Re: Apparmor move_mount mediation breaks mount tool in containers

From: John Johansen <john.johansen@canonical.com>
To: "Stéphane Graber" <stgraber@stgraber.org>
Cc: Linux regressions mailing list <regressions@lists.linux.dev>,
	Christian Brauner <brauner@kernel.org>,
	Sasha Levin <sashal@kernel.org>, Aleksa Sarai <cyphar@cyphar.com>,
	Alexander Mihalicyn <alexander@mihalicyn.com>,
	"paul@paul-moore.com" <paul@paul-moore.com>,
	James Morris <jmorris@namei.org>, Serge Hallyn <serge@hallyn.com>,
	linux-security-module@vger.kernel.org
Subject: Re: Apparmor move_mount mediation breaks mount tool in containers
Date: Tue, 5 Dec 2023 19:16:13 -0800	[thread overview]
Message-ID: <a4db48af-4c6b-4573-b4f1-007920b7111c@canonical.com> (raw)
In-Reply-To: <CA+enf=uzHAK3KEisZS3QfuUOMG_ZiFwyJGfP+5cyWB0-mdDQOg@mail.gmail.com>

On 12/5/23 18:18, Stéphane Graber wrote:
> On Tue, Dec 5, 2023 at 2:55 PM John Johansen
> <john.johansen@canonical.com> wrote:
>>
>> On 12/4/23 22:57, Stéphane Graber wrote:
>>> On Mon, Dec 4, 2023 at 9:20 AM Linux regression tracking (Thorsten
>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>
>>>> On 04.12.23 14:14, John Johansen wrote:
>>>>> On 12/3/23 17:34, Stéphane Graber wrote:
>>>>
>>>> Side note: Stéphane, thx for CCing the regressions list.
>>>>
>>>>>> On Sun, Dec 3, 2023 at 8:21 PM John Johansen
>>>>>> <john.johansen@canonical.com> wrote:
>>>>>>> On 12/2/23 17:20, Stéphane Graber wrote:
>>>>>>>>
>>>>>>>> Upstream commit 157a3537d6bc28ceb9a11fc8cb67f2152d860146 which just
>>>>>>>> landed in 6.6.3 stable as 96af45154a0be30485ad07f70f852b1456cb13d7 is
>>>>>>>> blocking new mounts for all LXC, LXD and Incus users (at least) on
>>>>>>>> distributions using the newer version of util-linux.
>>>>>>>>
>>>>>>>> That's because for a simple mount like "mount -t tmpfs tmpfs /tmp",
>>>>>>>> the new mount command now performs:
>>>>>>>> ```
>>>>>>>> fsconfig(3, FSCONFIG_SET_STRING, "source", "tmpfs", 0) = 0
>>>>>>>> fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
>>>>>>>> fsmount(3, FSMOUNT_CLOEXEC, 0)          = 4
>>>>>>>> statx(4, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_MNT_ID,
>>>>>>>> {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID,
>>>>>>>> stx_attributes=STATX_ATTR_MOUNT_ROOT, stx_mode=S_IFDIR|S_ISVTX|0777,
>>>>>>>> stx_size=40, ...}) = 0
>>>>>>>> move_mount(4, "", AT_FDCWD, "/tmp", MOVE_MOUNT_F_EMPTY_PATH) = 0
>>>>>>>> ```
>>>>>>>>
>>>>>>>> That last call to "move_mount" is incorrectly interpreted by AppArmor
>>>>>>>> as an attempt to move-mount "/" to "/mnt" rather than as a new mount
>>>>>>>> being created, this therefore results in:
>>>>>>>> ```
>>>>>>>> Dec 03 01:05:03 kernel-test kernel: audit: type=1400
>>>>>>>> audit(1701565503.599:34): apparmor="DENIED" operation="mount"
>>>>>>>> class="mount" info="failed perms check" error=-13
>>>>>>>> profile="incus-a_</var/lib/incus>" name="/tmp/" pid=2190 comm="mount"
>>>>>>>> srcname="/" flags="rw, move"
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Note that the flags here show "move", the fstype isn't even set and
>>>>>>>> the source path at srcname incorrectly shows "/".
>>>>>>>>
>>>>>>>> This operation therefore trips any container manager which has an
>>>>>>>> apparmor security policy preventing arbitrary move-mount (as those
>>>>>>>> could be used to bypass other apparmor path based policies).
>>>>>>>>
>>>>>>>>
>>>>>>>> The way I see it, the current mediation support effectively breaks any
>>>>>>>> attempt at mediating mounts in a useful way in apparmor as it's now
>>>>>>>> impossible to mediate new mounts based on their fstype or even
>>>>>>>> distinguish them from a move-mount operation.
>>>>>>>>
>>>>>>> Indeed it is a far from good solution. It is a stop gap.
>>>>>>>>
>>>>>>>> I don't know if this warrants pulling the mediation patch out of
>>>>>>>> stable (and out of linus' tree), obviously doing that would
>>>>>>>> reintroduce that hole in mount coverage, but at the same time, the
>>>>>>>> current coverage is broken enough that our only alternative is to
>>>>>>>> effectively allow all mounts, making the current mediation useless.
>>>>>>>
>>>>>>> pulling it effectively means ALL applications by-pass mediation, the
>>>>>>> alternative is to block all applications from using the move_mount
>>>>>>> system call as part of mediation. Which might have been acceptable
>>>>>>> as a stop gap when the syscall first landed but not now.
>>>>>>
>>>>>> Pulling it from the stable branch may still make sense, you now have
>>>>>> folks who are updating to get actual bugfixes and end up with broken
>>>>>> containers, that doesn't exactly seem like a good outcome...
>>>>>
>>>>> I will defer such a decision to the maintainers the stable trees. I
>>>>> can see arguments either way.
>>>>
>>>> FWIW, Greg usually does not revert a backported change if the change
>>>> causes the same problem to happen with mainline, as then it should be
>>>> fixed there as well. Which is normally the right thing to do, as Linus
>>>> wants people to be able to upgrade to new kernels without having them to
>>>> upgrade anything else (firmware, anything in userland incl. apparmor
>>>> policies).
>>>>
>>>> So normally this whole issue afaics would be "157a3537d6bc28 needs to be
>>>> fixed in mainline (if nothing else helps with a revert), and that fix
>>>> then needs to be backported to the various stable trees".
>>>>
>>>> It obviously is more complicated because that commit apparently fixes a
>>>> security vulnerability. But even under such circumstances Linus afaik
>>>> wants us to do everything possible to avoid breaking peoples workflows.
>>>> Which is why I wonder if there is a more elegant solution here
>>>> somewhere. I doubt that the answer is yes, but I'll ask the following
>>>> anyway: Would a kernel config option that distros could set when they
>>>> updated their Apparmor policies help here? Or something in sysfs?
>>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>
>>> So there are a few different scenarios here:
>>>
>>> 1) Distribution shipping application specific profiles that were
>>> developed back when the new VFS mount API wasn't around, those
>>> profiles allow the specific mount calls made by the application. The
>>> application has since moved on to using the VFS mount API and thanks
>>> to the lack of apparmor coverage, things have kept on working fine.
>>> Now when the distro pulls this bugfix, the application will now fail
>>> to perform the mount. Apparmor returns EACCESS which doesn't cause a
>>> fallback to the old mount API but instead a straight up error returned
>>> to the user. ENOSYS would usually be handled properly.
>>>
>> I am not opposed to conditionally returning ENOSYS. I also have no doubt
>> that doing so would break fewer applications, it will still break some.
>> As I said I will look into doing that
> 
> Thanks, that would certainly be a much better stop gap measure until
> such time as the mount API can properly be mediated by AppArmor.
> 
>>> 2) All the container managers out there which use AppArmor as a safety
>>> net to block mostly misbehaving or misconfigured software. Those
>>> policies are not meant to be water tight to begin with (and due to the
>>> nature of containers, they really can't be). But they do catch a lot
>>> and while not an actual security barrier, they certainly prevent a lot
>>> of accidents. Unfortunately with the new mediation in place, the only
>>> real way to operate moving forward is to not mediate mounts at all as
>>> mediation of the new VFS API leads to just about everything getting
>>> denied and apparmor doesn't provide a way to separately mediate the
>>> old and new mount APIs.
>>>
>> atm no, I am working on a conditional for detached mounts. I am not
>> convinced we should completely separate move_mount() mediation for
>> attached mounts.
>>
>> But to use any of that you will need a new apparmor userspace to
>> support it. LXD is using the mounts as a behavioral/advisory catch, so
>> allowing an unmediated move_mount() is not the end of the world. Apparmor
>> still needs to be able to mediate mounts for applications that aren't
>> containers and do need a tighter barrier.
>>
>> The reality with the new mount api is that atm we can only mediate
>> move_mount(), but the new mount api does add a twist with the whole
>> detached mounts. The only way to support that on older releases is
>> using a very generic mount rule.
>>
>> New userspaces can pick up more. Leaving move_mount() unmediated even
>> for older releases really isn't an appealing option. New kernels
>> get pulled back and used on old releases all the time.
>>
>>> 3) A system where apparmor is used to effectively prevent any mounting
>>> whatsoever, where the policy is now being bypassed by simply using the
>>> new VFS API. Such a system would also need to not be using Seccomp in
>>> combination with Apparmor as blocking the new VFS API syscalls would
>>> avoid the issue.
>>>
>> sure
>>
>>>
>>> I suspect 3) is what John is hinting at in this thread. I could
>>> certainly see this bug be used to bypass the snapd mount restrictions
>>> for example, though snapd also generates seccomp policies, so just
>>> blocking the new VFS API would be a much simpler fix there.
>>>
>> I am concerned about more than the snapd case, yes snapd could and
>> should use seccomp, but apparmor should be able to stand on its own
>> for other cases.
>>
>>> I have no idea how common 1) is, but it looks like we're about to find
>>> out soon and that has potential for some very "interesting" bug
>>> reports as mount related errors are usually not the easiest to
>>> understand and LSMs make them so much worse.
>>>
>> unfortunately that is always the case when fixing a mediation regression.
>> This one sat way too long that doesn't mean it shouldn't get fixed.
>>
>> Overall though, I think lxd/incus is probably the major one. Most people
>> are running targeted policies and have left most of the stuff doing
>> mounts unconfined.
> 
> We did some research with Aleksa last night and that does seem to be
> true for the profiles we could find.
> 
> However there is a bit of a catch that Docker, Kubernetes, ... pretty
> much all the application container options do support using AppArmor
> but don't provide advanced profiles, instead they put the burden on
> the user to attach profiles to specific containers.
> 
> I have no idea how many people actually do that and since they are
> individually crafted profiles by those users, we have no idea what may
> be in there and whether they got broken by this change.
> 
>>> As for 2), as it stands, we're going to have to effectively turn off
>>> any of the mount related safety nets we had in place in LXC, Incus and
>>> likely most other container runtimes out there to handle this. The
>>> usual issue with doing that kind of heavy hammer change is that once
>>> it's done, it will take a long time to undo it. That is, once AppArmor
>>> is actually capable of mediating the way it's supposed to, a lot of
>>> the profiles will have been changed to just blanket allow all mounts
>>> and they're only ever going to be changed once we're confident that
>>> the vast majority of our users are running a kernel with the fixed
>>> logic.
>>>
>> I am well aware of it. I fall on the side of leaving this completely
>> unmediated is slightly worse, but I can understand why someone
>> would choose to leave this unmediated.
>>
>>> I guess some of that could be handled better if there was a way to
>>> detect the current broken mediation of the new mount API and then
>>> later again, detect that the kernel now has working mediation,
>>> allowing for those container managers to generate accurate profile
>>> rather than have to go for the "safest" option (as far as keeping
>>> users running) and just keep allowing everything.
>>>
>>> John, is there any such detection mechanism currently present that
>>> could be used by userspace to better handle this situation?
>>>
>> sadly not for the move_mount() patch, it is however a one line patch
>>
>> diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
>> index 7465c39ad1bc..6cd52767cfeb 100644
>> --- a/security/apparmor/apparmorfs.c
>> +++ b/security/apparmor/apparmorfs.c
>> @@ -2373,6 +2373,7 @@ static struct aa_sfs_entry aa_sfs_entry_policy[] = {
>>
>>    static struct aa_sfs_entry aa_sfs_entry_mount[] = {
>>          AA_SFS_FILE_STRING("mask", "mount umount pivot_root"),
>> +       AA_SFS_FILE_STRING(move_mount, "detached"),
>>          { }
>>    };
>>
>> or similar. That I could send out today.
> 
> I think that'd be useful to have as something that can be used to
> detect this behavior from userspace and have the affected code turn
> off mount mediation if that's the case.
> 
>> it would surfaced the via the regular securityfs/apparmor interface as
>> /sys/kernel/security/apparmor/features/mount/move_mount
> 
>> I could also get you a simple conditional patch around returning ENOSYS
>> to start testing, so that we could potentially try sending a final version
>> of that patch up next week.
> 
> That'd be good. I feel that if this change had come with the ENOSYS
> behavior, I wouldn't have noticed it at all.
> With the new mount API being still pretty fresh, I'm not aware of any
> userspace which today relies exclusively on it, so having it be
> effectively disabled until such time as you can mediate it properly
> shouldn't really break anyone (always hard to know, but sure would
> break a lot less cases than the current change).
> 
> One catch though from my testing with seccomp, to make this work, you
> need to have fsmount return ENOSYS, if you only have move_mount return
> ENOSYS, it's too late and libmount just returns the failure to the
> user rather than go down the old mount API code path.
> 

ugh, that is a problem. we don't have a hook in fsmount(), the closest
we get is the dentry_open() hook, but I don't see how a call to that
could reasonably detect that its coming from fsmount.

I am not opposed to a new hook at all, and I will throw something together
but I don't see that going into 6.7

>>> Overall, this still feels very rushed and broken to me. I understand
>>> that being able to trivially bypass AppArmor mount rules isn't great,
>>> but shipping code that makes the vast majority of said rules useless
>>> doesn't really feel like such an improvement. It's effectively turning
>>> what was a reasonably flexible policy engine around mount calls, into
>>> a binary switch to either allow everything or block everything.
>>>
>> I get what you are saying. The other side of the coin is oh look apparmor
>> isn't mediating mount if I just do this ...
> 
> Sure, but I suspect we could have avoided a bunch of pain if we had a
> chat around this change, had an opportunity to test it ahead of time.
> At the very least, we'd have had the flag file introduced from the
> start and hopefully some better handling like some kind of ENOSYS
> option.
> 
Sadly, yes I would have loved more testing, and I should have thought to
to poke you directly. I deliberately delayed sending this up during last
cycle when the report to kernel.org came in. I kicked a kernel with it over
to lxd (while still treating the issue as being in embargo. sigh, ...), and
then put it into the ubuntu proposed kernels and apparmor-next.

heard plenty on the user namespace mediation, but nothing of significance
around move_mount.

>>> We at the very least need a way to check whether we're dealing with
>>> this known broken state so that any change that's made to userspace
>>> can be made condition on this, ensuring that once mediation actually
>>> works as intended, those policies also go back to the way they're
>>> supposed to be.
>>>
>> agreed. That really should have been part of the patch to begin with.
> 
> Stéphane