From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751183AbeAVUbX (ORCPT ); Mon, 22 Jan 2018 15:31:23 -0500 Received: from mail-yw0-f195.google.com ([209.85.161.195]:45782 "EHLO mail-yw0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750878AbeAVUbV (ORCPT ); Mon, 22 Jan 2018 15:31:21 -0500 X-Google-Smtp-Source: AH8x22735TgFI0FjUHrj5Q8mh9+SfEx+FA12sIujpWkolaUqq5nWr0D3oN0OMxAshWkXhU336q/qRS3zNj4c1NIFw1A= MIME-Version: 1.0 In-Reply-To: References: <1509128538-50162-1-git-send-email-yang.s@alibaba-inc.com> <20171030124358.GF23278@quack2.suse.cz> <76a4d544-833a-5f42-a898-115640b6783b@alibaba-inc.com> <20171031101238.GD8989@quack2.suse.cz> <20171109135444.znaksm4fucmpuylf@dhcp22.suse.cz> <10924085-6275-125f-d56b-547d734b6f4e@alibaba-inc.com> <20171114093909.dbhlm26qnrrb2ww4@dhcp22.suse.cz> <20171115093131.GA17359@quack2.suse.cz> From: Amir Goldstein Date: Mon, 22 Jan 2018 22:31:20 +0200 Message-ID: Subject: Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg To: Shakeel Butt Cc: Jan Kara , Yang Shi , Michal Hocko , linux-fsdevel , Linux MM , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt wrote: > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara wrote: >> On Wed 15-11-17 01:32:16, Yang Shi wrote: >>> >>> >>> On 11/14/17 1:39 AM, Michal Hocko wrote: >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote: >>> >> >>> >> >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote: >>> >>>[Sorry for the late reply] >>> >>> >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote: >>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote: >>> >>>[...] >>> >>>>>I do agree it is not fair and not neat to account to producer rather than >>> >>>>>misbehaving consumer, but current memcg design looks not support such use >>> >>>>>case. And, the other question is do we know who is the listener if it >>> >>>>>doesn't read the events? >>> >>>> >>> >>>>So you never know who will read from the notification file descriptor but >>> >>>>you can simply account that to the process that created the notification >>> >>>>group and that is IMO the right process to account to. >>> >>> >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of >>> >>>those objects then this should be a target of the charge. >>> >>> >>> >>>>I agree that current SLAB memcg accounting does not allow to account to a >>> >>>>different memcg than the one of the running process. However I *think* it >>> >>>>should be possible to add such interface. Michal? >>> >>> >>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing >>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc, >>> >>>right? >>> >> >>> >>Yes. >>> >> >>> >>I took a look at the implementation and the callsites of >>> >>memcg_kmem_charge_memcg(). It looks it is called by: >>> >> >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg >>> >>* allocate new slab page, charge to memcg_params.memcg >>> >> >>> >>I think this is the plumbing you mentioned, right? >>> > >>> >Maybe I have misunderstood, but you are using slab allocator. So you >>> >would need to force it to use a different charging context than current. >>> >>> Yes. >>> >>> >I haven't checked deeply but this doesn't look trivial to me. >>> >>> I agree. This is also what I explained to Jan and Amir in earlier >>> discussion. >> >> And I also agree. But the fact that it is not trivial does not mean that it >> should not be done... >> > > I am currently working on directed or remote memcg charging for a > different usecase and I think that would be helpful here as well. > > I have two questions though: > > 1) Is fsnotify_group the right structure to hold the reference to > target mem_cgroup for charging? I think it is. The process who set up the group and determined the unlimited events queue size and did not consume the events from the queue in a timely manner is the process to blame for the OOM situation. > 2) Remote charging can trigger an OOM in the target memcg. In this > usecase, I think, there should be security concerns if the events > producer can trigger OOM in the memcg of the monitor. We can either > change these allocations to use __GFP_NORETRY or some new gfp flag to > not trigger oom-killer. So, is this valid concern or am I > over-thinking? > I think that is a very valid concern, but not sure about the solution. For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF (group->priority == 0), I think it makes sense to let oom-killer kill the listener which is not keeping up with consuming events. For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT (group->priority > 0) allowing an adversary to trigger oom-killer and kill the listener could bypass permission event checks. So we could use different allocation flags for permission events or for groups with high priority or for groups that have permission events in their mask, so an adversary could not use non-permission events to oom-kill a listener that is also monitoring permission events. Generally speaking, permission event monitors should not be setting FAN_UNLIMITED_QUEUE and should not be causing oom (because permission events are not queued they are blocking), but there is nothing preventing a process to setup a FAN_CLASS_CONTENT group with FAN_UNLIMITED_QUEUE and also monitor non-permission events. There is also nothing preventing a process from setting up one FAN_CLASS_CONTENT listener for permission events and another FAN_CLASS_NOTIF listener for non-permission event. Maybe it is not wise, but we don't know that there are no such processes out there. I know I am throwing a lot of confusing information, but here is an argument that might be simple enough: If a process has setup a group with permission event mask, then the administrator has granted this process the permission to bring the system to a halt even if the process misbehaves (because better halt the system than let it be compromised). If that interpretation is acceptable then oom-killing a process which has setup a group with permission event mask is not acceptable. Thanks, Amir.