From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1756670AbdJJWV6 (ORCPT <rfc822;w@1wt.eu>);
        Tue, 10 Oct 2017 18:21:58 -0400
Received: from mail-wm0-f52.google.com ([74.125.82.52]:45234 "EHLO
        mail-wm0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1756347AbdJJWV4 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 10 Oct 2017 18:21:56 -0400
X-Google-Smtp-Source: AOwi7QDYyz5n0dPfsJzpMBUB0BztxMYi+evIZua5TtuJYG77vIqgutaYGS+GsWrFX33mSlJ3khb4gca3g7x2Nv2wMBc=
MIME-Version: 1.0
In-Reply-To: <20171010091042.eokqlrqec33w3qzt@dhcp22.suse.cz>
References: <20171005222144.123797-1-shakeelb@google.com> <20171006075900.icqjx5rr7hctn3zd@dhcp22.suse.cz>
 <CALvZod7YN4JCG7Anm2FViyZ0-APYy+nxEd3nyxe5LT_P0FC9wg@mail.gmail.com>
 <20171009062426.hmqedtqz5hkmhnff@dhcp22.suse.cz> <xr93a810xl77.fsf@gthelen.svl.corp.google.com>
 <20171009180409.z3mpk3m7m75hjyfv@dhcp22.suse.cz> <20171009181754.37svpqljub2goojr@dhcp22.suse.cz>
 <20171010091042.eokqlrqec33w3qzt@dhcp22.suse.cz>
From: Shakeel Butt <shakeelb@google.com>
Date: Tue, 10 Oct 2017 15:21:53 -0700
Message-ID: <CALvZod5VzPRRbhxLSn5GkgPbJEVJ9X5SfA=rjzRtTqLbCAe+eA@mail.gmail.com>
Subject: Re: [PATCH] fs, mm: account filp and names caches to kmemcg
To: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Vladimir Davydov <vdavydov.dev@gmail.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linux MM <linux-mm@kvack.org>, linux-fsdevel@vger.kernel.org,
        LKML <linux-kernel@vger.kernel.org>,
        Johannes Weiner <hannes@cmpxchg.org>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sun, Oct 8, 2017 at 11:24 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
>> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>> >
>> > I might be wrong but isn't name cache only holding temporary objects
>> > used for path resolution which are not stored anywhere?
>> >
>>
>> Even though they're temporary, many containers can together use a
>> significant amount of transient uncharged memory. We've seen machines
>> with 100s of MiBs in names_cache.
>
> Yes that might be possible but are we prepared for random ENOMEM from
> vfs calls which need to allocate a temporary name?
>

I looked at all the syscalls which invoke allocations from
'names_cache' and tried to narrow down whose man page does not mention
that they can return ENOMEM. I found couple of syscalls like
truncate(), readdir() & getdents() which does not mention that they
can return ENOMEM but this patch will make them return ENOMEM.

>>
>> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>> >>  }
>> >
>> > Don't we have a limit for the maximum number of open files?
>> >
>>
>> Yes, there is a system limit of maximum number of open files. However
>> this limit is shared between different users on the system and one
>> user can hog this resource. To cater that, we set the maximum limit
>> very high and let the memory limit of each user limit the number of
>> files they can open.
>
> Similarly here. Are all syscalls allocating a fd prepared to return
> ENOMEM?

For filp, I found _sysctl(). However the man page says not to use it.

On Tue, Oct 10, 2017 at 2:10 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 09-10-17 20:17:54, Michal Hocko wrote:
>> the primary concern for this patch was whether we really need/want to
>> charge short therm objects which do not outlive a single syscall.
>
> Let me expand on this some more. What is the benefit of kmem accounting
> of such an object? It cannot stop any runaway as a syscall lifetime
> allocations are bound to number of processes which we kind of contain by
> other means.

We can contain by limited the number of processes or thread but for us
applications having thousands of threads is very common. So, limiting
the number of threads/processes will not work.

> If we do account then we put a memory pressure due to
> something that cannot be reclaimed by no means. Even the memcg OOM
> killer would simply kick a single path while there might be others
> to consume the same type of memory.
>
> So what is the actual point in accounting these? Does it help to contain
> any workload better? What kind of workload?
>

I think the benefits will be isolation and more accurate billing. As I
have said before we have observed 100s of MiBs in names_cache on many
machines and cumulative amount is not something we can ignore as just
memory overhead.

> Or am I completely wrong and name objects can outlive a syscall
> considerably?
>

No, I didn't find any instance of the name objects outliving the syscall.

Anyways, we can discuss more on names_cache, do you have any objection
regarding charging filp?