linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: "Uladzislau Rezki (Sony)" <urezki@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>, RCU <rcu@vger.kernel.org>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	"Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Michal Hocko <mhocko@suse.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Theodore Y . Ts'o" <tytso@mit.edu>,
	Joel Fernandes <joel@joelfernandes.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.
Date: Tue, 29 Sep 2020 12:15:34 +0200	[thread overview]
Message-ID: <38f42ca1-ffcd-04a6-bf11-618deffa897a@suse.cz> (raw)
In-Reply-To: <20200918194817.48921-3-urezki@gmail.com>

On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote:
> Some background and kfree_rcu()
> ===============================
> The pointers to be freed are stored in the per-cpu array to improve
> performance, to enable an easier-to-use API, to accommodate vmalloc
> memmory and to support a single argument of the kfree_rcu() when only
> a pointer is passed. More details are below.
> 
> In order to maintain such per-CPU arrays there is a need in dynamic
> allocation when a current array is fully populated and a new block is
> required. See below the example:
> 
>  0 1 2 3      0 1 2 3
> |p|p|p|p| -> |p|p|p|p| -> NULL
> 
> there are two pointer-blocks, each one can store 4 addresses
> which will be freed after a grace period is passed. In reality
> we store PAGE_SIZE / sizeof(void *). So to maintain such blocks
> a single page is obtain via the page allocator:
> 
>     bnode = (struct kvfree_rcu_bulk_data *)
>         __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> 
> after that it is attached to the "head" and its "next" pointer is
> set to previous "head", so the list of blocks can be maintained and
> grow dynamically until it gets drained by the reclaiming thread.
> 
> Please note. There is always a fallback if an allocation fails. In the
> single argument, this is a call to synchronize_rcu() and for the two
> arguments case this is to use rcu_head structure embedded in the object
> being free, and then paying cache-miss penalty, also invoke the kfree()
> per object instead of kfree_bulk() for groups of objects.
> 
> Why we maintain arrays/blocks instead of linking objects by the regular
> "struct rcu_head" technique. See below a few but main reasons:
> 
> a) A memory can be reclaimed by invoking of the kfree_bulk()
>    interface that requires passing an array and number of
>    entries in it. That reduces the per-object overhead caused
>    by calling kfree() per-object. This reduces the reclamation
>    time.
> 
> b) Improves locality and reduces the number of cache-misses, due to
>    "pointer chasing" between objects, which can be far spread between
>    each other.
> 
> c) Support a "single argument" in the kvfree_rcu()
>    void *ptr = kvmalloc(some_bytes, GFP_KERNEL);
>    if (ptr)
>         kvfree_rcu(ptr);
> 
>    We need it when an "rcu_head" is not embed into a stucture but an
>    object must be freed after a grace period. Therefore for the single
>    argument, such objects cannot be queued on a linked list.
> 
>    So nowadays, since we do not have a single argument but we see the
>    demand in it, to workaround it people just do a simple not efficient
>    sequence:
>    <snip>
>        synchronize_rcu(); /* Can be long and blocks a current context */
>        kfree(p);
>    <snip>
> 
>    More details is here: https://lkml.org/lkml/2020/4/28/1626
> 
> d) To distinguish vmalloc pointers between SLAB ones. It becomes possible
>    to invoke the right freeing API for the right kind of pointer, kfree_bulk()
>    or TBD: vmalloc_bulk().
> 
> e) Speeding up the post-grace-period freeing reduces the chance of a flood
>    of callback's OOMing the system.
> 
> Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166
> 
> Proposal
> ========
> Introduce a lock-free function that obtain a page from the per-cpu-lists
> on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> 
> Description
> ===========
> The page allocator has two phases, fast path and slow one. We are interested
> in fast path and order-0 allocations. In its turn it is divided also into two
> phases: lock-less and not:
> 
> 1) As a first step the page allocator tries to obtain a page from the
>    per-cpu-list, so each CPU has its own one. That is why this step is
>    lock-less and fast. Basically it disables irqs on current CPU in order
>    to access to per-cpu data and remove a first element from the pcp-list.
>    An element/page is returned to an user.
> 
> 2) If there is no any available page in per-cpu-list, the second step is
>    involved. It removes a specified number of elements from the buddy allocator
>    transferring them to the "supplied-list/per-cpu-list" described in [1].
> 
> Summarizing. The __rcu_alloc_page_lockless() covers only [1] and can not
> do step [2], due to the fact that [2] requires an access to zone->lock.
> It implies that it is super fast, but a higher rate of fails is also
> expected.
> 
> Usage: __rcu_alloc_page_lockless();
> 
> Link: https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/
> Not-signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

After reading all the threads and mulling over this, I am going to deflect from
Mel and Michal and not oppose the idea of lockless allocation. I would even
prefer to do it via the gfp flag and not a completely separate path. Not using
the exact code from v1, I think it could be done in a way that we don't actually
look at the new flag until we find that pcplist is empty - which should not
introduce overhead to the fast-fast path when pcpclist is not empty. It's more
maintainable that adding new entry points, IMHO.

But there's still the condition that it's sufficiently shown that the allocation
is useful for RCU. In that case I prefer that the page allocator (or MM in
general) can give its users what they need without having to work around it.
Seems like GFP_ATOMIC is insufficient today so if that means we need a new flag
for the raw spinlock context, so be it. But if your usage of __GFP_NO_LOCKS
depends on the result of preempt_count() or similar check whether this is a
context that needs it, I'd prefer to keep this in the caller.

  parent reply	other threads:[~2020-09-29 10:15 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-18 19:48 [PATCH 0/4] kvfree_rcu() and _LOCK_NESTING/_PREEMPT_RT Uladzislau Rezki (Sony)
2020-09-18 19:48 ` [PATCH 1/4] rcu/tree: Add a work to allocate pages from regular context Uladzislau Rezki (Sony)
2020-09-18 19:48 ` [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func Uladzislau Rezki (Sony)
2020-09-21  7:47   ` Michal Hocko
2020-09-21 15:45     ` Paul E. McKenney
2020-09-21 16:03       ` Michal Hocko
2020-09-21 19:48         ` Uladzislau Rezki
2020-09-22  7:50           ` Michal Hocko
2020-09-22 13:12             ` Uladzislau Rezki
2020-09-22 15:35               ` Michal Hocko
2020-09-23 10:37               ` Mel Gorman
2020-09-23 15:41                 ` Paul E. McKenney
2020-09-23 23:22                   ` Mel Gorman
2020-09-24  8:16                     ` Uladzislau Rezki
2020-09-24 11:16                       ` Peter Zijlstra
2020-09-24 15:16                         ` Uladzislau Rezki
2020-09-24 11:19                       ` Peter Zijlstra
2020-09-24 15:21                         ` Uladzislau Rezki
2020-09-25  8:15                           ` Peter Zijlstra
2020-09-25 10:25                             ` Uladzislau Rezki
2020-09-24 15:38                         ` Paul E. McKenney
2020-09-25  8:26                           ` Peter Zijlstra
2020-09-26 14:37                             ` Paul E. McKenney
2020-09-25  8:05                       ` Michal Hocko
2020-09-25 15:31                         ` Uladzislau Rezki
2020-09-25 15:47                           ` Michal Hocko
2020-09-29 16:25                             ` Uladzislau Rezki
2020-09-30  9:27                               ` Michal Hocko
2020-09-30 12:35                                 ` Uladzislau Rezki
2020-09-30 12:44                                   ` Michal Hocko
2020-09-30 13:39                                     ` Uladzislau Rezki
2020-09-30 16:46                                       ` Michal Hocko
2020-09-30 20:36                                         ` Uladzislau Rezki
2020-09-30 15:25                             ` Joel Fernandes
2020-09-30 16:48                               ` Michal Hocko
2020-09-30 17:03                                 ` Joel Fernandes
2020-09-30 17:22                                   ` Michal Hocko
2020-09-30 17:48                                     ` Joel Fernandes
2020-09-25 16:17                           ` Mel Gorman
2020-09-25 17:57                             ` Uladzislau Rezki
2020-09-22 15:49             ` Paul E. McKenney
2020-09-22  3:35         ` Paul E. McKenney
2020-09-22  8:03           ` Michal Hocko
2020-09-22 15:46             ` Paul E. McKenney
2020-09-23 11:27               ` Uladzislau Rezki
2020-09-29 10:15   ` Vlastimil Babka [this message]
2020-09-29 22:07     ` Uladzislau Rezki
2020-09-30 10:35       ` Michal Hocko
2020-10-01 19:32         ` Uladzislau Rezki
2020-09-30 14:39       ` Vlastimil Babka
2020-09-30 15:37         ` Joel Fernandes
2020-10-01 19:26         ` Uladzislau Rezki
2020-10-02  7:11           ` Michal Hocko
2020-10-02  8:50             ` Mel Gorman
2020-10-02  9:05               ` Michal Hocko
2020-10-05 15:08                 ` Uladzislau Rezki
2020-10-05 15:41                   ` Michal Hocko
2020-10-06 22:25                     ` Uladzislau Rezki
2020-10-07 10:02                       ` Michal Hocko
2020-10-07 11:02                         ` Uladzislau Rezki
2020-10-02  9:07               ` Peter Zijlstra
2020-10-02  9:45                 ` Mel Gorman
2020-10-02  9:58                   ` Peter Zijlstra
2020-10-02 10:19                     ` Mel Gorman
2020-10-02 14:41                       ` Paul E. McKenney
2020-10-06 10:03                         ` Mel Gorman
2020-10-06 15:41                           ` Paul E. McKenney
2020-10-05 13:58             ` Uladzislau Rezki
2020-10-02  8:06           ` Mel Gorman
2020-10-05 14:12             ` Uladzislau Rezki
2020-09-18 19:48 ` [PATCH 3/4] rcu/tree: use " Uladzislau Rezki (Sony)
2020-09-18 19:48 ` [PATCH 4/4] rcu/tree: Use schedule_delayed_work() instead of WQ_HIGHPRI queue Uladzislau Rezki (Sony)
2020-09-20 15:06   ` Paul E. McKenney
2020-09-21 13:27     ` Uladzislau Rezki
2020-09-18 22:15 ` [PATCH 0/4] kvfree_rcu() and _LOCK_NESTING/_PREEMPT_RT Paul E. McKenney
2020-09-30 15:52 ` Joel Fernandes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=38f42ca1-ffcd-04a6-bf11-618deffa897a@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=bigeasy@linutronix.de \
    --cc=joel@joelfernandes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=oleksiy.avramchenko@sonymobile.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rcu@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=urezki@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).