linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Uladzislau Rezki <urezki@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>, RCU <rcu@vger.kernel.org>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Michal Hocko <mhocko@suse.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Theodore Y . Ts'o" <tytso@mit.edu>,
	Joel Fernandes <joel@joelfernandes.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.
Date: Wed, 30 Sep 2020 00:07:42 +0200	[thread overview]
Message-ID: <20200929220742.GB8768@pc636> (raw)
In-Reply-To: <38f42ca1-ffcd-04a6-bf11-618deffa897a@suse.cz>

On Tue, Sep 29, 2020 at 12:15:34PM +0200, Vlastimil Babka wrote:
> On 9/18/20 9:48 PM, Uladzislau Rezki (Sony) wrote:
> > Some background and kfree_rcu()
> > ===============================
> > The pointers to be freed are stored in the per-cpu array to improve
> > performance, to enable an easier-to-use API, to accommodate vmalloc
> > memmory and to support a single argument of the kfree_rcu() when only
> > a pointer is passed. More details are below.
> > 
> > In order to maintain such per-CPU arrays there is a need in dynamic
> > allocation when a current array is fully populated and a new block is
> > required. See below the example:
> > 
> >  0 1 2 3      0 1 2 3
> > |p|p|p|p| -> |p|p|p|p| -> NULL
> > 
> > there are two pointer-blocks, each one can store 4 addresses
> > which will be freed after a grace period is passed. In reality
> > we store PAGE_SIZE / sizeof(void *). So to maintain such blocks
> > a single page is obtain via the page allocator:
> > 
> >     bnode = (struct kvfree_rcu_bulk_data *)
> >         __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > 
> > after that it is attached to the "head" and its "next" pointer is
> > set to previous "head", so the list of blocks can be maintained and
> > grow dynamically until it gets drained by the reclaiming thread.
> > 
> > Please note. There is always a fallback if an allocation fails. In the
> > single argument, this is a call to synchronize_rcu() and for the two
> > arguments case this is to use rcu_head structure embedded in the object
> > being free, and then paying cache-miss penalty, also invoke the kfree()
> > per object instead of kfree_bulk() for groups of objects.
> > 
> > Why we maintain arrays/blocks instead of linking objects by the regular
> > "struct rcu_head" technique. See below a few but main reasons:
> > 
> > a) A memory can be reclaimed by invoking of the kfree_bulk()
> >    interface that requires passing an array and number of
> >    entries in it. That reduces the per-object overhead caused
> >    by calling kfree() per-object. This reduces the reclamation
> >    time.
> > 
> > b) Improves locality and reduces the number of cache-misses, due to
> >    "pointer chasing" between objects, which can be far spread between
> >    each other.
> > 
> > c) Support a "single argument" in the kvfree_rcu()
> >    void *ptr = kvmalloc(some_bytes, GFP_KERNEL);
> >    if (ptr)
> >         kvfree_rcu(ptr);
> > 
> >    We need it when an "rcu_head" is not embed into a stucture but an
> >    object must be freed after a grace period. Therefore for the single
> >    argument, such objects cannot be queued on a linked list.
> > 
> >    So nowadays, since we do not have a single argument but we see the
> >    demand in it, to workaround it people just do a simple not efficient
> >    sequence:
> >    <snip>
> >        synchronize_rcu(); /* Can be long and blocks a current context */
> >        kfree(p);
> >    <snip>
> > 
> >    More details is here: https://lkml.org/lkml/2020/4/28/1626
> > 
> > d) To distinguish vmalloc pointers between SLAB ones. It becomes possible
> >    to invoke the right freeing API for the right kind of pointer, kfree_bulk()
> >    or TBD: vmalloc_bulk().
> > 
> > e) Speeding up the post-grace-period freeing reduces the chance of a flood
> >    of callback's OOMing the system.
> > 
> > Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166
> > 
> > Proposal
> > ========
> > Introduce a lock-free function that obtain a page from the per-cpu-lists
> > on current CPU. It returns NULL rather than acquiring any non-raw spinlock.
> > 
> > Description
> > ===========
> > The page allocator has two phases, fast path and slow one. We are interested
> > in fast path and order-0 allocations. In its turn it is divided also into two
> > phases: lock-less and not:
> > 
> > 1) As a first step the page allocator tries to obtain a page from the
> >    per-cpu-list, so each CPU has its own one. That is why this step is
> >    lock-less and fast. Basically it disables irqs on current CPU in order
> >    to access to per-cpu data and remove a first element from the pcp-list.
> >    An element/page is returned to an user.
> > 
> > 2) If there is no any available page in per-cpu-list, the second step is
> >    involved. It removes a specified number of elements from the buddy allocator
> >    transferring them to the "supplied-list/per-cpu-list" described in [1].
> > 
> > Summarizing. The __rcu_alloc_page_lockless() covers only [1] and can not
> > do step [2], due to the fact that [2] requires an access to zone->lock.
> > It implies that it is super fast, but a higher rate of fails is also
> > expected.
> > 
> > Usage: __rcu_alloc_page_lockless();
> > 
> > Link: https://lore.kernel.org/lkml/20200814215206.GL3982@worktop.programming.kicks-ass.net/
> > Not-signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> After reading all the threads and mulling over this, I am going to deflect from
> Mel and Michal and not oppose the idea of lockless allocation. I would even
> prefer to do it via the gfp flag and not a completely separate path. Not using
> the exact code from v1, I think it could be done in a way that we don't actually
> look at the new flag until we find that pcplist is empty - which should not
> introduce overhead to the fast-fast path when pcpclist is not empty. It's more
> maintainable that adding new entry points, IMHO.
> 
Thanks for reading all that from the beginning! It must be tough due to the
fact there were lot of messages in the threads, so at least i was lost.

I agree that adding a new entry or separate lock-less function can be considered
as something that is hard to maintain. I have a question here. I mean about your
different look at it:

<snip>
bool is_pcp_cache_empty(gfp_t gfp)
{
    struct per_cpu_pages *pcp;
    struct zoneref *ref;
    unsigned long flags;
    bool empty;

    ref = first_zones_zonelist(node_zonelist(
            numa_node_id(), gfp), gfp_zone(gfp), NULL);
    if (!ref->zone)
            return true;

    local_irq_save(flags);
    pcp = &this_cpu_ptr(ref->zone->pageset)->pcp;
    empty = list_empty(&pcp->lists[gfp_migratetype(gfp)]);
    local_irq_restore(flags);

    return empty;
}

disable_irq();
if (!is_pcp_cache_empty(GFP_NOWAIT))
    __get_free_page(GFP_NOWAIT);
enable_irq();
<snip>

Do you mean to have something like above? I mean some extra API
function that returns true or false if fast-fast allocation can
either occur or not. Above code works just fine and never touches
main zone->lock.

i.e. Instead of introducing an extra GFP_LOCKLESS flag or any new
extra lock-less function. We could have something that checks a
pcp page cache list, thus it can guarantee that a request would
be accomplish using fast-fast path.

>
> But there's still the condition that it's sufficiently shown that the allocation
> is useful for RCU. In that case I prefer that the page allocator (or MM in
> general) can give its users what they need without having to work around it.
>
I see your point, if there is something that can be generic and not specif - makes
sense to me also. I do share your view.

>
> Seems like GFP_ATOMIC is insufficient today so if that means we need a new flag
> for the raw spinlock context, so be it. But if your usage of __GFP_NO_LOCKS
> depends on the result of preempt_count() or similar check whether this is a
> context that needs it, I'd prefer to keep this in the caller.
>
By having the preemptabele() thing, we can just workaround GFP_ATOMIC or GFP_NOWAIT
flag issues. Because both are insufficient. I mean when we see that a context is
atomic(can not sleep) we can bypass any usage of the page allocator because it will
hit BUG() with flags which are in question. And yes, the best is when we do not
think much about context, so there is something reliable from point of using the
page allocator.

Thanks Vlastimil!

--
Vlad Rezki


  reply	other threads:[~2020-09-29 22:07 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-18 19:48 [PATCH 0/4] kvfree_rcu() and _LOCK_NESTING/_PREEMPT_RT Uladzislau Rezki (Sony)
2020-09-18 19:48 ` [PATCH 1/4] rcu/tree: Add a work to allocate pages from regular context Uladzislau Rezki (Sony)
2020-09-18 19:48 ` [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func Uladzislau Rezki (Sony)
2020-09-21  7:47   ` Michal Hocko
2020-09-21 15:45     ` Paul E. McKenney
2020-09-21 16:03       ` Michal Hocko
2020-09-21 19:48         ` Uladzislau Rezki
2020-09-22  7:50           ` Michal Hocko
2020-09-22 13:12             ` Uladzislau Rezki
2020-09-22 15:35               ` Michal Hocko
2020-09-23 10:37               ` Mel Gorman
2020-09-23 15:41                 ` Paul E. McKenney
2020-09-23 23:22                   ` Mel Gorman
2020-09-24  8:16                     ` Uladzislau Rezki
2020-09-24 11:16                       ` Peter Zijlstra
2020-09-24 15:16                         ` Uladzislau Rezki
2020-09-24 11:19                       ` Peter Zijlstra
2020-09-24 15:21                         ` Uladzislau Rezki
2020-09-25  8:15                           ` Peter Zijlstra
2020-09-25 10:25                             ` Uladzislau Rezki
2020-09-24 15:38                         ` Paul E. McKenney
2020-09-25  8:26                           ` Peter Zijlstra
2020-09-26 14:37                             ` Paul E. McKenney
2020-09-25  8:05                       ` Michal Hocko
2020-09-25 15:31                         ` Uladzislau Rezki
2020-09-25 15:47                           ` Michal Hocko
2020-09-29 16:25                             ` Uladzislau Rezki
2020-09-30  9:27                               ` Michal Hocko
2020-09-30 12:35                                 ` Uladzislau Rezki
2020-09-30 12:44                                   ` Michal Hocko
2020-09-30 13:39                                     ` Uladzislau Rezki
2020-09-30 16:46                                       ` Michal Hocko
2020-09-30 20:36                                         ` Uladzislau Rezki
2020-09-30 15:25                             ` Joel Fernandes
2020-09-30 16:48                               ` Michal Hocko
2020-09-30 17:03                                 ` Joel Fernandes
2020-09-30 17:22                                   ` Michal Hocko
2020-09-30 17:48                                     ` Joel Fernandes
2020-09-25 16:17                           ` Mel Gorman
2020-09-25 17:57                             ` Uladzislau Rezki
2020-09-22 15:49             ` Paul E. McKenney
2020-09-22  3:35         ` Paul E. McKenney
2020-09-22  8:03           ` Michal Hocko
2020-09-22 15:46             ` Paul E. McKenney
2020-09-23 11:27               ` Uladzislau Rezki
2020-09-29 10:15   ` Vlastimil Babka
2020-09-29 22:07     ` Uladzislau Rezki [this message]
2020-09-30 10:35       ` Michal Hocko
2020-10-01 19:32         ` Uladzislau Rezki
2020-09-30 14:39       ` Vlastimil Babka
2020-09-30 15:37         ` Joel Fernandes
2020-10-01 19:26         ` Uladzislau Rezki
2020-10-02  7:11           ` Michal Hocko
2020-10-02  8:50             ` Mel Gorman
2020-10-02  9:05               ` Michal Hocko
2020-10-05 15:08                 ` Uladzislau Rezki
2020-10-05 15:41                   ` Michal Hocko
2020-10-06 22:25                     ` Uladzislau Rezki
2020-10-07 10:02                       ` Michal Hocko
2020-10-07 11:02                         ` Uladzislau Rezki
2020-10-02  9:07               ` Peter Zijlstra
2020-10-02  9:45                 ` Mel Gorman
2020-10-02  9:58                   ` Peter Zijlstra
2020-10-02 10:19                     ` Mel Gorman
2020-10-02 14:41                       ` Paul E. McKenney
2020-10-06 10:03                         ` Mel Gorman
2020-10-06 15:41                           ` Paul E. McKenney
2020-10-05 13:58             ` Uladzislau Rezki
2020-10-02  8:06           ` Mel Gorman
2020-10-05 14:12             ` Uladzislau Rezki
2020-09-18 19:48 ` [PATCH 3/4] rcu/tree: use " Uladzislau Rezki (Sony)
2020-09-18 19:48 ` [PATCH 4/4] rcu/tree: Use schedule_delayed_work() instead of WQ_HIGHPRI queue Uladzislau Rezki (Sony)
2020-09-20 15:06   ` Paul E. McKenney
2020-09-21 13:27     ` Uladzislau Rezki
2020-09-18 22:15 ` [PATCH 0/4] kvfree_rcu() and _LOCK_NESTING/_PREEMPT_RT Paul E. McKenney
2020-09-30 15:52 ` Joel Fernandes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200929220742.GB8768@pc636 \
    --to=urezki@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bigeasy@linutronix.de \
    --cc=joel@joelfernandes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=oleksiy.avramchenko@sonymobile.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rcu@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).