All of lore.kernel.org
 help / color / mirror / Atom feed
From: Muchun Song <songmuchun@bytedance.com>
To: Waiman Long <longman@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH-mm v3] mm/list_lru: Optimize memcg_reparent_list_lru_node()
Date: Wed, 30 Mar 2022 15:20:50 +0800	[thread overview]
Message-ID: <CAMZfGtUgFoME16K+ojw1HTpoUFz__CAF1ETw18s1BqVF87hTLA@mail.gmail.com> (raw)
In-Reply-To: <CAMZfGtXYH8Lex3hZGW4V8AzmqR03uzJnrBz8z7_1FD_P3Lgk-A@mail.gmail.com>

On Wed, Mar 30, 2022 at 2:38 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Wed, Mar 30, 2022 at 5:53 AM Waiman Long <longman@redhat.com> wrote:
> >
> > On 3/28/22 21:15, Muchun Song wrote:
> > > On Tue, Mar 29, 2022 at 3:12 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >> On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
> > >>> On 3/22/22 22:12, Muchun Song wrote:
> > >>>> On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
> > >>>>> On 3/22/22 21:06, Muchun Song wrote:
> > >>>>>> On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> > >>>>>>> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> > >>>>>>> to be race free"), we are tracking the total number of lru
> > >>>>>>> entries in a list_lru_node in its nr_items field.  In the case of
> > >>>>>>> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> > >>>>>>> is 0.  We don't even need to take the nlru->lock as no new lru entry
> > >>>>>>> could be added by a racing list_lru_add() to the draining src_idx memcg
> > >>>>>>> at this point.
> > >>>>>> Hi Waiman,
> > >>>>>>
> > >>>>>> Sorry for the late reply.  Quick question: what if there is an inflight
> > >>>>>> list_lru_add()?  How about the following race?
> > >>>>>>
> > >>>>>> CPU0:                               CPU1:
> > >>>>>> list_lru_add()
> > >>>>>>        spin_lock(&nlru->lock)
> > >>>>>>        l = list_lru_from_kmem(memcg)
> > >>>>>>                                        memcg_reparent_objcgs(memcg)
> > >>>>>>                                        memcg_reparent_list_lrus(memcg)
> > >>>>>>                                            memcg_reparent_list_lru()
> > >>>>>>                                                memcg_reparent_list_lru_node()
> > >>>>>>                                                    if (!READ_ONCE(nlru->nr_items))
> > >>>>>>                                                        // Miss reparenting
> > >>>>>>                                                        return
> > >>>>>>        // Assume 0->1
> > >>>>>>        l->nr_items++
> > >>>>>>        // Assume 0->1
> > >>>>>>        nlru->nr_items++
> > >>>>>>
> > >>>>>> IIUC, we use nlru->lock to serialise this scenario.
> > >>>>> I guess this race is theoretically possible but very unlikely since it
> > >>>>> means a very long pause between list_lru_from_kmem() and the increment
> > >>>>> of nr_items.
> > >>>> It is more possible in a VM.
> > >>>>
> > >>>>> How about the following changes to make sure that this race can't happen?
> > >>>>>
> > >>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
> > >>>>> index c669d87001a6..c31a0a8ad4e7 100644
> > >>>>> --- a/mm/list_lru.c
> > >>>>> +++ b/mm/list_lru.c
> > >>>>> @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
> > >>>>> list_lru *lru, int nid,
> > >>>>>            struct list_lru_one *src, *dst;
> > >>>>>
> > >>>>>            /*
> > >>>>> -        * If there is no lru entry in this nlru, we can skip it
> > >>>>> immediately.
> > >>>>> +        * If there is no lru entry in this nlru and the nlru->lock is free,
> > >>>>> +        * we can skip it immediately.
> > >>>>>             */
> > >>>>> -       if (!READ_ONCE(nlru->nr_items))
> > >>>>> +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
> > >>>> I think we also should insert a smp_rmb() between those two loads.
> > >>> Thinking about this some more, I believe that adding spin_is_locked() check
> > >>> will be enough for x86. However, that will likely not be enough for arches
> > >>> with a more relaxed memory semantics. So the safest way to avoid this
> > >>> possible race is to move the check to within the lock critical section,
> > >>> though that comes with a slightly higher overhead for the 0 nr_items case. I
> > >>> will send out a patch to correct that. Thanks for bring this possible race
> > >>> to my attention.
> > >> Yes, I think it's not enough:
> > > I think it may be enough if we insert a smp_rmb() between those two loads.
> > >
> > >> CPU0                                       CPU1
> > >> READ_ONCE(&nlru->nr_items) -> 0
> > >>                                             spin_lock(&nlru->lock);
> > >>                                             nlru->nr_items++;
> > >                                               ^^^
> > >                                               |||
> > >                                               The nlr here is not the
> > > same as the one in CPU0,
> > >                                               since CPU0 have done the
> > > memcg reparting. Then
> > >                                               CPU0 will not miss nlru
> > > reparting.  If I am wrong, please
> > >                                               correct me.  Thanks.
> > >>                                             spin_unlock(&nlru->lock);
> > >> && !spin_is_locked(&nlru->lock) -> 0
> >
> > I just realize that there is another lock/unlock pair in
> > memcg_reparent_objcgs():
> >
> > memcg_reparent_objcgs()
> >      spin_lock_irq()
> >      memcg reparenting
> >      spin_unlock_irq()
> >      percpu_ref_kill()
> >          spin_lock_irqsave()
> >          ...
> >          spin_unlock_irqrestore()
> >
> > This lock/unlock pair in percpu_ref_kill() will stop the reordering of
> > read/write before the memcg reparenting. Now I think just adding a
> > spin_is_locked() check with smp_rmb() should be enough. However, I would
> > like to change the ordering like that:
> >
> > if (!spin_is_locked(&nlru->lock)) {
> >          smp_rmb();
> >          if (!READ_ONCE(nlru->nr_items))
> >                  return;
> > }
>
> Does the following race still exist?

Ignore this. My bad. I think your approach could work.

>
>  CPU0:                               CPU1:
>                                         spin_is_locked(&nlru->lock)
>  list_lru_add()
>         spin_lock(&nlru->lock)
>         l = list_lru_from_kmem(memcg)
>                                         memcg_reparent_objcgs(memcg)
>                                         memcg_reparent_list_lrus(memcg)
>                                             memcg_reparent_list_lru()
>                                                 memcg_reparent_list_lru_node()
>                                                     if
> (!READ_ONCE(nlru->nr_items))
>                                                         // Miss reparenting
>                                                         return
>         // Assume 0->1
>         l->nr_items++
>         // Assume 0->1
>         nlru->nr_items++
>
> >
> > Otherwise, we will have the problem
> >
> > list_lru_add()
> >        spin_lock(&nlru->lock)
> >        l = list_lru_from_kmem(memcg)
> > READ_ONCE(nlru->nr_items);
> >        // Assume 0->1
> >        l->nr_items++
> >        // Assume 0->1
> >        nlru->nr_items++
> >        spin_unlock(&nlru->lock)
> >                                        spin_is_locked()
>
> You are right.
>
> >
> > If spin_is_locked() is before spin_lock() in list_lru_add(),
> > list_lru_from_kmem() is guarantee to get the reparented memcg and so
> > won't added to the reparented lru.
> >

  reply	other threads:[~2022-03-30  7:22 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-09 14:40 [PATCH-mm v3] mm/list_lru: Optimize memcg_reparent_list_lru_node() Waiman Long
2022-03-23  1:06 ` Muchun Song
2022-03-23  1:55   ` Waiman Long
2022-03-23  2:12     ` Muchun Song
2022-03-28  0:57       ` Waiman Long
2022-03-28 19:12         ` Roman Gushchin
2022-03-28 20:46           ` Waiman Long
2022-03-28 21:12             ` Roman Gushchin
2022-03-28 21:20               ` Waiman Long
2022-03-28 23:44                 ` Roman Gushchin
2022-03-29  1:15           ` Muchun Song
2022-03-29  2:30             ` Roman Gushchin
2022-03-29 21:53             ` Waiman Long
2022-03-30  6:38               ` Muchun Song
2022-03-30  7:20                 ` Muchun Song [this message]
2022-03-28 19:12   ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMZfGtUgFoME16K+ojw1HTpoUFz__CAF1ETw18s1BqVF87hTLA@mail.gmail.com \
    --to=songmuchun@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=roman.gushchin@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.