Re: [PATCH v2] mm/list_lru: Fix possible race in memcg_reparent_list_lru_node()

From: Roman Gushchin <roman.gushchin@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Waiman Long <longman@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Muchun Song <songmuchun@bytedance.com>
Subject: Re: [PATCH v2] mm/list_lru: Fix possible race in memcg_reparent_list_lru_node()
Date: Wed, 30 Mar 2022 19:48:45 -0700	[thread overview]
Message-ID: <89B53D3A-FCC5-4107-8D49-81D5B9AE5172@linux.dev> (raw)
In-Reply-To: <20220330191440.1cc1b2de2b849d1ba93d2ba7@linux-foundation.org>

> On Mar 30, 2022, at 7:14 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Wed, 30 Mar 2022 13:26:46 -0400 Waiman Long <longman@redhat.com> wrote:
> 
>> Muchun Song found out there could be a race between list_lru_add()
>> and memcg_reparent_list_lru_node() causing the later function to miss
>> reparenting of a lru entry as shown below:
>> 
>> CPU0:                           CPU1:
>> list_lru_add()
>>    spin_lock(&nlru->lock)
>>    l = list_lru_from_kmem(memcg)
>>                                memcg_reparent_objcgs(memcg)
>>                                memcg_reparent_list_lrus(memcg)
>>                                    memcg_reparent_list_lru()
>>                                        memcg_reparent_list_lru_node()
>>                                            if (!READ_ONCE(nlru->nr_items))
>>                                                // Miss reparenting
>>                                                return
>>    // Assume 0->1
>>    l->nr_items++
>>    // Assume 0->1
>>    nlru->nr_items++
>> 
>> Though it is not likely that a list_lru_node that has 0 item suddenly
>> has a newly added lru entry at the end of its life. The race is still
>> theoretically possible.
>> 
>> With the lock/unlock pair used within the percpu_ref_kill() which is
>> the last function call of memcg_reparent_objcgs(), any read issued
>> in memcg_reparent_list_lru_node() will not be reordered before the
>> reparenting of objcgs.
>> 
>> Adding a !spin_is_locked()/smp_rmb()/!READ_ONCE(nlru->nr_items) check
>> to ensure that either the reading of nr_items is valid or the racing
>> list_lru_add() will see the reparented objcg.
>> 
>> ...
>> 
>> --- a/mm/list_lru.c
>> +++ b/mm/list_lru.c
>> @@ -395,10 +395,33 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
>>    struct list_lru_one *src, *dst;
>> 
>>    /*
>> -     * If there is no lru entry in this nlru, we can skip it immediately.
>> +     * With the lock/unlock pair used within the percpu_ref_kill()
>> +     * which is the last function call of memcg_reparent_objcgs(), any
>> +     * read issued here will not be reordered before the reparenting
>> +     * of objcgs.
>> +     *
>> +     * Assuming a racing list_lru_add():
>> +     * list_lru_add()
>> +     *                <- memcg_reparent_list_lru_node()
>> +     *   spin_lock(&nlru->lock)
>> +     *   l = list_lru_from_kmem(memcg)
>> +     *   nlru->nr_items++
>> +     *   spin_unlock(&nlru->lock)
>> +     *                <- memcg_reparent_list_lru_node()
>> +     *
>> +     * The !spin_is_locked(&nlru->lock) check is true means it is
>> +     * either before the spin_lock() or after the spin_unlock(). In the
>> +     * former case, list_lru_add() will see the reparented objcg and so
>> +     * won't touch the lru to be reparented. In the later case, it will
>> +     * see the updated nr_items. So we can use the optimization that if
>> +     * there is no lru entry in this nlru, skip it immediately.
>>     */
>> -    if (!READ_ONCE(nlru->nr_items))
>> -        return;
>> +    if (!spin_is_locked(&nlru->lock)) {
> 
> ick.
> 
>> +        /* nr_items read must be ordered after nlru->lock */
>> +        smp_rmb();
>> +        if (!READ_ONCE(nlru->nr_items))
>> +            return;
>> +    }
> 
> include/linux/spinlock_up.h has
> 
> #define arch_spin_is_locked(lock)    ((void)(lock), 0)
> 
> so this `if' will always be true on CONFIG_SMP=n.  Will the kernel
> still work?

I guess yes, because this race is not possible on a !smp machine.

> 
> At the very least let's have changelogging and commenting explaining
> that we've actually thought about this.
> 
> Preferably, can we fix this hole properly and avoid this hack?  There is
> a reason for this:
> 
> hp2:/usr/src/25> grep spin_is_locked mm/*.c
> hp2:/usr/src/25> 

But honestly, I’d drop the original optimization together with the fix, if only there is no _real world_ data on the problem and the improvement. It seems like it has started as a nice simple improvement, but the race makes it complex and probably not worth the added complexity and fragility.

Thanks!