All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: Sahitya Tummala <stummala@codeaurora.org>
Cc: Alexander Polakov <apolyakov@beget.ru>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>,
	viro@zeniv.linux.org.uk, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v2] fs/dcache.c: fix spin lockup issue on nlru->lock
Date: Thu, 22 Jun 2017 20:49:29 +0300	[thread overview]
Message-ID: <20170622174929.GB3273@esperanza> (raw)
In-Reply-To: <8d82c32d-6cbb-c39d-2f0e-0af23925b3c1@codeaurora.org>

On Thu, Jun 22, 2017 at 10:01:39PM +0530, Sahitya Tummala wrote:
> 
> 
> On 6/21/2017 10:01 PM, Vladimir Davydov wrote:
> >
> >>index cddf397..c8ca150 100644
> >>--- a/fs/dcache.c
> >>+++ b/fs/dcache.c
> >>@@ -1133,10 +1133,11 @@ void shrink_dcache_sb(struct super_block *sb)
> >>  		LIST_HEAD(dispose);
> >>  		freed = list_lru_walk(&sb->s_dentry_lru,
> >>-			dentry_lru_isolate_shrink, &dispose, UINT_MAX);
> >>+			dentry_lru_isolate_shrink, &dispose, 1024);
> >>  		this_cpu_sub(nr_dentry_unused, freed);
> >>  		shrink_dentry_list(&dispose);
> >>+		cond_resched();
> >>  	} while (freed > 0);
> >In an extreme case, a single invocation of list_lru_walk() can skip all
> >1024 dentries, in which case 'freed' will be 0 forcing us to break the
> >loop prematurely. I think we should loop until there's at least one
> >dentry left on the LRU, i.e.
> >
> >	while (list_lru_count(&sb->s_dentry_lru) > 0)
> >
> >However, even that wouldn't be quite correct, because list_lru_count()
> >iterates over all memory cgroups to sum list_lru_one->nr_items, which
> >can race with memcg offlining code migrating dentries off a dead cgroup
> >(see memcg_drain_all_list_lrus()). So it looks like to make this check
> >race-free, we need to account the number of entries on the LRU not only
> >per memcg, but also per node, i.e. add list_lru_node->nr_items.
> >Fortunately, list_lru entries can't be migrated between NUMA nodes.
> It looks like list_lru_count() is iterating per node before iterating over
> all memory
> cgroups as below -
> 
> unsigned long list_lru_count_node(struct list_lru *lru, int nid)
> {
>         long count = 0;
>         int memcg_idx;
> 
>         count += __list_lru_count_one(lru, nid, -1);
>         if (list_lru_memcg_aware(lru)) {
>                 for_each_memcg_cache_index(memcg_idx)
>                         count += __list_lru_count_one(lru, nid, memcg_idx);
>         }
>         return count;
> }
> 
> The first call to __list_lru_count_one() is iterating all the items per node
> i.e, nlru->lru->nr_items.

lru->node[nid].lru.nr_items returned by __list_lru_count_one(lru, nid, -1)
only counts items accounted to the root cgroup, not the total number of
entries on the node.

> Is my understanding correct? If not, could you please clarify on how to get
> the lru items per node?

What I mean is iterating over list_lru_node->memcg_lrus to count the
number of entries on the node is racy. For example, suppose you have
three cgroups with the following values of list_lru_one->nr_items:

  0   0   10

While list_lru_count_node() is at #1, cgroup #2 is offlined and its
list_lru_one is drained, i.e. its entries are migrated to the parent
cgroup, which happens to be #0, i.e. we see the following picture:

 10   0   0

     ^^^
  memcg_ids points here in list_lru_count_node() 

Then the count returned by list_lru_count_node() will be 0, although
there are still 10 entries on the list.

To avoid this race, we could keep list_lru_node->lock locked while
walking over list_lru_node->memcg_lrus, but that's too heavy. I'd prefer
adding list_lru_node->nr_count which would be equal to the total number
of list_lru entries on the node, i.e. sum of list_lru_node->lru.nr_lrus
and list_lru_node->memcg_lrus->lru[]->nr_items.

WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: Sahitya Tummala <stummala@codeaurora.org>
Cc: Alexander Polakov <apolyakov@beget.ru>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>,
	viro@zeniv.linux.org.uk, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v2] fs/dcache.c: fix spin lockup issue on nlru->lock
Date: Thu, 22 Jun 2017 20:49:29 +0300	[thread overview]
Message-ID: <20170622174929.GB3273@esperanza> (raw)
In-Reply-To: <8d82c32d-6cbb-c39d-2f0e-0af23925b3c1@codeaurora.org>

On Thu, Jun 22, 2017 at 10:01:39PM +0530, Sahitya Tummala wrote:
> 
> 
> On 6/21/2017 10:01 PM, Vladimir Davydov wrote:
> >
> >>index cddf397..c8ca150 100644
> >>--- a/fs/dcache.c
> >>+++ b/fs/dcache.c
> >>@@ -1133,10 +1133,11 @@ void shrink_dcache_sb(struct super_block *sb)
> >>  		LIST_HEAD(dispose);
> >>  		freed = list_lru_walk(&sb->s_dentry_lru,
> >>-			dentry_lru_isolate_shrink, &dispose, UINT_MAX);
> >>+			dentry_lru_isolate_shrink, &dispose, 1024);
> >>  		this_cpu_sub(nr_dentry_unused, freed);
> >>  		shrink_dentry_list(&dispose);
> >>+		cond_resched();
> >>  	} while (freed > 0);
> >In an extreme case, a single invocation of list_lru_walk() can skip all
> >1024 dentries, in which case 'freed' will be 0 forcing us to break the
> >loop prematurely. I think we should loop until there's at least one
> >dentry left on the LRU, i.e.
> >
> >	while (list_lru_count(&sb->s_dentry_lru) > 0)
> >
> >However, even that wouldn't be quite correct, because list_lru_count()
> >iterates over all memory cgroups to sum list_lru_one->nr_items, which
> >can race with memcg offlining code migrating dentries off a dead cgroup
> >(see memcg_drain_all_list_lrus()). So it looks like to make this check
> >race-free, we need to account the number of entries on the LRU not only
> >per memcg, but also per node, i.e. add list_lru_node->nr_items.
> >Fortunately, list_lru entries can't be migrated between NUMA nodes.
> It looks like list_lru_count() is iterating per node before iterating over
> all memory
> cgroups as below -
> 
> unsigned long list_lru_count_node(struct list_lru *lru, int nid)
> {
>         long count = 0;
>         int memcg_idx;
> 
>         count += __list_lru_count_one(lru, nid, -1);
>         if (list_lru_memcg_aware(lru)) {
>                 for_each_memcg_cache_index(memcg_idx)
>                         count += __list_lru_count_one(lru, nid, memcg_idx);
>         }
>         return count;
> }
> 
> The first call to __list_lru_count_one() is iterating all the items per node
> i.e, nlru->lru->nr_items.

lru->node[nid].lru.nr_items returned by __list_lru_count_one(lru, nid, -1)
only counts items accounted to the root cgroup, not the total number of
entries on the node.

> Is my understanding correct? If not, could you please clarify on how to get
> the lru items per node?

What I mean is iterating over list_lru_node->memcg_lrus to count the
number of entries on the node is racy. For example, suppose you have
three cgroups with the following values of list_lru_one->nr_items:

  0   0   10

While list_lru_count_node() is at #1, cgroup #2 is offlined and its
list_lru_one is drained, i.e. its entries are migrated to the parent
cgroup, which happens to be #0, i.e. we see the following picture:

 10   0   0

     ^^^
  memcg_ids points here in list_lru_count_node() 

Then the count returned by list_lru_count_node() will be 0, although
there are still 10 entries on the list.

To avoid this race, we could keep list_lru_node->lock locked while
walking over list_lru_node->memcg_lrus, but that's too heavy. I'd prefer
adding list_lru_node->nr_count which would be equal to the total number
of list_lru entries on the node, i.e. sum of list_lru_node->lru.nr_lrus
and list_lru_node->memcg_lrus->lru[]->nr_items.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-06-22 17:49 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-12  0:47 [PATCH] mm/list_lru.c: use cond_resched_lock() for nlru->lock Sahitya Tummala
2017-06-12  0:47 ` Sahitya Tummala
2017-06-12 13:11 ` Jan Kara
2017-06-12 13:11   ` Jan Kara
2017-06-15 21:05 ` Andrew Morton
2017-06-15 21:05   ` Andrew Morton
2017-06-16 14:44   ` Sahitya Tummala
2017-06-16 14:44     ` Sahitya Tummala
2017-06-17 11:14   ` Vladimir Davydov
2017-06-17 11:14     ` Vladimir Davydov
2017-06-20  2:52     ` Sahitya Tummala
2017-06-20  2:52       ` Sahitya Tummala
2017-06-21  6:39       ` [PATCH v2] fs/dcache.c: fix spin lockup issue on nlru->lock Sahitya Tummala
2017-06-21  6:39         ` Sahitya Tummala
2017-06-21 16:31         ` Vladimir Davydov
2017-06-21 16:31           ` Vladimir Davydov
2017-06-22 16:31           ` Sahitya Tummala
2017-06-22 16:31             ` Sahitya Tummala
2017-06-22 17:49             ` Vladimir Davydov [this message]
2017-06-22 17:49               ` Vladimir Davydov
2017-06-28  6:07               ` [PATCH v3 1/2] mm/list_lru.c: fix list_lru_count_node() to be race free Sahitya Tummala
2017-06-28  6:07                 ` Sahitya Tummala
2017-06-28  6:07                 ` [PATCH v3 2/2] fs/dcache.c: fix spin lockup issue on nlru->lock Sahitya Tummala
2017-06-28  6:07                   ` Sahitya Tummala
2017-06-28 17:18                 ` [PATCH v3 1/2] mm/list_lru.c: fix list_lru_count_node() to be race free Vladimir Davydov
2017-06-28 17:18                   ` Vladimir Davydov
2017-06-29  3:39                   ` [PATCH v4 " Sahitya Tummala
2017-06-29  3:39                     ` Sahitya Tummala
2017-07-01 16:28                     ` Vladimir Davydov
2017-07-01 16:28                       ` Vladimir Davydov
2017-06-29  3:39                   ` [PATCH v4 2/2] fs/dcache.c: fix spin lockup issue on nlru->lock Sahitya Tummala
2017-06-29  3:39                     ` Sahitya Tummala
2017-06-29 22:48                     ` Andrew Morton
2017-06-29 22:48                       ` Andrew Morton
2017-06-30  3:16                       ` Sahitya Tummala
2017-06-30  3:16                         ` Sahitya Tummala
2017-07-01 16:28                     ` Vladimir Davydov
2017-07-01 16:28                       ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170622174929.GB3273@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apolyakov@beget.ru \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=stummala@codeaurora.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.