linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Andi Kleen <andi@firstfloor.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Bob Liu <bob.liu@oracle.com>,
	Christoph Hellwig <hch@infradead.org>,
	Greg Thelen <gthelen@google.com>, Hugh Dickins <hughd@google.com>,
	Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Luigi Semenzato <semenzato@google.com>,
	Mel Gorman <mgorman@suse.de>, Metin Doslu <metin@citusdata.com>,
	Michel Lespinasse <walken@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Ozgun Erdogan <ozgun@citusdata.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>,
	Roman Gushchin <klamm@yandex-team.ru>,
	Ryan Mallon <rmallon@gmail.com>, Tejun Heo <tj@kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 9/9] mm: keep page cache radix tree nodes in check
Date: Tue, 21 Jan 2014 14:03:58 +1100	[thread overview]
Message-ID: <20140121030358.GN18112@dastard> (raw)
In-Reply-To: <20140120231737.GS6963@cmpxchg.org>

On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > +	/* Only shadow entries in there, keep track of this node */
> > > +	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > +	    list_empty(&node->private_list)) {
> > > +		node->private_data = mapping;
> > > +		list_lru_add(&workingset_shadow_nodes, &node->private_list);
> > > +	}
> > 
> > You can't do this list_empty(&node->private_list) check safely
> > externally to the list_lru code - only time that entry can be
> > checked safely is under the LRU list locks. This is the reason that
> > list_lru_add/list_lru_del return a boolean to indicate is the object
> > was added/removed from the list - they do this list_empty() check
> > internally. i.e. the correct, safe way to do conditionally update
> > state iff the object was added to the LRU is:
> > 
> > 	if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > 		if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
> > 			node->private_data = mapping;
> > 	}
> > 
> > > +	radix_tree_replace_slot(slot, page);
> > > +	mapping->nrpages++;
> > > +	if (node) {
> > > +		node->count++;
> > > +		/* Installed page, can't be shadow-only anymore */
> > > +		if (!list_empty(&node->private_list))
> > > +			list_lru_del(&workingset_shadow_nodes,
> > > +				     &node->private_list);
> > > +	}
> > 
> > Same issue here:
> > 
> > 	if (node) {
> > 		node->count++;
> > 		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > 	}
> 
> All modifications to node->private_list happen under
> mapping->tree_lock, and modifications of a neighboring link should not
> affect the outcome of the list_empty(), so I don't think the lru lock
> is necessary.

Can you please add that as a comment somewhere explaining why it is
safe to do this?

> > > +		case LRU_REMOVED_RETRY:
> > >  			if (--nlru->nr_items == 0)
> > >  				node_clear(nid, lru->active_nodes);
> > >  			WARN_ON_ONCE(nlru->nr_items < 0);
> > >  			isolated++;
> > > +			/*
> > > +			 * If the lru lock has been dropped, our list
> > > +			 * traversal is now invalid and so we have to
> > > +			 * restart from scratch.
> > > +			 */
> > > +			if (ret == LRU_REMOVED_RETRY)
> > > +				goto restart;
> > >  			break;
> > >  		case LRU_ROTATE:
> > >  			list_move_tail(item, &nlru->list);
> > 
> > I think that we need to assert that the list lru lock is correctly
> > held here on return with LRU_REMOVED_RETRY. i.e.
> > 
> > 		case LRU_REMOVED_RETRY:
> > 			assert_spin_locked(&nlru->lock);
> > 		case LRU_REMOVED:
> 
> Ah, good idea.  How about adding it to LRU_RETRY as well?

Yup, good idea.

> > > +static struct shrinker workingset_shadow_shrinker = {
> > > +	.count_objects = count_shadow_nodes,
> > > +	.scan_objects = scan_shadow_nodes,
> > > +	.seeks = DEFAULT_SEEKS * 4,
> > > +	.flags = SHRINKER_NUMA_AWARE,
> > > +};
> > 
> > Can you add a comment explaining how you calculated the .seeks
> > value? It's important to document the weighings/importance
> > we give to slab reclaim so we can determine if it's actually
> > acheiving the desired balance under different loads...
> 
> This is not an exact science, to say the least.

I know, that's why I asked it be documented rather than be something
kept in your head.

> The shadow entries are mostly self-regulated, so I don't want the
> shrinker to interfere while the machine is just regularly trimming
> caches during normal operation.
> 
> It should only kick in when either a) reclaim is picking up and the
> scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> swapping etc. or b) the number of objects compared to LRU pages
> becomes excessive.
> 
> I think that is what most shrinkers with an elevated seeks value want,
> but this translates very awkwardly (and not completely) to the current
> cost model, and we should probably rework that interface.
> 
> "Seeks" currently encodes 3 ratios:
> 
>   1. the cost of creating an object vs. a page
> 
>   2. the expected number of objects vs. pages

It doesn't encode that at all. If it did, then the default value
wouldn't be "2".

>   3. the cost of reclaiming an object vs. a page

Which, when you consider #3 in conjunction with #1, the actual
intended meaning of .seeks is "the cost of replacing this object in
the cache compared to the cost of replacing a page cache page."

> but they are not necessarily correlated.  How I would like to
> configure the shadow shrinker instead is:
> 
>   o scan objects when reclaim efficiency is down to 75%, because they
>     are more valuable than use-once cache but less than workingset
> 
>   o scan objects when the ratio between them and the number of pages
>     exceeds 1/32 (one shadow entry for each resident page, up to 64
>     entries per shrinkable object, assume 50% packing for robustness)
> 
>   o as the expected balance between objects and lru pages is 1:32,
>     reclaim one object for every 32 reclaimed LRU pages, instead of
>     assuming that number of scanned pages corresponds meaningfully to
>     number of objects to scan.

You're assuming that every radix tree node has a full population of
pages. This only occurs on sequential read and write workloads, and
so isn't going tobe true for things like mapped executables or any
semi-randomly accessed data set...

> "4" just doesn't have the same ring to it.

Right, but you still haven't explained how you came to the value of
"4"....

> It would be great if we could eliminate the reclaim cost assumption by
> turning the nr_to_scan into a nr_to_reclaim, and then set the other
> two ratios independently.

That doesn't work for caches that are full of objects that can't (or
won't) be reclaimed immediately. The CPU cost of repeatedly scanning
to find N reclaimable objects when you have millions of objects in
the cache is prohibitive.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2014-01-21  3:04 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
2014-01-10 18:10 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
2014-01-13  1:17   ` Minchan Kim
2014-01-10 18:10 ` [patch 2/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2014-01-10 18:10 ` [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2014-01-10 18:25   ` Rik van Riel
2014-01-10 18:10 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2014-01-10 19:22   ` Rik van Riel
2014-01-13  1:25   ` Minchan Kim
2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2014-01-10 19:39   ` Rik van Riel
2014-01-13  2:01   ` Minchan Kim
2014-01-22 17:47     ` Johannes Weiner
2014-01-23  5:07       ` Minchan Kim
2014-02-12 14:00   ` Mel Gorman
2014-03-12  1:15     ` Johannes Weiner
2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
2014-01-10 22:30   ` Rik van Riel
2014-01-13  2:18   ` Minchan Kim
2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
2014-01-10 22:51   ` Rik van Riel
2014-01-13  2:42   ` Minchan Kim
2014-01-14  1:01   ` Bob Liu
2014-01-14 19:16     ` Johannes Weiner
2014-01-15  2:57       ` Bob Liu
2014-01-15  3:52         ` Zhang Yanfei
2014-01-16 21:17         ` Johannes Weiner
2014-01-10 18:10 ` [patch 8/9] lib: radix_tree: tree node interface Johannes Weiner
2014-01-10 22:57   ` Rik van Riel
2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2014-01-10 23:09   ` Rik van Riel
2014-01-13  7:39   ` Minchan Kim
2014-01-14  5:40     ` Minchan Kim
2014-01-22 18:42     ` Johannes Weiner
2014-01-23  5:20       ` Minchan Kim
2014-01-23 19:22         ` Johannes Weiner
2014-01-27  2:31           ` Minchan Kim
2014-01-15  5:55   ` Bob Liu
2014-01-16 22:09     ` Johannes Weiner
2014-01-17  0:05   ` Dave Chinner
2014-01-20 23:17     ` Johannes Weiner
2014-01-21  3:03       ` Dave Chinner [this message]
2014-01-21  5:50         ` Johannes Weiner
2014-01-22  3:06           ` Dave Chinner
2014-01-22  6:57             ` Johannes Weiner
2014-01-22 18:48               ` Johannes Weiner
2014-01-23  5:57       ` Minchan Kim
  -- strict thread matches above, loose matches on Subject: below --
2013-12-02 19:21 [patch 0/9] mm: thrash detection-based file cache sizing v7 Johannes Weiner
2013-12-02 19:21 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-12-02 22:10   ` Dave Chinner
2013-12-02 22:46     ` Johannes Weiner
2013-11-24 23:38 [patch 0/9] mm: thrash detection-based file cache sizing v6 Johannes Weiner
2013-11-24 23:38 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-11-25 23:49   ` Dave Chinner
2013-11-26 21:27     ` Johannes Weiner
2013-11-26 22:29       ` Dave Chinner
2013-11-26 23:00         ` Johannes Weiner
2013-11-27  0:59           ` Dave Chinner
2013-11-26  0:13   ` Andrew Morton
2013-11-26 22:05     ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140121030358.GN18112@dastard \
    --to=david@fromorbit.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=bob.liu@oracle.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=klamm@yandex-team.ru \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=metin@citusdata.com \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=ozgun@citusdata.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rmallon@gmail.com \
    --cc=semenzato@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).