linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nick Piggin <npiggin@kernel.dk>
To: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <npiggin@kernel.dk>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [patch 31/35] fs: icache per-zone inode LRU
Date: Wed, 20 Oct 2010 21:41:21 +1100	[thread overview]
Message-ID: <20101020104121.GA5999@amd> (raw)
In-Reply-To: <20101020101906.GM32255@dastard>

On Wed, Oct 20, 2010 at 09:19:06PM +1100, Dave Chinner wrote:
> On Wed, Oct 20, 2010 at 02:20:24PM +1100, Nick Piggin wrote:
> > 
> > Well if XFS were to use per-ZONE shrinkers, it would remain with a
> > single shrinker context per-sb like it has now, but it would divide
> > its object management into per-zone structures.
> 
> <sigh>
> 
> I don't think anyone wants per-ag X per-zone reclaim lists on a 1024
> node machine with a 1,000 AG (1PB) filesystem.

Maybe not, but a 1024 node machine will *definitely* need to minimise
interconnect traffic and remote memory access. So if each node can't
spare enough memory for a couple of thousand LRU list heads, then
XFS's per-ag LRUs may need rethinking (they may provide reasonable
scalability on well partitioned workloads, but they can not help
the reclaimers to do the right thing -- remote memory accesses will
still dominate the inode LRU scanning there).


> As I have already said, the XFS inode caches are optimised in
> structure to minimise IO and maximise internal filesystem
> parallelism. They are not optimised for per-cpu or NUMA scalability
> because if you don't have filesystem level parallelism, you can't
> scale to large numbers of concurrent operations across large numbers
> of CPUs in the first place.

And as I have already said, nothing in my patches changes that.
What it provides is *opportunity* for shrinkers to take advantage
of per-zone scalability and improved reclaim patterns. Nothing
forces it, though.

 
> In the case of XFS, per-allocation group is the way we scale
> internal parallelism and as long as you have more AGs than you have
> CPUs, there is very good per-CPU scalability through the filesystem
> because most operations are isolated to a single AG.  That is how we
> scale parallelism in XFS, and it has proven to scale pretty well for
> even the largest of NUMA machines. 
> 
> This is what I mean about there being an impedence mismatch between
> the way the VM and the VFS/filesystem caches scale. Fundamentally,
> the way filesystems want their caches to operate for optimal
> performance can be vastly different to the way you want shrinkers to
> operate for VM scalability. Forcing the MM way of doing stuff down
> into the LRUs and shrinkers is not a good way of solving this
> problem.

It isn't forcing anything. Maybe you didn't understand the patch
because you keep repeating this.

 
> > For subsystems that aren't important, don't take much memory or have
> > much reclaim throughput, they are free to ignore the zone argument
> > and keep using the global input to the shrinker.
> 
> Having a global lock in a shrinker is already a major point of
> contention because shrinkers have unbound parallelism.  Hence all
> shrinkers need to be converted to use scalable structures. What we
> need _first_ is the infrastructure to do this in a sane manner, not
> tie a couple of shrinkers tightly into the mm structures and then
> walk away.

Per zone is the way to do it. Shrinkers and reclaim concept is
already tightly coupled with the mm. Memory pressure and the need
to reclaim occurs solely and only as a function of a zone (or zones).
Adding the zone argument to the shrinker does nothing more than adding
that previously missing input to the shrinker.

"I have a memory shortage in this zone, so I need to free reclaimable
objects from this zone"

This is a pretty core memory managementy idea. If you "decouple"
shrinkers from mm any further, then you end up with something that
doesn't give shrinkers the required information.
 

> And FWIW, most subsystems that use shrinkers can be compiled in as
> modules or not compiled in at all. That'll probably leave #ifdef
> CONFIG_ crap all through the struct zone definition as they are
> converted to use your current method....

I haven't thought about how random drivers will do per-zone things.
Obviously not an all out dumping ground in struct zone, but it does
fit for critical central caches like page, inode, and dentry.

Even if they aren't compiled out, we don't want their size bloating
things too much if they aren't loaded or in use. Probably dynamic
allocation would be the best way to go for them. Pretty simple really.


  reply	other threads:[~2010-10-20 10:41 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-19  3:42 [patch 00/35] my inode scaling series for review npiggin
2010-10-19  3:42 ` [patch 01/35] bit_spinlock: add required includes npiggin
2010-10-19  3:42 ` [patch 02/35] kernel: add bl_list npiggin
2010-10-19  3:42 ` [patch 03/35] mm: implement per-zone shrinker npiggin
2010-10-19  4:49   ` KOSAKI Motohiro
2010-10-19  5:33     ` Nick Piggin
2010-10-19  5:40       ` KOSAKI Motohiro
2010-10-19  3:42 ` [patch 04/35] vfs: convert inode and dentry caches to " npiggin
2010-10-19  3:42 ` [patch 05/35] fs: icache lock s_inodes list npiggin
2010-10-19  3:42 ` [patch 06/35] fs: icache lock inode hash npiggin
2010-10-19  3:42 ` [patch 07/35] fs: icache lock i_state npiggin
2010-10-19 10:47   ` Miklos Szeredi
2010-10-19 17:06     ` Peter Zijlstra
2010-10-19  3:42 ` [patch 08/35] fs: icache lock i_count npiggin
2010-10-19 10:16   ` Boaz Harrosh
2010-10-20  2:14     ` Nick Piggin
2010-10-19  3:42 ` [patch 09/35] fs: icache lock lru/writeback lists npiggin
2010-10-19  3:42 ` [patch 10/35] fs: icache atomic inodes_stat npiggin
2010-10-19  3:42 ` [patch 11/35] fs: icache lock inode state npiggin
2010-10-19  3:42 ` [patch 12/35] fs: inode atomic last_ino, iunique lock npiggin
2010-10-19  3:42 ` [patch 13/35] fs: icache remove inode_lock npiggin
2010-10-19  3:42 ` [patch 14/35] fs: icache factor hash lock into functions npiggin
2010-10-19  3:42 ` [patch 15/35] fs: icache per-bucket inode hash locks npiggin
2010-10-19  3:42 ` [patch 16/35] fs: icache lazy inode lru npiggin
2010-10-19  3:42 ` [patch 17/35] fs: icache RCU free inodes npiggin
2010-10-19  3:42 ` [patch 18/35] fs: avoid inode RCU freeing for pseudo fs npiggin
2010-10-19  3:42 ` [patch 19/35] fs: icache remove redundant i_sb_list umount locking npiggin
2010-10-20 12:46   ` Al Viro
2010-10-20 13:03     ` Nick Piggin
2010-10-20 13:27       ` Al Viro
2010-10-19  3:42 ` [patch 20/35] fs: icache rcu walk for i_sb_list npiggin
2010-10-19  3:42 ` [patch 21/35] fs: icache per-cpu nr_inodes, non-atomic nr_unused counters npiggin
2010-10-19  3:42 ` [patch 22/35] fs: icache per-cpu last_ino allocator npiggin
2010-10-19  3:42 ` [patch 23/35] fs: icache use per-CPU lists and locks for sb inode lists npiggin
2010-10-19 15:33   ` Miklos Szeredi
2010-10-20  2:37     ` Nick Piggin
2010-10-19  3:42 ` [patch 24/35] fs: icache use RCU to avoid locking in hash lookups npiggin
2010-10-19  3:42 ` [patch 25/35] fs: icache reduce some locking overheads npiggin
2010-10-19  3:42 ` [patch 26/35] fs: icache alloc anonymous inode allocation npiggin
2010-10-19 15:50   ` Miklos Szeredi
2010-10-20  2:38     ` Nick Piggin
2010-10-19 16:33   ` Christoph Hellwig
2010-10-20  3:07     ` Nick Piggin
2010-10-19  3:42 ` [patch 27/35] fs: icache split IO and LRU lists npiggin
2010-10-19 16:12   ` Miklos Szeredi
2010-10-20  2:41     ` Nick Piggin
2010-10-19  3:42 ` [patch 28/35] fs: icache split writeback and lru locks npiggin
2010-10-19  3:42 ` [patch 29/35] fs: icache per-bdi writeback list locking npiggin
2010-10-19  3:42 ` [patch 30/35] fs: icache lazy LRU avoid LRU locking after IO operation npiggin
2010-10-19  3:42 ` [patch 31/35] fs: icache per-zone inode LRU npiggin
2010-10-19 12:38   ` Dave Chinner
2010-10-20  2:35     ` Nick Piggin
2010-10-20  3:12       ` Nick Piggin
2010-10-20  9:43         ` Dave Chinner
2010-10-20 10:02           ` Nick Piggin
2010-10-20  3:14     ` KOSAKI Motohiro
2010-10-20  3:20       ` Nick Piggin
2010-10-20  3:29         ` KOSAKI Motohiro
2010-10-20 10:19         ` Dave Chinner
2010-10-20 10:41           ` Nick Piggin [this message]
2010-10-19  3:42 ` [patch 32/35] fs: icache minimise I_FREEING latency npiggin
2010-10-19  3:42 ` [patch 33/35] fs: icache introduce inode_get/inode_get_ilock npiggin
2010-10-19 10:17   ` Boaz Harrosh
2010-10-20  2:17     ` Nick Piggin
2010-10-19  3:42 ` [patch 34/35] fs: inode rename i_count to i_refs npiggin
2010-10-19  3:42 ` [patch 35/35] fs: icache document more lock orders npiggin
2010-10-19 16:22 ` [patch 00/35] my inode scaling series for review Christoph Hellwig
2010-10-20  3:05   ` Nick Piggin
2010-10-20 13:14 ` Al Viro
2010-10-20 13:59   ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101020104121.GA5999@amd \
    --to=npiggin@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).