All of lore.kernel.org
 help / color / mirror / Atom feed
* [v3 00/39] faster tree-based sysctl implementation
@ 2011-05-23  1:56 Lucian Adrian Grijincu
  2011-05-23  4:27 ` Eric W. Biederman
  0 siblings, 1 reply; 6+ messages in thread
From: Lucian Adrian Grijincu @ 2011-05-23  1:56 UTC (permalink / raw)
  To: Eric W . Biederman, linux-kernel
  Cc: netdev, Lucian Adrian Grijincu, Alexey Dobriyan,
	Octavian Purdila, David S . Miller

Hi

This is version 3 of a patch series that introduces a faster/leaner
sysctl internal implementation.

Due to high number of patches and low general interest I'll just point
you to the tree/branch:

  git://github.com/luciang/linux-2.6-new-sysctl.git   v3-new-sysctl-alg


Patches are on top of v2.6.39. I did not pick a more recent (random)
point in Linus' tree to rebase these onto to not mess up testing.



Changes since v2
(http://thread.gmane.org/gmane.linux.kernel/1137032/focus=194748):
- added a compatibility layer to support old registering complex
  sysctl trees. This layer will be deleted once all users of the old
  are changed.
- subdirectories and netns correspondent dirs are now held in rbtrees
- split of from the patches that make changes in the rest of the tree
- rebased on top of 2.6.39




Changes since v1 (http://thread.gmane.org/gmane.linux.kernel/1133667):
- rebased on top of 2.6.39-rc6
- split the patch that adds the new algorithm and data structures.
- fixed a few bugs lingering in the old code
- shrinked a reference counter
- added a new reference counter to maintain ownership information
- added method to register an empty sysctl dir and converted some users
- added checks enforcing the rule that a non-netns specific directory may
 not be registered after a netns specific one has already been registered.
- added cookie support: register a piece of data with the header to be
 used to make simple conversions on the ctl_table. This saves memory where
 we need to register sysctl tables with the same content affecting
 different pieces of data.
- enforced sysctl checks



$ time modprobe dummy numdummies=N


NOTE: these stats are from v2. v3 should be a bit slower due to:
- the compatibility layer
- the old stats used cookies to prevent kmemdups() on ctl_table arrays
- the old patches had an optimisation for directories with many
subdirs that was replaced (in v3) with rbtrees

Without this patch series :(
- ipv4 only
 -  N=1000  time= 0m 06s
 -  N=2000  time= 0m 30s
 -  N=4000  time= 2m 35s
- ipv4 and ipv6
 -  N=1000  time= 0m 24s
 -  N=2000  time= 2m 14s
 -  N=4000  time=10m 16s
 -  N=5000  time=16m  3s

With this patch series    :)
- ipv4 only
 -  N=1000  time= 0m  0.33s
 -  N=2000  time= 0m  1.25s
 -  N=4000  time= 0m  5.31s
- ipv4 and ipv6
 -  N=1000  time= 0m  0.41s
 -  N=2000  time= 0m  1.62s
 -  N=4000  time= 0m  7.64s
 -  N=5000  time= 0m 12.35s
 -  N=8000  time= 0m 36.95s




Patches marked with RFC: are patches where reviewers should pay more
attention as I may have missed something.


Part 1: introduce compatibility layer:
  sysctl: introduce temporary sysctl wrappers
  sysctl: register only tables of sysctl files


Part 2: minimal changes to sysctl users:
  sysctl: call sysctl_init before the first sysctl registration
  sysctl: no-child: manually register kernel/random
  sysctl: no-child: manually register kernel/keys
  sysctl: no-child: manually register fs/inotify
  sysctl: no-child: manually register fs/epoll
  sysctl: no-child: manually register root tables


Part 3: cleanups simplifying the new algorithm (created when asked to
break up the new algorithm patch:
  sysctl: faster reimplementation of sysctl_check_table
  sysctl: remove useless ctl_table->parent field
  sysctl: simplify find_in_table
  sysctl: sysctl_head_grab defaults to root header on NULL
  sysctl: delete useless grab_header function
  sysctl: rename ->used to ->ctl_use_refs
  sysctl: rename sysctl_head_grab/finish to sysctl_use_header/unuse
  sysctl: rename sysctl_head_next to sysctl_use_next_header
  sysctl: split ->count into ctl_procfs_refs and ctl_header_refs
  sysctl: rename sysctl_head_get/put to sysctl_proc_inode_get/put
  sysctl: rename (un)use_table to __sysctl_(un)use_header
  sysctl: simplify ->permissions hook
  sysctl: group root-specific operations
  sysctl: introduce ctl_table_group
  sysctl: move removal from list out of start_unregistering


Part 4: new algorithm/data structures:
  sysctl: faster tree-based sysctl implementation


Part 5: checks/warns requested during review:
  sysctl: add duplicate entry and sanity ctl_table checks
  sysctl: alloc ctl_table_header with kmem_cache
  RFC: sysctl: change type of ctl_procfs_refs to u8
  sysctl: check netns-specific registration order respected
  sysctl: warn if registration/unregistration order is not respected
  RFC: sysctl: always perform sysctl checks


Part 6: Eric requested rbtrees for subdirs. I also used rbtrees for
netns correspondent dirs. Hope that adding rbtrees after using the old
list-based implementation is good enough. The rbtree code complicates
things a bit and would uglify the patch adding the new algorithm.
  sysctl: reorder members of ctl_table_header (cleanup)
  sysctl: add ctl_type member
  RFC: sysctl: replace subdirectory list with rbtree
  RFC: sysctl: replace netns corresp list with rbtree
  sysctl: union-ize some ctl_table_header fields


Part 7: Eric requested ability to register an empty dir:
  sysctl: add register_sysctl_dir: register an empty sysctl directory


Part 8: unrequested feature I'd like to piggy back :)
  sysctl: add ctl_cookie and ctl_cookie_handler
  sysctl: add cookie to __register_sysctl_paths
  sysctl: add register_net_sysctl_table_net_cookie


 drivers/char/random.c            |   27 +-
 fs/eventpoll.c                   |   22 +-
 fs/notify/inotify/inotify_user.c |   22 +-
 fs/proc/inode.c                  |    2 +-
 fs/proc/proc_sysctl.c            |  233 +++++---
 include/linux/inotify.h          |    2 -
 include/linux/key.h              |    3 -
 include/linux/poll.h             |    2 -
 include/linux/sysctl.h           |  221 +++++---
 include/net/net_namespace.h      |    4 +-
 init/main.c                      |    1 +
 kernel/Makefile                  |    5 +-
 kernel/sysctl.c                  | 1161 ++++++++++++++++++++++++++++----------
 kernel/sysctl_check.c            |  325 +++++++----
 lib/Kconfig.debug                |    8 -
 net/sysctl_net.c                 |   86 ++--
 security/keys/key.c              |    7 +
 security/keys/sysctl.c           |   18 +-
 18 files changed, 1495 insertions(+), 654 deletions(-)


-- 
 .
..: Lucian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v3 00/39] faster tree-based sysctl implementation
  2011-05-23  1:56 [v3 00/39] faster tree-based sysctl implementation Lucian Adrian Grijincu
@ 2011-05-23  4:27 ` Eric W. Biederman
  2011-05-23  5:59   ` Lucian Adrian Grijincu
  2011-05-23  6:37   ` Lucian Adrian Grijincu
  0 siblings, 2 replies; 6+ messages in thread
From: Eric W. Biederman @ 2011-05-23  4:27 UTC (permalink / raw)
  To: Lucian Adrian Grijincu
  Cc: linux-kernel, netdev, Alexey Dobriyan, Octavian Purdila,
	David S . Miller

Lucian Adrian Grijincu <lucian.grijincu@gmail.com> writes:

> Hi
>
> This is version 3 of a patch series that introduces a faster/leaner
> sysctl internal implementation.
>
> Due to high number of patches and low general interest I'll just point
> you to the tree/branch:
>
>   git://github.com/luciang/linux-2.6-new-sysctl.git   v3-new-sysctl-alg
>
>
> Patches are on top of v2.6.39. I did not pick a more recent (random)
> point in Linus' tree to rebase these onto to not mess up testing.


Thanks for keeping going on this.

This patchset looks like it is deserving of some close scrutiny, and
not just the high level design overview I have given the previous
patches.  This is going to be a busy week for me so I probably won't
get through all of the patches for a while.

I will mention a couple of nits I noticed while I was skimming through
your patches.
- There can be multiple proc superblocks and thus multiple inodes
  referring to the same /proc/sys file, if there are multiple pid
  namespaces. 

- I have a hope to move /proc/sys into /proc/<pid>/sys so we don't have
  to look at current to determine the namespace we want to display.
  That would allow the deeply magic sysctl_is_seen check to be removed
  from proc_sys_compare.  That is not your problem, but of an
  explanation why the namespaces are passed through.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v3 00/39] faster tree-based sysctl implementation
  2011-05-23  4:27 ` Eric W. Biederman
@ 2011-05-23  5:59   ` Lucian Adrian Grijincu
  2011-05-23  6:37   ` Lucian Adrian Grijincu
  1 sibling, 0 replies; 6+ messages in thread
From: Lucian Adrian Grijincu @ 2011-05-23  5:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, netdev, Alexey Dobriyan, Octavian Purdila,
	David S . Miller

On Mon, May 23, 2011 at 7:27 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> I will mention a couple of nits I noticed while I was skimming through
> your patches.
> - There can be multiple proc superblocks and thus multiple inodes
>  referring to the same /proc/sys file, if there are multiple pid
>  namespaces.


OK. I'll revert the patch that make converts an int counter into an u8
and post an update. I guess this is what you were referring to.
https://github.com/luciang/linux-2.6-new-sysctl/commit/b7a547b8ce7484ae22eea34a860f674fcb19c11b


> - I have a hope to move /proc/sys into /proc/<pid>/sys so we don't have
>  to look at current to determine the namespace we want to display.
>  That would allow the deeply magic sysctl_is_seen check to be removed
>  from proc_sys_compare.  That is not your problem, but of an
>  explanation why the namespaces are passed through.


OK, I'll go an change things to pass the current namespace as it were
before my changes.


Thank you.
-- 
 .
..: Lucian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v3 00/39] faster tree-based sysctl implementation
  2011-05-23  4:27 ` Eric W. Biederman
  2011-05-23  5:59   ` Lucian Adrian Grijincu
@ 2011-05-23  6:37   ` Lucian Adrian Grijincu
  2011-05-23  9:32     ` Eric W. Biederman
  1 sibling, 1 reply; 6+ messages in thread
From: Lucian Adrian Grijincu @ 2011-05-23  6:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, netdev, Alexey Dobriyan, Octavian Purdila,
	David S . Miller

On Mon, May 23, 2011 at 7:27 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> This patchset looks like it is deserving of some close scrutiny, and
> not just the high level design overview I have given the previous
> patches.  This is going to be a busy week for me so I probably won't
> get through all of the patches for a while.


I have one more question. The current implementation uses a single
sysctl_lock to synchronize all changes to the data structures.

In my algorithm I change a few places to use a per-header read-write
lock. Even though the code is organized to handle a per-header rwlock,
the implementation uses a single global rwlock. In v2 I got rid of the
rwlock and replaced the subdirs/files regular lists with rcu-protected
lists and that's why I did not bother giving each header a rwlock.


I have no idea how to use rcu with rbtree. Should I now give each
header it's own lock to reduce contention?


I'm asking this because I don't know why the only is a global sysctl
spin lock, when multiple locks could have been used, each to protect
it's own domain of values.

If you'd like to keep locking as simple as possible (to reduce all the
potential problems brought on by too many locks), or if in general
contention is low enough, then global lock is better. If not, then
I'll change the code to support per-header rwlocks (increasing the
ctl_table_header structure size).

-- 
 .
..: Lucian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v3 00/39] faster tree-based sysctl implementation
  2011-05-23  6:37   ` Lucian Adrian Grijincu
@ 2011-05-23  9:32     ` Eric W. Biederman
  2011-05-23 13:26       ` Lucian Adrian Grijincu
  0 siblings, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2011-05-23  9:32 UTC (permalink / raw)
  To: Lucian Adrian Grijincu
  Cc: linux-kernel, netdev, Alexey Dobriyan, Octavian Purdila,
	David S . Miller

Lucian Adrian Grijincu <lucian.grijincu@gmail.com> writes:

> On Mon, May 23, 2011 at 7:27 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> This patchset looks like it is deserving of some close scrutiny, and
>> not just the high level design overview I have given the previous
>> patches.  This is going to be a busy week for me so I probably won't
>> get through all of the patches for a while.
>
>
> I have one more question. The current implementation uses a single
> sysctl_lock to synchronize all changes to the data structures.
>
> In my algorithm I change a few places to use a per-header read-write
> lock. Even though the code is organized to handle a per-header rwlock,
> the implementation uses a single global rwlock. In v2 I got rid of the
> rwlock and replaced the subdirs/files regular lists with rcu-protected
> lists and that's why I did not bother giving each header a rwlock.
>
>
> I have no idea how to use rcu with rbtree. Should I now give each
> header it's own lock to reduce contention?

I would only walk down that path if we can find some profile data
showing that the lock is where we are hot.

> I'm asking this because I don't know why the only is a global sysctl
> spin lock, when multiple locks could have been used, each to protect
> it's own domain of values.

Mostly it is simplicity.  There is also the fact that the spin lock is
used in the implementation of something that is essentially a
reader/writer lock already.

With the help of the reference counts we block when we are unregistering
until there are no more users.

In that context I'm not certain I am comfortable with separating proc
inode usage from other proc usage. But I haven't read through that
section of your code well enough yet to tell if you are making sense.

One of the things that would be very nice to do is add lockdep
annotations like I have to sysfs_activate and sysfs_deactivate, so we
can catch the all too common case of someone unregistering a sysctl
table when there are problems.

Personally I'm not happy with the state of the locking abstractions in
sysctl today.  It is all much too obscure, and there are too few
warnings.  However for your set of changes I think the thing to focus
on is getting sysctl to better data structures so that it can scale.

Once the data structures are simple enough any remaining issues should
be fixable with small straight forward patches.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v3 00/39] faster tree-based sysctl implementation
  2011-05-23  9:32     ` Eric W. Biederman
@ 2011-05-23 13:26       ` Lucian Adrian Grijincu
  0 siblings, 0 replies; 6+ messages in thread
From: Lucian Adrian Grijincu @ 2011-05-23 13:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, netdev, Alexey Dobriyan, Octavian Purdila,
	David S . Miller

On Mon, May 23, 2011 at 12:32 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Mostly it is simplicity.  There is also the fact that the spin lock is
> used in the implementation of something that is essentially a
> reader/writer lock already.


The amount of time in which the spin lock is held in the current
implementation can be quite large: in __register_sysctl_paths:

https://github.com/mirrors/linux-2.6/blob/v2.6.39/kernel/sysctl.c#L1887

	spin_lock(&sysctl_lock);
	for (set = header->set; set; set = set->parent)
		list_for_each_entry(p, &set->list, ctl_entry)
			try_attach(p, header);
	spin_unlock(&sysctl_lock);


For N=10^5 headers and try_attach=O(N) it's not a very good locking mechanism.

That's why I opted for a rwlock for each dir's subdirs/tables.


> In that context I'm not certain I am comfortable with separating proc
> inode usage from other proc usage. But I haven't read through that
> section of your code well enough yet to tell if you are making sense.


Proc inode usage (->count) was already separate from other proc usage (->use).
It was not separate from other header references (shared in ->count).

I separated the two because when I call unregister on a header I need
to decide whether to really unregister it (->unregistering=true and no
one can see this header and anything under it any more) or just
decrement a reference.

In the current implementation a header is only created by a
__register_sysctl_paths call and it's clear that at unregister we have
to set ->unregistering.
In my implementation headers are created dynamically to create new
directory elements. I need to know when to unregister such a header
regardless of any possible procfs inode references.

https://github.com/luciang/linux-2.6-new-sysctl/blob/v4-new-sysctl-alg/kernel/sysctl.c#L2390



I pushed a new version:
git://github.com/luciang/linux-2.6-new-sysctl.git v4-new-sysctl-alg

I undid int->u8 for ctl_procfs_refs.

I left the ->permissions hook get it's namespace form current->
because rewriting history for that change trips on too many patches
and a new parameter can be very easily added later when needed. Hope
this is ok with you.


I'd like to send patches for review to archs/drivers/etc. that
register only tables of files, not whole sysctl trees.
The patches don't depend on anything from this series.

Examples:
* http://thread.gmane.org/gmane.linux.kernel/1137032/focus=1137089
* http://thread.gmane.org/gmane.linux.kernel/1137032/focus=1137087


I'd like an OK-GO from you.

-- 
 .
..: Lucian

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-05-23 13:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-23  1:56 [v3 00/39] faster tree-based sysctl implementation Lucian Adrian Grijincu
2011-05-23  4:27 ` Eric W. Biederman
2011-05-23  5:59   ` Lucian Adrian Grijincu
2011-05-23  6:37   ` Lucian Adrian Grijincu
2011-05-23  9:32     ` Eric W. Biederman
2011-05-23 13:26       ` Lucian Adrian Grijincu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.