All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 net-next] rhashtable: Notify on resize to allow signaling interrupted dumps
@ 2015-01-20 13:20 Thomas Graf
  2015-01-20 13:20 ` [PATCH 1/3] rhashtable: Provide notifier for deferred resizes Thomas Graf
                   ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Thomas Graf @ 2015-01-20 13:20 UTC (permalink / raw)
  To: davem, kaber, herbert, paulmck, ying.xue; +Cc: netdev, netfilter-devel

Netlink dumps have a traditional weakness of being interruptible and
non-atomic. Once the message buffer has been filled, all protective
locks must be released and the buffer is handed over to the user to
copy the data. The dumping then continues from the previously stored
offsets in the data structures. Due to dropping of the locks, mutations
can occur.

This first series introduces a new notification mechanism which allows
users of rhashtable to bump a sequence number on resizes and report
an inconsistent dump to user space.

As discussed in the netdev thread ("[PATCH 7/9] rhashtable: Per bucket
locks & deferred expansion/shrinking").

Thomas Graf (3):
  rhashtable: Provide notifier for deferred resizes
  netlink: Mark dumps as inconsistent which have been interrupted by a
    resize
  netlink: Lock out table resizes while dumping Netlink sockets

 include/linux/rhashtable.h | 10 +++++++++-
 lib/rhashtable.c           | 13 ++++++++++---
 net/netlink/af_netlink.c   | 10 ++++++++++
 net/netlink/af_netlink.h   |  1 +
 net/netlink/diag.c         |  4 ++++
 5 files changed, 34 insertions(+), 4 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 1/3] rhashtable: Provide notifier for deferred resizes
  2015-01-20 13:20 [PATCH 0/3 net-next] rhashtable: Notify on resize to allow signaling interrupted dumps Thomas Graf
@ 2015-01-20 13:20 ` Thomas Graf
  2015-01-20 13:20 ` [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize Thomas Graf
  2015-01-20 13:20 ` [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets Thomas Graf
  2 siblings, 0 replies; 79+ messages in thread
From: Thomas Graf @ 2015-01-20 13:20 UTC (permalink / raw)
  To: davem, kaber, herbert, paulmck, ying.xue; +Cc: netdev, netfilter-devel

Allowing users of rhashtable to bump a sequence counter and invalidate
Netlink dumps which have been interrupted by a deferred resize.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/rhashtable.h | 10 +++++++++-
 lib/rhashtable.c           | 13 ++++++++++---
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index a2562ed..b711d16 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1,7 +1,7 @@
 /*
  * Resizable, Scalable, Concurrent Hash Table
  *
- * Copyright (c) 2014 Thomas Graf <tgraf@suug.ch>
+ * Copyright (c) 2014-2015 Thomas Graf <tgraf@suug.ch>
  * Copyright (c) 2008-2014 Patrick McHardy <kaber@trash.net>
  *
  * Based on the following paper by Josh Triplett, Paul E. McKenney
@@ -64,6 +64,11 @@ typedef u32 (*rht_obj_hashfn_t)(const void *data, u32 seed);
 
 struct rhashtable;
 
+enum rht_resize_op {
+	RHT_GROWING,
+	RHT_SHRINKING,
+};
+
 /**
  * struct rhashtable_params - Hash table construction parameters
  * @nelem_hint: Hint on number of elements, should be 75% of desired size
@@ -79,6 +84,7 @@ struct rhashtable;
  * @obj_hashfn: Function to hash object
  * @grow_decision: If defined, may return true if table should expand
  * @shrink_decision: If defined, may return true if table should shrink
+ * @resize_notify: Called when a table resize is scheduled.
  *
  * Note: when implementing the grow and shrink decision function, min/max
  * shift must be enforced, otherwise, resizing watermarks they set may be
@@ -100,6 +106,8 @@ struct rhashtable_params {
 						 size_t new_size);
 	bool			(*shrink_decision)(const struct rhashtable *ht,
 						   size_t new_size);
+	void			(*resize_notify)(const struct rhashtable *ht,
+						 enum rht_resize_op op);
 };
 
 /**
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 84a78e3..a4449c4 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -1,7 +1,7 @@
 /*
  * Resizable, Scalable, Concurrent Hash Table
  *
- * Copyright (c) 2014 Thomas Graf <tgraf@suug.ch>
+ * Copyright (c) 2014-2015 Thomas Graf <tgraf@suug.ch>
  * Copyright (c) 2008-2014 Patrick McHardy <kaber@trash.net>
  *
  * Based on the following paper:
@@ -489,10 +489,17 @@ static void rht_deferred_worker(struct work_struct *work)
 	mutex_lock(&ht->mutex);
 	tbl = rht_dereference(ht->tbl, ht);
 
-	if (ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
+	if (ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size)) {
+		if (ht->p.resize_notify)
+			ht->p.resize_notify(ht, RHT_GROWING);
+
 		rhashtable_expand(ht);
-	else if (ht->p.shrink_decision && ht->p.shrink_decision(ht, tbl->size))
+	} else if (ht->p.shrink_decision && ht->p.shrink_decision(ht, tbl->size)) {
+		if (ht->p.resize_notify)
+			ht->p.resize_notify(ht, RHT_SHRINKING);
+
 		rhashtable_shrink(ht);
+	}
 
 	mutex_unlock(&ht->mutex);
 }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize
  2015-01-20 13:20 [PATCH 0/3 net-next] rhashtable: Notify on resize to allow signaling interrupted dumps Thomas Graf
  2015-01-20 13:20 ` [PATCH 1/3] rhashtable: Provide notifier for deferred resizes Thomas Graf
@ 2015-01-20 13:20 ` Thomas Graf
  2015-01-21  8:13   ` Ying Xue
  2015-01-20 13:20 ` [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets Thomas Graf
  2 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-20 13:20 UTC (permalink / raw)
  To: davem, kaber, herbert, paulmck, ying.xue; +Cc: netdev, netfilter-devel

A deferred resize of nl_table causes the offsets that Netlink diag keeps
to become inaccurate. Mark the dump as inconsistent and have user space
request a new dump.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/netlink/af_netlink.c | 10 ++++++++++
 net/netlink/af_netlink.h |  1 +
 net/netlink/diag.c       |  1 +
 3 files changed, 12 insertions(+)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7a94185..e214557 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -91,6 +91,9 @@ static inline int netlink_is_kernel(struct sock *sk)
 struct netlink_table *nl_table;
 EXPORT_SYMBOL_GPL(nl_table);
 
+atomic_t nl_table_seq;
+EXPORT_SYMBOL_GPL(nl_table_seq);
+
 static DECLARE_WAIT_QUEUE_HEAD(nl_table_wait);
 
 static int netlink_dump(struct sock *sk);
@@ -3071,6 +3074,12 @@ static const struct net_proto_family netlink_family_ops = {
 	.owner	= THIS_MODULE,	/* for consistency 8) */
 };
 
+static void nl_table_resize_notify(const struct rhashtable *ht,
+				   enum rht_resize_op op)
+{
+	atomic_inc(&nl_table_seq);
+}
+
 static int __net_init netlink_net_init(struct net *net)
 {
 #ifdef CONFIG_PROC_FS
@@ -3124,6 +3133,7 @@ static int __init netlink_proto_init(void)
 		.max_shift = 16, /* 64K */
 		.grow_decision = rht_grow_above_75,
 		.shrink_decision = rht_shrink_below_30,
+		.resize_notify = nl_table_resize_notify,
 	};
 
 	if (err != 0)
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 7518375..51c23c0 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -74,5 +74,6 @@ struct netlink_table {
 
 extern struct netlink_table *nl_table;
 extern rwlock_t nl_table_lock;
+extern atomic_t nl_table_seq;
 
 #endif
diff --git a/net/netlink/diag.c b/net/netlink/diag.c
index 3ee63a3..50aa385 100644
--- a/net/netlink/diag.c
+++ b/net/netlink/diag.c
@@ -112,6 +112,7 @@ static int __netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 	int ret = 0, num = 0, i;
 
 	req = nlmsg_data(cb->nlh);
+	cb->seq = atomic_read(&nl_table_seq);
 
 	for (i = 0; i < htbl->size; i++) {
 		struct rhash_head *pos;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-20 13:20 [PATCH 0/3 net-next] rhashtable: Notify on resize to allow signaling interrupted dumps Thomas Graf
  2015-01-20 13:20 ` [PATCH 1/3] rhashtable: Provide notifier for deferred resizes Thomas Graf
  2015-01-20 13:20 ` [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize Thomas Graf
@ 2015-01-20 13:20 ` Thomas Graf
  2015-01-20 14:31   ` Patrick McHardy
  2 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-20 13:20 UTC (permalink / raw)
  To: davem, kaber, herbert, paulmck, ying.xue; +Cc: netdev, netfilter-devel

Lock out table resizes while dumping Netlink sockets to user space.
This keeps disruptions to a minimum for readers which don't handle
the NLM_F_DUMP_INTR flag.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/netlink/diag.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netlink/diag.c b/net/netlink/diag.c
index 50aa385..be4ea6e 100644
--- a/net/netlink/diag.c
+++ b/net/netlink/diag.c
@@ -114,6 +114,8 @@ static int __netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 	req = nlmsg_data(cb->nlh);
 	cb->seq = atomic_read(&nl_table_seq);
 
+	mutex_lock(&ht->mutex);
+
 	for (i = 0; i < htbl->size; i++) {
 		struct rhash_head *pos;
 
@@ -161,6 +163,7 @@ static int __netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 		num++;
 	}
 done:
+	mutex_unlock(&ht->mutex);
 	cb->args[0] = num;
 	cb->args[1] = protocol;
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-20 13:20 ` [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets Thomas Graf
@ 2015-01-20 14:31   ` Patrick McHardy
  2015-01-20 14:55     ` Thomas Graf
                       ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Patrick McHardy @ 2015-01-20 14:31 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, herbert, paulmck, ying.xue, netdev, netfilter-devel

On 20.01, Thomas Graf wrote:
> Lock out table resizes while dumping Netlink sockets to user space.
> This keeps disruptions to a minimum for readers which don't handle
> the NLM_F_DUMP_INTR flag.

This doesn't lock them out for the duration of the entire dump of
course, so the benefit seems rather small. Still with this patch,
they will need to handle NLM_F_DUMP_INTR or will get unpredictable
behaviour, in which case I'd think it makes more sense to not even
try this, all it does is hide parts of the brokenness.

An alternative would be to set a flag in ht when a dump begins that
indicates to skip resizing operations and on the end of the dump
perform any resizing operations that might be necessary. Herbert
disagrees though and he might be right.

> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ---
>  net/netlink/diag.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/netlink/diag.c b/net/netlink/diag.c
> index 50aa385..be4ea6e 100644
> --- a/net/netlink/diag.c
> +++ b/net/netlink/diag.c
> @@ -114,6 +114,8 @@ static int __netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
>  	req = nlmsg_data(cb->nlh);
>  	cb->seq = atomic_read(&nl_table_seq);
>  
> +	mutex_lock(&ht->mutex);
> +
>  	for (i = 0; i < htbl->size; i++) {
>  		struct rhash_head *pos;
>  
> @@ -161,6 +163,7 @@ static int __netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
>  		num++;
>  	}
>  done:
> +	mutex_unlock(&ht->mutex);
>  	cb->args[0] = num;
>  	cb->args[1] = protocol;
>  
> -- 
> 1.9.3
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-20 14:31   ` Patrick McHardy
@ 2015-01-20 14:55     ` Thomas Graf
  2015-01-20 15:21       ` Patrick McHardy
  2015-01-20 15:00     ` David Laight
  2015-01-21  5:11     ` Herbert Xu
  2 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-20 14:55 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: davem, herbert, paulmck, ying.xue, netdev, netfilter-devel

On 01/20/15 at 02:31pm, Patrick McHardy wrote:
> On 20.01, Thomas Graf wrote:
> > Lock out table resizes while dumping Netlink sockets to user space.
> > This keeps disruptions to a minimum for readers which don't handle
> > the NLM_F_DUMP_INTR flag.
> 
> This doesn't lock them out for the duration of the entire dump of
> course, so the benefit seems rather small. Still with this patch,
> they will need to handle NLM_F_DUMP_INTR or will get unpredictable
> behaviour, in which case I'd think it makes more sense to not even
> try this, all it does is hide parts of the brokenness.

If it would lock out the resize for the entire dump I would not have
done patches 1 and 2 ;-)

I does provide better behaviour if the whole dump fits into a single
buffer or if it fits into 2 buffers and we are already dumping into
the 2nd buffer when the resize occurs. Otherwise we will see resizes
and thus tons of duplicates even in those scenarios even if no insert
or removal occurs in parallel.

In the case of Netlink diag that should be typical case. Most systems
will not have 1000s of Netlink sockets in parallel.

> An alternative would be to set a flag in ht when a dump begins that
> indicates to skip resizing operations and on the end of the dump
> perform any resizing operations that might be necessary. Herbert
> disagrees though and he might be right.

I don't like the flag as it prevents resizes (and possibly rehashes
further down the road) for a long period of time. The hashtable
becomes attackable.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-20 14:31   ` Patrick McHardy
  2015-01-20 14:55     ` Thomas Graf
@ 2015-01-20 15:00     ` David Laight
  2015-01-20 15:05       ` Thomas Graf
  2015-01-21  5:11     ` Herbert Xu
  2 siblings, 1 reply; 79+ messages in thread
From: David Laight @ 2015-01-20 15:00 UTC (permalink / raw)
  To: 'Patrick McHardy', Thomas Graf
  Cc: davem, herbert, paulmck, ying.xue, netdev, netfilter-devel

From: Patrick McHardy
> On 20.01, Thomas Graf wrote:
> > Lock out table resizes while dumping Netlink sockets to user space.
> > This keeps disruptions to a minimum for readers which don't handle
> > the NLM_F_DUMP_INTR flag.
> 
> This doesn't lock them out for the duration of the entire dump of
> course, so the benefit seems rather small. Still with this patch,
> they will need to handle NLM_F_DUMP_INTR or will get unpredictable
> behaviour, in which case I'd think it makes more sense to not even
> try this, all it does is hide parts of the brokenness.
> 
> An alternative would be to set a flag in ht when a dump begins that
> indicates to skip resizing operations and on the end of the dump
> perform any resizing operations that might be necessary. Herbert
> disagrees though and he might be right.

Can you rely on being told when the dump completes?
If the program is killed in the middle then it can't tell you.

I suspect you'd have to suppress resize for some time interval
after a partial dump.
Unfortunately two continuous dumps would be likely to suppress
it forever. Maybe sleep user space dump requests for the first
block while any resize (esp. grow) is pending.

What is passed to userspace as the 'continue from here' marker?
Even without resize there are likely to be issues if something
nearer the head of a hash chain being processed is deleted.

	David


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-20 15:00     ` David Laight
@ 2015-01-20 15:05       ` Thomas Graf
  0 siblings, 0 replies; 79+ messages in thread
From: Thomas Graf @ 2015-01-20 15:05 UTC (permalink / raw)
  To: David Laight
  Cc: 'Patrick McHardy',
	davem, herbert, paulmck, ying.xue, netdev, netfilter-devel

On 01/20/15 at 03:00pm, David Laight wrote:
> From: Patrick McHardy
> > An alternative would be to set a flag in ht when a dump begins that
> > indicates to skip resizing operations and on the end of the dump
> > perform any resizing operations that might be necessary. Herbert
> > disagrees though and he might be right.
> 
> Can you rely on being told when the dump completes?
> If the program is killed in the middle then it can't tell you.
> 
> I suspect you'd have to suppress resize for some time interval
> after a partial dump.
> Unfortunately two continuous dumps would be likely to suppress
> it forever. Maybe sleep user space dump requests for the first
> block while any resize (esp. grow) is pending.
> 
> What is passed to userspace as the 'continue from here' marker?
> Even without resize there are likely to be issues if something
> nearer the head of a hash chain being processed is deleted.

I assumed that the proposal would include a timer in the Netlink
cb which fires after 2s or so and cancels the dump. 2s is still
a long time and a lot of inserts can happen in the meantime.

This may become more viable once insert()s can fail above a
certain watermark as Herbert suggests.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-20 14:55     ` Thomas Graf
@ 2015-01-20 15:21       ` Patrick McHardy
  2015-01-20 15:35         ` Thomas Graf
  0 siblings, 1 reply; 79+ messages in thread
From: Patrick McHardy @ 2015-01-20 15:21 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, herbert, paulmck, ying.xue, netdev, netfilter-devel

On 20.01, Thomas Graf wrote:
> On 01/20/15 at 02:31pm, Patrick McHardy wrote:
> > On 20.01, Thomas Graf wrote:
> > > Lock out table resizes while dumping Netlink sockets to user space.
> > > This keeps disruptions to a minimum for readers which don't handle
> > > the NLM_F_DUMP_INTR flag.
> > 
> > This doesn't lock them out for the duration of the entire dump of
> > course, so the benefit seems rather small. Still with this patch,
> > they will need to handle NLM_F_DUMP_INTR or will get unpredictable
> > behaviour, in which case I'd think it makes more sense to not even
> > try this, all it does is hide parts of the brokenness.
> 
> If it would lock out the resize for the entire dump I would not have
> done patches 1 and 2 ;-)
> 
> I does provide better behaviour if the whole dump fits into a single
> buffer or if it fits into 2 buffers and we are already dumping into
> the 2nd buffer when the resize occurs. Otherwise we will see resizes
> and thus tons of duplicates even in those scenarios even if no insert
> or removal occurs in parallel.
> 
> In the case of Netlink diag that should be typical case. Most systems
> will not have 1000s of Netlink sockets in parallel.

I think its preferrable to make the need to handle NETLINK_F_DUMP_INTR
as noticable as possible and not hide it. Silent failure is the worst
kind of failure.

> > An alternative would be to set a flag in ht when a dump begins that
> > indicates to skip resizing operations and on the end of the dump
> > perform any resizing operations that might be necessary. Herbert
> > disagrees though and he might be right.
> 
> I don't like the flag as it prevents resizes (and possibly rehashes
> further down the road) for a long period of time. The hashtable
> becomes attackable.

Yeah. The point could be made that this is a regression though. We didn't
require userspace to deal with interruptions before, and the behaviour
was well defined and acceptable for most cases, its not anymore.

So I think it should be handled by the kernel, without changes to
userspace.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-20 15:21       ` Patrick McHardy
@ 2015-01-20 15:35         ` Thomas Graf
  2015-01-21  5:08           ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-20 15:35 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: davem, herbert, paulmck, ying.xue, netdev, netfilter-devel

On 01/20/15 at 03:21pm, Patrick McHardy wrote:
> I think its preferrable to make the need to handle NETLINK_F_DUMP_INTR
> as noticable as possible and not hide it. Silent failure is the worst
> kind of failure.

I agree to that. The point here is to avoid unnecessary use of
NETLINK_F_DUMP_INTR if all entries fit into a single message buffer.
 
> Yeah. The point could be made that this is a regression though. We didn't
> require userspace to deal with interruptions before, and the behaviour
> was well defined and acceptable for most cases, its not anymore.
> 
> So I think it should be handled by the kernel, without changes to
> userspace.

nl_table_lock was released between individual messages just like
ht->mutex is released with this change.

What changed is that inserts and removal can now occur *while* the
message is being constructed whereas previously they could only
occur between message construction periods. In either case, the dump
would end up missing entries or showing duplicates.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-20 15:35         ` Thomas Graf
@ 2015-01-21  5:08           ` Herbert Xu
  2015-01-21  5:15             ` Herbert Xu
  2015-01-21  9:37             ` Thomas Graf
  0 siblings, 2 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-21  5:08 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On Tue, Jan 20, 2015 at 03:35:56PM +0000, Thomas Graf wrote:
> On 01/20/15 at 03:21pm, Patrick McHardy wrote:
> > I think its preferrable to make the need to handle NETLINK_F_DUMP_INTR
> > as noticable as possible and not hide it. Silent failure is the worst
> > kind of failure.
> 
> I agree to that. The point here is to avoid unnecessary use of
> NETLINK_F_DUMP_INTR if all entries fit into a single message buffer.

OK I think I have a solution for you guys.  But first you'll need to
wait for me to undo the nulls stuff so I can steal that bit which
is central to my solution.

Essentially I need a bit to indicate an entry in the bucket chain
should be skipped, either because it has just been removed or that
it is a walker entry (see xfrm_state_walk).

The way it'll work then is exactly the same as xfrm_state_walk,
except that the linked list is broken up into individual buckets.

Of course we'll still need to postpone resizes (and rehashes which
is what my work is about) during a walk but I think that's a fair
price to pay.

This also means handling insertion failures but I think that
should be acceptable if we make it based on a configurable maximum
chain length along with forced resize/rehash where possible.

Note that this can be made optional, i.e., if the user can afford
memory to do their own walking (e.g., xfrm_state), then none of
this needs to happen and it'll just work as it does now.  IOW if
you don't use this special rhashtable walk function then you're
not affected.

Thoughts?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-20 14:31   ` Patrick McHardy
  2015-01-20 14:55     ` Thomas Graf
  2015-01-20 15:00     ` David Laight
@ 2015-01-21  5:11     ` Herbert Xu
  2 siblings, 0 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-21  5:11 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Thomas Graf, davem, paulmck, ying.xue, netdev, netfilter-devel

On Tue, Jan 20, 2015 at 02:31:54PM +0000, Patrick McHardy wrote:
> 
> An alternative would be to set a flag in ht when a dump begins that
> indicates to skip resizing operations and on the end of the dump
> perform any resizing operations that might be necessary. Herbert
> disagrees though and he might be right.

It's a bit more complex than just a disagreement :)

See the email I just sent.  But the crucial thing is that postponed
resizing/rehashing during dumping is only possible if the only
entity that can perform dumping is trusted.  Otherwise the whole
thing falls apart really quickly.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  5:08           ` Herbert Xu
@ 2015-01-21  5:15             ` Herbert Xu
  2015-01-21  9:14               ` Herbert Xu
  2015-01-21  9:37             ` Thomas Graf
  1 sibling, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-21  5:15 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On Wed, Jan 21, 2015 at 04:08:19PM +1100, Herbert Xu wrote:
> 
> OK I think I have a solution for you guys.  But first you'll need to
> wait for me to undo the nulls stuff so I can steal that bit which
> is central to my solution.
> 
> Essentially I need a bit to indicate an entry in the bucket chain
> should be skipped, either because it has just been removed or that
> it is a walker entry (see xfrm_state_walk).
> 
> The way it'll work then is exactly the same as xfrm_state_walk,
> except that the linked list is broken up into individual buckets.
> 
> Of course we'll still need to postpone resizes (and rehashes which
> is what my work is about) during a walk but I think that's a fair
> price to pay.

I failed to address the security aspect of this approach.  Obviously
this can only work if the only entites doing the walk are trusted.
Which means that the vast majority (if not all) hash table users
would be excluded, in particular, netlink.

Even with trusted cases such as netfilter I'm uneasy about this
so I don't think I'm going to go forward with this approach.

At this point I think the best thing that'll give you stable dumps
is a separate list like xfrm_state_walk.  If you did that then
we don't need to touch rhashtable at all.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize
  2015-01-20 13:20 ` [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize Thomas Graf
@ 2015-01-21  8:13   ` Ying Xue
  2015-01-21 12:17     ` Thomas Graf
  0 siblings, 1 reply; 79+ messages in thread
From: Ying Xue @ 2015-01-21  8:13 UTC (permalink / raw)
  To: Thomas Graf, davem, kaber, herbert, paulmck; +Cc: netdev, netfilter-devel

On 01/20/2015 09:20 PM, Thomas Graf wrote:
> A deferred resize of nl_table causes the offsets that Netlink diag keeps
> to become inaccurate. Mark the dump as inconsistent and have user space
> request a new dump.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ---
>  net/netlink/af_netlink.c | 10 ++++++++++
>  net/netlink/af_netlink.h |  1 +
>  net/netlink/diag.c       |  1 +
>  3 files changed, 12 insertions(+)
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 7a94185..e214557 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -91,6 +91,9 @@ static inline int netlink_is_kernel(struct sock *sk)
>  struct netlink_table *nl_table;
>  EXPORT_SYMBOL_GPL(nl_table);
>  
> +atomic_t nl_table_seq;

It sounds like the atomic variable is not initialized.

Regards,
Ying

> +EXPORT_SYMBOL_GPL(nl_table_seq);
> +
>  static DECLARE_WAIT_QUEUE_HEAD(nl_table_wait);
>  
>  static int netlink_dump(struct sock *sk);
> @@ -3071,6 +3074,12 @@ static const struct net_proto_family netlink_family_ops = {
>  	.owner	= THIS_MODULE,	/* for consistency 8) */
>  };
>  
> +static void nl_table_resize_notify(const struct rhashtable *ht,
> +				   enum rht_resize_op op)
> +{
> +	atomic_inc(&nl_table_seq);
> +}
> +
>  static int __net_init netlink_net_init(struct net *net)
>  {
>  #ifdef CONFIG_PROC_FS
> @@ -3124,6 +3133,7 @@ static int __init netlink_proto_init(void)
>  		.max_shift = 16, /* 64K */
>  		.grow_decision = rht_grow_above_75,
>  		.shrink_decision = rht_shrink_below_30,
> +		.resize_notify = nl_table_resize_notify,
>  	};
>  
>  	if (err != 0)
> diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
> index 7518375..51c23c0 100644
> --- a/net/netlink/af_netlink.h
> +++ b/net/netlink/af_netlink.h
> @@ -74,5 +74,6 @@ struct netlink_table {
>  
>  extern struct netlink_table *nl_table;
>  extern rwlock_t nl_table_lock;
> +extern atomic_t nl_table_seq;
>  
>  #endif
> diff --git a/net/netlink/diag.c b/net/netlink/diag.c
> index 3ee63a3..50aa385 100644
> --- a/net/netlink/diag.c
> +++ b/net/netlink/diag.c
> @@ -112,6 +112,7 @@ static int __netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
>  	int ret = 0, num = 0, i;
>  
>  	req = nlmsg_data(cb->nlh);
> +	cb->seq = atomic_read(&nl_table_seq);
>  
>  	for (i = 0; i < htbl->size; i++) {
>  		struct rhash_head *pos;
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  5:15             ` Herbert Xu
@ 2015-01-21  9:14               ` Herbert Xu
  2015-01-21  9:56                 ` Thomas Graf
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-21  9:14 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On Wed, Jan 21, 2015 at 04:15:55PM +1100, Herbert Xu wrote:
> On Wed, Jan 21, 2015 at 04:08:19PM +1100, Herbert Xu wrote:
> > 
> > OK I think I have a solution for you guys.  But first you'll need to
> > wait for me to undo the nulls stuff so I can steal that bit which
> > is central to my solution.
> > 
> > Essentially I need a bit to indicate an entry in the bucket chain
> > should be skipped, either because it has just been removed or that
> > it is a walker entry (see xfrm_state_walk).
> > 
> > The way it'll work then is exactly the same as xfrm_state_walk,
> > except that the linked list is broken up into individual buckets.
> > 
> > Of course we'll still need to postpone resizes (and rehashes which
> > is what my work is about) during a walk but I think that's a fair
> > price to pay.
> 
> I failed to address the security aspect of this approach.  Obviously
> this can only work if the only entites doing the walk are trusted.
> Which means that the vast majority (if not all) hash table users
> would be excluded, in particular, netlink.

OK maybe we can get around this.  So we will postpone the resize
or rehash when a walk is in place, however, when we hit a hard
limit, i.e., when insert would otherwise fail, then we do the
rehash regardless of any outstanding walks.  The walks will just
behave erratically or we could force them to start from scratch
again.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  5:08           ` Herbert Xu
  2015-01-21  5:15             ` Herbert Xu
@ 2015-01-21  9:37             ` Thomas Graf
  2015-01-21  9:38               ` Herbert Xu
  2015-01-21 10:36               ` David Laight
  1 sibling, 2 replies; 79+ messages in thread
From: Thomas Graf @ 2015-01-21  9:37 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On 01/21/15 at 04:08pm, Herbert Xu wrote:
> On Tue, Jan 20, 2015 at 03:35:56PM +0000, Thomas Graf wrote:
> > On 01/20/15 at 03:21pm, Patrick McHardy wrote:
> > > I think its preferrable to make the need to handle NETLINK_F_DUMP_INTR
> > > as noticable as possible and not hide it. Silent failure is the worst
> > > kind of failure.
> > 
> > I agree to that. The point here is to avoid unnecessary use of
> > NETLINK_F_DUMP_INTR if all entries fit into a single message buffer.
> 
> OK I think I have a solution for you guys.  But first you'll need to
> wait for me to undo the nulls stuff so I can steal that bit which
> is central to my solution.

Without having seen your code, can we make it configurable on what
the bit is used for? Use of nulls marker is a strict requirement for
some targeted users of rhashtable.

> Essentially I need a bit to indicate an entry in the bucket chain
> should be skipped, either because it has just been removed or that
> it is a walker entry (see xfrm_state_walk).
> 
> The way it'll work then is exactly the same as xfrm_state_walk,
> except that the linked list is broken up into individual buckets.
> 
> Of course we'll still need to postpone resizes (and rehashes which
> is what my work is about) during a walk but I think that's a fair
> price to pay.

If I understand this correctly we also need to block out parallel
walkers and we need to start taking bucket locks while walking to
modify the walker mark bit in peace.

> This also means handling insertion failures but I think that
> should be acceptable if we make it based on a configurable maximum
> chain length along with forced resize/rehash where possible.
> 
> Note that this can be made optional, i.e., if the user can afford
> memory to do their own walking (e.g., xfrm_state), then none of
> this needs to happen and it'll just work as it does now.  IOW if
> you don't use this special rhashtable walk function then you're
> not affected.

That sounds like the best option to me.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  9:37             ` Thomas Graf
@ 2015-01-21  9:38               ` Herbert Xu
  2015-01-21  9:49                 ` Thomas Graf
  2015-01-21 10:36               ` David Laight
  1 sibling, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-21  9:38 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On Wed, Jan 21, 2015 at 09:37:22AM +0000, Thomas Graf wrote:
>
> Without having seen your code, can we make it configurable on what
> the bit is used for? Use of nulls marker is a strict requirement for
> some targeted users of rhashtable.

What do they need this for?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  9:38               ` Herbert Xu
@ 2015-01-21  9:49                 ` Thomas Graf
  2015-01-21  9:58                   ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-21  9:49 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On 01/21/15 at 08:38pm, Herbert Xu wrote:
> On Wed, Jan 21, 2015 at 09:37:22AM +0000, Thomas Graf wrote:
> >
> > Without having seen your code, can we make it configurable on what
> > the bit is used for? Use of nulls marker is a strict requirement for
> > some targeted users of rhashtable.
> 
> What do they need this for?

An entry can move between different tables and thus chains need to be
marked to identify what list a lookup ended up searching in. It's not
the nulls marker itself that is needed, it's the bits in the last next
pointer identifying the list that the nulls marker allows to be used
which are essential.

This is on my plate next. Most of the work in rhashtable was done in
preparation of translating the TCP established table over to
rhashtable.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  9:14               ` Herbert Xu
@ 2015-01-21  9:56                 ` Thomas Graf
  2015-01-21  9:59                   ` Herbert Xu
  2015-01-21 10:00                   ` Patrick McHardy
  0 siblings, 2 replies; 79+ messages in thread
From: Thomas Graf @ 2015-01-21  9:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On 01/21/15 at 08:14pm, Herbert Xu wrote:
> On Wed, Jan 21, 2015 at 04:15:55PM +1100, Herbert Xu wrote:
> > I failed to address the security aspect of this approach.  Obviously
> > this can only work if the only entites doing the walk are trusted.
> > Which means that the vast majority (if not all) hash table users
> > would be excluded, in particular, netlink.
> 
> OK maybe we can get around this.  So we will postpone the resize
> or rehash when a walk is in place, however, when we hit a hard
> limit, i.e., when insert would otherwise fail, then we do the
> rehash regardless of any outstanding walks.  The walks will just
> behave erratically or we could force them to start from scratch
> again.

Exactly. I think we also need a timer to abort walks because if
only a single walker is allowed, an attacker could start a walk and
not complete it to block out everybody else.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  9:49                 ` Thomas Graf
@ 2015-01-21  9:58                   ` Herbert Xu
  2015-01-21 10:23                     ` Thomas Graf
  2015-01-21 10:34                     ` Thomas Graf
  0 siblings, 2 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-21  9:58 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On Wed, Jan 21, 2015 at 09:49:28AM +0000, Thomas Graf wrote:
>
> An entry can move between different tables and thus chains need to be
> marked to identify what list a lookup ended up searching in. It's not
> the nulls marker itself that is needed, it's the bits in the last next
> pointer identifying the list that the nulls marker allows to be used
> which are essential.

Can you describe in more detail how it's going to be used? I don't
see how I could use the bit if you need it to indicate the end of
the list.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  9:56                 ` Thomas Graf
@ 2015-01-21  9:59                   ` Herbert Xu
  2015-01-21 10:00                   ` Patrick McHardy
  1 sibling, 0 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-21  9:59 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On Wed, Jan 21, 2015 at 09:56:34AM +0000, Thomas Graf wrote:
>
> Exactly. I think we also need a timer to abort walks because if
> only a single walker is allowed, an attacker could start a walk and
> not complete it to block out everybody else.

My scheme should support an arbitrary number of walks.  See how
xfrm_state_walk is implemented.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  9:56                 ` Thomas Graf
  2015-01-21  9:59                   ` Herbert Xu
@ 2015-01-21 10:00                   ` Patrick McHardy
  1 sibling, 0 replies; 79+ messages in thread
From: Patrick McHardy @ 2015-01-21 10:00 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Herbert Xu, davem, paulmck, ying.xue, netdev, netfilter-devel

On 21.01, Thomas Graf wrote:
> On 01/21/15 at 08:14pm, Herbert Xu wrote:
> > On Wed, Jan 21, 2015 at 04:15:55PM +1100, Herbert Xu wrote:
> > > I failed to address the security aspect of this approach.  Obviously
> > > this can only work if the only entites doing the walk are trusted.
> > > Which means that the vast majority (if not all) hash table users
> > > would be excluded, in particular, netlink.
> > 
> > OK maybe we can get around this.  So we will postpone the resize
> > or rehash when a walk is in place, however, when we hit a hard
> > limit, i.e., when insert would otherwise fail, then we do the
> > rehash regardless of any outstanding walks.  The walks will just
> > behave erratically or we could force them to start from scratch
> > again.
> 
> Exactly. I think we also need a timer to abort walks because if
> only a single walker is allowed, an attacker could start a walk and
> not complete it to block out everybody else.

Restarting them automatically sounds reasonable, duplicate entries have
always been possible. That makes me think, we could have userspace
indicate support for NLM_F_DUMP_INTR and otherwise always restart.
That would solve the problem very easily.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  9:58                   ` Herbert Xu
@ 2015-01-21 10:23                     ` Thomas Graf
  2015-01-22  6:35                       ` Herbert Xu
  2015-01-21 10:34                     ` Thomas Graf
  1 sibling, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-21 10:23 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On 01/21/15 at 08:58pm, Herbert Xu wrote:
> On Wed, Jan 21, 2015 at 09:49:28AM +0000, Thomas Graf wrote:
> >
> > An entry can move between different tables and thus chains need to be
> > marked to identify what list a lookup ended up searching in. It's not
> > the nulls marker itself that is needed, it's the bits in the last next
> > pointer identifying the list that the nulls marker allows to be used
> > which are essential.
> 
> Can you describe in more detail how it's going to be used? I don't
> see how I could use the bit if you need it to indicate the end of
> the list.

The usage will be identical to how __inet_lookup_listener() uses it.
If at the end of the lookup, we ended up in a different table than
we started, the lookup is restarted as an entry has moved to another
table while we were moving over it.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  9:58                   ` Herbert Xu
  2015-01-21 10:23                     ` Thomas Graf
@ 2015-01-21 10:34                     ` Thomas Graf
  2015-01-21 10:40                       ` Patrick McHardy
  1 sibling, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-21 10:34 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

On 01/21/15 at 08:58pm, Herbert Xu wrote:
> On Wed, Jan 21, 2015 at 09:49:28AM +0000, Thomas Graf wrote:
> >
> > An entry can move between different tables and thus chains need to be
> > marked to identify what list a lookup ended up searching in. It's not
> > the nulls marker itself that is needed, it's the bits in the last next
> > pointer identifying the list that the nulls marker allows to be used
> > which are essential.
> 
> Can you describe in more detail how it's going to be used? I don't
> see how I could use the bit if you need it to indicate the end of
> the list.

Another thought: Instead of storing that bit in the next pointer, we
could require the user to store this bit, i.e. two new function
pointers to rhashtable_params, set_walk_bit() and get_walk_bit(),
which take the hashed object as argument and if a rhashtable user
requires consistent dumps he can provide these functions to store
the flag.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21  9:37             ` Thomas Graf
  2015-01-21  9:38               ` Herbert Xu
@ 2015-01-21 10:36               ` David Laight
  1 sibling, 0 replies; 79+ messages in thread
From: David Laight @ 2015-01-21 10:36 UTC (permalink / raw)
  To: 'Thomas Graf', Herbert Xu
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev, netfilter-devel

From: Thomas Graf
> On 01/21/15 at 04:08pm, Herbert Xu wrote:
...
> > Essentially I need a bit to indicate an entry in the bucket chain
> > should be skipped, either because it has just been removed or that
> > it is a walker entry (see xfrm_state_walk).
> >
> > The way it'll work then is exactly the same as xfrm_state_walk,
> > except that the linked list is broken up into individual buckets.
> >
> > Of course we'll still need to postpone resizes (and rehashes which
> > is what my work is about) during a walk but I think that's a fair
> > price to pay.
> 
> If I understand this correctly we also need to block out parallel
> walkers and we need to start taking bucket locks while walking to
> modify the walker mark bit in peace.

Running 'walker | grep foo | less' really shouldn't stop any activity,
including further requests by the same person to dump the same table in a
different window.
I would also expect the above request to be able to continue correctly
when restarted hours later (unless something unusual - like a resize -
happens).

Suppressing 'table shrink' because you think a walker might be active
if probably a good idea, and just requires a timestamp of the last
walk action.

Thought...
It is possible to break walking only on hash chain boundaries?
At least in the case where the response buffer is large enough
for a full 'chain' of entries.

Also what happens if there is a page fault on any userpage?
You don't want to be holding a mutex.

	David



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21 10:34                     ` Thomas Graf
@ 2015-01-21 10:40                       ` Patrick McHardy
  2015-01-21 11:37                         ` Thomas Graf
  0 siblings, 1 reply; 79+ messages in thread
From: Patrick McHardy @ 2015-01-21 10:40 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Herbert Xu, davem, paulmck, ying.xue, netdev, netfilter-devel

On 21.01, Thomas Graf wrote:
> On 01/21/15 at 08:58pm, Herbert Xu wrote:
> > On Wed, Jan 21, 2015 at 09:49:28AM +0000, Thomas Graf wrote:
> > >
> > > An entry can move between different tables and thus chains need to be
> > > marked to identify what list a lookup ended up searching in. It's not
> > > the nulls marker itself that is needed, it's the bits in the last next
> > > pointer identifying the list that the nulls marker allows to be used
> > > which are essential.
> > 
> > Can you describe in more detail how it's going to be used? I don't
> > see how I could use the bit if you need it to indicate the end of
> > the list.
> 
> Another thought: Instead of storing that bit in the next pointer, we
> could require the user to store this bit, i.e. two new function
> pointers to rhashtable_params, set_walk_bit() and get_walk_bit(),
> which take the hashed object as argument and if a rhashtable user
> requires consistent dumps he can provide these functions to store
> the flag.

On the danger of repeating myself, every (converted) user requires
that we at least keep the existing semantics since it is exposed to
userspace. My opinion is that NLM_F_DUMP_INTR is fine if userspace
indicates support, but without that, we have to take care of that
in the kernel.

An automatic restart handles this well. Userspace always had to
expect duplicates.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21 10:40                       ` Patrick McHardy
@ 2015-01-21 11:37                         ` Thomas Graf
  2015-01-21 11:59                           ` Patrick McHardy
  0 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-21 11:37 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Herbert Xu, davem, paulmck, ying.xue, netdev, netfilter-devel

On 01/21/15 at 10:40am, Patrick McHardy wrote:
> On the danger of repeating myself, every (converted) user requires
> that we at least keep the existing semantics since it is exposed to
> userspace. My opinion is that NLM_F_DUMP_INTR is fine if userspace
> indicates support, but without that, we have to take care of that
> in the kernel.

Absolutely agreed. I think this is an excellent low cost path for
future users where dumps are rare.

> An automatic restart handles this well. Userspace always had to
> expect duplicates.

Maybe I don't understand the restart yet. How do you restart if the
dump was already started and the user has read part of the dump?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21 11:37                         ` Thomas Graf
@ 2015-01-21 11:59                           ` Patrick McHardy
  2015-01-21 12:07                             ` Thomas Graf
  0 siblings, 1 reply; 79+ messages in thread
From: Patrick McHardy @ 2015-01-21 11:59 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Herbert Xu, davem, paulmck, ying.xue, netdev, netfilter-devel

On 21.01, Thomas Graf wrote:
> On 01/21/15 at 10:40am, Patrick McHardy wrote:
> > On the danger of repeating myself, every (converted) user requires
> > that we at least keep the existing semantics since it is exposed to
> > userspace. My opinion is that NLM_F_DUMP_INTR is fine if userspace
> > indicates support, but without that, we have to take care of that
> > in the kernel.
> 
> Absolutely agreed. I think this is an excellent low cost path for
> future users where dumps are rare.
> 
> > An automatic restart handles this well. Userspace always had to
> > expect duplicates.
> 
> Maybe I don't understand the restart yet. How do you restart if the
> dump was already started and the user has read part of the dump?

You use the mutex to prevent concurrent resizes and dumps, for dumps
that have left the kernel you simply reset the iterator state after
a resize operation.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21 11:59                           ` Patrick McHardy
@ 2015-01-21 12:07                             ` Thomas Graf
  2015-01-21 12:09                               ` Patrick McHardy
  0 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-21 12:07 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Herbert Xu, davem, paulmck, ying.xue, netdev, netfilter-devel

On 01/21/15 at 11:59am, Patrick McHardy wrote:
> On 21.01, Thomas Graf wrote:
> > On 01/21/15 at 10:40am, Patrick McHardy wrote:
> > > An automatic restart handles this well. Userspace always had to
> > > expect duplicates.
> > 
> > Maybe I don't understand the restart yet. How do you restart if the
> > dump was already started and the user has read part of the dump?
> 
> You use the mutex to prevent concurrent resizes and dumps, for dumps
> that have left the kernel you simply reset the iterator state after
> a resize operation.

I see, so we possibly see a lot of duplicates but we will never miss
any entries.

So in summary we do:

2) We allow rhashtable users to opt in to Herbert's consistent walker
   relying on the use of an additional bit.

2) If user supports NLM_F_DUMP_INTR we block out resizes during
   message construction and set NLM_F_DUMP_INTR if a resize/rehash
   interrupted us while the user read.

3) If (1) and (2) are not available we do (2) but restart the dump
   on interruption thus causing duplicates but never missing any
   entries.

I think this offers good flexibility for new users and compatibility
for existing users.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21 12:07                             ` Thomas Graf
@ 2015-01-21 12:09                               ` Patrick McHardy
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick McHardy @ 2015-01-21 12:09 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Herbert Xu, davem, paulmck, ying.xue, netdev, netfilter-devel

On 21.01, Thomas Graf wrote:
> On 01/21/15 at 11:59am, Patrick McHardy wrote:
> > On 21.01, Thomas Graf wrote:
> > > On 01/21/15 at 10:40am, Patrick McHardy wrote:
> > > > An automatic restart handles this well. Userspace always had to
> > > > expect duplicates.
> > > 
> > > Maybe I don't understand the restart yet. How do you restart if the
> > > dump was already started and the user has read part of the dump?
> > 
> > You use the mutex to prevent concurrent resizes and dumps, for dumps
> > that have left the kernel you simply reset the iterator state after
> > a resize operation.
> 
> I see, so we possibly see a lot of duplicates but we will never miss
> any entries.
> 
> So in summary we do:
> 
> 2) We allow rhashtable users to opt in to Herbert's consistent walker
>    relying on the use of an additional bit.
> 
> 2) If user supports NLM_F_DUMP_INTR we block out resizes during
>    message construction and set NLM_F_DUMP_INTR if a resize/rehash
>    interrupted us while the user read.
> 
> 3) If (1) and (2) are not available we do (2) but restart the dump
>    on interruption thus causing duplicates but never missing any
>    entries.
> 
> I think this offers good flexibility for new users and compatibility
> for existing users.

Yep, sounds good to me.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize
  2015-01-21  8:13   ` Ying Xue
@ 2015-01-21 12:17     ` Thomas Graf
  2015-01-22  8:49       ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-21 12:17 UTC (permalink / raw)
  To: Ying Xue; +Cc: davem, kaber, herbert, paulmck, netdev, netfilter-devel

On 01/21/15 at 04:13pm, Ying Xue wrote:
> On 01/20/2015 09:20 PM, Thomas Graf wrote:
> > A deferred resize of nl_table causes the offsets that Netlink diag keeps
> > to become inaccurate. Mark the dump as inconsistent and have user space
> > request a new dump.
> > 
> > Signed-off-by: Thomas Graf <tgraf@suug.ch>
> > ---
> >  net/netlink/af_netlink.c | 10 ++++++++++
> >  net/netlink/af_netlink.h |  1 +
> >  net/netlink/diag.c       |  1 +
> >  3 files changed, 12 insertions(+)
> > 
> > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> > index 7a94185..e214557 100644
> > --- a/net/netlink/af_netlink.c
> > +++ b/net/netlink/af_netlink.c
> > @@ -91,6 +91,9 @@ static inline int netlink_is_kernel(struct sock *sk)
> >  struct netlink_table *nl_table;
> >  EXPORT_SYMBOL_GPL(nl_table);
> >  
> > +atomic_t nl_table_seq;
> 
> It sounds like the atomic variable is not initialized.

Thanks for the review. We also need to avoid hitting 0 when we overflow
on a seq increment.  The netfilter code is doing this correctly but several
other users are suffering from this as well.

I'll address this in v2 together with the other discussed changes.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-21 10:23                     ` Thomas Graf
@ 2015-01-22  6:35                       ` Herbert Xu
  2015-01-22  7:20                         ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-22  6:35 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev,
	netfilter-devel, Eric Dumazet

On Wed, Jan 21, 2015 at 10:23:46AM +0000, Thomas Graf wrote:
>
> The usage will be identical to how __inet_lookup_listener() uses it.
> If at the end of the lookup, we ended up in a different table than
> we started, the lookup is restarted as an entry has moved to another
> table while we were moving over it.

Who uses this stuff apart from ip_dynaddr?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-22  6:35                       ` Herbert Xu
@ 2015-01-22  7:20                         ` Herbert Xu
  2015-01-22  9:05                           ` Thomas Graf
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-22  7:20 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev,
	netfilter-devel, Eric Dumazet

On Thu, Jan 22, 2015 at 05:35:01PM +1100, Herbert Xu wrote:
> On Wed, Jan 21, 2015 at 10:23:46AM +0000, Thomas Graf wrote:
> >
> > The usage will be identical to how __inet_lookup_listener() uses it.
> > If at the end of the lookup, we ended up in a different table than
> > we started, the lookup is restarted as an entry has moved to another
> > table while we were moving over it.
> 
> Who uses this stuff apart from ip_dynaddr?

OK it's there for fast socket recycling.  Given that and the fact
that everyone seems to be happy with restarting the dump after a
resize, I think we should just go with that.

Anybody who wants a better walk can always implement their own
data structure outside of rhashtable.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize
  2015-01-21 12:17     ` Thomas Graf
@ 2015-01-22  8:49       ` Herbert Xu
  2015-01-22  8:56         ` Patrick McHardy
  2015-01-25 23:20         ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
  0 siblings, 2 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-22  8:49 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

On Wed, Jan 21, 2015 at 12:17:48PM +0000, Thomas Graf wrote:
>
> Thanks for the review. We also need to avoid hitting 0 when we overflow
> on a seq increment.  The netfilter code is doing this correctly but several
> other users are suffering from this as well.
> 
> I'll address this in v2 together with the other discussed changes.

Could you hold off for a bit? I've got some changes that touch
this area that I'd like to push out.  Basically I'm trying to
eliminate direct access of rhashtable internals from the existing
users.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize
  2015-01-22  8:49       ` Herbert Xu
@ 2015-01-22  8:56         ` Patrick McHardy
  2015-01-22  9:22           ` Herbert Xu
  2015-01-25 23:20         ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
  1 sibling, 1 reply; 79+ messages in thread
From: Patrick McHardy @ 2015-01-22  8:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, Ying Xue, davem, paulmck, netdev, netfilter-devel

On 22.01, Herbert Xu wrote:
> On Wed, Jan 21, 2015 at 12:17:48PM +0000, Thomas Graf wrote:
> >
> > Thanks for the review. We also need to avoid hitting 0 when we overflow
> > on a seq increment.  The netfilter code is doing this correctly but several
> > other users are suffering from this as well.
> > 
> > I'll address this in v2 together with the other discussed changes.
> 
> Could you hold off for a bit? I've got some changes that touch
> this area that I'd like to push out.  Basically I'm trying to
> eliminate direct access of rhashtable internals from the existing
> users.

Hope it will still be possible, I need it for GC in timeout support for
nftables sets.

Current patch attached for reference.


commit 91a334436d2f8eaa8261251d7d98cb700138f8b6
Author: Patrick McHardy <kaber@trash.net>
Date:   Fri Jan 16 18:12:31 2015 +0000

    netfilter: nft_hash: add support for timeouts
    
    Signed-off-by: Patrick McHardy <kaber@trash.net>

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index cba0ad2..e7cf886 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -15,6 +15,7 @@
 #include <linux/log2.h>
 #include <linux/jhash.h>
 #include <linux/netlink.h>
+#include <linux/workqueue.h>
 #include <linux/rhashtable.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
@@ -23,19 +24,47 @@
 /* We target a hash table size of 4, element hint is 75% of final size */
 #define NFT_HASH_ELEMENT_HINT 3
 
+struct nft_hash {
+	struct rhashtable		ht;
+	struct delayed_work		gc_work;
+};
+
 struct nft_hash_elem {
 	struct rhash_head		node;
 	struct nft_set_ext		ext;
 };
 
+struct nft_hash_compare_arg {
+	const struct nft_data		*key;
+	unsigned int			len;
+};
+
+static bool nft_hash_compare(void *ptr, void *arg)
+{
+	const struct nft_hash_elem *he = ptr;
+	struct nft_hash_compare_arg *x = arg;
+
+	if (nft_data_cmp(nft_set_ext_key(&he->ext), x->key, x->len))
+		return false;
+	if (nft_set_ext_exists(&he->ext, NFT_SET_EXT_TIMEOUT) &&
+	    time_after_eq(jiffies, *nft_set_ext_timeout(&he->ext)))
+		return false;
+
+	return true;
+}
+
 static bool nft_hash_lookup(const struct nft_set *set,
 			    const struct nft_data *key,
 			    const struct nft_set_ext **ext)
 {
-	struct rhashtable *priv = nft_set_priv(set);
+	struct nft_hash *priv = nft_set_priv(set);
 	const struct nft_hash_elem *he;
+	struct nft_hash_compare_arg arg = {
+		.key	= key,
+		.len	= set->klen,
+	};
 
-	he = rhashtable_lookup(priv, key);
+	he = rhashtable_lookup_compare(&priv->ht, key, nft_hash_compare, &arg);
 	if (he != NULL)
 		*ext = &he->ext;
 
@@ -45,10 +74,10 @@ static bool nft_hash_lookup(const struct nft_set *set,
 static int nft_hash_insert(const struct nft_set *set,
 			   const struct nft_set_elem *elem)
 {
-	struct rhashtable *priv = nft_set_priv(set);
+	struct nft_hash *priv = nft_set_priv(set);
 	struct nft_hash_elem *he = elem->priv;
 
-	rhashtable_insert(priv, &he->node);
+	rhashtable_insert(&priv->ht, &he->node);
 	return 0;
 }
 
@@ -64,58 +93,43 @@ static void nft_hash_elem_destroy(const struct nft_set *set,
 static void nft_hash_remove(const struct nft_set *set,
 			    const struct nft_set_elem *elem)
 {
-	struct rhashtable *priv = nft_set_priv(set);
+	struct nft_hash *priv = nft_set_priv(set);
+	struct nft_hash_elem *he = elem->cookie;
 
-	rhashtable_remove(priv, elem->cookie);
+	rhashtable_remove(&priv->ht, &he->node);
 	synchronize_rcu();
 	kfree(elem->cookie);
 }
 
-struct nft_compare_arg {
-	const struct nft_set *set;
-	struct nft_set_elem *elem;
-};
-
-static bool nft_hash_compare(void *ptr, void *arg)
-{
-	struct nft_hash_elem *he = ptr;
-	struct nft_compare_arg *x = arg;
-
-	if (!nft_data_cmp(nft_set_ext_key(&he->ext), &x->elem->key,
-			  x->set->klen)) {
-		x->elem->cookie = he;
-		x->elem->priv  = he;
-		return true;
-	}
-
-	return false;
-}
-
 static int nft_hash_get(const struct nft_set *set, struct nft_set_elem *elem)
 {
-	struct rhashtable *priv = nft_set_priv(set);
-	struct nft_compare_arg arg = {
-		.set = set,
-		.elem = elem,
+	struct nft_hash *priv = nft_set_priv(set);
+	struct nft_hash_elem *he;
+	struct nft_hash_compare_arg arg = {
+		.key	= &elem->key,
+		.len	= set->klen,
 	};
 
-	if (rhashtable_lookup_compare(priv, &elem->key,
-				      &nft_hash_compare, &arg))
+	he = rhashtable_lookup_compare(&priv->ht, &elem->key,
+				       nft_hash_compare, &arg);
+	if (he != NULL) {
+		elem->cookie = he;
+		elem->priv   = he;
 		return 0;
-
+	}
 	return -ENOENT;
 }
 
 static void nft_hash_walk(const struct nft_ctx *ctx, const struct nft_set *set,
 			  struct nft_set_iter *iter)
 {
-	struct rhashtable *priv = nft_set_priv(set);
+	struct nft_hash *priv = nft_set_priv(set);
 	const struct bucket_table *tbl;
 	struct nft_hash_elem *he;
 	struct nft_set_elem elem;
 	unsigned int i;
 
-	tbl = rht_dereference_rcu(priv->tbl, priv);
+	tbl = rht_dereference_rcu(priv->ht.tbl, &priv->ht);
 	for (i = 0; i < tbl->size; i++) {
 		struct rhash_head *pos;
 
@@ -134,16 +148,48 @@ cont:
 	}
 }
 
+static void nft_hash_gc(struct work_struct *work)
+{
+	const struct nft_set *set;
+	const struct bucket_table *tbl;
+	struct rhash_head *pos, *next;
+	struct nft_hash_elem *he;
+	struct nft_hash *priv;
+	unsigned long timeout;
+	unsigned int i;
+
+	priv = container_of(work, struct nft_hash, gc_work.work);
+	set  = (void *)priv - offsetof(struct nft_set, data);
+
+	mutex_lock(&priv->ht.mutex);
+	tbl = rht_dereference(priv->ht.tbl, &priv->ht);
+	for (i = 0; i < tbl->size; i++) {
+		rht_for_each_entry_safe(he, pos, next, tbl, i, node) {
+			if (!nft_set_ext_exists(&he->ext, NFT_SET_EXT_TIMEOUT))
+				continue;
+			timeout = *nft_set_ext_timeout(&he->ext);
+			if (time_before(jiffies, timeout))
+				continue;
+
+			rhashtable_remove(&priv->ht, &he->node);
+			nft_hash_elem_destroy(set, he);
+		}
+	}
+	mutex_unlock(&priv->ht.mutex);
+
+	queue_delayed_work(system_power_efficient_wq, &priv->gc_work, HZ);
+}
+
 static unsigned int nft_hash_privsize(const struct nlattr * const nla[])
 {
-	return sizeof(struct rhashtable);
+	return sizeof(struct nft_hash);
 }
 
 static int nft_hash_init(const struct nft_set *set,
 			 const struct nft_set_desc *desc,
 			 const struct nlattr * const tb[])
 {
-	struct rhashtable *priv = nft_set_priv(set);
+	struct nft_hash *priv = nft_set_priv(set);
 	struct rhashtable_params params = {
 		.nelem_hint = desc->size ? : NFT_HASH_ELEMENT_HINT,
 		.head_offset = offsetof(struct nft_hash_elem, node),
@@ -153,30 +199,42 @@ static int nft_hash_init(const struct nft_set *set,
 		.grow_decision = rht_grow_above_75,
 		.shrink_decision = rht_shrink_below_30,
 	};
+	int err;
 
-	return rhashtable_init(priv, &params);
+	err = rhashtable_init(&priv->ht, &params);
+	if (err < 0)
+		return err;
+
+	INIT_DEFERRABLE_WORK(&priv->gc_work, nft_hash_gc);
+	if (set->flags & NFT_SET_TIMEOUT)
+		queue_delayed_work(system_power_efficient_wq,
+				   &priv->gc_work, HZ);
+
+	return 0;
 }
 
 static void nft_hash_destroy(const struct nft_set *set)
 {
-	struct rhashtable *priv = nft_set_priv(set);
+	struct nft_hash *priv = nft_set_priv(set);
 	const struct bucket_table *tbl;
 	struct nft_hash_elem *he;
 	struct rhash_head *pos, *next;
 	unsigned int i;
 
+	cancel_delayed_work_sync(&priv->gc_work);
+
 	/* Stop an eventual async resizing */
-	priv->being_destroyed = true;
-	mutex_lock(&priv->mutex);
+	priv->ht.being_destroyed = true;
+	mutex_lock(&priv->ht.mutex);
 
-	tbl = rht_dereference(priv->tbl, priv);
+	tbl = rht_dereference(priv->ht.tbl, &priv->ht);
 	for (i = 0; i < tbl->size; i++) {
 		rht_for_each_entry_safe(he, pos, next, tbl, i, node)
 			nft_hash_elem_destroy(set, he);
 	}
-	mutex_unlock(&priv->mutex);
+	mutex_unlock(&priv->ht.mutex);
 
-	rhashtable_destroy(priv);
+	rhashtable_destroy(&priv->ht);
 }
 
 static bool nft_hash_estimate(const struct nft_set_desc *desc, u32 features,
@@ -187,9 +245,11 @@ static bool nft_hash_estimate(const struct nft_set_desc *desc, u32 features,
 	esize = sizeof(struct nft_hash_elem);
 	if (features & NFT_SET_MAP)
 		esize += sizeof(struct nft_data);
+	if (features & NFT_SET_TIMEOUT)
+		esize += sizeof(unsigned long);
 
 	if (desc->size) {
-		est->size = sizeof(struct rhashtable) +
+		est->size = sizeof(struct nft_hash) +
 			    roundup_pow_of_two(desc->size * 4 / 3) *
 			    sizeof(struct nft_hash_elem *) +
 			    desc->size * esize;
@@ -218,7 +278,7 @@ static struct nft_set_ops nft_hash_ops __read_mostly = {
 	.remove		= nft_hash_remove,
 	.lookup		= nft_hash_lookup,
 	.walk		= nft_hash_walk,
-	.features	= NFT_SET_MAP,
+	.features	= NFT_SET_MAP | NFT_SET_TIMEOUT,
 	.owner		= THIS_MODULE,
 };
 
@@ -238,3 +298,4 @@ module_exit(nft_hash_module_exit);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Patrick McHardy <kaber@trash.net>");
 MODULE_ALIAS_NFT_SET();
+

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-22  7:20                         ` Herbert Xu
@ 2015-01-22  9:05                           ` Thomas Graf
  2015-01-22  9:50                             ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-22  9:05 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev,
	netfilter-devel, Eric Dumazet

On 01/22/15 at 06:20pm, Herbert Xu wrote:
> On Thu, Jan 22, 2015 at 05:35:01PM +1100, Herbert Xu wrote:
> > On Wed, Jan 21, 2015 at 10:23:46AM +0000, Thomas Graf wrote:
> > >
> > > The usage will be identical to how __inet_lookup_listener() uses it.
> > > If at the end of the lookup, we ended up in a different table than
> > > we started, the lookup is restarted as an entry has moved to another
> > > table while we were moving over it.
> > 
> > Who uses this stuff apart from ip_dynaddr?
> 
> OK it's there for fast socket recycling.  Given that and the fact
> that everyone seems to be happy with restarting the dump after a
> resize, I think we should just go with that.
> 
> Anybody who wants a better walk can always implement their own
> data structure outside of rhashtable.

What did you think of the idea to let the user store the walker
bit for rhashtable?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize
  2015-01-22  8:56         ` Patrick McHardy
@ 2015-01-22  9:22           ` Herbert Xu
  2015-01-22 10:07             ` Patrick McHardy
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-22  9:22 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Thomas Graf, Ying Xue, davem, paulmck, netdev, netfilter-devel

On Thu, Jan 22, 2015 at 08:56:44AM +0000, Patrick McHardy wrote:
>
> Hope it will still be possible, I need it for GC in timeout support for
> nftables sets.

This should be able to use the same walk that you're currently
using for dump once we add the restart, no?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets
  2015-01-22  9:05                           ` Thomas Graf
@ 2015-01-22  9:50                             ` Herbert Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-22  9:50 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Patrick McHardy, davem, paulmck, ying.xue, netdev,
	netfilter-devel, Eric Dumazet

On Thu, Jan 22, 2015 at 09:05:18AM +0000, Thomas Graf wrote:
>
> What did you think of the idea to let the user store the walker
> bit for rhashtable?

The problem with that is the bit is needed to indicate special
walker objects which the user won't be able to decipher (see
xfrm_state_walk).

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize
  2015-01-22  9:22           ` Herbert Xu
@ 2015-01-22 10:07             ` Patrick McHardy
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick McHardy @ 2015-01-22 10:07 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Thomas Graf, Ying Xue, davem, paulmck, netdev, netfilter-devel

On 22.01, Herbert Xu wrote:
> On Thu, Jan 22, 2015 at 08:56:44AM +0000, Patrick McHardy wrote:
> >
> > Hope it will still be possible, I need it for GC in timeout support for
> > nftables sets.
> 
> This should be able to use the same walk that you're currently
> using for dump once we add the restart, no?

Yes, should even be possible to use the walk function with some small
adjustments.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink
  2015-01-22  8:49       ` Herbert Xu
  2015-01-22  8:56         ` Patrick McHardy
@ 2015-01-25 23:20         ` Herbert Xu
  2015-01-25 23:21           ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
                             ` (2 more replies)
  1 sibling, 3 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-25 23:20 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

On Thu, Jan 22, 2015 at 07:49:24PM +1100, Herbert Xu wrote:
>
> Could you hold off for a bit? I've got some changes that touch
> this area that I'd like to push out.  Basically I'm trying to
> eliminate direct access of rhashtable internals from the existing
> users.

Here are the first two patches, one to add the primitives and one
to demonstrate its use in netlink.  In fact while testing this I
found that the existing netlink walking code is totally broken.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-25 23:20         ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
@ 2015-01-25 23:21           ` Herbert Xu
  2015-01-26  8:20             ` Thomas Graf
  2015-01-26 10:09             ` David Laight
  2015-01-25 23:21           ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
  2015-01-27 23:19           ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
  2 siblings, 2 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-25 23:21 UTC (permalink / raw)
  To: Thomas Graf, Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

Some existing rhashtable users get too intimate with it by walking
the buckets directly.  This prevents us from easily changing the
internals of rhashtable.

This patch adds the helpers rhashtable_walk_init/next/end which
will replace these custom walkers.

They are meant to be usable for both procfs seq_file walks as well
as walking by a netlink dump.  The iterator structure should fit
inside a netlink dump cb structure, with at least one element to
spare.
  
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |   44 +++++++++++++++++++++
 lib/rhashtable.c           |   91 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 6d7e840..b03b375 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -18,6 +18,7 @@
 #ifndef _LINUX_RHASHTABLE_H
 #define _LINUX_RHASHTABLE_H
 
+#include <linux/compiler.h>
 #include <linux/list_nulls.h>
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
@@ -123,6 +124,22 @@ struct rhashtable {
 	bool                            being_destroyed;
 };
 
+/**
+ * struct rhashtable_iter - Hash table iterator
+ * @ht: Table to iterate through
+ * @p: Current pointer
+ * @lock: Slot lock
+ * @slot: Current slot
+ * @skip: Number of entries to skip in slot
+ */
+struct rhashtable_iter {
+	struct rhashtable *ht;
+	struct rhash_head *p;
+	spinlock_t *lock;
+	unsigned int slot;
+	unsigned int skip;
+};
+
 static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
 {
 	return NULLS_MARKER(ht->p.nulls_base + hash);
@@ -178,6 +195,33 @@ bool rhashtable_lookup_compare_insert(struct rhashtable *ht,
 				      bool (*compare)(void *, void *),
 				      void *arg);
 
+/**
+ * rhashtable_walk_init - Initialise an iterator
+ * @ht:		Table to walk over
+ * @iter:	Hash table Iterator
+ *
+ * This function prepares a hash table walk.
+ * Note that if you restart a walk after rhashtable_walk_stop you
+ * may see the same object twice.  Also, you may miss objects if
+ * there are removals in between rhashtable_walk_stop and the next
+ * call to rhashtable_walk_start.
+ *
+ * For a completely stable walk you should construct your own data
+ * structure outside the hash table.
+ */
+static inline void rhashtable_walk_init(struct rhashtable *ht,
+					struct rhashtable_iter *iter)
+{
+	iter->ht = ht;
+	iter->p = NULL;
+	iter->slot = 0;
+	iter->skip = 0;
+}
+
+int rhashtable_walk_start(struct rhashtable_iter *iter) __acquires(RCU);
+void *rhashtable_walk_next(struct rhashtable_iter *iter);
+void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
+
 void rhashtable_destroy(struct rhashtable *ht);
 
 #define rht_dereference(p, ht) \
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 71c6aa1..d51fb06 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -813,6 +813,97 @@ exit:
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup_compare_insert);
 
+/**
+ * rhashtable_walk_start - Start a hash table walk
+ * @iter:	Hash table iterator
+ *
+ * Start a hash table walk.
+ *
+ * Returns zero if successful.  Returns -EINTR if we couldn't
+ * obtain the resize lock.
+ */
+int rhashtable_walk_start(struct rhashtable_iter *iter)
+{
+	struct rhashtable *ht = iter->ht;
+	int err;
+
+	err = mutex_lock_interruptible(&ht->mutex);
+	rcu_read_lock();
+
+	if (!err)
+		mutex_unlock(&ht->mutex);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_start);
+
+/**
+ * rhashtable_walk_next - Return the next object and advance the iterator
+ * @iter:	Hash table iterator
+ *
+ * Note that you must call rhashtable_walk_stop when you are finished
+ * with the walk.
+ *
+ * Returns the next object or NULL when the end of the table is reached.
+ */
+void *rhashtable_walk_next(struct rhashtable_iter *iter)
+{
+	const struct bucket_table *tbl;
+	struct rhashtable *ht = iter->ht;
+	struct rhash_head *p = iter->p;
+
+	tbl = rht_dereference_rcu(ht->tbl, ht);
+
+	if (p) {
+		p = rht_dereference_bucket(p->next, tbl, iter->slot);
+		goto next;
+	}
+
+	for (; iter->slot < tbl->size; iter->slot++) {
+		int skip = iter->skip;
+
+		iter->lock = bucket_lock(tbl, iter->slot);
+		spin_lock_bh(iter->lock);
+
+		rht_for_each(p, tbl, iter->slot) {
+			if (!skip)
+				break;
+			skip--;
+		}
+
+next:
+		if (!rht_is_a_nulls(p)) {
+			iter->skip++;
+			iter->p = p;
+			return rht_obj(ht, p);
+		}
+		spin_unlock_bh(iter->lock);
+
+		iter->skip = 0;
+	}
+
+	iter->p = NULL;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_next);
+
+/**
+ * rhashtable_walk_stop - Finish a hash table walk
+ * @iter:	Hash table iterator
+ *
+ * Finish a hash table walk.
+ */
+void rhashtable_walk_stop(struct rhashtable_iter *iter)
+{
+	if (iter->p)
+		spin_unlock_bh(iter->lock);
+
+	rcu_read_unlock();
+
+	iter->p = NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
+
 static size_t rounded_hashtable_size(struct rhashtable_params *params)
 {
 	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 2/2] netlink: Use rhashtable walk iterator
  2015-01-25 23:20         ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
  2015-01-25 23:21           ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
@ 2015-01-25 23:21           ` Herbert Xu
  2015-01-27 23:19           ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
  2 siblings, 0 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-25 23:21 UTC (permalink / raw)
  To: Thomas Graf, Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

This patch gets rid of the manual rhashtable walk in netlink
which touches rhashtable internals that should not be exposed.
It does so by using the rhashtable iterator primitives.

In fact the existing code was very buggy.  Some sockets weren't
shown at all while others were shown more than once.
 
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/netlink/af_netlink.c |  113 +++++++++++++++++------------------------------
 1 file changed, 41 insertions(+), 72 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index d77b346..6f8549e 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2884,99 +2884,68 @@ EXPORT_SYMBOL(nlmsg_notify);
 #ifdef CONFIG_PROC_FS
 struct nl_seq_iter {
 	struct seq_net_private p;
+	struct rhashtable_iter hti;
 	int link;
-	int hash_idx;
 };
 
-static struct sock *netlink_seq_socket_idx(struct seq_file *seq, loff_t pos)
+static int netlink_walk_start(struct nl_seq_iter *iter)
 {
-	struct nl_seq_iter *iter = seq->private;
-	int i, j;
-	struct netlink_sock *nlk;
-	struct sock *s;
-	loff_t off = 0;
-
-	for (i = 0; i < MAX_LINKS; i++) {
-		struct rhashtable *ht = &nl_table[i].hash;
-		const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
-
-		for (j = 0; j < tbl->size; j++) {
-			struct rhash_head *node;
-
-			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
-				s = (struct sock *)nlk;
-
-				if (sock_net(s) != seq_file_net(seq))
-					continue;
-				if (off == pos) {
-					iter->link = i;
-					iter->hash_idx = j;
-					return s;
-				}
-				++off;
-			}
-		}
-	}
-	return NULL;
-}
-
-static void *netlink_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
-{
-	rcu_read_lock();
-	return *pos ? netlink_seq_socket_idx(seq, *pos - 1) : SEQ_START_TOKEN;
+	rhashtable_walk_init(&nl_table[iter->link].hash, &iter->hti);
+	return rhashtable_walk_start(&iter->hti);
 }
 
-static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+static void *__netlink_seq_next(struct seq_file *seq)
 {
-	struct rhashtable *ht;
-	const struct bucket_table *tbl;
-	struct rhash_head *node;
+	struct nl_seq_iter *iter = seq->private;
 	struct netlink_sock *nlk;
-	struct nl_seq_iter *iter;
-	struct net *net;
-	int i, j;
 
-	++*pos;
+	do {
+		while (!(nlk = rhashtable_walk_next(&iter->hti))) {
+			int err;
 
-	if (v == SEQ_START_TOKEN)
-		return netlink_seq_socket_idx(seq, 0);
+			rhashtable_walk_stop(&iter->hti);
+			if (++iter->link >= MAX_LINKS)
+				return NULL;
 
-	net = seq_file_net(seq);
-	iter = seq->private;
-	nlk = v;
+			err = netlink_walk_start(iter);
+			if (err)
+				return ERR_PTR(err);
+		}
+	} while (sock_net(&nlk->sk) != seq_file_net(seq));
 
-	i = iter->link;
-	ht = &nl_table[i].hash;
-	tbl = rht_dereference_rcu(ht->tbl, ht);
-	rht_for_each_entry_rcu_continue(nlk, node, nlk->node.next, tbl, iter->hash_idx, node)
-		if (net_eq(sock_net((struct sock *)nlk), net))
-			return nlk;
+	return nlk;
+}
 
-	j = iter->hash_idx + 1;
+static void *netlink_seq_start(struct seq_file *seq, loff_t *posp)
+{
+	struct nl_seq_iter *iter = seq->private;
+	void *obj = SEQ_START_TOKEN;
+	loff_t pos;
+	int err;
 
-	do {
+	iter->link = 0;
+	err = netlink_walk_start(iter);
+	if (err)
+		return ERR_PTR(err);
 
-		for (; j < tbl->size; j++) {
-			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
-				if (net_eq(sock_net((struct sock *)nlk), net)) {
-					iter->link = i;
-					iter->hash_idx = j;
-					return nlk;
-				}
-			}
-		}
+	for (pos = *posp; pos && obj; pos--)
+		obj = __netlink_seq_next(seq);
 
-		j = 0;
-	} while (++i < MAX_LINKS);
+	return obj;
+}
 
-	return NULL;
+static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return __netlink_seq_next(seq);
 }
 
 static void netlink_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
 {
-	rcu_read_unlock();
+	struct nl_seq_iter *iter = seq->private;
+
+	if (iter->link < MAX_LINKS)
+		rhashtable_walk_stop(&iter->hti);
 }
 
 

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-25 23:21           ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
@ 2015-01-26  8:20             ` Thomas Graf
  2015-01-26 22:21               ` Herbert Xu
  2015-01-26 10:09             ` David Laight
  1 sibling, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-26  8:20 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

On 01/26/15 at 10:21am, Herbert Xu wrote:
> +int rhashtable_walk_start(struct rhashtable_iter *iter)
> +{
> +	struct rhashtable *ht = iter->ht;
> +	int err;
> +
> +	err = mutex_lock_interruptible(&ht->mutex);
> +	rcu_read_lock();
> +
> +	if (!err)
> +		mutex_unlock(&ht->mutex);
> +
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(rhashtable_walk_start);

This doesn't seem to work for Netlink dumps as it would define
a RCU read side section across user read()s.

I assume you relying on RCU to defer the initial table relinking
in the resizing to be postponed. Wouldn't it be better to hold
the mutex instead, we could sleep during the dump.

> +/**
> + * rhashtable_walk_next - Return the next object and advance the iterator
> + * @iter:	Hash table iterator
> + *
> + * Note that you must call rhashtable_walk_stop when you are finished
> + * with the walk.
> + *
> + * Returns the next object or NULL when the end of the table is reached.
> + */
> +void *rhashtable_walk_next(struct rhashtable_iter *iter)
> +{
> +	const struct bucket_table *tbl;
> +	struct rhashtable *ht = iter->ht;
> +	struct rhash_head *p = iter->p;
> +
> +	tbl = rht_dereference_rcu(ht->tbl, ht);
> +
> +	if (p) {
> +		p = rht_dereference_bucket(p->next, tbl, iter->slot);
> +		goto next;
> +	}
> +
> +	for (; iter->slot < tbl->size; iter->slot++) {
> +		int skip = iter->skip;
> +
> +		iter->lock = bucket_lock(tbl, iter->slot);
> +		spin_lock_bh(iter->lock);
> +
> +		rht_for_each(p, tbl, iter->slot) {
> +			if (!skip)
> +				break;
> +			skip--;
> +		}
> +
> +next:
> +		if (!rht_is_a_nulls(p)) {
> +			iter->skip++;
> +			iter->p = p;
> +			return rht_obj(ht, p);

Same here, we leave the iterator with the bucket lock held.

> +		}
> +		spin_unlock_bh(iter->lock);
> +
> +		iter->skip = 0;
> +	}
> +
> +	iter->p = NULL;
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(rhashtable_walk_next);

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-25 23:21           ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
  2015-01-26  8:20             ` Thomas Graf
@ 2015-01-26 10:09             ` David Laight
  2015-01-26 22:23               ` Herbert Xu
  1 sibling, 1 reply; 79+ messages in thread
From: David Laight @ 2015-01-26 10:09 UTC (permalink / raw)
  To: 'Herbert Xu',
	Thomas Graf, Ying Xue, davem, kaber, paulmck, netdev,
	netfilter-devel

From: Herbert Xu
> Some existing rhashtable users get too intimate with it by walking
> the buckets directly.  This prevents us from easily changing the
> internals of rhashtable.
> 
> This patch adds the helpers rhashtable_walk_init/next/end which
> will replace these custom walkers.
> 
> They are meant to be usable for both procfs seq_file walks as well
> as walking by a netlink dump.  The iterator structure should fit
> inside a netlink dump cb structure, with at least one element to
> spare.
...
> +/**
> + * rhashtable_walk_start - Start a hash table walk
> + * @iter:	Hash table iterator
> + *
> + * Start a hash table walk.
> + *
> + * Returns zero if successful.  Returns -EINTR if we couldn't
> + * obtain the resize lock.
> + */
> +int rhashtable_walk_start(struct rhashtable_iter *iter)
> +{
> +	struct rhashtable *ht = iter->ht;
> +	int err;
> +
> +	err = mutex_lock_interruptible(&ht->mutex);
> +	rcu_read_lock();
> +
> +	if (!err)
> +		mutex_unlock(&ht->mutex);
> +
> +	return err;
> +}

That doesn't look right to me.
Surely you shouldn't be calling rcu_read_lock() when the mutex
request is interrupted.

So maybe:
	err = mutex_lock_interruptible(&ht->mutex);
	if (err)
		return err;
	rcu_read_lock();
	return 0;
}

	David

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-26  8:20             ` Thomas Graf
@ 2015-01-26 22:21               ` Herbert Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-26 22:21 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

On Mon, Jan 26, 2015 at 08:20:00AM +0000, Thomas Graf wrote:
> On 01/26/15 at 10:21am, Herbert Xu wrote:
> > +int rhashtable_walk_start(struct rhashtable_iter *iter)
> > +{
> > +	struct rhashtable *ht = iter->ht;
> > +	int err;
> > +
> > +	err = mutex_lock_interruptible(&ht->mutex);
> > +	rcu_read_lock();
> > +
> > +	if (!err)
> > +		mutex_unlock(&ht->mutex);
> > +
> > +	return err;
> > +}
> > +EXPORT_SYMBOL_GPL(rhashtable_walk_start);
> 
> This doesn't seem to work for Netlink dumps as it would define
> a RCU read side section across user read()s.

No it does not.  The start call roughly corresponds to the start
call in seq_file which does not span across reads.

> I assume you relying on RCU to defer the initial table relinking
> in the resizing to be postponed. Wouldn't it be better to hold
> the mutex instead, we could sleep during the dump.

Note that this is not the final solution but just the first step.
So right now it does not prevent/postpone resizes completely.  It's
only meant to be a strict improvement over what we currently have.

So indeed resizes will be stopped within a single read but not
across reads for seq_file.  Similarly for netlink dump resizes
will only be stopped for a single skb.

I will do the rest of what we discussed once I implement rehashing
since there will be dependencies on it.
 
> Same here, we leave the iterator with the bucket lock held.

This is intentional, it is what udp_diag does.  We'll have to do
this if you ever want to convert UDP sockets over to rhashtable
so we might as well do it now.

If you don't hold the bucket lock then anyone that recycles objects
like UDP will end up walking into the ether.

As before this is not held across reads but only within a read.
Please see patch #2.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-26 10:09             ` David Laight
@ 2015-01-26 22:23               ` Herbert Xu
  2015-01-26 22:36                 ` David Miller
  2015-01-27 10:09                 ` David Laight
  0 siblings, 2 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-26 22:23 UTC (permalink / raw)
  To: David Laight
  Cc: Thomas Graf, Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

On Mon, Jan 26, 2015 at 10:09:24AM +0000, David Laight wrote:
>
> That doesn't look right to me.
> Surely you shouldn't be calling rcu_read_lock() when the mutex
> request is interrupted.
> 
> So maybe:
> 	err = mutex_lock_interruptible(&ht->mutex);
> 	if (err)
> 		return err;
> 	rcu_read_lock();

No, we need to grab the RCU read lock while holding the mutex
in order to prevent future resizes from happening once we release
the mutex.

We don't want to hold the mutex which would stop other walks from
starting.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-26 22:23               ` Herbert Xu
@ 2015-01-26 22:36                 ` David Miller
  2015-01-26 22:42                   ` Herbert Xu
  2015-01-27 10:09                 ` David Laight
  1 sibling, 1 reply; 79+ messages in thread
From: David Miller @ 2015-01-26 22:36 UTC (permalink / raw)
  To: herbert
  Cc: David.Laight, tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 27 Jan 2015 09:23:00 +1100

> On Mon, Jan 26, 2015 at 10:09:24AM +0000, David Laight wrote:
>>
>> That doesn't look right to me.
>> Surely you shouldn't be calling rcu_read_lock() when the mutex
>> request is interrupted.
>> 
>> So maybe:
>> 	err = mutex_lock_interruptible(&ht->mutex);
>> 	if (err)
>> 		return err;
>> 	rcu_read_lock();
> 
> No, we need to grab the RCU read lock while holding the mutex
> in order to prevent future resizes from happening once we release
> the mutex.
> 
> We don't want to hold the mutex which would stop other walks from
> starting.

I really think the amount of time and effort being put into making
the walker function properly is excessive.

Let's just say that if you want to use rhashtable, you have to use a
linked list, or similar separate mechanism, to have stable walks of
the entire set of objects.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-26 22:36                 ` David Miller
@ 2015-01-26 22:42                   ` Herbert Xu
  2015-01-26 23:31                     ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-26 22:42 UTC (permalink / raw)
  To: David Miller
  Cc: David.Laight, tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

On Mon, Jan 26, 2015 at 02:36:13PM -0800, David Miller wrote:
>
> I really think the amount of time and effort being put into making
> the walker function properly is excessive.
> 
> Let's just say that if you want to use rhashtable, you have to use a
> linked list, or similar separate mechanism, to have stable walks of
> the entire set of objects.

I definitely agree with that :)

However, I'd really like to get these primitives in first so that
at least I can convert the existing rhashtable walkers over and
get on with implementing rhashtable rehashing which is what I set
out to do.

As otherwise we'd be looking at fixing the existing users (in
particular, nft_hash) properly which might take a lot more time.

If and when we finally fix the existing users we can always remove
these primitives.

FWIW when I first wrote this patch I called it
rhashtable_walk_dangerously, I can readopt that name if you think
that will dissuade people from using it.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-26 22:42                   ` Herbert Xu
@ 2015-01-26 23:31                     ` Herbert Xu
  2015-01-27  9:45                       ` Thomas Graf
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-26 23:31 UTC (permalink / raw)
  To: David Miller
  Cc: David.Laight, tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

On Tue, Jan 27, 2015 at 09:42:16AM +1100, Herbert Xu wrote:
> 
> As otherwise we'd be looking at fixing the existing users (in
> particular, nft_hash) properly which might take a lot more time.

I take that back.  Looks like my walkers would be no good for
netfilter anyway since it wants to do nested walks, meaning no
locks can be taken.

So I think I'll just have to tackle netfilter first in which case
the walk function would be no use anyway since I could just fix
netlink by itself.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-26 23:31                     ` Herbert Xu
@ 2015-01-27  9:45                       ` Thomas Graf
  2015-01-27  9:54                         ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-27  9:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, David.Laight, ying.xue, kaber, paulmck, netdev,
	netfilter-devel

On 01/27/15 at 10:31am, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 09:42:16AM +1100, Herbert Xu wrote:
> > 
> > As otherwise we'd be looking at fixing the existing users (in
> > particular, nft_hash) properly which might take a lot more time.
> 
> I take that back.  Looks like my walkers would be no good for
> netfilter anyway since it wants to do nested walks, meaning no
> locks can be taken.
> 
> So I think I'll just have to tackle netfilter first in which case
> the walk function would be no use anyway since I could just fix
> netlink by itself.

OK. I'll wait for your netfilter approach before posting NLM_F_INTR
patches.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27  9:45                       ` Thomas Graf
@ 2015-01-27  9:54                         ` Herbert Xu
  2015-01-27 10:15                           ` Thomas Graf
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-27  9:54 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, David.Laight, ying.xue, kaber, paulmck, netdev,
	netfilter-devel

On Tue, Jan 27, 2015 at 09:45:41AM +0000, Thomas Graf wrote:
>
> OK. I'll wait for your netfilter approach before posting NLM_F_INTR
> patches.

Why would we still need the INTR stuff given that Dave has asked
for this to be done outside of rhashtable?

What I'm planning on doing right now is to divide rhashtable users
into two camps, lockless insert/removal vs. locked insert/removal.
netfilter will obviously fall into the latter camp.

For those with locked insert/removal I will make resizing/rehashing
synchronous since there is zero benefit from doing it in a separate
thread.

Once that is done you can then easily walk the table by just holding
the lock, which is what netfilter does today.

I'll probably make netlink locked too since it's much simpler that
way.  Only really high performance users like TCP/UDP sockets need
lockless insert/removal and the complexity that comes with it.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-26 22:23               ` Herbert Xu
  2015-01-26 22:36                 ` David Miller
@ 2015-01-27 10:09                 ` David Laight
  2015-01-27 10:12                   ` Herbert Xu
  1 sibling, 1 reply; 79+ messages in thread
From: David Laight @ 2015-01-27 10:09 UTC (permalink / raw)
  To: 'Herbert Xu'
  Cc: Thomas Graf, Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

From: Herbert Xu 
> On Mon, Jan 26, 2015 at 10:09:24AM +0000, David Laight wrote:
> >
> > That doesn't look right to me.
> > Surely you shouldn't be calling rcu_read_lock() when the mutex
> > request is interrupted.
> >
> > So maybe:
> > 	err = mutex_lock_interruptible(&ht->mutex);
> > 	if (err)
> > 		return err;
> > 	rcu_read_lock();
> 
> No, we need to grab the RCU read lock while holding the mutex
> in order to prevent future resizes from happening once we release
> the mutex.

But if err is non-zero you don't hold the mutex.
Presumably the calling code also errors out immediately,
so the RCU lock isn't needed at all.

	David

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 10:09                 ` David Laight
@ 2015-01-27 10:12                   ` Herbert Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-27 10:12 UTC (permalink / raw)
  To: David Laight
  Cc: Thomas Graf, Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

On Tue, Jan 27, 2015 at 10:09:18AM +0000, David Laight wrote:
>
> But if err is non-zero you don't hold the mutex.
> Presumably the calling code also errors out immediately,
> so the RCU lock isn't needed at all.

No the calling code expects the RCU lock to be held because it
will always call stop.  Look at how seq_file works.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27  9:54                         ` Herbert Xu
@ 2015-01-27 10:15                           ` Thomas Graf
  2015-01-27 10:24                             ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-27 10:15 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, David.Laight, ying.xue, kaber, paulmck, netdev,
	netfilter-devel

On 01/27/15 at 08:54pm, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 09:45:41AM +0000, Thomas Graf wrote:
> >
> > OK. I'll wait for your netfilter approach before posting NLM_F_INTR
> > patches.
> 
> Why would we still need the INTR stuff given that Dave has asked
> for this to be done outside of rhashtable?

Because libraries like libnl already handle this transparently for
various user space apps. Those apps are not even aware of the problem
and always get a consistent dump if the kernel implements INTR.

> What I'm planning on doing right now is to divide rhashtable users
> into two camps, lockless insert/removal vs. locked insert/removal.
> netfilter will obviously fall into the latter camp.
> 
> For those with locked insert/removal I will make resizing/rehashing
> synchronous since there is zero benefit from doing it in a separate
> thread.

Because making it synchronous does not help the Netlink dump problem
at all, see below.

Also, a sync resize causes a random inserts/removal to take a lot
longer. This leads to less predictable insert/removal times. I have
no objection to offering both sync and async resizing as long as the
choice of balance is configurable.

> Once that is done you can then easily walk the table by just holding
> the lock, which is what netfilter does today.

How? Do you want to hold the mutex while the user is reading? I don't
see how to achieve consistency with locking without introducing a ton
of complexity and attackable behaviour like user space being able to
delay inserts. If at all this only seems acceptable for privileged
users.

My vote is to rely on the proven INTR flag and optionally restart the
dump if user space did not indicate that INTR is supported as Patrick
suggested.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 10:15                           ` Thomas Graf
@ 2015-01-27 10:24                             ` Herbert Xu
  2015-01-27 11:16                               ` Thomas Graf
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-27 10:24 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, David.Laight, ying.xue, kaber, paulmck, netdev,
	netfilter-devel

On Tue, Jan 27, 2015 at 10:15:12AM +0000, Thomas Graf wrote:
>
> Because libraries like libnl already handle this transparently for
> various user space apps. Those apps are not even aware of the problem
> and always get a consistent dump if the kernel implements INTR.

Did you see Dave's message? He said those that dump need to create
their own data structure so there is no point in proceeding with
any of this work within rhashtable.
 
> Because making it synchronous does not help the Netlink dump problem
> at all, see below.
> 
> Also, a sync resize causes a random inserts/removal to take a lot
> longer. This leads to less predictable insert/removal times. I have
> no objection to offering both sync and async resizing as long as the
> choice of balance is configurable.

So what?  They can switch over to lockless mode if they care.

The point is that for netfilter as it stands it makes no sense
to go lockless because its modus operandi is to hold a single
lock over everything, including walking/dumping.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 10:24                             ` Herbert Xu
@ 2015-01-27 11:16                               ` Thomas Graf
  2015-01-27 11:23                                 ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-27 11:16 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, David.Laight, ying.xue, kaber, paulmck, netdev,
	netfilter-devel

On 01/27/15 at 09:24pm, Herbert Xu wrote:
> The point is that for netfilter as it stands it makes no sense
> to go lockless because its modus operandi is to hold a single
> lock over everything, including walking/dumping.

No objection. I have a patch prepared which allows the user to
provide ht->mutex himself so nfset can provide its own existing
mutex to rhashtable and lock out the resizes from inserts,
removals and dump iterations automatically That would restore the
old behaviour of the nfset API without major surgery.

I was just waiting on your walker as you requested a hold off.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 11:16                               ` Thomas Graf
@ 2015-01-27 11:23                                 ` Herbert Xu
  2015-01-27 11:40                                   ` Thomas Graf
  2015-01-27 13:09                                   ` Patrick McHardy
  0 siblings, 2 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-27 11:23 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, David.Laight, ying.xue, kaber, paulmck, netdev,
	netfilter-devel

On Tue, Jan 27, 2015 at 11:16:04AM +0000, Thomas Graf wrote:
> 
> No objection. I have a patch prepared which allows the user to
> provide ht->mutex himself so nfset can provide its own existing
> mutex to rhashtable and lock out the resizes from inserts,
> removals and dump iterations automatically That would restore the
> old behaviour of the nfset API without major surgery.

If you take the mutex you might as well just make it synchronous.
There is zero difference.

Maybe you misunderstood my email.  I'm not making it synchronous
for everybody, just those that always take a lock on inserts/removals
and therefore don't need per-bucket locks.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 11:23                                 ` Herbert Xu
@ 2015-01-27 11:40                                   ` Thomas Graf
  2015-01-27 20:39                                     ` Herbert Xu
  2015-01-27 13:09                                   ` Patrick McHardy
  1 sibling, 1 reply; 79+ messages in thread
From: Thomas Graf @ 2015-01-27 11:40 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, David.Laight, ying.xue, kaber, paulmck, netdev,
	netfilter-devel

On 01/27/15 at 10:23pm, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 11:16:04AM +0000, Thomas Graf wrote:
> > 
> > No objection. I have a patch prepared which allows the user to
> > provide ht->mutex himself so nfset can provide its own existing
> > mutex to rhashtable and lock out the resizes from inserts,
> > removals and dump iterations automatically That would restore the
> > old behaviour of the nfset API without major surgery.
> 
> If you take the mutex you might as well just make it synchronous.
> There is zero difference.
> 
> Maybe you misunderstood my email.  I'm not making it synchronous
> for everybody, just those that always take a lock on inserts/removals
> and therefore don't need per-bucket locks.

Understood. No objection, happy to review patches. I initially did
keep sync/async separate and it lead to considerable code duplication
and complexity. I figured that if a user needs sync insert they could
provide their own locking. I missed to allow controlling the async
resize though. Again, feel free to give a shot, no objections.

This is unrelated to resize run control though, the reason is that
I'm converting tcp_hashinfo et al and they require a hybrid approach.
The tables may be too big to construct a parallel data structure, we
don't want to hold off inserts or deletes while the expensive dump
is underway. Even though we can't build a shadow structure while
locking everybody else out, we still want to provide a way to somehow
achieve consistent information. I think that NLM_F_INTR with fallback
to restarting the dump is a good option and very easy to implement. In
that case, we want to lock out resize from dumping iterations but
still allow parallel insert/delete.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 11:23                                 ` Herbert Xu
  2015-01-27 11:40                                   ` Thomas Graf
@ 2015-01-27 13:09                                   ` Patrick McHardy
  2015-01-27 20:36                                     ` Herbert Xu
  1 sibling, 1 reply; 79+ messages in thread
From: Patrick McHardy @ 2015-01-27 13:09 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, David.Laight, ying.xue, paulmck,
	netdev, netfilter-devel

On 27.01, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 11:16:04AM +0000, Thomas Graf wrote:
> > 
> > No objection. I have a patch prepared which allows the user to
> > provide ht->mutex himself so nfset can provide its own existing
> > mutex to rhashtable and lock out the resizes from inserts,
> > removals and dump iterations automatically That would restore the
> > old behaviour of the nfset API without major surgery.
> 
> If you take the mutex you might as well just make it synchronous.
> There is zero difference.
> 
> Maybe you misunderstood my email.  I'm not making it synchronous
> for everybody, just those that always take a lock on inserts/removals
> and therefore don't need per-bucket locks.

Actually I have a patchset queued that adds runtime additions and
removals, both active and timeout based. So netfilter won't have
pure synchronous behaviour anymore.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 13:09                                   ` Patrick McHardy
@ 2015-01-27 20:36                                     ` Herbert Xu
  2015-01-28 19:07                                       ` Patrick McHardy
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-27 20:36 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Thomas Graf, David Miller, David.Laight, ying.xue, paulmck,
	netdev, netfilter-devel

On Tue, Jan 27, 2015 at 01:09:50PM +0000, Patrick McHardy wrote:
>
> Actually I have a patchset queued that adds runtime additions and
> removals, both active and timeout based. So netfilter won't have
> pure synchronous behaviour anymore.

Can you show us the patchset?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 11:40                                   ` Thomas Graf
@ 2015-01-27 20:39                                     ` Herbert Xu
  2015-01-27 22:10                                       ` David Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-27 20:39 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, David.Laight, ying.xue, kaber, paulmck, netdev,
	netfilter-devel

On Tue, Jan 27, 2015 at 11:40:28AM +0000, Thomas Graf wrote:
> 
> This is unrelated to resize run control though, the reason is that
> I'm converting tcp_hashinfo et al and they require a hybrid approach.
> The tables may be too big to construct a parallel data structure, we
> don't want to hold off inserts or deletes while the expensive dump
> is underway. Even though we can't build a shadow structure while
> locking everybody else out, we still want to provide a way to somehow
> achieve consistent information. I think that NLM_F_INTR with fallback
> to restarting the dump is a good option and very easy to implement. In
> that case, we want to lock out resize from dumping iterations but
> still allow parallel insert/delete.

Well I guess Dave needs to make the call.  Do we want to allow
lockless walks over the hash table or not?

Personally I don't think a linked list is that big a deal.  But then
you guys were agonsing over a single pointer so who knows.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 20:39                                     ` Herbert Xu
@ 2015-01-27 22:10                                       ` David Miller
  2015-01-27 23:16                                         ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: David Miller @ 2015-01-27 22:10 UTC (permalink / raw)
  To: herbert
  Cc: tgraf, David.Laight, ying.xue, kaber, paulmck, netdev, netfilter-devel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 28 Jan 2015 07:39:24 +1100

> On Tue, Jan 27, 2015 at 11:40:28AM +0000, Thomas Graf wrote:
>> 
>> This is unrelated to resize run control though, the reason is that
>> I'm converting tcp_hashinfo et al and they require a hybrid approach.
>> The tables may be too big to construct a parallel data structure, we
>> don't want to hold off inserts or deletes while the expensive dump
>> is underway. Even though we can't build a shadow structure while
>> locking everybody else out, we still want to provide a way to somehow
>> achieve consistent information. I think that NLM_F_INTR with fallback
>> to restarting the dump is a good option and very easy to implement. In
>> that case, we want to lock out resize from dumping iterations but
>> still allow parallel insert/delete.
> 
> Well I guess Dave needs to make the call.  Do we want to allow
> lockless walks over the hash table or not?
> 
> Personally I don't think a linked list is that big a deal.  But then
> you guys were agonsing over a single pointer so who knows.

For netlink a linked list is no big deal, but for something like TCP
sockets it really is.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 22:10                                       ` David Miller
@ 2015-01-27 23:16                                         ` Herbert Xu
  0 siblings, 0 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-27 23:16 UTC (permalink / raw)
  To: David Miller
  Cc: tgraf, David.Laight, ying.xue, kaber, paulmck, netdev, netfilter-devel

On Tue, Jan 27, 2015 at 02:10:18PM -0800, David Miller wrote:
>
> For netlink a linked list is no big deal, but for something like TCP
> sockets it really is.

OK then I'll resubmit my iterators.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink
  2015-01-25 23:20         ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
  2015-01-25 23:21           ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
  2015-01-25 23:21           ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
@ 2015-01-27 23:19           ` Herbert Xu
  2015-01-27 23:20             ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
                               ` (2 more replies)
  2 siblings, 3 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-27 23:19 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

On Mon, Jan 26, 2015 at 10:20:40AM +1100, Herbert Xu wrote:
> 
> Here are the first two patches, one to add the primitives and one
> to demonstrate its use in netlink.  In fact while testing this I
> found that the existing netlink walking code is totally broken.

So this is a repost of the same thing.  These lockless iterators
won't be usable by netfilter as it stands because it wants to do
nested walks and also isn't currently lockless.

However, I'll wait for Patrick to post his pending patches to see
what exactly he needs before implementing helpers for that case.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 23:19           ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
@ 2015-01-27 23:20             ` Herbert Xu
  2015-01-29 22:26               ` Thomas Graf
  2015-01-27 23:20             ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
  2015-01-29 22:42             ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink David Miller
  2 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-27 23:20 UTC (permalink / raw)
  To: Thomas Graf, Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

Some existing rhashtable users get too intimate with it by walking
the buckets directly.  This prevents us from easily changing the
internals of rhashtable.

This patch adds the helpers rhashtable_walk_init/next/end which
will replace these custom walkers.

They are meant to be usable for both procfs seq_file walks as well
as walking by a netlink dump.  The iterator structure should fit
inside a netlink dump cb structure, with at least one element to
spare.
  
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |   44 +++++++++++++++++++++
 lib/rhashtable.c           |   91 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 6d7e840..b03b375 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -18,6 +18,7 @@
 #ifndef _LINUX_RHASHTABLE_H
 #define _LINUX_RHASHTABLE_H
 
+#include <linux/compiler.h>
 #include <linux/list_nulls.h>
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
@@ -123,6 +124,22 @@ struct rhashtable {
 	bool                            being_destroyed;
 };
 
+/**
+ * struct rhashtable_iter - Hash table iterator
+ * @ht: Table to iterate through
+ * @p: Current pointer
+ * @lock: Slot lock
+ * @slot: Current slot
+ * @skip: Number of entries to skip in slot
+ */
+struct rhashtable_iter {
+	struct rhashtable *ht;
+	struct rhash_head *p;
+	spinlock_t *lock;
+	unsigned int slot;
+	unsigned int skip;
+};
+
 static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
 {
 	return NULLS_MARKER(ht->p.nulls_base + hash);
@@ -178,6 +195,33 @@ bool rhashtable_lookup_compare_insert(struct rhashtable *ht,
 				      bool (*compare)(void *, void *),
 				      void *arg);
 
+/**
+ * rhashtable_walk_init - Initialise an iterator
+ * @ht:		Table to walk over
+ * @iter:	Hash table Iterator
+ *
+ * This function prepares a hash table walk.
+ * Note that if you restart a walk after rhashtable_walk_stop you
+ * may see the same object twice.  Also, you may miss objects if
+ * there are removals in between rhashtable_walk_stop and the next
+ * call to rhashtable_walk_start.
+ *
+ * For a completely stable walk you should construct your own data
+ * structure outside the hash table.
+ */
+static inline void rhashtable_walk_init(struct rhashtable *ht,
+					struct rhashtable_iter *iter)
+{
+	iter->ht = ht;
+	iter->p = NULL;
+	iter->slot = 0;
+	iter->skip = 0;
+}
+
+int rhashtable_walk_start(struct rhashtable_iter *iter) __acquires(RCU);
+void *rhashtable_walk_next(struct rhashtable_iter *iter);
+void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
+
 void rhashtable_destroy(struct rhashtable *ht);
 
 #define rht_dereference(p, ht) \
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 71c6aa1..d51fb06 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -813,6 +813,97 @@ exit:
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup_compare_insert);
 
+/**
+ * rhashtable_walk_start - Start a hash table walk
+ * @iter:	Hash table iterator
+ *
+ * Start a hash table walk.
+ *
+ * Returns zero if successful.  Returns -EINTR if we couldn't
+ * obtain the resize lock.
+ */
+int rhashtable_walk_start(struct rhashtable_iter *iter)
+{
+	struct rhashtable *ht = iter->ht;
+	int err;
+
+	err = mutex_lock_interruptible(&ht->mutex);
+	rcu_read_lock();
+
+	if (!err)
+		mutex_unlock(&ht->mutex);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_start);
+
+/**
+ * rhashtable_walk_next - Return the next object and advance the iterator
+ * @iter:	Hash table iterator
+ *
+ * Note that you must call rhashtable_walk_stop when you are finished
+ * with the walk.
+ *
+ * Returns the next object or NULL when the end of the table is reached.
+ */
+void *rhashtable_walk_next(struct rhashtable_iter *iter)
+{
+	const struct bucket_table *tbl;
+	struct rhashtable *ht = iter->ht;
+	struct rhash_head *p = iter->p;
+
+	tbl = rht_dereference_rcu(ht->tbl, ht);
+
+	if (p) {
+		p = rht_dereference_bucket(p->next, tbl, iter->slot);
+		goto next;
+	}
+
+	for (; iter->slot < tbl->size; iter->slot++) {
+		int skip = iter->skip;
+
+		iter->lock = bucket_lock(tbl, iter->slot);
+		spin_lock_bh(iter->lock);
+
+		rht_for_each(p, tbl, iter->slot) {
+			if (!skip)
+				break;
+			skip--;
+		}
+
+next:
+		if (!rht_is_a_nulls(p)) {
+			iter->skip++;
+			iter->p = p;
+			return rht_obj(ht, p);
+		}
+		spin_unlock_bh(iter->lock);
+
+		iter->skip = 0;
+	}
+
+	iter->p = NULL;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_next);
+
+/**
+ * rhashtable_walk_stop - Finish a hash table walk
+ * @iter:	Hash table iterator
+ *
+ * Finish a hash table walk.
+ */
+void rhashtable_walk_stop(struct rhashtable_iter *iter)
+{
+	if (iter->p)
+		spin_unlock_bh(iter->lock);
+
+	rcu_read_unlock();
+
+	iter->p = NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
+
 static size_t rounded_hashtable_size(struct rhashtable_params *params)
 {
 	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 2/2] netlink: Use rhashtable walk iterator
  2015-01-27 23:19           ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
  2015-01-27 23:20             ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
@ 2015-01-27 23:20             ` Herbert Xu
  2015-01-29 22:27               ` Thomas Graf
  2015-01-29 22:42             ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink David Miller
  2 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-27 23:20 UTC (permalink / raw)
  To: Thomas Graf, Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

This patch gets rid of the manual rhashtable walk in netlink
which touches rhashtable internals that should not be exposed.
It does so by using the rhashtable iterator primitives.

In fact the existing code was very buggy.  Some sockets weren't
shown at all while others were shown more than once.
 
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/netlink/af_netlink.c |  113 +++++++++++++++++------------------------------
 1 file changed, 41 insertions(+), 72 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index d77b346..6f8549e 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2884,99 +2884,68 @@ EXPORT_SYMBOL(nlmsg_notify);
 #ifdef CONFIG_PROC_FS
 struct nl_seq_iter {
 	struct seq_net_private p;
+	struct rhashtable_iter hti;
 	int link;
-	int hash_idx;
 };
 
-static struct sock *netlink_seq_socket_idx(struct seq_file *seq, loff_t pos)
+static int netlink_walk_start(struct nl_seq_iter *iter)
 {
-	struct nl_seq_iter *iter = seq->private;
-	int i, j;
-	struct netlink_sock *nlk;
-	struct sock *s;
-	loff_t off = 0;
-
-	for (i = 0; i < MAX_LINKS; i++) {
-		struct rhashtable *ht = &nl_table[i].hash;
-		const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
-
-		for (j = 0; j < tbl->size; j++) {
-			struct rhash_head *node;
-
-			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
-				s = (struct sock *)nlk;
-
-				if (sock_net(s) != seq_file_net(seq))
-					continue;
-				if (off == pos) {
-					iter->link = i;
-					iter->hash_idx = j;
-					return s;
-				}
-				++off;
-			}
-		}
-	}
-	return NULL;
-}
-
-static void *netlink_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
-{
-	rcu_read_lock();
-	return *pos ? netlink_seq_socket_idx(seq, *pos - 1) : SEQ_START_TOKEN;
+	rhashtable_walk_init(&nl_table[iter->link].hash, &iter->hti);
+	return rhashtable_walk_start(&iter->hti);
 }
 
-static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+static void *__netlink_seq_next(struct seq_file *seq)
 {
-	struct rhashtable *ht;
-	const struct bucket_table *tbl;
-	struct rhash_head *node;
+	struct nl_seq_iter *iter = seq->private;
 	struct netlink_sock *nlk;
-	struct nl_seq_iter *iter;
-	struct net *net;
-	int i, j;
 
-	++*pos;
+	do {
+		while (!(nlk = rhashtable_walk_next(&iter->hti))) {
+			int err;
 
-	if (v == SEQ_START_TOKEN)
-		return netlink_seq_socket_idx(seq, 0);
+			rhashtable_walk_stop(&iter->hti);
+			if (++iter->link >= MAX_LINKS)
+				return NULL;
 
-	net = seq_file_net(seq);
-	iter = seq->private;
-	nlk = v;
+			err = netlink_walk_start(iter);
+			if (err)
+				return ERR_PTR(err);
+		}
+	} while (sock_net(&nlk->sk) != seq_file_net(seq));
 
-	i = iter->link;
-	ht = &nl_table[i].hash;
-	tbl = rht_dereference_rcu(ht->tbl, ht);
-	rht_for_each_entry_rcu_continue(nlk, node, nlk->node.next, tbl, iter->hash_idx, node)
-		if (net_eq(sock_net((struct sock *)nlk), net))
-			return nlk;
+	return nlk;
+}
 
-	j = iter->hash_idx + 1;
+static void *netlink_seq_start(struct seq_file *seq, loff_t *posp)
+{
+	struct nl_seq_iter *iter = seq->private;
+	void *obj = SEQ_START_TOKEN;
+	loff_t pos;
+	int err;
 
-	do {
+	iter->link = 0;
+	err = netlink_walk_start(iter);
+	if (err)
+		return ERR_PTR(err);
 
-		for (; j < tbl->size; j++) {
-			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
-				if (net_eq(sock_net((struct sock *)nlk), net)) {
-					iter->link = i;
-					iter->hash_idx = j;
-					return nlk;
-				}
-			}
-		}
+	for (pos = *posp; pos && obj; pos--)
+		obj = __netlink_seq_next(seq);
 
-		j = 0;
-	} while (++i < MAX_LINKS);
+	return obj;
+}
 
-	return NULL;
+static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return __netlink_seq_next(seq);
 }
 
 static void netlink_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
 {
-	rcu_read_unlock();
+	struct nl_seq_iter *iter = seq->private;
+
+	if (iter->link < MAX_LINKS)
+		rhashtable_walk_stop(&iter->hti);
 }
 
 

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 20:36                                     ` Herbert Xu
@ 2015-01-28 19:07                                       ` Patrick McHardy
  2015-01-30  5:58                                         ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Patrick McHardy @ 2015-01-28 19:07 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, David.Laight, ying.xue, paulmck,
	netdev, netfilter-devel

On 28.01, Herbert Xu wrote:
> On Tue, Jan 27, 2015 at 01:09:50PM +0000, Patrick McHardy wrote:
> >
> > Actually I have a patchset queued that adds runtime additions and
> > removals, both active and timeout based. So netfilter won't have
> > pure synchronous behaviour anymore.
> 
> Can you show us the patchset?

Sure, will send it over tomorrow.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-27 23:20             ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
@ 2015-01-29 22:26               ` Thomas Graf
  0 siblings, 0 replies; 79+ messages in thread
From: Thomas Graf @ 2015-01-29 22:26 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

On 01/28/15 at 10:20am, Herbert Xu wrote:
> Some existing rhashtable users get too intimate with it by walking
> the buckets directly.  This prevents us from easily changing the
> internals of rhashtable.
> 
> This patch adds the helpers rhashtable_walk_init/next/end which
> will replace these custom walkers.
> 
> They are meant to be usable for both procfs seq_file walks as well
> as walking by a netlink dump.  The iterator structure should fit
> inside a netlink dump cb structure, with at least one element to
> spare.
>   
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 2/2] netlink: Use rhashtable walk iterator
  2015-01-27 23:20             ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
@ 2015-01-29 22:27               ` Thomas Graf
  0 siblings, 0 replies; 79+ messages in thread
From: Thomas Graf @ 2015-01-29 22:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Ying Xue, davem, kaber, paulmck, netdev, netfilter-devel

On 01/28/15 at 10:20am, Herbert Xu wrote:
> This patch gets rid of the manual rhashtable walk in netlink
> which touches rhashtable internals that should not be exposed.
> It does so by using the rhashtable iterator primitives.
> 
> In fact the existing code was very buggy.  Some sockets weren't
> shown at all while others were shown more than once.
>  
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink
  2015-01-27 23:19           ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
  2015-01-27 23:20             ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
  2015-01-27 23:20             ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
@ 2015-01-29 22:42             ` David Miller
  2015-01-31  3:13               ` Herbert Xu
  2 siblings, 1 reply; 79+ messages in thread
From: David Miller @ 2015-01-29 22:42 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 28 Jan 2015 10:19:50 +1100

> On Mon, Jan 26, 2015 at 10:20:40AM +1100, Herbert Xu wrote:
>> 
>> Here are the first two patches, one to add the primitives and one
>> to demonstrate its use in netlink.  In fact while testing this I
>> found that the existing netlink walking code is totally broken.
> 
> So this is a repost of the same thing.  These lockless iterators
> won't be usable by netfilter as it stands because it wants to do
> nested walks and also isn't currently lockless.
> 
> However, I'll wait for Patrick to post his pending patches to see
> what exactly he needs before implementing helpers for that case.

I'm holding off on this series for now.

Let me know if you want me to do something different.

Thanks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-28 19:07                                       ` Patrick McHardy
@ 2015-01-30  5:58                                         ` Herbert Xu
  2015-01-30  8:10                                           ` Patrick McHardy
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2015-01-30  5:58 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Thomas Graf, David Miller, David.Laight, ying.xue, paulmck,
	netdev, netfilter-devel

On Wed, Jan 28, 2015 at 07:07:16PM +0000, Patrick McHardy wrote:
> On 28.01, Herbert Xu wrote:
> > On Tue, Jan 27, 2015 at 01:09:50PM +0000, Patrick McHardy wrote:
> > >
> > > Actually I have a patchset queued that adds runtime additions and
> > > removals, both active and timeout based. So netfilter won't have
> > > pure synchronous behaviour anymore.
> > 
> > Can you show us the patchset?
> 
> Sure, will send it over tomorrow.

So how are you handling the locking in these cases?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-30  5:58                                         ` Herbert Xu
@ 2015-01-30  8:10                                           ` Patrick McHardy
  0 siblings, 0 replies; 79+ messages in thread
From: Patrick McHardy @ 2015-01-30  8:10 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, David.Laight, ying.xue, paulmck,
	netdev, netfilter-devel

On 30.01, Herbert Xu wrote:
> On Wed, Jan 28, 2015 at 07:07:16PM +0000, Patrick McHardy wrote:
> > On 28.01, Herbert Xu wrote:
> > > On Tue, Jan 27, 2015 at 01:09:50PM +0000, Patrick McHardy wrote:
> > > >
> > > > Actually I have a patchset queued that adds runtime additions and
> > > > removals, both active and timeout based. So netfilter won't have
> > > > pure synchronous behaviour anymore.
> > > 
> > > Can you show us the patchset?
> > 
> > Sure, will send it over tomorrow.
> 
> So how are you handling the locking in these cases?

Insertion is handled using the rhashtable internal locks. GC happens
in work context and takes the rhashtable mutex, then walks the table
similar to nft_hash_walk and removes stale entries.

Still need to work out something to reduce RCU grace periods
(currently ignored) and make sure we don't race between ->get/->remove
in the netlink API and GC.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink
  2015-01-29 22:42             ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink David Miller
@ 2015-01-31  3:13               ` Herbert Xu
  2015-01-31  3:14                 ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
                                   ` (3 more replies)
  0 siblings, 4 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-31  3:13 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

On Thu, Jan 29, 2015 at 02:42:46PM -0800, David Miller wrote:
> 
> I'm holding off on this series for now.
> 
> Let me know if you want me to do something different.

No problems.  Here is a new version of these two patches which
hopefully should work on netfilter as well.

Note that the major functional difference compared to the previous
one is that resizes are now handled properly and will cause the
iterator to restart from scratch.

However, as with all existing hash table walker implementations,
if you remove an element from the chain that we're walking over
while we're waiting for user-space to give us a new buffer, then
the walk may miss elements that are still on that chain.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 1/2] rhashtable: Introduce rhashtable_walk_*
  2015-01-31  3:13               ` Herbert Xu
@ 2015-01-31  3:14                 ` Herbert Xu
  2015-01-31  3:14                 ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-31  3:14 UTC (permalink / raw)
  To: David Miller, tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

Some existing rhashtable users get too intimate with it by walking
the buckets directly.  This prevents us from easily changing the
internals of rhashtable.

This patch adds the helpers rhashtable_walk_init/exit/start/next/stop
which will replace these custom walkers.

They are meant to be usable for both procfs seq_file walks as well
as walking by a netlink dump.  The iterator structure should fit
inside a netlink dump cb structure, with at least one element to
spare.
  
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/rhashtable.h |   35 +++++++++
 lib/rhashtable.c           |  162 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 197 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 6d7e840..a0e5b08 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -18,6 +18,7 @@
 #ifndef _LINUX_RHASHTABLE_H
 #define _LINUX_RHASHTABLE_H
 
+#include <linux/compiler.h>
 #include <linux/list_nulls.h>
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
@@ -110,6 +111,7 @@ struct rhashtable_params {
  * @p: Configuration parameters
  * @run_work: Deferred worker to expand/shrink asynchronously
  * @mutex: Mutex to protect current/future table swapping
+ * @walkers: List of active walkers
  * @being_destroyed: True if table is set up for destruction
  */
 struct rhashtable {
@@ -120,9 +122,36 @@ struct rhashtable {
 	struct rhashtable_params	p;
 	struct work_struct		run_work;
 	struct mutex                    mutex;
+	struct list_head		walkers;
 	bool                            being_destroyed;
 };
 
+/**
+ * struct rhashtable_walker - Hash table walker
+ * @list: List entry on list of walkers
+ * @resize: Resize event occured
+ */
+struct rhashtable_walker {
+	struct list_head list;
+	bool resize;
+};
+
+/**
+ * struct rhashtable_iter - Hash table iterator, fits into netlink cb
+ * @ht: Table to iterate through
+ * @p: Current pointer
+ * @walker: Associated rhashtable walker
+ * @slot: Current slot
+ * @skip: Number of entries to skip in slot
+ */
+struct rhashtable_iter {
+	struct rhashtable *ht;
+	struct rhash_head *p;
+	struct rhashtable_walker *walker;
+	unsigned int slot;
+	unsigned int skip;
+};
+
 static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
 {
 	return NULLS_MARKER(ht->p.nulls_base + hash);
@@ -178,6 +207,12 @@ bool rhashtable_lookup_compare_insert(struct rhashtable *ht,
 				      bool (*compare)(void *, void *),
 				      void *arg);
 
+int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter);
+void rhashtable_walk_exit(struct rhashtable_iter *iter);
+int rhashtable_walk_start(struct rhashtable_iter *iter) __acquires(RCU);
+void *rhashtable_walk_next(struct rhashtable_iter *iter);
+void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
+
 void rhashtable_destroy(struct rhashtable *ht);
 
 #define rht_dereference(p, ht) \
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 71c6aa1..69a4eb0 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -485,11 +485,15 @@ static void rht_deferred_worker(struct work_struct *work)
 {
 	struct rhashtable *ht;
 	struct bucket_table *tbl;
+	struct rhashtable_walker *walker;
 
 	ht = container_of(work, struct rhashtable, run_work);
 	mutex_lock(&ht->mutex);
 	tbl = rht_dereference(ht->tbl, ht);
 
+	list_for_each_entry(walker, &ht->walkers, list)
+		walker->resize = true;
+
 	if (ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
 		rhashtable_expand(ht);
 	else if (ht->p.shrink_decision && ht->p.shrink_decision(ht, tbl->size))
@@ -813,6 +817,163 @@ exit:
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup_compare_insert);
 
+/**
+ * rhashtable_walk_init - Initialise an iterator
+ * @ht:		Table to walk over
+ * @iter:	Hash table Iterator
+ *
+ * This function prepares a hash table walk.
+ *
+ * Note that if you restart a walk after rhashtable_walk_stop you
+ * may see the same object twice.  Also, you may miss objects if
+ * there are removals in between rhashtable_walk_stop and the next
+ * call to rhashtable_walk_start.
+ *
+ * For a completely stable walk you should construct your own data
+ * structure outside the hash table.
+ *
+ * This function may sleep so you must not call it from interrupt
+ * context or with spin locks held.
+ *
+ * You must call rhashtable_walk_exit if this function returns
+ * successfully.
+ */
+int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter)
+{
+	iter->ht = ht;
+	iter->p = NULL;
+	iter->slot = 0;
+	iter->skip = 0;
+
+	iter->walker = kmalloc(sizeof(*iter->walker), GFP_KERNEL);
+	if (!iter->walker)
+		return -ENOMEM;
+
+	mutex_lock(&ht->mutex);
+	list_add(&iter->walker->list, &ht->walkers);
+	mutex_unlock(&ht->mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_init);
+
+/**
+ * rhashtable_walk_exit - Free an iterator
+ * @iter:	Hash table Iterator
+ *
+ * This function frees resources allocated by rhashtable_walk_init.
+ */
+void rhashtable_walk_exit(struct rhashtable_iter *iter)
+{
+	mutex_lock(&iter->ht->mutex);
+	list_del(&iter->walker->list);
+	mutex_unlock(&iter->ht->mutex);
+	kfree(iter->walker);
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_exit);
+
+/**
+ * rhashtable_walk_start - Start a hash table walk
+ * @iter:	Hash table iterator
+ *
+ * Start a hash table walk.  Note that we take the RCU lock in all
+ * cases including when we return an error.  So you must always call
+ * rhashtable_walk_stop to clean up.
+ *
+ * Returns zero if successful.
+ *
+ * Returns -EAGAIN if resize event occured.  Note that the iterator
+ * will rewind back to the beginning and you may use it immediately
+ * by calling rhashtable_walk_next.
+ */
+int rhashtable_walk_start(struct rhashtable_iter *iter)
+{
+	rcu_read_lock();
+
+	if (iter->walker->resize) {
+		iter->slot = 0;
+		iter->skip = 0;
+		iter->walker->resize = false;
+		return -EAGAIN;
+	}
+
+	return 0;
+}
+
+/**
+ * rhashtable_walk_next - Return the next object and advance the iterator
+ * @iter:	Hash table iterator
+ *
+ * Note that you must call rhashtable_walk_stop when you are finished
+ * with the walk.
+ *
+ * Returns the next object or NULL when the end of the table is reached.
+ *
+ * Returns -EAGAIN if resize event occured.  Note that the iterator
+ * will rewind back to the beginning and you may continue to use it.
+ */
+void *rhashtable_walk_next(struct rhashtable_iter *iter)
+{
+	const struct bucket_table *tbl;
+	struct rhashtable *ht = iter->ht;
+	struct rhash_head *p = iter->p;
+	void *obj = NULL;
+
+	tbl = rht_dereference_rcu(ht->tbl, ht);
+
+	if (p) {
+		p = rht_dereference_bucket_rcu(p->next, tbl, iter->slot);
+		goto next;
+	}
+
+	for (; iter->slot < tbl->size; iter->slot++) {
+		int skip = iter->skip;
+
+		rht_for_each_rcu(p, tbl, iter->slot) {
+			if (!skip)
+				break;
+			skip--;
+		}
+
+next:
+		if (!rht_is_a_nulls(p)) {
+			iter->skip++;
+			iter->p = p;
+			obj = rht_obj(ht, p);
+			goto out;
+		}
+
+		iter->skip = 0;
+	}
+
+	iter->p = NULL;
+
+out:
+	if (iter->walker->resize) {
+		iter->p = NULL;
+		iter->slot = 0;
+		iter->skip = 0;
+		iter->walker->resize = false;
+		return ERR_PTR(-EAGAIN);
+	}
+
+	return obj;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_next);
+
+/**
+ * rhashtable_walk_stop - Finish a hash table walk
+ * @iter:	Hash table iterator
+ *
+ * Finish a hash table walk.
+ */
+void rhashtable_walk_stop(struct rhashtable_iter *iter)
+{
+	rcu_read_unlock();
+	iter->p = NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_stop);
+
 static size_t rounded_hashtable_size(struct rhashtable_params *params)
 {
 	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),
@@ -885,6 +1046,7 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	memset(ht, 0, sizeof(*ht));
 	mutex_init(&ht->mutex);
 	memcpy(&ht->p, params, sizeof(*params));
+	INIT_LIST_HEAD(&ht->walkers);
 
 	if (params->locks_mul)
 		ht->p.locks_mul = roundup_pow_of_two(params->locks_mul);

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH 2/2] netlink: Use rhashtable walk iterator
  2015-01-31  3:13               ` Herbert Xu
  2015-01-31  3:14                 ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
@ 2015-01-31  3:14                 ` Herbert Xu
  2015-01-31  4:31                 ` netfilter: " Herbert Xu
  2015-02-03  3:19                 ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink David Miller
  3 siblings, 0 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-31  3:14 UTC (permalink / raw)
  To: David Miller, tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

This patch gets rid of the manual rhashtable walk in netlink
which touches rhashtable internals that should not be exposed.
It does so by using the rhashtable iterator primitives.

In fact the existing code was very buggy.  Some sockets weren't
shown at all while others were shown more than once.
 
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/netlink/af_netlink.c |  130 +++++++++++++++++++++++------------------------
 1 file changed, 64 insertions(+), 66 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index d77b346..2c7a95a 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2884,99 +2884,97 @@ EXPORT_SYMBOL(nlmsg_notify);
 #ifdef CONFIG_PROC_FS
 struct nl_seq_iter {
 	struct seq_net_private p;
+	struct rhashtable_iter hti;
 	int link;
-	int hash_idx;
 };
 
-static struct sock *netlink_seq_socket_idx(struct seq_file *seq, loff_t pos)
+static int netlink_walk_start(struct nl_seq_iter *iter)
 {
-	struct nl_seq_iter *iter = seq->private;
-	int i, j;
-	struct netlink_sock *nlk;
-	struct sock *s;
-	loff_t off = 0;
-
-	for (i = 0; i < MAX_LINKS; i++) {
-		struct rhashtable *ht = &nl_table[i].hash;
-		const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
-
-		for (j = 0; j < tbl->size; j++) {
-			struct rhash_head *node;
-
-			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
-				s = (struct sock *)nlk;
+	int err;
 
-				if (sock_net(s) != seq_file_net(seq))
-					continue;
-				if (off == pos) {
-					iter->link = i;
-					iter->hash_idx = j;
-					return s;
-				}
-				++off;
-			}
-		}
+	err = rhashtable_walk_init(&nl_table[iter->link].hash, &iter->hti);
+	if (err) {
+		iter->link = MAX_LINKS;
+		return err;
 	}
-	return NULL;
+
+	err = rhashtable_walk_start(&iter->hti);
+	return err == -EAGAIN ? 0 : err;
 }
 
-static void *netlink_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
+static void netlink_walk_stop(struct nl_seq_iter *iter)
 {
-	rcu_read_lock();
-	return *pos ? netlink_seq_socket_idx(seq, *pos - 1) : SEQ_START_TOKEN;
+	rhashtable_walk_stop(&iter->hti);
+	rhashtable_walk_exit(&iter->hti);
 }
 
-static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+static void *__netlink_seq_next(struct seq_file *seq)
 {
-	struct rhashtable *ht;
-	const struct bucket_table *tbl;
-	struct rhash_head *node;
+	struct nl_seq_iter *iter = seq->private;
 	struct netlink_sock *nlk;
-	struct nl_seq_iter *iter;
-	struct net *net;
-	int i, j;
 
-	++*pos;
+	do {
+		for (;;) {
+			int err;
 
-	if (v == SEQ_START_TOKEN)
-		return netlink_seq_socket_idx(seq, 0);
+			nlk = rhashtable_walk_next(&iter->hti);
 
-	net = seq_file_net(seq);
-	iter = seq->private;
-	nlk = v;
+			if (IS_ERR(nlk)) {
+				if (PTR_ERR(nlk) == -EAGAIN)
+					continue;
 
-	i = iter->link;
-	ht = &nl_table[i].hash;
-	tbl = rht_dereference_rcu(ht->tbl, ht);
-	rht_for_each_entry_rcu_continue(nlk, node, nlk->node.next, tbl, iter->hash_idx, node)
-		if (net_eq(sock_net((struct sock *)nlk), net))
-			return nlk;
+				return nlk;
+			}
 
-	j = iter->hash_idx + 1;
+			if (nlk)
+				break;
 
-	do {
+			netlink_walk_stop(iter);
+			if (++iter->link >= MAX_LINKS)
+				return NULL;
 
-		for (; j < tbl->size; j++) {
-			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
-				if (net_eq(sock_net((struct sock *)nlk), net)) {
-					iter->link = i;
-					iter->hash_idx = j;
-					return nlk;
-				}
-			}
+			err = netlink_walk_start(iter);
+			if (err)
+				return ERR_PTR(err);
 		}
+	} while (sock_net(&nlk->sk) != seq_file_net(seq));
 
-		j = 0;
-	} while (++i < MAX_LINKS);
+	return nlk;
+}
 
-	return NULL;
+static void *netlink_seq_start(struct seq_file *seq, loff_t *posp)
+{
+	struct nl_seq_iter *iter = seq->private;
+	void *obj = SEQ_START_TOKEN;
+	loff_t pos;
+	int err;
+
+	iter->link = 0;
+
+	err = netlink_walk_start(iter);
+	if (err)
+		return ERR_PTR(err);
+
+	for (pos = *posp; pos && obj && !IS_ERR(obj); pos--)
+		obj = __netlink_seq_next(seq);
+
+	return obj;
+}
+
+static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return __netlink_seq_next(seq);
 }
 
 static void netlink_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
 {
-	rcu_read_unlock();
+	struct nl_seq_iter *iter = seq->private;
+
+	if (iter->link >= MAX_LINKS)
+		return;
+
+	netlink_walk_stop(iter);
 }
 
 

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* netfilter: Use rhashtable walk iterator
  2015-01-31  3:13               ` Herbert Xu
  2015-01-31  3:14                 ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
  2015-01-31  3:14                 ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
@ 2015-01-31  4:31                 ` Herbert Xu
  2015-02-01  7:45                   ` Patrick McHardy
  2015-02-03  3:19                   ` David Miller
  2015-02-03  3:19                 ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink David Miller
  3 siblings, 2 replies; 79+ messages in thread
From: Herbert Xu @ 2015-01-31  4:31 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

On Sat, Jan 31, 2015 at 02:13:56PM +1100, Herbert Xu wrote:
> 
> No problems.  Here is a new version of these two patches which
> hopefully should work on netfilter as well.

And here is the patch for netfilter.

-- >8 --
This patch gets rid of the manual rhashtable walk in nft_hash
which touches rhashtable internals that should not be exposed.
It does so by using the rhashtable iterator primitives.
    
Note that I'm leaving nft_hash_destroy alone since it's only
invoked on shutdown and it shouldn't be affected by changes
to rhashtable internals (or at least not what I'm planning to
change).
    
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 75887d7..61e6c40 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -130,31 +130,50 @@ static void nft_hash_walk(const struct nft_ctx *ctx, const struct nft_set *set,
 			  struct nft_set_iter *iter)
 {
 	struct rhashtable *priv = nft_set_priv(set);
-	const struct bucket_table *tbl;
 	const struct nft_hash_elem *he;
+	struct rhashtable_iter hti;
 	struct nft_set_elem elem;
-	unsigned int i;
+	int err;
 
-	tbl = rht_dereference_rcu(priv->tbl, priv);
-	for (i = 0; i < tbl->size; i++) {
-		struct rhash_head *pos;
+	err = rhashtable_walk_init(priv, &hti);
+	iter->err = err;
+	if (err)
+		return;
+
+	err = rhashtable_walk_start(&hti);
+	if (err && err != -EAGAIN) {
+		iter->err = err;
+		goto out;
+	}
 
-		rht_for_each_entry_rcu(he, pos, tbl, i, node) {
-			if (iter->count < iter->skip)
-				goto cont;
+	while ((he = rhashtable_walk_next(&hti))) {
+		if (IS_ERR(he)) {
+			err = PTR_ERR(he);
+			if (err != -EAGAIN) {
+				iter->err = err;
+				goto out;
+			}
+		}
+
+		if (iter->count < iter->skip)
+			goto cont;
+
+		memcpy(&elem.key, &he->key, sizeof(elem.key));
+		if (set->flags & NFT_SET_MAP)
+			memcpy(&elem.data, he->data, sizeof(elem.data));
+		elem.flags = 0;
 
-			memcpy(&elem.key, &he->key, sizeof(elem.key));
-			if (set->flags & NFT_SET_MAP)
-				memcpy(&elem.data, he->data, sizeof(elem.data));
-			elem.flags = 0;
+		iter->err = iter->fn(ctx, set, iter, &elem);
+		if (iter->err < 0)
+			goto out;
 
-			iter->err = iter->fn(ctx, set, iter, &elem);
-			if (iter->err < 0)
-				return;
 cont:
-			iter->count++;
-		}
+		iter->count++;
 	}
+
+out:
+	rhashtable_walk_stop(&hti);
+	rhashtable_walk_exit(&hti);
 }
 
 static unsigned int nft_hash_privsize(const struct nlattr * const nla[])

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: netfilter: Use rhashtable walk iterator
  2015-01-31  4:31                 ` netfilter: " Herbert Xu
@ 2015-02-01  7:45                   ` Patrick McHardy
  2015-02-03  3:19                   ` David Miller
  1 sibling, 0 replies; 79+ messages in thread
From: Patrick McHardy @ 2015-02-01  7:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, tgraf, ying.xue, paulmck, netdev, netfilter-devel

On 31.01, Herbert Xu wrote:
> On Sat, Jan 31, 2015 at 02:13:56PM +1100, Herbert Xu wrote:
> > 
> > No problems.  Here is a new version of these two patches which
> > hopefully should work on netfilter as well.
> 
> And here is the patch for netfilter.
> 
> -- >8 --
> This patch gets rid of the manual rhashtable walk in nft_hash
> which touches rhashtable internals that should not be exposed.
> It does so by using the rhashtable iterator primitives.
>     
> Note that I'm leaving nft_hash_destroy alone since it's only
> invoked on shutdown and it shouldn't be affected by changes
> to rhashtable internals (or at least not what I'm planning to
> change).
>     
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Patrick McHardy <kaber@trash.net>

Thanks Herbert!

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink
  2015-01-31  3:13               ` Herbert Xu
                                   ` (2 preceding siblings ...)
  2015-01-31  4:31                 ` netfilter: " Herbert Xu
@ 2015-02-03  3:19                 ` David Miller
  3 siblings, 0 replies; 79+ messages in thread
From: David Miller @ 2015-02-03  3:19 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 31 Jan 2015 14:13:56 +1100

> On Thu, Jan 29, 2015 at 02:42:46PM -0800, David Miller wrote:
>> 
>> I'm holding off on this series for now.
>> 
>> Let me know if you want me to do something different.
> 
> No problems.  Here is a new version of these two patches which
> hopefully should work on netfilter as well.
> 
> Note that the major functional difference compared to the previous
> one is that resizes are now handled properly and will cause the
> iterator to restart from scratch.
> 
> However, as with all existing hash table walker implementations,
> if you remove an element from the chain that we're walking over
> while we're waiting for user-space to give us a new buffer, then
> the walk may miss elements that are still on that chain.

Series applied to net-next.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: netfilter: Use rhashtable walk iterator
  2015-01-31  4:31                 ` netfilter: " Herbert Xu
  2015-02-01  7:45                   ` Patrick McHardy
@ 2015-02-03  3:19                   ` David Miller
  1 sibling, 0 replies; 79+ messages in thread
From: David Miller @ 2015-02-03  3:19 UTC (permalink / raw)
  To: herbert; +Cc: tgraf, ying.xue, kaber, paulmck, netdev, netfilter-devel

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 31 Jan 2015 15:31:51 +1100

> On Sat, Jan 31, 2015 at 02:13:56PM +1100, Herbert Xu wrote:
>> 
>> No problems.  Here is a new version of these two patches which
>> hopefully should work on netfilter as well.
> 
> And here is the patch for netfilter.
> 
> -- >8 --
> This patch gets rid of the manual rhashtable walk in nft_hash
> which touches rhashtable internals that should not be exposed.
> It does so by using the rhashtable iterator primitives.
>     
> Note that I'm leaving nft_hash_destroy alone since it's only
> invoked on shutdown and it shouldn't be affected by changes
> to rhashtable internals (or at least not what I'm planning to
> change).
>     
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied to net-next.

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2015-02-03  3:19 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-20 13:20 [PATCH 0/3 net-next] rhashtable: Notify on resize to allow signaling interrupted dumps Thomas Graf
2015-01-20 13:20 ` [PATCH 1/3] rhashtable: Provide notifier for deferred resizes Thomas Graf
2015-01-20 13:20 ` [PATCH 2/3] netlink: Mark dumps as inconsistent which have been interrupted by a resize Thomas Graf
2015-01-21  8:13   ` Ying Xue
2015-01-21 12:17     ` Thomas Graf
2015-01-22  8:49       ` Herbert Xu
2015-01-22  8:56         ` Patrick McHardy
2015-01-22  9:22           ` Herbert Xu
2015-01-22 10:07             ` Patrick McHardy
2015-01-25 23:20         ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
2015-01-25 23:21           ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
2015-01-26  8:20             ` Thomas Graf
2015-01-26 22:21               ` Herbert Xu
2015-01-26 10:09             ` David Laight
2015-01-26 22:23               ` Herbert Xu
2015-01-26 22:36                 ` David Miller
2015-01-26 22:42                   ` Herbert Xu
2015-01-26 23:31                     ` Herbert Xu
2015-01-27  9:45                       ` Thomas Graf
2015-01-27  9:54                         ` Herbert Xu
2015-01-27 10:15                           ` Thomas Graf
2015-01-27 10:24                             ` Herbert Xu
2015-01-27 11:16                               ` Thomas Graf
2015-01-27 11:23                                 ` Herbert Xu
2015-01-27 11:40                                   ` Thomas Graf
2015-01-27 20:39                                     ` Herbert Xu
2015-01-27 22:10                                       ` David Miller
2015-01-27 23:16                                         ` Herbert Xu
2015-01-27 13:09                                   ` Patrick McHardy
2015-01-27 20:36                                     ` Herbert Xu
2015-01-28 19:07                                       ` Patrick McHardy
2015-01-30  5:58                                         ` Herbert Xu
2015-01-30  8:10                                           ` Patrick McHardy
2015-01-27 10:09                 ` David Laight
2015-01-27 10:12                   ` Herbert Xu
2015-01-25 23:21           ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
2015-01-27 23:19           ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink Herbert Xu
2015-01-27 23:20             ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
2015-01-29 22:26               ` Thomas Graf
2015-01-27 23:20             ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
2015-01-29 22:27               ` Thomas Graf
2015-01-29 22:42             ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink David Miller
2015-01-31  3:13               ` Herbert Xu
2015-01-31  3:14                 ` [PATCH 1/2] rhashtable: Introduce rhashtable_walk_* Herbert Xu
2015-01-31  3:14                 ` [PATCH 2/2] netlink: Use rhashtable walk iterator Herbert Xu
2015-01-31  4:31                 ` netfilter: " Herbert Xu
2015-02-01  7:45                   ` Patrick McHardy
2015-02-03  3:19                   ` David Miller
2015-02-03  3:19                 ` [PATCH 0/2] rhashtable: Add walk iterator primitives and use them in netlink David Miller
2015-01-20 13:20 ` [PATCH 3/3] netlink: Lock out table resizes while dumping Netlink sockets Thomas Graf
2015-01-20 14:31   ` Patrick McHardy
2015-01-20 14:55     ` Thomas Graf
2015-01-20 15:21       ` Patrick McHardy
2015-01-20 15:35         ` Thomas Graf
2015-01-21  5:08           ` Herbert Xu
2015-01-21  5:15             ` Herbert Xu
2015-01-21  9:14               ` Herbert Xu
2015-01-21  9:56                 ` Thomas Graf
2015-01-21  9:59                   ` Herbert Xu
2015-01-21 10:00                   ` Patrick McHardy
2015-01-21  9:37             ` Thomas Graf
2015-01-21  9:38               ` Herbert Xu
2015-01-21  9:49                 ` Thomas Graf
2015-01-21  9:58                   ` Herbert Xu
2015-01-21 10:23                     ` Thomas Graf
2015-01-22  6:35                       ` Herbert Xu
2015-01-22  7:20                         ` Herbert Xu
2015-01-22  9:05                           ` Thomas Graf
2015-01-22  9:50                             ` Herbert Xu
2015-01-21 10:34                     ` Thomas Graf
2015-01-21 10:40                       ` Patrick McHardy
2015-01-21 11:37                         ` Thomas Graf
2015-01-21 11:59                           ` Patrick McHardy
2015-01-21 12:07                             ` Thomas Graf
2015-01-21 12:09                               ` Patrick McHardy
2015-01-21 10:36               ` David Laight
2015-01-20 15:00     ` David Laight
2015-01-20 15:05       ` Thomas Graf
2015-01-21  5:11     ` Herbert Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.