All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
@ 2016-05-19 12:35 Eric Dumazet
  2016-05-19 16:08 ` Alexei Starovoitov
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Eric Dumazet @ 2016-05-19 12:35 UTC (permalink / raw)
  To: netdev; +Cc: Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

From: Eric Dumazet <edumazet@google.com>

Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :
    
For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure.

These stats are using u64 or u32 fields, so reading integral values
should not prevent writers from doing concurrent updates if the kernel
arch is a 64bit one.

Being able to atomically fetch all counters like packets and bytes sent
at the expense of interfering in fast path (queue and dequeue packets)
is simply not worth the pain, as the values are generally stale after 1
usec.

These lock acquisitions slow down the fast path by 10 to 20 %

An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()

gnet_dump_force_lock() call is added there and could be added to other
qdisc stat handlers if needed.

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kevin Athey <kda@google.com>
Cc: Xiaotian Pei <xiaotian@google.com>
---
 include/net/sch_generic.h |   23 +++++++++++++++++++++++
 net/core/gen_stats.c      |   16 ++++++++++------
 net/sched/sch_api.c       |    4 ++--
 net/sched/sch_fq_codel.c  |   13 +++++++++----
 net/sched/sch_mqprio.c    |    6 ++++--
 5 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a1fd76c22a59..87f39fafb770 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -321,6 +321,29 @@ static inline spinlock_t *qdisc_root_sleeping_lock(const struct Qdisc *qdisc)
 	return qdisc_lock(root);
 }
 
+static inline spinlock_t *qdisc_stats_lock(const struct Qdisc *qdisc)
+{
+	ASSERT_RTNL();
+#if defined(CONFIG_64BIT) && !defined(CONFIG_LOCKDEP)
+	/* With u32/u64 bytes counter, there is no real issue on 64bit arches */
+	return NULL;
+#else
+	return qdisc_lock(qdisc_root_sleeping(qdisc));
+#endif
+}
+
+/* Some qdisc dump_stat() methods might need to run while qdisc lock is held.
+ * They can call gnet_dump_force_lock() in case qdisc_stats_lock()
+ * returned NULL and qdisc spinlock was not acquired.
+ */
+static inline void gnet_dump_force_lock(struct Qdisc *sch, struct gnet_dump *d)
+{
+	if (!d->lock) {
+		d->lock = qdisc_root_sleeping_lock(sch);
+		spin_lock_bh(d->lock);
+	}
+}
+
 static inline struct net_device *qdisc_dev(const struct Qdisc *qdisc)
 {
 	return qdisc->dev_queue->dev;
diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
index f96ee8b9478d..c5469d0377c8 100644
--- a/net/core/gen_stats.c
+++ b/net/core/gen_stats.c
@@ -32,10 +32,11 @@ gnet_stats_copy(struct gnet_dump *d, int type, void *buf, int size, int padattr)
 	return 0;
 
 nla_put_failure:
+	if (d->lock)
+		spin_unlock_bh(d->lock);
 	kfree(d->xstats);
 	d->xstats = NULL;
 	d->xstats_len = 0;
-	spin_unlock_bh(d->lock);
 	return -1;
 }
 
@@ -65,15 +66,16 @@ gnet_stats_start_copy_compat(struct sk_buff *skb, int type, int tc_stats_type,
 {
 	memset(d, 0, sizeof(*d));
 
-	spin_lock_bh(lock);
-	d->lock = lock;
 	if (type)
 		d->tail = (struct nlattr *)skb_tail_pointer(skb);
 	d->skb = skb;
 	d->compat_tc_stats = tc_stats_type;
 	d->compat_xstats = xstats_type;
 	d->padattr = padattr;
-
+	if (lock) {
+		d->lock = lock;
+		spin_lock_bh(lock);
+	}
 	if (d->tail)
 		return gnet_stats_copy(d, type, NULL, 0, padattr);
 
@@ -328,8 +330,9 @@ gnet_stats_copy_app(struct gnet_dump *d, void *st, int len)
 	return 0;
 
 err_out:
+	if (d->lock)
+		spin_unlock_bh(d->lock);
 	d->xstats_len = 0;
-	spin_unlock_bh(d->lock);
 	return -1;
 }
 EXPORT_SYMBOL(gnet_stats_copy_app);
@@ -363,10 +366,11 @@ gnet_stats_finish_copy(struct gnet_dump *d)
 			return -1;
 	}
 
+	if (d->lock)
+		spin_unlock_bh(d->lock);
 	kfree(d->xstats);
 	d->xstats = NULL;
 	d->xstats_len = 0;
-	spin_unlock_bh(d->lock);
 	return 0;
 }
 EXPORT_SYMBOL(gnet_stats_finish_copy);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 64f71a2155f3..466fe32f2862 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1365,7 +1365,7 @@ static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid,
 		goto nla_put_failure;
 
 	if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, TCA_XSTATS,
-					 qdisc_root_sleeping_lock(q), &d,
+					 qdisc_stats_lock(q), &d,
 					 TCA_PAD) < 0)
 		goto nla_put_failure;
 
@@ -1680,7 +1680,7 @@ static int tc_fill_tclass(struct sk_buff *skb, struct Qdisc *q,
 		goto nla_put_failure;
 
 	if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, TCA_XSTATS,
-					 qdisc_root_sleeping_lock(q), &d,
+					 qdisc_stats_lock(q), &d,
 					 TCA_PAD) < 0)
 		goto nla_put_failure;
 
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 6883a8971562..9e887df9650b 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -566,6 +566,7 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 	st.qdisc_stats.memory_usage  = q->memory_usage;
 	st.qdisc_stats.drop_overmemory = q->drop_overmemory;
 
+	gnet_dump_force_lock(sch, d);
 	list_for_each(pos, &q->new_flows)
 		st.qdisc_stats.new_flows_len++;
 
@@ -624,7 +625,7 @@ static int fq_codel_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 
 	if (idx < q->flows_cnt) {
 		const struct fq_codel_flow *flow = &q->flows[idx];
-		const struct sk_buff *skb = flow->head;
+		const struct sk_buff *skb;
 
 		memset(&xstats, 0, sizeof(xstats));
 		xstats.type = TCA_FQ_CODEL_XSTATS_CLASS;
@@ -642,9 +643,13 @@ static int fq_codel_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 				codel_time_to_us(delta) :
 				-codel_time_to_us(-delta);
 		}
-		while (skb) {
-			qs.qlen++;
-			skb = skb->next;
+		if (flow->head) {
+			gnet_dump_force_lock(sch, d);
+			skb = flow->head;
+			while (skb) {
+				qs.qlen++;
+				skb = skb->next;
+			}
 		}
 		qs.backlog = q->backlogs[idx];
 		qs.drops = flow->dropped;
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index b8002ce3d010..15d0210f5747 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -342,7 +342,8 @@ static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 		 * hold here is the look on dev_queue->qdisc_sleeping
 		 * also acquired below.
 		 */
-		spin_unlock_bh(d->lock);
+		if (d->lock)
+			spin_unlock_bh(d->lock);
 
 		for (i = tc.offset; i < tc.offset + tc.count; i++) {
 			struct netdev_queue *q = netdev_get_tx_queue(dev, i);
@@ -359,7 +360,8 @@ static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 			spin_unlock_bh(qdisc_lock(qdisc));
 		}
 		/* Reclaim root sleeping lock before completing stats */
-		spin_lock_bh(d->lock);
+		if (d->lock)
+			spin_lock_bh(d->lock);
 		if (gnet_stats_copy_basic(d, NULL, &bstats) < 0 ||
 		    gnet_stats_copy_queue(d, NULL, &qstats, qlen) < 0)
 			return -1;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-19 12:35 [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump Eric Dumazet
@ 2016-05-19 16:08 ` Alexei Starovoitov
  2016-05-19 17:02   ` Eric Dumazet
  2016-05-20  1:50 ` Cong Wang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2016-05-19 16:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

On Thu, May 19, 2016 at 05:35:20AM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
> agent [1] are problematic at scale :
>     
> For each qdisc/class found in the dump, we currently lock the root qdisc
> spinlock in order to get stats. Sampling stats every 5 seconds from
> thousands of HTB classes is a challenge when the root qdisc spinlock is
> under high pressure.
> 
> These stats are using u64 or u32 fields, so reading integral values
> should not prevent writers from doing concurrent updates if the kernel
> arch is a 64bit one.
> 
> Being able to atomically fetch all counters like packets and bytes sent
> at the expense of interfering in fast path (queue and dequeue packets)
> is simply not worth the pain, as the values are generally stale after 1
> usec.
> 
> These lock acquisitions slow down the fast path by 10 to 20 %
> 
> An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
> that might need the qdisc lock in fq_codel_dump_stats() and
> fq_codel_dump_class_stats()
> 
> gnet_dump_force_lock() call is added there and could be added to other
> qdisc stat handlers if needed.
> 
> [1]
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Kevin Athey <kda@google.com>
> Cc: Xiaotian Pei <xiaotian@google.com>

good optimization.
Acked-by: Alexei Starovoitov <ast@kernel.org>

> +static inline spinlock_t *qdisc_stats_lock(const struct Qdisc *qdisc)
> +{
> +	ASSERT_RTNL();
> +#if defined(CONFIG_64BIT) && !defined(CONFIG_LOCKDEP)
> +	/* With u32/u64 bytes counter, there is no real issue on 64bit arches */
> +	return NULL;
> +#else
> +	return qdisc_lock(qdisc_root_sleeping(qdisc));

initially I thought that above line could have been qdisc_root_sleeping_lock()
but then realized that moving out ASSERT_RTNL makes more sense. Good call.
The only thing not clear to me is why '!defined(CONFIG_LOCKDEP)' ?
Just extra caution? I think should be fine without it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-19 16:08 ` Alexei Starovoitov
@ 2016-05-19 17:02   ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2016-05-19 17:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

On Thu, 2016-05-19 at 09:08 -0700, Alexei Starovoitov wrote:

> initially I thought that above line could have been qdisc_root_sleeping_lock()
> but then realized that moving out ASSERT_RTNL makes more sense. Good call.
> The only thing not clear to me is why '!defined(CONFIG_LOCKDEP)' ?
> Just extra caution? I think should be fine without it.

I believe it is better to exercise this code path (using spinlock) even
on 64bit kernel for kernels with LOCKDEP.

This could avoid some hard to debug bugs occurring on 32bit builds.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-19 12:35 [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump Eric Dumazet
  2016-05-19 16:08 ` Alexei Starovoitov
@ 2016-05-20  1:50 ` Cong Wang
  2016-05-20  2:45   ` Eric Dumazet
  2016-05-20 23:38 ` David Miller
  2016-05-22 16:30 ` Jamal Hadi Salim
  3 siblings, 1 reply; 13+ messages in thread
From: Cong Wang @ 2016-05-20  1:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

On Thu, May 19, 2016 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> These stats are using u64 or u32 fields, so reading integral values
> should not prevent writers from doing concurrent updates if the kernel
> arch is a 64bit one.
>
> Being able to atomically fetch all counters like packets and bytes sent
> at the expense of interfering in fast path (queue and dequeue packets)
> is simply not worth the pain, as the values are generally stale after 1
> usec.

I think one purpose of this lock is to make sure we have an atomic
snapshot of these counters as a whole. IOW, we may need another
lock rather than the qdisc root lock to guarantee this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-20  1:50 ` Cong Wang
@ 2016-05-20  2:45   ` Eric Dumazet
  2016-05-20  5:23     ` Cong Wang
  2016-05-20 12:44     ` Eric Dumazet
  0 siblings, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2016-05-20  2:45 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

On Thu, 2016-05-19 at 18:50 -0700, Cong Wang wrote:
> On Thu, May 19, 2016 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > These stats are using u64 or u32 fields, so reading integral values
> > should not prevent writers from doing concurrent updates if the kernel
> > arch is a 64bit one.
> >
> > Being able to atomically fetch all counters like packets and bytes sent
> > at the expense of interfering in fast path (queue and dequeue packets)
> > is simply not worth the pain, as the values are generally stale after 1
> > usec.
> 
> I think one purpose of this lock is to make sure we have an atomic
> snapshot of these counters as a whole. IOW, we may need another
> lock rather than the qdisc root lock to guarantee this.

Right, this was stated in the changelog.

I played a bit at changing qdisc->__state to a seqcount.

But this would add 2 additional smp_wmb() barriers.

Not sure if any application would actually care if it reads

Sent 22954160104777 bytes 1808585171 pkt (dropped 0, overlimits 0
requeues 26021)

Instead of 

Sent 22954160104777 bytes 1808585172 pkt (dropped 0, overlimits 0
requeues 26021)

?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-20  2:45   ` Eric Dumazet
@ 2016-05-20  5:23     ` Cong Wang
  2016-05-20  5:50       ` Eric Dumazet
  2016-05-20 12:44     ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: Cong Wang @ 2016-05-20  5:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

On Thu, May 19, 2016 at 7:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-05-19 at 18:50 -0700, Cong Wang wrote:
>> On Thu, May 19, 2016 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > These stats are using u64 or u32 fields, so reading integral values
>> > should not prevent writers from doing concurrent updates if the kernel
>> > arch is a 64bit one.
>> >
>> > Being able to atomically fetch all counters like packets and bytes sent
>> > at the expense of interfering in fast path (queue and dequeue packets)
>> > is simply not worth the pain, as the values are generally stale after 1
>> > usec.
>>
>> I think one purpose of this lock is to make sure we have an atomic
>> snapshot of these counters as a whole. IOW, we may need another
>> lock rather than the qdisc root lock to guarantee this.
>
> Right, this was stated in the changelog.
>
> I played a bit at changing qdisc->__state to a seqcount.
>
> But this would add 2 additional smp_wmb() barriers.
>
> Not sure if any application would actually care if it reads
>
> Sent 22954160104777 bytes 1808585171 pkt (dropped 0, overlimits 0
> requeues 26021)
>
> Instead of
>
> Sent 22954160104777 bytes 1808585172 pkt (dropped 0, overlimits 0
> requeues 26021)
>
> ?

Sometimes I use bytes/pkts to valuate the average size of the packets.
;) It doesn't have to be that accurate but we don't know how inaccurate
it is without lock?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-20  5:23     ` Cong Wang
@ 2016-05-20  5:50       ` Eric Dumazet
  2016-05-22 16:35         ` Jamal Hadi Salim
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2016-05-20  5:50 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

On Thu, 2016-05-19 at 22:23 -0700, Cong Wang wrote:
> On Thu, May 19, 2016 at 7:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Thu, 2016-05-19 at 18:50 -0700, Cong Wang wrote:
> >> On Thu, May 19, 2016 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> >
> >> > These stats are using u64 or u32 fields, so reading integral values
> >> > should not prevent writers from doing concurrent updates if the kernel
> >> > arch is a 64bit one.
> >> >
> >> > Being able to atomically fetch all counters like packets and bytes sent
> >> > at the expense of interfering in fast path (queue and dequeue packets)
> >> > is simply not worth the pain, as the values are generally stale after 1
> >> > usec.
> >>
> >> I think one purpose of this lock is to make sure we have an atomic
> >> snapshot of these counters as a whole. IOW, we may need another
> >> lock rather than the qdisc root lock to guarantee this.
> >
> > Right, this was stated in the changelog.
> >
> > I played a bit at changing qdisc->__state to a seqcount.
> >
> > But this would add 2 additional smp_wmb() barriers.
> >
> > Not sure if any application would actually care if it reads
> >
> > Sent 22954160104777 bytes 1808585171 pkt (dropped 0, overlimits 0
> > requeues 26021)
> >
> > Instead of
> >
> > Sent 22954160104777 bytes 1808585172 pkt (dropped 0, overlimits 0
> > requeues 26021)
> >
> > ?
> 
> Sometimes I use bytes/pkts to valuate the average size of the packets.
> ;) It doesn't have to be that accurate but we don't know how inaccurate
> it is without lock?

It wont be inaccurate (no drift), as the writers hold the qdisc
spinlock.

It it the same with say IP/TCP SNMP counters :

When an incoming frame is handled, it might change many SNMP counters,
but an SNMP agent might fetch the whole set of SNMP values in the middle
of the changes.

IpInReceives
IpInDelivers 
TcpInSegs
IpExtInOctets
IpExtInNoECTPkts
...

The only 'problem' can be a off-by-one (or off-by-bytes-in-the-packet)
transient error, that all SNMP agents are normally handling just fine.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-20  2:45   ` Eric Dumazet
  2016-05-20  5:23     ` Cong Wang
@ 2016-05-20 12:44     ` Eric Dumazet
  2016-05-20 13:01       ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2016-05-20 12:44 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

On Thu, 2016-05-19 at 19:45 -0700, Eric Dumazet wrote:
> On Thu, 2016-05-19 at 18:50 -0700, Cong Wang wrote:
> > On Thu, May 19, 2016 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >
> > > These stats are using u64 or u32 fields, so reading integral values
> > > should not prevent writers from doing concurrent updates if the kernel
> > > arch is a 64bit one.
> > >
> > > Being able to atomically fetch all counters like packets and bytes sent
> > > at the expense of interfering in fast path (queue and dequeue packets)
> > > is simply not worth the pain, as the values are generally stale after 1
> > > usec.
> > 
> > I think one purpose of this lock is to make sure we have an atomic
> > snapshot of these counters as a whole. IOW, we may need another
> > lock rather than the qdisc root lock to guarantee this.
> 
> Right, this was stated in the changelog.
> 
> I played a bit at changing qdisc->__state to a seqcount.
> 
> But this would add 2 additional smp_wmb() barriers.

Although this would allow the mechanism to be used both on 32bit an
64bit kernels.

This would also add LOCKDEP annotations which can be nice for debugging.

Also the seqcount value >> 1 would give us the number of __qdisc_run()
and we could compute packets/(seqcount>>1) to get the average number of
packets processed per round.

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 941ec99cd3b6..471095beca09 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4610,6 +4610,7 @@ static int bond_check_params(struct bond_params *params)
 static struct lock_class_key bonding_netdev_xmit_lock_key;
 static struct lock_class_key bonding_netdev_addr_lock_key;
 static struct lock_class_key bonding_tx_busylock_key;
+static struct lock_class_key bonding_qdisc_running;
 
 static void bond_set_lockdep_class_one(struct net_device *dev,
 				       struct netdev_queue *txq,
@@ -4625,6 +4626,7 @@ static void bond_set_lockdep_class(struct net_device *dev)
 			  &bonding_netdev_addr_lock_key);
 	netdev_for_each_tx_queue(dev, bond_set_lockdep_class_one, NULL);
 	dev->qdisc_tx_busylock = &bonding_tx_busylock_key;
+	dev->qdisc_running = &bonding_qdisc_running;
 }
 
 /* Called from registration process */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c148edfe4965..e06646b69b06 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1862,6 +1862,7 @@ struct net_device {
 #endif
 	struct phy_device	*phydev;
 	struct lock_class_key	*qdisc_tx_busylock;
+	struct lock_class_key	*qdisc_running;
 	bool			proto_down;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a1fd76c22a59..bff8d895ef8a 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -29,13 +29,6 @@ enum qdisc_state_t {
 	__QDISC_STATE_THROTTLED,
 };
 
-/*
- * following bits are only changed while qdisc lock is held
- */
-enum qdisc___state_t {
-	__QDISC___STATE_RUNNING = 1,
-};
-
 struct qdisc_size_table {
 	struct rcu_head		rcu;
 	struct list_head	list;
@@ -93,7 +86,7 @@ struct Qdisc {
 	unsigned long		state;
 	struct sk_buff_head	q;
 	struct gnet_stats_basic_packed bstats;
-	unsigned int		__state;
+	seqcount_t		running;
 	struct gnet_stats_queue	qstats;
 	struct rcu_head		rcu_head;
 	int			padded;
@@ -104,20 +97,20 @@ struct Qdisc {
 
 static inline bool qdisc_is_running(const struct Qdisc *qdisc)
 {
-	return (qdisc->__state & __QDISC___STATE_RUNNING) ? true : false;
+	return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
 }
 
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc_is_running(qdisc))
 		return false;
-	qdisc->__state |= __QDISC___STATE_RUNNING;
+	write_seqcount_begin(&qdisc->running);
 	return true;
 }
 
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
-	qdisc->__state &= ~__QDISC___STATE_RUNNING;
+	write_seqcount_end(&qdisc->running);
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/core/dev.c b/net/core/dev.c
index 904ff431d570..55b414dead29 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3075,7 +3075,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	/*
 	 * Heuristic to force contended enqueues to serialize on a
 	 * separate lock before trying to get qdisc main lock.
-	 * This permits __QDISC___STATE_RUNNING owner to get the lock more
+	 * This permits qdisc->running owner to get the lock more
 	 * often and dequeue packets faster.
 	 */
 	contended = qdisc_is_running(q);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 269dd71b3828..d25412364c07 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -110,7 +110,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 
 /*
  * Transmit possibly several skbs, and handle the return status as
- * required. Holding the __QDISC___STATE_RUNNING bit guarantees that
+ * required. Owning running seqcount bit guarantees that
  * only one CPU can execute this function.
  *
  * Returns to the caller:
@@ -163,7 +163,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 /*
  * NOTE: Called under qdisc_lock(q) with locally disabled BH.
  *
- * __QDISC___STATE_RUNNING guarantees only one CPU can process
+ * running seqcount guarantees only one CPU can process (dequeue packets)
  * this qdisc at a time. qdisc_lock(q) serializes queue accesses for
  * this queue.
  *
@@ -379,6 +379,7 @@ struct Qdisc noop_qdisc = {
 	.list		=	LIST_HEAD_INIT(noop_qdisc.list),
 	.q.lock		=	__SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
 	.dev_queue	=	&noop_netdev_queue,
+	.running	=	SEQCNT_ZERO(noop_qdisc.running),
 	.busylock	=	__SPIN_LOCK_UNLOCKED(noop_qdisc.busylock),
 };
 EXPORT_SYMBOL(noop_qdisc);
@@ -537,6 +538,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
 EXPORT_SYMBOL(pfifo_fast_ops);
 
 static struct lock_class_key qdisc_tx_busylock;
+static struct lock_class_key qdisc_running_class;
 
 struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 			  const struct Qdisc_ops *ops)
@@ -570,6 +572,10 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	lockdep_set_class(&sch->busylock,
 			  dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
 
+	seqcount_init(&sch->running);
+	lockdep_set_class(&sch->running,
+			  dev->qdisc_running ?: &qdisc_running_class);
+
 	sch->ops = ops;
 	sch->enqueue = ops->enqueue;
 	sch->dequeue = ops->dequeue;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-20 12:44     ` Eric Dumazet
@ 2016-05-20 13:01       ` Eric Dumazet
  2016-05-20 13:09         ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2016-05-20 13:01 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

On Fri, 2016-05-20 at 05:44 -0700, Eric Dumazet wrote:
> On Thu, 2016-05-19 at 19:45 -0700, Eric Dumazet wrote:
> > On Thu, 2016-05-19 at 18:50 -0700, Cong Wang wrote:
> > > On Thu, May 19, 2016 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > >
> > > > These stats are using u64 or u32 fields, so reading integral values
> > > > should not prevent writers from doing concurrent updates if the kernel
> > > > arch is a 64bit one.
> > > >
> > > > Being able to atomically fetch all counters like packets and bytes sent
> > > > at the expense of interfering in fast path (queue and dequeue packets)
> > > > is simply not worth the pain, as the values are generally stale after 1
> > > > usec.
> > > 
> > > I think one purpose of this lock is to make sure we have an atomic
> > > snapshot of these counters as a whole. IOW, we may need another
> > > lock rather than the qdisc root lock to guarantee this.
> > 
> > Right, this was stated in the changelog.
> > 
> > I played a bit at changing qdisc->__state to a seqcount.
> > 
> > But this would add 2 additional smp_wmb() barriers.
> 
> Although this would allow the mechanism to be used both on 32bit an
> 64bit kernels.
> 
> This would also add LOCKDEP annotations which can be nice for debugging.
> 
> Also the seqcount value >> 1 would give us the number of __qdisc_run()
> and we could compute packets/(seqcount>>1) to get the average number of
> packets processed per round.

Tricky, since sch_direct_xmit() releases the qdisc spinlock and grabs it
again, while owning the ' running seqcount'

Needs more LOCKDEP tricks ;)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-20 13:01       ` Eric Dumazet
@ 2016-05-20 13:09         ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2016-05-20 13:09 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Jamal Hadi Salim, John Fastabend, Kevin Athey, Xiaotian Pei

On Fri, 2016-05-20 at 06:01 -0700, Eric Dumazet wrote:

> Tricky, since sch_direct_xmit() releases the qdisc spinlock and grabs it
> again, while owning the ' running seqcount'
> 
> Needs more LOCKDEP tricks ;)
> 

That would be :

@@ -137,10 +137,10 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
                HARD_TX_UNLOCK(dev, txq);
        } else {
-               spin_lock(root_lock);
+               spin_lock_nested(root_lock, SINGLE_DEPTH_NESTING);
                return qdisc_qlen(q);
        }
-       spin_lock(root_lock);
+       spin_lock_nested(root_lock, SINGLE_DEPTH_NESTING);
 
        if (dev_xmit_complete(ret)) {
                /* Driver sent out skb successfully or skb was consumed */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-19 12:35 [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump Eric Dumazet
  2016-05-19 16:08 ` Alexei Starovoitov
  2016-05-20  1:50 ` Cong Wang
@ 2016-05-20 23:38 ` David Miller
  2016-05-22 16:30 ` Jamal Hadi Salim
  3 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2016-05-20 23:38 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, jhs, john.fastabend, kda, xiaotian

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 19 May 2016 05:35:20 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
> agent [1] are problematic at scale :
>     
> For each qdisc/class found in the dump, we currently lock the root qdisc
> spinlock in order to get stats. Sampling stats every 5 seconds from
> thousands of HTB classes is a challenge when the root qdisc spinlock is
> under high pressure.
> 
> These stats are using u64 or u32 fields, so reading integral values
> should not prevent writers from doing concurrent updates if the kernel
> arch is a 64bit one.
> 
> Being able to atomically fetch all counters like packets and bytes sent
> at the expense of interfering in fast path (queue and dequeue packets)
> is simply not worth the pain, as the values are generally stale after 1
> usec.
> 
> These lock acquisitions slow down the fast path by 10 to 20 %
> 
> An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
> that might need the qdisc lock in fq_codel_dump_stats() and
> fq_codel_dump_class_stats()
> 
> gnet_dump_force_lock() call is added there and could be added to other
> qdisc stat handlers if needed.
> 
> [1]
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I guess the off-by-one situations are not a big enough deal to code new
locking or memory barrier for, so I'm fine with this.

Please resubmit when I open net-next back up, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-19 12:35 [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump Eric Dumazet
                   ` (2 preceding siblings ...)
  2016-05-20 23:38 ` David Miller
@ 2016-05-22 16:30 ` Jamal Hadi Salim
  3 siblings, 0 replies; 13+ messages in thread
From: Jamal Hadi Salim @ 2016-05-22 16:30 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: John Fastabend, Kevin Athey, Xiaotian Pei

On 16-05-19 08:35 AM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
> agent [1] are problematic at scale :
>
> For each qdisc/class found in the dump, we currently lock the root qdisc
> spinlock in order to get stats. Sampling stats every 5 seconds from
> thousands of HTB classes is a challenge when the root qdisc spinlock is
> under high pressure.
>

Good stuff.
There are other optimization we could do in such large scale dumps
(such as not dumping something that hasnt been updated)
Could we have changed it to be rcu?

> These stats are using u64 or u32 fields, so reading integral values
> should not prevent writers from doing concurrent updates if the kernel
> arch is a 64bit one.
>

Meaning it wont work on other archs? is atomic read not dependable
on other setups?


> Being able to atomically fetch all counters like packets and bytes sent
> at the expense of interfering in fast path (queue and dequeue packets)
> is simply not worth the pain, as the values are generally stale after 1
> usec.
>
> These lock acquisitions slow down the fast path by 10 to 20 %
>


Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
  2016-05-20  5:50       ` Eric Dumazet
@ 2016-05-22 16:35         ` Jamal Hadi Salim
  0 siblings, 0 replies; 13+ messages in thread
From: Jamal Hadi Salim @ 2016-05-22 16:35 UTC (permalink / raw)
  To: Eric Dumazet, Cong Wang; +Cc: netdev, John Fastabend, Kevin Athey, Xiaotian Pei

On 16-05-20 01:50 AM, Eric Dumazet wrote:

> It wont be inaccurate (no drift), as the writers hold the qdisc
> spinlock.
>
> It it the same with say IP/TCP SNMP counters :
>
> When an incoming frame is handled, it might change many SNMP counters,
> but an SNMP agent might fetch the whole set of SNMP values in the middle
> of the changes.
>
> IpInReceives
> IpInDelivers
> TcpInSegs
> IpExtInOctets
> IpExtInNoECTPkts
> ...
>
> The only 'problem' can be a off-by-one (or off-by-bytes-in-the-packet)
> transient error, that all SNMP agents are normally handling just fine.
>


I think qdisc stats being off by a bit are ok - hence the rcu
suggestion. I would have problems with action stats being off
(where they are used for billing).

cheers,
jamal

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-05-22 16:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-19 12:35 [RFC net-next] net: sched: do not acquire qdisc spinlock in qdisc/class stats dump Eric Dumazet
2016-05-19 16:08 ` Alexei Starovoitov
2016-05-19 17:02   ` Eric Dumazet
2016-05-20  1:50 ` Cong Wang
2016-05-20  2:45   ` Eric Dumazet
2016-05-20  5:23     ` Cong Wang
2016-05-20  5:50       ` Eric Dumazet
2016-05-22 16:35         ` Jamal Hadi Salim
2016-05-20 12:44     ` Eric Dumazet
2016-05-20 13:01       ` Eric Dumazet
2016-05-20 13:09         ` Eric Dumazet
2016-05-20 23:38 ` David Miller
2016-05-22 16:30 ` Jamal Hadi Salim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.