All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next] qfq: handle the case that front slot is empty
@ 2012-10-23  4:15 Cong Wang
  2012-10-23  7:09 ` Paolo Valente
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2012-10-23  4:15 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Valente, Stephen Hemminger, Eric Dumazet, David S. Miller,
	Cong Wang

I am not sure if this patch fixes the real problem or just workarounds
it. At least, after this patch I don't see the crash I reported any more.

The problem is that in some cases the front slot can become empty,
therefore qfq_slot_head() returns an invalid pointer (not NULL!).
This patch just make it return NULL in such case.

What's more, in qfq_front_slot_remove(), it always clears
bit 0 in full_slots, so we should ajust the bucket list with
qfq_slot_scan() before that.

Cc: Paolo Valente <paolo.valente@unimore.it>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>

---
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index f0dd83c..8dfdd89 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -680,24 +680,13 @@ static void qfq_slot_insert(struct qfq_group *grp, struct qfq_class *cl,
 /* Maybe introduce hlist_first_entry?? */
 static struct qfq_class *qfq_slot_head(struct qfq_group *grp)
 {
+	if (hlist_empty(&grp->slots[grp->front]))
+		return NULL;
 	return hlist_entry(grp->slots[grp->front].first,
 			   struct qfq_class, next);
 }
 
 /*
- * remove the entry from the slot
- */
-static void qfq_front_slot_remove(struct qfq_group *grp)
-{
-	struct qfq_class *cl = qfq_slot_head(grp);
-
-	BUG_ON(!cl);
-	hlist_del(&cl->next);
-	if (hlist_empty(&grp->slots[grp->front]))
-		__clear_bit(0, &grp->full_slots);
-}
-
-/*
  * Returns the first full queue in a group. As a side effect,
  * adjust the bucket list so the first non-empty bucket is at
  * position 0 in full_slots.
@@ -722,6 +711,19 @@ static struct qfq_class *qfq_slot_scan(struct qfq_group *grp)
 }
 
 /*
+ * remove the entry from the front slot
+ */
+static void qfq_front_slot_remove(struct qfq_group *grp)
+{
+	struct qfq_class *cl = qfq_slot_scan(grp);
+
+	BUG_ON(!cl);
+	hlist_del(&cl->next);
+	if (hlist_empty(&grp->slots[grp->front]))
+		__clear_bit(0, &grp->full_slots);
+}
+
+/*
  * adjust the bucket list. When the start time of a group decreases,
  * we move the index down (modulo QFQ_MAX_SLOTS) so we don't need to
  * move the objects. The mask of occupied slots must be shifted
@@ -794,6 +796,8 @@ static struct sk_buff *qfq_dequeue(struct Qdisc *sch)
 	grp = qfq_ffs(q, q->bitmaps[ER]);
 
 	cl = qfq_slot_head(grp);
+	if (!cl)
+		return NULL;
 	skb = qdisc_dequeue_peeked(cl->qdisc);
 	if (!skb) {
 		WARN_ONCE(1, "qfq_dequeue: non-workconserving leaf\n");
@@ -1018,16 +1022,23 @@ static void qfq_deactivate_class(struct qfq_sched *q, struct qfq_class *cl)
 		__clear_bit(grp->index, &q->bitmaps[ER]);
 	} else if (hlist_empty(&grp->slots[grp->front])) {
 		cl = qfq_slot_scan(grp);
-		roundedS = qfq_round_down(cl->S, grp->slot_shift);
-		if (grp->S != roundedS) {
+		if (!cl) {
 			__clear_bit(grp->index, &q->bitmaps[ER]);
 			__clear_bit(grp->index, &q->bitmaps[IR]);
 			__clear_bit(grp->index, &q->bitmaps[EB]);
 			__clear_bit(grp->index, &q->bitmaps[IB]);
-			grp->S = roundedS;
-			grp->F = roundedS + (2ULL << grp->slot_shift);
-			s = qfq_calc_state(q, grp);
-			__set_bit(grp->index, &q->bitmaps[s]);
+		} else {
+			roundedS = qfq_round_down(cl->S, grp->slot_shift);
+			if (grp->S != roundedS) {
+				__clear_bit(grp->index, &q->bitmaps[ER]);
+				__clear_bit(grp->index, &q->bitmaps[IR]);
+				__clear_bit(grp->index, &q->bitmaps[EB]);
+				__clear_bit(grp->index, &q->bitmaps[IB]);
+				grp->S = roundedS;
+				grp->F = roundedS + (2ULL << grp->slot_shift);
+				s = qfq_calc_state(q, grp);
+				__set_bit(grp->index, &q->bitmaps[s]);
+			}
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] qfq: handle the case that front slot is empty
  2012-10-23  4:15 [RFC PATCH net-next] qfq: handle the case that front slot is empty Cong Wang
@ 2012-10-23  7:09 ` Paolo Valente
  2012-10-23  8:53   ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Valente @ 2012-10-23  7:09 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Stephen Hemminger, Eric Dumazet, David S. Miller

The crash you reported is one of the problems I tried to solve with my last fixes.
After those fixes I could not reproduce this crash (and other crashes) any more, but of course I am still missing something.

Il giorno 23/ott/2012, alle ore 06:15, Cong Wang ha scritto:

> I am not sure if this patch fixes the real problem or just workarounds
> it. At least, after this patch I don't see the crash I reported any more.

It is actually a workaround: if the condition that triggers your workaround holds true, then the group data structure is already inconstent, and qfq is likely not to schedule classes correctly.
I will try to reproduce the crash with the steps you suggest, and try to understand what is still wrong as soon as I can.

Paolo

> 
> The problem is that in some cases the front slot can become empty,
> therefore qfq_slot_head() returns an invalid pointer (not NULL!).
> This patch just make it return NULL in such case.
> 
> What's more, in qfq_front_slot_remove(), it always clears
> bit 0 in full_slots, so we should ajust the bucket list with
> qfq_slot_scan() before that.
> 
> Cc: Paolo Valente <paolo.valente@unimore.it>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: David S. Miller <davem@davemloft.net>
> Signed-off-by: Cong Wang <amwang@redhat.com>
> 
> ---
> diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
> index f0dd83c..8dfdd89 100644
> --- a/net/sched/sch_qfq.c
> +++ b/net/sched/sch_qfq.c
> @@ -680,24 +680,13 @@ static void qfq_slot_insert(struct qfq_group *grp, struct qfq_class *cl,
> /* Maybe introduce hlist_first_entry?? */
> static struct qfq_class *qfq_slot_head(struct qfq_group *grp)
> {
> +	if (hlist_empty(&grp->slots[grp->front]))
> +		return NULL;
> 	return hlist_entry(grp->slots[grp->front].first,
> 			   struct qfq_class, next);
> }
> 
> /*
> - * remove the entry from the slot
> - */
> -static void qfq_front_slot_remove(struct qfq_group *grp)
> -{
> -	struct qfq_class *cl = qfq_slot_head(grp);
> -
> -	BUG_ON(!cl);
> -	hlist_del(&cl->next);
> -	if (hlist_empty(&grp->slots[grp->front]))
> -		__clear_bit(0, &grp->full_slots);
> -}
> -
> -/*
>  * Returns the first full queue in a group. As a side effect,
>  * adjust the bucket list so the first non-empty bucket is at
>  * position 0 in full_slots.
> @@ -722,6 +711,19 @@ static struct qfq_class *qfq_slot_scan(struct qfq_group *grp)
> }
> 
> /*
> + * remove the entry from the front slot
> + */
> +static void qfq_front_slot_remove(struct qfq_group *grp)
> +{
> +	struct qfq_class *cl = qfq_slot_scan(grp);
> +
> +	BUG_ON(!cl);
> +	hlist_del(&cl->next);
> +	if (hlist_empty(&grp->slots[grp->front]))
> +		__clear_bit(0, &grp->full_slots);
> +}
> +
> +/*
>  * adjust the bucket list. When the start time of a group decreases,
>  * we move the index down (modulo QFQ_MAX_SLOTS) so we don't need to
>  * move the objects. The mask of occupied slots must be shifted
> @@ -794,6 +796,8 @@ static struct sk_buff *qfq_dequeue(struct Qdisc *sch)
> 	grp = qfq_ffs(q, q->bitmaps[ER]);
> 
> 	cl = qfq_slot_head(grp);
> +	if (!cl)
> +		return NULL;
> 	skb = qdisc_dequeue_peeked(cl->qdisc);
> 	if (!skb) {
> 		WARN_ONCE(1, "qfq_dequeue: non-workconserving leaf\n");
> @@ -1018,16 +1022,23 @@ static void qfq_deactivate_class(struct qfq_sched *q, struct qfq_class *cl)
> 		__clear_bit(grp->index, &q->bitmaps[ER]);
> 	} else if (hlist_empty(&grp->slots[grp->front])) {
> 		cl = qfq_slot_scan(grp);
> -		roundedS = qfq_round_down(cl->S, grp->slot_shift);
> -		if (grp->S != roundedS) {
> +		if (!cl) {
> 			__clear_bit(grp->index, &q->bitmaps[ER]);
> 			__clear_bit(grp->index, &q->bitmaps[IR]);
> 			__clear_bit(grp->index, &q->bitmaps[EB]);
> 			__clear_bit(grp->index, &q->bitmaps[IB]);
> -			grp->S = roundedS;
> -			grp->F = roundedS + (2ULL << grp->slot_shift);
> -			s = qfq_calc_state(q, grp);
> -			__set_bit(grp->index, &q->bitmaps[s]);
> +		} else {
> +			roundedS = qfq_round_down(cl->S, grp->slot_shift);
> +			if (grp->S != roundedS) {
> +				__clear_bit(grp->index, &q->bitmaps[ER]);
> +				__clear_bit(grp->index, &q->bitmaps[IR]);
> +				__clear_bit(grp->index, &q->bitmaps[EB]);
> +				__clear_bit(grp->index, &q->bitmaps[IB]);
> +				grp->S = roundedS;
> +				grp->F = roundedS + (2ULL << grp->slot_shift);
> +				s = qfq_calc_state(q, grp);
> +				__set_bit(grp->index, &q->bitmaps[s]);
> +			}
> 		}
> 	}
> 


--
Paolo Valente                                                 
Algogroup
Dipartimento di Ingegneria dell'Informazione		
Via Vignolese 905/b
41125 Modena - Italy        				  
homepage:  http://algo.ing.unimo.it/people/paolo/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] qfq: handle the case that front slot is empty
  2012-10-23  7:09 ` Paolo Valente
@ 2012-10-23  8:53   ` Cong Wang
  2012-10-26  7:51     ` Paolo Valente
  2012-10-26 16:51     ` [RFC PATCH net-next] qfq: handle the case that front slot is empty Paolo Valente
  0 siblings, 2 replies; 12+ messages in thread
From: Cong Wang @ 2012-10-23  8:53 UTC (permalink / raw)
  To: Paolo Valente; +Cc: netdev, Stephen Hemminger, Eric Dumazet, David S. Miller

On Tue, 2012-10-23 at 09:09 +0200, Paolo Valente wrote:
> The crash you reported is one of the problems I tried to solve with my last fixes.
> After those fixes I could not reproduce this crash (and other crashes) any more, but of course I am still missing something.

I am using the latest net-next, so if your patches are in net-next,
then the problem of course still exists.

> 
> Il giorno 23/ott/2012, alle ore 06:15, Cong Wang ha scritto:
> 
> > I am not sure if this patch fixes the real problem or just workarounds
> > it. At least, after this patch I don't see the crash I reported any more.
> 
> It is actually a workaround: if the condition that triggers your workaround holds true, then the group data structure is already inconstent, and qfq is likely not to schedule classes correctly.
> I will try to reproduce the crash with the steps you suggest, and try to understand what is still wrong as soon as I can.
> 

OK, I don't pretend I understand qfq. And I can help you to test
patches.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] qfq: handle the case that front slot is empty
  2012-10-23  8:53   ` Cong Wang
@ 2012-10-26  7:51     ` Paolo Valente
  2012-10-26  8:10       ` Eric Dumazet
  2012-10-26 16:51     ` [RFC PATCH net-next] qfq: handle the case that front slot is empty Paolo Valente
  1 sibling, 1 reply; 12+ messages in thread
From: Paolo Valente @ 2012-10-26  7:51 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Stephen Hemminger, Eric Dumazet, David S. Miller


Il giorno 23/ott/2012, alle ore 10:53, Cong Wang ha scritto:

> On Tue, 2012-10-23 at 09:09 +0200, Paolo Valente wrote:
>> The crash you reported is one of the problems I tried to solve with my last fixes.
>> After those fixes I could not reproduce this crash (and other crashes) any more, but of course I am still missing something.
> 
> I am using the latest net-next, so if your patches are in net-next,
> then the problem of course still exists.
> 
>> 
>> Il giorno 23/ott/2012, alle ore 06:15, Cong Wang ha scritto:
>> 
>>> I am not sure if this patch fixes the real problem or just workarounds
>>> it. At least, after this patch I don't see the crash I reported any more.
>> 
>> It is actually a workaround: if the condition that triggers your workaround holds true, then the group data structure is already inconstent, and qfq is likely not to schedule classes correctly.
>> I will try to reproduce the crash with the steps you suggest, and try to understand what is still wrong as soon as I can.
>> 
> 
> OK, I don't pretend I understand qfq.
The problem is that I should :)
> And I can help you to test
> patches.
> 
I think I will ask for your help soon, thanks.

The cause of the failure is TCP segment offloading, which lets qfq receive packets with a much larger size than the MTU of the device.
In this respect, under qfq the default max packet size lmax for each class (2KB) is only slightly higher than the MTU. Violating the lmax constraint causes the corruption of the data structure that implements the bucket lists of the groups. In fact, the failure that you found is only one of the consequences of this corruption. I am sorry I did not discover it before, but, foolishly, I have run only UDP tests.

I am thinking about the best ways for addressing this issue.

BTW, I think that the behavior of all the other schedulers should be checked as well. For example, with segment offloading, drr must increment the deficit of a class for at most (64K/quantum) times, i.e., rounds, before it can serve the next packet of the class. The number of instructions per packet dequeue becomes therefore (64K/quantum) times higher than without segment offloading.

> Thanks!
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] qfq: handle the case that front slot is empty
  2012-10-26  7:51     ` Paolo Valente
@ 2012-10-26  8:10       ` Eric Dumazet
  2012-10-26  9:32         ` [PATCH net-next] net_sched: more precise pkt_len computation Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2012-10-26  8:10 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Cong Wang, netdev, Stephen Hemminger, David S. Miller

On Fri, 2012-10-26 at 09:51 +0200, Paolo Valente wrote:

> I think I will ask for your help soon, thanks.
> 
> The cause of the failure is TCP segment offloading, which lets qfq
> receive packets with a much larger size than the MTU of the device.
> In this respect, under qfq the default max packet size lmax for each
> class (2KB) is only slightly higher than the MTU. Violating the lmax
> constraint causes the corruption of the data structure that implements
> the bucket lists of the groups. In fact, the failure that you found is
> only one of the consequences of this corruption. I am sorry I did not
> discover it before, but, foolishly, I have run only UDP tests.
> 
> I am thinking about the best ways for addressing this issue.
> 
> BTW, I think that the behavior of all the other schedulers should be
> checked as well. For example, with segment offloading, drr must
> increment the deficit of a class for at most (64K/quantum) times,
> i.e., rounds, before it can serve the next packet of the class. The
> number of instructions per packet dequeue becomes therefore
> (64K/quantum) times higher than without segment offloading.
> 

OK, good to know you found the problem.

Normally, TSO is supported by other qdisc, maybe not optimally but
supported.

For example, one known problem with TSO is that skb->len is a
underestimation of real number of bytes sent on wire

If MSS is a bit small, TBF/HTB/CBQ can really overdrive links.

We probably should have a more precise estimation.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next] net_sched: more precise pkt_len computation
  2012-10-26  8:10       ` Eric Dumazet
@ 2012-10-26  9:32         ` Eric Dumazet
  2012-10-26 11:11           ` Eric Dumazet
  2013-01-10 22:36           ` [PATCH v2 " Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2012-10-26  9:32 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Stephen Hemminger, Patrick McHardy, Paolo Valente,
	Jamal Hadi Salim, Herbert Xu

From: Eric Dumazet <edumazet@google.com>

One long standing problem with TSO/GSO/GRO packets is that skb->len
doesnt represent a precise amount of bytes on wire.

Headers are only accounted for the first segment.
For TCP, thats typically 66 bytes per 1448 bytes segment missing,
an error of 4.5 %

As consequences :

1) TBF/CBQ/HTB/NETEM/... can send more bytes than the assigned limits.
2) Device stats are slightly under estimated as well.

Fix this by taking account of headers in qdisc_skb_cb(skb)->pkt_len
computation.

Packet schedulers should use pkt_len instead of skb->len for their
bandwidth limitations, and TSO enabled devices drivers could use pkt_len
if their statistics are not hardware assisted, and if they dont scratch
skb->cb[] first word.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Paolo Valente <paolo.valente@unimore.it>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Patrick McHardy <kaber@trash.net>
---
 net/core/dev.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b4978e2..cf2b5f4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2442,7 +2442,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	bool contended;
 	int rc;
 
-	qdisc_skb_cb(skb)->pkt_len = skb->len;
 	qdisc_calculate_pkt_len(skb, q);
 	/*
 	 * Heuristic to force contended enqueues to serialize on a
@@ -2579,6 +2578,16 @@ int dev_queue_xmit(struct sk_buff *skb)
 	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
 #endif
 	trace_net_dev_queue(skb);
+	qdisc_skb_cb(skb)->pkt_len = skb->len;
+	if (skb_is_gso(skb)) {
+		unsigned int hdr_len = skb_transport_offset(skb);
+
+		if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))
+			hdr_len += tcp_hdrlen(skb);
+		else
+			hdr_len += sizeof(struct udphdr);
+		qdisc_skb_cb(skb)->pkt_len += (skb_shinfo(skb)->gso_segs - 1) * hdr_len;
+	}
 	if (q->enqueue) {
 		rc = __dev_xmit_skb(skb, q, dev, txq);
 		goto out;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] net_sched: more precise pkt_len computation
  2012-10-26  9:32         ` [PATCH net-next] net_sched: more precise pkt_len computation Eric Dumazet
@ 2012-10-26 11:11           ` Eric Dumazet
  2013-01-10 22:36           ` [PATCH v2 " Eric Dumazet
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2012-10-26 11:11 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Stephen Hemminger, Patrick McHardy, Paolo Valente,
	Jamal Hadi Salim, Herbert Xu

On Fri, 2012-10-26 at 11:32 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> One long standing problem with TSO/GSO/GRO packets is that skb->len
> doesnt represent a precise amount of bytes on wire.
> 
> Headers are only accounted for the first segment.
> For TCP, thats typically 66 bytes per 1448 bytes segment missing,
> an error of 4.5 %
> 
> As consequences :
> 
> 1) TBF/CBQ/HTB/NETEM/... can send more bytes than the assigned limits.
> 2) Device stats are slightly under estimated as well.
> 
> Fix this by taking account of headers in qdisc_skb_cb(skb)->pkt_len
> computation.
> 
> Packet schedulers should use pkt_len instead of skb->len for their
> bandwidth limitations, and TSO enabled devices drivers could use pkt_len
> if their statistics are not hardware assisted, and if they dont scratch
> skb->cb[] first word.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Paolo Valente <paolo.valente@unimore.it>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Patrick McHardy <kaber@trash.net>
> ---
>  net/core/dev.c |   11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b4978e2..cf2b5f4 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2442,7 +2442,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	bool contended;
>  	int rc;
>  
> -	qdisc_skb_cb(skb)->pkt_len = skb->len;
>  	qdisc_calculate_pkt_len(skb, q);
>  	/*
>  	 * Heuristic to force contended enqueues to serialize on a
> @@ -2579,6 +2578,16 @@ int dev_queue_xmit(struct sk_buff *skb)
>  	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
>  #endif
>  	trace_net_dev_queue(skb);
> +	qdisc_skb_cb(skb)->pkt_len = skb->len;
> +	if (skb_is_gso(skb)) {
> +		unsigned int hdr_len = skb_transport_offset(skb);
> +
> +		if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))
> +			hdr_len += tcp_hdrlen(skb);
> +		else
> +			hdr_len += sizeof(struct udphdr);
> +		qdisc_skb_cb(skb)->pkt_len += (skb_shinfo(skb)->gso_segs - 1) * hdr_len;
> +	}
>  	if (q->enqueue) {
>  		rc = __dev_xmit_skb(skb, q, dev, txq);
>  		goto out;
> 

Hmm, this doesnt quite work for GRO (ingress), as we call
skb_reset_transport_header(skb); in __netif_receive_skb() before
handle_ing()

So skb_transport_offset(skb) is 14 here, instead of 14+(IP header)

This skb_reset_transport_header(skb); looks like defensive programming,
for the !NET_SKBUFF_DATA_USES_OFFSET case ?

We could either :

1) remove this skb_reset_transport_header(skb) call

or

2) use the following helper instead :

#ifdef NET_SKBUFF_DATA_USES_OFFSET
static inline void skb_sanitize_transport_header(struct sk_buff *skb)
{
	skb->transport_header = max_t(sk_buff_data_t,
				      skb->data - skb->head,
				      skb->transport_header);
}
#else
static inline void skb_sanitize_transport_header(struct sk_buff *skb)
{
	skb->transport_header = max_t(sk_buff_data_t,
				      skb->data,
				      skb->transport_header);
}
#endif

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] qfq: handle the case that front slot is empty
  2012-10-23  8:53   ` Cong Wang
  2012-10-26  7:51     ` Paolo Valente
@ 2012-10-26 16:51     ` Paolo Valente
  2012-10-28 12:45       ` Cong Wang
  1 sibling, 1 reply; 12+ messages in thread
From: Paolo Valente @ 2012-10-26 16:51 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Stephen Hemminger, Eric Dumazet, Fabio Checconi

Il 23/10/2012 10:53, Cong Wang ha scritto:
> On Tue, 2012-10-23 at 09:09 +0200, Paolo Valente wrote:
>> The crash you reported is one of the problems I tried to solve with my last fixes.
>> After those fixes I could not reproduce this crash (and other crashes) any more, but of course I am still missing something.
> 
> I am using the latest net-next, so if your patches are in net-next,
> then the problem of course still exists.
> 
>>
>> Il giorno 23/ott/2012, alle ore 06:15, Cong Wang ha scritto:
>>
>>> I am not sure if this patch fixes the real problem or just workarounds
>>> it. At least, after this patch I don't see the crash I reported any more.
>>
>> It is actually a workaround: if the condition that triggers your workaround holds true, then the group data structure is already inconstent, and qfq is likely not to schedule classes correctly.
>> I will try to reproduce the crash with the steps you suggest, and try to understand what is still wrong as soon as I can.
>>
> 
> OK, I don't pretend I understand qfq. And I can help you to test
> patches.
Here is a possible patch. Could you please give me a feedback?

If this patch actually works, there are some issues related to it that I would
like to point out after your (and/or anyone else's) tests.
> 
> Thanks!
> 
> 
---
 net/sched/sch_qfq.c |   84 +++++++++++++++++++++++++++++++++------------------
 1 file changed, 55 insertions(+), 29 deletions(-)

diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index f0dd83c..1fef61a 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -84,18 +84,19 @@
  * grp->index is the index of the group; and grp->slot_shift
  * is the shift for the corresponding (scaled) sigma_i.
  */
-#define QFQ_MAX_INDEX		19
-#define QFQ_MAX_WSHIFT		16
+#define QFQ_MAX_INDEX		24
+#define QFQ_MAX_WSHIFT		12
 
 #define	QFQ_MAX_WEIGHT		(1<<QFQ_MAX_WSHIFT)
-#define QFQ_MAX_WSUM		(2*QFQ_MAX_WEIGHT)
+#define QFQ_MAX_WSUM		(16*QFQ_MAX_WEIGHT)
 
 #define FRAC_BITS		30	/* fixed point arithmetic */
 #define ONE_FP			(1UL << FRAC_BITS)
 #define IWSUM			(ONE_FP/QFQ_MAX_WSUM)
 
-#define QFQ_MTU_SHIFT		11
+#define QFQ_MTU_SHIFT		16	/* because of TSO/GSP */
 #define QFQ_MIN_SLOT_SHIFT	(FRAC_BITS + QFQ_MTU_SHIFT - QFQ_MAX_INDEX)
+#define QFQ_MIN_LMAX		256	/* min possible lmax for a class */
 
 /*
  * Possible group states.  These values are used as indexes for the bitmaps
@@ -231,6 +232,32 @@ static void qfq_update_class_params(struct qfq_sched *q, struct qfq_class *cl,
 	q->wsum += delta_w;
 }
 
+static void qfq_update_reactivate_class(struct qfq_sched *q,
+					struct qfq_class *cl,
+					u32 inv_w, u32 lmax, int delta_w)
+{
+	bool need_reactivation = false;
+	int i = qfq_calc_index(inv_w, lmax);
+
+	if (&q->groups[i] != cl->grp && cl->qdisc->q.qlen > 0) {
+		/*
+		 * shift cl->F back, to not charge the
+		 * class for the not-yet-served head
+		 * packet
+		 */
+		cl->F = cl->S;
+		/* remove class from its slot in the old group */
+		qfq_deactivate_class(q, cl);
+		need_reactivation = true;
+	}
+
+	qfq_update_class_params(q, cl, lmax, inv_w, delta_w);
+
+	if (need_reactivation) /* activate in new group */
+		qfq_activate_class(q, cl, qdisc_peek_len(cl->qdisc));
+}
+
+
 static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
 			    struct nlattr **tca, unsigned long *arg)
 {
@@ -238,7 +265,7 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
 	struct qfq_class *cl = (struct qfq_class *)*arg;
 	struct nlattr *tb[TCA_QFQ_MAX + 1];
 	u32 weight, lmax, inv_w;
-	int i, err;
+	int err;
 	int delta_w;
 
 	if (tca[TCA_OPTIONS] == NULL) {
@@ -270,16 +297,14 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
 
 	if (tb[TCA_QFQ_LMAX]) {
 		lmax = nla_get_u32(tb[TCA_QFQ_LMAX]);
-		if (!lmax || lmax > (1UL << QFQ_MTU_SHIFT)) {
+		if (lmax < QFQ_MIN_LMAX || lmax > (1UL << QFQ_MTU_SHIFT)) {
 			pr_notice("qfq: invalid max length %u\n", lmax);
 			return -EINVAL;
 		}
 	} else
-		lmax = 1UL << QFQ_MTU_SHIFT;
+		lmax = psched_mtu(qdisc_dev(sch));
 
 	if (cl != NULL) {
-		bool need_reactivation = false;
-
 		if (tca[TCA_RATE]) {
 			err = gen_replace_estimator(&cl->bstats, &cl->rate_est,
 						    qdisc_root_sleeping_lock(sch),
@@ -291,24 +316,8 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
 		if (lmax == cl->lmax && inv_w == cl->inv_w)
 			return 0; /* nothing to update */
 
-		i = qfq_calc_index(inv_w, lmax);
 		sch_tree_lock(sch);
-		if (&q->groups[i] != cl->grp && cl->qdisc->q.qlen > 0) {
-			/*
-			 * shift cl->F back, to not charge the
-			 * class for the not-yet-served head
-			 * packet
-			 */
-			cl->F = cl->S;
-			/* remove class from its slot in the old group */
-			qfq_deactivate_class(q, cl);
-			need_reactivation = true;
-		}
-
-		qfq_update_class_params(q, cl, lmax, inv_w, delta_w);
-
-		if (need_reactivation) /* activate in new group */
-			qfq_activate_class(q, cl, qdisc_peek_len(cl->qdisc));
+		qfq_update_reactivate_class(q, cl, inv_w, lmax, delta_w);
 		sch_tree_unlock(sch);
 
 		return 0;
@@ -663,9 +672,17 @@ static void qfq_make_eligible(struct qfq_sched *q, u64 old_V)
 
 
 /*
- * XXX we should make sure that slot becomes less than 32.
- * This is guaranteed by the input values.
- * roundedS is always cl->S rounded on grp->slot_shift bits.
+ * If the weight and lmax (max_pkt_size) of the classes do not change,
+ * then QFQ guarantees that the slot index is never higher than 2 +
+ * ((1<<QFQ_MTU_SHIFT)/QFQ_MIN_LMAX) * (QFQ_MAX_WEIGHT/QFQ_MAX_WSUM).
+ *
+ * With the current values of the above constants, the index is
+ * then guaranteed to never be higher than 2 + 256 * (1 / 16) = 18.
+ *
+ * Even if the weight and/or lmax of some class change, the index
+ * should not, however, happen to be higher than 30, which is the
+ * critical threshold above which the full_slots bitmap my get
+ * corrupted.
  */
 static void qfq_slot_insert(struct qfq_group *grp, struct qfq_class *cl,
 			    u64 roundedS)
@@ -673,6 +690,8 @@ static void qfq_slot_insert(struct qfq_group *grp, struct qfq_class *cl,
 	u64 slot = (roundedS - grp->S) >> grp->slot_shift;
 	unsigned int i = (grp->front + slot) % QFQ_MAX_SLOTS;
 
+	BUG_ON(slot > 30);
+
 	hlist_add_head(&cl->next, &grp->slots[i]);
 	__set_bit(slot, &grp->full_slots);
 }
@@ -892,6 +911,13 @@ static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	}
 	pr_debug("qfq_enqueue: cl = %x\n", cl->common.classid);
 
+	if (unlikely(cl->lmax < qdisc_pkt_len(skb))) {
+		pr_notice("qfq: increasing class max_pkt_size from %u to %u\n",
+			  cl->lmax, qdisc_pkt_len(skb));
+		qfq_update_reactivate_class(q, cl, cl->inv_w,
+					    qdisc_pkt_len(skb), 0);
+	}
+
 	err = qdisc_enqueue(skb, cl->qdisc);
 	if (unlikely(err != NET_XMIT_SUCCESS)) {
 		pr_debug("qfq_enqueue: enqueue failed %d\n", err);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] qfq: handle the case that front slot is empty
  2012-10-26 16:51     ` [RFC PATCH net-next] qfq: handle the case that front slot is empty Paolo Valente
@ 2012-10-28 12:45       ` Cong Wang
  2012-10-28 16:07         ` Paolo Valente
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2012-10-28 12:45 UTC (permalink / raw)
  To: Paolo Valente; +Cc: netdev, Stephen Hemminger, Eric Dumazet, Fabio Checconi

On Fri, 2012-10-26 at 18:51 +0200, Paolo Valente wrote:
> Here is a possible patch. Could you please give me a feedback?
> 
> If this patch actually works, there are some issues related to it that I would
> like to point out after your (and/or anyone else's) tests.

With this patch applied, I can't reproduce the crash any more.

And the following messages appear in the log:

[  430.956400] qfq: increasing class max_pkt_size from 1514 to 26130
[  430.981887] qfq: increasing class max_pkt_size from 26130 to 31922
[  431.011844] qfq: increasing class max_pkt_size from 1514 to 2962
[  431.060129] qfq: increasing class max_pkt_size from 31922 to 46402
[  431.096241] qfq: increasing class max_pkt_size from 2962 to 4410
[  431.123689] qfq: increasing class max_pkt_size from 4410 to 18890
[  431.134624] qfq: increasing class max_pkt_size from 18890 to 20338
[  431.149573] qfq: increasing class max_pkt_size from 20338 to 39162
[  431.328157] qfq: increasing class max_pkt_size from 46402 to 56538
[  431.339731] qfq: increasing class max_pkt_size from 56538 to 65226
[  431.363699] qfq: increasing class max_pkt_size from 39162 to 44954
[  431.376303] qfq: increasing class max_pkt_size from 44954 to 46402
[  431.482082] qfq: increasing class max_pkt_size from 46402 to 62330
[  431.520339] qfq: increasing class max_pkt_size from 62330 to 65226

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH net-next] qfq: handle the case that front slot is empty
  2012-10-28 12:45       ` Cong Wang
@ 2012-10-28 16:07         ` Paolo Valente
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Valente @ 2012-10-28 16:07 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Stephen Hemminger, Eric Dumazet, Fabio Checconi


Il giorno 28/ott/2012, alle ore 13:45, Cong Wang ha scritto:

> On Fri, 2012-10-26 at 18:51 +0200, Paolo Valente wrote:
>> Here is a possible patch. Could you please give me a feedback?
>> 
>> If this patch actually works, there are some issues related to it that I would
>> like to point out after your (and/or anyone else's) tests.
> 
> With this patch applied, I can't reproduce the crash any more.
Thank you for the test.
> 
> And the following messages appear in the log:
> 
> [  430.956400] qfq: increasing class max_pkt_size from 1514 to 26130
> [  430.981887] qfq: increasing class max_pkt_size from 26130 to 31922
> [  431.011844] qfq: increasing class max_pkt_size from 1514 to 2962
> [  431.060129] qfq: increasing class max_pkt_size from 31922 to 46402
> [  431.096241] qfq: increasing class max_pkt_size from 2962 to 4410
> [  431.123689] qfq: increasing class max_pkt_size from 4410 to 18890
> [  431.134624] qfq: increasing class max_pkt_size from 18890 to 20338
> [  431.149573] qfq: increasing class max_pkt_size from 20338 to 39162
> [  431.328157] qfq: increasing class max_pkt_size from 46402 to 56538
> [  431.339731] qfq: increasing class max_pkt_size from 56538 to 65226
> [  431.363699] qfq: increasing class max_pkt_size from 39162 to 44954
> [  431.376303] qfq: increasing class max_pkt_size from 44954 to 46402
> [  431.482082] qfq: increasing class max_pkt_size from 46402 to 62330
> [  431.520339] qfq: increasing class max_pkt_size from 62330 to 65226
> 
> 
WIth each of these messages, qfq now informs the user that he/she has chosen a too low max_pkt_size with respect to the actual size of the last-arrived packet, and that the max_pkt_size has been increased automatically to handle this larger packet correctly. In the log that you report, the large packets that cause qfq to increase the max_pkt_size are due to TSO/GSO.

Any feedback about these notifications is welcome. Meanwhile I will prepare a description of the patch and of the limitations it imposes.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 net-next] net_sched: more precise pkt_len computation
  2012-10-26  9:32         ` [PATCH net-next] net_sched: more precise pkt_len computation Eric Dumazet
  2012-10-26 11:11           ` Eric Dumazet
@ 2013-01-10 22:36           ` Eric Dumazet
  2013-01-10 22:58             ` David Miller
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2013-01-10 22:36 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Stephen Hemminger, Patrick McHardy, Paolo Valente,
	Jamal Hadi Salim, Herbert Xu

From: Eric Dumazet <edumazet@google.com>

One long standing problem with TSO/GSO/GRO packets is that skb->len
doesn't represent a precise amount of bytes on wire.

Headers are only accounted for the first segment.
For TCP, thats typically 66 bytes per 1448 bytes segment missing,
an error of 4.5 % for normal MSS value.

As consequences :

1) TBF/CBQ/HTB/NETEM/... can send more bytes than the assigned limits.
2) Device stats are slightly under estimated as well.

Fix this by taking account of headers in qdisc_skb_cb(skb)->pkt_len
computation.

Packet schedulers should use qdisc pkt_len instead of skb->len for their
bandwidth limitations, and TSO enabled devices drivers could use pkt_len
if their statistics are not hardware assisted, and if they don't scratch
skb->cb[] first word.

Both egress and ingress paths work, thanks to commit fda55eca5a
(net: introduce skb_transport_header_was_set()) : If GRO built
a GSO packet, it also set the transport header for us.


Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Paolo Valente <paolo.valente@unimore.it>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Patrick McHardy <kaber@trash.net>
---
 net/core/dev.c |   22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 594830e..3625f97 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2532,6 +2532,26 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 	return netdev_get_tx_queue(dev, queue_index);
 }
 
+static void qdisc_pkt_len_init(struct sk_buff *skb)
+{
+	const struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+	qdisc_skb_cb(skb)->pkt_len = skb->len;
+
+	/* To get more precise estimation of bytes sent on wire,
+	 * we add to pkt_len the headers size of all segments
+	 */
+	if (shinfo->gso_size)  {
+		unsigned int hdr_len = skb_transport_offset(skb);
+
+		if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
+			hdr_len += tcp_hdrlen(skb);
+		else
+			hdr_len += sizeof(struct udphdr);
+		qdisc_skb_cb(skb)->pkt_len += (shinfo->gso_segs - 1) * hdr_len;
+	}
+}
+
 static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 				 struct net_device *dev,
 				 struct netdev_queue *txq)
@@ -2540,7 +2560,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	bool contended;
 	int rc;
 
-	qdisc_skb_cb(skb)->pkt_len = skb->len;
+	qdisc_pkt_len_init(skb);
 	qdisc_calculate_pkt_len(skb, q);
 	/*
 	 * Heuristic to force contended enqueues to serialize on a

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next] net_sched: more precise pkt_len computation
  2013-01-10 22:36           ` [PATCH v2 " Eric Dumazet
@ 2013-01-10 22:58             ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2013-01-10 22:58 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, shemminger, kaber, paolo.valente, jhs, herbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 10 Jan 2013 14:36:42 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> One long standing problem with TSO/GSO/GRO packets is that skb->len
> doesn't represent a precise amount of bytes on wire.
> 
> Headers are only accounted for the first segment.
> For TCP, thats typically 66 bytes per 1448 bytes segment missing,
> an error of 4.5 % for normal MSS value.
> 
> As consequences :
> 
> 1) TBF/CBQ/HTB/NETEM/... can send more bytes than the assigned limits.
> 2) Device stats are slightly under estimated as well.
> 
> Fix this by taking account of headers in qdisc_skb_cb(skb)->pkt_len
> computation.
> 
> Packet schedulers should use qdisc pkt_len instead of skb->len for their
> bandwidth limitations, and TSO enabled devices drivers could use pkt_len
> if their statistics are not hardware assisted, and if they don't scratch
> skb->cb[] first word.
> 
> Both egress and ingress paths work, thanks to commit fda55eca5a
> (net: introduce skb_transport_header_was_set()) : If GRO built
> a GSO packet, it also set the transport header for us.
> 
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Looks good, applied, thanks Eric.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-01-10 22:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-23  4:15 [RFC PATCH net-next] qfq: handle the case that front slot is empty Cong Wang
2012-10-23  7:09 ` Paolo Valente
2012-10-23  8:53   ` Cong Wang
2012-10-26  7:51     ` Paolo Valente
2012-10-26  8:10       ` Eric Dumazet
2012-10-26  9:32         ` [PATCH net-next] net_sched: more precise pkt_len computation Eric Dumazet
2012-10-26 11:11           ` Eric Dumazet
2013-01-10 22:36           ` [PATCH v2 " Eric Dumazet
2013-01-10 22:58             ` David Miller
2012-10-26 16:51     ` [RFC PATCH net-next] qfq: handle the case that front slot is empty Paolo Valente
2012-10-28 12:45       ` Cong Wang
2012-10-28 16:07         ` Paolo Valente

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.