linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* include/linux/pcounter.h
@ 2008-02-04  9:44 Andrew Morton
  2008-02-05  0:20 ` include/linux/pcounter.h David Miller
  2008-02-16  3:37 ` include/linux/pcounter.h Andrew Morton
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2008-02-04  9:44 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller; +Cc: netdev, linux-kernel


e7d0362dd41e760f340c1b500646cc92522bd9d5 should have been folded into
de4d1db369785c29d68915edfee0cb70e8199f4c prior to merging.  We now and for
ever have a window of breakage which screws up git bisection.  Which I
just hit.  Which is the only reason I discovered the file's existence.


Please do not merge pieces of generic kernel infrastructure while keeping
it all secret on the netdev list.  Ever.  That code has a number of deficiencies
which I and probably others would have noted had it been offered to us
for review.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-04  9:44 include/linux/pcounter.h Andrew Morton
@ 2008-02-05  0:20 ` David Miller
  2008-02-05  0:43   ` include/linux/pcounter.h Andrew Morton
  2008-02-16  3:37 ` include/linux/pcounter.h Andrew Morton
  1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2008-02-05  0:20 UTC (permalink / raw)
  To: akpm; +Cc: herbert, netdev, linux-kernel

From: Andrew Morton <akpm@linux-foundation.org>
Date: Mon, 4 Feb 2008 01:44:02 -0800

> Please do not merge pieces of generic kernel infrastructure while
> keeping it all secret on the netdev list.  Ever.

It was so damn secret that it sat in your -mm tree for months.

Don't be rediculious Andrew.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-05  0:20 ` include/linux/pcounter.h David Miller
@ 2008-02-05  0:43   ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2008-02-05  0:43 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, linux-kernel

On Mon, 04 Feb 2008 16:20:35 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Mon, 4 Feb 2008 01:44:02 -0800
> 
> > Please do not merge pieces of generic kernel infrastructure while
> > keeping it all secret on the netdev list.  Ever.
> 
> It was so damn secret that it sat in your -mm tree for months.

I never noticed it and I doubt if anyone else did.  How was I (or anyone
else) to have known?

> Don't be rediculious Andrew.

There is nothing ridiculous about requiring that new generic kernel
infrastructure patches be appropriately submitted and reviewed.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-04  9:44 include/linux/pcounter.h Andrew Morton
  2008-02-05  0:20 ` include/linux/pcounter.h David Miller
@ 2008-02-16  3:37 ` Andrew Morton
  2008-02-16 10:07   ` include/linux/pcounter.h Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-02-16  3:37 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, netdev, linux-kernel


- First up, why was this added at all?  We have percpu_counter.h which
  has several years development invested in it.  afaict it would suit the
  present applications of pcounters.

  If some deficiency in percpu_counters has been identified, is it
  possible to correct that deficiency rather than implementing a whole new
  counter type?  That would be much better.

- The comments in pcounter.h appear to indicate that there is a
  performance advantage (and we infer that the advantage is when the
  statically-allocated flavour of pcounters is used).  When compared with
  percpu_counters the number of data-reference indirections is the same as
  with percpu_counters, so no advantage there.

  And, bizarrely, because of a quite inappropriate abstraction toy, both
  flavours of pcounters add an indirect function call which I believe is
  significantly more expensive than a plain old pointer indirection.

  So it's quite possible that DEFINE_PCOUNTER-style counters consume more
  memory and are slower than percpu_counters.  They will surely be much
  slower on the read side.  More below.

  If we really want to put some helper wrappers around
  DEFINE_PER_CPU(s32) then I'd suggest that we should do that as a
  standalone thing and not attempt to graft the same interface onto two
  quite different types of storage (DEFINE_PER_CPU and alloc_per_cpu)

- The comment "2)" in pcounter.h (which overflows 80 cols and probably
  wasn't passed through checkpatch) indicates that some other
  implementation (presumably plain old DEFINE_PER_CPU) will use
  NR_CPUS*(32+sizeof(void *)) bytes of storage.  But DEFINE_PCOUNTER will
  use as much memory as DEFINE_PER_CPU(s32) and both pcounter_alloc()-style
  pcounters and percpu_counters use
  num_possible_cpus()*sizeof(s32)+epsilon.

- The CONFIG_SMP=n stubs in pcounter.h are cheesy and are vulnerable to
  several well-known compilation risks which I always forget.  Should be
  converted to regular static inlines.

- the comment in lib/pcounter.c needlessly exceeds 80 cols.

- pcounter_dyn_add() will spew a
  use-of-smp_processor_id()-in-preemptible-code warning if used in places
  where one could reasonably use it.  The interface could do with a bit of
  a rethink.  Or at least some justification and documentation.

- pcounter_getval() will be disastrously inefficient if
  num_possible_cpus() is much greater than num_online_cpus().  It should
  use for_each_online_cpu() (as does percpu_counter), and implement a CPU
  hotplug notifier (as does percpu_counter).

  It will remain grossly inefficient at high CPU counts, unlike
  percpu_counters, which solved this problem by doing a batched spill into
  a central counter at add/sub time.

  The danger here is that someone will actually use this interface in new
  code.  Six months later (when it's too late to fix it) the big-NUMA guys
  come up and say "whaa, when our user does <this> it disabled interrupts
  for N milliseconds".

- pcounter_getval() can return incorrect negative numbers.  This can
  cause caller malfunctions in very rare situations because callers just
  don't expect the things which they're counting to go negative.

  We experienced this during the evolution of percpu_counter.  See
  percpu_counter_sum_positive() and friends.

- pcounter_alloc() should return -ENOMEM on allocation error, not "1".

- pcounter_free() perhaps shouldn't test for (self->per_cpu_values !=
  NULL), because callers shouldn't be calling it if pcounter_alloc() failed
  (arguable).



afaict the whole implementation can and should be removed and replaced with
percpu_counters.  I don't think there's much point in its ability to manage
DEFINE_PER_CPU counters: pcounter_getval() remains grossly inefficient (and
can return negative values) and quite a bit of new code will need to be put
in place to address that.

But perhaps there are plans to evolve it into something further in the
future, I don't know.  But I would suggest that the people who have worked
upon percpu_counters (principally Gautham, Peter Z, clameter and me) be
involved in that work.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-16  3:37 ` include/linux/pcounter.h Andrew Morton
@ 2008-02-16 10:07   ` Eric Dumazet
  2008-02-16 10:50     ` include/linux/pcounter.h Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2008-02-16 10:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Herbert Xu, David S. Miller, netdev, linux-kernel

Andrew Morton a écrit :
> - First up, why was this added at all?  We have percpu_counter.h which
>   has several years development invested in it.  afaict it would suit the
>   present applications of pcounters.
> 
>   If some deficiency in percpu_counters has been identified, is it
>   possible to correct that deficiency rather than implementing a whole new
>   counter type?  That would be much better.
> 
> - The comments in pcounter.h appear to indicate that there is a
>   performance advantage (and we infer that the advantage is when the
>   statically-allocated flavour of pcounters is used).  When compared with
>   percpu_counters the number of data-reference indirections is the same as
>   with percpu_counters, so no advantage there.
> 
>   And, bizarrely, because of a quite inappropriate abstraction toy, both
>   flavours of pcounters add an indirect function call which I believe is
>   significantly more expensive than a plain old pointer indirection.
> 
>   So it's quite possible that DEFINE_PCOUNTER-style counters consume more
>   memory and are slower than percpu_counters.  They will surely be much
>   slower on the read side.  More below.
> 
>   If we really want to put some helper wrappers around
>   DEFINE_PER_CPU(s32) then I'd suggest that we should do that as a
>   standalone thing and not attempt to graft the same interface onto two
>   quite different types of storage (DEFINE_PER_CPU and alloc_per_cpu)
> 
> - The comment "2)" in pcounter.h (which overflows 80 cols and probably
>   wasn't passed through checkpatch) indicates that some other
>   implementation (presumably plain old DEFINE_PER_CPU) will use
>   NR_CPUS*(32+sizeof(void *)) bytes of storage.  But DEFINE_PCOUNTER will
>   use as much memory as DEFINE_PER_CPU(s32) and both pcounter_alloc()-style
>   pcounters and percpu_counters use
>   num_possible_cpus()*sizeof(s32)+epsilon.
> 
> - The CONFIG_SMP=n stubs in pcounter.h are cheesy and are vulnerable to
>   several well-known compilation risks which I always forget.  Should be
>   converted to regular static inlines.
> 
> - the comment in lib/pcounter.c needlessly exceeds 80 cols.
> 
> - pcounter_dyn_add() will spew a
>   use-of-smp_processor_id()-in-preemptible-code warning if used in places
>   where one could reasonably use it.  The interface could do with a bit of
>   a rethink.  Or at least some justification and documentation.
> 
> - pcounter_getval() will be disastrously inefficient if
>   num_possible_cpus() is much greater than num_online_cpus().  It should
>   use for_each_online_cpu() (as does percpu_counter), and implement a CPU
>   hotplug notifier (as does percpu_counter).
> 
>   It will remain grossly inefficient at high CPU counts, unlike
>   percpu_counters, which solved this problem by doing a batched spill into
>   a central counter at add/sub time.
> 
>   The danger here is that someone will actually use this interface in new
>   code.  Six months later (when it's too late to fix it) the big-NUMA guys
>   come up and say "whaa, when our user does <this> it disabled interrupts
>   for N milliseconds".
> 
> - pcounter_getval() can return incorrect negative numbers.  This can
>   cause caller malfunctions in very rare situations because callers just
>   don't expect the things which they're counting to go negative.
> 
>   We experienced this during the evolution of percpu_counter.  See
>   percpu_counter_sum_positive() and friends.
> 
> - pcounter_alloc() should return -ENOMEM on allocation error, not "1".
> 
> - pcounter_free() perhaps shouldn't test for (self->per_cpu_values !=
>   NULL), because callers shouldn't be calling it if pcounter_alloc() failed
>   (arguable).
> 
> 
> 
> afaict the whole implementation can and should be removed and replaced with
> percpu_counters.  I don't think there's much point in its ability to manage
> DEFINE_PER_CPU counters: pcounter_getval() remains grossly inefficient (and
> can return negative values) and quite a bit of new code will need to be put
> in place to address that.
> 
> But perhaps there are plans to evolve it into something further in the
> future, I don't know.  But I would suggest that the people who have worked
> upon percpu_counters (principally Gautham, Peter Z, clameter and me) be
> involved in that work.
> 

Andrew, pcounter is a temporary abstraction.

It is temporaty because it will vanish as soon as Christoph Clameter (or 
somebody else) provides real cheap per cpu counter implementation.

At time I introduced it in network tree (locally, not meant to invade kernel 
land and makes you unhappy :) ), the goals were :

Some counters (total sockets count) were a single integer, that were doing 
ping-pong between cpus (SMP/NUMA). As they are basically lazy values (as we 
dont really need to read their value), using plain atomic_t was overkill.

Using a plain percpu_counters was expensive (NR_CPUS*(32+sizeof(void *)) 
instead of num_possible_cpus()*4).

Using 'online' instead of 'possible' stuff is not really needed for a 
temporary thing.

- We dont care of read sides. We want really fast write side. Real fast.

Read side is when you do a "cat /proc/net/sockstat".
That is ... once in a while...

Now when we allocate a new socket, code to increment the "socket count" is :

c03a74a8 <tcp_pcounter_add>:
c03a74a8:   b8 90 26 5f c0          mov    $0xc05f2690,%eax
c03a74ad:   64 8b 0d 10 f1 5e c0    mov    %fs:0xc05ef110,%ecx
c03a74b4:   01 14 01                add    %edx,(%ecx,%eax,1)
c03a74b7:   c3                      ret

That is 4 instructions. I could be two in the future, thanks to current work 
on fs/gs based percpu variables.

Current percpu_counters implementation is more expensive :

c021467b <__percpu_counter_add>:
c021467b:       55                      push   %ebp
c021467c:       57                      push   %edi
c021467d:       89 c7                   mov    %eax,%edi
c021467f:       56                      push   %esi
c0214680:       53                      push   %ebx
c0214681:       83 ec 04                sub    $0x4,%esp
c0214684:       8b 40 14                mov    0x14(%eax),%eax
c0214687:       64 8b 1d 08 f0 5e c0    mov    %fs:0xc05ef008,%ebx
c021468e:       8b 6c 24 18             mov    0x18(%esp),%ebp
c0214692:       f7 d0                   not    %eax
c0214694:       8b 1c 98                mov    (%eax,%ebx,4),%ebx
c0214697:       89 1c 24                mov    %ebx,(%esp)
c021469a:       8b 03                   mov    (%ebx),%eax
c021469c:       89 c3                   mov    %eax,%ebx
c021469e:       89 c6                   mov    %eax,%esi
c02146a0:       c1 fe 1f                sar    $0x1f,%esi
c02146a3:       89 e8                   mov    %ebp,%eax
c02146a5:       01 d3                   add    %edx,%ebx
c02146a7:       11 ce                   adc    %ecx,%esi
c02146a9:       99                      cltd
c02146aa:       39 d6                   cmp    %edx,%esi
c02146ac:       7f 15                   jg     c02146c3 
<__percpu_counter_add+0x48>
c02146ae:       7c 04                   jl     c02146b4 
<__percpu_counter_add+0x39>
c02146b0:       39 eb                   cmp    %ebp,%ebx
c02146b2:       73 0f                   jae    c02146c3 
<__percpu_counter_add+0x48>
c02146b4:       f7 dd                   neg    %ebp
c02146b6:       89 e8                   mov    %ebp,%eax
c02146b8:       99                      cltd
c02146b9:       39 d6                   cmp    %edx,%esi
c02146bb:       7f 20                   jg     c02146dd 
<__percpu_counter_add+0x62>
c02146bd:       7c 04                   jl     c02146c3 
<__percpu_counter_add+0x48>
c02146bf:       39 eb                   cmp    %ebp,%ebx
c02146c1:       77 1a                   ja     c02146dd 
<__percpu_counter_add+0x62>
c02146c3:       89 f8                   mov    %edi,%eax
c02146c5:       e8 04 cc 1f 00          call   c04112ce <_spin_lock>
c02146ca:       01 5f 04                add    %ebx,0x4(%edi)
c02146cd:       11 77 08                adc    %esi,0x8(%edi)
c02146d0:       8b 04 24                mov    (%esp),%eax
c02146d3:       c7 00 00 00 00 00       movl   $0x0,(%eax)
c02146d9:       fe 07                   incb   (%edi)
c02146db:       eb 05                   jmp    c02146e2 
<__percpu_counter_add+0x67>
c02146dd:       8b 04 24                mov    (%esp),%eax
c02146e0:       89 18                   mov    %ebx,(%eax)
c02146e2:       58                      pop    %eax
c02146e3:       5b                      pop    %ebx
c02146e4:       5e                      pop    %esi
c02146e5:       5f                      pop    %edi
c02146e6:       5d                      pop    %ebp
c02146e7:       c3                      ret


Once it is better, just make pcounter vanish.

It is even clearly stated at the top of include/linux/pcounter.h

/*
  * Using a dynamic percpu 'int' variable has a cost :
  * 1) Extra dereference
  * Current per_cpu_ptr() implementation uses an array per 'percpu variable'.
  * 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of num_possible_cpus()*4
  *
  * This pcounter implementation is an abstraction to be able to use
  * either a static or a dynamic per cpu variable.
  * One dynamic per cpu variable gets a fast & cheap implementation, we can
  * change pcounter implementation too.
  */


We all agree.

Thank you

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-16 10:07   ` include/linux/pcounter.h Eric Dumazet
@ 2008-02-16 10:50     ` Andrew Morton
  2008-02-16 12:03       ` include/linux/pcounter.h Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-02-16 10:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David S. Miller, netdev, linux-kernel

On Sat, 16 Feb 2008 11:07:29 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:

> 
> Andrew, pcounter is a temporary abstraction.

It's buggy!  Main problems are a) possible return of negative numbers b)
some of the API can't be from preemptible code c) excessive interrupt-off
time on some machines if used from irq-disabled sections.

> It is temporaty because it will vanish as soon as Christoph Clameter (or 
> somebody else) provides real cheap per cpu counter implementation.

numbers?

most of percpu_counter_add() is only executed once per FBC_BATCH calls.

> At time I introduced it in network tree (locally, not meant to invade kernel 
> land and makes you unhappy :) ), the goals were :

Well maybe as a temporary networking-only thing OK, based upon
performance-tested results.  But I don't think the present code is suitable
as part of the kernel-wide toolkit.

> Some counters (total sockets count) were a single integer, that were doing 
> ping-pong between cpus (SMP/NUMA). As they are basically lazy values (as we 
> dont really need to read their value), using plain atomic_t was overkill.
> 
> Using a plain percpu_counters was expensive (NR_CPUS*(32+sizeof(void *)) 
> instead of num_possible_cpus()*4).

No, percpu_counters use alloc_percpu(), which is O(num_possible_cpus), not
O(NR_CPUS).

> Using 'online' instead of 'possible' stuff is not really needed for a 
> temporary thing.

This was put in ./lib/!

> - We dont care of read sides.

Well the present single caller in networking might not care.  But this was
put in ./lib/ and was exported to modules.  That is an invitation to all
kernel developers to use it in new code.  Which may result in truly awful
performance on high-cpu-count machines.

>  We want really fast write side. Real fast.

eh?  It's called on a per-connection basis, not on a per-packet basis?

> Read side is when you do a "cat /proc/net/sockstat".
> That is ... once in a while...

For the current single caller.  But it's in ./lib/.

And there's always someone out there who does whatever we don't expect them
to do.

> Now when we allocate a new socket, code to increment the "socket count" is :
> 
> c03a74a8 <tcp_pcounter_add>:
> c03a74a8:   b8 90 26 5f c0          mov    $0xc05f2690,%eax
> c03a74ad:   64 8b 0d 10 f1 5e c0    mov    %fs:0xc05ef110,%ecx
> c03a74b4:   01 14 01                add    %edx,(%ecx,%eax,1)
> c03a74b7:   c3                      ret

I can't find that code.  I suspect that's the DEFINE_PER_CPU flavour, which
isn't used anywhere afaict.  Plus this omits the local_irq_save/restore (or
preempt_disable/enable) and the indirect function call, which can be
expensive.

> That is 4 instructions. I could be two in the future, thanks to current work 
> on fs/gs based percpu variables.
> 
> Current percpu_counters implementation is more expensive :
> 
> c021467b <__percpu_counter_add>:
> c021467b:       55                      push   %ebp
> c021467c:       57                      push   %edi
> c021467d:       89 c7                   mov    %eax,%edi
> c021467f:       56                      push   %esi
> c0214680:       53                      push   %ebx
> c0214681:       83 ec 04                sub    $0x4,%esp
> c0214684:       8b 40 14                mov    0x14(%eax),%eax
> c0214687:       64 8b 1d 08 f0 5e c0    mov    %fs:0xc05ef008,%ebx
> c021468e:       8b 6c 24 18             mov    0x18(%esp),%ebp
> c0214692:       f7 d0                   not    %eax
> c0214694:       8b 1c 98                mov    (%eax,%ebx,4),%ebx
> c0214697:       89 1c 24                mov    %ebx,(%esp)
> c021469a:       8b 03                   mov    (%ebx),%eax
> c021469c:       89 c3                   mov    %eax,%ebx
> c021469e:       89 c6                   mov    %eax,%esi
> c02146a0:       c1 fe 1f                sar    $0x1f,%esi
> c02146a3:       89 e8                   mov    %ebp,%eax
> c02146a5:       01 d3                   add    %edx,%ebx
> c02146a7:       11 ce                   adc    %ecx,%esi
> c02146a9:       99                      cltd
> c02146aa:       39 d6                   cmp    %edx,%esi
> c02146ac:       7f 15                   jg     c02146c3 
> <__percpu_counter_add+0x48>
> c02146ae:       7c 04                   jl     c02146b4 

One of the above two branches is taken ((FBC_BATCH-1)/FBC_BATCH)ths of the
time.


> <__percpu_counter_add+0x39>
> c02146b0:       39 eb                   cmp    %ebp,%ebx
> c02146b2:       73 0f                   jae    c02146c3 
> <__percpu_counter_add+0x48>
> c02146b4:       f7 dd                   neg    %ebp
> c02146b6:       89 e8                   mov    %ebp,%eax
> c02146b8:       99                      cltd
> c02146b9:       39 d6                   cmp    %edx,%esi
> c02146bb:       7f 20                   jg     c02146dd 
> <__percpu_counter_add+0x62>
> c02146bd:       7c 04                   jl     c02146c3 
> <__percpu_counter_add+0x48>
> c02146bf:       39 eb                   cmp    %ebp,%ebx
> c02146c1:       77 1a                   ja     c02146dd 
> <__percpu_counter_add+0x62>
> c02146c3:       89 f8                   mov    %edi,%eax
> c02146c5:       e8 04 cc 1f 00          call   c04112ce <_spin_lock>
> c02146ca:       01 5f 04                add    %ebx,0x4(%edi)
> c02146cd:       11 77 08                adc    %esi,0x8(%edi)
> c02146d0:       8b 04 24                mov    (%esp),%eax
> c02146d3:       c7 00 00 00 00 00       movl   $0x0,(%eax)
> c02146d9:       fe 07                   incb   (%edi)
> c02146db:       eb 05                   jmp    c02146e2 
> <__percpu_counter_add+0x67>
> c02146dd:       8b 04 24                mov    (%esp),%eax
> c02146e0:       89 18                   mov    %ebx,(%eax)
> c02146e2:       58                      pop    %eax
> c02146e3:       5b                      pop    %ebx
> c02146e4:       5e                      pop    %esi
> c02146e5:       5f                      pop    %edi
> c02146e6:       5d                      pop    %ebp
> c02146e7:       c3                      ret
> 
> 
> Once it is better, just make pcounter vanish.

Some of the stuff in there is from the __percpu_disguise() thing which we
probably can live without.

But I'd be surprised if benchmarking reveals that the pcounter code is
justifiable in its present networking application or indeed in any future
ones.

> It is even clearly stated at the top of include/linux/pcounter.h
> 
> /*
>   * Using a dynamic percpu 'int' variable has a cost :
>   * 1) Extra dereference
>   * Current per_cpu_ptr() implementation uses an array per 'percpu variable'.
>   * 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of num_possible_cpus()*4
>   *
>   * This pcounter implementation is an abstraction to be able to use
>   * either a static or a dynamic per cpu variable.
>   * One dynamic per cpu variable gets a fast & cheap implementation, we can
>   * change pcounter implementation too.
>   */
> 
> 
> We all agree.
> 

No we don't.  That comment is afaict wrong about the memory consumption and
the abstraction *isn't useful*.

Why do we want some abstraction which makes alloc_percpu() storage and
DEFINE_PERCPU storage "look the same"?  What use is there in that?  One is
per-object storage and one is singleton storage - they're quite different
things and they are used in quite different situations and they are
basically never interchangeable.  Yet we add this pretend-they're-the-same
wrapper around them which costs us an indirect function call on the
fastpath.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-16 10:50     ` include/linux/pcounter.h Andrew Morton
@ 2008-02-16 12:03       ` Eric Dumazet
  2008-02-16 19:26         ` include/linux/pcounter.h Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2008-02-16 12:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Herbert Xu, David S. Miller, netdev, linux-kernel

Andrew Morton a écrit :
> On Sat, 16 Feb 2008 11:07:29 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Andrew, pcounter is a temporary abstraction.
> 
> It's buggy!  Main problems are a) possible return of negative numbers b)
> some of the API can't be from preemptible code c) excessive interrupt-off
> time on some machines if used from irq-disabled sections.

a) We dont care of possibly off values when reading /proc/net/sockstat
    Same arguments apply for percpu_counters.

b) It is called from network parts where preemption is disabled.

net/ipv4/inet_timewait_sock.c:94: 
sock_prot_inuse_add(sk->sk_prot, -1);
net/ipv4/inet_hashtables.c:291: sock_prot_inuse_add(sk->sk_prot, 1);
net/ipv4/inet_hashtables.c:335: sock_prot_inuse_add(sk->sk_prot, 1);
net/ipv4/inet_hashtables.c:357: sock_prot_inuse_add(sk->sk_prot, 1);
net/ipv4/inet_hashtables.c:390:         sock_prot_inuse_add(sk->sk_prot, -1);
net/ipv4/raw.c:95:      sock_prot_inuse_add(sk->sk_prot, 1);
net/ipv4/raw.c:104:             sock_prot_inuse_add(sk->sk_prot, -1);
net/ipv4/udp.c:235:             sock_prot_inuse_add(sk->sk_prot, 1);
net/ipv6/ipv6_sockglue.c:271: 
sock_prot_inuse_add(sk->sk_prot, -1);
net/ipv6/ipv6_sockglue.c:272: 
sock_prot_inuse_add(&tcp_prot, 1);
net/ipv6/ipv6_sockglue.c:285: 
sock_prot_inuse_add(sk->sk_prot, -1);
net/ipv6/ipv6_sockglue.c:286: 
sock_prot_inuse_add(prot, 1);
net/ipv6/inet6_hashtables.c:46: sock_prot_inuse_add(sk->sk_prot, 1);
net/ipv6/inet6_hashtables.c:207:        sock_prot_inuse_add(sk->sk_prot, 1);

c) No need to play with irq games here.

> 
>> It is temporaty because it will vanish as soon as Christoph Clameter (or 
>> somebody else) provides real cheap per cpu counter implementation.
> 
> numbers?
> 
> most of percpu_counter_add() is only executed once per FBC_BATCH calls.



> 
>> At time I introduced it in network tree (locally, not meant to invade kernel 
>> land and makes you unhappy :) ), the goals were :
> 
> Well maybe as a temporary networking-only thing OK, based upon
> performance-tested results.  But I don't think the present code is suitable
> as part of the kernel-wide toolkit.
> 
>> Some counters (total sockets count) were a single integer, that were doing 
>> ping-pong between cpus (SMP/NUMA). As they are basically lazy values (as we 
>> dont really need to read their value), using plain atomic_t was overkill.
>>
>> Using a plain percpu_counters was expensive (NR_CPUS*(32+sizeof(void *)) 
>> instead of num_possible_cpus()*4).
> 
> No, percpu_counters use alloc_percpu(), which is O(num_possible_cpus), not
> O(NR_CPUS).

You are playing with words.

The array was exacly sizeof(void *) * NR_CPUS  (at least when pcounter were 
added in network tree, commit b3242151906372f30f57feaa43b4cac96a23edb1 was 
done only ten days ago). Then it needed 32 bytes per possible cpu (maybe less 
now with SLUB)


> 
>> Using 'online' instead of 'possible' stuff is not really needed for a 
>> temporary thing.
> 
> This was put in ./lib/!
> 
>> - We dont care of read sides.
> 
> Well the present single caller in networking might not care.  But this was
> put in ./lib/ and was exported to modules.  That is an invitation to all
> kernel developers to use it in new code.  Which may result in truly awful
> performance on high-cpu-count machines.

> 
>>  We want really fast write side. Real fast.
> 
> eh?  It's called on a per-connection basis, not on a per-packet basis?

Yes, per connection basis. Some workloads want to open/close more than 1000 
sockets per second.

You are right we also need to reduce all atomic inc/dec done per packet :)

A pcounter/perc_cpu for the device refcounter would be a very nice win, but as 
number of devices in the system might be very big, David said no to a percpu 
conversion. We will reconsider this when new percpu stuff can go in, so that 
the memory cost will be minimal (4 bytes per cpu per device)

> 
>> Read side is when you do a "cat /proc/net/sockstat".
>> That is ... once in a while...
> 
> For the current single caller.  But it's in ./lib/.
> 
> And there's always someone out there who does whatever we don't expect them
> to do.
> 
>> Now when we allocate a new socket, code to increment the "socket count" is :
>>
>> c03a74a8 <tcp_pcounter_add>:
>> c03a74a8:   b8 90 26 5f c0          mov    $0xc05f2690,%eax
>> c03a74ad:   64 8b 0d 10 f1 5e c0    mov    %fs:0xc05ef110,%ecx
>> c03a74b4:   01 14 01                add    %edx,(%ecx,%eax,1)
>> c03a74b7:   c3                      ret
> 
> I can't find that code.  I suspect that's the DEFINE_PER_CPU flavour, which
> isn't used anywhere afaict.  Plus this omits the local_irq_save/restore (or
> preempt_disable/enable) and the indirect function call, which can be
> expensive.

Please do : nm vmlinux | grep tcp_pcounter_add

c03a74a8 t tcp_pcounter_add

It should be there, even if its a static function


> 
>> That is 4 instructions. I could be two in the future, thanks to current work 
>> on fs/gs based percpu variables.
>>
>> Current percpu_counters implementation is more expensive :
>>
>> c021467b <__percpu_counter_add>:
>> c021467b:       55                      push   %ebp
>> c021467c:       57                      push   %edi
>> c021467d:       89 c7                   mov    %eax,%edi
>> c021467f:       56                      push   %esi
>> c0214680:       53                      push   %ebx
>> c0214681:       83 ec 04                sub    $0x4,%esp
>> c0214684:       8b 40 14                mov    0x14(%eax),%eax
>> c0214687:       64 8b 1d 08 f0 5e c0    mov    %fs:0xc05ef008,%ebx
>> c021468e:       8b 6c 24 18             mov    0x18(%esp),%ebp
>> c0214692:       f7 d0                   not    %eax
>> c0214694:       8b 1c 98                mov    (%eax,%ebx,4),%ebx
>> c0214697:       89 1c 24                mov    %ebx,(%esp)
>> c021469a:       8b 03                   mov    (%ebx),%eax
>> c021469c:       89 c3                   mov    %eax,%ebx
>> c021469e:       89 c6                   mov    %eax,%esi
>> c02146a0:       c1 fe 1f                sar    $0x1f,%esi
>> c02146a3:       89 e8                   mov    %ebp,%eax
>> c02146a5:       01 d3                   add    %edx,%ebx
>> c02146a7:       11 ce                   adc    %ecx,%esi
>> c02146a9:       99                      cltd
>> c02146aa:       39 d6                   cmp    %edx,%esi
>> c02146ac:       7f 15                   jg     c02146c3 
>> <__percpu_counter_add+0x48>
>> c02146ae:       7c 04                   jl     c02146b4 
> 
> One of the above two branches is taken ((FBC_BATCH-1)/FBC_BATCH)ths of the
> time.
> 
> 
>> <__percpu_counter_add+0x39>
>> c02146b0:       39 eb                   cmp    %ebp,%ebx
>> c02146b2:       73 0f                   jae    c02146c3 
>> <__percpu_counter_add+0x48>
>> c02146b4:       f7 dd                   neg    %ebp
>> c02146b6:       89 e8                   mov    %ebp,%eax
>> c02146b8:       99                      cltd
>> c02146b9:       39 d6                   cmp    %edx,%esi
>> c02146bb:       7f 20                   jg     c02146dd 
>> <__percpu_counter_add+0x62>
>> c02146bd:       7c 04                   jl     c02146c3 
>> <__percpu_counter_add+0x48>
>> c02146bf:       39 eb                   cmp    %ebp,%ebx
>> c02146c1:       77 1a                   ja     c02146dd 
>> <__percpu_counter_add+0x62>
>> c02146c3:       89 f8                   mov    %edi,%eax
>> c02146c5:       e8 04 cc 1f 00          call   c04112ce <_spin_lock>
>> c02146ca:       01 5f 04                add    %ebx,0x4(%edi)
>> c02146cd:       11 77 08                adc    %esi,0x8(%edi)
>> c02146d0:       8b 04 24                mov    (%esp),%eax
>> c02146d3:       c7 00 00 00 00 00       movl   $0x0,(%eax)
>> c02146d9:       fe 07                   incb   (%edi)
>> c02146db:       eb 05                   jmp    c02146e2 
>> <__percpu_counter_add+0x67>
>> c02146dd:       8b 04 24                mov    (%esp),%eax
>> c02146e0:       89 18                   mov    %ebx,(%eax)
>> c02146e2:       58                      pop    %eax
>> c02146e3:       5b                      pop    %ebx
>> c02146e4:       5e                      pop    %esi
>> c02146e5:       5f                      pop    %edi
>> c02146e6:       5d                      pop    %ebp
>> c02146e7:       c3                      ret
>>
>>
>> Once it is better, just make pcounter vanish.
> 
> Some of the stuff in there is from the __percpu_disguise() thing which we
> probably can live without.
> 
> But I'd be surprised if benchmarking reveals that the pcounter code is
> justifiable in its present networking application or indeed in any future
> ones.

I have no benchmarks, but real workloads where it matters, and where userland 
eats icache/dcache all the time.

> 
>> It is even clearly stated at the top of include/linux/pcounter.h
>>
>> /*
>>   * Using a dynamic percpu 'int' variable has a cost :
>>   * 1) Extra dereference
>>   * Current per_cpu_ptr() implementation uses an array per 'percpu variable'.
>>   * 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of num_possible_cpus()*4
>>   *
>>   * This pcounter implementation is an abstraction to be able to use
>>   * either a static or a dynamic per cpu variable.
>>   * One dynamic per cpu variable gets a fast & cheap implementation, we can
>>   * change pcounter implementation too.
>>   */
>>
>>
>> We all agree.
>>
> 
> No we don't.  That comment is afaict wrong about the memory consumption and
> the abstraction *isn't useful*.

Fact is that we need percpu 32bits counters, and we need to have pointers to them.

Current percpu_counters cannot cope that.

> 
> Why do we want some abstraction which makes alloc_percpu() storage and
> DEFINE_PERCPU storage "look the same"?  What use is there in that?  One is
> per-object storage and one is singleton storage - they're quite different
> things and they are used in quite different situations and they are
> basically never interchangeable.  Yet we add this pretend-they're-the-same
> wrapper around them which costs us an indirect function call on the
> fastpath.

I believe all 'pcounter' are in fact statically allocated 'one per struct 
proto' to track inuse count. (search for DEFINE_PROTO_INUSE() uses)

# find net include|xargs grep -n DEFINE_PROTO_INUSE
net/dccp/ipv6.c:1105:DEFINE_PROTO_INUSE(dccp_v6)
net/dccp/ipv4.c:920:DEFINE_PROTO_INUSE(dccp_v4)
net/ipv6/udp.c:1001:DEFINE_PROTO_INUSE(udpv6)
net/ipv6/udplite.c:43:DEFINE_PROTO_INUSE(udplitev6)
net/ipv6/raw.c:1187:DEFINE_PROTO_INUSE(rawv6)
net/ipv6/tcp_ipv6.c:2108:DEFINE_PROTO_INUSE(tcpv6)
net/ipv4/udp.c:1477:DEFINE_PROTO_INUSE(udp)
net/ipv4/udplite.c:47:DEFINE_PROTO_INUSE(udplite)
net/ipv4/tcp_ipv4.c:2406:DEFINE_PROTO_INUSE(tcp)
net/ipv4/raw.c:828:DEFINE_PROTO_INUSE(raw)
net/sctp/socket.c:6461:DEFINE_PROTO_INUSE(sctp)
net/sctp/socket.c:6495:DEFINE_PROTO_INUSE(sctpv6)
include/net/sock.h:624:# define DEFINE_PROTO_INUSE(NAME) DEFINE_PCOUNTER(NAME)
include/net/sock.h:644:# define DEFINE_PROTO_INUSE(NAME)

So pcounter_alloc()/pcounter_free() could just be deleted from pcounter.h

indirect functions calls are everywhere in kernel, network, fs, everywhere.

As soon we can put in 'struct pcounter' the address of a percpu variable, we 
wont need anymore pointers to the pcounter_add()/getval() function.

mov  0(pcounter),%eax  # get the address of a percpuvar
add  %edx,fs:%eax

This just need the percpu work done by SGI guys to be finished.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-16 12:03       ` include/linux/pcounter.h Eric Dumazet
@ 2008-02-16 19:26         ` Andrew Morton
  2008-02-16 19:39           ` include/linux/pcounter.h Arjan van de Ven
  2008-02-17  5:54           ` include/linux/pcounter.h David Miller
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2008-02-16 19:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David S. Miller, netdev, linux-kernel

On Sat, 16 Feb 2008 13:03:54 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:

> Andrew Morton a écrit :
> > On Sat, 16 Feb 2008 11:07:29 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:
> > 
> >> Andrew, pcounter is a temporary abstraction.
> > 
> > It's buggy!  Main problems are a) possible return of negative numbers b)
> > some of the API can't be from preemptible code c) excessive interrupt-off
> > time on some machines if used from irq-disabled sections.

This is starting to get tiresome.  Please do not waste my time and everyone
else's by defending the indefensible.

> a) We dont care of possibly off values when reading /proc/net/sockstat

So take the damn thing out of ./lib/ and hide it down in networking
somewhere.

>     Same arguments apply for percpu_counters.

No it does not, Not at all.  Callers regularly use percpu_counters to count
the number of some object.  These counts never go negative!

And how the heck can you say that we don't care if /proc/net/sockstat goes
negative?  You don't know that.  Someone's system-monitoring app might very
well take drastic action if it sees connection-related statistics go
negative, or around 4 gigaconnections.

> b) It is called from network parts where preemption is disabled.

It's in ./lib/

> net/ipv4/inet_timewait_sock.c:94: 
> sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv4/inet_hashtables.c:291: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv4/inet_hashtables.c:335: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv4/inet_hashtables.c:357: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv4/inet_hashtables.c:390:         sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv4/raw.c:95:      sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv4/raw.c:104:             sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv4/udp.c:235:             sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv6/ipv6_sockglue.c:271: 
> sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv6/ipv6_sockglue.c:272: 
> sock_prot_inuse_add(&tcp_prot, 1);
> net/ipv6/ipv6_sockglue.c:285: 
> sock_prot_inuse_add(sk->sk_prot, -1);
> net/ipv6/ipv6_sockglue.c:286: 
> sock_prot_inuse_add(prot, 1);
> net/ipv6/inet6_hashtables.c:46: sock_prot_inuse_add(sk->sk_prot, 1);
> net/ipv6/inet6_hashtables.c:207:        sock_prot_inuse_add(sk->sk_prot, 1);
> 
> c) No need to play with irq games here.

It's in ./lib/

> > 
> >> It is temporaty because it will vanish as soon as Christoph Clameter (or 
> >> somebody else) provides real cheap per cpu counter implementation.
> > 
> > numbers?
> > 
> > most of percpu_counter_add() is only executed once per FBC_BATCH calls.
> 
> 
> 
> > 
> >> At time I introduced it in network tree (locally, not meant to invade kernel 
> >> land and makes you unhappy :) ), the goals were :
> > 
> > Well maybe as a temporary networking-only thing OK, based upon
> > performance-tested results.  But I don't think the present code is suitable
> > as part of the kernel-wide toolkit.
> > 
> >> Some counters (total sockets count) were a single integer, that were doing 
> >> ping-pong between cpus (SMP/NUMA). As they are basically lazy values (as we 
> >> dont really need to read their value), using plain atomic_t was overkill.
> >>
> >> Using a plain percpu_counters was expensive (NR_CPUS*(32+sizeof(void *)) 
> >> instead of num_possible_cpus()*4).
> > 
> > No, percpu_counters use alloc_percpu(), which is O(num_possible_cpus), not
> > O(NR_CPUS).
> 
> You are playing with words.

I assumed the comment was comparing with percpu_counters.

Seems that it was comparing with some hand-rolled static array of NR_CPUS
entries or something.

> > 
> >>  We want really fast write side. Real fast.
> > 
> > eh?  It's called on a per-connection basis, not on a per-packet basis?
> 
> Yes, per connection basis. Some workloads want to open/close more than 1000 
> sockets per second.

ie: slowpath

> You are right we also need to reduce all atomic inc/dec done per packet :)
> 
> A pcounter/perc_cpu for the device refcounter would be a very nice win, but as 
> number of devices in the system might be very big, David said no to a percpu 
> conversion. We will reconsider this when new percpu stuff can go in, so that 
> the memory cost will be minimal (4 bytes per cpu per device)
> 
> > 
> >> Read side is when you do a "cat /proc/net/sockstat".
> >> That is ... once in a while...
> > 
> > For the current single caller.  But it's in ./lib/.
> > 
> > And there's always someone out there who does whatever we don't expect them
> > to do.
> > 
> >> Now when we allocate a new socket, code to increment the "socket count" is :
> >>
> >> c03a74a8 <tcp_pcounter_add>:
> >> c03a74a8:   b8 90 26 5f c0          mov    $0xc05f2690,%eax
> >> c03a74ad:   64 8b 0d 10 f1 5e c0    mov    %fs:0xc05ef110,%ecx
> >> c03a74b4:   01 14 01                add    %edx,(%ecx,%eax,1)
> >> c03a74b7:   c3                      ret
> > 
> > I can't find that code.  I suspect that's the DEFINE_PER_CPU flavour, which
> > isn't used anywhere afaict.  Plus this omits the local_irq_save/restore (or
> > preempt_disable/enable) and the indirect function call, which can be
> > expensive.
> 
> Please do : nm vmlinux | grep tcp_pcounter_add
> 
> c03a74a8 t tcp_pcounter_add
> 
> It should be there, even if its a static function
> 

Probably - the macros hide it well.  And the function itself doesn't tell
the whole story: callers must still do preempt_dsable or local_irq_enable
and must wear the cost of an indirect call.

> > 
> >> It is even clearly stated at the top of include/linux/pcounter.h
> >>
> >> /*
> >>   * Using a dynamic percpu 'int' variable has a cost :
> >>   * 1) Extra dereference
> >>   * Current per_cpu_ptr() implementation uses an array per 'percpu variable'.
> >>   * 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of num_possible_cpus()*4
> >>   *
> >>   * This pcounter implementation is an abstraction to be able to use
> >>   * either a static or a dynamic per cpu variable.
> >>   * One dynamic per cpu variable gets a fast & cheap implementation, we can
> >>   * change pcounter implementation too.
> >>   */
> >>
> >>
> >> We all agree.
> >>
> > 
> > No we don't.  That comment is afaict wrong about the memory consumption and
> > the abstraction *isn't useful*.
> 
> Fact is that we need percpu 32bits counters, and we need to have pointers to them.
> 
> Current percpu_counters cannot cope that.

yes they can.

> > 
> > Why do we want some abstraction which makes alloc_percpu() storage and
> > DEFINE_PERCPU storage "look the same"?  What use is there in that?  One is
> > per-object storage and one is singleton storage - they're quite different
> > things and they are used in quite different situations and they are
> > basically never interchangeable.  Yet we add this pretend-they're-the-same
> > wrapper around them which costs us an indirect function call on the
> > fastpath.
> 
> I believe all 'pcounter' are in fact statically allocated 'one per struct 
> proto' to track inuse count. (search for DEFINE_PROTO_INUSE() uses)
> 
> # find net include|xargs grep -n DEFINE_PROTO_INUSE
> net/dccp/ipv6.c:1105:DEFINE_PROTO_INUSE(dccp_v6)
> net/dccp/ipv4.c:920:DEFINE_PROTO_INUSE(dccp_v4)
> net/ipv6/udp.c:1001:DEFINE_PROTO_INUSE(udpv6)
> net/ipv6/udplite.c:43:DEFINE_PROTO_INUSE(udplitev6)
> net/ipv6/raw.c:1187:DEFINE_PROTO_INUSE(rawv6)
> net/ipv6/tcp_ipv6.c:2108:DEFINE_PROTO_INUSE(tcpv6)
> net/ipv4/udp.c:1477:DEFINE_PROTO_INUSE(udp)
> net/ipv4/udplite.c:47:DEFINE_PROTO_INUSE(udplite)
> net/ipv4/tcp_ipv4.c:2406:DEFINE_PROTO_INUSE(tcp)
> net/ipv4/raw.c:828:DEFINE_PROTO_INUSE(raw)
> net/sctp/socket.c:6461:DEFINE_PROTO_INUSE(sctp)
> net/sctp/socket.c:6495:DEFINE_PROTO_INUSE(sctpv6)
> include/net/sock.h:624:# define DEFINE_PROTO_INUSE(NAME) DEFINE_PCOUNTER(NAME)
> include/net/sock.h:644:# define DEFINE_PROTO_INUSE(NAME)
> 
> So pcounter_alloc()/pcounter_free() could just be deleted from pcounter.h

Well if you do that and if pcounters become helpers functions around
DEFINE_PERCPU memory then things are starting to get a bit more sensible. 
We can then remove the indirect function call.

Yes, a general wrapper around a DEFINE_PERCPU'ed counter could be a useful
thing to have.  It could use raw_smp_processor_id() and local_add().

Problem is the read side will still be far too inefficient for it to be
presentable as a general-purpose library - it really will need to do all
the batching which experience told us was needed in percpu_counters.

The read side should also present two interfaces: one which returns a +ve
or -ve value and one which returns a never-negative value.

> indirect functions calls are everywhere in kernel, network, fs, everywhere.

That doesn't make them fast.

> As soon we can put in 'struct pcounter' the address of a percpu variable, we 
> wont need anymore pointers to the pcounter_add()/getval() function.
> 
> mov  0(pcounter),%eax  # get the address of a percpuvar
> add  %edx,fs:%eax
> 
> This just need the percpu work done by SGI guys to be finished.

So...  I need to keep watching the tree to make sure that nobody actually
uses this code in ./lib/?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-16 19:26         ` include/linux/pcounter.h Andrew Morton
@ 2008-02-16 19:39           ` Arjan van de Ven
  2008-02-17  5:54           ` include/linux/pcounter.h David Miller
  1 sibling, 0 replies; 14+ messages in thread
From: Arjan van de Ven @ 2008-02-16 19:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Herbert Xu, David S. Miller, netdev, linux-kernel

On Sat, 16 Feb 2008 11:26:18 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> > indirect functions calls are everywhere in kernel, network, fs,
> > everywhere.
> 
> That doesn't make them fast.

just to emphasize this: an indirect function call is at least as expensive as an atomic operation on 
x86 cpus.... (if not more)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-16 19:26         ` include/linux/pcounter.h Andrew Morton
  2008-02-16 19:39           ` include/linux/pcounter.h Arjan van de Ven
@ 2008-02-17  5:54           ` David Miller
  2008-02-26  9:24             ` include/linux/pcounter.h Ingo Molnar
  1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2008-02-17  5:54 UTC (permalink / raw)
  To: akpm; +Cc: dada1, herbert, netdev, linux-kernel

From: Andrew Morton <akpm@linux-foundation.org>
Date: Sat, 16 Feb 2008 11:26:18 -0800

> On Sat, 16 Feb 2008 13:03:54 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> > Yes, per connection basis. Some workloads want to open/close more than 1000 
> > sockets per second.
> 
> ie: slowpath

Definitely not slow path in the networking.

Connection rates are definitely as, or more, important than packet
rates for certain workloads.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-17  5:54           ` include/linux/pcounter.h David Miller
@ 2008-02-26  9:24             ` Ingo Molnar
  2008-02-26 21:35               ` include/linux/pcounter.h David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2008-02-26  9:24 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, dada1, herbert, netdev, linux-kernel


* David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Sat, 16 Feb 2008 11:26:18 -0800
> 
> > On Sat, 16 Feb 2008 13:03:54 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote:
> > 
> > > Yes, per connection basis. Some workloads want to open/close more 
> > > than 1000 sockets per second.
> > 
> > ie: slowpath
> 
> Definitely not slow path in the networking.
> 
> Connection rates are definitely as, or more, important than packet 
> rates for certain workloads.

but the main and fundamental question still remains unanswered (more 
than 3 weeks after Andrew asked that question): why was this piece of 
general infrastructure merged via net.git and not submitted to lkml 
ever? The code touching -mm does _not_ count as "review".

Now that there was review of it and there is clearly controversy, the 
code should be reverted/undone and resubmitted after all review 
observations have been addressed. Just sitting around and ignoring 
objections, hoping for the code to hit v2.6.25 is rather un-nice ...

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-26  9:24             ` include/linux/pcounter.h Ingo Molnar
@ 2008-02-26 21:35               ` David Miller
  2008-02-27  7:36                 ` include/linux/pcounter.h Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2008-02-26 21:35 UTC (permalink / raw)
  To: mingo; +Cc: akpm, dada1, herbert, netdev, linux-kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Tue, 26 Feb 2008 10:24:54 +0100

> but the main and fundamental question still remains unanswered (more 
> than 3 weeks after Andrew asked that question): why was this piece of 
> general infrastructure merged via net.git and not submitted to lkml 
> ever? The code touching -mm does _not_ count as "review".

I already stated this was a mistake and it won't happen
again in the future.

And if you ask Andrew, he'll tell you I've been hard on
my network developers about this lately.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-26 21:35               ` include/linux/pcounter.h David Miller
@ 2008-02-27  7:36                 ` Ingo Molnar
  2008-02-28  1:27                   ` include/linux/pcounter.h Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2008-02-27  7:36 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, dada1, herbert, netdev, linux-kernel


* David Miller <davem@davemloft.net> wrote:

> > but the main and fundamental question still remains unanswered (more 
> > than 3 weeks after Andrew asked that question): why was this piece 
> > of general infrastructure merged via net.git and not submitted to 
> > lkml ever? The code touching -mm does _not_ count as "review".
> 
> I already stated this was a mistake and it won't happen again in the 
> future.

sorry - that bit of the thread didnt seem to make it to lkml. I just saw 
this incomplete discussion with a denial and with no resolution.

And you did the right thing anyway by thinking in terms of a generic 
piece of infrastructure instead of hiding it away into say 
include/net/pcounter.h (which nobody could have objected against).

I sometimes think that the forced isolation of subsystems (rather 
strongly enforced both by -mm and by linux-next) and their hiding away 
on non-lkml lists will eventually hurt the core kernel because less and 
less people will be willing to go the trouble of doing proper 
cross-subsystem development. That results in duplicated or specialistic 
infrastructure, increased code size and longer term, ultimately less 
performance. (by the time we notice _that_ it will probably be too late 
to do anything about it)

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: include/linux/pcounter.h
  2008-02-27  7:36                 ` include/linux/pcounter.h Ingo Molnar
@ 2008-02-28  1:27                   ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-02-28  1:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, akpm, dada1, herbert, netdev, linux-kernel

Em Wed, Feb 27, 2008 at 08:36:50AM +0100, Ingo Molnar escreveu:
> 
> * David Miller <davem@davemloft.net> wrote:
> 
> > > but the main and fundamental question still remains unanswered (more 
> > > than 3 weeks after Andrew asked that question): why was this piece 
> > > of general infrastructure merged via net.git and not submitted to 
> > > lkml ever? The code touching -mm does _not_ count as "review".
> > 
> > I already stated this was a mistake and it won't happen again in the 
> > future.
> 
> sorry - that bit of the thread didnt seem to make it to lkml. I just saw 
> this incomplete discussion with a denial and with no resolution.
> 
> And you did the right thing anyway by thinking in terms of a generic 
> piece of infrastructure instead of hiding it away into say 
> include/net/pcounter.h (which nobody could have objected against).
> 
> I sometimes think that the forced isolation of subsystems (rather 
> strongly enforced both by -mm and by linux-next) and their hiding away 
> on non-lkml lists will eventually hurt the core kernel because less and 
> less people will be willing to go the trouble of doing proper 
> cross-subsystem development. That results in duplicated or specialistic 
> infrastructure, increased code size and longer term, ultimately less 
> performance. (by the time we notice _that_ it will probably be too late 
> to do anything about it)

This more friendly wording makes me feel actually happy to get from my
hiding place and tell that I actually saw the percpu counters code, just
after I made Eric's code generic as I thought it should have been from
the start, I found out about what was already in lib/.

I just got lazy to do what at the time looked the right thing
to do: to read thru the existing lib/ code and use it where pcounter was
being used. But at least it got exposed 8-)

Anyway, progress was made, I do not feel too bad about it even now.

- Arnaldo

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-02-28  1:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-04  9:44 include/linux/pcounter.h Andrew Morton
2008-02-05  0:20 ` include/linux/pcounter.h David Miller
2008-02-05  0:43   ` include/linux/pcounter.h Andrew Morton
2008-02-16  3:37 ` include/linux/pcounter.h Andrew Morton
2008-02-16 10:07   ` include/linux/pcounter.h Eric Dumazet
2008-02-16 10:50     ` include/linux/pcounter.h Andrew Morton
2008-02-16 12:03       ` include/linux/pcounter.h Eric Dumazet
2008-02-16 19:26         ` include/linux/pcounter.h Andrew Morton
2008-02-16 19:39           ` include/linux/pcounter.h Arjan van de Ven
2008-02-17  5:54           ` include/linux/pcounter.h David Miller
2008-02-26  9:24             ` include/linux/pcounter.h Ingo Molnar
2008-02-26 21:35               ` include/linux/pcounter.h David Miller
2008-02-27  7:36                 ` include/linux/pcounter.h Ingo Molnar
2008-02-28  1:27                   ` include/linux/pcounter.h Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).