linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Fire Engine??
       [not found] ` <20031125183035.1c17185a.davem@redhat.com.suse.lists.linux.kernel>
@ 2003-11-26  9:53   ` Andi Kleen
  2003-11-26 11:35     ` John Bradford
                       ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Andi Kleen @ 2003-11-26  9:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

"David S. Miller" <davem@redhat.com> writes:
> 
> So his claim is that, in their mesaurements, "CPU utilization"
> was lower in their stack.  Was he using 2.6.x and TSO capable
> cards on the Linux side?  If not, it's not apples to apples
> against are current upcoming technology.

Maybe they just have a better copy_to_user(). That eats most time anyways.

I think there are definitely areas of improvements left in current TCP.
It has gotten quite fat over the last years.

Some issues just from the top of my head. I have not done detailed profiling
recently and don't know if any of this would help significantly. It is 
just what I remember right now.

- Window computation for incoming packets is quite dumbly coded right now
and could be optimized
- I suspect the copy/process-in--user-context setup needs to be rethought/
rebenchmarked in Gigabit setups.  There was at least one test case
where tcp_low_latency=1 helped. It just adds latency that might hurt
and is not very useful when you have hardware checksums anyways
- If they tested TCP-over-NFS then I'm pretty sure Linux lost badly because
the current paths for that are just awfully inefficient.
- Overall IP/TCP could probably have some more instructions and hopefully
cache misses shaved off with some careful going over the fast paths.
- There are too many locks. That hurts when you have slow atomic operations
(like on P4) and together with the next issue. 
- We do most things one packet at a time. This means locking and multiple
layer overhead multiplies. Most network operations come in packet bursts
and it would be much more efficient to batch operations: always process
lists of packets instead of single packets. This could probably lower
locking overhead a lot.
- On TX we are inefficient for the same reason. TCP builds one packet
at a time and then goes down through all layers taking all locks (queue,
device driver etc.) and submits the single packet. Then repeats that for 
lots of packets because many TCP writes are > MTU. Batching that would 
likely help a lot, like it was done in the 2.6 VFS. I think it could 
also make hard_start_xmit in many drivers significantly faster.
- The hash tables are too big. This causes unnecessary cache misses all the 
time.
- Doing gettimeofday on each incoming packet is just dumb, especially
when you have gettimeofday backed with a slow southbridge timer.
This shows quite badly on many profile logs.
I still think right solution for that would be to only take time stamps
when there is any user for it (= no timestamps in 99% of all systems) 
- user copy and checksum could probably also done faster if they were
batched for multiple packets. It is hard to optimize properly for 
<= 1.5K copies.
This is especially true for 4/4 split kernels which will eat an 
page table look up + lock for each individual copy, but also for others.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26  9:53   ` Fire Engine?? Andi Kleen
@ 2003-11-26 11:35     ` John Bradford
  2003-11-26 18:50       ` Mike Fedyk
  2003-11-26 15:00     ` Trond Myklebust
  2003-11-26 19:30     ` David S. Miller
  2 siblings, 1 reply; 38+ messages in thread
From: John Bradford @ 2003-11-26 11:35 UTC (permalink / raw)
  To: Andi Kleen, David S. Miller; +Cc: linux-kernel

Quote from Andi Kleen <ak@suse.de>:
> "David S. Miller" <davem@redhat.com> writes:
> > 
> > So his claim is that, in their mesaurements, "CPU utilization"
> > was lower in their stack.  Was he using 2.6.x and TSO capable
> > cards on the Linux side?  If not, it's not apples to apples
> > against are current upcoming technology.
> 
> Maybe they just have a better copy_to_user(). That eats most time anyways.
> 
> I think there are definitely areas of improvements left in current TCP.
> It has gotten quite fat over the last years.

On the subject of general networking performance in Linux, I thought
this set of benchmarks was quite interesting:

http://bulk.fefe.de/scalability/

particularly the 2.4 -> 2.6 comparisons.

John.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26  9:53   ` Fire Engine?? Andi Kleen
  2003-11-26 11:35     ` John Bradford
@ 2003-11-26 15:00     ` Trond Myklebust
  2003-11-26 23:01       ` Andi Kleen
  2003-11-26 19:30     ` David S. Miller
  2 siblings, 1 reply; 38+ messages in thread
From: Trond Myklebust @ 2003-11-26 15:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, linux-kernel

>>>>> " " == Andi Kleen <ak@suse.de> writes:

     > - If they tested TCP-over-NFS then I'm pretty sure Linux lost
                        ^^^^^^^^^^^^ That would be inefficient 8-)
     > badly because the current paths for that are just awfully
     > inefficient.

...mind elaborating?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 11:35     ` John Bradford
@ 2003-11-26 18:50       ` Mike Fedyk
  2003-11-26 19:19         ` Diego Calleja García
  0 siblings, 1 reply; 38+ messages in thread
From: Mike Fedyk @ 2003-11-26 18:50 UTC (permalink / raw)
  To: John Bradford; +Cc: Andi Kleen, David S. Miller, linux-kernel

On Wed, Nov 26, 2003 at 11:35:03AM +0000, John Bradford wrote:
> Quote from Andi Kleen <ak@suse.de>:
> > "David S. Miller" <davem@redhat.com> writes:
> > > 
> > > So his claim is that, in their mesaurements, "CPU utilization"
> > > was lower in their stack.  Was he using 2.6.x and TSO capable
> > > cards on the Linux side?  If not, it's not apples to apples
> > > against are current upcoming technology.
> > 
> > Maybe they just have a better copy_to_user(). That eats most time anyways.
> > 
> > I think there are definitely areas of improvements left in current TCP.
> > It has gotten quite fat over the last years.
> 
> On the subject of general networking performance in Linux, I thought
> this set of benchmarks was quite interesting:
> 
> http://bulk.fefe.de/scalability/

No such file or directory.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 18:50       ` Mike Fedyk
@ 2003-11-26 19:19         ` Diego Calleja García
  2003-11-26 19:59           ` Mike Fedyk
  2003-11-27  3:54           ` Bill Huey
  0 siblings, 2 replies; 38+ messages in thread
From: Diego Calleja García @ 2003-11-26 19:19 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: john, ak, davem, linux-kernel

El Wed, 26 Nov 2003 10:50:28 -0800 Mike Fedyk <mfedyk@matchmail.com> escribió:

> > http://bulk.fefe.de/scalability/
> 
> No such file or directory.

It works here. I don't know if those numbers represent anything for networking.
Some of the benchmarks look more like "vm benchmarking". And the ones which
are measuring latency are valid, considering that BSDs are lacking "preempt"?
(shooting in the dark)

Diego Calleja.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26  9:53   ` Fire Engine?? Andi Kleen
  2003-11-26 11:35     ` John Bradford
  2003-11-26 15:00     ` Trond Myklebust
@ 2003-11-26 19:30     ` David S. Miller
  2003-11-26 19:58       ` Paul Menage
                         ` (4 more replies)
  2 siblings, 5 replies; 38+ messages in thread
From: David S. Miller @ 2003-11-26 19:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On 26 Nov 2003 10:53:21 +0100
Andi Kleen <ak@suse.de> wrote:

> Some issues just from the top of my head. I have not done detailed profiling
> recently and don't know if any of this would help significantly. It is 
> just what I remember right now.

Thanks for the list Andi, I'll keep it around.  I'd like
to comment on one entry though.

> - On TX we are inefficient for the same reason. TCP builds one packet
> at a time and then goes down through all layers taking all locks (queue,
> device driver etc.) and submits the single packet. Then repeats that for 
> lots of packets because many TCP writes are > MTU. Batching that would 
> likely help a lot, like it was done in the 2.6 VFS. I think it could 
> also make hard_start_xmit in many drivers significantly faster.

This is tricky, because of getting all of the queueing stuff right.
All of the packet scheduler APIs would need to change, as would
the classification stuff, not to mention netfilter et al.

You're talking about basically redoing the whole TX path if you
want to really support this.

I'm not saying "don't do this", just that we should be sure we know
what we're getting if we invest the time into this.

> - The hash tables are too big. This causes unnecessary cache misses all the 
> time.

I agree.  See my comments on this topic in another recent linux-kernel
thread wrt. huge hash tables on numa systems.

> - Doing gettimeofday on each incoming packet is just dumb, especially
> when you have gettimeofday backed with a slow southbridge timer.
> This shows quite badly on many profile logs.
> I still think right solution for that would be to only take time stamps
> when there is any user for it (= no timestamps in 99% of all systems) 

Andi, I know this is a problem, but for the millionth time your idea
does not work because we don't know if the user asked for the timestamp
until we are deep within the recvmsg() processing, which is long after
the packet has arrived.

> - user copy and checksum could probably also done faster if they were
> batched for multiple packets. It is hard to optimize properly for 
> <= 1.5K copies.
> This is especially true for 4/4 split kernels which will eat an 
> page table look up + lock for each individual copy, but also for others.

I disagree partially, especially in the presence of a chip that provides
proper implementations of software initiated prefetching.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
@ 2003-11-26 19:58       ` Paul Menage
  2003-11-26 20:03         ` David S. Miller
  2003-11-26 20:01       ` Fire Engine?? Jamie Lokier
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Paul Menage @ 2003-11-26 19:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

David S. Miller wrote:
  >
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

How about tracking the number of current sockets that have had timestamp 
requests for them? If this number is zero, don't bother with the 
timestamps. The first time you get a SIOCGSTAMP ioctl on a given socket, 
bump the count and set a flag; decrement the count when the socket is 
destroyed if the flag is set.

The drawback is that the first SIOCGSTAMP on any particular socket will 
have to return a bogus value (maybe just the current time?). Ways to 
mitigate that are:

- have a /proc option to let the sysadmin enforce timestamps on all 
packets (just bump the counter)

- bump the counter whenever an interface is in promiscuous mode (I 
imagine that tcpdump et al are the main users of the timestamps?)

Paul


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:19         ` Diego Calleja García
@ 2003-11-26 19:59           ` Mike Fedyk
  2003-11-27  3:54           ` Bill Huey
  1 sibling, 0 replies; 38+ messages in thread
From: Mike Fedyk @ 2003-11-26 19:59 UTC (permalink / raw)
  To: Diego Calleja Garc?a; +Cc: john, ak, davem, linux-kernel

On Wed, Nov 26, 2003 at 08:19:03PM +0100, Diego Calleja Garc?a wrote:
> El Wed, 26 Nov 2003 10:50:28 -0800 Mike Fedyk <mfedyk@matchmail.com> escribi?:
> 
> > > http://bulk.fefe.de/scalability/
> > 
> > No such file or directory.
> 
> It works here. I don't know if those numbers represent anything for networking.
> Some of the benchmarks look more like "vm benchmarking". And the ones which
> are measuring latency are valid, considering that BSDs are lacking "preempt"?
> (shooting in the dark)

Grr, that trailing "/" made the difference. :-/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
  2003-11-26 19:58       ` Paul Menage
@ 2003-11-26 20:01       ` Jamie Lokier
  2003-11-26 20:04         ` David S. Miller
  2003-11-26 21:54         ` Pekka Pietikainen
  2003-11-26 20:22       ` Theodore Ts'o
                         ` (2 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: Jamie Lokier @ 2003-11-26 20:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

David S. Miller wrote:
> > - Doing gettimeofday on each incoming packet is just dumb, especially
> > when you have gettimeofday backed with a slow southbridge timer.
> > This shows quite badly on many profile logs.
> > I still think right solution for that would be to only take time stamps
> > when there is any user for it (= no timestamps in 99% of all systems) 
> 
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

Do the timestamps need to be precise and accurately reflect the
arrival time in the irq handler?  Or, for TCP timestamps, would it be
good enough to use the time when the protocol handlers are run, and
only read the hardware clock once for a bunch of received packets?  Or
even use jiffies?

Apart from TCP, precise timestamps are only used for packet capture,
and it's easy to keep track globally of whether anyone has packet
sockets open.

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:58       ` Paul Menage
@ 2003-11-26 20:03         ` David S. Miller
  2003-11-26 22:29           ` Andi Kleen
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2003-11-26 20:03 UTC (permalink / raw)
  To: Paul Menage; +Cc: ak, linux-kernel

On Wed, 26 Nov 2003 11:58:44 -0800
Paul Menage <menage@google.com> wrote:

> How about tracking the number of current sockets that have had timestamp 
> requests for them? If this number is zero, don't bother with the 
> timestamps. The first time you get a SIOCGSTAMP ioctl on a given socket, 
> bump the count and set a flag; decrement the count when the socket is 
> destroyed if the flag is set.

Reread what I said please, the user can ask for timestamps using CMSG
objects via the recvmsg() system call, there are no ioctls or socket
controls done on the socket.  It is completely dynamic and
unpredictable.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 20:01       ` Fire Engine?? Jamie Lokier
@ 2003-11-26 20:04         ` David S. Miller
  2003-11-26 21:54         ` Pekka Pietikainen
  1 sibling, 0 replies; 38+ messages in thread
From: David S. Miller @ 2003-11-26 20:04 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: ak, linux-kernel

On Wed, 26 Nov 2003 20:01:53 +0000
Jamie Lokier <jamie@shareable.org> wrote:

> Do the timestamps need to be precise and accurately reflect the
> arrival time in the irq handler?

It would be a regression to make the timestamps less accurate
than those provided now.

> Or, for TCP timestamps,

The timestamps we are talking about are not used for TCP.

> Apart from TCP, precise timestamps are only used for packet capture,
> and it's easy to keep track globally of whether anyone has packet
> sockets open.

We have no knowledge of what an applications requirements are,
that is why we provide as accurate a timestamp as possible.

If we were writing this stuff for the first time now, sure we could
specify things however conveniently we like, but how this stuff behaves
is already well defined.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
  2003-11-26 19:58       ` Paul Menage
  2003-11-26 20:01       ` Fire Engine?? Jamie Lokier
@ 2003-11-26 20:22       ` Theodore Ts'o
  2003-11-26 21:02         ` David S. Miller
  2003-11-26 21:34       ` Arjan van de Ven
  2003-11-26 22:39       ` Andi Kleen
  4 siblings, 1 reply; 38+ messages in thread
From: Theodore Ts'o @ 2003-11-26 20:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

On Wed, Nov 26, 2003 at 11:30:40AM -0800, David S. Miller wrote:
> > - Doing gettimeofday on each incoming packet is just dumb, especially
> > when you have gettimeofday backed with a slow southbridge timer.
> > This shows quite badly on many profile logs.
> > I still think right solution for that would be to only take time stamps
> > when there is any user for it (= no timestamps in 99% of all systems) 
> 
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

I believe what Andi was suggesting was if there was **no** processes
that are currently requesting timestamps, then we can dispense with
taking the timestamp.  If a single user asks for the timestamp, then
we would still end up taking timestamps on all packets.  Is this worth
the overhead to keep track of that factor?  It's arguable, but some
platforms, probably yes.

						- Ted

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 20:22       ` Theodore Ts'o
@ 2003-11-26 21:02         ` David S. Miller
  2003-11-26 21:24           ` Jamie Lokier
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2003-11-26 21:02 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ak, linux-kernel

On Wed, 26 Nov 2003 15:22:16 -0500
"Theodore Ts'o" <tytso@mit.edu> wrote:

> I believe what Andi was suggesting was if there was **no** processes
> that are currently requesting timestamps, then we can dispense with
> taking the timestamp.

You can predict what the arguments will be for the user's
recvmsg() system call at the time of packet reception?  Wow,
show me how :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 21:02         ` David S. Miller
@ 2003-11-26 21:24           ` Jamie Lokier
  2003-11-26 21:38             ` David S. Miller
  0 siblings, 1 reply; 38+ messages in thread
From: Jamie Lokier @ 2003-11-26 21:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: Theodore Ts'o, ak, linux-kernel

David S. Miller wrote:
> > that are currently requesting timestamps, then we can dispense with
> > taking the timestamp.
> 
> You can predict what the arguments will be for the user's
> recvmsg() system call at the time of packet reception?  Wow,
> show me how :)

recvmsg() doesn't return timestamps until they are requested
using setsockopt(...SO_TIMESTAMP...).

See sock_recv_timestamp() in include/net/sock.h.

-- Jamie

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
                         ` (2 preceding siblings ...)
  2003-11-26 20:22       ` Theodore Ts'o
@ 2003-11-26 21:34       ` Arjan van de Ven
  2003-11-26 22:58         ` Andi Kleen
  2003-11-26 22:39       ` Andi Kleen
  4 siblings, 1 reply; 38+ messages in thread
From: Arjan van de Ven @ 2003-11-26 21:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 866 bytes --]

On Wed, 2003-11-26 at 20:30, David S. Miller wrote:

> > - Doing gettimeofday on each incoming packet is just dumb, especially
> > when you have gettimeofday backed with a slow southbridge timer.
> > This shows quite badly on many profile logs.
> > I still think right solution for that would be to only take time stamps
> > when there is any user for it (= no timestamps in 99% of all systems) 
> 
> Andi, I know this is a problem, but for the millionth time your idea
> does not work because we don't know if the user asked for the timestamp
> until we are deep within the recvmsg() processing, which is long after
> the packet has arrived.

question: do we need a timestamp for every packet or can we do one
timestamp per irq-context entry ? (eg one timestamp at irq entry time we
do anyway and keep that for all packets processed in the softirq)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 21:24           ` Jamie Lokier
@ 2003-11-26 21:38             ` David S. Miller
  2003-11-26 23:43               ` Jamie Lokier
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2003-11-26 21:38 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: tytso, ak, linux-kernel

On Wed, 26 Nov 2003 21:24:06 +0000
Jamie Lokier <jamie@shareable.org> wrote:

> recvmsg() doesn't return timestamps until they are requested
> using setsockopt(...SO_TIMESTAMP...).
> 
> See sock_recv_timestamp() in include/net/sock.h.

See MSG_ERRQUEUE and net/ipv4/ip_sockglue.c

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 20:01       ` Fire Engine?? Jamie Lokier
  2003-11-26 20:04         ` David S. Miller
@ 2003-11-26 21:54         ` Pekka Pietikainen
  1 sibling, 0 replies; 38+ messages in thread
From: Pekka Pietikainen @ 2003-11-26 21:54 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: David S. Miller, Andi Kleen, linux-kernel

On Wed, Nov 26, 2003 at 08:01:53PM +0000, Jamie Lokier wrote:
> > Andi, I know this is a problem, but for the millionth time your idea
> > does not work because we don't know if the user asked for the timestamp
> > until we are deep within the recvmsg() processing, which is long after
> > the packet has arrived.
> 
> Do the timestamps need to be precise and accurately reflect the
> arrival time in the irq handler?  Or, for TCP timestamps, would it be
> good enough to use the time when the protocol handlers are run, and
> only read the hardware clock once for a bunch of received packets?  Or
> even use jiffies?

> Apart from TCP, precise timestamps are only used for packet capture,
> and it's easy to keep track globally of whether anyone has packet
> sockets open.
It should probably noted that really hardcore timestamp users 
have their NICs do it for them, since interrupt coalescing 
makes timestamps done in the kernel too inaccurate for them even
if rdtsc is used (http://www-didc.lbl.gov/papers/SCNM-PAM03.pdf)
Not that it's anywhere near a univeral solution since more or less only 
one brand of NICs supports them.

It would probably be a useful experiment to see whether the performance is
improved in a noticeable way if say jiffies were used. If so, it might be a
reasonable choice for a configurable option, if not then not. 
Isn't stuff like this the reason why the experimental network patches tree
that was announced a while back is out there? ;-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 20:03         ` David S. Miller
@ 2003-11-26 22:29           ` Andi Kleen
  2003-11-26 22:36             ` David S. Miller
  0 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2003-11-26 22:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Wed, 26 Nov 2003 12:03:16 -0800
"David S. Miller" <davem@redhat.com> wrote:

> On Wed, 26 Nov 2003 11:58:44 -0800
> Paul Menage <menage@google.com> wrote:
> 
> > How about tracking the number of current sockets that have had timestamp 
> > requests for them? If this number is zero, don't bother with the 
> > timestamps. The first time you get a SIOCGSTAMP ioctl on a given socket, 
> > bump the count and set a flag; decrement the count when the socket is 
> > destroyed if the flag is set.
> 
> Reread what I said please, the user can ask for timestamps using CMSG
> objects via the recvmsg() system call, there are no ioctls or socket
> controls done on the socket.  It is completely dynamic and
> unpredictable.

The user sets the SO_TIMESTAMP setsockopt to 1 and then you get the cmsg.
That's per socket state. The other way is to use the SIOCGTSTAMP ioctl.
That is a bit more ugly because it has no state, but you can do 
a heuristic and assume that an process that does SIOCGTSTAMP once
will do it in future too and set a flag in this case. 

The first SIOCGTSTAMP would be inaccurate, but the following (after 
all untimestamped packets have been flushed) would be ok.

Doing for IP would be relatively easy, the only major user of the
timestamp seems to be DECnet and the bridge, but I supose those could be 
converted to use jiffies too.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:29           ` Andi Kleen
@ 2003-11-26 22:36             ` David S. Miller
  2003-11-26 22:56               ` Andi Kleen
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2003-11-26 22:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, 26 Nov 2003 23:29:09 +0100
Andi Kleen <ak@suse.de> wrote:

> The first SIOCGTSTAMP would be inaccurate, but the following (after 
> all untimestamped packets have been flushed) would be ok.

I don't think this is acceptable.  It's important that all
of the timestamps are as accurate as they were before.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:30     ` David S. Miller
                         ` (3 preceding siblings ...)
  2003-11-26 21:34       ` Arjan van de Ven
@ 2003-11-26 22:39       ` Andi Kleen
  2003-11-26 22:46         ` David S. Miller
  4 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2003-11-26 22:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Wed, 26 Nov 2003 11:30:40 -0800
"David S. Miller" <davem@redhat.com> wrote:

>
> > - On TX we are inefficient for the same reason. TCP builds one packet
> > at a time and then goes down through all layers taking all locks (queue,
> > device driver etc.) and submits the single packet. Then repeats that for 
> > lots of packets because many TCP writes are > MTU. Batching that would 
> > likely help a lot, like it was done in the 2.6 VFS. I think it could 
> > also make hard_start_xmit in many drivers significantly faster.
> 
> This is tricky, because of getting all of the queueing stuff right.
> All of the packet scheduler APIs would need to change, as would
> the classification stuff, not to mention netfilter et al.

You only need to do a fast path for the default scheduler at the beginning.
Every complicated "slow" API like advanced queuing or netfilter can still fallback to 
one packet at a time until cleaned up (similar strategy as was done with the 
non linear skbs) 
 
> You're talking about basically redoing the whole TX path if you
> want to really support this.
> 
> I'm not saying "don't do this", just that we should be sure we know
> what we're getting if we invest the time into this.

In some profiling I did some time ago queue locks and device driver
locks were the biggest offenders on TX after copy. 

The only tricky part is to get the state machine in tcp_do_sendmsg()
right that decides when to flush.

 > - user copy and checksum could probably also done faster if they were
> > batched for multiple packets. It is hard to optimize properly for 
> > <= 1.5K copies.
> > This is especially true for 4/4 split kernels which will eat an 
> > page table look up + lock for each individual copy, but also for others.
> 
> I disagree partially, especially in the presence of a chip that provides
> proper implementations of software initiated prefetching.

Especially for prefetching having a list of packets helps because you
can prefetch the next while you're working on the current one. The CPU
hardware prefetcher cannot do that for you.

I did look seriously at faster csum-copy/copy-to-user for K8, but the conclusion
was that all the tricks are only worth it when you can work with bigger amounts of data.
1.5K at a time is just too small.

Ah yes:

- Investigate more performance through explicit prefetching 
(e.g. in the device drivers to optimize eth_type_trans() when you can classify the packet 
just by looking at the RX ring state. Instead do a prefetch on the packet data
and hope the data is already in cache when the IP stack gets around to look at it) 

could be also added to the list

-Andi (who shuts up now because I don't have any time to code on any of this :-( ) 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:39       ` Andi Kleen
@ 2003-11-26 22:46         ` David S. Miller
  0 siblings, 0 replies; 38+ messages in thread
From: David S. Miller @ 2003-11-26 22:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, 26 Nov 2003 23:39:18 +0100
Andi Kleen <ak@suse.de> wrote:

> You only need to do a fast path for the default scheduler at the beginning.

In the end we're going to have a design and we're going to do it
right, if we decide to do this.

Sun needs fast paths, not us.

> Especially for prefetching having a list of packets helps because you
> can prefetch the next while you're working on the current one. The CPU
> hardware prefetcher cannot do that for you.

The initial prefetches are consumed by the copy implementation
setup instructions.  By the time the real loads execute, the
data is there or not very far away.

This I have measured on UltraSPARC, I suspect other cpus can
match that if not do better.

> I did look seriously at faster csum-copy/copy-to-user for K8, but the conclusion
> was that all the tricks are only worth it when you can work with bigger amounts of data.
> 1.5K at a time is just too small.

Not true, once you have ~300 or so bytes you have enough inertia
to get a good stream going in the main loop, really look at the
ultrasparc-III stuff I wrote for heuristics.

You really should write the k8 code before coming to conclusions
about what it would or would not be capable of doing :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:36             ` David S. Miller
@ 2003-11-26 22:56               ` Andi Kleen
  2003-11-26 23:13                 ` David S. Miller
  0 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2003-11-26 22:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Wed, 26 Nov 2003 14:36:20 -0800
"David S. Miller" <davem@redhat.com> wrote:

> On Wed, 26 Nov 2003 23:29:09 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> > The first SIOCGTSTAMP would be inaccurate, but the following (after 
> > all untimestamped packets have been flushed) would be ok.
> 
> I don't think this is acceptable.  It's important that all
> of the timestamps are as accurate as they were before.

I disagree on that. The window is small and slowing down 99.99999% of all 
users who never care about this for this extremely obscure misdesigned API does 
not make  much sense to me.

Also if you worry about these you could add an optional sysctl
to always take it, so if anybody really has an application that relies
on the first time stamp being accurate and they cannot use SO_TIMESTAMP
they could set the sysctl.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 21:34       ` Arjan van de Ven
@ 2003-11-26 22:58         ` Andi Kleen
  2003-11-27 12:16           ` Ingo Oeser
  0 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2003-11-26 22:58 UTC (permalink / raw)
  To: arjanv; +Cc: davem, linux-kernel

On Wed, 26 Nov 2003 22:34:10 +0100
Arjan van de Ven <arjanv@redhat.com> wrote:

> On Wed, 2003-11-26 at 20:30, David S. Miller wrote:
> 
> > > - Doing gettimeofday on each incoming packet is just dumb, especially
> > > when you have gettimeofday backed with a slow southbridge timer.
> > > This shows quite badly on many profile logs.
> > > I still think right solution for that would be to only take time stamps
> > > when there is any user for it (= no timestamps in 99% of all systems) 
> > 
> > Andi, I know this is a problem, but for the millionth time your idea
> > does not work because we don't know if the user asked for the timestamp
> > until we are deep within the recvmsg() processing, which is long after
> > the packet has arrived.
> 
> question: do we need a timestamp for every packet or can we do one
> timestamp per irq-context entry ? (eg one timestamp at irq entry time we
> do anyway and keep that for all packets processed in the softirq)

If people want the timestamp they usually want it to be accurate
(e.g. for tcpdump etc.). of course there is already a lot of jitter
in this information because it is done relatively late in the device
driver (long after the NIC has received the packet)

Just most people never care about this at all.... 

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 15:00     ` Trond Myklebust
@ 2003-11-26 23:01       ` Andi Kleen
  2003-11-26 23:23         ` Trond Myklebust
  0 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2003-11-26 23:01 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: davem, linux-kernel

On 26 Nov 2003 10:00:09 -0500
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> >>>>> " " == Andi Kleen <ak@suse.de> writes:
> 
>      > - If they tested TCP-over-NFS then I'm pretty sure Linux lost
>                         ^^^^^^^^^^^^ That would be inefficient 8-)

grin. 

>      > badly because the current paths for that are just awfully
>      > inefficient.
> 
> ...mind elaborating?

Current sunrpc does two recvmsgs for each record to first get the record length 
and then the payload.

This means you take all the locks and other overhead twice per packet. 

Having a special function that peeks directly at the TCP receive
queue would be much faster (and falls back to normal recvmsg when
there is no data waiting) 

But that's the really obvious case. I think if you got out an profiler
and optimized carefully you could likely make this path much more
efficient. Same for sunrpc TX probably, although that seems to be
in a better shape already.

-Andi 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:56               ` Andi Kleen
@ 2003-11-26 23:13                 ` David S. Miller
  2003-11-26 23:29                   ` Andi Kleen
  2003-11-26 23:41                   ` Ben Greear
  0 siblings, 2 replies; 38+ messages in thread
From: David S. Miller @ 2003-11-26 23:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, 26 Nov 2003 23:56:41 +0100
Andi Kleen <ak@suse.de> wrote:

> On Wed, 26 Nov 2003 14:36:20 -0800
> "David S. Miller" <davem@redhat.com> wrote:
> 
> > I don't think this is acceptable.  It's important that all
> > of the timestamps are as accurate as they were before.
> 
> I disagree on that. The window is small and slowing down 99.99999% of all 
> users who never care about this for this extremely obscure
> misdesigned API does not make  much sense to me.

We can't change behavior like this.  Every time we've tried to
do it, we've been burnt.  Remember nonlocal-bind?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 23:01       ` Andi Kleen
@ 2003-11-26 23:23         ` Trond Myklebust
  2003-11-26 23:38           ` Andi Kleen
  0 siblings, 1 reply; 38+ messages in thread
From: Trond Myklebust @ 2003-11-26 23:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: davem, linux-kernel

>>>>> " " == Andi Kleen <ak@suse.de> writes:

     > Current sunrpc does two recvmsgs for each record to first get
     > the record length and then the payload.

     > This means you take all the locks and other overhead twice per
     > packet.

     > Having a special function that peeks directly at the TCP
     > receive queue would be much faster (and falls back to normal
     > recvmsg when there is no data waiting)

Oh, right... That would be the server code you are thinking of, then.

The client already does something like this. I've added a function
tcp_read_sock() that is called directly from tcp_data_ready() and
hence fills the page cache directly from within the softirq.

There are a still few inefficiencies with this approach, though. Most
notable is the fact that you need to call kmap_atomic() several times
per page since the socket lower layers will usually be feeding you 1
skb at a time. I thought you might be referring to those (and that you
might have a good solution to propose ;-))

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 23:13                 ` David S. Miller
@ 2003-11-26 23:29                   ` Andi Kleen
  2003-11-26 23:41                   ` Ben Greear
  1 sibling, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2003-11-26 23:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

On Wed, 26 Nov 2003 15:13:52 -0800
"David S. Miller" <davem@redhat.com> wrote:

> On Wed, 26 Nov 2003 23:56:41 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> > On Wed, 26 Nov 2003 14:36:20 -0800
> > "David S. Miller" <davem@redhat.com> wrote:
> > 
> > > I don't think this is acceptable.  It's important that all
> > > of the timestamps are as accurate as they were before.
> > 
> > I disagree on that. The window is small and slowing down 99.99999% of all 
> > users who never care about this for this extremely obscure
> > misdesigned API does not make  much sense to me.
> 
> We can't change behavior like this.  Every time we've tried to
> do it, we've been burnt.  Remember nonlocal-bind?

The behaviour is not really changed, just the precision of the timestamp
is temporarily (a few tens of ms on a busy network) worse. 

And the jitter in this timestamp is already higher than this when
you consider queueing delays and interrupt mitigation in the driver.

-Andi


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 23:23         ` Trond Myklebust
@ 2003-11-26 23:38           ` Andi Kleen
  0 siblings, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2003-11-26 23:38 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andi Kleen, davem, linux-kernel

> There are a still few inefficiencies with this approach, though. Most
> notable is the fact that you need to call kmap_atomic() several times
> per page since the socket lower layers will usually be feeding you 1
> skb at a time. I thought you might be referring to those (and that you
> might have a good solution to propose ;-))

For kmap_atomic? Run a x86-64 box ;-) 

In general doing things with more than one packet at a time would
be probably a good idea, but I don't have any deep thoughts on how
to implement this for TCP RX.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 23:13                 ` David S. Miller
  2003-11-26 23:29                   ` Andi Kleen
@ 2003-11-26 23:41                   ` Ben Greear
  2003-11-27  0:01                     ` Fast timestamps David S. Miller
  1 sibling, 1 reply; 38+ messages in thread
From: Ben Greear @ 2003-11-26 23:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, linux-kernel

David S. Miller wrote:
> On Wed, 26 Nov 2003 23:56:41 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> 
>>On Wed, 26 Nov 2003 14:36:20 -0800
>>"David S. Miller" <davem@redhat.com> wrote:
>>
>>
>>>I don't think this is acceptable.  It's important that all
>>>of the timestamps are as accurate as they were before.
>>
>>I disagree on that. The window is small and slowing down 99.99999% of all 
>>users who never care about this for this extremely obscure
>>misdesigned API does not make  much sense to me.
> 
> 
> We can't change behavior like this.  Every time we've tried to
> do it, we've been burnt.  Remember nonlocal-bind?

I'll try to write up a patch that uses the TSC and lazy conversion
to timeval as soon as I get the rx-all and rx-fcs code happily
into the kernel....

Assuming TSC is very fast and the conversion is accurate enough, I think
this can give good results....

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 21:38             ` David S. Miller
@ 2003-11-26 23:43               ` Jamie Lokier
  0 siblings, 0 replies; 38+ messages in thread
From: Jamie Lokier @ 2003-11-26 23:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: tytso, ak, linux-kernel

David S. Miller wrote:
> > recvmsg() doesn't return timestamps until they are requested
> > using setsockopt(...SO_TIMESTAMP...).
> > 
> > See sock_recv_timestamp() in include/net/sock.h.
> 
> See MSG_ERRQUEUE and net/ipv4/ip_sockglue.c

I don't see your point.  The test for the SO_TIMESTAMP socket option
is _inside_ sock_recv_timestamp() (the flag is called sk_rcvtstamp).

The MSG_ERRQUEUE code simply calls sock_recv_timestamp(), which in
turn only reports the timestamp if the flag is set.

There are exactly two places where the timestamp is reported to
userspace, and both are at the request of userspace:

	1. sock_recv_timestamp(), called from many places including
	   ip_sockglue.c.  It _only_ reports it if SO_TIMESTAMP is
	   enabled for the socket.

	2. inet_ioctl(SIOCGSTAMP)

Nowhere else is the timestamp reported to userspace.

-- Jamie


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Fast timestamps
  2003-11-26 23:41                   ` Ben Greear
@ 2003-11-27  0:01                     ` David S. Miller
  2003-11-27  0:30                       ` Mitchell Blank Jr
  2003-11-27  1:57                       ` Ben Greear
  0 siblings, 2 replies; 38+ messages in thread
From: David S. Miller @ 2003-11-27  0:01 UTC (permalink / raw)
  To: Ben Greear; +Cc: ak, linux-kernel

On Wed, 26 Nov 2003 15:41:52 -0800
Ben Greear <greearb@candelatech.com> wrote:

> I'll try to write up a patch that uses the TSC and lazy conversion
> to timeval as soon as I get the rx-all and rx-fcs code happily
> into the kernel....
> 
> Assuming TSC is very fast and the conversion is accurate enough, I think
> this can give good results....

I'm amazed that you will be able to write a fast_timestamp
implementation without even seeing the API I had specified
to the various arch maintainers :-)

====================

But at the base I say we need three things:

1) Some kind of fast_timestamp_t, the property is that this stores
   enough information at time "T" such that at time "T + something"
   the fast_timestamp_t can be converted what the timeval was back at
   time "T".

   For networking, make skb->stamp into this type.

2) store_fast_timestamp(fast_timestamp_t *)

   For networking, change do_gettimeofday(&skb->stamp) into
   store_fast_timestamp(&skb->stamp)

3) fast_timestamp_to_timeval(arch_timestamp_t *, struct timeval *)

   For networking, change things that read the skb->stamp value
   into calls to fast_timestamp_to_timeval().

It is defined that the timeval given by fast_timestamp_to_timeval()
needs to be the same thing that do_gettimeofday() would have recorded
at the time store_fast_timestamp() was called.

Here is the default generic implementation that would go into
asm-generic/faststamp.h:

1) fast_timestamp_t is struct timeval
2) store_fast_timestamp() is gettimeofday()
3) fast_timestamp_to_timeval() merely copies the fast_timestamp_t
   into the passed in timeval.

And here is how an example implementation could work on sparc64:

1) fast_timestamp_t is a u64

2) store_fast_timestamp() reads the cpu cycle counter

3) fast_timestamp_to_timeval() records the difference between the
   current cpu cycle counter and the one recorded, it takes a sample
   of the current xtime value and adjusts it accordingly to account
   for the cpu cycle counter difference.

This only works because sparc64's cpu cycle counters are synchronized
across all cpus, they increase monotonically, and are guarenteed not
to overflow for at least 10 years.

Alpha, for example, cannot do it this way because it's cpu cycle counter
register overflows too quickly to be useful.

Platforms with inter-cpu TSC synchronization issues will have some
troubles doing the same trick too, because one must handle properly
the case where the fast timestamp is converted to a timeval on a different
cpu on which the fast timestamp was recorded.

Regardless, we could put the infrastructure in there now and arch folks
can work on implementations.  The generic implementation code, which is
what everyone will end up with at first, will cancel out to what we have
currently.

This is a pretty powerful idea that could be applied to other places,
not just the networking.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fast timestamps
  2003-11-27  0:01                     ` Fast timestamps David S. Miller
@ 2003-11-27  0:30                       ` Mitchell Blank Jr
  2003-11-27  1:57                       ` Ben Greear
  1 sibling, 0 replies; 38+ messages in thread
From: Mitchell Blank Jr @ 2003-11-27  0:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

David S. Miller wrote:
> Ben Greear <greearb@candelatech.com> wrote:
> > I'll try to write up a patch that uses the TSC and lazy conversion
> > to timeval as soon as I get the rx-all and rx-fcs code happily
> > into the kernel....
> > 
> > Assuming TSC is very fast and the conversion is accurate enough, I think
> > this can give good results....
> 
> I'm amazed that you will be able to write a fast_timestamp
> implementation without even seeing the API I had specified
> to the various arch maintainers :-)

Also, anyone interested in doing this should probably re-read the thread
on netdev from a couple months back about this, since we hashed out some
implementation details wrt SMP efficiency:
  http://oss.sgi.com/projects/netdev/archive/2003-10/msg00032.html

Although reading this thread I'm feeling that Andi is probably right -
are there really any apps that coudn't cope with a small inaccuracy of the
first ioctl-fetched timestamp?  I really doubt it.  Basically there's
two common cases:
  1. System under reasonably network load: in this case the tcpdump (or
     whatever) probably will get the packet soon after it arrived, so
     the timestamp we compute for the first packet won't be very far off.
  2. System under heavy network load: the card's hardware rx queues are
     probably pretty full so our timestamps won't be very accurate
     no matter what we do

Given that the timestamp is already inexact it seems like a fine idea to
trade a tiny amount of accuracy for a potentially significant performance
improvement.

-Mitch

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fast timestamps
  2003-11-27  0:01                     ` Fast timestamps David S. Miller
  2003-11-27  0:30                       ` Mitchell Blank Jr
@ 2003-11-27  1:57                       ` Ben Greear
  1 sibling, 0 replies; 38+ messages in thread
From: Ben Greear @ 2003-11-27  1:57 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, linux-kernel

David S. Miller wrote:
> On Wed, 26 Nov 2003 15:41:52 -0800
> Ben Greear <greearb@candelatech.com> wrote:
> 
> 
>>I'll try to write up a patch that uses the TSC and lazy conversion
>>to timeval as soon as I get the rx-all and rx-fcs code happily
>>into the kernel....
>>
>>Assuming TSC is very fast and the conversion is accurate enough, I think
>>this can give good results....
> 
> 
> I'm amazed that you will be able to write a fast_timestamp
> implementation without even seeing the API I had specified
> to the various arch maintainers :-)

Well, I would only aim at x86, with generic code for the
rest of the architectures.  The truth is, I'm sure others would
be better/faster at it than me, but we keep discussing it, and it
never gets done, so unless someone beats me to it, I'll take a stab
at it...  Might be after Christmas though, busy December coming up!

I agree with your approach below.  One thing I was thinking about:
is it possible that two threads ask for the timestamp of a single skb
concurrently?  If so, we may need a lock if we want to cache the conversion
to gettimeofday units....  Of course, the case where multiple readers want
the timestamp for a single skb may be too rare to warrant caching...

Ben

> 
> ====================
> 
> But at the base I say we need three things:
> 
> 1) Some kind of fast_timestamp_t, the property is that this stores
>    enough information at time "T" such that at time "T + something"
>    the fast_timestamp_t can be converted what the timeval was back at
>    time "T".
> 
>    For networking, make skb->stamp into this type.
> 
> 2) store_fast_timestamp(fast_timestamp_t *)
> 
>    For networking, change do_gettimeofday(&skb->stamp) into
>    store_fast_timestamp(&skb->stamp)
> 
> 3) fast_timestamp_to_timeval(arch_timestamp_t *, struct timeval *)
> 
>    For networking, change things that read the skb->stamp value
>    into calls to fast_timestamp_to_timeval().
> 
> It is defined that the timeval given by fast_timestamp_to_timeval()
> needs to be the same thing that do_gettimeofday() would have recorded
> at the time store_fast_timestamp() was called.
> 
> Here is the default generic implementation that would go into
> asm-generic/faststamp.h:
> 
> 1) fast_timestamp_t is struct timeval
> 2) store_fast_timestamp() is gettimeofday()
> 3) fast_timestamp_to_timeval() merely copies the fast_timestamp_t
>    into the passed in timeval.
> 
> And here is how an example implementation could work on sparc64:
> 
> 1) fast_timestamp_t is a u64
> 
> 2) store_fast_timestamp() reads the cpu cycle counter
> 
> 3) fast_timestamp_to_timeval() records the difference between the
>    current cpu cycle counter and the one recorded, it takes a sample
>    of the current xtime value and adjusts it accordingly to account
>    for the cpu cycle counter difference.
> 
> This only works because sparc64's cpu cycle counters are synchronized
> across all cpus, they increase monotonically, and are guarenteed not
> to overflow for at least 10 years.
> 
> Alpha, for example, cannot do it this way because it's cpu cycle counter
> register overflows too quickly to be useful.
> 
> Platforms with inter-cpu TSC synchronization issues will have some
> troubles doing the same trick too, because one must handle properly
> the case where the fast timestamp is converted to a timeval on a different
> cpu on which the fast timestamp was recorded.
> 
> Regardless, we could put the infrastructure in there now and arch folks
> can work on implementations.  The generic implementation code, which is
> what everyone will end up with at first, will cancel out to what we have
> currently.
> 
> This is a pretty powerful idea that could be applied to other places,
> not just the networking.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 19:19         ` Diego Calleja García
  2003-11-26 19:59           ` Mike Fedyk
@ 2003-11-27  3:54           ` Bill Huey
  1 sibling, 0 replies; 38+ messages in thread
From: Bill Huey @ 2003-11-27  3:54 UTC (permalink / raw)
  To: Diego Calleja Garc?a
  Cc: Mike Fedyk, john, ak, davem, linux-kernel, Bill Huey (hui)

On Wed, Nov 26, 2003 at 08:19:03PM +0100, Diego Calleja Garc?a wrote:
> It works here. I don't know if those numbers represent anything for networking.
> Some of the benchmarks look more like "vm benchmarking". And the ones which
> are measuring latency are valid, considering that BSDs are lacking "preempt"?
> (shooting in the dark)

FreeBSD-current is fully preemptive. The preempt patch, which add
preemption points, is meaningless in that context.

bill


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26 22:58         ` Andi Kleen
@ 2003-11-27 12:16           ` Ingo Oeser
  0 siblings, 0 replies; 38+ messages in thread
From: Ingo Oeser @ 2003-11-27 12:16 UTC (permalink / raw)
  To: Andi Kleen, arjanv; +Cc: davem, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 26 November 2003 23:58, Andi Kleen wrote:
> On Wed, 26 Nov 2003 22:34:10 +0100
> Arjan van de Ven <arjanv@redhat.com> wrote:
> > question: do we need a timestamp for every packet or can we do one
> > timestamp per irq-context entry ? (eg one timestamp at irq entry time we
> > do anyway and keep that for all packets processed in the softirq)
>
> If people want the timestamp they usually want it to be accurate
> (e.g. for tcpdump etc.). of course there is already a lot of jitter
> in this information because it is done relatively late in the device
> driver (long after the NIC has received the packet)
>
> Just most people never care about this at all....

Yes, these people not caring just open a SOCK_STREAM or SOCK_DGRAM. I
don't see any field in msghdr, which contains the time.

Other people have packet sockets (or other special stuff) opened, which
is usally bound to a device or to a special RX/TX path. So we know,
which device does need it and which not.

If in doubt, there could be an sysctl option for exact time per device
or for all.

But I'm not really that familiar with the networking code, so please
ignore my ignorance on any issues here.


Regards

Ingo Oeser

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/xesDU56oYWuOrkARAr1sAJ9h/EywUCb9wGVCZiW9GbivMiEVsACghj74
dE4EdzeW84U7QcMi/o+Q9qE=
=70Cm
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26  0:15 Mr. BOFH
  2003-11-26  2:30 ` David S. Miller
@ 2003-11-26  5:41 ` Valdis.Kletnieks
  1 sibling, 0 replies; 38+ messages in thread
From: Valdis.Kletnieks @ 2003-11-26  5:41 UTC (permalink / raw)
  To: Mr. BOFH; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 462 bytes --]

On Tue, 25 Nov 2003 16:15:12 PST, "Mr. BOFH" <icerbofh@hotmail.com>  said:
> 
> Sun has announced that they have redone their TCP/IP stack and is showing
> for some instances a 30% improvement over Linux....
> 
> http://www.theregister.co.uk/content/61/33440.html

Hmm.. IBM tried this same idea with their 8232 Ethernet controller
(basically, an 'industrial' PC with a 3Com card and a bus&tag card)
and offload of some TCP/IP functionality back in 1988 or so.


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fire Engine??
  2003-11-26  0:15 Mr. BOFH
@ 2003-11-26  2:30 ` David S. Miller
  2003-11-26  5:41 ` Valdis.Kletnieks
  1 sibling, 0 replies; 38+ messages in thread
From: David S. Miller @ 2003-11-26  2:30 UTC (permalink / raw)
  To: Mr. BOFH; +Cc: linux-kernel

On Tue, 25 Nov 2003 16:15:12 -0800
"Mr. BOFH" <icerbofh@hotmail.com> wrote:

> http://www.theregister.co.uk/content/61/33440.html

This was amusing to read, let's read the claim carefuly,
shall we?

	"We worked hard on efficiency, and we now measure,
	 at a given network workload on identical x86 hardware,
	 we use 30 percent less CPU than Linux."

So his claim is that, in their mesaurements, "CPU utilization"
was lower in their stack.  Was he using 2.6.x and TSO capable
cards on the Linux side?  If not, it's not apples to apples
against are current upcoming technology.

And while his CPU utilization claim is interesting (I bet that gain
would go to zero if they'd used Linux TSO in 2.6.x), but was the
networking bandwidth and latency any better as a result?  I think it's
not by accident that the claim was phrased the way it was.

In fact, I bet their connection setup/teardown latency will go in the
toilet with this stuff and Solaris was already horrible in this area.
It is a well established fact that TOE technologies have this problem
because of how the socket setup/teardown operation with TOE cards
requires the OS to go over the bus a few times.

I'm not worried at all about Sun's fire engine.  It's preliminary
technology, and they are going to discover all of the problem TOE
stuff has that I've discussed several times on this list.

They even mention that they don't even support any current generation
shipping TOE cards yet, at least I offer a cpu utilization reduction
optimization (TSO in 2.6.x) with multiple implementation on current
generation hardware (e1000, tg3, etc.).

I fully welcome them to put Linux up against their incredible fire
engine crap in a sanctioned specweb run on identical hardware.  :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Fire Engine??
@ 2003-11-26  0:15 Mr. BOFH
  2003-11-26  2:30 ` David S. Miller
  2003-11-26  5:41 ` Valdis.Kletnieks
  0 siblings, 2 replies; 38+ messages in thread
From: Mr. BOFH @ 2003-11-26  0:15 UTC (permalink / raw)
  To: linux-kernel


Sun has announced that they have redone their TCP/IP stack and is showing
for some instances a 30% improvement over Linux....

http://www.theregister.co.uk/content/61/33440.html



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2003-11-27 12:18 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BAY1-DAV15JU71pROHD000040e2@hotmail.com.suse.lists.linux.kernel>
     [not found] ` <20031125183035.1c17185a.davem@redhat.com.suse.lists.linux.kernel>
2003-11-26  9:53   ` Fire Engine?? Andi Kleen
2003-11-26 11:35     ` John Bradford
2003-11-26 18:50       ` Mike Fedyk
2003-11-26 19:19         ` Diego Calleja García
2003-11-26 19:59           ` Mike Fedyk
2003-11-27  3:54           ` Bill Huey
2003-11-26 15:00     ` Trond Myklebust
2003-11-26 23:01       ` Andi Kleen
2003-11-26 23:23         ` Trond Myklebust
2003-11-26 23:38           ` Andi Kleen
2003-11-26 19:30     ` David S. Miller
2003-11-26 19:58       ` Paul Menage
2003-11-26 20:03         ` David S. Miller
2003-11-26 22:29           ` Andi Kleen
2003-11-26 22:36             ` David S. Miller
2003-11-26 22:56               ` Andi Kleen
2003-11-26 23:13                 ` David S. Miller
2003-11-26 23:29                   ` Andi Kleen
2003-11-26 23:41                   ` Ben Greear
2003-11-27  0:01                     ` Fast timestamps David S. Miller
2003-11-27  0:30                       ` Mitchell Blank Jr
2003-11-27  1:57                       ` Ben Greear
2003-11-26 20:01       ` Fire Engine?? Jamie Lokier
2003-11-26 20:04         ` David S. Miller
2003-11-26 21:54         ` Pekka Pietikainen
2003-11-26 20:22       ` Theodore Ts'o
2003-11-26 21:02         ` David S. Miller
2003-11-26 21:24           ` Jamie Lokier
2003-11-26 21:38             ` David S. Miller
2003-11-26 23:43               ` Jamie Lokier
2003-11-26 21:34       ` Arjan van de Ven
2003-11-26 22:58         ` Andi Kleen
2003-11-27 12:16           ` Ingo Oeser
2003-11-26 22:39       ` Andi Kleen
2003-11-26 22:46         ` David S. Miller
2003-11-26  0:15 Mr. BOFH
2003-11-26  2:30 ` David S. Miller
2003-11-26  5:41 ` Valdis.Kletnieks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).