linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CONFIG_PACKET_MMAP revisited
@ 2003-10-29  4:09 odain2
  2003-10-29  4:50 ` Jamie Lokier
  2003-11-06 11:08 ` Gianni Tedesco
  0 siblings, 2 replies; 6+ messages in thread
From: odain2 @ 2003-10-29  4:09 UTC (permalink / raw)
  To: linux-kernel

I've been looking into faster ways to do packet captures and I stumbled on
the following discussion on the Linux Kernel mailing list:

http://www.ussg.iu.edu/hypermail/linux/kernel/0202.2/1173.html

In that discussion Jamie Lokier suggested having a memory buffer that's
shared between user and kernel space and having the NIC do DMA transfers
directly to that buffer as an alternative to using Alexy's shared ring
buffer stuff.  The argument was that this would avoid the memory copy that
the kernel does from the DMA buffer to the memory mapped ring buffer. 
However, Alan Cox pointed out that the main cost of the memory copy is
getting the data from system memory (where the NIC put it via DMA) into the
L1 cache (DMA doesn't do any cache coherence so it can't go there
directly).  The memory copy (presumably from L1 cache to L1 cache) is
insignificant compared to this cost and since you'll need to get the data
into L1 cache to use it anyway, the memory copy is virtually free.

I'm wondering if this takes all of the costs into account.  If I understand
how this works the user space application can't get at the packet without a
context switch so that the kernel can first copy the packet to the shared
buffer.  The cost of the context switch is pretty high and this seems to me
to be the main bottleneck.  I believe that in normal operation each packet
(or with NICs that do interrupt coalescing, every n packets) causes an
interrupt which causes a context switch, the kernel then copies the data
from the DMA buffer to the shared buffer and does a RETI.  That's fairly
expensive.  If, on the other hand, data could be copied directly to
user-space accessible memory the NIC wouldn't need to generate any
interrupts and the kernel wouldn't need to get involved at all (this
assumes a NIC that can be configured not to generate any interrupts).  The
user space application could then poll the shared buffer and process
packets as fast as possible (some synchronization mechanism is clearly
needed here, but I think some of the high-end programmable NICs could do
this).  Would this not be significantly more efficient then the current
implementation?

Thank,
Oliver

PS: I'm not a mailing list subscriber so CCs on responses would be
appreciated.



--------------------------------------------------------------------
mail2web - Check your email from the web at
http://mail2web.com/ .



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CONFIG_PACKET_MMAP revisited
  2003-10-29  4:09 CONFIG_PACKET_MMAP revisited odain2
@ 2003-10-29  4:50 ` Jamie Lokier
  2003-11-06 11:08 ` Gianni Tedesco
  1 sibling, 0 replies; 6+ messages in thread
From: Jamie Lokier @ 2003-10-29  4:50 UTC (permalink / raw)
  To: odain2; +Cc: linux-kernel

odain2@mindspring.com wrote:
> Alan Cox pointed out that the main cost of the memory copy is
> getting the data from system memory (where the NIC put it via DMA)
> into the L1 cache (DMA doesn't do any cache coherence so it can't go
> there directly).  The memory copy (presumably from L1 cache to L1
> cache) is insignificant compared to this cost and since you'll need
> to get the data into L1 cache to use it anyway, the memory copy is
> virtually free.

After the data is copied, it's likely to be written from L1 to L2 (at
least) before that position in the ring buffer is rewritten, so that's
another overhead.

-- Jamie

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CONFIG_PACKET_MMAP revisited
  2003-10-29  4:09 CONFIG_PACKET_MMAP revisited odain2
  2003-10-29  4:50 ` Jamie Lokier
@ 2003-11-06 11:08 ` Gianni Tedesco
  2003-11-06 14:13   ` Oliver Dain
  1 sibling, 1 reply; 6+ messages in thread
From: Gianni Tedesco @ 2003-11-06 11:08 UTC (permalink / raw)
  To: odain2; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1280 bytes --]

On Wed, 2003-10-29 at 05:09, odain2@mindspring.com wrote:
> I believe that in normal operation each packet
> (or with NICs that do interrupt coalescing, every n packets) causes an
> interrupt which causes a context switch, the kernel then copies the data
> from the DMA buffer to the shared buffer and does a RETI.  That's fairly
> expensive. 

The cost of handling that interrupt and doing an iret is unavoidable
(ignoring NAPI). The main point you are missing with the ring buffer is
that if packets come in at a fast enough rate, the usermode task never
context switches, because there is always data in the ring buffer, so it
loops in usermode forever.

The problem could be the packets are coming in just too slow to allow
the ring buffer to work properly and causing the application to sleep on
poll(2) every time. This would kill performance at pathelogical packet
rates I guess.

You could work around this by spinning for a few thousand spins before
calling poll(2) (or even indefinately for that matter, and allow the
kernel to preempt you if need be).

-- 
// Gianni Tedesco (gianni at scaramanga dot co dot uk)
lynx --source www.scaramanga.co.uk/gianni-at-ecsc.asc | gpg --import
8646BE7D: 6D9F 2287 870E A2C9 8F60 3A3C 91B5 7669 8646 BE7D


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CONFIG_PACKET_MMAP revisited
  2003-11-06 11:08 ` Gianni Tedesco
@ 2003-11-06 14:13   ` Oliver Dain
  2003-11-06 14:31     ` Gianni Tedesco
  0 siblings, 1 reply; 6+ messages in thread
From: Oliver Dain @ 2003-11-06 14:13 UTC (permalink / raw)
  To: Gianni Tedesco, odain2; +Cc: linux-kernel

On Thursday November 6 2003 6:08 am, Gianni Tedesco wrote:
> On Wed, 2003-10-29 at 05:09, odain2@mindspring.com wrote:
> > I believe that in normal operation each packet
> > (or with NICs that do interrupt coalescing, every n packets) causes an
> > interrupt which causes a context switch, the kernel then copies the data
> > from the DMA buffer to the shared buffer and does a RETI.  That's fairly
> > expensive.
>
> The cost of handling that interrupt and doing an iret is unavoidable
> (ignoring NAPI). The main point you are missing with the ring buffer is
> that if packets come in at a fast enough rate, the usermode task never
> context switches, because there is always data in the ring buffer, so it
> loops in usermode forever.

It seems to me that it can't loop in user mode forever.  There is no way to 
get data into user mode (the ring buffer) witout going through the kernel.  
My understanding is that the NIC doesn't transfer directly to the user mode 
ring buffer, but rather to a different DMA buffer.  The kernel must copy it 
from the DMA buffer to the ring buffer. Thus once the user mode app has 
processed all the data in the ring buffer the kenel _must_ get involved to 
get more data to user space.  Currently the data gets there because the NIC 
produces an interrupt for each packet (or for every few packets) and when the 
kernel handles these the data is copied to user space.  Then, as you point 
out, the cost of the RETI can't be avoided.  

NAPI tries to solve this problem.  I don't know much about NAPI, but as I 
understand it, the idea is this: The cost of the RETI's and context switches 
(which occur on each interrupt) can be reduced if the NIC doesn't produce an 
interrupt for every packet but instead does interrupt coalescing, but this 
only goes so far.  If too many packets are coalesced the data copied by the 
kernel will no longer fit in the L1 cache and we'll pay the price of moving 
it there twice (once when the kernel copies the data from main memory to the 
ring buffer and once when the user mode application reads it out of the 
ring), the latency may become a problem, we've still got a context switch 
every time the user mode application has processed everything in the ring 
buffer (and perhaps more often), and we're still paying the price of copying 
data from the DMA buffer to the ring.

However, if the NIC could transfer the data directly to user space it wouldn't 
need to cause an interrupt and the cost of the RETI and the context switch is 
avoided.  The user mode app really could process forever without sleeping at 
that point.

> The problem could be the packets are coming in just too slow to allow
> the ring buffer to work properly and causing the application to sleep on
> poll(2) every time. This would kill performance at pathelogical packet
> rates I guess.
>
> You could work around this by spinning for a few thousand spins before
> calling poll(2) (or even indefinately for that matter, and allow the
> kernel to preempt you if need be).



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CONFIG_PACKET_MMAP revisited
  2003-11-06 14:13   ` Oliver Dain
@ 2003-11-06 14:31     ` Gianni Tedesco
  2003-11-06 15:29       ` P
  0 siblings, 1 reply; 6+ messages in thread
From: Gianni Tedesco @ 2003-11-06 14:31 UTC (permalink / raw)
  To: Oliver Dain; +Cc: odain2, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1682 bytes --]

On Thu, 2003-11-06 at 15:13, Oliver Dain wrote:
> It seems to me that it can't loop in user mode forever.  There is no way to 
> get data into user mode (the ring buffer) witout going through the kernel.  
> My understanding is that the NIC doesn't transfer directly to the user mode 
> ring buffer, but rather to a different DMA buffer.  The kernel must copy it 
> from the DMA buffer to the ring buffer. Thus once the user mode app has 
> processed all the data in the ring buffer the kenel _must_ get involved to 
> get more data to user space.  Currently the data gets there because the NIC 
> produces an interrupt for each packet (or for every few packets) and when the 
> kernel handles these the data is copied to user space.  Then, as you point 
> out, the cost of the RETI can't be avoided.  

yes, in interrupt context. My point is that that *task* will never go in
to kernel mode, it will always be running in user mode.

> However, if the NIC could transfer the data directly to user space it wouldn't 
> need to cause an interrupt and the cost of the RETI and the context switch is 
> avoided.  The user mode app really could process forever without sleeping at 
> that point.

it would need to cause an interrupt to notify of the packet, unless the
program communicated directly with the NIC in polling mode. This the
point of mmap packet socket, provides a *portable* ring buffer structure
so that userspace doesn't have to reimplement drtivers.

-- 
// Gianni Tedesco (gianni at scaramanga dot co dot uk)
lynx --source www.scaramanga.co.uk/gianni-at-ecsc.asc | gpg --import
8646BE7D: 6D9F 2287 870E A2C9 8F60 3A3C 91B5 7669 8646 BE7D


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CONFIG_PACKET_MMAP revisited
  2003-11-06 14:31     ` Gianni Tedesco
@ 2003-11-06 15:29       ` P
  0 siblings, 0 replies; 6+ messages in thread
From: P @ 2003-11-06 15:29 UTC (permalink / raw)
  To: Gianni Tedesco; +Cc: Oliver Dain, linux-kernel

Gianni Tedesco wrote:
> On Thu, 2003-11-06 at 15:13, Oliver Dain wrote:
> 
>>It seems to me that it can't loop in user mode forever.  There is no way to 
>>get data into user mode (the ring buffer) witout going through the kernel.  
>>My understanding is that the NIC doesn't transfer directly to the user mode 
>>ring buffer, but rather to a different DMA buffer.  The kernel must copy it 
>>from the DMA buffer to the ring buffer. Thus once the user mode app has 
>>processed all the data in the ring buffer the kenel _must_ get involved to 
>>get more data to user space.  Currently the data gets there because the NIC 
>>produces an interrupt for each packet (or for every few packets) and when the 
>>kernel handles these the data is copied to user space.  Then, as you point 
>>out, the cost of the RETI can't be avoided.  
> 
> 
> yes, in interrupt context. My point is that that *task* will never go in
> to kernel mode, it will always be running in user mode.

In my experience (PIII 1.2GHz, i815, e100, NAPI), user mode
would read at most 7 packets at a time, even when artificial
busy loops insterted. The max packet rate acheived was
around 120Kpps, but that was limited at the driver level.
Most of the CPU was consumed while doing this (measured with
cyclesoak, especially required since NAPI was used).

Pádraig.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-11-06 15:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-29  4:09 CONFIG_PACKET_MMAP revisited odain2
2003-10-29  4:50 ` Jamie Lokier
2003-11-06 11:08 ` Gianni Tedesco
2003-11-06 14:13   ` Oliver Dain
2003-11-06 14:31     ` Gianni Tedesco
2003-11-06 15:29       ` P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).