linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [MPC52xx]Latency issue with DMA on FEC
@ 2010-12-01  8:16 Jean-Michel Hautbois
  2010-12-01  9:59 ` Jean-Michel Hautbois
  2010-12-01 14:52 ` Steven Rostedt
  0 siblings, 2 replies; 7+ messages in thread
From: Jean-Michel Hautbois @ 2010-12-01  8:16 UTC (permalink / raw)
  To: linuxppc-dev, linux-rt-users; +Cc: Eric Dumazet, Steven Rostedt

Hi lists !

I measured the latency and the jitter of the RX and TX ethernet paths
on my MPC5200 board.
The RX path is quite good, but the TX path can be slow.

[ 1218.976762] [mpc52xx_fec_start_xmit]Delay >30us for dma_map_single
=3D> 76364 ns
[ 1219.188405] [mpc52xx_fec_tx_interrupt]Delay >30us for
dma_unmap_single =3D> 34515 ns
[ 1220.628785] [mpc52xx_fec_start_xmit]Delay >30us for
bcom_submit_next_buffer =3D> 97273 ns
[ 1225.776784] [mpc52xx_fec_tx_interrupt]Delay >30us for
dma_unmap_single =3D> 95273 ns

As one can see, this is obviously problematic.
The first function I analyzed is bcom_submit_next_buffer() =3D> This
function doesn't do lots of things, except a call to mb().

I have been looking to the "MPC603e RISC Microprocessor User's Manual"
and especially the chapter named "2.3.4.7 Memory Synchronization
Instructions=E2=80=94UISA".

Here is a paragraph which explains a lot :

"The functions performed by the sync instruction normally take a
signi=EF=AC=81cant amount of time
to complete; as a result, frequent use of this instruction may
adversely affect performance.
In addition, the number of cycles required to complete a sync
instruction depends on system
parameters and on the processor's state when the instruction is issued."

I am using a real time kernel, and this is a problem, as it is not
deterministic to use this instruction.
Is there a way to avoid this ?

I will now focus on the dma_map_single() and dma_unmap_single functions...

Thanks in advance for your help,
Best Regards,

JM

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [MPC52xx]Latency issue with DMA on FEC
  2010-12-01  8:16 [MPC52xx]Latency issue with DMA on FEC Jean-Michel Hautbois
@ 2010-12-01  9:59 ` Jean-Michel Hautbois
  2010-12-01 14:52 ` Steven Rostedt
  1 sibling, 0 replies; 7+ messages in thread
From: Jean-Michel Hautbois @ 2010-12-01  9:59 UTC (permalink / raw)
  To: linuxppc-dev, linux-rt-users; +Cc: Eric Dumazet, Steven Rostedt

2010/12/1 Jean-Michel Hautbois <jhautbois@gmail.com>:
> Hi lists !
>
> I measured the latency and the jitter of the RX and TX ethernet paths
> on my MPC5200 board.
> The RX path is quite good, but the TX path can be slow.
>
> [ 1218.976762] [mpc52xx_fec_start_xmit]Delay >30us for dma_map_single
> =3D> 76364 ns
> [ 1219.188405] [mpc52xx_fec_tx_interrupt]Delay >30us for
> dma_unmap_single =3D> 34515 ns
> [ 1220.628785] [mpc52xx_fec_start_xmit]Delay >30us for
> bcom_submit_next_buffer =3D> 97273 ns
> [ 1225.776784] [mpc52xx_fec_tx_interrupt]Delay >30us for
> dma_unmap_single =3D> 95273 ns
>
> As one can see, this is obviously problematic.
> The first function I analyzed is bcom_submit_next_buffer() =3D> This
> function doesn't do lots of things, except a call to mb().
>
> I have been looking to the "MPC603e RISC Microprocessor User's Manual"
> and especially the chapter named "2.3.4.7 Memory Synchronization
> Instructions=E2=80=94UISA".
>
> Here is a paragraph which explains a lot :
>
> "The functions performed by the sync instruction normally take a
> signi=EF=AC=81cant amount of time
> to complete; as a result, frequent use of this instruction may
> adversely affect performance.
> In addition, the number of cycles required to complete a sync
> instruction depends on system
> parameters and on the processor's state when the instruction is issued."
>
> I am using a real time kernel, and this is a problem, as it is not
> deterministic to use this instruction.
> Is there a way to avoid this ?
>
> I will now focus on the dma_map_single() and dma_unmap_single functions..=
.
>
> Thanks in advance for your help,
> Best Regards,
>
> JM
>

dma_map_single() and dma_unmap_single() have the same instruction set
used inside (sync) because there is a cleaning of cache.
eieio instruction doesn't seem to be faster and I think that because
cache is not inhibited, this is not a good way to do that.

The delay introduced by the use of these instructions can be really
big (about 70-90=C2=B5s) whereas in most cases it is relatively good (about
10-20=C2=B5s).
This jitter is a problem in my use case, and I think I am not the only one =
:).

One other thing to say : I am using little packets (about 200 bytes).

JM

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [MPC52xx]Latency issue with DMA on FEC
  2010-12-01  8:16 [MPC52xx]Latency issue with DMA on FEC Jean-Michel Hautbois
  2010-12-01  9:59 ` Jean-Michel Hautbois
@ 2010-12-01 14:52 ` Steven Rostedt
  2010-12-01 15:09   ` David Laight
  1 sibling, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2010-12-01 14:52 UTC (permalink / raw)
  To: Jean-Michel Hautbois; +Cc: linuxppc-dev, linux-rt-users, Eric Dumazet

On Wed, 2010-12-01 at 09:16 +0100, Jean-Michel Hautbois wrote:
> Hi lists !
> 
> I measured the latency and the jitter of the RX and TX ethernet paths
> on my MPC5200 board.
> The RX path is quite good, but the TX path can be slow.
> 
> [ 1218.976762] [mpc52xx_fec_start_xmit]Delay >30us for dma_map_single
> => 76364 ns
> [ 1219.188405] [mpc52xx_fec_tx_interrupt]Delay >30us for
> dma_unmap_single => 34515 ns
> [ 1220.628785] [mpc52xx_fec_start_xmit]Delay >30us for
> bcom_submit_next_buffer => 97273 ns
> [ 1225.776784] [mpc52xx_fec_tx_interrupt]Delay >30us for
> dma_unmap_single => 95273 ns
> 
> As one can see, this is obviously problematic.
> The first function I analyzed is bcom_submit_next_buffer() => This
> function doesn't do lots of things, except a call to mb().
> 
> I have been looking to the "MPC603e RISC Microprocessor User's Manual"
> and especially the chapter named "2.3.4.7 Memory Synchronization
> Instructions—UISA".
> 
> Here is a paragraph which explains a lot :
> 
> "The functions performed by the sync instruction normally take a
> significant amount of time
> to complete; as a result, frequent use of this instruction may
> adversely affect performance.
> In addition, the number of cycles required to complete a sync
> instruction depends on system
> parameters and on the processor's state when the instruction is issued."
> 
> I am using a real time kernel, and this is a problem, as it is not
> deterministic to use this instruction.
> Is there a way to avoid this ?

Don't use that hardware.

When working with drivers there are times you must sync with the device.
And if the device is nondeterministic, then find another set of hardware
to use. Unfortunately, I think you may not find any.

A mb() is usually used if you do a write to device and read from it.
With out it, the CPU could perform the read before the write, which
would give you an incorrect result. There's no other way around that.

-- Steve

> 
> I will now focus on the dma_map_single() and dma_unmap_single functions...
> 
> Thanks in advance for your help,
> Best Regards,
> 
> JM

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [MPC52xx]Latency issue with DMA on FEC
  2010-12-01 14:52 ` Steven Rostedt
@ 2010-12-01 15:09   ` David Laight
  2010-12-01 15:15     ` Jean-Michel Hautbois
                       ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: David Laight @ 2010-12-01 15:09 UTC (permalink / raw)
  To: Steven Rostedt, Jean-Michel Hautbois
  Cc: linuxppc-dev, linux-rt-users, Eric Dumazet

=20
> A mb() is usually used if you do a write to device and read from it.
> With out it, the CPU could perform the read before the write, which
> would give you an incorrect result. There's no other way around that.

Possibly the synchronisation functions are doing significantly
more work than is required.

I was looking at the in_le32() and out_le32() functions for the
ppc e300 (and maybe others).

The out_le32() contains a 'sync' instruction - this may only
be needed after a series of writes (eg just before a command).

The iosync() function just adds a 'sync' and can be used as needed.

The in_le32() not only contains the unwanted 'sync', but also
a 'twi' (trap immediate - NFI exactly what this does) and 'isync'.
The 'isync' is particularly horrid and unnecessary (aborts
the instruction queue and refeches the opcode bytes).

The very slow in_le32() might be there to give semi-synchronous
traps on address fault - but unless the hardware is being probed
that really isn't necessary.

I did find st_le32() and ld_le32() in arch/powerpc/include/asm/swab.h
but had difficulty #including that version of swab.h!
    #include <../arch/powerpc/include/asm/swab.h>
worked - but isn't that nice.

	David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [MPC52xx]Latency issue with DMA on FEC
  2010-12-01 15:09   ` David Laight
@ 2010-12-01 15:15     ` Jean-Michel Hautbois
  2010-12-01 20:34     ` Micha Nelissen
  2010-12-01 21:16     ` Scott Wood
  2 siblings, 0 replies; 7+ messages in thread
From: Jean-Michel Hautbois @ 2010-12-01 15:15 UTC (permalink / raw)
  To: David Laight; +Cc: linuxppc-dev, Eric Dumazet, linux-rt-users, Steven Rostedt

2010/12/1 David Laight <David.Laight@aculab.com>:
>
>> A mb() is usually used if you do a write to device and read from it.
>> With out it, the CPU could perform the read before the write, which
>> would give you an incorrect result. There's no other way around that.
>
> Possibly the synchronisation functions are doing significantly
> more work than is required.
>
> I was looking at the in_le32() and out_le32() functions for the
> ppc e300 (and maybe others).
>
> The out_le32() contains a 'sync' instruction - this may only
> be needed after a series of writes (eg just before a command).
>
> The iosync() function just adds a 'sync' and can be used as needed.
>
> The in_le32() not only contains the unwanted 'sync', but also
> a 'twi' (trap immediate - NFI exactly what this does) and 'isync'.
> The 'isync' is particularly horrid and unnecessary (aborts
> the instruction queue and refeches the opcode bytes).
>
> The very slow in_le32() might be there to give semi-synchronous
> traps on address fault - but unless the hardware is being probed
> that really isn't necessary.
>
> I did find st_le32() and ld_le32() in arch/powerpc/include/asm/swab.h
> but had difficulty #including that version of swab.h!
> =C2=A0 =C2=A0#include <../arch/powerpc/include/asm/swab.h>
> worked - but isn't that nice.
>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0David

Yes, I was also looking at in_be16 and out_be16, and my thoughts were
exactly the same.
I think the HW I am using is not a good one, but this is not
sufficient to explain the behaviour.
These instructions are called, for instance, from bcom_enable_task().

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [MPC52xx]Latency issue with DMA on FEC
  2010-12-01 15:09   ` David Laight
  2010-12-01 15:15     ` Jean-Michel Hautbois
@ 2010-12-01 20:34     ` Micha Nelissen
  2010-12-01 21:16     ` Scott Wood
  2 siblings, 0 replies; 7+ messages in thread
From: Micha Nelissen @ 2010-12-01 20:34 UTC (permalink / raw)
  To: David Laight
  Cc: linuxppc-dev, Eric Dumazet, Jean-Michel Hautbois, Steven Rostedt,
	linux-rt-users

David Laight wrote:
> The in_le32() not only contains the unwanted 'sync', but also
> a 'twi' (trap immediate - NFI exactly what this does) and 'isync'.
> The 'isync' is particularly horrid and unnecessary (aborts
> the instruction queue and refeches the opcode bytes).

I've also wondered why some time ago, and this is what I could find: 
it's a special sequence that is detected by the bus error handler 
(machine check exception happens on I/O error i.e. aborted pci 
transaction or some such), so that it can 'recover' by continuing at the 
next instruction (and setting an error variable).

Perhaps there is no other way to recover reliably from bus errors?

Micha

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [MPC52xx]Latency issue with DMA on FEC
  2010-12-01 15:09   ` David Laight
  2010-12-01 15:15     ` Jean-Michel Hautbois
  2010-12-01 20:34     ` Micha Nelissen
@ 2010-12-01 21:16     ` Scott Wood
  2 siblings, 0 replies; 7+ messages in thread
From: Scott Wood @ 2010-12-01 21:16 UTC (permalink / raw)
  To: David Laight
  Cc: linuxppc-dev, Eric Dumazet, Jean-Michel Hautbois, Steven Rostedt,
	linux-rt-users

On Wed, 1 Dec 2010 15:09:54 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> The in_le32() not only contains the unwanted 'sync', but also
> a 'twi' (trap immediate - NFI exactly what this does) and 'isync'.

It turns a data dependency into a flow dependency.  It's basically equivalent to:

lwz	rX, ...
cmpw	rX, rX
bne	1f
1: isync

> The 'isync' is particularly horrid and unnecessary (aborts
> the instruction queue and refeches the opcode bytes)

The isync makes sure that the twi has completed before proceeding.

Note that the guarded, cache-inhibited load itself can be pretty
painful -- the core can't restart it, so it must complete before you
can take an interrupt.

> The very slow in_le32() might be there to give semi-synchronous
> traps on address fault - but unless the hardware is being probed
> that really isn't necessary.

There are times when you really want to be sure that the I/O is
finished before proceeding with something that isn't a load/store and
thus can't be serialized with normal barriers.

E.g. you're about to execute instructions in a physical address window
that you just set up (or even just create a non-guarded mapping to it
-- could get speculative accesses any time), or you just masked an
interrupt at the PIC (with a readback to flush) and are about to enable
MSR[EE].

Most of the time, though, it's overkill.  It should probably be an
alternate accessor form, or maybe a wait_for_io() wrapper -- if it can
be shown to make a real performance difference.

-Scott

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-12-01 21:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-01  8:16 [MPC52xx]Latency issue with DMA on FEC Jean-Michel Hautbois
2010-12-01  9:59 ` Jean-Michel Hautbois
2010-12-01 14:52 ` Steven Rostedt
2010-12-01 15:09   ` David Laight
2010-12-01 15:15     ` Jean-Michel Hautbois
2010-12-01 20:34     ` Micha Nelissen
2010-12-01 21:16     ` Scott Wood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).