All of lore.kernel.org
 help / color / mirror / Atom feed
* PL310 errata workarounds
@ 2014-03-14 14:48 Russell King - ARM Linux
  2014-03-14 15:01 ` Russell King - ARM Linux
  2014-03-16 15:29 ` Russell King - ARM Linux
  0 siblings, 2 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-14 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

While looking at the now rather horrid L2C-2x0 code, there's some really
serious questions that have to be asked about some of these errata.

588369 affects PL310 up to and including R1P0, but is fixed in R2P0.
	This erratum requires a clean+invalidate request to be split into
	a separate clean request followed by an invalidate request.  This
	must be done with the debug register set to 0x03.

	If the debug register is not set to 0x03, then it's effectively
	a separate clean operation followed by a separate invalidate
	operation.

727915 affects PL310 up to and including R3P0, but is fixed in R3P1.
	This erratum concerns writes which may be discarded as a result
	of a clean+invalidate request, particularly when a write hits
	the cache line being requested between the clean stage and the
	invalidate stage.

Now, let's say that we have a PL310 R3P3 cache.  Linux is configured with
both workarounds enabled.  This has the effect that we never use the
clean+invalidate request.

However, Rob's patch in 74ddcdb868a8 (ARM: 7608/1: l2x0: Only set
.set_debug on PL310 r3p0 and earlier) has the effect that we no longer
to the debug dance on such a cache.

This means we're still splitting the clean+invalidate sequence into two
separate operations, so there's room for a write to hit the cache line
between the two operations and, therefore, be lost.

So, do we actually care about 727915, because through applying Rob's
change, we've effectively re-exposed the problem which this erratum
refers to on all L2 cache controllers where this is actually fixed,
and it's been like this for 14 months without any reported issue.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-14 14:48 PL310 errata workarounds Russell King - ARM Linux
@ 2014-03-14 15:01 ` Russell King - ARM Linux
  2014-03-14 16:02   ` Rob Herring
  2014-03-16 15:29 ` Russell King - ARM Linux
  1 sibling, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-14 15:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 14, 2014 at 02:48:35PM +0000, Russell King - ARM Linux wrote:
> While looking at the now rather horrid L2C-2x0 code, there's some really
> serious questions that have to be asked about some of these errata.
> 
> 588369 affects PL310 up to and including R1P0, but is fixed in R2P0.
> 	This erratum requires a clean+invalidate request to be split into
> 	a separate clean request followed by an invalidate request.  This
> 	must be done with the debug register set to 0x03.
> 
> 	If the debug register is not set to 0x03, then it's effectively
> 	a separate clean operation followed by a separate invalidate
> 	operation.
> 
> 727915 affects PL310 up to and including R3P0, but is fixed in R3P1.
> 	This erratum concerns writes which may be discarded as a result
> 	of a clean+invalidate request, particularly when a write hits
> 	the cache line being requested between the clean stage and the
> 	invalidate stage.
> 
> Now, let's say that we have a PL310 R3P3 cache.  Linux is configured with
> both workarounds enabled.  This has the effect that we never use the
> clean+invalidate request.
> 
> However, Rob's patch in 74ddcdb868a8 (ARM: 7608/1: l2x0: Only set
> .set_debug on PL310 r3p0 and earlier) has the effect that we no longer
> to the debug dance on such a cache.
> 
> This means we're still splitting the clean+invalidate sequence into two
> separate operations, so there's room for a write to hit the cache line
> between the two operations and, therefore, be lost.
> 
> So, do we actually care about 727915, because through applying Rob's
> change, we've effectively re-exposed the problem which this erratum
> refers to on all L2 cache controllers where this is actually fixed,
> and it's been like this for 14 months without any reported issue.

Another point here is... we don't even need 727915 at all: 727915 is
about the clean+invalidate by WAY request, not clean+invalidate by PA.
So really we should rip out references to this erratum because it's
not applicable.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-14 15:01 ` Russell King - ARM Linux
@ 2014-03-14 16:02   ` Rob Herring
  2014-03-14 17:57     ` Russell King - ARM Linux
  2014-03-16 11:52     ` Russell King - ARM Linux
  0 siblings, 2 replies; 76+ messages in thread
From: Rob Herring @ 2014-03-14 16:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 14, 2014 at 10:01 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Mar 14, 2014 at 02:48:35PM +0000, Russell King - ARM Linux wrote:
>> While looking at the now rather horrid L2C-2x0 code, there's some really
>> serious questions that have to be asked about some of these errata.
>>
>> 588369 affects PL310 up to and including R1P0, but is fixed in R2P0.
>>       This erratum requires a clean+invalidate request to be split into
>>       a separate clean request followed by an invalidate request.  This
>>       must be done with the debug register set to 0x03.
>>
>>       If the debug register is not set to 0x03, then it's effectively
>>       a separate clean operation followed by a separate invalidate
>>       operation.
>>
>> 727915 affects PL310 up to and including R3P0, but is fixed in R3P1.
>>       This erratum concerns writes which may be discarded as a result
>>       of a clean+invalidate request, particularly when a write hits
>>       the cache line being requested between the clean stage and the
>>       invalidate stage.
>>
>> Now, let's say that we have a PL310 R3P3 cache.  Linux is configured with
>> both workarounds enabled.  This has the effect that we never use the
>> clean+invalidate request.
>>
>> However, Rob's patch in 74ddcdb868a8 (ARM: 7608/1: l2x0: Only set
>> .set_debug on PL310 r3p0 and earlier) has the effect that we no longer
>> to the debug dance on such a cache.
>>
>> This means we're still splitting the clean+invalidate sequence into two
>> separate operations, so there's room for a write to hit the cache line
>> between the two operations and, therefore, be lost.
>>
>> So, do we actually care about 727915, because through applying Rob's
>> change, we've effectively re-exposed the problem which this erratum
>> refers to on all L2 cache controllers where this is actually fixed,
>> and it's been like this for 14 months without any reported issue.
>
> Another point here is... we don't even need 727915 at all: 727915 is
> about the clean+invalidate by WAY request, not clean+invalidate by PA.
> So really we should rip out references to this erratum because it's
> not applicable.

Is there really no possibility of a write occurring during a
clean+invalidate by WAY? There is the case where large clean+inv by PA
uses a clean+inv by WAY. I'd guess this path is fairly rare. We would
never hit it on highbank because we'd never see a 4MB range and DMA is
normally coherent. But do the cpu suspend or kexec paths have an
issue?

Back to 588369, I agree there is a problem here as a cache line write
could occur between the clean and invalidate. However, as long as DMA
buffers are not sharing a cache line with some other data we would not
see any problems. This is not guaranteed as l2x0_inv_range clearly
tries to handle this case, so we may just be getting lucky.

The obvious fix is to convert l2x0_flush_line to a function ptr which
just further complicates the code.

Rob

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-14 16:02   ` Rob Herring
@ 2014-03-14 17:57     ` Russell King - ARM Linux
  2014-03-14 19:14       ` Rob Herring
  2014-03-16 11:52     ` Russell King - ARM Linux
  1 sibling, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-14 17:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 14, 2014 at 11:02:17AM -0500, Rob Herring wrote:
> The obvious fix is to convert l2x0_flush_line to a function ptr which
> just further complicates the code.

I currently regard this code as unmaintainable, so right now I'm in the
process of completely rewriting it, going back to the specs and checking
what is required.  Sharing code between the L210, L220 and L310 looks
good from the point of view of keeping the line count down, but actually
they're quite different and have various different requirements.

Out of those, L220 is the odd one out: all operations are background
operations there, which means they have to be waited for.  Currently,
if you have one of these, and you build a multiplatform kernel which
includes PL310 support, you don't wait for any of these and you probably
see a lot of data corruption as a result.

PL210 and PL310 are very similar, but PL310 has all the nasty errata
against it.  Both of these have the nice behaviour that provided you
stick to the "atomic" operations (physical addressed operations, and
cache sync) and you don't need to do the errata workaround dances,
then you don't need the spinlocks... which means less contention on
a SMP system.

Besides, the L2 mess extends outside this file:

static int sirfsoc_pm_enter(suspend_state_t state)
{
...
                outer_flush_all();
                outer_disable();

So, outer_disable() is defined to disable the L2 cache in a way that
doesn't result in data loss to the system (which would crash it) so
why is there that outer_flush_all() call there...  *REMOVED*.

static int spear13xx_boot_secondary(unsigned int cpu, struct task_struct *idle)
{
...
        pen_release = cpu;
        flush_cache_all();
        outer_flush_all();

This is a very big hammer, and quite unnecessary - there's other
solutions to writing to "pen_release" and getting it visible which
are much more effective and don't require such a big hammer.  I've
committed a patch to remove this.

The last one is in the dcscb code, which seems to be a real unknown:

                /*
                 * This is a harmless no-op.  On platforms with a real
                 * outer cache this might either be needed or not,
                 * depending on where the outer cache sits.
                 */
                outer_flush_all();

hardly "harmless" and hardly a "no-op" either... this one I've left
but as I've no hope of running this code as no one seems willing to
assist me with getting Versatile Express running with anything but the
Cortex-A9 tile, I've a good mind to just delete the above.

The reason outer_flush_all() matters is that if we avoid the index/way
operations on the L2 cache, we can remove the locking on much of the L2
cache operations, which means less SMP contention on this stuff.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-14 17:57     ` Russell King - ARM Linux
@ 2014-03-14 19:14       ` Rob Herring
  2014-03-14 20:32         ` Russell King - ARM Linux
  2014-03-19 21:22         ` Marek Vasut
  0 siblings, 2 replies; 76+ messages in thread
From: Rob Herring @ 2014-03-14 19:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 14, 2014 at 12:57 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Mar 14, 2014 at 11:02:17AM -0500, Rob Herring wrote:
>> The obvious fix is to convert l2x0_flush_line to a function ptr which
>> just further complicates the code.
>
> I currently regard this code as unmaintainable, so right now I'm in the
> process of completely rewriting it, going back to the specs and checking
> what is required.  Sharing code between the L210, L220 and L310 looks
> good from the point of view of keeping the line count down, but actually
> they're quite different and have various different requirements.
>
> Out of those, L220 is the odd one out: all operations are background
> operations there, which means they have to be waited for.  Currently,
> if you have one of these, and you build a multiplatform kernel which
> includes PL310 support, you don't wait for any of these and you probably
> see a lot of data corruption as a result.

Oops. I guess there are not many users as the L220 was primarily used
with the 1176 (and 11MP?) IIRC. I guess one of the primary 1176 based
platforms currently used is RPi, but it appears the BCM2835 doesn't
have an L2 (other than some custom L2 for the GPU).

> PL210 and PL310 are very similar, but PL310 has all the nasty errata
> against it.  Both of these have the nice behaviour that provided you
> stick to the "atomic" operations (physical addressed operations, and
> cache sync) and you don't need to do the errata workaround dances,
> then you don't need the spinlocks... which means less contention on
> a SMP system.

I would definitely like to see removing the spinlock. readl/writel can
be a hotspot because of this if you have them in any critical path.

Really, if you are flushing such large buffers that you need the
optimization, you should fix how the buffers are mapped rather than
optimizing the flush.

>
> Besides, the L2 mess extends outside this file:
>
> static int sirfsoc_pm_enter(suspend_state_t state)
> {
> ...
>                 outer_flush_all();
>                 outer_disable();
>
> So, outer_disable() is defined to disable the L2 cache in a way that
> doesn't result in data loss to the system (which would crash it) so
> why is there that outer_flush_all() call there...  *REMOVED*.

Isn't l2x0_disable racy in SMP? What happens to a write between the
flush and the disable? If it is only usable when running single core,
then we can remove the spinlock there. I don't think any of the custom
disable functions like highbank take the lock.

Rob

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-14 19:14       ` Rob Herring
@ 2014-03-14 20:32         ` Russell King - ARM Linux
  2014-03-19 21:22         ` Marek Vasut
  1 sibling, 0 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-14 20:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 14, 2014 at 02:14:06PM -0500, Rob Herring wrote:
> On Fri, Mar 14, 2014 at 12:57 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Fri, Mar 14, 2014 at 11:02:17AM -0500, Rob Herring wrote:
> >> The obvious fix is to convert l2x0_flush_line to a function ptr which
> >> just further complicates the code.
> >
> > Out of those, L220 is the odd one out: all operations are background
> > operations there, which means they have to be waited for.  Currently,
> > if you have one of these, and you build a multiplatform kernel which
> > includes PL310 support, you don't wait for any of these and you probably
> > see a lot of data corruption as a result.
> 
> Oops. I guess there are not many users as the L220 was primarily used
> with the 1176 (and 11MP?) IIRC. I guess one of the primary 1176 based
> platforms currently used is RPi, but it appears the BCM2835 doesn't
> have an L2 (other than some custom L2 for the GPU).

I guess not as there hasn't been any reports about breakage there.

> > PL210 and PL310 are very similar, but PL310 has all the nasty errata
> > against it.  Both of these have the nice behaviour that provided you
> > stick to the "atomic" operations (physical addressed operations, and
> > cache sync) and you don't need to do the errata workaround dances,
> > then you don't need the spinlocks... which means less contention on
> > a SMP system.
> 
> I would definitely like to see removing the spinlock. readl/writel can
> be a hotspot because of this if you have them in any critical path.
> 
> Really, if you are flushing such large buffers that you need the
> optimization, you should fix how the buffers are mapped rather than
> optimizing the flush.

Exactly.

> >
> > Besides, the L2 mess extends outside this file:
> >
> > static int sirfsoc_pm_enter(suspend_state_t state)
> > {
> > ...
> >                 outer_flush_all();
> >                 outer_disable();
> >
> > So, outer_disable() is defined to disable the L2 cache in a way that
> > doesn't result in data loss to the system (which would crash it) so
> > why is there that outer_flush_all() call there...  *REMOVED*.
> 
> Isn't l2x0_disable racy in SMP? What happens to a write between the
> flush and the disable? If it is only usable when running single core,
> then we can remove the spinlock there. I don't think any of the custom
> disable functions like highbank take the lock.

Yes it is - which is why it should only ever be called as part of
turning the L2 cache off, which is only done when we're sure that we
have complete control over the system and we're single threaded.  So
yes, you're right that the spinlocks aren't needed there.

I've got most of it done now, but there's a few stumbling blocks: the
Aurora and Broadcom caches in there need to have their addiction to
the old ARM code removed.

Broadcom appears to be L2C-310, but it would be useful to know which
revision this is based upon (and therefore which errata we carry are
applicable to it.)  This makes a big difference because we now end up
with different function pointers for the two errata I mentioned (each
errata currently affect two different functions and don't overlap on
native ARM implementations.)

Aurora is completely opaque as to which ARM L2C it's based upon, and
this is quite a blocker for killing the old code.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-14 16:02   ` Rob Herring
  2014-03-14 17:57     ` Russell King - ARM Linux
@ 2014-03-16 11:52     ` Russell King - ARM Linux
  2014-03-17 15:04       ` Rob Herring
  1 sibling, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-16 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 14, 2014 at 11:02:17AM -0500, Rob Herring wrote:
> On Fri, Mar 14, 2014 at 10:01 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Fri, Mar 14, 2014 at 02:48:35PM +0000, Russell King - ARM Linux wrote:
> >> While looking at the now rather horrid L2C-2x0 code, there's some really
> >> serious questions that have to be asked about some of these errata.
> >>
> >> 588369 affects PL310 up to and including R1P0, but is fixed in R2P0.
> >>       This erratum requires a clean+invalidate request to be split into
> >>       a separate clean request followed by an invalidate request.  This
> >>       must be done with the debug register set to 0x03.
> >>
> >>       If the debug register is not set to 0x03, then it's effectively
> >>       a separate clean operation followed by a separate invalidate
> >>       operation.
> >>
> >> 727915 affects PL310 up to and including R3P0, but is fixed in R3P1.
> >>       This erratum concerns writes which may be discarded as a result
> >>       of a clean+invalidate request, particularly when a write hits
> >>       the cache line being requested between the clean stage and the
> >>       invalidate stage.
> >>
> >> Now, let's say that we have a PL310 R3P3 cache.  Linux is configured with
> >> both workarounds enabled.  This has the effect that we never use the
> >> clean+invalidate request.
> >>
> >> However, Rob's patch in 74ddcdb868a8 (ARM: 7608/1: l2x0: Only set
> >> .set_debug on PL310 r3p0 and earlier) has the effect that we no longer
> >> to the debug dance on such a cache.
> >>
> >> This means we're still splitting the clean+invalidate sequence into two
> >> separate operations, so there's room for a write to hit the cache line
> >> between the two operations and, therefore, be lost.
> >>
> >> So, do we actually care about 727915, because through applying Rob's
> >> change, we've effectively re-exposed the problem which this erratum
> >> refers to on all L2 cache controllers where this is actually fixed,
> >> and it's been like this for 14 months without any reported issue.
> >
> > Another point here is... we don't even need 727915 at all: 727915 is
> > about the clean+invalidate by WAY request, not clean+invalidate by PA.
> > So really we should rip out references to this erratum because it's
> > not applicable.
> 
> Is there really no possibility of a write occurring during a
> clean+invalidate by WAY? There is the case where large clean+inv by PA
> uses a clean+inv by WAY. I'd guess this path is fairly rare. We would
> never hit it on highbank because we'd never see a 4MB range and DMA is
> normally coherent. But do the cpu suspend or kexec paths have an
> issue?

To answer this point more fully: clean+invalidate by way is used in
two circumstances:

1. when disabling the cache.  Here, we must be single threaded, with
   interrupts disabled, and any other masters for the other cache
   under control so they don't spit out spurious writes while we're
   disabling it.

   The suspend paths are careful enough: other CPUs have been hot-
   unplugged, devices suspended, and interrupts are disabled at this
   point, so that's safe.

   kexec is slightly different - interrupts will be disabled, we will
   only call outer_cache_disable() if we are the only CPU running at
   soft_restart() time.  However, devices may or may not be suspended
   (which would be an issue for the L2 cache if they're on the coherent
   bus.)  So I think this is as safe as it can be.

   The MCPM stuff is another issue: what the conditions are there I've
   no idea, but it looks like other CPUs will be running when it calls
   outer_cache_flush().  MCPM commentry claims that this will be
   "harmless" and I just had to laugh at that - even with this workaround
   enabled, it doesn't fix the problem on L2C-310 R2P0 as the workaround
   implementation only works on R3P0!

2. on large (>= cache size) coherent DMA allocations via outer_flush_range()

Consider using CMA for framebuffers, and allocating a 1080p framebuffer.
That's going to be around 8MB in size, and that's going to need the L2
purged of any cache lines associated with it.  As the L2 cache is
normally around 512kB or maybe 1MB, walking every 32-byte cache line is
extremely wasteful (that's 259200 clean+invalidate by PA operations, of
which a maximum of 32768 could possibly hit a cache line), so we do want
to preserve the ability to use this operation.

However, one of the issues with background operations is that while the
background operation is in progress, you can't allow another operation
(atomic or background) to be issued - and that includes the sync operation.
L2C-310 issues a slave error in response to this, which will be reported
as an imprecise data abort.

This means that if we do want to avoid this waste, we need some kind of
locking: using a spinlock here is too heavy, I'm thinking about using
a read-write spinlock.  However, that brings with it all the lockdep
baggage with it which we already know we need to avoid, and there's no
raw_ variants of it.

The best idea I can come up with is to directly use the arch_ rw spinlock
variants - the read lock taken for the atomic operations so that many
of them can interleve, and the write lock taken for background operations
as only one can be active at any one time.

Thoughts?

(Note that L2C-210 is slightly more fun: if you try another operation with
a background operation already in progress, it's documented to be ignored -
it's unspecified whether it issues a slave error response.  So you don't
know if you're hitting this condition.)

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-14 14:48 PL310 errata workarounds Russell King - ARM Linux
  2014-03-14 15:01 ` Russell King - ARM Linux
@ 2014-03-16 15:29 ` Russell King - ARM Linux
  2014-03-17 14:00   ` Rob Herring
  2014-03-17 16:32   ` Russell King - ARM Linux
  1 sibling, 2 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-16 15:29 UTC (permalink / raw)
  To: linux-arm-kernel

Okay, I've been going through the various initialisations of the L2x0
cache code, and there's some really fun stuff going on here:

arch/arm/mach-imx/system.c:
	/* Configure the L2 PREFETCH and POWER registers */
	val = readl_relaxed(l2x0_base + L2X0_PREFETCH_CTRL);
	val |= 0x70800000;
	/*
	 * The L2 cache controller(PL310) version on the i.MX6D/Q is r3p1-50rel0
	 * The L2 cache controller(PL310) version on the i.MX6DL/SOLO/SL is r3p2
	 * But according to ARM PL310 errata: 752271
	 * ID: 752271: Double linefill feature can cause data corruption
	 * Fault Status: Present in: r3p0, r3p1, r3p1-50rel0. Fixed in r3p2
	 * Workaround: The only workaround to this erratum is to disable the
	 * double linefill feature. This is the default behavior.
	 */
	if (cpu_is_imx6q())
		val &= ~(1 << 30 | 1 << 23);
	writel_relaxed(val, l2x0_base + L2X0_PREFETCH_CTRL);
        val = L2X0_DYNAMIC_CLK_GATING_EN | L2X0_STNDBY_MODE_EN;
        writel_relaxed(val, l2x0_base + L2X0_POWER_CTRL);

You can wonder why we don't have 752271 implemented by the core code.

arch/arm/mach-prima2/l2x0.c:
struct l2x0_aux
{
        u32 val;
        u32 mask;
};

static struct l2x0_aux prima2_l2x0_aux __initconst = {
        .val = 2 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT,
        .mask = 0,
};

static struct l2x0_aux marco_l2x0_aux __initconst = {
        .val = (2 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT) |
                (1 << L2X0_AUX_CTRL_ASSOCIATIVITY_SHIFT),
        .mask = L2X0_AUX_CTRL_MASK,
};

static struct of_device_id sirf_l2x0_ids[] __initconst = {
        { .compatible = "sirf,prima2-pl310-cache", .data = &prima2_l2x0_aux, },
        { .compatible = "sirf,marco-pl310-cache", .data = &marco_l2x0_aux, },
        {},
};

static int __init sirfsoc_l2x0_init(void)
{
        struct device_node *np;
        const struct l2x0_aux *aux;

        np = of_find_matching_node(NULL, sirf_l2x0_ids);
        if (np) {
                aux = of_match_node(sirf_l2x0_ids, np)->data;
                return l2x0_of_init(aux->val, aux->mask);
        }

        return 0;
}
early_initcall(sirfsoc_l2x0_init);

This shows everything that is /BAD/ about DT.  People go running around
stuffing crap like this in rather than thinking about the generic problem.
The sole function of the above compatible strings is to select between
different cache property configurations.  Now go back and look at the
iMX code above, which is doing the same thing but at runtime.

Now think: why don't we _already_ have a dt property or two to specify
this information?

The reason why we don't have that is because people do not think beyond
their nose when they encounter a problem.  They will not think whether
it should be fixed in a generic way.  They will not look to see whether
other people are suffering the same problem.

The big problem is that _now_ we have to support this fuckage for ever,
and all because people can't look beyond their immediate problem.

arch/arm/mach-zynq/common.c:
static void __init zynq_init_machine(void)
{
        /*
         * 64KB way size, 8-way associativity, parity disabled
         */
        l2x0_of_init(0x02060000, 0xF0F0FFFF);

Oh look, same problem again.

arch/arm/mach-ux500/cache-l2x0.c:
        /* DBx540's L2 has 128KB way size */
        if (cpu_is_ux540_family())
                /* 128KB way size */
                aux_val |= (0x4 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT);
        else
                /* 64KB way size */
                aux_val |= (0x3 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT);

        /* 64KB way size, 8 way associativity, force WA */
        if (of_have_populated_dt())
                l2x0_of_init(aux_val, 0xc0000fff);
        else
                l2x0_init(l2x0_base, aux_val, 0xc0000fff);

Okay, this has to cope with non-DT as well, but for the DT case, it is
again the same damned problem.

arch/arm/mach-nomadik/cpu-8815.c:
#ifdef CONFIG_CACHE_L2X0
        /* At full speed latency must be >=2, so 0x249 in low bits */
        l2x0_of_init(0x00730249, 0xfe000fff);
#endif

Do we actually end up with 0x249 in the low bits - because the fff in
the mask field preserves the values already in the aux register.
Moreover, don't we have DT properties to specify this information already,
so why isn't nomadik using them?  Nomadik is using a DT compatible string
of "arm,l210-cache", which has the properties "arm,tag-latency",
"arm,data-latency", "arm,dirty-latency" which allow that 0x249 to be
specified.

What the real painful bit here is that we have all these DT files out
there now which make the assumption that this code is going to always be
there, and there's not a /damned/ thing we can do with this crap now.
The best we can do is to try and stop this crap spreading further than
it already has.

Yes, I'm pissed off with DT.  I'm pissed off with people not thinking
before creating this crap as well.  I'm pissed off with the people who
allegedly reviewed this crap.  I'm pissed off with the people who
allegedly reviewed the creation of these SoC specific compatible
strings just to be able to specify configuration data.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-16 15:29 ` Russell King - ARM Linux
@ 2014-03-17 14:00   ` Rob Herring
  2014-03-17 14:40     ` Russell King - ARM Linux
                       ` (2 more replies)
  2014-03-17 16:32   ` Russell King - ARM Linux
  1 sibling, 3 replies; 76+ messages in thread
From: Rob Herring @ 2014-03-17 14:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Mar 16, 2014 at 10:29 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> Okay, I've been going through the various initialisations of the L2x0
> cache code, and there's some really fun stuff going on here:
>
> arch/arm/mach-imx/system.c:
>         /* Configure the L2 PREFETCH and POWER registers */
>         val = readl_relaxed(l2x0_base + L2X0_PREFETCH_CTRL);
>         val |= 0x70800000;

I wonder what is the desired setting for "Force write allocate" since
this fails to mask bit 24. They are probably really wanting to set bit
22.

Setting prefetch enables and early BRESP could all be done
unconditionally in the core code.

>         /*
>          * The L2 cache controller(PL310) version on the i.MX6D/Q is r3p1-50rel0
>          * The L2 cache controller(PL310) version on the i.MX6DL/SOLO/SL is r3p2
>          * But according to ARM PL310 errata: 752271
>          * ID: 752271: Double linefill feature can cause data corruption
>          * Fault Status: Present in: r3p0, r3p1, r3p1-50rel0. Fixed in r3p2
>          * Workaround: The only workaround to this erratum is to disable the
>          * double linefill feature. This is the default behavior.
>          */
>         if (cpu_is_imx6q())
>                 val &= ~(1 << 30 | 1 << 23);
>         writel_relaxed(val, l2x0_base + L2X0_PREFETCH_CTRL);
>         val = L2X0_DYNAMIC_CLK_GATING_EN | L2X0_STNDBY_MODE_EN;
>         writel_relaxed(val, l2x0_base + L2X0_POWER_CTRL);
>
> You can wonder why we don't have 752271 implemented by the core code.
>
> arch/arm/mach-prima2/l2x0.c:
> struct l2x0_aux
> {
>         u32 val;
>         u32 mask;
> };
>
> static struct l2x0_aux prima2_l2x0_aux __initconst = {
>         .val = 2 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT,
>         .mask = 0,
> };
>
> static struct l2x0_aux marco_l2x0_aux __initconst = {
>         .val = (2 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT) |
>                 (1 << L2X0_AUX_CTRL_ASSOCIATIVITY_SHIFT),
>         .mask = L2X0_AUX_CTRL_MASK,

What I like about this is all the bits they are clearing and failing
to set. Shared override, prefetch, etc.

> };
>
> static struct of_device_id sirf_l2x0_ids[] __initconst = {
>         { .compatible = "sirf,prima2-pl310-cache", .data = &prima2_l2x0_aux, },
>         { .compatible = "sirf,marco-pl310-cache", .data = &marco_l2x0_aux, },
>         {},
> };
>
> static int __init sirfsoc_l2x0_init(void)
> {
>         struct device_node *np;
>         const struct l2x0_aux *aux;
>
>         np = of_find_matching_node(NULL, sirf_l2x0_ids);
>         if (np) {
>                 aux = of_match_node(sirf_l2x0_ids, np)->data;
>                 return l2x0_of_init(aux->val, aux->mask);
>         }
>
>         return 0;
> }
> early_initcall(sirfsoc_l2x0_init);
>
> This shows everything that is /BAD/ about DT.  People go running around
> stuffing crap like this in rather than thinking about the generic problem.
> The sole function of the above compatible strings is to select between
> different cache property configurations.  Now go back and look at the
> iMX code above, which is doing the same thing but at runtime.
>
> Now think: why don't we _already_ have a dt property or two to specify
> this information?
>
> The reason why we don't have that is because people do not think beyond
> their nose when they encounter a problem.  They will not think whether
> it should be fixed in a generic way.  They will not look to see whether
> other people are suffering the same problem.
>
> The big problem is that _now_ we have to support this fuckage for ever,
> and all because people can't look beyond their immediate problem.
>
> arch/arm/mach-zynq/common.c:
> static void __init zynq_init_machine(void)
> {
>         /*
>          * 64KB way size, 8-way associativity, parity disabled
>          */
>         l2x0_of_init(0x02060000, 0xF0F0FFFF);
>
> Oh look, same problem again.

This all seems unnecessary. All these values should be correct at
reset and this should be a nop unless the h/w is configured
incorrectly. Surely h/w folks can tie off WAYSIZE correctly... It is
also keeping shared attrib override disabled which I believe is
generally wrong, and is masking 1 bit of force write allocate, but not
the other.

ePAPR already defines standard properties for cache layout which could
be used here if really needed.

> arch/arm/mach-ux500/cache-l2x0.c:
>         /* DBx540's L2 has 128KB way size */
>         if (cpu_is_ux540_family())
>                 /* 128KB way size */
>                 aux_val |= (0x4 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT);
>         else
>                 /* 64KB way size */
>                 aux_val |= (0x3 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT);
>
>         /* 64KB way size, 8 way associativity, force WA */
>         if (of_have_populated_dt())
>                 l2x0_of_init(aux_val, 0xc0000fff);
>         else
>                 l2x0_init(l2x0_base, aux_val, 0xc0000fff);
>
> Okay, this has to cope with non-DT as well, but for the DT case, it is
> again the same damned problem.

This one is interesting in that it has to undo way locking done by the
bootloader. So the bootloader has setup the L2 to some extent, but
didn't configure aux ctrl already?

The way unlocking could be done unconditionally in the core code
perhaps. There could be issues on non-secure platforms though.

>
> arch/arm/mach-nomadik/cpu-8815.c:
> #ifdef CONFIG_CACHE_L2X0
>         /* At full speed latency must be >=2, so 0x249 in low bits */
>         l2x0_of_init(0x00730249, 0xfe000fff);
> #endif
>
> Do we actually end up with 0x249 in the low bits - because the fff in
> the mask field preserves the values already in the aux register.
> Moreover, don't we have DT properties to specify this information already,
> so why isn't nomadik using them?  Nomadik is using a DT compatible string
> of "arm,l210-cache", which has the properties "arm,tag-latency",
> "arm,data-latency", "arm,dirty-latency" which allow that 0x249 to be
> specified.

l2x0_of_setup clears those mask bits if the DT properties are present,
so it will work in that case. But otherwise, yes this is broken.

> What the real painful bit here is that we have all these DT files out
> there now which make the assumption that this code is going to always be
> there, and there's not a /damned/ thing we can do with this crap now.
> The best we can do is to try and stop this crap spreading further than
> it already has.

I think we really should have pushed that all aux_ctrl setup go into
the bootloader. It's really not that different from errata work-around
setup. It doesn't have the same race condition as enabling errata
work-arounds may have, but does have the same secure vs. non-secure
setup issue.

Keeping the aux val/mask parameters in the DT case was definitely a
mistake. DT or not, it is a fragile and error prone interface of magic
values and we should have forced its removal when we had the chance.

Rob

> Yes, I'm pissed off with DT.  I'm pissed off with people not thinking
> before creating this crap as well.  I'm pissed off with the people who
> allegedly reviewed this crap.  I'm pissed off with the people who
> allegedly reviewed the creation of these SoC specific compatible
> strings just to be able to specify configuration data.
>
> --
> FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
> improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-17 14:00   ` Rob Herring
@ 2014-03-17 14:40     ` Russell King - ARM Linux
  2014-03-18 11:22     ` *READ THIS IF YOUR SOC HAS A L2 CACHE* PL310 auxctrl settings Russell King - ARM Linux
  2014-03-18 17:26     ` PL310 errata workarounds Russell King - ARM Linux
  2 siblings, 0 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-17 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
> On Sun, Mar 16, 2014 at 10:29 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > Okay, I've been going through the various initialisations of the L2x0
> > cache code, and there's some really fun stuff going on here:
> >
> > arch/arm/mach-imx/system.c:
> >         /* Configure the L2 PREFETCH and POWER registers */
> >         val = readl_relaxed(l2x0_base + L2X0_PREFETCH_CTRL);
> >         val |= 0x70800000;
> 
> I wonder what is the desired setting for "Force write allocate" since
> this fails to mask bit 24. They are probably really wanting to set bit
> 22.

No, this is the prefetch control register, not the auxillary control
register.  Bits 30 and 23 here are to do with double line fills, and
29 and 28 are to do with I/D prefetching.

> > arch/arm/mach-zynq/common.c:
> > static void __init zynq_init_machine(void)
> > {
> >         /*
> >          * 64KB way size, 8-way associativity, parity disabled
> >          */
> >         l2x0_of_init(0x02060000, 0xF0F0FFFF);
> >
> > Oh look, same problem again.
> 
> This all seems unnecessary. All these values should be correct at
> reset and this should be a nop unless the h/w is configured
> incorrectly. Surely h/w folks can tie off WAYSIZE correctly... It is
> also keeping shared attrib override disabled which I believe is
> generally wrong, and is masking 1 bit of force write allocate, but not
> the other.

Yes, they /should/ so that the register starts of appropriately configured,
but have they actually done that... the fact this stuff exists suggests
that they haven't, because if the values were correct, calling
l2x0_of_init() with "0UL, ~0UL" would result in the reset value being used.
That's surely the path of least resistance, because you don't have to go
digging through the documentation to find the values to then write code
to mask/set these values appropriately.

> ePAPR already defines standard properties for cache layout which could
> be used here if really needed.

I'm adding those right now to this code - there's no excuse not to given
this mess - at least we can prevent any future additions of stuff like
this.
> 
> > arch/arm/mach-ux500/cache-l2x0.c:
> >         /* DBx540's L2 has 128KB way size */
> >         if (cpu_is_ux540_family())
> >                 /* 128KB way size */
> >                 aux_val |= (0x4 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT);
> >         else
> >                 /* 64KB way size */
> >                 aux_val |= (0x3 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT);
> >
> >         /* 64KB way size, 8 way associativity, force WA */
> >         if (of_have_populated_dt())
> >                 l2x0_of_init(aux_val, 0xc0000fff);
> >         else
> >                 l2x0_init(l2x0_base, aux_val, 0xc0000fff);
> >
> > Okay, this has to cope with non-DT as well, but for the DT case, it is
> > again the same damned problem.
> 
> This one is interesting in that it has to undo way locking done by the
> bootloader. So the bootloader has setup the L2 to some extent, but
> didn't configure aux ctrl already?
> 
> The way unlocking could be done unconditionally in the core code
> perhaps. There could be issues on non-secure platforms though.

What's more interesting here is that it boots in non-secure mode, and the
platform doesn't have access to the secure monitor to be able to change
any of the L2 configuration registers - this includes the auxillary
control register.  So the above is just trying (for some odd reason) to
supply details to the L2 code that it could just as well retrieve from
the readable auxillary control register (and would use given the correct
masks.)

As for the lock registers, as this code is already writing to those
registers, we can assume that the aux control register is set to allow
non-secure access to these - so we can kill this if we make the L2
code clear these registers if the NS lock access bit is set, even if
the cache is enabled.

> > arch/arm/mach-nomadik/cpu-8815.c:
> > #ifdef CONFIG_CACHE_L2X0
> >         /* At full speed latency must be >=2, so 0x249 in low bits */
> >         l2x0_of_init(0x00730249, 0xfe000fff);
> > #endif
> >
> > Do we actually end up with 0x249 in the low bits - because the fff in
> > the mask field preserves the values already in the aux register.
> > Moreover, don't we have DT properties to specify this information already,
> > so why isn't nomadik using them?  Nomadik is using a DT compatible string
> > of "arm,l210-cache", which has the properties "arm,tag-latency",
> > "arm,data-latency", "arm,dirty-latency" which allow that 0x249 to be
> > specified.
> 
> l2x0_of_setup clears those mask bits if the DT properties are present,
> so it will work in that case. But otherwise, yes this is broken.

"if those properties are present" is the key thing here.  They aren't:

        L2: l2-cache {
                compatible = "arm,l210-cache";
                reg = <0x10210000 0x1000>;
                interrupt-parent = <&vica>;
                interrupts = <30>;
                cache-unified;
                cache-level = <2>;
        };

So what you end up with is an OR of these values and whatever happened
to be in the register at the time.  In other words... a total mess, just
like much of the other code in this area.

> I think we really should have pushed that all aux_ctrl setup go into
> the bootloader. It's really not that different from errata work-around
> setup. It doesn't have the same race condition as enabling errata
> work-arounds may have, but does have the same secure vs. non-secure
> setup issue.
> 
> Keeping the aux val/mask parameters in the DT case was definitely a
> mistake. DT or not, it is a fragile and error prone interface of magic
> values and we should have forced its removal when we had the chance.

Given that caches have standard ePAR parameters, we should have at least
insisted that these parameters were part of the DT specification.

We say in the DT binding docs that cache-unified is required.  However,
these omit it:
- arch/arm/boot/dts/marco.dtsi
- arch/arm/boot/dts/vexpress-v2p-ca5s.dts
- arch/arm/boot/dts/prima2.dtsi (omits cache-level as well)

What good is it having binding documentation when we don't check (by
some means, either by review or by code checking) that the DT files are
actually correct?  The fact that we have _one_ which gets this wrong
now means that "cache-unified" is completely useless should we ever
need to parse it in the future.  We might have decided to omit it
entirely.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-16 11:52     ` Russell King - ARM Linux
@ 2014-03-17 15:04       ` Rob Herring
  2014-03-17 15:37         ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: Rob Herring @ 2014-03-17 15:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Mar 16, 2014 at 6:52 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Mar 14, 2014 at 11:02:17AM -0500, Rob Herring wrote:
>> On Fri, Mar 14, 2014 at 10:01 AM, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>> > On Fri, Mar 14, 2014 at 02:48:35PM +0000, Russell King - ARM Linux wrote:
>> >> While looking at the now rather horrid L2C-2x0 code, there's some really
>> >> serious questions that have to be asked about some of these errata.
>> >>
>> >> 588369 affects PL310 up to and including R1P0, but is fixed in R2P0.
>> >>       This erratum requires a clean+invalidate request to be split into
>> >>       a separate clean request followed by an invalidate request.  This
>> >>       must be done with the debug register set to 0x03.
>> >>
>> >>       If the debug register is not set to 0x03, then it's effectively
>> >>       a separate clean operation followed by a separate invalidate
>> >>       operation.
>> >>
>> >> 727915 affects PL310 up to and including R3P0, but is fixed in R3P1.
>> >>       This erratum concerns writes which may be discarded as a result
>> >>       of a clean+invalidate request, particularly when a write hits
>> >>       the cache line being requested between the clean stage and the
>> >>       invalidate stage.
>> >>
>> >> Now, let's say that we have a PL310 R3P3 cache.  Linux is configured with
>> >> both workarounds enabled.  This has the effect that we never use the
>> >> clean+invalidate request.
>> >>
>> >> However, Rob's patch in 74ddcdb868a8 (ARM: 7608/1: l2x0: Only set
>> >> .set_debug on PL310 r3p0 and earlier) has the effect that we no longer
>> >> to the debug dance on such a cache.
>> >>
>> >> This means we're still splitting the clean+invalidate sequence into two
>> >> separate operations, so there's room for a write to hit the cache line
>> >> between the two operations and, therefore, be lost.
>> >>
>> >> So, do we actually care about 727915, because through applying Rob's
>> >> change, we've effectively re-exposed the problem which this erratum
>> >> refers to on all L2 cache controllers where this is actually fixed,
>> >> and it's been like this for 14 months without any reported issue.
>> >
>> > Another point here is... we don't even need 727915 at all: 727915 is
>> > about the clean+invalidate by WAY request, not clean+invalidate by PA.
>> > So really we should rip out references to this erratum because it's
>> > not applicable.
>>
>> Is there really no possibility of a write occurring during a
>> clean+invalidate by WAY? There is the case where large clean+inv by PA
>> uses a clean+inv by WAY. I'd guess this path is fairly rare. We would
>> never hit it on highbank because we'd never see a 4MB range and DMA is
>> normally coherent. But do the cpu suspend or kexec paths have an
>> issue?
>
> To answer this point more fully: clean+invalidate by way is used in
> two circumstances:
>
> 1. when disabling the cache.  Here, we must be single threaded, with
>    interrupts disabled, and any other masters for the other cache
>    under control so they don't spit out spurious writes while we're
>    disabling it.
>
>    The suspend paths are careful enough: other CPUs have been hot-
>    unplugged, devices suspended, and interrupts are disabled at this
>    point, so that's safe.
>
>    kexec is slightly different - interrupts will be disabled, we will
>    only call outer_cache_disable() if we are the only CPU running at
>    soft_restart() time.  However, devices may or may not be suspended
>    (which would be an issue for the L2 cache if they're on the coherent
>    bus.)  So I think this is as safe as it can be.
>
>    The MCPM stuff is another issue: what the conditions are there I've
>    no idea, but it looks like other CPUs will be running when it calls
>    outer_cache_flush().  MCPM commentry claims that this will be
>    "harmless" and I just had to laugh at that - even with this workaround
>    enabled, it doesn't fix the problem on L2C-310 R2P0 as the workaround
>    implementation only works on R3P0!

For MCPM, is there even a platform that has a PL310 used as an L3? I
suppose architecturally it is possible, but in reality it is probably
not something that's ever been tested.

> 2. on large (>= cache size) coherent DMA allocations via outer_flush_range()
>
> Consider using CMA for framebuffers, and allocating a 1080p framebuffer.
> That's going to be around 8MB in size, and that's going to need the L2
> purged of any cache lines associated with it.  As the L2 cache is
> normally around 512kB or maybe 1MB, walking every 32-byte cache line is
> extremely wasteful (that's 259200 clean+invalidate by PA operations, of
> which a maximum of 32768 could possibly hit a cache line), so we do want
> to preserve the ability to use this operation.

Aren't CMA buffers mapped coherently so the flush is not needed?

> However, one of the issues with background operations is that while the
> background operation is in progress, you can't allow another operation
> (atomic or background) to be issued - and that includes the sync operation.
> L2C-310 issues a slave error in response to this, which will be reported
> as an imprecise data abort.
>
> This means that if we do want to avoid this waste, we need some kind of
> locking: using a spinlock here is too heavy, I'm thinking about using
> a read-write spinlock.  However, that brings with it all the lockdep
> baggage with it which we already know we need to avoid, and there's no
> raw_ variants of it.
>
> The best idea I can come up with is to directly use the arch_ rw spinlock
> variants - the read lock taken for the atomic operations so that many
> of them can interleve, and the write lock taken for background operations
> as only one can be active at any one time.
>
> Thoughts?

This would help with contention in readl/writel, but you still have
most all the overhead of a spinlock. I'm not sure which is the bigger
component: lock contention or all the loads, stores and dsb/dmbs
associated with the lock.

Isn't using by way ops potentially broken if you are running a secure
OS? If linux is doing a by way operation and the secure OS does range
operations, someone is going to crash on an abort. I suppose no one
sees this due to the limited function of secure OSs.

Rob

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-17 15:04       ` Rob Herring
@ 2014-03-17 15:37         ` Russell King - ARM Linux
  2014-03-17 17:29           ` Catalin Marinas
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-17 15:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 17, 2014 at 10:04:20AM -0500, Rob Herring wrote:
> On Sun, Mar 16, 2014 at 6:52 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> >    The MCPM stuff is another issue: what the conditions are there I've
> >    no idea, but it looks like other CPUs will be running when it calls
> >    outer_cache_flush().  MCPM commentry claims that this will be
> >    "harmless" and I just had to laugh at that - even with this workaround
> >    enabled, it doesn't fix the problem on L2C-310 R2P0 as the workaround
> >    implementation only works on R3P0!
> 
> For MCPM, is there even a platform that has a PL310 used as an L3? I
> suppose architecturally it is possible, but in reality it is probably
> not something that's ever been tested.

I have no idea about MCPM - it's something I've never used, even though
I in theory have hardware which supports it.  From what I can tell,
getting Linux running on the Versatile Express CA15 tile is something of
a black art, there's only a relatively small number of people who have
managed that feat.

> > 2. on large (>= cache size) coherent DMA allocations via outer_flush_range()
> >
> > Consider using CMA for framebuffers, and allocating a 1080p framebuffer.
> > That's going to be around 8MB in size, and that's going to need the L2
> > purged of any cache lines associated with it.  As the L2 cache is
> > normally around 512kB or maybe 1MB, walking every 32-byte cache line is
> > extremely wasteful (that's 259200 clean+invalidate by PA operations, of
> > which a maximum of 32768 could possibly hit a cache line), so we do want
> > to preserve the ability to use this operation.
> 
> Aren't CMA buffers mapped coherently so the flush is not needed?

Not initially, not when they're free, not when they're being re-used for
non-CMA purposes.  Hence, they can contain dirty cache lines which need
to be flushed out of their mappings to avoid interference with the
coherent mapping.

> This would help with contention in readl/writel, but you still have
> most all the overhead of a spinlock. I'm not sure which is the bigger
> component: lock contention or all the loads, stores and dsb/dmbs
> associated with the lock.

Using the arch r/w locks is not that heavy, and doesn't have the problem
that interrupts are locked out during much of the L2 maintanence.  Even
with arch r/w locks, the L2 cache ops don't show up much in perf, compared
to the existing implementation where they show quite highly.

The only issue is we'd only be able to use this optimisation when we
aren't running in IRQ context anyway, which I think isn't that great a
restriction on it.

> Isn't using by way ops potentially broken if you are running a secure
> OS? If linux is doing a by way operation and the secure OS does range
> operations, someone is going to crash on an abort. I suppose no one
> sees this due to the limited function of secure OSs.

Let's cover that should it happen - a secure OS should check the status
of the L2 hardware before issuing any cache operation anyway for exactly
this reason (you can always read from the L2 registers to check whether
any operation is in progress.)

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-16 15:29 ` Russell King - ARM Linux
  2014-03-17 14:00   ` Rob Herring
@ 2014-03-17 16:32   ` Russell King - ARM Linux
  2014-03-17 16:51     ` Russell King - ARM Linux
  1 sibling, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-17 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Mar 16, 2014 at 03:29:53PM +0000, Russell King - ARM Linux wrote:
> Okay, I've been going through the various initialisations of the L2x0
> cache code, and there's some really fun stuff going on here:

Here's OMAP4 - thankfully I have an OMAP4430 SDP to be able to investigate
what goes on here.  We have this code:

        /*
         * 16-way associativity, parity disabled
         * Way size - 32KB (es1.0)
         * Way size - 64KB (es2.0 +)
         */
        aux_ctrl = L310_AUX_CTRL_ASSOCIATIVITY_16 |
                   L310_AUX_CTRL_CACHE_REPLACE_RR |
                   L310_AUX_CTRL_NS_LOCKDOWN |
                   L310_AUX_CTRL_NS_INT_CTRL;

        if (omap_rev() == OMAP4430_REV_ES1_0) {
                aux_ctrl |= L310_AUX_CTRL_WAY_SIZE(2);
        } else {
                aux_ctrl |= L310_AUX_CTRL_WAY_SIZE(3) |
                            L310_AUX_CTRL_SHARED_OVERRIDE |
                            L310_AUX_CTRL_DATA_PREFETCH |
                            L310_AUX_CTRL_INSTR_PREFETCH |
                            L310_AUX_CTRL_EARLY_BRESP;
        }
        if (omap_rev() != OMAP4430_REV_ES1_0)
                omap_smc1(0x109, aux_ctrl);

        /* Enable PL310 L2 Cache controller */
        omap_smc1(0x102, 0x1);

        if (of_have_populated_dt())
                l2x0_of_init(aux_ctrl, L2X0_AUX_CTRL_MASK);
        else
                l2x0_init(l2cache_base, aux_ctrl, L2X0_AUX_CTRL_MASK);

Now, with my changes to add a "write_sec" hook into the code (to allow
secure-only registers to be written) we appear to be able to kill all
the above code on OMAP4430 ES2.2 - the following messages are from my
replacement L2x0 code with write_sec implemented for OMAP, and the
explicit SMC ops above commented out.

L2C: platform provided aux values match the hardware, so have no effect.  Please remove them.
L2C: platform provided aux values permit register corruption.
L2C-310 errata 727915 769419 enabled
L2C-310 cache controller enabled, 16 ways, 1024 kB
L2C-310: CACHE_ID 0x410000c4, AUX_CTRL 0x7e470000

The first message is produced by:
        old_aux = readl_relaxed(l2x0_base + L2X0_AUX_CTRL);
        if (old_aux != ((old_aux & aux_mask) | aux_val)) {
		pr_err("L2C: platform modifies aux control register: 0x%08x -> 0x%08x\n",
                       old_aux, (old_aux & aux_mask) | aux_val);
        } else if (aux_mask != ~0U && aux_val != 0) {
                pr_alert("L2C: platform provided aux values match the hardware, so have no effect.  Please remove them.\n");
        }

Later on I have:
        if (aux_val & aux_mask)
                pr_alert("L2C: platform provided aux values permit register corruption.\n");
...
        aux = readl_relaxed(l2x0_base + L2X0_AUX_CTRL);
        if (aux != ((aux & aux_mask) | aux_val))
                pr_err("L2C: DT/platform modifies aux control register: 0x%08x -> 0x%08x\n",
                       aux, (aux & aux_mask) | aux_val);

The two pr_alert()s are to provide sufficient motivation to people to
remove aux value configuration which actually matches the hardware.

Annoyingly I can't check the actual configuration registers because it
seems that someone along the way has broken ethernet (KSZ8851) in
recent kernels on this board... so I can't mount the fs with devmem2 on
to read the L2's registers.

I would try and quote some of the kernel messages for that, but: (a)
busybox grep doesn't understand the -<number> argument to give lines of
context, and (b) busybox less doesn't allow searching the kernel log...
So all I can say is that these kernel messages exist:

ks8851 spi1.0: message enable is 0
ks8851 spi1.0 eth0: revision 0, MAC 08:00:28:01:4d:c6, IRQ 194, has EEPROM

and dhcp fails - the ethernet interface doesn't even establish a link with
the connected switch... it just gets better and better.  Why is the modern
ARM kernel a perpetual struggle to make everything work.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-17 16:32   ` Russell King - ARM Linux
@ 2014-03-17 16:51     ` Russell King - ARM Linux
  0 siblings, 0 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-17 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 17, 2014 at 04:32:04PM +0000, Russell King - ARM Linux wrote:
> On Sun, Mar 16, 2014 at 03:29:53PM +0000, Russell King - ARM Linux wrote:
> > Okay, I've been going through the various initialisations of the L2x0
> > cache code, and there's some really fun stuff going on here:
> 
> Here's OMAP4 - thankfully I have an OMAP4430 SDP to be able to investigate
> what goes on here.  We have this code:
> 
>         /*
>          * 16-way associativity, parity disabled
>          * Way size - 32KB (es1.0)
>          * Way size - 64KB (es2.0 +)
>          */
>         aux_ctrl = L310_AUX_CTRL_ASSOCIATIVITY_16 |
>                    L310_AUX_CTRL_CACHE_REPLACE_RR |
>                    L310_AUX_CTRL_NS_LOCKDOWN |
>                    L310_AUX_CTRL_NS_INT_CTRL;
> 
>         if (omap_rev() == OMAP4430_REV_ES1_0) {
>                 aux_ctrl |= L310_AUX_CTRL_WAY_SIZE(2);
>         } else {
>                 aux_ctrl |= L310_AUX_CTRL_WAY_SIZE(3) |
>                             L310_AUX_CTRL_SHARED_OVERRIDE |
>                             L310_AUX_CTRL_DATA_PREFETCH |
>                             L310_AUX_CTRL_INSTR_PREFETCH |
>                             L310_AUX_CTRL_EARLY_BRESP;
>         }
>         if (omap_rev() != OMAP4430_REV_ES1_0)
>                 omap_smc1(0x109, aux_ctrl);
> 
>         /* Enable PL310 L2 Cache controller */
>         omap_smc1(0x102, 0x1);
> 
>         if (of_have_populated_dt())
>                 l2x0_of_init(aux_ctrl, L2X0_AUX_CTRL_MASK);
>         else
>                 l2x0_init(l2cache_base, aux_ctrl, L2X0_AUX_CTRL_MASK);
> 
> Now, with my changes to add a "write_sec" hook into the code (to allow
> secure-only registers to be written) we appear to be able to kill all
> the above code on OMAP4430 ES2.2 - the following messages are from my
> replacement L2x0 code with write_sec implemented for OMAP, and the
> explicit SMC ops above commented out.
> 
> L2C: platform provided aux values match the hardware, so have no effect.  Please remove them.
> L2C: platform provided aux values permit register corruption.
> L2C-310 errata 727915 769419 enabled
> L2C-310 cache controller enabled, 16 ways, 1024 kB
> L2C-310: CACHE_ID 0x410000c4, AUX_CTRL 0x7e470000

Damn it, I keep forgetting to build then install - the change to make
zinstall not do a build keeps biting me.

When the kernel boots, AUX_CTRL is set in hardware to 0x0e070000, which
is basically as close as the hardware strapping can get it.  The rest of
the bits must be modified by software.  So we /do/ need to modify this
register.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-17 15:37         ` Russell King - ARM Linux
@ 2014-03-17 17:29           ` Catalin Marinas
  2014-03-17 19:44             ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: Catalin Marinas @ 2014-03-17 17:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 17, 2014 at 03:37:38PM +0000, Russell King - ARM Linux wrote:
> On Mon, Mar 17, 2014 at 10:04:20AM -0500, Rob Herring wrote:
> > On Sun, Mar 16, 2014 at 6:52 AM, Russell King - ARM Linux
> > <linux@arm.linux.org.uk> wrote:
> > >    The MCPM stuff is another issue: what the conditions are there I've
> > >    no idea, but it looks like other CPUs will be running when it calls
> > >    outer_cache_flush().  MCPM commentry claims that this will be
> > >    "harmless" and I just had to laugh at that - even with this workaround
> > >    enabled, it doesn't fix the problem on L2C-310 R2P0 as the workaround
> > >    implementation only works on R3P0!
> > 
> > For MCPM, is there even a platform that has a PL310 used as an L3? I
> > suppose architecturally it is possible, but in reality it is probably
> > not something that's ever been tested.
> 
> I have no idea about MCPM

I assume that the MCPM comment about outer_cache_flush() being harmless
is because it is assumed to be a no-op. In the mach-vexpress/dcscb.c
file, there is a v7_exit_coherency_flush() prior to outer_flush_all().
While it looks like the right way, the comment for
v7_exit_coherency_flush() states that ldrex/strex no longer work after
the call.

You could do with a lock-less outer_flush_all(), even though it is a
background operation assuming there is no race (single CPU running). I
think your big clean-up series is already hiding outer_flush_all() under
the L2 disable function.

> > This would help with contention in readl/writel, but you still have
> > most all the overhead of a spinlock. I'm not sure which is the bigger
> > component: lock contention or all the loads, stores and dsb/dmbs
> > associated with the lock.
> 
> Using the arch r/w locks is not that heavy, and doesn't have the problem
> that interrupts are locked out during much of the L2 maintanence.  Even
> with arch r/w locks, the L2 cache ops don't show up much in perf, compared
> to the existing implementation where they show quite highly.
> 
> The only issue is we'd only be able to use this optimisation when we
> aren't running in IRQ context anyway, which I think isn't that great a
> restriction on it.

IIRC Will had a patch for this but I don't remember whether it showed
any improvements (I guess not since the patch hasn't been pushed).

> > Isn't using by way ops potentially broken if you are running a secure
> > OS? If linux is doing a by way operation and the secure OS does range
> > operations, someone is going to crash on an abort. I suppose no one
> > sees this due to the limited function of secure OSs.
> 
> Let's cover that should it happen - a secure OS should check the status
> of the L2 hardware before issuing any cache operation anyway for exactly
> this reason (you can always read from the L2 registers to check whether
> any operation is in progress.)

But between the secure OS check and the actual operation, Linux could
start a background one (the SMP case). If this happens, the hardware
generates an error signal, I guess this translates to an external abort
(I haven't tried it but it doesn't look to safe for the secure world).

-- 
Catalin

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-17 17:29           ` Catalin Marinas
@ 2014-03-17 19:44             ` Russell King - ARM Linux
  2014-03-17 21:09               ` Nicolas Pitre
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-17 19:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 17, 2014 at 05:29:44PM +0000, Catalin Marinas wrote:
> On Mon, Mar 17, 2014 at 03:37:38PM +0000, Russell King - ARM Linux wrote:
> > On Mon, Mar 17, 2014 at 10:04:20AM -0500, Rob Herring wrote:
> > > On Sun, Mar 16, 2014 at 6:52 AM, Russell King - ARM Linux
> > > <linux@arm.linux.org.uk> wrote:
> > > >    The MCPM stuff is another issue: what the conditions are there I've
> > > >    no idea, but it looks like other CPUs will be running when it calls
> > > >    outer_cache_flush().  MCPM commentry claims that this will be
> > > >    "harmless" and I just had to laugh at that - even with this workaround
> > > >    enabled, it doesn't fix the problem on L2C-310 R2P0 as the workaround
> > > >    implementation only works on R3P0!
> > > 
> > > For MCPM, is there even a platform that has a PL310 used as an L3? I
> > > suppose architecturally it is possible, but in reality it is probably
> > > not something that's ever been tested.
> > 
> > I have no idea about MCPM
> 
> I assume that the MCPM comment about outer_cache_flush() being harmless
> is because it is assumed to be a no-op. In the mach-vexpress/dcscb.c
> file, there is a v7_exit_coherency_flush() prior to outer_flush_all().
> While it looks like the right way, the comment for
> v7_exit_coherency_flush() states that ldrex/strex no longer work after
> the call.
> 
> You could do with a lock-less outer_flush_all(), even though it is a
> background operation assuming there is no race (single CPU running). I
> think your big clean-up series is already hiding outer_flush_all() under
> the L2 disable function.

Almost but not quite - here's the last matches by git grep:

$ git grep outer_flush_all
arch/arm/include/asm/outercache.h: * outer_flush_all: clean and invalidate all c
arch/arm/include/asm/outercache.h:static inline void outer_flush_all(void)
arch/arm/include/asm/outercache.h: * conditions above concerning outer_flush_all
arch/arm/include/asm/outercache.h:static inline void outer_flush_all(void) { }
arch/arm/mach-exynos/pm.c:      outer_flush_all();
arch/arm/mach-vexpress/dcscb.c:         outer_flush_all();

My worry about MCPM is that it talks about the function containing the
outer_flush_all() potentially racing with a different CPU coming online.
It's not clear to me whether that other CPU would be using the same L2
controller or not:

/*
 * We can't use regular spinlocks. In the switcher case, it is possible
 * for an outbound CPU to call power_down() while its inbound counterpart
 * is already live using the same logical CPU number which trips lockdep
 * debugging.
 */

Since this is Nicocode, I think Nicolas has to be the one to sort this
out.  (Added Nico.)

> But between the secure OS check and the actual operation, Linux could
> start a background one (the SMP case). If this happens, the hardware
> generates an error signal, I guess this translates to an external abort
> (I haven't tried it but it doesn't look to safe for the secure world).

Yes, that's true.  However, the only way to be safe from that would be
for the non-secure world to totally avoid the background operations
altogether.  In the case of PL310, that means flushing by set/way or
by PA line, which can get very expensive if you need to flush the
entire cache (eg, because it's going to be powered down.)

The more I think about it, the more I come to the conclusion that the
maintainence side of the L2 cache hasn't been properly thought out in
trustzone context at hardware design stage.

Even so, we've been performing these background operations for years
from Linux already, so I'm not about to try and "fix" this problem
when we haven't had any reports of issues.  Yes, it makes sense to
drop it from the invalidate/clean functions, because these will only
ever be used with the streaming DMA APIs, but the flush function can
be called on some very big areas.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-17 19:44             ` Russell King - ARM Linux
@ 2014-03-17 21:09               ` Nicolas Pitre
  2014-03-17 21:14                 ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: Nicolas Pitre @ 2014-03-17 21:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 17 Mar 2014, Russell King - ARM Linux wrote:

> On Mon, Mar 17, 2014 at 05:29:44PM +0000, Catalin Marinas wrote:
> > On Mon, Mar 17, 2014 at 03:37:38PM +0000, Russell King - ARM Linux wrote:
> > > On Mon, Mar 17, 2014 at 10:04:20AM -0500, Rob Herring wrote:
> > > > On Sun, Mar 16, 2014 at 6:52 AM, Russell King - ARM Linux
> > > > <linux@arm.linux.org.uk> wrote:
> > > > >    The MCPM stuff is another issue: what the conditions are there I've
> > > > >    no idea, but it looks like other CPUs will be running when it calls
> > > > >    outer_cache_flush().  MCPM commentry claims that this will be
> > > > >    "harmless" and I just had to laugh at that - even with this workaround
> > > > >    enabled, it doesn't fix the problem on L2C-310 R2P0 as the workaround
> > > > >    implementation only works on R3P0!
> > > > 
> > > > For MCPM, is there even a platform that has a PL310 used as an L3? I
> > > > suppose architecturally it is possible, but in reality it is probably
> > > > not something that's ever been tested.
> > > 
> > > I have no idea about MCPM
> > 
> > I assume that the MCPM comment about outer_cache_flush() being harmless
> > is because it is assumed to be a no-op. In the mach-vexpress/dcscb.c
> > file, there is a v7_exit_coherency_flush() prior to outer_flush_all().
> > While it looks like the right way, the comment for
> > v7_exit_coherency_flush() states that ldrex/strex no longer work after
> > the call.

[...]

> My worry about MCPM is that it talks about the function containing the
> outer_flush_all() potentially racing with a different CPU coming online.
> It's not clear to me whether that other CPU would be using the same L2
> controller or not:
>
> /*
>  * We can't use regular spinlocks. In the switcher case, it is possible
>  * for an outbound CPU to call power_down() while its inbound counterpart
>  * is already live using the same logical CPU number which trips lockdep
>  * debugging.
>  */

L2 is normally a per cluster resource.  It is flushed by the last man 
standing when no other CPUs might contend for the L2 controller.  And if 
the outer cache is shared by multiple clusters then some additional 
handling (such as "last cluster standing") would need to be implemented.

Clearly this outer_cache_flush() call is just a hint if someone were to 
copy that file to write their own backend.  If it is causing problems 
then it should just be removed altogether.  No platforms with MCPM that 
I know of have an actual outer cache at the moment.  And certainly not 
the platform where dcscb.c is used.


Nicolas

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-17 21:09               ` Nicolas Pitre
@ 2014-03-17 21:14                 ` Russell King - ARM Linux
  2014-03-17 21:54                   ` Nicolas Pitre
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-17 21:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 17, 2014 at 05:09:46PM -0400, Nicolas Pitre wrote:
> L2 is normally a per cluster resource.  It is flushed by the last man 
> standing when no other CPUs might contend for the L2 controller.  And if 
> the outer cache is shared by multiple clusters then some additional 
> handling (such as "last cluster standing") would need to be implemented.
> 
> Clearly this outer_cache_flush() call is just a hint if someone were to 
> copy that file to write their own backend.  If it is causing problems 
> then it should just be removed altogether.  No platforms with MCPM that 
> I know of have an actual outer cache at the moment.  And certainly not 
> the platform where dcscb.c is used.

This sounds to me like an invitation to kill it :)  Killing it off
would be good, though maybe a comment should be left behind at this
site?

As you're the most familiar with this code, I'd prefer to commit a
patch from you rather than just deleting the reference myself.

Thanks.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-17 21:14                 ` Russell King - ARM Linux
@ 2014-03-17 21:54                   ` Nicolas Pitre
  0 siblings, 0 replies; 76+ messages in thread
From: Nicolas Pitre @ 2014-03-17 21:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 17 Mar 2014, Russell King - ARM Linux wrote:

> On Mon, Mar 17, 2014 at 05:09:46PM -0400, Nicolas Pitre wrote:
> > L2 is normally a per cluster resource.  It is flushed by the last man 
> > standing when no other CPUs might contend for the L2 controller.  And if 
> > the outer cache is shared by multiple clusters then some additional 
> > handling (such as "last cluster standing") would need to be implemented.
> > 
> > Clearly this outer_cache_flush() call is just a hint if someone were to 
> > copy that file to write their own backend.  If it is causing problems 
> > then it should just be removed altogether.  No platforms with MCPM that 
> > I know of have an actual outer cache at the moment.  And certainly not 
> > the platform where dcscb.c is used.
> 
> This sounds to me like an invitation to kill it :)  Killing it off
> would be good, though maybe a comment should be left behind at this
> site?
> 
> As you're the most familiar with this code, I'd prefer to commit a
> patch from you rather than just deleting the reference myself.

OK, what about the following:

----- >8
Subject: ARM: dcscb.c: remove actual call to outer_flush_all()

Strictly speaking this call is a no-op on the platform where dcscb.c is 
used since it only has architected caches.  The call was there as a hint 
to people who would be inspired by this code to write their own backend 
but the hint might not always be correct.

For example, if a PL310 were to be used this wouldn't be safe to call 
the regular outer_flush_all() anyway as it makes use of atomic 
instructions for locking which cannot be assumed to still be operational 
after v7_exit_coherency_flush() has returned.  Given no other CPUs (in 
the cluster) should be running at that point then standard concurrency 
concerns wouldn't apply.

So let's simply kill this call for now and enhance the existing comment.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index 14d4996887..e717515999 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -137,11 +137,10 @@ static void dcscb_power_down(void)
 		v7_exit_coherency_flush(all);
 
 		/*
-		 * This is a harmless no-op.  On platforms with a real
-		 * outer cache this might either be needed or not,
-		 * depending on where the outer cache sits.
+		 * A full outer cache flush could be needed at this point
+		 * on platforms with a real outer cache. This might either
+		 * be needed or not depending on where the outer cache sits.
 		 */
-		outer_flush_all();
 
 		/*
 		 * Disable cluster-level coherency by masking

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* *READ THIS IF YOUR SOC HAS A L2 CACHE* PL310 auxctrl settings
  2014-03-17 14:00   ` Rob Herring
  2014-03-17 14:40     ` Russell King - ARM Linux
@ 2014-03-18 11:22     ` Russell King - ARM Linux
  2014-03-28 12:51       ` Maxime Coquelin
  2014-03-18 17:26     ` PL310 errata workarounds Russell King - ARM Linux
  2 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-18 11:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
> On Sun, Mar 16, 2014 at 10:29 AM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > static struct l2x0_aux prima2_l2x0_aux __initconst = {
> >         .val = 2 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT,
> >         .mask = 0,
> > };
> >
> > static struct l2x0_aux marco_l2x0_aux __initconst = {
> >         .val = (2 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT) |
> >                 (1 << L2X0_AUX_CTRL_ASSOCIATIVITY_SHIFT),
> >         .mask = L2X0_AUX_CTRL_MASK,
> 
> What I like about this is all the bits they are clearing and failing
> to set. Shared override, prefetch, etc.

It gets more fun when you see the full picture.  Here's a table of the
auxctrl settings that everyone uses.  The "good" ones which leave the
auxctrl register value alone are listed at the top.

                                                        DT val      mask
arch/arm/mach-bcm/board_bcm281xx.c:                     Y  00000000 ffffffff
arch/arm/mach-highbank/highbank.c:                      Y  00000000 ffffffff
arch/arm/mach-imx/mach-vf610.c:                         Y  00000000 ffffffff
arch/arm/mach-imx/system.c:                             Y  00000000 ffffffff
arch/arm/mach-mvebu/armada-370-xp.c:                    Y  00000000 ffffffff
arch/arm/mach-rockchip/rockchip.c:                      Y  00000000 ffffffff
arch/arm/mach-socfpga/socfpga.c:                        Y  00000000 ffffffff
arch/arm/mach-berlin/berlin.c:                          Y  70c00000 feffffff
arch/arm/mach-cns3xxx/core.c:                           N  00540000 fe000fff
arch/arm/mach-exynos/common.c:                          Y  7c470001 c200ffff
arch/arm/mach-imx/mm-imx3.c:                            N  00030024 00000000
arch/arm/mach-nomadik/cpu-8815.c:                       Y  00730249 fe000fff
arch/arm/mach-omap2/omap4-common.c:                     Y  aux_ctrl aux_mask
arch/arm/mach-omap2/omap4-common.c:                     N  aux_ctrl aux_mask
arch/arm/mach-prima2/l2x0.c:                            Y  aux->val aux->mask
arch/arm/mach-realview/realview_pbx.c:                  N  02520000 c0000fff
arch/arm/mach-realview/realview_pb1176.c:               N  00730000 fe000fff
arch/arm/mach-realview/realview_eb.c:                   N  00790000 fe000fff
arch/arm/mach-realview/realview_pb11mp.c:               N  00790000 fe000fff
arch/arm/mach-shmobile/setup-r8a7779.c:                 N  40470000 82000fff
arch/arm/mach-shmobile/board-kzm9g.c:                   N  40460000 82000fff
arch/arm/mach-shmobile/board-kzm9g-reference.c:         N  40460000 82000fff
arch/arm/mach-shmobile/board-armadillo800eva.c:         N  40440000 82000fff
arch/arm/mach-shmobile/board-armadillo800eva-reference.c:N 40440000 82000fff
arch/arm/mach-shmobile/setup-r8a7778.c:                 N  40470000 82000fff
arch/arm/mach-spear/spear13xx.c:                        N  70a60001 fe00ffff
arch/arm/mach-sti/board-dt.c:                           Y  30480000 c0000fff
arch/arm/mach-tegra/tegra.c:                            Y  aux_ctrl 8200c3fe
arch/arm/mach-ux500/cache-l2x0.c:                       Y  aux_val  c0000fff
arch/arm/mach-vexpress/v2m.c:                           Y  00400000 fe0fffff
arch/arm/mach-vexpress/ct-ca9x4.c:                      N  00400000 fe0fffff
arch/arm/mach-zynq/common.c:                            Y  02060000 f0f0ffff

Bits 19:17 are the way size, bit 16 (for the PL310) is the associativity
(8 or 16 ways.)  It seems many people have the impression that they have
to set these values according to the size of the cache and they can't
rely on the value already initialised by hardware from synthesis the
options.

tegra:
        cache_type = readl(p + L2X0_CACHE_TYPE);
        aux_ctrl = (cache_type & 0x700) << (17-8);
        aux_ctrl |= 0x7C400001;

Fun, since that's just reading the cache type to generate the auxctrl value
which should already be set to the values in the cache type register.

ux500:
        aux_val = 0x3e000000;
        if (cpu_is_ux540_family())
                aux_val |= L310_AUX_CTRL_WAY_SIZE(4);   /* 128KB way size */
        else
                aux_val |= L310_AUX_CTRL_WAY_SIZE(3);   /* 64KB way size */

omap4:
        if (omap_rev() == OMAP4430_REV_ES1_0) {
                aux_ctrl = L310_AUX_CTRL_ASSOCIATIVITY_16 |
                           L310_AUX_CTRL_CACHE_REPLACE_RR |
                           L310_AUX_CTRL_NS_LOCKDOWN |
                           L310_AUX_CTRL_NS_INT_CTRL |
                           L310_AUX_CTRL_WAY_SIZE(2);
                aux_mask = L2X0_AUX_CTRL_MASK;
        } else {
                aux_ctrl = L310_AUX_CTRL_CACHE_REPLACE_RR |
                           L310_AUX_CTRL_NS_LOCKDOWN |
                           L310_AUX_CTRL_NS_INT_CTRL |
                           L310_AUX_CTRL_SHARED_OVERRIDE |
                           L310_AUX_CTRL_DATA_PREFETCH |
                           L310_AUX_CTRL_INSTR_PREFETCH |
                           L310_AUX_CTRL_EARLY_BRESP;
                aux_mask = ~aux_ctrl;
        }
(I killed its setting of the cache sizing for ES2.)

So, the question is: how much of the messing with way size/associativity
can we kill off?  Can we kill all that off and ensure that aux_val always
has bits 19:16 clear and aux_mask always has them set?

-- 
FTTC broadband for 0.8mile line: now@9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-17 14:00   ` Rob Herring
  2014-03-17 14:40     ` Russell King - ARM Linux
  2014-03-18 11:22     ` *READ THIS IF YOUR SOC HAS A L2 CACHE* PL310 auxctrl settings Russell King - ARM Linux
@ 2014-03-18 17:26     ` Russell King - ARM Linux
  2014-03-19 21:52       ` Marek Vasut
  2 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-18 17:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
> Setting prefetch enables and early BRESP could all be done
> unconditionally in the core code.

I think we can do a few things here, if we know that the CPUs we're
connected to are all Cortex-A9:

1. Enable BRESP.

2. Enable I+D prefetching - but we really need to tune the prefetch offset
   for this to be worthwhile.  The value depends on the L3 memory system
   latency, so isn't something that should be specified at the SoC level.
   It may also change with different operating points.

3. Full line of zeros - I think this is a difficult one to achieve properly.
   The required sequence:

   - enable FLZ in L2 cache
   - enable L2 cache
   - enable FLZ in Cortex A9

   I'd also assume that when we turn the L2 cache off, we need the reverse
   sequence too.  So this sequence can't be done entirely by the boot loader.

With (1) enabled and (2) properly tuned, I see a performance increase of
around 60Mbps on transmission, bringing the Cubox-i4 up from 250Mbps to
315Mbps transmit on its gigabit interface with cpufreq ondemand enabled.
With "performance", this goes up to [323, 323, 321, 325, 322]Mbps.  On
receive [446, 603, 605, 605, 601]Mbps, which hasn't really changed
very much (and still impressively exceeds the Freescale stated maximum
total bandwidth of the gigabit interface.)

(3) is going to be harder to have the kernel to sort out - because we'd
need to enable FLZ in the L2 cache, enable the cache (that's the easy bit)
and then turn on FLZ in each of the Cortex-A9 AUXCR registers.  That's the
hard bit for the L2 code to do for two reasons:
a) it means SMC stuff for our three non-secure SoC
b) the random placement of where the L2 cache initialisation makes setting
   stuff up at the appropriate time more difficult than it really need be.
   We can't rely on the scheduler running but at the same time, we can't
   rely on other SMP CPUs not running.

I think (3) is going to require a round of L2 cache initialisation
unification across all platforms before we can sanely have the kernel
enabling that feature.

Other features fall under the same problem that (3) does.  If you have a
Cortex-A9 connected to a PL310, then there are performance options you
can enable in the Cortex-A9, such as the prefetch hint enable, which
allow the Cortex-A9 to issue hints to the connected PL310 to bring cache
lines into the L2 without needing to return data to the CA9.

Looking at some of this stuff, there's quite a large number of performance
features that we just plain ignore so far - uboot doesn't seem to touch
them, and from what I can see, not many people have been particularly
interested in evaluating them (people seem to prefer to optimise memcpy
code, or switch to using neon, or some other solution like that.)

I wonder how much system performance we're losing because we've been
ignoring these configuration options.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-14 19:14       ` Rob Herring
  2014-03-14 20:32         ` Russell King - ARM Linux
@ 2014-03-19 21:22         ` Marek Vasut
  2014-03-19 21:35           ` Rob Herring
  1 sibling, 1 reply; 76+ messages in thread
From: Marek Vasut @ 2014-03-19 21:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday, March 14, 2014 at 08:14:06 PM, Rob Herring wrote:
> On Fri, Mar 14, 2014 at 12:57 PM, Russell King - ARM Linux
> 
> <linux@arm.linux.org.uk> wrote:
> > On Fri, Mar 14, 2014 at 11:02:17AM -0500, Rob Herring wrote:
> >> The obvious fix is to convert l2x0_flush_line to a function ptr which
> >> just further complicates the code.
> > 
> > I currently regard this code as unmaintainable, so right now I'm in the
> > process of completely rewriting it, going back to the specs and checking
> > what is required.  Sharing code between the L210, L220 and L310 looks
> > good from the point of view of keeping the line count down, but actually
> > they're quite different and have various different requirements.
> > 
> > Out of those, L220 is the odd one out: all operations are background
> > operations there, which means they have to be waited for.  Currently,
> > if you have one of these, and you build a multiplatform kernel which
> > includes PL310 support, you don't wait for any of these and you probably
> > see a lot of data corruption as a result.
> 
> Oops. I guess there are not many users as the L220 was primarily used
> with the 1176 (and 11MP?) IIRC. I guess one of the primary 1176 based
> platforms currently used is RPi, but it appears the BCM2835 doesn't
> have an L2 (other than some custom L2 for the GPU).

Won't Freescale i.MX35 help here ? It's 1136JF-S core and has some L2 cache.

Best regards,
Marek Vasut

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-19 21:22         ` Marek Vasut
@ 2014-03-19 21:35           ` Rob Herring
  2014-03-19 21:46             ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: Rob Herring @ 2014-03-19 21:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 19, 2014 at 4:22 PM, Marek Vasut <marex@denx.de> wrote:
> On Friday, March 14, 2014 at 08:14:06 PM, Rob Herring wrote:
>> On Fri, Mar 14, 2014 at 12:57 PM, Russell King - ARM Linux
>>
>> <linux@arm.linux.org.uk> wrote:
>> > On Fri, Mar 14, 2014 at 11:02:17AM -0500, Rob Herring wrote:
>> >> The obvious fix is to convert l2x0_flush_line to a function ptr which
>> >> just further complicates the code.
>> >
>> > I currently regard this code as unmaintainable, so right now I'm in the
>> > process of completely rewriting it, going back to the specs and checking
>> > what is required.  Sharing code between the L210, L220 and L310 looks
>> > good from the point of view of keeping the line count down, but actually
>> > they're quite different and have various different requirements.
>> >
>> > Out of those, L220 is the odd one out: all operations are background
>> > operations there, which means they have to be waited for.  Currently,
>> > if you have one of these, and you build a multiplatform kernel which
>> > includes PL310 support, you don't wait for any of these and you probably
>> > see a lot of data corruption as a result.
>>
>> Oops. I guess there are not many users as the L220 was primarily used
>> with the 1176 (and 11MP?) IIRC. I guess one of the primary 1176 based
>> platforms currently used is RPi, but it appears the BCM2835 doesn't
>> have an L2 (other than some custom L2 for the GPU).
>
> Won't Freescale i.MX35 help here ? It's 1136JF-S core and has some L2 cache.

No. 1136 uses L210. 1176 uses L220. There was the i.MX37 which had an
1176, but it was pretty short lived and never saw upstream support.

Rob

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-19 21:35           ` Rob Herring
@ 2014-03-19 21:46             ` Russell King - ARM Linux
  2014-03-19 21:54               ` Marek Vasut
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-19 21:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 19, 2014 at 04:35:58PM -0500, Rob Herring wrote:
> On Wed, Mar 19, 2014 at 4:22 PM, Marek Vasut <marex@denx.de> wrote:
> > On Friday, March 14, 2014 at 08:14:06 PM, Rob Herring wrote:
> >> On Fri, Mar 14, 2014 at 12:57 PM, Russell King - ARM Linux
> >>
> >> <linux@arm.linux.org.uk> wrote:
> >> > On Fri, Mar 14, 2014 at 11:02:17AM -0500, Rob Herring wrote:
> >> >> The obvious fix is to convert l2x0_flush_line to a function ptr which
> >> >> just further complicates the code.
> >> >
> >> > I currently regard this code as unmaintainable, so right now I'm in the
> >> > process of completely rewriting it, going back to the specs and checking
> >> > what is required.  Sharing code between the L210, L220 and L310 looks
> >> > good from the point of view of keeping the line count down, but actually
> >> > they're quite different and have various different requirements.
> >> >
> >> > Out of those, L220 is the odd one out: all operations are background
> >> > operations there, which means they have to be waited for.  Currently,
> >> > if you have one of these, and you build a multiplatform kernel which
> >> > includes PL310 support, you don't wait for any of these and you probably
> >> > see a lot of data corruption as a result.
> >>
> >> Oops. I guess there are not many users as the L220 was primarily used
> >> with the 1176 (and 11MP?) IIRC. I guess one of the primary 1176 based
> >> platforms currently used is RPi, but it appears the BCM2835 doesn't
> >> have an L2 (other than some custom L2 for the GPU).
> >
> > Won't Freescale i.MX35 help here ? It's 1136JF-S core and has some L2 cache.
> 
> No. 1136 uses L210. 1176 uses L220. There was the i.MX37 which had an
> 1176, but it was pretty short lived and never saw upstream support.

Don't worry too much about the L220 - the Realview boards have one, and
I have one Realview EB board:

Linux version 3.14.0-rc7+ (rmk at rmk-PC.arm.linux.org.uk) (gcc version 4.5.4 (GCC) ) #664 SMP Wed Mar 19 18:06:11 GMT 2014
CPU: ARMv6-compatible processor [410fb020] revision 0 (ARMv7), cr=00c5387d
CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
Machine: ARM-RealView EB
...
CPU0: thread -1, cpu 0, socket -1, mpidr 0
Setting up static identity map for 0x7036d130 - 0x7036d1a0
CPU1: Booted secondary processor
CPU1: thread -1, cpu 0, socket -1, mpidr 1
CPU2: Booted secondary processor
CPU2: thread -1, cpu 0, socket -1, mpidr 2
CPU3: Booted secondary processor
CPU3: thread -1, cpu 0, socket -1, mpidr 3
Brought up 4 CPUs
...
L2C: DT/platform modifies aux control register: 0x00020fff -> 0x00790fff
L2C-220 cache controller enabled, 8 ways, 1024 kB
L2C-220: CACHE_ID 0x41000081, AUX_CTRL 0x00790fff

These need to be configured by the board code for their cache size because
the register isn't correctly configured at reset.  Same goes for PL210.
Only with PL310 does the register contain appropriate sizing information
on reset.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-18 17:26     ` PL310 errata workarounds Russell King - ARM Linux
@ 2014-03-19 21:52       ` Marek Vasut
  2014-03-19 22:51         ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: Marek Vasut @ 2014-03-19 21:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday, March 18, 2014 at 06:26:15 PM, Russell King - ARM Linux wrote:
> On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
> > Setting prefetch enables and early BRESP could all be done
> > unconditionally in the core code.
> 
> I think we can do a few things here, if we know that the CPUs we're
> connected to are all Cortex-A9:
> 
> 1. Enable BRESP.
> 
> 2. Enable I+D prefetching - but we really need to tune the prefetch offset
>    for this to be worthwhile.  The value depends on the L3 memory system
>    latency, so isn't something that should be specified at the SoC level.
>    It may also change with different operating points.
> 
> 3. Full line of zeros - I think this is a difficult one to achieve
> properly. The required sequence:
> 
>    - enable FLZ in L2 cache
>    - enable L2 cache
>    - enable FLZ in Cortex A9
> 
>    I'd also assume that when we turn the L2 cache off, we need the reverse
>    sequence too.  So this sequence can't be done entirely by the boot
> loader.
> 
> With (1) enabled and (2) properly tuned, I see a performance increase of
> around 60Mbps on transmission, bringing the Cubox-i4 up from 250Mbps to
> 315Mbps transmit on its gigabit interface with cpufreq ondemand enabled.
> With "performance", this goes up to [323, 323, 321, 325, 322]Mbps.  On
> receive [446, 603, 605, 605, 601]Mbps, which hasn't really changed
> very much (and still impressively exceeds the Freescale stated maximum
> total bandwidth of the gigabit interface.)

Speaking of FEC and slightly off-topic, have you ever seen this on your box 
[1]/[2]/[3] ? I wonder if this might be cache-related as well, since I saw 
similar issue on MX6 with PCIe-connected ethernet. I cannot put a finger on this 
though.

[1] http://www.spinics.net/lists/netdev/msg275354.html
[2] http://www.spinics.net/lists/netdev/msg268190.html
[3] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-
October/202519.html

Best regards,
Marek Vasut

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-19 21:46             ` Russell King - ARM Linux
@ 2014-03-19 21:54               ` Marek Vasut
  0 siblings, 0 replies; 76+ messages in thread
From: Marek Vasut @ 2014-03-19 21:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday, March 19, 2014 at 10:46:32 PM, Russell King - ARM Linux wrote:
> On Wed, Mar 19, 2014 at 04:35:58PM -0500, Rob Herring wrote:
> > On Wed, Mar 19, 2014 at 4:22 PM, Marek Vasut <marex@denx.de> wrote:
> > > On Friday, March 14, 2014 at 08:14:06 PM, Rob Herring wrote:
> > >> On Fri, Mar 14, 2014 at 12:57 PM, Russell King - ARM Linux
> > >> 
> > >> <linux@arm.linux.org.uk> wrote:
> > >> > On Fri, Mar 14, 2014 at 11:02:17AM -0500, Rob Herring wrote:
> > >> >> The obvious fix is to convert l2x0_flush_line to a function ptr
> > >> >> which just further complicates the code.
> > >> > 
> > >> > I currently regard this code as unmaintainable, so right now I'm in
> > >> > the process of completely rewriting it, going back to the specs and
> > >> > checking what is required.  Sharing code between the L210, L220 and
> > >> > L310 looks good from the point of view of keeping the line count
> > >> > down, but actually they're quite different and have various
> > >> > different requirements.
> > >> > 
> > >> > Out of those, L220 is the odd one out: all operations are background
> > >> > operations there, which means they have to be waited for. 
> > >> > Currently, if you have one of these, and you build a multiplatform
> > >> > kernel which includes PL310 support, you don't wait for any of
> > >> > these and you probably see a lot of data corruption as a result.
> > >> 
> > >> Oops. I guess there are not many users as the L220 was primarily used
> > >> with the 1176 (and 11MP?) IIRC. I guess one of the primary 1176 based
> > >> platforms currently used is RPi, but it appears the BCM2835 doesn't
> > >> have an L2 (other than some custom L2 for the GPU).
> > > 
> > > Won't Freescale i.MX35 help here ? It's 1136JF-S core and has some L2
> > > cache.
> > 
> > No. 1136 uses L210. 1176 uses L220. There was the i.MX37 which had an
> > 1176, but it was pretty short lived and never saw upstream support.
> 
> Don't worry too much about the L220 - the Realview boards have one, and
> I have one Realview EB board:

OK, gotcha.

Best regards,
Marek Vasut

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-19 21:52       ` Marek Vasut
@ 2014-03-19 22:51         ` Russell King - ARM Linux
  2014-03-19 23:05           ` FEC ethernet issues [Was: PL310 errata workarounds] Marek Vasut
  2014-03-21 15:50           ` PL310 errata workarounds Frank Li
  0 siblings, 2 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-19 22:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 19, 2014 at 10:52:32PM +0100, Marek Vasut wrote:
> On Tuesday, March 18, 2014 at 06:26:15 PM, Russell King - ARM Linux wrote:
> > On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
> > > Setting prefetch enables and early BRESP could all be done
> > > unconditionally in the core code.
> > 
> > I think we can do a few things here, if we know that the CPUs we're
> > connected to are all Cortex-A9:
> > 
> > 1. Enable BRESP.
> > 
> > 2. Enable I+D prefetching - but we really need to tune the prefetch offset
> >    for this to be worthwhile.  The value depends on the L3 memory system
> >    latency, so isn't something that should be specified at the SoC level.
> >    It may also change with different operating points.
> > 
> > 3. Full line of zeros - I think this is a difficult one to achieve
> > properly. The required sequence:
> > 
> >    - enable FLZ in L2 cache
> >    - enable L2 cache
> >    - enable FLZ in Cortex A9
> > 
> >    I'd also assume that when we turn the L2 cache off, we need the reverse
> >    sequence too.  So this sequence can't be done entirely by the boot
> > loader.
> > 
> > With (1) enabled and (2) properly tuned, I see a performance increase of
> > around 60Mbps on transmission, bringing the Cubox-i4 up from 250Mbps to
> > 315Mbps transmit on its gigabit interface with cpufreq ondemand enabled.
> > With "performance", this goes up to [323, 323, 321, 325, 322]Mbps.  On
> > receive [446, 603, 605, 605, 601]Mbps, which hasn't really changed
> > very much (and still impressively exceeds the Freescale stated maximum
> > total bandwidth of the gigabit interface.)
> 
> Speaking of FEC and slightly off-topic, have you ever seen this on your box 
> [1]/[2]/[3] ? I wonder if this might be cache-related as well, since I saw 
> similar issue on MX6 with PCIe-connected ethernet. I cannot put a finger on
> this though.

I think I've seen something similar once or twice.

Let me just pull up the transmit function.  This is the code which
writes to the descriptors (which frankly is pretty horrid):

        bdp->cbd_datlen = skb->len;
        bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
                        skb->len, DMA_TO_DEVICE); <-- barrier inside
                ebdp->cbd_bdu = 0;
                        ebdp->cbd_esc = BD_ENET_TX_INT;
                                ebdp->cbd_esc |= BD_ENET_TX_PINS;
        bdp->cbd_sc = status;
        writel(0, fep->hwp + FEC_X_DES_ACTIVE); <-- barrier before write

A couple of points here:

1. The hardware operates in a ring - it can read the next ring if it
   sees the BD_ENET_TX_READY bit set if has just finished processing
   the previous descriptor - this can happen before the write to
   FEC_X_DES_ACTIVE.

2. The ARM can re-order writes.  The writes it can re-order are:

	bdp->cbd_bufaddr
	ebdp->cbd_bdu
	ebdp->cbd_esc
	ebdp->cbd_sc

Hence, it's entirely possible for the FEC to see the updated descriptor
status before the rest of the descriptor has been written.  What's missing
is a barrier between the descriptor writes, and the final write to
bdp->cbd_sc.

Had I not got distracted by the L2 issues, I'd have posted my FEC patches
by now... in any case, my current xmit function looks a little different -
I've organised it such that:

1. We don't modify anything until we're past the point where things can
   error out.

2. Writes to the descriptor are localised in one area.

3. There's a wmb() barrier between cbd_sc and the previous writes - I
   discussed this with Will Deacon, which resulted in this documentation
   for the barrier:

        /*
         * We need the preceding stores to the descriptor to complete
         * before updating the status field, which hands it over to the
         * hardware.  The corresponding rmb() is "in the hardware".
         */

The second thing that causes the transmit timeouts is the horrid way
NAPI has been added to the driver - it's racy.  NAPI itself isn't the
problem, it's this (compressed a bit to show only the relevant bits):

        do {
                int_events = readl(fep->hwp + FEC_IEVENT);
                writel(int_events, fep->hwp + FEC_IEVENT);
                if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
                        if (napi_schedule_prep(&fep->napi)) {
                                writel(FEC_RX_DISABLED_IMASK,
                                        fep->hwp + FEC_IMASK);
                                __napi_schedule(&fep->napi);
                        }
                }
                if (int_events & FEC_ENET_MII) {
                        complete(&fep->mdio_done);
                }
        } while (int_events);

Consider what happens here if:
- we talk in the MII bus and receive a MII interrupt
- we're just finishing NAPI processing but haven't quite got around to
  calling napi_complete()
- the ethernet has sent all packets, and has also raised a transmit
  interrupt

The result is the handler is entered, FEC_IEVENT contains TXF and MII
events.  Both these events are cleared down, (and thus no longer exist
as interrupt-causing events.)  napi_schedule_prep() returns false as
the NAPI rx function is still running, and doesn't mark it for a re-run.
We then do the MII interrupt.  Loop again, and int_events is zero,
we exit.

Meanwhile, the NAPI rx function calls napi_complete() and re-enables
the receive interrupt.  If you're unlucky enough that the RX ring is
also full... no RXF interrupt.  So no further interrupts except maybe
MII interrupts.

NAPI never gets scheduled.  RX ring never gets emptied.  TX ring never
gets reaped.  The result is a timeout with a completely full TX ring.

I think I've seen both cases: I've seen the case where the TX ring is
completely empty, but it hasn't been reaped.   I've also seen the case
where the TX ring contains packets to be transmitted but the hardware
isn't sending them.

That all said - with the patch below I haven't seen problems since.
(which is the quickest way I can think of to get you a copy of what
I'm presently running - I've killed a number of debug bits denoted by
all the blank lines - and this is against -rc7.)  You may notice that
I added some TX ring dumping code to the driver - always useful in
these situations. ;-)

This patch is of course the consolidated version: individually, this
would be at least 19 patches with nice commit messages describing
each change...

As far as the 600Mbps receive - you need the right conditions for that.
I select the performance cpufreq governor after boot, and let the boot
quiesce.  It doesn't take much for it to drop back to 460Mbps - another
running process other than iperf -s is sufficient to do that.

Let me know how you get on with this.

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 3b8d6d19ff05..510580eeae4b 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -170,6 +170,11 @@ struct bufdesc_ex {
 	unsigned short res0[4];
 };
 
+union bufdesc_u {
+	struct bufdesc bd;
+	struct bufdesc_ex ebd;
+};
+
 /*
  *	The following definitions courtesy of commproc.h, which where
  *	Copyright (c) 1997 Dan Malek (dmalek at jlc.net).
@@ -240,14 +245,14 @@ struct bufdesc_ex {
  * the skbuffer directly.
  */
 
-#define FEC_ENET_RX_PAGES	8
+#define FEC_ENET_RX_PAGES	32
 #define FEC_ENET_RX_FRSIZE	2048
 #define FEC_ENET_RX_FRPPG	(PAGE_SIZE / FEC_ENET_RX_FRSIZE)
 #define RX_RING_SIZE		(FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES)
 #define FEC_ENET_TX_FRSIZE	2048
 #define FEC_ENET_TX_FRPPG	(PAGE_SIZE / FEC_ENET_TX_FRSIZE)
-#define TX_RING_SIZE		16	/* Must be power of two */
-#define TX_RING_MOD_MASK	15	/*   for this to work */
+#define TX_RING_SIZE		64	/* Must be power of two */
+#define TX_RING_MOD_MASK	63	/*   for this to work */
 
 #define BD_ENET_RX_INT          0x00800000
 #define BD_ENET_RX_PTP          ((ushort)0x0400)
@@ -289,12 +294,12 @@ struct fec_enet_private {
 	/* CPM dual port RAM relative addresses */
 	dma_addr_t	bd_dma;
 	/* Address of Rx and Tx buffers */
-	struct bufdesc	*rx_bd_base;
-	struct bufdesc	*tx_bd_base;
+	union bufdesc_u	*rx_bd_base;
+	union bufdesc_u	*tx_bd_base;
 	/* The next free ring entry */
-	struct bufdesc	*cur_rx, *cur_tx;
-	/* The ring entries to be free()ed */
-	struct bufdesc	*dirty_tx;
+	unsigned short tx_next;
+	unsigned short tx_dirty;
+	unsigned short rx_next;
 
 	unsigned short tx_ring_size;
 	unsigned short rx_ring_size;
@@ -335,6 +340,9 @@ struct fec_enet_private {
 	struct timer_list time_keep;
 	struct fec_enet_delayed_work delay_work;
 	struct regulator *reg_phy;
+	unsigned long quirks;
+
+
 };
 
 void fec_ptp_init(struct platform_device *pdev);
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 03a351300013..8105697d5a99 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -101,6 +101,8 @@ static void set_multicast_list(struct net_device *ndev);
  * ENET_TDAR[TDAR].
  */
 #define FEC_QUIRK_ERR006358            (1 << 7)
+/* Controller has ability to offset rx packets */
+#define FEC_QUIRK_RX_SHIFT16           (1 << 8)
 
 static struct platform_device_id fec_devtype[] = {
 	{
@@ -120,7 +122,8 @@ static struct platform_device_id fec_devtype[] = {
 		.name = "imx6q-fec",
 		.driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT |
 				FEC_QUIRK_HAS_BUFDESC_EX | FEC_QUIRK_HAS_CSUM |
-				FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358,
+				FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358 |
+				FEC_QUIRK_RX_SHIFT16,
 	}, {
 		.name = "mvf600-fec",
 		.driver_data = FEC_QUIRK_ENET_MAC,
@@ -200,6 +203,7 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
 /* FEC receive acceleration */
 #define FEC_RACC_IPDIS		(1 << 1)
 #define FEC_RACC_PRODIS		(1 << 2)
+#define FEC_RACC_SHIFT16	BIT(7)
 #define FEC_RACC_OPTIONS	(FEC_RACC_IPDIS | FEC_RACC_PRODIS)
 
 /*
@@ -233,57 +237,54 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
 
 static int mii_cnt;
 
-static inline
-struct bufdesc *fec_enet_get_nextdesc(struct bufdesc *bdp, struct fec_enet_private *fep)
+static unsigned copybreak = 200;
+module_param(copybreak, uint, 0644);
+MODULE_PARM_DESC(copybreak,
+		 "Maximum size of packet that is copied to a new buffer on receive");
+
+
+
+
+
+static bool fec_enet_rx_zerocopy(struct fec_enet_private *fep, unsigned pktlen)
 {
-	struct bufdesc *new_bd = bdp + 1;
-	struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp + 1;
-	struct bufdesc_ex *ex_base;
-	struct bufdesc *base;
-	int ring_size;
-
-	if (bdp >= fep->tx_bd_base) {
-		base = fep->tx_bd_base;
-		ring_size = fep->tx_ring_size;
-		ex_base = (struct bufdesc_ex *)fep->tx_bd_base;
-	} else {
-		base = fep->rx_bd_base;
-		ring_size = fep->rx_ring_size;
-		ex_base = (struct bufdesc_ex *)fep->rx_bd_base;
-	}
+#ifndef CONFIG_M5272
+	if (fep->quirks & FEC_QUIRK_RX_SHIFT16 && pktlen >= copybreak)
+		return true;
+#endif
+	return false;
+}
+
+static union bufdesc_u *
+fec_enet_tx_get(unsigned index, struct fec_enet_private *fep)
+{
+	union bufdesc_u *base = fep->tx_bd_base;
+	union bufdesc_u *bdp;
+
+	index &= fep->tx_ring_size - 1;
 
 	if (fep->bufdesc_ex)
-		return (struct bufdesc *)((ex_new_bd >= (ex_base + ring_size)) ?
-			ex_base : ex_new_bd);
+		bdp = (union bufdesc_u *)(&base->ebd + index);
 	else
-		return (new_bd >= (base + ring_size)) ?
-			base : new_bd;
+		bdp = (union bufdesc_u *)(&base->bd + index);
+
+	return bdp;
 }
 
-static inline
-struct bufdesc *fec_enet_get_prevdesc(struct bufdesc *bdp, struct fec_enet_private *fep)
+static union bufdesc_u *
+fec_enet_rx_get(unsigned index, struct fec_enet_private *fep)
 {
-	struct bufdesc *new_bd = bdp - 1;
-	struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp - 1;
-	struct bufdesc_ex *ex_base;
-	struct bufdesc *base;
-	int ring_size;
-
-	if (bdp >= fep->tx_bd_base) {
-		base = fep->tx_bd_base;
-		ring_size = fep->tx_ring_size;
-		ex_base = (struct bufdesc_ex *)fep->tx_bd_base;
-	} else {
-		base = fep->rx_bd_base;
-		ring_size = fep->rx_ring_size;
-		ex_base = (struct bufdesc_ex *)fep->rx_bd_base;
-	}
+	union bufdesc_u *base = fep->rx_bd_base;
+	union bufdesc_u *bdp;
+
+	index &= fep->rx_ring_size - 1;
 
 	if (fep->bufdesc_ex)
-		return (struct bufdesc *)((ex_new_bd < ex_base) ?
-			(ex_new_bd + ring_size) : ex_new_bd);
+		bdp = (union bufdesc_u *)(&base->ebd + index);
 	else
-		return (new_bd < base) ? (new_bd + ring_size) : new_bd;
+		bdp = (union bufdesc_u *)(&base->bd + index);
+
+	return bdp;
 }
 
 static void *swap_buffer(void *bufaddr, int len)
@@ -297,6 +298,26 @@ static void *swap_buffer(void *bufaddr, int len)
 	return bufaddr;
 }
 
+static void fec_dump(struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	unsigned index = 0;
+
+	netdev_info(ndev, "TX ring dump\n");
+	pr_info("Nr    SC     addr       len  SKB\n");
+
+	for (index = 0; index < fep->tx_ring_size; index++) {
+		union bufdesc_u *bdp = fec_enet_tx_get(index, fep);
+
+		pr_info("%2u %c%c 0x%04x 0x%08lx %4u %p\n",
+			index,
+			index == fep->tx_next ? 'S' : ' ',
+			index == fep->tx_dirty ? 'H' : ' ',
+			bdp->bd.cbd_sc, bdp->bd.cbd_bufaddr, bdp->bd.cbd_datlen,
+			fep->tx_skbuff[index]);
+	}
+}
+
 static int
 fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
 {
@@ -312,21 +333,42 @@ fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
 	return 0;
 }
 
+static void
+fec_enet_tx_unmap(struct bufdesc *bdp, struct fec_enet_private *fep)
+{
+	dma_addr_t addr = bdp->cbd_bufaddr;
+	unsigned length = bdp->cbd_datlen;
+
+	bdp->cbd_bufaddr = 0;
+
+	dma_unmap_single(&fep->pdev->dev, addr, length, DMA_TO_DEVICE);
+}
+
 static netdev_tx_t
 fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
-	struct bufdesc *bdp, *bdp_pre;
+	union bufdesc_u *bdp, *bdp_pre;
 	void *bufaddr;
 	unsigned short	status;
-	unsigned int index;
+	unsigned index;
+	unsigned length;
+	dma_addr_t addr;
+
+
+
+
+
+
+
+
+
 
 	/* Fill in a Tx ring entry */
-	bdp = fep->cur_tx;
+	index = fep->tx_next;
 
-	status = bdp->cbd_sc;
+	bdp = fec_enet_tx_get(index, fep);
+	status = bdp->bd.cbd_sc;
 
 	if (status & BD_ENET_TX_READY) {
 		/* Ooops.  All transmit buffers are full.  Bail out.
@@ -347,21 +389,15 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 
 	/* Set buffer length and buffer pointer */
 	bufaddr = skb->data;
-	bdp->cbd_datlen = skb->len;
+	length = skb->len;
 
 	/*
 	 * On some FEC implementations data must be aligned on
 	 * 4-byte boundaries. Use bounce buffers to copy data
 	 * and get it aligned. Ugh.
 	 */
-	if (fep->bufdesc_ex)
-		index = (struct bufdesc_ex *)bdp -
-			(struct bufdesc_ex *)fep->tx_bd_base;
-	else
-		index = bdp - fep->tx_bd_base;
-
 	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
-		memcpy(fep->tx_bounce[index], skb->data, skb->len);
+		memcpy(fep->tx_bounce[index], skb->data, length);
 		bufaddr = fep->tx_bounce[index];
 	}
 
@@ -370,70 +406,72 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	 * the system that it's running on. As the result, driver has to
 	 * swap every frame going to and coming from the controller.
 	 */
-	if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
-		swap_buffer(bufaddr, skb->len);
+	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
+		swap_buffer(bufaddr, length);
 
-	/* Save skb pointer */
-	fep->tx_skbuff[index] = skb;
-
-	/* Push the data cache so the CPM does not get stale memory
-	 * data.
-	 */
-	bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
-			skb->len, DMA_TO_DEVICE);
-	if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
-		bdp->cbd_bufaddr = 0;
-		fep->tx_skbuff[index] = NULL;
+	/* Push the data cache so the CPM does not get stale memory data. */
+	addr = dma_map_single(&fep->pdev->dev, bufaddr, length, DMA_TO_DEVICE);
+	if (dma_mapping_error(&fep->pdev->dev, addr)) {
 		dev_kfree_skb_any(skb);
 		if (net_ratelimit())
 			netdev_err(ndev, "Tx DMA memory map failed\n");
 		return NETDEV_TX_OK;
 	}
 
-	if (fep->bufdesc_ex) {
+	/* Save skb pointer */
+	fep->tx_skbuff[index] = skb;
 
-		struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-		ebdp->cbd_bdu = 0;
+	bdp->bd.cbd_datlen = length;
+	bdp->bd.cbd_bufaddr = addr;
+
+	if (fep->bufdesc_ex) {
+		bdp->ebd.cbd_bdu = 0;
 		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
 			fep->hwts_tx_en)) {
-			ebdp->cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
+			bdp->ebd.cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
 			skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
 		} else {
-			ebdp->cbd_esc = BD_ENET_TX_INT;
+			bdp->ebd.cbd_esc = BD_ENET_TX_INT;
 
 			/* Enable protocol checksum flags
 			 * We do not bother with the IP Checksum bits as they
 			 * are done by the kernel
 			 */
 			if (skb->ip_summed == CHECKSUM_PARTIAL)
-				ebdp->cbd_esc |= BD_ENET_TX_PINS;
+				bdp->ebd.cbd_esc |= BD_ENET_TX_PINS;
 		}
 	}
 
+	/*
+	 * We need the preceding stores to the descriptor to complete
+	 * before updating the status field, which hands it over to the
+	 * hardware.  The corresponding rmb() is "in the hardware".
+	 */
+	wmb();
+
 	/* Send it on its way.  Tell FEC it's ready, interrupt when done,
 	 * it's the last BD of the frame, and to put the CRC on the end.
 	 */
 	status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
 			| BD_ENET_TX_LAST | BD_ENET_TX_TC);
-	bdp->cbd_sc = status;
+	bdp->bd.cbd_sc = status;
 
-	bdp_pre = fec_enet_get_prevdesc(bdp, fep);
-	if ((id_entry->driver_data & FEC_QUIRK_ERR006358) &&
-	    !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) {
+	bdp_pre = fec_enet_tx_get(index - 1, fep);
+	if ((fep->quirks & FEC_QUIRK_ERR006358) &&
+	    !(bdp_pre->bd.cbd_sc & BD_ENET_TX_READY)) {
 		fep->delay_work.trig_tx = true;
 		schedule_delayed_work(&(fep->delay_work.delay_work),
 					msecs_to_jiffies(1));
 	}
 
-	/* If this was the last BD in the ring, start at the beginning again. */
-	bdp = fec_enet_get_nextdesc(bdp, fep);
-
 	skb_tx_timestamp(skb);
 
-	fep->cur_tx = bdp;
+	fep->tx_next = (index + 1) & (fep->tx_ring_size - 1);
 
-	if (fep->cur_tx == fep->dirty_tx)
+	if (fep->tx_next == fep->tx_dirty) {
+
 		netif_stop_queue(ndev);
+	}
 
 	/* Trigger transmission start */
 	writel(0, fep->hwp + FEC_X_DES_ACTIVE);
@@ -446,46 +484,43 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 static void fec_enet_bd_init(struct net_device *dev)
 {
 	struct fec_enet_private *fep = netdev_priv(dev);
-	struct bufdesc *bdp;
 	unsigned int i;
 
 	/* Initialize the receive buffer descriptors. */
-	bdp = fep->rx_bd_base;
 	for (i = 0; i < fep->rx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
 
 		/* Initialize the BD for every fragment in the page. */
-		if (bdp->cbd_bufaddr)
-			bdp->cbd_sc = BD_ENET_RX_EMPTY;
+		if (bdp->bd.cbd_bufaddr)
+			bdp->bd.cbd_sc = BD_ENET_RX_EMPTY;
 		else
-			bdp->cbd_sc = 0;
-		bdp = fec_enet_get_nextdesc(bdp, fep);
+			bdp->bd.cbd_sc = 0;
+		if (i == fep->rx_ring_size - 1)
+			bdp->bd.cbd_sc |= BD_SC_WRAP;
 	}
 
-	/* Set the last buffer to wrap */
-	bdp = fec_enet_get_prevdesc(bdp, fep);
-	bdp->cbd_sc |= BD_SC_WRAP;
-
-	fep->cur_rx = fep->rx_bd_base;
+	fep->rx_next = 0;
 
 	/* ...and the same for transmit */
-	bdp = fep->tx_bd_base;
-	fep->cur_tx = bdp;
 	for (i = 0; i < fep->tx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
 
 		/* Initialize the BD for every fragment in the page. */
-		bdp->cbd_sc = 0;
-		if (bdp->cbd_bufaddr && fep->tx_skbuff[i]) {
+		/* Set the last buffer to wrap */
+		if (i == fep->tx_ring_size - 1)
+			bdp->bd.cbd_sc = BD_SC_WRAP;
+		else
+			bdp->bd.cbd_sc = 0;
+		if (bdp->bd.cbd_bufaddr)
+			fec_enet_tx_unmap(&bdp->bd, fep);
+		if (fep->tx_skbuff[i]) {
 			dev_kfree_skb_any(fep->tx_skbuff[i]);
 			fep->tx_skbuff[i] = NULL;
 		}
-		bdp->cbd_bufaddr = 0;
-		bdp = fec_enet_get_nextdesc(bdp, fep);
 	}
 
-	/* Set the last buffer to wrap */
-	bdp = fec_enet_get_prevdesc(bdp, fep);
-	bdp->cbd_sc |= BD_SC_WRAP;
-	fep->dirty_tx = bdp;
+	fep->tx_next = 0;
+	fep->tx_dirty = fep->tx_ring_size - 1;
 }
 
 /* This function is called to start or restart the FEC during a link
@@ -496,8 +531,6 @@ static void
 fec_restart(struct net_device *ndev, int duplex)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
 	int i;
 	u32 val;
 	u32 temp_mac[2];
@@ -519,7 +552,7 @@ fec_restart(struct net_device *ndev, int duplex)
 	 * enet-mac reset will reset mac address registers too,
 	 * so need to reconfigure it.
 	 */
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
+	if (fep->quirks & FEC_QUIRK_ENET_MAC) {
 		memcpy(&temp_mac, ndev->dev_addr, ETH_ALEN);
 		writel(cpu_to_be32(temp_mac[0]), fep->hwp + FEC_ADDR_LOW);
 		writel(cpu_to_be32(temp_mac[1]), fep->hwp + FEC_ADDR_HIGH);
@@ -568,6 +601,8 @@ fec_restart(struct net_device *ndev, int duplex)
 #if !defined(CONFIG_M5272)
 	/* set RX checksum */
 	val = readl(fep->hwp + FEC_RACC);
+	if (fep->quirks & FEC_QUIRK_RX_SHIFT16)
+		val |= FEC_RACC_SHIFT16;
 	if (fep->csum_flags & FLAG_RX_CSUM_ENABLED)
 		val |= FEC_RACC_OPTIONS;
 	else
@@ -579,7 +614,7 @@ fec_restart(struct net_device *ndev, int duplex)
 	 * The phy interface and speed need to get configured
 	 * differently on enet-mac.
 	 */
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
+	if (fep->quirks & FEC_QUIRK_ENET_MAC) {
 		/* Enable flow control and length check */
 		rcntl |= 0x40000000 | 0x00000020;
 
@@ -602,7 +637,7 @@ fec_restart(struct net_device *ndev, int duplex)
 		}
 	} else {
 #ifdef FEC_MIIGSK_ENR
-		if (id_entry->driver_data & FEC_QUIRK_USE_GASKET) {
+		if (fep->quirks & FEC_QUIRK_USE_GASKET) {
 			u32 cfgr;
 			/* disable the gasket and wait */
 			writel(0, fep->hwp + FEC_MIIGSK_ENR);
@@ -655,7 +690,7 @@ fec_restart(struct net_device *ndev, int duplex)
 	writel(0, fep->hwp + FEC_HASH_TABLE_LOW);
 #endif
 
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
+	if (fep->quirks & FEC_QUIRK_ENET_MAC) {
 		/* enable ENET endian swap */
 		ecntl |= (1 << 8);
 		/* enable ENET store and forward mode */
@@ -692,8 +727,6 @@ static void
 fec_stop(struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
 	u32 rmii_mode = readl(fep->hwp + FEC_R_CNTRL) & (1 << 8);
 
 	/* We cannot expect a graceful transmit stop without link !!! */
@@ -711,7 +744,7 @@ fec_stop(struct net_device *ndev)
 	writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
 
 	/* We have to keep ENET enabled to have MII interrupt stay working */
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
+	if (fep->quirks & FEC_QUIRK_ENET_MAC) {
 		writel(2, fep->hwp + FEC_ECNTRL);
 		writel(rmii_mode, fep->hwp + FEC_R_CNTRL);
 	}
@@ -723,6 +756,8 @@ fec_timeout(struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
 
+	fec_dump(ndev);
+
 	ndev->stats.tx_errors++;
 
 	fep->delay_work.timeout = true;
@@ -751,34 +786,28 @@ static void fec_enet_work(struct work_struct *work)
 static void
 fec_enet_tx(struct net_device *ndev)
 {
-	struct	fec_enet_private *fep;
-	struct bufdesc *bdp;
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	union bufdesc_u *bdp;
 	unsigned short status;
 	struct	sk_buff	*skb;
-	int	index = 0;
-
-	fep = netdev_priv(ndev);
-	bdp = fep->dirty_tx;
+	unsigned index = fep->tx_dirty;
 
-	/* get next bdp of dirty_tx */
-	bdp = fec_enet_get_nextdesc(bdp, fep);
+	do {
+		index = (index + 1) & (fep->tx_ring_size - 1);
+		bdp = fec_enet_tx_get(index, fep);
 
-	while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
+		status = bdp->bd.cbd_sc;
+		if (status & BD_ENET_TX_READY)
+			break;
 
 		/* current queue is empty */
-		if (bdp == fep->cur_tx)
+		if (index == fep->tx_next)
 			break;
 
-		if (fep->bufdesc_ex)
-			index = (struct bufdesc_ex *)bdp -
-				(struct bufdesc_ex *)fep->tx_bd_base;
-		else
-			index = bdp - fep->tx_bd_base;
+		fec_enet_tx_unmap(&bdp->bd, fep);
 
 		skb = fep->tx_skbuff[index];
-		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len,
-				DMA_TO_DEVICE);
-		bdp->cbd_bufaddr = 0;
+		fep->tx_skbuff[index] = NULL;
 
 		/* Check for errors. */
 		if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC |
@@ -797,19 +826,18 @@ fec_enet_tx(struct net_device *ndev)
 				ndev->stats.tx_carrier_errors++;
 		} else {
 			ndev->stats.tx_packets++;
-			ndev->stats.tx_bytes += bdp->cbd_datlen;
+			ndev->stats.tx_bytes += bdp->bd.cbd_datlen;
 		}
 
 		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) &&
 			fep->bufdesc_ex) {
 			struct skb_shared_hwtstamps shhwtstamps;
 			unsigned long flags;
-			struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
 
 			memset(&shhwtstamps, 0, sizeof(shhwtstamps));
 			spin_lock_irqsave(&fep->tmreg_lock, flags);
 			shhwtstamps.hwtstamp = ns_to_ktime(
-				timecounter_cyc2time(&fep->tc, ebdp->ts));
+				timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
 			spin_unlock_irqrestore(&fep->tmreg_lock, flags);
 			skb_tstamp_tx(skb, &shhwtstamps);
 		}
@@ -825,45 +853,252 @@ fec_enet_tx(struct net_device *ndev)
 
 		/* Free the sk buffer associated with this last transmit */
 		dev_kfree_skb_any(skb);
-		fep->tx_skbuff[index] = NULL;
-
-		fep->dirty_tx = bdp;
-
-		/* Update pointer to next buffer descriptor to be transmitted */
-		bdp = fec_enet_get_nextdesc(bdp, fep);
 
 		/* Since we have freed up a buffer, the ring is no longer full
 		 */
-		if (fep->dirty_tx != fep->cur_tx) {
-			if (netif_queue_stopped(ndev))
-				netif_wake_queue(ndev);
+		if (netif_queue_stopped(ndev)) {
+
+
+
+
+
+			netif_wake_queue(ndev);
+
 		}
-	}
+
+		fep->tx_dirty = index;
+	} while (1);
 	return;
 }
 
 
-/* During a receive, the cur_rx points to the current incoming buffer.
+static void
+fec_enet_receive(struct sk_buff *skb, union bufdesc_u *bdp, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+
+	skb->protocol = eth_type_trans(skb, ndev);
+
+	/* Get receive timestamp from the skb */
+	if (fep->hwts_rx_en && fep->bufdesc_ex) {
+		struct skb_shared_hwtstamps *shhwtstamps =
+						    skb_hwtstamps(skb);
+		unsigned long flags;
+
+		memset(shhwtstamps, 0, sizeof(*shhwtstamps));
+
+		spin_lock_irqsave(&fep->tmreg_lock, flags);
+		shhwtstamps->hwtstamp = ns_to_ktime(
+		    timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
+		spin_unlock_irqrestore(&fep->tmreg_lock, flags);
+	}
+
+	if (fep->csum_flags & FLAG_RX_CSUM_ENABLED) {
+		if (!(bdp->ebd.cbd_esc & FLAG_RX_CSUM_ERROR)) {
+			/* don't check it */
+			skb->ip_summed = CHECKSUM_UNNECESSARY;
+		} else {
+			skb_checksum_none_assert(skb);
+		}
+	}
+
+	napi_gro_receive(&fep->napi, skb);
+}
+
+static void noinline
+fec_enet_receive_copy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	struct sk_buff *skb;
+	unsigned char *data;
+	bool vlan_packet_rcvd = false;
+
+	/*
+	 * Detect the presence of the VLAN tag, and adjust
+	 * the packet length appropriately.
+	 */
+	if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX &&
+	    bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) {
+		pkt_len -= VLAN_HLEN;
+		vlan_packet_rcvd = true;
+	}
+
+	/* This does 16 byte alignment, exactly what we need. */
+	skb = netdev_alloc_skb(ndev, pkt_len + NET_IP_ALIGN);
+	if (unlikely(!skb)) {
+		ndev->stats.rx_dropped++;
+		return;
+	}
+
+	dma_sync_single_for_cpu(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
+				FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+
+	data = fep->rx_skbuff[index]->data;
+
+#ifndef CONFIG_M5272
+	/*
+	 * If we have enabled this feature, we need to discard
+	 * the two bytes at the beginning of the packet before
+	 * copying it.
+	 */
+	if (fep->quirks & FEC_QUIRK_RX_SHIFT16) {
+		pkt_len -= 2;
+		data += 2;
+	}
+#endif
+
+	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
+		swap_buffer(data, pkt_len);
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, pkt_len);	/* Make room */
+
+	/* If this is a VLAN packet remove the VLAN Tag */
+	if (vlan_packet_rcvd) {
+		struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN);
+
+		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+				       ntohs(vlan->h_vlan_TCI));
+
+		/* Extract the frame data without the VLAN header. */
+		skb_copy_to_linear_data(skb, data, 2 * ETH_ALEN);
+		skb_copy_to_linear_data_offset(skb, 2 * ETH_ALEN,
+					       data + 2 * ETH_ALEN + VLAN_HLEN,
+					       pkt_len - 2 * ETH_ALEN);
+	} else {
+		skb_copy_to_linear_data(skb, data, pkt_len);
+	}
+
+	dma_sync_single_for_device(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
+				   FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+
+	fec_enet_receive(skb, bdp, ndev);
+}
+
+static void noinline
+fec_enet_receive_nocopy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	struct sk_buff *skb, *skb_new;
+	unsigned char *data;
+	dma_addr_t addr;
+
+#if 0
+	skb_new = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
+	if (!skb_new) {
+		ndev->stats.rx_dropped++;
+		return;
+	}
+
+	addr = dma_map_single(&fep->pdev->dev, skb_new->data,
+			      FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+	if (dma_mapping_error(&fep->pdev->dev, addr)) {
+		dev_kfree_skb(skb_new);
+		ndev->stats.rx_dropped++;
+		return;
+	}
+#else
+	skb_new = NULL;
+	addr = 0;
+#endif
+
+	/*
+	 * We have the new skb, so proceed to deal with the
+	 * received data.
+	 */
+	dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
+			 FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+
+	skb = fep->rx_skbuff[index];
+
+	/* Now subsitute in the new skb */
+	fep->rx_skbuff[index] = skb_new;
+	bdp->bd.cbd_bufaddr = addr;
+
+	/*
+	 * Update the skb length according to the raw packet
+	 * length.  Then remove the two bytes of additional
+	 * padding.
+	 */
+	skb_put(skb, pkt_len);
+	data = skb_pull_inline(skb, 2);
+
+	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
+		swap_buffer(data, skb->len);
+
+	/*
+	 * Now juggle things for the VLAN tag - if the hardware
+	 * flags this as present, we need to read the tag, and
+	 * then shuffle the ethernet addresses up.
+	 */
+	if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX &&
+	    bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) {
+		struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN);
+
+		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+				       ntohs(vlan->h_vlan_TCI));
+
+		memmove(data + VLAN_HLEN, data, 2 * ETH_ALEN);
+		skb_pull_inline(skb, VLAN_HLEN);
+	}
+
+	fec_enet_receive(skb, bdp, ndev);
+}
+
+static int
+fec_enet_refill_ring(unsigned first, unsigned last, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	unsigned i = first;
+
+	do {
+		union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
+		struct sk_buff *skb;
+		dma_addr_t addr;
+
+		if (!fep->rx_skbuff[i]) {
+			skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
+			if (!skb)
+				return -ENOMEM;
+
+			addr = dma_map_single(&fep->pdev->dev, skb->data,
+					      FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+			if (dma_mapping_error(&fep->pdev->dev, addr)) {
+				dev_kfree_skb(skb);
+				return -ENOMEM;
+			}
+
+			fep->rx_skbuff[i] = skb;
+			bdp->bd.cbd_bufaddr = addr;
+		}
+
+		bdp->bd.cbd_sc = (bdp->bd.cbd_sc & BD_SC_WRAP) |
+				 BD_ENET_RX_EMPTY;
+
+		if (fep->bufdesc_ex) {
+			bdp->ebd.cbd_esc = BD_ENET_RX_INT;
+			bdp->ebd.cbd_prot = 0;
+			bdp->ebd.cbd_bdu = 0;
+		}
+		i = (i + 1) & (fep->rx_ring_size - 1);
+	} while (i != last);
+
+	return 0;
+}
+
+/* During a receive, the rx_next points to the current incoming buffer.
  * When we update through the ring, if the next incoming buffer has
  * not been given to the system, we just set the empty indicator,
  * effectively tossing the packet.
  */
-static int
+static int noinline
 fec_enet_rx(struct net_device *ndev, int budget)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
-	struct bufdesc *bdp;
 	unsigned short status;
-	struct	sk_buff	*skb;
-	ushort	pkt_len;
-	__u8 *data;
+	unsigned pkt_len;
 	int	pkt_received = 0;
-	struct	bufdesc_ex *ebdp = NULL;
-	bool	vlan_packet_rcvd = false;
-	u16	vlan_tag;
-	int	index = 0;
+	unsigned index = fep->rx_next;
 
 #ifdef CONFIG_M532x
 	flush_cache_all();
@@ -872,12 +1107,16 @@ fec_enet_rx(struct net_device *ndev, int budget)
 	/* First, grab all of the stats for the incoming packet.
 	 * These get messed up if we get called due to a busy condition.
 	 */
-	bdp = fep->cur_rx;
+	do {
+		union bufdesc_u *bdp = fec_enet_rx_get(index, fep);
 
-	while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
+		status = bdp->bd.cbd_sc;
+		if (status & BD_ENET_RX_EMPTY)
+			break;
 
 		if (pkt_received >= budget)
 			break;
+
 		pkt_received++;
 
 		/* Since we have allocated space to hold a complete frame,
@@ -917,124 +1156,33 @@ fec_enet_rx(struct net_device *ndev, int budget)
 
 		/* Process the incoming frame. */
 		ndev->stats.rx_packets++;
-		pkt_len = bdp->cbd_datlen;
-		ndev->stats.rx_bytes += pkt_len;
-
-		if (fep->bufdesc_ex)
-			index = (struct bufdesc_ex *)bdp -
-				(struct bufdesc_ex *)fep->rx_bd_base;
-		else
-			index = bdp - fep->rx_bd_base;
-		data = fep->rx_skbuff[index]->data;
-		dma_sync_single_for_cpu(&fep->pdev->dev, bdp->cbd_bufaddr,
-					FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
-
-		if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
-			swap_buffer(data, pkt_len);
-
-		/* Extract the enhanced buffer descriptor */
-		ebdp = NULL;
-		if (fep->bufdesc_ex)
-			ebdp = (struct bufdesc_ex *)bdp;
-
-		/* If this is a VLAN packet remove the VLAN Tag */
-		vlan_packet_rcvd = false;
-		if ((ndev->features & NETIF_F_HW_VLAN_CTAG_RX) &&
-		    fep->bufdesc_ex && (ebdp->cbd_esc & BD_ENET_RX_VLAN)) {
-			/* Push and remove the vlan tag */
-			struct vlan_hdr *vlan_header =
-					(struct vlan_hdr *) (data + ETH_HLEN);
-			vlan_tag = ntohs(vlan_header->h_vlan_TCI);
-			pkt_len -= VLAN_HLEN;
-
-			vlan_packet_rcvd = true;
-		}
 
-		/* This does 16 byte alignment, exactly what we need.
-		 * The packet length includes FCS, but we don't want to
-		 * include that when passing upstream as it messes up
-		 * bridging applications.
+		/*
+		 * The packet length includes FCS, but we don't want
+		 * to include that when passing upstream as it messes
+		 * up bridging applications.
 		 */
-		skb = netdev_alloc_skb(ndev, pkt_len - 4 + NET_IP_ALIGN);
+		pkt_len = bdp->bd.cbd_datlen - 4;
+		ndev->stats.rx_bytes += pkt_len;
 
-		if (unlikely(!skb)) {
-			ndev->stats.rx_dropped++;
+		if (fec_enet_rx_zerocopy(fep, pkt_len)) {
+			fec_enet_receive_nocopy(pkt_len, index, bdp, ndev);
 		} else {
-			int payload_offset = (2 * ETH_ALEN);
-			skb_reserve(skb, NET_IP_ALIGN);
-			skb_put(skb, pkt_len - 4);	/* Make room */
-
-			/* Extract the frame data without the VLAN header. */
-			skb_copy_to_linear_data(skb, data, (2 * ETH_ALEN));
-			if (vlan_packet_rcvd)
-				payload_offset = (2 * ETH_ALEN) + VLAN_HLEN;
-			skb_copy_to_linear_data_offset(skb, (2 * ETH_ALEN),
-						       data + payload_offset,
-						       pkt_len - 4 - (2 * ETH_ALEN));
-
-			skb->protocol = eth_type_trans(skb, ndev);
-
-			/* Get receive timestamp from the skb */
-			if (fep->hwts_rx_en && fep->bufdesc_ex) {
-				struct skb_shared_hwtstamps *shhwtstamps =
-							    skb_hwtstamps(skb);
-				unsigned long flags;
-
-				memset(shhwtstamps, 0, sizeof(*shhwtstamps));
-
-				spin_lock_irqsave(&fep->tmreg_lock, flags);
-				shhwtstamps->hwtstamp = ns_to_ktime(
-				    timecounter_cyc2time(&fep->tc, ebdp->ts));
-				spin_unlock_irqrestore(&fep->tmreg_lock, flags);
-			}
-
-			if (fep->bufdesc_ex &&
-			    (fep->csum_flags & FLAG_RX_CSUM_ENABLED)) {
-				if (!(ebdp->cbd_esc & FLAG_RX_CSUM_ERROR)) {
-					/* don't check it */
-					skb->ip_summed = CHECKSUM_UNNECESSARY;
-				} else {
-					skb_checksum_none_assert(skb);
-				}
-			}
-
-			/* Handle received VLAN packets */
-			if (vlan_packet_rcvd)
-				__vlan_hwaccel_put_tag(skb,
-						       htons(ETH_P_8021Q),
-						       vlan_tag);
-
-			napi_gro_receive(&fep->napi, skb);
+			fec_enet_receive_copy(pkt_len, index, bdp, ndev);
 		}
 
-		dma_sync_single_for_device(&fep->pdev->dev, bdp->cbd_bufaddr,
-					FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
 rx_processing_done:
-		/* Clear the status flags for this buffer */
-		status &= ~BD_ENET_RX_STATS;
-
-		/* Mark the buffer empty */
-		status |= BD_ENET_RX_EMPTY;
-		bdp->cbd_sc = status;
-
-		if (fep->bufdesc_ex) {
-			struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-
-			ebdp->cbd_esc = BD_ENET_RX_INT;
-			ebdp->cbd_prot = 0;
-			ebdp->cbd_bdu = 0;
-		}
-
-		/* Update BD pointer to next entry */
-		bdp = fec_enet_get_nextdesc(bdp, fep);
+		index = (index + 1) & (fep->rx_ring_size - 1);
+		if (index == fep->rx_next)
+			break;
+	} while (1);
 
-		/* Doing this here will keep the FEC running while we process
-		 * incoming frames.  On a heavily loaded network, we should be
-		 * able to keep up at the expense of system resources.
-		 */
+	if (pkt_received) {
+		fec_enet_refill_ring(fep->rx_next, index, ndev);
 		writel(0, fep->hwp + FEC_R_DES_ACTIVE);
 	}
-	fep->cur_rx = bdp;
+
+	fep->rx_next = index;
 
 	return pkt_received;
 }
@@ -1044,29 +1192,25 @@ fec_enet_interrupt(int irq, void *dev_id)
 {
 	struct net_device *ndev = dev_id;
 	struct fec_enet_private *fep = netdev_priv(ndev);
+	const unsigned napi_mask = FEC_ENET_RXF | FEC_ENET_TXF;
 	uint int_events;
 	irqreturn_t ret = IRQ_NONE;
 
-	do {
-		int_events = readl(fep->hwp + FEC_IEVENT);
-		writel(int_events, fep->hwp + FEC_IEVENT);
+	int_events = readl(fep->hwp + FEC_IEVENT);
+	writel(int_events & ~napi_mask, fep->hwp + FEC_IEVENT);
 
-		if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
-			ret = IRQ_HANDLED;
+	if (int_events & napi_mask) {
+		ret = IRQ_HANDLED;
 
-			/* Disable the RX interrupt */
-			if (napi_schedule_prep(&fep->napi)) {
-				writel(FEC_RX_DISABLED_IMASK,
-					fep->hwp + FEC_IMASK);
-				__napi_schedule(&fep->napi);
-			}
-		}
+		/* Disable the NAPI interrupts */
+		writel(FEC_ENET_MII, fep->hwp + FEC_IMASK);
+		napi_schedule(&fep->napi);
+	}
 
-		if (int_events & FEC_ENET_MII) {
-			ret = IRQ_HANDLED;
-			complete(&fep->mdio_done);
-		}
-	} while (int_events);
+	if (int_events & FEC_ENET_MII) {
+		ret = IRQ_HANDLED;
+		complete(&fep->mdio_done);
+	}
 
 	return ret;
 }
@@ -1074,10 +1218,24 @@ fec_enet_interrupt(int irq, void *dev_id)
 static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
 {
 	struct net_device *ndev = napi->dev;
-	int pkts = fec_enet_rx(ndev, budget);
 	struct fec_enet_private *fep = netdev_priv(ndev);
+	unsigned status;
+	int pkts = 0;
+
+	status = readl(fep->hwp + FEC_IEVENT) & (FEC_ENET_RXF | FEC_ENET_TXF);
+	if (status) {
+		/*
+		 * Clear any pending transmit or receive interrupts before
+		 * processing the rings to avoid racing with the hardware.
+		 */
+		writel(status, fep->hwp + FEC_IEVENT);
 
-	fec_enet_tx(ndev);
+		if (status & FEC_ENET_RXF)
+			pkts = fec_enet_rx(ndev, budget);
+
+		if (status & FEC_ENET_TXF)
+			fec_enet_tx(ndev);
+	}
 
 	if (pkts < budget) {
 		napi_complete(napi);
@@ -1263,8 +1421,6 @@ static int fec_enet_mdio_reset(struct mii_bus *bus)
 static int fec_enet_mii_probe(struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
 	struct phy_device *phy_dev = NULL;
 	char mdio_bus_id[MII_BUS_ID_SIZE];
 	char phy_name[MII_BUS_ID_SIZE + 3];
@@ -1302,7 +1458,7 @@ static int fec_enet_mii_probe(struct net_device *ndev)
 	}
 
 	/* mask with MAC supported features */
-	if (id_entry->driver_data & FEC_QUIRK_HAS_GBIT) {
+	if (fep->quirks & FEC_QUIRK_HAS_GBIT) {
 		phy_dev->supported &= PHY_GBIT_FEATURES;
 #if !defined(CONFIG_M5272)
 		phy_dev->supported |= SUPPORTED_Pause;
@@ -1329,8 +1485,6 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	static struct mii_bus *fec0_mii_bus;
 	struct net_device *ndev = platform_get_drvdata(pdev);
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
 	int err = -ENXIO, i;
 
 	/*
@@ -1349,7 +1503,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	 * mdio interface in board design, and need to be configured by
 	 * fec0 mii_bus.
 	 */
-	if ((id_entry->driver_data & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) {
+	if ((fep->quirks & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) {
 		/* fec1 uses fec0 mii_bus */
 		if (mii_cnt && fec0_mii_bus) {
 			fep->mii_bus = fec0_mii_bus;
@@ -1370,7 +1524,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	 * document.
 	 */
 	fep->phy_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ahb), 5000000);
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC)
+	if (fep->quirks & FEC_QUIRK_ENET_MAC)
 		fep->phy_speed--;
 	fep->phy_speed <<= 1;
 	writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED);
@@ -1405,7 +1559,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	mii_cnt++;
 
 	/* save fec0 mii_bus */
-	if (id_entry->driver_data & FEC_QUIRK_ENET_MAC)
+	if (fep->quirks & FEC_QUIRK_ENET_MAC)
 		fec0_mii_bus = fep->mii_bus;
 
 	return 0;
@@ -1694,23 +1848,24 @@ static void fec_enet_free_buffers(struct net_device *ndev)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	unsigned int i;
 	struct sk_buff *skb;
-	struct bufdesc	*bdp;
 
-	bdp = fep->rx_bd_base;
 	for (i = 0; i < fep->rx_ring_size; i++) {
-		skb = fep->rx_skbuff[i];
+		union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
 
-		if (bdp->cbd_bufaddr)
-			dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
+		skb = fep->rx_skbuff[i];
+		if (skb) {
+			dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
 					FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
-		if (skb)
 			dev_kfree_skb(skb);
-		bdp = fec_enet_get_nextdesc(bdp, fep);
+		}
 	}
 
-	bdp = fep->tx_bd_base;
-	for (i = 0; i < fep->tx_ring_size; i++)
+	for (i = 0; i < fep->tx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
+		if (bdp->bd.cbd_bufaddr)
+			fec_enet_tx_unmap(&bdp->bd, fep);
 		kfree(fep->tx_bounce[i]);
+	}
 }
 
 static int fec_enet_alloc_buffers(struct net_device *ndev)
@@ -1718,58 +1873,54 @@ static int fec_enet_alloc_buffers(struct net_device *ndev)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	unsigned int i;
 	struct sk_buff *skb;
-	struct bufdesc	*bdp;
 
-	bdp = fep->rx_bd_base;
 	for (i = 0; i < fep->rx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
+		dma_addr_t addr;
+
 		skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
 		if (!skb) {
 			fec_enet_free_buffers(ndev);
 			return -ENOMEM;
 		}
-		fep->rx_skbuff[i] = skb;
 
-		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, skb->data,
-				FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
-		if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
+		addr = dma_map_single(&fep->pdev->dev, skb->data,
+				      FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
+		if (dma_mapping_error(&fep->pdev->dev, addr)) {
+			dev_kfree_skb(skb);
 			fec_enet_free_buffers(ndev);
 			if (net_ratelimit())
 				netdev_err(ndev, "Rx DMA memory map failed\n");
 			return -ENOMEM;
 		}
-		bdp->cbd_sc = BD_ENET_RX_EMPTY;
 
-		if (fep->bufdesc_ex) {
-			struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-			ebdp->cbd_esc = BD_ENET_RX_INT;
-		}
+		fep->rx_skbuff[i] = skb;
 
-		bdp = fec_enet_get_nextdesc(bdp, fep);
-	}
+		bdp->bd.cbd_bufaddr = addr;
+		bdp->bd.cbd_sc = BD_ENET_RX_EMPTY;
+		/* Set the last buffer to wrap. */
+		if (i == fep->rx_ring_size - 1)
+			bdp->bd.cbd_sc |= BD_SC_WRAP;
 
-	/* Set the last buffer to wrap. */
-	bdp = fec_enet_get_prevdesc(bdp, fep);
-	bdp->cbd_sc |= BD_SC_WRAP;
+		if (fep->bufdesc_ex)
+			bdp->ebd.cbd_esc = BD_ENET_RX_INT;
+	}
 
-	bdp = fep->tx_bd_base;
 	for (i = 0; i < fep->tx_ring_size; i++) {
+		union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
 		fep->tx_bounce[i] = kmalloc(FEC_ENET_TX_FRSIZE, GFP_KERNEL);
 
-		bdp->cbd_sc = 0;
-		bdp->cbd_bufaddr = 0;
-
-		if (fep->bufdesc_ex) {
-			struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-			ebdp->cbd_esc = BD_ENET_TX_INT;
-		}
+		/* Set the last buffer to wrap. */
+		if (i == fep->tx_ring_size - 1)
+			bdp->bd.cbd_sc = BD_SC_WRAP;
+		else
+			bdp->bd.cbd_sc = 0;
+		bdp->bd.cbd_bufaddr = 0;
 
-		bdp = fec_enet_get_nextdesc(bdp, fep);
+		if (fep->bufdesc_ex)
+			bdp->ebd.cbd_esc = BD_ENET_TX_INT;
 	}
 
-	/* Set the last buffer to wrap. */
-	bdp = fec_enet_get_prevdesc(bdp, fep);
-	bdp->cbd_sc |= BD_SC_WRAP;
-
 	return 0;
 }
 
@@ -1990,9 +2141,7 @@ static const struct net_device_ops fec_netdev_ops = {
 static int fec_enet_init(struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
-	const struct platform_device_id *id_entry =
-				platform_get_device_id(fep->pdev);
-	struct bufdesc *cbd_base;
+	union bufdesc_u *cbd_base;
 
 	/* Allocate memory for buffer descriptors. */
 	cbd_base = dma_alloc_coherent(NULL, PAGE_SIZE, &fep->bd_dma,
@@ -2014,10 +2163,11 @@ static int fec_enet_init(struct net_device *ndev)
 	/* Set receive and transmit descriptor base. */
 	fep->rx_bd_base = cbd_base;
 	if (fep->bufdesc_ex)
-		fep->tx_bd_base = (struct bufdesc *)
-			(((struct bufdesc_ex *)cbd_base) + fep->rx_ring_size);
+		fep->tx_bd_base = (union bufdesc_u *)
+			(&cbd_base->ebd + fep->rx_ring_size);
 	else
-		fep->tx_bd_base = cbd_base + fep->rx_ring_size;
+		fep->tx_bd_base = (union bufdesc_u *)
+			(&cbd_base->bd + fep->rx_ring_size);
 
 	/* The FEC Ethernet specific entries in the device structure */
 	ndev->watchdog_timeo = TX_TIMEOUT;
@@ -2027,19 +2177,24 @@ static int fec_enet_init(struct net_device *ndev)
 	writel(FEC_RX_DISABLED_IMASK, fep->hwp + FEC_IMASK);
 	netif_napi_add(ndev, &fep->napi, fec_enet_rx_napi, NAPI_POLL_WEIGHT);
 
-	if (id_entry->driver_data & FEC_QUIRK_HAS_VLAN) {
-		/* enable hw VLAN support */
-		ndev->features |= NETIF_F_HW_VLAN_CTAG_RX;
-		ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX;
-	}
+	if (fep->bufdesc_ex) {
+		/* Features which require the enhanced buffer descriptors */
+		netdev_features_t features = 0;
 
-	if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) {
-		/* enable hw accelerator */
-		ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
-				| NETIF_F_RXCSUM);
-		ndev->hw_features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
-				| NETIF_F_RXCSUM);
-		fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
+		if (fep->quirks & FEC_QUIRK_HAS_VLAN) {
+			/* enable hw VLAN support */
+			features |= NETIF_F_HW_VLAN_CTAG_RX;
+		}
+
+		if (fep->quirks & FEC_QUIRK_HAS_CSUM) {
+			/* enable hw accelerator */
+			features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
+				    NETIF_F_RXCSUM;
+			fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
+		}
+
+		ndev->hw_features |= features;
+		ndev->features |= features;
 	}
 
 	fec_restart(ndev, 0);
@@ -2110,13 +2265,6 @@ fec_probe(struct platform_device *pdev)
 	/* setup board info structure */
 	fep = netdev_priv(ndev);
 
-#if !defined(CONFIG_M5272)
-	/* default enable pause frame auto negotiation */
-	if (pdev->id_entry &&
-	    (pdev->id_entry->driver_data & FEC_QUIRK_HAS_GBIT))
-		fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG;
-#endif
-
 	r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
 	fep->hwp = devm_ioremap_resource(&pdev->dev, r);
 	if (IS_ERR(fep->hwp)) {
@@ -2126,6 +2274,14 @@ fec_probe(struct platform_device *pdev)
 
 	fep->pdev = pdev;
 	fep->dev_id = dev_id++;
+	if (pdev->id_entry)
+		fep->quirks = pdev->id_entry->driver_data;
+
+#if !defined(CONFIG_M5272)
+	/* default enable pause frame auto negotiation */
+	if (fep->quirks & FEC_QUIRK_HAS_GBIT)
+		fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG;
+#endif
 
 	fep->bufdesc_ex = 0;
 
@@ -2160,8 +2316,7 @@ fec_probe(struct platform_device *pdev)
 		fep->clk_enet_out = NULL;
 
 	fep->clk_ptp = devm_clk_get(&pdev->dev, "ptp");
-	fep->bufdesc_ex =
-		pdev->id_entry->driver_data & FEC_QUIRK_HAS_BUFDESC_EX;
+	fep->bufdesc_ex = fep->quirks & FEC_QUIRK_HAS_BUFDESC_EX;
 	if (IS_ERR(fep->clk_ptp)) {
 		fep->clk_ptp = NULL;
 		fep->bufdesc_ex = 0;


-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-19 22:51         ` Russell King - ARM Linux
@ 2014-03-19 23:05           ` Marek Vasut
  2014-03-20  4:01             ` Marek Vasut
  2014-03-21 15:50           ` PL310 errata workarounds Frank Li
  1 sibling, 1 reply; 76+ messages in thread
From: Marek Vasut @ 2014-03-19 23:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday, March 19, 2014 at 11:51:38 PM, Russell King - ARM Linux wrote:
> On Wed, Mar 19, 2014 at 10:52:32PM +0100, Marek Vasut wrote:

[...]

> > Speaking of FEC and slightly off-topic, have you ever seen this on your
> > box [1]/[2]/[3] ? I wonder if this might be cache-related as well, since
> > I saw similar issue on MX6 with PCIe-connected ethernet. I cannot put a
> > finger on this though.
> 
> I think I've seen something similar once or twice.

Nice, thanks for confirming this.

> Let me just pull up the transmit function.  This is the code which
> writes to the descriptors (which frankly is pretty horrid):

Yes.

>         bdp->cbd_datlen = skb->len;
>         bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
>                         skb->len, DMA_TO_DEVICE); <-- barrier inside
>                 ebdp->cbd_bdu = 0;
>                         ebdp->cbd_esc = BD_ENET_TX_INT;
>                                 ebdp->cbd_esc |= BD_ENET_TX_PINS;
>         bdp->cbd_sc = status;
>         writel(0, fep->hwp + FEC_X_DES_ACTIVE); <-- barrier before write
> 
> A couple of points here:
> 
> 1. The hardware operates in a ring - it can read the next ring if it
>    sees the BD_ENET_TX_READY bit set if has just finished processing
>    the previous descriptor - this can happen before the write to
>    FEC_X_DES_ACTIVE.
> 
> 2. The ARM can re-order writes.  The writes it can re-order are:
> 
> 	bdp->cbd_bufaddr
> 	ebdp->cbd_bdu
> 	ebdp->cbd_esc
> 	ebdp->cbd_sc
> 
> Hence, it's entirely possible for the FEC to see the updated descriptor
> status before the rest of the descriptor has been written.  What's missing
> is a barrier between the descriptor writes, and the final write to
> bdp->cbd_sc.

I think there might even be another thing at play here, but please correct me if 
I am wrong. If you look at fb8ef788680d48523321e5f150b23700a1caf980 , this patch 
actually has no functional impact on the driver , right ?

It does no functional change at all, but it changed something for Fugang. This 
makes me also believe there might clearly be something afoot with the DMA 
descriptors. Reading Documentation/DMA-API.txt , I noticed the ring with the 
descriptors is allocated with dma_alloc_coherent() . Yet, DMA-API.txt , section 
I, states:

"
 29 Consistent memory is memory for which a write by either the device or
 30 the processor can immediately be read by the processor or device
 31 without having to worry about caching effects.  (You may however need
 32 to make sure to flush the processor's write buffers before telling
 33 devices to read that memory.)
"

So the writes into the descriptor from the CPU might even be stuck in the CPU's 
write buffer when this code below from fec_main.c is called:

 438         /* Trigger transmission start */
 439         writel(0, fep->hwp + FEC_X_DES_ACTIVE);

Does this also make sense or is my idea completely flawed please ?

> Had I not got distracted by the L2 issues, I'd have posted my FEC patches
> by now... in any case, my current xmit function looks a little different -
> I've organised it such that:
> 
> 1. We don't modify anything until we're past the point where things can
>    error out.
> 
> 2. Writes to the descriptor are localised in one area.
> 
> 3. There's a wmb() barrier between cbd_sc and the previous writes - I
>    discussed this with Will Deacon, which resulted in this documentation
>    for the barrier:
> 
>         /*
>          * We need the preceding stores to the descriptor to complete
>          * before updating the status field, which hands it over to the
>          * hardware.  The corresponding rmb() is "in the hardware".
>          */
> 
> The second thing that causes the transmit timeouts is the horrid way
> NAPI has been added to the driver - it's racy.  NAPI itself isn't the
> problem, it's this (compressed a bit to show only the relevant bits):

I fully agree there are multiple race conditions in the driver as it is now.

[...]

> That all said - with the patch below I haven't seen problems since.
> (which is the quickest way I can think of to get you a copy of what
> I'm presently running - I've killed a number of debug bits denoted by
> all the blank lines - and this is against -rc7.)  You may notice that
> I added some TX ring dumping code to the driver - always useful in
> these situations. ;-)

I will test this ASAP on multiple MX6es and let you know of the result.

> This patch is of course the consolidated version: individually, this
> would be at least 19 patches with nice commit messages describing
> each change...

Please keep me on CC of those, I'd be very interested to test them.

> As far as the 600Mbps receive - you need the right conditions for that.
> I select the performance cpufreq governor after boot, and let the boot
> quiesce.  It doesn't take much for it to drop back to 460Mbps - another
> running process other than iperf -s is sufficient to do that.
> 
> Let me know how you get on with this.

Roger that, will do, thank you very much !

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-19 23:05           ` FEC ethernet issues [Was: PL310 errata workarounds] Marek Vasut
@ 2014-03-20  4:01             ` Marek Vasut
  2014-03-20 22:27               ` Fabio Estevam
  0 siblings, 1 reply; 76+ messages in thread
From: Marek Vasut @ 2014-03-20  4:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday, March 20, 2014 at 12:05:56 AM, Marek Vasut wrote:
> On Wednesday, March 19, 2014 at 11:51:38 PM, Russell King - ARM Linux wrote:

[...]

> > As far as the 600Mbps receive - you need the right conditions for that.
> > I select the performance cpufreq governor after boot, and let the boot
> > quiesce.  It doesn't take much for it to drop back to 460Mbps - another
> > running process other than iperf -s is sufficient to do that.
> > 
> > Let me know how you get on with this.
> 
> Roger that, will do, thank you very much !

I did some test and I observe no dips in the speed, nice work. Thanks !

I will try to get feedback from more boards and get back to you with the 
results.

Best regards,
Marek Vasut

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-20  4:01             ` Marek Vasut
@ 2014-03-20 22:27               ` Fabio Estevam
  2014-03-20 23:06                 ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: Fabio Estevam @ 2014-03-20 22:27 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Thu, Mar 20, 2014 at 1:01 AM, Marek Vasut <marex@denx.de> wrote:
> On Thursday, March 20, 2014 at 12:05:56 AM, Marek Vasut wrote:
>> On Wednesday, March 19, 2014 at 11:51:38 PM, Russell King - ARM Linux wrote:
>
> [...]
>
>> > As far as the 600Mbps receive - you need the right conditions for that.
>> > I select the performance cpufreq governor after boot, and let the boot
>> > quiesce.  It doesn't take much for it to drop back to 460Mbps - another
>> > running process other than iperf -s is sufficient to do that.
>> >
>> > Let me know how you get on with this.
>>
>> Roger that, will do, thank you very much !
>
> I did some test and I observe no dips in the speed, nice work. Thanks !
>
> I will try to get feedback from more boards and get back to you with the
> results.

Thanks for the patch.

Robert Daniels on Cc tried it on mx53 and sent the feedback below.

Thanks

"I tried the patch and it seems a little better but still has problems.
When I first started the test I saw dropped packets.  After about 2 minutes
I got the pause again with the accompanying kernel backtrace.  It looks
like I got some more debug information from the backtrace however (I think
that was part of the patch.)

This is what it looks like now:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0
at /home/robertd/Development/IC/
Dev/BoardSupport/ic-ii/linux-mainline/net/sched/sch_generic.c:264
 dev_watchdo
g+0x288/0x2ac()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc6+ #5
Backtrace:
[<800121bc>] (dump_backtrace) from [<800124a0>] (show_stack+0x18/0x1c)
 r6:8051f29c r5:00000000 r4:808edb9c r3:00000000
[<80012488>] (show_stack) from [<80652098>] (dump_stack+0x84/0x9c)
[<80652014>] (dump_stack) from [<80027d1c>] (warn_slowpath_common
+0x70/0x94)
 r5:00000009 r4:808c9d60
[<80027cac>] (warn_slowpath_common) from [<80027d78>] (warn_slowpath_fmt
+0x38/0x40)
 r8:ded37b40 r7:808c8000 r6:ded37b00 r5:dec34000 r4:00000000
[<80027d44>] (warn_slowpath_fmt) from [<8051f29c>] (dev_watchdog
+0x288/0x2ac)
 r3:dec34000 r2:808335f0
[<8051f014>] (dev_watchdog) from [<80031e68>] (call_timer_fn+0x74/0xf4)
 r10:80031df4 r9:dec34000 r8:8051f014 r7:808c8000 r6:00000100 r5:808c8000
 r4:808c9dd0
[<80031df4>] (call_timer_fn) from [<80032598>] (run_timer_softirq
+0x19c/0x234)
 r10:8051f014 r9:dec34000 r8:00200200 r7:00000000 r6:808c9e20 r5:80929fc0
 r4:dec34284
[<800323fc>] (run_timer_softirq) from [<8002c2f4>] (__do_softirq
+0x110/0x2b4)
 r10:00000100 r9:00000001 r8:40000001 r7:808c8000 r6:808ca080 r5:808ca084
 r4:00000000
[<8002c1e4>] (__do_softirq) from [<8002c7ac>] (irq_exit+0xb8/0x10c)
 r10:8065c4cc r9:00000001 r8:00000000 r7:00000037 r6:808c8000 r5:808c5fe8
 r4:808c8028
[<8002c6f4>] (irq_exit) from [<8000f2a0>] (handle_IRQ+0x5c/0xbc)
 r5:808c5fe8 r4:808d0d24
[<8000f244>] (handle_IRQ) from [<80008590>] (tzic_handle_irq+0x78/0xa8)
 r8:808c9f10 r7:00000001 r6:00000020 r5:80928fd8 r4:00000000 r3:00000080
[<80008518>] (tzic_handle_irq) from [<800130a4>] (__irq_svc+0x44/0x5c)
Exception stack(0x808c9f10 to 0x808c9f58)
9f00:                                     00000001 00000001 00000000
808d3e70
9f20: 808c8000 808d099c 808d0938 8092837d 00000000 808c8000 8065c4cc
808c9f64
9f40: 808c9f28 808c9f58 800638e0 8000f674 20000013 ffffffff
 r9:808c8000 r8:00000000 r7:808c9f44 r6:ffffffff r5:20000013 r4:8000f674
[<8000f64c>] (arch_cpu_idle) from [<8006e874>] (cpu_startup_entry
+0x108/0x160)
[<8006e76c>] (cpu_startup_entry) from [<8064ce8c>] (rest_init+0xb4/0xdc)
 r7:808b8360
[<8064cdd8>] (rest_init) from [<80879b58>] (start_kernel+0x328/0x38c)
 r6:ffffffff r5:808d0880 r4:808d0a30
[<80879830>] (start_kernel) from [<70008074>] (0x70008074)
---[ end trace 18ea3643c6be04df ]---
fec 63fec000.ethernet eth0: TX ring dump
Nr    SC     addr       len  SKB
 0    0x1c00 0xce485000  106 de5db480
 1    0x1c00 0xce485800  106 de5b7540
 2    0x1c00 0xce486000  106 de5b7300
 3    0x1c00 0xce486800  106 de5b73c0
 4    0x1c00 0xce487000  106 de5b7e40
 5    0x1c00 0xce487800  106 de5b7840
 6    0x1c00 0xce500000  106 de5b7b40
 7    0x1c00 0xce500800  106 de5b7600
 8    0x1c00 0xce501000  106 de5b7d80
 9    0x1c00 0xce501800  106 de5b7180
10    0x1c00 0xce502000  106 de5b7a80
11    0x1c00 0xce502800  106 de5b7240
12    0x1c00 0xce503000  106 de5b7000
13 SH 0x1c00 0x00000000   66   (null)
14    0x9c00 0xce504000   66 de5b90c0
15    0x1c00 0xce504800   66 de5b9a80
16    0x1c00 0xce505000   66 de5b9840
17    0x1c00 0xce505800   66 de5b9000
18    0x1c00 0xce506000   66 de5b9480
19    0x1c00 0xce506800   66 de5b9b40
20    0x1c00 0xce507000   66 de5b9600
21    0x1c00 0xce507800   66 de5b9900
22    0x1c00 0xce508000   66 de5b9180
23    0x1c00 0xce508800   66 de5b9240
24    0x1c00 0xce509000   66 de5b93c0
25    0x1c00 0xce509800   66 de5b9540
26    0x1c00 0xce50a000  106 de5b9780
27    0x1c00 0xce50a800   66 de5b9d80
28    0x1c00 0xce50b000  106 de5b9cc0
29    0x1c00 0xce50b800   66 de5b99c0
30    0x1c00 0xce50c000  106 de5b9300
31    0x1c00 0xce50c800  106 de7e9900
32    0x1c00 0xce50d000   66 de7e9e40
33    0x1c00 0xce50d800  106 de7e96c0
34    0x1c00 0xce50e000   66 de7e9c00
35    0x1c00 0xce50e800  106 de7e9d80
36    0x1c00 0xce50f000  106 de7e9cc0
37    0x1c00 0xce50f800   66 de7e9f00
38    0x1c00 0xce510000  106 de7e9a80
39    0x1c00 0xce510800  106 de7e90c0
40    0x1c00 0xce511000   66 de7e9240
41    0x1c00 0xce511800  106 de7e9000
42    0x1c00 0xce512000  106 de7e9780
43    0x1c00 0xce512800   66 de7e9300
44    0x1c00 0xce513000  106 de7e9840
45    0x1c00 0xce513800  106 de5dbd80
46    0x1c00 0xce514000   66 de5dbb40
47    0x1c00 0xce514800  106 de5db780
48    0x1c00 0xce515000  106 de5db9c0
49    0x1c00 0xce515800   66 de5db600
50    0x1c00 0xce516000  106 de5db540
51    0x1c00 0xce516800  106 de5db000
52    0x1c00 0xce517000   66 de5dbc00
53    0x1c00 0xce517800  106 de5db6c0
54    0x1c00 0xce518000   66 de5db3c0
55    0x1c00 0xce518800  106 de5db300
56    0x1c00 0xce519000  106 de5dbf00
57    0x1c00 0xce519800   66 de5db0c0
58    0x1c00 0xce51a000  106 de5db900
59    0x1c00 0xce51a800  106 de5db240
60    0x1c00 0xce51b000   66 de5dbe40
61    0x1c00 0xce51b800  106 de5db840
62    0x1c00 0xce51c000  106 de5dbcc0
63    0x3c00 0xce51c800  106 de5db180
"

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-20 22:27               ` Fabio Estevam
@ 2014-03-20 23:06                 ` Russell King - ARM Linux
  2014-03-21  0:24                   ` Troy Kisky
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-20 23:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 20, 2014 at 07:27:43PM -0300, Fabio Estevam wrote:
> "I tried the patch and it seems a little better but still has problems.
> When I first started the test I saw dropped packets.  After about 2 minutes
> I got the pause again with the accompanying kernel backtrace.  It looks
> like I got some more debug information from the backtrace however (I think
> that was part of the patch.)

Thanks.  Now that we can see the ring, we can see what's going on.
Every entry with a non-NULL skb is a packet which has been queued up
for transmission, but which hasn't been reaped.  Any entry with a "SC"
value of 0x1c00 has been transmitted.

S marks the ring entry where the next packet to be transmitted would
be placed into the ring.  H points at the previous entry which was
reaped - in other words, the next packet which would be reaped if it
were transmitted would be ring entry 14.

However, this hasn't happened, because that packet is still marked for
the FEC to transmit it.  The really odd thing is, it has transmitted
the subsequent packets from 15 to 63, and 0 to 12.

The obvious question is: was the packet at 14 actually transmitted,
and has the update to the descriptor been lost somehow - that would
require a log of the last 64-ish transmitted packets, and knowledge
of what those packets should contain to really confirm - just wondering
whether we could transmit an additional byte containing the ring index
on these small packets which we could then capture.  We would have to
be certain that the capture itself hasn't dropped any packets.

I'll look into some further patches which would allow that to be
confirmed - I can't do that immediately as I've rather a lot to deal
with at the moment.

> fec 63fec000.ethernet eth0: TX ring dump
> Nr    SC     addr       len  SKB
>  0    0x1c00 0xce485000  106 de5db480
>  1    0x1c00 0xce485800  106 de5b7540
>  2    0x1c00 0xce486000  106 de5b7300
>  3    0x1c00 0xce486800  106 de5b73c0
>  4    0x1c00 0xce487000  106 de5b7e40
>  5    0x1c00 0xce487800  106 de5b7840
>  6    0x1c00 0xce500000  106 de5b7b40
>  7    0x1c00 0xce500800  106 de5b7600
>  8    0x1c00 0xce501000  106 de5b7d80
>  9    0x1c00 0xce501800  106 de5b7180
> 10    0x1c00 0xce502000  106 de5b7a80
> 11    0x1c00 0xce502800  106 de5b7240
> 12    0x1c00 0xce503000  106 de5b7000
> 13 SH 0x1c00 0x00000000   66   (null)
> 14    0x9c00 0xce504000   66 de5b90c0
> 15    0x1c00 0xce504800   66 de5b9a80
> 16    0x1c00 0xce505000   66 de5b9840
> 17    0x1c00 0xce505800   66 de5b9000
> 18    0x1c00 0xce506000   66 de5b9480
> 19    0x1c00 0xce506800   66 de5b9b40
> 20    0x1c00 0xce507000   66 de5b9600
> 21    0x1c00 0xce507800   66 de5b9900
> 22    0x1c00 0xce508000   66 de5b9180
> 23    0x1c00 0xce508800   66 de5b9240
> 24    0x1c00 0xce509000   66 de5b93c0
> 25    0x1c00 0xce509800   66 de5b9540
> 26    0x1c00 0xce50a000  106 de5b9780
> 27    0x1c00 0xce50a800   66 de5b9d80
> 28    0x1c00 0xce50b000  106 de5b9cc0
> 29    0x1c00 0xce50b800   66 de5b99c0
> 30    0x1c00 0xce50c000  106 de5b9300
> 31    0x1c00 0xce50c800  106 de7e9900
> 32    0x1c00 0xce50d000   66 de7e9e40
> 33    0x1c00 0xce50d800  106 de7e96c0
> 34    0x1c00 0xce50e000   66 de7e9c00
> 35    0x1c00 0xce50e800  106 de7e9d80
> 36    0x1c00 0xce50f000  106 de7e9cc0
> 37    0x1c00 0xce50f800   66 de7e9f00
> 38    0x1c00 0xce510000  106 de7e9a80
> 39    0x1c00 0xce510800  106 de7e90c0
> 40    0x1c00 0xce511000   66 de7e9240
> 41    0x1c00 0xce511800  106 de7e9000
> 42    0x1c00 0xce512000  106 de7e9780
> 43    0x1c00 0xce512800   66 de7e9300
> 44    0x1c00 0xce513000  106 de7e9840
> 45    0x1c00 0xce513800  106 de5dbd80
> 46    0x1c00 0xce514000   66 de5dbb40
> 47    0x1c00 0xce514800  106 de5db780
> 48    0x1c00 0xce515000  106 de5db9c0
> 49    0x1c00 0xce515800   66 de5db600
> 50    0x1c00 0xce516000  106 de5db540
> 51    0x1c00 0xce516800  106 de5db000
> 52    0x1c00 0xce517000   66 de5dbc00
> 53    0x1c00 0xce517800  106 de5db6c0
> 54    0x1c00 0xce518000   66 de5db3c0
> 55    0x1c00 0xce518800  106 de5db300
> 56    0x1c00 0xce519000  106 de5dbf00
> 57    0x1c00 0xce519800   66 de5db0c0
> 58    0x1c00 0xce51a000  106 de5db900
> 59    0x1c00 0xce51a800  106 de5db240
> 60    0x1c00 0xce51b000   66 de5dbe40
> 61    0x1c00 0xce51b800  106 de5db840
> 62    0x1c00 0xce51c000  106 de5dbcc0
> 63    0x3c00 0xce51c800  106 de5db180
> "

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-20 23:06                 ` Russell King - ARM Linux
@ 2014-03-21  0:24                   ` Troy Kisky
  2014-03-21  1:18                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: Troy Kisky @ 2014-03-21  0:24 UTC (permalink / raw)
  To: linux-arm-kernel

On 3/20/2014 4:06 PM, Russell King - ARM Linux wrote:
> On Thu, Mar 20, 2014 at 07:27:43PM -0300, Fabio Estevam wrote:
>> "I tried the patch and it seems a little better but still has problems.
>> When I first started the test I saw dropped packets.  After about 2 minutes
>> I got the pause again with the accompanying kernel backtrace.  It looks
>> like I got some more debug information from the backtrace however (I think
>> that was part of the patch.)
> 
> Thanks.  Now that we can see the ring, we can see what's going on.
> Every entry with a non-NULL skb is a packet which has been queued up
> for transmission, but which hasn't been reaped.  Any entry with a "SC"
> value of 0x1c00 has been transmitted.
> 
> S marks the ring entry where the next packet to be transmitted would
> be placed into the ring.  H points at the previous entry which was
> reaped - in other words, the next packet which would be reaped if it
> were transmitted would be ring entry 14.
> 
> However, this hasn't happened, because that packet is still marked for
> the FEC to transmit it.  The really odd thing is, it has transmitted
> the subsequent packets from 15 to 63, and 0 to 12.
> 
> The obvious question is: was the packet at 14 actually transmitted,
> and has the update to the descriptor been lost somehow - that would
> require a log of the last 64-ish transmitted packets, and knowledge
> of what those packets should contain to really confirm - just wondering
> whether we could transmit an additional byte containing the ring index
> on these small packets which we could then capture.  We would have to
> be certain that the capture itself hasn't dropped any packets.

My question is, on a non-cacheable, bufferable write, can an entire 64
byte line be written ?  Since our descriptors are only 32 bytes is there
contention between the cpu queuing  the next packet and the ENET completing
the one before ?

13 SH 0x1c00 0x00000000   66   (null)
14    0x9c00 0xce504000   66 de5b90c0

Note that the problematic descriptor(14) is even. That would fit with
the above. In your testing, is it always even?


Troy

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-21  0:24                   ` Troy Kisky
@ 2014-03-21  1:18                     ` Russell King - ARM Linux
  2014-03-21  1:36                       ` Fabio Estevam
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-21  1:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 20, 2014 at 05:24:07PM -0700, Troy Kisky wrote:
> My question is, on a non-cacheable, bufferable write, can an entire 64
> byte line be written ?  Since our descriptors are only 32 bytes is there
> contention between the cpu queuing  the next packet and the ENET completing
> the one before ?

Well, 64-byte lines would be covering descriptors 0,1 2,3 4,5 ... 12,13
14,15 etc.

However, it wouldn't really make sense because the store buffers in the
L310 are 32-bytes in size, and there's three of them.  Given that with
my patch the descriptor updates go through two draining of that buffer
per update - once after the address, length and other parameters are
written, then again after the status bits have been written, I think
it's unlikely that what you're suggesting could be a possibility,
unless there's some bug we don't know about.

Remember, DMA memory is "normal, non-cacheable" so there shouldn't be
any cache effects going on here - only store buffer effects which may
delay, merge and/or reorder writes, and that's why we have barriers.

> 13 SH 0x1c00 0x00000000   66   (null)
> 14    0x9c00 0xce504000   66 de5b90c0
> 
> Note that the problematic descriptor(14) is even. That would fit with
> the above. In your testing, is it always even?

Since running with my patch on aan iMX6Q, I haven't seen any timeouts,
and I've run many iperf sessions in both directions (even multiple
machines hammering the poor iMX6Q, running iperf and/or flood pings
with various packet sizes.)

What may be useful to know is which iMX device these are still happening
on, with my patch applied.  I have both iMX6Q and iMX6S, the 6Q has been
the focus of my hammering so far...

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-21  1:18                     ` Russell King - ARM Linux
@ 2014-03-21  1:36                       ` Fabio Estevam
  2014-03-21  1:43                         ` Fabio Estevam
  0 siblings, 1 reply; 76+ messages in thread
From: Fabio Estevam @ 2014-03-21  1:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 20, 2014 at 10:18 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:

> Well, 64-byte lines would be covering descriptors 0,1 2,3 4,5 ... 12,13
> 14,15 etc.
>
> However, it wouldn't really make sense because the store buffers in the
> L310 are 32-bytes in size, and there's three of them.  Given that with
> my patch the descriptor updates go through two draining of that buffer
> per update - once after the address, length and other parameters are
> written, then again after the status bits have been written, I think
> it's unlikely that what you're suggesting could be a possibility,
> unless there's some bug we don't know about.
>
> Remember, DMA memory is "normal, non-cacheable" so there shouldn't be
> any cache effects going on here - only store buffer effects which may
> delay, merge and/or reorder writes, and that's why we have barriers.
>
>> 13 SH 0x1c00 0x00000000   66   (null)
>> 14    0x9c00 0xce504000   66 de5b90c0
>>
>> Note that the problematic descriptor(14) is even. That would fit with
>> the above. In your testing, is it always even?
>
> Since running with my patch on aan iMX6Q, I haven't seen any timeouts,
> and I've run many iperf sessions in both directions (even multiple
> machines hammering the poor iMX6Q, running iperf and/or flood pings
> with various packet sizes.)
>
> What may be useful to know is which iMX device these are still happening
> on, with my patch applied.  I have both iMX6Q and iMX6S, the 6Q has been
> the focus of my hammering so far...

Robert's tests were made on a mx53 (single CortexA9), and its cache
controller is not the L310.

Regards,

Fabio Estevam

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-21  1:36                       ` Fabio Estevam
@ 2014-03-21  1:43                         ` Fabio Estevam
  2014-03-21 16:37                           ` Bernd Faust
       [not found]                           ` <CANBf9eNZB+BVBDkgWNxxGs-ndQ-mYCc6+bfVdS-8T-QLpSZ3GA@mail.gmail.com>
  0 siblings, 2 replies; 76+ messages in thread
From: Fabio Estevam @ 2014-03-21  1:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 20, 2014 at 10:36 PM, Fabio Estevam <festevam@gmail.com> wrote:

> Robert's tests were made on a mx53 (single CortexA9), and its cache
> controller is not the L310.

Ops, I meant CortexA8.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-19 22:51         ` Russell King - ARM Linux
  2014-03-19 23:05           ` FEC ethernet issues [Was: PL310 errata workarounds] Marek Vasut
@ 2014-03-21 15:50           ` Frank Li
  2014-03-21 17:12             ` Russell King - ARM Linux
  1 sibling, 1 reply; 76+ messages in thread
From: Frank Li @ 2014-03-21 15:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 19, 2014 at 5:51 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Mar 19, 2014 at 10:52:32PM +0100, Marek Vasut wrote:
>> On Tuesday, March 18, 2014 at 06:26:15 PM, Russell King - ARM Linux wrote:
>> > On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
>> > > Setting prefetch enables and early BRESP could all be done
>> > > unconditionally in the core code.
>> >
>> > I think we can do a few things here, if we know that the CPUs we're
>> > connected to are all Cortex-A9:
>> >
>> > 1. Enable BRESP.
>> >
>> > 2. Enable I+D prefetching - but we really need to tune the prefetch offset
>> >    for this to be worthwhile.  The value depends on the L3 memory system
>> >    latency, so isn't something that should be specified at the SoC level.
>> >    It may also change with different operating points.
>> >
>> > 3. Full line of zeros - I think this is a difficult one to achieve
>> > properly. The required sequence:
>> >
>> >    - enable FLZ in L2 cache
>> >    - enable L2 cache
>> >    - enable FLZ in Cortex A9
>> >
>> >    I'd also assume that when we turn the L2 cache off, we need the reverse
>> >    sequence too.  So this sequence can't be done entirely by the boot
>> > loader.
>> >
>> > With (1) enabled and (2) properly tuned, I see a performance increase of
>> > around 60Mbps on transmission, bringing the Cubox-i4 up from 250Mbps to
>> > 315Mbps transmit on its gigabit interface with cpufreq ondemand enabled.
>> > With "performance", this goes up to [323, 323, 321, 325, 322]Mbps.  On
>> > receive [446, 603, 605, 605, 601]Mbps, which hasn't really changed
>> > very much (and still impressively exceeds the Freescale stated maximum
>> > total bandwidth of the gigabit interface.)
>>
>> Speaking of FEC and slightly off-topic, have you ever seen this on your box
>> [1]/[2]/[3] ? I wonder if this might be cache-related as well, since I saw
>> similar issue on MX6 with PCIe-connected ethernet. I cannot put a finger on
>> this though.
>
> I think I've seen something similar once or twice.
>
> Let me just pull up the transmit function.  This is the code which
> writes to the descriptors (which frankly is pretty horrid):
>
>         bdp->cbd_datlen = skb->len;
>         bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
>                         skb->len, DMA_TO_DEVICE); <-- barrier inside
>                 ebdp->cbd_bdu = 0;
>                         ebdp->cbd_esc = BD_ENET_TX_INT;
>                                 ebdp->cbd_esc |= BD_ENET_TX_PINS;
>         bdp->cbd_sc = status;
>         writel(0, fep->hwp + FEC_X_DES_ACTIVE); <-- barrier before write
>
> A couple of points here:
>
> 1. The hardware operates in a ring - it can read the next ring if it
>    sees the BD_ENET_TX_READY bit set if has just finished processing
>    the previous descriptor - this can happen before the write to
>    FEC_X_DES_ACTIVE.

Correct. uDMA will start transfer when D_ENET_TX_READY bit.
I think other network MAC work similar with this one.

>
> 2. The ARM can re-order writes.  The writes it can re-order are:
>
>         bdp->cbd_bufaddr
>         ebdp->cbd_bdu
>         ebdp->cbd_esc
>         ebdp->cbd_sc
>
> Hence, it's entirely possible for the FEC to see the updated descriptor
> status before the rest of the descriptor has been written.  What's missing
> is a barrier between the descriptor writes, and the final write to
> bdp->cbd_sc.

Possible. I also consider this when debug the other issues.
Just bdp->bufaddr and bdp->buflen is important.
bdu, hardware don't care it.
esc is the same for each BD.  When software go though BD ring once,
esc will be not changed again, even though we write.

but if sc write before buffaddr and bullen, there will be issue.
Add memory barre here is better before  ebdp->cbd_sc.

I also want to add it before, But I have not found a test case to
verify it is necessary.

>
> Had I not got distracted by the L2 issues, I'd have posted my FEC patches
> by now... in any case, my current xmit function looks a little different -
> I've organised it such that:
>
> 1. We don't modify anything until we're past the point where things can
>    error out.
>
> 2. Writes to the descriptor are localised in one area.
>
> 3. There's a wmb() barrier between cbd_sc and the previous writes - I
>    discussed this with Will Deacon, which resulted in this documentation
>    for the barrier:
>
>         /*
>          * We need the preceding stores to the descriptor to complete
>          * before updating the status field, which hands it over to the
>          * hardware.  The corresponding rmb() is "in the hardware".
>          */

It is better.
>
> The second thing that causes the transmit timeouts is the horrid way
> NAPI has been added to the driver - it's racy.  NAPI itself isn't the
> problem, it's this (compressed a bit to show only the relevant bits):
>
>         do {
>                 int_events = readl(fep->hwp + FEC_IEVENT);
>                 writel(int_events, fep->hwp + FEC_IEVENT);
>                 if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
>                         if (napi_schedule_prep(&fep->napi)) {
>                                 writel(FEC_RX_DISABLED_IMASK,
>                                         fep->hwp + FEC_IMASK);
>                                 __napi_schedule(&fep->napi);
>                         }
>                 }
>                 if (int_events & FEC_ENET_MII) {
>                         complete(&fep->mdio_done);
>                 }
>         } while (int_events);
>
> Consider what happens here if:
> - we talk in the MII bus and receive a MII interrupt
> - we're just finishing NAPI processing but haven't quite got around to
>   calling napi_complete()
> - the ethernet has sent all packets, and has also raised a transmit
>   interrupt
>
> The result is the handler is entered, FEC_IEVENT contains TXF and MII
> events.  Both these events are cleared down, (and thus no longer exist
> as interrupt-causing events.)  napi_schedule_prep() returns false as
> the NAPI rx function is still running, and doesn't mark it for a re-run.
> We then do the MII interrupt.  Loop again, and int_events is zero,
> we exit.
>
> Meanwhile, the NAPI rx function calls napi_complete() and re-enables
> the receive interrupt.  If you're unlucky enough that the RX ring is
> also full... no RXF interrupt.  So no further interrupts except maybe
> MII interrupts.
>
> NAPI never gets scheduled.  RX ring never gets emptied.  TX ring never
> gets reaped.  The result is a timeout with a completely full TX ring.

Do you see RX ring full?

>
> I think I've seen both cases: I've seen the case where the TX ring is
> completely empty, but it hasn't been reaped.   I've also seen the case
> where the TX ring contains packets to be transmitted but the hardware
> isn't sending them.
>
> That all said - with the patch below I haven't seen problems since.
> (which is the quickest way I can think of to get you a copy of what
> I'm presently running - I've killed a number of debug bits denoted by
> all the blank lines - and this is against -rc7.)  You may notice that
> I added some TX ring dumping code to the driver - always useful in
> these situations. ;-)
>
> This patch is of course the consolidated version: individually, this
> would be at least 19 patches with nice commit messages describing
> each change...
>
> As far as the 600Mbps receive - you need the right conditions for that.
> I select the performance cpufreq governor after boot, and let the boot
> quiesce.  It doesn't take much for it to drop back to 460Mbps - another
> running process other than iperf -s is sufficient to do that.
>
> Let me know how you get on with this.
>
> diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
> index 3b8d6d19ff05..510580eeae4b 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -170,6 +170,11 @@ struct bufdesc_ex {
>         unsigned short res0[4];
>  };
>
> +union bufdesc_u {
> +       struct bufdesc bd;
> +       struct bufdesc_ex ebd;
> +};
> +
>  /*
>   *     The following definitions courtesy of commproc.h, which where
>   *     Copyright (c) 1997 Dan Malek (dmalek at jlc.net).
> @@ -240,14 +245,14 @@ struct bufdesc_ex {
>   * the skbuffer directly.
>   */
>
> -#define FEC_ENET_RX_PAGES      8
> +#define FEC_ENET_RX_PAGES      32
>  #define FEC_ENET_RX_FRSIZE     2048
>  #define FEC_ENET_RX_FRPPG      (PAGE_SIZE / FEC_ENET_RX_FRSIZE)
>  #define RX_RING_SIZE           (FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES)
>  #define FEC_ENET_TX_FRSIZE     2048
>  #define FEC_ENET_TX_FRPPG      (PAGE_SIZE / FEC_ENET_TX_FRSIZE)
> -#define TX_RING_SIZE           16      /* Must be power of two */
> -#define TX_RING_MOD_MASK       15      /*   for this to work */
> +#define TX_RING_SIZE           64      /* Must be power of two */
> +#define TX_RING_MOD_MASK       63      /*   for this to work */
>
>  #define BD_ENET_RX_INT          0x00800000
>  #define BD_ENET_RX_PTP          ((ushort)0x0400)
> @@ -289,12 +294,12 @@ struct fec_enet_private {
>         /* CPM dual port RAM relative addresses */
>         dma_addr_t      bd_dma;
>         /* Address of Rx and Tx buffers */
> -       struct bufdesc  *rx_bd_base;
> -       struct bufdesc  *tx_bd_base;
> +       union bufdesc_u *rx_bd_base;
> +       union bufdesc_u *tx_bd_base;
>         /* The next free ring entry */
> -       struct bufdesc  *cur_rx, *cur_tx;
> -       /* The ring entries to be free()ed */
> -       struct bufdesc  *dirty_tx;
> +       unsigned short tx_next;
> +       unsigned short tx_dirty;
> +       unsigned short rx_next;
>
>         unsigned short tx_ring_size;
>         unsigned short rx_ring_size;
> @@ -335,6 +340,9 @@ struct fec_enet_private {
>         struct timer_list time_keep;
>         struct fec_enet_delayed_work delay_work;
>         struct regulator *reg_phy;
> +       unsigned long quirks;
> +
> +
>  };
>
>  void fec_ptp_init(struct platform_device *pdev);
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 03a351300013..8105697d5a99 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -101,6 +101,8 @@ static void set_multicast_list(struct net_device *ndev);
>   * ENET_TDAR[TDAR].
>   */
>  #define FEC_QUIRK_ERR006358            (1 << 7)
> +/* Controller has ability to offset rx packets */
> +#define FEC_QUIRK_RX_SHIFT16           (1 << 8)
>
>  static struct platform_device_id fec_devtype[] = {
>         {
> @@ -120,7 +122,8 @@ static struct platform_device_id fec_devtype[] = {
>                 .name = "imx6q-fec",
>                 .driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT |
>                                 FEC_QUIRK_HAS_BUFDESC_EX | FEC_QUIRK_HAS_CSUM |
> -                               FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358,
> +                               FEC_QUIRK_HAS_VLAN | FEC_QUIRK_ERR006358 |
> +                               FEC_QUIRK_RX_SHIFT16,
>         }, {
>                 .name = "mvf600-fec",
>                 .driver_data = FEC_QUIRK_ENET_MAC,
> @@ -200,6 +203,7 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
>  /* FEC receive acceleration */
>  #define FEC_RACC_IPDIS         (1 << 1)
>  #define FEC_RACC_PRODIS                (1 << 2)
> +#define FEC_RACC_SHIFT16       BIT(7)
>  #define FEC_RACC_OPTIONS       (FEC_RACC_IPDIS | FEC_RACC_PRODIS)
>
>  /*
> @@ -233,57 +237,54 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
>
>  static int mii_cnt;
>
> -static inline
> -struct bufdesc *fec_enet_get_nextdesc(struct bufdesc *bdp, struct fec_enet_private *fep)
> +static unsigned copybreak = 200;
> +module_param(copybreak, uint, 0644);
> +MODULE_PARM_DESC(copybreak,
> +                "Maximum size of packet that is copied to a new buffer on receive");
> +
> +
> +
> +
> +
> +static bool fec_enet_rx_zerocopy(struct fec_enet_private *fep, unsigned pktlen)
>  {
> -       struct bufdesc *new_bd = bdp + 1;
> -       struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp + 1;
> -       struct bufdesc_ex *ex_base;
> -       struct bufdesc *base;
> -       int ring_size;
> -
> -       if (bdp >= fep->tx_bd_base) {
> -               base = fep->tx_bd_base;
> -               ring_size = fep->tx_ring_size;
> -               ex_base = (struct bufdesc_ex *)fep->tx_bd_base;
> -       } else {
> -               base = fep->rx_bd_base;
> -               ring_size = fep->rx_ring_size;
> -               ex_base = (struct bufdesc_ex *)fep->rx_bd_base;
> -       }
> +#ifndef CONFIG_M5272
> +       if (fep->quirks & FEC_QUIRK_RX_SHIFT16 && pktlen >= copybreak)
> +               return true;
> +#endif
> +       return false;
> +}
> +
> +static union bufdesc_u *
> +fec_enet_tx_get(unsigned index, struct fec_enet_private *fep)
> +{
> +       union bufdesc_u *base = fep->tx_bd_base;
> +       union bufdesc_u *bdp;
> +
> +       index &= fep->tx_ring_size - 1;
>
>         if (fep->bufdesc_ex)
> -               return (struct bufdesc *)((ex_new_bd >= (ex_base + ring_size)) ?
> -                       ex_base : ex_new_bd);
> +               bdp = (union bufdesc_u *)(&base->ebd + index);
>         else
> -               return (new_bd >= (base + ring_size)) ?
> -                       base : new_bd;
> +               bdp = (union bufdesc_u *)(&base->bd + index);
> +
> +       return bdp;
>  }
>
> -static inline
> -struct bufdesc *fec_enet_get_prevdesc(struct bufdesc *bdp, struct fec_enet_private *fep)
> +static union bufdesc_u *
> +fec_enet_rx_get(unsigned index, struct fec_enet_private *fep)
>  {
> -       struct bufdesc *new_bd = bdp - 1;
> -       struct bufdesc_ex *ex_new_bd = (struct bufdesc_ex *)bdp - 1;
> -       struct bufdesc_ex *ex_base;
> -       struct bufdesc *base;
> -       int ring_size;
> -
> -       if (bdp >= fep->tx_bd_base) {
> -               base = fep->tx_bd_base;
> -               ring_size = fep->tx_ring_size;
> -               ex_base = (struct bufdesc_ex *)fep->tx_bd_base;
> -       } else {
> -               base = fep->rx_bd_base;
> -               ring_size = fep->rx_ring_size;
> -               ex_base = (struct bufdesc_ex *)fep->rx_bd_base;
> -       }
> +       union bufdesc_u *base = fep->rx_bd_base;
> +       union bufdesc_u *bdp;
> +
> +       index &= fep->rx_ring_size - 1;
>
>         if (fep->bufdesc_ex)
> -               return (struct bufdesc *)((ex_new_bd < ex_base) ?
> -                       (ex_new_bd + ring_size) : ex_new_bd);
> +               bdp = (union bufdesc_u *)(&base->ebd + index);
>         else
> -               return (new_bd < base) ? (new_bd + ring_size) : new_bd;
> +               bdp = (union bufdesc_u *)(&base->bd + index);
> +
> +       return bdp;
>  }
>
>  static void *swap_buffer(void *bufaddr, int len)
> @@ -297,6 +298,26 @@ static void *swap_buffer(void *bufaddr, int len)
>         return bufaddr;
>  }
>
> +static void fec_dump(struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       unsigned index = 0;
> +
> +       netdev_info(ndev, "TX ring dump\n");
> +       pr_info("Nr    SC     addr       len  SKB\n");
> +
> +       for (index = 0; index < fep->tx_ring_size; index++) {
> +               union bufdesc_u *bdp = fec_enet_tx_get(index, fep);
> +
> +               pr_info("%2u %c%c 0x%04x 0x%08lx %4u %p\n",
> +                       index,
> +                       index == fep->tx_next ? 'S' : ' ',
> +                       index == fep->tx_dirty ? 'H' : ' ',
> +                       bdp->bd.cbd_sc, bdp->bd.cbd_bufaddr, bdp->bd.cbd_datlen,
> +                       fep->tx_skbuff[index]);
> +       }
> +}
> +
>  static int
>  fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
>  {
> @@ -312,21 +333,42 @@ fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
>         return 0;
>  }
>
> +static void
> +fec_enet_tx_unmap(struct bufdesc *bdp, struct fec_enet_private *fep)
> +{
> +       dma_addr_t addr = bdp->cbd_bufaddr;
> +       unsigned length = bdp->cbd_datlen;
> +
> +       bdp->cbd_bufaddr = 0;
> +
> +       dma_unmap_single(&fep->pdev->dev, addr, length, DMA_TO_DEVICE);
> +}
> +
>  static netdev_tx_t
>  fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
> -       struct bufdesc *bdp, *bdp_pre;
> +       union bufdesc_u *bdp, *bdp_pre;
>         void *bufaddr;
>         unsigned short  status;
> -       unsigned int index;
> +       unsigned index;
> +       unsigned length;
> +       dma_addr_t addr;
> +
> +
> +
> +
> +
> +
> +
> +
> +
>
>         /* Fill in a Tx ring entry */
> -       bdp = fep->cur_tx;
> +       index = fep->tx_next;
>
> -       status = bdp->cbd_sc;
> +       bdp = fec_enet_tx_get(index, fep);
> +       status = bdp->bd.cbd_sc;
>
>         if (status & BD_ENET_TX_READY) {
>                 /* Ooops.  All transmit buffers are full.  Bail out.
> @@ -347,21 +389,15 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>
>         /* Set buffer length and buffer pointer */
>         bufaddr = skb->data;
> -       bdp->cbd_datlen = skb->len;
> +       length = skb->len;
>
>         /*
>          * On some FEC implementations data must be aligned on
>          * 4-byte boundaries. Use bounce buffers to copy data
>          * and get it aligned. Ugh.
>          */
> -       if (fep->bufdesc_ex)
> -               index = (struct bufdesc_ex *)bdp -
> -                       (struct bufdesc_ex *)fep->tx_bd_base;
> -       else
> -               index = bdp - fep->tx_bd_base;
> -
>         if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
> -               memcpy(fep->tx_bounce[index], skb->data, skb->len);
> +               memcpy(fep->tx_bounce[index], skb->data, length);
>                 bufaddr = fep->tx_bounce[index];
>         }
>
> @@ -370,70 +406,72 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>          * the system that it's running on. As the result, driver has to
>          * swap every frame going to and coming from the controller.
>          */
> -       if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
> -               swap_buffer(bufaddr, skb->len);
> +       if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> +               swap_buffer(bufaddr, length);
>
> -       /* Save skb pointer */
> -       fep->tx_skbuff[index] = skb;
> -
> -       /* Push the data cache so the CPM does not get stale memory
> -        * data.
> -        */
> -       bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
> -                       skb->len, DMA_TO_DEVICE);
> -       if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
> -               bdp->cbd_bufaddr = 0;
> -               fep->tx_skbuff[index] = NULL;
> +       /* Push the data cache so the CPM does not get stale memory data. */
> +       addr = dma_map_single(&fep->pdev->dev, bufaddr, length, DMA_TO_DEVICE);
> +       if (dma_mapping_error(&fep->pdev->dev, addr)) {
>                 dev_kfree_skb_any(skb);
>                 if (net_ratelimit())
>                         netdev_err(ndev, "Tx DMA memory map failed\n");
>                 return NETDEV_TX_OK;
>         }
>
> -       if (fep->bufdesc_ex) {
> +       /* Save skb pointer */
> +       fep->tx_skbuff[index] = skb;
>
> -               struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> -               ebdp->cbd_bdu = 0;
> +       bdp->bd.cbd_datlen = length;
> +       bdp->bd.cbd_bufaddr = addr;
> +
> +       if (fep->bufdesc_ex) {
> +               bdp->ebd.cbd_bdu = 0;
>                 if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
>                         fep->hwts_tx_en)) {
> -                       ebdp->cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
> +                       bdp->ebd.cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
>                         skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
>                 } else {
> -                       ebdp->cbd_esc = BD_ENET_TX_INT;
> +                       bdp->ebd.cbd_esc = BD_ENET_TX_INT;
>
>                         /* Enable protocol checksum flags
>                          * We do not bother with the IP Checksum bits as they
>                          * are done by the kernel
>                          */
>                         if (skb->ip_summed == CHECKSUM_PARTIAL)
> -                               ebdp->cbd_esc |= BD_ENET_TX_PINS;
> +                               bdp->ebd.cbd_esc |= BD_ENET_TX_PINS;
>                 }
>         }
>
> +       /*
> +        * We need the preceding stores to the descriptor to complete
> +        * before updating the status field, which hands it over to the
> +        * hardware.  The corresponding rmb() is "in the hardware".
> +        */
> +       wmb();
> +
>         /* Send it on its way.  Tell FEC it's ready, interrupt when done,
>          * it's the last BD of the frame, and to put the CRC on the end.
>          */
>         status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
>                         | BD_ENET_TX_LAST | BD_ENET_TX_TC);
> -       bdp->cbd_sc = status;
> +       bdp->bd.cbd_sc = status;
>
> -       bdp_pre = fec_enet_get_prevdesc(bdp, fep);
> -       if ((id_entry->driver_data & FEC_QUIRK_ERR006358) &&
> -           !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) {
> +       bdp_pre = fec_enet_tx_get(index - 1, fep);
> +       if ((fep->quirks & FEC_QUIRK_ERR006358) &&
> +           !(bdp_pre->bd.cbd_sc & BD_ENET_TX_READY)) {
>                 fep->delay_work.trig_tx = true;
>                 schedule_delayed_work(&(fep->delay_work.delay_work),
>                                         msecs_to_jiffies(1));
>         }
>
> -       /* If this was the last BD in the ring, start at the beginning again. */
> -       bdp = fec_enet_get_nextdesc(bdp, fep);
> -
>         skb_tx_timestamp(skb);
>
> -       fep->cur_tx = bdp;
> +       fep->tx_next = (index + 1) & (fep->tx_ring_size - 1);
>
> -       if (fep->cur_tx == fep->dirty_tx)
> +       if (fep->tx_next == fep->tx_dirty) {
> +
>                 netif_stop_queue(ndev);
> +       }
>
>         /* Trigger transmission start */
>         writel(0, fep->hwp + FEC_X_DES_ACTIVE);
> @@ -446,46 +484,43 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
>  static void fec_enet_bd_init(struct net_device *dev)
>  {
>         struct fec_enet_private *fep = netdev_priv(dev);
> -       struct bufdesc *bdp;
>         unsigned int i;
>
>         /* Initialize the receive buffer descriptors. */
> -       bdp = fep->rx_bd_base;
>         for (i = 0; i < fep->rx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
>
>                 /* Initialize the BD for every fragment in the page. */
> -               if (bdp->cbd_bufaddr)
> -                       bdp->cbd_sc = BD_ENET_RX_EMPTY;
> +               if (bdp->bd.cbd_bufaddr)
> +                       bdp->bd.cbd_sc = BD_ENET_RX_EMPTY;
>                 else
> -                       bdp->cbd_sc = 0;
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> +                       bdp->bd.cbd_sc = 0;
> +               if (i == fep->rx_ring_size - 1)
> +                       bdp->bd.cbd_sc |= BD_SC_WRAP;
>         }
>
> -       /* Set the last buffer to wrap */
> -       bdp = fec_enet_get_prevdesc(bdp, fep);
> -       bdp->cbd_sc |= BD_SC_WRAP;
> -
> -       fep->cur_rx = fep->rx_bd_base;
> +       fep->rx_next = 0;
>
>         /* ...and the same for transmit */
> -       bdp = fep->tx_bd_base;
> -       fep->cur_tx = bdp;
>         for (i = 0; i < fep->tx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
>
>                 /* Initialize the BD for every fragment in the page. */
> -               bdp->cbd_sc = 0;
> -               if (bdp->cbd_bufaddr && fep->tx_skbuff[i]) {
> +               /* Set the last buffer to wrap */
> +               if (i == fep->tx_ring_size - 1)
> +                       bdp->bd.cbd_sc = BD_SC_WRAP;
> +               else
> +                       bdp->bd.cbd_sc = 0;
> +               if (bdp->bd.cbd_bufaddr)
> +                       fec_enet_tx_unmap(&bdp->bd, fep);
> +               if (fep->tx_skbuff[i]) {
>                         dev_kfree_skb_any(fep->tx_skbuff[i]);
>                         fep->tx_skbuff[i] = NULL;
>                 }
> -               bdp->cbd_bufaddr = 0;
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
>         }
>
> -       /* Set the last buffer to wrap */
> -       bdp = fec_enet_get_prevdesc(bdp, fep);
> -       bdp->cbd_sc |= BD_SC_WRAP;
> -       fep->dirty_tx = bdp;
> +       fep->tx_next = 0;
> +       fep->tx_dirty = fep->tx_ring_size - 1;
>  }
>
>  /* This function is called to start or restart the FEC during a link
> @@ -496,8 +531,6 @@ static void
>  fec_restart(struct net_device *ndev, int duplex)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
>         int i;
>         u32 val;
>         u32 temp_mac[2];
> @@ -519,7 +552,7 @@ fec_restart(struct net_device *ndev, int duplex)
>          * enet-mac reset will reset mac address registers too,
>          * so need to reconfigure it.
>          */
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC) {
>                 memcpy(&temp_mac, ndev->dev_addr, ETH_ALEN);
>                 writel(cpu_to_be32(temp_mac[0]), fep->hwp + FEC_ADDR_LOW);
>                 writel(cpu_to_be32(temp_mac[1]), fep->hwp + FEC_ADDR_HIGH);
> @@ -568,6 +601,8 @@ fec_restart(struct net_device *ndev, int duplex)
>  #if !defined(CONFIG_M5272)
>         /* set RX checksum */
>         val = readl(fep->hwp + FEC_RACC);
> +       if (fep->quirks & FEC_QUIRK_RX_SHIFT16)
> +               val |= FEC_RACC_SHIFT16;
>         if (fep->csum_flags & FLAG_RX_CSUM_ENABLED)
>                 val |= FEC_RACC_OPTIONS;
>         else
> @@ -579,7 +614,7 @@ fec_restart(struct net_device *ndev, int duplex)
>          * The phy interface and speed need to get configured
>          * differently on enet-mac.
>          */
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC) {
>                 /* Enable flow control and length check */
>                 rcntl |= 0x40000000 | 0x00000020;
>
> @@ -602,7 +637,7 @@ fec_restart(struct net_device *ndev, int duplex)
>                 }
>         } else {
>  #ifdef FEC_MIIGSK_ENR
> -               if (id_entry->driver_data & FEC_QUIRK_USE_GASKET) {
> +               if (fep->quirks & FEC_QUIRK_USE_GASKET) {
>                         u32 cfgr;
>                         /* disable the gasket and wait */
>                         writel(0, fep->hwp + FEC_MIIGSK_ENR);
> @@ -655,7 +690,7 @@ fec_restart(struct net_device *ndev, int duplex)
>         writel(0, fep->hwp + FEC_HASH_TABLE_LOW);
>  #endif
>
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC) {
>                 /* enable ENET endian swap */
>                 ecntl |= (1 << 8);
>                 /* enable ENET store and forward mode */
> @@ -692,8 +727,6 @@ static void
>  fec_stop(struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
>         u32 rmii_mode = readl(fep->hwp + FEC_R_CNTRL) & (1 << 8);
>
>         /* We cannot expect a graceful transmit stop without link !!! */
> @@ -711,7 +744,7 @@ fec_stop(struct net_device *ndev)
>         writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
>
>         /* We have to keep ENET enabled to have MII interrupt stay working */
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC) {
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC) {
>                 writel(2, fep->hwp + FEC_ECNTRL);
>                 writel(rmii_mode, fep->hwp + FEC_R_CNTRL);
>         }
> @@ -723,6 +756,8 @@ fec_timeout(struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
>
> +       fec_dump(ndev);
> +
>         ndev->stats.tx_errors++;
>
>         fep->delay_work.timeout = true;
> @@ -751,34 +786,28 @@ static void fec_enet_work(struct work_struct *work)
>  static void
>  fec_enet_tx(struct net_device *ndev)
>  {
> -       struct  fec_enet_private *fep;
> -       struct bufdesc *bdp;
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       union bufdesc_u *bdp;
>         unsigned short status;
>         struct  sk_buff *skb;
> -       int     index = 0;
> -
> -       fep = netdev_priv(ndev);
> -       bdp = fep->dirty_tx;
> +       unsigned index = fep->tx_dirty;
>
> -       /* get next bdp of dirty_tx */
> -       bdp = fec_enet_get_nextdesc(bdp, fep);
> +       do {
> +               index = (index + 1) & (fep->tx_ring_size - 1);
> +               bdp = fec_enet_tx_get(index, fep);
>
> -       while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
> +               status = bdp->bd.cbd_sc;
> +               if (status & BD_ENET_TX_READY)
> +                       break;
>
>                 /* current queue is empty */
> -               if (bdp == fep->cur_tx)
> +               if (index == fep->tx_next)
>                         break;
>
> -               if (fep->bufdesc_ex)
> -                       index = (struct bufdesc_ex *)bdp -
> -                               (struct bufdesc_ex *)fep->tx_bd_base;
> -               else
> -                       index = bdp - fep->tx_bd_base;
> +               fec_enet_tx_unmap(&bdp->bd, fep);
>
>                 skb = fep->tx_skbuff[index];
> -               dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len,
> -                               DMA_TO_DEVICE);
> -               bdp->cbd_bufaddr = 0;
> +               fep->tx_skbuff[index] = NULL;
>
>                 /* Check for errors. */
>                 if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC |
> @@ -797,19 +826,18 @@ fec_enet_tx(struct net_device *ndev)
>                                 ndev->stats.tx_carrier_errors++;
>                 } else {
>                         ndev->stats.tx_packets++;
> -                       ndev->stats.tx_bytes += bdp->cbd_datlen;
> +                       ndev->stats.tx_bytes += bdp->bd.cbd_datlen;
>                 }
>
>                 if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) &&
>                         fep->bufdesc_ex) {
>                         struct skb_shared_hwtstamps shhwtstamps;
>                         unsigned long flags;
> -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
>
>                         memset(&shhwtstamps, 0, sizeof(shhwtstamps));
>                         spin_lock_irqsave(&fep->tmreg_lock, flags);
>                         shhwtstamps.hwtstamp = ns_to_ktime(
> -                               timecounter_cyc2time(&fep->tc, ebdp->ts));
> +                               timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
>                         spin_unlock_irqrestore(&fep->tmreg_lock, flags);
>                         skb_tstamp_tx(skb, &shhwtstamps);
>                 }
> @@ -825,45 +853,252 @@ fec_enet_tx(struct net_device *ndev)
>
>                 /* Free the sk buffer associated with this last transmit */
>                 dev_kfree_skb_any(skb);
> -               fep->tx_skbuff[index] = NULL;
> -
> -               fep->dirty_tx = bdp;
> -
> -               /* Update pointer to next buffer descriptor to be transmitted */
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
>
>                 /* Since we have freed up a buffer, the ring is no longer full
>                  */
> -               if (fep->dirty_tx != fep->cur_tx) {
> -                       if (netif_queue_stopped(ndev))
> -                               netif_wake_queue(ndev);
> +               if (netif_queue_stopped(ndev)) {
> +
> +
> +
> +
> +
> +                       netif_wake_queue(ndev);
> +
>                 }
> -       }
> +
> +               fep->tx_dirty = index;
> +       } while (1);

where break this while(1);

>         return;
>  }
>
>
> -/* During a receive, the cur_rx points to the current incoming buffer.
> +static void
> +fec_enet_receive(struct sk_buff *skb, union bufdesc_u *bdp, struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +
> +       skb->protocol = eth_type_trans(skb, ndev);
> +
> +       /* Get receive timestamp from the skb */
> +       if (fep->hwts_rx_en && fep->bufdesc_ex) {
> +               struct skb_shared_hwtstamps *shhwtstamps =
> +                                                   skb_hwtstamps(skb);
> +               unsigned long flags;
> +
> +               memset(shhwtstamps, 0, sizeof(*shhwtstamps));
> +
> +               spin_lock_irqsave(&fep->tmreg_lock, flags);
> +               shhwtstamps->hwtstamp = ns_to_ktime(
> +                   timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
> +               spin_unlock_irqrestore(&fep->tmreg_lock, flags);
> +       }
> +
> +       if (fep->csum_flags & FLAG_RX_CSUM_ENABLED) {
> +               if (!(bdp->ebd.cbd_esc & FLAG_RX_CSUM_ERROR)) {
> +                       /* don't check it */
> +                       skb->ip_summed = CHECKSUM_UNNECESSARY;
> +               } else {
> +                       skb_checksum_none_assert(skb);
> +               }
> +       }
> +
> +       napi_gro_receive(&fep->napi, skb);
> +}
> +
> +static void noinline
> +fec_enet_receive_copy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       struct sk_buff *skb;
> +       unsigned char *data;
> +       bool vlan_packet_rcvd = false;
> +
> +       /*
> +        * Detect the presence of the VLAN tag, and adjust
> +        * the packet length appropriately.
> +        */
> +       if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX &&
> +           bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) {
> +               pkt_len -= VLAN_HLEN;
> +               vlan_packet_rcvd = true;
> +       }
> +
> +       /* This does 16 byte alignment, exactly what we need. */
> +       skb = netdev_alloc_skb(ndev, pkt_len + NET_IP_ALIGN);
> +       if (unlikely(!skb)) {
> +               ndev->stats.rx_dropped++;
> +               return;
> +       }
> +
> +       dma_sync_single_for_cpu(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
> +                               FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +
> +       data = fep->rx_skbuff[index]->data;
> +
> +#ifndef CONFIG_M5272
> +       /*
> +        * If we have enabled this feature, we need to discard
> +        * the two bytes at the beginning of the packet before
> +        * copying it.
> +        */
> +       if (fep->quirks & FEC_QUIRK_RX_SHIFT16) {
> +               pkt_len -= 2;
> +               data += 2;
> +       }
> +#endif
> +
> +       if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> +               swap_buffer(data, pkt_len);
> +
> +       skb_reserve(skb, NET_IP_ALIGN);
> +       skb_put(skb, pkt_len);  /* Make room */
> +
> +       /* If this is a VLAN packet remove the VLAN Tag */
> +       if (vlan_packet_rcvd) {
> +               struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN);
> +
> +               __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
> +                                      ntohs(vlan->h_vlan_TCI));
> +
> +               /* Extract the frame data without the VLAN header. */
> +               skb_copy_to_linear_data(skb, data, 2 * ETH_ALEN);
> +               skb_copy_to_linear_data_offset(skb, 2 * ETH_ALEN,
> +                                              data + 2 * ETH_ALEN + VLAN_HLEN,
> +                                              pkt_len - 2 * ETH_ALEN);
> +       } else {
> +               skb_copy_to_linear_data(skb, data, pkt_len);
> +       }
> +
> +       dma_sync_single_for_device(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
> +                                  FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +
> +       fec_enet_receive(skb, bdp, ndev);
> +}
> +
> +static void noinline
> +fec_enet_receive_nocopy(unsigned pkt_len, unsigned index, union bufdesc_u *bdp, struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       struct sk_buff *skb, *skb_new;
> +       unsigned char *data;
> +       dma_addr_t addr;
> +
> +#if 0
> +       skb_new = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
> +       if (!skb_new) {
> +               ndev->stats.rx_dropped++;
> +               return;
> +       }
> +
> +       addr = dma_map_single(&fep->pdev->dev, skb_new->data,
> +                             FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +       if (dma_mapping_error(&fep->pdev->dev, addr)) {
> +               dev_kfree_skb(skb_new);
> +               ndev->stats.rx_dropped++;
> +               return;
> +       }
> +#else
> +       skb_new = NULL;
> +       addr = 0;
> +#endif
> +
> +       /*
> +        * We have the new skb, so proceed to deal with the
> +        * received data.
> +        */
> +       dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
> +                        FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +
> +       skb = fep->rx_skbuff[index];
> +
> +       /* Now subsitute in the new skb */
> +       fep->rx_skbuff[index] = skb_new;
> +       bdp->bd.cbd_bufaddr = addr;
> +
> +       /*
> +        * Update the skb length according to the raw packet
> +        * length.  Then remove the two bytes of additional
> +        * padding.
> +        */
> +       skb_put(skb, pkt_len);
> +       data = skb_pull_inline(skb, 2);
> +
> +       if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> +               swap_buffer(data, skb->len);
> +
> +       /*
> +        * Now juggle things for the VLAN tag - if the hardware
> +        * flags this as present, we need to read the tag, and
> +        * then shuffle the ethernet addresses up.
> +        */
> +       if (ndev->features & NETIF_F_HW_VLAN_CTAG_RX &&
> +           bdp->ebd.cbd_esc & BD_ENET_RX_VLAN) {
> +               struct vlan_hdr *vlan = (struct vlan_hdr *)(data + ETH_HLEN);
> +
> +               __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
> +                                      ntohs(vlan->h_vlan_TCI));
> +
> +               memmove(data + VLAN_HLEN, data, 2 * ETH_ALEN);
> +               skb_pull_inline(skb, VLAN_HLEN);
> +       }
> +
> +       fec_enet_receive(skb, bdp, ndev);
> +}
> +
> +static int
> +fec_enet_refill_ring(unsigned first, unsigned last, struct net_device *ndev)
> +{
> +       struct fec_enet_private *fep = netdev_priv(ndev);
> +       unsigned i = first;
> +
> +       do {
> +               union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
> +               struct sk_buff *skb;
> +               dma_addr_t addr;
> +
> +               if (!fep->rx_skbuff[i]) {
> +                       skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
> +                       if (!skb)
> +                               return -ENOMEM;
> +
> +                       addr = dma_map_single(&fep->pdev->dev, skb->data,
> +                                             FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +                       if (dma_mapping_error(&fep->pdev->dev, addr)) {
> +                               dev_kfree_skb(skb);
> +                               return -ENOMEM;
> +                       }
> +
> +                       fep->rx_skbuff[i] = skb;
> +                       bdp->bd.cbd_bufaddr = addr;
> +               }
> +
> +               bdp->bd.cbd_sc = (bdp->bd.cbd_sc & BD_SC_WRAP) |
> +                                BD_ENET_RX_EMPTY;
> +
> +               if (fep->bufdesc_ex) {
> +                       bdp->ebd.cbd_esc = BD_ENET_RX_INT;
> +                       bdp->ebd.cbd_prot = 0;
> +                       bdp->ebd.cbd_bdu = 0;
> +               }
> +               i = (i + 1) & (fep->rx_ring_size - 1);
> +       } while (i != last);
> +
> +       return 0;
> +}
> +
> +/* During a receive, the rx_next points to the current incoming buffer.
>   * When we update through the ring, if the next incoming buffer has
>   * not been given to the system, we just set the empty indicator,
>   * effectively tossing the packet.
>   */
> -static int
> +static int noinline
>  fec_enet_rx(struct net_device *ndev, int budget)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
> -       struct bufdesc *bdp;
>         unsigned short status;
> -       struct  sk_buff *skb;
> -       ushort  pkt_len;
> -       __u8 *data;
> +       unsigned pkt_len;
>         int     pkt_received = 0;
> -       struct  bufdesc_ex *ebdp = NULL;
> -       bool    vlan_packet_rcvd = false;
> -       u16     vlan_tag;
> -       int     index = 0;
> +       unsigned index = fep->rx_next;
>
>  #ifdef CONFIG_M532x
>         flush_cache_all();
> @@ -872,12 +1107,16 @@ fec_enet_rx(struct net_device *ndev, int budget)
>         /* First, grab all of the stats for the incoming packet.
>          * These get messed up if we get called due to a busy condition.
>          */
> -       bdp = fep->cur_rx;
> +       do {
> +               union bufdesc_u *bdp = fec_enet_rx_get(index, fep);
>
> -       while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
> +               status = bdp->bd.cbd_sc;
> +               if (status & BD_ENET_RX_EMPTY)
> +                       break;
>
>                 if (pkt_received >= budget)
>                         break;
> +
>                 pkt_received++;
>
>                 /* Since we have allocated space to hold a complete frame,
> @@ -917,124 +1156,33 @@ fec_enet_rx(struct net_device *ndev, int budget)
>
>                 /* Process the incoming frame. */
>                 ndev->stats.rx_packets++;
> -               pkt_len = bdp->cbd_datlen;
> -               ndev->stats.rx_bytes += pkt_len;
> -
> -               if (fep->bufdesc_ex)
> -                       index = (struct bufdesc_ex *)bdp -
> -                               (struct bufdesc_ex *)fep->rx_bd_base;
> -               else
> -                       index = bdp - fep->rx_bd_base;
> -               data = fep->rx_skbuff[index]->data;
> -               dma_sync_single_for_cpu(&fep->pdev->dev, bdp->cbd_bufaddr,
> -                                       FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> -
> -               if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
> -                       swap_buffer(data, pkt_len);
> -
> -               /* Extract the enhanced buffer descriptor */
> -               ebdp = NULL;
> -               if (fep->bufdesc_ex)
> -                       ebdp = (struct bufdesc_ex *)bdp;
> -
> -               /* If this is a VLAN packet remove the VLAN Tag */
> -               vlan_packet_rcvd = false;
> -               if ((ndev->features & NETIF_F_HW_VLAN_CTAG_RX) &&
> -                   fep->bufdesc_ex && (ebdp->cbd_esc & BD_ENET_RX_VLAN)) {
> -                       /* Push and remove the vlan tag */
> -                       struct vlan_hdr *vlan_header =
> -                                       (struct vlan_hdr *) (data + ETH_HLEN);
> -                       vlan_tag = ntohs(vlan_header->h_vlan_TCI);
> -                       pkt_len -= VLAN_HLEN;
> -
> -                       vlan_packet_rcvd = true;
> -               }
>
> -               /* This does 16 byte alignment, exactly what we need.
> -                * The packet length includes FCS, but we don't want to
> -                * include that when passing upstream as it messes up
> -                * bridging applications.
> +               /*
> +                * The packet length includes FCS, but we don't want
> +                * to include that when passing upstream as it messes
> +                * up bridging applications.
>                  */
> -               skb = netdev_alloc_skb(ndev, pkt_len - 4 + NET_IP_ALIGN);
> +               pkt_len = bdp->bd.cbd_datlen - 4;
> +               ndev->stats.rx_bytes += pkt_len;
>
> -               if (unlikely(!skb)) {
> -                       ndev->stats.rx_dropped++;
> +               if (fec_enet_rx_zerocopy(fep, pkt_len)) {
> +                       fec_enet_receive_nocopy(pkt_len, index, bdp, ndev);
>                 } else {
> -                       int payload_offset = (2 * ETH_ALEN);
> -                       skb_reserve(skb, NET_IP_ALIGN);
> -                       skb_put(skb, pkt_len - 4);      /* Make room */
> -
> -                       /* Extract the frame data without the VLAN header. */
> -                       skb_copy_to_linear_data(skb, data, (2 * ETH_ALEN));
> -                       if (vlan_packet_rcvd)
> -                               payload_offset = (2 * ETH_ALEN) + VLAN_HLEN;
> -                       skb_copy_to_linear_data_offset(skb, (2 * ETH_ALEN),
> -                                                      data + payload_offset,
> -                                                      pkt_len - 4 - (2 * ETH_ALEN));
> -
> -                       skb->protocol = eth_type_trans(skb, ndev);
> -
> -                       /* Get receive timestamp from the skb */
> -                       if (fep->hwts_rx_en && fep->bufdesc_ex) {
> -                               struct skb_shared_hwtstamps *shhwtstamps =
> -                                                           skb_hwtstamps(skb);
> -                               unsigned long flags;
> -
> -                               memset(shhwtstamps, 0, sizeof(*shhwtstamps));
> -
> -                               spin_lock_irqsave(&fep->tmreg_lock, flags);
> -                               shhwtstamps->hwtstamp = ns_to_ktime(
> -                                   timecounter_cyc2time(&fep->tc, ebdp->ts));
> -                               spin_unlock_irqrestore(&fep->tmreg_lock, flags);
> -                       }
> -
> -                       if (fep->bufdesc_ex &&
> -                           (fep->csum_flags & FLAG_RX_CSUM_ENABLED)) {
> -                               if (!(ebdp->cbd_esc & FLAG_RX_CSUM_ERROR)) {
> -                                       /* don't check it */
> -                                       skb->ip_summed = CHECKSUM_UNNECESSARY;
> -                               } else {
> -                                       skb_checksum_none_assert(skb);
> -                               }
> -                       }
> -
> -                       /* Handle received VLAN packets */
> -                       if (vlan_packet_rcvd)
> -                               __vlan_hwaccel_put_tag(skb,
> -                                                      htons(ETH_P_8021Q),
> -                                                      vlan_tag);
> -
> -                       napi_gro_receive(&fep->napi, skb);
> +                       fec_enet_receive_copy(pkt_len, index, bdp, ndev);
>                 }
>
> -               dma_sync_single_for_device(&fep->pdev->dev, bdp->cbd_bufaddr,
> -                                       FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
>  rx_processing_done:
> -               /* Clear the status flags for this buffer */
> -               status &= ~BD_ENET_RX_STATS;
> -
> -               /* Mark the buffer empty */
> -               status |= BD_ENET_RX_EMPTY;
> -               bdp->cbd_sc = status;
> -
> -               if (fep->bufdesc_ex) {
> -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> -
> -                       ebdp->cbd_esc = BD_ENET_RX_INT;
> -                       ebdp->cbd_prot = 0;
> -                       ebdp->cbd_bdu = 0;
> -               }
> -
> -               /* Update BD pointer to next entry */
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> +               index = (index + 1) & (fep->rx_ring_size - 1);
> +               if (index == fep->rx_next)
> +                       break;
> +       } while (1);
>
> -               /* Doing this here will keep the FEC running while we process
> -                * incoming frames.  On a heavily loaded network, we should be
> -                * able to keep up at the expense of system resources.
> -                */
> +       if (pkt_received) {
> +               fec_enet_refill_ring(fep->rx_next, index, ndev);
>                 writel(0, fep->hwp + FEC_R_DES_ACTIVE);
>         }
> -       fep->cur_rx = bdp;
> +
> +       fep->rx_next = index;
>
>         return pkt_received;
>  }
> @@ -1044,29 +1192,25 @@ fec_enet_interrupt(int irq, void *dev_id)
>  {
>         struct net_device *ndev = dev_id;
>         struct fec_enet_private *fep = netdev_priv(ndev);
> +       const unsigned napi_mask = FEC_ENET_RXF | FEC_ENET_TXF;
>         uint int_events;
>         irqreturn_t ret = IRQ_NONE;
>
> -       do {
> -               int_events = readl(fep->hwp + FEC_IEVENT);
> -               writel(int_events, fep->hwp + FEC_IEVENT);
> +       int_events = readl(fep->hwp + FEC_IEVENT);
> +       writel(int_events & ~napi_mask, fep->hwp + FEC_IEVENT);
>
> -               if (int_events & (FEC_ENET_RXF | FEC_ENET_TXF)) {
> -                       ret = IRQ_HANDLED;
> +       if (int_events & napi_mask) {
> +               ret = IRQ_HANDLED;
>
> -                       /* Disable the RX interrupt */
> -                       if (napi_schedule_prep(&fep->napi)) {
> -                               writel(FEC_RX_DISABLED_IMASK,
> -                                       fep->hwp + FEC_IMASK);
> -                               __napi_schedule(&fep->napi);
> -                       }
> -               }
> +               /* Disable the NAPI interrupts */
> +               writel(FEC_ENET_MII, fep->hwp + FEC_IMASK);
> +               napi_schedule(&fep->napi);
> +       }
>
> -               if (int_events & FEC_ENET_MII) {
> -                       ret = IRQ_HANDLED;
> -                       complete(&fep->mdio_done);
> -               }
> -       } while (int_events);
> +       if (int_events & FEC_ENET_MII) {
> +               ret = IRQ_HANDLED;
> +               complete(&fep->mdio_done);
> +       }
>
>         return ret;
>  }
> @@ -1074,10 +1218,24 @@ fec_enet_interrupt(int irq, void *dev_id)
>  static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
>  {
>         struct net_device *ndev = napi->dev;
> -       int pkts = fec_enet_rx(ndev, budget);
>         struct fec_enet_private *fep = netdev_priv(ndev);
> +       unsigned status;
> +       int pkts = 0;
> +
> +       status = readl(fep->hwp + FEC_IEVENT) & (FEC_ENET_RXF | FEC_ENET_TXF);
> +       if (status) {
> +               /*
> +                * Clear any pending transmit or receive interrupts before
> +                * processing the rings to avoid racing with the hardware.
> +                */
> +               writel(status, fep->hwp + FEC_IEVENT);
>
> -       fec_enet_tx(ndev);
> +               if (status & FEC_ENET_RXF)
> +                       pkts = fec_enet_rx(ndev, budget);
> +
> +               if (status & FEC_ENET_TXF)
> +                       fec_enet_tx(ndev);
> +       }
>
>         if (pkts < budget) {
>                 napi_complete(napi);
> @@ -1263,8 +1421,6 @@ static int fec_enet_mdio_reset(struct mii_bus *bus)
>  static int fec_enet_mii_probe(struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
>         struct phy_device *phy_dev = NULL;
>         char mdio_bus_id[MII_BUS_ID_SIZE];
>         char phy_name[MII_BUS_ID_SIZE + 3];
> @@ -1302,7 +1458,7 @@ static int fec_enet_mii_probe(struct net_device *ndev)
>         }
>
>         /* mask with MAC supported features */
> -       if (id_entry->driver_data & FEC_QUIRK_HAS_GBIT) {
> +       if (fep->quirks & FEC_QUIRK_HAS_GBIT) {
>                 phy_dev->supported &= PHY_GBIT_FEATURES;
>  #if !defined(CONFIG_M5272)
>                 phy_dev->supported |= SUPPORTED_Pause;
> @@ -1329,8 +1485,6 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>         static struct mii_bus *fec0_mii_bus;
>         struct net_device *ndev = platform_get_drvdata(pdev);
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
>         int err = -ENXIO, i;
>
>         /*
> @@ -1349,7 +1503,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>          * mdio interface in board design, and need to be configured by
>          * fec0 mii_bus.
>          */
> -       if ((id_entry->driver_data & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) {
> +       if ((fep->quirks & FEC_QUIRK_ENET_MAC) && fep->dev_id > 0) {
>                 /* fec1 uses fec0 mii_bus */
>                 if (mii_cnt && fec0_mii_bus) {
>                         fep->mii_bus = fec0_mii_bus;
> @@ -1370,7 +1524,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>          * document.
>          */
>         fep->phy_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ahb), 5000000);
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC)
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC)
>                 fep->phy_speed--;
>         fep->phy_speed <<= 1;
>         writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED);
> @@ -1405,7 +1559,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>         mii_cnt++;
>
>         /* save fec0 mii_bus */
> -       if (id_entry->driver_data & FEC_QUIRK_ENET_MAC)
> +       if (fep->quirks & FEC_QUIRK_ENET_MAC)
>                 fec0_mii_bus = fep->mii_bus;
>
>         return 0;
> @@ -1694,23 +1848,24 @@ static void fec_enet_free_buffers(struct net_device *ndev)
>         struct fec_enet_private *fep = netdev_priv(ndev);
>         unsigned int i;
>         struct sk_buff *skb;
> -       struct bufdesc  *bdp;
>
> -       bdp = fep->rx_bd_base;
>         for (i = 0; i < fep->rx_ring_size; i++) {
> -               skb = fep->rx_skbuff[i];
> +               union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
>
> -               if (bdp->cbd_bufaddr)
> -                       dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
> +               skb = fep->rx_skbuff[i];
> +               if (skb) {
> +                       dma_unmap_single(&fep->pdev->dev, bdp->bd.cbd_bufaddr,
>                                         FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> -               if (skb)
>                         dev_kfree_skb(skb);
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> +               }
>         }
>
> -       bdp = fep->tx_bd_base;
> -       for (i = 0; i < fep->tx_ring_size; i++)
> +       for (i = 0; i < fep->tx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
> +               if (bdp->bd.cbd_bufaddr)
> +                       fec_enet_tx_unmap(&bdp->bd, fep);
>                 kfree(fep->tx_bounce[i]);
> +       }
>  }
>
>  static int fec_enet_alloc_buffers(struct net_device *ndev)
> @@ -1718,58 +1873,54 @@ static int fec_enet_alloc_buffers(struct net_device *ndev)
>         struct fec_enet_private *fep = netdev_priv(ndev);
>         unsigned int i;
>         struct sk_buff *skb;
> -       struct bufdesc  *bdp;
>
> -       bdp = fep->rx_bd_base;
>         for (i = 0; i < fep->rx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_rx_get(i, fep);
> +               dma_addr_t addr;
> +
>                 skb = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
>                 if (!skb) {
>                         fec_enet_free_buffers(ndev);
>                         return -ENOMEM;
>                 }
> -               fep->rx_skbuff[i] = skb;
>
> -               bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, skb->data,
> -                               FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> -               if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
> +               addr = dma_map_single(&fep->pdev->dev, skb->data,
> +                                     FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
> +               if (dma_mapping_error(&fep->pdev->dev, addr)) {
> +                       dev_kfree_skb(skb);
>                         fec_enet_free_buffers(ndev);
>                         if (net_ratelimit())
>                                 netdev_err(ndev, "Rx DMA memory map failed\n");
>                         return -ENOMEM;
>                 }
> -               bdp->cbd_sc = BD_ENET_RX_EMPTY;
>
> -               if (fep->bufdesc_ex) {
> -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> -                       ebdp->cbd_esc = BD_ENET_RX_INT;
> -               }
> +               fep->rx_skbuff[i] = skb;
>
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> -       }
> +               bdp->bd.cbd_bufaddr = addr;
> +               bdp->bd.cbd_sc = BD_ENET_RX_EMPTY;
> +               /* Set the last buffer to wrap. */
> +               if (i == fep->rx_ring_size - 1)
> +                       bdp->bd.cbd_sc |= BD_SC_WRAP;
>
> -       /* Set the last buffer to wrap. */
> -       bdp = fec_enet_get_prevdesc(bdp, fep);
> -       bdp->cbd_sc |= BD_SC_WRAP;
> +               if (fep->bufdesc_ex)
> +                       bdp->ebd.cbd_esc = BD_ENET_RX_INT;
> +       }
>
> -       bdp = fep->tx_bd_base;
>         for (i = 0; i < fep->tx_ring_size; i++) {
> +               union bufdesc_u *bdp = fec_enet_tx_get(i, fep);
>                 fep->tx_bounce[i] = kmalloc(FEC_ENET_TX_FRSIZE, GFP_KERNEL);
>
> -               bdp->cbd_sc = 0;
> -               bdp->cbd_bufaddr = 0;
> -
> -               if (fep->bufdesc_ex) {
> -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> -                       ebdp->cbd_esc = BD_ENET_TX_INT;
> -               }
> +               /* Set the last buffer to wrap. */
> +               if (i == fep->tx_ring_size - 1)
> +                       bdp->bd.cbd_sc = BD_SC_WRAP;
> +               else
> +                       bdp->bd.cbd_sc = 0;
> +               bdp->bd.cbd_bufaddr = 0;
>
> -               bdp = fec_enet_get_nextdesc(bdp, fep);
> +               if (fep->bufdesc_ex)
> +                       bdp->ebd.cbd_esc = BD_ENET_TX_INT;
>         }
>
> -       /* Set the last buffer to wrap. */
> -       bdp = fec_enet_get_prevdesc(bdp, fep);
> -       bdp->cbd_sc |= BD_SC_WRAP;
> -
>         return 0;
>  }
>
> @@ -1990,9 +2141,7 @@ static const struct net_device_ops fec_netdev_ops = {
>  static int fec_enet_init(struct net_device *ndev)
>  {
>         struct fec_enet_private *fep = netdev_priv(ndev);
> -       const struct platform_device_id *id_entry =
> -                               platform_get_device_id(fep->pdev);
> -       struct bufdesc *cbd_base;
> +       union bufdesc_u *cbd_base;
>
>         /* Allocate memory for buffer descriptors. */
>         cbd_base = dma_alloc_coherent(NULL, PAGE_SIZE, &fep->bd_dma,
> @@ -2014,10 +2163,11 @@ static int fec_enet_init(struct net_device *ndev)
>         /* Set receive and transmit descriptor base. */
>         fep->rx_bd_base = cbd_base;
>         if (fep->bufdesc_ex)
> -               fep->tx_bd_base = (struct bufdesc *)
> -                       (((struct bufdesc_ex *)cbd_base) + fep->rx_ring_size);
> +               fep->tx_bd_base = (union bufdesc_u *)
> +                       (&cbd_base->ebd + fep->rx_ring_size);
>         else
> -               fep->tx_bd_base = cbd_base + fep->rx_ring_size;
> +               fep->tx_bd_base = (union bufdesc_u *)
> +                       (&cbd_base->bd + fep->rx_ring_size);
>
>         /* The FEC Ethernet specific entries in the device structure */
>         ndev->watchdog_timeo = TX_TIMEOUT;
> @@ -2027,19 +2177,24 @@ static int fec_enet_init(struct net_device *ndev)
>         writel(FEC_RX_DISABLED_IMASK, fep->hwp + FEC_IMASK);
>         netif_napi_add(ndev, &fep->napi, fec_enet_rx_napi, NAPI_POLL_WEIGHT);
>
> -       if (id_entry->driver_data & FEC_QUIRK_HAS_VLAN) {
> -               /* enable hw VLAN support */
> -               ndev->features |= NETIF_F_HW_VLAN_CTAG_RX;
> -               ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX;
> -       }
> +       if (fep->bufdesc_ex) {
> +               /* Features which require the enhanced buffer descriptors */
> +               netdev_features_t features = 0;
>
> -       if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) {
> -               /* enable hw accelerator */
> -               ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
> -                               | NETIF_F_RXCSUM);
> -               ndev->hw_features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
> -                               | NETIF_F_RXCSUM);
> -               fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
> +               if (fep->quirks & FEC_QUIRK_HAS_VLAN) {
> +                       /* enable hw VLAN support */
> +                       features |= NETIF_F_HW_VLAN_CTAG_RX;
> +               }
> +
> +               if (fep->quirks & FEC_QUIRK_HAS_CSUM) {
> +                       /* enable hw accelerator */
> +                       features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
> +                                   NETIF_F_RXCSUM;
> +                       fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
> +               }
> +
> +               ndev->hw_features |= features;
> +               ndev->features |= features;
>         }
>
>         fec_restart(ndev, 0);
> @@ -2110,13 +2265,6 @@ fec_probe(struct platform_device *pdev)
>         /* setup board info structure */
>         fep = netdev_priv(ndev);
>
> -#if !defined(CONFIG_M5272)
> -       /* default enable pause frame auto negotiation */
> -       if (pdev->id_entry &&
> -           (pdev->id_entry->driver_data & FEC_QUIRK_HAS_GBIT))
> -               fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG;
> -#endif
> -
>         r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
>         fep->hwp = devm_ioremap_resource(&pdev->dev, r);
>         if (IS_ERR(fep->hwp)) {
> @@ -2126,6 +2274,14 @@ fec_probe(struct platform_device *pdev)
>
>         fep->pdev = pdev;
>         fep->dev_id = dev_id++;
> +       if (pdev->id_entry)
> +               fep->quirks = pdev->id_entry->driver_data;
> +
> +#if !defined(CONFIG_M5272)
> +       /* default enable pause frame auto negotiation */
> +       if (fep->quirks & FEC_QUIRK_HAS_GBIT)
> +               fep->pause_flag |= FEC_PAUSE_FLAG_AUTONEG;
> +#endif
>
>         fep->bufdesc_ex = 0;
>
> @@ -2160,8 +2316,7 @@ fec_probe(struct platform_device *pdev)
>                 fep->clk_enet_out = NULL;
>
>         fep->clk_ptp = devm_clk_get(&pdev->dev, "ptp");
> -       fep->bufdesc_ex =
> -               pdev->id_entry->driver_data & FEC_QUIRK_HAS_BUFDESC_EX;
> +       fep->bufdesc_ex = fep->quirks & FEC_QUIRK_HAS_BUFDESC_EX;
>         if (IS_ERR(fep->clk_ptp)) {
>                 fep->clk_ptp = NULL;
>                 fep->bufdesc_ex = 0;
>
>
> --
> FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
> improving, and getting towards what was expected from it.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-21  1:43                         ` Fabio Estevam
@ 2014-03-21 16:37                           ` Bernd Faust
       [not found]                           ` <CANBf9eNZB+BVBDkgWNxxGs-ndQ-mYCc6+bfVdS-8T-QLpSZ3GA@mail.gmail.com>
  1 sibling, 0 replies; 76+ messages in thread
From: Bernd Faust @ 2014-03-21 16:37 UTC (permalink / raw)
  To: linux-arm-kernel

I also tried the patch, but unfortunately it doesn't work for me either.
I get the same error as Robert Daniels. My target is a i.MX6 (Cortex
A9, single core) running Linux Kernel version 3.10.9-rt5+

Test done:
  Start iperf server on target
    $ iperf -s -u &

   Let iperf client run for 10 minutes on the host
    $ iperf -c <ip_target> -u -b 100m -t 600 -l 256

  Also start a ping from host to target, to easily detect when the error occurs
    $ ping <ip_target>


Without the patch applied I see the following behavior:
After some time the ping from host to target stopped.
As soon as iperf was ready I started a ping from target to host.
By doing that the network came back to life: the ping from host to
target resumed.

With the patch applied the following happens:
After some amount of time the ping stops, but I've seen large ping
times (up to 9000ms) before it does that.
After iperf is done, the connection was dead. (just like without the patch)
The difference now is that the ping from host to target started
reporting "Destination Host Unreachable" after iperf has finished
and the ping from target to host didn't bring the connection back to
life: it reported 100% packet loss.

dmesg output on target shows the following:

------------[ cut here ]------------
WARNING: at /home/bfa/linux/net/sched/sch_generic.c:255
dev_watchdog+0x380/0x390()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
Modules linked in:
CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G           O 3.10.9-rt5+ #3
Backtrace:
[<80012cf0>] (dump_backtrace+0x0/0x118) from [<800130f0>] (show_stack+0x20/0x24)
 r6:000000ff r5:8046caf8 r4:9f8d1d80 r3:00000000
[<800130d0>] (show_stack+0x0/0x24) from [<8054c9f0>] (dump_stack+0x24/0x28)
[<8054c9cc>] (dump_stack+0x0/0x28) from [<80020d84>]
(warn_slowpath_common+0x64/0x78)
[<80020d20>] (warn_slowpath_common+0x0/0x78) from [<80020e54>]
(warn_slowpath_fmt+0x40/0x48)
 r8:8076ea10 r7:8072c2c0 r6:00000000 r5:9f837214 r4:9f837000
r3:00000009
[<80020e14>] (warn_slowpath_fmt+0x0/0x48) from [<8046caf8>]
(dev_watchdog+0x380/0x390)
 r3:9f837000 r2:8069ab98
[<8046c778>] (dev_watchdog+0x0/0x390) from [<800326c8>]
(call_timer_fn+0x54/0x1ac)
[<80032674>] (call_timer_fn+0x0/0x1ac) from [<80032a58>]
(run_timer_softirq+0x238/0x2ec)
[<80032820>] (run_timer_softirq+0x0/0x2ec) from [<8002a180>]
(handle_softirq+0x84/0x1e8)
[<8002a0fc>] (handle_softirq+0x0/0x1e8) from [<8002a344>]
(do_single_softirq+0x60/0x90)
[<8002a2e4>] (do_single_softirq+0x0/0x90) from [<8002a4e4>]
(do_current_softirqs+0x170/0x1a8)
 r7:80735620 r6:807b2440 r5:9f8d0000 r4:00000001
[<8002a374>] (do_current_softirqs+0x0/0x1a8) from [<8002a55c>]
(run_ksoftirqd+0x40/0x60)
[<8002a51c>] (run_ksoftirqd+0x0/0x60) from [<800516a8>]
(smpboot_thread_fn+0x23c/0x2b0)
 r5:9f8d0000 r4:9f827540
[<8005146c>] (smpboot_thread_fn+0x0/0x2b0) from [<80047684>] (kthread+0xb4/0xb8)
[<800475d0>] (kthread+0x0/0xb8) from [<8000edd8>] (ret_from_fork+0x14/0x20)
 r7:00000000 r6:00000000 r5:800475d0 r4:9f8c3e5c
---[ end trace 0000000000000002 ]---
fec 2188000.ethernet eth0: TX ring dump
Nr    SC     addr       len  SKB
 0    0x1c00 0x2e853000   98 9e8b6000
 1    0x1c00 0x2e853800   98 9e9bb6c0
 2    0x1c00 0x2e9c4000   98 9e841a80
 3    0x1c00 0x2e9c4800   98 9e841000
 4    0x1c00 0x2e9c5000   98 9e8c3480
 5    0x1c00 0x2e9c5800   98 9e9bb540
 6    0x1c00 0x2e9c6000   98 9e8b4f00
 7    0x1c00 0x2e9c6800   98 9e941e40
 8    0x1c00 0x2e9c7000   98 9e941300
 9    0x1c00 0x2e9c7800   98 9e941780
10    0x1c00 0x2e9cc000   98 9e8b4d80
11    0x1c00 0x2e9cc800   98 9e8b4b40
12    0x1c00 0x2e9cd000   98 9e8b4a80
13    0x1c00 0x2e9cd800   98 9e8b4600
14    0x1c00 0x2e9ce000   98 9e941600
15    0x1c00 0x2e9ce800   98 9e8b4540
16    0x1c00 0x2e9cf000   98 9e8b4180
17    0x1c00 0x2e9cf800   98 9e9419c0
18    0x1c00 0x2e9f8000   98 9e8b4300
19    0x1c00 0x2e9f8800   98 9e941840
20    0x1c00 0x2e9f9000   98 9e8b4cc0
21    0x1c00 0x2e9f9800   98 9e8b49c0
22    0x1c00 0x2e9fa000   98 9e8416c0
23    0x1c00 0x2e9fa800   98 9e8c1b40
24    0x1c00 0x2e9fb000   98 9e9ad900
25    0x1c00 0x2e9fb800   98 9e9ad300
26    0x1c00 0x2e9fc000   98 9e9ad840
27    0x1c00 0x2e9fc800   98 9e9ad000
28    0x1c00 0x2e9fd000   98 9e941b40
29    0x1c00 0x2e9fd800   98 9e8b6240
30    0x1c00 0x2e9fe000   98 9e941f00
31    0x1c00 0x2e9fe800   98 9e941000
32    0x1c00 0x2e9ff000   98 9e941a80
33    0x1c00 0x2e9ff800   98 9e8b4e40
34    0x1c00 0x2ea00000   98 9e9ad600
35    0x1c00 0x2ea00800   98 9e8413c0
36    0x1c00 0x2ea01000   98 9e841c00
37    0x1c00 0x2ea01800   98 9e8b6300
38    0x1c00 0x2ea02000   98 9e8b6c00
39    0x1c00 0x2ea02800   98 9e8b40c0
40    0x1c00 0x2ea03000   98 9e841e40
41    0x1c00 0x2ea03800   98 9e8c1600
42    0x1c00 0x2ea04000   98 9e8c1c00
43    0x1c00 0x2ea04800   98 9e9ad6c0
44    0x1c00 0x2ea05000   98 9e9bb0c0
45    0x1c00 0x2ea05800   98 9e9ad480
46    0x1c00 0x2ea06000   98 9e841900
47    0x1c00 0x2ea06800   98 9e8b4240
48    0x1c00 0x2ea07000   98 9e8b4000
49    0x1c00 0x2ea07800   98 9e8b4840
50    0x1c00 0x2ea08000   98 9e9ad240
51    0x1c00 0x2ea08800   98 9e9b19c0
52    0x1c00 0x2ea09000   98 9e941240
53    0x1c00 0x2ea09800   98 9e841600
54    0x1c00 0x2ea0a000   98 9e9bb780
55    0x1c00 0x2ea0a800   98 9e9b1480
56    0x1c00 0x2ea0b000   98 9e9bb600
57    0x1c00 0x2ea0b800   98 9e9b13c0
58    0x1c00 0x2ea0c000   98 9e9ada80
59    0x1c00 0x2ea0c800   98 9e9ad0c0
60 SH 0x1c00 0x00000000   98   (null)
61    0x9c00 0x2ea0d800   98 9e841b40
62    0x1c00 0x2ea0e000   98 9e9bb240
63    0x3c00 0x2ea0e800   98 9e8c3f00


both situations are very reproducible.
I've you have suggestions on what I can do or test, please let me know.

regards,
Bernd Faust

On 21 March 2014 02:43, Fabio Estevam <festevam@gmail.com> wrote:
> On Thu, Mar 20, 2014 at 10:36 PM, Fabio Estevam <festevam@gmail.com> wrote:
>
>> Robert's tests were made on a mx53 (single CortexA9), and its cache
>> controller is not the L310.
>
> Ops, I meant CortexA8.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* PL310 errata workarounds
  2014-03-21 15:50           ` PL310 errata workarounds Frank Li
@ 2014-03-21 17:12             ` Russell King - ARM Linux
  0 siblings, 0 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-21 17:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 21, 2014 at 10:50:41AM -0500, Frank Li wrote:
> On Wed, Mar 19, 2014 at 5:51 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > 2. The ARM can re-order writes.  The writes it can re-order are:
> >
> >         bdp->cbd_bufaddr
> >         ebdp->cbd_bdu
> >         ebdp->cbd_esc
> >         ebdp->cbd_sc
> >
> > Hence, it's entirely possible for the FEC to see the updated descriptor
> > status before the rest of the descriptor has been written.  What's missing
> > is a barrier between the descriptor writes, and the final write to
> > bdp->cbd_sc.
> 
> Possible. I also consider this when debug the other issues.
> Just bdp->bufaddr and bdp->buflen is important.
> bdu, hardware don't care it.
> esc is the same for each BD.  When software go though BD ring once,
> esc will be not changed again, even though we write.

If you assume that all you're sending is traffic which needs the headers
checksummed (in other words, only IP traffic and nothing else) then you're
correct.  Reality is that other traffic gets included, such as ARPs which
have skb->ip_summed = CHECKSUM_NONE.

In any case, you are wrong about cbd_esc.  For this code:

			ebdp->cbd_esc = BD_ENET_TX_INT;
 
			/* Enable protocol checksum flags
			 * We do not bother with the IP Checksum bits as they
			 * are done by the kernel
			 */
			if (skb->ip_summed == CHECKSUM_PARTIAL)
				ebdp->cbd_esc |= BD_ENET_TX_PINS;

this is the assembly which the compiler will generate:

     820:       e3a03101        mov     r3, #1073741824 ; 0x40000000
     824:       e5863008        str     r3, [r6, #8]
     828:       e5d53064        ldrb    r3, [r5, #100]  ; 0x64
     82c:       e203300c        and     r3, r3, #12
     830:       e353000c        cmp     r3, #12
     834:       03a03205        moveq   r3, #1342177280 ; 0x50000000
     838:       05863008        streq   r3, [r6, #8]

That's a write to cbd_esc with BD_ENET_TX_INT set and BD_ENET_TX_PINS
clear, followed by another write with both set if the packet has been
partially checksummed.

Depending on the CPU and timing, that could be visible to the device.
Thankfully, fb8ef788680 fixed it for CPUs which are strongly ordered.
However, with a weakly ordered CPU, all it would take is an interrupt
between those two writes to make them visible as separate writes, and
without the following barrier, the order of cbd_esc and cbd_sc
becoming visible is uncertain.

> but if sc write before buffaddr and bullen, there will be issue.
> Add memory barre here is better before  ebdp->cbd_sc.
> 
> I also want to add it before, But I have not found a test case to
> verify it is necessary.

In these situations, knowledge and theory is better than test-and-verify,
especially when we're talking about trying to hit a small window with
timings.

We know ARMv6 and ARMv7 are weakly ordered.  We know we see problems
with transmit on these devices.  We know that the specification calls
for the TX ready bit to be set last.  We know on these CPUs that
writes can be reordered.  Therefore... a barrier is definitely
required.

> > The result is the handler is entered, FEC_IEVENT contains TXF and MII
> > events.  Both these events are cleared down, (and thus no longer exist
> > as interrupt-causing events.)  napi_schedule_prep() returns false as
> > the NAPI rx function is still running, and doesn't mark it for a re-run.
> > We then do the MII interrupt.  Loop again, and int_events is zero,
> > we exit.
> >
> > Meanwhile, the NAPI rx function calls napi_complete() and re-enables
> > the receive interrupt.  If you're unlucky enough that the RX ring is
> > also full... no RXF interrupt.  So no further interrupts except maybe
> > MII interrupts.
> >
> > NAPI never gets scheduled.  RX ring never gets emptied.  TX ring never
> > gets reaped.  The result is a timeout with a completely full TX ring.
> 
> Do you see RX ring full?

I haven't added any debug for the RX ring, so I don't know for certain.
What I do know is that I've had the situation where the TX ring is
completely full of packets which have been sent and the TX ring hasn't
been reaped, which leads to the timeout, and that is down to the NAPI
functions not being scheduled.

When that happens, the three sets of flood pings thumping away at the
iMX6Q don't get any response until the timeout occurs.

In that scenario, if a packet were to be received (and there's many
packets hitting the interface) then it _would_ trigger reaping of the
TX ring.

> > +       do {
> > +               index = (index + 1) & (fep->tx_ring_size - 1);
> > +               bdp = fec_enet_tx_get(index, fep);
> >
> > -       while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
> > +               status = bdp->bd.cbd_sc;
> > +               if (status & BD_ENET_TX_READY)
> > +                       break;

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In answer to your question below, here.

> >
> >                 /* current queue is empty */
> > -               if (bdp == fep->cur_tx)
> > +               if (index == fep->tx_next)
> >                         break;

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and here.

> >
> > -               if (fep->bufdesc_ex)
> > -                       index = (struct bufdesc_ex *)bdp -
> > -                               (struct bufdesc_ex *)fep->tx_bd_base;
> > -               else
> > -                       index = bdp - fep->tx_bd_base;
> > +               fec_enet_tx_unmap(&bdp->bd, fep);
> >
> >                 skb = fep->tx_skbuff[index];
> > -               dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len,
> > -                               DMA_TO_DEVICE);
> > -               bdp->cbd_bufaddr = 0;
> > +               fep->tx_skbuff[index] = NULL;
> >
> >                 /* Check for errors. */
> >                 if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC |
> > @@ -797,19 +826,18 @@ fec_enet_tx(struct net_device *ndev)
> >                                 ndev->stats.tx_carrier_errors++;
> >                 } else {
> >                         ndev->stats.tx_packets++;
> > -                       ndev->stats.tx_bytes += bdp->cbd_datlen;
> > +                       ndev->stats.tx_bytes += bdp->bd.cbd_datlen;
> >                 }
> >
> >                 if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) &&
> >                         fep->bufdesc_ex) {
> >                         struct skb_shared_hwtstamps shhwtstamps;
> >                         unsigned long flags;
> > -                       struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> >
> >                         memset(&shhwtstamps, 0, sizeof(shhwtstamps));
> >                         spin_lock_irqsave(&fep->tmreg_lock, flags);
> >                         shhwtstamps.hwtstamp = ns_to_ktime(
> > -                               timecounter_cyc2time(&fep->tc, ebdp->ts));
> > +                               timecounter_cyc2time(&fep->tc, bdp->ebd.ts));
> >                         spin_unlock_irqrestore(&fep->tmreg_lock, flags);
> >                         skb_tstamp_tx(skb, &shhwtstamps);
> >                 }
> > @@ -825,45 +853,252 @@ fec_enet_tx(struct net_device *ndev)
> >
> >                 /* Free the sk buffer associated with this last transmit */
> >                 dev_kfree_skb_any(skb);
> > -               fep->tx_skbuff[index] = NULL;
> > -
> > -               fep->dirty_tx = bdp;
> > -
> > -               /* Update pointer to next buffer descriptor to be transmitted */
> > -               bdp = fec_enet_get_nextdesc(bdp, fep);
> >
> >                 /* Since we have freed up a buffer, the ring is no longer full
> >                  */
> > -               if (fep->dirty_tx != fep->cur_tx) {
> > -                       if (netif_queue_stopped(ndev))
> > -                               netif_wake_queue(ndev);
> > +               if (netif_queue_stopped(ndev)) {
> > +
> > +
> > +
> > +
> > +
> > +                       netif_wake_queue(ndev);
> > +
> >                 }
> > -       }
> > +
> > +               fep->tx_dirty = index;
> > +       } while (1);
> 
> where break this while(1);

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
       [not found]                           ` <CANBf9eNZB+BVBDkgWNxxGs-ndQ-mYCc6+bfVdS-8T-QLpSZ3GA@mail.gmail.com>
@ 2014-03-21 17:32                             ` Russell King - ARM Linux
  2014-03-23 11:38                               ` Bernd Faust
  2014-03-24 17:57                               ` robert.daniels at vantagecontrols.com
  0 siblings, 2 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-21 17:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 21, 2014 at 05:33:02PM +0100, Bernd Faust wrote:
> I also tried the patch, but unfortunately it doesn't work for me either.
> I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9,
> single core) running Linux Kernel version 3.10.9-rt5+

Hmm, rt kernels.  Does this happen without the rt patches applied?

> Test done:
>   Start iperf server on target
>     $ iperf -s -u &
> 
>    Let iperf client run for 10 minutes on the host
>     $ iperf -c <ip_target> -u -b 100m -t 600 -l 256
> 
>   Also start a ping from host to target, to easily detect when the error
> occurs
>     $ ping <ip_target>

Right, I've just done this here, the results aren't that good:

On the iMX6S:
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  160 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.181 port 5001 connected with 192.168.1.11 port 45248
[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  3]  0.0-600.2 sec   109 MBytes  1.52 Mbits/sec  13.726 ms 29469735/29914906 (99%)

On my x86 laptop:

--- 192.168.1.181 ping statistics ---
617 packets transmitted, 616 received, 0% packet loss, time 616014ms
rtt min/avg/max/mdev = 0.167/0.294/8.934/0.512 ms

------------------------------------------------------------
Client connecting to 192.168.1.181, UDP port 5001
Sending 256 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.11 port 45248 connected with 192.168.1.181 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-600.0 sec  7.13 GBytes   102 Mbits/sec
[  3] Sent 29914976 datagrams
[  3] Server Report:
[  3]  0.0-600.2 sec   109 MBytes  1.52 Mbits/sec  13.726 ms 29469735/29914906 (99%)

The loss is disappointing and needs further investigation, but as you
can see, none of the symptoms you have reported are present - no lockup
and no 9000ms ping times (the longest was 9ms).

This is a plain 3.14-rc7 kernel configured for preempt, without the
-rt patches.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-21 17:32                             ` Russell King - ARM Linux
@ 2014-03-23 11:38                               ` Bernd Faust
  2014-03-23 13:44                                 ` Russell King - ARM Linux
  2014-03-24 17:57                               ` robert.daniels at vantagecontrols.com
  1 sibling, 1 reply; 76+ messages in thread
From: Bernd Faust @ 2014-03-23 11:38 UTC (permalink / raw)
  To: linux-arm-kernel

On 21 March 2014 18:32, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> On Fri, Mar 21, 2014 at 05:33:02PM +0100, Bernd Faust wrote:
>> I also tried the patch, but unfortunately it doesn't work for me either.
>> I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9,
>> single core) running Linux Kernel version 3.10.9-rt5+
>
> Hmm, rt kernels.  Does this happen without the rt patches applied?
>

On a kernel without RT patches the problem does not occur, so in my
case the problem seems to be introduced/triggered by the RT patches.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-23 11:38                               ` Bernd Faust
@ 2014-03-23 13:44                                 ` Russell King - ARM Linux
  0 siblings, 0 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-23 13:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Mar 23, 2014 at 12:38:31PM +0100, Bernd Faust wrote:
> On 21 March 2014 18:32, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> > On Fri, Mar 21, 2014 at 05:33:02PM +0100, Bernd Faust wrote:
> >> I also tried the patch, but unfortunately it doesn't work for me either.
> >> I get the same error as Robert Daniels. My target is a i.MX6 (Cortex A9,
> >> single core) running Linux Kernel version 3.10.9-rt5+
> >
> > Hmm, rt kernels.  Does this happen without the rt patches applied?
> >
> 
> On a kernel without RT patches the problem does not occur, so in my
> case the problem seems to be introduced/triggered by the RT patches.

Yes, this suggests it's some kind of race introduced by the RT patches.
I've never looked at the RT patches, so I can't help there.

What I can say is that there are a number of other problems with this
driver which are all races of one form or another, but they should not
affect normal operation.

For example:
- the FEC_QUIRK_ERR006358 workaround (for iMX6 only) can read the previous
  descriptor before the write to new descriptor's TX status bit has become
  visible to the hardware.
- FEC_QUIRK_ERR006358 workaround not quite having the desired effect
  (a scheduled delayed work can't be re-scheduled until after it has
  already fired.)
- the number of places that call fec_stop() without first stopping the
  transmit queue or dealing with the above work-around before
  reconfiguring the hardware and re-initialising the rings.
- the rx path having a similar problem to the above (writing descriptor
  entries after the descriptor has been handed over to uDMA.)

I haven't addressed any of these yet, but they're on my list.  I'm pretty
sure there's other problems in there too...

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-21 17:32                             ` Russell King - ARM Linux
  2014-03-23 11:38                               ` Bernd Faust
@ 2014-03-24 17:57                               ` robert.daniels at vantagecontrols.com
  2014-03-24 20:21                                 ` Marek Vasut
  1 sibling, 1 reply; 76+ messages in thread
From: robert.daniels at vantagecontrols.com @ 2014-03-24 17:57 UTC (permalink / raw)
  To: linux-arm-kernel



Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 03/21/2014
11:32:53 AM:

> Hmm, rt kernels.  Does this happen without the rt patches applied?

I am not using the rt patches and I still see the problem.  I'm using
3.14.0-rc6+ with
fec patches on the i.MX53 Quick Start Board with the latest U-Boot.

When I run my test I will immediately see packet loss with the eventual tx
timeout issue.

These are the steps I use to reproduce the problem:

Setup: Desktop (Ubuntu 12.04: webfs, iperf3, /srv/ftp/test.bmp ~27 MB)
       i.MX53 Quick Start Board (linux 3.14.0-rc6+ with fec patches:
iperf3, wget)

Test:

  Desktop:
    iperf3 -s -V

  i.MX53 QSB:
    ssh 1> iperf3 -c 192.168.1.101 -u -l 64 -b 55M -V -t 1000
    ssh 2> cd /tmp; while true; do date; wget
http://192.168.1.101:8000/test.bmp; rm -fv /tmp/test.bmp; done

Now for the verbose explanation, I installed webfs on my Ubuntu desktop
machine and put
a ~27 MB bmp file in the web_root directory for the i.MX53 QSB to access.
I built iperf3
and ran the server on my Ubuntu desktop machine in verbose mode.

On the i.MX53 QSB I connect with ssh and run iperf3 with the above options.
I then open
another ssh session and run the wget command from above which continuously
transfers the
test.bmp from my desktop.  As soon as I start the wget I start seeing
packet loss
from iperf3.  After a while I will also see the tx timeout.

The following is what I see from iperf3 on the desktop:

-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Time: Mon, 24 Mar 2014 16:42:13 GMT
Accepted connection from 192.168.1.169, port 48609
      Cookie: ic-ii-0.4872.965116.71cd138c70e6e36e
[  5] local 192.168.1.101 port 5201 connected to 192.168.1.169 port 56375
Starting Test: protocol: UDP, 1 streams, 64 byte blocks, omitting 0
seconds, 1000 second test
[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total
Datagrams
[  5]   0.00-1.00   sec   879 KBytes  7.20 Mbits/sec  0.087 ms  0/14064
(0%)
[  5]   1.00-2.00   sec   910 KBytes  7.46 Mbits/sec  0.088 ms  0/14568
(0%)
[  5]   2.00-3.00   sec   910 KBytes  7.46 Mbits/sec  0.086 ms  0/14562
(0%)
[  5]   3.00-4.00   sec   891 KBytes  7.30 Mbits/sec  0.086 ms  0/14259
(0%)
[  5]   4.00-5.00   sec   891 KBytes  7.30 Mbits/sec  0.084 ms  0/14250
(0%)
[  5]   5.00-6.00   sec   913 KBytes  7.48 Mbits/sec  0.084 ms  0/14612
(0%)
[  5]   6.00-7.00   sec   913 KBytes  7.48 Mbits/sec  0.085 ms  0/14615
(0%)
[  5]   7.00-8.00   sec   806 KBytes  6.60 Mbits/sec  0.246 ms  0/12893
(0%)
[  5]   8.00-9.00   sec   327 KBytes  2.68 Mbits/sec  2.748 ms  1/5230
(0.019%)   <----- Start of wget transfers
[  5]   9.00-10.00  sec   294 KBytes  2.41 Mbits/sec  0.191 ms  13/4715
(0.28%)
[  5]  10.00-11.00  sec   301 KBytes  2.46 Mbits/sec  0.184 ms  11/4820
(0.23%)
[  5]  11.00-12.00  sec   351 KBytes  2.88 Mbits/sec  0.373 ms  0/5616 (0%)
[  5]  12.00-13.00  sec   289 KBytes  2.36 Mbits/sec  0.086 ms  4/4622
(0.087%)
[  5]  13.00-14.00  sec   303 KBytes  2.48 Mbits/sec  0.084 ms  18/4859
(0.37%)
[  5]  14.00-15.00  sec   369 KBytes  3.03 Mbits/sec  0.094 ms  0/5911 (0%)
[  5]  15.00-16.00  sec   283 KBytes  2.32 Mbits/sec  0.193 ms  6/4537
(0.13%)
[  5]  16.00-17.00  sec   277 KBytes  2.27 Mbits/sec  0.146 ms  12/4441
(0.27%)
[  5]  17.00-18.00  sec   366 KBytes  3.00 Mbits/sec  0.054 ms  0/5850 (0%)
[  5]  18.00-19.00  sec   287 KBytes  2.35 Mbits/sec  0.085 ms  5/4604
(0.11%)
[  5]  19.00-20.00  sec   276 KBytes  2.26 Mbits/sec  0.081 ms  10/4429
(0.23%)
[  5]  20.00-21.00  sec   362 KBytes  2.97 Mbits/sec  0.417 ms  1/5798
(0.017%)
[  5]  21.00-22.00  sec   301 KBytes  2.47 Mbits/sec  0.089 ms  6/4826
(0.12%)
[  5]  22.00-23.00  sec   278 KBytes  2.28 Mbits/sec  0.228 ms  15/4471
(0.34%)
[  5]  23.00-24.00  sec   379 KBytes  3.10 Mbits/sec  0.130 ms  7/6068
(0.12%)
[  5]  24.00-25.00  sec   318 KBytes  2.60 Mbits/sec  0.083 ms  2/5087
(0.039%)
[  5]  25.00-26.00  sec   279 KBytes  2.29 Mbits/sec  0.301 ms  11/4476
(0.25%)
[  5]  26.00-27.00  sec   350 KBytes  2.86 Mbits/sec  0.059 ms  11/5605
(0.2%)
[  5]  27.00-28.00  sec   324 KBytes  2.65 Mbits/sec  0.597 ms  1/5179
(0.019%)
[  5]  28.00-29.00  sec   284 KBytes  2.32 Mbits/sec  0.088 ms  13/4549
(0.29%)
[  5]  29.00-30.00  sec   339 KBytes  2.77 Mbits/sec  0.442 ms  12/5431
(0.22%)
[  5]  30.00-31.00  sec   317 KBytes  2.60 Mbits/sec  0.122 ms  1/5074
(0.02%)
[  5]  31.00-32.00  sec   278 KBytes  2.28 Mbits/sec  0.099 ms  10/4458
(0.22%)
[  5]  32.00-33.00  sec   328 KBytes  2.68 Mbits/sec  0.043 ms  12/5252
(0.23%)
[  5]  33.00-34.00  sec   316 KBytes  2.59 Mbits/sec  0.445 ms  1/5060
(0.02%)
[  5]  34.00-35.00  sec   277 KBytes  2.27 Mbits/sec  1.432 ms  13/4445
(0.29%)
[  5]  35.00-36.00  sec   345 KBytes  2.83 Mbits/sec  0.089 ms  12/5535
(0.22%)
[  5]  36.00-37.00  sec   360 KBytes  2.95 Mbits/sec  0.143 ms  0/5768 (0%)
[  5]  37.00-38.00  sec   292 KBytes  2.39 Mbits/sec  0.460 ms  3/4671
(0.064%)
[  5]  38.00-39.00  sec   290 KBytes  2.37 Mbits/sec  0.095 ms  14/4649
(0.3%)
[  5]  39.00-40.00  sec   353 KBytes  2.89 Mbits/sec  0.318 ms  1/5643
(0.018%)
[  5]  40.00-41.00  sec   297 KBytes  2.44 Mbits/sec  0.707 ms  3/4761
(0.063%)
[  5]  41.00-42.00  sec   270 KBytes  2.21 Mbits/sec  0.173 ms  18/4335
(0.42%)
[  5]  42.00-43.00  sec   359 KBytes  2.94 Mbits/sec  0.452 ms  1/5750
(0.017%)
[  5]  43.00-44.00  sec   294 KBytes  2.41 Mbits/sec  0.112 ms  4/4703
(0.085%)
[  5]  44.00-45.00  sec   272 KBytes  2.23 Mbits/sec  0.134 ms  18/4371
(0.41%)
[  5]  45.00-46.00  sec   385 KBytes  3.16 Mbits/sec  0.090 ms  3/6170
(0.049%)
[  5]  46.00-47.00  sec   309 KBytes  2.53 Mbits/sec  0.735 ms  2/4949
(0.04%)
[  5]  47.00-48.00  sec   288 KBytes  2.36 Mbits/sec  0.206 ms  13/4622
(0.28%)
[  5]  48.00-49.00  sec   356 KBytes  2.92 Mbits/sec  0.107 ms  4/5707
(0.07%)
[  5]  49.00-50.00  sec   292 KBytes  2.39 Mbits/sec  1.680 ms  4/4670
(0.086%)
[  5]  50.00-51.00  sec   262 KBytes  2.15 Mbits/sec  0.646 ms  12/4202
(0.29%)
[  5]  51.00-52.00  sec   349 KBytes  2.86 Mbits/sec  0.151 ms  11/5597
(0.2%)
[  5]  52.00-53.00  sec   313 KBytes  2.56 Mbits/sec  0.076 ms  3/5011
(0.06%)
[  5]  53.00-54.00  sec   278 KBytes  2.27 Mbits/sec  0.436 ms  16/4456
(0.36%)
[  5]  54.00-55.00  sec   361 KBytes  2.95 Mbits/sec  0.086 ms  3/5772
(0.052%)
[  5]  55.00-56.00  sec   312 KBytes  2.56 Mbits/sec  0.422 ms  1/4992
(0.02%)
[  5]  56.00-57.00  sec   292 KBytes  2.39 Mbits/sec  0.134 ms  14/4683
(0.3%)
[  5]  57.00-58.00  sec   344 KBytes  2.82 Mbits/sec  0.090 ms  6/5513
(0.11%)
[  5]  58.00-59.00  sec   200 KBytes  1.64 Mbits/sec  0.811 ms  1/3199
(0.031%)
[  5]  59.00-60.00  sec  0.00 Bytes  0.00 bits/sec  0.811 ms  0/0 (-nan%)
<----- Coincides with the tx timeout
[  5]  60.00-61.00  sec  0.00 Bytes  0.00 bits/sec  0.811 ms  0/0 (-nan%)
[  5]  61.00-62.00  sec  84.2 KBytes   690 Kbits/sec  0.032 ms  0/1347 (0%)
[  5]  62.00-63.00  sec   307 KBytes  2.52 Mbits/sec  0.243 ms  0/4915 (0%)
[  5]  63.00-64.00  sec   301 KBytes  2.47 Mbits/sec  0.044 ms  0/4820 (0%)
[  5]  64.00-65.00  sec   391 KBytes  3.21 Mbits/sec  0.053 ms  1/6263
(0.016%)
[  5]  65.00-66.00  sec   364 KBytes  2.98 Mbits/sec  0.431 ms  0/5828 (0%)

Here is another tx timeout dump:

Jan  1fec 63fec000.ethernet eth0: TX ring dump
Nr    SC     addr       len  SKB
 0    0x1c00 0xce485000  106 dec7bf00
 1    0x1c00 0xce485800  106 dec7bcc0
 2    0x1c00 0xce486000  106 dec7b480
 3    0x1c00 0xce486800  106 dec7b840
 4    0x1c00 0xce487000  106 dec7b3c0
 5    0x1c00 0xce487800  106 dec7b600
 6    0x1c00 0xce500000  106 dec7b780
 7    0x1c00 0xce500800  106 dec7be40
 8    0x1c00 0xce501000  106 dec7b9c0
 9    0x1c00 0xce501800  106 dec7bb40
10    0x1c00 0xce502000  106 dec7b180
11    0x1c00 0xce502800  106 de7b2540
12    0x1c00 0xce503000  106 de7b2480
13    0x1c00 0xce503800  106 de7b2900
14    0x1c00 0xce504000  106 de7b2d80
15    0x1c00 0xce504800  106 de7b23c0
16    0x1c00 0xce505000  106 de7b2cc0
17    0x1c00 0xce505800  106 de7b2000
18    0x1c00 0xce506000  106 de7b2b40
19    0x1c00 0xce506800  106 de7b2840
20    0x1c00 0xce507000  106 de7b2780
21    0x1c00 0xce507800  106 de7b2300
22    0x1c00 0xce508000  106 de7b2e40
23    0x1c00 0xce508800  106 de7b2180
24    0x1c00 0xce509000  106 de7b2c00
25    0x1c00 0xce509800  106 de7b29c0
26    0x1c00 0xce50a000  106 de7b2240
27    0x1c00 0xce50a800  106 de7b2a80
28    0x1c00 0xce50b000  106 de7b2600
29    0x1c00 0xce50b800  106 de7b20c0
30    0x1c00 0xce50c000  106 de7ea780
31    0x1c00 0xce50c800  106 de7ea480
32    0x1c00 0xce50d000  106 de7ea900
33 SH 0x1c00 0x00000000  106   (null)
34    0x9c00 0xce50e000  106 de120e40
35    0x1c00 0xce50e800  106 de120f00
36    0x1c00 0xce50f000  106 de120840
37    0x1c00 0xce50f800  106 de120600
38    0x1c00 0xce510000   66 de1209c0
39    0x1c00 0xce510800  106 de120a80
40    0x1c00 0xce511000  106 de120cc0
41    0x1c00 0xce511800   66 de120000
42    0x1c00 0xce512000  106 de1200c0
43    0x1c00 0xce512800  106 de120780
44    0x1c00 0xce513000  106 de120c00
45    0x1c00 0xce513800   66 de120b40
46    0x1c00 0xce514000  106 de120540
47    0x1c00 0xce514800  106 de120900
48    0x1c00 0xce515000   66 de120240
49    0x1c00 0xce515800  106 de120d80
50    0x1c00 0xce516000  106 de1203c0
51    0x1c00 0xce516800   66 de120480
52    0x1c00 0xce517000  106 de120300
53    0x1c00 0xce517800  106 de120180
54    0x1c00 0xce518000   66 dec7b000
55    0x1c00 0xce518800  106 dec7b300
56    0x1c00 0xce519000  106 dec7b0c0
57    0x1c00 0xce519800   66 dec7b540
58    0x1c00 0xce51a000  106 dec7bc00
59    0x1c00 0xce51a800  106 dec7ba80
60    0x1c00 0xce51b000   66 dec7b900
61    0x1c00 0xce51b800  106 dec7bd80
62    0x1c00 0xce51c000  106 dec7b240
63    0x3c00 0xce51c800  106 dec7b6c0


As the test continues to run, the tx timeout will continue to happen.  I
monitored the sequence of tx timeouts to see
if there was anything interesting about which buffer descriptors were
having problems - here is the sequence:

15, 60, 57, 54, 53, 6, 18, 47, 24, 54, 23, 9, 39, 14


This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-24 17:57                               ` robert.daniels at vantagecontrols.com
@ 2014-03-24 20:21                                 ` Marek Vasut
  2014-03-24 22:37                                   ` robert.daniels at vantagecontrols.com
  0 siblings, 1 reply; 76+ messages in thread
From: Marek Vasut @ 2014-03-24 20:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday, March 24, 2014 at 06:57:18 PM, robert.daniels at vantagecontrols.com 
wrote:
> Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 03/21/2014
> 
> 11:32:53 AM:
> > Hmm, rt kernels.  Does this happen without the rt patches applied?
> 
> I am not using the rt patches and I still see the problem.  I'm using
> 3.14.0-rc6+ with
> fec patches on the i.MX53 Quick Start Board with the latest U-Boot.
> 
> When I run my test I will immediately see packet loss with the eventual tx
> timeout issue.
> 
> These are the steps I use to reproduce the problem:
> 
> Setup: Desktop (Ubuntu 12.04: webfs, iperf3, /srv/ftp/test.bmp ~27 MB)
>        i.MX53 Quick Start Board (linux 3.14.0-rc6+ with fec patches:
> iperf3, wget)
> 
> Test:
> 
>   Desktop:
>     iperf3 -s -V
> 
>   i.MX53 QSB:
>     ssh 1> iperf3 -c 192.168.1.101 -u -l 64 -b 55M -V -t 1000

I think the UDP test might be lossy, can you try with TCP test please ?

Best regards,
Marek Vasut

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-24 20:21                                 ` Marek Vasut
@ 2014-03-24 22:37                                   ` robert.daniels at vantagecontrols.com
  2014-03-24 23:44                                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: robert.daniels at vantagecontrols.com @ 2014-03-24 22:37 UTC (permalink / raw)
  To: linux-arm-kernel



Marek Vasut <marex@denx.de> wrote on 03/24/2014 02:21:58 PM:

>
> I think the UDP test might be lossy, can you try with TCP test please ?
>
> Best regards,
> Marek Vasut

Yes, I tried with TCP and there were no problems.
I then tried my same test with a crossover cable: no problems.
Then I started testing some different network configurations:

1. Desktop & i.MX53 both connected to the same router: no problems.
2. Desktop & i.MX53 both connected to the same hub: packet loss but no tx
timeout
3. Desktop connected to router, hub connected to router, i.MX53 connected
to hub: packet loss + tx timeout

So, the packet loss is due to the hub, not the fec driver.
I'm not sure why I'm getting the tx timeout.  My router is a Cisco
ValetPlus and the hub is a Netgear DS108.
The router and desktop are gigabit and the hub is 10/100.

I assumed that the tx timeout should not be happening - is this an
incorrect assumption based on my setup?

Thanks,
- Robert Daniels

This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-24 22:37                                   ` robert.daniels at vantagecontrols.com
@ 2014-03-24 23:44                                     ` Russell King - ARM Linux
  2014-03-25  1:02                                       ` Marek Vasut
  2014-03-26  0:11                                       ` Russell King - ARM Linux
  0 siblings, 2 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-24 23:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 24, 2014 at 04:37:20PM -0600, robert.daniels at vantagecontrols.com wrote:
> Marek Vasut <marex@denx.de> wrote on 03/24/2014 02:21:58 PM:
> > I think the UDP test might be lossy, can you try with TCP test please ?
> >
> > Best regards,
> > Marek Vasut
> 
> Yes, I tried with TCP and there were no problems.
> I then tried my same test with a crossover cable: no problems.
> Then I started testing some different network configurations:
> 
> 1. Desktop & i.MX53 both connected to the same router: no problems.
> 2. Desktop & i.MX53 both connected to the same hub: packet loss but no tx
> timeout
> 3. Desktop connected to router, hub connected to router, i.MX53 connected
> to hub: packet loss + tx timeout
> 
> So, the packet loss is due to the hub, not the fec driver.
> I'm not sure why I'm getting the tx timeout.  My router is a Cisco
> ValetPlus and the hub is a Netgear DS108.
> The router and desktop are gigabit and the hub is 10/100.
> 
> I assumed that the tx timeout should not be happening - is this an
> incorrect assumption based on my setup?

The first point is that the timeout should not be happening, because
when the link is up and running, packets should be transmitted.

The second point is that I think you have a setup which tickles a bug
in the FEC hardware.  What I think is going on is that the driver is
filling the transmit ring correctly.  For some reason, the FEC is
either failing to update the header after transmission leaving the
buffer owned by the FEC (and therefore unable to be reaped), or the
FEC is skipping over a ring descriptor.

As I hinted in one of my previous replies, I think the only way we're
truely going to work out what's going on here is to come up with some
way to mark each packet that is transmitted with where it was in the
ring, so a tcpdump session on another machine can be used to inspect
which packets from the ring were actually transmitted on the wire.

Now, you mention packet loss above.  Even on half-duplex networks, I
would not expect there to be much packet loss because of the
retransmissions after a collision, and 100Mbit networks handle this
much better than 10Mbit.  It takes repeated collisions to cause a
packet to actually be dropped (and a packet being dropped because
of collisions should be logged by the transmitting host in its
interface statistics.)

Another possibility is that the FEC isn't properly negotiating with
the hub, and the two ends are ending up configured differently, or
that it is negotiating properly but the FEC isn't being configured
correctly.

Now that we know a little more about your setup, including that it
seems to be one specific setup which is causing the problems, I'll
do some more testing - I have an old Allied Telesys 10/100 hub here
which I'll try putting the iMX6 on and re-run your tests.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-24 23:44                                     ` Russell King - ARM Linux
@ 2014-03-25  1:02                                       ` Marek Vasut
  2014-03-25 23:10                                         ` Russell King - ARM Linux
  2014-03-26  0:11                                       ` Russell King - ARM Linux
  1 sibling, 1 reply; 76+ messages in thread
From: Marek Vasut @ 2014-03-25  1:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday, March 25, 2014 at 12:44:43 AM, Russell King - ARM Linux wrote:
> On Mon, Mar 24, 2014 at 04:37:20PM -0600, robert.daniels at vantagecontrols.com 
wrote:
> > Marek Vasut <marex@denx.de> wrote on 03/24/2014 02:21:58 PM:
> > > I think the UDP test might be lossy, can you try with TCP test please ?
> > > 
> > > Best regards,
> > > Marek Vasut
> > 
> > Yes, I tried with TCP and there were no problems.
> > I then tried my same test with a crossover cable: no problems.
> > Then I started testing some different network configurations:
> > 
> > 1. Desktop & i.MX53 both connected to the same router: no problems.
> > 2. Desktop & i.MX53 both connected to the same hub: packet loss but no tx
> > timeout
> > 3. Desktop connected to router, hub connected to router, i.MX53 connected
> > to hub: packet loss + tx timeout
> > 
> > So, the packet loss is due to the hub, not the fec driver.
> > I'm not sure why I'm getting the tx timeout.  My router is a Cisco
> > ValetPlus and the hub is a Netgear DS108.
> > The router and desktop are gigabit and the hub is 10/100.
> > 
> > I assumed that the tx timeout should not be happening - is this an
> > incorrect assumption based on my setup?
> 
> The first point is that the timeout should not be happening, because
> when the link is up and running, packets should be transmitted.
> 
> The second point is that I think you have a setup which tickles a bug
> in the FEC hardware.  What I think is going on is that the driver is
> filling the transmit ring correctly.  For some reason, the FEC is
> either failing to update the header after transmission leaving the
> buffer owned by the FEC (and therefore unable to be reaped), or the
> FEC is skipping over a ring descriptor.

I think we might be seeing something similar on i.MX28 in U-Boot too, but I dare 
not to correlate this and the in-kernel FEC issue. We also sporadically see the 
FEC not transmitting a packet. I am just raising this since I want to update you 
all there is something like that as well.

[...]

Best regards,
Marek Vasut

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-25  1:02                                       ` Marek Vasut
@ 2014-03-25 23:10                                         ` Russell King - ARM Linux
  0 siblings, 0 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-25 23:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 25, 2014 at 02:02:42AM +0100, Marek Vasut wrote:
> I think we might be seeing something similar on i.MX28 in U-Boot too,
> but I dare not to correlate this and the in-kernel FEC issue. We also
> sporadically see the FEC not transmitting a packet.

That could be the ERR006358 problem.

This occurs when you've transmitted a packet previously, and the
indicator in FEC_X_DES_ACTIVE remains set at the point in time when
you load the next packet.  The subsequent write to FEC_X_DES_ACTIVE
has no effect on the TX side, which has already shut down.

The work-around for that which is in the kernel is to look at the
previous packet status: if that indicates it has been sent immediately
after setting everything up for the next packet (remembering to take
care of the store buffers between the CPU and memory), then you need to
re-ping the transmitter after a short delay.

(I've cocked that work-around up today... it can happen quite frequently
depending on your workload.)

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-24 23:44                                     ` Russell King - ARM Linux
  2014-03-25  1:02                                       ` Marek Vasut
@ 2014-03-26  0:11                                       ` Russell King - ARM Linux
  2014-04-01  9:26                                         ` Russell King - ARM Linux
  1 sibling, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-26  0:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 24, 2014 at 11:44:43PM +0000, Russell King - ARM Linux wrote:
> Now, you mention packet loss above.  Even on half-duplex networks, I
> would not expect there to be much packet loss because of the
> retransmissions after a collision, and 100Mbit networks handle this
> much better than 10Mbit.  It takes repeated collisions to cause a
> packet to actually be dropped (and a packet being dropped because
> of collisions should be logged by the transmitting host in its
> interface statistics.)
> 
> Another possibility is that the FEC isn't properly negotiating with
> the hub, and the two ends are ending up configured differently, or
> that it is negotiating properly but the FEC isn't being configured
> correctly.
> 
> Now that we know a little more about your setup, including that it
> seems to be one specific setup which is causing the problems, I'll
> do some more testing - I have an old Allied Telesys 10/100 hub here
> which I'll try putting the iMX6 on and re-run your tests.

PC --- 100Mbit hub --- iMX6S

iperf (2.0.5) -c PC -r reports:

iMX6 Tx: [  5]  0.0-10.0 sec  85.5 MBytes  71.5 Mbits/sec
iMX6 Rx: [  3]  0.0-10.0 sec   103 MBytes  86.2 Mbits/sec

However, there's a marked difference when the iMX6 sends - the
collision LED is almost permanently on at full brightness, whereas
when the PC is sending, it's almost always off.

I'm seeing lots of late collisions and CRC errors in the statistics
registers.

It looks to me like the iMX6's transmitter is blathering at any moment,
not taking any notice whether there's already a carrier on the RX side.
I'm not yet sure what to make of this yet.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* *READ THIS IF YOUR SOC HAS A L2 CACHE* PL310 auxctrl settings
  2014-03-18 11:22     ` *READ THIS IF YOUR SOC HAS A L2 CACHE* PL310 auxctrl settings Russell King - ARM Linux
@ 2014-03-28 12:51       ` Maxime Coquelin
  2014-03-28 13:02         ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: Maxime Coquelin @ 2014-03-28 12:51 UTC (permalink / raw)
  To: linux-arm-kernel



On 03/18/2014 12:22 PM, Russell King - ARM Linux wrote:
> On Mon, Mar 17, 2014 at 09:00:03AM -0500, Rob Herring wrote:
>> On Sun, Mar 16, 2014 at 10:29 AM, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>>> static struct l2x0_aux prima2_l2x0_aux __initconst = {
>>>          .val = 2 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT,
>>>          .mask = 0,
>>> };
>>>
>>> static struct l2x0_aux marco_l2x0_aux __initconst = {
>>>          .val = (2 << L2X0_AUX_CTRL_WAY_SIZE_SHIFT) |
>>>                  (1 << L2X0_AUX_CTRL_ASSOCIATIVITY_SHIFT),
>>>          .mask = L2X0_AUX_CTRL_MASK,
>>
>> What I like about this is all the bits they are clearing and failing
>> to set. Shared override, prefetch, etc.
>
> It gets more fun when you see the full picture.  Here's a table of the
> auxctrl settings that everyone uses.  The "good" ones which leave the
> auxctrl register value alone are listed at the top.
>
>                                                          DT val      mask
> arch/arm/mach-bcm/board_bcm281xx.c:                     Y  00000000 ffffffff
> arch/arm/mach-highbank/highbank.c:                      Y  00000000 ffffffff
> arch/arm/mach-imx/mach-vf610.c:                         Y  00000000 ffffffff
> arch/arm/mach-imx/system.c:                             Y  00000000 ffffffff
> arch/arm/mach-mvebu/armada-370-xp.c:                    Y  00000000 ffffffff
> arch/arm/mach-rockchip/rockchip.c:                      Y  00000000 ffffffff
> arch/arm/mach-socfpga/socfpga.c:                        Y  00000000 ffffffff
> arch/arm/mach-berlin/berlin.c:                          Y  70c00000 feffffff
> arch/arm/mach-cns3xxx/core.c:                           N  00540000 fe000fff
> arch/arm/mach-exynos/common.c:                          Y  7c470001 c200ffff
> arch/arm/mach-imx/mm-imx3.c:                            N  00030024 00000000
> arch/arm/mach-nomadik/cpu-8815.c:                       Y  00730249 fe000fff
> arch/arm/mach-omap2/omap4-common.c:                     Y  aux_ctrl aux_mask
> arch/arm/mach-omap2/omap4-common.c:                     N  aux_ctrl aux_mask
> arch/arm/mach-prima2/l2x0.c:                            Y  aux->val aux->mask
> arch/arm/mach-realview/realview_pbx.c:                  N  02520000 c0000fff
> arch/arm/mach-realview/realview_pb1176.c:               N  00730000 fe000fff
> arch/arm/mach-realview/realview_eb.c:                   N  00790000 fe000fff
> arch/arm/mach-realview/realview_pb11mp.c:               N  00790000 fe000fff
> arch/arm/mach-shmobile/setup-r8a7779.c:                 N  40470000 82000fff
> arch/arm/mach-shmobile/board-kzm9g.c:                   N  40460000 82000fff
> arch/arm/mach-shmobile/board-kzm9g-reference.c:         N  40460000 82000fff
> arch/arm/mach-shmobile/board-armadillo800eva.c:         N  40440000 82000fff
> arch/arm/mach-shmobile/board-armadillo800eva-reference.c:N 40440000 82000fff
> arch/arm/mach-shmobile/setup-r8a7778.c:                 N  40470000 82000fff
> arch/arm/mach-spear/spear13xx.c:                        N  70a60001 fe00ffff
> arch/arm/mach-sti/board-dt.c:                           Y  30480000 c0000fff

For STiH416 SoC, the reset value of the AUX_CTRL register is 0x02000000.
So bits 19:17 = 0, whereas the expected value is bits 19:17 = 4.

IIRC, for the other STi SoCs, the reset value is the good one.

Note that on STi platforms, the kernel runs in Secure mode, so it can 
write this register directly, without having to issue SMC call.

Regards,
Maxime

^ permalink raw reply	[flat|nested] 76+ messages in thread

* *READ THIS IF YOUR SOC HAS A L2 CACHE* PL310 auxctrl settings
  2014-03-28 12:51       ` Maxime Coquelin
@ 2014-03-28 13:02         ` Russell King - ARM Linux
  2014-03-28 13:32           ` Maxime Coquelin
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-03-28 13:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 28, 2014 at 01:51:27PM +0100, Maxime Coquelin wrote:
> On 03/18/2014 12:22 PM, Russell King - ARM Linux wrote:
>> arch/arm/mach-sti/board-dt.c:                           Y  30480000 c0000fff
>
> For STiH416 SoC, the reset value of the AUX_CTRL register is 0x02000000.
> So bits 19:17 = 0, whereas the expected value is bits 19:17 = 4.

Which L2 cache is it?  210/220/310 ?

Thanks.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* *READ THIS IF YOUR SOC HAS A L2 CACHE* PL310 auxctrl settings
  2014-03-28 13:02         ` Russell King - ARM Linux
@ 2014-03-28 13:32           ` Maxime Coquelin
  0 siblings, 0 replies; 76+ messages in thread
From: Maxime Coquelin @ 2014-03-28 13:32 UTC (permalink / raw)
  To: linux-arm-kernel



On 03/28/2014 02:02 PM, Russell King - ARM Linux wrote:
> On Fri, Mar 28, 2014 at 01:51:27PM +0100, Maxime Coquelin wrote:
>> On 03/18/2014 12:22 PM, Russell King - ARM Linux wrote:
>>> arch/arm/mach-sti/board-dt.c:                           Y  30480000 c0000fff
>>
>> For STiH416 SoC, the reset value of the AUX_CTRL register is 0x02000000.
>> So bits 19:17 = 0, whereas the expected value is bits 19:17 = 4.
>
> Which L2 cache is it?  210/220/310 ?
It is the 310.

Regards,
Maxime

>
> Thanks.
>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-03-26  0:11                                       ` Russell King - ARM Linux
@ 2014-04-01  9:26                                         ` Russell King - ARM Linux
  2014-04-01 14:00                                           ` Eric Nelson
  2014-04-01 19:38                                           ` robert.daniels at vantagecontrols.com
  0 siblings, 2 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-04-01  9:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 26, 2014 at 12:11:35AM +0000, Russell King - ARM Linux wrote:
> On Mon, Mar 24, 2014 at 11:44:43PM +0000, Russell King - ARM Linux wrote:
> > Now, you mention packet loss above.  Even on half-duplex networks, I
> > would not expect there to be much packet loss because of the
> > retransmissions after a collision, and 100Mbit networks handle this
> > much better than 10Mbit.  It takes repeated collisions to cause a
> > packet to actually be dropped (and a packet being dropped because
> > of collisions should be logged by the transmitting host in its
> > interface statistics.)
> > 
> > Another possibility is that the FEC isn't properly negotiating with
> > the hub, and the two ends are ending up configured differently, or
> > that it is negotiating properly but the FEC isn't being configured
> > correctly.
> > 
> > Now that we know a little more about your setup, including that it
> > seems to be one specific setup which is causing the problems, I'll
> > do some more testing - I have an old Allied Telesys 10/100 hub here
> > which I'll try putting the iMX6 on and re-run your tests.
> 
> PC --- 100Mbit hub --- iMX6S
> 
> iperf (2.0.5) -c PC -r reports:
> 
> iMX6 Tx: [  5]  0.0-10.0 sec  85.5 MBytes  71.5 Mbits/sec
> iMX6 Rx: [  3]  0.0-10.0 sec   103 MBytes  86.2 Mbits/sec
> 
> However, there's a marked difference when the iMX6 sends - the
> collision LED is almost permanently on at full brightness, whereas
> when the PC is sending, it's almost always off.
> 
> I'm seeing lots of late collisions and CRC errors in the statistics
> registers.
> 
> It looks to me like the iMX6's transmitter is blathering at any moment,
> not taking any notice whether there's already a carrier on the RX side.
> I'm not yet sure what to make of this yet.

Last night, I performed a different test:

PC --- Gigabit switch --- 10/100Mbit hub --- 10Mbit hub --- iMX6S
           |
  (the rest of my network)

This shows that all the (late) collisions occur in the 10Mbit domain,
and very few in the 100Mbit domain, which puts the blame fairly
squarely on the iMX6 side.

I can see nothing wrong with the setup of the iMX6, nor of the AR8035
phy which my board has.  Both the phy and the FEC appear to correctly
indicate that they are configured for half-duplex with flow control
disabled, and I've been through the iomux settings for the RGMII
interface.

So, I'm left with three possibilities:
- the AR8035 doesn't work with half-duplex
- there is something wrong with the signalling for carrier sense between
  the AR8035 and the FEC.
- the iMX6 FEC doesn't work with half-duplex

It's not easy to monitor the TX_CTL and RX_CTL signals and make sense of
them, so I can't really say what's at fault at the moment, but from what
I can see, my hardware fails to work correctly with half-duplex
connections.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-01  9:26                                         ` Russell King - ARM Linux
@ 2014-04-01 14:00                                           ` Eric Nelson
  2014-04-01 17:29                                             ` Russell King - ARM Linux
  2014-04-01 19:38                                           ` robert.daniels at vantagecontrols.com
  1 sibling, 1 reply; 76+ messages in thread
From: Eric Nelson @ 2014-04-01 14:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On 04/01/2014 02:26 AM, Russell King - ARM Linux wrote:
> On Wed, Mar 26, 2014 at 12:11:35AM +0000, Russell King - ARM Linux wrote:
>> On Mon, Mar 24, 2014 at 11:44:43PM +0000, Russell King - ARM Linux wrote:
>>> Now, you mention packet loss above.  Even on half-duplex networks, I
>>> would not expect there to be much packet loss because of the
>>> retransmissions after a collision, and 100Mbit networks handle this
>>> much better than 10Mbit.  It takes repeated collisions to cause a
>>> packet to actually be dropped (and a packet being dropped because
>>> of collisions should be logged by the transmitting host in its
>>> interface statistics.)
>>>
>>> Another possibility is that the FEC isn't properly negotiating with
>>> the hub, and the two ends are ending up configured differently, or
>>> that it is negotiating properly but the FEC isn't being configured
>>> correctly.
>>>
>>> Now that we know a little more about your setup, including that it
>>> seems to be one specific setup which is causing the problems, I'll
>>> do some more testing - I have an old Allied Telesys 10/100 hub here
>>> which I'll try putting the iMX6 on and re-run your tests.
>>
>> PC --- 100Mbit hub --- iMX6S
>>
>> iperf (2.0.5) -c PC -r reports:
>>
>> iMX6 Tx: [  5]  0.0-10.0 sec  85.5 MBytes  71.5 Mbits/sec
>> iMX6 Rx: [  3]  0.0-10.0 sec   103 MBytes  86.2 Mbits/sec
>>
>> However, there's a marked difference when the iMX6 sends - the
>> collision LED is almost permanently on at full brightness, whereas
>> when the PC is sending, it's almost always off.
>>
>> I'm seeing lots of late collisions and CRC errors in the statistics
>> registers.
>>
>> It looks to me like the iMX6's transmitter is blathering at any moment,
>> not taking any notice whether there's already a carrier on the RX side.
>> I'm not yet sure what to make of this yet.
>
> Last night, I performed a different test:
>
> PC --- Gigabit switch --- 10/100Mbit hub --- 10Mbit hub --- iMX6S
>             |
>    (the rest of my network)
>
> This shows that all the (late) collisions occur in the 10Mbit domain,
> and very few in the 100Mbit domain, which puts the blame fairly
> squarely on the iMX6 side.
>
> I can see nothing wrong with the setup of the iMX6, nor of the AR8035
> phy which my board has.  Both the phy and the FEC appear to correctly
> indicate that they are configured for half-duplex with flow control
> disabled, and I've been through the iomux settings for the RGMII
> interface.
>
> So, I'm left with three possibilities:
> - the AR8035 doesn't work with half-duplex
> - there is something wrong with the signalling for carrier sense between
>    the AR8035 and the FEC.
> - the iMX6 FEC doesn't work with half-duplex
>
> It's not easy to monitor the TX_CTL and RX_CTL signals and make sense of
> them, so I can't really say what's at fault at the moment, but from what
> I can see, my hardware fails to work correctly with half-duplex
> connections.
>

What value do you have for pad control register
IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)?

I notice in some of the DTS files that this has a value of 0x100b0,
which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling
that seems wrong (also bit 7 seems wrong, as it's undefined in the RM).

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-01 14:00                                           ` Eric Nelson
@ 2014-04-01 17:29                                             ` Russell King - ARM Linux
  2014-04-01 18:11                                               ` Eric Nelson
  2014-04-02  2:26                                               ` fugang.duan at freescale.com
  0 siblings, 2 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-04-01 17:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 01, 2014 at 07:00:29AM -0700, Eric Nelson wrote:
> Hi Russell,
>
> What value do you have for pad control register
> IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)?

You're referring to the iMX6Q, the Hummingboard is iMX6S, so that's
0x20e06bc.  I have value 0x1b030.

> I notice in some of the DTS files that this has a value of 0x100b0,
> which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling
> that seems wrong (also bit 7 seems wrong, as it's undefined in the RM).

Yes, bit 7 should not be set as it's reserved, but this is ignored.
Setting the hysteresis bit affects the input thresholds, and this signal
is only ever used as an output.  While selecting HYS mode for an output
may seem odd, that should not have any effect here.

In any case, TX_CTL doesn't have much to do with it.  I've fully proved
that the interface works fine under full duplex conditions, achieving
500Mbps transmit and 570-600Mbps receive with TCP connections (which is
nicely beyond what freescale claim the hardware is capable of.)

The phy needs TX_CTL asserted in order to transmit anything, so as it
works in full duplex mode, this must be happening correctly.

The FEC needs RX_CTL asserted to indicate that valid data is being
received by the phy - if this was not being recognised by the FEC, full
duplex mode wouldn't work.

The last scenario is the error signals encoded onto TX_CTL/RX_CTL.  This
is one area I can't comment on yet - monitoring these signals is not
exactly easy (and requires two steady hands.)  However, I have briefly
monitored these signals when running at 100mbit half-duplex. With the
scope RX_CTL and triggering on it, I do see TX_CTL being raised by the
FEC.

Whether TX_CTL is raised by the FEC just before or after RX_CTL has been
asserted I can't say yet.

However, TX_CTL raised while RX_CTL is raised or toggling means that a
collision has occurred.  TX_CTL should never be raised while RX_CTL
indicates that there is a "carrier" present in half-duplex mode (and
therefore there is already another ethernet station transmitting on
the wire.)

Remember, I'm seeing late collisions.  The only way that happens is if
something starts transmitting after 64 bytes into an ethernet frame, and
it implies that something is not noticing that the link is busy before it
starts blurting out.

I know that my hubs are fine - the 10mbit hub has been used over _many_
years (it's more than 20 years old now) and has always been rock solid...
unlike the 10/100 hub with a dodgy switch between the two segments which
I can lock out the 10 mbit segment with enough 100mbit traffic.

Let's also not forget that these signals run at 125MHz at gigabit speeds,
and 2.5MHz at 10mbit...  if they were marginal at 10mbit, gigabit
certainly would not work.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-01 17:29                                             ` Russell King - ARM Linux
@ 2014-04-01 18:11                                               ` Eric Nelson
  2014-04-02  2:26                                               ` fugang.duan at freescale.com
  1 sibling, 0 replies; 76+ messages in thread
From: Eric Nelson @ 2014-04-01 18:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/01/2014 10:29 AM, Russell King - ARM Linux wrote:
> On Tue, Apr 01, 2014 at 07:00:29AM -0700, Eric Nelson wrote:
>> Hi Russell,
>>
>> What value do you have for pad control register
>> IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)?
>
> You're referring to the iMX6Q, the Hummingboard is iMX6S, so that's
> 0x20e06bc.  I have value 0x1b030.
>

Right. I missed that you were running on Solo.

>> I notice in some of the DTS files that this has a value of 0x100b0,
>> which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling
>> that seems wrong (also bit 7 seems wrong, as it's undefined in the RM).
>
> Yes, bit 7 should not be set as it's reserved, but this is ignored.
> Setting the hysteresis bit affects the input thresholds, and this signal
> is only ever used as an output.  While selecting HYS mode for an output
> may seem odd, that should not have any effect here.
>
> In any case, TX_CTL doesn't have much to do with it.  I've fully proved
> that the interface works fine under full duplex conditions, achieving
> 500Mbps transmit and 570-600Mbps receive with TCP connections (which is
> nicely beyond what freescale claim the hardware is capable of.)
>
> The phy needs TX_CTL asserted in order to transmit anything, so as it
> works in full duplex mode, this must be happening correctly.
>
It appears I'm gender-challenged and confused TX_CTL and RX_CTL.

My thought (and question) was whether the pad setup for "transmit
enable" to the i.MX is either being missed entirely, or arriving
late because of an internal pullup/pull down or other pad
misconfiguration.

The pad settings have every indication of being cut/pasted, which
made me question them.

> The FEC needs RX_CTL asserted to indicate that valid data is being
> received by the phy - if this was not being recognised by the FEC, full
> duplex mode wouldn't work.
>
> The last scenario is the error signals encoded onto TX_CTL/RX_CTL.  This
> is one area I can't comment on yet - monitoring these signals is not
> exactly easy (and requires two steady hands.)  However, I have briefly
> monitored these signals when running at 100mbit half-duplex. With the
> scope RX_CTL and triggering on it, I do see TX_CTL being raised by the
> FEC.
>
> Whether TX_CTL is raised by the FEC just before or after RX_CTL has been
> asserted I can't say yet.
>
> However, TX_CTL raised while RX_CTL is raised or toggling means that a
> collision has occurred.  TX_CTL should never be raised while RX_CTL
> indicates that there is a "carrier" present in half-duplex mode (and
> therefore there is already another ethernet station transmitting on
> the wire.)
>
> Remember, I'm seeing late collisions.  The only way that happens is if
> something starts transmitting after 64 bytes into an ethernet frame, and
> it implies that something is not noticing that the link is busy before it
> starts blurting out.
>
I guess we can rule out slew rate.

> I know that my hubs are fine - the 10mbit hub has been used over _many_
> years (it's more than 20 years old now) and has always been rock solid...
> unlike the 10/100 hub with a dodgy switch between the two segments which
> I can lock out the 10 mbit segment with enough 100mbit traffic.
>
> Let's also not forget that these signals run at 125MHz at gigabit speeds,
> and 2.5MHz at 10mbit...  if they were marginal at 10mbit, gigabit
> certainly would not work.
>

Based on this, I don't see how a badly-configured pad could produce the
symptoms you're seeing.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-01  9:26                                         ` Russell King - ARM Linux
  2014-04-01 14:00                                           ` Eric Nelson
@ 2014-04-01 19:38                                           ` robert.daniels at vantagecontrols.com
  2014-04-01 22:51                                             ` Russell King - ARM Linux
  1 sibling, 1 reply; 76+ messages in thread
From: robert.daniels at vantagecontrols.com @ 2014-04-01 19:38 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014
03:26:38 AM:

>
> Last night, I performed a different test:
>
> PC --- Gigabit switch --- 10/100Mbit hub --- 10Mbit hub --- iMX6S
>            |
>   (the rest of my network)
>
> This shows that all the (late) collisions occur in the 10Mbit domain,
> and very few in the 100Mbit domain, which puts the blame fairly
> squarely on the iMX6 side.
>
> I can see nothing wrong with the setup of the iMX6, nor of the AR8035
> phy which my board has.  Both the phy and the FEC appear to correctly
> indicate that they are configured for half-duplex with flow control
> disabled, and I've been through the iomux settings for the RGMII
> interface.
>
> So, I'm left with three possibilities:
> - the AR8035 doesn't work with half-duplex
> - there is something wrong with the signalling for carrier sense between
>   the AR8035 and the FEC.
> - the iMX6 FEC doesn't work with half-duplex
>
> It's not easy to monitor the TX_CTL and RX_CTL signals and make sense of
> them, so I can't really say what's at fault at the moment, but from what
> I can see, my hardware fails to work correctly with half-duplex
> connections.

I'm not sure where this factors in, but I originally saw this problem using
the
Freescale 2.6.35 kernel.  The driver there exhibits this problem
differently,
although it could very well be a different problem.  What I observed was
that
when the FEC got into this bad state the driver would attempt to transmit a
socket buffer but for some reason the buffer would not actually get
transmitted.
The driver would continue transmitting packets until it got all the way
around in the ring buffer to the buffer descriptor right before the one
that
was never transmitted.  When this buffer descriptor was set to transmit
you'd
get a double transmit - the new packet and the previously untransmitted
buffer.
This results in out-of-order packets being sent directly from the i.MX53.

To illustrate this, I got my i.MX53 into this bad state and then ran ping
with the
following results:

PING 192.168.1.101 (192.168.1.101) 56(84) bytes of data.
64 bytes from 192.168.1.101: icmp_seq=1 ttl=64 time=6.69 ms
64 bytes from 192.168.1.101: icmp_seq=2 ttl=64 time=0.306 ms
64 bytes from 192.168.1.101: icmp_seq=4 ttl=64 time=0.314 ms
64 bytes from 192.168.1.101: icmp_seq=5 ttl=64 time=0.325 ms
64 bytes from 192.168.1.101: icmp_seq=6 ttl=64 time=0.351 ms
64 bytes from 192.168.1.101: icmp_seq=7 ttl=64 time=0.323 ms
64 bytes from 192.168.1.101: icmp_seq=9 ttl=64 time=0.337 ms
64 bytes from 192.168.1.101: icmp_seq=10 ttl=64 time=0.319 ms
64 bytes from 192.168.1.101: icmp_seq=11 ttl=64 time=0.353 ms
64 bytes from 192.168.1.101: icmp_seq=12 ttl=64 time=0.321 ms
64 bytes from 192.168.1.101: icmp_seq=13 ttl=64 time=0.308 ms
64 bytes from 192.168.1.101: icmp_seq=14 ttl=64 time=0.329 ms
64 bytes from 192.168.1.101: icmp_seq=15 ttl=64 time=0.335 ms
64 bytes from 192.168.1.101: icmp_seq=16 ttl=64 time=0.316 ms
64 bytes from 192.168.1.101: icmp_seq=17 ttl=64 time=0.326 ms
64 bytes from 192.168.1.101: icmp_seq=3 ttl=64 time=14006 ms
64 bytes from 192.168.1.101: icmp_seq=19 ttl=64 time=0.323 ms
64 bytes from 192.168.1.101: icmp_seq=20 ttl=64 time=0.314 ms
64 bytes from 192.168.1.101: icmp_seq=21 ttl=64 time=0.310 ms
64 bytes from 192.168.1.101: icmp_seq=22 ttl=64 time=0.317 ms
64 bytes from 192.168.1.101: icmp_seq=23 ttl=64 time=0.331 ms
64 bytes from 192.168.1.101: icmp_seq=8 ttl=64 time=15000 ms
64 bytes from 192.168.1.101: icmp_seq=25 ttl=64 time=0.322 ms
64 bytes from 192.168.1.101: icmp_seq=26 ttl=64 time=0.333 ms
64 bytes from 192.168.1.101: icmp_seq=27 ttl=64 time=0.337 ms
64 bytes from 192.168.1.101: icmp_seq=28 ttl=64 time=0.337 ms
64 bytes from 192.168.1.101: icmp_seq=29 ttl=64 time=0.335 ms
64 bytes from 192.168.1.101: icmp_seq=30 ttl=64 time=0.325 ms
64 bytes from 192.168.1.101: icmp_seq=31 ttl=64 time=0.307 ms
64 bytes from 192.168.1.101: icmp_seq=32 ttl=64 time=0.333 ms
64 bytes from 192.168.1.101: icmp_seq=18 ttl=64 time=14006 ms
64 bytes from 192.168.1.101: icmp_seq=34 ttl=64 time=0.330 ms
64 bytes from 192.168.1.101: icmp_seq=35 ttl=64 time=0.342 ms

Here you can see that icmp_seq=3 wasn't replied to until after
icmp_seq=17 was sent.

Once these buffer descriptors get into this state, they stay
that way until the FEC is reset.

I don't see this exact behavior when I run the test with the 3.14
kernel but I'm starting to wonder if it's because the 3.14 kernel
is receiving the transmit timeout which allows the driver to reset
itself and start working again whereas the 2.6.35 driver is not.

I hope this additional information is useful, I don't know enough
about these low-level networking details to contribute much but
it's possible that what I've seen in the 2.6.35 kernel is actually
the same issue that I'm seeing in the 3.14 kernel but handled
better.

Thanks,

Robert Daniels

This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-01 19:38                                           ` robert.daniels at vantagecontrols.com
@ 2014-04-01 22:51                                             ` Russell King - ARM Linux
  2014-04-02  0:37                                               ` robert.daniels at vantagecontrols.com
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-04-01 22:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 01, 2014 at 01:38:37PM -0600, robert.daniels at vantagecontrols.com wrote:
> I'm not sure where this factors in, but I originally saw this problem using
> the Freescale 2.6.35 kernel.  The driver there exhibits this problem
> differently, although it could very well be a different problem. What
> I observed was that when the FEC got into this bad state the driver would
> attempt to transmit a socket buffer but for some reason the buffer would
> not actually get transmitted.
>
> The driver would continue transmitting packets until it got all the way
> around in the ring buffer to the buffer descriptor right before the one
> that was never transmitted.  When this buffer descriptor was set to
> transmit you'd get a double transmit - the new packet and the previously
> untransmitted buffer.
>
> This results in out-of-order packets being sent directly from the i.MX53.

At initial glance, this is coherent with my idea of the FEC skipping a
ring entry on the initial pass around.  Then when a new entry is loaded,

Let's say that the problem entry is number 12 that has been skipped.
When we get back around to entry 11, the FEC will transmit entries 11
and 12, as you rightly point out, and it will then look at entry 13
for the next packet.

However, the driver loads the next packet into entry 12, and hits the
FEC to transmit it.  The FEC re-reads entry 13, finds no packet, so
does nothing.

Then the next packet is submitted to the driver, and it enters it into
entry 13, again hitting the FEC.  The FEC now sees the entry at 13,
meanwhile the entry at 12 is still pending.

> I hope this additional information is useful, I don't know enough
> about these low-level networking details to contribute much but
> it's possible that what I've seen in the 2.6.35 kernel is actually
> the same issue that I'm seeing in the 3.14 kernel but handled
> better.

It confirms the theory, but doesn't really provide much clues for a
solution at the moment.

However, I've had something of a breakthrough with iMX6 and half-duplex.
I think much of the problem comes down to this ERR006358 workaround
implemented in the driver (this apparantly doesn't affect your device.)
The delayed work implementation, and my delayed timer implementation of
the same are fundamentally wrong to the erratum documentation - as is
the version implemented in the Freescale BSP.

Implementing what the erratum says as an acceptable workaround improves
things tremendously - I see iperf on a 10Mbit hub go from 1-2Mbps up to
8Mbps, though still with loads of collisions.  That said, I'm not that
trusting of the error bits indicated from the FEC.

The reason I mention it here is that I wonder if less wacking of the
FEC_X_DES_ACTIVE register may help your problem.

In 3.14, in the fec_enet_start_xmit function, find the
"writel(0, fep->hwp + FEC_X_DES_ACTIVE);" and change it to:

	wmb();

        /* Trigger transmission start */
        if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
                writel(0, fep->hwp + FEC_X_DES_ACTIVE);

and see whether that helps your problem(s).

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-01 22:51                                             ` Russell King - ARM Linux
@ 2014-04-02  0:37                                               ` robert.daniels at vantagecontrols.com
  2014-04-02  3:19                                               ` fugang.duan at freescale.com
  2014-04-04 20:21                                               ` robert.daniels at vantagecontrols.com
  2 siblings, 0 replies; 76+ messages in thread
From: robert.daniels at vantagecontrols.com @ 2014-04-02  0:37 UTC (permalink / raw)
  To: linux-arm-kernel


Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014
04:51:49 PM:

> At initial glance, this is coherent with my idea of the FEC skipping a
> ring entry on the initial pass around.  Then when a new entry is loaded,
>
> Let's say that the problem entry is number 12 that has been skipped.
> When we get back around to entry 11, the FEC will transmit entries 11
> and 12, as you rightly point out, and it will then look at entry 13
> for the next packet.
>
> However, the driver loads the next packet into entry 12, and hits the
> FEC to transmit it.  The FEC re-reads entry 13, finds no packet, so
> does nothing.
>
> Then the next packet is submitted to the driver, and it enters it into
> entry 13, again hitting the FEC.  The FEC now sees the entry at 13,
> meanwhile the entry at 12 is still pending.
>
> > I hope this additional information is useful, I don't know enough
> > about these low-level networking details to contribute much but
> > it's possible that what I've seen in the 2.6.35 kernel is actually
> > the same issue that I'm seeing in the 3.14 kernel but handled
> > better.
>
> It confirms the theory, but doesn't really provide much clues for a
> solution at the moment.

So according to this theory, if we can detect this situation we should be
able to skip entry 12 and then everything should get back into sink?  I did
a little test where I marked '12' as needing to be skipped and then the
next
time it was to be used I skipped it... and I got an out of order packet but
then things got back on track - that's the good news.

Unfortunately, I think there must be something wrong with my hack because
after running for a while longer ethernet stops working entirely.  Not sure
why but I'll see if I can figure that out.  This was on my 2.6.35 kernel.

> However, I've had something of a breakthrough with iMX6 and half-duplex.
> I think much of the problem comes down to this ERR006358 workaround
> implemented in the driver (this apparantly doesn't affect your device.)
> The delayed work implementation, and my delayed timer implementation of
> the same are fundamentally wrong to the erratum documentation - as is
> the version implemented in the Freescale BSP.
>
> Implementing what the erratum says as an acceptable workaround improves
> things tremendously - I see iperf on a 10Mbit hub go from 1-2Mbps up to
> 8Mbps, though still with loads of collisions.  That said, I'm not that
> trusting of the error bits indicated from the FEC.
>
> The reason I mention it here is that I wonder if less wacking of the
> FEC_X_DES_ACTIVE register may help your problem.
>
> In 3.14, in the fec_enet_start_xmit function, find the
> "writel(0, fep->hwp + FEC_X_DES_ACTIVE);" and change it to:
>
>    wmb();
>
>         /* Trigger transmission start */
>         if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
>                 writel(0, fep->hwp + FEC_X_DES_ACTIVE);
>
> and see whether that helps your problem(s).

I tried less wacking and I still have the same tx transmit timeout issue.

Thanks,

Robert Daniels


This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-01 17:29                                             ` Russell King - ARM Linux
  2014-04-01 18:11                                               ` Eric Nelson
@ 2014-04-02  2:26                                               ` fugang.duan at freescale.com
  1 sibling, 0 replies; 76+ messages in thread
From: fugang.duan at freescale.com @ 2014-04-02  2:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/02/2014 1:30 AM, Russell King - ARM Linux wrote:
>On Tue, Apr 01, 2014 at 07:00:29AM -0700, Eric Nelson wrote:
>> Hi Russell,
>>
>> What value do you have for pad control register
>> IOMUXC_SW_PAD_CTL_PAD_RGMII_TX_CTL (0x020E0388)?
>
>You're referring to the iMX6Q, the Hummingboard is iMX6S, so that's 0x20e06bc.
>I have value 0x1b030.
>
>> I notice in some of the DTS files that this has a value of 0x100b0,
>> which seems wrong. Bit 16 seems to be a HYSteresis bit, and enabling
>> that seems wrong (also bit 7 seems wrong, as it's undefined in the RM).
>
>Yes, bit 7 should not be set as it's reserved, but this is ignored.
>Setting the hysteresis bit affects the input thresholds, and this signal is
>only ever used as an output.  While selecting HYS mode for an output may seem
>odd, that should not have any effect here.
>
>In any case, TX_CTL doesn't have much to do with it.  I've fully proved that
>the interface works fine under full duplex conditions, achieving 500Mbps
>transmit and 570-600Mbps receive with TCP connections (which is nicely beyond
>what freescale claim the hardware is capable of.)
>
Hi, Russell,

Since imx6q/dl/s sillicon enet have bandwidth issue, the bandwidth is 400 ~ 700 Mbps which is supplied by soc arichitecturer.
So we claim our enet performance is 460Mbps.
In fact, for our kernel 3.0.35 serial release,  kernel 3.10.17 release:
TCP tx: 470Mbps (stable)
TCP rx: 600Mbps (stable)
 
[...]

Thanks,
Andy

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-01 22:51                                             ` Russell King - ARM Linux
  2014-04-02  0:37                                               ` robert.daniels at vantagecontrols.com
@ 2014-04-02  3:19                                               ` fugang.duan at freescale.com
  2014-04-02  8:59                                                 ` Russell King - ARM Linux
  2014-04-04 20:21                                               ` robert.daniels at vantagecontrols.com
  2 siblings, 1 reply; 76+ messages in thread
From: fugang.duan at freescale.com @ 2014-04-02  3:19 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King wrote on 04/01/2014 04:51:49 PM:
>On Tue, Apr 01, 2014 at 01:38:37PM -0600, robert.daniels at vantagecontrols.com
>wrote:
>> I'm not sure where this factors in, but I originally saw this problem
>> using the Freescale 2.6.35 kernel.  The driver there exhibits this
>> problem differently, although it could very well be a different
>> problem. What I observed was that when the FEC got into this bad state
>> the driver would attempt to transmit a socket buffer but for some
>> reason the buffer would not actually get transmitted.
>>
>> The driver would continue transmitting packets until it got all the
>> way around in the ring buffer to the buffer descriptor right before
>> the one that was never transmitted.  When this buffer descriptor was
>> set to transmit you'd get a double transmit - the new packet and the
>> previously untransmitted buffer.
>>
>> This results in out-of-order packets being sent directly from the i.MX53.
>
>At initial glance, this is coherent with my idea of the FEC skipping a ring
>entry on the initial pass around.  Then when a new entry is loaded,
>
>Let's say that the problem entry is number 12 that has been skipped.
>When we get back around to entry 11, the FEC will transmit entries 11 and 12,
>as you rightly point out, and it will then look at entry 13 for the next packet.
>
>However, the driver loads the next packet into entry 12, and hits the FEC to
>transmit it.  The FEC re-reads entry 13, finds no packet, so does nothing.
>
>Then the next packet is submitted to the driver, and it enters it into entry 13,
>again hitting the FEC.  The FEC now sees the entry at 13, meanwhile the entry
>at 12 is still pending.
>
>> I hope this additional information is useful, I don't know enough
>> about these low-level networking details to contribute much but it's
>> possible that what I've seen in the 2.6.35 kernel is actually the same
>> issue that I'm seeing in the 3.14 kernel but handled better.
>
>It confirms the theory, but doesn't really provide much clues for a solution at
>the moment.
>
>However, I've had something of a breakthrough with iMX6 and half-duplex.
>I think much of the problem comes down to this ERR006358 workaround implemented
>in the driver (this apparantly doesn't affect your device.) The delayed work
>implementation, and my delayed timer implementation of the same are
>fundamentally wrong to the erratum documentation - as is the version
>implemented in the Freescale BSP.
>
>Implementing what the erratum says as an acceptable workaround improves things
>tremendously - I see iperf on a 10Mbit hub go from 1-2Mbps up to 8Mbps, though
>still with loads of collisions.  That said, I'm not that trusting of the error
>bits indicated from the FEC.
I don't have 10Mbit hub, but use ethtool to change the phy speed/duplex mode in imx6q sabresd platform:
- 100M half: work fine
- 10M half: work fine, iperf test, the tx bandwidth is 6.2 ~ 11 Mbps.

Log:
root at freescale /data/ptp-debug$ ./ethtool -s eth0 autoneg off speed 10 duplex half
root at freescale /data/ptp-debug$ PHY: 1:01 - Link is Down
root at freescale /data/ptp-debug$ PHY: 1:01 - Link is Up - 10/Half

root at freescale /data/ptp-debug$ iperf -c 10.192.242.202 -t 100 -i 1
------------------------------------------------------------
Client connecting to 10.192.242.202, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 10.192.242.124 port 52725 connected with 10.192.242.202 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   768 KBytes  6.29 Mbits/sec
[  3]  1.0- 2.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3]  2.0- 3.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3]  3.0- 4.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3]  4.0- 5.0 sec   768 KBytes  6.29 Mbits/sec
[  3]  5.0- 6.0 sec  1.38 MBytes  11.5 Mbits/sec
[  3]  6.0- 7.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3]  7.0- 8.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3]  8.0- 9.0 sec   896 KBytes  7.34 Mbits/sec
[  3]  9.0-10.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 10.0-11.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 11.0-12.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 12.0-13.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 13.0-14.0 sec  1.38 MBytes  11.5 Mbits/sec
[  3] 14.0-15.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 15.0-16.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3] 16.0-17.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 17.0-18.0 sec   512 KBytes  4.19 Mbits/sec
[  3] 18.0-19.0 sec  1.62 MBytes  13.6 Mbits/sec
[  3] 19.0-20.0 sec   640 KBytes  5.24 Mbits/sec
[  3] 20.0-21.0 sec  1.50 MBytes  12.6 Mbits/sec
[  3] 21.0-22.0 sec   640 KBytes  5.24 Mbits/sec
[  3] 22.0-23.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3] 23.0-24.0 sec  1.50 MBytes  12.6 Mbits/sec
[  3] 24.0-25.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 25.0-26.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 26.0-27.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 27.0-28.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 28.0-29.0 sec  1.00 MBytes  8.39 Mbits/sec
....

>
>The reason I mention it here is that I wonder if less wacking of the
>FEC_X_DES_ACTIVE register may help your problem.
>
>In 3.14, in the fec_enet_start_xmit function, find the "writel(0, fep->hwp +
>FEC_X_DES_ACTIVE);" and change it to:
>
>	wmb();
>
>        /* Trigger transmission start */
>        if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
>                writel(0, fep->hwp + FEC_X_DES_ACTIVE);
>
>and see whether that helps your problem(s).
>
So far, I don't see wmb() bring any effort on FEC.  Some years ago, we want to add wmb() before set BD "Ready" bit, but we don't find any issue without wmb(), on the contrary, it introduce more
System overhead special for Gbps networking. And now, imx6sx sillicon (sigle core), add wmb() cause tx performance drop much since cpu loading is the bottleneck. In our bsp release, we don't find
Tx watchdog timeout issue, don't find ping order reversal issue after overnight stress test.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-02  3:19                                               ` fugang.duan at freescale.com
@ 2014-04-02  8:59                                                 ` Russell King - ARM Linux
  2014-04-02  9:40                                                   ` fugang.duan at freescale.com
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-04-02  8:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Apr 02, 2014 at 03:19:58AM +0000, fugang.duan at freescale.com wrote:
> >	wmb();
> >
> >        /* Trigger transmission start */
> >        if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
> >                writel(0, fep->hwp + FEC_X_DES_ACTIVE);
> >
> >and see whether that helps your problem(s).
> >
> So far, I don't see wmb() bring any effort on FEC.  Some years ago, we
> want to add wmb() before set BD "Ready" bit, but we don't find any issue
> without wmb(), on the contrary, it introduce more System overhead special
> for Gbps networking.

I wonder whether you understand what is going on here, and why it is
required.  I doubt it somehow from your comments.  Maybe if you were
to read about the operation of the store buffer in the PL310, it may
open your eyes to why it would be necessary for reliable operation.

> And now, imx6sx sillicon (sigle core), add wmb() cause tx performance
> drop much since cpu loading is the bottleneck.

I don't find any measurable performance drop from adding it on either
iMX6Q or iMX6S with the L2 cache code fixed up.

The reality is, with the mainline ethernet driver with no changes, the
best it can do is around 300Mbps transmit and 400Mbps receive.  With
both my L2 and FEC changes, this has increased to 500Mbps transmit and
570Mbps receive.  Plus it now works reasonably (though with lots of
collisions) on half-duplex giving a 4 to 8 fold increase in speed there.

I really don't care how your private BSP kernels perform.  It's not
mainline, so it's not of any interest.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-02  8:59                                                 ` Russell King - ARM Linux
@ 2014-04-02  9:40                                                   ` fugang.duan at freescale.com
  2014-04-02 10:46                                                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: fugang.duan at freescale.com @ 2014-04-02  9:40 UTC (permalink / raw)
  To: linux-arm-kernel

From: Russell King - ARM Linux <linux@arm.linux.org.uk>
Data: Wednesday, April 02, 2014 4:59 PM

>To: Duan Fugang-B38611
>Cc: robert.daniels at vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
>Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel at lists.infradead.org
>Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>
>On Wed, Apr 02, 2014 at 03:19:58AM +0000, fugang.duan at freescale.com wrote:
>> >	wmb();
>> >
>> >        /* Trigger transmission start */
>> >        if (readl(fep->hwp + FEC_X_DES_ACTIVE) == 0)
>> >                writel(0, fep->hwp + FEC_X_DES_ACTIVE);
>> >
>> >and see whether that helps your problem(s).
>> >
>> So far, I don't see wmb() bring any effort on FEC.  Some years ago, we
>> want to add wmb() before set BD "Ready" bit, but we don't find any
>> issue without wmb(), on the contrary, it introduce more System
>> overhead special for Gbps networking.
>
>I wonder whether you understand what is going on here, and why it is required.
>I doubt it somehow from your comments.  Maybe if you were to read about the
>operation of the store buffer in the PL310, it may open your eyes to why it
>would be necessary for reliable operation.
>
In kernel 3.0.35 internal BSP,  BD memory is non-cacheable, non-bufferable (we add new api to support it: dma_alloc_noncacheable()),
So wmb() is not necessary. We don't find net watchdog timeout/ ping order issue.

>> And now, imx6sx sillicon (sigle core), add wmb() cause tx performance
>> drop much since cpu loading is the bottleneck.
>
>I don't find any measurable performance drop from adding it on either iMX6Q or
>iMX6S with the L2 cache code fixed up.
>
Yes, it don't impact imx6q since cpu loading is not bottleneck due rx/tx bandwidth is slow and multi-cores.
But for imx6sx, enet rx can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core. Enet IP don't support
TSO feaure, cpu loading is the bottleneck. Wmb() is very expensive which cause tx performance drop much.

>The reality is, with the mainline ethernet driver with no changes, the best it
>can do is around 300Mbps transmit and 400Mbps receive.  With both my L2 and FEC
>changes, this has increased to 500Mbps transmit and 570Mbps receive.  Plus it
>now works reasonably (though with lots of
>collisions) on half-duplex giving a 4 to 8 fold increase in speed there.
>
>I really don't care how your private BSP kernels perform.  It's not mainline,
>so it's not of any interest.
>
Yes, I agree. There have some arch/driver patches need to upstream to align the performance with internal bsp.

>--
>FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
>improving, and getting towards what was expected from it.
>

Thanks,
Andy

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-02  9:40                                                   ` fugang.duan at freescale.com
@ 2014-04-02 10:46                                                     ` Russell King - ARM Linux
  2014-04-02 11:33                                                       ` fugang.duan at freescale.com
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-04-02 10:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Apr 02, 2014 at 09:40:53AM +0000, fugang.duan at freescale.com wrote:
> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
> Data: Wednesday, April 02, 2014 4:59 PM
> >I wonder whether you understand what is going on here, and why it is required.
> >I doubt it somehow from your comments.  Maybe if you were to read about the
> >operation of the store buffer in the PL310, it may open your eyes to why it
> >would be necessary for reliable operation.
> 
> In kernel 3.0.35 internal BSP,  BD memory is non-cacheable, non-bufferable
> (we add new api to support it: dma_alloc_noncacheable()),

As is the memory you get from dma_alloc_coherent().  So, why did you
invent a new API which does something which the mainline kernel APIs
already do?

Maybe yours is doing something different but you haven't explained it
in correct terminology.

> So wmb() is not necessary.

Even on non-cacheable normal memory, the wmb() is required.  Please read
up in the ARM architecture reference manual about memory types and their
various attributes, followed by the memory ordering chapters.

> Yes, it don't impact imx6q since cpu loading is not bottleneck due
> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx can
> reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.

What netdev features do you support to achieve that?

> Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb()
> is very expensive which cause tx performance drop much.

wmb() is very expensive because of the L2 cache code using a sledge hammer
with it - particularly the spinlock, which has a large overhead if lockdep
or spinlock debugging is enabled.

> Yes, I agree. There have some arch/driver patches need to upstream to
> align the performance with internal bsp.

Well, post these patches so that people can test them.  If your patches
are indeed the worlds best thing since sliced bread, I've just wasted
over two months of solid work on the iMX6 ethernet driver.

However, hacks like dma_alloc_noncacheable will not be acceptable for
mainline.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-02 10:46                                                     ` Russell King - ARM Linux
@ 2014-04-02 11:33                                                       ` fugang.duan at freescale.com
  2014-04-02 16:51                                                         ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: fugang.duan at freescale.com @ 2014-04-02 11:33 UTC (permalink / raw)
  To: linux-arm-kernel

From: Russell King - ARM Linux <linux@arm.linux.org.uk>
Data: Wednesday, April 02, 2014 6:47 PM

>To: Duan Fugang-B38611
>Cc: robert.daniels at vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
>Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel at lists.infradead.org
>Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>
>On Wed, Apr 02, 2014 at 09:40:53AM +0000, fugang.duan at freescale.com wrote:
>> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
>> Data: Wednesday, April 02, 2014 4:59 PM
>> >I wonder whether you understand what is going on here, and why it is
>required.
>> >I doubt it somehow from your comments.  Maybe if you were to read
>> >about the operation of the store buffer in the PL310, it may open
>> >your eyes to why it would be necessary for reliable operation.
>>
>> In kernel 3.0.35 internal BSP,  BD memory is non-cacheable,
>> non-bufferable (we add new api to support it:
>> dma_alloc_noncacheable()),
>
>As is the memory you get from dma_alloc_coherent().  So, why did you invent a
>new API which does something which the mainline kernel APIs already do?
>
>Maybe yours is doing something different but you haven't explained it in
>correct terminology.

In kernel 3.0.35, there have no CMA memory allocate mechanism.
Below Kernel configs are enabled:
CONFIG_ARM_DMA_MEM_BUFFERABLE
CONFIG_SMP

If use dma_alloc_coherent() allocate memory, it must be non-cacheable, but bufferable.
The new invented api "dma_alloc_noncacheable()" allocate memory is non-cacheable, non-bufferable, the memory type is Strongly ordered.

>
>> So wmb() is not necessary.
>
>Even on non-cacheable normal memory, the wmb() is required.  Please read up in
>the ARM architecture reference manual about memory types and their various
>attributes, followed by the memory ordering chapters.
>
>> Yes, it don't impact imx6q since cpu loading is not bottleneck due
>> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx can
>> reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.
>
>What netdev features do you support to achieve that?
>
Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
So we enable the two features.

>> Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb()
>> is very expensive which cause tx performance drop much.
>
>wmb() is very expensive because of the L2 cache code using a sledge hammer with
>it - particularly the spinlock, which has a large overhead if lockdep or
>spinlock debugging is enabled.
>
Yes, if add wmb() to xmit(), imx6sx enet performance will drop more than 100Mbps.

[...]

Thanks,
Andy

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-02 11:33                                                       ` fugang.duan at freescale.com
@ 2014-04-02 16:51                                                         ` Russell King - ARM Linux
  2014-04-03  2:41                                                           ` fugang.duan at freescale.com
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-04-02 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan at freescale.com wrote:
> In kernel 3.0.35, there have no CMA memory allocate mechanism.
> Below Kernel configs are enabled:
> CONFIG_ARM_DMA_MEM_BUFFERABLE
> CONFIG_SMP
> 
> If use dma_alloc_coherent() allocate memory, it must be non-cacheable,
> but bufferable.  The new invented api "dma_alloc_noncacheable()"
> allocate memory is non-cacheable, non-bufferable, the memory type is
> Strongly ordered.

Right, so what you've just said is that it's fine to violate the
requirements of the architecture L1 memory model by setting up a
strongly ordered memory mapping for the same physical addresses as
an existing mapping which is mapped as normal memory.

Sorry, I'm not going to listen to you anymore, you just lost any kind
of authority on this matter.

> >> So wmb() is not necessary.
> >
> >Even on non-cacheable normal memory, the wmb() is required.  Please read up in
> >the ARM architecture reference manual about memory types and their various
> >attributes, followed by the memory ordering chapters.
> >
> >> Yes, it don't impact imx6q since cpu loading is not bottleneck due
> >> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx can
> >> reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.
> >
> >What netdev features do you support to achieve that?
> >
> Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
> So we enable the two features.

Checksum and... presumably you're referring to NAPI don't get you to that
kind of speed.  Even on x86, you can't get close to wire speed without
GSO, which you need scatter-gather for, and you don't support that.  So
I don't believe your 900Mbps figure.

Plus, as you're memcpy'ing every packet received, I don't believe you can
reach 940Mbps receive either.

> >> Enet IP don't support TSO feaure, cpu loading is the bottleneck. Wmb()
> >> is very expensive which cause tx performance drop much.
> >
> >wmb() is very expensive because of the L2 cache code using a sledge hammer with
> >it - particularly the spinlock, which has a large overhead if lockdep or
> >spinlock debugging is enabled.
>
> Yes, if add wmb() to xmit(), imx6sx enet performance will drop more
> than 100Mbps.

In any case, I suspect that isn't directly attributable to wmb() itself.
What I've noticed is that even changing an unsigned short to an unsigned
int *can* result in a substantial performance drop.  Although the
unsigned int results in fewer instructions, it's lower performance because
they're placed differently, and the efficiency of the instruction cache
changes, resulting in different throughput.

What this means is that even changing compiler versions can get you
significantly different performance figures.  So I don't attribute very
much creedence to "wmb() causes performance to drop 100Mbps".  It may
very well go back up with some other changes which result in a slightly
different placement of the instructions.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-02 16:51                                                         ` Russell King - ARM Linux
@ 2014-04-03  2:41                                                           ` fugang.duan at freescale.com
  2014-04-03  8:56                                                             ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: fugang.duan at freescale.com @ 2014-04-03  2:41 UTC (permalink / raw)
  To: linux-arm-kernel

From: Russell King - ARM Linux <linux@arm.linux.org.uk>
Data: Thursday, April 03, 2014 12:51 AM

>To: Duan Fugang-B38611
>Cc: robert.daniels at vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
>Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel at lists.infradead.org
>Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>
>On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan at freescale.com wrote:
>> In kernel 3.0.35, there have no CMA memory allocate mechanism.
>> Below Kernel configs are enabled:
>> CONFIG_ARM_DMA_MEM_BUFFERABLE
>> CONFIG_SMP
>>
>> If use dma_alloc_coherent() allocate memory, it must be non-cacheable,
>> but bufferable.  The new invented api "dma_alloc_noncacheable()"
>> allocate memory is non-cacheable, non-bufferable, the memory type is
>> Strongly ordered.
>
>Right, so what you've just said is that it's fine to violate the requirements
>of the architecture L1 memory model by setting up a strongly ordered memory
>mapping for the same physical addresses as an existing mapping which is mapped
>as normal memory.
>
>Sorry, I'm not going to listen to you anymore, you just lost any kind of
>authority on this matter.
>
>> >> So wmb() is not necessary.
>> >
>> >Even on non-cacheable normal memory, the wmb() is required.  Please
>> >read up in the ARM architecture reference manual about memory types
>> >and their various attributes, followed by the memory ordering chapters.
>> >
>> >> Yes, it don't impact imx6q since cpu loading is not bottleneck due
>> >> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx
>> >> can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.
>> >
>> >What netdev features do you support to achieve that?
>> >
>> Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
>> So we enable the two features.
>
>Checksum and... presumably you're referring to NAPI don't get you to that kind
>of speed.  Even on x86, you can't get close to wire speed without GSO, which
>you need scatter-gather for, and you don't support that.  So I don't believe
>your 900Mbps figure.
>
>Plus, as you're memcpy'ing every packet received, I don't believe you can reach
>940Mbps receive either.
>
Since Imx6sx enet still don't support TSO and Jumbo packet, scatter-gather cannot improve ethernet performance in
Most cases special for iperf test. 

Imx6sx: sigle core, cpu frequency is 996Mhz, cpu government is performance.
Kernel config: disable SMP config
For rx path: 
	- hw accleration: crc checksum, interrupt coalescing.
	- software part: napi, new skb allocation instead of memory copy.
	- Test result: 940Mbps, 8% cpu idle
For tx path:
	- hw accleration: crc checksum, interrupt coalescing.
	- software part: napi, no memory copy in driver since tx DMA support data buffer byte alignment.
	- Test result: 900Mbps, cpu loading near to 100%

>> >> Enet IP don't support TSO feaure, cpu loading is the bottleneck.
>> >> Wmb() is very expensive which cause tx performance drop much.
>> >
>> >wmb() is very expensive because of the L2 cache code using a sledge
>> >hammer with it - particularly the spinlock, which has a large
>> >overhead if lockdep or spinlock debugging is enabled.
>>
>> Yes, if add wmb() to xmit(), imx6sx enet performance will drop more
>> than 100Mbps.
>
>In any case, I suspect that isn't directly attributable to wmb() itself.
>What I've noticed is that even changing an unsigned short to an unsigned int
>*can* result in a substantial performance drop.  Although the unsigned int
>results in fewer instructions, it's lower performance because they're placed
>differently, and the efficiency of the instruction cache changes, resulting in
>different throughput.
>
It is interesting, I will try it.

>What this means is that even changing compiler versions can get you
>significantly different performance figures.  So I don't attribute very much
>creedence to "wmb() causes performance to drop 100Mbps".  It may very well go
>back up with some other changes which result in a slightly different placement
>of the instructions.
>
I test the performance with three compiler:
gcc-4.4.4-glibc-2.11.1-multilib-1.0,
4.7
4.8.1 

The test result is similar.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-03  2:41                                                           ` fugang.duan at freescale.com
@ 2014-04-03  8:56                                                             ` Russell King - ARM Linux
  2014-04-03  9:55                                                               ` fugang.duan at freescale.com
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-04-03  8:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan at freescale.com wrote:
> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
> Data: Thursday, April 03, 2014 12:51 AM
> 
> >To: Duan Fugang-B38611
> >Cc: robert.daniels at vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
> >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel at lists.infradead.org
> >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
> >
> >On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan at freescale.com wrote:
> >> In kernel 3.0.35, there have no CMA memory allocate mechanism.
> >> Below Kernel configs are enabled:
> >> CONFIG_ARM_DMA_MEM_BUFFERABLE
> >> CONFIG_SMP
> >>
> >> If use dma_alloc_coherent() allocate memory, it must be non-cacheable,
> >> but bufferable.  The new invented api "dma_alloc_noncacheable()"
> >> allocate memory is non-cacheable, non-bufferable, the memory type is
> >> Strongly ordered.
> >
> >Right, so what you've just said is that it's fine to violate the requirements
> >of the architecture L1 memory model by setting up a strongly ordered memory
> >mapping for the same physical addresses as an existing mapping which is mapped
> >as normal memory.
> >
> >Sorry, I'm not going to listen to you anymore, you just lost any kind of
> >authority on this matter.
> >
> >> >> So wmb() is not necessary.
> >> >
> >> >Even on non-cacheable normal memory, the wmb() is required.  Please
> >> >read up in the ARM architecture reference manual about memory types
> >> >and their various attributes, followed by the memory ordering chapters.
> >> >
> >> >> Yes, it don't impact imx6q since cpu loading is not bottleneck due
> >> >> rx/tx bandwidth is slow and multi-cores.  But for imx6sx, enet rx
> >> >> can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle core.
> >> >
> >> >What netdev features do you support to achieve that?
> >> >
> >> Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
> >> So we enable the two features.
> >
> >Checksum and... presumably you're referring to NAPI don't get you to that kind
> >of speed.  Even on x86, you can't get close to wire speed without GSO, which
> >you need scatter-gather for, and you don't support that.  So I don't believe
> >your 900Mbps figure.
> >
> >Plus, as you're memcpy'ing every packet received, I don't believe you can reach
> >940Mbps receive either.
> >
> Since Imx6sx enet still don't support TSO and Jumbo packet, scatter-gather
> cannot improve ethernet performance in
> Most cases special for iperf test. 

Again, you are losing credibility every time you deny stuff like this.
I'm now at the point of just not listening to you anymore because you're
contradicting what I know to be solid fact through my own measurements.

This seems to be Freescale's overall attitude - as I've read on
Freescale's forums.  Your customers/users are always wrong, you're always
right.  Eg, any performance issues are not the fault of Freescale stuff,
it's tarnished connectors or similar.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-03  8:56                                                             ` Russell King - ARM Linux
@ 2014-04-03  9:55                                                               ` fugang.duan at freescale.com
  2014-04-03 10:32                                                                 ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: fugang.duan at freescale.com @ 2014-04-03  9:55 UTC (permalink / raw)
  To: linux-arm-kernel

From: Russell King - ARM Linux <linux@arm.linux.org.uk>
Data: Thursday, April 03, 2014 4:57 PM

>To: Duan Fugang-B38611
>Cc: robert.daniels at vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
>Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel at lists.infradead.org
>Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>
>On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan at freescale.com wrote:
>> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
>> Data: Thursday, April 03, 2014 12:51 AM
>>
>> >To: Duan Fugang-B38611
>> >Cc: robert.daniels at vantagecontrols.com; Marek Vasut; Detlev Zundel;
>> >Troy Kisky; Grant Likely; Bernd Faust; Fabio Estevam;
>> >linux-arm-kernel at lists.infradead.org
>> >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
>> >
>> >On Wed, Apr 02, 2014 at 11:33:22AM +0000, fugang.duan at freescale.com wrote:
>> >> In kernel 3.0.35, there have no CMA memory allocate mechanism.
>> >> Below Kernel configs are enabled:
>> >> CONFIG_ARM_DMA_MEM_BUFFERABLE
>> >> CONFIG_SMP
>> >>
>> >> If use dma_alloc_coherent() allocate memory, it must be
>> >> non-cacheable, but bufferable.  The new invented api
>"dma_alloc_noncacheable()"
>> >> allocate memory is non-cacheable, non-bufferable, the memory type
>> >> is Strongly ordered.
>> >
>> >Right, so what you've just said is that it's fine to violate the
>> >requirements of the architecture L1 memory model by setting up a
>> >strongly ordered memory mapping for the same physical addresses as an
>> >existing mapping which is mapped as normal memory.
>> >
>> >Sorry, I'm not going to listen to you anymore, you just lost any kind
>> >of authority on this matter.
>> >
>> >> >> So wmb() is not necessary.
>> >> >
>> >> >Even on non-cacheable normal memory, the wmb() is required.
>> >> >Please read up in the ARM architecture reference manual about
>> >> >memory types and their various attributes, followed by the memory
>ordering chapters.
>> >> >
>> >> >> Yes, it don't impact imx6q since cpu loading is not bottleneck
>> >> >> due rx/tx bandwidth is slow and multi-cores.  But for imx6sx,
>> >> >> enet rx can reach at 940Mbps, tx can reach at 900Mbps, imx6sx is sigle
>core.
>> >> >
>> >> >What netdev features do you support to achieve that?
>> >> >
>> >> Imx6sx enet accleration feature support crc checksum, interrupt coalescing.
>> >> So we enable the two features.
>> >
>> >Checksum and... presumably you're referring to NAPI don't get you to
>> >that kind of speed.  Even on x86, you can't get close to wire speed
>> >without GSO, which you need scatter-gather for, and you don't support
>> >that.  So I don't believe your 900Mbps figure.
>> >
>> >Plus, as you're memcpy'ing every packet received, I don't believe you
>> >can reach 940Mbps receive either.
>> >
>> Since Imx6sx enet still don't support TSO and Jumbo packet,
>> scatter-gather cannot improve ethernet performance in Most cases
>> special for iperf test.
>
>Again, you are losing credibility every time you deny stuff like this.
>I'm now at the point of just not listening to you anymore because you're
>contradicting what I know to be solid fact through my own measurements.
>
>This seems to be Freescale's overall attitude - as I've read on Freescale's
>forums.  Your customers/users are always wrong, you're always right.  Eg, any
>performance issues are not the fault of Freescale stuff, it's tarnished
>connectors or similar.
>
Hi, Russell,

I don't contradict your thinking/solution and measurements.  You are expert on arm/modules, we keep study attitude to dicuss with you.
For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you did great job on performance tuning, and the test result is similar
To our internal test result. Your suggestion for the optimiztion is meaningful. Pls understand my thinking.  

Thanks,
Andy

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-03  9:55                                                               ` fugang.duan at freescale.com
@ 2014-04-03 10:32                                                                 ` Russell King - ARM Linux
  2014-04-03 13:36                                                                   ` Shawn Guo
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-04-03 10:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 03, 2014 at 09:55:06AM +0000, fugang.duan at freescale.com wrote:
> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
> Data: Thursday, April 03, 2014 4:57 PM
> 
> >To: Duan Fugang-B38611
> >Cc: robert.daniels at vantagecontrols.com; Marek Vasut; Detlev Zundel; Troy Kisky;
> >Grant Likely; Bernd Faust; Fabio Estevam; linux-arm-kernel at lists.infradead.org
> >Subject: Re: FEC ethernet issues [Was: PL310 errata workarounds]
> >
> >On Thu, Apr 03, 2014 at 02:41:46AM +0000, fugang.duan at freescale.com wrote:
> >> From: Russell King - ARM Linux <linux@arm.linux.org.uk>
> >> Data: Thursday, April 03, 2014 12:51 AM
> >>
> >> >Checksum and... presumably you're referring to NAPI don't get you to
> >> >that kind of speed.  Even on x86, you can't get close to wire speed
> >> >without GSO, which you need scatter-gather for, and you don't support
> >> >that.  So I don't believe your 900Mbps figure.
> >> >
> >> >Plus, as you're memcpy'ing every packet received, I don't believe you
> >> >can reach 940Mbps receive either.
> >> >
> >> Since Imx6sx enet still don't support TSO and Jumbo packet,
> >> scatter-gather cannot improve ethernet performance in Most cases
> >> special for iperf test.
> >
> >Again, you are losing credibility every time you deny stuff like this.
> >I'm now at the point of just not listening to you anymore because you're
> >contradicting what I know to be solid fact through my own measurements.
> >
> >This seems to be Freescale's overall attitude - as I've read on Freescale's
> >forums.  Your customers/users are always wrong, you're always right.  Eg, any
> >performance issues are not the fault of Freescale stuff, it's tarnished
> >connectors or similar.
> >
> Hi, Russell,
> 
> I don't contradict your thinking/solution and measurements.  You are
> expert on arm/modules, we keep study attitude to dicuss with you.
> For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you
> did great job on performance tuning, and the test result is similar
> To our internal test result. Your suggestion for the optimiztion is
> meaningful. Pls understand my thinking.  

The reason I said what I said is because I'm not talking about TSO.  I'm
talking about GSO.  They're similar features, but are done in totally
different ways.

TSO requires either hardware support, or driver cooperation to segment
the data.  GSO does not.

There's several points here which make me discount your figures:

1. The overhead of routing packets is not insigificant.

   GSO is a feature where the higher levels (such as TCP) can submit
   large socket buffers to the lower levels, all the way through to
   packet queues to just before the device.

   At the very last moment, just before the buffer is handed over to
   the driver's start_xmit function, the buffer is carved into
   appropriately sized chunks, and new skbuffs allocated.  Each skbuff
   "head" contains the protocol headers, and the associated fragments
   contain pointers/lengths into the large buffer.  (This is why SG is
   required for GSO.)

   What this means is that all the overhead from the layers such as
   TCP / IP / routing / packet queuing are all reduced because rather
   than those code paths having to be run for every single 1500 byte
   packet, they're run once for maybe 16 to 32K of data to be sent.
   This avoids the overhead from a lot of code.

   I have monitored the RGMII TX_CTL signal while performing iperf
   tests without GSO, and I can see from that the gap between packets,
   it is very rare that it gets to the point of transmitting two
   packets back to back.

2. At 500Mbps transmit on the iMX6S, performance testing shows the
   majority of CPU time is spent in the cache cleaning/flushing code.
   This overhead is necessary and can't be removed without risking
   data corruption.

3. I discount your receive figure because even with all my work, if
   I set my x86 box to perform a UDP iperf test with x86 sending to
   iMX6S, the result remains extremely poor, with 95% (modified) to 99%
   (unmodified) of UDP packets lost.  This is not entirely the fault
   of the FEC.  Around 70% of the packet loss is in the UDP receive
   path - iperf seems to be unable to read the packets from UDP fast
   enough.  (This confirmed by checking the statistics in /proc/net/snmp.)
   This means 30% of packet loss is due to the FEC not keeping up with
   the on-wire packet rate (which is around 810Mbps.)

   Therefore, I place the maximum receive packet rate of the FEC with
   the current packet-copy strategy in the receive path and 128 entry
   receive ring size at around 550Mbps, and again, much of the overhead
   comes from the cache handling code according to perf.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-03 10:32                                                                 ` Russell King - ARM Linux
@ 2014-04-03 13:36                                                                   ` Shawn Guo
  2014-04-03 13:45                                                                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 76+ messages in thread
From: Shawn Guo @ 2014-04-03 13:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 03, 2014 at 11:32:06AM +0100, Russell King - ARM Linux wrote:
> > Hi, Russell,
> > 
> > I don't contradict your thinking/solution and measurements.  You are
> > expert on arm/modules, we keep study attitude to dicuss with you.
> > For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you
> > did great job on performance tuning, and the test result is similar
> > To our internal test result. Your suggestion for the optimiztion is
> > meaningful. Pls understand my thinking.  
> 
> The reason I said what I said is because I'm not talking about TSO.  I'm
> talking about GSO.  They're similar features, but are done in totally
> different ways.
> 
> TSO requires either hardware support, or driver cooperation to segment
> the data.  GSO does not.
> 
> There's several points here which make me discount your figures:

Russell,

In case there is a confusion.  The 900Mbps figure that Fugang said is
not on any of i.MX6 SoCs that are publicly available - i.MX6SoloLite
(imx6sl), i.MX6Solo/DualLite (imx6dl), i.MX6Dual/Quad (imx6q), but on
a new member of i.MX6 family - i.MX6SoloX (imx6sx).  This new SoC hasn't
been announced by Freescale yet.  One major improvement of this new SoC
over its ancestors is the FEC throughput.  It claims 1Gbps throughput
support.  So it's really a hardware optimization instead of anything
that software can do.

Shawn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-03 13:36                                                                   ` Shawn Guo
@ 2014-04-03 13:45                                                                     ` Russell King - ARM Linux
  2014-04-03 14:00                                                                       ` Shawn Guo
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-04-03 13:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 03, 2014 at 09:36:41PM +0800, Shawn Guo wrote:
> On Thu, Apr 03, 2014 at 11:32:06AM +0100, Russell King - ARM Linux wrote:
> > > Hi, Russell,
> > > 
> > > I don't contradict your thinking/solution and measurements.  You are
> > > expert on arm/modules, we keep study attitude to dicuss with you.
> > > For imx6sx, we indeed get the result. For imx6q/dl linux upstream, you
> > > did great job on performance tuning, and the test result is similar
> > > To our internal test result. Your suggestion for the optimiztion is
> > > meaningful. Pls understand my thinking.  
> > 
> > The reason I said what I said is because I'm not talking about TSO.  I'm
> > talking about GSO.  They're similar features, but are done in totally
> > different ways.
> > 
> > TSO requires either hardware support, or driver cooperation to segment
> > the data.  GSO does not.
> > 
> > There's several points here which make me discount your figures:
> 
> Russell,
> 
> In case there is a confusion.  The 900Mbps figure that Fugang said is
> not on any of i.MX6 SoCs that are publicly available - i.MX6SoloLite
> (imx6sl), i.MX6Solo/DualLite (imx6dl), i.MX6Dual/Quad (imx6q), but on
> a new member of i.MX6 family - i.MX6SoloX (imx6sx).  This new SoC hasn't
> been announced by Freescale yet.  One major improvement of this new SoC
> over its ancestors is the FEC throughput.  It claims 1Gbps throughput
> support.  So it's really a hardware optimization instead of anything
> that software can do.

That means it's irrelevant to this discussion because it's different
hardware, with who knows what changes to the memory subsystem and
CPU/cache implementation.  Hence, it can't be compared in any way to
the performance I see on iMX6Q and iMX6S.

What matters for here is fixing the issues with the mainline driver on
hardware that people have today - not only the bugs and instability
that people are seeing, but also the low performance.

If some of that involves doing stuff "The Right Way" by using memory
barriers rather than violating the architecture specification and
introducing new kernel APIs in order to do so, we're going to do it
the right way, even if that means that those changes may not be in the
best interest of this unreleased part.  There is no room for compromise
on that.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-03 13:45                                                                     ` Russell King - ARM Linux
@ 2014-04-03 14:00                                                                       ` Shawn Guo
  0 siblings, 0 replies; 76+ messages in thread
From: Shawn Guo @ 2014-04-03 14:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Apr 03, 2014 at 02:45:10PM +0100, Russell King - ARM Linux wrote:
> On Thu, Apr 03, 2014 at 09:36:41PM +0800, Shawn Guo wrote:
> > Russell,
> > 
> > In case there is a confusion.  The 900Mbps figure that Fugang said is
> > not on any of i.MX6 SoCs that are publicly available - i.MX6SoloLite
> > (imx6sl), i.MX6Solo/DualLite (imx6dl), i.MX6Dual/Quad (imx6q), but on
> > a new member of i.MX6 family - i.MX6SoloX (imx6sx).  This new SoC hasn't
> > been announced by Freescale yet.  One major improvement of this new SoC
> > over its ancestors is the FEC throughput.  It claims 1Gbps throughput
> > support.  So it's really a hardware optimization instead of anything
> > that software can do.
> 
> That means it's irrelevant to this discussion because it's different
> hardware, with who knows what changes to the memory subsystem and
> CPU/cache implementation.  Hence, it can't be compared in any way to
> the performance I see on iMX6Q and iMX6S.

Exactly.  I don't think that Fugang should bring i.MX6SX figure into the
discussion either, and I see that's where the confusion comes, from what
I read the thread today.

Shawn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-01 22:51                                             ` Russell King - ARM Linux
  2014-04-02  0:37                                               ` robert.daniels at vantagecontrols.com
  2014-04-02  3:19                                               ` fugang.duan at freescale.com
@ 2014-04-04 20:21                                               ` robert.daniels at vantagecontrols.com
  2014-04-29  9:26                                                 ` Jaccon Bastiaansen
       [not found]                                                 ` <CAGzjT4d8H6YE6P6A1E4aHiPAenJFvZH01LHXaNzVwhF2MRBvWQ@mail.gmail.com>
  2 siblings, 2 replies; 76+ messages in thread
From: robert.daniels at vantagecontrols.com @ 2014-04-04 20:21 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014
04:51:49 PM:

> At initial glance, this is coherent with my idea of the FEC skipping a
> ring entry on the initial pass around.  Then when a new entry is loaded,
>
> Let's say that the problem entry is number 12 that has been skipped.
> When we get back around to entry 11, the FEC will transmit entries 11
> and 12, as you rightly point out, and it will then look at entry 13
> for the next packet.
>
> However, the driver loads the next packet into entry 12, and hits the
> FEC to transmit it.  The FEC re-reads entry 13, finds no packet, so
> does nothing.
>
> Then the next packet is submitted to the driver, and it enters it into
> entry 13, again hitting the FEC.  The FEC now sees the entry at 13,
> meanwhile the entry at 12 is still pending.

I've explored the option of providing a work around for this observed
problem.
In the 2.6.35 kernel, I've used the BD_ENET_TX_INTR flag which is marked as
TO2 in the RM and which was being set but never used by the software to
help
detect the skip.  In the interrupt handler I clear this bit which allows me
to use it to know when the bd goes from ready -> clean.  In the error
situation
above, you would end up with 12 marked with both READY and INTR and at some
point
you would have bd 13 marked as INTR only (as it would have been transmitted
but
not cleaned up.)  This allows me to clean up bd 12 despite not actually
being
transmitted.  This gets the driver back in sync with the FEC and things
continue
on normally... until it happens again.

Of course, this results in the occasional dropped packet but I feel like
for now
(until Freescale figures out what's going on) this is better than nothing.
At
least the driver is able to recover somewhat from the situation.

I'm not sure if the mainline driver could benefit from a strategy like this
or not,
especially since it manifested this problem differently (tx transmit
timeout).
Also, in my opinion this is an undesirable hack to make things work
acceptably.
There could also be some inherent problem with this strategy that I'm
unaware of,
since dealing with linux kernel ethernet drivers is not exactly my area of
expertise.

Any thoughts or new insights?

Thanks,

Robert Daniels

This email, and any document attached hereto, may contain
confidential and/or privileged information.  If you are not the
intended recipient (or have received this email in error) please
notify the sender immediately and destroy this email.  Any
unauthorized, direct or indirect, copying, disclosure, distribution
or other use of the material or parts thereof is strictly
forbidden.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-04-04 20:21                                               ` robert.daniels at vantagecontrols.com
@ 2014-04-29  9:26                                                 ` Jaccon Bastiaansen
       [not found]                                                 ` <CAGzjT4d8H6YE6P6A1E4aHiPAenJFvZH01LHXaNzVwhF2MRBvWQ@mail.gmail.com>
  1 sibling, 0 replies; 76+ messages in thread
From: Jaccon Bastiaansen @ 2014-04-29  9:26 UTC (permalink / raw)
  To: linux-arm-kernel

 Hello all,

I tried the FEC patches from Russel
(http://ftp.arm.linux.org.uk/cgit/linux-arm.git/log/?h=fec-testing),
but with the following test the FEC complety stops receiving frames:

     Linux PC <----->  Gigabit Ethernet switch  <------> SabreSD board

Both ethernet links run at 1 Gigabit, full duplex.

The SabreSD board runs the fec-testing kernel from Russel.

On the SabreSD board we run "iperf -s -u"

On the Linux PC we
- run the iperf client (command used: ?while [ 1 ]; do iperf ?c ?IP
address of SabreSD? -u -b 100m -t 300 -l 256;sleep 1;done?)
- ping the SabreSD board every second

After a while (sometimes 10 seconds, sometimes a couple of minutes),
we see that the SabreSD board stops replying to the pings from the
Linux PC. Closer inspection shows that the FEC doesn't generate
receive frame interrupts anymore. Analysis with a debugger shows that
the RXF interrupt is enabled and that the FEC is still receiving
ethernet frames (because the IEEE_R_FRAME_OK event counter is
increasing), but no receive frame interrupts are occurring. We only
see MII interrupts occurring.

The RDAR register has the value 0 meaning that the FEC cannot write
the received frame in main memory because of a lack of available free
receive descriptors.

When the FEC has stopped generating receive interrupts, we used the
ping command on the Sabre board to send some frames from the Sabre
board to the Linux PC. The FEC then starts to generate receive
interrupts again. The driver source code shows that the receive buffer
is also emptied when a transmit interrupt occurs. So it seems that the
FEC completely stops generating receive interrupts when frames are
received and there are no more empty descriptors.

Any ideas or insights?

Regards,
  Jaccon

2014-04-04 22:21 GMT+02:00  <robert.daniels@vantagecontrols.com>:
> Russell King - ARM Linux <linux@arm.linux.org.uk> wrote on 04/01/2014
> 04:51:49 PM:
>
>> At initial glance, this is coherent with my idea of the FEC skipping a
>> ring entry on the initial pass around.  Then when a new entry is loaded,
>>
>> Let's say that the problem entry is number 12 that has been skipped.
>> When we get back around to entry 11, the FEC will transmit entries 11
>> and 12, as you rightly point out, and it will then look at entry 13
>> for the next packet.
>>
>> However, the driver loads the next packet into entry 12, and hits the
>> FEC to transmit it.  The FEC re-reads entry 13, finds no packet, so
>> does nothing.
>>
>> Then the next packet is submitted to the driver, and it enters it into
>> entry 13, again hitting the FEC.  The FEC now sees the entry at 13,
>> meanwhile the entry at 12 is still pending.
>
> I've explored the option of providing a work around for this observed
> problem.
> In the 2.6.35 kernel, I've used the BD_ENET_TX_INTR flag which is marked as
> TO2 in the RM and which was being set but never used by the software to
> help
> detect the skip.  In the interrupt handler I clear this bit which allows me
> to use it to know when the bd goes from ready -> clean.  In the error
> situation
> above, you would end up with 12 marked with both READY and INTR and at some
> point
> you would have bd 13 marked as INTR only (as it would have been transmitted
> but
> not cleaned up.)  This allows me to clean up bd 12 despite not actually
> being
> transmitted.  This gets the driver back in sync with the FEC and things
> continue
> on normally... until it happens again.
>
> Of course, this results in the occasional dropped packet but I feel like
> for now
> (until Freescale figures out what's going on) this is better than nothing.
> At
> least the driver is able to recover somewhat from the situation.
>
> I'm not sure if the mainline driver could benefit from a strategy like this
> or not,
> especially since it manifested this problem differently (tx transmit
> timeout).
> Also, in my opinion this is an undesirable hack to make things work
> acceptably.
> There could also be some inherent problem with this strategy that I'm
> unaware of,
> since dealing with linux kernel ethernet drivers is not exactly my area of
> expertise.
>
> Any thoughts or new insights?
>
> Thanks,
>
> Robert Daniels
>
> This email, and any document attached hereto, may contain
> confidential and/or privileged information.  If you are not the
> intended recipient (or have received this email in error) please
> notify the sender immediately and destroy this email.  Any
> unauthorized, direct or indirect, copying, disclosure, distribution
> or other use of the material or parts thereof is strictly
> forbidden.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
       [not found]                                                 ` <CAGzjT4d8H6YE6P6A1E4aHiPAenJFvZH01LHXaNzVwhF2MRBvWQ@mail.gmail.com>
@ 2014-05-02 11:41                                                   ` Russell King - ARM Linux
  2014-05-08  9:23                                                     ` Jaccon Bastiaansen
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King - ARM Linux @ 2014-05-02 11:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Apr 29, 2014 at 11:05:04AM +0200, Jaccon Bastiaansen wrote:
> Hello all,
> 
> I tried the FEC patches from Russel
> (http://ftp.arm.linux.org.uk/cgit/linux-arm.git/log/?h=fec-testing),
> but with the following test the FEC complety stops receiving frames:
> 
> 
>  Linux PC <----->  Gigabit Ethernet switch  <------> SabreSD board
> 
> Both ethernet links run at 1 Gigabit, full duplex.
> 
> The SabreSD board runs the fec-testing kernel from Russel.
> 
> On the SabreSD board we run "iperf -s -u"
> 
> On the Linux PC we
> - run the iperf client (command used: ?while [ 1 ]; do iperf ?c ?IP
> address of SabreSD? -u -b 100m -t 300 -l 256;sleep 1;done?)
> - ping the SabreSD board every second
> 
> After a while (sometimes 10 seconds, sometimes a couple of minutes),
> we see that the SabreSD board stops replying to the pings from the
> Linux PC. Closer inspection shows that the FEC doesn't generate
> receive frame interrupts anymore. The attached debugger screenshots
> show the FEC registers when it is in this state. The RXF interrupt is
> enabled and we know that the FEC is still receiving ethernet frames
> (because the IEEE_R_FRAME_OK event counter is increasing), but no
> receive frame interrupts are occurring. We only see MII interrupts
> occurring.

If the RXF interrupt is enabled, it means that we must have received
less than NAPI_POLL_WEIGHT (64) frames from the ring during the previous
NAPI poll - this is the only circumstance when we will re-enable
interrupts.

There are only two conditions where we stop receiving frames from the
receive ring:
1. if we reach the NAPI poll weight number of frames received.
2. if we encounter a packet descriptor marked empty.

It can't be (1) because we would have left the TXF/RXF interrupts
disabled and waited for the next NAPI poll.  So, it can only be (2).

(2) implies that there is a ring descriptor which is marked as being
owned by the FEC, which means that there are descriptors free.  Whether
it's the one which the FEC is expecting to be free or not is something
that's impossible to tell (the FEC hardware doesn't tell us where it
is in its ring, which is a big minus point against debugging.)

It would be useful to see the state of the receive ring, as well as
the rx_next index.  That may be difficult to get though.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* FEC ethernet issues [Was: PL310 errata workarounds]
  2014-05-02 11:41                                                   ` Russell King - ARM Linux
@ 2014-05-08  9:23                                                     ` Jaccon Bastiaansen
  0 siblings, 0 replies; 76+ messages in thread
From: Jaccon Bastiaansen @ 2014-05-08  9:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hello all,

2014-05-02 13:41 GMT+02:00 Russell King - ARM Linux <linux@arm.linux.org.uk>:
> On Tue, Apr 29, 2014 at 11:05:04AM +0200, Jaccon Bastiaansen wrote:
>> Hello all,
>>
>> I tried the FEC patches from Russel
>> (http://ftp.arm.linux.org.uk/cgit/linux-arm.git/log/?h=fec-testing),
>> but with the following test the FEC complety stops receiving frames:
>>
>>
>>  Linux PC <----->  Gigabit Ethernet switch  <------> SabreSD board
>>
>> Both ethernet links run at 1 Gigabit, full duplex.
>>
>> The SabreSD board runs the fec-testing kernel from Russel.
>>
>> On the SabreSD board we run "iperf -s -u"
>>
>> On the Linux PC we
>> - run the iperf client (command used: ?while [ 1 ]; do iperf ?c ?IP
>> address of SabreSD? -u -b 100m -t 300 -l 256;sleep 1;done?)
>> - ping the SabreSD board every second
>>
>> After a while (sometimes 10 seconds, sometimes a couple of minutes),
>> we see that the SabreSD board stops replying to the pings from the
>> Linux PC. Closer inspection shows that the FEC doesn't generate
>> receive frame interrupts anymore. The attached debugger screenshots
>> show the FEC registers when it is in this state. The RXF interrupt is
>> enabled and we know that the FEC is still receiving ethernet frames
>> (because the IEEE_R_FRAME_OK event counter is increasing), but no
>> receive frame interrupts are occurring. We only see MII interrupts
>> occurring.
>
> If the RXF interrupt is enabled, it means that we must have received
> less than NAPI_POLL_WEIGHT (64) frames from the ring during the previous
> NAPI poll - this is the only circumstance when we will re-enable
> interrupts.
>
> There are only two conditions where we stop receiving frames from the
> receive ring:
> 1. if we reach the NAPI poll weight number of frames received.
> 2. if we encounter a packet descriptor marked empty.
>
> It can't be (1) because we would have left the TXF/RXF interrupts
> disabled and waited for the next NAPI poll.  So, it can only be (2).
>
> (2) implies that there is a ring descriptor which is marked as being
> owned by the FEC, which means that there are descriptors free.  Whether
> it's the one which the FEC is expecting to be free or not is something
> that's impossible to tell (the FEC hardware doesn't tell us where it
> is in its ring, which is a big minus point against debugging.)
>
> It would be useful to see the state of the receive ring, as well as
> the rx_next index.  That may be difficult to get though.
>

I added some code to show the RX ring when the reception of frames has
stopped. See the patch below. Reading the fec_rx_ring file gives the
following output (I stripped the bit decoding here to reduce the
number of lines):

0: c0a0d000 0x0800 (== rx_bd_base)
1: c0a0d020 0x0800
2: c0a0d040 0x0800
3: c0a0d060 0x0800
4: c0a0d080 0x0800
5: c0a0d0a0 0x0800
6: c0a0d0c0 0x0800 (== rx_next)
7: c0a0d0e0 0x0800
8: c0a0d100 0x0800
9: c0a0d120 0x0800
10: c0a0d140 0x0800
11: c0a0d160 0x0800
12: c0a0d180 0x0800
13: c0a0d1a0 0x0800
14: c0a0d1c0 0x0800
15: c0a0d1e0 0x0800
16: c0a0d200 0x0800
17: c0a0d220 0x0800
18: c0a0d240 0x0800
19: c0a0d260 0x0800
20: c0a0d280 0x0800
21: c0a0d2a0 0x0800
22: c0a0d2c0 0x0800
23: c0a0d2e0 0x0800
24: c0a0d300 0x0800
25: c0a0d320 0x0800
26: c0a0d340 0x0800
27: c0a0d360 0x0800
28: c0a0d380 0x0800
29: c0a0d3a0 0x0800
30: c0a0d3c0 0x0800
31: c0a0d3e0 0x0800
32: c0a0d400 0x0800
33: c0a0d420 0x0800
34: c0a0d440 0x0800
35: c0a0d460 0x0800
36: c0a0d480 0x0800
37: c0a0d4a0 0x0800
38: c0a0d4c0 0x0800
39: c0a0d4e0 0x0800
40: c0a0d500 0x0800
41: c0a0d520 0x0800
42: c0a0d540 0x0800
43: c0a0d560 0x0800
44: c0a0d580 0x0800
45: c0a0d5a0 0x0800
46: c0a0d5c0 0x0800
47: c0a0d5e0 0x0800
48: c0a0d600 0x0800
49: c0a0d620 0x0800
50: c0a0d640 0x0800
51: c0a0d660 0x0800
52: c0a0d680 0x0800
53: c0a0d6a0 0x0800
54: c0a0d6c0 0x0800
55: c0a0d6e0 0x0800
56: c0a0d700 0x0800
57: c0a0d720 0x0800
58: c0a0d740 0x0800
59: c0a0d760 0x0800
60: c0a0d780 0x0800
61: c0a0d7a0 0x0800
62: c0a0d7c0 0x0800
63: c0a0d7e0 0x0800
64: c0a0d800 0x0800
65: c0a0d820 0x0800
66: c0a0d840 0x0800
67: c0a0d860 0x0800
68: c0a0d880 0x0800
69: c0a0d8a0 0x0800
70: c0a0d8c0 0x0800
71: c0a0d8e0 0x0800
72: c0a0d900 0x0800
73: c0a0d920 0x0800
74: c0a0d940 0x0800
75: c0a0d960 0x0800
76: c0a0d980 0x0800
77: c0a0d9a0 0x0800
78: c0a0d9c0 0x0800
79: c0a0d9e0 0x0800
80: c0a0da00 0x0800
81: c0a0da20 0x0800
82: c0a0da40 0x0800
83: c0a0da60 0x0800
84: c0a0da80 0x0800
85: c0a0daa0 0x0800
86: c0a0dac0 0x0800
87: c0a0dae0 0x0800
88: c0a0db00 0x0800
89: c0a0db20 0x0800
90: c0a0db40 0x0800
91: c0a0db60 0x0800
92: c0a0db80 0x0800
93: c0a0dba0 0x0800
94: c0a0dbc0 0x0800
95: c0a0dbe0 0x0800
96: c0a0dc00 0x0800
97: c0a0dc20 0x0800
98: c0a0dc40 0x0800
99: c0a0dc60 0x0800
100: c0a0dc80 0x0800
101: c0a0dca0 0x0800
102: c0a0dcc0 0x0800
103: c0a0dce0 0x0800
104: c0a0dd00 0x0800
105: c0a0dd20 0x0800
106: c0a0dd40 0x0800
107: c0a0dd60 0x0800
108: c0a0dd80 0x0800
109: c0a0dda0 0x0800
110: c0a0ddc0 0x0800
111: c0a0dde0 0x0800
112: c0a0de00 0x0800
113: c0a0de20 0x0800
114: c0a0de40 0x0800
115: c0a0de60 0x0800
116: c0a0de80 0x0800
117: c0a0dea0 0x0800
118: c0a0dec0 0x0800
119: c0a0dee0 0x0800
120: c0a0df00 0x0800
121: c0a0df20 0x0800
122: c0a0df40 0x0800
123: c0a0df60 0x0800
124: c0a0df80 0x0800
125: c0a0dfa0 0x0800
126: c0a0dfc0 0x0800
127: c0a0dfe0 0x2800

This shows that the complete receive ring is filled with received
frames, which matches with the value 0 in the RDAR register.

The patched FEC driver also logs "FEC receive ring full" when the
receive ring is full after the napi_complete() call and before the
enabling of the RXF interrupt. The kernel logging shows this message a
couple of times. So the RX ring has been filled completely between the
fec_enet_rx() call and the napi_complete() call. In this testcase
(100Megabit, 256 byte frames) it takes less than 3 milliseconds to
completely fill the RX ring. So a preemption between the fec_enet_rx()
call and the napi_complete() call or a long execution time of the
fec_enet_tx() function (which is called between fec_enet_rx() and
napi_complete() ) results in the enabling of the RXF interrupt when
the RX ring is completely full. It seems that in that case the RXF
interrupt will not be generated anymore.



@@ -23,17 +23,20 @@

 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/string.h>
 #include <linux/ptrace.h>
+#include <linux/debugfs.h>
 #include <linux/errno.h>
 #include <linux/ioport.h>
 #include <linux/slab.h>
 #include <linux/interrupt.h>
 #include <linux/delay.h>
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
 #include <linux/skbuff.h>
 #include <linux/spinlock.h>
 #include <linux/workqueue.h>
 #include <linux/bitops.h>
 #include <linux/io.h>
@@ -1164,11 +1167,17 @@ static int fec_enet_rx_napi(struct napi_struct
*napi, int budget)

     fec_enet_tx(ndev);

     if (pkts < budget) {
         napi_complete(napi);
-        writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
+        if (readl(fep->hwp + FEC_R_DES_ACTIVE) == 0) {
+            writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
+            printk(KERN_ERR "%llu FEC: receive ring full!!!\n",
+                sched_clock());
+        } else {
+            writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
+        }
     }
     return pkts;
 }

 /* ------------------------------------------------------------------------- */
@@ -2404,10 +2413,79 @@ static void fec_reset_phy(struct platform_device *pdev)
      * by machine code.
      */
 }
 #endif /* CONFIG_OF */

+#define BDRX_EMPTY    (1 << 15)
+#define BDRX_RO1    (1 << 14)
+#define BDRX_WRAP    (1 << 13)
+#define BDRX_RO2    (1 << 12)
+#define BDRX_LAST    (1 << 11)
+#define BDRX_UNDEF10    (1 << 10)
+#define BDRX_UNDEF9    (1 << 9)
+#define BDRX_MISS    (1 << 8)
+#define BDRX_BCAST    (1 << 7)
+#define BDRX_MCAST    (1 << 6)
+#define BDRX_LENERR    (1 << 5)
+#define BDRX_ALGNERR    (1 << 4)
+#define BDRX_UNDEF3    (1 << 3)
+#define BDRX_CRCERR    (1 << 2)
+#define BDRX_OVERRUN    (1 << 1)
+#define BDRX_TRUNC    (1 << 0)
+
+
+static int fec_rx_ring_show(struct seq_file *s, void *p)
+{
+    struct fec_enet_private *fep = s->private;
+    int nbds;
+    union bufdesc_u *bd;
+
+    bd = fep->rx_bd_base;
+    for (nbds = 0;nbds < fep->rx_ring_size;nbds++) {
+        seq_printf(s, "%d: %p 0x%.4x", nbds, bd, bd->ebd.desc.cbd_sc);
+        if (bd == fep->rx_bd_base)
+            seq_printf(s, " (== rx_bd_base)");
+        if (nbds == fep->rx_next)
+            seq_printf(s, " (== rx_next)");
+        seq_printf(s, "\n");
+        seq_printf(s,
+               "  [15]%s [14]%s [13]%s [12]%s [11]%s [10]%s [ 9]%s [ 8]%s"
+               "  [ 7]%s [ 6]%s [ 5]%s [ 4]%s [ 3]%s [ 2]%s [ 1]%s [ 0]%s\n",
+               bd->ebd.desc.cbd_sc & BDRX_EMPTY   ? "EMP" : "emp",
+               bd->ebd.desc.cbd_sc & BDRX_RO1     ? "RO1" : "ro1",
+               bd->ebd.desc.cbd_sc & BDRX_WRAP    ? "WRP" : "wrp",
+               bd->ebd.desc.cbd_sc & BDRX_RO2     ? "RO2" : "ro2",
+               bd->ebd.desc.cbd_sc & BDRX_LAST    ? "LST" : "lst",
+               bd->ebd.desc.cbd_sc & BDRX_UNDEF10 ? "UNA" : "una",
+               bd->ebd.desc.cbd_sc & BDRX_UNDEF9  ? "UN9" : "un9",
+               bd->ebd.desc.cbd_sc & BDRX_MISS    ? "MIS" : "mis",
+               bd->ebd.desc.cbd_sc & BDRX_BCAST   ? "BCS" : "bcs",
+               bd->ebd.desc.cbd_sc & BDRX_MCAST   ? "MCS" : "mcs",
+               bd->ebd.desc.cbd_sc & BDRX_LENERR  ? "LEN" : "len",
+               bd->ebd.desc.cbd_sc & BDRX_ALGNERR ? "ALN" : "aln",
+               bd->ebd.desc.cbd_sc & BDRX_UNDEF3  ? "UN3" : "un3",
+               bd->ebd.desc.cbd_sc & BDRX_CRCERR  ? "CRC" : "crc",
+               bd->ebd.desc.cbd_sc & BDRX_OVERRUN ? "OVR" : "ovr",
+               bd->ebd.desc.cbd_sc & BDRX_TRUNC   ? "TRC" : "trc");
+        bd += 1;
+    }
+    return 0;
+}
+
+static int fec_rx_ring_open(struct inode *inode, struct file *file)
+{
+    struct fec_enet_private *fep = inode->i_private;
+    return single_open(file, fec_rx_ring_show, fep);
+}
+
+static const struct file_operations fec_rx_ring_fops = {
+    .open        = fec_rx_ring_open,
+    .read        = seq_read,
+    .llseek        = seq_lseek,
+    .release    = single_release,
+};
+
 static int
 fec_probe(struct platform_device *pdev)
 {
     struct fec_enet_private *fep;
     struct fec_platform_data *pdata;
@@ -2558,10 +2636,18 @@ fec_probe(struct platform_device *pdev)

     if (fep->flags & FEC_FLAG_BUFDESC_EX && fep->ptp_clock)
         netdev_info(ndev, "registered PHC device %d\n", fep->dev_id);

     INIT_WORK(&fep->tx_timeout_work, fec_enet_timeout_work);
+
+    if (!debugfs_create_file("fec_rx_ring",
+                 S_IFREG | S_IRUGO,
+                 NULL,
+                 fep,
+                 &fec_rx_ring_fops))
+        goto failed_register;
+
     return 0;

 failed_register:
     fec_enet_mii_remove(fep);
 failed_mii_init:



Regards,
  Jaccon

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2014-05-08  9:23 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-14 14:48 PL310 errata workarounds Russell King - ARM Linux
2014-03-14 15:01 ` Russell King - ARM Linux
2014-03-14 16:02   ` Rob Herring
2014-03-14 17:57     ` Russell King - ARM Linux
2014-03-14 19:14       ` Rob Herring
2014-03-14 20:32         ` Russell King - ARM Linux
2014-03-19 21:22         ` Marek Vasut
2014-03-19 21:35           ` Rob Herring
2014-03-19 21:46             ` Russell King - ARM Linux
2014-03-19 21:54               ` Marek Vasut
2014-03-16 11:52     ` Russell King - ARM Linux
2014-03-17 15:04       ` Rob Herring
2014-03-17 15:37         ` Russell King - ARM Linux
2014-03-17 17:29           ` Catalin Marinas
2014-03-17 19:44             ` Russell King - ARM Linux
2014-03-17 21:09               ` Nicolas Pitre
2014-03-17 21:14                 ` Russell King - ARM Linux
2014-03-17 21:54                   ` Nicolas Pitre
2014-03-16 15:29 ` Russell King - ARM Linux
2014-03-17 14:00   ` Rob Herring
2014-03-17 14:40     ` Russell King - ARM Linux
2014-03-18 11:22     ` *READ THIS IF YOUR SOC HAS A L2 CACHE* PL310 auxctrl settings Russell King - ARM Linux
2014-03-28 12:51       ` Maxime Coquelin
2014-03-28 13:02         ` Russell King - ARM Linux
2014-03-28 13:32           ` Maxime Coquelin
2014-03-18 17:26     ` PL310 errata workarounds Russell King - ARM Linux
2014-03-19 21:52       ` Marek Vasut
2014-03-19 22:51         ` Russell King - ARM Linux
2014-03-19 23:05           ` FEC ethernet issues [Was: PL310 errata workarounds] Marek Vasut
2014-03-20  4:01             ` Marek Vasut
2014-03-20 22:27               ` Fabio Estevam
2014-03-20 23:06                 ` Russell King - ARM Linux
2014-03-21  0:24                   ` Troy Kisky
2014-03-21  1:18                     ` Russell King - ARM Linux
2014-03-21  1:36                       ` Fabio Estevam
2014-03-21  1:43                         ` Fabio Estevam
2014-03-21 16:37                           ` Bernd Faust
     [not found]                           ` <CANBf9eNZB+BVBDkgWNxxGs-ndQ-mYCc6+bfVdS-8T-QLpSZ3GA@mail.gmail.com>
2014-03-21 17:32                             ` Russell King - ARM Linux
2014-03-23 11:38                               ` Bernd Faust
2014-03-23 13:44                                 ` Russell King - ARM Linux
2014-03-24 17:57                               ` robert.daniels at vantagecontrols.com
2014-03-24 20:21                                 ` Marek Vasut
2014-03-24 22:37                                   ` robert.daniels at vantagecontrols.com
2014-03-24 23:44                                     ` Russell King - ARM Linux
2014-03-25  1:02                                       ` Marek Vasut
2014-03-25 23:10                                         ` Russell King - ARM Linux
2014-03-26  0:11                                       ` Russell King - ARM Linux
2014-04-01  9:26                                         ` Russell King - ARM Linux
2014-04-01 14:00                                           ` Eric Nelson
2014-04-01 17:29                                             ` Russell King - ARM Linux
2014-04-01 18:11                                               ` Eric Nelson
2014-04-02  2:26                                               ` fugang.duan at freescale.com
2014-04-01 19:38                                           ` robert.daniels at vantagecontrols.com
2014-04-01 22:51                                             ` Russell King - ARM Linux
2014-04-02  0:37                                               ` robert.daniels at vantagecontrols.com
2014-04-02  3:19                                               ` fugang.duan at freescale.com
2014-04-02  8:59                                                 ` Russell King - ARM Linux
2014-04-02  9:40                                                   ` fugang.duan at freescale.com
2014-04-02 10:46                                                     ` Russell King - ARM Linux
2014-04-02 11:33                                                       ` fugang.duan at freescale.com
2014-04-02 16:51                                                         ` Russell King - ARM Linux
2014-04-03  2:41                                                           ` fugang.duan at freescale.com
2014-04-03  8:56                                                             ` Russell King - ARM Linux
2014-04-03  9:55                                                               ` fugang.duan at freescale.com
2014-04-03 10:32                                                                 ` Russell King - ARM Linux
2014-04-03 13:36                                                                   ` Shawn Guo
2014-04-03 13:45                                                                     ` Russell King - ARM Linux
2014-04-03 14:00                                                                       ` Shawn Guo
2014-04-04 20:21                                               ` robert.daniels at vantagecontrols.com
2014-04-29  9:26                                                 ` Jaccon Bastiaansen
     [not found]                                                 ` <CAGzjT4d8H6YE6P6A1E4aHiPAenJFvZH01LHXaNzVwhF2MRBvWQ@mail.gmail.com>
2014-05-02 11:41                                                   ` Russell King - ARM Linux
2014-05-08  9:23                                                     ` Jaccon Bastiaansen
2014-03-21 15:50           ` PL310 errata workarounds Frank Li
2014-03-21 17:12             ` Russell King - ARM Linux
2014-03-17 16:32   ` Russell King - ARM Linux
2014-03-17 16:51     ` Russell King - ARM Linux

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.