linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
       [not found] <20030829053510.GA12663@mail.jlokier.co.uk.suse.lists.linux.kernel>
@ 2003-08-29 11:08 ` Andi Kleen
  2003-08-29 11:17   ` Russell King
  2003-09-01  5:03   ` Jamie Lokier
  0 siblings, 2 replies; 106+ messages in thread
From: Andi Kleen @ 2003-08-29 11:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> I already got a surprise (to me): my Athlon MP is much slower
> accessing multiple mappings which are within 32k of each other, than
> mappings which are further apart, although it is coherent.  The L1

Most x86 and probably most other modern CPUs have virtually addressed L1 caches.
It's just too slow to wait for the MMU for an L1 access which is really critical.

So such artifacts are expected

> data cache is 64k.  (The explanation is easy: virtually indexed,
> physically tagged cache moves data among cache lines, possibly via L2).

On x86 L2 is usually physically tagged.

Mostly only ARM,MIPS et.al. have virtually tagged L2.

-Andi

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 11:08 ` x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this Andi Kleen
@ 2003-08-29 11:17   ` Russell King
  2003-09-01  5:03   ` Jamie Lokier
  1 sibling, 0 replies; 106+ messages in thread
From: Russell King @ 2003-08-29 11:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jamie Lokier, linux-kernel

On Fri, Aug 29, 2003 at 01:08:51PM +0200, Andi Kleen wrote:
> Jamie Lokier <jamie@shareable.org> writes:
> > data cache is 64k.  (The explanation is easy: virtually indexed,
> > physically tagged cache moves data among cache lines, possibly via L2).
> 
> On x86 L2 is usually physically tagged.
> 
> Mostly only ARM,MIPS et.al. have virtually tagged L2.

Correction: ARM L1 is mostly VIVT.  L2 cache isn't mandated by the
architecture, and therefore generally doesn't exist.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 11:08 ` x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this Andi Kleen
  2003-08-29 11:17   ` Russell King
@ 2003-09-01  5:03   ` Jamie Lokier
  1 sibling, 0 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  5:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Andi Kleen wrote:
> > I already got a surprise (to me): my Athlon MP is much slower
> > accessing multiple mappings which are within 32k of each other, than
> > mappings which are further apart, although it is coherent.  The L1
> 
> Most x86 and probably most other modern CPUs have virtually
> addressed L1 caches.  It's just too slow to wait for the MMU for an
> L1 access which is really critical.
> 
> So such artifacts are expected

I hadn't thought at first because there's no artefact at all (not even
a small one) on my Celeron, but you're right.  They don't appear on
any Intels(*), but they do on all AMDs that I have results for.

(*) With the possible exception of one P4 that reports varying results.

> 
> > data cache is 64k.  (The explanation is easy: virtually indexed,
> > physically tagged cache moves data among cache lines, possibly via L2).
> 
> On x86 L2 is usually physically tagged.

I'm speculating that L1 is physically tagged, and when there's a
virtual alias the CPU moves data from one L1 line to another.  L2 only
comes into it because the line transfer is slow enough that a
MESI-style transfer through L2 (as if another CPU or device requested
the line) would account for the slowness.

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-07 13:53                                       ` Jamie Lokier
@ 2003-09-07 17:56                                         ` Alan Cox
  0 siblings, 0 replies; 106+ messages in thread
From: Alan Cox @ 2003-09-07 17:56 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Pavel Machek, nagendra_tomar, Geert Uytterhoeven, Roman Zippel,
	Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

On Sul, 2003-09-07 at 14:53, Jamie Lokier wrote:
> Pavel Machek wrote:
> > Perhaps weak ordering matters when you are writting to the MMIO, too?
> 
> Perhaps, but the code in arch/i386/kernel/cpu/centaur.c seems to try
> hard to set weak ordering for RAM, not the whole address space.

There are three cases I know of where you get weak store ordering that
is visible in some way

#1 Pentium Pro due to an errata, hence the need for lock in the
spin_unlock

#2 Centaur Winchip (where OOSTORE off is worth 10-30% performance on
common tasks). A lot of that has to do with the nature of the CPU and 
the old socket 7 bus stuff. Its not SMP but we have to care about it
for mmio not because mmio is itself out of order (we leave it in order)
but because of DMA. We must ensure that our writes to ram finish
-before- we kick off the hardware copying the data...

#3 Weak store ordering via sse type instructions, where its intentional
and an sfence is needed eventually


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-07 13:40                                     ` Pavel Machek
@ 2003-09-07 13:53                                       ` Jamie Lokier
  2003-09-07 17:56                                         ` Alan Cox
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-07 13:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, nagendra_tomar, Geert Uytterhoeven, Roman Zippel,
	Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

Pavel Machek wrote:
> Perhaps weak ordering matters when you are writting to the MMIO, too?

Perhaps, but the code in arch/i386/kernel/cpu/centaur.c seems to try
hard to set weak ordering for RAM, not the whole address space.

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-07 13:35                                   ` Jamie Lokier
@ 2003-09-07 13:40                                     ` Pavel Machek
  2003-09-07 13:53                                       ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Pavel Machek @ 2003-09-07 13:40 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Pavel Machek, Alan Cox, nagendra_tomar, Geert Uytterhoeven,
	Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

Hi!

Perhaps weak ordering matters when you are writting to the MMIO, too?


> > Wow, seems interesting, how much performance does it buy? [Maybe AMD
> > and Intel just threw a lot of silicon at the problem and it went
> > away. Centaur solution might be nicer, through -- spin_unlock is so
> > uncommon that this seems like nice optimalization.]
> 
> I didn't realise Centaur SMP systems existed, but I guess they must do
> for weak memory writes to mean anything.
> 
> -- Jamie

-- 
Horseback riding is like software...
...vgf orggre jura vgf serr.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-07 13:10                                 ` Pavel Machek
@ 2003-09-07 13:35                                   ` Jamie Lokier
  2003-09-07 13:40                                     ` Pavel Machek
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-07 13:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, nagendra_tomar, Geert Uytterhoeven, Roman Zippel,
	Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

Pavel Machek wrote:
> Wow, seems interesting, how much performance does it buy? [Maybe AMD
> and Intel just threw a lot of silicon at the problem and it went
> away. Centaur solution might be nicer, through -- spin_unlock is so
> uncommon that this seems like nice optimalization.]

I didn't realise Centaur SMP systems existed, but I guess they must do
for weak memory writes to mean anything.

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-06 23:09                               ` Jamie Lokier
@ 2003-09-07 13:10                                 ` Pavel Machek
  2003-09-07 13:35                                   ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Pavel Machek @ 2003-09-07 13:10 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Pavel Machek, Alan Cox, nagendra_tomar, Geert Uytterhoeven,
	Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

Hi!

> > > x86 gives you coherency and store ordering (barring errata and special
> > > CPU modes)
> > 
> > Special CPU modes? You mean some special SSE stores?
> 
> Take a look at arch/i386/kernel/cpu/centaur.c, and CONFIG_X86_OOSTORE.
> 
> You can change the memory settings to weakly ordered writes, which
> means that a plain write isn't suitable for spin_unlock.  Presumably
> this mode is faster (though I don't see why, if Intel, AMD et al. can
> manage good memory performance without weak writes).

Wow, seems interesting, how much performance does it buy? [Maybe AMD
and Intel just threw a lot of silicon at the problem and it went
away. Centaur solution might be nicer, through -- spin_unlock is so
uncommon that this seems like nice optimalization.]

-- 
Horseback riding is like software...
...vgf orggre jura vgf serr.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-05 21:24                             ` Pavel Machek
@ 2003-09-06 23:09                               ` Jamie Lokier
  2003-09-07 13:10                                 ` Pavel Machek
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-06 23:09 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, nagendra_tomar, Geert Uytterhoeven, Roman Zippel,
	Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

Pavel Machek wrote:
> > x86 gives you coherency and store ordering (barring errata and special
> > CPU modes)
> 
> Special CPU modes? You mean some special SSE stores?

Take a look at arch/i386/kernel/cpu/centaur.c, and CONFIG_X86_OOSTORE.

You can change the memory settings to weakly ordered writes, which
means that a plain write isn't suitable for spin_unlock.  Presumably
this mode is faster (though I don't see why, if Intel, AMD et al. can
manage good memory performance without weak writes).

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-04 11:19                           ` Alan Cox
@ 2003-09-05 21:24                             ` Pavel Machek
  2003-09-06 23:09                               ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Pavel Machek @ 2003-09-05 21:24 UTC (permalink / raw)
  To: Alan Cox
  Cc: nagendra_tomar, Jamie Lokier, Geert Uytterhoeven, Roman Zippel,
	Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

Hi!

> > In x86 store buffer is not snooped which leads to all these serialization 
> > issues (other CPUs looking at stale value of data which is in the store 
> > buffer of some other CPU).
> 
> x86 gives you coherency and store ordering (barring errata and special
> CPU modes)

Special CPU modes? You mean some special SSE stores?
								Pavel

-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03 17:36                   ` bill davidsen
@ 2003-09-04 22:50                     ` Jamie Lokier
  0 siblings, 0 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-04 22:50 UTC (permalink / raw)
  To: bill davidsen; +Cc: linux-kernel

bill davidsen wrote:
> | Why do you need the same piece of data mapped to multiple places
> | in the first place, and why at specific addresses?  It's purely an
> | optimization of some sort, right?
> 
> I think he said he was doing DSP... there's a trick of double mapping
> the same memory to save one subscript calculation in FFT (or maybe DFT)
> inner loop.

It is for DSP, but nothing to do with FFT.  I hadn't ever thought of
using this techinque for FFT, and it would probably make little
difference on a modern CPU given the form of FFT algorithms.

No, I use it to make a circular buffer, in which the data always
appears as a contiguous block - no split.  This is useful for
operations on streams of data, such as FIR & IIR filters, equalisers,
upconverters, downcoverters, etc.  Many DSP algorithms fall into this
category.

A characteristic of these algorithms is that they consist of a long,
tight sequence of streaming memory accesses with calculations at each
step.

DSP chips often implement circular buffers by masking the offset into
the buffer's memory.

On a CPU, I prefer to avoid the masking operation which happens for
each address calculation.  This saves a couple of registers, as I can
just use an incrementing pointer into the buffer, rather than a base
address, offset and mask value.  Especially on x86, a couple of
registers saved is good.

It's possible to write DSP algorithms which avoid address masking,
after all a circular buffer in an ordinary array is just two separate
regions.  But that complicates the algorithms especially with corner
cases, and some of them are complicated enough already.

Using the duplicate mappings, I can use the most straightforward
streaming DSP code, and it runs as fast as possible if the mappings
don't incur a penalty.

When mappings aren't available or are too slow, then I just copy the
contents of the buffer backwards whenever the write pointer will cross
the end of the array.  That costs some, but keeps the DSP code simple.

Fwiw, the test program asseses whether there's a cost to using
duplicate mappings and whether they work.  However, for the above kind
of DSP buffer, the measurement isn't the best it could be (although
it's what I'm using).  There's a balance of factors.  For a large
buffer, it's ok even if page faults were to be needed as we switch
between alias pages, because the access pattern doesn't do that very
often.  Then the occasional page faults are just a potentially faster
version of the copy backwards.  On the other hand, if aliased pages
are made coherent by making then uncacheable (such as the ARM port),
even though that's much faster than faulting, it isn't good for the
DSP algorithms.

Fwiw#2, in the DSP I'm working on it's better to use the copying
method for most of my buffers even on x86, because they aren't that
large and fit better into L1 cache without the mappings.  Maybe for a
different project, it will get used for more of the buffers.  Mainly,
having developed the testing code, I wanted to know if it worked
properly on the different architectures.  It's nice to see some spin
offs, such as finding the ARM write buffer bug.

So thanks to everyone who responded.  I'll post a table of the results
soon.

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03 18:05                           ` Russell King
@ 2003-09-04 22:20                             ` Jamie Lokier
  0 siblings, 0 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-04 22:20 UTC (permalink / raw)
  To: Larry McVoy, Paul J.Y. Lahaie, linux-kernel

Russell King wrote:
> > Larry means that it's perfectly normal for libc to map the same file
> > more than once: you have the code section and the data section.
> 
> Code is read-only, data is read-write and is copy on write.  Therefore
> its a different scenario.

Yes, a thinko on my part :)

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 10:12     ` Jamie Lokier
  2003-09-01 11:30       ` Geert Uytterhoeven
  2003-09-01 14:17       ` Russell King
@ 2003-09-04 17:37       ` Maciej W. Rozycki
  2 siblings, 0 replies; 106+ messages in thread
From: Maciej W. Rozycki @ 2003-09-04 17:37 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Mon, 1 Sep 2003, Jamie Lokier wrote:

> Please try the program below, which is the same as before but with
> test_l1_only hopefully improved, and it prints some more helpful
> numbers.

 A few MIPS systems:

1. An R3400-based DECstation 5000/240 -- the CPU has a 64kB I-cache and a
64kB D-cache, both are direct mapped, PIPT:

$ uname -a
Linux 3maxp 2.4.21 #3 Thu Aug 14 04:14:33 CEST 2003 mips unknown unknown GNU/Linux
$ time ./test
(256) [155,155,7] Test separation: 4096 bytes: pass
(256) [155,155,7] Test separation: 8192 bytes: pass
(256) [155,155,7] Test separation: 16384 bytes: pass
(256) [155,155,7] Test separation: 32768 bytes: pass
(256) [155,155,7] Test separation: 65536 bytes: pass
(256) [155,155,7] Test separation: 131072 bytes: pass
(256) [155,155,7] Test separation: 262144 bytes: pass
(256) [155,155,7] Test separation: 524288 bytes: pass
(256) [155,155,7] Test separation: 1048576 bytes: pass
(256) [155,155,7] Test separation: 2097152 bytes: pass
(256) [155,155,7] Test separation: 4194304 bytes: pass
(256) [155,155,7] Test separation: 8388608 bytes: pass
(256) [155,155,7] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
1.01user 0.27system 0:01.33elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (135major+44minor)pagefaults 0swaps
$ cat /proc/cpuinfo
system type		: Digital DECstation 5000/2x0
processor		: 0
cpu model		: R3000A V3.0  FPU V4.0
BogoMIPS		: 39.90
wait instruction	: no
microsecond timers	: no
tlb_entries		: 64
extra interrupt vector	: no
hardware watchpoint	: no
VCED exceptions		: not available
VCEI exceptions		: not available

2. An R4400SC-based DECstation 5000/260 -- the CPU has a 16kB primary
I-cache and a 16kB primary D-cache, both are direct mapped, VIPT, and a
1024kB secondary joint (I+D) cache, direct mapped, PIPT:

$ uname -a
Linux 4maxp64 2.4.21 #19 Mon Aug 25 00:16:25 CEST 2003 mips64 unknown unknown GNU/Linux
$ time ./test
(64) [331,17,3] Test separation: 4096 bytes: FAIL - too slow
(64) [331,17,3] Test separation: 8192 bytes: FAIL - too slow
(128) [38,63,3] Test separation: 16384 bytes: pass
(128) [38,63,3] Test separation: 32768 bytes: pass
(128) [38,63,3] Test separation: 65536 bytes: pass
(128) [38,63,3] Test separation: 131072 bytes: pass
(128) [38,63,3] Test separation: 262144 bytes: pass
(128) [38,63,3] Test separation: 524288 bytes: pass
(128) [38,63,3] Test separation: 1048576 bytes: pass
(128) [38,63,3] Test separation: 2097152 bytes: pass
(128) [38,63,3] Test separation: 4194304 bytes: pass
(128) [38,63,3] Test separation: 8388608 bytes: pass
(128) [38,63,3] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
0.34user 0.14system 0:00.53elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (135major+250minor)pagefaults 0swaps
$ cat /proc/cpuinfo
system type		: Digital DECstation 5000/2x0
processor		: 0
cpu model		: R4400SC V4.0  FPU V0.0
BogoMIPS		: 59.86
wait instruction	: no
microsecond timers	: yes
tlb_entries		: 48
extra interrupt vector	: no
hardware watchpoint	: yes
VCED exceptions		: 464662
VCEI exceptions		: 667534

3. A MIPS 5Kc-based Malta -- the CPU has a 16kB I-cache and a 16kB
D-cache, both are 4-way set associative, VIPT: 

$ uname -a
Linux malta 2.4.21 #5 Sun Aug 3 21:51:32 CEST 2003 mips unknown unknown GNU/Linux
$ time ./test
(128) [25,23,1] Test separation: 4096 bytes: pass
(128) [25,23,1] Test separation: 8192 bytes: pass
(128) [25,23,1] Test separation: 16384 bytes: pass
(128) [25,23,1] Test separation: 32768 bytes: pass
(256) [49,46,1] Test separation: 65536 bytes: pass
(128) [25,23,1] Test separation: 131072 bytes: pass
(128) [25,23,1] Test separation: 262144 bytes: pass
(256) [49,46,1] Test separation: 524288 bytes: pass
(256) [49,46,1] Test separation: 1048576 bytes: pass
(256) [49,46,1] Test separation: 2097152 bytes: pass
(256) [48,45,2] Test separation: 4194304 bytes: pass
(256) [49,46,1] Test separation: 8388608 bytes: pass
(128) [25,23,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
0.22user 0.06system 0:00.30elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (135major+44minor)pagefaults 0swaps
$ cat /proc/cpuinfo
system type		: MIPS Malta
processor		: 0
cpu model		: MIPS 5Kc V0.1
BogoMIPS		: 159.74
wait instruction	: yes
microsecond timers	: yes
tlb_entries		: 32
extra interrupt vector	: yes
hardware watchpoint	: yes
VCED exceptions		: not available
VCEI exceptions		: not available

 The slowdown for the R4400SC processor is surely the result of Virtual
Coherency Exceptions (reported in cpuinfo for both primary caches) -- the
secondary cache (S-cache) remembers a few bits of the virtual address (VA)
and if there is a hit in the S-cache, but the VA bits don't match, an
exception is taken to write back and invalidate the old entry from the
respective primary cache (P-cache) and reset the VA bits to the new value.
Then a reexecution of the faulting instruction does a refill to the
P-cache from the S-cache.  This problem doesn't happen for the two other
processors as neither has an S-cache and also the R3400's P-cache is PIPT. 

 We avoid the hit resulting from cache aliasing for MIPS by aligning maps
appropriately. 

  Maciej

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03 16:07                         ` Nagendra Singh Tomar
  2003-09-04  5:03                           ` Davide Libenzi
@ 2003-09-04 11:19                           ` Alan Cox
  2003-09-05 21:24                             ` Pavel Machek
  1 sibling, 1 reply; 106+ messages in thread
From: Alan Cox @ 2003-09-04 11:19 UTC (permalink / raw)
  To: nagendra_tomar
  Cc: Jamie Lokier, Geert Uytterhoeven, Roman Zippel, Kars de Jong,
	Linux/m68k kernel mailing list, Linux Kernel Development

On Mer, 2003-09-03 at 17:07, Nagendra Singh Tomar wrote:
> In x86 store buffer is not snooped which leads to all these serialization 
> issues (other CPUs looking at stale value of data which is in the store 
> buffer of some other CPU).

x86 gives you coherency and store ordering (barring errata and special
CPU modes)


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03 18:03                             ` Nagendra Singh Tomar
@ 2003-09-04  6:38                               ` Davide Libenzi
  0 siblings, 0 replies; 106+ messages in thread
From: Davide Libenzi @ 2003-09-04  6:38 UTC (permalink / raw)
  To: Nagendra Singh Tomar
  Cc: Jamie Lokier, Geert Uytterhoeven, Roman Zippel, Kars de Jong,
	Linux/m68k kernel mailing list, Linux Kernel Development

On Wed, 3 Sep 2003, Nagendra Singh Tomar wrote:

> I meant to ask if the store buffer is snooped by *other CPUs*. To maintain
> self coherence the local store buffer has to be anyway consulted by local
> loads to give the latest stored value.

There are CPUs (at least some version of Alpha, 21064 IIRC) that uses
flush upon L1 read miss, so they do not snoop their local WB. IIRC P5 has
internal and external snooping while P6, using a write allocate L1, does
not have external snooping.



- Davide


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03 16:07                         ` Nagendra Singh Tomar
@ 2003-09-04  5:03                           ` Davide Libenzi
  2003-09-03 18:03                             ` Nagendra Singh Tomar
  2003-09-04 11:19                           ` Alan Cox
  1 sibling, 1 reply; 106+ messages in thread
From: Davide Libenzi @ 2003-09-04  5:03 UTC (permalink / raw)
  To: Nagendra Singh Tomar
  Cc: Jamie Lokier, Geert Uytterhoeven, Roman Zippel, Kars de Jong,
	Linux/m68k kernel mailing list, Linux Kernel Development

On Wed, 3 Sep 2003, Nagendra Singh Tomar wrote:

> Jamie,
> 	Just wondered if the store buffer is snooped in some
> architectures. In that case I believe the OS need not do anything for
> serialization (except for aliases, if they do not hit the same cache line).
> In x86 store buffer is not snooped which leads to all these serialization
> issues (other CPUs looking at stale value of data which is in the store
> buffer of some other CPU).
> Pl correct me if I have got anything wrong/

To avoid the so called 'load hazard' (that, BTW, triggers read over
writes, that are not allowed in x86) you have two options. Snoop the write
buffer or flush it upon L1 miss. Otherwise you might end up getting stale
data from L2.



- Davide


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03  7:41                         ` Jamie Lokier
@ 2003-09-03 18:05                           ` Russell King
  2003-09-04 22:20                             ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Russell King @ 2003-09-03 18:05 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Larry McVoy, Paul J.Y. Lahaie, linux-kernel

On Wed, Sep 03, 2003 at 08:41:34AM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > > > Multiple mappings of the same object rarely occur in my experience, so
> > > > the resulting performance loss caused by working around the cache and
> > > > writebuffer is something we can live with.
> > > 
> > > Multiple *writable* mappings.   Don't forget about libc et al.
> > 
> > I mean in the same group of threads with the same struct mm, not the whole
> > system.
> 
> Larry means that it's perfectly normal for libc to map the same file
> more than once: you have the code section and the data section.

Code is read-only, data is read-write and is copy on write.  Therefore
its a different scenario.

Practical tests indicate that the vast majority of applications do not
trip the test.

You're right in theory, but I don't particularly care about theory when
its real life which matters.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-04  5:03                           ` Davide Libenzi
@ 2003-09-03 18:03                             ` Nagendra Singh Tomar
  2003-09-04  6:38                               ` Davide Libenzi
  0 siblings, 1 reply; 106+ messages in thread
From: Nagendra Singh Tomar @ 2003-09-03 18:03 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Tomar, Nagendra, Jamie Lokier, Geert Uytterhoeven, Roman Zippel,
	Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development


On Thu, 4 Sep 2003, Davide Libenzi wrote:

> On Wed, 3 Sep 2003, Nagendra Singh Tomar wrote:
> 
> > Jamie,
> > 	Just wondered if the store buffer is snooped in some
> > architectures. In that case I believe the OS need not do anything for
> > serialization (except for aliases, if they do not hit the same cache
> line).
> > In x86 store buffer is not snooped which leads to all these
> serialization
> > issues (other CPUs looking at stale value of data which is in the
> store
> > buffer of some other CPU).
> > Pl correct me if I have got anything wrong/
> 
> To avoid the so called 'load hazard' (that, BTW, triggers read over
> writes, that are not allowed in x86) you have two options. Snoop the
> write
> buffer or flush it upon L1 miss. Otherwise you might end up getting
> stale
> data from L2.
> 

I meant to ask if the store buffer is snooped by *other CPUs*. To maintain 
self coherence the local store buffer has to be anyway consulted by local 
loads to give the latest stored value. 

Thanx,

tomar
> 
> 
> - Davide
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  9:02                 ` David S. Miller
  2003-09-01 10:04                   ` Jamie Lokier
@ 2003-09-03 17:36                   ` bill davidsen
  2003-09-04 22:50                     ` Jamie Lokier
  1 sibling, 1 reply; 106+ messages in thread
From: bill davidsen @ 2003-09-03 17:36 UTC (permalink / raw)
  To: linux-kernel

In article <20030901020203.1779efe8.davem@redhat.com>,
David S. Miller <davem@redhat.com> wrote:

| > This is my strategy:
| > 
| > 	mmap MAP_ANON without MAP_FIXED to find a free area
| > 	mmap MAP_FIXED over the anon area at same address
| > 	mmap MAP_FIXED over the anon area at larger address
| > 
| > I don't see any strategy that lets me establish this kind of circular
| > mapping on Sparc without either (a) knowing the value of SHMLBA, or
| > (b) risking clobbering another thread's mmap.
| 
| Why do you need the same piece of data mapped to multiple places
| in the first place, and why at specific addresses?  It's purely an
| optimization of some sort, right?

I think he said he was doing DSP... there's a trick of double mapping
the same memory to save one subscript calculation in FFT (or maybe DFT)
inner loop. The only reason I know this is that a friend did a master's
thesis on DSP about 20 years ago, and I absorbed some info I hope to
never need. He also coded an FFT instruction in the LCS (programmable
firmware) of a VAX.

I am only speculating, of course.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03 13:29                       ` Jamie Lokier
@ 2003-09-03 16:07                         ` Nagendra Singh Tomar
  2003-09-04  5:03                           ` Davide Libenzi
  2003-09-04 11:19                           ` Alan Cox
  0 siblings, 2 replies; 106+ messages in thread
From: Nagendra Singh Tomar @ 2003-09-03 16:07 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Geert Uytterhoeven, Roman Zippel, Kars de Jong,
	Linux/m68k kernel mailing list, Linux Kernel Development

Jamie,
	Just wondered if the store buffer is snooped in some 
architectures. In that case I believe the OS need not do anything for 
serialization (except for aliases, if they do not hit the same cache line). 
In x86 store buffer is not snooped which leads to all these serialization 
issues (other CPUs looking at stale value of data which is in the store 
buffer of some other CPU).
Pl correct me if I have got anything wrong/

Thanx,
tomar



 On Wed, 3 Sep 2003, Jamie Lokier wrote:

> Geert Uytterhoeven wrote:
> > > BTW the 020/030 caches are VIVT (and also only writethrough), the
> 040/060 
> > > caches are PIPT.
> > 
> > That explains a bit. But the '060 stores are coherent, while the '040
> stores
> > aren't.
> 
> The L1 cache is coherent on the '040 according to the results.  It's
> the store buffer snooping which fails.  Presumably the CPU core is
> looking ahead at recent writes comparing just virtual addresses.
> 
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03 12:36                     ` Geert Uytterhoeven
@ 2003-09-03 13:29                       ` Jamie Lokier
  2003-09-03 16:07                         ` Nagendra Singh Tomar
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-03 13:29 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Roman Zippel, Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

Geert Uytterhoeven wrote:
> > BTW the 020/030 caches are VIVT (and also only writethrough), the 040/060 
> > caches are PIPT.
> 
> That explains a bit. But the '060 stores are coherent, while the '040 stores
> aren't.

The L1 cache is coherent on the '040 according to the results.  It's
the store buffer snooping which fails.  Presumably the CPU core is
looking ahead at recent writes comparing just virtual addresses.

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03 12:17                   ` Roman Zippel
@ 2003-09-03 12:36                     ` Geert Uytterhoeven
  2003-09-03 13:29                       ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Geert Uytterhoeven @ 2003-09-03 12:36 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Jamie Lokier, Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

On Wed, 3 Sep 2003, Roman Zippel wrote:
> On Wed, 3 Sep 2003, Geert Uytterhoeven wrote:
> > > Does the 68020 even _have_ the equivalent of a store buffer?
> > 
> > Good question :-)
> > 
> > After I sent the previous mail, I realized the '030 has 256 bytes I cache and
> > 256 bytes D cache, while the '020 has 256 bytes I cache only.
> 
> BTW the 020/030 caches are VIVT (and also only writethrough), the 040/060 
> caches are PIPT.

That explains a bit. But the '060 stores are coherent, while the '040 stores
aren't.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03  9:26                 ` Geert Uytterhoeven
@ 2003-09-03 12:17                   ` Roman Zippel
  2003-09-03 12:36                     ` Geert Uytterhoeven
  0 siblings, 1 reply; 106+ messages in thread
From: Roman Zippel @ 2003-09-03 12:17 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Jamie Lokier, Kars de Jong, Linux/m68k kernel mailing list,
	Linux Kernel Development

Hi,

On Wed, 3 Sep 2003, Geert Uytterhoeven wrote:

> > Does the 68020 even _have_ the equivalent of a store buffer?
> 
> Good question :-)
> 
> After I sent the previous mail, I realized the '030 has 256 bytes I cache and
> 256 bytes D cache, while the '020 has 256 bytes I cache only.

BTW the 020/030 caches are VIVT (and also only writethrough), the 040/060 
caches are PIPT.

bye, Roman


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03  7:59             ` Geert Uytterhoeven
  2003-09-03  9:13               ` Jamie Lokier
@ 2003-09-03 12:13               ` Jan-Benedict Glaw
  1 sibling, 0 replies; 106+ messages in thread
From: Jan-Benedict Glaw @ 2003-09-03 12:13 UTC (permalink / raw)
  To: Linux/m68k kernel mailing list, Linux Kernel Development

[-- Attachment #1: Type: text/plain, Size: 635 bytes --]

On Wed, 2003-09-03 09:59:02 +0200, Geert Uytterhoeven <geert@linux-m68k.org>
wrote in message <Pine.GSO.4.21.0309030958130.6985-100000@waterleaf.sonytel.be>:
> On 2 Sep 2003, Kars de Jong wrote:
> Now all that's left is the 68030.

Maybe I get my Amiga 3000 installed these days... I think it has got an
68030.

MfG, JBG

-- 
   Jan-Benedict Glaw       jbglaw@lug-owl.de    . +49-172-7608481
   "Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg
    fuer einen Freien Staat voll Freier Bürger" | im Internet! |   im Irak!
      ret = do_actions((curr | FREE_SPEECH) & ~(IRAQ_WAR_2 | DRM | TCPA));

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03  9:13               ` Jamie Lokier
@ 2003-09-03  9:26                 ` Geert Uytterhoeven
  2003-09-03 12:17                   ` Roman Zippel
  0 siblings, 1 reply; 106+ messages in thread
From: Geert Uytterhoeven @ 2003-09-03  9:26 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development

On Wed, 3 Sep 2003, Jamie Lokier wrote:
> Geert Uytterhoeven wrote:
> > So the store buffer is coherent on 68020 with external MMU, while it
> > isn't on 68040 with internal MMU...
> 
> Does the 68020 even _have_ the equivalent of a store buffer?

Good question :-)

After I sent the previous mail, I realized the '030 has 256 bytes I cache and
256 bytes D cache, while the '020 has 256 bytes I cache only.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03  8:05         ` Geert Uytterhoeven
@ 2003-09-03  9:24           ` Kars de Jong
  0 siblings, 0 replies; 106+ messages in thread
From: Kars de Jong @ 2003-09-03  9:24 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development

On Wed, 2003-09-03 at 10:05, Geert Uytterhoeven wrote:
> On 3 Sep 2003, Kars de Jong wrote:
> > On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> > > BTW, probably you want us to run your test program on other m68k boxes? Mine
> > > got a 68040, that leaves us with:
> > >   - 68030
> > 
> > Ah, I forgot, I've got one of these here too, a Motorola MVME147 board:
> > 
> > sasscm:/tmp# time ./jamie_test2
> > Test separation: 4096 bytes: FAIL - cache not coherent
> 
> I guess the Plessey PME 68-22 didn't have cache, since the test passed?

No, no cache. Well. A very tiny instruction cache in the 68020 itself.

Regards,

Kars.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03  7:59             ` Geert Uytterhoeven
@ 2003-09-03  9:13               ` Jamie Lokier
  2003-09-03  9:26                 ` Geert Uytterhoeven
  2003-09-03 12:13               ` Jan-Benedict Glaw
  1 sibling, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-03  9:13 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Kars de Jong, Linux/m68k kernel mailing list, Linux Kernel Development

Geert Uytterhoeven wrote:
> So the store buffer is coherent on 68020 with external MMU, while it
> isn't on 68040 with internal MMU...

Does the 68020 even _have_ the equivalent of a store buffer?

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03  8:00       ` Kars de Jong
@ 2003-09-03  8:05         ` Geert Uytterhoeven
  2003-09-03  9:24           ` Kars de Jong
  0 siblings, 1 reply; 106+ messages in thread
From: Geert Uytterhoeven @ 2003-09-03  8:05 UTC (permalink / raw)
  To: Kars de Jong
  Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development

On 3 Sep 2003, Kars de Jong wrote:
> On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> > BTW, probably you want us to run your test program on other m68k boxes? Mine
> > got a 68040, that leaves us with:
> >   - 68030
> 
> Ah, I forgot, I've got one of these here too, a Motorola MVME147 board:
> 
> sasscm:/tmp# time ./jamie_test2
> Test separation: 4096 bytes: FAIL - cache not coherent

I guess the Plessey PME 68-22 didn't have cache, since the test passed?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  8:34     ` Geert Uytterhoeven
  2003-09-01  9:09       ` Kars de Jong
  2003-09-01 10:35       ` Sam Creasey
@ 2003-09-03  8:00       ` Kars de Jong
  2003-09-03  8:05         ` Geert Uytterhoeven
  2 siblings, 1 reply; 106+ messages in thread
From: Kars de Jong @ 2003-09-03  8:00 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development

On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> BTW, probably you want us to run your test program on other m68k boxes? Mine
> got a 68040, that leaves us with:
>   - 68030

Ah, I forgot, I've got one of these here too, a Motorola MVME147 board:

sasscm:/tmp# time ./jamie_test2
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - cache not coherent
Test separation: 524288 bytes: FAIL - cache not coherent
Test separation: 1048576 bytes: FAIL - cache not coherent
Test separation: 2097152 bytes: FAIL - cache not coherent
Test separation: 4194304 bytes: FAIL - cache not coherent
Test separation: 8388608 bytes: FAIL - cache not coherent
Test separation: 16777216 bytes: FAIL - cache not coherent
VM page alias coherency test: failed; will use copy buffers instead
 
real    0m1.149s
user    0m0.240s
sys     0m0.670s
sasscm:/tmp# cat /proc/cpuinfo
CPU:            68030
MMU:            68030
FPU:            68882
Clocking:       19.6MHz
BogoMips:       4.90
Calibration:    24512 loops

Regards,

Kars.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-02 20:42           ` Kars de Jong
  2003-09-02 21:39             ` Jamie Lokier
@ 2003-09-03  7:59             ` Geert Uytterhoeven
  2003-09-03  9:13               ` Jamie Lokier
  2003-09-03 12:13               ` Jan-Benedict Glaw
  1 sibling, 2 replies; 106+ messages in thread
From: Geert Uytterhoeven @ 2003-09-03  7:59 UTC (permalink / raw)
  To: Kars de Jong
  Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development

On 2 Sep 2003, Kars de Jong wrote:
> fikkie:/tmp# ./jamie_test
> Test separation: 4096 bytes: pass
> Test separation: 8192 bytes: pass
> Test separation: 16384 bytes: pass
> Test separation: 32768 bytes: pass
> Test separation: 65536 bytes: pass
> Test separation: 131072 bytes: pass
> Test separation: 262144 bytes: pass
> Test separation: 524288 bytes: pass
> Test separation: 1048576 bytes: pass
> Test separation: 2097152 bytes: pass
> Test separation: 4194304 bytes: pass
> Test separation: 8388608 bytes: pass
> Test separation: 16777216 bytes: pass
> VM page alias coherency test: all sizes passed
> 
> New program:
> 
> fikkie:/tmp# time ./jamie_test2
> (2048) [10000,10000,0] Test separation: 4096 bytes: pass
> (2048) [10000,10000,0] Test separation: 8192 bytes: pass
> (2048) [10000,10000,0] Test separation: 16384 bytes: pass
> (2048) [10000,10000,0] Test separation: 32768 bytes: pass
> (2048) [10000,10000,0] Test separation: 65536 bytes: pass
> (2048) [10000,10000,0] Test separation: 131072 bytes: pass
> (2048) [10000,10000,0] Test separation: 262144 bytes: pass
> (2048) [10000,10000,0] Test separation: 524288 bytes: pass
> (2048) [10000,10000,0] Test separation: 1048576 bytes: pass
> (2048) [10000,10000,0] Test separation: 2097152 bytes: pass
> (2048) [10000,10000,0] Test separation: 4194304 bytes: pass
> (2048) [10000,10000,0] Test separation: 8388608 bytes: pass
> (2048) [10000,10000,0] Test separation: 16777216 bytes: pass
> VM page alias coherency test: all sizes passed
>                                                                                 
> real    1m51.210s
> user    1m44.950s
> sys     0m4.930s
> fikkie:/tmp# cat /proc/cpuinfo
> CPU:            68020
> MMU:            68851
> FPU:            68881
> Clocking:       15.6MHz
> BogoMips:       3.90
> Calibration:    19520 loops
> fikkie:/tmp#

So the store buffer is coherent on 68020 with external MMU, while it isn't on
68040 with internal MMU...

Now all that's left is the 68030.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-03  7:31                       ` Russell King
@ 2003-09-03  7:41                         ` Jamie Lokier
  2003-09-03 18:05                           ` Russell King
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-03  7:41 UTC (permalink / raw)
  To: Larry McVoy, Paul J.Y. Lahaie, linux-kernel

Russell King wrote:
> > > Multiple mappings of the same object rarely occur in my experience, so
> > > the resulting performance loss caused by working around the cache and
> > > writebuffer is something we can live with.
> > 
> > Multiple *writable* mappings.   Don't forget about libc et al.
> 
> I mean in the same group of threads with the same struct mm, not the whole
> system.

Larry means that it's perfectly normal for libc to map the same file
more than once: you have the code section and the data section.

I don't know if ARM's ELF is like the x86, but on the x86 the final
partial page of code or read-only data will be mapped twice, as the
latter part of the page can contain writable data.  This avoids
wasting up to a page's worth of bytes in the ELF file.

-- Jamie


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-02 23:59                     ` Larry McVoy
@ 2003-09-03  7:31                       ` Russell King
  2003-09-03  7:41                         ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Russell King @ 2003-09-03  7:31 UTC (permalink / raw)
  To: Larry McVoy, Jamie Lokier, Paul J.Y. Lahaie, linux-kernel

On Tue, Sep 02, 2003 at 04:59:00PM -0700, Larry McVoy wrote:
> On Tue, Sep 02, 2003 at 07:52:22PM +0100, Russell King wrote:
> > Multiple mappings of the same object rarely occur in my experience, so
> > the resulting performance loss caused by working around the cache and
> > writebuffer is something we can live with.
> 
> Multiple *writable* mappings.   Don't forget about libc et al.

I mean in the same group of threads with the same struct mm, not the whole
system.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-02 18:52                   ` Russell King
@ 2003-09-02 23:59                     ` Larry McVoy
  2003-09-03  7:31                       ` Russell King
  0 siblings, 1 reply; 106+ messages in thread
From: Larry McVoy @ 2003-09-02 23:59 UTC (permalink / raw)
  To: Jamie Lokier, Paul J.Y. Lahaie, linux-kernel

On Tue, Sep 02, 2003 at 07:52:22PM +0100, Russell King wrote:
> Multiple mappings of the same object rarely occur in my experience, so
> the resulting performance loss caused by working around the cache and
> writebuffer is something we can live with.

Multiple *writable* mappings.   Don't forget about libc et al.
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-02 20:42           ` Kars de Jong
@ 2003-09-02 21:39             ` Jamie Lokier
  2003-09-03  7:59             ` Geert Uytterhoeven
  1 sibling, 0 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-02 21:39 UTC (permalink / raw)
  To: Kars de Jong
  Cc: Geert Uytterhoeven, Linux/m68k kernel mailing list,
	Linux Kernel Development

Kars de Jong wrote:
> And no, this board has no way of getting a better time resolution than
> the 100 Hz tick timer either ;-)

The coherency test is fine.  That's just logic.

The clock granularity got me wondering whether the timing measurement
is meaningful on these machines.  It's possible for the shared test to
take 2000 microseconds and the unshared test to take 10 microseconds,
and they can still show up as 10ms if they both cross a clock tick
boundary.

The minimum of 128 tests of each type is likely to report 0 until
timing_loops is larger enough to make all 128 consistently almost
10ms, according to the timing when each test starts.  Then as we only
care if there is an approximately 2:1 ratio or more, it is fine.

That depends on the timing of each test not being synchronised with
the clock ticks, or when they are, that not affecting the result.

I'm not sure, but I have a feeling that the random shuffle makes it ok.

Hmm.

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 10:08         ` Jamie Lokier
  2003-09-01 11:13           ` Roman Zippel
@ 2003-09-02 20:42           ` Kars de Jong
  2003-09-02 21:39             ` Jamie Lokier
  2003-09-03  7:59             ` Geert Uytterhoeven
  1 sibling, 2 replies; 106+ messages in thread
From: Kars de Jong @ 2003-09-02 20:42 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Geert Uytterhoeven, Linux/m68k kernel mailing list,
	Linux Kernel Development

On Mon, 2003-09-01 at 12:08, Jamie Lokier wrote:
> Kars de Jong wrote:
> > On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> > > BTW, probably you want us to run your test program on other m68k boxes? Mine
> > > got a 68040, that leaves us with:
> > >   - 68020+68551
> > >   - 68060
> > 
> > I can run it on these boxes if no-one else has done it yet before I come
> > home tonight. I'm sure there are more people with a 68060 out there, not
> > too sure about the 68020+68851.
> 
> I would prefer that you run the attached program.  It fixes a bug in
> the function which tests whether the problem is in the L1 cache or
> store buffer.  The bug probably didn't affect the test, but it might
> have.
> 
> Ideally you could run the program Geert linked to as well?
> Please remember to compile both with optimisation.

OK, here are my results (I'll skip the 68060 because Roman has already
run the program on that one):

This is on a Plessey PME 68-22. It's sooooo fast... Sam, is there a Sun
slower than this?

Original program:

fikkie:/tmp# ./jamie_test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

New program:

fikkie:/tmp# time ./jamie_test2
(2048) [10000,10000,0] Test separation: 4096 bytes: pass
(2048) [10000,10000,0] Test separation: 8192 bytes: pass
(2048) [10000,10000,0] Test separation: 16384 bytes: pass
(2048) [10000,10000,0] Test separation: 32768 bytes: pass
(2048) [10000,10000,0] Test separation: 65536 bytes: pass
(2048) [10000,10000,0] Test separation: 131072 bytes: pass
(2048) [10000,10000,0] Test separation: 262144 bytes: pass
(2048) [10000,10000,0] Test separation: 524288 bytes: pass
(2048) [10000,10000,0] Test separation: 1048576 bytes: pass
(2048) [10000,10000,0] Test separation: 2097152 bytes: pass
(2048) [10000,10000,0] Test separation: 4194304 bytes: pass
(2048) [10000,10000,0] Test separation: 8388608 bytes: pass
(2048) [10000,10000,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
                                                                                
real    1m51.210s
user    1m44.950s
sys     0m4.930s
fikkie:/tmp# cat /proc/cpuinfo
CPU:            68020
MMU:            68851
FPU:            68881
Clocking:       15.6MHz
BogoMips:       3.90
Calibration:    19520 loops
fikkie:/tmp#

And no, this board has no way of getting a better time resolution than
the 100 Hz tick timer either ;-)

Regards,

Kars.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 14:43     ` Larry McVoy
  2003-09-01 16:33       ` Jamie Lokier
@ 2003-09-02 20:29       ` Jamie Lokier
  1 sibling, 0 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-02 20:29 UTC (permalink / raw)
  To: Larry McVoy, Larry McVoy, linux-kernel

Larry McVoy wrote:
> Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC, s390
> on Linux and hpux/parisc, {freebsd, netbsd, openbsd}/x86, sco/x86, 
> solaris/sparc, solaris/x86, irix/mips, osx/ppc, aix/ppc, tru64/alpha.

It's interesting to see all the free unixes, Solaris and SCO have no
trouble mapping files.  But AIX, HPUX and whatever environment you
have on Windows XP couldn't even do the mmaps.

Could you be able to try the aix/ppc, hpux/parisc and Windows XP (or
any Windows) tests again, but this time try each of these:

	1. Compile with -DHAVE_SHM_OPEN
	2. Compile with -DHAVE_SYSV_SHM

Thanks again,
-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-02 11:57                 ` Jamie Lokier
@ 2003-09-02 18:52                   ` Russell King
  2003-09-02 23:59                     ` Larry McVoy
  0 siblings, 1 reply; 106+ messages in thread
From: Russell King @ 2003-09-02 18:52 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Paul J.Y. Lahaie, linux-kernel

On Tue, Sep 02, 2003 at 12:57:31PM +0100, Jamie Lokier wrote:
> You say that "reading from the first mapping _should_ return the
> second write value no matter what", but that there's a bug in the
> write buffer and it isn't doing that.
> 
> I'm saying that the bug can't be that, because such a bug would affect
> normal applications.

I know of no other explaination which fits with the information I have
available to me here.  If you'd care to speculate further, you may,
but I see further speculation as being rather academic, unless it comes
from one of the people who designed the chip.

All this is, however, immateral - the facts are that the write buffer
is buggy, this test detects it, and we can take fairly easy measures
to ensure we fix it up.

Multiple mappings of the same object rarely occur in my experience, so
the resulting performance loss caused by working around the cache and
writebuffer is something we can live with.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-02  8:15               ` Russell King
@ 2003-09-02 11:57                 ` Jamie Lokier
  2003-09-02 18:52                   ` Russell King
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-02 11:57 UTC (permalink / raw)
  To: Paul J.Y. Lahaie, linux-kernel

Russell King wrote:
> > > If you take a moment to think about what should be going on -
> > > 
> > > - first write gets translated to physical address, and the address with
> > >   the data is placed in the write buffer.
> > > - second write gets translated to the same physical address, and the
> > >   address and data is placed into the write buffer such that we store
> > >   the first write then the second write to the same physical memory.
> > > - reading from the first mapping should return the second writes value
> > >   no matter what.
> > 
> > That is an incomplete explanation, because it should never be possible
> > for reads to access data from the write buffer which isn't the most
> > recent.
> 
> Umm, that's what I said.

You say that "reading from the first mapping _should_ return the
second write value no matter what", but that there's a bug in the
write buffer and it isn't doing that.

I'm saying that the bug can't be that, because such a bug would affect
normal applications.

> > Don't some of the ARMs executed two instructions concurrently, like
> > the original Pentium?
> 
> Nope - they're all single issue CPUs, and, if non-buggy, they guarantee
> that stores never bypass loads.  (In a later architecture revision, this
> is controllable.)
>
> Remember - ARM CPUs aren't a high spec desktop CPU.  They're an embedded
> CPU where power consumption matters.  Superscalar/multiple issue/high
> performance isn't viable in such many embedded environments.

Fair enough.  I recall someone mentioning a dual issue ARM once upon a
time, that's all.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Russell King wrote:
> > > If you take a moment to think about what should be going on -
> > > 
> > > - first write gets translated to physical address, and the address with
> > >   the data is placed in the write buffer.
> > > - second write gets translated to the same physical address, and the
> > >   address and data is placed into the write buffer such that we store
> > >   the first write then the second write to the same physical memory.
> > > - reading from the first mapping should return the second writes value
> > >   no matter what.
> > 
> > That is an incomplete explanation, because it should never be possible
> > for reads to access data from the write buffer which isn't the most
> > recent.
> 
> Umm, that's what I said.

You say that "reading from the first mapping _should_ return the
second write value no matter what", but that there's a bug in the
write buffer and it isn't doing that.

I'm saying that the bug can't be that, because such a bug would affect
normal applications.

> > Don't some of the ARMs executed two instructions concurrently, like
> > the original Pentium?
> 
> Nope - they're all single issue CPUs, and, if non-buggy, they guarantee
> that stores never bypass loads.  (In a later architecture revision, this
> is controllable.)
>
> Remember - ARM CPUs aren't a high spec desktop CPU.  They're an embedded
> CPU where power consumption matters.  Superscalar/multiple issue/high
> performance isn't viable in such many embedded environments.

Fair enough.  I recall someone mentioning a dual issue ARM once upon a
time, that's all.

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (21 preceding siblings ...)
  2003-09-01  1:13 ` dean gaudet
@ 2003-09-02 10:08 ` Jan Rychter
  22 siblings, 0 replies; 106+ messages in thread
From: Jan Rychter @ 2003-09-02 10:08 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1609 bytes --]

> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
> 
> It searches for that address multiple which an application can use to
> get coherent multiple mappings of shared memory, with good performance.

From a Sharp Zaurus C-760. Not very interesting, I'm afraid:

Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: FAIL - too slow
Test separation: 65536 bytes: FAIL - too slow
Test separation: 131072 bytes: FAIL - too slow
Test separation: 262144 bytes: FAIL - too slow
Test separation: 524288 bytes: FAIL - too slow
Test separation: 1048576 bytes: FAIL - too slow
Test separation: 2097152 bytes: FAIL - too slow
Test separation: 4194304 bytes: FAIL - too slow
Test separation: 8388608 bytes: FAIL - too slow
Test separation: 16777216 bytes: FAIL - too slow
VM page alias coherency test: failed; will use copy buffers instead


Processor	: Intel XScale-PXA255 rev 6 (v5l)
BogoMIPS	: 397.31
Features	: swp half thumb fastmult edsp 
CPU implementor	: 0x69
CPU architecture: 5TE
CPU variant	: 0x0
CPU part	: 0x2d0
CPU revision	: 6
Cache type	: undefined 5
Cache clean	: undefined 5
Cache lockdown	: undefined 5
Cache unified	: harvard
I size		: 16384
I assoc		: 16
I line length	: 32
I sets		: 32
D size		: 16384
D assoc		: 16
D line length	: 32
D sets		: 32

Hardware	: SHARP Shepherd
Revision	: 0000
Serial		: 0000000000000000

--J.

[-- Attachment #2: Type: application/pgp-signature, Size: 188 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-02  5:34             ` Jamie Lokier
@ 2003-09-02  8:15               ` Russell King
  2003-09-02 11:57                 ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Russell King @ 2003-09-02  8:15 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Paul J.Y. Lahaie, linux-kernel

On Tue, Sep 02, 2003 at 06:34:15AM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > If you take a moment to think about what should be going on -
> > 
> > - first write gets translated to physical address, and the address with
> >   the data is placed in the write buffer.
> > - second write gets translated to the same physical address, and the
> >   address and data is placed into the write buffer such that we store
> >   the first write then the second write to the same physical memory.
> > - reading from the first mapping should return the second writes value
> >   no matter what.
> 
> That is an incomplete explanation, because it should never be possible
> for reads to access data from the write buffer which isn't the most
> recent.

Umm, that's what I said.

> > ARM doesn't do any of those tricks.
> 
> Don't some of the ARMs executed two instructions concurrently, like
> the original Pentium?

Nope - they're all single issue CPUs, and, if non-buggy, they guarantee
that stores never bypass loads.  (In a later architecture revision, this
is controllable.)

Remember - ARM CPUs aren't a high spec desktop CPU.  They're an embedded
CPU where power consumption matters.  Superscalar/multiple issue/high
performance isn't viable in such many embedded environments.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-02  2:16       ` Matt Porter
@ 2003-09-02  5:40         ` Jamie Lokier
  0 siblings, 0 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-02  5:40 UTC (permalink / raw)
  To: Matt Porter; +Cc: Roland Dreier, linux-kernel

Matt Porter wrote:
> Exactly.  After reading some other subthreads I see the other version of
> "cache coherency" that Jamie is interested in.

Indeed, quite a lot of systems don't offer cache coherence with
peripherals, other CPUs (if any) and in some cases even with other
tasks on the same CPU.  Isn't memory fun? :)

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 17:11           ` Russell King
@ 2003-09-02  5:34             ` Jamie Lokier
  2003-09-02  8:15               ` Russell King
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-02  5:34 UTC (permalink / raw)
  To: Paul J.Y. Lahaie, linux-kernel

Russell King wrote:
> >    1. That's not necessary when the virtual addresses are separated
> >       by some multiple, is it?
> 
> Incorrect - with a VIVT, you have alias hell.  There is no multiple
> which makes it safe.

Ok.  I guess I was thinking of VIPT, but by now I am just guessing :)

> > > I've tested on several silicon revisions of StrongARM-110's:
> > > - H appears buggy (reports as rev. 2)
> > > - K appears fine (reports as rev. 2)
> > > - S appears buggy (reports as rev. 3)
> > 
> > It's possible that all of them are buggy, but the write buffer test
> > doesn't manage to get writes into the buffer with the exact timing
> > needed to trigger it.
> 
> Well, I've just generated a kernel test which does more or less the
> same thing (write to one mapping, write to other, read from first.)
> This indicates the same result.
> 
> If you take a moment to think about what should be going on -
> 
> - first write gets translated to physical address, and the address with
>   the data is placed in the write buffer.
> - second write gets translated to the same physical address, and the
>   address and data is placed into the write buffer such that we store
>   the first write then the second write to the same physical memory.
> - reading from the first mapping should return the second writes value
>   no matter what.

That is an incomplete explanation, because it should never be possible
for reads to access data from the write buffer which isn't the most
recent.  That would break ordinary programs which don't have alias mappings.

> > Unfortunately, while the write buffer test does
> > pretty much guarantee a store/store/load instruction sequence, because
> > it's generic it can't guarantee how those are executed in a
> > superscalar or out of order pipeline.
> 
> ARM doesn't do any of those tricks.

Don't some of the ARMs executed two instructions concurrently, like
the original Pentium?  The simple test is only valid if a
store/store/load sequence is guaranteed to pass through the buggy part
of the pipeline in exactly the same way, no matter which programs it
appears in.

> > > So it seems your test program finds problems which DaveM's aliastest
> > > program fails to detect...  Gah. ;(
> > 
> > Well, it's good to know it was useful :/
> 
> Well, we now have a kernel test to detect the problem, which alters our
> behaviour appropriately.  Thanks.

Fwiw, PA-RISC shows a similar problem.

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 17:22     ` Roland Dreier
@ 2003-09-02  2:16       ` Matt Porter
  2003-09-02  5:40         ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Matt Porter @ 2003-09-02  2:16 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Jamie Lokier, Matt Porter, linux-kernel

On Mon, Sep 01, 2003 at 10:22:02AM -0700, Roland Dreier wrote:
>     Matt> PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache
>     Matt> is PTPI
> 
>     Jamie> The cache looks very coherent to me.
> 
> Matt (like me) is probably just used to thinking of the IBM PPC 440
> chips as non-coherent because they are not cache coherent with respect
> to external bus masters (eg they don't snoop the PCI bus).  Of course,
> this is a different type of coherency from what you are measuring.

Exactly.  After reading some other subthreads I see the other version of
"cache coherency" that Jamie is interested in.

-Matt

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 14:51         ` Russell King
@ 2003-09-01 19:09           ` Guennadi Liakhovetski
  0 siblings, 0 replies; 106+ messages in thread
From: Guennadi Liakhovetski @ 2003-09-01 19:09 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel, Jamie Lokier, Paul J.Y. Lahaie

On

Processor       : Intel XScale-PXA250 rev 3 (v5l)
BogoMIPS        : 397.31
Features        : swp half thumb fastmult edsp
CPU implementor : 0x69
CPU architecture: 5TE
CPU variant     : 0x0
CPU part        : 0x290
CPU revision    : 3
Cache type      : undefined 5
Cache clean     : undefined 5
Cache lockdown  : undefined 5
Cache unified   : Harvard
I size          : 32768
I assoc         : 32
I line length   : 32
I sets          : 32
D size          : 32768
D assoc         : 32
D line length   : 32
D sets          : 32

and

Processor       : StrongARM-1100 rev 9 (v4l)
BogoMIPS        : 127.38
Features        : swp half 26bit fastmult

version 3 of the test consistently reports "Too slow".

Guennadi
---
Guennadi Liakhovetski




^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  6:00   ` Jamie Lokier
  2003-09-01 11:17     ` Alan Cox
@ 2003-09-01 17:22     ` Roland Dreier
  2003-09-02  2:16       ` Matt Porter
  1 sibling, 1 reply; 106+ messages in thread
From: Roland Dreier @ 2003-09-01 17:22 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Matt Porter, linux-kernel

    Matt> PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache
    Matt> is PTPI

    Jamie> The cache looks very coherent to me.

Matt (like me) is probably just used to thinking of the IBM PPC 440
chips as non-coherent because they are not cache coherent with respect
to external bus masters (eg they don't snoop the PCI bus).  Of course,
this is a different type of coherency from what you are measuring.

 - Roland

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 16:52         ` Jamie Lokier
@ 2003-09-01 17:11           ` Russell King
  2003-09-02  5:34             ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Russell King @ 2003-09-01 17:11 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Paul J.Y. Lahaie, linux-kernel

On Mon, Sep 01, 2003 at 05:52:39PM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > By looking at the mappings present in the process.  If a process maps the
> > same file using MAP_SHARED _and_ we fault the same page of data into two
> > or more mappings, we turn off the cache for those pages.
> 
>    1. That's not necessary when the virtual addresses are separated
>       by some multiple, is it?

Incorrect - with a VIVT, you have alias hell.  There is no multiple
which makes it safe.

> > I've tested on several silicon revisions of StrongARM-110's:
> > - H appears buggy (reports as rev. 2)
> > - K appears fine (reports as rev. 2)
> > - S appears buggy (reports as rev. 3)
> 
> It's possible that all of them are buggy, but the write buffer test
> doesn't manage to get writes into the buffer with the exact timing
> needed to trigger it.

Well, I've just generated a kernel test which does more or less the
same thing (write to one mapping, write to other, read from first.)
This indicates the same result.

If you take a moment to think about what should be going on -

- first write gets translated to physical address, and the address with
  the data is placed in the write buffer.
- second write gets translated to the same physical address, and the
  address and data is placed into the write buffer such that we store
  the first write then the second write to the same physical memory.
- reading from the first mapping should return the second writes value
  no matter what.

But it doesn't in some cases.

> Unfortunately, while the write buffer test does
> pretty much guarantee a store/store/load instruction sequence, because
> it's generic it can't guarantee how those are executed in a
> superscalar or out of order pipeline.

ARM doesn't do any of those tricks.

> > So it seems your test program finds problems which DaveM's aliastest
> > program fails to detect...  Gah. ;(
> 
> Well, it's good to know it was useful :/

Well, we now have a kernel test to detect the problem, which alters our
behaviour appropriately.  Thanks.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 16:33       ` Jamie Lokier
@ 2003-09-01 16:58         ` Larry McVoy
  0 siblings, 0 replies; 106+ messages in thread
From: Larry McVoy @ 2003-09-01 16:58 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Larry McVoy, linux-kernel

On Mon, Sep 01, 2003 at 05:33:54PM +0100, Jamie Lokier wrote:
> Your freebsds don't what CPU they are, but let me guess..
> 
>      freebsd isn't an AMD
>      freebsd3 and freebsd4 are both AMD K6, and freebsd3 is the faster

Right you are on all points.  

freebsd:
    CPU: Unknown 80686 (400.91-MHz 686-class CPU)
    Origin = "GenuineIntel"  Id = 0x660  Stepping=0
    Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,<b16>,<b17>,MMX,<b24>>

freebsd3
    CPU: AMD-K6(tm) 3D processor (451.03-MHz 586-class CPU)
    Origin = "AuthenticAMD"  Id = 0x58c  Stepping=12
    Features=0x8021bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,PGE,MMX>

freebsd4
    CPU: AMD-K6tm w/ multimedia extensions (233.87-MHz 586-class CPU)
    Origin = "AuthenticAMD"  Id = 0x562  Stepping = 2
    Features=0x8001bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,MMX>
    AMD Features=0x400<<b10>>
-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 14:17       ` Russell King
  2003-09-01 14:51         ` Russell King
@ 2003-09-01 16:52         ` Jamie Lokier
  2003-09-01 17:11           ` Russell King
  1 sibling, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01 16:52 UTC (permalink / raw)
  To: Paul J.Y. Lahaie, linux-kernel

Russell King wrote:
> On Mon, Sep 01, 2003 at 11:12:24AM +0100, Jamie Lokier wrote:
> > Russell King wrote:
> > > This looks like an old kernel on your NetWinder.  Later 2.4 kernels
> > > should get this right (by marking the pages uncacheable in user space.)
> > 
> > How do they know which pages to mark uncacheable?  Surely not all
> > MAP_SHARED|MAP_FIXED mappings are uncacheable?
> 
> By looking at the mappings present in the process.  If a process maps the
> same file using MAP_SHARED _and_ we fault the same page of data into two
> or more mappings, we turn off the cache for those pages.

   1. That's not necessary when the virtual addresses are separated
      by some multiple, is it?

   2. The other architectures with incoherent caches set SHMLBA to the
      multiple, and they don't do anything special in
      update_mmu_cache(), so MAP_FIXED can create incoherent mappings.

      Is there any special reason why ARM is different?

> I've tested on several silicon revisions of StrongARM-110's:
> - H appears buggy (reports as rev. 2)
> - K appears fine (reports as rev. 2)
> - S appears buggy (reports as rev. 3)

It's possible that all of them are buggy, but the write buffer test
doesn't manage to get writes into the buffer with the exact timing
needed to trigger it.  Unfortunately, while the write buffer test does
pretty much guarantee a store/store/load instruction sequence, because
it's generic it can't guarantee how those are executed in a
superscalar or out of order pipeline.

> So it seems your test program finds problems which DaveM's aliastest
> program fails to detect...  Gah. ;(

Well, it's good to know it was useful :/

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 14:43     ` Larry McVoy
@ 2003-09-01 16:33       ` Jamie Lokier
  2003-09-01 16:58         ` Larry McVoy
  2003-09-02 20:29       ` Jamie Lokier
  1 sibling, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01 16:33 UTC (permalink / raw)
  To: Larry McVoy, Larry McVoy, linux-kernel

Larry McVoy wrote:
> I'm a little concerned I have the wrong test, why would a 2.1Ghz Athlon 
> say it is too slow?

It's the right test.  "too slow" means that where shared memory is
mapped at a certain separation, alternating accesses between the
different virtual addresses are much slower (10-20 times) than if the
underlying mapped memory is not shared.

All Athlons show this slowdown for any virtual address separation
which is not a multiple of 32k.  No Intels do, with the possible
exception of a P4 which showed inconsistent results and needs further
investigation.

Your freebsds don't what CPU they are, but let me guess..

     freebsd isn't an AMD
     freebsd3 and freebsd4 are both AMD K6, and freebsd3 is the faster

-- Jamie

> ==== freebsd ====
> (512) [32,32,1] Test separation: 4096 bytes: pass
...
> FreeBSD freebsd.bitmover.com 2.2.8-RELEASE FreeBSD 2.2.8-RELEASE #0: Mon Nov 30 06:34:08 GMT 1998     jkh@time.cdrom.com:/usr/src/sys/compile/GENERIC  i386

> ==== freebsd3 ====
> (64) [33,3,1] Test separation: 4096 bytes: FAIL - too slow
> (64) [33,3,1] Test separation: 8192 bytes: FAIL - too slow
> (512) [19,26,1] Test separation: 16384 bytes: pass
> VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
> 
> FreeBSD freebsd3.bitmover.com 3.2-RELEASE FreeBSD 3.2-RELEASE #0: Fri Jun  2 11:34:52 PDT 2000     root@freebsd3.bitmover.com:/usr/src/sys/compile/DAVICOM  i386
> 
> ==== freebsd4 ====
> (256) [92,26,5] Test separation: 4096 bytes: FAIL - too slow
> (256) [92,26,5] Test separation: 8192 bytes: FAIL - too slow
> (1024) [75,101,5] Test separation: 16384 bytes: pass
> VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
> 
> FreeBSD freebsd4.bitmover.com 4.1-RELEASE FreeBSD 4.1-RELEASE #0: Fri Jul 28 14:30:31 GMT 2000     jkh@ref4.freebsd.org:/usr/src/sys/compile/GENERIC  i386

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 14:17       ` Russell King
@ 2003-09-01 14:51         ` Russell King
  2003-09-01 19:09           ` Guennadi Liakhovetski
  2003-09-01 16:52         ` Jamie Lokier
  1 sibling, 1 reply; 106+ messages in thread
From: Russell King @ 2003-09-01 14:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jamie Lokier, Paul J.Y. Lahaie

Ok, here's the results for a SA1110 machine (ie, with non-broken
write buffer):

Linux assabet2 2.6.0-test4 #1313 Thu Aug 28 21:05:05 BST 2003 armv4l unknown

Processor       : StrongARM-1110 rev 8 (v4l)
BogoMIPS        : 147.04
Features        : swp half 26bit fastmult 
CPU implementer : 0x69
CPU architecture: 4
CPU variant     : 0x0
CPU part        : 0xb11
CPU revision    : 8

Hardware        : Intel-Assabet
Revision        : 0000
Serial          : 0000000000000000

(64) [21,6,1] Test separation: 4096 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 8192 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 16384 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 32768 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 65536 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 131072 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 262144 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 524288 bytes: FAIL - too slow
(64) [21,7,1] Test separation: 1048576 bytes: FAIL - too slow
(64) [21,7,1] Test separation: 2097152 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 4194304 bytes: FAIL - too slow
(64) [21,6,1] Test separation: 8388608 bytes: FAIL - too slow
(64) [21,7,1] Test separation: 16777216 bytes: FAIL - too slow
VM page alias coherency test: failed; will use copy buffers instead


-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  5:44   ` Jamie Lokier
@ 2003-09-01 14:43     ` Larry McVoy
  2003-09-01 16:33       ` Jamie Lokier
  2003-09-02 20:29       ` Jamie Lokier
  0 siblings, 2 replies; 106+ messages in thread
From: Larry McVoy @ 2003-09-01 14:43 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Larry McVoy, linux-kernel

Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC, s390
on Linux and hpux/parisc, {freebsd, netbsd, openbsd}/x86, sco/x86, 
solaris/sparc, solaris/x86, irix/mips, osx/ppc, aix/ppc, tru64/alpha.

This is most of our test machines, it doesn't include all the Windows
boxes but I figured you didn't care.

The version of test.c is the one you posted later.  If I got it wrong
send me the latest.

work ~/jamie wc test.c
    773    3726   25064 test.c
work ~/jamie md5sum test.c
1e7b9e6fa525c21211abbb8986d7b2e7  test.c

I'm a little concerned I have the wrong test, why would a 2.1Ghz Athlon 
say it is too slow?

Format: 
    ==== host name ====
    Notes (may be blank)

    Results

    uname -a output
    /proc/cpuinfo (if there)

==== aix ====
332Mhz 604e 7043-150

Test separation: 4096 bytes: FAIL - alias map failed
Test separation: 8192 bytes: FAIL - alias map failed
Test separation: 16384 bytes: FAIL - alias map failed
Test separation: 32768 bytes: FAIL - alias map failed
Test separation: 65536 bytes: FAIL - alias map failed
Test separation: 131072 bytes: FAIL - alias map failed
Test separation: 262144 bytes: FAIL - alias map failed
Test separation: 524288 bytes: FAIL - alias map failed
Test separation: 1048576 bytes: FAIL - alias map failed
Test separation: 2097152 bytes: FAIL - alias map failed
Test separation: 4194304 bytes: FAIL - alias map failed
Test separation: 8388608 bytes: FAIL - alias map failed
Test separation: 16777216 bytes: FAIL - alias map failed
VM page alias coherency test: failed; will use copy buffers instead

AIX aix 1 4 004376804C00

==== alpha ====
PC something-164, that really common cheapo motherboard/test kit.

(512) [14,14,0] Test separation: 8192 bytes: pass
(512) [14,14,0] Test separation: 16384 bytes: pass
(512) [14,14,0] Test separation: 32768 bytes: pass
(512) [14,14,0] Test separation: 65536 bytes: pass
(512) [14,14,0] Test separation: 131072 bytes: pass
(512) [14,14,0] Test separation: 262144 bytes: pass
(512) [14,14,0] Test separation: 524288 bytes: pass
(512) [14,14,0] Test separation: 1048576 bytes: pass
(512) [14,14,0] Test separation: 2097152 bytes: pass
(512) [14,14,0] Test separation: 4194304 bytes: pass
(512) [14,14,0] Test separation: 8388608 bytes: pass
(512) [14,14,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux alpha.bitmover.com 2.4.21-pre5 #2 Thu Mar 20 07:54:03 PST 2003 alpha unknown
cpu			: Alpha
cpu model		: EV56
cpu variation		: 7
cpu revision		: 0
cpu serial number	: 
system type		: EB164
system variation	: PC164
system revision		: 0
system serial number	: 
cycle frequency [Hz]	: 500000000 
timer frequency [Hz]	: 1024.00
page size [bytes]	: 8192
phys. address bits	: 40
max. addr. space #	: 127
BogoMIPS		: 992.88
kernel unaligned acc	: 0 (pc=0,va=0)
user unaligned acc	: 0 (pc=0,va=0)
platform string		: Digital AlphaPC 164 500 MHz
cpus detected		: 1

==== disks ====
(128) [17,1,0] Test separation: 4096 bytes: FAIL - too slow
(128) [17,1,0] Test separation: 8192 bytes: FAIL - too slow
(128) [17,1,0] Test separation: 16384 bytes: FAIL - too slow
(1024) [10,13,0] Test separation: 32768 bytes: pass
(1024) [10,13,0] Test separation: 65536 bytes: pass
(1024) [10,13,0] Test separation: 131072 bytes: pass
(1024) [10,13,0] Test separation: 262144 bytes: pass
(1024) [10,13,0] Test separation: 524288 bytes: pass
(1024) [10,13,0] Test separation: 1048576 bytes: pass
(1024) [10,13,0] Test separation: 2097152 bytes: pass
(1024) [10,13,0] Test separation: 4194304 bytes: pass
(1024) [10,13,0] Test separation: 8388608 bytes: pass
(1024) [10,13,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

Linux disks.bitmover.com 2.4.18-14 #1 Wed Sep 4 12:13:11 EDT 2002 i686 athlon i386 GNU/Linux
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 6
model name	: AMD Athlon(tm) XP 1900+
stepping	: 2
cpu MHz		: 1593.143
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips	: 3172.64

==== freebsd ====
(512) [32,32,1] Test separation: 4096 bytes: pass
(512) [32,32,1] Test separation: 8192 bytes: pass
(512) [32,32,1] Test separation: 16384 bytes: pass
(512) [32,32,1] Test separation: 32768 bytes: pass
(512) [32,32,1] Test separation: 65536 bytes: pass
(512) [32,32,1] Test separation: 131072 bytes: pass
(512) [32,32,1] Test separation: 262144 bytes: pass
(512) [32,32,1] Test separation: 524288 bytes: pass
(512) [32,32,1] Test separation: 1048576 bytes: pass
(512) [32,32,1] Test separation: 2097152 bytes: pass
(512) [32,32,1] Test separation: 4194304 bytes: pass
(512) [32,32,1] Test separation: 8388608 bytes: pass
(512) [32,32,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

FreeBSD freebsd.bitmover.com 2.2.8-RELEASE FreeBSD 2.2.8-RELEASE #0: Mon Nov 30 06:34:08 GMT 1998     jkh@time.cdrom.com:/usr/src/sys/compile/GENERIC  i386

==== freebsd3 ====
(64) [33,3,1] Test separation: 4096 bytes: FAIL - too slow
(64) [33,3,1] Test separation: 8192 bytes: FAIL - too slow
(512) [19,26,1] Test separation: 16384 bytes: pass
(512) [19,26,1] Test separation: 32768 bytes: pass
(512) [19,26,1] Test separation: 65536 bytes: pass
(512) [19,26,1] Test separation: 131072 bytes: pass
(512) [19,26,1] Test separation: 262144 bytes: pass
(512) [19,26,1] Test separation: 524288 bytes: pass
(512) [19,26,1] Test separation: 1048576 bytes: pass
(512) [19,26,1] Test separation: 2097152 bytes: pass
(512) [19,26,1] Test separation: 4194304 bytes: pass
(512) [19,26,1] Test separation: 8388608 bytes: pass
(512) [19,26,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

FreeBSD freebsd3.bitmover.com 3.2-RELEASE FreeBSD 3.2-RELEASE #0: Fri Jun  2 11:34:52 PDT 2000     root@freebsd3.bitmover.com:/usr/src/sys/compile/DAVICOM  i386

==== freebsd4 ====
(256) [92,26,5] Test separation: 4096 bytes: FAIL - too slow
(256) [92,26,5] Test separation: 8192 bytes: FAIL - too slow
(1024) [75,101,5] Test separation: 16384 bytes: pass
(1024) [75,101,5] Test separation: 32768 bytes: pass
(1024) [75,101,5] Test separation: 65536 bytes: pass
(1024) [75,101,5] Test separation: 131072 bytes: pass
(1024) [75,101,5] Test separation: 262144 bytes: pass
(1024) [75,101,5] Test separation: 524288 bytes: pass
(1024) [75,101,5] Test separation: 1048576 bytes: pass
(1024) [75,101,5] Test separation: 2097152 bytes: pass
(1024) [75,101,5] Test separation: 4194304 bytes: pass
(1024) [75,101,5] Test separation: 8388608 bytes: pass
(1024) [75,101,5] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

FreeBSD freebsd4.bitmover.com 4.1-RELEASE FreeBSD 4.1-RELEASE #0: Fri Jul 28 14:30:31 GMT 2000     jkh@ref4.freebsd.org:/usr/src/sys/compile/GENERIC  i386

==== hp ====
C360, HPUX 10.20

Test separation: 4096 bytes: FAIL - alias map failed
Test separation: 8192 bytes: FAIL - alias map failed
Test separation: 16384 bytes: FAIL - alias map failed
Test separation: 32768 bytes: FAIL - alias map failed
Test separation: 65536 bytes: FAIL - alias map failed
Test separation: 131072 bytes: FAIL - alias map failed
Test separation: 262144 bytes: FAIL - alias map failed
Test separation: 524288 bytes: FAIL - alias map failed
Test separation: 1048576 bytes: FAIL - alias map failed
Test separation: 2097152 bytes: FAIL - alias map failed
Test separation: 4194304 bytes: FAIL - alias map failed
Test separation: 8388608 bytes: FAIL - alias map failed
Test separation: 16777216 bytes: FAIL - alias map failed
VM page alias coherency test: failed; will use copy buffers instead

HP-UX hp B.10.20 A 9000/785 2004452144 two-user license

==== ia64 ====
(512) [17,17,0] Test separation: 16384 bytes: pass
(512) [17,17,0] Test separation: 32768 bytes: pass
(512) [17,17,0] Test separation: 65536 bytes: pass
(512) [17,17,0] Test separation: 131072 bytes: pass
(512) [17,17,0] Test separation: 262144 bytes: pass
(512) [17,17,0] Test separation: 524288 bytes: pass
(512) [17,17,0] Test separation: 1048576 bytes: pass
(512) [17,17,0] Test separation: 2097152 bytes: pass
(512) [17,17,0] Test separation: 4194304 bytes: pass
(512) [17,17,0] Test separation: 8388608 bytes: pass
(512) [17,17,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux ia64.bitmover.com 2.4.9-18smp #1 SMP Tue Dec 11 12:59:00 EST 2001 ia64 unknown
processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium
model      : 0
revision   : 7
archrev    : 0
features   : standard
cpu number : 0
cpu regs   : 4
cpu MHz    : 799.486992
itc MHz    : 799.486992
BogoMIPS   : 796.91

processor  : 1
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium
model      : 0
revision   : 7
archrev    : 0
features   : standard
cpu number : 0
cpu regs   : 4
cpu MHz    : 799.486992
itc MHz    : 799.486992
BogoMIPS   : 796.91

==== macos ====
Imac, OS X 10.2

(2048) [67,67,3] Test separation: 4096 bytes: pass
(2048) [67,67,3] Test separation: 8192 bytes: pass
(2048) [67,67,3] Test separation: 16384 bytes: pass
(2048) [67,67,3] Test separation: 32768 bytes: pass
(2048) [67,67,3] Test separation: 65536 bytes: pass
(2048) [67,67,3] Test separation: 131072 bytes: pass
(2048) [67,67,3] Test separation: 262144 bytes: pass
(2048) [67,67,3] Test separation: 524288 bytes: pass
(2048) [67,67,3] Test separation: 1048576 bytes: pass
(2048) [67,67,3] Test separation: 2097152 bytes: pass
(2048) [67,67,3] Test separation: 4194304 bytes: pass
(2048) [67,67,3] Test separation: 8388608 bytes: pass
(2048) [67,67,3] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Darwin macos.bitmover.com 6.6 Darwin Kernel Version 6.6: Thu May  1 21:48:54 PDT 2003; root:xnu/xnu-344.34.obj~1/RELEASE_PPC  Power Macintosh powerpc

==== mips ====
(64) [276,11,2] Test separation: 4096 bytes: FAIL - too slow
(64) [276,11,2] Test separation: 8192 bytes: FAIL - too slow
(128) [26,43,2] Test separation: 16384 bytes: pass
(128) [26,43,2] Test separation: 32768 bytes: pass
(128) [26,43,2] Test separation: 65536 bytes: pass
(128) [26,43,2] Test separation: 131072 bytes: pass
(128) [26,43,2] Test separation: 262144 bytes: pass
(128) [26,43,2] Test separation: 524288 bytes: pass
(128) [26,43,2] Test separation: 1048576 bytes: pass
(128) [26,43,2] Test separation: 2097152 bytes: pass
(128) [26,43,2] Test separation: 4194304 bytes: pass
(128) [26,43,2] Test separation: 8388608 bytes: pass
(128) [26,43,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

Linux mips 2.4.18-r4k-ip22 #1 Sun Jun 23 15:30:50 CEST 2002 mips unknown
system type		: SGI Indy
processor		: 0
cpu model		: R4000SC V6.0  FPU V0.0
BogoMIPS		: 86.83
byteorder		: big endian
wait instruction	: no
microsecond timers	: yes
tlb_entries		: 48
extra interrupt vector	: no
hardware watchpoint	: yes
VCED exceptions		: 8055726
VCEI exceptions		: 0

==== netbsd ====
(1024) [53,53,4] Test separation: 4096 bytes: pass
(2048) [106,106,4] Test separation: 8192 bytes: pass
(2048) [104,105,5] Test separation: 16384 bytes: pass
(2048) [105,104,5] Test separation: 32768 bytes: pass
(2048) [105,104,5] Test separation: 65536 bytes: pass
(2048) [104,104,5] Test separation: 131072 bytes: pass
(2048) [105,105,5] Test separation: 262144 bytes: pass
(2048) [105,105,5] Test separation: 524288 bytes: pass
(1024) [53,53,4] Test separation: 1048576 bytes: pass
(2048) [104,104,5] Test separation: 2097152 bytes: pass
(2048) [106,106,4] Test separation: 4194304 bytes: pass
(2048) [105,106,4] Test separation: 8388608 bytes: pass
(2048) [104,105,5] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

NetBSD netbsd.bitmover.com 1.5 NetBSD 1.5 (GENERIC) #1: Sun Nov 19 21:42:11 MET 2000     fvdl@sushi:/work/trees/netbsd-1-5/sys/arch/i386/compile/GENERIC i386

==== netwinder ====
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - cache not coherent
Test separation: 524288 bytes: FAIL - cache not coherent
Test separation: 1048576 bytes: FAIL - cache not coherent
Test separation: 2097152 bytes: FAIL - cache not coherent
Test separation: 4194304 bytes: FAIL - cache not coherent
Test separation: 8388608 bytes: FAIL - cache not coherent
Test separation: 16777216 bytes: FAIL - cache not coherent
VM page alias coherency test: failed; will use copy buffers instead

Linux netwinder 2.2.12-19991020 #1 Wed Oct 20 13:09:07 EDT 1999 armv4l unknown
Processor	: Intel sa110 rev 3
BogoMips	: 262.14
Hardware	: Rebel-NetWinder
Serial #	: 3464
Revision	: 52ff

==== openbsd ====
(512) [27,27,1] Test separation: 4096 bytes: pass
(512) [27,27,1] Test separation: 8192 bytes: pass
(512) [27,27,1] Test separation: 16384 bytes: pass
(512) [27,27,1] Test separation: 32768 bytes: pass
(512) [27,27,1] Test separation: 65536 bytes: pass
(512) [27,27,1] Test separation: 131072 bytes: pass
(512) [27,27,1] Test separation: 262144 bytes: pass
(512) [27,27,1] Test separation: 524288 bytes: pass
(512) [27,27,1] Test separation: 1048576 bytes: pass
(512) [27,27,1] Test separation: 2097152 bytes: pass
(512) [27,27,1] Test separation: 4194304 bytes: pass
(512) [27,27,1] Test separation: 8388608 bytes: pass
(512) [27,27,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

OpenBSD openbsd 3.0 GENERIC#94 i386

==== parisc ====
A500
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - store buffer not coherent
Test separation: 524288 bytes: FAIL - store buffer not coherent
Test separation: 1048576 bytes: FAIL - store buffer not coherent
Test separation: 2097152 bytes: FAIL - store buffer not coherent
(2048) [41,41,2] Test separation: 4194304 bytes: pass
(2048) [41,41,2] Test separation: 8388608 bytes: pass
(2048) [41,41,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 4194304 (1024 pages)

Linux parisc 2.4.17-64 #1 Sat Mar 16 17:31:44 MST 2002 parisc64 unknown
processor	: 0
cpu family	: PA-RISC 2.0
cpu		: PA8600 (PCX-W+)
cpu MHz		: 550.000000
model		: 9000/800/A500-5X
model name	: Crescendo 550
hversion	: 0x00005d50
sversion	: 0x00000491
I-cache		: 512 KB
D-cache		: 1024 KB (WB)
ITLB entries	: 160
DTLB entries	: 160 - shared with ITLB
bogomips	: 1097.72
software id	: 580790518

==== ppc ====
(1024) [40,40,1] Test separation: 4096 bytes: pass
(1024) [40,40,1] Test separation: 8192 bytes: pass
(1024) [40,40,1] Test separation: 16384 bytes: pass
(1024) [40,40,1] Test separation: 32768 bytes: pass
(1024) [40,40,1] Test separation: 65536 bytes: pass
(1024) [40,40,1] Test separation: 131072 bytes: pass
(1024) [40,40,1] Test separation: 262144 bytes: pass
(1024) [40,40,1] Test separation: 524288 bytes: pass
(1024) [40,40,1] Test separation: 1048576 bytes: pass
(1024) [40,40,1] Test separation: 2097152 bytes: pass
(1024) [40,40,1] Test separation: 4194304 bytes: pass
(1024) [40,40,1] Test separation: 8388608 bytes: pass
(1024) [40,40,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux ppc.bitmover.com 2.4.6-pre2 #2 Sun Jun 10 20:21:17 PDT 2001 ppc unknown
processor	: 0
cpu		: 750
temperature 	: 0 C
clock		: 333MHz
revision	: 2.2
bogomips	: 665.69
zero pages	: total: 0 (0Kb) current: 0 (0Kb) hits: 0/0 (0%)
machine		: iMac,1
motherboard	: iMac MacRISC Power Macintosh
L2 cache	: 512K unified
memory		: 160MB
pmac-generation	: NewWorld

==== qube ====
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
(512) [47,47,2] Test separation: 16384 bytes: pass
(512) [47,47,2] Test separation: 32768 bytes: pass
(512) [47,47,2] Test separation: 65536 bytes: pass
(512) [47,47,2] Test separation: 131072 bytes: pass
(512) [47,47,2] Test separation: 262144 bytes: pass
(512) [47,47,2] Test separation: 524288 bytes: pass
(512) [47,47,2] Test separation: 1048576 bytes: pass
(512) [47,47,2] Test separation: 2097152 bytes: pass
(512) [47,47,2] Test separation: 4194304 bytes: pass
(512) [47,47,2] Test separation: 8388608 bytes: pass
(512) [47,47,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

Linux qube.bitmover.com 2.0.34 #1 Thu Jan 28 03:03:03 PST 1999 mips unknown
cpu			: MIPS
cpu model		: Nevada V10.0
system type		: Cobalt Microserver 27
BogoMIPS		: 249.86
byteorder		: little endian
unaligned accesses	: 16
wait instruction	: yes
microsecond timers	: yes
extra interrupt vector	: yes
hardware watchpoint	: no

==== redhat52 ====
(256) [12,12,0] Test separation: 4096 bytes: pass
(256) [12,12,0] Test separation: 8192 bytes: pass
(256) [12,12,0] Test separation: 16384 bytes: pass
(256) [12,12,0] Test separation: 32768 bytes: pass
(256) [12,12,0] Test separation: 65536 bytes: pass
(256) [12,12,0] Test separation: 131072 bytes: pass
(256) [12,12,0] Test separation: 262144 bytes: pass
(256) [12,12,0] Test separation: 524288 bytes: pass
(256) [12,12,0] Test separation: 1048576 bytes: pass
(256) [12,12,0] Test separation: 2097152 bytes: pass
(256) [12,12,0] Test separation: 4194304 bytes: pass
(256) [12,12,0] Test separation: 8388608 bytes: pass
(256) [12,12,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux redhat52.bitmover.com 2.2.15pre9 #10 Sat Apr 8 17:59:35 PDT 2000 i686 unknown
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 6
model name	: Celeron (Mendocino)
stepping	: 5
cpu MHz		: 534.561273
cache size	: 128 KB
fdiv_bug	: no
hlt_bug		: no
sep_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips	: 532.48

==== redhat62 ====
(256) [12,12,0] Test separation: 4096 bytes: pass
(256) [12,12,0] Test separation: 8192 bytes: pass
(256) [12,12,0] Test separation: 16384 bytes: pass
(256) [12,12,0] Test separation: 32768 bytes: pass
(256) [12,12,0] Test separation: 65536 bytes: pass
(256) [12,12,0] Test separation: 131072 bytes: pass
(256) [12,12,0] Test separation: 262144 bytes: pass
(256) [12,12,0] Test separation: 524288 bytes: pass
(256) [12,12,0] Test separation: 1048576 bytes: pass
(256) [12,12,0] Test separation: 2097152 bytes: pass
(256) [12,12,0] Test separation: 4194304 bytes: pass
(256) [12,12,0] Test separation: 8388608 bytes: pass
(256) [12,12,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux redhat62.bitmover.com 2.2.14-5.0 #1 Tue Mar 7 21:07:39 EST 2000 i686 unknown
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 6
model name	: Celeron (Mendocino)
stepping	: 5
cpu MHz		: 534.552424
cache size	: 128 KB
fdiv_bug	: no
hlt_bug		: no
sep_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips	: 532.48

==== redhat71 ====
(256) [14,14,0] Test separation: 4096 bytes: pass
(256) [14,14,0] Test separation: 8192 bytes: pass
(256) [14,14,0] Test separation: 16384 bytes: pass
(256) [14,14,0] Test separation: 32768 bytes: pass
(256) [14,14,0] Test separation: 65536 bytes: pass
(256) [14,14,0] Test separation: 131072 bytes: pass
(256) [14,14,0] Test separation: 262144 bytes: pass
(256) [14,14,0] Test separation: 524288 bytes: pass
(256) [14,14,0] Test separation: 1048576 bytes: pass
(256) [14,14,0] Test separation: 2097152 bytes: pass
(256) [14,14,0] Test separation: 4194304 bytes: pass
(256) [14,14,0] Test separation: 8388608 bytes: pass
(256) [14,14,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux redhat71.bitmover.com 2.4.2-2 #1 Sun Apr 8 20:41:30 EDT 2001 i686 unknown
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 6
model name	: Celeron (Mendocino)
stepping	: 5
cpu MHz		: 467.739
cache size	: 128 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips	: 933.88

==== sco ====
(1024) [48,48,2] Test separation: 4096 bytes: pass
(1024) [48,48,2] Test separation: 8192 bytes: pass
(1024) [48,48,2] Test separation: 16384 bytes: pass
(1024) [48,48,2] Test separation: 32768 bytes: pass
(1024) [48,48,2] Test separation: 65536 bytes: pass
(1024) [48,48,2] Test separation: 131072 bytes: pass
(1024) [48,48,1] Test separation: 262144 bytes: pass
(1024) [49,49,1] Test separation: 524288 bytes: pass
(1024) [48,48,2] Test separation: 1048576 bytes: pass
(1024) [48,48,2] Test separation: 2097152 bytes: pass
(1024) [48,48,2] Test separation: 4194304 bytes: pass
(1024) [48,48,2] Test separation: 8388608 bytes: pass
(1024) [48,48,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

SCO_SV sco 3.2 5.0.7 i386

==== sgi ====
FPU: MIPS R10010 Floating Point Chip Revision: 0.0
CPU: MIPS R10000 Processor Chip Revision: 2.6
1 195 MHZ IP28 Processor
Main memory size: 192 Mbytes
Secondary unified instruction/data cache size: 1 Mbyte
Instruction cache size: 32 Kbytes
Data cache size: 32 Kbytes

(1024) [103,103,5] Test separation: 16384 bytes: pass
(1024) [103,103,5] Test separation: 32768 bytes: pass
(1024) [103,103,5] Test separation: 65536 bytes: pass
(1024) [103,103,5] Test separation: 131072 bytes: pass
(1024) [103,103,5] Test separation: 262144 bytes: pass
(1024) [103,103,5] Test separation: 524288 bytes: pass
(1024) [103,103,5] Test separation: 1048576 bytes: pass
(1024) [103,103,5] Test separation: 2097152 bytes: pass
(1024) [103,103,5] Test separation: 4194304 bytes: pass
(1024) [103,103,5] Test separation: 8388608 bytes: pass
(1024) [103,103,5] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

IRIX64 sgi 6.5 10120105 IP28

==== slovax ====
(128) [12,1,0] Test separation: 4096 bytes: FAIL - too slow
(128) [12,1,0] Test separation: 8192 bytes: FAIL - too slow
(128) [12,1,0] Test separation: 16384 bytes: FAIL - too slow
(2048) [15,16,0] Test separation: 32768 bytes: pass
(2048) [13,16,0] Test separation: 65536 bytes: pass
(2048) [13,16,0] Test separation: 131072 bytes: pass
(2048) [15,16,0] Test separation: 262144 bytes: pass
(2048) [15,16,0] Test separation: 524288 bytes: pass
(2048) [15,16,0] Test separation: 1048576 bytes: pass
(2048) [15,16,0] Test separation: 2097152 bytes: pass
(2048) [15,16,0] Test separation: 4194304 bytes: pass
(2048) [15,16,0] Test separation: 8388608 bytes: pass
(2048) [13,16,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

Linux slovax.bitmover.com 2.4.18-14 #1 Wed Sep 4 12:13:11 EDT 2002 i686 athlon i386 GNU/Linux

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Athlon(tm) XP 2700+
stepping        : 1
cpu MHz         : 2162.685
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 4297.33


==== sparc ====
Test separation: 8192 bytes: FAIL - cache not coherent
(1024) [65,71,2] Test separation: 16384 bytes: pass
(1024) [65,68,2] Test separation: 32768 bytes: pass
(512) [2,50,2] Test separation: 65536 bytes: pass
(512) [33,19,2] Test separation: 131072 bytes: pass
(512) [33,20,2] Test separation: 262144 bytes: pass
(512) [33,50,2] Test separation: 524288 bytes: pass
(512) [33,19,2] Test separation: 1048576 bytes: pass
(1024) [35,68,2] Test separation: 2097152 bytes: pass
(512) [33,42,2] Test separation: 4194304 bytes: pass
(512) [2,50,2] Test separation: 8388608 bytes: pass
(512) [5,50,2] Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (2 pages)

Linux sparc.bitmover.com 2.2.18 #2 Thu Dec 21 18:53:16 PST 2000 sparc64 unknown
cpu		: TI UltraSparc IIi
fpu		: UltraSparc IIi integrated FPU
promlib		: Version 3 Revision 11
prom		: 3.11.12
type		: sun4u
ncpus probed	: 1
ncpus active	: 1
BogoMips	: 539.03
MMU Type	: Spitfire

==== sun ====
cpu0: SUNW,UltraSPARC-II (upaid 0 impl 0x11 ver 0x20 clock 296 MHz)
cpu1: SUNW,UltraSPARC-II (upaid 1 impl 0x11 ver 0x20 clock 296 MHz)
SunOS Release 5.6 Version Generic_105181-05 [UNIX(R) System V Release 4.0]

(128) [11,7,0] Test separation: 8192 bytes: pass
(256) [15,21,0] Test separation: 16384 bytes: pass
(256) [15,21,0] Test separation: 32768 bytes: pass
(256) [15,21,0] Test separation: 65536 bytes: pass
(256) [15,21,0] Test separation: 131072 bytes: pass
(256) [15,21,0] Test separation: 262144 bytes: pass
(256) [15,21,0] Test separation: 524288 bytes: pass
(256) [15,21,0] Test separation: 1048576 bytes: pass
(256) [15,21,0] Test separation: 2097152 bytes: pass
(256) [15,21,0] Test separation: 4194304 bytes: pass
(256) [15,21,0] Test separation: 8388608 bytes: pass
(256) [15,21,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

SunOS sun 5.6 Generic_105181-05 sun4u sparc SUNW,Ultra-2

==== sunx86 ====
2x 450Mhz Xeons

(512) [29,29,1] Test separation: 4096 bytes: pass
(512) [29,29,1] Test separation: 8192 bytes: pass
(512) [29,29,1] Test separation: 16384 bytes: pass
(512) [29,29,1] Test separation: 32768 bytes: pass
(512) [29,29,1] Test separation: 65536 bytes: pass
(512) [29,29,1] Test separation: 131072 bytes: pass
(512) [29,29,1] Test separation: 262144 bytes: pass
(512) [29,29,1] Test separation: 524288 bytes: pass
(512) [29,29,1] Test separation: 1048576 bytes: pass
(512) [29,29,1] Test separation: 2097152 bytes: pass
(512) [29,29,1] Test separation: 4194304 bytes: pass
(512) [29,29,1] Test separation: 8388608 bytes: pass
(512) [29,29,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

SunOS sunx86.bitmover.com 5.7 Generic_106542-18 i86pc i386 i86pc

==== tru64 ====
600AU (nicely made machine)

(65536) [976,976,0] Test separation: 8192 bytes: pass
(65536) [976,976,0] Test separation: 16384 bytes: pass
(65536) [976,976,0] Test separation: 32768 bytes: pass
(65536) [976,976,0] Test separation: 65536 bytes: pass
(65536) [976,976,0] Test separation: 131072 bytes: pass
(65536) [976,976,0] Test separation: 262144 bytes: pass
(65536) [976,976,0] Test separation: 524288 bytes: pass
(65536) [976,976,0] Test separation: 1048576 bytes: pass
(65536) [976,976,0] Test separation: 2097152 bytes: pass
(65536) [976,976,0] Test separation: 4194304 bytes: pass
(65536) [976,976,0] Test separation: 8388608 bytes: pass
(65536) [976,976,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

OSF1 tru64.bitmover.com V5.1 2650 alpha

==== winxp ====
I just did a gcc on this system, I have no idea what that did but it didn't
complain so it did something.  

win32-xp /build/jamie ./a.exe
Test separation: 4096 bytes: FAIL - alias map failed
Test separation: 8192 bytes: FAIL - alias map failed
Test separation: 16384 bytes: FAIL - alias map failed
Test separation: 32768 bytes: FAIL - alias map failed
Test separation: 65536 bytes: FAIL - alias map failed
Test separation: 131072 bytes: FAIL - alias map failed
Test separation: 262144 bytes: FAIL - alias map failed
Test separation: 524288 bytes: FAIL - alias map failed
Test separation: 1048576 bytes: FAIL - alias map failed
Test separation: 2097152 bytes: FAIL - alias map failed
Test separation: 4194304 bytes: FAIL - alias map failed
Test separation: 8388608 bytes: FAIL - alias map failed
Test separation: 16777216 bytes: FAIL - alias map failed
VM page alias coherency test: failed; will use copy buffers instead

=== zseries/RedHat ===
(256) [11,11,0] Test separation: 4096 bytes: pass
(256) [11,11,0] Test separation: 8192 bytes: pass
(256) [11,11,0] Test separation: 16384 bytes: pass
(256) [11,11,0] Test separation: 32768 bytes: pass
(256) [11,11,0] Test separation: 65536 bytes: pass
(256) [11,11,0] Test separation: 131072 bytes: pass
(256) [11,11,0] Test separation: 262144 bytes: pass
(256) [11,11,0] Test separation: 524288 bytes: pass
(256) [11,13,0] Test separation: 1048576 bytes: pass
(256) [11,13,0] Test separation: 2097152 bytes: pass
(256) [11,13,0] Test separation: 4194304 bytes: pass
(256) [11,13,0] Test separation: 8388608 bytes: pass
(256) [11,13,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux l006034.zseriespenguins.ihost.com 2.4.9-38 #1 SMP Tue Sep 10 00:16:26 CEST 2002 s390 unknown

vendor_id       : IBM/S390
# processors    : 1
bogomips per cpu: 612.76
processor 0: version = FF,  identification = 049321,  machine = 9672

=== zseries/SuSE ===
(512) [21,21,1] Test separation: 4096 bytes: pass
(256) [11,11,0] Test separation: 8192 bytes: pass
(512) [21,21,1] Test separation: 16384 bytes: pass
(512) [21,21,1] Test separation: 32768 bytes: pass
(512) [21,21,1] Test separation: 65536 bytes: pass
(512) [22,22,0] Test separation: 131072 bytes: pass
(512) [22,22,0] Test separation: 262144 bytes: pass
(512) [21,21,1] Test separation: 524288 bytes: pass
(512) [21,25,1] Test separation: 1048576 bytes: pass
(512) [22,26,0] Test separation: 2097152 bytes: pass
(256) [11,13,0] Test separation: 4194304 bytes: pass
(512) [22,26,0] Test separation: 8388608 bytes: pass
(512) [21,25,1] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

Linux lh003022 2.2.16 #6 SMP Wed May 23 16:39:31 EDT 2001 s390 unknown

vendor_id       : IBM/S390
# processors    : 1
bogomips per cpu: 581.63
processor 0: version = FF,  identification = 049321,  machine = 9672

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 10:12     ` Jamie Lokier
  2003-09-01 11:30       ` Geert Uytterhoeven
@ 2003-09-01 14:17       ` Russell King
  2003-09-01 14:51         ` Russell King
  2003-09-01 16:52         ` Jamie Lokier
  2003-09-04 17:37       ` Maciej W. Rozycki
  2 siblings, 2 replies; 106+ messages in thread
From: Russell King @ 2003-09-01 14:17 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Paul J.Y. Lahaie, linux-kernel

On Mon, Sep 01, 2003 at 11:12:24AM +0100, Jamie Lokier wrote:
> Russell King wrote:
> > This looks like an old kernel on your NetWinder.  Later 2.4 kernels
> > should get this right (by marking the pages uncacheable in user space.)
> 
> How do they know which pages to mark uncacheable?  Surely not all
> MAP_SHARED|MAP_FIXED mappings are uncacheable?

By looking at the mappings present in the process.  If a process maps the
same file using MAP_SHARED _and_ we fault the same page of data into two
or more mappings, we turn off the cache for those pages.

We actually only turn off the cache and leave the write buffer (aka your
store buffer) turned on for these regions, which should be sufficient for
it to remain coherent between different virtual addresses.

I've been doing some further investigation, and I'm now of the opinion
that "SA110" StrongARM chips have buggy write buffers, because:

- if I turn off the cache, leaving the write buffer on, this program
  works on StrongARM-1110 CPUs but not some StrongARM-110 CPUs.
- if I turn off the cache and write buffer on these twice-mapped pages,
  StrongARM-110 behaves as expected.

I've tested on several silicon revisions of StrongARM-110's:
- H appears buggy (reports as rev. 2)
- K appears fine (reports as rev. 2)
- S appears buggy (reports as rev. 3)

Unfortunately, the written documentation makes zero mention of the exact
write buffer behaviour.  The best that I have to go on for the
StrongARM-110 is a block diagram which indicates that the write buffer
uses physical addresses, and that the D-cache contains the physical
address which the line was fetched from for writeback (via the write
buffer.)

So it seems your test program finds problems which DaveM's aliastest
program fails to detect...  Gah. ;(

I guess its time to devise a kernel test and alter our behaviour on ARM
accordingly.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 10:48         ` Jamie Lokier
@ 2003-09-01 12:23           ` Sam Creasey
  0 siblings, 0 replies; 106+ messages in thread
From: Sam Creasey @ 2003-09-01 12:23 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Geert Uytterhoeven, Linux/m68k, Linux Kernel Development



On Mon, 1 Sep 2003, Jamie Lokier wrote:

> Sam Creasey wrote:
>
> > bash-2.03# time ./jamie-test2
> > (2048) [10000,10000,0] Test separation: 8192 bytes: pass
>
> Mighty suspicious gettimeofday() you have there.
>
> > real    1m34.330s
> > user    1m30.030s
> > sys     0m4.070s
>
> Indeed, on other systems the test completes in a few seconds at most,
> not because of CPU speed, but because gettimeofday() returns high
> resolution time on them.
>
> Isn't there a way to read high resolution time on the 68020 Sun-3?

AFAICT, no.  I've dug through the datasheets for the intersil RTC used, as
well as the NetBSD code, and SunOS headers, and it seems that we're stuck
with 1/100th second accuracy.  Bummer.

-- Sam


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 10:12     ` Jamie Lokier
@ 2003-09-01 11:30       ` Geert Uytterhoeven
  2003-09-01 14:17       ` Russell King
  2003-09-04 17:37       ` Maciej W. Rozycki
  2 siblings, 0 replies; 106+ messages in thread
From: Geert Uytterhoeven @ 2003-09-01 11:30 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Russell King, Paul J.Y. Lahaie, Linux Kernel Development, Linux/m68k

On Mon, 1 Sep 2003, Jamie Lokier wrote:
> There is a bug in test_l1_only which I just noticed.  It's unlikely,
> but if `dummy' happens to have the same L1 cache address as both words
> being tested, and it's a 2-way (or less) set-associative cache, then
> it will inadvertently flush the cache and say "store buffer not
> coherent" when it means to say "cache not coherent".
> 
> Please try the program below, which is the same as before but with
> test_l1_only hopefully improved, and it prints some more helpful
> numbers.

Results for 68040 with the new version:

cassandra:/tmp# time ./test2
Test separation: 4096 bytes: FAIL - store buffer not coherent
Test separation: 8192 bytes: FAIL - store buffer not coherent
Test separation: 16384 bytes: FAIL - store buffer not coherent
Test separation: 32768 bytes: FAIL - store buffer not coherent
Test separation: 65536 bytes: FAIL - store buffer not coherent
Test separation: 131072 bytes: FAIL - store buffer not coherent
Test separation: 262144 bytes: FAIL - store buffer not coherent
Test separation: 524288 bytes: FAIL - store buffer not coherent
Test separation: 1048576 bytes: FAIL - store buffer not coherent
Test separation: 2097152 bytes: FAIL - store buffer not coherent
Test separation: 4194304 bytes: FAIL - store buffer not coherent
Test separation: 8388608 bytes: FAIL - store buffer not coherent
Test separation: 16777216 bytes: FAIL - store buffer not coherent
VM page alias coherency test: failed; will use copy buffers instead

real	0m0.454s
user	0m0.090s
sys	0m0.210s
cassandra:/tmp# cat /proc/cpuinfo 
CPU:		68040
MMU:		68040
FPU:		68040
Clocking:	24.8MHz
BogoMips:	16.53
Calibration:	82688 loops
cassandra:/tmp# 

New m68k binary at http://home.tvd.be/cr26864/Linux/m68k/jamie_test2.gz

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  6:00   ` Jamie Lokier
@ 2003-09-01 11:17     ` Alan Cox
  2003-09-01 17:22     ` Roland Dreier
  1 sibling, 0 replies; 106+ messages in thread
From: Alan Cox @ 2003-09-01 11:17 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Matt Porter, Linux Kernel Mailing List

On Llu, 2003-09-01 at 07:00, Jamie Lokier wrote:
> Matt Porter wrote:
> > PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache is PTPI
> 
> The cache looks very coherent to me.

The only x86 which will show the user non cache coherent behaviour (and
then only in a really weird situation) is SMP Pentium Pro due to the
store fence errata.

The Winchip is non SMP so you won't see CPU<->CPU store ordering changes
although I guess mmap of mmio space might show you stuff if you really
tried hard

The Geode has bus level magic so its out of order but if you ask then
you get the right answer (kind of the zen question about falling trees
implemented in silicon).



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 10:08         ` Jamie Lokier
@ 2003-09-01 11:13           ` Roman Zippel
  2003-09-02 20:42           ` Kars de Jong
  1 sibling, 0 replies; 106+ messages in thread
From: Roman Zippel @ 2003-09-01 11:13 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Kars de Jong, Geert Uytterhoeven, Linux/m68k kernel mailing list,
	Linux Kernel Development

Hi,

On Mon, 1 Sep 2003, Jamie Lokier wrote:

> I would prefer that you run the attached program.  It fixes a bug in
> the function which tests whether the problem is in the L1 cache or
> store buffer.  The bug probably didn't affect the test, but it might
> have.

This is the result for a 060:

$ ./a.out
(256) [175,175,11] Test separation: 4096 bytes: pass
(256) [173,175,11] Test separation: 8192 bytes: pass
(256) [176,175,10] Test separation: 16384 bytes: pass
(256) [174,173,11] Test separation: 32768 bytes: pass
(256) [174,175,11] Test separation: 65536 bytes: pass
(256) [175,175,10] Test separation: 131072 bytes: pass
(256) [176,176,10] Test separation: 262144 bytes: pass
(256) [175,175,11] Test separation: 524288 bytes: pass
(256) [173,175,11] Test separation: 1048576 bytes: pass
(256) [174,174,11] Test separation: 2097152 bytes: pass
(256) [176,176,10] Test separation: 4194304 bytes: pass
(256) [177,177,9] Test separation: 8388608 bytes: pass
(256) [175,176,10] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
$ cat /proc/cpuinfo
CPU:            68060
MMU:            68060
FPU:            68060
Clocking:       49.7MHz
BogoMips:       99.53
Calibration:    497664 loops

bye, Roman


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 10:35       ` Sam Creasey
@ 2003-09-01 10:48         ` Jamie Lokier
  2003-09-01 12:23           ` Sam Creasey
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01 10:48 UTC (permalink / raw)
  To: Sam Creasey; +Cc: Geert Uytterhoeven, Linux/m68k, Linux Kernel Development

Sam Creasey wrote:
> 68020+Sun-3 MMU results attached below (this is for a 3/60, and it's not
> suprising that it passes, as there's no real cache in this configuration
> (the sun3/2xx did have external cache, but the onboard ethernet in my
> 3/210 is on the fritz, and it's not booting at the moment).  Note that
> this is the newer version of the program which Jamie just posted.

Thanks.

> bash-2.03# time ./jamie-test2
> (2048) [10000,10000,0] Test separation: 8192 bytes: pass

Mighty suspicious gettimeofday() you have there.

> real    1m34.330s
> user    1m30.030s
> sys     0m4.070s

Indeed, on other systems the test completes in a few seconds at most,
not because of CPU speed, but because gettimeofday() returns high
resolution time on them.

Isn't there a way to read high resolution time on the 68020 Sun-3?

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  8:34     ` Geert Uytterhoeven
  2003-09-01  9:09       ` Kars de Jong
@ 2003-09-01 10:35       ` Sam Creasey
  2003-09-01 10:48         ` Jamie Lokier
  2003-09-03  8:00       ` Kars de Jong
  2 siblings, 1 reply; 106+ messages in thread
From: Sam Creasey @ 2003-09-01 10:35 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Jamie Lokier, Linux/m68k, Linux Kernel Development



On Mon, 1 Sep 2003, Geert Uytterhoeven wrote:

> As you probably know the 68020 had an external MMU (68551, or Sun-3 or Apollo
> MMU). Probably Motorola didn't bother to change the behavior when the MMU got
> integrated in later generations (68030 and up).
>
> BTW, probably you want us to run your test program on other m68k boxes? Mine
> got a 68040, that leaves us with:

>   - 68020+Sun-3 MMU

68020+Sun-3 MMU results attached below (this is for a 3/60, and it's not
suprising that it passes, as there's no real cache in this configuration
(the sun3/2xx did have external cache, but the onboard ethernet in my
3/210 is on the fritz, and it's not booting at the moment).  Note that
this is the newer version of the program which Jamie just posted.

bash-2.03# time ./jamie-test2
(2048) [10000,10000,0] Test separation: 8192 bytes: pass
(2048) [10000,10000,0] Test separation: 16384 bytes: pass
(2048) [10000,10000,0] Test separation: 32768 bytes: pass
(2048) [10000,10000,0] Test separation: 65536 bytes: pass
(2048) [10000,10000,0] Test separation: 131072 bytes: pass
(2048) [10000,10000,0] Test separation: 262144 bytes: pass
(2048) [10000,10000,0] Test separation: 524288 bytes: pass
(2048) [10000,10000,0] Test separation: 1048576 bytes: pass
(2048) [10000,10000,0] Test separation: 2097152 bytes: pass
(2048) [10000,10000,0] Test separation: 4194304 bytes: pass
(2048) [10000,10000,0] Test separation: 8388608 bytes: pass
(2048) [10000,10000,0] Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    1m34.330s
user    1m30.030s
sys     0m4.070s
bash-2.03# cat /proc/cpuinfo
CPU:            68020
MMU:            Sun-3
FPU:            68881
Clocking:       19.9MHz
BogoMips:       4.97
Calibration:    24896 loops


-- Sam





^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  8:15   ` Russell King
@ 2003-09-01 10:12     ` Jamie Lokier
  2003-09-01 11:30       ` Geert Uytterhoeven
                         ` (2 more replies)
  0 siblings, 3 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01 10:12 UTC (permalink / raw)
  To: Russell King, Paul J.Y. Lahaie, linux-kernel

Russell King wrote:
> This looks like an old kernel on your NetWinder.  Later 2.4 kernels
> should get this right (by marking the pages uncacheable in user space.)

How do they know which pages to mark uncacheable?  Surely not all
MAP_SHARED|MAP_FIXED mappings are uncacheable?

> However, when I tried this program, it seemed to have some unexpected
> results, sometimes claiming that its too slow, sometimes that the
> store buffer isn't coherent, and sometimes saying that the cache
> isn't coherent.

If it says the store buffer isn't coherent, that means the main test
for coherence failed (test_page_alias), but a second test
(test_l1_only), which is designed to allow any CPU delayed stores to
drain, is showing the same memory to be coherent.

There is a bug in test_l1_only which I just noticed.  It's unlikely,
but if `dummy' happens to have the same L1 cache address as both words
being tested, and it's a 2-way (or less) set-associative cache, then
it will inadvertently flush the cache and say "store buffer not
coherent" when it means to say "cache not coherent".

If the duplicate mapping is uncacheable, it should always say it's too
slow, however if _all_ MAP_FIXED|MAP_SHARED mappings are uncacheable
then it compares the timings and will think there is no penalty for
the duplicate mapping.

> On Fri, Aug 29, 2003 at 04:26:28PM -0400, Paul J.Y. Lahaie wrote:
> > Corel NetWinder (275MHz StrongARM)
> > Test separation: 4096 bytes: FAIL - cache not coherent

All the 3 results I have for ARM say that they are all incoherent.
Those results are all for SA-110s of different speeds.

Please try the program below, which is the same as before but with
test_l1_only hopefully improved, and it prints some more helpful
numbers.

Thanks,
-- Jamie

         ==========================================

/* Version 3!  This code maps shared memory to multiple addresses and
   tests it for cache coherency and performance.

   Copyright (C) 1999, 2001, 2002, 2003 Jamie Lokier

   This program is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 2 of the License, or (at
   your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program; if not, write to the Free Software
   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307  USA */

#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/signal.h>
#include <sys/mman.h>
#include <sys/time.h>

#if HAVE_SYSV_SHM
#include <sys/ipc.h>
#include <sys/shm.h>
#endif

//#include "pagealias.h"

/* Helpers to temporarily block all signals.  These are used for when a
   race condition might leave a temporary file that should have been
   deleted -- we do our best to prevent this possibility. */

static void
block_signals (sigset_t * save_state)
{
  sigset_t all_signals;
  sigfillset (&all_signals);
  sigprocmask (SIG_BLOCK, &all_signals, save_state);
}

static void
unblock_signals (sigset_t * restore_state)
{
  sigprocmask (SIG_SETMASK, restore_state, (sigset_t *) 0);
}

/* Open a new shared memory file, either using the POSIX.4 `shm_open'
   function, or using a regular temporary file in /tmp.  Immediately
   after opening the file, it is unlinked from the global namespace
   using `shm_unlink' or `unlink'.

   On success, the value returned is a file descriptor.  Otherwise, -1
   is returned and `errno' is set.

   The descriptor can be closed using simply `close'. */

/* Note: `shm_open' requires link argument `-lposix4' on Suns.
   On GNU/Linux with Glibc, it requires `-lrt'.  Unfortunately, Glibc's
   -lrt insists on linking to pthreads, which we may not want to use
   because that enables thread locking overhead in other functions.  So
   we implement a direct method of opening shm on Linux. */

/* If this is changed, change the size of `buffer' below too. */
#if HAVE_SHM_OPEN
#define SHM_DIR_PREFIX "/"      /* `shm_open' arg needs "/" for portability. */
#elif defined (__linux__)
#include <sys/statfs.h>
#define SHM_DIR_PREFIX "/dev/shm/"
#else
#undef  SHM_DIR_PREFIX
#endif

static int
open_shared_memory_file (int use_tmp_file)
{
  char * ptr, buffer [19];
  int fd, i;
  unsigned long number;
  sigset_t save_signals;
  struct timeval tv;

#if !HAVE_SHM_OPEN && defined (__linux__)
  struct statfs sfs;
  if (!use_tmp_file && (statfs (SHM_DIR_PREFIX, &sfs) != 0
			|| sfs.f_type != 0x01021994 /* SHMFS_SUPER_MAGIC */))
    {
      errno = ENOSYS;
      return -1;
    }
#endif

 loop:
  /* Print a randomised path name into `buffer'.  The string depends on
     the directory and whether we are using POSIX.4 shared memory or a
     regular temporary file.  RANDOM is a 5-digit, base-62
     representation of a pseudo-random number.  The string is used as a
     candidate in the search for an unused shared segment or file name. */
#ifdef SHM_DIR_PREFIX
  strcpy (buffer, use_tmp_file ? "/tmp/shm-" : SHM_DIR_PREFIX "shm-");
#else
  strcpy (buffer, "/tmp/shm-");
#endif
  ptr = buffer + strlen (buffer);
  gettimeofday (&tv, (struct timezone *) 0);
  number = (unsigned long) random ();
  number += (unsigned long) getpid ();
  number += (unsigned long) tv.tv_sec + (unsigned long) tv.tv_usec;
  for (i = 0; i < 5; i++)
    {
      /* Don't use character arithmetic, as not all systems are ASCII. */
      *ptr++ = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" [number % 62];
      number /= 62;
    }
  *ptr = '\0';

  /* Block signals between the open and unlink, to really minimise
     the chance of accidentally leaving an unwanted file around. */
  block_signals (&save_signals);
#if HAVE_SHM_OPEN
  if (!use_tmp_file)
    {
      fd = shm_open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
      if (fd != -1)
	shm_unlink (buffer);
    }
  else
#endif /* HAVE_SHM_OPEN */
    {
      fd = open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
      if (fd != -1)
	unlink (buffer);
    }
  unblock_signals (&save_signals);

  /* If we failed due to a name collision or a signal, try again. */
  if (fd == -1 && (errno == EEXIST || errno == EINTR || errno == EISDIR))
    goto loop;

  return fd;
}

/* Allocate a region of address space `size' bytes long, so that the
   region will not be allocated for any other purpose.  It is freed with
   `munmap'.

   Returns the mapped base address on success.  Otherwise, MAP_FAILED is
   returned and `errno' is set. */

static size_t system_page_size;

#if !defined (MAP_ANONYMOUS) && defined (MAP_ANON)
#define MAP_ANONYMOUS	MAP_ANON
#endif
#ifndef MAP_NORESERVE
#define MAP_NORESERVE	0
#endif
#ifndef MAP_FILE
#define MAP_FILE	0
#endif
#ifndef MAP_VARIABLE
#define MAP_VARIABLE	0
#endif
#ifndef MAP_FAILED
#define MAP_FAILED	((void *) -1)
#endif
#ifndef PROT_NONE
#define PROT_NONE	PROT_READ
#endif

static void *
map_address_space (void * optional_address, size_t size, int access)
{
  void * addr;
#ifdef MAP_ANONYMOUS
  addr = mmap (optional_address, size,
	       access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
	       (MAP_PRIVATE | MAP_ANONYMOUS
		| (optional_address ? MAP_FIXED : MAP_VARIABLE)
		| (access ? 0 : MAP_NORESERVE)), -1, (off_t) 0);
#else  /* not defined MAP_ANONYMOUS */
  int save_errno, zero_fd = open ("/dev/zero", O_RDONLY);
  if (zero_fd == -1)
    return MAP_FAILED;
  addr = mmap (optional_address, size,
	       access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
	       (MAP_PRIVATE | MAP_FILE
		| (optional_address ? MAP_FIXED : MAP_VARIABLE)
		| (access ? 0 : MAP_NORESERVE)), zero_fd, (off_t) 0);
  save_errno = errno;
  close (zero_fd);
  errno = save_errno;
#endif /* not defined MAP_ANONMOUS */
  return addr;
}

/* Set up a page alias mapping using mmap() on POSIX shared memory or on
   a temporary regular file.

   Returns the mapped base address on success.  Otherwise, 0 is returned
   and `errno' is set. */

static void *
page_alias_using_mmap (size_t size, size_t separation, int use_tmp_file)
{
  void * base_addr, * addr;
  int fd, i, save_errno;
  struct stat st;

  fd = open_shared_memory_file (use_tmp_file);
  if (fd == -1)
    goto fail;

  /* First, resize the shared memory file to the desired size. */
  if (ftruncate (fd, size) != 0 || fstat (fd, &st) != 0 || st.st_size != size)
    goto close_fail;

  /* Map an anonymous region `separation + size' bytes long.  This is how
     we allocate sufficient contiguous address space.  We over-map
     this with the aliased buffer. */
  if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
    goto close_fail;

  /* Map the same shared memory repeatedly, at different addresses. */
  for (i = 0; i < 2; i++)
    {
      addr = mmap ((char *) base_addr + (i ? separation : 0), size,
		   PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FILE | MAP_FIXED,
		   fd, (off_t) 0);
      if (addr == MAP_FAILED)
	goto unmap_fail;
      if (addr != (char *) base_addr + (i ? separation : 0))
	{
	  /* `mmap' ignored MAP_FIXED!  Should never happen. */
	  munmap (addr, size);
	  save_errno = EINVAL;
	  goto unmap_fail_se;
	}
    }
  if (close (fd) != 0)
    goto unmap_fail;

  /* Success! */
  return base_addr;

  /* Failure. */
 unmap_fail:
  save_errno = errno;
 unmap_fail_se:
  munmap (base_addr, separation + size);
  errno = save_errno;
 close_fail:
  save_errno = errno;
  close (fd);
  errno = save_errno;
 fail:
  return 0;
}

/* Set up a page alias mapping using SYSV IPC shared memory.

   Returns the mapped base address on success.  Otherwise, 0 is returned
   and `errno' is set. */

#if HAVE_SYSV_SHM

static void *
page_alias_using_sysv_shm (size_t size, size_t separation)
{
  void * base_addr, * addr;
  sigset_t save_signals;
  int shmid, i, save_errno;

  /* Map an anonymous region `separation + size' bytes long.  This is how
     we allocate sufficient contiguous address space.  We over-map
     this with the aliased buffer. */
  if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
    goto fail;

  /* Block signals between the shmget() and IPC_RMID, to minimise the chance
     of accidentally leaving an unwanted shared segment around. */
  block_signals (&save_signals);

  shmid = shmget (IPC_PRIVATE, size, IPC_CREAT | IPC_EXCL | 0600);
  if (shmid == -1)
    goto unmap_fail;

  /* Map the same shared memory repeatedly, at different addresses. */
  for (i = 0; i < 2; i++)
    {
      /* `shmat' is tried twice.  The fist time it can fail if the local
	 implementation of `shmat' refuses to map over a region mapped
	 with `mmap'.  In that case, we punch a hole using `munmap' and
	 do it again.

	 If the local `shmat' has this property, the `shmat' calls
	 to fixed addresses might collide with a concurrent thread
	 which is also doing mappings, and will fail.  At least it
	 is a safe failure.

	 On the other hand, if the local `shmat' can map over
	 already-mapped regions (in the same way that `mmap' does), it
	 is essential that we do actually use an already-mapped region,
	 so that collisions with a concurrent thread can't possibly
	 result in both of us grabbing the same address range with no
	 indication of error. */
      addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
      if (addr == (void *) -1 && errno == EINVAL)
	{
	  munmap ((char *) base_addr + (i ? separation : 0), size);
	  addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
	}

      /* Check for errors. */
      if (addr == (void *) -1)
	{
	  save_errno = errno;
	  if (i == 1)
	    shmdt (base_addr);
	  goto remove_shm_fail_se;
	}
      else if (addr != (char *) base_addr + (i ? separation : 0))
	{
	  /* `shmat' ignored the requested address! */
	  if (i == 1)
	    shmdt (base_addr);
	  save_errno = EINVAL;
	  goto remove_shm_fail_se;
	}
    }
		    
  if (shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0) != 0)
    goto remove_shm_fail;
  unblock_signals (&save_signals);

  /* Success! */
  return base_addr;

  /* Failure. */
 remove_shm_fail:
  save_errno = errno;
 remove_shm_fail_se:
  while (--i >= 0)
    shmdt ((char *) base_addr + (i ? separation : 0));
  shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0);
  errno = save_errno;
 unmap_fail:
  save_errno = errno;
  unblock_signals (&save_signals);
  munmap (base_addr, separation + size);
  errno = save_errno;
 fail:
  return 0;
}

#endif /* HAVE_SYSV_SHM */

/* Map a page-aliased ring buffer.  Shared memory of size `size' is
   mapped twice, with the difference between the two addresses being
   `separation', which must be at least `size'.  The total address range
   used is `separation + size' bytes long.

   On success, *METHOD is filled with a number which must be passed to
   `page_alias_unmap', and the mapped base address is returned.
   Otherwise, 0 is returned and `errno' is set. */

static void *
__page_alias_map (size_t size, size_t separation, int * method)
{
  void * addr;
  if (((size | separation) & (system_page_size - 1)) != 0 || size > separation)
    {
      errno = -EINVAL;
      return 0;
    }

  /* Try these strategies in turn: POSIX shm_open(), SYSV IPC, regular file. */
#ifdef SHM_DIR_PREFIX
  *method = 0;
  if ((addr = page_alias_using_mmap (size, separation, 0)) != 0)
    return addr;
#endif
#if HAVE_SYSV_SHM
  *method = 1;
  if ((addr = page_alias_using_sysv_shm (size, separation)) != 0)
    return addr;
#endif
  *method = 2;
  return page_alias_using_mmap (size, separation, 1);
}

/* Unmap a page-aliased ring buffer previously allocated by
   `page_alias_map'.  `address' is the base address, and `size' and
   `separation' are the arguments previously passed to
   `__page_alias_map'.  `method' is the value previously stored in *METHOD.

   Returns 0 on success.  Otherwise, -1 is returned and `errno' is set. */

static int
__page_alias_unmap (void * address, size_t size, size_t separation, int method)
{
#if HAVE_SYSV_SHM
  if (method == 1)
    {
      shmdt (address);
      shmdt (address + separation);
      if (separation > size)
	munmap (address + size, separation - size);
      return 0;
    }
#endif

  return munmap (address, separation + size);
}

/* Map a page-aliased ring buffer.  `size' is the size of the buffer to
   create; it will be mapped twice to cover a total address range
   `size * 2' bytes long.

   On success, *METHOD is filled with a number which must be passed to
   `page_alias_unmap', and the mapped base address is returned.
   Otherwise, 0 is returned and `errno' is set. */

void *
page_alias_map (size_t size, int * method)
{
  return __page_alias_map (size, size, method);
}

/* Unmap a page-aliased ring buffer previously allocated by
   `page_alias_map'.  `address' is the base address, and `size' is the
   size of the buffer (which is half of the total mapped address range).
   `method' is a value previously stored in *METHOD by `page_alias_map'.

   Returns 0 on success.  Otherwise, -1 is returned and `errno' is set. */

int
page_alias_unmap (void * address, size_t size, int method)
{
  return __page_alias_unmap (address, size, size, method);
}

/* Map some memory which is not aliased, for timing comparisons against
   aliased pages.  We use a combination of mappings similar to
   page_alias_*(), in case there are resource limitations which would
   prevent malloc() or a single mmap() working for the larger address
   range tests. */

static void *
page_no_alias (size_t size, size_t separation)
{
  void * base_addr, * addr;
  int i, save_errno;

  if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
    goto fail;

  /* Map anonymous memory at the different addresses. */
  for (i = 0; i < 2; i++)
    {
      addr = map_address_space ((char *) base_addr + (i ? separation : 0),
				size, 1);
      if (addr == MAP_FAILED)
	goto unmap_fail;
      if (addr != (char *) base_addr + (i ? separation : 0))
	{
	  /* `mmap' ignored MAP_FIXED!  Should never happen. */
	  munmap (addr, size);
	  save_errno = EINVAL;
	  goto unmap_fail_se;
	}
    }

  /* Success! */
  return base_addr;

  /* Failure. */
 unmap_fail:
  save_errno = errno;
 unmap_fail_se:
  munmap (base_addr, separation + size);
  errno = save_errno;
 fail:
  return 0;
}

/* This should be a word size that the architecture can read and write
   fast in a single instruction.  In principle, C's `int' is the natural
   word size, but in practice it isn't on 64-bit machines. */

#define WORD long

/* These GCC-specific asm statements force values into registers, and
   also act as compiler memory barriers.  These are used to force a
   group of write/write/read instructions as close together as possible,
   to maximise the detection of store buffer conditions.  Despite being
   asm statements, these will work with any of GCC's target architectures,
   provided they have >= 4 registers. */

#if __GNUC__ >= 3
#define __noinline __attribute__ ((__noinline__))
#else
#define __noinline
#endif

#ifdef __GNUC__
#define force_into_register(var) \
  __asm__ ("" : "=r" (var) : "0" (var) : "memory")
#define force_into_registers(var1, var2, var3, var4) \
  __asm__ ("" : "=r" (var1), "=r" (var2), "=r" (var3), "=r" (var4) \
	   : "0" (var1), "1" (var2), "2" (var3), "3" (var4) : "memory")
#else
#define force_into_register(var) do {} while (0)
#define force_into_registers(var1, var2, var3, var4) do {} while (0)
#endif

/* This function tries to test whether a CPU snoops its store buffer for
   reads within a few instructions, and ignores virtual to physical
   address translations when doing that.  In principle a CPU might do
   this even if it's L1 cache is physically tagged or indexed, although
   I have not seen such a system.  (A CPU which uses store buffer
   snooping and with an off-board MMU, which the CPU is unaware of,
   could have this property).

   It isn't possible to do this test perfectly; we do our best.  The
   `force_into_register' macros ensure that the write/write/read
   sequence is as compact as the compiler can make it. */

static WORD __noinline
test_store_buffer_snoop (volatile WORD * ptr1, volatile WORD * ptr2)
{
  register volatile WORD * __regptr1 = ptr1, * __regptr2 = ptr2;
  register WORD __reg1 = 1, __reg2 = 0;
  force_into_registers (__reg1, __reg2, __regptr1, __regptr2);
  *__regptr1 = __reg1;
  *__regptr2 = __reg2;
  __reg1 = *__regptr1;
  force_into_register (__reg1);
  return __reg1;
}

/* This function tests whether writes to one page are seen in another
   page at a different virtual address, and whether they are nearly as
   fast as normal writes.

   The accesses are timed by the caller of this function.
   Alternate writes go to alternate pages, so that if aliasing is
   implemented using page faults, it will clearly show up in the
   timings. */

static int __noinline
test_page_alias (volatile WORD * ptr1, volatile WORD * ptr2, int timing_loops)
{
  WORD fail = 0;
  while (--timing_loops >= 0)
    fail |= test_store_buffer_snoop (ptr1, ptr2);
  return fail != 0;
}

/* This function tests L1 cache coherency without checking for store
   buffer snoop coherency.  To do this, we add enough stores that the
   writes to *PTR1 are flushed (or drain due to the time delay) from the
   store buffer before we read from *PTR1.  The result of this function
   is not important: it is only used in a diagnostic message. */

static int __noinline
test_l1_only (volatile WORD * ptr1, volatile WORD * ptr2)
{
  int i, j;
  WORD fail = 0;
  for (i = 0; i < 10; i++)
    {
      *ptr1 = 1;
      /* This loop of volatile writes creates a short time delay.  The
	 delay gives the store to *PTR1 time to flush from the store
	 buffer and/or the many writes flush the store buffer.  The loop
	 writes to *PTR2 because if we pick another fixed address and
	 write to it, that would be testing 3 cache lines (PTR1, PTR2
	 and the fixed address) and the fixed address _might_ happen to
	 collide with PTR1 or PTR2 in the L1 cache.  If the L1 cache is
	 2-way set-associative, that would flush it every time, possibly
	 making it appear coherent when it isn't. */
      for (j = 0; j < 1000; j++)
	*ptr2 = 0;
      fail |= *ptr1;
    }
  return fail != 0;
}

/* Thoroughly test a pair of aliased pages with a fixed address
   separation, to see if they really behave like memory appearing at two
   locations, and efficiently.  We search through different values of
   `separation' searching for a suitable "cache colour" on this machine. */

static inline const char *
test_one_separation (size_t separation)
{
  void * buffers [2];
  long timings [3];
  int i, method, timing_loops = 64;

  /* We measure timings of 3 different tests, each 128 times to find the
     minimum.  0: Writes and reads to aliased pages.  1: Writes and
     reads to non-aliased pages, to compare with 1.  2: Doing nothing,
     to measure the time for `gettimeofday' itself.

     The measurements are done in a mixed up order.  If we did 64
     measurements of type 0, then 64 of type 1, then 64 of type 2, the
     results could be mislead due to synchronisation with other
     processes occuring on the machine. */

  /* A previously generated random shuffle of bit-pairs.  Each pair is a
     number from the set {0,1,2}.  Each number occurs exactly 128 times. */
  static const unsigned char pattern [96] =
    {
      0x64, 0x68, 0x9a, 0x86, 0x42, 0x10, 0x90, 0x81, 0x58, 0x91, 0x18, 0x56,
      0x12, 0x44, 0x64, 0x89, 0x29, 0xa9, 0x96, 0x05, 0x61, 0x80, 0x82, 0x49,
      0x02, 0x16, 0x89, 0x12, 0x9a, 0x45, 0x41, 0x12, 0xa9, 0xa6, 0x01, 0x99,
      0x88, 0x80, 0x94, 0x20, 0x86, 0x29, 0x29, 0x1a, 0xa5, 0x46, 0x66, 0x25,
      0x42, 0x20, 0xa4, 0x81, 0x20, 0x81, 0x50, 0x44, 0x01, 0x06, 0xa5, 0x19,
      0x4a, 0x56, 0x28, 0x89, 0x88, 0x14, 0x94, 0x88, 0x1a, 0xa4, 0x95, 0x15,
      0x82, 0x99, 0x84, 0x64, 0x52, 0x56, 0x69, 0x64, 0x00, 0x95, 0x9a, 0x89,
      0x48, 0x01, 0x58, 0x88, 0x60, 0xa6, 0x29, 0x06, 0x64, 0xa0, 0x56, 0x85,
    };

  buffers [0] = __page_alias_map (system_page_size, separation, &method);
  if (buffers [0] == 0)
    return "alias map failed";
  buffers [1] = page_no_alias (system_page_size, separation);
  if (buffers [1] == 0)
    {
      __page_alias_unmap (buffers [0], system_page_size, separation, method);
      return "non-alias map failed";
    }

 retry:
  timings [2] = timings [1] = timings [0] = LONG_MAX;
  for (i = 0; i < 384; i++)
    {
      struct timeval time_before, time_after;
      long time_delta;
      int fail = 0, which_test = (pattern [i >> 2] >> ((i & 3) << 1)) & 3;
      volatile WORD * ptr1 = (volatile WORD *) buffers [which_test];
      volatile WORD * ptr2 = (volatile WORD *) ((char *) ptr1 + separation);

      /* Test whether writes to one page appear immediately in the other,
	 and time how long the memory accesses take. */
      gettimeofday (&time_before, (struct timezone *) 0);
      if (which_test < 2)
	fail = test_page_alias (ptr1, ptr2, timing_loops);
      gettimeofday (&time_after, (struct timezone *) 0);
	      
      if (fail && which_test == 0)
	{
	  /* Test whether the failure is due to a store buffer bypass
	     which ignores virtual address translation. */
	  int l1_fail = test_l1_only (ptr1, ptr2);
	  __page_alias_unmap (buffers [0], system_page_size, separation,
			      method);
	  munmap (buffers [1], separation + system_page_size);
	  return l1_fail ? "cache not coherent" : "store buffer not coherent";
	}

      time_delta = ((time_after.tv_usec - time_before.tv_usec)
		    + 1000000 * (time_after.tv_sec - time_before.tv_sec));

      /* Find the smallest time taken for each test.  Ignore negative
	 glitches due to Linux' tendancy to jump the clock backwards. */
      if (time_delta >= 0 && time_delta < timings [which_test])
	timings [which_test] = time_delta;
    }

  /* Remove the cost of `gettimeofday()' itself from measurements. */
  timings [0] -= timings [2];
  timings [1] -= timings [2];

  /* Keep looping until at least one measurement becomes significant.  A
     very fast CPU will show measurements of zero microseconds for
     smaller values of `timing_loops'.  Also loop until the cost of
     `gettimeofday()' becomes insignificant.  When the program is run
     under `strace', the latter is a big and this is needed to stabilise
     the results. */
  if (timings [0] <= 10 * (1 + timings [2])
      && timings [1] <= 10 * (1 + timings [2]))
    {
      timing_loops <<= 1;
      goto retry;
    }

  __page_alias_unmap (buffers [0], system_page_size, separation, method);
  munmap (buffers [1], separation + system_page_size);

  printf ("(%d) [%ld,%ld,%ld] ",
	  timing_loops, timings [0], timings [1], timings [2]);

  /* Reject page aliasing if it is much slower than accessing a single,
     definitely cached page directly. */
  if (timings [0] > 2 * timings [1])
    return "too slow";

  /* Success!  Passed all tests for these parameters. */
  return 0;
}

size_t page_alias_smallest_size;

void
page_alias_init (void)
{
  size_t size;

#ifdef _SC_PAGESIZE
  system_page_size = sysconf (_SC_PAGESIZE);
#elif defined (_SC_PAGE_SIZE)
  system_page_size = sysconf (_SC_PAGE_SIZE);
#else
  system_page_size = getpagesize ();
#endif

  for (size = system_page_size; size <= 16 * 1024 * 1024; size *= 2)
    {
      const char * reason = test_one_separation (size);

      printf ("Test separation: %lu bytes: %s%s\n",
	      (unsigned long) size, reason ? "FAIL - " : "pass",
	      reason ? reason : "");

      /* This logic searches for the smallest _contiguous_ range
	 of page sizes for which `page_alias_test' passes. */
      if (reason == 0 && page_alias_smallest_size == 0)
	page_alias_smallest_size = size;
      else if (reason != 0 && page_alias_smallest_size != 0)
	{
	  /* Fail, indicating that page-aliasing is not reliable,
	     because there's a maximum size.  We don't support that as
	     it seems quite unlikely given our model of cache colouring. */
	  page_alias_smallest_size = 0;
	  break;
 	}
    }

  printf ("VM page alias coherency test: ");

  if (page_alias_smallest_size == 0)
    printf ("failed; will use copy buffers instead\n");
  else if (page_alias_smallest_size == system_page_size)
    printf ("all sizes passed\n");
  else
    printf ("minimum fast spacing: %lu (%lu page%s)\n",
	    (unsigned long) page_alias_smallest_size,
	    (unsigned long) (page_alias_smallest_size / system_page_size),
	    (page_alias_smallest_size == system_page_size) ? "" : "s");
}

//#ifdef TEST_PAGEALIAS
int
main ()
{
  page_alias_init ();
  return 0;
}
//#endif

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  9:09       ` Kars de Jong
@ 2003-09-01 10:08         ` Jamie Lokier
  2003-09-01 11:13           ` Roman Zippel
  2003-09-02 20:42           ` Kars de Jong
  0 siblings, 2 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01 10:08 UTC (permalink / raw)
  To: Kars de Jong
  Cc: Geert Uytterhoeven, Linux/m68k kernel mailing list,
	Linux Kernel Development

Kars de Jong wrote:
> On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:
> > BTW, probably you want us to run your test program on other m68k boxes? Mine
> > got a 68040, that leaves us with:
> >   - 68020+68551
> >   - 68060
> 
> I can run it on these boxes if no-one else has done it yet before I come
> home tonight. I'm sure there are more people with a 68060 out there, not
> too sure about the 68020+68851.

I would prefer that you run the attached program.  It fixes a bug in
the function which tests whether the problem is in the L1 cache or
store buffer.  The bug probably didn't affect the test, but it might
have.

Ideally you could run the program Geert linked to as well?
Please remember to compile both with optimisation.

Thanks,
-- Jamie

/* This code maps shared memory to multiple addresses and tests it
   for cache coherency and performance.

   Copyright (C) 1999, 2001, 2002, 2003 Jamie Lokier

   This program is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 2 of the License, or (at
   your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program; if not, write to the Free Software
   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307  USA */

#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/signal.h>
#include <sys/mman.h>
#include <sys/time.h>

#if HAVE_SYSV_SHM
#include <sys/ipc.h>
#include <sys/shm.h>
#endif

//#include "pagealias.h"

/* Helpers to temporarily block all signals.  These are used for when a
   race condition might leave a temporary file that should have been
   deleted -- we do our best to prevent this possibility. */

static void
block_signals (sigset_t * save_state)
{
  sigset_t all_signals;
  sigfillset (&all_signals);
  sigprocmask (SIG_BLOCK, &all_signals, save_state);
}

static void
unblock_signals (sigset_t * restore_state)
{
  sigprocmask (SIG_SETMASK, restore_state, (sigset_t *) 0);
}

/* Open a new shared memory file, either using the POSIX.4 `shm_open'
   function, or using a regular temporary file in /tmp.  Immediately
   after opening the file, it is unlinked from the global namespace
   using `shm_unlink' or `unlink'.

   On success, the value returned is a file descriptor.  Otherwise, -1
   is returned and `errno' is set.

   The descriptor can be closed using simply `close'. */

/* Note: `shm_open' requires link argument `-lposix4' on Suns.
   On GNU/Linux with Glibc, it requires `-lrt'.  Unfortunately, Glibc's
   -lrt insists on linking to pthreads, which we may not want to use
   because that enables thread locking overhead in other functions.  So
   we implement a direct method of opening shm on Linux. */

/* If this is changed, change the size of `buffer' below too. */
#if HAVE_SHM_OPEN
#define SHM_DIR_PREFIX "/"      /* `shm_open' arg needs "/" for portability. */
#elif defined (__linux__)
#include <sys/statfs.h>
#define SHM_DIR_PREFIX "/dev/shm/"
#else
#undef  SHM_DIR_PREFIX
#endif

static int
open_shared_memory_file (int use_tmp_file)
{
  char * ptr, buffer [19];
  int fd, i;
  unsigned long number;
  sigset_t save_signals;
  struct timeval tv;

#if !HAVE_SHM_OPEN && defined (__linux__)
  struct statfs sfs;
  if (!use_tmp_file && (statfs (SHM_DIR_PREFIX, &sfs) != 0
			|| sfs.f_type != 0x01021994 /* SHMFS_SUPER_MAGIC */))
    {
      errno = ENOSYS;
      return -1;
    }
#endif

 loop:
  /* Print a randomised path name into `buffer'.  The string depends on
     the directory and whether we are using POSIX.4 shared memory or a
     regular temporary file.  RANDOM is a 5-digit, base-62
     representation of a pseudo-random number.  The string is used as a
     candidate in the search for an unused shared segment or file name. */
#ifdef SHM_DIR_PREFIX
  strcpy (buffer, use_tmp_file ? "/tmp/shm-" : SHM_DIR_PREFIX "shm-");
#else
  strcpy (buffer, "/tmp/shm-");
#endif
  ptr = buffer + strlen (buffer);
  gettimeofday (&tv, (struct timezone *) 0);
  number = (unsigned long) random ();
  number += (unsigned long) getpid ();
  number += (unsigned long) tv.tv_sec + (unsigned long) tv.tv_usec;
  for (i = 0; i < 5; i++)
    {
      /* Don't use character arithmetic, as not all systems are ASCII. */
      *ptr++ = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" [number % 62];
      number /= 62;
    }
  *ptr = '\0';

  /* Block signals between the open and unlink, to really minimise
     the chance of accidentally leaving an unwanted file around. */
  block_signals (&save_signals);
#if HAVE_SHM_OPEN
  if (!use_tmp_file)
    {
      fd = shm_open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
      if (fd != -1)
	shm_unlink (buffer);
    }
  else
#endif /* HAVE_SHM_OPEN */
    {
      fd = open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
      if (fd != -1)
	unlink (buffer);
    }
  unblock_signals (&save_signals);

  /* If we failed due to a name collision or a signal, try again. */
  if (fd == -1 && (errno == EEXIST || errno == EINTR || errno == EISDIR))
    goto loop;

  return fd;
}

/* Allocate a region of address space `size' bytes long, so that the
   region will not be allocated for any other purpose.  It is freed with
   `munmap'.

   Returns the mapped base address on success.  Otherwise, MAP_FAILED is
   returned and `errno' is set. */

static size_t system_page_size;

#if !defined (MAP_ANONYMOUS) && defined (MAP_ANON)
#define MAP_ANONYMOUS	MAP_ANON
#endif
#ifndef MAP_NORESERVE
#define MAP_NORESERVE	0
#endif
#ifndef MAP_FILE
#define MAP_FILE	0
#endif
#ifndef MAP_VARIABLE
#define MAP_VARIABLE	0
#endif
#ifndef MAP_FAILED
#define MAP_FAILED	((void *) -1)
#endif
#ifndef PROT_NONE
#define PROT_NONE	PROT_READ
#endif

static void *
map_address_space (void * optional_address, size_t size, int access)
{
  void * addr;
#ifdef MAP_ANONYMOUS
  addr = mmap (optional_address, size,
	       access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
	       (MAP_PRIVATE | MAP_ANONYMOUS
		| (optional_address ? MAP_FIXED : MAP_VARIABLE)
		| (access ? 0 : MAP_NORESERVE)), -1, (off_t) 0);
#else  /* not defined MAP_ANONYMOUS */
  int save_errno, zero_fd = open ("/dev/zero", O_RDONLY);
  if (zero_fd == -1)
    return MAP_FAILED;
  addr = mmap (optional_address, size,
	       access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
	       (MAP_PRIVATE | MAP_FILE
		| (optional_address ? MAP_FIXED : MAP_VARIABLE)
		| (access ? 0 : MAP_NORESERVE)), zero_fd, (off_t) 0);
  save_errno = errno;
  close (zero_fd);
  errno = save_errno;
#endif /* not defined MAP_ANONMOUS */
  return addr;
}

/* Set up a page alias mapping using mmap() on POSIX shared memory or on
   a temporary regular file.

   Returns the mapped base address on success.  Otherwise, 0 is returned
   and `errno' is set. */

static void *
page_alias_using_mmap (size_t size, size_t separation, int use_tmp_file)
{
  void * base_addr, * addr;
  int fd, i, save_errno;
  struct stat st;

  fd = open_shared_memory_file (use_tmp_file);
  if (fd == -1)
    goto fail;

  /* First, resize the shared memory file to the desired size. */
  if (ftruncate (fd, size) != 0 || fstat (fd, &st) != 0 || st.st_size != size)
    goto close_fail;

  /* Map an anonymous region `separation + size' bytes long.  This is how
     we allocate sufficient contiguous address space.  We over-map
     this with the aliased buffer. */
  if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
    goto close_fail;

  /* Map the same shared memory repeatedly, at different addresses. */
  for (i = 0; i < 2; i++)
    {
      addr = mmap ((char *) base_addr + (i ? separation : 0), size,
		   PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FILE | MAP_FIXED,
		   fd, (off_t) 0);
      if (addr == MAP_FAILED)
	goto unmap_fail;
      if (addr != (char *) base_addr + (i ? separation : 0))
	{
	  /* `mmap' ignored MAP_FIXED!  Should never happen. */
	  munmap (addr, size);
	  save_errno = EINVAL;
	  goto unmap_fail_se;
	}
    }
  if (close (fd) != 0)
    goto unmap_fail;

  /* Success! */
  return base_addr;

  /* Failure. */
 unmap_fail:
  save_errno = errno;
 unmap_fail_se:
  munmap (base_addr, separation + size);
  errno = save_errno;
 close_fail:
  save_errno = errno;
  close (fd);
  errno = save_errno;
 fail:
  return 0;
}

/* Set up a page alias mapping using SYSV IPC shared memory.

   Returns the mapped base address on success.  Otherwise, 0 is returned
   and `errno' is set. */

#if HAVE_SYSV_SHM

static void *
page_alias_using_sysv_shm (size_t size, size_t separation)
{
  void * base_addr, * addr;
  sigset_t save_signals;
  int shmid, i, save_errno;

  /* Map an anonymous region `separation + size' bytes long.  This is how
     we allocate sufficient contiguous address space.  We over-map
     this with the aliased buffer. */
  if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
    goto fail;

  /* Block signals between the shmget() and IPC_RMID, to minimise the chance
     of accidentally leaving an unwanted shared segment around. */
  block_signals (&save_signals);

  shmid = shmget (IPC_PRIVATE, size, IPC_CREAT | IPC_EXCL | 0600);
  if (shmid == -1)
    goto unmap_fail;

  /* Map the same shared memory repeatedly, at different addresses. */
  for (i = 0; i < 2; i++)
    {
      /* `shmat' is tried twice.  The fist time it can fail if the local
	 implementation of `shmat' refuses to map over a region mapped
	 with `mmap'.  In that case, we punch a hole using `munmap' and
	 do it again.

	 If the local `shmat' has this property, the `shmat' calls
	 to fixed addresses might collide with a concurrent thread
	 which is also doing mappings, and will fail.  At least it
	 is a safe failure.

	 On the other hand, if the local `shmat' can map over
	 already-mapped regions (in the same way that `mmap' does), it
	 is essential that we do actually use an already-mapped region,
	 so that collisions with a concurrent thread can't possibly
	 result in both of us grabbing the same address range with no
	 indication of error. */
      addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
      if (addr == (void *) -1 && errno == EINVAL)
	{
	  munmap ((char *) base_addr + (i ? separation : 0), size);
	  addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
	}

      /* Check for errors. */
      if (addr == (void *) -1)
	{
	  save_errno = errno;
	  if (i == 1)
	    shmdt (base_addr);
	  goto remove_shm_fail_se;
	}
      else if (addr != (char *) base_addr + (i ? separation : 0))
	{
	  /* `shmat' ignored the requested address! */
	  if (i == 1)
	    shmdt (base_addr);
	  save_errno = EINVAL;
	  goto remove_shm_fail_se;
	}
    }
		    
  if (shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0) != 0)
    goto remove_shm_fail;
  unblock_signals (&save_signals);

  /* Success! */
  return base_addr;

  /* Failure. */
 remove_shm_fail:
  save_errno = errno;
 remove_shm_fail_se:
  while (--i >= 0)
    shmdt ((char *) base_addr + (i ? separation : 0));
  shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0);
  errno = save_errno;
 unmap_fail:
  save_errno = errno;
  unblock_signals (&save_signals);
  munmap (base_addr, separation + size);
  errno = save_errno;
 fail:
  return 0;
}

#endif /* HAVE_SYSV_SHM */

/* Map a page-aliased ring buffer.  Shared memory of size `size' is
   mapped twice, with the difference between the two addresses being
   `separation', which must be at least `size'.  The total address range
   used is `separation + size' bytes long.

   On success, *METHOD is filled with a number which must be passed to
   `page_alias_unmap', and the mapped base address is returned.
   Otherwise, 0 is returned and `errno' is set. */

static void *
__page_alias_map (size_t size, size_t separation, int * method)
{
  void * addr;
  if (((size | separation) & (system_page_size - 1)) != 0 || size > separation)
    {
      errno = -EINVAL;
      return 0;
    }

  /* Try these strategies in turn: POSIX shm_open(), SYSV IPC, regular file. */
#ifdef SHM_DIR_PREFIX
  *method = 0;
  if ((addr = page_alias_using_mmap (size, separation, 0)) != 0)
    return addr;
#endif
#if HAVE_SYSV_SHM
  *method = 1;
  if ((addr = page_alias_using_sysv_shm (size, separation)) != 0)
    return addr;
#endif
  *method = 2;
  return page_alias_using_mmap (size, separation, 1);
}

/* Unmap a page-aliased ring buffer previously allocated by
   `page_alias_map'.  `address' is the base address, and `size' and
   `separation' are the arguments previously passed to
   `__page_alias_map'.  `method' is the value previously stored in *METHOD.

   Returns 0 on success.  Otherwise, -1 is returned and `errno' is set. */

static int
__page_alias_unmap (void * address, size_t size, size_t separation, int method)
{
#if HAVE_SYSV_SHM
  if (method == 1)
    {
      shmdt (address);
      shmdt (address + separation);
      if (separation > size)
	munmap (address + size, separation - size);
      return 0;
    }
#endif

  return munmap (address, separation + size);
}

/* Map a page-aliased ring buffer.  `size' is the size of the buffer to
   create; it will be mapped twice to cover a total address range
   `size * 2' bytes long.

   On success, *METHOD is filled with a number which must be passed to
   `page_alias_unmap', and the mapped base address is returned.
   Otherwise, 0 is returned and `errno' is set. */

void *
page_alias_map (size_t size, int * method)
{
  return __page_alias_map (size, size, method);
}

/* Unmap a page-aliased ring buffer previously allocated by
   `page_alias_map'.  `address' is the base address, and `size' is the
   size of the buffer (which is half of the total mapped address range).
   `method' is a value previously stored in *METHOD by `page_alias_map'.

   Returns 0 on success.  Otherwise, -1 is returned and `errno' is set. */

int
page_alias_unmap (void * address, size_t size, int method)
{
  return __page_alias_unmap (address, size, size, method);
}

/* Map some memory which is not aliased, for timing comparisons against
   aliased pages.  We use a combination of mappings similar to
   page_alias_*(), in case there are resource limitations which would
   prevent malloc() or a single mmap() working for the larger address
   range tests. */

static void *
page_no_alias (size_t size, size_t separation)
{
  void * base_addr, * addr;
  int i, save_errno;

  if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
    goto fail;

  /* Map anonymous memory at the different addresses. */
  for (i = 0; i < 2; i++)
    {
      addr = map_address_space ((char *) base_addr + (i ? separation : 0),
				size, 1);
      if (addr == MAP_FAILED)
	goto unmap_fail;
      if (addr != (char *) base_addr + (i ? separation : 0))
	{
	  /* `mmap' ignored MAP_FIXED!  Should never happen. */
	  munmap (addr, size);
	  save_errno = EINVAL;
	  goto unmap_fail_se;
	}
    }

  /* Success! */
  return base_addr;

  /* Failure. */
 unmap_fail:
  save_errno = errno;
 unmap_fail_se:
  munmap (base_addr, separation + size);
  errno = save_errno;
 fail:
  return 0;
}

/* This should be a word size that the architecture can read and write
   fast in a single instruction.  In principle, C's `int' is the natural
   word size, but in practice it isn't on 64-bit machines. */

#define WORD long

/* These GCC-specific asm statements force values into registers, and
   also act as compiler memory barriers.  These are used to force a
   group of write/write/read instructions as close together as possible,
   to maximise the detection of store buffer conditions.  Despite being
   asm statements, these will work with any of GCC's target architectures,
   provided they have >= 4 registers. */

#if __GNUC__ >= 3
#define __noinline __attribute__ ((__noinline__))
#else
#define __noinline
#endif

#ifdef __GNUC__
#define force_into_register(var) \
  __asm__ ("" : "=r" (var) : "0" (var) : "memory")
#define force_into_registers(var1, var2, var3, var4) \
  __asm__ ("" : "=r" (var1), "=r" (var2), "=r" (var3), "=r" (var4) \
	   : "0" (var1), "1" (var2), "2" (var3), "3" (var4) : "memory")
#else
#define force_into_register(var) do {} while (0)
#define force_into_registers(var1, var2, var3, var4) do {} while (0)
#endif

/* This function tries to test whether a CPU snoops its store buffer for
   reads within a few instructions, and ignores virtual to physical
   address translations when doing that.  In principle a CPU might do
   this even if it's L1 cache is physically tagged or indexed, although
   I have not seen such a system.  (A CPU which uses store buffer
   snooping and with an off-board MMU, which the CPU is unaware of,
   could have this property).

   It isn't possible to do this test perfectly; we do our best.  The
   `force_into_register' macros ensure that the write/write/read
   sequence is as compact as the compiler can make it. */

static WORD __noinline
test_store_buffer_snoop (volatile WORD * ptr1, volatile WORD * ptr2)
{
  register volatile WORD * __regptr1 = ptr1, * __regptr2 = ptr2;
  register WORD __reg1 = 1, __reg2 = 0;
  force_into_registers (__reg1, __reg2, __regptr1, __regptr2);
  *__regptr1 = __reg1;
  *__regptr2 = __reg2;
  __reg1 = *__regptr1;
  force_into_register (__reg1);
  return __reg1;
}

/* This function tests whether writes to one page are seen in another
   page at a different virtual address, and whether they are nearly as
   fast as normal writes.

   The accesses are timed by the caller of this function.
   Alternate writes go to alternate pages, so that if aliasing is
   implemented using page faults, it will clearly show up in the
   timings. */

static int __noinline
test_page_alias (volatile WORD * ptr1, volatile WORD * ptr2, int timing_loops)
{
  WORD fail = 0;
  while (--timing_loops >= 0)
    fail |= test_store_buffer_snoop (ptr1, ptr2);
  return fail != 0;
}

/* This function tests L1 cache coherency without checking for store
   buffer snoop coherency.  To do this, we add enough stores that the
   writes to *PTR1 are flushed (or drain due to the time delay) from the
   store buffer before we read from *PTR1.  The result of this function
   is not important: it is only used in a diagnostic message. */

static int __noinline
test_l1_only (volatile WORD * ptr1, volatile WORD * ptr2)
{
  int i, j;
  WORD fail = 0;
  for (i = 0; i < 10; i++)
    {
      *ptr1 = 1;
      /* This loop of volatile writes creates a short time delay.  The
	 delay gives the store to *PTR1 time to flush from the store
	 buffer and/or the many writes flush the store buffer.  The loop
	 writes to *PTR2 because if we pick another fixed address and
	 write to it, that would be testing 3 cache lines (PTR1, PTR2
	 and the fixed address) and the fixed address _might_ happen to
	 collide with PTR1 or PTR2 in the L1 cache.  If the L1 cache is
	 2-way set-associative, that would flush it every time, possibly
	 making it appear coherent when it isn't. */
      for (j = 0; j < 1000; j++)
	*ptr2 = 0;
      fail |= *ptr1;
    }
  return fail != 0;
}

/* Thoroughly test a pair of aliased pages with a fixed address
   separation, to see if they really behave like memory appearing at two
   locations, and efficiently.  We search through different values of
   `separation' searching for a suitable "cache colour" on this machine. */

static inline const char *
test_one_separation (size_t separation)
{
  void * buffers [2];
  long timings [3];
  int i, method, timing_loops = 64;

  /* We measure timings of 3 different tests, each 128 times to find the
     minimum.  0: Writes and reads to aliased pages.  1: Writes and
     reads to non-aliased pages, to compare with 1.  2: Doing nothing,
     to measure the time for `gettimeofday' itself.

     The measurements are done in a mixed up order.  If we did 64
     measurements of type 0, then 64 of type 1, then 64 of type 2, the
     results could be mislead due to synchronisation with other
     processes occuring on the machine. */

  /* A previously generated random shuffle of bit-pairs.  Each pair is a
     number from the set {0,1,2}.  Each number occurs exactly 128 times. */
  static const unsigned char pattern [96] =
    {
      0x64, 0x68, 0x9a, 0x86, 0x42, 0x10, 0x90, 0x81, 0x58, 0x91, 0x18, 0x56,
      0x12, 0x44, 0x64, 0x89, 0x29, 0xa9, 0x96, 0x05, 0x61, 0x80, 0x82, 0x49,
      0x02, 0x16, 0x89, 0x12, 0x9a, 0x45, 0x41, 0x12, 0xa9, 0xa6, 0x01, 0x99,
      0x88, 0x80, 0x94, 0x20, 0x86, 0x29, 0x29, 0x1a, 0xa5, 0x46, 0x66, 0x25,
      0x42, 0x20, 0xa4, 0x81, 0x20, 0x81, 0x50, 0x44, 0x01, 0x06, 0xa5, 0x19,
      0x4a, 0x56, 0x28, 0x89, 0x88, 0x14, 0x94, 0x88, 0x1a, 0xa4, 0x95, 0x15,
      0x82, 0x99, 0x84, 0x64, 0x52, 0x56, 0x69, 0x64, 0x00, 0x95, 0x9a, 0x89,
      0x48, 0x01, 0x58, 0x88, 0x60, 0xa6, 0x29, 0x06, 0x64, 0xa0, 0x56, 0x85,
    };

  buffers [0] = __page_alias_map (system_page_size, separation, &method);
  if (buffers [0] == 0)
    return "alias map failed";
  buffers [1] = page_no_alias (system_page_size, separation);
  if (buffers [1] == 0)
    {
      __page_alias_unmap (buffers [0], system_page_size, separation, method);
      return "non-alias map failed";
    }

 retry:
  timings [2] = timings [1] = timings [0] = LONG_MAX;
  for (i = 0; i < 384; i++)
    {
      struct timeval time_before, time_after;
      long time_delta;
      int fail = 0, which_test = (pattern [i >> 2] >> ((i & 3) << 1)) & 3;
      volatile WORD * ptr1 = (volatile WORD *) buffers [which_test];
      volatile WORD * ptr2 = (volatile WORD *) ((char *) ptr1 + separation);

      /* Test whether writes to one page appear immediately in the other,
	 and time how long the memory accesses take. */
      gettimeofday (&time_before, (struct timezone *) 0);
      if (which_test < 2)
	fail = test_page_alias (ptr1, ptr2, timing_loops);
      gettimeofday (&time_after, (struct timezone *) 0);
	      
      if (fail && which_test == 0)
	{
	  /* Test whether the failure is due to a store buffer bypass
	     which ignores virtual address translation. */
	  int l1_fail = test_l1_only (ptr1, ptr2);
	  __page_alias_unmap (buffers [0], system_page_size, separation,
			      method);
	  munmap (buffers [1], separation + system_page_size);
	  return l1_fail ? "cache not coherent" : "store buffer not coherent";
	}

      time_delta = ((time_after.tv_usec - time_before.tv_usec)
		    + 1000000 * (time_after.tv_sec - time_before.tv_sec));

      /* Find the smallest time taken for each test.  Ignore negative
	 glitches due to Linux' tendancy to jump the clock backwards. */
      if (time_delta >= 0 && time_delta < timings [which_test])
	timings [which_test] = time_delta;
    }

  /* Remove the cost of `gettimeofday()' itself from measurements. */
  timings [0] -= timings [2];
  timings [1] -= timings [2];

  /* Keep looping until at least one measurement becomes significant.  A
     very fast CPU will show measurements of zero microseconds for
     smaller values of `timing_loops'.  Also loop until the cost of
     `gettimeofday()' becomes insignificant.  When the program is run
     under `strace', the latter is a big and this is needed to stabilise
     the results. */
  if (timings [0] <= 10 * (1 + timings [2])
      && timings [1] <= 10 * (1 + timings [2]))
    {
      timing_loops <<= 1;
      goto retry;
    }

  __page_alias_unmap (buffers [0], system_page_size, separation, method);
  munmap (buffers [1], separation + system_page_size);

  printf ("(%d) [%ld,%ld,%ld] ",
	  timing_loops, timings [0], timings [1], timings [2]);

  /* Reject page aliasing if it is much slower than accessing a single,
     definitely cached page directly. */
  if (timings [0] > 2 * timings [1])
    return "too slow";

  /* Success!  Passed all tests for these parameters. */
  return 0;
}

size_t page_alias_smallest_size;

void
page_alias_init (void)
{
  size_t size;

#ifdef _SC_PAGESIZE
  system_page_size = sysconf (_SC_PAGESIZE);
#elif defined (_SC_PAGE_SIZE)
  system_page_size = sysconf (_SC_PAGE_SIZE);
#else
  system_page_size = getpagesize ();
#endif

  for (size = system_page_size; size <= 16 * 1024 * 1024; size *= 2)
    {
      const char * reason = test_one_separation (size);

      printf ("Test separation: %lu bytes: %s%s\n",
	      (unsigned long) size, reason ? "FAIL - " : "pass",
	      reason ? reason : "");

      /* This logic searches for the smallest _contiguous_ range
	 of page sizes for which `page_alias_test' passes. */
      if (reason == 0 && page_alias_smallest_size == 0)
	page_alias_smallest_size = size;
      else if (reason != 0 && page_alias_smallest_size != 0)
	{
	  /* Fail, indicating that page-aliasing is not reliable,
	     because there's a maximum size.  We don't support that as
	     it seems quite unlikely given our model of cache colouring. */
	  page_alias_smallest_size = 0;
	  break;
 	}
    }

  printf ("VM page alias coherency test: ");

  if (page_alias_smallest_size == 0)
    printf ("failed; will use copy buffers instead\n");
  else if (page_alias_smallest_size == system_page_size)
    printf ("all sizes passed\n");
  else
    printf ("minimum fast spacing: %lu (%lu page%s)\n",
	    (unsigned long) page_alias_smallest_size,
	    (unsigned long) (page_alias_smallest_size / system_page_size),
	    (page_alias_smallest_size == system_page_size) ? "" : "s");
}

//#ifdef TEST_PAGEALIAS
int
main ()
{
  page_alias_init ();
  return 0;
}
//#endif


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  9:02                 ` David S. Miller
@ 2003-09-01 10:04                   ` Jamie Lokier
  2003-09-01 10:02                     ` David S. Miller
  2003-09-03 17:36                   ` bill davidsen
  1 sibling, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01 10:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: mfedyk, lm, linux-kernel

David S. Miller wrote:
> Why do you need the same piece of data mapped to multiple places
> in the first place, and why at specific addresses?  It's purely an
> optimization of some sort, right?

Right.  It's a circular buffer for signal processing: DSP code sees
contiguous ascending addresses.  The multiple maps mean we don't have
to copy the contents of the buffer back to the start periodically, nor
mask the offset into the array on each memory access, nor write
extra-complicated DSP code which can handle split regions.

It's an optimisation, it works well on some architectures and on
others it's not worth it.  On those, I just copy - it keeps the DSP
code fast and simple.

> > Well, my code has no bug because I do run-time tests to see what
> > rubbish the architecture gave me.  As we see, they work :)
> 
> It doesn't work in just the right set of circumstances, if interrupts
> arrive at just the right moment it might flush the bad aliases out
> of the cache via displacement during your 'check' phase.
> 
> Then during your actual computation you can hit the aliasing problem
> silently.

To fool the coherence test, interrupts would need to arrive in a 2
instruction window, at least 8192 times.  It is possible, but unlikely
except in pathological situations.

Of course if you make mmap() return EINVAL then it cannot possible fail :)

> I'd suggest instead to hardcode the SHMLBA stuff into your sources.

How?  SHMLBA is a run time value on the Sparc; I have no idea how
to work it out.

-- Jamie


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01 10:04                   ` Jamie Lokier
@ 2003-09-01 10:02                     ` David S. Miller
  0 siblings, 0 replies; 106+ messages in thread
From: David S. Miller @ 2003-09-01 10:02 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: mfedyk, lm, linux-kernel

On Mon, 1 Sep 2003 11:04:58 +0100
Jamie Lokier <jamie@shareable.org> wrote:

> Of course if you make mmap() return EINVAL then it cannot possible fail :)

Right :-)

> > I'd suggest instead to hardcode the SHMLBA stuff into your sources.
> 
> How?  SHMLBA is a run time value on the Sparc; I have no idea how
> to work it out.

You're talking about 32-bit sparc, on sparc64 it's a constant
16K.

For sparc 32-bit, just use 4MB, that's the largest possible value.

And you have to check this with uname() results, not with ifdefs
as 32-bit Sparc binaries run on sparc64 systems just fine.

I also would not object at all to a kernel patch that exported the
SHMLBA value via some sysctl value.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  8:34     ` Geert Uytterhoeven
@ 2003-09-01  9:09       ` Kars de Jong
  2003-09-01 10:08         ` Jamie Lokier
  2003-09-01 10:35       ` Sam Creasey
  2003-09-03  8:00       ` Kars de Jong
  2 siblings, 1 reply; 106+ messages in thread
From: Kars de Jong @ 2003-09-01  9:09 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Jamie Lokier, Linux/m68k kernel mailing list, Linux Kernel Development

On Mon, 2003-09-01 at 10:34, Geert Uytterhoeven wrote:

> BTW, probably you want us to run your test program on other m68k boxes? Mine
> got a 68040, that leaves us with:
>   - 68020+68551
>   - 68060

I can run it on these boxes if no-one else has done it yet before I come
home tonight. I'm sure there are more people with a 68060 out there, not
too sure about the 68020+68851.


Regards,

Kars.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  8:29               ` Jamie Lokier
@ 2003-09-01  9:02                 ` David S. Miller
  2003-09-01 10:04                   ` Jamie Lokier
  2003-09-03 17:36                   ` bill davidsen
  0 siblings, 2 replies; 106+ messages in thread
From: David S. Miller @ 2003-09-01  9:02 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: mfedyk, lm, linux-kernel

On Mon, 1 Sep 2003 09:29:11 +0100
Jamie Lokier <jamie@shareable.org> wrote:

> David S. Miller wrote:
> > I disagree, MAP_FIXED means "I know what I am doing don't override
> > this unless the mapping area is not available in my address space."
> > You should never specify MAP_FIXED unless you _REALLY_ know what you
> > are doing.
> 
> So explain this from the Sparc architecture code:
> 
> 	if (flags & MAP_FIXED) {
> 		/* We do not accept a shared mapping if it would violate
> 		 * cache aliasing constraints.
> 		 */
> 		if ((flags & MAP_SHARED) && (addr & (SHMLBA - 1)))
> 			return -EINVAL;
> 		return addr;
> 	}
> 
> Ok, I'll explain it :)  At one time, the code did what the comment says,
> but nowadays linux/mm/mmap.c doesn't call arch_get_unmapped_area() for
> MAP_FIXED, so the above code is redundant and misleading.  It already
> mislead me, so please remove it.  sparc and sparc64 both have it.

I take back what I said, I think the -EINVAL behavior is better
and mmap.c should call into this code to verify the requested
mmap() parameters.

> This is my strategy:
> 
> 	mmap MAP_ANON without MAP_FIXED to find a free area
> 	mmap MAP_FIXED over the anon area at same address
> 	mmap MAP_FIXED over the anon area at larger address
> 
> I don't see any strategy that lets me establish this kind of circular
> mapping on Sparc without either (a) knowing the value of SHMLBA, or
> (b) risking clobbering another thread's mmap.

Why do you need the same piece of data mapped to multiple places
in the first place, and why at specific addresses?  It's purely an
optimization of some sort, right?

> Well, my code has no bug because I do run-time tests to see what
> rubbish the architecture gave me.  As we see, they work :)

It doesn't work in just the right set of circumstances, if interrupts
arrive at just the right moment it might flush the bad aliases out
of the cache via displacement during your 'check' phase.

Then during your actual computation you can hit the aliasing problem
silently.

That's just a bad way to do this.

> I don't see any real alternative to doing that.

I'd suggest instead to hardcode the SHMLBA stuff into your sources.

> But that's ok, it seems robust and portable.

Unfortunately, it is anything but robust.

> > There is no efficient way to do this from userspace, only the
> > kernel has access to the more efficient cache flushing instructions.
> > You'd need to flush via loads to displace the aliasing cache lines.
> 
> Will msync() do it?

No.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  5:58   ` Jamie Lokier
@ 2003-09-01  8:34     ` Geert Uytterhoeven
  2003-09-01  9:09       ` Kars de Jong
                         ` (2 more replies)
  0 siblings, 3 replies; 106+ messages in thread
From: Geert Uytterhoeven @ 2003-09-01  8:34 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux/m68k, Linux Kernel Development

On Mon, 1 Sep 2003, Jamie Lokier wrote:
> Geert Uytterhoeven wrote:
> > Are you also interested in m68k? ;-)
> > 
> > cassandra:/tmp# time ./test
> > Test separation: 4096 bytes: FAIL - store buffer not coherent
> 
> Especially!  I hadn't expected to see any machine that would print
> "store buffer not coherent".  It means that if there's an L1 cache, it
> is coherent, but any store-then-load bypass in the CPU pipeline is
> using the virtual address with no rollback after MMU translation.
> 
> I had thought it would only be the case with chips using an external
> MMU, but now that I think about it, the older simpler chips aren't
> going to bother with things like pipeline rollback wherever they can
> get away without it!

As you probably know the 68020 had an external MMU (68551, or Sun-3 or Apollo
MMU). Probably Motorola didn't bother to change the behavior when the MMU got
integrated in later generations (68030 and up).

BTW, probably you want us to run your test program on other m68k boxes? Mine
got a 68040, that leaves us with:
  - 68020+68551
  - 68020+Sun-3 MMU
  - 68030
  - 68060

For linux-m68k: You can find the test program source in Jamie's original
posting on lkml. For your convenience, I put a binary for m68k at
http://home.tvd.be/cr26864/Linux/m68k/jamie_test.gz. Just tell us the
program's output and give us a copy of your /proc/cpuinfo. Thanks!

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  7:06             ` David S. Miller
@ 2003-09-01  8:29               ` Jamie Lokier
  2003-09-01  9:02                 ` David S. Miller
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  8:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: mfedyk, lm, linux-kernel

David S. Miller wrote:
> I disagree, MAP_FIXED means "I know what I am doing don't override
> this unless the mapping area is not available in my address space."
> You should never specify MAP_FIXED unless you _REALLY_ know what you
> are doing.

So explain this from the Sparc architecture code:

	if (flags & MAP_FIXED) {
		/* We do not accept a shared mapping if it would violate
		 * cache aliasing constraints.
		 */
		if ((flags & MAP_SHARED) && (addr & (SHMLBA - 1)))
			return -EINVAL;
		return addr;
	}

Ok, I'll explain it :)  At one time, the code did what the comment says,
but nowadays linux/mm/mmap.c doesn't call arch_get_unmapped_area() for
MAP_FIXED, so the above code is redundant and misleading.  It already
mislead me, so please remove it.  sparc and sparc64 both have it.

> > Thus I have three Sparc-specific questions:
> > 
> > 	1. How does userspace find out the value of SHMLBA?
> > 	   On Sparc, it is not a compile-time constant.
> 
> Don't specify MAP_FIXED for MAP_SHARED mapping if you want
> proper coherency, that's my answer for this one.

I can't safely set up this kind of mapping without MAP_FIXED, unless I
know SHMLBA.

This is my strategy:

	mmap MAP_ANON without MAP_FIXED to find a free area
	mmap MAP_FIXED over the anon area at same address
	mmap MAP_FIXED over the anon area at larger address

I don't see any strategy that lets me establish this kind of circular
mapping on Sparc without either (a) knowing the value of SHMLBA, or
(b) risking clobbering another thread's mmap.

> > 	3. Is there a kernel bug on Sparc, because the test program
> > 	   is either getting mappings that aren't aligned to run time
> > 	   SHMLBA, or the kernel's run time SHMLBA value is not correct.
> 
> No, the user is allowed to hang himself with MAP_FIXED.
> The bug is in your code :)

Well, my code has no bug because I do run-time tests to see what
rubbish the architecture gave me.  As we see, they work :)

I don't see any real alternative to doing that.  But that's ok, it
seems robust and portable.  It's a shame about the slow cache flush,
because I can sometimes use fast cache flushing to improve my DSP
buffering algorithms.

> > 	2. Is flushing part of the data cache something I can do from
> > 	   userspace?  (I'll figure out the exact machine instructions
> > 	   myself if I need to do this, but it'd be nice to know if
> > 	   it's possible before I have a go).
> 
> There is no efficient way to do this from userspace, only the
> kernel has access to the more efficient cache flushing instructions.
> You'd need to flush via loads to displace the aliasing cache lines.

Will msync() do it?

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 20:26 ` Paul J.Y. Lahaie
@ 2003-09-01  8:15   ` Russell King
  2003-09-01 10:12     ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Russell King @ 2003-09-01  8:15 UTC (permalink / raw)
  To: Paul J.Y. Lahaie; +Cc: Jamie Lokier, linux-kernel

This looks like an old kernel on your NetWinder.  Later 2.4 kernels
should get this right (by marking the pages uncacheable in user space.)

However, when I tried this program, it seemed to have some unexpected
results, sometimes claiming that its too slow, sometimes that the
store buffer isn't coherent, and sometimes saying that the cache
isn't coherent.

Oddly, davem's cache aliasing test program works every time.

It's something which I need to look into, but I don't know when I'm
going to find the time to delve into the memory management stuff.

On Fri, Aug 29, 2003 at 04:26:28PM -0400, Paul J.Y. Lahaie wrote:
> Corel NetWinder (275MHz StrongARM)
> Test separation: 4096 bytes: FAIL - cache not coherent
> Test separation: 8192 bytes: FAIL - cache not coherent
> Test separation: 16384 bytes: FAIL - cache not coherent
> Test separation: 32768 bytes: FAIL - cache not coherent
> Test separation: 65536 bytes: FAIL - cache not coherent
> Test separation: 131072 bytes: FAIL - cache not coherent
> Test separation: 262144 bytes: FAIL - cache not coherent
> Test separation: 524288 bytes: FAIL - cache not coherent
> Test separation: 1048576 bytes: FAIL - cache not coherent
> Test separation: 2097152 bytes: FAIL - cache not coherent
> Test separation: 4194304 bytes: FAIL - cache not coherent
> Test separation: 8388608 bytes: FAIL - cache not coherent
> Test separation: 16777216 bytes: FAIL - cache not coherent
> VM page alias coherency test: failed; will use copy buffers instead
> 
> cat /proc/cpuinfo
> Processor       : StrongARM-110 rev 3 (v4l)
> BogoMIPS        : 185.95
> Features        : swp half 26bit fastmult
>  
> Hardware        : Rebel-NetWinder
> Revision        : 52ff
> Serial          : 00000000000008bf

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  6:42           ` Jamie Lokier
@ 2003-09-01  7:06             ` David S. Miller
  2003-09-01  8:29               ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Miller @ 2003-09-01  7:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: mfedyk, lm, linux-kernel

On Mon, 1 Sep 2003 07:42:31 +0100
Jamie Lokier <jamie@shareable.org> wrote:

> David S. Miller wrote:
> > On Sun, 31 Aug 2003 23:49:37 +0100
> > Jamie Lokier <jamie@shareable.org> wrote:
> > 
> > > It uses POSIX shared memory and (necessarily) MAP_SHARED, which
> > > doesn't constrain the mapping alignment.
> > 
> > That's wrong.  If a platform needs to, it should properly
> > align the mapping when MAP_SHARED is used on a file.
> > 
> > If you look in arch/sparc64/kernel/sys_sparc.c, you'll see
> > that when we're mmap()'ing a file and MAP_SHARED is specified,
> > we align things to SHMLBA.
> 
> Then you have a bug in the Sparc code.  It looks like it should return
> -EINVAL when a misaligned mapping is used with MAP_FIXED|MAP_SHARED,
> but the test program is clearly getting mappings that aren't aligned
> to SHMLBA.

I disagree, MAP_FIXED means "I know what I am doing don't override
this unless the mapping area is not available in my address space."
You should never specify MAP_FIXED unless you _REALLY_ know what you
are doing.

> Thus I have three Sparc-specific questions:
> 
> 	1. How does userspace find out the value of SHMLBA?
> 	   On Sparc, it is not a compile-time constant.

Don't specify MAP_FIXED for MAP_SHARED mapping if you want
proper coherency, that's my answer for this one.

> 	2. Is flushing part of the data cache something I can do from
> 	   userspace?  (I'll figure out the exact machine instructions
> 	   myself if I need to do this, but it'd be nice to know if
> 	   it's possible before I have a go).

There is no efficient way to do this from userspace, only the
kernel has access to the more efficient cache flushing instructions.
You'd need to flush via loads to displace the aliasing cache lines.

> 	3. Is there a kernel bug on Sparc, because the test program
> 	   is either getting mappings that aren't aligned to run time
> 	   SHMLBA, or the kernel's run time SHMLBA value is not correct.

No, the user is allowed to hang himself with MAP_FIXED.

The bug is in your code :)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  5:31         ` David S. Miller
@ 2003-09-01  6:42           ` Jamie Lokier
  2003-09-01  7:06             ` David S. Miller
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  6:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: mfedyk, lm, linux-kernel

David S. Miller wrote:
> On Sun, 31 Aug 2003 23:49:37 +0100
> Jamie Lokier <jamie@shareable.org> wrote:
> 
> > It uses POSIX shared memory and (necessarily) MAP_SHARED, which
> > doesn't constrain the mapping alignment.
> 
> That's wrong.  If a platform needs to, it should properly
> align the mapping when MAP_SHARED is used on a file.
> 
> If you look in arch/sparc64/kernel/sys_sparc.c, you'll see
> that when we're mmap()'ing a file and MAP_SHARED is specified,
> we align things to SHMLBA.

Then you have a bug in the Sparc code.  It looks like it should return
-EINVAL when a misaligned mapping is used with MAP_FIXED|MAP_SHARED,
but the test program is clearly getting mappings that aren't aligned
to SHMLBA.

> If userspace purposefully violates this alignment attempt,
> then it's at it's own peril to keep the mappings coherent,
> there is simply nothing the kernel should be doing to help
> out that case.

I understand that userspace needs to keep it coherent, or map to a
multiple of SHMLBA.  I don't mind whether the kernel constrains the
mapping address or not, with a slight preference for userspace
flexibility.

Thus I have three Sparc-specific questions:

	1. How does userspace find out the value of SHMLBA?
	   On Sparc, it is not a compile-time constant.

	2. Is flushing part of the data cache something I can do from
	   userspace?  (I'll figure out the exact machine instructions
	   myself if I need to do this, but it'd be nice to know if
	   it's possible before I have a go).

	3. Is there a kernel bug on Sparc, because the test program
	   is either getting mappings that aren't aligned to run time
	   SHMLBA, or the kernel's run time SHMLBA value is not correct.

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 17:39 ` Matt Porter
@ 2003-09-01  6:00   ` Jamie Lokier
  2003-09-01 11:17     ` Alan Cox
  2003-09-01 17:22     ` Roland Dreier
  0 siblings, 2 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  6:00 UTC (permalink / raw)
  To: Matt Porter; +Cc: linux-kernel

Matt Porter wrote:
> PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache is PTPI

The cache looks very coherent to me.

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 16:27 ` Geert Uytterhoeven
@ 2003-09-01  5:58   ` Jamie Lokier
  2003-09-01  8:34     ` Geert Uytterhoeven
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  5:58 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Linux Kernel Development

Geert Uytterhoeven wrote:
> Are you also interested in m68k? ;-)
> 
> cassandra:/tmp# time ./test
> Test separation: 4096 bytes: FAIL - store buffer not coherent

Especially!  I hadn't expected to see any machine that would print
"store buffer not coherent".  It means that if there's an L1 cache, it
is coherent, but any store-then-load bypass in the CPU pipeline is
using the virtual address with no rollback after MMU translation.

I had thought it would only be the case with chips using an external
MMU, but now that I think about it, the older simpler chips aren't
going to bother with things like pipeline rollback wherever they can
get away without it!

(The other CPU that is reporting "store buffer not coherent" is
PA-RISC, which is even more of an eye opener.  That has a big 1MiB
coherent L1 cache, and the pipeline bypass is coherent for very large
separations but not others!)

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 15:41 ` Larry McVoy
  2003-08-29 23:05   ` Mike Fedyk
@ 2003-09-01  5:44   ` Jamie Lokier
  2003-09-01 14:43     ` Larry McVoy
  1 sibling, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  5:44 UTC (permalink / raw)
  To: Larry McVoy, linux-kernel

Larry McVoy wrote:
> On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> > I'd appreciate if folks would run the program below on various
> > machines, especially those whose caches aren't automatically coherent
> > at the hardware level.
> 
> Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC

Thanks Larry.  That's a great range you have!
Collected and will be posted shortly in a table with the others.

> If you care, I also have freebsd (v2, v3, v4), netbsd 1.5, openbsd 3.0 (all
> bsd systems are x86, mostly celerons), hpux 10.20, sco, solaris, solaris/x86,
> Irix, MacOS X, AIX, Tru64 and probably some others.

AIX would be interesting; I don't have an RS6000.  The rest of the
CPUs I have results for, and it sounds like a lot of effort for what's
basically a compile/compatibility test.

However, if it's very little effort for you to run the test on them please do!

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-31 22:49       ` Jamie Lokier
@ 2003-09-01  5:31         ` David S. Miller
  2003-09-01  6:42           ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Miller @ 2003-09-01  5:31 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: mfedyk, lm, linux-kernel

On Sun, 31 Aug 2003 23:49:37 +0100
Jamie Lokier <jamie@shareable.org> wrote:

> It uses POSIX shared memory and (necessarily) MAP_SHARED, which
> doesn't constrain the mapping alignment.

That's wrong.  If a platform needs to, it should properly
align the mapping when MAP_SHARED is used on a file.

If you look in arch/sparc64/kernel/sys_sparc.c, you'll see
that when we're mmap()'ing a file and MAP_SHARED is specified,
we align things to SHMLBA.

If userspace purposefully violates this alignment attempt,
then it's at it's own peril to keep the mappings coherent,
there is simply nothing the kernel should be doing to help
out that case.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 10:03 ` J.A. Magallon
  2003-08-29 10:36   ` Alan Cox
@ 2003-09-01  4:49   ` Jamie Lokier
  1 sibling, 0 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  4:49 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: linux-kernel

J.A. Magallon wrote:
> On 08.29, Jamie Lokier wrote:
> > I already got a surprise (to me): my Athlon MP is much slower
> > accessing multiple mappings which are within 32k of each other, than
> > mappings which are further apart, although it is coherent.  The L1
> > data cache is 64k.  (The explanation is easy: virtually indexed,
> > physically tagged cache moves data among cache lines, possibly via L2).
> > 
> 
> Sorry if this is a stupid question, but have you heard about 64K-aliasing ?
> We have seen it in P3/P4, do not know if Athlons also suffer it.
> In short, x86 is crap. It slows like a dog when accessing two memory
> positions sparated by 2^n (address decoder has two 16 bits adders, instead
> of 1 32 bits..., cache is 16 bit tagged, etc...)

I don't know what you mean.  This test doesn't observe any gross
timing effect at 64K.  I have just tried it on a Celeron Coppermine
printing more detailed numbers, and I don't notice anything at all.

So, what exactly do you mean?  What kind of code shows the effect you
are talking about?

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  1:13 ` dean gaudet
@ 2003-09-01  4:29   ` Jamie Lokier
  0 siblings, 0 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  4:29 UTC (permalink / raw)
  To: dean gaudet; +Cc: linux-kernel

dean gaudet wrote:
> On Fri, 29 Aug 2003, Jamie Lokier wrote:
> > I already got a surprise (to me): my Athlon MP is much slower
> > accessing multiple mappings which are within 32k of each other, than
> > mappings which are further apart, although it is coherent.  The L1
> > data cache is 64k.  (The explanation is easy: virtually indexed,
> > physically tagged cache moves data among cache lines, possibly via L2).
> 
> opteron has 64KiB / 2-way L1 which means 15-bits of indexing... which
> totally predicts the 32KiB spacing i saw someone else post about.

Aha, thanks!  All Athlons are the same with 64KiB L1 and 32KiB
threshold, and K6 is the same but with 16KiB threshold instead.

> there's a real oddity i found on p4 just yesterday.  i was doing some
> pointer-chasing experiments, and i set up two 8192B shared mappings to the
> same file, for example:
> 
> 0x50000000 => /var/tmp/foo offset 0
> 0x50002000 => /var/tmp/foo offset 0
> 
> then i set up a 4 element cycle:
> 
> 0x50000000 => 0x50001004 => 0x50002008 => 0x5000300c => 0x50000000
> 
> when i do this it seems to trip up a p4 badly ... i'm seeing 3000 cycles
> per load on a 2.4GHz p4, and 300 cycles per load on a 2.4GHz xeon.  the
> crazy thing is that small variations in the experiment (such as longer
> cycles) make the oddity go away!

I have no idea of the explanation, unless P4 is doing the same as the
Athlon, 3000 cycles is the cost of an L1/L2 miss, and P4 has virtual
aliasing in both L1 and L2.  Hmm.

I would certainly like to detect that if it occurs with typical
instruction streams, otherwise it'll clobber my application's
performance on a P4.  I don't have a P4 to test on, btw.  If you can
investigate further that would be very good.

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  1:00     ` Paul Mundt
@ 2003-09-01  1:58       ` Jamie Lokier
  0 siblings, 0 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  1:58 UTC (permalink / raw)
  To: linux-kernel

Paul Mundt wrote:
> On Mon, Sep 01, 2003 at 01:37:50AM +0100, Jamie Lokier wrote:
> > > sh (VIPT cache):
> > > 
> > > Test separation: 4096 bytes: FAIL - cache not coherent
> > > Test separation: 8192 bytes: FAIL - cache not coherent
> > > Test separation: 16384 bytes: pass
> > 
> > A VIVT cache can do that, but I think a VIPT cache should always be coherent.
> > Do I misunderstand?
> > 
> There's nothing stating that VIPT == automatic coherency,
> as is obviously the case for sh, where we are completely VIPT, but
> are also non coherent.

Ah.  A VIPT cache needn't be coherent with itself if isn't coherent
w.r.t. external devices.  Thanks.

-- Jamie



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (20 preceding siblings ...)
  2003-09-01  0:24 ` Paul Mundt
@ 2003-09-01  1:13 ` dean gaudet
  2003-09-01  4:29   ` Jamie Lokier
  2003-09-02 10:08 ` Jan Rychter
  22 siblings, 1 reply; 106+ messages in thread
From: dean gaudet @ 2003-09-01  1:13 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Fri, 29 Aug 2003, Jamie Lokier wrote:

> I already got a surprise (to me): my Athlon MP is much slower
> accessing multiple mappings which are within 32k of each other, than
> mappings which are further apart, although it is coherent.  The L1
> data cache is 64k.  (The explanation is easy: virtually indexed,
> physically tagged cache moves data among cache lines, possibly via L2).

opteron has 64KiB / 2-way L1 which means 15-bits of indexing... which
totally predicts the 32KiB spacing i saw someone else post about.

tm8000 also has some virtual aliasing and your test detects it properly...
but i'm probably not supposed to say anything about that :)

there's a real oddity i found on p4 just yesterday.  i was doing some
pointer-chasing experiments, and i set up two 8192B shared mappings to the
same file, for example:

0x50000000 => /var/tmp/foo offset 0
0x50002000 => /var/tmp/foo offset 0

then i set up a 4 element cycle:

0x50000000 => 0x50001004 => 0x50002008 => 0x5000300c => 0x50000000

when i do this it seems to trip up a p4 badly ... i'm seeing 3000 cycles
per load on a 2.4GHz p4, and 300 cycles per load on a 2.4GHz xeon.  the
crazy thing is that small variations in the experiment (such as longer
cycles) make the oddity go away!

i've placed my hack here <http://arctic.org/~dean/noah/chase.c>.


> This suggests scope for improving x86 kernel performance in the areas
> of kmap() and shared library / executable mappings, by good choice of
> _virtual_ addresses.  This doesn't require a cache colouring
> page allocator, so maybe it's a new avenue?

i was trying to use wli's pgcl patch to test out larger clustering, but it
still has some perf problems which i never got enough time to dig into
further :)  this approach might be better than just colouring.

here's what i've found tripping up virtual aliasing on processors which
have this "feature":

- shared use empty_zero_page trips up virtual aliasing for things like BSS
  -- especially if the program for some reason doesn't typically have to
  write before reading.  this is pretty easy to fix (there's even an
  example fix in the mips architecture, i believe R4000 or something)

- kernel and user mappings differ in the virtual index bits.  this means
  CoW will trip up virtual aliases amongst other things.  i imagine it
  means network checksum calculation on write(2) data will trip up virtual
  aliases.  this is more of a pain to fix in a way which is nice on SMP.

- physical pages change their virtual index bits each alloc/free.

mind you overall i'm not sure that i'm seeing any perf loss due to this
sort of thing...

-dean

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  0:37   ` Jamie Lokier
@ 2003-09-01  1:00     ` Paul Mundt
  2003-09-01  1:58       ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: Paul Mundt @ 2003-09-01  1:00 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 515 bytes --]

On Mon, Sep 01, 2003 at 01:37:50AM +0100, Jamie Lokier wrote:
> > sh (VIPT cache):
> > 
> > Test separation: 4096 bytes: FAIL - cache not coherent
> > Test separation: 8192 bytes: FAIL - cache not coherent
> > Test separation: 16384 bytes: pass
> 
> A VIVT cache can do that, but I think a VIPT cache should always be coherent.
> Do I misunderstand?
> 
There's nothing stating that VIPT == automatic coherency, as is obviously the
case for sh, where we are completely VIPT, but are also non coherent.


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-09-01  0:24 ` Paul Mundt
@ 2003-09-01  0:37   ` Jamie Lokier
  2003-09-01  1:00     ` Paul Mundt
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-09-01  0:37 UTC (permalink / raw)
  To: linux-kernel

Paul Mundt wrote:
> sh (VIPT cache):
> 
> Test separation: 4096 bytes: FAIL - cache not coherent
> Test separation: 8192 bytes: FAIL - cache not coherent
> Test separation: 16384 bytes: pass

A VIVT cache can do that, but I think a VIPT cache should always be coherent.
Do I misunderstand?

-- Jamie

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (19 preceding siblings ...)
  2003-08-29 23:47 ` Kurt Wall
@ 2003-09-01  0:24 ` Paul Mundt
  2003-09-01  0:37   ` Jamie Lokier
  2003-09-01  1:13 ` dean gaudet
  2003-09-02 10:08 ` Jan Rychter
  22 siblings, 1 reply; 106+ messages in thread
From: Paul Mundt @ 2003-09-01  0:24 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2085 bytes --]

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
>
sh (VIPT cache):

Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

$ cat /proc/cpuinfo
machine         : Sega Dreamcast
processor       : 0
cpu family      : sh4
cpu type        : SH7750
cache size      : 8K-bytes/16K-bytes
bogomips        : 199.06
cpu clock       : 199.49MHz
bus clock       : 99.74MHz
module clock    : 49.87MHz

and on sh64 (which is sort of VIPT/VIVT, as it looks at physical tags if
there's no match on virtual):

Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 8192 (2 pages)

-sh-2.05b$ cat /proc/cpuinfo
machine         : Hitachi Cayman
processor       : 0
cpu family      : SH-5
cpu type        : SH5-101
icache size     : 32K-bytes
dcache size     : 32K-bytes
itlb entries    : 64
dtlb entries    : 64
cpu clock       : 314.73MHz
bus clock       : 157.36MHz
module clock    : 26.22MHz
bogomips        : 313.75


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-31  5:10     ` David S. Miller
@ 2003-08-31 22:49       ` Jamie Lokier
  2003-09-01  5:31         ` David S. Miller
  0 siblings, 1 reply; 106+ messages in thread
From: Jamie Lokier @ 2003-08-31 22:49 UTC (permalink / raw)
  To: David S. Miller; +Cc: Mike Fedyk, lm, linux-kernel

David S. Miller wrote:
> On Fri, 29 Aug 2003 16:05:21 -0700
> Mike Fedyk <mfedyk@matchmail.com> wrote:
> 
> > Does this mean that userspace has to take into consideration that the isn't
> > coherent for adjacent small memory accesses on sparc?  What could happen if
> > it doesn't, or does it need to at all?
> 
> For shared memory, we enforce the correct mapping alignment
> so that coherency issues don't crop up.
> 
> How does this program work?  I haven't taken a close look
> at it.  Does it use MAP_SHARED or IPC shm?

It uses POSIX shared memory and (necessarily) MAP_SHARED, which
doesn't constrain the mapping alignment.

I had wondered if some kernels used page faults to maintain coherence
between multiple shared mappings of the same file.  It's one of the
things the program checks, and I have seen it mentioned on l-k, which
made me think it might be implemented.  None of the results for any
architecture show it, though.

If userspace does create multiple shared mappings at non-coherent
offsets, what is the recommended method for switching between
accessing one page (or page cluster?) and accessing the other.  Is it
msync(), a special system call to flush parts of the data cache, a
machine instruction, or something else?

Thanks,
-- Jamie



ps. The program has code to try IPC shm instead.  Change "#ifdef
SHM_DIR_PREFIX" in __page_alias_map to "#if 0", and add
-DHAVE_SYSV_SHM to the GCC command line.  It should fail the same test
sizes with a different message.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 23:05   ` Mike Fedyk
@ 2003-08-31  5:10     ` David S. Miller
  2003-08-31 22:49       ` Jamie Lokier
  0 siblings, 1 reply; 106+ messages in thread
From: David S. Miller @ 2003-08-31  5:10 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: lm, jamie, linux-kernel

On Fri, 29 Aug 2003 16:05:21 -0700
Mike Fedyk <mfedyk@matchmail.com> wrote:

> Does this mean that userspace has to take into consideration that the isn't
> coherent for adjacent small memory accesses on sparc?  What could happen if
> it doesn't, or does it need to at all?

For shared memory, we enforce the correct mapping alignment
so that coherency issues don't crop up.

How does this program work?  I haven't taken a close look
at it.  Does it use MAP_SHARED or IPC shm?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 15:47 ` Herbert Poetzl
@ 2003-08-30  1:48   ` Stuart Longland
  0 siblings, 0 replies; 106+ messages in thread
From: Stuart Longland @ 2003-08-30  1:48 UTC (permalink / raw)
  To: jamie; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

	I've thrown this at a Gateway Microserver (aka. Sun Cobalt Qube) which
runs an r5k little endian MIPS.  I'd also throw this at a Silicon
Graphics Indy, but I don't feel energetic enough right now to go and
drag the beast out.

	Also attached, is the results from my laptop (Toshiba Protege 7010CT)
and web server (Generic Dual P-Pro).


- --
+-------------------------------------------------------------+
| Stuart Longland           stuartl at longlandclan.hopto.org |
| Brisbane Mesh Node: 719             http://stuartl.cjb.net/ |
| I haven't lost my mind - it's backed up on a tape somewhere |
| Griffith Student No:           Course: Bachelor/IT (Nathan) |
+-------------------------------------------------------------+


- -------------------< From the qube >-----------------------
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

real    0m0.276s
user    0m0.140s
sys     0m0.120s

system type		: MIPS Cobalt
processor		: 0
cpu model		: Nevada V10.0  FPU V10.0
BogoMIPS		: 249.85
wait instruction	: yes
microsecond timers	: yes
tlb_entries		: 48
extra interrupt vector	: yes
hardware watchpoint	: no
VCED exceptions		: not available
VCEI exceptions		: not available
- -------------------< From the qube >-----------------------

- ------------------< From the laptop >----------------------
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.195s
user    0m0.142s
sys     0m0.052s

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 5
model name	: Pentium II (Deschutes)
stepping	: 2
cpu MHz		: 300.026
cache size	: 512 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat
pse36 mmx fxsr
bogomips	: 591.87
- ------------------< From the laptop >----------------------

- ----------------< From the web server >--------------------
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.279s
user    0m0.210s
sys     0m0.060s

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 1
model name	: Pentium Pro
stepping	: 9
cpu MHz		: 199.434
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
bogomips	: 398.13

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 1
model name	: Pentium Pro
stepping	: 9
cpu MHz		: 199.434
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
bogomips	: 398.13
- ----------------< From the web server >--------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/UAKFIGJk7gLSDPcRAif8AJ9WKjTGIGYJdHgME/Fkac4cNZKUkACdHwA5
yHQlu/O96H4IUHKGflJncmI=
=yAoq
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (18 preceding siblings ...)
  2003-08-29 22:35 ` Kenneth Johansson
@ 2003-08-29 23:47 ` Kurt Wall
  2003-09-01  0:24 ` Paul Mundt
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: Kurt Wall @ 2003-08-29 23:47 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

Quoth Jamie Lokier:
> Dear All,
> 
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

[snip]

----- system one ---
$ time ./mmap 
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.475s
user    0m0.250s
sys     0m0.020s
$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 5
model name      : Pentium II (Deschutes)
stepping        : 2
cpu MHz         : 349.200
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips        : 696.32
-----

----- system two ---
[kwall]$ time ./mmap 
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.134s
user    0m0.120s
sys     0m0.010s
]$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 801.830
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips        : 1599.07
-----

---- system three -----
$ time ./mmap 
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real    0m0.101s
user    0m0.090s
sys     0m0.010s
root@advent:~# cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 1210.825
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 2418.27
----- 

Now, that was interesting. The AMD is my fastest machine...

Kurt
-- 
"I have the world's largest collection of seashells.  I keep it
scattered around the beaches of the world ... Perhaps you've seen it.
		-- Steven Wright

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 15:41 ` Larry McVoy
@ 2003-08-29 23:05   ` Mike Fedyk
  2003-08-31  5:10     ` David S. Miller
  2003-09-01  5:44   ` Jamie Lokier
  1 sibling, 1 reply; 106+ messages in thread
From: Mike Fedyk @ 2003-08-29 23:05 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Jamie Lokier, linux-kernel

On Fri, Aug 29, 2003 at 08:41:01AM -0700, Larry McVoy wrote:

> ====== sparc.bitmover.com ======
> Test separation: 8192 bytes: FAIL - cache not coherent

> VM page alias coherency test: minimum fast spacing: 16384 (2 pages)
> 0.29user 0.02system 0:00.31elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (107major+36minor)pagefaults 0swaps
> Linux sparc.bitmover.com 2.2.18 #2 Thu Dec 21 18:53:16 PST 2000 sparc64 unknown
> cpu		: TI UltraSparc IIi
> fpu		: UltraSparc IIi integrated FPU
> promlib		: Version 3 Revision 11
> prom		: 3.11.12
> type		: sun4u
> ncpus probed	: 1
> ncpus active	: 1
> BogoMips	: 539.03
> MMU Type	: Spitfire

Does this mean that userspace has to take into consideration that the isn't
coherent for adjacent small memory accesses on sparc?  What could happen if
it doesn't, or does it need to at all?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (17 preceding siblings ...)
  2003-08-29 20:26 ` Paul J.Y. Lahaie
@ 2003-08-29 22:35 ` Kenneth Johansson
  2003-08-29 23:47 ` Kurt Wall
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: Kenneth Johansson @ 2003-08-29 22:35 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Fri, 2003-08-29 at 07:35, Jamie Lokier wrote:
> Dear All,
> 
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
 
real    0m0.473s
user    0m0.280s
sys     0m0.100s

>cat /proc/cpuinfo
cpu             : 405CR
clock           : 200MHz
revision        : 1.69 (pvr 4011 0145)
bogomips        : 199.88
machine         : Ericsson ELN 2XX
plb bus clock   : 100MHz





^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (16 preceding siblings ...)
  2003-08-29 20:14 ` Iulian Musat
@ 2003-08-29 20:26 ` Paul J.Y. Lahaie
  2003-09-01  8:15   ` Russell King
  2003-08-29 22:35 ` Kenneth Johansson
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 106+ messages in thread
From: Paul J.Y. Lahaie @ 2003-08-29 20:26 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2055 bytes --]

Ran it on a few systems here.

Corel NetWinder (275MHz StrongARM)
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - cache not coherent
Test separation: 524288 bytes: FAIL - cache not coherent
Test separation: 1048576 bytes: FAIL - cache not coherent
Test separation: 2097152 bytes: FAIL - cache not coherent
Test separation: 4194304 bytes: FAIL - cache not coherent
Test separation: 8388608 bytes: FAIL - cache not coherent
Test separation: 16777216 bytes: FAIL - cache not coherent
VM page alias coherency test: failed; will use copy buffers instead

cat /proc/cpuinfo
Processor       : StrongARM-110 rev 3 (v4l)
BogoMIPS        : 185.95
Features        : swp half 26bit fastmult
 
Hardware        : Rebel-NetWinder
Revision        : 52ff
Serial          : 00000000000008bf



HP zx6000 (2xItanium 2)
time ./test
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
                                                                                
real    0m7.455s
user    0m7.412s
sys     0m0.040s

cat /proc/cpuinfo
processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 0
revision   : 7
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 900.000000
itc MHz    : 900.000000
BogoMIPS   : 1346.37





[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (15 preceding siblings ...)
  2003-08-29 20:03 ` Sean Neakums
@ 2003-08-29 20:14 ` Iulian Musat
  2003-08-29 20:26 ` Paul J.Y. Lahaie
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: Iulian Musat @ 2003-08-29 20:14 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel



Jamie Lokier wrote:
> Anyway, please lots of people run the program and post the output +
> /proc/cpuinfo.  Compile with optimisation, -O or -O2 is fine.  (You
> can add -DHAVE_SYSV_SHM too if you like):
> 
> 	gcc -o test test.c -O2
> 	time ./test
> 	cat /proc/cpuinfo

2 AMD Athlon
4 Itanium II (on an altix machine)
2 Pentium III
1 AMD XP
1 Pentium IV


2 AMD Athlon :
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real    0m0.088s
user    0m0.080s
sys     0m0.004s

cat /proc/cpuinfo

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 1526.385
cache size      : 256 KB
Physical processor ID   : -2084402944
Number of siblings      : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 3038.00

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 1526.385
cache size      : 256 KB
Physical processor ID   : 410321912
Number of siblings      : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 3046.10

~~~~~~~~~~~~~~~~~~~~~~~~

4 Itanium II (on an altix machine)
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.095s
user    0m0.065s
sys     0m0.028s

cat /proc/cpuinfo

processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 0
revision   : 7
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 900.000000
itc MHz    : 900.000000
BogoMIPS   : 1346.37

processor  : 1
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 0
revision   : 7
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 900.000000
itc MHz    : 900.000000
BogoMIPS   : 1346.37

processor  : 2
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 0
revision   : 7
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 900.000000
itc MHz    : 900.000000
BogoMIPS   : 1342.17

processor  : 3
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 0
revision   : 7
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 900.000000
itc MHz    : 900.000000
BogoMIPS   : 1342.17

~~~~~~~~~~~~~~~~~~~~~~~~

2 Pentium III
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.154s
user    0m0.109s
sys     0m0.020s

cat /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 846.353
cache size      : 256 KB
Physical processor ID   : 0
Number of siblings      : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1682.99

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 846.353
cache size      : 256 KB
Physical processor ID   : 0
Number of siblings      : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1691.09

~~~~~~~~~~~~~~~~~~~~~~~~

1 AMD XP
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real    0m0.077s
user    0m0.060s
sys     0m0.010s

cat /proc/cpuinfo

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : AMD Athlon(tm) XP 2100+
stepping        : 2
cpu MHz         : 1746.168
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 3486.51


~~~~~~~~~~~~~~~~~~~~~~~~

1 Pentium IV
~~~~~~~~~~~~~~~~~~~~~~~~
gcc -o test test.c -O2

time ./test

Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.221s
user    0m0.180s
sys     0m0.025s


cat /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 0
model name      : Intel(R) Pentium(R) 4 CPU 1700MHz
stepping        : 10
cpu MHz         : 1694.928
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 3365.99

~~~~~~~~~~~~~~~~~~~~~~~~



-iulian


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (14 preceding siblings ...)
  2003-08-29 19:37 ` Thorsten Kranzkowski
@ 2003-08-29 20:03 ` Sean Neakums
  2003-08-29 20:14 ` Iulian Musat
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: Sean Neakums @ 2003-08-29 20:03 UTC (permalink / raw)
  To: linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

2-way Pentium III:

$ time ./va
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.096s
user    0m0.073s
sys     0m0.023s
$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 11
model name      : Intel(R) Pentium(R) III CPU family      1133MHz
stepping        : 1
cpu MHz         : 1129.879
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips        : 2220.03

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 11
model name      : Intel(R) Pentium(R) III CPU family      1133MHz
stepping        : 1
cpu MHz         : 1129.879
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips        : 2252.80

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (13 preceding siblings ...)
  2003-08-29 17:39 ` Matt Porter
@ 2003-08-29 19:37 ` Thorsten Kranzkowski
  2003-08-29 20:03 ` Sean Neakums
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: Thorsten Kranzkowski @ 2003-08-29 19:37 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> Dear All,
> 
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
 

Dual Alpha ev6:


ds20:~/src/cachetest$ ./doit 
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (4 pages)

real    0m4.148s
user    0m4.029s
sys     0m0.075s
cpu                     : Alpha
cpu model               : EV6
cpu variation           : 7
cpu revision            : 0
cpu serial number       : 
system type             : Tsunami
system variation        : Goldrush
system revision         : 0
system serial number    : ay91560403
cycle frequency [Hz]    : 500000000 
timer frequency [Hz]    : 1024.00
page size [bytes]       : 8192
phys. address bits      : 44
max. addr. space #      : 255
BogoMIPS                : 998.56
kernel unaligned acc    : 0 (pc=0,va=0)
user unaligned acc      : 0 (pc=0,va=0)
platform string         : AlphaServer DS20 500 MHz
cpus detected           : 2
cpus active             : 2
cpu active mask         : 0000000000000003



Single Alpha ev4 (AXPpci33):

Marvin:~/src/cachetest$ ./doit 
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m1.442s
user    0m0.853s
sys     0m0.471s
cpu                     : Alpha
cpu model               : LCA4
cpu variation           : -4294967301
cpu revision            : 0
cpu serial number       : Linux_is_Great!
system type             : Noname
system variation        : 0
system revision         : 0
system serial number    : MILO-2.2-17
cycle frequency [Hz]    : 166868457 
timer frequency [Hz]    : 1024.00
page size [bytes]       : 8192
phys. address bits      : 34
max. addr. space #      : 63
BogoMIPS                : 320.40
kernel unaligned acc    : 56014443 (pc=fffffc0000ab65a4,va=fffffc0000b99105)
user unaligned acc      : 2695 (pc=2000031ff90,va=11fffef26)
platform string         : N/A
cpus detected           : 0




ordinary Pentium II:


bash-2.03$ ./doit           
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.342s
user    0m0.290s
sys     0m0.030s
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 3
model name      : Pentium II (Klamath)
stepping        : 4
cpu MHz         : 300.691
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov mmx
bogomips        : 599.65




bye,
Thorsten

-- 
| Thorsten Kranzkowski        Internet: dl8bcu@dl8bcu.de                      |
| Mobile: ++49 170 1876134       Snail: Kiebitzstr. 14, 49324 Melle, Germany  |
| Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8bcu@marvin.dl8bcu.ampr.org [44.130.8.19] |

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (12 preceding siblings ...)
  2003-08-29 16:31 ` Brian Jackson
@ 2003-08-29 17:39 ` Matt Porter
  2003-09-01  6:00   ` Jamie Lokier
  2003-08-29 19:37 ` Thorsten Kranzkowski
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 106+ messages in thread
From: Matt Porter @ 2003-08-29 17:39 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> Anyway, please lots of people run the program and post the output +
> /proc/cpuinfo.  Compile with optimisation, -O or -O2 is fine.  (You
> can add -DHAVE_SYSV_SHM too if you like):
> 
> 	gcc -o test test.c -O2
> 	time ./test
> 	cat /proc/cpuinfo

PPC440GX, non cache coherent, L1 icache is VTPI, L1 dcache is PTPI

-----

440gx-1:~/cachetest# gcc -o test test.c -O2
440gx-1:~/cachetest# time ./test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real	0m0.193s
user	0m0.140s
sys	0m0.010s
440gx-1:~/cachetest# cat /proc/cpuinfo
cpu		: 440GX Rev. A
revision	: 24.80 (pvr 51b2 1850)
bogomips	: 624.23
vendor		: IBM
machine		: PPC440GX EVB (Ocotea)
440gx-1:~/cachetest# 

-- 
Matt Porter
mporter@kernel.crashing.org

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (11 preceding siblings ...)
  2003-08-29 16:27 ` Geert Uytterhoeven
@ 2003-08-29 16:31 ` Brian Jackson
  2003-08-29 17:39 ` Matt Porter
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: Brian Jackson @ 2003-08-29 16:31 UTC (permalink / raw)
  To: Jamie Lokier, linux-kernel

On Friday 29 August 2003 12:35 am, Jamie Lokier wrote:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
>
<snip>

Didn't see a 512k cache athlon-xp yet

skyline:/share/linux/projects/cachetest # sh go
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 10
model name      : AMD Athlon(tm) XP 2800+
stepping        : 0
cpu MHz         : 2088.111
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 4168.08

Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real    0m0.110s
user    0m0.070s
sys     0m0.030s

--Brian Jackson

-- 
OpenGFS -- http://opengfs.sourceforge.net
Gentoo -- http://gentoo.brianandsara.net
Home -- http://www.brianandsara.net


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (10 preceding siblings ...)
  2003-08-29 15:47 ` Herbert Poetzl
@ 2003-08-29 16:27 ` Geert Uytterhoeven
  2003-09-01  5:58   ` Jamie Lokier
  2003-08-29 16:31 ` Brian Jackson
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 106+ messages in thread
From: Geert Uytterhoeven @ 2003-08-29 16:27 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux Kernel Development

On Fri, 29 Aug 2003, Jamie Lokier wrote:
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

Are you also interested in m68k? ;-)

cassandra:/tmp# time ./test
Test separation: 4096 bytes: FAIL - store buffer not coherent
Test separation: 8192 bytes: FAIL - store buffer not coherent
Test separation: 16384 bytes: FAIL - store buffer not coherent
Test separation: 32768 bytes: FAIL - store buffer not coherent
Test separation: 65536 bytes: FAIL - store buffer not coherent
Test separation: 131072 bytes: FAIL - store buffer not coherent
Test separation: 262144 bytes: FAIL - store buffer not coherent
Test separation: 524288 bytes: FAIL - store buffer not coherent
Test separation: 1048576 bytes: FAIL - store buffer not coherent
Test separation: 2097152 bytes: FAIL - store buffer not coherent
Test separation: 4194304 bytes: FAIL - store buffer not coherent
Test separation: 8388608 bytes: FAIL - store buffer not coherent
Test separation: 16777216 bytes: FAIL - store buffer not coherent
VM page alias coherency test: failed; will use copy buffers instead

real	0m0.478s
user	0m0.110s
sys	0m0.190s
cassandra:/tmp# cat /proc/cpuinfo 
CPU:		68040
MMU:		68040
FPU:		68040
Clocking:	24.8MHz
BogoMips:	16.53
Calibration:	82688 loops
cassandra:/tmp# 


callisto$ time ./test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real	0m0.329s
user	0m0.270s
sys	0m0.050s
callisto$ cat /proc/cpuinfo 
cpu		: 604r
clock		: 200MHz
revision	: 18.3 (pvr 0009 1203)
bogomips	: 398.13
machine		: CHRP IBM,LongTrail-2
memory bank 0	: 32 MB SDRAM
memory bank 1	: 32 MB SDRAM
memory bank 2	: 32 MB SDRAM
memory bank 3	: 32 MB SDRAM
board l2	: 512 KB Pipelined Synchronous (Write-Through)
callisto$

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (9 preceding siblings ...)
  2003-08-29 15:41 ` Larry McVoy
@ 2003-08-29 15:47 ` Herbert Poetzl
  2003-08-30  1:48   ` Stuart Longland
  2003-08-29 16:27 ` Geert Uytterhoeven
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 106+ messages in thread
From: Herbert Poetzl @ 2003-08-29 15:47 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel


# gcc -o test test.c -O2
# ./test 
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

# cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 6
model name	: AMD Athlon(tm) MP 1800+
stepping	: 2
cpu MHz		: 1533.425
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips	: 3060.53

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 6
model name	: AMD Athlon(tm) Processor
stepping	: 2
cpu MHz		: 1533.425
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips	: 3060.53


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (8 preceding siblings ...)
  2003-08-29 11:51 ` James Morris
@ 2003-08-29 15:41 ` Larry McVoy
  2003-08-29 23:05   ` Mike Fedyk
  2003-09-01  5:44   ` Jamie Lokier
  2003-08-29 15:47 ` Herbert Poetzl
                   ` (12 subsequent siblings)
  22 siblings, 2 replies; 106+ messages in thread
From: Larry McVoy @ 2003-08-29 15:41 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

Results for Alpha, IA64, MIPS, ARM, PARISC, PPC, MIPSEL, X86, SPARC

If you care, I also have freebsd (v2, v3, v4), netbsd 1.5, openbsd 3.0 (all
bsd systems are x86, mostly celerons), hpux 10.20, sco, solaris, solaris/x86,
Irix, MacOS X, AIX, Tru64 and probably some others.

====== alpha.bitmover.com ======
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
Linux alpha.bitmover.com 2.4.21-pre5 #2 Thu Mar 20 07:54:03 PST 2003 alpha unknown
cpu			: Alpha
cpu model		: EV56
cpu variation		: 7
cpu revision		: 0
cpu serial number	: 
system type		: EB164
system variation	: PC164
system revision		: 0
system serial number	: 
cycle frequency [Hz]	: 500000000 
timer frequency [Hz]	: 1024.00
page size [bytes]	: 8192
phys. address bits	: 40
max. addr. space #	: 127
BogoMIPS		: 992.88
kernel unaligned acc	: 0 (pc=0,va=0)
user unaligned acc	: 0 (pc=0,va=0)
platform string		: Digital AlphaPC 164 500 MHz
cpus detected		: 1

====== ia64.bitmover.com ======
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
Linux ia64.bitmover.com 2.4.9-18smp #1 SMP Tue Dec 11 12:59:00 EST 2001 ia64 unknown
processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium
model      : 0
revision   : 7
archrev    : 0
features   : standard
cpu number : 0
cpu regs   : 4
cpu MHz    : 799.486992
itc MHz    : 799.486992
BogoMIPS   : 796.91

processor  : 1
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium
model      : 0
revision   : 7
archrev    : 0
features   : standard
cpu number : 0
cpu regs   : 4
cpu MHz    : 799.486992
itc MHz    : 799.486992
BogoMIPS   : 796.91


====== mips.bitmover.com ======
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
Linux mips 2.4.18-r4k-ip22 #1 Sun Jun 23 15:30:50 CEST 2002 mips unknown
system type		: SGI Indy
processor		: 0
cpu model		: R4000SC V6.0  FPU V0.0
BogoMIPS		: 86.83
byteorder		: big endian
wait instruction	: no
microsecond timers	: yes
tlb_entries		: 48
extra interrupt vector	: no
hardware watchpoint	: yes
VCED exceptions		: 2955114
VCEI exceptions		: 0

====== netwinder.bitmover.com ======
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - cache not coherent
Test separation: 524288 bytes: FAIL - cache not coherent
Test separation: 1048576 bytes: FAIL - cache not coherent
Test separation: 2097152 bytes: FAIL - cache not coherent
Test separation: 4194304 bytes: FAIL - cache not coherent
Test separation: 8388608 bytes: FAIL - cache not coherent
Test separation: 16777216 bytes: FAIL - cache not coherent
VM page alias coherency test: failed; will use copy buffers instead
Linux netwinder 2.2.12-19991020 #1 Wed Oct 20 13:09:07 EDT 1999 armv4l unknown
Processor	: Intel sa110 rev 3
BogoMips	: 262.14
Hardware	: Rebel-NetWinder
Serial #	: 3464
Revision	: 52ff

====== parisc.bitmover.com ======
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: FAIL - cache not coherent
Test separation: 32768 bytes: FAIL - cache not coherent
Test separation: 65536 bytes: FAIL - cache not coherent
Test separation: 131072 bytes: FAIL - cache not coherent
Test separation: 262144 bytes: FAIL - store buffer not coherent
Test separation: 524288 bytes: FAIL - store buffer not coherent
Test separation: 1048576 bytes: FAIL - store buffer not coherent
Test separation: 2097152 bytes: FAIL - store buffer not coherent
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 4194304 (1024 pages)
Linux parisc 2.4.17-64 #1 Sat Mar 16 17:31:44 MST 2002 parisc64 unknown
processor	: 0
cpu family	: PA-RISC 2.0
cpu		: PA8600 (PCX-W+)
cpu MHz		: 550.000000
model		: 9000/800/A500-5X
model name	: Crescendo 550
hversion	: 0x00005d50
sversion	: 0x00000491
I-cache		: 512 KB
D-cache		: 1024 KB (WB)
ITLB entries	: 160
DTLB entries	: 160 - shared with ITLB
bogomips	: 1097.72
software id	: 580790518


====== ppc.bitmover.com ======
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
Linux ppc.bitmover.com 2.4.6-pre2 #2 Sun Jun 10 20:21:17 PDT 2001 ppc unknown
processor	: 0
cpu		: 750
temperature 	: 0 C
clock		: 333MHz
revision	: 2.2
bogomips	: 665.69
zero pages	: total: 0 (0Kb) current: 0 (0Kb) hits: 0/0 (0%)
machine		: iMac,1
motherboard	: iMac MacRISC Power Macintosh
L2 cache	: 512K unified
memory		: 160MB
pmac-generation	: NewWorld

====== qube.bitmover.com ======
Test separation: 4096 bytes: FAIL - cache not coherent
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)
0.31user 0.10system 0:00.40elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (116major+34minor)pagefaults 0swaps
Linux qube.bitmover.com 2.0.34 #1 Thu Jan 28 03:03:03 PST 1999 mips unknown
cpu			: MIPS
cpu model		: Nevada V10.0
system type		: Cobalt Microserver 27
BogoMIPS		: 249.86
byteorder		: little endian
unaligned accesses	: 16
wait instruction	: yes
microsecond timers	: yes
extra interrupt vector	: yes
hardware watchpoint	: no

====== redhat71.bitmover.com ======
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
Linux redhat71.bitmover.com 2.4.2-2 #1 Sun Apr 8 20:41:30 EDT 2001 i686 unknown
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 6
model name	: Celeron (Mendocino)
stepping	: 5
cpu MHz		: 467.739
cache size	: 128 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips	: 933.88


====== sparc.bitmover.com ======
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (2 pages)
0.29user 0.02system 0:00.31elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (107major+36minor)pagefaults 0swaps
Linux sparc.bitmover.com 2.2.18 #2 Thu Dec 21 18:53:16 PST 2000 sparc64 unknown
cpu		: TI UltraSparc IIi
fpu		: UltraSparc IIi integrated FPU
promlib		: Version 3 Revision 11
prom		: 3.11.12
type		: sun4u
ncpus probed	: 1
ncpus active	: 1
BogoMips	: 539.03
MMU Type	: Spitfire

-- 
---
Larry McVoy              lm at bitmover.com          http://www.bitmover.com/lm

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (7 preceding siblings ...)
  2003-08-29 11:41 ` Gianni Tedesco
@ 2003-08-29 11:51 ` James Morris
  2003-08-29 15:41 ` Larry McVoy
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: James Morris @ 2003-08-29 11:51 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

Here's the result for sparc64 (Ultrasparc II):

$ gcc -o test test.c -O2
$ time ./test
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (2 pages)

real    0m0.194s
user    0m0.160s
sys     0m0.040s
$ gcc -o test test.c -O2 -DHAVE_SYSV_SHM
$ time ./test
Test separation: 8192 bytes: FAIL - cache not coherent
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (2 pages)

real    0m0.162s
user    0m0.140s
sys     0m0.020s

$ cat /proc/cpuinfo

cpu             : TI UltraSparc II  (BlackBird)
fpu             : UltraSparc II integrated FPU
promlib         : Version 3 Revision 23
prom            : 3.23.1
type            : sun4u
ncpus probed    : 2
ncpus active    : 2
Cpu0Bogo        : 591.46
Cpu0ClkTck      : 0000000011a4f2ed
Cpu2Bogo        : 591.46
Cpu2ClkTck      : 0000000011a4f2ed
MMU Type        : Spitfire
State:
CPU0:           online
CPU2:           online



-- 
James Morris
<jmorris@intercode.com.au>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (6 preceding siblings ...)
  2003-08-29 10:49 ` Mikael Pettersson
@ 2003-08-29 11:41 ` Gianni Tedesco
  2003-08-29 11:51 ` James Morris
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: Gianni Tedesco @ 2003-08-29 11:41 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1392 bytes --]

On Fri, 2003-08-29 at 06:35, Jamie Lokier wrote:
> Dear All,
> 
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.

PPC (G4).

Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

cpu             : 7455, altivec supported
clock           : 667MHz
revision        : 2.1 (pvr 8001 0201)
bogomips        : 665.19
machine         : PowerBook3,4
motherboard     : PowerBook3,4 MacRISC2 MacRISC Power Macintosh
board revision  : 00000000
detected as     : 73 (PowerBook Titanium III)
pmac flags      : 0000000b
L2 cache        : 256K unified
memory          : 512MB
pmac-generation : NewWorld

-- 
// Gianni Tedesco (gianni at scaramanga dot co dot uk)
lynx --source www.scaramanga.co.uk/gianni-at-ecsc.asc | gpg --import
8646BE7D: 6D9F 2287 870E A2C9 8F60 3A3C 91B5 7669 8646 BE7D


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (5 preceding siblings ...)
  2003-08-29 10:37 ` CaT
@ 2003-08-29 10:49 ` Mikael Pettersson
  2003-08-29 11:41 ` Gianni Tedesco
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: Mikael Pettersson @ 2003-08-29 10:49 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

Jamie Lokier writes:
 > Dear All,
 > 
 > I'd appreciate if folks would run the program below on various
 > machines, especially those whose caches aren't automatically coherent
 > at the hardware level.

>From a dual Opteron 244 box:

Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)
0.08user 0.01system 0:00.08elapsed 101%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (131major+38minor)pagefaults 0swaps

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 5
model name	: AMD Opteron(tm) Processor 244
stepping	: 1
cpu MHz		: 1791.569
cache size	: 1024 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips	: 3565.15
TLB size	: 1088 4K pages
clflush size	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts ttp

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 5
model name	: AMD Opteron(tm) Processor 244
stepping	: 1
cpu MHz		: 1791.569
cache size	: 1024 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips	: 3578.26
TLB size	: 1088 4K pages
clflush size	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts ttp

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (4 preceding siblings ...)
  2003-08-29 10:34 ` CaT
@ 2003-08-29 10:37 ` CaT
  2003-08-29 10:49 ` Mikael Pettersson
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: CaT @ 2003-08-29 10:37 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> 	gcc -o test test.c -O2
> 	time ./test
> 	cat /proc/cpuinfo

Forgot about this one. :/

$ time ./coherencytest
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: pass      
Test separation: 32768 bytes: pass      
Test separation: 65536 bytes: pass      
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 16384 (4 pages)

real    0m0.543s
user    0m0.230s
sys     0m0.020s
$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 5
model           : 8
model name      : AMD-K6(tm) 3D processor
stepping        : 12
cpu MHz         : 300.691
cache size      : 64 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr mce cx8 pge mmx syscall 3dnow k6_mtrr
bogomips        : 599.65

-- 
"How can I not love the Americans? They helped me with a flat tire the
other day," he said.
	- http://tinyurl.com/h6fo

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29 10:03 ` J.A. Magallon
@ 2003-08-29 10:36   ` Alan Cox
  2003-09-01  4:49   ` Jamie Lokier
  1 sibling, 0 replies; 106+ messages in thread
From: Alan Cox @ 2003-08-29 10:36 UTC (permalink / raw)
  To: J.A. Magallon; +Cc: Jamie Lokier, Linux Kernel Mailing List

On Gwe, 2003-08-29 at 11:03, J.A. Magallon wrote:
> Sorry if this is a stupid question, but have you heard about 64K-aliasing ?
> We have seen it in P3/P4, do not know if Athlons also suffer it.
> In short, x86 is crap. It slows like a dog when accessing two memory
> positions sparated by 2^n (address decoder has two 16 bits adders, instead
> of 1 32 bits..., cache is 16 bit tagged, etc...)

Pretty much all processors are bad at handling memory accesses on the
same alignment within powers of two. Thats one of the reasons for slab
and for things like the old kernel code putting skb structs at the end
of the skbuff data.

Grab a copy of "Unix systems for modern architectures".



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (3 preceding siblings ...)
  2003-08-29 10:21 ` J.A. Magallon
@ 2003-08-29 10:34 ` CaT
  2003-08-29 10:37 ` CaT
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: CaT @ 2003-08-29 10:34 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Fri, Aug 29, 2003 at 06:35:10AM +0100, Jamie Lokier wrote:
> 	gcc -o test test.c -O2
> 	time ./test
> 	cat /proc/cpuinfo

16 [20:33:33] hogarth@theirongiant:/home/hogarth>> time ./coherencytest
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

real    0m0.206s
user    0m0.135s
sys     0m0.027s
16 [20:33:44] hogarth@theirongiant:/home/hogarth>> cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 701.641
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips        : 1388.54


-- 
"How can I not love the Americans? They helped me with a flat tire the
other day," he said.
	- http://tinyurl.com/h6fo

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
                   ` (2 preceding siblings ...)
  2003-08-29 10:15 ` J.A. Magallon
@ 2003-08-29 10:21 ` J.A. Magallon
  2003-08-29 10:34 ` CaT
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: J.A. Magallon @ 2003-08-29 10:21 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel


On 08.29, Jamie Lokier wrote:
> Dear All,
> 
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
> 

Dual P4 Xeon

annwn:~> gcc -march=pentium4 -O2 -fomit-frame-pointer -o vm-test vm-test.c
annwn:~> vm-test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
annwn:~> gcc -DHAVE_SYSV_SHM -march=pentium2 -O2 -fomit-frame-pointer -o vm-test vm-test.c
annwn:~> vm-test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed
annwn:~> cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 1.80GHz
stepping        : 4
cpu MHz         : 1784.328
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 3552.05

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 1.80GHz
stepping        : 4
cpu MHz         : 1784.328
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 3565.15

-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.22-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk))

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
  2003-08-29 10:03 ` J.A. Magallon
  2003-08-29 10:04 ` Sergey S. Kostyliov
@ 2003-08-29 10:15 ` J.A. Magallon
  2003-08-29 10:21 ` J.A. Magallon
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: J.A. Magallon @ 2003-08-29 10:15 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel


On 08.29, Jamie Lokier wrote:
> Dear All,
> 
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
> 

Uh ? So good are my PII ?

werewolf:~> gcc -march=pentium2 -O2 -fomit-frame-pointer -o vm-test vm-test.c
werewolf:~> vm-test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

werewolf:~> gcc -DHAVE_SYSV_SHM -march=pentium2 -O2 -fomit-frame-pointer -o vm-test vm-test.c
werewolf:~> vm-test
Test separation: 4096 bytes: pass
Test separation: 8192 bytes: pass
Test separation: 16384 bytes: pass
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: all sizes passed

werewolf:~> cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 5
model name      : Pentium II (Deschutes)
stepping        : 2
cpu MHz         : 400.915
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips        : 799.53

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 5
model name      : Pentium II (Deschutes)
stepping        : 2
cpu MHz         : 400.915
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips        : 801.17


-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.22-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk))

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
  2003-08-29 10:03 ` J.A. Magallon
@ 2003-08-29 10:04 ` Sergey S. Kostyliov
  2003-08-29 10:15 ` J.A. Magallon
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 106+ messages in thread
From: Sergey S. Kostyliov @ 2003-08-29 10:04 UTC (permalink / raw)
  To: Jamie Lokier, linux-kernel

Hi Jamie,

On Friday 29 August 2003 09:35, Jamie Lokier wrote:
> Dear All,
>
> I'd appreciate if folks would run the program below on various
> machines, especially those whose caches aren't automatically coherent
> at the hardware level.
rathamahata@test rathamahata $ gcc -march=athlon-xp -mcpu=athlon-xp -fomit-frame-pointer -O2 -o test test.c
rathamahata@test rathamahata $ time ./test
Test separation: 4096 bytes: FAIL - too slow
Test separation: 8192 bytes: FAIL - too slow
Test separation: 16384 bytes: FAIL - too slow
Test separation: 32768 bytes: pass
Test separation: 65536 bytes: pass
Test separation: 131072 bytes: pass
Test separation: 262144 bytes: pass
Test separation: 524288 bytes: pass
Test separation: 1048576 bytes: pass
Test separation: 2097152 bytes: pass
Test separation: 4194304 bytes: pass
Test separation: 8388608 bytes: pass
Test separation: 16777216 bytes: pass
VM page alias coherency test: minimum fast spacing: 32768 (8 pages)

real	0m0.097s
user	0m0.091s
sys	0m0.006s
rathamahata@test rathamahata $ cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 8
model name	: AMD Athlon(tm) MP 2200+
stepping	: 0
cpu MHz		: 1800.967
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow
bogomips	: 3538.94

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 8
model name	: AMD Athlon(tm) Processor
stepping	: 0
cpu MHz		: 1800.967
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow
bogomips	: 3596.28


-- 
                   Best regards,
                   Sergey S. Kostyliov <rathamahata@php4.ru>
                   Public PGP key: http://sysadminday.org.ru/rathamahata.asc

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
  2003-08-29  5:35 Jamie Lokier
@ 2003-08-29 10:03 ` J.A. Magallon
  2003-08-29 10:36   ` Alan Cox
  2003-09-01  4:49   ` Jamie Lokier
  2003-08-29 10:04 ` Sergey S. Kostyliov
                   ` (21 subsequent siblings)
  22 siblings, 2 replies; 106+ messages in thread
From: J.A. Magallon @ 2003-08-29 10:03 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel


On 08.29, Jamie Lokier wrote:
> Dear All,
[...]
> 
> I already got a surprise (to me): my Athlon MP is much slower
> accessing multiple mappings which are within 32k of each other, than
> mappings which are further apart, although it is coherent.  The L1
> data cache is 64k.  (The explanation is easy: virtually indexed,
> physically tagged cache moves data among cache lines, possibly via L2).
> 

Sorry if this is a stupid question, but have you heard about 64K-aliasing ?
We have seen it in P3/P4, do not know if Athlons also suffer it.
In short, x86 is crap. It slows like a dog when accessing two memory
positions sparated by 2^n (address decoder has two 16 bits adders, instead
of 1 32 bits..., cache is 16 bit tagged, etc...)

-- 
J.A. Magallon <jamagallon@able.es>      \                 Software is like sex:
werewolf.able.es                         \           It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.22-jam1m (gcc 3.3.1 (Mandrake Linux 9.2 3.3.1-1mdk))

^ permalink raw reply	[flat|nested] 106+ messages in thread

* x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this
@ 2003-08-29  5:35 Jamie Lokier
  2003-08-29 10:03 ` J.A. Magallon
                   ` (22 more replies)
  0 siblings, 23 replies; 106+ messages in thread
From: Jamie Lokier @ 2003-08-29  5:35 UTC (permalink / raw)
  To: linux-kernel

Dear All,

I'd appreciate if folks would run the program below on various
machines, especially those whose caches aren't automatically coherent
at the hardware level.

It searches for that address multiple which an application can use to
get coherent multiple mappings of shared memory, with good performance.

I want this information for two reasons:

	1. To check it correctly detects archs which page fault for
	   coherency or aren't coherent.
	2. To check the timing test is robust, both for 1. and for
	   detecting archs where the hardware is coherent but slows
	   down (see Athlon below).
	3. To check this is reliable enough to use at run time in an app.

I already got a surprise (to me): my Athlon MP is much slower
accessing multiple mappings which are within 32k of each other, than
mappings which are further apart, although it is coherent.  The L1
data cache is 64k.  (The explanation is easy: virtually indexed,
physically tagged cache moves data among cache lines, possibly via L2).

This suggests scope for improving x86 kernel performance in the areas
of kmap() and shared library / executable mappings, by good choice of
_virtual_ addresses.  This doesn't require a cache colouring
page allocator, so maybe it's a new avenue?

Anyway, please lots of people run the program and post the output +
/proc/cpuinfo.  Compile with optimisation, -O or -O2 is fine.  (You
can add -DHAVE_SYSV_SHM too if you like):

	gcc -o test test.c -O2
	time ./test
	cat /proc/cpuinfo

Thanks a lot :)
-- Jamie

==============================================================================

/* This code maps shared memory to multiple addresses and tests it
   for cache coherency and performance.

   Copyright (C) 1999, 2001, 2002, 2003 Jamie Lokier

   This program is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 2 of the License, or (at
   your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program; if not, write to the Free Software
   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307  USA */

#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/signal.h>
#include <sys/mman.h>
#include <sys/time.h>

#if HAVE_SYSV_SHM
#include <sys/ipc.h>
#include <sys/shm.h>
#endif

//#include "pagealias.h"

/* Helpers to temporarily block all signals.  These are used for when a
   race condition might leave a temporary file that should have been
   deleted -- we do our best to prevent this possibility. */

static void
block_signals (sigset_t * save_state)
{
  sigset_t all_signals;
  sigfillset (&all_signals);
  sigprocmask (SIG_BLOCK, &all_signals, save_state);
}

static void
unblock_signals (sigset_t * restore_state)
{
  sigprocmask (SIG_SETMASK, restore_state, (sigset_t *) 0);
}

/* Open a new shared memory file, either using the POSIX.4 `shm_open'
   function, or using a regular temporary file in /tmp.  Immediately
   after opening the file, it is unlinked from the global namespace
   using `shm_unlink' or `unlink'.

   On success, the value returned is a file descriptor.  Otherwise, -1
   is returned and `errno' is set.

   The descriptor can be closed using simply `close'. */

/* Note: `shm_open' requires link argument `-lposix4' on Suns.
   On GNU/Linux with Glibc, it requires `-lrt'.  Unfortunately, Glibc's
   -lrt insists on linking to pthreads, which we may not want to use
   because that enables thread locking overhead in other functions.  So
   we implement a direct method of opening shm on Linux. */

/* If this is changed, change the size of `buffer' below too. */
#if HAVE_SHM_OPEN
#define SHM_DIR_PREFIX "/"      /* `shm_open' arg needs "/" for portability. */
#elif defined (__linux__)
#include <sys/statfs.h>
#define SHM_DIR_PREFIX "/dev/shm/"
#else
#undef  SHM_DIR_PREFIX
#endif

static int
open_shared_memory_file (int use_tmp_file)
{
  char * ptr, buffer [19];
  int fd, i;
  unsigned long number;
  sigset_t save_signals;
  struct timeval tv;

#if !HAVE_SHM_OPEN && defined (__linux__)
  struct statfs sfs;
  if (!use_tmp_file && (statfs (SHM_DIR_PREFIX, &sfs) != 0
			|| sfs.f_type != 0x01021994 /* SHMFS_SUPER_MAGIC */))
    {
      errno = ENOSYS;
      return -1;
    }
#endif

 loop:
  /* Print a randomised path name into `buffer'.  The string depends on
     the directory and whether we are using POSIX.4 shared memory or a
     regular temporary file.  RANDOM is a 5-digit, base-62
     representation of a pseudo-random number.  The string is used as a
     candidate in the search for an unused shared segment or file name. */
#ifdef SHM_DIR_PREFIX
  strcpy (buffer, use_tmp_file ? "/tmp/shm-" : SHM_DIR_PREFIX "shm-");
#else
  strcpy (buffer, "/tmp/shm-");
#endif
  ptr = buffer + strlen (buffer);
  gettimeofday (&tv, (struct timezone *) 0);
  number = (unsigned long) random ();
  number += (unsigned long) getpid ();
  number += (unsigned long) tv.tv_sec + (unsigned long) tv.tv_usec;
  for (i = 0; i < 5; i++)
    {
      /* Don't use character arithmetic, as not all systems are ASCII. */
      *ptr++ = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" [number % 62];
      number /= 62;
    }
  *ptr = '\0';

  /* Block signals between the open and unlink, to really minimise
     the chance of accidentally leaving an unwanted file around. */
  block_signals (&save_signals);
#if HAVE_SHM_OPEN
  if (!use_tmp_file)
    {
      fd = shm_open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
      if (fd != -1)
	shm_unlink (buffer);
    }
  else
#endif /* HAVE_SHM_OPEN */
    {
      fd = open (buffer, O_RDWR | O_CREAT | O_EXCL, 0600);
      if (fd != -1)
	unlink (buffer);
    }
  unblock_signals (&save_signals);

  /* If we failed due to a name collision or a signal, try again. */
  if (fd == -1 && (errno == EEXIST || errno == EINTR || errno == EISDIR))
    goto loop;

  return fd;
}

/* Allocate a region of address space `size' bytes long, so that the
   region will not be allocated for any other purpose.  It is freed with
   `munmap'.

   Returns the mapped base address on success.  Otherwise, MAP_FAILED is
   returned and `errno' is set. */

static size_t system_page_size;

#if !defined (MAP_ANONYMOUS) && defined (MAP_ANON)
#define MAP_ANONYMOUS	MAP_ANON
#endif
#ifndef MAP_NORESERVE
#define MAP_NORESERVE	0
#endif
#ifndef MAP_FILE
#define MAP_FILE	0
#endif
#ifndef MAP_VARIABLE
#define MAP_VARIABLE	0
#endif
#ifndef MAP_FAILED
#define MAP_FAILED	((void *) -1)
#endif
#ifndef PROT_NONE
#define PROT_NONE	PROT_READ
#endif

static void *
map_address_space (void * optional_address, size_t size, int access)
{
  void * addr;
#ifdef MAP_ANONYMOUS
  addr = mmap (optional_address, size,
	       access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
	       (MAP_PRIVATE | MAP_ANONYMOUS
		| (optional_address ? MAP_FIXED : MAP_VARIABLE)
		| (access ? 0 : MAP_NORESERVE)), -1, (off_t) 0);
#else  /* not defined MAP_ANONYMOUS */
  int save_errno, zero_fd = open ("/dev/zero", O_RDONLY);
  if (zero_fd == -1)
    return MAP_FAILED;
  addr = mmap (optional_address, size,
	       access ? (PROT_READ | PROT_WRITE) : PROT_NONE,
	       (MAP_PRIVATE | MAP_FILE
		| (optional_address ? MAP_FIXED : MAP_VARIABLE)
		| (access ? 0 : MAP_NORESERVE)), zero_fd, (off_t) 0);
  save_errno = errno;
  close (zero_fd);
  errno = save_errno;
#endif /* not defined MAP_ANONMOUS */
  return addr;
}

/* Set up a page alias mapping using mmap() on POSIX shared memory or on
   a temporary regular file.

   Returns the mapped base address on success.  Otherwise, 0 is returned
   and `errno' is set. */

static void *
page_alias_using_mmap (size_t size, size_t separation, int use_tmp_file)
{
  void * base_addr, * addr;
  int fd, i, save_errno;
  struct stat st;

  fd = open_shared_memory_file (use_tmp_file);
  if (fd == -1)
    goto fail;

  /* First, resize the shared memory file to the desired size. */
  if (ftruncate (fd, size) != 0 || fstat (fd, &st) != 0 || st.st_size != size)
    goto close_fail;

  /* Map an anonymous region `separation + size' bytes long.  This is how
     we allocate sufficient contiguous address space.  We over-map
     this with the aliased buffer. */
  if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
    goto close_fail;

  /* Map the same shared memory repeatedly, at different addresses. */
  for (i = 0; i < 2; i++)
    {
      addr = mmap ((char *) base_addr + (i ? separation : 0), size,
		   PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FILE | MAP_FIXED,
		   fd, (off_t) 0);
      if (addr == MAP_FAILED)
	goto unmap_fail;
      if (addr != (char *) base_addr + (i ? separation : 0))
	{
	  /* `mmap' ignored MAP_FIXED!  Should never happen. */
	  munmap (addr, size);
	  save_errno = EINVAL;
	  goto unmap_fail_se;
	}
    }
  if (close (fd) != 0)
    goto unmap_fail;

  /* Success! */
  return base_addr;

  /* Failure. */
 unmap_fail:
  save_errno = errno;
 unmap_fail_se:
  munmap (base_addr, separation + size);
  errno = save_errno;
 close_fail:
  save_errno = errno;
  close (fd);
  errno = save_errno;
 fail:
  return 0;
}

/* Set up a page alias mapping using SYSV IPC shared memory.

   Returns the mapped base address on success.  Otherwise, 0 is returned
   and `errno' is set. */

#if HAVE_SYSV_SHM

static void *
page_alias_using_sysv_shm (size_t size, size_t separation)
{
  void * base_addr, * addr;
  sigset_t save_signals;
  int shmid, i, save_errno;

  /* Map an anonymous region `separation + size' bytes long.  This is how
     we allocate sufficient contiguous address space.  We over-map
     this with the aliased buffer. */
  if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
    goto fail;

  /* Block signals between the shmget() and IPC_RMID, to minimise the chance
     of accidentally leaving an unwanted shared segment around. */
  block_signals (&save_signals);

  shmid = shmget (IPC_PRIVATE, size, IPC_CREAT | IPC_EXCL | 0600);
  if (shmid == -1)
    goto unmap_fail;

  /* Map the same shared memory repeatedly, at different addresses. */
  for (i = 0; i < 2; i++)
    {
      /* `shmat' is tried twice.  The fist time it can fail if the local
	 implementation of `shmat' refuses to map over a region mapped
	 with `mmap'.  In that case, we punch a hole using `munmap' and
	 do it again.

	 If the local `shmat' has this property, the `shmat' calls
	 to fixed addresses might collide with a concurrent thread
	 which is also doing mappings, and will fail.  At least it
	 is a safe failure.

	 On the other hand, if the local `shmat' can map over
	 already-mapped regions (in the same way that `mmap' does), it
	 is essential that we do actually use an already-mapped region,
	 so that collisions with a concurrent thread can't possibly
	 result in both of us grabbing the same address range with no
	 indication of error. */
      addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
      if (addr == (void *) -1 && errno == EINVAL)
	{
	  munmap ((char *) base_addr + (i ? separation : 0), size);
	  addr = shmat (shmid, (char *) base_addr + (i ? separation : 0), 0);
	}

      /* Check for errors. */
      if (addr == (void *) -1)
	{
	  save_errno = errno;
	  if (i == 1)
	    shmdt (base_addr);
	  goto remove_shm_fail_se;
	}
      else if (addr != (char *) base_addr + (i ? separation : 0))
	{
	  /* `shmat' ignored the requested address! */
	  if (i == 1)
	    shmdt (base_addr);
	  save_errno = EINVAL;
	  goto remove_shm_fail_se;
	}
    }
		    
  if (shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0) != 0)
    goto remove_shm_fail;
  unblock_signals (&save_signals);

  /* Success! */
  return base_addr;

  /* Failure. */
 remove_shm_fail:
  save_errno = errno;
 remove_shm_fail_se:
  while (--i >= 0)
    shmdt ((char *) base_addr + (i ? separation : 0));
  shmctl (shmid, IPC_RMID, (struct shmid_ds *) 0);
  errno = save_errno;
 unmap_fail:
  save_errno = errno;
  unblock_signals (&save_signals);
  munmap (base_addr, separation + size);
  errno = save_errno;
 fail:
  return 0;
}

#endif /* HAVE_SYSV_SHM */

/* Map a page-aliased ring buffer.  Shared memory of size `size' is
   mapped twice, with the difference between the two addresses being
   `separation', which must be at least `size'.  The total address range
   used is `separation + size' bytes long.

   On success, *METHOD is filled with a number which must be passed to
   `page_alias_unmap', and the mapped base address is returned.
   Otherwise, 0 is returned and `errno' is set. */

static void *
__page_alias_map (size_t size, size_t separation, int * method)
{
  void * addr;
  if (((size | separation) & (system_page_size - 1)) != 0 || size > separation)
    {
      errno = -EINVAL;
      return 0;
    }

  /* Try these strategies in turn: POSIX shm_open(), SYSV IPC, regular file. */
#ifdef SHM_DIR_PREFIX
  *method = 0;
  if ((addr = page_alias_using_mmap (size, separation, 0)) != 0)
    return addr;
#endif
#if HAVE_SYSV_SHM
  *method = 1;
  if ((addr = page_alias_using_sysv_shm (size, separation)) != 0)
    return addr;
#endif
  *method = 2;
  return page_alias_using_mmap (size, separation, 1);
}

/* Unmap a page-aliased ring buffer previously allocated by
   `page_alias_map'.  `address' is the base address, and `size' and
   `separation' are the arguments previously passed to
   `__page_alias_map'.  `method' is the value previously stored in *METHOD.

   Returns 0 on success.  Otherwise, -1 is returned and `errno' is set. */

static int
__page_alias_unmap (void * address, size_t size, size_t separation, int method)
{
#if HAVE_SYSV_SHM
  if (method == 1)
    {
      shmdt (address);
      shmdt (address + separation);
      if (separation > size)
	munmap (address + size, separation - size);
      return 0;
    }
#endif

  return munmap (address, separation + size);
}

/* Map a page-aliased ring buffer.  `size' is the size of the buffer to
   create; it will be mapped twice to cover a total address range
   `size * 2' bytes long.

   On success, *METHOD is filled with a number which must be passed to
   `page_alias_unmap', and the mapped base address is returned.
   Otherwise, 0 is returned and `errno' is set. */

void *
page_alias_map (size_t size, int * method)
{
  return __page_alias_map (size, size, method);
}

/* Unmap a page-aliased ring buffer previously allocated by
   `page_alias_map'.  `address' is the base address, and `size' is the
   size of the buffer (which is half of the total mapped address range).
   `method' is a value previously stored in *METHOD by `page_alias_map'.

   Returns 0 on success.  Otherwise, -1 is returned and `errno' is set. */

int
page_alias_unmap (void * address, size_t size, int method)
{
  return __page_alias_unmap (address, size, size, method);
}

/* Map some memory which is not aliased, for timing comparisons against
   aliased pages.  We use a combination of mappings similar to
   page_alias_*(), in case there are resource limitations which would
   prevent malloc() or a single mmap() working for the larger address
   range tests. */

static void *
page_no_alias (size_t size, size_t separation)
{
  void * base_addr, * addr;
  int i, save_errno;

  if ((base_addr = map_address_space (0, separation + size, 0)) == MAP_FAILED)
    goto fail;

  /* Map anonymous memory at the different addresses. */
  for (i = 0; i < 2; i++)
    {
      addr = map_address_space ((char *) base_addr + (i ? separation : 0),
				size, 1);
      if (addr == MAP_FAILED)
	goto unmap_fail;
      if (addr != (char *) base_addr + (i ? separation : 0))
	{
	  /* `mmap' ignored MAP_FIXED!  Should never happen. */
	  munmap (addr, size);
	  save_errno = EINVAL;
	  goto unmap_fail_se;
	}
    }

  /* Success! */
  return base_addr;

  /* Failure. */
 unmap_fail:
  save_errno = errno;
 unmap_fail_se:
  munmap (base_addr, separation + size);
  errno = save_errno;
 fail:
  return 0;
}

/* This should be a word size that the architecture can read and write
   fast in a single instruction.  In principle, C's `int' is the natural
   word size, but in practice it isn't on 64-bit machines. */

#define WORD long

/* These GCC-specific asm statements force values into registers, and
   also act as compiler memory barriers.  These are used to force a
   group of write/write/read instructions as close together as possible,
   to maximise the detection of store buffer conditions.  Despite being
   asm statements, these will work with any of GCC's target architectures,
   provided they have >= 4 registers. */

#if __GNUC__ >= 3
#define __noinline __attribute__ ((__noinline__))
#else
#define __noinline
#endif

#ifdef __GNUC__
#define force_into_register(var) \
  __asm__ ("" : "=r" (var) : "0" (var) : "memory")
#define force_into_registers(var1, var2, var3, var4) \
  __asm__ ("" : "=r" (var1), "=r" (var2), "=r" (var3), "=r" (var4) \
	   : "0" (var1), "1" (var2), "2" (var3), "3" (var4) : "memory")
#else
#define force_into_register(var) do {} while (0)
#define force_into_registers(var1, var2, var3, var4) do {} while (0)
#endif

/* This function tries to test whether a CPU snoops its store buffer for
   reads within a few instructions, and ignores virtual to physical
   address translations when doing that.  In principle a CPU might do
   this even if it's L1 cache is physically tagged or indexed, although
   I have not seen such a system.  (A CPU which uses store buffer
   snooping and with an off-board MMU, which the CPU is unaware of,
   could have this property).

   It isn't possible to do this test perfectly; we do our best.  The
   `force_into_register' macros ensure that the write/write/read
   sequence is as compact as the compiler can make it. */

static WORD __noinline
test_store_buffer_snoop (volatile WORD * ptr1, volatile WORD * ptr2)
{
  register volatile WORD * __regptr1 = ptr1, * __regptr2 = ptr2;
  register WORD __reg1 = 1, __reg2 = 0;
  force_into_registers (__reg1, __reg2, __regptr1, __regptr2);
  *__regptr1 = __reg1;
  *__regptr2 = __reg2;
  __reg1 = *__regptr1;
  force_into_register (__reg1);
  return __reg1;
}

/* This function tests whether writes to one page are seen in another
   page at a different virtual address, and whether they are nearly as
   fast as normal writes.

   The accesses are timed by the caller of this function.
   Alternate writes go to alternate pages, so that if aliasing is
   implemented using page faults, it will clearly show up in the
   timings. */

static int __noinline
test_page_alias (volatile WORD * ptr1, volatile WORD * ptr2, int timing_loops)
{
  WORD fail = 0;
  while (--timing_loops >= 0)
    fail |= test_store_buffer_snoop (ptr1, ptr2);
  return fail != 0;
}

/* This function tests L1 cache coherency without checking for store
   buffer snoop coherency.  To do this, we add delays after each store
   to allow the store buffer to drain.  The result of this function is
   not important: it is only used in a diagnostic message. */

static int __noinline
test_l1_only (volatile WORD * ptr1, volatile WORD * ptr2)
{
  static volatile WORD dummy;
  int i, j;
  WORD fail = 0;
  for (i = 0; i < 10; i++)
    {
      *ptr1 = 1;
      for (j = 0; j < 1000; j++) /* Dummy volatile writes for delay. */
	dummy = 0;
      *ptr2 = 0;
      for (j = 0; j < 1000; j++) /* Dummy volatile writes for delay. */
	dummy = 0;
      fail |= *ptr1;
    }
  return fail != 0;
}

/* Thoroughly test a pair of aliased pages with a fixed address
   separation, to see if they really behave like memory appearing at two
   locations, and efficiently.  We search through different values of
   `separation' searching for a suitable "cache colour" on this machine. */

static inline const char *
test_one_separation (size_t separation)
{
  void * buffers [2];
  long timings [3];
  int i, method, timing_loops = 64;

  /* We measure timings of 3 different tests, each 128 times to find the
     minimum.  0: Writes and reads to aliased pages.  1: Writes and
     reads to non-aliased pages, to compare with 1.  2: Doing nothing,
     to measure the time for `gettimeofday' itself.

     The measurements are done in a mixed up order.  If we did 64
     measurements of type 0, then 64 of type 1, then 64 of type 2, the
     results could be mislead due to synchronisation with other
     processes occuring on the machine. */

  /* A previously generated random shuffle of bit-pairs.  Each pair is a
     number from the set {0,1,2}.  Each number occurs exactly 128 times. */
  static const unsigned char pattern [96] =
    {
      0x64, 0x68, 0x9a, 0x86, 0x42, 0x10, 0x90, 0x81, 0x58, 0x91, 0x18, 0x56,
      0x12, 0x44, 0x64, 0x89, 0x29, 0xa9, 0x96, 0x05, 0x61, 0x80, 0x82, 0x49,
      0x02, 0x16, 0x89, 0x12, 0x9a, 0x45, 0x41, 0x12, 0xa9, 0xa6, 0x01, 0x99,
      0x88, 0x80, 0x94, 0x20, 0x86, 0x29, 0x29, 0x1a, 0xa5, 0x46, 0x66, 0x25,
      0x42, 0x20, 0xa4, 0x81, 0x20, 0x81, 0x50, 0x44, 0x01, 0x06, 0xa5, 0x19,
      0x4a, 0x56, 0x28, 0x89, 0x88, 0x14, 0x94, 0x88, 0x1a, 0xa4, 0x95, 0x15,
      0x82, 0x99, 0x84, 0x64, 0x52, 0x56, 0x69, 0x64, 0x00, 0x95, 0x9a, 0x89,
      0x48, 0x01, 0x58, 0x88, 0x60, 0xa6, 0x29, 0x06, 0x64, 0xa0, 0x56, 0x85,
    };

  buffers [0] = __page_alias_map (system_page_size, separation, &method);
  if (buffers [0] == 0)
    return "alias map failed";
  buffers [1] = page_no_alias (system_page_size, separation);
  if (buffers [1] == 0)
    {
      __page_alias_unmap (buffers [0], system_page_size, separation, method);
      return "non-alias map failed";
    }

 retry:
  timings [2] = timings [1] = timings [0] = LONG_MAX;
  for (i = 0; i < 384; i++)
    {
      struct timeval time_before, time_after;
      long time_delta;
      int fail = 0, which_test = (pattern [i >> 2] >> ((i & 3) << 1)) & 3;
      volatile WORD * ptr1 = (volatile WORD *) buffers [which_test];
      volatile WORD * ptr2 = (volatile WORD *) ((char *) ptr1 + separation);

      /* Test whether writes to one page appear immediately in the other,
	 and time how long the memory accesses take. */
      gettimeofday (&time_before, (struct timezone *) 0);
      if (which_test < 2)
	fail = test_page_alias (ptr1, ptr2, timing_loops);
      gettimeofday (&time_after, (struct timezone *) 0);
	      
      if (fail && which_test == 0)
	{
	  /* Test whether the failure is due to a store buffer bypass
	     which ignores virtual address translation. */
	  int l1_fail = test_l1_only (ptr1, ptr2);
	  __page_alias_unmap (buffers [0], system_page_size, separation,
			      method);
	  munmap (buffers [1], separation + system_page_size);
	  return l1_fail ? "cache not coherent" : "store buffer not coherent";
	}

      time_delta = ((time_after.tv_usec - time_before.tv_usec)
		    + 1000000 * (time_after.tv_sec - time_before.tv_sec));

      /* Find the smallest time taken for each test.  Ignore negative
	 glitches due to Linux' tendancy to jump the clock backwards. */
      if (time_delta >= 0 && time_delta < timings [which_test])
	timings [which_test] = time_delta;
    }

  /* Remove the cost of `gettimeofday()' itself from measurements. */
  timings [0] -= timings [2];
  timings [1] -= timings [2];

  /* Keep looping until at least one measurement becomes significant.  A
     very fast CPU will show measurements of zero microseconds for
     smaller values of `timing_loops'.  Also loop until the cost of
     `gettimeofday()' becomes insignificant.  When the program is run
     under `strace', the latter is a big and this is needed to stabilise
     the results. */
  if (timings [0] <= 10 * (1 + timings [2])
      && timings [1] <= 10 * (1 + timings [2]))
    {
      timing_loops <<= 1;
      goto retry;
    }

  __page_alias_unmap (buffers [0], system_page_size, separation, method);
  munmap (buffers [1], separation + system_page_size);

  /* Reject page aliasing if it is much slower than accessing a single,
     definitely cached page directly. */
  if (timings [0] > 2 * timings [1])
    return "too slow";

  /* Success!  Passed all tests for these parameters. */
  return 0;
}

size_t page_alias_smallest_size;

void
page_alias_init (void)
{
  size_t size;

#ifdef _SC_PAGESIZE
  system_page_size = sysconf (_SC_PAGESIZE);
#elif defined (_SC_PAGE_SIZE)
  system_page_size = sysconf (_SC_PAGE_SIZE);
#else
  system_page_size = getpagesize ();
#endif

  for (size = system_page_size; size <= 16 * 1024 * 1024; size *= 2)
    {
      const char * reason = test_one_separation (size);

      printf ("Test separation: %lu bytes: %s%s\n",
	      (unsigned long) size, reason ? "FAIL - " : "pass",
	      reason ? reason : "");

      /* This logic searches for the smallest _contiguous_ range
	 of page sizes for which `page_alias_test' passes. */
      if (reason == 0 && page_alias_smallest_size == 0)
	page_alias_smallest_size = size;
      else if (reason != 0 && page_alias_smallest_size != 0)
	{
	  /* Fail, indicating that page-aliasing is not reliable,
	     because there's a maximum size.  We don't support that as
	     it seems quite unlikely given our model of cache colouring. */
	  page_alias_smallest_size = 0;
	  break;
 	}
    }

  printf ("VM page alias coherency test: ");

  if (page_alias_smallest_size == 0)
    printf ("failed; will use copy buffers instead\n");
  else if (page_alias_smallest_size == system_page_size)
    printf ("all sizes passed\n");
  else
    printf ("minimum fast spacing: %lu (%lu page%s)\n",
	    (unsigned long) page_alias_smallest_size,
	    (unsigned long) (page_alias_smallest_size / system_page_size),
	    (page_alias_smallest_size == system_page_size) ? "" : "s");
}

//#ifdef TEST_PAGEALIAS
int
main ()
{
  page_alias_init ();
  return 0;
}
//#endif

^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2003-09-07 17:57 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20030829053510.GA12663@mail.jlokier.co.uk.suse.lists.linux.kernel>
2003-08-29 11:08 ` x86, ARM, PARISC, PPC, MIPS and Sparc folks please run this Andi Kleen
2003-08-29 11:17   ` Russell King
2003-09-01  5:03   ` Jamie Lokier
2003-08-29  5:35 Jamie Lokier
2003-08-29 10:03 ` J.A. Magallon
2003-08-29 10:36   ` Alan Cox
2003-09-01  4:49   ` Jamie Lokier
2003-08-29 10:04 ` Sergey S. Kostyliov
2003-08-29 10:15 ` J.A. Magallon
2003-08-29 10:21 ` J.A. Magallon
2003-08-29 10:34 ` CaT
2003-08-29 10:37 ` CaT
2003-08-29 10:49 ` Mikael Pettersson
2003-08-29 11:41 ` Gianni Tedesco
2003-08-29 11:51 ` James Morris
2003-08-29 15:41 ` Larry McVoy
2003-08-29 23:05   ` Mike Fedyk
2003-08-31  5:10     ` David S. Miller
2003-08-31 22:49       ` Jamie Lokier
2003-09-01  5:31         ` David S. Miller
2003-09-01  6:42           ` Jamie Lokier
2003-09-01  7:06             ` David S. Miller
2003-09-01  8:29               ` Jamie Lokier
2003-09-01  9:02                 ` David S. Miller
2003-09-01 10:04                   ` Jamie Lokier
2003-09-01 10:02                     ` David S. Miller
2003-09-03 17:36                   ` bill davidsen
2003-09-04 22:50                     ` Jamie Lokier
2003-09-01  5:44   ` Jamie Lokier
2003-09-01 14:43     ` Larry McVoy
2003-09-01 16:33       ` Jamie Lokier
2003-09-01 16:58         ` Larry McVoy
2003-09-02 20:29       ` Jamie Lokier
2003-08-29 15:47 ` Herbert Poetzl
2003-08-30  1:48   ` Stuart Longland
2003-08-29 16:27 ` Geert Uytterhoeven
2003-09-01  5:58   ` Jamie Lokier
2003-09-01  8:34     ` Geert Uytterhoeven
2003-09-01  9:09       ` Kars de Jong
2003-09-01 10:08         ` Jamie Lokier
2003-09-01 11:13           ` Roman Zippel
2003-09-02 20:42           ` Kars de Jong
2003-09-02 21:39             ` Jamie Lokier
2003-09-03  7:59             ` Geert Uytterhoeven
2003-09-03  9:13               ` Jamie Lokier
2003-09-03  9:26                 ` Geert Uytterhoeven
2003-09-03 12:17                   ` Roman Zippel
2003-09-03 12:36                     ` Geert Uytterhoeven
2003-09-03 13:29                       ` Jamie Lokier
2003-09-03 16:07                         ` Nagendra Singh Tomar
2003-09-04  5:03                           ` Davide Libenzi
2003-09-03 18:03                             ` Nagendra Singh Tomar
2003-09-04  6:38                               ` Davide Libenzi
2003-09-04 11:19                           ` Alan Cox
2003-09-05 21:24                             ` Pavel Machek
2003-09-06 23:09                               ` Jamie Lokier
2003-09-07 13:10                                 ` Pavel Machek
2003-09-07 13:35                                   ` Jamie Lokier
2003-09-07 13:40                                     ` Pavel Machek
2003-09-07 13:53                                       ` Jamie Lokier
2003-09-07 17:56                                         ` Alan Cox
2003-09-03 12:13               ` Jan-Benedict Glaw
2003-09-01 10:35       ` Sam Creasey
2003-09-01 10:48         ` Jamie Lokier
2003-09-01 12:23           ` Sam Creasey
2003-09-03  8:00       ` Kars de Jong
2003-09-03  8:05         ` Geert Uytterhoeven
2003-09-03  9:24           ` Kars de Jong
2003-08-29 16:31 ` Brian Jackson
2003-08-29 17:39 ` Matt Porter
2003-09-01  6:00   ` Jamie Lokier
2003-09-01 11:17     ` Alan Cox
2003-09-01 17:22     ` Roland Dreier
2003-09-02  2:16       ` Matt Porter
2003-09-02  5:40         ` Jamie Lokier
2003-08-29 19:37 ` Thorsten Kranzkowski
2003-08-29 20:03 ` Sean Neakums
2003-08-29 20:14 ` Iulian Musat
2003-08-29 20:26 ` Paul J.Y. Lahaie
2003-09-01  8:15   ` Russell King
2003-09-01 10:12     ` Jamie Lokier
2003-09-01 11:30       ` Geert Uytterhoeven
2003-09-01 14:17       ` Russell King
2003-09-01 14:51         ` Russell King
2003-09-01 19:09           ` Guennadi Liakhovetski
2003-09-01 16:52         ` Jamie Lokier
2003-09-01 17:11           ` Russell King
2003-09-02  5:34             ` Jamie Lokier
2003-09-02  8:15               ` Russell King
2003-09-02 11:57                 ` Jamie Lokier
2003-09-02 18:52                   ` Russell King
2003-09-02 23:59                     ` Larry McVoy
2003-09-03  7:31                       ` Russell King
2003-09-03  7:41                         ` Jamie Lokier
2003-09-03 18:05                           ` Russell King
2003-09-04 22:20                             ` Jamie Lokier
2003-09-04 17:37       ` Maciej W. Rozycki
2003-08-29 22:35 ` Kenneth Johansson
2003-08-29 23:47 ` Kurt Wall
2003-09-01  0:24 ` Paul Mundt
2003-09-01  0:37   ` Jamie Lokier
2003-09-01  1:00     ` Paul Mundt
2003-09-01  1:58       ` Jamie Lokier
2003-09-01  1:13 ` dean gaudet
2003-09-01  4:29   ` Jamie Lokier
2003-09-02 10:08 ` Jan Rychter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).