All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: C8000 cpu upgrade problem
       [not found] <20101024020337.725094D30@hiauly1.hia.nrc.ca>
@ 2010-10-24  3:03 ` Mikulas Patocka
  2010-10-24  3:43   ` Kyle McMartin
  2010-10-24  4:01   ` C8000 cpu upgrade problem John David Anglin
  0 siblings, 2 replies; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-24  3:03 UTC (permalink / raw)
  To: John David Anglin; +Cc: linux-parisc



On Sat, 23 Oct 2010, John David Anglin wrote:

> > I'm still thinking the processor module shown above is the base
> > model with 3 MB L1 and no L2, and it's not consistent with upgrade
> > module.
> 
> I just noticed the following wording in the online specifications:
> 
> 1 or 2 dual-core PA-8800 or PA-8900 processors
>   (2-way 900 MHz PA-8800 with 3 MB L1 cache,
>    2 or 4-way 900 MHz or 1 GHz PA-8800 3 MB L1 and 32 MB L2 cache or
>    2 or 4-way 1.1 GHz PA-8900 with 3 MB L1 cache and 64 MB L2 cache)
> 
> Note the base model with no L2 appears to be only 2-way.  If it
> can be upgraded, I think you would need a AB665A kit.  Kits with
> L2 cache seem more common.  To do this right, you need to start
> with the specific workstation model number.
> 
> See HP PartSurfer.
> 
> I also found a QuickSpecs document with PA-8900 processors.  It looks
> like the 2-way base is not upgradeable as the second processor options
> only have 64 MB L2.  The document lists a 2nd processor as an after-market
> option (AB675A or AB676A).  There is a note that the second processor
> must be the same as the first.
> 
> Dave
> -- 
> J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
> National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

I tried to measure the cache size, sequential memory read showed cutoff at 
700kB and no cutoff at 32MB. It shows 1.7GB/s below 700kB and 612MB/s 
above. Latency measurements (chasing pointer chain) showed drastic cutoff 
at 700kB (from 3ns to 300ns) and no cutoff at 32MB.

It may be that the lack of L2 cache is the reason why the CPUs don't 
support multiprocessing ... I may buy two better CPUs, if I had actually 
guarantee that the machine isn't locked (I don't want to waste more money 
just to find out that the firmware lock doesn't go away).

Mikulas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: C8000 cpu upgrade problem
  2010-10-24  3:03 ` C8000 cpu upgrade problem Mikulas Patocka
@ 2010-10-24  3:43   ` Kyle McMartin
  2010-10-26  2:16     ` PA caches (was: C8000 cpu upgrade problem) Mikulas Patocka
  2010-10-24  4:01   ` C8000 cpu upgrade problem John David Anglin
  1 sibling, 1 reply; 27+ messages in thread
From: Kyle McMartin @ 2010-10-24  3:43 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: John David Anglin, linux-parisc

On Sun, Oct 24, 2010 at 05:03:25AM +0200, Mikulas Patocka wrote:
> I tried to measure the cache size, sequential memory read showed cutoff at 
> 700kB and no cutoff at 32MB. It shows 1.7GB/s below 700kB and 612MB/s 
> above. Latency measurements (chasing pointer chain) showed drastic cutoff 
> at 700kB (from 3ns to 300ns) and no cutoff at 32MB.
> 
> It may be that the lack of L2 cache is the reason why the CPUs don't 
> support multiprocessing ... I may buy two better CPUs, if I had actually 
> guarantee that the machine isn't locked (I don't want to waste more money 
> just to find out that the firmware lock doesn't go away).
> 

FWIW, I'd recommend running in non-SMP mode on pa8800/8900 anyway, as
our cache flushing is a bit... suboptimal right now (doing whole cache
flushes on fork and such.) Which, coupled with the gigantic caches on
those cpus which must be flushed just tanks performance.

I've been working on cleaning up jejb's patchset from back in the
bitkeeper days to properly do deferred flushing, but time is constantly
against me (sigh, I don't think I've even powered on my C8000 in a few
years now... explains why I didn't catch your e1000 issue there. :)

--Kyle

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: C8000 cpu upgrade problem
  2010-10-24  3:03 ` C8000 cpu upgrade problem Mikulas Patocka
  2010-10-24  3:43   ` Kyle McMartin
@ 2010-10-24  4:01   ` John David Anglin
  2010-10-26  2:04     ` Mikulas Patocka
  1 sibling, 1 reply; 27+ messages in thread
From: John David Anglin @ 2010-10-24  4:01 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-parisc

> It may be that the lack of L2 cache is the reason why the CPUs don't 
> support multiprocessing ... I may buy two better CPUs, if I had actually 
> guarantee that the machine isn't locked (I don't want to waste more money 
> just to find out that the firmware lock doesn't go away).

Are you sure the part numbers for the two processor modules that you
have are the same?  Parts with cache seem much more common.  There
also seem to be quite a few obsolete parts.

It might be Linux would work better without the L2 cache.  There are
are some cache coherency issues that haven't been resolved in SMP.
These problems are aggrevated by the L2 cache which takes a long
time to flush.

It's just not clear that your machine is locked.  The c8000 model
name doesn't change depending on number of processors.  If you search
on rp3410 processor upgrade, you will find that a processor update
license is needed to go from one to two processor.  This is clear
in the documentation.  I couldn't find anything similar for c8000.
Indeed, there are many indications that an after-market processor
update is possible for it.

Good luck,
Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: C8000 cpu upgrade problem
  2010-10-24  4:01   ` C8000 cpu upgrade problem John David Anglin
@ 2010-10-26  2:04     ` Mikulas Patocka
  0 siblings, 0 replies; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-26  2:04 UTC (permalink / raw)
  To: John David Anglin; +Cc: linux-parisc

On Sun, 24 Oct 2010, John David Anglin wrote:

> > It may be that the lack of L2 cache is the reason why the CPUs don't 
> > support multiprocessing ... I may buy two better CPUs, if I had actually 
> > guarantee that the machine isn't locked (I don't want to waste more money 
> > just to find out that the firmware lock doesn't go away).
> 
> Are you sure the part numbers for the two processor modules that you
> have are the same?  Parts with cache seem much more common.  There
> also seem to be quite a few obsolete parts.

They are the same (I posted the version numbers written by PDC). But as 
you noted, the versions without L2 cache may not be smp aware.

> It might be Linux would work better without the L2 cache.

What is the exact problem with L2 cache? Is it virtually indexed too?

> There are are some cache coherency issues that haven't been resolved in 
> SMP.

What exactly do you mean?

> These problems are aggrevated by the L2 cache which takes a long 
> time to flush.
> 
> It's just not clear that your machine is locked.  The c8000 model
> name doesn't change depending on number of processors.  If you search
> on rp3410 processor upgrade, you will find that a processor update
> license is needed to go from one to two processor.  This is clear
> in the documentation.  I couldn't find anything similar for c8000.
> Indeed, there are many indications that an after-market processor
> update is possible for it.

I don't know. Before I buy two CPUs to get a quad-core system, I'd like to 
make sure it isn't locked. Can the lock be detected somehow?

Mikulas

> Good luck,
> Dave
> -- 
> J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
> National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* PA caches (was: C8000 cpu upgrade problem)
  2010-10-24  3:43   ` Kyle McMartin
@ 2010-10-26  2:16     ` Mikulas Patocka
  2010-10-26  3:04       ` Kyle McMartin
  2010-12-18 20:13       ` PA caches (was: C8000 cpu upgrade problem) John David Anglin
  0 siblings, 2 replies; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-26  2:16 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, linux-parisc



On Sat, 23 Oct 2010, Kyle McMartin wrote:

> On Sun, Oct 24, 2010 at 05:03:25AM +0200, Mikulas Patocka wrote:
> > I tried to measure the cache size, sequential memory read showed cutoff at 
> > 700kB and no cutoff at 32MB. It shows 1.7GB/s below 700kB and 612MB/s 
> > above. Latency measurements (chasing pointer chain) showed drastic cutoff 
> > at 700kB (from 3ns to 300ns) and no cutoff at 32MB.
> > 
> > It may be that the lack of L2 cache is the reason why the CPUs don't 
> > support multiprocessing ... I may buy two better CPUs, if I had actually 
> > guarantee that the machine isn't locked (I don't want to waste more money 
> > just to find out that the firmware lock doesn't go away).
> > 
> 
> FWIW, I'd recommend running in non-SMP mode on pa8800/8900 anyway, as

I tried UP build and it is almost twice slower when compiling (obviously). 
So I don't see any performance advantage in running UP :)

Generally, performance of two-way 900MHz machine is not that bad --- 5 
times faster compile than 440MHz sparc. It suffers only on tests involving 
mostly kernelwork, but no so seriously --- 3.5 times faster than said 
sparc when doing a "dummy" make of an already compiled project (just 
testing timestamps) and 1.2 times faster than sparc on make clean (ok, it 
sucks when re-calculated to clock-to-clock). Generally, I think it's 
usable for development.

I found that gcc 4.3 from Debian 5 is buggy, it miscompiled the UP kernel. 
Compiling it with -Os worked fine. Could you please recommend a compiler 
to use? (4.4 from Debian 6 ... or some other version?)

> our cache flushing is a bit... suboptimal right now (doing whole cache
> flushes on fork and such.)

What is exactly the problem there? Could you describe it or refer to some 
document that describes it? Why do you need to flush on fork?

Sparc has virtually indexed caches too, but there are not many problems 
with it, basically the only needed thing is to flush the cache when kernel 
touches some user page via its own mapping. (if they ran with 16kB page 
size, they wouldn't have to care about data cache coherency at all).

Another thing I don't understand: the L1 cache is supposed to be 
direct-mapped, but it's size is 768kB. I can't imagine how is it 
implemented. Does it mean that the processor does a divide-by-3 on every 
cache access?

Or is it a mistake and the cache is 3-way set associative, with set size 
256kB? (that would make much more sense)

> Which, coupled with the gigantic caches on
> those cpus which must be flushed just tanks performance.
> 
> I've been working on cleaning up jejb's patchset from back in the
> bitkeeper days to properly do deferred flushing, but time is constantly
> against me (sigh, I don't think I've even powered on my C8000 in a few
> years now... explains why I didn't catch your e1000 issue there. :)
> 
> --Kyle

Mikulas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-26  2:16     ` PA caches (was: C8000 cpu upgrade problem) Mikulas Patocka
@ 2010-10-26  3:04       ` Kyle McMartin
  2010-10-26  4:30         ` John David Anglin
  2010-10-26 16:02         ` Mikulas Patocka
  2010-12-18 20:13       ` PA caches (was: C8000 cpu upgrade problem) John David Anglin
  1 sibling, 2 replies; 27+ messages in thread
From: Kyle McMartin @ 2010-10-26  3:04 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Kyle McMartin, John David Anglin, linux-parisc

On Tue, Oct 26, 2010 at 04:16:39AM +0200, Mikulas Patocka wrote:
> I tried UP build and it is almost twice slower when compiling (obviously). 
> So I don't see any performance advantage in running UP :)
> 
> Generally, performance of two-way 900MHz machine is not that bad --- 5 
> times faster compile than 440MHz sparc. It suffers only on tests involving 
> mostly kernelwork, but no so seriously --- 3.5 times faster than said 
> sparc when doing a "dummy" make of an already compiled project (just 
> testing timestamps) and 1.2 times faster than sparc on make clean (ok, it 
> sucks when re-calculated to clock-to-clock). Generally, I think it's 
> usable for development.
> 

Heh. I think you may be lucking in here... see below.

> I found that gcc 4.3 from Debian 5 is buggy, it miscompiled the UP kernel. 
> Compiling it with -Os worked fine. Could you please recommend a compiler 
> to use? (4.4 from Debian 6 ... or some other version?)
> 

4.4.5 from sid is what I'm using... I think it's working more or less
for me. I've only been building/booting UP/SMP on an rp3440 these days,
so I'm not sure about 32-bit.

> > our cache flushing is a bit... suboptimal right now (doing whole cache
> > flushes on fork and such.)
> 
> What is exactly the problem there? Could you describe it or refer to some 
> document that describes it? Why do you need to flush on fork?
> 
> Sparc has virtually indexed caches too, but there are not many problems 
> with it, basically the only needed thing is to flush the cache when kernel 
> touches some user page via its own mapping. (if they ran with 16kB page 
> size, they wouldn't have to care about data cache coherency at all).
> 

I can't remember exactly why offhand, I'm sure jejb can remind us.

> Another thing I don't understand: the L1 cache is supposed to be 
> direct-mapped, but it's size is 768kB. I can't imagine how is it 
> implemented. Does it mean that the processor does a divide-by-3 on every 
> cache access?
> 
> Or is it a mistake and the cache is 3-way set associative, with set size 
> 256kB? (that would make much more sense)
> 

That's the output from one of the firmware queries, which has been lying
to us for a very long time (apparently HP just doesn't test these things
or something.) It believe the pa8800 L1 caches were 4-way associative.

So, on to the interesting bit!

Does your /proc/cpuinfo actually say 768kB? That's... amazingly
interesting. I wonder (out loud, sorry I should go back and look at the
prior emails) if that's the cause of your cpu issues...

processor       : 0
cpu family      : PA-RISC 2.0
cpu             : PA8800 (Mako)
cpu MHz         : 999.995500
capabilities    : os64
model           : 9000/800/rp3440  
model name      : Storm Peak Fast
hversion        : 0x00008890
sversion        : 0x00000491
I-cache         : 32768 KB
D-cache         : 32768 KB (WB, direct mapped)
ITLB entries    : 240
DTLB entries    : 240 - shared with ITLB
bogomips        : 1998.84
software id     : 4468984695822677774

is what mine says... (with the 32MB L2 cache.)

Anyway, the L1 are usually 2/4-way associative on parisc, iirc, I
believe the L2 is as well.

The main problems we see on the pa8800 is due to the L2, which is
physically indexed, and exclusive. We had some bizarre
corruption due to incorrect evictions there. (And flushing 32MB on
fork is just utterly painful, we really need to fix that someday.)

--Kyle

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-26  3:04       ` Kyle McMartin
@ 2010-10-26  4:30         ` John David Anglin
  2010-10-26 16:02         ` Mikulas Patocka
  1 sibling, 0 replies; 27+ messages in thread
From: John David Anglin @ 2010-10-26  4:30 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: mikulas, kyle, linux-parisc

> > I found that gcc 4.3 from Debian 5 is buggy, it miscompiled the UP kernel. 
> > Compiling it with -Os worked fine. Could you please recommend a compiler 
> > to use? (4.4 from Debian 6 ... or some other version?)
> > 
> 
> 4.4.5 from sid is what I'm using... I think it's working more or less
> for me. I've only been building/booting UP/SMP on an rp3440 these days,
> so I'm not sure about 32-bit.

Almost all the bugs are middle-end problems.  Gradually, things have
been getting better, but resolving wrong code bugs is often difficult,
particularly if the compiler has been miscompiled.

Some of the things that hurt:
a) Strict alignment and wierd abi for passing structs.
b) Callee copies (almost all other archs are caller copy).

Recently, a serious bug in forward propagation was discovered.  It's
hard to tell the magnitude of its impact.

> > > our cache flushing is a bit... suboptimal right now (doing whole cache
> > > flushes on fork and such.)
> > 
> > What is exactly the problem there? Could you describe it or refer to some 
> > document that describes it? Why do you need to flush on fork?
> > 
> > Sparc has virtually indexed caches too, but there are not many problems 
> > with it, basically the only needed thing is to flush the cache when kernel 
> > touches some user page via its own mapping. (if they ran with 16kB page 
> > size, they wouldn't have to care about data cache coherency at all).
> > 
> 
> I can't remember exactly why offhand, I'm sure jejb can remind us.

The fundamental issue is the PA-8800 and PA-8900 implementations
don't support non equivalent aliasing.  So, copies to/from the
kernel mapping are tricky.

> That's the output from one of the firmware queries, which has been lying
> to us for a very long time (apparently HP just doesn't test these things
> or something.) It believe the pa8800 L1 caches were 4-way associative.
> 
> So, on to the interesting bit!
> 
> Does your /proc/cpuinfo actually say 768kB? That's... amazingly
> interesting. I wonder (out loud, sorry I should go back and look at the
> prior emails) if that's the cause of your cpu issues...
> 
> processor       : 0
> cpu family      : PA-RISC 2.0
> cpu             : PA8800 (Mako)
> cpu MHz         : 999.995500
> capabilities    : os64
> model           : 9000/800/rp3440  
> model name      : Storm Peak Fast
> hversion        : 0x00008890
> sversion        : 0x00000491
> I-cache         : 32768 KB
> D-cache         : 32768 KB (WB, direct mapped)
> ITLB entries    : 240
> DTLB entries    : 240 - shared with ITLB
> bogomips        : 1998.84
> software id     : 4468984695822677774
> 
> is what mine says... (with the 32MB L2 cache.)

My inspection of the datasheets for the c8000 indicated that the base
configuration was 2-way with no L2 cache (both PA8800 and PA8900).
It had a total cache size of 3 MB.  The optional PA8800 models were
2 and 4-way PA8800 with 32 MB L2 cache per processor module.  The optional
PA8900 models had 64 MB L2 per processor module.

Also, Mikulas machine is misidentified as Mako2.  Why does the model
for yours show it is a rp3440?

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-26  3:04       ` Kyle McMartin
  2010-10-26  4:30         ` John David Anglin
@ 2010-10-26 16:02         ` Mikulas Patocka
  2010-10-27  1:29           ` John David Anglin
  1 sibling, 1 reply; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-26 16:02 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: John David Anglin, linux-parisc



On Mon, 25 Oct 2010, Kyle McMartin wrote:

> On Tue, Oct 26, 2010 at 04:16:39AM +0200, Mikulas Patocka wrote:
> > I tried UP build and it is almost twice slower when compiling (obviously). 
> > So I don't see any performance advantage in running UP :)
> > 
> > Generally, performance of two-way 900MHz machine is not that bad --- 5 
> > times faster compile than 440MHz sparc. It suffers only on tests involving 
> > mostly kernelwork, but no so seriously --- 3.5 times faster than said 
> > sparc when doing a "dummy" make of an already compiled project (just 
> > testing timestamps) and 1.2 times faster than sparc on make clean (ok, it 
> > sucks when re-calculated to clock-to-clock). Generally, I think it's 
> > usable for development.
> > 
> 
> Heh. I think you may be lucking in here... see below.
> 
> > I found that gcc 4.3 from Debian 5 is buggy, it miscompiled the UP kernel. 
> > Compiling it with -Os worked fine. Could you please recommend a compiler 
> > to use? (4.4 from Debian 6 ... or some other version?)
> > 
> 
> 4.4.5 from sid is what I'm using... I think it's working more or less
> for me. I've only been building/booting UP/SMP on an rp3440 these days,
> so I'm not sure about 32-bit.
> 
> > > our cache flushing is a bit... suboptimal right now (doing whole cache
> > > flushes on fork and such.)
> > 
> > What is exactly the problem there? Could you describe it or refer to some 
> > document that describes it? Why do you need to flush on fork?
> > 
> > Sparc has virtually indexed caches too, but there are not many problems 
> > with it, basically the only needed thing is to flush the cache when kernel 
> > touches some user page via its own mapping. (if they ran with 16kB page 
> > size, they wouldn't have to care about data cache coherency at all).
> > 
> 
> I can't remember exactly why offhand, I'm sure jejb can remind us.
> 
> > Another thing I don't understand: the L1 cache is supposed to be 
> > direct-mapped, but it's size is 768kB. I can't imagine how is it 
> > implemented. Does it mean that the processor does a divide-by-3 on every 
> > cache access?
> > 
> > Or is it a mistake and the cache is 3-way set associative, with set size 
> > 256kB? (that would make much more sense)
> > 
> 
> That's the output from one of the firmware queries, which has been lying
> to us for a very long time (apparently HP just doesn't test these things
> or something.) It believe the pa8800 L1 caches were 4-way associative.

I'd say 3-way. If there are 768kB, the associativity must be 3*(2^n).

> So, on to the interesting bit!
> 
> Does your /proc/cpuinfo actually say 768kB? That's... amazingly
> interesting. I wonder (out loud, sorry I should go back and look at the
> prior emails) if that's the cause of your cpu issues...
> 
> processor       : 0
> cpu family      : PA-RISC 2.0
> cpu             : PA8800 (Mako)
> cpu MHz         : 999.995500
> capabilities    : os64
> model           : 9000/800/rp3440  
> model name      : Storm Peak Fast
> hversion        : 0x00008890
> sversion        : 0x00000491
> I-cache         : 32768 KB
> D-cache         : 32768 KB (WB, direct mapped)
> ITLB entries    : 240
> DTLB entries    : 240 - shared with ITLB
> bogomips        : 1998.84
> software id     : 4468984695822677774
> 
> is what mine says... (with the 32MB L2 cache.)

My says:
processor       : 0
cpu family      : PA-RISC 2.0
cpu             : PA8900 (Shortfin)
cpu MHz         : 900.000000
capabilities    : os64
model           : 9000/785/C8000
model name      : Unknown machine
hversion        : 0x00008920
sversion        : 0x00000491
I-cache         : 768 KB
D-cache         : 768 KB (WB, direct mapped)
ITLB entries    : 240
DTLB entries    : 240 - shared with ITLB
bogomips        : 1795.68
software id     : 6249854628114153565

PA8900 is wrong, direct mapped is wrong.

So, maybe the cache is the reason why it is fast and why it doesn't run on 
SMP?

> Anyway, the L1 are usually 2/4-way associative on parisc, iirc, I
> believe the L2 is as well.
> 
> The main problems we see on the pa8800 is due to the L2, which is
> physically indexed, and exclusive. We had some bizarre
> corruption due to incorrect evictions there. (And flushing 32MB on
> fork is just utterly painful, we really need to fix that someday.)
> 
> --Kyle

When I read the specification, it says that equivalent virtual addresses 
are those that are 16-MB (or multiplies of) apart. Warning, the PDF is 
wrong (it says 1MB), there's an errata on HP website that extends it to 
16MB.

It also gives an option to hash parts of space-ID to the cache addressing, 
I suppose this is turned off on Linux.

The hardware handles aliasing of equivalent addresses fine (both on UP or 
SMP).

Multiple mappings on non-equivalent addresses are allowed only if all are 
read-only (otherwise it generates machine-check conditions).



Based on the specification, I suppose that the processor finds the cache 
address with a virtual address (and optionally a space-id hashed into it), 
in parallel it finds the physical address using TLB, the cache contains 3 
or 4 lines at a given address, each with a full physical address. The 
phyiscal addresses are compared with the output from the TLB and if match 
is found, that cache line is accessed.



So, if we want to implement it correctly, we must allow aliasing only on 
equivalent virtual addresses.

- fork --- no problem, the mappings are equivalent after fork, I see no 
need to flush cache there, hardware should do. If you see such need, 
describe it.

- kmap (accessing user pages from the kernel) --- kmap will work if we 
deliberately select an equivalent kernel address (that matches the user 
address modulo 16M). If we do, no need to flush cache.

- shared memory --- there is SHMLBA boundary that causes that all mappings 
are aligned to this boundary --- it is **WRONG** in the current kernel!!! 
It is only 4MB and should be 16MB!!!

- mapped files --- I'd simply map them all so that (mapped_address - 
file_offset) is divisiable by 16MB. One problem would be MAP_FIXED, this 
should be simply rejected with -EINVAL and userspace linker be patched to 
use conguent addresses.

Note that aliasing non-equivalent addresses may cause machine-check 
exception according to the specifications, so we simply can't allow the 
userspace to do them. I don't know how many programs will be broken by 
restricting MAP_FIXED, but I don't see any other reasonable way (well, 
you can unmap the other mappings when creating a non-equivalent mapping, 
but what to do with mlock() then?).

How does HP-UX solve MAP_FIXED to non-equivalent addresses? Does it abort 
it with -EINVAL?

If we obey these rules, we can run with no cache flushing in page mapping 
or unmappinh at all. There is one case where we'd need to flush cache --- 
freeing a page and allocating it to a different virtual address. We'd need 
to free cache on all page freeings or allocations. (it could be later 
minigated with an arch-specific wrapper around page allocator)

Mikulas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-26 16:02         ` Mikulas Patocka
@ 2010-10-27  1:29           ` John David Anglin
  2010-10-27  2:40             ` John David Anglin
  2010-10-27  4:50             ` James Bottomley
  0 siblings, 2 replies; 27+ messages in thread
From: John David Anglin @ 2010-10-27  1:29 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: kyle, linux-parisc

> > Heh. I think you may be lucking in here... see below.
> > 
> > > I found that gcc 4.3 from Debian 5 is buggy, it miscompiled the UP kernel. 
> > > Compiling it with -Os worked fine. Could you please recommend a compiler 
> > > to use? (4.4 from Debian 6 ... or some other version?)
> > > 
> > 
> > 4.4.5 from sid is what I'm using... I think it's working more or less
> > for me. I've only been building/booting UP/SMP on an rp3440 these days,
> > so I'm not sure about 32-bit.
> > 
> > > > our cache flushing is a bit... suboptimal right now (doing whole cache
> > > > flushes on fork and such.)
> > > 
> > > What is exactly the problem there? Could you describe it or refer to some 
> > > document that describes it? Why do you need to flush on fork?
> > > 
> > > Sparc has virtually indexed caches too, but there are not many problems 
> > > with it, basically the only needed thing is to flush the cache when kernel 
> > > touches some user page via its own mapping. (if they ran with 16kB page 
> > > size, they wouldn't have to care about data cache coherency at all).
> I'd say 3-way. If there are 768kB, the associativity must be 3*(2^n).
> 
> > So, on to the interesting bit!
> > 
> > Does your /proc/cpuinfo actually say 768kB? That's... amazingly
> > interesting. I wonder (out loud, sorry I should go back and look at the
> > prior emails) if that's the cause of your cpu issues...
> > 
> > processor       : 0
> > cpu family      : PA-RISC 2.0
> > cpu             : PA8800 (Mako)
> > cpu MHz         : 999.995500
> > capabilities    : os64
> > model           : 9000/800/rp3440  
> > model name      : Storm Peak Fast
> > hversion        : 0x00008890
> > sversion        : 0x00000491
> > I-cache         : 32768 KB
> > D-cache         : 32768 KB (WB, direct mapped)
> > ITLB entries    : 240
> > DTLB entries    : 240 - shared with ITLB
> > bogomips        : 1998.84
> > software id     : 4468984695822677774
> > 
> > is what mine says... (with the 32MB L2 cache.)
> 
> My says:
> processor       : 0
> cpu family      : PA-RISC 2.0
> cpu             : PA8900 (Shortfin)
> cpu MHz         : 900.000000
> capabilities    : os64
> model           : 9000/785/C8000
> model name      : Unknown machine
> hversion        : 0x00008920
> sversion        : 0x00000491
> I-cache         : 768 KB
> D-cache         : 768 KB (WB, direct mapped)
> ITLB entries    : 240
> DTLB entries    : 240 - shared with ITLB
> bogomips        : 1795.68
> software id     : 6249854628114153565
> 
> PA8900 is wrong, direct mapped is wrong.

"direct mapped" indicates that the PDC_CACHE call returned a D_loop value
of 1.  According to the documentation, this indicates that FDCE(addr) only
needs to be done once at any given address.  A N way cache may require
N FDCE(addr) executions or just 1, depending on implementation.  Thus, a
value of 1 doesn't provide any information about the details of the
implementation.

Probably, the I_loop and D_loop values should be saved for the cache
flush code.

> So, maybe the cache is the reason why it is fast and why it doesn't run on 
> SMP?

What happens when you run a SMP kernel?

> > Anyway, the L1 are usually 2/4-way associative on parisc, iirc, I
> > believe the L2 is as well.
> > 
> > The main problems we see on the pa8800 is due to the L2, which is
> > physically indexed, and exclusive. We had some bizarre
> > corruption due to incorrect evictions there. (And flushing 32MB on
> > fork is just utterly painful, we really need to fix that someday.)
> > 
> > --Kyle
> 
> When I read the specification, it says that equivalent virtual addresses 
> are those that are 16-MB (or multiplies of) apart. Warning, the PDF is 
> wrong (it says 1MB), there's an errata on HP website that extends it to 
> 16MB.
> 
> It also gives an option to hash parts of space-ID to the cache addressing, 
> I suppose this is turned off on Linux.
> 
> The hardware handles aliasing of equivalent addresses fine (both on UP or 
> SMP).
> 
> Multiple mappings on non-equivalent addresses are allowed only if all are 
> read-only (otherwise it generates machine-check conditions).
> 
> 
> 
> Based on the specification, I suppose that the processor finds the cache 
> address with a virtual address (and optionally a space-id hashed into it), 
> in parallel it finds the physical address using TLB, the cache contains 3 
> or 4 lines at a given address, each with a full physical address. The 
> phyiscal addresses are compared with the output from the TLB and if match 
> is found, that cache line is accessed.
> 
> 
> 
> So, if we want to implement it correctly, we must allow aliasing only on 
> equivalent virtual addresses.
> 
> - fork --- no problem, the mappings are equivalent after fork, I see no 
> need to flush cache there, hardware should do. If you see such need, 
> describe it.
> 
> - kmap (accessing user pages from the kernel) --- kmap will work if we 
> deliberately select an equivalent kernel address (that matches the user 
> address modulo 16M). If we do, no need to flush cache.

I have tried this but haven't reached a fully stable configuration.
Unfortunately, the hard drive on the system that I was testing on
is dying...

See __clear_user_page_asm.  I tried similar implementations for
copy_user_page, etc.

> - shared memory --- there is SHMLBA boundary that causes that all mappings 
> are aligned to this boundary --- it is **WRONG** in the current kernel!!! 
> It is only 4MB and should be 16MB!!!

James has said that the max for all PA-RISC implementations is
4 MB.  The value is returned by the PDC_CACHE call.  Maybe a BUG_ON is
called for.  The alias boundary can be determined by the alias field
in the D_conf return value.

> - mapped files --- I'd simply map them all so that (mapped_address - 
> file_offset) is divisiable by 16MB. One problem would be MAP_FIXED, this 
> should be simply rejected with -EINVAL and userspace linker be patched to 
> use conguent addresses.
> 
> Note that aliasing non-equivalent addresses may cause machine-check 
> exception according to the specifications, so we simply can't allow the 
> userspace to do them. I don't know how many programs will be broken by 
> restricting MAP_FIXED, but I don't see any other reasonable way (well, 
> you can unmap the other mappings when creating a non-equivalent mapping, 
> but what to do with mlock() then?).
> 
> How does HP-UX solve MAP_FIXED to non-equivalent addresses? Does it abort 
> it with -EINVAL?

I believe that the call fails.  This was a problem in getting PCH to
work on hppa.

> If we obey these rules, we can run with no cache flushing in page mapping 
> or unmappinh at all. There is one case where we'd need to flush cache --- 
> freeing a page and allocating it to a different virtual address. We'd need 
> to free cache on all page freeings or allocations. (it could be later 
> minigated with an arch-specific wrapper around page allocator)
> 
> Mikulas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27  1:29           ` John David Anglin
@ 2010-10-27  2:40             ` John David Anglin
  2010-10-27  4:50             ` James Bottomley
  1 sibling, 0 replies; 27+ messages in thread
From: John David Anglin @ 2010-10-27  2:40 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: kyle, linux-parisc

[-- Attachment #1: Type: text/plain, Size: 502 bytes --]

On Tue, 26 Oct 2010, John David Anglin wrote:

> I have tried this but haven't reached a fully stable configuration.
> Unfortunately, the hard drive on the system that I was testing on
> is dying...

Attached is last diff that I have readily available.  It's
certainly not right, but it points to areas that I though
might be problems.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

[-- Attachment #2: diff-20100523.d --]
[-- Type: text/plain, Size: 49155 bytes --]

diff --git a/arch/parisc/hpux/wrappers.S b/arch/parisc/hpux/wrappers.S
index 58c53c8..bdcea33 100644
--- a/arch/parisc/hpux/wrappers.S
+++ b/arch/parisc/hpux/wrappers.S
@@ -88,7 +88,7 @@ ENTRY(hpux_fork_wrapper)
 
 	STREG	%r2,-20(%r30)
 	ldo	64(%r30),%r30
-	STREG	%r2,PT_GR19(%r1)	;! save for child
+	STREG	%r2,PT_SYSCALL_RP(%r1)	;! save for child
 	STREG	%r30,PT_GR21(%r1)	;! save for child
 
 	LDREG	PT_GR30(%r1),%r25
@@ -132,7 +132,7 @@ ENTRY(hpux_child_return)
 	bl,n	schedule_tail, %r2
 #endif
 
-	LDREG	TASK_PT_GR19-TASK_SZ_ALGN-128(%r30),%r2
+	LDREG	TASK_PT_SYSCALL_RP-TASK_SZ_ALGN-128(%r30),%r2
 	b fork_return
 	copy %r0,%r28
 ENDPROC(hpux_child_return)
diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h
index 716634d..ad7df44 100644
--- a/arch/parisc/include/asm/atomic.h
+++ b/arch/parisc/include/asm/atomic.h
@@ -24,29 +24,46 @@
  * Hash function to index into a different SPINLOCK.
  * Since "a" is usually an address, use one spinlock per cacheline.
  */
-#  define ATOMIC_HASH_SIZE 4
-#  define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ]))
+#  define ATOMIC_HASH_SIZE (4096/L1_CACHE_BYTES)  /* 4 */
+#  define ATOMIC_HASH(a)      (&(__atomic_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ]))
+#  define ATOMIC_USER_HASH(a) (&(__atomic_user_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ]))
 
 extern arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned;
+extern arch_spinlock_t __atomic_user_hash[ATOMIC_HASH_SIZE] __lock_aligned;
 
 /* Can't use raw_spin_lock_irq because of #include problems, so
  * this is the substitute */
-#define _atomic_spin_lock_irqsave(l,f) do {	\
-	arch_spinlock_t *s = ATOMIC_HASH(l);		\
+#define _atomic_spin_lock_irqsave_template(l,f,hash_func) do {	\
+	arch_spinlock_t *s = hash_func;		\
 	local_irq_save(f);			\
 	arch_spin_lock(s);			\
 } while(0)
 
-#define _atomic_spin_unlock_irqrestore(l,f) do {	\
-	arch_spinlock_t *s = ATOMIC_HASH(l);			\
+#define _atomic_spin_unlock_irqrestore_template(l,f,hash_func) do {	\
+	arch_spinlock_t *s = hash_func;			\
 	arch_spin_unlock(s);				\
 	local_irq_restore(f);				\
 } while(0)
 
+/* kernel memory locks */
+#define _atomic_spin_lock_irqsave(l,f)	\
+	_atomic_spin_lock_irqsave_template(l,f,ATOMIC_HASH(l))
+
+#define _atomic_spin_unlock_irqrestore(l,f)	\
+	_atomic_spin_unlock_irqrestore_template(l,f,ATOMIC_HASH(l))
+
+/* userspace memory locks */
+#define _atomic_spin_lock_irqsave_user(l,f)	\
+	_atomic_spin_lock_irqsave_template(l,f,ATOMIC_USER_HASH(l))
+
+#define _atomic_spin_unlock_irqrestore_user(l,f)	\
+	_atomic_spin_unlock_irqrestore_template(l,f,ATOMIC_USER_HASH(l))
 
 #else
 #  define _atomic_spin_lock_irqsave(l,f) do { local_irq_save(f); } while (0)
 #  define _atomic_spin_unlock_irqrestore(l,f) do { local_irq_restore(f); } while (0)
+#  define _atomic_spin_lock_irqsave_user(l,f) _atomic_spin_lock_irqsave(l,f)
+#  define _atomic_spin_unlock_irqrestore_user(l,f) _atomic_spin_unlock_irqrestore(l,f)
 #endif
 
 /* This should get optimized out since it's never called.
diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h
index 7a73b61..b90c895 100644
--- a/arch/parisc/include/asm/cacheflush.h
+++ b/arch/parisc/include/asm/cacheflush.h
@@ -2,6 +2,7 @@
 #define _PARISC_CACHEFLUSH_H
 
 #include <linux/mm.h>
+#include <linux/uaccess.h>
 
 /* The usual comment is "Caches aren't brain-dead on the <architecture>".
  * Unfortunately, that doesn't apply to PA-RISC. */
@@ -104,21 +105,32 @@ void mark_rodata_ro(void);
 #define ARCH_HAS_KMAP
 
 void kunmap_parisc(void *addr);
+void *kmap_parisc(struct page *page);
 
 static inline void *kmap(struct page *page)
 {
 	might_sleep();
-	return page_address(page);
+	return kmap_parisc(page);
 }
 
 #define kunmap(page)			kunmap_parisc(page_address(page))
 
-#define kmap_atomic(page, idx)		page_address(page)
+static inline void *kmap_atomic(struct page *page, enum km_type idx)
+{
+	pagefault_disable();
+	return kmap_parisc(page);
+}
 
-#define kunmap_atomic(addr, idx)	kunmap_parisc(addr)
+static inline void kunmap_atomic(void *addr, enum km_type idx)
+{
+	kunmap_parisc(addr);
+	pagefault_enable();
+}
 
-#define kmap_atomic_pfn(pfn, idx)	page_address(pfn_to_page(pfn))
-#define kmap_atomic_to_page(ptr)	virt_to_page(ptr)
+#define kmap_atomic_prot(page, idx, prot)	kmap_atomic(page, idx)
+#define kmap_atomic_pfn(pfn, idx)	kmap_atomic(pfn_to_page(pfn), (idx))
+#define kmap_atomic_to_page(ptr)	virt_to_page(kmap_atomic(virt_to_page(ptr), (enum km_type) 0))
+#define kmap_flush_unused()	do {} while(0)
 #endif
 
 #endif /* _PARISC_CACHEFLUSH_H */
diff --git a/arch/parisc/include/asm/futex.h b/arch/parisc/include/asm/futex.h
index 0c705c3..7bc963e 100644
--- a/arch/parisc/include/asm/futex.h
+++ b/arch/parisc/include/asm/futex.h
@@ -55,6 +55,7 @@ futex_atomic_cmpxchg_inatomic(int __user *uaddr, int oldval, int newval)
 {
 	int err = 0;
 	int uval;
+	unsigned long flags;
 
 	/* futex.c wants to do a cmpxchg_inatomic on kernel NULL, which is
 	 * our gateway page, and causes no end of trouble...
@@ -65,10 +66,15 @@ futex_atomic_cmpxchg_inatomic(int __user *uaddr, int oldval, int newval)
 	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int)))
 		return -EFAULT;
 
+	_atomic_spin_lock_irqsave_user(uaddr, flags);
+
 	err = get_user(uval, uaddr);
-	if (err) return -EFAULT;
-	if (uval == oldval)
-		err = put_user(newval, uaddr);
+	if (!err)
+		if (uval == oldval)
+			err = put_user(newval, uaddr);
+
+	_atomic_spin_unlock_irqrestore_user(uaddr, flags);
+
 	if (err) return -EFAULT;
 	return uval;
 }
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index a84cc1f..cca0f53 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -21,15 +21,18 @@
 #include <asm/types.h>
 #include <asm/cache.h>
 
-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
-#define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))
+#define clear_page(page)	clear_page_asm((void *)(page))
+#define copy_page(to,from)      copy_page_asm((void *)(to), (void *)(from))
 
 struct page;
 
-void copy_user_page_asm(void *to, void *from);
-void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
+extern void copy_page_asm(void *to, void *from);
+extern void clear_page_asm(void *page);
+extern void copy_user_page_asm(void *to, void *from, unsigned long vaddr);
+extern void clear_user_page_asm(void *page, unsigned long vaddr);
+extern void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
 			   struct page *pg);
-void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
+extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 
 /*
  * These are used to make use of C type-checking..
diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
index a27d2e2..8050948 100644
--- a/arch/parisc/include/asm/pgtable.h
+++ b/arch/parisc/include/asm/pgtable.h
@@ -14,6 +14,7 @@
 #include <linux/bitops.h>
 #include <asm/processor.h>
 #include <asm/cache.h>
+#include <linux/uaccess.h>
 
 /*
  * kern_addr_valid(ADDR) tests if ADDR is pointing to valid kernel
@@ -30,15 +31,21 @@
  */
 #define kern_addr_valid(addr)	(1)
 
+extern spinlock_t pa_pte_lock;
+extern spinlock_t pa_tlb_lock;
+
 /* Certain architectures need to do special things when PTEs
  * within a page table are directly modified.  Thus, the following
  * hook is made available.
  */
-#define set_pte(pteptr, pteval)                                 \
-        do{                                                     \
+#define set_pte(pteptr, pteval)					\
+        do {							\
+		unsigned long flags;				\
+		spin_lock_irqsave(&pa_pte_lock, flags);		\
                 *(pteptr) = (pteval);                           \
+		spin_unlock_irqrestore(&pa_pte_lock, flags);	\
         } while(0)
-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
+#define set_pte_at(mm,addr,ptep,pteval)	set_pte(ptep, pteval)
 
 #endif /* !__ASSEMBLY__ */
 
@@ -262,6 +269,7 @@ extern unsigned long *empty_zero_page;
 #define pte_none(x)     ((pte_val(x) == 0) || (pte_val(x) & _PAGE_FLUSH))
 #define pte_present(x)	(pte_val(x) & _PAGE_PRESENT)
 #define pte_clear(mm,addr,xp)	do { pte_val(*(xp)) = 0; } while (0)
+#define pte_same(A,B)	(pte_val(A) == pte_val(B))
 
 #define pmd_flag(x)	(pmd_val(x) & PxD_FLAG_MASK)
 #define pmd_address(x)	((unsigned long)(pmd_val(x) &~ PxD_FLAG_MASK) << PxD_VALUE_SHIFT)
@@ -410,6 +418,7 @@ extern void paging_init (void);
 
 #define PG_dcache_dirty         PG_arch_1
 
+extern void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn);
 extern void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t);
 
 /* Encode and de-code a swap entry */
@@ -423,56 +432,83 @@ extern void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t);
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val(pte) })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })
 
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+static inline void __flush_tlb_page(struct mm_struct *mm, unsigned long addr)
 {
-#ifdef CONFIG_SMP
-	if (!pte_young(*ptep))
-		return 0;
-	return test_and_clear_bit(xlate_pabit(_PAGE_ACCESSED_BIT), &pte_val(*ptep));
-#else
-	pte_t pte = *ptep;
-	if (!pte_young(pte))
-		return 0;
-	set_pte_at(vma->vm_mm, addr, ptep, pte_mkold(pte));
-	return 1;
-#endif
+	unsigned long flags;
+
+	/* For one page, it's not worth testing the split_tlb variable.  */
+	spin_lock_irqsave(&pa_tlb_lock, flags);
+	mtsp(mm->context,1);
+	pdtlb(addr);
+	pitlb(addr);
+	spin_unlock_irqrestore(&pa_tlb_lock, flags);
 }
 
-extern spinlock_t pa_dbit_lock;
+static inline int ptep_set_access_flags(struct vm_area_struct *vma, unsigned
+ long addr, pte_t *ptep, pte_t entry, int dirty)
+{
+	int changed;
+	unsigned long flags;
+	spin_lock_irqsave(&pa_pte_lock, flags);
+	changed = !pte_same(*ptep, entry);
+	if (changed) {
+		*ptep = entry;
+	}
+	spin_unlock_irqrestore(&pa_pte_lock, flags);
+	if (changed) {
+		__flush_tlb_page(vma->vm_mm, addr);
+	}
+	return changed;
+}
+
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
+{
+	pte_t pte;
+	unsigned long flags;
+	int r;
+
+	spin_lock_irqsave(&pa_pte_lock, flags);
+	pte = *ptep;
+	if (pte_young(pte)) {
+		*ptep = pte_mkold(pte);
+		r = 1;
+	} else {
+		r = 0;
+	}
+	spin_unlock_irqrestore(&pa_pte_lock, flags);
+
+	return r;
+}
 
 struct mm_struct;
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
-	pte_t old_pte;
-	pte_t pte;
+	pte_t pte, old_pte;
+	unsigned long flags;
 
-	spin_lock(&pa_dbit_lock);
+	spin_lock_irqsave(&pa_pte_lock, flags);
 	pte = old_pte = *ptep;
 	pte_val(pte) &= ~_PAGE_PRESENT;
 	pte_val(pte) |= _PAGE_FLUSH;
-	set_pte_at(mm,addr,ptep,pte);
-	spin_unlock(&pa_dbit_lock);
+	*ptep = pte;
+	spin_unlock_irqrestore(&pa_pte_lock, flags);
 
 	return old_pte;
 }
 
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+static inline void ptep_set_wrprotect(struct vm_area_struct *vma, struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
-#ifdef CONFIG_SMP
-	unsigned long new, old;
-
-	do {
-		old = pte_val(*ptep);
-		new = pte_val(pte_wrprotect(__pte (old)));
-	} while (cmpxchg((unsigned long *) ptep, old, new) != old);
-#else
-	pte_t old_pte = *ptep;
-	set_pte_at(mm, addr, ptep, pte_wrprotect(old_pte));
-#endif
+	pte_t old_pte;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pa_pte_lock, flags);
+	old_pte = *ptep;
+	*ptep = pte_wrprotect(old_pte);
+	__flush_tlb_page(mm, addr);
+	flush_cache_page(vma, addr, pte_pfn(old_pte));
+	spin_unlock_irqrestore(&pa_pte_lock, flags);
 }
 
-#define pte_same(A,B)	(pte_val(A) == pte_val(B))
-
 #endif /* !__ASSEMBLY__ */
 
 
@@ -504,6 +540,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 
 #define HAVE_ARCH_UNMAPPED_AREA
 
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
diff --git a/arch/parisc/include/asm/system.h b/arch/parisc/include/asm/system.h
index d91357b..4653c77 100644
--- a/arch/parisc/include/asm/system.h
+++ b/arch/parisc/include/asm/system.h
@@ -160,7 +160,7 @@ static inline void set_eiem(unsigned long val)
    ldcd). */
 
 #define __PA_LDCW_ALIGNMENT	4
-#define __ldcw_align(a) ((volatile unsigned int *)a)
+#define __ldcw_align(a) (&(a)->slock)
 #define __LDCW	"ldcw,co"
 
 #endif /*!CONFIG_PA20*/
diff --git a/arch/parisc/kernel/asm-offsets.c b/arch/parisc/kernel/asm-offsets.c
index ec787b4..b2f35b2 100644
--- a/arch/parisc/kernel/asm-offsets.c
+++ b/arch/parisc/kernel/asm-offsets.c
@@ -137,6 +137,7 @@ int main(void)
 	DEFINE(TASK_PT_IAOQ0, offsetof(struct task_struct, thread.regs.iaoq[0]));
 	DEFINE(TASK_PT_IAOQ1, offsetof(struct task_struct, thread.regs.iaoq[1]));
 	DEFINE(TASK_PT_CR27, offsetof(struct task_struct, thread.regs.cr27));
+	DEFINE(TASK_PT_SYSCALL_RP, offsetof(struct task_struct, thread.regs.pad0));
 	DEFINE(TASK_PT_ORIG_R28, offsetof(struct task_struct, thread.regs.orig_r28));
 	DEFINE(TASK_PT_KSP, offsetof(struct task_struct, thread.regs.ksp));
 	DEFINE(TASK_PT_KPC, offsetof(struct task_struct, thread.regs.kpc));
@@ -225,6 +226,7 @@ int main(void)
 	DEFINE(PT_IAOQ0, offsetof(struct pt_regs, iaoq[0]));
 	DEFINE(PT_IAOQ1, offsetof(struct pt_regs, iaoq[1]));
 	DEFINE(PT_CR27, offsetof(struct pt_regs, cr27));
+	DEFINE(PT_SYSCALL_RP, offsetof(struct pt_regs, pad0));
 	DEFINE(PT_ORIG_R28, offsetof(struct pt_regs, orig_r28));
 	DEFINE(PT_KSP, offsetof(struct pt_regs, ksp));
 	DEFINE(PT_KPC, offsetof(struct pt_regs, kpc));
@@ -290,5 +292,11 @@ int main(void)
 	BLANK();
 	DEFINE(ASM_PDC_RESULT_SIZE, NUM_PDC_RESULT * sizeof(unsigned long));
 	BLANK();
+
+#ifdef CONFIG_SMP
+	DEFINE(ASM_ATOMIC_HASH_SIZE_SHIFT, __builtin_ffs(ATOMIC_HASH_SIZE)-1);
+	DEFINE(ASM_ATOMIC_HASH_ENTRY_SHIFT, __builtin_ffs(sizeof(__atomic_hash[0]))-1);
+#endif
+
 	return 0;
 }
diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index b6ed34d..7952ae4 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -336,9 +336,9 @@ __flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr)
 	}
 }
 
-void flush_dcache_page(struct page *page)
+static void flush_user_dcache_page_internal(struct address_space *mapping,
+					    struct page *page)
 {
-	struct address_space *mapping = page_mapping(page);
 	struct vm_area_struct *mpnt;
 	struct prio_tree_iter iter;
 	unsigned long offset;
@@ -346,14 +346,6 @@ void flush_dcache_page(struct page *page)
 	pgoff_t pgoff;
 	unsigned long pfn = page_to_pfn(page);
 
-
-	if (mapping && !mapping_mapped(mapping)) {
-		set_bit(PG_dcache_dirty, &page->flags);
-		return;
-	}
-
-	flush_kernel_dcache_page(page);
-
 	if (!mapping)
 		return;
 
@@ -387,6 +379,19 @@ void flush_dcache_page(struct page *page)
 	}
 	flush_dcache_mmap_unlock(mapping);
 }
+
+void flush_dcache_page(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (mapping && !mapping_mapped(mapping)) {
+		set_bit(PG_dcache_dirty, &page->flags);
+		return;
+	}
+
+	flush_kernel_dcache_page(page);
+	flush_user_dcache_page_internal(mapping, page);
+}
 EXPORT_SYMBOL(flush_dcache_page);
 
 /* Defined in arch/parisc/kernel/pacache.S */
@@ -395,17 +400,6 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_asm);
 EXPORT_SYMBOL(flush_data_cache_local);
 EXPORT_SYMBOL(flush_kernel_icache_range_asm);
 
-void clear_user_page_asm(void *page, unsigned long vaddr)
-{
-	unsigned long flags;
-	/* This function is implemented in assembly in pacache.S */
-	extern void __clear_user_page_asm(void *page, unsigned long vaddr);
-
-	purge_tlb_start(flags);
-	__clear_user_page_asm(page, vaddr);
-	purge_tlb_end(flags);
-}
-
 #define FLUSH_THRESHOLD 0x80000 /* 0.5MB */
 int parisc_cache_flush_threshold __read_mostly = FLUSH_THRESHOLD;
 
@@ -440,17 +434,26 @@ void __init parisc_setup_cache_timing(void)
 }
 
 extern void purge_kernel_dcache_page(unsigned long);
-extern void clear_user_page_asm(void *page, unsigned long vaddr);
 
 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
+#if 1
+	/* Clear user page using alias region.  */
+#if 0
 	unsigned long flags;
 
 	purge_kernel_dcache_page((unsigned long)page);
 	purge_tlb_start(flags);
 	pdtlb_kernel(page);
 	purge_tlb_end(flags);
+#endif
+
 	clear_user_page_asm(page, vaddr);
+#else
+	/* Clear user page using kernel mapping.  */
+	clear_page_asm(page);
+	flush_kernel_dcache_page_asm(page);
+#endif
 }
 EXPORT_SYMBOL(clear_user_page);
 
@@ -469,22 +472,15 @@ void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
 		    struct page *pg)
 {
 	/* no coherency needed (all in kmap/kunmap) */
-	copy_user_page_asm(vto, vfrom);
-	if (!parisc_requires_coherency())
-		flush_kernel_dcache_page_asm(vto);
+#if 0
+	copy_user_page_asm(vto, vfrom, vaddr);
+#else
+	copy_page_asm(vto, vfrom);
+	flush_kernel_dcache_page_asm(vto);
+#endif
 }
 EXPORT_SYMBOL(copy_user_page);
 
-#ifdef CONFIG_PA8X00
-
-void kunmap_parisc(void *addr)
-{
-	if (parisc_requires_coherency())
-		flush_kernel_dcache_page_addr(addr);
-}
-EXPORT_SYMBOL(kunmap_parisc);
-#endif
-
 void __flush_tlb_range(unsigned long sid, unsigned long start,
 		       unsigned long end)
 {
@@ -577,3 +573,25 @@ flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long
 		__flush_cache_page(vma, vmaddr);
 
 }
+
+void *kmap_parisc(struct page *page)
+{
+	/* this is a killer.  There's no easy way to test quickly if
+	 * this page is dirty in any userspace.  Additionally, for
+	 * kernel alterations of the page, we'd need it invalidated
+	 * here anyway, so currently flush (and invalidate)
+	 * universally */
+	flush_user_dcache_page_internal(page_mapping(page), page);
+	return page_address(page);
+}
+EXPORT_SYMBOL(kmap_parisc);
+
+void kunmap_parisc(void *addr)
+{
+	/* flush and invalidate the kernel mapping.  We need the
+	 * invalidate so we don't have stale data at this cache
+	 * location the next time the page is mapped */
+	flush_kernel_dcache_page_addr(addr);
+}
+EXPORT_SYMBOL(kunmap_parisc);
+
diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
index 3a44f7f..42dbf32 100644
--- a/arch/parisc/kernel/entry.S
+++ b/arch/parisc/kernel/entry.S
@@ -45,7 +45,7 @@
 	.level 2.0
 #endif
 
-	.import         pa_dbit_lock,data
+	.import         pa_pte_lock,data
 
 	/* space_to_prot macro creates a prot id from a space id */
 
@@ -364,32 +364,6 @@
 	.align		32
 	.endm
 
-	/* The following are simple 32 vs 64 bit instruction
-	 * abstractions for the macros */
-	.macro		EXTR	reg1,start,length,reg2
-#ifdef CONFIG_64BIT
-	extrd,u		\reg1,32+(\start),\length,\reg2
-#else
-	extrw,u		\reg1,\start,\length,\reg2
-#endif
-	.endm
-
-	.macro		DEP	reg1,start,length,reg2
-#ifdef CONFIG_64BIT
-	depd		\reg1,32+(\start),\length,\reg2
-#else
-	depw		\reg1,\start,\length,\reg2
-#endif
-	.endm
-
-	.macro		DEPI	val,start,length,reg
-#ifdef CONFIG_64BIT
-	depdi		\val,32+(\start),\length,\reg
-#else
-	depwi		\val,\start,\length,\reg
-#endif
-	.endm
-
 	/* In LP64, the space contains part of the upper 32 bits of the
 	 * fault.  We have to extract this and place it in the va,
 	 * zeroing the corresponding bits in the space register */
@@ -442,19 +416,19 @@
 	 */
 	.macro		L2_ptep	pmd,pte,index,va,fault
 #if PT_NLEVELS == 3
-	EXTR		\va,31-ASM_PMD_SHIFT,ASM_BITS_PER_PMD,\index
+	extru		\va,31-ASM_PMD_SHIFT,ASM_BITS_PER_PMD,\index
 #else
-	EXTR		\va,31-ASM_PGDIR_SHIFT,ASM_BITS_PER_PGD,\index
+	extru		\va,31-ASM_PGDIR_SHIFT,ASM_BITS_PER_PGD,\index
 #endif
-	DEP             %r0,31,PAGE_SHIFT,\pmd  /* clear offset */
+	dep             %r0,31,PAGE_SHIFT,\pmd  /* clear offset */
 	copy		%r0,\pte
 	ldw,s		\index(\pmd),\pmd
 	bb,>=,n		\pmd,_PxD_PRESENT_BIT,\fault
-	DEP		%r0,31,PxD_FLAG_SHIFT,\pmd /* clear flags */
+	dep		%r0,31,PxD_FLAG_SHIFT,\pmd /* clear flags */
 	copy		\pmd,%r9
 	SHLREG		%r9,PxD_VALUE_SHIFT,\pmd
-	EXTR		\va,31-PAGE_SHIFT,ASM_BITS_PER_PTE,\index
-	DEP		%r0,31,PAGE_SHIFT,\pmd  /* clear offset */
+	extru		\va,31-PAGE_SHIFT,ASM_BITS_PER_PTE,\index
+	dep		%r0,31,PAGE_SHIFT,\pmd  /* clear offset */
 	shladd		\index,BITS_PER_PTE_ENTRY,\pmd,\pmd
 	LDREG		%r0(\pmd),\pte		/* pmd is now pte */
 	bb,>=,n		\pte,_PAGE_PRESENT_BIT,\fault
@@ -488,13 +462,46 @@
 	L2_ptep		\pgd,\pte,\index,\va,\fault
 	.endm
 
+	/* SMP lock for consistent PTE updates.  Unlocks and jumps
+	   to FAULT if the page is not present.  Note the preceeding
+	   load of the PTE can't be deleted since we can't fault holding
+	   the lock.  */ 
+	.macro		pte_lock	ptep,pte,spc,tmp,tmp1,fault
+#ifdef CONFIG_SMP
+	cmpib,COND(=),n        0,\spc,2f
+	load32		PA(pa_pte_lock),\tmp1
+1:
+	LDCW		0(\tmp1),\tmp
+	cmpib,COND(=)         0,\tmp,1b
+	nop
+	LDREG		%r0(\ptep),\pte
+	bb,<,n		\pte,_PAGE_PRESENT_BIT,2f
+	ldi             1,\tmp
+	stw             \tmp,0(\tmp1)
+	b,n		\fault
+2:
+#endif
+	.endm
+
+	.macro		pte_unlock	spc,tmp,tmp1
+#ifdef CONFIG_SMP
+	cmpib,COND(=),n        0,\spc,1f
+	ldi             1,\tmp
+	stw             \tmp,0(\tmp1)
+1:
+#endif
+	.endm
+
 	/* Set the _PAGE_ACCESSED bit of the PTE.  Be clever and
 	 * don't needlessly dirty the cache line if it was already set */
-	.macro		update_ptep	ptep,pte,tmp,tmp1
-	ldi		_PAGE_ACCESSED,\tmp1
-	or		\tmp1,\pte,\tmp
-	and,COND(<>)	\tmp1,\pte,%r0
-	STREG		\tmp,0(\ptep)
+	.macro		update_ptep	ptep,pte,spc,tmp,tmp1,fault
+	bb,<,n		\pte,_PAGE_ACCESSED_BIT,3f
+	pte_lock	\ptep,\pte,\spc,\tmp,\tmp1,\fault
+	ldi		_PAGE_ACCESSED,\tmp
+	or		\tmp,\pte,\pte
+	STREG		\pte,0(\ptep)
+	pte_unlock	\spc,\tmp,\tmp1
+3:
 	.endm
 
 	/* Set the dirty bit (and accessed bit).  No need to be
@@ -605,7 +612,7 @@
 	depdi		0,31,32,\tmp
 #endif
 	copy		\va,\tmp1
-	DEPI		0,31,23,\tmp1
+	depi		0,31,23,\tmp1
 	cmpb,COND(<>),n	\tmp,\tmp1,\fault
 	ldi		(_PAGE_DIRTY|_PAGE_WRITE|_PAGE_READ),\prot
 	depd,z		\prot,8,7,\prot
@@ -622,6 +629,39 @@
 	or		%r26,%r0,\pte
 	.endm 
 
+	/* Save PTE for recheck if SMP.  */
+	.macro		save_pte	pte,tmp
+#ifdef CONFIG_SMP
+	copy		\pte,\tmp
+#endif
+	.endm
+
+	/* Reload the PTE and purge the data TLB entry if the new
+	   value is different from the old one.  */
+	.macro		dtlb_recheck	ptep,old_pte,spc,va,tmp
+#ifdef CONFIG_SMP
+	LDREG		%r0(\ptep),\tmp
+	cmpb,COND(=),n	\old_pte,\tmp,1f
+	mfsp		%sr1,\tmp
+	mtsp		\spc,%sr1
+	pdtlb,l		%r0(%sr1,\va)
+	mtsp		\tmp,%sr1
+1:
+#endif
+	.endm
+
+	.macro		itlb_recheck	ptep,old_pte,spc,va,tmp
+#ifdef CONFIG_SMP
+	LDREG		%r0(\ptep),\tmp
+	cmpb,COND(=),n	\old_pte,\tmp,1f
+	mfsp		%sr1,\tmp
+	mtsp		\spc,%sr1
+	pitlb,l		%r0(%sr1,\va)
+	mtsp		\tmp,%sr1
+1:
+#endif
+	.endm
+
 
 	/*
 	 * Align fault_vector_20 on 4K boundary so that both
@@ -758,6 +798,10 @@ ENTRY(__kernel_thread)
 
 	STREG	%r22, PT_GR22(%r1)	/* save r22 (arg5) */
 	copy	%r0, %r22		/* user_tid */
+	copy	%r0, %r21		/* child_tid */
+#else
+	stw	%r0, -52(%r30)	     	/* user_tid */
+	stw	%r0, -56(%r30)	     	/* child_tid */
 #endif
 	STREG	%r26, PT_GR26(%r1)  /* Store function & argument for child */
 	STREG	%r25, PT_GR25(%r1)
@@ -765,7 +809,7 @@ ENTRY(__kernel_thread)
 	ldo	CLONE_VM(%r26), %r26   /* Force CLONE_VM since only init_mm */
 	or	%r26, %r24, %r26      /* will have kernel mappings.	 */
 	ldi	1, %r25			/* stack_start, signals kernel thread */
-	stw	%r0, -52(%r30)	     	/* user_tid */
+	ldi	0, %r23			/* child_stack_size */
 #ifdef CONFIG_64BIT
 	ldo	-16(%r30),%r29		/* Reference param save area */
 #endif
@@ -972,7 +1016,10 @@ intr_check_sig:
 	BL	do_notify_resume,%r2
 	copy	%r16, %r26			/* struct pt_regs *regs */
 
-	b,n	intr_check_sig
+	mfctl   %cr30,%r16		/* Reload */
+	LDREG	TI_TASK(%r16), %r16	/* thread_info -> task_struct */
+	b	intr_check_sig
+	ldo	TASK_REGS(%r16),%r16
 
 intr_restore:
 	copy            %r16,%r29
@@ -997,13 +1044,6 @@ intr_restore:
 
 	rfi
 	nop
-	nop
-	nop
-	nop
-	nop
-	nop
-	nop
-	nop
 
 #ifndef CONFIG_PREEMPT
 # define intr_do_preempt	intr_restore
@@ -1026,14 +1066,12 @@ intr_do_resched:
 	ldo	-16(%r30),%r29		/* Reference param save area */
 #endif
 
-	ldil	L%intr_check_sig, %r2
-#ifndef CONFIG_64BIT
-	b	schedule
-#else
-	load32	schedule, %r20
-	bv	%r0(%r20)
-#endif
-	ldo	R%intr_check_sig(%r2), %r2
+	BL	schedule,%r2
+	nop
+	mfctl   %cr30,%r16		/* Reload */
+	LDREG	TI_TASK(%r16), %r16	/* thread_info -> task_struct */
+	b	intr_check_sig
+	ldo	TASK_REGS(%r16),%r16
 
 	/* preempt the current task on returning to kernel
 	 * mode from an interrupt, iff need_resched is set,
@@ -1214,11 +1252,12 @@ dtlb_miss_20w:
 
 	L3_ptep		ptp,pte,t0,va,dtlb_check_alias_20w
 
-	update_ptep	ptp,pte,t0,t1
+	update_ptep	ptp,pte,spc,t0,t1,dtlb_check_alias_20w
 
+	save_pte	pte,t1
 	make_insert_tlb	spc,pte,prot
-	
 	idtlbt          pte,prot
+	dtlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1238,11 +1277,10 @@ nadtlb_miss_20w:
 
 	L3_ptep		ptp,pte,t0,va,nadtlb_check_flush_20w
 
-	update_ptep	ptp,pte,t0,t1
-
+	save_pte	pte,t1
 	make_insert_tlb	spc,pte,prot
-
 	idtlbt          pte,prot
+	dtlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1272,8 +1310,9 @@ dtlb_miss_11:
 
 	L2_ptep		ptp,pte,t0,va,dtlb_check_alias_11
 
-	update_ptep	ptp,pte,t0,t1
+	update_ptep	ptp,pte,spc,t0,t1,dtlb_check_alias_11
 
+	save_pte	pte,t1
 	make_insert_tlb_11	spc,pte,prot
 
 	mfsp		%sr1,t0  /* Save sr1 so we can use it in tlb inserts */
@@ -1283,6 +1322,7 @@ dtlb_miss_11:
 	idtlbp		prot,(%sr1,va)
 
 	mtsp		t0, %sr1	/* Restore sr1 */
+	dtlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1321,11 +1361,9 @@ nadtlb_miss_11:
 
 	L2_ptep		ptp,pte,t0,va,nadtlb_check_flush_11
 
-	update_ptep	ptp,pte,t0,t1
-
+	save_pte	pte,t1
 	make_insert_tlb_11	spc,pte,prot
 
-
 	mfsp		%sr1,t0  /* Save sr1 so we can use it in tlb inserts */
 	mtsp		spc,%sr1
 
@@ -1333,6 +1371,7 @@ nadtlb_miss_11:
 	idtlbp		prot,(%sr1,va)
 
 	mtsp		t0, %sr1	/* Restore sr1 */
+	dtlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1368,13 +1407,15 @@ dtlb_miss_20:
 
 	L2_ptep		ptp,pte,t0,va,dtlb_check_alias_20
 
-	update_ptep	ptp,pte,t0,t1
+	update_ptep	ptp,pte,spc,t0,t1,dtlb_check_alias_20
 
+	save_pte	pte,t1
 	make_insert_tlb	spc,pte,prot
 
 	f_extend	pte,t0
 
 	idtlbt          pte,prot
+	dtlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1394,13 +1435,13 @@ nadtlb_miss_20:
 
 	L2_ptep		ptp,pte,t0,va,nadtlb_check_flush_20
 
-	update_ptep	ptp,pte,t0,t1
-
+	save_pte	pte,t1
 	make_insert_tlb	spc,pte,prot
 
 	f_extend	pte,t0
 	
         idtlbt          pte,prot
+	dtlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1508,11 +1549,12 @@ itlb_miss_20w:
 
 	L3_ptep		ptp,pte,t0,va,itlb_fault
 
-	update_ptep	ptp,pte,t0,t1
+	update_ptep	ptp,pte,spc,t0,t1,itlb_fault
 
+	save_pte	pte,t1
 	make_insert_tlb	spc,pte,prot
-	
 	iitlbt          pte,prot
+	itlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1526,8 +1568,9 @@ itlb_miss_11:
 
 	L2_ptep		ptp,pte,t0,va,itlb_fault
 
-	update_ptep	ptp,pte,t0,t1
+	update_ptep	ptp,pte,spc,t0,t1,itlb_fault
 
+	save_pte	pte,t1
 	make_insert_tlb_11	spc,pte,prot
 
 	mfsp		%sr1,t0  /* Save sr1 so we can use it in tlb inserts */
@@ -1537,6 +1580,7 @@ itlb_miss_11:
 	iitlbp		prot,(%sr1,va)
 
 	mtsp		t0, %sr1	/* Restore sr1 */
+	itlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1548,13 +1592,15 @@ itlb_miss_20:
 
 	L2_ptep		ptp,pte,t0,va,itlb_fault
 
-	update_ptep	ptp,pte,t0,t1
+	update_ptep	ptp,pte,spc,t0,t1,itlb_fault
 
+	save_pte	pte,t1
 	make_insert_tlb	spc,pte,prot
 
 	f_extend	pte,t0	
 
 	iitlbt          pte,prot
+	itlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1570,29 +1616,14 @@ dbit_trap_20w:
 
 	L3_ptep		ptp,pte,t0,va,dbit_fault
 
-#ifdef CONFIG_SMP
-	cmpib,COND(=),n        0,spc,dbit_nolock_20w
-	load32		PA(pa_dbit_lock),t0
-
-dbit_spin_20w:
-	LDCW		0(t0),t1
-	cmpib,COND(=)         0,t1,dbit_spin_20w
-	nop
-
-dbit_nolock_20w:
-#endif
-	update_dirty	ptp,pte,t1
+	pte_lock	ptp,pte,spc,t0,t1,dbit_fault
+	update_dirty	ptp,pte,t0
+	pte_unlock	spc,t0,t1
 
+	save_pte	pte,t1
 	make_insert_tlb	spc,pte,prot
-		
 	idtlbt          pte,prot
-#ifdef CONFIG_SMP
-	cmpib,COND(=),n        0,spc,dbit_nounlock_20w
-	ldi             1,t1
-	stw             t1,0(t0)
-
-dbit_nounlock_20w:
-#endif
+	dtlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1606,35 +1637,21 @@ dbit_trap_11:
 
 	L2_ptep		ptp,pte,t0,va,dbit_fault
 
-#ifdef CONFIG_SMP
-	cmpib,COND(=),n        0,spc,dbit_nolock_11
-	load32		PA(pa_dbit_lock),t0
-
-dbit_spin_11:
-	LDCW		0(t0),t1
-	cmpib,=         0,t1,dbit_spin_11
-	nop
-
-dbit_nolock_11:
-#endif
-	update_dirty	ptp,pte,t1
+	pte_lock	ptp,pte,spc,t0,t1,dbit_fault
+	update_dirty	ptp,pte,t0
+	pte_unlock	spc,t0,t1
 
+	save_pte	pte,t1
 	make_insert_tlb_11	spc,pte,prot
 
-	mfsp            %sr1,t1  /* Save sr1 so we can use it in tlb inserts */
+	mfsp            %sr1,t0  /* Save sr1 so we can use it in tlb inserts */
 	mtsp		spc,%sr1
 
 	idtlba		pte,(%sr1,va)
 	idtlbp		prot,(%sr1,va)
 
-	mtsp            t1, %sr1     /* Restore sr1 */
-#ifdef CONFIG_SMP
-	cmpib,COND(=),n        0,spc,dbit_nounlock_11
-	ldi             1,t1
-	stw             t1,0(t0)
-
-dbit_nounlock_11:
-#endif
+	mtsp            t0, %sr1     /* Restore sr1 */
+	dtlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1646,32 +1663,17 @@ dbit_trap_20:
 
 	L2_ptep		ptp,pte,t0,va,dbit_fault
 
-#ifdef CONFIG_SMP
-	cmpib,COND(=),n        0,spc,dbit_nolock_20
-	load32		PA(pa_dbit_lock),t0
-
-dbit_spin_20:
-	LDCW		0(t0),t1
-	cmpib,=         0,t1,dbit_spin_20
-	nop
-
-dbit_nolock_20:
-#endif
-	update_dirty	ptp,pte,t1
+	pte_lock	ptp,pte,spc,t0,t1,dbit_fault
+	update_dirty	ptp,pte,t0
+	pte_unlock	spc,t0,t1
 
+	save_pte	pte,t1
 	make_insert_tlb	spc,pte,prot
 
-	f_extend	pte,t1
+	f_extend	pte,t0
 	
         idtlbt          pte,prot
-
-#ifdef CONFIG_SMP
-	cmpib,COND(=),n        0,spc,dbit_nounlock_20
-	ldi             1,t1
-	stw             t1,0(t0)
-
-dbit_nounlock_20:
-#endif
+	dtlb_recheck	ptp,t1,spc,va,t0
 
 	rfir
 	nop
@@ -1772,9 +1774,9 @@ ENTRY(sys_fork_wrapper)
 	ldo	-16(%r30),%r29		/* Reference param save area */
 #endif
 
-	/* These are call-clobbered registers and therefore
-	   also syscall-clobbered (we hope). */
-	STREG	%r2,PT_GR19(%r1)	/* save for child */
+	STREG	%r2,PT_SYSCALL_RP(%r1)
+
+	/* WARNING - Clobbers r21, userspace must save! */
 	STREG	%r30,PT_GR21(%r1)
 
 	LDREG	PT_GR30(%r1),%r25
@@ -1804,7 +1806,7 @@ ENTRY(child_return)
 	nop
 
 	LDREG	TI_TASK-THREAD_SZ_ALGN-FRAME_SIZE-FRAME_SIZE(%r30), %r1
-	LDREG	TASK_PT_GR19(%r1),%r2
+	LDREG	TASK_PT_SYSCALL_RP(%r1),%r2
 	b	wrapper_exit
 	copy	%r0,%r28
 ENDPROC(child_return)
@@ -1823,8 +1825,9 @@ ENTRY(sys_clone_wrapper)
 	ldo	-16(%r30),%r29		/* Reference param save area */
 #endif
 
-	/* WARNING - Clobbers r19 and r21, userspace must save these! */
-	STREG	%r2,PT_GR19(%r1)	/* save for child */
+	STREG	%r2,PT_SYSCALL_RP(%r1)
+
+	/* WARNING - Clobbers r21, userspace must save! */
 	STREG	%r30,PT_GR21(%r1)
 	BL	sys_clone,%r2
 	copy	%r1,%r24
@@ -1847,7 +1850,9 @@ ENTRY(sys_vfork_wrapper)
 	ldo	-16(%r30),%r29		/* Reference param save area */
 #endif
 
-	STREG	%r2,PT_GR19(%r1)	/* save for child */
+	STREG	%r2,PT_SYSCALL_RP(%r1)
+
+	/* WARNING - Clobbers r21, userspace must save! */
 	STREG	%r30,PT_GR21(%r1)
 
 	BL	sys_vfork,%r2
@@ -2076,9 +2081,10 @@ syscall_restore:
 	LDREG	TASK_PT_GR31(%r1),%r31	   /* restore syscall rp */
 
 	/* NOTE: We use rsm/ssm pair to make this operation atomic */
+	LDREG   TASK_PT_GR30(%r1),%r1              /* Get user sp */
 	rsm     PSW_SM_I, %r0
-	LDREG   TASK_PT_GR30(%r1),%r30             /* restore user sp */
-	mfsp	%sr3,%r1			   /* Get users space id */
+	copy    %r1,%r30                           /* Restore user sp */
+	mfsp    %sr3,%r1                           /* Get user space id */
 	mtsp    %r1,%sr7                           /* Restore sr7 */
 	ssm     PSW_SM_I, %r0
 
diff --git a/arch/parisc/kernel/pacache.S b/arch/parisc/kernel/pacache.S
index 09b77b2..b2f0d3d 100644
--- a/arch/parisc/kernel/pacache.S
+++ b/arch/parisc/kernel/pacache.S
@@ -277,7 +277,7 @@ ENDPROC(flush_data_cache_local)
 
 	.align	16
 
-ENTRY(copy_user_page_asm)
+ENTRY(copy_page_asm)
 	.proc
 	.callinfo NO_CALLS
 	.entry
@@ -288,54 +288,54 @@ ENTRY(copy_user_page_asm)
 	 * GCC probably can do this just as well.
 	 */
 
-	ldd		0(%r25), %r19
+	ldd		0(%r25), %r20
 	ldi		(PAGE_SIZE / 128), %r1
 
 	ldw		64(%r25), %r0		/* prefetch 1 cacheline ahead */
 	ldw		128(%r25), %r0		/* prefetch 2 */
 
-1:	ldd		8(%r25), %r20
+1:	ldd		8(%r25), %r21
 	ldw		192(%r25), %r0		/* prefetch 3 */
 	ldw		256(%r25), %r0		/* prefetch 4 */
 
-	ldd		16(%r25), %r21
-	ldd		24(%r25), %r22
-	std		%r19, 0(%r26)
-	std		%r20, 8(%r26)
-
-	ldd		32(%r25), %r19
-	ldd		40(%r25), %r20
-	std		%r21, 16(%r26)
-	std		%r22, 24(%r26)
-
-	ldd		48(%r25), %r21
-	ldd		56(%r25), %r22
-	std		%r19, 32(%r26)
-	std		%r20, 40(%r26)
-
-	ldd		64(%r25), %r19
-	ldd		72(%r25), %r20
-	std		%r21, 48(%r26)
-	std		%r22, 56(%r26)
-
-	ldd		80(%r25), %r21
-	ldd		88(%r25), %r22
-	std		%r19, 64(%r26)
-	std		%r20, 72(%r26)
-
-	ldd		 96(%r25), %r19
-	ldd		104(%r25), %r20
-	std		%r21, 80(%r26)
-	std		%r22, 88(%r26)
-
-	ldd		112(%r25), %r21
-	ldd		120(%r25), %r22
-	std		%r19, 96(%r26)
-	std		%r20, 104(%r26)
+	ldd		16(%r25), %r22
+	ldd		24(%r25), %r24
+	std		%r20, 0(%r26)
+	std		%r21, 8(%r26)
+
+	ldd		32(%r25), %r20
+	ldd		40(%r25), %r21
+	std		%r22, 16(%r26)
+	std		%r24, 24(%r26)
+
+	ldd		48(%r25), %r22
+	ldd		56(%r25), %r24
+	std		%r20, 32(%r26)
+	std		%r21, 40(%r26)
+
+	ldd		64(%r25), %r20
+	ldd		72(%r25), %r21
+	std		%r22, 48(%r26)
+	std		%r24, 56(%r26)
+
+	ldd		80(%r25), %r22
+	ldd		88(%r25), %r24
+	std		%r20, 64(%r26)
+	std		%r21, 72(%r26)
+
+	ldd		96(%r25), %r20
+	ldd		104(%r25), %r21
+	std		%r22, 80(%r26)
+	std		%r24, 88(%r26)
+
+	ldd		112(%r25), %r22
+	ldd		120(%r25), %r24
+	std		%r20, 96(%r26)
+	std		%r21, 104(%r26)
 
 	ldo		128(%r25), %r25
-	std		%r21, 112(%r26)
-	std		%r22, 120(%r26)
+	std		%r22, 112(%r26)
+	std		%r24, 120(%r26)
 	ldo		128(%r26), %r26
 
 	/* conditional branches nullify on forward taken branch, and on
@@ -343,7 +343,7 @@ ENTRY(copy_user_page_asm)
 	 * The ldd should only get executed if the branch is taken.
 	 */
 	addib,COND(>),n	-1, %r1, 1b		/* bundle 10 */
-	ldd		0(%r25), %r19		/* start next loads */
+	ldd		0(%r25), %r20		/* start next loads */
 
 #else
 
@@ -354,52 +354,116 @@ ENTRY(copy_user_page_asm)
 	 * the full 64 bit register values on interrupt, we can't
 	 * use ldd/std on a 32 bit kernel.
 	 */
-	ldw		0(%r25), %r19
+	ldw		0(%r25), %r20
 	ldi		(PAGE_SIZE / 64), %r1
 
 1:
-	ldw		4(%r25), %r20
-	ldw		8(%r25), %r21
-	ldw		12(%r25), %r22
-	stw		%r19, 0(%r26)
-	stw		%r20, 4(%r26)
-	stw		%r21, 8(%r26)
-	stw		%r22, 12(%r26)
-	ldw		16(%r25), %r19
-	ldw		20(%r25), %r20
-	ldw		24(%r25), %r21
-	ldw		28(%r25), %r22
-	stw		%r19, 16(%r26)
-	stw		%r20, 20(%r26)
-	stw		%r21, 24(%r26)
-	stw		%r22, 28(%r26)
-	ldw		32(%r25), %r19
-	ldw		36(%r25), %r20
-	ldw		40(%r25), %r21
-	ldw		44(%r25), %r22
-	stw		%r19, 32(%r26)
-	stw		%r20, 36(%r26)
-	stw		%r21, 40(%r26)
-	stw		%r22, 44(%r26)
-	ldw		48(%r25), %r19
-	ldw		52(%r25), %r20
-	ldw		56(%r25), %r21
-	ldw		60(%r25), %r22
-	stw		%r19, 48(%r26)
-	stw		%r20, 52(%r26)
+	ldw		4(%r25), %r21
+	ldw		8(%r25), %r22
+	ldw		12(%r25), %r24
+	stw		%r20, 0(%r26)
+	stw		%r21, 4(%r26)
+	stw		%r22, 8(%r26)
+	stw		%r24, 12(%r26)
+	ldw		16(%r25), %r20
+	ldw		20(%r25), %r21
+	ldw		24(%r25), %r22
+	ldw		28(%r25), %r24
+	stw		%r20, 16(%r26)
+	stw		%r21, 20(%r26)
+	stw		%r22, 24(%r26)
+	stw		%r24, 28(%r26)
+	ldw		32(%r25), %r20
+	ldw		36(%r25), %r21
+	ldw		40(%r25), %r22
+	ldw		44(%r25), %r24
+	stw		%r20, 32(%r26)
+	stw		%r21, 36(%r26)
+	stw		%r22, 40(%r26)
+	stw		%r24, 44(%r26)
+	ldw		48(%r25), %r20
+	ldw		52(%r25), %r21
+	ldw		56(%r25), %r22
+	ldw		60(%r25), %r24
+	stw		%r20, 48(%r26)
+	stw		%r21, 52(%r26)
 	ldo		64(%r25), %r25
-	stw		%r21, 56(%r26)
-	stw		%r22, 60(%r26)
+	stw		%r22, 56(%r26)
+	stw		%r24, 60(%r26)
 	ldo		64(%r26), %r26
 	addib,COND(>),n	-1, %r1, 1b
-	ldw		0(%r25), %r19
+	ldw		0(%r25), %r20
 #endif
 	bv		%r0(%r2)
 	nop
 	.exit
 
 	.procend
-ENDPROC(copy_user_page_asm)
+ENDPROC(copy_page_asm)
+
+ENTRY(clear_page_asm)
+	.proc
+	.callinfo NO_CALLS
+	.entry
+
+#ifdef CONFIG_64BIT
+	ldi		(PAGE_SIZE / 128), %r1
+
+1:
+	std		%r0, 0(%r26)
+	std		%r0, 8(%r26)
+	std		%r0, 16(%r26)
+	std		%r0, 24(%r26)
+	std		%r0, 32(%r26)
+	std		%r0, 40(%r26)
+	std		%r0, 48(%r26)
+	std		%r0, 56(%r26)
+	std		%r0, 64(%r26)
+	std		%r0, 72(%r26)
+	std		%r0, 80(%r26)
+	std		%r0, 88(%r26)
+	std		%r0, 96(%r26)
+	std		%r0, 104(%r26)
+	std		%r0, 112(%r26)
+	std		%r0, 120(%r26)
+
+	/* Conditional branches nullify on forward taken branch, and on
+	 * non-taken backward branch. Note that .+4 is a backwards branch.
+	 */
+	addib,COND(>),n	-1, %r1, 1b
+	ldo		128(%r26), %r26
+
+#else
+
+	ldi		(PAGE_SIZE / 64), %r1
+
+1:
+	stw		%r0, 0(%r26)
+	stw		%r0, 4(%r26)
+	stw		%r0, 8(%r26)
+	stw		%r0, 12(%r26)
+	stw		%r0, 16(%r26)
+	stw		%r0, 20(%r26)
+	stw		%r0, 24(%r26)
+	stw		%r0, 28(%r26)
+	stw		%r0, 32(%r26)
+	stw		%r0, 36(%r26)
+	stw		%r0, 40(%r26)
+	stw		%r0, 44(%r26)
+	stw		%r0, 48(%r26)
+	stw		%r0, 52(%r26)
+	stw		%r0, 56(%r26)
+	stw		%r0, 60(%r26)
+	addib,COND(>),n	-1, %r1, 1b
+	ldo		64(%r26), %r26
+#endif
+
+	bv		%r0(%r2)
+	nop
+	.exit
+
+	.procend
+ENDPROC(clear_page_asm)
 
 /*
  * NOTE: Code in clear_user_page has a hard coded dependency on the
@@ -422,7 +486,6 @@ ENDPROC(copy_user_page_asm)
  *          %r23 physical page (shifted for tlb insert) of "from" translation
  */
 
-#if 0
 
 	/*
 	 * We can't do this since copy_user_page is used to bring in
@@ -449,9 +512,9 @@ ENTRY(copy_user_page_asm)
 	ldil		L%(TMPALIAS_MAP_START), %r28
 	/* FIXME for different page sizes != 4k */
 #ifdef CONFIG_64BIT
-	extrd,u		%r26,56,32, %r26		/* convert phys addr to tlb insert format */
-	extrd,u		%r23,56,32, %r23		/* convert phys addr to tlb insert format */
-	depd		%r24,63,22, %r28		/* Form aliased virtual address 'to' */
+	extrd,u		%r26,56,32, %r26	/* convert phys addr to tlb insert format */
+	extrd,u		%r23,56,32, %r23	/* convert phys addr to tlb insert format */
+	depd		%r24,63,22, %r28	/* Form aliased virtual address 'to' */
 	depdi		0, 63,12, %r28		/* Clear any offset bits */
 	copy		%r28, %r29
 	depdi		1, 41,1, %r29		/* Form aliased virtual address 'from' */
@@ -464,12 +527,88 @@ ENTRY(copy_user_page_asm)
 	depwi		1, 9,1, %r29		/* Form aliased virtual address 'from' */
 #endif
 
+#ifdef CONFIG_SMP
+	ldil		L%pa_tlb_lock, %r1
+	ldo		R%pa_tlb_lock(%r1), %r24
+	rsm		PSW_SM_I, %r22
+1:
+	LDCW		0(%r24),%r25
+	cmpib,COND(=)	0,%r25,1b
+	nop
+#endif
+
 	/* Purge any old translations */
 
 	pdtlb		0(%r28)
 	pdtlb		0(%r29)
 
-	ldi		64, %r1
+#ifdef CONFIG_SMP
+	ldi		1,%r25
+	stw		%r25,0(%r24)
+	mtsm		%r22
+#endif
+
+#ifdef CONFIG_64BIT
+
+	ldd		0(%r29), %r20
+	ldi		(PAGE_SIZE / 128), %r1
+
+	ldw		64(%r29), %r0		/* prefetch 1 cacheline ahead */
+	ldw		128(%r29), %r0		/* prefetch 2 */
+
+2:	ldd		8(%r29), %r21
+	ldw		192(%r29), %r0		/* prefetch 3 */
+	ldw		256(%r29), %r0		/* prefetch 4 */
+
+	ldd		16(%r29), %r22
+	ldd		24(%r29), %r24
+	std		%r20, 0(%r28)
+	std		%r21, 8(%r28)
+
+	ldd		32(%r29), %r20
+	ldd		40(%r29), %r21
+	std		%r22, 16(%r28)
+	std		%r24, 24(%r28)
+
+	ldd		48(%r29), %r22
+	ldd		56(%r29), %r24
+	std		%r20, 32(%r28)
+	std		%r21, 40(%r28)
+
+	ldd		64(%r29), %r20
+	ldd		72(%r29), %r21
+	std		%r22, 48(%r28)
+	std		%r24, 56(%r28)
+
+	ldd		80(%r29), %r22
+	ldd		88(%r29), %r24
+	std		%r20, 64(%r28)
+	std		%r21, 72(%r28)
+
+	ldd		96(%r29), %r20
+	ldd		104(%r29), %r21
+	std		%r22, 80(%r28)
+	std		%r24, 88(%r28)
+
+	ldd		112(%r29), %r22
+	ldd		120(%r29), %r24
+	std		%r20, 96(%r28)
+	std		%r21, 104(%r28)
+
+	ldo		128(%r29), %r29
+	std		%r22, 112(%r28)
+	std		%r24, 120(%r28)
+
+	fdc		0(%r28)
+	ldo		64(%r28), %r28
+	fdc		0(%r28)
+	ldo		64(%r28), %r28
+	addib,COND(>),n	-1, %r1, 2b
+	ldd		0(%r29), %r20		/* start next loads */
+
+#else
+
+	ldi		(PAGE_SIZE / 64), %r1
 
 	/*
 	 * This loop is optimized for PCXL/PCXL2 ldw/ldw and stw/stw
@@ -480,53 +619,57 @@ ENTRY(copy_user_page_asm)
 	 * use ldd/std on a 32 bit kernel.
 	 */
 
-
-1:
-	ldw		0(%r29), %r19
-	ldw		4(%r29), %r20
-	ldw		8(%r29), %r21
-	ldw		12(%r29), %r22
-	stw		%r19, 0(%r28)
-	stw		%r20, 4(%r28)
-	stw		%r21, 8(%r28)
-	stw		%r22, 12(%r28)
-	ldw		16(%r29), %r19
-	ldw		20(%r29), %r20
-	ldw		24(%r29), %r21
-	ldw		28(%r29), %r22
-	stw		%r19, 16(%r28)
-	stw		%r20, 20(%r28)
-	stw		%r21, 24(%r28)
-	stw		%r22, 28(%r28)
-	ldw		32(%r29), %r19
-	ldw		36(%r29), %r20
-	ldw		40(%r29), %r21
-	ldw		44(%r29), %r22
-	stw		%r19, 32(%r28)
-	stw		%r20, 36(%r28)
-	stw		%r21, 40(%r28)
-	stw		%r22, 44(%r28)
-	ldw		48(%r29), %r19
-	ldw		52(%r29), %r20
-	ldw		56(%r29), %r21
-	ldw		60(%r29), %r22
-	stw		%r19, 48(%r28)
-	stw		%r20, 52(%r28)
-	stw		%r21, 56(%r28)
-	stw		%r22, 60(%r28)
-	ldo		64(%r28), %r28
-	addib,COND(>)		-1, %r1,1b
+2:
+	ldw		0(%r29), %r20
+	ldw		4(%r29), %r21
+	ldw		8(%r29), %r22
+	ldw		12(%r29), %r24
+	stw		%r20, 0(%r28)
+	stw		%r21, 4(%r28)
+	stw		%r22, 8(%r28)
+	stw		%r24, 12(%r28)
+	ldw		16(%r29), %r20
+	ldw		20(%r29), %r21
+	ldw		24(%r29), %r22
+	ldw		28(%r29), %r24
+	stw		%r20, 16(%r28)
+	stw		%r21, 20(%r28)
+	stw		%r22, 24(%r28)
+	stw		%r24, 28(%r28)
+	ldw		32(%r29), %r20
+	ldw		36(%r29), %r21
+	ldw		40(%r29), %r22
+	ldw		44(%r29), %r24
+	stw		%r20, 32(%r28)
+	stw		%r21, 36(%r28)
+	stw		%r22, 40(%r28)
+	stw		%r24, 44(%r28)
+	ldw		48(%r29), %r20
+	ldw		52(%r29), %r21
+	ldw		56(%r29), %r22
+	ldw		60(%r29), %r24
+	stw		%r20, 48(%r28)
+	stw		%r21, 52(%r28)
+	stw		%r22, 56(%r28)
+	stw		%r24, 60(%r28)
+	fdc		0(%r28)
+	ldo		32(%r28), %r28
+	fdc		0(%r28)
+	ldo		32(%r28), %r28
+	addib,COND(>)		-1, %r1,2b
 	ldo		64(%r29), %r29
 
+#endif
+
+	sync
 	bv		%r0(%r2)
 	nop
 	.exit
 
 	.procend
 ENDPROC(copy_user_page_asm)
-#endif
 
-ENTRY(__clear_user_page_asm)
+ENTRY(clear_user_page_asm)
 	.proc
 	.callinfo NO_CALLS
 	.entry
@@ -548,17 +691,33 @@ ENTRY(__clear_user_page_asm)
 	depwi		0, 31,12, %r28		/* Clear any offset bits */
 #endif
 
+#ifdef CONFIG_SMP
+	ldil		L%pa_tlb_lock, %r1
+	ldo		R%pa_tlb_lock(%r1), %r24
+	rsm		PSW_SM_I, %r22
+1:
+	LDCW		0(%r24),%r25
+	cmpib,COND(=)	0,%r25,1b
+	nop
+#endif
+
 	/* Purge any old translation */
 
 	pdtlb		0(%r28)
 
+#ifdef CONFIG_SMP
+	ldi		1,%r25
+	stw		%r25,0(%r24)
+	mtsm		%r22
+#endif
+
 #ifdef CONFIG_64BIT
 	ldi		(PAGE_SIZE / 128), %r1
 
 	/* PREFETCH (Write) has not (yet) been proven to help here */
 	/* #define	PREFETCHW_OP	ldd		256(%0), %r0 */
 
-1:	std		%r0, 0(%r28)
+2:	std		%r0, 0(%r28)
 	std		%r0, 8(%r28)
 	std		%r0, 16(%r28)
 	std		%r0, 24(%r28)
@@ -574,13 +733,13 @@ ENTRY(__clear_user_page_asm)
 	std		%r0, 104(%r28)
 	std		%r0, 112(%r28)
 	std		%r0, 120(%r28)
-	addib,COND(>)		-1, %r1, 1b
+	addib,COND(>)		-1, %r1, 2b
 	ldo		128(%r28), %r28
 
 #else	/* ! CONFIG_64BIT */
 	ldi		(PAGE_SIZE / 64), %r1
 
-1:
+2:
 	stw		%r0, 0(%r28)
 	stw		%r0, 4(%r28)
 	stw		%r0, 8(%r28)
@@ -597,7 +756,7 @@ ENTRY(__clear_user_page_asm)
 	stw		%r0, 52(%r28)
 	stw		%r0, 56(%r28)
 	stw		%r0, 60(%r28)
-	addib,COND(>)		-1, %r1, 1b
+	addib,COND(>)		-1, %r1, 2b
 	ldo		64(%r28), %r28
 #endif	/* CONFIG_64BIT */
 
@@ -606,7 +765,7 @@ ENTRY(__clear_user_page_asm)
 	.exit
 
 	.procend
-ENDPROC(__clear_user_page_asm)
+ENDPROC(clear_user_page_asm)
 
 ENTRY(flush_kernel_dcache_page_asm)
 	.proc
diff --git a/arch/parisc/kernel/parisc_ksyms.c b/arch/parisc/kernel/parisc_ksyms.c
index df65366..a5314df 100644
--- a/arch/parisc/kernel/parisc_ksyms.c
+++ b/arch/parisc/kernel/parisc_ksyms.c
@@ -159,4 +159,5 @@ EXPORT_SYMBOL(_mcount);
 #endif
 
 /* from pacache.S -- needed for copy_page */
-EXPORT_SYMBOL(copy_user_page_asm);
+EXPORT_SYMBOL(copy_page_asm);
+EXPORT_SYMBOL(clear_page_asm);
diff --git a/arch/parisc/kernel/setup.c b/arch/parisc/kernel/setup.c
index cb71f3d..84b3239 100644
--- a/arch/parisc/kernel/setup.c
+++ b/arch/parisc/kernel/setup.c
@@ -128,6 +128,14 @@ void __init setup_arch(char **cmdline_p)
 	printk(KERN_INFO "The 32-bit Kernel has started...\n");
 #endif
 
+	/* Consistency check on the size and alignments of our spinlocks */
+#ifdef CONFIG_SMP
+	BUILD_BUG_ON(sizeof(arch_spinlock_t) != __PA_LDCW_ALIGNMENT);
+	BUG_ON((unsigned long)&__atomic_hash[0] & (__PA_LDCW_ALIGNMENT-1));
+	BUG_ON((unsigned long)&__atomic_hash[1] & (__PA_LDCW_ALIGNMENT-1));
+#endif
+	BUILD_BUG_ON((1<<L1_CACHE_SHIFT) != L1_CACHE_BYTES);
+
 	pdc_console_init();
 
 #ifdef CONFIG_64BIT
diff --git a/arch/parisc/kernel/syscall.S b/arch/parisc/kernel/syscall.S
index f5f9602..68e75ce 100644
--- a/arch/parisc/kernel/syscall.S
+++ b/arch/parisc/kernel/syscall.S
@@ -47,18 +47,17 @@ ENTRY(linux_gateway_page)
 	KILL_INSN
 	.endr
 
-	/* ADDRESS 0xb0 to 0xb4, lws uses 1 insns for entry */
+	/* ADDRESS 0xb0 to 0xb8, lws uses two insns for entry */
 	/* Light-weight-syscall entry must always be located at 0xb0 */
 	/* WARNING: Keep this number updated with table size changes */
 #define __NR_lws_entries (2)
 
 lws_entry:
-	/* Unconditional branch to lws_start, located on the 
-	   same gateway page */
-	b,n	lws_start
+	gate	lws_start, %r0		/* increase privilege */
+	depi	3, 31, 2, %r31		/* Ensure we return into user mode. */
 
-	/* Fill from 0xb4 to 0xe0 */
-	.rept 11
+	/* Fill from 0xb8 to 0xe0 */
+	.rept 10
 	KILL_INSN
 	.endr
 
@@ -423,9 +422,6 @@ tracesys_sigexit:
 
 	*********************************************************/
 lws_start:
-	/* Gate and ensure we return to userspace */
-	gate	.+8, %r0
-	depi	3, 31, 2, %r31	/* Ensure we return to userspace */
 
 #ifdef CONFIG_64BIT
 	/* FIXME: If we are a 64-bit kernel just
@@ -442,7 +438,7 @@ lws_start:
 #endif	
 
         /* Is the lws entry number valid? */
-	comiclr,>>=	__NR_lws_entries, %r20, %r0
+	comiclr,>>	__NR_lws_entries, %r20, %r0
 	b,n	lws_exit_nosys
 
 	/* WARNING: Trashing sr2 and sr3 */
@@ -473,7 +469,7 @@ lws_exit:
 	/* now reset the lowest bit of sp if it was set */
 	xor	%r30,%r1,%r30
 #endif
-	be,n	0(%sr3, %r31)
+	be,n	0(%sr7, %r31)
 
 
 	
@@ -529,7 +525,6 @@ lws_compare_and_swap32:
 #endif
 
 lws_compare_and_swap:
-#ifdef CONFIG_SMP
 	/* Load start of lock table */
 	ldil	L%lws_lock_start, %r20
 	ldo	R%lws_lock_start(%r20), %r28
@@ -572,8 +567,6 @@ cas_wouldblock:
 	ldo	2(%r0), %r28				/* 2nd case */
 	b	lws_exit				/* Contended... */
 	ldo	-EAGAIN(%r0), %r21			/* Spin in userspace */
-#endif
-/* CONFIG_SMP */
 
 	/*
 		prev = *addr;
@@ -601,13 +594,11 @@ cas_action:
 1:	ldw	0(%sr3,%r26), %r28
 	sub,<>	%r28, %r25, %r0
 2:	stw	%r24, 0(%sr3,%r26)
-#ifdef CONFIG_SMP
 	/* Free lock */
 	stw	%r20, 0(%sr2,%r20)
-# if ENABLE_LWS_DEBUG
+#if ENABLE_LWS_DEBUG
 	/* Clear thread register indicator */
 	stw	%r0, 4(%sr2,%r20)
-# endif
 #endif
 	/* Return to userspace, set no error */
 	b	lws_exit
@@ -615,12 +606,10 @@ cas_action:
 
 3:		
 	/* Error occured on load or store */
-#ifdef CONFIG_SMP
 	/* Free lock */
 	stw	%r20, 0(%sr2,%r20)
-# if ENABLE_LWS_DEBUG
+#if ENABLE_LWS_DEBUG
 	stw	%r0, 4(%sr2,%r20)
-# endif
 #endif
 	b	lws_exit
 	ldo	-EFAULT(%r0),%r21	/* set errno */
@@ -672,7 +661,6 @@ ENTRY(sys_call_table64)
 END(sys_call_table64)
 #endif
 
-#ifdef CONFIG_SMP
 	/*
 		All light-weight-syscall atomic operations 
 		will use this set of locks 
@@ -694,8 +682,6 @@ ENTRY(lws_lock_start)
 	.endr
 END(lws_lock_start)
 	.previous
-#endif
-/* CONFIG_SMP for lws_lock_start */
 
 .end
 
diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
index 8b58bf0..804b024 100644
--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -47,7 +47,7 @@
 			  /*  dumped to the console via printk)          */
 
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-DEFINE_SPINLOCK(pa_dbit_lock);
+DEFINE_SPINLOCK(pa_pte_lock);
 #endif
 
 static void parisc_show_stack(struct task_struct *task, unsigned long *sp,
diff --git a/arch/parisc/lib/bitops.c b/arch/parisc/lib/bitops.c
index 353963d..bae6a86 100644
--- a/arch/parisc/lib/bitops.c
+++ b/arch/parisc/lib/bitops.c
@@ -15,6 +15,9 @@
 arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned = {
 	[0 ... (ATOMIC_HASH_SIZE-1)]  = __ARCH_SPIN_LOCK_UNLOCKED
 };
+arch_spinlock_t __atomic_user_hash[ATOMIC_HASH_SIZE] __lock_aligned = {
+	[0 ... (ATOMIC_HASH_SIZE-1)]  = __ARCH_SPIN_LOCK_UNLOCKED
+};
 #endif
 
 #ifdef CONFIG_64BIT
diff --git a/arch/parisc/math-emu/decode_exc.c b/arch/parisc/math-emu/decode_exc.c
index 3ca1c61..27a7492 100644
--- a/arch/parisc/math-emu/decode_exc.c
+++ b/arch/parisc/math-emu/decode_exc.c
@@ -342,6 +342,7 @@ decode_fpu(unsigned int Fpu_register[], unsigned int trap_counts[])
 		return SIGNALCODE(SIGFPE, FPE_FLTINV);
 	  case DIVISIONBYZEROEXCEPTION:
 		update_trap_counts(Fpu_register, aflags, bflags, trap_counts);
+		Clear_excp_register(exception_index);
 	  	return SIGNALCODE(SIGFPE, FPE_FLTDIV);
 	  case INEXACTEXCEPTION:
 		update_trap_counts(Fpu_register, aflags, bflags, trap_counts);
diff --git a/mm/memory.c b/mm/memory.c
index 09e4b1b..21c2916 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -616,7 +616,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		ptep_set_wrprotect(vma, src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27  1:29           ` John David Anglin
  2010-10-27  2:40             ` John David Anglin
@ 2010-10-27  4:50             ` James Bottomley
  2010-10-27  8:06               ` Mikulas Patocka
  2010-10-27  9:04               ` sym53c8xx_2 data corruption Mikulas Patocka
  1 sibling, 2 replies; 27+ messages in thread
From: James Bottomley @ 2010-10-27  4:50 UTC (permalink / raw)
  To: John David Anglin; +Cc: Mikulas Patocka, kyle, linux-parisc

On Tue, 2010-10-26 at 21:29 -0400, John David Anglin wrote:
> > - shared memory --- there is SHMLBA boundary that causes that all
> mappings 
> > are aligned to this boundary --- it is **WRONG** in the current
> kernel!!! 
> > It is only 4MB and should be 16MB!!!
> 
> James has said that the max for all PA-RISC implementations is
> 4 MB.  The value is returned by the PDC_CACHE call.  Maybe a BUG_ON is
> called for.  The alias boundary can be determined by the alias field
> in the D_conf return value.

Why is it I get blamed for everything cache related on parisc?  The
statement in the manuals that the equivalency modulus is 16MB was left
for future expansion.  However, given PA8900 is the last in the series,
there is no future expansion.  John Marvin (I think it was) from the HP
processor group confirmed that the largest equivalency modulus for any
produced parisc processor is 4MB, so that's what we use in the kernel.

James



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27  4:50             ` James Bottomley
@ 2010-10-27  8:06               ` Mikulas Patocka
  2010-10-27  8:35                 ` Mikulas Patocka
  2010-10-27 14:07                 ` James Bottomley
  2010-10-27  9:04               ` sym53c8xx_2 data corruption Mikulas Patocka
  1 sibling, 2 replies; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-27  8:06 UTC (permalink / raw)
  To: James Bottomley; +Cc: John David Anglin, kyle, linux-parisc



On Tue, 26 Oct 2010, James Bottomley wrote:

> On Tue, 2010-10-26 at 21:29 -0400, John David Anglin wrote:
> > > - shared memory --- there is SHMLBA boundary that causes that all
> > mappings 
> > > are aligned to this boundary --- it is **WRONG** in the current
> > kernel!!! 
> > > It is only 4MB and should be 16MB!!!
> > 
> > James has said that the max for all PA-RISC implementations is
> > 4 MB.  The value is returned by the PDC_CACHE call.  Maybe a BUG_ON is
> > called for.  The alias boundary can be determined by the alias field
> > in the D_conf return value.
> 
> Why is it I get blamed for everything cache related on parisc?  The

You don't get blamed, we're just trying to find bugs :)

> statement in the manuals that the equivalency modulus is 16MB was left
> for future expansion.  However, given PA8900 is the last in the series,
> there is no future expansion.  John Marvin (I think it was) from the HP
> processor group confirmed that the largest equivalency modulus for any
> produced parisc processor is 4MB, so that's what we use in the kernel.
> 
> James

The largest L2 cache size is 64MB --- so if the cache is 4-way 
associative, the equivalency distance is 16MB (as the manual says).

If the equivalency distance were 4MB, the L2 cache would have to be 16-way 
(or the cache would have to be physically indexed and you wouldn't have to 
care about its consistency at all).

Mikulas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27  8:06               ` Mikulas Patocka
@ 2010-10-27  8:35                 ` Mikulas Patocka
  2010-10-27 14:18                   ` James Bottomley
  2010-10-27 14:07                 ` James Bottomley
  1 sibling, 1 reply; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-27  8:35 UTC (permalink / raw)
  To: James Bottomley; +Cc: John David Anglin, kyle, linux-parisc



On Wed, 27 Oct 2010, Mikulas Patocka wrote:

> 
> 
> On Tue, 26 Oct 2010, James Bottomley wrote:
> 
> > On Tue, 2010-10-26 at 21:29 -0400, John David Anglin wrote:
> > > > - shared memory --- there is SHMLBA boundary that causes that all
> > > mappings 
> > > > are aligned to this boundary --- it is **WRONG** in the current
> > > kernel!!! 
> > > > It is only 4MB and should be 16MB!!!
> > > 
> > > James has said that the max for all PA-RISC implementations is
> > > 4 MB.  The value is returned by the PDC_CACHE call.  Maybe a BUG_ON is
> > > called for.  The alias boundary can be determined by the alias field
> > > in the D_conf return value.
> > 
> > Why is it I get blamed for everything cache related on parisc?  The
> 
> You don't get blamed, we're just trying to find bugs :)
> 
> > statement in the manuals that the equivalency modulus is 16MB was left
> > for future expansion.  However, given PA8900 is the last in the series,
> > there is no future expansion.  John Marvin (I think it was) from the HP
> > processor group confirmed that the largest equivalency modulus for any
> > produced parisc processor is 4MB, so that's what we use in the kernel.
> > 
> > James
> 
> The largest L2 cache size is 64MB --- so if the cache is 4-way 
> associative, the equivalency distance is 16MB (as the manual says).
> 
> If the equivalency distance were 4MB, the L2 cache would have to be 16-way 
> (or the cache would have to be physically indexed and you wouldn't have to 
> care about its consistency at all).
> 
> Mikulas

BTW. note that that internal documentation may be wrong. There's a 
whitepaper about PA8700 on HP site that describes the dcache as 
375kBx4ways and icache as 188kBx4ways. It is hard to implement this way 
(the CPU'd have to do some mathematics to calculate the cache index), I 
much more believe that the cache is really 3-way or 6-way (where the CPU 
could just take some bits of the address as the index).

That 64MB cache may be 16-way with 4MB modulus, but it looks less 
plausible than a 4-way cache with 16MB modulus.

Anyway, if you think that the modulus is 4MB, try it --- take PA8900 with 
64MB cache, map shared memory to two processes to addresses congruent with 
4MB (but not 8MB), let one process write to odd bytes and the other to 
even bytes and let the processes check that they read the bytes that they 
wrote. If the caches don't synchronize, you'll get a corruption after some 
time of running this.

Mikulas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* sym53c8xx_2 data corruption
  2010-10-27  4:50             ` James Bottomley
  2010-10-27  8:06               ` Mikulas Patocka
@ 2010-10-27  9:04               ` Mikulas Patocka
  2010-10-27 14:46                 ` James Bottomley
  1 sibling, 1 reply; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-27  9:04 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-parisc, linux-scsi, matthew

Hi

I sent this about twice to linux-scsi and got no reseponse, neither from 
conference nor from Matthew. So I'm sending it here, James, you are the 
maintainer of SCSI, could you please look at the patch and incorporate it 
to the kernel in this cycle?

The problem is that if the disk returns QUEUE FULL, the requests are 
aborted with DID_SOFT_ERROR (rather than DID_REQUEUE), which results in 
too few retries and premature errors. The errors happen mostly on writes, 
resulting in data corruption.

Mikulas

---

sym53c8xx_2: Set DID_REQUEUE return code when aborting squeue.

When the controller encounters an error (including QUEUE FULL or BUSY status),
it aborts all not yet submitted requests in the function
sym_dequeue_from_squeue.

This function aborts them with DID_SOFT_ERROR.

If the disk has a full tag queue, the request that caused the overflow is
aborted with QUEUE FULL status (and the scsi midlayer properly retries it
until it is accepted by the disk), but other requests are aborted with
DID_SOFT_ERROR --- for them, the midlayer does just a few retries and then
signals the error up to sd.

The result is that disk returning QUEUE FULL causes request failures.

The error was reproduced on 53c895 with COMPAQ BD03685A24 disk (rebranded
ST336607LC) with command queue 48 or 64 tags. The disk has 64 tags, but
under some access patterns it return QUEUE FULL when there are less than
64 pending tags. The SCSI specification allows returning QUEUE FULL
anytime and it is up to the host to retry.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/scsi/sym53c8xx_2/sym_hipd.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c
===================================================================
--- linux-2.6.36-rc5-fast.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c	2010-09-27 10:25:59.000000000 +0200
+++ linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c	2010-09-27 10:26:27.000000000 +0200
@@ -3000,7 +3000,11 @@ sym_dequeue_from_squeue(struct sym_hcb *
 		if ((target == -1 || cp->target == target) &&
 		    (lun    == -1 || cp->lun    == lun)    &&
 		    (task   == -1 || cp->tag    == task)) {
+#ifdef SYM_OPT_HANDLE_DEVICE_QUEUEING
 			sym_set_cam_status(cp->cmd, DID_SOFT_ERROR);
+#else
+			sym_set_cam_status(cp->cmd, DID_REQUEUE);
+#endif
 			sym_remque(&cp->link_ccbq);
 			sym_insque_tail(&cp->link_ccbq, &np->comp_ccbq);
 		}


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27  8:06               ` Mikulas Patocka
  2010-10-27  8:35                 ` Mikulas Patocka
@ 2010-10-27 14:07                 ` James Bottomley
  2010-10-27 16:28                   ` Mikulas Patocka
  1 sibling, 1 reply; 27+ messages in thread
From: James Bottomley @ 2010-10-27 14:07 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: John David Anglin, kyle, linux-parisc

On Wed, 2010-10-27 at 10:06 +0200, Mikulas Patocka wrote:
> 
> On Tue, 26 Oct 2010, James Bottomley wrote:
> 
> > On Tue, 2010-10-26 at 21:29 -0400, John David Anglin wrote:
> > > > - shared memory --- there is SHMLBA boundary that causes that all
> > > mappings 
> > > > are aligned to this boundary --- it is **WRONG** in the current
> > > kernel!!! 
> > > > It is only 4MB and should be 16MB!!!
> > > 
> > > James has said that the max for all PA-RISC implementations is
> > > 4 MB.  The value is returned by the PDC_CACHE call.  Maybe a BUG_ON is
> > > called for.  The alias boundary can be determined by the alias field
> > > in the D_conf return value.
> > 
> > Why is it I get blamed for everything cache related on parisc?  The
> 
> You don't get blamed, we're just trying to find bugs :)
> 
> > statement in the manuals that the equivalency modulus is 16MB was left
> > for future expansion.  However, given PA8900 is the last in the series,
> > there is no future expansion.  John Marvin (I think it was) from the HP
> > processor group confirmed that the largest equivalency modulus for any
> > produced parisc processor is 4MB, so that's what we use in the kernel.
> > 
> > James
> 
> The largest L2 cache size is 64MB --- so if the cache is 4-way 
> associative, the equivalency distance is 16MB (as the manual says).
> 
> If the equivalency distance were 4MB, the L2 cache would have to be 16-way 
> (or the cache would have to be physically indexed and you wouldn't have to 
> care about its consistency at all).

The L2 cache has no equivalency modulus ... it's not virtually indexed.

James



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27  8:35                 ` Mikulas Patocka
@ 2010-10-27 14:18                   ` James Bottomley
  0 siblings, 0 replies; 27+ messages in thread
From: James Bottomley @ 2010-10-27 14:18 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: John David Anglin, kyle, linux-parisc

On Wed, 2010-10-27 at 10:35 +0200, Mikulas Patocka wrote:
> 
> On Wed, 27 Oct 2010, Mikulas Patocka wrote:
> 
> > 
> > 
> > On Tue, 26 Oct 2010, James Bottomley wrote:
> > 
> > > On Tue, 2010-10-26 at 21:29 -0400, John David Anglin wrote:
> > > > > - shared memory --- there is SHMLBA boundary that causes that all
> > > > mappings 
> > > > > are aligned to this boundary --- it is **WRONG** in the current
> > > > kernel!!! 
> > > > > It is only 4MB and should be 16MB!!!
> > > > 
> > > > James has said that the max for all PA-RISC implementations is
> > > > 4 MB.  The value is returned by the PDC_CACHE call.  Maybe a BUG_ON is
> > > > called for.  The alias boundary can be determined by the alias field
> > > > in the D_conf return value.
> > > 
> > > Why is it I get blamed for everything cache related on parisc?  The
> > 
> > You don't get blamed, we're just trying to find bugs :)
> > 
> > > statement in the manuals that the equivalency modulus is 16MB was left
> > > for future expansion.  However, given PA8900 is the last in the series,
> > > there is no future expansion.  John Marvin (I think it was) from the HP
> > > processor group confirmed that the largest equivalency modulus for any
> > > produced parisc processor is 4MB, so that's what we use in the kernel.
> > > 
> > > James
> > 
> > The largest L2 cache size is 64MB --- so if the cache is 4-way 
> > associative, the equivalency distance is 16MB (as the manual says).
> > 
> > If the equivalency distance were 4MB, the L2 cache would have to be 16-way 
> > (or the cache would have to be physically indexed and you wouldn't have to 
> > care about its consistency at all).
> > 
> > Mikulas
> 
> BTW. note that that internal documentation may be wrong. There's a 
> whitepaper about PA8700 on HP site that describes the dcache as 
> 375kBx4ways and icache as 188kBx4ways. It is hard to implement this way 
> (the CPU'd have to do some mathematics to calculate the cache index), I 
> much more believe that the cache is really 3-way or 6-way (where the CPU 
> could just take some bits of the address as the index).

I don't really buy that, but it depends on internals of the chip we
don't know about.  You seem to be assuming some type of CAS arrangement
but low associativity caches frequently use hidden index bits to cycle
through for the content, so their associativity is most often power of
two.

James

> That 64MB cache may be 16-way with 4MB modulus, but it looks less 
> plausible than a 4-way cache with 16MB modulus.
> 
> Anyway, if you think that the modulus is 4MB, try it --- take PA8900 with 
> 64MB cache, map shared memory to two processes to addresses congruent with 
> 4MB (but not 8MB), let one process write to odd bytes and the other to 
> even bytes and let the processes check that they read the bytes that they 
> wrote. If the caches don't synchronize, you'll get a corruption after some 
> time of running this.
> 
> Mikulas



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sym53c8xx_2 data corruption
  2010-10-27  9:04               ` sym53c8xx_2 data corruption Mikulas Patocka
@ 2010-10-27 14:46                 ` James Bottomley
  2010-10-27 16:19                   ` Mikulas Patocka
  0 siblings, 1 reply; 27+ messages in thread
From: James Bottomley @ 2010-10-27 14:46 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-parisc, linux-scsi, matthew

On Wed, 2010-10-27 at 11:04 +0200, Mikulas Patocka wrote:
> Hi
> 
> I sent this about twice to linux-scsi and got no reseponse, neither from 
> conference nor from Matthew. So I'm sending it here, James, you are the 
> maintainer of SCSI, could you please look at the patch and incorporate it 
> to the kernel in this cycle?
> 
> The problem is that if the disk returns QUEUE FULL, the requests are 
> aborted with DID_SOFT_ERROR (rather than DID_REQUEUE), which results in 
> too few retries and premature errors. The errors happen mostly on writes, 
> resulting in data corruption.
> 
> Mikulas
> 
> ---
> 
> sym53c8xx_2: Set DID_REQUEUE return code when aborting squeue.
> 
> When the controller encounters an error (including QUEUE FULL or BUSY status),
> it aborts all not yet submitted requests in the function
> sym_dequeue_from_squeue.
> 
> This function aborts them with DID_SOFT_ERROR.
> 
> If the disk has a full tag queue, the request that caused the overflow is
> aborted with QUEUE FULL status (and the scsi midlayer properly retries it
> until it is accepted by the disk), but other requests are aborted with
> DID_SOFT_ERROR --- for them, the midlayer does just a few retries and then
> signals the error up to sd.
> 
> The result is that disk returning QUEUE FULL causes request failures.
> 
> The error was reproduced on 53c895 with COMPAQ BD03685A24 disk (rebranded
> ST336607LC) with command queue 48 or 64 tags. The disk has 64 tags, but
> under some access patterns it return QUEUE FULL when there are less than
> 64 pending tags. The SCSI specification allows returning QUEUE FULL
> anytime and it is up to the host to retry.

So the description isn't really complete.  the function is
dequeue_from_squeue which is used to requeue all unissued scbs when the
sequencer is restarted.  This doesn't just affect QUEUE_FULL, it affects
everything.  As long as the pushback is done before the status is
returned (which it looks like it is), I think the patch after fixing
looks fine.

The problem isn't the actual command which returns queue full ... it's
that the sequencer accepts and queues a pile of commands and then
returns all of them on the first queue full ... that means that deeply
queued commands in the sequencer issue queue can get returned >5 times
on multiple QUEUE_FULL conditions which would cause a failure.

> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> ---
>  drivers/scsi/sym53c8xx_2/sym_hipd.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> Index: linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c
> ===================================================================
> --- linux-2.6.36-rc5-fast.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c	2010-09-27 10:25:59.000000000 +0200
> +++ linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c	2010-09-27 10:26:27.000000000 +0200
> @@ -3000,7 +3000,11 @@ sym_dequeue_from_squeue(struct sym_hcb *
>  		if ((target == -1 || cp->target == target) &&
>  		    (lun    == -1 || cp->lun    == lun)    &&
>  		    (task   == -1 || cp->tag    == task)) {
> +#ifdef SYM_OPT_HANDLE_DEVICE_QUEUEING
>  			sym_set_cam_status(cp->cmd, DID_SOFT_ERROR);
> +#else
> +			sym_set_cam_status(cp->cmd, DID_REQUEUE);
> +#endif

So the ifdef is definitely wrong.  SYM_OPT_HANDLE_DEVICE_QUEUEING is a
leftover from when the driver did explicit internal queueing. Just make
this do DID_REQUEUE and I *think* everything will be OK.

There's a danger in that DID_REQUEUE will requeue forever, so this
working depends on the original failing command being returned with the
correct code (which I think it is, but more eyes looking at this would
be helpful).

James



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sym53c8xx_2 data corruption
  2010-10-27 14:46                 ` James Bottomley
@ 2010-10-27 16:19                   ` Mikulas Patocka
  2010-10-27 16:37                     ` James Bottomley
  2010-10-28  5:59                     ` Grant Grundler
  0 siblings, 2 replies; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-27 16:19 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-parisc, linux-scsi, matthew



On Wed, 27 Oct 2010, James Bottomley wrote:

> On Wed, 2010-10-27 at 11:04 +0200, Mikulas Patocka wrote:
> > Hi
> > 
> > I sent this about twice to linux-scsi and got no reseponse, neither from 
> > conference nor from Matthew. So I'm sending it here, James, you are the 
> > maintainer of SCSI, could you please look at the patch and incorporate it 
> > to the kernel in this cycle?
> > 
> > The problem is that if the disk returns QUEUE FULL, the requests are 
> > aborted with DID_SOFT_ERROR (rather than DID_REQUEUE), which results in 
> > too few retries and premature errors. The errors happen mostly on writes, 
> > resulting in data corruption.
> > 
> > Mikulas
> > 
> > ---
> > 
> > sym53c8xx_2: Set DID_REQUEUE return code when aborting squeue.
> > 
> > When the controller encounters an error (including QUEUE FULL or BUSY status),
> > it aborts all not yet submitted requests in the function
> > sym_dequeue_from_squeue.
> > 
> > This function aborts them with DID_SOFT_ERROR.
> > 
> > If the disk has a full tag queue, the request that caused the overflow is
> > aborted with QUEUE FULL status (and the scsi midlayer properly retries it
> > until it is accepted by the disk), but other requests are aborted with
> > DID_SOFT_ERROR --- for them, the midlayer does just a few retries and then
> > signals the error up to sd.
> > 
> > The result is that disk returning QUEUE FULL causes request failures.
> > 
> > The error was reproduced on 53c895 with COMPAQ BD03685A24 disk (rebranded
> > ST336607LC) with command queue 48 or 64 tags. The disk has 64 tags, but
> > under some access patterns it return QUEUE FULL when there are less than
> > 64 pending tags. The SCSI specification allows returning QUEUE FULL
> > anytime and it is up to the host to retry.
> 
> So the description isn't really complete.  the function is
> dequeue_from_squeue which is used to requeue all unissued scbs when the
> sequencer is restarted.  This doesn't just affect QUEUE_FULL, it affects
> everything.  As long as the pushback is done before the status is
> returned (which it looks like it is), I think the patch after fixing
> looks fine.
>
> The problem isn't the actual command which returns queue full ... it's
> that the sequencer accepts and queues a pile of commands and then
> returns all of them on the first queue full ... that means that deeply
> queued commands in the sequencer issue queue can get returned >5 times
> on multiple QUEUE_FULL conditions which would cause a failure.

Sure, that's how I understood it from the code and debug prints. You can 
add this to the description.

That QUEUE_FULL command is actually retired fine, the following commands 
are problematic.

> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > 
> > ---
> >  drivers/scsi/sym53c8xx_2/sym_hipd.c |    4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > Index: linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c
> > ===================================================================
> > --- linux-2.6.36-rc5-fast.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c	2010-09-27 10:25:59.000000000 +0200
> > +++ linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c	2010-09-27 10:26:27.000000000 +0200
> > @@ -3000,7 +3000,11 @@ sym_dequeue_from_squeue(struct sym_hcb *
> >  		if ((target == -1 || cp->target == target) &&
> >  		    (lun    == -1 || cp->lun    == lun)    &&
> >  		    (task   == -1 || cp->tag    == task)) {
> > +#ifdef SYM_OPT_HANDLE_DEVICE_QUEUEING
> >  			sym_set_cam_status(cp->cmd, DID_SOFT_ERROR);
> > +#else
> > +			sym_set_cam_status(cp->cmd, DID_REQUEUE);
> > +#endif
> 
> So the ifdef is definitely wrong.  SYM_OPT_HANDLE_DEVICE_QUEUEING is a
> leftover from when the driver did explicit internal queueing. Just make
> this do DID_REQUEUE and I *think* everything will be OK.

When I tried to enable SYM_OPT_HANDLE_DEVICE_QUEUEING, it didn't work, it 
crashed on something --- it is leftover from some other operating system 
that didn't handle requeuing in the midlayer.

When looking at the other parts of code that handles this driver-internal 
requeueing, it expects DID_SOFT_ERROR there. But it doesn't matter, that 
code is useless for Linux and broken anyway.

> There's a danger in that DID_REQUEUE will requeue forever, so this
> working depends on the original failing command being returned with the
> correct code (which I think it is, but more eyes looking at this would
> be helpful).

Requeuing forever is dangerous anyway, a device returning QUEUE_FULL 
constantly could deadlock the system. Question: is it better to risk a 
deadlock with a broken device or to risk a false timeout under high load? 
--- I don't know --- maybe there are valid cases where the device is 
returning QUEUE_FULL for long time (some raid reconfiguration?) ... do you 
know about them?

Anyway, if sym_dequeue_from_squeue was called from some other error that 
causes limited retry or command abort, I think it is still valid to use 
DID_REQUEUE for the following commands --- it can't deadlock with 
DID_REQUEUE, because on that error, the first command is aborted or has 
its retry count decremented --- so the first command must be eventually 
completed, and the second command (which was being retried with 
DID_REQUEUE) becomes the first --- and once it's first, it cannot loop 
forever. So with induction you can prove that every command completes in 
finite time.

Mikulas

> James
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27 14:07                 ` James Bottomley
@ 2010-10-27 16:28                   ` Mikulas Patocka
  2010-10-27 16:35                     ` James Bottomley
  0 siblings, 1 reply; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-27 16:28 UTC (permalink / raw)
  To: James Bottomley; +Cc: John David Anglin, kyle, linux-parisc



On Wed, 27 Oct 2010, James Bottomley wrote:

> On Wed, 2010-10-27 at 10:06 +0200, Mikulas Patocka wrote:
> > 
> > On Tue, 26 Oct 2010, James Bottomley wrote:
> > 
> > > On Tue, 2010-10-26 at 21:29 -0400, John David Anglin wrote:
> > > > > - shared memory --- there is SHMLBA boundary that causes that all
> > > > mappings 
> > > > > are aligned to this boundary --- it is **WRONG** in the current
> > > > kernel!!! 
> > > > > It is only 4MB and should be 16MB!!!
> > > > 
> > > > James has said that the max for all PA-RISC implementations is
> > > > 4 MB.  The value is returned by the PDC_CACHE call.  Maybe a BUG_ON is
> > > > called for.  The alias boundary can be determined by the alias field
> > > > in the D_conf return value.
> > > 
> > > Why is it I get blamed for everything cache related on parisc?  The
> > 
> > You don't get blamed, we're just trying to find bugs :)
> > 
> > > statement in the manuals that the equivalency modulus is 16MB was left
> > > for future expansion.  However, given PA8900 is the last in the series,
> > > there is no future expansion.  John Marvin (I think it was) from the HP
> > > processor group confirmed that the largest equivalency modulus for any
> > > produced parisc processor is 4MB, so that's what we use in the kernel.
> > > 
> > > James
> > 
> > The largest L2 cache size is 64MB --- so if the cache is 4-way 
> > associative, the equivalency distance is 16MB (as the manual says).
> > 
> > If the equivalency distance were 4MB, the L2 cache would have to be 16-way 
> > (or the cache would have to be physically indexed and you wouldn't have to 
> > care about its consistency at all).
> 
> The L2 cache has no equivalency modulus ... it's not virtually indexed.
> 
> James

Why is Kyle than suggesting that I am lucky because I have no L2 cache 
(and therefore, Linux runs faster)?

Why are people talking here about flushing 32MB or 64MB L2 on fork()?

Or is it that you need to flush only L1 cache but the architecture forces 
flush of both caches?


I'd still like to see if someone with PA8800 or PA8900 with L2 ran that 
shared memory experiment to actually *prove* that L2 is physically indexed 
and that the L1 equivalency modulus is 4MB. I.e. not rely on what you 
heard somewhere, but rely on what you see.

Mikulas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27 16:28                   ` Mikulas Patocka
@ 2010-10-27 16:35                     ` James Bottomley
  2010-10-27 16:50                       ` Mikulas Patocka
  0 siblings, 1 reply; 27+ messages in thread
From: James Bottomley @ 2010-10-27 16:35 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: John David Anglin, kyle, linux-parisc

On Wed, 2010-10-27 at 18:28 +0200, Mikulas Patocka wrote:
> 
> On Wed, 27 Oct 2010, James Bottomley wrote:
> 
> > On Wed, 2010-10-27 at 10:06 +0200, Mikulas Patocka wrote:
> > > 
> > > On Tue, 26 Oct 2010, James Bottomley wrote:
> > > 
> > > > On Tue, 2010-10-26 at 21:29 -0400, John David Anglin wrote:
> > > > > > - shared memory --- there is SHMLBA boundary that causes that all
> > > > > mappings 
> > > > > > are aligned to this boundary --- it is **WRONG** in the current
> > > > > kernel!!! 
> > > > > > It is only 4MB and should be 16MB!!!
> > > > > 
> > > > > James has said that the max for all PA-RISC implementations is
> > > > > 4 MB.  The value is returned by the PDC_CACHE call.  Maybe a BUG_ON is
> > > > > called for.  The alias boundary can be determined by the alias field
> > > > > in the D_conf return value.
> > > > 
> > > > Why is it I get blamed for everything cache related on parisc?  The
> > > 
> > > You don't get blamed, we're just trying to find bugs :)
> > > 
> > > > statement in the manuals that the equivalency modulus is 16MB was left
> > > > for future expansion.  However, given PA8900 is the last in the series,
> > > > there is no future expansion.  John Marvin (I think it was) from the HP
> > > > processor group confirmed that the largest equivalency modulus for any
> > > > produced parisc processor is 4MB, so that's what we use in the kernel.
> > > > 
> > > > James
> > > 
> > > The largest L2 cache size is 64MB --- so if the cache is 4-way 
> > > associative, the equivalency distance is 16MB (as the manual says).
> > > 
> > > If the equivalency distance were 4MB, the L2 cache would have to be 16-way 
> > > (or the cache would have to be physically indexed and you wouldn't have to 
> > > care about its consistency at all).
> > 
> > The L2 cache has no equivalency modulus ... it's not virtually indexed.
> > 
> > James
> 
> Why is Kyle than suggesting that I am lucky because I have no L2 cache 
> (and therefore, Linux runs faster)?
> 
> Why are people talking here about flushing 32MB or 64MB L2 on fork()?
> 
> Or is it that you need to flush only L1 cache but the architecture forces 
> flush of both caches?

There's only a couple of flush instructions: fic and fdc ... they have
to flush all caches.  We did argue the toss on this with the HP
processor people since aliasing, which is primarily where we need
flushes for control, only occurs in the L1 cache.  However, they pointed
out that if they made fic and fdc L1 specific, we'd have no control over
DMA type ops which have to reach physical memory.

> I'd still like to see if someone with PA8800 or PA8900 with L2 ran that 
> shared memory experiment to actually *prove* that L2 is physically indexed 
> and that the L1 equivalency modulus is 4MB. I.e. not rely on what you 
> heard somewhere, but rely on what you see.

We already did all of that years ago just trying to make the pa8x00
chips work with linux ... they didn't for about 18 months.

James



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sym53c8xx_2 data corruption
  2010-10-27 16:19                   ` Mikulas Patocka
@ 2010-10-27 16:37                     ` James Bottomley
  2010-10-28  5:59                     ` Grant Grundler
  1 sibling, 0 replies; 27+ messages in thread
From: James Bottomley @ 2010-10-27 16:37 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-parisc, linux-scsi, matthew

On Wed, 2010-10-27 at 18:19 +0200, Mikulas Patocka wrote:
> 
> On Wed, 27 Oct 2010, James Bottomley wrote:
> 
> > On Wed, 2010-10-27 at 11:04 +0200, Mikulas Patocka wrote:
> > > Hi
> > > 
> > > I sent this about twice to linux-scsi and got no reseponse, neither from 
> > > conference nor from Matthew. So I'm sending it here, James, you are the 
> > > maintainer of SCSI, could you please look at the patch and incorporate it 
> > > to the kernel in this cycle?
> > > 
> > > The problem is that if the disk returns QUEUE FULL, the requests are 
> > > aborted with DID_SOFT_ERROR (rather than DID_REQUEUE), which results in 
> > > too few retries and premature errors. The errors happen mostly on writes, 
> > > resulting in data corruption.
> > > 
> > > Mikulas
> > > 
> > > ---
> > > 
> > > sym53c8xx_2: Set DID_REQUEUE return code when aborting squeue.
> > > 
> > > When the controller encounters an error (including QUEUE FULL or BUSY status),
> > > it aborts all not yet submitted requests in the function
> > > sym_dequeue_from_squeue.
> > > 
> > > This function aborts them with DID_SOFT_ERROR.
> > > 
> > > If the disk has a full tag queue, the request that caused the overflow is
> > > aborted with QUEUE FULL status (and the scsi midlayer properly retries it
> > > until it is accepted by the disk), but other requests are aborted with
> > > DID_SOFT_ERROR --- for them, the midlayer does just a few retries and then
> > > signals the error up to sd.
> > > 
> > > The result is that disk returning QUEUE FULL causes request failures.
> > > 
> > > The error was reproduced on 53c895 with COMPAQ BD03685A24 disk (rebranded
> > > ST336607LC) with command queue 48 or 64 tags. The disk has 64 tags, but
> > > under some access patterns it return QUEUE FULL when there are less than
> > > 64 pending tags. The SCSI specification allows returning QUEUE FULL
> > > anytime and it is up to the host to retry.
> > 
> > So the description isn't really complete.  the function is
> > dequeue_from_squeue which is used to requeue all unissued scbs when the
> > sequencer is restarted.  This doesn't just affect QUEUE_FULL, it affects
> > everything.  As long as the pushback is done before the status is
> > returned (which it looks like it is), I think the patch after fixing
> > looks fine.
> >
> > The problem isn't the actual command which returns queue full ... it's
> > that the sequencer accepts and queues a pile of commands and then
> > returns all of them on the first queue full ... that means that deeply
> > queued commands in the sequencer issue queue can get returned >5 times
> > on multiple QUEUE_FULL conditions which would cause a failure.
> 
> Sure, that's how I understood it from the code and debug prints. You can 
> add this to the description.
> 
> That QUEUE_FULL command is actually retired fine, the following commands 
> are problematic.
> 
> > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > > 
> > > ---
> > >  drivers/scsi/sym53c8xx_2/sym_hipd.c |    4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > Index: linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c
> > > ===================================================================
> > > --- linux-2.6.36-rc5-fast.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c	2010-09-27 10:25:59.000000000 +0200
> > > +++ linux-2.6.36-rc5-fast/drivers/scsi/sym53c8xx_2/sym_hipd.c	2010-09-27 10:26:27.000000000 +0200
> > > @@ -3000,7 +3000,11 @@ sym_dequeue_from_squeue(struct sym_hcb *
> > >  		if ((target == -1 || cp->target == target) &&
> > >  		    (lun    == -1 || cp->lun    == lun)    &&
> > >  		    (task   == -1 || cp->tag    == task)) {
> > > +#ifdef SYM_OPT_HANDLE_DEVICE_QUEUEING
> > >  			sym_set_cam_status(cp->cmd, DID_SOFT_ERROR);
> > > +#else
> > > +			sym_set_cam_status(cp->cmd, DID_REQUEUE);
> > > +#endif
> > 
> > So the ifdef is definitely wrong.  SYM_OPT_HANDLE_DEVICE_QUEUEING is a
> > leftover from when the driver did explicit internal queueing. Just make
> > this do DID_REQUEUE and I *think* everything will be OK.
> 
> When I tried to enable SYM_OPT_HANDLE_DEVICE_QUEUEING, it didn't work, it 
> crashed on something --- it is leftover from some other operating system 
> that didn't handle requeuing in the midlayer.
> 
> When looking at the other parts of code that handles this driver-internal 
> requeueing, it expects DID_SOFT_ERROR there. But it doesn't matter, that 
> code is useless for Linux and broken anyway.
> 
> > There's a danger in that DID_REQUEUE will requeue forever, so this
> > working depends on the original failing command being returned with the
> > correct code (which I think it is, but more eyes looking at this would
> > be helpful).
> 
> Requeuing forever is dangerous anyway, a device returning QUEUE_FULL 
> constantly could deadlock the system. Question: is it better to risk a 
> deadlock with a broken device or to risk a false timeout under high load? 
> --- I don't know --- maybe there are valid cases where the device is 
> returning QUEUE_FULL for long time (some raid reconfiguration?) ... do you 
> know about them?
> 
> Anyway, if sym_dequeue_from_squeue was called from some other error that 
> causes limited retry or command abort, I think it is still valid to use 
> DID_REQUEUE for the following commands --- it can't deadlock with 
> DID_REQUEUE, because on that error, the first command is aborted or has 
> its retry count decremented --- so the first command must be eventually 
> completed, and the second command (which was being retried with 
> DID_REQUEUE) becomes the first --- and once it's first, it cannot loop 
> forever. So with induction you can prove that every command completes in 
> finite time.

As long as we see the QUEUE_FULL return, there's no danger.  The mid
layer has a timeout beyond which it won't allow a QUEUE_FULL return to
be retried.

James



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27 16:35                     ` James Bottomley
@ 2010-10-27 16:50                       ` Mikulas Patocka
  2010-10-27 17:07                         ` James Bottomley
  0 siblings, 1 reply; 27+ messages in thread
From: Mikulas Patocka @ 2010-10-27 16:50 UTC (permalink / raw)
  To: James Bottomley; +Cc: John David Anglin, kyle, linux-parisc

> > Why is Kyle than suggesting that I am lucky because I have no L2 cache 
> > (and therefore, Linux runs faster)?
> > 
> > Why are people talking here about flushing 32MB or 64MB L2 on fork()?
> > 
> > Or is it that you need to flush only L1 cache but the architecture forces 
> > flush of both caches?
> 
> There's only a couple of flush instructions: fic and fdc ... they have
> to flush all caches.  We did argue the toss on this with the HP
> processor people since aliasing, which is primarily where we need
> flushes for control, only occurs in the L1 cache.  However, they pointed
> out that if they made fic and fdc L1 specific, we'd have no control over
> DMA type ops which have to reach physical memory.
> 
> > I'd still like to see if someone with PA8800 or PA8900 with L2 ran that 
> > shared memory experiment to actually *prove* that L2 is physically indexed 
> > and that the L1 equivalency modulus is 4MB. I.e. not rely on what you 
> > heard somewhere, but rely on what you see.
> 
> We already did all of that years ago just trying to make the pa8x00

If it is really proved, OK.

So, the CPU takes a hash of bits some bits up to 4MB and uses them to 
calculate an index into 4-way not-power-of-two-sized L1 cache?

> chips work with linux ... they didn't for about 18 months.
> 
> James

Unfortunatelly, I still get some userspace crashes on SMP, I already found 
one reproducible crash (running "make install" on gcc-4.5.1). The crash 
happens with some probability, but the probability is high enough so that 
it's reproducible.

Do you have some idea where cache flushing is missing so that I could try 
if it fixes my case?

BTW. if you flush cache on kmap, I think it couldn't work in multithreaded 
environment at all --- i.e. the program has "int a, b;" both variables 
share the cacheline, one thread is accessing "a" via kmap and the other 
thread writes to b directly, for example "b = 5". Then, cache flushing 
won't help and one of the variables will be trashed. You need kmap address 
to be congruent with the linear address. But I think it's not reason for 
my crash because neither gmake nor bash (that crashes) is multithreaded.

Mikulas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27 16:50                       ` Mikulas Patocka
@ 2010-10-27 17:07                         ` James Bottomley
  2010-10-28  6:04                           ` John David Anglin
  0 siblings, 1 reply; 27+ messages in thread
From: James Bottomley @ 2010-10-27 17:07 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: John David Anglin, kyle, linux-parisc

On Wed, 2010-10-27 at 18:50 +0200, Mikulas Patocka wrote:
> > > Why is Kyle than suggesting that I am lucky because I have no L2 cache 
> > > (and therefore, Linux runs faster)?
> > > 
> > > Why are people talking here about flushing 32MB or 64MB L2 on fork()?
> > > 
> > > Or is it that you need to flush only L1 cache but the architecture forces 
> > > flush of both caches?
> > 
> > There's only a couple of flush instructions: fic and fdc ... they have
> > to flush all caches.  We did argue the toss on this with the HP
> > processor people since aliasing, which is primarily where we need
> > flushes for control, only occurs in the L1 cache.  However, they pointed
> > out that if they made fic and fdc L1 specific, we'd have no control over
> > DMA type ops which have to reach physical memory.
> > 
> > > I'd still like to see if someone with PA8800 or PA8900 with L2 ran that 
> > > shared memory experiment to actually *prove* that L2 is physically indexed 
> > > and that the L1 equivalency modulus is 4MB. I.e. not rely on what you 
> > > heard somewhere, but rely on what you see.
> > 
> > We already did all of that years ago just trying to make the pa8x00
> 
> If it is really proved, OK.
> 
> So, the CPU takes a hash of bits some bits up to 4MB and uses them to 
> calculate an index into 4-way not-power-of-two-sized L1 cache?
> 
> > chips work with linux ... they didn't for about 18 months.
> > 
> > James
> 
> Unfortunatelly, I still get some userspace crashes on SMP, I already found 
> one reproducible crash (running "make install" on gcc-4.5.1). The crash 
> happens with some probability, but the probability is high enough so that 
> it's reproducible.
> 
> Do you have some idea where cache flushing is missing so that I could try 
> if it fixes my case?

This is what we know

http://wiki.parisc-linux.org/TestCases

> BTW. if you flush cache on kmap, I think it couldn't work in multithreaded 
> environment at all --- i.e. the program has "int a, b;" both variables 
> share the cacheline, one thread is accessing "a" via kmap and the other 
> thread writes to b directly, for example "b = 5". Then, cache flushing 
> won't help and one of the variables will be trashed. You need kmap address 
> to be congruent with the linear address. But I think it's not reason for 
> my crash because neither gmake nor bash (that crashes) is multithreaded.

That statement assumes the threads share a data structure but are not
congruent ... which certainly isn't true for userspace.  Our only
incongruency which gives rise to aliasing is between the kernel and user
address spaces and we don't do data sharing between the two without
pretty severe accessor restrictions.

James



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sym53c8xx_2 data corruption
  2010-10-27 16:19                   ` Mikulas Patocka
  2010-10-27 16:37                     ` James Bottomley
@ 2010-10-28  5:59                     ` Grant Grundler
  1 sibling, 0 replies; 27+ messages in thread
From: Grant Grundler @ 2010-10-28  5:59 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: James Bottomley, linux-parisc, linux-scsi, matthew

On Wed, Oct 27, 2010 at 06:19:32PM +0200, Mikulas Patocka wrote:
...
> Requeuing forever is dangerous anyway, a device returning QUEUE_FULL 
> constantly could deadlock the system. Question: is it better to risk a 
> deadlock with a broken device or to risk a false timeout under high load? 
> --- I don't know --- maybe there are valid cases where the device is 
> returning QUEUE_FULL for long time (some raid reconfiguration?) ... do you 
> know about them?

This was a problem in multi-initiator SCSI systems and I'm guessing also
an issue for FC SAN. Multiple hosts "compete" for filling the device's
available command slots. If all hosts used available queue_depth
(say 32 commands) and device only supported 64 commands at a time,
then the 65th command from host #3 might get QUEUEFULL status back.

I'm not sure what the difference is to BUSY status. Wikipedia
suggests "QUEUE_FULL" is a hint that the device is already
processing commands from the same initiator.

hth,
grant

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-27 17:07                         ` James Bottomley
@ 2010-10-28  6:04                           ` John David Anglin
  2010-10-28 16:55                             ` James Bottomley
  0 siblings, 1 reply; 27+ messages in thread
From: John David Anglin @ 2010-10-28  6:04 UTC (permalink / raw)
  To: James Bottomley; +Cc: mikulas, kyle, linux-parisc

> On Wed, 2010-10-27 at 18:50 +0200, Mikulas Patocka wrote:
> > > > Why is Kyle than suggesting that I am lucky because I have no L2 cache 
> > > > (and therefore, Linux runs faster)?
> > > > 
> > > > Why are people talking here about flushing 32MB or 64MB L2 on fork()?
> > > > 
> > > > Or is it that you need to flush only L1 cache but the architecture forces 
> > > > flush of both caches?
> > > 
> > > There's only a couple of flush instructions: fic and fdc ... they have
> > > to flush all caches.  We did argue the toss on this with the HP
> > > processor people since aliasing, which is primarily where we need
> > > flushes for control, only occurs in the L1 cache.  However, they pointed
> > > out that if they made fic and fdc L1 specific, we'd have no control over
> > > DMA type ops which have to reach physical memory.
> > > 
> > > > I'd still like to see if someone with PA8800 or PA8900 with L2 ran that 
> > > > shared memory experiment to actually *prove* that L2 is physically indexed 
> > > > and that the L1 equivalency modulus is 4MB. I.e. not rely on what you 
> > > > heard somewhere, but rely on what you see.
> > > 
> > > We already did all of that years ago just trying to make the pa8x00
> > 
> > If it is really proved, OK.
> > 
> > So, the CPU takes a hash of bits some bits up to 4MB and uses them to 
> > calculate an index into 4-way not-power-of-two-sized L1 cache?
> > 
> > > chips work with linux ... they didn't for about 18 months.
> > > 
> > > James
> > 
> > Unfortunatelly, I still get some userspace crashes on SMP, I already found 
> > one reproducible crash (running "make install" on gcc-4.5.1). The crash 
> > happens with some probability, but the probability is high enough so that 
> > it's reproducible.
> > 
> > Do you have some idea where cache flushing is missing so that I could try 
> > if it fixes my case?

I also see random userspace segmentation faults on SMP.  This is not
restricted to PA8800/PA8900 processors, although it is less frequent
on earlier processors.  The probability of completing a full GCC build
on a SMP system is relatively low.

Some kernels (e.g., 2.6.19) have been better than others but the reason
is unknown.

> This is what we know
> 
> http://wiki.parisc-linux.org/TestCases

The wiki certainly has a number of testcases that fail with high probability.
The testcases all involve multiple threads and near simultaneous execution
of a clone and a fork.  The parent forks after the clone.  The segvs always
occur in the thread created by the clone syscall.  The stack created for
the thread is corrupted by the fork.  It is allocated by a mmap call.
If the fork is delayed, the segvs don't occur.

However, I'm totally convinced there are more problems than the one above.
They are just harder to reproduce.  In general, GCC builds don't involve
multithreaded applications.

> > BTW. if you flush cache on kmap, I think it couldn't work in multithreaded 
> > environment at all --- i.e. the program has "int a, b;" both variables 
> > share the cacheline, one thread is accessing "a" via kmap and the other 
> > thread writes to b directly, for example "b = 5". Then, cache flushing 
> > won't help and one of the variables will be trashed. You need kmap address 
> > to be congruent with the linear address. But I think it's not reason for 
> > my crash because neither gmake nor bash (that crashes) is multithreaded.

Agree.

> That statement assumes the threads share a data structure but are not
> congruent ... which certainly isn't true for userspace.  Our only
> incongruency which gives rise to aliasing is between the kernel and user
> address spaces and we don't do data sharing between the two without
> pretty severe accessor restrictions.

My sense is the above isn't correct.  Even if userspace is congruent,
it would seem to me that another thread could dirty the cache line after
kmap does its flush.  The design of the Linux memory management system
might prevent this from happening, but it's not obvious.  That's why
I thought the kernel should use also use congruent mappings.

My 32-bit UP c3750 is fully stable.  It has run for months and I don't
see any wierd segmentation faults in userspace doing gcc builds.  On
the otherhand, the testcases on the wiki crash with high probabiity.
So, I think we are dealing with two different, but possibly related
issues.  One is a MP issue.  The other is a clone/fork race.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-28  6:04                           ` John David Anglin
@ 2010-10-28 16:55                             ` James Bottomley
  0 siblings, 0 replies; 27+ messages in thread
From: James Bottomley @ 2010-10-28 16:55 UTC (permalink / raw)
  To: John David Anglin; +Cc: mikulas, kyle, linux-parisc

On Thu, 2010-10-28 at 02:04 -0400, John David Anglin wrote:
> > > BTW. if you flush cache on kmap, I think it couldn't work in multithreaded 
> > > environment at all --- i.e. the program has "int a, b;" both variables 
> > > share the cacheline, one thread is accessing "a" via kmap and the other 
> > > thread writes to b directly, for example "b = 5". Then, cache flushing 
> > > won't help and one of the variables will be trashed. You need kmap address 
> > > to be congruent with the linear address. But I think it's not reason for 
> > > my crash because neither gmake nor bash (that crashes) is multithreaded.
> 
> Agree.
> 
> > That statement assumes the threads share a data structure but are not
> > congruent ... which certainly isn't true for userspace.  Our only
> > incongruency which gives rise to aliasing is between the kernel and user
> > address spaces and we don't do data sharing between the two without
> > pretty severe accessor restrictions.
> 
> My sense is the above isn't correct.  Even if userspace is congruent,
> it would seem to me that another thread could dirty the cache line after
> kmap does its flush.  The design of the Linux memory management system
> might prevent this from happening, but it's not obvious.

Look at it this way:  it only happens if the kernel and userspace share
a data structure, which they never do.  If they did, it wouldn't just
show up on parisc, it would be seen on every risc system (since they're
almost all VIPT).

A far more dangerous form of cache line induced incoherence is actually
DMA.  If you DMA into a line which the kernel also touches
simultaneously, only one modification will survive.  Again we fix this
with a similar ownership model: either the kernel uses the region or the
device driver.

>   That's why
> I thought the kernel should use also use congruent mappings.

That would be ideal, but unfortunately memory in the kernel is currently
at fixed mappings (the physical to virtual offset is fixed).  If we
could get all mappings congruent, we'd actually be operating the pa88/89
processors within spec instead of having to pull kmap tricks.  Ralf has
some strange mips system that needs full congruency as well ... it's
just I don't think he's managed to get it working yet.

James

> My 32-bit UP c3750 is fully stable.  It has run for months and I don't
> see any wierd segmentation faults in userspace doing gcc builds.  On
> the otherhand, the testcases on the wiki crash with high probabiity.
> So, I think we are dealing with two different, but possibly related
> issues.  One is a MP issue.  The other is a clone/fork race.
> 
> Dave



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: PA caches (was: C8000 cpu upgrade problem)
  2010-10-26  2:16     ` PA caches (was: C8000 cpu upgrade problem) Mikulas Patocka
  2010-10-26  3:04       ` Kyle McMartin
@ 2010-12-18 20:13       ` John David Anglin
  1 sibling, 0 replies; 27+ messages in thread
From: John David Anglin @ 2010-12-18 20:13 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Kyle McMartin, linux-parisc

On Tue, 26 Oct 2010, Mikulas Patocka wrote:

> > our cache flushing is a bit... suboptimal right now (doing whole cache
> > flushes on fork and such.)
> 
> What is exactly the problem there? Could you describe it or refer to some 
> document that describes it? Why do you need to flush on fork?

Discussed here:
http://lkml.org/lkml/2003/12/15/244

Flush should not be needed according to Lamont.

However COW handling is broken.  See minifail bug on wiki:
http://wiki.parisc-linux.org/TestCases

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2010-12-18 20:13 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20101024020337.725094D30@hiauly1.hia.nrc.ca>
2010-10-24  3:03 ` C8000 cpu upgrade problem Mikulas Patocka
2010-10-24  3:43   ` Kyle McMartin
2010-10-26  2:16     ` PA caches (was: C8000 cpu upgrade problem) Mikulas Patocka
2010-10-26  3:04       ` Kyle McMartin
2010-10-26  4:30         ` John David Anglin
2010-10-26 16:02         ` Mikulas Patocka
2010-10-27  1:29           ` John David Anglin
2010-10-27  2:40             ` John David Anglin
2010-10-27  4:50             ` James Bottomley
2010-10-27  8:06               ` Mikulas Patocka
2010-10-27  8:35                 ` Mikulas Patocka
2010-10-27 14:18                   ` James Bottomley
2010-10-27 14:07                 ` James Bottomley
2010-10-27 16:28                   ` Mikulas Patocka
2010-10-27 16:35                     ` James Bottomley
2010-10-27 16:50                       ` Mikulas Patocka
2010-10-27 17:07                         ` James Bottomley
2010-10-28  6:04                           ` John David Anglin
2010-10-28 16:55                             ` James Bottomley
2010-10-27  9:04               ` sym53c8xx_2 data corruption Mikulas Patocka
2010-10-27 14:46                 ` James Bottomley
2010-10-27 16:19                   ` Mikulas Patocka
2010-10-27 16:37                     ` James Bottomley
2010-10-28  5:59                     ` Grant Grundler
2010-12-18 20:13       ` PA caches (was: C8000 cpu upgrade problem) John David Anglin
2010-10-24  4:01   ` C8000 cpu upgrade problem John David Anglin
2010-10-26  2:04     ` Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.