linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux-2.5.16
@ 2002-05-18  7:57 Linus Torvalds
  2002-05-18  8:05 ` Linux-2.5.16 Aschwin Marsman - aYniK Software Solutions
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Linus Torvalds @ 2002-05-18  7:57 UTC (permalink / raw)
  To: Kernel Mailing List


[ Testing the shortlog format, full changelogs on the kernel site ]

Well, I dunno if the short changelog format is wonderfully readable, but 
at least it's small enough that I don't feel bad about mailbombing the 
kernel list with it.

USB and architecture updates, IDE driver updates etc. The one that kept me
personally somewhat busy was the interesting Intel SMP-P4 TLB corruption
bug, which ends up being due to some very funky asynchronous speculative
TLB fill logic, which made the page table invalidation "exciting".

The TLB invalidate rewrite will likely have broken all other architectures 
(at least performance-wise, if not in any other way), so architecture 
maintainers look out!

		Linus

-----

Summary of changes from v2.5.15 to v2.5.16
============================================

<acher@in.tum.de>
	o USB-UHCI-HCD

Anton Altaparmakov <aia21@cantab.net>
	o NTFS 2.0.7: minor cleanup, remove NULL struct initializers
	o NTFS 2.0.7 release: pure cleanups.

Jens Axboe <axboe@suse.de>
	o fix scsi oops on failed sg table allocation

<bunk@fs.tum.de>
	o Include linux/slab.h not linux/malloc.h in pc300 wan driver.

Martin Dalecki <dalecki@evision-ventures.com>
	o 2.5.15 IDE 60
	o 2.5.15 IDE 61
	o 2.5.15 IDE 62a
	o 2.5.15 IDE 63
	o 2.5.15 IDE 64

<davem@nuts.ninka.net>
	o Sparc64 fixes:
	o Sparc64: Delete AOFF_task_fpregs define.
	o tcp_ipv4.c: Do not increment TcpAttemptFails twice.
	o Sparc64: Make pcibios_init return an int.
	o Ingress packet scheduler: Fix compiler error when CONFIG_NET_CLS_POLICE is disabled.
	o Sparc64: Bitops take unsigned long pointer.
	o Sparc64: Fix typos in bitops changes.
	o Sparc64: Missing parts of previous math-emu fixes.

<david-b@pacbell.net>
	o -- ehci misc FIXMEs
	o -- hub/tt error recovery

<david@gibson.dropbear.id.au>
	o Update orinoco driver to 0.11b

<dirk.uffmann@nokia.com>
	o 1127/1: static PCI memory mapping for ARM Integrator reduced
	o 1126/1: Kernel decompression in head.S does not work for ARM 9xx architectures
	o 1130/1: Remove support for prefetchable PCI memory on ARM Integrator

<dwmw2@infradead.org>
	o zlib_inflate return code fix. Again.

<george@mvista.com>
	o 64-bit jiffies, a better solution

<greg@kroah.com>
	o USB storage
	o USB storage
	o USB storage drivers
	o USB storage
	o usb_submit_urb fix for broken usb devices
	o USB device reference counting api cleanup changes
	o USB sddr55 minor to enable a MDSM-B reader
	o Change to the USB core to retry failed devices on startup.
	o USB Config.in and Makefile fixups
	o USB - fix a compiler warning in the core code
	o USB - Host controller Config.in changes

Christoph Hellwig <hch@infradead.org>
	o IPv4 Syncookies: Remove pointless CONFIG_SYN_COOKIES ifdef.

<henrique@cyclades.com>
	o Change maintainer info of PC300 WAN driver.

<hirofumi@mail.parknet.co.jp>
	o Fixed the handling of file name containing 0x05 on vfat

<jdavid@farfalle.com>
	o Add full duplex support to 3c509 net driver.

Jeff Garzik <jgarzik@mandrakesoft.com>
	o Add new pci id to tulip net driver.
	o Merge 2.4.x changes for old OSS ac97_codec driver:
	o via-rhine net driver minor fixes and cleanups:
	o Update MII generic phy driver to properly report link status.
	o Fix phy id masking in 8139too net driver.

<johannes@erdfelt.com>
	o uhci.c FSBR timeout
	o USB device reference counting fix for uhci.c and usb core
	o 2.4.19-pre8 uhci.c incorrect bit operations
	o 2.4.19-pre8 uhci.c incorrect bit operations
	o uhci-hcd for 2.5.15

<jt@hpl.hp.com>
	o Fix four similar off-by-one errors in wireless net drvr core.
	o IrDA update 1/3:
	o IrDA update 2/3, set_bit updates:
	o IrDA update 3/3:

<kai@tp1.ruhr-uni-bochum.de>
	o ISDN: maintain outstanding CAPI messages in the drivers
	o Use standard AS rule.
	o ISDN: AVM CAPI drivers: Common revision parsing
	o ISDN: Usage count for CAPI controllers
	o ISDN: Init ISA AVM CAPI drivers at module load time
	o ISDN: Release AVM CAPI controllers at module unload time

<kasperd@daimi.au.dk>
	o Fix oops-able situation in 3c509 net driver

Manfred Spraul <manfred@colorfullife.com>
	o usb-storage locking fixes

Neil Brown <neilb@cse.unsw.edu.au>
	o - kNFSd in 2.5.15 - Require export operations for exporting a filesystem
	o - kNFSd in 2.5.15 - export_operations support for isofs
	o Micro Memory battery backed RAM card driver

<nico@cam.org>
	o [ARM 1110/1: fixes to the ARM checksum code

<os@emlix.com>
	o cs89x0 net driver minor fixes, SH4 support, and cmd line media support

<paulus@nanango.paulus.ozlabs.org>
	o PPC32: This changeset updates several of the powermac-specific

<quintela@mandrakesoft.com>
	o tulip net driver 2114x phy init fix

<rgooch@atnf.csiro.au>
	o misc.c:
	o Fixed race when devfs lookup()/readdir() triggers partition rescanning.
	o Minor cleanup of fs/devfs/base.c:scan_dir_for_removable().

<rl@hellgate.ch>
	o Cosmetic cleanups, remove unused struct members from via-rhine net driver

Russell King <rmk@flint.arm.linux.org.uk>
	o [ARM] Localise old param_struct to arch/arm/kernel/compat.c.
	o [ARM] Fix signedness of address comparisons, causing boots on some
	o Pass a physical address from the boot loader for the location of the
	o Always allow CONFIG_CMDLINE to be set or edited by the user.
	o Clean up do_undefinstr - it only needs to take the pt_regs pointer
	o A pile of missed kernel stack accessing functions were still using
	o [ARM] Don't write to read-only registers.
	o [ARM] SA1100 cleanups:
	o [ARM] Couple of small fixes:
	o [ARM] ADFS updates/fixes.
	o 2.5.14 updates - for the new memory management pfn() macros.  Also,

<rml@tech9.net>
	o clean up maximum priorities

<rusty@rustcorp.com.au>
	o Hotplug CPU prep

<shaggy@austin.ibm.com>
	o Prevent deadlock in JFS when flushing data during commit

<skyrelighten@yahoo.co.kr>
	o Add to list of supported 8139 net boards.

<tcallawa@redhat.com>
	o Sparc64: Export batten_down_hatches
	o Sparc: Use proper sys_{read,write} prototypes in SunOS
	o drivers/video/aty/mach64_gx.c: Include sched.h

Linus Torvalds <torvalds@transmeta.com>
	o Fix 'export-objs' usage in Makefiles. 
	o Make arm default to little-endian jiffies.
	o This improves on the page table TLB shootdown. Almost there.
	o Fix up some more TLB shootdown issues.
	o Update kernel version
	o Cleanup munmap a lot. Fix Intel P4 TLB corruptions on SMP.
	o Make setresuid/setresgid be more consistent wrt fsuid handling
	o First cut at proper TLB shootdown for page directory entries.

<wstinson@infonie.fr>
	o request_region janitor cleanup for rtc char driver



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-18  7:57 Linux-2.5.16 Linus Torvalds
@ 2002-05-18  8:05 ` Aschwin Marsman - aYniK Software Solutions
  2002-05-18  8:21 ` Linux-2.5.16 Russell King
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Aschwin Marsman - aYniK Software Solutions @ 2002-05-18  8:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

On Sat, 18 May 2002, Linus Torvalds wrote:

> [ Testing the shortlog format, full changelogs on the kernel site ]
> 
> Well, I dunno if the short changelog format is wonderfully readable, but 
> at least it's small enough that I don't feel bad about mailbombing the 
> kernel list with it.

I think this is much better, thanks. It gives a short overview of recent
development, that can be read quickly.

Have a nice weekend,
 
Aschwin Marsman
 
--
aYniK Software Solutions - all You need is Knowledge
Bedrijvenpark Twente 305 - NL-7602 KL Almelo - the Netherlands
P.O. box 134             - NL-7600 AC Almelo - the Netherlands
telephone: +31 (0)546-581400 fax: +31 (0)546-581401
a.marsman@aYniK.com        http://www.aYniK.com
aschwin@marsman.org        http://www.marsman.org


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-18  7:57 Linux-2.5.16 Linus Torvalds
  2002-05-18  8:05 ` Linux-2.5.16 Aschwin Marsman - aYniK Software Solutions
@ 2002-05-18  8:21 ` Russell King
  2002-05-18  9:51   ` Linux-2.5.16 Tomas Szepe
  2002-05-18  8:52 ` Linux-2.5.16 mikeH
  2002-05-20  0:33 ` Linux-2.5.16 Roman Zippel
  3 siblings, 1 reply; 32+ messages in thread
From: Russell King @ 2002-05-18  8:21 UTC (permalink / raw)
  To: Kernel Mailing List

On Sat, May 18, 2002 at 12:57:01AM -0700, Linus Torvalds wrote:
> <nico@cam.org>
> 	o [ARM 1110/1: fixes to the ARM checksum code

Not quite perfect yet, but I'm not too bothered - that used to be
[ARM PATCH]

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-18  7:57 Linux-2.5.16 Linus Torvalds
  2002-05-18  8:05 ` Linux-2.5.16 Aschwin Marsman - aYniK Software Solutions
  2002-05-18  8:21 ` Linux-2.5.16 Russell King
@ 2002-05-18  8:52 ` mikeH
  2002-05-18 18:33   ` Linux-2.5.16 Andrew Morton
  2002-05-20  0:33 ` Linux-2.5.16 Roman Zippel
  3 siblings, 1 reply; 32+ messages in thread
From: mikeH @ 2002-05-18  8:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List


Whats the state of ext3 in this release? I seem to remember reading 
there were some corruption issues.

Thanks,

mikeH

Linus Torvalds wrote:

>[ Testing the shortlog format, full changelogs on the kernel site ]
>
>Well, I dunno if the short changelog format is wonderfully readable, but 
>at least it's small enough that I don't feel bad about mailbombing the 
>kernel list with it.
>
>USB and architecture updates, IDE driver updates etc. The one that kept me
>personally somewhat busy was the interesting Intel SMP-P4 TLB corruption
>bug, which ends up being due to some very funky asynchronous speculative
>TLB fill logic, which made the page table invalidation "exciting".
>
>The TLB invalidate rewrite will likely have broken all other architectures 
>(at least performance-wise, if not in any other way), so architecture 
>maintainers look out!
>
>		Linus
>  
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-18  8:21 ` Linux-2.5.16 Russell King
@ 2002-05-18  9:51   ` Tomas Szepe
  2002-05-18 11:28     ` Linux-2.5.16 Marcus Alanen
  0 siblings, 1 reply; 32+ messages in thread
From: Tomas Szepe @ 2002-05-18  9:51 UTC (permalink / raw)
  To: Russell King; +Cc: Kernel Mailing List, Matthias Andree

> > <nico@cam.org>
> > 	o [ARM 1110/1: fixes to the ARM checksum code
> 
> Not quite perfect yet, but I'm not too bothered - that used to be
> [ARM PATCH]

Now if only we knew which of the scripts Linus used. :)

Matthias, is this regexp broken in the recent version of the
script too?

T.


"when you do things right, people won't be sure you've done anything at all."
- god to bender

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-18  9:51   ` Linux-2.5.16 Tomas Szepe
@ 2002-05-18 11:28     ` Marcus Alanen
  2002-05-18 15:38       ` Linux-2.5.16 Matthias Andree
  0 siblings, 1 reply; 32+ messages in thread
From: Marcus Alanen @ 2002-05-18 11:28 UTC (permalink / raw)
  To: szepe, Russell King; +Cc: Kernel Mailing List, Matthias Andree

>> > <nico@cam.org>
>> > 	o [ARM 1110/1: fixes to the ARM checksum code
>> Not quite perfect yet, but I'm not too bothered - that used to be
>> [ARM PATCH]
>Now if only we knew which of the scripts Linus used. :)
>
>Matthias, is this regexp broken in the recent version of the
>script too?

I guess it still is "$_ =~ s/\[?PATCH\]?\s*//i;", which means
that it still is broken. There certainly are several solutions,
what do people think of "s/\[?[^\]]*PATCH\]?\W*//i;" ?
(Maybe a ^ at the beginning?) 

Marcus

-- 
Marcus Alanen
maalanen@abo.fi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-18 11:28     ` Linux-2.5.16 Marcus Alanen
@ 2002-05-18 15:38       ` Matthias Andree
  2002-05-18 15:44         ` Linux-2.5.16 Tomas Szepe
  0 siblings, 1 reply; 32+ messages in thread
From: Matthias Andree @ 2002-05-18 15:38 UTC (permalink / raw)
  To: Marcus Alanen; +Cc: szepe, Russell King, Kernel Mailing List, Matthias Andree

On Sat, 18 May 2002, Marcus Alanen wrote:

> I guess it still is "$_ =~ s/\[?PATCH\]?\s*//i;", which means
> that it still is broken. There certainly are several solutions,
> what do people think of "s/\[?[^\]]*PATCH\]?\W*//i;" ?
> (Maybe a ^ at the beginning?) 

Don't guess, look:

    # kill "PATCH" tag
    s/^\s*\[PATCH\]//;
    s/^\s*PATCH//;
    s/^\s*[-:]+\s*//;
    # strip trailing colon
    s/:\s*$//;
    # kill leading and trailing whitespace for consistent indentation
    s/^\s+//; s/\s+$//;

So it should not harm "[ARM PATCH]".

What we would want is only remove the tag when we have symmetric square
brackets. What we also want is simplicity to allow for easy maintenance
and, last but not least, simple, anchored regexps for speed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-18 15:38       ` Linux-2.5.16 Matthias Andree
@ 2002-05-18 15:44         ` Tomas Szepe
  0 siblings, 0 replies; 32+ messages in thread
From: Tomas Szepe @ 2002-05-18 15:44 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

> > I guess it still is "$_ =~ s/\[?PATCH\]?\s*//i;", which means
> > that it still is broken. There certainly are several solutions,
> > what do people think of "s/\[?[^\]]*PATCH\]?\W*//i;" ?
> > (Maybe a ^ at the beginning?) 
> 
> Don't guess, look:
> 
>     # kill "PATCH" tag
>     s/^\s*\[PATCH\]//;
>     s/^\s*PATCH//;
>     s/^\s*[-:]+\s*//;
>     # strip trailing colon
>     s/:\s*$//;
>     # kill leading and trailing whitespace for consistent indentation
>     s/^\s+//; s/\s+$//;
> 
> So it should not harm "[ARM PATCH]".
> 
> What we would want is only remove the tag when we have symmetric square
> brackets. What we also want is simplicity to allow for easy maintenance
> and, last but not least, simple, anchored regexps for speed.

Good.
Could you repost the latest version *in plaintext* please?


T.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-18  8:52 ` Linux-2.5.16 mikeH
@ 2002-05-18 18:33   ` Andrew Morton
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2002-05-18 18:33 UTC (permalink / raw)
  To: mikeH; +Cc: Linus Torvalds, Kernel Mailing List

mikeH wrote:
> 
> Whats the state of ext3 in this release? I seem to remember reading
> there were some corruption issues.

data=journal is not in very good state.

data=ordered works for normal use, but it will fail in heavy testing
on SMP due to exposure of preexisting bugs.  I have forward ported a
couple of Stephen's patches which appear to fix that up.

data=writeback should be OK.

I'll get the ext3 fixes out over the next few days.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-18  7:57 Linux-2.5.16 Linus Torvalds
                   ` (2 preceding siblings ...)
  2002-05-18  8:52 ` Linux-2.5.16 mikeH
@ 2002-05-20  0:33 ` Roman Zippel
  2002-05-20  0:39   ` Linux-2.5.16 Linus Torvalds
  3 siblings, 1 reply; 32+ messages in thread
From: Roman Zippel @ 2002-05-20  0:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Hi,

On Sat, 18 May 2002, Linus Torvalds wrote:

> The TLB invalidate rewrite will likely have broken all other architectures 
> (at least performance-wise, if not in any other way), so architecture 
> maintainers look out!

Two questions about asm-generic/tlb.h:
- freed is never incremented, callers of tlb_remove_page have to do the
  rss update themselves?
- will a non smp version later be added again?

bye, Roman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  0:33 ` Linux-2.5.16 Roman Zippel
@ 2002-05-20  0:39   ` Linus Torvalds
  2002-05-20  0:47     ` Linux-2.5.16 Linus Torvalds
  2002-05-20  1:10     ` Linux-2.5.16 Roman Zippel
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2002-05-20  0:39 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Kernel Mailing List



On Mon, 20 May 2002, Roman Zippel wrote:
>
> Two questions about asm-generic/tlb.h:
> - freed is never incremented, callers of tlb_remove_page have to do the
>   rss update themselves?

No, that's just a missed thing (for a while I thought I could use "nr" for
"freed", so I changed the code and forgot to add back the free'd).

> - will a non smp version later be added again?

Not likely, at least not in the form it was before of having two
completely different paths.

But I was thinking of doing a one-source thing that the compiler can
statically optimize, with something like

	#ifdef CONFIG_SMP
	#define fast_case(tlb) ((tlb)->nr == ~0UL)
	#else
	#define fast_case(tlb) (1)
	#endif

which allows us to have one set of sources for both UP and SMP, but the UP
case gets optimized by the compiler.

Do you want to do the freed and the above and test it?

		Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  0:39   ` Linux-2.5.16 Linus Torvalds
@ 2002-05-20  0:47     ` Linus Torvalds
  2002-05-20  1:09       ` Linux-2.5.16 Paul Mackerras
                         ` (3 more replies)
  2002-05-20  1:10     ` Linux-2.5.16 Roman Zippel
  1 sibling, 4 replies; 32+ messages in thread
From: Linus Torvalds @ 2002-05-20  0:47 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Kernel Mailing List



On Sun, 19 May 2002, Linus Torvalds wrote:
>
> No, that's just a missed thing (for a while I thought I could use "nr" for
> "freed", so I changed the code and forgot to add back the free'd).

That reminds me - we should increment the rss for page directories now on
the allocation path, because we will decrement rss for them when we free
them (and because it's the right thing to do anyway, I guess - better
resource tracking).

The other alternative is to make separate versions of "tlb_remove_page()":
one that decrements RSS, one that doesn't (and the latter would be used
for page directories).

Finally, I haven't really heard anything back from the "strange" VM
architectures (ie sparc v8 and PPC) other than Davem's buy-in that the
basic approach should work ok for them.

		Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  0:47     ` Linux-2.5.16 Linus Torvalds
@ 2002-05-20  1:09       ` Paul Mackerras
  2002-05-20  1:25         ` Linux-2.5.16 Linus Torvalds
  2002-05-20  1:15       ` Linux-2.5.16 Roman Zippel
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2002-05-20  1:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Linus Torvalds writes:

> Finally, I haven't really heard anything back from the "strange" VM
> architectures (ie sparc v8 and PPC) other than Davem's buy-in that the
> basic approach should work ok for them.

Looking at it now. :)

My only comment at this stage is that I would like to have the address
passed to tlb_remove_page, as it used to be, so that I can find and
clear the PTEs in the MMU hash table efficiently when the buffer in
the mmu_gather_t fills up, before freeing the pages.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  0:39   ` Linux-2.5.16 Linus Torvalds
  2002-05-20  0:47     ` Linux-2.5.16 Linus Torvalds
@ 2002-05-20  1:10     ` Roman Zippel
  2002-05-20 17:57       ` Linux-2.5.16 Linus Torvalds
  1 sibling, 1 reply; 32+ messages in thread
From: Roman Zippel @ 2002-05-20  1:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Hi,

On Sun, 19 May 2002, Linus Torvalds wrote:

> > - freed is never incremented, callers of tlb_remove_page have to do the
> >   rss update themselves?
> 
> No, that's just a missed thing (for a while I thought I could use "nr" for
> "freed", so I changed the code and forgot to add back the free'd).

If I see it correctly, tlb_remove_page() can't be called with a swap page,
does it make sense to use free_page_and_swap_cache()?

> > - will a non smp version later be added again?
> 
> Not likely, at least not in the form it was before of having two
> completely different paths.
> 
> But I was thinking of doing a one-source thing that the compiler can
> statically optimize, with something like
> 
> 	#ifdef CONFIG_SMP
> 	#define fast_case(tlb) ((tlb)->nr == ~0UL)
> 	#else
> 	#define fast_case(tlb) (1)
> 	#endif
> 
> which allows us to have one set of sources for both UP and SMP, but the UP
> case gets optimized by the compiler.

I was thinking about this and I agree, but this is needed as well

#define FREE_PTE_NR 1

otherwise we waste 2KB. :)

> Do you want to do the freed and the above and test it?

Sure.

bye, Roman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  0:47     ` Linux-2.5.16 Linus Torvalds
  2002-05-20  1:09       ` Linux-2.5.16 Paul Mackerras
@ 2002-05-20  1:15       ` Roman Zippel
  2002-05-20  1:20         ` Linux-2.5.16 Linus Torvalds
  2002-05-20  4:30       ` Linux-2.5.16 David S. Miller
  2002-05-20 22:20       ` Linux-2.5.16 Roman Zippel
  3 siblings, 1 reply; 32+ messages in thread
From: Roman Zippel @ 2002-05-20  1:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Hi,

On Sun, 19 May 2002, Linus Torvalds wrote:

> That reminds me - we should increment the rss for page directories now on
> the allocation path, because we will decrement rss for them when we free
> them (and because it's the right thing to do anyway, I guess - better
> resource tracking).
> 
> The other alternative is to make separate versions of "tlb_remove_page()":
> one that decrements RSS, one that doesn't (and the latter would be used
> for page directories).
> 
> Finally, I haven't really heard anything back from the "strange" VM
> architectures (ie sparc v8 and PPC) other than Davem's buy-in that the
> basic approach should work ok for them.

There is another problem even on rather "normal" systems, a pgd/pmd
directory doesn't have to be of PAGE_SIZE size, e.g. on m68k it's 512
bytes.

bye, Roman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  1:15       ` Linux-2.5.16 Roman Zippel
@ 2002-05-20  1:20         ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2002-05-20  1:20 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Kernel Mailing List



On Mon, 20 May 2002, Roman Zippel wrote:
>
> There is another problem even on rather "normal" systems, a pgd/pmd
> directory doesn't have to be of PAGE_SIZE size, e.g. on m68k it's 512
> bytes.

Note that the generic VM code doesn't actually call any of these functions
directly - an architecture can choose to redefine the whole thing for its
own uses if it wants to.

In particular, even if the architecture wants to share everything else in
the generic tlb.h, you can solve the particular problem you mention by
just not defining "pmd_free_tlb()" to be "tlb_remove_page()". In short:
there should be absolutely nothing in the setup that _requires_ you to
consider page directories to be normal pages. It just happens to work out
that way on x86 (and a number of other architectures).

		Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  1:09       ` Linux-2.5.16 Paul Mackerras
@ 2002-05-20  1:25         ` Linus Torvalds
  2002-05-20 12:43           ` Linux-2.5.16 Paul Mackerras
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2002-05-20  1:25 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Kernel Mailing List



On Mon, 20 May 2002, Paul Mackerras wrote:
>
> My only comment at this stage is that I would like to have the address
> passed to tlb_remove_page, as it used to be, so that I can find and
> clear the PTEs in the MMU hash table efficiently when the buffer in
> the mmu_gather_t fills up, before freeing the pages.

Sure enough, but you need to keep in mind that x86 (and others) want to
use the generic support even for pages that don't have virtual addresses,
ie the page directories etc. So that argues for splitting up the existing
"tlb_remove_page()" into something like

	tlb_remove_tlb_entry(tlb_gather_t *tlb, struct page *page, unsigned long address)
	{
		tlb_flush_mapping(tlb, page, address);
		tlb->freed++;
		tlb_remove_page(tlb,page);
	}

	tlb_remove_page(tlb_gather_t *tlb, tlb)
	{
		.. add the page to the pages[] array ..
	}

where PPC would have the "tlb_flush_mapping()" thing, and something like
x86 or a regular TLB would just define it to be a no-op.

		Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  0:47     ` Linux-2.5.16 Linus Torvalds
  2002-05-20  1:09       ` Linux-2.5.16 Paul Mackerras
  2002-05-20  1:15       ` Linux-2.5.16 Roman Zippel
@ 2002-05-20  4:30       ` David S. Miller
  2002-05-20 22:20       ` Linux-2.5.16 Roman Zippel
  3 siblings, 0 replies; 32+ messages in thread
From: David S. Miller @ 2002-05-20  4:30 UTC (permalink / raw)
  To: torvalds; +Cc: zippel, linux-kernel

   From: Linus Torvalds <torvalds@transmeta.com>
   Date: Sun, 19 May 2002 17:47:01 -0700 (PDT)
   
   Finally, I haven't really heard anything back from the "strange" VM
   architectures (ie sparc v8 and PPC) other than Davem's buy-in that the
   basic approach should work ok for them.

I haven't had time this weekend to even look at sparc64.
Sparc v8 is doesn't even work with the 2.5.x tree before
your VM changes, it will be months before sparc32 is in
any kind of working shape in the 2.5.x tree, if at all.

So expect something wrt. sparc64 soon, and don't hold your breath
on sparc32.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  1:25         ` Linux-2.5.16 Linus Torvalds
@ 2002-05-20 12:43           ` Paul Mackerras
  2002-05-20 16:13             ` Linux-2.5.16 Linus Torvalds
  2002-05-21  5:10             ` Linux-2.5.16 Linus Torvalds
  0 siblings, 2 replies; 32+ messages in thread
From: Paul Mackerras @ 2002-05-20 12:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Linus,

This patch splits up the existing tlb_remove_page into
tlb_remove_tlb_entry (for pages that are/were mapped into userspace)
and tlb_remove_page, as you suggested.  It also adds the necessary
stuff for PPC, which has its own include/asm-ppc/tlb.h now.  This
works on at least one PPC machine. :)

Thanks,
Paul.

diff -urN linux-2.5/include/asm-generic/tlb.h pmac-2.5/include/asm-generic/tlb.h
--- linux-2.5/include/asm-generic/tlb.h	Sun May 19 21:04:28 2002
+++ pmac-2.5/include/asm-generic/tlb.h	Mon May 20 11:46:18 2002
@@ -83,11 +83,10 @@
 	tlb_flush_mmu(tlb, start, end);
 }
 
-
-/* void tlb_remove_page(mmu_gather_t *tlb, pte_t *ptep, unsigned long addr)
- *	Must perform the equivalent to __free_pte(pte_get_and_clear(ptep)), while
- *	handling the additional races in SMP caused by other CPUs caching valid
- *	mappings in their TLBs.
+/* void tlb_remove_page(mmu_gather_t *tlb, struct page *page)
+ *	This should free the page given after flushing any reference
+ *	to it from the TLB.  This should be done no later than the
+ *	next call to tlb_finish_mmu for this tlb.
  */
 static inline void tlb_remove_page(mmu_gather_t *tlb, struct page *page)
 {
@@ -99,6 +98,18 @@
 	tlb->pages[tlb->nr++] = page;
 	if (tlb->nr >= FREE_PTE_NR)
 		tlb_flush_mmu(tlb, 0, 0);
+}
+
+/* void tlb_remove_tlb_entry(mmu_gather_t *tlb, struct page *page, unsigned long address)
+ *	This is similar to tlb_remove_page, except that we are given the
+ *	virtual address at which the page was mapped.  The address parameter
+ *	is unused here but is used on some architectures.
+ */
+static inline void tlb_remove_tlb_entry(mmu_gather_t *tlb, struct page *page,
+					unsigned long address)
+{
+	tlb->freed++;
+	tlb_remove_page(tlb, page);
 }
 
 #endif /* _ASM_GENERIC__TLB_H */
diff -urN linux-2.5/mm/memory.c pmac-2.5/mm/memory.c
--- linux-2.5/mm/memory.c	Thu May 16 20:31:42 2002
+++ pmac-2.5/mm/memory.c	Mon May 20 14:06:58 2002
@@ -353,7 +353,7 @@
 				if (!PageReserved(page)) {
 					if (pte_dirty(pte))
 						set_page_dirty(page);
-					tlb_remove_page(tlb, page);
+					tlb_remove_tlb_entry(tlb, page, address+offset);
 				}
 			}
 		} else {
diff -urN linux-2.5/include/asm-ppc/tlb.h pmac-2.5/include/asm-ppc/tlb.h
--- linux-2.5/include/asm-ppc/tlb.h	Tue Feb  5 18:40:23 2002
+++ pmac-2.5/include/asm-ppc/tlb.h	Mon May 20 20:14:20 2002
@@ -1,4 +1,141 @@
 /*
- * BK Id: SCCS/s.tlb.h 1.5 05/17/01 18:14:26 cort
+ * include/asm-ppc/tlb.h
+ *
+ *	TLB shootdown code for PPC
+ *
+ * Based on include/asm-generic/tlb.h, which is
+ * Copyright 2001 Red Hat, Inc.
+ * Based on code from mm/memory.c Copyright Linus Torvalds and others.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
  */
-#include <asm-generic/tlb.h>
+#ifndef _ASM_PPC__TLB_H
+#define _ASM_PPC__TLB_H
+
+#include <linux/config.h>
+#include <asm/tlbflush.h>
+#include <asm/page.h>
+
+/*
+ * This makes sizeof(mmu_gather_t) a power of 2.
+ * We assume we get some advantage from batching up the invalidations.
+ * It would be nice to measure how much ...
+ */
+#define FREE_PTE_NR	507
+
+/* mmu_gather_t is an opaque type used by the mm code for passing around any
+ * data needed by arch specific code for tlb_remove_page.  This structure can
+ * be per-CPU or per-MM as the page table lock is held for the duration of TLB
+ * shootdown.
+ */
+typedef struct free_pte_ctx {
+	struct mm_struct	*mm;
+	unsigned long		nr;	/* set to ~0UL means fast mode */
+	unsigned long		freed;
+	unsigned long		start;
+	unsigned long		end;
+	struct page *		pages[FREE_PTE_NR];
+} mmu_gather_t;
+
+/* Declared in arch/ppc/mm/init.c. */
+extern mmu_gather_t	mmu_gathers[NR_CPUS];
+
+/*
+ * Actually do the flushes that we have gathered up, and
+ * then free the corresponding pages.
+ */
+static inline void tlb_flush_mmu(mmu_gather_t *tlb)
+{
+	unsigned long nr;
+
+	nr = tlb->nr;
+	if (nr != 0) {
+		unsigned long i;
+		flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end + PAGE_SIZE);
+		tlb->nr = 0;
+		for (i = 0; i < nr; i++)
+			free_page_and_swap_cache(tlb->pages[i]);
+	}
+}
+
+/* tlb_gather_mmu
+ *	Return a pointer to an initialized mmu_gather_t.
+ */
+static inline mmu_gather_t *tlb_gather_mmu(struct mm_struct *mm)
+{
+	mmu_gather_t *tlb = &mmu_gathers[smp_processor_id()];
+
+	tlb->mm = mm;
+	tlb->freed = 0;
+	tlb->nr = 0;
+	return tlb;
+}
+
+/* tlb_finish_mmu
+ *	Called at the end of the shootdown operation to free up any resources
+ *	that were required.  The page table lock is still held at this point.
+ */
+static inline void tlb_finish_mmu(mmu_gather_t *tlb, unsigned long start, unsigned long end)
+{
+	int freed = tlb->freed;
+	struct mm_struct *mm = tlb->mm;
+	int rss = mm->rss;
+
+	if (rss < freed)
+		freed = rss;
+	mm->rss = rss - freed;
+}
+
+/* Nothing needed here in fact... */
+#define tlb_start_vma(tlb, vma) do { } while (0)
+
+/*
+ * flush_tlb_mm_range looks at the pte pages for the range of addresses
+ * in order to check the _PAGE_HASHPTE bit.  Thus we can't defer
+ * the tlb_flush_mmu call to tlb_finish_mmu time, since by then the
+ * pointers to the pte pages in the pgdir have been zeroed.
+ * Instead we do the tlb_flush_mmu here.  In future we could possibly
+ * do something cleverer, like keeping our own pointer(s) to the pte
+ * page(s) that we are interested in.
+ */
+static inline void tlb_end_vma(mmu_gather_t *tlb, struct vm_area_struct *vma)
+{
+	tlb_flush_mmu(tlb);
+}
+
+/* void tlb_remove_page(mmu_gather_t *tlb, struct page *page)
+ *
+ *	On PPC this should never be called.
+ */
+static inline void tlb_remove_page(mmu_gather_t *tlb, struct page *page)
+{
+	BUG();
+}
+
+/* void tlb_remove_tlb_entry(mmu_gather_t *tlb, struct page *page, unsigned long address)
+ *	This should free the page given after flushing any reference
+ *	to it from the MMU hash table and TLB.  This should be done no
+ *	later than the next call to tlb_finish_mmu for this tlb.
+ *	We get given the virtual address at which the page was mapped.
+ */
+static inline void tlb_remove_tlb_entry(mmu_gather_t *tlb, struct page *page,
+					unsigned long address)
+{
+	tlb->freed++;
+
+	if (tlb->nr == 0)
+		tlb->start = address;
+	else if (address - tlb->end > 32 * PAGE_SIZE) {
+		tlb_flush_mmu(tlb);
+		tlb->start = address;
+	}
+	tlb->end = address;
+	tlb->pages[tlb->nr++] = page;
+	if (tlb->nr >= FREE_PTE_NR)
+		tlb_flush_mmu(tlb);
+}
+
+#endif /* _ASM_PPC__TLB_H */
diff -urN linux-2.5/include/asm-ppc/tlbflush.h pmac-2.5/include/asm-ppc/tlbflush.h
--- linux-2.5/include/asm-ppc/tlbflush.h	Fri May 10 10:14:59 2002
+++ pmac-2.5/include/asm-ppc/tlbflush.h	Mon May 20 12:00:24 2002
@@ -27,6 +27,9 @@
 	{ __tlbia(); }
 static inline void flush_tlb_mm(struct mm_struct *mm)
 	{ __tlbia(); }
+static inline void flush_tlb_mm_range(struct mm_struct *mm,
+				unsigned long start, unsigned long end);
+	{ __tlbia(); }
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				unsigned long vmaddr)
 	{ _tlbie(vmaddr); }
@@ -45,6 +48,9 @@
 	{ __tlbia(); }
 static inline void flush_tlb_mm(struct mm_struct *mm)
 	{ __tlbia(); }
+static inline void flush_tlb_mm_range(struct mm_struct *mm,
+				unsigned long start, unsigned long end)
+	{ __tlbia(); }
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				unsigned long vmaddr)
 	{ _tlbie(vmaddr); }
@@ -61,6 +67,8 @@
 struct vm_area_struct;
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm(struct mm_struct *mm);
+extern void flush_tlb_mm_range(struct mm_struct *mm,
+			       unsigned long start, unsigned long end);
 extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 extern void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 			    unsigned long end);
diff -urN linux-2.5/include/asm-ppc/pgalloc.h pmac-2.5/include/asm-ppc/pgalloc.h
--- linux-2.5/include/asm-ppc/pgalloc.h	Wed Apr 10 17:56:21 2002
+++ pmac-2.5/include/asm-ppc/pgalloc.h	Mon May 20 11:18:03 2002
@@ -20,6 +20,7 @@
  */
 #define pmd_alloc_one(mm,address)       ({ BUG(); ((pmd_t *)2); })
 #define pmd_free(x)                     do { } while (0)
+#define pmd_free_tlb(tlb,x)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)      BUG()
 
 #define pmd_populate_kernel(mm, pmd, pte)	\
@@ -31,6 +32,8 @@
 extern struct page *pte_alloc_one(struct mm_struct *mm, unsigned long addr);
 extern void pte_free_kernel(pte_t *pte);
 extern void pte_free(struct page *pte);
+
+#define pte_free_tlb(tlb, pte)	pte_free((pte))
 
 #define check_pgt_cache()	do { } while (0)
 
diff -urN linux-2.5/arch/ppc/mm/tlb.c pmac-2.5/arch/ppc/mm/tlb.c
--- linux-2.5/arch/ppc/mm/tlb.c	Mon Apr 15 09:48:49 2002
+++ pmac-2.5/arch/ppc/mm/tlb.c	Mon May 20 22:21:40 2002
@@ -59,7 +59,7 @@
 #define FINISH_FLUSH	do { } while (0)
 #endif
 
-static void flush_range(struct mm_struct *mm, unsigned long start,
+void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 			unsigned long end)
 {
 	pmd_t *pmd;
@@ -110,7 +110,7 @@
 	 */
 	printk(KERN_ERR "flush_tlb_all called from %p\n",
 	       __builtin_return_address(0));
-	flush_range(&init_mm, TASK_SIZE, ~0UL);
+	flush_tlb_mm_range(&init_mm, TASK_SIZE, ~0UL);
 	FINISH_FLUSH;
 }
 
@@ -119,7 +119,7 @@
  */
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	flush_range(&init_mm, start, end);
+	flush_tlb_mm_range(&init_mm, start, end);
 	FINISH_FLUSH;
 }
 
@@ -130,18 +130,15 @@
  */
 void flush_tlb_mm(struct mm_struct *mm)
 {
+	struct vm_area_struct *mp;
+
 	if (Hash == 0) {
 		_tlbia();
 		return;
 	}
 
-	if (mm->map_count) {
-		struct vm_area_struct *mp;
-		for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
-			flush_range(mp->vm_mm, mp->vm_start, mp->vm_end);
-	} else {
-		flush_range(mm, 0, TASK_SIZE);
-	}
+	for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
+		flush_tlb_mm_range(mp->vm_mm, mp->vm_start, mp->vm_end);
 	FINISH_FLUSH;
 }
 
@@ -170,6 +167,6 @@
 void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 		     unsigned long end)
 {
-	flush_range(vma->vm_mm, start, end);
+	flush_tlb_mm_range(vma->vm_mm, start, end);
 	FINISH_FLUSH;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20 12:43           ` Linux-2.5.16 Paul Mackerras
@ 2002-05-20 16:13             ` Linus Torvalds
  2002-05-20 23:30               ` Linux-2.5.16 David S. Miller
  2002-05-20 23:55               ` Linux-2.5.16 Paul Mackerras
  2002-05-21  5:10             ` Linux-2.5.16 Linus Torvalds
  1 sibling, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2002-05-20 16:13 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Kernel Mailing List



On Mon, 20 May 2002, Paul Mackerras wrote:
>
> This patch splits up the existing tlb_remove_page into
> tlb_remove_tlb_entry (for pages that are/were mapped into userspace)
> and tlb_remove_page, as you suggested.  It also adds the necessary
> stuff for PPC, which has its own include/asm-ppc/tlb.h now.  This
> works on at least one PPC machine. :)

Hmm.. The PPC <asm/tlb.h> seems to be largely a simplified version of
the asm-generic one, with no support for the UP optimization, for example.

And that UP optimization should be perfectly correct even on PPC, so you
apparently lost something in the translation.

I'd actually rather try to share more of the code, if possible.

That does involve putting some of the helper functions in the native
asm/tlb.h file, so I would suggest somehting along the line of

 - asm-i386/tlb.h:

	/*
	 * x86 doesn't need to do any per-TLB work,
	 * or care about VMA ranges
	 */
	#define tlb_flush_one_page(tlb,page,address) do { } while (0)
	#define tlb_start_vma(tlb,vma) do { } while (0)
	#define tlb_end_vma(tlb,vma) do { } while (0)

	#include <asm-generic/tlb.h>

 - asm-ppc/tlb.h:


	static inline void tlb_flush_one_page(tlb, page, address)
	{
		if (tlb->nr == 0)
			tlb->start = address;
		else if (address - tlb->end > 32 * PAGE_SIZE) {
			tlb_flush_mmu(tlb);
			tlb->start = address;
		}
		tlb->end = address;
	}
	#define tlb_start_vma(tlb,vma) do { } while (0)
	#define tlb_end_vma(tlb,vma) tlb_flush_mmu(tlb)

	#include <asm-generic/tlb.h>

See what I mean? You can share all the generic stuff, and only differ in
the details.

I think.

		Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  1:10     ` Linux-2.5.16 Roman Zippel
@ 2002-05-20 17:57       ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2002-05-20 17:57 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.21.0205200245381.23394-100000@serv>,
Roman Zippel  <zippel@linux-m68k.org> wrote:
>On Sun, 19 May 2002, Linus Torvalds wrote:
>>
>> No, that's just a missed thing (for a while I thought I could use "nr" for
>> "freed", so I changed the code and forgot to add back the free'd).
>
>If I see it correctly, tlb_remove_page() can't be called with a swap page,
>does it make sense to use free_page_and_swap_cache()?

tlb_remove_page() cannot be called with a swap _entry_, but it
absolutely can be (and often is) called with a page that is a swap
backing store page. So it does need free_page_and_swap_cache().

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20  0:47     ` Linux-2.5.16 Linus Torvalds
                         ` (2 preceding siblings ...)
  2002-05-20  4:30       ` Linux-2.5.16 David S. Miller
@ 2002-05-20 22:20       ` Roman Zippel
  2002-05-20 23:36         ` [PATCH] Fix rss accounting Roman Zippel
  3 siblings, 1 reply; 32+ messages in thread
From: Roman Zippel @ 2002-05-20 22:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Hi,

On Sun, 19 May 2002, Linus Torvalds wrote:

> That reminds me - we should increment the rss for page directories now on
> the allocation path, because we will decrement rss for them when we free
> them (and because it's the right thing to do anyway, I guess - better
> resource tracking).

The patch does this as well. It seems to get the rss tracking right, at
least I couldn't trigger the print with some basic tests here and allows
to simplify tlb_finish_mmu().
Changes:

asm-generic/tlb.h: 
- introduce tlb_fast_mode() and reduce table size to optimize for UP only mode
- fix and simplify rss handling
linux/mm.h:
- __free_pte() doesn't exist anymore
binfmt_{aout,elf}.c:
- initializing of mm->rss/mm->mmap is redundant
pte_alloc_one()/pte_free():
- add rss accounting (pte_free needs mm arg for this)
do_no_page()
- fix rss accounting
exit_mmap()
- check mm->rss on exit

bye, Roman

Index: arch/i386/mm/init.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/i386/mm/init.c,v
retrieving revision 1.1.1.7
diff -u -p -r1.1.1.7 init.c
--- arch/i386/mm/init.c	6 May 2002 17:56:30 -0000	1.1.1.7
+++ arch/i386/mm/init.c	20 May 2002 20:46:17 -0000
@@ -656,9 +656,10 @@ struct page *pte_alloc_one(struct mm_str
 #else
 		pte = alloc_pages(GFP_KERNEL, 0);
 #endif
-		if (pte)
+		if (pte) {
+			mm->rss++;
 			clear_highpage(pte);
-		else {
+		} else {
 			current->state = TASK_UNINTERRUPTIBLE;
 			schedule_timeout(HZ);
 		}
Index: fs/binfmt_aout.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/fs/binfmt_aout.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 binfmt_aout.c
--- fs/binfmt_aout.c	14 Apr 2002 20:01:10 -0000	1.1.1.3
+++ fs/binfmt_aout.c	20 May 2002 20:46:17 -0000
@@ -308,8 +308,6 @@ static int load_aout_binary(struct linux
 	current->mm->brk = ex.a_bss +
 		(current->mm->start_brk = N_BSSADDR(ex));
 
-	current->mm->rss = 0;
-	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
 #ifdef __sparc__
Index: fs/binfmt_elf.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/fs/binfmt_elf.c,v
retrieving revision 1.1.1.6
diff -u -p -r1.1.1.6 binfmt_elf.c
--- fs/binfmt_elf.c	14 Apr 2002 20:01:09 -0000	1.1.1.6
+++ fs/binfmt_elf.c	20 May 2002 20:46:18 -0000
@@ -600,13 +600,11 @@ static int load_elf_binary(struct linux_
 	current->mm->start_data = 0;
 	current->mm->end_data = 0;
 	current->mm->end_code = 0;
-	current->mm->mmap = NULL;
 	current->flags &= ~PF_FORKNOEXEC;
 	elf_entry = (unsigned long) elf_ex.e_entry;
 
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
 	setup_arg_pages(bprm); /* XXX: check error */
 	current->mm->start_stack = bprm->p;
 
Index: include/asm-generic/tlb.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-generic/tlb.h,v
retrieving revision 1.1.1.4
diff -u -p -r1.1.1.4 tlb.h
--- include/asm-generic/tlb.h	18 May 2002 13:39:29 -0000	1.1.1.4
+++ include/asm-generic/tlb.h	20 May 2002 20:46:18 -0000
@@ -14,10 +14,17 @@
 #define _ASM_GENERIC__TLB_H
 
 #include <linux/config.h>
+#include <linux/swap.h>
 #include <asm/tlbflush.h>
 
+#ifdef CONFIG_SMP
+#define tlb_fast_mode(tlb) ((tlb)->nr == ~0UL)	
 /* aim for something that fits in the L1 cache */
 #define FREE_PTE_NR	508
+#else
+#define tlb_fast_mode(tlb) (1)
+#define FREE_PTE_NR	1
+#endif
 
 /* mmu_gather_t is an opaque type used by the mm code for passing around any
  * data needed by arch specific code for tlb_remove_page.  This structure can
@@ -55,12 +62,10 @@ static inline mmu_gather_t *tlb_gather_m
 
 static inline void tlb_flush_mmu(mmu_gather_t *tlb, unsigned long start, unsigned long end)
 {
-	unsigned long nr;
-
 	flush_tlb_mm(tlb->mm);
-	nr = tlb->nr;
-	if (nr != ~0UL) {
-		unsigned long i;
+	if (!tlb_fast_mode(tlb)) {
+		unsigned long nr, i;
+		nr = tlb->nr;
 		tlb->nr = 0;
 		for (i=0; i < nr; i++)
 			free_page_and_swap_cache(tlb->pages[i]);
@@ -73,13 +78,7 @@ static inline void tlb_flush_mmu(mmu_gat
  */
 static inline void tlb_finish_mmu(mmu_gather_t *tlb, unsigned long start, unsigned long end)
 {
-	int freed = tlb->freed;
-	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
-
-	if (rss < freed)
-		freed = rss;
-	mm->rss = rss - freed;
+	tlb->mm->rss -= tlb->freed;
 	tlb_flush_mmu(tlb, start, end);
 }
 
@@ -91,8 +90,9 @@ static inline void tlb_finish_mmu(mmu_ga
  */
 static inline void tlb_remove_page(mmu_gather_t *tlb, struct page *page)
 {
-	/* Handle the common case fast, first. */\
-	if (tlb->nr == ~0UL) {
+	/* Handle the common case fast, first. */
+	tlb->freed++;
+	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
 		return;
 	}
Index: include/asm-i386/pgalloc.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgalloc.h,v
retrieving revision 1.1.1.9
diff -u -p -r1.1.1.9 pgalloc.h
--- include/asm-i386/pgalloc.h	18 May 2002 13:39:30 -0000	1.1.1.9
+++ include/asm-i386/pgalloc.h	20 May 2002 20:50:25 -0000
@@ -30,8 +30,9 @@ static inline void pte_free_kernel(pte_t
 	free_page((unsigned long)pte);
 }
 
-static inline void pte_free(struct page *pte)
+static inline void pte_free(struct mm_struct *mm, struct page *pte)
 {
+	mm->rss--;
 	__free_page(pte);
 }
 
Index: include/linux/mm.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/linux/mm.h,v
retrieving revision 1.1.1.11
diff -u -p -r1.1.1.11 mm.h
--- include/linux/mm.h	18 May 2002 13:39:19 -0000	1.1.1.11
+++ include/linux/mm.h	20 May 2002 20:50:19 -0000
@@ -383,8 +383,6 @@ extern void swapin_readahead(swp_entry_t
 extern int can_share_swap_page(struct page *);
 extern int remove_exclusive_swap_page(struct page *);
 
-extern void __free_pte(pte_t);
-
 /* mmap.c */
 extern void lock_vma_mappings(struct vm_area_struct *);
 extern void unlock_vma_mappings(struct vm_area_struct *);
Index: mm/memory.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/memory.c,v
retrieving revision 1.1.1.12
diff -u -p -r1.1.1.12 memory.c
--- mm/memory.c	18 May 2002 13:39:18 -0000	1.1.1.12
+++ mm/memory.c	20 May 2002 20:46:18 -0000
@@ -148,7 +148,7 @@ pte_t * pte_alloc_map(struct mm_struct *
 		 * entry, as somebody else could have populated it..
 		 */
 		if (pmd_present(*pmd)) {
-			pte_free(new);
+			pte_free(mm, new);
 			goto out;
 		}
 		pmd_populate(mm, pmd, new);
@@ -1326,7 +1326,8 @@ static int do_no_page(struct mm_struct *
 	 */
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
-		++mm->rss;
+		if (!PageReserved(new_page))
+			++mm->rss;
 		flush_page_to_ram(new_page);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
Index: mm/mmap.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/mmap.c,v
retrieving revision 1.1.1.7
diff -u -p -r1.1.1.7 mmap.c
--- mm/mmap.c	18 May 2002 13:39:18 -0000	1.1.1.7
+++ mm/mmap.c	20 May 2002 20:46:18 -0000
@@ -1128,6 +1128,9 @@ void exit_mmap(struct mm_struct * mm)
 	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
 	tlb_finish_mmu(tlb, FIRST_USER_PGD_NR*PGDIR_SIZE, USER_PTRS_PER_PGD*PGDIR_SIZE);
 
+	if (mm->rss)
+		printk("mm %p has nonzero rss (%ld) (%d,%s)\n", mm, mm->rss, current->pid, current->comm);
+
 	mpnt = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20 16:13             ` Linux-2.5.16 Linus Torvalds
@ 2002-05-20 23:30               ` David S. Miller
  2002-05-20 23:37                 ` Linux-2.5.16 David S. Miller
  2002-05-20 23:55               ` Linux-2.5.16 Paul Mackerras
  1 sibling, 1 reply; 32+ messages in thread
From: David S. Miller @ 2002-05-20 23:30 UTC (permalink / raw)
  To: torvalds; +Cc: paulus, linux-kernel

   From: Linus Torvalds <torvalds@transmeta.com>
   Date: Mon, 20 May 2002 09:13:22 -0700 (PDT)

   See what I mean? You can share all the generic stuff, and only differ in
   the details.

I think it is easier to make tlb_{start,end}_vma do the cache/tlb
flushing, and then change tlb_flush_mmu() to look something like:

static inline void tlb_flush_mmu(mmu_gather_t *tlb, unsigned long start, unsigned long end)
{
        unsigned long nr;

-       flush_tlb_mm(tlb->mm);
+       tlb_flush_mm(tlb->mm);
        nr = tlb->nr;
        if (nr != ~0UL) {
                unsigned long i;
                tlb->nr = 0;
                for (i=0; i < nr; i++)
                        free_page_and_swap_cache(tlb->pages[i]);
        }
}

Architectures define tlb_flush_mm() as appropriate, on x86 it would
be just flush_tlb_mm(mm), on Sparc/PPC/etc. which uses the VMA
flushing it would just be a NOP.

This allows to share all of the infrastructure, with just a few
overrides for the arch specific bits.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] Fix rss accounting
  2002-05-20 22:20       ` Linux-2.5.16 Roman Zippel
@ 2002-05-20 23:36         ` Roman Zippel
  0 siblings, 0 replies; 32+ messages in thread
From: Roman Zippel @ 2002-05-20 23:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Hi,

On Tue, 21 May 2002, I wrote:

> The patch does this as well. It seems to get the rss tracking right, at
> least I couldn't trigger the print with some basic tests here and allows
> to simplify tlb_finish_mmu().

Here is a slightly simpler patch, instead of modifiying
pte_alloc/pte_free simply increment rss in pmd_populate.

Changes:

asm-generic/tlb.h: 
- introduce tlb_fast_mode() and reduce table size to optimize for UP only mode
- fix and simplify rss handling
linux/mm.h:
- __free_pte() doesn't exist anymore
binfmt_{aout,elf}.c:
- initializing of mm->rss/mm->mmap is redundant
pmd_populate():
- add rss accounting (pte_free needs mm arg for this)
do_no_page():
- fix rss accounting
exit_mmap():
- check mm->rss on exit

bye, Roman

Index: fs/binfmt_aout.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/fs/binfmt_aout.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 binfmt_aout.c
--- fs/binfmt_aout.c	14 Apr 2002 20:01:10 -0000	1.1.1.3
+++ fs/binfmt_aout.c	20 May 2002 20:46:17 -0000
@@ -308,8 +308,6 @@ static int load_aout_binary(struct linux
 	current->mm->brk = ex.a_bss +
 		(current->mm->start_brk = N_BSSADDR(ex));
 
-	current->mm->rss = 0;
-	current->mm->mmap = NULL;
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
 #ifdef __sparc__
Index: fs/binfmt_elf.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/fs/binfmt_elf.c,v
retrieving revision 1.1.1.6
diff -u -p -r1.1.1.6 binfmt_elf.c
--- fs/binfmt_elf.c	14 Apr 2002 20:01:09 -0000	1.1.1.6
+++ fs/binfmt_elf.c	20 May 2002 20:46:18 -0000
@@ -600,13 +600,11 @@ static int load_elf_binary(struct linux_
 	current->mm->start_data = 0;
 	current->mm->end_data = 0;
 	current->mm->end_code = 0;
-	current->mm->mmap = NULL;
 	current->flags &= ~PF_FORKNOEXEC;
 	elf_entry = (unsigned long) elf_ex.e_entry;
 
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
-	current->mm->rss = 0;
 	setup_arg_pages(bprm); /* XXX: check error */
 	current->mm->start_stack = bprm->p;
 
Index: include/asm-generic/tlb.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-generic/tlb.h,v
retrieving revision 1.1.1.4
diff -u -p -r1.1.1.4 tlb.h
--- include/asm-generic/tlb.h	18 May 2002 13:39:29 -0000	1.1.1.4
+++ include/asm-generic/tlb.h	20 May 2002 20:46:18 -0000
@@ -14,10 +14,17 @@
 #define _ASM_GENERIC__TLB_H
 
 #include <linux/config.h>
+#include <linux/swap.h>
 #include <asm/tlbflush.h>
 
+#ifdef CONFIG_SMP
+#define tlb_fast_mode(tlb) ((tlb)->nr == ~0UL)	
 /* aim for something that fits in the L1 cache */
 #define FREE_PTE_NR	508
+#else
+#define tlb_fast_mode(tlb) (1)
+#define FREE_PTE_NR	1
+#endif
 
 /* mmu_gather_t is an opaque type used by the mm code for passing around any
  * data needed by arch specific code for tlb_remove_page.  This structure can
@@ -55,12 +62,10 @@ static inline mmu_gather_t *tlb_gather_m
 
 static inline void tlb_flush_mmu(mmu_gather_t *tlb, unsigned long start, unsigned long end)
 {
-	unsigned long nr;
-
 	flush_tlb_mm(tlb->mm);
-	nr = tlb->nr;
-	if (nr != ~0UL) {
-		unsigned long i;
+	if (!tlb_fast_mode(tlb)) {
+		unsigned long nr, i;
+		nr = tlb->nr;
 		tlb->nr = 0;
 		for (i=0; i < nr; i++)
 			free_page_and_swap_cache(tlb->pages[i]);
@@ -73,13 +78,7 @@ static inline void tlb_flush_mmu(mmu_gat
  */
 static inline void tlb_finish_mmu(mmu_gather_t *tlb, unsigned long start, unsigned long end)
 {
-	int freed = tlb->freed;
-	struct mm_struct *mm = tlb->mm;
-	int rss = mm->rss;
-
-	if (rss < freed)
-		freed = rss;
-	mm->rss = rss - freed;
+	tlb->mm->rss -= tlb->freed;
 	tlb_flush_mmu(tlb, start, end);
 }
 
@@ -91,8 +90,9 @@ static inline void tlb_finish_mmu(mmu_ga
  */
 static inline void tlb_remove_page(mmu_gather_t *tlb, struct page *page)
 {
-	/* Handle the common case fast, first. */\
-	if (tlb->nr == ~0UL) {
+	/* Handle the common case fast, first. */
+	tlb->freed++;
+	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
 		return;
 	}
Index: include/asm-i386/pgalloc.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgalloc.h,v
retrieving revision 1.1.1.9
diff -u -p -r1.1.1.9 pgalloc.h
--- include/asm-i386/pgalloc.h	18 May 2002 13:39:30 -0000	1.1.1.9
+++ include/asm-i386/pgalloc.h	20 May 2002 23:27:24 -0000
@@ -14,6 +14,7 @@ static inline void pmd_populate(struct m
 	set_pmd(pmd, __pmd(_PAGE_TABLE +
 		((unsigned long long)(pte - mem_map) <<
 			(unsigned long long) PAGE_SHIFT)));
+	mm->rss++;
 }
 /*
  * Allocate and free page tables.
Index: include/linux/mm.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/linux/mm.h,v
retrieving revision 1.1.1.11
diff -u -p -r1.1.1.11 mm.h
--- include/linux/mm.h	18 May 2002 13:39:19 -0000	1.1.1.11
+++ include/linux/mm.h	20 May 2002 20:50:19 -0000
@@ -383,8 +383,6 @@ extern void swapin_readahead(swp_entry_t
 extern int can_share_swap_page(struct page *);
 extern int remove_exclusive_swap_page(struct page *);
 
-extern void __free_pte(pte_t);
-
 /* mmap.c */
 extern void lock_vma_mappings(struct vm_area_struct *);
 extern void unlock_vma_mappings(struct vm_area_struct *);
Index: mm/memory.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/memory.c,v
retrieving revision 1.1.1.12
diff -u -p -r1.1.1.12 memory.c
--- mm/memory.c	18 May 2002 13:39:18 -0000	1.1.1.12
+++ mm/memory.c	20 May 2002 23:21:26 -0000
@@ -1326,7 +1326,8 @@ static int do_no_page(struct mm_struct *
 	 */
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
-		++mm->rss;
+		if (!PageReserved(new_page))
+			++mm->rss;
 		flush_page_to_ram(new_page);
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
Index: mm/mmap.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/mmap.c,v
retrieving revision 1.1.1.7
diff -u -p -r1.1.1.7 mmap.c
--- mm/mmap.c	18 May 2002 13:39:18 -0000	1.1.1.7
+++ mm/mmap.c	20 May 2002 20:46:18 -0000
@@ -1128,6 +1128,9 @@ void exit_mmap(struct mm_struct * mm)
 	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
 	tlb_finish_mmu(tlb, FIRST_USER_PGD_NR*PGDIR_SIZE, USER_PTRS_PER_PGD*PGDIR_SIZE);
 
+	if (mm->rss)
+		printk("mm %p has nonzero rss (%ld) (%d,%s)\n", mm, mm->rss, current->pid, current->comm);
+
 	mpnt = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;
 	mm->mm_rb = RB_ROOT;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20 23:30               ` Linux-2.5.16 David S. Miller
@ 2002-05-20 23:37                 ` David S. Miller
  2002-05-21  1:02                   ` [PATCH] TLB changes (was Re: Linux-2.5.16) David S. Miller
  0 siblings, 1 reply; 32+ messages in thread
From: David S. Miller @ 2002-05-20 23:37 UTC (permalink / raw)
  To: torvalds; +Cc: paulus, linux-kernel

   From: "David S. Miller" <davem@redhat.com>
   Date: Mon, 20 May 2002 16:30:26 -0700 (PDT)
   
   Architectures define tlb_flush_mm() as appropriate, on x86 it would
   be just flush_tlb_mm(mm), on Sparc/PPC/etc. which uses the VMA
   flushing it would just be a NOP.

Actually, there are some issues with my suggestion.

We are trying to do two things:

1) Flush all VMAs

2) Flush some unmapped area (1 or a few VMAs)

In the #1 case we'd like that to turn into something like:

	flush_cache_mm()
	for each vma {
		unmap_page_range();
	}
	tlb_finish_mmu();
	flush_tlb_mm();

Whereas in the #2 case it should look like:

	for each vma {
		tlb_start_vma(vma...);
		tlb_end_vma(vma...);
	}
	tlb_finish_mmu();

We have to reposition that tlb.h:flush_tlb_mm() call somehow to
make this a reality.

The next issue is how to make it so that this infrstructure
can allow us to kill off the buggy flush_tlb_pgtables() thing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20 16:13             ` Linux-2.5.16 Linus Torvalds
  2002-05-20 23:30               ` Linux-2.5.16 David S. Miller
@ 2002-05-20 23:55               ` Paul Mackerras
  2002-05-21  0:18                 ` Linux-2.5.16 Paul Mackerras
  1 sibling, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2002-05-20 23:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Linus Torvalds writes:

> Hmm.. The PPC <asm/tlb.h> seems to be largely a simplified version of
> the asm-generic one, with no support for the UP optimization, for example.
> 
> And that UP optimization should be perfectly correct even on PPC, so you
> apparently lost something in the translation.

The UP optimization would be slower on "classic" PPCs because you end
up doing a flush_tlb_mm rather than flushing the ranges of addresses
where there are actually PTEs present.

By "classic" PPCs I mean those that use a hashed page table, as
distinct from the embedded PPCs that have software-loaded TLBs.
Certainly for the embedded PPCs the UP optimization is useful and
desirable.  More ifdefs... :(

Anyway, on classic PPCs, the cost of TLB flushes goes up with the
amount of address space being flushed, and flush_tlb_mm is essentially
a flush of the whole space from 0 to TASK_SIZE.  As an optimization,
I currently have flush_tlb_mm look at the list of vma's and flush the
address range for each vma.  That works because flush_tlb_mm currently
only gets called from dup_mmap() and the list of vma's is valid in
that case.  If we were calling flush_tlb_mm from tlb_flush_mmu, we
could not use that optimization (since all the vma's have been removed
from the mm->mmap list at that point) so we would have to flush the
entire 0 .. TASK_SIZE range.

There is a further optimization that we do that we would not be able
to use if flush_tlb_mm were called from tlb_flush_mmu.  We have a bit
in each PTE, the _PAGE_HASHPTE bit, that tells us if a hardware PTE
for that virtual address has been put into the hash table.  (This bit
is present even when the PTE is a swap entry, and set_pte et al. don't
modify this bit.)  Then, when we are flushing a range of addresses, we
look at the linux PTE for each page and only do the hash table search
if the bit is set.  However, by the time tlb_flush_mmu is called, the
page tables have been freed, so we would have to do the hash table
search for every page from 0 to TASK_SIZE.  (This is why I call
tlb_flush_mmu from tlb_end_vma rather than tlb_finish_mmu.)

Therefore I concluded that the UP optimization was actually a
pessimization for classic PPC.

The fact that flushes cost in proportion to the size of the range
being flushed is the reason behind the code that flushes and starts a
new range if address - tlb->end > 32 * PAGE_SIZE.  The 32 is a number
that could use some tuning; we get some advantage from batching up the
flushes but that advantage will be lost if we have to look through
large ranges of addresses where there were no valid PTEs.

> I'd actually rather try to share more of the code, if possible.

I'll see what I can do along those lines...

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20 23:55               ` Linux-2.5.16 Paul Mackerras
@ 2002-05-21  0:18                 ` Paul Mackerras
  0 siblings, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2002-05-21  0:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

I wrote:

> The UP optimization would be slower on "classic" PPCs because you end
> up doing a flush_tlb_mm rather than flushing the ranges of addresses
> where there are actually PTEs present.

Of course, the part of the UP optimization where we free the page
immediately would still apply on classic PPC.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] TLB changes (was Re: Linux-2.5.16)
  2002-05-20 23:37                 ` Linux-2.5.16 David S. Miller
@ 2002-05-21  1:02                   ` David S. Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David S. Miller @ 2002-05-21  1:02 UTC (permalink / raw)
  To: torvalds; +Cc: paulus, linux-kernel


Ok, here is what I'm playing with now on sparc64.  It seems to
work so far and I'm stressing it out with a 64-bit gcc-3.1 bootstrap.

1) We differentiate between unmapping for munmap() type operations
   and flushing out the entire address space.

   tlb->full_mm_flush keeps track of that, initialized via
   tlb_gather_mmu().

   In this way a platform that wants to do the per-VMA flushes
   can do so, and still get the:

	flush_cache_mm(mm);
	.. flush all VMAs ...
	flush_tlb_mm(mm);

   when clearing out the entire address space.

2) The {pmd,pte}_free_tlb stuff needs to know which part of the
   address space that pmd/pte came from in order to flush it
   properly.

   Basically, it needs to have the same information that
   flush_tlb_pgtables() had access to.

   So I made the page table clearing keep track of this.

   This is an area that undoubtedly can be optimized further.

   For example, if we keep track of the first pte_page_nr fully
   freed and also the last one fully freed, we can just do a single
   flush at the end.

   If we move in that direction, I don't think it makes sense to
   provide two different routines anymore.  Just one:

	flush_page_tables(mm, first_pte_page_nr, last_pte_page_nr);

   Actually, I'm not so sure this meshes well with what the PPC
   folks want to accomplish to flush the hash tables efficiently.
   Paul, comments?

3) As a consequence of #2 being able to do the page table flushing
   we can totally kill off flush_tlb_pgtables.  It is buggy and
   the work is to be done by the TLB infrastructure.

Comments?

--- ./include/asm-generic/tlb.h.~1~	Mon May 20 16:31:23 2002
+++ ./include/asm-generic/tlb.h	Mon May 20 17:16:56 2002
@@ -28,6 +28,7 @@ typedef struct free_pte_ctx {
 	struct mm_struct	*mm;
 	unsigned long		nr;	/* set to ~0UL means fast mode */
 	unsigned long		freed;
+	int			full_mm_flush;	/* non-zero means full address space flush */
 	struct page *		pages[FREE_PTE_NR];
 } mmu_gather_t;
 
@@ -35,18 +36,19 @@ typedef struct free_pte_ctx {
 extern mmu_gather_t	mmu_gathers[NR_CPUS];
 
 /* Do me later */
-#define tlb_start_vma(tlb, vma) do { } while (0)
-#define tlb_end_vma(tlb, vma) do { } while (0)
+#define tlb_start_vma(tlb, vma, start, end) do { } while (0)
+#define tlb_end_vma(tlb, vma, start, end) do { } while (0)
 
 /* tlb_gather_mmu
  *	Return a pointer to an initialized mmu_gather_t.
  */
-static inline mmu_gather_t *tlb_gather_mmu(struct mm_struct *mm)
+static inline mmu_gather_t *tlb_gather_mmu(struct mm_struct *mm, int full_mm_flush)
 {
 	mmu_gather_t *tlb = &mmu_gathers[smp_processor_id()];
 
 	tlb->mm = mm;
 	tlb->freed = 0;
+	tlb->full_mm_flush = full_mm_flush;
 
 	/* Use fast mode if only one CPU is online */
 	tlb->nr = smp_num_cpus > 1 ? 0UL : ~0UL;
@@ -57,7 +59,10 @@ static inline void tlb_flush_mmu(mmu_gat
 {
 	unsigned long nr;
 
-	flush_tlb_mm(tlb->mm);
+	if (tlb->full_mm_flush)
+		flush_tlb_mm(tlb->mm);
+	else
+		tlb_flush_mm(tlb->mm);
 	nr = tlb->nr;
 	if (nr != ~0UL) {
 		unsigned long i;
--- ./include/asm-sparc64/tlb.h.~1~	Mon May 20 16:30:00 2002
+++ ./include/asm-sparc64/tlb.h	Mon May 20 17:21:49 2002
@@ -1 +1,29 @@
+#define tlb_flush_mm(mm)			do { } while (0)
+
 #include <asm-generic/tlb.h>
+
+/* We need to flush at the VMA level.  */
+#undef tlb_start_vma
+#define tlb_start_vma(tlb, vma, start, end) \
+	flush_cache_range(vma, start, end)
+#undef tlb_end_vma
+#define tlb_end_vma(tlb, vma, start, end) \
+	flush_tlb_range(vma, start, end)
+
+#define pmd_free_tlb(tlb, pmd, pmd_page_nr)	do { } while (0)
+
+static __inline__ void pte_free_tlb(mmu_gather_t *tlb, struct page *pte,
+	unsigned long pte_page_nr)
+{
+	tlb_remove_page(tlb, pte);
+
+	if (!tlb->full_mm_flush) {
+		unsigned long vpte_addr;
+
+		vpte_addr = (tlb_type == spitfire ?
+			     VPTE_BASE_SPITFIRE :
+			     VPTE_BASE_CHEETAH);
+		vpte_addr += (pte_page_nr << PAGE_SHIFT);
+		flush_tlb_vpte(tlb->mm, vpte_addr);
+	}
+}
--- ./include/asm-sparc64/tlbflush.h.~1~	Mon May 20 17:09:24 2002
+++ ./include/asm-sparc64/tlbflush.h	Mon May 20 17:15:01 2002
@@ -22,12 +22,12 @@ extern void __flush_tlb_kernel_range(uns
 	__flush_tlb_kernel_range(start,end)
 
 #define flush_tlb_mm(__mm) \
-do { if(CTX_VALID((__mm)->context)) \
+do { if (CTX_VALID((__mm)->context)) \
 	__flush_tlb_mm(CTX_HWBITS((__mm)->context), SECONDARY_CONTEXT); \
 } while(0)
 
 #define flush_tlb_range(__vma, start, end) \
-do { if(CTX_VALID((__vma)->vm_mm->context)) { \
+do { if (CTX_VALID((__vma)->vm_mm->context)) { \
 	unsigned long __start = (start)&PAGE_MASK; \
 	unsigned long __end = PAGE_ALIGN(end); \
 	__flush_tlb_range(CTX_HWBITS((__vma)->vm_mm->context), __start, \
@@ -38,11 +38,18 @@ do { if(CTX_VALID((__vma)->vm_mm->contex
 
 #define flush_tlb_page(vma, page) \
 do { struct mm_struct *__mm = (vma)->vm_mm; \
-     if(CTX_VALID(__mm->context)) \
+     if (CTX_VALID(__mm->context)) \
 	__flush_tlb_page(CTX_HWBITS(__mm->context), (page)&PAGE_MASK, \
 			 SECONDARY_CONTEXT); \
 } while(0)
 
+#define flush_tlb_vpte(mm, addr) \
+do { struct mm_struct *__mm = (mm); \
+     if (CTX_VALID(__mm->context)) \
+	__flush_tlb_page(CTX_HWBITS(__mm->context), (addr)&PAGE_MASK, \
+			 SECONDARY_CONTEXT); \
+} while(0)
+
 #else /* CONFIG_SMP */
 
 extern void smp_flush_tlb_all(void);
@@ -61,33 +68,9 @@ extern void smp_flush_tlb_page(struct mm
 	smp_flush_tlb_kernel_range(start, end)
 #define flush_tlb_page(vma, page) \
 	smp_flush_tlb_page((vma)->vm_mm, page)
+#define flush_tlb_vpte(mm, addr) \
+	smp_flush_tlb_page((mm), addr)
 
 #endif /* ! CONFIG_SMP */
-
-static __inline__ void flush_tlb_pgtables(struct mm_struct *mm, unsigned long start,
-					  unsigned long end)
-{
-	/* Note the signed type.  */
-	long s = start, e = end, vpte_base;
-	if (s > e)
-		/* Nobody should call us with start below VM hole and end above.
-		   See if it is really true.  */
-		BUG();
-#if 0
-	/* Currently free_pgtables guarantees this.  */
-	s &= PMD_MASK;
-	e = (e + PMD_SIZE - 1) & PMD_MASK;
-#endif
-	vpte_base = (tlb_type == spitfire ?
-		     VPTE_BASE_SPITFIRE :
-		     VPTE_BASE_CHEETAH);
-	{
-		struct vm_area_struct vma;
-		vma.vm_mm = mm;
-		flush_tlb_range(&vma,
-				vpte_base + (s >> (PAGE_SHIFT - 3)),
-				vpte_base + (e >> (PAGE_SHIFT - 3)));
-	}
-}
 
 #endif /* _SPARC64_TLBFLUSH_H */
--- ./mm/memory.c.~1~	Mon May 20 16:31:43 2002
+++ ./mm/memory.c	Mon May 20 17:24:43 2002
@@ -75,7 +75,8 @@ mem_map_t * mem_map;
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static inline void free_one_pmd(mmu_gather_t *tlb, pmd_t * dir)
+static inline void free_one_pmd(mmu_gather_t *tlb, pmd_t * dir,
+	unsigned long pte_page_nr)
 {
 	struct page *pte;
 
@@ -88,28 +89,32 @@ static inline void free_one_pmd(mmu_gath
 	}
 	pte = pmd_page(*dir);
 	pmd_clear(dir);
-	pte_free_tlb(tlb, pte);
+	pte_free_tlb(tlb, pte, pte_page_nr);
 }
 
-static inline void free_one_pgd(mmu_gather_t *tlb, pgd_t * dir)
+static inline unsigned long free_one_pgd(mmu_gather_t *tlb, pgd_t * dir,
+	unsigned long pte_page_nr)
 {
 	int j;
 	pmd_t * pmd;
 
 	if (pgd_none(*dir))
-		return;
+		goto out;
 	if (pgd_bad(*dir)) {
 		pgd_ERROR(*dir);
 		pgd_clear(dir);
-		return;
+		goto out;
 	}
 	pmd = pmd_offset(dir, 0);
 	pgd_clear(dir);
 	for (j = 0; j < PTRS_PER_PMD ; j++) {
 		prefetchw(pmd+j+(PREFETCH_STRIDE/16));
-		free_one_pmd(tlb, pmd+j);
+		free_one_pmd(tlb, pmd+j, pte_page_nr+j);
 	}
-	pmd_free_tlb(tlb, pmd);
+	pmd_free_tlb(tlb, pmd, (dir - tlb->mm->pgd));
+
+out:
+	return pte_page_nr + PTRS_PER_PMD;
 }
 
 /*
@@ -121,10 +126,12 @@ static inline void free_one_pgd(mmu_gath
 void clear_page_tables(mmu_gather_t *tlb, unsigned long first, int nr)
 {
 	pgd_t * page_dir = tlb->mm->pgd;
+	unsigned long pte_page_nr;
 
 	page_dir += first;
+	pte_page_nr = first * PTRS_PER_PMD;
 	do {
-		free_one_pgd(tlb, page_dir);
+		pte_page_nr = free_one_pgd(tlb, page_dir, pte_page_nr);
 		page_dir++;
 	} while (--nr);
 
@@ -394,13 +401,11 @@ void unmap_page_range(mmu_gather_t *tlb,
 	if (address >= end)
 		BUG();
 	dir = pgd_offset(vma->vm_mm, address);
-	tlb_start_vma(tlb, vma);
 	do {
 		zap_pmd_range(tlb, dir, address, end - address);
 		address = (address + PGDIR_SIZE) & PGDIR_MASK;
 		dir++;
 	} while (address && (address < end));
-	tlb_end_vma(tlb, vma);
 }
 
 /*
@@ -427,8 +432,10 @@ void zap_page_range(struct vm_area_struc
 	spin_lock(&mm->page_table_lock);
 	flush_cache_range(vma, address, end);
 
-	tlb = tlb_gather_mmu(mm);
+	tlb = tlb_gather_mmu(mm, 0);
+	tlb_start_vma(tlb, vma, address, end);
 	unmap_page_range(tlb, vma, address, end);
+	tlb_end_vma(tlb, vma, address, end);
 	tlb_finish_mmu(tlb, start, end);
 	spin_unlock(&mm->page_table_lock);
 }
--- ./mm/mmap.c.~1~	Mon May 20 16:55:53 2002
+++ ./mm/mmap.c	Mon May 20 17:22:03 2002
@@ -785,10 +785,8 @@ no_mmaps:
 	 */
 	start_index = pgd_index(first);
 	end_index = pgd_index(last);
-	if (end_index > start_index) {
+	if (end_index > start_index)
 		clear_page_tables(tlb, start_index, end_index - start_index);
-		flush_tlb_pgtables(mm, first & PGDIR_MASK, last & PGDIR_MASK);
-	}
 }
 
 /* Normal function to fix up a mapping
@@ -846,7 +844,7 @@ static void unmap_region(struct mm_struc
 {
 	mmu_gather_t *tlb;
 
-	tlb = tlb_gather_mmu(mm);
+	tlb = tlb_gather_mmu(mm, 0);
 
 	do {
 		unsigned long from, to;
@@ -854,7 +852,9 @@ static void unmap_region(struct mm_struc
 		from = start < mpnt->vm_start ? mpnt->vm_start : start;
 		to = end > mpnt->vm_end ? mpnt->vm_end : end;
 
+		tlb_start_vma(tlb, mpnt, from, to);
 		unmap_page_range(tlb, mpnt, from, to);
+		tlb_end_vma(tlb, mpnt, from, to);
 	} while ((mpnt = mpnt->vm_next) != NULL);
 
 	free_pgtables(tlb, prev, start, end);
@@ -1107,7 +1107,7 @@ void exit_mmap(struct mm_struct * mm)
 	release_segments(mm);
 	spin_lock(&mm->page_table_lock);
 
-	tlb = tlb_gather_mmu(mm);
+	tlb = tlb_gather_mmu(mm, 1);
 
 	flush_cache_mm(mm);
 	mpnt = mm->mmap;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-20 12:43           ` Linux-2.5.16 Paul Mackerras
  2002-05-20 16:13             ` Linux-2.5.16 Linus Torvalds
@ 2002-05-21  5:10             ` Linus Torvalds
  2002-05-21  5:10               ` Linux-2.5.16 David S. Miller
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2002-05-21  5:10 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Kernel Mailing List



On Mon, 20 May 2002, Paul Mackerras wrote:
>
> This patch splits up the existing tlb_remove_page into
> tlb_remove_tlb_entry (for pages that are/were mapped into userspace)
> and tlb_remove_page, as you suggested.  It also adds the necessary
> stuff for PPC, which has its own include/asm-ppc/tlb.h now.  This
> works on at least one PPC machine. :)

Looking into this a bit more, I came to the conclusion that this cannot
actually work on ppc.

The start/end address optimization cannot work the way you intended them
to, because the "tlb_remove_page()" semantics _only_ call it for "real
pages". Thus anything that is unknown to the kernel VM (ie reserved or
outside the kernel mem_map[]) will never cause that interface to be
called, since those pages will not be free'd.

I solved this by splitting up the interface: the tlb_remove_page() thing
still exists and does the same old thing, but I added a _separate_
"tlb_remove_tlb_entry()" which on x86 is a no-op, and that gets called on
each present pte entry.

So the loop basically looks like

        tlb = tlb_gather_mmu(mm);
	for_each_vma() {
		tlb_start_vma(tlb, vma);
		for_each_pte() {
			tlb_remove_tlb_entry(tlb, pte, address);
			if (pte_valid_page()) {
				tlb->freed++;
				tlb_remove_page(tlb, page);
			}
		}
		tlb_end_vma(tlb, vma);
	}
	tlb_finish_mmu(tlb, start, end);

where the x86 defines tlb_start_vma/tlb_end_vma/tlb_remove_tlb_entry to be
no-ops, because on the x86 we always just do a full TLB flush when we fill
up the free buffer.

For other architectures, any of the following may be a good idea:

 - tlb_end_vma() looks at the type of the VMA, and flushes that type only
   (ie if the vma was not executable, you can avoid the ITLB flush)

 - tlb_remove_tlb_entry() can try to remove the entry proactively from MMU
   hash chains like on the PPC: it gets enough information to do so. In
   this case, the tlb_remove_page() thing might decide to only flush the
   on-chip TLB, since it knows that the off-chip TLB has already been
   cleared.

   (For this reason, tlb_finish_mmu() doesn't actually call the old
   "flush_tlb_mm()", but instead calls a "tlb_flush(tlb)", so that the
   architecture can know that it doesn't need to flush the whole mm)

 - For architectures that hide hints in the PTE ("this pte has been loaded
   for a iTLB access"), the tlb_remove_tlb_entry() routine might just or
   in that bit into a tlb "status" word, and then tlb_flush(tlb) might
   decide to only flush the DTLB if it never saw any TLB entries that had
   the iTLB bit set. This should work for ia64 or alpha.

I tried to make the interface fairly generic, so that people could easily
do any (or none, like on x86) of these optimizations with little or no
overhead.

It's there in 2.5.17..

		Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-21  5:10             ` Linux-2.5.16 Linus Torvalds
@ 2002-05-21  5:10               ` David S. Miller
  2002-05-21 16:01                 ` Linux-2.5.16 Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: David S. Miller @ 2002-05-21  5:10 UTC (permalink / raw)
  To: torvalds; +Cc: paulus, linux-kernel

   From: Linus Torvalds <torvalds@transmeta.com>
   Date: Mon, 20 May 2002 22:10:31 -0700 (PDT)
   
   I tried to make the interface fairly generic, so that people could easily
   do any (or none, like on x86) of these optimizations with little or no
   overhead.

It still needs to handle the "unmapping entire address space"
vs. "unmapping munmap-like operation".  Currently we'll flush
excessively when doing exit_mmap().

I'll go and hack this into your new stuff.

I guess you didn't read my emails from today for some reason,
I explained all of this.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-21  5:10               ` Linux-2.5.16 David S. Miller
@ 2002-05-21 16:01                 ` Linus Torvalds
  2002-05-21 16:45                   ` Linux-2.5.16 Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2002-05-21 16:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: paulus, linux-kernel



On Mon, 20 May 2002, David S. Miller wrote:
>
> It still needs to handle the "unmapping entire address space"
> vs. "unmapping munmap-like operation".  Currently we'll flush
> excessively when doing exit_mmap().

I'm considering just re-doing exit_mmap() entirely, so that it shares no
real code.

> I'll go and hack this into your new stuff.

Don't hack it into the existing stuff, the exit_mmap() really is
very different.

For example, in the exit_mmap() case, we should tear down the page tables
in top-to-bottom order, and that makes all the "tlb->pages[]" stuff
entirely unnecessary: we can just remove the _top_ pgd, and once that is
done (and the TLB invalidated), we can remove the pmd's and the pte's at
our leisure without any fear of races.

None of the complicated TLB flushing needed or wanted.

Also, exit_mmap() has no races with other threads etc, nor does it have
any reason to worry about the RSS of the process any more.

		Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux-2.5.16
  2002-05-21 16:01                 ` Linux-2.5.16 Linus Torvalds
@ 2002-05-21 16:45                   ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2002-05-21 16:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: paulus, linux-kernel



On Tue, 21 May 2002, Linus Torvalds wrote:
>
> For example, in the exit_mmap() case, we should tear down the page tables
> in top-to-bottom order, and that makes all the "tlb->pages[]" stuff
> entirely unnecessary: we can just remove the _top_ pgd, and once that is
> done (and the TLB invalidated), we can remove the pmd's and the pte's at
> our leisure without any fear of races.

Hmm.. We could simplify it even further by moving the exit_mmap() from
mmput() into mmdrop(), at which point we know that we exit the mm only
after nobody is using the thing any more at all, and it has been flushed
from the TLB's.

The only downside of that is that we currently do the mmdrop in the middle
of the context switch, and we'd have to move it to _after_ the context
switch. Which is slightly complicated. The other problem is that with lazy
TLB's, we might delay actually freeing the pages for a longish time
especially on big SMP machines (if the MM ends up being lazy on an idle
CPU for long)..

So while this approach would be absolutely wonderful from a TLB behaviour
approach, it might not be the best approach in some other ways. Ideas?

		Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2002-05-21 16:45 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-05-18  7:57 Linux-2.5.16 Linus Torvalds
2002-05-18  8:05 ` Linux-2.5.16 Aschwin Marsman - aYniK Software Solutions
2002-05-18  8:21 ` Linux-2.5.16 Russell King
2002-05-18  9:51   ` Linux-2.5.16 Tomas Szepe
2002-05-18 11:28     ` Linux-2.5.16 Marcus Alanen
2002-05-18 15:38       ` Linux-2.5.16 Matthias Andree
2002-05-18 15:44         ` Linux-2.5.16 Tomas Szepe
2002-05-18  8:52 ` Linux-2.5.16 mikeH
2002-05-18 18:33   ` Linux-2.5.16 Andrew Morton
2002-05-20  0:33 ` Linux-2.5.16 Roman Zippel
2002-05-20  0:39   ` Linux-2.5.16 Linus Torvalds
2002-05-20  0:47     ` Linux-2.5.16 Linus Torvalds
2002-05-20  1:09       ` Linux-2.5.16 Paul Mackerras
2002-05-20  1:25         ` Linux-2.5.16 Linus Torvalds
2002-05-20 12:43           ` Linux-2.5.16 Paul Mackerras
2002-05-20 16:13             ` Linux-2.5.16 Linus Torvalds
2002-05-20 23:30               ` Linux-2.5.16 David S. Miller
2002-05-20 23:37                 ` Linux-2.5.16 David S. Miller
2002-05-21  1:02                   ` [PATCH] TLB changes (was Re: Linux-2.5.16) David S. Miller
2002-05-20 23:55               ` Linux-2.5.16 Paul Mackerras
2002-05-21  0:18                 ` Linux-2.5.16 Paul Mackerras
2002-05-21  5:10             ` Linux-2.5.16 Linus Torvalds
2002-05-21  5:10               ` Linux-2.5.16 David S. Miller
2002-05-21 16:01                 ` Linux-2.5.16 Linus Torvalds
2002-05-21 16:45                   ` Linux-2.5.16 Linus Torvalds
2002-05-20  1:15       ` Linux-2.5.16 Roman Zippel
2002-05-20  1:20         ` Linux-2.5.16 Linus Torvalds
2002-05-20  4:30       ` Linux-2.5.16 David S. Miller
2002-05-20 22:20       ` Linux-2.5.16 Roman Zippel
2002-05-20 23:36         ` [PATCH] Fix rss accounting Roman Zippel
2002-05-20  1:10     ` Linux-2.5.16 Roman Zippel
2002-05-20 17:57       ` Linux-2.5.16 Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).