linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Linux 2.4.10-pre11
  2001-09-18  0:08 Linux 2.4.10-pre11 Linus Torvalds
@ 2001-09-17 23:17 ` Marcelo Tosatti
  2001-09-18  1:08   ` Marcelo Tosatti
       [not found] ` <20010917211834.A31693@redhat.com>
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-17 23:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Andrea Arcangeli



On Mon, 17 Sep 2001, Linus Torvalds wrote:

> 
> Ok, the big thing here is continued merging, this time with Andrea.
> 
> I still don't like some of the VM changes, but integrating Andrea's VM
> changes results in (a) better performance and (b) much cleaner inactive
> page handling in particular. Besides, for the 2.4.x tree, the big priority
> is stability, we can re-address my other concerns during 2.5.x.

Andrea, 

Could you please make a resume of your VM changes ? 


Its hard to keep up with VM changes this way. 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Linux 2.4.10-pre11
@ 2001-09-18  0:08 Linus Torvalds
  2001-09-17 23:17 ` Marcelo Tosatti
                   ` (3 more replies)
  0 siblings, 4 replies; 110+ messages in thread
From: Linus Torvalds @ 2001-09-18  0:08 UTC (permalink / raw)
  To: Kernel Mailing List; +Cc: Andrea Arcangeli


Ok, the big thing here is continued merging, this time with Andrea.

I still don't like some of the VM changes, but integrating Andrea's VM
changes results in (a) better performance and (b) much cleaner inactive
page handling in particular. Besides, for the 2.4.x tree, the big priority
is stability, we can re-address my other concerns during 2.5.x.

This also merges the blkdev in page cache patch, and that will hopefully
make it noticeably easier to do the "do bread() with page cache too", at
which point a lot of the current ugly synchronization issues will go away.

Oh, and it gets the direct-IO improvements from Andrea too.

[ Other small patches from Andrea merged just to make future merges
  easier. ]

The console locking merge with Andrew Morton also moves us a bit closer to
the -ac tree..

Full 2.4.10 ChangeLog appended.

		Linus

------
pre11:
 - Neil Brown: md cleanups/fixes
 - Andrew Morton: console locking merge
 - Andrea Arkangeli: major VM merge

pre10:
 - Alan Cox: continued merging
 - Mingming Cao: make msgrcv/shmat check the queue/segment ID's properly
 - Greg KH: USB serial init failure fix, Xircom serial converter driver
 - Neil Brown: nsfd/raid/md/lockd cleanups
 - Ingo Molnar: multipath RAID personality, raid xor update
 - Hugh Dickins/Marcelo Tosatti: swapin read-ahead race fix
 - Vojtech Pavlik: fix up some of the infrastructure for x86-64
 - Robert Love: AMD 761 AGP GART support
 - Jens Axboe: fix SCSI-generic queue handling race
 - me: be sane about page reference bits

pre9:
 - Greg KH: start migration to new "min()/max()"
 - Roman Zippel: move affs over to "min()/max()".
 - Vojtech Pavlik: VIA update (make sure not to IRQ-unmask a vt82c576)
 - Jan Kara: quota bug-fix (don't decrement quota for non-counted inode)
 - Anton Altaparmakov: more NTFS updates
 - Al Viro: make nosuid/noexec/nodev be per-mount flags, not per-filesystem
 - Alan Cox: merge input/joystick layer differences, driver and alpha merge
 - Keith Owens: scsi Makefile cleanup
 - Trond Myklebust: fix oopsable race in locking code
 - Jean Tourrilhes: IrDA update

pre8:
 - Christoph Hellwig: clean up personality handling a bit
 - Robert Love: update sysctl/vm documentation
 - make the three-argument (that everybody hates) "min()" be "min_t()",
   and introduce a type-anal "min()" that complains about arguments of
   different types.

pre7:
 - Alan Cox: big driver/mips sync
 - Andries Brouwer, Christoph Hellwig: more gendisk fixups
 - Tobias Ringstrom: tulip driver workaround for DC21143 erratum

pre6:
 - Jens Axboe: remove trivially dead io_request_lock usage
 - Andrea Arcangeli: softirq cleanup and ARM fixes. Slab cleanups
 - Christoph Hellwig: gendisk handling helper functions/cleanups
 - Nikita Danilov: reiserfs dead code pruning
 - Anton Altaparmakov: NTFS update to 1.1.18
 - firestream network driver: patch reverted on authors request
 - NIIBE Yutaka: SH architecture update
 - Paul Mackerras: PPC cleanups, PPC8xx update.
 - me: reverse broken bootdata allocation patch that went into pre5

pre5:
 - Merge with Alan
 - Trond Myklebust: NFS fixes - kmap and root inode special case
 - Al Viro: more superblock cleanups, inode leak in rd.c, minix
   directories in page cache
 - Paul Mackerras: clean up rubbish from sl82c105.c
 - Neil Brown: md/raid cleanups, NFS filehandles
 - Johannes Erdfelt: USB update (usb-2.0 support, visor fix, Clie fix,
   pl2303 driver update)
 - David Miller: sparc and net update
 - Eric Biederman: simplify and correct bootdata allocation - don't
   overwrite ramdisks
 - Tim Waugh: support multiple SuperIO devices, parport doc updates

pre4:
 - Hugh Dickins: swapoff cleanups and speedups
 - Matthew Dharm: USB storage update
 - Keith Owens: Makefile fixes
 - Tom Rini: MPC8xx build fix
 - Nikita Danilov: reiserfs update
 - Jakub Jelinek: ELF loader fix for ET_DYN
 - Andrew Morton: reparent_to_init() for kernel threads
 - Christoph Hellwig: VxFS and SysV updates, vfs_permission fix

pre3:
 - Johannes Erdfelt, Oliver Neukum: USB printer driver race fix
 - John Byrne: fix stupid i386-SMP irq stack layout bug
 - Andreas Bombe, me: yenta IO window fix
 - Neil Brown: raid1 buffer state fix
 - David Miller, Paul Mackerras: fix up sparc and ppc respectively for kmap/kbd_rate
 - Matija Nalis: umsdos fixes, and make it possible to boot up with umsdos
 - Francois Romieu: fix bugs in dscc4 driver
 - Andy Grover: new PCI config space access functions (eventually for ACPI)
 - Albert Cranford: fix incorrect e2fsprog data from ver_linux script
 - Dave Jones: re-sync x86 setup code, fix macsonic kmalloc use
 - Johannes Erdfelt: remove obsolete plusb USB driver
 - Andries Brouwer: fix USB compact flash version info, add blksize ioctls

pre2:
 - Al Viro: block device cleanups
 - Marcelo Tosatti: make bounce buffer allocations more robust (it's ok
   for them to do IO, just not cause recursive bounce IO. So allow them)
 - Anton Altaparmakov: NTFS update (1.1.17)
 - Paul Mackerras: PPC update (big re-org)
 - Petko Manolov: USB pegasus driver fixes
 - David Miller: networking and sparc updates
 - Trond Myklebust: Export atomic_dec_and_lock
 - OGAWA Hirofumi: find and fix umsdos "filldir" users that were broken
   by the 64-bit-cleanups. Fix msdos warnings.
 - Al Viro: superblock handling cleanups and race fixes
 - Johannes Erdfelt++: USB updates

pre1:
 - Jeff Hartmann: DRM AGP/alpha cleanups
 - Ben LaHaise: highmem user pagecopy/clear optimization
 - Vojtech Pavlik: VIA IDE driver update
 - Herbert Xu: make cramfs work with HIGHMEM pages
 - David Fennell: awe32 ram size detection improvement
 - Istvan Varadi: umsdos EMD filename bug fix
 - Keith Owens: make min/max work for pointers too
 - Jan Kara: quota initialization fix
 - Brad Hards: Kaweth USB driver update (enable, and fix endianness)
 - Ralf Baechle: MIPS updates
 - David Gibson: airport driver update
 - Rogier Wolff: firestream ATM driver multi-phy support
 - Daniel Phillips: swap read page referenced set - avoid swap thrashing


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-17 23:17 ` Marcelo Tosatti
@ 2001-09-18  1:08   ` Marcelo Tosatti
  2001-09-18  3:37     ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18  1:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Andrea Arcangeli



On Mon, 17 Sep 2001, Marcelo Tosatti wrote:

> 
> 
> On Mon, 17 Sep 2001, Linus Torvalds wrote:
> 
> > 
> > Ok, the big thing here is continued merging, this time with Andrea.
> > 
> > I still don't like some of the VM changes, but integrating Andrea's VM
> > changes results in (a) better performance and (b) much cleaner inactive
> > page handling in particular. Besides, for the 2.4.x tree, the big priority
> > is stability, we can re-address my other concerns during 2.5.x.
> 
> Andrea, 
> 
> Could you please make a resume of your VM changes ? 
> 
> Its hard to keep up with VM changes this way. 

Andrea, 

I've just read a bit of your new VM code and I have a few comments.

You completly removed the "inactive freeable pages" logic: There is no
more distiction between "freeable inactive" and "free" pages. All VM work
is based on "freepages.high" watermark. I don't like that: it seems to
make page freeing more aggressive over time.

Also, if we have several try_to_free_pages() callers, for different
classzones, I'm right saying that a caller with a "smaller" classzone can
"hide" pages in its "active_local_lru" and/or "inactive_local_lru" (during
shrink_cache) from other processes trying to free pages from those higher
zones ?



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
       [not found]   ` <20010918035055.J698@athlon.random>
@ 2001-09-18  2:02     ` Andrea Arcangeli
       [not found]     ` <20010917221653.B31693@redhat.com>
  1 sibling, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  2:02 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 03:50:55AM +0200, Andrea Arcangeli wrote:
> On Mon, Sep 17, 2001 at 09:18:34PM -0400, Benjamin LaHaise wrote:
> > On Mon, Sep 17, 2001 at 05:08:37PM -0700, Linus Torvalds wrote:
> > > 
> > > Ok, the big thing here is continued merging, this time with Andrea.
> > > 
> > > I still don't like some of the VM changes, but integrating Andrea's VM
> > > changes results in (a) better performance and (b) much cleaner inactive
> > > page handling in particular. Besides, for the 2.4.x tree, the big priority
> > > is stability, we can re-address my other concerns during 2.5.x.
> > 
> > What is the idea behind merging complete and total utter CRAP into the 
> > tree?  Lets take a few completely fubar'd hunks as examples.  Let's start 
> > with fs/buffer.c:
> > 
> > Point 1. It doesn't compile with egcs 1.1.2.  Pretty remarkable for the only 
> > recommended stable compiler.
> 
> Sorry, I use gcc 3.0.2, but I assume fixing the compilation trouble with
> egcs 1.1.2 is trivial. Can you show which is the interested line of
> code so we can rewrite it in a compatible manner?

ah I just got a (verbose) report about one compilation trouble with old
kernels, the 00_rwsem patch that includes this backwards compatibility
hunk wasn't applied. This is a very very minor issue. Please Linus
include this patch to fix the compilation troubles with the old
compilers:

diff -urN 2.4.10pre10/include/linux/compiler.h rwsem/include/linux/compiler.h
--- 2.4.10pre10/include/linux/compiler.h	Thu Jan  1 01:00:00 1970
+++ rwsem/include/linux/compiler.h	Tue Sep 18 02:02:08 2001
@@ -0,0 +1,13 @@
+#ifndef __LINUX_COMPILER_H
+#define __LINUX_COMPILER_H
+
+/* Somewhere in the middle of the GCC 2.96 development cycle, we implemented
+   a mechanism by which the user can annotate likely branch directions and
+   expect the blocks to be reordered appropriately.  Define __builtin_expect
+   to nothing for earlier compilers.  */
+
+#if __GNUC__ == 2 && __GNUC_MINOR__ < 96
+#define __builtin_expect(x, expected_value) (x)
+#endif
+
+#endif /* __LINUX_COMPILER_H */

Then you need to simply include <linux/compile.h> from the files that
doesn't compile, this second patch is untested but it should do the
trick (00_rwsem-?? was including compile.h from the rwsem includes that
were in turn included by those files implicitly so I didn't need to add
compile.h in each file):

--- 2.4.10pre11/mm/slab.c.~1~	Tue Sep 18 02:43:04 2001
+++ 2.4.10pre11/mm/slab.c	Tue Sep 18 04:00:30 2001
@@ -72,6 +72,7 @@
 #include	<linux/slab.h>
 #include	<linux/interrupt.h>
 #include	<linux/init.h>
+#include	<linux/compile.h>
 #include	<asm/uaccess.h>
 
 /*
--- 2.4.10pre11/mm/page_alloc.c.~1~	Tue Sep 18 02:43:04 2001
+++ 2.4.10pre11/mm/page_alloc.c	Tue Sep 18 04:00:20 2001
@@ -17,6 +17,7 @@
 #include <linux/pagemap.h>
 #include <linux/bootmem.h>
 #include <linux/slab.h>
+#include <linux/compile.h>
 
 int nr_swap_pages;
 int nr_active_pages;
--- 2.4.10pre11/mm/vmscan.c.~1~	Tue Sep 18 02:43:04 2001
+++ 2.4.10pre11/mm/vmscan.c	Tue Sep 18 04:00:47 2001
@@ -21,6 +21,7 @@
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/file.h>
+#include <linux/compile.h>
 
 #include <asm/pgalloc.h>
 

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  3:37     ` Andrea Arcangeli
@ 2001-09-18  2:25       ` Marcelo Tosatti
  2001-09-18  3:58         ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18  2:25 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> On Mon, Sep 17, 2001 at 10:08:22PM -0300, Marcelo Tosatti wrote:
> > 
> > 
> > On Mon, 17 Sep 2001, Marcelo Tosatti wrote:
> > 
> > > 
> > > 
> > > On Mon, 17 Sep 2001, Linus Torvalds wrote:
> > > 
> > > > 
> > > > Ok, the big thing here is continued merging, this time with Andrea.
> > > > 
> > > > I still don't like some of the VM changes, but integrating Andrea's VM
> > > > changes results in (a) better performance and (b) much cleaner inactive
> > > > page handling in particular. Besides, for the 2.4.x tree, the big priority
> > > > is stability, we can re-address my other concerns during 2.5.x.
> > > 
> > > Andrea, 
> > > 
> > > Could you please make a resume of your VM changes ? 
> > > 
> > > Its hard to keep up with VM changes this way. 
> > 
> > Andrea, 
> > 
> > I've just read a bit of your new VM code and I have a few comments.
> > 
> > You completly removed the "inactive freeable pages" logic: There is no
> 
> yes, it wasn't relly useful to keep this list lazily, you either keep it
> enforced with locking overhead or such information isn't valuable.
> 
> > more distiction between "freeable inactive" and "free" pages. All VM work
> > is based on "freepages.high" watermark. I don't like that: it seems to
> 
> hardly on freepages.high:
> 
> diff -urN vm-ref/mm/swap.c vm/mm/swap.c
> --- vm-ref/mm/swap.c	Tue Sep 18 00:18:17 2001
> +++ vm/mm/swap.c	Tue Sep 18 00:18:35 2001
> @@ -24,50 +24,13 @@
>  #include <asm/uaccess.h> /* for copy_to/from_user */
>  #include <asm/pgtable.h>
>  
> -/*
> - * We identify three levels of free memory.  We never let free mem
> - * fall below the freepages.min except for atomic allocations.  We
> - * start background swapping if we fall below freepages.high free
> - * pages, and we begin intensive swapping below freepages.low.
> - *
> - * Actual initialization is done in mm/page_alloc.c
> - */
> -freepages_t freepages = {
> -	0,	/* freepages.min */
> -	0,	/* freepages.low */
> -	0	/* freepages.high */
> -};
> -
> 
> > make page freeing more aggressive over time.
> 
> I don't see your point with "page freeing more aggressive over time".

I meant "zone->freepages.high" (or something like that... you get the
idea). 

My point is: we're not going to start aging pte's until we have a
zone shortage, right ?

With the old scheme, we would differentiate "inactive shortage" from "free
shortage".

> 
> > Also, if we have several try_to_free_pages() callers, for different
> > classzones, I'm right saying that a caller with a "smaller" classzone can
> > "hide" pages in its "active_local_lru" and/or "inactive_local_lru" (during
> > shrink_cache) from other processes trying to free pages from those higher
> > zones ?
> 
> I'm deeply impressed, you seem to have understood the rewrite greatly
> well, congrats, this "hiding" was infact my main concern I had on the
> memclass check during shrink_cache, but I don't think this will ever
> give us troubles.  In such there are suprious swapouts with HIGHMEM
> we'll just need to waste some cpu by lefting those pages visible with a
> few changes in shrink_cache, but again I'm almost sure there won't be
> problems, we do multiple scans before failing so those pages will return
> visible before the other task has a chance to fail the allocation.

I really think this will cause problems in practice, Andrea.

Moreover, the whole active_local_lru has a _bad_ effect on writeout
clustering:

shrink_cache() callers on low classzone can "hide" pages from higher
classzone callers and avoid clustering. Look:

We have a nice big sequentially ordered list of pages to writeout. 

lowzone shrink_cache() moves higher zones pages (from this ordered list)
to its inactive/active_local_lru until it calls schedule() (due to
need_resched()). (for example)

Now we have a higher classzone caller which finds "half of the block" and
writes it out.

>From this point on, we breaked in half an ordered list of pages which
could be nicely merged together at ll_rw_block().

The example I gave which splits one sequentially ordered list of pages in
two blocks is a simple and "stupid" one. But think how badly can we have
pages ordered during a long time of VM activity.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
       [not found]     ` <20010917221653.B31693@redhat.com>
@ 2001-09-18  2:27       ` Linus Torvalds
  2001-09-18  3:14         ` Alan Cox
  2001-09-18  3:26         ` Andrea Arcangeli
       [not found]       ` <20010918052201.N698@athlon.random>
  1 sibling, 2 replies; 110+ messages in thread
From: Linus Torvalds @ 2001-09-18  2:27 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andrea Arcangeli, Kernel Mailing List


On Mon, 17 Sep 2001, Benjamin LaHaise wrote:
>
> __builtin_expect.  It just goes to show that you haven't bothered to get any
> feedback on this patch which is significantly invasive before merging it.
> This is completely unreasonable behaviour for a kernel series which is
> supposed to be stable.

No, it goes to show that I didn't actually apply all of Andrea's patches.
I left out a number of patches that I didn't see the point to, or that I
just plain disagreed with.

And some of the patches had some non-obvious dependencies, especially as I
have a reasonably recent compiler..

> The code in question does not attempt to explain itself at all.  Your release
> notes did not explain what changes you did at all.  Nor are your patches
> split up into reasonable chunks of functionality that can be evaluated based
> on their individual merit.  All or nothing is *not* the approach that should
> be taken at present.  (Hint, stability is acheived gradually.)

Actually, many of them _are_ split up, much more so than Alan's public
patches are (now, Alan in private splits up the patches really well, so
for me it tends to be even easier to apply Alans patches than Andreas, but
as with Alan, my one-liner "merged with xxxx" doesn't go into the detail
that Andrea and others actually do have).

Now, the VM part of the patch was certainly fairly big. I doubt much of it
could have been reasonably split up, though.

> No, what I'm deeply concerned with is the nature of patch merging.  There was
> no obvious public testing of the changes before merging, no warning of it, the
> patch contained obvious bogus chunks and many unrelated changes.  I've never
> seen as invasive a patch merged that ran the risk of completely torpedoing
> stability merged into a STABLE KERNEL SERIES, nor would I ever consider
> submitting such a patch.  There are bug fixes that are outstanding in -ac
> that aren't being merged to -linus, yet this completely untested pile of messy
> code is merged without hesitation?

Without hesitation? Hardly.

The bug fixes in -ac that aren't merged into -linus are that way BECAUSE
NOBODY HAS SENT ME MERGES.

Alan works on it quite intensively, but the fact is, that for the -ac
merge, Alan seems to be able to merge it slower than -ac grows. Which is
why I actually started asking people to merge their parts from -ac into
-linus to help Alan. That's how the other merge in -pre11 happened.

The aa tree has existed for a long time, and is actually mainted in better
chunks than the -ac tree, which makes it easier to merge. Admittedly my
and Alans tastes are often closer than mine and Andreas, which means that
the -aa tree merges are more painful for _me_ ;)

		Linus


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  3:58         ` Andrea Arcangeli
@ 2001-09-18  2:53           ` Marcelo Tosatti
  2001-09-18  4:54             ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18  2:53 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> On Mon, Sep 17, 2001 at 11:25:09PM -0300, Marcelo Tosatti wrote:
> > My point is: we're not going to start aging pte's until we have a
> > zone shortage, right ?
> 
> correct, so we don't waste time if there's only filesystem cache
> pressure during the whole kernel uptime (not an unlikely scenario if you
> have enough ram like on servers).
> 
> > I really think this will cause problems in practice, Andrea.
> 
> yes, this hiding could infact explain the report you posted. I'll try to
> reproduce with highmem emulation and to fix it as soon as I can
> reproduce, also if you are interested to hack on it too feel free to
> send patches of course. thanks.

Andrea,

Personally I think it is too late in the 2.4.x series to integrate your
code.

I have nothing against the code itself (the "old" code also had bugs), but
a major VM rewrite at this point seems to be dangerous if we want a stable
VM.

Don't you agree that your code can introduce new stability bugs ?

Linus, please, if you want to integrate Andrea's code do it in 2.5.x, but
not 2.4.x. (yes, I'm expecting you to scream at me now :))


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  2:27       ` Linus Torvalds
@ 2001-09-18  3:14         ` Alan Cox
  2001-09-18  3:26         ` Andrea Arcangeli
  1 sibling, 0 replies; 110+ messages in thread
From: Alan Cox @ 2001-09-18  3:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Benjamin LaHaise, Andrea Arcangeli, Kernel Mailing List

> The bug fixes in -ac that aren't merged into -linus are that way BECAUSE
> NOBODY HAS SENT ME MERGES.

Actually some of them I've sent you many times. The tlb fix I have on 
my list of "dont bother" for example - which means I got bored of sending
it.

> Alan works on it quite intensively, but the fact is, that for the -ac
> merge, Alan seems to be able to merge it slower than -ac grows. Which is
> why I actually started asking people to merge their parts from -ac into
> -linus to help Alan. That's how the other merge in -pre11 happened.

I've been trying to ensure I feed stuff in testable chunks. For example
I dont want vfs and scsi changes both in a Linus merge because someone is
bound to go "hey I got corruption" and then with both in one merge we
are screwed

Alan

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  2:27       ` Linus Torvalds
  2001-09-18  3:14         ` Alan Cox
@ 2001-09-18  3:26         ` Andrea Arcangeli
  1 sibling, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  3:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Benjamin LaHaise, Kernel Mailing List

On Mon, Sep 17, 2001 at 07:27:31PM -0700, Linus Torvalds wrote:
> Now, the VM part of the patch was certainly fairly big. I doubt much of it
> could have been reasonably split up, though.

Yes, I considered doing that but it would been quite a pain to develop
it incrementally (so I thought if needed I would have splitted it
later).

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  4:54             ` Andrea Arcangeli
@ 2001-09-18  3:33               ` Marcelo Tosatti
  2001-09-18  5:06                 ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18  3:33 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List


On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote:
> > Don't you agree that your code can introduce new stability bugs ?
> 
> not anything that can corrupt randomly your hd.

Sure, the old code did not corrupt hd's randomly, did it?

Let me redo the question: Don't you think the old stinky and slow code was
reasonably stable ? :) 



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  1:08   ` Marcelo Tosatti
@ 2001-09-18  3:37     ` Andrea Arcangeli
  2001-09-18  2:25       ` Marcelo Tosatti
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  3:37 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List

On Mon, Sep 17, 2001 at 10:08:22PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Mon, 17 Sep 2001, Marcelo Tosatti wrote:
> 
> > 
> > 
> > On Mon, 17 Sep 2001, Linus Torvalds wrote:
> > 
> > > 
> > > Ok, the big thing here is continued merging, this time with Andrea.
> > > 
> > > I still don't like some of the VM changes, but integrating Andrea's VM
> > > changes results in (a) better performance and (b) much cleaner inactive
> > > page handling in particular. Besides, for the 2.4.x tree, the big priority
> > > is stability, we can re-address my other concerns during 2.5.x.
> > 
> > Andrea, 
> > 
> > Could you please make a resume of your VM changes ? 
> > 
> > Its hard to keep up with VM changes this way. 
> 
> Andrea, 
> 
> I've just read a bit of your new VM code and I have a few comments.
> 
> You completly removed the "inactive freeable pages" logic: There is no

yes, it wasn't relly useful to keep this list lazily, you either keep it
enforced with locking overhead or such information isn't valuable.

> more distiction between "freeable inactive" and "free" pages. All VM work
> is based on "freepages.high" watermark. I don't like that: it seems to

hardly on freepages.high:

diff -urN vm-ref/mm/swap.c vm/mm/swap.c
--- vm-ref/mm/swap.c	Tue Sep 18 00:18:17 2001
+++ vm/mm/swap.c	Tue Sep 18 00:18:35 2001
@@ -24,50 +24,13 @@
 #include <asm/uaccess.h> /* for copy_to/from_user */
 #include <asm/pgtable.h>
 
-/*
- * We identify three levels of free memory.  We never let free mem
- * fall below the freepages.min except for atomic allocations.  We
- * start background swapping if we fall below freepages.high free
- * pages, and we begin intensive swapping below freepages.low.
- *
- * Actual initialization is done in mm/page_alloc.c
- */
-freepages_t freepages = {
-	0,	/* freepages.min */
-	0,	/* freepages.low */
-	0	/* freepages.high */
-};
-

> make page freeing more aggressive over time.

I don't see your point with "page freeing more aggressive over time".

> Also, if we have several try_to_free_pages() callers, for different
> classzones, I'm right saying that a caller with a "smaller" classzone can
> "hide" pages in its "active_local_lru" and/or "inactive_local_lru" (during
> shrink_cache) from other processes trying to free pages from those higher
> zones ?

I'm deeply impressed, you seem to have understood the rewrite greatly
well, congrats, this "hiding" was infact my main concern I had on the
memclass check during shrink_cache, but I don't think this will ever
give us troubles.  In such there are suprious swapouts with HIGHMEM
we'll just need to waste some cpu by lefting those pages visible with a
few changes in shrink_cache, but again I'm almost sure there won't be
problems, we do multiple scans before failing so those pages will return
visible before the other task has a chance to fail the allocation.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  5:06                 ` Andrea Arcangeli
@ 2001-09-18  3:55                   ` Marcelo Tosatti
  2001-09-18  5:32                     ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18  3:55 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote:
> > 
> > On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> > 
> > > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote:
> > > > Don't you agree that your code can introduce new stability bugs ?
> > > 
> > > not anything that can corrupt randomly your hd.
> > 
> > Sure, the old code did not corrupt hd's randomly, did it?
> > 
> > Let me redo the question: Don't you think the old stinky and slow code was
> > reasonably stable ? :) 
> 
> As said in the other email, just check 2.4 l-k reports of this week,
> last week etc.., I've lots of private reports too. While for everybody
> 2.2.19 is working fine.

Have you seen any problem report which does not happen with anon intensive
workloads ? 

As far as I've noted, people usually report performance problems when
running anon intensive workloads. For those cases, I'm pretty sure the
swap_out() loop is the fuckup: the swap allocation code is really a _CRAP_
for the current VM.




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  2:25       ` Marcelo Tosatti
@ 2001-09-18  3:58         ` Andrea Arcangeli
  2001-09-18  2:53           ` Marcelo Tosatti
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  3:58 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List

On Mon, Sep 17, 2001 at 11:25:09PM -0300, Marcelo Tosatti wrote:
> My point is: we're not going to start aging pte's until we have a
> zone shortage, right ?

correct, so we don't waste time if there's only filesystem cache
pressure during the whole kernel uptime (not an unlikely scenario if you
have enough ram like on servers).

> I really think this will cause problems in practice, Andrea.

yes, this hiding could infact explain the report you posted. I'll try to
reproduce with highmem emulation and to fix it as soon as I can
reproduce, also if you are interested to hack on it too feel free to
send patches of course. thanks.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
       [not found]       ` <20010918052201.N698@athlon.random>
@ 2001-09-18  4:01         ` Benjamin LaHaise
  2001-09-18  4:39           ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Benjamin LaHaise @ 2001-09-18  4:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 05:22:01AM +0200, Andrea Arcangeli wrote:
> try compiling 2.4.10pre10aa1 with egcs 1.1.2 before complaing about lack
> of testing, the tests were run on 2.4.10pre10aa1, not on 2.4.10pre11
> that didn't even existed at that time.

If I spent the time testing every random kernel patch that comes across, then 
I wouldn't have any time left to develop other useful things.  What I see is 
that what was merged didn't deal with things sanely.

> So I'm not surprised by the fact you weren't horrified by the previous
> VM that was looping forever in such a case.  unfortunately not everybody
> out there (mainly simulations) knows how much memory they will alloc.

Every single kernel since the dawn of 1.0 has died under OOM.  Optimizing for 
it seems rather stupid when a well maintained box WON'T OOM.

> > Then why wasn't it added to the generic code?
> 
> see above.

So merging crap is now acceptable?  Great, I'll write messy code from now on.

> > on their individual merit.  All or nothing is *not* the approach that should 
> > be taken at present.  (Hint, stability is acheived gradually.)
> 
> You can find a very big split if you check into the 2.4.10pre10aa1/
> directory in my ftp area, of course if you look at 2.4.10pre11 it is all
> at once.

Yes, all of the vm patches are in one big 3300 line patch, which is about the 
same size as the entire aio patch which adds a new subsystem.

> For the vm changes properly splitting them it's quite a pain, mainly due
> the way they're developed. For everything that is not a pain to keep
> separated I try to do so.

Bull shit.  People do it every day because Linus dictates it so.  There is 
nothing magical in you patch that can't be split up into human-readable 
chunks.

> > > again and you'll see, as said 2.2.19 does the same write throttling.
> > 
> > And it does so in a fashion which is completely broken.
> 
> Then send Alan the fix against 2.2.20pre10 if it's completly broken. I
> never got a complain about it yet and I look forward to see your fix.

2.2 doesn't matter any more.  Any work I'm doing now is 2.4 based.

> 
> > Then make it a seperate patch.  It shouldn't be part of the do-everything 
> 
> Please get real, it is a 3 thousand lines patch, if for every single
> change like this (orthogonal with the rest) I should make 1 patch it
> would take hours just to run the diffs. Not to tell if then I make a
> controversial changes.

I am being real.  I don't expect single massive patches to ever be applied, 
and am shocked I've even had to comment on this.

> What kind of workload is applied during the measurement, and what
> NULL/MEM/FILE_IO/PROCESS means? Is this benchmark available somewhere?
> What fields are you looking at exactly? I can see some field going up
> and low and I cannot evaluate those numbers very well without more info.

Go do the work yourself.  If you can't be bothered to develop benchmarks that 
simulate users on the vm subsystem, how can you claim your vm patches are 
correct?

> Except for the vm rewrite there was extensive testing. Probably not from
> RedHat Inc if this is the point that you are making. And the vm rewrite
> is so obviously better and much cleaner that I think this was a fine
> integration. I'm pretty sure no stability problem can arise because of
> the vm integration, if there could be problems as worse they could be
> specific to slower performance in some workload which we can address
> over the time without any pain. anyways as said dbench runs exactly
> twice faster so it cannot be obviously wrong in terms of performance
> either.

So the vm rewrite wasn't well tested, but it should be merged into a stable 
kernel?  Please say it ain't so.

> Linus merged everything without asking me and I think his move was very
> good IMHO. Face it: users cannot live anymore with workloads slowing
> down at every runs of benchmarks, with kswapd looping forever, with
> swapout storms, with swap filled by unused swapcache etc... What I did
> is the less intrusive change to the code that could at least address
> those critical problems and I'm pretty sure it did, as worse as said we
> may need some little tweak over the time but the new infrastructure
> should be robust enough now.

I want robust and not likely to corrupt my data randomly.  The latter is more 
important to me and everyone else in the world, so I think you should play 
by rules that enforce that.


> Ben, the true mess is the 2.4 VM before 2.4.10pre11. Period.

Sure.

> If you
> check the new code you'll see how much cleaner it is.

Hardly.  You don't even define your basic terms.  WTF is a classzone?  A 
comment somewhere would be nice.

> Those locking changes are bugfixes. Try to growsdown on a file backed
> vma with 2.4.9 using two threads and you'll run into troubles eventually
> (see the thread on l-k).

That isn't the one I'm talking about.  You changed the swapcache code.  That 
code is fragile.  These changes aren't documented.

> Just in case it wasn't obvious I do things gradually and in public,
> check ftp.kernel.org. If something you're the one that doesn't care
> about what I do in public.

The vm rewrite was not posted in public, nor described in public.  It just 
appeared and got merged.  Could you at least describe *ALL* of the changes?

> I recall I also described you some of the O_DIRECT
> stuff and blkdev pagecache work on Ottawa a few months ago. I get lots
> of feedback and also bugfixes from many places, to mention one company
> that cares about what I do in public there's IBM, they did auditing and
> I got several O_DIRECT small fixes from them incrementally to my very
> open patches. I think -aa I'm even more open than -ac since I keep all
> the stuff separated so it's much easier to understand and pick the
> interesting parts. I do my best effort to make every patch self
> contained just for the purpose of making merging and auditing easier,
> and it payed off so far. Furthmore it's not just in public but also in
> production, for istsance all SuSE server users happily runs things like
> O_DIRECT and blkdev in pagecache for several months now. The latter was
> needed from a few I/O intensive applications using blkdevs as files and
> needing heavy readhead for exploiting the full bandwith of many-ways
> raid arrays for example. The only other way would been to basically
> duplicate the readahead of filemap.c in block_dev.c, the hack was as
> complex as the long term fix.

And we agreed that this is 2.5 material.

> > Not at the expense of stability.  Do it in 2.5, or do it gradually.
> 
> Ben, instead of writing complains could you grab 2.4.10pre11, put it on
> your desktop while doing your daily work and let me know when you run
> into stability troubles? I could have said the same thing of the ext2
> directory metadata in pagecache, I grabbed it, put it on my desktop, it
> never given me a problem so far so I didn't complained. And the directory
> in pagecache wasn't fixing _any_ problem for the end user, here we're
> instead fixing _real_life_ showstopper problems with mysql, postgres (I
> wonder you care about postrgres these days?) and the other db not using
> rawio+lvm, so this was kind of needed in 2.4 eventually anyways if not
> today (I was flooded by bugreports of such kind, I'm sure you got them
> too). So the earlier the better. I think users would been not happy to
> wait until 2.6.

I'm sorry, but I'm not going to run it right now since I have other 
priorities.  I'm sure users will post their pass/fails over the next few 
days, but until I see some evidence that it doesn't eat my data, I'll hold 
back.

		-ben

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  5:32                     ` Andrea Arcangeli
@ 2001-09-18  4:14                       ` Marcelo Tosatti
  2001-09-18  5:59                         ` Andrea Arcangeli
  2001-09-18  5:00                       ` Marcelo Tosatti
  1 sibling, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18  4:14 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> On Tue, Sep 18, 2001 at 12:55:46AM -0300, Marcelo Tosatti wrote:
> > 
> > 
> > On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> > 
> > > On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote:
> > > > 
> > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> > > > 
> > > > > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote:
> > > > > > Don't you agree that your code can introduce new stability bugs ?
> > > > > 
> > > > > not anything that can corrupt randomly your hd.
> > > > 
> > > > Sure, the old code did not corrupt hd's randomly, did it?
> > > > 
> > > > Let me redo the question: Don't you think the old stinky and slow code was
> > > > reasonably stable ? :) 
> > > 
> > > As said in the other email, just check 2.4 l-k reports of this week,
> > > last week etc.., I've lots of private reports too. While for everybody
> > > 2.2.19 is working fine.
> > 
> > Have you seen any problem report which does not happen with anon intensive
> > workloads ? 
> 
> of course, all the mysql/postgres db reports I got were non anon
> intensive I assume, I assume they had enough ram, they all said 2.2 was
> fine.
> 
> > As far as I've noted, people usually report performance problems when
> > running anon intensive workloads. For those cases, I'm pretty sure the
> > swap_out() loop is the fuckup: the swap allocation code is really a _CRAP_
> > for the current VM.
> 
> I don't think that was the case, 2.2 has the same swap_out loop.

Note that in 2.4 we scan pte's even if there is no free pages
shortage, while in 2.2 we only scan pte's if there is a free page
shortage.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  4:01         ` Benjamin LaHaise
@ 2001-09-18  4:39           ` Andrea Arcangeli
  2001-09-18  5:04             ` Alan Cox
  2001-09-18  5:22             ` Benjamin LaHaise
  0 siblings, 2 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  4:39 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 12:01:32AM -0400, Benjamin LaHaise wrote:
> Every single kernel since the dawn of 1.0 has died under OOM.  Optimizing for 

try 2.2 once.

> 2.2 doesn't matter any more.  Any work I'm doing now is 2.4 based.

It still matters for me. Critical servers with very high vm loads still
have to run 2.2 to be stable and fast unfortunately.

> I am being real.  I don't expect single massive patches to ever be applied, 
> and am shocked I've even had to comment on this.

Your aio patch is massive too.

andrea@athlon:~ > wc -l aio-v2.4.0-20010123.diff 
   2951 aio-v2.4.0-20010123.diff

Now if you think I'm unreal and you are real, feed me the aio patch in
self contained pieces of 10 lines each as you expect from me. And note
that if they're not self contained they will just make my life harder.

I'd be glad to be proved wrong and to get aio from you in small self
contained pieces really, I planned to look into aio as one of the next
things to merge in -aa but as usual the size of the patch makes things
harder to merge due the larger implications. feel free to cc l-k, I'm
sure other people is interested in aio too.

> I want robust and not likely to corrupt my data randomly.  The latter is more 

Forget the corruption. So far the only scary report I had is from
Marcelo's 2G machine which is nothin compared to corruption, I don't
have x86 machines with more than 1G, I tested alpha with 3G (but it has
only 1 zone). I think Marcelo identified the problematic part before
even testing it, so the fix should be fairly immediate, I'll address it
ASAP unless he beats me on it (at the moment I'm still resynching).

> That isn't the one I'm talking about.  You changed the swapcache code.  That 
> code is fragile.  These changes aren't documented.

I didn't changed the swapcache locking rules. I only fixed the VM to
properly clear the dirty bit before freeing a page. Anybody freeing a
page that is dirty was a plain vm bug. That was quite strightforward and
correct change. Infact I was horrified by seeing __free_pages_ok
clearing the dirty bit (not to talk about the referenced bit which was
useless to change).

> The vm rewrite was not posted in public, nor described in public.  It just 

It obviously was. How do you think Linus got it? I said I didn't sent it
to Linus privately.

> appeared and got merged.  Could you at least describe *ALL* of the changes?

I'll be glad to do that over the time, right now I'm strict in time and
I also needed to go to sleep a few hours ago so I won't inline the reply
to this email right now, sorry.

> And we agreed that this is 2.5 material.

the O_DIRECT and blkdev in pagecache yes but definitely not the VM one
but people needed those features in production anyways so that was good
and they were well tested.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  2:53           ` Marcelo Tosatti
@ 2001-09-18  4:54             ` Andrea Arcangeli
  2001-09-18  3:33               ` Marcelo Tosatti
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  4:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List

On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote:
> Don't you agree that your code can introduce new stability bugs ?

not anything that can corrupt randomly your hd.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  5:32                     ` Andrea Arcangeli
  2001-09-18  4:14                       ` Marcelo Tosatti
@ 2001-09-18  5:00                       ` Marcelo Tosatti
  1 sibling, 0 replies; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18  5:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> On Tue, Sep 18, 2001 at 12:55:46AM -0300, Marcelo Tosatti wrote:
> > 
> > 
> > On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> > 
> > > On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote:
> > > > 
> > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> > > > 
> > > > > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote:
> > > > > > Don't you agree that your code can introduce new stability bugs ?
> > > > > 
> > > > > not anything that can corrupt randomly your hd.
> > > > 
> > > > Sure, the old code did not corrupt hd's randomly, did it?
> > > > 
> > > > Let me redo the question: Don't you think the old stinky and slow code was
> > > > reasonably stable ? :) 
> > > 
> > > As said in the other email, just check 2.4 l-k reports of this week,
> > > last week etc.., I've lots of private reports too. While for everybody
> > > 2.2.19 is working fine.
> > 
> > Have you seen any problem report which does not happen with anon intensive
> > workloads ? 
> 
> of course, all the mysql/postgres db reports I got were non anon
> intensive I assume, I assume they had enough ram, they all said 2.2 was
> fine.
> 
> > As far as I've noted, people usually report performance problems when
> > running anon intensive workloads. For those cases, I'm pretty sure the
> > swap_out() loop is the fuckup: the swap allocation code is really a _CRAP_
> > for the current VM.
> 
> I don't think that was the case, 2.2 has the same swap_out loop.

Ok, back to the current allocation failure problem.

In addition to the zone specific page hiding problem, I'm afraid threads
can hide pages from themselves up to a point there is nothing to block on
(even if we have just one zone).

There is nothing which avoids that from happening in theory.

Well, I'm going to sleep now. 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  6:11   ` Andrea Arcangeli
@ 2001-09-18  5:02     ` Marcelo Tosatti
  2001-09-18  6:40       ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18  5:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> On Mon, Sep 17, 2001 at 10:48:34PM -0700, Andrew Morton wrote:
> > Linus Torvalds wrote:
> > > 
> > > Ok, the big thing here is continued merging, this time with Andrea.
> > > 
> > 
> > In one test here the VM changes seem fragile, and slower.
> > 
> > Dual x86, 512 megs RAM, 512 megs swap.  No highmem.
> > 
> > The workload is:
> > 
> > 	while true
> > 	do
> > 		/usr/src/ext3/tools/usemem 300
> > 	done
> > 
> > 	(This just mallocs 300 megs, touches it then exits)
> > 
> > in parallel with
> > 
> > 	time /usr/src/ext3/tools/bash-shared-mapping -n 5 -t 3 foo 300000000
> > 
> > on ext2.
> > 
> > (bash-shared-mapping is a tool which I wrote for ext3.  It's one of the
> >  most aggressive VM/MM stress testers around, and has found a number of
> >  kernel bugs).
> > 
> > On 2.4.9-ac10, the b-s-m run took 294 seconds.  On 2.4.10-pre11 it
> > took 330 seconds DESPITE the fact that one of the b-s-m instances
> > was oom-killed quite early in the test.
> > 
> > `vmstat' took about thirty seconds to start (this is usual), but
> > was promptly killed, despite having (presumably) a small RSS.  Instances
> > of `usemem' were oom-killed quite frequently.  In 2.4.9-ac10, nothing
> > was oom-killed.
> 
> should be the very same problem identified by Marcelo. I'm wondering why
> I didn't reproduced here during testing, 512mbytes is not highmem and my
> desktop has 512mbytes too and it didn't killed anything yet. As for the
> slowdown there are a few localized places to look at. but let's fix the
> oom first.

Try to run several memory hungry threads (thus hiding more pages).


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  4:39           ` Andrea Arcangeli
@ 2001-09-18  5:04             ` Alan Cox
  2001-09-18  5:09               ` Andrea Arcangeli
  2001-09-18  5:22             ` Benjamin LaHaise
  1 sibling, 1 reply; 110+ messages in thread
From: Alan Cox @ 2001-09-18  5:04 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Benjamin LaHaise, Linus Torvalds, Kernel Mailing List

> > And we agreed that this is 2.5 material.
> 
> the O_DIRECT and blkdev in pagecache yes but definitely not the VM one
> but people needed those features in production anyways so that was good
> and they were well tested.

The O_DIRECT stuff is very clean - its definitely a feature that should
have gone into 2.5 first and then back, but its one that really doesn't
bother me too much. blkdev in page cache needs some locking thinking but
looks ok.

Alan

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  3:33               ` Marcelo Tosatti
@ 2001-09-18  5:06                 ` Andrea Arcangeli
  2001-09-18  3:55                   ` Marcelo Tosatti
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  5:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote:
> 
> On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> 
> > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote:
> > > Don't you agree that your code can introduce new stability bugs ?
> > 
> > not anything that can corrupt randomly your hd.
> 
> Sure, the old code did not corrupt hd's randomly, did it?
> 
> Let me redo the question: Don't you think the old stinky and slow code was
> reasonably stable ? :) 

As said in the other email, just check 2.4 l-k reports of this week,
last week etc.., I've lots of private reports too. While for everybody
2.2.19 is working fine.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  5:04             ` Alan Cox
@ 2001-09-18  5:09               ` Andrea Arcangeli
  0 siblings, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  5:09 UTC (permalink / raw)
  To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 06:04:28AM +0100, Alan Cox wrote:
> The O_DIRECT stuff is very clean - its definitely a feature that should
> have gone into 2.5 first and then back, but its one that really doesn't
> bother me too much. blkdev in page cache needs some locking thinking but
> looks ok.

agreed.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  4:39           ` Andrea Arcangeli
  2001-09-18  5:04             ` Alan Cox
@ 2001-09-18  5:22             ` Benjamin LaHaise
  2001-09-18  5:48               ` Andrea Arcangeli
  1 sibling, 1 reply; 110+ messages in thread
From: Benjamin LaHaise @ 2001-09-18  5:22 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 06:39:10AM +0200, Andrea Arcangeli wrote:
> On Tue, Sep 18, 2001 at 12:01:32AM -0400, Benjamin LaHaise wrote:
> > Every single kernel since the dawn of 1.0 has died under OOM.  Optimizing for 
> 
> try 2.2 once.

There are still loads that 2.2 can die under (think fast network cards and 
you'll realise that you can't protect against it).  Yes, 2.2 is in far better 
state than anything else ever has been, but it's not perfect.

> > 2.2 doesn't matter any more.  Any work I'm doing now is 2.4 based.
> 
> It still matters for me. Critical servers with very high vm loads still
> have to run 2.2 to be stable and fast unfortunately.

That's where I want 2.4 to get to.

> > I am being real.  I don't expect single massive patches to ever be applied, 
> > and am shocked I've even had to comment on this.
> 
> Your aio patch is massive too.
> 
> andrea@athlon:~ > wc -l aio-v2.4.0-20010123.diff 
>    2951 aio-v2.4.0-20010123.diff
> 
> Now if you think I'm unreal and you are real, feed me the aio patch in
> self contained pieces of 10 lines each as you expect from me. And note
> that if they're not self contained they will just make my life harder.

Not 10 lines, but several hundred here and there.  Aio actually splits up 
very well since thing like the wait_queue changes are all isolated bits 
of functionality.  Even the brw_kiovec_async bit is standalone since it 
only matters to itself and brw_kiovec.  Yes, the current patch is not 
seperated, but that was my plan from the beginning on merging.

> I'd be glad to be proved wrong and to get aio from you in small self
> contained pieces really, I planned to look into aio as one of the next
> things to merge in -aa but as usual the size of the patch makes things
> harder to merge due the larger implications. feel free to cc l-k, I'm
> sure other people is interested in aio too.

It's not ready yet.  Most of the development has been on hold waiting for 
2.5 to start for cementing the ABI in stone.

> > I want robust and not likely to corrupt my data randomly.  The latter is more 
> 
> Forget the corruption. So far the only scary report I had is from
> Marcelo's 2G machine which is nothin compared to corruption, I don't
> have x86 machines with more than 1G, I tested alpha with 3G (but it has
> only 1 zone). I think Marcelo identified the problematic part before
> even testing it, so the fix should be fairly immediate, I'll address it
> ASAP unless he beats me on it (at the moment I'm still resynching).

Sure.  That's why it should remain seperate for at least a little bit.

> 
> > That isn't the one I'm talking about.  You changed the swapcache code.  That 
> > code is fragile.  These changes aren't documented.
> 
> I didn't changed the swapcache locking rules. I only fixed the VM to
> properly clear the dirty bit before freeing a page. Anybody freeing a
> page that is dirty was a plain vm bug. That was quite strightforward and
> correct change. Infact I was horrified by seeing __free_pages_ok
> clearing the dirty bit (not to talk about the referenced bit which was
> useless to change).

So why not split that fix out as a seperate patch that can then be applied 
alone?

> 
> > The vm rewrite was not posted in public, nor described in public.  It just 
> 
> It obviously was. How do you think Linus got it? I said I didn't sent it
> to Linus privately.

*nod*

> > appeared and got merged.  Could you at least describe *ALL* of the changes?
> 
> I'll be glad to do that over the time, right now I'm strict in time and
> I also needed to go to sleep a few hours ago so I won't inline the reply
> to this email right now, sorry.

Thanks!  I think we'll all be a bit calmer in the morning and better able to 
think clearly.  I'll even try to break things down a bit as there are bits 
in your patches which I think are very good.

Cheers,

		-ben

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  3:55                   ` Marcelo Tosatti
@ 2001-09-18  5:32                     ` Andrea Arcangeli
  2001-09-18  4:14                       ` Marcelo Tosatti
  2001-09-18  5:00                       ` Marcelo Tosatti
  0 siblings, 2 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  5:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 12:55:46AM -0300, Marcelo Tosatti wrote:
> 
> 
> On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> 
> > On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote:
> > > 
> > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> > > 
> > > > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote:
> > > > > Don't you agree that your code can introduce new stability bugs ?
> > > > 
> > > > not anything that can corrupt randomly your hd.
> > > 
> > > Sure, the old code did not corrupt hd's randomly, did it?
> > > 
> > > Let me redo the question: Don't you think the old stinky and slow code was
> > > reasonably stable ? :) 
> > 
> > As said in the other email, just check 2.4 l-k reports of this week,
> > last week etc.., I've lots of private reports too. While for everybody
> > 2.2.19 is working fine.
> 
> Have you seen any problem report which does not happen with anon intensive
> workloads ? 

of course, all the mysql/postgres db reports I got were non anon
intensive I assume, I assume they had enough ram, they all said 2.2 was
fine.

> As far as I've noted, people usually report performance problems when
> running anon intensive workloads. For those cases, I'm pretty sure the
> swap_out() loop is the fuckup: the swap allocation code is really a _CRAP_
> for the current VM.

I don't think that was the case, 2.2 has the same swap_out loop.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  0:08 Linux 2.4.10-pre11 Linus Torvalds
  2001-09-17 23:17 ` Marcelo Tosatti
       [not found] ` <20010917211834.A31693@redhat.com>
@ 2001-09-18  5:48 ` Andrew Morton
  2001-09-18  6:11   ` Andrea Arcangeli
  2001-09-18 10:58   ` Martin Dalecki
  2001-09-18  9:31 ` Alexander Viro
  3 siblings, 2 replies; 110+ messages in thread
From: Andrew Morton @ 2001-09-18  5:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Andrea Arcangeli

Linus Torvalds wrote:
> 
> Ok, the big thing here is continued merging, this time with Andrea.
> 

In one test here the VM changes seem fragile, and slower.

Dual x86, 512 megs RAM, 512 megs swap.  No highmem.

The workload is:

	while true
	do
		/usr/src/ext3/tools/usemem 300
	done

	(This just mallocs 300 megs, touches it then exits)

in parallel with

	time /usr/src/ext3/tools/bash-shared-mapping -n 5 -t 3 foo 300000000

on ext2.

(bash-shared-mapping is a tool which I wrote for ext3.  It's one of the
 most aggressive VM/MM stress testers around, and has found a number of
 kernel bugs).

On 2.4.9-ac10, the b-s-m run took 294 seconds.  On 2.4.10-pre11 it
took 330 seconds DESPITE the fact that one of the b-s-m instances
was oom-killed quite early in the test.

`vmstat' took about thirty seconds to start (this is usual), but
was promptly killed, despite having (presumably) a small RSS.  Instances
of `usemem' were oom-killed quite frequently.  In 2.4.9-ac10, nothing
was oom-killed.

With a gig of VM and a 600 meg working set I don't see why it's necessary
to kill processes?   Each oom-kill was associated with a 0-order allocation
failure.

These tools are available at http://www.uow.edu.au/~andrewm/ext3-tools.tar.gz

OK, so this was a seriously loony workload.  But suddenly, the number of people
who understand the Linux VM has gone from maybe 10 down to just one-and-a-bit.
A large number of comments have been removed, and a year's worth of
discussion has been invalidated.

Andrea, it would be most useful if you were to spend some time (say, four hours)
commenting the code and telling us how it works.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  5:22             ` Benjamin LaHaise
@ 2001-09-18  5:48               ` Andrea Arcangeli
  0 siblings, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  5:48 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 01:22:16AM -0400, Benjamin LaHaise wrote:
> That's where I want 2.4 to get to.

yes, me too, we agree. And what I faced is that I couldn't do that
without first removing the broken mess. It wasn't possible to just make
a two liner to fix it.

> So why not split that fix out as a seperate patch that can then be applied 
> alone?

It's not as simple as it seems, I noticed such a bug in the middle of
the developement and so it was controversial to fix, I should have fixed
the same bug first in the new VM then in the old vm, only the
free_pages_ok change would been non controversial but such one alone
would been useless since it wouldn't be self contained. so I did it for
my vm and I never ended splitting it off because it was non trivial work.

> Thanks!  I think we'll all be a bit calmer in the morning and better able to 

never mind.

> think clearly.  I'll even try to break things down a bit as there are bits 
> in your patches which I think are very good.

Great, thanks!

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  4:14                       ` Marcelo Tosatti
@ 2001-09-18  5:59                         ` Andrea Arcangeli
  0 siblings, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  5:59 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 01:14:36AM -0300, Marcelo Tosatti wrote:
> Note that in 2.4 we scan pte's even if there is no free pages
> shortage, while in 2.2 we only scan pte's if there is a free page
> shortage.

That was most certainly a problem which is now fixed. Mainly the
preallication of swap was a waste. But I think there was a kind of
physical (not pte) overaging too that forbidden the vm to react
properly.

I believe when the storm of swap_out startups it won't matter any longer
what we aged and whatever pte scanning we did previously. At least with
the current swap_out.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  5:48 ` Andrew Morton
@ 2001-09-18  6:11   ` Andrea Arcangeli
  2001-09-18  5:02     ` Marcelo Tosatti
  2001-09-18 10:58   ` Martin Dalecki
  1 sibling, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  6:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Kernel Mailing List

On Mon, Sep 17, 2001 at 10:48:34PM -0700, Andrew Morton wrote:
> Linus Torvalds wrote:
> > 
> > Ok, the big thing here is continued merging, this time with Andrea.
> > 
> 
> In one test here the VM changes seem fragile, and slower.
> 
> Dual x86, 512 megs RAM, 512 megs swap.  No highmem.
> 
> The workload is:
> 
> 	while true
> 	do
> 		/usr/src/ext3/tools/usemem 300
> 	done
> 
> 	(This just mallocs 300 megs, touches it then exits)
> 
> in parallel with
> 
> 	time /usr/src/ext3/tools/bash-shared-mapping -n 5 -t 3 foo 300000000
> 
> on ext2.
> 
> (bash-shared-mapping is a tool which I wrote for ext3.  It's one of the
>  most aggressive VM/MM stress testers around, and has found a number of
>  kernel bugs).
> 
> On 2.4.9-ac10, the b-s-m run took 294 seconds.  On 2.4.10-pre11 it
> took 330 seconds DESPITE the fact that one of the b-s-m instances
> was oom-killed quite early in the test.
> 
> `vmstat' took about thirty seconds to start (this is usual), but
> was promptly killed, despite having (presumably) a small RSS.  Instances
> of `usemem' were oom-killed quite frequently.  In 2.4.9-ac10, nothing
> was oom-killed.

should be the very same problem identified by Marcelo. I'm wondering why
I didn't reproduced here during testing, 512mbytes is not highmem and my
desktop has 512mbytes too and it didn't killed anything yet. As for the
slowdown there are a few localized places to look at. but let's fix the
oom first.

> Andrea, it would be most useful if you were to spend some time (say, four hours)
> commenting the code and telling us how it works.

I will do that and I'll answer to any question ASAP, no panic. We may
even want to change part of the core logic over time if it doesn't
perform wonderfully. But first prio at the moment is to fix the oom you
also noticed (the "hiding" logic mentioned by Marcelo), the rest of the
changes should be a very solid base for a very solid and fast 2.4 vm.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  5:02     ` Marcelo Tosatti
@ 2001-09-18  6:40       ` Andrea Arcangeli
  2001-09-18 16:06         ` Marcelo Tosatti
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  6:40 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 02:02:37AM -0300, Marcelo Tosatti wrote:
> Try to run several memory hungry threads (thus hiding more pages).

I did that indeed, not sure why I didn't reproduced (I guess the hogs
were sligthly different). (and of course they were killed but only after
the box was truly oom)

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  0:08 Linux 2.4.10-pre11 Linus Torvalds
                   ` (2 preceding siblings ...)
  2001-09-18  5:48 ` Andrew Morton
@ 2001-09-18  9:31 ` Alexander Viro
  2001-09-18  9:39   ` Andrea Arcangeli
  2001-09-18 16:45   ` Linus Torvalds
  3 siblings, 2 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-18  9:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List, Andrea Arcangeli



On Mon, 17 Sep 2001, Linus Torvalds wrote:

> This also merges the blkdev in page cache patch, and that will hopefully
> make it noticeably easier to do the "do bread() with page cache too", at
> which point a lot of the current ugly synchronization issues will go away.

Umm...  Linus, had you actually read through the fs/block_device.c part
of that?  It's not just ugly as hell, it's (AFAICS) not hard to oops
if you have several inodes sharing major:minor.  ->bd_inode and its
treatment are bogus.  Please, read it through and consider reverting -
in its current state code is an ugly mess.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  9:31 ` Alexander Viro
@ 2001-09-18  9:39   ` Andrea Arcangeli
  2001-09-18  9:44     ` Alexander Viro
  2001-09-18 16:45   ` Linus Torvalds
  1 sibling, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  9:39 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 05:31:48AM -0400, Alexander Viro wrote:
> 
> 
> On Mon, 17 Sep 2001, Linus Torvalds wrote:
> 
> > This also merges the blkdev in page cache patch, and that will hopefully
> > make it noticeably easier to do the "do bread() with page cache too", at
> > which point a lot of the current ugly synchronization issues will go away.
> 
> Umm...  Linus, had you actually read through the fs/block_device.c part
> of that?  It's not just ugly as hell, it's (AFAICS) not hard to oops
> if you have several inodes sharing major:minor.  ->bd_inode and its

can you show an exploit? I cannot reproduce any problem here:

root@athlon:/tmp > cp -a /dev/hda hda.1
root@athlon:/tmp > cp -a /dev/hda hda.2
root@athlon:/tmp > cp hda.1 /dev/null & cp hda.2 /dev/null &
[1] 24981
[2] 24982
root@athlon:/tmp > fg
cp hda.2 /dev/null

root@athlon:/tmp > fg
cp hda.1 /dev/null

root@athlon:/tmp > 

> treatment are bogus.  Please, read it through and consider reverting -
> in its current state code is an ugly mess.

what other design solution do you propose rather both inodes sharing the
i_mapping across the different inodes like I did?

I found more handy to just bump the i_count of the first inode and
referencing it from the bd_inode, rather than dynamically allocating the
i_mapping and have a bd_mapping, but if you prefer to dynamically
allocate the i_mapping rather than using the i_data of the fist inode we
can change that of course. Not sure what's the mess in the patch you're
talking about, could you elaborate?

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  9:39   ` Andrea Arcangeli
@ 2001-09-18  9:44     ` Alexander Viro
  2001-09-18  9:57       ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-18  9:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> what other design solution do you propose rather both inodes sharing the
> i_mapping across the different inodes like I did?
> 
> I found more handy to just bump the i_count of the first inode and
> referencing it from the bd_inode, rather than dynamically allocating the
> i_mapping and have a bd_mapping, but if you prefer to dynamically
> allocate the i_mapping rather than using the i_data of the fist inode we
> can change that of course. Not sure what's the mess in the patch you're
> talking about, could you elaborate?

Bumping ->i_count on inode is _not_ an option - think what it does if
you umount the first fs.

_If_ you need an inode for block_device - allocate a new one instead of
reusing the inode that had been passed to ->open().


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  9:44     ` Alexander Viro
@ 2001-09-18  9:57       ` Andrea Arcangeli
  2001-09-18 10:02         ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  9:57 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 05:44:18AM -0400, Alexander Viro wrote:
> Bumping ->i_count on inode is _not_ an option - think what it does if
> you umount the first fs.

what it does? Unless I'm missing something the fs never cares and never
sees the bd_inode. the fs just does a bdget and then it works only on
the bdev. What should I run to get the oops exactly?

> _If_ you need an inode for block_device - allocate a new one instead of
> reusing the inode that had been passed to ->open().

If we need to avoid the bumping of i_count and to allocate something
dynamically that will be the bd_mapping address space, we don't need a
new fake_inode there too, we just need to share the new physical
pagecahce address space. Such physical i_mapping address space is the
same thing that the buffer cache will have to use to map its legacy
buffer cache buffer headers on top of it (then we can cleanup away the
few lines in blkdev_close that do the update_buffers() and checks the
MS_RDONLY bit).

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  9:57       ` Andrea Arcangeli
@ 2001-09-18 10:02         ` Alexander Viro
  2001-09-18 10:17           ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-18 10:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> On Tue, Sep 18, 2001 at 05:44:18AM -0400, Alexander Viro wrote:
> > Bumping ->i_count on inode is _not_ an option - think what it does if
> > you umount the first fs.
> 
> what it does? Unless I'm missing something the fs never cares and never
> sees the bd_inode. the fs just does a bdget and then it works only on
> the bdev. What should I run to get the oops exactly?

It sees an active inode for superblock we are destroying.  _Not_ a good
thing, for very obvious reasons.  There is a reason for "Self-destruct in
5 seconds" printk...
 
> If we need to avoid the bumping of i_count and to allocate something
> dynamically that will be the bd_mapping address space, we don't need a
> new fake_inode there too, we just need to share the new physical
> pagecahce address space. Such physical i_mapping address space is the

What are you going to use as mapping->host for it?


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 10:02         ` Alexander Viro
@ 2001-09-18 10:17           ` Andrea Arcangeli
  2001-09-18 10:28             ` Alexander Viro
                               ` (2 more replies)
  0 siblings, 3 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18 10:17 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 06:02:27AM -0400, Alexander Viro wrote:
> 
> 
> On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> 
> > On Tue, Sep 18, 2001 at 05:44:18AM -0400, Alexander Viro wrote:
> > > Bumping ->i_count on inode is _not_ an option - think what it does if
> > > you umount the first fs.
> > 
> > what it does? Unless I'm missing something the fs never cares and never
> > sees the bd_inode. the fs just does a bdget and then it works only on
> > the bdev. What should I run to get the oops exactly?
> 
> It sees an active inode for superblock we are destroying.  _Not_ a good
> thing, for very obvious reasons.  There is a reason for "Self-destruct in
> 5 seconds" printk...
>  

very clear now, thanks. I though the fs you were talking about was
mounted on the blkdev, while it is instead the one where the blkdev
inode cames from. Usually people keeps bldkev in /dev and nobody
unmounts /dev that's why I didn't noticed and thought about it, sorry.

> > If we need to avoid the bumping of i_count and to allocate something
> > dynamically that will be the bd_mapping address space, we don't need a
> > new fake_inode there too, we just need to share the new physical
> > pagecahce address space. Such physical i_mapping address space is the
> 
> What are you going to use as mapping->host for it?

the only info we'd need from the host is the host->i_rdev, so why can't
we get it from the file->f_dentry->d_inode->i_rdev? In general I don't
like very much the fake inodes with the zillon of unused fields (they
just looks confusing due their unused fields), but well if you for
whatever reason prefer to keep doing page->mapping->host->i_rdev or if
file->f_dentry->d_inode->i_rdev has a problem I'm pretty much ok with
the fake_inode there too.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 10:17           ` Andrea Arcangeli
@ 2001-09-18 10:28             ` Alexander Viro
  2001-09-18 10:35               ` Andrea Arcangeli
  2001-09-18 11:05             ` Helge Hafting
  2001-09-18 17:02             ` Linus Torvalds
  2 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-18 10:28 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> > > If we need to avoid the bumping of i_count and to allocate something
> > > dynamically that will be the bd_mapping address space, we don't need a
> > > new fake_inode there too, we just need to share the new physical
> > > pagecahce address space. Such physical i_mapping address space is the
> > 
> > What are you going to use as mapping->host for it?
> 
> the only info we'd need from the host is the host->i_rdev, so why can't
> we get it from the file->f_dentry->d_inode->i_rdev? In general I don't

In ->writepage()?  Good luck.  BTW, at some point use of ->i_rdev will have
to go - we need to be able to use ->i_bdev instead.

> like very much the fake inodes with the zillon of unused fields (they
> just looks confusing due their unused fields), but well if you for
> whatever reason prefer to keep doing page->mapping->host->i_rdev or if
> file->f_dentry->d_inode->i_rdev has a problem I'm pretty much ok with
> the fake_inode there too.

It doesn't have to be fake. See how it's done for sockets or pipes.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 10:28             ` Alexander Viro
@ 2001-09-18 10:35               ` Andrea Arcangeli
  2001-09-18 10:52                 ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18 10:35 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 06:28:11AM -0400, Alexander Viro wrote:
> 
> 
> On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> 
> > > > If we need to avoid the bumping of i_count and to allocate something
> > > > dynamically that will be the bd_mapping address space, we don't need a
> > > > new fake_inode there too, we just need to share the new physical
> > > > pagecahce address space. Such physical i_mapping address space is the
> > > 
> > > What are you going to use as mapping->host for it?
> > 
> > the only info we'd need from the host is the host->i_rdev, so why can't
> > we get it from the file->f_dentry->d_inode->i_rdev? In general I don't
> 
> In ->writepage()?  Good luck.  BTW, at some point use of ->i_rdev will have

I would have noticed if I actually wrote the code ;)

static int blkdev_writepage(struct page * page)
{

no file...

> It doesn't have to be fake. See how it's done for sockets or pipes.

here it's really completly private to the bdev. I mean we could be
tricky and force a cast on mapping->host to point to bdev and we
wouldn't need the fake inode. But casts are probably uglier and more
risky than using the fake_inode (unless we really consdier the host a
cookie rather than an inode pointer). Comments?

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 10:35               ` Andrea Arcangeli
@ 2001-09-18 10:52                 ` Alexander Viro
  0 siblings, 0 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-18 10:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> > It doesn't have to be fake. See how it's done for sockets or pipes.
> 
> here it's really completly private to the bdev. I mean we could be
> tricky and force a cast on mapping->host to point to bdev and we
> wouldn't need the fake inode. But casts are probably uglier and more
> risky than using the fake_inode (unless we really consdier the host a
> cookie rather than an inode pointer). Comments?

Well, seeing that I had added ->host in the first place... yes, they were
intended to be cookies rather than inode pointers.  Precisely because of
block device applications.  Overridden by Linus several months down the road
and yes, I still think that it was bogus.

Anyway, _if_ we have to have an inode somewhere, we'd better make sure
that it's not a reuse of inode passed by open().  Otherwise we are in for
a lot of fun.

I could live with something similar to the net/socket.c and fs/pipe.c schemes
- minimal superblock and inodes allocated on it.  But then we'd better be
consistent and do that for _all_ uses of blkdev_get().


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  5:48 ` Andrew Morton
  2001-09-18  6:11   ` Andrea Arcangeli
@ 2001-09-18 10:58   ` Martin Dalecki
  1 sibling, 0 replies; 110+ messages in thread
From: Martin Dalecki @ 2001-09-18 10:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Kernel Mailing List, Andrea Arcangeli

Andrew Morton wrote:

> OK, so this was a seriously loony workload.  But suddenly, the number of people
> who understand the Linux VM has gone from maybe 10 down to just one-and-a-bit.
> A large number of comments have been removed, and a year's worth of
> discussion has been invalidated.

No I disagree if I look at the patches.
The number of people went down from 10 fumbeling around without
a clue to two knowing what they are doing I think ;-). IMHO this
is increasing the chance that the problems will finally get sorted out.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 10:17           ` Andrea Arcangeli
  2001-09-18 10:28             ` Alexander Viro
@ 2001-09-18 11:05             ` Helge Hafting
  2001-09-18 12:40               ` Andrea Arcangeli
  2001-09-18 17:02             ` Linus Torvalds
  2 siblings, 1 reply; 110+ messages in thread
From: Helge Hafting @ 2001-09-18 11:05 UTC (permalink / raw)
  To: Andrea Arcangeli, linux-kernel

Andrea Arcangeli wrote:

> very clear now, thanks. I though the fs you were talking about was
> mounted on the blkdev, while it is instead the one where the blkdev
> inode cames from. Usually people keeps bldkev in /dev and nobody
> unmounts /dev that's why I didn't noticed and thought about it, sorry.
> 
Don't assume "nobody does..."  There is always someone that do.

This particular one will easily come up for anybody who
develop boot floppies:

1.  Mount the boot floppy at /mnt
2.  Test the device files in /mnt/dev, _perhaps by using them
    in every way imaginable_
3.  Umount the boot floppy. - There goes /mnt/dev and a bunch
    of block devices.  

Umounting some dev/ is common enough.

Helge Hafting

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 11:05             ` Helge Hafting
@ 2001-09-18 12:40               ` Andrea Arcangeli
  0 siblings, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18 12:40 UTC (permalink / raw)
  To: Helge Hafting; +Cc: linux-kernel

On Tue, Sep 18, 2001 at 01:05:51PM +0200, Helge Hafting wrote:
> Andrea Arcangeli wrote:
> 
> > very clear now, thanks. I though the fs you were talking about was
> > mounted on the blkdev, while it is instead the one where the blkdev
> > inode cames from. Usually people keeps bldkev in /dev and nobody
> > unmounts /dev that's why I didn't noticed and thought about it, sorry.
> > 
> Don't assume "nobody does..."  There is always someone that do.
> 
> This particular one will easily come up for anybody who
> develop boot floppies:
> 
> 1.  Mount the boot floppy at /mnt
> 2.  Test the device files in /mnt/dev, _perhaps by using them
>     in every way imaginable_
> 3.  Umount the boot floppy. - There goes /mnt/dev and a bunch
>     of block devices.  
> 
> Umounting some dev/ is common enough.

Of course, that's a bug and it must be fixed.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  6:40       ` Andrea Arcangeli
@ 2001-09-18 16:06         ` Marcelo Tosatti
  2001-09-18 19:18           ` Marcelo Tosatti
  0 siblings, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18 16:06 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Andrea Arcangeli wrote:

> On Tue, Sep 18, 2001 at 02:02:37AM -0300, Marcelo Tosatti wrote:
> > Try to run several memory hungry threads (thus hiding more pages).
> 
> I did that indeed, not sure why I didn't reproduced (I guess the hogs
> were sligthly different). (and of course they were killed but only after
> the box was truly oom)

Easy reproducible way: 

Boot with 64M, start 4 setiathome processes, then start mtest (from
memtest suite) with a lot of threads (I'm using 40 readers and 4 writers
in this case, and a 100MB heap)

[root@matrix memtest-0.0.4]# ./mtest -m 100 -r 40 -w 4
Starting test run with 100 megabyte heap.
Setting up 25600 4096kB pages for test...
 done.
Child 00 started with pid 00935, readonly
Child 01 started with pid 00936, readonly
Child 02 started with pid 00937, readonly
Child 03 started with pid 00938, readonly
Child 04 started with pid 00939, readonly


Sep 18 14:28:31 matrix kernel: __alloc_pages: 0-order allocation failed
(gfp=0x1d2/0) from c012c3be
Sep 18 14:28:31 matrix kernel: VM: killing process setiathome


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  9:31 ` Alexander Viro
  2001-09-18  9:39   ` Andrea Arcangeli
@ 2001-09-18 16:45   ` Linus Torvalds
  2001-09-18 18:19     ` Alexander Viro
  1 sibling, 1 reply; 110+ messages in thread
From: Linus Torvalds @ 2001-09-18 16:45 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Kernel Mailing List, Andrea Arcangeli


On Tue, 18 Sep 2001, Alexander Viro wrote:
>
> Umm...  Linus, had you actually read through the fs/block_device.c part
> of that?  It's not just ugly as hell, it's (AFAICS) not hard to oops
> if you have several inodes sharing major:minor.  ->bd_inode and its
> treatment are bogus.  Please, read it through and consider reverting -
> in its current state code is an ugly mess.

Funny that you mention it, because I actually have a cunning plan, and
you're an unwitting part of it.

Or actually, I hope you're a "witting" part of it, because it's going to
be your code.

Take your "struct block_device" code, add a "struct address_space" to it,
and whenever a block device inode is opened, make the inode->i_mapping
point to &bdev->b_data, and voila..

You already get all the reference counting right, and it's the only
sensible place to do it anyway, wouldn't you agree?

I thought you'd be thrilled. It seems to match your lazy allocation patch
very well..

		Linus


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 10:17           ` Andrea Arcangeli
  2001-09-18 10:28             ` Alexander Viro
  2001-09-18 11:05             ` Helge Hafting
@ 2001-09-18 17:02             ` Linus Torvalds
  2 siblings, 0 replies; 110+ messages in thread
From: Linus Torvalds @ 2001-09-18 17:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Alexander Viro, Kernel Mailing List


On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> >
> > What are you going to use as mapping->host for it?
>
> the only info we'd need from the host is the host->i_rdev, so why can't
> we get it from the file->f_dentry->d_inode->i_rdev?

No, just allocate a real inode as part of "struct block_device"
allocation. This is basically what socket etc stuff does - and it's
actually easiest to do it the other way, exactly like sockets do (ie
_embed_ "struct block_device"  inside a struct inode as if "block_device"
was just another filesystem), and then just allocate one inode and get the
"struct block_device"  allocation for free.

This works really well for sockets, and has worked for a long time. See
how the socket is hidden in "inode->u.socket_i", and how it's trivial to
convert from one to the other.

And the fact is, a block_device _is_ a simple filesystem if you squint
just the right way. Admittedly it's a fairly flat and _boring_ filesystem,
but hey, that's the point of them, after all ;)

			Linus


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 16:45   ` Linus Torvalds
@ 2001-09-18 18:19     ` Alexander Viro
  2001-09-18 18:27       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-18 18:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List



On Tue, 18 Sep 2001, Linus Torvalds wrote:

> 
> On Tue, 18 Sep 2001, Alexander Viro wrote:
> >
> > Umm...  Linus, had you actually read through the fs/block_device.c part
> > of that?  It's not just ugly as hell, it's (AFAICS) not hard to oops
> > if you have several inodes sharing major:minor.  ->bd_inode and its
> > treatment are bogus.  Please, read it through and consider reverting -
> > in its current state code is an ugly mess.
> 
> Funny that you mention it, because I actually have a cunning plan, and
> you're an unwitting part of it.

/me swears

Yes, we can make it work that way (see downthread).  Yes, combined with
lazy allocation (and pipefs-like scheme) it can turn into nice, _working_
code.

But right now we have
	a) broken patch applied to the tree
	b) somewhat tested patch that may (after being modified so that it
would _apply_ to -pre11) fix the breakage.  Once it's tested enough to
be consider as a candidate for inclusion, that is.

IMNSHO it's somewhat less than ideal situation.  I've already talked to
with Andrea and once he gets back to life (no, 'e's just sleepin') we'll
try to do something usable.  Life would be much simpler if aforementioned
cunning plan included sending mail to participants (me and Andrea) before
putting the patch into the tree, though.  Oh, well...

> Or actually, I hope you're a "witting" part of it, because it's going to
> be your code.
> 
> Take your "struct block_device" code, add a "struct address_space" to it,
> and whenever a block device inode is opened, make the inode->i_mapping
> point to &bdev->b_data, and voila..
> 
> You already get all the reference counting right, and it's the only
> sensible place to do it anyway, wouldn't you agree?
> 
> I thought you'd be thrilled. It seems to match your lazy allocation patch
> very well..

It can be modified so that combination with lazy-bdev and pipefs-like tree
would work.  And yes, most of the ugliness would just go away.

The problem being, I'd rather do it in a different order -
	1) finish testing of lazy-bdev
	2) apply well-tested patch
	3) test modified bedv-in-pagecache patch
	4) apply it, once it got public testing.

As it is, we'll do combined patch, diff it against -pre11 and submit
for inclusion - new hole in the main tree needs fixing...


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 18:19     ` Alexander Viro
@ 2001-09-18 18:27       ` Linus Torvalds
  2001-09-18 19:14         ` Andreas Dilger
                           ` (4 more replies)
  2001-09-19 13:38       ` Rik van Riel
  2001-09-19 16:35       ` Andrea Arcangeli
  2 siblings, 5 replies; 110+ messages in thread
From: Linus Torvalds @ 2001-09-18 18:27 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Kernel Mailing List


On Tue, 18 Sep 2001, Alexander Viro wrote:
>
> It can be modified so that combination with lazy-bdev and pipefs-like tree
> would work.  And yes, most of the ugliness would just go away.

That's the part I like about the page-cache bdev patch. It has a lot of
fairly ugly warts, but all of them seem to be really fixable with _other_
cleanups, at which point only the good parts remain.

I agree that the timing may leave something to be desired. But we had the
discussion about fixing pagecache-bdev consistency wrt the regular buffer
cache filesystem accesses a week or so ago, and the fact is that nobody
really seems to have started working on it - because everybody felt that
you have to get everything done at once.

I don't have that feeling. I'm happy with having partial merge with ugly
warts, if it means that you can get to the final stage _without_ having to
have all the problems fixed at one time.

So now we have two _smaller_ merges that will fix two other issues, and
remove all the horridness from the original merge.

		Linus


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 18:27       ` Linus Torvalds
@ 2001-09-18 19:14         ` Andreas Dilger
  2001-09-18 19:41           ` Alexander Viro
  2001-09-18 20:33           ` Richard Gooch
  2001-09-18 19:29         ` Benjamin LaHaise
                           ` (3 subsequent siblings)
  4 siblings, 2 replies; 110+ messages in thread
From: Andreas Dilger @ 2001-09-18 19:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexander Viro, Kernel Mailing List

On Sep 18, 2001  11:27 -0700, Linus Torvalds wrote:
> I agree that the timing may leave something to be desired. But we had the
> discussion about fixing pagecache-bdev consistency wrt the regular buffer
> cache filesystem accesses a week or so ago, and the fact is that nobody
> really seems to have started working on it - because everybody felt that
> you have to get everything done at once.

The real question is why can't we just open 2.5 and only fix the VM to
start with?  Leave the kernel at 2.4.10-pre10, and possibly use the -ac
VM code (which has diverged from mainline), and leave people (Alan, Ben,
Marcello, et. al.) who want to tinker with it in small increments and
do the drastic stuff in 2.5.

Make it clear that 2.5 is still ONLY for VM and other bug fixes at this
point, and not all of the long-awaited 2.5 rework YET.  If it turns
out that the VM fixes are done quickly, then they can be back-ported
to 2.4.  If it takes longer than expected you can open 2.5 up to general
development.

With the right "management" it will be not much different than the current
situation, except that people won't have fits about such huge changes
going into 2.4.  I think this is your subconcious telling you that you
really wanted to start 2.5 a month ago.

Cheers, Andreas
--
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 16:06         ` Marcelo Tosatti
@ 2001-09-18 19:18           ` Marcelo Tosatti
  2001-09-18 21:05             ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Marcelo Tosatti @ 2001-09-18 19:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List



On Tue, 18 Sep 2001, Marcelo Tosatti wrote:

> 
> 
> On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> 
> > On Tue, Sep 18, 2001 at 02:02:37AM -0300, Marcelo Tosatti wrote:
> > > Try to run several memory hungry threads (thus hiding more pages).
> > 
> > I did that indeed, not sure why I didn't reproduced (I guess the hogs
> > were sligthly different). (and of course they were killed but only after
> > the box was truly oom)
> 
> Easy reproducible way: 
> 
> Boot with 64M, start 4 setiathome processes, then start mtest (from
> memtest suite) with a lot of threads (I'm using 40 readers and 4 writers
> in this case, and a 100MB heap)
> 
> [root@matrix memtest-0.0.4]# ./mtest -m 100 -r 40 -w 4
> Starting test run with 100 megabyte heap.
> Setting up 25600 4096kB pages for test...
>  done.
> Child 00 started with pid 00935, readonly
> Child 01 started with pid 00936, readonly
> Child 02 started with pid 00937, readonly
> Child 03 started with pid 00938, readonly
> Child 04 started with pid 00939, readonly
> 
> 
> Sep 18 14:28:31 matrix kernel: __alloc_pages: 0-order allocation failed
> (gfp=0x1d2/0) from c012c3be
> Sep 18 14:28:31 matrix kernel: VM: killing process setiathome

Andrea, 

I still can reproduce the alloc pages failures with the following patch.

It should remove the "hidden pages" problem WRT ZONES. That is, a lot of
threads can still hide pages from each other independantly of the zone
hiding problem. 

--- linux.orig/mm/vmscan.c      Tue Sep 18 15:43:14 2001
+++ linux/mm/vmscan.c   Tue Sep 18 16:37:52 2001
@@ -361,13 +361,19 @@
                }
 
                deactivate_page_nolock(page);
+
                list_del(entry);
-               list_add_tail(entry, &inactive_local_lru);
 
-               if (__builtin_expect(!memclass(page->zone, classzone), 0))
+               if (__builtin_expect(!memclass(page->zone, classzone),
0)) {
+                       list_add_tail(entry, &inactive_list);
+                       __max_scan--;
                        continue;
+               }
 
                __max_scan--;
+
+               list_add_tail(entry, &inactive_local_lru);
+
 
                /* Racy check to avoid trylocking when not worthwhile */
                if (!page->buffers && page_count(page) != 1) {



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 18:27       ` Linus Torvalds
  2001-09-18 19:14         ` Andreas Dilger
@ 2001-09-18 19:29         ` Benjamin LaHaise
  2001-09-18 20:17         ` Stephan von Krawczynski
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Benjamin LaHaise @ 2001-09-18 19:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexander Viro, Kernel Mailing List

On Tue, Sep 18, 2001 at 11:27:27AM -0700, Linus Torvalds wrote:
> I don't have that feeling. I'm happy with having partial merge with ugly
> warts, if it means that you can get to the final stage _without_ having to
> have all the problems fixed at one time.
> 
> So now we have two _smaller_ merges that will fix two other issues, and
> remove all the horridness from the original merge.

A lot of us would be much happier if you just renamed 2.4.10pre11 to 2.5.1. =)
Then we can backport things as they become stable.

		-ben

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 19:14         ` Andreas Dilger
@ 2001-09-18 19:41           ` Alexander Viro
  2001-09-18 20:33           ` Richard Gooch
  1 sibling, 0 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-18 19:41 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Kernel Mailing List



On Tue, 18 Sep 2001, Andreas Dilger wrote:

> The real question is why can't we just open 2.5 and only fix the VM to
> start with?  Leave the kernel at 2.4.10-pre10, and possibly use the -ac
> VM code (which has diverged from mainline), and leave people (Alan, Ben,
> Marcello, et. al.) who want to tinker with it in small increments and
> do the drastic stuff in 2.5.

I'd rather get a list of API changes planned for 2.5 and DO ONLY THEM.
IOW, start 2.5 with a sequence of patches that do nothing but a global
search-and-replace.  Then treat it as -STABLE.  It _is_ doable - check
what had happened to superblock handling in 2.4.  Yes, it takes extra
work on splitup and doing things in compatible way, but it's not a
rocket science - BTDT.

"I can't be arsed to split my K'R4D 3133t 200Kb p47cH" had lost its charm
years ago - just look at the devfs mess...


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 18:27       ` Linus Torvalds
  2001-09-18 19:14         ` Andreas Dilger
  2001-09-18 19:29         ` Benjamin LaHaise
@ 2001-09-18 20:17         ` Stephan von Krawczynski
  2001-09-18 20:33           ` Alan Cox
  2001-09-19 13:42           ` Rik van Riel
  2001-09-19  2:59         ` Michael Peddemors
  2001-09-19 16:11         ` Alexander Viro
  4 siblings, 2 replies; 110+ messages in thread
From: Stephan von Krawczynski @ 2001-09-18 20:17 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: torvalds, viro, linux-kernel

On Tue, 18 Sep 2001 13:14:19 -0600 Andreas Dilger <adilger@turbolabs.com>
wrote:

> On Sep 18, 2001  11:27 -0700, Linus Torvalds wrote:
> [...]
> The real question is why can't we just open 2.5 and only fix the VM to
> start with?

Hm, I guess if anybody would be capable of _really_ fixing vm in upto-pre10
state, he would have done it already. It's not that people would not have
tried, but it looks like nobody is able to get the _whole_ picture of this.
Surely there are people that know a lot about certain parts of the (old) vm
code, and thats exactly what leads to this Linus' statements here sounding like
"I had a look at part xyz of it and it's a mess. Patch appended." (see pre10 vm
patch).
You have to keep in mind that a lot of people on the planet try to use 2.4 in a
unbelievably broad variety of setups - and quite a number of them relies on
released kernels. If you take a close look at 2.4.7, 2.4.8, 2.4.9 you may well
find out, that they're unusable compared to 2.2.19. You cannot go on like this.
I really do back Andrea and Linus for this step, because I can _see_ a great
win in performance _and_ things got less complex in this code. So there is a
much bigger chance to tilt the last remaining problems in short time (hopefully
before 2.4.10 or .11). 
So I guess the right thing to do now is give Andrea good input of leftover,
reproducible problems to give him a chance to fix them. A major discussion
about "doing it all the way round" makes only sense, if someone comes up with
something _at least_ as good as Andrea's code. 
Then its "Lonely Linus"' decision to choose. Whatever he will choose, a good
percentage LKML will be against it. This is a normal thing, the guy that has to
make the decision is alone most of the time. The only way for you to find out
if he is right is to _try_ it. If he is, tell him. If he isn't, send a patch.
He couldn't have been that wrong the last years, though.

</end sermon>

Just _one_ opinion from an unimportant guy,

Stephan


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 19:14         ` Andreas Dilger
  2001-09-18 19:41           ` Alexander Viro
@ 2001-09-18 20:33           ` Richard Gooch
  2001-09-18 20:53             ` Alexander Viro
  2001-09-18 21:06             ` Richard Gooch
  1 sibling, 2 replies; 110+ messages in thread
From: Richard Gooch @ 2001-09-18 20:33 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Andreas Dilger, Kernel Mailing List

Alexander Viro writes:
> "I can't be arsed to split my K'R4D 3133t 200Kb p47cH" had lost its
> charm years ago - just look at the devfs mess...

Al, are you ever going to stop bitching and moaning about devfs?
If you have a specific, technical, reasoned complaint to make, do it
coherently. Otherwise, stop sniping.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 20:17         ` Stephan von Krawczynski
@ 2001-09-18 20:33           ` Alan Cox
  2001-09-19 13:42           ` Rik van Riel
  1 sibling, 0 replies; 110+ messages in thread
From: Alan Cox @ 2001-09-18 20:33 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Andreas Dilger, torvalds, viro, linux-kernel

> Hm, I guess if anybody would be capable of _really_ fixing vm in upto-pre10
> state, he would have done it already. It's not that people would not have
> tried, but it looks like nobody is able to get the _whole_ picture of this.

They have been, but slowly and steadily - thats why the -ac mm worked fairly
well. Its also why merging it with Linus would have been pointless at this
time since its just backing out some of the less apparently useful
experiments

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 20:33           ` Richard Gooch
@ 2001-09-18 20:53             ` Alexander Viro
  2001-09-18 21:06             ` Richard Gooch
  1 sibling, 0 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-18 20:53 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Andreas Dilger, Kernel Mailing List



On Tue, 18 Sep 2001, Richard Gooch wrote:

> Alexander Viro writes:
> > "I can't be arsed to split my K'R4D 3133t 200Kb p47cH" had lost its
> > charm years ago - just look at the devfs mess...
> 
> Al, are you ever going to stop bitching and moaning about devfs?
> If you have a specific, technical, reasoned complaint to make, do it
> coherently. Otherwise, stop sniping.

What?  I'm saying that devfs was a huge patch which was not understood
(let alone audited) by anyone except you when it was included and as
the direct result we have a long history of problems in that area.
You want to argue against that?

If you still feel that feeding devfs into the tree in one huge chunk was
not a mistake - you are much dumber than I thought.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 19:18           ` Marcelo Tosatti
@ 2001-09-18 21:05             ` Andrea Arcangeli
  2001-09-19 13:57               ` Rik van Riel
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-18 21:05 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 04:18:34PM -0300, Marcelo Tosatti wrote:
> I still can reproduce the alloc pages failures with the following patch.

Could you try the very last patch I posted?

> --- linux.orig/mm/vmscan.c      Tue Sep 18 15:43:14 2001
> +++ linux/mm/vmscan.c   Tue Sep 18 16:37:52 2001
> @@ -361,13 +361,19 @@
>                 }
>  
>                 deactivate_page_nolock(page);
> +
>                 list_del(entry);
> -               list_add_tail(entry, &inactive_local_lru);
>  
> -               if (__builtin_expect(!memclass(page->zone, classzone), 0))
> +               if (__builtin_expect(!memclass(page->zone, classzone),
> 0)) {
> +                       list_add_tail(entry, &inactive_list);
> +                       __max_scan--;
>                         continue;
> +               }
>  
>                 __max_scan--;
> +
> +               list_add_tail(entry, &inactive_local_lru);
> +

I actually used a much more aggressive approch, I always left all the
page visibles now, not just for the memclass check. After all the same
issue that can arise with the memclass check can also arise in a smaller
scale with the other failed checks. But the oom weren't just because of
the "hiding", I suspect that infact.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 20:33           ` Richard Gooch
  2001-09-18 20:53             ` Alexander Viro
@ 2001-09-18 21:06             ` Richard Gooch
  2001-09-18 21:27               ` Alexander Viro
  1 sibling, 1 reply; 110+ messages in thread
From: Richard Gooch @ 2001-09-18 21:06 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Andreas Dilger, Kernel Mailing List

Alexander Viro writes:
> 
> 
> On Tue, 18 Sep 2001, Richard Gooch wrote:
> 
> > Alexander Viro writes:
> > > "I can't be arsed to split my K'R4D 3133t 200Kb p47cH" had lost its
> > > charm years ago - just look at the devfs mess...
> > 
> > Al, are you ever going to stop bitching and moaning about devfs?
> > If you have a specific, technical, reasoned complaint to make, do it
> > coherently. Otherwise, stop sniping.
> 
> What?  I'm saying that devfs was a huge patch which was not understood
> (let alone audited) by anyone except you when it was included and as
> the direct result we have a long history of problems in that area.
> You want to argue against that?
> 
> If you still feel that feeding devfs into the tree in one huge chunk
> was not a mistake - you are much dumber than I thought.

Actually, many times I fed Linus smaller patches. I tried to get the
FS core itself in separately from the drivers, which was 100% safe
(since no existing code was touched), but he wasn't interested. If the
devfs core *had* been accepted on it's own, I could then have
reasonably split up the driver patches.

However, I disagree with your "long history of problems" comment. You
make it sound like devfs hasn't performed reliably, whereas in real
life (i.e. normal use, not constructed exploits) it's done quite well.

In any case, if your "I can't be arsed to split my patch" comment is
directed at me, I take offense. I did actually take the trouble to
split up my patch.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 21:06             ` Richard Gooch
@ 2001-09-18 21:27               ` Alexander Viro
  0 siblings, 0 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-18 21:27 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Andreas Dilger, Kernel Mailing List



On Tue, 18 Sep 2001, Richard Gooch wrote:

> Actually, many times I fed Linus smaller patches. I tried to get the
> FS core itself in separately from the drivers, which was 100% safe
> (since no existing code was touched), but he wasn't interested. If the
> devfs core *had* been accepted on it's own, I could then have
> reasonably split up the driver patches.
> 
> However, I disagree with your "long history of problems" comment. You
> make it sound like devfs hasn't performed reliably, whereas in real
> life (i.e. normal use, not constructed exploits) it's done quite well.
> 
> In any case, if your "I can't be arsed to split my patch" comment is
> directed at me, I take offense. I did actually take the trouble to
> split up my patch.

Sheesh... Statement: "non-priveleged user can crash any released version
of kernel if devfs is compiled in and mounted".  Provably true.  It _does_
qualify as a problem and yes, I would say that 20 months _is_ long.

Regardless of the reasons why the thing went in once chunk, presence of
these problems is a direct result.  Anyone who wants to push a large
patch into the tree is inviting the same result.

BTW, potentially useful observation: if a preliminary chunk of patch
makes sense on its own, it gets much better chance to get applied.
Accepting devfs core in one step would be completely useless - it's
still too large (~100Kb) and if it would be applied first, it would
get zero testing until the rest went in.

Anyway, Zen And Art Of Feeding Patches Into Tree is a topic for a different
thread...  AFAICS devfs could be split in meaningful steps (ones that
would get testing immediately and would add functionality by pieces),
but by now it's hardly interesting.  If you want details - let's take
it to private mail, preferably when I'll have some spare time.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 18:27       ` Linus Torvalds
                           ` (2 preceding siblings ...)
  2001-09-18 20:17         ` Stephan von Krawczynski
@ 2001-09-19  2:59         ` Michael Peddemors
  2001-09-19 16:11         ` Alexander Viro
  4 siblings, 0 replies; 110+ messages in thread
From: Michael Peddemors @ 2001-09-19  2:59 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Kernel Mailing List

Ummmm.. because the 2.4 series is supposed to be the stable series.. and
a lot of people's comments on virtually every 2.4 kernel have expressed
dissatisfaction with it's stability..

So a lot of people haven't left the 2.2 series yet, and are waiting for
a 2.4 series that everyone is happy with..  Even in your note, you
haven't indicated a release 2.4 version that makes you happy.

On Tue, 2001-09-18 at 12:14, Andreas Dilger wrote:
> On Sep 18, 2001  11:27 -0700, Linus Torvalds wrote:
> > I agree that the timing may leave something to be desired. But we had the
> > discussion about fixing pagecache-bdev consistency wrt the regular buffer
> > cache filesystem accesses a week or so ago, and the fact is that nobody
> > really seems to have started working on it - because everybody felt that
> > you have to get everything done at once.
> 
> The real question is why can't we just open 2.5 and only fix the VM to
> start with?  Leave the kernel at 2.4.10-pre10, and possibly use the -ac
> VM code (which has diverged from mainline), and leave people (Alan, Ben,
> Marcello, et. al.) who want to tinker with it in small increments and
> do the drastic stuff in 2.5.

> 
-- 
"Catch the Magic of Linux..."
--------------------------------------------------------
Michael Peddemors - Senior Consultant
LinuxAdministration - Internet Services
NetworkServices - Programming - Security
Wizard IT Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com
--------------------------------------------------------
(604)589-0037 Beautiful British Columbia, Canada


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 18:19     ` Alexander Viro
  2001-09-18 18:27       ` Linus Torvalds
@ 2001-09-19 13:38       ` Rik van Riel
  2001-09-19 16:35       ` Andrea Arcangeli
  2 siblings, 0 replies; 110+ messages in thread
From: Rik van Riel @ 2001-09-19 13:38 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, 18 Sep 2001, Alexander Viro wrote:
> On Tue, 18 Sep 2001, Linus Torvalds wrote:
>
> > Funny that you mention it, because I actually have a cunning plan, and
> > you're an unwitting part of it.
>
> /me swears

No need to get stressed, if you don't like it you don't
do it and Linus will have to do things himself ;)

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 20:17         ` Stephan von Krawczynski
  2001-09-18 20:33           ` Alan Cox
@ 2001-09-19 13:42           ` Rik van Riel
  2001-09-19 14:27             ` Alexander Viro
  1 sibling, 1 reply; 110+ messages in thread
From: Rik van Riel @ 2001-09-19 13:42 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Andreas Dilger, torvalds, viro, linux-kernel

On Tue, 18 Sep 2001, Stephan von Krawczynski wrote:

> Hm, I guess if anybody would be capable of _really_ fixing vm in
> upto-pre10 state, he would have done it already. It's not that people
> would not have tried, but it looks like nobody is able to get the
> _whole_ picture of this.

Look, the problem is that Linus is being an asshole and
integrating conflicting ideas into both the VM and the
VFS, without giving anybody prior notice and later blame
others.

Just look at how he's now trying to force Al Viro into
implementing his ideas yesterday because he broke stuff
again...

If you want a stable kernel, use Alan's kernel.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 21:05             ` Andrea Arcangeli
@ 2001-09-19 13:57               ` Rik van Riel
  0 siblings, 0 replies; 110+ messages in thread
From: Rik van Riel @ 2001-09-19 13:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Andrew Morton, Linus Torvalds, Kernel Mailing List

On Tue, 18 Sep 2001, Andrea Arcangeli wrote:
> On Tue, Sep 18, 2001 at 04:18:34PM -0300, Marcelo Tosatti wrote:
> > I still can reproduce the alloc pages failures with the following patch.
>
> Could you try the very last patch I posted?

Marcelo told you how to reproduce the bug, is there any
particular reason you're not testing yourself? ;)

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 13:42           ` Rik van Riel
@ 2001-09-19 14:27             ` Alexander Viro
  0 siblings, 0 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-19 14:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephan von Krawczynski, Andreas Dilger, torvalds, linux-kernel



On Wed, 19 Sep 2001, Rik van Riel wrote:

> Just look at how he's now trying to force Al Viro into
> implementing his ideas yesterday because he broke stuff
> again...

Rik, in case you've missed that, I can and do speak for myself.  I had
spent ten years dodging the draft; when I decide to get enlisted into
something it will happen on _my_ decision and _my_ conditions.  When I
decide that I'm being forced into something I do not accept - you'll know
it from posting with URL of forked tree.

FWIW, I'm less than thrilled by the Andrea's patch, but it is salvagable.
I'm also less than thrilled by the whole situation with VM - all sides of
it.  I seriously suspect that we need a limited multi-way fork in that
area, so that you guys would stop stepping on each others' toes.  I'm taking
no part in your merry 5-way clusterfuck - sort that mess out between
yourselves.

Again, when I decide that situation is unacceptable for me - I'll simply
fork the tree.  I do _not_ appreciate being enlisted into anyone's holy
wars, so unless you really want to go _way_ up in my personal shitlist
(several positions below .ru DoD) - don't play politics in my vicinity.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 18:27       ` Linus Torvalds
                           ` (3 preceding siblings ...)
  2001-09-19  2:59         ` Michael Peddemors
@ 2001-09-19 16:11         ` Alexander Viro
  2001-09-19 18:25           ` Andrea Arcangeli
  4 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-19 16:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrea Arcangeli, Kernel Mailing List



On Tue, 18 Sep 2001, Linus Torvalds wrote:

> 
> On Tue, 18 Sep 2001, Alexander Viro wrote:
> >
> > It can be modified so that combination with lazy-bdev and pipefs-like tree
> > would work.  And yes, most of the ugliness would just go away.
> 
> That's the part I like about the page-cache bdev patch. It has a lot of
> fairly ugly warts, but all of them seem to be really fixable with _other_
> cleanups, at which point only the good parts remain.

It's actually quite broken in several areas (== bunch of panicable rmmod
races, broken wrt umount(), trivially oopsable in ioctl() on ramdisk,
very suspicious in swapoff(2), etc.).  It _might_ be fixable, but I
would really like to see the patch that went into -pre11 separately from
the rest.  Andrea, could you send it to me?  In particular, I'm deeply
suspicious about changes in blkdev_put() in case of BDEV_FILE.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18 18:19     ` Alexander Viro
  2001-09-18 18:27       ` Linus Torvalds
  2001-09-19 13:38       ` Rik van Riel
@ 2001-09-19 16:35       ` Andrea Arcangeli
  2 siblings, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 16:35 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Tue, Sep 18, 2001 at 02:19:42PM -0400, Alexander Viro wrote:
> with Andrea and once he gets back to life (no, 'e's just sleepin') we'll

yes, I'm back to life but I had a surprise: but my ISP is down so I'm
now masqueraded my dekstop over a wirless GPRS link via irda on the
laptop [GRPS is flat rate in Italy :) ] so it's quite dogslow at the
moment, I cannot read emails and RTT over GSM is of the order of 1
second (but I can send emails and slowly browse the web). I still hope
the ISP returns alive before tomorrow... (I'm looking into Andrew's
report at the moment and I don't need good connectivity for debugging it
luckily :)

I think the bug we have in the inode pinning isn't a showstopper (I
probably could live with it forever without ever have chance to notice it),
so no user should be hurted by it [but I understand it can hurt you very
much], and I am _very_ concerned in addressing it ASAP and I'm very
happy you spoteed it immediatly. I'll also check your worries with the
ramdisk BLKFLSBUF (the logic should be just like before the changes, the
reason I backed out the rd_bdev changes is because I was always working
with inodes for everything related to the physical address space, not
with bdevs so I didn't see a value in the change, but after the bugfix
[bdev->bd_mapping] probably this will change), for the other issue with the
->writepage as said it is fine and intentional, except we can do
something more efficient for the ramdisk adding some special case to the
memory balancing code (for istance we could always put the ramdisk stuff
over the active list rather than rolling it over in the inactive list
again, but quite frankly that's low prio unless somebody has a real
patological case and of course that applies and it always applied to
ramfs too [I did it with the argument: it works for ramfs so it will
obviously work work for ramdisk too]). In the next days I'll also be
very busy with some non linux-kernel developement issue, but I'll try to
be still as much responsive as possible for anything urgent (I'll
certainly be 100% responsive again next week starting from Wed 26).

At the moment my very first prio is to address the BUG() reported by
Andrew (if it is just been fixed please somebody at SuSE send me an SMS
so I don't waste time since as said I cannot read emails at the moment).

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 16:11         ` Alexander Viro
@ 2001-09-19 18:25           ` Andrea Arcangeli
  2001-09-19 19:21             ` Alexander Viro
  2001-09-19 20:41             ` Richard Gooch
  0 siblings, 2 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 18:25 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Wed, Sep 19, 2001 at 12:11:56PM -0400, Alexander Viro wrote:
> the rest.  Andrea, could you send it to me?  In particular, I'm deeply
> suspicious about changes in blkdev_put() in case of BDEV_FILE.

of course, for the record you can also find it in the ftp area all
splitted out, but I've no problem to send it via email too.

Quite frankly the BDEV_* handling was and is a total mess IMHO, even if
it was written by you ;), there was no difference at all from many of
them, I didn't fixed that but I had to check all them on the differences
until I realized there was none. I also think the other things you
mentioned (besides the inode pinning bug, non critical) are not buggy
(infact I never had a single report), but well we'll verify that in
detail ASAP.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 18:25           ` Andrea Arcangeli
@ 2001-09-19 19:21             ` Alexander Viro
  2001-09-19 20:55               ` Andrea Arcangeli
  2001-09-20  2:34               ` Andrea Arcangeli
  2001-09-19 20:41             ` Richard Gooch
  1 sibling, 2 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-19 19:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Wed, 19 Sep 2001, Andrea Arcangeli wrote:

> Quite frankly the BDEV_* handling was and is a total mess IMHO, even if
> it was written by you ;), there was no difference at all from many of
> them, I didn't fixed that but I had to check all them on the differences
> until I realized there was none. I also think the other things you

There certainly _are_ differences  (e.g. in handling the moment
when you close them).

> mentioned (besides the inode pinning bug, non critical) are not buggy

_What_?

int fd = open("/dev/ram0", O_RDWR);
ioctl(fd, BLKFLSBUF);
ioctl(fd, BLKFLSBUF);

and you claim that resulting oops is not a bug?

> (infact I never had a single report), but well we'll verify that in

Richard, is that you?  What had you done with real Andrea?


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 18:25           ` Andrea Arcangeli
  2001-09-19 19:21             ` Alexander Viro
@ 2001-09-19 20:41             ` Richard Gooch
  1 sibling, 0 replies; 110+ messages in thread
From: Richard Gooch @ 2001-09-19 20:41 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List

Alexander Viro writes:
> On Wed, 19 Sep 2001, Andrea Arcangeli wrote:
> > (infact I never had a single report), but well we'll verify that in
> 
> Richard, is that you?  What had you done with real Andrea?

To throw your own crap back at you: "don't drag me into your political
crap".

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 19:21             ` Alexander Viro
@ 2001-09-19 20:55               ` Andrea Arcangeli
  2001-09-19 21:17                 ` Alexander Viro
  2001-09-19 22:15                 ` Linux 2.4.10-pre11 Richard Gooch
  2001-09-20  2:34               ` Andrea Arcangeli
  1 sibling, 2 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 20:55 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Wed, Sep 19, 2001 at 03:21:09PM -0400, Alexander Viro wrote:
> 
> 
> On Wed, 19 Sep 2001, Andrea Arcangeli wrote:
> 
> > Quite frankly the BDEV_* handling was and is a total mess IMHO, even if
> > it was written by you ;), there was no difference at all from many of
> > them, I didn't fixed that but I had to check all them on the differences
> > until I realized there was none. I also think the other things you
> 
> There certainly _are_ differences  (e.g. in handling the moment
> when you close them).

there aren't difference, only thing that matters is: "is that an fs
or a blkdev". SWAP/RAW/FILE is useless.

> > mentioned (besides the inode pinning bug, non critical) are not buggy
> 
> _What_?
> 
> int fd = open("/dev/ram0", O_RDWR);
> ioctl(fd, BLKFLSBUF);
> ioctl(fd, BLKFLSBUF);
> 
> and you claim that resulting oops is not a bug?

that is a bug.

> > (infact I never had a single report), but well we'll verify that in
> 
> Richard, is that you?  What had you done with real Andrea?

You also screwup things sometime (think the few liner you posts to l-k
after your cleanups).  Those are minor bugs, so I'm not going to panic
on them (ramdisk works not by luck), this is what I meant, and they will
be fixed shortly somehow, and many thanks for the further auditing.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 20:55               ` Andrea Arcangeli
@ 2001-09-19 21:17                 ` Alexander Viro
  2001-09-19 23:01                   ` Andrea Arcangeli
  2001-09-19 23:03                   ` Andrea Arcangeli
  2001-09-19 22:15                 ` Linux 2.4.10-pre11 Richard Gooch
  1 sibling, 2 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-19 21:17 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Wed, 19 Sep 2001, Andrea Arcangeli wrote:
> > There certainly _are_ differences  (e.g. in handling the moment
> > when you close them).
> 
> there aren't difference, only thing that matters is: "is that an fs
> or a blkdev". SWAP/RAW/FILE is useless.

fsync_dev() is not needed for raw devices or swap.  It _is_ needed for
file access.

> > > (infact I never had a single report), but well we'll verify that in
> > 
> > Richard, is that you?  What had you done with real Andrea?
> 
> You also screwup things sometime (think the few liner you posts to l-k
> after your cleanups).  Those are minor bugs, so I'm not going to panic

Certainly.

> on them (ramdisk works not by luck), this is what I meant, and they will

Sorry, it just sounded so..., well, familiar... Couldn't resist ;-) (BTW,
Richard, _what_ political whatever could be found in that?)

> be fixed shortly somehow, and many thanks for the further auditing.

Andrea, had you seen the off-list mail (cc: to you and Linus)?  The main
problem I have right now is that I don't see how you manage to guarantee
that during the last ->release() no requests are going in.  Old code
did unconditional invalidate_buffers() to wipe out all buffer_heads when
device is finally closed.  Absense of pagecache sources was guaranteed
by umount() - by the time when we release ->s_bdev all pages are gone.
How do you deal with that in the current code?


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 20:55               ` Andrea Arcangeli
  2001-09-19 21:17                 ` Alexander Viro
@ 2001-09-19 22:15                 ` Richard Gooch
  1 sibling, 0 replies; 110+ messages in thread
From: Richard Gooch @ 2001-09-19 22:15 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List

Alexander Viro writes:
> On Wed, 19 Sep 2001, Andrea Arcangeli wrote:
> > > > (infact I never had a single report), but well we'll verify that in
> > > 
> > > Richard, is that you?  What had you done with real Andrea?
> > 
> > You also screwup things sometime (think the few liner you posts to l-k
> > after your cleanups).  Those are minor bugs, so I'm not going to panic
> 
> Certainly.
> 
> > on them (ramdisk works not by luck), this is what I meant, and they will
> 
> Sorry, it just sounded so..., well, familiar... Couldn't resist ;-)

Oh, yeah. So it seems. "Resistance is futile", eh?

> (BTW, Richard, _what_ political whatever could be found in that?)

Gee, do I have to spell it out? Just today, I saw you complain to Rik
that he'd dragged you into some issue. And now you take my name in
vain regarding another issue which I'd not previously been involved
in.

No, not exactly the same. But I couldn't help but notice a certain
parallel...

Anyway, this is going nowhere.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 21:17                 ` Alexander Viro
@ 2001-09-19 23:01                   ` Andrea Arcangeli
  2001-09-19 23:03                   ` Andrea Arcangeli
  1 sibling, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 23:01 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Wed, Sep 19, 2001 at 05:17:23PM -0400, Alexander Viro wrote:
> Andrea, had you seen the off-list mail (cc: to you and Linus)?  The main
> problem I have right now is that I don't see how you manage to guarantee
> that during the last ->release() no requests are going in.  Old code

bdev->bd_sem + bdev->bd_cache_openers.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 21:17                 ` Alexander Viro
  2001-09-19 23:01                   ` Andrea Arcangeli
@ 2001-09-19 23:03                   ` Andrea Arcangeli
  2001-09-19 23:30                     ` Alexander Viro
  1 sibling, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 23:03 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Wed, Sep 19, 2001 at 05:17:23PM -0400, Alexander Viro wrote:
> fsync_dev() is not needed for raw devices or swap.  It _is_ needed for
> file access.

then what's the difference between raw devices and swap.

And there's reason we should we avoid the fsync_dev with the raw devices
and swap.

I just found it an useless complication.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 23:03                   ` Andrea Arcangeli
@ 2001-09-19 23:30                     ` Alexander Viro
  2001-09-19 23:40                       ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-19 23:30 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Thu, 20 Sep 2001, Andrea Arcangeli wrote:

> On Wed, Sep 19, 2001 at 05:17:23PM -0400, Alexander Viro wrote:
> > fsync_dev() is not needed for raw devices or swap.  It _is_ needed for
> > file access.
> 
> then what's the difference between raw devices and swap.

Keep in mind that at some point we may want to add exclusions - e.g.
"swapon() messing with block_size when accidentially called for mounted
fs" problem can be easily solved that way - that's probably the
simplest way to deal with it.  Ditto for RAID vs. filesystem - current
code for that is ugly and not too reliable.
 
> And there's reason we should we avoid the fsync_dev with the raw devices
> and swap.

Umm... Not doing unnecessary work?  Semantics of releasing a block device
depends on the kind of use.  BTW, I'm less than sure that fsync_dev() is
the right thing for file access now that you've got that in pagecache -
__block_fsync() seems to be more correct thing to do.

> I just found it an useless complication.

<shrug> in the eye of beholder...

/me goes to get some sleep.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 23:30                     ` Alexander Viro
@ 2001-09-19 23:40                       ` Andrea Arcangeli
  2001-09-20 13:56                         ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 23:40 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Wed, Sep 19, 2001 at 07:30:55PM -0400, Alexander Viro wrote:
> "swapon() messing with block_size when accidentially called for mounted

swap should never change softblocksize etc.. the concept of
softblocksize will die as soon as we make the buffercache - physically
address space backed.

> Umm... Not doing unnecessary work?  Semantics of releasing a block device
> depends on the kind of use.  BTW, I'm less than sure that fsync_dev() is
> the right thing for file access now that you've got that in pagecache -
> __block_fsync() seems to be more correct thing to do.

Not really, blkdev isn't a filesystem. It will never have a superblock
and its own inodes and we also need to filemap_fdatasync/wait the
physical address space.

> /me goes to get some sleep.

night.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 19:21             ` Alexander Viro
  2001-09-19 20:55               ` Andrea Arcangeli
@ 2001-09-20  2:34               ` Andrea Arcangeli
  2001-09-20 10:52                 ` Alexander Viro
  1 sibling, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-20  2:34 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Wed, Sep 19, 2001 at 03:21:09PM -0400, Alexander Viro wrote:
> int fd = open("/dev/ram0", O_RDWR);
> ioctl(fd, BLKFLSBUF);
> ioctl(fd, BLKFLSBUF);

here it is the fix below.

actually it also notably fixes a subtle security problem with the
ramdisk. The fact is that we are asked to read/write from the same
pagecache page that also was the destination of the I/O (ramdisk is
zerocopy), so we actually need a new bitflag to know if this a page
secure, this way we know if we need to bother hiding its contents or
not, if it's a secure page we don't need to do anything. It is a zero
cost bitflag for the nonramdisk case so it shouldn't be a problem and I
cannot imagine a better fix.

Now the only bit left in the ramdisk is a race condition in rmmod, I can
now see why you were converting rd_inode to rd_bdev :).

--- 2.4.10pre11aa1/include/linux/mm.h.~1~	Thu Sep 20 01:20:38 2001
+++ 2.4.10pre11aa1/include/linux/mm.h	Thu Sep 20 03:57:25 2001
@@ -282,6 +282,7 @@
 #define PG_checked		12	/* kill me in 2.5.<early>. */
 #define PG_arch_1		13
 #define PG_reserved		14
+#define PG_secure		15
 
 /* Make it prettier to test the above... */
 #define Page_Uptodate(page)	test_bit(PG_uptodate, &(page)->flags)
@@ -295,6 +296,9 @@
 #define TryLockPage(page)	test_and_set_bit(PG_locked, &(page)->flags)
 #define PageChecked(page)	test_bit(PG_checked, &(page)->flags)
 #define SetPageChecked(page)	set_bit(PG_checked, &(page)->flags)
+
+#define PageSecure(page)	test_bit(PG_secure, &(page)->flags)
+#define SetPageSecure(page)	set_bit(PG_secure, &(page)->flags)
 
 extern void __set_page_dirty(struct page *);
 
--- 2.4.10pre11aa1/mm/filemap.c.~1~	Tue Sep 18 15:39:51 2001
+++ 2.4.10pre11aa1/mm/filemap.c	Thu Sep 20 03:59:00 2001
@@ -621,7 +621,7 @@
 	if (PageLocked(page))
 		BUG();
 
-	flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_dirty | 1 << PG_referenced | 1 << PG_arch_1 | 1 << PG_checked);
+	flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_dirty | 1 << PG_referenced | 1 << PG_arch_1 | 1 << PG_checked | 1 << PG_secure);
 	page->flags = flags | (1 << PG_locked);
 	page_cache_get(page);
 	page->index = offset;
--- 2.4.10pre11aa1/mm/shmem.c.~1~	Tue Sep 18 02:43:04 2001
+++ 2.4.10pre11aa1/mm/shmem.c	Thu Sep 20 03:59:09 2001
@@ -353,7 +353,7 @@
 		swap_free(*entry);
 		*entry = (swp_entry_t) {0};
 		delete_from_swap_cache_nolock(page);
-		flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_referenced | 1 << PG_arch_1);
+		flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_referenced | 1 << PG_arch_1 | 1 << PG_secure);
 		page->flags = flags | (1 << PG_dirty);
 		add_to_page_cache_locked(page, mapping, idx);
 		info->swapped--;
--- 2.4.10pre11aa1/mm/swap_state.c.~1~	Tue Sep 18 02:43:04 2001
+++ 2.4.10pre11aa1/mm/swap_state.c	Thu Sep 20 03:59:20 2001
@@ -70,7 +70,7 @@
 		BUG();
 
 	/* clear PG_dirty so a subsequent set_page_dirty takes effect */
-	flags = page->flags & ~(1 << PG_error | 1 << PG_dirty | 1 << PG_arch_1 | 1 << PG_referenced);
+	flags = page->flags & ~(1 << PG_error | 1 << PG_dirty | 1 << PG_arch_1 | 1 << PG_referenced | 1 << PG_secure);
 	page->flags = flags | (1 << PG_uptodate);
 	add_to_page_cache_locked(page, &swapper_space, entry.val);
 }
--- 2.4.10pre11/drivers/block/rd.c	Tue Sep 18 02:42:23 2001
+++ 2.4.10pre11aa1/drivers/block/rd.c	Thu Sep 20 04:22:44 2001
@@ -188,11 +188,23 @@
 
 static int rd_blkdev_pagecache_IO(int rw, struct buffer_head * sbh, int minor)
 {
-	struct address_space * mapping = rd_inode[minor]->i_mapping;
+	struct address_space * mapping;
 	unsigned long index;
-	int offset, size, err = 0;
+	int offset, size, err;
 
-	if (sbh->b_page->mapping == mapping) {
+	err = -EIO;
+	/*
+	 * If we did BLKFLSBUF just before doing read/write,
+	 * we could see a rd_inode before the opener had a chance
+	 * to return from rd_open(), that's ok, as soon as we
+	 * see the inode we can use its i_mapping, and the inode
+	 * cannot go away from under us.
+	 */
+	if (!rd_inode[minor])
+		goto out;
+	err = 0;
+	mapping = rd_inode[minor]->i_mapping;
+	if (sbh->b_page->mapping == mapping && PageSecure(sbh->b_page)) {
 		if (rw != READ)
 			SetPageDirty(sbh->b_page);
 		goto out;
@@ -217,7 +229,7 @@
 
 		hash = page_hash(mapping, index);
 		page = __find_get_page(mapping, index, hash);
-		if (!page && rw != READ) {
+		if (!page) {
 			page = grab_cache_page(mapping, index);
 			err = -ENOMEM;
 			if (!page)
@@ -227,18 +239,40 @@
 		}
 
 		index++;
-		if (!page) {
-			offset = 0;
-			continue;
-		}
 
 		if (rw == READ) {
 			src = kmap(page);
+
+			if (!PageSecure(page)) {
+				clear_page(src);
+				SetPageSecure(page);
+			}
+
 			src += offset;
 			dst = bh_kmap(sbh);
 		} else {
 			dst = kmap(page);
 			dst += offset;
+
+			if (!PageSecure(page)) {
+				unsigned long addr, start, end;
+
+				/* clear around */
+				addr = (unsigned long) dst;
+				if (addr & ~PAGE_CACHE_MASK) {
+					start = addr & PAGE_CACHE_MASK;
+					end = addr;
+					memset((char *) start, 0, end - start);
+				}
+				addr = (unsigned long) dst + count;
+				if (addr & ~PAGE_CACHE_MASK) {
+					start = addr;
+					end = (addr + PAGE_CACHE_SIZE - 1) & PAGE_CACHE_MASK;
+					memset((char *) start, 0, end - start);
+				}
+				SetPageSecure(page);
+			}
+
 			src = bh_kmap(sbh);
 		}
 		offset = 0;
@@ -321,6 +355,13 @@
 				struct block_device * bdev = inode->i_bdev;
 
 				down(&bdev->bd_sem);
+				/*
+				 * We're the only users here, protected by the
+				 * bd_sem, but first check if we didn't just
+				 * flushed the ramdisk.
+				 */
+				if (!rd_inode[minor])
+					goto unlock;
 				if (bdev->bd_openers > 2) {
 					up(&bdev->bd_sem);
 					return -EBUSY;
@@ -328,8 +369,16 @@
 				bdev->bd_openers--;
 				bdev->bd_cache_openers--;
 				iput(rd_inode[minor]);
+				/*
+				 * Make sure the cache is flushed from here
+				 * and not from close(2), somebody
+				 * could reopen the device before we have a
+				 * chance to close it ourself.
+				 */
+				truncate_inode_pages(rd_inode[minor]->i_mapping, 0);
 				rd_inode[minor] = NULL;
 				rd_blocksizes[minor] = rd_blocksize;
+			unlock:
 				up(&bdev->bd_sem);
 			}
 			break;



Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20  2:34               ` Andrea Arcangeli
@ 2001-09-20 10:52                 ` Alexander Viro
  2001-09-20 18:18                   ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 10:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Thu, 20 Sep 2001, Andrea Arcangeli wrote:

> On Wed, Sep 19, 2001 at 03:21:09PM -0400, Alexander Viro wrote:
> > int fd = open("/dev/ram0", O_RDWR);
> > ioctl(fd, BLKFLSBUF);
> > ioctl(fd, BLKFLSBUF);
> 
> here it is the fix below.
[snip]
> @@ -328,8 +369,16 @@
>  				bdev->bd_openers--;
>  				bdev->bd_cache_openers--;
>  				iput(rd_inode[minor]);
> +				/*
> +				 * Make sure the cache is flushed from here
> +				 * and not from close(2), somebody
> +				 * could reopen the device before we have a
> +				 * chance to close it ourself.
> +				 */
> +				truncate_inode_pages(rd_inode[minor]->i_mapping, 0);
>  				rd_inode[minor] = NULL;
>  				rd_blocksizes[minor] = rd_blocksize;
> +			unlock:
>  				up(&bdev->bd_sem);

Now think what happens if you go through that code twice.  What argument will
be passed to iput() the second time you call it?


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-19 23:40                       ` Andrea Arcangeli
@ 2001-09-20 13:56                         ` Alexander Viro
  2001-09-20 14:38                           ` Chris Mason
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 13:56 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Thu, 20 Sep 2001, Andrea Arcangeli wrote:

> > Umm... Not doing unnecessary work?  Semantics of releasing a block device
> > depends on the kind of use.  BTW, I'm less than sure that fsync_dev() is
> > the right thing for file access now that you've got that in pagecache -
> > __block_fsync() seems to be more correct thing to do.
> 
> Not really, blkdev isn't a filesystem. It will never have a superblock
> and its own inodes and we also need to filemap_fdatasync/wait the
> physical address space.

Had you actually read the fsync_dev()?  Let me make it clear: you are
flushing _buffer_ cache upon blkdev_put(bdev, BDEV_FILE).  It was
the right thing when file access went through buffer cache.  It's
blatantly wrong with page cache.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 13:56                         ` Alexander Viro
@ 2001-09-20 14:38                           ` Chris Mason
  2001-09-20 14:50                             ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Chris Mason @ 2001-09-20 14:38 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Thursday, September 20, 2001 09:56:22 AM -0400 Alexander Viro <viro@math.psu.edu> wrote:

> Had you actually read the fsync_dev()?  Let me make it clear: you are
> flushing _buffer_ cache upon blkdev_put(bdev, BDEV_FILE).  It was
> the right thing when file access went through buffer cache.  It's
> blatantly wrong with page cache.

Well, fsync_dev will flush anything on the dirty list.  Since 
blkdev_commit_write puts things on the dirty list, andrea's current 
code will flush changes through the page cache.

The biggest exception is blkdev_writepage directly submits the io instead
of marking the buffers dirty.  This means the buffers won't be on
the locked/dirty list, and they won't get waited on.  Similar problem
for direct io.

The fix shouldn't be too bad, I'll code it up...

-chris


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 14:38                           ` Chris Mason
@ 2001-09-20 14:50                             ` Alexander Viro
  2001-09-20 15:44                               ` Chris Mason
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 14:50 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List



On Thu, 20 Sep 2001, Chris Mason wrote:

> 
> 
> On Thursday, September 20, 2001 09:56:22 AM -0400 Alexander Viro <viro@math.psu.edu> wrote:
> 
> > Had you actually read the fsync_dev()?  Let me make it clear: you are
> > flushing _buffer_ cache upon blkdev_put(bdev, BDEV_FILE).  It was
> > the right thing when file access went through buffer cache.  It's
> > blatantly wrong with page cache.
> 
> Well, fsync_dev will flush anything on the dirty list.  Since 
> blkdev_commit_write puts things on the dirty list, andrea's current 
> code will flush changes through the page cache.
> 
> The biggest exception is blkdev_writepage directly submits the io instead
> of marking the buffers dirty.  This means the buffers won't be on
> the locked/dirty list, and they won't get waited on.  Similar problem
> for direct io.

<nod>  And if you add Andrea's (perfectly valid) observation re having no
need to sync any fs structures we might have for that device, you get
__block_fsync().  After that it's easy to merge blkdev_close() code into
blkdev_put().


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 14:50                             ` Alexander Viro
@ 2001-09-20 15:44                               ` Chris Mason
  2001-09-20 16:43                                 ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Chris Mason @ 2001-09-20 15:44 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List



On Thursday, September 20, 2001 10:50:16 AM -0400 Alexander Viro
<viro@math.psu.edu> wrote:

>> The biggest exception is blkdev_writepage directly submits the io instead
>> of marking the buffers dirty.  This means the buffers won't be on
>> the locked/dirty list, and they won't get waited on.  Similar problem
>> for direct io.
> 
> <nod>  And if you add Andrea's (perfectly valid) observation re having no
> need to sync any fs structures we might have for that device, you get
> __block_fsync().  After that it's easy to merge blkdev_close() code into
> blkdev_put().
> 
>

Ok, __block_fsync is much better than just fsync_dev.

Are there other parts of blkdev_close you want merged into 
blkdev_put? Without changing the reread blocks on last close 
semantics, I think this is all we can do.

As far as I can tell, bdev->bd_inode is valid to send 
to __block_fsync, am I missing something?

--- linux/fs/block_dev.c	Mon Sep 17 11:28:56 2001
+++ linux/fs/block_dev.c	Thu Sep 20 11:21:39 2001
@@ -704,10 +704,9 @@
 	kdev_t rdev = to_kdev_t(bdev->bd_dev); /* this should become bdev */
 	down(&bdev->bd_sem);
 	lock_kernel();
-	if (kind == BDEV_FILE)
-		fsync_dev(rdev);
-	else if (kind == BDEV_FS)
-		fsync_no_super(rdev);
+
+	__block_fsync(bdev->bd_inode) ;
+
 	/* only filesystems uses buffer cache for the metadata these days */
 	if (kind == BDEV_FS)
 		invalidate_buffers(rdev);


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 15:44                               ` Chris Mason
@ 2001-09-20 16:43                                 ` Alexander Viro
  2001-09-20 20:54                                   ` [PATCH] fs/block_dev.c cleanup Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 16:43 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List



On Thu, 20 Sep 2001, Chris Mason wrote:

> > <nod>  And if you add Andrea's (perfectly valid) observation re having no
> > need to sync any fs structures we might have for that device, you get
> > __block_fsync().  After that it's easy to merge blkdev_close() code into
> > blkdev_put().
> > 
> >
> 
> Ok, __block_fsync is much better than just fsync_dev.
> 
> Are there other parts of blkdev_close you want merged into 
> blkdev_put? Without changing the reread blocks on last close 
> semantics, I think this is all we can do.
> 
> As far as I can tell, bdev->bd_inode is valid to send 
> to __block_fsync, am I missing something?

Eventually that will be the right thing, but only after we allocate
bd_inode upon blkdev_get()/blkdev_open() instead of trying to cannibalize
the inode passed to blkdev_open().

I'm testing that chunk right now (it also kills all the fake_inode crap in
block_dev.c).

When we cut the lifetime of block_device down we'll be able to get it
even simpler - allocate and free ->bd_inode at the same time as block_device,
but that's several chunks later.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 10:52                 ` Alexander Viro
@ 2001-09-20 18:18                   ` Andrea Arcangeli
  2001-09-20 18:33                     ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 18:18 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Thu, Sep 20, 2001 at 06:52:07AM -0400, Alexander Viro wrote:
> 
> 
> On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
> 
> > On Wed, Sep 19, 2001 at 03:21:09PM -0400, Alexander Viro wrote:
> > > int fd = open("/dev/ram0", O_RDWR);
> > > ioctl(fd, BLKFLSBUF);
> > > ioctl(fd, BLKFLSBUF);
> > 
> > here it is the fix below.
> [snip]
> > @@ -328,8 +369,16 @@
> >  				bdev->bd_openers--;
> >  				bdev->bd_cache_openers--;
> >  				iput(rd_inode[minor]);
> > +				/*
> > +				 * Make sure the cache is flushed from here
> > +				 * and not from close(2), somebody
> > +				 * could reopen the device before we have a
> > +				 * chance to close it ourself.
> > +				 */
> > +				truncate_inode_pages(rd_inode[minor]->i_mapping, 0);
> >  				rd_inode[minor] = NULL;
> >  				rd_blocksizes[minor] = rd_blocksize;
> > +			unlock:
> >  				up(&bdev->bd_sem);
> 
> Now think what happens if you go through that code twice.  What argument will
> be passed to iput() the second time you call it?

the second time we won't go through that code.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 18:18                   ` Andrea Arcangeli
@ 2001-09-20 18:33                     ` Alexander Viro
  2001-09-20 18:59                       ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 18:33 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Thu, 20 Sep 2001, Andrea Arcangeli wrote:

> > > +				truncate_inode_pages(rd_inode[minor]->i_mapping, 0);
> > >  				rd_inode[minor] = NULL;
> > >  				rd_blocksizes[minor] = rd_blocksize;
> > > +			unlock:
> > >  				up(&bdev->bd_sem);
> > 
> > Now think what happens if you go through that code twice.  What argument will
> > be passed to iput() the second time you call it?
> 
> the second time we won't go through that code.

IOW, subsequent calls of ioctl(fd, BLKFLSBUF) will not work.  Which is
better than oopsing, but doesn't look right.

Question: why the hell do we bother with iput() and decrementing counters
at all?



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 18:33                     ` Alexander Viro
@ 2001-09-20 18:59                       ` Andrea Arcangeli
  2001-09-20 20:41                         ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 18:59 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Thu, Sep 20, 2001 at 02:33:34PM -0400, Alexander Viro wrote:
> 
> 
> On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
> 
> > > > +				truncate_inode_pages(rd_inode[minor]->i_mapping, 0);
> > > >  				rd_inode[minor] = NULL;
> > > >  				rd_blocksizes[minor] = rd_blocksize;
> > > > +			unlock:
> > > >  				up(&bdev->bd_sem);
> > > 
> > > Now think what happens if you go through that code twice.  What argument will
> > > be passed to iput() the second time you call it?
> > 
> > the second time we won't go through that code.
> 
> IOW, subsequent calls of ioctl(fd, BLKFLSBUF) will not work.  Which is
> better than oopsing, but doesn't look right.

The second call will do the work if there's something to do. If somebody
did an open/read/writes/close in the middle, it should do the work as
usual. I actually can see a problem if you use the different inodes, but
that will be sorted out too automatically as soon as we stop pinning the
inodes.

> Question: why the hell do we bother with iput() and decrementing counters
> at all?

Yes, that part should be rewritten using the bdev as you did recently,
the ipinning of the ramdisk also in the previous 2.4 kernels (before
your recent patch) had the same problem of my ipinning in the blkdev
highlevel (it looked only a cleanup but it wasn't).

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 18:59                       ` Andrea Arcangeli
@ 2001-09-20 20:41                         ` Alexander Viro
  2001-09-20 21:18                           ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 20:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Thu, 20 Sep 2001, Andrea Arcangeli wrote:

> The second call will do the work if there's something to do. If somebody
> did an open/read/writes/close in the middle, it should do the work as
> usual. I actually can see a problem if you use the different inodes, but
> that will be sorted out too automatically as soon as we stop pinning the
> inodes.


Sigh... Try BLKFLSBUF + write() + BLKFLSBUF.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH] fs/block_dev.c cleanup
  2001-09-20 16:43                                 ` Alexander Viro
@ 2001-09-20 20:54                                   ` Alexander Viro
  0 siblings, 0 replies; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 20:54 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List



On Thu, 20 Sep 2001, Alexander Viro wrote:

> On Thu, 20 Sep 2001, Chris Mason wrote:
> 
> > > <nod>  And if you add Andrea's (perfectly valid) observation re having no
> > > need to sync any fs structures we might have for that device, you get
> > > __block_fsync().  After that it's easy to merge blkdev_close() code into
> > > blkdev_put().
> > > 
> > >
> > 
> > Ok, __block_fsync is much better than just fsync_dev.
> > 
> > Are there other parts of blkdev_close you want merged into 
> > blkdev_put? Without changing the reread blocks on last close 
> > semantics, I think this is all we can do.
> > 
> > As far as I can tell, bdev->bd_inode is valid to send 
> > to __block_fsync, am I missing something?
> 
> Eventually that will be the right thing, but only after we allocate
> bd_inode upon blkdev_get()/blkdev_open() instead of trying to cannibalize
> the inode passed to blkdev_open().
> 
> I'm testing that chunk right now (it also kills all the fake_inode crap in
> block_dev.c).

OK, it seems to be working here.  It doesn't fix anything in rd.c - all
it does is crapectomy in block_dev.c.   Pseudo-fs added, inodes are
allocated there upon blkdev_open() and blkdev_get(), ->bd_inode is
always from that pseudo-fs and never equal to inode passed into blkdev_open().

Crap with fake inodes is gone - we simply use ->bd_inode.  ->a_ops for
block device inodes is not set at all - neither in devices.c nor in devfs.
We set it for inodes on pseudo-fs and setting ->i_mapping upon open()
does the right thing.

All mess with keeping the first inode alive is gone - no need to do that
anymore.

diff -urN S10-pre12/drivers/block/rd.c linux/drivers/block/rd.c
--- S10-pre12/drivers/block/rd.c	Thu Sep 20 15:27:18 2001
+++ linux/drivers/block/rd.c	Thu Sep 20 15:26:09 2001
@@ -420,7 +420,6 @@
 			/* bdev->bd_sem is held by caller */
 			bdev->bd_openers++;
 			bdev->bd_cache_openers++;
-			bdev->bd_inode = inode;
 		}
 	}
 
diff -urN S10-pre12/fs/block_dev.c linux/fs/block_dev.c
--- S10-pre12/fs/block_dev.c	Thu Sep 20 15:27:24 2001
+++ linux/fs/block_dev.c	Thu Sep 20 15:40:04 2001
@@ -18,6 +18,7 @@
 #include <linux/iobuf.h>
 #include <linux/highmem.h>
 #include <linux/blkdev.h>
+#include <linux/module.h>
 
 #include <asm/uaccess.h>
 
@@ -365,6 +366,53 @@
 }
 
 /*
+ * pseudo-fs
+ */
+
+static struct super_block *bd_read_super(struct super_block *sb, void *data, int silent)
+{
+	static struct super_operations sops = {};
+	struct inode *root = new_inode(sb);
+	if (!root)
+		return NULL;
+	root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
+	root->i_uid = root->i_gid = 0;
+	root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
+	sb->s_blocksize = 1024;
+	sb->s_blocksize_bits = 10;
+	sb->s_magic = 0x62646576;
+	sb->s_op = &sops;
+	sb->s_root = d_alloc(NULL, &(const struct qstr) { "bdev:", 5, 0 });
+	if (!sb->s_root) {
+		iput(root);
+		return NULL;
+	}
+	sb->s_root->d_sb = sb;
+	sb->s_root->d_parent = sb->s_root;
+	d_instantiate(sb->s_root, root);
+	return sb;
+}
+
+static DECLARE_FSTYPE(bd_type, "bdev", bd_read_super, FS_NOMOUNT);
+
+static struct vfsmount *bd_mnt;
+
+static int get_inode(struct block_device *bdev)
+{
+	if (!bdev->bd_inode) {
+		struct inode *inode = new_inode(bd_mnt->mnt_sb);
+		if (!inode)
+			return -ENOMEM;
+		inode->i_rdev = to_kdev_t(bdev->bd_dev);
+		atomic_inc(&bdev->bd_count);	/* will go away */
+		inode->i_bdev = bdev;
+		inode->i_data.a_ops = &def_blk_aops;
+		bdev->bd_inode = inode;
+	}
+	return 0;
+}
+
+/*
  * bdev cache handling - shamelessly stolen from inode.c
  * We use smaller hashtable, though.
  */
@@ -394,7 +442,7 @@
 
 void __init bdev_cache_init(void)
 {
-	int i;
+	int i, err;
 	struct list_head *head = bdev_hashtable;
 
 	i = HASH_SIZE;
@@ -410,6 +458,13 @@
 					 NULL);
 	if (!bdev_cachep)
 		panic("Cannot create bdev_cache SLAB cache");
+	err = register_filesystem(&bd_type);
+	if (err)
+		panic("Cannot register bdev pseudo-fs");
+	bd_mnt = kern_mount(&bd_type);
+	err = PTR_ERR(bd_mnt);
+	if (IS_ERR(bd_mnt))
+		panic("Cannot create bdev pseudo-fs");
 }
 
 /*
@@ -598,18 +653,13 @@
 
 int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
 {
-	struct inode inode_fake;
 	int res;
 	mm_segment_t old_fs = get_fs();
 
 	if (!bdev->bd_op->ioctl)
 		return -EINVAL;
-	memset(&inode_fake, 0, sizeof(inode_fake));
-	inode_fake.i_rdev = to_kdev_t(bdev->bd_dev);
-	inode_fake.i_bdev = bdev;
-	init_waitqueue_head(&inode_fake.i_wait);
 	set_fs(KERNEL_DS);
-	res = bdev->bd_op->ioctl(&inode_fake, NULL, cmd, arg);
+	res = bdev->bd_op->ioctl(bdev->bd_inode, NULL, cmd, arg);
 	set_fs(old_fs);
 	return res;
 }
@@ -619,6 +669,12 @@
 	int ret = -ENODEV;
 	kdev_t rdev = to_kdev_t(bdev->bd_dev); /* this should become bdev */
 	down(&bdev->bd_sem);
+
+	if (get_inode(bdev)) {
+		up(&bdev->bd_sem);
+		return -ENOMEM;
+	}
+
 	lock_kernel();
 	if (!bdev->bd_op)
 		bdev->bd_op = get_blkfops(MAJOR(rdev));
@@ -631,23 +687,22 @@
 		 */
 		struct file fake_file = {};
 		struct dentry fake_dentry = {};
-		struct inode *fake_inode = get_empty_inode();
 		ret = -ENOMEM;
-		if (fake_inode) {
-			fake_file.f_mode = mode;
-			fake_file.f_flags = flags;
-			fake_file.f_dentry = &fake_dentry;
-			fake_dentry.d_inode = fake_inode;
-			fake_inode->i_rdev = rdev;
-			ret = 0;
-			if (bdev->bd_op->open)
-				ret = bdev->bd_op->open(fake_inode, &fake_file);
-			if (!ret) {
-				bdev->bd_openers++;
-				atomic_inc(&bdev->bd_count);
-			} else if (!bdev->bd_openers)
-				bdev->bd_op = NULL;
-			iput(fake_inode);
+		fake_file.f_mode = mode;
+		fake_file.f_flags = flags;
+		fake_file.f_dentry = &fake_dentry;
+		fake_dentry.d_inode = bdev->bd_inode;
+		ret = 0;
+		if (bdev->bd_op->open)
+			ret = bdev->bd_op->open(bdev->bd_inode, &fake_file);
+		if (!ret) {
+			bdev->bd_openers++;
+			atomic_inc(&bdev->bd_count);
+		} else if (!bdev->bd_openers) {
+			struct inode *bd_inode = bdev->bd_inode;
+			bdev->bd_op = NULL;
+			bdev->bd_inode = NULL;
+			iput(bd_inode);
 		}
 	}
 	unlock_kernel();
@@ -669,6 +724,12 @@
 	filp->f_flags |= O_LARGEFILE;
 
 	down(&bdev->bd_sem);
+
+	if (get_inode(bdev)) {
+		up(&bdev->bd_sem);
+		return -ENOMEM;
+	}
+
 	lock_kernel();
 	if (!bdev->bd_op)
 		bdev->bd_op = get_blkfops(MAJOR(inode->i_rdev));
@@ -678,20 +739,15 @@
 			ret = bdev->bd_op->open(inode,filp);
 		if (!ret) {
 			bdev->bd_openers++;
-			if (!bdev->bd_cache_openers && bdev->bd_inode)
-				BUG();
-			if (bdev->bd_cache_openers && !bdev->bd_inode)
-				BUG();
-			if (!bdev->bd_cache_openers++)
-				bdev->bd_inode = inode;
-			else {
-				if (bdev->bd_inode != inode && !inode->i_mapping_overload++) {
-					inode->i_mapping = bdev->bd_inode->i_mapping;
-					atomic_inc(&bdev->bd_inode->i_count);
-				}
-			}
-		} else if (!bdev->bd_openers)
+			bdev->bd_cache_openers++;
+			inode->i_mapping = bdev->bd_inode->i_mapping;
+			inode->i_mapping_overload++;
+		} else if (!bdev->bd_openers) {
+			struct inode *bd_inode = bdev->bd_inode;
 			bdev->bd_op = NULL;
+			bdev->bd_inode = NULL;
+			iput(bd_inode);
+		}
 	}	
 	unlock_kernel();
 	up(&bdev->bd_sem);
@@ -702,28 +758,24 @@
 {
 	int ret = 0;
 	kdev_t rdev = to_kdev_t(bdev->bd_dev); /* this should become bdev */
+	struct inode *bd_inode = bdev->bd_inode;
+
 	down(&bdev->bd_sem);
 	lock_kernel();
 	if (kind == BDEV_FILE)
-		fsync_dev(rdev);
+		__block_fsync(bd_inode);
 	else if (kind == BDEV_FS)
 		fsync_no_super(rdev);
 	/* only filesystems uses buffer cache for the metadata these days */
 	if (kind == BDEV_FS)
 		invalidate_buffers(rdev);
-	if (bdev->bd_op->release) {
-		struct inode * fake_inode = get_empty_inode();
-		ret = -ENOMEM;
-		if (fake_inode) {
-			fake_inode->i_rdev = rdev;
-			ret = bdev->bd_op->release(fake_inode, NULL);
-			iput(fake_inode);
-		} else
-			printk(KERN_WARNING "blkdev_put: ->release couldn't be run due -ENOMEM\n");
-	}
-	if (!--bdev->bd_openers)
-		bdev->bd_op = NULL;	/* we can't rely on driver being */
-					/* kind to stay around. */
+	if (bdev->bd_op->release)
+		ret = bdev->bd_op->release(bd_inode, NULL);
+	if (!--bdev->bd_openers) {
+		bdev->bd_op = NULL;
+		bdev->bd_inode = NULL;
+		iput(bd_inode);
+	}
 	unlock_kernel();
 	up(&bdev->bd_sem);
 	bdput(bdev);
@@ -736,8 +788,6 @@
 	int ret = 0;
 	struct inode * bd_inode = bdev->bd_inode;
 
-	if (bd_inode->i_mapping != inode->i_mapping)
-		BUG();
 	down(&bdev->bd_sem);
 	lock_kernel();
 	/* cache coherency protocol */
@@ -745,11 +795,9 @@
 		struct super_block * sb;
 
 		/* flush the pagecache to disk */
-		__block_fsync(inode);
+		__block_fsync(bd_inode);
 		/* drop the pagecache, uptodate info is on disk by now */
 		truncate_inode_pages(inode->i_mapping, 0);
-		/* forget the bdev pagecache address space */
-		bdev->bd_inode = NULL;
 
 		/* if the fs was mounted ro just throw away most of its caches */
 		sb = get_super(inode->i_rdev);
@@ -782,16 +830,17 @@
 			drop_super(sb);
 		}
 	}
-	if (inode != bd_inode && !--inode->i_mapping_overload) {
+	if (!--inode->i_mapping_overload)
 		inode->i_mapping = &inode->i_data;
-		iput(bd_inode);
-	}
 
 	/* release the device driver */
 	if (bdev->bd_op->release)
 		ret = bdev->bd_op->release(inode, NULL);
-	if (!--bdev->bd_openers)
+	if (!--bdev->bd_openers) {
 		bdev->bd_op = NULL;
+		bdev->bd_inode = NULL;
+		iput(bd_inode);
+	}
 	unlock_kernel();
 	up(&bdev->bd_sem);
 
diff -urN S10-pre12/fs/devices.c linux/fs/devices.c
--- S10-pre12/fs/devices.c	Thu Sep 20 15:27:24 2001
+++ linux/fs/devices.c	Thu Sep 20 15:26:09 2001
@@ -206,7 +206,6 @@
 		inode->i_cdev = cdget(rdev);
 	} else if (S_ISBLK(mode)) {
 		inode->i_fop = &def_blk_fops;
-		inode->i_mapping->a_ops = &def_blk_aops;
 		inode->i_rdev = to_kdev_t(rdev);
 		inode->i_bdev = bdget(rdev);
 	} else if (S_ISFIFO(mode))


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 20:41                         ` Alexander Viro
@ 2001-09-20 21:18                           ` Andrea Arcangeli
  2001-09-20 21:40                             ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 21:18 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Thu, Sep 20, 2001 at 04:41:54PM -0400, Alexander Viro wrote:
> 
> 
> On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
> 
> > The second call will do the work if there's something to do. If somebody
> > did an open/read/writes/close in the middle, it should do the work as
> > usual. I actually can see a problem if you use the different inodes, but
> > that will be sorted out too automatically as soon as we stop pinning the
> > inodes.
> 
> 
> Sigh... Try BLKFLSBUF + write() + BLKFLSBUF.

write will return -EIO and second BLKFLSBUF will do nothing.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 21:18                           ` Andrea Arcangeli
@ 2001-09-20 21:40                             ` Alexander Viro
  2001-09-20 22:13                               ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 21:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Thu, 20 Sep 2001, Andrea Arcangeli wrote:

> > Sigh... Try BLKFLSBUF + write() + BLKFLSBUF.
> 
> write will return -EIO and second BLKFLSBUF will do nothing.

Now compare that with behaviour of -pre10 (not to mention the
(in)sanity of "this ioctl() will make all IO on the fd fail until
somebody opens the same file" semantics).


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 21:40                             ` Alexander Viro
@ 2001-09-20 22:13                               ` Andrea Arcangeli
  2001-09-20 22:20                                 ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 22:13 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Thu, Sep 20, 2001 at 05:40:00PM -0400, Alexander Viro wrote:
> 
> 
> On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
> 
> > > Sigh... Try BLKFLSBUF + write() + BLKFLSBUF.
> > 
> > write will return -EIO and second BLKFLSBUF will do nothing.
> 
> Now compare that with behaviour of -pre10 (not to mention the
> (in)sanity of "this ioctl() will make all IO on the fd fail until
> somebody opens the same file" semantics).

It's not that insane: the address space is allocated at open time.
After you drop it with BLKFLSBUF you will have to open the device again
to reallocate a new address space. I could just truncate the physical
address space, there are no other users, but then the inode would remain
pinned forever, and so until we include your ipinning fix this looked an
acceptable two liner band-aid I guess (again, real fix is yours, all I'm
saying is that it can't oops any longer ;).

thanks,
Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 22:13                               ` Andrea Arcangeli
@ 2001-09-20 22:20                                 ` Alexander Viro
  2001-09-20 22:31                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 22:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Fri, 21 Sep 2001, Andrea Arcangeli wrote:

> It's not that insane: the address space is allocated at open time.
> After you drop it with BLKFLSBUF you will have to open the device again
> to reallocate a new address space. I could just truncate the physical
> address space, there are no other users, but then the inode would remain
> pinned forever, and so until we include your ipinning fix this looked an
> acceptable two liner band-aid I guess (again, real fix is yours, all I'm
> saying is that it can't oops any longer ;).

OK.  Could you resend your patch (or just the page initialization parts
of it)?  I'm getting to the point where I'll start seriously touching
rd.c, so...


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 22:20                                 ` Alexander Viro
@ 2001-09-20 22:31                                   ` Andrea Arcangeli
  2001-09-20 22:44                                     ` Alexander Viro
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 22:31 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Thu, Sep 20, 2001 at 06:20:39PM -0400, Alexander Viro wrote:
> OK.  Could you resend your patch (or just the page initialization parts
> of it)?  I'm getting to the point where I'll start seriously touching
> rd.c, so...

Of course, however if you want I can first fix initrd (I was just
looking into it in the last minutes), the security fix broke initrd
badly unfortunately [didn't tested initrd but just the ramdisks before
posting it] (not sure why initrd broke at the moment but I believe the
design of the fix was the right one, so it is probably an implementation
detail). So unless you need it urgently I will try to fix initrd first,
then I will send to you so you can go ahead without risk of future rejects.

thanks!

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 22:31                                   ` Andrea Arcangeli
@ 2001-09-20 22:44                                     ` Alexander Viro
  2001-09-20 23:03                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 22:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Fri, 21 Sep 2001, Andrea Arcangeli wrote:

> Of course, however if you want I can first fix initrd (I was just
> looking into it in the last minutes), the security fix broke initrd
> badly unfortunately [didn't tested initrd but just the ramdisks before
> posting it] (not sure why initrd broke at the moment but I believe the
> design of the fix was the right one, so it is probably an implementation
> detail). So unless you need it urgently I will try to fix initrd first,
> then I will send to you so you can go ahead without risk of future rejects.

I'd suggest to stop treating initrd as block device.  It has to keep
S_IFBLK in mode bits, indeed, but I'd rather set ->f_op upon open() and
stop worrying about it.

We don't need buffer-cache access to it anyway - if you look carefully
you will see that we actually copy its contents to normal ramdisk
first.  So just having ->read() (essentially copy_to_user()) is
perfectly OK. 

Check how old code was dealing with it - it's really the best way to
treat that sucker.  We probably will be better off if we make it a
character device at some point in 2.5 and move the thing to char/mem.c,
but anyway, it had never been a proper block device (== one with
requests queue, etc.) and there's no point in making it such now.

Again, the proper way to treat it is to convert it into character
device at some point.  Userland won't actually care - after the
boot it's gone.  And kernel is using it only via ->read(), so
there's no point trying to make it similar to real ramdisks.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 22:44                                     ` Alexander Viro
@ 2001-09-20 23:03                                       ` Andrea Arcangeli
  2001-09-20 23:11                                         ` Alexander Viro
  2001-09-21  3:47                                         ` Andrea Arcangeli
  0 siblings, 2 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 23:03 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Thu, Sep 20, 2001 at 06:44:18PM -0400, Alexander Viro wrote:
> 
> 
> On Fri, 21 Sep 2001, Andrea Arcangeli wrote:
> 
> > Of course, however if you want I can first fix initrd (I was just
> > looking into it in the last minutes), the security fix broke initrd
> > badly unfortunately [didn't tested initrd but just the ramdisks before
> > posting it] (not sure why initrd broke at the moment but I believe the
> > design of the fix was the right one, so it is probably an implementation
> > detail). So unless you need it urgently I will try to fix initrd first,
> > then I will send to you so you can go ahead without risk of future rejects.
> 
> I'd suggest to stop treating initrd as block device.  It has to keep
> S_IFBLK in mode bits, indeed, but I'd rather set ->f_op upon open() and
> stop worrying about it.
> 
> We don't need buffer-cache access to it anyway - if you look carefully
> you will see that we actually copy its contents to normal ramdisk
> first.  So just having ->read() (essentially copy_to_user()) is
> perfectly OK. 

of course, we never used initrd blkdev, we just use the ramdisk always
from userspace.  Initrd doesn't need to be a blockdev, we could just
copy the ram ourself and just call ->write on the ramdisk. userspace
just sees /dev/ram0, /dev/initrd is a kernel internal thing that doesn't
need to be visible, and if anybody uses /dev/initrd somehow that just
insane (and it doesn't even look possible to make anything useful out of
/dev/initrd anyways).

> Check how old code was dealing with it - it's really the best way to
> treat that sucker.  We probably will be better off if we make it a
> character device at some point in 2.5 and move the thing to char/mem.c,
> but anyway, it had never been a proper block device (== one with
> requests queue, etc.) and there's no point in making it such now.
> 
> Again, the proper way to treat it is to convert it into character
> device at some point.  Userland won't actually care - after the
> boot it's gone.  And kernel is using it only via ->read(), so
> there's no point trying to make it similar to real ramdisks.

1) I am tentated to fix the initrd bug by just killing initrd blkdev,
completly instead of going to mark PageSecure all the initrd pages.

2) Otherwise I can make a brute hack one liner approch with a global
field that ignores the PageSecure bit on reads until I finished to fill
in the /dev/ram0 :) (slowdown production paths for a boot thing but
doesn't matter, but ok we aren't writing proprietary software here and
this isn't the right thing ;)

3) I can just mark the initrd pages as PageSecure, then I can safely
forget about them.

What do you prefer for the next 2.4.10 mainline? I'd like to have this
fixed _fast_ and to be included in mainline, since the security problem
is ugly (not too bad though since /dev/ram0 is readable only by root by
default) and because people really needs initrd.

Suggestions are very welcome.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 23:03                                       ` Andrea Arcangeli
@ 2001-09-20 23:11                                         ` Alexander Viro
  2001-09-21  1:50                                           ` Alexander Viro
  2001-09-21  3:47                                         ` Andrea Arcangeli
  1 sibling, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-20 23:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Fri, 21 Sep 2001, Andrea Arcangeli wrote:

> 1) I am tentated to fix the initrd bug by just killing initrd blkdev,
> completly instead of going to mark PageSecure all the initrd pages.

Just put initrd_read(), initrd_release() and initrd_fops back and
switch file->f_op on rd_open() as it was in -pre10.  Maybe you
want to replace blkdev_put(inode->i_bdev, BDEV_FILE) in initrd_release()
with blkdev_close(inode, file), but that will be the same pretty soon.

It's really the simplest way to deal with that and it has a benefit
of being both straightforward and well-tested.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 23:11                                         ` Alexander Viro
@ 2001-09-21  1:50                                           ` Alexander Viro
  2001-09-21  2:42                                             ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-21  1:50 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List

	Andrea, what's the point of making blkdev_get() bump ->bd_count
in case of success and blkdev_put() - drop it?  We _do_ grab a reference
before calling blkdev_get() - any place where we don't is an immediately
oopsable hole both in the old an in the new tree.  Notice that you
do down(&bdev->bd_sem) before that increment of refcount, so if caller
doesn't hold a reference we are toast.  What's going on there?


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-21  1:50                                           ` Alexander Viro
@ 2001-09-21  2:42                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-21  2:42 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Thu, Sep 20, 2001 at 09:50:52PM -0400, Alexander Viro wrote:
> 	Andrea, what's the point of making blkdev_get() bump ->bd_count
> in case of success and blkdev_put() - drop it?  We _do_ grab a reference
> before calling blkdev_get() - any place where we don't is an immediately
> oopsable hole both in the old an in the new tree.  Notice that you
> do down(&bdev->bd_sem) before that increment of refcount, so if caller
> doesn't hold a reference we are toast.  What's going on there?

I just wanted to make sure the bdev couldn't be released under us by
owning a reference for the whole duration of the blkdev_get/put. But
requiring the caller to hold the reference for us seems saner since the
caller will have to pass the bdev as parameter anyways, so yes it seems
superflous.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-20 23:03                                       ` Andrea Arcangeli
  2001-09-20 23:11                                         ` Alexander Viro
@ 2001-09-21  3:47                                         ` Andrea Arcangeli
  2001-09-21  4:00                                           ` Alexander Viro
  2001-09-21  4:06                                           ` Andrea Arcangeli
  1 sibling, 2 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-21  3:47 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Fri, Sep 21, 2001 at 01:03:40AM +0200, Andrea Arcangeli wrote:
> What do you prefer for the next 2.4.10 mainline? I'd like to have this

Famous last words, after a few hours of debugging mixed with vm patches
and emails, I finally got around finding the real bug. The fact is that
the secure-ramdisk logic was totally broken, not just for initrd, oh 
well, so please don't apply such patch (code in mainline has the
security issue if you allow an luser to read from /dev/ram0, but it
isn't buggy). and the issue is quite unfixable with just a PageSecure
set inside rd.c.  The fact is that I cannot just clear-around the
written "bh", around there could be the source for the next block to
write and I cannot zero it out. It is getting harder to fix this one
just inside the ->make_request callback... Hints?

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-21  3:47                                         ` Andrea Arcangeli
@ 2001-09-21  4:00                                           ` Alexander Viro
  2001-09-21  4:06                                             ` Andrea Arcangeli
  2001-09-21  4:06                                           ` Andrea Arcangeli
  1 sibling, 1 reply; 110+ messages in thread
From: Alexander Viro @ 2001-09-21  4:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List



On Fri, 21 Sep 2001, Andrea Arcangeli wrote:

> Famous last words, after a few hours of debugging mixed with vm patches
> and emails, I finally got around finding the real bug. The fact is that
> the secure-ramdisk logic was totally broken, not just for initrd, oh 
> well, so please don't apply such patch (code in mainline has the
> security issue if you allow an luser to read from /dev/ram0, but it
> isn't buggy). and the issue is quite unfixable with just a PageSecure
> set inside rd.c.  The fact is that I cannot just clear-around the
> written "bh", around there could be the source for the next block to
> write and I cannot zero it out. It is getting harder to fix this one
> just inside the ->make_request callback... Hints?

Well, taking a file on ramfs and doing losetup on it should be equivalent
to ramdisk.  Turning relevant pieces into a driver shouldn't be too hard.
It won't be pretty, though - you'll probably want different
address_space_operations, so that read()/write() wouldn't bother with
requests at all.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-21  3:47                                         ` Andrea Arcangeli
  2001-09-21  4:00                                           ` Alexander Viro
@ 2001-09-21  4:06                                           ` Andrea Arcangeli
  2001-09-21  4:46                                             ` Andrea Arcangeli
  1 sibling, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-21  4:06 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Fri, Sep 21, 2001 at 05:47:49AM +0200, Andrea Arcangeli wrote:
> write and I cannot zero it out. It is getting harder to fix this one
> just inside the ->make_request callback... Hints?

I think the best fix is to have the ramdisk using the same aops of ramfs
and replace "Secure" with "Uptodate". We need to trap the security issue
at the higher layer and also this will avoid us having to map useless
bh, so it should be an improvement, only the filesystem will end
triggering the ->make_request callback of the ramdisk. Then if the fs
does I/O on stuff out of the physical address space we'll just clear it
out.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-21  4:00                                           ` Alexander Viro
@ 2001-09-21  4:06                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-21  4:06 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Fri, Sep 21, 2001 at 12:00:35AM -0400, Alexander Viro wrote:
> Well, taking a file on ramfs and doing losetup on it should be equivalent
> to ramdisk.  Turning relevant pieces into a driver shouldn't be too hard.
> It won't be pretty, though - you'll probably want different
> address_space_operations, so that read()/write() wouldn't bother with
> requests at all.

My same idea, agreed.

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-21  4:06                                           ` Andrea Arcangeli
@ 2001-09-21  4:46                                             ` Andrea Arcangeli
  2001-09-21  7:09                                               ` Andrea Arcangeli
  0 siblings, 1 reply; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-21  4:46 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Fri, Sep 21, 2001 at 06:06:25AM +0200, Andrea Arcangeli wrote:
> On Fri, Sep 21, 2001 at 05:47:49AM +0200, Andrea Arcangeli wrote:
> > write and I cannot zero it out. It is getting harder to fix this one
> > just inside the ->make_request callback... Hints?
> 
> I think the best fix is to have the ramdisk using the same aops of ramfs
> and replace "Secure" with "Uptodate". We need to trap the security issue
> at the higher layer and also this will avoid us having to map useless
> bh, so it should be an improvement, only the filesystem will end
> triggering the ->make_request callback of the ramdisk. Then if the fs
> does I/O on stuff out of the physical address space we'll just clear it
> out.

ok, here it is, this obsoletes the previous patch (so again against
2.4.10pre12), initrd works again.

Al, feel free to expand over rd.c now.

diff -urN ramdisk-ref/drivers/block/rd.c ramdisk/drivers/block/rd.c
--- ramdisk-ref/drivers/block/rd.c	Fri Sep 21 06:39:28 2001
+++ ramdisk/drivers/block/rd.c	Fri Sep 21 06:26:37 2001
@@ -186,17 +186,77 @@
 
 #endif
 
+/*
+ * Copyright (C) 2000 Linus Torvalds.
+ *               2000 Transmeta Corp.
+ * aops copied from ramfs.
+ */
+static int ramdisk_readpage(struct file *file, struct page * page)
+{
+	if (!Page_Uptodate(page)) {
+		memset(kmap(page), 0, PAGE_CACHE_SIZE);
+		kunmap(page);
+		flush_dcache_page(page);
+		SetPageUptodate(page);
+	}
+	UnlockPage(page);
+	return 0;
+}
+
+/*
+ * Writing: just make sure the page gets marked dirty, so that
+ * the page stealer won't grab it.
+ */
+static int ramdisk_writepage(struct page *page)
+{
+	SetPageDirty(page);
+	UnlockPage(page);
+	return 0;
+}
+
+static int ramdisk_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+{
+	void *addr = kmap(page);
+	if (!Page_Uptodate(page)) {
+		memset(addr, 0, PAGE_CACHE_SIZE);
+		flush_dcache_page(page);
+		SetPageUptodate(page);
+	}
+	SetPageDirty(page);
+	return 0;
+}
+
+static int ramdisk_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+{
+	kunmap(page);
+	return 0;
+}
+
+struct address_space_operations ramdisk_aops = {
+	readpage: ramdisk_readpage,
+	writepage: ramdisk_writepage,
+	prepare_write: ramdisk_prepare_write,
+	commit_write: ramdisk_commit_write,
+};
+
 static int rd_blkdev_pagecache_IO(int rw, struct buffer_head * sbh, int minor)
 {
-	struct address_space * mapping = rd_inode[minor]->i_mapping;
+	struct address_space * mapping;
 	unsigned long index;
-	int offset, size, err = 0;
+	int offset, size, err;
 
-	if (sbh->b_page->mapping == mapping) {
-		if (rw != READ)
-			SetPageDirty(sbh->b_page);
+	err = -EIO;
+	/*
+	 * If we did BLKFLSBUF just before doing read/write,
+	 * we could see a rd_inode before the opener had a chance
+	 * to return from rd_open(), that's ok, as soon as we
+	 * see the inode we can use its i_mapping, and the inode
+	 * cannot go away from under us.
+	 */
+	if (!rd_inode[minor])
 		goto out;
-	}
+	err = 0;
+	mapping = rd_inode[minor]->i_mapping;
 
 	index = sbh->b_rsector >> (PAGE_CACHE_SHIFT - 9);
 	offset = (sbh->b_rsector << 9) & ~PAGE_CACHE_MASK;
@@ -206,8 +266,7 @@
 		int count;
 		struct page ** hash;
 		struct page * page;
-		const char * src;
-		char * dst;
+		char * src, * dst;
 		int unlock = 0;
 
 		count = PAGE_CACHE_SIZE - offset;
@@ -217,20 +276,24 @@
 
 		hash = page_hash(mapping, index);
 		page = __find_get_page(mapping, index, hash);
-		if (!page && rw != READ) {
+		if (!page) {
 			page = grab_cache_page(mapping, index);
 			err = -ENOMEM;
 			if (!page)
 				goto out;
 			err = 0;
+
+			if (!Page_Uptodate(page)) {
+				memset(kmap(page), 0, PAGE_CACHE_SIZE);
+				kunmap(page);
+				flush_dcache_page(page);
+				SetPageUptodate(page);
+			}
+
 			unlock = 1;
 		}
 
 		index++;
-		if (!page) {
-			offset = 0;
-			continue;
-		}
 
 		if (rw == READ) {
 			src = kmap(page);
@@ -321,6 +384,13 @@
 				struct block_device * bdev = inode->i_bdev;
 
 				down(&bdev->bd_sem);
+				/*
+				 * We're the only users here, protected by the
+				 * bd_sem, but first check if we didn't just
+				 * flushed the ramdisk.
+				 */
+				if (!rd_inode[minor])
+					goto unlock;
 				if (bdev->bd_openers > 2) {
 					up(&bdev->bd_sem);
 					return -EBUSY;
@@ -328,8 +398,16 @@
 				bdev->bd_openers--;
 				bdev->bd_cache_openers--;
 				iput(rd_inode[minor]);
+				/*
+				 * Make sure the cache is flushed from here
+				 * and not from close(2), somebody
+				 * could reopen the device before we have a
+				 * chance to close it ourself.
+				 */
+				truncate_inode_pages(rd_inode[minor]->i_mapping, 0);
 				rd_inode[minor] = NULL;
 				rd_blocksizes[minor] = rd_blocksize;
+			unlock:
 				up(&bdev->bd_sem);
 			}
 			break;
@@ -421,6 +499,8 @@
 			bdev->bd_openers++;
 			bdev->bd_cache_openers++;
 			bdev->bd_inode = inode;
+
+			inode->i_mapping->a_ops = &ramdisk_aops;
 		}
 	}
 

Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-21  4:46                                             ` Andrea Arcangeli
@ 2001-09-21  7:09                                               ` Andrea Arcangeli
  0 siblings, 0 replies; 110+ messages in thread
From: Andrea Arcangeli @ 2001-09-21  7:09 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List

On Fri, Sep 21, 2001 at 06:46:21AM +0200, Andrea Arcangeli wrote:
> On Fri, Sep 21, 2001 at 06:06:25AM +0200, Andrea Arcangeli wrote:
> > On Fri, Sep 21, 2001 at 05:47:49AM +0200, Andrea Arcangeli wrote:
> > > write and I cannot zero it out. It is getting harder to fix this one
> > > just inside the ->make_request callback... Hints?
> > 
> > I think the best fix is to have the ramdisk using the same aops of ramfs
> > and replace "Secure" with "Uptodate". We need to trap the security issue
> > at the higher layer and also this will avoid us having to map useless
> > bh, so it should be an improvement, only the filesystem will end
> > triggering the ->make_request callback of the ramdisk. Then if the fs
> > does I/O on stuff out of the physical address space we'll just clear it
> > out.
> 
> ok, here it is, this obsoletes the previous patch (so again against
> 2.4.10pre12), initrd works again.

rediffed against pre13 (I'm passing the bd_inode to the ->open/release,
so we can modify the a_ops of the physical address space, ramdisk needs
that in rd_open())

diff -urN ramdisk-ref/drivers/block/rd.c ramdisk/drivers/block/rd.c
--- ramdisk-ref/drivers/block/rd.c	Fri Sep 21 08:28:04 2001
+++ ramdisk/drivers/block/rd.c	Fri Sep 21 08:06:02 2001
@@ -186,17 +186,77 @@
 
 #endif
 
+/*
+ * Copyright (C) 2000 Linus Torvalds.
+ *               2000 Transmeta Corp.
+ * aops copied from ramfs.
+ */
+static int ramdisk_readpage(struct file *file, struct page * page)
+{
+	if (!Page_Uptodate(page)) {
+		memset(kmap(page), 0, PAGE_CACHE_SIZE);
+		kunmap(page);
+		flush_dcache_page(page);
+		SetPageUptodate(page);
+	}
+	UnlockPage(page);
+	return 0;
+}
+
+/*
+ * Writing: just make sure the page gets marked dirty, so that
+ * the page stealer won't grab it.
+ */
+static int ramdisk_writepage(struct page *page)
+{
+	SetPageDirty(page);
+	UnlockPage(page);
+	return 0;
+}
+
+static int ramdisk_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+{
+	void *addr = kmap(page);
+	if (!Page_Uptodate(page)) {
+		memset(addr, 0, PAGE_CACHE_SIZE);
+		flush_dcache_page(page);
+		SetPageUptodate(page);
+	}
+	SetPageDirty(page);
+	return 0;
+}
+
+static int ramdisk_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+{
+	kunmap(page);
+	return 0;
+}
+
+struct address_space_operations ramdisk_aops = {
+	readpage: ramdisk_readpage,
+	writepage: ramdisk_writepage,
+	prepare_write: ramdisk_prepare_write,
+	commit_write: ramdisk_commit_write,
+};
+
 static int rd_blkdev_pagecache_IO(int rw, struct buffer_head * sbh, int minor)
 {
-	struct address_space * mapping = rd_inode[minor]->i_mapping;
+	struct address_space * mapping;
 	unsigned long index;
-	int offset, size, err = 0;
+	int offset, size, err;
 
-	if (sbh->b_page->mapping == mapping) {
-		if (rw != READ)
-			SetPageDirty(sbh->b_page);
+	err = -EIO;
+	/*
+	 * If we did BLKFLSBUF just before doing read/write,
+	 * we could see a rd_inode before the opener had a chance
+	 * to return from rd_open(), that's ok, as soon as we
+	 * see the inode we can use its i_mapping, and the inode
+	 * cannot go away from under us.
+	 */
+	if (!rd_inode[minor])
 		goto out;
-	}
+	err = 0;
+	mapping = rd_inode[minor]->i_mapping;
 
 	index = sbh->b_rsector >> (PAGE_CACHE_SHIFT - 9);
 	offset = (sbh->b_rsector << 9) & ~PAGE_CACHE_MASK;
@@ -206,8 +266,7 @@
 		int count;
 		struct page ** hash;
 		struct page * page;
-		const char * src;
-		char * dst;
+		char * src, * dst;
 		int unlock = 0;
 
 		count = PAGE_CACHE_SIZE - offset;
@@ -217,20 +276,24 @@
 
 		hash = page_hash(mapping, index);
 		page = __find_get_page(mapping, index, hash);
-		if (!page && rw != READ) {
+		if (!page) {
 			page = grab_cache_page(mapping, index);
 			err = -ENOMEM;
 			if (!page)
 				goto out;
 			err = 0;
+
+			if (!Page_Uptodate(page)) {
+				memset(kmap(page), 0, PAGE_CACHE_SIZE);
+				kunmap(page);
+				flush_dcache_page(page);
+				SetPageUptodate(page);
+			}
+
 			unlock = 1;
 		}
 
 		index++;
-		if (!page) {
-			offset = 0;
-			continue;
-		}
 
 		if (rw == READ) {
 			src = kmap(page);
@@ -321,6 +384,13 @@
 				struct block_device * bdev = inode->i_bdev;
 
 				down(&bdev->bd_sem);
+				/*
+				 * We're the only users here, protected by the
+				 * bd_sem, but first check if we didn't just
+				 * flushed the ramdisk.
+				 */
+				if (!rd_inode[minor])
+					goto unlock;
 				if (bdev->bd_openers > 2) {
 					up(&bdev->bd_sem);
 					return -EBUSY;
@@ -328,8 +398,16 @@
 				bdev->bd_openers--;
 				bdev->bd_cache_openers--;
 				iput(rd_inode[minor]);
+				/*
+				 * Make sure the cache is flushed from here
+				 * and not from close(2), somebody
+				 * could reopen the device before we have a
+				 * chance to close it ourself.
+				 */
+				truncate_inode_pages(rd_inode[minor]->i_mapping, 0);
 				rd_inode[minor] = NULL;
 				rd_blocksizes[minor] = rd_blocksize;
+			unlock:
 				up(&bdev->bd_sem);
 			}
 			break;
@@ -420,6 +498,8 @@
 			/* bdev->bd_sem is held by caller */
 			bdev->bd_openers++;
 			bdev->bd_cache_openers++;
+
+			bdev->bd_inode->i_mapping->a_ops = &ramdisk_aops;
 		}
 	}
 
diff -urN ramdisk-ref/fs/block_dev.c ramdisk/fs/block_dev.c
--- ramdisk-ref/fs/block_dev.c	Fri Sep 21 07:48:12 2001
+++ ramdisk/fs/block_dev.c	Fri Sep 21 08:25:29 2001
@@ -736,7 +736,7 @@
 	if (bdev->bd_op) {
 		ret = 0;
 		if (bdev->bd_op->open)
-			ret = bdev->bd_op->open(inode,filp);
+			ret = bdev->bd_op->open(bdev->bd_inode,filp);
 		if (!ret) {
 			bdev->bd_openers++;
 			bdev->bd_cache_openers++;
@@ -835,7 +835,7 @@
 
 	/* release the device driver */
 	if (bdev->bd_op->release)
-		ret = bdev->bd_op->release(inode, NULL);
+		ret = bdev->bd_op->release(bd_inode, NULL);
 	if (!--bdev->bd_openers) {
 		bdev->bd_op = NULL;
 		bdev->bd_inode = NULL;


Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  6:14 Alexei Podtelezhnikov
@ 2001-09-18 13:51 ` Rik van Riel
  0 siblings, 0 replies; 110+ messages in thread
From: Rik van Riel @ 2001-09-18 13:51 UTC (permalink / raw)
  To: Alexei Podtelezhnikov; +Cc: linux-kernel

On Mon, 17 Sep 2001, Alexei Podtelezhnikov wrote:

> Since 9-10 months ago Rik's scheme didn't det enough attention to get
> fixed. Out of the main tree it has better chances, just like it did
> before inclusion.

Umm, we _have_ been fixing the VM in 2.4, but Linus has
been randomly ignoring patches and introducing stuff
we warned him not to apply which broke stuff.

With maintainance like that, I'm sure Andrea's code
will also have a better chance out of the main tree ;)

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
@ 2001-09-18  6:14 Alexei Podtelezhnikov
  2001-09-18 13:51 ` Rik van Riel
  0 siblings, 1 reply; 110+ messages in thread
From: Alexei Podtelezhnikov @ 2001-09-18  6:14 UTC (permalink / raw)
  To: linux-kernel

I praise Linus for this step.
Since 9-10 months ago Rik's scheme didn't det enough attention
to get fixed. Out of the main tree it has better chances, just like it did
before inclusion.

Alexei


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  3:15 ` Alan Cox
@ 2001-09-18  4:41   ` H. Peter Anvin
  0 siblings, 0 replies; 110+ messages in thread
From: H. Peter Anvin @ 2001-09-18  4:41 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <E15jBLy-0008UF-00@the-village.bc.nu>
By author:    Alan Cox <alan@lxorguk.ukuu.org.uk>
In newsgroup: linux.dev.kernel
> 
> You need gcc 2.96 or higher to build the pre11 tree. I doubt that was
> intentional. Basically rip out all use of __builtin_expect
> 

Perhaps we should have a header which does #define __builtin_expect(X)
if your gcc version is 2.91-95?

   -hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  2:31 ` Magnus Naeslund(f)
  2001-09-18  2:49   ` David B. Stevens
@ 2001-09-18  3:38   ` Ed Tomlinson
  1 sibling, 0 replies; 110+ messages in thread
From: Ed Tomlinson @ 2001-09-18  3:38 UTC (permalink / raw)
  To: Magnus Naeslund(f); +Cc: linux-kernel

On September 17, 2001 10:31 pm, Magnus Naeslund(f) wrote:
> From: "Ed Tomlinson" <tomlins@cam.org>
>
> > Hi,
> >
> > You are fast.  Was just going report this one...
> > Using debian sid with gcc 2.95.4.  Both before and after
> > appling the patch below I get:
>
> The patch seems wrong...
> The files include "compile.h", but should include "compiler.h", no?

I should have caught that...  It works changeing compile.h to compiler.h

Thanks
Ed Tomlinson

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
@ 2001-09-18  3:18 Ed Tomlinson
  2001-09-18  2:31 ` Magnus Naeslund(f)
  2001-09-18  3:15 ` Alan Cox
  0 siblings, 2 replies; 110+ messages in thread
From: Ed Tomlinson @ 2001-09-18  3:18 UTC (permalink / raw)
  To: linux-kernel

Hi,

You are fast.  Was just going report this one...
Using debian sid with gcc 2.95.4.  Both before and after
appling the patch below I get:

ld -m elf_i386 -T /usr/src/linux/arch/i386/vmlinux.lds -e stext arch/i386/kernel/head.o arch/i386/kernel/init_task.o init/main.o init/version.o \
        --start-group \
        arch/i386/kernel/kernel.o arch/i386/mm/mm.o kernel/kernel.o mm/mm.o fs/fs.o ipc/ipc.o \         drivers/char/char.o drivers/block/block.o drivers/misc/misc.o drivers/net/net.o drivers/media/media.o drivers/char/drm/drm.o drivers/ide/idedriver.o drivers/cdrom/driver.o drivers/pci/driver.o drivers/video/video.o drivers/usb/usbdrv.o drivers/md/mddev.o \
        net/network.o \
        /usr/src/linux/arch/i386/lib/lib.a /usr/src/linux/lib/lib.a /usr/src/linux/arch/i386/lib/lib.a \
        --end-group \
        -o vmlinux
mm/mm.o: In function `kmem_cache_alloc':
mm/mm.o(.text+0x8154): undefined reference to `__builtin_expect'
mm/mm.o(.text+0x8173): undefined reference to `__builtin_expect'
mm/mm.o(.text+0x81cc): undefined reference to `__builtin_expect'
mm/mm.o: In function `kmalloc':
mm/mm.o(.text+0x8294): undefined reference to `__builtin_expect'
mm/mm.o(.text+0x82b3): undefined reference to `__builtin_expect'
mm/mm.o(.text+0x830c): more undefined references to `__builtin_expect' follow
make: *** [vmlinux] Error 1

Impling the patch below is incomplete?

TIA
Ed Tomlinson

Andrea Arcangeli wrote:
 
> ah I just got a (verbose) report about one compilation trouble with old
> kernels, the 00_rwsem patch that includes this backwards compatibility
> hunk wasn't applied. This is a very very minor issue. Please Linus
> include this patch to fix the compilation troubles with the old
> compilers:
> 
> diff -urN 2.4.10pre10/include/linux/compiler.h rwsem/include/linux/compiler.h
> --- 2.4.10pre10/include/linux/compiler.h	Thu Jan  1 01:00:00 1970
> +++ rwsem/include/linux/compiler.h	Tue Sep 18 02:02:08 2001
> @@ -0,0 +1,13 @@
> +#ifndef __LINUX_COMPILER_H
> +#define __LINUX_COMPILER_H
> +
> +/* Somewhere in the middle of the GCC 2.96 development cycle, we implemented
> +   a mechanism by which the user can annotate likely branch directions and
> +   expect the blocks to be reordered appropriately.  Define __builtin_expect
> +   to nothing for earlier compilers.  */
> +
> +#if __GNUC__ == 2 && __GNUC_MINOR__ < 96ld -m elf_i386 -T /usr/src/linux/arch/i386/vmlinux.lds -e stext arch/i386/kernel/head.o arch/i386/kernel/init_task.o init/main.o init/version.o \
        --start-group \
        arch/i386/kernel/kernel.o arch/i386/mm/mm.o kernel/kernel.o mm/mm.o fs/fs.o ipc/ipc.o \         drivers/char/char.o drivers/block/block.o drivers/misc/misc.o drivers/net/net.o drivers/media/media.o drivers/char/drm/drm.o drivers/ide/idedriver.o drivers/cdrom/driver.o drivers/pci/driver.o drivers/video/video.o drivers/usb/usbdrv.o drivers/md/mddev.o \
        net/network.o \
        /usr/src/linux/arch/i386/lib/lib.a /usr/src/linux/lib/lib.a /usr/src/linux/arch/i386/lib/lib.a \
        --end-group \
        -o vmlinux
mm/mm.o: In function `kmem_cache_alloc':
mm/mm.o(.text+0x8154): undefined reference to `__builtin_expect'
mm/mm.o(.text+0x8173): undefined reference to `__builtin_expect'
mm/mm.o(.text+0x81cc): undefined reference to `__builtin_expect'
mm/mm.o: In function `kmalloc':
mm/mm.o(.text+0x8294): undefined reference to `__builtin_expect'
mm/mm.o(.text+0x82b3): undefined reference to `__builtin_expect'
mm/mm.o(.text+0x830c): more undefined references to `__builtin_expect' follow
make: *** [vmlinux] Error 1
> +#define __builtin_expect(x, expected_value) (x)
> +#endif
> +
> +#endif /* __LINUX_COMPILER_H */
> 
> Then you need to simply include <linux/compile.h> from the files that
> doesn't compile, this second patch is untested but it should do the
> trick (00_rwsem-?? was including compile.h from the rwsem includes that
> were in turn included by those files implicitly so I didn't need to add
> compile.h in each file):
> 
> --- 2.4.10pre11/mm/slab.c.~1~	Tue Sep 18 02:43:04 2001
> +++ 2.4.10pre11/mm/slab.c	Tue Sep 18 04:00:30 2001
> @@ -72,6 +72,7 @@
>  #include	<linux/slab.h>
>  #include	<linux/interrupt.h>
>  #include	<linux/init.h>
> +#include	<linux/compile.h>
>  #include	<asm/uaccess.h>
>  
>  /*
> --- 2.4.10pre11/mm/page_alloc.c.~1~	Tue Sep 18 02:43:04 2001
> +++ 2.4.10pre11/mm/page_alloc.c	Tue Sep 18 04:00:20 2001
> @@ -17,6 +17,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/bootmem.h>
>  #include <linux/slab.h>
> +#include <linux/compile.h>
>  
>  int nr_swap_pages;
>  int nr_active_pages;
> --- 2.4.10pre11/mm/vmscan.c.~1~	Tue Sep 18 02:43:04 2001
> +++ 2.4.10pre11/mm/vmscan.c	Tue Sep 18 04:00:47 2001
> @@ -21,6 +21,7 @@
>  #include <linux/init.h>
>  #include <linux/highmem.h>
>  #include <linux/file.h>
> +#include <linux/compile.h>
>  
>  #include <asm/pgalloc.h>
>  
> 
> Andrea

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  3:18 Ed Tomlinson
  2001-09-18  2:31 ` Magnus Naeslund(f)
@ 2001-09-18  3:15 ` Alan Cox
  2001-09-18  4:41   ` H. Peter Anvin
  1 sibling, 1 reply; 110+ messages in thread
From: Alan Cox @ 2001-09-18  3:15 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-kernel

> You are fast.  Was just going report this one...
> Using debian sid with gcc 2.95.4.  Both before and after
> appling the patch below I get:

You need gcc 2.96 or higher to build the pre11 tree. I doubt that was
intentional. Basically rip out all use of __builtin_expect


Alan

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  2:31 ` Magnus Naeslund(f)
@ 2001-09-18  2:49   ` David B. Stevens
  2001-09-18  3:38   ` Ed Tomlinson
  1 sibling, 0 replies; 110+ messages in thread
From: David B. Stevens @ 2001-09-18  2:49 UTC (permalink / raw)
  To: Magnus Naeslund(f); +Cc: Ed Tomlinson, linux-kernel

Greetings,

"Magnus Naeslund(f)" wrote:
> 
> From: "Ed Tomlinson" <tomlins@cam.org>
> > Hi,
> >
> > You are fast.  Was just going report this one...
> > Using debian sid with gcc 2.95.4.  Both before and after
> > appling the patch below I get:
> >
> 
> The patch seems wrong...
> The files include "compile.h", but should include "compiler.h", no?
> 
> Magnus

You are correct Magnus, it should be compiler.h in the three mm modules.

Cheers,
  Dave

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Linux 2.4.10-pre11
  2001-09-18  3:18 Ed Tomlinson
@ 2001-09-18  2:31 ` Magnus Naeslund(f)
  2001-09-18  2:49   ` David B. Stevens
  2001-09-18  3:38   ` Ed Tomlinson
  2001-09-18  3:15 ` Alan Cox
  1 sibling, 2 replies; 110+ messages in thread
From: Magnus Naeslund(f) @ 2001-09-18  2:31 UTC (permalink / raw)
  To: Ed Tomlinson; +Cc: linux-kernel

From: "Ed Tomlinson" <tomlins@cam.org>
> Hi,
> 
> You are fast.  Was just going report this one...
> Using debian sid with gcc 2.95.4.  Both before and after
> appling the patch below I get:
> 


The patch seems wrong...
The files include "compile.h", but should include "compiler.h", no?

Magnus



^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2001-09-21  7:09 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-18  0:08 Linux 2.4.10-pre11 Linus Torvalds
2001-09-17 23:17 ` Marcelo Tosatti
2001-09-18  1:08   ` Marcelo Tosatti
2001-09-18  3:37     ` Andrea Arcangeli
2001-09-18  2:25       ` Marcelo Tosatti
2001-09-18  3:58         ` Andrea Arcangeli
2001-09-18  2:53           ` Marcelo Tosatti
2001-09-18  4:54             ` Andrea Arcangeli
2001-09-18  3:33               ` Marcelo Tosatti
2001-09-18  5:06                 ` Andrea Arcangeli
2001-09-18  3:55                   ` Marcelo Tosatti
2001-09-18  5:32                     ` Andrea Arcangeli
2001-09-18  4:14                       ` Marcelo Tosatti
2001-09-18  5:59                         ` Andrea Arcangeli
2001-09-18  5:00                       ` Marcelo Tosatti
     [not found] ` <20010917211834.A31693@redhat.com>
     [not found]   ` <20010918035055.J698@athlon.random>
2001-09-18  2:02     ` Andrea Arcangeli
     [not found]     ` <20010917221653.B31693@redhat.com>
2001-09-18  2:27       ` Linus Torvalds
2001-09-18  3:14         ` Alan Cox
2001-09-18  3:26         ` Andrea Arcangeli
     [not found]       ` <20010918052201.N698@athlon.random>
2001-09-18  4:01         ` Benjamin LaHaise
2001-09-18  4:39           ` Andrea Arcangeli
2001-09-18  5:04             ` Alan Cox
2001-09-18  5:09               ` Andrea Arcangeli
2001-09-18  5:22             ` Benjamin LaHaise
2001-09-18  5:48               ` Andrea Arcangeli
2001-09-18  5:48 ` Andrew Morton
2001-09-18  6:11   ` Andrea Arcangeli
2001-09-18  5:02     ` Marcelo Tosatti
2001-09-18  6:40       ` Andrea Arcangeli
2001-09-18 16:06         ` Marcelo Tosatti
2001-09-18 19:18           ` Marcelo Tosatti
2001-09-18 21:05             ` Andrea Arcangeli
2001-09-19 13:57               ` Rik van Riel
2001-09-18 10:58   ` Martin Dalecki
2001-09-18  9:31 ` Alexander Viro
2001-09-18  9:39   ` Andrea Arcangeli
2001-09-18  9:44     ` Alexander Viro
2001-09-18  9:57       ` Andrea Arcangeli
2001-09-18 10:02         ` Alexander Viro
2001-09-18 10:17           ` Andrea Arcangeli
2001-09-18 10:28             ` Alexander Viro
2001-09-18 10:35               ` Andrea Arcangeli
2001-09-18 10:52                 ` Alexander Viro
2001-09-18 11:05             ` Helge Hafting
2001-09-18 12:40               ` Andrea Arcangeli
2001-09-18 17:02             ` Linus Torvalds
2001-09-18 16:45   ` Linus Torvalds
2001-09-18 18:19     ` Alexander Viro
2001-09-18 18:27       ` Linus Torvalds
2001-09-18 19:14         ` Andreas Dilger
2001-09-18 19:41           ` Alexander Viro
2001-09-18 20:33           ` Richard Gooch
2001-09-18 20:53             ` Alexander Viro
2001-09-18 21:06             ` Richard Gooch
2001-09-18 21:27               ` Alexander Viro
2001-09-18 19:29         ` Benjamin LaHaise
2001-09-18 20:17         ` Stephan von Krawczynski
2001-09-18 20:33           ` Alan Cox
2001-09-19 13:42           ` Rik van Riel
2001-09-19 14:27             ` Alexander Viro
2001-09-19  2:59         ` Michael Peddemors
2001-09-19 16:11         ` Alexander Viro
2001-09-19 18:25           ` Andrea Arcangeli
2001-09-19 19:21             ` Alexander Viro
2001-09-19 20:55               ` Andrea Arcangeli
2001-09-19 21:17                 ` Alexander Viro
2001-09-19 23:01                   ` Andrea Arcangeli
2001-09-19 23:03                   ` Andrea Arcangeli
2001-09-19 23:30                     ` Alexander Viro
2001-09-19 23:40                       ` Andrea Arcangeli
2001-09-20 13:56                         ` Alexander Viro
2001-09-20 14:38                           ` Chris Mason
2001-09-20 14:50                             ` Alexander Viro
2001-09-20 15:44                               ` Chris Mason
2001-09-20 16:43                                 ` Alexander Viro
2001-09-20 20:54                                   ` [PATCH] fs/block_dev.c cleanup Alexander Viro
2001-09-19 22:15                 ` Linux 2.4.10-pre11 Richard Gooch
2001-09-20  2:34               ` Andrea Arcangeli
2001-09-20 10:52                 ` Alexander Viro
2001-09-20 18:18                   ` Andrea Arcangeli
2001-09-20 18:33                     ` Alexander Viro
2001-09-20 18:59                       ` Andrea Arcangeli
2001-09-20 20:41                         ` Alexander Viro
2001-09-20 21:18                           ` Andrea Arcangeli
2001-09-20 21:40                             ` Alexander Viro
2001-09-20 22:13                               ` Andrea Arcangeli
2001-09-20 22:20                                 ` Alexander Viro
2001-09-20 22:31                                   ` Andrea Arcangeli
2001-09-20 22:44                                     ` Alexander Viro
2001-09-20 23:03                                       ` Andrea Arcangeli
2001-09-20 23:11                                         ` Alexander Viro
2001-09-21  1:50                                           ` Alexander Viro
2001-09-21  2:42                                             ` Andrea Arcangeli
2001-09-21  3:47                                         ` Andrea Arcangeli
2001-09-21  4:00                                           ` Alexander Viro
2001-09-21  4:06                                             ` Andrea Arcangeli
2001-09-21  4:06                                           ` Andrea Arcangeli
2001-09-21  4:46                                             ` Andrea Arcangeli
2001-09-21  7:09                                               ` Andrea Arcangeli
2001-09-19 20:41             ` Richard Gooch
2001-09-19 13:38       ` Rik van Riel
2001-09-19 16:35       ` Andrea Arcangeli
2001-09-18  3:18 Ed Tomlinson
2001-09-18  2:31 ` Magnus Naeslund(f)
2001-09-18  2:49   ` David B. Stevens
2001-09-18  3:38   ` Ed Tomlinson
2001-09-18  3:15 ` Alan Cox
2001-09-18  4:41   ` H. Peter Anvin
2001-09-18  6:14 Alexei Podtelezhnikov
2001-09-18 13:51 ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).