* Linux 2.4.10-pre11 @ 2001-09-18 0:08 Linus Torvalds 2001-09-17 23:17 ` Marcelo Tosatti ` (3 more replies) 0 siblings, 4 replies; 110+ messages in thread From: Linus Torvalds @ 2001-09-18 0:08 UTC (permalink / raw) To: Kernel Mailing List; +Cc: Andrea Arcangeli Ok, the big thing here is continued merging, this time with Andrea. I still don't like some of the VM changes, but integrating Andrea's VM changes results in (a) better performance and (b) much cleaner inactive page handling in particular. Besides, for the 2.4.x tree, the big priority is stability, we can re-address my other concerns during 2.5.x. This also merges the blkdev in page cache patch, and that will hopefully make it noticeably easier to do the "do bread() with page cache too", at which point a lot of the current ugly synchronization issues will go away. Oh, and it gets the direct-IO improvements from Andrea too. [ Other small patches from Andrea merged just to make future merges easier. ] The console locking merge with Andrew Morton also moves us a bit closer to the -ac tree.. Full 2.4.10 ChangeLog appended. Linus ------ pre11: - Neil Brown: md cleanups/fixes - Andrew Morton: console locking merge - Andrea Arkangeli: major VM merge pre10: - Alan Cox: continued merging - Mingming Cao: make msgrcv/shmat check the queue/segment ID's properly - Greg KH: USB serial init failure fix, Xircom serial converter driver - Neil Brown: nsfd/raid/md/lockd cleanups - Ingo Molnar: multipath RAID personality, raid xor update - Hugh Dickins/Marcelo Tosatti: swapin read-ahead race fix - Vojtech Pavlik: fix up some of the infrastructure for x86-64 - Robert Love: AMD 761 AGP GART support - Jens Axboe: fix SCSI-generic queue handling race - me: be sane about page reference bits pre9: - Greg KH: start migration to new "min()/max()" - Roman Zippel: move affs over to "min()/max()". - Vojtech Pavlik: VIA update (make sure not to IRQ-unmask a vt82c576) - Jan Kara: quota bug-fix (don't decrement quota for non-counted inode) - Anton Altaparmakov: more NTFS updates - Al Viro: make nosuid/noexec/nodev be per-mount flags, not per-filesystem - Alan Cox: merge input/joystick layer differences, driver and alpha merge - Keith Owens: scsi Makefile cleanup - Trond Myklebust: fix oopsable race in locking code - Jean Tourrilhes: IrDA update pre8: - Christoph Hellwig: clean up personality handling a bit - Robert Love: update sysctl/vm documentation - make the three-argument (that everybody hates) "min()" be "min_t()", and introduce a type-anal "min()" that complains about arguments of different types. pre7: - Alan Cox: big driver/mips sync - Andries Brouwer, Christoph Hellwig: more gendisk fixups - Tobias Ringstrom: tulip driver workaround for DC21143 erratum pre6: - Jens Axboe: remove trivially dead io_request_lock usage - Andrea Arcangeli: softirq cleanup and ARM fixes. Slab cleanups - Christoph Hellwig: gendisk handling helper functions/cleanups - Nikita Danilov: reiserfs dead code pruning - Anton Altaparmakov: NTFS update to 1.1.18 - firestream network driver: patch reverted on authors request - NIIBE Yutaka: SH architecture update - Paul Mackerras: PPC cleanups, PPC8xx update. - me: reverse broken bootdata allocation patch that went into pre5 pre5: - Merge with Alan - Trond Myklebust: NFS fixes - kmap and root inode special case - Al Viro: more superblock cleanups, inode leak in rd.c, minix directories in page cache - Paul Mackerras: clean up rubbish from sl82c105.c - Neil Brown: md/raid cleanups, NFS filehandles - Johannes Erdfelt: USB update (usb-2.0 support, visor fix, Clie fix, pl2303 driver update) - David Miller: sparc and net update - Eric Biederman: simplify and correct bootdata allocation - don't overwrite ramdisks - Tim Waugh: support multiple SuperIO devices, parport doc updates pre4: - Hugh Dickins: swapoff cleanups and speedups - Matthew Dharm: USB storage update - Keith Owens: Makefile fixes - Tom Rini: MPC8xx build fix - Nikita Danilov: reiserfs update - Jakub Jelinek: ELF loader fix for ET_DYN - Andrew Morton: reparent_to_init() for kernel threads - Christoph Hellwig: VxFS and SysV updates, vfs_permission fix pre3: - Johannes Erdfelt, Oliver Neukum: USB printer driver race fix - John Byrne: fix stupid i386-SMP irq stack layout bug - Andreas Bombe, me: yenta IO window fix - Neil Brown: raid1 buffer state fix - David Miller, Paul Mackerras: fix up sparc and ppc respectively for kmap/kbd_rate - Matija Nalis: umsdos fixes, and make it possible to boot up with umsdos - Francois Romieu: fix bugs in dscc4 driver - Andy Grover: new PCI config space access functions (eventually for ACPI) - Albert Cranford: fix incorrect e2fsprog data from ver_linux script - Dave Jones: re-sync x86 setup code, fix macsonic kmalloc use - Johannes Erdfelt: remove obsolete plusb USB driver - Andries Brouwer: fix USB compact flash version info, add blksize ioctls pre2: - Al Viro: block device cleanups - Marcelo Tosatti: make bounce buffer allocations more robust (it's ok for them to do IO, just not cause recursive bounce IO. So allow them) - Anton Altaparmakov: NTFS update (1.1.17) - Paul Mackerras: PPC update (big re-org) - Petko Manolov: USB pegasus driver fixes - David Miller: networking and sparc updates - Trond Myklebust: Export atomic_dec_and_lock - OGAWA Hirofumi: find and fix umsdos "filldir" users that were broken by the 64-bit-cleanups. Fix msdos warnings. - Al Viro: superblock handling cleanups and race fixes - Johannes Erdfelt++: USB updates pre1: - Jeff Hartmann: DRM AGP/alpha cleanups - Ben LaHaise: highmem user pagecopy/clear optimization - Vojtech Pavlik: VIA IDE driver update - Herbert Xu: make cramfs work with HIGHMEM pages - David Fennell: awe32 ram size detection improvement - Istvan Varadi: umsdos EMD filename bug fix - Keith Owens: make min/max work for pointers too - Jan Kara: quota initialization fix - Brad Hards: Kaweth USB driver update (enable, and fix endianness) - Ralf Baechle: MIPS updates - David Gibson: airport driver update - Rogier Wolff: firestream ATM driver multi-phy support - Daniel Phillips: swap read page referenced set - avoid swap thrashing ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 0:08 Linux 2.4.10-pre11 Linus Torvalds @ 2001-09-17 23:17 ` Marcelo Tosatti 2001-09-18 1:08 ` Marcelo Tosatti [not found] ` <20010917211834.A31693@redhat.com> ` (2 subsequent siblings) 3 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-17 23:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kernel Mailing List, Andrea Arcangeli On Mon, 17 Sep 2001, Linus Torvalds wrote: > > Ok, the big thing here is continued merging, this time with Andrea. > > I still don't like some of the VM changes, but integrating Andrea's VM > changes results in (a) better performance and (b) much cleaner inactive > page handling in particular. Besides, for the 2.4.x tree, the big priority > is stability, we can re-address my other concerns during 2.5.x. Andrea, Could you please make a resume of your VM changes ? Its hard to keep up with VM changes this way. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-17 23:17 ` Marcelo Tosatti @ 2001-09-18 1:08 ` Marcelo Tosatti 2001-09-18 3:37 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 1:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kernel Mailing List, Andrea Arcangeli On Mon, 17 Sep 2001, Marcelo Tosatti wrote: > > > On Mon, 17 Sep 2001, Linus Torvalds wrote: > > > > > Ok, the big thing here is continued merging, this time with Andrea. > > > > I still don't like some of the VM changes, but integrating Andrea's VM > > changes results in (a) better performance and (b) much cleaner inactive > > page handling in particular. Besides, for the 2.4.x tree, the big priority > > is stability, we can re-address my other concerns during 2.5.x. > > Andrea, > > Could you please make a resume of your VM changes ? > > Its hard to keep up with VM changes this way. Andrea, I've just read a bit of your new VM code and I have a few comments. You completly removed the "inactive freeable pages" logic: There is no more distiction between "freeable inactive" and "free" pages. All VM work is based on "freepages.high" watermark. I don't like that: it seems to make page freeing more aggressive over time. Also, if we have several try_to_free_pages() callers, for different classzones, I'm right saying that a caller with a "smaller" classzone can "hide" pages in its "active_local_lru" and/or "inactive_local_lru" (during shrink_cache) from other processes trying to free pages from those higher zones ? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 1:08 ` Marcelo Tosatti @ 2001-09-18 3:37 ` Andrea Arcangeli 2001-09-18 2:25 ` Marcelo Tosatti 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 3:37 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List On Mon, Sep 17, 2001 at 10:08:22PM -0300, Marcelo Tosatti wrote: > > > On Mon, 17 Sep 2001, Marcelo Tosatti wrote: > > > > > > > On Mon, 17 Sep 2001, Linus Torvalds wrote: > > > > > > > > Ok, the big thing here is continued merging, this time with Andrea. > > > > > > I still don't like some of the VM changes, but integrating Andrea's VM > > > changes results in (a) better performance and (b) much cleaner inactive > > > page handling in particular. Besides, for the 2.4.x tree, the big priority > > > is stability, we can re-address my other concerns during 2.5.x. > > > > Andrea, > > > > Could you please make a resume of your VM changes ? > > > > Its hard to keep up with VM changes this way. > > Andrea, > > I've just read a bit of your new VM code and I have a few comments. > > You completly removed the "inactive freeable pages" logic: There is no yes, it wasn't relly useful to keep this list lazily, you either keep it enforced with locking overhead or such information isn't valuable. > more distiction between "freeable inactive" and "free" pages. All VM work > is based on "freepages.high" watermark. I don't like that: it seems to hardly on freepages.high: diff -urN vm-ref/mm/swap.c vm/mm/swap.c --- vm-ref/mm/swap.c Tue Sep 18 00:18:17 2001 +++ vm/mm/swap.c Tue Sep 18 00:18:35 2001 @@ -24,50 +24,13 @@ #include <asm/uaccess.h> /* for copy_to/from_user */ #include <asm/pgtable.h> -/* - * We identify three levels of free memory. We never let free mem - * fall below the freepages.min except for atomic allocations. We - * start background swapping if we fall below freepages.high free - * pages, and we begin intensive swapping below freepages.low. - * - * Actual initialization is done in mm/page_alloc.c - */ -freepages_t freepages = { - 0, /* freepages.min */ - 0, /* freepages.low */ - 0 /* freepages.high */ -}; - > make page freeing more aggressive over time. I don't see your point with "page freeing more aggressive over time". > Also, if we have several try_to_free_pages() callers, for different > classzones, I'm right saying that a caller with a "smaller" classzone can > "hide" pages in its "active_local_lru" and/or "inactive_local_lru" (during > shrink_cache) from other processes trying to free pages from those higher > zones ? I'm deeply impressed, you seem to have understood the rewrite greatly well, congrats, this "hiding" was infact my main concern I had on the memclass check during shrink_cache, but I don't think this will ever give us troubles. In such there are suprious swapouts with HIGHMEM we'll just need to waste some cpu by lefting those pages visible with a few changes in shrink_cache, but again I'm almost sure there won't be problems, we do multiple scans before failing so those pages will return visible before the other task has a chance to fail the allocation. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 3:37 ` Andrea Arcangeli @ 2001-09-18 2:25 ` Marcelo Tosatti 2001-09-18 3:58 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 2:25 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Mon, Sep 17, 2001 at 10:08:22PM -0300, Marcelo Tosatti wrote: > > > > > > On Mon, 17 Sep 2001, Marcelo Tosatti wrote: > > > > > > > > > > > On Mon, 17 Sep 2001, Linus Torvalds wrote: > > > > > > > > > > > Ok, the big thing here is continued merging, this time with Andrea. > > > > > > > > I still don't like some of the VM changes, but integrating Andrea's VM > > > > changes results in (a) better performance and (b) much cleaner inactive > > > > page handling in particular. Besides, for the 2.4.x tree, the big priority > > > > is stability, we can re-address my other concerns during 2.5.x. > > > > > > Andrea, > > > > > > Could you please make a resume of your VM changes ? > > > > > > Its hard to keep up with VM changes this way. > > > > Andrea, > > > > I've just read a bit of your new VM code and I have a few comments. > > > > You completly removed the "inactive freeable pages" logic: There is no > > yes, it wasn't relly useful to keep this list lazily, you either keep it > enforced with locking overhead or such information isn't valuable. > > > more distiction between "freeable inactive" and "free" pages. All VM work > > is based on "freepages.high" watermark. I don't like that: it seems to > > hardly on freepages.high: > > diff -urN vm-ref/mm/swap.c vm/mm/swap.c > --- vm-ref/mm/swap.c Tue Sep 18 00:18:17 2001 > +++ vm/mm/swap.c Tue Sep 18 00:18:35 2001 > @@ -24,50 +24,13 @@ > #include <asm/uaccess.h> /* for copy_to/from_user */ > #include <asm/pgtable.h> > > -/* > - * We identify three levels of free memory. We never let free mem > - * fall below the freepages.min except for atomic allocations. We > - * start background swapping if we fall below freepages.high free > - * pages, and we begin intensive swapping below freepages.low. > - * > - * Actual initialization is done in mm/page_alloc.c > - */ > -freepages_t freepages = { > - 0, /* freepages.min */ > - 0, /* freepages.low */ > - 0 /* freepages.high */ > -}; > - > > > make page freeing more aggressive over time. > > I don't see your point with "page freeing more aggressive over time". I meant "zone->freepages.high" (or something like that... you get the idea). My point is: we're not going to start aging pte's until we have a zone shortage, right ? With the old scheme, we would differentiate "inactive shortage" from "free shortage". > > > Also, if we have several try_to_free_pages() callers, for different > > classzones, I'm right saying that a caller with a "smaller" classzone can > > "hide" pages in its "active_local_lru" and/or "inactive_local_lru" (during > > shrink_cache) from other processes trying to free pages from those higher > > zones ? > > I'm deeply impressed, you seem to have understood the rewrite greatly > well, congrats, this "hiding" was infact my main concern I had on the > memclass check during shrink_cache, but I don't think this will ever > give us troubles. In such there are suprious swapouts with HIGHMEM > we'll just need to waste some cpu by lefting those pages visible with a > few changes in shrink_cache, but again I'm almost sure there won't be > problems, we do multiple scans before failing so those pages will return > visible before the other task has a chance to fail the allocation. I really think this will cause problems in practice, Andrea. Moreover, the whole active_local_lru has a _bad_ effect on writeout clustering: shrink_cache() callers on low classzone can "hide" pages from higher classzone callers and avoid clustering. Look: We have a nice big sequentially ordered list of pages to writeout. lowzone shrink_cache() moves higher zones pages (from this ordered list) to its inactive/active_local_lru until it calls schedule() (due to need_resched()). (for example) Now we have a higher classzone caller which finds "half of the block" and writes it out. >From this point on, we breaked in half an ordered list of pages which could be nicely merged together at ll_rw_block(). The example I gave which splits one sequentially ordered list of pages in two blocks is a simple and "stupid" one. But think how badly can we have pages ordered during a long time of VM activity. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 2:25 ` Marcelo Tosatti @ 2001-09-18 3:58 ` Andrea Arcangeli 2001-09-18 2:53 ` Marcelo Tosatti 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 3:58 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List On Mon, Sep 17, 2001 at 11:25:09PM -0300, Marcelo Tosatti wrote: > My point is: we're not going to start aging pte's until we have a > zone shortage, right ? correct, so we don't waste time if there's only filesystem cache pressure during the whole kernel uptime (not an unlikely scenario if you have enough ram like on servers). > I really think this will cause problems in practice, Andrea. yes, this hiding could infact explain the report you posted. I'll try to reproduce with highmem emulation and to fix it as soon as I can reproduce, also if you are interested to hack on it too feel free to send patches of course. thanks. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 3:58 ` Andrea Arcangeli @ 2001-09-18 2:53 ` Marcelo Tosatti 2001-09-18 4:54 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 2:53 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Mon, Sep 17, 2001 at 11:25:09PM -0300, Marcelo Tosatti wrote: > > My point is: we're not going to start aging pte's until we have a > > zone shortage, right ? > > correct, so we don't waste time if there's only filesystem cache > pressure during the whole kernel uptime (not an unlikely scenario if you > have enough ram like on servers). > > > I really think this will cause problems in practice, Andrea. > > yes, this hiding could infact explain the report you posted. I'll try to > reproduce with highmem emulation and to fix it as soon as I can > reproduce, also if you are interested to hack on it too feel free to > send patches of course. thanks. Andrea, Personally I think it is too late in the 2.4.x series to integrate your code. I have nothing against the code itself (the "old" code also had bugs), but a major VM rewrite at this point seems to be dangerous if we want a stable VM. Don't you agree that your code can introduce new stability bugs ? Linus, please, if you want to integrate Andrea's code do it in 2.5.x, but not 2.4.x. (yes, I'm expecting you to scream at me now :)) ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 2:53 ` Marcelo Tosatti @ 2001-09-18 4:54 ` Andrea Arcangeli 2001-09-18 3:33 ` Marcelo Tosatti 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 4:54 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote: > Don't you agree that your code can introduce new stability bugs ? not anything that can corrupt randomly your hd. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 4:54 ` Andrea Arcangeli @ 2001-09-18 3:33 ` Marcelo Tosatti 2001-09-18 5:06 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 3:33 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote: > > Don't you agree that your code can introduce new stability bugs ? > > not anything that can corrupt randomly your hd. Sure, the old code did not corrupt hd's randomly, did it? Let me redo the question: Don't you think the old stinky and slow code was reasonably stable ? :) ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 3:33 ` Marcelo Tosatti @ 2001-09-18 5:06 ` Andrea Arcangeli 2001-09-18 3:55 ` Marcelo Tosatti 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 5:06 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote: > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote: > > > Don't you agree that your code can introduce new stability bugs ? > > > > not anything that can corrupt randomly your hd. > > Sure, the old code did not corrupt hd's randomly, did it? > > Let me redo the question: Don't you think the old stinky and slow code was > reasonably stable ? :) As said in the other email, just check 2.4 l-k reports of this week, last week etc.., I've lots of private reports too. While for everybody 2.2.19 is working fine. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 5:06 ` Andrea Arcangeli @ 2001-09-18 3:55 ` Marcelo Tosatti 2001-09-18 5:32 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 3:55 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote: > > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > > > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote: > > > > Don't you agree that your code can introduce new stability bugs ? > > > > > > not anything that can corrupt randomly your hd. > > > > Sure, the old code did not corrupt hd's randomly, did it? > > > > Let me redo the question: Don't you think the old stinky and slow code was > > reasonably stable ? :) > > As said in the other email, just check 2.4 l-k reports of this week, > last week etc.., I've lots of private reports too. While for everybody > 2.2.19 is working fine. Have you seen any problem report which does not happen with anon intensive workloads ? As far as I've noted, people usually report performance problems when running anon intensive workloads. For those cases, I'm pretty sure the swap_out() loop is the fuckup: the swap allocation code is really a _CRAP_ for the current VM. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 3:55 ` Marcelo Tosatti @ 2001-09-18 5:32 ` Andrea Arcangeli 2001-09-18 4:14 ` Marcelo Tosatti 2001-09-18 5:00 ` Marcelo Tosatti 0 siblings, 2 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 5:32 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 12:55:46AM -0300, Marcelo Tosatti wrote: > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote: > > > > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > > > > > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote: > > > > > Don't you agree that your code can introduce new stability bugs ? > > > > > > > > not anything that can corrupt randomly your hd. > > > > > > Sure, the old code did not corrupt hd's randomly, did it? > > > > > > Let me redo the question: Don't you think the old stinky and slow code was > > > reasonably stable ? :) > > > > As said in the other email, just check 2.4 l-k reports of this week, > > last week etc.., I've lots of private reports too. While for everybody > > 2.2.19 is working fine. > > Have you seen any problem report which does not happen with anon intensive > workloads ? of course, all the mysql/postgres db reports I got were non anon intensive I assume, I assume they had enough ram, they all said 2.2 was fine. > As far as I've noted, people usually report performance problems when > running anon intensive workloads. For those cases, I'm pretty sure the > swap_out() loop is the fuckup: the swap allocation code is really a _CRAP_ > for the current VM. I don't think that was the case, 2.2 has the same swap_out loop. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 5:32 ` Andrea Arcangeli @ 2001-09-18 4:14 ` Marcelo Tosatti 2001-09-18 5:59 ` Andrea Arcangeli 2001-09-18 5:00 ` Marcelo Tosatti 1 sibling, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 4:14 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Tue, Sep 18, 2001 at 12:55:46AM -0300, Marcelo Tosatti wrote: > > > > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > > > On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote: > > > > > > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > > > > > > > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote: > > > > > > Don't you agree that your code can introduce new stability bugs ? > > > > > > > > > > not anything that can corrupt randomly your hd. > > > > > > > > Sure, the old code did not corrupt hd's randomly, did it? > > > > > > > > Let me redo the question: Don't you think the old stinky and slow code was > > > > reasonably stable ? :) > > > > > > As said in the other email, just check 2.4 l-k reports of this week, > > > last week etc.., I've lots of private reports too. While for everybody > > > 2.2.19 is working fine. > > > > Have you seen any problem report which does not happen with anon intensive > > workloads ? > > of course, all the mysql/postgres db reports I got were non anon > intensive I assume, I assume they had enough ram, they all said 2.2 was > fine. > > > As far as I've noted, people usually report performance problems when > > running anon intensive workloads. For those cases, I'm pretty sure the > > swap_out() loop is the fuckup: the swap allocation code is really a _CRAP_ > > for the current VM. > > I don't think that was the case, 2.2 has the same swap_out loop. Note that in 2.4 we scan pte's even if there is no free pages shortage, while in 2.2 we only scan pte's if there is a free page shortage. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 4:14 ` Marcelo Tosatti @ 2001-09-18 5:59 ` Andrea Arcangeli 0 siblings, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 5:59 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 01:14:36AM -0300, Marcelo Tosatti wrote: > Note that in 2.4 we scan pte's even if there is no free pages > shortage, while in 2.2 we only scan pte's if there is a free page > shortage. That was most certainly a problem which is now fixed. Mainly the preallication of swap was a waste. But I think there was a kind of physical (not pte) overaging too that forbidden the vm to react properly. I believe when the storm of swap_out startups it won't matter any longer what we aged and whatever pte scanning we did previously. At least with the current swap_out. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 5:32 ` Andrea Arcangeli 2001-09-18 4:14 ` Marcelo Tosatti @ 2001-09-18 5:00 ` Marcelo Tosatti 1 sibling, 0 replies; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 5:00 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Tue, Sep 18, 2001 at 12:55:46AM -0300, Marcelo Tosatti wrote: > > > > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > > > On Tue, Sep 18, 2001 at 12:33:15AM -0300, Marcelo Tosatti wrote: > > > > > > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > > > > > > > On Mon, Sep 17, 2001 at 11:53:10PM -0300, Marcelo Tosatti wrote: > > > > > > Don't you agree that your code can introduce new stability bugs ? > > > > > > > > > > not anything that can corrupt randomly your hd. > > > > > > > > Sure, the old code did not corrupt hd's randomly, did it? > > > > > > > > Let me redo the question: Don't you think the old stinky and slow code was > > > > reasonably stable ? :) > > > > > > As said in the other email, just check 2.4 l-k reports of this week, > > > last week etc.., I've lots of private reports too. While for everybody > > > 2.2.19 is working fine. > > > > Have you seen any problem report which does not happen with anon intensive > > workloads ? > > of course, all the mysql/postgres db reports I got were non anon > intensive I assume, I assume they had enough ram, they all said 2.2 was > fine. > > > As far as I've noted, people usually report performance problems when > > running anon intensive workloads. For those cases, I'm pretty sure the > > swap_out() loop is the fuckup: the swap allocation code is really a _CRAP_ > > for the current VM. > > I don't think that was the case, 2.2 has the same swap_out loop. Ok, back to the current allocation failure problem. In addition to the zone specific page hiding problem, I'm afraid threads can hide pages from themselves up to a point there is nothing to block on (even if we have just one zone). There is nothing which avoids that from happening in theory. Well, I'm going to sleep now. ^ permalink raw reply [flat|nested] 110+ messages in thread
[parent not found: <20010917211834.A31693@redhat.com>]
[parent not found: <20010918035055.J698@athlon.random>]
* Re: Linux 2.4.10-pre11 [not found] ` <20010918035055.J698@athlon.random> @ 2001-09-18 2:02 ` Andrea Arcangeli [not found] ` <20010917221653.B31693@redhat.com> 1 sibling, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 2:02 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 03:50:55AM +0200, Andrea Arcangeli wrote: > On Mon, Sep 17, 2001 at 09:18:34PM -0400, Benjamin LaHaise wrote: > > On Mon, Sep 17, 2001 at 05:08:37PM -0700, Linus Torvalds wrote: > > > > > > Ok, the big thing here is continued merging, this time with Andrea. > > > > > > I still don't like some of the VM changes, but integrating Andrea's VM > > > changes results in (a) better performance and (b) much cleaner inactive > > > page handling in particular. Besides, for the 2.4.x tree, the big priority > > > is stability, we can re-address my other concerns during 2.5.x. > > > > What is the idea behind merging complete and total utter CRAP into the > > tree? Lets take a few completely fubar'd hunks as examples. Let's start > > with fs/buffer.c: > > > > Point 1. It doesn't compile with egcs 1.1.2. Pretty remarkable for the only > > recommended stable compiler. > > Sorry, I use gcc 3.0.2, but I assume fixing the compilation trouble with > egcs 1.1.2 is trivial. Can you show which is the interested line of > code so we can rewrite it in a compatible manner? ah I just got a (verbose) report about one compilation trouble with old kernels, the 00_rwsem patch that includes this backwards compatibility hunk wasn't applied. This is a very very minor issue. Please Linus include this patch to fix the compilation troubles with the old compilers: diff -urN 2.4.10pre10/include/linux/compiler.h rwsem/include/linux/compiler.h --- 2.4.10pre10/include/linux/compiler.h Thu Jan 1 01:00:00 1970 +++ rwsem/include/linux/compiler.h Tue Sep 18 02:02:08 2001 @@ -0,0 +1,13 @@ +#ifndef __LINUX_COMPILER_H +#define __LINUX_COMPILER_H + +/* Somewhere in the middle of the GCC 2.96 development cycle, we implemented + a mechanism by which the user can annotate likely branch directions and + expect the blocks to be reordered appropriately. Define __builtin_expect + to nothing for earlier compilers. */ + +#if __GNUC__ == 2 && __GNUC_MINOR__ < 96 +#define __builtin_expect(x, expected_value) (x) +#endif + +#endif /* __LINUX_COMPILER_H */ Then you need to simply include <linux/compile.h> from the files that doesn't compile, this second patch is untested but it should do the trick (00_rwsem-?? was including compile.h from the rwsem includes that were in turn included by those files implicitly so I didn't need to add compile.h in each file): --- 2.4.10pre11/mm/slab.c.~1~ Tue Sep 18 02:43:04 2001 +++ 2.4.10pre11/mm/slab.c Tue Sep 18 04:00:30 2001 @@ -72,6 +72,7 @@ #include <linux/slab.h> #include <linux/interrupt.h> #include <linux/init.h> +#include <linux/compile.h> #include <asm/uaccess.h> /* --- 2.4.10pre11/mm/page_alloc.c.~1~ Tue Sep 18 02:43:04 2001 +++ 2.4.10pre11/mm/page_alloc.c Tue Sep 18 04:00:20 2001 @@ -17,6 +17,7 @@ #include <linux/pagemap.h> #include <linux/bootmem.h> #include <linux/slab.h> +#include <linux/compile.h> int nr_swap_pages; int nr_active_pages; --- 2.4.10pre11/mm/vmscan.c.~1~ Tue Sep 18 02:43:04 2001 +++ 2.4.10pre11/mm/vmscan.c Tue Sep 18 04:00:47 2001 @@ -21,6 +21,7 @@ #include <linux/init.h> #include <linux/highmem.h> #include <linux/file.h> +#include <linux/compile.h> #include <asm/pgalloc.h> Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
[parent not found: <20010917221653.B31693@redhat.com>]
* Re: Linux 2.4.10-pre11 [not found] ` <20010917221653.B31693@redhat.com> @ 2001-09-18 2:27 ` Linus Torvalds 2001-09-18 3:14 ` Alan Cox 2001-09-18 3:26 ` Andrea Arcangeli [not found] ` <20010918052201.N698@athlon.random> 1 sibling, 2 replies; 110+ messages in thread From: Linus Torvalds @ 2001-09-18 2:27 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Andrea Arcangeli, Kernel Mailing List On Mon, 17 Sep 2001, Benjamin LaHaise wrote: > > __builtin_expect. It just goes to show that you haven't bothered to get any > feedback on this patch which is significantly invasive before merging it. > This is completely unreasonable behaviour for a kernel series which is > supposed to be stable. No, it goes to show that I didn't actually apply all of Andrea's patches. I left out a number of patches that I didn't see the point to, or that I just plain disagreed with. And some of the patches had some non-obvious dependencies, especially as I have a reasonably recent compiler.. > The code in question does not attempt to explain itself at all. Your release > notes did not explain what changes you did at all. Nor are your patches > split up into reasonable chunks of functionality that can be evaluated based > on their individual merit. All or nothing is *not* the approach that should > be taken at present. (Hint, stability is acheived gradually.) Actually, many of them _are_ split up, much more so than Alan's public patches are (now, Alan in private splits up the patches really well, so for me it tends to be even easier to apply Alans patches than Andreas, but as with Alan, my one-liner "merged with xxxx" doesn't go into the detail that Andrea and others actually do have). Now, the VM part of the patch was certainly fairly big. I doubt much of it could have been reasonably split up, though. > No, what I'm deeply concerned with is the nature of patch merging. There was > no obvious public testing of the changes before merging, no warning of it, the > patch contained obvious bogus chunks and many unrelated changes. I've never > seen as invasive a patch merged that ran the risk of completely torpedoing > stability merged into a STABLE KERNEL SERIES, nor would I ever consider > submitting such a patch. There are bug fixes that are outstanding in -ac > that aren't being merged to -linus, yet this completely untested pile of messy > code is merged without hesitation? Without hesitation? Hardly. The bug fixes in -ac that aren't merged into -linus are that way BECAUSE NOBODY HAS SENT ME MERGES. Alan works on it quite intensively, but the fact is, that for the -ac merge, Alan seems to be able to merge it slower than -ac grows. Which is why I actually started asking people to merge their parts from -ac into -linus to help Alan. That's how the other merge in -pre11 happened. The aa tree has existed for a long time, and is actually mainted in better chunks than the -ac tree, which makes it easier to merge. Admittedly my and Alans tastes are often closer than mine and Andreas, which means that the -aa tree merges are more painful for _me_ ;) Linus ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 2:27 ` Linus Torvalds @ 2001-09-18 3:14 ` Alan Cox 2001-09-18 3:26 ` Andrea Arcangeli 1 sibling, 0 replies; 110+ messages in thread From: Alan Cox @ 2001-09-18 3:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Benjamin LaHaise, Andrea Arcangeli, Kernel Mailing List > The bug fixes in -ac that aren't merged into -linus are that way BECAUSE > NOBODY HAS SENT ME MERGES. Actually some of them I've sent you many times. The tlb fix I have on my list of "dont bother" for example - which means I got bored of sending it. > Alan works on it quite intensively, but the fact is, that for the -ac > merge, Alan seems to be able to merge it slower than -ac grows. Which is > why I actually started asking people to merge their parts from -ac into > -linus to help Alan. That's how the other merge in -pre11 happened. I've been trying to ensure I feed stuff in testable chunks. For example I dont want vfs and scsi changes both in a Linus merge because someone is bound to go "hey I got corruption" and then with both in one merge we are screwed Alan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 2:27 ` Linus Torvalds 2001-09-18 3:14 ` Alan Cox @ 2001-09-18 3:26 ` Andrea Arcangeli 1 sibling, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 3:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Benjamin LaHaise, Kernel Mailing List On Mon, Sep 17, 2001 at 07:27:31PM -0700, Linus Torvalds wrote: > Now, the VM part of the patch was certainly fairly big. I doubt much of it > could have been reasonably split up, though. Yes, I considered doing that but it would been quite a pain to develop it incrementally (so I thought if needed I would have splitted it later). Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
[parent not found: <20010918052201.N698@athlon.random>]
* Re: Linux 2.4.10-pre11 [not found] ` <20010918052201.N698@athlon.random> @ 2001-09-18 4:01 ` Benjamin LaHaise 2001-09-18 4:39 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Benjamin LaHaise @ 2001-09-18 4:01 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 05:22:01AM +0200, Andrea Arcangeli wrote: > try compiling 2.4.10pre10aa1 with egcs 1.1.2 before complaing about lack > of testing, the tests were run on 2.4.10pre10aa1, not on 2.4.10pre11 > that didn't even existed at that time. If I spent the time testing every random kernel patch that comes across, then I wouldn't have any time left to develop other useful things. What I see is that what was merged didn't deal with things sanely. > So I'm not surprised by the fact you weren't horrified by the previous > VM that was looping forever in such a case. unfortunately not everybody > out there (mainly simulations) knows how much memory they will alloc. Every single kernel since the dawn of 1.0 has died under OOM. Optimizing for it seems rather stupid when a well maintained box WON'T OOM. > > Then why wasn't it added to the generic code? > > see above. So merging crap is now acceptable? Great, I'll write messy code from now on. > > on their individual merit. All or nothing is *not* the approach that should > > be taken at present. (Hint, stability is acheived gradually.) > > You can find a very big split if you check into the 2.4.10pre10aa1/ > directory in my ftp area, of course if you look at 2.4.10pre11 it is all > at once. Yes, all of the vm patches are in one big 3300 line patch, which is about the same size as the entire aio patch which adds a new subsystem. > For the vm changes properly splitting them it's quite a pain, mainly due > the way they're developed. For everything that is not a pain to keep > separated I try to do so. Bull shit. People do it every day because Linus dictates it so. There is nothing magical in you patch that can't be split up into human-readable chunks. > > > again and you'll see, as said 2.2.19 does the same write throttling. > > > > And it does so in a fashion which is completely broken. > > Then send Alan the fix against 2.2.20pre10 if it's completly broken. I > never got a complain about it yet and I look forward to see your fix. 2.2 doesn't matter any more. Any work I'm doing now is 2.4 based. > > > Then make it a seperate patch. It shouldn't be part of the do-everything > > Please get real, it is a 3 thousand lines patch, if for every single > change like this (orthogonal with the rest) I should make 1 patch it > would take hours just to run the diffs. Not to tell if then I make a > controversial changes. I am being real. I don't expect single massive patches to ever be applied, and am shocked I've even had to comment on this. > What kind of workload is applied during the measurement, and what > NULL/MEM/FILE_IO/PROCESS means? Is this benchmark available somewhere? > What fields are you looking at exactly? I can see some field going up > and low and I cannot evaluate those numbers very well without more info. Go do the work yourself. If you can't be bothered to develop benchmarks that simulate users on the vm subsystem, how can you claim your vm patches are correct? > Except for the vm rewrite there was extensive testing. Probably not from > RedHat Inc if this is the point that you are making. And the vm rewrite > is so obviously better and much cleaner that I think this was a fine > integration. I'm pretty sure no stability problem can arise because of > the vm integration, if there could be problems as worse they could be > specific to slower performance in some workload which we can address > over the time without any pain. anyways as said dbench runs exactly > twice faster so it cannot be obviously wrong in terms of performance > either. So the vm rewrite wasn't well tested, but it should be merged into a stable kernel? Please say it ain't so. > Linus merged everything without asking me and I think his move was very > good IMHO. Face it: users cannot live anymore with workloads slowing > down at every runs of benchmarks, with kswapd looping forever, with > swapout storms, with swap filled by unused swapcache etc... What I did > is the less intrusive change to the code that could at least address > those critical problems and I'm pretty sure it did, as worse as said we > may need some little tweak over the time but the new infrastructure > should be robust enough now. I want robust and not likely to corrupt my data randomly. The latter is more important to me and everyone else in the world, so I think you should play by rules that enforce that. > Ben, the true mess is the 2.4 VM before 2.4.10pre11. Period. Sure. > If you > check the new code you'll see how much cleaner it is. Hardly. You don't even define your basic terms. WTF is a classzone? A comment somewhere would be nice. > Those locking changes are bugfixes. Try to growsdown on a file backed > vma with 2.4.9 using two threads and you'll run into troubles eventually > (see the thread on l-k). That isn't the one I'm talking about. You changed the swapcache code. That code is fragile. These changes aren't documented. > Just in case it wasn't obvious I do things gradually and in public, > check ftp.kernel.org. If something you're the one that doesn't care > about what I do in public. The vm rewrite was not posted in public, nor described in public. It just appeared and got merged. Could you at least describe *ALL* of the changes? > I recall I also described you some of the O_DIRECT > stuff and blkdev pagecache work on Ottawa a few months ago. I get lots > of feedback and also bugfixes from many places, to mention one company > that cares about what I do in public there's IBM, they did auditing and > I got several O_DIRECT small fixes from them incrementally to my very > open patches. I think -aa I'm even more open than -ac since I keep all > the stuff separated so it's much easier to understand and pick the > interesting parts. I do my best effort to make every patch self > contained just for the purpose of making merging and auditing easier, > and it payed off so far. Furthmore it's not just in public but also in > production, for istsance all SuSE server users happily runs things like > O_DIRECT and blkdev in pagecache for several months now. The latter was > needed from a few I/O intensive applications using blkdevs as files and > needing heavy readhead for exploiting the full bandwith of many-ways > raid arrays for example. The only other way would been to basically > duplicate the readahead of filemap.c in block_dev.c, the hack was as > complex as the long term fix. And we agreed that this is 2.5 material. > > Not at the expense of stability. Do it in 2.5, or do it gradually. > > Ben, instead of writing complains could you grab 2.4.10pre11, put it on > your desktop while doing your daily work and let me know when you run > into stability troubles? I could have said the same thing of the ext2 > directory metadata in pagecache, I grabbed it, put it on my desktop, it > never given me a problem so far so I didn't complained. And the directory > in pagecache wasn't fixing _any_ problem for the end user, here we're > instead fixing _real_life_ showstopper problems with mysql, postgres (I > wonder you care about postrgres these days?) and the other db not using > rawio+lvm, so this was kind of needed in 2.4 eventually anyways if not > today (I was flooded by bugreports of such kind, I'm sure you got them > too). So the earlier the better. I think users would been not happy to > wait until 2.6. I'm sorry, but I'm not going to run it right now since I have other priorities. I'm sure users will post their pass/fails over the next few days, but until I see some evidence that it doesn't eat my data, I'll hold back. -ben ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 4:01 ` Benjamin LaHaise @ 2001-09-18 4:39 ` Andrea Arcangeli 2001-09-18 5:04 ` Alan Cox 2001-09-18 5:22 ` Benjamin LaHaise 0 siblings, 2 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 4:39 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 12:01:32AM -0400, Benjamin LaHaise wrote: > Every single kernel since the dawn of 1.0 has died under OOM. Optimizing for try 2.2 once. > 2.2 doesn't matter any more. Any work I'm doing now is 2.4 based. It still matters for me. Critical servers with very high vm loads still have to run 2.2 to be stable and fast unfortunately. > I am being real. I don't expect single massive patches to ever be applied, > and am shocked I've even had to comment on this. Your aio patch is massive too. andrea@athlon:~ > wc -l aio-v2.4.0-20010123.diff 2951 aio-v2.4.0-20010123.diff Now if you think I'm unreal and you are real, feed me the aio patch in self contained pieces of 10 lines each as you expect from me. And note that if they're not self contained they will just make my life harder. I'd be glad to be proved wrong and to get aio from you in small self contained pieces really, I planned to look into aio as one of the next things to merge in -aa but as usual the size of the patch makes things harder to merge due the larger implications. feel free to cc l-k, I'm sure other people is interested in aio too. > I want robust and not likely to corrupt my data randomly. The latter is more Forget the corruption. So far the only scary report I had is from Marcelo's 2G machine which is nothin compared to corruption, I don't have x86 machines with more than 1G, I tested alpha with 3G (but it has only 1 zone). I think Marcelo identified the problematic part before even testing it, so the fix should be fairly immediate, I'll address it ASAP unless he beats me on it (at the moment I'm still resynching). > That isn't the one I'm talking about. You changed the swapcache code. That > code is fragile. These changes aren't documented. I didn't changed the swapcache locking rules. I only fixed the VM to properly clear the dirty bit before freeing a page. Anybody freeing a page that is dirty was a plain vm bug. That was quite strightforward and correct change. Infact I was horrified by seeing __free_pages_ok clearing the dirty bit (not to talk about the referenced bit which was useless to change). > The vm rewrite was not posted in public, nor described in public. It just It obviously was. How do you think Linus got it? I said I didn't sent it to Linus privately. > appeared and got merged. Could you at least describe *ALL* of the changes? I'll be glad to do that over the time, right now I'm strict in time and I also needed to go to sleep a few hours ago so I won't inline the reply to this email right now, sorry. > And we agreed that this is 2.5 material. the O_DIRECT and blkdev in pagecache yes but definitely not the VM one but people needed those features in production anyways so that was good and they were well tested. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 4:39 ` Andrea Arcangeli @ 2001-09-18 5:04 ` Alan Cox 2001-09-18 5:09 ` Andrea Arcangeli 2001-09-18 5:22 ` Benjamin LaHaise 1 sibling, 1 reply; 110+ messages in thread From: Alan Cox @ 2001-09-18 5:04 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Benjamin LaHaise, Linus Torvalds, Kernel Mailing List > > And we agreed that this is 2.5 material. > > the O_DIRECT and blkdev in pagecache yes but definitely not the VM one > but people needed those features in production anyways so that was good > and they were well tested. The O_DIRECT stuff is very clean - its definitely a feature that should have gone into 2.5 first and then back, but its one that really doesn't bother me too much. blkdev in page cache needs some locking thinking but looks ok. Alan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 5:04 ` Alan Cox @ 2001-09-18 5:09 ` Andrea Arcangeli 0 siblings, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 5:09 UTC (permalink / raw) To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 06:04:28AM +0100, Alan Cox wrote: > The O_DIRECT stuff is very clean - its definitely a feature that should > have gone into 2.5 first and then back, but its one that really doesn't > bother me too much. blkdev in page cache needs some locking thinking but > looks ok. agreed. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 4:39 ` Andrea Arcangeli 2001-09-18 5:04 ` Alan Cox @ 2001-09-18 5:22 ` Benjamin LaHaise 2001-09-18 5:48 ` Andrea Arcangeli 1 sibling, 1 reply; 110+ messages in thread From: Benjamin LaHaise @ 2001-09-18 5:22 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 06:39:10AM +0200, Andrea Arcangeli wrote: > On Tue, Sep 18, 2001 at 12:01:32AM -0400, Benjamin LaHaise wrote: > > Every single kernel since the dawn of 1.0 has died under OOM. Optimizing for > > try 2.2 once. There are still loads that 2.2 can die under (think fast network cards and you'll realise that you can't protect against it). Yes, 2.2 is in far better state than anything else ever has been, but it's not perfect. > > 2.2 doesn't matter any more. Any work I'm doing now is 2.4 based. > > It still matters for me. Critical servers with very high vm loads still > have to run 2.2 to be stable and fast unfortunately. That's where I want 2.4 to get to. > > I am being real. I don't expect single massive patches to ever be applied, > > and am shocked I've even had to comment on this. > > Your aio patch is massive too. > > andrea@athlon:~ > wc -l aio-v2.4.0-20010123.diff > 2951 aio-v2.4.0-20010123.diff > > Now if you think I'm unreal and you are real, feed me the aio patch in > self contained pieces of 10 lines each as you expect from me. And note > that if they're not self contained they will just make my life harder. Not 10 lines, but several hundred here and there. Aio actually splits up very well since thing like the wait_queue changes are all isolated bits of functionality. Even the brw_kiovec_async bit is standalone since it only matters to itself and brw_kiovec. Yes, the current patch is not seperated, but that was my plan from the beginning on merging. > I'd be glad to be proved wrong and to get aio from you in small self > contained pieces really, I planned to look into aio as one of the next > things to merge in -aa but as usual the size of the patch makes things > harder to merge due the larger implications. feel free to cc l-k, I'm > sure other people is interested in aio too. It's not ready yet. Most of the development has been on hold waiting for 2.5 to start for cementing the ABI in stone. > > I want robust and not likely to corrupt my data randomly. The latter is more > > Forget the corruption. So far the only scary report I had is from > Marcelo's 2G machine which is nothin compared to corruption, I don't > have x86 machines with more than 1G, I tested alpha with 3G (but it has > only 1 zone). I think Marcelo identified the problematic part before > even testing it, so the fix should be fairly immediate, I'll address it > ASAP unless he beats me on it (at the moment I'm still resynching). Sure. That's why it should remain seperate for at least a little bit. > > > That isn't the one I'm talking about. You changed the swapcache code. That > > code is fragile. These changes aren't documented. > > I didn't changed the swapcache locking rules. I only fixed the VM to > properly clear the dirty bit before freeing a page. Anybody freeing a > page that is dirty was a plain vm bug. That was quite strightforward and > correct change. Infact I was horrified by seeing __free_pages_ok > clearing the dirty bit (not to talk about the referenced bit which was > useless to change). So why not split that fix out as a seperate patch that can then be applied alone? > > > The vm rewrite was not posted in public, nor described in public. It just > > It obviously was. How do you think Linus got it? I said I didn't sent it > to Linus privately. *nod* > > appeared and got merged. Could you at least describe *ALL* of the changes? > > I'll be glad to do that over the time, right now I'm strict in time and > I also needed to go to sleep a few hours ago so I won't inline the reply > to this email right now, sorry. Thanks! I think we'll all be a bit calmer in the morning and better able to think clearly. I'll even try to break things down a bit as there are bits in your patches which I think are very good. Cheers, -ben ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 5:22 ` Benjamin LaHaise @ 2001-09-18 5:48 ` Andrea Arcangeli 0 siblings, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 5:48 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 01:22:16AM -0400, Benjamin LaHaise wrote: > That's where I want 2.4 to get to. yes, me too, we agree. And what I faced is that I couldn't do that without first removing the broken mess. It wasn't possible to just make a two liner to fix it. > So why not split that fix out as a seperate patch that can then be applied > alone? It's not as simple as it seems, I noticed such a bug in the middle of the developement and so it was controversial to fix, I should have fixed the same bug first in the new VM then in the old vm, only the free_pages_ok change would been non controversial but such one alone would been useless since it wouldn't be self contained. so I did it for my vm and I never ended splitting it off because it was non trivial work. > Thanks! I think we'll all be a bit calmer in the morning and better able to never mind. > think clearly. I'll even try to break things down a bit as there are bits > in your patches which I think are very good. Great, thanks! Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 0:08 Linux 2.4.10-pre11 Linus Torvalds 2001-09-17 23:17 ` Marcelo Tosatti [not found] ` <20010917211834.A31693@redhat.com> @ 2001-09-18 5:48 ` Andrew Morton 2001-09-18 6:11 ` Andrea Arcangeli 2001-09-18 10:58 ` Martin Dalecki 2001-09-18 9:31 ` Alexander Viro 3 siblings, 2 replies; 110+ messages in thread From: Andrew Morton @ 2001-09-18 5:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kernel Mailing List, Andrea Arcangeli Linus Torvalds wrote: > > Ok, the big thing here is continued merging, this time with Andrea. > In one test here the VM changes seem fragile, and slower. Dual x86, 512 megs RAM, 512 megs swap. No highmem. The workload is: while true do /usr/src/ext3/tools/usemem 300 done (This just mallocs 300 megs, touches it then exits) in parallel with time /usr/src/ext3/tools/bash-shared-mapping -n 5 -t 3 foo 300000000 on ext2. (bash-shared-mapping is a tool which I wrote for ext3. It's one of the most aggressive VM/MM stress testers around, and has found a number of kernel bugs). On 2.4.9-ac10, the b-s-m run took 294 seconds. On 2.4.10-pre11 it took 330 seconds DESPITE the fact that one of the b-s-m instances was oom-killed quite early in the test. `vmstat' took about thirty seconds to start (this is usual), but was promptly killed, despite having (presumably) a small RSS. Instances of `usemem' were oom-killed quite frequently. In 2.4.9-ac10, nothing was oom-killed. With a gig of VM and a 600 meg working set I don't see why it's necessary to kill processes? Each oom-kill was associated with a 0-order allocation failure. These tools are available at http://www.uow.edu.au/~andrewm/ext3-tools.tar.gz OK, so this was a seriously loony workload. But suddenly, the number of people who understand the Linux VM has gone from maybe 10 down to just one-and-a-bit. A large number of comments have been removed, and a year's worth of discussion has been invalidated. Andrea, it would be most useful if you were to spend some time (say, four hours) commenting the code and telling us how it works. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 5:48 ` Andrew Morton @ 2001-09-18 6:11 ` Andrea Arcangeli 2001-09-18 5:02 ` Marcelo Tosatti 2001-09-18 10:58 ` Martin Dalecki 1 sibling, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 6:11 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, Kernel Mailing List On Mon, Sep 17, 2001 at 10:48:34PM -0700, Andrew Morton wrote: > Linus Torvalds wrote: > > > > Ok, the big thing here is continued merging, this time with Andrea. > > > > In one test here the VM changes seem fragile, and slower. > > Dual x86, 512 megs RAM, 512 megs swap. No highmem. > > The workload is: > > while true > do > /usr/src/ext3/tools/usemem 300 > done > > (This just mallocs 300 megs, touches it then exits) > > in parallel with > > time /usr/src/ext3/tools/bash-shared-mapping -n 5 -t 3 foo 300000000 > > on ext2. > > (bash-shared-mapping is a tool which I wrote for ext3. It's one of the > most aggressive VM/MM stress testers around, and has found a number of > kernel bugs). > > On 2.4.9-ac10, the b-s-m run took 294 seconds. On 2.4.10-pre11 it > took 330 seconds DESPITE the fact that one of the b-s-m instances > was oom-killed quite early in the test. > > `vmstat' took about thirty seconds to start (this is usual), but > was promptly killed, despite having (presumably) a small RSS. Instances > of `usemem' were oom-killed quite frequently. In 2.4.9-ac10, nothing > was oom-killed. should be the very same problem identified by Marcelo. I'm wondering why I didn't reproduced here during testing, 512mbytes is not highmem and my desktop has 512mbytes too and it didn't killed anything yet. As for the slowdown there are a few localized places to look at. but let's fix the oom first. > Andrea, it would be most useful if you were to spend some time (say, four hours) > commenting the code and telling us how it works. I will do that and I'll answer to any question ASAP, no panic. We may even want to change part of the core logic over time if it doesn't perform wonderfully. But first prio at the moment is to fix the oom you also noticed (the "hiding" logic mentioned by Marcelo), the rest of the changes should be a very solid base for a very solid and fast 2.4 vm. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 6:11 ` Andrea Arcangeli @ 2001-09-18 5:02 ` Marcelo Tosatti 2001-09-18 6:40 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 5:02 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Mon, Sep 17, 2001 at 10:48:34PM -0700, Andrew Morton wrote: > > Linus Torvalds wrote: > > > > > > Ok, the big thing here is continued merging, this time with Andrea. > > > > > > > In one test here the VM changes seem fragile, and slower. > > > > Dual x86, 512 megs RAM, 512 megs swap. No highmem. > > > > The workload is: > > > > while true > > do > > /usr/src/ext3/tools/usemem 300 > > done > > > > (This just mallocs 300 megs, touches it then exits) > > > > in parallel with > > > > time /usr/src/ext3/tools/bash-shared-mapping -n 5 -t 3 foo 300000000 > > > > on ext2. > > > > (bash-shared-mapping is a tool which I wrote for ext3. It's one of the > > most aggressive VM/MM stress testers around, and has found a number of > > kernel bugs). > > > > On 2.4.9-ac10, the b-s-m run took 294 seconds. On 2.4.10-pre11 it > > took 330 seconds DESPITE the fact that one of the b-s-m instances > > was oom-killed quite early in the test. > > > > `vmstat' took about thirty seconds to start (this is usual), but > > was promptly killed, despite having (presumably) a small RSS. Instances > > of `usemem' were oom-killed quite frequently. In 2.4.9-ac10, nothing > > was oom-killed. > > should be the very same problem identified by Marcelo. I'm wondering why > I didn't reproduced here during testing, 512mbytes is not highmem and my > desktop has 512mbytes too and it didn't killed anything yet. As for the > slowdown there are a few localized places to look at. but let's fix the > oom first. Try to run several memory hungry threads (thus hiding more pages). ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 5:02 ` Marcelo Tosatti @ 2001-09-18 6:40 ` Andrea Arcangeli 2001-09-18 16:06 ` Marcelo Tosatti 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 6:40 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 02:02:37AM -0300, Marcelo Tosatti wrote: > Try to run several memory hungry threads (thus hiding more pages). I did that indeed, not sure why I didn't reproduced (I guess the hogs were sligthly different). (and of course they were killed but only after the box was truly oom) Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 6:40 ` Andrea Arcangeli @ 2001-09-18 16:06 ` Marcelo Tosatti 2001-09-18 19:18 ` Marcelo Tosatti 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 16:06 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Tue, Sep 18, 2001 at 02:02:37AM -0300, Marcelo Tosatti wrote: > > Try to run several memory hungry threads (thus hiding more pages). > > I did that indeed, not sure why I didn't reproduced (I guess the hogs > were sligthly different). (and of course they were killed but only after > the box was truly oom) Easy reproducible way: Boot with 64M, start 4 setiathome processes, then start mtest (from memtest suite) with a lot of threads (I'm using 40 readers and 4 writers in this case, and a 100MB heap) [root@matrix memtest-0.0.4]# ./mtest -m 100 -r 40 -w 4 Starting test run with 100 megabyte heap. Setting up 25600 4096kB pages for test... done. Child 00 started with pid 00935, readonly Child 01 started with pid 00936, readonly Child 02 started with pid 00937, readonly Child 03 started with pid 00938, readonly Child 04 started with pid 00939, readonly Sep 18 14:28:31 matrix kernel: __alloc_pages: 0-order allocation failed (gfp=0x1d2/0) from c012c3be Sep 18 14:28:31 matrix kernel: VM: killing process setiathome ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 16:06 ` Marcelo Tosatti @ 2001-09-18 19:18 ` Marcelo Tosatti 2001-09-18 21:05 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Marcelo Tosatti @ 2001-09-18 19:18 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Marcelo Tosatti wrote: > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > On Tue, Sep 18, 2001 at 02:02:37AM -0300, Marcelo Tosatti wrote: > > > Try to run several memory hungry threads (thus hiding more pages). > > > > I did that indeed, not sure why I didn't reproduced (I guess the hogs > > were sligthly different). (and of course they were killed but only after > > the box was truly oom) > > Easy reproducible way: > > Boot with 64M, start 4 setiathome processes, then start mtest (from > memtest suite) with a lot of threads (I'm using 40 readers and 4 writers > in this case, and a 100MB heap) > > [root@matrix memtest-0.0.4]# ./mtest -m 100 -r 40 -w 4 > Starting test run with 100 megabyte heap. > Setting up 25600 4096kB pages for test... > done. > Child 00 started with pid 00935, readonly > Child 01 started with pid 00936, readonly > Child 02 started with pid 00937, readonly > Child 03 started with pid 00938, readonly > Child 04 started with pid 00939, readonly > > > Sep 18 14:28:31 matrix kernel: __alloc_pages: 0-order allocation failed > (gfp=0x1d2/0) from c012c3be > Sep 18 14:28:31 matrix kernel: VM: killing process setiathome Andrea, I still can reproduce the alloc pages failures with the following patch. It should remove the "hidden pages" problem WRT ZONES. That is, a lot of threads can still hide pages from each other independantly of the zone hiding problem. --- linux.orig/mm/vmscan.c Tue Sep 18 15:43:14 2001 +++ linux/mm/vmscan.c Tue Sep 18 16:37:52 2001 @@ -361,13 +361,19 @@ } deactivate_page_nolock(page); + list_del(entry); - list_add_tail(entry, &inactive_local_lru); - if (__builtin_expect(!memclass(page->zone, classzone), 0)) + if (__builtin_expect(!memclass(page->zone, classzone), 0)) { + list_add_tail(entry, &inactive_list); + __max_scan--; continue; + } __max_scan--; + + list_add_tail(entry, &inactive_local_lru); + /* Racy check to avoid trylocking when not worthwhile */ if (!page->buffers && page_count(page) != 1) { ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 19:18 ` Marcelo Tosatti @ 2001-09-18 21:05 ` Andrea Arcangeli 2001-09-19 13:57 ` Rik van Riel 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 21:05 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 04:18:34PM -0300, Marcelo Tosatti wrote: > I still can reproduce the alloc pages failures with the following patch. Could you try the very last patch I posted? > --- linux.orig/mm/vmscan.c Tue Sep 18 15:43:14 2001 > +++ linux/mm/vmscan.c Tue Sep 18 16:37:52 2001 > @@ -361,13 +361,19 @@ > } > > deactivate_page_nolock(page); > + > list_del(entry); > - list_add_tail(entry, &inactive_local_lru); > > - if (__builtin_expect(!memclass(page->zone, classzone), 0)) > + if (__builtin_expect(!memclass(page->zone, classzone), > 0)) { > + list_add_tail(entry, &inactive_list); > + __max_scan--; > continue; > + } > > __max_scan--; > + > + list_add_tail(entry, &inactive_local_lru); > + I actually used a much more aggressive approch, I always left all the page visibles now, not just for the memclass check. After all the same issue that can arise with the memclass check can also arise in a smaller scale with the other failed checks. But the oom weren't just because of the "hiding", I suspect that infact. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 21:05 ` Andrea Arcangeli @ 2001-09-19 13:57 ` Rik van Riel 0 siblings, 0 replies; 110+ messages in thread From: Rik van Riel @ 2001-09-19 13:57 UTC (permalink / raw) To: Andrea Arcangeli Cc: Marcelo Tosatti, Andrew Morton, Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Tue, Sep 18, 2001 at 04:18:34PM -0300, Marcelo Tosatti wrote: > > I still can reproduce the alloc pages failures with the following patch. > > Could you try the very last patch I posted? Marcelo told you how to reproduce the bug, is there any particular reason you're not testing yourself? ;) Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 5:48 ` Andrew Morton 2001-09-18 6:11 ` Andrea Arcangeli @ 2001-09-18 10:58 ` Martin Dalecki 1 sibling, 0 replies; 110+ messages in thread From: Martin Dalecki @ 2001-09-18 10:58 UTC (permalink / raw) To: Andrew Morton; +Cc: Linus Torvalds, Kernel Mailing List, Andrea Arcangeli Andrew Morton wrote: > OK, so this was a seriously loony workload. But suddenly, the number of people > who understand the Linux VM has gone from maybe 10 down to just one-and-a-bit. > A large number of comments have been removed, and a year's worth of > discussion has been invalidated. No I disagree if I look at the patches. The number of people went down from 10 fumbeling around without a clue to two knowing what they are doing I think ;-). IMHO this is increasing the chance that the problems will finally get sorted out. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 0:08 Linux 2.4.10-pre11 Linus Torvalds ` (2 preceding siblings ...) 2001-09-18 5:48 ` Andrew Morton @ 2001-09-18 9:31 ` Alexander Viro 2001-09-18 9:39 ` Andrea Arcangeli 2001-09-18 16:45 ` Linus Torvalds 3 siblings, 2 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-18 9:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kernel Mailing List, Andrea Arcangeli On Mon, 17 Sep 2001, Linus Torvalds wrote: > This also merges the blkdev in page cache patch, and that will hopefully > make it noticeably easier to do the "do bread() with page cache too", at > which point a lot of the current ugly synchronization issues will go away. Umm... Linus, had you actually read through the fs/block_device.c part of that? It's not just ugly as hell, it's (AFAICS) not hard to oops if you have several inodes sharing major:minor. ->bd_inode and its treatment are bogus. Please, read it through and consider reverting - in its current state code is an ugly mess. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 9:31 ` Alexander Viro @ 2001-09-18 9:39 ` Andrea Arcangeli 2001-09-18 9:44 ` Alexander Viro 2001-09-18 16:45 ` Linus Torvalds 1 sibling, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 9:39 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 05:31:48AM -0400, Alexander Viro wrote: > > > On Mon, 17 Sep 2001, Linus Torvalds wrote: > > > This also merges the blkdev in page cache patch, and that will hopefully > > make it noticeably easier to do the "do bread() with page cache too", at > > which point a lot of the current ugly synchronization issues will go away. > > Umm... Linus, had you actually read through the fs/block_device.c part > of that? It's not just ugly as hell, it's (AFAICS) not hard to oops > if you have several inodes sharing major:minor. ->bd_inode and its can you show an exploit? I cannot reproduce any problem here: root@athlon:/tmp > cp -a /dev/hda hda.1 root@athlon:/tmp > cp -a /dev/hda hda.2 root@athlon:/tmp > cp hda.1 /dev/null & cp hda.2 /dev/null & [1] 24981 [2] 24982 root@athlon:/tmp > fg cp hda.2 /dev/null root@athlon:/tmp > fg cp hda.1 /dev/null root@athlon:/tmp > > treatment are bogus. Please, read it through and consider reverting - > in its current state code is an ugly mess. what other design solution do you propose rather both inodes sharing the i_mapping across the different inodes like I did? I found more handy to just bump the i_count of the first inode and referencing it from the bd_inode, rather than dynamically allocating the i_mapping and have a bd_mapping, but if you prefer to dynamically allocate the i_mapping rather than using the i_data of the fist inode we can change that of course. Not sure what's the mess in the patch you're talking about, could you elaborate? Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 9:39 ` Andrea Arcangeli @ 2001-09-18 9:44 ` Alexander Viro 2001-09-18 9:57 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-18 9:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > what other design solution do you propose rather both inodes sharing the > i_mapping across the different inodes like I did? > > I found more handy to just bump the i_count of the first inode and > referencing it from the bd_inode, rather than dynamically allocating the > i_mapping and have a bd_mapping, but if you prefer to dynamically > allocate the i_mapping rather than using the i_data of the fist inode we > can change that of course. Not sure what's the mess in the patch you're > talking about, could you elaborate? Bumping ->i_count on inode is _not_ an option - think what it does if you umount the first fs. _If_ you need an inode for block_device - allocate a new one instead of reusing the inode that had been passed to ->open(). ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 9:44 ` Alexander Viro @ 2001-09-18 9:57 ` Andrea Arcangeli 2001-09-18 10:02 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 9:57 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 05:44:18AM -0400, Alexander Viro wrote: > Bumping ->i_count on inode is _not_ an option - think what it does if > you umount the first fs. what it does? Unless I'm missing something the fs never cares and never sees the bd_inode. the fs just does a bdget and then it works only on the bdev. What should I run to get the oops exactly? > _If_ you need an inode for block_device - allocate a new one instead of > reusing the inode that had been passed to ->open(). If we need to avoid the bumping of i_count and to allocate something dynamically that will be the bd_mapping address space, we don't need a new fake_inode there too, we just need to share the new physical pagecahce address space. Such physical i_mapping address space is the same thing that the buffer cache will have to use to map its legacy buffer cache buffer headers on top of it (then we can cleanup away the few lines in blkdev_close that do the update_buffers() and checks the MS_RDONLY bit). Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 9:57 ` Andrea Arcangeli @ 2001-09-18 10:02 ` Alexander Viro 2001-09-18 10:17 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-18 10:02 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > On Tue, Sep 18, 2001 at 05:44:18AM -0400, Alexander Viro wrote: > > Bumping ->i_count on inode is _not_ an option - think what it does if > > you umount the first fs. > > what it does? Unless I'm missing something the fs never cares and never > sees the bd_inode. the fs just does a bdget and then it works only on > the bdev. What should I run to get the oops exactly? It sees an active inode for superblock we are destroying. _Not_ a good thing, for very obvious reasons. There is a reason for "Self-destruct in 5 seconds" printk... > If we need to avoid the bumping of i_count and to allocate something > dynamically that will be the bd_mapping address space, we don't need a > new fake_inode there too, we just need to share the new physical > pagecahce address space. Such physical i_mapping address space is the What are you going to use as mapping->host for it? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 10:02 ` Alexander Viro @ 2001-09-18 10:17 ` Andrea Arcangeli 2001-09-18 10:28 ` Alexander Viro ` (2 more replies) 0 siblings, 3 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 10:17 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 06:02:27AM -0400, Alexander Viro wrote: > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > On Tue, Sep 18, 2001 at 05:44:18AM -0400, Alexander Viro wrote: > > > Bumping ->i_count on inode is _not_ an option - think what it does if > > > you umount the first fs. > > > > what it does? Unless I'm missing something the fs never cares and never > > sees the bd_inode. the fs just does a bdget and then it works only on > > the bdev. What should I run to get the oops exactly? > > It sees an active inode for superblock we are destroying. _Not_ a good > thing, for very obvious reasons. There is a reason for "Self-destruct in > 5 seconds" printk... > very clear now, thanks. I though the fs you were talking about was mounted on the blkdev, while it is instead the one where the blkdev inode cames from. Usually people keeps bldkev in /dev and nobody unmounts /dev that's why I didn't noticed and thought about it, sorry. > > If we need to avoid the bumping of i_count and to allocate something > > dynamically that will be the bd_mapping address space, we don't need a > > new fake_inode there too, we just need to share the new physical > > pagecahce address space. Such physical i_mapping address space is the > > What are you going to use as mapping->host for it? the only info we'd need from the host is the host->i_rdev, so why can't we get it from the file->f_dentry->d_inode->i_rdev? In general I don't like very much the fake inodes with the zillon of unused fields (they just looks confusing due their unused fields), but well if you for whatever reason prefer to keep doing page->mapping->host->i_rdev or if file->f_dentry->d_inode->i_rdev has a problem I'm pretty much ok with the fake_inode there too. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 10:17 ` Andrea Arcangeli @ 2001-09-18 10:28 ` Alexander Viro 2001-09-18 10:35 ` Andrea Arcangeli 2001-09-18 11:05 ` Helge Hafting 2001-09-18 17:02 ` Linus Torvalds 2 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-18 10:28 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > If we need to avoid the bumping of i_count and to allocate something > > > dynamically that will be the bd_mapping address space, we don't need a > > > new fake_inode there too, we just need to share the new physical > > > pagecahce address space. Such physical i_mapping address space is the > > > > What are you going to use as mapping->host for it? > > the only info we'd need from the host is the host->i_rdev, so why can't > we get it from the file->f_dentry->d_inode->i_rdev? In general I don't In ->writepage()? Good luck. BTW, at some point use of ->i_rdev will have to go - we need to be able to use ->i_bdev instead. > like very much the fake inodes with the zillon of unused fields (they > just looks confusing due their unused fields), but well if you for > whatever reason prefer to keep doing page->mapping->host->i_rdev or if > file->f_dentry->d_inode->i_rdev has a problem I'm pretty much ok with > the fake_inode there too. It doesn't have to be fake. See how it's done for sockets or pipes. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 10:28 ` Alexander Viro @ 2001-09-18 10:35 ` Andrea Arcangeli 2001-09-18 10:52 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 10:35 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 06:28:11AM -0400, Alexander Viro wrote: > > > On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > > > If we need to avoid the bumping of i_count and to allocate something > > > > dynamically that will be the bd_mapping address space, we don't need a > > > > new fake_inode there too, we just need to share the new physical > > > > pagecahce address space. Such physical i_mapping address space is the > > > > > > What are you going to use as mapping->host for it? > > > > the only info we'd need from the host is the host->i_rdev, so why can't > > we get it from the file->f_dentry->d_inode->i_rdev? In general I don't > > In ->writepage()? Good luck. BTW, at some point use of ->i_rdev will have I would have noticed if I actually wrote the code ;) static int blkdev_writepage(struct page * page) { no file... > It doesn't have to be fake. See how it's done for sockets or pipes. here it's really completly private to the bdev. I mean we could be tricky and force a cast on mapping->host to point to bdev and we wouldn't need the fake inode. But casts are probably uglier and more risky than using the fake_inode (unless we really consdier the host a cookie rather than an inode pointer). Comments? Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 10:35 ` Andrea Arcangeli @ 2001-09-18 10:52 ` Alexander Viro 0 siblings, 0 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-18 10:52 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > It doesn't have to be fake. See how it's done for sockets or pipes. > > here it's really completly private to the bdev. I mean we could be > tricky and force a cast on mapping->host to point to bdev and we > wouldn't need the fake inode. But casts are probably uglier and more > risky than using the fake_inode (unless we really consdier the host a > cookie rather than an inode pointer). Comments? Well, seeing that I had added ->host in the first place... yes, they were intended to be cookies rather than inode pointers. Precisely because of block device applications. Overridden by Linus several months down the road and yes, I still think that it was bogus. Anyway, _if_ we have to have an inode somewhere, we'd better make sure that it's not a reuse of inode passed by open(). Otherwise we are in for a lot of fun. I could live with something similar to the net/socket.c and fs/pipe.c schemes - minimal superblock and inodes allocated on it. But then we'd better be consistent and do that for _all_ uses of blkdev_get(). ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 10:17 ` Andrea Arcangeli 2001-09-18 10:28 ` Alexander Viro @ 2001-09-18 11:05 ` Helge Hafting 2001-09-18 12:40 ` Andrea Arcangeli 2001-09-18 17:02 ` Linus Torvalds 2 siblings, 1 reply; 110+ messages in thread From: Helge Hafting @ 2001-09-18 11:05 UTC (permalink / raw) To: Andrea Arcangeli, linux-kernel Andrea Arcangeli wrote: > very clear now, thanks. I though the fs you were talking about was > mounted on the blkdev, while it is instead the one where the blkdev > inode cames from. Usually people keeps bldkev in /dev and nobody > unmounts /dev that's why I didn't noticed and thought about it, sorry. > Don't assume "nobody does..." There is always someone that do. This particular one will easily come up for anybody who develop boot floppies: 1. Mount the boot floppy at /mnt 2. Test the device files in /mnt/dev, _perhaps by using them in every way imaginable_ 3. Umount the boot floppy. - There goes /mnt/dev and a bunch of block devices. Umounting some dev/ is common enough. Helge Hafting ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 11:05 ` Helge Hafting @ 2001-09-18 12:40 ` Andrea Arcangeli 0 siblings, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-18 12:40 UTC (permalink / raw) To: Helge Hafting; +Cc: linux-kernel On Tue, Sep 18, 2001 at 01:05:51PM +0200, Helge Hafting wrote: > Andrea Arcangeli wrote: > > > very clear now, thanks. I though the fs you were talking about was > > mounted on the blkdev, while it is instead the one where the blkdev > > inode cames from. Usually people keeps bldkev in /dev and nobody > > unmounts /dev that's why I didn't noticed and thought about it, sorry. > > > Don't assume "nobody does..." There is always someone that do. > > This particular one will easily come up for anybody who > develop boot floppies: > > 1. Mount the boot floppy at /mnt > 2. Test the device files in /mnt/dev, _perhaps by using them > in every way imaginable_ > 3. Umount the boot floppy. - There goes /mnt/dev and a bunch > of block devices. > > Umounting some dev/ is common enough. Of course, that's a bug and it must be fixed. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 10:17 ` Andrea Arcangeli 2001-09-18 10:28 ` Alexander Viro 2001-09-18 11:05 ` Helge Hafting @ 2001-09-18 17:02 ` Linus Torvalds 2 siblings, 0 replies; 110+ messages in thread From: Linus Torvalds @ 2001-09-18 17:02 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Alexander Viro, Kernel Mailing List On Tue, 18 Sep 2001, Andrea Arcangeli wrote: > > > > What are you going to use as mapping->host for it? > > the only info we'd need from the host is the host->i_rdev, so why can't > we get it from the file->f_dentry->d_inode->i_rdev? No, just allocate a real inode as part of "struct block_device" allocation. This is basically what socket etc stuff does - and it's actually easiest to do it the other way, exactly like sockets do (ie _embed_ "struct block_device" inside a struct inode as if "block_device" was just another filesystem), and then just allocate one inode and get the "struct block_device" allocation for free. This works really well for sockets, and has worked for a long time. See how the socket is hidden in "inode->u.socket_i", and how it's trivial to convert from one to the other. And the fact is, a block_device _is_ a simple filesystem if you squint just the right way. Admittedly it's a fairly flat and _boring_ filesystem, but hey, that's the point of them, after all ;) Linus ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 9:31 ` Alexander Viro 2001-09-18 9:39 ` Andrea Arcangeli @ 2001-09-18 16:45 ` Linus Torvalds 2001-09-18 18:19 ` Alexander Viro 1 sibling, 1 reply; 110+ messages in thread From: Linus Torvalds @ 2001-09-18 16:45 UTC (permalink / raw) To: Alexander Viro; +Cc: Kernel Mailing List, Andrea Arcangeli On Tue, 18 Sep 2001, Alexander Viro wrote: > > Umm... Linus, had you actually read through the fs/block_device.c part > of that? It's not just ugly as hell, it's (AFAICS) not hard to oops > if you have several inodes sharing major:minor. ->bd_inode and its > treatment are bogus. Please, read it through and consider reverting - > in its current state code is an ugly mess. Funny that you mention it, because I actually have a cunning plan, and you're an unwitting part of it. Or actually, I hope you're a "witting" part of it, because it's going to be your code. Take your "struct block_device" code, add a "struct address_space" to it, and whenever a block device inode is opened, make the inode->i_mapping point to &bdev->b_data, and voila.. You already get all the reference counting right, and it's the only sensible place to do it anyway, wouldn't you agree? I thought you'd be thrilled. It seems to match your lazy allocation patch very well.. Linus ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 16:45 ` Linus Torvalds @ 2001-09-18 18:19 ` Alexander Viro 2001-09-18 18:27 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-18 18:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kernel Mailing List On Tue, 18 Sep 2001, Linus Torvalds wrote: > > On Tue, 18 Sep 2001, Alexander Viro wrote: > > > > Umm... Linus, had you actually read through the fs/block_device.c part > > of that? It's not just ugly as hell, it's (AFAICS) not hard to oops > > if you have several inodes sharing major:minor. ->bd_inode and its > > treatment are bogus. Please, read it through and consider reverting - > > in its current state code is an ugly mess. > > Funny that you mention it, because I actually have a cunning plan, and > you're an unwitting part of it. /me swears Yes, we can make it work that way (see downthread). Yes, combined with lazy allocation (and pipefs-like scheme) it can turn into nice, _working_ code. But right now we have a) broken patch applied to the tree b) somewhat tested patch that may (after being modified so that it would _apply_ to -pre11) fix the breakage. Once it's tested enough to be consider as a candidate for inclusion, that is. IMNSHO it's somewhat less than ideal situation. I've already talked to with Andrea and once he gets back to life (no, 'e's just sleepin') we'll try to do something usable. Life would be much simpler if aforementioned cunning plan included sending mail to participants (me and Andrea) before putting the patch into the tree, though. Oh, well... > Or actually, I hope you're a "witting" part of it, because it's going to > be your code. > > Take your "struct block_device" code, add a "struct address_space" to it, > and whenever a block device inode is opened, make the inode->i_mapping > point to &bdev->b_data, and voila.. > > You already get all the reference counting right, and it's the only > sensible place to do it anyway, wouldn't you agree? > > I thought you'd be thrilled. It seems to match your lazy allocation patch > very well.. It can be modified so that combination with lazy-bdev and pipefs-like tree would work. And yes, most of the ugliness would just go away. The problem being, I'd rather do it in a different order - 1) finish testing of lazy-bdev 2) apply well-tested patch 3) test modified bedv-in-pagecache patch 4) apply it, once it got public testing. As it is, we'll do combined patch, diff it against -pre11 and submit for inclusion - new hole in the main tree needs fixing... ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 18:19 ` Alexander Viro @ 2001-09-18 18:27 ` Linus Torvalds 2001-09-18 19:14 ` Andreas Dilger ` (4 more replies) 2001-09-19 13:38 ` Rik van Riel 2001-09-19 16:35 ` Andrea Arcangeli 2 siblings, 5 replies; 110+ messages in thread From: Linus Torvalds @ 2001-09-18 18:27 UTC (permalink / raw) To: Alexander Viro; +Cc: Kernel Mailing List On Tue, 18 Sep 2001, Alexander Viro wrote: > > It can be modified so that combination with lazy-bdev and pipefs-like tree > would work. And yes, most of the ugliness would just go away. That's the part I like about the page-cache bdev patch. It has a lot of fairly ugly warts, but all of them seem to be really fixable with _other_ cleanups, at which point only the good parts remain. I agree that the timing may leave something to be desired. But we had the discussion about fixing pagecache-bdev consistency wrt the regular buffer cache filesystem accesses a week or so ago, and the fact is that nobody really seems to have started working on it - because everybody felt that you have to get everything done at once. I don't have that feeling. I'm happy with having partial merge with ugly warts, if it means that you can get to the final stage _without_ having to have all the problems fixed at one time. So now we have two _smaller_ merges that will fix two other issues, and remove all the horridness from the original merge. Linus ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 18:27 ` Linus Torvalds @ 2001-09-18 19:14 ` Andreas Dilger 2001-09-18 19:41 ` Alexander Viro 2001-09-18 20:33 ` Richard Gooch 2001-09-18 19:29 ` Benjamin LaHaise ` (3 subsequent siblings) 4 siblings, 2 replies; 110+ messages in thread From: Andreas Dilger @ 2001-09-18 19:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alexander Viro, Kernel Mailing List On Sep 18, 2001 11:27 -0700, Linus Torvalds wrote: > I agree that the timing may leave something to be desired. But we had the > discussion about fixing pagecache-bdev consistency wrt the regular buffer > cache filesystem accesses a week or so ago, and the fact is that nobody > really seems to have started working on it - because everybody felt that > you have to get everything done at once. The real question is why can't we just open 2.5 and only fix the VM to start with? Leave the kernel at 2.4.10-pre10, and possibly use the -ac VM code (which has diverged from mainline), and leave people (Alan, Ben, Marcello, et. al.) who want to tinker with it in small increments and do the drastic stuff in 2.5. Make it clear that 2.5 is still ONLY for VM and other bug fixes at this point, and not all of the long-awaited 2.5 rework YET. If it turns out that the VM fixes are done quickly, then they can be back-ported to 2.4. If it takes longer than expected you can open 2.5 up to general development. With the right "management" it will be not much different than the current situation, except that people won't have fits about such huge changes going into 2.4. I think this is your subconcious telling you that you really wanted to start 2.5 a month ago. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 19:14 ` Andreas Dilger @ 2001-09-18 19:41 ` Alexander Viro 2001-09-18 20:33 ` Richard Gooch 1 sibling, 0 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-18 19:41 UTC (permalink / raw) To: Andreas Dilger; +Cc: Kernel Mailing List On Tue, 18 Sep 2001, Andreas Dilger wrote: > The real question is why can't we just open 2.5 and only fix the VM to > start with? Leave the kernel at 2.4.10-pre10, and possibly use the -ac > VM code (which has diverged from mainline), and leave people (Alan, Ben, > Marcello, et. al.) who want to tinker with it in small increments and > do the drastic stuff in 2.5. I'd rather get a list of API changes planned for 2.5 and DO ONLY THEM. IOW, start 2.5 with a sequence of patches that do nothing but a global search-and-replace. Then treat it as -STABLE. It _is_ doable - check what had happened to superblock handling in 2.4. Yes, it takes extra work on splitup and doing things in compatible way, but it's not a rocket science - BTDT. "I can't be arsed to split my K'R4D 3133t 200Kb p47cH" had lost its charm years ago - just look at the devfs mess... ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 19:14 ` Andreas Dilger 2001-09-18 19:41 ` Alexander Viro @ 2001-09-18 20:33 ` Richard Gooch 2001-09-18 20:53 ` Alexander Viro 2001-09-18 21:06 ` Richard Gooch 1 sibling, 2 replies; 110+ messages in thread From: Richard Gooch @ 2001-09-18 20:33 UTC (permalink / raw) To: Alexander Viro; +Cc: Andreas Dilger, Kernel Mailing List Alexander Viro writes: > "I can't be arsed to split my K'R4D 3133t 200Kb p47cH" had lost its > charm years ago - just look at the devfs mess... Al, are you ever going to stop bitching and moaning about devfs? If you have a specific, technical, reasoned complaint to make, do it coherently. Otherwise, stop sniping. Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 20:33 ` Richard Gooch @ 2001-09-18 20:53 ` Alexander Viro 2001-09-18 21:06 ` Richard Gooch 1 sibling, 0 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-18 20:53 UTC (permalink / raw) To: Richard Gooch; +Cc: Andreas Dilger, Kernel Mailing List On Tue, 18 Sep 2001, Richard Gooch wrote: > Alexander Viro writes: > > "I can't be arsed to split my K'R4D 3133t 200Kb p47cH" had lost its > > charm years ago - just look at the devfs mess... > > Al, are you ever going to stop bitching and moaning about devfs? > If you have a specific, technical, reasoned complaint to make, do it > coherently. Otherwise, stop sniping. What? I'm saying that devfs was a huge patch which was not understood (let alone audited) by anyone except you when it was included and as the direct result we have a long history of problems in that area. You want to argue against that? If you still feel that feeding devfs into the tree in one huge chunk was not a mistake - you are much dumber than I thought. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 20:33 ` Richard Gooch 2001-09-18 20:53 ` Alexander Viro @ 2001-09-18 21:06 ` Richard Gooch 2001-09-18 21:27 ` Alexander Viro 1 sibling, 1 reply; 110+ messages in thread From: Richard Gooch @ 2001-09-18 21:06 UTC (permalink / raw) To: Alexander Viro; +Cc: Andreas Dilger, Kernel Mailing List Alexander Viro writes: > > > On Tue, 18 Sep 2001, Richard Gooch wrote: > > > Alexander Viro writes: > > > "I can't be arsed to split my K'R4D 3133t 200Kb p47cH" had lost its > > > charm years ago - just look at the devfs mess... > > > > Al, are you ever going to stop bitching and moaning about devfs? > > If you have a specific, technical, reasoned complaint to make, do it > > coherently. Otherwise, stop sniping. > > What? I'm saying that devfs was a huge patch which was not understood > (let alone audited) by anyone except you when it was included and as > the direct result we have a long history of problems in that area. > You want to argue against that? > > If you still feel that feeding devfs into the tree in one huge chunk > was not a mistake - you are much dumber than I thought. Actually, many times I fed Linus smaller patches. I tried to get the FS core itself in separately from the drivers, which was 100% safe (since no existing code was touched), but he wasn't interested. If the devfs core *had* been accepted on it's own, I could then have reasonably split up the driver patches. However, I disagree with your "long history of problems" comment. You make it sound like devfs hasn't performed reliably, whereas in real life (i.e. normal use, not constructed exploits) it's done quite well. In any case, if your "I can't be arsed to split my patch" comment is directed at me, I take offense. I did actually take the trouble to split up my patch. Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 21:06 ` Richard Gooch @ 2001-09-18 21:27 ` Alexander Viro 0 siblings, 0 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-18 21:27 UTC (permalink / raw) To: Richard Gooch; +Cc: Andreas Dilger, Kernel Mailing List On Tue, 18 Sep 2001, Richard Gooch wrote: > Actually, many times I fed Linus smaller patches. I tried to get the > FS core itself in separately from the drivers, which was 100% safe > (since no existing code was touched), but he wasn't interested. If the > devfs core *had* been accepted on it's own, I could then have > reasonably split up the driver patches. > > However, I disagree with your "long history of problems" comment. You > make it sound like devfs hasn't performed reliably, whereas in real > life (i.e. normal use, not constructed exploits) it's done quite well. > > In any case, if your "I can't be arsed to split my patch" comment is > directed at me, I take offense. I did actually take the trouble to > split up my patch. Sheesh... Statement: "non-priveleged user can crash any released version of kernel if devfs is compiled in and mounted". Provably true. It _does_ qualify as a problem and yes, I would say that 20 months _is_ long. Regardless of the reasons why the thing went in once chunk, presence of these problems is a direct result. Anyone who wants to push a large patch into the tree is inviting the same result. BTW, potentially useful observation: if a preliminary chunk of patch makes sense on its own, it gets much better chance to get applied. Accepting devfs core in one step would be completely useless - it's still too large (~100Kb) and if it would be applied first, it would get zero testing until the rest went in. Anyway, Zen And Art Of Feeding Patches Into Tree is a topic for a different thread... AFAICS devfs could be split in meaningful steps (ones that would get testing immediately and would add functionality by pieces), but by now it's hardly interesting. If you want details - let's take it to private mail, preferably when I'll have some spare time. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 18:27 ` Linus Torvalds 2001-09-18 19:14 ` Andreas Dilger @ 2001-09-18 19:29 ` Benjamin LaHaise 2001-09-18 20:17 ` Stephan von Krawczynski ` (2 subsequent siblings) 4 siblings, 0 replies; 110+ messages in thread From: Benjamin LaHaise @ 2001-09-18 19:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alexander Viro, Kernel Mailing List On Tue, Sep 18, 2001 at 11:27:27AM -0700, Linus Torvalds wrote: > I don't have that feeling. I'm happy with having partial merge with ugly > warts, if it means that you can get to the final stage _without_ having to > have all the problems fixed at one time. > > So now we have two _smaller_ merges that will fix two other issues, and > remove all the horridness from the original merge. A lot of us would be much happier if you just renamed 2.4.10pre11 to 2.5.1. =) Then we can backport things as they become stable. -ben ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 18:27 ` Linus Torvalds 2001-09-18 19:14 ` Andreas Dilger 2001-09-18 19:29 ` Benjamin LaHaise @ 2001-09-18 20:17 ` Stephan von Krawczynski 2001-09-18 20:33 ` Alan Cox 2001-09-19 13:42 ` Rik van Riel 2001-09-19 2:59 ` Michael Peddemors 2001-09-19 16:11 ` Alexander Viro 4 siblings, 2 replies; 110+ messages in thread From: Stephan von Krawczynski @ 2001-09-18 20:17 UTC (permalink / raw) To: Andreas Dilger; +Cc: torvalds, viro, linux-kernel On Tue, 18 Sep 2001 13:14:19 -0600 Andreas Dilger <adilger@turbolabs.com> wrote: > On Sep 18, 2001 11:27 -0700, Linus Torvalds wrote: > [...] > The real question is why can't we just open 2.5 and only fix the VM to > start with? Hm, I guess if anybody would be capable of _really_ fixing vm in upto-pre10 state, he would have done it already. It's not that people would not have tried, but it looks like nobody is able to get the _whole_ picture of this. Surely there are people that know a lot about certain parts of the (old) vm code, and thats exactly what leads to this Linus' statements here sounding like "I had a look at part xyz of it and it's a mess. Patch appended." (see pre10 vm patch). You have to keep in mind that a lot of people on the planet try to use 2.4 in a unbelievably broad variety of setups - and quite a number of them relies on released kernels. If you take a close look at 2.4.7, 2.4.8, 2.4.9 you may well find out, that they're unusable compared to 2.2.19. You cannot go on like this. I really do back Andrea and Linus for this step, because I can _see_ a great win in performance _and_ things got less complex in this code. So there is a much bigger chance to tilt the last remaining problems in short time (hopefully before 2.4.10 or .11). So I guess the right thing to do now is give Andrea good input of leftover, reproducible problems to give him a chance to fix them. A major discussion about "doing it all the way round" makes only sense, if someone comes up with something _at least_ as good as Andrea's code. Then its "Lonely Linus"' decision to choose. Whatever he will choose, a good percentage LKML will be against it. This is a normal thing, the guy that has to make the decision is alone most of the time. The only way for you to find out if he is right is to _try_ it. If he is, tell him. If he isn't, send a patch. He couldn't have been that wrong the last years, though. </end sermon> Just _one_ opinion from an unimportant guy, Stephan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 20:17 ` Stephan von Krawczynski @ 2001-09-18 20:33 ` Alan Cox 2001-09-19 13:42 ` Rik van Riel 1 sibling, 0 replies; 110+ messages in thread From: Alan Cox @ 2001-09-18 20:33 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Andreas Dilger, torvalds, viro, linux-kernel > Hm, I guess if anybody would be capable of _really_ fixing vm in upto-pre10 > state, he would have done it already. It's not that people would not have > tried, but it looks like nobody is able to get the _whole_ picture of this. They have been, but slowly and steadily - thats why the -ac mm worked fairly well. Its also why merging it with Linus would have been pointless at this time since its just backing out some of the less apparently useful experiments ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 20:17 ` Stephan von Krawczynski 2001-09-18 20:33 ` Alan Cox @ 2001-09-19 13:42 ` Rik van Riel 2001-09-19 14:27 ` Alexander Viro 1 sibling, 1 reply; 110+ messages in thread From: Rik van Riel @ 2001-09-19 13:42 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Andreas Dilger, torvalds, viro, linux-kernel On Tue, 18 Sep 2001, Stephan von Krawczynski wrote: > Hm, I guess if anybody would be capable of _really_ fixing vm in > upto-pre10 state, he would have done it already. It's not that people > would not have tried, but it looks like nobody is able to get the > _whole_ picture of this. Look, the problem is that Linus is being an asshole and integrating conflicting ideas into both the VM and the VFS, without giving anybody prior notice and later blame others. Just look at how he's now trying to force Al Viro into implementing his ideas yesterday because he broke stuff again... If you want a stable kernel, use Alan's kernel. regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 13:42 ` Rik van Riel @ 2001-09-19 14:27 ` Alexander Viro 0 siblings, 0 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-19 14:27 UTC (permalink / raw) To: Rik van Riel Cc: Stephan von Krawczynski, Andreas Dilger, torvalds, linux-kernel On Wed, 19 Sep 2001, Rik van Riel wrote: > Just look at how he's now trying to force Al Viro into > implementing his ideas yesterday because he broke stuff > again... Rik, in case you've missed that, I can and do speak for myself. I had spent ten years dodging the draft; when I decide to get enlisted into something it will happen on _my_ decision and _my_ conditions. When I decide that I'm being forced into something I do not accept - you'll know it from posting with URL of forked tree. FWIW, I'm less than thrilled by the Andrea's patch, but it is salvagable. I'm also less than thrilled by the whole situation with VM - all sides of it. I seriously suspect that we need a limited multi-way fork in that area, so that you guys would stop stepping on each others' toes. I'm taking no part in your merry 5-way clusterfuck - sort that mess out between yourselves. Again, when I decide that situation is unacceptable for me - I'll simply fork the tree. I do _not_ appreciate being enlisted into anyone's holy wars, so unless you really want to go _way_ up in my personal shitlist (several positions below .ru DoD) - don't play politics in my vicinity. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 18:27 ` Linus Torvalds ` (2 preceding siblings ...) 2001-09-18 20:17 ` Stephan von Krawczynski @ 2001-09-19 2:59 ` Michael Peddemors 2001-09-19 16:11 ` Alexander Viro 4 siblings, 0 replies; 110+ messages in thread From: Michael Peddemors @ 2001-09-19 2:59 UTC (permalink / raw) To: Andreas Dilger; +Cc: Kernel Mailing List Ummmm.. because the 2.4 series is supposed to be the stable series.. and a lot of people's comments on virtually every 2.4 kernel have expressed dissatisfaction with it's stability.. So a lot of people haven't left the 2.2 series yet, and are waiting for a 2.4 series that everyone is happy with.. Even in your note, you haven't indicated a release 2.4 version that makes you happy. On Tue, 2001-09-18 at 12:14, Andreas Dilger wrote: > On Sep 18, 2001 11:27 -0700, Linus Torvalds wrote: > > I agree that the timing may leave something to be desired. But we had the > > discussion about fixing pagecache-bdev consistency wrt the regular buffer > > cache filesystem accesses a week or so ago, and the fact is that nobody > > really seems to have started working on it - because everybody felt that > > you have to get everything done at once. > > The real question is why can't we just open 2.5 and only fix the VM to > start with? Leave the kernel at 2.4.10-pre10, and possibly use the -ac > VM code (which has diverged from mainline), and leave people (Alan, Ben, > Marcello, et. al.) who want to tinker with it in small increments and > do the drastic stuff in 2.5. > -- "Catch the Magic of Linux..." -------------------------------------------------------- Michael Peddemors - Senior Consultant LinuxAdministration - Internet Services NetworkServices - Programming - Security Wizard IT Services http://www.wizard.ca Linux Support Specialist - http://www.linuxmagic.com -------------------------------------------------------- (604)589-0037 Beautiful British Columbia, Canada ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 18:27 ` Linus Torvalds ` (3 preceding siblings ...) 2001-09-19 2:59 ` Michael Peddemors @ 2001-09-19 16:11 ` Alexander Viro 2001-09-19 18:25 ` Andrea Arcangeli 4 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-19 16:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrea Arcangeli, Kernel Mailing List On Tue, 18 Sep 2001, Linus Torvalds wrote: > > On Tue, 18 Sep 2001, Alexander Viro wrote: > > > > It can be modified so that combination with lazy-bdev and pipefs-like tree > > would work. And yes, most of the ugliness would just go away. > > That's the part I like about the page-cache bdev patch. It has a lot of > fairly ugly warts, but all of them seem to be really fixable with _other_ > cleanups, at which point only the good parts remain. It's actually quite broken in several areas (== bunch of panicable rmmod races, broken wrt umount(), trivially oopsable in ioctl() on ramdisk, very suspicious in swapoff(2), etc.). It _might_ be fixable, but I would really like to see the patch that went into -pre11 separately from the rest. Andrea, could you send it to me? In particular, I'm deeply suspicious about changes in blkdev_put() in case of BDEV_FILE. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 16:11 ` Alexander Viro @ 2001-09-19 18:25 ` Andrea Arcangeli 2001-09-19 19:21 ` Alexander Viro 2001-09-19 20:41 ` Richard Gooch 0 siblings, 2 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-19 18:25 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Wed, Sep 19, 2001 at 12:11:56PM -0400, Alexander Viro wrote: > the rest. Andrea, could you send it to me? In particular, I'm deeply > suspicious about changes in blkdev_put() in case of BDEV_FILE. of course, for the record you can also find it in the ftp area all splitted out, but I've no problem to send it via email too. Quite frankly the BDEV_* handling was and is a total mess IMHO, even if it was written by you ;), there was no difference at all from many of them, I didn't fixed that but I had to check all them on the differences until I realized there was none. I also think the other things you mentioned (besides the inode pinning bug, non critical) are not buggy (infact I never had a single report), but well we'll verify that in detail ASAP. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 18:25 ` Andrea Arcangeli @ 2001-09-19 19:21 ` Alexander Viro 2001-09-19 20:55 ` Andrea Arcangeli 2001-09-20 2:34 ` Andrea Arcangeli 2001-09-19 20:41 ` Richard Gooch 1 sibling, 2 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-19 19:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Wed, 19 Sep 2001, Andrea Arcangeli wrote: > Quite frankly the BDEV_* handling was and is a total mess IMHO, even if > it was written by you ;), there was no difference at all from many of > them, I didn't fixed that but I had to check all them on the differences > until I realized there was none. I also think the other things you There certainly _are_ differences (e.g. in handling the moment when you close them). > mentioned (besides the inode pinning bug, non critical) are not buggy _What_? int fd = open("/dev/ram0", O_RDWR); ioctl(fd, BLKFLSBUF); ioctl(fd, BLKFLSBUF); and you claim that resulting oops is not a bug? > (infact I never had a single report), but well we'll verify that in Richard, is that you? What had you done with real Andrea? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 19:21 ` Alexander Viro @ 2001-09-19 20:55 ` Andrea Arcangeli 2001-09-19 21:17 ` Alexander Viro 2001-09-19 22:15 ` Linux 2.4.10-pre11 Richard Gooch 2001-09-20 2:34 ` Andrea Arcangeli 1 sibling, 2 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-19 20:55 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Wed, Sep 19, 2001 at 03:21:09PM -0400, Alexander Viro wrote: > > > On Wed, 19 Sep 2001, Andrea Arcangeli wrote: > > > Quite frankly the BDEV_* handling was and is a total mess IMHO, even if > > it was written by you ;), there was no difference at all from many of > > them, I didn't fixed that but I had to check all them on the differences > > until I realized there was none. I also think the other things you > > There certainly _are_ differences (e.g. in handling the moment > when you close them). there aren't difference, only thing that matters is: "is that an fs or a blkdev". SWAP/RAW/FILE is useless. > > mentioned (besides the inode pinning bug, non critical) are not buggy > > _What_? > > int fd = open("/dev/ram0", O_RDWR); > ioctl(fd, BLKFLSBUF); > ioctl(fd, BLKFLSBUF); > > and you claim that resulting oops is not a bug? that is a bug. > > (infact I never had a single report), but well we'll verify that in > > Richard, is that you? What had you done with real Andrea? You also screwup things sometime (think the few liner you posts to l-k after your cleanups). Those are minor bugs, so I'm not going to panic on them (ramdisk works not by luck), this is what I meant, and they will be fixed shortly somehow, and many thanks for the further auditing. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 20:55 ` Andrea Arcangeli @ 2001-09-19 21:17 ` Alexander Viro 2001-09-19 23:01 ` Andrea Arcangeli 2001-09-19 23:03 ` Andrea Arcangeli 2001-09-19 22:15 ` Linux 2.4.10-pre11 Richard Gooch 1 sibling, 2 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-19 21:17 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Wed, 19 Sep 2001, Andrea Arcangeli wrote: > > There certainly _are_ differences (e.g. in handling the moment > > when you close them). > > there aren't difference, only thing that matters is: "is that an fs > or a blkdev". SWAP/RAW/FILE is useless. fsync_dev() is not needed for raw devices or swap. It _is_ needed for file access. > > > (infact I never had a single report), but well we'll verify that in > > > > Richard, is that you? What had you done with real Andrea? > > You also screwup things sometime (think the few liner you posts to l-k > after your cleanups). Those are minor bugs, so I'm not going to panic Certainly. > on them (ramdisk works not by luck), this is what I meant, and they will Sorry, it just sounded so..., well, familiar... Couldn't resist ;-) (BTW, Richard, _what_ political whatever could be found in that?) > be fixed shortly somehow, and many thanks for the further auditing. Andrea, had you seen the off-list mail (cc: to you and Linus)? The main problem I have right now is that I don't see how you manage to guarantee that during the last ->release() no requests are going in. Old code did unconditional invalidate_buffers() to wipe out all buffer_heads when device is finally closed. Absense of pagecache sources was guaranteed by umount() - by the time when we release ->s_bdev all pages are gone. How do you deal with that in the current code? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 21:17 ` Alexander Viro @ 2001-09-19 23:01 ` Andrea Arcangeli 2001-09-19 23:03 ` Andrea Arcangeli 1 sibling, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-19 23:01 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Wed, Sep 19, 2001 at 05:17:23PM -0400, Alexander Viro wrote: > Andrea, had you seen the off-list mail (cc: to you and Linus)? The main > problem I have right now is that I don't see how you manage to guarantee > that during the last ->release() no requests are going in. Old code bdev->bd_sem + bdev->bd_cache_openers. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 21:17 ` Alexander Viro 2001-09-19 23:01 ` Andrea Arcangeli @ 2001-09-19 23:03 ` Andrea Arcangeli 2001-09-19 23:30 ` Alexander Viro 1 sibling, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-19 23:03 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Wed, Sep 19, 2001 at 05:17:23PM -0400, Alexander Viro wrote: > fsync_dev() is not needed for raw devices or swap. It _is_ needed for > file access. then what's the difference between raw devices and swap. And there's reason we should we avoid the fsync_dev with the raw devices and swap. I just found it an useless complication. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 23:03 ` Andrea Arcangeli @ 2001-09-19 23:30 ` Alexander Viro 2001-09-19 23:40 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-19 23:30 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > On Wed, Sep 19, 2001 at 05:17:23PM -0400, Alexander Viro wrote: > > fsync_dev() is not needed for raw devices or swap. It _is_ needed for > > file access. > > then what's the difference between raw devices and swap. Keep in mind that at some point we may want to add exclusions - e.g. "swapon() messing with block_size when accidentially called for mounted fs" problem can be easily solved that way - that's probably the simplest way to deal with it. Ditto for RAID vs. filesystem - current code for that is ugly and not too reliable. > And there's reason we should we avoid the fsync_dev with the raw devices > and swap. Umm... Not doing unnecessary work? Semantics of releasing a block device depends on the kind of use. BTW, I'm less than sure that fsync_dev() is the right thing for file access now that you've got that in pagecache - __block_fsync() seems to be more correct thing to do. > I just found it an useless complication. <shrug> in the eye of beholder... /me goes to get some sleep. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 23:30 ` Alexander Viro @ 2001-09-19 23:40 ` Andrea Arcangeli 2001-09-20 13:56 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-19 23:40 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Wed, Sep 19, 2001 at 07:30:55PM -0400, Alexander Viro wrote: > "swapon() messing with block_size when accidentially called for mounted swap should never change softblocksize etc.. the concept of softblocksize will die as soon as we make the buffercache - physically address space backed. > Umm... Not doing unnecessary work? Semantics of releasing a block device > depends on the kind of use. BTW, I'm less than sure that fsync_dev() is > the right thing for file access now that you've got that in pagecache - > __block_fsync() seems to be more correct thing to do. Not really, blkdev isn't a filesystem. It will never have a superblock and its own inodes and we also need to filemap_fdatasync/wait the physical address space. > /me goes to get some sleep. night. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 23:40 ` Andrea Arcangeli @ 2001-09-20 13:56 ` Alexander Viro 2001-09-20 14:38 ` Chris Mason 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 13:56 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > > Umm... Not doing unnecessary work? Semantics of releasing a block device > > depends on the kind of use. BTW, I'm less than sure that fsync_dev() is > > the right thing for file access now that you've got that in pagecache - > > __block_fsync() seems to be more correct thing to do. > > Not really, blkdev isn't a filesystem. It will never have a superblock > and its own inodes and we also need to filemap_fdatasync/wait the > physical address space. Had you actually read the fsync_dev()? Let me make it clear: you are flushing _buffer_ cache upon blkdev_put(bdev, BDEV_FILE). It was the right thing when file access went through buffer cache. It's blatantly wrong with page cache. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 13:56 ` Alexander Viro @ 2001-09-20 14:38 ` Chris Mason 2001-09-20 14:50 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Chris Mason @ 2001-09-20 14:38 UTC (permalink / raw) To: Alexander Viro, Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Thursday, September 20, 2001 09:56:22 AM -0400 Alexander Viro <viro@math.psu.edu> wrote: > Had you actually read the fsync_dev()? Let me make it clear: you are > flushing _buffer_ cache upon blkdev_put(bdev, BDEV_FILE). It was > the right thing when file access went through buffer cache. It's > blatantly wrong with page cache. Well, fsync_dev will flush anything on the dirty list. Since blkdev_commit_write puts things on the dirty list, andrea's current code will flush changes through the page cache. The biggest exception is blkdev_writepage directly submits the io instead of marking the buffers dirty. This means the buffers won't be on the locked/dirty list, and they won't get waited on. Similar problem for direct io. The fix shouldn't be too bad, I'll code it up... -chris ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 14:38 ` Chris Mason @ 2001-09-20 14:50 ` Alexander Viro 2001-09-20 15:44 ` Chris Mason 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 14:50 UTC (permalink / raw) To: Chris Mason; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List On Thu, 20 Sep 2001, Chris Mason wrote: > > > On Thursday, September 20, 2001 09:56:22 AM -0400 Alexander Viro <viro@math.psu.edu> wrote: > > > Had you actually read the fsync_dev()? Let me make it clear: you are > > flushing _buffer_ cache upon blkdev_put(bdev, BDEV_FILE). It was > > the right thing when file access went through buffer cache. It's > > blatantly wrong with page cache. > > Well, fsync_dev will flush anything on the dirty list. Since > blkdev_commit_write puts things on the dirty list, andrea's current > code will flush changes through the page cache. > > The biggest exception is blkdev_writepage directly submits the io instead > of marking the buffers dirty. This means the buffers won't be on > the locked/dirty list, and they won't get waited on. Similar problem > for direct io. <nod> And if you add Andrea's (perfectly valid) observation re having no need to sync any fs structures we might have for that device, you get __block_fsync(). After that it's easy to merge blkdev_close() code into blkdev_put(). ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 14:50 ` Alexander Viro @ 2001-09-20 15:44 ` Chris Mason 2001-09-20 16:43 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Chris Mason @ 2001-09-20 15:44 UTC (permalink / raw) To: Alexander Viro; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List On Thursday, September 20, 2001 10:50:16 AM -0400 Alexander Viro <viro@math.psu.edu> wrote: >> The biggest exception is blkdev_writepage directly submits the io instead >> of marking the buffers dirty. This means the buffers won't be on >> the locked/dirty list, and they won't get waited on. Similar problem >> for direct io. > > <nod> And if you add Andrea's (perfectly valid) observation re having no > need to sync any fs structures we might have for that device, you get > __block_fsync(). After that it's easy to merge blkdev_close() code into > blkdev_put(). > > Ok, __block_fsync is much better than just fsync_dev. Are there other parts of blkdev_close you want merged into blkdev_put? Without changing the reread blocks on last close semantics, I think this is all we can do. As far as I can tell, bdev->bd_inode is valid to send to __block_fsync, am I missing something? --- linux/fs/block_dev.c Mon Sep 17 11:28:56 2001 +++ linux/fs/block_dev.c Thu Sep 20 11:21:39 2001 @@ -704,10 +704,9 @@ kdev_t rdev = to_kdev_t(bdev->bd_dev); /* this should become bdev */ down(&bdev->bd_sem); lock_kernel(); - if (kind == BDEV_FILE) - fsync_dev(rdev); - else if (kind == BDEV_FS) - fsync_no_super(rdev); + + __block_fsync(bdev->bd_inode) ; + /* only filesystems uses buffer cache for the metadata these days */ if (kind == BDEV_FS) invalidate_buffers(rdev); ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 15:44 ` Chris Mason @ 2001-09-20 16:43 ` Alexander Viro 2001-09-20 20:54 ` [PATCH] fs/block_dev.c cleanup Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 16:43 UTC (permalink / raw) To: Chris Mason; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List On Thu, 20 Sep 2001, Chris Mason wrote: > > <nod> And if you add Andrea's (perfectly valid) observation re having no > > need to sync any fs structures we might have for that device, you get > > __block_fsync(). After that it's easy to merge blkdev_close() code into > > blkdev_put(). > > > > > > Ok, __block_fsync is much better than just fsync_dev. > > Are there other parts of blkdev_close you want merged into > blkdev_put? Without changing the reread blocks on last close > semantics, I think this is all we can do. > > As far as I can tell, bdev->bd_inode is valid to send > to __block_fsync, am I missing something? Eventually that will be the right thing, but only after we allocate bd_inode upon blkdev_get()/blkdev_open() instead of trying to cannibalize the inode passed to blkdev_open(). I'm testing that chunk right now (it also kills all the fake_inode crap in block_dev.c). When we cut the lifetime of block_device down we'll be able to get it even simpler - allocate and free ->bd_inode at the same time as block_device, but that's several chunks later. ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH] fs/block_dev.c cleanup 2001-09-20 16:43 ` Alexander Viro @ 2001-09-20 20:54 ` Alexander Viro 0 siblings, 0 replies; 110+ messages in thread From: Alexander Viro @ 2001-09-20 20:54 UTC (permalink / raw) To: Chris Mason; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List On Thu, 20 Sep 2001, Alexander Viro wrote: > On Thu, 20 Sep 2001, Chris Mason wrote: > > > > <nod> And if you add Andrea's (perfectly valid) observation re having no > > > need to sync any fs structures we might have for that device, you get > > > __block_fsync(). After that it's easy to merge blkdev_close() code into > > > blkdev_put(). > > > > > > > > > > Ok, __block_fsync is much better than just fsync_dev. > > > > Are there other parts of blkdev_close you want merged into > > blkdev_put? Without changing the reread blocks on last close > > semantics, I think this is all we can do. > > > > As far as I can tell, bdev->bd_inode is valid to send > > to __block_fsync, am I missing something? > > Eventually that will be the right thing, but only after we allocate > bd_inode upon blkdev_get()/blkdev_open() instead of trying to cannibalize > the inode passed to blkdev_open(). > > I'm testing that chunk right now (it also kills all the fake_inode crap in > block_dev.c). OK, it seems to be working here. It doesn't fix anything in rd.c - all it does is crapectomy in block_dev.c. Pseudo-fs added, inodes are allocated there upon blkdev_open() and blkdev_get(), ->bd_inode is always from that pseudo-fs and never equal to inode passed into blkdev_open(). Crap with fake inodes is gone - we simply use ->bd_inode. ->a_ops for block device inodes is not set at all - neither in devices.c nor in devfs. We set it for inodes on pseudo-fs and setting ->i_mapping upon open() does the right thing. All mess with keeping the first inode alive is gone - no need to do that anymore. diff -urN S10-pre12/drivers/block/rd.c linux/drivers/block/rd.c --- S10-pre12/drivers/block/rd.c Thu Sep 20 15:27:18 2001 +++ linux/drivers/block/rd.c Thu Sep 20 15:26:09 2001 @@ -420,7 +420,6 @@ /* bdev->bd_sem is held by caller */ bdev->bd_openers++; bdev->bd_cache_openers++; - bdev->bd_inode = inode; } } diff -urN S10-pre12/fs/block_dev.c linux/fs/block_dev.c --- S10-pre12/fs/block_dev.c Thu Sep 20 15:27:24 2001 +++ linux/fs/block_dev.c Thu Sep 20 15:40:04 2001 @@ -18,6 +18,7 @@ #include <linux/iobuf.h> #include <linux/highmem.h> #include <linux/blkdev.h> +#include <linux/module.h> #include <asm/uaccess.h> @@ -365,6 +366,53 @@ } /* + * pseudo-fs + */ + +static struct super_block *bd_read_super(struct super_block *sb, void *data, int silent) +{ + static struct super_operations sops = {}; + struct inode *root = new_inode(sb); + if (!root) + return NULL; + root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR; + root->i_uid = root->i_gid = 0; + root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME; + sb->s_blocksize = 1024; + sb->s_blocksize_bits = 10; + sb->s_magic = 0x62646576; + sb->s_op = &sops; + sb->s_root = d_alloc(NULL, &(const struct qstr) { "bdev:", 5, 0 }); + if (!sb->s_root) { + iput(root); + return NULL; + } + sb->s_root->d_sb = sb; + sb->s_root->d_parent = sb->s_root; + d_instantiate(sb->s_root, root); + return sb; +} + +static DECLARE_FSTYPE(bd_type, "bdev", bd_read_super, FS_NOMOUNT); + +static struct vfsmount *bd_mnt; + +static int get_inode(struct block_device *bdev) +{ + if (!bdev->bd_inode) { + struct inode *inode = new_inode(bd_mnt->mnt_sb); + if (!inode) + return -ENOMEM; + inode->i_rdev = to_kdev_t(bdev->bd_dev); + atomic_inc(&bdev->bd_count); /* will go away */ + inode->i_bdev = bdev; + inode->i_data.a_ops = &def_blk_aops; + bdev->bd_inode = inode; + } + return 0; +} + +/* * bdev cache handling - shamelessly stolen from inode.c * We use smaller hashtable, though. */ @@ -394,7 +442,7 @@ void __init bdev_cache_init(void) { - int i; + int i, err; struct list_head *head = bdev_hashtable; i = HASH_SIZE; @@ -410,6 +458,13 @@ NULL); if (!bdev_cachep) panic("Cannot create bdev_cache SLAB cache"); + err = register_filesystem(&bd_type); + if (err) + panic("Cannot register bdev pseudo-fs"); + bd_mnt = kern_mount(&bd_type); + err = PTR_ERR(bd_mnt); + if (IS_ERR(bd_mnt)) + panic("Cannot create bdev pseudo-fs"); } /* @@ -598,18 +653,13 @@ int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg) { - struct inode inode_fake; int res; mm_segment_t old_fs = get_fs(); if (!bdev->bd_op->ioctl) return -EINVAL; - memset(&inode_fake, 0, sizeof(inode_fake)); - inode_fake.i_rdev = to_kdev_t(bdev->bd_dev); - inode_fake.i_bdev = bdev; - init_waitqueue_head(&inode_fake.i_wait); set_fs(KERNEL_DS); - res = bdev->bd_op->ioctl(&inode_fake, NULL, cmd, arg); + res = bdev->bd_op->ioctl(bdev->bd_inode, NULL, cmd, arg); set_fs(old_fs); return res; } @@ -619,6 +669,12 @@ int ret = -ENODEV; kdev_t rdev = to_kdev_t(bdev->bd_dev); /* this should become bdev */ down(&bdev->bd_sem); + + if (get_inode(bdev)) { + up(&bdev->bd_sem); + return -ENOMEM; + } + lock_kernel(); if (!bdev->bd_op) bdev->bd_op = get_blkfops(MAJOR(rdev)); @@ -631,23 +687,22 @@ */ struct file fake_file = {}; struct dentry fake_dentry = {}; - struct inode *fake_inode = get_empty_inode(); ret = -ENOMEM; - if (fake_inode) { - fake_file.f_mode = mode; - fake_file.f_flags = flags; - fake_file.f_dentry = &fake_dentry; - fake_dentry.d_inode = fake_inode; - fake_inode->i_rdev = rdev; - ret = 0; - if (bdev->bd_op->open) - ret = bdev->bd_op->open(fake_inode, &fake_file); - if (!ret) { - bdev->bd_openers++; - atomic_inc(&bdev->bd_count); - } else if (!bdev->bd_openers) - bdev->bd_op = NULL; - iput(fake_inode); + fake_file.f_mode = mode; + fake_file.f_flags = flags; + fake_file.f_dentry = &fake_dentry; + fake_dentry.d_inode = bdev->bd_inode; + ret = 0; + if (bdev->bd_op->open) + ret = bdev->bd_op->open(bdev->bd_inode, &fake_file); + if (!ret) { + bdev->bd_openers++; + atomic_inc(&bdev->bd_count); + } else if (!bdev->bd_openers) { + struct inode *bd_inode = bdev->bd_inode; + bdev->bd_op = NULL; + bdev->bd_inode = NULL; + iput(bd_inode); } } unlock_kernel(); @@ -669,6 +724,12 @@ filp->f_flags |= O_LARGEFILE; down(&bdev->bd_sem); + + if (get_inode(bdev)) { + up(&bdev->bd_sem); + return -ENOMEM; + } + lock_kernel(); if (!bdev->bd_op) bdev->bd_op = get_blkfops(MAJOR(inode->i_rdev)); @@ -678,20 +739,15 @@ ret = bdev->bd_op->open(inode,filp); if (!ret) { bdev->bd_openers++; - if (!bdev->bd_cache_openers && bdev->bd_inode) - BUG(); - if (bdev->bd_cache_openers && !bdev->bd_inode) - BUG(); - if (!bdev->bd_cache_openers++) - bdev->bd_inode = inode; - else { - if (bdev->bd_inode != inode && !inode->i_mapping_overload++) { - inode->i_mapping = bdev->bd_inode->i_mapping; - atomic_inc(&bdev->bd_inode->i_count); - } - } - } else if (!bdev->bd_openers) + bdev->bd_cache_openers++; + inode->i_mapping = bdev->bd_inode->i_mapping; + inode->i_mapping_overload++; + } else if (!bdev->bd_openers) { + struct inode *bd_inode = bdev->bd_inode; bdev->bd_op = NULL; + bdev->bd_inode = NULL; + iput(bd_inode); + } } unlock_kernel(); up(&bdev->bd_sem); @@ -702,28 +758,24 @@ { int ret = 0; kdev_t rdev = to_kdev_t(bdev->bd_dev); /* this should become bdev */ + struct inode *bd_inode = bdev->bd_inode; + down(&bdev->bd_sem); lock_kernel(); if (kind == BDEV_FILE) - fsync_dev(rdev); + __block_fsync(bd_inode); else if (kind == BDEV_FS) fsync_no_super(rdev); /* only filesystems uses buffer cache for the metadata these days */ if (kind == BDEV_FS) invalidate_buffers(rdev); - if (bdev->bd_op->release) { - struct inode * fake_inode = get_empty_inode(); - ret = -ENOMEM; - if (fake_inode) { - fake_inode->i_rdev = rdev; - ret = bdev->bd_op->release(fake_inode, NULL); - iput(fake_inode); - } else - printk(KERN_WARNING "blkdev_put: ->release couldn't be run due -ENOMEM\n"); - } - if (!--bdev->bd_openers) - bdev->bd_op = NULL; /* we can't rely on driver being */ - /* kind to stay around. */ + if (bdev->bd_op->release) + ret = bdev->bd_op->release(bd_inode, NULL); + if (!--bdev->bd_openers) { + bdev->bd_op = NULL; + bdev->bd_inode = NULL; + iput(bd_inode); + } unlock_kernel(); up(&bdev->bd_sem); bdput(bdev); @@ -736,8 +788,6 @@ int ret = 0; struct inode * bd_inode = bdev->bd_inode; - if (bd_inode->i_mapping != inode->i_mapping) - BUG(); down(&bdev->bd_sem); lock_kernel(); /* cache coherency protocol */ @@ -745,11 +795,9 @@ struct super_block * sb; /* flush the pagecache to disk */ - __block_fsync(inode); + __block_fsync(bd_inode); /* drop the pagecache, uptodate info is on disk by now */ truncate_inode_pages(inode->i_mapping, 0); - /* forget the bdev pagecache address space */ - bdev->bd_inode = NULL; /* if the fs was mounted ro just throw away most of its caches */ sb = get_super(inode->i_rdev); @@ -782,16 +830,17 @@ drop_super(sb); } } - if (inode != bd_inode && !--inode->i_mapping_overload) { + if (!--inode->i_mapping_overload) inode->i_mapping = &inode->i_data; - iput(bd_inode); - } /* release the device driver */ if (bdev->bd_op->release) ret = bdev->bd_op->release(inode, NULL); - if (!--bdev->bd_openers) + if (!--bdev->bd_openers) { bdev->bd_op = NULL; + bdev->bd_inode = NULL; + iput(bd_inode); + } unlock_kernel(); up(&bdev->bd_sem); diff -urN S10-pre12/fs/devices.c linux/fs/devices.c --- S10-pre12/fs/devices.c Thu Sep 20 15:27:24 2001 +++ linux/fs/devices.c Thu Sep 20 15:26:09 2001 @@ -206,7 +206,6 @@ inode->i_cdev = cdget(rdev); } else if (S_ISBLK(mode)) { inode->i_fop = &def_blk_fops; - inode->i_mapping->a_ops = &def_blk_aops; inode->i_rdev = to_kdev_t(rdev); inode->i_bdev = bdget(rdev); } else if (S_ISFIFO(mode)) ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 20:55 ` Andrea Arcangeli 2001-09-19 21:17 ` Alexander Viro @ 2001-09-19 22:15 ` Richard Gooch 1 sibling, 0 replies; 110+ messages in thread From: Richard Gooch @ 2001-09-19 22:15 UTC (permalink / raw) To: Alexander Viro; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List Alexander Viro writes: > On Wed, 19 Sep 2001, Andrea Arcangeli wrote: > > > > (infact I never had a single report), but well we'll verify that in > > > > > > Richard, is that you? What had you done with real Andrea? > > > > You also screwup things sometime (think the few liner you posts to l-k > > after your cleanups). Those are minor bugs, so I'm not going to panic > > Certainly. > > > on them (ramdisk works not by luck), this is what I meant, and they will > > Sorry, it just sounded so..., well, familiar... Couldn't resist ;-) Oh, yeah. So it seems. "Resistance is futile", eh? > (BTW, Richard, _what_ political whatever could be found in that?) Gee, do I have to spell it out? Just today, I saw you complain to Rik that he'd dragged you into some issue. And now you take my name in vain regarding another issue which I'd not previously been involved in. No, not exactly the same. But I couldn't help but notice a certain parallel... Anyway, this is going nowhere. Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 19:21 ` Alexander Viro 2001-09-19 20:55 ` Andrea Arcangeli @ 2001-09-20 2:34 ` Andrea Arcangeli 2001-09-20 10:52 ` Alexander Viro 1 sibling, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-20 2:34 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Wed, Sep 19, 2001 at 03:21:09PM -0400, Alexander Viro wrote: > int fd = open("/dev/ram0", O_RDWR); > ioctl(fd, BLKFLSBUF); > ioctl(fd, BLKFLSBUF); here it is the fix below. actually it also notably fixes a subtle security problem with the ramdisk. The fact is that we are asked to read/write from the same pagecache page that also was the destination of the I/O (ramdisk is zerocopy), so we actually need a new bitflag to know if this a page secure, this way we know if we need to bother hiding its contents or not, if it's a secure page we don't need to do anything. It is a zero cost bitflag for the nonramdisk case so it shouldn't be a problem and I cannot imagine a better fix. Now the only bit left in the ramdisk is a race condition in rmmod, I can now see why you were converting rd_inode to rd_bdev :). --- 2.4.10pre11aa1/include/linux/mm.h.~1~ Thu Sep 20 01:20:38 2001 +++ 2.4.10pre11aa1/include/linux/mm.h Thu Sep 20 03:57:25 2001 @@ -282,6 +282,7 @@ #define PG_checked 12 /* kill me in 2.5.<early>. */ #define PG_arch_1 13 #define PG_reserved 14 +#define PG_secure 15 /* Make it prettier to test the above... */ #define Page_Uptodate(page) test_bit(PG_uptodate, &(page)->flags) @@ -295,6 +296,9 @@ #define TryLockPage(page) test_and_set_bit(PG_locked, &(page)->flags) #define PageChecked(page) test_bit(PG_checked, &(page)->flags) #define SetPageChecked(page) set_bit(PG_checked, &(page)->flags) + +#define PageSecure(page) test_bit(PG_secure, &(page)->flags) +#define SetPageSecure(page) set_bit(PG_secure, &(page)->flags) extern void __set_page_dirty(struct page *); --- 2.4.10pre11aa1/mm/filemap.c.~1~ Tue Sep 18 15:39:51 2001 +++ 2.4.10pre11aa1/mm/filemap.c Thu Sep 20 03:59:00 2001 @@ -621,7 +621,7 @@ if (PageLocked(page)) BUG(); - flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_dirty | 1 << PG_referenced | 1 << PG_arch_1 | 1 << PG_checked); + flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_dirty | 1 << PG_referenced | 1 << PG_arch_1 | 1 << PG_checked | 1 << PG_secure); page->flags = flags | (1 << PG_locked); page_cache_get(page); page->index = offset; --- 2.4.10pre11aa1/mm/shmem.c.~1~ Tue Sep 18 02:43:04 2001 +++ 2.4.10pre11aa1/mm/shmem.c Thu Sep 20 03:59:09 2001 @@ -353,7 +353,7 @@ swap_free(*entry); *entry = (swp_entry_t) {0}; delete_from_swap_cache_nolock(page); - flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_referenced | 1 << PG_arch_1); + flags = page->flags & ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_referenced | 1 << PG_arch_1 | 1 << PG_secure); page->flags = flags | (1 << PG_dirty); add_to_page_cache_locked(page, mapping, idx); info->swapped--; --- 2.4.10pre11aa1/mm/swap_state.c.~1~ Tue Sep 18 02:43:04 2001 +++ 2.4.10pre11aa1/mm/swap_state.c Thu Sep 20 03:59:20 2001 @@ -70,7 +70,7 @@ BUG(); /* clear PG_dirty so a subsequent set_page_dirty takes effect */ - flags = page->flags & ~(1 << PG_error | 1 << PG_dirty | 1 << PG_arch_1 | 1 << PG_referenced); + flags = page->flags & ~(1 << PG_error | 1 << PG_dirty | 1 << PG_arch_1 | 1 << PG_referenced | 1 << PG_secure); page->flags = flags | (1 << PG_uptodate); add_to_page_cache_locked(page, &swapper_space, entry.val); } --- 2.4.10pre11/drivers/block/rd.c Tue Sep 18 02:42:23 2001 +++ 2.4.10pre11aa1/drivers/block/rd.c Thu Sep 20 04:22:44 2001 @@ -188,11 +188,23 @@ static int rd_blkdev_pagecache_IO(int rw, struct buffer_head * sbh, int minor) { - struct address_space * mapping = rd_inode[minor]->i_mapping; + struct address_space * mapping; unsigned long index; - int offset, size, err = 0; + int offset, size, err; - if (sbh->b_page->mapping == mapping) { + err = -EIO; + /* + * If we did BLKFLSBUF just before doing read/write, + * we could see a rd_inode before the opener had a chance + * to return from rd_open(), that's ok, as soon as we + * see the inode we can use its i_mapping, and the inode + * cannot go away from under us. + */ + if (!rd_inode[minor]) + goto out; + err = 0; + mapping = rd_inode[minor]->i_mapping; + if (sbh->b_page->mapping == mapping && PageSecure(sbh->b_page)) { if (rw != READ) SetPageDirty(sbh->b_page); goto out; @@ -217,7 +229,7 @@ hash = page_hash(mapping, index); page = __find_get_page(mapping, index, hash); - if (!page && rw != READ) { + if (!page) { page = grab_cache_page(mapping, index); err = -ENOMEM; if (!page) @@ -227,18 +239,40 @@ } index++; - if (!page) { - offset = 0; - continue; - } if (rw == READ) { src = kmap(page); + + if (!PageSecure(page)) { + clear_page(src); + SetPageSecure(page); + } + src += offset; dst = bh_kmap(sbh); } else { dst = kmap(page); dst += offset; + + if (!PageSecure(page)) { + unsigned long addr, start, end; + + /* clear around */ + addr = (unsigned long) dst; + if (addr & ~PAGE_CACHE_MASK) { + start = addr & PAGE_CACHE_MASK; + end = addr; + memset((char *) start, 0, end - start); + } + addr = (unsigned long) dst + count; + if (addr & ~PAGE_CACHE_MASK) { + start = addr; + end = (addr + PAGE_CACHE_SIZE - 1) & PAGE_CACHE_MASK; + memset((char *) start, 0, end - start); + } + SetPageSecure(page); + } + src = bh_kmap(sbh); } offset = 0; @@ -321,6 +355,13 @@ struct block_device * bdev = inode->i_bdev; down(&bdev->bd_sem); + /* + * We're the only users here, protected by the + * bd_sem, but first check if we didn't just + * flushed the ramdisk. + */ + if (!rd_inode[minor]) + goto unlock; if (bdev->bd_openers > 2) { up(&bdev->bd_sem); return -EBUSY; @@ -328,8 +369,16 @@ bdev->bd_openers--; bdev->bd_cache_openers--; iput(rd_inode[minor]); + /* + * Make sure the cache is flushed from here + * and not from close(2), somebody + * could reopen the device before we have a + * chance to close it ourself. + */ + truncate_inode_pages(rd_inode[minor]->i_mapping, 0); rd_inode[minor] = NULL; rd_blocksizes[minor] = rd_blocksize; + unlock: up(&bdev->bd_sem); } break; Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 2:34 ` Andrea Arcangeli @ 2001-09-20 10:52 ` Alexander Viro 2001-09-20 18:18 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 10:52 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > On Wed, Sep 19, 2001 at 03:21:09PM -0400, Alexander Viro wrote: > > int fd = open("/dev/ram0", O_RDWR); > > ioctl(fd, BLKFLSBUF); > > ioctl(fd, BLKFLSBUF); > > here it is the fix below. [snip] > @@ -328,8 +369,16 @@ > bdev->bd_openers--; > bdev->bd_cache_openers--; > iput(rd_inode[minor]); > + /* > + * Make sure the cache is flushed from here > + * and not from close(2), somebody > + * could reopen the device before we have a > + * chance to close it ourself. > + */ > + truncate_inode_pages(rd_inode[minor]->i_mapping, 0); > rd_inode[minor] = NULL; > rd_blocksizes[minor] = rd_blocksize; > + unlock: > up(&bdev->bd_sem); Now think what happens if you go through that code twice. What argument will be passed to iput() the second time you call it? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 10:52 ` Alexander Viro @ 2001-09-20 18:18 ` Andrea Arcangeli 2001-09-20 18:33 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-20 18:18 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Thu, Sep 20, 2001 at 06:52:07AM -0400, Alexander Viro wrote: > > > On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > > > On Wed, Sep 19, 2001 at 03:21:09PM -0400, Alexander Viro wrote: > > > int fd = open("/dev/ram0", O_RDWR); > > > ioctl(fd, BLKFLSBUF); > > > ioctl(fd, BLKFLSBUF); > > > > here it is the fix below. > [snip] > > @@ -328,8 +369,16 @@ > > bdev->bd_openers--; > > bdev->bd_cache_openers--; > > iput(rd_inode[minor]); > > + /* > > + * Make sure the cache is flushed from here > > + * and not from close(2), somebody > > + * could reopen the device before we have a > > + * chance to close it ourself. > > + */ > > + truncate_inode_pages(rd_inode[minor]->i_mapping, 0); > > rd_inode[minor] = NULL; > > rd_blocksizes[minor] = rd_blocksize; > > + unlock: > > up(&bdev->bd_sem); > > Now think what happens if you go through that code twice. What argument will > be passed to iput() the second time you call it? the second time we won't go through that code. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 18:18 ` Andrea Arcangeli @ 2001-09-20 18:33 ` Alexander Viro 2001-09-20 18:59 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 18:33 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > > > + truncate_inode_pages(rd_inode[minor]->i_mapping, 0); > > > rd_inode[minor] = NULL; > > > rd_blocksizes[minor] = rd_blocksize; > > > + unlock: > > > up(&bdev->bd_sem); > > > > Now think what happens if you go through that code twice. What argument will > > be passed to iput() the second time you call it? > > the second time we won't go through that code. IOW, subsequent calls of ioctl(fd, BLKFLSBUF) will not work. Which is better than oopsing, but doesn't look right. Question: why the hell do we bother with iput() and decrementing counters at all? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 18:33 ` Alexander Viro @ 2001-09-20 18:59 ` Andrea Arcangeli 2001-09-20 20:41 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-20 18:59 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Thu, Sep 20, 2001 at 02:33:34PM -0400, Alexander Viro wrote: > > > On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > > > > > + truncate_inode_pages(rd_inode[minor]->i_mapping, 0); > > > > rd_inode[minor] = NULL; > > > > rd_blocksizes[minor] = rd_blocksize; > > > > + unlock: > > > > up(&bdev->bd_sem); > > > > > > Now think what happens if you go through that code twice. What argument will > > > be passed to iput() the second time you call it? > > > > the second time we won't go through that code. > > IOW, subsequent calls of ioctl(fd, BLKFLSBUF) will not work. Which is > better than oopsing, but doesn't look right. The second call will do the work if there's something to do. If somebody did an open/read/writes/close in the middle, it should do the work as usual. I actually can see a problem if you use the different inodes, but that will be sorted out too automatically as soon as we stop pinning the inodes. > Question: why the hell do we bother with iput() and decrementing counters > at all? Yes, that part should be rewritten using the bdev as you did recently, the ipinning of the ramdisk also in the previous 2.4 kernels (before your recent patch) had the same problem of my ipinning in the blkdev highlevel (it looked only a cleanup but it wasn't). Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 18:59 ` Andrea Arcangeli @ 2001-09-20 20:41 ` Alexander Viro 2001-09-20 21:18 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 20:41 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > The second call will do the work if there's something to do. If somebody > did an open/read/writes/close in the middle, it should do the work as > usual. I actually can see a problem if you use the different inodes, but > that will be sorted out too automatically as soon as we stop pinning the > inodes. Sigh... Try BLKFLSBUF + write() + BLKFLSBUF. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 20:41 ` Alexander Viro @ 2001-09-20 21:18 ` Andrea Arcangeli 2001-09-20 21:40 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-20 21:18 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Thu, Sep 20, 2001 at 04:41:54PM -0400, Alexander Viro wrote: > > > On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > > > The second call will do the work if there's something to do. If somebody > > did an open/read/writes/close in the middle, it should do the work as > > usual. I actually can see a problem if you use the different inodes, but > > that will be sorted out too automatically as soon as we stop pinning the > > inodes. > > > Sigh... Try BLKFLSBUF + write() + BLKFLSBUF. write will return -EIO and second BLKFLSBUF will do nothing. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 21:18 ` Andrea Arcangeli @ 2001-09-20 21:40 ` Alexander Viro 2001-09-20 22:13 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 21:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > > Sigh... Try BLKFLSBUF + write() + BLKFLSBUF. > > write will return -EIO and second BLKFLSBUF will do nothing. Now compare that with behaviour of -pre10 (not to mention the (in)sanity of "this ioctl() will make all IO on the fd fail until somebody opens the same file" semantics). ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 21:40 ` Alexander Viro @ 2001-09-20 22:13 ` Andrea Arcangeli 2001-09-20 22:20 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-20 22:13 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Thu, Sep 20, 2001 at 05:40:00PM -0400, Alexander Viro wrote: > > > On Thu, 20 Sep 2001, Andrea Arcangeli wrote: > > > > Sigh... Try BLKFLSBUF + write() + BLKFLSBUF. > > > > write will return -EIO and second BLKFLSBUF will do nothing. > > Now compare that with behaviour of -pre10 (not to mention the > (in)sanity of "this ioctl() will make all IO on the fd fail until > somebody opens the same file" semantics). It's not that insane: the address space is allocated at open time. After you drop it with BLKFLSBUF you will have to open the device again to reallocate a new address space. I could just truncate the physical address space, there are no other users, but then the inode would remain pinned forever, and so until we include your ipinning fix this looked an acceptable two liner band-aid I guess (again, real fix is yours, all I'm saying is that it can't oops any longer ;). thanks, Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 22:13 ` Andrea Arcangeli @ 2001-09-20 22:20 ` Alexander Viro 2001-09-20 22:31 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 22:20 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Fri, 21 Sep 2001, Andrea Arcangeli wrote: > It's not that insane: the address space is allocated at open time. > After you drop it with BLKFLSBUF you will have to open the device again > to reallocate a new address space. I could just truncate the physical > address space, there are no other users, but then the inode would remain > pinned forever, and so until we include your ipinning fix this looked an > acceptable two liner band-aid I guess (again, real fix is yours, all I'm > saying is that it can't oops any longer ;). OK. Could you resend your patch (or just the page initialization parts of it)? I'm getting to the point where I'll start seriously touching rd.c, so... ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 22:20 ` Alexander Viro @ 2001-09-20 22:31 ` Andrea Arcangeli 2001-09-20 22:44 ` Alexander Viro 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-20 22:31 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Thu, Sep 20, 2001 at 06:20:39PM -0400, Alexander Viro wrote: > OK. Could you resend your patch (or just the page initialization parts > of it)? I'm getting to the point where I'll start seriously touching > rd.c, so... Of course, however if you want I can first fix initrd (I was just looking into it in the last minutes), the security fix broke initrd badly unfortunately [didn't tested initrd but just the ramdisks before posting it] (not sure why initrd broke at the moment but I believe the design of the fix was the right one, so it is probably an implementation detail). So unless you need it urgently I will try to fix initrd first, then I will send to you so you can go ahead without risk of future rejects. thanks! Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 22:31 ` Andrea Arcangeli @ 2001-09-20 22:44 ` Alexander Viro 2001-09-20 23:03 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 22:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Fri, 21 Sep 2001, Andrea Arcangeli wrote: > Of course, however if you want I can first fix initrd (I was just > looking into it in the last minutes), the security fix broke initrd > badly unfortunately [didn't tested initrd but just the ramdisks before > posting it] (not sure why initrd broke at the moment but I believe the > design of the fix was the right one, so it is probably an implementation > detail). So unless you need it urgently I will try to fix initrd first, > then I will send to you so you can go ahead without risk of future rejects. I'd suggest to stop treating initrd as block device. It has to keep S_IFBLK in mode bits, indeed, but I'd rather set ->f_op upon open() and stop worrying about it. We don't need buffer-cache access to it anyway - if you look carefully you will see that we actually copy its contents to normal ramdisk first. So just having ->read() (essentially copy_to_user()) is perfectly OK. Check how old code was dealing with it - it's really the best way to treat that sucker. We probably will be better off if we make it a character device at some point in 2.5 and move the thing to char/mem.c, but anyway, it had never been a proper block device (== one with requests queue, etc.) and there's no point in making it such now. Again, the proper way to treat it is to convert it into character device at some point. Userland won't actually care - after the boot it's gone. And kernel is using it only via ->read(), so there's no point trying to make it similar to real ramdisks. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 22:44 ` Alexander Viro @ 2001-09-20 23:03 ` Andrea Arcangeli 2001-09-20 23:11 ` Alexander Viro 2001-09-21 3:47 ` Andrea Arcangeli 0 siblings, 2 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-20 23:03 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Thu, Sep 20, 2001 at 06:44:18PM -0400, Alexander Viro wrote: > > > On Fri, 21 Sep 2001, Andrea Arcangeli wrote: > > > Of course, however if you want I can first fix initrd (I was just > > looking into it in the last minutes), the security fix broke initrd > > badly unfortunately [didn't tested initrd but just the ramdisks before > > posting it] (not sure why initrd broke at the moment but I believe the > > design of the fix was the right one, so it is probably an implementation > > detail). So unless you need it urgently I will try to fix initrd first, > > then I will send to you so you can go ahead without risk of future rejects. > > I'd suggest to stop treating initrd as block device. It has to keep > S_IFBLK in mode bits, indeed, but I'd rather set ->f_op upon open() and > stop worrying about it. > > We don't need buffer-cache access to it anyway - if you look carefully > you will see that we actually copy its contents to normal ramdisk > first. So just having ->read() (essentially copy_to_user()) is > perfectly OK. of course, we never used initrd blkdev, we just use the ramdisk always from userspace. Initrd doesn't need to be a blockdev, we could just copy the ram ourself and just call ->write on the ramdisk. userspace just sees /dev/ram0, /dev/initrd is a kernel internal thing that doesn't need to be visible, and if anybody uses /dev/initrd somehow that just insane (and it doesn't even look possible to make anything useful out of /dev/initrd anyways). > Check how old code was dealing with it - it's really the best way to > treat that sucker. We probably will be better off if we make it a > character device at some point in 2.5 and move the thing to char/mem.c, > but anyway, it had never been a proper block device (== one with > requests queue, etc.) and there's no point in making it such now. > > Again, the proper way to treat it is to convert it into character > device at some point. Userland won't actually care - after the > boot it's gone. And kernel is using it only via ->read(), so > there's no point trying to make it similar to real ramdisks. 1) I am tentated to fix the initrd bug by just killing initrd blkdev, completly instead of going to mark PageSecure all the initrd pages. 2) Otherwise I can make a brute hack one liner approch with a global field that ignores the PageSecure bit on reads until I finished to fill in the /dev/ram0 :) (slowdown production paths for a boot thing but doesn't matter, but ok we aren't writing proprietary software here and this isn't the right thing ;) 3) I can just mark the initrd pages as PageSecure, then I can safely forget about them. What do you prefer for the next 2.4.10 mainline? I'd like to have this fixed _fast_ and to be included in mainline, since the security problem is ugly (not too bad though since /dev/ram0 is readable only by root by default) and because people really needs initrd. Suggestions are very welcome. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 23:03 ` Andrea Arcangeli @ 2001-09-20 23:11 ` Alexander Viro 2001-09-21 1:50 ` Alexander Viro 2001-09-21 3:47 ` Andrea Arcangeli 1 sibling, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-20 23:11 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Fri, 21 Sep 2001, Andrea Arcangeli wrote: > 1) I am tentated to fix the initrd bug by just killing initrd blkdev, > completly instead of going to mark PageSecure all the initrd pages. Just put initrd_read(), initrd_release() and initrd_fops back and switch file->f_op on rd_open() as it was in -pre10. Maybe you want to replace blkdev_put(inode->i_bdev, BDEV_FILE) in initrd_release() with blkdev_close(inode, file), but that will be the same pretty soon. It's really the simplest way to deal with that and it has a benefit of being both straightforward and well-tested. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 23:11 ` Alexander Viro @ 2001-09-21 1:50 ` Alexander Viro 2001-09-21 2:42 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-21 1:50 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List Andrea, what's the point of making blkdev_get() bump ->bd_count in case of success and blkdev_put() - drop it? We _do_ grab a reference before calling blkdev_get() - any place where we don't is an immediately oopsable hole both in the old an in the new tree. Notice that you do down(&bdev->bd_sem) before that increment of refcount, so if caller doesn't hold a reference we are toast. What's going on there? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-21 1:50 ` Alexander Viro @ 2001-09-21 2:42 ` Andrea Arcangeli 0 siblings, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-21 2:42 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Thu, Sep 20, 2001 at 09:50:52PM -0400, Alexander Viro wrote: > Andrea, what's the point of making blkdev_get() bump ->bd_count > in case of success and blkdev_put() - drop it? We _do_ grab a reference > before calling blkdev_get() - any place where we don't is an immediately > oopsable hole both in the old an in the new tree. Notice that you > do down(&bdev->bd_sem) before that increment of refcount, so if caller > doesn't hold a reference we are toast. What's going on there? I just wanted to make sure the bdev couldn't be released under us by owning a reference for the whole duration of the blkdev_get/put. But requiring the caller to hold the reference for us seems saner since the caller will have to pass the bdev as parameter anyways, so yes it seems superflous. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-20 23:03 ` Andrea Arcangeli 2001-09-20 23:11 ` Alexander Viro @ 2001-09-21 3:47 ` Andrea Arcangeli 2001-09-21 4:00 ` Alexander Viro 2001-09-21 4:06 ` Andrea Arcangeli 1 sibling, 2 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-21 3:47 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Fri, Sep 21, 2001 at 01:03:40AM +0200, Andrea Arcangeli wrote: > What do you prefer for the next 2.4.10 mainline? I'd like to have this Famous last words, after a few hours of debugging mixed with vm patches and emails, I finally got around finding the real bug. The fact is that the secure-ramdisk logic was totally broken, not just for initrd, oh well, so please don't apply such patch (code in mainline has the security issue if you allow an luser to read from /dev/ram0, but it isn't buggy). and the issue is quite unfixable with just a PageSecure set inside rd.c. The fact is that I cannot just clear-around the written "bh", around there could be the source for the next block to write and I cannot zero it out. It is getting harder to fix this one just inside the ->make_request callback... Hints? Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-21 3:47 ` Andrea Arcangeli @ 2001-09-21 4:00 ` Alexander Viro 2001-09-21 4:06 ` Andrea Arcangeli 2001-09-21 4:06 ` Andrea Arcangeli 1 sibling, 1 reply; 110+ messages in thread From: Alexander Viro @ 2001-09-21 4:00 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Kernel Mailing List On Fri, 21 Sep 2001, Andrea Arcangeli wrote: > Famous last words, after a few hours of debugging mixed with vm patches > and emails, I finally got around finding the real bug. The fact is that > the secure-ramdisk logic was totally broken, not just for initrd, oh > well, so please don't apply such patch (code in mainline has the > security issue if you allow an luser to read from /dev/ram0, but it > isn't buggy). and the issue is quite unfixable with just a PageSecure > set inside rd.c. The fact is that I cannot just clear-around the > written "bh", around there could be the source for the next block to > write and I cannot zero it out. It is getting harder to fix this one > just inside the ->make_request callback... Hints? Well, taking a file on ramfs and doing losetup on it should be equivalent to ramdisk. Turning relevant pieces into a driver shouldn't be too hard. It won't be pretty, though - you'll probably want different address_space_operations, so that read()/write() wouldn't bother with requests at all. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-21 4:00 ` Alexander Viro @ 2001-09-21 4:06 ` Andrea Arcangeli 0 siblings, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-21 4:06 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Fri, Sep 21, 2001 at 12:00:35AM -0400, Alexander Viro wrote: > Well, taking a file on ramfs and doing losetup on it should be equivalent > to ramdisk. Turning relevant pieces into a driver shouldn't be too hard. > It won't be pretty, though - you'll probably want different > address_space_operations, so that read()/write() wouldn't bother with > requests at all. My same idea, agreed. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-21 3:47 ` Andrea Arcangeli 2001-09-21 4:00 ` Alexander Viro @ 2001-09-21 4:06 ` Andrea Arcangeli 2001-09-21 4:46 ` Andrea Arcangeli 1 sibling, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-21 4:06 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Fri, Sep 21, 2001 at 05:47:49AM +0200, Andrea Arcangeli wrote: > write and I cannot zero it out. It is getting harder to fix this one > just inside the ->make_request callback... Hints? I think the best fix is to have the ramdisk using the same aops of ramfs and replace "Secure" with "Uptodate". We need to trap the security issue at the higher layer and also this will avoid us having to map useless bh, so it should be an improvement, only the filesystem will end triggering the ->make_request callback of the ramdisk. Then if the fs does I/O on stuff out of the physical address space we'll just clear it out. Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-21 4:06 ` Andrea Arcangeli @ 2001-09-21 4:46 ` Andrea Arcangeli 2001-09-21 7:09 ` Andrea Arcangeli 0 siblings, 1 reply; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-21 4:46 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Fri, Sep 21, 2001 at 06:06:25AM +0200, Andrea Arcangeli wrote: > On Fri, Sep 21, 2001 at 05:47:49AM +0200, Andrea Arcangeli wrote: > > write and I cannot zero it out. It is getting harder to fix this one > > just inside the ->make_request callback... Hints? > > I think the best fix is to have the ramdisk using the same aops of ramfs > and replace "Secure" with "Uptodate". We need to trap the security issue > at the higher layer and also this will avoid us having to map useless > bh, so it should be an improvement, only the filesystem will end > triggering the ->make_request callback of the ramdisk. Then if the fs > does I/O on stuff out of the physical address space we'll just clear it > out. ok, here it is, this obsoletes the previous patch (so again against 2.4.10pre12), initrd works again. Al, feel free to expand over rd.c now. diff -urN ramdisk-ref/drivers/block/rd.c ramdisk/drivers/block/rd.c --- ramdisk-ref/drivers/block/rd.c Fri Sep 21 06:39:28 2001 +++ ramdisk/drivers/block/rd.c Fri Sep 21 06:26:37 2001 @@ -186,17 +186,77 @@ #endif +/* + * Copyright (C) 2000 Linus Torvalds. + * 2000 Transmeta Corp. + * aops copied from ramfs. + */ +static int ramdisk_readpage(struct file *file, struct page * page) +{ + if (!Page_Uptodate(page)) { + memset(kmap(page), 0, PAGE_CACHE_SIZE); + kunmap(page); + flush_dcache_page(page); + SetPageUptodate(page); + } + UnlockPage(page); + return 0; +} + +/* + * Writing: just make sure the page gets marked dirty, so that + * the page stealer won't grab it. + */ +static int ramdisk_writepage(struct page *page) +{ + SetPageDirty(page); + UnlockPage(page); + return 0; +} + +static int ramdisk_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) +{ + void *addr = kmap(page); + if (!Page_Uptodate(page)) { + memset(addr, 0, PAGE_CACHE_SIZE); + flush_dcache_page(page); + SetPageUptodate(page); + } + SetPageDirty(page); + return 0; +} + +static int ramdisk_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to) +{ + kunmap(page); + return 0; +} + +struct address_space_operations ramdisk_aops = { + readpage: ramdisk_readpage, + writepage: ramdisk_writepage, + prepare_write: ramdisk_prepare_write, + commit_write: ramdisk_commit_write, +}; + static int rd_blkdev_pagecache_IO(int rw, struct buffer_head * sbh, int minor) { - struct address_space * mapping = rd_inode[minor]->i_mapping; + struct address_space * mapping; unsigned long index; - int offset, size, err = 0; + int offset, size, err; - if (sbh->b_page->mapping == mapping) { - if (rw != READ) - SetPageDirty(sbh->b_page); + err = -EIO; + /* + * If we did BLKFLSBUF just before doing read/write, + * we could see a rd_inode before the opener had a chance + * to return from rd_open(), that's ok, as soon as we + * see the inode we can use its i_mapping, and the inode + * cannot go away from under us. + */ + if (!rd_inode[minor]) goto out; - } + err = 0; + mapping = rd_inode[minor]->i_mapping; index = sbh->b_rsector >> (PAGE_CACHE_SHIFT - 9); offset = (sbh->b_rsector << 9) & ~PAGE_CACHE_MASK; @@ -206,8 +266,7 @@ int count; struct page ** hash; struct page * page; - const char * src; - char * dst; + char * src, * dst; int unlock = 0; count = PAGE_CACHE_SIZE - offset; @@ -217,20 +276,24 @@ hash = page_hash(mapping, index); page = __find_get_page(mapping, index, hash); - if (!page && rw != READ) { + if (!page) { page = grab_cache_page(mapping, index); err = -ENOMEM; if (!page) goto out; err = 0; + + if (!Page_Uptodate(page)) { + memset(kmap(page), 0, PAGE_CACHE_SIZE); + kunmap(page); + flush_dcache_page(page); + SetPageUptodate(page); + } + unlock = 1; } index++; - if (!page) { - offset = 0; - continue; - } if (rw == READ) { src = kmap(page); @@ -321,6 +384,13 @@ struct block_device * bdev = inode->i_bdev; down(&bdev->bd_sem); + /* + * We're the only users here, protected by the + * bd_sem, but first check if we didn't just + * flushed the ramdisk. + */ + if (!rd_inode[minor]) + goto unlock; if (bdev->bd_openers > 2) { up(&bdev->bd_sem); return -EBUSY; @@ -328,8 +398,16 @@ bdev->bd_openers--; bdev->bd_cache_openers--; iput(rd_inode[minor]); + /* + * Make sure the cache is flushed from here + * and not from close(2), somebody + * could reopen the device before we have a + * chance to close it ourself. + */ + truncate_inode_pages(rd_inode[minor]->i_mapping, 0); rd_inode[minor] = NULL; rd_blocksizes[minor] = rd_blocksize; + unlock: up(&bdev->bd_sem); } break; @@ -421,6 +499,8 @@ bdev->bd_openers++; bdev->bd_cache_openers++; bdev->bd_inode = inode; + + inode->i_mapping->a_ops = &ramdisk_aops; } } Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-21 4:46 ` Andrea Arcangeli @ 2001-09-21 7:09 ` Andrea Arcangeli 0 siblings, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-21 7:09 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Fri, Sep 21, 2001 at 06:46:21AM +0200, Andrea Arcangeli wrote: > On Fri, Sep 21, 2001 at 06:06:25AM +0200, Andrea Arcangeli wrote: > > On Fri, Sep 21, 2001 at 05:47:49AM +0200, Andrea Arcangeli wrote: > > > write and I cannot zero it out. It is getting harder to fix this one > > > just inside the ->make_request callback... Hints? > > > > I think the best fix is to have the ramdisk using the same aops of ramfs > > and replace "Secure" with "Uptodate". We need to trap the security issue > > at the higher layer and also this will avoid us having to map useless > > bh, so it should be an improvement, only the filesystem will end > > triggering the ->make_request callback of the ramdisk. Then if the fs > > does I/O on stuff out of the physical address space we'll just clear it > > out. > > ok, here it is, this obsoletes the previous patch (so again against > 2.4.10pre12), initrd works again. rediffed against pre13 (I'm passing the bd_inode to the ->open/release, so we can modify the a_ops of the physical address space, ramdisk needs that in rd_open()) diff -urN ramdisk-ref/drivers/block/rd.c ramdisk/drivers/block/rd.c --- ramdisk-ref/drivers/block/rd.c Fri Sep 21 08:28:04 2001 +++ ramdisk/drivers/block/rd.c Fri Sep 21 08:06:02 2001 @@ -186,17 +186,77 @@ #endif +/* + * Copyright (C) 2000 Linus Torvalds. + * 2000 Transmeta Corp. + * aops copied from ramfs. + */ +static int ramdisk_readpage(struct file *file, struct page * page) +{ + if (!Page_Uptodate(page)) { + memset(kmap(page), 0, PAGE_CACHE_SIZE); + kunmap(page); + flush_dcache_page(page); + SetPageUptodate(page); + } + UnlockPage(page); + return 0; +} + +/* + * Writing: just make sure the page gets marked dirty, so that + * the page stealer won't grab it. + */ +static int ramdisk_writepage(struct page *page) +{ + SetPageDirty(page); + UnlockPage(page); + return 0; +} + +static int ramdisk_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to) +{ + void *addr = kmap(page); + if (!Page_Uptodate(page)) { + memset(addr, 0, PAGE_CACHE_SIZE); + flush_dcache_page(page); + SetPageUptodate(page); + } + SetPageDirty(page); + return 0; +} + +static int ramdisk_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to) +{ + kunmap(page); + return 0; +} + +struct address_space_operations ramdisk_aops = { + readpage: ramdisk_readpage, + writepage: ramdisk_writepage, + prepare_write: ramdisk_prepare_write, + commit_write: ramdisk_commit_write, +}; + static int rd_blkdev_pagecache_IO(int rw, struct buffer_head * sbh, int minor) { - struct address_space * mapping = rd_inode[minor]->i_mapping; + struct address_space * mapping; unsigned long index; - int offset, size, err = 0; + int offset, size, err; - if (sbh->b_page->mapping == mapping) { - if (rw != READ) - SetPageDirty(sbh->b_page); + err = -EIO; + /* + * If we did BLKFLSBUF just before doing read/write, + * we could see a rd_inode before the opener had a chance + * to return from rd_open(), that's ok, as soon as we + * see the inode we can use its i_mapping, and the inode + * cannot go away from under us. + */ + if (!rd_inode[minor]) goto out; - } + err = 0; + mapping = rd_inode[minor]->i_mapping; index = sbh->b_rsector >> (PAGE_CACHE_SHIFT - 9); offset = (sbh->b_rsector << 9) & ~PAGE_CACHE_MASK; @@ -206,8 +266,7 @@ int count; struct page ** hash; struct page * page; - const char * src; - char * dst; + char * src, * dst; int unlock = 0; count = PAGE_CACHE_SIZE - offset; @@ -217,20 +276,24 @@ hash = page_hash(mapping, index); page = __find_get_page(mapping, index, hash); - if (!page && rw != READ) { + if (!page) { page = grab_cache_page(mapping, index); err = -ENOMEM; if (!page) goto out; err = 0; + + if (!Page_Uptodate(page)) { + memset(kmap(page), 0, PAGE_CACHE_SIZE); + kunmap(page); + flush_dcache_page(page); + SetPageUptodate(page); + } + unlock = 1; } index++; - if (!page) { - offset = 0; - continue; - } if (rw == READ) { src = kmap(page); @@ -321,6 +384,13 @@ struct block_device * bdev = inode->i_bdev; down(&bdev->bd_sem); + /* + * We're the only users here, protected by the + * bd_sem, but first check if we didn't just + * flushed the ramdisk. + */ + if (!rd_inode[minor]) + goto unlock; if (bdev->bd_openers > 2) { up(&bdev->bd_sem); return -EBUSY; @@ -328,8 +398,16 @@ bdev->bd_openers--; bdev->bd_cache_openers--; iput(rd_inode[minor]); + /* + * Make sure the cache is flushed from here + * and not from close(2), somebody + * could reopen the device before we have a + * chance to close it ourself. + */ + truncate_inode_pages(rd_inode[minor]->i_mapping, 0); rd_inode[minor] = NULL; rd_blocksizes[minor] = rd_blocksize; + unlock: up(&bdev->bd_sem); } break; @@ -420,6 +498,8 @@ /* bdev->bd_sem is held by caller */ bdev->bd_openers++; bdev->bd_cache_openers++; + + bdev->bd_inode->i_mapping->a_ops = &ramdisk_aops; } } diff -urN ramdisk-ref/fs/block_dev.c ramdisk/fs/block_dev.c --- ramdisk-ref/fs/block_dev.c Fri Sep 21 07:48:12 2001 +++ ramdisk/fs/block_dev.c Fri Sep 21 08:25:29 2001 @@ -736,7 +736,7 @@ if (bdev->bd_op) { ret = 0; if (bdev->bd_op->open) - ret = bdev->bd_op->open(inode,filp); + ret = bdev->bd_op->open(bdev->bd_inode,filp); if (!ret) { bdev->bd_openers++; bdev->bd_cache_openers++; @@ -835,7 +835,7 @@ /* release the device driver */ if (bdev->bd_op->release) - ret = bdev->bd_op->release(inode, NULL); + ret = bdev->bd_op->release(bd_inode, NULL); if (!--bdev->bd_openers) { bdev->bd_op = NULL; bdev->bd_inode = NULL; Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-19 18:25 ` Andrea Arcangeli 2001-09-19 19:21 ` Alexander Viro @ 2001-09-19 20:41 ` Richard Gooch 1 sibling, 0 replies; 110+ messages in thread From: Richard Gooch @ 2001-09-19 20:41 UTC (permalink / raw) To: Alexander Viro; +Cc: Andrea Arcangeli, Linus Torvalds, Kernel Mailing List Alexander Viro writes: > On Wed, 19 Sep 2001, Andrea Arcangeli wrote: > > (infact I never had a single report), but well we'll verify that in > > Richard, is that you? What had you done with real Andrea? To throw your own crap back at you: "don't drag me into your political crap". Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 18:19 ` Alexander Viro 2001-09-18 18:27 ` Linus Torvalds @ 2001-09-19 13:38 ` Rik van Riel 2001-09-19 16:35 ` Andrea Arcangeli 2 siblings, 0 replies; 110+ messages in thread From: Rik van Riel @ 2001-09-19 13:38 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Tue, 18 Sep 2001, Alexander Viro wrote: > On Tue, 18 Sep 2001, Linus Torvalds wrote: > > > Funny that you mention it, because I actually have a cunning plan, and > > you're an unwitting part of it. > > /me swears No need to get stressed, if you don't like it you don't do it and Linus will have to do things himself ;) Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 18:19 ` Alexander Viro 2001-09-18 18:27 ` Linus Torvalds 2001-09-19 13:38 ` Rik van Riel @ 2001-09-19 16:35 ` Andrea Arcangeli 2 siblings, 0 replies; 110+ messages in thread From: Andrea Arcangeli @ 2001-09-19 16:35 UTC (permalink / raw) To: Alexander Viro; +Cc: Linus Torvalds, Kernel Mailing List On Tue, Sep 18, 2001 at 02:19:42PM -0400, Alexander Viro wrote: > with Andrea and once he gets back to life (no, 'e's just sleepin') we'll yes, I'm back to life but I had a surprise: but my ISP is down so I'm now masqueraded my dekstop over a wirless GPRS link via irda on the laptop [GRPS is flat rate in Italy :) ] so it's quite dogslow at the moment, I cannot read emails and RTT over GSM is of the order of 1 second (but I can send emails and slowly browse the web). I still hope the ISP returns alive before tomorrow... (I'm looking into Andrew's report at the moment and I don't need good connectivity for debugging it luckily :) I think the bug we have in the inode pinning isn't a showstopper (I probably could live with it forever without ever have chance to notice it), so no user should be hurted by it [but I understand it can hurt you very much], and I am _very_ concerned in addressing it ASAP and I'm very happy you spoteed it immediatly. I'll also check your worries with the ramdisk BLKFLSBUF (the logic should be just like before the changes, the reason I backed out the rd_bdev changes is because I was always working with inodes for everything related to the physical address space, not with bdevs so I didn't see a value in the change, but after the bugfix [bdev->bd_mapping] probably this will change), for the other issue with the ->writepage as said it is fine and intentional, except we can do something more efficient for the ramdisk adding some special case to the memory balancing code (for istance we could always put the ramdisk stuff over the active list rather than rolling it over in the inactive list again, but quite frankly that's low prio unless somebody has a real patological case and of course that applies and it always applied to ramfs too [I did it with the argument: it works for ramfs so it will obviously work work for ramdisk too]). In the next days I'll also be very busy with some non linux-kernel developement issue, but I'll try to be still as much responsive as possible for anything urgent (I'll certainly be 100% responsive again next week starting from Wed 26). At the moment my very first prio is to address the BUG() reported by Andrew (if it is just been fixed please somebody at SuSE send me an SMS so I don't waste time since as said I cannot read emails at the moment). Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 @ 2001-09-18 3:18 Ed Tomlinson 2001-09-18 2:31 ` Magnus Naeslund(f) 2001-09-18 3:15 ` Alan Cox 0 siblings, 2 replies; 110+ messages in thread From: Ed Tomlinson @ 2001-09-18 3:18 UTC (permalink / raw) To: linux-kernel Hi, You are fast. Was just going report this one... Using debian sid with gcc 2.95.4. Both before and after appling the patch below I get: ld -m elf_i386 -T /usr/src/linux/arch/i386/vmlinux.lds -e stext arch/i386/kernel/head.o arch/i386/kernel/init_task.o init/main.o init/version.o \ --start-group \ arch/i386/kernel/kernel.o arch/i386/mm/mm.o kernel/kernel.o mm/mm.o fs/fs.o ipc/ipc.o \ drivers/char/char.o drivers/block/block.o drivers/misc/misc.o drivers/net/net.o drivers/media/media.o drivers/char/drm/drm.o drivers/ide/idedriver.o drivers/cdrom/driver.o drivers/pci/driver.o drivers/video/video.o drivers/usb/usbdrv.o drivers/md/mddev.o \ net/network.o \ /usr/src/linux/arch/i386/lib/lib.a /usr/src/linux/lib/lib.a /usr/src/linux/arch/i386/lib/lib.a \ --end-group \ -o vmlinux mm/mm.o: In function `kmem_cache_alloc': mm/mm.o(.text+0x8154): undefined reference to `__builtin_expect' mm/mm.o(.text+0x8173): undefined reference to `__builtin_expect' mm/mm.o(.text+0x81cc): undefined reference to `__builtin_expect' mm/mm.o: In function `kmalloc': mm/mm.o(.text+0x8294): undefined reference to `__builtin_expect' mm/mm.o(.text+0x82b3): undefined reference to `__builtin_expect' mm/mm.o(.text+0x830c): more undefined references to `__builtin_expect' follow make: *** [vmlinux] Error 1 Impling the patch below is incomplete? TIA Ed Tomlinson Andrea Arcangeli wrote: > ah I just got a (verbose) report about one compilation trouble with old > kernels, the 00_rwsem patch that includes this backwards compatibility > hunk wasn't applied. This is a very very minor issue. Please Linus > include this patch to fix the compilation troubles with the old > compilers: > > diff -urN 2.4.10pre10/include/linux/compiler.h rwsem/include/linux/compiler.h > --- 2.4.10pre10/include/linux/compiler.h Thu Jan 1 01:00:00 1970 > +++ rwsem/include/linux/compiler.h Tue Sep 18 02:02:08 2001 > @@ -0,0 +1,13 @@ > +#ifndef __LINUX_COMPILER_H > +#define __LINUX_COMPILER_H > + > +/* Somewhere in the middle of the GCC 2.96 development cycle, we implemented > + a mechanism by which the user can annotate likely branch directions and > + expect the blocks to be reordered appropriately. Define __builtin_expect > + to nothing for earlier compilers. */ > + > +#if __GNUC__ == 2 && __GNUC_MINOR__ < 96ld -m elf_i386 -T /usr/src/linux/arch/i386/vmlinux.lds -e stext arch/i386/kernel/head.o arch/i386/kernel/init_task.o init/main.o init/version.o \ --start-group \ arch/i386/kernel/kernel.o arch/i386/mm/mm.o kernel/kernel.o mm/mm.o fs/fs.o ipc/ipc.o \ drivers/char/char.o drivers/block/block.o drivers/misc/misc.o drivers/net/net.o drivers/media/media.o drivers/char/drm/drm.o drivers/ide/idedriver.o drivers/cdrom/driver.o drivers/pci/driver.o drivers/video/video.o drivers/usb/usbdrv.o drivers/md/mddev.o \ net/network.o \ /usr/src/linux/arch/i386/lib/lib.a /usr/src/linux/lib/lib.a /usr/src/linux/arch/i386/lib/lib.a \ --end-group \ -o vmlinux mm/mm.o: In function `kmem_cache_alloc': mm/mm.o(.text+0x8154): undefined reference to `__builtin_expect' mm/mm.o(.text+0x8173): undefined reference to `__builtin_expect' mm/mm.o(.text+0x81cc): undefined reference to `__builtin_expect' mm/mm.o: In function `kmalloc': mm/mm.o(.text+0x8294): undefined reference to `__builtin_expect' mm/mm.o(.text+0x82b3): undefined reference to `__builtin_expect' mm/mm.o(.text+0x830c): more undefined references to `__builtin_expect' follow make: *** [vmlinux] Error 1 > +#define __builtin_expect(x, expected_value) (x) > +#endif > + > +#endif /* __LINUX_COMPILER_H */ > > Then you need to simply include <linux/compile.h> from the files that > doesn't compile, this second patch is untested but it should do the > trick (00_rwsem-?? was including compile.h from the rwsem includes that > were in turn included by those files implicitly so I didn't need to add > compile.h in each file): > > --- 2.4.10pre11/mm/slab.c.~1~ Tue Sep 18 02:43:04 2001 > +++ 2.4.10pre11/mm/slab.c Tue Sep 18 04:00:30 2001 > @@ -72,6 +72,7 @@ > #include <linux/slab.h> > #include <linux/interrupt.h> > #include <linux/init.h> > +#include <linux/compile.h> > #include <asm/uaccess.h> > > /* > --- 2.4.10pre11/mm/page_alloc.c.~1~ Tue Sep 18 02:43:04 2001 > +++ 2.4.10pre11/mm/page_alloc.c Tue Sep 18 04:00:20 2001 > @@ -17,6 +17,7 @@ > #include <linux/pagemap.h> > #include <linux/bootmem.h> > #include <linux/slab.h> > +#include <linux/compile.h> > > int nr_swap_pages; > int nr_active_pages; > --- 2.4.10pre11/mm/vmscan.c.~1~ Tue Sep 18 02:43:04 2001 > +++ 2.4.10pre11/mm/vmscan.c Tue Sep 18 04:00:47 2001 > @@ -21,6 +21,7 @@ > #include <linux/init.h> > #include <linux/highmem.h> > #include <linux/file.h> > +#include <linux/compile.h> > > #include <asm/pgalloc.h> > > > Andrea ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 3:18 Ed Tomlinson @ 2001-09-18 2:31 ` Magnus Naeslund(f) 2001-09-18 2:49 ` David B. Stevens 2001-09-18 3:38 ` Ed Tomlinson 2001-09-18 3:15 ` Alan Cox 1 sibling, 2 replies; 110+ messages in thread From: Magnus Naeslund(f) @ 2001-09-18 2:31 UTC (permalink / raw) To: Ed Tomlinson; +Cc: linux-kernel From: "Ed Tomlinson" <tomlins@cam.org> > Hi, > > You are fast. Was just going report this one... > Using debian sid with gcc 2.95.4. Both before and after > appling the patch below I get: > The patch seems wrong... The files include "compile.h", but should include "compiler.h", no? Magnus ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 2:31 ` Magnus Naeslund(f) @ 2001-09-18 2:49 ` David B. Stevens 2001-09-18 3:38 ` Ed Tomlinson 1 sibling, 0 replies; 110+ messages in thread From: David B. Stevens @ 2001-09-18 2:49 UTC (permalink / raw) To: Magnus Naeslund(f); +Cc: Ed Tomlinson, linux-kernel Greetings, "Magnus Naeslund(f)" wrote: > > From: "Ed Tomlinson" <tomlins@cam.org> > > Hi, > > > > You are fast. Was just going report this one... > > Using debian sid with gcc 2.95.4. Both before and after > > appling the patch below I get: > > > > The patch seems wrong... > The files include "compile.h", but should include "compiler.h", no? > > Magnus You are correct Magnus, it should be compiler.h in the three mm modules. Cheers, Dave ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 2:31 ` Magnus Naeslund(f) 2001-09-18 2:49 ` David B. Stevens @ 2001-09-18 3:38 ` Ed Tomlinson 1 sibling, 0 replies; 110+ messages in thread From: Ed Tomlinson @ 2001-09-18 3:38 UTC (permalink / raw) To: Magnus Naeslund(f); +Cc: linux-kernel On September 17, 2001 10:31 pm, Magnus Naeslund(f) wrote: > From: "Ed Tomlinson" <tomlins@cam.org> > > > Hi, > > > > You are fast. Was just going report this one... > > Using debian sid with gcc 2.95.4. Both before and after > > appling the patch below I get: > > The patch seems wrong... > The files include "compile.h", but should include "compiler.h", no? I should have caught that... It works changeing compile.h to compiler.h Thanks Ed Tomlinson ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 3:18 Ed Tomlinson 2001-09-18 2:31 ` Magnus Naeslund(f) @ 2001-09-18 3:15 ` Alan Cox 2001-09-18 4:41 ` H. Peter Anvin 1 sibling, 1 reply; 110+ messages in thread From: Alan Cox @ 2001-09-18 3:15 UTC (permalink / raw) To: Ed Tomlinson; +Cc: linux-kernel > You are fast. Was just going report this one... > Using debian sid with gcc 2.95.4. Both before and after > appling the patch below I get: You need gcc 2.96 or higher to build the pre11 tree. I doubt that was intentional. Basically rip out all use of __builtin_expect Alan ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 3:15 ` Alan Cox @ 2001-09-18 4:41 ` H. Peter Anvin 0 siblings, 0 replies; 110+ messages in thread From: H. Peter Anvin @ 2001-09-18 4:41 UTC (permalink / raw) To: linux-kernel Followup to: <E15jBLy-0008UF-00@the-village.bc.nu> By author: Alan Cox <alan@lxorguk.ukuu.org.uk> In newsgroup: linux.dev.kernel > > You need gcc 2.96 or higher to build the pre11 tree. I doubt that was > intentional. Basically rip out all use of __builtin_expect > Perhaps we should have a header which does #define __builtin_expect(X) if your gcc version is 2.91-95? -hpa -- <hpa@transmeta.com> at work, <hpa@zytor.com> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt <amsp@zytor.com> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 @ 2001-09-18 6:14 Alexei Podtelezhnikov 2001-09-18 13:51 ` Rik van Riel 0 siblings, 1 reply; 110+ messages in thread From: Alexei Podtelezhnikov @ 2001-09-18 6:14 UTC (permalink / raw) To: linux-kernel I praise Linus for this step. Since 9-10 months ago Rik's scheme didn't det enough attention to get fixed. Out of the main tree it has better chances, just like it did before inclusion. Alexei ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: Linux 2.4.10-pre11 2001-09-18 6:14 Alexei Podtelezhnikov @ 2001-09-18 13:51 ` Rik van Riel 0 siblings, 0 replies; 110+ messages in thread From: Rik van Riel @ 2001-09-18 13:51 UTC (permalink / raw) To: Alexei Podtelezhnikov; +Cc: linux-kernel On Mon, 17 Sep 2001, Alexei Podtelezhnikov wrote: > Since 9-10 months ago Rik's scheme didn't det enough attention to get > fixed. Out of the main tree it has better chances, just like it did > before inclusion. Umm, we _have_ been fixing the VM in 2.4, but Linus has been randomly ignoring patches and introducing stuff we warned him not to apply which broke stuff. With maintainance like that, I'm sure Andrea's code will also have a better chance out of the main tree ;) cheers, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 110+ messages in thread
end of thread, other threads:[~2001-09-21 7:09 UTC | newest] Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-09-18 0:08 Linux 2.4.10-pre11 Linus Torvalds 2001-09-17 23:17 ` Marcelo Tosatti 2001-09-18 1:08 ` Marcelo Tosatti 2001-09-18 3:37 ` Andrea Arcangeli 2001-09-18 2:25 ` Marcelo Tosatti 2001-09-18 3:58 ` Andrea Arcangeli 2001-09-18 2:53 ` Marcelo Tosatti 2001-09-18 4:54 ` Andrea Arcangeli 2001-09-18 3:33 ` Marcelo Tosatti 2001-09-18 5:06 ` Andrea Arcangeli 2001-09-18 3:55 ` Marcelo Tosatti 2001-09-18 5:32 ` Andrea Arcangeli 2001-09-18 4:14 ` Marcelo Tosatti 2001-09-18 5:59 ` Andrea Arcangeli 2001-09-18 5:00 ` Marcelo Tosatti [not found] ` <20010917211834.A31693@redhat.com> [not found] ` <20010918035055.J698@athlon.random> 2001-09-18 2:02 ` Andrea Arcangeli [not found] ` <20010917221653.B31693@redhat.com> 2001-09-18 2:27 ` Linus Torvalds 2001-09-18 3:14 ` Alan Cox 2001-09-18 3:26 ` Andrea Arcangeli [not found] ` <20010918052201.N698@athlon.random> 2001-09-18 4:01 ` Benjamin LaHaise 2001-09-18 4:39 ` Andrea Arcangeli 2001-09-18 5:04 ` Alan Cox 2001-09-18 5:09 ` Andrea Arcangeli 2001-09-18 5:22 ` Benjamin LaHaise 2001-09-18 5:48 ` Andrea Arcangeli 2001-09-18 5:48 ` Andrew Morton 2001-09-18 6:11 ` Andrea Arcangeli 2001-09-18 5:02 ` Marcelo Tosatti 2001-09-18 6:40 ` Andrea Arcangeli 2001-09-18 16:06 ` Marcelo Tosatti 2001-09-18 19:18 ` Marcelo Tosatti 2001-09-18 21:05 ` Andrea Arcangeli 2001-09-19 13:57 ` Rik van Riel 2001-09-18 10:58 ` Martin Dalecki 2001-09-18 9:31 ` Alexander Viro 2001-09-18 9:39 ` Andrea Arcangeli 2001-09-18 9:44 ` Alexander Viro 2001-09-18 9:57 ` Andrea Arcangeli 2001-09-18 10:02 ` Alexander Viro 2001-09-18 10:17 ` Andrea Arcangeli 2001-09-18 10:28 ` Alexander Viro 2001-09-18 10:35 ` Andrea Arcangeli 2001-09-18 10:52 ` Alexander Viro 2001-09-18 11:05 ` Helge Hafting 2001-09-18 12:40 ` Andrea Arcangeli 2001-09-18 17:02 ` Linus Torvalds 2001-09-18 16:45 ` Linus Torvalds 2001-09-18 18:19 ` Alexander Viro 2001-09-18 18:27 ` Linus Torvalds 2001-09-18 19:14 ` Andreas Dilger 2001-09-18 19:41 ` Alexander Viro 2001-09-18 20:33 ` Richard Gooch 2001-09-18 20:53 ` Alexander Viro 2001-09-18 21:06 ` Richard Gooch 2001-09-18 21:27 ` Alexander Viro 2001-09-18 19:29 ` Benjamin LaHaise 2001-09-18 20:17 ` Stephan von Krawczynski 2001-09-18 20:33 ` Alan Cox 2001-09-19 13:42 ` Rik van Riel 2001-09-19 14:27 ` Alexander Viro 2001-09-19 2:59 ` Michael Peddemors 2001-09-19 16:11 ` Alexander Viro 2001-09-19 18:25 ` Andrea Arcangeli 2001-09-19 19:21 ` Alexander Viro 2001-09-19 20:55 ` Andrea Arcangeli 2001-09-19 21:17 ` Alexander Viro 2001-09-19 23:01 ` Andrea Arcangeli 2001-09-19 23:03 ` Andrea Arcangeli 2001-09-19 23:30 ` Alexander Viro 2001-09-19 23:40 ` Andrea Arcangeli 2001-09-20 13:56 ` Alexander Viro 2001-09-20 14:38 ` Chris Mason 2001-09-20 14:50 ` Alexander Viro 2001-09-20 15:44 ` Chris Mason 2001-09-20 16:43 ` Alexander Viro 2001-09-20 20:54 ` [PATCH] fs/block_dev.c cleanup Alexander Viro 2001-09-19 22:15 ` Linux 2.4.10-pre11 Richard Gooch 2001-09-20 2:34 ` Andrea Arcangeli 2001-09-20 10:52 ` Alexander Viro 2001-09-20 18:18 ` Andrea Arcangeli 2001-09-20 18:33 ` Alexander Viro 2001-09-20 18:59 ` Andrea Arcangeli 2001-09-20 20:41 ` Alexander Viro 2001-09-20 21:18 ` Andrea Arcangeli 2001-09-20 21:40 ` Alexander Viro 2001-09-20 22:13 ` Andrea Arcangeli 2001-09-20 22:20 ` Alexander Viro 2001-09-20 22:31 ` Andrea Arcangeli 2001-09-20 22:44 ` Alexander Viro 2001-09-20 23:03 ` Andrea Arcangeli 2001-09-20 23:11 ` Alexander Viro 2001-09-21 1:50 ` Alexander Viro 2001-09-21 2:42 ` Andrea Arcangeli 2001-09-21 3:47 ` Andrea Arcangeli 2001-09-21 4:00 ` Alexander Viro 2001-09-21 4:06 ` Andrea Arcangeli 2001-09-21 4:06 ` Andrea Arcangeli 2001-09-21 4:46 ` Andrea Arcangeli 2001-09-21 7:09 ` Andrea Arcangeli 2001-09-19 20:41 ` Richard Gooch 2001-09-19 13:38 ` Rik van Riel 2001-09-19 16:35 ` Andrea Arcangeli 2001-09-18 3:18 Ed Tomlinson 2001-09-18 2:31 ` Magnus Naeslund(f) 2001-09-18 2:49 ` David B. Stevens 2001-09-18 3:38 ` Ed Tomlinson 2001-09-18 3:15 ` Alan Cox 2001-09-18 4:41 ` H. Peter Anvin 2001-09-18 6:14 Alexei Podtelezhnikov 2001-09-18 13:51 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).