linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [CFT] delayed allocation and multipage I/O patches for 2.5.6.
@ 2002-03-12  6:00 Andrew Morton
  2002-03-12 11:18 ` Daniel Phillips
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Andrew Morton @ 2002-03-12  6:00 UTC (permalink / raw)
  To: lkml


[  Does anyone know what "CFT" means?  It means "call for testers".  It
   doesn't mean "woo-hoo, it'll be neat when that's merged <delete>".  It means
   "help, help - there's no point in just one guy testing this" (thanks Randy). ]


This is an update of the delayed-allocation and multipage pagecache I/O
patches.  I'm calling this a beta, because it all works, and I have
other stuff to do for a while.

Of the thirteen patches, seven (dallocbase-* and tuning-*) are
applicable to the base 2.5.6 kernel.

You need to mount an ext2 filesystem with the `-o delalloc' mount
option to turn on most of the functionality.

The rolled up patch is at

	http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6/everything.patch.gz


These patches do a ton of stuff.  Generally, the CPU overhead for filesystem
operations is decreased by about 40%.  Note "overhead": this is after factoring
out the constant copy_*_user overhead.  This translates to a 15-25% reduction
in CPU use for most workloads.

All the benchmarks are increased, to varying degrees.  Best case is two
instances of `dbench 64' against different disks which went from 7
megabytes/sec to 25.  This is due to better write layout patterns, avoidance of
synchronous reads in the writeback path, better memory management and better
management of writeback threads.


The patch breakdown is:

dallocbase-10-readahead

  Unifies the current three readahead functions (mmap reads, read(2) and
  sys_readhead) into a single implementation.

  More aggressive in building up the readahead windows.

  More conservative in tearing them down.

  Special start-of-file heuristics.

  Preallocates the readahead pages, to avoid the (never demonstrated, but
  potentially catastrophic) scenario where allocation of readahead pages causes
  the allocator to perform VM writeout.

  {hidden agenda): Gets all the readahead pages gathered together in one
  spot, so they can be marshalled into big BIOs.

  Reinstates the readahead tuning ioctls, so hdparm(8) and blockdev(8) are
  working again.  The readahead settings are now per-request-queue, and the
  drivers never have to know about it.

  Big code cleanup.

  Identifies readahead thrashing.

    Currently, it just performs a shrink on the readahead window when thrashing
    occurs.  This greatly reduces the amount of pointless I/O which we perform,
    and will reduce the CPU load.  The idea is that the readahead window
    dynamically adjusts to a sustainable size.  It improves things, but not
    hugely, experimentally.

    We really need drop-behind for read and write streams.  Or O_STREAMING,
    indeed.

dallocbase-15-pageprivate

  page->buffers is a bit of a layering violation.  Not all address_spaces
  have pages which are backed by buffers.

  The exclusive use of page->buffers for buffers means that a piece of prime
  real estate in struct page is unavailable to other forms of address_space.

  This patch turns page->buffers into `unsigned long page->private' and sets
  in place all the infrastructure which is needed to allow other address_spaces
  to use this storage.

  With this change in place, the multipage-bio no-buffer_head code can use
  page->private to cache the results of an earlier get_block(), so repeated
  calls into the filesystem are not needed in the case of file overwriting.

dallocbase-20-page_accounting

  This patch provides global accounting of locked and dirty pages.  It does
  this via lightweight per-CPU data structures.  The page_cache_size accounting
  has been changed to use this facility as well.

  Locked and dirty page accounting is needed for making writeback and
  throttling decisions in the delayed-allocation code.

dallocbase-30-pdflush

  This patch creates an adaptively-sized pool of writeback threads, called
  `pdflush'.  A simple algorithm is used to determine when new threads are
  needed, and when excess threads should be reaped.

  The kupdate and bdflush kernel threads are removed - the pdflush pool is
  used instead.

  The (ab)use of keventd for writing back unused inodes has been removed -
  the pdflush pool is now used for that operation.

dalloc-10-core

  The core delayed allocation code.  There's a description in the
  dalloc-10-core.patch file (all the patches have descriptions).

dalloc-20-ext2

  Implements delayed allocation for ext2.

dalloc-30-ratcache

  The radix-tree pagecache patch.

mpage-10-biobits

  Little API extensions in the BIO layer which were needed for building the
  pagecache BIOs.

mpage-20-core

  The core multipage I/O layer.

  This now implements multipage BIO reads into the pagecache.  Also caching
  of get_block() results at page->private.

  The get_block() result caching currently only applies if all of a page's
  blocks are laid out contiguously on disk.  Caching of a discontiguous list of
  blocks at page->private is easy enough to do, but would require a memory
  allocation, and the requirement is so rare that I didn't bother.

mpage-30-ext2

  Implements multipage I/O for ext2.

tuning-10-request

  get_request() fairness for 2.5.x.  Avoids the situation where a thread
  sleeps for ages on the request queue while other threads whizz in and steal
  requests which they didn't wait for.

tuning-20-ext2-preread-inode

  When we create a new inode, preread its backing block.

  Without this patch, many-inode writeout gets seriously stalled by having to
  read many individual inode table blocks.

tuning-30-read_latency

  read-latency2, ported from 2.4.  Intelligently promotes reads ahead of
  writes on the request queue, to prevent reads from being stalled for very
  long periods of time.

  Also reinstates the BLKELVGET and BLKELVSET ioctls, so `elvtune' may be
  used in 2.5.

  Also increases the size of the request queues, which allows better request
  merging.  This is acceptable now that reads are not heavily penalised by a
  large queue.


-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-12  6:00 [CFT] delayed allocation and multipage I/O patches for 2.5.6 Andrew Morton
@ 2002-03-12 11:18 ` Daniel Phillips
  2002-03-12 20:29   ` Andrew Morton
  2002-03-12 11:39 ` Daniel Phillips
  2002-03-18 19:16 ` Hanna Linder
  2 siblings, 1 reply; 16+ messages in thread
From: Daniel Phillips @ 2002-03-12 11:18 UTC (permalink / raw)
  To: Andrew Morton, lkml

On March 12, 2002 07:00 am, Andrew Morton wrote:
>   Identifies readahead thrashing.
> 
>     Currently, it just performs a shrink on the readahead window when thrashing
>     occurs.  This greatly reduces the amount of pointless I/O which we perform,
>     and will reduce the CPU load.  The idea is that the readahead window
>     dynamically adjusts to a sustainable size.  It improves things, but not
>     hugely, experimentally.

The question is, does it wipe out a nasty corner case?  If so then the improvement
for the averge case is just a nice fringe benefit.  A carefully constructed test
that triggers the corner case would be most interesting.

-- 
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-12  6:00 [CFT] delayed allocation and multipage I/O patches for 2.5.6 Andrew Morton
  2002-03-12 11:18 ` Daniel Phillips
@ 2002-03-12 11:39 ` Daniel Phillips
  2002-03-12 21:00   ` Andrew Morton
  2002-03-13  0:42   ` David Woodhouse
  2002-03-18 19:16 ` Hanna Linder
  2 siblings, 2 replies; 16+ messages in thread
From: Daniel Phillips @ 2002-03-12 11:39 UTC (permalink / raw)
  To: Andrew Morton, lkml

On March 12, 2002 07:00 am, Andrew Morton wrote:
> dallocbase-15-pageprivate
> 
>   page->buffers is a bit of a layering violation.  Not all address_spaces
>   have pages which are backed by buffers.
> 
>   The exclusive use of page->buffers for buffers means that a piece of prime
>   real estate in struct page is unavailable to other forms of address_space.
> 
>   This patch turns page->buffers into `unsigned long page->private' and sets
>   in place all the infrastructure which is needed to allow other address_spaces
>   to use this storage.
> 
>   With this change in place, the multipage-bio no-buffer_head code can use
>   page->private to cache the results of an earlier get_block(), so repeated
>   calls into the filesystem are not needed in the case of file overwriting.

That's pragmatic, a good short term solution.  Getting rid of page->buffers 
entirely will be nicer, and in that case you want to cache the physical block
only for those pages that have one, e.g., not for swap-backed pages, which
keep that information in the page table.

I've been playing with the idea of caching the physical block in the radix
tree, which imposes the cost only on cache pages.  This forces you to do a
tree probe at IO time, but that cost is probably insignificant against the
cost of the IO.  This arrangement could make it quite convenient for the
filesystem to exploit the structure by doing opportunistic map-ahead, i.e.,
when ->get_block consults the metadata to fill in one physical address, why
not fill in several more, if it's convenient?

-- 
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-12 11:18 ` Daniel Phillips
@ 2002-03-12 20:29   ` Andrew Morton
  2002-03-12 20:40     ` Daniel Phillips
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2002-03-12 20:29 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: lkml

Daniel Phillips wrote:
> 
> On March 12, 2002 07:00 am, Andrew Morton wrote:
> >   Identifies readahead thrashing.
> >
> >     Currently, it just performs a shrink on the readahead window when thrashing
> >     occurs.  This greatly reduces the amount of pointless I/O which we perform,
> >     and will reduce the CPU load.  The idea is that the readahead window
> >     dynamically adjusts to a sustainable size.  It improves things, but not
> >     hugely, experimentally.
> 
> The question is, does it wipe out a nasty corner case?  If so then the improvement
> for the averge case is just a nice fringe benefit.  A carefully constructed test
> that triggers the corner case would be most interesting.
> 

There are many test scenarios.  The one I use is:

- 64 megs of memory.

- Process A loops across N 10-megabyte files, reading 4k from each one
  and terminates when all N files are fully read.

- Process B loops, repeatedly reading a one gig file off another disk.

The total wallclock time for process A exhibits *massive* step jumps
as you vary N.  In stock 2.5.6 the runtime jumps from 40 seconds to
ten minutes when N is increased from 40 to 60.

With my changes, the rate of increase of runtime-versus-N is lower,
and happens at later N.  But it's still very sudden and very bad.

Yes, it's a known-and-nasty corner case.  Worth fixing if the
fix is clean.  But IMO the problem is not common enough to
justify significantly compromising the common case.

-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-12 20:29   ` Andrew Morton
@ 2002-03-12 20:40     ` Daniel Phillips
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Phillips @ 2002-03-12 20:40 UTC (permalink / raw)
  To: Andrew Morton, Daniel Phillips; +Cc: lkml

On March 12, 2002 09:29 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > On March 12, 2002 07:00 am, Andrew Morton wrote:
> > >   Identifies readahead thrashing.
> > >
> > >     Currently, it just performs a shrink on the readahead window when thrashing
> > >     occurs.  This greatly reduces the amount of pointless I/O which we perform,
> > >     and will reduce the CPU load.  The idea is that the readahead window
> > >     dynamically adjusts to a sustainable size.  It improves things, but not
> > >     hugely, experimentally.
> > 
> > The question is, does it wipe out a nasty corner case?  If so then the improvement
> > for the averge case is just a nice fringe benefit.  A carefully constructed test
> > that triggers the corner case would be most interesting.
> 
> There are many test scenarios.  The one I use is:
> 
> - 64 megs of memory.
> 
> - Process A loops across N 10-megabyte files, reading 4k from each one
>   and terminates when all N files are fully read.
> 
> - Process B loops, repeatedly reading a one gig file off another disk.
> 
> The total wallclock time for process A exhibits *massive* step jumps
> as you vary N.  In stock 2.5.6 the runtime jumps from 40 seconds to
> ten minutes when N is increased from 40 to 60.
> 
> With my changes, the rate of increase of runtime-versus-N is lower,
> and happens at later N.  But it's still very sudden and very bad.
> 
> Yes, it's a known-and-nasty corner case.  Worth fixing if the
> fix is clean.  But IMO the problem is not common enough to
> justify significantly compromising the common case.

It's a given the common case should be optimal.  I'm sure there's an algorithm that
fixes up your test case, which by the way isn't that uncommon - it's Rik's '100 ftp
processes' case.  I'll buy the suggestion it isn't common enough to drop everything
right now and go fix it.

-- 
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-12 11:39 ` Daniel Phillips
@ 2002-03-12 21:00   ` Andrew Morton
  2002-03-13 11:58     ` Daniel Phillips
  2002-03-13  0:42   ` David Woodhouse
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2002-03-12 21:00 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: lkml

Daniel Phillips wrote:
> 
> On March 12, 2002 07:00 am, Andrew Morton wrote:
> > dallocbase-15-pageprivate
> >
> >   page->buffers is a bit of a layering violation.  Not all address_spaces
> >   have pages which are backed by buffers.
> >
> >   The exclusive use of page->buffers for buffers means that a piece of prime
> >   real estate in struct page is unavailable to other forms of address_space.
> >
> >   This patch turns page->buffers into `unsigned long page->private' and sets
> >   in place all the infrastructure which is needed to allow other address_spaces
> >   to use this storage.
> >
> >   With this change in place, the multipage-bio no-buffer_head code can use
> >   page->private to cache the results of an earlier get_block(), so repeated
> >   calls into the filesystem are not needed in the case of file overwriting.
> 
> That's pragmatic, a good short term solution.  Getting rid of page->buffers
> entirely will be nicer, and in that case you want to cache the physical block
> only for those pages that have one, e.g., not for swap-backed pages, which
> keep that information in the page table.

Really, I don't think we can lose page->buffers for *enough* users
of address_spaces to make it worthwhile.

If it was only being used for, say, blockdev inodes then we could
perhaps take it out and hash for it, but there are a ton of
filesystems out there...


The main problem I see with this patch series is that it introduces
a new way of performing writeback while leaving the old way in place.
The new way is better, I think - it's just a_ops->write_many_pages().
But at present, there are some address_spaces which support write_many_pages(),
and others which still use ->writepage() and sync_page_buffers().

This will make VM development harder, because the VM now needs to cope
with the nice, uniform, does-clustering-for-you writeback as well as
the crufty old write-little-bits-of-crap-all-over-the-disk writeback :)

I need to give the VM a uniform way of performing writeback for
all address_spaces.  My current thinking there is that all
address_spaces (even the non-delalloc, buffer_head-backed ones)
need to be taught to perform multipage clustered writeback
based on the address_space, not the dirty buffer LRU.

This is pretty deep surgery.  If it can be made to work, it'll
be nice - it will heavily deprecate the buffer_head layer and will
unify the current two-or-three different ways of performing
writeback (I've already unified all ways of performing writeback
for delalloc filesystems - my version of kupdate writeback, bdflush
writeback, vm-writeback and write(2) writeback are all unified).

> I've been playing with the idea of caching the physical block in the radix
> tree, which imposes the cost only on cache pages.  This forces you to do a
> tree probe at IO time, but that cost is probably insignificant against the
> cost of the IO.  This arrangement could make it quite convenient for the
> filesystem to exploit the structure by doing opportunistic map-ahead, i.e.,
> when ->get_block consults the metadata to fill in one physical address, why
> not fill in several more, if it's convenient?

That would be fairly easy to do.  My current writeback interface
into the filesytem is, basically, "write back N pages from your
mapping->dirty_pages list" [1].  The address_space could quite simply
whizz through that list and map all the required pages in a batched
manner.

[1] Problem with the current implementation is that I've taken
    out the guarantee that the page which the VM wanted to free
    actually has I/O started against it.  So if the VM wants to
    free something from ZONE_NORMAL, the address_space may just
    go and start writeback against 1000 ZONE_HIGHMEM pages instead.
    In practice, I suspect this doesn't matter much.  But it needs
    fixing.

    (Our current behaviour in this scenario is terrible.  Suppose
    a mapping has a mixture of dirty pages from two or more zones,
    and the VM is trying to free up a particular zone: the VM will
    *selectively* perform writepage against *some* of the dirty
    pages, and will skip writeback of pages from other zones.

    This means that we're submitting great chunks of discontiguous
    I/O.  It'll fragment the layout of sparse files and will
    greatly decrease writeout bandwidth.  We should be opportunistically
    submitting writeback against disk-contiguous and file-offset-contiguous
    pages from other zones at the same time!  I'm doing that now, but
    with the present VM design [2] I do need to provide a way to
    ensure that writeback has commenced against the target page).

[2] The more I think about it, the less I like it.  I have a feeling
    that I'll end up having to, umm, redesign the VM.  Damn.

-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-12 11:39 ` Daniel Phillips
  2002-03-12 21:00   ` Andrew Morton
@ 2002-03-13  0:42   ` David Woodhouse
  1 sibling, 0 replies; 16+ messages in thread
From: David Woodhouse @ 2002-03-13  0:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, lkml


akpm@zip.com.au said:
>  Really, I don't think we can lose page->buffers for *enough* users of
> address_spaces to make it worthwhile.

> If it was only being used for, say, blockdev inodes then we could
> perhaps take it out and hash for it, but there are a ton of
> filesystems out there... 

I have plenty of boxes which never have any use for page->buffers. Ever.

--
dwmw2



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-12 21:00   ` Andrew Morton
@ 2002-03-13 11:58     ` Daniel Phillips
  2002-03-13 19:50       ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Phillips @ 2002-03-13 11:58 UTC (permalink / raw)
  To: Andrew Morton, Daniel Phillips; +Cc: lkml

On March 12, 2002 10:00 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> > On March 12, 2002 07:00 am, Andrew Morton wrote:
> > >   With this change in place, the multipage-bio no-buffer_head code can 
use
> > >   page->private to cache the results of an earlier get_block(), so 
repeated
> > >   calls into the filesystem are not needed in the case of file 
overwriting.
> > 
> > That's pragmatic, a good short term solution.  Getting rid of 
page->buffers
> > entirely will be nicer, and in that case you want to cache the physical 
block
> > only for those pages that have one, e.g., not for swap-backed pages, which
> > keep that information in the page table.
> 
> Really, I don't think we can lose page->buffers for *enough* users
> of address_spaces to make it worthwhile.
> 
> If it was only being used for, say, blockdev inodes then we could
> perhaps take it out and hash for it, but there are a ton of
> filesystems out there...

That's the thrust of my current work - massaging things into a form where 
struct page can be substituted for buffer_head as the block data handle for 
the mass of filesystems that use it.
 
> The main problem I see with this patch series is that it introduces
> a new way of performing writeback while leaving the old way in place.
> The new way is better, I think - it's just a_ops->write_many_pages().
> But at present, there are some address_spaces which support 
write_many_pages(),
> and others which still use ->writepage() and sync_page_buffers().
> 
> This will make VM development harder, because the VM now needs to cope
> with the nice, uniform, does-clustering-for-you writeback as well as
> the crufty old write-little-bits-of-crap-all-over-the-disk writeback :)
> 
> I need to give the VM a uniform way of performing writeback for
> all address_spaces.  My current thinking there is that all
> address_spaces (even the non-delalloc, buffer_head-backed ones)
> need to be taught to perform multipage clustered writeback
> based on the address_space, not the dirty buffer LRU.
>
> This is pretty deep surgery.  If it can be made to work, it'll
> be nice - it will heavily deprecate the buffer_head layer and will
> unify the current two-or-three different ways of performing
> writeback (I've already unified all ways of performing writeback
> for delalloc filesystems - my version of kupdate writeback, bdflush
> writeback, vm-writeback and write(2) writeback are all unified).

For me, the missing piece of the puzzle is how to recover the semantics of 
->b_flushtime.  The crude solution is just to put that in struct page for 
now.  At least that's a wash in terms of size because ->buffers goes out.

It's not right for the long term though, because flushtime is only needed for
cache-under-io.  It's annoying to bloat up all pages for the sake of the 
block cache.  Some kind of external structure is needed that can capture both 
flushtime ordering and physical ordering, and be readily accessible from 
struct page without costing a field in struct page.  Right now that structure 
is the buffer dirty list, though that fails the requirement of not bloating 
the struct page.  It falls short by other measures as well, such as offering 
no good way to search for physical adjacency for the purpose of clustering.

I don't know what kind of thing I'm searching for here, I thought I'd dump 
the problem on you and see what happens.

-- 
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-13 11:58     ` Daniel Phillips
@ 2002-03-13 19:50       ` Andrew Morton
  2002-03-13 21:51         ` Mike Fedyk
  2002-03-14 11:59         ` Daniel Phillips
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2002-03-13 19:50 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: lkml

Daniel Phillips wrote:
> 
> That's the thrust of my current work - massaging things into a form where
> struct page can be substituted for buffer_head as the block data handle for
> the mass of filesystems that use it.
> 
> ...
> 
> For me, the missing piece of the puzzle is how to recover the semantics of
> ->b_flushtime.  The crude solution is just to put that in struct page for
> now.  At least that's a wash in terms of size because ->buffers goes out.

I'm currently doing that in struct address_space(!).  Maybe struct inode
would make more sense...

So the mapping records the time at which it was first dirtied.  So the
`kupdate' function simply writes back all files which had their
first-dirtying time between 30 and 35 seconds ago.

That works OK, but it also needs to cope with the case of a single
huge dirty file.  For that case, periodic writeback also terminates
when it has written back 1/6th of all the dirty pages in the machine.

This is all fairly arbitrary, and is basically designed to map onto
the time-honoured behaviour.  I haven't observed any surprises from
it, nor any reason to change it.

> It's not right for the long term though, because flushtime is only needed for
> cache-under-io.  It's annoying to bloat up all pages for the sake of the
> block cache.  Some kind of external structure is needed that can capture both
> flushtime ordering and physical ordering, and be readily accessible from
> struct page without costing a field in struct page.  Right now that structure
> is the buffer dirty list, though that fails the requirement of not bloating
> the struct page.  It falls short by other measures as well, such as offering
> no good way to search for physical adjacency for the purpose of clustering.
> 
> I don't know what kind of thing I'm searching for here, I thought I'd dump
> the problem on you and see what happens.

See above...

We also need to discuss writeback of metadata.  For delayed allocate
files, indirect blocks are not a problem, because I start I/O against
them *immediately*, as soon as they're dirtied.  This is because we
know that the indirect's associated data blocks are also under I/O.

Which leaves bitmaps and inode blocks.  These I am leaving on the
dirty buffer LRU, so nothing has changed there.

Now, I think it's fair to say that the ext2/ext3 inter-file fragmentation
issue is one of the three biggest performance problems in Linux.  (The
other two being excessive latency in the page allocator due to VM writeback
and read latency in the I/O scheduler).

The fix for interfile fragmentation lies inside ext2/ext3, not inside
any generic layers of the kernel.    And this really is a must-fix,
because the completion time for writeback is approximately proportional
to the size of the filesystem.  So we're getting, what? Fifty percent
slower per year?

The `tar xfz linux.tar.gz ; sync' workload can be sped up 4x-5x by
using find_group_other() for directories.  I spent a week or so
poking at this when it first came up.  Basically, *everything*
which I did to address the rapid-growth problem ended up penalising
the slow-growth fragmentation - long-term intra-file fragmentation
suffered at the expense of short-term inter-file fragmentation.

So I ended up concluding that we need online defrag to fix the
intra-file fragmentation.  Then this would *enable* the death
of find_group_dir().  I do have fully-journalled, cache-coherent,
crash-proof block relocation coded for ext3.  It's just an ioctl:

	int try_to_relocate_page(int fd, long blocks[]);

But I haven't sat down and thought about the userspace application
yet.  Which is kinda dumb, because the payback from this will be
considerable.

So.  Conclusion:  periodic writeback is based on inode-dirty-time,
with a limit on the number of pages.  Buffer LRU writeback for
bitmaps and inodes is OK, but we need to fix the directory
allocator.

-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-13 19:50       ` Andrew Morton
@ 2002-03-13 21:51         ` Mike Fedyk
  2002-03-14 11:59         ` Daniel Phillips
  1 sibling, 0 replies; 16+ messages in thread
From: Mike Fedyk @ 2002-03-13 21:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Daniel Phillips, lkml

On Wed, Mar 13, 2002 at 11:50:32AM -0800, Andrew Morton wrote:
> Now, I think it's fair to say that the ext2/ext3 inter-file fragmentation
> issue is one of the three biggest performance problems in Linux.  (The
> other two being excessive latency in the page allocator due to VM writeback
> and read latency in the I/O scheduler).
> 
> The fix for interfile fragmentation lies inside ext2/ext3, not inside
> any generic layers of the kernel.    And this really is a must-fix,
> because the completion time for writeback is approximately proportional
> to the size of the filesystem.  So we're getting, what? Fifty percent
> slower per year?
> 
> The `tar xfz linux.tar.gz ; sync' workload can be sped up 4x-5x by
> using find_group_other() for directories.  I spent a week or so
> poking at this when it first came up.  Basically, *everything*
> which I did to address the rapid-growth problem ended up penalising
> the slow-growth fragmentation - long-term intra-file fragmentation
> suffered at the expense of short-term inter-file fragmentation.

I know ReiserFS has similar problems.

Can anyone say wheather JFS or XFS has this problem also?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-13 19:50       ` Andrew Morton
  2002-03-13 21:51         ` Mike Fedyk
@ 2002-03-14 11:59         ` Daniel Phillips
  1 sibling, 0 replies; 16+ messages in thread
From: Daniel Phillips @ 2002-03-14 11:59 UTC (permalink / raw)
  To: Andrew Morton, Daniel Phillips; +Cc: lkml

On March 13, 2002 08:50 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> > 
> > That's the thrust of my current work - massaging things into a form where
> > struct page can be substituted for buffer_head as the block data handle 
> > for the mass of filesystems that use it.
> > ...
> > 
> > For me, the missing piece of the puzzle is how to recover the semantics of
> > ->b_flushtime.  The crude solution is just to put that in struct page for
> > now.  At least that's a wash in terms of size because ->buffers goes out.
> 
> I'm currently doing that in struct address_space(!).  Maybe struct inode
> would make more sense...

I don't know, when do we ever use an address_space that's not attached to an 
inode?  Hmm, swapper_space.  Though nobody does it now, it's also possible to 
have more than one address_space per inode.  This confuses the issue because 
you don't know whether to flush them together, separately, or what.  By the 
time things get this murky, its time for VM to step back and let the 
filesystem itself establish the flushing policy.  No, we don't have any model 
for how to express that, and we need one.  What you're developing here is 
a generic_flush, mainly for use by dumb filesystems, and by lucky accident, 
also suitable for Ext3.

So, hrm, the sky won't fall either places you put it.

> So the mapping records the time at which it was first dirtied.  So the
> `kupdate' function simply writes back all files which had their
> first-dirtying time between 30 and 35 seconds ago.

I guess you have to be careful to set the first-dirtying time again after
submitting all the IO, in case the inode got dirtied again while you were
busy submitting.

> That works OK, but it also needs to cope with the case of a single
> huge dirty file.  For that case, periodic writeback also terminates
> when it has written back 1/6th of all the dirty pages in the machine.

You need a way of cycling through all the inodes on the system reliably, 
otherwise you'll get nasty situations where repeated dirtying starves some 
inodes of updates.  This has to have file offset resolution, otherwise more 
flush starvation corner cases will start crawling out of the woodwork.

The 1/6th rule is an oversimplification, it should at least be based on how 
much flush IO is already in flight, and other more difficult measures we 
haven't even started to address yet, such as how much and what kind of other 
IO is competing for the same bandwidth, and how much bandwidth is available.

> This is all fairly arbitrary, and is basically designed to map onto
> the time-honoured behaviour.  I haven't observed any surprises from
> it, nor any reason to change it.

It's surprisingly resistant to flaming.  The starvation problem is going to 
get ugly, it's just as hard as the elevator starvation question and the 
crude, inode-level resolution of the flushtime makes it tricky.  But I think 
it can be beaten into some kind of shape where the corner cases are seen to 
be bounded.

I don't know, I need to think about it more.  It's both convenient and 
strange not to maintain per-block flushtime.

> We also need to discuss writeback of metadata.  For delayed allocate
> files, indirect blocks are not a problem, because I start I/O against
> them *immediately*, as soon as they're dirtied.  This is because we
> know that the indirect's associated data blocks are also under I/O.
> 
> Which leaves bitmaps and inode blocks.  These I am leaving on the
> dirty buffer LRU, so nothing has changed there.

Easy, give them an address_space.

[snip fascinating/timesucking sortie into online defrag]

-- 
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-12  6:00 [CFT] delayed allocation and multipage I/O patches for 2.5.6 Andrew Morton
  2002-03-12 11:18 ` Daniel Phillips
  2002-03-12 11:39 ` Daniel Phillips
@ 2002-03-18 19:16 ` Hanna Linder
  2002-03-18 20:14   ` Andrew Morton
  2 siblings, 1 reply; 16+ messages in thread
From: Hanna Linder @ 2002-03-18 19:16 UTC (permalink / raw)
  To: Andrew Morton, lkml; +Cc: hannal


--On Monday, March 11, 2002 22:00:57 -0800 Andrew Morton <akpm@zip.com.au> wrote:

>    "help, help - there's no point in just one guy testing this" (thanks Randy). 

	Will you accept the testing of a gal? ;)
> 
> This is an update of the delayed-allocation and multipage pagecache I/O
> patches.  I'm calling this a beta, because it all works, and I have
> other stuff to do for a while.
> 

	Here are the dbench throughput results on an 8-way SMP with 2GB memory.
These are run with 64 then 128 clients 15 times each averaged. It looked 
pretty good.
	Running with more than 180 clients caused the system to hang, after 
a reset there was much filesystem corruption. This happened twice. Probably
related to filling up disk space. There are no ServerRaid drivers for 2.5 yet 
so the biggest disks on this system are unusable. lockmeter results are forthcoming (day or two). 

Running dbench on an 8-way SMP 15 times each.

2.5.6 clean

Clients		Avg

64		37.9821
128		29.8258

2.5.6 with everything.patch

Clients		Avg

64		41.0204
128		30.6431

Hanna 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-18 19:16 ` Hanna Linder
@ 2002-03-18 20:14   ` Andrew Morton
  2002-03-18 20:22     ` Hanna Linder
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2002-03-18 20:14 UTC (permalink / raw)
  To: Hanna Linder; +Cc: lkml

Hanna Linder wrote:
> 
> --On Monday, March 11, 2002 22:00:57 -0800 Andrew Morton <akpm@zip.com.au> wrote:
> 
> >    "help, help - there's no point in just one guy testing this" (thanks Randy).
> 
>         Will you accept the testing of a gal? ;)

With alacrity :)  Thanks.

> >
> > This is an update of the delayed-allocation and multipage pagecache I/O
> > patches.  I'm calling this a beta, because it all works, and I have
> > other stuff to do for a while.
> >
> 
>         Here are the dbench throughput results on an 8-way SMP with 2GB memory.
> These are run with 64 then 128 clients 15 times each averaged. It looked
> pretty good.
>         Running with more than 180 clients caused the system to hang, after
> a reset there was much filesystem corruption. This happened twice. Probably
> related to filling up disk space.

It could be space-related.  A couple of gigs should have been plenty..

One other possible explanation is to do with radix-tree pagecache.
It has to allocate memory to add nodes to the tree.  When these
allocations start failing due to out-of-memory, the VM will keep
on calling swap_out() a trillion times without noticing that it
didn't work out.  But if this happened, yo would have seen a huge
number of "0-order allocation failed" messages.

> There are no ServerRaid drivers for 2.5 yet
> so the biggest disks on this system are unusable. lockmeter results are forthcoming (day or two).
> 
> Running dbench on an 8-way SMP 15 times each.
> 
> 2.5.6 clean
> 
> Clients         Avg
> 
> 64              37.9821
> 128             29.8258
> 
> 2.5.6 with everything.patch
> 
> Clients         Avg
> 
> 64              41.0204
> 128             30.6431
> 

That's odd.  I'm showing 50% increases in dbench throughput.  Not
that there's anything particularly clever about that - these patches
allow the kernel to just throw more memory in dbench's direction, and
it likes that.  But it does indicate that something funny is up.
I'll take a closer look - thanks again.

-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-18 20:14   ` Andrew Morton
@ 2002-03-18 20:22     ` Hanna Linder
  2002-03-18 20:49       ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Hanna Linder @ 2002-03-18 20:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hanna Linder, lkml


--On Monday, March 18, 2002 12:14:27 -0800 Andrew Morton <akpm@zip.com.au> wrote:

> One other possible explanation is to do with radix-tree pagecache.
> It has to allocate memory to add nodes to the tree.  When these
> allocations start failing due to out-of-memory, the VM will keep
> on calling swap_out() a trillion times without noticing that it
> didn't work out.  But if this happened, yo would have seen a huge
> number of "0-order allocation failed" messages.

	Yes, I did see a huge number of those messages. It also died
	on 2.5.6 clean though. I chalked it up to 2.5 instability.
	Will test again when things calm down. Any chance you will
	backport to 2.4?
	
	Glad to help.

	Hanna




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
  2002-03-18 20:22     ` Hanna Linder
@ 2002-03-18 20:49       ` Andrew Morton
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2002-03-18 20:49 UTC (permalink / raw)
  To: Hanna Linder; +Cc: lkml

Hanna Linder wrote:
> 
> --On Monday, March 18, 2002 12:14:27 -0800 Andrew Morton <akpm@zip.com.au> wrote:
> 
> > One other possible explanation is to do with radix-tree pagecache.
> > It has to allocate memory to add nodes to the tree.  When these
> > allocations start failing due to out-of-memory, the VM will keep
> > on calling swap_out() a trillion times without noticing that it
> > didn't work out.  But if this happened, yo would have seen a huge
> > number of "0-order allocation failed" messages.
> 
>         Yes, I did see a huge number of those messages.

OK.  Probably it would have eventually recovered, because there
would have been *some* I/O queued up somewhere.  Of course, it
would recover a damn sight faster if I hadn't added those
printk's in the page allocator :)

>         It also died
>         on 2.5.6 clean though. I chalked it up to 2.5 instability.

mm.  2.5.6 is stable in my testing.  PIIX4 IDE and aic7xxx SCSI.
So maybe a driver problem, maybe a highmem problem?

>         Will test again when things calm down. Any chance you will
>         backport to 2.4?

I don't really plan to do that.  There's still quite a lot more stuff
needs doing, and it'll end up a huuuuge patch.  Plus a fair bit of
the value isn't there, because 2.4 doesn't have BIOs.

-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [CFT] delayed allocation and multipage I/O patches for 2.5.6.
@ 2002-03-19  0:41 rwhron
  0 siblings, 0 replies; 16+ messages in thread
From: rwhron @ 2002-03-19  0:41 UTC (permalink / raw)
  To: linux-kernel

2.5.6 with Andrew's everything patch on ext2 
filesystem mounted with delalloc came up with
these MB/second on k6-2/475 with IDE disk:

dbench 128	
2.5.6		2.5.6-akpme	akpm % faster
8.4 		12.5 		48

tiobench seq reads (8 - 128 threads avg 3 runs)
2.5.6		2.5.6-akpme	%
9.36 		12.97		38

tiobench seq writes (8 - 128 threads avg 3 runs)
2.5.6		2.5.6-akpme	%
15.3		19.29		26

Both kernels needed reiserfs patches for 2.5.6, but
above tests are on ext2.  

More on these tests and a few other akpm patches at:
http://home.earthlink.net/~rwhron/kernel/akpm.html

-- 
Randy Hron


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2002-03-19  0:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-03-12  6:00 [CFT] delayed allocation and multipage I/O patches for 2.5.6 Andrew Morton
2002-03-12 11:18 ` Daniel Phillips
2002-03-12 20:29   ` Andrew Morton
2002-03-12 20:40     ` Daniel Phillips
2002-03-12 11:39 ` Daniel Phillips
2002-03-12 21:00   ` Andrew Morton
2002-03-13 11:58     ` Daniel Phillips
2002-03-13 19:50       ` Andrew Morton
2002-03-13 21:51         ` Mike Fedyk
2002-03-14 11:59         ` Daniel Phillips
2002-03-13  0:42   ` David Woodhouse
2002-03-18 19:16 ` Hanna Linder
2002-03-18 20:14   ` Andrew Morton
2002-03-18 20:22     ` Hanna Linder
2002-03-18 20:49       ` Andrew Morton
2002-03-19  0:41 rwhron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).