linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [prepatch] address_space-based writeback
@ 2002-04-10 11:21 Andrew Morton
  2002-04-10 11:34 ` Alexander Viro
                   ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Andrew Morton @ 2002-04-10 11:21 UTC (permalink / raw)
  To: lkml


This is a largish patch which makes some fairly deep changes.  It's
currently at the "wow, it worked" stage.  Most of it is fairly
mature code, but some conceptual changes were recently made.
Hopefully it'll be in respectable state in a few days, but I'd
like people to take a look.

The idea is: all writeback is against address_spaces.  All dirty data
has the dirty bit set against its page.  So all dirty data is
accessible by

	superblock list
		-> superblock dirty inode list
			-> inode mapping's dirty page list.
				-> page_buffers(page) (maybe)

pdflush threads are used to implement periodic writeback (kupdate) by
descending the data structures decribed above.  address_spaces now
carry a timestamp (when it was first dirtied) to permit the traditional
"write back stuff which is older than thirty second" behaviour.

pflush threads are used for global writeback (bdflush analogue).

New address_space operations are introduced for whole-file writeback
and for VM writeback.  The VM writeback function is designed so that
large chunks of data (I grabbed 4 megs out of the air) will be sent
down to the I/O layers in a single operation.

Aside: Remember, although there's a lot of cleanup and uniformity being
introduced here, one goal is to get rid of the whole practice of
chunking data up into tiny pieces, passing them to the request layer,
adding tiny BIOs to them, sorting them, stringing them onto requests,
then feeding them to the device driver to puzzle over.  A key objective
is to deal with maximal-sized chunks of data in an efficient manner. 
Assemble large scatter/gather arrays directly against pagecache at the
filemap level.  No single-buffer I/O.  No single page I/O.


New rules are introduced for the relationship between buffers and their
page:

- If a page has one or more dirty buffers, the page is marked dirty.

- If a page is marked dirty, it *may* have dirty buffers.

- If a page is marked dirty but has no buffers, it is entirely dirty.
   So if buffers are later attached to that page, they are all set
  dirty.

So block_write_full_page will now only write the dirty buffers.

All this is desiged to support file metadata.  Metadata is written back
via its blockdevice's address_space.  So mark_buffer_dirty sets the
buffer dirty, sets the page dirty, attaches the page to the blockdev's
address_space's dirty_pages, attaches the blockdev's inode to its dummy
superblock's dirty inodes list and there we are.  The blockdev mapping
is treated in an identical manner to other address_spaces.

A consequence of this is that if someone cleans a page by directly
writing back its buffers with ll_rw_block or submit_bh, the page will
still be marked dirty, but its buffers will be clean.  That's fine -
block_write_full_page will perform no I/O.  PG_dirty is advisory if
the page has buffers.

We need to be careful to not free buffers against a dirty page. 
Because when they are reattached, *all* the buffers will be marked
dirty.  Which, in the case of the blockdev mapping will corrupt file
data.

Numerous changes in fs/inode.c make it the means by which we perform
most sorts of writeback.  I'll be splitting a new file out from
fs/inode.c for this.


As a consequence of all the above, some adjustments were possible at
the buffer layer.

The patch deletes the buffer LRUs, lru_list_lock, write_locked_buffers,
write_some_buffers, wait_for_buffers, sync_buffers, etc.  The bdflush
tunables have been removed.  All the sync functions operate at the
address_space level only.  The kupdate and bdflush functions have been
removed.

The buffer hash has been removed - hash_table_lock, etc.  Buffercache
lookups use the page hash and then a walk across the page's buffer list.

Per-inode locking for i_dirty_buffers and i_dirty_data_buffers has been
introduced.  That same lock is held across try_to_free_buffers so that
spinlocking may be used to get exclusion against try_to_free_buffers. 
This enabled the new __get_hash_table() to be non-blocking, which is what
some filesystems appear to expect.

sync_page_buffers() has been removed.  This is because all VM writeback
occurs at the address_space level, and all pages which contain dirty
data are marked dirty.  VM throttling occurs purely at the page level -
wait_on_page().

buffer_head.b_inode, unused_list and address_space.i_dirty_data_buffers
have not been removed yet.

The diff (along with another eight patches) is at

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.8-pre3/dallocbase-70-writeback.patch

As I say, it's still a few days away from being presentable.  I
definitely need to test a lot of other filesystems because it
did find a few warts in ext2.


-

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 11:21 [prepatch] address_space-based writeback Andrew Morton
@ 2002-04-10 11:34 ` Alexander Viro
  2002-04-10 19:16   ` Andrew Morton
  2002-04-10 19:29 ` Jeremy Jackson
  2002-04-15  8:47 ` Andrew Morton
  2 siblings, 1 reply; 48+ messages in thread
From: Alexander Viro @ 2002-04-10 11:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml



On Wed, 10 Apr 2002, Andrew Morton wrote:

> 
> This is a largish patch which makes some fairly deep changes.  It's
> currently at the "wow, it worked" stage.  Most of it is fairly
> mature code, but some conceptual changes were recently made.
> Hopefully it'll be in respectable state in a few days, but I'd
> like people to take a look.
> 
> The idea is: all writeback is against address_spaces.  All dirty data
> has the dirty bit set against its page.  So all dirty data is
> accessible by
> 
> 	superblock list
> 		-> superblock dirty inode list
> 			-> inode mapping's dirty page list.
> 				-> page_buffers(page) (maybe)

Wait.  You are assuming that all address_spaces happen to be ->i_mapping of
some inode.  Which is less than obvious - e.g. putting indirect blocks into
private per-inode address_space is not unreasonable.

What's more, I wonder how well does your scheme work with ->i_mapping
to a different inode's ->i_data (CODA et.al., file access to block devices).

BTW, CODA-like beasts can have several local filesystems for cache - i.e.
->i_mapping for dirty inodes from the same superblock can ultimately go
to different queues.  Again, the same goes for stuff like
dd if=foo of=/dev/floppy - you get dirty inode of /dev/floppy with ->i_mapping
pointing to bdev filesystem and queue we end up with having nothing in common
with that of root fs' device.

I'd really like to see where are you going with all that stuff - if you
expect some correspondence between superblocks and devices, you've are
walking straight into major PITA.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 11:34 ` Alexander Viro
@ 2002-04-10 19:16   ` Andrew Morton
  2002-04-10 20:53     ` Alexander Viro
  2002-04-10 22:12     ` Jan Harkes
  0 siblings, 2 replies; 48+ messages in thread
From: Andrew Morton @ 2002-04-10 19:16 UTC (permalink / raw)
  To: Alexander Viro; +Cc: lkml

Alexander Viro wrote:
> 
> On Wed, 10 Apr 2002, Andrew Morton wrote:
> 
> >
> > This is a largish patch which makes some fairly deep changes.  It's
> > currently at the "wow, it worked" stage.  Most of it is fairly
> > mature code, but some conceptual changes were recently made.
> > Hopefully it'll be in respectable state in a few days, but I'd
> > like people to take a look.
> >
> > The idea is: all writeback is against address_spaces.  All dirty data
> > has the dirty bit set against its page.  So all dirty data is
> > accessible by
> >
> >       superblock list
> >               -> superblock dirty inode list
> >                       -> inode mapping's dirty page list.
> >                               -> page_buffers(page) (maybe)
> 
> Wait.

Hi, Al.  Nothing has really changed wrt the things to which you
refer.  ie: they would already be a problem.  The relationships
between dirty pages, address_spaces, inodes and superblocks
are unchanged, except for one thing:  __mark_inode_dirty will
now attach blockdev inodes to the dummy blockdev's dirty
inodes list.

The main thing which is being changed is buffers. The assumption is
that buffers can be hunted down via
superblocklist->superblock->dirty_inode->i_mapping->writeback_mapping,
not via the global LRU.

>  You are assuming that all address_spaces happen to be ->i_mapping of
> some inode.

Sorry, the above diagram is not really accurate.  The sync/kupdate/bdflush
writeback path is really:

	superblock list
		-> superblock dirty inode list
			->inode->i_mapping->a_ops->writeback_mapping(mapping)

So core kernel does not actually assume that the to-be-written
pages are accessible via inode->i_mapping->dirty_pages.

I believe that the object relationship you're describing is
that the inode->i_mapping points to the main address_space,
and the `host' field of both the main and private address_spaces
both point at the same inode?  That the inode owns two
address_spaces?

That's OK.  When a page is dirtied, the kernel will attach
that page to the private address_space's dirty pages list and
will attach the common inode to its superblock's dirty inodes list.

For writeback, core kernel will perform

	inode->i_mapping->writeback_mapping(mapping, nr_pages)

which will hit the inode's main address_space's writeback_mapping()
method will do:

my_writeback_mapping(mapping, nr_pages)
{
	generic_writeback_mapping(mapping, nr_pages);
	mapping->host->private_mapping->a_ops->writeback_mapping(
		mapping->host->private_mapping, nr_pages);
}

> ...
> What's more, I wonder how well does your scheme work with ->i_mapping
> to a different inode's ->i_data (CODA et.al., file access to block devices).

Presumably, those different inodes have a superblock?  In that
case set_page_dirty will mark that inode dirty wrt its own
superblock.  set_page_dirty() is currently an optional a_op,
but it's not obvious that there will be a need for that.

The one thing which does worry me a bit is why __mark_inode_dirty
tests for a null ->i_sb.  If the inode doesn't have a superblock
then its pages are hidden from the writeback functions.

This is not fatal per-se.  The pages are still visible to the VM
via the LRU, and presumably the filesystem knows how to sync
its own stuff.  But for memory-balancing and throttling purposes,
I'd prefer that the null ->i_sb not be allowed.  Where can this
occur?

> BTW, CODA-like beasts can have several local filesystems for cache - i.e.
> ->i_mapping for dirty inodes from the same superblock can ultimately go
> to different queues.

When a page is dirtied, those inodes will be attached to their
->i_sb's dirty inode list; I haven't changed any of that.

>  Again, the same goes for stuff like
> dd if=foo of=/dev/floppy - you get dirty inode of /dev/floppy with ->i_mapping
> pointing to bdev filesystem and queue we end up with having nothing in common
> with that of root fs' device.

OK.  What has changed here is that the resulting mark_buffer_dirty
calls will now set the page dirty, and attach the page to its
mapping's dirty_pages, and attach its mapping's host to
mapping->host->i_sb->s_dirty.

So writeback for the floppy device is no longer via the
buffer LRU.  It's via

	dummy blockdev superblock
		-> blockdev's inode
			->i_mapping
				->writeback_mapping

> I'd really like to see where are you going with all that stuff - if you
> expect some correspondence between superblocks and devices, you've are
> walking straight into major PITA.

Hopefully, none of that has changed.  It's just the null inode->i_sb
which is a potential problem.

-

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 11:21 [prepatch] address_space-based writeback Andrew Morton
  2002-04-10 11:34 ` Alexander Viro
@ 2002-04-10 19:29 ` Jeremy Jackson
  2002-04-10 19:41   ` Andrew Morton
  2002-04-15  8:47 ` Andrew Morton
  2 siblings, 1 reply; 48+ messages in thread
From: Jeremy Jackson @ 2002-04-10 19:29 UTC (permalink / raw)
  To: Andrew Morton, lkml

This sounds like a wonderful piece of work.
I'm also inspired by the rmap stuff coming down 
the pipe.   I wonder if there would be any interference
between the two, or could they leverage each other?

Jeremy

----- Original Message ----- 
From: "Andrew Morton" <akpm@zip.com.au>
Sent: Wednesday, April 10, 2002 4:21 AM


> 
> This is a largish patch which makes some fairly deep changes.  It's
> currently at the "wow, it worked" stage.  Most of it is fairly
> mature code, but some conceptual changes were recently made.
> Hopefully it'll be in respectable state in a few days, but I'd
> like people to take a look.
> 
> The idea is: all writeback is against address_spaces.  All dirty data
> has the dirty bit set against its page.  So all dirty data is
> accessible by
 (snip)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 19:29 ` Jeremy Jackson
@ 2002-04-10 19:41   ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2002-04-10 19:41 UTC (permalink / raw)
  To: Jeremy Jackson; +Cc: lkml

Jeremy Jackson wrote:
> 
> This sounds like a wonderful piece of work.
> I'm also inspired by the rmap stuff coming down
> the pipe.   I wonder if there would be any interference
> between the two, or could they leverage each other?
> 

well the theory is that rmap doesn't need to know. It just
calls writepage when it sees dirty pages. That's a minimal
approach.  But I'd like the VM to aggressively use the
"write out lots of pages in the same call" APIs, rather
than the current "send lots of individual pages" approach.

For a number of reasons:

- For delalloc filesystems, and for sparse mappings
  against "syncalloc" filesystems, disk allocation is
  performed at ->writepage time.  It's important that
  the writes be clustered effectively.  Otherwise
  the file gets fragmented on-disk.

- There's a reasonable chance that the pages on the
  LRU lists get themselves out-of-order as the aging
  process proceeds.  So calling ->writepage in lru_list
  order has the potential to result in fragmented write
  patterns, and inefficient I/O.

- If the VM is trying to free pages from, say, ZONE_NORMAL
  then it will only perform writeout against pages from
  ZONE_NORMAL and ZONE_DMA.  But there may be writable pages
  from ZONE_HIGHMEM sprinkled amongst those pages within the
  file. It would be really bad if we miss out on opportunistically
  slotting those other pages into the same disk request.

- The current VM writeback is an enormous code path.  For each
  tiny little page, we send it off into writepage, pass it
  to the filesystem, give it a disk mapping, give it a buffer_head,
  submit the buffer_head, attach a tiny BIO to the buffer_head,
  submit the BIO, merge that onto a request structure which
  contains contiguous BIOs, feed that list-of-single-page-BIOs
  to the device driver.

  That's rather painful.   So the intent is to batch this work
  up.  Instead, the VM says "write up to four megs from this
  page's mapping, including this page".

  That request passes through the filesystem and we wrap 64k
  or larger BIOs around the pagecache data and put those into
  the request queue.  For some filesystems the buffer_heads
  are ignored altogether.  For others, the buffer_head
  can be used at "build the big BIO" time to work out how to
  segment the BIOs across sub-page boundaries.

  The "including this page" requirement of the vm_writeback
  method is there because the VM may be trying to free pages
  from a specific zone, so it would be not useful if the
  filesystem went and submitted I/O for a ton of pages which
  are all from the wrong zone.  This is a bit of an ugly
  back-compatible placeholder to keep the VM happy before
  we move on to that part of it.

-

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 19:16   ` Andrew Morton
@ 2002-04-10 20:53     ` Alexander Viro
  2002-04-10 22:12     ` Jan Harkes
  1 sibling, 0 replies; 48+ messages in thread
From: Alexander Viro @ 2002-04-10 20:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml



On Wed, 10 Apr 2002, Andrew Morton wrote:

[I'll answer to the rest when I get some sleep]

> The one thing which does worry me a bit is why __mark_inode_dirty
> tests for a null ->i_sb.  If the inode doesn't have a superblock
> then its pages are hidden from the writeback functions.
> 
> This is not fatal per-se.  The pages are still visible to the VM
> via the LRU, and presumably the filesystem knows how to sync
> its own stuff.  But for memory-balancing and throttling purposes,
> I'd prefer that the null ->i_sb not be allowed.  Where can this
> occur?

In rather old kernels.  IOW, these checks are atavisms - these days
_all_ inodes have (constant) ->i_sb.  The only way to create an
in-core inode is alloc_inode(superblock) and requires superblock
to be non-NULL.  After that ->i_sb stays unchanged (and non-NULL)
until struct inode is freed.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 22:12     ` Jan Harkes
@ 2002-04-10 21:44       ` Andrew Morton
  2002-04-10 22:56         ` Anton Altaparmakov
  2002-04-10 23:02         ` Jan Harkes
  0 siblings, 2 replies; 48+ messages in thread
From: Andrew Morton @ 2002-04-10 21:44 UTC (permalink / raw)
  To: Jan Harkes; +Cc: linux-kernel

Jan Harkes wrote:
> 
> On Wed, Apr 10, 2002 at 12:16:26PM -0700, Andrew Morton wrote:
> > I believe that the object relationship you're describing is
> > that the inode->i_mapping points to the main address_space,
> > and the `host' field of both the main and private address_spaces
> > both point at the same inode?  That the inode owns two
> > address_spaces?
> 
> Actually with Coda we've got 2 inodes that have an identical i_mapping.
> The Coda inode's i_mapping is set to point to the hosting inode's
> i_data.

I see.  So this is all your fault :)

> ...
> 
> But Coda has 2 inodes, which one are you connecting to whose superblock.
> My guess is that it would be correct to add inode->i_mapping->host to
> inode->i_mapping->host->i_sb, which will be the underlying inode in
> Coda's case, but host isn't guaranteed to be an inode, it just happens
> to be an inode in all existing situations.

When a page is marked dirty, the path which is followed
is page->mapping->host->i_sb.  So in this case the page will
be attached to its page->mapping.dirty_pages, and
page->mapping->host will be attached to page->mapping->host->i_sb.s_dirty

This is as it always was - I didn't change any of this.

> > > What's more, I wonder how well does your scheme work with ->i_mapping
> > > to a different inode's ->i_data (CODA et.al., file access to block devices).
> >
> > Presumably, those different inodes have a superblock?  In that
> > case set_page_dirty will mark that inode dirty wrt its own
> > superblock.  set_page_dirty() is currently an optional a_op,
> > but it's not obvious that there will be a need for that.
> 
> Coda's inodes don't have to get dirtied because we never write them out,
> although the associated dirty pages do need to hit the disk eventually :)
> 

Right.  Presumably, the pages hit the disk via the hosting inode's
filesystem's mechanics.

And it remains the case that Coda inodes will not be marked DIRTY_PAGES
because set_page_dirty()'s page->mapping->host walk will arrive at
the hosting inode.

-

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 19:16   ` Andrew Morton
  2002-04-10 20:53     ` Alexander Viro
@ 2002-04-10 22:12     ` Jan Harkes
  2002-04-10 21:44       ` Andrew Morton
  1 sibling, 1 reply; 48+ messages in thread
From: Jan Harkes @ 2002-04-10 22:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Wed, Apr 10, 2002 at 12:16:26PM -0700, Andrew Morton wrote:
> I believe that the object relationship you're describing is
> that the inode->i_mapping points to the main address_space,
> and the `host' field of both the main and private address_spaces
> both point at the same inode?  That the inode owns two
> address_spaces?

Actually with Coda we've got 2 inodes that have an identical i_mapping.
The Coda inode's i_mapping is set to point to the hosting inode's
i_data. Hmm, I should probably set it to the i_mapping of the host
inode that way I can run Coda over a Coda(-like) filesystem.

> That's OK.  When a page is dirtied, the kernel will attach
> that page to the private address_space's dirty pages list and
> will attach the common inode to its superblock's dirty inodes list.

But Coda has 2 inodes, which one are you connecting to whose superblock.
My guess is that it would be correct to add inode->i_mapping->host to
inode->i_mapping->host->i_sb, which will be the underlying inode in
Coda's case, but host isn't guaranteed to be an inode, it just happens
to be an inode in all existing situations.

> > What's more, I wonder how well does your scheme work with ->i_mapping
> > to a different inode's ->i_data (CODA et.al., file access to block devices).
> 
> Presumably, those different inodes have a superblock?  In that
> case set_page_dirty will mark that inode dirty wrt its own
> superblock.  set_page_dirty() is currently an optional a_op,
> but it's not obvious that there will be a need for that.

Coda's inodes don't have to get dirtied because we never write them out,
although the associated dirty pages do need to hit the disk eventually :)

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 22:56         ` Anton Altaparmakov
@ 2002-04-10 22:31           ` Andrew Morton
  2002-04-11 20:20           ` Linus Torvalds
  1 sibling, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2002-04-10 22:31 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Jan Harkes, linux-kernel

Anton Altaparmakov wrote:
> 
> At 22:44 10/04/02, Andrew Morton wrote:
> >When a page is marked dirty, the path which is followed
> >is page->mapping->host->i_sb.  So in this case the page will
> >be attached to its page->mapping.dirty_pages, and
> >page->mapping->host will be attached to page->mapping->host->i_sb.s_dirty
> >
> >This is as it always was - I didn't change any of this.
> 
> Um, NTFS uses address spaces for things where ->host is not an inode at all
> so doing host->i_sb will give you god knows what but certainly not a super
> block!

But it's a `struct inode *' :(

What happens when someone runs set_page_dirty against one of
the address_space's pages?  I guess that doesn't happen, because
it would explode.  Do these address_spaces not support writable
mappings?

I like to think in terms of "top down" and "bottom up".

set_page_dirty is the core "bottom up" function which propagates
dirtiness information from the bottom of the superblock/inode/page
tree up to the top.

writeback is top-down.  It goes from the superblock list down
to pages.

The assumption about page->mapping->host being an inode
only occurs in the bottom-up path, at set_page_dirty().

> As long as your patches don't break that is possible to have I am happy...
> But from what you are saying above I have a bad feeling you are somehow
> assuming that a mapping's host is an inode...

Well the default implementation of __set_page_dirty() will
make that assumption.  (It always has).

But the address_space may implement its own a_ops->set_page_dirty(page),
so you can do whatever you need to do there, yes?

I currently have:

static inline int set_page_dirty(struct page *page)
{
        if (page->mapping) {
                int (*spd)(struct page *, int reserve_page);

                spd = page->mapping->a_ops->set_page_dirty;
                if (spd)
                        return (*spd)(page, 1);
        }
        return __set_page_dirty_buffers(page, 1);
}

Where __set_page_dirty_buffers() will dirty the buffers if
they exist.  And non-buffer_head-backed filesystems which
use page->private MUST implement set_page_dirty().

The reserve_page stuff is for delayed-allocate, the priority
and timing of which has been pushed waaay back by this.  I'm
keeping the reserve_page infrastructure around at present
because of vague thoughts that it may be useful to fix the
data-loss bug which occurs when a shared mapping of a sparse
file has insufficient disk space to satisfy new page instantiations.
Dunno yet.

(Sometime I need to go through and spell out all the new a_ops
methods in all the filesystems, and take out the fall-through-
to-default-handler stuff here, and in do_flushpage() and
try_to_release_page() and others.  But not now).


-

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 21:44       ` Andrew Morton
@ 2002-04-10 22:56         ` Anton Altaparmakov
  2002-04-10 22:31           ` Andrew Morton
  2002-04-11 20:20           ` Linus Torvalds
  2002-04-10 23:02         ` Jan Harkes
  1 sibling, 2 replies; 48+ messages in thread
From: Anton Altaparmakov @ 2002-04-10 22:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Harkes, linux-kernel

At 22:44 10/04/02, Andrew Morton wrote:
>When a page is marked dirty, the path which is followed
>is page->mapping->host->i_sb.  So in this case the page will
>be attached to its page->mapping.dirty_pages, and
>page->mapping->host will be attached to page->mapping->host->i_sb.s_dirty
>
>This is as it always was - I didn't change any of this.

Um, NTFS uses address spaces for things where ->host is not an inode at all 
so doing host->i_sb will give you god knows what but certainly not a super 
block!

As long as your patches don't break that is possible to have I am happy... 
But from what you are saying above I have a bad feeling you are somehow 
assuming that a mapping's host is an inode...

Anton


-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
IRC: #ntfs on irc.openprojects.net / ICQ: 8561279
WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 21:44       ` Andrew Morton
  2002-04-10 22:56         ` Anton Altaparmakov
@ 2002-04-10 23:02         ` Jan Harkes
  1 sibling, 0 replies; 48+ messages in thread
From: Jan Harkes @ 2002-04-10 23:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Wed, Apr 10, 2002 at 02:44:40PM -0700, Andrew Morton wrote:
> Jan Harkes wrote:
> > But Coda has 2 inodes, which one are you connecting to whose superblock.
> > My guess is that it would be correct to add inode->i_mapping->host to
> > inode->i_mapping->host->i_sb, which will be the underlying inode in
> > Coda's case, but host isn't guaranteed to be an inode, it just happens
> > to be an inode in all existing situations.
> 
> When a page is marked dirty, the path which is followed
> is page->mapping->host->i_sb.  So in this case the page will
> be attached to its page->mapping.dirty_pages, and
> page->mapping->host will be attached to page->mapping->host->i_sb.s_dirty
> 
> This is as it always was - I didn't change any of this.

As far as I can see, it should work just fine.

> > Coda's inodes don't have to get dirtied because we never write them out,
> > although the associated dirty pages do need to hit the disk eventually :)
> 
> Right.  Presumably, the pages hit the disk via the hosting inode's
> filesystem's mechanics.
>
> And it remains the case that Coda inodes will not be marked DIRTY_PAGES
> because set_page_dirty()'s page->mapping->host walk will arrive at
> the hosting inode.

Correct, we rely completely on the hosting (inode's) filesystem to
implement the actual file operations. So the hosting inode is the
one that becomes dirty and pushed to disk by bdflush.

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 22:56         ` Anton Altaparmakov
  2002-04-10 22:31           ` Andrew Morton
@ 2002-04-11 20:20           ` Linus Torvalds
  2002-04-11 20:41             ` Alexander Viro
  2002-04-12  1:15             ` Anton Altaparmakov
  1 sibling, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2002-04-11 20:20 UTC (permalink / raw)
  To: linux-kernel

In article <5.1.0.14.2.20020410235415.03d41d00@pop.cus.cam.ac.uk>,
Anton Altaparmakov  <aia21@cam.ac.uk> wrote:
>
>Um, NTFS uses address spaces for things where ->host is not an inode at all 
>so doing host->i_sb will give you god knows what but certainly not a super 
>block!

Then that should be fixed in NTFS.

The original meaning of "->host" was that it could be anything (and it
was a "void *", but the fact is that all the generic VM code etc needed
to know about host things like size, locking etc, so for over a year now
"host" has been a "struct inode", and if you need to have something
else, then that something else has to embed a proper inode.

>As long as your patches don't break that is possible to have I am happy... 
>But from what you are saying above I have a bad feeling you are somehow 
>assuming that a mapping's host is an inode...

It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
and notice the

	struct inode            *host;

part.

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 20:20           ` Linus Torvalds
@ 2002-04-11 20:41             ` Alexander Viro
  2002-04-11 21:27               ` Andrew Morton
  2002-04-12  1:15             ` Anton Altaparmakov
  1 sibling, 1 reply; 48+ messages in thread
From: Alexander Viro @ 2002-04-11 20:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel



On Thu, 11 Apr 2002, Linus Torvalds wrote:

> >As long as your patches don't break that is possible to have I am happy... 
> >But from what you are saying above I have a bad feeling you are somehow 
> >assuming that a mapping's host is an inode...
> 
> It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
> and notice the
> 
> 	struct inode            *host;

True.  However, Andrew _is_ assuming that you can get from inode->i_mapping to 
&FOOFS_I(inode)->secondary_address_space, which is simply not true.

Consider a filesystem that uses address_space to store, say it, EA of inode.
Now look at device node on such fs.  What we have is

inode_1: sits on our fs, has i_mapping pointing to inode_2->i_data and
EA_address_space in fs-private part of inode;

inode_2: block device inode, sits on bdev.

inode_1->i_mapping == &inode_2->i_data
inode_1->i_mapping->host == inode2
FOOFS_I(inode_1)->EA_address_space.host = inode_1

Looks OK?  Now look at Andrew's proposal - he suggests to have
method of inode_1->i_mapping to be responsible for writing out
&FOOFS_I(inode_1)->EA_address_space.

See where it breaks?  After we'd followed ->i_mapping we lose information
about private parts of inode.  And that's OK - that's precisely what we
want for, say it, block devices.  There ->i_mapping should be the same
for _all_ inodes with this major:minor.  However, that makes "scan all
superblocks, for each of them scan dirty inodes, for each of them do
some work on ->i_mapping" hopeless as a way to reach all address_spaces
in the system.

FWIW, correct solution might be to put dirty address_spaces on a list -
per-superblock or global.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 20:41             ` Alexander Viro
@ 2002-04-11 21:27               ` Andrew Morton
  2002-04-11 22:55                 ` Andreas Dilger
  2002-04-11 23:22                 ` Anton Altaparmakov
  0 siblings, 2 replies; 48+ messages in thread
From: Andrew Morton @ 2002-04-11 21:27 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, linux-kernel

Alexander Viro wrote:
> 
> ...
> FWIW, correct solution might be to put dirty address_spaces on a list -
> per-superblock or global.

Another approach may be to implement address_space.private,
which appears to be what NTFS wants.  My initial reaction
to that is fear, because it just makes the graph even more
tangly.

I agree that listing the dirty address_spaces against the
superblock makes sense - really it's what I'm trying to do,
and the intermediate inode is the only means of getting there.

Also, this splitup would clearly separate the concepts
of a dirty-inode and dirty-inode-pages.  These seem to be
coupled in a non-obvious way at present.

AFAIK, the current superblock->inode->mapping approach won't
break any existing filesystems, so I'm inclined to follow 
that for the while, get all the known problems collected
together and then take another pass at it. Maybe do something
to make inode_lock a bit more conventional as well.


This whole trend toward a flowering of tiny address_spaces
worries me a little from the performance POV, because
writeback likes big gobs of linear data to chew on.  With the
global buffer LRU, even though large-scale sorting 
opportunities were never implemented, we at least threw
a decent amount of data at the request queue.

With my current approach, all dirty conventional metadata
(the output from mark_buffer_dirty) is written back via
the blockdev's mapping.  It becomes a bit of a dumping
ground for the output of legacy filesystems, but it does
offer the opportunity to implement a high-level sort before
sending it to disk.  If that's needed.

I guess that as long as the periodic- and memory-pressure
based writeback continues to send a lot of pages down the
pipe we'll still get adequate merging, but it is something
to be borne in mind.  At the end of the day we have to deal
with these funny spinning things.


One thing I'm not clear on with the private metadata address_space
concept: how will it handle blocksize less than PAGE_CACHE_SIZE? 
The only means we have at present of representing sub-page
segments is the buffer_head.  Do we want to generalise the buffer
layer so that it can be applied against private address_spaces?
That wouldn't be a big leap.

Or would the metadata address_spaces send their I/O through the
backing blockdev mapping in some manner?

-

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 22:55                 ` Andreas Dilger
@ 2002-04-11 22:49                   ` Andrew Morton
  2002-04-12  0:12                     ` Linus Torvalds
  2002-04-11 23:10                   ` Christoph Hellwig
  1 sibling, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2002-04-11 22:49 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Alexander Viro, Linus Torvalds, linux-kernel

Andreas Dilger wrote:
> 
> On Apr 11, 2002  14:27 -0700, Andrew Morton wrote:
> > One thing I'm not clear on with the private metadata address_space
> > concept: how will it handle blocksize less than PAGE_CACHE_SIZE?
> > The only means we have at present of representing sub-page
> > segments is the buffer_head.  Do we want to generalise the buffer
> > layer so that it can be applied against private address_spaces?
> > That wouldn't be a big leap.
> 
> I was going to send you an email on this previously, but I (think I)
> didn't in the end...
> 
> At one time Linus proposed having an array of dirty bits for a page,
> which would allow us to mark only parts of a page dirty (say down to
> the sector level).  I believe this was in the discussion about moving
> the block devices to the page cache around 2.4.10.

What Christoph said :)

One problem with having a bitmap at page->private is that
we don't know how big the blocks are.  You could have
0x00000003 but you don't know if that's two 1k blocks
or two 2k blocks.  Not hard to work around I guess.

But you will, I suspect, end up needing additional
information.  Whether that part of the page has a
disk mapping.  If so, what block is it.  Maybe not a
lot more.

In the delalloc no-buffer_head code I'm just sticking a block
number at page->private, to cache the disk mapping, to save
additional get_block calls when the file is overwritten.  That
works if all the page's blocks are disk-contiguous.  If they're
not, repeated get_block calls are needed.   That all works OK.
Even when I wasn't caching get_block results at all I didn't
observe any particular performance problems from the need for
repeated get_block calls.

We could also just stick a dynamically-allocated array
of next-gen buffer_heads at page->private.  There's really
no need to allocate each buffer separately.  This would
save memory, save CPU and with a bit of pointer arithmetic,
liberate b_this_page and b_data.


Anyway, I puzzled it out.  Private address_spaces will "just work"
at present.  There are some wrinkles.  Take the example of ext2
indirect blocks.   Suppose they are in an address_space which is
contained in the private part of the inode.  Presumably, that
address_space's ->host points back at the IS_REG inode.  But even if
it doesn't, the indirect block's address_space needs to make damn
sure that it passes the main file's inode into the block allocator
so that the indirects are laid out in the right place.    The cost
of not placing the first indirect immediately behind block 11 is
30%.  Been there, done that :)

Now, if it's done right, that indirect block address_space is
disk-fragmented all over the shop.  Each of its blocks are four
megabytes apart.  This has the potential to come significantly
unstuck with address_space-based writeback, because the walk
across the file's pages will be in a separate pass from the
walk across the indirects.

In the 2.4-way of doing things, the regular data and the indirects
are nicely sorted on the global buffer LRU.

It is easy to solve this with delayed-allocate.  Because here,
you *know* that when an indirect block is allocated, I/O
is underway right now against its neighbouring data blocks.
So you just start I/O against the indirect as soon as you
allocate it.

But for synchronous-allocate filesystems I have a problem.  This
could significantly suck for writeout of very large files, when
the request queue is at its current teeny size.  Maybe writepage
of regular data at an indirect boundary needs to hunt down its
associated indirect block and start low-level I/O against that.
That will work.

Hopefully when Jens gets the new I/O scheduler in place we can
comfortably extend the request queue to 1024 slots or so.

-

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 21:27               ` Andrew Morton
@ 2002-04-11 22:55                 ` Andreas Dilger
  2002-04-11 22:49                   ` Andrew Morton
  2002-04-11 23:10                   ` Christoph Hellwig
  2002-04-11 23:22                 ` Anton Altaparmakov
  1 sibling, 2 replies; 48+ messages in thread
From: Andreas Dilger @ 2002-04-11 22:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alexander Viro, Linus Torvalds, linux-kernel

On Apr 11, 2002  14:27 -0700, Andrew Morton wrote:
> One thing I'm not clear on with the private metadata address_space
> concept: how will it handle blocksize less than PAGE_CACHE_SIZE? 
> The only means we have at present of representing sub-page
> segments is the buffer_head.  Do we want to generalise the buffer
> layer so that it can be applied against private address_spaces?
> That wouldn't be a big leap.

I was going to send you an email on this previously, but I (think I)
didn't in the end...

At one time Linus proposed having an array of dirty bits for a page,
which would allow us to mark only parts of a page dirty (say down to
the sector level).  I believe this was in the discussion about moving
the block devices to the page cache around 2.4.10.

While that isn't a huge win in the most cases (it costs the same to
write 512 bytes as 4096 bytes to disk because of disk latencies) it may
be more important if/when we ever have larger pages.  This also becomes
more important if you are working with a network filesystem where you
have to send all of the dirty data over a much smaller pipe, so sending
512 bytes takes 1/8 as long as sending 4096 bytes.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 23:22                 ` Anton Altaparmakov
@ 2002-04-11 23:03                   ` Andrew Morton
  2002-04-12  4:19                   ` Bill Davidsen
  1 sibling, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2002-04-11 23:03 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Alexander Viro, Linus Torvalds, linux-kernel

Anton Altaparmakov wrote:
> 
> ...
> It would be great to be able to submit variable size "io entities" even
> greater than PAGE_CACHE_SIZE (by giving a list of pages, starting offset
> in first page and total request size for example) and saying write that to
> the device starting at offset xyz. That would suit ntfs perfectly. (-:
> 

Yes, I'll be implementing that.  Writeback will hit the filesystem
via a_ops->writeback_mapping(mapping, nr_pages).  The filesytem
will then call in to generic_writeback_mapping(), which will walk
the pages, assemble BIOs and send them off.

The filesystem needs to pass a little state structure into the
generic writeback function.  An important part of that is a
semaphore which is held while writeback is locking multiple
pages.  To avoid ab/ba deadlocks.

The current implementation of this bio-assembler is for no-buffer
(delalloc) fileystems only.  It need to be enhanced (or forked)
to also cope with buffer-backed pages. It will need to peek
inside the buffer-heads to detect clean buffers (maybe.  It
definitely needs to skip unmapped ones).  When such a buffer
is encountered the BIO will be sent off and a new one will be started.
The code for this is going to be quite horrendous.  I suspect
the comment/code ratio will exceed 1.0, which is a bad sign :)

One thing upon which I am undecided:  for syncalloc filesystems
like ext2, do we attach buffers at ->readpages() time, or do
we just leave the page bufferless?

That's a hard call.  It helps the common case, but in the uncommon
case (we overwrite the file after reading it), we need to run
get_block again.

-

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 22:55                 ` Andreas Dilger
  2002-04-11 22:49                   ` Andrew Morton
@ 2002-04-11 23:10                   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2002-04-11 23:10 UTC (permalink / raw)
  To: Andrew Morton, Alexander Viro, Linus Torvalds, linux-kernel

On Thu, Apr 11, 2002 at 04:55:36PM -0600, Andreas Dilger wrote:
> At one time Linus proposed having an array of dirty bits for a page,
> which would allow us to mark only parts of a page dirty (say down to
> the sector level).  I believe this was in the discussion about moving
> the block devices to the page cache around 2.4.10.

The early XFS code used to do this for kiobuf-based block I/O, but it
got dropped around 2.4.8 IIRC.  The new page->private handling from akpm
which seems to be merged in Linus' BK tree (Linus: please push it to
bkbits.net, thanks :)) can be used to do the same if and only if we don't
need the buffer_heads attached to the page.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 21:27               ` Andrew Morton
  2002-04-11 22:55                 ` Andreas Dilger
@ 2002-04-11 23:22                 ` Anton Altaparmakov
  2002-04-11 23:03                   ` Andrew Morton
  2002-04-12  4:19                   ` Bill Davidsen
  1 sibling, 2 replies; 48+ messages in thread
From: Anton Altaparmakov @ 2002-04-11 23:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alexander Viro, Linus Torvalds, linux-kernel

On Thu, 11 Apr 2002, Andrew Morton wrote:
> Alexander Viro wrote:
> > ...
> > FWIW, correct solution might be to put dirty address_spaces on a list -
> > per-superblock or global.
> 
> Another approach may be to implement address_space.private,
> which appears to be what NTFS wants.  My initial reaction
> to that is fear, because it just makes the graph even more
> tangly.
> 
> I agree that listing the dirty address_spaces against the
> superblock makes sense - really it's what I'm trying to do,
> and the intermediate inode is the only means of getting there.

If you take Al's approach then the intermediate inode becomes irrelevant.
I like the idea of per super block dirty mappings list a lot. In fact I
was planning to do that for ntfs and attach the list to the ntfs_volume
and then have a kntfsd thread walk those every 5 seconds and flush to
disk... (I was also thinking of throwing in barriers and
flush_to_this_barrier_now functionality for journalling purposes but I
was going to leave that for later...)

It would be great if the VFS implemented that in general instead. (-:

> Also, this splitup would clearly separate the concepts
> of a dirty-inode and dirty-inode-pages.  These seem to be
> coupled in a non-obvious way at present.

Indeed. That would be very nice.

> AFAIK, the current superblock->inode->mapping approach won't
> break any existing filesystems, so I'm inclined to follow 
> that for the while, get all the known problems collected
> together and then take another pass at it. Maybe do something
> to make inode_lock a bit more conventional as well.
> 
> This whole trend toward a flowering of tiny address_spaces
> worries me a little from the performance POV,

Why tiny address_spaces? I have seen 4.5GiB large $MFT/$DATA (inode table)
on a 40GiB ntfs partition and that was year ago now... That's not what I
call tiny. (-;

The cluster allocation bitmap on a 40GiB partition is 10MiB in size, that
is not huge but not what I would call tiny either...

> because writeback likes big gobs of linear data to chew on.  With the
> global buffer LRU, even though large-scale sorting opportunities were
> never implemented, we at least threw a decent amount of data at the
> request queue. 
> 
> With my current approach, all dirty conventional metadata
> (the output from mark_buffer_dirty) is written back via
> the blockdev's mapping.  It becomes a bit of a dumping
> ground for the output of legacy filesystems, but it does
> offer the opportunity to implement a high-level sort before
> sending it to disk.  If that's needed.

Considering modern harddrives have their own intelligent write back and
sorting of write requests and ever growing hd buffers the need for doing
this at the OS level is going to become less and less would be my guess, I
may be wrong of course...

> I guess that as long as the periodic- and memory-pressure
> based writeback continues to send a lot of pages down the
> pipe we'll still get adequate merging, but it is something
> to be borne in mind.  At the end of the day we have to deal
> with these funny spinning things.
> 
> One thing I'm not clear on with the private metadata address_space
> concept: how will it handle blocksize less than PAGE_CACHE_SIZE? 

The new ntfs uses 512 byte blocksize ONLY (the hd sector size) and remaps
internally between the actual partition cluster size (block size) and the
512 byte block size set in s_blocksize.

This is the only way to support cluster sizes above PAGE_CACHE_SIZE. -
NTFS 3.1 can go up to 512kiB clusters (and bigger!, but Windows doesn't
allow you to format partitions with bigger cluster sizes AFAIK). 

At the moment I am using a buffer head for each 512 byte disk sector which
is very counter productive to performance. 

It would be great to be able to submit variable size "io entities" even
greater than PAGE_CACHE_SIZE (by giving a list of pages, starting offset
in first page and total request size for example) and saying write that to
the device starting at offset xyz. That would suit ntfs perfectly. (-:

> The only means we have at present of representing sub-page
> segments is the buffer_head.  Do we want to generalise the buffer
> layer so that it can be applied against private address_spaces?
> That wouldn't be a big leap.

It is already generalised. I use it that way at least at the moment...

> Or would the metadata address_spaces send their I/O through the
> backing blockdev mapping in some manner?

Not sure what you mean...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS maintainer / WWW: http://linux-ntfs.sf.net/
IRC: #ntfs on irc.openprojects.net / ICQ: 8561279
WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 22:49                   ` Andrew Morton
@ 2002-04-12  0:12                     ` Linus Torvalds
  0 siblings, 0 replies; 48+ messages in thread
From: Linus Torvalds @ 2002-04-12  0:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andreas Dilger, Alexander Viro, linux-kernel



On Thu, 11 Apr 2002, Andrew Morton wrote:
>
> One problem with having a bitmap at page->private is that
> we don't know how big the blocks are.  You could have
> 0x00000003 but you don't know if that's two 1k blocks
> or two 2k blocks.  Not hard to work around I guess.

I think the more fundamental problem is that we want to have a generic
mechanism, and for other filesystems the writeback data is a lot different
from "this sector is dirty".

That, in the end, convinced me that there's no point to trying to keep
per-sector dirty data around in a generic formal - it just wasn't generic
enough.

So now it's up to the writeback mechanism itself to keep track of which
part of the page is dirty, be it with the NFS kind of "nfs_page"
structures, or with things like "struct buffer_head", and I certainly no
longer have any plans at all to try to keep anything like a "dirty bitmap"
in the generic data structures.

		Linus


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 20:20           ` Linus Torvalds
  2002-04-11 20:41             ` Alexander Viro
@ 2002-04-12  1:15             ` Anton Altaparmakov
  2002-04-12  1:37               ` Linus Torvalds
  1 sibling, 1 reply; 48+ messages in thread
From: Anton Altaparmakov @ 2002-04-12  1:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

At 21:20 11/04/02, Linus Torvalds wrote:
>In article <5.1.0.14.2.20020410235415.03d41d00@pop.cus.cam.ac.uk>,
>Anton Altaparmakov  <aia21@cam.ac.uk> wrote:
> >
> >Um, NTFS uses address spaces for things where ->host is not an inode at all
> >so doing host->i_sb will give you god knows what but certainly not a super
> >block!
>
>Then that should be fixed in NTFS.
>
>The original meaning of "->host" was that it could be anything (and it
>was a "void *", but the fact is that all the generic VM code etc needed
>to know about host things like size, locking etc, so for over a year now
>"host" has been a "struct inode", and if you need to have something
>else, then that something else has to embed a proper inode.
>
> >As long as your patches don't break that is possible to have I am happy...
> >But from what you are saying above I have a bad feeling you are somehow
> >assuming that a mapping's host is an inode...
>
>It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
>and notice the
>
>         struct inode            *host;
>
>part.

Yes I know that. Why not extend address spaces beyond being just inode 
caches? An address space with its mapping is a very useful generic data 
cache object.

Requiring a whole struct inode seems silly. I suspect (without in depth 
knowledge of the VM yet...) the actual number of fields in the struct inode 
being used by the generic VM code is rather small... They could be split 
away from the inode and moved into struct address_space instead if that is 
where they actually belong...

Just a very basic example: Inode 0 on ntfs is a file named $MFT. It 
contains two data parts: one containing the actual on-disk inode metadata 
and one containing the inode allocation bitmap. At the moment I have placed 
the inode metadata in the "normal" inode address_space mapping and of 
course ->host points back to inode 0. But for the inode allocation bitmap I 
set ->host to the current ntfs_volume structure instead and I am at present 
using a special readpage function to access this address space. Making 
->host point back to the original inode 0 makes no sense as i_size for 
example would be the size of the other address space - the inode data one. 
(And faking an inode is not fun, then I would have to keep track of fake 
inodes, etc, and the kernel already restricts us to only 32-bits worth of 
inodes while ntfs uses s64 for that... and inode locking than becomes even 
more interesting than it already is...)

The mere fact that the VM requires to know the size of the data in an 
address space just indicates to me that struct address_space ought to 
contain a size field in it.

So while something should be fixed I think it is the kernel and not ntfs. (-;

Best regards,

Anton


-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
IRC: #ntfs on irc.openprojects.net / ICQ: 8561279
WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-12  1:15             ` Anton Altaparmakov
@ 2002-04-12  1:37               ` Linus Torvalds
  2002-04-12  7:57                 ` Anton Altaparmakov
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2002-04-12  1:37 UTC (permalink / raw)
  To: linux-kernel

In article <5.1.0.14.2.20020412015633.01f1f3c0@pop.cus.cam.ac.uk>,
Anton Altaparmakov  <aia21@cam.ac.uk> wrote:
>At 21:20 11/04/02, Linus Torvalds wrote:
>>
>>It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
>>and notice the
>>
>>         struct inode            *host;
>>
>>part.
>
>Yes I know that. Why not extend address spaces beyond being just inode 
>caches? An address space with its mapping is a very useful generic data 
>cache object.

Sure it is. And at worst you'll just have to have a dummy inode to
attach to it.

Could we split up just the important fields and create a new level of
abstraction? Sure.  But since there are no major users, and since it
would make the common case less readable, there simply isn't any point. 

Over-abstraction is a real problem - it makes the code less obvious, and
if there is no real gain from it then over-abstracting things is a bad
thing. 

>Just a very basic example: Inode 0 on ntfs is a file named $MFT. It 
>contains two data parts: one containing the actual on-disk inode metadata 
>and one containing the inode allocation bitmap. At the moment I have placed 
>the inode metadata in the "normal" inode address_space mapping and of 
>course ->host points back to inode 0. But for the inode allocation bitmap I 
>set ->host to the current ntfs_volume structure instead and I am at present 
>using a special readpage function to access this address space.

Why not just allocate a separate per-volume "bitmap" inode? It makes
sense, and means that you can then use all the generic routines to let
it be automatically written back to disk etc. 

Also, don't you get the "ntfs_volume" simply through "inode->i_sb"? It
looks like you're just trying to save off the same piece of information
that you already have in another place. Redundancy isn't a virtue.

>The mere fact that the VM requires to know the size of the data in an 
>address space just indicates to me that struct address_space ought to 
>contain a size field in it.

It doesn't "require" it - in the sense that you can certainly use
address spaces without using those routines that assume that it has an
inode.  It just means that you don't get the advantage of the
higher-level concepts. 

>So while something should be fixed I think it is the kernel and not ntfs. (-;

There are a few fundamental concepts in UNIX, and one of them is
"everything is a file".

And like it or not, the Linux concept of a file _requires_ an inode (it
also requires a dentry to describe the path to that inode, and it
requires the "struct file" itself). 

The notion of "struct inode" as the lowest level of generic IO entity is
very basic in the whole design of Linux.  If you try to avoid it, you're
most likely doing something wrong.  You think your special case is
somehow independent and more important than one of the basic building
blocks of the whole system. 

			Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-11 23:22                 ` Anton Altaparmakov
  2002-04-11 23:03                   ` Andrew Morton
@ 2002-04-12  4:19                   ` Bill Davidsen
  1 sibling, 0 replies; 48+ messages in thread
From: Bill Davidsen @ 2002-04-12  4:19 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Linux Kernel Mailing List

On Fri, 12 Apr 2002, Anton Altaparmakov wrote:

> Considering modern harddrives have their own intelligent write back and
> sorting of write requests and ever growing hd buffers the need for doing
> this at the OS level is going to become less and less would be my guess, I
> may be wrong of course...

Clearly this is true. However, while the benefits of o/s effort are down
on top end hardware, one of the strengths of Linux is that it runs well on
small, or embedded, or old, or totally obsolete hardware. So there is a PR
and social benefit from the efforts already invested in making the code
use the hardware carefully.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-12  1:37               ` Linus Torvalds
@ 2002-04-12  7:57                 ` Anton Altaparmakov
  2002-04-27 15:53                   ` Jan Harkes
  0 siblings, 1 reply; 48+ messages in thread
From: Anton Altaparmakov @ 2002-04-12  7:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

At 02:37 12/04/02, Linus Torvalds wrote:
>In article <5.1.0.14.2.20020412015633.01f1f3c0@pop.cus.cam.ac.uk>,
>Anton Altaparmakov  <aia21@cam.ac.uk> wrote:
> >At 21:20 11/04/02, Linus Torvalds wrote:
> >>
> >>It's not Andrew who is assuming anything: it _is_. Look at <linux/fs.h>,
> >>and notice the
> >>
> >>         struct inode            *host;
> >>
> >>part.
> >
> >Yes I know that. Why not extend address spaces beyond being just inode
> >caches? An address space with its mapping is a very useful generic data
> >cache object.
>
>Sure it is. And at worst you'll just have to have a dummy inode to
>attach to it.
>
>Could we split up just the important fields and create a new level of
>abstraction? Sure.  But since there are no major users, and since it
>would make the common case less readable, there simply isn't any point.

Ok, fair enough. While ntfs would not be the only fs to benefit, major fs 
like ext2/3 don't have named streams, EAs, bitmap attributes, and other 
data holding constructs except the traditial "file contents data stream" so 
they probably don't have the need for a new level of abstraction.

But that then leads to the question which fields in the dummy inode must be 
initialized. My biggest concern is whether i_ino needs to be unique or 
whether it can be copied from the real inode (or perhaps even better 
whether I can just set it to a value like minus one to signify to ntfs 
"treat as special").

ntfs will need to distinguish the dummies from the real inodes and not only 
that, it will need to know which ntfs attribute they signify. Admittedly 
that part is trivial as the ntfs specific part of the inode can contain a 
NI_Dummy flag in the ->state bits and that can then change the meaning of 
the remainder of the ntfs_inode structure.

A locking problem also arises. To write out a real inode, you need to lock 
down all the dummy inodes associated with it and vice versa, so you get 
into ab/ba deadlock country. Further, dirty inodes become interesting, what 
happens if the real inode is evicted from ram but the attached dummies are 
still around? Probably would lead to dereferencing random memory. )-: So 
one would need to extend ntfs_clear_inode to kill off all dummy inodes 
belonging to a real inode when the real inode is evicted. And that again 
brings the problems along of some of the dummies may be dirty and/or 
locked, so potential deadlocks again...

>Over-abstraction is a real problem - it makes the code less obvious, and
>if there is no real gain from it then over-abstracting things is a bad
>thing.

Fair comment but the current VM isn't exactly obvious either. </rant about 
lack of vm documentation> Understanding readpage and helpers was easy but 
the write code paths which I am trying to get my head round to are not 
funny once you get into the details...

> >Just a very basic example: Inode 0 on ntfs is a file named $MFT. It
> >contains two data parts: one containing the actual on-disk inode metadata
> >and one containing the inode allocation bitmap. At the moment I have placed
> >the inode metadata in the "normal" inode address_space mapping and of
> >course ->host points back to inode 0. But for the inode allocation bitmap I
> >set ->host to the current ntfs_volume structure instead and I am at present
> >using a special readpage function to access this address space.
>
>Why not just allocate a separate per-volume "bitmap" inode? It makes
>sense, and means that you can then use all the generic routines to let
>it be automatically written back to disk etc.

Yes, I could do. I guess I can grab one from fs/inode.c::new_inode(). Do I 
need to provide unique i_ino though?

>Also, don't you get the "ntfs_volume" simply through "inode->i_sb"? It
>looks like you're just trying to save off the same piece of information
>that you already have in another place. Redundancy isn't a virtue.

Yes, I can get it from there but I didn't want to use the original inode as 
->host as that would have had potential for breaking things in odd ways in 
the VM. Using "ntfs_volume" was more likely to give more obvious "BANG" 
type of errors instead... (-;

> >The mere fact that the VM requires to know the size of the data in an
> >address space just indicates to me that struct address_space ought to
> >contain a size field in it.
>
>It doesn't "require" it - in the sense that you can certainly use
>address spaces without using those routines that assume that it has an
>inode.  It just means that you don't get the advantage of the
>higher-level concepts.

Indeed. And that in turn means I have to duplicate a lot of functionality 
from the core kernel in ntfs which is always a Bad Thing.

For file access I can use block_read_full_page with only ONE line changed 
and that is the setting of the async io completion handler - NTFS requires 
its own as it has 3 different versions of i_size which need to be 
interpreted after the read from device in order to determine if any parts 
of the buffer head need to be zeroed out after the read (and there are 4 
i_size versions for compressed files). I don't see any way to do this at an 
earlier stage than io completion as some partial buffers need to be zeroed 
AFTER the io from disk has completed.

If there was a way to provide my custom handler to block_read_full_page or 
at least if end_buffer_is_async would allow a hook to be placed which would 
get run for each buffer head it is called with that would decrease the 
amount of code duplication quite a lot... But I guess this falls under the 
"there are no major users category"... )-:

> >So while something should be fixed I think it is the kernel and not 
> ntfs. (-;
>
>There are a few fundamental concepts in UNIX, and one of them is
>"everything is a file".
>
>And like it or not, the Linux concept of a file _requires_ an inode (it
>also requires a dentry to describe the path to that inode, and it
>requires the "struct file" itself).
>
>The notion of "struct inode" as the lowest level of generic IO entity is
>very basic in the whole design of Linux.  If you try to avoid it, you're
>most likely doing something wrong.  You think your special case is
>somehow independent and more important than one of the basic building
>blocks of the whole system.

Thanks for the reminder. I have obviously been digging around Windows/NTFS 
for too long so I had distanced myself from the concept. )-:

Yet, this really begs the question of defining the concept of a file. I am 
quite happy with files being the io entity in ntfs. It is just that each 
file in ntfs can contain loads of different data holding attributes which 
are all worth placing in address spaces. Granted, a dummy inode could be 
setup for each of those which just means a lot of wasted ram but ntfs is 
not that important so I have to take the penalty there. But if I also need 
unique inode numbers in those dummy inodes then the overhead is becoming 
very, very high...

Best regards,

         Anton


-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
IRC: #ntfs on irc.openprojects.net / ICQ: 8561279
WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-10 11:21 [prepatch] address_space-based writeback Andrew Morton
  2002-04-10 11:34 ` Alexander Viro
  2002-04-10 19:29 ` Jeremy Jackson
@ 2002-04-15  8:47 ` Andrew Morton
  2 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2002-04-15  8:47 UTC (permalink / raw)
  To: lkml

An update to these patches is at

	http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.8/

All work at this time is on

dallocbase-10-page_alloc_fail.patch
dallocbase-35-pageprivate_fixes.patch
dallocbase-55-ext2_dir.patch
dallocbase-60-page_accounting.patch
ratcache-pf_memalloc.patch
mempool-node.patch
dallocbase-70-writeback.patch
ttd		(my current 2.5 things-to-do-list, just for fun)

none of the patches beyond these have even been tested in a week..

The new buffer<->page relationship rules seem to be working
out OK.  In a six-hour stress test on a quad Xeon with
1k blocksize ext3 in ordered-data mode there was one failure:
a block in a file which came up with wrong data.  There appears
to be a race in ext3 or the patch or 2.5 or something somewhere.
Still hunting this one.

It is all relatively stable now.  ramdisk, loop, reiserfs, ext2,
ext3-ordered, ext3-journalled, JFS and vfat have been tested.

minixfs and sysvfs are broken with these patches.  They rely
on preservation of directory data outside i_size.   Will fix.

-

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-12  7:57                 ` Anton Altaparmakov
@ 2002-04-27 15:53                   ` Jan Harkes
  2002-04-28  3:03                     ` Anton Altaparmakov
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Harkes @ 2002-04-27 15:53 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: linux-kernel

On Fri, Apr 12, 2002 at 08:57:17AM +0100, Anton Altaparmakov wrote:
> Yet, this really begs the question of defining the concept of a file. I am 
> quite happy with files being the io entity in ntfs. It is just that each 
> file in ntfs can contain loads of different data holding attributes which 
> are all worth placing in address spaces. Granted, a dummy inode could be 
> setup for each of those which just means a lot of wasted ram but ntfs is 
> not that important so I have to take the penalty there. But if I also need 
> unique inode numbers in those dummy inodes then the overhead is becoming 
> very, very high...

You could have all additional IO streams use the same inode number and
use iget4. Several inodes can have the same i_ino and the additional
argument would be a stream identifier that selects the correct 'IO
identity'.

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-27 15:53                   ` Jan Harkes
@ 2002-04-28  3:03                     ` Anton Altaparmakov
  2002-04-29  9:03                       ` Nikita Danilov
  0 siblings, 1 reply; 48+ messages in thread
From: Anton Altaparmakov @ 2002-04-28  3:03 UTC (permalink / raw)
  To: Jan Harkes; +Cc: linux-kernel

At 16:53 27/04/02, Jan Harkes wrote:
>On Fri, Apr 12, 2002 at 08:57:17AM +0100, Anton Altaparmakov wrote:
> > Yet, this really begs the question of defining the concept of a file. I am
> > quite happy with files being the io entity in ntfs. It is just that each
> > file in ntfs can contain loads of different data holding attributes which
> > are all worth placing in address spaces. Granted, a dummy inode could be
> > setup for each of those which just means a lot of wasted ram but ntfs is
> > not that important so I have to take the penalty there. But if I also need
> > unique inode numbers in those dummy inodes then the overhead is becoming
> > very, very high...
>
>You could have all additional IO streams use the same inode number and
>use iget4. Several inodes can have the same i_ino and the additional
>argument would be a stream identifier that selects the correct 'IO
>identity'.

Great idea! I quickly looked into the implementation details and using 
iget4/read_inode2 perfectly reconciles my ideas of using an address space 
mapping for each ntfs attribute with the kernels requirement of using 
inodes as the i/o entity by allowing a clean and unique mapping between 
multiple inodes with the same inode numbers and their attributes and 
address spaces.

I need to work out exactly how to do it but I will definitely go that way. 
That will make everything nice and clean and get rid of the existing 
kludges of passing around other types of objects instead of struct file * 
to my various readpage functions. Also I will be able to have fewer 
readpage functions... (-:

Thanks for the suggestion!

Best regards,

Anton


-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-28  3:03                     ` Anton Altaparmakov
@ 2002-04-29  9:03                       ` Nikita Danilov
  2002-04-29 11:11                         ` Anton Altaparmakov
  0 siblings, 1 reply; 48+ messages in thread
From: Nikita Danilov @ 2002-04-29  9:03 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Jan Harkes, linux-kernel

Anton Altaparmakov writes:
 > At 16:53 27/04/02, Jan Harkes wrote:
 > >On Fri, Apr 12, 2002 at 08:57:17AM +0100, Anton Altaparmakov wrote:
 > > > Yet, this really begs the question of defining the concept of a file. I am
 > > > quite happy with files being the io entity in ntfs. It is just that each
 > > > file in ntfs can contain loads of different data holding attributes which
 > > > are all worth placing in address spaces. Granted, a dummy inode could be
 > > > setup for each of those which just means a lot of wasted ram but ntfs is
 > > > not that important so I have to take the penalty there. But if I also need
 > > > unique inode numbers in those dummy inodes then the overhead is becoming
 > > > very, very high...
 > >
 > >You could have all additional IO streams use the same inode number and
 > >use iget4. Several inodes can have the same i_ino and the additional
 > >argument would be a stream identifier that selects the correct 'IO
 > >identity'.
 > 
 > Great idea! I quickly looked into the implementation details and using 
 > iget4/read_inode2 perfectly reconciles my ideas of using an address space 
 > mapping for each ntfs attribute with the kernels requirement of using 
 > inodes as the i/o entity by allowing a clean and unique mapping between 
 > multiple inodes with the same inode numbers and their attributes and 
 > address spaces.
 > 

Please note that ->read_inode2() is reiserfs-specific hack. Adding more
users for it would make it permanent. The only reason for ->read_inode2
existence was that iget() was called by code external to the
file-system, knfsd used to do this, now it can call ->fh_to_dentry() in
stead. As iget() is never called outside of file-ssytem, you can set
ntfs->read_inode to no-op and write your own function ntfs_iget(...) to
be called from ntfs_lookup() and ntfs_fill_super().

ntfs_iget() calls iget() (->read_inode is no-op, hence iget doesn't
access disk) and, if new inode were allocated, reads data from the disk
and initializes inode, etc.

I guess coda_iget() is example of this.

 > I need to work out exactly how to do it but I will definitely go that way. 
 > That will make everything nice and clean and get rid of the existing 
 > kludges of passing around other types of objects instead of struct file * 
 > to my various readpage functions. Also I will be able to have fewer 
 > readpage functions... (-:
 > 
 > Thanks for the suggestion!
 > 
 > Best regards,
 > 
 > Anton
 > 

Nikita.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-29  9:03                       ` Nikita Danilov
@ 2002-04-29 11:11                         ` Anton Altaparmakov
  2002-04-29 11:59                           ` Nikita Danilov
  0 siblings, 1 reply; 48+ messages in thread
From: Anton Altaparmakov @ 2002-04-29 11:11 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: viro, Jan Harkes, linux-kernel

Hi,

I am cc:-ing Al Viro, perhaps Al could comment on approach as this would 
affect the future of read_inode2 in VFS?

At 10:03 29/04/02, Nikita Danilov wrote:
>Anton Altaparmakov writes:
>  > At 16:53 27/04/02, Jan Harkes wrote:
>  > >You could have all additional IO streams use the same inode number and
>  > >use iget4. Several inodes can have the same i_ino and the additional
>  > >argument would be a stream identifier that selects the correct 'IO
>  > >identity'.
>  >
>  > Great idea! I quickly looked into the implementation details and using
>  > iget4/read_inode2 perfectly reconciles my ideas of using an address space
>  > mapping for each ntfs attribute with the kernels requirement of using
>  > inodes as the i/o entity by allowing a clean and unique mapping between
>  > multiple inodes with the same inode numbers and their attributes and
>  > address spaces.
>
>Please note that ->read_inode2() is reiserfs-specific hack. Adding more
>users for it would make it permanent. The only reason for ->read_inode2
>existence was that iget() was called by code external to the
>file-system, knfsd used to do this, now it can call ->fh_to_dentry() in
>stead. As iget() is never called outside of file-ssytem, you can set
>ntfs->read_inode to no-op and write your own function ntfs_iget(...) to
>be called from ntfs_lookup() and ntfs_fill_super().
>
>ntfs_iget() calls iget() (->read_inode is no-op, hence iget doesn't
>access disk) and, if new inode were allocated, reads data from the disk
>and initializes inode, etc.
>
>I guess coda_iget() is example of this.

This will not work AFAICS.

coda_iget() -> iget4() -> get_new_inode(), which calls ->read_inode or 
->read_inode2, and then unlocks the inode and wakes up the waiting tasks.

If ->read_inode and ->read_inode2 are NULL as you suggest for NTFS it means 
that as soon as ntfs_iget() has called iget4() there will be an 
uninitialized yet unlocked inode in memory which is guaranteed to cause 
NTFS to oops... (And any other fs using this approach.)

Before the inode is unlocked it MUST be initialized. And the only way to do 
this in the framework of the current VFS is to use ->read_inode or 
->read_inode2.

Al, would you agree with NTFS using ->read_inode2 as well as ReiserFS?

Best regards,

Anton


-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-29 11:11                         ` Anton Altaparmakov
@ 2002-04-29 11:59                           ` Nikita Danilov
  2002-04-29 12:34                             ` Anton Altaparmakov
  2002-04-30 17:19                             ` Denis Vlasenko
  0 siblings, 2 replies; 48+ messages in thread
From: Nikita Danilov @ 2002-04-29 11:59 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: viro, Jan Harkes, linux-kernel

Anton Altaparmakov writes:
 > Hi,
 > 
 > I am cc:-ing Al Viro, perhaps Al could comment on approach as this would 
 > affect the future of read_inode2 in VFS?
 > 
 > At 10:03 29/04/02, Nikita Danilov wrote:
 > >Anton Altaparmakov writes:
 > >  > At 16:53 27/04/02, Jan Harkes wrote:
 > >  > >You could have all additional IO streams use the same inode number and
 > >  > >use iget4. Several inodes can have the same i_ino and the additional
 > >  > >argument would be a stream identifier that selects the correct 'IO
 > >  > >identity'.
 > >  >
 > >  > Great idea! I quickly looked into the implementation details and using
 > >  > iget4/read_inode2 perfectly reconciles my ideas of using an address space
 > >  > mapping for each ntfs attribute with the kernels requirement of using
 > >  > inodes as the i/o entity by allowing a clean and unique mapping between
 > >  > multiple inodes with the same inode numbers and their attributes and
 > >  > address spaces.
 > >
 > >Please note that ->read_inode2() is reiserfs-specific hack. Adding more
 > >users for it would make it permanent. The only reason for ->read_inode2
 > >existence was that iget() was called by code external to the
 > >file-system, knfsd used to do this, now it can call ->fh_to_dentry() in
 > >stead. As iget() is never called outside of file-ssytem, you can set
 > >ntfs->read_inode to no-op and write your own function ntfs_iget(...) to
 > >be called from ntfs_lookup() and ntfs_fill_super().
 > >
 > >ntfs_iget() calls iget() (->read_inode is no-op, hence iget doesn't
 > >access disk) and, if new inode were allocated, reads data from the disk
 > >and initializes inode, etc.
 > >
 > >I guess coda_iget() is example of this.
 > 
 > This will not work AFAICS.
 > 
 > coda_iget() -> iget4() -> get_new_inode(), which calls ->read_inode or 
 > ->read_inode2, and then unlocks the inode and wakes up the waiting tasks.
 > 
 > If ->read_inode and ->read_inode2 are NULL as you suggest for NTFS it means 
 > that as soon as ntfs_iget() has called iget4() there will be an 
 > uninitialized yet unlocked inode in memory which is guaranteed to cause 
 > NTFS to oops... (And any other fs using this approach.)

I see. While this can be worked around by adding flag set up after inode
initialization, this would become ugly shortly.

 > 
 > Before the inode is unlocked it MUST be initialized. And the only way
 > to do this in the framework of the current VFS is to use ->read_inode
 > or ->read_inode2.
 > 
 > Al, would you agree with NTFS using ->read_inode2 as well as ReiserFS?
 > 

->read_inode2 is a hack. And especially so is having both ->read_inode
and ->read_inode2. iget() interface was based on the assumption that
inodes can be located (and identified) by inode number. It is not so at
least for the reiserfs and ->read_inode2 works around this by passing
"cookie" with information sufficient for file system to locate inode.

I am concerned that (ab)using this cookie and ->read_inode2 to bypass
rigid iget() is not right way to go. What about VFS exporting function
that checks hash table, creates new inode if not there and returns it
still locked? This way each file system would be able to locate and load
inodes in a way it likes without encoding/decoding information in the
cookie.

 > Best regards,
 > 
 > Anton
 > 

Nikita.

 > 
 > -- 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-29 11:59                           ` Nikita Danilov
@ 2002-04-29 12:34                             ` Anton Altaparmakov
  2002-04-29 13:01                               ` Christoph Hellwig
  2002-04-30 17:19                             ` Denis Vlasenko
  1 sibling, 1 reply; 48+ messages in thread
From: Anton Altaparmakov @ 2002-04-29 12:34 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: viro, Jan Harkes, linux-kernel

At 12:59 29/04/02, Nikita Danilov wrote:
>[snip]
>  > >Please note that ->read_inode2() is reiserfs-specific hack. Adding more
>  > >users for it would make it permanent. The only reason for ->read_inode2
>  > >existence was that iget() was called by code external to the
>  > >file-system, knfsd used to do this, now it can call ->fh_to_dentry() in
>  > >stead. As iget() is never called outside of file-ssytem, you can set
>  > >ntfs->read_inode to no-op and write your own function ntfs_iget(...) to
>  > >be called from ntfs_lookup() and ntfs_fill_super().
>  > >
>  > >ntfs_iget() calls iget() (->read_inode is no-op, hence iget doesn't
>  > >access disk) and, if new inode were allocated, reads data from the disk
>  > >and initializes inode, etc.
>  > >
>  > >I guess coda_iget() is example of this.
>  >
>  > This will not work AFAICS.
>  >
>  > coda_iget() -> iget4() -> get_new_inode(), which calls ->read_inode or
>  > ->read_inode2, and then unlocks the inode and wakes up the waiting tasks.
>  >
>  > If ->read_inode and ->read_inode2 are NULL as you suggest for NTFS it 
> means
>  > that as soon as ntfs_iget() has called iget4() there will be an
>  > uninitialized yet unlocked inode in memory which is guaranteed to cause
>  > NTFS to oops... (And any other fs using this approach.)
>
>I see. While this can be worked around by adding flag set up after inode
>initialization, this would become ugly shortly.
>
>  > Before the inode is unlocked it MUST be initialized. And the only way
>  > to do this in the framework of the current VFS is to use ->read_inode
>  > or ->read_inode2.
>  >
>  > Al, would you agree with NTFS using ->read_inode2 as well as ReiserFS?
>
>->read_inode2 is a hack.
>
>And especially so is having both ->read_inode and ->read_inode2. iget() 
>interface was based on the assumption that inodes can be located (and 
>identified) by inode number. It is not so at least for the reiserfs and 
>->read_inode2 works around this by passing "cookie" with information 
>sufficient for file system to locate inode.
>
>I am concerned that (ab)using this cookie and ->read_inode2 to bypass
>rigid iget() is not right way to go.

Perhaps. But it works now and can easily be modified later if better 
approach comes along.

>What about VFS exporting function that checks hash table, creates new 
>inode if not there and returns it still locked? This way each file system 
>would be able to locate and load
>inodes in a way it likes without encoding/decoding information in the cookie.

Yes that would work fine. But it would still need the iget4 like find actor 
and opaque pointer to do the "checks hash table" part...

Basically one could just change iget4 into two functions: iget calling 
fs->read_inode (with read_inode2 removed) and iget4 without the 
->read_inode and ->read_inode2 and returning a locked inode instead.

That would make it fs specific.

If we wanted to make it generic we do need a special method in the 
operations, but wait, we already have read_inode2! So perhaps it isn't as 
much of a hack after all...

If you wanted to get rid of the hackish nature of the beast, one could just 
remove ->read_inode and rename ->read_inode2 to ->read_inode. Then change 
all fs to accept two dummy parameters in their ->read_inode declaration...

If that would be an acceptable approach I would be happy to do a patch to 
convert all in kernel file systems in 2.5.x.

Best regards,

Anton


-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-29 12:34                             ` Anton Altaparmakov
@ 2002-04-29 13:01                               ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2002-04-29 13:01 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Nikita Danilov, Jan Harkes, linux-kernel

On Mon, Apr 29, 2002 at 01:34:02PM +0100, Anton Altaparmakov wrote:
> Basically one could just change iget4 into two functions: iget calling 
> fs->read_inode (with read_inode2 removed) and iget4 without the 
> ->read_inode and ->read_inode2 and returning a locked inode instead.
> 
> That would make it fs specific.
> 
> If we wanted to make it generic we do need a special method in the 
> operations, but wait, we already have read_inode2! So perhaps it isn't as 
> much of a hack after all...
> 
> If you wanted to get rid of the hackish nature of the beast, one could just 
> remove ->read_inode and rename ->read_inode2 to ->read_inode. Then change 
> all fs to accept two dummy parameters in their ->read_inode declaration...
> 
> If that would be an acceptable approach I would be happy to do a patch to 
> convert all in kernel file systems in 2.5.x.

Please take a look at icreate & surrounding code in the 2.5 XFS tree.
The code needs some cleanup, but I think the API is exactly what we want.

	Christoph


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-30 17:19                             ` Denis Vlasenko
@ 2002-04-30 13:15                               ` john slee
  2002-04-30 13:24                                 ` Billy O'Connor
                                                   ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: john slee @ 2002-04-30 13:15 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-kernel

[ cc list trimmed ]

On Tue, Apr 30, 2002 at 03:19:17PM -0200, Denis Vlasenko wrote:
> Why do we have to stich to concept of inode *numbers*?
> Because there are inode numbers in traditional Unix filesystems?

probably because there is software out there relying on them being
numbers and being able to do 'if(inum_a == inum_b) { same_file(); }'
as appropriate.  i can't think of a use for such a construct other than
preserving hardlinks in archives (does tar do this?) but i'm sure there
are others

like much of unix it's been there forever and has become such a natural
concept in people's heads that to change it now seems unthinkable.

much like the missing e in creat().

j.

-- 
R N G G   "Well, there it goes again... And we just sit 
 I G G G   here without opposable thumbs." -- gary larson

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-30 13:15                               ` john slee
@ 2002-04-30 13:24                                 ` Billy O'Connor
  2002-04-30 13:36                                   ` jlnance
  2002-04-30 13:40                                 ` Keith Owens
  2002-04-30 16:12                                 ` Peter Wächtler
  2 siblings, 1 reply; 48+ messages in thread
From: Billy O'Connor @ 2002-04-30 13:24 UTC (permalink / raw)
  To: indigoid; +Cc: vda, linux-kernel


   On Tue, Apr 30, 2002 at 03:19:17PM -0200, Denis Vlasenko wrote:
   > Why do we have to stich to concept of inode *numbers*?
   > Because there are inode numbers in traditional Unix filesystems?

   like much of unix it's been there forever and has become such a natural
   concept in people's heads that to change it now seems unthinkable.

   much like the missing e in creat().

Wasn't that the one thing Ken Thompson said he would do differently
if he had it to do all over(unix)?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-30 13:24                                 ` Billy O'Connor
@ 2002-04-30 13:36                                   ` jlnance
  0 siblings, 0 replies; 48+ messages in thread
From: jlnance @ 2002-04-30 13:36 UTC (permalink / raw)
  To: linux-kernel

On Tue, Apr 30, 2002 at 08:24:26AM -0500, Billy O'Connor wrote:

>    much like the missing e in creat().
> 
> Wasn't that the one thing Ken Thompson said he would do differently
> if he had it to do all over(unix)?

Actually that was the only thing he said he would do differently, which
puts a little different spin on that statement :-)

Jim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-30 13:15                               ` john slee
  2002-04-30 13:24                                 ` Billy O'Connor
@ 2002-04-30 13:40                                 ` Keith Owens
  2002-05-01 19:18                                   ` Denis Vlasenko
  2002-04-30 16:12                                 ` Peter Wächtler
  2 siblings, 1 reply; 48+ messages in thread
From: Keith Owens @ 2002-04-30 13:40 UTC (permalink / raw)
  To: linux-kernel

On Tue, 30 Apr 2002 23:15:23 +1000, 
john slee <indigoid@higherplane.net> wrote:
>probably because there is software out there relying on them being
>numbers and being able to do 'if(inum_a == inum_b) { same_file(); }'
>as appropriate.  i can't think of a use for such a construct other than
>preserving hardlinks in archives (does tar do this?) but i'm sure there
>are others

Any program that tries to preserve or detect hard links.  cp, mv (files
a and b are the same file).  tar, cpio, rsync -H, du, etc.

The assumption that inode numbers are unique within a mount point is
one of the reasons that NFS export does not cross mount points by
default.  man exports, look for 'nohide'.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-30 13:15                               ` john slee
  2002-04-30 13:24                                 ` Billy O'Connor
  2002-04-30 13:40                                 ` Keith Owens
@ 2002-04-30 16:12                                 ` Peter Wächtler
  2 siblings, 0 replies; 48+ messages in thread
From: Peter Wächtler @ 2002-04-30 16:12 UTC (permalink / raw)
  To: john slee; +Cc: Denis Vlasenko, linux-kernel

john slee wrote:
> [ cc list trimmed ]
> 
> On Tue, Apr 30, 2002 at 03:19:17PM -0200, Denis Vlasenko wrote:
> 
>>Why do we have to stich to concept of inode *numbers*?
>>Because there are inode numbers in traditional Unix filesystems?
>>
> 
> probably because there is software out there relying on them being
> numbers and being able to do 'if(inum_a == inum_b) { same_file(); }'
> as appropriate.  i can't think of a use for such a construct other than
> preserving hardlinks in archives (does tar do this?) but i'm sure there
> are others
> 
> like much of unix it's been there forever and has become such a natural
> concept in people's heads that to change it now seems unthinkable.
> 
> much like the missing e in creat().
> 

No. Not supplying inode numbers would break unix semantics.
The kernel (and binary loader) depends on a unique key:

major:minor device number + inode


Otherwise: how to decide if a shared object is the same?
checksuming? ;-)

But what would be different with characters? Despite more complexity?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-29 11:59                           ` Nikita Danilov
  2002-04-29 12:34                             ` Anton Altaparmakov
@ 2002-04-30 17:19                             ` Denis Vlasenko
  2002-04-30 13:15                               ` john slee
  1 sibling, 1 reply; 48+ messages in thread
From: Denis Vlasenko @ 2002-04-30 17:19 UTC (permalink / raw)
  To: Nikita Danilov, Anton Altaparmakov; +Cc: viro, Jan Harkes, linux-kernel

On 29 April 2002 09:59, Nikita Danilov wrote:
> Anton Altaparmakov writes:
>  > Al, would you agree with NTFS using ->read_inode2 as well as ReiserFS?
>
> ->read_inode2 is a hack. And especially so is having both ->read_inode
> and ->read_inode2. iget() interface was based on the assumption that
> inodes can be located (and identified) by inode number. It is not so at
> least for the reiserfs and ->read_inode2 works around this by passing
> "cookie" with information sufficient for file system to locate inode.

Why do we have to stich to concept of inode *numbers*?
Because there are inode numbers in traditional Unix filesystems?

What about reiserfs? NTFS? Even plain old FAT have trouble simulating
inode numbers for zero-length files.

Why? Because inode numbers (or lack of them) is fs implementation detail 
which unfortunately leaked into Linux VFS API.

Or maybe I am just stupid.
--
vda

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-04-30 13:40                                 ` Keith Owens
@ 2002-05-01 19:18                                   ` Denis Vlasenko
  2002-05-02  8:49                                     ` Anton Altaparmakov
                                                       ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Denis Vlasenko @ 2002-05-01 19:18 UTC (permalink / raw)
  To: Keith Owens, linux-kernel

On 30 April 2002 11:40, Keith Owens wrote:
> On Tue, 30 Apr 2002 23:15:23 +1000,
> john slee <indigoid@higherplane.net> wrote:
> >probably because there is software out there relying on them being
> >numbers and being able to do 'if(inum_a == inum_b) { same_file(); }'
> >as appropriate.  i can't think of a use for such a construct other than
> >preserving hardlinks in archives (does tar do this?) but i'm sure there
> >are others
>
> Any program that tries to preserve or detect hard links.  cp, mv (files
> a and b are the same file).  tar, cpio, rsync -H, du, etc.
>
> The assumption that inode numbers are unique within a mount point is
> one of the reasons that NFS export does not cross mount points by
> default.  man exports, look for 'nohide'.

And I recently moved my /usr/src to separate partition.
That is, /usr/src is now a mount point.
I have to export it in NFS exports *and* mount it *on every workstation*
(potentially thousands of wks!).

I'll repeat myself. What if some advanced fs has no sensible way of 
generating inode? Does it have to 'fake' it, just like [v]fat does it now?
(Yes, vfat is not 'advanced' fs, let's not discuss it...)

The fact that minix,ext[23],etc has inode #s is an *implementation detail*.
Historically entrenched in Unix.

Bad:
inum_a = inode_num(file1);
inum_b = inode_num(file2);
if(inum_a == inum_b) { same_file(); }

Better:
if(is_hardlinked(file1,file2) { same_file(); }

Yes, new syscal, blah, blah, blah... Not worth the effort, etc...
lets start a flamewar...
--
vda

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-05-01 19:18                                   ` Denis Vlasenko
@ 2002-05-02  8:49                                     ` Anton Altaparmakov
  2002-05-03 15:35                                       ` Denis Vlasenko
  2002-05-03  7:56                                     ` Pavel Machek
  2002-05-03 14:48                                     ` Rob Landley
  2 siblings, 1 reply; 48+ messages in thread
From: Anton Altaparmakov @ 2002-05-02  8:49 UTC (permalink / raw)
  To: vda; +Cc: Keith Owens, linux-kernel

At 20:18 01/05/02, Denis Vlasenko wrote:
>On 30 April 2002 11:40, Keith Owens wrote:
> > On Tue, 30 Apr 2002 23:15:23 +1000,
> > john slee <indigoid@higherplane.net> wrote:
> > >probably because there is software out there relying on them being
> > >numbers and being able to do 'if(inum_a == inum_b) { same_file(); }'
> > >as appropriate.  i can't think of a use for such a construct other than
> > >preserving hardlinks in archives (does tar do this?) but i'm sure there
> > >are others
> >
> > Any program that tries to preserve or detect hard links.  cp, mv (files
> > a and b are the same file).  tar, cpio, rsync -H, du, etc.
> >
> > The assumption that inode numbers are unique within a mount point is
> > one of the reasons that NFS export does not cross mount points by
> > default.  man exports, look for 'nohide'.
>
>And I recently moved my /usr/src to separate partition.
>That is, /usr/src is now a mount point.
>I have to export it in NFS exports *and* mount it *on every workstation*
>(potentially thousands of wks!).

Yes, edit /etc/fstab. My file server has loads of partitions and it exports 
them all and /etc/fstab on all clients just mounts them all. Problem being?

>I'll repeat myself. What if some advanced fs has no sensible way of
>generating inode? Does it have to 'fake' it, just like [v]fat does it now?
>(Yes, vfat is not 'advanced' fs, let's not discuss it...)

They have to fake it yes. Otherwise all existing userspace utilities will 
break. Anod no they cannot be changed otherwise they would no longer work 
on non-Linux platforms and most utilities are UNIX utilities which work on 
everything including Linux. You don't want to break that.

>The fact that minix,ext[23],etc has inode #s is an *implementation detail*.
>Historically entrenched in Unix.
>
>Bad:
>inum_a = inode_num(file1);
>inum_b = inode_num(file2);
>if(inum_a == inum_b) { same_file(); }
>
>Better:
>if(is_hardlinked(file1,file2) { same_file(); }
>
>Yes, new syscal, blah, blah, blah... Not worth the effort, etc...
>lets start a flamewar...

That would break UNIX semantics. Which it seems is exactly what you want to 
do... I don't think you will find many supporters of that idea... As Linus 
pointed out to me the inode is the basic i/o entity in UNIX and hence 
Linux. And that is not going to change...

Best regards,

Anton


-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-05-01 19:18                                   ` Denis Vlasenko
  2002-05-02  8:49                                     ` Anton Altaparmakov
@ 2002-05-03  7:56                                     ` Pavel Machek
  2002-05-03 14:48                                     ` Rob Landley
  2 siblings, 0 replies; 48+ messages in thread
From: Pavel Machek @ 2002-05-03  7:56 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Keith Owens, linux-kernel

Hi!

> I'll repeat myself. What if some advanced fs has no sensible way of 
> generating inode? Does it have to 'fake' it, just like [v]fat does it now?
> (Yes, vfat is not 'advanced' fs, let's not discuss it...)
> 
> The fact that minix,ext[23],etc has inode #s is an *implementation detail*.
> Historically entrenched in Unix.
> 
> Bad:
> inum_a = inode_num(file1);
> inum_b = inode_num(file2);
> if(inum_a == inum_b) { same_file(); }
> 
> Better:
> if(is_hardlinked(file1,file2) { same_file(); }
> 
> Yes, new syscal, blah, blah, blah... Not worth the effort, etc...
> lets start a flamewar...

Its worse: You have 1000 files with same size, how do you find which
are hardlinked? With inode_num() it is hashtable, doable with
O(n). With syscall we are talking O(n^2).
									Pavel
-- 
(about SSSCA) "I don't say this lightly.  However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-05-03 15:35                                       ` Denis Vlasenko
@ 2002-05-03 12:49                                         ` Helge Hafting
  2002-05-03 22:47                                           ` Denis Vlasenko
  0 siblings, 1 reply; 48+ messages in thread
From: Helge Hafting @ 2002-05-03 12:49 UTC (permalink / raw)
  To: vda, linux-kernel

Denis Vlasenko wrote:
> 
> On 2 May 2002 06:49, Anton Altaparmakov wrote:
> > >And I recently moved my /usr/src to separate partition.
> > >That is, /usr/src is now a mount point.
> > >I have to export it in NFS exports *and* mount it *on every workstation*
> > >(potentially thousands of wks!).
> >
> > Yes, edit /etc/fstab. My file server has loads of partitions and it exports
> > them all and /etc/fstab on all clients just mounts them all. Problem being?
> 
> Problem is that I have to modify /etc/fstab on every workstation.
> 
So _automate_ that then.  If you have so many workstations, make
a program/script that fix /etc/fstab.  Perhaps as simple as appending
a new line with the new fs to mount.  Put the program one some fs
already mounted on the clients, then ssh to each and run it.
The ssh part may be automated too, of course.

[...]
> It seems to me like the Bad Thing which is too old and traditional to change.
> :-(

Most ways have their own disadvantages.  Can you invent a better concept
than the inode that works as well in every existing way, and better for
this case?  Your new syscall isn't it, as Pavel Machek demonstrated.

Changing unix is doable _if_ you can show a significant benefit.
The more utilities you want to break, the more benefit you need to show.
I don't think you can send the inode to the land of 
"8-char limited passwords" by pushing "simpler management of fstabs"
though.

Helge Hafting

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-05-01 19:18                                   ` Denis Vlasenko
  2002-05-02  8:49                                     ` Anton Altaparmakov
  2002-05-03  7:56                                     ` Pavel Machek
@ 2002-05-03 14:48                                     ` Rob Landley
  2002-05-05  0:42                                       ` Denis Vlasenko
  2 siblings, 1 reply; 48+ messages in thread
From: Rob Landley @ 2002-05-03 14:48 UTC (permalink / raw)
  To: vda, Keith Owens, linux-kernel

On Wednesday 01 May 2002 03:18 pm, Denis Vlasenko wrote:

> The fact that minix,ext[23],etc has inode #s is an *implementation detail*.
> Historically entrenched in Unix.
>
> Bad:
> inum_a = inode_num(file1);
> inum_b = inode_num(file2);
> if(inum_a == inum_b) { same_file(); }
>
> Better:
> if(is_hardlinked(file1,file2) { same_file(); }
>
> Yes, new syscal, blah, blah, blah... Not worth the effort, etc...
> lets start a flamewar...

If I'm backing up a million files off of a big server, I don't want an 
enormous loop checking each and every one of them against each and every 
other one of them via some system call (potentially through the network) to 
go looking for dupes.  I want some kind of index I can hash against on MY 
side of the wire to go "Have I seen this guy before?".

That's EXACTLY what an inode is: a unique index for each file that can be 
compared to see if two directory entries refer to the same actual file.  
(Anything ELSE an inode is is an implementation detail, sure.)

These kind of numeric identifiers show up all over the place.  Process ids, 
user ids, filehandles...  It's not an implementation detail, it's a sane API.

Having them be persistent across reboots is only really needed for network 
exported filesystems (things like "tar" don't care).  In theory, the clients 
could be informed of server reboots and resync when necessary (about like 
samba does).  Of course there's a certain three-letter network server 
(originally from another three letter word) that tries to maintain no state 
whatsoever about its clients, when the entire JOB of a filesystem is 
basically to maintain persistent state...

But we won't go there.  And calculating whatever the heck your unique hash is 
entirely from your persistent data, in a reproducible way, generally isn't 
brain surgery... 

Rob

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-05-02  8:49                                     ` Anton Altaparmakov
@ 2002-05-03 15:35                                       ` Denis Vlasenko
  2002-05-03 12:49                                         ` Helge Hafting
  0 siblings, 1 reply; 48+ messages in thread
From: Denis Vlasenko @ 2002-05-03 15:35 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: linux-kernel

On 2 May 2002 06:49, Anton Altaparmakov wrote:
> >And I recently moved my /usr/src to separate partition.
> >That is, /usr/src is now a mount point.
> >I have to export it in NFS exports *and* mount it *on every workstation*
> >(potentially thousands of wks!).
>
> Yes, edit /etc/fstab. My file server has loads of partitions and it exports
> them all and /etc/fstab on all clients just mounts them all. Problem being?

Problem is that I have to modify /etc/fstab on every workstation.

> >I'll repeat myself. What if some advanced fs has no sensible way of
> >generating inode? Does it have to 'fake' it, just like [v]fat does it now?
> >(Yes, vfat is not 'advanced' fs, let's not discuss it...)
>
> They have to fake it yes. Otherwise all existing userspace utilities will
> break. And no they cannot be changed otherwise they would no longer work
> on non-Linux platforms and most utilities are UNIX utilities which work on
> everything including Linux. You don't want to break that.

It seems to me like the Bad Thing which is too old and traditional to change.
:-(

> That would break UNIX semantics. Which it seems is exactly what you want to
> do... I don't think you will find many supporters of that idea... As Linus
> pointed out to me the inode is the basic i/o entity in UNIX and hence
> Linux. And that is not going to change...

Yes, it isn't going to change.
--
vda

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-05-03 22:47                                           ` Denis Vlasenko
@ 2002-05-03 21:50                                             ` Anton Altaparmakov
  2002-05-05  0:46                                               ` Denis Vlasenko
  0 siblings, 1 reply; 48+ messages in thread
From: Anton Altaparmakov @ 2002-05-03 21:50 UTC (permalink / raw)
  To: vda; +Cc: Helge Hafting, linux-kernel

At 23:47 03/05/02, Denis Vlasenko wrote:
>On 3 May 2002 10:49, Helge Hafting wrote:
> > Changing unix is doable _if_ you can show a significant benefit.
> > The more utilities you want to break, the more benefit you need to show.
> > I don't think you can send the inode to the land of
> > "8-char limited passwords" by pushing "simpler management of fstabs"
> > though.
>
>I'm afraid I can't present benefits big enough.
>
>I was thinking of fs driver (NFS,reiser,NTFS,FAT,...) developers'
>pain, not about my /etc/fstab editing.

NTFS has native inode numbers which are persistent across reboot so this is 
a non-issue. The only thing is that inode numbers on ntfs are 64 bit and 
not 32 bit but that is much a user space issue as a kernel issue...

Best regards,

         Anton


-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-05-03 12:49                                         ` Helge Hafting
@ 2002-05-03 22:47                                           ` Denis Vlasenko
  2002-05-03 21:50                                             ` Anton Altaparmakov
  0 siblings, 1 reply; 48+ messages in thread
From: Denis Vlasenko @ 2002-05-03 22:47 UTC (permalink / raw)
  To: Helge Hafting, linux-kernel

On 3 May 2002 10:49, Helge Hafting wrote:
> > > Yes, edit /etc/fstab. My file server has loads of partitions and it
> > > exports them all and /etc/fstab on all clients just mounts them all.
> > > Problem being?
> >
> > Problem is that I have to modify /etc/fstab on every workstation.
>
> So _automate_ that then.  If you have so many workstations, make...

Yes I can do that easily. I meant that it is somewhat silly that clients
have to be tweaked when normal directory on server become a mount point.
It should be invisible from client.

(Before we start: I know about nohide)

> > It seems to me like the Bad Thing which is too old and traditional to
> > change. :-(
>
> Most ways have their own disadvantages.  Can you invent a better concept
> than the inode that works as well in every existing way, and better for
> this case?  Your new syscall isn't it, as Pavel Machek demonstrated.

Pavel presented a corner case (tarring up thousands of files, all with
exactly *same size*). It's like making pathological cases for VM behavior.
I don't take that seriously, sorry. Another example?

> Changing unix is doable _if_ you can show a significant benefit.
> The more utilities you want to break, the more benefit you need to show.
> I don't think you can send the inode to the land of
> "8-char limited passwords" by pushing "simpler management of fstabs"
> though.

I'm afraid I can't present benefits big enough.

I was thinking of fs driver (NFS,reiser,NTFS,FAT,...) developers'
pain, not about my /etc/fstab editing.
--
vda

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-05-03 14:48                                     ` Rob Landley
@ 2002-05-05  0:42                                       ` Denis Vlasenko
  0 siblings, 0 replies; 48+ messages in thread
From: Denis Vlasenko @ 2002-05-05  0:42 UTC (permalink / raw)
  To: Rob Landley, Keith Owens, linux-kernel

On 3 May 2002 12:48, Rob Landley wrote:
> > The fact that minix,ext[23],etc has inode #s is an *implementation
> > detail*. Historically entrenched in Unix.
> > Bad:
> > inum_a = inode_num(file1);
> > inum_b = inode_num(file2);
> > if(inum_a == inum_b) { same_file(); }
> > Better:
> > if(is_hardlinked(file1,file2) { same_file(); }
> >
> > Yes, new syscal, blah, blah, blah... Not worth the effort, etc...
> > lets start a flamewar...
>
> If I'm backing up a million files off of a big server, I don't want an
> enormous loop checking each and every one of them against each and every
> other one of them via some system call (potentially through the network) to
> go looking for dupes.  I want some kind of index I can hash against on MY
> side of the wire to go "Have I seen this guy before?".

You can check pairs of files with same size,mode,etc and with hardlink
count>1. That will dramatically reduce number of is_hardlinked() calls
(unless you construct a pathological case of 1000000 hardlinks to
single file).

> That's EXACTLY what an inode is: a unique index for each file that can be
> compared to see if two directory entries refer to the same actual file.
> (Anything ELSE an inode is is an implementation detail, sure.)
>
> These kind of numeric identifiers show up all over the place.  Process ids,
> user ids, filehandles...  It's not an implementation detail, it's a sane
> API.
[snip]

I agree. API is not insane, but I wouldn't call it wonderful too.
--
vda

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [prepatch] address_space-based writeback
  2002-05-03 21:50                                             ` Anton Altaparmakov
@ 2002-05-05  0:46                                               ` Denis Vlasenko
  0 siblings, 0 replies; 48+ messages in thread
From: Denis Vlasenko @ 2002-05-05  0:46 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Helge Hafting, linux-kernel

On 3 May 2002 19:50, Anton Altaparmakov wrote:
> > > Changing unix is doable _if_ you can show a significant benefit.
> > > The more utilities you want to break, the more benefit you need to
> > > show. I don't think you can send the inode to the land of
> > > "8-char limited passwords" by pushing "simpler management of fstabs"
> > > though.
> >
> >I'm afraid I can't present benefits big enough.
> >
> >I was thinking of fs driver (NFS,reiser,NTFS,FAT,...) developers'
> >pain, not about my /etc/fstab editing.
>
> NTFS has native inode numbers which are persistent across reboot so this is
> a non-issue. The only thing is that inode numbers on ntfs are 64 bit and
> not 32 bit but that is much a user space issue as a kernel issue...

Sure it is fixable, we can slowly drift to 64bit inodes in libc.

OTOH, why I have this subtle feeling that there is (or will be)
SuperHyperDuperFS with 128bit inodes?

That is one reason why I don't like inode numbers.
--
vda

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2002-05-04 19:44 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-04-10 11:21 [prepatch] address_space-based writeback Andrew Morton
2002-04-10 11:34 ` Alexander Viro
2002-04-10 19:16   ` Andrew Morton
2002-04-10 20:53     ` Alexander Viro
2002-04-10 22:12     ` Jan Harkes
2002-04-10 21:44       ` Andrew Morton
2002-04-10 22:56         ` Anton Altaparmakov
2002-04-10 22:31           ` Andrew Morton
2002-04-11 20:20           ` Linus Torvalds
2002-04-11 20:41             ` Alexander Viro
2002-04-11 21:27               ` Andrew Morton
2002-04-11 22:55                 ` Andreas Dilger
2002-04-11 22:49                   ` Andrew Morton
2002-04-12  0:12                     ` Linus Torvalds
2002-04-11 23:10                   ` Christoph Hellwig
2002-04-11 23:22                 ` Anton Altaparmakov
2002-04-11 23:03                   ` Andrew Morton
2002-04-12  4:19                   ` Bill Davidsen
2002-04-12  1:15             ` Anton Altaparmakov
2002-04-12  1:37               ` Linus Torvalds
2002-04-12  7:57                 ` Anton Altaparmakov
2002-04-27 15:53                   ` Jan Harkes
2002-04-28  3:03                     ` Anton Altaparmakov
2002-04-29  9:03                       ` Nikita Danilov
2002-04-29 11:11                         ` Anton Altaparmakov
2002-04-29 11:59                           ` Nikita Danilov
2002-04-29 12:34                             ` Anton Altaparmakov
2002-04-29 13:01                               ` Christoph Hellwig
2002-04-30 17:19                             ` Denis Vlasenko
2002-04-30 13:15                               ` john slee
2002-04-30 13:24                                 ` Billy O'Connor
2002-04-30 13:36                                   ` jlnance
2002-04-30 13:40                                 ` Keith Owens
2002-05-01 19:18                                   ` Denis Vlasenko
2002-05-02  8:49                                     ` Anton Altaparmakov
2002-05-03 15:35                                       ` Denis Vlasenko
2002-05-03 12:49                                         ` Helge Hafting
2002-05-03 22:47                                           ` Denis Vlasenko
2002-05-03 21:50                                             ` Anton Altaparmakov
2002-05-05  0:46                                               ` Denis Vlasenko
2002-05-03  7:56                                     ` Pavel Machek
2002-05-03 14:48                                     ` Rob Landley
2002-05-05  0:42                                       ` Denis Vlasenko
2002-04-30 16:12                                 ` Peter Wächtler
2002-04-10 23:02         ` Jan Harkes
2002-04-10 19:29 ` Jeremy Jackson
2002-04-10 19:41   ` Andrew Morton
2002-04-15  8:47 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).