All of lore.kernel.org
 help / color / mirror / Atom feed
* e2fsck performance for large lost+found dirs
@ 2015-11-02 22:49 Andreas Dilger
  2015-11-02 23:42 ` Darrick J. Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Andreas Dilger @ 2015-11-02 22:49 UTC (permalink / raw)
  To: Theodore Ts'o, Darrick J. Wong; +Cc: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 2573 bytes --]

Running e2fsck on a filesystem with a large number of disconnected inodes
can take a long time to complete.   This can happen with Lustre, since there
may be hundreds of thousands of inodes in one directory, and extent corruption
can wipe out the whole directory (more on that in a separate email).

It is a very lengthy O(n^2) operation to reattach all disconnected inodes
to lost+found, which can take a long time if there are millions of them.


It would be much more efficient to keep a cursor pointing to the last leaf
block in the directory in which an entry was inserted. Since e2fsck is not
deleting entries from lost+found, and because the entry names are always
getting longer (due to scanning in increasing inode number order) there is
no value to search earlier blocks again. This would make lost+found insertion
O(1) and significantly improve e2fsck performance in this case. This could be
very fast since there is only the inode bitmap to traverse, and filenames are
"#ino" so only leaf block allocation and writes needed.

For generic libext2fs usage (e.g. Darrick's FUSE interface) where entries
may be deleted from a directory, the cursor could be reset to the block of
any deleted entry, if it was at a lower offset.

Darrick, at one time I thought you had a patch to fix this behaviour, but
I couldn't find it.  Maybe your patch was related to a similar O(n^2) search
problem with block allocation?


Any thoughts about how to fix this?  I was originally thinking that I could
just cache this into the "file pointer", but no such thing exists in the
ext2fs_link() interface, only the directory inode number is passed, and it
has an additional level of indirection in that it calls the block iterator
with a callback to process each leaf block separately.  It also has the
problem that any cache would be local to ext2fs_link() and not visible to
ext2fs_unlink().

I was thinking to start with ext2fs_link() calling ext2fs_dir_iterate2()
directly, just to avoid the first level of confusion in this code.  That
still doesn't allow passing a starting offset for the iteration, however.

Next, cache the block number into the link_struct state and skip the leaf
block searches if block number < cursor, but that still needs iterating
over all the blocks to skip to the last one.

Should I just call ext2fs_block_iterate2() in e2fsck_reconnect_file()
and keep the mechanics local to e2fsck?  I thought it might be a good
generic optimization, but the interfaces make this difficult.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: e2fsck performance for large lost+found dirs
  2015-11-02 22:49 e2fsck performance for large lost+found dirs Andreas Dilger
@ 2015-11-02 23:42 ` Darrick J. Wong
  2015-11-03  9:02   ` Andreas Dilger
  0 siblings, 1 reply; 3+ messages in thread
From: Darrick J. Wong @ 2015-11-02 23:42 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Theodore Ts'o, linux-ext4

On Mon, Nov 02, 2015 at 03:49:38PM -0700, Andreas Dilger wrote:
> Running e2fsck on a filesystem with a large number of disconnected inodes
> can take a long time to complete.   This can happen with Lustre, since there
> may be hundreds of thousands of inodes in one directory, and extent corruption
> can wipe out the whole directory (more on that in a separate email).

Or the case I saw a few months ago where a client had a 35-million-entry
directory which was corrupt and had to be fixed... the reason for it being
that large was that it was not possible to give each department in this large
bureaucracy its own directory "for reasons of political expediency". :)

(XFS, but still...)

> It is a very lengthy O(n^2) operation to reattach all disconnected inodes
> to lost+found, which can take a long time if there are millions of them.
> 
> 
> It would be much more efficient to keep a cursor pointing to the last leaf
> block in the directory in which an entry was inserted. Since e2fsck is not
> deleting entries from lost+found, and because the entry names are always
> getting longer (due to scanning in increasing inode number order) there is
> no value to search earlier blocks again. This would make lost+found insertion
> O(1) and significantly improve e2fsck performance in this case. This could be
> very fast since there is only the inode bitmap to traverse, and filenames are
> "#ino" so only leaf block allocation and writes needed.

Agreed.

> For generic libext2fs usage (e.g. Darrick's FUSE interface) where entries
> may be deleted from a directory, the cursor could be reset to the block of
> any deleted entry, if it was at a lower offset.
> 
> Darrick, at one time I thought you had a patch to fix this behaviour, but
> I couldn't find it.  Maybe your patch was related to a similar O(n^2) search
> problem with block allocation?

Hmm... sorry, I don't recall ever doing that.

(Bad memory, however.  Too many patches.)

> Any thoughts about how to fix this?  I was originally thinking that I could
> just cache this into the "file pointer", but no such thing exists in the
> ext2fs_link() interface, only the directory inode number is passed, and it
> has an additional level of indirection in that it calls the block iterator
> with a callback to process each leaf block separately.  It also has the
> problem that any cache would be local to ext2fs_link() and not visible to
> ext2fs_unlink().
> 
> I was thinking to start with ext2fs_link() calling ext2fs_dir_iterate2()
> directly, just to avoid the first level of confusion in this code.  That
> still doesn't allow passing a starting offset for the iteration, however.
> 
> Next, cache the block number into the link_struct state and skip the leaf
> block searches if block number < cursor, but that still needs iterating
> over all the blocks to skip to the last one.
> 
> Should I just call ext2fs_block_iterate2() in e2fsck_reconnect_file()
> and keep the mechanics local to e2fsck?  I thought it might be a good
> generic optimization, but the interfaces make this difficult.

Uggghhh.... :)

So I started to look at ext2fs_link -> ext2fs_dir_iterate2 ->
ext2fs_block_iterate3, to see if there was a way to skip to the last block.
There isn't any such thing, currently.

How about this: create a new flags value for ext2fs_link which means "only
search the last block".  Weirdly, the flags argument at the moment seems to
provide the ftype entry for the directory, nothing more.  So this new flag
would have to be filtered out of flags before it gets used as ftype.

Next we create a similar "last block only" flag for dir_iterate2 so that we
can convey the append-only intent to the directory iterator, which could take
a shortcut in which instead of calling the block iterator, it would instead
ext2fs_bmap2() to find the pblk of the last directory block and call the
function ext2fs_process_dir_block on that last block.

This isn't /quite/ as efficient as simply storing a cursor, but I don't know
of a non-messy way for a client program to stuff state into a libext2 call.

<shrug>

--D

> 
> Cheers, Andreas
> 
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: e2fsck performance for large lost+found dirs
  2015-11-02 23:42 ` Darrick J. Wong
@ 2015-11-03  9:02   ` Andreas Dilger
  0 siblings, 0 replies; 3+ messages in thread
From: Andreas Dilger @ 2015-11-03  9:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Theodore Ts'o, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 5070 bytes --]

On Nov 2, 2015, at 4:42 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> On Mon, Nov 02, 2015 at 03:49:38PM -0700, Andreas Dilger wrote:
>> Running e2fsck on a filesystem with a large number of disconnected inodes
>> can take a long time to complete.   This can happen with Lustre, since there
>> may be hundreds of thousands of inodes in one directory, and extent corruption can wipe out the whole directory (more on that in a separate
>> email).
> 
> Or the case I saw a few months ago where a client had a 35-million-entry
> directory which was corrupt and had to be fixed... the reason for it being
> that large was that it was not possible to give each department in this large
> bureaucracy its own directory "for reasons of political expediency". :)
> 
> (XFS, but still...)
> 
>> It is a very lengthy O(n^2) operation to reattach all disconnected inodes
>> to lost+found, which can take a long time if there are millions of them.
>> 
>> 
>> It would be much more efficient to keep a cursor pointing to the last leaf
>> block in the directory in which an entry was inserted. Since e2fsck is not
>> deleting entries from lost+found, and because the entry names are always
>> getting longer (due to scanning in increasing inode number order) there is
>> no value to search earlier blocks again. This would make lost+found insertion
>> O(1) and significantly improve e2fsck performance in this case. This could be
>> very fast since there is only the inode bitmap to traverse, and filenames are
>> "#ino" so only leaf block allocation and writes needed.
> 
> Agreed.
> 
>> For generic libext2fs usage (e.g. Darrick's FUSE interface) where entries
>> may be deleted from a directory, the cursor could be reset to the block of
>> any deleted entry, if it was at a lower offset.
>> 
>> Darrick, at one time I thought you had a patch to fix this behaviour, but
>> I couldn't find it.  Maybe your patch was related to a similar O(n^2) search
>> problem with block allocation?
> 
> Hmm... sorry, I don't recall ever doing that.
> 
> (Bad memory, however.  Too many patches.)
> 
>> Any thoughts about how to fix this?  I was originally thinking that I could
>> just cache this into the "file pointer", but no such thing exists in the
>> ext2fs_link() interface, only the directory inode number is passed, and it
>> has an additional level of indirection in that it calls the block iterator
>> with a callback to process each leaf block separately.  It also has the
>> problem that any cache would be local to ext2fs_link() and not visible to
>> ext2fs_unlink().
>> 
>> I was thinking to start with ext2fs_link() calling ext2fs_dir_iterate2()
>> directly, just to avoid the first level of confusion in this code.  That
>> still doesn't allow passing a starting offset for the iteration, however.
>> 
>> Next, cache the block number into the link_struct state and skip the leaf
>> block searches if block number < cursor, but that still needs iterating
>> over all the blocks to skip to the last one.
>> 
>> Should I just call ext2fs_block_iterate2() in e2fsck_reconnect_file()
>> and keep the mechanics local to e2fsck?  I thought it might be a good
>> generic optimization, but the interfaces make this difficult.
> 
> Uggghhh.... :)
> 
> So I started to look at ext2fs_link -> ext2fs_dir_iterate2 ->
> ext2fs_block_iterate3, to see if there was a way to skip to the last block.
> There isn't any such thing, currently.
> 
> How about this: create a new flags value for ext2fs_link which means "only
> search the last block".  Weirdly, the flags argument at the moment seems to
> provide the ftype entry for the directory, nothing more.  So this new flag
> would have to be filtered out of flags before it gets used as ftype.

Hmm, but I don't want to skip to the last block, but rather "skip to the
last block that had free space.  We have a tool to re-link Lustre OST
object from lost+found back into their correct locations using information
stored in an xattr, which leaves behind a large empty directory (maybe 150MB
in size) and I don't want inserts only at the end of that directory, but
rather starting at the next free block and only scan each block once.

Unfortunately, that preferred solution needs some ability to share state
across calls to ext2fs_link(), and ideally also with ext2fs_unlink().

Cheers, Andreas
> Next we create a similar "last block only" flag for dir_iterate2 so that we
> can convey the append-only intent to the directory iterator, which could take
> a shortcut in which instead of calling the block iterator, it would instead
> ext2fs_bmap2() to find the pblk of the last directory block and call the
> function ext2fs_process_dir_block on that last block.
> 
> This isn't /quite/ as efficient as simply storing a cursor, but I don't know
> of a non-messy way for a client program to stuff state into a libext2 call.
> 
> <shrug>
> 
> --D
> 
>> 
>> Cheers, Andreas
>> 
>> 
>> 
>> 
>> 
> 
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-11-03  9:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-02 22:49 e2fsck performance for large lost+found dirs Andreas Dilger
2015-11-02 23:42 ` Darrick J. Wong
2015-11-03  9:02   ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.