linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Locality of extent status tree traversal
@ 2019-05-07 16:12 Probir Roy
  2019-05-07 17:59 ` Theodore Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: Probir Roy @ 2019-05-07 16:12 UTC (permalink / raw)
  To: linux-ext4

Hi,

I am running Phoronix-fio benchmark on Linux kernel 4.18.0-rc5. I
observe the same nodes have been traversed on the extent status tree
in "ext4_es_lookup_extent" function when ext4 write begins. What's the
locality signature of "ext4_es_lookup_extent" in general? Is it
possible that same logical block being looked up repeatedly
(Temporal)? Is it possible that co-located logical blocks are searched
by ext4_es_lookup_extent (spatial)?  Or is it totally random?

-- 
Probir Roy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Locality of extent status tree traversal
  2019-05-07 16:12 Locality of extent status tree traversal Probir Roy
@ 2019-05-07 17:59 ` Theodore Ts'o
  2019-05-07 18:56   ` Probir Roy
  0 siblings, 1 reply; 4+ messages in thread
From: Theodore Ts'o @ 2019-05-07 17:59 UTC (permalink / raw)
  To: Probir Roy; +Cc: linux-ext4

On Tue, May 07, 2019 at 11:12:07AM -0500, Probir Roy wrote:
> 
> I am running Phoronix-fio benchmark on Linux kernel 4.18.0-rc5. I
> observe the same nodes have been traversed on the extent status tree
> in "ext4_es_lookup_extent" function when ext4 write begins. What's the
> locality signature of "ext4_es_lookup_extent" in general? Is it
> possible that same logical block being looked up repeatedly
> (Temporal)? Is it possible that co-located logical blocks are searched
> by ext4_es_lookup_extent (spatial)?  Or is it totally random?

I'm not sure what you are asking.  The ext4_es_lookup_extent() is used
as a fast map of an inode's logical block number to find the physical
block number (e.g., the location on disk).  It's a cache; lookups are
fast, and is an in-memory lookup.  Well, it's a little more than a
cache, it also stores some information for delayed allocation buffered
writes.

If the workload is a random read or random write workload, then
accesses to look up logical to physical block maps will be random.  If
the workload is mostly a sequential read or sequential write access,
then the logical blocks looked up via ext4_es_lookup_extent() will
largely be sequential.

						- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Locality of extent status tree traversal
  2019-05-07 17:59 ` Theodore Ts'o
@ 2019-05-07 18:56   ` Probir Roy
  2019-05-07 21:40     ` Theodore Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: Probir Roy @ 2019-05-07 18:56 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

> block number (e.g., the location on disk).  It's a cache; lookups are
> fast, and is an in-memory lookup.  Well, it's a little more than a
> cache, it also stores some information for delayed allocation buffered
> writes.

For sequential access, it will traverse almost the same path of the
tree. How deep the extent status tree be in general? If the tree is
much deeper, the sequential accesses would have many repeated nodes
traversal on the tree for the lookup. Have you observed significant
bottleneck on "ext4_es_lookup_extent"? Can it be removed by caching
the parent node?

-- 
Probir Roy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Locality of extent status tree traversal
  2019-05-07 18:56   ` Probir Roy
@ 2019-05-07 21:40     ` Theodore Ts'o
  0 siblings, 0 replies; 4+ messages in thread
From: Theodore Ts'o @ 2019-05-07 21:40 UTC (permalink / raw)
  To: Probir Roy; +Cc: linux-ext4

On Tue, May 07, 2019 at 01:56:27PM -0500, Probir Roy wrote:
> > block number (e.g., the location on disk).  It's a cache; lookups are
> > fast, and is an in-memory lookup.  Well, it's a little more than a
> > cache, it also stores some information for delayed allocation buffered
> > writes.
> 
> For sequential access, it will traverse almost the same path of the
> tree. How deep the extent status tree be in general? If the tree is
> much deeper, the sequential accesses would have many repeated nodes
> traversal on the tree for the lookup. Have you observed significant
> bottleneck on "ext4_es_lookup_extent"? Can it be removed by caching
> the parent node?

You do realize that the extent status tree is separate from the
on-disk ext4 extent tree, right?

The extent status tree is an in-memory cache, and it caches logical
extents.  Which is to say, the on-disk physical extents are limited
(for historical reasons) to 32,767 blocks in an on-disk extent entry.
If you have a contiguous range of 128,000 blocks, it will require 4
on-disk extents in the ext4 extent tree.

The extent status tree, being an in-memory data structure, will
collapse those 4 on-disk extents (assuming they are physically and
logically contiguous) into a single in-memory entry in the extent
status tree.  So the depth of the extent status tree very much depends
on how fragmented the data blocks are for the file in question.  If
the file is 100% contiguous, and pre-allocated in advance, it could be
12TB long, and it would only take a single entry in the extent status
tree.  If the file is very highly fragmented, then of course, the size
of the extent status tree in memory and the on-disk extent tree can be
quite large.

It's true that if the tree is very deep, then you might have to do
many traversals of the red-black tree.  But that's because file is
super fragmented.  If we didn't have the extent status cache, then
we'd have to read in portions of the on-disk extent tree.  That tree
has a larger fanout, so it's wider, as well as being less deep.  But
if you are talking about the number of memory accesses needed to
traverse the extent tree, it's going to be roughly the same as the
read-block tree.  In either case, it's going to be O(log N) of the number
of extents in the highly fragmented file.

So let's back up.  Why are you so concerned about potential
bottle-necks on the ext4_es_lookup_extent()?  What is your workload?
How badly fragmented is your file?  And have you considered taking
measures (such as preallocating the file using fallocate, and possibly
pre-initializing the file) to improve things?  If you are doing
something nasty such as a random write workload using buffered writes,
without preallocating the file, then yeah, things can get pretty
nasty.  But the problem isn't going to be in the extent status tree;
it's going to be in many positions as well.

						- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-05-07 21:40 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-07 16:12 Locality of extent status tree traversal Probir Roy
2019-05-07 17:59 ` Theodore Ts'o
2019-05-07 18:56   ` Probir Roy
2019-05-07 21:40     ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).