All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about ext4 extents and file fragmentation
@ 2019-03-20 22:44 Mikhail Morfikov
  2019-03-21  3:18 ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Mikhail Morfikov @ 2019-03-20 22:44 UTC (permalink / raw)
  To: linux-ext4


[-- Attachment #1.1: Type: text/plain, Size: 1825 bytes --]

When we have a big file on an ext4 partition, and filefrag shows
the following:

filefrag -ve /bigfile
Filesystem type is: ef53
File size of /bigfile is 1439201280 (351368 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   32767:      34816..     67583:  32768:            
   1:    32768..   63487:      67584..     98303:  30720:            
   2:    63488..   96255:     100352..    133119:  32768:      98304:
   3:    96256..  126975:     133120..    163839:  30720:            
   4:   126976..  159743:     165888..    198655:  32768:     163840:
   5:   159744..  190463:     198656..    229375:  30720:            
   6:   190464..  223231:     231424..    264191:  32768:     229376:
   7:   223232..  253951:     264192..    294911:  30720:            
   8:   253952..  286719:     296960..    329727:  32768:     294912:
   9:   286720..  319487:     329728..    362495:  32768:            
  10:   319488..  351367:     362496..    394375:  31880:             last,eof
/bigfile: 5 extents found

1. How many fragments does this file really have? 11 or 5? 
2. Should the extents 0 and 1 be treated as one fragment or two 
   separate ones? I know they could be one from the human 
   perspective, but is it really one for ext4 filesystem?
3. What does actually happen during the read in the case of 
   some HDD and its magnetic heads? If the head finishes reading 
   the whole extent (ext 0), will it be able to read the data of 
   the next extent (ext 1) without any delays like in the case of
   raw read (for instance dd if=/dev/sda ...), or will it be 
   delayed because of the filesystem layer, and the head will 
   have to spend some time to be positioned again in order to 
   read the next extent?


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ext4 extents and file fragmentation
  2019-03-20 22:44 Question about ext4 extents and file fragmentation Mikhail Morfikov
@ 2019-03-21  3:18 ` Theodore Ts'o
  2019-03-21  9:29   ` Mikhail Morfikov
  0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2019-03-21  3:18 UTC (permalink / raw)
  To: Mikhail Morfikov; +Cc: linux-ext4

On Wed, Mar 20, 2019 at 11:44:19PM +0100, Mikhail Morfikov wrote:
> When we have a big file on an ext4 partition, and filefrag shows
> the following:
> 
> filefrag -ve /bigfile
> Filesystem type is: ef53
> File size of /bigfile is 1439201280 (351368 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..   32767:      34816..     67583:  32768:            
>    1:    32768..   63487:      67584..     98303:  30720:            
>    2:    63488..   96255:     100352..    133119:  32768:      98304:
>    3:    96256..  126975:     133120..    163839:  30720:            
>    4:   126976..  159743:     165888..    198655:  32768:     163840:
>    5:   159744..  190463:     198656..    229375:  30720:            
>    6:   190464..  223231:     231424..    264191:  32768:     229376:
>    7:   223232..  253951:     264192..    294911:  30720:            
>    8:   253952..  286719:     296960..    329727:  32768:     294912:
>    9:   286720..  319487:     329728..    362495:  32768:            
>   10:   319488..  351367:     362496..    394375:  31880:             last,eof
> /bigfile: 5 extents found
> 
> 1. How many fragments does this file really have? 11 or 5? 
> 2. Should the extents 0 and 1 be treated as one fragment or two 
>    separate ones? I know they could be one from the human 
>    perspective, but is it really one for ext4 filesystem?

They are encoded as two separate physical extents on disk.  Logically,
extents 0, 1, and 2 are contiguous regions on idks.

> 3. What does actually happen during the read in the case of 
>    some HDD and its magnetic heads? If the head finishes reading 
>    the whole extent (ext 0), will it be able to read the data of 
>    the next extent (ext 1) without any delays like in the case of
>    raw read (for instance dd if=/dev/sda ...), or will it be 
>    delayed because of the filesystem layer, and the head will 
>    have to spend some time to be positioned again in order to 
>    read the next extent?

The delay won't be because of the file system layer, as the
information about these first three extents will all be stored on the
same block on disk.  In addition, ext4 has an in-memory "extent cache"
which stores the logical->physical block mapping, and in memory, it
will be stored as a single entry in the extent cache.

It takes *time* to read 128 megabytes (32768 4k blocks), and from a
hard drive perspective, you are doing a streaming sequential read, how
the file system metadata is stored is not going to be the limiting
factor.  In fact, it's likely that they won't be issued to the hard
drive as a single I/O request anyway.  But that doesn't matter; the
hard drive has an I/O request queue, and so the right thing will
happen.

					- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ext4 extents and file fragmentation
  2019-03-21  3:18 ` Theodore Ts'o
@ 2019-03-21  9:29   ` Mikhail Morfikov
  2019-03-21 15:05     ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Mikhail Morfikov @ 2019-03-21  9:29 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4


[-- Attachment #1.1: Type: text/plain, Size: 3935 bytes --]

On 21/03/2019 04:18, Theodore Ts'o wrote:
> On Wed, Mar 20, 2019 at 11:44:19PM +0100, Mikhail Morfikov wrote:
>> When we have a big file on an ext4 partition, and filefrag shows
>> the following:
>>
>> filefrag -ve /bigfile
>> Filesystem type is: ef53
>> File size of /bigfile is 1439201280 (351368 blocks of 4096 bytes)
>>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>>    0:        0..   32767:      34816..     67583:  32768:            
>>    1:    32768..   63487:      67584..     98303:  30720:            
>>    2:    63488..   96255:     100352..    133119:  32768:      98304:
>>    3:    96256..  126975:     133120..    163839:  30720:            
>>    4:   126976..  159743:     165888..    198655:  32768:     163840:
>>    5:   159744..  190463:     198656..    229375:  30720:            
>>    6:   190464..  223231:     231424..    264191:  32768:     229376:
>>    7:   223232..  253951:     264192..    294911:  30720:            
>>    8:   253952..  286719:     296960..    329727:  32768:     294912:
>>    9:   286720..  319487:     329728..    362495:  32768:            
>>   10:   319488..  351367:     362496..    394375:  31880:             last,eof
>> /bigfile: 5 extents found
>>
>> 1. How many fragments does this file really have? 11 or 5? 
>> 2. Should the extents 0 and 1 be treated as one fragment or two 
>>    separate ones? I know they could be one from the human 
>>    perspective, but is it really one for ext4 filesystem?
> 
> They are encoded as two separate physical extents on disk.  Logically,
> extents 0, 1, and 2 are contiguous regions on idks.So 5 fragments then?

>> 3. What does actually happen during the read in the case of 
>>    some HDD and its magnetic heads? If the head finishes reading 
>>    the whole extent (ext 0), will it be able to read the data of 
>>    the next extent (ext 1) without any delays like in the case of
>>    raw read (for instance dd if=/dev/sda ...), or will it be 
>>    delayed because of the filesystem layer, and the head will 
>>    have to spend some time to be positioned again in order to 
>>    read the next extent?
> 
> The delay won't be because of the file system layer, as the
> information about these first three extents will all be stored on the
> same block on disk.  In addition, ext4 has an in-memory "extent cache"
> which stores the logical->physical block mapping, and in memory, it
> will be stored as a single entry in the extent cache.
> 
> It takes *time* to read 128 megabytes (32768 4k blocks), and from a
> hard drive perspective, you are doing a streaming sequential read, how
> the file system metadata is stored is not going to be the limiting
> factor.  In fact, it's likely that they won't be issued to the hard
> drive as a single I/O request anyway.  But that doesn't matter; the
> hard drive has an I/O request queue, and so the right thing will
> happen.
> 

Yes, I know that many things can happen during the 128M read. But wecan assume that we have some simplified environment, where we have 
only one disk, one file we want to read at the moment, and we have 
time to do it without any external interferences.

If I understood correctly, as long as the extents reside on a contiguous 
region, they will be read sequentially without any delays, right? So if 
the file in question was one big contiguous region, would it be read 
sequentially from the beginning of the file to its end? 

Also I have a question concerning the following sentence[1]: 
  "When there are more than four extents to a file, the rest of the 
  extents are indexed in a tree." 
Does this mean that only four extents can be read sequentially in a 
file that have only contiguous blocks of data, or because of the 
extent cache, the whole file can be read sequentially anyway?

[1] https://en.wikipedia.org/wiki/Ext4#Features



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ext4 extents and file fragmentation
  2019-03-21  9:29   ` Mikhail Morfikov
@ 2019-03-21 15:05     ` Theodore Ts'o
  2019-03-21 15:59       ` Mikhail Morfikov
  0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2019-03-21 15:05 UTC (permalink / raw)
  To: Mikhail Morfikov; +Cc: linux-ext4

On Thu, Mar 21, 2019 at 10:29:23AM +0100, Mikhail Morfikov wrote:
> 
> Yes, I know that many things can happen during the 128M read. But wecan assume that we have some simplified environment, where we have 
> only one disk, one file we want to read at the moment, and we have 
> time to do it without any external interferences.
> 
> If I understood correctly, as long as the extents reside on a contiguous 
> region, they will be read sequentially without any delays, right? So if 
> the file in question was one big contiguous region, would it be read 
> sequentially from the beginning of the file to its end?

It *could* be read sequentially from the beginning of the file to the
end.  There are many things that might cause that not to happen, that
have nothing to do with how we store the logical to physicla map.  For
example, some other process might be requested disk reads that might
be interleaved with the reads for that file.  If you try to read too
quickly, and the system stalls due to lack of space in the page cache,
that might force some writeback that will interrupt the contiguous
read.  The possibilities are endless.

I hesitate to make a categorical statement, because I don't understand
why you are being monomaniacal about this.

> Also I have a question concerning the following sentence[1]: 
>   "When there are more than four extents to a file, the rest of the 
>   extents are indexed in a tree." 
> Does this mean that only four extents can be read sequentially in a 
> file that have only contiguous blocks of data, or because of the 
> extent cache, the whole file can be read sequentially anyway?

If you really care about this, it's possible to use the ioctl
EXT4_IOC_PRECACHE_EXTENTS which will read the extent tree and cache it
in the extent status cache.  The main use for this has been people who
want to make a really big file --- for example, it's possible to
create a single 10 TB file which is contiguous, and while the on-disk
extent tree might require a number of 4k blocks, it can be cached in a
single 12 byte extent status cache entry.

The primary use case for this ioctl is for a *random* read workload if
there is a requirement for tail latencies.  For certain workloads,
such as a distributed query of hundreds of disks to satisfy a single
search query, if a single read is slow, it will slow down the ability
to satisfy the entire search query.  To avoid that, people will worry
about the 99th or even 99.9th percentile random read latency.  And so
precaching the extent tree makes sense:

   3. Fast is better than slow.
   We know your time is valuable, so when you’re seeking an answer on
   the web you want it right away–and we aim to please. We may be the
   only people in the world who can say our goal is to have people
   leave our website as quickly as possible. By shaving excess bits
   and bytes from our pages and increasing the efficiency of our
   serving environment, we’ve broken our own speed records many times
   over, so that the average response time on a search result is a
    fraction of a second....
   	          - https://www.google.com/about/philosophy.html

But for a sequential read workload --- it really makes no sense to be
worried about this.  For example, if you are doing a streaming video
read, the need to seek to to read from the extent status tree is not
going to be noticed at all.  A HD video stream is roughly 100MB /
minute.  So once the system realizes that you are doing a sequential
read, read-ahead will automatically start pulling in new blocks ahead
of the video stream, and the need to seek to read the extent status
tree will be invisible.  And if you are copying the file, the
percentage increase for periodically seeking to read in the extent
status tree is going to be so small it might not even be measurable.

Which is why I'm really puzzled why you care.

						- Ted


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ext4 extents and file fragmentation
  2019-03-21 15:05     ` Theodore Ts'o
@ 2019-03-21 15:59       ` Mikhail Morfikov
  0 siblings, 0 replies; 5+ messages in thread
From: Mikhail Morfikov @ 2019-03-21 15:59 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4


[-- Attachment #1.1: Type: text/plain, Size: 842 bytes --]

On 21/03/2019 16:05, Theodore Ts'o wrote:
> It *could* be read sequentially from the beginning of the file to the
> end.  There are many things that might cause that not to happen, that
> have nothing to do with how we store the logical to physicla map.

And this is what I wanted to know, because some people tell that if you 
store a file in a filesystem, it can't be read sequentially as a whole 
because of the filesystem layer (compared to "dd if=/dev/sda ..."). So 
the filesystem layer doesn't really matter and doesn't really add any 
additional delays compared to the raw read of a device when we deal with 
data that is stored in contiguous blocks. I know that many things can 
prevent the sequential read from happening, but I just wanted it to be 
clarified. 

Thank you for the answer, I really appreciate it.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-03-21 15:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-20 22:44 Question about ext4 extents and file fragmentation Mikhail Morfikov
2019-03-21  3:18 ` Theodore Ts'o
2019-03-21  9:29   ` Mikhail Morfikov
2019-03-21 15:05     ` Theodore Ts'o
2019-03-21 15:59       ` Mikhail Morfikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.