All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs sequential 8K read()s from compressed files are not merging
@ 2023-07-10 18:56 Dimitrios Apostolou
  2023-07-17 14:11 ` Dimitrios Apostolou
  0 siblings, 1 reply; 9+ messages in thread
From: Dimitrios Apostolou @ 2023-07-10 18:56 UTC (permalink / raw)
  To: linux-btrfs

Hello list,

I discovered this issue because of very slow sequential read speed in
Postgresql, which performs all reads using blocking pread() calls of 8192
size (postgres' default page size). I verified reads are similarly slow
when I read files using dd bs=8k. Here are my measurements:

Reading a 1GB postgres file using dd (which uses read() internally) in 8K
and 32K chunks:

     # dd if=4156889.4 of=/dev/null bs=8k
     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.18829 s, 174 MB/s

     # dd if=4156889.4 of=/dev/null bs=8k    # 2nd run, data is cached
     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.287623 s, 3.7 GB/s

     # dd if=4156889.8 of=/dev/null bs=32k
     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.02688 s, 1.0 GB/s

     # dd if=4156889.8 of=/dev/null bs=32k    # 2nd run, data is cached
     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.264049 s, 4.1 GB/s

Notice that the read rate (after transparent decompression) with bs=8k is
174MB/s (I see ~20MB/s on the device), slow and similar to what Postgresql
does. With bs=32k the rate increases to 1GB/s (I see ~80MB/s on the
device, but the time is very short to register properly). The device limit
is 1GB/s, of course I'm not expecting to reach this while decompressing.
The cached reads are fast in both cases, I'm guessing the kernel
buffercache contains the decompressed blocks.

The above results have been verified with multiple runs. The kernel is
5.15 Ubuntu LTS and the block device is an LVM logical volume on a high
performance DAS system, but I verified the same behaviour on a separate
system with kernel 6.3.9 and btrfs directly on a local spinning disk.
Btrfs filesystem is mounted with compress=zstd:3 and the files have been
defragmented prior to running the commands.

Focusing on the cold cache cases, iostat gives interesting insight: For
both postgres doing sequential scan and for dd with bs=8k, the kernel
block layer does not appear to merge the I/O requests. `iostat -x` shows
16 (sectors?) average read request size, 0 merged requests, and very high
reads/s IOPS number.

The dd commands with bs=32k block size show fewer IOPS on `iostat -x`,
higher speed, larger average block size and high number of merged
requests.  To me it appears as btrfs is doing read-ahead only when the
read block is large.

Example output for some random second out of dd bs=8k:

     Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
     sdc           1313.00     20.93     2.00   0.15    0.53    16.32

with dd bs=32k:

     Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
     sdc            290.00     76.44  4528.00  93.98    1.71   269.92

*On the same filesystem, doing dd bs=8k reads from a file that has not
been compressed by the filesystem I get 1GB/s throughput, which is the
limit of my device. This is what makes me believe it's an issue with btrfs
compression.*

Is this a bug or known behaviour?

Thanks in advance,
Dimitris


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: btrfs sequential 8K read()s from compressed files are not merging
  2023-07-10 18:56 btrfs sequential 8K read()s from compressed files are not merging Dimitrios Apostolou
@ 2023-07-17 14:11 ` Dimitrios Apostolou
  2023-07-26 10:59   ` (PING) " Dimitrios Apostolou
  0 siblings, 1 reply; 9+ messages in thread
From: Dimitrios Apostolou @ 2023-07-17 14:11 UTC (permalink / raw)
  To: linux-btrfs

Ping, any feedback on this issue?

Sorry if I was not clear, the problem here is that the filesystem is very
slow (10-20 MB/s on the device) in sequential reads from compressed
files, when the block size is 8K.

It looks like a bug to me (read requests are not merging, i.e. no
read-ahead is happening). Any opinions?


On Mon, 10 Jul 2023, Dimitrios Apostolou wrote:

> Hello list,
>
> I discovered this issue because of very slow sequential read speed in
> Postgresql, which performs all reads using blocking pread() calls of 8192
> size (postgres' default page size). I verified reads are similarly slow when
> I read files using dd bs=8k. Here are my measurements:
>
> Reading a 1GB postgres file using dd (which uses read() internally) in 8K and
> 32K chunks:
>
>     # dd if=4156889.4 of=/dev/null bs=8k
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.18829 s, 174 MB/s
>
>     # dd if=4156889.4 of=/dev/null bs=8k    # 2nd run, data is cached
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.287623 s, 3.7 GB/s
>
>     # dd if=4156889.8 of=/dev/null bs=32k
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.02688 s, 1.0 GB/s
>
>     # dd if=4156889.8 of=/dev/null bs=32k    # 2nd run, data is cached
>     1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.264049 s, 4.1 GB/s
>
> Notice that the read rate (after transparent decompression) with bs=8k is
> 174MB/s (I see ~20MB/s on the device), slow and similar to what Postgresql
> does. With bs=32k the rate increases to 1GB/s (I see ~80MB/s on the device,
> but the time is very short to register properly). The device limit is 1GB/s,
> of course I'm not expecting to reach this while decompressing. The cached
> reads are fast in both cases, I'm guessing the kernel buffercache contains
> the decompressed blocks.
>
> The above results have been verified with multiple runs. The kernel is 5.15
> Ubuntu LTS and the block device is an LVM logical volume on a high
> performance DAS system, but I verified the same behaviour on a separate
> system with kernel 6.3.9 and btrfs directly on a local spinning disk. Btrfs
> filesystem is mounted with compress=zstd:3 and the files have been
> defragmented prior to running the commands.
>
> Focusing on the cold cache cases, iostat gives interesting insight: For both
> postgres doing sequential scan and for dd with bs=8k, the kernel block layer
> does not appear to merge the I/O requests. `iostat -x` shows 16 (sectors?)
> average read request size, 0 merged requests, and very high reads/s IOPS
> number.
>
> The dd commands with bs=32k block size show fewer IOPS on `iostat -x`, higher
> speed, larger average block size and high number of merged requests.  To me
> it appears as btrfs is doing read-ahead only when the read block is large.
>
> Example output for some random second out of dd bs=8k:
>
>     Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
>     sdc           1313.00     20.93     2.00   0.15    0.53    16.32
>
> with dd bs=32k:
>
>     Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
>     sdc            290.00     76.44  4528.00  93.98    1.71   269.92
>
> *On the same filesystem, doing dd bs=8k reads from a file that has not been
> compressed by the filesystem I get 1GB/s throughput, which is the limit of my
> device. This is what makes me believe it's an issue with btrfs compression.*
>
> Is this a bug or known behaviour?
>
> Thanks in advance,
> Dimitris
>
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* (PING) btrfs sequential 8K read()s from compressed files are not merging
  2023-07-17 14:11 ` Dimitrios Apostolou
@ 2023-07-26 10:59   ` Dimitrios Apostolou
  2023-07-26 12:54     ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Dimitrios Apostolou @ 2023-07-26 10:59 UTC (permalink / raw)
  To: linux-btrfs

Any feedback? Is this a bug? I verified that others see the same slow read
speads from compressed files when the block size is small.

P.S. Is there a bugtracker to report btrfs bugs? My understanding is that
      neither kernel's bugzilla nor github issues are endorsed.

On Mon, 17 Jul 2023, Dimitrios Apostolou wrote:

> Ping, any feedback on this issue?
>
> Sorry if I was not clear, the problem here is that the filesystem is very
> slow (10-20 MB/s on the device) in sequential reads from compressed files,
> when the block size is 8K.
>
> It looks like a bug to me (read requests are not merging, i.e. no read-ahead
> is happening). Any opinions?
>
>
> On Mon, 10 Jul 2023, Dimitrios Apostolou wrote:
>
>>  Hello list,
>>
>>  I discovered this issue because of very slow sequential read speed in
>>  Postgresql, which performs all reads using blocking pread() calls of 8192
>>  size (postgres' default page size). I verified reads are similarly slow
>>  when I read files using dd bs=8k. Here are my measurements:
>>
>>  Reading a 1GB postgres file using dd (which uses read() internally) in 8K
>>  and 32K chunks:
>>
>>      # dd if=4156889.4 of=/dev/null bs=8k
>>      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.18829 s, 174 MB/s
>>
>>      # dd if=4156889.4 of=/dev/null bs=8k    # 2nd run, data is cached
>>      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.287623 s, 3.7 GB/s
>>
>>      # dd if=4156889.8 of=/dev/null bs=32k
>>      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.02688 s, 1.0 GB/s
>>
>>      # dd if=4156889.8 of=/dev/null bs=32k    # 2nd run, data is cached
>>      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.264049 s, 4.1 GB/s
>>
>>  Notice that the read rate (after transparent decompression) with bs=8k is
>>  174MB/s (I see ~20MB/s on the device), slow and similar to what Postgresql
>>  does. With bs=32k the rate increases to 1GB/s (I see ~80MB/s on the
>>  device, but the time is very short to register properly). The device limit
>>  is 1GB/s, of course I'm not expecting to reach this while decompressing.
>>  The cached reads are fast in both cases, I'm guessing the kernel
>>  buffercache contains the decompressed blocks.
>>
>>  The above results have been verified with multiple runs. The kernel is
>>  5.15 Ubuntu LTS and the block device is an LVM logical volume on a high
>>  performance DAS system, but I verified the same behaviour on a separate
>>  system with kernel 6.3.9 and btrfs directly on a local spinning disk.
>>  Btrfs filesystem is mounted with compress=zstd:3 and the files have been
>>  defragmented prior to running the commands.
>>
>>  Focusing on the cold cache cases, iostat gives interesting insight: For
>>  both postgres doing sequential scan and for dd with bs=8k, the kernel
>>  block layer does not appear to merge the I/O requests. `iostat -x` shows
>>  16 (sectors?) average read request size, 0 merged requests, and very high
>>  reads/s IOPS number.
>>
>>  The dd commands with bs=32k block size show fewer IOPS on `iostat -x`,
>>  higher speed, larger average block size and high number of merged
>>  requests.  To me it appears as btrfs is doing read-ahead only when the
>>  read block is large.
>>
>>  Example output for some random second out of dd bs=8k:
>>
>>      Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
>>      sdc           1313.00     20.93     2.00   0.15    0.53    16.32
>>
>>  with dd bs=32k:
>>
>>      Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
>>      sdc            290.00     76.44  4528.00  93.98    1.71   269.92
>>
>>  *On the same filesystem, doing dd bs=8k reads from a file that has not
>>  been compressed by the filesystem I get 1GB/s throughput, which is the
>>  limit of my device. This is what makes me believe it's an issue with btrfs
>>  compression.*
>>
>>  Is this a bug or known behaviour?
>>
>>  Thanks in advance,
>>  Dimitris
>>
>>
>>
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: (PING) btrfs sequential 8K read()s from compressed files are not merging
  2023-07-26 10:59   ` (PING) " Dimitrios Apostolou
@ 2023-07-26 12:54     ` Christoph Hellwig
  2023-07-26 13:44       ` Dimitrios Apostolou
  2023-08-29 13:02       ` Dimitrios Apostolou
  0 siblings, 2 replies; 9+ messages in thread
From: Christoph Hellwig @ 2023-07-26 12:54 UTC (permalink / raw)
  To: Dimitrios Apostolou; +Cc: linux-btrfs

FYI, I can reproduce similar findings to yours.  I'm somewhere between
dealing with regressions and travel and don't actually have time to
fully root cause it.

The most likely scenario is probably some interaction between the read
ahead window that is based around the actual I/O size, and the btrfs
compressed extent design that always compressed a fixed sized chunk
of data.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: (PING) btrfs sequential 8K read()s from compressed files are not merging
  2023-07-26 12:54     ` Christoph Hellwig
@ 2023-07-26 13:44       ` Dimitrios Apostolou
  2023-08-29 13:02       ` Dimitrios Apostolou
  1 sibling, 0 replies; 9+ messages in thread
From: Dimitrios Apostolou @ 2023-07-26 13:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs

Thanks for responding while travelling! :-)

On Wed, 26 Jul 2023, Christoph Hellwig wrote:

> FYI, I can reproduce similar findings to yours.  I'm somewhere between
> dealing with regressions and travel and don't actually have time to
> fully root cause it.
>
> The most likely scenario is probably some interaction between the read
> ahead window that is based around the actual I/O size, and the btrfs
> compressed extent design that always compressed a fixed sized chunk
> of data.

AFAIK the compressed extents are of size 128KB. I would expect btrfs to
decompress it as a whole, so no clever read-ahead would be needed, btrfs
should read 128KB chunks from disk and not 8KB which is the application
block size. But the data shows otherwise. Any idea about how btrfs
reads and decompresses the 128KB extents?

Also do you know if btrfs keeps the full decompressed chunk cached, or
does it re-decompress it every time the application reads 8KB?

Dimitris



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: (PING) btrfs sequential 8K read()s from compressed files are not merging
  2023-07-26 12:54     ` Christoph Hellwig
  2023-07-26 13:44       ` Dimitrios Apostolou
@ 2023-08-29 13:02       ` Dimitrios Apostolou
  2023-08-30 11:54         ` Qu Wenruo
  1 sibling, 1 reply; 9+ messages in thread
From: Dimitrios Apostolou @ 2023-08-29 13:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs

On Wed, 26 Jul 2023, Christoph Hellwig wrote:

> FYI, I can reproduce similar findings to yours.  I'm somewhere between
> dealing with regressions and travel and don't actually have time to
> fully root cause it.
>
> The most likely scenario is probably some interaction between the read
> ahead window that is based around the actual I/O size, and the btrfs
> compressed extent design that always compressed a fixed sized chunk
> of data.

So the issue is still an issue (btrfs being unreasonably slow when reading
sequentially 8K blocks from a compressed file) and I'm trying to figure
out the reasons.

I'm wondering, when an application read()s an 8K block from a big
btrfs-compressed file, apparently the full 128KB compressed chunk has to
be decompressed. But what does btrfs store in the kernel buffercache?

a. Does it store only the specific 8K block of decompressed data that was
    requested?

b. Does it store the full compressed block (128KB AFAIK) and will be
    re-decompressed upon read() from any application?

c. Or does it store the full de-compressed block, which might even be 1MB
    in size?

I guess it's doing [a], because of the performance issue I'm facing. Both
[b] and [c] would work as some kind of automatic read-ahead. But any kind
of verification would be helpful to nail the problem, as I can't see this
level of detail exposed in any way, from a userspace point of view.


Thanks,
Dimitris


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: (PING) btrfs sequential 8K read()s from compressed files are not merging
  2023-08-29 13:02       ` Dimitrios Apostolou
@ 2023-08-30 11:54         ` Qu Wenruo
  2023-08-30 18:18           ` Dimitrios Apostolou
  0 siblings, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2023-08-30 11:54 UTC (permalink / raw)
  To: Dimitrios Apostolou, Christoph Hellwig; +Cc: linux-btrfs



On 2023/8/29 21:02, Dimitrios Apostolou wrote:
> On Wed, 26 Jul 2023, Christoph Hellwig wrote:
>
>> FYI, I can reproduce similar findings to yours.  I'm somewhere between
>> dealing with regressions and travel and don't actually have time to
>> fully root cause it.
>>
>> The most likely scenario is probably some interaction between the read
>> ahead window that is based around the actual I/O size, and the btrfs
>> compressed extent design that always compressed a fixed sized chunk
>> of data.
>
> So the issue is still an issue (btrfs being unreasonably slow when reading
> sequentially 8K blocks from a compressed file) and I'm trying to figure
> out the reasons.
>
> I'm wondering, when an application read()s an 8K block from a big
> btrfs-compressed file, apparently the full 128KB compressed chunk has to
> be decompressed. But what does btrfs store in the kernel buffercache?

The kernel page cache is mostly for inode, aka the decompressed data.

As long as you're doing cached read, the decompressed data would be cached.

But there is another catch, if the file extent only points to a very
small part of the decompressed range, we still need to read the full
compressed extent, do the decompression, and only copy the small range
into the page cache.

>
> a. Does it store only the specific 8K block of decompressed data that was
>     requested?

If it's buffered read, the read can be merged with other blocks, and we
also have readahead, in that case we can still submit a much larger read.

But mostly it's case a), as for dd, it would wait for the read to finish.

Meanwhile if it's direct IO, there would be no merge, nor any cache.
(That's expected though)

>
> b. Does it store the full compressed block (128KB AFAIK) and will be
>     re-decompressed upon read() from any application?
>
> c. Or does it store the full de-compressed block, which might even be 1MB
>     in size?
>
> I guess it's doing [a], because of the performance issue I'm facing. Both
> [b] and [c] would work as some kind of automatic read-ahead. But any kind
> of verification would be helpful to nail the problem, as I can't see this
> level of detail exposed in any way, from a userspace point of view.

Although there are other factors which can be involved, like fragments
(especially damaging performance for compressed extents).

One thing I want to verify is, could you create a big file with all
compressed extents (dd writes, blocksize doesn't matter that much as by
default it's buffered write), other than postgres data bases?

Then do the same read with 32K and 512K and see if there is still the
same slow performance.
(The compressed extent size limit is 128K, thus 512K would cover 4 file
extents, and hopefully to increase the performance.)

I'm afraid the postgres data may be fragmented due to the database
workload, and contributes to the slow down.

Thanks,
Qu
>
>
> Thanks,
> Dimitris
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: (PING) btrfs sequential 8K read()s from compressed files are not merging
  2023-08-30 11:54         ` Qu Wenruo
@ 2023-08-30 18:18           ` Dimitrios Apostolou
  2023-08-31  0:22             ` Anand Jain
  0 siblings, 1 reply; 9+ messages in thread
From: Dimitrios Apostolou @ 2023-08-30 18:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4993 bytes --]

Thanks for the feedback!

On Wed, 30 Aug 2023, Qu Wenruo wrote:
>
> On 2023/8/29 21:02, Dimitrios Apostolou wrote:
>
>>  a. Does it store only the specific 8K block of decompressed data that was
>>      requested?
>
> If it's buffered read, the read can be merged with other blocks, and we
> also have readahead, in that case we can still submit a much larger read.
>
> But mostly it's case a), as for dd, it would wait for the read to finish.

This is definitely not the case in other filesystems, where I see blocking
8K buffered reads going much faster. But I understand better now, and I
think I have expressed the problem wrong in the subject. The problem is
not that IOs are not *merging*, but that there is no read-ahead/pre-fetch
happening. This sounds more accurate to me.

But this brings the question: shouldn't read-ahead/pre-fetch happen on the
block layer? I remember I have seen some configurable knobs on the
elevator level, or even on the device driver level. Is btrfs circumventing
those?


For the sake of completeness all of the read()s in my (previous and
current) measurements are buffered and blocking, and same are the ones
from postgres.


> One thing I want to verify is, could you create a big file with all
> compressed extents (dd writes, blocksize doesn't matter that much as by
> default it's buffered write), other than postgres data bases?
>
> Then do the same read with 32K and 512K and see if there is still the
> same slow performance.

I assume you also want to see 8KB reads here, which is the main problem I
reported.

> (The compressed extent size limit is 128K, thus 512K would cover 4 file
> extents, and hopefully to increase the performance.)
>
> I'm afraid the postgres data may be fragmented due to the database
> workload, and contributes to the slow down.


==== Measurements

I created a zero-filled file with the size of the host's RAM to avoid
caching issues. I did many re-runs of every dd command and verified there
is no variation. I should also mention that the filesystem is 85% free, so
there shouldn't be any fragmentation issues.

# dd if=/dev/zero of=blah bs=1G count=16
16+0 records in
16+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 14.2627 s, 1.2 GB/s

I verified the file is well compressed:

# compsize blah
Processed 1 file, 131073 regular extents (131073 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL        3%      512M          16G          16G
zstd         3%      512M          16G          16G

I'm surprised that such a file needed 128Kextents and required 512MB of
disk space (the filesystem is mounted with compress=zstd:3) but it is what
it is. On to reading the file:

# dd if=blah of=/dev/null bs=512k
32768+0 records in
32768+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 7.40493 s, 2.3 GB/s
### iostat showed 30MB/s to 100MB/s device read speed

# dd if=blah of=/dev/null bs=32k
524288+0 records in
524288+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 8.34762 s, 2.1 GB/s
### iostat showed 30MB/s to 90MB/s device read speed

# dd if=blah of=/dev/null bs=8k
2097152+0 records in
2097152+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 18.7143 s, 918 MB/s
### iostat showed very variable 8MB/s to 60MB/s device read speed
### average maybe around 40MB/s


Also worth noting is the IO request size that iostat is reporting. For
bs=8k it reports a request size of about 4 (KB?), while it's order of
magnitudes higher for all the other measurements in this email.


==== Same test with uncompressable file

I performed the same experiments with a urandom-filled file. I assume here
that btrfs is detecting the file can't be compressed, so it's treating it
differently. This is what the measurements are showing here, that the
device speed limits are reached in all cases
(this host has an HDD with limit 200MB/s).

# dd if=/dev/urandom of=blah-random bs=1G count=16
16+0 records in
16+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 84.0045 s, 205 MB/s

# compsize blah-random
Processed 1 file, 133 regular extents (133 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%       15G          15G          15G
none       100%       15G          15G          15G

# dd if=blah-random of=/dev/null bs=512k
32768+0 records in
32768+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 87.82 s, 196 MB/s
### iostat showed 180-205MB/s device read speed

# dd if=blah-random of=/dev/null bs=32k
524288+0 records in
524288+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 88.3785 s, 194 MB/s
### iostat showed 180-205MB/s device read speed

# dd if=blah-random of=/dev/null bs=8k
2097152+0 records in
2097152+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 88.7887 s, 193 MB/s
### iostat showed 180-205MB/s device read speed





Thanks,
Dimitris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: (PING) btrfs sequential 8K read()s from compressed files are not merging
  2023-08-30 18:18           ` Dimitrios Apostolou
@ 2023-08-31  0:22             ` Anand Jain
  0 siblings, 0 replies; 9+ messages in thread
From: Anand Jain @ 2023-08-31  0:22 UTC (permalink / raw)
  To: Dimitrios Apostolou, Qu Wenruo; +Cc: Christoph Hellwig, linux-btrfs


> # dd if=/dev/zero of=blah bs=1G count=16
> 16+0 records in
> 16+0 records out
> 17179869184 bytes (17 GB, 16 GiB) copied, 14.2627 s, 1.2 GB/s
> 
> I verified the file is well compressed:
> 
> # compsize blah
> Processed 1 file, 131073 regular extents (131073 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL        3%      512M          16G          16G
> zstd         3%      512M          16G          16G
> 
> I'm surprised that such a file needed 128Kextents and required 512MB of
> disk space (the filesystem is mounted with compress=zstd:3) but it is what
> it is. On to reading the file:
> 
> # dd if=blah of=/dev/null bs=512k
> 32768+0 records in
> 32768+0 records out
> 17179869184 bytes (17 GB, 16 GiB) copied, 7.40493 s, 2.3 GB/s
> ### iostat showed 30MB/s to 100MB/s device read speed
> 
> # dd if=blah of=/dev/null bs=32k
> 524288+0 records in
> 524288+0 records out
> 17179869184 bytes (17 GB, 16 GiB) copied, 8.34762 s, 2.1 GB/s
> ### iostat showed 30MB/s to 90MB/s device read speed
> 
> # dd if=blah of=/dev/null bs=8k
> 2097152+0 records in
> 2097152+0 records out
> 17179869184 bytes (17 GB, 16 GiB) copied, 18.7143 s, 918 MB/s
> ### iostat showed very variable 8MB/s to 60MB/s device read speed
> ### average maybe around 40MB/s
> 
> 
> Also worth noting is the IO request size that iostat is reporting. For
> bs=8k it reports a request size of about 4 (KB?), while it's order of
> magnitudes higher for all the other measurements in this email.
> 

The sector size is 4k, and the compression block size is 128k. There 
will be a lot more read IO, which may not be mergeable for reads with 
lower block sizes.

> 
> ==== Same test with uncompressable file
> 
> I performed the same experiments with a urandom-filled file. I assume here
> that btrfs is detecting the file can't be compressed, so it's treating it
> differently. This is what the measurements are showing here, that the
> device speed limits are reached in all cases
> (this host has an HDD with limit 200MB/s).
> 
> # dd if=/dev/urandom of=blah-random bs=1G count=16
> 16+0 records in
> 16+0 records out
> 17179869184 bytes (17 GB, 16 GiB) copied, 84.0045 s, 205 MB/s
> 
> # compsize blah-random
> Processed 1 file, 133 regular extents (133 refs), 0 inline.
> Type       Perc     Disk Usage   Uncompressed Referenced
> TOTAL      100%       15G          15G          15G
> none       100%       15G          15G          15G
> 
> # dd if=blah-random of=/dev/null bs=512k
> 32768+0 records in
> 32768+0 records out
> 17179869184 bytes (17 GB, 16 GiB) copied, 87.82 s, 196 MB/s
> ### iostat showed 180-205MB/s device read speed
> 
> # dd if=blah-random of=/dev/null bs=32k
> 524288+0 records in
> 524288+0 records out
> 17179869184 bytes (17 GB, 16 GiB) copied, 88.3785 s, 194 MB/s
> ### iostat showed 180-205MB/s device read speed
> 
> # dd if=blah-random of=/dev/null bs=8k
> 2097152+0 records in
> 2097152+0 records out
> 17179869184 bytes (17 GB, 16 GiB) copied, 88.7887 s, 193 MB/s
> ### iostat showed 180-205MB/s device read speed


The heuristic will disable compression on the file if the data is 
incompressible, such as that from /dev/urandom.

Generally, to test compression in fstests, we use the 'dd' command as below.

od /dev/urandom | dd iflag=fullblock of=.. bs=.. count=..

Thanks, Anand



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-08-31  0:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-10 18:56 btrfs sequential 8K read()s from compressed files are not merging Dimitrios Apostolou
2023-07-17 14:11 ` Dimitrios Apostolou
2023-07-26 10:59   ` (PING) " Dimitrios Apostolou
2023-07-26 12:54     ` Christoph Hellwig
2023-07-26 13:44       ` Dimitrios Apostolou
2023-08-29 13:02       ` Dimitrios Apostolou
2023-08-30 11:54         ` Qu Wenruo
2023-08-30 18:18           ` Dimitrios Apostolou
2023-08-31  0:22             ` Anand Jain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.