linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* improved performance in case of data journaling
@ 2020-12-03  7:28 lokesh jaliminche
  2020-12-03  8:20 ` Martin Steigerwald
  0 siblings, 1 reply; 6+ messages in thread
From: lokesh jaliminche @ 2020-12-03  7:28 UTC (permalink / raw)
  To: linux-ext4

Hello everyone,

I have been doing experiments to analyze the impact of data journaling
on IO latencies. Theoretically, data journaling should show long
latencies as compared to metadata journaling. However, I observed that
when I enable data journaling I see improved performance. Is there any
specific optimization for data journaling in the write path?

fio Logs:
------------
metadata journaling enabled
========================================================================

Actual data written (calculated using iostat): 5820352 bytes

writer_2: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B,
(T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
writer_2: Laying out IO file (1 file / 102400MiB)

writer_2: (groupid=0, jobs=1): err= 0: pid=26021: Thu Dec  3 06:51:23 2020
  write: IOPS=24.0k, BW=97.7MiB/s (102MB/s)(2930MiB/30001msec); 0 zone resets
    clat (usec): min=12, max=3144, avg=16.63, stdev=22.33
     lat (usec): min=12, max=3144, avg=16.70, stdev=22.33
    clat percentiles (usec):
     |  1.00th=[   14],  5.00th=[   15], 10.00th=[   15], 20.00th=[   15],
     | 30.00th=[   16], 40.00th=[   16], 50.00th=[   16], 60.00th=[   16],
     | 70.00th=[   16], 80.00th=[   17], 90.00th=[   17], 95.00th=[   18],
     | 99.00th=[   34], 99.50th=[   44], 99.90th=[  424], 99.95th=[  562],
     | 99.99th=[  791]
   bw (  KiB/s): min=99856, max=100168, per=100.00%, avg=99992.14,
stdev=99.10, samples=59
   iops        : min=24964, max=25042, avg=24998.03, stdev=24.78, samples=59
  lat (usec)   : 20=96.72%, 50=2.95%, 100=0.16%, 250=0.02%, 500=0.08%
  lat (usec)   : 750=0.06%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=34.96%, sys=44.69%, ctx=750093, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,750001,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=97.7MiB/s (102MB/s), 97.7MiB/s-97.7MiB/s
(102MB/s-102MB/s), io=2930MiB (3072MB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=0/757110, merge=0/23753, ticks=0/12769, in_queue=116, util=99.74%
-------------------------------------------------------------------------------------------


data journaling enabled
========================================================================
Actual data written (calculated using iostat): 10070880 bytes

writer_2: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B,
(T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
writer_2: Laying out IO file (1 file / 102400MiB)

writer_2: (groupid=0, jobs=1): err= 0: pid=26103: Thu Dec  3 06:52:15 2020
  write: IOPS=24.0k, BW=97.7MiB/s (102MB/s)(2930MiB/30001msec); 0 zone resets
    clat (usec): min=3, max=283709, avg=15.16, stdev=946.43
     lat (usec): min=3, max=283709, avg=15.21, stdev=946.43
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    9], 60.00th=[    9],
     | 70.00th=[    9], 80.00th=[   11], 90.00th=[   24], 95.00th=[   28],
     | 99.00th=[   34], 99.50th=[   44], 99.90th=[   81], 99.95th=[  676],
     | 99.99th=[  947]
   bw (  KiB/s): min=48488, max=151408, per=99.87%, avg=99861.02,
stdev=13105.09, samples=59
   iops        : min=12122, max=37852, avg=24965.25, stdev=3276.27, samples=59
  lat (usec)   : 4=0.02%, 10=79.65%, 20=5.92%, 50=14.15%, 100=0.17%
  lat (usec)   : 250=0.01%, 500=0.02%, 750=0.03%, 1000=0.03%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (msec)   : 250=0.01%, 500=0.01%
  cpu          : usr=61.73%, sys=22.41%, ctx=115437, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,750001,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=97.7MiB/s (102MB/s), 97.7MiB/s-97.7MiB/s
(102MB/s-102MB/s), io=2930MiB (3072MB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=0/766273, merge=0/766195, ticks=0/941966,
in_queue=407464, util=95.92%

Thanks & Regards,
Lokesh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: improved performance in case of data journaling
  2020-12-03  7:28 improved performance in case of data journaling lokesh jaliminche
@ 2020-12-03  8:20 ` Martin Steigerwald
  2020-12-03  9:07   ` lokesh jaliminche
  0 siblings, 1 reply; 6+ messages in thread
From: Martin Steigerwald @ 2020-12-03  8:20 UTC (permalink / raw)
  To: Ext4; +Cc: lokesh jaliminche, Andrew Morton

lokesh jaliminche - 03.12.20, 08:28:49 CET:
> I have been doing experiments to analyze the impact of data journaling
> on IO latencies. Theoretically, data journaling should show long
> latencies as compared to metadata journaling. However, I observed
> that when I enable data journaling I see improved performance. Is
> there any specific optimization for data journaling in the write
> path?

This has been discussed before as Andrew Morton found that data 
journalling would be surprisingly fast with interactive write workloads. 
I would need to look it up in my performance training slides or use 
internet search to find the reference to that discussion again.

AFAIR even Andrew had no explanation for that. So I thought why would I 
have one? However an idea came to my mind: The journal is a sequential 
area on the disk. This could help with harddisks I thought at least if 
if it I/O mostly to the same not too big location/file – as you did not 
post it, I don't know exactly what your fio job file is doing. However the 
latencies you posted as well as the device name certainly point to fast 
flash storage :).

Another idea that just came to my mind is: AFAIK ext4 uses quite some 
delayed logging and relogging. That means if a block in the journal is 
changed another time within a certain time frame Ext4 changes it in 
memory before the journal block is written out to disk. Thus if the same 
block if overwritten again and again in short time, at least some of the 
updates would only happen in RAM. That might help latencies even with 
NVMe flash as RAM usually still is faster.

Of course I bet that Ext4 maintainers have a more accurate or detailed 
explanation than I do. But that was at least my idea about this.

Best,
-- 
Martin



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: improved performance in case of data journaling
  2020-12-03  8:20 ` Martin Steigerwald
@ 2020-12-03  9:07   ` lokesh jaliminche
  2020-12-22 17:47     ` Jan Kara
  0 siblings, 1 reply; 6+ messages in thread
From: lokesh jaliminche @ 2020-12-03  9:07 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Ext4, Andrew Morton

Hi Martin,

thanks for the quick response,

Apologies from my side, I should have posted my fio job description
with the fio logs
Anyway here is my fio workload.

[global]
filename=/mnt/ext4/test
direct=1
runtime=30s
time_based
size=100G
group_reporting

[writer]
new_group
rate_iops=250000
bs=4k
iodepth=1
ioengine=sync
rw=randomwrite
numjobs=1

I am using Intel Optane SSD so it's certainly very fast.

I agree that delayed logging could help to hide the performance
degradation due to actual writes to SSD. However as per the iostat
output data is definitely crossing the block layer and since
data journaling logs both data and metadata I am wondering why
or how IO requests see reduced latencies compared to metadata
journaling or even no journaling.

Also, I am using direct IO mode so ideally, it should not be using any type
of caching. I am not sure if it's applicable to journal writes but the whole
point of journaling is to prevent data loss in case of abrupt failures. So
caching journal writes may result in data loss unless we are using NVRAM.

So questions come to my mind are
1. why writes without journaling are having long latencies as compared to
    writes requests with metadata and data journaling?
2. Since metadata journaling have relatively fewer journal writes than data
    journaling why writes with data journaling is faster than no journaling and
    metadata journaling mode?
3. If there is an optimization that allows data journaling to be so fast without
   any risk of data loss, why the same optimization is not used in
case of metadata
   journaling?

On Thu, Dec 3, 2020 at 12:20 AM Martin Steigerwald <martin@lichtvoll.de> wrote:
>
> lokesh jaliminche - 03.12.20, 08:28:49 CET:
> > I have been doing experiments to analyze the impact of data journaling
> > on IO latencies. Theoretically, data journaling should show long
> > latencies as compared to metadata journaling. However, I observed
> > that when I enable data journaling I see improved performance. Is
> > there any specific optimization for data journaling in the write
> > path?
>
> This has been discussed before as Andrew Morton found that data
> journalling would be surprisingly fast with interactive write workloads.
> I would need to look it up in my performance training slides or use
> internet search to find the reference to that discussion again.
>
> AFAIR even Andrew had no explanation for that. So I thought why would I
> have one? However an idea came to my mind: The journal is a sequential
> area on the disk. This could help with harddisks I thought at least if
> if it I/O mostly to the same not too big location/file – as you did not
> post it, I don't know exactly what your fio job file is doing. However the
> latencies you posted as well as the device name certainly point to fast
> flash storage :).
>
> Another idea that just came to my mind is: AFAIK ext4 uses quite some
> delayed logging and relogging. That means if a block in the journal is
> changed another time within a certain time frame Ext4 changes it in
> memory before the journal block is written out to disk. Thus if the same
> block if overwritten again and again in short time, at least some of the
> updates would only happen in RAM. That might help latencies even with
> NVMe flash as RAM usually still is faster.
>
> Of course I bet that Ext4 maintainers have a more accurate or detailed
> explanation than I do. But that was at least my idea about this.
>
> Best,
> --
> Martin
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: improved performance in case of data journaling
  2020-12-03  9:07   ` lokesh jaliminche
@ 2020-12-22 17:47     ` Jan Kara
  2020-12-22 22:24       ` Andreas Dilger
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2020-12-22 17:47 UTC (permalink / raw)
  To: lokesh jaliminche; +Cc: Martin Steigerwald, Ext4, Andrew Morton

Hi!

On Thu 03-12-20 01:07:51, lokesh jaliminche wrote:
> Hi Martin,
> 
> thanks for the quick response,
> 
> Apologies from my side, I should have posted my fio job description
> with the fio logs
> Anyway here is my fio workload.
> 
> [global]
> filename=/mnt/ext4/test
> direct=1
> runtime=30s
> time_based
> size=100G
> group_reporting
> 
> [writer]
> new_group
> rate_iops=250000
> bs=4k
> iodepth=1
> ioengine=sync
> rw=randomwrite
> numjobs=1
> 
> I am using Intel Optane SSD so it's certainly very fast.
> 
> I agree that delayed logging could help to hide the performance
> degradation due to actual writes to SSD. However as per the iostat
> output data is definitely crossing the block layer and since
> data journaling logs both data and metadata I am wondering why
> or how IO requests see reduced latencies compared to metadata
> journaling or even no journaling.
> 
> Also, I am using direct IO mode so ideally, it should not be using any type
> of caching. I am not sure if it's applicable to journal writes but the whole
> point of journaling is to prevent data loss in case of abrupt failures. So
> caching journal writes may result in data loss unless we are using NVRAM.

Well, first bear in mind that in data=journal mode, ext4 does not support
direct IO so all the IO is in fact buffered. So your random-write workload
will be transformed to semilinear writeback of the page cache pages. Now 
I think given your SSD storage this performs much better because the
journalling thread commiting data will drive large IOs (IO to the journal
will be sequential) and even when the journal is filled and we have to
checkpoint, we will run many IOs in parallel which is beneficial for SSDs.
Whereas without data journalling your fio job will just run one IO at a
time which is far from utilizing full SSD bandwidth.

So to summarize you see better results with data journalling because you in
fact do buffered IO under the hood :).

								Honza

> So questions come to my mind are
> 1. why writes without journaling are having long latencies as compared to
>     writes requests with metadata and data journaling?
> 2. Since metadata journaling have relatively fewer journal writes than data
>     journaling why writes with data journaling is faster than no journaling and
>     metadata journaling mode?
> 3. If there is an optimization that allows data journaling to be so fast without
>    any risk of data loss, why the same optimization is not used in
> case of metadata
>    journaling?
> 
> On Thu, Dec 3, 2020 at 12:20 AM Martin Steigerwald <martin@lichtvoll.de> wrote:
> >
> > lokesh jaliminche - 03.12.20, 08:28:49 CET:
> > > I have been doing experiments to analyze the impact of data journaling
> > > on IO latencies. Theoretically, data journaling should show long
> > > latencies as compared to metadata journaling. However, I observed
> > > that when I enable data journaling I see improved performance. Is
> > > there any specific optimization for data journaling in the write
> > > path?
> >
> > This has been discussed before as Andrew Morton found that data
> > journalling would be surprisingly fast with interactive write workloads.
> > I would need to look it up in my performance training slides or use
> > internet search to find the reference to that discussion again.
> >
> > AFAIR even Andrew had no explanation for that. So I thought why would I
> > have one? However an idea came to my mind: The journal is a sequential
> > area on the disk. This could help with harddisks I thought at least if
> > if it I/O mostly to the same not too big location/file – as you did not
> > post it, I don't know exactly what your fio job file is doing. However the
> > latencies you posted as well as the device name certainly point to fast
> > flash storage :).
> >
> > Another idea that just came to my mind is: AFAIK ext4 uses quite some
> > delayed logging and relogging. That means if a block in the journal is
> > changed another time within a certain time frame Ext4 changes it in
> > memory before the journal block is written out to disk. Thus if the same
> > block if overwritten again and again in short time, at least some of the
> > updates would only happen in RAM. That might help latencies even with
> > NVMe flash as RAM usually still is faster.
> >
> > Of course I bet that Ext4 maintainers have a more accurate or detailed
> > explanation than I do. But that was at least my idea about this.
> >
> > Best,
> > --
> > Martin
> >
> >
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: improved performance in case of data journaling
  2020-12-22 17:47     ` Jan Kara
@ 2020-12-22 22:24       ` Andreas Dilger
  2020-12-28  4:06         ` lokesh jaliminche
  0 siblings, 1 reply; 6+ messages in thread
From: Andreas Dilger @ 2020-12-22 22:24 UTC (permalink / raw)
  To: Jan Kara, lokesh jaliminche
  Cc: Martin Steigerwald, Ext4, Andrew Morton, Mauricio Faria de Oliveira

[-- Attachment #1: Type: text/plain, Size: 7174 bytes --]

On Dec 22, 2020, at 10:47 AM, Jan Kara <jack@suse.cz> wrote:
> 
> Hi!
> 
> On Thu 03-12-20 01:07:51, lokesh jaliminche wrote:
>> Hi Martin,
>> 
>> thanks for the quick response,
>> 
>> Apologies from my side, I should have posted my fio job description
>> with the fio logs
>> Anyway here is my fio workload.
>> 
>> [global]
>> filename=/mnt/ext4/test
>> direct=1
>> runtime=30s
>> time_based
>> size=100G
>> group_reporting
>> 
>> [writer]
>> new_group
>> rate_iops=250000
>> bs=4k
>> iodepth=1
>> ioengine=sync
>> rw=randomwrite
>> numjobs=1
>> 
>> I am using Intel Optane SSD so it's certainly very fast.
>> 
>> I agree that delayed logging could help to hide the performance
>> degradation due to actual writes to SSD. However as per the iostat
>> output data is definitely crossing the block layer and since
>> data journaling logs both data and metadata I am wondering why
>> or how IO requests see reduced latencies compared to metadata
>> journaling or even no journaling.
>> 
>> Also, I am using direct IO mode so ideally, it should not be using any type
>> of caching. I am not sure if it's applicable to journal writes but the whole
>> point of journaling is to prevent data loss in case of abrupt failures. So
>> caching journal writes may result in data loss unless we are using NVRAM.
> 
> Well, first bear in mind that in data=journal mode, ext4 does not support
> direct IO so all the IO is in fact buffered. So your random-write workload
> will be transformed to semilinear writeback of the page cache pages. Now
> I think given your SSD storage this performs much better because the
> journalling thread commiting data will drive large IOs (IO to the journal
> will be sequential) and even when the journal is filled and we have to
> checkpoint, we will run many IOs in parallel which is beneficial for SSDs.
> Whereas without data journalling your fio job will just run one IO at a
> time which is far from utilizing full SSD bandwidth.
> 
> So to summarize you see better results with data journalling because you in
> fact do buffered IO under the hood :).

IMHO that is one of the benefits of data=journal in the first place, regardless
of whether the journal is NVMe or HDD - that it linearizes what would otherwise
be a random small-block IO workload to be much friendlier to the storage.  As
long as it maintains the "written to stable storage" semantic for O_DIRECT, I
don't think it is a problem that the data is copied or not.  Even without the
use of data=journal, there are still some code paths that copy O_DIRECT writes.

Ideally, being able to dynamically/automatically change between data=journal
and data=ordered depending on the IO workload (e.g. large writes go straight
to their allocated blocks, small writes go into the journal) would be the best
of both worlds.  High "IOPS" for workloads that need it (even on HDD), without
overwhelming the journal device bandwidth with large streaming writes.

This would tie in well with the proposed SMR patches, which allow a very large
journal device to (essentially) transform ext4 into a log-structured filesystem
by allowing journal shadow buffers to be dropped from memory rather than being
pinned in RAM:

https://github.com/tytso/ext4-patch-queue/blob/master/series
https://github.com/tytso/ext4-patch-queue/blob/master/jbd2-dont-double-bump-transaction-number
https://github.com/tytso/ext4-patch-queue/blob/master/journal-superblock-changes
https://github.com/tytso/ext4-patch-queue/blob/master/add-journal-no-cleanup-option
https://github.com/tytso/ext4-patch-queue/blob/master/add-support-for-log-metadata-block-tracking-in-log
https://github.com/tytso/ext4-patch-queue/blob/master/add-indirection-to-metadata-block-read-paths
https://github.com/tytso/ext4-patch-queue/blob/master/cleaner
https://github.com/tytso/ext4-patch-queue/blob/master/load-jmap-from-journal
https://github.com/tytso/ext4-patch-queue/blob/master/disable-writeback
https://github.com/tytso/ext4-patch-queue/blob/master/add-ext4-journal-lazy-mount-option


Having a 64GB-256GB NVMe device for the journal and handling most of the small
IO directly to the journal, and only periodically flushing to the filesystem to
HDD would really make those SMR disks more usable, since they are starting to
creep into consumer/NAS devices, even when users aren't really aware of it:

https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shingled-magnetic-recording/

>> So questions come to my mind are
>> 1. why writes without journaling are having long latencies as compared to
>>    writes requests with metadata and data journaling?
>> 2. Since metadata journaling have relatively fewer journal writes than data
>>    journaling why writes with data journaling is faster than no journaling and
>>    metadata journaling mode?
>> 3. If there is an optimization that allows data journaling to be so fast
>>    without any risk of data loss, why the same optimization is not used in case
>>    of metadata journaling?
>> 
>> On Thu, Dec 3, 2020 at 12:20 AM Martin Steigerwald <martin@lichtvoll.de> wrote:
>>> 
>>> lokesh jaliminche - 03.12.20, 08:28:49 CET:
>>>> I have been doing experiments to analyze the impact of data journaling
>>>> on IO latencies. Theoretically, data journaling should show long
>>>> latencies as compared to metadata journaling. However, I observed
>>>> that when I enable data journaling I see improved performance. Is
>>>> there any specific optimization for data journaling in the write
>>>> path?
>>> 
>>> This has been discussed before as Andrew Morton found that data
>>> journalling would be surprisingly fast with interactive write workloads.
>>> I would need to look it up in my performance training slides or use
>>> internet search to find the reference to that discussion again.
>>> 
>>> AFAIR even Andrew had no explanation for that. So I thought why would I
>>> have one? However an idea came to my mind: The journal is a sequential
>>> area on the disk. This could help with harddisks I thought at least if
>>> if it I/O mostly to the same not too big location/file – as you did not
>>> post it, I don't know exactly what your fio job file is doing. However the
>>> latencies you posted as well as the device name certainly point to fast
>>> flash storage :).
>>> 
>>> Another idea that just came to my mind is: AFAIK ext4 uses quite some
>>> delayed logging and relogging. That means if a block in the journal is
>>> changed another time within a certain time frame Ext4 changes it in
>>> memory before the journal block is written out to disk. Thus if the same
>>> block if overwritten again and again in short time, at least some of the
>>> updates would only happen in RAM. That might help latencies even with
>>> NVMe flash as RAM usually still is faster.
>>> 
>>> Of course I bet that Ext4 maintainers have a more accurate or detailed
>>> explanation than I do. But that was at least my idea about this.
>>> 
>>> Best,
>>> --
>>> Martin
>>> 
>>> 
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: improved performance in case of data journaling
  2020-12-22 22:24       ` Andreas Dilger
@ 2020-12-28  4:06         ` lokesh jaliminche
  0 siblings, 0 replies; 6+ messages in thread
From: lokesh jaliminche @ 2020-12-28  4:06 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jan Kara, Martin Steigerwald, Ext4, Andrew Morton,
	Mauricio Faria de Oliveira

On Tue, Dec 22, 2020 at 2:24 PM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Dec 22, 2020, at 10:47 AM, Jan Kara <jack@suse.cz> wrote:
> >
> > Hi!
> >
> > On Thu 03-12-20 01:07:51, lokesh jaliminche wrote:
> >> Hi Martin,
> >>
> >> thanks for the quick response,
> >>
> >> Apologies from my side, I should have posted my fio job description
> >> with the fio logs
> >> Anyway here is my fio workload.
> >>
> >> [global]
> >> filename=/mnt/ext4/test
> >> direct=1
> >> runtime=30s
> >> time_based
> >> size=100G
> >> group_reporting
> >>
> >> [writer]
> >> new_group
> >> rate_iops=250000
> >> bs=4k
> >> iodepth=1
> >> ioengine=sync
> >> rw=randomwrite
> >> numjobs=1
> >>
> >> I am using Intel Optane SSD so it's certainly very fast.
> >>
> >> I agree that delayed logging could help to hide the performance
> >> degradation due to actual writes to SSD. However as per the iostat
> >> output data is definitely crossing the block layer and since
> >> data journaling logs both data and metadata I am wondering why
> >> or how IO requests see reduced latencies compared to metadata
> >> journaling or even no journaling.
> >>
> >> Also, I am using direct IO mode so ideally, it should not be using any type
> >> of caching. I am not sure if it's applicable to journal writes but the whole
> >> point of journaling is to prevent data loss in case of abrupt failures. So
> >> caching journal writes may result in data loss unless we are using NVRAM.
> >
> > Well, first bear in mind that in data=journal mode, ext4 does not support
> > direct IO so all the IO is in fact buffered. So your random-write workload
> > will be transformed to semilinear writeback of the page cache pages. Now
> > I think given your SSD storage this performs much better because the
> > journalling thread commiting data will drive large IOs (IO to the journal
> > will be sequential) and even when the journal is filled and we have to
> > checkpoint, we will run many IOs in parallel which is beneficial for SSDs.
> > Whereas without data journalling your fio job will just run one IO at a
> > time which is far from utilizing full SSD bandwidth.
> >
> > So to summarize you see better results with data journalling because you in
> > fact do buffered IO under the hood :).

That makes sense thank you!!
>
> IMHO that is one of the benefits of data=journal in the first place, regardless
> of whether the journal is NVMe or HDD - that it linearizes what would otherwise
> be a random small-block IO workload to be much friendlier to the storage.  As
> long as it maintains the "written to stable storage" semantic for O_DIRECT, I
> don't think it is a problem that the data is copied or not.  Even without the
> use of data=journal, there are still some code paths that copy O_DIRECT writes.
>
> Ideally, being able to dynamically/automatically change between data=journal
> and data=ordered depending on the IO workload (e.g. large writes go straight
> to their allocated blocks, small writes go into the journal) would be the best
> of both worlds.  High "IOPS" for workloads that need it (even on HDD), without
> overwhelming the journal device bandwidth with large streaming writes.
>
> This would tie in well with the proposed SMR patches, which allow a very large
> journal device to (essentially) transform ext4 into a log-structured filesystem
> by allowing journal shadow buffers to be dropped from memory rather than being
> pinned in RAM:
>
> https://github.com/tytso/ext4-patch-queue/blob/master/series
> https://github.com/tytso/ext4-patch-queue/blob/master/jbd2-dont-double-bump-transaction-number
> https://github.com/tytso/ext4-patch-queue/blob/master/journal-superblock-changes
> https://github.com/tytso/ext4-patch-queue/blob/master/add-journal-no-cleanup-option
> https://github.com/tytso/ext4-patch-queue/blob/master/add-support-for-log-metadata-block-tracking-in-log
> https://github.com/tytso/ext4-patch-queue/blob/master/add-indirection-to-metadata-block-read-paths
> https://github.com/tytso/ext4-patch-queue/blob/master/cleaner
> https://github.com/tytso/ext4-patch-queue/blob/master/load-jmap-from-journal
> https://github.com/tytso/ext4-patch-queue/blob/master/disable-writeback
> https://github.com/tytso/ext4-patch-queue/blob/master/add-ext4-journal-lazy-mount-option
>
>
> Having a 64GB-256GB NVMe device for the journal and handling most of the small
> IO directly to the journal, and only periodically flushing to the filesystem to
> HDD would really make those SMR disks more usable, since they are starting to
> creep into consumer/NAS devices, even when users aren't really aware of it:
>
> https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shingled-magnetic-recording/
>
> >> So questions come to my mind are
> >> 1. why writes without journaling are having long latencies as compared to
> >>    writes requests with metadata and data journaling?
> >> 2. Since metadata journaling have relatively fewer journal writes than data
> >>    journaling why writes with data journaling is faster than no journaling and
> >>    metadata journaling mode?
> >> 3. If there is an optimization that allows data journaling to be so fast
> >>    without any risk of data loss, why the same optimization is not used in case
> >>    of metadata journaling?
> >>
> >> On Thu, Dec 3, 2020 at 12:20 AM Martin Steigerwald <martin@lichtvoll.de> wrote:
> >>>
> >>> lokesh jaliminche - 03.12.20, 08:28:49 CET:
> >>>> I have been doing experiments to analyze the impact of data journaling
> >>>> on IO latencies. Theoretically, data journaling should show long
> >>>> latencies as compared to metadata journaling. However, I observed
> >>>> that when I enable data journaling I see improved performance. Is
> >>>> there any specific optimization for data journaling in the write
> >>>> path?
> >>>
> >>> This has been discussed before as Andrew Morton found that data
> >>> journalling would be surprisingly fast with interactive write workloads.
> >>> I would need to look it up in my performance training slides or use
> >>> internet search to find the reference to that discussion again.
> >>>
> >>> AFAIR even Andrew had no explanation for that. So I thought why would I
> >>> have one? However an idea came to my mind: The journal is a sequential
> >>> area on the disk. This could help with harddisks I thought at least if
> >>> if it I/O mostly to the same not too big location/file – as you did not
> >>> post it, I don't know exactly what your fio job file is doing. However the
> >>> latencies you posted as well as the device name certainly point to fast
> >>> flash storage :).
> >>>
> >>> Another idea that just came to my mind is: AFAIK ext4 uses quite some
> >>> delayed logging and relogging. That means if a block in the journal is
> >>> changed another time within a certain time frame Ext4 changes it in
> >>> memory before the journal block is written out to disk. Thus if the same
> >>> block if overwritten again and again in short time, at least some of the
> >>> updates would only happen in RAM. That might help latencies even with
> >>> NVMe flash as RAM usually still is faster.
> >>>
> >>> Of course I bet that Ext4 maintainers have a more accurate or detailed
> >>> explanation than I do. But that was at least my idea about this.
> >>>
> >>> Best,
> >>> --
> >>> Martin
> >>>
> >>>
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
>
>
> Cheers, Andreas
>
>
>
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-28  4:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03  7:28 improved performance in case of data journaling lokesh jaliminche
2020-12-03  8:20 ` Martin Steigerwald
2020-12-03  9:07   ` lokesh jaliminche
2020-12-22 17:47     ` Jan Kara
2020-12-22 22:24       ` Andreas Dilger
2020-12-28  4:06         ` lokesh jaliminche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).