All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-14  8:07 Chang, Cunyin
  0 siblings, 0 replies; 17+ messages in thread
From: Chang, Cunyin @ 2017-11-14  8:07 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2442 bytes --]

Could you please also try the sequential write test:

This is my test result with P3700:
O_DIRECT
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/445.7MB/0KB /s] [0/114K/0 iops]
O_DIRECT + O_DSYNC
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/676.6MB/0KB /s] [0/173K/0 iops]

-Cunyin

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? / MATSUMOTO,SHUUHEI
Sent: Tuesday, November 14, 2017 3:42 PM
To: 'spdk(a)lists.01.org' <spdk(a)lists.01.org>
Subject: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi,

This may be related with the current being fixed issue Anirudh found.
Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does not use O_DSYNC.
O_DSYNC assures that IO is written to persistent storage.

SPDK is for storage and hence I think O_DIRECT and O_DSYNC will be better for SPDK BDEV AIO.

Cunyin understood O_DSYNC but asked me the performance difference between O_DIRECT and (O_DIRECT|O_DSYNC).
Since our team evaluated only functionality, I did simple performance test by FIO + NVMe-SSD (P3700) x 1.
I modified FIO slightly (O_SYNC -> O_DSYNC) and rebuild it.

I cannot create the complete steady state of NVMe-SSD to get the real world performance yet due to lack of time,
I did not notice major difference between O_DIRECT and (O_DIRECT|O_DSYNC).
It looks that NVMe-SSD is saturated and CPU utilization was less than 10% for all cases as long as I checked mpstat.
(Our performance team created the steady state and got the stable IO write performance 100K but I have not got it yet.
If I can create the complete steady state by taking more time I will be able to get 100K for write.)

O_DIRECT
4K random read, 40jobs,
   read: IOPS=478k, BW=1869MiB/s (1960MB/s)(18.3GiB/10007msec)

4K random write, 40jobs
  write: IOPS=68.4k, BW=268MiB/s (281MB/s)(2689MiB/10031msec)

O_DIRECT|O_DSYNC
4K random read, 40jobs,
   read: IOPS=477k, BW=1864MiB/s (1954MB/s)(18.2GiB/10007msec)

4K random write, 40jobs
  write: IOPS=72.0k, BW=286MiB/s (300MB/s)(2871MiB/10038msec)



About the difference of IO command sequence,
for SCSI disk O_DSYNC issues extra IO command and it may affect IO performance but
for NVMe-SSD O_DSYNC issues no extra IO command.
Hence I estimate this is the reason of indifference of performance.

I would appreciate your any feedback.

Thank you,
Shuhei Matsumoto

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 12022 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-17  0:23 
  0 siblings, 0 replies; 17+ messages in thread
From:  @ 2017-11-17  0:23 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4610 bytes --]

> > I really would like to avoid adding O_DSYNC if we can find a way to
> > make the code strictly correct in terms of data integrity. That option
> > should have a large negative impact on performance for devices that have a volatile write cache
> (HDDs, consumer SSDs).

I agree with you about this and it will be better if any option is added in the kernel to enable only IO completion detection.

Thank you,
Shuhei

> -----Original Message-----
> From: 松本周平 / MATSUMOTO,SHUUHEI
> Sent: Friday, November 17, 2017 9:08 AM
> To: spdk(a)lists.01.org
> Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> 
> Hi Jim, Ben,
> 
> At first about performance, current FIO does not support O_DSYNC mode and our performance team
> have measured only O_DIRECT.
> I'm not against your opinion that O_DSYNC should not improve performance.
> But AIO implementation is complex and Cunyin's result may be reliable and hence I think it's probable.
> I would like to found the logic but have not done yet.
> 
> I would like to do double check but unfortunately system is not available for me now and I wish
> you will be able to do.
> 
> 
> > Is this experience based on experiments with HDDs? I assume those HDDs
> > have a volatile write cache, since most do. What if another
> > explanation is that AIO reported the I/O complete after the write to the HDD completed successfully,
> then the drive was hot removed prior to a flush being sent?
> > In that case, the data would be lost even though the write I/O
> > completed successfully. This would also explain why O_DIRECT plus
> > O_DSYNC fixes the problem - it first sends the write which succeeds,
> > but then tries to send a flush immediately which fails because the device was hot removed between
> the two commands.
> 
> I want to revise the following:
> > >     - if only O_DIRECT is set, any IO succeeded even if HDD was hot removed.
> > >     - if both O_DIRECT and O_DSYNC, any IO failed if HDD was hot removed.
> 
> - If only O_DIRECT is set, when HDD was hot removed read failed but write succeeded.
> - If both O_DIRECT and O_DSYNC are set, both read and write failed.
> 
> After receiving this result someone (maybe me...) looked into the kernel code, found the logic
> that may cause this result and also found the patch that added the logic.
> 
> Thank you,
> Shuhei
> 
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker,
> > Benjamin
> > Sent: Friday, November 17, 2017 3:00 AM
> > To: spdk(a)lists.01.org
> > Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to
> > BDEV_AIO
> >
> > On Thu, 2017-11-16 at 15:33 +0000, Harris, James R wrote:
> > > It still seems like O_DSYNC should be a nop on NVMe SSDs that do not
> > > have a volatile write cache.  Certainly, adding O_DSYNC should not
> > > improve performance as Cunyin showed in his latest data.  I suspect
> > > some of these differences must still be related to preconditioning.
> >
> > I agree - there has to be something else going on here. Sending
> > flushes after every I/O should make the drive slower (or have no impact), not faster.
> >
> >
> > > On 11/15/17, 5:39 PM, "SPDK on behalf of 松本周平 / MATSUMOTO,SHUUHEI"
> > > <spdk-bounc es(a)lists.01.org on behalf of shuhei.matsumoto.xt(a)hitachi.com> wrote:
> > >     I want to add our team's experience about AIO + only O_DIRECT.
> > >     We have observed that
> > >     - if only O_DIRECT is set, any IO succeeded even if HDD was hot removed.
> > >     - if both O_DIRECT and O_DSYNC, any IO failed if HDD was hot removed.
> >
> > Is this experience based on experiments with HDDs? I assume those HDDs
> > have a volatile write cache, since most do. What if another
> > explanation is that AIO reported the I/O complete after the write to the HDD completed successfully,
> then the drive was hot removed prior to a flush being sent?
> > In that case, the data would be lost even though the write I/O
> > completed successfully. This would also explain why O_DIRECT plus
> > O_DSYNC fixes the problem - it first sends the write which succeeds,
> > but then tries to send a flush immediately which fails because the device was hot removed between
> the two commands.
> >
> > I really would like to avoid adding O_DSYNC if we can find a way to
> > make the code strictly correct in terms of data integrity. That option
> > should have a large negative impact on performance for devices that have a volatile write cache
> (HDDs, consumer SSDs).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-17  0:08 
  0 siblings, 0 replies; 17+ messages in thread
From:  @ 2017-11-17  0:08 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3802 bytes --]

Hi Jim, Ben,

At first about performance, current FIO does not support O_DSYNC mode and our performance team have measured only O_DIRECT.
I'm not against your opinion that O_DSYNC should not improve performance.
But AIO implementation is complex and Cunyin's result may be reliable and hence I think it's probable.
I would like to found the logic but have not done yet.

I would like to do double check but unfortunately system is not available for me now and I wish you will be able to do.


> Is this experience based on experiments with HDDs? I assume those HDDs have a volatile write cache,
> since most do. What if another explanation is that AIO reported the I/O complete after the write
> to the HDD completed successfully, then the drive was hot removed prior to a flush being sent?
> In that case, the data would be lost even though the write I/O completed successfully. This would
> also explain why O_DIRECT plus O_DSYNC fixes the problem - it first sends the write which succeeds,
> but then tries to send a flush immediately which fails because the device was hot removed between
> the two commands.

I want to revise the following:
> >     - if only O_DIRECT is set, any IO succeeded even if HDD was hot removed.
> >     - if both O_DIRECT and O_DSYNC, any IO failed if HDD was hot removed.

- If only O_DIRECT is set, when HDD was hot removed read failed but write succeeded.
- If both O_DIRECT and O_DSYNC are set, both read and write failed.

After receiving this result someone (maybe me...) looked into the kernel code, found the logic that may cause this result and also found the patch that added the logic.

Thank you,
Shuhei

> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
> Sent: Friday, November 17, 2017 3:00 AM
> To: spdk(a)lists.01.org
> Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> 
> On Thu, 2017-11-16 at 15:33 +0000, Harris, James R wrote:
> > It still seems like O_DSYNC should be a nop on NVMe SSDs that do not
> > have a volatile write cache.  Certainly, adding O_DSYNC should not
> > improve performance as Cunyin showed in his latest data.  I suspect
> > some of these differences must still be related to preconditioning.
> 
> I agree - there has to be something else going on here. Sending flushes after every I/O should
> make the drive slower (or have no impact), not faster.
> 
> 
> > On 11/15/17, 5:39 PM, "SPDK on behalf of 松本周平 / MATSUMOTO,SHUUHEI"
> > <spdk-bounc es(a)lists.01.org on behalf of shuhei.matsumoto.xt(a)hitachi.com> wrote:
> >     I want to add our team's experience about AIO + only O_DIRECT.
> >     We have observed that
> >     - if only O_DIRECT is set, any IO succeeded even if HDD was hot removed.
> >     - if both O_DIRECT and O_DSYNC, any IO failed if HDD was hot removed.
> 
> Is this experience based on experiments with HDDs? I assume those HDDs have a volatile write cache,
> since most do. What if another explanation is that AIO reported the I/O complete after the write
> to the HDD completed successfully, then the drive was hot removed prior to a flush being sent?
> In that case, the data would be lost even though the write I/O completed successfully. This would
> also explain why O_DIRECT plus O_DSYNC fixes the problem - it first sends the write which succeeds,
> but then tries to send a flush immediately which fails because the device was hot removed between
> the two commands.
> 
> I really would like to avoid adding O_DSYNC if we can find a way to make the code strictly correct
> in terms of data integrity. That option should have a large negative impact on performance for
> devices that have a volatile write cache (HDDs, consumer SSDs).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-16 17:59 Walker, Benjamin
  0 siblings, 0 replies; 17+ messages in thread
From: Walker, Benjamin @ 2017-11-16 17:59 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 1807 bytes --]

On Thu, 2017-11-16 at 15:33 +0000, Harris, James R wrote:
> It still seems like O_DSYNC should be a nop on NVMe SSDs that do not have a
> volatile write cache.  Certainly, adding O_DSYNC should not improve
> performance as Cunyin showed in his latest data.  I suspect some of these
> differences must still be related to preconditioning.

I agree - there has to be something else going on here. Sending flushes after
every I/O should make the drive slower (or have no impact), not faster.


> On 11/15/17, 5:39 PM, "SPDK on behalf of 松本周平 / MATSUMOTO,SHUUHEI" <spdk-bounc
> es(a)lists.01.org on behalf of shuhei.matsumoto.xt(a)hitachi.com> wrote:
>     I want to add our team's experience about AIO + only O_DIRECT.
>     We have observed that
>     - if only O_DIRECT is set, any IO succeeded even if HDD was hot removed.
>     - if both O_DIRECT and O_DSYNC, any IO failed if HDD was hot removed.

Is this experience based on experiments with HDDs? I assume those HDDs have a
volatile write cache, since most do. What if another explanation is that AIO
reported the I/O complete after the write to the HDD completed successfully,
then the drive was hot removed prior to a flush being sent? In that case, the
data would be lost even though the write I/O completed successfully. This would
also explain why O_DIRECT plus O_DSYNC fixes the problem - it first sends the
write which succeeds, but then tries to send a flush immediately which fails
because the device was hot removed between the two commands.

I really would like to avoid adding O_DSYNC if we can find a way to make the
code strictly correct in terms of data integrity. That option should have a
large negative impact on performance for devices that have a volatile write
cache (HDDs, consumer SSDs).


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3274 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-16 15:33 Harris, James R
  0 siblings, 0 replies; 17+ messages in thread
From: Harris, James R @ 2017-11-16 15:33 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6947 bytes --]

Hi Shuhei,

This behavior seems crazy to me.  (  This must be a kernel bug then, if an O_DIRECT write to a disk that has been removed is completed without error? 

But I agree that we should add O_DSYNC to the bdev aio driver to ensure FUA is set or a flush occurs after any writes.

It still seems like O_DSYNC should be a nop on NVMe SSDs that do not have a volatile write cache.  Certainly, adding O_DSYNC should not improve performance as Cunyin showed in his latest data.  I suspect some of these differences must still be related to preconditioning.

-Jim


On 11/15/17, 5:39 PM, "SPDK on behalf of 松本周平 / MATSUMOTO,SHUUHEI" <spdk-bounces(a)lists.01.org on behalf of shuhei.matsumoto.xt(a)hitachi.com> wrote:

    Hi Ben, Jim, Cunyin,
    
    I want to add our team's experience about AIO + only O_DIRECT.
    We have observed that
    - if only O_DIRECT is set, any IO succeeded even if HDD was hot removed.
    - if both O_DIRECT and O_DSYNC, any IO failed if HDD was hot removed.
    
    I'm not sure if this implementation will be changed.
    
    I hope this will be any help for you to consider.
    
    Thank you,
    Shuhei
    
    > -----Original Message-----
    > From: 松本周平 / MATSUMOTO,SHUUHEI
    > Sent: Wednesday, November 15, 2017 3:03 PM
    > To: spdk(a)lists.01.org
    > Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
    > 
    > Hi Ben,
    > 
    > I did not expect but thanks to your very well explanation, my understanding improved for not only
    > AIO but also hardware and SPDK.
    > I started SPDK BDEV layer mentioned in the following and your explanation is very helpful.
    > 
    > Thank you,
    > Shuhei
    > 
    > > -----Original Message-----
    > > From: 松本周平 / MATSUMOTO,SHUUHEI
    > > Sent: Wednesday, November 15, 2017 10:07 AM
    > > To: spdk(a)lists.01.org
    > > Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
    > >
    > > Hi Ben, Jim, Cunyin,
    > >
    > > Thank you for your feedback.
    > >
    > > I talked with my colleague and considered again.
    > > Let me clarify my concern.
    > >
    > > The issue is caused by the current implementation of Linux AIO.
    > > O_DSYNC controls both of two controls.
    > >
    > > 1) Assurance of transaction completion on the software layer
    > > O_DSYNC=YES: Linux kernel informs completion to the application after confirming queued IO is
    > done.
    > > O_DSYNC=NO : Linux kernel informs completion to the application
    > > speculatively without confirming queued IO is done.
    > >
    > > 2) Avoid using volatile cache
    > > O_DSYNC=YES: Linux kernel set the FUA bit O_DSYNC=NO : Linux kernel do
    > > not set the FUA bit
    > >
    > > Your concern is 2).
    > >
    > > My concern is 1) and SPDK API spdk_bdev_flush() is for hardware and
    > > cannot assure that queued IO is done in Linux kernel.
    > > I think 1) and 2) should be controlled by different parameters but currently not.
    > >
    > > Hence O_DSYNC=YES/NO should be controllable at least.
    > >
    > >
    > > About performance evaluation, our performance team usually
    > > - take 5 hours for preconditioning
    > > - take only 10 seconds for test run but they observe stable data due to enough preconditioning.
    > >
    > > However I took only 10 minutes for preconditioning and 10 seconds for
    > > test run respectively due to availability.
    > > Unfortunately I will not be able to use the machine and I wish Cunyin will have good data.
    > >
    > > Thank you,
    > > Shuhei
    > >
    > > > -----Original Message-----
    > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker,
    > > > Benjamin
    > > > Sent: Wednesday, November 15, 2017 3:01 AM
    > > > To: spdk(a)lists.01.org
    > > > Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to
    > > > BDEV_AIO
    > > >
    > > > On Tue, 2017-11-14 at 07:41 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
    > > > > Hi,
    > > > >
    > > > > Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does
    > > > > not use O_DSYNC.
    > > > > O_DSYNC assures that IO is written to persistent storage.
    > > > >
    > > > > About the difference of IO command sequence, for SCSI disk O_DSYNC
    > > > > issues extra IO command and it may affect IO performance but for
    > > > > NVMe-SSD O_DSYNC issues no extra IO command.
    > > > > Hence I estimate this is the reason of indifference of performance.
    > > > >
    > > >
    > > > O_DIRECT avoids using operating system caches in system memory.
    > > > O_DSYNC avoids using volatile caches in the SSD itself by setting
    > > > the FUA bit on I/O (Force Unit Access) and additionally by issuing a
    > > > SCSI synchronize cache command after each I/O on devices that report
    > > > having a volatile write cache. This is why you see an extra command
    > > > for SCSI devices - they are reporting
    > > that they support a volatile write cache. That's very common for a SAS/SATA HDD.
    > > >
    > > > The SPDK API provides the user with a mechanism to query whether a
    > > > block device has a volatile write cache (spdk_bdev_has_write_cache)
    > > > and an API to instruct the device to make data in its volatile
    > > > caches persistent (spdk_bdev_flush). The existence of volatile write
    > > > caches and the semantics around flushing are well established in
    > > > traditional block stacks and provide significant performance
    > > > benefits to some types of devices (particularly lower end consumer
    > > > grade devices), so we've chosen to provide those same semantics in
    > > > SPDK. Altering the spdk bdev aio module to always specify O_DSYNC
    > > > will greatly reduce the performance on these types
    > > of devices, so I think choosing just O_DIRECT and not O_DSYNC is the correct choice for the flag.
    > > This provides the user the traditional semantics of flushing block devices that they expect.
    > > >
    > > > Note that the Intel P3700 does not have a volatile write cache, so
    > > > sending flush requests does nothing (it has a write cache, it just
    > > > isn't volatile). The reason Jim was asking about preconditioning in
    > > > another branch of this thread is because I don't think either of us
    > > > expect to see any performance difference on the Intel
    > > > P3700 when the O_DSYNC flag is added, regardless of workload. If
    > > > there is a difference even after preconditioning, then it certainly warrants investigation.
    > > >
    > > > Thanks Shuhei!
    > > >
    > > > Ben
    
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-16  3:43 
  0 siblings, 0 replies; 17+ messages in thread
From:  @ 2017-11-16  3:43 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7725 bytes --]

Hi Cunyin,

Thank you so much for rerunning the performance test by official measure.
I would like to confirm one thing just in case.
I modified the FIO code slightly (O_SYNC -> O_DSYNC) and rebuild it.
Did you do the same preparation? 

Thank you,
Shuhei Matsumoto

> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Chang, Cunyin
> Sent: Thursday, November 16, 2017 11:42 AM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> 
> For the performance, here is the test result in 20 mins:
> -bs=4K --iodepth=128 --rw=write
> 
> O_DIRECT:
> Jobs: 1 (f=0): [f(1)] [100.0% done] [0KB/253.5MB/0KB /s] [0/64.9K/0 iops]
> Jobs: 16 (f=16): [W(16)] [100.0% done] [0KB/987.8MB/0KB /s] [0/253K/0 iops]
> 
> O_DIRECT + O_DSYNC:
> Jobs: 1 (f=0): [f(1)] [100.0% done] [0KB/626.6MB/0KB /s] [0/160K/0 iops]
> Jobs: 16 (f=16): [W(16)] [100.0% done] [0KB/1378MB/0KB /s] [0/353K/0 iops]
> 
> 
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? /
> > MATSUMOTO,SHUUHEI
> > Sent: Thursday, November 16, 2017 8:40 AM
> > To: spdk(a)lists.01.org
> > Subject: Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> >
> > Hi Ben, Jim, Cunyin,
> >
> > I want to add our team's experience about AIO + only O_DIRECT.
> > We have observed that
> > - if only O_DIRECT is set, any IO succeeded even if HDD was hot removed.
> > - if both O_DIRECT and O_DSYNC, any IO failed if HDD was hot removed.
> >
> > I'm not sure if this implementation will be changed.
> >
> > I hope this will be any help for you to consider.
> >
> > Thank you,
> > Shuhei
> >
> > > -----Original Message-----
> > > From: 松本周平 / MATSUMOTO,SHUUHEI
> > > Sent: Wednesday, November 15, 2017 3:03 PM
> > > To: spdk(a)lists.01.org
> > > Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> > >
> > > Hi Ben,
> > >
> > > I did not expect but thanks to your very well explanation, my
> > > understanding improved for not only AIO but also hardware and SPDK.
> > > I started SPDK BDEV layer mentioned in the following and your explanation
> > is very helpful.
> > >
> > > Thank you,
> > > Shuhei
> > >
> > > > -----Original Message-----
> > > > From: 松本周平 / MATSUMOTO,SHUUHEI
> > > > Sent: Wednesday, November 15, 2017 10:07 AM
> > > > To: spdk(a)lists.01.org
> > > > Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to
> > > > BDEV_AIO
> > > >
> > > > Hi Ben, Jim, Cunyin,
> > > >
> > > > Thank you for your feedback.
> > > >
> > > > I talked with my colleague and considered again.
> > > > Let me clarify my concern.
> > > >
> > > > The issue is caused by the current implementation of Linux AIO.
> > > > O_DSYNC controls both of two controls.
> > > >
> > > > 1) Assurance of transaction completion on the software layer
> > > > O_DSYNC=YES: Linux kernel informs completion to the application
> > > > after confirming queued IO is
> > > done.
> > > > O_DSYNC=NO : Linux kernel informs completion to the application
> > > > speculatively without confirming queued IO is done.
> > > >
> > > > 2) Avoid using volatile cache
> > > > O_DSYNC=YES: Linux kernel set the FUA bit O_DSYNC=NO : Linux kernel
> > > > do not set the FUA bit
> > > >
> > > > Your concern is 2).
> > > >
> > > > My concern is 1) and SPDK API spdk_bdev_flush() is for hardware and
> > > > cannot assure that queued IO is done in Linux kernel.
> > > > I think 1) and 2) should be controlled by different parameters but
> > currently not.
> > > >
> > > > Hence O_DSYNC=YES/NO should be controllable at least.
> > > >
> > > >
> > > > About performance evaluation, our performance team usually
> > > > - take 5 hours for preconditioning
> > > > - take only 10 seconds for test run but they observe stable data due to
> > enough preconditioning.
> > > >
> > > > However I took only 10 minutes for preconditioning and 10 seconds
> > > > for test run respectively due to availability.
> > > > Unfortunately I will not be able to use the machine and I wish Cunyin will
> > have good data.
> > > >
> > > > Thank you,
> > > > Shuhei
> > > >
> > > > > -----Original Message-----
> > > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker,
> > > > > Benjamin
> > > > > Sent: Wednesday, November 15, 2017 3:01 AM
> > > > > To: spdk(a)lists.01.org
> > > > > Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to
> > > > > BDEV_AIO
> > > > >
> > > > > On Tue, 2017-11-14 at 07:41 +0000, 松本周平 / MATSUMOTO,
> > SHUUHEI wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Current BDEV AIO uses O_DIRECT to avoid IO cache effects but
> > > > > > does not use O_DSYNC.
> > > > > > O_DSYNC assures that IO is written to persistent storage.
> > > > > >
> > > > > > About the difference of IO command sequence, for SCSI disk
> > > > > > O_DSYNC issues extra IO command and it may affect IO performance
> > > > > > but for NVMe-SSD O_DSYNC issues no extra IO command.
> > > > > > Hence I estimate this is the reason of indifference of performance.
> > > > > >
> > > > >
> > > > > O_DIRECT avoids using operating system caches in system memory.
> > > > > O_DSYNC avoids using volatile caches in the SSD itself by setting
> > > > > the FUA bit on I/O (Force Unit Access) and additionally by issuing
> > > > > a SCSI synchronize cache command after each I/O on devices that
> > > > > report having a volatile write cache. This is why you see an extra
> > > > > command for SCSI devices - they are reporting
> > > > that they support a volatile write cache. That's very common for a
> > SAS/SATA HDD.
> > > > >
> > > > > The SPDK API provides the user with a mechanism to query whether a
> > > > > block device has a volatile write cache
> > > > > (spdk_bdev_has_write_cache) and an API to instruct the device to
> > > > > make data in its volatile caches persistent (spdk_bdev_flush). The
> > > > > existence of volatile write caches and the semantics around
> > > > > flushing are well established in traditional block stacks and
> > > > > provide significant performance benefits to some types of devices
> > > > > (particularly lower end consumer grade devices), so we've chosen
> > > > > to provide those same semantics in SPDK. Altering the spdk bdev
> > > > > aio module to always specify O_DSYNC will greatly reduce the
> > > > > performance on these types
> > > > of devices, so I think choosing just O_DIRECT and not O_DSYNC is the
> > correct choice for the flag.
> > > > This provides the user the traditional semantics of flushing block devices
> > that they expect.
> > > > >
> > > > > Note that the Intel P3700 does not have a volatile write cache, so
> > > > > sending flush requests does nothing (it has a write cache, it just
> > > > > isn't volatile). The reason Jim was asking about preconditioning
> > > > > in another branch of this thread is because I don't think either
> > > > > of us expect to see any performance difference on the Intel
> > > > > P3700 when the O_DSYNC flag is added, regardless of workload. If
> > > > > there is a difference even after preconditioning, then it certainly
> > warrants investigation.
> > > > >
> > > > > Thanks Shuhei!
> > > > >
> > > > > Ben
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-16  2:41 Chang, Cunyin
  0 siblings, 0 replies; 17+ messages in thread
From: Chang, Cunyin @ 2017-11-16  2:41 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6687 bytes --]

For the performance, here is the test result in 20 mins:
-bs=4K --iodepth=128 --rw=write

O_DIRECT:
Jobs: 1 (f=0): [f(1)] [100.0% done] [0KB/253.5MB/0KB /s] [0/64.9K/0 iops]
Jobs: 16 (f=16): [W(16)] [100.0% done] [0KB/987.8MB/0KB /s] [0/253K/0 iops]

O_DIRECT + O_DSYNC:
Jobs: 1 (f=0): [f(1)] [100.0% done] [0KB/626.6MB/0KB /s] [0/160K/0 iops] 
Jobs: 16 (f=16): [W(16)] [100.0% done] [0KB/1378MB/0KB /s] [0/353K/0 iops]


> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? /
> MATSUMOTO,SHUUHEI
> Sent: Thursday, November 16, 2017 8:40 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> 
> Hi Ben, Jim, Cunyin,
> 
> I want to add our team's experience about AIO + only O_DIRECT.
> We have observed that
> - if only O_DIRECT is set, any IO succeeded even if HDD was hot removed.
> - if both O_DIRECT and O_DSYNC, any IO failed if HDD was hot removed.
> 
> I'm not sure if this implementation will be changed.
> 
> I hope this will be any help for you to consider.
> 
> Thank you,
> Shuhei
> 
> > -----Original Message-----
> > From: 松本周平 / MATSUMOTO,SHUUHEI
> > Sent: Wednesday, November 15, 2017 3:03 PM
> > To: spdk(a)lists.01.org
> > Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> >
> > Hi Ben,
> >
> > I did not expect but thanks to your very well explanation, my
> > understanding improved for not only AIO but also hardware and SPDK.
> > I started SPDK BDEV layer mentioned in the following and your explanation
> is very helpful.
> >
> > Thank you,
> > Shuhei
> >
> > > -----Original Message-----
> > > From: 松本周平 / MATSUMOTO,SHUUHEI
> > > Sent: Wednesday, November 15, 2017 10:07 AM
> > > To: spdk(a)lists.01.org
> > > Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to
> > > BDEV_AIO
> > >
> > > Hi Ben, Jim, Cunyin,
> > >
> > > Thank you for your feedback.
> > >
> > > I talked with my colleague and considered again.
> > > Let me clarify my concern.
> > >
> > > The issue is caused by the current implementation of Linux AIO.
> > > O_DSYNC controls both of two controls.
> > >
> > > 1) Assurance of transaction completion on the software layer
> > > O_DSYNC=YES: Linux kernel informs completion to the application
> > > after confirming queued IO is
> > done.
> > > O_DSYNC=NO : Linux kernel informs completion to the application
> > > speculatively without confirming queued IO is done.
> > >
> > > 2) Avoid using volatile cache
> > > O_DSYNC=YES: Linux kernel set the FUA bit O_DSYNC=NO : Linux kernel
> > > do not set the FUA bit
> > >
> > > Your concern is 2).
> > >
> > > My concern is 1) and SPDK API spdk_bdev_flush() is for hardware and
> > > cannot assure that queued IO is done in Linux kernel.
> > > I think 1) and 2) should be controlled by different parameters but
> currently not.
> > >
> > > Hence O_DSYNC=YES/NO should be controllable at least.
> > >
> > >
> > > About performance evaluation, our performance team usually
> > > - take 5 hours for preconditioning
> > > - take only 10 seconds for test run but they observe stable data due to
> enough preconditioning.
> > >
> > > However I took only 10 minutes for preconditioning and 10 seconds
> > > for test run respectively due to availability.
> > > Unfortunately I will not be able to use the machine and I wish Cunyin will
> have good data.
> > >
> > > Thank you,
> > > Shuhei
> > >
> > > > -----Original Message-----
> > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker,
> > > > Benjamin
> > > > Sent: Wednesday, November 15, 2017 3:01 AM
> > > > To: spdk(a)lists.01.org
> > > > Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to
> > > > BDEV_AIO
> > > >
> > > > On Tue, 2017-11-14 at 07:41 +0000, 松本周平 / MATSUMOTO,
> SHUUHEI wrote:
> > > > > Hi,
> > > > >
> > > > > Current BDEV AIO uses O_DIRECT to avoid IO cache effects but
> > > > > does not use O_DSYNC.
> > > > > O_DSYNC assures that IO is written to persistent storage.
> > > > >
> > > > > About the difference of IO command sequence, for SCSI disk
> > > > > O_DSYNC issues extra IO command and it may affect IO performance
> > > > > but for NVMe-SSD O_DSYNC issues no extra IO command.
> > > > > Hence I estimate this is the reason of indifference of performance.
> > > > >
> > > >
> > > > O_DIRECT avoids using operating system caches in system memory.
> > > > O_DSYNC avoids using volatile caches in the SSD itself by setting
> > > > the FUA bit on I/O (Force Unit Access) and additionally by issuing
> > > > a SCSI synchronize cache command after each I/O on devices that
> > > > report having a volatile write cache. This is why you see an extra
> > > > command for SCSI devices - they are reporting
> > > that they support a volatile write cache. That's very common for a
> SAS/SATA HDD.
> > > >
> > > > The SPDK API provides the user with a mechanism to query whether a
> > > > block device has a volatile write cache
> > > > (spdk_bdev_has_write_cache) and an API to instruct the device to
> > > > make data in its volatile caches persistent (spdk_bdev_flush). The
> > > > existence of volatile write caches and the semantics around
> > > > flushing are well established in traditional block stacks and
> > > > provide significant performance benefits to some types of devices
> > > > (particularly lower end consumer grade devices), so we've chosen
> > > > to provide those same semantics in SPDK. Altering the spdk bdev
> > > > aio module to always specify O_DSYNC will greatly reduce the
> > > > performance on these types
> > > of devices, so I think choosing just O_DIRECT and not O_DSYNC is the
> correct choice for the flag.
> > > This provides the user the traditional semantics of flushing block devices
> that they expect.
> > > >
> > > > Note that the Intel P3700 does not have a volatile write cache, so
> > > > sending flush requests does nothing (it has a write cache, it just
> > > > isn't volatile). The reason Jim was asking about preconditioning
> > > > in another branch of this thread is because I don't think either
> > > > of us expect to see any performance difference on the Intel
> > > > P3700 when the O_DSYNC flag is added, regardless of workload. If
> > > > there is a difference even after preconditioning, then it certainly
> warrants investigation.
> > > >
> > > > Thanks Shuhei!
> > > >
> > > > Ben
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-16  0:39 
  0 siblings, 0 replies; 17+ messages in thread
From:  @ 2017-11-16  0:39 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5535 bytes --]

Hi Ben, Jim, Cunyin,

I want to add our team's experience about AIO + only O_DIRECT.
We have observed that
- if only O_DIRECT is set, any IO succeeded even if HDD was hot removed.
- if both O_DIRECT and O_DSYNC, any IO failed if HDD was hot removed.

I'm not sure if this implementation will be changed.

I hope this will be any help for you to consider.

Thank you,
Shuhei

> -----Original Message-----
> From: 松本周平 / MATSUMOTO,SHUUHEI
> Sent: Wednesday, November 15, 2017 3:03 PM
> To: spdk(a)lists.01.org
> Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> 
> Hi Ben,
> 
> I did not expect but thanks to your very well explanation, my understanding improved for not only
> AIO but also hardware and SPDK.
> I started SPDK BDEV layer mentioned in the following and your explanation is very helpful.
> 
> Thank you,
> Shuhei
> 
> > -----Original Message-----
> > From: 松本周平 / MATSUMOTO,SHUUHEI
> > Sent: Wednesday, November 15, 2017 10:07 AM
> > To: spdk(a)lists.01.org
> > Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> >
> > Hi Ben, Jim, Cunyin,
> >
> > Thank you for your feedback.
> >
> > I talked with my colleague and considered again.
> > Let me clarify my concern.
> >
> > The issue is caused by the current implementation of Linux AIO.
> > O_DSYNC controls both of two controls.
> >
> > 1) Assurance of transaction completion on the software layer
> > O_DSYNC=YES: Linux kernel informs completion to the application after confirming queued IO is
> done.
> > O_DSYNC=NO : Linux kernel informs completion to the application
> > speculatively without confirming queued IO is done.
> >
> > 2) Avoid using volatile cache
> > O_DSYNC=YES: Linux kernel set the FUA bit O_DSYNC=NO : Linux kernel do
> > not set the FUA bit
> >
> > Your concern is 2).
> >
> > My concern is 1) and SPDK API spdk_bdev_flush() is for hardware and
> > cannot assure that queued IO is done in Linux kernel.
> > I think 1) and 2) should be controlled by different parameters but currently not.
> >
> > Hence O_DSYNC=YES/NO should be controllable at least.
> >
> >
> > About performance evaluation, our performance team usually
> > - take 5 hours for preconditioning
> > - take only 10 seconds for test run but they observe stable data due to enough preconditioning.
> >
> > However I took only 10 minutes for preconditioning and 10 seconds for
> > test run respectively due to availability.
> > Unfortunately I will not be able to use the machine and I wish Cunyin will have good data.
> >
> > Thank you,
> > Shuhei
> >
> > > -----Original Message-----
> > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker,
> > > Benjamin
> > > Sent: Wednesday, November 15, 2017 3:01 AM
> > > To: spdk(a)lists.01.org
> > > Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to
> > > BDEV_AIO
> > >
> > > On Tue, 2017-11-14 at 07:41 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> > > > Hi,
> > > >
> > > > Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does
> > > > not use O_DSYNC.
> > > > O_DSYNC assures that IO is written to persistent storage.
> > > >
> > > > About the difference of IO command sequence, for SCSI disk O_DSYNC
> > > > issues extra IO command and it may affect IO performance but for
> > > > NVMe-SSD O_DSYNC issues no extra IO command.
> > > > Hence I estimate this is the reason of indifference of performance.
> > > >
> > >
> > > O_DIRECT avoids using operating system caches in system memory.
> > > O_DSYNC avoids using volatile caches in the SSD itself by setting
> > > the FUA bit on I/O (Force Unit Access) and additionally by issuing a
> > > SCSI synchronize cache command after each I/O on devices that report
> > > having a volatile write cache. This is why you see an extra command
> > > for SCSI devices - they are reporting
> > that they support a volatile write cache. That's very common for a SAS/SATA HDD.
> > >
> > > The SPDK API provides the user with a mechanism to query whether a
> > > block device has a volatile write cache (spdk_bdev_has_write_cache)
> > > and an API to instruct the device to make data in its volatile
> > > caches persistent (spdk_bdev_flush). The existence of volatile write
> > > caches and the semantics around flushing are well established in
> > > traditional block stacks and provide significant performance
> > > benefits to some types of devices (particularly lower end consumer
> > > grade devices), so we've chosen to provide those same semantics in
> > > SPDK. Altering the spdk bdev aio module to always specify O_DSYNC
> > > will greatly reduce the performance on these types
> > of devices, so I think choosing just O_DIRECT and not O_DSYNC is the correct choice for the flag.
> > This provides the user the traditional semantics of flushing block devices that they expect.
> > >
> > > Note that the Intel P3700 does not have a volatile write cache, so
> > > sending flush requests does nothing (it has a write cache, it just
> > > isn't volatile). The reason Jim was asking about preconditioning in
> > > another branch of this thread is because I don't think either of us
> > > expect to see any performance difference on the Intel
> > > P3700 when the O_DSYNC flag is added, regardless of workload. If
> > > there is a difference even after preconditioning, then it certainly warrants investigation.
> > >
> > > Thanks Shuhei!
> > >
> > > Ben


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-15  6:02 
  0 siblings, 0 replies; 17+ messages in thread
From:  @ 2017-11-15  6:02 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4704 bytes --]

Hi Ben,

I did not expect but thanks to your very well explanation, my understanding improved for not only AIO but also hardware and SPDK.
I started SPDK BDEV layer mentioned in the following and your explanation is very helpful.

Thank you,
Shuhei

> -----Original Message-----
> From: 松本周平 / MATSUMOTO,SHUUHEI
> Sent: Wednesday, November 15, 2017 10:07 AM
> To: spdk(a)lists.01.org
> Subject: RE: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> 
> Hi Ben, Jim, Cunyin,
> 
> Thank you for your feedback.
> 
> I talked with my colleague and considered again.
> Let me clarify my concern.
> 
> The issue is caused by the current implementation of Linux AIO.
> O_DSYNC controls both of two controls.
> 
> 1) Assurance of transaction completion on the software layer
> O_DSYNC=YES: Linux kernel informs completion to the application after confirming queued IO is done.
> O_DSYNC=NO : Linux kernel informs completion to the application speculatively without confirming
> queued IO is done.
> 
> 2) Avoid using volatile cache
> O_DSYNC=YES: Linux kernel set the FUA bit O_DSYNC=NO : Linux kernel do not set the FUA bit
> 
> Your concern is 2).
> 
> My concern is 1) and SPDK API spdk_bdev_flush() is for hardware and cannot assure that queued IO
> is done in Linux kernel.
> I think 1) and 2) should be controlled by different parameters but currently not.
> 
> Hence O_DSYNC=YES/NO should be controllable at least.
> 
> 
> About performance evaluation, our performance team usually
> - take 5 hours for preconditioning
> - take only 10 seconds for test run but they observe stable data due to enough preconditioning.
> 
> However I took only 10 minutes for preconditioning and 10 seconds for test run respectively due
> to availability.
> Unfortunately I will not be able to use the machine and I wish Cunyin will have good data.
> 
> Thank you,
> Shuhei
> 
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker,
> > Benjamin
> > Sent: Wednesday, November 15, 2017 3:01 AM
> > To: spdk(a)lists.01.org
> > Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to
> > BDEV_AIO
> >
> > On Tue, 2017-11-14 at 07:41 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> > > Hi,
> > >
> > > Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does
> > > not use O_DSYNC.
> > > O_DSYNC assures that IO is written to persistent storage.
> > >
> > > About the difference of IO command sequence, for SCSI disk O_DSYNC
> > > issues extra IO command and it may affect IO performance but for
> > > NVMe-SSD O_DSYNC issues no extra IO command.
> > > Hence I estimate this is the reason of indifference of performance.
> > >
> >
> > O_DIRECT avoids using operating system caches in system memory.
> > O_DSYNC avoids using volatile caches in the SSD itself by setting the
> > FUA bit on I/O (Force Unit Access) and additionally by issuing a SCSI
> > synchronize cache command after each I/O on devices that report having
> > a volatile write cache. This is why you see an extra command for SCSI devices - they are reporting
> that they support a volatile write cache. That's very common for a SAS/SATA HDD.
> >
> > The SPDK API provides the user with a mechanism to query whether a
> > block device has a volatile write cache (spdk_bdev_has_write_cache)
> > and an API to instruct the device to make data in its volatile caches
> > persistent (spdk_bdev_flush). The existence of volatile write caches
> > and the semantics around flushing are well established in traditional
> > block stacks and provide significant performance benefits to some
> > types of devices (particularly lower end consumer grade devices), so
> > we've chosen to provide those same semantics in SPDK. Altering the
> > spdk bdev aio module to always specify O_DSYNC will greatly reduce the performance on these types
> of devices, so I think choosing just O_DIRECT and not O_DSYNC is the correct choice for the flag.
> This provides the user the traditional semantics of flushing block devices that they expect.
> >
> > Note that the Intel P3700 does not have a volatile write cache, so
> > sending flush requests does nothing (it has a write cache, it just
> > isn't volatile). The reason Jim was asking about preconditioning in
> > another branch of this thread is because I don't think either of us
> > expect to see any performance difference on the Intel
> > P3700 when the O_DSYNC flag is added, regardless of workload. If there
> > is a difference even after preconditioning, then it certainly warrants investigation.
> >
> > Thanks Shuhei!
> >
> > Ben


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-15  1:06 
  0 siblings, 0 replies; 17+ messages in thread
From:  @ 2017-11-15  1:06 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4034 bytes --]

Hi Ben, Jim, Cunyin,

Thank you for your feedback.

I talked with my colleague and considered again.
Let me clarify my concern.

The issue is caused by the current implementation of Linux AIO.
O_DSYNC controls both of two controls.

1) Assurance of transaction completion on the software layer
O_DSYNC=YES: Linux kernel informs completion to the application after confirming queued IO is done.
O_DSYNC=NO : Linux kernel informs completion to the application speculatively without confirming queued IO is done.

2) Avoid using volatile cache
O_DSYNC=YES: Linux kernel set the FUA bit
O_DSYNC=NO : Linux kernel do not set the FUA bit

Your concern is 2).

My concern is 1) and SPDK API spdk_bdev_flush() is for hardware and cannot assure that queued IO is done in Linux kernel.
I think 1) and 2) should be controlled by different parameters but currently not.

Hence O_DSYNC=YES/NO should be controllable at least.


About performance evaluation, our performance team usually
- take 5 hours for preconditioning
- take only 10 seconds for test run but they observe stable data due to enough preconditioning.

However I took only 10 minutes for preconditioning and 10 seconds for test run respectively due to availability.
Unfortunately I will not be able to use the machine and I wish Cunyin will have good data.

Thank you,
Shuhei

> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
> Sent: Wednesday, November 15, 2017 3:01 AM
> To: spdk(a)lists.01.org
> Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
> 
> On Tue, 2017-11-14 at 07:41 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> > Hi,
> >
> > Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does not
> > use O_DSYNC.
> > O_DSYNC assures that IO is written to persistent storage.
> >
> > About the difference of IO command sequence, for SCSI disk O_DSYNC
> > issues extra IO command and it may affect IO performance but for
> > NVMe-SSD O_DSYNC issues no extra IO command.
> > Hence I estimate this is the reason of indifference of performance.
> >
> 
> O_DIRECT avoids using operating system caches in system memory. O_DSYNC avoids using volatile caches
> in the SSD itself by setting the FUA bit on I/O (Force Unit Access) and additionally by issuing
> a SCSI synchronize cache command after each I/O on devices that report having a volatile write
> cache. This is why you see an extra command for SCSI devices - they are reporting that they support
> a volatile write cache. That's very common for a SAS/SATA HDD.
> 
> The SPDK API provides the user with a mechanism to query whether a block device has a volatile
> write cache (spdk_bdev_has_write_cache) and an API to instruct the device to make data in its
> volatile caches persistent (spdk_bdev_flush). The existence of volatile write caches and the
> semantics around flushing are well established in traditional block stacks and provide significant
> performance benefits to some types of devices (particularly lower end consumer grade devices),
> so we've chosen to provide those same semantics in SPDK. Altering the spdk bdev aio module to always
> specify O_DSYNC will greatly reduce the performance on these types of devices, so I think choosing
> just O_DIRECT and not O_DSYNC is the correct choice for the flag. This provides the user the
> traditional semantics of flushing block devices that they expect.
> 
> Note that the Intel P3700 does not have a volatile write cache, so sending flush requests does
> nothing (it has a write cache, it just isn't volatile). The reason Jim was asking about
> preconditioning in another branch of this thread is because I don't think either of us expect to
> see any performance difference on the Intel
> P3700 when the O_DSYNC flag is added, regardless of workload. If there is a difference even after
> preconditioning, then it certainly warrants investigation.
> 
> Thanks Shuhei!
> 
> Ben


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-15  0:38 Chang, Cunyin
  0 siblings, 0 replies; 17+ messages in thread
From: Chang, Cunyin @ 2017-11-15  0:38 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5346 bytes --]

Hi Jim,

All the ssd has done the preconditioning, but I do not do such long time test for write test case, I test in 5 mins, and I saw the performance drop
After about 10 seconds, I can make the test time longer and to see if we could get different result.

-Cunyin

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 11:56 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi Shuhei, Cunyin, Ziye,

Do you see a difference between sequential write and random write?  Sequential writes with 40 jobs should behave similarly to random writes – even though each job is issuing writes sequentially, the 40 jobs combined will look like a random-like workload to the SSD.

A few additional questions:


1)      How long were these tests run for?  Especially for random write tests, they should be quite long (20-30 minutes) to ensure steady state.

2)      How much preconditioning was done on the SSD before starting the tests?  Intel recommends 90 minutes before running any type of random write workload.

Thanks,

-Jim


From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com<mailto:shuhei.matsumoto.xt(a)hitachi.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 1:26 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi Cunyin, Ziye,

The following is the sequential IO test.
The queue depth is 64 (not 128) and NVMe-SSD is not in the complete steady state too.
But it looks sure that O_DIRECT and O_DSYNC is better by both results.

Thank you,
Shuhei

O_DIRECT
4K sequential read, 40jobs,
read: IOPS=473k, BW=1850MiB/s (1939MB/s)(18.1GiB/10007msec)
read: IOPS=485k, BW=1896MiB/s (1988MB/s)(18.5GiB/10006msec)

4K sequential write, 40jobs
write: IOPS=60.2k, BW=236MiB/s (248MB/s)(2380MiB/10080msec)
write: IOPS=69.4k, BW=272MiB/s (285MB/s)(2741MiB/10077msec)

O_DIRECT|O_DSYNC
4K sequential read, 40jobs,
read: IOPS=469k, BW=1834MiB/s (1923MB/s)(17.9GiB/10006msec)
read: IOPS=485k, BW=1895MiB/s (1987MB/s)(18.5GiB/10006msec)

4K sequential write, 40jobs
write: IOPS=160k, BW=627MiB/s (657MB/s)(6280MiB/10019msec)
write: IOPS=118k, BW=461MiB/s (484MB/s)(4624MiB/10025msec)

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Chang, Cunyin
Sent: Tuesday, November 14, 2017 5:07 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Could you please also try the sequential write test:

This is my test result with P3700:
O_DIRECT
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/445.7MB/0KB /s] [0/114K/0 iops]
O_DIRECT + O_DSYNC
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/676.6MB/0KB /s] [0/173K/0 iops]

-Cunyin

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? / MATSUMOTO,SHUUHEI
Sent: Tuesday, November 14, 2017 3:42 PM
To: 'spdk(a)lists.01.org' <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi,

This may be related with the current being fixed issue Anirudh found.
Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does not use O_DSYNC.
O_DSYNC assures that IO is written to persistent storage.

SPDK is for storage and hence I think O_DIRECT and O_DSYNC will be better for SPDK BDEV AIO.

Cunyin understood O_DSYNC but asked me the performance difference between O_DIRECT and (O_DIRECT|O_DSYNC).
Since our team evaluated only functionality, I did simple performance test by FIO + NVMe-SSD (P3700) x 1.
I modified FIO slightly (O_SYNC -> O_DSYNC) and rebuild it.

I cannot create the complete steady state of NVMe-SSD to get the real world performance yet due to lack of time,
I did not notice major difference between O_DIRECT and (O_DIRECT|O_DSYNC).
It looks that NVMe-SSD is saturated and CPU utilization was less than 10% for all cases as long as I checked mpstat.
(Our performance team created the steady state and got the stable IO write performance 100K but I have not got it yet.
If I can create the complete steady state by taking more time I will be able to get 100K for write.)

O_DIRECT
4K random read, 40jobs,
   read: IOPS=478k, BW=1869MiB/s (1960MB/s)(18.3GiB/10007msec)

4K random write, 40jobs
  write: IOPS=68.4k, BW=268MiB/s (281MB/s)(2689MiB/10031msec)

O_DIRECT|O_DSYNC
4K random read, 40jobs,
   read: IOPS=477k, BW=1864MiB/s (1954MB/s)(18.2GiB/10007msec)

4K random write, 40jobs
  write: IOPS=72.0k, BW=286MiB/s (300MB/s)(2871MiB/10038msec)



About the difference of IO command sequence,
for SCSI disk O_DSYNC issues extra IO command and it may affect IO performance but
for NVMe-SSD O_DSYNC issues no extra IO command.
Hence I estimate this is the reason of indifference of performance.

I would appreciate your any feedback.

Thank you,
Shuhei Matsumoto

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 27977 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-14 18:01 Walker, Benjamin
  0 siblings, 0 replies; 17+ messages in thread
From: Walker, Benjamin @ 2017-11-14 18:01 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2337 bytes --]

On Tue, 2017-11-14 at 07:41 +0000, 松本周平 / MATSUMOTO,SHUUHEI wrote:
> Hi,
>  
> Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does not use
> O_DSYNC.
> O_DSYNC assures that IO is written to persistent storage.
>  
> About the difference of IO command sequence,
> for SCSI disk O_DSYNC issues extra IO command and it may affect IO performance
> but
> for NVMe-SSD O_DSYNC issues no extra IO command.
> Hence I estimate this is the reason of indifference of performance.
>  

O_DIRECT avoids using operating system caches in system memory. O_DSYNC avoids
using volatile caches in the SSD itself by setting the FUA bit on I/O (Force
Unit Access) and additionally by issuing a SCSI synchronize cache command after
each I/O on devices that report having a volatile write cache. This is why you
see an extra command for SCSI devices - they are reporting that they support a
volatile write cache. That's very common for a SAS/SATA HDD.

The SPDK API provides the user with a mechanism to query whether a block device
has a volatile write cache (spdk_bdev_has_write_cache) and an API to instruct
the device to make data in its volatile caches persistent (spdk_bdev_flush). The
existence of volatile write caches and the semantics around flushing are well
established in traditional block stacks and provide significant performance
benefits to some types of devices (particularly lower end consumer grade
devices), so we've chosen to provide those same semantics in SPDK. Altering the
spdk bdev aio module to always specify O_DSYNC will greatly reduce the
performance on these types of devices, so I think choosing just O_DIRECT and not
O_DSYNC is the correct choice for the flag. This provides the user the
traditional semantics of flushing block devices that they expect.

Note that the Intel P3700 does not have a volatile write cache, so sending flush
requests does nothing (it has a write cache, it just isn't volatile). The reason
Jim was asking about preconditioning in another branch of this thread is because
I don't think either of us expect to see any performance difference on the Intel
P3700 when the O_DSYNC flag is added, regardless of workload. If there is a
difference even after preconditioning, then it certainly warrants investigation.

Thanks Shuhei!

Ben


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3274 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-14 15:56 Harris, James R
  0 siblings, 0 replies; 17+ messages in thread
From: Harris, James R @ 2017-11-14 15:56 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4653 bytes --]

Hi Shuhei, Cunyin, Ziye,

Do you see a difference between sequential write and random write?  Sequential writes with 40 jobs should behave similarly to random writes – even though each job is issuing writes sequentially, the 40 jobs combined will look like a random-like workload to the SSD.

A few additional questions:


1)       How long were these tests run for?  Especially for random write tests, they should be quite long (20-30 minutes) to ensure steady state.

2)       How much preconditioning was done on the SSD before starting the tests?  Intel recommends 90 minutes before running any type of random write workload.

Thanks,

-Jim


From: SPDK <spdk-bounces(a)lists.01.org> on behalf of 松本周平 / MATSUMOTO,SHUUHEI <shuhei.matsumoto.xt(a)hitachi.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Tuesday, November 14, 2017 at 1:26 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi Cunyin, Ziye,

The following is the sequential IO test.
The queue depth is 64 (not 128) and NVMe-SSD is not in the complete steady state too.
But it looks sure that O_DIRECT and O_DSYNC is better by both results.

Thank you,
Shuhei

O_DIRECT
4K sequential read, 40jobs,
read: IOPS=473k, BW=1850MiB/s (1939MB/s)(18.1GiB/10007msec)
read: IOPS=485k, BW=1896MiB/s (1988MB/s)(18.5GiB/10006msec)

4K sequential write, 40jobs
write: IOPS=60.2k, BW=236MiB/s (248MB/s)(2380MiB/10080msec)
write: IOPS=69.4k, BW=272MiB/s (285MB/s)(2741MiB/10077msec)

O_DIRECT|O_DSYNC
4K sequential read, 40jobs,
read: IOPS=469k, BW=1834MiB/s (1923MB/s)(17.9GiB/10006msec)
read: IOPS=485k, BW=1895MiB/s (1987MB/s)(18.5GiB/10006msec)

4K sequential write, 40jobs
write: IOPS=160k, BW=627MiB/s (657MB/s)(6280MiB/10019msec)
write: IOPS=118k, BW=461MiB/s (484MB/s)(4624MiB/10025msec)

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Chang, Cunyin
Sent: Tuesday, November 14, 2017 5:07 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Could you please also try the sequential write test:

This is my test result with P3700:
O_DIRECT
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/445.7MB/0KB /s] [0/114K/0 iops]
O_DIRECT + O_DSYNC
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/676.6MB/0KB /s] [0/173K/0 iops]

-Cunyin

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? / MATSUMOTO,SHUUHEI
Sent: Tuesday, November 14, 2017 3:42 PM
To: 'spdk(a)lists.01.org' <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi,

This may be related with the current being fixed issue Anirudh found.
Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does not use O_DSYNC.
O_DSYNC assures that IO is written to persistent storage.

SPDK is for storage and hence I think O_DIRECT and O_DSYNC will be better for SPDK BDEV AIO.

Cunyin understood O_DSYNC but asked me the performance difference between O_DIRECT and (O_DIRECT|O_DSYNC).
Since our team evaluated only functionality, I did simple performance test by FIO + NVMe-SSD (P3700) x 1.
I modified FIO slightly (O_SYNC -> O_DSYNC) and rebuild it.

I cannot create the complete steady state of NVMe-SSD to get the real world performance yet due to lack of time,
I did not notice major difference between O_DIRECT and (O_DIRECT|O_DSYNC).
It looks that NVMe-SSD is saturated and CPU utilization was less than 10% for all cases as long as I checked mpstat.
(Our performance team created the steady state and got the stable IO write performance 100K but I have not got it yet.
If I can create the complete steady state by taking more time I will be able to get 100K for write.)

O_DIRECT
4K random read, 40jobs,
   read: IOPS=478k, BW=1869MiB/s (1960MB/s)(18.3GiB/10007msec)

4K random write, 40jobs
  write: IOPS=68.4k, BW=268MiB/s (281MB/s)(2689MiB/10031msec)

O_DIRECT|O_DSYNC
4K random read, 40jobs,
   read: IOPS=477k, BW=1864MiB/s (1954MB/s)(18.2GiB/10007msec)

4K random write, 40jobs
  write: IOPS=72.0k, BW=286MiB/s (300MB/s)(2871MiB/10038msec)



About the difference of IO command sequence,
for SCSI disk O_DSYNC issues extra IO command and it may affect IO performance but
for NVMe-SSD O_DSYNC issues no extra IO command.
Hence I estimate this is the reason of indifference of performance.

I would appreciate your any feedback.

Thank you,
Shuhei Matsumoto

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 26513 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-14  9:31 
  0 siblings, 0 replies; 17+ messages in thread
From:  @ 2017-11-14  9:31 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3944 bytes --]

I did not expect this improvement and thank you for covering patterns and sharing the result.

Thank you,
Shuhei

From: 松本周平 / MATSUMOTO,SHUUHEI
Sent: Tuesday, November 14, 2017 6:26 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: RE: Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi Cunyin, Ziye,

The following is the sequential IO test.
The queue depth is 64 (not 128) and NVMe-SSD is not in the complete steady state too.
But it looks sure that O_DIRECT and O_DSYNC is better by both results.

Thank you,
Shuhei

O_DIRECT
4K sequential read, 40jobs,
read: IOPS=473k, BW=1850MiB/s (1939MB/s)(18.1GiB/10007msec)
read: IOPS=485k, BW=1896MiB/s (1988MB/s)(18.5GiB/10006msec)

4K sequential write, 40jobs
write: IOPS=60.2k, BW=236MiB/s (248MB/s)(2380MiB/10080msec)
write: IOPS=69.4k, BW=272MiB/s (285MB/s)(2741MiB/10077msec)

O_DIRECT|O_DSYNC
4K sequential read, 40jobs,
read: IOPS=469k, BW=1834MiB/s (1923MB/s)(17.9GiB/10006msec)
read: IOPS=485k, BW=1895MiB/s (1987MB/s)(18.5GiB/10006msec)

4K sequential write, 40jobs
write: IOPS=160k, BW=627MiB/s (657MB/s)(6280MiB/10019msec)
write: IOPS=118k, BW=461MiB/s (484MB/s)(4624MiB/10025msec)

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Chang, Cunyin
Sent: Tuesday, November 14, 2017 5:07 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Could you please also try the sequential write test:

This is my test result with P3700:
O_DIRECT
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/445.7MB/0KB /s] [0/114K/0 iops]
O_DIRECT + O_DSYNC
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/676.6MB/0KB /s] [0/173K/0 iops]

-Cunyin

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? / MATSUMOTO,SHUUHEI
Sent: Tuesday, November 14, 2017 3:42 PM
To: 'spdk(a)lists.01.org' <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi,

This may be related with the current being fixed issue Anirudh found.
Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does not use O_DSYNC.
O_DSYNC assures that IO is written to persistent storage.

SPDK is for storage and hence I think O_DIRECT and O_DSYNC will be better for SPDK BDEV AIO.

Cunyin understood O_DSYNC but asked me the performance difference between O_DIRECT and (O_DIRECT|O_DSYNC).
Since our team evaluated only functionality, I did simple performance test by FIO + NVMe-SSD (P3700) x 1.
I modified FIO slightly (O_SYNC -> O_DSYNC) and rebuild it.

I cannot create the complete steady state of NVMe-SSD to get the real world performance yet due to lack of time,
I did not notice major difference between O_DIRECT and (O_DIRECT|O_DSYNC).
It looks that NVMe-SSD is saturated and CPU utilization was less than 10% for all cases as long as I checked mpstat.
(Our performance team created the steady state and got the stable IO write performance 100K but I have not got it yet.
If I can create the complete steady state by taking more time I will be able to get 100K for write.)

O_DIRECT
4K random read, 40jobs,
   read: IOPS=478k, BW=1869MiB/s (1960MB/s)(18.3GiB/10007msec)

4K random write, 40jobs
  write: IOPS=68.4k, BW=268MiB/s (281MB/s)(2689MiB/10031msec)

O_DIRECT|O_DSYNC
4K random read, 40jobs,
   read: IOPS=477k, BW=1864MiB/s (1954MB/s)(18.2GiB/10007msec)

4K random write, 40jobs
  write: IOPS=72.0k, BW=286MiB/s (300MB/s)(2871MiB/10038msec)



About the difference of IO command sequence,
for SCSI disk O_DSYNC issues extra IO command and it may affect IO performance but
for NVMe-SSD O_DSYNC issues no extra IO command.
Hence I estimate this is the reason of indifference of performance.

I would appreciate your any feedback.

Thank you,
Shuhei Matsumoto

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 22241 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-14  9:26 
  0 siblings, 0 replies; 17+ messages in thread
From:  @ 2017-11-14  9:26 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3611 bytes --]

Hi Cunyin, Ziye,

The following is the sequential IO test.
The queue depth is 64 (not 128) and NVMe-SSD is not in the complete steady state too.
But it looks sure that O_DIRECT and O_DSYNC is better by both results.

Thank you,
Shuhei

O_DIRECT
4K sequential read, 40jobs,
read: IOPS=473k, BW=1850MiB/s (1939MB/s)(18.1GiB/10007msec)
read: IOPS=485k, BW=1896MiB/s (1988MB/s)(18.5GiB/10006msec)

4K sequential write, 40jobs
write: IOPS=60.2k, BW=236MiB/s (248MB/s)(2380MiB/10080msec)
write: IOPS=69.4k, BW=272MiB/s (285MB/s)(2741MiB/10077msec)

O_DIRECT|O_DSYNC
4K sequential read, 40jobs,
read: IOPS=469k, BW=1834MiB/s (1923MB/s)(17.9GiB/10006msec)
read: IOPS=485k, BW=1895MiB/s (1987MB/s)(18.5GiB/10006msec)

4K sequential write, 40jobs
write: IOPS=160k, BW=627MiB/s (657MB/s)(6280MiB/10019msec)
write: IOPS=118k, BW=461MiB/s (484MB/s)(4624MiB/10025msec)

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Chang, Cunyin
Sent: Tuesday, November 14, 2017 5:07 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [!]Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Could you please also try the sequential write test:

This is my test result with P3700:
O_DIRECT
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/445.7MB/0KB /s] [0/114K/0 iops]
O_DIRECT + O_DSYNC
-bs=4K --iodepth=128 --rw=write
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/676.6MB/0KB /s] [0/173K/0 iops]

-Cunyin

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of ???? / MATSUMOTO,SHUUHEI
Sent: Tuesday, November 14, 2017 3:42 PM
To: 'spdk(a)lists.01.org' <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi,

This may be related with the current being fixed issue Anirudh found.
Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does not use O_DSYNC.
O_DSYNC assures that IO is written to persistent storage.

SPDK is for storage and hence I think O_DIRECT and O_DSYNC will be better for SPDK BDEV AIO.

Cunyin understood O_DSYNC but asked me the performance difference between O_DIRECT and (O_DIRECT|O_DSYNC).
Since our team evaluated only functionality, I did simple performance test by FIO + NVMe-SSD (P3700) x 1.
I modified FIO slightly (O_SYNC -> O_DSYNC) and rebuild it.

I cannot create the complete steady state of NVMe-SSD to get the real world performance yet due to lack of time,
I did not notice major difference between O_DIRECT and (O_DIRECT|O_DSYNC).
It looks that NVMe-SSD is saturated and CPU utilization was less than 10% for all cases as long as I checked mpstat.
(Our performance team created the steady state and got the stable IO write performance 100K but I have not got it yet.
If I can create the complete steady state by taking more time I will be able to get 100K for write.)

O_DIRECT
4K random read, 40jobs,
   read: IOPS=478k, BW=1869MiB/s (1960MB/s)(18.3GiB/10007msec)

4K random write, 40jobs
  write: IOPS=68.4k, BW=268MiB/s (281MB/s)(2689MiB/10031msec)

O_DIRECT|O_DSYNC
4K random read, 40jobs,
   read: IOPS=477k, BW=1864MiB/s (1954MB/s)(18.2GiB/10007msec)

4K random write, 40jobs
  write: IOPS=72.0k, BW=286MiB/s (300MB/s)(2871MiB/10038msec)



About the difference of IO command sequence,
for SCSI disk O_DSYNC issues extra IO command and it may affect IO performance but
for NVMe-SSD O_DSYNC issues no extra IO command.
Hence I estimate this is the reason of indifference of performance.

I would appreciate your any feedback.

Thank you,
Shuhei Matsumoto

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 19831 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-14  7:47 Yang, Ziye
  0 siblings, 0 replies; 17+ messages in thread
From: Yang, Ziye @ 2017-11-14  7:47 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]

Hi Shuhei,

Thanks for the performance test.  From your test results, it seems that there is nearly no performance difference on testing NVMe SSDs via AIO bdev,

Thanks.

Best Regards,
Ziye Yang

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of 松本周平 / MATSUMOTO,SHUUHEI
Sent: Tuesday, November 14, 2017 3:42 PM
To: 'spdk(a)lists.01.org' <spdk(a)lists.01.org>
Subject: [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO

Hi,

This may be related with the current being fixed issue Anirudh found.
Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does not use O_DSYNC.
O_DSYNC assures that IO is written to persistent storage.

SPDK is for storage and hence I think O_DIRECT and O_DSYNC will be better for SPDK BDEV AIO.

Cunyin understood O_DSYNC but asked me the performance difference between O_DIRECT and (O_DIRECT|O_DSYNC).
Since our team evaluated only functionality, I did simple performance test by FIO + NVMe-SSD (P3700) x 1.
I modified FIO slightly (O_SYNC -> O_DSYNC) and rebuild it.

I cannot create the complete steady state of NVMe-SSD to get the real world performance yet due to lack of time,
I did not notice major difference between O_DIRECT and (O_DIRECT|O_DSYNC).
It looks that NVMe-SSD is saturated and CPU utilization was less than 10% for all cases as long as I checked mpstat.
(Our performance team created the steady state and got the stable IO write performance 100K but I have not got it yet.
If I can create the complete steady state by taking more time I will be able to get 100K for write.)

O_DIRECT
4K random read, 40jobs,
   read: IOPS=478k, BW=1869MiB/s (1960MB/s)(18.3GiB/10007msec)

4K random write, 40jobs
  write: IOPS=68.4k, BW=268MiB/s (281MB/s)(2689MiB/10031msec)

O_DIRECT|O_DSYNC
4K random read, 40jobs,
   read: IOPS=477k, BW=1864MiB/s (1954MB/s)(18.2GiB/10007msec)

4K random write, 40jobs
  write: IOPS=72.0k, BW=286MiB/s (300MB/s)(2871MiB/10038msec)



About the difference of IO command sequence,
for SCSI disk O_DSYNC issues extra IO command and it may affect IO performance but
for NVMe-SSD O_DSYNC issues no extra IO command.
Hence I estimate this is the reason of indifference of performance.

I would appreciate your any feedback.

Thank you,
Shuhei Matsumoto

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11589 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO
@ 2017-11-14  7:41 
  0 siblings, 0 replies; 17+ messages in thread
From:  @ 2017-11-14  7:41 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 1845 bytes --]

Hi,

This may be related with the current being fixed issue Anirudh found.
Current BDEV AIO uses O_DIRECT to avoid IO cache effects but does not use O_DSYNC.
O_DSYNC assures that IO is written to persistent storage.

SPDK is for storage and hence I think O_DIRECT and O_DSYNC will be better for SPDK BDEV AIO.

Cunyin understood O_DSYNC but asked me the performance difference between O_DIRECT and (O_DIRECT|O_DSYNC).
Since our team evaluated only functionality, I did simple performance test by FIO + NVMe-SSD (P3700) x 1.
I modified FIO slightly (O_SYNC -> O_DSYNC) and rebuild it.

I cannot create the complete steady state of NVMe-SSD to get the real world performance yet due to lack of time,
I did not notice major difference between O_DIRECT and (O_DIRECT|O_DSYNC).
It looks that NVMe-SSD is saturated and CPU utilization was less than 10% for all cases as long as I checked mpstat.
(Our performance team created the steady state and got the stable IO write performance 100K but I have not got it yet.
If I can create the complete steady state by taking more time I will be able to get 100K for write.)

O_DIRECT
4K random read, 40jobs,
   read: IOPS=478k, BW=1869MiB/s (1960MB/s)(18.3GiB/10007msec)

4K random write, 40jobs
  write: IOPS=68.4k, BW=268MiB/s (281MB/s)(2689MiB/10031msec)

O_DIRECT|O_DSYNC
4K random read, 40jobs,
   read: IOPS=477k, BW=1864MiB/s (1954MB/s)(18.2GiB/10007msec)

4K random write, 40jobs
  write: IOPS=72.0k, BW=286MiB/s (300MB/s)(2871MiB/10038msec)



About the difference of IO command sequence,
for SCSI disk O_DSYNC issues extra IO command and it may affect IO performance but
for NVMe-SSD O_DSYNC issues no extra IO command.
Hence I estimate this is the reason of indifference of performance.

I would appreciate your any feedback.

Thank you,
Shuhei Matsumoto

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 9771 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-11-17  0:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-14  8:07 [SPDK] Set not only O_DIRECT but also O_DSYNC to BDEV_AIO Chang, Cunyin
  -- strict thread matches above, loose matches on Subject: below --
2017-11-17  0:23 
2017-11-17  0:08 
2017-11-16 17:59 Walker, Benjamin
2017-11-16 15:33 Harris, James R
2017-11-16  3:43 
2017-11-16  2:41 Chang, Cunyin
2017-11-16  0:39 
2017-11-15  6:02 
2017-11-15  1:06 
2017-11-15  0:38 Chang, Cunyin
2017-11-14 18:01 Walker, Benjamin
2017-11-14 15:56 Harris, James R
2017-11-14  9:31 
2017-11-14  9:26 
2017-11-14  7:47 Yang, Ziye
2017-11-14  7:41 

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.