Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
* Discussion: is it time to remove dioread_nolock?
@ 2019-12-26 15:31 Theodore Y. Ts'o
  2019-12-27 13:10 ` Joseph Qi
  2019-12-29 15:03 ` Xiaoguang Wang
  0 siblings, 2 replies; 16+ messages in thread
From: Theodore Y. Ts'o @ 2019-12-26 15:31 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: joseph.qi, Liu Bo

With inclusion of Ritesh's inode lock scalability patches[1], the
traditional performance reasons for dioread_nolock --- namely,
removing the need to take an exclusive lock for Direct I/O read
operations --- has been removed.

[1] https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com

So... is it time to remove the code which supports dioread_nolock?
Doing so would simplify the code base, and reduce the test matrix.
This would also make it easier to restructure the write path when
allocating blocks so that the extent tree is updated after writing out
the data blocks, by clearing away the underbrush of dioread nolock
first.

If we do this, we'd leave the dioread_nolock mount option for
backwards compatibility, but it would be a no-op and not actually do
anything.

Any objections before I look into ripping out dioread_nolock?

The one possible concern that I considered was for Alibaba, which was
doing something interesting with dioread_nolock plus nodelalloc.  But
looking at Liu Bo's explanation[2], I believe that their workload
would be satisfied simply by using the standard ext4 mount options
(that is, the default mode has the performance benefits when doing
parallel DIO reads, and so the need for nodelalloc to mitigate the
tail latency concerns which Alibaba was seeing in their workload would
not be needed).  Could Liu or someone from Alibaba confirm, perhaps
with some benchmarks using their workload?

[2] https://lore.kernel.org/linux-ext4/20181121013035.ab4xp7evjyschecy@US-160370MP2.local/

    	  	     	      	   	- Ted



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2019-12-26 15:31 Discussion: is it time to remove dioread_nolock? Theodore Y. Ts'o
@ 2019-12-27 13:10 ` Joseph Qi
  2019-12-29 15:03 ` Xiaoguang Wang
  1 sibling, 0 replies; 16+ messages in thread
From: Joseph Qi @ 2019-12-27 13:10 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Ext4 Developers List; +Cc: Liu Bo, Xiaoguang Wang

Hi Ted,
After applying Ritesh's patches, from my test result, there is no
obvious performance difference between default mount options and
with dioread_lock (delalloc or nodelalloc).
I'm not sure if dioread_nolock was used for other purpose in the
scenario reported by Bo Liu. Maybe Xiaoguang could give some inputs.

Thanks,
Joseph


On 19/12/26 23:31, Theodore Y. Ts'o wrote:
> With inclusion of Ritesh's inode lock scalability patches[1], the
> traditional performance reasons for dioread_nolock --- namely,
> removing the need to take an exclusive lock for Direct I/O read
> operations --- has been removed.
> 
> [1] https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com
> 
> So... is it time to remove the code which supports dioread_nolock?
> Doing so would simplify the code base, and reduce the test matrix.
> This would also make it easier to restructure the write path when
> allocating blocks so that the extent tree is updated after writing out
> the data blocks, by clearing away the underbrush of dioread nolock
> first.
> 
> If we do this, we'd leave the dioread_nolock mount option for
> backwards compatibility, but it would be a no-op and not actually do
> anything.
> 
> Any objections before I look into ripping out dioread_nolock?
> 
> The one possible concern that I considered was for Alibaba, which was
> doing something interesting with dioread_nolock plus nodelalloc.  But
> looking at Liu Bo's explanation[2], I believe that their workload
> would be satisfied simply by using the standard ext4 mount options
> (that is, the default mode has the performance benefits when doing
> parallel DIO reads, and so the need for nodelalloc to mitigate the
> tail latency concerns which Alibaba was seeing in their workload would
> not be needed).  Could Liu or someone from Alibaba confirm, perhaps
> with some benchmarks using their workload?
> 
> [2] https://lore.kernel.org/linux-ext4/20181121013035.ab4xp7evjyschecy@US-160370MP2.local/
> 
>     	  	     	      	   	- Ted
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2019-12-26 15:31 Discussion: is it time to remove dioread_nolock? Theodore Y. Ts'o
  2019-12-27 13:10 ` Joseph Qi
@ 2019-12-29 15:03 ` Xiaoguang Wang
  2020-01-06 12:24   ` Ritesh Harjani
  1 sibling, 1 reply; 16+ messages in thread
From: Xiaoguang Wang @ 2019-12-29 15:03 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Ext4 Developers List; +Cc: joseph.qi, Liu Bo

hi,

> With inclusion of Ritesh's inode lock scalability patches[1], the
> traditional performance reasons for dioread_nolock --- namely,
> removing the need to take an exclusive lock for Direct I/O read
> operations --- has been removed.
> 
> [1] https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com
> 
> So... is it time to remove the code which supports dioread_nolock?
> Doing so would simplify the code base, and reduce the test matrix.
> This would also make it easier to restructure the write path when
> allocating blocks so that the extent tree is updated after writing out
> the data blocks, by clearing away the underbrush of dioread nolock
> first.
> 
> If we do this, we'd leave the dioread_nolock mount option for
> backwards compatibility, but it would be a no-op and not actually do
> anything.
> 
> Any objections before I look into ripping out dioread_nolock?
> 
> The one possible concern that I considered was for Alibaba, which was
> doing something interesting with dioread_nolock plus nodelalloc.  But
> looking at Liu Bo's explanation[2], I believe that their workload
> would be satisfied simply by using the standard ext4 mount options
> (that is, the default mode has the performance benefits when doing
> parallel DIO reads, and so the need for nodelalloc to mitigate the
> tail latency concerns which Alibaba was seeing in their workload would
> not be needed).  Could Liu or someone from Alibaba confirm, perhaps
> with some benchmarks using their workload?
Currently we don't use dioread_nolock & nodelalloc in our internal
servers, and we use dioread_nolock & delalloc widely, it works well.

The initial reason we use dioread_nolock is that it'll also allocate
unwritten extents for buffered write, and normally the corresponding
inode won't be added to jbd2 transaction's t_inode_list, so while
commiting transaction, it won't flush inodes' dirty pages, then
transaction will commit quickly, otherwise in extream case, the time
taking to flush dirty inodes will be very big, especially cgroup writeback
is enabled. A previous discussion: https://marc.info/?l=linux-fsdevel&m=151799957104768&w=2
I think this semantics hidden behind diread_nolock is also important,
so if planning to remove this mount option, we should keep same semantics.

Regards,
Xiaoguang Wang

> 
> [2] https://lore.kernel.org/linux-ext4/20181121013035.ab4xp7evjyschecy@US-160370MP2.local/
> 
>      	  	     	      	   	- Ted
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2019-12-29 15:03 ` Xiaoguang Wang
@ 2020-01-06 12:24   ` Ritesh Harjani
  2020-01-07  0:43     ` Theodore Y. Ts'o
  0 siblings, 1 reply; 16+ messages in thread
From: Ritesh Harjani @ 2020-01-06 12:24 UTC (permalink / raw)
  To: Xiaoguang Wang, Theodore Y. Ts'o, Ext4 Developers List, Jan Kara
  Cc: joseph.qi, Liu Bo



On 12/29/19 8:33 PM, Xiaoguang Wang wrote:
> hi,
> 
>> With inclusion of Ritesh's inode lock scalability patches[1], the
>> traditional performance reasons for dioread_nolock --- namely,
>> removing the need to take an exclusive lock for Direct I/O read
>> operations --- has been removed.
>>
>> [1] 
>> https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com
>>
>> So... is it time to remove the code which supports dioread_nolock?
>> Doing so would simplify the code base, and reduce the test matrix.
>> This would also make it easier to restructure the write path when
>> allocating blocks so that the extent tree is updated after writing out
>> the data blocks, by clearing away the underbrush of dioread nolock
>> first.
>>
>> If we do this, we'd leave the dioread_nolock mount option for
>> backwards compatibility, but it would be a no-op and not actually do
>> anything.
>>
>> Any objections before I look into ripping out dioread_nolock?
>>
>> The one possible concern that I considered was for Alibaba, which was
>> doing something interesting with dioread_nolock plus nodelalloc.  But
>> looking at Liu Bo's explanation[2], I believe that their workload
>> would be satisfied simply by using the standard ext4 mount options
>> (that is, the default mode has the performance benefits when doing
>> parallel DIO reads, and so the need for nodelalloc to mitigate the
>> tail latency concerns which Alibaba was seeing in their workload would
>> not be needed).  Could Liu or someone from Alibaba confirm, perhaps
>> with some benchmarks using their workload?
> Currently we don't use dioread_nolock & nodelalloc in our internal
> servers, and we use dioread_nolock & delalloc widely, it works well.
> 
> The initial reason we use dioread_nolock is that it'll also allocate
> unwritten extents for buffered write, and normally the corresponding
> inode won't be added to jbd2 transaction's t_inode_list, so while
> commiting transaction, it won't flush inodes' dirty pages, then
> transaction will commit quickly, otherwise in extream case, the time

I do notice this in ext4_map_blocks(). We add inode to t_inode_list only
in case if we allocate written blocks. I guess this was done to avoid
stale data exposure problem. So now due to ordered mode, we may end up
flushing all dirty data pages in committing transaction before the
metadata is flushed.

Do you have any benchmarks or workload where we could see this problem?
And could this actually be a problem with any real world workload too?

Actually speaking, dioread_nolock mount option was not meant to solve 
the problem you mentioned.
Atleast as per the Documentation it is meant to help with the inode lock
scalablity issue in DIO reads case, which mostly should be fixed
with recent fixes pointed by Ted in above link.



> taking to flush dirty inodes will be very big, especially cgroup writeback
> is enabled. A previous discussion: 
> https://marc.info/?l=linux-fsdevel&m=151799957104768&w=2
> I think this semantics hidden behind diread_nolock is also important,
> so if planning to remove this mount option, we should keep same semantics.


Jan/Ted, your opinion on this pls?

I do see that there was a proposal by Ted @ [1] which should
also solve this problem. I do have plans to work on Ted's proposal, but
meanwhile, should we preserve this mount option for above mentioned use
case? Or should we make it a no-op now?

Also as an update, I am working on a patch which should fix the
stale data exposure issue which we currently have with ext4_page_mkwrite
& ext4 DIO reads. With that fixed it was discussed @ [2], that we don't
need dioread_nolock, so I was also planning to make this mount option a
no-op. Will soon post the RFC version of this patch too.


[1] - https://marc.info/?l=linux-ext4&m=157244559501734&w=2

[2] - 
https://lore.kernel.org/linux-ext4/20191120143257.GE9509@quack2.suse.cz/


-ritesh


> 
> Regards,
> Xiaoguang Wang
> 
>>
>> [2] 
>> https://lore.kernel.org/linux-ext4/20181121013035.ab4xp7evjyschecy@US-160370MP2.local/ 
>>
>>
>>                                          - Ted
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-06 12:24   ` Ritesh Harjani
@ 2020-01-07  0:43     ` Theodore Y. Ts'o
  2020-01-07  8:22       ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Y. Ts'o @ 2020-01-07  0:43 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Xiaoguang Wang, Ext4 Developers List, Jan Kara, joseph.qi, Liu Bo

On Mon, Jan 06, 2020 at 05:54:56PM +0530, Ritesh Harjani wrote:
> > The initial reason we use dioread_nolock is that it'll also allocate
> > unwritten extents for buffered write, and normally the corresponding
> > inode won't be added to jbd2 transaction's t_inode_list, so while
> > commiting transaction, it won't flush inodes' dirty pages, then
> > transaction will commit quickly, otherwise in extream case, the time
> 
> I do notice this in ext4_map_blocks(). We add inode to t_inode_list only
> in case if we allocate written blocks. I guess this was done to avoid
> stale data exposure problem. So now due to ordered mode, we may end up
> flushing all dirty data pages in committing transaction before the
> metadata is flushed.
> 
> Do you have any benchmarks or workload where we could see this problem?
> And could this actually be a problem with any real world workload too?

After thinking about this some more, I've changed my mind.

I think this is something which *can* be very noticeable in some real
world workloads.  If the workload is doing a lot of allocating,
buffered writes to an inode, and the writeback thread starts doing the
writeback for that inode right before a commit starts, then the commit
can take a long time.  The problem is that if the storage device is
particularly slow --- for example, a slow USB drive, or a 32 GiB
Standard Persistent Disk in a Google Compute Environment (which has a
max sustained throughput of 3 MiB/s), it doesn't take a lot of queued
writeback I/O to trigger a hung task warning.  Even if hung task panic
isn't enabled, if the commit thread is busied out for a minute or two,
anything that is blocked on a commit completing --- such a fsync(2)
system call, could end up getting blocked for a long time, and that
could easily make a userspace application sad.

> Jan/Ted, your opinion on this pls?
> 
> I do see that there was a proposal by Ted @ [1] which should
> also solve this problem. I do have plans to work on Ted's proposal, but
> meanwhile, should we preserve this mount option for above mentioned use
> case? Or should we make it a no-op now?

> [1] - https://marc.info/?l=linux-ext4&m=157244559501734&w=2

I agree that this was not the original intent of dioread_nolock, but I
until we can implement [1], dioread_nolock is the only workaround real
workaround we have today.  (Well, data=writeback also works, but that
has other problems.)

If dropping dioread_nolock makes it easier to implement [1], we can
certainly make that one of the first patches in a patch series which
changes how we ext4_writepages() works so it writes the data blocks
before it updates the metadata blocks.  But unless there are some real
downsides to keeping the code around in the kernel until then, I'm not
sure it's worth it to take away the diorad_nolock functionality until
we have a good replacement --- even if that wasn't the original
purpose of the code.

What do other folks think?

					- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-07  0:43     ` Theodore Y. Ts'o
@ 2020-01-07  8:22       ` Jan Kara
  2020-01-07 17:11         ` Theodore Y. Ts'o
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2020-01-07  8:22 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Ritesh Harjani, Xiaoguang Wang, Ext4 Developers List, Jan Kara,
	joseph.qi, Liu Bo

On Mon 06-01-20 19:43:38, Theodore Y. Ts'o wrote:
> On Mon, Jan 06, 2020 at 05:54:56PM +0530, Ritesh Harjani wrote:
> > > The initial reason we use dioread_nolock is that it'll also allocate
> > > unwritten extents for buffered write, and normally the corresponding
> > > inode won't be added to jbd2 transaction's t_inode_list, so while
> > > commiting transaction, it won't flush inodes' dirty pages, then
> > > transaction will commit quickly, otherwise in extream case, the time
> > 
> > I do notice this in ext4_map_blocks(). We add inode to t_inode_list only
> > in case if we allocate written blocks. I guess this was done to avoid
> > stale data exposure problem. So now due to ordered mode, we may end up
> > flushing all dirty data pages in committing transaction before the
> > metadata is flushed.
> > 
> > Do you have any benchmarks or workload where we could see this problem?
> > And could this actually be a problem with any real world workload too?
> 
> After thinking about this some more, I've changed my mind.
> 
> I think this is something which *can* be very noticeable in some real
> world workloads.  If the workload is doing a lot of allocating,
> buffered writes to an inode, and the writeback thread starts doing the
> writeback for that inode right before a commit starts, then the commit
> can take a long time.  The problem is that if the storage device is
> particularly slow --- for example, a slow USB drive, or a 32 GiB
> Standard Persistent Disk in a Google Compute Environment (which has a
> max sustained throughput of 3 MiB/s), it doesn't take a lot of queued
> writeback I/O to trigger a hung task warning.  Even if hung task panic
> isn't enabled, if the commit thread is busied out for a minute or two,
> anything that is blocked on a commit completing --- such a fsync(2)
> system call, could end up getting blocked for a long time, and that
> could easily make a userspace application sad.

Yes, stalls on flushing inode data during transaction commits are
definitely real in some setups / for some workloads. Priority inversion
issues when heavily using cgroups to constrain IO are one of the things
Facebook people complained about.

> > Jan/Ted, your opinion on this pls?
> > 
> > I do see that there was a proposal by Ted @ [1] which should
> > also solve this problem. I do have plans to work on Ted's proposal, but
> > meanwhile, should we preserve this mount option for above mentioned use
> > case? Or should we make it a no-op now?
> 
> > [1] - https://marc.info/?l=linux-ext4&m=157244559501734&w=2
> 
> I agree that this was not the original intent of dioread_nolock, but I
> until we can implement [1], dioread_nolock is the only workaround real
> workaround we have today.  (Well, data=writeback also works, but that
> has other problems.)
> 
> If dropping dioread_nolock makes it easier to implement [1], we can
> certainly make that one of the first patches in a patch series which
> changes how we ext4_writepages() works so it writes the data blocks
> before it updates the metadata blocks.  But unless there are some real
> downsides to keeping the code around in the kernel until then, I'm not
> sure it's worth it to take away the diorad_nolock functionality until
> we have a good replacement --- even if that wasn't the original
> purpose of the code.
> 
> What do other folks think?

When there are users that use dioread_nolock to workaround that
data=ordered limitation of ext4, I think it's fair to keep the mount option
in until we have a better workaround implemented. We can leave that option
just with a meaning "do data writeback using unwritten extents".

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-07  8:22       ` Jan Kara
@ 2020-01-07 17:11         ` Theodore Y. Ts'o
  2020-01-07 17:22           ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Y. Ts'o @ 2020-01-07 17:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo

Hmm..... There's actually an even more radical option we could use,
given that Ritesh has made dioread_nolock work on block sizes < page
size.  We could make dioread_nolock the default, until we can revamp
ext4_writepages() to write the data blocks first....

		     	       	    - Ted
				    

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-07 17:11         ` Theodore Y. Ts'o
@ 2020-01-07 17:22           ` Jan Kara
  2020-01-08 10:45             ` Ritesh Harjani
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2020-01-07 17:22 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Jan Kara, Ritesh Harjani, Xiaoguang Wang, Ext4 Developers List,
	joseph.qi, Liu Bo

On Tue 07-01-20 12:11:09, Theodore Y. Ts'o wrote:
> Hmm..... There's actually an even more radical option we could use,
> given that Ritesh has made dioread_nolock work on block sizes < page
> size.  We could make dioread_nolock the default, until we can revamp
> ext4_writepages() to write the data blocks first....

Yes, that's a good point. And I'm not opposed to that if it makes the life
simpler. But I'd like to see some performance numbers showing how much is
writeback using unwritten extents slower so that we don't introduce too big
regression with this...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-07 17:22           ` Jan Kara
@ 2020-01-08 10:45             ` Ritesh Harjani
  2020-01-08 17:42               ` Theodore Y. Ts'o
  0 siblings, 1 reply; 16+ messages in thread
From: Ritesh Harjani @ 2020-01-08 10:45 UTC (permalink / raw)
  To: Jan Kara, Theodore Y. Ts'o
  Cc: Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo

Hello Ted/Jan,

On 1/7/20 10:52 PM, Jan Kara wrote:
> On Tue 07-01-20 12:11:09, Theodore Y. Ts'o wrote:
>> Hmm..... There's actually an even more radical option we could use,
>> given that Ritesh has made dioread_nolock work on block sizes < page
>> size.  We could make dioread_nolock the default, until we can revamp
>> ext4_writepages() to write the data blocks first....

Agreed. I guess it should be a straight forward change to make.
It should be just removing test_opt(inode->i_sb, DIOREAD_NOLOCK) 
condition from ext4_should_dioread_nolock().

> 
> Yes, that's a good point. And I'm not opposed to that if it makes the life
> simpler. But I'd like to see some performance numbers showing how much is
> writeback using unwritten extents slower so that we don't introduce too big
> regression with this...
> 

Yes, let me try to get some performance numbers with dioread_nolock as
the default option for buffered write on my setup.

AFAIU this should also fix the stale data exposure race between DIO
read and ext4_page_mkwrite, since we will by default be using unwritten
extents.
Currently I am testing the patch to fix this race which is based on our 
previous discussion. Will anyway post that. After that I can also 
collect the performance numbers for this suggested option (to make
dioread_nolock as default)


-ritesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-08 10:45             ` Ritesh Harjani
@ 2020-01-08 17:42               ` Theodore Y. Ts'o
  2020-01-09  9:21                 ` Ritesh Harjani
  2020-01-09 12:34                 ` Jan Kara
  0 siblings, 2 replies; 16+ messages in thread
From: Theodore Y. Ts'o @ 2020-01-08 17:42 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Jan Kara, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo

[-- Attachment #1: Type: text/plain, Size: 2268 bytes --]

On Wed, Jan 08, 2020 at 04:15:13PM +0530, Ritesh Harjani wrote:
> Hello Ted/Jan,
> 
> On 1/7/20 10:52 PM, Jan Kara wrote:
> > On Tue 07-01-20 12:11:09, Theodore Y. Ts'o wrote:
> > > Hmm..... There's actually an even more radical option we could use,
> > > given that Ritesh has made dioread_nolock work on block sizes < page
> > > size.  We could make dioread_nolock the default, until we can revamp
> > > ext4_writepages() to write the data blocks first....
> 
> Agreed. I guess it should be a straight forward change to make.
> It should be just removing test_opt(inode->i_sb, DIOREAD_NOLOCK) condition
> from ext4_should_dioread_nolock().

Actually, it's simpler than that.  In fs/ext4/super.c, around line
3730, after the comment:

	/* Set defaults before we parse the mount options */

Just add:

     set_opt(sb, DIOREAD_NOLOCK);

This will allow system administrators to revert back to the original
method using the someone confusingly named mount option,
"dioread_lock".  (Maybe we can add a alias for that mount option so
it's less confusing).

> > Yes, that's a good point. And I'm not opposed to that if it makes the life
> > simpler. But I'd like to see some performance numbers showing how much is
> > writeback using unwritten extents slower so that we don't introduce too big
> > regression with this...
> > 
> 
> Yes, let me try to get some performance numbers with dioread_nolock as
> the default option for buffered write on my setup.

I started running some performance runs last night, and the
interesting thing that I found was that fs_mark actually *improved*
with dioread_nolock (with fsync enabled).  That may be an example of
where fixing the commit latency caused by writeback can actually show
up in a measurable way with benchmarks.

Dbench was slightly impacted; I didn't see any real differences with
compilebench or postmark.  dioread_nolock did improve fio with
sequential reads; which is interesting, since I would have expected
with the inode_lock improvements, there shouldn't have been any
difference.  So that may be a bit of wierdness that we should try to
understand.

See the attached tar file; open ext4-modes/index.html in a browser to
see the pretty graphs.  The raw numbers are in ext4/composite.xml.

Cheers,

						- Ted


[-- Attachment #2: pts.tar.xz --]
[-- Type: application/x-xz, Size: 27892 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-08 17:42               ` Theodore Y. Ts'o
@ 2020-01-09  9:21                 ` Ritesh Harjani
  2020-01-09 16:38                   ` Theodore Y. Ts'o
  2020-01-09 12:34                 ` Jan Kara
  1 sibling, 1 reply; 16+ messages in thread
From: Ritesh Harjani @ 2020-01-09  9:21 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Jan Kara, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo



On 1/8/20 11:12 PM, Theodore Y. Ts'o wrote:
> On Wed, Jan 08, 2020 at 04:15:13PM +0530, Ritesh Harjani wrote:
>> Hello Ted/Jan,
>>
>> On 1/7/20 10:52 PM, Jan Kara wrote:
>>> On Tue 07-01-20 12:11:09, Theodore Y. Ts'o wrote:
>>>> Hmm..... There's actually an even more radical option we could use,
>>>> given that Ritesh has made dioread_nolock work on block sizes < page
>>>> size.  We could make dioread_nolock the default, until we can revamp
>>>> ext4_writepages() to write the data blocks first....
>>
>> Agreed. I guess it should be a straight forward change to make.
>> It should be just removing test_opt(inode->i_sb, DIOREAD_NOLOCK) condition
>> from ext4_should_dioread_nolock().
> 
> Actually, it's simpler than that.  In fs/ext4/super.c, around line
> 3730, after the comment:
> 
> 	/* Set defaults before we parse the mount options */
> 
> Just add:
> 
>       set_opt(sb, DIOREAD_NOLOCK);

Yes, silly me.


> 
> This will allow system administrators to revert back to the original
> method using the someone confusingly named mount option,
> "dioread_lock".  (Maybe we can add a alias for that mount option so
> it's less confusing).
> 
>>> Yes, that's a good point. And I'm not opposed to that if it makes the life
>>> simpler. But I'd like to see some performance numbers showing how much is
>>> writeback using unwritten extents slower so that we don't introduce too big
>>> regression with this...
>>>
>>
>> Yes, let me try to get some performance numbers with dioread_nolock as
>> the default option for buffered write on my setup.
> 
> I started running some performance runs last night, and the
> interesting thing that I found was that fs_mark actually *improved*
> with dioread_nolock (with fsync enabled).  That may be an example of
> where fixing the commit latency caused by writeback can actually show
> up in a measurable way with benchmarks.
> 
> Dbench was slightly impacted; I didn't see any real differences with
> compilebench or postmark.  dioread_nolock did improve fio with
> sequential reads; which is interesting, since I would have expected

IIUC, this Seq. read numbers are with --direct=1 & bs=2MB & 
ioengine=libaio, correct?
So essentially it will do a DIO AIO sequential read.

Yes, it *does shows* a big delta in the numbers. I also noticed a higher
deviation between the two runs with dioread_nolock.


> with the inode_lock improvements, there shouldn't have been any
> difference.  So that may be a bit of wierdness that we should try to
> understand.

So inode_lock patches gives improvement in mixed read/write workload
where inode exclusive locking was causing the bottleneck earlier.


In this run, was encryption or fsverity enabled?
If yes then in that case I see that ext4_dio_supported() will return
false and it will fallback to bufferedRead.
Though with that also can't explain the delta with only enabling
dioread_nolock.


> 
> See the attached tar file; open ext4-modes/index.html in a browser to
> see the pretty graphs.  The raw numbers are in ext4/composite.xml.

The graphs and overview looks really good. I will also check about PTS
sometime. Will be good to capture such reports.


ritesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-08 17:42               ` Theodore Y. Ts'o
  2020-01-09  9:21                 ` Ritesh Harjani
@ 2020-01-09 12:34                 ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2020-01-09 12:34 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Ritesh Harjani, Jan Kara, Xiaoguang Wang, Ext4 Developers List,
	joseph.qi, Liu Bo

On Wed 08-01-20 12:42:59, Theodore Y. Ts'o wrote:
> On Wed, Jan 08, 2020 at 04:15:13PM +0530, Ritesh Harjani wrote:
> > > Yes, that's a good point. And I'm not opposed to that if it makes the life
> > > simpler. But I'd like to see some performance numbers showing how much is
> > > writeback using unwritten extents slower so that we don't introduce too big
> > > regression with this...
> > > 
> > 
> > Yes, let me try to get some performance numbers with dioread_nolock as
> > the default option for buffered write on my setup.
> 
> I started running some performance runs last night, and the

Thanks for the numbers! What is the difference between 'default-1' and
'default-2' configurations (and similarly between dioread_nolock-1 and -2
configurations)?

> interesting thing that I found was that fs_mark actually *improved*
> with dioread_nolock (with fsync enabled).  That may be an example of
> where fixing the commit latency caused by writeback can actually show
> up in a measurable way with benchmarks.

Yeah, that could be.

> Dbench was slightly impacted; I didn't see any real differences with
> compilebench or postmark.

Interestingly, dbench is also fsync(2)-bound workload (because the working
set is too small for anything else to actually reach the disk in
contemporary systems). But file sizes with dbench are smaller (under 100k)
than with fs-mark (1MB) so probably that's what makes the difference.

>  dioread_nolock did improve fio with
> sequential reads; which is interesting, since I would have expected
> with the inode_lock improvements, there shouldn't have been any
> difference.  So that may be a bit of wierdness that we should try to
> understand.

Yes, this is indeed strange. I don't see anything the DIO read path where
dioread_nolock would actually still matter.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-09  9:21                 ` Ritesh Harjani
@ 2020-01-09 16:38                   ` Theodore Y. Ts'o
  2020-01-14 23:30                     ` Ritesh Harjani
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Y. Ts'o @ 2020-01-09 16:38 UTC (permalink / raw)
  To: Ritesh Harjani, Jan Kara
  Cc: Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo

On Thu, Jan 09, 2020 at 02:51:42PM +0530, Ritesh Harjani wrote:
> > Dbench was slightly impacted; I didn't see any real differences with
> > compilebench or postmark.  dioread_nolock did improve fio with
> > sequential reads; which is interesting, since I would have expected
> 
> IIUC, this Seq. read numbers are with --direct=1 & bs=2MB & ioengine=libaio,
> correct?
> So essentially it will do a DIO AIO sequential read.

Correct.

>In this run, was encryption or fsverity enabled?
>If yes then in that case I see that ext4_dio_supported() will return
>false and it will fallback to bufferedRead.

No, there was no encryption or fsverity enabled.  These runs were pure
stock ext4, with no additional features or mount options other than
the defaults --- and in the case of the dioread_nolock runs, just the
dioread_nolock mount option was added.


On Thu, Jan 09, 2020 at 01:34:27PM +0100, Jan Kara wrote:
> > 
> > I started running some performance runs last night, and the
> 
> Thanks for the numbers! What is the difference between 'default-1' and
> 'default-2' configurations (and similarly between dioread_nolock-1 and -2
> configurations)?

default-1 and default-2 are two runs using the default mke2fs and ext4
mount options.  So 4k blocks, no encryption, no fs-verity, delayed
allocation, and the kernel version includes the inode locking patches,
but not the dio overwrite patch (that's for the next merge window, and
this was 5.5-rc3).

dioread-1 and dioread-2 are two runs with the exact same file system,
the same VM and PD.  The file system would have been slightly aged
since I did not recreate the file system between each of the four
runs.

File system aging might be explain some of the differences; the other
possbility is some kind of differences caused by differing amounts of
CPU and network availability on the host system (which could have
affected the Persistent Disk performance).  GCE does try to provide
enough isolation and throttling however, and the outliers are
generally in the positive, not negative direction, so I don't *think*
it's caused by host loading issues, but I can't 100% rule it out.

One of the things PTS does do is to increase the number of runs until
the standard deviation drops below some percentage (4 or 5%, I think).
In particular, I've noticed that fsmark sometimes require quite a few
runs before the results fall within that criteria.  So it's quite
possible that the fs-mark variations might have been luck.  That's one
of the reasons why I performed multiple runs, was to try to see if
that was what was going on.

> >  dioread_nolock did improve fio with
> > sequential reads; which is interesting, since I would have expected
> > with the inode_lock improvements, there shouldn't have been any
> > difference.  So that may be a bit of wierdness that we should try to
> > understand.
> 
> Yes, this is indeed strange. I don't see anything the DIO read path where
> dioread_nolock would actually still matter.

The fio runs do tend to be pretty stable, and generally only require
PTS to performe 3 runs to get stable results.  That's why the
variations and differences found when doing runs with and without
dioread_nolock were *very* unexpected.

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-09 16:38                   ` Theodore Y. Ts'o
@ 2020-01-14 23:30                     ` Ritesh Harjani
  2020-01-15 16:48                       ` Theodore Y. Ts'o
  0 siblings, 1 reply; 16+ messages in thread
From: Ritesh Harjani @ 2020-01-14 23:30 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Jan Kara
  Cc: Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo



On 1/9/20 10:08 PM, Theodore Y. Ts'o wrote:
> On Thu, Jan 09, 2020 at 02:51:42PM +0530, Ritesh Harjani wrote:
>>> Dbench was slightly impacted; I didn't see any real differences with
>>> compilebench or postmark.  dioread_nolock did improve fio with
>>> sequential reads; which is interesting, since I would have expected
>>
>> IIUC, this Seq. read numbers are with --direct=1 & bs=2MB & ioengine=libaio,
>> correct?
>> So essentially it will do a DIO AIO sequential read.
> 
> Correct.

I too collected some performance numbers on my x86 box with
--direct=1, bs=4K/1M & ioengine=libaio, with default opt v/s 
dioread_nolock opt on latest ext4 git tree.

I found the delta to be within +/- 6% in all of the runs which includes, 
seq read, mixed rw & mixed randrw.

-ritesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-14 23:30                     ` Ritesh Harjani
@ 2020-01-15 16:48                       ` Theodore Y. Ts'o
  2020-01-16  9:46                         ` Ritesh Harjani
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Y. Ts'o @ 2020-01-15 16:48 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Jan Kara, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo

On Wed, Jan 15, 2020 at 05:00:53AM +0530, Ritesh Harjani wrote:
> 
> I too collected some performance numbers on my x86 box with
> --direct=1, bs=4K/1M & ioengine=libaio, with default opt v/s dioread_nolock
> opt on latest ext4 git tree.
> 
> I found the delta to be within +/- 6% in all of the runs which includes, seq
> read, mixed rw & mixed randrw.

Thanks for taking the performance measurements!

Are you able to release more detail of what the deltas were for those
tests?  And how stable were those results across repeated runs?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Discussion: is it time to remove dioread_nolock?
  2020-01-15 16:48                       ` Theodore Y. Ts'o
@ 2020-01-16  9:46                         ` Ritesh Harjani
  0 siblings, 0 replies; 16+ messages in thread
From: Ritesh Harjani @ 2020-01-16  9:46 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Jan Kara, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo

[-- Attachment #1: Type: text/plain, Size: 5058 bytes --]

Hello Ted,

On 1/15/20 10:18 PM, Theodore Y. Ts'o wrote:
> On Wed, Jan 15, 2020 at 05:00:53AM +0530, Ritesh Harjani wrote:
>>
>> I too collected some performance numbers on my x86 box with
>> --direct=1, bs=4K/1M & ioengine=libaio, with default opt v/s dioread_nolock
>> opt on latest ext4 git tree.
>>
>> I found the delta to be within +/- 6% in all of the runs which includes, seq
>> read, mixed rw & mixed randrw.
> 
> Thanks for taking the performance measurements!
> 
> Are you able to release more detail of what the deltas were for those
> tests?  And how stable were those results across repeated runs?
> 

I have collected these numbers using fio. All of the column values
shown in the attachment & pasted below are average of 3 runs.
I checked all those individual runs too and saw that even in the run-to-
run variations the delta was within +/- 5% only.

The naming of those individual runs *.json files are a bit weird and 
will take sometime if I have to publish the individual runs. But do let
me know if that as well is required. I can make some changes in the 
scripts to print individual run's numbers too.

I have pasted the individual fio files too which I used for my testing.
I have used libaio as ioengine (with iodepth=16) in the SeqRead case and
psync in others.


Performance numbers:
===================

(seq) read (4K) - libaio
============
threads     default_opt_read(MB/s)   dioread_nolock_opt_read(MB/s)
----------  ---------------------    ----------------------------
1           138.928059895833             138.824869791667 

8           129.345052083333             124.472005208333 

24          71.6555989583333             72.275390625


(Not sure why 24 thread case has lower perf numbers, but on increasing
the iodepth=32, I could see an increase in bw with 24 threads case
too.)


(seq) read (1M) - libaio
===========
threads     default_opt_read(MB/s)  dioread_nolock_opt_read(MB/s)
----------  ---------------------   ---------------
1           138.905598958333             138.832682291667 

8           111.263997395833             109.301106770833 

24          121.895182291667             127.75390625


randrw(4K)  - read (psync) 
 

========================== 

threads     default_opt_read [KB/s]  dioread_opt_read [KB/s] 

----------  ----------------    ---------------- 

1           414.666666666667    410.0 

8           780.0               792.333333333333 

24          967.333333333333    991.666666666667 


 
 

randrw(4K)  - write (psync) 

=========================== 

threads     default_opt_write [KB/s]  dioread_opt_write [KB/s] 

----------  -----------------    ----------------- 

1           418.0                413.666666666667 

8           796.666666666667     809.666666666667 

24          981.333333333333     1007.66666666667


randrw(1M) - read (psync)
=================
threads     default_opt_read(MB/s)   dioread_nolock_opt_read(MB/s)
----------  -----------------        -----------------------
1           39.5693359375                39.7288411458333 

8           44.5179036458333             47.9098307291667 

24          50.2861328125                51.720703125


randrw(1M) - write (psync)
==================
threads     default_opt_write(MB/s)  dioread_nolock_opt_write(MB/s)
----------  ----------------------   -----------------------
1           41.4583333333333              41.5068359375 

8           46.0768229166667              49.568359375 

24          49.5947265625                 50.7083333333333


rw(1M) - read (psync)
=============
threads     default_opt_read(MB/s)  dioread_nolock_opt_read(MB/s)
----------  ----------------------  -------------------
1           43.6458333333333             43.6770833333333 

8           48.1178385416667             49.2718098958333 

24          50.5703125                   53.7890625


rw(1M) - write (psync)
==============
threads     default_opt_write(MB/s)   dioread_nolock_opt_write(MB/s)
----------  -----------------------   ------------------
1           45.5065104166667               45.5654296875 

8           49.7431640625                  51.0690104166667 

24          50.2493489583333               53.3463541666667



FIO FILES
=========

dio_read.fio
============
[global]
         ioengine=libaio
         rw=read
         runtime=60
         iodepth=16
         direct=1
         size=10G
         filename=./testfile
         group_reporting=1
         thread=1
         ;bs=$bs
         ;numjobs=24

[fio-run]


dio_randrw.fio
==============
[global]
         ioengine=psync
         rw=randrw
         runtime=60
         iodepth=1
         direct=1
         size=10G
         filename=./testfile
         group_reporting=1
         thread=1
         ;bs=$bs
         ;numjobs=24

[fio-run]

dio_rw.fio
==========
[global]
         ioengine=psync
         rw=rw
         runtime=60
         iodepth=1
         direct=1
         size=10G
         filename=./testfile
         group_reporting=1
         thread=1
         ;bs=$bs
         ;numjobs=24

[fio-run]


Thanks
-ritesh



[-- Attachment #2: datafile.txt --]
[-- Type: text/plain, Size: 3311 bytes --]

SeqRead (4K) - libaio
============
threads     default_opt[read_4K(MB/s)]   dioread_nolock_opt [read_4K(MB/s)]
----------  ---------------------------  ------------------------------
1           138.928059895833             138.824869791667                                    
8           129.345052083333             124.472005208333                                    
24          71.6555989583333             72.275390625   


SeqRead(1M) - libaio
===========
threads     default_opt [read_1M(MB/s)]  dioread_nolock_opt [read_1M(MB/s)]
----------  ---------------------------  --------------------------------
1           138.905598958333             138.832682291667                                    
8           111.263997395833             109.301106770833                                    
24          121.895182291667             127.75390625      


randrw(4K)  - read (psync)
==========================
threads     default_opt_read [KB/s]  dioread_opt_read [KB/s]
----------  ----------------  	----------------
1           414.666666666667  	410.0
8           780.0             	792.333333333333
24          967.333333333333  	991.666666666667


randrw(4K)  - write (psync)
===========================
threads     default_opt_write [KB/s]  dioread_opt_write [KB/s]
----------  ----------------- 	 -----------------                                      
1           418.0             	 413.666666666667
8           796.666666666667  	 809.666666666667
24          981.333333333333  	 1007.66666666667

randrw(1M) - read (psync)
=================
threads     default_opt [read_1M(MB/s)   dioread_nolock_opt [read_1M(MB/s)]
----------  ---------------------------  ---------------------------------
1           39.5693359375                39.7288411458333                                      
8           44.5179036458333             47.9098307291667                                      
24          50.2861328125                51.720703125       


randrw(1M) - write (psync)
==================
threads     default_opt [write_1M(MB/s)]  dioread_nolock_opt [write_1M(MB/s)]
----------  ----------------------------  ----------------------------------
1           41.4583333333333              41.5068359375                                          
8           46.0768229166667              49.568359375                                           
24          49.5947265625                 50.7083333333333         


rw(1M) - read (psync)
=============
threads     default_opt [read_1M(MB/s)]  dioread_nolock_opt [read_1M(MB/s)]
----------  ---------------------------  ---------------------------------
1           43.6458333333333             43.6770833333333                                  
8           48.1178385416667             49.2718098958333                                  
24          50.5703125                   53.7890625


rw(1M) - write (psync)
==============
threads     default_opt [write_1M(MB/s)]   dioread_nolock_opt [write_1M(MB/s)
----------  -----------------------------  ---------------------------------
1           45.5065104166667               45.5654296875                                      
8           49.7431640625                  51.0690104166667                                   
24          50.2493489583333               53.3463541666667         


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, back to index

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-26 15:31 Discussion: is it time to remove dioread_nolock? Theodore Y. Ts'o
2019-12-27 13:10 ` Joseph Qi
2019-12-29 15:03 ` Xiaoguang Wang
2020-01-06 12:24   ` Ritesh Harjani
2020-01-07  0:43     ` Theodore Y. Ts'o
2020-01-07  8:22       ` Jan Kara
2020-01-07 17:11         ` Theodore Y. Ts'o
2020-01-07 17:22           ` Jan Kara
2020-01-08 10:45             ` Ritesh Harjani
2020-01-08 17:42               ` Theodore Y. Ts'o
2020-01-09  9:21                 ` Ritesh Harjani
2020-01-09 16:38                   ` Theodore Y. Ts'o
2020-01-14 23:30                     ` Ritesh Harjani
2020-01-15 16:48                       ` Theodore Y. Ts'o
2020-01-16  9:46                         ` Ritesh Harjani
2020-01-09 12:34                 ` Jan Kara

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git