* Discussion: is it time to remove dioread_nolock? @ 2019-12-26 15:31 Theodore Y. Ts'o 2019-12-27 13:10 ` Joseph Qi 2019-12-29 15:03 ` Xiaoguang Wang 0 siblings, 2 replies; 16+ messages in thread From: Theodore Y. Ts'o @ 2019-12-26 15:31 UTC (permalink / raw) To: Ext4 Developers List; +Cc: joseph.qi, Liu Bo With inclusion of Ritesh's inode lock scalability patches[1], the traditional performance reasons for dioread_nolock --- namely, removing the need to take an exclusive lock for Direct I/O read operations --- has been removed. [1] https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com So... is it time to remove the code which supports dioread_nolock? Doing so would simplify the code base, and reduce the test matrix. This would also make it easier to restructure the write path when allocating blocks so that the extent tree is updated after writing out the data blocks, by clearing away the underbrush of dioread nolock first. If we do this, we'd leave the dioread_nolock mount option for backwards compatibility, but it would be a no-op and not actually do anything. Any objections before I look into ripping out dioread_nolock? The one possible concern that I considered was for Alibaba, which was doing something interesting with dioread_nolock plus nodelalloc. But looking at Liu Bo's explanation[2], I believe that their workload would be satisfied simply by using the standard ext4 mount options (that is, the default mode has the performance benefits when doing parallel DIO reads, and so the need for nodelalloc to mitigate the tail latency concerns which Alibaba was seeing in their workload would not be needed). Could Liu or someone from Alibaba confirm, perhaps with some benchmarks using their workload? [2] https://lore.kernel.org/linux-ext4/20181121013035.ab4xp7evjyschecy@US-160370MP2.local/ - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2019-12-26 15:31 Discussion: is it time to remove dioread_nolock? Theodore Y. Ts'o @ 2019-12-27 13:10 ` Joseph Qi 2019-12-29 15:03 ` Xiaoguang Wang 1 sibling, 0 replies; 16+ messages in thread From: Joseph Qi @ 2019-12-27 13:10 UTC (permalink / raw) To: Theodore Y. Ts'o, Ext4 Developers List; +Cc: Liu Bo, Xiaoguang Wang Hi Ted, After applying Ritesh's patches, from my test result, there is no obvious performance difference between default mount options and with dioread_lock (delalloc or nodelalloc). I'm not sure if dioread_nolock was used for other purpose in the scenario reported by Bo Liu. Maybe Xiaoguang could give some inputs. Thanks, Joseph On 19/12/26 23:31, Theodore Y. Ts'o wrote: > With inclusion of Ritesh's inode lock scalability patches[1], the > traditional performance reasons for dioread_nolock --- namely, > removing the need to take an exclusive lock for Direct I/O read > operations --- has been removed. > > [1] https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com > > So... is it time to remove the code which supports dioread_nolock? > Doing so would simplify the code base, and reduce the test matrix. > This would also make it easier to restructure the write path when > allocating blocks so that the extent tree is updated after writing out > the data blocks, by clearing away the underbrush of dioread nolock > first. > > If we do this, we'd leave the dioread_nolock mount option for > backwards compatibility, but it would be a no-op and not actually do > anything. > > Any objections before I look into ripping out dioread_nolock? > > The one possible concern that I considered was for Alibaba, which was > doing something interesting with dioread_nolock plus nodelalloc. But > looking at Liu Bo's explanation[2], I believe that their workload > would be satisfied simply by using the standard ext4 mount options > (that is, the default mode has the performance benefits when doing > parallel DIO reads, and so the need for nodelalloc to mitigate the > tail latency concerns which Alibaba was seeing in their workload would > not be needed). Could Liu or someone from Alibaba confirm, perhaps > with some benchmarks using their workload? > > [2] https://lore.kernel.org/linux-ext4/20181121013035.ab4xp7evjyschecy@US-160370MP2.local/ > > - Ted > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2019-12-26 15:31 Discussion: is it time to remove dioread_nolock? Theodore Y. Ts'o 2019-12-27 13:10 ` Joseph Qi @ 2019-12-29 15:03 ` Xiaoguang Wang 2020-01-06 12:24 ` Ritesh Harjani 1 sibling, 1 reply; 16+ messages in thread From: Xiaoguang Wang @ 2019-12-29 15:03 UTC (permalink / raw) To: Theodore Y. Ts'o, Ext4 Developers List; +Cc: joseph.qi, Liu Bo hi, > With inclusion of Ritesh's inode lock scalability patches[1], the > traditional performance reasons for dioread_nolock --- namely, > removing the need to take an exclusive lock for Direct I/O read > operations --- has been removed. > > [1] https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com > > So... is it time to remove the code which supports dioread_nolock? > Doing so would simplify the code base, and reduce the test matrix. > This would also make it easier to restructure the write path when > allocating blocks so that the extent tree is updated after writing out > the data blocks, by clearing away the underbrush of dioread nolock > first. > > If we do this, we'd leave the dioread_nolock mount option for > backwards compatibility, but it would be a no-op and not actually do > anything. > > Any objections before I look into ripping out dioread_nolock? > > The one possible concern that I considered was for Alibaba, which was > doing something interesting with dioread_nolock plus nodelalloc. But > looking at Liu Bo's explanation[2], I believe that their workload > would be satisfied simply by using the standard ext4 mount options > (that is, the default mode has the performance benefits when doing > parallel DIO reads, and so the need for nodelalloc to mitigate the > tail latency concerns which Alibaba was seeing in their workload would > not be needed). Could Liu or someone from Alibaba confirm, perhaps > with some benchmarks using their workload? Currently we don't use dioread_nolock & nodelalloc in our internal servers, and we use dioread_nolock & delalloc widely, it works well. The initial reason we use dioread_nolock is that it'll also allocate unwritten extents for buffered write, and normally the corresponding inode won't be added to jbd2 transaction's t_inode_list, so while commiting transaction, it won't flush inodes' dirty pages, then transaction will commit quickly, otherwise in extream case, the time taking to flush dirty inodes will be very big, especially cgroup writeback is enabled. A previous discussion: https://marc.info/?l=linux-fsdevel&m=151799957104768&w=2 I think this semantics hidden behind diread_nolock is also important, so if planning to remove this mount option, we should keep same semantics. Regards, Xiaoguang Wang > > [2] https://lore.kernel.org/linux-ext4/20181121013035.ab4xp7evjyschecy@US-160370MP2.local/ > > - Ted > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2019-12-29 15:03 ` Xiaoguang Wang @ 2020-01-06 12:24 ` Ritesh Harjani 2020-01-07 0:43 ` Theodore Y. Ts'o 0 siblings, 1 reply; 16+ messages in thread From: Ritesh Harjani @ 2020-01-06 12:24 UTC (permalink / raw) To: Xiaoguang Wang, Theodore Y. Ts'o, Ext4 Developers List, Jan Kara Cc: joseph.qi, Liu Bo On 12/29/19 8:33 PM, Xiaoguang Wang wrote: > hi, > >> With inclusion of Ritesh's inode lock scalability patches[1], the >> traditional performance reasons for dioread_nolock --- namely, >> removing the need to take an exclusive lock for Direct I/O read >> operations --- has been removed. >> >> [1] >> https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com >> >> So... is it time to remove the code which supports dioread_nolock? >> Doing so would simplify the code base, and reduce the test matrix. >> This would also make it easier to restructure the write path when >> allocating blocks so that the extent tree is updated after writing out >> the data blocks, by clearing away the underbrush of dioread nolock >> first. >> >> If we do this, we'd leave the dioread_nolock mount option for >> backwards compatibility, but it would be a no-op and not actually do >> anything. >> >> Any objections before I look into ripping out dioread_nolock? >> >> The one possible concern that I considered was for Alibaba, which was >> doing something interesting with dioread_nolock plus nodelalloc. But >> looking at Liu Bo's explanation[2], I believe that their workload >> would be satisfied simply by using the standard ext4 mount options >> (that is, the default mode has the performance benefits when doing >> parallel DIO reads, and so the need for nodelalloc to mitigate the >> tail latency concerns which Alibaba was seeing in their workload would >> not be needed). Could Liu or someone from Alibaba confirm, perhaps >> with some benchmarks using their workload? > Currently we don't use dioread_nolock & nodelalloc in our internal > servers, and we use dioread_nolock & delalloc widely, it works well. > > The initial reason we use dioread_nolock is that it'll also allocate > unwritten extents for buffered write, and normally the corresponding > inode won't be added to jbd2 transaction's t_inode_list, so while > commiting transaction, it won't flush inodes' dirty pages, then > transaction will commit quickly, otherwise in extream case, the time I do notice this in ext4_map_blocks(). We add inode to t_inode_list only in case if we allocate written blocks. I guess this was done to avoid stale data exposure problem. So now due to ordered mode, we may end up flushing all dirty data pages in committing transaction before the metadata is flushed. Do you have any benchmarks or workload where we could see this problem? And could this actually be a problem with any real world workload too? Actually speaking, dioread_nolock mount option was not meant to solve the problem you mentioned. Atleast as per the Documentation it is meant to help with the inode lock scalablity issue in DIO reads case, which mostly should be fixed with recent fixes pointed by Ted in above link. > taking to flush dirty inodes will be very big, especially cgroup writeback > is enabled. A previous discussion: > https://marc.info/?l=linux-fsdevel&m=151799957104768&w=2 > I think this semantics hidden behind diread_nolock is also important, > so if planning to remove this mount option, we should keep same semantics. Jan/Ted, your opinion on this pls? I do see that there was a proposal by Ted @ [1] which should also solve this problem. I do have plans to work on Ted's proposal, but meanwhile, should we preserve this mount option for above mentioned use case? Or should we make it a no-op now? Also as an update, I am working on a patch which should fix the stale data exposure issue which we currently have with ext4_page_mkwrite & ext4 DIO reads. With that fixed it was discussed @ [2], that we don't need dioread_nolock, so I was also planning to make this mount option a no-op. Will soon post the RFC version of this patch too. [1] - https://marc.info/?l=linux-ext4&m=157244559501734&w=2 [2] - https://lore.kernel.org/linux-ext4/20191120143257.GE9509@quack2.suse.cz/ -ritesh > > Regards, > Xiaoguang Wang > >> >> [2] >> https://lore.kernel.org/linux-ext4/20181121013035.ab4xp7evjyschecy@US-160370MP2.local/ >> >> >> - Ted >> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-06 12:24 ` Ritesh Harjani @ 2020-01-07 0:43 ` Theodore Y. Ts'o 2020-01-07 8:22 ` Jan Kara 0 siblings, 1 reply; 16+ messages in thread From: Theodore Y. Ts'o @ 2020-01-07 0:43 UTC (permalink / raw) To: Ritesh Harjani Cc: Xiaoguang Wang, Ext4 Developers List, Jan Kara, joseph.qi, Liu Bo On Mon, Jan 06, 2020 at 05:54:56PM +0530, Ritesh Harjani wrote: > > The initial reason we use dioread_nolock is that it'll also allocate > > unwritten extents for buffered write, and normally the corresponding > > inode won't be added to jbd2 transaction's t_inode_list, so while > > commiting transaction, it won't flush inodes' dirty pages, then > > transaction will commit quickly, otherwise in extream case, the time > > I do notice this in ext4_map_blocks(). We add inode to t_inode_list only > in case if we allocate written blocks. I guess this was done to avoid > stale data exposure problem. So now due to ordered mode, we may end up > flushing all dirty data pages in committing transaction before the > metadata is flushed. > > Do you have any benchmarks or workload where we could see this problem? > And could this actually be a problem with any real world workload too? After thinking about this some more, I've changed my mind. I think this is something which *can* be very noticeable in some real world workloads. If the workload is doing a lot of allocating, buffered writes to an inode, and the writeback thread starts doing the writeback for that inode right before a commit starts, then the commit can take a long time. The problem is that if the storage device is particularly slow --- for example, a slow USB drive, or a 32 GiB Standard Persistent Disk in a Google Compute Environment (which has a max sustained throughput of 3 MiB/s), it doesn't take a lot of queued writeback I/O to trigger a hung task warning. Even if hung task panic isn't enabled, if the commit thread is busied out for a minute or two, anything that is blocked on a commit completing --- such a fsync(2) system call, could end up getting blocked for a long time, and that could easily make a userspace application sad. > Jan/Ted, your opinion on this pls? > > I do see that there was a proposal by Ted @ [1] which should > also solve this problem. I do have plans to work on Ted's proposal, but > meanwhile, should we preserve this mount option for above mentioned use > case? Or should we make it a no-op now? > [1] - https://marc.info/?l=linux-ext4&m=157244559501734&w=2 I agree that this was not the original intent of dioread_nolock, but I until we can implement [1], dioread_nolock is the only workaround real workaround we have today. (Well, data=writeback also works, but that has other problems.) If dropping dioread_nolock makes it easier to implement [1], we can certainly make that one of the first patches in a patch series which changes how we ext4_writepages() works so it writes the data blocks before it updates the metadata blocks. But unless there are some real downsides to keeping the code around in the kernel until then, I'm not sure it's worth it to take away the diorad_nolock functionality until we have a good replacement --- even if that wasn't the original purpose of the code. What do other folks think? - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-07 0:43 ` Theodore Y. Ts'o @ 2020-01-07 8:22 ` Jan Kara 2020-01-07 17:11 ` Theodore Y. Ts'o 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2020-01-07 8:22 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Ritesh Harjani, Xiaoguang Wang, Ext4 Developers List, Jan Kara, joseph.qi, Liu Bo On Mon 06-01-20 19:43:38, Theodore Y. Ts'o wrote: > On Mon, Jan 06, 2020 at 05:54:56PM +0530, Ritesh Harjani wrote: > > > The initial reason we use dioread_nolock is that it'll also allocate > > > unwritten extents for buffered write, and normally the corresponding > > > inode won't be added to jbd2 transaction's t_inode_list, so while > > > commiting transaction, it won't flush inodes' dirty pages, then > > > transaction will commit quickly, otherwise in extream case, the time > > > > I do notice this in ext4_map_blocks(). We add inode to t_inode_list only > > in case if we allocate written blocks. I guess this was done to avoid > > stale data exposure problem. So now due to ordered mode, we may end up > > flushing all dirty data pages in committing transaction before the > > metadata is flushed. > > > > Do you have any benchmarks or workload where we could see this problem? > > And could this actually be a problem with any real world workload too? > > After thinking about this some more, I've changed my mind. > > I think this is something which *can* be very noticeable in some real > world workloads. If the workload is doing a lot of allocating, > buffered writes to an inode, and the writeback thread starts doing the > writeback for that inode right before a commit starts, then the commit > can take a long time. The problem is that if the storage device is > particularly slow --- for example, a slow USB drive, or a 32 GiB > Standard Persistent Disk in a Google Compute Environment (which has a > max sustained throughput of 3 MiB/s), it doesn't take a lot of queued > writeback I/O to trigger a hung task warning. Even if hung task panic > isn't enabled, if the commit thread is busied out for a minute or two, > anything that is blocked on a commit completing --- such a fsync(2) > system call, could end up getting blocked for a long time, and that > could easily make a userspace application sad. Yes, stalls on flushing inode data during transaction commits are definitely real in some setups / for some workloads. Priority inversion issues when heavily using cgroups to constrain IO are one of the things Facebook people complained about. > > Jan/Ted, your opinion on this pls? > > > > I do see that there was a proposal by Ted @ [1] which should > > also solve this problem. I do have plans to work on Ted's proposal, but > > meanwhile, should we preserve this mount option for above mentioned use > > case? Or should we make it a no-op now? > > > [1] - https://marc.info/?l=linux-ext4&m=157244559501734&w=2 > > I agree that this was not the original intent of dioread_nolock, but I > until we can implement [1], dioread_nolock is the only workaround real > workaround we have today. (Well, data=writeback also works, but that > has other problems.) > > If dropping dioread_nolock makes it easier to implement [1], we can > certainly make that one of the first patches in a patch series which > changes how we ext4_writepages() works so it writes the data blocks > before it updates the metadata blocks. But unless there are some real > downsides to keeping the code around in the kernel until then, I'm not > sure it's worth it to take away the diorad_nolock functionality until > we have a good replacement --- even if that wasn't the original > purpose of the code. > > What do other folks think? When there are users that use dioread_nolock to workaround that data=ordered limitation of ext4, I think it's fair to keep the mount option in until we have a better workaround implemented. We can leave that option just with a meaning "do data writeback using unwritten extents". Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-07 8:22 ` Jan Kara @ 2020-01-07 17:11 ` Theodore Y. Ts'o 2020-01-07 17:22 ` Jan Kara 0 siblings, 1 reply; 16+ messages in thread From: Theodore Y. Ts'o @ 2020-01-07 17:11 UTC (permalink / raw) To: Jan Kara Cc: Ritesh Harjani, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo Hmm..... There's actually an even more radical option we could use, given that Ritesh has made dioread_nolock work on block sizes < page size. We could make dioread_nolock the default, until we can revamp ext4_writepages() to write the data blocks first.... - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-07 17:11 ` Theodore Y. Ts'o @ 2020-01-07 17:22 ` Jan Kara 2020-01-08 10:45 ` Ritesh Harjani 0 siblings, 1 reply; 16+ messages in thread From: Jan Kara @ 2020-01-07 17:22 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Jan Kara, Ritesh Harjani, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo On Tue 07-01-20 12:11:09, Theodore Y. Ts'o wrote: > Hmm..... There's actually an even more radical option we could use, > given that Ritesh has made dioread_nolock work on block sizes < page > size. We could make dioread_nolock the default, until we can revamp > ext4_writepages() to write the data blocks first.... Yes, that's a good point. And I'm not opposed to that if it makes the life simpler. But I'd like to see some performance numbers showing how much is writeback using unwritten extents slower so that we don't introduce too big regression with this... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-07 17:22 ` Jan Kara @ 2020-01-08 10:45 ` Ritesh Harjani 2020-01-08 17:42 ` Theodore Y. Ts'o 0 siblings, 1 reply; 16+ messages in thread From: Ritesh Harjani @ 2020-01-08 10:45 UTC (permalink / raw) To: Jan Kara, Theodore Y. Ts'o Cc: Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo Hello Ted/Jan, On 1/7/20 10:52 PM, Jan Kara wrote: > On Tue 07-01-20 12:11:09, Theodore Y. Ts'o wrote: >> Hmm..... There's actually an even more radical option we could use, >> given that Ritesh has made dioread_nolock work on block sizes < page >> size. We could make dioread_nolock the default, until we can revamp >> ext4_writepages() to write the data blocks first.... Agreed. I guess it should be a straight forward change to make. It should be just removing test_opt(inode->i_sb, DIOREAD_NOLOCK) condition from ext4_should_dioread_nolock(). > > Yes, that's a good point. And I'm not opposed to that if it makes the life > simpler. But I'd like to see some performance numbers showing how much is > writeback using unwritten extents slower so that we don't introduce too big > regression with this... > Yes, let me try to get some performance numbers with dioread_nolock as the default option for buffered write on my setup. AFAIU this should also fix the stale data exposure race between DIO read and ext4_page_mkwrite, since we will by default be using unwritten extents. Currently I am testing the patch to fix this race which is based on our previous discussion. Will anyway post that. After that I can also collect the performance numbers for this suggested option (to make dioread_nolock as default) -ritesh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-08 10:45 ` Ritesh Harjani @ 2020-01-08 17:42 ` Theodore Y. Ts'o 2020-01-09 9:21 ` Ritesh Harjani 2020-01-09 12:34 ` Jan Kara 0 siblings, 2 replies; 16+ messages in thread From: Theodore Y. Ts'o @ 2020-01-08 17:42 UTC (permalink / raw) To: Ritesh Harjani Cc: Jan Kara, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo [-- Attachment #1: Type: text/plain, Size: 2268 bytes --] On Wed, Jan 08, 2020 at 04:15:13PM +0530, Ritesh Harjani wrote: > Hello Ted/Jan, > > On 1/7/20 10:52 PM, Jan Kara wrote: > > On Tue 07-01-20 12:11:09, Theodore Y. Ts'o wrote: > > > Hmm..... There's actually an even more radical option we could use, > > > given that Ritesh has made dioread_nolock work on block sizes < page > > > size. We could make dioread_nolock the default, until we can revamp > > > ext4_writepages() to write the data blocks first.... > > Agreed. I guess it should be a straight forward change to make. > It should be just removing test_opt(inode->i_sb, DIOREAD_NOLOCK) condition > from ext4_should_dioread_nolock(). Actually, it's simpler than that. In fs/ext4/super.c, around line 3730, after the comment: /* Set defaults before we parse the mount options */ Just add: set_opt(sb, DIOREAD_NOLOCK); This will allow system administrators to revert back to the original method using the someone confusingly named mount option, "dioread_lock". (Maybe we can add a alias for that mount option so it's less confusing). > > Yes, that's a good point. And I'm not opposed to that if it makes the life > > simpler. But I'd like to see some performance numbers showing how much is > > writeback using unwritten extents slower so that we don't introduce too big > > regression with this... > > > > Yes, let me try to get some performance numbers with dioread_nolock as > the default option for buffered write on my setup. I started running some performance runs last night, and the interesting thing that I found was that fs_mark actually *improved* with dioread_nolock (with fsync enabled). That may be an example of where fixing the commit latency caused by writeback can actually show up in a measurable way with benchmarks. Dbench was slightly impacted; I didn't see any real differences with compilebench or postmark. dioread_nolock did improve fio with sequential reads; which is interesting, since I would have expected with the inode_lock improvements, there shouldn't have been any difference. So that may be a bit of wierdness that we should try to understand. See the attached tar file; open ext4-modes/index.html in a browser to see the pretty graphs. The raw numbers are in ext4/composite.xml. Cheers, - Ted [-- Attachment #2: pts.tar.xz --] [-- Type: application/x-xz, Size: 27892 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-08 17:42 ` Theodore Y. Ts'o @ 2020-01-09 9:21 ` Ritesh Harjani 2020-01-09 16:38 ` Theodore Y. Ts'o 2020-01-09 12:34 ` Jan Kara 1 sibling, 1 reply; 16+ messages in thread From: Ritesh Harjani @ 2020-01-09 9:21 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Jan Kara, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo On 1/8/20 11:12 PM, Theodore Y. Ts'o wrote: > On Wed, Jan 08, 2020 at 04:15:13PM +0530, Ritesh Harjani wrote: >> Hello Ted/Jan, >> >> On 1/7/20 10:52 PM, Jan Kara wrote: >>> On Tue 07-01-20 12:11:09, Theodore Y. Ts'o wrote: >>>> Hmm..... There's actually an even more radical option we could use, >>>> given that Ritesh has made dioread_nolock work on block sizes < page >>>> size. We could make dioread_nolock the default, until we can revamp >>>> ext4_writepages() to write the data blocks first.... >> >> Agreed. I guess it should be a straight forward change to make. >> It should be just removing test_opt(inode->i_sb, DIOREAD_NOLOCK) condition >> from ext4_should_dioread_nolock(). > > Actually, it's simpler than that. In fs/ext4/super.c, around line > 3730, after the comment: > > /* Set defaults before we parse the mount options */ > > Just add: > > set_opt(sb, DIOREAD_NOLOCK); Yes, silly me. > > This will allow system administrators to revert back to the original > method using the someone confusingly named mount option, > "dioread_lock". (Maybe we can add a alias for that mount option so > it's less confusing). > >>> Yes, that's a good point. And I'm not opposed to that if it makes the life >>> simpler. But I'd like to see some performance numbers showing how much is >>> writeback using unwritten extents slower so that we don't introduce too big >>> regression with this... >>> >> >> Yes, let me try to get some performance numbers with dioread_nolock as >> the default option for buffered write on my setup. > > I started running some performance runs last night, and the > interesting thing that I found was that fs_mark actually *improved* > with dioread_nolock (with fsync enabled). That may be an example of > where fixing the commit latency caused by writeback can actually show > up in a measurable way with benchmarks. > > Dbench was slightly impacted; I didn't see any real differences with > compilebench or postmark. dioread_nolock did improve fio with > sequential reads; which is interesting, since I would have expected IIUC, this Seq. read numbers are with --direct=1 & bs=2MB & ioengine=libaio, correct? So essentially it will do a DIO AIO sequential read. Yes, it *does shows* a big delta in the numbers. I also noticed a higher deviation between the two runs with dioread_nolock. > with the inode_lock improvements, there shouldn't have been any > difference. So that may be a bit of wierdness that we should try to > understand. So inode_lock patches gives improvement in mixed read/write workload where inode exclusive locking was causing the bottleneck earlier. In this run, was encryption or fsverity enabled? If yes then in that case I see that ext4_dio_supported() will return false and it will fallback to bufferedRead. Though with that also can't explain the delta with only enabling dioread_nolock. > > See the attached tar file; open ext4-modes/index.html in a browser to > see the pretty graphs. The raw numbers are in ext4/composite.xml. The graphs and overview looks really good. I will also check about PTS sometime. Will be good to capture such reports. ritesh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-09 9:21 ` Ritesh Harjani @ 2020-01-09 16:38 ` Theodore Y. Ts'o 2020-01-14 23:30 ` Ritesh Harjani 0 siblings, 1 reply; 16+ messages in thread From: Theodore Y. Ts'o @ 2020-01-09 16:38 UTC (permalink / raw) To: Ritesh Harjani, Jan Kara Cc: Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo On Thu, Jan 09, 2020 at 02:51:42PM +0530, Ritesh Harjani wrote: > > Dbench was slightly impacted; I didn't see any real differences with > > compilebench or postmark. dioread_nolock did improve fio with > > sequential reads; which is interesting, since I would have expected > > IIUC, this Seq. read numbers are with --direct=1 & bs=2MB & ioengine=libaio, > correct? > So essentially it will do a DIO AIO sequential read. Correct. >In this run, was encryption or fsverity enabled? >If yes then in that case I see that ext4_dio_supported() will return >false and it will fallback to bufferedRead. No, there was no encryption or fsverity enabled. These runs were pure stock ext4, with no additional features or mount options other than the defaults --- and in the case of the dioread_nolock runs, just the dioread_nolock mount option was added. On Thu, Jan 09, 2020 at 01:34:27PM +0100, Jan Kara wrote: > > > > I started running some performance runs last night, and the > > Thanks for the numbers! What is the difference between 'default-1' and > 'default-2' configurations (and similarly between dioread_nolock-1 and -2 > configurations)? default-1 and default-2 are two runs using the default mke2fs and ext4 mount options. So 4k blocks, no encryption, no fs-verity, delayed allocation, and the kernel version includes the inode locking patches, but not the dio overwrite patch (that's for the next merge window, and this was 5.5-rc3). dioread-1 and dioread-2 are two runs with the exact same file system, the same VM and PD. The file system would have been slightly aged since I did not recreate the file system between each of the four runs. File system aging might be explain some of the differences; the other possbility is some kind of differences caused by differing amounts of CPU and network availability on the host system (which could have affected the Persistent Disk performance). GCE does try to provide enough isolation and throttling however, and the outliers are generally in the positive, not negative direction, so I don't *think* it's caused by host loading issues, but I can't 100% rule it out. One of the things PTS does do is to increase the number of runs until the standard deviation drops below some percentage (4 or 5%, I think). In particular, I've noticed that fsmark sometimes require quite a few runs before the results fall within that criteria. So it's quite possible that the fs-mark variations might have been luck. That's one of the reasons why I performed multiple runs, was to try to see if that was what was going on. > > dioread_nolock did improve fio with > > sequential reads; which is interesting, since I would have expected > > with the inode_lock improvements, there shouldn't have been any > > difference. So that may be a bit of wierdness that we should try to > > understand. > > Yes, this is indeed strange. I don't see anything the DIO read path where > dioread_nolock would actually still matter. The fio runs do tend to be pretty stable, and generally only require PTS to performe 3 runs to get stable results. That's why the variations and differences found when doing runs with and without dioread_nolock were *very* unexpected. - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-09 16:38 ` Theodore Y. Ts'o @ 2020-01-14 23:30 ` Ritesh Harjani 2020-01-15 16:48 ` Theodore Y. Ts'o 0 siblings, 1 reply; 16+ messages in thread From: Ritesh Harjani @ 2020-01-14 23:30 UTC (permalink / raw) To: Theodore Y. Ts'o, Jan Kara Cc: Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo On 1/9/20 10:08 PM, Theodore Y. Ts'o wrote: > On Thu, Jan 09, 2020 at 02:51:42PM +0530, Ritesh Harjani wrote: >>> Dbench was slightly impacted; I didn't see any real differences with >>> compilebench or postmark. dioread_nolock did improve fio with >>> sequential reads; which is interesting, since I would have expected >> >> IIUC, this Seq. read numbers are with --direct=1 & bs=2MB & ioengine=libaio, >> correct? >> So essentially it will do a DIO AIO sequential read. > > Correct. I too collected some performance numbers on my x86 box with --direct=1, bs=4K/1M & ioengine=libaio, with default opt v/s dioread_nolock opt on latest ext4 git tree. I found the delta to be within +/- 6% in all of the runs which includes, seq read, mixed rw & mixed randrw. -ritesh ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-14 23:30 ` Ritesh Harjani @ 2020-01-15 16:48 ` Theodore Y. Ts'o 2020-01-16 9:46 ` Ritesh Harjani 0 siblings, 1 reply; 16+ messages in thread From: Theodore Y. Ts'o @ 2020-01-15 16:48 UTC (permalink / raw) To: Ritesh Harjani Cc: Jan Kara, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo On Wed, Jan 15, 2020 at 05:00:53AM +0530, Ritesh Harjani wrote: > > I too collected some performance numbers on my x86 box with > --direct=1, bs=4K/1M & ioengine=libaio, with default opt v/s dioread_nolock > opt on latest ext4 git tree. > > I found the delta to be within +/- 6% in all of the runs which includes, seq > read, mixed rw & mixed randrw. Thanks for taking the performance measurements! Are you able to release more detail of what the deltas were for those tests? And how stable were those results across repeated runs? Thanks, - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-15 16:48 ` Theodore Y. Ts'o @ 2020-01-16 9:46 ` Ritesh Harjani 0 siblings, 0 replies; 16+ messages in thread From: Ritesh Harjani @ 2020-01-16 9:46 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Jan Kara, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo [-- Attachment #1: Type: text/plain, Size: 5058 bytes --] Hello Ted, On 1/15/20 10:18 PM, Theodore Y. Ts'o wrote: > On Wed, Jan 15, 2020 at 05:00:53AM +0530, Ritesh Harjani wrote: >> >> I too collected some performance numbers on my x86 box with >> --direct=1, bs=4K/1M & ioengine=libaio, with default opt v/s dioread_nolock >> opt on latest ext4 git tree. >> >> I found the delta to be within +/- 6% in all of the runs which includes, seq >> read, mixed rw & mixed randrw. > > Thanks for taking the performance measurements! > > Are you able to release more detail of what the deltas were for those > tests? And how stable were those results across repeated runs? > I have collected these numbers using fio. All of the column values shown in the attachment & pasted below are average of 3 runs. I checked all those individual runs too and saw that even in the run-to- run variations the delta was within +/- 5% only. The naming of those individual runs *.json files are a bit weird and will take sometime if I have to publish the individual runs. But do let me know if that as well is required. I can make some changes in the scripts to print individual run's numbers too. I have pasted the individual fio files too which I used for my testing. I have used libaio as ioengine (with iodepth=16) in the SeqRead case and psync in others. Performance numbers: =================== (seq) read (4K) - libaio ============ threads default_opt_read(MB/s) dioread_nolock_opt_read(MB/s) ---------- --------------------- ---------------------------- 1 138.928059895833 138.824869791667 8 129.345052083333 124.472005208333 24 71.6555989583333 72.275390625 (Not sure why 24 thread case has lower perf numbers, but on increasing the iodepth=32, I could see an increase in bw with 24 threads case too.) (seq) read (1M) - libaio =========== threads default_opt_read(MB/s) dioread_nolock_opt_read(MB/s) ---------- --------------------- --------------- 1 138.905598958333 138.832682291667 8 111.263997395833 109.301106770833 24 121.895182291667 127.75390625 randrw(4K) - read (psync) ========================== threads default_opt_read [KB/s] dioread_opt_read [KB/s] ---------- ---------------- ---------------- 1 414.666666666667 410.0 8 780.0 792.333333333333 24 967.333333333333 991.666666666667 randrw(4K) - write (psync) =========================== threads default_opt_write [KB/s] dioread_opt_write [KB/s] ---------- ----------------- ----------------- 1 418.0 413.666666666667 8 796.666666666667 809.666666666667 24 981.333333333333 1007.66666666667 randrw(1M) - read (psync) ================= threads default_opt_read(MB/s) dioread_nolock_opt_read(MB/s) ---------- ----------------- ----------------------- 1 39.5693359375 39.7288411458333 8 44.5179036458333 47.9098307291667 24 50.2861328125 51.720703125 randrw(1M) - write (psync) ================== threads default_opt_write(MB/s) dioread_nolock_opt_write(MB/s) ---------- ---------------------- ----------------------- 1 41.4583333333333 41.5068359375 8 46.0768229166667 49.568359375 24 49.5947265625 50.7083333333333 rw(1M) - read (psync) ============= threads default_opt_read(MB/s) dioread_nolock_opt_read(MB/s) ---------- ---------------------- ------------------- 1 43.6458333333333 43.6770833333333 8 48.1178385416667 49.2718098958333 24 50.5703125 53.7890625 rw(1M) - write (psync) ============== threads default_opt_write(MB/s) dioread_nolock_opt_write(MB/s) ---------- ----------------------- ------------------ 1 45.5065104166667 45.5654296875 8 49.7431640625 51.0690104166667 24 50.2493489583333 53.3463541666667 FIO FILES ========= dio_read.fio ============ [global] ioengine=libaio rw=read runtime=60 iodepth=16 direct=1 size=10G filename=./testfile group_reporting=1 thread=1 ;bs=$bs ;numjobs=24 [fio-run] dio_randrw.fio ============== [global] ioengine=psync rw=randrw runtime=60 iodepth=1 direct=1 size=10G filename=./testfile group_reporting=1 thread=1 ;bs=$bs ;numjobs=24 [fio-run] dio_rw.fio ========== [global] ioengine=psync rw=rw runtime=60 iodepth=1 direct=1 size=10G filename=./testfile group_reporting=1 thread=1 ;bs=$bs ;numjobs=24 [fio-run] Thanks -ritesh [-- Attachment #2: datafile.txt --] [-- Type: text/plain, Size: 3311 bytes --] SeqRead (4K) - libaio ============ threads default_opt[read_4K(MB/s)] dioread_nolock_opt [read_4K(MB/s)] ---------- --------------------------- ------------------------------ 1 138.928059895833 138.824869791667 8 129.345052083333 124.472005208333 24 71.6555989583333 72.275390625 SeqRead(1M) - libaio =========== threads default_opt [read_1M(MB/s)] dioread_nolock_opt [read_1M(MB/s)] ---------- --------------------------- -------------------------------- 1 138.905598958333 138.832682291667 8 111.263997395833 109.301106770833 24 121.895182291667 127.75390625 randrw(4K) - read (psync) ========================== threads default_opt_read [KB/s] dioread_opt_read [KB/s] ---------- ---------------- ---------------- 1 414.666666666667 410.0 8 780.0 792.333333333333 24 967.333333333333 991.666666666667 randrw(4K) - write (psync) =========================== threads default_opt_write [KB/s] dioread_opt_write [KB/s] ---------- ----------------- ----------------- 1 418.0 413.666666666667 8 796.666666666667 809.666666666667 24 981.333333333333 1007.66666666667 randrw(1M) - read (psync) ================= threads default_opt [read_1M(MB/s) dioread_nolock_opt [read_1M(MB/s)] ---------- --------------------------- --------------------------------- 1 39.5693359375 39.7288411458333 8 44.5179036458333 47.9098307291667 24 50.2861328125 51.720703125 randrw(1M) - write (psync) ================== threads default_opt [write_1M(MB/s)] dioread_nolock_opt [write_1M(MB/s)] ---------- ---------------------------- ---------------------------------- 1 41.4583333333333 41.5068359375 8 46.0768229166667 49.568359375 24 49.5947265625 50.7083333333333 rw(1M) - read (psync) ============= threads default_opt [read_1M(MB/s)] dioread_nolock_opt [read_1M(MB/s)] ---------- --------------------------- --------------------------------- 1 43.6458333333333 43.6770833333333 8 48.1178385416667 49.2718098958333 24 50.5703125 53.7890625 rw(1M) - write (psync) ============== threads default_opt [write_1M(MB/s)] dioread_nolock_opt [write_1M(MB/s) ---------- ----------------------------- --------------------------------- 1 45.5065104166667 45.5654296875 8 49.7431640625 51.0690104166667 24 50.2493489583333 53.3463541666667 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Discussion: is it time to remove dioread_nolock? 2020-01-08 17:42 ` Theodore Y. Ts'o 2020-01-09 9:21 ` Ritesh Harjani @ 2020-01-09 12:34 ` Jan Kara 1 sibling, 0 replies; 16+ messages in thread From: Jan Kara @ 2020-01-09 12:34 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Ritesh Harjani, Jan Kara, Xiaoguang Wang, Ext4 Developers List, joseph.qi, Liu Bo On Wed 08-01-20 12:42:59, Theodore Y. Ts'o wrote: > On Wed, Jan 08, 2020 at 04:15:13PM +0530, Ritesh Harjani wrote: > > > Yes, that's a good point. And I'm not opposed to that if it makes the life > > > simpler. But I'd like to see some performance numbers showing how much is > > > writeback using unwritten extents slower so that we don't introduce too big > > > regression with this... > > > > > > > Yes, let me try to get some performance numbers with dioread_nolock as > > the default option for buffered write on my setup. > > I started running some performance runs last night, and the Thanks for the numbers! What is the difference between 'default-1' and 'default-2' configurations (and similarly between dioread_nolock-1 and -2 configurations)? > interesting thing that I found was that fs_mark actually *improved* > with dioread_nolock (with fsync enabled). That may be an example of > where fixing the commit latency caused by writeback can actually show > up in a measurable way with benchmarks. Yeah, that could be. > Dbench was slightly impacted; I didn't see any real differences with > compilebench or postmark. Interestingly, dbench is also fsync(2)-bound workload (because the working set is too small for anything else to actually reach the disk in contemporary systems). But file sizes with dbench are smaller (under 100k) than with fs-mark (1MB) so probably that's what makes the difference. > dioread_nolock did improve fio with > sequential reads; which is interesting, since I would have expected > with the inode_lock improvements, there shouldn't have been any > difference. So that may be a bit of wierdness that we should try to > understand. Yes, this is indeed strange. I don't see anything the DIO read path where dioread_nolock would actually still matter. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-01-16 9:47 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-26 15:31 Discussion: is it time to remove dioread_nolock? Theodore Y. Ts'o 2019-12-27 13:10 ` Joseph Qi 2019-12-29 15:03 ` Xiaoguang Wang 2020-01-06 12:24 ` Ritesh Harjani 2020-01-07 0:43 ` Theodore Y. Ts'o 2020-01-07 8:22 ` Jan Kara 2020-01-07 17:11 ` Theodore Y. Ts'o 2020-01-07 17:22 ` Jan Kara 2020-01-08 10:45 ` Ritesh Harjani 2020-01-08 17:42 ` Theodore Y. Ts'o 2020-01-09 9:21 ` Ritesh Harjani 2020-01-09 16:38 ` Theodore Y. Ts'o 2020-01-14 23:30 ` Ritesh Harjani 2020-01-15 16:48 ` Theodore Y. Ts'o 2020-01-16 9:46 ` Ritesh Harjani 2020-01-09 12:34 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).