* dm-zoned performance degradation after apply 75d66ffb48efb3 ("dm zoned:, properly handle backing device failure")
@ 2019-10-26 1:56 zhangxiaoxu (A)
2019-10-27 2:56 ` Dmitry Fomichev
0 siblings, 1 reply; 4+ messages in thread
From: zhangxiaoxu (A) @ 2019-10-26 1:56 UTC (permalink / raw)
To: dmitry.fomichev, damien.lemoal, snitzer, dm-devel, Alasdair G Kergon
Hi all, when I 'mkfs.ext4' on the dmz device based on 10T smr disk,
it takes more than 10 hours after apply 75d66ffb48efb3 ("dm zoned:
properly handle backing device failure").
After delete the 'check_events' in 'dmz_bdev_is_dying', it just
take less than 12 mins.
I test it based on 4.19 branch.
Must we do the 'check_events' at mapping path, reclaim or metadata I/O?
Thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: dm-zoned performance degradation after apply 75d66ffb48efb3 ("dm zoned:, properly handle backing device failure")
2019-10-26 1:56 dm-zoned performance degradation after apply 75d66ffb48efb3 ("dm zoned:, properly handle backing device failure") zhangxiaoxu (A)
@ 2019-10-27 2:56 ` Dmitry Fomichev
2019-10-31 8:20 ` zhangxiaoxu (A)
0 siblings, 1 reply; 4+ messages in thread
From: Dmitry Fomichev @ 2019-10-27 2:56 UTC (permalink / raw)
To: dm-devel, agk, Damien Le Moal, zhangxiaoxu5, snitzer
Zhang,
I just did some testing of this scenario with a recent kernel that includes this patch.
The log below is a run in QEMU with 8 CPUs and it took 18.5 minutes to create the FS on a
14TB ATA drive. Doing the same thing on bare metal with 32 CPUs takes 10.5 minutes in my
environment. However, when doing the same test with a SAS drive, the run takes 43 minutes.
This is not quite the degradation you are observing, but still a big performance hit.
Is the disk that you are using SAS or SATA?
My current guess is that sd driver may generate some TEST UNIT READY commands to check if
the drive is really online as a part of check_events() processing. For ATA drives, this is
nearly a NOP since all TURs are completed internally in libata. But, in SCSI case, these
blocking TURs are issued to the drive and certainly may degrade performance.
The check_events() call has been added to bdev_device_is_dying() because simply calling
bdev_queue_dying() doesn't cover the situation when the drive gets offlined in SCSI layer.
It might be possible to only call check_events() once before every reclaim run and to avoid
calling it in I/O mapping path. If this works, the overhead would likely be acceptable.
I am going to take a look into this.
Regards,
Dmitry
[root@xxx dmz]# uname -a
Linux xxx 5.4.0-rc1-DMZ+ #1 SMP Fri Oct 11 11:23:13 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@xxx dmz]# lsscsi
[0:0:0:0] disk QEMU QEMU HARDDISK 2.5+ /dev/sda
[1:0:0:0] zbc ATA HGST HSH721415AL T240 /dev/sdb
[root@xxx dmz]# ./setup-dmz test /dev/sdb
[root@xxx dmz]# cat /proc/kallsyms | grep dmz_bdev_is_dying
(standard input):90782:ffffffffc070a401 t dmz_bdev_is_dying.cold [dm_zoned]
(standard input):90849:ffffffffc0706e10 t dmz_bdev_is_dying [dm_zoned]
[root@xxx dmz]# time mkfs.ext4 /dev/mapper/test
mke2fs 1.44.6 (5-Mar-2019)
Discarding device blocks: done
Creating filesystem with 3660840960 4k blocks and 457605120 inodes
Filesystem UUID: 4536bacd-cfb5-41b2-b0bf-c2513e6e3360
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done
real 18m30.867s
user 0m0.172s
sys 0m11.198s
On Sat, 2019-10-26 at 09:56 +0800, zhangxiaoxu (A) wrote:
> Hi all, when I 'mkfs.ext4' on the dmz device based on 10T smr disk,
> it takes more than 10 hours after apply 75d66ffb48efb3 ("dm zoned:
> properly handle backing device failure").
>
> After delete the 'check_events' in 'dmz_bdev_is_dying', it just
> take less than 12 mins.
>
> I test it based on 4.19 branch.
> Must we do the 'check_events' at mapping path, reclaim or metadata I/O?
>
> Thanks.
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: dm-zoned performance degradation after apply 75d66ffb48efb3 ("dm zoned:, properly handle backing device failure")
2019-10-27 2:56 ` Dmitry Fomichev
@ 2019-10-31 8:20 ` zhangxiaoxu (A)
2019-11-06 23:00 ` Dmitry Fomichev
0 siblings, 1 reply; 4+ messages in thread
From: zhangxiaoxu (A) @ 2019-10-31 8:20 UTC (permalink / raw)
To: Dmitry Fomichev, dm-devel, agk, Damien Le Moal, snitzer
hi Dmitry, thanks for your reply.
I also test it use the mainline, it also takes more than 1 hours.
my mechine has 64 CPUs core and the disk is SATA.
when mkfs.ext4, I found the 'scsi_test_unit_ready' run more than 1000 times
per second by the different kworker.
and every 'scsi_test_unit_ready' takes more than 200us, and the interval
less than 20us.
So, I think your guess is right.
but there is another question, why 4.19 branch takes more than 10 hour?
I will work on it, if any information about it, I will reply you.
Thanks.
my script:
dmzadm --format /dev/sdi
echo "0 21485322240 zoned /dev/sdi" | dmsetup create dmz-sdi
date; mkfs.ext4 /dev/mapper/dmz-sdi; date
mainline:
[root@localhost ~]# uname -a
Linux localhost 5.4.0-rc5 #1 SMP Thu Oct 31 11:41:20 CST 2019 aarch64 aarch64 aarch64 GNU/Linux
Thu Oct 31 13:58:55 CST 2019
mke2fs 1.43.6 (29-Aug-2017)
Discarding device blocks: done
Creating filesystem with 2684354560 4k blocks and 335544320 inodes
Filesystem UUID: e0d8e01e-efa8-47fd-a019-b184e66f65b0
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done
Thu Oct 31 15:01:01 CST 2019
after delete the 'check_events' on mainline:
[root@localhost ~]# uname -a
Linux localhost 5.4.0-rc5+ #2 SMP Thu Oct 31 15:07:36 CST 2019 aarch64 aarch64 aarch64 GNU/Linux
Thu Oct 31 15:19:56 CST 2019
mke2fs 1.43.6 (29-Aug-2017)
Discarding device blocks: done
Creating filesystem with 2684354560 4k blocks and 335544320 inodes
Filesystem UUID: 735198e8-9df0-49fc-aaa8-23b0869dfa05
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done
Thu Oct 31 15:30:51 CST 2019
在 2019/10/27 10:56, Dmitry Fomichev 写道:
> Zhang,
>
> I just did some testing of this scenario with a recent kernel that includes this patch.
>
> The log below is a run in QEMU with 8 CPUs and it took 18.5 minutes to create the FS on a
> 14TB ATA drive. Doing the same thing on bare metal with 32 CPUs takes 10.5 minutes in my
> environment. However, when doing the same test with a SAS drive, the run takes 43 minutes.
> This is not quite the degradation you are observing, but still a big performance hit.
>
> Is the disk that you are using SAS or SATA?
>
> My current guess is that sd driver may generate some TEST UNIT READY commands to check if
> the drive is really online as a part of check_events() processing. For ATA drives, this is
> nearly a NOP since all TURs are completed internally in libata. But, in SCSI case, these
> blocking TURs are issued to the drive and certainly may degrade performance.
>
> The check_events() call has been added to bdev_device_is_dying() because simply calling
> bdev_queue_dying() doesn't cover the situation when the drive gets offlined in SCSI layer.
> It might be possible to only call check_events() once before every reclaim run and to avoid
> calling it in I/O mapping path. If this works, the overhead would likely be acceptable.
> I am going to take a look into this.
>
> Regards,
> Dmitry
>
> [root@xxx dmz]# uname -a
> Linux xxx 5.4.0-rc1-DMZ+ #1 SMP Fri Oct 11 11:23:13 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux
> [root@xxx dmz]# lsscsi
> [0:0:0:0] disk QEMU QEMU HARDDISK 2.5+ /dev/sda
> [1:0:0:0] zbc ATA HGST HSH721415AL T240 /dev/sdb
> [root@xxx dmz]# ./setup-dmz test /dev/sdb
> [root@xxx dmz]# cat /proc/kallsyms | grep dmz_bdev_is_dying
> (standard input):90782:ffffffffc070a401 t dmz_bdev_is_dying.cold [dm_zoned]
> (standard input):90849:ffffffffc0706e10 t dmz_bdev_is_dying [dm_zoned]
> [root@xxx dmz]# time mkfs.ext4 /dev/mapper/test
> mke2fs 1.44.6 (5-Mar-2019)
> Discarding device blocks: done
> Creating filesystem with 3660840960 4k blocks and 457605120 inodes
> Filesystem UUID: 4536bacd-cfb5-41b2-b0bf-c2513e6e3360
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
> 2560000000
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (262144 blocks): done
> Writing superblocks and filesystem accounting information: done
>
>
> real 18m30.867s
> user 0m0.172s
> sys 0m11.198s
>
>
> On Sat, 2019-10-26 at 09:56 +0800, zhangxiaoxu (A) wrote:
>> Hi all, when I 'mkfs.ext4' on the dmz device based on 10T smr disk,
>> it takes more than 10 hours after apply 75d66ffb48efb3 ("dm zoned:
>> properly handle backing device failure").
>>
>> After delete the 'check_events' in 'dmz_bdev_is_dying', it just
>> take less than 12 mins.
>>
>> I test it based on 4.19 branch.
>> Must we do the 'check_events' at mapping path, reclaim or metadata I/O?
>>
>> Thanks.
>>
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: dm-zoned performance degradation after apply 75d66ffb48efb3 ("dm zoned:, properly handle backing device failure")
2019-10-31 8:20 ` zhangxiaoxu (A)
@ 2019-11-06 23:00 ` Dmitry Fomichev
0 siblings, 0 replies; 4+ messages in thread
From: Dmitry Fomichev @ 2019-11-06 23:00 UTC (permalink / raw)
To: dm-devel, agk, Damien Le Moal, zhangxiaoxu5, snitzer
Hi Zhang,
I just posted the patch that fixes this issue. Could you please try it and let
me know how this patch works for you? In my testing, I don't see any excessive
TURs issued with this patch in place. It takes around 12 minutes to run
mkfs.ext4 on a freshly created dm-zoned device on top of a 14TB SCSI drive.
The same test on top of a 14TB SATA drive takes around 10 minutes. These are
direct attached drives on a physical server.
I didn't test this patch on 4.19 kernel. If you have any findings about how
it behaves, do let me know.
Regards,
Dmitry
On Thu, 2019-10-31 at 16:20 +0800, zhangxiaoxu (A) wrote:
> hi Dmitry, thanks for your reply.
>
> I also test it use the mainline, it also takes more than 1 hours.
> my mechine has 64 CPUs core and the disk is SATA.
>
> when mkfs.ext4, I found the 'scsi_test_unit_ready' run more than 1000 times
> per second by the different kworker.
> and every 'scsi_test_unit_ready' takes more than 200us, and the interval
> less than 20us.
> So, I think your guess is right.
>
> but there is another question, why 4.19 branch takes more than 10 hour?
> I will work on it, if any information about it, I will reply you.
>
> Thanks.
>
> my script:
> dmzadm --format /dev/sdi
> echo "0 21485322240 zoned /dev/sdi" | dmsetup create dmz-sdi
> date; mkfs.ext4 /dev/mapper/dmz-sdi; date
>
> mainline:
> [root@localhost ~]# uname -a
> Linux localhost 5.4.0-rc5 #1 SMP Thu Oct 31 11:41:20 CST 2019 aarch64 aarch64 aarch64 GNU/Linux
>
> Thu Oct 31 13:58:55 CST 2019
> mke2fs 1.43.6 (29-Aug-2017)
> Discarding device blocks: done
> Creating filesystem with 2684354560 4k blocks and 335544320 inodes
> Filesystem UUID: e0d8e01e-efa8-47fd-a019-b184e66f65b0
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
> 2560000000
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (262144 blocks): done
> Writing superblocks and filesystem accounting information: done
>
> Thu Oct 31 15:01:01 CST 2019
>
> after delete the 'check_events' on mainline:
> [root@localhost ~]# uname -a
> Linux localhost 5.4.0-rc5+ #2 SMP Thu Oct 31 15:07:36 CST 2019 aarch64 aarch64 aarch64 GNU/Linux
> Thu Oct 31 15:19:56 CST 2019
> mke2fs 1.43.6 (29-Aug-2017)
> Discarding device blocks: done
> Creating filesystem with 2684354560 4k blocks and 335544320 inodes
> Filesystem UUID: 735198e8-9df0-49fc-aaa8-23b0869dfa05
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
> 2560000000
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (262144 blocks): done
> Writing superblocks and filesystem accounting information: done
>
> Thu Oct 31 15:30:51 CST 2019
>
> 在 2019/10/27 10:56, Dmitry Fomichev 写道:
> > Zhang,
> >
> > I just did some testing of this scenario with a recent kernel that includes this patch.
> >
> > The log below is a run in QEMU with 8 CPUs and it took 18.5 minutes to create the FS on a
> > 14TB ATA drive. Doing the same thing on bare metal with 32 CPUs takes 10.5 minutes in my
> > environment. However, when doing the same test with a SAS drive, the run takes 43 minutes.
> > This is not quite the degradation you are observing, but still a big performance hit.
> >
> > Is the disk that you are using SAS or SATA?
> >
> > My current guess is that sd driver may generate some TEST UNIT READY commands to check if
> > the drive is really online as a part of check_events() processing. For ATA drives, this is
> > nearly a NOP since all TURs are completed internally in libata. But, in SCSI case, these
> > blocking TURs are issued to the drive and certainly may degrade performance.
> >
> > The check_events() call has been added to bdev_device_is_dying() because simply calling
> > bdev_queue_dying() doesn't cover the situation when the drive gets offlined in SCSI layer.
> > It might be possible to only call check_events() once before every reclaim run and to avoid
> > calling it in I/O mapping path. If this works, the overhead would likely be acceptable.
> > I am going to take a look into this.
> >
> > Regards,
> > Dmitry
> >
> > [root@xxx dmz]# uname -a
> > Linux xxx 5.4.0-rc1-DMZ+ #1 SMP Fri Oct 11 11:23:13 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux
> > [root@xxx dmz]# lsscsi
> > [0:0:0:0] disk QEMU QEMU HARDDISK 2.5+ /dev/sda
> > [1:0:0:0] zbc ATA HGST HSH721415AL T240 /dev/sdb
> > [root@xxx dmz]# ./setup-dmz test /dev/sdb
> > [root@xxx dmz]# cat /proc/kallsyms | grep dmz_bdev_is_dying
> > (standard input):90782:ffffffffc070a401 t dmz_bdev_is_dying.cold [dm_zoned]
> > (standard input):90849:ffffffffc0706e10 t dmz_bdev_is_dying [dm_zoned]
> > [root@xxx dmz]# time mkfs.ext4 /dev/mapper/test
> > mke2fs 1.44.6 (5-Mar-2019)
> > Discarding device blocks: done
> > Creating filesystem with 3660840960 4k blocks and 457605120 inodes
> > Filesystem UUID: 4536bacd-cfb5-41b2-b0bf-c2513e6e3360
> > Superblock backups stored on blocks:
> > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
> > 2560000000
> >
> > Allocating group tables: done
> > Writing inode tables: done
> > Creating journal (262144 blocks): done
> > Writing superblocks and filesystem accounting information: done
> >
> >
> > real 18m30.867s
> > user 0m0.172s
> > sys 0m11.198s
> >
> >
> > On Sat, 2019-10-26 at 09:56 +0800, zhangxiaoxu (A) wrote:
> > > Hi all, when I 'mkfs.ext4' on the dmz device based on 10T smr disk,
> > > it takes more than 10 hours after apply 75d66ffb48efb3 ("dm zoned:
> > > properly handle backing device failure").
> > >
> > > After delete the 'check_events' in 'dmz_bdev_is_dying', it just
> > > take less than 12 mins.
> > >
> > > I test it based on 4.19 branch.
> > > Must we do the 'check_events' at mapping path, reclaim or metadata I/O?
> > >
> > > Thanks.
> > >
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-11-06 23:00 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-26 1:56 dm-zoned performance degradation after apply 75d66ffb48efb3 ("dm zoned:, properly handle backing device failure") zhangxiaoxu (A)
2019-10-27 2:56 ` Dmitry Fomichev
2019-10-31 8:20 ` zhangxiaoxu (A)
2019-11-06 23:00 ` Dmitry Fomichev
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.