From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751083AbcBLG7T (ORCPT ); Fri, 12 Feb 2016 01:59:19 -0500 Received: from mail-vk0-f42.google.com ([209.85.213.42]:34589 "EHLO mail-vk0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750852AbcBLG7Q (ORCPT ); Fri, 12 Feb 2016 01:59:16 -0500 MIME-Version: 1.0 In-Reply-To: References: Date: Fri, 12 Feb 2016 06:59:15 +0000 Message-ID: Subject: Re: Small writes being split with fdatasync based on non-aligned partition ending From: Sitsofe Wheeler To: Jens Rosenboom Cc: Fio , "linux-kernel@vger.kernel.org" , parted-devel@lists.alioth.debian.org, linux-block@vger.kernel.org, Jens Axboe Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org CC'ing Jens Axboe. On 11 February 2016 at 09:54, Jens Rosenboom wrote: > 2016-02-11 4:48 GMT+01:00 Sitsofe Wheeler : >> Trying to cc the GNU parted and linux-block mailing lists. >> >> On 9 February 2016 at 13:02, Jens Rosenboom wrote: >>> While trying to reproduce some performance issues I have been seeing >>> with Ceph, I have come across a strange behaviour which is seemingly >>> affected only by the end point (and thereby the size) of a partition >>> being an odd number of sectors. Since all documentation about >>> alignment only refers to the starting point of the partition, this was >>> pretty surprising and I would like to know whether this is expected >>> behaviour or maybe a kernel issue. >>> >>> The command I am using is pretty simple: >>> >>> fio --rw=randwrite --size=1G --fdatasync=1 --bs=4k >>> --filename=/dev/sdb2 --runtime=10 --name=test >>> >>> The difference shows itself when the partition is created either by >>> sgdisk or by parted: >>> >>> sgdisk --new=2:6000M: /dev/sdb >>> >>> parted -s /dev/sdb mkpart osd-device-1-block 6291456000B 100% >>> >>> The difference in the partition table looks like this: >>> >>> < 2 6291456000B 1600320962559B 1594029506560B >>> osd-device-1-block >>> --- >>>> 2 6291456000B 1600321297919B 1594029841920B osd-device-1-block >> >> Looks like parted took you at your word when you asked for your >> partition at 100%. Just out of curiosity if you try and make the same >> partition interactively with parted do you get any warnings after >> making and after running align-check ? > > No warnings and everything fine for align-check. I found out that I > can get the same effect if I step the partition ending manually in > parted in 1s increments. The sequence of write sizes is 8, 1, 2, 1, 4, > 1, 2, 1, 8, ... which corresponds to the size (unit s) of the > resulting partion mod 8. OK. Could you add the output of grep . /sys/block/nvme0n1/queue/*size sgdisk -D /dev/sdb and could you post the information about the whole partition table. Does sgdisk create a similar problem ending if you use sgdisk --new=2:0 /dev/sdb ? It seems strange that the end of the disk (and thus a 100% sized partition) wouldn't end on a multiple of 4k... >>> So this is really only the end of the partition that is different. >>> However, in the first case, the 4k writes all get broken up into 512b >>> writes somewhere in the kernel, as can be seen with btrace: >>> >>> 8,16 3 36 0.000102666 8184 A WS 12353985 + 1 <- (8,18) 65985 >>> 8,16 3 37 0.000102739 8184 Q WS 12353985 + 1 [fio] >>> 8,16 3 38 0.000102875 8184 M WS 12353985 + 1 [fio] >>> 8,16 3 39 0.000103038 8184 A WS 12353986 + 1 <- (8,18) 65986 >>> 8,16 3 40 0.000103109 8184 Q WS 12353986 + 1 [fio] >>> 8,16 3 41 0.000103196 8184 M WS 12353986 + 1 [fio] >>> 8,16 3 42 0.000103335 8184 A WS 12353987 + 1 <- (8,18) 65987 >>> 8,16 3 43 0.000103403 8184 Q WS 12353987 + 1 [fio] >>> 8,16 3 44 0.000103489 8184 M WS 12353987 + 1 [fio] >>> 8,16 3 45 0.000103609 8184 A WS 12353988 + 1 <- (8,18) 65988 >>> 8,16 3 46 0.000103678 8184 Q WS 12353988 + 1 [fio] >>> 8,16 3 47 0.000103767 8184 M WS 12353988 + 1 [fio] >>> 8,16 3 48 0.000103879 8184 A WS 12353989 + 1 <- (8,18) 65989 >>> 8,16 3 49 0.000103947 8184 Q WS 12353989 + 1 [fio] >>> 8,16 3 50 0.000104035 8184 M WS 12353989 + 1 [fio] >>> 8,16 3 51 0.000104150 8184 A WS 12353990 + 1 <- (8,18) 65990 >>> 8,16 3 52 0.000104219 8184 Q WS 12353990 + 1 [fio] >>> 8,16 3 53 0.000104307 8184 M WS 12353990 + 1 [fio] >>> 8,16 3 54 0.000104452 8184 A WS 12353991 + 1 <- (8,18) 65991 >>> 8,16 3 55 0.000104520 8184 Q WS 12353991 + 1 [fio] >>> 8,16 3 56 0.000104609 8184 M WS 12353991 + 1 [fio] >>> 8,16 3 57 0.000104885 8184 I WS 12353984 + 8 [fio] >>> >>> whereas in the second case, I'm getting the expected 4k writes: >>> >>> 8,16 6 42 1266874889.659842036 8409 A WS 12340232 + 8 <- >>> (8,18) 52232 >>> 8,16 6 43 1266874889.659842167 8409 Q WS 12340232 + 8 [fio] >>> 8,16 6 44 1266874889.659842393 8409 G WS 12340232 + 8 [fio] >> >> This is weird because --size=1G should mean that fio is "seeing" an >> aligned end. Does direct=1 with a sequential job of iodepth=1 show the >> problem too? > > IIUC fio uses the size only to find out where to write to, it opens > the block device and passes the resulting fd to the fdatasync call, so > the kernel will not know about what size fio thinks the device has. In > fact, the effect is the same without the size=1G option, I used it > just to make sure that the writes do not go anywhere near the badly > aligned partition ending. > > direct=1 kills the effect, i.e. all writes will be 4k size again. > Astonishingly though, sequential writes also are affected, i.e. > changing to rw=write in my sample above behaves the same as randwrite. Do you get this style of behaviour without fdatasync (or with larger values of fdatasync) too? >>> The above examples are from running with an SSD, where the small >>> writes get merged together again before hitting the block device, >>> which is still pretty o.k. performance wise. But when I run the same >>> test on some NVMe device, the writes do not get merged, instead the >>> performance drops to less then 10% of what I get in the second case. >> >> Perhaps the ioscheduler doesn't have the opportunity with the NVMe device... > > Yes, there is no scheduler available in this case: > > $ cat /sys/block/nvme0n1/queue/scheduler > none > > This is just to show that the argument "Don't bother, the writes get > merged back together anyway" doesn't hold true in all cases. > >>> If this is indeed expected behaviour from the kernel pov, it might >>> need some better documentation and probably sgdisk should also be >>> enhanced to align the end of the partition as well. FWIW, this happens >>> on a stock 4.4.0 kernel as well as recent Ubuntu and CentOS kernels. >> >> Do you mean parted? > > No, as I am currently assuming that the issue is caused by some effect > happening inside the kernel during the fdatasync call, there was the > idea that only certain kernels might be affected. But I don't have a > clue yet how for back I would have to go in order to find a kernel > that behaves differently. -- Sitsofe | http://sucs.org/~sits/