* Re: How can btrfs take 23sec to stat 23K files from an SSD? @ 2012-07-24 6:18 Marc MERLIN 0 siblings, 0 replies; 22+ messages in thread From: Marc MERLIN @ 2012-07-24 6:18 UTC (permalink / raw) To: linux-btrfs On Sun, Jul 22, 2012 at 11:42:03PM -0700, Marc MERLIN wrote: > I just realized that the older thread got a bit confusing, so I'll keep > problems separate and make things simpler :) Since yesterday, I tried other kernels, including noprempt, volprempt and preempt for 3.4.4. I also tried a default 3.2.0 kernel from debian (all amd64), but that did not help. I'm still seeing close to 25 seconds to scan 15K files. How can it possibly be so slow? More importantly how I can provide useful debug information. - I don't think it's a problem with the kernel since I tried 4 kernels, including a default debian one. - Alignement seem ok, I made sure cylinders was divisible by 512: /dev/sda2 502272 52930559 26214144 83 Linux - I tried another brand new btrfs, and thing are even slower now. gandalfthegreat:/mnt/mnt2# mount -o ssd,discard,noatime /dev/sda2 /mnt/mnt2 gandalfthegreat:/mnt/mnt2# reset_cache gandalfthegreat:/mnt/mnt2# time du -sh src/ 514M src/ real 0m29.584s gandalfthegreat:/mnt/mnt2# find src/| wc -l 15261 This is bad enough that there ought to be a way to debug this, right? Can you suggest something? Thanks, Marc > On an _unencrypted_ partition on the SSD, running du -sh on a directory > with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on > encrypted spinning drive, both with a similar btrfs filesystem, and > the same kernel (3.4.4). > > Unencrypted btrfs on SSD: > gandalfthegreat:~# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2 > gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src > 514M src > real 0m22.667s > > Encrypted btrfs on spinning drive of the same src directory: > gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src > 514M src > real 0m3.881s > > I've run this many times and get the same numbers. > I've tried deadline and noop on /dev/sda (the SSD) and du is just as slow. > > I also tried with: > - space_cache and nospace_cache > - ssd and nossd > - noatime didn't seem to help even though I was hopeful on this one. > > In all cases, I get: > gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src > 514M src > real 0m22.537s > > > I'm having the same slow speed on 2 btrfs filesystems on the same SSD. > One is encrypted, the other one isnt: > Label: 'btrfs_pool1' uuid: d570c40a-4a0b-4d03-b1c9-cff319fc224d > Total devices 1 FS bytes used 144.74GB > devid 1 size 441.70GB used 195.04GB path /dev/dm-0 > > Label: 'boot' uuid: 84199644-3542-430a-8f18-a5aa58959662 > Total devices 1 FS bytes used 2.33GB > devid 1 size 25.00GB used 5.04GB path /dev/sda2 > > If instead of stating a bunch of files, I try reading a big file, I do get speeds > that are quite fast (253MB/s and 423MB/s). > > 22 seconds for 15K files on an SSD is super slow and being 5 times > slower than a spinning disk with the same data. > What's going on? > > Thanks, > Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
@ 2012-02-02 12:42 Chris Mason
2012-07-18 22:04 ` brtfs on top of dmcrypt with SSD -> Trim or no Trim Marc MERLIN
0 siblings, 1 reply; 22+ messages in thread
From: Chris Mason @ 2012-02-02 12:42 UTC (permalink / raw)
To: Marc MERLIN; +Cc: linux-btrfs
On Wed, Feb 01, 2012 at 07:23:45PM -0800, Marc MERLIN wrote:
> On Wed, Feb 01, 2012 at 12:56:24PM -0500, Chris Mason wrote:
> > > Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD
> > > and what the performance is like with write cache turned off (I'm actually
> > > not too sure what the impact is for SSDs considering that writing to flash
> > > can actually be slower than writing to a hard drive).
> >
> > Performance without the cache on is going to vary wildly from one SSD to
> > another. Some really need it to give them nice fat writes while others
> > do better on smaller writes. It's best to just test yours and see.
> >
> > With a 3.2 kernel (it really must be 3.2 or higher), both btrfs and dm
> > are doing the right thing for barriers.
>
> Thanks for the answer.
> Can you confirm that I still must disable write cache on the SSD to avoid
> corruption with btrfs on top of dmcrypt, or is there a chance that it just
> works now?
No, with 3.2 or higher it is expected to work. dm-crypt is doing the
barriers correctly and as of 3.2 btrfs is sending them down correctly.
-chris
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim @ 2012-07-18 22:04 ` Marc MERLIN 2012-07-22 18:58 ` brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-07-18 22:04 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff On Wed, Jul 18, 2012 at 11:49:36PM +0200, Martin Steigerwald wrote: > I am still not convinced that dm-crypt is the best way to go about > encryption especially for SSDs. But its more of a gut feeling than > anything that I can explain easily. I agree that dmcrypt is not great, and it even makes some SSDs slower than hard drives as per some reports I just posted in another mail. But: > I use ecryptfs, formerly encfs, but encfs is much slower. The advantage ecryptfs is: 1) painfully slow compared to dmcrypt in my tests. As in so slow that I don't even need to benchmark it. 2) unable to encrypt very long filenames, so when I copy my archive on an ecryptfs volume, some files won't copy unless I rename them. I would love for ecryptfs to have the performance of dmcrypt, because it'd be easier for me to use it, but it didn't even come close. > > Not using TRIM on my Crucial RealSSD C300 256GB is most likely what > > caused its garbage collection algorithm to fail (killing the drive and > > all its data), and it was also causing BRTFS to hang badly when I was > > getting within 10GB of the drive getting full. > > How did you know that it was its garbage collection algorithm? I don't, hence "most likely". The failure of the drive I got was likely garbage collection related from what I got from the techs I talked to. > > From talking to their techs and other folks, it seems clear that TRIM > > is greatly encouraged, and I'm pretty sure that had I used TRIM, I > > would not have hit the problems that caused my drive to fail and suck > > so much when it was getting full. > > I still think that telling the SSD about free space is helping it to > balance its wear leveling process. Yes. > > Any objections and/or comments? > > I still only use fstrim from time to time. About once a week or after lots > of drive churning or removing lots of data. I also have a logical volume > of about 20 GiB that I keep free for most of the time. And other filesystem > are quite full, but there is also some little free space of about 20-30 > GiB together. So it should be about 40-50 GiB free most of the time. I'm curious. If your filesystem supports trim (i.e. ext4 and btrfs), is there every a reason to turn off trim in the FS and use fstrim instead? > The 300 GB Intel SSD 320 in this ThinkPad T520 is still fine after about 1 > year and 2-3 months. I do not see any performance degradation whatsover so > far. Last time I looked also SMART data looked fine, but I have not much > experience with SMART on SSDs so far. My experience and what I read online is that SMART on SSDs doesn't seem to help much in many cases. I've seen too many reports of SSDs dying very suddenly with absolutely no warning. Hard drives, if you look at smart data over time, typically give you plenty of warning before they die (as long as you don't drop them off a table without parking their heads). If you're curious, here's the last dump of my SMART data on the SSD that died: === START OF INFORMATION SECTION === Model Family: Crucial RealSSD C300 Device Model: C300-CTFDDAC256MAG Serial Number: 00000000111003044A9C LU WWN Device Id: 5 00a075 103044a9c Firmware Version: 0007 User Capacity: 256,060,514,304 bytes [256 GB] 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 000 Pre-fail Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2977 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 472 170 Grown_Failing_Block_Ct 0x0033 100 100 000 Pre-fail Always - 0 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Wear_Levelling_Count 0x0033 100 100 000 Pre-fail Always - 35 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 1 181 Non4k_Aligned_Access 0x0022 100 100 000 Old_age Always - 2757376737803 183 SATA_Iface_Downshift 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0033 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 Factory_Bad_Block_Ct 0x000e 100 100 000 Old_age Always - 422 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 0 202 Perc_Rated_Life_Used 0x0018 100 100 000 Old_age Offline - 0 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 2958 - # 2 Extended offline Completed without error 00% 2937 - # 3 Short offline Completed without error 00% 2936 - # 4 Short offline Completed without error 00% 2915 - # 5 Short offline Completed without error 00% 2892 - # 6 Short offline Completed without error 00% 2881 - # 7 Short offline Completed without error 00% 2859 - # 8 Vendor (0xff) Completed without error 00% 2852 - Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? 2012-07-18 22:04 ` brtfs on top of dmcrypt with SSD -> Trim or no Trim Marc MERLIN @ 2012-07-22 18:58 ` Marc MERLIN 2012-07-22 19:35 ` Martin Steigerwald 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-07-22 18:58 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff I'm still getting a bit more data before updating the btrfs wiki with my best recommendations for today. First, everything I've read so far says that the ssd btrfs mount option makes btrfs slower in benchmarks. What gives? Anyone using it or know of a reason not to mount my ssd with nossd? Next, I got a new Samsumg 830 512GB SSD which is supposed to be very high performance. The raw device seems fast enough on a quick hdparm test: But once I encrypt it, it drops to 5 times slower than my 1TB spinning disk in the same laptop: gandalfthegreat:~# hdparm -tT /dev/mapper/ssdcrypt /dev/mapper/ssdcrypt: Timing cached reads: 15412 MB in 2.00 seconds = 7715.37 MB/sec Timing buffered disk reads: 70 MB in 3.06 seconds = 22.91 MB/sec <<<< gandalfthegreat:~# hdparm -tT /dev/mapper/cryptroot (spinning disk) /dev/mapper/cryptroot: Timing cached reads: 16222 MB in 2.00 seconds = 8121.03 MB/sec Timing buffered disk reads: 308 MB in 3.01 seconds = 102.24 MB/sec <<<< The non encrypted SSD device gives me: /dev/sda4: Timing cached reads: 14258 MB in 2.00 seconds = 7136.70 MB/sec Timing buffered disk reads: 1392 MB in 3.00 seconds = 463.45 MB/sec which is 4x faster than my non encrypted spinning disk, as expected. I used aes-xts-plain as recommended on http://www.mayrhofer.eu.org/ssd-linux-benchmark gandalfthegreat:~# cryptsetup status /dev/mapper/ssdcrypt /dev/mapper/ssdcrypt is active. type: LUKS1 cipher: aes-xts-plain keysize: 256 bits device: /dev/sda4 offset: 4096 sectors size: 926308752 sectors mode: read/write gandalfthegreat:~# lsmod |grep -e aes aesni_intel 50443 66 cryptd 14517 18 ghash_clmulni_intel,aesni_intel aes_x86_64 16796 1 aesni_intel I realize this is not btrfs' fault :) although I'm a bit stuck as to how to benchmark btrfs if the underlying encrypted device is so slow :) Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? 2012-07-22 18:58 ` brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? Marc MERLIN @ 2012-07-22 19:35 ` Martin Steigerwald 2012-07-22 20:44 ` Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Martin Steigerwald @ 2012-07-22 19:35 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff Hi Marc, Am Sonntag, 22. Juli 2012 schrieb Marc MERLIN: > I'm still getting a bit more data before updating the btrfs wiki with > my best recommendations for today. > > First, everything I've read so far says that the ssd btrfs mount option > makes btrfs slower in benchmarks. > What gives? > Anyone using it or know of a reason not to mount my ssd with nossd? > > > Next, I got a new Samsumg 830 512GB SSD which is supposed to be very > high performance. > The raw device seems fast enough on a quick hdparm test: > > > But once I encrypt it, it drops to 5 times slower than my 1TB spinning > disk in the same laptop: > gandalfthegreat:~# hdparm -tT /dev/mapper/ssdcrypt > /dev/mapper/ssdcrypt: > Timing cached reads: 15412 MB in 2.00 seconds = 7715.37 MB/sec > Timing buffered disk reads: 70 MB in 3.06 seconds = 22.91 MB/sec > <<<< > > gandalfthegreat:~# hdparm -tT /dev/mapper/cryptroot (spinning disk) > /dev/mapper/cryptroot: > Timing cached reads: 16222 MB in 2.00 seconds = 8121.03 MB/sec > Timing buffered disk reads: 308 MB in 3.01 seconds = 102.24 MB/sec > <<<< Have you looked whether certain kernel threads are eating CPU? I would have a look at this. Or use atop to have a complete system overview during the hdparm run. You may want to use its default 10 seconds delay. Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 seconds is really short.) Did you repeat tests three or five times and looked at the deviation? For what it is worth I can beat that with ecryptfs on top of Ext4 ontop of an Intel SSD 320 (SATA 300 based): martin@merkaba:~> su - ms Passwort: ms@merkaba:~> df -hT . Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf /home/.ms ecryptfs 224G 211G 11G 96% /home/ms ms@merkaba:~> dd if=/dev/zero of=testfile bs=1M count=1000 conv=fsync 1000+0 Datensätze ein 1000+0 Datensätze aus 1048576000 Bytes (1,0 GB) kopiert, 20,1466 s, 52,0 MB/s ms@merkaba:~> rm testfile ms@merkaba:~> sudo fstrim /home [sudo] password for ms: ms@merkaba:~> Thats way slower than a dd without encryption, but its way faster than your hdparm figures. The SSD was underutilized according to the harddisk LED of this ThinkPad T520 with Intel i5 Sandybridge 2.5 GHz dualcore. Did start atop to late to see whats going on. (I did not yet test ecryptfs on top of BTRFS, but you didn´t test a filesystem with hdparm anyway.) Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? 2012-07-22 19:35 ` Martin Steigerwald @ 2012-07-22 20:44 ` Marc MERLIN 2012-07-22 22:41 ` brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-07-22 20:44 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff On Sun, Jul 22, 2012 at 09:35:10PM +0200, Martin Steigerwald wrote: > Or use atop to have a complete system overview during the hdparm run. You > may want to use its default 10 seconds delay. atop looked totally fine and showed that a long dd took 6% CPU of one core. > Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 seconds > is really short.) > > Did you repeat tests three or five times and looked at the deviation? Yes. I also ran dd with 1GB, and got the exact same number. Thanks for confirming that it is indeed way faster for you, and you're going through the filesystem layer which makes it slower compared to my doing dd against the raw device. But anyway, while my question about using 'nossd' is on topic here, speed of raw device falling dramatically once encrypted isn't really relevant here, so I'll continue on the dm-crypt mailing list. Thanks Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk 2012-07-22 20:44 ` Marc MERLIN @ 2012-07-22 22:41 ` Marc MERLIN 2012-07-23 6:42 ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-07-22 22:41 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff On Sun, Jul 22, 2012 at 01:44:28PM -0700, Marc MERLIN wrote: > On Sun, Jul 22, 2012 at 09:35:10PM +0200, Martin Steigerwald wrote: > > Or use atop to have a complete system overview during the hdparm run. You > > may want to use its default 10 seconds delay. > > atop looked totally fine and showed that a long dd took 6% CPU of one core. > > > Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 seconds > > is really short.) > > > > Did you repeat tests three or five times and looked at the deviation? > > Yes. I also ran dd with 1GB, and got the exact same number. Oh my, the plot thickens. I'm bringing this back on this list because 1) I've confirmed that I can get 500MB/s unencrypted reads (2GB file) with btfrs on the SSD 2) Despite getting a measly 23MB/s reading from /dev/mapper/ssdcrypt, once I mount it as a btrfs drive and I a read a file from it, I now get 267MB/s 3) Stating a bunch of files on the ssd is very slow (5-6x slower than stating the same files from disk). Using noatime did not help. On an _unencrypted_ partition on the SSD, running du -sh on a directory with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on encrypted spinning drive, both with a similar btrfs filesystem, and the same kernel (3.4.4). Since I'm getting some kind of "seek problem" on the ssd (yes, there are no real seeks on SSD), it seems that I do have a problem with btrfs that is at least not related to encryption. Unencrypted btrfs on SSD: gandalfthegreat:/mnt/mnt2# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2 gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m22.667s Encrypted btrfs on spinning drive: gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m3.881s I've run this many times and get the same numbers. I've tried deadline and noop on /dev/sda (the SSD) and du is just as slow. I also tried with: - space_cache and nospace_cache - ssd and nossd - noatime didn't seem to help even though I was hopeful on this one. In all cases, I get: gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m22.537s 22 seconds for 15K files on an SSD while being 5 times slower than a spinning disk with the same data. What's going on? More details below for anyone interested: ---------------------------------------------------------------------------- The test below shows that I can access the encrypted SSD 10x faster by talking to it through a btrfs filesystem than doing a dd read from the device node. So I suppose I could just ignore the dd device node thing, however not really. Let's take a directory with some source files inside: Let's start with the hard drive in my laptop (dmcrypt with btrtfs) gandalfthegreat:/var/local# find src | wc -l 15261 gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m4.068s So on an encrypted spinning disk, it takes 4 seconds. On my SSD in the same machine with the same encryption and the same btrfs filesystem, I get 5-6 times slower: gandalfthegreat:/mnt/btrfs_pool1/var/local# time du -sh src 514M src real 0m24.937s Incidently 5x is also the speed difference between my encrypted HD and encrypted SSD with dd. Now, why du is 5x slower and dd of a file from the filesystem is 2.5x faster, I have no idea :-/ See below: 1) drop caches gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# echo 3 > /proc/sys/vm/drop_caches gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# dd if=w2k-s001.vmdk of=/dev/null 2146631680 bytes (2.1 GB) copied, 8.03898 s, 267 MB/s -> 267MB/s reading from the file through the encrypted filesystem. That's good. For comparison gandalfthegreat:/mnt/mnt2# dd if=w2k-s001.vmdk of=/dev/null 2146631680 bytes (2.1 GB) copied, 4.33393 s, 495 MB/s -> almost 500MB/s reading through another unencrypted filesystem on the same SSD gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# dd if=/dev/mapper/ssdcrypt of=/dev/null bs=1M count=1000 1048576000 bytes (1.0 GB) copied, 45.1234 s, 23.2 MB/s -> 23MB/s reading from the block device that my FS is mounted from. WTF? gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# echo 3 > /proc/sys/vm/drop_caches; dd if=w2k-s001.vmdk of=test 2146631680 bytes (2.1 GB) copied, 17.9129 s, 120 MB/s -> 120MB/s copying a file from the SSD to itself. That's not bad. gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# echo 3 > /proc/sys/vm/drop_caches; dd if=test of=/dev/null 2146631680 bytes (2.1 GB) copied, 8.4907 s, 253 MB/s -> reading the new copied file still shows 253MB/s, good. gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# dd if=test of=/dev/null 2146631680 bytes (2.1 GB) copied, 2.11001 s, 1.0 GB/s -> reading without dropping the cache shows 1GB/s I'm very lost now, any idea what's going on? - big file copies from encrypted SSD are fast as expected - raw device access is unexplainably slow (10x slower than btrfs on the raw device) - stating 15K files on the encrypted ssd with btrfs is super slow Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* How can btrfs take 23sec to stat 23K files from an SSD? 2012-07-22 22:41 ` brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk Marc MERLIN @ 2012-07-23 6:42 ` Marc MERLIN 2012-07-24 7:56 ` Martin Steigerwald 2012-07-27 11:08 ` Chris Mason 0 siblings, 2 replies; 22+ messages in thread From: Marc MERLIN @ 2012-07-23 6:42 UTC (permalink / raw) To: linux-btrfs I just realized that the older thread got a bit confusing, so I'll keep problems separate and make things simpler :) On an _unencrypted_ partition on the SSD, running du -sh on a directory with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on encrypted spinning drive, both with a similar btrfs filesystem, and the same kernel (3.4.4). Unencrypted btrfs on SSD: gandalfthegreat:~# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2 gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m22.667s Encrypted btrfs on spinning drive of the same src directory: gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m3.881s I've run this many times and get the same numbers. I've tried deadline and noop on /dev/sda (the SSD) and du is just as slow. I also tried with: - space_cache and nospace_cache - ssd and nossd - noatime didn't seem to help even though I was hopeful on this one. In all cases, I get: gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src 514M src real 0m22.537s I'm having the same slow speed on 2 btrfs filesystems on the same SSD. One is encrypted, the other one isnt: Label: 'btrfs_pool1' uuid: d570c40a-4a0b-4d03-b1c9-cff319fc224d Total devices 1 FS bytes used 144.74GB devid 1 size 441.70GB used 195.04GB path /dev/dm-0 Label: 'boot' uuid: 84199644-3542-430a-8f18-a5aa58959662 Total devices 1 FS bytes used 2.33GB devid 1 size 25.00GB used 5.04GB path /dev/sda2 If instead of stating a bunch of files, I try reading a big file, I do get speeds that are quite fast (253MB/s and 423MB/s). 22 seconds for 15K files on an SSD is super slow and being 5 times slower than a spinning disk with the same data. What's going on? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-07-23 6:42 ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN @ 2012-07-24 7:56 ` Martin Steigerwald 2012-07-27 4:40 ` Marc MERLIN 2012-07-27 11:08 ` Chris Mason 1 sibling, 1 reply; 22+ messages in thread From: Martin Steigerwald @ 2012-07-24 7:56 UTC (permalink / raw) To: linux-btrfs; +Cc: Marc MERLIN Am Montag, 23. Juli 2012 schrieb Marc MERLIN: > I just realized that the older thread got a bit confusing, so I'll keep > problems separate and make things simpler :) > > On an _unencrypted_ partition on the SSD, running du -sh on a directory > with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on > encrypted spinning drive, both with a similar btrfs filesystem, and > the same kernel (3.4.4). > > Unencrypted btrfs on SSD: > gandalfthegreat:~# mount -o > compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2 > gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du > -sh src 514M src > real 0m22.667s > > Encrypted btrfs on spinning drive of the same src directory: > gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du > -sh src 514M src > real 0m3.881s find is fast, du is much slower: merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr | wc -l ) 404166 ( find /usr | wc -l; ) 0,03s user 0,07s system 1% cpu 9,212 total merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr ) 11G /usr ( du -sh /usr; ) 1,00s user 19,07s system 41% cpu 48,886 total Now I try to find something with less files. merkaba:~> find /usr/share/doc | wc -l 50715 merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr/share/doc | wc -l ) 50715 ( find /usr/share/doc | wc -l; ) 0,00s user 0,02s system 1% cpu 1,398 total merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr/share/doc ) 606M /usr/share/doc ( du -sh /usr/share/doc; ) 0,20s user 3,63s system 35% cpu 10,691 total merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time du -sh /usr/share/doc 606M /usr/share/doc du -sh /usr/share/doc 0,19s user 3,54s system 35% cpu 10,386 total Anyway thats still much faster than your measurements. merkaba:~> df -hT /usr Dateisystem Typ Größe Benutzt Verf. Verw% Eingehängt auf /dev/dm-0 btrfs 19G 11G 5,6G 67% / merkaba:~> btrfs fi sh failed to read /dev/sr0 Label: 'debian' uuid: […] Total devices 1 FS bytes used 10.25GB devid 1 size 18.62GB used 18.62GB path /dev/dm-0 Btrfs Btrfs v0.19 merkaba:~> btrfs fi df / Data: total=15.10GB, used=9.59GB System, DUP: total=8.00MB, used=4.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=1.75GB, used=670.43MB Metadata: total=8.00MB, used=0.00 merkaba:~> grep btrfs /proc/mounts /dev/dm-0 / btrfs rw,noatime,compress=lzo,ssd,space_cache,inode_cache 0 0 Somewhat aged BTRFS filesystem on ThinkPad T520, Intel SSD 320, kernel 3.5. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-07-24 7:56 ` Martin Steigerwald @ 2012-07-27 4:40 ` Marc MERLIN 0 siblings, 0 replies; 22+ messages in thread From: Marc MERLIN @ 2012-07-27 4:40 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs On Tue, Jul 24, 2012 at 09:56:26AM +0200, Martin Steigerwald wrote: > find is fast, du is much slower: > > merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr | wc -l ) > 404166 > ( find /usr | wc -l; ) 0,03s user 0,07s system 1% cpu 9,212 total > merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr ) > 11G /usr > ( du -sh /usr; ) 1,00s user 19,07s system 41% cpu 48,886 total You're right. gandalfthegreat [mc]# time du -sh src 514M src real 0m25.159s gandalfthegreat [mc]# reset_cache gandalfthegreat [mc]# time bash -c "find src | wc -l" 15261 real 0m7.614s But find is still slower the du on my spinning disk for the same tree. > Anyway thats still much faster than your measurements. Right. Your numbers look reasonable for an SSD. Since then, I did some more tests and I'm also getting slower than normal speeds with ext4, an indications that it's a problem with the block layer. I'm working with some folks try to pin down the core problem, but it looks like it's not an easy one. Thanks for your data. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-07-23 6:42 ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN 2012-07-24 7:56 ` Martin Steigerwald @ 2012-07-27 11:08 ` Chris Mason 2012-07-27 18:42 ` Marc MERLIN 1 sibling, 1 reply; 22+ messages in thread From: Chris Mason @ 2012-07-27 11:08 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-btrfs On Mon, Jul 23, 2012 at 12:42:03AM -0600, Marc MERLIN wrote: > > 22 seconds for 15K files on an SSD is super slow and being 5 times > slower than a spinning disk with the same data. > What's going on? Hi Marc, The easiest way to figure out is with latencytop. I'd either run the latencytop gui or use the latencytop -c patch which sends a text dump to the console. This is assuming that you're not pegged at 100% CPU... https://oss.oracle.com/~mason/latencytop.patch -chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-07-27 11:08 ` Chris Mason @ 2012-07-27 18:42 ` Marc MERLIN 2012-08-01 5:30 ` du -s src is a lot slower on SSD than spinning disk in the same laptop Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-07-27 18:42 UTC (permalink / raw) To: Chris Mason, linux-btrfs On Fri, Jul 27, 2012 at 07:08:35AM -0400, Chris Mason wrote: > On Mon, Jul 23, 2012 at 12:42:03AM -0600, Marc MERLIN wrote: > > > > 22 seconds for 15K files on an SSD is super slow and being 5 times > > slower than a spinning disk with the same data. > > What's going on? > > Hi Marc, > > The easiest way to figure out is with latencytop. I'd either run the > latencytop gui or use the latencytop -c patch which sends a text dump to > the console. > > This is assuming that you're not pegged at 100% CPU... > > https://oss.oracle.com/~mason/latencytop.patch Thanks for the patch, and yes I can confirm I'm definitely not pegged on CPU (not even close and I get the same problem with unencrypted filesystem, actually du -sh is exactly the same speed on encrypted and unecrypted). Here's the result I think you were looking for. I'm not good at reading this, but hopefully it tells you something useful :) The full run is here if that helps: http://marc.merlins.org/tmp/latencytop.txt Process du (6748) Total: 4280.5 msec Reading directory content 15.2 msec 11.0 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_next_leaf btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath [sleep_on_page] 13.5 msec 88.2 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_lookup_csum __btrfs_lookup_bio_sums btrfs_lookup_bio_sums btrfs_submit_compressed_read btrfs_submit_bio_hook Page fault 12.9 msec 0.6 % sleep_on_page_killable wait_on_page_bit_killable __lock_page_or_retry filemap_fault __do_fault handle_pte_fault handle_mm_fault do_page_fault page_fault Executing a program 7.1 msec 0.2 % sleep_on_page_killable __lock_page_killable generic_file_aio_read do_sync_read vfs_read kernel_read prepare_binprm do_execve_common.isra.27 do_execve sys_execve stub_execve Process du (6748) Total: 9517.4 msec [sleep_on_page] 23.0 msec 82.8 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_lookup_inode btrfs_iget btrfs_lookup_dentry btrfs_lookup __lookup_hash Reading directory content 13.2 msec 17.2 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath Process du (6748) Total: 9524.0 msec [sleep_on_page] 17.1 msec 88.5 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_lookup_inode btrfs_iget btrfs_lookup_dentry btrfs_lookup __lookup_hash Reading directory content 16.0 msec 11.5 % sleep_on_page wait_on_page_bit read_extent_buffer_pages btree_read_extent_buffer_pages.constprop.110 read_tree_block read_block_for_search.isra.32 btrfs_search_slot btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: du -s src is a lot slower on SSD than spinning disk in the same laptop @ 2012-08-01 5:30 ` Marc MERLIN 2012-08-01 6:01 ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-08-01 5:30 UTC (permalink / raw) To: Ted Ts'o; +Cc: Lukáš Czerner, linux-ext4, axboe, Milan Broz On Wed, Jul 25, 2012 at 11:54:12PM -0700, Marc MERLIN wrote: > and one on th HDD: > > gandalfthegreat:/boot# find /boot/src | wc -l > 6261 > gandalfthegreat:/boot# reset_cache; time du -sh /boot/src ; reset_cache; time du -sh /boot2/src > 169M /boot/src > real 0m5.248s > > 157M /boot2/src > real 0m0.698s > > gandalfthegreat:/boot# df -h /boot /boot2 > Filesystem Size Used Avail Use% Mounted on > /dev/sda1 241M 226M 2.5M 99% /boot > /dev/sdb2 298M 273M 9.3M 97% /boot2 > gandalfthegreat:/boot# grep boot /proc/mounts > /dev/sdb2 /boot2 ext4 rw,relatime,stripe=4,data=ordered 0 0 > /dev/sda1 /boot ext4 rw,relatime,discard,stripe=32,data=ordered 0 0 > > The only sizeable difference is that sda1 is a 4K block size I forced to > help with faster writes on the ssd. > > In the example above, the SSD is almost 10 times slower than the HDD. > > blktrace -d /dev/sda -o - | blkparse -i - during du is > http://marc.merlins.org/tmp/blktrace_du_sda1.txt > > Does that give enough info on what's going on? (TL;DR: ntfs on linux via fuse is 33% faster than ext4, which is 2x faster than btrfs, but 3x slower than the same filesystem on spinning disk :( ) Ok, just to help with debuggging this, 1) I put my samsung 830 SSD into another thinkpad and it wasn't faster or slower. 2) Then I put a crucial 256 C300 SSD (the replacement for the one I had that just died and killed all my data), and du took 0.3 seconds on both my old and new thinkpads. The old thinkpad is running ubuntu 32bit the new one debian testing 64bit both with kernel 3.4.4. So, clearly, there is something wrong with the samsung 830 SSD with linux but I have no clue what :( In raw speed (dd) the samsung is faster than the crucial (350MB/s vs 500MB/s). It it were a random crappy SSD from a random vendor, I'd blame the SSD, but I have a hard time believing that samsung is selling SSDs that are slower than hard drives at random IO and 'seeks'. Mmmh, and to do more tests, I eventually got a 2nd ssd from samsung (same kind), just to make sure the one I had wasn't bad. Unfortunately the results are similar. I upgraded to 3.5.0 in the meantime: First: btrfs is the slowest: gandalfthegreat:/mnt/ssd/var/local# time du -sh src/ 514M src/ real 0m25.741s Second: ext4 with mkfs.ext4 -O extent -b 4096 /dev/sda3 gandalfthegreat:/mnt/mnt3# reset_cache gandalfthegreat:/mnt/mnt3# time du -sh src/ 519M src/ real 0m12.459s gandalfthegreat:~# grep mnt3 /proc/mounts /dev/sda3 /mnt/mnt3 ext4 rw,noatime,discard,data=ordered 0 0 A freshly made ntfs filesystem through fuse is actually FASTER! gandalfthegreat:/mnt/mnt2# reset_cache gandalfthegreat:/mnt/mnt2# time du -sh src/ 506M src/ real 0m8.928s gandalfthegreat:/mnt/mnt2# grep mnt2 /proc/mounts /dev/sda2 /mnt/mnt2 fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0 How can ntfs via fuse be the fastest? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-01 5:30 ` du -s src is a lot slower on SSD than spinning disk in the same laptop Marc MERLIN @ 2012-08-01 6:01 ` Marc MERLIN 2012-08-01 6:08 ` Fajar A. Nugraha 2012-08-01 6:36 ` Chris Samuel 0 siblings, 2 replies; 22+ messages in thread From: Marc MERLIN @ 2012-08-01 6:01 UTC (permalink / raw) To: Chris Mason, linux-btrfs, Ted Ts'o Cc: Lukáš Czerner, linux-ext4, axboe, Milan Broz On Fri, Jul 27, 2012 at 11:42:39AM -0700, Marc MERLIN wrote: > > https://oss.oracle.com/~mason/latencytop.patch > > Thanks for the patch, and yes I can confirm I'm definitely not pegged on CPU > (not even close and I get the same problem with unencrypted filesystem, actually > du -sh is exactly the same speed on encrypted and unecrypted). > > Here's the result I think you were looking for. I'm not good at reading this, > but hopefully it tells you something useful :) > > The full run is here if that helps: > http://marc.merlins.org/tmp/latencytop.txt I did some other tests since last week since my laptop is hard to use considering how slow the SSD is. (TL;DR: ntfs on linux via fuse is 33% faster than ext4, which is 2x faster than btrfs, but 3x slower than the same filesystem on spinning disk :( ) Ok, just to help with debuggging this, 1) I put my samsung 830 SSD into another thinkpad and it wasn't faster or slower. 2) Then I put a crucial 256 C300 SSD (the replacement for the one I had that just died and killed all my data), and du took 0.3 seconds on both my old and new thinkpads. The old thinkpad is running ubuntu 32bit the new one debian testing 64bit both with kernel 3.4.4. So, clearly, there is something wrong with the samsung 830 SSD with linux but I have no clue what :( In raw speed (dd) the samsung is faster than the crucial (350MB/s vs 500MB/s). It it were a random crappy SSD from a random vendor, I'd blame the SSD, but I have a hard time believing that samsung is selling SSDs that are slower than hard drives at random IO and 'seeks'. 3) I just got a 2nd ssd from samsung (same kind), just to make sure the one I had wasn't bad. It's brand new, and I formatted it carefully on 512 boundaries: /dev/sda1 2048 502271 250112 83 Linux /dev/sda2 502272 52930559 26214144 7 HPFS/NTFS/exFAT /dev/sda3 52930560 73902079 10485760 82 Linux swap / Solaris /dev/sda4 73902080 1000215215 463156568 83 Linux I also upgraded to 3.5.0 in the meantime but unfortunately the results are similar. First: btrfs is the slowest: gandalfthegreat:/mnt/ssd/var/local# time du -sh src/ 514M src/ real 0m25.741s gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0 Second: ext4 is 2x faster than btrfs with mkfs.ext4 -O extent -b 4096 /dev/sda3 gandalfthegreat:/mnt/mnt3# reset_cache gandalfthegreat:/mnt/mnt3# time du -sh src/ 519M src/ real 0m12.459s gandalfthegreat:~# grep mnt3 /proc/mounts /dev/sda3 /mnt/mnt3 ext4 rw,noatime,discard,data=ordered 0 0 Third, A freshly made ntfs filesystem through fuse is actually FASTER! gandalfthegreat:/mnt/mnt2# reset_cache gandalfthegreat:/mnt/mnt2# time du -sh src/ 506M src/ real 0m8.928s gandalfthegreat:/mnt/mnt2# grep mnt2 /proc/mounts /dev/sda2 /mnt/mnt2 fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0 How can ntfs via fuse be the fastest and btrfs so slow? Of course, all 3 are slower than the same filesystem on spinning too, but I'm wondering if there is a scheduling issue that is somehow causing the extreme slowness I'm seeing. Did the latencytop trace I got help in any way? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-01 6:01 ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN @ 2012-08-01 6:08 ` Fajar A. Nugraha 2012-08-01 6:21 ` Marc MERLIN 2012-08-01 6:36 ` Chris Samuel 1 sibling, 1 reply; 22+ messages in thread From: Fajar A. Nugraha @ 2012-08-01 6:08 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-btrfs On Wed, Aug 1, 2012 at 1:01 PM, Marc MERLIN <marc@merlins.org> wrote: > So, clearly, there is something wrong with the samsung 830 SSD with linux > It it were a random crappy SSD from a random vendor, I'd blame the SSD, but > I have a hard time believing that samsung is selling SSDs that are slower > than hard drives at random IO and 'seeks'. You'd be surprised on how badly some vendors can screw up :) > First: btrfs is the slowest: > gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts > /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0 Just checking, did you explicitly activate "discard"? Cause on my setup (with corsair SSD) it made things MUCH slower. Also, try adding "noatime" (just in case the slow down was because "du" cause many access time updates) -- Fajar ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-01 6:08 ` Fajar A. Nugraha @ 2012-08-01 6:21 ` Marc MERLIN 2012-08-01 21:57 ` Martin Steigerwald 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-08-01 6:21 UTC (permalink / raw) To: Fajar A. Nugraha; +Cc: linux-btrfs On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote: > > It it were a random crappy SSD from a random vendor, I'd blame the SSD, but > > I have a hard time believing that samsung is selling SSDs that are slower > > than hard drives at random IO and 'seeks'. > > You'd be surprised on how badly some vendors can screw up :) At some point, it may come down to that indeed :-/ I'm still hopefully that Samsung didn't, but we'll see. > > First: btrfs is the slowest: > > > gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts > > /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0 > > Just checking, did you explicitly activate "discard"? Cause on my Yes. Note that it should a noop when all you're doing is stating inodes and not writing (I'm using noatime). > setup (with corsair SSD) it made things MUCH slower. Also, try adding > "noatime" (just in case the slow down was because "du" cause many > access time updates) I have noatime in there already :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-01 6:21 ` Marc MERLIN @ 2012-08-01 21:57 ` Martin Steigerwald 2012-08-02 5:07 ` Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Martin Steigerwald @ 2012-08-01 21:57 UTC (permalink / raw) To: linux-btrfs; +Cc: Marc MERLIN, Fajar A. Nugraha Hi Marc, Am Mittwoch, 1. August 2012 schrieb Marc MERLIN: > On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote: > > > It it were a random crappy SSD from a random vendor, I'd blame the > > > SSD, but I have a hard time believing that samsung is selling SSDs > > > that are slower than hard drives at random IO and 'seeks'. > > > > You'd be surprised on how badly some vendors can screw up :) > > At some point, it may come down to that indeed :-/ > I'm still hopefully that Samsung didn't, but we'll see. Its getting quite strange. I lost track of whether you did that already or not, but if you didn´t please post some vmstat 1 iostat -xd 1 on the device while it is being slow. I am interested in wait I/O and latencies and disk utilization. Comparison data of Intel SSD 320 in ThinkPad T520 during merkaba:~> echo 3 > /proc/sys/drop_caches ; du -sch /usr on BTRFS with Kernel 3.5: martin@merkaba:~> vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 1 21556 4442668 2056 502352 0 0 194 85 247 120 11 2 87 0 1 2 21556 4408888 2448 514884 0 0 11684 328 4975 24585 5 16 65 14 1 0 21556 4389880 2448 528060 0 0 13400 0 4574 23452 2 16 68 14 3 1 21556 4370068 2448 545052 0 0 18132 0 5499 27220 1 18 64 16 1 0 21556 4350228 2448 555580 0 0 10856 0 4122 25339 3 16 67 14 1 1 21556 4315604 2448 569756 0 0 12648 0 4647 31153 5 14 66 15 0 1 21556 4295652 2456 581480 0 0 11548 56 4093 24618 2 13 69 16 0 1 21556 4286720 2456 591580 0 0 10824 0 3750 21445 1 12 71 16 0 1 21556 4266308 2456 603620 0 0 12932 0 4841 26447 4 12 68 17 1 0 21556 4248228 2456 613808 0 0 10264 4 3703 22108 1 13 71 15 5 1 21556 4231976 2456 624356 0 0 10540 0 3581 20436 1 10 72 17 0 1 21556 4197168 2456 639108 0 0 12952 0 4738 28223 4 15 66 15 4 1 21556 4178456 2456 650552 0 0 11656 0 4234 23480 2 14 68 16 0 1 21556 4163616 2456 662992 0 0 13652 0 4619 26580 1 16 70 13 4 1 21556 4138288 2456 675696 0 0 13352 0 4422 22254 1 16 70 13 1 0 21556 4113204 2456 689060 0 0 13232 0 4312 21936 1 15 70 14 0 1 21556 4085532 2456 704160 0 0 14972 0 4820 24238 1 16 69 14 2 0 21556 4055740 2456 719644 0 0 15736 0 5099 25513 3 17 66 14 0 1 21556 4028612 2456 734380 0 0 14504 0 4795 25052 3 15 68 14 2 0 21556 3999108 2456 749040 0 0 14656 16 4672 21878 1 17 69 13 1 1 21556 3972732 2456 762108 0 0 12972 0 4717 22411 1 17 70 13 5 0 21556 3949684 2584 773484 0 0 11528 52 4837 24107 3 15 67 15 1 0 21556 3912504 2584 787420 0 0 12156 0 4883 25201 4 15 67 14 martin@merkaba:~> iostat -xd 1 /dev/sda Linux 3.5.0-tp520 (merkaba) 01.08.2012 _x86_64_ (4 CPU) Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1,29 1,44 11,58 12,78 684,74 299,75 80,81 0,24 9,86 0,95 17,93 0,29 0,71 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2808,00 0,00 11232,00 0,00 8,00 0,57 0,21 0,21 0,00 0,19 54,50 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2967,00 0,00 11868,00 0,00 8,00 0,63 0,21 0,21 0,00 0,21 60,90 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 11,00 2992,00 4,00 11968,00 56,00 8,03 0,64 0,22 0,22 0,25 0,21 62,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2680,00 0,00 10720,00 0,00 8,00 0,70 0,26 0,26 0,00 0,25 66,70 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3153,00 0,00 12612,00 0,00 8,00 0,72 0,23 0,23 0,00 0,22 69,30 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2769,00 0,00 11076,00 0,00 8,00 0,63 0,23 0,23 0,00 0,21 58,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2523,00 1,00 10092,00 4,00 8,00 0,74 0,29 0,29 0,00 0,28 71,30 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3026,00 0,00 12104,00 0,00 8,00 0,73 0,24 0,24 0,00 0,21 64,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3069,00 0,00 12276,00 0,00 8,00 0,67 0,22 0,22 0,00 0,20 62,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3346,00 0,00 13384,00 0,00 8,00 0,64 0,19 0,19 0,00 0,18 59,90 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3188,00 0,00 12752,00 0,00 8,00 0,80 0,25 0,25 0,00 0,17 54,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3433,00 0,00 13732,00 0,00 8,00 1,03 0,30 0,30 0,00 0,17 57,00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3565,00 0,00 14260,00 0,00 8,00 0,92 0,26 0,26 0,00 0,16 57,30 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3972,00 0,00 15888,00 0,00 8,00 1,13 0,29 0,29 0,00 0,16 62,90 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3743,00 0,00 14972,00 0,00 8,00 1,03 0,28 0,28 0,00 0,16 59,40 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3408,00 0,00 13632,00 0,00 8,00 1,08 0,32 0,32 0,00 0,17 56,70 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3730,00 3,00 14920,00 16,00 8,00 1,14 0,31 0,31 0,00 0,15 56,30 I also suggest to use fio with with the ssd-test example on the SSD. I have some comparison data available for my setup. Heck it should be publicly available in my ADMIN magazine article about fio. I used a bit different fio jobs with block sizes of 2k to 16k, but its similar enough and I might even have some 4k examples at hand or can easily create one. I also raised size and duration a bit. An example based on whats in my article: [global] ioengine=libaio direct=1 iodepth=64 runtime=60 filename=testfile size=2G bsrange=2k-16k refill_buffers=1 [randomwrite] stonewall rw=randwrite [sequentialwrite] stonewall rw=write [randomread] stonewall rw=randread [sequentialread] stonewall rw=read Remove bsrange if you want 4k blocks only. I put the writes above the reads cause the writes with refill_buffers=1 initialize the testfile with random data. It would contain only zeros which will be compressible nicely by modern SandForce controllers otherwise. Above job is untested, but it should do. Please remove the test file and fstrim your disk unless you have discard mount option on after having run any write tests. (Not necessarily after each one ;). Also I am interested in merkaba:~> hdparm -I /dev/sda | grep -i queue Queue depth: 32 * Native Command Queueing (NCQ) output for your SSD. Try to load the ssd with different iodepths up to twice the amount displayed by hdparm. But note: a du -sch will not reach that high iodepth. I expect it to get as low as an iodepth of one in that case. You see in my comparison output that a single du -sch is not able to load the Intel SSD 320 fully, but load is just about 50-70%. And iowait is just about 10-20%. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-01 21:57 ` Martin Steigerwald @ 2012-08-02 5:07 ` Marc MERLIN 2012-08-02 11:18 ` Martin Steigerwald 2012-08-02 11:25 ` Martin Steigerwald 0 siblings, 2 replies; 22+ messages in thread From: Marc MERLIN @ 2012-08-02 5:07 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote: > Its getting quite strange. I would agree :) Before I paste a bunch of thing, I wanted to thank you for not giving up on me and offering your time to help me figure this out :) > I lost track of whether you did that already or not, but if you didn´t > please post some > > vmstat 1 > iostat -xd 1 > on the device while it is being slow. Sure thing, here's the 24 second du -s run: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 1 0 2747264 44 348388 0 0 28 50 242 184 19 6 74 1 1 0 0 2744128 44 351700 0 0 144 0 2758 32115 30 5 61 4 2 1 0 2743100 44 351992 0 0 792 0 2616 30613 28 4 50 18 1 1 0 2741592 44 352668 0 0 776 0 2574 31551 29 4 45 21 1 1 0 2740720 44 353432 0 0 692 0 2734 32891 30 4 45 22 1 1 0 2740104 44 354284 0 0 460 0 2639 31585 30 4 45 21 3 1 0 2738520 44 354692 0 0 544 264 2834 30302 32 5 42 21 1 1 0 2736936 44 355476 0 0 1064 2012 2867 31172 28 4 45 23 3 1 0 2734892 44 356840 0 0 1236 0 2865 31100 30 3 46 21 3 1 0 2733840 44 357432 0 0 636 12 2930 31362 30 4 45 21 3 1 0 2735152 44 357892 0 0 612 0 2546 31641 28 4 46 22 3 1 0 2733608 44 358680 0 0 668 0 2846 29747 30 3 46 22 2 1 0 2731480 44 359556 0 0 712 248 2744 27939 31 3 44 22 2 1 0 2732688 44 360736 0 0 1204 0 2774 30979 30 3 46 21 4 1 0 2730276 44 361872 0 0 872 256 3174 31657 30 5 45 20 1 1 0 2728372 44 363008 0 0 912 132 2752 31811 30 3 46 22 1 1 0 2727060 44 363240 0 0 476 0 2595 30313 33 4 43 21 1 1 0 2725856 44 364000 0 0 580 0 2547 30803 29 4 46 21 1 1 0 2723628 44 364724 0 0 756 272 2766 30652 31 4 44 22 2 1 0 2722128 44 365616 0 0 744 8 2771 31760 30 4 45 21 2 1 0 2721024 44 366284 0 0 788 196 2695 31819 31 4 44 22 3 1 0 2719312 44 367084 0 0 732 0 2737 32038 30 5 45 20 1 1 0 2717436 44 367880 0 0 728 0 2570 30810 29 3 46 22 3 1 0 2716820 44 368500 0 0 608 0 2614 30949 29 4 45 22 2 1 0 2715576 44 369240 0 0 628 264 2715 31180 34 4 40 22 1 1 0 2714308 44 369952 0 0 624 0 2533 31155 28 4 46 22 Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat) 08/01/2012 _x86_64_ (4 CPU) rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util 2.18 0.68 6.45 17.77 78.12 153.39 19.12 0.18 7.51 8.52 7.15 4.46 10.81 0.00 0.00 118.00 0.00 540.00 0.00 9.15 1.18 9.93 9.93 0.00 4.98 58.80 0.00 0.00 217.00 0.00 868.00 0.00 8.00 1.90 8.77 8.77 0.00 4.44 96.40 0.00 0.00 192.00 0.00 768.00 0.00 8.00 1.63 8.44 8.44 0.00 5.10 98.00 0.00 0.00 119.00 0.00 476.00 0.00 8.00 1.06 9.01 9.01 0.00 8.20 97.60 0.00 0.00 125.00 0.00 500.00 0.00 8.00 1.08 8.67 8.67 0.00 7.55 94.40 0.00 0.00 165.00 0.00 660.00 0.00 8.00 1.50 9.12 9.12 0.00 5.87 96.80 0.00 0.00 195.00 13.00 780.00 272.00 10.12 1.68 8.10 7.94 10.46 4.65 96.80 0.00 0.00 173.00 0.00 692.00 0.00 8.00 1.72 9.87 9.87 0.00 5.71 98.80 0.00 0.00 171.00 0.00 684.00 0.00 8.00 1.62 9.33 9.33 0.00 5.75 98.40 0.00 0.00 161.00 0.00 644.00 0.00 8.00 1.52 9.57 9.57 0.00 6.14 98.80 0.00 0.00 136.00 0.00 544.00 0.00 8.00 1.26 9.29 9.29 0.00 7.24 98.40 0.00 0.00 199.00 0.00 796.00 0.00 8.00 1.94 9.73 9.73 0.00 4.94 98.40 0.00 0.00 201.00 0.00 804.00 0.00 8.00 1.70 8.54 8.54 0.00 4.80 96.40 0.00 0.00 272.00 15.00 1088.00 272.00 9.48 2.35 8.21 8.46 3.73 3.39 97.20 0.00 0.00 208.00 1.00 832.00 4.00 8.00 1.90 9.01 9.04 4.00 4.67 97.60 0.00 0.00 228.00 0.00 912.00 0.00 8.00 1.94 8.56 8.56 0.00 4.23 96.40 0.00 0.00 107.00 0.00 428.00 0.00 8.00 0.98 9.05 9.05 0.00 9.12 97.60 0.00 0.00 163.00 0.00 652.00 0.00 8.00 1.60 9.89 9.89 0.00 6.01 98.00 0.00 0.00 162.00 0.00 648.00 0.00 8.00 1.83 10.20 10.20 0.00 6.07 98.40 1.00 10.00 224.00 71.00 900.00 2157.50 20.73 2.82 10.14 9.50 12.17 3.35 98.80 41.00 8.00 166.00 176.00 832.00 333.50 6.82 2.48 7.24 8.48 6.07 2.92 100.00 18.00 0.00 78.00 72.00 388.00 36.00 5.65 2.30 15.28 16.56 13.89 6.67 100.00 13.00 0.00 190.00 25.00 824.00 12.50 7.78 1.95 9.17 9.43 7.20 4.58 98.40 0.00 0.00 156.00 0.00 624.00 0.00 8.00 1.36 8.72 8.72 0.00 6.26 97.60 0.00 0.00 158.00 19.00 632.00 64.00 7.86 1.50 8.45 8.61 7.16 5.60 99.20 0.00 0.00 176.00 0.00 704.00 0.00 8.00 1.65 9.39 9.39 0.00 5.61 98.80 0.00 0.00 44.00 0.00 176.00 0.00 8.00 0.38 8.55 8.55 0.00 8.27 36.40 > I am interested in wait I/O and latencies and disk utilization. Cool tool, I didn't know about iostat. My r_await numbers don't look good obviously and yet %util is pretty much 100% the entire time. Does that show that it's indeed the device that is unable to deliver the requests any quicker, despite being an ssd, or are you reading this differently? > Also I am interested in > merkaba:~> hdparm -I /dev/sda | grep -i queue > Queue depth: 32 > * Native Command Queueing (NCQ) > output for your SSD. gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue Queue depth: 32 * Native Command Queueing (NCQ) gandalfthegreat:/var/local# I've the the fio tests in: /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 (discard is there, so fstrim shouldn't be needed) > I also suggest to use fio with with the ssd-test example on the SSD. I > have some comparison data available for my setup. Heck it should be > publicly available in my ADMIN magazine article about fio. I used a bit > different fio jobs with block sizes of 2k to 16k, but its similar enough > and I might even have some 4k examples at hand or can easily create one. > I also raised size and duration a bit. > > An example based on whats in my article: Thanks, here's the output. I see bw jumping from bw=1700.8KB/s all the way to bw=474684KB/s depending on the io size. That's "intereesting" to say the least. So, doctor, is it bad? :) randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 2.0.8 Starting 4 processes randomwrite: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0 iops] [eta 00m:00s] randomwrite: (groupid=0, jobs=1): err= 0: pid=7193 write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55 clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63 lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57 clat percentiles (msec): | 1.00th=[ 225], 5.00th=[ 241], 10.00th=[ 247], 20.00th=[ 260], | 30.00th=[ 269], 40.00th=[ 281], 50.00th=[ 293], 60.00th=[ 306], | 70.00th=[ 322], 80.00th=[ 351], 90.00th=[ 545], 95.00th=[ 570], | 99.00th=[ 627], 99.50th=[ 644], 99.90th=[ 709], 99.95th=[ 725], | 99.99th=[ 742] bw (KB/s) : min= 11, max= 2591, per=99.83%, avg=1697.13, stdev=491.48 lat (usec) : 50=0.01% lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89% lat (msec) : 500=72.44%, 750=14.43% cpu : usr=0.30%, sys=2.65%, ctx=9699, majf=0, minf=19 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=11396/d=0, short=r=0/w=0/d=0 sequentialwrite: (groupid=1, jobs=1): err= 0: pid=7218 write: io=80994KB, bw=1349.7KB/s, iops=149 , runt= 60011msec slat (usec): min=15 , max=296798 , avg=6658.06, stdev=7414.10 clat (usec): min=15 , max=962319 , avg=418984.76, stdev=128626.12 lat (msec): min=12 , max=967 , avg=425.64, stdev=130.36 clat percentiles (msec): | 1.00th=[ 269], 5.00th=[ 306], 10.00th=[ 322], 20.00th=[ 338], | 30.00th=[ 355], 40.00th=[ 371], 50.00th=[ 388], 60.00th=[ 404], | 70.00th=[ 420], 80.00th=[ 445], 90.00th=[ 603], 95.00th=[ 766], | 99.00th=[ 881], 99.50th=[ 906], 99.90th=[ 938], 99.95th=[ 947], | 99.99th=[ 963] bw (KB/s) : min= 418, max= 1952, per=99.31%, avg=1339.72, stdev=354.39 lat (usec) : 20=0.01% lat (msec) : 20=0.01%, 50=0.11%, 100=0.27%, 250=0.29%, 500=86.93% lat (msec) : 750=6.87%, 1000=5.51% cpu : usr=0.25%, sys=2.32%, ctx=13426, majf=0, minf=21 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=8991/d=0, short=r=0/w=0/d=0 randomread: (groupid=2, jobs=1): err= 0: pid=7234 read : io=473982KB, bw=7899.5KB/s, iops=934 , runt= 60005msec slat (usec): min=2 , max=31048 , avg=1065.57, stdev=3519.63 clat (usec): min=21 , max=246981 , avg=67367.84, stdev=33459.21 lat (msec): min=1 , max=263 , avg=68.43, stdev=33.81 clat percentiles (msec): | 1.00th=[ 15], 5.00th=[ 25], 10.00th=[ 33], 20.00th=[ 39], | 30.00th=[ 50], 40.00th=[ 57], 50.00th=[ 61], 60.00th=[ 70], | 70.00th=[ 76], 80.00th=[ 89], 90.00th=[ 111], 95.00th=[ 139], | 99.00th=[ 176], 99.50th=[ 190], 99.90th=[ 217], 99.95th=[ 229], | 99.99th=[ 247] bw (KB/s) : min= 2912, max=11909, per=100.00%, avg=7900.46, stdev=2255.68 lat (usec) : 50=0.01% lat (msec) : 2=0.17%, 4=0.16%, 10=0.35%, 20=2.49%, 50=27.44% lat (msec) : 100=55.05%, 250=14.34% cpu : usr=0.38%, sys=2.89%, ctx=8425, majf=0, minf=276 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=56073/w=0/d=0, short=r=0/w=0/d=0 sequentialread: (groupid=3, jobs=1): err= 0: pid=7249 read : io=2048.0MB, bw=474684KB/s, iops=52754 , runt= 4418msec slat (usec): min=1 , max=44869 , avg=15.35, stdev=320.96 clat (usec): min=0 , max=67787 , avg=1145.60, stdev=3487.25 lat (usec): min=2 , max=67801 , avg=1161.14, stdev=3508.06 clat percentiles (usec): | 1.00th=[ 10], 5.00th=[ 187], 10.00th=[ 217], 20.00th=[ 249], | 30.00th=[ 278], 40.00th=[ 302], 50.00th=[ 330], 60.00th=[ 362], | 70.00th=[ 398], 80.00th=[ 450], 90.00th=[ 716], 95.00th=[ 6624], | 99.00th=[16064], 99.50th=[20096], 99.90th=[44800], 99.95th=[61696], | 99.99th=[67072] bw (KB/s) : min=50063, max=635019, per=97.34%, avg=462072.50, stdev=202894.18 lat (usec) : 2=0.65%, 4=0.02%, 10=0.30%, 20=0.24%, 50=0.36% lat (usec) : 100=0.52%, 250=18.08%, 500=65.03%, 750=4.88%, 1000=0.52% lat (msec) : 2=0.81%, 4=0.95%, 10=4.72%, 20=2.43%, 50=0.45% lat (msec) : 100=0.05% cpu : usr=7.52%, sys=67.56%, ctx=1494, majf=0, minf=277 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=233071/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=102048KB, aggrb=1700KB/s, minb=1700KB/s, maxb=1700KB/s, mint=60003msec, maxt=60003msec Run status group 1 (all jobs): WRITE: io=80994KB, aggrb=1349KB/s, minb=1349KB/s, maxb=1349KB/s, mint=60011msec, maxt=60011msec Run status group 2 (all jobs): READ: io=473982KB, aggrb=7899KB/s, minb=7899KB/s, maxb=7899KB/s, mint=60005msec, maxt=60005msec Run status group 3 (all jobs): READ: io=2048.0MB, aggrb=474683KB/s, minb=474683KB/s, maxb=474683KB/s, mint=4418msec, maxt=4418msec Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-02 5:07 ` Marc MERLIN @ 2012-08-02 11:18 ` Martin Steigerwald 2012-08-02 17:39 ` Marc MERLIN 2012-08-02 11:25 ` Martin Steigerwald 1 sibling, 1 reply; 22+ messages in thread From: Martin Steigerwald @ 2012-08-02 11:18 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha Am Donnerstag, 2. August 2012 schrieb Marc MERLIN: > On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote: > > Its getting quite strange. > > I would agree :) > > Before I paste a bunch of thing, I wanted to thank you for not giving up on me > and offering your time to help me figure this out :) You are welcome. Well I am holding Linux performance analysis & tuning trainings and I am really interested into issues like this ;) I will take care of myself and I take my time to respond or even do not respond at all anymore if I run out of ideas ;). > > I lost track of whether you did that already or not, but if you didn´t > > please post some > > > > vmstat 1 > > iostat -xd 1 > > on the device while it is being slow. > > Sure thing, here's the 24 second du -s run: > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 2 1 0 2747264 44 348388 0 0 28 50 242 184 19 6 74 1 > 1 0 0 2744128 44 351700 0 0 144 0 2758 32115 30 5 61 4 > 2 1 0 2743100 44 351992 0 0 792 0 2616 30613 28 4 50 18 > 1 1 0 2741592 44 352668 0 0 776 0 2574 31551 29 4 45 21 > 1 1 0 2740720 44 353432 0 0 692 0 2734 32891 30 4 45 22 > 1 1 0 2740104 44 354284 0 0 460 0 2639 31585 30 4 45 21 > 3 1 0 2738520 44 354692 0 0 544 264 2834 30302 32 5 42 21 > 1 1 0 2736936 44 355476 0 0 1064 2012 2867 31172 28 4 45 23 A bit more wait I/O with not even 10% of the throughput as compared to the Intel SSD 320 figures. Seems that Intel SSD is running circles around your Samsung SSD while not – as expected for that use case – being fully utilized. > Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat) 08/01/2012 _x86_64_ (4 CPU) > rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > 2.18 0.68 6.45 17.77 78.12 153.39 19.12 0.18 7.51 8.52 7.15 4.46 10.81 > 0.00 0.00 118.00 0.00 540.00 0.00 9.15 1.18 9.93 9.93 0.00 4.98 58.80 > 0.00 0.00 217.00 0.00 868.00 0.00 8.00 1.90 8.77 8.77 0.00 4.44 96.40 > 0.00 0.00 192.00 0.00 768.00 0.00 8.00 1.63 8.44 8.44 0.00 5.10 98.00 > 0.00 0.00 119.00 0.00 476.00 0.00 8.00 1.06 9.01 9.01 0.00 8.20 97.60 > 0.00 0.00 125.00 0.00 500.00 0.00 8.00 1.08 8.67 8.67 0.00 7.55 94.40 > 0.00 0.00 165.00 0.00 660.00 0.00 8.00 1.50 9.12 9.12 0.00 5.87 96.80 > 0.00 0.00 195.00 13.00 780.00 272.00 10.12 1.68 8.10 7.94 10.46 4.65 96.80 > 0.00 0.00 173.00 0.00 692.00 0.00 8.00 1.72 9.87 9.87 0.00 5.71 98.80 > 0.00 0.00 171.00 0.00 684.00 0.00 8.00 1.62 9.33 9.33 0.00 5.75 98.40 > 0.00 0.00 161.00 0.00 644.00 0.00 8.00 1.52 9.57 9.57 0.00 6.14 98.80 > 0.00 0.00 136.00 0.00 544.00 0.00 8.00 1.26 9.29 9.29 0.00 7.24 98.40 > 0.00 0.00 199.00 0.00 796.00 0.00 8.00 1.94 9.73 9.73 0.00 4.94 98.40 > 0.00 0.00 201.00 0.00 804.00 0.00 8.00 1.70 8.54 8.54 0.00 4.80 96.40 > 0.00 0.00 272.00 15.00 1088.00 272.00 9.48 2.35 8.21 8.46 3.73 3.39 97.20 […] > > I am interested in wait I/O and latencies and disk utilization. > > Cool tool, I didn't know about iostat. > My r_await numbers don't look good obviously and yet %util is pretty much > 100% the entire time. > > Does that show that it's indeed the device that is unable to deliver the requests any quicker, despite > being an ssd, or are you reading this differently? That, or… > > Also I am interested in > > merkaba:~> hdparm -I /dev/sda | grep -i queue > > Queue depth: 32 > > * Native Command Queueing (NCQ) > > output for your SSD. > > gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue > Queue depth: 32 > * Native Command Queueing (NCQ) > gandalfthegreat:/var/local# > > I've the the fio tests in: > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 … you are still using dm_crypt? Please test without dm_crypt. My figures are from within LVM, but no dm_crypt. Its good to have a comparable base for the measurements. > (discard is there, so fstrim shouldn't be needed) I can´t imagine why it should matter, but maybe its worth having some tests without „discard“. > > I also suggest to use fio with with the ssd-test example on the SSD. I > > have some comparison data available for my setup. Heck it should be > > publicly available in my ADMIN magazine article about fio. I used a bit > > different fio jobs with block sizes of 2k to 16k, but its similar enough > > and I might even have some 4k examples at hand or can easily create one. > > I also raised size and duration a bit. > > > > An example based on whats in my article: > > Thanks, here's the output. > I see bw jumping from bw=1700.8KB/s all the way to bw=474684KB/s depending > on the io size. That's "intereesting" to say the least. > > So, doctor, is it bad? :) Well I do not engineer those SSDs, but to me it looks like either dm_crypt is wildly hogging performance or the SSD is way slower than what I would expect. But then you did some tests without dm_crypt AFAIR, but then it would be interesting to repeat these tests without dm_crypt as well. > randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > 2.0.8 > Starting 4 processes > randomwrite: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0 iops] [eta 00m:00s] > randomwrite: (groupid=0, jobs=1): err= 0: pid=7193 > write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec > slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55 > clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63 > lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57 > clat percentiles (msec): > | 1.00th=[ 225], 5.00th=[ 241], 10.00th=[ 247], 20.00th=[ 260], > | 30.00th=[ 269], 40.00th=[ 281], 50.00th=[ 293], 60.00th=[ 306], > | 70.00th=[ 322], 80.00th=[ 351], 90.00th=[ 545], 95.00th=[ 570], > | 99.00th=[ 627], 99.50th=[ 644], 99.90th=[ 709], 99.95th=[ 725], > | 99.99th=[ 742] > bw (KB/s) : min= 11, max= 2591, per=99.83%, avg=1697.13, stdev=491.48 > lat (usec) : 50=0.01% > lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89% > lat (msec) : 500=72.44%, 750=14.43% Gosh, look at these latencies! 72,44% of all requests above 500 (in words: five hundred) milliseconds! And 14,43% above 750 msecs. The percentage of requests served at 100 msecs or less is was below one percent! Hey, is this an SSD or what? Please really test without dm_crypt so that we can see whether it introduces any kind of latency and if so in what amount. But then let me compare. You are using iodepth=64 which fills the device with request at maximum speed. So this will likely increase latencies. martin@merkaba:~[…]> fio iops-iodepth64.job zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 fio 1.57 Starting 4 processes […] zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=13103 write: io=2048.0MB, bw=76940KB/s, iops=14032 , runt= 27257msec slat (usec): min=3 , max=6426 , avg=14.35, stdev=22.33 clat (usec): min=82 , max=414190 , avg=4540.81, stdev=4996.47 lat (usec): min=97 , max=414205 , avg=4555.53, stdev=4996.42 bw (KB/s) : min=12890, max=112728, per=100.39%, avg=77240.48, stdev=20310.79 cpu : usr=9.67%, sys=29.47%, ctx=72053, majf=0, minf=532 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=0/382480/0, short=0/0/0 lat (usec): 100=0.01%, 250=0.14%, 500=0.66%, 750=0.82%, 1000=1.52% lat (msec): 2=10.86%, 4=27.88%, 10=57.18%, 20=0.79%, 50=0.03% lat (msec): 100=0.07%, 250=0.04%, 500=0.01% Still even with iodepth 64 totally different picture. And look at the IOPS and throughput. For reference, this refers to [global] ioengine=libaio direct=1 iodepth=64 # Für zufällige Daten über die komplette Länge # der Datei vorher Job sequentiell laufen lassen filename=testdatei size=2G bsrange=2k-16k refill_buffers=1 [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 [zufälligschreiben] stonewall rw=randwrite runtime=60 [sequentiellschreiben] stonewall rw=write runtime=60 but on Ext4 instead of BTRFS. This could be another good test. Text with Ext4 on plain logical volume without dm_crypt. > cpu : usr=0.30%, sys=2.65%, ctx=9699, majf=0, minf=19 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=0/w=11396/d=0, short=r=0/w=0/d=0 > sequentialwrite: (groupid=1, jobs=1): err= 0: pid=7218 > write: io=80994KB, bw=1349.7KB/s, iops=149 , runt= 60011msec > slat (usec): min=15 , max=296798 , avg=6658.06, stdev=7414.10 > clat (usec): min=15 , max=962319 , avg=418984.76, stdev=128626.12 > lat (msec): min=12 , max=967 , avg=425.64, stdev=130.36 > clat percentiles (msec): > | 1.00th=[ 269], 5.00th=[ 306], 10.00th=[ 322], 20.00th=[ 338], > | 30.00th=[ 355], 40.00th=[ 371], 50.00th=[ 388], 60.00th=[ 404], > | 70.00th=[ 420], 80.00th=[ 445], 90.00th=[ 603], 95.00th=[ 766], > | 99.00th=[ 881], 99.50th=[ 906], 99.90th=[ 938], 99.95th=[ 947], > | 99.99th=[ 963] > bw (KB/s) : min= 418, max= 1952, per=99.31%, avg=1339.72, stdev=354.39 > lat (usec) : 20=0.01% > lat (msec) : 20=0.01%, 50=0.11%, 100=0.27%, 250=0.29%, 500=86.93% > lat (msec) : 750=6.87%, 1000=5.51% Sequential write latencies are abysmal as well. sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=13105 write: io=2048.0MB, bw=155333KB/s, iops=17290 , runt= 13501msec slat (usec): min=2 , max=4299 , avg=13.51, stdev=18.91 clat (usec): min=334 , max=201706 , avg=3680.94, stdev=2088.53 lat (usec): min=340 , max=201718 , avg=3694.83, stdev=2088.28 bw (KB/s) : min=144952, max=162856, per=100.04%, avg=155394.92, stdev=5720.43 cpu : usr=16.77%, sys=31.17%, ctx=47790, majf=0, minf=535 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=0/233444/0, short=0/0/0 lat (usec): 500=0.01%, 750=0.02%, 1000=0.03% lat (msec): 2=0.55%, 4=77.52%, 10=21.86%, 20=0.01%, 100=0.01% lat (msec): 250=0.01% > cpu : usr=0.25%, sys=2.32%, ctx=13426, majf=0, minf=21 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=0/w=8991/d=0, short=r=0/w=0/d=0 > randomread: (groupid=2, jobs=1): err= 0: pid=7234 > read : io=473982KB, bw=7899.5KB/s, iops=934 , runt= 60005msec > slat (usec): min=2 , max=31048 , avg=1065.57, stdev=3519.63 > clat (usec): min=21 , max=246981 , avg=67367.84, stdev=33459.21 > lat (msec): min=1 , max=263 , avg=68.43, stdev=33.81 > clat percentiles (msec): > | 1.00th=[ 15], 5.00th=[ 25], 10.00th=[ 33], 20.00th=[ 39], > | 30.00th=[ 50], 40.00th=[ 57], 50.00th=[ 61], 60.00th=[ 70], > | 70.00th=[ 76], 80.00th=[ 89], 90.00th=[ 111], 95.00th=[ 139], > | 99.00th=[ 176], 99.50th=[ 190], 99.90th=[ 217], 99.95th=[ 229], > | 99.99th=[ 247] > bw (KB/s) : min= 2912, max=11909, per=100.00%, avg=7900.46, stdev=2255.68 > lat (usec) : 50=0.01% > lat (msec) : 2=0.17%, 4=0.16%, 10=0.35%, 20=2.49%, 50=27.44% > lat (msec) : 100=55.05%, 250=14.34% Even read latencies are really high. zufälliglesen: (groupid=0, jobs=1): err= 0: pid=13101 read : io=2048.0MB, bw=162545KB/s, iops=29719 , runt= 12902msec slat (usec): min=2 , max=1485 , avg=10.58, stdev=10.81 clat (usec): min=246 , max=18706 , avg=2140.05, stdev=1073.28 lat (usec): min=267 , max=18714 , avg=2150.97, stdev=1074.42 bw (KB/s) : min=108000, max=205060, per=101.08%, avg=164305.12, stdev=32494.26 cpu : usr=12.74%, sys=49.83%, ctx=82018, majf=0, minf=276 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=383439/0/0, short=0/0/0 lat (usec): 250=0.01%, 500=0.03%, 750=0.30%, 1000=3.21% lat (msec): 2=48.26%, 4=44.68%, 10=3.35%, 20=0.18% Here for comparison that 500 GB 2.5 inch eSATA Hitachi disk (5400rpm): merkaba:[…]> fio --readonly iops-gerät-lesend-iodepth64.job zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 fio 1.57 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [79411K/0K /s] [8617 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=32578 read : io=58290KB, bw=984592 B/s, iops=106 , runt= 60623msec slat (usec): min=3 , max=82 , avg=24.63, stdev= 6.64 clat (msec): min=40 , max=2825 , avg=602.78, stdev=374.10 lat (msec): min=40 , max=2825 , avg=602.80, stdev=374.10 bw (KB/s) : min= 0, max= 1172, per=65.52%, avg=629.66, stdev=466.50 cpu : usr=0.24%, sys=0.64%, ctx=6443, majf=0, minf=275 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.5%, >=64=99.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=6435/0/0, short=0/0/0 lat (msec): 50=0.20%, 100=2.61%, 250=15.37%, 500=26.64%, 750=25.07% lat (msec): 1000=16.13%, 2000=13.44%, >=2000=0.54% Okay, so at least your SSD has shorter random read latencies than this slow harddisk. ;) Has a queue depth of 32 as well. > cpu : usr=0.38%, sys=2.89%, ctx=8425, majf=0, minf=276 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=56073/w=0/d=0, short=r=0/w=0/d=0 > sequentialread: (groupid=3, jobs=1): err= 0: pid=7249 > read : io=2048.0MB, bw=474684KB/s, iops=52754 , runt= 4418msec > slat (usec): min=1 , max=44869 , avg=15.35, stdev=320.96 > clat (usec): min=0 , max=67787 , avg=1145.60, stdev=3487.25 > lat (usec): min=2 , max=67801 , avg=1161.14, stdev=3508.06 > clat percentiles (usec): > | 1.00th=[ 10], 5.00th=[ 187], 10.00th=[ 217], 20.00th=[ 249], > | 30.00th=[ 278], 40.00th=[ 302], 50.00th=[ 330], 60.00th=[ 362], > | 70.00th=[ 398], 80.00th=[ 450], 90.00th=[ 716], 95.00th=[ 6624], > | 99.00th=[16064], 99.50th=[20096], 99.90th=[44800], 99.95th=[61696], > | 99.99th=[67072] > bw (KB/s) : min=50063, max=635019, per=97.34%, avg=462072.50, stdev=202894.18 > lat (usec) : 2=0.65%, 4=0.02%, 10=0.30%, 20=0.24%, 50=0.36% > lat (usec) : 100=0.52%, 250=18.08%, 500=65.03%, 750=4.88%, 1000=0.52% > lat (msec) : 2=0.81%, 4=0.95%, 10=4.72%, 20=2.43%, 50=0.45% > lat (msec) : 100=0.05% > cpu : usr=7.52%, sys=67.56%, ctx=1494, majf=0, minf=277 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=233071/w=0/d=0, short=r=0/w=0/d=0 Intel SSD 320: sequentielllesen: (groupid=1, jobs=1): err= 0: pid=13102 read : io=2048.0MB, bw=252699KB/s, iops=28043 , runt= 8299msec slat (usec): min=2 , max=1416 , avg=10.75, stdev=10.95 clat (usec): min=305 , max=201105 , avg=2268.82, stdev=2066.69 lat (usec): min=319 , max=201114 , avg=2279.94, stdev=2066.61 bw (KB/s) : min=249424, max=254472, per=100.03%, avg=252776.50, stdev=1416.63 cpu : usr=12.29%, sys=43.62%, ctx=43913, majf=0, minf=278 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=232729/0/0, short=0/0/0 lat (usec): 500=0.01%, 750=0.06%, 1000=0.10% lat (msec): 2=11.00%, 4=88.32%, 10=0.50%, 20=0.01%, 50=0.01% lat (msec): 100=0.01%, 250=0.01% Harddisk again: sequentielllesen: (groupid=1, jobs=1): err= 0: pid=32580 read : io=4510.3MB, bw=76964KB/s, iops=8551 , runt= 60008msec slat (usec): min=1 , max=3373 , avg=10.03, stdev= 8.96 clat (msec): min=1 , max=49 , avg= 7.47, stdev= 1.14 lat (msec): min=1 , max=49 , avg= 7.48, stdev= 1.14 bw (KB/s) : min=72608, max=79588, per=100.05%, avg=77003.96, stdev=1626.66 cpu : usr=6.89%, sys=16.96%, ctx=229486, majf=0, minf=277 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=513184/0/0, short=0/0/0 lat (msec): 2=0.01%, 4=0.02%, 10=99.67%, 20=0.29%, 50=0.01% Maximum latencies on our SSD are longer. Minimum latencies shorter. Well sequential stuff might not be located sequential on the SSD, the SSD controller might put the stuff from a big file into totally different places. > Run status group 0 (all jobs): > WRITE: io=102048KB, aggrb=1700KB/s, minb=1700KB/s, maxb=1700KB/s, mint=60003msec, maxt=60003msec > > Run status group 1 (all jobs): > WRITE: io=80994KB, aggrb=1349KB/s, minb=1349KB/s, maxb=1349KB/s, mint=60011msec, maxt=60011msec > > Run status group 2 (all jobs): > READ: io=473982KB, aggrb=7899KB/s, minb=7899KB/s, maxb=7899KB/s, mint=60005msec, maxt=60005msec > > Run status group 3 (all jobs): > READ: io=2048.0MB, aggrb=474683KB/s, minb=474683KB/s, maxb=474683KB/s, mint=4418msec, maxt=4418msec Thats pathetic except for the last job group (each group has one job here cause of the stonewall commands) of reading sequentially. Strangely sequential write is also abysmally slow. This is how I expect this to look like with iodepth 64: Run status group 0 (all jobs): READ: io=2048.0MB, aggrb=162544KB/s, minb=166445KB/s, maxb=166445KB/s, mint=12902msec, maxt=12902msec Run status group 1 (all jobs): READ: io=2048.0MB, aggrb=252699KB/s, minb=258764KB/s, maxb=258764KB/s, mint=8299msec, maxt=8299msec Run status group 2 (all jobs): WRITE: io=2048.0MB, aggrb=76939KB/s, minb=78786KB/s, maxb=78786KB/s, mint=27257msec, maxt=27257msec Run status group 3 (all jobs): WRITE: io=2048.0MB, aggrb=155333KB/s, minb=159061KB/s, maxb=159061KB/s, mint=13501msec, maxt=13501msec Can you also post the last lines: Disk stats (read/write): dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78% sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78% martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba> Its gives information on who good the I/O scheduler was able to merge requests. I didn´t see much of a difference between CFQ and noop, so it may not matter much, but since it gives also a number on total disk utilization its still quite nice to have. So my recommendation of now: Remove as much factors as possible and in order to compare results with what I posted try with plain logical volume with Ext4. If the values are still quite slow, I think its good to ask Linux block layer experts – for example by posting on fio mailing list where there are people subscribed that may be able to provide other test results – and SSD experts. There might be a Linux block layer mailing list or use libata or fsdevel, I don´t know. If your Samsung SSD turns out to be this slow or almost this slow on such a basic level, I am out of ideas. Besides except: Have the IOPS run on the device it self. That will remove any filesystem layer. But only the read only tests, to make sure I suggest to use fio with the --readonly option as safety guard. Unless you have a spare SSD that you can afford to use for write testing which will likely destroy every filesystem on it. Or let it run on just one logical volume. If its then slow, then I´d tend to use a different SSD and test from there. If its then faster, I´d tend to believe that its either a hardware or probably more likely a compatibilty issue. What does your SSD report discard alignment – well but even that should not matter for reads: merkaba:/sys/block/sda> cat discard_alignment 0 Seems Intel tells us to not care at all. Or the value for some reason cannot be read. merkaba:/sys/block/sda> cd queue merkaba:/sys/block/sda/queue> grep . * add_random:1 discard_granularity:512 discard_max_bytes:2147450880 discard_zeroes_data:1 hw_sector_size:512 I would be interested whether these above differ. grep: iosched: Ist ein Verzeichnis iostats:1 logical_block_size:512 max_hw_sectors_kb:32767 max_integrity_segments:0 max_sectors_kb:512 max_segments:168 max_segment_size:65536 minimum_io_size:512 nomerges:0 nr_requests:128 optimal_io_size:0 physical_block_size:512 read_ahead_kb:128 rotational:0 rq_affinity:1 scheduler:noop deadline [cfq] And whether there is any optimal io size. One note regarding this: Back then I made the Ext4 I carried above tests in the last year to be aligned to what I think could be good for the SSD. I do not know whether this matters much. merkaba:~> tune2fs -l /dev/merkaba/home RAID stripe width: 128 Which should be in blocks and alignto 512 KiB. Hmmm, I am a bit puzzled about this value at the moment. Either 1 MiB or 128 KiB would make sense to me. But then what I do not about the erase block size of that SSD. Then add BTRFS and then dm_crypt and look at how the numbers change. I think this is a plan to find out whether its either really the hardware or some wierd happening in the low level parts of Linux, e.g. the block layer or dm_crypt or filing system. Reduce it to the most basic level and then work from there. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-02 11:18 ` Martin Steigerwald @ 2012-08-02 17:39 ` Marc MERLIN 2012-08-02 20:20 ` Martin Steigerwald 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-08-02 17:39 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote: > > I've the the fio tests in: > > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 > > … you are still using dm_crypt? That was my biggest partition and so far I've found no performance impact on file access between unencrypted and dm_crypt. I just took out my swap partition and made a smaller btrfs there: /dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0 I mounted without discard. > > lat (usec) : 50=0.01% > > lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89% > > lat (msec) : 500=72.44%, 750=14.43% > > Gosh, look at these latencies! > > 72,44% of all requests above 500 (in words: five hundred) milliseconds! > And 14,43% above 750 msecs. The percentage of requests served at 100 msecs > or less is was below one percent! Hey, is this an SSD or what? Yeah, that's kind of what I've been complaining about since the beginning :) Once I'm reading sequentially, it goes fast, but random access/latency is indeed abysmal. > Still even with iodepth 64 totally different picture. And look at the IOPS > and throughput. Yep. I know mine are bad :( > For reference, this refers to > > [global] > ioengine=libaio > direct=1 > iodepth=64 Since it's slightly different than the first job file you gave me, I re-ran with this one this time. gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2 gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2 gandalfthegreat:~# grep sda2 /proc/mounts /dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0 here's the btrfs test (ext4 is lower down): gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 2.0.8 Starting 4 processes zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193 iops] [eta 02m:10s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318 read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec slat (usec): min=3 , max=37432 , avg=7252.52, stdev=5717.70 clat (usec): min=13 , max=981927 , avg=454046.13, stdev=110527.92 lat (msec): min=5 , max=999 , avg=461.30, stdev=112.00 clat percentiles (msec): | 1.00th=[ 145], 5.00th=[ 269], 10.00th=[ 371], 20.00th=[ 408], | 30.00th=[ 424], 40.00th=[ 437], 50.00th=[ 449], 60.00th=[ 457], | 70.00th=[ 474], 80.00th=[ 490], 90.00th=[ 570], 95.00th=[ 644], | 99.00th=[ 865], 99.50th=[ 922], 99.90th=[ 963], 99.95th=[ 979], | 99.99th=[ 979] bw (KB/s) : min= 8, max= 2807, per=100.00%, avg=1227.75, stdev=317.57 lat (usec) : 20=0.01% lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82% lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58% cpu : usr=0.12%, sys=1.13%, ctx=12186, majf=0, minf=276 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=8262/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30340 read : io=2048.0MB, bw=211257KB/s, iops=23473 , runt= 9927msec slat (usec): min=1 , max=56321 , avg=20.51, stdev=424.44 clat (usec): min=0 , max=57987 , avg=2695.98, stdev=6624.00 lat (usec): min=1 , max=58015 , avg=2716.75, stdev=6642.09 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 10], 10.00th=[ 30], 20.00th=[ 100], | 30.00th=[ 217], 40.00th=[ 362], 50.00th=[ 494], 60.00th=[ 636], | 70.00th=[ 892], 80.00th=[ 1656], 90.00th=[ 7392], 95.00th=[21632], | 99.00th=[29056], 99.50th=[29568], 99.90th=[43776], 99.95th=[46848], | 99.99th=[57600] bw (KB/s) : min=166675, max=260984, per=99.83%, avg=210892.26, stdev=22433.65 lat (usec) : 2=2.16%, 4=0.43%, 10=2.33%, 20=2.80%, 50=5.72% lat (usec) : 100=6.47%, 250=12.35%, 500=18.29%, 750=15.52%, 1000=5.44% lat (msec) : 2=13.04%, 4=3.59%, 10=3.83%, 20=1.88%, 50=6.11% lat (msec) : 100=0.04% cpu : usr=4.51%, sys=35.70%, ctx=11480, majf=0, minf=278 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=233025/w=0/d=0, short=r=0/w=0/d=0 zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30348 write: io=110768KB, bw=1845.1KB/s, iops=208 , runt= 60007msec slat (usec): min=26 , max=50160 , avg=4789.12, stdev=4968.75 clat (usec): min=29 , max=752494 , avg=301858.86, stdev=80422.24 lat (msec): min=12 , max=757 , avg=306.65, stdev=81.52 clat percentiles (msec): | 1.00th=[ 208], 5.00th=[ 241], 10.00th=[ 245], 20.00th=[ 255], | 30.00th=[ 262], 40.00th=[ 269], 50.00th=[ 277], 60.00th=[ 289], | 70.00th=[ 306], 80.00th=[ 326], 90.00th=[ 363], 95.00th=[ 519], | 99.00th=[ 627], 99.50th=[ 652], 99.90th=[ 734], 99.95th=[ 742], | 99.99th=[ 750] bw (KB/s) : min= 616, max= 2688, per=99.72%, avg=1839.80, stdev=399.60 lat (usec) : 50=0.01% lat (msec) : 20=0.02%, 50=0.04%, 100=0.07%, 250=14.31%, 500=79.62% lat (msec) : 750=5.91%, 1000=0.01% cpu : usr=0.28%, sys=2.60%, ctx=10167, majf=0, minf=20 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.5% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=12495/d=0, short=r=0/w=0/d=0 sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30364 write: io=84902KB, bw=1414.1KB/s, iops=156 , runt= 60005msec slat (usec): min=18 , max=40825 , avg=6389.99, stdev=6072.56 clat (usec): min=22 , max=887097 , avg=401897.77, stdev=108015.23 lat (msec): min=11 , max=899 , avg=408.29, stdev=109.47 clat percentiles (msec): | 1.00th=[ 262], 5.00th=[ 302], 10.00th=[ 318], 20.00th=[ 338], | 30.00th=[ 351], 40.00th=[ 363], 50.00th=[ 375], 60.00th=[ 388], | 70.00th=[ 404], 80.00th=[ 433], 90.00th=[ 502], 95.00th=[ 693], | 99.00th=[ 783], 99.50th=[ 824], 99.90th=[ 873], 99.95th=[ 881], | 99.99th=[ 889] bw (KB/s) : min= 346, max= 2115, per=99.44%, avg=1406.11, stdev=333.58 lat (usec) : 50=0.01% lat (msec) : 20=0.02%, 50=0.03%, 100=0.07%, 250=0.59%, 500=89.12% lat (msec) : 750=7.91%, 1000=2.24% cpu : usr=0.28%, sys=2.21%, ctx=13224, majf=0, minf=21 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.3%, >=64=99.3% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=9369/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=73682KB, aggrb=1227KB/s, minb=1227KB/s, maxb=1227KB/s, mint=60004msec, maxt=60004msec Run status group 1 (all jobs): READ: io=2048.0MB, aggrb=211257KB/s, minb=211257KB/s, maxb=211257KB/s, mint=9927msec, maxt=9927msec Run status group 2 (all jobs): WRITE: io=110768KB, aggrb=1845KB/s, minb=1845KB/s, maxb=1845KB/s, mint=60007msec, maxt=60007msec Run status group 3 (all jobs): WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, mint=60005msec, maxt=60005msec gandalfthegreat:/mnt/mnt2# (no lines beyond this, fio 2.0.8) > This could be another good test. Text with Ext4 on plain logical volume > without dm_crypt. > Can you also post the last lines: > > Disk stats (read/write): > dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78% > sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78% > martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba> I didn't get these lines. > Its gives information on who good the I/O scheduler was able to merge requests. > > I didn´t see much of a difference between CFQ and noop, so it may not > matter much, but since it gives also a number on total disk utilization > its still quite nice to have. I tried deadline and noop, and indeed I'm not see much of a difference for my basic tests. Fro now I have deadline. > So my recommendation of now: > > Remove as much factors as possible and in order to compare results with > what I posted try with plain logical volume with Ext4. gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 /dev/sda2 /dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0 gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 2.0.8 Starting 4 processes zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280 iops] [eta 01m:21s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077 read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt= 7592msec slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01 clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63 lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36 clat percentiles (usec): | 1.00th=[ 450], 5.00th=[ 548], 10.00th=[ 620], 20.00th=[ 724], | 30.00th=[ 820], 40.00th=[ 908], 50.00th=[ 1004], 60.00th=[ 1096], | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040], | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168], | 99.99th=[33024] bw (KB/s) : min=76463, max=385328, per=100.00%, avg=277313.20, stdev=94661.29 lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70% lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17% lat (msec) : 100=0.01% cpu : usr=11.91%, sys=51.64%, ctx=91337, majf=0, minf=277 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=383190/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30081 read : io=2048.0MB, bw=150043KB/s, iops=16641 , runt= 13977msec slat (usec): min=1 , max=2134 , avg= 7.09, stdev=10.83 clat (usec): min=298 , max=16751 , avg=3836.41, stdev=755.26 lat (usec): min=304 , max=16771 , avg=3843.77, stdev=754.77 clat percentiles (usec): | 1.00th=[ 2608], 5.00th=[ 2960], 10.00th=[ 3152], 20.00th=[ 3376], | 30.00th=[ 3536], 40.00th=[ 3664], 50.00th=[ 3792], 60.00th=[ 3920], | 70.00th=[ 4080], 80.00th=[ 4256], 90.00th=[ 4448], 95.00th=[ 4704], | 99.00th=[ 5216], 99.50th=[ 5984], 99.90th=[13888], 99.95th=[15296], | 99.99th=[16512] bw (KB/s) : min=134280, max=169692, per=100.00%, avg=150111.30, stdev=7227.35 lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.07%, 4=64.79%, 10=34.87%, 20=0.25% cpu : usr=6.81%, sys=20.01%, ctx=77116, majf=0, minf=279 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=232601/w=0/d=0, short=r=0/w=0/d=0 zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30088 write: io=111828KB, bw=1863.7KB/s, iops=209 , runt= 60006msec slat (usec): min=7 , max=51044 , avg=4757.60, stdev=4816.78 clat (usec): min=958 , max=582205 , avg=299854.73, stdev=56663.66 lat (msec): min=12 , max=588 , avg=304.61, stdev=57.22 clat percentiles (msec): | 1.00th=[ 73], 5.00th=[ 225], 10.00th=[ 251], 20.00th=[ 273], | 30.00th=[ 285], 40.00th=[ 297], 50.00th=[ 306], 60.00th=[ 314], | 70.00th=[ 318], 80.00th=[ 330], 90.00th=[ 343], 95.00th=[ 359], | 99.00th=[ 482], 99.50th=[ 519], 99.90th=[ 553], 99.95th=[ 570], | 99.99th=[ 570] bw (KB/s) : min= 1033, max= 3430, per=99.67%, avg=1856.90, stdev=307.24 lat (usec) : 1000=0.01% lat (msec) : 20=0.06%, 50=0.72%, 100=0.44%, 250=8.50%, 500=89.52% lat (msec) : 750=0.75% cpu : usr=0.27%, sys=1.27%, ctx=9505, majf=0, minf=21 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.5% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=12580/d=0, short=r=0/w=0/d=0 sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30102 write: io=137316KB, bw=2288.2KB/s, iops=254 , runt= 60013msec slat (usec): min=5 , max=40817 , avg=3911.57, stdev=4785.23 clat (msec): min=4 , max=682 , avg=246.61, stdev=77.35 lat (msec): min=5 , max=688 , avg=250.53, stdev=78.39 clat percentiles (msec): | 1.00th=[ 73], 5.00th=[ 143], 10.00th=[ 186], 20.00th=[ 206], | 30.00th=[ 221], 40.00th=[ 229], 50.00th=[ 239], 60.00th=[ 245], | 70.00th=[ 258], 80.00th=[ 273], 90.00th=[ 318], 95.00th=[ 416], | 99.00th=[ 529], 99.50th=[ 562], 99.90th=[ 660], 99.95th=[ 668], | 99.99th=[ 676] bw (KB/s) : min= 811, max= 4243, per=99.65%, avg=2279.92, stdev=610.32 lat (msec) : 10=0.26%, 20=0.03%, 50=0.22%, 100=1.97%, 250=62.01% lat (msec) : 500=33.43%, 750=2.07% cpu : usr=0.35%, sys=1.46%, ctx=10993, majf=0, minf=21 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=0/w=15293/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=2048.0MB, aggrb=276231KB/s, minb=276231KB/s, maxb=276231KB/s, mint=7592msec, maxt=7592msec Run status group 1 (all jobs): READ: io=2048.0MB, aggrb=150043KB/s, minb=150043KB/s, maxb=150043KB/s, mint=13977msec, maxt=13977msec Run status group 2 (all jobs): WRITE: io=111828KB, aggrb=1863KB/s, minb=1863KB/s, maxb=1863KB/s, mint=60006msec, maxt=60006msec Run status group 3 (all jobs): WRITE: io=137316KB, aggrb=2288KB/s, minb=2288KB/s, maxb=2288KB/s, mint=60013msec, maxt=60013msec Disk stats (read/write): sda: ios=505649/30833, merge=110818/7835, ticks=917980/187892, in_queue=1106088, util=98.50% gandalfthegreat:/mnt/mnt2# > If the values are still quite slow, I think its good to ask Linux > block layer experts – for example by posting on fio mailing list where > there are people subscribed that may be able to provide other test > results – and SSD experts. There might be a Linux block layer mailing > list or use libata or fsdevel, I don´t know. I'll try there next. I think after this mail, and your help which is very much appreciated, it's fair to take things out of this list to stop spamming the wrong people :) > Have the IOPS run on the device it self. That will remove any filesystem > layer. But only the read only tests, to make sure I suggest to use fio > with the --readonly option as safety guard. Unless you have a spare SSD > that you can afford to use for write testing which will likely destroy > every filesystem on it. Or let it run on just one logical volume. Can you send me a recommended job config you'd like me to run if the runs above haven't already answered your questions? > What does your SSD report discard alignment – well but even that should > not matter for reads: > > merkaba:/sys/block/sda> cat discard_alignment > 0 gandalfthegreat:/sys/block/sda# cat discard_alignment 0 > Seems Intel tells us to not care at all. Or the value for some reason > cannot be read. > > merkaba:/sys/block/sda> cd queue > merkaba:/sys/block/sda/queue> grep . * > add_random:1 > discard_granularity:512 > discard_max_bytes:2147450880 > discard_zeroes_data:1 > hw_sector_size:512 > > I would be interested whether these above differ. I have the exact same. > And whether there is any optimal io size. I also have optimal_io_size:0 > One note regarding this: Back then I made the Ext4 I carried above tests > in the last year to be aligned to what I think could be good for the SSD. > I do not know whether this matters much. > > merkaba:~> tune2fs -l /dev/merkaba/home > RAID stripe width: 128 > I think this is a plan to find out whether its either really the hardware > or some wierd happening in the low level parts of Linux, e.g. the block > layer or dm_crypt or filing system. > > Reduce it to the most basic level and then work from there. So, I went back to basics. On Thu, Aug 02, 2012 at 01:25:17PM +0200, Martin Steigerwald wrote: > Heck, I didn´t look at the IOPS figure! > > 189 IOPS for a SATA-600 SSD. Thats pathetic. Yeah, and the new tests above without dmcrypt are no better. > A really fast 15000 rpm SAS harddisk might top that. My slow 1TB 5400rpm laptop hard drive runs du -s 4x faster than the SSD right now :) I read your comments about the multiple things you had in mind. Given the new data, should I just go to an io list now, or even save myself some time and just return the damned SSDs and buy from another vendor? :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-02 17:39 ` Marc MERLIN @ 2012-08-02 20:20 ` Martin Steigerwald 2012-08-02 20:44 ` Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Martin Steigerwald @ 2012-08-02 20:20 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha Am Donnerstag, 2. August 2012 schrieb Marc MERLIN: > On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote: > > > I've the the fio tests in: > > > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 > > > > … you are still using dm_crypt? […] > I just took out my swap partition and made a smaller btrfs there: > /dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0 > > I mounted without discard. […] > > For reference, this refers to > > > > [global] > > ioengine=libaio > > direct=1 > > iodepth=64 > > Since it's slightly different than the first job file you gave me, I re-ran > with this one this time. > > gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2 > gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2 > gandalfthegreat:~# grep sda2 /proc/mounts > /dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0 > > here's the btrfs test (ext4 is lower down): > gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 > zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > 2.0.8 Still abysmal except for sequential reads. > Starting 4 processes > zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) > sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) > zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193 iops] [eta 02m:10s] > zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318 > read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec […] > lat (usec) : 20=0.01% > lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82% > lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58% 1 second latency? Oh well… > Run status group 3 (all jobs): > WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, mint=60005msec, maxt=60005msec > gandalfthegreat:/mnt/mnt2# > (no lines beyond this, fio 2.0.8) […] > > Can you also post the last lines: > > > > Disk stats (read/write): > > dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78% > > sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78% > > martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba> > > I didn't get these lines. Hmmm, back then I guess this was fio 1.5.9 or so. > I tried deadline and noop, and indeed I'm not see much of a difference for my basic tests. > Fro now I have deadline. > > > So my recommendation of now: > > > > Remove as much factors as possible and in order to compare results with > > what I posted try with plain logical volume with Ext4. > > gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 /dev/sda2 > /dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0 > > gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 > zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > 2.0.8 > Starting 4 processes > zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) > sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) > zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280 iops] [eta 01m:21s] > zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077 > read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt= 7592msec > slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01 > clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63 > lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36 > clat percentiles (usec): > | 1.00th=[ 450], 5.00th=[ 548], 10.00th=[ 620], 20.00th=[ 724], > | 30.00th=[ 820], 40.00th=[ 908], 50.00th=[ 1004], 60.00th=[ 1096], > | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040], > | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168], > | 99.99th=[33024] > bw (KB/s) : min=76463, max=385328, per=100.00%, avg=277313.20, stdev=94661.29 > lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70% > lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17% > lat (msec) : 100=0.01% > cpu : usr=11.91%, sys=51.64%, ctx=91337, majf=0, minf=277 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=383190/w=0/d=0, short=r=0/w=0/d=0 Hey, whats this? With Ext4 you have really good random read performance now! Way better than the Intel SSD 320 and… > sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30081 > read : io=2048.0MB, bw=150043KB/s, iops=16641 , runt= 13977msec > slat (usec): min=1 , max=2134 , avg= 7.09, stdev=10.83 > clat (usec): min=298 , max=16751 , avg=3836.41, stdev=755.26 > lat (usec): min=304 , max=16771 , avg=3843.77, stdev=754.77 > clat percentiles (usec): > | 1.00th=[ 2608], 5.00th=[ 2960], 10.00th=[ 3152], 20.00th=[ 3376], > | 30.00th=[ 3536], 40.00th=[ 3664], 50.00th=[ 3792], 60.00th=[ 3920], > | 70.00th=[ 4080], 80.00th=[ 4256], 90.00th=[ 4448], 95.00th=[ 4704], > | 99.00th=[ 5216], 99.50th=[ 5984], 99.90th=[13888], 99.95th=[15296], > | 99.99th=[16512] > bw (KB/s) : min=134280, max=169692, per=100.00%, avg=150111.30, stdev=7227.35 > lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.07%, 4=64.79%, 10=34.87%, 20=0.25% > cpu : usr=6.81%, sys=20.01%, ctx=77116, majf=0, minf=279 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% > issued : total=r=232601/w=0/d=0, short=r=0/w=0/d=0 … even faster than sequential reads. That just doesn´t make any sense to me. The performance of your Samsung SSD seems to be quite erratic. It seems that the device is capable of being fast, but only sometimes shows this capability. > zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30088 > write: io=111828KB, bw=1863.7KB/s, iops=209 , runt= 60006msec […] > sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30102 > write: io=137316KB, bw=2288.2KB/s, iops=254 , runt= 60013msec […] > Run status group 0 (all jobs): > READ: io=2048.0MB, aggrb=276231KB/s, minb=276231KB/s, maxb=276231KB/s, mint=7592msec, maxt=7592msec > > Run status group 1 (all jobs): > READ: io=2048.0MB, aggrb=150043KB/s, minb=150043KB/s, maxb=150043KB/s, mint=13977msec, maxt=13977msec > > Run status group 2 (all jobs): > WRITE: io=111828KB, aggrb=1863KB/s, minb=1863KB/s, maxb=1863KB/s, mint=60006msec, maxt=60006msec > > Run status group 3 (all jobs): > WRITE: io=137316KB, aggrb=2288KB/s, minb=2288KB/s, maxb=2288KB/s, mint=60013msec, maxt=60013msec Writes are slow again. > Disk stats (read/write): > sda: ios=505649/30833, merge=110818/7835, ticks=917980/187892, in_queue=1106088, util=98.50% Well there we have that line. You don´t seem to have the other line possibly due to not using LVM. > gandalfthegreat:/mnt/mnt2# > > > If the values are still quite slow, I think its good to ask Linux > > block layer experts – for example by posting on fio mailing list where > > there are people subscribed that may be able to provide other test > > results – and SSD experts. There might be a Linux block layer mailing > > list or use libata or fsdevel, I don´t know. > > I'll try there next. I think after this mail, and your help which is very > much appreciated, it's fair to take things out of this list to stop spamming > the wrong people :) Well the only thing interesting regarding BTRFS is the whopping difference in random read performance between BTRFS and Ext4 on this SSD. > > Have the IOPS run on the device it self. That will remove any filesystem > > layer. But only the read only tests, to make sure I suggest to use fio > > with the --readonly option as safety guard. Unless you have a spare SSD > > that you can afford to use for write testing which will likely destroy > > every filesystem on it. Or let it run on just one logical volume. > > Can you send me a recommended job config you'd like me to run if the runs > above haven't already answered your questions? [global] ioengine=libaio direct=1 filename=/dev/sdb size=476940m bsrange=2k-16k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 I won´t expect much of a difference, but then the random read performance is quite different between Ext4 and BTRFS on this disk. That would make it interesting to test without any filesystem in between and over the whole device. It might make sense to test around with different alignment, I think the option is called bsalign or so… > On Thu, Aug 02, 2012 at 01:25:17PM +0200, Martin Steigerwald wrote: > > Heck, I didn´t look at the IOPS figure! > > > > 189 IOPS for a SATA-600 SSD. Thats pathetic. > > Yeah, and the new tests above without dmcrypt are no better. > > > A really fast 15000 rpm SAS harddisk might top that. > > My slow 1TB 5400rpm laptop hard drive runs du -s 4x faster than > the SSD right now :) > > I read your comments about the multiple things you had in mind. Given the > new data, should I just go to an io list now, or even save myself some time > and just return the damned SSDs and buy from another vendor? :) … or get yourself another SSD. Its your decision. I admire your endurance. ;) Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-02 20:20 ` Martin Steigerwald @ 2012-08-02 20:44 ` Marc MERLIN 2012-08-02 21:21 ` Martin Steigerwald 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-08-02 20:44 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote: > Hey, whats this? With Ext4 you have really good random read performance > now! Way better than the Intel SSD 320 and… Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. Obviously it's sending IO in a way that either the IO subsystem, linux driver, or drive prefer. > The performance of your Samsung SSD seems to be quite erratic. It seems > that the device is capable of being fast, but only sometimes shows this > capability. That's indeed exactly what I'm seeing in real life :) > > > Have the IOPS run on the device it self. That will remove any filesystem > > > layer. But only the read only tests, to make sure I suggest to use fio > > > with the --readonly option as safety guard. Unless you have a spare SSD > > > that you can afford to use for write testing which will likely destroy > > > every filesystem on it. Or let it run on just one logical volume. > > > > Can you send me a recommended job config you'd like me to run if the runs > > above haven't already answered your questions? > [global] (...) I used this and just changed filename to /dev/sda. Since I'm reading from the beginning of the drive, reads have to be aligned. > I won´t expect much of a difference, but then the random read performance > is quite different between Ext4 and BTRFS on this disk. That would make > it interesting to test without any filesystem in between and over the > whole device. Here is the output: gandalfthegreat:~# fio --readonly ./fio.job3 zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172 read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64 clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66 lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88 clat percentiles (usec): | 1.00th=[ 126], 5.00th=[ 235], 10.00th=[ 5216], 20.00th=[ 5920], | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480], | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768], | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656], | 99.99th=[27264] bw (KB/s) : min= 423, max= 5776, per=100.00%, avg=986.48, stdev=480.68 lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02% lat (usec) : 1000=0.02% lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18% cpu : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=6532/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199 read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91 clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32 lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06 clat percentiles (usec): | 1.00th=[ 61], 5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920], | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608], | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536], | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568], | 99.99th=[34048] bw (KB/s) : min= 405, max= 2680, per=100.00%, avg=912.92, stdev=261.62 lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02% lat (usec) : 1000=0.03% lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36% cpu : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=6101/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, maxt=60002msec Run status group 1 (all jobs): READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, maxt=60002msec Disk stats (read/write): sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, util=99.30% > … or get yourself another SSD. Its your decision. > > I admire your endurance. ;) Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the company is greating great reviews for them, I'm now pretty sure that it's either a problem with a linux driver, which is interesting for us all to debug :) If I go buy another brand, the next guy will have the same problems than me. But yes, if we figure this out, Samsung owes you and me some money :) I'll try plugging this SSD in a totally different PC and see what happens. This may say if it's an AHCI/intel sata driver problem. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-02 20:44 ` Marc MERLIN @ 2012-08-02 21:21 ` Martin Steigerwald 2012-08-02 21:49 ` Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Martin Steigerwald @ 2012-08-02 21:21 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha Am Donnerstag, 2. August 2012 schrieb Marc MERLIN: > On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote: > > Hey, whats this? With Ext4 you have really good random read performance > > now! Way better than the Intel SSD 320 and… > > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. > Obviously it's sending IO in a way that either the IO subsystem, linux > driver, or drive prefer. But only on reads. > > > > Have the IOPS run on the device it self. That will remove any filesystem > > > > layer. But only the read only tests, to make sure I suggest to use fio > > > > with the --readonly option as safety guard. Unless you have a spare SSD > > > > that you can afford to use for write testing which will likely destroy > > > > every filesystem on it. Or let it run on just one logical volume. > > > > > > Can you send me a recommended job config you'd like me to run if the runs > > > above haven't already answered your questions? > > > [global] > (...) > > I used this and just changed filename to /dev/sda. Since I'm reading > from the beginning of the drive, reads have to be aligned. > > > I won´t expect much of a difference, but then the random read performance > > is quite different between Ext4 and BTRFS on this disk. That would make > > it interesting to test without any filesystem in between and over the > > whole device. > > Here is the output: > gandalfthegreat:~# fio --readonly ./fio.job3 > zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 > sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 > 2.0.8 > Starting 2 processes > Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0 iops] [eta 01m:00s] > zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172 > read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec WTF? Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely without it. Also I noticed that I had iodepth 1 in there to circumvent any in drive cache / optimization. > slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64 > clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66 > lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88 > clat percentiles (usec): > | 1.00th=[ 126], 5.00th=[ 235], 10.00th=[ 5216], 20.00th=[ 5920], > | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480], > | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768], > | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656], > | 99.99th=[27264] > bw (KB/s) : min= 423, max= 5776, per=100.00%, avg=986.48, stdev=480.68 > lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02% > lat (usec) : 1000=0.02% > lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18% > cpu : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=6532/w=0/d=0, short=r=0/w=0/d=0 Latency is still way to high even with iodepth 1, 10 milliseconds for 43% of requests. And the throughput and IOPS is still abysmal even for iodepth 1 (see below for Intel SSD 320 values). Okay, one further idea: Remove the bsrange to just test with 4k blocks. Additionally test these are aligned with blockalign=int[,int], ba=int[,int] At what boundary to align random IO offsets. Defaults to the same as 'blocksize' the minimum blocksize given. Minimum alignment is typically 512b for using direct IO, though it usually depends on the hardware block size. This option is mutually exclusive with using a random map for files, so it will turn off that option. I would first test with 4k blocks as is. And then do something like: blocksize=4k blockalign=4k And then raise blockalign to some values that may matter like 8k, 128k, 512k, 1m or so. But thats just guess work. I do not even exactly now if it works this way in fio. There is something pretty wierd going on. But I am not sure what it is. Maybe an alignment issue, since Ext4 with stripe alignment was able to do so much faster on reads. > sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199 > read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec Ey, whats this? > slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91 > clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32 > lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06 > clat percentiles (usec): > | 1.00th=[ 61], 5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920], > | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608], > | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536], > | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568], > | 99.99th=[34048] > bw (KB/s) : min= 405, max= 2680, per=100.00%, avg=912.92, stdev=261.62 > lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02% > lat (usec) : 1000=0.03% > lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36% > cpu : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=6101/w=0/d=0, short=r=0/w=0/d=0 > > Run status group 0 (all jobs): > READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, maxt=60002msec > > Run status group 1 (all jobs): > READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, maxt=60002msec > > Disk stats (read/write): > sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, util=99.30% What on earth is this? You are testing the raw device and are getting these fun values out of it? Let me repeat that here: merkaba:/tmp> cat iops-read.job [global] ioengine=libaio direct=1 filename=/dev/sda bsrange=2k-16k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 merkaba:/tmp> fio iops-read.job zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [57784K/0K /s] [6471 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=31915 read : io=1681.2MB, bw=28692KB/s, iops=3193 , runt= 60001msec slat (usec): min=5 , max=991 , avg=31.18, stdev=10.07 clat (usec): min=2 , max=3312 , avg=276.67, stdev=124.41 lat (usec): min=48 , max=3337 , avg=308.54, stdev=125.51 clat percentiles (usec): | 1.00th=[ 61], 5.00th=[ 82], 10.00th=[ 103], 20.00th=[ 159], | 30.00th=[ 201], 40.00th=[ 237], 50.00th=[ 270], 60.00th=[ 306], | 70.00th=[ 354], 80.00th=[ 402], 90.00th=[ 446], 95.00th=[ 478], | 99.00th=[ 524], 99.50th=[ 548], 99.90th=[ 644], 99.95th=[ 716], | 99.99th=[ 1020] bw (KB/s) : min=27760, max=31784, per=100.00%, avg=28694.49, stdev=625.77 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.15%, 100=8.76% lat (usec) : 250=34.69%, 500=54.13%, 750=2.23%, 1000=0.03% lat (msec) : 2=0.01%, 4=0.01% cpu : usr=2.91%, sys=12.49%, ctx=199678, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=191587/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=31921 read : io=3895.9MB, bw=66487KB/s, iops=7384 , runt= 60001msec slat (usec): min=5 , max=518 , avg=30.18, stdev= 8.16 clat (usec): min=1 , max=3394 , avg=100.21, stdev=83.51 lat (usec): min=44 , max=3429 , avg=131.69, stdev=84.44 clat percentiles (usec): | 1.00th=[ 54], 5.00th=[ 60], 10.00th=[ 61], 20.00th=[ 68], | 30.00th=[ 72], 40.00th=[ 79], 50.00th=[ 86], 60.00th=[ 92], | 70.00th=[ 99], 80.00th=[ 105], 90.00th=[ 112], 95.00th=[ 163], | 99.00th=[ 556], 99.50th=[ 716], 99.90th=[ 900], 99.95th=[ 940], | 99.99th=[ 1080] bw (KB/s) : min=30276, max=81176, per=99.95%, avg=66453.19, stdev=10893.90 lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.57% lat (usec) : 100=69.52%, 250=26.59%, 500=2.01%, 750=0.89%, 1000=0.39% lat (msec) : 2=0.02%, 4=0.01% cpu : usr=7.12%, sys=27.34%, ctx=461537, majf=0, minf=27 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=443106/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=1681.2MB, aggrb=28691KB/s, minb=28691KB/s, maxb=28691KB/s, mint=60001msec, maxt=60001msec Run status group 1 (all jobs): READ: io=3895.9MB, aggrb=66487KB/s, minb=66487KB/s, maxb=66487KB/s, mint=60001msec, maxt=60001msec Disk stats (read/write): sda: ios=632851/228, merge=0/140, ticks=106392/351, in_queue=105682, util=87.78% I have seen about 4000 IOPS with pure 4k blocks from that drive with 4k blocks, so that seems to be fine. And look at latencies, barely some requests in low millisecond range. Now just for the sake of it with iodepth 64: merkaba:/tmp> fio iops-read.job zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [217.8M/0K /s] [24.1K/0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=31945 read : io=12412MB, bw=211819KB/s, iops=23795 , runt= 60003msec slat (usec): min=2 , max=2158 , avg=15.69, stdev= 8.96 clat (usec): min=138 , max=19349 , avg=2670.01, stdev=861.85 lat (usec): min=189 , max=19366 , avg=2686.29, stdev=861.67 clat percentiles (usec): | 1.00th=[ 1256], 5.00th=[ 1448], 10.00th=[ 1592], 20.00th=[ 1864], | 30.00th=[ 2128], 40.00th=[ 2384], 50.00th=[ 2640], 60.00th=[ 2896], | 70.00th=[ 3152], 80.00th=[ 3408], 90.00th=[ 3728], 95.00th=[ 3952], | 99.00th=[ 4448], 99.50th=[ 4896], 99.90th=[ 8768], 99.95th=[10048], | 99.99th=[12096] bw (KB/s) : min=116464, max=216400, per=100.00%, avg=211823.16, stdev=9787.89 lat (usec) : 250=0.01%, 500=0.01%, 750=0.02%, 1000=0.05% lat (msec) : 2=25.18%, 4=70.40%, 10=4.28%, 20=0.05% cpu : usr=14.39%, sys=44.38%, ctx=624900, majf=0, minf=278 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=1427811/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=32301 read : io=12499MB, bw=213307KB/s, iops=23691 , runt= 60002msec slat (usec): min=1 , max=1509 , avg=12.42, stdev= 8.12 clat (usec): min=306 , max=201523 , avg=2685.75, stdev=1220.14 lat (usec): min=316 , max=201536 , avg=2698.70, stdev=1220.06 clat percentiles (usec): | 1.00th=[ 1816], 5.00th=[ 2024], 10.00th=[ 2096], 20.00th=[ 2192], | 30.00th=[ 2256], 40.00th=[ 2352], 50.00th=[ 2416], 60.00th=[ 2544], | 70.00th=[ 2736], 80.00th=[ 2992], 90.00th=[ 3632], 95.00th=[ 4320], | 99.00th=[ 5664], 99.50th=[ 6112], 99.90th=[ 7136], 99.95th=[ 7712], | 99.99th=[26240] bw (KB/s) : min=144720, max=256568, per=99.96%, avg=213210.08, stdev=23175.58 lat (usec) : 500=0.01%, 750=0.02%, 1000=0.06% lat (msec) : 2=4.10%, 4=88.72%, 10=7.09%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=11.77%, sys=34.79%, ctx=440558, majf=0, minf=280 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=1421534/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=12412MB, aggrb=211818KB/s, minb=211818KB/s, maxb=211818KB/s, mint=60003msec, maxt=60003msec Run status group 1 (all jobs): READ: io=12499MB, aggrb=213306KB/s, minb=213306KB/s, maxb=213306KB/s, mint=60002msec, maxt=60002msec Disk stats (read/write): sda: ios=2151624/243, merge=692022/115, ticks=5719203/651132, in_queue=6372962, util=99.87% (Seems that extra line is gone from fio 2, I darkly remember that Jens Axboe removed some output he found to be superfluuos, but the disk stats are still there.) In that case random versus sequential do not even seem to make much of a difference. > > … or get yourself another SSD. Its your decision. > > > > I admire your endurance. ;) > > Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the > company is greating great reviews for them, I'm now pretty sure that > it's either a problem with a linux driver, which is interesting for us > all to debug :) Either that or there possible is some firmware update? But then those reviews would mention it I bet. I´d still look whether there is some firmware update available labeled: "This will raise performance on lots of Linux based workloads by factors of 10x to 250x" ;) Anyway, I think now we are pretty near to the hardware. So issue has to be somewhere in the block layer, the controller, the SSD firmware or the SSD itself IMHO. > If I go buy another brand, the next guy will have the same problems than > me. > > But yes, if we figure this out, Samsung owes you and me some money :) ;) > I'll try plugging this SSD in a totally different PC and see what happens. > This may say if it's an AHCI/intel sata driver problem. Seems we will continue until someone starts to complain here. Maybe another list will be more approbiate? But then this thread has it all in one ;). Adding a CC with some introductory note might be approbiate. Its your problem, so you decide ;). I´d suggest the fio mailing list, there are other performance people how may want to chime in. Another idea: Is there anything funny in SMART values (smartctl -a and possibly even -x)? Well I gather these SSDs are new, but still. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-02 21:21 ` Martin Steigerwald @ 2012-08-02 21:49 ` Marc MERLIN 2012-08-03 18:45 ` Martin Steigerwald 0 siblings, 1 reply; 22+ messages in thread From: Marc MERLIN @ 2012-08-02 21:49 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha On Thu, Aug 02, 2012 at 11:21:51PM +0200, Martin Steigerwald wrote: > > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. > > Obviously it's sending IO in a way that either the IO subsystem, linux > > driver, or drive prefer. > > But only on reads. Yes. I was focussing on the simple case. SSDs can be slower than hard drives on writes, and now you're getting into deep firmware magic for corner cases, so I didn't want to touch that here :) > WTF? > > Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely > without it. I left it as is: [global] ioengine=libaio direct=1 filename=/dev/sda size=476940m bsrange=2k-16k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 I just removed it now. > Okay, one further idea: Remove the bsrange to just test with 4k blocks. Ok, there you go: [global] ioengine=libaio direct=1 filename=/dev/sda blocksize=4k blockalign=4k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 gandalfthegreat:~# fio --readonly ./fio.job4 zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [1476K/0K /s] [369 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3614 read : io=60832KB, bw=1013.7KB/s, iops=253 , runt= 60012msec slat (usec): min=4 , max=892 , avg=24.36, stdev=16.05 clat (usec): min=2 , max=27575 , avg=3915.39, stdev=4990.69 lat (usec): min=44 , max=27591 , avg=3940.35, stdev=4991.99 clat percentiles (usec): | 1.00th=[ 41], 5.00th=[ 46], 10.00th=[ 51], 20.00th=[ 66], | 30.00th=[ 113], 40.00th=[ 163], 50.00th=[ 205], 60.00th=[ 4960], | 70.00th=[ 5856], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608], | 99.00th=[14272], 99.50th=[17280], 99.90th=[18816], 99.95th=[21120], | 99.99th=[22912] bw (KB/s) : min= 253, max= 2520, per=100.00%, avg=1014.86, stdev=364.74 lat (usec) : 4=0.05%, 10=0.01%, 20=0.01%, 50=8.07%, 100=1iiiiiiii9.94% lat (usec) : 250=26.26%, 500=2.78%, 750=0.30%, 1000=0.09% lat (msec) : 2=0.12%, 4=0.07%, 10=20.75%, 20=21.51%, 50=0.05% cpu : usr=0.31%, sys=0.88%, ctx=15250, majf=0, minf=23 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=15208/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3636 read : io=88468KB, bw=1474.3KB/s, iops=368 , runt= 60009msec slat (usec): min=3 , max=340 , avg=16.46, stdev=10.23 clat (usec): min=1 , max=27356 , avg=2692.99, stdev=4564.76 lat (usec): min=30 , max=27370 , avg=2709.87, stdev=4566.31 clat percentiles (usec): | 1.00th=[ 27], 5.00th=[ 29], 10.00th=[ 31], 20.00th=[ 35], | 30.00th=[ 47], 40.00th=[ 61], 50.00th=[ 78], 60.00th=[ 107], | 70.00th=[ 270], 80.00th=[ 5856], 90.00th=[11712], 95.00th=[12608], | 99.00th=[16512], 99.50th=[17536], 99.90th=[18816], 99.95th=[19584], | 99.99th=[21120] bw (KB/s) : min= 282, max= 5096, per=100.00%, avg=1474.40, stdev=899.36 lat (usec) : 2=0.05%, 4=0.01%, 20=0.02%, 50=32.45%, 100=24.61% lat (usec) : 250=12.52%, 500=1.40%, 750=0.03%, 1000=0.03% lat (msec) : 2=0.02%, 4=0.03%, 10=13.95%, 20=14.85%, 50=0.03% cpu : usr=0.33%, sys=0.90%, ctx=22176, majf=0, minf=25 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=22117/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=60832KB, aggrb=1013KB/s, minb=1013KB/s, maxb=1013KB/s, mint=60012msec, maxt=60012msec Run status group 1 (all jobs): READ: io=88468KB, aggrb=1474KB/s, minb=1474KB/s, maxb=1474KB/s, mint=60009msec, maxt=60009msec Disk stats (read/write): sda: ios=54013/1836, merge=291/17, ticks=251660/11052, in_queue=262668, util=99.51% and the same test with blockalign=512k: gandalfthegreat:~# fio --readonly ./fio.job4 # blockalign=512k fio: Any use of blockalign= turns off randommap zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [4180K/0K /s] [1045 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3679 read : io=56096KB, bw=957355 B/s, iops=233 , runt= 60001msec slat (usec): min=4 , max=344 , avg=27.73, stdev=15.51 clat (usec): min=2 , max=38954 , avg=4244.27, stdev=5153.32 lat (usec): min=44 , max=38980 , avg=4272.70, stdev=5154.68 clat percentiles (usec): | 1.00th=[ 42], 5.00th=[ 46], 10.00th=[ 53], 20.00th=[ 69], | 30.00th=[ 120], 40.00th=[ 169], 50.00th=[ 211], 60.00th=[ 5024], | 70.00th=[ 5920], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608], | 99.00th=[16768], 99.50th=[17536], 99.90th=[18560], 99.95th=[20096], | 99.99th=[37632] bw (KB/s) : min= 263, max= 3133, per=100.00%, avg=935.09, stdev=392.66 lat (usec) : 4=0.05%, 10=0.03%, 20=0.01%, 50=7.84%, 100=19.12% lat (usec) : 250=25.84%, 500=1.61%, 750=0.08%, 1000=0.05% lat (msec) : 2=0.06%, 4=0.08%, 10=21.99%, 20=23.17%, 50=0.06% cpu : usr=0.29%, sys=0.94%, ctx=14069, majf=0, minf=25 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=14024/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3703 read : io=99592KB, bw=1659.7KB/s, iops=414 , runt= 60008msec slat (usec): min=4 , max=186 , avg=15.83, stdev= 9.95 clat (usec): min=1 , max=25530 , avg=2390.79, stdev=4340.79 lat (usec): min=28 , max=25539 , avg=2407.01, stdev=4342.90 clat percentiles (usec): | 1.00th=[ 26], 5.00th=[ 28], 10.00th=[ 30], 20.00th=[ 34], | 30.00th=[ 43], 40.00th=[ 57], 50.00th=[ 72], 60.00th=[ 91], | 70.00th=[ 155], 80.00th=[ 5152], 90.00th=[11712], 95.00th=[12480], | 99.00th=[12992], 99.50th=[16768], 99.90th=[18560], 99.95th=[18816], | 99.99th=[21632] bw (KB/s) : min= 269, max= 8465, per=100.00%, avg=1663.84, stdev=1367.82 lat (usec) : 2=0.01%, 4=0.02%, 20=0.01%, 50=35.42%, 100=26.00% lat (usec) : 250=11.29%, 500=1.31%, 750=0.05%, 1000=0.01% lat (msec) : 2=0.04%, 4=0.02%, 10=12.55%, 20=13.25%, 50=0.01% cpu : usr=0.31%, sys=0.99%, ctx=24963, majf=0, minf=25 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=24898/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=56096KB, aggrb=934KB/s, minb=934KB/s, maxb=934KB/s, mint=60001msec, maxt=60001msec Run status group 1 (all jobs): READ: io=99592KB, aggrb=1659KB/s, minb=1659KB/s, maxb=1659KB/s, mint=60008msec, maxt=60008msec Disk stats (read/write): sda: ios=55021/1637, merge=72/39, ticks=240896/12080, in_queue=252912, util=99.59% > > Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the > > company is greating great reviews for them, I'm now pretty sure that > > it's either a problem with a linux driver, which is interesting for us > > all to debug :) > > Either that or there possible is some firmware update? But then those > reviews would mention it I bet. So far, there are no firmware updates from mine, I checked right after unboxing it :) > > I'll try plugging this SSD in a totally different PC and see what happens. > > This may say if it's an AHCI/intel sata driver problem. > > Seems we will continue until someone starts to complain here. Maybe > another list will be more approbiate? But then this thread has it all > in one ;). Adding a CC with some introductory note might be approbiate. > Its your problem, so you decide ;). I´d suggest the fio mailing list, > there are other performance people how may want to chime in. Actually you know the lists and people involved more than me. I'd be happy if you added a Cc to another list you think is best, and we can move there. > Another idea: Is there anything funny in SMART values (smartctl -a > and possibly even -x)? Well I gather these SSDs are new, but still. smartctl -a /dev/sda (snipped) SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 36 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 7 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 9 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 058 040 000 Old_age Always - 42 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 253 253 000 Old_age Always - 0 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 4 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 433339112 smartctl -x /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.5.0-amd64-preempt-noide-20120410] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: SAMSUNG SSD 830 Series Serial Number: S0WGNEAC400484 LU WWN Device Id: 5 002538 043584d30 Firmware Version: CXM03B1Q User Capacity: 512,110,190,592 bytes [512 GB] Sector Size: 512 bytes logical/physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ACS-2 revision 2 Local Time is: Thu Aug 2 14:42:08 2012 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x83) Offline data collection activity is in a Reserved state. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 1980) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 33) minutes. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 9 Power_On_Hours -O--CK 099 099 000 - 36 12 Power_Cycle_Count -O--CK 099 099 000 - 7 177 Wear_Leveling_Count PO--C- 099 099 000 - 9 179 Used_Rsvd_Blk_Cnt_Tot PO--C- 100 100 010 - 0 181 Program_Fail_Cnt_Total -O--CK 100 100 010 - 0 182 Erase_Fail_Count_Total -O--CK 100 100 010 - 0 183 Runtime_Bad_Block PO--C- 100 100 010 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 190 Airflow_Temperature_Cel -O--CK 058 040 000 - 42 195 Hardware_ECC_Recovered -O-RC- 200 200 000 - 0 199 UDMA_CRC_Error_Count -OSRCK 253 253 000 - 0 235 Unknown_Attribute -O--C- 099 099 000 - 4 241 Total_LBAs_Written -O--CK 099 099 000 - 433352008 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] GP/S Log at address 0x00 has 1 sectors [Log Directory] GP/S Log at address 0x01 has 1 sectors [Summary SMART error log] GP/S Log at address 0x02 has 2 sectors [Comprehensive SMART error log] GP/S Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART error log] GP/S Log at address 0x06 has 1 sectors [SMART self-test log] GP/S Log at address 0x07 has 2 sectors [Extended self-test log] GP/S Log at address 0x09 has 1 sectors [Selective self-test log] GP/S Log at address 0x10 has 1 sectors [NCQ Command Error] GP/S Log at address 0x11 has 1 sectors [SATA Phy Event Counters] GP/S Log at address 0x80 has 16 sectors [Host vendor specific log] GP/S Log at address 0x81 has 16 sectors [Host vendor specific log] GP/S Log at address 0x82 has 16 sectors [Host vendor specific log] GP/S Log at address 0x83 has 16 sectors [Host vendor specific log] (snip) SMART Extended Comprehensive Error Log Version: 1 (2 sectors) No Errors Logged SMART Extended Self-test Log Version: 1 (2 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 27 - # 2 Short offline Completed without error 00% 8 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Warning: device does not support SCT Commands SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 2 0 Command failed due to ICRC error 0x0002 2 49152 R_ERR response for data FIS 0x0003 2 53248 R_ERR response for device-to-host data FIS 0x0004 2 57344 R_ERR response for host-to-device data FIS 0x0005 2 7952 R_ERR response for non-data FIS 0x0006 2 2562 R_ERR response for device-to-host non-data FIS 0x0007 2 4096 R_ERR response for host-to-device non-data FIS 0x0008 2 7952 Device-to-host non-data FIS retries 0x0009 2 7953 Transition from drive PhyRdy to drive PhyNRdy 0x000a 2 25 Device-to-host register FISes sent due to a COMRESET 0x000b 2 6145 CRC errors within host-to-device FIS 0x000d 2 7953 Non-CRC errors within host-to-device FIS 0x000f 2 8176 R_ERR response for host-to-device data FIS, CRC 0x0010 2 15 R_ERR response for host-to-device data FIS, non-CRC 0x0012 2 49153 R_ERR response for host-to-device non-data FIS, CRC 0x0013 2 12112 R_ERR response for host-to-device non-data FIS, non-CRC -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-02 21:49 ` Marc MERLIN @ 2012-08-03 18:45 ` Martin Steigerwald 2012-08-16 7:45 ` Marc MERLIN 0 siblings, 1 reply; 22+ messages in thread From: Martin Steigerwald @ 2012-08-03 18:45 UTC (permalink / raw) To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha Am Donnerstag, 2. August 2012 schrieb Marc MERLIN: > > > I'll try plugging this SSD in a totally different PC and see what > > > happens. This may say if it's an AHCI/intel sata driver problem. > > > > > > > > Seems we will continue until someone starts to complain here. Maybe > > another list will be more approbiate? But then this thread has it all > > in one ;). Adding a CC with some introductory note might be > > approbiate. Its your problem, so you decide ;). I´d suggest the fio > > mailing list, there are other performance people how may want to > > chime in. > > > Actually you know the lists and people involved more than me. I'd be > happy if you added a Cc to another list you think is best, and we can > move there. I have no more ideas for the moment. I suggest you to try the fio mailing list at fio lalala vger.kernel.org. I currently have no ideas for other lists. I bet there is some block layer / libata / scsi list that might match. Just browse the lists from kernel.org and pick the one that sounds most suitable for me¹. Probably fsdevel but since the thing does not seem to be filesystem related beside some alignment effect (stripe in Ext4). Hmmm, probably linux-scsi, but look there first, whether block layer and libata related things are discussed there. I would start a new thread. Tell the problem, tell in summary what you have tried and the outcome of that was and ask for advice. Add a link to the original thread here. (It might be a good idea to have one more post here with a link to the new thread so that other curious people can follow there – in case there are any. Otherwise I would end it on this thread. The thread view already looks ridiculous in KMail here ;-) And put me on Cc ;). I´d really like to know whats going on here. As written I have no more ideas right now. It seems to be getting to low level for me. [1] http://vger.kernel.org/vger-lists.html Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-03 18:45 ` Martin Steigerwald @ 2012-08-16 7:45 ` Marc MERLIN 0 siblings, 0 replies; 22+ messages in thread From: Marc MERLIN @ 2012-08-16 7:45 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha To close this thread. I gave up on the drive after it showed abysmal benchmarking results in windows too. I owed everyone an update, which I just finished typing: http://marc.merlins.org/perso/linux/post_2012-08-15_The-tale-of-SSDs_-Crucial-C300-early-Death_-Samsung-830-extreme-random-IO-slowness_-and-settling-with-OCZ-Vertex-4.html I'm not sure how I could have gotten 2 bad drives from Samsung in 2 different shipments, so I'm afraid the entire line may be bad. At least, it was for me after extensive benchmarking, and even using their own windows benchmarking tool. In the end, I got a OCZ Vertex 4 and it's superfast as per the benchmarks I posted in the link above. What is interesting is that ext4 is somewhat faster than btrfs in the tests I posted above, but not in ways that should be truly worrisome. I'm still happy to have COW and snapshots, even if that costs me performance. Sorry for spamming the list with what apparently just ended up a crappy drive (well, 2 since I got two just to make sure) from a vendor who shouldn't have been selling crappy SSDs. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-02 5:07 ` Marc MERLIN 2012-08-02 11:18 ` Martin Steigerwald @ 2012-08-02 11:25 ` Martin Steigerwald 1 sibling, 0 replies; 22+ messages in thread From: Martin Steigerwald @ 2012-08-02 11:25 UTC (permalink / raw) To: linux-btrfs; +Cc: Marc MERLIN, Fajar A. Nugraha Am Donnerstag, 2. August 2012 schrieb Marc MERLIN: > So, doctor, is it bad? :) > > randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > 2.0.8 > Starting 4 processes > randomwrite: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0 iops] [eta 00m:00s] > randomwrite: (groupid=0, jobs=1): err= 0: pid=7193 > write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec > slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55 > clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63 > lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57 > clat percentiles (msec): Heck, I didn´t look at the IOPS figure! 189 IOPS for a SATA-600 SSD. Thats pathetic. So again, please test this without dm_crypt. I can´t believe that this is the maximum the hardware is able to achieve. A really fast 15000 rpm SAS harddisk might top that. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-01 6:01 ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN 2012-08-01 6:08 ` Fajar A. Nugraha @ 2012-08-01 6:36 ` Chris Samuel 2012-08-01 6:40 ` Marc MERLIN 1 sibling, 1 reply; 22+ messages in thread From: Chris Samuel @ 2012-08-01 6:36 UTC (permalink / raw) To: Marc MERLIN Cc: Chris Mason, linux-btrfs, Ted Ts'o, Lukáš Czerner, linux-ext4, axboe, Milan Broz On 01/08/12 16:01, Marc MERLIN wrote: > Third, A freshly made ntfs filesystem through fuse is actually FASTER! Could it be that Samsungs FTL has optimisations in it for NTFS ? A horrible thought, but not impossible.. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How can btrfs take 23sec to stat 23K files from an SSD? 2012-08-01 6:36 ` Chris Samuel @ 2012-08-01 6:40 ` Marc MERLIN 0 siblings, 0 replies; 22+ messages in thread From: Marc MERLIN @ 2012-08-01 6:40 UTC (permalink / raw) To: Chris Samuel Cc: Chris Mason, linux-btrfs, Ted Ts'o, Lukáš Czerner, linux-ext4, axboe, Milan Broz On Wed, Aug 01, 2012 at 04:36:22PM +1000, Chris Samuel wrote: > On 01/08/12 16:01, Marc MERLIN wrote: > > > Third, A freshly made ntfs filesystem through fuse is actually FASTER! > > Could it be that Samsungs FTL has optimisations in it for NTFS ? > > A horrible thought, but not impossible.. Not impossible, but it's can be the main reason since it clocks still 2x slower with ntfs than a spinning disk with encrypted btrfs. Since SSDs should "seek" 10-100x faster than spinning disks, that can't be the only reason. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2012-08-16 7:45 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-07-24 6:18 How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN -- strict thread matches above, loose matches on Subject: below -- 2012-02-02 12:42 brtfs on top of dmcrypt with SSD. No corruption iff write cache off? Chris Mason 2012-07-18 22:04 ` brtfs on top of dmcrypt with SSD -> Trim or no Trim Marc MERLIN 2012-07-22 18:58 ` brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? Marc MERLIN 2012-07-22 19:35 ` Martin Steigerwald 2012-07-22 20:44 ` Marc MERLIN 2012-07-22 22:41 ` brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk Marc MERLIN 2012-07-23 6:42 ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN 2012-07-24 7:56 ` Martin Steigerwald 2012-07-27 4:40 ` Marc MERLIN 2012-07-27 11:08 ` Chris Mason 2012-07-27 18:42 ` Marc MERLIN 2012-08-01 5:30 ` du -s src is a lot slower on SSD than spinning disk in the same laptop Marc MERLIN 2012-08-01 6:01 ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN 2012-08-01 6:08 ` Fajar A. Nugraha 2012-08-01 6:21 ` Marc MERLIN 2012-08-01 21:57 ` Martin Steigerwald 2012-08-02 5:07 ` Marc MERLIN 2012-08-02 11:18 ` Martin Steigerwald 2012-08-02 17:39 ` Marc MERLIN 2012-08-02 20:20 ` Martin Steigerwald 2012-08-02 20:44 ` Marc MERLIN 2012-08-02 21:21 ` Martin Steigerwald 2012-08-02 21:49 ` Marc MERLIN 2012-08-03 18:45 ` Martin Steigerwald 2012-08-16 7:45 ` Marc MERLIN 2012-08-02 11:25 ` Martin Steigerwald 2012-08-01 6:36 ` Chris Samuel 2012-08-01 6:40 ` Marc MERLIN
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.