From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from magic.merlins.org ([209.81.13.136]:60835 "EHLO mail1.merlins.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753090Ab2HBVtt (ORCPT ); Thu, 2 Aug 2012 17:49:49 -0400 Date: Thu, 2 Aug 2012 14:49:46 -0700 From: Marc MERLIN To: Martin Steigerwald Cc: linux-btrfs@vger.kernel.org, "Fajar A. Nugraha" Subject: Re: How can btrfs take 23sec to stat 23K files from an SSD? Message-ID: <20120802214946.GB1834@merlins.org> References: <20120722185848.GA10089@merlins.org> <201208022220.08217.Martin@lichtvoll.de> <20120802204414.GA1834@merlins.org> <201208022321.51795.Martin@lichtvoll.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 In-Reply-To: <201208022321.51795.Martin@lichtvoll.de> Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Thu, Aug 02, 2012 at 11:21:51PM +0200, Martin Steigerwald wrote: > > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. > > Obviously it's sending IO in a way that either the IO subsystem, linux > > driver, or drive prefer. > > But only on reads. Yes. I was focussing on the simple case. SSDs can be slower than hard drives on writes, and now you're getting into deep firmware magic for corner cases, so I didn't want to touch that here :) > WTF? > > Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely > without it. I left it as is: [global] ioengine=libaio direct=1 filename=/dev/sda size=476940m bsrange=2k-16k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 I just removed it now. > Okay, one further idea: Remove the bsrange to just test with 4k blocks. Ok, there you go: [global] ioengine=libaio direct=1 filename=/dev/sda blocksize=4k blockalign=4k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 gandalfthegreat:~# fio --readonly ./fio.job4 zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [1476K/0K /s] [369 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3614 read : io=60832KB, bw=1013.7KB/s, iops=253 , runt= 60012msec slat (usec): min=4 , max=892 , avg=24.36, stdev=16.05 clat (usec): min=2 , max=27575 , avg=3915.39, stdev=4990.69 lat (usec): min=44 , max=27591 , avg=3940.35, stdev=4991.99 clat percentiles (usec): | 1.00th=[ 41], 5.00th=[ 46], 10.00th=[ 51], 20.00th=[ 66], | 30.00th=[ 113], 40.00th=[ 163], 50.00th=[ 205], 60.00th=[ 4960], | 70.00th=[ 5856], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608], | 99.00th=[14272], 99.50th=[17280], 99.90th=[18816], 99.95th=[21120], | 99.99th=[22912] bw (KB/s) : min= 253, max= 2520, per=100.00%, avg=1014.86, stdev=364.74 lat (usec) : 4=0.05%, 10=0.01%, 20=0.01%, 50=8.07%, 100=1iiiiiiii9.94% lat (usec) : 250=26.26%, 500=2.78%, 750=0.30%, 1000=0.09% lat (msec) : 2=0.12%, 4=0.07%, 10=20.75%, 20=21.51%, 50=0.05% cpu : usr=0.31%, sys=0.88%, ctx=15250, majf=0, minf=23 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=15208/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3636 read : io=88468KB, bw=1474.3KB/s, iops=368 , runt= 60009msec slat (usec): min=3 , max=340 , avg=16.46, stdev=10.23 clat (usec): min=1 , max=27356 , avg=2692.99, stdev=4564.76 lat (usec): min=30 , max=27370 , avg=2709.87, stdev=4566.31 clat percentiles (usec): | 1.00th=[ 27], 5.00th=[ 29], 10.00th=[ 31], 20.00th=[ 35], | 30.00th=[ 47], 40.00th=[ 61], 50.00th=[ 78], 60.00th=[ 107], | 70.00th=[ 270], 80.00th=[ 5856], 90.00th=[11712], 95.00th=[12608], | 99.00th=[16512], 99.50th=[17536], 99.90th=[18816], 99.95th=[19584], | 99.99th=[21120] bw (KB/s) : min= 282, max= 5096, per=100.00%, avg=1474.40, stdev=899.36 lat (usec) : 2=0.05%, 4=0.01%, 20=0.02%, 50=32.45%, 100=24.61% lat (usec) : 250=12.52%, 500=1.40%, 750=0.03%, 1000=0.03% lat (msec) : 2=0.02%, 4=0.03%, 10=13.95%, 20=14.85%, 50=0.03% cpu : usr=0.33%, sys=0.90%, ctx=22176, majf=0, minf=25 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=22117/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=60832KB, aggrb=1013KB/s, minb=1013KB/s, maxb=1013KB/s, mint=60012msec, maxt=60012msec Run status group 1 (all jobs): READ: io=88468KB, aggrb=1474KB/s, minb=1474KB/s, maxb=1474KB/s, mint=60009msec, maxt=60009msec Disk stats (read/write): sda: ios=54013/1836, merge=291/17, ticks=251660/11052, in_queue=262668, util=99.51% and the same test with blockalign=512k: gandalfthegreat:~# fio --readonly ./fio.job4 # blockalign=512k fio: Any use of blockalign= turns off randommap zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [4180K/0K /s] [1045 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3679 read : io=56096KB, bw=957355 B/s, iops=233 , runt= 60001msec slat (usec): min=4 , max=344 , avg=27.73, stdev=15.51 clat (usec): min=2 , max=38954 , avg=4244.27, stdev=5153.32 lat (usec): min=44 , max=38980 , avg=4272.70, stdev=5154.68 clat percentiles (usec): | 1.00th=[ 42], 5.00th=[ 46], 10.00th=[ 53], 20.00th=[ 69], | 30.00th=[ 120], 40.00th=[ 169], 50.00th=[ 211], 60.00th=[ 5024], | 70.00th=[ 5920], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608], | 99.00th=[16768], 99.50th=[17536], 99.90th=[18560], 99.95th=[20096], | 99.99th=[37632] bw (KB/s) : min= 263, max= 3133, per=100.00%, avg=935.09, stdev=392.66 lat (usec) : 4=0.05%, 10=0.03%, 20=0.01%, 50=7.84%, 100=19.12% lat (usec) : 250=25.84%, 500=1.61%, 750=0.08%, 1000=0.05% lat (msec) : 2=0.06%, 4=0.08%, 10=21.99%, 20=23.17%, 50=0.06% cpu : usr=0.29%, sys=0.94%, ctx=14069, majf=0, minf=25 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=14024/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3703 read : io=99592KB, bw=1659.7KB/s, iops=414 , runt= 60008msec slat (usec): min=4 , max=186 , avg=15.83, stdev= 9.95 clat (usec): min=1 , max=25530 , avg=2390.79, stdev=4340.79 lat (usec): min=28 , max=25539 , avg=2407.01, stdev=4342.90 clat percentiles (usec): | 1.00th=[ 26], 5.00th=[ 28], 10.00th=[ 30], 20.00th=[ 34], | 30.00th=[ 43], 40.00th=[ 57], 50.00th=[ 72], 60.00th=[ 91], | 70.00th=[ 155], 80.00th=[ 5152], 90.00th=[11712], 95.00th=[12480], | 99.00th=[12992], 99.50th=[16768], 99.90th=[18560], 99.95th=[18816], | 99.99th=[21632] bw (KB/s) : min= 269, max= 8465, per=100.00%, avg=1663.84, stdev=1367.82 lat (usec) : 2=0.01%, 4=0.02%, 20=0.01%, 50=35.42%, 100=26.00% lat (usec) : 250=11.29%, 500=1.31%, 750=0.05%, 1000=0.01% lat (msec) : 2=0.04%, 4=0.02%, 10=12.55%, 20=13.25%, 50=0.01% cpu : usr=0.31%, sys=0.99%, ctx=24963, majf=0, minf=25 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=24898/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=56096KB, aggrb=934KB/s, minb=934KB/s, maxb=934KB/s, mint=60001msec, maxt=60001msec Run status group 1 (all jobs): READ: io=99592KB, aggrb=1659KB/s, minb=1659KB/s, maxb=1659KB/s, mint=60008msec, maxt=60008msec Disk stats (read/write): sda: ios=55021/1637, merge=72/39, ticks=240896/12080, in_queue=252912, util=99.59% > > Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the > > company is greating great reviews for them, I'm now pretty sure that > > it's either a problem with a linux driver, which is interesting for us > > all to debug :) > > Either that or there possible is some firmware update? But then those > reviews would mention it I bet. So far, there are no firmware updates from mine, I checked right after unboxing it :) > > I'll try plugging this SSD in a totally different PC and see what happens. > > This may say if it's an AHCI/intel sata driver problem. > > Seems we will continue until someone starts to complain here. Maybe > another list will be more approbiate? But then this thread has it all > in one ;). Adding a CC with some introductory note might be approbiate. > Its your problem, so you decide ;). I´d suggest the fio mailing list, > there are other performance people how may want to chime in. Actually you know the lists and people involved more than me. I'd be happy if you added a Cc to another list you think is best, and we can move there. > Another idea: Is there anything funny in SMART values (smartctl -a > and possibly even -x)? Well I gather these SSDs are new, but still. smartctl -a /dev/sda (snipped) SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 36 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 7 177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 9 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 058 040 000 Old_age Always - 42 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x003e 253 253 000 Old_age Always - 0 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 4 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 433339112 smartctl -x /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.5.0-amd64-preempt-noide-20120410] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: SAMSUNG SSD 830 Series Serial Number: S0WGNEAC400484 LU WWN Device Id: 5 002538 043584d30 Firmware Version: CXM03B1Q User Capacity: 512,110,190,592 bytes [512 GB] Sector Size: 512 bytes logical/physical Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ACS-2 revision 2 Local Time is: Thu Aug 2 14:42:08 2012 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x83) Offline data collection activity is in a Reserved state. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 1980) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 33) minutes. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 9 Power_On_Hours -O--CK 099 099 000 - 36 12 Power_Cycle_Count -O--CK 099 099 000 - 7 177 Wear_Leveling_Count PO--C- 099 099 000 - 9 179 Used_Rsvd_Blk_Cnt_Tot PO--C- 100 100 010 - 0 181 Program_Fail_Cnt_Total -O--CK 100 100 010 - 0 182 Erase_Fail_Count_Total -O--CK 100 100 010 - 0 183 Runtime_Bad_Block PO--C- 100 100 010 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 190 Airflow_Temperature_Cel -O--CK 058 040 000 - 42 195 Hardware_ECC_Recovered -O-RC- 200 200 000 - 0 199 UDMA_CRC_Error_Count -OSRCK 253 253 000 - 0 235 Unknown_Attribute -O--C- 099 099 000 - 4 241 Total_LBAs_Written -O--CK 099 099 000 - 433352008 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] GP/S Log at address 0x00 has 1 sectors [Log Directory] GP/S Log at address 0x01 has 1 sectors [Summary SMART error log] GP/S Log at address 0x02 has 2 sectors [Comprehensive SMART error log] GP/S Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART error log] GP/S Log at address 0x06 has 1 sectors [SMART self-test log] GP/S Log at address 0x07 has 2 sectors [Extended self-test log] GP/S Log at address 0x09 has 1 sectors [Selective self-test log] GP/S Log at address 0x10 has 1 sectors [NCQ Command Error] GP/S Log at address 0x11 has 1 sectors [SATA Phy Event Counters] GP/S Log at address 0x80 has 16 sectors [Host vendor specific log] GP/S Log at address 0x81 has 16 sectors [Host vendor specific log] GP/S Log at address 0x82 has 16 sectors [Host vendor specific log] GP/S Log at address 0x83 has 16 sectors [Host vendor specific log] (snip) SMART Extended Comprehensive Error Log Version: 1 (2 sectors) No Errors Logged SMART Extended Self-test Log Version: 1 (2 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 27 - # 2 Short offline Completed without error 00% 8 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Warning: device does not support SCT Commands SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 2 0 Command failed due to ICRC error 0x0002 2 49152 R_ERR response for data FIS 0x0003 2 53248 R_ERR response for device-to-host data FIS 0x0004 2 57344 R_ERR response for host-to-device data FIS 0x0005 2 7952 R_ERR response for non-data FIS 0x0006 2 2562 R_ERR response for device-to-host non-data FIS 0x0007 2 4096 R_ERR response for host-to-device non-data FIS 0x0008 2 7952 Device-to-host non-data FIS retries 0x0009 2 7953 Transition from drive PhyRdy to drive PhyNRdy 0x000a 2 25 Device-to-host register FISes sent due to a COMRESET 0x000b 2 6145 CRC errors within host-to-device FIS 0x000d 2 7953 Non-CRC errors within host-to-device FIS 0x000f 2 8176 R_ERR response for host-to-device data FIS, CRC 0x0010 2 15 R_ERR response for host-to-device data FIS, non-CRC 0x0012 2 49153 R_ERR response for host-to-device non-data FIS, CRC 0x0013 2 12112 R_ERR response for host-to-device non-data FIS, non-CRC -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/