From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Jeff Furlong Subject: RE: fio signal 11 Date: Wed, 3 Aug 2016 15:29:07 +0000 Message-ID: References: <575CCF51.8020704@kernel.dk> <575CD738.1050505@kernel.dk> <20160615144502.GH1607@quack2.suse.cz> <5762500B.2040202@kernel.dk> <20160720050832.GA3918@quack2.suse.cz> <8683247d-429c-e639-78a5-912316ea9e21@kernel.dk> <20160726084307.GA6860@quack2.suse.cz> <2bd421b5-7d16-e948-e86f-da19f5ae297e@kernel.dk> <8a240fb5-f06e-3978-2e24-eefa2836dbdb@kernel.dk> <32df995e-9fc4-3f8e-f741-4d261a08b9ad@kernel.dk> In-Reply-To: <32df995e-9fc4-3f8e-f741-4d261a08b9ad@kernel.dk> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable To: Jens Axboe , Jan Kara Cc: Sitsofe Wheeler , "fio@vger.kernel.org" List-ID: Indeed. With fio-2.13-35-gf5a5 I ran this workload and the usual combos of= log_avg_msec and ramp_time, and all output seems right. Thanks! Regards, Jeff -----Original Message----- From: Jens Axboe [mailto:axboe@kernel.dk] = Sent: Wednesday, August 3, 2016 8:15 AM To: Jeff Furlong ; Jan Kara Cc: Sitsofe Wheeler ; fio@vger.kernel.org Subject: Re: fio signal 11 OK, try current -git, I believe it should be fixed. On 08/03/2016 08:03 AM, Jens Axboe wrote: > I can reproduce it now. Seems to happen a bit randomly for me, and not = > all 4 slat logs are big, some of them are averaged fine. So smells = > like a race in updating after going out of ramp time. > > > On 08/03/2016 08:02 AM, Jeff Furlong wrote: >> # fio -version >> fio-2.13-28-g059b6 >> >> # fio --name=3Dtest_job --ioengine=3Dlibaio --direct=3D1 --rw=3Drandread >> --iodepth=3D256 --size=3D100% --numjobs=3D4 --bs=3D4k --filename=3D/dev/= nvme0n1 = >> --group_reporting --write_bw_log=3Dtest_job --write_iops_log=3Dtest_job = >> --write_lat_log=3Dtest_job --log_avg_msec=3D1000 --disable_lat=3D0 >> --disable_clat=3D0 --disable_slat=3D0 --runtime=3D10s --time_based = >> --output=3Dtest_job --ramp_time=3D1s >> >> # ls -l >> -rw-r--r-- 1 root root 1748 Aug 3 07:57 test_job >> -rw-r--r-- 1 root root 201 Aug 3 07:57 test_job_bw.1.log >> -rw-r--r-- 1 root root 196 Aug 3 07:57 test_job_bw.2.log >> -rw-r--r-- 1 root root 192 Aug 3 07:57 test_job_bw.3.log >> -rw-r--r-- 1 root root 201 Aug 3 07:57 test_job_bw.4.log >> -rw-r--r-- 1 root root 179 Aug 3 07:57 test_job_clat.1.log >> -rw-r--r-- 1 root root 181 Aug 3 07:57 test_job_clat.2.log >> -rw-r--r-- 1 root root 186 Aug 3 07:57 test_job_clat.3.log >> -rw-r--r-- 1 root root 178 Aug 3 07:57 test_job_clat.4.log >> -rw-r--r-- 1 root root 191 Aug 3 07:57 test_job_iops.1.log >> -rw-r--r-- 1 root root 187 Aug 3 07:57 test_job_iops.2.log >> -rw-r--r-- 1 root root 184 Aug 3 07:57 test_job_iops.3.log >> -rw-r--r-- 1 root root 191 Aug 3 07:57 test_job_iops.4.log >> -rw-r--r-- 1 root root 179 Aug 3 07:57 test_job_lat.1.log >> -rw-r--r-- 1 root root 181 Aug 3 07:57 test_job_lat.2.log >> -rw-r--r-- 1 root root 186 Aug 3 07:57 test_job_lat.3.log >> -rw-r--r-- 1 root root 178 Aug 3 07:57 test_job_lat.4.log >> -rw-r--r-- 1 root root 49597737 Aug 3 07:57 test_job_slat.1.log >> -rw-r--r-- 1 root root 42065243 Aug 3 07:57 test_job_slat.2.log >> -rw-r--r-- 1 root root 24407670 Aug 3 07:57 test_job_slat.3.log >> -rw-r--r-- 1 root root 47090233 Aug 3 07:57 test_job_slat.4.log >> >> Above we can see the slat log files are huge for a 10s runtime. Not = >> sure if it's a factor of total runtime or IOs during that runtime. I = >> can also create the issue when I reduce the runtime to 1s. >> >> Regards, >> Jeff >> >> -----Original Message----- >> From: Jens Axboe [mailto:axboe@kernel.dk] >> Sent: Wednesday, August 3, 2016 7:56 AM >> To: Jeff Furlong ; Jan Kara >> Cc: Sitsofe Wheeler ; fio@vger.kernel.org >> Subject: Re: fio signal 11 >> >> What version did you test? Works fine for me. >> >> >> On 08/01/2016 03:57 PM, Jeff Furlong wrote: >>> Sorry to open this item back up. However, it appears when we add = >>> the ramp_time option, we break the logging. Specifically, slat will = >>> log every entry, regardless of log_avg_msec. >>> >>> This example works as intended: >>> # fio --name=3Dtest_job --ioengine=3Dlibaio --direct=3D1 --rw=3Drandread >>> --iodepth=3D256 --size=3D100% --numjobs=3D4 --bs=3D4k = >>> --filename=3D/dev/nvme0n1 --group_reporting --write_bw_log=3Dtest_job = >>> --write_iops_log=3Dtest_job --write_lat_log=3Dtest_job = >>> --log_avg_msec=3D1000 --disable_lat=3D0 >>> --disable_clat=3D0 --disable_slat=3D0 --runtime=3D10s --time_based = >>> --output=3Dtest_job >>> >>> This example is the same, but adds a ramp_time, but the slat log is = >>> full of all entries: >>> # fio --name=3Dtest_job --ioengine=3Dlibaio --direct=3D1 --rw=3Drandread >>> --iodepth=3D256 --size=3D100% --numjobs=3D4 --bs=3D4k = >>> --filename=3D/dev/nvme0n1 --group_reporting --write_bw_log=3Dtest_job = >>> --write_iops_log=3Dtest_job --write_lat_log=3Dtest_job = >>> --log_avg_msec=3D1000 --disable_lat=3D0 >>> --disable_clat=3D0 --disable_slat=3D0 --runtime=3D10s --time_based = >>> --output=3Dtest_job --ramp_time=3D1s >>> >>> Regards, >>> Jeff >>> >>> -----Original Message----- >>> From: Jens Axboe [mailto:axboe@kernel.dk] >>> Sent: Tuesday, July 26, 2016 11:35 AM >>> To: Jeff Furlong ; Jan Kara >>> Cc: Sitsofe Wheeler ; fio@vger.kernel.org >>> Subject: Re: fio signal 11 >>> >>> Perfect, thanks for testing! >>> >>> >>> On 07/26/2016 12:33 PM, Jeff Furlong wrote: >>>> FYI, with the patch back in version fio-2.13-1-gce8b, I re-ran my = >>>> prior workload that caused the signal 11. The workload now = >>>> completes without issue. >>>> >>>> Regards, >>>> Jeff >>>> >>>> >>>> -----Original Message----- >>>> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] = >>>> On Behalf Of Jens Axboe >>>> Sent: Tuesday, July 26, 2016 7:17 AM >>>> To: Jan Kara >>>> Cc: Jeff Furlong ; Sitsofe Wheeler = >>>> ; fio@vger.kernel.org >>>> Subject: Re: fio signal 11 >>>> >>>> On 07/26/2016 02:43 AM, Jan Kara wrote: >>>>> On Mon 25-07-16 09:21:00, Jens Axboe wrote: >>>>>> On 07/19/2016 11:08 PM, Jan Kara wrote: >>>>>>> On Thu 16-06-16 09:06:51, Jens Axboe wrote: >>>>>>>> On 06/15/2016 04:45 PM, Jan Kara wrote: >>>>>>>>> On Sat 11-06-16 21:30:00, Jens Axboe wrote: >>>>>>>>>> On 06/11/2016 08:56 PM, Jens Axboe wrote: >>>>>>>>>>> On 06/10/2016 12:42 PM, Jeff Furlong wrote: >>>>>>>>>>>> Good point. Here is the trace: >>>>>>>>>>>> >>>>>>>>>>>> [New LWP 59231] >>>>>>>>>>>> [Thread debugging using libthread_db enabled] Using host = >>>>>>>>>>>> libthread_db library "/lib64/libthread_db.so.1". >>>>>>>>>>>> Core was generated by `fio --name=3Dtest_job = >>>>>>>>>>>> --ioengine=3Dlibaio >>>>>>>>>>>> --direct=3D1 --rw=3Dwrite --iodepth=3D32'. >>>>>>>>>>>> Program terminated with signal 11, Segmentation fault. >>>>>>>>>>>> #0 0x0000000000421e39 in regrow_log (iolog=3D0x7f828c0c5ad0) = >>>>>>>>>>>> at >>>>>>>>>>>> stat.c:1909 >>>>>>>>>>>> 1909 if (!cur_log) { >>>>>>>>>>>> >>>>>>>>>>>> (gdb) bt >>>>>>>>>>>> #0 0x0000000000421e39 in regrow_log (iolog=3D0x7f828c0c5ad0) = >>>>>>>>>>>> at >>>>>>>>>>>> stat.c:1909 >>>>>>>>>>>> #1 0x000000000042d4df in regrow_logs >>>>>>>>>>>> (td=3Dtd@entry=3D0x7f8277de0000) at >>>>>>>>>>>> stat.c:1965 >>>>>>>>>>>> #2 0x000000000040ca90 in wait_for_completions = >>>>>>>>>>>> (td=3Dtd@entry=3D0x7f8277de0000, = >>>>>>>>>>>> time=3Dtime@entry=3D0x7fffcfb6b300) at >>>>>>>>>>>> backend.c:446 >>>>>>>>>>>> #3 0x000000000045ade7 in do_io (bytes_done=3D>>>>>>>>>>> pointer>, >>>>>>>>>>>> td=3D0x7f8277de0000) at backend.c:991 >>>>>>>>>>>> #4 thread_main (data=3Ddata@entry=3D0x264d450) at = >>>>>>>>>>>> backend.c:1667 >>>>>>>>>>>> #5 0x000000000045cfec in run_threads >>>>>>>>>>>> (sk_out=3Dsk_out@entry=3D0x0) at >>>>>>>>>>>> backend.c:2217 >>>>>>>>>>>> #6 0x000000000045d2cd in fio_backend >>>>>>>>>>>> (sk_out=3Dsk_out@entry=3D0x0) at >>>>>>>>>>>> backend.c:2349 >>>>>>>>>>>> #7 0x000000000040d09c in main (argc=3D22, = >>>>>>>>>>>> argv=3D0x7fffcfb6f638, envp=3D) at fio.c:63 >>>>>>>>>>> >>>>>>>>>>> That looks odd, thanks for reporting this. I'll see if I can = >>>>>>>>>>> get to this on Monday, if not, it'll have to wait until = >>>>>>>>>>> after my vacation... So while I appreciate people running = >>>>>>>>>>> -git and finding issues like these before they show up in a = >>>>>>>>>>> release, might be best to revert back to 2.2.11 until I can = >>>>>>>>>>> get this debugged. >>>>>>>>>> >>>>>>>>>> I take that back - continue using -git! Just pull a fresh = >>>>>>>>>> copy, should be fixed now. >>>>>>>>>> >>>>>>>>>> Jan, the reporter is right, 2.11 works and -git does not. So = >>>>>>>>>> I just ran a quick bisect, changing the logging from every = >>>>>>>>>> second to every 100ms to make it reproduce faster. I don't = >>>>>>>>>> have time to look into why yet, so I just reverted the commit. >>>>>>>>>> >>>>>>>>>> commit d7982dd0ab2a1a315b5f9859c67a02414ce6274f >>>>>>>>>> Author: Jan Kara >>>>>>>>>> Date: Tue May 24 17:03:21 2016 +0200 >>>>>>>>>> >>>>>>>>>> fio: Simplify forking of processes >>>>>>>>> >>>>>>>>> Hum, I've tried reproducing this but failed (I've tried using >>>>>>>>> /dev/ram0 and >>>>>>>>> /dev/sda4 as devices for fio). Is it somehow dependent on the = >>>>>>>>> device fio works with? I have used commit >>>>>>>>> 54d0a3150d44adca3ee4047fabd85651c6ea2db1 (just before you = >>>>>>>>> reverted my >>>>>>>>> patch) for testing. >>>>>>>> >>>>>>>> On vacation right now, I'll check when I get back. It is = >>>>>>>> possible that it was just a fluke, since there was another bug = >>>>>>>> there related to shared memory, but it was predictably crashing = >>>>>>>> at the same time for the bisect. >>>>>>>> >>>>>>>> It doesn't make a lot of sense, however. >>>>>>> >>>>>>> Did you have a chance to look into this? >>>>>> >>>>>> I have not, unfortunately, but I'm suspecting the patch is fine = >>>>>> and the later fix to allocate the cur_log out of the shared pool = >>>>>> was the real fix and that the original patch was fine. >>>>> >>>>> So that's what I'd suspect as well but I'm not able to reproduce = >>>>> even the original crash so I cannot verify this theory... What's = >>>>> the plan going forward? Will you re-apply the patch? Frankly, I = >>>>> don't care much, it was just a small cleanup. I'm just curious = >>>>> whether it was really that other bug or whether I miss something. >>>> >>>> Yes, I think re-applying would be the best way forward. Especially = >>>> since that 2.13 was just released, so we'll have a while to iron = >>>> out any issues. But I really don't see how it could be the reason = >>>> for the issue, I'm guessing it just exacerbated it somehow. >>>> >>>> -- >>>> Jens Axboe >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe fio" in = >>>> the body of a message to majordomo@vger.kernel.org More majordomo = >>>> info at http://vger.kernel.org/majordomo-info.html >>>> Western Digital Corporation (and its subsidiaries) E-mail = >>>> Confidentiality Notice & Disclaimer: >>>> >>>> This e-mail and any files transmitted with it may contain = >>>> confidential or legally privileged information of WDC and/or its = >>>> affiliates, and are intended solely for the use of the individual = >>>> or entity to which they are addressed. If you are not the intended = >>>> recipient, any disclosure, copying, distribution or any action = >>>> taken or omitted to be taken in reliance on it, is prohibited. If = >>>> you have received this e-mail in error, please notify the sender = >>>> immediately and delete the e-mail in its entirety from your system. >>>> >>> >>> >>> -- >>> Jens Axboe >>> >>> Western Digital Corporation (and its subsidiaries) E-mail = >>> Confidentiality Notice & Disclaimer: >>> >>> This e-mail and any files transmitted with it may contain = >>> confidential or legally privileged information of WDC and/or its = >>> affiliates, and are intended solely for the use of the individual or = >>> entity to which they are addressed. If you are not the intended = >>> recipient, any disclosure, copying, distribution or any action taken = >>> or omitted to be taken in reliance on it, is prohibited. If you have = >>> received this e-mail in error, please notify the sender immediately = >>> and delete the e-mail in its entirety from your system. >>> >> >> >> -- >> Jens Axboe >> >> Western Digital Corporation (and its subsidiaries) E-mail = >> Confidentiality Notice & Disclaimer: >> >> This e-mail and any files transmitted with it may contain = >> confidential or legally privileged information of WDC and/or its = >> affiliates, and are intended solely for the use of the individual or = >> entity to which they are addressed. If you are not the intended = >> recipient, any disclosure, copying, distribution or any action taken = >> or omitted to be taken in reliance on it, is prohibited. If you have = >> received this e-mail in error, please notify the sender immediately = >> and delete the e-mail in its entirety from your system. >> > > -- Jens Axboe Western Digital Corporation (and its subsidiaries) E-mail Confidentiality N= otice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or l= egally privileged information of WDC and/or its affiliates, and are intende= d solely for the use of the individual or entity to which they are addresse= d. If you are not the intended recipient, any disclosure, copying, distribu= tion or any action taken or omitted to be taken in reliance on it, is prohi= bited. If you have received this e-mail in error, please notify the sender = immediately and delete the e-mail in its entirety from your system.