All of lore.kernel.org
 help / color / mirror / Atom feed
* segfault runninng fio against 2048 jobs
@ 2012-04-17 21:05 Roger Sibert
  2012-04-18  7:23 ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Roger Sibert @ 2012-04-17 21:05 UTC (permalink / raw)
  To: fio

Hello Everyone,

I am using a 2.0x variant ran across a couple of things, one of which looks to be as designed and the other was a segfault in fio.

My original job file had 4800 entries which exceeds the max limit. (error: maximum number of jobs (2048) reached)� The question I have here , is there a reason the limit can't be raised to handle larger job files?

Reducing the job file to the max re-running it jumped straight to the initial print screen and then to a segfault. (Segmentation fault (core dumped))

Doing a quick look gave me 

[root@localhost std-testing]# gdb fio core.9582
GNU gdb (GDB) CentOS (7.0.1-42.el5.centos)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.� Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/fio-test/std-testing/fio...done.
[New Thread 9583]
[New Thread 9582]

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff213fd000
Core was generated by `./fio --output=1.log 1.inp'.
Program terminated with signal 11, Segmentation fault.
#0� 0x00000000004167b0 in display_thread_status (je=<value optimized out>) at eta.c:416
416���� eta.c: No such file or directory.
������� in eta.c
(gdb) quit

I reduced the job count down to about 33 and re-started the run which I am waiting to finish so I can re-compile fio with whatever extra flags and to whatever code level are requested.� Currently file gives me:
fio: ELF 64-bit LSB executable, AMD x86-64, version 1 (GNU/Linux), for GNU/Linux 2.6.15, statically linked, not stripped
Which is running on a CentOS box
Linux localhost.localdomain 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7 04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

Thanks,
Roger





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: segfault runninng fio against 2048 jobs
  2012-04-17 21:05 segfault runninng fio against 2048 jobs Roger Sibert
@ 2012-04-18  7:23 ` Jens Axboe
  2012-04-18  9:02   ` Roger Sibert
  2012-04-18 17:27   ` Roger Sibert
  0 siblings, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2012-04-18  7:23 UTC (permalink / raw)
  To: Roger Sibert; +Cc: fio

On 04/17/2012 11:05 PM, Roger Sibert wrote:
> Hello Everyone,
> 
> I am using a 2.0x variant ran across a couple of things, one of which
> looks to be as designed and the other was a segfault in fio.
> 
> My original job file had 4800 entries which exceeds the max limit.
> (error: maximum number of jobs (2048) reached)  The question I have
> here , is there a reason the limit can't be raised to handle larger
> job files?

There's no inherent limit in fio that causes this, it was done to avoid
errors on platforms where shared memory segments were more limited. A
check now reveals that thread_data is around 15KB, which means that the
segment is around 30MB in total. You should be safe to bump the

#define REAL_MAX_JOBS           2048

in fio.h to something bigger. In fact I should just make it bigger, we
scale it down these days if we see errors.

> Reducing the job file to the max re-running it jumped straight to the initial print screen and then to a segfault. (Segmentation fault (core dumped))
> 
> Doing a quick look gave me 
> 
> [root@localhost std-testing]# gdb fio core.9582
> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /root/fio-test/std-testing/fio...done.
> [New Thread 9583]
> [New Thread 9582]
> 
> warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff213fd000
> Core was generated by `./fio --output=1.log 1.inp'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00000000004167b0 in display_thread_status (je=<value optimized out>) at eta.c:416
> 416     eta.c: No such file or directory.
>         in eta.c
> (gdb) quit
> 
> I reduced the job count down to about 33 and re-started the run which I am waiting to finish so I can re-compile fio with whatever extra flags and to whatever code level are requested.  Currently file gives me:
> fio: ELF 64-bit LSB executable, AMD x86-64, version 1 (GNU/Linux), for GNU/Linux 2.6.15, statically linked, not stripped
> Which is running on a CentOS box
> Linux localhost.localdomain 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7 04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

There's not enough information here to help you out, I'm afraid. What
fio version are you running? What job did you run that caused this
failure?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: segfault runninng fio against 2048 jobs
  2012-04-18  7:23 ` Jens Axboe
@ 2012-04-18  9:02   ` Roger Sibert
  2012-04-18 17:27   ` Roger Sibert
  1 sibling, 0 replies; 13+ messages in thread
From: Roger Sibert @ 2012-04-18  9:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

[-- Attachment #1: Type: text/html, Size: 5868 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: segfault runninng fio against 2048 jobs
  2012-04-18  7:23 ` Jens Axboe
  2012-04-18  9:02   ` Roger Sibert
@ 2012-04-18 17:27   ` Roger Sibert
  2012-04-18 18:16     ` Roger Sibert
  2012-04-18 18:39     ` Jens Axboe
  1 sibling, 2 replies; 13+ messages in thread
From: Roger Sibert @ 2012-04-18 17:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

Heres hoping Outlook doesn't inject html into the message again.

[global]
direct=1
ioengine=libaio
zonesize=1g
randrepeat=1
write_bw_log
write_lat_log
time_based
ramp_time=15s
runtime=15s
;
[sdf-iodepth1-rw-readwrite_mix_5050-bs128k-2048]
description=[sdf-iodepth1-rw-readwrite_mix_5050-bs128k-2048]
stonewall
filename=/dev/sdf
iodepth=1
rw=rw
rwmixread=50
rwmixwrite=50
bs=128k

Running just the 2048 job on its own doesn't cause any issues.

I did a fresh git clone and ended up with fio-2.0.7-10-g8430 (which was
compiled on the local system without making any changes to the code) and
re-ran the test using the full 2048 to verify that the segfault still
occurs, which it does.  I also noted that the segfault is about
immediate once seeing Jobs: 1 (f=xxx) print and stays that way until you
reduce it down to 535.  At about 535 it runs for about 15 or so seconds
before segfaulting, 500 is still running after about 3 minutes.

Thanks,
Roger

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, April 18, 2012 3:24 AM
To: Roger Sibert
Cc: fio@vger.kernel.org
Subject: Re: segfault runninng fio against 2048 jobs

On 04/17/2012 11:05 PM, Roger Sibert wrote:
> Hello Everyone,
> 
> I am using a 2.0x variant ran across a couple of things, one of which
> looks to be as designed and the other was a segfault in fio.
> 
> My original job file had 4800 entries which exceeds the max limit.
> (error: maximum number of jobs (2048) reached)  The question I have
> here , is there a reason the limit can't be raised to handle larger
> job files?

There's no inherent limit in fio that causes this, it was done to avoid
errors on platforms where shared memory segments were more limited. A
check now reveals that thread_data is around 15KB, which means that the
segment is around 30MB in total. You should be safe to bump the

#define REAL_MAX_JOBS           2048

in fio.h to something bigger. In fact I should just make it bigger, we
scale it down these days if we see errors.

> Reducing the job file to the max re-running it jumped straight to the
initial print screen and then to a segfault. (Segmentation fault (core
dumped))
> 
> Doing a quick look gave me 
> 
> [root@localhost std-testing]# gdb fio core.9582
> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /root/fio-test/std-testing/fio...done.
> [New Thread 9583]
> [New Thread 9582]
> 
> warning: no loadable sections found in added symbol-file
system-supplied DSO at 0x7fff213fd000
> Core was generated by `./fio --output=1.log 1.inp'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00000000004167b0 in display_thread_status (je=<value optimized
out>) at eta.c:416
> 416     eta.c: No such file or directory.
>         in eta.c
> (gdb) quit
> 
> I reduced the job count down to about 33 and re-started the run which
I am waiting to finish so I can re-compile fio with whatever extra flags
and to whatever code level are requested.  Currently file gives me:
> fio: ELF 64-bit LSB executable, AMD x86-64, version 1 (GNU/Linux), for
GNU/Linux 2.6.15, statically linked, not stripped
> Which is running on a CentOS box
> Linux localhost.localdomain 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7
04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

There's not enough information here to help you out, I'm afraid. What
fio version are you running? What job did you run that caused this
failure?

-- 
Jens Axboe

______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: segfault runninng fio against 2048 jobs
  2012-04-18 17:27   ` Roger Sibert
@ 2012-04-18 18:16     ` Roger Sibert
  2012-04-18 18:42       ` Jens Axboe
  2012-04-18 18:39     ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Roger Sibert @ 2012-04-18 18:16 UTC (permalink / raw)
  To: Roger Sibert, Jens Axboe; +Cc: fio

Hello Jens,

Not sure if this is a red herring or not.

I did a quick check using valgrind with its memcheck on the 1 job sample
and noted that there appears to be a small memory leak which gets
noticeably worse when you run against a larger job configuration.

All the leaks appear to be the same originating line of code so just a
snippet of the valgrind output is included below.

1 job configuration file
==19277== 168 bytes in 1 blocks are definitely lost in loss record 9 of
10
==19277==    at 0x4A0610C: malloc (vg_replace_malloc.c:195)
==19277==    by 0x408A44: load_ioengine (ioengines.c:148)
==19277==    by 0x409BE2: ioengine_load (init.c:694)
==19277==    by 0x409F79: add_job (init.c:765)
==19277==    by 0x40BD26: parse_jobs_ini (init.c:1135)
==19277==    by 0x40C059: parse_options (init.c:1602)
==19277==    by 0x4082F3: main (fio.c:104)
==19277==
.
.
==19277==
==19277== LEAK SUMMARY:
==19277==    definitely lost: 211 bytes in 6 blocks
==19277==    indirectly lost: 0 bytes in 0 blocks
==19277==      possibly lost: 272 bytes in 1 blocks
==19277==    still reachable: 12 bytes in 3 blocks
==19277==         suppressed: 0 bytes in 0 blocks
==19277== Reachable blocks (those to which a pointer was found) are not
shown.
==19277== To see them, rerun with: --leak-check=full
--show-reachable=yes


2048 job configuration file.
==19365== 50,618,216 (311,144 direct, 50,307,072 indirect) bytes in
2,047 blocks are definitely lost in loss record 22 of 22
==19365==    at 0x4A0610C: malloc (vg_replace_malloc.c:195)
==19365==    by 0x42DA03: setup_log (iolog.c:499)
==19365==    by 0x40A9DD: add_job (init.c:846)
==19365==    by 0x40BD26: parse_jobs_ini (init.c:1135)
==19365==    by 0x40C059: parse_options (init.c:1602)
==19365==    by 0x4082F3: main (fio.c:104)
==19365==
==19365== LEAK SUMMARY:
==19365==    definitely lost: 1,843,954 bytes in 22,523 blocks
==19365==    indirectly lost: 201,154,560 bytes in 8,185 blocks
==19365==      possibly lost: 73,728 bytes in 3 blocks
==19365==    still reachable: 580 bytes in 4 blocks
==19365==         suppressed: 0 bytes in 0 blocks

Thanks,
Roger
-----Original Message-----
From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On
Behalf Of Roger Sibert
Sent: Wednesday, April 18, 2012 1:27 PM
To: Jens Axboe
Cc: fio@vger.kernel.org
Subject: RE: segfault runninng fio against 2048 jobs

Heres hoping Outlook doesn't inject html into the message again.

[global]
direct=1
ioengine=libaio
zonesize=1g
randrepeat=1
write_bw_log
write_lat_log
time_based
ramp_time=15s
runtime=15s
;
[sdf-iodepth1-rw-readwrite_mix_5050-bs128k-2048]
description=[sdf-iodepth1-rw-readwrite_mix_5050-bs128k-2048]
stonewall
filename=/dev/sdf
iodepth=1
rw=rw
rwmixread=50
rwmixwrite=50
bs=128k

Running just the 2048 job on its own doesn't cause any issues.

I did a fresh git clone and ended up with fio-2.0.7-10-g8430 (which was
compiled on the local system without making any changes to the code) and
re-ran the test using the full 2048 to verify that the segfault still
occurs, which it does.  I also noted that the segfault is about
immediate once seeing Jobs: 1 (f=xxx) print and stays that way until you
reduce it down to 535.  At about 535 it runs for about 15 or so seconds
before segfaulting, 500 is still running after about 3 minutes.

Thanks,
Roger

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, April 18, 2012 3:24 AM
To: Roger Sibert
Cc: fio@vger.kernel.org
Subject: Re: segfault runninng fio against 2048 jobs

On 04/17/2012 11:05 PM, Roger Sibert wrote:
> Hello Everyone,
> 
> I am using a 2.0x variant ran across a couple of things, one of which
> looks to be as designed and the other was a segfault in fio.
> 
> My original job file had 4800 entries which exceeds the max limit.
> (error: maximum number of jobs (2048) reached)  The question I have
> here , is there a reason the limit can't be raised to handle larger
> job files?

There's no inherent limit in fio that causes this, it was done to avoid
errors on platforms where shared memory segments were more limited. A
check now reveals that thread_data is around 15KB, which means that the
segment is around 30MB in total. You should be safe to bump the

#define REAL_MAX_JOBS           2048

in fio.h to something bigger. In fact I should just make it bigger, we
scale it down these days if we see errors.

> Reducing the job file to the max re-running it jumped straight to the
initial print screen and then to a segfault. (Segmentation fault (core
dumped))
> 
> Doing a quick look gave me 
> 
> [root@localhost std-testing]# gdb fio core.9582
> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /root/fio-test/std-testing/fio...done.
> [New Thread 9583]
> [New Thread 9582]
> 
> warning: no loadable sections found in added symbol-file
system-supplied DSO at 0x7fff213fd000
> Core was generated by `./fio --output=1.log 1.inp'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00000000004167b0 in display_thread_status (je=<value optimized
out>) at eta.c:416
> 416     eta.c: No such file or directory.
>         in eta.c
> (gdb) quit
> 
> I reduced the job count down to about 33 and re-started the run which
I am waiting to finish so I can re-compile fio with whatever extra flags
and to whatever code level are requested.  Currently file gives me:
> fio: ELF 64-bit LSB executable, AMD x86-64, version 1 (GNU/Linux), for
GNU/Linux 2.6.15, statically linked, not stripped
> Which is running on a CentOS box
> Linux localhost.localdomain 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7
04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

There's not enough information here to help you out, I'm afraid. What
fio version are you running? What job did you run that caused this
failure?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: segfault runninng fio against 2048 jobs
  2012-04-18 17:27   ` Roger Sibert
  2012-04-18 18:16     ` Roger Sibert
@ 2012-04-18 18:39     ` Jens Axboe
  2012-04-18 19:46       ` Roger Sibert
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2012-04-18 18:39 UTC (permalink / raw)
  To: Roger Sibert; +Cc: fio

On 2012-04-18 19:27, Roger Sibert wrote:
> Heres hoping Outlook doesn't inject html into the message again.
> 
> [global]
> direct=1
> ioengine=libaio
> zonesize=1g
> randrepeat=1
> write_bw_log
> write_lat_log
> time_based
> ramp_time=15s
> runtime=15s
> ;
> [sdf-iodepth1-rw-readwrite_mix_5050-bs128k-2048]
> description=[sdf-iodepth1-rw-readwrite_mix_5050-bs128k-2048]
> stonewall
> filename=/dev/sdf
> iodepth=1
> rw=rw
> rwmixread=50
> rwmixwrite=50
> bs=128k
> 
> Running just the 2048 job on its own doesn't cause any issues.
> 
> I did a fresh git clone and ended up with fio-2.0.7-10-g8430 (which was
> compiled on the local system without making any changes to the code) and
> re-ran the test using the full 2048 to verify that the segfault still
> occurs, which it does.  I also noted that the segfault is about
> immediate once seeing Jobs: 1 (f=xxx) print and stays that way until you
> reduce it down to 535.  At about 535 it runs for about 15 or so seconds
> before segfaulting, 500 is still running after about 3 minutes.

OK, pretty silly error. Guess not that many people use more than ~500
jobs. What you run into is a simple stack smash. The below should help,
I'm committing it now.

diff --git a/eta.c b/eta.c
index 7e837ba..4679a21 100644
--- a/eta.c
+++ b/eta.c
@@ -360,7 +360,7 @@ void display_thread_status(struct jobs_eta *je)
 {
 	static int linelen_last;
 	static int eta_good;
-	char output[512], *p = output;
+	char output[REAL_MAX_JOBS + 512], *p = output;
 	char eta_str[128];
 	double perc = 0.0;
 	int i2p = 0;
@@ -385,6 +385,7 @@ void display_thread_status(struct jobs_eta *je)
 		char perc_str[32];
 		char *iops_str[2];
 		char *rate_str[2];
+		size_t left;
 		int l;
 
 		if ((!je->eta_sec && !eta_good) || je->nr_ramp == je->nr_running)
@@ -401,7 +402,9 @@ void display_thread_status(struct jobs_eta *je)
 		iops_str[0] = num2str(je->iops[0], 4, 1, 0);
 		iops_str[1] = num2str(je->iops[1], 4, 1, 0);
 
-		l = sprintf(p, ": [%s] [%s] [%s/%s /s] [%s/%s iops] [eta %s]",
+		left = sizeof(output) - (p - output) - 1;
+
+		l = snprintf(p, left, ": [%s] [%s] [%s/%s /s] [%s/%s iops] [eta %s]",
 				je->run_str, perc_str, rate_str[0],
 				rate_str[1], iops_str[0], iops_str[1], eta_str);
 		p += l;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: segfault runninng fio against 2048 jobs
  2012-04-18 18:16     ` Roger Sibert
@ 2012-04-18 18:42       ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2012-04-18 18:42 UTC (permalink / raw)
  To: Roger Sibert; +Cc: fio

On 2012-04-18 20:16, Roger Sibert wrote:
> Hello Jens,
> 
> Not sure if this is a red herring or not.
> 
> I did a quick check using valgrind with its memcheck on the 1 job sample
> and noted that there appears to be a small memory leak which gets
> noticeably worse when you run against a larger job configuration.
> 
> All the leaks appear to be the same originating line of code so just a
> snippet of the valgrind output is included below.
> 
> 1 job configuration file
> ==19277== 168 bytes in 1 blocks are definitely lost in loss record 9 of
> 10
> ==19277==    at 0x4A0610C: malloc (vg_replace_malloc.c:195)
> ==19277==    by 0x408A44: load_ioengine (ioengines.c:148)
> ==19277==    by 0x409BE2: ioengine_load (init.c:694)
> ==19277==    by 0x409F79: add_job (init.c:765)
> ==19277==    by 0x40BD26: parse_jobs_ini (init.c:1135)
> ==19277==    by 0x40C059: parse_options (init.c:1602)
> ==19277==    by 0x4082F3: main (fio.c:104)
> ==19277==
> .
> .
> ==19277==
> ==19277== LEAK SUMMARY:
> ==19277==    definitely lost: 211 bytes in 6 blocks
> ==19277==    indirectly lost: 0 bytes in 0 blocks
> ==19277==      possibly lost: 272 bytes in 1 blocks
> ==19277==    still reachable: 12 bytes in 3 blocks
> ==19277==         suppressed: 0 bytes in 0 blocks
> ==19277== Reachable blocks (those to which a pointer was found) are not
> shown.
> ==19277== To see them, rerun with: --leak-check=full
> --show-reachable=yes
> 
> 
> 2048 job configuration file.
> ==19365== 50,618,216 (311,144 direct, 50,307,072 indirect) bytes in
> 2,047 blocks are definitely lost in loss record 22 of 22
> ==19365==    at 0x4A0610C: malloc (vg_replace_malloc.c:195)
> ==19365==    by 0x42DA03: setup_log (iolog.c:499)
> ==19365==    by 0x40A9DD: add_job (init.c:846)
> ==19365==    by 0x40BD26: parse_jobs_ini (init.c:1135)
> ==19365==    by 0x40C059: parse_options (init.c:1602)
> ==19365==    by 0x4082F3: main (fio.c:104)
> ==19365==
> ==19365== LEAK SUMMARY:
> ==19365==    definitely lost: 1,843,954 bytes in 22,523 blocks
> ==19365==    indirectly lost: 201,154,560 bytes in 8,185 blocks
> ==19365==      possibly lost: 73,728 bytes in 3 blocks
> ==19365==    still reachable: 580 bytes in 4 blocks
> ==19365==         suppressed: 0 bytes in 0 blocks

Yes, there are a few minor leaks that could grow based on number of
jobs. But it's only really a concern if you run fio as a server backend,
otherwise it's nicely freed when the job is done. And it's not leaking
while a job is running either, it's "just" some of the initialization
memory that isn't freed explicitly on exit.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: segfault runninng fio against 2048 jobs
  2012-04-18 18:39     ` Jens Axboe
@ 2012-04-18 19:46       ` Roger Sibert
  2012-04-20  6:40         ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Roger Sibert @ 2012-04-18 19:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

I verified the patch in fio-2.0.7-11-g7907 and that does indeed look to
take care of the issue.  (Many thanks for that)

Also in follow up I changed the max job limit to 5120 and it seems to
run properly against that as well.

Question though, is there any reason you have a REAL_MAX_JOBS in fio.h
and then a FIO_MAX_JOBS in os.h.  First glance at it just shows that the
init.c code uses FIO_MAX_JOBS in for the thread check and then later on
it uses REAL_MAX_JOBS for the job check except that max_jobs is set
equal to FIO_MAX_JOBS.  It may be that the answer to my question is the
os-mac.h file which means you have a smaller thread count ... maybe then
the result is just a small adjustment in the error print to show you
have exceeded the max # of jobs and or max # of threads.
 
Thanks,
Roger



-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Wednesday, April 18, 2012 2:40 PM
To: Roger Sibert
Cc: fio@vger.kernel.org
Subject: Re: segfault runninng fio against 2048 jobs

On 2012-04-18 19:27, Roger Sibert wrote:
> Heres hoping Outlook doesn't inject html into the message again.
> 
> [global]
> direct=1
> ioengine=libaio
> zonesize=1g
> randrepeat=1
> write_bw_log
> write_lat_log
> time_based
> ramp_time=15s
> runtime=15s
> ;
> [sdf-iodepth1-rw-readwrite_mix_5050-bs128k-2048]
> description=[sdf-iodepth1-rw-readwrite_mix_5050-bs128k-2048]
> stonewall
> filename=/dev/sdf
> iodepth=1
> rw=rw
> rwmixread=50
> rwmixwrite=50
> bs=128k
> 
> Running just the 2048 job on its own doesn't cause any issues.
> 
> I did a fresh git clone and ended up with fio-2.0.7-10-g8430 (which
was
> compiled on the local system without making any changes to the code)
and
> re-ran the test using the full 2048 to verify that the segfault still
> occurs, which it does.  I also noted that the segfault is about
> immediate once seeing Jobs: 1 (f=xxx) print and stays that way until
you
> reduce it down to 535.  At about 535 it runs for about 15 or so
seconds
> before segfaulting, 500 is still running after about 3 minutes.

OK, pretty silly error. Guess not that many people use more than ~500
jobs. What you run into is a simple stack smash. The below should help,
I'm committing it now.

diff --git a/eta.c b/eta.c
index 7e837ba..4679a21 100644
--- a/eta.c
+++ b/eta.c
@@ -360,7 +360,7 @@ void display_thread_status(struct jobs_eta *je)
 {
 	static int linelen_last;
 	static int eta_good;
-	char output[512], *p = output;
+	char output[REAL_MAX_JOBS + 512], *p = output;
 	char eta_str[128];
 	double perc = 0.0;
 	int i2p = 0;
@@ -385,6 +385,7 @@ void display_thread_status(struct jobs_eta *je)
 		char perc_str[32];
 		char *iops_str[2];
 		char *rate_str[2];
+		size_t left;
 		int l;
 
 		if ((!je->eta_sec && !eta_good) || je->nr_ramp ==
je->nr_running)
@@ -401,7 +402,9 @@ void display_thread_status(struct jobs_eta *je)
 		iops_str[0] = num2str(je->iops[0], 4, 1, 0);
 		iops_str[1] = num2str(je->iops[1], 4, 1, 0);
 
-		l = sprintf(p, ": [%s] [%s] [%s/%s /s] [%s/%s iops] [eta
%s]",
+		left = sizeof(output) - (p - output) - 1;
+
+		l = snprintf(p, left, ": [%s] [%s] [%s/%s /s] [%s/%s
iops] [eta %s]",
 				je->run_str, perc_str, rate_str[0],
 				rate_str[1], iops_str[0], iops_str[1],
eta_str);
 		p += l;

-- 
Jens Axboe

______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: segfault runninng fio against 2048 jobs
  2012-04-18 19:46       ` Roger Sibert
@ 2012-04-20  6:40         ` Jens Axboe
  2012-04-20 14:21           ` Roger Sibert
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2012-04-20  6:40 UTC (permalink / raw)
  To: Roger Sibert; +Cc: fio

On 04/18/2012 09:46 PM, Roger Sibert wrote:
> I verified the patch in fio-2.0.7-11-g7907 and that does indeed look to
> take care of the issue.  (Many thanks for that)
> 
> Also in follow up I changed the max job limit to 5120 and it seems to
> run properly against that as well.
> 
> Question though, is there any reason you have a REAL_MAX_JOBS in fio.h
> and then a FIO_MAX_JOBS in os.h.  First glance at it just shows that the
> init.c code uses FIO_MAX_JOBS in for the thread check and then later on
> it uses REAL_MAX_JOBS for the job check except that max_jobs is set
> equal to FIO_MAX_JOBS.  It may be that the answer to my question is the
> os-mac.h file which means you have a smaller thread count ... maybe then
> the result is just a small adjustment in the error print to show you
> have exceeded the max # of jobs and or max # of threads.

OSX has a seriously small max segment by default, hence the split and
smaller value there.

I should probably make the thread_data array be segmented, so that fio
could support an arbitrary number of jobs regardless of the max shm
segment size. So far it hasn't been a huge problem.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: segfault runninng fio against 2048 jobs
  2012-04-20  6:40         ` Jens Axboe
@ 2012-04-20 14:21           ` Roger Sibert
  2012-04-20 14:27             ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Roger Sibert @ 2012-04-20 14:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

I was thinking along the lines of adding a command, job_size_allowed =
default, (l)arge, (xl)arge, (j)umbo with using those in where you do the
check FIO_MAX_JOBS or REAL_MAX_JOBS.

Default=1
Large=1.5
XLarge=2
Jumbo=3

char output[(REAL_MAX_JOBS*job_size_allowed) + 512], *p = output;

Sorry if the code is off/wrong, I have spent the last 5 days doing
nothing but perl and bash scripting along with a twist of SQL so my
brain is mush :P

I used to test RAID code for a living so I wasn't about to start digging
since I don't know the code well enough which means that if I push in on
one side something else will more than likely pop out on the other.

Looking at what your describing vs what I was thinking it sounds like
your approach of setting it up to allow for a more dynamic range would
be a more elegant approach that would serve better in the long run.

Thanks,
Roger

-----Original Message-----
From: Jens Axboe [mailto:axboe@kernel.dk] 
Sent: Friday, April 20, 2012 2:41 AM
To: Roger Sibert
Cc: fio@vger.kernel.org
Subject: Re: segfault runninng fio against 2048 jobs

On 04/18/2012 09:46 PM, Roger Sibert wrote:
> I verified the patch in fio-2.0.7-11-g7907 and that does indeed look
to
> take care of the issue.  (Many thanks for that)
> 
> Also in follow up I changed the max job limit to 5120 and it seems to
> run properly against that as well.
> 
> Question though, is there any reason you have a REAL_MAX_JOBS in fio.h
> and then a FIO_MAX_JOBS in os.h.  First glance at it just shows that
the
> init.c code uses FIO_MAX_JOBS in for the thread check and then later
on
> it uses REAL_MAX_JOBS for the job check except that max_jobs is set
> equal to FIO_MAX_JOBS.  It may be that the answer to my question is
the
> os-mac.h file which means you have a smaller thread count ... maybe
then
> the result is just a small adjustment in the error print to show you
> have exceeded the max # of jobs and or max # of threads.

OSX has a seriously small max segment by default, hence the split and
smaller value there.

I should probably make the thread_data array be segmented, so that fio
could support an arbitrary number of jobs regardless of the max shm
segment size. So far it hasn't been a huge problem.

-- 
Jens Axboe

______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: segfault runninng fio against 2048 jobs
  2012-04-20 14:21           ` Roger Sibert
@ 2012-04-20 14:27             ` Jens Axboe
  2012-04-20 16:22               ` Steven Lang
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2012-04-20 14:27 UTC (permalink / raw)
  To: Roger Sibert; +Cc: fio

On 04/20/2012 04:21 PM, Roger Sibert wrote:
> I was thinking along the lines of adding a command, job_size_allowed =
> default, (l)arge, (xl)arge, (j)umbo with using those in where you do the
> check FIO_MAX_JOBS or REAL_MAX_JOBS.
> 
> Default=1
> Large=1.5
> XLarge=2
> Jumbo=3
> 
> char output[(REAL_MAX_JOBS*job_size_allowed) + 512], *p = output;
> 
> Sorry if the code is off/wrong, I have spent the last 5 days doing
> nothing but perl and bash scripting along with a twist of SQL so my
> brain is mush :P
> 
> I used to test RAID code for a living so I wasn't about to start digging
> since I don't know the code well enough which means that if I push in on
> one side something else will more than likely pop out on the other.
> 
> Looking at what your describing vs what I was thinking it sounds like
> your approach of setting it up to allow for a more dynamic range would
> be a more elegant approach that would serve better in the long run.

Yes, the point of doing segmented thread_data arrays would be to get rid
of any fio imposed constraint on the number of jobs that could be
supported. And do so without requiring tweaking of the shm segment size
on the OS in question.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: segfault runninng fio against 2048 jobs
  2012-04-20 14:27             ` Jens Axboe
@ 2012-04-20 16:22               ` Steven Lang
  2012-04-20 17:22                 ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Steven Lang @ 2012-04-20 16:22 UTC (permalink / raw)
  To: fio

It seems like a lot of what is in the thread_data structure does not
need to be in shared memory; the configuration information is static
(And in fact some of it is just pointers into process memory) and much
of it is just used for the running job, such as anything referencing
files or io_u.  If instead of the whole structure, just necessarily
shared parts were put in the shared segment, even OSs with limited
shared segment sizes could better make use of shared memory and run
more jobs.

Not to mention that any job which runs in a thread rather than a
process doesn't need to be in shared memory at all.

On Fri, Apr 20, 2012 at 7:27 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 04/20/2012 04:21 PM, Roger Sibert wrote:
>> I was thinking along the lines of adding a command, job_size_allowed =
>> default, (l)arge, (xl)arge, (j)umbo with using those in where you do the
>> check FIO_MAX_JOBS or REAL_MAX_JOBS.
>>
>> Default=1
>> Large=1.5
>> XLarge=2
>> Jumbo=3
>>
>> char output[(REAL_MAX_JOBS*job_size_allowed) + 512], *p = output;
>>
>> Sorry if the code is off/wrong, I have spent the last 5 days doing
>> nothing but perl and bash scripting along with a twist of SQL so my
>> brain is mush :P
>>
>> I used to test RAID code for a living so I wasn't about to start digging
>> since I don't know the code well enough which means that if I push in on
>> one side something else will more than likely pop out on the other.
>>
>> Looking at what your describing vs what I was thinking it sounds like
>> your approach of setting it up to allow for a more dynamic range would
>> be a more elegant approach that would serve better in the long run.
>
> Yes, the point of doing segmented thread_data arrays would be to get rid
> of any fio imposed constraint on the number of jobs that could be
> supported. And do so without requiring tweaking of the shm segment size
> on the OS in question.
>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe fio" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at �http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: segfault runninng fio against 2048 jobs
  2012-04-20 16:22               ` Steven Lang
@ 2012-04-20 17:22                 ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2012-04-20 17:22 UTC (permalink / raw)
  To: Steven Lang; +Cc: fio

On 2012-04-20 18:22, Steven Lang wrote:
> It seems like a lot of what is in the thread_data structure does not
> need to be in shared memory; the configuration information is static
> (And in fact some of it is just pointers into process memory) and much
> of it is just used for the running job, such as anything referencing
> files or io_u.  If instead of the whole structure, just necessarily
> shared parts were put in the shared segment, even OSs with limited
> shared segment sizes could better make use of shared memory and run
> more jobs.
> 
> Not to mention that any job which runs in a thread rather than a
> process doesn't need to be in shared memory at all.

That is completely true, but that would require a much more invasive
change. Given that fio isn't _that_ heavy on the shm side (14KB per
process), my lazy side just thought that it would be easier just to have
a few segments for the unlikely cases where somebody did want to run
more than 2000 processes.

The options are around ~13% of the thread_data, so while moving just
that would be a bit easier (and mechanical), it would not be worth it
alone.

And yes, it's not needed for threads. The threads don't attach to it as
it is, so if you only run threads, it need not even be set up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-04-20 17:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-17 21:05 segfault runninng fio against 2048 jobs Roger Sibert
2012-04-18  7:23 ` Jens Axboe
2012-04-18  9:02   ` Roger Sibert
2012-04-18 17:27   ` Roger Sibert
2012-04-18 18:16     ` Roger Sibert
2012-04-18 18:42       ` Jens Axboe
2012-04-18 18:39     ` Jens Axboe
2012-04-18 19:46       ` Roger Sibert
2012-04-20  6:40         ` Jens Axboe
2012-04-20 14:21           ` Roger Sibert
2012-04-20 14:27             ` Jens Axboe
2012-04-20 16:22               ` Steven Lang
2012-04-20 17:22                 ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.