All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] FIO libnuma integration and 3 patches
@ 2012-10-20  3:11 Yufei Ren
  2012-10-20  3:11 ` [PATCH 1/4] cpuio engine cpuload bug fix Yufei Ren
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Yufei Ren @ 2012-10-20  3:11 UTC (permalink / raw)
  To: fio; +Cc: Yufei Ren

From: Yufei Ren <renyufei83@gmail.com>

Three patches and a numa integration is present as follows.

Yufei Ren (4):
  Current cpu ioengine always bruns out 100 percent cpu cycles     no
    matter what the cpuload value is. Since no data is transferred    
    with cpuio, bytes_done would be ZERO. Consequently, think_time    
    is omitted and loops keeps running.
  If `thread' option is enabled, resource usage should be thread    
    based instead of process based. For the following job,
  rdma ioengine improvement
  Two new options, numa_cpu_nodes and numa_mem_policy, are created    
    for a fine-grained job level numa control. Please refer HOWTO and  
      README for detailed description.     A example job,
    examples/numa, is added as well.

 HOWTO          |   18 +++++++
 README         |   14 ++++--
 backend.c      |   45 ++++++++++++++++++-
 engines/cpu.c  |    5 ++
 engines/rdma.c |   28 ++++++++---
 examples/cpuio |    8 +++
 examples/numa  |   21 +++++++++
 fio.1          |   22 +++++++++
 fio.h          |   18 +++++++
 options.c      |  138 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 os/os-linux.h  |    5 ++
 stat.c         |    4 ++
 12 files changed, 313 insertions(+), 13 deletions(-)
 create mode 100644 examples/cpuio
 create mode 100644 examples/numa

-- 
1.7.2.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] cpuio engine cpuload bug fix
  2012-10-20  3:11 [PATCH 0/4] FIO libnuma integration and 3 patches Yufei Ren
@ 2012-10-20  3:11 ` Yufei Ren
  2012-10-22  8:03   ` Jens Axboe
  2012-10-20  3:11 ` [PATCH 2/4] thread cpu resource statistics " Yufei Ren
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Yufei Ren @ 2012-10-20  3:11 UTC (permalink / raw)
  To: fio; +Cc: Yufei Ren

From: Yufei Ren <renyufei83@gmail.com>

Current cpu ioengine always bruns out 100 percent cpu cycles
no matter what the cpuload value is. Since no data is transferred
with cpuio, bytes_done would be ZERO. Consequently, think_time
is omitted and loops keeps running.

A cpuio example is added as well.
---
 engines/cpu.c  |    5 +++++
 examples/cpuio |    8 ++++++++
 2 files changed, 13 insertions(+), 0 deletions(-)
 create mode 100644 examples/cpuio

diff --git a/engines/cpu.c b/engines/cpu.c
index 8bc9fd5..322dfde 100644
--- a/engines/cpu.c
+++ b/engines/cpu.c
@@ -10,6 +10,11 @@
 static int fio_cpuio_queue(struct thread_data *td, struct io_u fio_unused *io_u)
 {
 	usec_spin(td->o.cpucycle);
+
+	if (io_u->buflen == 0)
+		io_u->buflen = 1;
+	io_u->resid = 0;
+
 	return FIO_Q_COMPLETED;
 }
 
diff --git a/examples/cpuio b/examples/cpuio
new file mode 100644
index 0000000..577e072
--- /dev/null
+++ b/examples/cpuio
@@ -0,0 +1,8 @@
+[global]
+ioengine=cpuio
+time_based
+runtime=10
+
+[burn50percent]
+cpuload=50
+
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] thread cpu resource statistics bug fix
  2012-10-20  3:11 [PATCH 0/4] FIO libnuma integration and 3 patches Yufei Ren
  2012-10-20  3:11 ` [PATCH 1/4] cpuio engine cpuload bug fix Yufei Ren
@ 2012-10-20  3:11 ` Yufei Ren
  2012-10-22  8:04   ` Jens Axboe
  2012-10-20  3:11 ` [PATCH 3/4] rdma ioengine improvement Yufei Ren
  2012-10-20  3:11 ` [PATCH 4/4] Fine-grained job level numa control Yufei Ren
  3 siblings, 1 reply; 12+ messages in thread
From: Yufei Ren @ 2012-10-20  3:11 UTC (permalink / raw)
  To: fio; +Cc: Yufei Ren

From: Yufei Ren <renyufei83@gmail.com>

If `thread' option is enabled, resource usage should be thread
based instead of process based. For the following job,

fio --ioengine=cpuio --cpuload=50 --time_based --runtime=10 --name=j0 --numjobs=4 --thread

before patch, each thread CPU statistics:
...
  cpu          : usr=199.67%, sys=0.14%, ctx=1475, majf=0, minf=24
...

after patch:
...
  cpu          : usr=49.80%, sys=0.00%, ctx=79, majf=0, minf=18446744073709538943
...
---
 os/os-linux.h |    5 +++++
 stat.c        |    4 ++++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/os/os-linux.h b/os/os-linux.h
index 9b7ff29..2b35f34 100644
--- a/os/os-linux.h
+++ b/os/os-linux.h
@@ -14,6 +14,7 @@
 #include <linux/unistd.h>
 #include <linux/raw.h>
 #include <linux/major.h>
+#include <linux/version.h>
 #include <endian.h>
 
 #include "indirect.h"
@@ -62,6 +63,10 @@
 #define FIO_HAVE_FALLOC_ENG
 #endif
 
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,26)
+#define FIO_HAVE_RUSAGE_THREAD
+#endif
+
 #ifdef SYNC_FILE_RANGE_WAIT_BEFORE
 #define FIO_HAVE_SYNC_FILE_RANGE
 #endif
diff --git a/stat.c b/stat.c
index d041ef3..af6e1f2 100644
--- a/stat.c
+++ b/stat.c
@@ -16,7 +16,11 @@ void update_rusage_stat(struct thread_data *td)
 {
 	struct thread_stat *ts = &td->ts;
 
+#ifdef FIO_HAVE_RUSAGE_THREAD
+	getrusage(RUSAGE_THREAD, &td->ru_end);
+#else
 	getrusage(RUSAGE_SELF, &td->ru_end);
+#endif
 
 	ts->usr_time += mtime_since(&td->ru_start.ru_utime,
 					&td->ru_end.ru_utime);
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] rdma ioengine improvement
  2012-10-20  3:11 [PATCH 0/4] FIO libnuma integration and 3 patches Yufei Ren
  2012-10-20  3:11 ` [PATCH 1/4] cpuio engine cpuload bug fix Yufei Ren
  2012-10-20  3:11 ` [PATCH 2/4] thread cpu resource statistics " Yufei Ren
@ 2012-10-20  3:11 ` Yufei Ren
  2012-10-22  8:06   ` Jens Axboe
  2012-10-20  3:11 ` [PATCH 4/4] Fine-grained job level numa control Yufei Ren
  3 siblings, 1 reply; 12+ messages in thread
From: Yufei Ren @ 2012-10-20  3:11 UTC (permalink / raw)
  To: fio; +Cc: Yufei Ren

From: Yufei Ren <renyufei83@gmail.com>

---
 backend.c      |    2 +-
 engines/rdma.c |   28 ++++++++++++++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/backend.c b/backend.c
index 4e3a3ed..85ec196 100644
--- a/backend.c
+++ b/backend.c
@@ -591,7 +591,7 @@ static void do_io(struct thread_data *td)
 		int ret2, full;
 		enum fio_ddir ddir;
 
-		if (td->terminate)
+		if (td->terminate || td->done)
 			break;
 
 		update_tv_cache(td);
diff --git a/engines/rdma.c b/engines/rdma.c
index 79d72d2..9a51e4f 100644
--- a/engines/rdma.c
+++ b/engines/rdma.c
@@ -7,8 +7,8 @@
  *
  * This I/O engine is disabled by default. To enable it, execute:
  *
- * $ export EXTFLAGS="-DFIO_HAVE_RDMA"
- * $ export EXTLIBS="-libverbs -lrdmacm"
+ * $ export EXTFLAGS+=" -DFIO_HAVE_RDMA "
+ * $ export EXTLIBS+=" -libverbs -lrdmacm "
  *
  * before running make. You will need the Linux RDMA software as well, either
  * from your Linux distributor or directly from openfabrics.org:
@@ -41,7 +41,7 @@
 #include <rdma/rdma_cma.h>
 #include <infiniband/arch.h>
 
-#define FIO_RDMA_MAX_IO_DEPTH    128
+#define FIO_RDMA_MAX_IO_DEPTH    512
 
 enum rdma_io_mode {
 	FIO_RDMA_UNKNOWN = 0,
@@ -110,6 +110,8 @@ struct rdmaio_data {
 	int io_u_completed_nr;
 };
 
+static unsigned int Junk;
+
 static int client_recv(struct thread_data *td, struct ibv_wc *wc)
 {
 	struct rdmaio_data *rd = td->io_ops->data;
@@ -602,7 +604,7 @@ static int fio_rdmaio_send(struct thread_data *td, struct io_u **io_us,
 		case FIO_RDMA_MEM_WRITE:
 			/* compose work request */
 			r_io_u_d = io_us[i]->engine_data;
-			index = rand() % rd->rmt_nr;
+			index = rand_r(&Junk) % rd->rmt_nr;
 			r_io_u_d->sq_wr.opcode = IBV_WR_RDMA_WRITE;
 			r_io_u_d->sq_wr.wr.rdma.rkey = rd->rmt_us[index].rkey;
 			r_io_u_d->sq_wr.wr.rdma.remote_addr = \
@@ -612,7 +614,7 @@ static int fio_rdmaio_send(struct thread_data *td, struct io_u **io_us,
 		case FIO_RDMA_MEM_READ:
 			/* compose work request */
 			r_io_u_d = io_us[i]->engine_data;
-			index = rand() % rd->rmt_nr;
+			index = rand_r(&Junk) % rd->rmt_nr;
 			r_io_u_d->sq_wr.opcode = IBV_WR_RDMA_READ;
 			r_io_u_d->sq_wr.wr.rdma.rkey = rd->rmt_us[index].rkey;
 			r_io_u_d->sq_wr.wr.rdma.remote_addr = \
@@ -790,6 +792,13 @@ static int fio_rdmaio_connect(struct thread_data *td, struct fio_file *f)
 	/* wait for remote MR info from server side */
 	rdma_poll_wait(td, IBV_WC_RECV);
 
+	/* In SEND/RECV test, iodepth in RECV side is deeper
+	 * in SEND side. RECV needs more time to construct the
+	 * buffer blocks, so the server side may need to stop
+	 * some time before transfer data.
+	 */
+	usleep(500000);
+
 	return 0;
 }
 
@@ -872,8 +881,8 @@ static int fio_rdmaio_close_file(struct thread_data *td, struct fio_file *f)
         return 1;
     }*/
 
-	ibv_destroy_qp(rd->qp);
 	ibv_destroy_cq(rd->cq);
+	ibv_destroy_qp(rd->qp);
 
 	if (rd->is_client == 1)
 		rdma_destroy_id(rd->cm_id);
@@ -1150,6 +1159,9 @@ static int fio_rdmaio_init(struct thread_data *td)
 		i++;
 	}
 
+	Junk = getpid();
+	rand_r(&Junk);
+
 	rd->send_buf.nr = htonl(i);
 
 	return ret;
@@ -1229,8 +1241,8 @@ static int fio_rdmaio_init(struct thread_data fio_unused * td)
 	log_err("     make sure OFED is installed,\n");
 	log_err("     $ ofed_info\n");
 	log_err("     then try to make fio as follows:\n");
-	log_err("     $ export EXTFLAGS=\"-DFIO_HAVE_RDMA\"\n");
-	log_err("     $ export EXTLIBS=\"-libverbs -lrdmacm\"\n");
+	log_err("     $ export EXTFLAGS+=\" -DFIO_HAVE_RDMA \"\n");
+	log_err("     $ export EXTLIBS+=\" -libverbs -lrdmacm \"\n");
 	log_err("     $ make clean && make\n");
 	return 1;
 }
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] Fine-grained job level numa control
  2012-10-20  3:11 [PATCH 0/4] FIO libnuma integration and 3 patches Yufei Ren
                   ` (2 preceding siblings ...)
  2012-10-20  3:11 ` [PATCH 3/4] rdma ioengine improvement Yufei Ren
@ 2012-10-20  3:11 ` Yufei Ren
  2012-10-22  8:07   ` Jens Axboe
  3 siblings, 1 reply; 12+ messages in thread
From: Yufei Ren @ 2012-10-20  3:11 UTC (permalink / raw)
  To: fio; +Cc: Yufei Ren

From: Yufei Ren <renyufei83@gmail.com>

Two new options, numa_cpu_nodes and numa_mem_policy, are created
for a fine-grained job level numa control. Please refer HOWTO and
README for detailed description.
A example job, examples/numa, is added as well.

---
 HOWTO         |   18 +++++++
 README        |   14 ++++--
 backend.c     |   43 ++++++++++++++++++
 examples/numa |   21 +++++++++
 fio.1         |   22 +++++++++
 fio.h         |   18 +++++++
 options.c     |  138 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 270 insertions(+), 4 deletions(-)
 create mode 100644 examples/numa

diff --git a/HOWTO b/HOWTO
index ee9680a..8eda99d 100644
--- a/HOWTO
+++ b/HOWTO
@@ -799,6 +799,24 @@ cpus_allowed=str Controls the same options as cpumask, but it allows a text
 		allows a range of CPUs. Say you wanted a binding to CPUs
 		1, 5, and 8-15, you would set cpus_allowed=1,5,8-15.
 
+numa_cpu_nodes=str Set this job running on spcified NUMA nodes' CPUs. The
+		arguments allow comma delimited list of cpu numbers,
+		A-B ranges, or 'all'. Note, to enable numa options support,
+		export the following environment variables,
+			export EXTFLAGS+=" -DFIO_HAVE_LIBNUMA "
+			export EXTLIBS+=" -lnuma "
+
+numa_mem_policy=str Set this job's memory policy and corresponding NUMA
+		nodes. Format of the argements:
+			<mode>[:<nodelist>]
+		`mode' is one of the following memory policy:
+			default, prefer, bind, interleave, local
+		For `default' and `local' memory policy, no node is
+		needed to be specified.
+		For `prefer', only one node is allowed.
+		For `bind' and `interleave', it allow comma delimited
+		list of numbers, A-B ranges, or 'all'.
+
 startdelay=time	Start this job the specified number of seconds after fio
 		has started. Only useful if the job file contains several
 		jobs, and you want to delay starting some jobs to a certain
diff --git a/README b/README
index 535b077..ceac385 100644
--- a/README
+++ b/README
@@ -233,10 +233,11 @@ The job file parameters are:
 			readv/writev (with queuing emulation) mmap for mmap'ed
 			io, syslet-rw for syslet driven read/write, splice for
 			using splice/vmsplice, sg for direct SG_IO io, net
-			for network io, or cpuio for a cycler burner load. sg
-			only works on Linux on SCSI (or SCSI-like devices, such
-			as usb-storage or sata/libata driven) devices. Fio also
-			has a null io engine, which is mainly used for testing
+			for network io, rdma for RDMA io, or cpuio for a
+			cycler burner load. sg only works on Linux on
+			SCSI (or SCSI-like devices, such as usb-storage or
+			sata/libata driven) devices. Fio also has a null
+			io engine, which is mainly used for testing
 			fio itself.
 
 	iodepth=x	For async io, allow 'x' ios in flight
@@ -255,6 +256,11 @@ The job file parameters are:
 	ratecycle=x	ratemin averaged over x msecs
 	cpumask=x	Only allow job to run on CPUs defined by mask.
 	cpus_allowed=x	Like 'cpumask', but allow text setting of CPU affinity.
+	numa_cpu_nodes=x,y-z  Allow job to run on specified NUMA nodes' CPU.
+	numa_mem_policy=m:x,y-z  Setup numa memory allocation policy.
+			'm' stands for policy, such as local, interleave,
+			bind, prefer, local. 'x, y-z' are numa node(s) for
+			memory allocation according to policy.
 	fsync=x		If writing with buffered IO, fsync after every
 			'x' blocks have been written.
 	end_fsync=x	If 'x', run fsync() after end-of-job.
diff --git a/backend.c b/backend.c
index 85ec196..157fd52 100644
--- a/backend.c
+++ b/backend.c
@@ -1052,6 +1052,49 @@ static void *thread_main(void *data)
 		goto err;
 	}
 
+#ifdef FIO_HAVE_LIBNUMA
+	/* numa node setup */
+	if (td->o.numa_cpumask_set || td->o.numa_memmask_set) {
+		int ret;
+
+		if (numa_available() < 0) {
+			td_verror(td, errno, "Does not support NUMA API\n");
+			goto err;
+		}
+
+		if (td->o.numa_cpumask_set) {
+			ret = numa_run_on_node_mask(td->o.numa_cpunodesmask);
+			if (ret == -1) {
+				td_verror(td, errno, \
+					"numa_run_on_node_mask failed\n");
+				goto err;
+			}
+		}
+
+		if (td->o.numa_memmask_set) {
+
+			switch (td->o.numa_mem_mode) {
+			case MPOL_INTERLEAVE:
+				numa_set_interleave_mask(td->o.numa_memnodesmask);
+				break;
+			case MPOL_BIND:
+				numa_set_membind(td->o.numa_memnodesmask);
+				break;
+			case MPOL_LOCAL:
+				numa_set_localalloc();
+				break;
+			case MPOL_PREFERRED:
+				numa_set_preferred(td->o.numa_mem_prefer_node);
+				break;
+			case MPOL_DEFAULT:
+			default:
+				break;
+			}
+
+		}
+	}
+#endif
+
 	/*
 	 * May alter parameters that init_io_u() will use, so we need to
 	 * do this first.
diff --git a/examples/numa b/examples/numa
new file mode 100644
index 0000000..b81964f
--- /dev/null
+++ b/examples/numa
@@ -0,0 +1,21 @@
+; setup numa policy for each thread
+; 'numactl --show' to determine the maximum numa nodes
+[global]
+ioengine=libaio
+buffered=0
+rw=randread
+bs=512K
+iodepth=16
+size=512m
+filename=/dev/sdb1
+
+; Fix memory blocks (512K * 16) in numa node 0
+[job1]
+numa_cpu_nodes=0
+numa_mem_policy=bind:0
+
+; Interleave memory blocks (512K * 16) in numa node 0 and 1
+[job2]
+numa_cpu_nodes=0-1
+numa_mem_policy=interleave:0-1
+
diff --git a/fio.1 b/fio.1
index fad0ae4..bf65551 100644
--- a/fio.1
+++ b/fio.1
@@ -646,6 +646,28 @@ may run on.  See \fBsched_setaffinity\fR\|(2).
 .BI cpus_allowed \fR=\fPstr
 Same as \fBcpumask\fR, but allows a comma-delimited list of CPU numbers.
 .TP
+.BI numa_cpu_nodes \fR=\fPstr
+Set this job running on spcified NUMA nodes' CPUs. The arguments allow
+comma delimited list of cpu numbers, A-B ranges, or 'all'.
+.TP
+.BI numa_mem_policy \fR=\fPstr
+Set this job's memory policy and corresponding NUMA nodes. Format of
+the argements:
+.RS
+.TP
+.B <mode>[:<nodelist>]
+.TP
+.B mode
+is one of the following memory policy:
+.TP
+.B default, prefer, bind, interleave, local
+.TP
+.RE
+For \fBdefault\fR and \fBlocal\fR memory policy, no \fBnodelist\fR is
+needed to be specified. For \fBprefer\fR, only one node is
+allowed. For \fBbind\fR and \fBinterleave\fR, \fBnodelist\fR allows
+comma delimited list of numbers, A-B ranges, or 'all'.
+.TP
 .BI startdelay \fR=\fPint
 Delay start of job for the specified number of seconds.
 .TP
diff --git a/fio.h b/fio.h
index 8bb5b03..03e4da1 100644
--- a/fio.h
+++ b/fio.h
@@ -48,6 +48,16 @@ struct thread_data;
 #include <sys/asynch.h>
 #endif
 
+#ifdef FIO_HAVE_LIBNUMA
+#include <linux/mempolicy.h>
+#include <numa.h>
+
+/*
+ * "local" is pseudo-policy
+ */
+#define MPOL_LOCAL MPOL_MAX
+#endif
+
 /*
  * What type of allocation to use for io buffers
  */
@@ -195,6 +205,14 @@ struct thread_options {
 	unsigned int cpumask_set;
 	os_cpu_mask_t verify_cpumask;
 	unsigned int verify_cpumask_set;
+#ifdef FIO_HAVE_LIBNUMA
+	struct bitmask *numa_cpunodesmask;
+	unsigned int numa_cpumask_set;
+	unsigned short numa_mem_mode;
+	unsigned int numa_mem_prefer_node;
+	struct bitmask *numa_memnodesmask;
+	unsigned int numa_memmask_set;
+#endif
 	unsigned int iolog;
 	unsigned int rwmixcycle;
 	unsigned int rwmix[2];
diff --git a/options.c b/options.c
index 84101d1..e0b7fec 100644
--- a/options.c
+++ b/options.c
@@ -564,6 +564,130 @@ static int str_verify_cpus_allowed_cb(void *data, const char *input)
 }
 #endif
 
+#ifdef FIO_HAVE_LIBNUMA
+static int str_numa_cpunodes_cb(void *data, char *input)
+{
+	struct thread_data *td = data;
+
+	/* numa_parse_nodestring() parses a character string list
+	 * of nodes into a bit mask. The bit mask is allocated by
+	 * numa_allocate_nodemask(), so it should be freed by
+	 * numa_free_nodemask().
+	 */
+	td->o.numa_cpunodesmask = numa_parse_nodestring(input);
+	if (td->o.numa_cpunodesmask == NULL) {
+		log_err("fio: numa_parse_nodestring failed\n");
+		td_verror(td, 1, "str_numa_cpunodes_cb");
+		return 1;
+	}
+
+	td->o.numa_cpumask_set = 1;
+	return 0;
+}
+
+static int str_numa_mpol_cb(void *data, char *input)
+{
+	struct thread_data *td = data;
+	const char * const policy_types[] =
+		{ "default", "prefer", "bind", "interleave", "local" };
+	int i;
+
+	char *nodelist = strchr(input, ':');
+	if (nodelist) {
+		/* NUL-terminate mode */
+		*nodelist++ = '\0';
+	}
+
+	for (i = 0; i <= MPOL_LOCAL; i++) {
+		if (!strcmp(input, policy_types[i])) {
+			td->o.numa_mem_mode = i;
+			break;
+		}
+	}
+	if (i > MPOL_LOCAL) {
+		log_err("fio: memory policy should be: default, prefer, bind, interleave, local\n");
+		goto out;
+	}
+
+	switch (td->o.numa_mem_mode) {
+	case MPOL_PREFERRED:
+		/*
+		 * Insist on a nodelist of one node only
+		 */
+		if (nodelist) {
+			char *rest = nodelist;
+			while (isdigit(*rest))
+				rest++;
+			if (*rest) {
+				log_err("fio: one node only for \'prefer\'\n");
+				goto out;
+			}
+		} else {
+			log_err("fio: one node is needed for \'prefer\'\n");
+			goto out;
+		}
+		break;
+	case MPOL_INTERLEAVE:
+		/*
+		 * Default to online nodes with memory if no nodelist
+		 */
+		if (!nodelist)
+			nodelist = strdup("all");
+		break;
+	case MPOL_LOCAL:
+	case MPOL_DEFAULT:
+		/*
+		 * Don't allow a nodelist
+		 */
+		if (nodelist) {
+			log_err("fio: NO nodelist for \'local\'\n");
+			goto out;
+		}
+		break;
+	case MPOL_BIND:
+		/*
+		 * Insist on a nodelist
+		 */
+		if (!nodelist) {
+			log_err("fio: a nodelist is needed for \'bind\'\n");
+			goto out;
+		}
+		break;
+	}
+
+
+	/* numa_parse_nodestring() parses a character string list
+	 * of nodes into a bit mask. The bit mask is allocated by
+	 * numa_allocate_nodemask(), so it should be freed by
+	 * numa_free_nodemask().
+	 */
+	switch (td->o.numa_mem_mode) {
+	case MPOL_PREFERRED:
+		td->o.numa_mem_prefer_node = atoi(nodelist);
+		break;
+	case MPOL_INTERLEAVE:
+	case MPOL_BIND:
+		td->o.numa_memnodesmask = numa_parse_nodestring(nodelist);
+		if (td->o.numa_memnodesmask == NULL) {
+			log_err("fio: numa_parse_nodestring failed\n");
+			td_verror(td, 1, "str_numa_memnodes_cb");
+			return 1;
+		}
+		break;
+	case MPOL_LOCAL:
+	case MPOL_DEFAULT:
+	default:
+		break;
+	}
+
+	td->o.numa_memmask_set = 1;
+	return 0;
+
+out:
+	return 1;
+}
+#endif
+
 #ifdef FIO_HAVE_TRIM
 static int str_verify_trim_cb(void *data, unsigned long long *val)
 {
@@ -2069,6 +2193,20 @@ static struct fio_option options[FIO_MAX_OPTS] = {
 		.help	= "Set CPUs allowed",
 	},
 #endif
+#ifdef FIO_HAVE_LIBNUMA
+	{
+		.name	= "numa_cpu_nodes",
+		.type	= FIO_OPT_STR,
+		.cb	= str_numa_cpunodes_cb,
+		.help	= "NUMA CPU nodes bind",
+	},
+	{
+		.name	= "numa_mem_policy",
+		.type	= FIO_OPT_STR,
+		.cb	= str_numa_mpol_cb,
+		.help	= "NUMA memory policy setup",
+	},
+#endif
 	{
 		.name	= "end_fsync",
 		.type	= FIO_OPT_BOOL,
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] cpuio engine cpuload bug fix
  2012-10-20  3:11 ` [PATCH 1/4] cpuio engine cpuload bug fix Yufei Ren
@ 2012-10-22  8:03   ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2012-10-22  8:03 UTC (permalink / raw)
  To: Yufei Ren; +Cc: fio, Yufei Ren

On 2012-10-20 05:11, Yufei Ren wrote:
> From: Yufei Ren <renyufei83@gmail.com>
> 
> Current cpu ioengine always bruns out 100 percent cpu cycles
> no matter what the cpuload value is. Since no data is transferred
> with cpuio, bytes_done would be ZERO. Consequently, think_time
> is omitted and loops keeps running.
> 
> A cpuio example is added as well.

Applied, thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] thread cpu resource statistics bug fix
  2012-10-20  3:11 ` [PATCH 2/4] thread cpu resource statistics " Yufei Ren
@ 2012-10-22  8:04   ` Jens Axboe
  2012-10-22 16:33     ` Yufei Ren
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2012-10-22  8:04 UTC (permalink / raw)
  To: Yufei Ren; +Cc: fio, Yufei Ren

On 2012-10-20 05:11, Yufei Ren wrote:
> From: Yufei Ren <renyufei83@gmail.com>
> 
> If `thread' option is enabled, resource usage should be thread
> based instead of process based. For the following job,
> 
> fio --ioengine=cpuio --cpuload=50 --time_based --runtime=10 --name=j0 --numjobs=4 --thread
> 
> before patch, each thread CPU statistics:
> ...
>   cpu          : usr=199.67%, sys=0.14%, ctx=1475, majf=0, minf=24
> ...
> 
> after patch:
> ...
>   cpu          : usr=49.80%, sys=0.00%, ctx=79, majf=0, minf=18446744073709538943
> ...
> ---
>  os/os-linux.h |    5 +++++
>  stat.c        |    4 ++++
>  2 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/os/os-linux.h b/os/os-linux.h
> index 9b7ff29..2b35f34 100644
> --- a/os/os-linux.h
> +++ b/os/os-linux.h
> @@ -14,6 +14,7 @@
>  #include <linux/unistd.h>
>  #include <linux/raw.h>
>  #include <linux/major.h>
> +#include <linux/version.h>
>  #include <endian.h>
>  
>  #include "indirect.h"
> @@ -62,6 +63,10 @@
>  #define FIO_HAVE_FALLOC_ENG
>  #endif
>  
> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,26)
> +#define FIO_HAVE_RUSAGE_THREAD
> +#endif

I applied this, but I wonder if we should not just make this dependent
on

#ifdef RUSAGE_THREAD
...
#endif

instead?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] rdma ioengine improvement
  2012-10-20  3:11 ` [PATCH 3/4] rdma ioengine improvement Yufei Ren
@ 2012-10-22  8:06   ` Jens Axboe
  2012-10-22 18:39     ` Yufei Ren
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2012-10-22  8:06 UTC (permalink / raw)
  To: Yufei Ren; +Cc: fio, Yufei Ren

On 2012-10-20 05:11, Yufei Ren wrote:
> From: Yufei Ren <renyufei83@gmail.com>

This one needs a bit of explaining. What is the point of using a
re-entrant variant of rand(), if you are using a shared static variable
anyway?

Would it be better to use the fio shipped rand, now we're in there
anyway?

> @@ -790,6 +792,13 @@ static int fio_rdmaio_connect(struct thread_data *td, struct fio_file *f)
>  	/* wait for remote MR info from server side */
>  	rdma_poll_wait(td, IBV_WC_RECV);
>  
> +	/* In SEND/RECV test, iodepth in RECV side is deeper
> +	 * in SEND side. RECV needs more time to construct the
> +	 * buffer blocks, so the server side may need to stop
> +	 * some time before transfer data.
> +	 */
> +	usleep(500000);
> +
>  	return 0;

Hmm?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] Fine-grained job level numa control
  2012-10-20  3:11 ` [PATCH 4/4] Fine-grained job level numa control Yufei Ren
@ 2012-10-22  8:07   ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2012-10-22  8:07 UTC (permalink / raw)
  To: Yufei Ren; +Cc: fio, Yufei Ren

On 2012-10-20 05:11, Yufei Ren wrote:
> From: Yufei Ren <renyufei83@gmail.com>
> 
> Two new options, numa_cpu_nodes and numa_mem_policy, are created
> for a fine-grained job level numa control. Please refer HOWTO and
> README for detailed description.
> A example job, examples/numa, is added as well.

Good addition, added! It provides separate control of the job and memory
allocation policies.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] thread cpu resource statistics bug fix
  2012-10-22  8:04   ` Jens Axboe
@ 2012-10-22 16:33     ` Yufei Ren
  2012-10-22 17:10       ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Yufei Ren @ 2012-10-22 16:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

Hi Jens,

>
> I applied this, but I wonder if we should not just make this dependent
> on
>
> #ifdef RUSAGE_THREAD
> ...
> #endif
>
> instead?
>

Yes, that would more clear. Should I send you another patch instead of this?

Thanks

Yufei

> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] thread cpu resource statistics bug fix
  2012-10-22 16:33     ` Yufei Ren
@ 2012-10-22 17:10       ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2012-10-22 17:10 UTC (permalink / raw)
  To: Yufei Ren; +Cc: fio

On 2012-10-22 18:33, Yufei Ren wrote:
> Hi Jens,
> 
>>
>> I applied this, but I wonder if we should not just make this dependent
>> on
>>
>> #ifdef RUSAGE_THREAD
>> ...
>> #endif
>>
>> instead?
>>
> 
> Yes, that would more clear. Should I send you another patch instead of this?

Since it's already applied, please send one off the current -git tree
(so incremental to the previous).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] rdma ioengine improvement
  2012-10-22  8:06   ` Jens Axboe
@ 2012-10-22 18:39     ` Yufei Ren
  0 siblings, 0 replies; 12+ messages in thread
From: Yufei Ren @ 2012-10-22 18:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

On Mon, Oct 22, 2012 at 4:06 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2012-10-20 05:11, Yufei Ren wrote:
>> From: Yufei Ren <renyufei83@gmail.com>
>
> This one needs a bit of explaining. What is the point of using a
> re-entrant variant of rand(), if you are using a shared static variable
> anyway?
>

It's my buggy :-)

> Would it be better to use the fio shipped rand, now we're in there
> anyway?
>

Thanks for remind me the fio shipped rand. Let me send you another patch.

>> @@ -790,6 +792,13 @@ static int fio_rdmaio_connect(struct thread_data *td, struct fio_file *f)
>>       /* wait for remote MR info from server side */
>>       rdma_poll_wait(td, IBV_WC_RECV);
>>
>> +     /* In SEND/RECV test, iodepth in RECV side is deeper
>> +      * in SEND side. RECV needs more time to construct the
>> +      * buffer blocks, so the server side may need to stop
                                                   ~~~~~~~~~ should be SEND side
>> +      * some time before transfer data.
>> +      */
>> +     usleep(500000);
>> +
>>       return 0;
>
> Hmm?

After rdma connection(queue pair) established, data source and data
sink commit data blocks into send queue and receive queue,
respectively. If data source posts so many io requests that over the
capacity of the receiver queue. Receiver Not Ready (RNR) error comes
out. To avoid this, it's better that data source waits for some time
until data sink posts sufficient number of recv buffers into recv
queue.

Maybe it is a better way to add a `receiver ready notification' from
the sink to the source for addressing this problem.

Anyway, I will send you another patch with clear comments.

>
> --
> Jens Axboe
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-10-22 18:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-20  3:11 [PATCH 0/4] FIO libnuma integration and 3 patches Yufei Ren
2012-10-20  3:11 ` [PATCH 1/4] cpuio engine cpuload bug fix Yufei Ren
2012-10-22  8:03   ` Jens Axboe
2012-10-20  3:11 ` [PATCH 2/4] thread cpu resource statistics " Yufei Ren
2012-10-22  8:04   ` Jens Axboe
2012-10-22 16:33     ` Yufei Ren
2012-10-22 17:10       ` Jens Axboe
2012-10-20  3:11 ` [PATCH 3/4] rdma ioengine improvement Yufei Ren
2012-10-22  8:06   ` Jens Axboe
2012-10-22 18:39     ` Yufei Ren
2012-10-20  3:11 ` [PATCH 4/4] Fine-grained job level numa control Yufei Ren
2012-10-22  8:07   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.