All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Brian King <brking@linux.vnet.ibm.com>
Cc: Ming Lei <tom.leiming@gmail.com>, Jens Axboe <axboe@kernel.dk>,
	linux-block <linux-block@vger.kernel.org>,
	"open list:DEVICE-MAPPER (LVM)" <dm-devel@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>,
	Alasdair Kergon <agk@redhat.com>
Subject: Re: [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
Date: Fri, 30 Jun 2017 09:08:01 +0800	[thread overview]
Message-ID: <20170630010759.GB8305@ming.t460p> (raw)
In-Reply-To: <dc7cf367-534d-d9fc-27b1-7cfd5edcb74c@linux.vnet.ibm.com>

On Thu, Jun 29, 2017 at 12:31:05PM -0500, Brian King wrote:
> On 06/29/2017 11:25 AM, Ming Lei wrote:
> > On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe <axboe@kernel.dk> wrote:
> >> On 06/29/2017 02:40 AM, Ming Lei wrote:
> >>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 06/28/2017 03:12 PM, Brian King wrote:
> >>>>> This patch converts the in_flight counter in struct hd_struct from a
> >>>>> pair of atomics to a pair of percpu counters. This eliminates a cou=
ple
> >>>>> of atomics from the hot path. When running this on a Power system, =
to
> >>>>> a single null_blk device with 80 submission queues, irq mode 0, with
> >>>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
> >>>>
> >>>> This has been done before, but I've never really liked it. The reaso=
n is
> >>>> that it means that reading the part stat inflight count now has to
> >>>> iterate over every possible CPU. Did you use partitions in your test=
ing?
> >>>> How many CPUs were configured? When I last tested this a few years a=
go
> >>>> on even a quad core nehalem (which is notoriously shitty for cross-n=
ode
> >>>> latencies), it was a net loss.
> >>>
> >>> One year ago, I saw null_blk's IOPS can be decreased to 10%
> >>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
> >>> 96 cores, and dual numa nodes) too, the performance can be
> >>> recovered basically if per numa-node counter is introduced and
> >>> used in this case, but the patch was never posted out.
> >>> If anyone is interested in that, I can rebase the patch on current
> >>> block tree and post out. I guess the performance issue might be
> >>> related with system cache coherency implementation more or less.
> >>> This issue on ARM64 can be observed with the following userspace
> >>> atomic counting test too:
> >>>
> >>>        http://kernel.ubuntu.com/~ming/test/cache/
> >>
> >> How well did the per-node thing work? Doesn't seem to me like it would
> >=20
> > Last time, on ARM64, I remembered that the IOPS was basically recovered,
> > but now I don't have a such machine to test. Could Brian test the attac=
hed patch
> > to see if it works on big Power machine?
> >=20
> > And the idea is simple, just make the atomic counter per-node.
>=20
> I tried loading the patch and get an oops right away on boot. Haven't bee=
n able to=20
> debug anything yet. This is on top of an older kernel, so not sure if tha=
t is
> the issue or not. I can try upstream and see if i have different results.=
=2E.
>=20
> =1B[-1;-1fUbuntu 16.04=1B[-1;-1f.  .  .  .Unable to handle kernel paging =
request for data at address 0xc00031313a333532
> Faulting instruction address: 0xc0000000002552c4
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=3D1024 NUMA=20
> PowerNV
> Modules linked in: dm_round_robin vmx_crypto powernv_rng leds_powernv pow=
ernv_op_panel led_class rng_core dm_multipath autofs4 raid10 raid456 async_=
raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath=
 dm_mirror dm_region_hash dm_log cxlflash bnx2x mdio libcrc32c nvme nvme_co=
re lpfc cxl crc_t10dif crct10dif_generic crct10dif_common
> CPU: 9 PID: 3485 Comm: multipathd Not tainted 4.9.10-dirty #7
> task: c000000fba0d0000 task.stack: c000000fb8a1c000
> NIP: c0000000002552c4 LR: c000000000255274 CTR: 0000000000000000
> REGS: c000000fb8a1f350 TRAP: 0300   Not tainted  (4.9.10-dirty)
> MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24028848 =
 XER: 00000000
> CFAR: c000000000008a60 DAR: c00031313a333532 DSISR: 40000000 SOFTE: 1=20
> GPR00: c0000000001f8980 c000000fb8a1f5d0 c000000000f24300 c000000fc4007c0=
0=20
> GPR04: 00000000024000c0 c0000000002fc0ac 000000000000025d c000000fc5d346e=
0=20
> GPR08: 0000000fc50d6000 0000000000000000 c00031313a333532 c000000fbc83624=
0=20
> GPR12: 0000000000002200 c00000000ff02400 00003fff79cfebf0 000000000000000=
0=20
> GPR16: 00000000c138fd03 000001003a12bf70 00003fff79ae75e0 00003fff79d269e=
8=20
> GPR20: 0000000000000001 00003fff79cfebf0 000001003a393aa0 00003fff79d067b=
8=20
> GPR24: 0000000000000000 c000000000b04948 000000000000a1ff c0000000002fc0a=
c=20
> GPR28: 00000000024000c0 0000000000000007 c0000000002fc0ac c000000fc4007c0=
0=20
> NIP [c0000000002552c4] __kmalloc_track_caller+0x94/0x2f0
> LR [c000000000255274] __kmalloc_track_caller+0x44/0x2f0
> Call Trace:
> [c000000fb8a1f5d0] [c000000fb8a1f610] 0xc000000fb8a1f610 (unreliable)
> [c000000fb8a1f620] [c0000000001f8980] kstrdup+0x50/0xb0
> [c000000fb8a1f660] [c0000000002fc0ac] __kernfs_new_node+0x4c/0x140
> [c000000fb8a1f6b0] [c0000000002fd9f4] kernfs_new_node+0x34/0x80
> [c000000fb8a1f6e0] [c000000000300708] kernfs_create_link+0x38/0xf0
> [c000000fb8a1f720] [c000000000301cb8] sysfs_do_create_link_sd.isra.0+0xa8=
/0x160
> [c000000fb8a1f770] [c00000000054a658] device_add+0x2b8/0x740
> [c000000fb8a1f830] [c00000000054ae58] device_create_groups_vargs+0x178/0x=
190
> [c000000fb8a1f890] [c0000000001fcd70] bdi_register+0x80/0x1d0
> [c000000fb8a1f8c0] [c0000000001fd244] bdi_register_owner+0x44/0x80
> [c000000fb8a1f940] [c000000000453bbc] device_add_disk+0x1cc/0x500
> [c000000fb8a1f9f0] [c000000000764dec] dm_create+0x33c/0x5f0
> [c000000fb8a1fa90] [c00000000076d9ac] dev_create+0x8c/0x3e0
> [c000000fb8a1fb40] [c00000000076d1fc] ctl_ioctl+0x38c/0x580
> [c000000fb8a1fd20] [c00000000076d410] dm_ctl_ioctl+0x20/0x30
> [c000000fb8a1fd40] [c0000000002799ac] do_vfs_ioctl+0xcc/0x8f0
> [c000000fb8a1fde0] [c00000000027a230] SyS_ioctl+0x60/0xc0
> [c000000fb8a1fe30] [c00000000000bfe0] system_call+0x38/0xfc
> Instruction dump:
> 39290008 7cc8482a e94d0030 e9230000 7ce95214 7d49502a e9270010 2faa0000=
=20
> 419e007c 2fa90000 419e0074 e93f0022 <7f4a482a> 39200000 88ad023a 992d023a=
=20
> ---[ end trace dcdac2d3f668d033 ]---
>

Thanks for the test!

Looks there is one double free in patch, could you test the new one?

---
diff --git a/block/partition-generic.c b/block/partition-generic.c
index c5ec8246e25e..bd6644bf9643 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -140,8 +140,9 @@ ssize_t part_inflight_show(struct device *dev,
 {
 	struct hd_struct *p =3D dev_to_part(dev);
=20
-	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
-		atomic_read(&p->in_flight[1]));
+	return sprintf(buf, "%8u %8u\n",
+		pnode_counter_read_all(&p->in_flight[0]),
+		pnode_counter_read_all(&p->in_flight[1]));
 }
=20
 #ifdef CONFIG_FAIL_MAKE_REQUEST
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 96bd13e581cd..aac0b2235410 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -521,7 +521,7 @@ static void start_io_acct(struct dm_io *io)
 	cpu =3D part_stat_lock();
 	part_round_stats(cpu, &dm_disk(md)->part0);
 	part_stat_unlock();
-	atomic_set(&dm_disk(md)->part0.in_flight[rw],
+	pnode_counter_set(&dm_disk(md)->part0.in_flight[rw],
 		atomic_inc_return(&md->pending[rw]));
=20
 	if (unlikely(dm_stats_used(&md->stats)))
@@ -550,7 +550,7 @@ static void end_io_acct(struct dm_io *io)
 	 * a flush.
 	 */
 	pending =3D atomic_dec_return(&md->pending[rw]);
-	atomic_set(&dm_disk(md)->part0.in_flight[rw], pending);
+	pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], pending);
 	pending +=3D atomic_read(&md->pending[rw^0x1]);
=20
 	/* nudge anyone waiting on suspend queue */
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index e619fae2f037..40c9bc74a120 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/percpu-refcount.h>
 #include <linux/uuid.h>
+#include <linux/pernode_counter.h>
=20
 #ifdef CONFIG_BLOCK
=20
@@ -120,7 +121,7 @@ struct hd_struct {
 	int make_it_fail;
 #endif
 	unsigned long stamp;
-	atomic_t in_flight[2];
+	struct pnode_counter in_flight[2];
 #ifdef	CONFIG_SMP
 	struct disk_stats __percpu *dkstats;
 #else
@@ -364,21 +365,22 @@ static inline void free_part_stats(struct hd_struct *=
part)
=20
 static inline void part_inc_in_flight(struct hd_struct *part, int rw)
 {
-	atomic_inc(&part->in_flight[rw]);
+	pnode_counter_inc(&part->in_flight[rw]);
 	if (part->partno)
-		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+		pnode_counter_inc(&part_to_disk(part)->part0.in_flight[rw]);
 }
=20
 static inline void part_dec_in_flight(struct hd_struct *part, int rw)
 {
-	atomic_dec(&part->in_flight[rw]);
+	pnode_counter_dec(&part->in_flight[rw]);
 	if (part->partno)
-		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+		pnode_counter_dec(&part_to_disk(part)->part0.in_flight[rw]);
 }
=20
 static inline int part_in_flight(struct hd_struct *part)
 {
-	return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]=
);
+	return pnode_counter_read_all(&part->in_flight[0]) + \
+		pnode_counter_read_all(&part->in_flight[1]);
 }
=20
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *=
disk)
@@ -627,11 +629,34 @@ extern ssize_t part_fail_store(struct device *dev,
 			       const char *buf, size_t count);
 #endif /* CONFIG_FAIL_MAKE_REQUEST */
=20
+static inline int hd_counter_init(struct hd_struct *part)
+{
+	if (pnode_counter_init(&part->in_flight[0], GFP_KERNEL))
+		return -ENOMEM;
+	if (pnode_counter_init(&part->in_flight[1], GFP_KERNEL)) {
+		pnode_counter_deinit(&part->in_flight[0]);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline void hd_counter_deinit(struct hd_struct *part)
+{
+	pnode_counter_deinit(&part->in_flight[0]);
+	pnode_counter_deinit(&part->in_flight[1]);
+}
+
 static inline int hd_ref_init(struct hd_struct *part)
 {
+	if (hd_counter_init(part))
+		return -ENOMEM;
+
 	if (percpu_ref_init(&part->ref, __delete_partition, 0,
-				GFP_KERNEL))
+				GFP_KERNEL)) {
+		hd_counter_deinit(part);
 		return -ENOMEM;
+	}
 	return 0;
 }
=20
@@ -659,6 +684,7 @@ static inline void hd_free_part(struct hd_struct *part)
 {
 	free_part_stats(part);
 	free_part_info(part);
+	hd_counter_deinit(part);
 	percpu_ref_exit(&part->ref);
 }
=20
diff --git a/include/linux/pernode_counter.h b/include/linux/pernode_counte=
r.h
new file mode 100644
index 000000000000..e95b5f2d1d9c
--- /dev/null
+++ b/include/linux/pernode_counter.h
@@ -0,0 +1,117 @@
+#ifndef _LINUX_PERNODE_COUNTER_H
+#define _LINUX_PERNODE_COUNTER_H
+/*
+ * A simple per node atomic counter for use in block io accounting.
+ */
+
+#include <linux/smp.h>
+#include <linux/percpu.h>
+#include <linux/types.h>
+#include <linux/gfp.h>
+
+struct node_counter {
+	atomic_t counter_in_node;
+};
+
+struct pnode_counter {
+	struct node_counter * __percpu  *counter;
+	struct node_counter **nodes;
+};
+
+static inline int pnode_counter_init(struct pnode_counter *pnc, gfp_t gfp)
+{
+	struct node_counter **nodes;
+	int i, j, cpu;
+
+	nodes =3D kzalloc(nr_node_ids * sizeof(struct node_counter *), gfp);
+	if (!nodes)
+		goto err_nodes;
+
+	for_each_node(i) {
+		nodes[i] =3D kzalloc_node(sizeof(struct node_counter), gfp, i);
+		if (!nodes[i])
+			goto err_node_counter;
+	}
+
+	pnc->counter =3D alloc_percpu_gfp(struct node_counter *, gfp);
+	if (!pnc->counter)
+		goto err_node_counter;
+
+	for_each_possible_cpu(cpu)
+		*per_cpu_ptr(pnc->counter, cpu) =3D nodes[cpu_to_node(cpu)];
+
+	pnc->nodes =3D nodes;
+
+	return 0;
+
+ err_node_counter:
+	for (j =3D 0; j < i; j++)
+		kfree(nodes[j]);
+	kfree(nodes);
+ err_nodes:
+	return -ENOMEM;
+}
+
+static inline void pnode_counter_deinit(struct pnode_counter *pnc)
+{
+	int i;
+
+	for_each_node(i) {
+		struct node_counter *node =3D pnc->nodes[i];
+
+		kfree(node);
+	}
+	free_percpu(pnc->counter);
+	kfree(pnc->nodes);
+}
+
+static inline void pnode_counter_inc(struct pnode_counter *pnc)
+{
+	struct node_counter *node =3D this_cpu_read(*pnc->counter);
+
+	atomic_inc(&node->counter_in_node);
+}
+
+static inline void pnode_counter_inc_cpu(struct pnode_counter *pnc, int cp=
u)
+{
+	struct node_counter *node =3D *per_cpu_ptr(pnc->counter, cpu);
+
+	atomic_inc(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec(struct pnode_counter *pnc)
+{
+	struct node_counter *node =3D this_cpu_read(*pnc->counter);
+
+	atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec_cpu(struct pnode_counter *pnc, int cp=
u)
+{
+	struct node_counter *node =3D *per_cpu_ptr(pnc->counter, cpu);
+
+	atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_set(struct pnode_counter *pnc, int val)
+{
+	int i;
+	struct node_counter *node =3D this_cpu_read(*pnc->counter);
+
+	for_each_node(i)
+		atomic_set(&pnc->nodes[i]->counter_in_node, 0);
+	atomic_set(&node->counter_in_node, val);
+}
+
+static inline long pnode_counter_read_all(struct pnode_counter *pnc)
+{
+	int i;
+	long val =3D 0;
+
+	for_each_node(i)
+		val +=3D atomic_read(&pnc->nodes[i]->counter_in_node);
+
+	return val;
+}
+
+#endif /* _LINUX_PERNODE_COUNTER_H */


Thanks,
Ming

WARNING: multiple messages have this Message-ID (diff)
From: Ming Lei <ming.lei@redhat.com>
To: Brian King <brking@linux.vnet.ibm.com>
Cc: Ming Lei <tom.leiming@gmail.com>, Jens Axboe <axboe@kernel.dk>,
	linux-block <linux-block@vger.kernel.org>,
	"open list:DEVICE-MAPPER (LVM)" <dm-devel@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>,
	Alasdair Kergon <agk@redhat.com>
Subject: Re: [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
Date: Fri, 30 Jun 2017 09:08:01 +0800	[thread overview]
Message-ID: <20170630010759.GB8305@ming.t460p> (raw)
In-Reply-To: <dc7cf367-534d-d9fc-27b1-7cfd5edcb74c@linux.vnet.ibm.com>

On Thu, Jun 29, 2017 at 12:31:05PM -0500, Brian King wrote:
> On 06/29/2017 11:25 AM, Ming Lei wrote:
> > On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe <axboe@kernel.dk> wrote:
> >> On 06/29/2017 02:40 AM, Ming Lei wrote:
> >>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 06/28/2017 03:12 PM, Brian King wrote:
> >>>>> This patch converts the in_flight counter in struct hd_struct from a
> >>>>> pair of atomics to a pair of percpu counters. This eliminates a couple
> >>>>> of atomics from the hot path. When running this on a Power system, to
> >>>>> a single null_blk device with 80 submission queues, irq mode 0, with
> >>>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
> >>>>
> >>>> This has been done before, but I've never really liked it. The reason is
> >>>> that it means that reading the part stat inflight count now has to
> >>>> iterate over every possible CPU. Did you use partitions in your testing?
> >>>> How many CPUs were configured? When I last tested this a few years ago
> >>>> on even a quad core nehalem (which is notoriously shitty for cross-node
> >>>> latencies), it was a net loss.
> >>>
> >>> One year ago, I saw null_blk's IOPS can be decreased to 10%
> >>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
> >>> 96 cores, and dual numa nodes) too, the performance can be
> >>> recovered basically if per numa-node counter is introduced and
> >>> used in this case, but the patch was never posted out.
> >>> If anyone is interested in that, I can rebase the patch on current
> >>> block tree and post out. I guess the performance issue might be
> >>> related with system cache coherency implementation more or less.
> >>> This issue on ARM64 can be observed with the following userspace
> >>> atomic counting test too:
> >>>
> >>>        http://kernel.ubuntu.com/~ming/test/cache/
> >>
> >> How well did the per-node thing work? Doesn't seem to me like it would
> > 
> > Last time, on ARM64, I remembered that the IOPS was basically recovered,
> > but now I don't have a such machine to test. Could Brian test the attached patch
> > to see if it works on big Power machine?
> > 
> > And the idea is simple, just make the atomic counter per-node.
> 
> I tried loading the patch and get an oops right away on boot. Haven't been able to 
> debug anything yet. This is on top of an older kernel, so not sure if that is
> the issue or not. I can try upstream and see if i have different results...
> 
> ^[[-1;-1fUbuntu 16.04^[[-1;-1f.  .  .  .Unable to handle kernel paging request for data at address 0xc00031313a333532
> Faulting instruction address: 0xc0000000002552c4
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=1024 NUMA 
> PowerNV
> Modules linked in: dm_round_robin vmx_crypto powernv_rng leds_powernv powernv_op_panel led_class rng_core dm_multipath autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath dm_mirror dm_region_hash dm_log cxlflash bnx2x mdio libcrc32c nvme nvme_core lpfc cxl crc_t10dif crct10dif_generic crct10dif_common
> CPU: 9 PID: 3485 Comm: multipathd Not tainted 4.9.10-dirty #7
> task: c000000fba0d0000 task.stack: c000000fb8a1c000
> NIP: c0000000002552c4 LR: c000000000255274 CTR: 0000000000000000
> REGS: c000000fb8a1f350 TRAP: 0300   Not tainted  (4.9.10-dirty)
> MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24028848  XER: 00000000
> CFAR: c000000000008a60 DAR: c00031313a333532 DSISR: 40000000 SOFTE: 1 
> GPR00: c0000000001f8980 c000000fb8a1f5d0 c000000000f24300 c000000fc4007c00 
> GPR04: 00000000024000c0 c0000000002fc0ac 000000000000025d c000000fc5d346e0 
> GPR08: 0000000fc50d6000 0000000000000000 c00031313a333532 c000000fbc836240 
> GPR12: 0000000000002200 c00000000ff02400 00003fff79cfebf0 0000000000000000 
> GPR16: 00000000c138fd03 000001003a12bf70 00003fff79ae75e0 00003fff79d269e8 
> GPR20: 0000000000000001 00003fff79cfebf0 000001003a393aa0 00003fff79d067b8 
> GPR24: 0000000000000000 c000000000b04948 000000000000a1ff c0000000002fc0ac 
> GPR28: 00000000024000c0 0000000000000007 c0000000002fc0ac c000000fc4007c00 
> NIP [c0000000002552c4] __kmalloc_track_caller+0x94/0x2f0
> LR [c000000000255274] __kmalloc_track_caller+0x44/0x2f0
> Call Trace:
> [c000000fb8a1f5d0] [c000000fb8a1f610] 0xc000000fb8a1f610 (unreliable)
> [c000000fb8a1f620] [c0000000001f8980] kstrdup+0x50/0xb0
> [c000000fb8a1f660] [c0000000002fc0ac] __kernfs_new_node+0x4c/0x140
> [c000000fb8a1f6b0] [c0000000002fd9f4] kernfs_new_node+0x34/0x80
> [c000000fb8a1f6e0] [c000000000300708] kernfs_create_link+0x38/0xf0
> [c000000fb8a1f720] [c000000000301cb8] sysfs_do_create_link_sd.isra.0+0xa8/0x160
> [c000000fb8a1f770] [c00000000054a658] device_add+0x2b8/0x740
> [c000000fb8a1f830] [c00000000054ae58] device_create_groups_vargs+0x178/0x190
> [c000000fb8a1f890] [c0000000001fcd70] bdi_register+0x80/0x1d0
> [c000000fb8a1f8c0] [c0000000001fd244] bdi_register_owner+0x44/0x80
> [c000000fb8a1f940] [c000000000453bbc] device_add_disk+0x1cc/0x500
> [c000000fb8a1f9f0] [c000000000764dec] dm_create+0x33c/0x5f0
> [c000000fb8a1fa90] [c00000000076d9ac] dev_create+0x8c/0x3e0
> [c000000fb8a1fb40] [c00000000076d1fc] ctl_ioctl+0x38c/0x580
> [c000000fb8a1fd20] [c00000000076d410] dm_ctl_ioctl+0x20/0x30
> [c000000fb8a1fd40] [c0000000002799ac] do_vfs_ioctl+0xcc/0x8f0
> [c000000fb8a1fde0] [c00000000027a230] SyS_ioctl+0x60/0xc0
> [c000000fb8a1fe30] [c00000000000bfe0] system_call+0x38/0xfc
> Instruction dump:
> 39290008 7cc8482a e94d0030 e9230000 7ce95214 7d49502a e9270010 2faa0000 
> 419e007c 2fa90000 419e0074 e93f0022 <7f4a482a> 39200000 88ad023a 992d023a 
> ---[ end trace dcdac2d3f668d033 ]---
>

Thanks for the test!

Looks there is one double free in patch, could you test the new one?

---
diff --git a/block/partition-generic.c b/block/partition-generic.c
index c5ec8246e25e..bd6644bf9643 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -140,8 +140,9 @@ ssize_t part_inflight_show(struct device *dev,
 {
 	struct hd_struct *p = dev_to_part(dev);
 
-	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
-		atomic_read(&p->in_flight[1]));
+	return sprintf(buf, "%8u %8u\n",
+		pnode_counter_read_all(&p->in_flight[0]),
+		pnode_counter_read_all(&p->in_flight[1]));
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 96bd13e581cd..aac0b2235410 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -521,7 +521,7 @@ static void start_io_acct(struct dm_io *io)
 	cpu = part_stat_lock();
 	part_round_stats(cpu, &dm_disk(md)->part0);
 	part_stat_unlock();
-	atomic_set(&dm_disk(md)->part0.in_flight[rw],
+	pnode_counter_set(&dm_disk(md)->part0.in_flight[rw],
 		atomic_inc_return(&md->pending[rw]));
 
 	if (unlikely(dm_stats_used(&md->stats)))
@@ -550,7 +550,7 @@ static void end_io_acct(struct dm_io *io)
 	 * a flush.
 	 */
 	pending = atomic_dec_return(&md->pending[rw]);
-	atomic_set(&dm_disk(md)->part0.in_flight[rw], pending);
+	pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], pending);
 	pending += atomic_read(&md->pending[rw^0x1]);
 
 	/* nudge anyone waiting on suspend queue */
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index e619fae2f037..40c9bc74a120 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/percpu-refcount.h>
 #include <linux/uuid.h>
+#include <linux/pernode_counter.h>
 
 #ifdef CONFIG_BLOCK
 
@@ -120,7 +121,7 @@ struct hd_struct {
 	int make_it_fail;
 #endif
 	unsigned long stamp;
-	atomic_t in_flight[2];
+	struct pnode_counter in_flight[2];
 #ifdef	CONFIG_SMP
 	struct disk_stats __percpu *dkstats;
 #else
@@ -364,21 +365,22 @@ static inline void free_part_stats(struct hd_struct *part)
 
 static inline void part_inc_in_flight(struct hd_struct *part, int rw)
 {
-	atomic_inc(&part->in_flight[rw]);
+	pnode_counter_inc(&part->in_flight[rw]);
 	if (part->partno)
-		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+		pnode_counter_inc(&part_to_disk(part)->part0.in_flight[rw]);
 }
 
 static inline void part_dec_in_flight(struct hd_struct *part, int rw)
 {
-	atomic_dec(&part->in_flight[rw]);
+	pnode_counter_dec(&part->in_flight[rw]);
 	if (part->partno)
-		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+		pnode_counter_dec(&part_to_disk(part)->part0.in_flight[rw]);
 }
 
 static inline int part_in_flight(struct hd_struct *part)
 {
-	return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]);
+	return pnode_counter_read_all(&part->in_flight[0]) + \
+		pnode_counter_read_all(&part->in_flight[1]);
 }
 
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)
@@ -627,11 +629,34 @@ extern ssize_t part_fail_store(struct device *dev,
 			       const char *buf, size_t count);
 #endif /* CONFIG_FAIL_MAKE_REQUEST */
 
+static inline int hd_counter_init(struct hd_struct *part)
+{
+	if (pnode_counter_init(&part->in_flight[0], GFP_KERNEL))
+		return -ENOMEM;
+	if (pnode_counter_init(&part->in_flight[1], GFP_KERNEL)) {
+		pnode_counter_deinit(&part->in_flight[0]);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline void hd_counter_deinit(struct hd_struct *part)
+{
+	pnode_counter_deinit(&part->in_flight[0]);
+	pnode_counter_deinit(&part->in_flight[1]);
+}
+
 static inline int hd_ref_init(struct hd_struct *part)
 {
+	if (hd_counter_init(part))
+		return -ENOMEM;
+
 	if (percpu_ref_init(&part->ref, __delete_partition, 0,
-				GFP_KERNEL))
+				GFP_KERNEL)) {
+		hd_counter_deinit(part);
 		return -ENOMEM;
+	}
 	return 0;
 }
 
@@ -659,6 +684,7 @@ static inline void hd_free_part(struct hd_struct *part)
 {
 	free_part_stats(part);
 	free_part_info(part);
+	hd_counter_deinit(part);
 	percpu_ref_exit(&part->ref);
 }
 
diff --git a/include/linux/pernode_counter.h b/include/linux/pernode_counter.h
new file mode 100644
index 000000000000..e95b5f2d1d9c
--- /dev/null
+++ b/include/linux/pernode_counter.h
@@ -0,0 +1,117 @@
+#ifndef _LINUX_PERNODE_COUNTER_H
+#define _LINUX_PERNODE_COUNTER_H
+/*
+ * A simple per node atomic counter for use in block io accounting.
+ */
+
+#include <linux/smp.h>
+#include <linux/percpu.h>
+#include <linux/types.h>
+#include <linux/gfp.h>
+
+struct node_counter {
+	atomic_t counter_in_node;
+};
+
+struct pnode_counter {
+	struct node_counter * __percpu  *counter;
+	struct node_counter **nodes;
+};
+
+static inline int pnode_counter_init(struct pnode_counter *pnc, gfp_t gfp)
+{
+	struct node_counter **nodes;
+	int i, j, cpu;
+
+	nodes = kzalloc(nr_node_ids * sizeof(struct node_counter *), gfp);
+	if (!nodes)
+		goto err_nodes;
+
+	for_each_node(i) {
+		nodes[i] = kzalloc_node(sizeof(struct node_counter), gfp, i);
+		if (!nodes[i])
+			goto err_node_counter;
+	}
+
+	pnc->counter = alloc_percpu_gfp(struct node_counter *, gfp);
+	if (!pnc->counter)
+		goto err_node_counter;
+
+	for_each_possible_cpu(cpu)
+		*per_cpu_ptr(pnc->counter, cpu) = nodes[cpu_to_node(cpu)];
+
+	pnc->nodes = nodes;
+
+	return 0;
+
+ err_node_counter:
+	for (j = 0; j < i; j++)
+		kfree(nodes[j]);
+	kfree(nodes);
+ err_nodes:
+	return -ENOMEM;
+}
+
+static inline void pnode_counter_deinit(struct pnode_counter *pnc)
+{
+	int i;
+
+	for_each_node(i) {
+		struct node_counter *node = pnc->nodes[i];
+
+		kfree(node);
+	}
+	free_percpu(pnc->counter);
+	kfree(pnc->nodes);
+}
+
+static inline void pnode_counter_inc(struct pnode_counter *pnc)
+{
+	struct node_counter *node = this_cpu_read(*pnc->counter);
+
+	atomic_inc(&node->counter_in_node);
+}
+
+static inline void pnode_counter_inc_cpu(struct pnode_counter *pnc, int cpu)
+{
+	struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu);
+
+	atomic_inc(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec(struct pnode_counter *pnc)
+{
+	struct node_counter *node = this_cpu_read(*pnc->counter);
+
+	atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec_cpu(struct pnode_counter *pnc, int cpu)
+{
+	struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu);
+
+	atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_set(struct pnode_counter *pnc, int val)
+{
+	int i;
+	struct node_counter *node = this_cpu_read(*pnc->counter);
+
+	for_each_node(i)
+		atomic_set(&pnc->nodes[i]->counter_in_node, 0);
+	atomic_set(&node->counter_in_node, val);
+}
+
+static inline long pnode_counter_read_all(struct pnode_counter *pnc)
+{
+	int i;
+	long val = 0;
+
+	for_each_node(i)
+		val += atomic_read(&pnc->nodes[i]->counter_in_node);
+
+	return val;
+}
+
+#endif /* _LINUX_PERNODE_COUNTER_H */


Thanks,
Ming

  reply	other threads:[~2017-06-30  1:08 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-28 21:12 [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu Brian King
2017-06-28 21:49 ` Jens Axboe
2017-06-28 21:49   ` Jens Axboe
2017-06-28 22:04   ` Brian King
2017-06-28 22:04     ` Brian King
2017-06-29  8:40   ` Ming Lei
2017-06-29 15:58     ` Jens Axboe
2017-06-29 16:00       ` Jens Axboe
2017-06-29 18:42         ` Jens Axboe
2017-06-30  1:20           ` Ming Lei
2017-06-30  2:17             ` Jens Axboe
2017-06-30 13:05               ` [dm-devel] " Brian King
2017-06-30 13:05                 ` Brian King
2017-06-30 14:08                 ` [dm-devel] " Jens Axboe
2017-06-30 18:33                   ` Brian King
2017-06-30 23:23                     ` Ming Lei
2017-06-30 23:26                       ` Jens Axboe
2017-07-01  2:18                         ` Brian King
2017-07-04  1:20                           ` Ming Lei
2017-07-04 20:58                             ` Brian King
2017-07-01  4:17                   ` Jens Axboe
2017-07-01  4:59                     ` Jens Axboe
2017-07-01 16:43                       ` Jens Axboe
2017-07-04 20:55                         ` Brian King
2017-07-04 21:57                           ` Jens Axboe
2017-06-29 16:25       ` Ming Lei
2017-06-29 17:31         ` Brian King
2017-06-30  1:08           ` Ming Lei [this message]
2017-06-30  1:08             ` Ming Lei
2017-06-28 21:54 ` Jens Axboe
2017-06-28 21:54   ` Jens Axboe
2017-06-28 21:59   ` Jens Axboe
2017-06-28 22:07     ` [dm-devel] " Brian King
2017-06-28 22:19       ` Jens Axboe
2017-06-29 12:59         ` Brian King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170630010759.GB8305@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=brking@linux.vnet.ibm.com \
    --cc=dm-devel@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=snitzer@redhat.com \
    --cc=tom.leiming@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.