From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752569Ab1CUGxG (ORCPT ); Mon, 21 Mar 2011 02:53:06 -0400 Received: from mga09.intel.com ([134.134.136.24]:36052 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751133Ab1CUGww (ORCPT ); Mon, 21 Mar 2011 02:52:52 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.63,218,1299484800"; d="scan'208";a="615504853" Subject: Re: [PATCH 04/10] block: initial patch for on-stack per-task plugging From: Shaohua Li To: Jens Axboe Cc: Vivek Goyal , "linux-kernel@vger.kernel.org" , "hch@infradead.org" , "jmoyer@redhat.com" In-Reply-To: <4D83639A.3060407@fusionio.com> References: <1295659049-2688-1-git-send-email-jaxboe@fusionio.com> <1295659049-2688-5-git-send-email-jaxboe@fusionio.com> <20110316173148.GC13562@redhat.com> <1300323609.2337.117.camel@sli10-conroe> <4D81D7D5.7000806@fusionio.com> <1300430212.2337.141.camel@sli10-conroe> <4D83560E.8000002@fusionio.com> <4D83639A.3060407@fusionio.com> Content-Type: text/plain; charset="UTF-8" Date: Mon, 21 Mar 2011 14:52:48 +0800 Message-ID: <1300690368.2337.148.camel@sli10-conroe> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2011-03-18 at 21:52 +0800, Jens Axboe wrote: > On 2011-03-18 13:54, Jens Axboe wrote: > > On 2011-03-18 07:36, Shaohua Li wrote: > >> On Thu, 2011-03-17 at 17:43 +0800, Jens Axboe wrote: > >>> On 2011-03-17 02:00, Shaohua Li wrote: > >>>> On Thu, 2011-03-17 at 01:31 +0800, Vivek Goyal wrote: > >>>>> On Wed, Mar 16, 2011 at 04:18:30PM +0800, Shaohua Li wrote: > >>>>>> 2011/1/22 Jens Axboe : > >>>>>>> Signed-off-by: Jens Axboe > >>>>>>> --- > >>>>>>> block/blk-core.c | 357 ++++++++++++++++++++++++++++++++------------ > >>>>>>> block/elevator.c | 6 +- > >>>>>>> include/linux/blk_types.h | 2 + > >>>>>>> include/linux/blkdev.h | 30 ++++ > >>>>>>> include/linux/elevator.h | 1 + > >>>>>>> include/linux/sched.h | 6 + > >>>>>>> kernel/exit.c | 1 + > >>>>>>> kernel/fork.c | 3 + > >>>>>>> kernel/sched.c | 11 ++- > >>>>>>> 9 files changed, 317 insertions(+), 100 deletions(-) > >>>>>>> > >>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c > >>>>>>> index 960f12c..42dbfcc 100644 > >>>>>>> --- a/block/blk-core.c > >>>>>>> +++ b/block/blk-core.c > >>>>>>> @@ -27,6 +27,7 @@ > >>>>>>> #include > >>>>>>> #include > >>>>>>> #include > >>>>>>> +#include > >>>>>>> > >>>>>>> #define CREATE_TRACE_POINTS > >>>>>>> #include > >>>>>>> @@ -213,7 +214,7 @@ static void blk_delay_work(struct work_struct *work) > >>>>>>> > >>>>>>> q = container_of(work, struct request_queue, delay_work.work); > >>>>>>> spin_lock_irq(q->queue_lock); > >>>>>>> - q->request_fn(q); > >>>>>>> + __blk_run_queue(q); > >>>>>>> spin_unlock_irq(q->queue_lock); > >>>>>>> } > >>>>>> Hi Jens, > >>>>>> I have some questions about the per-task plugging. Since the request > >>>>>> list is per-task, and each task delivers its requests at finish flush > >>>>>> or schedule. But when one cpu delivers requests to global queue, other > >>>>>> cpus don't know. This seems to have problem. For example: > >>>>>> 1. get_request_wait() can only flush current task's request list, > >>>>>> other cpus/tasks might still have a lot of requests, which aren't sent > >>>>>> to request_queue. > >>>>> > >>>>> But very soon these requests will be sent to request queue as soon task > >>>>> is either scheduled out or task explicitly flushes the plug? So we might > >>>>> wait a bit longer but that might not matter in general, i guess. > >>>> Yes, I understand there is just a bit delay. I don't know how severe it > >>>> is, but this still could be a problem, especially for fast storage or > >>>> random I/O. My current tests show slight regression (3% or so) with > >>>> Jens's for 2.6.39/core branch. I'm still checking if it's caused by the > >>>> per-task plug, but the per-task plug is highly suspected. > >>> > >>> To check this particular case, you can always just bump the request > >>> limit. What test is showing a slowdown? > >> this is a simple multi-threaded seq read. The issue tends to be request > >> merge related (not verified yet). The merge reduces about 60% with stack > >> plug from fio reported data. From trace, without stack plug, requests > >> from different threads get merged. But with it, such merge is impossible > >> because flush_plug doesn't check merge, I thought we need add it again. > > > > What we could try is have the plug flush insert be > > ELEVATOR_INSERT_SORT_MERGE and have it lookup potential backmerges. > > > > Here's a quick hack that does that, I have not tested it at all. > > Gave it a quick test spin, as suspected it had a few issues. This one > seems to work. Can you toss it through that workload and see if it fares > better? yes, this fully restores the regression I saw. But I have accounting issue: 1. The merged request is already accounted when it's added into plug list 2. drive_stat_acct() is called without any protection in __make_request(). So there is race for in_flight accounting. The race exists after stack plug is added, so not because of this issue. Below is the extra patch I need to do the test. --- block/blk-merge.c | 12 +++++------- block/elevator.c | 9 ++++++--- drivers/md/dm.c | 7 ++++--- fs/partitions/check.c | 3 ++- include/linux/genhd.h | 12 ++++++------ 5 files changed, 23 insertions(+), 20 deletions(-) Index: linux-2.6/block/blk-merge.c =================================================================== --- linux-2.6.orig/block/blk-merge.c +++ linux-2.6/block/blk-merge.c @@ -429,14 +429,12 @@ static int attempt_merge(struct request_ req->__data_len += blk_rq_bytes(next); - if (next->cmd_flags & REQ_SORTED) { - elv_merge_requests(q, req, next); + elv_merge_requests(q, req, next); - /* - * 'next' is going away, so update stats accordingly - */ - blk_account_io_merge(next); - } + /* + * 'next' is going away, so update stats accordingly + */ + blk_account_io_merge(next); req->ioprio = ioprio_best(req->ioprio, next->ioprio); if (blk_rq_cpu_valid(next)) Index: linux-2.6/block/elevator.c =================================================================== --- linux-2.6.orig/block/elevator.c +++ linux-2.6/block/elevator.c @@ -566,13 +566,16 @@ void elv_merge_requests(struct request_q { struct elevator_queue *e = q->elevator; - if (e->ops->elevator_merge_req_fn) + if ((next->cmd_flags & REQ_SORTED) && e->ops->elevator_merge_req_fn) e->ops->elevator_merge_req_fn(q, rq, next); elv_rqhash_reposition(q, rq); - elv_rqhash_del(q, next); - q->nr_sorted--; + if (next->cmd_flags & REQ_SORTED) { + elv_rqhash_del(q, next); + q->nr_sorted--; + } + q->last_merge = rq; } Index: linux-2.6/drivers/md/dm.c =================================================================== --- linux-2.6.orig/drivers/md/dm.c +++ linux-2.6/drivers/md/dm.c @@ -477,7 +477,8 @@ static void start_io_acct(struct dm_io * cpu = part_stat_lock(); part_round_stats(cpu, &dm_disk(md)->part0); part_stat_unlock(); - dm_disk(md)->part0.in_flight[rw] = atomic_inc_return(&md->pending[rw]); + atomic_set(&dm_disk(md)->part0.in_flight[rw], + atomic_inc_return(&md->pending[rw])); } static void end_io_acct(struct dm_io *io) @@ -497,8 +498,8 @@ static void end_io_acct(struct dm_io *io * After this is decremented the bio must not be touched if it is * a flush. */ - dm_disk(md)->part0.in_flight[rw] = pending = - atomic_dec_return(&md->pending[rw]); + pending = atomic_dec_return(&md->pending[rw]); + atomic_set(&dm_disk(md)->part0.in_flight[rw], pending); pending += atomic_read(&md->pending[rw^0x1]); /* nudge anyone waiting on suspend queue */ Index: linux-2.6/fs/partitions/check.c =================================================================== --- linux-2.6.orig/fs/partitions/check.c +++ linux-2.6/fs/partitions/check.c @@ -290,7 +290,8 @@ ssize_t part_inflight_show(struct device { struct hd_struct *p = dev_to_part(dev); - return sprintf(buf, "%8u %8u\n", p->in_flight[0], p->in_flight[1]); + return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]), + atomic_read(&p->in_flight[1])); } #ifdef CONFIG_FAIL_MAKE_REQUEST Index: linux-2.6/include/linux/genhd.h =================================================================== --- linux-2.6.orig/include/linux/genhd.h +++ linux-2.6/include/linux/genhd.h @@ -109,7 +109,7 @@ struct hd_struct { int make_it_fail; #endif unsigned long stamp; - int in_flight[2]; + atomic_t in_flight[2]; #ifdef CONFIG_SMP struct disk_stats __percpu *dkstats; #else @@ -370,21 +370,21 @@ static inline void free_part_stats(struc static inline void part_inc_in_flight(struct hd_struct *part, int rw) { - part->in_flight[rw]++; + atomic_inc(&part->in_flight[rw]); if (part->partno) - part_to_disk(part)->part0.in_flight[rw]++; + atomic_inc(&part_to_disk(part)->part0.in_flight[rw]); } static inline void part_dec_in_flight(struct hd_struct *part, int rw) { - part->in_flight[rw]--; + atomic_dec(&part->in_flight[rw]); if (part->partno) - part_to_disk(part)->part0.in_flight[rw]--; + atomic_dec(&part_to_disk(part)->part0.in_flight[rw]); } static inline int part_in_flight(struct hd_struct *part) { - return part->in_flight[0] + part->in_flight[1]; + return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]); } static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)