From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752942AbcD2JdU (ORCPT ); Fri, 29 Apr 2016 05:33:20 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:35029 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751284AbcD2JdP (ORCPT ); Fri, 29 Apr 2016 05:33:15 -0400 MIME-Version: 1.0 In-Reply-To: References: <1461805789-3632-1-git-send-email-ming.lei@canonical.com> <1461805789-3632-3-git-send-email-ming.lei@canonical.com> Date: Fri, 29 Apr 2016 17:33:11 +0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v3 3/3] block: avoid to call .bi_end_io() recursively From: Ming Lei To: Mikulas Patocka Cc: Jens Axboe , Linux Kernel Mailing List , linux-block@vger.kernel.org, Christoph Hellwig , Btrfs BTRFS , Shaun Tancheff , Alan Cox , Neil Brown , Liu Bo , Jens Axboe Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 29, 2016 at 4:39 PM, Mikulas Patocka wrote: > > > On Fri, 29 Apr 2016, Ming Lei wrote: > >> On Fri, Apr 29, 2016 at 12:59 AM, Mikulas Patocka wrote: >> > >> > >> > On Fri, 29 Apr 2016, Ming Lei wrote: >> > >> >> On Thu, Apr 28, 2016 at 11:58 PM, Mikulas Patocka wrote: >> >> > >> >> > >> >> > On Thu, 28 Apr 2016, Ming Lei wrote: >> >> > >> >> >> Hi Mikulas, >> >> >> >> >> >> On Thu, Apr 28, 2016 at 11:29 PM, Mikulas Patocka wrote: >> >> >> > >> >> >> > >> >> >> > On Thu, 28 Apr 2016, Ming Lei wrote: >> >> >> > >> >> >> >> There were reports about heavy stack use by recursive calling >> >> >> >> .bi_end_io()([1][2][3]). For example, more than 16K stack is >> >> >> >> consumed in a single bio complete path[3], and in [2] stack >> >> >> >> overflow can be triggered if 20 nested dm-crypt is used. >> >> >> >> >> >> >> >> Also patches[1] [2] [3] were posted for addressing the issue, >> >> >> >> but never be merged. And the idea in these patches is basically >> >> >> >> similar, all serializes the recursive calling of .bi_end_io() by >> >> >> >> percpu list. >> >> >> >> >> >> >> >> This patch still takes the same idea, but uses bio_list to >> >> >> >> implement it, which turns out more simple and the code becomes >> >> >> >> more readable meantime. >> >> >> >> >> >> >> >> One corner case which wasn't covered before is that >> >> >> >> bi_endio() may be scheduled to run in process context(such >> >> >> >> as btrfs), and this patch just bypasses the optimizing for >> >> >> >> that case because one new context should have enough stack space, >> >> >> >> and this approach isn't capable of optimizing it too because >> >> >> >> there isn't easy way to get a per-task linked list head. >> >> >> > >> >> >> > Hi >> >> >> > >> >> >> > You could use preempt_disable() and then you could use per-cpu list even >> >> >> > in the process context. >> >> >> >> >> >> Image why the .bi_end_io() is scheduled to process context, and the only >> >> >> workable/simple way I thought of is to use per-task list because it may sleep. >> >> > >> >> > The bi_end_io callback should not sleep, even if it is called from the >> >> > process context. >> >> >> >> If it shouldn't sleep, why is it scheduled to run in process context by paying >> >> extra context switch cost? >> > >> > Some device mapper (and other) drivers use a worker thread to process >> > bios. So the bio may be finished from the worker thread. It would be >> > advantageous to prevent stack overflow even in this case. >> >> If the .bi_end_io wouldn't sleep, it can be put back into interrupt context >> for the sake of performance when this patch is merged. The cost of context >> switch in high IOPS case isn't cheap. > > If some block device driver in a process context finds out that it needs > to terminate a bio, it calls bio_endio in a process context. Why would it > need to trigger an interrupt just to call bio_endio? (and how could it > trigger an interrupt if that driver perhaps doesn't use interrupts at > all?) I don't know what are you trying to suggest. That should be the failure path, and this patch just aims at the normal bio complete path which is run at 99.999% time. And irq handler often borrows current process's stack and cause huge stack uses. > >> It isn't easy to avoid the recursive calling in process context except you >> can add something 'task_struct' or introduce 'alloca()' in kernel. Or do you >> have better ideas? > > preempt_disable around the bi_endio callback should be sufficient. How can a sleepable function be run in preempt disabled context? Looks you still don't believe the bi_endio scheduled to process context can sleep, do you? > >> > >> >> And you can find that btrfs_subio_endio_read() does sleep for checksum stuff. >> > >> > I'm not an expert on btrfs. What happens if it is called from an >> > interrupt? Do you have an actual stracktrace when this function is called >> >> What do you expect if sleepable function is called in softirq or >> hardirq handler? :-) >> >> > from bio_endio and when it sleeps? >> >> The problem is observed in xfstests generic/323 by this patch v1. Sometimes the >> test hangs, and sometimes kernel oops is triggered. and the issue is >> fixed by introducing 'if (!in_interrupt())' block for handling running >> .bi_end_io() from >> process context. >> >> If the block of 'if (!in_interrupt())' is removed and >> preempt_disable()/preempt_enable() is added around bio->bi_end_io(), >> the following kernel warning can be seen easily: >> >> [ 51.086303] BUG: sleeping function called from invalid context at >> mm/slab.h:388 >> [ 51.087442] in_atomic(): 1, irqs_disabled(): 0, pid: 633, name: kworker/u8:4 >> [ 51.088575] CPU: 3 PID: 633 Comm: kworker/u8:4 Not tainted 4.6.0-rc3+ #2017 >> [ 51.088578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), >> BIOS rel-1.9.0-0-g01a84be-prebuilt.qemu-project.org 04/01/2014 >> [ 51.088637] Workqueue: btrfs-endio btrfs_endio_helper [btrfs] >> [ 51.088640] 0000000000000000 ffff88007bbebc00 ffffffff813d92d3 >> ffff88007ba6ce00 >> [ 51.088643] 0000000000000184 ffff88007bbebc18 ffffffff810a38bb >> ffffffff81a35310 >> [ 51.088645] ffff88007bbebc40 ffffffff810a3949 0000000002400040 >> 0000000002400040 >> [ 51.088648] Call Trace: >> [ 51.088651] [] dump_stack+0x63/0x90 >> [ 51.088655] [] ___might_sleep+0xdb/0x120 >> [ 51.088657] [] __might_sleep+0x49/0x80 >> [ 51.088659] [] kmem_cache_alloc+0x1a7/0x210 >> [ 51.088670] [] ? alloc_extent_state+0x21/0xe0 [btrfs] >> [ 51.088680] [] alloc_extent_state+0x21/0xe0 [btrfs] >> [ 51.088689] [] __clear_extent_bit+0x2ae/0x3d0 [btrfs] >> [ 51.088698] [] clear_extent_bit+0x2a/0x30 [btrfs] >> [ 51.088708] [] btrfs_endio_direct_read+0x70/0xf0 [btrfs] >> [ 51.088711] [] bio_endio+0xf7/0x140 >> [ 51.088718] [] end_workqueue_fn+0x3c/0x40 [btrfs] >> [ 51.088728] [] normal_work_helper+0xc7/0x310 [btrfs] >> [ 51.088737] [] btrfs_endio_helper+0x12/0x20 [btrfs] >> [ 51.088740] [] process_one_work+0x157/0x420 >> [ 51.088742] [] worker_thread+0x12b/0x4d0 >> [ 51.088744] [] ? __schedule+0x368/0x950 >> [ 51.088746] [] ? rescuer_thread+0x380/0x380 >> [ 51.088748] [] kthread+0xd4/0xf0 >> [ 51.088750] [] ret_from_fork+0x22/0x40 >> [ 51.088752] [] ? kthread_park+0x60/0x60 >> >> >> Not mention wait_for_completion() in both __btrfs_correct_data_nocsum() >> and __btrfs_subio_endio_read(), which are called by btrfs_subio_endio_read() >> in btrfs_endio_direct_read(). > > btrfs is calling bio_endio on bio that it created. So, bio_endio in btrfs > can be replaced with: > if (bio->bi_end_io == btrfs_endio_direct_read) > btrfs_endio_direct_read(bio); > else > bio_endio(bio); > > ... or just maybe just with this: > bio->bi_end_io(bio); It is really ugly and should be avoided most of times. > > This could be fixed easily. Suppose it might be done in this way, and there may be other .bi_end_io which can sleep too, you need to make sure there isn't any such usage first before running it with preempt disabled. Anyway looks you worry about the seldom-run failure path, which isn't addressed by this patch. And you can optimize that in future and that isn't contradictory with this patch. Thanks, Ming > > Mikulas > >> Thanks, >> Ming >> >> > >> >> Thanks, >> >> Ming >> > >> > Mikulas >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-block" in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html