From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S264427AbTDXEff (ORCPT ); Thu, 24 Apr 2003 00:35:35 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S264424AbTDXEfe (ORCPT ); Thu, 24 Apr 2003 00:35:34 -0400 Received: from e5.ny.us.ibm.com ([32.97.182.105]:159 "EHLO e5.ny.us.ibm.com") by vger.kernel.org with ESMTP id S264420AbTDXEfX (ORCPT ); Thu, 24 Apr 2003 00:35:23 -0400 Date: Thu, 24 Apr 2003 10:22:22 +0530 From: Suparna Bhattacharya To: bcrl@redhat.com, akpm@digeo.com Cc: linux-kernel@vger.kernel.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org Subject: Filesystem AIO read-write patches Message-ID: <20030424102221.A2166@in.ibm.com> Reply-To: suparna@in.ibm.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Here is a revised version of the filesystem AIO patches for 2.5.68. It is built on a variation of the simple retry based scheme originally suggested by Ben LaHaise. Why ? ------ Because 2.5 is still missing real support for regular filesystem AIO (but for O_DIRECT). ext2, jfs and nfs define the fops aio interfaces aio_read and aio_write to default to generic_file_aio_read/write. However these routines show fully synchronous behaviour unless the file was opened with O_DIRECT. This means that an io_submit could merrily block for regular aio read/write operations, while an application thinks its doing async i/o. How ? ------ The approach we took was to identify and focus on the most significant blocking points (seconded by observations from initial experimentation and profiling results), and convert them to retry exit points. Retries start at a very high level, driven directly by the aio infrastructure (In future if the in-kernel fs apis change, then retries could be modified to happen one level below, i.e. at the api level). They are kicked off via async wait queue functions. In synchronous i/o context the default wait queue entries are synchronous hence don't cause an exit at a retry point. One of the considerations was to try to take a careful and less intrusive route with minimal changes to existing synchronous i/o paths. The intent was to achieve a reasonable level of asynchrony in a way that could then be further optimized and tuned for workloads of relevance. The Patches: ----------- (which I'll be mailing out as responses to this note) 01aioretry.patch : Base aio retry infrastructure 02aiord.patch : Filesystem aio read 03aiowr.patch : Minimal filesystem aio write (for all archs and all filesystems using generic_file_aio_write) 04down_wq-86.patch : An asynchronous semaphore down implementation (currently x86 only) 05aiowrdown_wq.patch : Uses async down for aio write 06bread_wq.patch : Async bread implementation 07ext2getblk_wq.patch : Async get block support for the ext2 filesystem Observations -------------- As a quick check to find out if this really works, I could observe a decent reduction in the time spent in io_submit (especially for large reads) when the file is not already cached (e.g. first time access). For the write case, I found that I had to add the async get block support to get a perceptable benefit. For the cached case, there wasn't any observable difference, which is expected. The patch didn't seem to be hurting synchronous read/ write performance for a simple test. Another thing I tried out was to temporarily move the retries into io_getevents rather than worker threads just as a sanity check for any gross impact on cpu utilization. That seemed OK too. Of course thorough performance testing is needed and would show up places where there is scope for tuning, and how it affects overall system performance numbers. I have been playing with it for a while now, and so far its been running OK for me. I would welcome feedback, bug reports, test results etc. Full diffstat: aiordwr-rollup.patch: ...................... arch/i386/kernel/i386_ksyms.c | 2 arch/i386/kernel/semaphore.c | 30 ++- drivers/block/ll_rw_blk.c | 21 +- fs/aio.c | 371 +++++++++++++++++++++++++++++++++--------- fs/buffer.c | 54 +++++- fs/ext2/inode.c | 44 +++- include/asm-i386/semaphore.h | 27 ++- include/linux/aio.h | 32 +++ include/linux/blkdev.h | 1 include/linux/buffer_head.h | 30 +++ include/linux/errno.h | 1 include/linux/init_task.h | 1 include/linux/pagemap.h | 19 ++ include/linux/sched.h | 2 include/linux/wait.h | 2 include/linux/writeback.h | 4 kernel/fork.c | 9 - mm/filemap.c | 97 +++++++++- mm/page-writeback.c | 17 + 19 files changed, 616 insertions(+), 148 deletions(-) [The patches are also available for download on the Linux Scalability Effort project site (http://sourceforge.net/projects/lse) Categorized under the "aio" release in IO Scalability section http://sourceforge.net/project/showfiles.php?group_id=8875] A rollup version containing all the 7 patches (aiordwr-rollup.patch) would be made available as well Major additions/changes since previous versions posted: ------------------------------------------------------ - Introduced _wq versions of low level routines like lock_page_wq, wait_on_page_bit_wq etc, which take the wait_queue entry as a parameter (Thanks to Christoph Hellwig for suggesting the new and much better names :)). - Reorganized code to avoid having to use the do_sync_op() wrapper (because the forced emulation of the i/o wait context seemed an overhead and not very elegant). - (New)Implementation of asynchronous semaphore down operation for x86 (down_wq). - Have dropped the async block allocation portions from the async ext2_get_block patch after a discussion with Stephen Tweedie (the i/o patterns we anticipate are less likely to extend file sizes) - Fixes use_mm() to clear lazy tlb setting (I traced some of the strange hangs I was seeing for large reads to this) - Removed the aio_run_iocbs() acceleration from io_getevents, now that the above problem is gone. Todos/TBDs: ---------- - Support for down_wq on other archs or provide compatibility definitions for archs where it is not implemented (Need feedback on this) - Should the cond_resched() calls in read/write be converted to retry points (would need ctx specific worker threads) ? - Look at async get block implementations for other filesystems (e.g. jfs) ? - Optional: Check if it makes sense to use retry model for o_direct (or change sync o_direct to wait for completion of async o_direct) ? - Upgrade to Ben's aio api changes (collapse of api parameters into an rw_iocb) if and when it gets merged A few comments on low level implementation details: -------------------------------------------------- io_wait context ------------------- The task->io_wait field reflects the wait context in which a task is executing its i/o operations. For synchronous i/o task->io_wait is NULL, and the wait context is local on stack; for threads doing io submits or retries on behalf of async i/o callers, tsk->io_wait is the wait queue function entry to be notified on completion of a condition required for i/o to progress. Low level _wq routines take a wait queue parameter, so they can be invoked in either async or sync mode, even if running in async context (e.g servicing a page fault during an async retry). Routines which are expected to be async whenever they are running in async context and sync when running in sync context do not need to provide a wait queue parameter. do_sync_op() --------------- The do_sync_op() wrappers are not typically needed anymore for sync versions of the operations; passing NULL to the corresponding _wq functions suffices. However, there may be weird cases where we may have several levels of nesting like: A()->B()->C()->D()->F()->iowait() It may seem unnatural to pass a wait queue argument all the way through, but if we need to force sync behaviour in a certain case even if it is called under async context, and have async behaviour in another, then we may need to resort to using do_sync_op() (e.g if we had kept the ext2 async block allocation modifications). Regards Suparna -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Labs, India