From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753738AbbCPKOw (ORCPT ); Mon, 16 Mar 2015 06:14:52 -0400 Received: from cantor2.suse.de ([195.135.220.15]:50766 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751799AbbCPKOt (ORCPT ); Mon, 16 Mar 2015 06:14:49 -0400 Date: Mon, 16 Mar 2015 11:14:43 +0100 From: Jan Kara To: Josef Bacik Cc: linux-fsdevel@vger.kernel.org, david@fromorbit.com, viro@zeniv.linux.org.uk, jack@suse.cz, linux-kernel@vger.kernel.org, Dave Chinner Subject: Re: [PATCH 6/9] bdi: add a new writeback list for sync Message-ID: <20150316101443.GF4934@quack.suse.cz> References: <1426016724-23912-1-git-send-email-jbacik@fb.com> <1426016724-23912-7-git-send-email-jbacik@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1426016724-23912-7-git-send-email-jbacik@fb.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 10-03-15 15:45:21, Josef Bacik wrote: > From: Dave Chinner > > wait_sb_inodes() current does a walk of all inodes in the filesystem > to find dirty one to wait on during sync. This is highly > inefficient and wastes a lot of CPU when there are lots of clean > cached inodes that we don't need to wait on. > > To avoid this "all inode" walk, we need to track inodes that are > currently under writeback that we need to wait for. We do this by > adding inodes to a writeback list on the bdi when the mapping is > first tagged as having pages under writeback. wait_sb_inodes() can > then walk this list of "inodes under IO" and wait specifically just > for the inodes that the current sync(2) needs to wait for. > > To avoid needing all the realted locking to be safe against > interrupts, Jan Kara suggested that we be lazy about removal from > the writeback list. That is, we don't remove inodes from the > writeback list on IO completion, but do it directly during a > wait_sb_inodes() walk. > > This means that the a rare sync(2) call will have some work to do > skipping clean inodes However, in the current problem case of > concurrent sync workloads, concurrent wait_sb_inodes() calls only > walk the very recently dispatched inodes and hence should have very > little work to do. > > This also means that we have to remove the inodes from the writeback > list during eviction. Do this in inode_wait_for_writeback() once > all writeback on the inode is complete. > > Signed-off-by: Dave Chinner The code looks mostly fine. Two comments below. ... > /* > - * Wait for writeback on an inode to complete. Caller must have inode pinned. > + * Wait for writeback on an inode to complete during eviction. Caller must have > + * inode pinned. > */ > void inode_wait_for_writeback(struct inode *inode) > { > + BUG_ON(!(inode->i_state & I_FREEING)); > + > spin_lock(&inode->i_lock); > __inode_wait_for_writeback(inode); > spin_unlock(&inode->i_lock); > + > + /* > + * bd_inode's will have already had the bdev free'd so don't bother > + * doing the bdi_clear_inode_writeback. > + */ > + if (!sb_is_blkdev_sb(inode->i_sb)) > + bdi_clear_inode_writeback(inode_to_bdi(inode), inode); Umm, I don't get this even though I've read the comment several times. Why don't we want to remove bdev inode from writeback list of blockdev_super? Is the comment suggesting that bdev inode cannot be on writeback list? > @@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb) > */ > WARN_ON(!rwsem_is_locked(&sb->s_umount)); > > - mutex_lock(&sb->s_sync_lock); > - spin_lock(&sb->s_inode_list_lock); > - > /* > - * Data integrity sync. Must wait for all pages under writeback, > - * because there may have been pages dirtied before our sync > - * call, but which had writeout started before we write it out. > - * In which case, the inode may not be on the dirty list, but > - * we still have to wait for that writeout. > + * Data integrity sync. Must wait for all pages under writeback, because > + * there may have been pages dirtied before our sync call, but which had > + * writeout started before we write it out. In which case, the inode > + * may not be on the dirty list, but we still have to wait for that > + * writeout. > + * > + * To avoid syncing inodes put under IO after we have started here, > + * splice the io list to a temporary list head and walk that. Newly > + * dirtied inodes will go onto the primary list so we won't wait for > + * them. This is safe to do as we are serialised by the s_sync_lock, > + * so we'll complete processing the complete list before the next > + * sync operation repeats the splice-and-walk process. > + * > + * Inodes that have pages under writeback after we've finished the wait > + * may or may not be on the primary list. They had pages under put IO > + * after we started our wait, so we need to make sure the next sync or > + * IO completion treats them correctly, Move them back to the primary > + * list and restart the walk. > + * > + * Inodes that are clean after we have waited for them don't belong on > + * any list, so simply remove them from the sync list and move onwards. > */ > - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > + mutex_lock(&sb->s_sync_lock); > + spin_lock(&bdi->wb.list_lock); > + list_splice_init(&bdi->wb.b_writeback, &sync_list); > + > + while (!list_empty(&sync_list)) { > + struct inode *inode = list_first_entry(&sync_list, struct inode, > + i_wb_list); > struct address_space *mapping = inode->i_mapping; > > + /* > + * We are lazy on IO completion and don't remove inodes from the > + * list when they are clean. Detect that immediately and skip > + * inodes we don't ahve to wait on. > + */ > + if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) { > + list_del_init(&inode->i_wb_list); > + cond_resched_lock(&bdi->wb.list_lock); > + continue; > + } > + > spin_lock(&inode->i_lock); > - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || > - (mapping->nrpages == 0)) { > + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { > + list_move(&inode->i_wb_list, &bdi->wb.b_writeback); Indenting looks damaged here... > spin_unlock(&inode->i_lock); > + cond_resched_lock(&bdi->wb.list_lock); > continue; Why do we put freeing inodes back to b_writeback list? For I_FREEING and I_WILL_FREE we could as well just delete them. For I_NEW we would start waiting for them once I_NEW gets cleared but still I_NEW inodes under writeback look somewhat worrying (flusher thread skips them so the semantics isn't clear there). Anyway, probably lets just keep code as is for I_NEW but moving back to b_writeback for I_FREEING | I_WILL_FREE looks just dumb. Honza -- Jan Kara SUSE Labs, CR