All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 1/3 v3 block/for-4.1/core] writeback: cgroup writeback support
@ 2015-04-06 19:57 ` Tejun Heo
  0 siblings, 0 replies; 144+ messages in thread
From: Tejun Heo @ 2015-04-06 19:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, gthelen

Subject: [PATCHSET 1/3 v3 block/for-4.1/core] writeback: cgroup writeback support

Hello,

This is v3 of cgroup writeback support patchset.  Changes from the
last take[L] are

* mapping_congested() is now inode_congested() and always operates on
  the inode regardless of the current task.  This was broken in the
  last take due to the logic carried over from multi-wb dirtying in
  the first take.  Noticed by Vivek.

* Fengguang noticed crash from the split congestion code when there's
  no blkcg policy active.  This is caused by blkcg pointlessly trying
  to micro optimize the allocation of root blkcg_gq and has caused
  similar bugs before.
  0004-blkcg-always-create-the-blkcg_gq-for-the-root-blkcg.patch
  added.

* Some minor bug fixes and cosmetic updates.  Nothing too interesting.

blkio cgroup (blkcg) is severely crippled in that it can only control
read and direct write IOs.  blkcg can't tell which cgroup should be
held responsible for a given writeback IO and charges all of them to
the root cgroup - all normal write traffic ends up in the root cgroup.
Although the problem has been identified years ago, mainly because it
interacts with so many subsystems, it hasn't been solved yet.

This patchset finally implements cgroup writeback support so that
writeback of a page is attributed to the corresponding blkcg of the
memcg that the page belongs to.

Overall design
--------------

* This requires cooperation between memcg and blkcg.  Each inode is
  assigned to the blkcg mapped to the memcg being dirtied.

* struct bdi_writeback (wb) was always embedded in struct
  backing_dev_info (bdi) and the distinction between the two wasn't
  clear.  This patchset makes wb operate as an independent writeback
  execution domain.  bdi->wb is still embedded and serves the root
  cgroup but there can be other wb's for other cgroups.

* Each wb is associated with memcg.  As memcg is implicitly enabled by
  blkcg on the unified hierarchy, this gives a unique wb for each
  memcg-blkcg combination.  When memcg-blkcg mapping changes, a new wb
  is created and the existing wb is unlinked and drained.

* An inode is associated with the matching wb when it gets dirtied for
  the first time and written back by that wb.  A later patchset will
  implement dynamic wb switching.

* All writeback operations are made per-wb instead of per-bdi.
  bdi-wide operations are split across all member wb's.  If some
  finite amount needs to be distributed, be it number of pages to
  writeback or bdi->min/max_ratio, it's distributed according to the
  bandwidth proportion a wb has in the bdi.

* cgroup writeback support adds one pointer to struct inode.


Missing pieces
--------------

* It requires some cooperation from the filesystem and currently only
  works with ext2.  The changes necessary on the filesystem side are
  almost trivial.  I'll write up a documentation on it.

* blk-throttle works but cfq-iosched isn't ready for writebacks coming
  down with different cgroups.  cfq-iosched should be updated to have
  a writeback ioc per cgroup and route writeback IOs through it.


How to test
-----------

* Boot with kernel option "cgroup__DEVEL__legacy_files_on_dfl".

* umount /sys/fs/cgroup/memory
  umount /sys/fs/cgroup/blkio
  mkdir /sys/fs/cgroup/unified
  mount -t cgroup -o __DEVEL__sane_behavior cgroup /sys/fs/cgroup/unified
  echo +blkio > /sys/fs/cgroup/unified/cgroup.subtree_control

* Build the cgroup hierarchy (don't forget to enable blkio using
  subtree_control) and put processes in cgroups and run tests on ext2
  filesystems and blkio.throttle.* knobs.

This patchset contains the following 48 patches.

 0001-memcg-add-per-cgroup-dirty-page-accounting.patch
 0002-blkcg-move-block-blk-cgroup.h-to-include-linux-blk-c.patch
 0003-update-CONFIG_BLK_CGROUP-dummies-in-include-linux-bl.patch
 0004-blkcg-always-create-the-blkcg_gq-for-the-root-blkcg.patch
 0005-memcg-add-mem_cgroup_root_css.patch
 0006-blkcg-add-blkcg_root_css.patch
 0007-cgroup-block-implement-task_get_css-and-use-it-in-bi.patch
 0008-blkcg-implement-task_get_blkcg_css.patch
 0009-blkcg-implement-bio_associate_blkcg.patch
 0010-memcg-implement-mem_cgroup_css_from_page.patch
 0011-writeback-move-backing_dev_info-state-into-bdi_write.patch
 0012-writeback-move-backing_dev_info-bdi_stat-into-bdi_wr.patch
 0013-writeback-move-bandwidth-related-fields-from-backing.patch
 0014-writeback-s-bdi-wb-in-mm-page-writeback.c.patch
 0015-writeback-move-backing_dev_info-wb_lock-and-worklist.patch
 0016-writeback-reorganize-mm-backing-dev.c.patch
 0017-writeback-separate-out-include-linux-backing-dev-def.patch
 0018-bdi-make-inode_to_bdi-inline.patch
 0019-writeback-add-gfp-to-wb_init.patch
 0020-bdi-separate-out-congested-state-into-a-separate-str.patch
 0021-writeback-add-CONFIG-BDI_CAP-FS-_CGROUP_WRITEBACK.patch
 0022-writeback-make-backing_dev_info-host-cgroup-specific.patch
 0023-writeback-blkcg-associate-each-blkcg_gq-with-the-cor.patch
 0024-writeback-attribute-stats-to-the-matching-per-cgroup.patch
 0025-writeback-let-balance_dirty_pages-work-on-the-matchi.patch
 0026-writeback-make-congestion-functions-per-bdi_writebac.patch
 0027-writeback-blkcg-restructure-blk_-set-clear-_queue_co.patch
 0028-writeback-blkcg-propagate-non-root-blkcg-congestion-.patch
 0029-writeback-implement-and-use-inode_congested.patch
 0030-writeback-implement-WB_has_dirty_io-wb_state-flag.patch
 0031-writeback-implement-backing_dev_info-tot_write_bandw.patch
 0032-writeback-make-bdi_has_dirty_io-take-multiple-bdi_wr.patch
 0033-writeback-don-t-issue-wb_writeback_work-if-clean.patch
 0034-writeback-make-bdi-min-max_ratio-handling-cgroup-wri.patch
 0035-writeback-implement-bdi_for_each_wb.patch
 0036-writeback-remove-bdi_start_writeback.patch
 0037-writeback-make-laptop_mode_timer_fn-handle-multiple-.patch
 0038-writeback-make-writeback_in_progress-take-bdi_writeb.patch
 0039-writeback-make-bdi_start_background_writeback-take-b.patch
 0040-writeback-make-wakeup_flusher_threads-handle-multipl.patch
 0041-writeback-add-wb_writeback_work-auto_free.patch
 0042-writeback-implement-bdi_wait_for_completion.patch
 0043-writeback-implement-wb_wait_for_single_work.patch
 0044-writeback-restructure-try_writeback_inodes_sb-_nr.patch
 0045-writeback-make-writeback-initiation-functions-handle.patch
 0046-writeback-dirty-inodes-against-their-matching-cgroup.patch
 0047-buffer-writeback-make-__block_write_full_page-honor-.patch
 0048-mpage-make-__mpage_writepage-honor-cgroup-writeback.patch
 0049-ext2-enable-cgroup-writeback-support.patch

0001-0020 are preps.

0021-0046 gradually convert writeback code so that wb (bdi_writeback)
operates as an independent writeback domain instead of bdi
(backing_dev_info), a single bdi can have multiple per-cgroup wb's
working for it, and per-bdi operations are translated and distributed
to all its member wb's.

0047-0049 make lower layers to properly propagate the cgroup
association from the writeback layer and enable cgroup writeback on
ext2.

This patchset is on top of

  block/for-4.1/core bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set")
+ [1] [PATCH] writeback: fix possible underflow in write bandwidth calculation

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-20150406

diffstat follows.  Thanks.

 Documentation/cgroups/memory.txt |    1 
 block/bio.c                      |   35 +-
 block/blk-cgroup.c               |  124 +++----
 block/blk-cgroup.h               |  603 -----------------------------------
 block/blk-core.c                 |   70 ++--
 block/blk-integrity.c            |    1 
 block/blk-sysfs.c                |    3 
 block/blk-throttle.c             |    2 
 block/bounce.c                   |    1 
 block/cfq-iosched.c              |    2 
 block/elevator.c                 |    2 
 block/genhd.c                    |    1 
 drivers/block/drbd/drbd_int.h    |    1 
 drivers/block/drbd/drbd_main.c   |   10 
 drivers/block/pktcdvd.c          |    1 
 drivers/char/raw.c               |    1 
 drivers/md/bcache/request.c      |    1 
 drivers/md/dm.c                  |    2 
 drivers/md/dm.h                  |    1 
 drivers/md/md.h                  |    1 
 drivers/md/raid1.c               |    4 
 drivers/md/raid10.c              |    2 
 drivers/mtd/devices/block2mtd.c  |    1 
 fs/block_dev.c                   |    9 
 fs/buffer.c                      |   60 ++-
 fs/ext2/super.c                  |    2 
 fs/ext4/extents.c                |    1 
 fs/ext4/mballoc.c                |    1 
 fs/ext4/super.c                  |    1 
 fs/f2fs/node.c                   |    4 
 fs/f2fs/segment.h                |    3 
 fs/fs-writeback.c                |  599 +++++++++++++++++++++++++----------
 fs/fuse/file.c                   |   12 
 fs/gfs2/super.c                  |    2 
 fs/hfs/super.c                   |    1 
 fs/hfsplus/super.c               |    1 
 fs/inode.c                       |    1 
 fs/mpage.c                       |    2 
 fs/nfs/filelayout/filelayout.c   |    1 
 fs/nfs/internal.h                |    2 
 fs/nfs/write.c                   |    3 
 fs/ocfs2/file.c                  |    1 
 fs/reiserfs/super.c              |    1 
 fs/ufs/super.c                   |    1 
 fs/xfs/xfs_aops.c                |   12 
 fs/xfs/xfs_file.c                |    1 
 include/linux/backing-dev-defs.h |  188 +++++++++++
 include/linux/backing-dev.h      |  567 ++++++++++++++++++++++++---------
 include/linux/bio.h              |    3 
 include/linux/blk-cgroup.h       |  631 ++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h           |   21 -
 include/linux/cgroup.h           |   25 +
 include/linux/fs.h               |   13 
 include/linux/memcontrol.h       |   10 
 include/linux/mm.h               |    3 
 include/linux/pagemap.h          |    3 
 include/linux/writeback.h        |   25 -
 include/trace/events/writeback.h |    8 
 init/Kconfig                     |    5 
 mm/backing-dev.c                 |  666 +++++++++++++++++++++++++++++++--------
 mm/fadvise.c                     |    2 
 mm/filemap.c                     |   32 +
 mm/madvise.c                     |    1 
 mm/memcontrol.c                  |   59 +++
 mm/page-writeback.c              |  649 +++++++++++++++++++++-----------------
 mm/readahead.c                   |    2 
 mm/rmap.c                        |    2 
 mm/truncate.c                    |   25 +
 mm/vmscan.c                      |   28 -
 69 files changed, 3003 insertions(+), 1556 deletions(-)

--
tejun

[L] http://lkml.kernel.org/g/1427086499-15657-1-git-send-email-tj@kernel.org
[1] http://lkml.kernel.org/g/20150323041848.GA8991@htj.duckdns.org

^ permalink raw reply	[flat|nested] 144+ messages in thread

end of thread, other threads:[~2015-04-21 15:06 UTC | newest]

Thread overview: 144+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-06 19:57 [PATCHSET 1/3 v3 block/for-4.1/core] writeback: cgroup writeback support Tejun Heo
2015-04-06 19:57 ` Tejun Heo
2015-04-06 19:57 ` [PATCH 01/49] memcg: add per cgroup dirty page accounting Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:57 ` [PATCH 02/49] blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:57 ` [PATCH 03/49] update !CONFIG_BLK_CGROUP dummies in include/linux/blk-cgroup.h Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:57 ` [PATCH 04/49] blkcg: always create the blkcg_gq for the root blkcg Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:57 ` [PATCH 05/49] memcg: add mem_cgroup_root_css Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:57 ` [PATCH 06/49] blkcg: add blkcg_root_css Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:57 ` [PATCH 07/49] cgroup, block: implement task_get_css() and use it in bio_associate_current() Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:57 ` [PATCH 08/49] blkcg: implement task_get_blkcg_css() Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:57 ` [PATCH 09/49] blkcg: implement bio_associate_blkcg() Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:57 ` [PATCH 10/49] memcg: implement mem_cgroup_css_from_page() Tejun Heo
2015-04-06 19:57   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 11/49] writeback: move backing_dev_info->state into bdi_writeback Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-20 15:32   ` Jan Kara
2015-04-20 15:32     ` Jan Kara
2015-04-06 19:58 ` [PATCH 12/49] writeback: move backing_dev_info->bdi_stat[] " Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-20 15:02   ` Jan Kara
2015-04-20 15:02     ` Jan Kara
2015-04-20 17:56     ` Tejun Heo
2015-04-20 17:56       ` Tejun Heo
2015-04-20 17:56       ` Tejun Heo
2015-04-21  8:51       ` Jan Kara
2015-04-21  8:51         ` Jan Kara
2015-04-21  8:51         ` Jan Kara
2015-04-21 15:02         ` Tejun Heo
2015-04-21 15:02           ` Tejun Heo
2015-04-21 15:05           ` Jan Kara
2015-04-21 15:05             ` Jan Kara
2015-04-21 15:05             ` Jan Kara
2015-04-06 19:58 ` [PATCH 13/49] writeback: move bandwidth related fields from backing_dev_info " Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-20 15:09   ` Jan Kara
2015-04-20 15:09     ` Jan Kara
2015-04-20 18:01     ` Tejun Heo
2015-04-20 18:01       ` Tejun Heo
2015-04-06 19:58 ` [PATCH 14/49] writeback: s/bdi/wb/ in mm/page-writeback.c Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-20 15:18   ` Jan Kara
2015-04-20 15:18     ` Jan Kara
2015-04-06 19:58 ` [PATCH 15/49] writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-20 15:32   ` Jan Kara
2015-04-20 15:32     ` Jan Kara
2015-04-20 15:32     ` Jan Kara
2015-04-20 18:17     ` Tejun Heo
2015-04-20 18:17       ` Tejun Heo
2015-04-21  8:59       ` Jan Kara
2015-04-21  8:59         ` Jan Kara
2015-04-21  8:59         ` Jan Kara
2015-04-06 19:58 ` [PATCH 16/49] writeback: reorganize mm/backing-dev.c Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-20 15:34   ` Jan Kara
2015-04-20 15:34     ` Jan Kara
2015-04-20 15:34     ` Jan Kara
2015-04-06 19:58 ` [PATCH 17/49] writeback: separate out include/linux/backing-dev-defs.h Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-20 15:37   ` Jan Kara
2015-04-20 15:37     ` Jan Kara
2015-04-06 19:58 ` [PATCH 18/49] bdi: make inode_to_bdi() inline Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-20 15:40   ` Jan Kara
2015-04-20 15:40     ` Jan Kara
2015-04-20 15:40     ` Jan Kara
2015-04-20 18:21     ` Tejun Heo
2015-04-20 18:21       ` Tejun Heo
2015-04-06 19:58 ` [PATCH 19/49] writeback: add @gfp to wb_init() Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-20 15:41   ` Jan Kara
2015-04-20 15:41     ` Jan Kara
2015-04-06 19:58 ` [PATCH 20/49] bdi: separate out congested state into a separate struct Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-21 14:36   ` Jan Kara
2015-04-21 14:36     ` Jan Kara
2015-04-06 19:58 ` [PATCH 21/49] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 22/49] writeback: make backing_dev_info host cgroup-specific bdi_writebacks Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 23/49] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 24/49] writeback: attribute stats to the matching per-cgroup bdi_writeback Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 25/49] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 26/49] writeback: make congestion functions per bdi_writeback Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 27/49] writeback, blkcg: restructure blk_{set|clear}_queue_congested() Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 28/49] writeback, blkcg: propagate non-root blkcg congestion state Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 29/49] writeback: implement and use inode_congested() Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 30/49] writeback: implement WB_has_dirty_io wb_state flag Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 31/49] writeback: implement backing_dev_info->tot_write_bandwidth Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 32/49] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 33/49] writeback: don't issue wb_writeback_work if clean Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 34/49] writeback: make bdi->min/max_ratio handling cgroup writeback aware Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 35/49] writeback: implement bdi_for_each_wb() Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 36/49] writeback: remove bdi_start_writeback() Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 37/49] writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 38/49] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 39/49] writeback: make bdi_start_background_writeback() " Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 40/49] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 41/49] writeback: add wb_writeback_work->auto_free Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 42/49] writeback: implement bdi_wait_for_completion() Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 43/49] writeback: implement wb_wait_for_single_work() Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 44/49] writeback: restructure try_writeback_inodes_sb[_nr]() Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 45/49] writeback: make writeback initiation functions handle multiple bdi_writeback's Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 46/49] writeback: dirty inodes against their matching cgroup bdi_writeback's Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 47/49] buffer, writeback: make __block_write_full_page() honor cgroup writeback Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 48/49] mpage: make __mpage_writepage() " Tejun Heo
2015-04-06 19:58   ` Tejun Heo
2015-04-06 19:58 ` [PATCH 49/49] ext2: enable cgroup writeback support Tejun Heo
2015-04-06 19:58   ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.