linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] Present useful limits to user (v2)
@ 2016-07-15 10:35 Topi Miettinen
  2016-07-15 10:35 ` [PATCH 03/14] resource limits: track highwater mark of file sizes Topi Miettinen
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Topi Miettinen @ 2016-07-15 10:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Balbir Singh, Markus Elfring, David S. Miller, Nicolas Dichtel,
	Andrew Morton, Konstantin Khlebnikov, Jiri Slaby,
	Cyrill Gorcunov, Michal Hocko, Vlastimil Babka, Dave Hansen,
	Greg Kroah-Hartman, Dan Carpenter, Michael Kerrisk,
	Kirill A. Shutemov, Marcus Gelderie, Vladimir Davydov,
	Joe Perches, Frederic Weisbecker, Andrea Arcangeli,
	Eric W. Biederman, Andi Kleen, Oleg Nesterov, Stas Sergeev,
	Amanieu d'Antras, Richard Weinberger, Wang Xiaoqiang,
	Helge Deller, Mateusz Guzik, Alex Thorlton, Ben Segall,
	John Stultz, Rik van Riel, Eric B Munson, Alexey Klimov,
	Chen Gang, Andrey Ryabinin, David Rientjes, Hugh Dickins,
	Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

Hello,

There are many basic ways to control processes, including capabilities,
cgroups and resource limits. However, there are far fewer ways to find out
useful values for the limits, except blind trial and error.

This patch series attempts to fix that by giving at least a nice starting
point from the highwater mark values of the resources in question.
I looked where each limit is checked and added a call to update the mark
nearby.

Example run of program from Documentation/accounting/getdelauys.c:

./getdelays -R -p `pidof smartd`
printing resource accounting
RLIMIT_CPU=0
RLIMIT_FSIZE=0
RLIMIT_DATA=18198528
RLIMIT_STACK=135168
RLIMIT_CORE=0
RLIMIT_RSS=0
RLIMIT_NPROC=1
RLIMIT_NOFILE=55
RLIMIT_MEMLOCK=0
RLIMIT_AS=130879488
RLIMIT_LOCKS=0
RLIMIT_SIGPENDING=0
RLIMIT_MSGQUEUE=0
RLIMIT_NICE=0
RLIMIT_RTPRIO=0
RLIMIT_RTTIME=0

./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/
printing resource accounting
sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0
RLIMIT_CPU=0
RLIMIT_FSIZE=0
RLIMIT_DATA=18198528
RLIMIT_STACK=135168
RLIMIT_CORE=0
RLIMIT_RSS=0
RLIMIT_NPROC=1
RLIMIT_NOFILE=55
RLIMIT_MEMLOCK=0
RLIMIT_AS=130879488
RLIMIT_LOCKS=0
RLIMIT_SIGPENDING=0
RLIMIT_MSGQUEUE=0
RLIMIT_NICE=0
RLIMIT_RTPRIO=0
RLIMIT_RTTIME=0

In this example, smartd is running as a non-root user. The presented
values can be used as a starting point for giving new limits to the
service.

There's one problem with the patch 07/13, kernel initialization calls
create_worker() which seems to use different locking model or something:

[    0.145410] =========================================================
[    0.148000] [ INFO: possible irq lock inversion dependency detected ]
[    0.148000] 4.7.0-rc7+ #155 Not tainted
[    0.148000] ---------------------------------------------------------
[    0.148000] swapper/0/1 just changed the state of lock:
[    0.148000]  (&(&(&sig->stats_lock)->lock)->rlock){+.....}, at: [<ffffffff810bf769>] __sched_setscheduler+0x339/0xbd0
[    0.148000] but this lock was taken by another, HARDIRQ-safe lock in the past:
[    0.148000]  (&rq->lock){-.....}

and interrupts could create inverse lock ordering between them.

[    0.148000] 
[    0.148000] other info that might help us debug this:
[    0.148000]  Possible interrupt unsafe locking scenario:
[    0.148000] 
[    0.148000]        CPU0                    CPU1
[    0.148000]        ----                    ----
[    0.148000]   lock(&(&(&sig->stats_lock)->lock)->rlock);
[    0.148000]                                local_irq_disable();
[    0.148000]                                lock(&rq->lock);
[    0.148000]                                lock(&(&(&sig->stats_lock)->lock)->rlock);
[    0.148000]   <Interrupt>
[    0.148000]     lock(&rq->lock);
[    0.148000] 
[    0.148000]  *** DEADLOCK ***
[    0.148000] 
[    0.148000] 2 locks held by swapper/0/1:
[    0.148000]  #0:  (cpu_hotplug.lock){.+.+.+}, at: [<ffffffff81092824>] get_online_cpus+0x24/0x70
[    0.148000]  #1:  (smpboot_threads_lock){+.+.+.}, at: [<ffffffff810ba517>] smpboot_register_percpu_thread_cpumask+0x37/0xf0
[    0.148000] 
[    0.148000] the shortest dependencies between 2nd lock and 1st lock:
[    0.148000]  -> (&rq->lock){-.....} ops: 181 {
[    0.148000]     IN-HARDIRQ-W at:
[    0.148000]                       [<ffffffff810e8439>] __lock_acquire+0x6e9/0x1440
[    0.148000]                       [<ffffffff810e95d3>] lock_acquire+0xe3/0x1c0
[    0.148000]                       [<ffffffff818cf661>] _raw_spin_lock+0x31/0x40
[    0.148000]                       [<ffffffff810c3a41>] scheduler_tick+0x41/0xd0
[    0.148000]                       [<ffffffff81110471>] update_process_times+0x51/0x60
[    0.148000]                       [<ffffffff8111fa4f>] tick_periodic+0x2f/0xc0
[    0.148000]                       [<ffffffff8111fb05>] tick_handle_periodic+0x25/0x70
[    0.148000]                       [<ffffffff8101ebf5>] timer_interrupt+0x15/0x20
[    0.148000]                       [<ffffffff810fc731>] handle_irq_event_percpu+0x41/0x320
[    0.148000]                       [<ffffffff810fca49>] handle_irq_event+0x39/0x60
[    0.148000]                       [<ffffffff810ffe08>] handle_level_irq+0x88/0x110
[    0.148000]                       [<ffffffff8101e58a>] handle_irq+0x1a/0x30
[    0.148000]                       [<ffffffff818d2281>] do_IRQ+0x61/0x120
[    0.148000]                       [<ffffffff818d0949>] ret_from_intr+0x0/0x19
[    0.148000]                       [<ffffffff810fe969>] __setup_irq+0x3f9/0x5e0
[    0.148000]                       [<ffffffff810feb96>] setup_irq+0x46/0xa0
[    0.148000]                       [<ffffffff821878e2>] setup_default_timer_irq+0x1e/0x20
[    0.148000]                       [<ffffffff821878fb>] hpet_time_init+0x17/0x19
[    0.148000]                       [<ffffffff821878bd>] x86_late_time_init+0xa/0x11
[    0.148000]                       [<ffffffff82181ef9>] start_kernel+0x39d/0x465
[    0.148000]                       [<ffffffff82181294>] x86_64_start_reservations+0x2f/0x31
[    0.148000]                       [<ffffffff8218140e>] x86_64_start_kernel+0x178/0x18b
[    0.148000]     INITIAL USE at:
[    0.148000]                      [<ffffffff810e7f90>] __lock_acquire+0x240/0x1440
[    0.148000]                      [<ffffffff810e95d3>] lock_acquire+0xe3/0x1c0
[    0.148000]                      [<ffffffff818cf82c>] _raw_spin_lock_irqsave+0x3c/0x50
[    0.148000]                      [<ffffffff810bdc9d>] rq_attach_root+0x1d/0x100
[    0.148000]                      [<ffffffff8219deab>] sched_init+0x2f5/0x44c
[    0.148000]                      [<ffffffff82181d9d>] start_kernel+0x241/0x465
[    0.148000]                      [<ffffffff82181294>] x86_64_start_reservations+0x2f/0x31
[    0.148000]                      [<ffffffff8218140e>] x86_64_start_kernel+0x178/0x18b
[    0.148000]   }
[    0.148000]   ... key      at: [<ffffffff822f3ad0>] __key.60059+0x0/0x8
[    0.148000]   ... acquired at:
[    0.148000]    [<ffffffff810e95d3>] lock_acquire+0xe3/0x1c0
[    0.148000]    [<ffffffff818cf661>] _raw_spin_lock+0x31/0x40
[    0.148000]    [<ffffffff810c0514>] set_user_nice.part.92+0xf4/0x270
[    0.148000]    [<ffffffff810c06b6>] set_user_nice+0x26/0x30
[    0.148000]    [<ffffffff810aee10>] create_worker+0xf0/0x1a0
[    0.148000]    [<ffffffff8219c195>] init_workqueues+0x317/0x51e
[    0.148000]    [<ffffffff81000450>] do_one_initcall+0x50/0x180
[    0.148000]    [<ffffffff821820d2>] kernel_init_freeable+0x111/0x25d
[    0.148000]    [<ffffffff818c206e>] kernel_init+0xe/0x100
[    0.148000]    [<ffffffff818d01ff>] ret_from_fork+0x1f/0x40
[    0.148000] 
[    0.148000] -> (&(&(&sig->stats_lock)->lock)->rlock){+.....} ops: 2 {
[    0.148000]    HARDIRQ-ON-W at:
[    0.148000]                     [<ffffffff810e82e0>] __lock_acquire+0x590/0x1440
[    0.148000]                     [<ffffffff810e95d3>] lock_acquire+0xe3/0x1c0
[    0.148000]                     [<ffffffff818cf661>] _raw_spin_lock+0x31/0x40
[    0.148000]                     [<ffffffff810bf769>] __sched_setscheduler+0x339/0xbd0
[    0.148000]                     [<ffffffff810c0076>] _sched_setscheduler+0x76/0x90
[    0.148000]                     [<ffffffff810c1012>] sched_set_stop_task+0x62/0xb0
[    0.148000]                     [<ffffffff81143983>] cpu_stop_create+0x23/0x30
[    0.148000]                     [<ffffffff810ba48d>] __smpboot_create_thread.part.2+0xad/0x100
[    0.148000]                     [<ffffffff810ba57f>] smpboot_register_percpu_thread_cpumask+0x9f/0xf0
[    0.148000]                     [<ffffffff821a1708>] cpu_stop_init+0x7d/0xb8
[    0.148000]                     [<ffffffff81000450>] do_one_initcall+0x50/0x180
[    0.148000]                     [<ffffffff821820d2>] kernel_init_freeable+0x111/0x25d
[    0.148000]                     [<ffffffff818c206e>] kernel_init+0xe/0x100
[    0.148000]                     [<ffffffff818d01ff>] ret_from_fork+0x1f/0x40
[    0.148000]    INITIAL USE at:
[    0.148000]                    [<ffffffff810e7f90>] __lock_acquire+0x240/0x1440
[    0.148000]                    [<ffffffff810e95d3>] lock_acquire+0xe3/0x1c0
[    0.148000]                    [<ffffffff818cf661>] _raw_spin_lock+0x31/0x40
[    0.148000]                    [<ffffffff810c0514>] set_user_nice.part.92+0xf4/0x270
[    0.148000]                    [<ffffffff810c06b6>] set_user_nice+0x26/0x30
[    0.148000]                    [<ffffffff810aee10>] create_worker+0xf0/0x1a0
[    0.148000]                    [<ffffffff8219c195>] init_workqueues+0x317/0x51e
[    0.148000]                    [<ffffffff81000450>] do_one_initcall+0x50/0x180
[    0.148000]                    [<ffffffff821820d2>] kernel_init_freeable+0x111/0x25d
[    0.148000]                    [<ffffffff818c206e>] kernel_init+0xe/0x100
[    0.148000]                    [<ffffffff818d01ff>] ret_from_fork+0x1f/0x40
[    0.148000]  }
[    0.148000]  ... key      at: [<ffffffff822f2190>] __key.55894+0x0/0x8
[    0.148000]  ... acquired at:
[    0.148000]    [<ffffffff810e6885>] check_usage_backwards+0x155/0x160
[    0.148000]    [<ffffffff810e7533>] mark_lock+0x333/0x610
[    0.148000]    [<ffffffff810e82e0>] __lock_acquire+0x590/0x1440
[    0.148000]    [<ffffffff810e95d3>] lock_acquire+0xe3/0x1c0
[    0.148000]    [<ffffffff818cf661>] _raw_spin_lock+0x31/0x40
[    0.148000]    [<ffffffff810bf769>] __sched_setscheduler+0x339/0xbd0
[    0.148000]    [<ffffffff810c0076>] _sched_setscheduler+0x76/0x90
[    0.148000]    [<ffffffff810c1012>] sched_set_stop_task+0x62/0xb0
[    0.148000]    [<ffffffff81143983>] cpu_stop_create+0x23/0x30
[    0.148000]    [<ffffffff810ba48d>] __smpboot_create_thread.part.2+0xad/0x100
[    0.148000]    [<ffffffff810ba57f>] smpboot_register_percpu_thread_cpumask+0x9f/0xf0
[    0.148000]    [<ffffffff821a1708>] cpu_stop_init+0x7d/0xb8
[    0.148000]    [<ffffffff81000450>] do_one_initcall+0x50/0x180
[    0.148000]    [<ffffffff821820d2>] kernel_init_freeable+0x111/0x25d
[    0.148000]    [<ffffffff818c206e>] kernel_init+0xe/0x100
[    0.148000]    [<ffffffff818d01ff>] ret_from_fork+0x1f/0x40
[    0.148000] 
[    0.148000] 
[    0.148000] stack backtrace:
[    0.148000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc7+ #155
[    0.148000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[    0.148000]  0000000000000086 00000000aea03eae ffff88003de6ba60 ffffffff813cb2d5
[    0.148000]  ffffffff82d48e60 ffff88003de6bac0 ffff88003de6baa0 ffffffff811a6b05
[    0.148000]  ffff88003de647d8 ffff88003de647d8 ffff88003de64040 ffffffff81d531a7
[    0.148000] Call Trace:
[    0.148000]  [<ffffffff813cb2d5>] dump_stack+0x67/0x92
[    0.148000]  [<ffffffff811a6b05>] print_irq_inversion_bug.part.38+0x1a4/0x1b0
[    0.148000]  [<ffffffff810e6885>] check_usage_backwards+0x155/0x160
[    0.148000]  [<ffffffff810e7533>] mark_lock+0x333/0x610
[    0.148000]  [<ffffffff810e6730>] ? check_usage_forwards+0x160/0x160
[    0.148000]  [<ffffffff810e82e0>] __lock_acquire+0x590/0x1440
[    0.148000]  [<ffffffff810e7a6d>] ? trace_hardirqs_on+0xd/0x10
[    0.148000]  [<ffffffff81104aad>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[    0.148000]  [<ffffffff810e95d3>] lock_acquire+0xe3/0x1c0
[    0.148000]  [<ffffffff810bf769>] ? __sched_setscheduler+0x339/0xbd0
[    0.148000]  [<ffffffff818cf661>] _raw_spin_lock+0x31/0x40
[    0.148000]  [<ffffffff810bf769>] ? __sched_setscheduler+0x339/0xbd0
[    0.148000]  [<ffffffff810bf769>] __sched_setscheduler+0x339/0xbd0
[    0.148000]  [<ffffffff810c0076>] _sched_setscheduler+0x76/0x90
[    0.148000]  [<ffffffff810c1012>] sched_set_stop_task+0x62/0xb0
[    0.148000]  [<ffffffff81143983>] cpu_stop_create+0x23/0x30
[    0.148000]  [<ffffffff810ba48d>] __smpboot_create_thread.part.2+0xad/0x100
[    0.148000]  [<ffffffff810ba57f>] smpboot_register_percpu_thread_cpumask+0x9f/0xf0
[    0.148000]  [<ffffffff821a1708>] cpu_stop_init+0x7d/0xb8
[    0.148000]  [<ffffffff821a168b>] ? pid_namespaces_init+0x40/0x40
[    0.148000]  [<ffffffff81000450>] do_one_initcall+0x50/0x180
[    0.148000]  [<ffffffff8102c24d>] ? print_cpu_info+0x7d/0xe0
[    0.148000]  [<ffffffff821820d2>] kernel_init_freeable+0x111/0x25d
[    0.148000]  [<ffffffff818c206e>] kernel_init+0xe/0x100
[    0.148000]  [<ffffffff818d01ff>] ret_from_fork+0x1f/0x40
[    0.148000]  [<ffffffff818c2060>] ? rest_init+0x130/0x130

In this v2, I tried to address all comments, thanks for reviews.

-Topi

Topi Miettinen (14):
  resource limits: foundation for resource highwater tracking
  resource limits: aggregate task highwater marks to cgroup level
  resource limits: track highwater mark of file sizes
  resource limits: track highwater mark of VM data segment
  resource limits: track highwater mark of stack size
  resource limits: track highwater mark of cores dumped
  resource limits: track highwater mark of user processes
  resource limits: track highwater mark of number of files
  resource limits: track highwater mark of locked memory
  resource limits: track highwater mark of address space size
  resource limits: track highwater mark of number of pending signals
  resource limits: track highwater mark of size of message queues
  resource limits: track highwater mark of niceness
  resource limits: track highwater mark of RT priority

 Documentation/accounting/getdelays.c       | 62 ++++++++++++++++++++++--
 arch/ia64/kernel/perfmon.c                 |  1 +
 arch/powerpc/kvm/book3s_64_vio.c           |  2 +
 arch/powerpc/mm/mmu_context_iommu.c        |  2 +
 arch/x86/ia32/ia32_aout.c                  |  2 +
 drivers/infiniband/core/umem.c             |  1 +
 drivers/infiniband/hw/hfi1/user_pages.c    |  2 +
 drivers/infiniband/hw/qib/qib_user_pages.c |  2 +
 drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +
 drivers/misc/mic/scif/scif_rma.c           |  1 +
 drivers/vfio/vfio_iommu_spapr_tce.c        |  2 +
 drivers/vfio/vfio_iommu_type1.c            |  5 ++
 fs/attr.c                                  |  2 +
 fs/binfmt_aout.c                           |  2 +
 fs/binfmt_flat.c                           |  2 +
 fs/coredump.c                              | 11 +++--
 fs/file.c                                  |  4 ++
 include/linux/cgroup-defs.h                |  5 ++
 include/linux/sched.h                      | 61 +++++++++++++++++++++++
 include/linux/tsacct_kern.h                |  3 ++
 include/uapi/linux/cgroupstats.h           |  3 ++
 include/uapi/linux/taskstats.h             | 10 +++-
 ipc/mqueue.c                               |  1 +
 kernel/bpf/syscall.c                       |  8 +++
 kernel/cgroup.c                            | 78 ++++++++++++++++++++++++++++++
 kernel/cred.c                              |  1 +
 kernel/events/core.c                       |  1 +
 kernel/fork.c                              |  2 +
 kernel/sched/core.c                        |  6 +++
 kernel/signal.c                            |  2 +
 kernel/sys.c                               |  5 ++
 kernel/taskstats.c                         |  4 ++
 kernel/tsacct.c                            | 47 ++++++++++++++++++
 mm/mlock.c                                 |  8 +++
 mm/mmap.c                                  | 17 ++++++-
 mm/mremap.c                                |  7 +++
 36 files changed, 365 insertions(+), 9 deletions(-)

-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 03/14] resource limits: track highwater mark of file sizes
  2016-07-15 10:35 [PATCH 00/14] Present useful limits to user (v2) Topi Miettinen
@ 2016-07-15 10:35 ` Topi Miettinen
  2016-07-15 10:35 ` [PATCH 04/14] resource limits: track highwater mark of VM data segment Topi Miettinen
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Topi Miettinen @ 2016-07-15 10:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Alexander Viro,
	open list:FILESYSTEMS (VFS and infrastructure)

Track maximum size of files created, to be able to configure
RLIMIT_FSIZE resource limits. The information is available
with taskstats and cgroupstats netlink socket.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/attr.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 25b24d0..546f4f9 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -116,6 +116,8 @@ int inode_newsize_ok(const struct inode *inode, loff_t offset)
 			return -ETXTBSY;
 	}
 
+	update_resource_highwatermark(RLIMIT_FSIZE, offset);
+
 	return 0;
 out_sig:
 	send_sig(SIGXFSZ, current, 0);
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 04/14] resource limits: track highwater mark of VM data segment
  2016-07-15 10:35 [PATCH 00/14] Present useful limits to user (v2) Topi Miettinen
  2016-07-15 10:35 ` [PATCH 03/14] resource limits: track highwater mark of file sizes Topi Miettinen
@ 2016-07-15 10:35 ` Topi Miettinen
  2016-07-15 10:35 ` [PATCH 06/14] resource limits: track highwater mark of cores dumped Topi Miettinen
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Topi Miettinen @ 2016-07-15 10:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Alexander Viro, Andrew Morton, Michal Hocko, Vlastimil Babka,
	Ben Segall, Alex Thorlton, Mateusz Guzik, John Stultz,
	Kirill A. Shutemov, Oleg Nesterov, Chen Gang,
	Konstantin Khlebnikov, Andrea Arcangeli, Andrey Ryabinin,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:MEMORY MANAGEMENT

Track maximum size of data VM, to be able to configure
RLIMIT_DATA resource limits. The information is available
with taskstats and cgroupstats netlink socket.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 arch/x86/ia32/ia32_aout.c | 2 ++
 fs/binfmt_aout.c          | 2 ++
 fs/binfmt_flat.c          | 2 ++
 kernel/sys.c              | 3 +++
 mm/mmap.c                 | 7 ++++++-
 5 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index cb26f18..9236254 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -26,6 +26,7 @@
 #include <linux/init.h>
 #include <linux/jiffies.h>
 #include <linux/perf_event.h>
+#include <linux/sched.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgalloc.h>
@@ -398,6 +399,7 @@ beyond_if:
 	regs->r8 = regs->r9 = regs->r10 = regs->r11 =
 	regs->r12 = regs->r13 = regs->r14 = regs->r15 = 0;
 	set_fs(USER_DS);
+	update_resource_highwatermark(RLIMIT_DATA, ex.a_data + ex.a_bss);
 	return 0;
 }
 
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index ae1b540..49216f4 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -25,6 +25,7 @@
 #include <linux/init.h>
 #include <linux/coredump.h>
 #include <linux/slab.h>
+#include <linux/sched.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -330,6 +331,7 @@ beyond_if:
 	regs->gp = ex.a_gpvalue;
 #endif
 	start_thread(regs, ex.a_entry, current->mm->start_stack);
+	update_resource_highwatermark(RLIMIT_DATA, ex.a_data + ex.a_bss);
 	return 0;
 }
 
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index caf9e39..19c2212 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -35,6 +35,7 @@
 #include <linux/init.h>
 #include <linux/flat.h>
 #include <linux/syscalls.h>
+#include <linux/sched.h>
 
 #include <asm/byteorder.h>
 #include <asm/uaccess.h>
@@ -792,6 +793,7 @@ static int load_flat_file(struct linux_binprm * bprm,
 			libinfo->lib_list[id].start_brk) +	/* start brk */
 			stack_len);
 
+	update_resource_highwatermark(RLIMIT_DATA, data_len + bss_len);
 	return 0;
 err:
 	return ret;
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..d84c87e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1896,6 +1896,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
 	if (prctl_map.auxv_size)
 		memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
 
+	update_resource_highwatermark(RLIMIT_DATA, mm->end_data -
+				      mm->start_data);
+
 	up_write(&mm->mmap_sem);
 	return 0;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index de2c176..0b10f56 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -228,6 +228,8 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 		goto out;
 
 set_brk:
+	update_resource_highwatermark(RLIMIT_DATA, (brk - mm->start_brk) +
+				      (mm->end_data - mm->start_data));
 	mm->brk = brk;
 	populate = newbrk > oldbrk && (mm->def_flags & VM_LOCKED) != 0;
 	up_write(&mm->mmap_sem);
@@ -2924,8 +2926,11 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 		mm->exec_vm += npages;
 	else if (is_stack_mapping(flags))
 		mm->stack_vm += npages;
-	else if (is_data_mapping(flags))
+	else if (is_data_mapping(flags)) {
 		mm->data_vm += npages;
+		update_resource_highwatermark(RLIMIT_DATA,
+					      mm->data_vm << PAGE_SHIFT);
+	}
 }
 
 static int special_mapping_fault(struct vm_area_struct *vma,
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 06/14] resource limits: track highwater mark of cores dumped
  2016-07-15 10:35 [PATCH 00/14] Present useful limits to user (v2) Topi Miettinen
  2016-07-15 10:35 ` [PATCH 03/14] resource limits: track highwater mark of file sizes Topi Miettinen
  2016-07-15 10:35 ` [PATCH 04/14] resource limits: track highwater mark of VM data segment Topi Miettinen
@ 2016-07-15 10:35 ` Topi Miettinen
  2016-07-15 10:35 ` [PATCH 08/14] resource limits: track highwater mark of number of files Topi Miettinen
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Topi Miettinen @ 2016-07-15 10:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Alexander Viro,
	open list:FILESYSTEMS (VFS and infrastructure)

Track maximum size of core dump written, to be able to configure
RLIMIT_CORE resource limits. The information is available
with taskstats and cgroupstats netlink socket.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/coredump.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 281b768..a0ace88 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -784,20 +784,25 @@ int dump_emit(struct coredump_params *cprm, const void *addr, int nr)
 	struct file *file = cprm->file;
 	loff_t pos = file->f_pos;
 	ssize_t n;
+	int r = 0;
+
 	if (cprm->written + nr > cprm->limit)
 		return 0;
 	while (nr) {
 		if (dump_interrupted())
-			return 0;
+			goto err;
 		n = __kernel_write(file, addr, nr, &pos);
 		if (n <= 0)
-			return 0;
+			goto err;
 		file->f_pos = pos;
 		cprm->written += n;
 		cprm->pos += n;
 		nr -= n;
 	}
-	return 1;
+	r = 1;
+ err:
+	update_resource_highwatermark(RLIMIT_CORE, cprm->written);
+	return r;
 }
 EXPORT_SYMBOL(dump_emit);
 
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 08/14] resource limits: track highwater mark of number of files
  2016-07-15 10:35 [PATCH 00/14] Present useful limits to user (v2) Topi Miettinen
                   ` (2 preceding siblings ...)
  2016-07-15 10:35 ` [PATCH 06/14] resource limits: track highwater mark of cores dumped Topi Miettinen
@ 2016-07-15 10:35 ` Topi Miettinen
  2016-07-15 12:43 ` [PATCH 00/14] Present useful limits to user (v2) Peter Zijlstra
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Topi Miettinen @ 2016-07-15 10:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Topi Miettinen, Alexander Viro,
	open list:FILESYSTEMS (VFS and infrastructure)

Track maximum number of files for the process, to be able to configure
RLIMIT_NOFILE resource limits. The information is available
with taskstats and cgroupstats netlink socket.

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 fs/file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index 6b1acdf..9de37c9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -547,6 +547,8 @@ repeat:
 	}
 #endif
 
+	update_resource_highwatermark(RLIMIT_NOFILE, fd);
+
 out:
 	spin_unlock(&files->file_lock);
 	return error;
@@ -857,6 +859,8 @@ __releases(&files->file_lock)
 	if (tofree)
 		filp_close(tofree, files);
 
+	update_resource_highwatermark(RLIMIT_NOFILE, fd);
+
 	return fd;
 
 Ebusy:
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 10:35 [PATCH 00/14] Present useful limits to user (v2) Topi Miettinen
                   ` (3 preceding siblings ...)
  2016-07-15 10:35 ` [PATCH 08/14] resource limits: track highwater mark of number of files Topi Miettinen
@ 2016-07-15 12:43 ` Peter Zijlstra
  2016-07-15 13:52   ` Topi Miettinen
  2016-07-15 13:04 ` Balbir Singh
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-07-15 12:43 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Balbir Singh,
	Markus Elfring, David S. Miller, Nicolas Dichtel, Andrew Morton,
	Konstantin Khlebnikov, Jiri Slaby, Cyrill Gorcunov, Michal Hocko,
	Vlastimil Babka, Dave Hansen, Greg Kroah-Hartman, Dan Carpenter,
	Michael Kerrisk, Kirill A. Shutemov, Marcus Gelderie,
	Vladimir Davydov, Joe Perches, Frederic Weisbecker,
	Andrea Arcangeli, Eric W. Biederman, Andi Kleen, Oleg Nesterov,
	Stas Sergeev, Amanieu d'Antras, Richard Weinberger,
	Wang Xiaoqiang, Helge Deller, Mateusz Guzik, Alex Thorlton,
	Ben Segall, John Stultz, Rik van Riel, Eric B Munson,
	Alexey Klimov, Chen Gang, Andrey Ryabinin, David Rientjes,
	Hugh Dickins, Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
> Hello,
> 
> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find out
> useful values for the limits, except blind trial and error.
> 
> This patch series attempts to fix that by giving at least a nice starting
> point from the highwater mark values of the resources in question.
> I looked where each limit is checked and added a call to update the mark
> nearby.

And how is that useful? Setting things to the high watermark is
basically the same as not setting the limit at all.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 10:35 [PATCH 00/14] Present useful limits to user (v2) Topi Miettinen
                   ` (4 preceding siblings ...)
  2016-07-15 12:43 ` [PATCH 00/14] Present useful limits to user (v2) Peter Zijlstra
@ 2016-07-15 13:04 ` Balbir Singh
  2016-07-15 16:35   ` Topi Miettinen
  2016-07-15 14:19 ` Richard Weinberger
  2016-08-03 18:20 ` Topi Miettinen
  7 siblings, 1 reply; 18+ messages in thread
From: Balbir Singh @ 2016-07-15 13:04 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Balbir Singh, Markus Elfring, David S. Miller, Nicolas Dichtel,
	Andrew Morton, Konstantin Khlebnikov, Jiri Slaby,
	Cyrill Gorcunov, Michal Hocko, Vlastimil Babka, Dave Hansen,
	Greg Kroah-Hartman, Dan Carpenter, Michael Kerrisk,
	Kirill A. Shutemov, Marcus Gelderie, Vladimir Davydov,
	Joe Perches, Frederic Weisbecker, Andrea Arcangeli,
	Eric W. Biederman, Andi Kleen, Oleg Nesterov, Stas Sergeev,
	Amanieu d'Antras, Richard Weinberger, Wang Xiaoqiang,
	Helge Deller, Mateusz Guzik, Alex Thorlton, Ben Segall,
	John Stultz, Rik van Riel, Eric B Munson, Alexey Klimov,
	Chen Gang, Andrey Ryabinin, David Rientjes, Hugh Dickins,
	Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
> Hello,
> 
> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find out
> useful values for the limits, except blind trial and error.
> 
> This patch series attempts to fix that by giving at least a nice starting
> point from the highwater mark values of the resources in question.
> I looked where each limit is checked and added a call to update the mark
> nearby.
> 
> Example run of program from Documentation/accounting/getdelauys.c:
> 
> ./getdelays -R -p `pidof smartd`
> printing resource accounting
> RLIMIT_CPU=0
> RLIMIT_FSIZE=0
> RLIMIT_DATA=18198528
> RLIMIT_STACK=135168
> RLIMIT_CORE=0
> RLIMIT_RSS=0
> RLIMIT_NPROC=1
> RLIMIT_NOFILE=55
> RLIMIT_MEMLOCK=0
> RLIMIT_AS=130879488
> RLIMIT_LOCKS=0
> RLIMIT_SIGPENDING=0
> RLIMIT_MSGQUEUE=0
> RLIMIT_NICE=0
> RLIMIT_RTPRIO=0
> RLIMIT_RTTIME=0
> 
> ./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/
> printing resource accounting
> sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0
> RLIMIT_CPU=0
> RLIMIT_FSIZE=0
> RLIMIT_DATA=18198528
> RLIMIT_STACK=135168
> RLIMIT_CORE=0
> RLIMIT_RSS=0
> RLIMIT_NPROC=1
> RLIMIT_NOFILE=55
> RLIMIT_MEMLOCK=0
> RLIMIT_AS=130879488
> RLIMIT_LOCKS=0
> RLIMIT_SIGPENDING=0
> RLIMIT_MSGQUEUE=0
> RLIMIT_NICE=0
> RLIMIT_RTPRIO=0
> RLIMIT_RTTIME=0

Does this mean that rlimit_data and rlimit_stack should be set to the
values as specified by the data above?

Do we expect a smart user space daemon to then tweak the RLIMIT values?

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 12:43 ` [PATCH 00/14] Present useful limits to user (v2) Peter Zijlstra
@ 2016-07-15 13:52   ` Topi Miettinen
  2016-07-15 13:59     ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Topi Miettinen @ 2016-07-15 13:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Balbir Singh,
	Markus Elfring, David S. Miller, Nicolas Dichtel, Andrew Morton,
	Konstantin Khlebnikov, Jiri Slaby, Cyrill Gorcunov, Michal Hocko,
	Vlastimil Babka, Dave Hansen, Greg Kroah-Hartman, Dan Carpenter,
	Michael Kerrisk, Kirill A. Shutemov, Marcus Gelderie,
	Vladimir Davydov, Joe Perches, Frederic Weisbecker,
	Andrea Arcangeli, Eric W. Biederman, Andi Kleen, Oleg Nesterov,
	Stas Sergeev, Amanieu d'Antras, Richard Weinberger,
	Wang Xiaoqiang, Helge Deller, Mateusz Guzik, Alex Thorlton,
	Ben Segall, John Stultz, Rik van Riel, Eric B Munson,
	Alexey Klimov, Chen Gang, Andrey Ryabinin, David Rientjes,
	Hugh Dickins, Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

On 07/15/16 12:43, Peter Zijlstra wrote:
> On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
>> Hello,
>>
>> There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find out
>> useful values for the limits, except blind trial and error.
>>
>> This patch series attempts to fix that by giving at least a nice starting
>> point from the highwater mark values of the resources in question.
>> I looked where each limit is checked and added a call to update the mark
>> nearby.
> 
> And how is that useful? Setting things to the high watermark is
> basically the same as not setting the limit at all.

What else would you use, too small limits?

-Topi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 13:52   ` Topi Miettinen
@ 2016-07-15 13:59     ` Peter Zijlstra
  2016-07-15 16:57       ` Topi Miettinen
  2016-07-15 20:54       ` H. Peter Anvin
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2016-07-15 13:59 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Kr??m????,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Balbir Singh,
	Markus Elfring, David S. Miller, Nicolas Dichtel, Andrew Morton,
	Konstantin Khlebnikov, Jiri Slaby, Cyrill Gorcunov, Michal Hocko,
	Vlastimil Babka, Dave Hansen, Greg Kroah-Hartman, Dan Carpenter,
	Michael Kerrisk, Kirill A. Shutemov, Marcus Gelderie,
	Vladimir Davydov, Joe Perches, Frederic Weisbecker,
	Andrea Arcangeli, Eric W. Biederman, Andi Kleen, Oleg Nesterov,
	Stas Sergeev, Amanieu d'Antras, Richard Weinberger,
	Wang Xiaoqiang, Helge Deller, Mateusz Guzik, Alex Thorlton,
	Ben Segall, John Stultz, Rik van Riel, Eric B Munson,
	Alexey Klimov, Chen Gang, Andrey Ryabinin, David Rientjes,
	Hugh Dickins, Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

On Fri, Jul 15, 2016 at 01:52:48PM +0000, Topi Miettinen wrote:
> On 07/15/16 12:43, Peter Zijlstra wrote:
> > On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
> >> Hello,
> >>
> >> There are many basic ways to control processes, including capabilities,
> >> cgroups and resource limits. However, there are far fewer ways to find out
> >> useful values for the limits, except blind trial and error.
> >>
> >> This patch series attempts to fix that by giving at least a nice starting
> >> point from the highwater mark values of the resources in question.
> >> I looked where each limit is checked and added a call to update the mark
> >> nearby.
> > 
> > And how is that useful? Setting things to the high watermark is
> > basically the same as not setting the limit at all.
> 
> What else would you use, too small limits?

That question doesn't make sense.

What's the point of setting a limit if it ends up being the same as
no-limit (aka unlimited).

If you cannot explain; and you have not so far; what use these values
are, why would we look at the patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 10:35 [PATCH 00/14] Present useful limits to user (v2) Topi Miettinen
                   ` (5 preceding siblings ...)
  2016-07-15 13:04 ` Balbir Singh
@ 2016-07-15 14:19 ` Richard Weinberger
  2016-07-15 17:19   ` Topi Miettinen
  2016-07-18 21:25   ` Doug Ledford
  2016-08-03 18:20 ` Topi Miettinen
  7 siblings, 2 replies; 18+ messages in thread
From: Richard Weinberger @ 2016-07-15 14:19 UTC (permalink / raw)
  To: Topi Miettinen, linux-kernel
  Cc: Jonathan Corbet, Tony Luck, Fenghua Yu, Alexander Graf,
	Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Balbir Singh, Markus Elfring, David S. Miller, Nicolas Dichtel,
	Andrew Morton, Konstantin Khlebnikov, Jiri Slaby,
	Cyrill Gorcunov, Michal Hocko, Vlastimil Babka, Dave Hansen,
	Greg Kroah-Hartman, Dan Carpenter, Michael Kerrisk,
	Kirill A. Shutemov, Marcus Gelderie, Vladimir Davydov,
	Joe Perches, Frederic Weisbecker, Andrea Arcangeli,
	Eric W. Biederman, Andi Kleen, Oleg Nesterov, Stas Sergeev,
	Amanieu d'Antras, Wang Xiaoqiang, Helge Deller,
	Mateusz Guzik, Alex Thorlton, Ben Segall, John Stultz,
	Rik van Riel, Eric B Munson, Alexey Klimov, Chen Gang,
	Andrey Ryabinin, David Rientjes, Hugh Dickins,
	Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

Hi!

Am 15.07.2016 um 12:35 schrieb Topi Miettinen:
> Hello,
> 
> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find out
> useful values for the limits, except blind trial and error.
> 
> This patch series attempts to fix that by giving at least a nice starting
> point from the highwater mark values of the resources in question.
> I looked where each limit is checked and added a call to update the mark
> nearby.
> 
> Example run of program from Documentation/accounting/getdelauys.c:
> 
> ./getdelays -R -p `pidof smartd`
> printing resource accounting
> RLIMIT_CPU=0
> RLIMIT_FSIZE=0
> RLIMIT_DATA=18198528
> RLIMIT_STACK=135168
> RLIMIT_CORE=0
> RLIMIT_RSS=0
> RLIMIT_NPROC=1
> RLIMIT_NOFILE=55
> RLIMIT_MEMLOCK=0
> RLIMIT_AS=130879488
> RLIMIT_LOCKS=0
> RLIMIT_SIGPENDING=0
> RLIMIT_MSGQUEUE=0
> RLIMIT_NICE=0
> RLIMIT_RTPRIO=0
> RLIMIT_RTTIME=0
> 
> ./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/
> printing resource accounting
> sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0
> RLIMIT_CPU=0
> RLIMIT_FSIZE=0
> RLIMIT_DATA=18198528
> RLIMIT_STACK=135168
> RLIMIT_CORE=0
> RLIMIT_RSS=0
> RLIMIT_NPROC=1
> RLIMIT_NOFILE=55
> RLIMIT_MEMLOCK=0
> RLIMIT_AS=130879488
> RLIMIT_LOCKS=0
> RLIMIT_SIGPENDING=0
> RLIMIT_MSGQUEUE=0
> RLIMIT_NICE=0
> RLIMIT_RTPRIO=0
> RLIMIT_RTTIME=0
> 
> In this example, smartd is running as a non-root user. The presented
> values can be used as a starting point for giving new limits to the
> service.

I don't think it is worth sprinkling the kernel with update_resource_highwatermark()
calls just to get these metrics.

Can't we teach the existing perf infrastructure to collect these highwatermarks for us?

Thanks,
//richard

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 13:04 ` Balbir Singh
@ 2016-07-15 16:35   ` Topi Miettinen
  2016-07-18 22:05     ` Doug Ledford
  0 siblings, 1 reply; 18+ messages in thread
From: Topi Miettinen @ 2016-07-15 16:35 UTC (permalink / raw)
  To: bsingharora
  Cc: linux-kernel, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Markus Elfring, David S. Miller, Nicolas Dichtel, Andrew Morton,
	Konstantin Khlebnikov, Jiri Slaby, Cyrill Gorcunov, Michal Hocko,
	Vlastimil Babka, Dave Hansen, Greg Kroah-Hartman, Dan Carpenter,
	Michael Kerrisk, Kirill A. Shutemov, Marcus Gelderie,
	Vladimir Davydov, Joe Perches, Frederic Weisbecker,
	Andrea Arcangeli, Eric W. Biederman, Andi Kleen, Oleg Nesterov,
	Stas Sergeev, Amanieu d'Antras, Richard Weinberger,
	Wang Xiaoqiang, Helge Deller, Mateusz Guzik, Alex Thorlton,
	Ben Segall, John Stultz, Rik van Riel, Eric B Munson,
	Alexey Klimov, Chen Gang, Andrey Ryabinin, David Rientjes,
	Hugh Dickins, Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

On 07/15/16 13:04, Balbir Singh wrote:
> On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
>> Hello,
>>
>> There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find out
>> useful values for the limits, except blind trial and error.
>>
>> This patch series attempts to fix that by giving at least a nice starting
>> point from the highwater mark values of the resources in question.
>> I looked where each limit is checked and added a call to update the mark
>> nearby.
>>
>> Example run of program from Documentation/accounting/getdelauys.c:
>>
>> ./getdelays -R -p `pidof smartd`
>> printing resource accounting
>> RLIMIT_CPU=0
>> RLIMIT_FSIZE=0
>> RLIMIT_DATA=18198528
>> RLIMIT_STACK=135168
>> RLIMIT_CORE=0
>> RLIMIT_RSS=0
>> RLIMIT_NPROC=1
>> RLIMIT_NOFILE=55
>> RLIMIT_MEMLOCK=0
>> RLIMIT_AS=130879488
>> RLIMIT_LOCKS=0
>> RLIMIT_SIGPENDING=0
>> RLIMIT_MSGQUEUE=0
>> RLIMIT_NICE=0
>> RLIMIT_RTPRIO=0
>> RLIMIT_RTTIME=0
>>
>> ./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/
>> printing resource accounting
>> sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0
>> RLIMIT_CPU=0
>> RLIMIT_FSIZE=0
>> RLIMIT_DATA=18198528
>> RLIMIT_STACK=135168
>> RLIMIT_CORE=0
>> RLIMIT_RSS=0
>> RLIMIT_NPROC=1
>> RLIMIT_NOFILE=55
>> RLIMIT_MEMLOCK=0
>> RLIMIT_AS=130879488
>> RLIMIT_LOCKS=0
>> RLIMIT_SIGPENDING=0
>> RLIMIT_MSGQUEUE=0
>> RLIMIT_NICE=0
>> RLIMIT_RTPRIO=0
>> RLIMIT_RTTIME=0
> 
> Does this mean that rlimit_data and rlimit_stack should be set to the
> values as specified by the data above?

My plan is that either system administrator, distro maintainer or even
upstream developer can get reasonable values for the limits. They may
still be wrong, but things would be better than without any help to
configure the system.

> 
> Do we expect a smart user space daemon to then tweak the RLIMIT values?

Someone could write an autotuning daemon that checks if the system has
changed (for example due to upgrade) and then run some tests to
reconfigure the system. But the limits are a bit too fragile, or rather,
applications can't handle failure, so I don't know if that would really
work.

-Topi


> 
> Balbir Singh.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 13:59     ` Peter Zijlstra
@ 2016-07-15 16:57       ` Topi Miettinen
  2016-07-15 20:54       ` H. Peter Anvin
  1 sibling, 0 replies; 18+ messages in thread
From: Topi Miettinen @ 2016-07-15 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Kr??m????,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Balbir Singh,
	Markus Elfring, David S. Miller, Nicolas Dichtel, Andrew Morton,
	Konstantin Khlebnikov, Jiri Slaby, Cyrill Gorcunov, Michal Hocko,
	Vlastimil Babka, Dave Hansen, Greg Kroah-Hartman, Dan Carpenter,
	Michael Kerrisk, Kirill A. Shutemov, Marcus Gelderie,
	Vladimir Davydov, Joe Perches, Frederic Weisbecker,
	Andrea Arcangeli, Eric W. Biederman, Andi Kleen, Oleg Nesterov,
	Stas Sergeev, Amanieu d'Antras, Richard Weinberger,
	Wang Xiaoqiang, Helge Deller, Mateusz Guzik, Alex Thorlton,
	Ben Segall, John Stultz, Rik van Riel, Eric B Munson,
	Alexey Klimov, Chen Gang, Andrey Ryabinin, David Rientjes,
	Hugh Dickins, Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

On 07/15/16 13:59, Peter Zijlstra wrote:
> On Fri, Jul 15, 2016 at 01:52:48PM +0000, Topi Miettinen wrote:
>> On 07/15/16 12:43, Peter Zijlstra wrote:
>>> On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
>>>> Hello,
>>>>
>>>> There are many basic ways to control processes, including capabilities,
>>>> cgroups and resource limits. However, there are far fewer ways to find out
>>>> useful values for the limits, except blind trial and error.
>>>>
>>>> This patch series attempts to fix that by giving at least a nice starting
>>>> point from the highwater mark values of the resources in question.
>>>> I looked where each limit is checked and added a call to update the mark
>>>> nearby.
>>>
>>> And how is that useful? Setting things to the high watermark is
>>> basically the same as not setting the limit at all.
>>
>> What else would you use, too small limits?
> 
> That question doesn't make sense.
> 
> What's the point of setting a limit if it ends up being the same as
> no-limit (aka unlimited).

Having a limit is not the same as not having any limits at all. You're
in a way right that good limits don't affect the program normally. But
they can make a difference if the flow is not normal. For example a
successful exploit or a memory leak bug could cause RLIMIT_AS to trigger.

> 
> If you cannot explain; and you have not so far; what use these values
> are, why would we look at the patches.
> 

The use case is to allow system administrators, distro maintainers and
developers to configure systems to use the resource limits. The limits
are not very useful right now, as there is no way to figure out what
values to use. There are a few /proc files to look, for example current
number of file descriptors (for RLIMIT_NOFILE) could be counted via
/proc/pid/fd. But now there is no way to know if there were more in use
at some point. Likewise, a program can use more address space when you
are not looking. The source code does not tell these things explicitly.

-Topi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 14:19 ` Richard Weinberger
@ 2016-07-15 17:19   ` Topi Miettinen
  2016-07-18 21:25   ` Doug Ledford
  1 sibling, 0 replies; 18+ messages in thread
From: Topi Miettinen @ 2016-07-15 17:19 UTC (permalink / raw)
  To: Richard Weinberger, linux-kernel
  Cc: Jonathan Corbet, Tony Luck, Fenghua Yu, Alexander Graf,
	Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Balbir Singh, Markus Elfring, David S. Miller, Nicolas Dichtel,
	Andrew Morton, Konstantin Khlebnikov, Jiri Slaby,
	Cyrill Gorcunov, Michal Hocko, Vlastimil Babka, Dave Hansen,
	Greg Kroah-Hartman, Dan Carpenter, Michael Kerrisk,
	Kirill A. Shutemov, Marcus Gelderie, Vladimir Davydov,
	Joe Perches, Frederic Weisbecker, Andrea Arcangeli,
	Eric W. Biederman, Andi Kleen, Oleg Nesterov, Stas Sergeev,
	Amanieu d'Antras, Wang Xiaoqiang, Helge Deller,
	Mateusz Guzik, Alex Thorlton, Ben Segall, John Stultz,
	Rik van Riel, Eric B Munson, Alexey Klimov, Chen Gang,
	Andrey Ryabinin, David Rientjes, Hugh Dickins,
	Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

On 07/15/16 14:19, Richard Weinberger wrote:
> Hi!
> 
> Am 15.07.2016 um 12:35 schrieb Topi Miettinen:
>> Hello,
>>
>> There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find out
>> useful values for the limits, except blind trial and error.
>>
>> This patch series attempts to fix that by giving at least a nice starting
>> point from the highwater mark values of the resources in question.
>> I looked where each limit is checked and added a call to update the mark
>> nearby.
>>
>> Example run of program from Documentation/accounting/getdelauys.c:
>>
>> ./getdelays -R -p `pidof smartd`
>> printing resource accounting
>> RLIMIT_CPU=0
>> RLIMIT_FSIZE=0
>> RLIMIT_DATA=18198528
>> RLIMIT_STACK=135168
>> RLIMIT_CORE=0
>> RLIMIT_RSS=0
>> RLIMIT_NPROC=1
>> RLIMIT_NOFILE=55
>> RLIMIT_MEMLOCK=0
>> RLIMIT_AS=130879488
>> RLIMIT_LOCKS=0
>> RLIMIT_SIGPENDING=0
>> RLIMIT_MSGQUEUE=0
>> RLIMIT_NICE=0
>> RLIMIT_RTPRIO=0
>> RLIMIT_RTTIME=0
>>
>> ./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/
>> printing resource accounting
>> sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0
>> RLIMIT_CPU=0
>> RLIMIT_FSIZE=0
>> RLIMIT_DATA=18198528
>> RLIMIT_STACK=135168
>> RLIMIT_CORE=0
>> RLIMIT_RSS=0
>> RLIMIT_NPROC=1
>> RLIMIT_NOFILE=55
>> RLIMIT_MEMLOCK=0
>> RLIMIT_AS=130879488
>> RLIMIT_LOCKS=0
>> RLIMIT_SIGPENDING=0
>> RLIMIT_MSGQUEUE=0
>> RLIMIT_NICE=0
>> RLIMIT_RTPRIO=0
>> RLIMIT_RTTIME=0
>>
>> In this example, smartd is running as a non-root user. The presented
>> values can be used as a starting point for giving new limits to the
>> service.
> 
> I don't think it is worth sprinkling the kernel with update_resource_highwatermark()
> calls just to get these metrics.
> 
> Can't we teach the existing perf infrastructure to collect these highwatermarks for us?

I don't know. What kind of changes do you think would be needed?

-Topi

> 
> Thanks,
> //richard
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 13:59     ` Peter Zijlstra
  2016-07-15 16:57       ` Topi Miettinen
@ 2016-07-15 20:54       ` H. Peter Anvin
  1 sibling, 0 replies; 18+ messages in thread
From: H. Peter Anvin @ 2016-07-15 20:54 UTC (permalink / raw)
  To: Peter Zijlstra, Topi Miettinen
  Cc: linux-kernel, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Kr??m????,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex

<lizefan@huawei.com>,Johannes Weiner <hannes@cmpxchg.org>,Alexei Starovoitov <ast@kernel.org>,Arnaldo Carvalho de Melo <acme@kernel.org>,Alexander Shishkin <alexander.shishkin@linux.intel.com>,Balbir Singh <bsingharora@gmail.com>,Markus Elfring <elfring@users.sourceforge.net>,"David S. Miller" <davem@davemloft.net>,Nicolas Dichtel <nicolas.dichtel@6wind.com>,Andrew Morton <akpm@linux-foundation.org>,Konstantin Khlebnikov <koct9i@gmail.com>,Jiri Slaby <jslaby@suse.cz>,Cyrill Gorcunov <gorcunov@openvz.org>,Michal Hocko <mhocko@suse.com>,Vlastimil Babka <vbabka@suse.cz>,Dave Hansen <dave.hansen@linux.intel.com>,Greg Kroah-Hartman <gregkh@linuxfoundation.org>,Dan Carpenter <dan.carpenter@oracle.com>,Michael Kerrisk <mtk.manpages@gmail.com>,"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
 ,Marcus Gelderie <redmnic@gmail.com>,Vladimir Davydov <vdavydov@virtuozzo.com>,Joe Perches <joe@perches.com>,Frederic Weisbecker <fweisbec@gmail.com>,Andrea Arcangeli <aarcange@redhat.com>,!
 "Eric W.
Biederman" <ebiederm@xmission.com>,Andi Kleen <ak@linux.intel.com>,Oleg Nesterov <oleg@redhat.com>,Stas Sergeev <stsp@list.ru>,Amanieu d'Antras <amanieu@gmail.com>,Richard Weinberger <richard@nod.at>,Wang Xiaoqiang <wangxq10@lzu.edu.cn>,Helge Deller <deller@gmx.de>,Mateusz Guzik <mguzik@redhat.com>,Alex Thorlton <athorlton@sgi.com>,Ben Segall <bsegall@google.com>,John Stultz <john.stultz@linaro.org>,Rik van Riel <riel@redhat.com>,Eric B Munson <emunson@akamai.com>,Alexey Klimov <klimov.linux@gmail.com>,Chen Gang <gang.chen.5i5j@gmail.com>,Andrey Ryabinin <aryabinin@virtuozzo.com>,David Rientjes <rientjes@google.com>,Hugh Dickins <hughd@google.com>,Alexander Kuleshov <kuleshovmail@gmail.com>,"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,"open list:IA64 (Itanium) PLATFORM" <linux-ia6
 4@vger.kernel.org>,"open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC" <kvm-ppc@vger.kernel.org>,"open list:KERNEL VIRTUAL MACHINE (KVM)" <kvm@vger.kernel.org>,"open list:LINUX FOR POWERPC!
  (32-BIT
AND 64-BIT)" <linuxppc-dev@lists.ozlabs.org>,"open list:INFINIBAND SUBSYSTEM" <linux-rdma@vger.kernel.org>,"open list:FILESYSTEMS (VFS and infrastructure)" <linux-fsdevel@vger.kernel.org>,"open list:CONTROL GROUP (CGROUP)" <cgroups@vger.kernel.org>,"open list:BPF (Safe dynamic programs and tools)" <netdev@vger.kernel.org>,"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Message-ID: <D79806FE-E6B9-481B-8AA2-A1800419D9B5@zytor.com>

On July 15, 2016 6:59:56 AM PDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Fri, Jul 15, 2016 at 01:52:48PM +0000, Topi Miettinen wrote:
>> On 07/15/16 12:43, Peter Zijlstra wrote:
>> > On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
>> >> Hello,
>> >>
>> >> There are many basic ways to control processes, including
>capabilities,
>> >> cgroups and resource limits. However, there are far fewer ways to
>find out
>> >> useful values for the limits, except blind trial and error.
>> >>
>> >> This patch series attempts to fix that by giving at least a nice
>starting
>> >> point from the highwater mark values of the resources in question.
>> >> I looked where each limit is checked and added a call to update
>the mark
>> >> nearby.
>> > 
>> > And how is that useful? Setting things to the high watermark is
>> > basically the same as not setting the limit at all.
>> 
>> What else would you use, too small limits?
>
>That question doesn't make sense.
>
>What's the point of setting a limit if it ends up being the same as
>no-limit (aka unlimited).
>
>If you cannot explain; and you have not so far; what use these values
>are, why would we look at the patches.

One reason is to catch a malfunctioning process rather than dragging the whole system down with it.  It could also be useful for development.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 14:19 ` Richard Weinberger
  2016-07-15 17:19   ` Topi Miettinen
@ 2016-07-18 21:25   ` Doug Ledford
  1 sibling, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2016-07-18 21:25 UTC (permalink / raw)
  To: Richard Weinberger, Topi Miettinen, linux-kernel
  Cc: Jonathan Corbet, Tony Luck, Fenghua Yu, Alexander Graf,
	Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Hefty, Hal Rosenstock, Mike Marciniszyn, Dennis Dalessandro,
	Christian Benvenuti, Dave Goodell, Sudeep Dutt, Ashutosh Dixit,
	Alex Williamson, Alexander Viro, Tejun Heo, Li Zefan,
	Johannes Weiner, Peter Zijlstra, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Balbir Singh,
	Markus Elfring, David S. Miller, Nicolas Dichtel, Andrew Morton,
	Konstantin Khlebnikov, Jiri Slaby, Cyrill Gorcunov, Michal Hocko,
	Vlastimil Babka, Dave Hansen, Greg Kroah-Hartman, Dan Carpenter,
	Michael Kerrisk, Kirill A. Shutemov, Marcus Gelderie,
	Vladimir Davydov, Joe Perches, Frederic Weisbecker,
	Andrea Arcangeli, Eric W. Biederman, Andi Kleen, Oleg Nesterov,
	Stas Sergeev, Amanieu d'Antras, Wang Xiaoqiang, Helge Deller,
	Mateusz Guzik, Alex Thorlton, Ben Segall, John Stultz,
	Rik van Riel, Eric B Munson, Alexey Klimov, Chen Gang,
	Andrey Ryabinin, David Rientjes, Hugh Dickins,
	Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT


[-- Attachment #1.1: Type: text/plain, Size: 2270 bytes --]

On 7/15/2016 10:19 AM, Richard Weinberger wrote:
> Hi!
> 
> Am 15.07.2016 um 12:35 schrieb Topi Miettinen:
>> Hello,
>>
>> There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find out
>> useful values for the limits, except blind trial and error.
>>
>> This patch series attempts to fix that by giving at least a nice starting
>> point from the highwater mark values of the resources in question.
>> I looked where each limit is checked and added a call to update the mark
>> nearby.
>>
>> Example run of program from Documentation/accounting/getdelauys.c:
>>
>> ./getdelays -R -p `pidof smartd`
>> printing resource accounting
>> RLIMIT_CPU=0
>> RLIMIT_FSIZE=0
>> RLIMIT_DATA=18198528
>> RLIMIT_STACK=135168
>> RLIMIT_CORE=0
>> RLIMIT_RSS=0
>> RLIMIT_NPROC=1
>> RLIMIT_NOFILE=55
>> RLIMIT_MEMLOCK=0
>> RLIMIT_AS=130879488
>> RLIMIT_LOCKS=0
>> RLIMIT_SIGPENDING=0
>> RLIMIT_MSGQUEUE=0
>> RLIMIT_NICE=0
>> RLIMIT_RTPRIO=0
>> RLIMIT_RTTIME=0
>>
>> ./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/
>> printing resource accounting
>> sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0
>> RLIMIT_CPU=0
>> RLIMIT_FSIZE=0
>> RLIMIT_DATA=18198528
>> RLIMIT_STACK=135168
>> RLIMIT_CORE=0
>> RLIMIT_RSS=0
>> RLIMIT_NPROC=1
>> RLIMIT_NOFILE=55
>> RLIMIT_MEMLOCK=0
>> RLIMIT_AS=130879488
>> RLIMIT_LOCKS=0
>> RLIMIT_SIGPENDING=0
>> RLIMIT_MSGQUEUE=0
>> RLIMIT_NICE=0
>> RLIMIT_RTPRIO=0
>> RLIMIT_RTTIME=0
>>
>> In this example, smartd is running as a non-root user. The presented
>> values can be used as a starting point for giving new limits to the
>> service.
> 
> I don't think it is worth sprinkling the kernel with update_resource_highwatermark()
> calls just to get these metrics.
> 
> Can't we teach the existing perf infrastructure to collect these highwatermarks for us?

I'm not sure about perf (I don't know the internals of perf well enough
to comment), but I'm sure the systemtap infrastructure could do this,
and a preconfigured systemtap script could be shipped with the package
that would allow this.


-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 16:35   ` Topi Miettinen
@ 2016-07-18 22:05     ` Doug Ledford
  2016-07-19 16:53       ` Topi Miettinen
  0 siblings, 1 reply; 18+ messages in thread
From: Doug Ledford @ 2016-07-18 22:05 UTC (permalink / raw)
  To: Topi Miettinen, bsingharora
  Cc: linux-kernel, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Hefty, Hal Rosenstock, Mike Marciniszyn, Dennis Dalessandro,
	Christian Benvenuti, Dave Goodell, Sudeep Dutt, Ashutosh Dixit,
	Alex Williamson, Alexander Viro, Tejun Heo, Li Zefan,
	Johannes Weiner, Peter Zijlstra, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Markus Elfring,
	David S. Miller, Nicolas Dichtel, Andrew Morton,
	Konstantin Khlebnikov, Jiri Slaby, Cyrill Gorcunov, Michal Hocko,
	Vlastimil Babka, Dave Hansen, Greg Kroah-Hartman, Dan Carpenter,
	Michael Kerrisk, Kirill A. Shutemov, Marcus Gelderie,
	Vladimir Davydov, Joe Perches, Frederic Weisbecker,
	Andrea Arcangeli, Eric W. Biederman, Andi Kleen, Oleg Nesterov,
	Stas Sergeev, Amanieu d'Antras, Richard Weinberger,
	Wang Xiaoqiang, Helge Deller, Mateusz Guzik, Alex Thorlton,
	Ben Segall, John Stultz, Rik van Riel, Eric B Munson,
	Alexey Klimov, Chen Gang, Andrey Ryabinin, David Rientjes,
	Hugh Dickins, Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT


[-- Attachment #1.1: Type: text/plain, Size: 8959 bytes --]

On 7/15/2016 12:35 PM, Topi Miettinen wrote:
> On 07/15/16 13:04, Balbir Singh wrote:
>> On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
>>> Hello,
>>>
>>> There are many basic ways to control processes, including capabilities,
>>> cgroups and resource limits. However, there are far fewer ways to find out
>>> useful values for the limits, except blind trial and error.
>>>
>>> This patch series attempts to fix that by giving at least a nice starting
>>> point from the highwater mark values of the resources in question.
>>> I looked where each limit is checked and added a call to update the mark
>>> nearby.
>>>
>>> Example run of program from Documentation/accounting/getdelauys.c:
>>>
>>> ./getdelays -R -p `pidof smartd`
>>> printing resource accounting
>>> RLIMIT_CPU=0
>>> RLIMIT_FSIZE=0
>>> RLIMIT_DATA=18198528
>>> RLIMIT_STACK=135168
>>> RLIMIT_CORE=0
>>> RLIMIT_RSS=0
>>> RLIMIT_NPROC=1
>>> RLIMIT_NOFILE=55
>>> RLIMIT_MEMLOCK=0
>>> RLIMIT_AS=130879488
>>> RLIMIT_LOCKS=0
>>> RLIMIT_SIGPENDING=0
>>> RLIMIT_MSGQUEUE=0
>>> RLIMIT_NICE=0
>>> RLIMIT_RTPRIO=0
>>> RLIMIT_RTTIME=0
>>>
>>> ./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/
>>> printing resource accounting
>>> sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0
>>> RLIMIT_CPU=0
>>> RLIMIT_FSIZE=0
>>> RLIMIT_DATA=18198528
>>> RLIMIT_STACK=135168
>>> RLIMIT_CORE=0
>>> RLIMIT_RSS=0
>>> RLIMIT_NPROC=1
>>> RLIMIT_NOFILE=55
>>> RLIMIT_MEMLOCK=0
>>> RLIMIT_AS=130879488
>>> RLIMIT_LOCKS=0
>>> RLIMIT_SIGPENDING=0
>>> RLIMIT_MSGQUEUE=0
>>> RLIMIT_NICE=0
>>> RLIMIT_RTPRIO=0
>>> RLIMIT_RTTIME=0
>>
>> Does this mean that rlimit_data and rlimit_stack should be set to the
>> values as specified by the data above?
> 
> My plan is that either system administrator, distro maintainer or even
> upstream developer can get reasonable values for the limits. They may
> still be wrong, but things would be better than without any help to
> configure the system.

This is not necessarily true.  It seems like there is a disconnect
between what these various values are for and what you are positioning
them as.  Most of these limits are meant to protect the system from
resource starvation crashes.  They aren't meant to be any sort of double
check on a specific application.  The vast majority of applications can
have bugs, leak resources, and do all sorts of other bad things and
still not hit these limits.  A program that leaks a file handle an hour
but only normally has 50 handles in use would take 950 hours of constant
leaking before these limits would kick in to bring the program under
control.  That's over a month.  What's more though, the kernel couldn't
really care less that a single application leaked files until it got to
1000 open.  The real point of the limit on file handles (since they are
cheap) is just not to let the system get brought down.  Someone could
maliciously fire up 1000 processes, and they could all attempt to open
up as many files as possible in order to drown the system in open
inodes.  The combination of the limit on maximum user processes and
maximum files per process are intended to prevent this.  They are not
intended to prevent a single, properly running application from
operating.  In fact, there are very few applications that are likely to
break the 1000 file per process limit.  It is outrageously high for most
applications.  They will leak files and do all sorts of bad things
without this ever stopping them.  But it does stop malicious programs.
And the process limit stops malicious users too.  The max locked memory
is used by almost no processes, and for the very few that use it, the
default is more than enough.  The major exception is the RDMA stack,
which uses it so much that we just disable it on large systems because
it's impossible to predict how much we'll need and we don't want a job
to get killed because it couldn't get the memory it needs for buffers.
The limit on POSIX message queues is another one where it's more than
enough for most applications which don't use this feature at all, and
the few systems that use this feature adjust the limit to something sane
on their system (we can't make the default sane for these special
systems or else it becomes an avenue for Denial of Service attack, so
the default must stay low and servers that make extensive use of this
feature must up their limit on a case by case basis).

>>
>> Do we expect a smart user space daemon to then tweak the RLIMIT values?
> 
> Someone could write an autotuning daemon that checks if the system has
> changed (for example due to upgrade) and then run some tests to
> reconfigure the system. But the limits are a bit too fragile, or rather,
> applications can't handle failure, so I don't know if that would really
> work.

This misses the point of most of these limits.  They aren't there to
keep normal processes and normal users in check.  They are there to stop
runaway use.  This runaway situation might be accidental, or it might be
a nefarious users.  They are generally set exceedingly high for those
things every application uses, and fairly low for those things that
almost no application uses but which could be abused by the nefarious
user crowd.

Moreover, for a large percentage of applications, the highwatermark is a
source of great trickery.  For instance, if you have a web server that
is hosting web pages written in python, and therefore are using
mod_python in the httpd server (assuming apache here), then your
highwatermark will never be a reliable, stable thing.  If you get 1000
web requests in a minute, all utilizing the mod_python resource in the
web server, and you don't have your httpd configured to restart after
every few hundred requests handled, then mod_python in your httpd
process will grow seemingly without limit.  It will consume tons of
memory.  And the only limit on how much memory it will consume is
determined by how many web requests it handles in between its garbage
collection intervals * how much memory it allocates per request.  If you
don't happen to catch the absolute highest amount while you are
gathering your watermarks, then when you actually switch the system to
enforcing the limits you learned from all your highwatermarks (you are
planning on doing that aren't you?....I didn't see a copy of the patch
1/14, so I don't know if this infrastructure ever goes back to enforcing
the limits or not, but I would assume so, what point is there in
learning what the limits should be if you then never turn around and
enforce them?), load spikes will cause random program failures.

Really, this looks like a solution in search of a problem.  Right now,
the limits are set where they are because they do two things:

1) Stay out of the way of the vast majority of applications.  Those
applications that get tripped up by the defaults (like RDMA applications
getting stopped by memlock settings) have setup guides that spell out
which limits need changed and hints on what to change them too.

2) Stop nefarious users or errant applications from a total runaway
situation on a machine.

If your applications run without fail unless they have already failed,
and the whole machine doesn't go down with your failed application, then
the limits are working as designed.  If your typical machine
configuration includes 256GB of RAM, then you could probably stand to
increase some of the limits safely if you wanted to.  But unless you
have applications getting killed because of these limits, why would you?

Right now, I'm inclined to NAK the patch set.  I've only seen patch 9/14
since you didn't Cc: everyone on the patch 1/14 that added the
infrastructure.  But, as I mentioned in another email, I think this can
be accomplished via a systemtap script instead so we keep the clutter
out of the kernel.  And more importantly, these patches seem to be
thinking about these limits as though they are supposed to be some sort
of tight fitting container around applications that catch an errant
application as soon as it steps out of bounds.  Nothing could be further
from the truth, and if we actually implemented something of that sort,
programs susceptible to high resource usage during load spikes would
suddenly start failing on a frequent basis.  The proof that these limits
are working is given by the fact that we rarely hear from users about
their programs being killed for resource consumption, and yet we also
don't hear from users about their systems going down due to runaway
applications.  From what I can tell from these patches, I would suspect
complaints from one of those two issues to increase once these patches
are in place and put in use, and that doesn't seem like a good thing.

-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-18 22:05     ` Doug Ledford
@ 2016-07-19 16:53       ` Topi Miettinen
  0 siblings, 0 replies; 18+ messages in thread
From: Topi Miettinen @ 2016-07-19 16:53 UTC (permalink / raw)
  To: Doug Ledford, bsingharora
  Cc: linux-kernel, Jonathan Corbet, Tony Luck, Fenghua Yu,
	Alexander Graf, Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Hefty, Hal Rosenstock, Mike Marciniszyn, Dennis Dalessandro,
	Christian Benvenuti, Dave Goodell, Sudeep Dutt, Ashutosh Dixit,
	Alex Williamson, Alexander Viro, Tejun Heo, Li Zefan,
	Johannes Weiner, Peter Zijlstra, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Markus Elfring,
	David S. Miller, Nicolas Dichtel, Andrew Morton,
	Konstantin Khlebnikov, Jiri Slaby, Cyrill Gorcunov, Michal Hocko,
	Vlastimil Babka, Dave Hansen, Greg Kroah-Hartman, Dan Carpenter,
	Michael Kerrisk, Kirill A. Shutemov, Marcus Gelderie,
	Vladimir Davydov, Joe Perches, Frederic Weisbecker,
	Andrea Arcangeli, Eric W. Biederman, Andi Kleen, Oleg Nesterov,
	Stas Sergeev, Amanieu d'Antras, Richard Weinberger,
	Wang Xiaoqiang, Helge Deller, Mateusz Guzik, Alex Thorlton,
	Ben Segall, John Stultz, Rik van Riel, Eric B Munson,
	Alexey Klimov, Chen Gang, Andrey Ryabinin, David Rientjes,
	Hugh Dickins, Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

On 07/18/16 22:05, Doug Ledford wrote:
> On 7/15/2016 12:35 PM, Topi Miettinen wrote:
>> On 07/15/16 13:04, Balbir Singh wrote:
>>> On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
>>>> Hello,
>>>>
>>>> There are many basic ways to control processes, including capabilities,
>>>> cgroups and resource limits. However, there are far fewer ways to find out
>>>> useful values for the limits, except blind trial and error.
>>>>
>>>> This patch series attempts to fix that by giving at least a nice starting
>>>> point from the highwater mark values of the resources in question.
>>>> I looked where each limit is checked and added a call to update the mark
>>>> nearby.
>>>>
>>>> Example run of program from Documentation/accounting/getdelauys.c:
>>>>
>>>> ./getdelays -R -p `pidof smartd`
>>>> printing resource accounting
>>>> RLIMIT_CPU=0
>>>> RLIMIT_FSIZE=0
>>>> RLIMIT_DATA=18198528
>>>> RLIMIT_STACK=135168
>>>> RLIMIT_CORE=0
>>>> RLIMIT_RSS=0
>>>> RLIMIT_NPROC=1
>>>> RLIMIT_NOFILE=55
>>>> RLIMIT_MEMLOCK=0
>>>> RLIMIT_AS=130879488
>>>> RLIMIT_LOCKS=0
>>>> RLIMIT_SIGPENDING=0
>>>> RLIMIT_MSGQUEUE=0
>>>> RLIMIT_NICE=0
>>>> RLIMIT_RTPRIO=0
>>>> RLIMIT_RTTIME=0
>>>>
>>>> ./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/
>>>> printing resource accounting
>>>> sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0
>>>> RLIMIT_CPU=0
>>>> RLIMIT_FSIZE=0
>>>> RLIMIT_DATA=18198528
>>>> RLIMIT_STACK=135168
>>>> RLIMIT_CORE=0
>>>> RLIMIT_RSS=0
>>>> RLIMIT_NPROC=1
>>>> RLIMIT_NOFILE=55
>>>> RLIMIT_MEMLOCK=0
>>>> RLIMIT_AS=130879488
>>>> RLIMIT_LOCKS=0
>>>> RLIMIT_SIGPENDING=0
>>>> RLIMIT_MSGQUEUE=0
>>>> RLIMIT_NICE=0
>>>> RLIMIT_RTPRIO=0
>>>> RLIMIT_RTTIME=0
>>>
>>> Does this mean that rlimit_data and rlimit_stack should be set to the
>>> values as specified by the data above?
>>
>> My plan is that either system administrator, distro maintainer or even
>> upstream developer can get reasonable values for the limits. They may
>> still be wrong, but things would be better than without any help to
>> configure the system.
> 
> This is not necessarily true.  It seems like there is a disconnect
> between what these various values are for and what you are positioning
> them as.  Most of these limits are meant to protect the system from
> resource starvation crashes.  They aren't meant to be any sort of double
> check on a specific application.  The vast majority of applications can
> have bugs, leak resources, and do all sorts of other bad things and
> still not hit these limits.  A program that leaks a file handle an hour
> but only normally has 50 handles in use would take 950 hours of constant
> leaking before these limits would kick in to bring the program under
> control.  That's over a month.  What's more though, the kernel couldn't
> really care less that a single application leaked files until it got to
> 1000 open.  The real point of the limit on file handles (since they are
> cheap) is just not to let the system get brought down.  Someone could
> maliciously fire up 1000 processes, and they could all attempt to open
> up as many files as possible in order to drown the system in open
> inodes.  The combination of the limit on maximum user processes and
> maximum files per process are intended to prevent this.  They are not
> intended to prevent a single, properly running application from
> operating.  In fact, there are very few applications that are likely to
> break the 1000 file per process limit.  It is outrageously high for most
> applications.  They will leak files and do all sorts of bad things
> without this ever stopping them.  But it does stop malicious programs.
> And the process limit stops malicious users too.  The max locked memory
> is used by almost no processes, and for the very few that use it, the
> default is more than enough.  The major exception is the RDMA stack,
> which uses it so much that we just disable it on large systems because
> it's impossible to predict how much we'll need and we don't want a job
> to get killed because it couldn't get the memory it needs for buffers.
> The limit on POSIX message queues is another one where it's more than
> enough for most applications which don't use this feature at all, and
> the few systems that use this feature adjust the limit to something sane
> on their system (we can't make the default sane for these special
> systems or else it becomes an avenue for Denial of Service attack, so
> the default must stay low and servers that make extensive use of this
> feature must up their limit on a case by case basis).
> 
>>>
>>> Do we expect a smart user space daemon to then tweak the RLIMIT values?
>>
>> Someone could write an autotuning daemon that checks if the system has
>> changed (for example due to upgrade) and then run some tests to
>> reconfigure the system. But the limits are a bit too fragile, or rather,
>> applications can't handle failure, so I don't know if that would really
>> work.
> 
> This misses the point of most of these limits.  They aren't there to
> keep normal processes and normal users in check.  They are there to stop
> runaway use.  This runaway situation might be accidental, or it might be
> a nefarious users.  They are generally set exceedingly high for those
> things every application uses, and fairly low for those things that
> almost no application uses but which could be abused by the nefarious
> user crowd.
> 
> Moreover, for a large percentage of applications, the highwatermark is a
> source of great trickery.  For instance, if you have a web server that
> is hosting web pages written in python, and therefore are using
> mod_python in the httpd server (assuming apache here), then your
> highwatermark will never be a reliable, stable thing.  If you get 1000
> web requests in a minute, all utilizing the mod_python resource in the
> web server, and you don't have your httpd configured to restart after
> every few hundred requests handled, then mod_python in your httpd
> process will grow seemingly without limit.  It will consume tons of
> memory.  And the only limit on how much memory it will consume is
> determined by how many web requests it handles in between its garbage
> collection intervals * how much memory it allocates per request.  If you
> don't happen to catch the absolute highest amount while you are
> gathering your watermarks, then when you actually switch the system to
> enforcing the limits you learned from all your highwatermarks (you are
> planning on doing that aren't you?....I didn't see a copy of the patch
> 1/14, so I don't know if this infrastructure ever goes back to enforcing
> the limits or not, but I would assume so, what point is there in
> learning what the limits should be if you then never turn around and
> enforce them?), load spikes will cause random program failures.
> 
> Really, this looks like a solution in search of a problem.  Right now,
> the limits are set where they are because they do two things:
> 
> 1) Stay out of the way of the vast majority of applications.  Those
> applications that get tripped up by the defaults (like RDMA applications
> getting stopped by memlock settings) have setup guides that spell out
> which limits need changed and hints on what to change them too.
> 
> 2) Stop nefarious users or errant applications from a total runaway
> situation on a machine.
> 
> If your applications run without fail unless they have already failed,
> and the whole machine doesn't go down with your failed application, then
> the limits are working as designed.  If your typical machine
> configuration includes 256GB of RAM, then you could probably stand to
> increase some of the limits safely if you wanted to.  But unless you
> have applications getting killed because of these limits, why would you?

Thanks for the long explanation. I'd suppose loose limits are also used
because it's hard to know good tighter values. I was thinking of using
tighter settings to make things less easy for exploit writers. With
tight limits for RLIMIT_AS, RLIMIT_DATA and RLIMIT_STACK (also
RLIMIT_FSIZE in case a daemon is not supposed to create new files) it
would not so easy for the initial exploit to mmap() a large next stage
payload.

But there could be more direct ways to prevent that. For example, if
there was a way for seccomp filters to access a share state, they could
implement a state machine that could switch to a stricter mode after the
application has entered the event loop. Most of the limits or the denial
of service case are not interesting to me anyway.

> Right now, I'm inclined to NAK the patch set.  I've only seen patch 9/14
> since you didn't Cc: everyone on the patch 1/14 that added the
> infrastructure.  But, as I mentioned in another email, I think this can
> be accomplished via a systemtap script instead so we keep the clutter
> out of the kernel.  And more importantly, these patches seem to be
> thinking about these limits as though they are supposed to be some sort
> of tight fitting container around applications that catch an errant
> application as soon as it steps out of bounds.  Nothing could be further
> from the truth, and if we actually implemented something of that sort,
> programs susceptible to high resource usage during load spikes would
> suddenly start failing on a frequent basis.  The proof that these limits
> are working is given by the fact that we rarely hear from users about
> their programs being killed for resource consumption, and yet we also
> don't hear from users about their systems going down due to runaway
> applications.  From what I can tell from these patches, I would suspect
> complaints from one of those two issues to increase once these patches
> are in place and put in use, and that doesn't seem like a good thing.
> 

Those complaints could increase either way if the users want to use
tight limits, with kernel assistance or with systemtap. Again, the cause
of lack of complaints could also be that users are unaware of how to get
tight limits, so the users have no option but to either ignore the
limits or to use loose settings.

-Topi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/14] Present useful limits to user (v2)
  2016-07-15 10:35 [PATCH 00/14] Present useful limits to user (v2) Topi Miettinen
                   ` (6 preceding siblings ...)
  2016-07-15 14:19 ` Richard Weinberger
@ 2016-08-03 18:20 ` Topi Miettinen
  7 siblings, 0 replies; 18+ messages in thread
From: Topi Miettinen @ 2016-08-03 18:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jonathan Corbet, Tony Luck, Fenghua Yu, Alexander Graf,
	Paolo Bonzini, Radim Krčmář,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Doug Ledford, Sean Hefty, Hal Rosenstock, Mike Marciniszyn,
	Dennis Dalessandro, Christian Benvenuti, Dave Goodell,
	Sudeep Dutt, Ashutosh Dixit, Alex Williamson, Alexander Viro,
	Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Balbir Singh, Markus Elfring, David S. Miller, Nicolas Dichtel,
	Andrew Morton, Konstantin Khlebnikov, Jiri Slaby,
	Cyrill Gorcunov, Michal Hocko, Vlastimil Babka, Dave Hansen,
	Greg Kroah-Hartman, Dan Carpenter, Michael Kerrisk,
	Kirill A. Shutemov, Marcus Gelderie, Vladimir Davydov,
	Joe Perches, Frederic Weisbecker, Andrea Arcangeli,
	Eric W. Biederman, Andi Kleen, Oleg Nesterov, Stas Sergeev,
	Amanieu d'Antras, Richard Weinberger, Wang Xiaoqiang,
	Helge Deller, Mateusz Guzik, Alex Thorlton, Ben Segall,
	John Stultz, Rik van Riel, Eric B Munson, Alexey Klimov,
	Chen Gang, Andrey Ryabinin, David Rientjes, Hugh Dickins,
	Alexander Kuleshov, open list:DOCUMENTATION,
	open list:IA64 (Itanium) PLATFORM,
	open list:KERNEL VIRTUAL MACHINE (KVM) FOR POWERPC,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	open list:INFINIBAND SUBSYSTEM,
	open list:FILESYSTEMS (VFS and infrastructure),
	open list:CONTROL GROUP (CGROUP),
	open list:BPF (Safe dynamic programs and tools),
	open list:MEMORY MANAGEMENT

[-- Attachment #1: Type: text/plain, Size: 2742 bytes --]

Hello,

I'm trying the systemtap approach and it looks promising. The script is
annotating strace-like output with capability, device access and RLIMIT
information. In the end there's a summary. Here's sample output from
wpa_supplicant run:

mprotect(0x7efebf140000, 16384, PROT_READ) = 0 [DATA 548864 -> 573440]
[AS 44986368 -> 45002752]
brk(0x55d9611f8000) = 94392125718528 missing
[Capabilities=CAP_SYS_ADMIN] [AS 45002752 -> 45010944]
open(0x55d960716462, O_RDWR) = 3 [DeviceAllow=/dev/char/1:3 rw ]
open("/dev/random", O_RDONLY|O_NONBLOCK) = 3 [DeviceAllow=/dev/char/1:8 r ]
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0) = 4
[RestrictAddressFamilies=AF_UNIX] [NOFILE 3 -> 4]
open("/etc/wpa_supplicant.conf", O_RDONLY) = 5 [NOFILE 4 -> 5]
socket(PF_NETLINK, SOCK_RAW, 0) = 5 [RestrictAddressFamilies=AF_NETLINK]
socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, 16) = 6
[RestrictAddressFamilies=AF_NETLINK] [NOFILE 5 -> 6]
socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, 16) = 7
[RestrictAddressFamilies=AF_NETLINK] [NOFILE 6 -> 7]
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 8
[RestrictAddressFamilies=AF_INET] [NOFILE 7 -> 8]
open("/dev/rfkill", O_RDONLY) = 9 [DeviceAllow=/dev/char/10:58 r ]
[NOFILE 8 -> 9]
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 10
[RestrictAddressFamilies=AF_UNIX] [NOFILE 9 -> 10]
sendmsg(6, 0x7ffc778f35b0, 0x0) = 36 [Capabilities=CAP_NET_ADMIN]

Summary:
CapabilityBoundingSet=CAP_NET_ADMIN CAP_NET_RAW
Consider also missing CapabilityBoundingSet=CAP_SYS_ADMIN
DeviceAllow=/dev/char/1:3 rw
DeviceAllow=/dev/char/1:8 r
DeviceAllow=/dev/char/10:58 r
DeviceAllow=/dev/char/1:9 r
LimitFSIZE=0
LimitDATA=577536
LimitSTACK=139264
LimitCORE=0
LimitNOFILE=15
LimitAS=45146112
LimitNPROC=171
LimitMEMLOCK=0
LimitSIGPENDING=0
LimitMSGQUEUE=0
LimitNICE=0
LimitRTPRIO=0
RestrictAddressFamilies=AF_UNIX AF_INET AF_NETLINK AF_PACKET
MemoryDenyWriteExecute=true

Some values are not correct. NPROC is wrong because staprun needs to be
run as root instead of the separate privileged user for wpa_supplicant
and that messes user process count. DATA/AS/STACK seems to be a bit off.
I can easily use this as systemd service configuration drop-in otherwise.

Now, the relevant part for the kernel is that I'd like to analyze error
paths better, so the system calls would be also annotated when there's a
failure when a RLIMIT is too tight. It would be easier to insert probes
if there was only one path for RLIMIT checks. Would it be OK to make the
function task_rlimit() a full check against the limit and also make it a
non-inlined function, just for improved probing purposes?

There's already error analysis for the capabilities, but there are some
false positive hits (like brk() complaining about missing CAP_SYS_ADMIN
above).

-Topi


[-- Attachment #2: strace.stp --]
[-- Type: text/plain, Size: 19078 bytes --]

#! /bin/sh

# suppress some run-time errors here for cleaner output
//bin/true && exec stap --suppress-handler-errors --skip-badvars $0 ${1+"$@"}

/*
 * Compile:
 * stap -p4 -DSTP_NO_OVERLOAD -m strace
 * Run:
 * /usr/bin/staprun -R -c "/sbin/wpa_supplicant -u -O /run/wpa_supplicant -c /etc/wpa_supplicant.conf -i wlan0" -w /root/strace.ko only_capability_use=1 timestamp=0
 */

/* configuration options; set these with stap -G */
global follow_fork = 0   /* -Gfollow_fork=1 means trace descendant processes too */
global timestamp = 1     /* -Gtimestamp=0 means don't print a syscall timestamp */
global elapsed_time = 0  /* -Gelapsed_time=1 means print a syscall duration too */
global only_capability_use = 0 /* -Gonly_capability_use=1 means print only when capabilities are used */
global thread_argstr%
global thread_time%

global syscalls_nonreturn[2]
global capnames[64]
global used_caps
global missing_caps
global all_used_caps
global all_missing_caps
global accessed_devices[1000]
global all_accessed_devices[1000]
global highwatermark_fsize
global highwatermark_data
global highwatermark_stack
global highwatermark_core
global highwatermark_nproc
global highwatermark_nofile
global highwatermark_memlock
global highwatermark_as
global highwatermark_sigpending
global highwatermark_msgqueue
global highwatermark_nice
global highwatermark_rtprio
global old_highwatermark_fsize
global old_highwatermark_data
global old_highwatermark_stack
global old_highwatermark_core
global old_highwatermark_nproc
global old_highwatermark_nofile
global old_highwatermark_memlock
global old_highwatermark_as
global old_highwatermark_sigpending
global old_highwatermark_msgqueue
global old_highwatermark_nice
global old_highwatermark_rtprio
global afnames[64]
global used_afs
global missing_afs
global all_used_afs
global all_missing_afs
global no_memory_deny_write_execute
global all_memory_deny_write_execute = "true"
global print_syscall


probe begin 
  {
    /* list those syscalls that never .return */
    syscalls_nonreturn["exit"]=1
    syscalls_nonreturn["exit_group"]=1

    // grep '#define CAP_.*[0-9]+$' /usr/src/linux-headers*/include/uapi/linux/capability.h | awk '{ print "capnames[" $3 "] = \"" $2 "\";" }'
    capnames[0] = "CAP_CHOWN";
    capnames[1] = "CAP_DAC_OVERRIDE";
    capnames[2] = "CAP_DAC_READ_SEARCH";
    capnames[3] = "CAP_FOWNER";
    capnames[4] = "CAP_FSETID";
    capnames[5] = "CAP_KILL";
    capnames[6] = "CAP_SETGID";
    capnames[7] = "CAP_SETUID";
    capnames[8] = "CAP_SETPCAP";
    capnames[9] = "CAP_LINUX_IMMUTABLE";
    capnames[10] = "CAP_NET_BIND_SERVICE";
    capnames[11] = "CAP_NET_BROADCAST";
    capnames[12] = "CAP_NET_ADMIN";
    capnames[13] = "CAP_NET_RAW";
    capnames[14] = "CAP_IPC_LOCK";
    capnames[15] = "CAP_IPC_OWNER";
    capnames[16] = "CAP_SYS_MODULE";
    capnames[17] = "CAP_SYS_RAWIO";
    capnames[18] = "CAP_SYS_CHROOT";
    capnames[19] = "CAP_SYS_PTRACE";
    capnames[20] = "CAP_SYS_PACCT";
    capnames[21] = "CAP_SYS_ADMIN";
    capnames[22] = "CAP_SYS_BOOT";
    capnames[23] = "CAP_SYS_NICE";
    capnames[24] = "CAP_SYS_RESOURCE";
    capnames[25] = "CAP_SYS_TIME";
    capnames[26] = "CAP_SYS_TTY_CONFIG";
    capnames[27] = "CAP_MKNOD";
    capnames[28] = "CAP_LEASE";
    capnames[29] = "CAP_AUDIT_WRITE";
    capnames[30] = "CAP_AUDIT_CONTROL";
    capnames[31] = "CAP_SETFCAP";
    capnames[32] = "CAP_MAC_OVERRIDE";
    capnames[33] = "CAP_MAC_ADMIN";
    capnames[34] = "CAP_SYSLOG";
    capnames[35] = "CAP_WAKE_ALARM";
    capnames[36] = "CAP_BLOCK_SUSPEND";
    capnames[37] = "CAP_AUDIT_READ";

    //grep '#define AF_.*' /usr/src/linux-headers-*/include/linux/socket.h | awk '{ print "afnames[" $3 "] = \"" $2 "\"" }'
    afnames[0] = "AF_UNSPEC"
    afnames[1] = "AF_UNIX"
    afnames[2] = "AF_INET"
    afnames[3] = "AF_AX25"
    afnames[4] = "AF_IPX"
    afnames[5] = "AF_APPLETALK"
    afnames[6] = "AF_NETROM"
    afnames[7] = "AF_BRIDGE"
    afnames[8] = "AF_ATMPVC"
    afnames[9] = "AF_X25"
    afnames[10] = "AF_INET6"
    afnames[11] = "AF_ROSE"
    afnames[12] = "AF_DECnet"
    afnames[13] = "AF_NETBEUI"
    afnames[14] = "AF_SECURITY"
    afnames[15] = "AF_KEY"
    afnames[16] = "AF_NETLINK"
    afnames[17] = "AF_PACKET"
    afnames[18] = "AF_ASH"
    afnames[19] = "AF_ECONET"
    afnames[20] = "AF_ATMSVC"
    afnames[21] = "AF_RDS"
    afnames[22] = "AF_SNA"
    afnames[23] = "AF_IRDA"
    afnames[24] = "AF_PPPOX"
    afnames[25] = "AF_WANPIPE"
    afnames[26] = "AF_LLC"
    afnames[27] = "AF_IB"
    afnames[28] = "AF_MPLS"
    afnames[29] = "AF_CAN"
    afnames[30] = "AF_TIPC"
    afnames[31] = "AF_BLUETOOTH"
    afnames[32] = "AF_IUCV"
    afnames[33] = "AF_RXRPC"
    afnames[34] = "AF_ISDN"
    afnames[35] = "AF_PHONET"
    afnames[36] = "AF_IEEE802154"
    afnames[37] = "AF_CAIF"
    afnames[38] = "AF_ALG"
    afnames[39] = "AF_NFC"
    afnames[40] = "AF_VSOCK"
    afnames[41] = "AF_KCM"
  }



function filter_p()
  {
    if (target() == 0) return 0; /* system-wide */
    if (!follow_fork && pid() != target()) return 1; /* single-process */
    if (follow_fork && !target_set_pid(pid())) return 1; /* multi-process */
    return 0;
  }

function caps_to_str(caps)
  {
    str = ""
    for (i = 0; i < 37; i++) # CAP_LAST_CAP
      if (caps & (1 << i)) {
        str .= capnames[i]
	if ((caps & ~((1 << (i + 1)) - 1)) != 0)
	  str .= " "
      }
    return str
  }

function dev_to_str(type, dev, access)
  {
    devs = "/dev/"
    if (type == 1) # DEV_BLOCK
      devs .= "block"
    else
      devs .= "char"
    devs .= sprintf("/%d:%d ", dev >> 32, dev & 0xffffffff)
    if (access & 2) # ACC_READ
      devs .= "r"
    if (access & 4) # ACC_WRITE
      devs .= "w"
    if (access & 1) # ACC_MKNOD
      devs .= "m"
    return devs
  }

function afs_to_str(afs)
  {
    str = ""
    for (i = 0; i < 42; i++) # MAX_AF
      if (afs & (1 << i)) {
        str .= afnames[i]
	if ((afs & ~((1 << (i + 1)) - 1)) != 0)
	  str .= " "
      }
    return str
  }

/* Capabilities */
probe kernel.function("cap_capable@security/commoncap.c").return
  {
    if (filter_p()) next;

    if ($return == 0 && $audit)
      used_caps |= 1 << $cap;
    else
      missing_caps |= 1 << $cap;
  }

/* Devices */
probe kernel.function("__devcgroup_check_permission@security/device_cgroup.c").return
  {
    if (filter_p()) next;

    if ($return == 0)
      accessed_devices[$type, $major << 32 | $minor] |= $access
  }

/* RLIMIT_FSIZE */
probe kernel.function("inode_newsize_ok@fs/attr.c").return
  {
    if (filter_p()) next;

    if ($return == 0 && highwatermark_fsize < $offset)
      highwatermark_fsize = $offset
  }

/* RLIMIT_DATA */
probe kernel.function("prctl_set_mm@kernel/sys.c").return
  {
    if (filter_p()) next;

    if ($return == 0 && highwatermark_data < $prctl_map->end_data - $prctl_map->start_data) {
      highwatermark_data = $prctl_map->end_data - $prctl_map->start_data
      print_syscall = 1
    }
  }

probe kernel.function("do_brk@mm/mmap.c").return
  {
    if (filter_p()) next;

    task = task_current()
    if ($return > 0 && highwatermark_data < task->mm->data_vm << 12) { # PAGE_SHIFT
      highwatermark_data = task->mm->data_vm << 12
      print_syscall = 1
    }
    if ($return > 0 && highwatermark_as < task->mm->total_vm << 12) {
      highwatermark_as = task->mm->total_vm << 12
      print_syscall = 1
    }
  }

/* also RLIMIT_STACK and RLIMIT_MEMLOCK */
probe kernel.function("vm_stat_account@mm/mmap.c").return
  {
    if (filter_p()) next;

    if (highwatermark_data < $mm->data_vm << 12) { # PAGE_SHIFT
      highwatermark_data = $mm->data_vm << 12
      print_syscall = 1
    }
    if (highwatermark_stack < $mm->stack_vm << 12) {
      highwatermark_stack = $mm->stack_vm << 12
      print_syscall = 1
    }
    if (highwatermark_memlock < atomic_long_read(&$mm->locked_vm) << 12) {
      highwatermark_memlock = atomic_long_read(&$mm->locked_vm) << 12
      print_syscall = 1
    }
    if (highwatermark_as < $mm->total_vm << 12) {
      highwatermark_as = $mm->total_vm << 12
      print_syscall = 1
    }
  }

/* RLIMIT_CORE */
probe kernel.function("dump_emit@fs/coredump.c").return
  {
    if (filter_p()) next;

    if (highwatermark_core < $cprm->written) {
      highwatermark_core = $cprm->written
      print_syscall = 1
    }
  }

/* RLIMIT_NPROC */
probe kernel.function("commit_creds@kernel/cred.c").return
  {
    if (filter_p()) next;

    if (highwatermark_nproc < atomic_read(&$new->user->processes)) {
      highwatermark_nproc = atomic_read(&$new->user->processes)
      print_syscall = 1
    }
  }

probe kernel.function("copy_process@kernel/fork.c").return
  {
    if (filter_p()) next;
    printf("return %d\n", $return);
    try {
    if (($return > 0 || $return < -1000) && $return->real_cred && $return->real_cred->user)
      printf("good return %d\n", $return);
      if (highwatermark_nproc < atomic_read(&$return->real_cred->user->processes)) {
	highwatermark_nproc = atomic_read(&$return->real_cred->user->processes)
	print_syscall = 1
      }
    } catch {}
  }

/* RLIMIT_NOFILE */
probe kernel.function("__alloc_fd@fs/file.c").return
  {
    if (filter_p()) next;

    if (($return >= 0 || $return < -1000) && highwatermark_nofile < $return) {
      highwatermark_nofile = $return
      print_syscall = 1
    }
  }

probe kernel.function("do_dup2@fs/file.c").return
  {
    if (filter_p()) next;

    if (($return >= 0 || $return < -1000) && highwatermark_nofile < $return) {
      highwatermark_nofile = $return
      print_syscall = 1
    }
  }

/* RLIMIT_MEMLOCK */
probe kernel.function("sys_bpf@kernel/bpf/syscall.c").return
  {
    if (filter_p()) next;

    task = task_current()
    user = task->real_cred->user
    if ($return == 0 && highwatermark_memlock < atomic_long_read(&user->locked_vm) << 12) { # PAGE_SHIFT
      highwatermark_memlock = atomic_long_read(&user->locked_vm) << 12
      print_syscall = 1
    }
  }

probe kernel.function("perf_mmap@kernel/events/core.c").return
  {
    if (filter_p()) next;

    task = task_current()
    if ($return == 0 && highwatermark_memlock < task->mm->pinned_vm << 12) { # PAGE_SHIFT
      highwatermark_memlock = task->mm->pinned_vm << 12
      print_syscall = 1
    }
  }

probe kernel.function("do_mlock@mm/mlock.c").return
  {
    if (filter_p()) next;

    task = task_current()
    if ($return == 0 && highwatermark_memlock < task->mm->locked_vm << 12) { # PAGE_SHIFT
      highwatermark_memlock = task->mm->locked_vm << 12
      print_syscall = 1
    }
  }

probe kernel.function("sys_mlockall@mm/mlock.c").return
  {
    if (filter_p()) next;

    task = task_current()
    if ($return == 0 && highwatermark_memlock < task->mm->total_vm << 12) { # PAGE_SHIFT
      highwatermark_memlock = task->mm->total_vm << 12
      print_syscall = 1
    }
  }

/* RLIMIT_SIGPENDING */
probe kernel.function("__sigqueue_alloc@kernel/signal.c").return
  {
    if (filter_p()) next;

    task = task_current()
    user = task->real_cred->user
    if ($return == 0 && highwatermark_sigpending < atomic_read(&user->sigpending)) {
      highwatermark_sigpending = atomic_read(&user->sigpending)
      print_syscall = 1
    }
  }

/* RLIMIT_MSGGQUEUE */
probe kernel.function("mqueue_get_inode@ipc/mqueue.c").return
  {
    if (filter_p()) next;

    task = task_current()
    user = task->real_cred->user
    if ($return == 0 && highwatermark_msgqueue < user->mq_bytes) {
      highwatermark_msgqueue = user->mq_bytes
      print_syscall = 1
    }
  }

/* RLIMIT_NICE */
probe kernel.function("set_user_nice@kernel/sched/core.c").return
  {
    if (filter_p()) next;

    if (highwatermark_nice < $nice) {
      highwatermark_nice = $nice
      print_syscall = 1
    }
  }

/* RLIMIT_RTPRIO */
probe kernel.function("__sched_setscheduler@kernel/sched/core.c").return
  {
    if (filter_p()) next;

    if (highwatermark_rtprio < $attr->sched_priority) {
      highwatermark_rtprio = $attr->sched_priority
      print_syscall = 1
    }
  }

/* socket address families */
probe kernel.function("__sock_create@net/socket.c").return
  {
    if (filter_p()) next;

    if ($return == 0) {
      used_afs |= 1 << $family
      print_syscall = 1
    } else if ($return == 93) { # EPROTONOSUPPORT
      missing_afs |= 1 << $family
      print_syscall = 1
    }
  }

/* mmap flags */
probe kernel.function("do_mmap@mm/mmap.c").return
  {
    if (filter_p()) next;

    if (($return >= 0 || $return < -1000) && ($flags & (2 | 4)) == (2 | 4)) { # PROT_WRITE | PROT_EXEC
      no_memory_deny_write_execute = 1
      print_syscall = 1
    }
  }

/* system call printing */
probe nd_syscall.* 
  {
    # TODO: filter out apparently-nested syscalls (that are implemented
    # in terms of each other within the kernel); PR6762

    if (filter_p()) next;

    thread_argstr[tid()]=argstr
    if (timestamp || elapsed_time)
      thread_time[tid()]=gettimeofday_us()

    if (name in syscalls_nonreturn)
      report(name,argstr,"")
  }

probe nd_syscall.*.return
  {
    if (filter_p()) next;

    report(name,thread_argstr[tid()],retstr)
  }

function report(syscall_name, syscall_argstr, syscall_retstr)
  {
    if (timestamp || elapsed_time)
      {
        now = gettimeofday_us()
        then = thread_time[tid()]

        if (timestamp)
          prefix=sprintf("%s.%06d ", ctime(then/1000000), then%1000000)

        if (elapsed_time && (now>then)) {
          diff = now-then
          suffix=sprintf(" <%d.%06d>", diff/1000000, diff%1000000)
        }

        delete thread_time[tid()]
      }

    /* add a thread-id string in lots of cases, except if
       stap strace.stp -c SINGLE_THREADED_CMD */
    if (tid() != target()) {
      prefix .= sprintf("%s[%d] ", execname(), tid())
    }

    if (used_caps) {
       suffix .= " [Capabilities=" . caps_to_str(used_caps) . "]"
       all_used_caps |= used_caps
       print_syscall = 1
    }		       
    if (missing_caps) {
       suffix .= " missing [Capabilities=" . caps_to_str(missing_caps) . "]"
       all_missing_caps |= missing_caps
       print_syscall = 1
    }		       

    foreach ([type, dev] in accessed_devices) {
      devs .= dev_to_str(type, dev, accessed_devices[type, dev]) . " "
      if (has_devs == 0) {
        has_devs = 1
	print_syscall = 1
	devs = " [DeviceAllow=" . devs
      }
      all_accessed_devices[type, dev] = accessed_devices[type, dev];
    }
    if (has_devs) {
      devs .= "]"
      suffix .= devs
    }

    if (used_afs) {
      suffix .= " [RestrictAddressFamilies=" . afs_to_str(used_afs) . "]"
      all_used_afs |= used_afs
      print_syscall = 1
    }		       
    if (missing_afs) {
      suffix .= " missing [RestrictAddressFamilies=" . afs_to_str(missing_afs) . "]"
      all_missing_afs |= missing_afs
      print_syscall = 1
    }		       

    if (no_memory_deny_write_execute) {
      suffix .= " [MemoryDenyWriteExecute=false]"
      all_memory_deny_write_execute = "false"
    }		       

    if (highwatermark_fsize > old_highwatermark_fsize) {
      suffix .= sprintf(" [FSIZE %d -> %d]", old_highwatermark_fsize, highwatermark_fsize)
      old_highwatermark_fsize = highwatermark_fsize
    }
    if (highwatermark_data > old_highwatermark_data) {
      suffix .= sprintf(" [DATA %d -> %d]", old_highwatermark_data, highwatermark_data)
      old_highwatermark_data = highwatermark_data
    }
    if (highwatermark_stack > old_highwatermark_stack) {
      suffix .= sprintf(" [STACK %d -> %d]", old_highwatermark_stack, highwatermark_stack)
      old_highwatermark_stack = highwatermark_stack
    }
    if (highwatermark_core > old_highwatermark_core) {
      suffix .= sprintf(" [CORE %d -> %d]", old_highwatermark_core, highwatermark_core)
      old_highwatermark_core = highwatermark_core
    }
    if (highwatermark_nofile > old_highwatermark_nofile) {
      suffix .= sprintf(" [NOFILE %d -> %d]", old_highwatermark_nofile, highwatermark_nofile)
      old_highwatermark_nofile = highwatermark_nofile
    }
    if (highwatermark_as > old_highwatermark_as) {
      suffix .= sprintf(" [AS %d -> %d]", old_highwatermark_as, highwatermark_as)
      old_highwatermark_as = highwatermark_as
    }
    if (highwatermark_nproc > old_highwatermark_nproc) {
      suffix .= sprintf(" [NPROC %d -> %d]", old_highwatermark_nproc, highwatermark_nproc)
      old_highwatermark_nproc = highwatermark_nproc
    }
    if (highwatermark_memlock > old_highwatermark_memlock) {
      suffix .= sprintf(" [MEMLOCK %d -> %d]", old_highwatermark_memlock, highwatermark_memlock)
      old_highwatermark_memlock = highwatermark_memlock
    }
    if (highwatermark_sigpending > old_highwatermark_sigpending) {
      suffix .= sprintf(" [SIGPENDING %d -> %d]", old_highwatermark_sigpending, highwatermark_sigpending)
      old_highwatermark_sigpending = highwatermark_sigpending
    }
    if (highwatermark_msgqueue > old_highwatermark_msgqueue) {
      suffix .= sprintf(" [MSGQUEUE %d -> %d]", old_highwatermark_msgqueue, highwatermark_msgqueue)
      old_highwatermark_msgqueue = highwatermark_msgqueue
    }
    if (highwatermark_nice > old_highwatermark_nice) {
      suffix .= sprintf(" [NICE %d -> %d]", old_highwatermark_nice, highwatermark_nice)
      old_highwatermark_nice = highwatermark_nice
    }
    if (highwatermark_rtprio > old_highwatermark_rtprio) {
      suffix .= sprintf(" [RTPRIO %d -> %d]", old_highwatermark_rtprio, highwatermark_rtprio)
      old_highwatermark_rtprio = highwatermark_rtprio
    }

    if (!only_capability_use || print_syscall)
        printf("%s%s(%s) = %s%s\n",
             prefix, 
             syscall_name, syscall_argstr, syscall_retstr,
	     suffix)

    used_caps = 0
    missing_caps = 0
    used_afs = 0
    print_syscall = 0
    no_memory_deny_write_execute = 0
    delete accessed_devices

    delete thread_argstr[tid()]
  }

probe end
  {
    printf("\nSummary:\n")
    printf("CapabilityBoundingSet=%s\n", caps_to_str(all_used_caps))
    if (all_missing_caps)
	    printf("Consider also missing CapabilityBoundingSet=%s\n", caps_to_str(all_missing_caps))
    foreach ([type, dev] in all_accessed_devices)
      printf("DeviceAllow=%s\n", dev_to_str(type, dev, all_accessed_devices[type, dev]))
    printf("LimitFSIZE=%d\n", highwatermark_fsize)
    printf("LimitDATA=%d\n", highwatermark_data)
    printf("LimitSTACK=%d\n", highwatermark_stack)
    printf("LimitCORE=%d\n", highwatermark_core)
    printf("LimitNOFILE=%d\n", highwatermark_nofile)
    printf("LimitAS=%d\n", highwatermark_as)
    printf("LimitNPROC=%d\n", highwatermark_nproc)
    printf("LimitMEMLOCK=%d\n", highwatermark_memlock)
    printf("LimitSIGPENDING=%d\n", highwatermark_sigpending)
    printf("LimitMSGQUEUE=%d\n", highwatermark_msgqueue)
    printf("LimitNICE=%d\n", highwatermark_nice)
    printf("LimitRTPRIO=%d\n", highwatermark_rtprio)
    printf("RestrictAddressFamilies=%s\n", afs_to_str(all_used_afs))
    if (all_missing_afs)
	    printf("Consider also missing RestrictAddressFamilies=%s\n", afs_to_str(all_missing_afs))
    printf("MemoryDenyWriteExecute=%s\n", all_memory_deny_write_execute)
  }

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-08-03 18:20 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-15 10:35 [PATCH 00/14] Present useful limits to user (v2) Topi Miettinen
2016-07-15 10:35 ` [PATCH 03/14] resource limits: track highwater mark of file sizes Topi Miettinen
2016-07-15 10:35 ` [PATCH 04/14] resource limits: track highwater mark of VM data segment Topi Miettinen
2016-07-15 10:35 ` [PATCH 06/14] resource limits: track highwater mark of cores dumped Topi Miettinen
2016-07-15 10:35 ` [PATCH 08/14] resource limits: track highwater mark of number of files Topi Miettinen
2016-07-15 12:43 ` [PATCH 00/14] Present useful limits to user (v2) Peter Zijlstra
2016-07-15 13:52   ` Topi Miettinen
2016-07-15 13:59     ` Peter Zijlstra
2016-07-15 16:57       ` Topi Miettinen
2016-07-15 20:54       ` H. Peter Anvin
2016-07-15 13:04 ` Balbir Singh
2016-07-15 16:35   ` Topi Miettinen
2016-07-18 22:05     ` Doug Ledford
2016-07-19 16:53       ` Topi Miettinen
2016-07-15 14:19 ` Richard Weinberger
2016-07-15 17:19   ` Topi Miettinen
2016-07-18 21:25   ` Doug Ledford
2016-08-03 18:20 ` Topi Miettinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).