Hi Michal, On Thu, 9 Apr 2020 17:25:40 Michal Hocko wrote: > Your earlier stat snapshot doesn't indicate a big problem with the > reclaim though: > > memory.stat:pgscan 47519855 > memory.stat:pgsteal 44933838 > > This tells the overall reclaim effectiveness was 94%. Could you try to > gather snapshots with a 1s granularity starting before your run your > backup to see how those numbers evolve? Ideally with timestamps to > compare with the actual stall information. Attached is a long collection of date memory.current memory.stat[pgscan] memory.stat[pgsteal] It started while backup was running +/- smoothly with its memory.high set to 4294967296 (4G instead of 2G) until backup finished around 20:22. From system memory pressure RRD-graph I see pressure (around 60) between about 19:50 to 20:10 while very small the rest of the time (below 1). I started a new backup run this morning grabbing full info snapshots of backup cgroup at 1s interval in order to get a better/more complete picture and CG's memory.high back to 2G limit. I have the impression as if reclaim was somehow triggered not enough or not strongly enough compared to the IO performed within the CG (complete backup covers 130G of data, data being read in blocks of 128kB at a smooth-running rate of ~7MiB/s). > Another option would be to enable vmscan tracepoints but let's try with > stats first. Bruno