All of lore.kernel.org
 help / color / mirror / Atom feed
* OOM problems
@ 2010-11-13  7:57 John Weekes
  2010-11-13  8:14 ` Ian Pratt
  2010-11-13 18:15 ` George Shuklin
  0 siblings, 2 replies; 23+ messages in thread
From: John Weekes @ 2010-11-13  7:57 UTC (permalink / raw)
  To: xen-devel

On machines running many HVM (stubdom-based) domains, I often see errors 
like this:

[77176.524094] qemu-dm invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
[77176.524102] Pid: 7478, comm: qemu-dm Not tainted 2.6.32.25-g80f7e08 #2
[77176.524109] Call Trace:
[77176.524123]  [<ffffffff810897fd>] ? T.413+0xcd/0x290
[77176.524129]  [<ffffffff81089ad3>] ? __out_of_memory+0x113/0x180
[77176.524133]  [<ffffffff81089b9e>] ? out_of_memory+0x5e/0xc0
[77176.524140]  [<ffffffff8108d1cb>] ? __alloc_pages_nodemask+0x69b/0x6b0
[77176.524144]  [<ffffffff8108d1f2>] ? __get_free_pages+0x12/0x60
[77176.524152]  [<ffffffff810c94e7>] ? __pollwait+0xb7/0x110
[77176.524161]  [<ffffffff81262b93>] ? n_tty_poll+0x183/0x1d0
[77176.524165]  [<ffffffff8125ea42>] ? tty_poll+0x92/0xa0
[77176.524169]  [<ffffffff810c8a92>] ? do_select+0x362/0x670
[77176.524173]  [<ffffffff810c9430>] ? __pollwait+0x0/0x110
[77176.524178]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524183]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524188]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524193]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524197]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524202]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524207]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524212]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524217]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524222]  [<ffffffff810c8fb5>] ? core_sys_select+0x215/0x350
[77176.524231]  [<ffffffff810100af>] ? xen_restore_fl_direct_end+0x0/0x1
[77176.524236]  [<ffffffff8100c48d>] ? xen_mc_flush+0x8d/0x1b0
[77176.524243]  [<ffffffff81014ffb>] ? xen_hypervisor_callback+0x1b/0x20
[77176.524251]  [<ffffffff814b0f5a>] ? error_exit+0x2a/0x60
[77176.524255]  [<ffffffff8101485d>] ? retint_restore_args+0x5/0x6
[77176.524263]  [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
[77176.524268]  [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
[77176.524276]  [<ffffffff810663d1>] ? ktime_get_ts+0x61/0xd0
[77176.524281]  [<ffffffff810c9354>] ? sys_select+0x44/0x120
[77176.524286]  [<ffffffff81013f02>] ? system_call_fastpath+0x16/0x1b
[77176.524290] Mem-Info:
[77176.524293] DMA per-cpu:
[77176.524296] CPU    0: hi:    0, btch:   1 usd:   0
[77176.524300] CPU    1: hi:    0, btch:   1 usd:   0
[77176.524303] CPU    2: hi:    0, btch:   1 usd:   0
[77176.524306] CPU    3: hi:    0, btch:   1 usd:   0
[77176.524310] CPU    4: hi:    0, btch:   1 usd:   0
[77176.524313] CPU    5: hi:    0, btch:   1 usd:   0
[77176.524316] CPU    6: hi:    0, btch:   1 usd:   0
[77176.524318] CPU    7: hi:    0, btch:   1 usd:   0
[77176.524322] CPU    8: hi:    0, btch:   1 usd:   0
[77176.524324] CPU    9: hi:    0, btch:   1 usd:   0
[77176.524327] CPU   10: hi:    0, btch:   1 usd:   0
[77176.524330] CPU   11: hi:    0, btch:   1 usd:   0
[77176.524333] CPU   12: hi:    0, btch:   1 usd:   0
[77176.524336] CPU   13: hi:    0, btch:   1 usd:   0
[77176.524339] CPU   14: hi:    0, btch:   1 usd:   0
[77176.524342] CPU   15: hi:    0, btch:   1 usd:   0
[77176.524345] CPU   16: hi:    0, btch:   1 usd:   0
[77176.524348] CPU   17: hi:    0, btch:   1 usd:   0
[77176.524351] CPU   18: hi:    0, btch:   1 usd:   0
[77176.524354] CPU   19: hi:    0, btch:   1 usd:   0
[77176.524358] CPU   20: hi:    0, btch:   1 usd:   0
[77176.524364] CPU   21: hi:    0, btch:   1 usd:   0
[77176.524367] CPU   22: hi:    0, btch:   1 usd:   0
[77176.524370] CPU   23: hi:    0, btch:   1 usd:   0
[77176.524372] DMA32 per-cpu:
[77176.524374] CPU    0: hi:  186, btch:  31 usd:  81
[77176.524377] CPU    1: hi:  186, btch:  31 usd:  66
[77176.524380] CPU    2: hi:  186, btch:  31 usd:  49
[77176.524385] CPU    3: hi:  186, btch:  31 usd:  67
[77176.524387] CPU    4: hi:  186, btch:  31 usd:  93
[77176.524390] CPU    5: hi:  186, btch:  31 usd:  73
[77176.524393] CPU    6: hi:  186, btch:  31 usd:  50
[77176.524396] CPU    7: hi:  186, btch:  31 usd:  79
[77176.524399] CPU    8: hi:  186, btch:  31 usd:  21
[77176.524402] CPU    9: hi:  186, btch:  31 usd:  38
[77176.524406] CPU   10: hi:  186, btch:  31 usd:   0
[77176.524409] CPU   11: hi:  186, btch:  31 usd:  75
[77176.524412] CPU   12: hi:  186, btch:  31 usd:   1
[77176.524414] CPU   13: hi:  186, btch:  31 usd:   4
[77176.524417] CPU   14: hi:  186, btch:  31 usd:   9
[77176.524420] CPU   15: hi:  186, btch:  31 usd:   0
[77176.524423] CPU   16: hi:  186, btch:  31 usd:  56
[77176.524426] CPU   17: hi:  186, btch:  31 usd:  35
[77176.524429] CPU   18: hi:  186, btch:  31 usd:  32
[77176.524432] CPU   19: hi:  186, btch:  31 usd:  39
[77176.524435] CPU   20: hi:  186, btch:  31 usd:  24
[77176.524438] CPU   21: hi:  186, btch:  31 usd:   0
[77176.524441] CPU   22: hi:  186, btch:  31 usd:  35
[77176.524444] CPU   23: hi:  186, btch:  31 usd:  51
[77176.524447] Normal per-cpu:
[77176.524449] CPU    0: hi:  186, btch:  31 usd:  29
[77176.524453] CPU    1: hi:  186, btch:  31 usd:   1
[77176.524456] CPU    2: hi:  186, btch:  31 usd:  30
[77176.524459] CPU    3: hi:  186, btch:  31 usd:  30
[77176.524463] CPU    4: hi:  186, btch:  31 usd:  30
[77176.524466] CPU    5: hi:  186, btch:  31 usd:  31
[77176.524469] CPU    6: hi:  186, btch:  31 usd:   0
[77176.524471] CPU    7: hi:  186, btch:  31 usd:   0
[77176.524474] CPU    8: hi:  186, btch:  31 usd:  30
[77176.524477] CPU    9: hi:  186, btch:  31 usd:  28
[77176.524480] CPU   10: hi:  186, btch:  31 usd:   0
[77176.524483] CPU   11: hi:  186, btch:  31 usd:  30
[77176.524486] CPU   12: hi:  186, btch:  31 usd:   0
[77176.524489] CPU   13: hi:  186, btch:  31 usd:   0
[77176.524492] CPU   14: hi:  186, btch:  31 usd:   0
[77176.524495] CPU   15: hi:  186, btch:  31 usd:   0
[77176.524498] CPU   16: hi:  186, btch:  31 usd:   0
[77176.524501] CPU   17: hi:  186, btch:  31 usd:   0
[77176.524504] CPU   18: hi:  186, btch:  31 usd:   0
[77176.524507] CPU   19: hi:  186, btch:  31 usd:   0
[77176.524510] CPU   20: hi:  186, btch:  31 usd:   0
[77176.524513] CPU   21: hi:  186, btch:  31 usd:   0
[77176.524516] CPU   22: hi:  186, btch:  31 usd:   0
[77176.524518] CPU   23: hi:  186, btch:  31 usd:   0
[77176.524524] active_anon:5675 inactive_anon:4676 isolated_anon:0
[77176.524526]  active_file:146373 inactive_file:153543 isolated_file:480
[77176.524527]  unevictable:0 dirty:167539 writeback:322 unstable:0
[77176.524528]  free:5017 slab_reclaimable:15640 slab_unreclaimable:8972
[77176.524529]  mapped:1114 shmem:7 pagetables:1908 bounce:0
[77176.524536] DMA free:9820kB min:32kB low:40kB high:48kB 
active_anon:4kB inactive_anon:0kB active_file:616kB inactive_file:2212kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12740kB 
mlocked:0kB dirty:2292kB writeback:0kB mapped:0kB shmem:0kB 
slab_reclaimable:72kB slab_unreclaimable:108kB kernel_stack:0kB 
pagetables:12kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:3040 all_unreclaimable? no
[77176.524541] lowmem_reserve[]: 0 1428 2452 2452
[77176.524551] DMA32 free:7768kB min:3680kB low:4600kB high:5520kB 
active_anon:22696kB inactive_anon:18704kB active_file:584580kB 
inactive_file:608508kB unevictable:0kB isolated(anon):0kB 
isolated(file):1920kB present:1462496kB mlocked:0kB dirty:664128kB 
writeback:1276kB mapped:4456kB shmem:28kB slab_reclaimable:62076kB 
slab_unreclaimable:32292kB kernel_stack:5120kB pagetables:7620kB 
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1971808 
all_unreclaimable? yes
[77176.524556] lowmem_reserve[]: 0 0 1024 1024
[77176.524564] Normal free:2480kB min:2636kB low:3292kB high:3952kB 
active_anon:0kB inactive_anon:0kB active_file:296kB inactive_file:3452kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1048700kB 
mlocked:0kB dirty:3736kB writeback:12kB mapped:0kB shmem:0kB 
slab_reclaimable:412kB slab_unreclaimable:3488kB kernel_stack:80kB 
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:8192 all_unreclaimable? yes
[77176.524569] lowmem_reserve[]: 0 0 0 0
[77176.524574] DMA: 4*4kB 25*8kB 11*16kB 7*32kB 8*64kB 8*128kB 8*256kB 
3*512kB 0*1024kB 0*2048kB 1*4096kB = 9832kB
[77176.524587] DMA32: 742*4kB 118*8kB 3*16kB 3*32kB 2*64kB 0*128kB 
0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7768kB
[77176.524600] Normal: 1*4kB 1*8kB 2*16kB 13*32kB 14*64kB 2*128kB 
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1612kB
[77176.524613] 302308 total pagecache pages
[77176.524615] 1619 pages in swap cache
[77176.524617] Swap cache stats: add 40686, delete 39067, find 24687/26036
[77176.524619] Free swap  = 10141956kB
[77176.524621] Total swap = 10239992kB
[77176.577607] 793456 pages RAM
[77176.577611] 436254 pages reserved
[77176.577613] 308627 pages shared
[77176.577615] 49249 pages non-shared
[77176.577620] Out of memory: kill process 5755 (python2.6) score 110492 
or a child
[77176.577623] Killed process 5757 (python2.6)

Depending on what gets nuked by the OOM-killer, I am frequently left 
with an unusable system that needs to be rebooted.

The machine always has plenty of memory available (1.5 GB devoted to 
dom0, of which >1 GB is always just in "cached" state). For instance, 
right now, on this same machine:

# free
              total       used       free     shared    buffers     cached
Mem:       1536512    1493112      43400          0      10284    1144904
-/+ buffers/cache:     337924    1198588
Swap:     10239992      74444   10165548

I have seen this OOM problem on a wide range of Xen versions, stretching 
as far back as I can remember, including the most recent 4.1-unstable 
and 2.6.32 pvops kernel (from yesterday, tested in the hope that they 
would fix this).  I haven't found a way to reliably reproduce it yet, 
but I suspect that the problem relates to reasonably heavy disk or 
network activity -- during this last one, I see that a domain was 
briefly doing ~200 Mbps of downloads.

Anyone have any ideas on what this could be? Is RAM getting 
spontaneously filled because a buffer somewhere grows too quickly, or 
something like that? What can I try here?

-John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: OOM problems
  2010-11-13  7:57 OOM problems John Weekes
@ 2010-11-13  8:14 ` Ian Pratt
  2010-11-13  8:27   ` John Weekes
  2010-11-13 18:15 ` George Shuklin
  1 sibling, 1 reply; 23+ messages in thread
From: Ian Pratt @ 2010-11-13  8:14 UTC (permalink / raw)
  To: John Weekes, xen-devel; +Cc: Ian Pratt

> On machines running many HVM (stubdom-based) domains, I often see errors
> like this:
> 
> [77176.524094] qemu-dm invoked oom-killer: gfp_mask=0xd0, order=0,
> oom_adj=0

What do the guests use for storage? (e.g. "blktap2 for VHD files on an iscsi mounted ext3 volume")

It might be worth looking at /proc/kernel/slabinfo to see if there's anything suspicious.

BTW: 24 vCPUs in dom0 seems a excessive, especially if you're using stubdoms. You may get better performance by dropping that to e.g. 2 or 3.

Ian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-13  8:14 ` Ian Pratt
@ 2010-11-13  8:27   ` John Weekes
  2010-11-13  9:13     ` Ian Pratt
  0 siblings, 1 reply; 23+ messages in thread
From: John Weekes @ 2010-11-13  8:27 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel

 > What do the guests use for storage? (e.g. "blktap2 for VHD files on 
an iscsi mounted ext3 volume")

Simple sparse .img files on a local ext4 RAID volume, using "file:".

 > It might be worth looking at /proc/kernel/slabinfo to see if there's 
anything suspicious.

I didn't see anything suspicious in there, but I'm not sure what I'm 
looking for.

Here is the first page of slabtop as it currently stands, if that helps. 
It looks a bit easier to read.

  Active / Total Objects (% used)    : 274753 / 507903 (54.1%)
  Active / Total Slabs (% used)      : 27573 / 27582 (100.0%)
  Active / Total Caches (% used)     : 85 / 160 (53.1%)
  Active / Total Size (% used)       : 75385.52K / 107127.41K (70.4%)
  Minimum / Average / Maximum Object : 0.02K / 0.21K / 4096.00K

   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
306397 110621  36%    0.10K   8281       37     33124K buffer_head
  37324  26606  71%    0.54K   5332        7     21328K radix_tree_node
  25640  25517  99%    0.19K   1282       20      5128K size-192
  23472  23155  98%    0.08K    489       48      1956K sysfs_dir_cache
  19964  19186  96%    0.95K   4991        4     19964K ext4_inode_cache
  17860  13026  72%    0.19K    893       20      3572K dentry
  14896  13057  87%    0.03K    133      112       532K size-32
   8316   6171  74%    0.17K    378       22      1512K vm_area_struct
   8142   5053  62%    0.06K    138       59       552K size-64
   4320   3389  78%    0.12K    144       30       576K size-128
   3760   2226  59%    0.19K    188       20       752K filp
   3456   1875  54%    0.02K     24      144        96K anon_vma
   3380   3001  88%    1.00K    845        4      3380K size-1024
   3380   3365  99%    0.76K    676        5      2704K shmem_inode_cache
   2736   2484  90%    0.50K    342        8      1368K size-512
   2597   2507  96%    0.07K     49       53       196K Acpi-Operand
   2100   1095  52%    0.25K    140       15       560K skbuff_head_cache
   1920    819  42%    0.12K     64       30       256K cred_jar
   1361   1356  99%    4.00K   1361        1      5444K size-4096
   1230    628  51%    0.12K     41       30       164K pid
   1008    907  89%    0.03K      9      112        36K Acpi-Namespace
    959    496  51%    0.57K    137        7       548K inode_cache
    891    554  62%    0.81K     99        9       792K signal_cache
    888    115  12%    0.10K     24       37        96K ext4_prealloc_space
    885    122  13%    0.06K     15       59        60K fs_cache
    850    642  75%    1.45K    170        5      1360K task_struct
    820    769  93%    0.19K     41       20       164K bio-0
    666    550  82%    2.06K    222        3      1776K sighand_cache
    576    211  36%    0.50K     72        8       288K task_xstate
    529    379  71%    0.16K     23       23        92K cfq_queue
    518    472  91%    2.00K    259        2      1036K size-2048
    506    375  74%    0.16K     22       23        88K cfq_io_context
    495    353  71%    0.33K     45       11       180K blkdev_requests
    465    422  90%    0.25K     31       15       124K size-256
    418    123  29%    0.69K     38       11       304K files_cache
    360    207  57%    0.69K     72        5       288K sock_inode_cache
    360    251  69%    0.12K     12       30        48K scsi_sense_cache
    336    115  34%    0.08K      7       48        28K blkdev_ioc
    285    236  82%    0.25K     19       15        76K scsi_cmd_cache


 > BTW: 24 vCPUs in dom0 seems a excessive, especially if you're using 
stubdoms. You may get better performance by dropping that to e.g. 2 or 3.

I will test that. Do you think it will make a difference in this case?

-John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: OOM problems
  2010-11-13  8:27   ` John Weekes
@ 2010-11-13  9:13     ` Ian Pratt
  2010-11-13  9:43       ` John Weekes
                         ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Ian Pratt @ 2010-11-13  9:13 UTC (permalink / raw)
  To: John Weekes; +Cc: Ian Pratt, xen-devel

>  > What do the guests use for storage? (e.g. "blktap2 for VHD files on
> an iscsi mounted ext3 volume")
> 
> Simple sparse .img files on a local ext4 RAID volume, using "file:".

Ah, if you're using loop it may be that you're just filling memory with dirty pages. Older kernels certainly did this, not sure about newer ones.

I'd be inclined to use blktap2 in raw file mode, with "aio:".

Ian

 
>  > It might be worth looking at /proc/kernel/slabinfo to see if there's
> anything suspicious.
> 
> I didn't see anything suspicious in there, but I'm not sure what I'm
> looking for.
> 
> Here is the first page of slabtop as it currently stands, if that helps.
> It looks a bit easier to read.
> 
>   Active / Total Objects (% used)    : 274753 / 507903 (54.1%)
>   Active / Total Slabs (% used)      : 27573 / 27582 (100.0%)
>   Active / Total Caches (% used)     : 85 / 160 (53.1%)
>   Active / Total Size (% used)       : 75385.52K / 107127.41K (70.4%)
>   Minimum / Average / Maximum Object : 0.02K / 0.21K / 4096.00K
> 
>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 306397 110621  36%    0.10K   8281       37     33124K buffer_head
>   37324  26606  71%    0.54K   5332        7     21328K radix_tree_node
>   25640  25517  99%    0.19K   1282       20      5128K size-192
>   23472  23155  98%    0.08K    489       48      1956K sysfs_dir_cache
>   19964  19186  96%    0.95K   4991        4     19964K ext4_inode_cache
>   17860  13026  72%    0.19K    893       20      3572K dentry
>   14896  13057  87%    0.03K    133      112       532K size-32
>    8316   6171  74%    0.17K    378       22      1512K vm_area_struct
>    8142   5053  62%    0.06K    138       59       552K size-64
>    4320   3389  78%    0.12K    144       30       576K size-128
>    3760   2226  59%    0.19K    188       20       752K filp
>    3456   1875  54%    0.02K     24      144        96K anon_vma
>    3380   3001  88%    1.00K    845        4      3380K size-1024
>    3380   3365  99%    0.76K    676        5      2704K shmem_inode_cache
>    2736   2484  90%    0.50K    342        8      1368K size-512
>    2597   2507  96%    0.07K     49       53       196K Acpi-Operand
>    2100   1095  52%    0.25K    140       15       560K skbuff_head_cache
>    1920    819  42%    0.12K     64       30       256K cred_jar
>    1361   1356  99%    4.00K   1361        1      5444K size-4096
>    1230    628  51%    0.12K     41       30       164K pid
>    1008    907  89%    0.03K      9      112        36K Acpi-Namespace
>     959    496  51%    0.57K    137        7       548K inode_cache
>     891    554  62%    0.81K     99        9       792K signal_cache
>     888    115  12%    0.10K     24       37        96K
> ext4_prealloc_space
>     885    122  13%    0.06K     15       59        60K fs_cache
>     850    642  75%    1.45K    170        5      1360K task_struct
>     820    769  93%    0.19K     41       20       164K bio-0
>     666    550  82%    2.06K    222        3      1776K sighand_cache
>     576    211  36%    0.50K     72        8       288K task_xstate
>     529    379  71%    0.16K     23       23        92K cfq_queue
>     518    472  91%    2.00K    259        2      1036K size-2048
>     506    375  74%    0.16K     22       23        88K cfq_io_context
>     495    353  71%    0.33K     45       11       180K blkdev_requests
>     465    422  90%    0.25K     31       15       124K size-256
>     418    123  29%    0.69K     38       11       304K files_cache
>     360    207  57%    0.69K     72        5       288K sock_inode_cache
>     360    251  69%    0.12K     12       30        48K scsi_sense_cache
>     336    115  34%    0.08K      7       48        28K blkdev_ioc
>     285    236  82%    0.25K     19       15        76K scsi_cmd_cache
> 
> 
>  > BTW: 24 vCPUs in dom0 seems a excessive, especially if you're using
> stubdoms. You may get better performance by dropping that to e.g. 2 or 3.
> 
> I will test that. Do you think it will make a difference in this case?
> 
> -John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-13  9:13     ` Ian Pratt
@ 2010-11-13  9:43       ` John Weekes
  2010-11-13 10:19       ` John Weekes
  2010-11-15  8:55       ` Jan Beulich
  2 siblings, 0 replies; 23+ messages in thread
From: John Weekes @ 2010-11-13  9:43 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 697 bytes --]

On 11/13/2010 1:13 AM, Ian Pratt wrote:
>> >  What do the guests use for storage? (e.g. "blktap2 for VHD files on
>> an iscsi mounted ext3 volume")
>>
>> Simple sparse .img files on a local ext4 RAID volume, using "file:".
> Ah, if you're using loop it may be that you're just filling memory 
> with dirty pages. Older kernels certainly did this, not sure about 
> newer ones.
>
> I'd be inclined to use blktap2 in raw file mode, with "aio:".
>
> Ian

That makes sense. tap/tap2 didn't work for me in prior releases, so I 
had to stick to file. It seems to work now (well, tap2:tapdisk:aio does; 
tap:tapdisk:aio still doesn't), so I'll switch everything over to it, 
and cross my fingers.

-John

[-- Attachment #1.2: Type: text/html, Size: 1498 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-13  9:13     ` Ian Pratt
  2010-11-13  9:43       ` John Weekes
@ 2010-11-13 10:19       ` John Weekes
  2010-11-14  9:53         ` Daniel Stodden
  2010-11-15  8:55       ` Jan Beulich
  2 siblings, 1 reply; 23+ messages in thread
From: John Weekes @ 2010-11-13 10:19 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel

On 11/13/2010 1:13 AM, Ian Pratt wrote:
 > Ah, if you're using loop it may be that you're just filling memory 
with dirty pages. Older kernels certainly did this, not sure about newer 
ones.
 > I'd be inclined to use blktap2 in raw file mode, with "aio:".

With blktap2, is free RAM in dom0 still used for a disk cache at all? I 
have this dom0 set to 1.5 GB mainly to help with caching; if that RAM is 
not needed, I'll retool it down to a smaller number.

Thanks,
John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-13  7:57 OOM problems John Weekes
  2010-11-13  8:14 ` Ian Pratt
@ 2010-11-13 18:15 ` George Shuklin
  1 sibling, 0 replies; 23+ messages in thread
From: George Shuklin @ 2010-11-13 18:15 UTC (permalink / raw)
  To: xen-devel

This kind of bug is in debian kernel seems most visible, but I was able
to reproduce it in all available kernels (SUSE 2.6.34 and rhel 2.6.18).

I found single solution to stop OOM killer coming for innocent processes
- disable memory overcommitment.

1) Set up swap as 50% of RAM or higher
2) set up vm.overcommit_memory = 2

In this condition only Debian Lenny kernel are still buggling (forget
and throw away), all other kernels works fine: they NEVER create an OOM
state (but, still can make MemoryError in case of 'no memory' state).

If you disable swap file all overcommited memory will be used from real
memory and cause MemoryError state before real memory running out.


В Птн, 12/11/2010 в 23:57 -0800, John Weekes пишет:
> On machines running many HVM (stubdom-based) domains, I often see errors 
> like this:
> 
> [77176.524094] qemu-dm invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> [77176.524102] Pid: 7478, comm: qemu-dm Not tainted 2.6.32.25-g80f7e08 #2
> [77176.524109] Call Trace:
> [77176.524123]  [<ffffffff810897fd>] ? T.413+0xcd/0x290
> [77176.524129]  [<ffffffff81089ad3>] ? __out_of_memory+0x113/0x180
> [77176.524133]  [<ffffffff81089b9e>] ? out_of_memory+0x5e/0xc0
> [77176.524140]  [<ffffffff8108d1cb>] ? __alloc_pages_nodemask+0x69b/0x6b0
> [77176.524144]  [<ffffffff8108d1f2>] ? __get_free_pages+0x12/0x60
> [77176.524152]  [<ffffffff810c94e7>] ? __pollwait+0xb7/0x110
> [77176.524161]  [<ffffffff81262b93>] ? n_tty_poll+0x183/0x1d0
> [77176.524165]  [<ffffffff8125ea42>] ? tty_poll+0x92/0xa0
> [77176.524169]  [<ffffffff810c8a92>] ? do_select+0x362/0x670
> [77176.524173]  [<ffffffff810c9430>] ? __pollwait+0x0/0x110
> [77176.524178]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524183]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524188]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524193]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524197]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524202]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524207]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524212]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524217]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
> [77176.524222]  [<ffffffff810c8fb5>] ? core_sys_select+0x215/0x350
> [77176.524231]  [<ffffffff810100af>] ? xen_restore_fl_direct_end+0x0/0x1
> [77176.524236]  [<ffffffff8100c48d>] ? xen_mc_flush+0x8d/0x1b0
> [77176.524243]  [<ffffffff81014ffb>] ? xen_hypervisor_callback+0x1b/0x20
> [77176.524251]  [<ffffffff814b0f5a>] ? error_exit+0x2a/0x60
> [77176.524255]  [<ffffffff8101485d>] ? retint_restore_args+0x5/0x6
> [77176.524263]  [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
> [77176.524268]  [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
> [77176.524276]  [<ffffffff810663d1>] ? ktime_get_ts+0x61/0xd0
> [77176.524281]  [<ffffffff810c9354>] ? sys_select+0x44/0x120
> [77176.524286]  [<ffffffff81013f02>] ? system_call_fastpath+0x16/0x1b
> [77176.524290] Mem-Info:
> [77176.524293] DMA per-cpu:
> [77176.524296] CPU    0: hi:    0, btch:   1 usd:   0
> [77176.524300] CPU    1: hi:    0, btch:   1 usd:   0
> [77176.524303] CPU    2: hi:    0, btch:   1 usd:   0
> [77176.524306] CPU    3: hi:    0, btch:   1 usd:   0
> [77176.524310] CPU    4: hi:    0, btch:   1 usd:   0
> [77176.524313] CPU    5: hi:    0, btch:   1 usd:   0
> [77176.524316] CPU    6: hi:    0, btch:   1 usd:   0
> [77176.524318] CPU    7: hi:    0, btch:   1 usd:   0
> [77176.524322] CPU    8: hi:    0, btch:   1 usd:   0
> [77176.524324] CPU    9: hi:    0, btch:   1 usd:   0
> [77176.524327] CPU   10: hi:    0, btch:   1 usd:   0
> [77176.524330] CPU   11: hi:    0, btch:   1 usd:   0
> [77176.524333] CPU   12: hi:    0, btch:   1 usd:   0
> [77176.524336] CPU   13: hi:    0, btch:   1 usd:   0
> [77176.524339] CPU   14: hi:    0, btch:   1 usd:   0
> [77176.524342] CPU   15: hi:    0, btch:   1 usd:   0
> [77176.524345] CPU   16: hi:    0, btch:   1 usd:   0
> [77176.524348] CPU   17: hi:    0, btch:   1 usd:   0
> [77176.524351] CPU   18: hi:    0, btch:   1 usd:   0
> [77176.524354] CPU   19: hi:    0, btch:   1 usd:   0
> [77176.524358] CPU   20: hi:    0, btch:   1 usd:   0
> [77176.524364] CPU   21: hi:    0, btch:   1 usd:   0
> [77176.524367] CPU   22: hi:    0, btch:   1 usd:   0
> [77176.524370] CPU   23: hi:    0, btch:   1 usd:   0
> [77176.524372] DMA32 per-cpu:
> [77176.524374] CPU    0: hi:  186, btch:  31 usd:  81
> [77176.524377] CPU    1: hi:  186, btch:  31 usd:  66
> [77176.524380] CPU    2: hi:  186, btch:  31 usd:  49
> [77176.524385] CPU    3: hi:  186, btch:  31 usd:  67
> [77176.524387] CPU    4: hi:  186, btch:  31 usd:  93
> [77176.524390] CPU    5: hi:  186, btch:  31 usd:  73
> [77176.524393] CPU    6: hi:  186, btch:  31 usd:  50
> [77176.524396] CPU    7: hi:  186, btch:  31 usd:  79
> [77176.524399] CPU    8: hi:  186, btch:  31 usd:  21
> [77176.524402] CPU    9: hi:  186, btch:  31 usd:  38
> [77176.524406] CPU   10: hi:  186, btch:  31 usd:   0
> [77176.524409] CPU   11: hi:  186, btch:  31 usd:  75
> [77176.524412] CPU   12: hi:  186, btch:  31 usd:   1
> [77176.524414] CPU   13: hi:  186, btch:  31 usd:   4
> [77176.524417] CPU   14: hi:  186, btch:  31 usd:   9
> [77176.524420] CPU   15: hi:  186, btch:  31 usd:   0
> [77176.524423] CPU   16: hi:  186, btch:  31 usd:  56
> [77176.524426] CPU   17: hi:  186, btch:  31 usd:  35
> [77176.524429] CPU   18: hi:  186, btch:  31 usd:  32
> [77176.524432] CPU   19: hi:  186, btch:  31 usd:  39
> [77176.524435] CPU   20: hi:  186, btch:  31 usd:  24
> [77176.524438] CPU   21: hi:  186, btch:  31 usd:   0
> [77176.524441] CPU   22: hi:  186, btch:  31 usd:  35
> [77176.524444] CPU   23: hi:  186, btch:  31 usd:  51
> [77176.524447] Normal per-cpu:
> [77176.524449] CPU    0: hi:  186, btch:  31 usd:  29
> [77176.524453] CPU    1: hi:  186, btch:  31 usd:   1
> [77176.524456] CPU    2: hi:  186, btch:  31 usd:  30
> [77176.524459] CPU    3: hi:  186, btch:  31 usd:  30
> [77176.524463] CPU    4: hi:  186, btch:  31 usd:  30
> [77176.524466] CPU    5: hi:  186, btch:  31 usd:  31
> [77176.524469] CPU    6: hi:  186, btch:  31 usd:   0
> [77176.524471] CPU    7: hi:  186, btch:  31 usd:   0
> [77176.524474] CPU    8: hi:  186, btch:  31 usd:  30
> [77176.524477] CPU    9: hi:  186, btch:  31 usd:  28
> [77176.524480] CPU   10: hi:  186, btch:  31 usd:   0
> [77176.524483] CPU   11: hi:  186, btch:  31 usd:  30
> [77176.524486] CPU   12: hi:  186, btch:  31 usd:   0
> [77176.524489] CPU   13: hi:  186, btch:  31 usd:   0
> [77176.524492] CPU   14: hi:  186, btch:  31 usd:   0
> [77176.524495] CPU   15: hi:  186, btch:  31 usd:   0
> [77176.524498] CPU   16: hi:  186, btch:  31 usd:   0
> [77176.524501] CPU   17: hi:  186, btch:  31 usd:   0
> [77176.524504] CPU   18: hi:  186, btch:  31 usd:   0
> [77176.524507] CPU   19: hi:  186, btch:  31 usd:   0
> [77176.524510] CPU   20: hi:  186, btch:  31 usd:   0
> [77176.524513] CPU   21: hi:  186, btch:  31 usd:   0
> [77176.524516] CPU   22: hi:  186, btch:  31 usd:   0
> [77176.524518] CPU   23: hi:  186, btch:  31 usd:   0
> [77176.524524] active_anon:5675 inactive_anon:4676 isolated_anon:0
> [77176.524526]  active_file:146373 inactive_file:153543 isolated_file:480
> [77176.524527]  unevictable:0 dirty:167539 writeback:322 unstable:0
> [77176.524528]  free:5017 slab_reclaimable:15640 slab_unreclaimable:8972
> [77176.524529]  mapped:1114 shmem:7 pagetables:1908 bounce:0
> [77176.524536] DMA free:9820kB min:32kB low:40kB high:48kB 
> active_anon:4kB inactive_anon:0kB active_file:616kB inactive_file:2212kB 
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12740kB 
> mlocked:0kB dirty:2292kB writeback:0kB mapped:0kB shmem:0kB 
> slab_reclaimable:72kB slab_unreclaimable:108kB kernel_stack:0kB 
> pagetables:12kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:3040 all_unreclaimable? no
> [77176.524541] lowmem_reserve[]: 0 1428 2452 2452
> [77176.524551] DMA32 free:7768kB min:3680kB low:4600kB high:5520kB 
> active_anon:22696kB inactive_anon:18704kB active_file:584580kB 
> inactive_file:608508kB unevictable:0kB isolated(anon):0kB 
> isolated(file):1920kB present:1462496kB mlocked:0kB dirty:664128kB 
> writeback:1276kB mapped:4456kB shmem:28kB slab_reclaimable:62076kB 
> slab_unreclaimable:32292kB kernel_stack:5120kB pagetables:7620kB 
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1971808 
> all_unreclaimable? yes
> [77176.524556] lowmem_reserve[]: 0 0 1024 1024
> [77176.524564] Normal free:2480kB min:2636kB low:3292kB high:3952kB 
> active_anon:0kB inactive_anon:0kB active_file:296kB inactive_file:3452kB 
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1048700kB 
> mlocked:0kB dirty:3736kB writeback:12kB mapped:0kB shmem:0kB 
> slab_reclaimable:412kB slab_unreclaimable:3488kB kernel_stack:80kB 
> pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:8192 all_unreclaimable? yes
> [77176.524569] lowmem_reserve[]: 0 0 0 0
> [77176.524574] DMA: 4*4kB 25*8kB 11*16kB 7*32kB 8*64kB 8*128kB 8*256kB 
> 3*512kB 0*1024kB 0*2048kB 1*4096kB = 9832kB
> [77176.524587] DMA32: 742*4kB 118*8kB 3*16kB 3*32kB 2*64kB 0*128kB 
> 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7768kB
> [77176.524600] Normal: 1*4kB 1*8kB 2*16kB 13*32kB 14*64kB 2*128kB 
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1612kB
> [77176.524613] 302308 total pagecache pages
> [77176.524615] 1619 pages in swap cache
> [77176.524617] Swap cache stats: add 40686, delete 39067, find 24687/26036
> [77176.524619] Free swap  = 10141956kB
> [77176.524621] Total swap = 10239992kB
> [77176.577607] 793456 pages RAM
> [77176.577611] 436254 pages reserved
> [77176.577613] 308627 pages shared
> [77176.577615] 49249 pages non-shared
> [77176.577620] Out of memory: kill process 5755 (python2.6) score 110492 
> or a child
> [77176.577623] Killed process 5757 (python2.6)
> 
> Depending on what gets nuked by the OOM-killer, I am frequently left 
> with an unusable system that needs to be rebooted.
> 
> The machine always has plenty of memory available (1.5 GB devoted to 
> dom0, of which >1 GB is always just in "cached" state). For instance, 
> right now, on this same machine:
> 
> # free
>               total       used       free     shared    buffers     cached
> Mem:       1536512    1493112      43400          0      10284    1144904
> -/+ buffers/cache:     337924    1198588
> Swap:     10239992      74444   10165548
> 
> I have seen this OOM problem on a wide range of Xen versions, stretching 
> as far back as I can remember, including the most recent 4.1-unstable 
> and 2.6.32 pvops kernel (from yesterday, tested in the hope that they 
> would fix this).  I haven't found a way to reliably reproduce it yet, 
> but I suspect that the problem relates to reasonably heavy disk or 
> network activity -- during this last one, I see that a domain was 
> briefly doing ~200 Mbps of downloads.
> 
> Anyone have any ideas on what this could be? Is RAM getting 
> spontaneously filled because a buffer somewhere grows too quickly, or 
> something like that? What can I try here?
> 
> -John
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-13 10:19       ` John Weekes
@ 2010-11-14  9:53         ` Daniel Stodden
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Stodden @ 2010-11-14  9:53 UTC (permalink / raw)
  To: John Weekes; +Cc: Ian Pratt, xen-devel

On Sat, 2010-11-13 at 05:19 -0500, John Weekes wrote:
> On 11/13/2010 1:13 AM, Ian Pratt wrote:
>  > Ah, if you're using loop it may be that you're just filling memory 
> with dirty pages. Older kernels certainly did this, not sure about newer 
> ones.
>  > I'd be inclined to use blktap2 in raw file mode, with "aio:".
> 
> With blktap2, is free RAM in dom0 still used for a disk cache at all? I 
> have this dom0 set to 1.5 GB mainly to help with caching; if that RAM is 
> not needed, I'll retool it down to a smaller number.

If you're not using cloned images deriving from a shared parent image,
that caching won't buy anyone much. Memory better spent on the guests
themselves then, thereby their own caches. 

Keep an eye on /proc/meminfo, it largely depends on number/type of
guests, but probably safe to reassign ~800M straight away.

blktap2 with aio will move the datapath to direct I/O. Comparend to
buffered loops, there's also some notable benefit to crash consistency
resulting from that.

Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: OOM problems
  2010-11-13  9:13     ` Ian Pratt
  2010-11-13  9:43       ` John Weekes
  2010-11-13 10:19       ` John Weekes
@ 2010-11-15  8:55       ` Jan Beulich
  2010-11-15  9:40         ` Daniel Stodden
  2010-11-15 14:17         ` Stefano Stabellini
  2 siblings, 2 replies; 23+ messages in thread
From: Jan Beulich @ 2010-11-15  8:55 UTC (permalink / raw)
  To: Ian Pratt, John Weekes; +Cc: xen-devel

>>> On 13.11.10 at 10:13, Ian Pratt <Ian.Pratt@eu.citrix.com> wrote:
>>   > What do the guests use for storage? (e.g. "blktap2 for VHD files on
>> an iscsi mounted ext3 volume")
>> 
>> Simple sparse .img files on a local ext4 RAID volume, using "file:".
> 
> Ah, if you're using loop it may be that you're just filling memory with 
> dirty pages. Older kernels certainly did this, not sure about newer ones.

Shouldn't this lead to the calling process being throttled, instead of
the system running into OOM?

Further, having got reports of similar problems lately, too, we have
indications that using pv drivers also gets us around the issue,
which makes me think that it's rather qemu-dm misbehaving (and
not getting stopped doing so by the kernel for whatever reason -
possibly just missing some non-infinite rlimit setting).

Not knowing much about the workings of stubdom, one thing I
don't really understand is how qemu-dm in Dom0 would be
heavily resource consuming here (actually I would have expected
no qemu-dm in Dom0 at all in this case). Aren't the main I/O paths
going from qemu-stubdom directly to the backends?

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: OOM problems
  2010-11-15  8:55       ` Jan Beulich
@ 2010-11-15  9:40         ` Daniel Stodden
  2010-11-15  9:57           ` Jan Beulich
  2010-11-15 17:59           ` John Weekes
  2010-11-15 14:17         ` Stefano Stabellini
  1 sibling, 2 replies; 23+ messages in thread
From: Daniel Stodden @ 2010-11-15  9:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Pratt, xen-devel, John Weekes

On Mon, 2010-11-15 at 03:55 -0500, Jan Beulich wrote:
> >>> On 13.11.10 at 10:13, Ian Pratt <Ian.Pratt@eu.citrix.com> wrote:
> >>   > What do the guests use for storage? (e.g. "blktap2 for VHD files on
> >> an iscsi mounted ext3 volume")
> >> 
> >> Simple sparse .img files on a local ext4 RAID volume, using "file:".
> > 
> > Ah, if you're using loop it may be that you're just filling memory with 
> > dirty pages. Older kernels certainly did this, not sure about newer ones.
> 
> Shouldn't this lead to the calling process being throttled, instead of
> the system running into OOM?

They are throttled, but the single control I'm aware of
is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only
per process, not a global limit. Could well be that's part of the
problem -- outwitting mm with just too many writers on too many cores?

We had a bit of trouble when switching dom0 to 2.6.32, buffered writes
made it much easier than with e.g. 2.6.27 to drive everybody else into
costly reclaims.

The Oom shown here reports about ~650M in dirty pages. The fact alone
that this counts as on oom condition doesn't sound quite right in
itself. That qemu might just have dared to ask at the wrong point in
time.

Just to get an idea -- how many guests did this box carry?

> Further, having got reports of similar problems lately, too, we have
> indications that using pv drivers also gets us around the issue,
> which makes me think that it's rather qemu-dm misbehaving (and
> not getting stopped doing so by the kernel for whatever reason -
> possibly just missing some non-infinite rlimit setting).
> 
> Not knowing much about the workings of stubdom, one thing I
> don't really understand is how qemu-dm in Dom0 would be
> heavily resource consuming here (actually I would have expected
> no qemu-dm in Dom0 at all in this case). Aren't the main I/O paths
> going from qemu-stubdom directly to the backends?
> 
> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: OOM problems
  2010-11-15  9:40         ` Daniel Stodden
@ 2010-11-15  9:57           ` Jan Beulich
  2010-11-15 17:59           ` John Weekes
  1 sibling, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2010-11-15  9:57 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Ian Pratt, xen-devel, John Weekes

>>> On 15.11.10 at 10:40, Daniel Stodden <daniel.stodden@citrix.com> wrote:
> On Mon, 2010-11-15 at 03:55 -0500, Jan Beulich wrote:
>> >>> On 13.11.10 at 10:13, Ian Pratt <Ian.Pratt@eu.citrix.com> wrote:
>> >>   > What do the guests use for storage? (e.g. "blktap2 for VHD files on
>> >> an iscsi mounted ext3 volume")
>> >> 
>> >> Simple sparse .img files on a local ext4 RAID volume, using "file:".
>> > 
>> > Ah, if you're using loop it may be that you're just filling memory with 
>> > dirty pages. Older kernels certainly did this, not sure about newer ones.
>> 
>> Shouldn't this lead to the calling process being throttled, instead of
>> the system running into OOM?
> 
> They are throttled, but the single control I'm aware of
> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only
> per process, not a global limit. Could well be that's part of the
> problem -- outwitting mm with just too many writers on too many cores?
> 
> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes
> made it much easier than with e.g. 2.6.27 to drive everybody else into
> costly reclaims.
> 
> The Oom shown here reports about ~650M in dirty pages. The fact alone
> that this counts as on oom condition doesn't sound quite right in
> itself. That qemu might just have dared to ask at the wrong point in
> time.

Indeed - dirty pages alone shouldn't resolve to OOM.

> Just to get an idea -- how many guests did this box carry?

>From what we know this requires just a single (Windows 7 or some
such) guest, provided the guest has more memory than Dom0.

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: OOM problems
  2010-11-15  8:55       ` Jan Beulich
  2010-11-15  9:40         ` Daniel Stodden
@ 2010-11-15 14:17         ` Stefano Stabellini
  1 sibling, 0 replies; 23+ messages in thread
From: Stefano Stabellini @ 2010-11-15 14:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ian Pratt, xen-devel, John Weekes

On Mon, 15 Nov 2010, Jan Beulich wrote:
> >>> On 13.11.10 at 10:13, Ian Pratt <Ian.Pratt@eu.citrix.com> wrote:
> >>   > What do the guests use for storage? (e.g. "blktap2 for VHD files on
> >> an iscsi mounted ext3 volume")
> >> 
> >> Simple sparse .img files on a local ext4 RAID volume, using "file:".
> > 
> > Ah, if you're using loop it may be that you're just filling memory with 
> > dirty pages. Older kernels certainly did this, not sure about newer ones.
> 
> Shouldn't this lead to the calling process being throttled, instead of
> the system running into OOM?
> 
> Further, having got reports of similar problems lately, too, we have
> indications that using pv drivers also gets us around the issue,
> which makes me think that it's rather qemu-dm misbehaving (and
> not getting stopped doing so by the kernel for whatever reason -
> possibly just missing some non-infinite rlimit setting).
> 
> Not knowing much about the workings of stubdom, one thing I
> don't really understand is how qemu-dm in Dom0 would be
> heavily resource consuming here (actually I would have expected
> no qemu-dm in Dom0 at all in this case). Aren't the main I/O paths
> going from qemu-stubdom directly to the backends?
> 

Qemu-dm in a stubdom uses the blkfront and netfront drivers in
Minios to communicate with the backends in dom0.
In a stubdom-only scenario qemu-dm in dom0 only provides the xenfb
backend for the vesa framebuffer.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-15  9:40         ` Daniel Stodden
  2010-11-15  9:57           ` Jan Beulich
@ 2010-11-15 17:59           ` John Weekes
  2010-11-16 19:54             ` John Weekes
  1 sibling, 1 reply; 23+ messages in thread
From: John Weekes @ 2010-11-15 17:59 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Ian Pratt, xen-devel, Jan Beulich


> They are throttled, but the single control I'm aware of
> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only
> per process, not a global limit. Could well be that's part of the
> problem -- outwitting mm with just too many writers on too many cores?
>
> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes
> made it much easier than with e.g. 2.6.27 to drive everybody else into
> costly reclaims.
>
> The Oom shown here reports about ~650M in dirty pages. The fact alone
> that this counts as on oom condition doesn't sound quite right in
> itself. That qemu might just have dared to ask at the wrong point in
> time.
>
> Just to get an idea -- how many guests did this box carry?

It carries about two dozen guests, with a mix of mostly HVMs (all 
stubdom-based, some with PV-on-HVM drivers) and some PV.

This problem occurred more often for me under 2.6.32 than 2.6.31, I 
noticed. Since I made the switch to aio, I haven't seen a crash, but it 
hasn't been long enough for that to mean much.

Having extra caching in the dom0 is nice because it allows for domUs to 
get away with having small amounts of free memory, while still having 
very good (much faster than hardware) write performance. If you have a 
large number of domUs that are all memory-constrained and use the disk 
in infrequent, large bursts, this can work out pretty well, since the 
big communal pool provides a better value proposition than giving each 
domU a few more megabytes of RAM.

If the OOM problem isn't something that can be fixed, it might be a good 
idea to print out a warning to the user when a domain using "file:" is 
started. Or, to go a step further and automatically run "file" based 
domains as though "aio" was specified, possibly with a warning and a way 
to override that behavior. It's not really intuitive that "file" would 
cause crashes.

-John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-15 17:59           ` John Weekes
@ 2010-11-16 19:54             ` John Weekes
  2010-11-17 20:10               ` Ian Pratt
  0 siblings, 1 reply; 23+ messages in thread
From: John Weekes @ 2010-11-16 19:54 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Ian Pratt, xen-devel, Jan Beulich

Performance is noticeably lower with aio on these bursty write 
workloads; I've been getting a number of complaints.

I see that 2.6.36 has some page_writeback changes:
http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2.6%2Fpatch-2.6.36.bz2;z=8379 
. Any thoughts on whether these would make a difference for the problems 
with "file:"? I'm still trying to find a way to reproduce the issue in 
the lab, so I'd have to test the patch in production -- that's not a 
tantalizing prospect, unless there is a real chance that it will affect it.

-John

On 11/15/2010 9:59 AM, John Weekes wrote:
>
>> They are throttled, but the single control I'm aware of
>> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only
>> per process, not a global limit. Could well be that's part of the
>> problem -- outwitting mm with just too many writers on too many cores?
>>
>> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes
>> made it much easier than with e.g. 2.6.27 to drive everybody else into
>> costly reclaims.
>>
>> The Oom shown here reports about ~650M in dirty pages. The fact alone
>> that this counts as on oom condition doesn't sound quite right in
>> itself. That qemu might just have dared to ask at the wrong point in
>> time.
>>
>> Just to get an idea -- how many guests did this box carry?
>
> It carries about two dozen guests, with a mix of mostly HVMs (all 
> stubdom-based, some with PV-on-HVM drivers) and some PV.
>
> This problem occurred more often for me under 2.6.32 than 2.6.31, I 
> noticed. Since I made the switch to aio, I haven't seen a crash, but 
> it hasn't been long enough for that to mean much.
>
> Having extra caching in the dom0 is nice because it allows for domUs 
> to get away with having small amounts of free memory, while still 
> having very good (much faster than hardware) write performance. If you 
> have a large number of domUs that are all memory-constrained and use 
> the disk in infrequent, large bursts, this can work out pretty well, 
> since the big communal pool provides a better value proposition than 
> giving each domU a few more megabytes of RAM.
>
> If the OOM problem isn't something that can be fixed, it might be a 
> good idea to print out a warning to the user when a domain using 
> "file:" is started. Or, to go a step further and automatically run 
> "file" based domains as though "aio" was specified, possibly with a 
> warning and a way to override that behavior. It's not really intuitive 
> that "file" would cause crashes.
>
> -John
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: OOM problems
  2010-11-16 19:54             ` John Weekes
@ 2010-11-17 20:10               ` Ian Pratt
  2010-11-17 22:02                 ` John Weekes
  0 siblings, 1 reply; 23+ messages in thread
From: Ian Pratt @ 2010-11-17 20:10 UTC (permalink / raw)
  To: John Weekes, Daniel Stodden; +Cc: Jan, Ian Pratt, xen-devel, Beulich

[-- Attachment #1: Type: text/plain, Size: 3549 bytes --]

> Performance is noticeably lower with aio on these bursty write
> workloads; I've been getting a number of complaints.

That's the cost of having guest data safely committed to disk before being ACK'ed.   The users will presumably be happier when a host failure doesn't trash their filesystems due to the total loss of any of the write ordering the filesystem implementer intended. 

Personally, I wouldn't want any data of mine stored on such a system, but I guess others mileage may vary.

If unsafe write buffering is desired, I'd be inclined to implement it explicitly in tapdisk rather than rely on the total vagaries of the linux buffer cache. It would thus be possible to bound the amount of outstanding data, continue to respect ordering, and still respect explicit flushes. 

Ian
 
> I see that 2.6.36 has some page_writeback changes:
> http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2.
> 6%2Fpatch-2.6.36.bz2;z=8379
> . Any thoughts on whether these would make a difference for the problems
> with "file:"? I'm still trying to find a way to reproduce the issue in
> the lab, so I'd have to test the patch in production -- that's not a
> tantalizing prospect, unless there is a real chance that it will affect
> it.
> 
> -John
> 
> On 11/15/2010 9:59 AM, John Weekes wrote:
> >
> >> They are throttled, but the single control I'm aware of
> >> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only
> >> per process, not a global limit. Could well be that's part of the
> >> problem -- outwitting mm with just too many writers on too many cores?
> >>
> >> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes
> >> made it much easier than with e.g. 2.6.27 to drive everybody else into
> >> costly reclaims.
> >>
> >> The Oom shown here reports about ~650M in dirty pages. The fact alone
> >> that this counts as on oom condition doesn't sound quite right in
> >> itself. That qemu might just have dared to ask at the wrong point in
> >> time.
> >>
> >> Just to get an idea -- how many guests did this box carry?
> >
> > It carries about two dozen guests, with a mix of mostly HVMs (all
> > stubdom-based, some with PV-on-HVM drivers) and some PV.
> >
> > This problem occurred more often for me under 2.6.32 than 2.6.31, I
> > noticed. Since I made the switch to aio, I haven't seen a crash, but
> > it hasn't been long enough for that to mean much.
> >
> > Having extra caching in the dom0 is nice because it allows for domUs
> > to get away with having small amounts of free memory, while still
> > having very good (much faster than hardware) write performance. If you
> > have a large number of domUs that are all memory-constrained and use
> > the disk in infrequent, large bursts, this can work out pretty well,
> > since the big communal pool provides a better value proposition than
> > giving each domU a few more megabytes of RAM.
> >
> > If the OOM problem isn't something that can be fixed, it might be a
> > good idea to print out a warning to the user when a domain using
> > "file:" is started. Or, to go a step further and automatically run
> > "file" based domains as though "aio" was specified, possibly with a
> > warning and a way to override that behavior. It's not really intuitive
> > that "file" would cause crashes.
> >
> > -John
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-17 20:10               ` Ian Pratt
@ 2010-11-17 22:02                 ` John Weekes
  2010-11-18  0:56                   ` Ian Pratt
  2010-11-18  1:23                   ` Daniel Stodden
  0 siblings, 2 replies; 23+ messages in thread
From: John Weekes @ 2010-11-17 22:02 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel, Jan Beulich, Daniel Stodden

There is certainly a trade-off, and historically, we've had problems 
with stability under Xen, so crashes are definitely a concern.

Implementation in tapdisk would be great.

I found today that tapdisk2 (at least on the latest 4.0-testing/unstable 
and latest pv_ops) is causing data corruption for Windows guests; I can 
see this by copying a few thousand files to another folder inside the 
guest, totalling a bit more than a GB, then running "fc" to check for 
differences (I tried with and without GPLPV). That's obviously a huge 
deal in production (and an even bigger deal than crashes), so in the 
short term, I may have to switch back to the uglier, crashier file: 
setup. I've been trying to find a workaround for the corruption all day 
without much luck.

-John

On 11/17/2010 12:10 PM, Ian Pratt wrote:
>> Performance is noticeably lower with aio on these bursty write
>> workloads; I've been getting a number of complaints.
> That's the cost of having guest data safely committed to disk before being ACK'ed.   The users will presumably be happier when a host failure doesn't trash their filesystems due to the total loss of any of the write ordering the filesystem implementer intended.
>
> Personally, I wouldn't want any data of mine stored on such a system, but I guess others mileage may vary.
>
> If unsafe write buffering is desired, I'd be inclined to implement it explicitly in tapdisk rather than rely on the total vagaries of the linux buffer cache. It would thus be possible to bound the amount of outstanding data, continue to respect ordering, and still respect explicit flushes.
>
> Ian
>
>> I see that 2.6.36 has some page_writeback changes:
>> http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2.
>> 6%2Fpatch-2.6.36.bz2;z=8379
>> . Any thoughts on whether these would make a difference for the problems
>> with "file:"? I'm still trying to find a way to reproduce the issue in
>> the lab, so I'd have to test the patch in production -- that's not a
>> tantalizing prospect, unless there is a real chance that it will affect
>> it.
>>
>> -John
>>
>> On 11/15/2010 9:59 AM, John Weekes wrote:
>>>> They are throttled, but the single control I'm aware of
>>>> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only
>>>> per process, not a global limit. Could well be that's part of the
>>>> problem -- outwitting mm with just too many writers on too many cores?
>>>>
>>>> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes
>>>> made it much easier than with e.g. 2.6.27 to drive everybody else into
>>>> costly reclaims.
>>>>
>>>> The Oom shown here reports about ~650M in dirty pages. The fact alone
>>>> that this counts as on oom condition doesn't sound quite right in
>>>> itself. That qemu might just have dared to ask at the wrong point in
>>>> time.
>>>>
>>>> Just to get an idea -- how many guests did this box carry?
>>> It carries about two dozen guests, with a mix of mostly HVMs (all
>>> stubdom-based, some with PV-on-HVM drivers) and some PV.
>>>
>>> This problem occurred more often for me under 2.6.32 than 2.6.31, I
>>> noticed. Since I made the switch to aio, I haven't seen a crash, but
>>> it hasn't been long enough for that to mean much.
>>>
>>> Having extra caching in the dom0 is nice because it allows for domUs
>>> to get away with having small amounts of free memory, while still
>>> having very good (much faster than hardware) write performance. If you
>>> have a large number of domUs that are all memory-constrained and use
>>> the disk in infrequent, large bursts, this can work out pretty well,
>>> since the big communal pool provides a better value proposition than
>>> giving each domU a few more megabytes of RAM.
>>>
>>> If the OOM problem isn't something that can be fixed, it might be a
>>> good idea to print out a warning to the user when a domain using
>>> "file:" is started. Or, to go a step further and automatically run
>>> "file" based domains as though "aio" was specified, possibly with a
>>> warning and a way to override that behavior. It's not really intuitive
>>> that "file" would cause crashes.
>>>
>>> -John
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: OOM problems
  2010-11-17 22:02                 ` John Weekes
@ 2010-11-18  0:56                   ` Ian Pratt
  2010-11-18  1:23                   ` Daniel Stodden
  1 sibling, 0 replies; 23+ messages in thread
From: Ian Pratt @ 2010-11-18  0:56 UTC (permalink / raw)
  To: John Weekes; +Cc: Ian Pratt, xen-devel, Jan Beulich, Daniel Stodden

[-- Attachment #1: Type: text/plain, Size: 825 bytes --]

> I found today that tapdisk2 (at least on the latest 4.0-testing/unstable
> and latest pv_ops) is causing data corruption for Windows guests; I can
> see this by copying a few thousand files to another folder inside the
> guest, totalling a bit more than a GB, then running "fc" to check for
> differences (I tried with and without GPLPV). That's obviously a huge
> deal in production (and an even bigger deal than crashes), so in the
> short term, I may have to switch back to the uglier, crashier file:
> setup. I've been trying to find a workaround for the corruption all day
> without much luck.

That's disturbing. It might be worth trying to drop the number of VCPUs in dom0 to 1 and then try to repro.

BTW: for production use I'd currently be strongly inclined to use the XCP 2.6.32 kernel. 

Ian



[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-17 22:02                 ` John Weekes
  2010-11-18  0:56                   ` Ian Pratt
@ 2010-11-18  1:23                   ` Daniel Stodden
  2010-11-18  3:29                     ` John Weekes
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Stodden @ 2010-11-18  1:23 UTC (permalink / raw)
  To: John Weekes; +Cc: Ian Pratt, xen-devel, Jan Beulich

On Wed, 2010-11-17 at 17:02 -0500, John Weekes wrote:
> There is certainly a trade-off, and historically, we've had problems 
> with stability under Xen, so crashes are definitely a concern.
> 
> Implementation in tapdisk would be great.
> 
> I found today that tapdisk2 (at least on the latest 4.0-testing/unstable 
> and latest pv_ops) is causing data corruption for Windows guests; I can 
> see this by copying a few thousand files to another folder inside the 
> guest, totalling a bit more than a GB, then running "fc" to check for 
> differences (I tried with and without GPLPV). That's obviously a huge 
> deal in production (and an even bigger deal than crashes), so in the 
> short term, I may have to switch back to the uglier, crashier file: 
> setup. I've been trying to find a workaround for the corruption all day 
> without much luck.

Which branch/revision does latest pvops mean?

Would you be willing to try and reproduce that again with the XCP blktap
(userspace, not kernel) sources? Just to further isolate the problem.
Those see a lot of testing. I certainly can't come up with a single fix
to the aio layer, in ages. But I'm never sure about other stuff
potentially broken in userland.

If dio is definitely not what you feel you need, let's get back your
original OOM problem. Did reducing dom0 vcpus help? 24 of them is quite
aggressive, to say the least.

If that alone doesn't help, I'd definitely try and check vm.dirty_ratio.
There must be a tradeoff which doesn't imply scribbling the better half
of 1.5GB main memory.

Daniel

> -John
> 
> On 11/17/2010 12:10 PM, Ian Pratt wrote:
> >> Performance is noticeably lower with aio on these bursty write
> >> workloads; I've been getting a number of complaints.
> > That's the cost of having guest data safely committed to disk before being ACK'ed.   The users will presumably be happier when a host failure doesn't trash their filesystems due to the total loss of any of the write ordering the filesystem implementer intended.
> >
> > Personally, I wouldn't want any data of mine stored on such a system, but I guess others mileage may vary.
> >
> > If unsafe write buffering is desired, I'd be inclined to implement it explicitly in tapdisk rather than rely on the total vagaries of the linux buffer cache. It would thus be possible to bound the amount of outstanding data, continue to respect ordering, and still respect explicit flushes.
> >
> > Ian
> >
> >> I see that 2.6.36 has some page_writeback changes:
> >> http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2Fv2.
> >> 6%2Fpatch-2.6.36.bz2;z=8379
> >> . Any thoughts on whether these would make a difference for the problems
> >> with "file:"? I'm still trying to find a way to reproduce the issue in
> >> the lab, so I'd have to test the patch in production -- that's not a
> >> tantalizing prospect, unless there is a real chance that it will affect
> >> it.
> >>
> >> -John
> >>
> >> On 11/15/2010 9:59 AM, John Weekes wrote:
> >>>> They are throttled, but the single control I'm aware of
> >>>> is /proc/sys/vm/dirty_ratio (or dirty_bytes, nowadays). Which is only
> >>>> per process, not a global limit. Could well be that's part of the
> >>>> problem -- outwitting mm with just too many writers on too many cores?
> >>>>
> >>>> We had a bit of trouble when switching dom0 to 2.6.32, buffered writes
> >>>> made it much easier than with e.g. 2.6.27 to drive everybody else into
> >>>> costly reclaims.
> >>>>
> >>>> The Oom shown here reports about ~650M in dirty pages. The fact alone
> >>>> that this counts as on oom condition doesn't sound quite right in
> >>>> itself. That qemu might just have dared to ask at the wrong point in
> >>>> time.
> >>>>
> >>>> Just to get an idea -- how many guests did this box carry?
> >>> It carries about two dozen guests, with a mix of mostly HVMs (all
> >>> stubdom-based, some with PV-on-HVM drivers) and some PV.
> >>>
> >>> This problem occurred more often for me under 2.6.32 than 2.6.31, I
> >>> noticed. Since I made the switch to aio, I haven't seen a crash, but
> >>> it hasn't been long enough for that to mean much.
> >>>
> >>> Having extra caching in the dom0 is nice because it allows for domUs
> >>> to get away with having small amounts of free memory, while still
> >>> having very good (much faster than hardware) write performance. If you
> >>> have a large number of domUs that are all memory-constrained and use
> >>> the disk in infrequent, large bursts, this can work out pretty well,
> >>> since the big communal pool provides a better value proposition than
> >>> giving each domU a few more megabytes of RAM.
> >>>
> >>> If the OOM problem isn't something that can be fixed, it might be a
> >>> good idea to print out a warning to the user when a domain using
> >>> "file:" is started. Or, to go a step further and automatically run
> >>> "file" based domains as though "aio" was specified, possibly with a
> >>> warning and a way to override that behavior. It's not really intuitive
> >>> that "file" would cause crashes.
> >>>
> >>> -John
> >>>
> >>> _______________________________________________
> >>> Xen-devel mailing list
> >>> Xen-devel@lists.xensource.com
> >>> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-18  1:23                   ` Daniel Stodden
@ 2010-11-18  3:29                     ` John Weekes
  2010-11-18  4:08                       ` Daniel Stodden
  0 siblings, 1 reply; 23+ messages in thread
From: John Weekes @ 2010-11-18  3:29 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Ian Pratt, xen-devel, Jan Beulich

Daniel:

 > Which branch/revision does latest pvops mean?

stable-2.6.32, using the latest pull as of today. (I also tried 
next-2.6.37, but it wouldn't boot for me.)
> Would you be willing to try and reproduce that again with the XCP blktap
> (userspace, not kernel) sources? Just to further isolate the problem.
> Those see a lot of testing. I certainly can't come up with a single fix
> to the aio layer, in ages. But I'm never sure about other stuff
> potentially broken in userland.

I'll have to give it a try. Normal blktap still isn't working with 
pv_ops, though, so I hope this is a drop-in for blktap2.

In my last bit of troubleshooting, I took O_DIRECT out of the open call 
in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates 
that this might have eliminated the problem with corruption. I'm testing 
further now, but could there be an issue with alignment (since the 
kernel is apparently very strict about it with direct I/O)? (Removing 
this flag also brings back in use of the page cache, of course.)

> If dio is definitely not what you feel you need, let's get back your
> original OOM problem. Did reducing dom0 vcpus help? 24 of them is quite
> aggressive, to say the least.

When I switched to aio, I reduced the vcpus to 2 (I needed to do this 
with dom0_max_vcpus, rather than through xend-config.sxp -- the latter 
wouldn't always boot). I haven't separately tried cached I/O with 
reduced CPUs yet, except in the lab; and unfortunately I still can't get 
the problem to happen in the lab, no matter what I try.

> If that alone doesn't help, I'd definitely try and check vm.dirty_ratio.
> There must be a tradeoff which doesn't imply scribbling the better half
> of 1.5GB main memory.

The default for dirty_ratio is 20. I tried halving that to 10, but it 
didn't help. I could try lower, but I like the thought of keeping this 
in user space, if possible, so I've been pursuing the blktap2 path most 
aggressively.

Ian:

>  That's disturbing. It might be worth trying to drop the number of VCPUs in dom0 to 1 and then try to repro.
>  BTW: for production use I'd currently be strongly inclined to use the XCP 2.6.32 kernel.

Interesting, ok.

-John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-18  3:29                     ` John Weekes
@ 2010-11-18  4:08                       ` Daniel Stodden
  2010-11-18  7:15                         ` John Weekes
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Stodden @ 2010-11-18  4:08 UTC (permalink / raw)
  To: John Weekes; +Cc: Ian Pratt, xen-devel, Jan Beulich

On Wed, 2010-11-17 at 22:29 -0500, John Weekes wrote:
> Daniel:
> 
>  > Which branch/revision does latest pvops mean?
> 
> stable-2.6.32, using the latest pull as of today. (I also tried 
> next-2.6.37, but it wouldn't boot for me.)
> > Would you be willing to try and reproduce that again with the XCP blktap
> > (userspace, not kernel) sources? Just to further isolate the problem.
> > Those see a lot of testing. I certainly can't come up with a single fix
> > to the aio layer, in ages. But I'm never sure about other stuff
> > potentially broken in userland.
> 
> I'll have to give it a try. Normal blktap still isn't working with 
> pv_ops, though, so I hope this is a drop-in for blktap2.

I think it should work fine, or wouldn't ask. If not, lemme know.

> In my last bit of troubleshooting, I took O_DIRECT out of the open call 
> in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates 
> that this might have eliminated the problem with corruption. I'm testing 
> further now, but could there be an issue with alignment (since the 
> kernel is apparently very strict about it with direct I/O)? 

Nope. It is, but they're 4k-aligned all over the place. You'd see syslog
yelling quite miserably in cases like that. Keeping an eye on syslog
(the daemon and kern facilites) is a generally good idea btw.

> (Removing 
> this flag also brings back in use of the page cache, of course.)

I/O-wise it's not much different from the file:-path. Meaning it should
have carried you directly back into the Oom realm.

> > If dio is definitely not what you feel you need, let's get back your
> > original OOM problem. Did reducing dom0 vcpus help? 24 of them is quite
> > aggressive, to say the least.
> 
> When I switched to aio, I reduced the vcpus to 2 (I needed to do this 
> with dom0_max_vcpus, rather than through xend-config.sxp -- the latter 
> wouldn't always boot). I haven't separately tried cached I/O with 
> reduced CPUs yet, except in the lab; and unfortunately I still can't get 
> the problem to happen in the lab, no matter what I try.

Just reducing the cpu count alone sounds like sth worth trying even on a
production box, if the current state of things already tends to take the
system down. Also, the dirty_ratio sysctl should be pretty safe to tweak
at runtime.

> > If that alone doesn't help, I'd definitely try and check vm.dirty_ratio.
> > There must be a tradeoff which doesn't imply scribbling the better half
> > of 1.5GB main memory.
> 
> The default for dirty_ratio is 20. I tried halving that to 10, but it 
> didn't help. 

Still too much. That's meant to be %/task. Try 2, with 1.5G that's still
a decent 30M write cache and should block all out of 24 disks after some
700M, worst case. Or so I think...

> I could try lower, but I like the thought of keeping this 
> in user space, if possible, so I've been pursuing the blktap2 path most 
> aggressively.

Okay. I'm sending you a tbz to try.

Daniel

> Ian:
> 
> >  That's disturbing. It might be worth trying to drop the number of VCPUs in dom0 to 1 and then try to repro.
> >  BTW: for production use I'd currently be strongly inclined to use the XCP 2.6.32 kernel.
> 
> Interesting, ok.
> 
> -John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-18  4:08                       ` Daniel Stodden
@ 2010-11-18  7:15                         ` John Weekes
  2010-11-18 10:41                           ` Daniel Stodden
  0 siblings, 1 reply; 23+ messages in thread
From: John Weekes @ 2010-11-18  7:15 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Ian Pratt, xen-devel, Jan Beulich


> I think [XCP blktap] should work fine, or wouldn't ask. If not, lemme know.

k.

>> In my last bit of troubleshooting, I took O_DIRECT out of the open call
>> in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates
>> that this might have eliminated the problem with corruption. I'm testing
>> further now, but could there be an issue with alignment (since the
>> kernel is apparently very strict about it with direct I/O)?
> Nope. It is, but they're 4k-aligned all over the place. You'd see syslog
> yelling quite miserably in cases like that. Keeping an eye on syslog
> (the daemon and kern facilites) is a generally good idea btw.

I've been doing that and haven't seen any unusual output so far, which I 
guess is good.

>> (Removing
>> this flag also brings back in use of the page cache, of course.)
> I/O-wise it's not much different from the file:-path. Meaning it should
> have carried you directly back into the Oom realm.

Does it make a difference that it's not using "loop" and instead the CPU 
usage (and presumably some blocking) occurs in user-space? There's not 
too much information on this out there, but it seems at though the OOM 
issue might be at least somewhat loop device-specific. One document that 
references loop OOM problems that I found is this one: 
http://sources.redhat.com/lvm2/wiki/DMLoop. My initial take on it was 
that it might be saying that it mattered when these things were being 
done in the kernel, but now I'm not so certain --

".. [their method and loop] submit[s] [I/O requests] via a kernel thread 
to the VFS layer using traditional I/O calls (read, write etc.). This 
has the advantage that it should work with any file system type 
supported by the Linux VFS (including networked file systems), but has 
some drawbacks that may affect performance and scalability. This is 
because it is hard to predict what a file system may attempt to do when 
an I/O request is submitted; for example, it may need to allocate memory 
to handle the request and the loopback driver has no control over this. 
Particularly under low-memory or intensive I/O scenarios this can lead 
to out of memory (OOM) problems or deadlocks as the kernel tries to make 
memory available to the VFS layer while satisfying a request from the 
block layer. "

Would there be an advantage to using blktap/blktap2 over loop, if I 
leave off O_DIRECT? Would it be faster, or anything like that?

> Just reducing the cpu count alone sounds like sth worth trying even on a
> production box, if the current state of things already tends to take the
> system down. Also, the dirty_ratio sysctl should be pretty safe to tweak
> at runtime.

That's good to hear.

>> The default for dirty_ratio is 20. I tried halving that to 10, but it
>> didn't help.
> Still too much. That's meant to be %/task. Try 2, with 1.5G that's still
> a decent 30M write cache and should block all out of 24 disks after some
> 700M, worst case. Or so I think...

Ah, ok. I was thinking that it was global. With a small per-process 
cache like that, it becomes much closer to AIO for writes, but at least 
the leftover memory could still be used for the read cache.

-John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-18  7:15                         ` John Weekes
@ 2010-11-18 10:41                           ` Daniel Stodden
  2010-11-19  7:27                             ` John Weekes
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Stodden @ 2010-11-18 10:41 UTC (permalink / raw)
  To: John Weekes; +Cc: Ian Pratt, xen-devel, Jan Beulich

On Thu, 2010-11-18 at 02:15 -0500, John Weekes wrote:
> > I think [XCP blktap] should work fine, or wouldn't ask. If not, lemme know.
> 
> k.
> 
> >> In my last bit of troubleshooting, I took O_DIRECT out of the open call
> >> in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates
> >> that this might have eliminated the problem with corruption. I'm testing
> >> further now, but could there be an issue with alignment (since the
> >> kernel is apparently very strict about it with direct I/O)?
> > Nope. It is, but they're 4k-aligned all over the place. You'd see syslog
> > yelling quite miserably in cases like that. Keeping an eye on syslog
> > (the daemon and kern facilites) is a generally good idea btw.
> 
> I've been doing that and haven't seen any unusual output so far, which I 
> guess is good.
> 
> >> (Removing
> >> this flag also brings back in use of the page cache, of course.)
> > I/O-wise it's not much different from the file:-path. Meaning it should
> > have carried you directly back into the Oom realm.
> 
> Does it make a difference that it's not using "loop" and instead the CPU 
> usage (and presumably some blocking) occurs in user-space?

It's certainly a different path taken. I just meant to say file access
has about the same properties, so you're likely back to the original
issue.

>  There's not 
> too much information on this out there, but it seems at though the OOM 
> issue might be at least somewhat loop device-specific. One document that 
> references loop OOM problems that I found is this one: 
> http://sources.redhat.com/lvm2/wiki/DMLoop.

>  My initial take on it was 
> that it might be saying that it mattered when these things were being 
> done in the kernel, but now I'm not so certain --
> 
> ".. [their method and loop] submit[s] [I/O requests] via a kernel thread 
> to the VFS layer using traditional I/O calls (read, write etc.). This 
> has the advantage that it should work with any file system type 
> supported by the Linux VFS (including networked file systems), but has 
> some drawbacks that may affect performance and scalability. This is 
> because it is hard to predict what a file system may attempt to do when 
> an I/O request is submitted; for example, it may need to allocate memory 
> to handle the request and the loopback driver has no control over this. 
> Particularly under low-memory or intensive I/O scenarios this can lead 
> to out of memory (OOM) problems or deadlocks as the kernel tries to make 
> memory available to the VFS layer while satisfying a request from the 
> block layer. "
> 
> Would there be an advantage to using blktap/blktap2 over loop, if I 
> leave off O_DIRECT? Would it be faster, or anything like that?

No, it's essentially the same thing. Both blktap and loopdevs sit on the
vfs in a similar fashion, without O_DIRECT even more so. The deadlocking
and OOM hazards are also the same, btw.

Deadlocks are a fairly general problem whenever you layer two subsystems
depending on the same resource on top of each other. Both in the blktap
and loopback case the system has several opportunities to hang itself,
because there's even more stuff stacked than normal. The layers are, top
to bottom

 (1) potential caching of {tap/loop}dev writes (Xen doesn't do that) 
 (2) The block device, which needs some minimum amount of memory to run 
     its request queue
 (3) Cached writes on the file layer
 (4) The filesystem needs memory to launder those pages
 (5) The disk's block device, equivalent to 2.
 (6) The driver driver running the data transfers.

The shared resource is memory. Now consider what happens when upper
layers in combination grab everything the lower layers need to make
progress. The upper layer can't roll back, so won't get off their memory
before that happened. So we're stuck.

It shouldn't happen, the kernel has a bunch of mechanisms to prevent
that. It obviously doesn't quite work here.

That's why I'm suggesting that the most obvious fix for your case is to
limit the cache dirtying rate.

> > Just reducing the cpu count alone sounds like sth worth trying even on a
> > production box, if the current state of things already tends to take the
> > system down. Also, the dirty_ratio sysctl should be pretty safe to tweak
> > at runtime.
> 
> That's good to hear.
> 
> >> The default for dirty_ratio is 20. I tried halving that to 10, but it
> >> didn't help.
> > Still too much. That's meant to be %/task. Try 2, with 1.5G that's still
> > a decent 30M write cache and should block all out of 24 disks after some
> > 700M, worst case. Or so I think...
> 
> Ah, ok. I was thinking that it was global. With a small per-process 
> cache like that, it becomes much closer to AIO for writes, but at least 
> the leftover memory could still be used for the read cache.

I agree it doesn't do what you want. I have no idea why there's no
global limit, seriously.

Note that in theory, 24*2% would still approach the oom state you were
in with the log you sent. I think it's going to be less likely though.
With all guests going mad at the same time, it may still not be low
enough. In case that happens, you could resort to pumping even more
memory into dom0.

Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: OOM problems
  2010-11-18 10:41                           ` Daniel Stodden
@ 2010-11-19  7:27                             ` John Weekes
  0 siblings, 0 replies; 23+ messages in thread
From: John Weekes @ 2010-11-19  7:27 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Ian Pratt, xen-devel, Jan Beulich

Daniel, thank you for the help and in-depth information, as well as the 
test code off-list. The corruption problem with blktap2 O_DIRECT is 
easily reproducible for me on multiple machines, so I hope that we'll be 
able to nail this one down pretty quickly.

To follow up on my question about the potential performance difference 
between blktap2 without O_DIRECT and loop (both of which use the page 
cache), I did some tests inside a sparse file-backed domU by timing 
copying a folder containing 7419 files and folders totalling 1.6 GB (of 
mixed sizes), and found that loop returned this:

real    1m18.257s
user    0m0.050s
sys     0m6.550s

While tapdisk2 aio w/o O_DIRECT clocked in at:

real    0m55.373s
user    0m0.050s
sys     0m6.690s

With each, I saw a few more seconds of disk activity on dom0, since 
dirty_ratio was set to 2. I ran the tests several times and dropped 
caches on dom0 between each one; all of the results were within a second 
or two of each other.

This represents a significant ~41% performance bump for that particular 
workload. In light of this, I would recommend to anyone who is using 
"file:" that they try out tapdisk2 aio with a modified block-aio.c to 
remove O_DIRECT, and see how it goes. If you find results similar to 
mine, it might be worth modifying this into another blktap2 driver.

-John

On 11/18/2010 2:41 AM, Daniel Stodden wrote:
> On Thu, 2010-11-18 at 02:15 -0500, John Weekes wrote:
>>> I think [XCP blktap] should work fine, or wouldn't ask. If not, lemme know.
>> k.
>>
>>>> In my last bit of troubleshooting, I took O_DIRECT out of the open call
>>>> in tools/blktap2/drivers/block-aio.c, and preliminary testing indicates
>>>> that this might have eliminated the problem with corruption. I'm testing
>>>> further now, but could there be an issue with alignment (since the
>>>> kernel is apparently very strict about it with direct I/O)?
>>> Nope. It is, but they're 4k-aligned all over the place. You'd see syslog
>>> yelling quite miserably in cases like that. Keeping an eye on syslog
>>> (the daemon and kern facilites) is a generally good idea btw.
>> I've been doing that and haven't seen any unusual output so far, which I
>> guess is good.
>>
>>>> (Removing
>>>> this flag also brings back in use of the page cache, of course.)
>>> I/O-wise it's not much different from the file:-path. Meaning it should
>>> have carried you directly back into the Oom realm.
>> Does it make a difference that it's not using "loop" and instead the CPU
>> usage (and presumably some blocking) occurs in user-space?
> It's certainly a different path taken. I just meant to say file access
> has about the same properties, so you're likely back to the original
> issue.
>
>>   There's not
>> too much information on this out there, but it seems at though the OOM
>> issue might be at least somewhat loop device-specific. One document that
>> references loop OOM problems that I found is this one:
>> http://sources.redhat.com/lvm2/wiki/DMLoop.
>>   My initial take on it was
>> that it might be saying that it mattered when these things were being
>> done in the kernel, but now I'm not so certain --
>>
>> ".. [their method and loop] submit[s] [I/O requests] via a kernel thread
>> to the VFS layer using traditional I/O calls (read, write etc.). This
>> has the advantage that it should work with any file system type
>> supported by the Linux VFS (including networked file systems), but has
>> some drawbacks that may affect performance and scalability. This is
>> because it is hard to predict what a file system may attempt to do when
>> an I/O request is submitted; for example, it may need to allocate memory
>> to handle the request and the loopback driver has no control over this.
>> Particularly under low-memory or intensive I/O scenarios this can lead
>> to out of memory (OOM) problems or deadlocks as the kernel tries to make
>> memory available to the VFS layer while satisfying a request from the
>> block layer. "
>>
>> Would there be an advantage to using blktap/blktap2 over loop, if I
>> leave off O_DIRECT? Would it be faster, or anything like that?
> No, it's essentially the same thing. Both blktap and loopdevs sit on the
> vfs in a similar fashion, without O_DIRECT even more so. The deadlocking
> and OOM hazards are also the same, btw.
>
> Deadlocks are a fairly general problem whenever you layer two subsystems
> depending on the same resource on top of each other. Both in the blktap
> and loopback case the system has several opportunities to hang itself,
> because there's even more stuff stacked than normal. The layers are, top
> to bottom
>
>   (1) potential caching of {tap/loop}dev writes (Xen doesn't do that)
>   (2) The block device, which needs some minimum amount of memory to run
>       its request queue
>   (3) Cached writes on the file layer
>   (4) The filesystem needs memory to launder those pages
>   (5) The disk's block device, equivalent to 2.
>   (6) The driver driver running the data transfers.
>
> The shared resource is memory. Now consider what happens when upper
> layers in combination grab everything the lower layers need to make
> progress. The upper layer can't roll back, so won't get off their memory
> before that happened. So we're stuck.
>
> It shouldn't happen, the kernel has a bunch of mechanisms to prevent
> that. It obviously doesn't quite work here.
>
> That's why I'm suggesting that the most obvious fix for your case is to
> limit the cache dirtying rate.
>
>>> Just reducing the cpu count alone sounds like sth worth trying even on a
>>> production box, if the current state of things already tends to take the
>>> system down. Also, the dirty_ratio sysctl should be pretty safe to tweak
>>> at runtime.
>> That's good to hear.
>>
>>>> The default for dirty_ratio is 20. I tried halving that to 10, but it
>>>> didn't help.
>>> Still too much. That's meant to be %/task. Try 2, with 1.5G that's still
>>> a decent 30M write cache and should block all out of 24 disks after some
>>> 700M, worst case. Or so I think...
>> Ah, ok. I was thinking that it was global. With a small per-process
>> cache like that, it becomes much closer to AIO for writes, but at least
>> the leftover memory could still be used for the read cache.
> I agree it doesn't do what you want. I have no idea why there's no
> global limit, seriously.
>
> Note that in theory, 24*2% would still approach the oom state you were
> in with the log you sent. I think it's going to be less likely though.
> With all guests going mad at the same time, it may still not be low
> enough. In case that happens, you could resort to pumping even more
> memory into dom0.
>
> Daniel
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-11-19  7:27 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-13  7:57 OOM problems John Weekes
2010-11-13  8:14 ` Ian Pratt
2010-11-13  8:27   ` John Weekes
2010-11-13  9:13     ` Ian Pratt
2010-11-13  9:43       ` John Weekes
2010-11-13 10:19       ` John Weekes
2010-11-14  9:53         ` Daniel Stodden
2010-11-15  8:55       ` Jan Beulich
2010-11-15  9:40         ` Daniel Stodden
2010-11-15  9:57           ` Jan Beulich
2010-11-15 17:59           ` John Weekes
2010-11-16 19:54             ` John Weekes
2010-11-17 20:10               ` Ian Pratt
2010-11-17 22:02                 ` John Weekes
2010-11-18  0:56                   ` Ian Pratt
2010-11-18  1:23                   ` Daniel Stodden
2010-11-18  3:29                     ` John Weekes
2010-11-18  4:08                       ` Daniel Stodden
2010-11-18  7:15                         ` John Weekes
2010-11-18 10:41                           ` Daniel Stodden
2010-11-19  7:27                             ` John Weekes
2010-11-15 14:17         ` Stefano Stabellini
2010-11-13 18:15 ` George Shuklin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.