linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG][REGRESSION] i915 gpu hangs under load
@ 2017-03-22  8:38 Martin Kepplinger
  2017-03-22 10:36 ` [Intel-gfx] " Jani Nikula
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Kepplinger @ 2017-03-22  8:38 UTC (permalink / raw)
  To: daniel.vetter, airlied; +Cc: intel-gfx, dri-devel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1910 bytes --]

Hi

I know something similar is here: 
https://bugs.freedesktop.org/show_bug.cgi?id=100110 too.

But this is rc3 and my machine is totally *not usable*. Let me be 
annoying :) I hope I can help:

Since rc1 I get gpu hangs and resets under load: This is almost 
certainly a kernel issue. 4.10 is fine.
I keep a debian stable userspace. nouveau is running on this machine 
too.

Mar 22 09:17:01 martin-laptop kernel: [ 2409.538706] [drm] GPU HANG: 
ecode 7:0:0xf3cffffe, in gnome-shell [1869], reason: Hang on render 
ring, action: reset
Mar 22 09:17:01 martin-laptop kernel: [ 2409.538711] [drm] GPU hangs can 
indicate a bug anywhere in the entire gfx stack, including userspace.
Mar 22 09:17:01 martin-laptop kernel: [ 2409.538713] [drm] Please file a 
_new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Mar 22 09:17:01 martin-laptop kernel: [ 2409.538714] [drm] drm/i915 
developers can then reassign to the right component if it's not a kernel 
issue.
Mar 22 09:17:01 martin-laptop kernel: [ 2409.538715] [drm] The gpu crash 
dump is required to analyze gpu hangs, so please always attach it.
Mar 22 09:17:01 martin-laptop kernel: [ 2409.538716] [drm] GPU crash 
dump saved to /sys/class/drm/card0/error
Mar 22 09:17:01 martin-laptop kernel: [ 2409.538768] drm/i915: Resetting 
chip after gpu hang
Mar 22 09:17:09 martin-laptop kernel: [ 2417.537886] drm/i915: Resetting 
chip after gpu hang
Mar 22 09:17:17 martin-laptop kernel: [ 2425.537152] drm/i915: Resetting 
chip after gpu hang
Mar 22 09:17:25 martin-laptop kernel: [ 2433.536407] drm/i915: Resetting 
chip after gpu hang
Mar 22 09:17:33 martin-laptop kernel: [ 2441.539674] drm/i915: Resetting 
chip after gpu hang


Furthermore, there are weird, small display distortions occuring. I 
don't get any log about them and
don't have a screenshot. Well. Nevermind. Please fix 4.11 and CC anyone 
I forgot.


thanks

              martin

[-- Attachment #2: gpu_crash.txt --]
[-- Type: text/plain, Size: 10815 bytes --]

GPU HANG: ecode 7:0:0xf3cffffe, in gnome-shell [1869], reason: Hang on render ring, action: reset
Kernel: 4.11.0-rc3-00003-gbc61cd2
Time: 1490170621 s 524489 us
Boottime: 2409 s 756155 us
Uptime: 2395 s 323536 us
is_mobile: no
is_lp: no
is_alpha_support: no
has_64bit_reloc: no
has_aliasing_ppgtt: yes
has_csr: no
has_ddi: yes
has_decoupled_mmio: no
has_dp_mst: yes
has_fbc: yes
has_fpga_dbg: yes
has_full_ppgtt: yes
has_full_48bit_ppgtt: no
has_gmbus_irq: yes
has_gmch_display: no
has_guc: no
has_hotplug: yes
has_hw_contexts: yes
has_l3_dpf: yes
has_llc: yes
has_logical_ring_contexts: no
has_overlay: no
has_pipe_cxsr: no
has_pooled_eu: no
has_psr: yes
has_rc6: yes
has_rc6p: no
has_resource_streamer: yes
has_runtime_pm: yes
has_snoop: no
cursor_needs_physical: no
hws_needs_physical: no
overlay_needs_physical: no
supports_tv: no
Active process (on ring render): gnome-shell [1869], context bans 0
Reset count: 0
Suspend count: 0
Platform: HASWELL
PCI ID: 0x0416
PCI Revision: 0x06
PCI Subsystem: 10cf:17ac
IOMMU enabled?: 0
EIR: 0x00000000
IER: 0xfc002529
GTIER: 0x00401821
PGTBL_ER: 0x00000000
FORCEWAKE: 0x00000001
DERRMR: 0xffffffff
CCID: 0x00ef410d
Missed interrupts: 0x00000000
  fence[0] = 00000000
  fence[1] = 00000000
  fence[2] = 00000000
  fence[3] = 00000000
  fence[4] = 00000000
  fence[5] = 00000000
  fence[6] = 00000000
  fence[7] = 00000000
  fence[8] = 00000000
  fence[9] = 00000000
  fence[10] = 00000000
  fence[11] = 00000000
  fence[12] = 00000000
  fence[13] = 00000000
  fence[14] = 00000000
  fence[15] = 00000000
  fence[16] = 00000000
  fence[17] = 00000000
  fence[18] = 4b530770374a001
  fence[19] = 00000000
  fence[20] = 00000000
  fence[21] = 00000000
  fence[22] = 00000000
  fence[23] = 00000000
  fence[24] = 00000000
  fence[25] = 00000000
  fence[26] = 00000000
  fence[27] = 00000000
  fence[28] = 00000000
  fence[29] = 00000000
  fence[30] = 00000000
  fence[31] = 00000000
ERROR: 0x00000109
DONE_REG: 0xffffffff
ERR_INT: 0x00000000
render command stream:
  START: 0x007ea000
  HEAD:  0x07a1f6dc [0x0001f648]
  TAIL:  0x0001f8f8 [0x0001f728, 0x0001f760]
  CTL:   0x0001f001
  MODE:  0x00004000
  HWS:   0x7fff0000
  ACTHD: 0x00000000 07a1f6dc
  IPEIR: 0x00000000
  IPEHR: 0x0c000000
  INSTDONE: 0xffcffffe
  SC_INSTDONE: 0xffffffff
  SAMPLER_INSTDONE[0][0]: 0xffffffff
  ROW_INSTDONE[0][0]: 0xffffffff
  BBADDR: 0x00000000_7fa48330
  BB_STATE: 0x00000000
  INSTPS: 0x00000500
  INSTPM: 0x00006080
  FADDR: 0x00000000 008096d8
  RC PSMI: 0x00000010
  FAULT_REG: 0x000000c5
  SYNC_0: 0x00000000
  SYNC_1: 0x0001c2a1
  SYNC_2: 0x00000000
  GFX_MODE: 0x00002a00
  PP_DIR_BASE: 0x7fdf0000
  seqno: 0x0001c29a
  last_seqno: 0x0001c2a2
  waiting: yes
  ring->head: 0x00016e60
  ring->tail: 0x0001f8f8
  hangcheck stall: yes
  hangcheck action: dead
  hangcheck action timestamp: 4295493232, 204600 ms ago
blt command stream:
  START: 0x0080a000
  HEAD:  0x07e0e8d0 [0x00000000]
  TAIL:  0x0000e8d0 [0x00000000, 0x00000000]
  CTL:   0x0001f001
  MODE:  0x00000200
  HWS:   0x7fff1000
  ACTHD: 0x00000000 07e0e8d0
  IPEIR: 0x00000000
  IPEHR: 0x01000000
  INSTDONE: 0xfffffffe
  BBADDR: 0x00000000_7fff4028
  BB_STATE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 008188d0
  RC PSMI: 0x00000011
  FAULT_REG: 0x00000000
  SYNC_0: 0x0001c29a
  SYNC_1: 0x00000000
  SYNC_2: 0x00000000
  GFX_MODE: 0x00000200
  PP_DIR_BASE: 0x7fdf0000
  seqno: 0x0001c2a1
  last_seqno: 0x0001c2a1
  waiting: no
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck stall: no
  hangcheck action: idle
  hangcheck action timestamp: 4295494736, 198584 ms ago
bsd command stream:
  START: 0x0082a000
  HEAD:  0x00000000 [0x00000000]
  TAIL:  0x00000000 [0x00000000, 0x00000000]
  CTL:   0x0001f001
  MODE:  0x00000200
  HWS:   0x7fff2000
  ACTHD: 0x00000000 00000000
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  BBADDR: 0x00000000_00000000
  BB_STATE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 0082a000
  RC PSMI: 0x00000011
  FAULT_REG: 0x00000000
  SYNC_0: 0x0001c2a1
  SYNC_1: 0x0001c29a
  SYNC_2: 0x00000000
  GFX_MODE: 0x00000200
  PP_DIR_BASE: 0x00000000
  seqno: 0x00000000
  last_seqno: 0x00000000
  waiting: no
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck stall: no
  hangcheck action: idle
  hangcheck action timestamp: 4295494736, 198584 ms ago
vebox command stream:
  START: 0x0084a000
  HEAD:  0x00000000 [0x00000000]
  TAIL:  0x00000000 [0x00000000, 0x00000000]
  CTL:   0x0001f001
  MODE:  0x00000200
  HWS:   0x7fff3000
  ACTHD: 0x00000000 00000000
  IPEIR: 0x00000000
  IPEHR: 0x00000000
  INSTDONE: 0xfffffffe
  BBADDR: 0x00000000_00000000
  BB_STATE: 0x00000000
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 0084a000
  RC PSMI: 0x00000011
  FAULT_REG: 0x00000000
  SYNC_0: 0x0001c2a1
  SYNC_1: 0x0001c29a
  SYNC_2: 0x00000000
  GFX_MODE: 0x00000200
  PP_DIR_BASE: 0x00000000
  seqno: 0x00000000
  last_seqno: 0x00000000
  waiting: no
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck stall: no
  hangcheck action: idle
  hangcheck action timestamp: 4295494736, 198584 ms ago
Active (render ring) [40]:
    00000000_7fff8000     8192 37 00 [ 1c29e 00 00 00 00 ] 00 LLC
    00000000_7e9df000 20971520 36 00 [ 1c29e 00 00 00 00 ] 00 X uncached (name: 2)
    00000000_7fff7000     4096 36 00 [ 1c29e 00 00 00 00 ] 00 LLC
    00000000_7d5df000 20971520 36 00 [ 1c29e 00 00 00 00 ] 00 Y LLC
    00000000_7c1df000 20971520 36 00 [ 1c29e 00 00 00 00 ] 00 Y LLC
    00000000_7bcdf000  5242880 36 00 [ 1c29e 00 00 00 00 ] 00 LLC
    00000000_7fff6000     4096 37 00 [ 1c29e 00 00 00 00 ] 00 LLC
    00000000_7b2df000 10485760 37 00 [ 1c29e 00 00 00 00 ] 00 X LLC (name: 10)
    00000000_7b2d5000    40960 37 00 [ 1c29e 00 00 00 00 ] 00 LLC
    00000000_7a8d5000 10485760 37 00 [ 1c29e 00 00 00 00 ] 00 X LLC (name: 8)
    00000000_7a655000  2621440 37 00 [ 1c29e 00 00 00 00 ] 00 Y LLC
    00000000_7fff5000     4096 37 00 [ 1c29e 00 00 00 00 ] 00 LLC
    00000000_7a575000   917504 37 00 [ 1c29e 00 00 00 00 ] 00 X LLC (name: 11)
    00000000_7a535000   262144 37 00 [ 1c29e 00 00 00 00 ] 00 Y LLC
    00000000_79b35000 10485760 37 00 [ 1c29e 00 00 00 00 ] 00 X LLC (name: 5)
    00000000_79535000  6291456 37 00 [ 1c29e 00 00 00 00 ] 00 X LLC (name: 12)
    00000000_793b5000  1572864 37 00 [ 1c29e 00 00 00 00 ] 00 Y LLC
    00000000_793ad000    32768 37 00 [ 1c29e 00 00 00 00 ] 00 dirty LLC
    00000000_00ef3000     4096 09 00 [ 1c29e 00 00 00 00 ] 00 dirty purgeable LLC
    00000000_77f9d000     4096 37 00 [ 1c2a0 00 00 00 00 ] 00 dirty LLC
    00000000_77f9c000     4096 37 00 [ 1c2a0 00 00 00 00 ] 00 LLC
    00000000_77f98000    16384 37 00 [ 1c2a0 00 00 00 00 ] 00 purgeable LLC
    00000000_77f97000     4096 37 00 [ 1c2a0 00 00 00 00 ] 00 LLC
    00000000_77e97000  1048576 37 00 [ 1c2a0 00 00 00 00 ] 00 X LLC
    00000000_77e93000    16384 37 00 [ 1c2a0 00 00 00 00 ] 00 dirty purgeable LLC
    00000000_7fffa000    16384 37 00 [ 1c2a0 00 00 00 00 ] 00 dirty LLC
    00000000_00f06000     4096 09 00 [ 1c2a0 00 00 00 00 ] 00 dirty purgeable LLC
    00000000_77fa8000     4096 37 00 [ 1c2a2 00 00 00 00 ] 00 LLC
    00000000_77fa1000    28672 37 00 [ 1c2a2 00 00 00 00 ] 00 LLC
    00000000_77fa0000     4096 37 00 [ 1c2a2 00 00 00 00 ] 00 LLC
    00000000_77f9f000     4096 37 00 [ 1c2a2 00 00 00 00 ] 00 LLC
    00000000_77f9e000     4096 37 00 [ 1c2a2 00 00 00 00 ] 00 LLC
    00000000_77e56000     4096 37 00 [ 1c2a2 00 00 00 00 ] 00 dirty LLC
    00000000_77e55000     4096 37 00 [ 1c2a2 00 00 00 00 ] 00 LLC
    00000000_77e51000    16384 37 00 [ 1c2a2 00 00 00 00 ] 00 purgeable LLC
    00000000_77e5b000   229376 36 00 [ 1c2a2 00 00 00 00 ] 00 X LLC
    00000000_789ad000 10485760 36 00 [ 1c2a2 00 00 00 00 ] 00 X LLC
    00000000_77e4d000    16384 37 00 [ 1c2a2 00 00 00 00 ] 00 purgeable LLC
    00000000_77fa9000    16384 37 00 [ 1c2a2 00 00 00 00 ] 00 dirty LLC
    00000000_00f07000     4096 09 00 [ 1c2a2 00 00 00 00 ] 00 dirty purgeable LLC
Pinned (global) [15]:
    00000000_7fddf000    69632 41 00 [ 00 00 00 00 00 ] 00 LLC
    00000000_7fff0000     4096 01 01 [ 00 00 00 00 00 ] 00 purgeable LLC
    00000000_007ea000   131072 40 40 [ 00 00 00 00 00 ] 00 dirty LLC
    00000000_7fffe000     4096 41 00 [ 00 00 00 00 00 ] 00 LLC
    00000000_7fff1000     4096 01 01 [ 00 00 00 00 00 ] 00 purgeable LLC
    00000000_0080a000   131072 40 40 [ 00 00 00 00 00 ] 00 dirty LLC
    00000000_7fff2000     4096 01 01 [ 00 00 00 00 00 ] 00 purgeable LLC
    00000000_0082a000   131072 40 40 [ 00 00 00 00 00 ] 00 dirty LLC
    00000000_7fff3000     4096 01 01 [ 00 00 00 00 00 ] 00 purgeable LLC
    00000000_0084a000   131072 40 40 [ 00 00 00 00 00 ] 00 dirty LLC
    00000000_00000000  8294400 41 00 [ 00 00 00 00 00 ] 00 uncached
    00000000_00f49000    16384 40 00 [ 00 00 00 00 00 ] 00 dirty uncached
    00000000_00ee2000    69632 41 00 [ 00 00 00 00 00 ] 00 LLC
    00000000_00ef4000    69632 41 00 [ 00 00 00 00 00 ] 00 LLC
    00000000_0374a000 20971520 36 00 [ 00 00 00 00 00 ] 00 X dirty uncached (name: 3) (fence: 18)
render ring --- 3 requests
  pid 1869, ban score 0, seqno        4:0001c29e, emitted 207960ms ago, head 0001f648, tail 0001f760
  pid 934, ban score 0, seqno        1:0001c2a0, emitted 207924ms ago, head 0001f760, tail 0001f878
  pid 934, ban score 0, seqno        1:0001c2a2, emitted 207908ms ago, head 0001f878, tail 0001f8f8
render ring --- 2 waiters
 seqno 0x0001c29e for gnome-shell [1869]
 seqno 0x0001c2a0 for Xorg [934]
Num Pipes: 3
PWR_WELL_CTL2: c0000000
Pipe [0]:
  Power: on
  SRC: 077f04af
  STAT: 00000000
Plane [0]:
  CNTR: d9000400
  STRIDE: 00003c00
  SURF: 0374a000
  TILEOFF: 00000000
Cursor [0]:
  CNTR: 05000027
  POS: 02740205
  BASE: 00f49000
Pipe [1]:
  Power: on
  SRC: 077f0437
  STAT: 00000000
Plane [1]:
  CNTR: d9000400
  STRIDE: 00003c00
  SURF: 03759000
  TILEOFF: 00000000
Cursor [1]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
Pipe [2]:
  Power: on
  SRC: 00000000
  STAT: 00000000
Plane [2]:
  CNTR: 00000000
  STRIDE: 00000000
  SURF: 00000000
  TILEOFF: 00000000
Cursor [2]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
CPU transcoder: A
  Power: on
  CONF: c0000000
  HTOTAL: 081f077f
  HBLANK: 081f077f
  HSYNC: 07cf07af
  VTOTAL: 04d204af
  VBLANK: 04d204af
  VSYNC: 04b804b2
CPU transcoder: B
  Power: on
  CONF: 00000000
  HTOTAL: 072f068f
  HBLANK: 072f068f
  HSYNC: 06df06bf
  VTOTAL: 04560437
  VBLANK: 04370419
  VSYNC: 0422041c
CPU transcoder: C
  Power: on
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
CPU transcoder: EDP
  Power: on
  CONF: c0000000
  HTOTAL: 081f077f
  HBLANK: 081f077f
  HSYNC: 07cf07af
  VTOTAL: 04560437
  VBLANK: 04560437
  VSYNC: 043f043a

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Intel-gfx] [BUG][REGRESSION] i915 gpu hangs under load
  2017-03-22  8:38 [BUG][REGRESSION] i915 gpu hangs under load Martin Kepplinger
@ 2017-03-22 10:36 ` Jani Nikula
  2017-04-02 11:50   ` Thorsten Leemhuis
  0 siblings, 1 reply; 21+ messages in thread
From: Jani Nikula @ 2017-03-22 10:36 UTC (permalink / raw)
  To: Martin Kepplinger, daniel.vetter, airlied
  Cc: intel-gfx, linux-kernel, dri-devel

On Wed, 22 Mar 2017, Martin Kepplinger <martink@posteo.de> wrote:
> I know something similar is here: 
> https://bugs.freedesktop.org/show_bug.cgi?id=100110 too.
>
> But this is rc3 and my machine is totally *not usable*. Let me be 
> annoying :) I hope I can help:

Please file a bug over at [1].

Thanks,
Jani.


[1] https://bugs.freedesktop.org/enter_bug.cgi?product=DRI&component=DRM/Intel


-- 
Jani Nikula, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Intel-gfx] [BUG][REGRESSION] i915 gpu hangs under load
  2017-03-22 10:36 ` [Intel-gfx] " Jani Nikula
@ 2017-04-02 11:50   ` Thorsten Leemhuis
  2017-04-02 12:13     ` Martin Kepplinger
  0 siblings, 1 reply; 21+ messages in thread
From: Thorsten Leemhuis @ 2017-04-02 11:50 UTC (permalink / raw)
  To: Jani Nikula, Martin Kepplinger, daniel.vetter, airlied
  Cc: intel-gfx, linux-kernel, dri-devel

Lo! On 22.03.2017 11:36, Jani Nikula wrote:
> On Wed, 22 Mar 2017, Martin Kepplinger <martink@posteo.de> wrote:
>> I know something similar is here: 
>> https://bugs.freedesktop.org/show_bug.cgi?id=100110 too.
>> But this is rc3 and my machine is totally *not usable*. Let me be 
>> annoying :) I hope I can help:
> Please file a bug over at [1].
> […]
> [1] https://bugs.freedesktop.org/enter_bug.cgi?product=DRI&component=DRM/Intel

@Martin: did you file that bug? I could not find one :-/

@Jani: In similar situations could you do me a favour and ask people to
send one more reply to the public list which contains the link to the
bug filed? Regression tracking is quite hard already; searching various
bug tracker for follow up bug entries makes it even harder :-(

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Intel-gfx] [BUG][REGRESSION] i915 gpu hangs under load
  2017-04-02 11:50   ` Thorsten Leemhuis
@ 2017-04-02 12:13     ` Martin Kepplinger
  2017-04-03 15:09       ` Jani Nikula
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Kepplinger @ 2017-04-02 12:13 UTC (permalink / raw)
  To: Thorsten Leemhuis, Jani Nikula, daniel.vetter, airlied
  Cc: intel-gfx, linux-kernel, dri-devel



Am 2. April 2017 13:50:26 MESZ schrieb Thorsten Leemhuis <regressions@leemhuis.info>:
>Lo! On 22.03.2017 11:36, Jani Nikula wrote:
>> On Wed, 22 Mar 2017, Martin Kepplinger <martink@posteo.de> wrote:
>>> I know something similar is here: 
>>> https://bugs.freedesktop.org/show_bug.cgi?id=100110 too.
>>> But this is rc3 and my machine is totally *not usable*. Let me be 
>>> annoying :) I hope I can help:
>> Please file a bug over at [1].
>> […]
>> [1]
>https://bugs.freedesktop.org/enter_bug.cgi?product=DRI&component=DRM/Intel
>
>@Martin: did you file that bug? I could not find one :-/

I did. Got marked as duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=100181 and there's a fix out there. I don't know if it's in rc5 though.

>
>@Jani: In similar situations could you do me a favour and ask people to
>send one more reply to the public list which contains the link to the
>bug filed? Regression tracking is quite hard already; searching various
>bug tracker for follow up bug entries makes it even harder :-(
>
>Ciao, Thorsten

-- 
Martin Kepplinger
http://martinkepplinger.com
sent from mobile

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Intel-gfx] [BUG][REGRESSION] i915 gpu hangs under load
  2017-04-02 12:13     ` Martin Kepplinger
@ 2017-04-03 15:09       ` Jani Nikula
  2017-04-06 23:23         ` [PATCH 0/5] " Andrea Arcangeli
  0 siblings, 1 reply; 21+ messages in thread
From: Jani Nikula @ 2017-04-03 15:09 UTC (permalink / raw)
  To: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, airlied
  Cc: intel-gfx, linux-kernel, dri-devel

On Sun, 02 Apr 2017, Martin Kepplinger <martink@posteo.de> wrote:
> Am 2. April 2017 13:50:26 MESZ schrieb Thorsten Leemhuis <regressions@leemhuis.info>:
>>Lo! On 22.03.2017 11:36, Jani Nikula wrote:
>>> On Wed, 22 Mar 2017, Martin Kepplinger <martink@posteo.de> wrote:
>>>> I know something similar is here: 
>>>> https://bugs.freedesktop.org/show_bug.cgi?id=100110 too.
>>>> But this is rc3 and my machine is totally *not usable*. Let me be 
>>>> annoying :) I hope I can help:
>>> Please file a bug over at [1].
>>> […]
>>> [1]
>>https://bugs.freedesktop.org/enter_bug.cgi?product=DRI&component=DRM/Intel
>>
>>@Martin: did you file that bug? I could not find one :-/
>
> I did. Got marked as duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=100181 and there's a fix out there. I don't know if it's in rc5 though.

Should be fixed in v4.11-rc5 by

commit 0abfe7e2570d7c729a7662e82c09a23f00f29346
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Mar 22 20:59:30 2017 +0000

    drm/i915: Restore marking context objects as dirty on pinning

>>@Jani: In similar situations could you do me a favour and ask people to
>>send one more reply to the public list which contains the link to the
>>bug filed? Regression tracking is quite hard already; searching various
>>bug tracker for follow up bug entries makes it even harder :-(

I'll try, thanks for the feedback.

BR,
Jani.




-- 
Jani Nikula, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/5] Re: [Intel-gfx] [BUG][REGRESSION] i915 gpu hangs under load
  2017-04-03 15:09       ` Jani Nikula
@ 2017-04-06 23:23         ` Andrea Arcangeli
  2017-04-06 23:23           ` [PATCH 1/5] i915: avoid kernel hang caused by synchronize rcu struct_mutex deadlock Andrea Arcangeli
                             ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2017-04-06 23:23 UTC (permalink / raw)
  To: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	Chris Wilson
  Cc: intel-gfx, linux-kernel, dri-devel

I'm also getting kernel hangs every couple of days. For me it's still
not fixed here in 4.11-rc5. It's hard to reproduce, the best
reproducer is to build lineageos 14.1 on host while running LTP in a
guest to stress the guest VM.

Initially I thought it was related to the fact I upgraded the xf86
intel driver just a few weeks ago (I deferred any upgrade of the
userland intel driver since last July because of a regression that
never got fixed and broke xterm for me). After I found a workaround
for the userland regression (appended at the end for reference) I
started getting kernel hangs but they are separate issues as far as I
can tell.

It's not well tested so beware... (it survived a couple of builds and
some VM reclaim but that's it).

The first patch 1/5 is the potential fix for the i915 kernel hang. The
rest are incremental improvements.

And I've no great solution for when the shrinker was invoked with the
struct_mutex held and and recurse on the lock. I don't think we can
possibly wait in such case (other than flush work that the second
patch does) but then practically it shouldn't be a big deal, the big
RAM eater is unlikely to be i915 when the system is low on memory.

Andrea Arcangeli (5):
  i915: avoid kernel hang caused by synchronize rcu struct_mutex
    deadlock
  i915: flush gem obj freeing workqueues to add accuracy to the i915
    shrinker
  i915: initialize the free_list of the fencing atomic_helper
  i915: schedule while freeing the lists of gem objects
  i915: fence workqueue optimization

 drivers/gpu/drm/i915/i915_gem.c          | 15 +++++++++++++++
 drivers/gpu/drm/i915/i915_gem_shrinker.c | 15 +++++++++++----
 drivers/gpu/drm/i915/intel_display.c     |  7 ++++---
 3 files changed, 30 insertions(+), 7 deletions(-)

===
Userland workaround for unusable xterm after commit
3d3d18f086cdda72ee18a454db70ca72c6e3246c (unrelated to this kernel
issue, just for reference of what I'm running in userland).

diff --git a/src/sna/sna_accel.c b/src/sna/sna_accel.c
index 11beb90..d349203 100644
--- a/src/sna/sna_accel.c
+++ b/src/sna/sna_accel.c
@@ -17430,11 +17430,15 @@ sna_flush_callback(CallbackListPtr *list, pointer user_data, pointer call_data)
 {
 	struct sna *sna = user_data;
 
+#if 0
 	if (!sna->needs_dri_flush)
 		return;
+#endif
 
 	sna_accel_flush(sna);
+#if 0
 	sna->needs_dri_flush = false;
+#endif
 }
 
 static void

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/5] i915: avoid kernel hang caused by synchronize rcu struct_mutex deadlock
  2017-04-06 23:23         ` [PATCH 0/5] " Andrea Arcangeli
@ 2017-04-06 23:23           ` Andrea Arcangeli
  2017-04-07  9:05             ` [Intel-gfx] " Joonas Lahtinen
  2017-04-06 23:23           ` [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker Andrea Arcangeli
                             ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2017-04-06 23:23 UTC (permalink / raw)
  To: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	Chris Wilson
  Cc: intel-gfx, linux-kernel, dri-devel

synchronize_rcu/synchronize_sched/synchronize_rcu_expedited() will
hang until its own workqueues are run. The i915 gem workqueues will
wait on the struct_mutex to be released. So we cannot wait for a
quiescent state using those rcu primitives while holding the
struct_mutex or it creates a circular lock dependency resulting in
kernel hangs (which is reproducible but goes undetected by lockdep).

This started in commit 3d3d18f086cdda72ee18a454db70ca72c6e3246c and
lockdep didn't detect it apparently.

kswapd0         D    0   700      2 0x00000000
Call Trace:
? __schedule+0x1a5/0x660
? schedule+0x36/0x80
? _synchronize_rcu_expedited.constprop.65+0x2ef/0x300
? wake_up_bit+0x20/0x20
? rcu_stall_kick_kthreads.part.54+0xc0/0xc0
? rcu_exp_wait_wake+0x530/0x530
? i915_gem_shrink+0x34b/0x4b0
? i915_gem_shrinker_scan+0x7c/0x90
? i915_gem_shrinker_scan+0x7c/0x90
? shrink_slab.part.61.constprop.72+0x1c1/0x3a0
? shrink_zone+0x154/0x160
? kswapd+0x40a/0x720
? kthread+0xf4/0x130
? try_to_free_pages+0x450/0x450
? kthread_create_on_node+0x40/0x40
? ret_from_fork+0x23/0x30
plasmashell     D    0  4657   4614 0x00000000
Call Trace:
? __schedule+0x1a5/0x660
? schedule+0x36/0x80
? schedule_preempt_disabled+0xe/0x10
? __mutex_lock.isra.4+0x1c9/0x790
? i915_gem_close_object+0x26/0xc0
? i915_gem_close_object+0x26/0xc0
? drm_gem_object_release_handle+0x48/0x90
? drm_gem_handle_delete+0x50/0x80
? drm_ioctl+0x1fa/0x420
? drm_gem_handle_create+0x40/0x40
? pipe_write+0x391/0x410
? __vfs_write+0xc6/0x120
? do_vfs_ioctl+0x8b/0x5d0
? SyS_ioctl+0x3b/0x70
? entry_SYSCALL_64_fastpath+0x13/0x94
kworker/0:0     D    0 29186      2 0x00000000
Workqueue: events __i915_gem_free_work
Call Trace:
? __schedule+0x1a5/0x660
? schedule+0x36/0x80
? schedule_preempt_disabled+0xe/0x10
? __mutex_lock.isra.4+0x1c9/0x790
? del_timer_sync+0x44/0x50
? update_curr+0x57/0x110
? __i915_gem_free_objects+0x31/0x300
? __i915_gem_free_objects+0x31/0x300
? __i915_gem_free_work+0x2d/0x40
? process_one_work+0x13a/0x3b0
? worker_thread+0x4a/0x460
? kthread+0xf4/0x130
? process_one_work+0x3b0/0x3b0
? kthread_create_on_node+0x40/0x40
? ret_from_fork+0x23/0x30

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem.c          |  9 +++++++++
 drivers/gpu/drm/i915/i915_gem_shrinker.c | 14 ++++++++++----
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 67b1fc5..3982489 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4742,6 +4742,13 @@ int i915_gem_freeze(struct drm_i915_private *dev_priv)
 	i915_gem_shrink_all(dev_priv);
 	mutex_unlock(&dev_priv->drm.struct_mutex);
 
+	/*
+	 * Cannot call synchronize_rcu() inside the struct_mutex
+	 * because it may block until workqueues complete, and the
+	 * running workqueue may wait on the struct_mutex.
+	 */
+	synchronize_rcu(); /* wait for our earlier RCU delayed slab frees */
+
 	intel_runtime_pm_put(dev_priv);
 
 	return 0;
@@ -4781,6 +4788,8 @@ int i915_gem_freeze_late(struct drm_i915_private *dev_priv)
 	}
 	mutex_unlock(&dev_priv->drm.struct_mutex);
 
+	synchronize_rcu_expedited();
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
index d5d2b4c..fea1454 100644
--- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
@@ -235,9 +235,6 @@ i915_gem_shrink(struct drm_i915_private *dev_priv,
 	if (unlock)
 		mutex_unlock(&dev_priv->drm.struct_mutex);
 
-	/* expedite the RCU grace period to free some request slabs */
-	synchronize_rcu_expedited();
-
 	return count;
 }
 
@@ -263,7 +260,6 @@ unsigned long i915_gem_shrink_all(struct drm_i915_private *dev_priv)
 				I915_SHRINK_BOUND |
 				I915_SHRINK_UNBOUND |
 				I915_SHRINK_ACTIVE);
-	synchronize_rcu(); /* wait for our earlier RCU delayed slab frees */
 
 	return freed;
 }
@@ -324,6 +320,16 @@ i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 	if (unlock)
 		mutex_unlock(&dev->struct_mutex);
 
+	if (likely(__mutex_owner(&dev->struct_mutex) != current))
+		/*
+		 * If reclaim was invoked by an allocation done while
+		 * holding the struct mutex, we cannot call
+		 * synchronize_rcu_expedited() as it depends on
+		 * workqueues to run but the running workqueue may be
+		 * blocked waiting on us to release struct_mutex.
+		 */
+		synchronize_rcu_expedited();
+
 	return freed;
 }
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker
  2017-04-06 23:23         ` [PATCH 0/5] " Andrea Arcangeli
  2017-04-06 23:23           ` [PATCH 1/5] i915: avoid kernel hang caused by synchronize rcu struct_mutex deadlock Andrea Arcangeli
@ 2017-04-06 23:23           ` Andrea Arcangeli
  2017-04-07 10:02             ` Chris Wilson
  2017-04-06 23:23           ` [PATCH 3/5] i915: initialize the free_list of the fencing atomic_helper Andrea Arcangeli
                             ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2017-04-06 23:23 UTC (permalink / raw)
  To: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	Chris Wilson
  Cc: intel-gfx, linux-kernel, dri-devel

Waiting a RCU grace period only guarantees the work gets queued, but
until after the queued workqueue returns, there's no guarantee the
memory was actually freed. So flush the work to provide better
guarantees to the reclaim code in addition of waiting a RCU grace
period to pass.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem.c          | 2 ++
 drivers/gpu/drm/i915/i915_gem_shrinker.c | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 3982489..612fde3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4748,6 +4748,7 @@ int i915_gem_freeze(struct drm_i915_private *dev_priv)
 	 * running workqueue may wait on the struct_mutex.
 	 */
 	synchronize_rcu(); /* wait for our earlier RCU delayed slab frees */
+	flush_work(&dev_priv->mm.free_work);
 
 	intel_runtime_pm_put(dev_priv);
 
@@ -4789,6 +4790,7 @@ int i915_gem_freeze_late(struct drm_i915_private *dev_priv)
 	mutex_unlock(&dev_priv->drm.struct_mutex);
 
 	synchronize_rcu_expedited();
+	flush_work(&dev_priv->mm.free_work);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
index fea1454..30f79af 100644
--- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
@@ -329,6 +329,7 @@ i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 		 * blocked waiting on us to release struct_mutex.
 		 */
 		synchronize_rcu_expedited();
+	flush_work(&dev_priv->mm.free_work);
 
 	return freed;
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] i915: initialize the free_list of the fencing atomic_helper
  2017-04-06 23:23         ` [PATCH 0/5] " Andrea Arcangeli
  2017-04-06 23:23           ` [PATCH 1/5] i915: avoid kernel hang caused by synchronize rcu struct_mutex deadlock Andrea Arcangeli
  2017-04-06 23:23           ` [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker Andrea Arcangeli
@ 2017-04-06 23:23           ` Andrea Arcangeli
  2017-04-07 10:35             ` Chris Wilson
  2017-04-06 23:23           ` [PATCH 4/5] i915: schedule while freeing the lists of gem objects Andrea Arcangeli
                             ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2017-04-06 23:23 UTC (permalink / raw)
  To: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	Chris Wilson
  Cc: intel-gfx, linux-kernel, dri-devel

Just in case the llist model changes and NULL isn't valid
initialization.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/gpu/drm/i915/intel_display.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index ed1f4f2..24f303e 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -16630,6 +16630,7 @@ int intel_modeset_init(struct drm_device *dev)
 
 	dev->mode_config.funcs = &intel_mode_funcs;
 
+	init_llist_head(&dev_priv->atomic_helper.free_list);
 	INIT_WORK(&dev_priv->atomic_helper.free_work,
 		  intel_atomic_helper_free_state_worker);
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] i915: schedule while freeing the lists of gem objects
  2017-04-06 23:23         ` [PATCH 0/5] " Andrea Arcangeli
                             ` (2 preceding siblings ...)
  2017-04-06 23:23           ` [PATCH 3/5] i915: initialize the free_list of the fencing atomic_helper Andrea Arcangeli
@ 2017-04-06 23:23           ` Andrea Arcangeli
  2017-04-06 23:23           ` [PATCH 5/5] i915: fence workqueue optimization Andrea Arcangeli
  2017-04-10 10:15           ` [PATCH 0/5] Re: [Intel-gfx] [BUG][REGRESSION] i915 gpu hangs under load Martin Kepplinger
  5 siblings, 0 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2017-04-06 23:23 UTC (permalink / raw)
  To: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	Chris Wilson
  Cc: intel-gfx, linux-kernel, dri-devel

Add cond_resched().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 612fde3..c81baeb 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4205,6 +4205,8 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 		GEM_BUG_ON(!RB_EMPTY_ROOT(&obj->vma_tree));
 
 		list_del(&obj->global_link);
+
+		cond_resched();
 	}
 	intel_runtime_pm_put(i915);
 	mutex_unlock(&i915->drm.struct_mutex);
@@ -4230,6 +4232,8 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 
 		kfree(obj->bit_17);
 		i915_gem_object_free(obj);
+
+		cond_resched();
 	}
 }
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] i915: fence workqueue optimization
  2017-04-06 23:23         ` [PATCH 0/5] " Andrea Arcangeli
                             ` (3 preceding siblings ...)
  2017-04-06 23:23           ` [PATCH 4/5] i915: schedule while freeing the lists of gem objects Andrea Arcangeli
@ 2017-04-06 23:23           ` Andrea Arcangeli
  2017-04-07  9:58             ` Chris Wilson
  2017-04-10 10:15           ` [PATCH 0/5] Re: [Intel-gfx] [BUG][REGRESSION] i915 gpu hangs under load Martin Kepplinger
  5 siblings, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2017-04-06 23:23 UTC (permalink / raw)
  To: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	Chris Wilson
  Cc: intel-gfx, linux-kernel, dri-devel

Insist to run llist_del_all() until the free_list is found empty, this
may avoid having to schedule more workqueues.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/gpu/drm/i915/intel_display.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 24f303e..931f0c7 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -14374,9 +14374,9 @@ static void intel_atomic_helper_free_state(struct drm_i915_private *dev_priv)
 	struct intel_atomic_state *state, *next;
 	struct llist_node *freed;
 
-	freed = llist_del_all(&dev_priv->atomic_helper.free_list);
-	llist_for_each_entry_safe(state, next, freed, freed)
-		drm_atomic_state_put(&state->base);
+	while ((freed = llist_del_all(&dev_priv->atomic_helper.free_list)))
+		llist_for_each_entry_safe(state, next, freed, freed)
+			drm_atomic_state_put(&state->base);
 }
 
 static void intel_atomic_helper_free_state_worker(struct work_struct *work)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Intel-gfx] [PATCH 1/5] i915: avoid kernel hang caused by synchronize rcu struct_mutex deadlock
  2017-04-06 23:23           ` [PATCH 1/5] i915: avoid kernel hang caused by synchronize rcu struct_mutex deadlock Andrea Arcangeli
@ 2017-04-07  9:05             ` Joonas Lahtinen
  0 siblings, 0 replies; 21+ messages in thread
From: Joonas Lahtinen @ 2017-04-07  9:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Martin Kepplinger, Thorsten Leemhuis,
	daniel.vetter, Dave Airlie, Chris Wilson
  Cc: intel-gfx, linux-kernel, dri-devel

On pe, 2017-04-07 at 01:23 +0200, Andrea Arcangeli wrote:
> synchronize_rcu/synchronize_sched/synchronize_rcu_expedited() will
> hang until its own workqueues are run. The i915 gem workqueues will
> wait on the struct_mutex to be released. So we cannot wait for a
> quiescent state using those rcu primitives while holding the
> struct_mutex or it creates a circular lock dependency resulting in
> kernel hangs (which is reproducible but goes undetected by lockdep).
> 
> This started in commit 3d3d18f086cdda72ee18a454db70ca72c6e3246c and
> lockdep didn't detect it apparently.

The right format is;

Fixes: 3d3d18f086cd ("drm/i915: Avoid rcu_barrier() from reclaim paths (shrinker)")

> @@ -324,6 +320,16 @@ i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
>  	if (unlock)
>  		mutex_unlock(&dev->struct_mutex);
>  
> +	if (likely(__mutex_owner(&dev->struct_mutex) != current))

This check can be dropped and synchronize_rcu_expedited() should be
embedded directly to the if (unlock) branch as it's functionally
equivalent. This can be applied to all the unlock cases, not just this
one. That should be the correct action to avoid the deadlock. I've sent
a patch to do this (Cc'd you), can you verify that it gets rid of the
problem for you?

> +		/*
> +		 * If reclaim was invoked by an allocation done while
> +		 * holding the struct mutex, we cannot call
> +		 * synchronize_rcu_expedited() as it depends on
> +		 * workqueues to run but the running workqueue may be
> +		 * blocked waiting on us to release struct_mutex.
> +		 */
> +		synchronize_rcu_expedited();
> +
>  	return freed;
>  }
>  
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] i915: fence workqueue optimization
  2017-04-06 23:23           ` [PATCH 5/5] i915: fence workqueue optimization Andrea Arcangeli
@ 2017-04-07  9:58             ` Chris Wilson
  2017-04-07 13:13               ` Andrea Arcangeli
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Wilson @ 2017-04-07  9:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	intel-gfx, linux-kernel, dri-devel

On Fri, Apr 07, 2017 at 01:23:47AM +0200, Andrea Arcangeli wrote:
> Insist to run llist_del_all() until the free_list is found empty, this
> may avoid having to schedule more workqueues.

The work will already be scheduled (everytime we add the first element,
the work is scheduled, and the scheduled bit is cleared before the work
is executed). So we aren't saving the kworker from having to process
another work, but we may make that having nothing to do. The question is
whether we want to trap the kworker here, and presumably you will also want
to add a cond_resched() between passes.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker
  2017-04-06 23:23           ` [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker Andrea Arcangeli
@ 2017-04-07 10:02             ` Chris Wilson
  2017-04-07 13:06               ` Andrea Arcangeli
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Wilson @ 2017-04-07 10:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	intel-gfx, linux-kernel, dri-devel

On Fri, Apr 07, 2017 at 01:23:44AM +0200, Andrea Arcangeli wrote:
> Waiting a RCU grace period only guarantees the work gets queued, but
> until after the queued workqueue returns, there's no guarantee the
> memory was actually freed. So flush the work to provide better
> guarantees to the reclaim code in addition of waiting a RCU grace
> period to pass.

We are not allowed to call flush_work() from the shrinker, the workqueue
doesn't have and can't have the right reclaim flags.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] i915: initialize the free_list of the fencing atomic_helper
  2017-04-06 23:23           ` [PATCH 3/5] i915: initialize the free_list of the fencing atomic_helper Andrea Arcangeli
@ 2017-04-07 10:35             ` Chris Wilson
  0 siblings, 0 replies; 21+ messages in thread
From: Chris Wilson @ 2017-04-07 10:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	intel-gfx, linux-kernel, dri-devel

On Fri, Apr 07, 2017 at 01:23:45AM +0200, Andrea Arcangeli wrote:
> Just in case the llist model changes and NULL isn't valid
> initialization.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Applied, thanks.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker
  2017-04-07 10:02             ` Chris Wilson
@ 2017-04-07 13:06               ` Andrea Arcangeli
  2017-04-07 15:30                 ` Chris Wilson
  0 siblings, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2017-04-07 13:06 UTC (permalink / raw)
  To: Chris Wilson, Martin Kepplinger, Thorsten Leemhuis,
	daniel.vetter, Dave Airlie, intel-gfx, linux-kernel, dri-devel

On Fri, Apr 07, 2017 at 11:02:11AM +0100, Chris Wilson wrote:
> On Fri, Apr 07, 2017 at 01:23:44AM +0200, Andrea Arcangeli wrote:
> > Waiting a RCU grace period only guarantees the work gets queued, but
> > until after the queued workqueue returns, there's no guarantee the
> > memory was actually freed. So flush the work to provide better
> > guarantees to the reclaim code in addition of waiting a RCU grace
> > period to pass.
> 
> We are not allowed to call flush_work() from the shrinker, the workqueue
> doesn't have and can't have the right reclaim flags.

I figured the flush_work had to be conditional to "unlock" being true
too in the i915 shrinker (not only synchronize_rcu_expedited()), and I
already fixed that bit, but I didn't think it would be a problem to
wait for the workqueue as long as reclaim didn't recurse on the
struct_mutex (it is a problem if unlock is false of course as we would
be back to square one). I didn't get further hangs and I assume I've
been running a couple of synchronize_rcu_expedited() and flush_work (I
should add dynamic tracing to be sure).

Also note, I didn't get any lockdep warning when I reproduced the
workqueue hang in 4.11-rc5 so at least as far as lockdep is concerned
there's no problem to call synchronize_rcu_expedited and it couldn't
notice we were holding the struct_mutex while waiting for the new
workqueue to run.

Also note recursing on the lock (unlock false case) is something
nothing else does, I'm not sure if it's worth the risk and if you
shouldn't just call mutex_trylock in the shrinker instead of
mutex_trylock_recursive. One thing was to recurse on the lock
internally in the same context, but recursing through the whole
reclaim is more dubious as safe.

You could start dropping objects and wiping vmas and stuff in the
middle of some kmalloc/alloc_pages that doesn't expect it and then
crash for other reasons. So this reclaim recursion model of the
shinker is quite unique and quite challenging to proof as safe if you
keep using mutex_trylock_recursive in i915_gem_shrinker_scan.

Lock recursion in all other places could be dropped without runtime
downsides, the only place mutex_trylock_recursive makes a design
difference and makes sense to be used is in i915_gem_shrinker_scan,
the rest are implementation issues not fundamental shrinker design and
it'd be nice if those other mutex_trylock_recursive would all be
removed and the only one that is left is in i915_gem_shrinker_scan and
nowhere else (or to drop it also from i915_gem_shrinker_scan).

mutex_trylock_recursive() should also be patched to use
READ_ONCE(__mutex_owner(lock)) because currently it breaks C.

In the whole kernel i915 and msm drm are the only two users of such
function in fact.

Another thing is what value return from i915_gem_shrinker_scan when
unlock is false, and we can't possibly wait for the memory to be freed
let alone for a rcu grace period. For various reasons I think it's
safer to return the current "free" even if we could also return "0" in
such case. There are different tradeoffs, returning "free" is less
likely to trigger an early OOM as the VM thinks it's still making
progress and in fact it will get more free memory shortly, while
returning SHRINK_STOP would also be an option and it would insist more
on the other slabs so it would be more reliable at freeing memory
timely, but it would be more at risk of early OOM. I think returning
"free" is the better tradeoff of the two, but I suggest to add a
comment as it's not exactly obvious what is better.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] i915: fence workqueue optimization
  2017-04-07  9:58             ` Chris Wilson
@ 2017-04-07 13:13               ` Andrea Arcangeli
  0 siblings, 0 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2017-04-07 13:13 UTC (permalink / raw)
  To: Chris Wilson, Martin Kepplinger, Thorsten Leemhuis,
	daniel.vetter, Dave Airlie, intel-gfx, linux-kernel, dri-devel

On Fri, Apr 07, 2017 at 10:58:38AM +0100, Chris Wilson wrote:
> On Fri, Apr 07, 2017 at 01:23:47AM +0200, Andrea Arcangeli wrote:
> > Insist to run llist_del_all() until the free_list is found empty, this
> > may avoid having to schedule more workqueues.
> 
> The work will already be scheduled (everytime we add the first element,
> the work is scheduled, and the scheduled bit is cleared before the work
> is executed). So we aren't saving the kworker from having to process
> another work, but we may make that having nothing to do. The question is
> whether we want to trap the kworker here, and presumably you will also want
> to add a cond_resched() between passes.

Yes it is somewhat dubious in the two event only case, but it will
save kworker in case of more events if there is a flood of
llist_add. It just looked fast enough but it's up to you, it's a
cmpxchg more for each intel_atomic_helper_free_state. If it's unlikely
more work is added, it's better to drop it. Agree about
cond_resched() if we keep it.

The same issue exists in __i915_gem_free_work, but I guess it's more
likely there that by the time __i915_gem_free_objects returns the
free_list isn't empty anymore because __i915_gem_free_objects has a
longer runtime but then you may want to re-evaluate that too as it's
slower for the two llist_add in a row case and only pays off from the
third.

	while ((freed = llist_del_all(&i915->mm.free_list)))
		__i915_gem_free_objects(i915, freed);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker
  2017-04-07 13:06               ` Andrea Arcangeli
@ 2017-04-07 15:30                 ` Chris Wilson
  2017-04-07 16:48                   ` Andrea Arcangeli
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Wilson @ 2017-04-07 15:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin Kepplinger, Thorsten Leemhuis, daniel.vetter, Dave Airlie,
	intel-gfx, linux-kernel, dri-devel

On Fri, Apr 07, 2017 at 03:06:00PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 07, 2017 at 11:02:11AM +0100, Chris Wilson wrote:
> > On Fri, Apr 07, 2017 at 01:23:44AM +0200, Andrea Arcangeli wrote:
> > > Waiting a RCU grace period only guarantees the work gets queued, but
> > > until after the queued workqueue returns, there's no guarantee the
> > > memory was actually freed. So flush the work to provide better
> > > guarantees to the reclaim code in addition of waiting a RCU grace
> > > period to pass.
> > 
> > We are not allowed to call flush_work() from the shrinker, the workqueue
> > doesn't have and can't have the right reclaim flags.
> 
> I figured the flush_work had to be conditional to "unlock" being true
> too in the i915 shrinker (not only synchronize_rcu_expedited()), and I
> already fixed that bit, but I didn't think it would be a problem to
> wait for the workqueue as long as reclaim didn't recurse on the
> struct_mutex (it is a problem if unlock is false of course as we would
> be back to square one). I didn't get further hangs and I assume I've
> been running a couple of synchronize_rcu_expedited() and flush_work (I
> should add dynamic tracing to be sure).

Not getting hangs is a good sign, but lockdep doesn't like it:

[  460.684901] WARNING: CPU: 1 PID: 172 at kernel/workqueue.c:2418 check_flush_dependency+0x92/0x130
[  460.684924] workqueue: PF_MEMALLOC task 172(kworker/1:1H) is flushing !WQ_MEM_RECLAIM events:__i915_gem_free_work [i915]

If I allocated the workqueue with WQ_MEM_RELCAIM, it complains bitterly
as well.

> Also note, I didn't get any lockdep warning when I reproduced the
> workqueue hang in 4.11-rc5 so at least as far as lockdep is concerned
> there's no problem to call synchronize_rcu_expedited and it couldn't
> notice we were holding the struct_mutex while waiting for the new
> workqueue to run.

Yes, that is concerning. I think it's due to the coupling via
the struct completion that is not being picked up lockdep, and I hope
the "crossrelease" patches would fix the lack of warnings.

> Also note recursing on the lock (unlock false case) is something
> nothing else does, I'm not sure if it's worth the risk and if you
> shouldn't just call mutex_trylock in the shrinker instead of
> mutex_trylock_recursive. One thing was to recurse on the lock
> internally in the same context, but recursing through the whole
> reclaim is more dubious as safe.

We know. We don't use trylock in order to reduce the frequency of users'
oom. Peter added mutex_trylock_recursive() because we already were doing
recursive locking in the shrinker and although we know we shouldn't,
getting rid of the recursion is something we are doing, but slowly.

> You could start dropping objects and wiping vmas and stuff in the
> middle of some kmalloc/alloc_pages that doesn't expect it and then
> crash for other reasons. So this reclaim recursion model of the
> shinker is quite unique and quite challenging to proof as safe if you
> keep using mutex_trylock_recursive in i915_gem_shrinker_scan.

I know. Trying to stay on top of all the kmallocs under the struct_mutex
and being aware that the shrinker can and will undo your objects as you
work is a continual battle. And catches everyone working on i915.ko by
surprise. Our policy to avoid surprises is based around pin before alloc.
 
> Lock recursion in all other places could be dropped without runtime
> downsides, the only place mutex_trylock_recursive makes a design
> difference and makes sense to be used is in i915_gem_shrinker_scan,
> the rest are implementation issues not fundamental shrinker design and
> it'd be nice if those other mutex_trylock_recursive would all be
> removed and the only one that is left is in i915_gem_shrinker_scan and
> nowhere else (or to drop it also from i915_gem_shrinker_scan).

We do need it for shrinker_count as well. If we just report 0 objects,
the shrinker_scan callback will be skipped, iirc. All we do need it for
direct calls to i915_gem_shrink() which themselves may or may not be
underneath the struct_mutex at the time.

> mutex_trylock_recursive() should also be patched to use
> READ_ONCE(__mutex_owner(lock)) because currently it breaks C.

I don't follow,

static inline struct task_struct *__mutex_owner(struct mutex *lock)
{
        return (struct task_struct *)(atomic_long_read(&lock->owner) & ~0x07);
}

The atomic read is equivalent to READ_ONCE(). What's broken here? (I
guess strict aliasing and pointer cast?)

> In the whole kernel i915 and msm drm are the only two users of such
> function in fact.

Yes, Peter will continue to remind us to fix our code and complain until
it is.

> Another thing is what value return from i915_gem_shrinker_scan when
> unlock is false, and we can't possibly wait for the memory to be freed
> let alone for a rcu grace period. For various reasons I think it's
> safer to return the current "free" even if we could also return "0" in
> such case. There are different tradeoffs, returning "free" is less
> likely to trigger an early OOM as the VM thinks it's still making
> progress and in fact it will get more free memory shortly, while
> returning SHRINK_STOP would also be an option and it would insist more
> on the other slabs so it would be more reliable at freeing memory
> timely, but it would be more at risk of early OOM. I think returning
> "free" is the better tradeoff of the two, but I suggest to add a
> comment as it's not exactly obvious what is better.

Ah. The RCU freeing is only for the small fry, the slabs from which
requests and objects are allocated. It's the gigabytes of pages we have
pinned that can be released by i915_gem_shrink() are what we count as
freed, even though we only return them to the system and hope the lru
anon page scanner will make available for reuse (if they are dirty, they
will need to be swapped out, but some will be returned to the system
directly by truncating the shmemfs filp backing an object.)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker
  2017-04-07 15:30                 ` Chris Wilson
@ 2017-04-07 16:48                   ` Andrea Arcangeli
  2017-04-10  9:39                     ` Chris Wilson
  0 siblings, 1 reply; 21+ messages in thread
From: Andrea Arcangeli @ 2017-04-07 16:48 UTC (permalink / raw)
  To: Chris Wilson, Martin Kepplinger, Thorsten Leemhuis,
	daniel.vetter, Dave Airlie, intel-gfx, linux-kernel, dri-devel

On Fri, Apr 07, 2017 at 04:30:11PM +0100, Chris Wilson wrote:
> Not getting hangs is a good sign, but lockdep doesn't like it:
> 
> [  460.684901] WARNING: CPU: 1 PID: 172 at kernel/workqueue.c:2418 check_flush_dependency+0x92/0x130
> [  460.684924] workqueue: PF_MEMALLOC task 172(kworker/1:1H) is flushing !WQ_MEM_RECLAIM events:__i915_gem_free_work [i915]
> 
> If I allocated the workqueue with WQ_MEM_RELCAIM, it complains bitterly
> as well.

So in PF_MEMALLOC context we can't flush a workqueue with
!WQ_MEM_RECLAIM.

	system_wq = alloc_workqueue("events", 0, 0);

My point is synchronize_rcu_expedited will still push its work in
the same system_wq workqueue...

		/* Marshall arguments & schedule the expedited grace period. */
		rew.rew_func = func;
		rew.rew_rsp = rsp;
		rew.rew_s = s;
		INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
		schedule_work(&rew.rew_work);

It's also using schedule_work, so either the above is a false
positive, or we've still a problem with synchronize_rcu_expedited,
just a reproducible issue anymore after we stop running it under the
struct_mutex.

Even synchronize_sched will wait on the system_wq if
synchronize_rcu_expedited has been issued in parallel by some other
code, it's just there's no check for it because it's not invoking
flush_work to wait.

The deadlock happens if we flush_work() while holding any lock that
may be taken by any of the workqueues that could be queued there. i915
makes sure to flush_work only if the struct_mutex was released (not
my initial version) but we're not checking if any of the other
system_wq workqueues could possibly taking a lock that may have been
hold during the allocation that invoked reclaim, I suppose that is the
problem left, but I don't see how flush_work is special about it,
synchronize_rcu_expedited would still have the same issue no? (despite
no lockdep warning)

I suspect this means synchronize_rcu_expedited() is not usable in
reclaim context and lockdep should warn if PF_MEMALLOC is set when
synchronize_rcu_expedited/synchronize_sched/synchronize_rcu are
called.

Probably to fix this we should create a private workqueue for both RCU
and i915 and stop sharing the system_wq with the rest of the system
(and of course set WQ_MEM_RECLAIM in both workqueues). This makes sure
when we call synchronize_rcu_expedited; flush_work from the shrinker,
we won't risk waiting on other random work that may be taking locks
that are hold by the code that invoked reclaim during an allocation.

The macro bug of waiting on system_wq 100% of the time while always
holding the struct_mutex is gone, but we need to perfect this further
and stop using the system_wq for RCU and i915 shrinker work.

> We do need it for shrinker_count as well. If we just report 0 objects,

Yes the shrinker_count too.

> the shrinker_scan callback will be skipped, iirc. All we do need it for
> direct calls to i915_gem_shrink() which themselves may or may not be
> underneath the struct_mutex at the time.

Yes.

> I don't follow,
> 
> static inline struct task_struct *__mutex_owner(struct mutex *lock)
> {
>         return (struct task_struct *)(atomic_long_read(&lock->owner) & ~0x07);
> }
> 
> The atomic read is equivalent to READ_ONCE(). What's broken here? (I
> guess strict aliasing and pointer cast?)

That was an oversight, atomic64_read does READ_ONCE internally (as it
should of course being an atomic read). I didn't recall it uses
atomic_long_read.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker
  2017-04-07 16:48                   ` Andrea Arcangeli
@ 2017-04-10  9:39                     ` Chris Wilson
  0 siblings, 0 replies; 21+ messages in thread
From: Chris Wilson @ 2017-04-10  9:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin Kepplinger, Joonas Lahtinen, Thorsten Leemhuis,
	daniel.vetter, Dave Airlie, intel-gfx, linux-kernel, dri-devel

On Fri, Apr 07, 2017 at 06:48:58PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 07, 2017 at 04:30:11PM +0100, Chris Wilson wrote:
> > Not getting hangs is a good sign, but lockdep doesn't like it:
> > 
> > [  460.684901] WARNING: CPU: 1 PID: 172 at kernel/workqueue.c:2418 check_flush_dependency+0x92/0x130
> > [  460.684924] workqueue: PF_MEMALLOC task 172(kworker/1:1H) is flushing !WQ_MEM_RECLAIM events:__i915_gem_free_work [i915]
> > 
> > If I allocated the workqueue with WQ_MEM_RELCAIM, it complains bitterly
> > as well.
> 
> So in PF_MEMALLOC context we can't flush a workqueue with
> !WQ_MEM_RECLAIM.
> 
> 	system_wq = alloc_workqueue("events", 0, 0);
> 
> My point is synchronize_rcu_expedited will still push its work in
> the same system_wq workqueue...
> 
> 		/* Marshall arguments & schedule the expedited grace period. */
> 		rew.rew_func = func;
> 		rew.rew_rsp = rsp;
> 		rew.rew_s = s;
> 		INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> 		schedule_work(&rew.rew_work);
> 
> It's also using schedule_work, so either the above is a false
> positive, or we've still a problem with synchronize_rcu_expedited,
> just a reproducible issue anymore after we stop running it under the
> struct_mutex.

We still do have a problem with using synchronize_rcu_expedited() from
the shrinker as we maybe under someone else's mutex is that involved in
its own RCU dance.

> Even synchronize_sched will wait on the system_wq if
> synchronize_rcu_expedited has been issued in parallel by some other
> code, it's just there's no check for it because it's not invoking
> flush_work to wait.

Right.
 
> The deadlock happens if we flush_work() while holding any lock that
> may be taken by any of the workqueues that could be queued there. i915
> makes sure to flush_work only if the struct_mutex was released (not
> my initial version) but we're not checking if any of the other
> system_wq workqueues could possibly taking a lock that may have been
> hold during the allocation that invoked reclaim, I suppose that is the
> problem left, but I don't see how flush_work is special about it,
> synchronize_rcu_expedited would still have the same issue no? (despite
> no lockdep warning)
> 
> I suspect this means synchronize_rcu_expedited() is not usable in
> reclaim context and lockdep should warn if PF_MEMALLOC is set when
> synchronize_rcu_expedited/synchronize_sched/synchronize_rcu are
> called.

Yes.

> Probably to fix this we should create a private workqueue for both RCU
> and i915 and stop sharing the system_wq with the rest of the system
> (and of course set WQ_MEM_RECLAIM in both workqueues). This makes sure
> when we call synchronize_rcu_expedited; flush_work from the shrinker,
> we won't risk waiting on other random work that may be taking locks
> that are hold by the code that invoked reclaim during an allocation.

We simply do not need to do our own synchronize_rcu* -- it's only used
to flush our slab frees on the off chance that (a) we have any and (b)
we do manage to free a whole slab. It is not the bulk of the memory that
we return to the system from the shrinker.

In the other thread, I stated that we should simply remove it. The
kswapd reclaim path should try to reclaim RCU slabs (by doing a
synhronize_sched or equivalent).

> The macro bug of waiting on system_wq 100% of the time while always
> holding the struct_mutex is gone, but we need to perfect this further
> and stop using the system_wq for RCU and i915 shrinker work.

Agreed. My preference is to simply not do it and leave the dangling RCU
to the core reclaim paths.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] Re: [Intel-gfx] [BUG][REGRESSION] i915 gpu hangs under load
  2017-04-06 23:23         ` [PATCH 0/5] " Andrea Arcangeli
                             ` (4 preceding siblings ...)
  2017-04-06 23:23           ` [PATCH 5/5] i915: fence workqueue optimization Andrea Arcangeli
@ 2017-04-10 10:15           ` Martin Kepplinger
  5 siblings, 0 replies; 21+ messages in thread
From: Martin Kepplinger @ 2017-04-10 10:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Thorsten Leemhuis, daniel.vetter, Dave Airlie, Chris Wilson,
	intel-gfx, linux-kernel, dri-devel

Am 07.04.2017 01:23 schrieb Andrea Arcangeli:
> I'm also getting kernel hangs every couple of days. For me it's still
> not fixed here in 4.11-rc5. It's hard to reproduce, the best
> reproducer is to build lineageos 14.1 on host while running LTP in a
> guest to stress the guest VM.
> 
> Initially I thought it was related to the fact I upgraded the xf86
> intel driver just a few weeks ago (I deferred any upgrade of the
> userland intel driver since last July because of a regression that
> never got fixed and broke xterm for me). After I found a workaround
> for the userland regression (appended at the end for reference) I
> started getting kernel hangs but they are separate issues as far as I
> can tell.
> 
> It's not well tested so beware... (it survived a couple of builds and
> some VM reclaim but that's it).
> 
> The first patch 1/5 is the potential fix for the i915 kernel hang. The
> rest are incremental improvements.
> 
> And I've no great solution for when the shrinker was invoked with the
> struct_mutex held and and recurse on the lock. I don't think we can
> possibly wait in such case (other than flush work that the second
> patch does) but then practically it shouldn't be a big deal, the big
> RAM eater is unlikely to be i915 when the system is low on memory.
> 

FWIW without having insight here, -rc6 seems to be good.
No disturbing gpu hangs under load so far.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-04-10 10:15 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-22  8:38 [BUG][REGRESSION] i915 gpu hangs under load Martin Kepplinger
2017-03-22 10:36 ` [Intel-gfx] " Jani Nikula
2017-04-02 11:50   ` Thorsten Leemhuis
2017-04-02 12:13     ` Martin Kepplinger
2017-04-03 15:09       ` Jani Nikula
2017-04-06 23:23         ` [PATCH 0/5] " Andrea Arcangeli
2017-04-06 23:23           ` [PATCH 1/5] i915: avoid kernel hang caused by synchronize rcu struct_mutex deadlock Andrea Arcangeli
2017-04-07  9:05             ` [Intel-gfx] " Joonas Lahtinen
2017-04-06 23:23           ` [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker Andrea Arcangeli
2017-04-07 10:02             ` Chris Wilson
2017-04-07 13:06               ` Andrea Arcangeli
2017-04-07 15:30                 ` Chris Wilson
2017-04-07 16:48                   ` Andrea Arcangeli
2017-04-10  9:39                     ` Chris Wilson
2017-04-06 23:23           ` [PATCH 3/5] i915: initialize the free_list of the fencing atomic_helper Andrea Arcangeli
2017-04-07 10:35             ` Chris Wilson
2017-04-06 23:23           ` [PATCH 4/5] i915: schedule while freeing the lists of gem objects Andrea Arcangeli
2017-04-06 23:23           ` [PATCH 5/5] i915: fence workqueue optimization Andrea Arcangeli
2017-04-07  9:58             ` Chris Wilson
2017-04-07 13:13               ` Andrea Arcangeli
2017-04-10 10:15           ` [PATCH 0/5] Re: [Intel-gfx] [BUG][REGRESSION] i915 gpu hangs under load Martin Kepplinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).