[PATCH 00/68] Broadwell 48b addressing and prelocations (no relocs)

* [PATCH 00/68] Broadwell 48b addressing and prelocations (no relocs)
@ 2014-08-22  3:11 Ben Widawsky
  2014-08-22  3:11 ` [PATCH 01/68] drm/i915: Split up do_switch Ben Widawsky
                   ` (72 more replies)
  0 siblings, 73 replies; 85+ messages in thread
From: Ben Widawsky @ 2014-08-22  3:11 UTC (permalink / raw)
  To: Intel GFX; +Cc: Anthony Bernecky, Ben Widawsky, mesa-dev

The primary goal of these patches is to introduce what I've started
calling, "prelocations" on Broadwell. A prelocation is like a
relocation, except not. When a GPU client specifies a prelocation, it is
instructing the kernel where in the GPU address the buffer should be
mapped. The mechanic works very similarly to a relocation except it uses
the execbuffer object to obtain the offset, and bind if needed. If a GPU
client uses only prelocations, the relocation process can be entirely
skipped. This sounds like a big win initially, but realistically with
full PPGTT and 48b address space it's unlikely to noticeably improve
anything. Doing this work leaves the address space allocation up to
libc/malloc [1] instead of drm_mm which I believe has some upside due to
the hits on creating new VMAs. Not specific to prelocations, dynamic
page table allocations by themselves can save measurable memory on systems
running multiple GPU clients. As previously mentioned, this kind of thing is
needed for OCL 2.0 SVM. One other advantage I've discussed with Ken... [2].

The difficult part to enable this [for 64b platforms] is supporting the
48b address space. As mentioned in previous versions of this cover
letter, and my blog post [3], it's not feasible to allocate the entire 48b
address space's page tables. Dynamic page table allocation and teardown
required a lot of plumbing and rework, and to make the interfaces as
neat as possible, I also had to put a good deal of work into GEN7 PPGTT
well. The other really difficult part is taking the malloc'd memory and
turning it into GPU usable pages. Luckily, Chris already did that for me
with userptr, so I simply reused his work.

The kernel patches are lightly tested at best. Previous iterations of
this series were more thoroughly tested, but enough has changed since
then that I would assume the code is unstable. If miraculously it is
almost stable, there are still a lot of cosmetic things to clean up, and
a performance optimization to reduce re-mapping already mapped objects.
I started on a patch to do this but ran into too many stability problems
(See Optimize PDP loads from previous posts). It's likely memory leaks
are introduced with the dynamic page tables; plugging those would nice.
One could also implement the reaper I refer to in the comments.

With the kernel prelocation support are the libdrm patches, an
intel-gpu-tools test, and a mesa patch. Some parts of the code are in
rough shape, and were meant for demonstration only. The userspace
components in particular were mostly meant as sample code. [4]

The series is fundamental 5 parts with some bleeding between 2-3, and
3-4.

1. [00-18] Provide fixes to make a stable branch for test with full
PPGTT.  I've previously posted this as a separate series. In the
meanwhile, many similar fixes have gone in, and some of these may be
dropped. So this is mostly here for completeness.

2. [19-42] Rework code to avoid as much future churn
as possible.  Nothing special here. Some of this is arguably #3.

3. [43-46] Make page table allocations dynamic. I tried to keep this
generic, but since the current code supported very specific page table
depths, it's really mostly GEN7.

4. [47-67] GEN8 dynamic page table support with 64b page table support.
This was very hard to split up, and is definitely the majority of the
work.

5. [68] A basic SVM interface.  I opted not to use create2 IOCTL since
there are patches for that already, and I wanted to have something
that's as reusable as possible.  X. the rest are
workaround/libdrm/mesa/igt

Kernel:
http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=prelocate
libdrm:
http://cgit.freedesktop.org/~bwidawsk/drm/log/?h=prelocate
mesa:
http://cgit.freedesktop.org/~bwidawsk/mesa/log/?h=prelocate
IGT:
http://cgit.freedesktop.org/~bwidawsk/intel-gpu-tools/log/?h=prelocate

Final thoughts:
* Due to time pressure, the ability to go back and test on GEN7 was lost.
The original patches I posted back in March did work fine on GEN7, but I
cannot speak to the quality now. That said, I did the work, so I figured
I may as well provide it. For the sake of progress, someone should
test/fix GEN7, or simply drop the GEN7 support.

* Broadwell is currently hanging with this patch series when I run piglit.
I have gone through plenty of software bugs, and this current hang is
baffling. Therefore I think it makes sense to either parameterize, or
CONFIG_ dynamic page table allocations until that's solved.

* Again on the stability, there are a lot of extra flushes introduced as a
result of this series. I believe if we can figure out the case of some
of these issues, we can remove some flushes.

* I haven't tested aliasing PPGTT only in a while. Someone should do that.

* I'll bet 32b is broken.

* A lot of issues I had were related to the complexities when dealing with
legacy contexts. It's possible, and I am hopeful that with execlists
these issues go away, and so do the hangs.

* The patches have been rebased SOOOOO many times that they really need to
be reviewed closely to make sure they're bisectable. They were at one
time, but I doubt it's the case now.

[1] We have to use mmap in certain situations due to a hardware
limitation. I'm not sure how libc manages these things together. I hope
it's efficient...

[2] We can potentially always set the state base to be 0, and rely on HW
contexts to save restore this information, thus eliminating this
non-pipelined state upload. It turns out this is not possible for all
cases because of hardware limitations, but it's a neat idea that someone
can possibly turn into something useful. It's also probably a premature
optimization given how many PIPE CONTROL stalls we have.

[3] https://bwidawsk.net/blog/index.php/2014/07/future-ppgtt-part-4-dynamic-page-table-allocations-64-bit-address-space-gpu-mirroring-and-yeah-something-about-relocs-too/

[4] This was the best I could do on short notice. I won't be improving,
rebasing, or fixing these patches any longer, but someone is welcome to take
them over. Consider this my parting gift before I go on sabbatical [tomorrow].

--

Ben Widawsky (68):
  drm/i915: Split up do_switch
  drm/i915: Extract l3 remapping out of ctx switch
  drm/i915/ppgtt: Load address space after mi_set_context
  drm/i915: Fix another another use-after-free in do_switch
  drm/i915/ctx: Return earlier on failure
  drm/i915/error: vma error capture prettyify
  drm/i915/error: Do a better job of disambiguating VMAs
  drm/i915/error: Capture vmas instead of BOs
  drm/i915: Add some extra guards in evict_vm
  drm/i915: Make an uninterruptible evict
  drm/i915: More correct (slower) ppgtt cleanup
  drm/i915: Defer PPGTT cleanup
  drm/i915/bdw: Enable full PPGTT
  drm/i915: Get the error state over the wire (HACKish)
  drm/i915/gen8: Invalidate TLBs before PDP reload
  drm/i915: Remove false assertion in ppgtt_release
  Revert "drm/i915/bdw: Use timeout mode for RC6 on bdw"
  drm/i915/trace: Fix offsets for 64b
  drm/i915: Wrap VMA binding
  drm/i915: Make pin global flags explicit
  drm/i915: Split out aliasing binds
  drm/i915: fix gtt_total_entries()
  drm/i915: Rename to GEN8_LEGACY_PDPES
  drm/i915: Split out verbose PPGTT dumping
  drm/i915: s/pd/pdpe, s/pt/pde
  drm/i915: rename map/unmap to dma_map/unmap
  drm/i915: Setup less PPGTT on failed pagedir
  drm/i915: clean up PPGTT init error path
  drm/i915: Un-hardcode number of page directories
  drm/i915: Make gen6_write_pdes gen6_map_page_tables
  drm/i915: Range clearing is PPGTT agnostic
  drm/i915: Page table helpers, and define renames
  drm/i915: construct page table abstractions
  drm/i915: Complete page table structures
  drm/i915: Create page table allocators
  drm/i915: Generalize GEN6 mapping
  drm/i915: Clean up pagetable DMA map & unmap
  drm/i915: Always dma map page table allocations
  drm/i915: Consolidate dma mappings
  drm/i915: Always dma map page directory allocations
  drm/i915: Track GEN6 page table usage
  drm/i915: Extract context switch skip logic
  drm/i915: Track page table reload need
  drm/i915: Initialize all contexts
  drm/i915: Finish gen6/7 dynamic page table allocation
  drm/i915/bdw: Use dynamic allocation idioms on free
  drm/i915/bdw: pagedirs rework allocation
  drm/i915/bdw: pagetable allocation rework
  drm/i915/bdw: Make the pdp switch a bit less hacky
  drm/i915: num_pd_pages/num_pd_entries isn't useful
  drm/i915: Extract PPGTT param from pagedir alloc
  drm/i915/bdw: Split out mappings
  drm/i915/bdw: begin bitmap tracking
  drm/i915/bdw: Dynamic page table allocations
  drm/i915/bdw: Make pdp allocation more dynamic
  drm/i915/bdw: Abstract PDP usage
  drm/i915/bdw: Add dynamic page trace events
  drm/i915/bdw: Add ppgtt info for dynamic pages
  drm/i915/bdw: implement alloc/teardown for 4lvl
  drm/i915/bdw: Add 4 level switching infrastructure
  drm/i915/bdw: Generalize PTE writing for GEN8 PPGTT
  drm/i915: Plumb sg_iter through va allocation ->maps
  drm/i915: Introduce map and unmap for VMAs
  drm/i915: Depend exclusively on map and unmap_vma
  drm/i915: Expand error state's address width to 64b
  drm/i915/bdw: Flip the 48b switch
  drm/i915: Provide a soft_pin hook
  XXX: drm/i915: Unexplained workarounds

 drivers/gpu/drm/i915/i915_debugfs.c        |  114 +-
 drivers/gpu/drm/i915/i915_drv.h            |   61 +-
 drivers/gpu/drm/i915/i915_gem.c            |  231 +++-
 drivers/gpu/drm/i915/i915_gem_context.c    |  276 ++++-
 drivers/gpu/drm/i915/i915_gem_evict.c      |   39 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   27 +-
 drivers/gpu/drm/i915/i915_gem_gtt.c        | 1838 +++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_gem_gtt.h        |  379 +++++-
 drivers/gpu/drm/i915/i915_gem_stolen.c     |    2 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c    |    7 +-
 drivers/gpu/drm/i915/i915_gpu_error.c      |  171 ++-
 drivers/gpu/drm/i915/i915_reg.h            |    1 +
 drivers/gpu/drm/i915/i915_sysfs.c          |    2 +-
 drivers/gpu/drm/i915/i915_trace.h          |  156 ++-
 drivers/gpu/drm/i915/intel_pm.c            |   16 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +-
 include/uapi/drm/i915_drm.h                |    3 +-
 17 files changed, 2588 insertions(+), 737 deletions(-)

-- 
2.0.4

^ permalink raw reply	[flat|nested] 85+ messages in thread