linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/15] Habana Labs kernel driver
@ 2019-01-23  0:00 Oded Gabbay
  2019-01-23  0:00 ` [PATCH 01/15] habanalabs: add skeleton driver Oded Gabbay
                   ` (16 more replies)
  0 siblings, 17 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

Hello,

For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
Habana Labs since its inception two and a half years ago. 

Habana is a leading startup in the emerging AI processor space and we have
already started production of our first Goya inference processor PCIe card
and delivered it to customers. The Goya processor silicon has been tested
since June of 2018 and is production-qualified by now. The Gaudi training
processor solution is slated to sample in the second quarter of 2019.

This patch-set contains the kernel driver for Habana's AI Processors 
(AIP) that are designed to accelerate Deep Learning inference and training
workloads. The current version supports only the Goya processor and
support for Gaudi will be upstreamed after the ASIC will be available to
customers.

The Goya processor has been designed from the ground up for deep learning
inference workloads. It comprises a cluster of eight fully programmable
Tensor Processing Cores (TPC). The TPC core is a VLIW SIMD vector
processor with ISA and hardware that was tailored to serve deep learning
workloads efficiently. 

In addition, Goya contains software-managed, on-die memory along with five
separate DMA channels, a PCIe Gen4 x16 system interface and 4/8/16GB of
DDR4 memory.

Goya has 3 PCI bars (64-bit), which are not exposed to user-space. They
map the on-chip memory and configuration space (bar 0-1), MSI-X table 
(bar 2-3) and DDR4 memory (bar 4-5).

Each TPC engine and DMA channel has a H/W queue attached to it, called
QMAN. The S/W provides command buffers to the H/W queues (through the
kernel driver) and the H/W consumes the command buffers. To prevent
malicious users from stealing data from other users through the Host or
Device memory, Goya has an internal MMU and a security protection scheme.
In addition, The kernel driver parses the command buffer and rejects it if
it contains disallowed commands.

The QMANs are triggered by a write to a PI (producer index) register. The
QMAN H/W logic maintains a CI (consumer index) register. When PI==CI, the
queue is empty. When PI+1==CI, the queue is full (note the queue is
cyclic). Each entry in the H/W queue is 16-bytes, and contains
a pointer and length of a variable-size command buffer, which the user
fills with specific commands that the H/W logic can read and execute.

For each DMA QMAN, there is a completion queue that the QMAN writes to
when it finishes the execution of the command buffer. The QMAN also
sends an MSI-X interrupt after writing the completion entry.

Inference workloads running on Goya are associated with an address space
through the ASID (address-space ID) property. Goya supports up to 1024
ASIDs. The ASID value is updated by the kernel driver in the relevant
registers before scheduling a workload.

During its initialization, the driver registers itself to the PCI
subsystem. For each Habana PCI device found, a char device node (/dev/hlX)
is created.

The driver currently exposes a total of five IOCTLs. One IOCTL allows
the application to submit workloads to the device, and another to wait on
completion of submitted workloads. The other three IOCTLs are used for
memory management, command buffer creation and information/status
retrieval.

In addition, the driver exposes several sensors through the hwmon
subsystem and provides various system-level information in sysfs for
system administrators.

The first step for an application process is to open the correct hlX
device it wants to work with. Calls to open create a new "context" for
that application in the driver's internal structures and a unique ASID
is assigned to that context. The context object lives until the process
releases the file descriptor AND its command submissions have finished
executing on the device.

Next step is for the application to request information about the
device, such as amount of DDR4 memory. The application then can go on to
create command buffers for its command submissions and allocate and map
device or host memory (host memory can only be mapped) to the internal
device's MMU subsystem.

At this point the application can load various deep learning
topologies to the device DDR memory. After that, it can start to submit
inference workloads using those topologies. For each workload, the
the application receives a sequence number that represents the workload.
The application can then query the driver regarding the status of the
workload using that sequence number.

In case a workload didn't finish execution after 5 seconds (configurable
using a kernel module parameter) from the time it was scheduled to run, a
TDR (timeout detection & recovery) event occurs in the driver. The driver
will then mark that workload as "timed out", perform a minimal reset of
the device (DMA and compute units only) and abort all other workloads of
that context that were already submitted to the H/W queues.

I would appricate any feedback, question and/or review.

p.s. for those who prefer to clone the tree instead of looking at the
emails, you can grab a copy from our company's page in GitHub:

https://github.com/HabanaAI/linux/releases/tag/hl_patchset_v1

Thanks,
Oded

Oded Gabbay (14):
  habanalabs: add skeleton driver
  habanalabs: add Goya registers header files
  habanalabs: add basic Goya support
  habanalabs: add context and ASID modules
  habanalabs: add command buffer module
  habanalabs: add basic Goya h/w initialization
  habanalabs: add h/w queues module
  habanalabs: add event queue and interrupts
  habanalabs: add sysfs and hwmon support
  habanalabs: add device reset support
  habanalabs: add command submission module
  habanalabs: implement INFO IOCTL
  habanalabs: add debugfs support
  Update MAINTAINERS and CREDITS with habanalabs info

Omer Shpigelman (1):
  habanalabs: add virtual memory and MMU modules

 CREDITS                                       |    2 +-
 .../ABI/testing/debugfs-driver-habanalabs     |  127 +
 .../ABI/testing/sysfs-driver-habanalabs       |  190 +
 MAINTAINERS                                   |    9 +
 drivers/misc/Kconfig                          |    1 +
 drivers/misc/Makefile                         |    1 +
 drivers/misc/habanalabs/Kconfig               |   22 +
 drivers/misc/habanalabs/Makefile              |   14 +
 drivers/misc/habanalabs/asid.c                |   58 +
 drivers/misc/habanalabs/command_buffer.c      |  425 +
 drivers/misc/habanalabs/command_submission.c  |  799 ++
 drivers/misc/habanalabs/context.c             |  216 +
 drivers/misc/habanalabs/debugfs.c             | 1069 ++
 drivers/misc/habanalabs/device.c              | 1097 ++
 drivers/misc/habanalabs/goya/Makefile         |    3 +
 drivers/misc/habanalabs/goya/goya.c           | 6347 ++++++++++++
 drivers/misc/habanalabs/goya/goyaP.h          |  161 +
 drivers/misc/habanalabs/goya/goya_hwmgr.c     |  306 +
 drivers/misc/habanalabs/goya/goya_security.c  | 2999 ++++++
 drivers/misc/habanalabs/habanalabs.h          | 1464 +++
 drivers/misc/habanalabs/habanalabs_drv.c      |  474 +
 drivers/misc/habanalabs/habanalabs_ioctl.c    |  237 +
 drivers/misc/habanalabs/hw_queue.c            |  654 ++
 drivers/misc/habanalabs/hwmon.c               |  449 +
 .../include/goya/asic_reg/cpu_ca53_cfg_regs.h |  213 +
 .../include/goya/asic_reg/cpu_if_regs.h       |  110 +
 .../include/goya/asic_reg/cpu_pll_regs.h      |  186 +
 .../include/goya/asic_reg/ddr_mc_ch0_regs.h   | 1158 +++
 .../include/goya/asic_reg/ddr_mc_ch1_regs.h   | 1158 +++
 .../include/goya/asic_reg/ddr_misc_ch0_regs.h |  156 +
 .../include/goya/asic_reg/ddr_misc_ch1_regs.h |  156 +
 .../include/goya/asic_reg/dma_ch_0_regs.h     |  512 +
 .../include/goya/asic_reg/dma_ch_1_regs.h     |  512 +
 .../include/goya/asic_reg/dma_ch_2_regs.h     |  512 +
 .../include/goya/asic_reg/dma_ch_3_regs.h     |  512 +
 .../include/goya/asic_reg/dma_ch_4_regs.h     |  512 +
 .../include/goya/asic_reg/dma_macro_regs.h    |  242 +
 .../include/goya/asic_reg/dma_nrtr_regs.h     |  380 +
 .../include/goya/asic_reg/dma_qm_0_regs.h     |  543 +
 .../include/goya/asic_reg/dma_qm_1_regs.h     |  543 +
 .../include/goya/asic_reg/dma_qm_2_regs.h     |  543 +
 .../include/goya/asic_reg/dma_qm_3_regs.h     |  543 +
 .../include/goya/asic_reg/dma_qm_4_regs.h     |  543 +
 .../include/goya/asic_reg/gic_regs.h          | 9079 +++++++++++++++++
 .../include/goya/asic_reg/goya_blocks.h       | 1372 +++
 .../include/goya/asic_reg/goya_masks.h        |  262 +
 .../include/goya/asic_reg/goya_regs.h         |  119 +
 .../include/goya/asic_reg/ic_pll_regs.h       |  186 +
 .../include/goya/asic_reg/mc_pll_regs.h       |  186 +
 .../include/goya/asic_reg/mme1_rtr_regs.h     |  876 ++
 .../include/goya/asic_reg/mme2_rtr_regs.h     |  876 ++
 .../include/goya/asic_reg/mme3_rtr_regs.h     |  876 ++
 .../include/goya/asic_reg/mme4_rtr_regs.h     |  876 ++
 .../include/goya/asic_reg/mme5_rtr_regs.h     |  876 ++
 .../include/goya/asic_reg/mme6_rtr_regs.h     |  876 ++
 .../include/goya/asic_reg/mme_cmdq_regs.h     |  431 +
 .../include/goya/asic_reg/mme_qm_regs.h       |  543 +
 .../include/goya/asic_reg/mme_regs.h          | 2422 +++++
 .../include/goya/asic_reg/mmu_regs.h          |  158 +
 .../include/goya/asic_reg/pci_nrtr_regs.h     |  380 +
 .../include/goya/asic_reg/pcie_aux_regs.h     |  476 +
 .../include/goya/asic_reg/pcie_dbi_regs.h     | 2909 ++++++
 .../goya/asic_reg/psoc_emmc_pll_regs.h        |  186 +
 .../goya/asic_reg/psoc_global_conf_regs.h     | 1119 ++
 .../include/goya/asic_reg/psoc_mme_pll_regs.h |  186 +
 .../include/goya/asic_reg/psoc_pci_pll_regs.h |  186 +
 .../include/goya/asic_reg/psoc_spi_regs.h     |  427 +
 .../goya/asic_reg/sram_y0_x0_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y0_x1_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y0_x2_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y0_x3_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y0_x4_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y1_x0_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y1_x1_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y1_x2_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y1_x3_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y1_x4_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y2_x0_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y2_x1_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y2_x2_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y2_x3_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y2_x4_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y3_x0_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y3_x1_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y3_x2_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y3_x3_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y3_x4_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y4_x0_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y4_x1_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y4_x2_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y4_x3_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y4_x4_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y5_x0_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y5_x1_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y5_x2_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y5_x3_rtr_regs.h       |  215 +
 .../goya/asic_reg/sram_y5_x4_rtr_regs.h       |  215 +
 .../include/goya/asic_reg/stlb_regs.h         |  133 +
 .../include/goya/asic_reg/sync_mngr_regs.h    | 4930 +++++++++
 .../include/goya/asic_reg/tpc0_cfg_regs.h     | 2110 ++++
 .../include/goya/asic_reg/tpc0_cmdq_regs.h    |  431 +
 .../include/goya/asic_reg/tpc0_eml_cfg_regs.h |  580 ++
 .../include/goya/asic_reg/tpc0_nrtr_regs.h    |  380 +
 .../include/goya/asic_reg/tpc0_qm_regs.h      |  543 +
 .../include/goya/asic_reg/tpc1_cfg_regs.h     | 2110 ++++
 .../include/goya/asic_reg/tpc1_cmdq_regs.h    |  431 +
 .../include/goya/asic_reg/tpc1_qm_regs.h      |  543 +
 .../include/goya/asic_reg/tpc1_rtr_regs.h     |  848 ++
 .../include/goya/asic_reg/tpc2_cfg_regs.h     | 2110 ++++
 .../include/goya/asic_reg/tpc2_cmdq_regs.h    |  431 +
 .../include/goya/asic_reg/tpc2_qm_regs.h      |  543 +
 .../include/goya/asic_reg/tpc2_rtr_regs.h     |  848 ++
 .../include/goya/asic_reg/tpc3_cfg_regs.h     | 2110 ++++
 .../include/goya/asic_reg/tpc3_cmdq_regs.h    |  431 +
 .../include/goya/asic_reg/tpc3_qm_regs.h      |  543 +
 .../include/goya/asic_reg/tpc3_rtr_regs.h     |  848 ++
 .../include/goya/asic_reg/tpc4_cfg_regs.h     | 2110 ++++
 .../include/goya/asic_reg/tpc4_cmdq_regs.h    |  431 +
 .../include/goya/asic_reg/tpc4_qm_regs.h      |  543 +
 .../include/goya/asic_reg/tpc4_rtr_regs.h     |  848 ++
 .../include/goya/asic_reg/tpc5_cfg_regs.h     | 2110 ++++
 .../include/goya/asic_reg/tpc5_cmdq_regs.h    |  431 +
 .../include/goya/asic_reg/tpc5_qm_regs.h      |  543 +
 .../include/goya/asic_reg/tpc5_rtr_regs.h     |  848 ++
 .../include/goya/asic_reg/tpc6_cfg_regs.h     | 2110 ++++
 .../include/goya/asic_reg/tpc6_cmdq_regs.h    |  431 +
 .../include/goya/asic_reg/tpc6_qm_regs.h      |  543 +
 .../include/goya/asic_reg/tpc6_rtr_regs.h     |  848 ++
 .../include/goya/asic_reg/tpc7_cfg_regs.h     | 2110 ++++
 .../include/goya/asic_reg/tpc7_cmdq_regs.h    |  431 +
 .../include/goya/asic_reg/tpc7_nrtr_regs.h    |  380 +
 .../include/goya/asic_reg/tpc7_qm_regs.h      |  543 +
 .../include/goya/asic_reg/tpc_pll_regs.h      |  186 +
 drivers/misc/habanalabs/include/goya/goya.h   |  117 +
 .../include/goya/goya_async_events.h          |  186 +
 .../habanalabs/include/goya/goya_boot_if.h    |   32 +
 .../habanalabs/include/goya/goya_packets.h    |  234 +
 .../habanalabs/include/habanalabs_device_if.h |  397 +
 .../include/hw_ip/mmu/mmu_general.h           |   45 +
 .../habanalabs/include/hw_ip/mmu/mmu_v1_0.h   |   15 +
 drivers/misc/habanalabs/irq.c                 |  325 +
 drivers/misc/habanalabs/memory.c              | 1714 ++++
 drivers/misc/habanalabs/mmu.c                 |  604 ++
 drivers/misc/habanalabs/sysfs.c               |  690 ++
 include/uapi/misc/habanalabs.h                |  412 +
 145 files changed, 99610 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/debugfs-driver-habanalabs
 create mode 100644 Documentation/ABI/testing/sysfs-driver-habanalabs
 create mode 100644 drivers/misc/habanalabs/Kconfig
 create mode 100644 drivers/misc/habanalabs/Makefile
 create mode 100644 drivers/misc/habanalabs/asid.c
 create mode 100644 drivers/misc/habanalabs/command_buffer.c
 create mode 100644 drivers/misc/habanalabs/command_submission.c
 create mode 100644 drivers/misc/habanalabs/context.c
 create mode 100644 drivers/misc/habanalabs/debugfs.c
 create mode 100644 drivers/misc/habanalabs/device.c
 create mode 100644 drivers/misc/habanalabs/goya/Makefile
 create mode 100644 drivers/misc/habanalabs/goya/goya.c
 create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
 create mode 100644 drivers/misc/habanalabs/goya/goya_hwmgr.c
 create mode 100644 drivers/misc/habanalabs/goya/goya_security.c
 create mode 100644 drivers/misc/habanalabs/habanalabs.h
 create mode 100644 drivers/misc/habanalabs/habanalabs_drv.c
 create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
 create mode 100644 drivers/misc/habanalabs/hw_queue.c
 create mode 100644 drivers/misc/habanalabs/hwmon.c
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_ca53_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_if_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_mc_ch0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_mc_ch1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_misc_ch0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_misc_ch1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_2_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_3_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_4_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_macro_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_nrtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_2_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_3_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_4_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/gic_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_blocks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ic_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mc_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme5_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme6_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mmu_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pci_nrtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pcie_aux_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pcie_dbi_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_emmc_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_global_conf_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_mme_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_pci_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_spi_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x0_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x0_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x0_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x0_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x0_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x0_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/stlb_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sync_mngr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_eml_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_nrtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_nrtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_async_events.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_boot_if.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_packets.h
 create mode 100644 drivers/misc/habanalabs/include/habanalabs_device_if.h
 create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
 create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
 create mode 100644 drivers/misc/habanalabs/irq.c
 create mode 100644 drivers/misc/habanalabs/memory.c
 create mode 100644 drivers/misc/habanalabs/mmu.c
 create mode 100644 drivers/misc/habanalabs/sysfs.c
 create mode 100644 include/uapi/misc/habanalabs.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-23  0:49   ` Joe Perches
                     ` (2 more replies)
  2019-01-23  0:00 ` [PATCH 03/15] habanalabs: add basic Goya support Oded Gabbay
                   ` (15 subsequent siblings)
  16 siblings, 3 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds the habanalabs skeleton driver. The driver does nothing at
this stage except very basic operations. It contains the minimal code to
insmod and rmmod the driver and to create a /dev/hlX file per PCI device.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/Kconfig                          |   1 +
 drivers/misc/Makefile                         |   1 +
 drivers/misc/habanalabs/Kconfig               |  22 ++
 drivers/misc/habanalabs/Makefile              |   7 +
 drivers/misc/habanalabs/device.c              | 331 ++++++++++++++++
 drivers/misc/habanalabs/habanalabs.h          | 149 +++++++
 drivers/misc/habanalabs/habanalabs_drv.c      | 366 ++++++++++++++++++
 .../habanalabs/include/habanalabs_device_if.h | 125 ++++++
 8 files changed, 1002 insertions(+)
 create mode 100644 drivers/misc/habanalabs/Kconfig
 create mode 100644 drivers/misc/habanalabs/Makefile
 create mode 100644 drivers/misc/habanalabs/device.c
 create mode 100644 drivers/misc/habanalabs/habanalabs.h
 create mode 100644 drivers/misc/habanalabs/habanalabs_drv.c
 create mode 100644 drivers/misc/habanalabs/include/habanalabs_device_if.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index f417b06e11c5..fecab53c4f21 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -535,4 +535,5 @@ source "drivers/misc/echo/Kconfig"
 source "drivers/misc/cxl/Kconfig"
 source "drivers/misc/ocxl/Kconfig"
 source "drivers/misc/cardreader/Kconfig"
+source "drivers/misc/habanalabs/Kconfig"
 endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index e39ccbbc1b3a..ae77dfd790a4 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -59,3 +59,4 @@ obj-$(CONFIG_PCI_ENDPOINT_TEST)	+= pci_endpoint_test.o
 obj-$(CONFIG_OCXL)		+= ocxl/
 obj-y				+= cardreader/
 obj-$(CONFIG_PVPANIC)   	+= pvpanic.o
+obj-$(CONFIG_HABANA_AI)		+= habanalabs/
diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig
new file mode 100644
index 000000000000..b7f38a14caf5
--- /dev/null
+++ b/drivers/misc/habanalabs/Kconfig
@@ -0,0 +1,22 @@
+#
+# HabanaLabs AI accelerators driver
+#
+
+config HABANA_AI
+	tristate "HabanaAI accelerators (habanalabs)"
+	depends on PCI
+	select FRAME_VECTOR
+	help
+	  Enables PCIe card driver for Habana's AI Processors (AIP) that are
+	  designed to accelerate Deep Learning inference and training workloads.
+
+	  The driver manages the PCIe devices and provides IOCTL interface for
+	  the user to submit workloads to the devices.
+
+	  The user-space interface is described in
+	  include/uapi/misc/habanalabs.h
+
+	  If unsure, say N.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called habanalabs.
diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
new file mode 100644
index 000000000000..b41433a09e02
--- /dev/null
+++ b/drivers/misc/habanalabs/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for HabanaLabs AI accelerators driver
+#
+
+obj-m	:= habanalabs.o
+
+habanalabs-y := habanalabs_drv.o device.o
\ No newline at end of file
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
new file mode 100644
index 000000000000..376b55eb73d4
--- /dev/null
+++ b/drivers/misc/habanalabs/device.c
@@ -0,0 +1,331 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/fs.h>
+#include <linux/kthread.h>
+#include <linux/sched/signal.h>
+
+static void hpriv_release(struct kref *ref)
+{
+	struct hl_fpriv *hpriv;
+	struct hl_device *hdev;
+
+	hpriv = container_of(ref, struct hl_fpriv, refcount);
+
+	hdev = hpriv->hdev;
+
+	put_pid(hpriv->taskpid);
+
+	kfree(hpriv);
+}
+
+void hl_hpriv_get(struct hl_fpriv *hpriv)
+{
+	kref_get(&hpriv->refcount);
+}
+
+void hl_hpriv_put(struct hl_fpriv *hpriv)
+{
+	kref_put(&hpriv->refcount, hpriv_release);
+}
+
+/**
+ * hl_device_release - release function for habanalabs device
+ *
+ * @inode: pointer to inode structure
+ * @filp: pointer to file structure
+ *
+ * Called when process closes an habanalabs device
+ */
+static int hl_device_release(struct inode *inode, struct file *filp)
+{
+	struct hl_fpriv *hpriv = filp->private_data;
+
+	filp->private_data = NULL;
+
+	hl_hpriv_put(hpriv);
+
+	return 0;
+}
+
+static const struct file_operations hl_ops = {
+	.owner = THIS_MODULE,
+	.open = hl_device_open,
+	.release = hl_device_release
+};
+
+/**
+ * device_setup_cdev - setup cdev and device for habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @hclass: pointer to the class object of the device
+ * @minor: minor number of the specific device
+ * @fpos : file operations to install for this device
+ *
+ * Create a cdev and a Linux device for habanalabs's device. Need to be
+ * called at the end of the habanalabs device initialization process,
+ * because this function exposes the device to the user
+ */
+static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
+				int minor, const struct file_operations *fops)
+{
+	int err, devno = MKDEV(hdev->major, minor);
+	struct cdev *hdev_cdev = &hdev->cdev;
+	char name[8];
+
+	sprintf(name, "hl%d", hdev->id);
+
+	cdev_init(hdev_cdev, fops);
+	hdev_cdev->owner = THIS_MODULE;
+	err = cdev_add(hdev_cdev, devno, 1);
+	if (err) {
+		pr_err("habanalabs: Failed to add char device %s", name);
+		goto err_cdev_add;
+	}
+
+	hdev->dev = device_create(hclass, NULL, devno, NULL, "%s", name);
+	if (IS_ERR(hdev->dev)) {
+		pr_err("habanalabs: Failed to create device %s\n", name);
+		err = PTR_ERR(hdev->dev);
+		goto err_device_create;
+	}
+
+	dev_set_drvdata(hdev->dev, hdev);
+
+	return 0;
+
+err_device_create:
+	cdev_del(hdev_cdev);
+err_cdev_add:
+	return err;
+}
+
+/**
+ * device_early_init - do some early initialization for the habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Install the relevant function pointers and call the early_init function,
+ * if such a function exists
+ */
+static int device_early_init(struct hl_device *hdev)
+{
+	switch (hdev->asic_type) {
+	case ASIC_GOYA:
+		sprintf(hdev->asic_name, "GOYA");
+		break;
+	default:
+		dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
+			hdev->asic_type);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/**
+ * device_early_fini - finalize all that was done in device_early_fini
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ */
+static void device_early_fini(struct hl_device *hdev)
+{
+}
+
+/**
+ * hl_device_suspend - initiate device suspend
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Puts the hw in the suspend state (all asics).
+ * Returns 0 for success or an error on failure.
+ * Called at driver suspend.
+ */
+int hl_device_suspend(struct hl_device *hdev)
+{
+	pci_save_state(hdev->pdev);
+
+	/* Shut down the device */
+	pci_disable_device(hdev->pdev);
+	pci_set_power_state(hdev->pdev, PCI_D3hot);
+
+	return 0;
+}
+
+/**
+ * hl_device_resume - initiate device resume
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Bring the hw back to operating state (all asics).
+ * Returns 0 for success or an error on failure.
+ * Called at driver resume.
+ */
+int hl_device_resume(struct hl_device *hdev)
+{
+	int rc;
+
+	pci_set_power_state(hdev->pdev, PCI_D0);
+	pci_restore_state(hdev->pdev);
+	rc = pci_enable_device(hdev->pdev);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to enable PCI device in resume\n");
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * hl_device_init - main initialization function for habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Allocate an id for the device, do early initialization and then call the
+ * ASIC specific initialization functions. Finally, create the cdev and the
+ * Linux device to expose it to the user
+ */
+int hl_device_init(struct hl_device *hdev, struct class *hclass)
+{
+	int rc;
+
+	/* Create device */
+	rc = device_setup_cdev(hdev, hclass, hdev->id, &hl_ops);
+
+	if (rc)
+		goto out_disabled;
+
+	/* Initialize ASIC function pointers and perform early init */
+	rc = device_early_init(hdev);
+	if (rc)
+		goto release_device;
+
+	dev_notice(hdev->dev,
+		"Successfully added device to habanalabs driver\n");
+
+	return 0;
+
+release_device:
+	device_destroy(hclass, hdev->dev->devt);
+	cdev_del(&hdev->cdev);
+out_disabled:
+	hdev->disabled = true;
+	if (hdev->pdev)
+		dev_err(&hdev->pdev->dev,
+			"Failed to initialize hl%d. Device is NOT usable !!!\n",
+			hdev->id);
+	else
+		pr_err("habanalabs: Failed to initialize hl%d. Device is NOT usable !!!\n",
+			hdev->id);
+
+	return rc;
+}
+
+/**
+ * hl_device_fini - main tear-down function for habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Destroy the device, call ASIC fini functions and release the id
+ */
+void hl_device_fini(struct hl_device *hdev)
+{
+	dev_info(hdev->dev, "Removing device\n");
+
+	/* Mark device as disabled */
+	hdev->disabled = true;
+
+	device_early_fini(hdev);
+
+	/* Hide device from user */
+	device_destroy(hdev->dev->class, hdev->dev->devt);
+	cdev_del(&hdev->cdev);
+
+	pr_info("habanalabs: removed device successfully\n");
+}
+
+/**
+ * hl_poll_timeout_memory - Periodically poll a host memory address
+ *                              until it is not zero or a timeout occurs
+ * @hdev: pointer to habanalabs device structure
+ * @addr: Address to poll
+ * @timeout_us: timeout in us
+ * @val: Variable to read the value into
+ *
+ * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
+ * case, the last read value at @addr is stored in @val. Must not
+ * be called from atomic context if sleep_us or timeout_us are used.
+ *
+ * The function sleeps for 100us with timeout value of
+ * timeout_us
+ */
+int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr,
+				u32 timeout_us, u32 *val)
+{
+	/*
+	 * pReturnVal is defined as volatile because it points to HOST memory,
+	 * which is being written to by the device. Therefore, we can't use
+	 * locks to synchronize it and it is not a memory-mapped register space
+	 */
+	volatile u32 *pReturnVal = (volatile u32 *) addr;
+	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
+
+	might_sleep();
+
+	for (;;) {
+		*val = *pReturnVal;
+		if (*val)
+			break;
+		if (ktime_compare(ktime_get(), timeout) > 0) {
+			*val = *pReturnVal;
+			break;
+		}
+		usleep_range((100 >> 2) + 1, 100);
+	}
+
+	return (*val ? 0 : -ETIMEDOUT);
+}
+
+/**
+ * hl_poll_timeout_devicememory - Periodically poll a device memory address
+ *                                until it is not zero or a timeout occurs
+ * @hdev: pointer to habanalabs device structure
+ * @addr: Device address to poll
+ * @timeout_us: timeout in us
+ * @val: Variable to read the value into
+ *
+ * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
+ * case, the last read value at @addr is stored in @val. Must not
+ * be called from atomic context if sleep_us or timeout_us are used.
+ *
+ * The function sleeps for 100us with timeout value of
+ * timeout_us
+ */
+int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
+				u32 timeout_us, u32 *val)
+{
+	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
+
+	might_sleep();
+
+	for (;;) {
+		*val = readl(addr);
+		if (*val)
+			break;
+		if (ktime_compare(ktime_get(), timeout) > 0) {
+			*val = readl(addr);
+			break;
+		}
+		usleep_range((100 >> 2) + 1, 100);
+	}
+
+	return (*val ? 0 : -ETIMEDOUT);
+}
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
new file mode 100644
index 000000000000..7e1b088b677c
--- /dev/null
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -0,0 +1,149 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef HABANALABSP_H_
+#define HABANALABSP_H_
+
+#include "include/habanalabs_device_if.h"
+
+#include <linux/pci.h>
+#include <linux/types.h>
+#include <linux/cdev.h>
+#include <linux/interrupt.h>
+#include <linux/iopoll.h>
+#include <linux/dma-fence.h>
+#include <linux/hashtable.h>
+#include <linux/hwmon.h>
+
+#define HL_NAME				"habanalabs"
+
+struct hl_device;
+
+
+
+
+
+
+/*
+ * ASICs
+ */
+
+/**
+ * enum hl_asic_type - supported ASIC types.
+ * @ASIC_AUTO_DETECT: ASIC type will be automatically set.
+ * @ASIC_GOYA: Goya device.
+ * @ASIC_LAST: last ASIC type.
+ */
+enum hl_asic_type {
+	ASIC_AUTO_DETECT,
+	ASIC_GOYA,
+	ASIC_LAST
+};
+
+
+
+
+
+/*
+ * FILE PRIVATE STRUCTURE
+ */
+
+/**
+ * struct hl_fpriv - process information stored in FD private data.
+ * @hdev: habanalabs device structure.
+ * @filp: pointer to the given file structure.
+ * @taskpid: current process ID.
+ * @refcount: number of related contexts.
+ */
+struct hl_fpriv {
+	struct hl_device	*hdev;
+	struct file		*filp;
+	struct pid		*taskpid;
+	struct kref		refcount;
+};
+
+
+
+
+/*
+ * DEVICES
+ */
+
+/* Theoretical limit only. A single host can only contain up to 4 or 8 PCIe
+ * x16 cards. In extereme cases, there are hosts that can accommodate 16 cards
+ */
+#define HL_MAX_MINORS	256
+
+/**
+ * struct hl_device - habanalabs device structure.
+ * @pdev: pointer to PCI device, can be NULL in case of simulator device.
+ * @cdev: related char device.
+ * @dev: realted kernel basic device structure.
+ * @asic_name: ASIC specific nmae.
+ * @asic_type: ASIC specific type.
+ * @major: habanalabs KMD major.
+ * @id: device minor.
+ * @disabled: is device disabled.
+ */
+struct hl_device {
+	struct pci_dev			*pdev;
+	struct cdev			cdev;
+	struct device			*dev;
+	char				asic_name[16];
+	enum hl_asic_type		asic_type;
+	u32				major;
+	u16				id;
+	u8				disabled;
+};
+
+/*
+ * IOCTLs
+ */
+
+/**
+ * typedef hl_ioctl_t - typedef for ioctl function in the driver
+ * @hpriv: pointer to the FD's private data, which contains state of
+ *		user process
+ * @data: pointer to the input/output arguments structure of the IOCTL
+ *
+ * Return: 0 for success, negative value for error
+ */
+typedef int hl_ioctl_t(struct hl_fpriv *hpriv, void *data);
+
+/**
+ * struct hl_ioctl_desc - describes an IOCTL entry of the driver.
+ * @cmd: the IOCTL code as created by the kernel macros.
+ * @func: pointer to the driver's function that should be called for this IOCTL.
+ */
+struct hl_ioctl_desc {
+	unsigned int cmd;
+	hl_ioctl_t *func;
+};
+
+
+
+
+
+/*
+ * Kernel module functions that can be accessed by entire module
+ */
+
+int hl_device_open(struct inode *inode, struct file *filp);
+int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
+		enum hl_asic_type asic_type, int minor);
+void destroy_hdev(struct hl_device *hdev);
+int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
+				u32 *val);
+int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
+				u32 timeout_us, u32 *val);
+
+int hl_device_init(struct hl_device *hdev, struct class *hclass);
+void hl_device_fini(struct hl_device *hdev);
+int hl_device_suspend(struct hl_device *hdev);
+int hl_device_resume(struct hl_device *hdev);
+
+#endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
new file mode 100644
index 000000000000..15217975327b
--- /dev/null
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -0,0 +1,366 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ * Author: Oded Gabbay <oded.gabbay@gmail.com>
+ *
+ */
+
+#include "habanalabs.h"
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kthread.h>
+
+#include <linux/fs.h>
+
+#define HL_DRIVER_AUTHOR	"HabanaLabs Kernel Driver Team"
+
+#define HL_DRIVER_DESC		"Driver for HabanaLabs's AI Accelerators"
+
+MODULE_AUTHOR(HL_DRIVER_AUTHOR);
+MODULE_DESCRIPTION(HL_DRIVER_DESC);
+MODULE_LICENSE("GPL v2");
+
+static int hl_major;
+static struct class *hl_class;
+DEFINE_IDR(hl_devs_idr);
+DEFINE_MUTEX(hl_devs_idr_lock);
+
+#define PCI_VENDOR_ID_HABANALABS	0x1da3
+
+#define PCI_IDS_GOYA			0x0001
+
+static struct pci_device_id ids[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_HABANALABS, PCI_IDS_GOYA), },
+	{ 0, }
+};
+MODULE_DEVICE_TABLE(pci, ids);
+
+/**
+ * get_asic_type - translate device id to asic type
+ *
+ * @device: id of the PCI device
+ * @asic_type: pointer that will be filled by the asic type
+ *
+ * Translate device id to asic type.
+ * In case of unidentified device, return -1
+ */
+static int get_asic_type(u16 device, enum hl_asic_type *asic_type)
+{
+	int rc = 0;
+
+	switch (device) {
+	case PCI_IDS_GOYA:
+		*asic_type = ASIC_GOYA;
+		break;
+	default:
+		*asic_type = rc = -1;
+		break;
+	}
+
+	return rc;
+}
+
+/**
+ * hl_device_open - open function for habanalabs device
+ *
+ * @inode: pointer to inode structure
+ * @filp: pointer to file structure
+ *
+ * Called when process opens an habanalabs device.
+ */
+int hl_device_open(struct inode *inode, struct file *filp)
+{
+	struct hl_device *hdev;
+	struct hl_fpriv *hpriv;
+
+	mutex_lock(&hl_devs_idr_lock);
+	hdev = idr_find(&hl_devs_idr, iminor(inode));
+	mutex_unlock(&hl_devs_idr_lock);
+
+	if (!hdev) {
+		pr_err("habanalabs: Couldn't find device %d:%d\n",
+			imajor(inode), iminor(inode));
+		return -ENXIO;
+	}
+
+	hpriv = kzalloc(sizeof(*hpriv), GFP_KERNEL);
+	if (!hpriv)
+		return -ENOMEM;
+
+	hpriv->hdev = hdev;
+	filp->private_data = hpriv;
+	hpriv->filp = filp;
+	kref_init(&hpriv->refcount);
+	nonseekable_open(inode, filp);
+
+	hpriv->taskpid = find_get_pid(current->pid);
+
+	return 0;
+}
+
+/**
+ * create_hdev - create habanalabs device instance
+ *
+ * @dev: will hold the pointer to the new habanalabs device structure
+ * @pdev: pointer to the pci device
+ * @asic_type: in case of simulator device, which device is it
+ * @minor: in case of simulator device, the minor of the device
+ *
+ * Allocate memory for habanalabs device and initialize basic fields
+ * Identify the ASIC type
+ * Allocate ID (minor) for the device (only for real devices)
+ */
+int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
+		enum hl_asic_type asic_type, int minor)
+{
+	struct hl_device *hdev;
+	int rc;
+
+	*dev = NULL;
+
+	hdev = kzalloc(sizeof(*hdev), GFP_KERNEL);
+	if (!hdev) {
+		if (pdev)
+			dev_err(&pdev->dev,
+				"Not enough memory for habanalabs device\n");
+		else
+			pr_err("habanalabs: Not enough memory for  device\n");
+
+		return -ENOMEM;
+	}
+
+	hdev->major = hl_major;
+
+	hdev->disabled = true;
+	hdev->pdev = pdev; /* can be NULL in case of simulator device */
+
+	if (asic_type == ASIC_AUTO_DETECT) {
+		rc = get_asic_type(pdev->device, &hdev->asic_type);
+		if (rc) {
+			dev_err(&pdev->dev, "Unsupported ASIC\n");
+			rc = -ENODEV;
+			goto free_hdev;
+		}
+	} else {
+		hdev->asic_type = asic_type;
+	}
+
+	mutex_lock(&hl_devs_idr_lock);
+
+	if (minor == -1) {
+		rc = idr_alloc(&hl_devs_idr, hdev, 0, HL_MAX_MINORS,
+				GFP_KERNEL);
+	} else {
+		idr_replace(&hl_devs_idr, hdev, minor);
+		rc = minor;
+	}
+
+	mutex_unlock(&hl_devs_idr_lock);
+
+	if (rc < 0) {
+		if (rc == -ENOSPC) {
+			pr_err("habanalabs: too many devices in the system\n");
+			rc = -EBUSY;
+		}
+		goto free_hdev;
+	}
+
+	hdev->id = rc;
+
+	*dev = hdev;
+
+	return 0;
+
+free_hdev:
+	kfree(hdev);
+	return rc;
+}
+
+/**
+ * destroy_hdev - destroy habanalabs device instance
+ *
+ * @dev: pointer to the habanalabs device structure
+ *
+ */
+void destroy_hdev(struct hl_device *hdev)
+{
+	/* Remove device from the device list */
+	mutex_lock(&hl_devs_idr_lock);
+	idr_remove(&hl_devs_idr, hdev->id);
+	mutex_unlock(&hl_devs_idr_lock);
+
+	kfree(hdev);
+}
+
+static int hl_pmops_suspend(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct hl_device *hdev = pci_get_drvdata(pdev);
+
+	pr_debug("habanalabs: Going to suspend PCI device\n");
+
+	if (!hdev) {
+		pr_err("habanalabs: device pointer is NULL in suspend\n");
+		return 0;
+	}
+
+	return hl_device_suspend(hdev);
+}
+
+static int hl_pmops_resume(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct hl_device *hdev = pci_get_drvdata(pdev);
+
+	pr_debug("habanalabs: Going to resume PCI device\n");
+
+	if (!hdev) {
+		pr_err("habanalabs: device pointer is NULL in resume\n");
+		return 0;
+	}
+
+	return hl_device_resume(hdev);
+}
+
+/**
+ * hl_pci_probe - probe PCI habanalabs devices
+ *
+ * @pdev: pointer to pci device
+ * @id: pointer to pci device id structure
+ *
+ * Standard PCI probe function for habanalabs device.
+ * Create a new habanalabs device and initialize it according to the
+ * device's type
+ */
+static int hl_pci_probe(struct pci_dev *pdev,
+				const struct pci_device_id *id)
+{
+	struct hl_device *hdev;
+	int rc;
+
+	dev_info(&pdev->dev, HL_NAME
+		 " device found [%04x:%04x] (rev %x)\n",
+		 (int)pdev->vendor, (int)pdev->device, (int)pdev->revision);
+
+	rc = create_hdev(&hdev, pdev, ASIC_AUTO_DETECT, -1);
+	if (rc)
+		return rc;
+
+	pci_set_drvdata(pdev, hdev);
+
+	rc = hl_device_init(hdev, hl_class);
+	if (rc) {
+		dev_err(&pdev->dev, "Fatal error during habanalabs device init\n");
+		rc = -ENODEV;
+		goto disable_device;
+	}
+
+	return 0;
+
+disable_device:
+	pci_set_drvdata(pdev, NULL);
+	destroy_hdev(hdev);
+
+	return rc;
+}
+
+/**
+ * hl_pci_remove - remove PCI habanalabs devices
+ *
+ * @pdev: pointer to pci device
+ *
+ * Standard PCI remove function for habanalabs device
+ */
+static void hl_pci_remove(struct pci_dev *pdev)
+{
+	struct hl_device *hdev;
+
+	hdev = pci_get_drvdata(pdev);
+	if (!hdev)
+		return;
+
+	hl_device_fini(hdev);
+	pci_set_drvdata(pdev, NULL);
+
+	destroy_hdev(hdev);
+}
+
+static const struct dev_pm_ops hl_pm_ops = {
+	.suspend = hl_pmops_suspend,
+	.resume = hl_pmops_resume,
+};
+
+static struct pci_driver hl_pci_driver = {
+	.name = HL_NAME,
+	.id_table = ids,
+	.probe = hl_pci_probe,
+	.remove = hl_pci_remove,
+	.driver.pm = &hl_pm_ops,
+};
+
+/**
+ * hl_init - Initialize the habanalabs kernel driver
+ *
+ */
+static int __init hl_init(void)
+{
+	int rc;
+	dev_t dev;
+
+	pr_info("habanalabs: loading driver\n");
+
+	rc = alloc_chrdev_region(&dev, 0, HL_MAX_MINORS, HL_NAME);
+	if (rc < 0) {
+		pr_err("habanalabs: unable to get major\n");
+		return rc;
+	}
+
+	hl_major = MAJOR(dev);
+
+	hl_class = class_create(THIS_MODULE, HL_NAME);
+	if (IS_ERR(hl_class)) {
+		pr_err("habanalabs: failed to allocate class\n");
+		rc = PTR_ERR(hl_class);
+		goto remove_major;
+	}
+
+	rc = pci_register_driver(&hl_pci_driver);
+	if (rc) {
+		pr_err("habanalabs: failed to register pci device\n");
+		goto remove_class;
+	}
+
+	pr_debug("habanalabs: driver loaded\n");
+
+	return 0;
+
+remove_class:
+	class_destroy(hl_class);
+remove_major:
+	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
+	return rc;
+}
+
+/**
+ * hl_exit - Release all resources of the habanalabs kernel driver
+ *
+ */
+static void __exit hl_exit(void)
+{
+	pci_unregister_driver(&hl_pci_driver);
+
+	class_destroy(hl_class);
+	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
+
+	idr_destroy(&hl_devs_idr);
+
+	pr_debug("habanalabs: driver removed\n");
+}
+
+module_init(hl_init);
+module_exit(hl_exit);
diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
new file mode 100644
index 000000000000..9dbb7077eabd
--- /dev/null
+++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef HABANALABS_DEVICE_IF_H
+#define HABANALABS_DEVICE_IF_H
+
+#include <linux/types.h>
+
+/*
+ * PRIMARY QUEUE
+ */
+
+struct hl_bd {
+	__u64	ptr;
+	__u32	len;
+	union {
+		struct {
+			__u32	repeat:16;
+			__u32	res1:8;
+			__u32	repeat_valid:1;
+			__u32	res2:7;
+		};
+		__u32	ctl;
+	};
+};
+
+#define HL_BD_SIZE			sizeof(struct hl_bd)
+
+/*
+ * BD_CTL_REPEAT_VALID tells the CP whether the repeat field in the BD CTL is
+ * valid. 1 means the repeat field is valid, 0 means not-valid,
+ * i.e. repeat == 1
+ */
+#define BD_CTL_REPEAT_VALID_SHIFT	24
+#define BD_CTL_REPEAT_VALID_MASK	0x01000000
+
+#define BD_CTL_SHADOW_INDEX_SHIFT	0
+#define BD_CTL_SHADOW_INDEX_MASK	0x00000FFF
+
+/*
+ * COMPLETION QUEUE
+ */
+
+struct hl_cq_entry {
+	__u32	data;
+};
+
+#define HL_CQ_ENTRY_SIZE		sizeof(struct hl_cq_entry)
+
+#define CQ_ENTRY_READY_SHIFT			31
+#define CQ_ENTRY_READY_MASK			0x80000000
+
+#define CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT	30
+#define CQ_ENTRY_SHADOW_INDEX_VALID_MASK	0x40000000
+
+#define CQ_ENTRY_SHADOW_INDEX_SHIFT		BD_CTL_SHADOW_INDEX_SHIFT
+#define CQ_ENTRY_SHADOW_INDEX_MASK		BD_CTL_SHADOW_INDEX_MASK
+
+/*
+ * EVENT QUEUE
+ */
+
+struct hl_eq_header {
+	__u32 reserved;
+	union {
+		struct {
+			__u32 ctx_id :10;
+			__u32:6;
+			__u32 opcode :10;
+			__u32:5;
+			__u32 ready :1;
+		};
+		__u32 ctl;
+	};
+};
+
+struct hl_eq_entry {
+	struct hl_eq_header hdr;
+	__u64 data[7];
+};
+
+#define HL_EQ_ENTRY_SIZE		sizeof(struct hl_eq_entry)
+
+#define EQ_CTL_READY_SHIFT		31
+#define EQ_CTL_READY_MASK		0x80000000
+
+#define EQ_CTL_EVENT_TYPE_SHIFT		16
+#define EQ_CTL_EVENT_TYPE_MASK		0x03FF0000
+
+enum pq_init_status {
+	PQ_INIT_STATUS_NA = 0,
+	PQ_INIT_STATUS_READY_FOR_CP,
+	PQ_INIT_STATUS_READY_FOR_HOST
+};
+
+/*
+ * ArmCP info
+ */
+
+#define VERSION_MAX_LEN			128
+#define ARMCP_MAX_SENSORS		128
+
+struct armcp_sensor {
+	__u32 type;
+	__u32 flags;
+};
+
+/* must be aligned to 4 bytes */
+struct armcp_info {
+	struct armcp_sensor sensors[ARMCP_MAX_SENSORS];
+	__u8 kernel_version[VERSION_MAX_LEN];
+	__u32 reserved[3];
+	__u32 cpld_version;
+	__u32 infineon_version;
+	__u8 fuse_version[VERSION_MAX_LEN];
+	__u8 thermal_version[VERSION_MAX_LEN];
+	__u8 armcp_version[VERSION_MAX_LEN];
+	__u64 dram_size;
+};
+
+#endif /* HABANALABS_DEVICE_IF_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 03/15] habanalabs: add basic Goya support
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
  2019-01-23  0:00 ` [PATCH 01/15] habanalabs: add skeleton driver Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-23 12:28   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 04/15] habanalabs: add context and ASID modules Oded Gabbay
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds a basic support for the Goya device. The code initializes
the device's PCI controller and PCI bars. It also initializes various S/W
structures and adds some basic helper functions.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/Makefile            |   5 +-
 drivers/misc/habanalabs/device.c            |  71 +++
 drivers/misc/habanalabs/goya/Makefile       |   3 +
 drivers/misc/habanalabs/goya/goya.c         | 633 ++++++++++++++++++++
 drivers/misc/habanalabs/goya/goyaP.h        | 125 ++++
 drivers/misc/habanalabs/habanalabs.h        | 131 ++++
 drivers/misc/habanalabs/habanalabs_drv.c    |   3 +
 drivers/misc/habanalabs/include/goya/goya.h | 115 ++++
 8 files changed, 1085 insertions(+), 1 deletion(-)
 create mode 100644 drivers/misc/habanalabs/goya/Makefile
 create mode 100644 drivers/misc/habanalabs/goya/goya.c
 create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya.h

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index b41433a09e02..6f1ead69bd77 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -4,4 +4,7 @@
 
 obj-m	:= habanalabs.o
 
-habanalabs-y := habanalabs_drv.o device.o
\ No newline at end of file
+habanalabs-y := habanalabs_drv.o device.o
+
+include $(src)/goya/Makefile
+habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 376b55eb73d4..a4276ef559b3 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -116,8 +116,11 @@ static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
  */
 static int device_early_init(struct hl_device *hdev)
 {
+	int rc;
+
 	switch (hdev->asic_type) {
 	case ASIC_GOYA:
+		goya_set_asic_funcs(hdev);
 		sprintf(hdev->asic_name, "GOYA");
 		break;
 	default:
@@ -126,6 +129,10 @@ static int device_early_init(struct hl_device *hdev)
 		return -EINVAL;
 	}
 
+	rc = hdev->asic_funcs->early_init(hdev);
+	if (rc)
+		return rc;
+
 	return 0;
 }
 
@@ -137,6 +144,10 @@ static int device_early_init(struct hl_device *hdev)
  */
 static void device_early_fini(struct hl_device *hdev)
 {
+
+	if (hdev->asic_funcs->early_fini)
+		hdev->asic_funcs->early_fini(hdev);
+
 }
 
 /**
@@ -150,8 +161,15 @@ static void device_early_fini(struct hl_device *hdev)
  */
 int hl_device_suspend(struct hl_device *hdev)
 {
+	int rc;
+
 	pci_save_state(hdev->pdev);
 
+	rc = hdev->asic_funcs->suspend(hdev);
+	if (rc)
+		dev_err(hdev->dev,
+			"Failed to disable PCI access of device CPU\n");
+
 	/* Shut down the device */
 	pci_disable_device(hdev->pdev);
 	pci_set_power_state(hdev->pdev, PCI_D3hot);
@@ -181,6 +199,13 @@ int hl_device_resume(struct hl_device *hdev)
 		return rc;
 	}
 
+	rc = hdev->asic_funcs->resume(hdev);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to enable PCI access from device CPU\n");
+		return rc;
+	}
+
 	return 0;
 }
 
@@ -208,11 +233,21 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 	if (rc)
 		goto release_device;
 
+	/*
+	 * Start calling ASIC initialization. First S/W then H/W and finally
+	 * late init
+	 */
+	rc = hdev->asic_funcs->sw_init(hdev);
+	if (rc)
+		goto early_fini;
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
 	return 0;
 
+early_fini:
+	device_early_fini(hdev);
 release_device:
 	device_destroy(hclass, hdev->dev->devt);
 	cdev_del(&hdev->cdev);
@@ -243,6 +278,9 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	/* Call ASIC S/W finalize function */
+	hdev->asic_funcs->sw_fini(hdev);
+
 	device_early_fini(hdev);
 
 	/* Hide device from user */
@@ -329,3 +367,36 @@ int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
 
 	return (*val ? 0 : -ETIMEDOUT);
 }
+
+/*
+ * MMIO register access helper functions.
+ */
+
+/**
+ * hl_rreg - Read an MMIO register
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @reg: MMIO register offset (in bytes)
+ *
+ * Returns the value of the MMIO register we are asked to read
+ *
+ */
+inline u32 hl_rreg(struct hl_device *hdev, u32 reg)
+{
+	return readl(hdev->rmmio + reg);
+}
+
+/**
+ * hl_wreg - Write to an MMIO register
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @reg: MMIO register offset (in bytes)
+ * @val: 32-bit value
+ *
+ * Writes the 32-bit value into the MMIO register
+ *
+ */
+inline void hl_wreg(struct hl_device *hdev, u32 reg, u32 val)
+{
+	writel(val, hdev->rmmio + reg);
+}
diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
new file mode 100644
index 000000000000..5ebf3d0d5794
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/Makefile
@@ -0,0 +1,3 @@
+subdir-ccflags-y += -I$(src)
+
+HL_GOYA_FILES :=  goya/goya.o
\ No newline at end of file
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
new file mode 100644
index 000000000000..b2952296b890
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -0,0 +1,633 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "goyaP.h"
+#include "include/goya/asic_reg/goya_masks.h"
+
+#include <linux/fs.h>
+#include <linux/delay.h>
+#include <linux/vmalloc.h>
+#include <linux/sched.h>
+#include <linux/genalloc.h>
+#include <linux/sysfs.h>
+#include <linux/kfifo.h>
+#include <linux/dma-mapping.h>
+#include <linux/firmware.h>
+#include <linux/log2.h>
+#include <linux/hwmon.h>
+#include <linux/string.h>
+#include <linux/io.h>
+
+/*
+ * GOYA security scheme:
+ *
+ * 1. Host is protected by:
+ *        - Range registers (When MMU is enabled, DMA RR does NOT protect host)
+ *        - MMU
+ *
+ * 2. DRAM is protected by:
+ *        - Range registers (protect the first 512MB)
+ *        - MMU (isolation between users)
+ *
+ * 3. Configuration is protected by:
+ *        - Range registers
+ *        - Protection bits
+ *
+ * When MMU is disabled:
+ *
+ * QMAN DMA: PQ, CQ, CP, DMA are secured.
+ * PQ, CB and the data are on the host.
+ *
+ * QMAN TPC/MME:
+ * PQ, CQ and CP are not secured.
+ * PQ, CB and the data are on the SRAM/DRAM.
+ *
+ * Since QMAN DMA is secured, KMD is parsing the DMA CB:
+ *     - KMD checks DMA pointer
+ *     - WREG, MSG_PROT are not allowed.
+ *     - MSG_LONG/SHORT are allowed.
+ *
+ * A read/write transaction by the QMAN to a protected area will succeed if
+ * and only if the QMAN's CP is secured and MSG_PROT is used
+ *
+ *
+ * When MMU is enabled:
+ *
+ * QMAN DMA: PQ, CQ and CP are secured.
+ * MMU is set to bypass on the Secure props register of the QMAN.
+ * The reasons we don't enable MMU for PQ, CQ and CP are:
+ *     - PQ entry is in kernel address space and KMD doesn't map it.
+ *     - CP writes to MSIX register and to kernel address space (completion
+ *       queue).
+ *
+ * DMA is not secured but because CP is secured, KMD still needs to parse the
+ * CB, but doesn't need to check the DMA addresses.
+ *
+ * For QMAN DMA 0, DMA is also secured because only KMD uses this DMA and KMD
+ * doesn't map memory in MMU.
+ *
+ * QMAN TPC/MME: PQ, CQ and CP aren't secured (no change from MMU disabled mode)
+ *
+ * DMA RR does NOT protect host because DMA is not secured
+ *
+ */
+
+#define GOYA_MMU_REGS_NUM		61
+
+#define GOYA_DMA_POOL_BLK_SIZE		0x100		/* 256 bytes */
+
+#define GOYA_RESET_TIMEOUT_MSEC		500		/* 500ms */
+#define GOYA_PLDM_RESET_TIMEOUT_MSEC	20000		/* 20s */
+#define GOYA_RESET_WAIT_MSEC		1		/* 1ms */
+#define GOYA_CPU_RESET_WAIT_MSEC	100		/* 100ms */
+#define GOYA_PLDM_RESET_WAIT_MSEC	1000		/* 1s */
+#define GOYA_CPU_TIMEOUT_USEC		10000000	/* 10s */
+#define GOYA_TEST_QUEUE_WAIT_USEC	100000		/* 100ms */
+
+#define GOYA_QMAN0_FENCE_VAL		0xD169B243
+
+#define GOYA_MAX_INITIATORS		20
+
+static void goya_get_fixed_properties(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+
+	prop->completion_queues_count = NUMBER_OF_CMPLT_QUEUES;
+
+	prop->dram_base_address = DRAM_PHYS_BASE;
+	prop->dram_size = DRAM_PHYS_DEFAULT_SIZE;
+	prop->dram_end_address = prop->dram_base_address + prop->dram_size;
+	prop->dram_user_base_address = DRAM_BASE_ADDR_USER;
+
+	prop->sram_base_address = SRAM_BASE_ADDR;
+	prop->sram_size = SRAM_SIZE;
+	prop->sram_end_address = prop->sram_base_address + prop->sram_size;
+	prop->sram_user_base_address = prop->sram_base_address +
+						SRAM_USER_BASE_OFFSET;
+
+	prop->host_phys_base_address = HOST_PHYS_BASE;
+	prop->va_space_host_start_address = VA_HOST_SPACE_START;
+	prop->va_space_host_end_address = VA_HOST_SPACE_END;
+	prop->va_space_dram_start_address = VA_DDR_SPACE_START;
+	prop->va_space_dram_end_address = VA_DDR_SPACE_END;
+	prop->cfg_size = CFG_SIZE;
+	prop->max_asid = MAX_ASID;
+	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
+
+	prop->high_pll = PLL_HIGH_DEFAULT;
+}
+
+/**
+ * goya_pci_bars_map - Map PCI BARS of Goya device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Request PCI regions and map them to kernel virtual addresses.
+ * Returns 0 on success
+ *
+ */
+int goya_pci_bars_map(struct hl_device *hdev)
+{
+	struct pci_dev *pdev = hdev->pdev;
+	int rc;
+
+	rc = pci_request_regions(pdev, HL_NAME);
+	if (rc) {
+		dev_err(hdev->dev, "Cannot obtain PCI resources\n");
+		return rc;
+	}
+
+	hdev->pcie_bar[SRAM_CFG_BAR_ID] =
+			pci_ioremap_bar(pdev, SRAM_CFG_BAR_ID);
+	if (!hdev->pcie_bar[SRAM_CFG_BAR_ID]) {
+		dev_err(hdev->dev, "pci_ioremap_bar failed for CFG\n");
+		rc = -ENODEV;
+		goto err_release_regions;
+	}
+
+	hdev->pcie_bar[MSIX_BAR_ID] = pci_ioremap_bar(pdev, MSIX_BAR_ID);
+	if (!hdev->pcie_bar[MSIX_BAR_ID]) {
+		dev_err(hdev->dev, "pci_ioremap_bar failed for MSIX\n");
+		rc = -ENODEV;
+		goto err_unmap_sram_cfg;
+	}
+
+	hdev->pcie_bar[DDR_BAR_ID] = pci_ioremap_wc_bar(pdev, DDR_BAR_ID);
+	if (!hdev->pcie_bar[DDR_BAR_ID]) {
+		dev_err(hdev->dev, "pci_ioremap_bar failed for DDR\n");
+		rc = -ENODEV;
+		goto err_unmap_msix;
+	}
+
+	hdev->rmmio = hdev->pcie_bar[SRAM_CFG_BAR_ID] +
+				(CFG_BASE - SRAM_BASE_ADDR);
+
+	return 0;
+
+err_unmap_msix:
+	iounmap(hdev->pcie_bar[MSIX_BAR_ID]);
+err_unmap_sram_cfg:
+	iounmap(hdev->pcie_bar[SRAM_CFG_BAR_ID]);
+err_release_regions:
+	pci_release_regions(pdev);
+
+	return rc;
+}
+
+/**
+ * goya_pci_bars_unmap - Unmap PCI BARS of Goya device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Release all PCI BARS and unmap their virtual addresses
+ *
+ */
+static void goya_pci_bars_unmap(struct hl_device *hdev)
+{
+	struct pci_dev *pdev = hdev->pdev;
+
+	iounmap(hdev->pcie_bar[DDR_BAR_ID]);
+	iounmap(hdev->pcie_bar[MSIX_BAR_ID]);
+	iounmap(hdev->pcie_bar[SRAM_CFG_BAR_ID]);
+	pci_release_regions(pdev);
+}
+
+/**
+ * goya_elbi_write - Write through the ELBI interface
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * return 0 on success, -1 on failure
+ *
+ */
+static int goya_elbi_write(struct hl_device *hdev, u64 addr, u32 data)
+{
+	struct pci_dev *pdev = hdev->pdev;
+	ktime_t timeout;
+	u32 val;
+
+	/* Clear previous status */
+	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_STS, 0);
+
+	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_ADDR, (u32) addr);
+	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_DATA, data);
+	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_CTRL,
+				PCI_CONFIG_ELBI_CTRL_WRITE);
+
+	timeout = ktime_add_ms(ktime_get(), 10);
+	for (;;) {
+		pci_read_config_dword(pdev, mmPCI_CONFIG_ELBI_STS, &val);
+		if (val & PCI_CONFIG_ELBI_STS_MASK)
+			break;
+		if (ktime_compare(ktime_get(), timeout) > 0) {
+			pci_read_config_dword(pdev, mmPCI_CONFIG_ELBI_STS,
+						&val);
+			break;
+		}
+		usleep_range(300, 500);
+	}
+
+	if ((val & PCI_CONFIG_ELBI_STS_MASK) == PCI_CONFIG_ELBI_STS_DONE)
+		return 0;
+
+	if (val & PCI_CONFIG_ELBI_STS_ERR) {
+		dev_err(hdev->dev, "Error writing to ELBI\n");
+		return -1;
+	}
+
+	if (!(val & PCI_CONFIG_ELBI_STS_MASK)) {
+		dev_err(hdev->dev, "ELBI write didn't finish in time\n");
+		return -1;
+	}
+
+	dev_err(hdev->dev, "ELBI write has undefined bits in status\n");
+	return -1;
+}
+
+/**
+ * goya_iatu_write - iatu write routine
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static int goya_iatu_write(struct hl_device *hdev, u32 addr, u32 data)
+{
+	u32 dbi_offset;
+	int rc;
+
+	dbi_offset = addr & 0xFFF;
+
+	rc = goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0x00300000);
+	rc |= goya_elbi_write(hdev, mmPCIE_DBI_BASE + dbi_offset, data);
+
+	return rc;
+}
+
+void goya_reset_link_through_bridge(struct hl_device *hdev)
+{
+	struct pci_dev *pdev = hdev->pdev;
+	struct pci_dev *parent_port;
+	u16 val;
+
+	parent_port = pdev->bus->self;
+	pci_read_config_word(parent_port, PCI_BRIDGE_CONTROL, &val);
+	val |= PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(parent_port, PCI_BRIDGE_CONTROL, val);
+	ssleep(1);
+
+	val &= ~(PCI_BRIDGE_CTL_BUS_RESET);
+	pci_write_config_word(parent_port, PCI_BRIDGE_CONTROL, val);
+	ssleep(3);
+}
+
+/**
+ * goya_set_ddr_bar_base - set DDR bar to map specific device address
+ *
+ * @hdev: pointer to hl_device structure
+ * @addr: address in DDR. Must be aligned to DDR bar size
+ *
+ * This function configures the iATU so that the DDR bar will start at the
+ * specified addr.
+ *
+ */
+static int goya_set_ddr_bar_base(struct hl_device *hdev, u64 addr)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+
+	if ((goya) && (goya->ddr_bar_cur_addr == addr))
+		return 0;
+
+	/* Inbound Region 1 - Bar 4 - Point to DDR */
+	rc = goya_iatu_write(hdev, 0x314, lower_32_bits(addr));
+	rc |= goya_iatu_write(hdev, 0x318, upper_32_bits(addr));
+	rc |= goya_iatu_write(hdev, 0x300, 0);
+	/* Enable + Bar match + match enable + Bar 4 */
+	rc |= goya_iatu_write(hdev, 0x304, 0xC0080400);
+
+	/* Return the DBI window to the default location */
+	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0);
+	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI_32, 0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to map DDR bar to 0x%08llx\n", addr);
+		return rc;
+	}
+
+	if (goya)
+		goya->ddr_bar_cur_addr = addr;
+
+	return 0;
+}
+
+/**
+ * goya_init_iatu - Initialize the iATU unit inside the PCI controller
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * This is needed in case the firmware doesn't initialize the iATU
+ *
+ */
+static int goya_init_iatu(struct hl_device *hdev)
+{
+	int rc;
+
+	/* Inbound Region 0 - Bar 0 - Point to SRAM_BASE_ADDR */
+	rc  = goya_iatu_write(hdev, 0x114, lower_32_bits(SRAM_BASE_ADDR));
+	rc |= goya_iatu_write(hdev, 0x118, upper_32_bits(SRAM_BASE_ADDR));
+	rc |= goya_iatu_write(hdev, 0x100, 0);
+	/* Enable + Bar match + match enable */
+	rc |= goya_iatu_write(hdev, 0x104, 0xC0080000);
+
+	/* Inbound Region 1 - Bar 4 - Point to DDR */
+	rc |= goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
+
+	/* Outbound Region 0 - Point to Host */
+	rc |= goya_iatu_write(hdev, 0x008, lower_32_bits(HOST_PHYS_BASE));
+	rc |= goya_iatu_write(hdev, 0x00C, upper_32_bits(HOST_PHYS_BASE));
+	rc |= goya_iatu_write(hdev, 0x010,
+		lower_32_bits(HOST_PHYS_BASE + HOST_PHYS_SIZE - 1));
+	rc |= goya_iatu_write(hdev, 0x014, 0);
+	rc |= goya_iatu_write(hdev, 0x018, 0);
+	rc |= goya_iatu_write(hdev, 0x020,
+		upper_32_bits(HOST_PHYS_BASE + HOST_PHYS_SIZE - 1));
+	/* Increase region size */
+	rc |= goya_iatu_write(hdev, 0x000, 0x00002000);
+	/* Enable */
+	rc |= goya_iatu_write(hdev, 0x004, 0x80000000);
+
+	/* Return the DBI window to the default location */
+	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0);
+	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI_32, 0);
+
+	return rc;
+}
+
+/**
+ * goya_early_init - GOYA early initialization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Verify PCI bars
+ * Set DMA masks
+ * PCI controller initialization
+ * Map PCI bars
+ *
+ */
+static int goya_early_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct pci_dev *pdev = hdev->pdev;
+	u32 val;
+	int rc;
+
+	goya_get_fixed_properties(hdev);
+
+	/* Check BAR sizes */
+	if (pci_resource_len(pdev, SRAM_CFG_BAR_ID) != CFG_BAR_SIZE) {
+		dev_err(hdev->dev,
+			"Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
+			SRAM_CFG_BAR_ID,
+			pci_resource_len(pdev, SRAM_CFG_BAR_ID),
+			CFG_BAR_SIZE);
+		return -ENODEV;
+	}
+
+	if (pci_resource_len(pdev, MSIX_BAR_ID) != MSIX_BAR_SIZE) {
+		dev_err(hdev->dev,
+			"Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
+			MSIX_BAR_ID, pci_resource_len(pdev, MSIX_BAR_ID),
+			MSIX_BAR_SIZE);
+		return -ENODEV;
+	}
+
+	prop->dram_pci_bar_size = pci_resource_len(pdev, DDR_BAR_ID);
+
+	/* set DMA mask for GOYA */
+	rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(39));
+	if (rc) {
+		dev_warn(hdev->dev, "Unable to set pci dma mask to 39 bits\n");
+		rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+		if (rc) {
+			dev_err(hdev->dev,
+				"Unable to set pci dma mask to 32 bits\n");
+			return rc;
+		}
+	}
+
+	rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(39));
+	if (rc) {
+		dev_warn(hdev->dev,
+			"Unable to set pci consistent dma mask to 39 bits\n");
+		rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32));
+		if (rc) {
+			dev_err(hdev->dev,
+				"Unable to set pci consistent dma mask to 32 bits\n");
+			return rc;
+		}
+	}
+
+	if (hdev->reset_pcilink)
+		goya_reset_link_through_bridge(hdev);
+
+	rc = pci_enable_device_mem(pdev);
+	if (rc) {
+		dev_err(hdev->dev, "can't enable PCI device\n");
+		return rc;
+	}
+
+	pci_set_master(pdev);
+
+	rc = goya_init_iatu(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to initialize iATU\n");
+		goto disable_device;
+	}
+
+	rc = goya_pci_bars_map(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to initialize PCI BARS\n");
+		goto disable_device;
+	}
+
+	val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
+	if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
+		dev_warn(hdev->dev,
+			"PCI strap is not configured correctly, PCI bus errors may occur\n");
+
+	return 0;
+
+disable_device:
+	pci_clear_master(pdev);
+	pci_disable_device(pdev);
+
+	return rc;
+}
+
+/**
+ * goya_early_fini - GOYA early finalization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Unmap PCI bars
+ *
+ */
+int goya_early_fini(struct hl_device *hdev)
+{
+	goya_pci_bars_unmap(hdev);
+
+	pci_clear_master(hdev->pdev);
+	pci_disable_device(hdev->pdev);
+
+	return 0;
+}
+
+/**
+ * goya_sw_init - Goya software initialization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static int goya_sw_init(struct hl_device *hdev)
+{
+	struct goya_device *goya;
+	int rc;
+
+	/* Allocate device structure */
+	goya = kzalloc(sizeof(*goya), GFP_KERNEL);
+	if (!goya)
+		return -ENOMEM;
+
+	/* according to goya_init_iatu */
+	goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
+	hdev->asic_specific = goya;
+
+	/* Create DMA pool for small allocations */
+	hdev->dma_pool = dma_pool_create(dev_name(hdev->dev),
+			&hdev->pdev->dev, GOYA_DMA_POOL_BLK_SIZE, 8, 0);
+	if (!hdev->dma_pool) {
+		dev_err(hdev->dev, "failed to create DMA pool\n");
+		rc = -ENOMEM;
+		goto free_goya_device;
+	}
+
+	hdev->cpu_accessible_dma_mem =
+			hdev->asic_funcs->dma_alloc_coherent(hdev,
+					CPU_ACCESSIBLE_MEM_SIZE,
+					&hdev->cpu_accessible_dma_address,
+					GFP_KERNEL | __GFP_ZERO);
+
+	if (!hdev->cpu_accessible_dma_mem) {
+		dev_err(hdev->dev,
+			"failed to allocate %d of dma memory for CPU accessible memory space\n",
+			CPU_ACCESSIBLE_MEM_SIZE);
+		rc = -ENOMEM;
+		goto free_dma_pool;
+	}
+
+	hdev->cpu_accessible_dma_pool = gen_pool_create(CPU_PKT_SHIFT, -1);
+	if (!hdev->cpu_accessible_dma_pool) {
+		dev_err(hdev->dev,
+			"Failed to create CPU accessible DMA pool\n");
+		rc = -ENOMEM;
+		goto free_cpu_pq_dma_mem;
+	}
+
+	rc = gen_pool_add(hdev->cpu_accessible_dma_pool,
+				(u64) hdev->cpu_accessible_dma_mem,
+				CPU_ACCESSIBLE_MEM_SIZE, -1);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to add memory to CPU accessible DMA pool\n");
+		rc = -EFAULT;
+		goto free_cpu_pq_pool;
+	}
+
+	spin_lock_init(&goya->hw_queues_lock);
+
+	return 0;
+
+free_cpu_pq_pool:
+	gen_pool_destroy(hdev->cpu_accessible_dma_pool);
+free_cpu_pq_dma_mem:
+	hdev->asic_funcs->dma_free_coherent(hdev, CPU_ACCESSIBLE_MEM_SIZE,
+			hdev->cpu_accessible_dma_mem,
+			hdev->cpu_accessible_dma_address);
+free_dma_pool:
+	dma_pool_destroy(hdev->dma_pool);
+free_goya_device:
+	kfree(goya);
+
+	return rc;
+}
+
+/**
+ * goya_sw_fini - Goya software tear-down code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+int goya_sw_fini(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	gen_pool_destroy(hdev->cpu_accessible_dma_pool);
+
+	hdev->asic_funcs->dma_free_coherent(hdev, CPU_ACCESSIBLE_MEM_SIZE,
+			hdev->cpu_accessible_dma_mem,
+			hdev->cpu_accessible_dma_address);
+
+	dma_pool_destroy(hdev->dma_pool);
+
+	kfree(goya);
+
+	return 0;
+}
+
+int goya_suspend(struct hl_device *hdev)
+{
+	return 0;
+}
+
+int goya_resume(struct hl_device *hdev)
+{
+	return 0;
+}
+
+void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
+					dma_addr_t *dma_handle, gfp_t flags)
+{
+	return dma_alloc_coherent(&hdev->pdev->dev, size, dma_handle, flags);
+}
+
+void goya_dma_free_coherent(struct hl_device *hdev, size_t size, void *cpu_addr,
+				dma_addr_t dma_handle)
+{
+	dma_free_coherent(&hdev->pdev->dev, size, cpu_addr, dma_handle);
+}
+
+static const struct hl_asic_funcs goya_funcs = {
+	.early_init = goya_early_init,
+	.early_fini = goya_early_fini,
+	.sw_init = goya_sw_init,
+	.sw_fini = goya_sw_fini,
+	.suspend = goya_suspend,
+	.resume = goya_resume,
+	.dma_alloc_coherent = goya_dma_alloc_coherent,
+	.dma_free_coherent = goya_dma_free_coherent,
+};
+
+/**
+ * goya_set_asic_funcs - set Goya function pointers
+ *
+ * @*hdev: pointer to hl_device structure
+ *
+ */
+void goya_set_asic_funcs(struct hl_device *hdev)
+{
+	hdev->asic_funcs = &goya_funcs;
+}
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
new file mode 100644
index 000000000000..0e12c56472bd
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef GOYAP_H_
+#define GOYAP_H_
+
+#include "habanalabs.h"
+#include "include/goya/goya.h"
+
+#define NUMBER_OF_CMPLT_QUEUES		5
+#define NUMBER_OF_EXT_HW_QUEUES		5
+#define NUMBER_OF_CPU_HW_QUEUES		1
+#define NUMBER_OF_INT_HW_QUEUES		9
+#define NUMBER_OF_HW_QUEUES		(NUMBER_OF_EXT_HW_QUEUES + \
+					NUMBER_OF_CPU_HW_QUEUES + \
+					NUMBER_OF_INT_HW_QUEUES)
+
+/*
+ * Number of MSIX interrupts IDS:
+ * Each completion queue has 1 ID
+ * The event queue has 1 ID
+ * ArmCP reset has 1 ID
+ */
+#define NUMBER_OF_INTERRUPTS		(NUMBER_OF_CMPLT_QUEUES + 2)
+
+#if (NUMBER_OF_HW_QUEUES >= HL_MAX_QUEUES)
+#error "Number of H/W queues must be smaller than HL_MAX_QUEUES"
+#endif
+
+#if (NUMBER_OF_INTERRUPTS > GOYA_MSIX_ENTRIES)
+#error "Number of MSIX interrupts must be smaller or equal to GOYA_MSIX_ENTRIES"
+#endif
+
+#define QMAN_FENCE_TIMEOUT_USEC		10000	/* 10 ms */
+
+#define QMAN_STOP_TIMEOUT_USEC		100000	/* 100 ms */
+
+#define TPC_MAX_NUM			8
+#define TPC_ENABLED_MASK		0xFF
+
+#define DMA_MAX_NUM			5
+
+#define PLL_HIGH_DEFAULT		1575000000	/* 1.575 GHz */
+
+#define GOYA_ARMCP_INFO_TIMEOUT		10000000	/* 10s */
+
+#define DRAM_PHYS_DEFAULT_SIZE		0x100000000ull	/* 4GB */
+
+/*
+ * SRAM Memory Map for KMD
+ *
+ * KMD occupies KMD_SRAM_SIZE bytes from the start of SRAM. It is used for
+ * MME/TPC QMANs
+ *
+ */
+
+#define MME_QMAN_BASE_OFFSET	0x000000	/* Must be 0 */
+#define MME_QMAN_LENGTH		64
+#define TPC_QMAN_LENGTH		64
+
+#define TPC0_QMAN_BASE_OFFSET	(MME_QMAN_BASE_OFFSET + \
+				(MME_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC1_QMAN_BASE_OFFSET	(TPC0_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC2_QMAN_BASE_OFFSET	(TPC1_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC3_QMAN_BASE_OFFSET	(TPC2_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC4_QMAN_BASE_OFFSET	(TPC3_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC5_QMAN_BASE_OFFSET	(TPC4_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC6_QMAN_BASE_OFFSET	(TPC5_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC7_QMAN_BASE_OFFSET	(TPC6_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+
+#define SRAM_KMD_RES_OFFSET	(TPC7_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+
+#if (SRAM_KMD_RES_OFFSET >= KMD_SRAM_RESERVED_SIZE)
+#error "MME/TPC QMANs SRAM space exceeds limit"
+#endif
+
+#define SRAM_USER_BASE_OFFSET	KMD_SRAM_RESERVED_SIZE
+
+#define DMA_MAX_TRANSFER_SIZE	0xFFFFFFFF
+
+#define HW_CAP_PLL		0x00000001
+#define HW_CAP_DDR_0		0x00000002
+#define HW_CAP_DDR_1		0x00000004
+#define HW_CAP_MME		0x00000008
+#define HW_CAP_CPU		0x00000010
+#define HW_CAP_DMA		0x00000020
+#define HW_CAP_MSIX		0x00000040
+#define HW_CAP_CPU_Q		0x00000080
+#define HW_CAP_MMU		0x00000100
+#define HW_CAP_TPC_MBIST	0x00000200
+#define HW_CAP_GOLDEN		0x00000400
+#define HW_CAP_TPC		0x00000800
+
+#define CPU_PKT_SHIFT		5
+#define CPU_PKT_SIZE		(1 << CPU_PKT_SHIFT)
+#define CPU_PKT_MASK		(~((1 << CPU_PKT_SHIFT) - 1))
+#define CPU_MAX_PKTS_IN_CB	32
+#define CPU_CB_SIZE		(CPU_PKT_SIZE * CPU_MAX_PKTS_IN_CB)
+#define CPU_ACCESSIBLE_MEM_SIZE	(HL_QUEUE_LENGTH * CPU_CB_SIZE)
+
+enum goya_fw_component {
+	FW_COMP_UBOOT,
+	FW_COMP_PREBOOT
+};
+
+struct goya_device {
+	/* TODO: remove hw_queues_lock after moving to scheduler code */
+	spinlock_t	hw_queues_lock;
+	u64		ddr_bar_cur_addr;
+	u32		hw_cap_initialized;
+};
+
+#endif /* GOYAP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 7e1b088b677c..97844825f7a8 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -21,11 +21,64 @@
 
 #define HL_NAME				"habanalabs"
 
+#define HL_MAX_QUEUES			128
+
 struct hl_device;
 
 
 
 
+/**
+ * struct asic_fixed_properties - ASIC specific immutable properties.
+ * @sram_base_address: SRAM physical start address.
+ * @sram_end_address: SRAM physical end address.
+ * @sram_user_base_address - SRAM physical start address for user access.
+ * @dram_base_address: DRAM physical start address.
+ * @dram_end_address: DRAM physical end address.
+ * @dram_user_base_address: DRAM physical start address for user access.
+ * @dram_size: DRAM total size.
+ * @dram_pci_bar_size: size of PCI bar towards DRAM.
+ * @host_phys_base_address: base physical address of host memory for
+ *				transactions that the device generates.
+ * @va_space_host_start_address: base address of virtual memory range for
+ *                               mapping host memory.
+ * @va_space_host_end_address: end address of virtual memory range for
+ *                             mapping host memory.
+ * @va_space_dram_start_address: base address of virtual memory range for
+ *                               mapping DRAM memory.
+ * @va_space_dram_end_address: end address of virtual memory range for
+ *                             mapping DRAM memory.
+ * @cfg_size: configuration space size on SRAM.
+ * @sram_size: total size of SRAM.
+ * @max_asid: maximum number of open contexts (ASIDs).
+ * @completion_queues_count: number of completion queues.
+ * @high_pll: high PLL frequency used by the device.
+ * @tpc_enabled_mask: which TPCs are enabled.
+ */
+struct asic_fixed_properties {
+	u64			sram_base_address;
+	u64			sram_end_address;
+	u64			sram_user_base_address;
+	u64			dram_base_address;
+	u64			dram_end_address;
+	u64			dram_user_base_address;
+	u64			dram_size;
+	u64			dram_pci_bar_size;
+	u64			host_phys_base_address;
+	u64			va_space_host_start_address;
+	u64			va_space_host_end_address;
+	u64			va_space_dram_start_address;
+	u64			va_space_dram_end_address;
+	u32			cfg_size;
+	u32			sram_size;
+	u32			max_asid;
+	u32			high_pll;
+	u8			completion_queues_count;
+	u8			tpc_enabled_mask;
+};
+
+
+#define HL_QUEUE_LENGTH			256
 
 
 /*
@@ -47,6 +100,30 @@ enum hl_asic_type {
 
 
 
+/**
+ * struct hl_asic_funcs - ASIC specific functions that are can be called from
+ *                        common code.
+ * @early_init: sets up early driver state (pre sw_init), doesn't configure H/W.
+ * @early_fini: tears down what was done in early_init.
+ * @sw_init: sets up driver state, does not configure H/W.
+ * @sw_fini: tears down driver state, does not configure H/W.
+ * @suspend: handles IP specific H/W or SW changes for suspend.
+ * @resume: handles IP specific H/W or SW changes for resume.
+ * @dma_alloc_coherent: DMA allocate coherent memory.
+ * @dma_free_coherent: free DMA allocation.
+ */
+struct hl_asic_funcs {
+	int (*early_init)(struct hl_device *hdev);
+	int (*early_fini)(struct hl_device *hdev);
+	int (*sw_init)(struct hl_device *hdev);
+	int (*sw_fini)(struct hl_device *hdev);
+	int (*suspend)(struct hl_device *hdev);
+	int (*resume)(struct hl_device *hdev);
+	void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
+					dma_addr_t *dma_handle, gfp_t flag);
+	void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
+					void *cpu_addr, dma_addr_t dma_handle);
+};
 
 /*
  * FILE PRIVATE STRUCTURE
@@ -78,26 +155,78 @@ struct hl_fpriv {
  */
 #define HL_MAX_MINORS	256
 
+/*
+ * Registers read & write functions.
+ */
+
+u32 hl_rreg(struct hl_device *hdev, u32 reg);
+void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
+
+#define hl_poll_timeout(hdev, addr, val, cond, sleep_us, timeout_us) \
+	readl_poll_timeout(hdev->rmmio + addr, val, cond, sleep_us, timeout_us)
+
+#define RREG32(reg) hl_rreg(hdev, (reg))
+#define WREG32(reg, v) hl_wreg(hdev, (reg), (v))
+#define DREG32(reg) pr_info("REGISTER: " #reg " : 0x%08X\n",	\
+				hl_rreg(hdev, (reg)))
+
+#define WREG32_P(reg, val, mask)				\
+	do {							\
+		u32 tmp_ = RREG32(reg);				\
+		tmp_ &= (mask);					\
+		tmp_ |= ((val) & ~(mask));			\
+		WREG32(reg, tmp_);				\
+	} while (0)
+#define WREG32_AND(reg, and) WREG32_P(reg, 0, and)
+#define WREG32_OR(reg, or) WREG32_P(reg, or, ~(or))
+
+#define REG_FIELD_SHIFT(reg, field) reg##_##field##_SHIFT
+#define REG_FIELD_MASK(reg, field) reg##_##field##_MASK
+#define WREG32_FIELD(reg, field, val)	\
+	WREG32(mm##reg, (RREG32(mm##reg) & ~REG_FIELD_MASK(reg, field)) | \
+			(val) << REG_FIELD_SHIFT(reg, field))
+
 /**
  * struct hl_device - habanalabs device structure.
  * @pdev: pointer to PCI device, can be NULL in case of simulator device.
+ * @pcie_bar: array of available PCIe bars.
+ * @rmmio: configuration area address on SRAM.
  * @cdev: related char device.
  * @dev: realted kernel basic device structure.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
+ * @dma_pool: DMA pool for small allocations.
+ * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
+ * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
+ * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
+ * @asic_prop: ASIC specific immutable properties.
+ * @asic_funcs: ASIC specific functions.
+ * @asic_specific: ASIC specific information to use only from ASIC files.
  * @major: habanalabs KMD major.
  * @id: device minor.
  * @disabled: is device disabled.
  */
 struct hl_device {
 	struct pci_dev			*pdev;
+	void __iomem			*pcie_bar[6];
+	void __iomem			*rmmio;
 	struct cdev			cdev;
 	struct device			*dev;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
+	struct dma_pool			*dma_pool;
+	void				*cpu_accessible_dma_mem;
+	dma_addr_t			cpu_accessible_dma_address;
+	struct gen_pool			*cpu_accessible_dma_pool;
+	struct asic_fixed_properties	asic_prop;
+	const struct hl_asic_funcs	*asic_funcs;
+	void				*asic_specific;
 	u32				major;
 	u16				id;
 	u8				disabled;
+
+	/* Parameters for bring-up */
+	u8				reset_pcilink;
 };
 
 /*
@@ -146,4 +275,6 @@ void hl_device_fini(struct hl_device *hdev);
 int hl_device_suspend(struct hl_device *hdev);
 int hl_device_resume(struct hl_device *hdev);
 
+void goya_set_asic_funcs(struct hl_device *hdev);
+
 #endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 15217975327b..79545003b7c2 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -136,6 +136,9 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 
 	hdev->major = hl_major;
 
+	/* Parameters for bring-up - set them to defaults */
+	hdev->reset_pcilink = 0;
+
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
 
diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
new file mode 100644
index 000000000000..192a1450cbb1
--- /dev/null
+++ b/drivers/misc/habanalabs/include/goya/goya.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ * Author: Oded Gabbay <oded.gabbay@gmail.com>
+ *
+ */
+
+#ifndef GOYA_H
+#define GOYA_H
+
+#include "asic_reg/goya_regs.h"
+
+#include <linux/types.h>
+
+#define SRAM_CFG_BAR_ID		0
+#define MSIX_BAR_ID		2
+#define DDR_BAR_ID		4
+
+#define CFG_BAR_SIZE		0x10000000ull		/* 256MB */
+#define MSIX_BAR_SIZE		0x1000ull		/* 4KB */
+
+#define CFG_BASE		0x7FFC000000ull
+#define CFG_SIZE		0x4000000		/* 32MB CFG + 32MB DBG*/
+
+#define SRAM_BASE_ADDR		0x7FF0000000ull
+#define SRAM_SIZE		0x32A0000		/* 50.625MB */
+#define KMD_SRAM_RESERVED_SIZE	0x8000			/* 32KB */
+
+#define SRAM_BASE_ADDR_USER	(0x7FF0000000ull + KMD_SRAM_RESERVED_SIZE)
+#define SRAM_SIZE_USER		(SRAM_SIZE - KMD_SRAM_RESERVED_SIZE)
+
+#define DRAM_PHYS_BASE		0x0ull
+
+#define CPU_FW_IMAGE_SIZE	0x10000000	/* 256MB */
+#define MMU_PAGE_TABLES_SIZE	0x0E000000	/* 224MB */
+#define CPU_PQ_PKT_SIZE		0x00001000	/* 4KB */
+#define CPU_PQ_DATA_SIZE	0x01FFF000	/* 32MB - 4KB  */
+
+#define CPU_FW_IMAGE_ADDR	DRAM_PHYS_BASE
+#define MMU_PAGE_TABLES_ADDR	(CPU_FW_IMAGE_ADDR + CPU_FW_IMAGE_SIZE)
+#define CPU_PQ_PKT_ADDR		(MMU_PAGE_TABLES_ADDR + MMU_PAGE_TABLES_SIZE)
+#define CPU_PQ_DATA_ADDR	(CPU_PQ_PKT_ADDR + CPU_PQ_PKT_SIZE)
+#define DRAM_BASE_ADDR_USER	(CPU_PQ_DATA_ADDR + CPU_PQ_DATA_SIZE)
+
+#define HOST_PHYS_BASE		0x8000000000ull		/* 0.5TB */
+#define HOST_PHYS_SIZE		0x1000000000000ull	/* 0.25PB (48 bits) */
+
+#define VA_HOST_SPACE_START	0x1000000000000ull	/* 256TB */
+#define VA_HOST_SPACE_END	0x3FF8000000000ull	/* 1PB - 1TB */
+#define VA_HOST_SPACE_SIZE	(VA_HOST_SPACE_END - \
+					VA_HOST_SPACE_START) /* 767TB */
+
+#define VA_DDR_SPACE_START	0x800000000ull		/* 32GB */
+#define VA_DDR_SPACE_END	0x2000000000ull		/* 128GB */
+#define VA_DDR_SPACE_SIZE	(VA_DDR_SPACE_END - \
+					VA_DDR_SPACE_START)	/* 128GB */
+
+#define CPU_BOOT_ADDR		0x7FF8040000ull
+
+#define UBOOT_FW_OFFSET		0x100000		/* 1MB in SRAM */
+#define LINUX_FW_OFFSET		0x800000		/* 8BM in DDR */
+
+#define GOYA_MSIX_ENTRIES	8
+#define EVENT_QUEUE_MSIX_IDX	5
+#define ARMCP_RESET_MSIX_IDX	6
+
+#define QMAN_PQ_ENTRY_SIZE	16			/* Bytes */
+
+#define MAX_ASID		1024
+
+#define PROT_BITS_OFFS		0xF80
+
+/*
+ * Queue Numbering
+ *
+ * The external queues (DMA channels + CPU) MUST be before the internal queues
+ * and each group (DMA channels + CPU and internal) must be contiguous inside
+ * itself but there can be a gap between the two groups (although not
+ * recommended)
+ */
+
+enum goya_queue_id {
+	GOYA_QUEUE_ID_DMA_0 = 0,
+	GOYA_QUEUE_ID_DMA_1,
+	GOYA_QUEUE_ID_DMA_2,
+	GOYA_QUEUE_ID_DMA_3,
+	GOYA_QUEUE_ID_DMA_4,
+	GOYA_QUEUE_ID_CPU_PQ,
+	GOYA_QUEUE_ID_MME,
+	GOYA_QUEUE_ID_TPC0,
+	GOYA_QUEUE_ID_TPC1,
+	GOYA_QUEUE_ID_TPC2,
+	GOYA_QUEUE_ID_TPC3,
+	GOYA_QUEUE_ID_TPC4,
+	GOYA_QUEUE_ID_TPC5,
+	GOYA_QUEUE_ID_TPC6,
+	GOYA_QUEUE_ID_TPC7,
+	GOYA_QUEUE_ID_SIZE
+};
+
+enum goya_pll_index {
+	CPU_PLL = 0,
+	IC_PLL,
+	MC_PLL,
+	MME_PLL,
+	PCI_PLL,
+	EMMC_PLL,
+	TPC_PLL
+};
+
+#define GOYA_PLL_FREQ_LOW		50000000 /* 50 MHz */
+
+#endif /* GOYA_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 04/15] habanalabs: add context and ASID modules
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
  2019-01-23  0:00 ` [PATCH 01/15] habanalabs: add skeleton driver Oded Gabbay
  2019-01-23  0:00 ` [PATCH 03/15] habanalabs: add basic Goya support Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-23 12:28   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 05/15] habanalabs: add command buffer module Oded Gabbay
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds two modules - ASID and context.

Each user process the opens a device's file must have at least one context
before it is able to "work" with the device. Each context has its own
device address-space and contains information about its runtime state (its
active command submissions).

To have address-space separation between contexts, each context is assigned
a unique ASID, which stands for "address-space id". Goya supports up to
1024 ASIDs.

Currently, the driver doesn't support multiple contexts. Therefore, the
user doesn't need to actively create a context. A "primary context" is
created automatically when the user opens the device's file.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/Makefile         |   2 +-
 drivers/misc/habanalabs/asid.c           |  58 +++++++++
 drivers/misc/habanalabs/context.c        | 155 +++++++++++++++++++++++
 drivers/misc/habanalabs/device.c         |  47 +++++++
 drivers/misc/habanalabs/habanalabs.h     |  70 ++++++++++
 drivers/misc/habanalabs/habanalabs_drv.c |  46 ++++++-
 6 files changed, 375 insertions(+), 3 deletions(-)
 create mode 100644 drivers/misc/habanalabs/asid.c
 create mode 100644 drivers/misc/habanalabs/context.c

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index 6f1ead69bd77..3ffbadc2ca01 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -4,7 +4,7 @@
 
 obj-m	:= habanalabs.o
 
-habanalabs-y := habanalabs_drv.o device.o
+habanalabs-y := habanalabs_drv.o device.o context.o asid.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/asid.c b/drivers/misc/habanalabs/asid.c
new file mode 100644
index 000000000000..0ce84c8f5a47
--- /dev/null
+++ b/drivers/misc/habanalabs/asid.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/slab.h>
+#include <linux/types.h>
+
+int hl_asid_init(struct hl_device *hdev)
+{
+	hdev->asid_bitmap = kcalloc(BITS_TO_LONGS(hdev->asic_prop.max_asid),
+					sizeof(*hdev->asid_bitmap), GFP_KERNEL);
+	if (!hdev->asid_bitmap)
+		return -ENOMEM;
+
+	mutex_init(&hdev->asid_mutex);
+
+	/* ASID 0 is reserved for KMD */
+	set_bit(0, hdev->asid_bitmap);
+
+	return 0;
+}
+
+void hl_asid_fini(struct hl_device *hdev)
+{
+	mutex_destroy(&hdev->asid_mutex);
+	kfree(hdev->asid_bitmap);
+}
+
+unsigned long hl_asid_alloc(struct hl_device *hdev)
+{
+	unsigned long found;
+
+	mutex_lock(&hdev->asid_mutex);
+
+	found = find_first_zero_bit(hdev->asid_bitmap,
+					hdev->asic_prop.max_asid);
+	if (found == hdev->asic_prop.max_asid)
+		found = 0;
+	else
+		set_bit(found, hdev->asid_bitmap);
+
+	mutex_unlock(&hdev->asid_mutex);
+
+	return found;
+}
+
+void hl_asid_free(struct hl_device *hdev, unsigned long asid)
+{
+	if (WARN((asid == 0 || asid >= hdev->asic_prop.max_asid),
+						"Invalid ASID %lu", asid))
+		return;
+	clear_bit(asid, hdev->asid_bitmap);
+}
diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
new file mode 100644
index 000000000000..cdcad077e5cf
--- /dev/null
+++ b/drivers/misc/habanalabs/context.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/sched.h>
+#include <linux/delay.h>
+
+static void hl_ctx_fini(struct hl_ctx *ctx)
+{
+	struct hl_device *hdev = ctx->hdev;
+
+	if (ctx->asid != HL_KERNEL_ASID_ID)
+		hl_asid_free(hdev, ctx->asid);
+}
+
+void hl_ctx_do_release(struct kref *ref)
+{
+	struct hl_ctx *ctx;
+
+	ctx = container_of(ref, struct hl_ctx, refcount);
+
+	dev_dbg(ctx->hdev->dev, "Now really releasing context %d\n", ctx->asid);
+
+	hl_ctx_fini(ctx);
+
+	if (ctx->hpriv)
+		hl_hpriv_put(ctx->hpriv);
+
+	kfree(ctx);
+}
+
+int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv)
+{
+	struct hl_ctx_mgr *mgr = &hpriv->ctx_mgr;
+	struct hl_ctx *ctx;
+	int rc;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx) {
+		rc = -ENOMEM;
+		goto out_err;
+	}
+
+	rc = hl_ctx_init(hdev, ctx, false);
+	if (rc)
+		goto free_ctx;
+
+	hl_hpriv_get(hpriv);
+	ctx->hpriv = hpriv;
+
+	/* TODO: remove for multiple contexts */
+	hpriv->ctx = ctx;
+	hdev->user_ctx = ctx;
+
+	mutex_lock(&mgr->ctx_lock);
+	rc = idr_alloc(&mgr->ctx_handles, ctx, 1, 0, GFP_KERNEL);
+	mutex_unlock(&mgr->ctx_lock);
+
+	if (rc < 0) {
+		dev_err(hdev->dev, "Failed to allocate IDR for a new CTX\n");
+		hl_ctx_free(hdev, ctx);
+		goto out_err;
+	}
+
+	return 0;
+
+free_ctx:
+	kfree(ctx);
+out_err:
+	return rc;
+}
+
+void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx)
+{
+	if (kref_put(&ctx->refcount, hl_ctx_do_release) == 1)
+		return;
+
+	dev_warn(hdev->dev,
+		"Context %d closed or terminated but its CS are executing\n",
+		ctx->asid);
+}
+
+int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
+{
+	ctx->hdev = hdev;
+
+	kref_init(&ctx->refcount);
+
+	if (is_kernel_ctx) {
+		ctx->asid = HL_KERNEL_ASID_ID; /* KMD gets ASID 0 */
+	} else {
+		ctx->asid = hl_asid_alloc(hdev);
+		if (!ctx->asid) {
+			dev_err(hdev->dev, "No free ASID, failed to create context\n");
+			return -ENOMEM;
+		}
+	}
+
+	dev_dbg(hdev->dev, "Created context with ASID %u\n", ctx->asid);
+
+	return 0;
+}
+
+void hl_ctx_get(struct hl_device *hdev, struct hl_ctx *ctx)
+{
+	kref_get(&ctx->refcount);
+}
+
+int hl_ctx_put(struct hl_ctx *ctx)
+{
+	return kref_put(&ctx->refcount, hl_ctx_do_release);
+}
+
+/**
+ * hl_ctx_mgr_init - initialize the context manager
+ *
+ * @mgr: pointer to context manager structure
+ *
+ * This manager is an object inside the hpriv object of the user process.
+ * The function is called when a user process opens the FD.
+ */
+void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr)
+{
+	mutex_init(&mgr->ctx_lock);
+	idr_init(&mgr->ctx_handles);
+}
+
+/**
+ * hl_ctx_mgr_fini - finalize the context manager
+ *
+ * @hdev: pointer to device structure
+ * @mgr: pointer to context manager structure
+ *
+ * This function goes over all the contexts in the manager and frees them.
+ * It is called when a process closes the FD.
+ */
+void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr)
+{
+	struct hl_ctx *ctx;
+	struct idr *idp;
+	u32 id;
+
+	idp = &mgr->ctx_handles;
+
+	idr_for_each_entry(idp, ctx, id)
+		hl_ctx_free(hdev, ctx);
+
+	idr_destroy(&mgr->ctx_handles);
+	mutex_destroy(&mgr->ctx_lock);
+}
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index a4276ef559b3..84ce9fcb52da 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -23,6 +23,12 @@ static void hpriv_release(struct kref *ref)
 	put_pid(hpriv->taskpid);
 
 	kfree(hpriv);
+
+	/* Now the FD is really closed */
+	atomic_dec(&hdev->fd_open_cnt);
+
+	/* This allows a new user context to open the device */
+	hdev->user_ctx = NULL;
 }
 
 void hl_hpriv_get(struct hl_fpriv *hpriv)
@@ -47,6 +53,8 @@ static int hl_device_release(struct inode *inode, struct file *filp)
 {
 	struct hl_fpriv *hpriv = filp->private_data;
 
+	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
+
 	filp->private_data = NULL;
 
 	hl_hpriv_put(hpriv);
@@ -133,7 +141,20 @@ static int device_early_init(struct hl_device *hdev)
 	if (rc)
 		return rc;
 
+	rc = hl_asid_init(hdev);
+	if (rc)
+		goto early_fini;
+
+	mutex_init(&hdev->device_open);
+	atomic_set(&hdev->fd_open_cnt, 0);
+
 	return 0;
+
+early_fini:
+	if (hdev->asic_funcs->early_fini)
+		hdev->asic_funcs->early_fini(hdev);
+
+	return rc;
 }
 
 /**
@@ -145,9 +166,12 @@ static int device_early_init(struct hl_device *hdev)
 static void device_early_fini(struct hl_device *hdev)
 {
 
+	hl_asid_fini(hdev);
+
 	if (hdev->asic_funcs->early_fini)
 		hdev->asic_funcs->early_fini(hdev);
 
+	mutex_destroy(&hdev->device_open);
 }
 
 /**
@@ -241,11 +265,30 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 	if (rc)
 		goto early_fini;
 
+	/* Allocate the kernel context */
+	hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
+	if (!hdev->kernel_ctx) {
+		rc = -ENOMEM;
+		goto sw_fini;
+	}
+
+	hdev->user_ctx = NULL;
+
+	rc = hl_ctx_init(hdev, hdev->kernel_ctx, true);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize kernel context\n");
+		goto free_ctx;
+	}
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
 	return 0;
 
+free_ctx:
+	kfree(hdev->kernel_ctx);
+sw_fini:
+	hdev->asic_funcs->sw_fini(hdev);
 early_fini:
 	device_early_fini(hdev);
 release_device:
@@ -278,6 +321,10 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	/* Release kernel context */
+	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
+		dev_err(hdev->dev, "kernel ctx is still alive\n");
+
 	/* Call ASIC S/W finalize function */
 	hdev->asic_funcs->sw_fini(hdev);
 
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 97844825f7a8..d003a6af2131 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -125,6 +125,45 @@ struct hl_asic_funcs {
 					void *cpu_addr, dma_addr_t dma_handle);
 };
 
+
+
+
+
+/*
+ * CONTEXTS
+ */
+
+#define HL_KERNEL_ASID_ID	0
+
+/**
+ * struct hl_ctx - user/kernel context.
+ * @hpriv: pointer to the private (KMD) data of the process (fd).
+ * @hdev: pointer to the device structure.
+ * @refcount: reference counter for the context. Context is released only when
+ *		this hits 0l. It is incremented on CS and CS_WAIT.
+ * @asid: context's unique address space ID in the device's MMU.
+ */
+struct hl_ctx {
+	struct hl_fpriv		*hpriv;
+	struct hl_device	*hdev;
+	struct kref		refcount;
+	u32			asid;
+};
+
+/**
+ * struct hl_ctx_mgr - for handling multiple contexts.
+ * @ctx_lock: protects ctx_handles.
+ * @ctx_handles: idr to hold all ctx handles.
+ */
+struct hl_ctx_mgr {
+	struct mutex		ctx_lock;
+	struct idr		ctx_handles;
+};
+
+
+
+
+
 /*
  * FILE PRIVATE STRUCTURE
  */
@@ -134,12 +173,16 @@ struct hl_asic_funcs {
  * @hdev: habanalabs device structure.
  * @filp: pointer to the given file structure.
  * @taskpid: current process ID.
+ * @ctx: current executing context.
+ * @ctx_mgr: context manager to handle multiple context for this FD.
  * @refcount: number of related contexts.
  */
 struct hl_fpriv {
 	struct hl_device	*hdev;
 	struct file		*filp;
 	struct pid		*taskpid;
+	struct hl_ctx		*ctx; /* TODO: remove for multiple ctx */
+	struct hl_ctx_mgr	ctx_mgr;
 	struct kref		refcount;
 };
 
@@ -195,13 +238,19 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @dev: realted kernel basic device structure.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
+ * @kernel_ctx: KMD context structure.
  * @dma_pool: DMA pool for small allocations.
  * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
  * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
  * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
+ * @asid_bitmap: holds used/available ASIDs.
+ * @asid_mutex: protects asid_bitmap.
+ * @device_open: lock for sanity checks upon FD open.
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
+ * @user_ctx: current user context executing.
+ * @fd_open_cnt: number of open context executing.
  * @major: habanalabs KMD major.
  * @id: device minor.
  * @disabled: is device disabled.
@@ -214,13 +263,21 @@ struct hl_device {
 	struct device			*dev;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
+	struct hl_ctx			*kernel_ctx;
 	struct dma_pool			*dma_pool;
 	void				*cpu_accessible_dma_mem;
 	dma_addr_t			cpu_accessible_dma_address;
 	struct gen_pool			*cpu_accessible_dma_pool;
+	unsigned long			*asid_bitmap;
+	struct mutex			asid_mutex;
+	/* TODO: change to rw_sem for multiple contexts (same as other IOCTL) */
+	struct mutex			device_open;
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
+	/* TODO: The following fields should be moved for multi-context */
+	struct hl_ctx			*user_ctx;
+	atomic_t			fd_open_cnt;
 	u32				major;
 	u16				id;
 	u8				disabled;
@@ -270,10 +327,23 @@ int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
 int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
 				u32 timeout_us, u32 *val);
 
+int hl_asid_init(struct hl_device *hdev);
+void hl_asid_fini(struct hl_device *hdev);
+unsigned long hl_asid_alloc(struct hl_device *hdev);
+void hl_asid_free(struct hl_device *hdev, unsigned long asid);
+
+int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv);
+void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx);
+int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx);
+int hl_ctx_put(struct hl_ctx *ctx);
+void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr);
+void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr);
 int hl_device_init(struct hl_device *hdev, struct class *hclass);
 void hl_device_fini(struct hl_device *hdev);
 int hl_device_suspend(struct hl_device *hdev);
 int hl_device_resume(struct hl_device *hdev);
+void hl_hpriv_get(struct hl_fpriv *hpriv);
+void hl_hpriv_put(struct hl_fpriv *hpriv);
 
 void goya_set_asic_funcs(struct hl_device *hdev);
 
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 79545003b7c2..0646da83eb53 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -77,6 +77,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 {
 	struct hl_device *hdev;
 	struct hl_fpriv *hpriv;
+	int rc;
 
 	mutex_lock(&hl_devs_idr_lock);
 	hdev = idr_find(&hl_devs_idr, iminor(inode));
@@ -88,9 +89,33 @@ int hl_device_open(struct inode *inode, struct file *filp)
 		return -ENXIO;
 	}
 
+	mutex_lock(&hdev->device_open);
+
+	if (hdev->disabled) {
+		dev_err_ratelimited(hdev->dev,
+			"Can't open %s because it is disabled\n",
+			dev_name(hdev->dev));
+		mutex_unlock(&hdev->device_open);
+		return -EPERM;
+	}
+
+	if (hdev->user_ctx) {
+		dev_info_ratelimited(hdev->dev,
+			"Device %s is already attached to application\n",
+			dev_name(hdev->dev));
+		mutex_unlock(&hdev->device_open);
+		return -EBUSY;
+	}
+
+	atomic_inc(&hdev->fd_open_cnt);
+
+	mutex_unlock(&hdev->device_open);
+
 	hpriv = kzalloc(sizeof(*hpriv), GFP_KERNEL);
-	if (!hpriv)
-		return -ENOMEM;
+	if (!hpriv) {
+		rc = -ENOMEM;
+		goto close_device;
+	}
 
 	hpriv->hdev = hdev;
 	filp->private_data = hpriv;
@@ -98,9 +123,26 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	kref_init(&hpriv->refcount);
 	nonseekable_open(inode, filp);
 
+	hl_ctx_mgr_init(&hpriv->ctx_mgr);
+
+	rc = hl_ctx_create(hdev, hpriv);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to open FD (CTX fail)\n");
+		goto out_err;
+	}
+
 	hpriv->taskpid = find_get_pid(current->pid);
 
 	return 0;
+
+out_err:
+	filp->private_data = NULL;
+	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
+	kfree(hpriv);
+
+close_device:
+	atomic_dec(&hdev->fd_open_cnt);
+	return rc;
 }
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 05/15] habanalabs: add command buffer module
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (2 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 04/15] habanalabs: add context and ASID modules Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-23 12:28   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 06/15] habanalabs: add basic Goya h/w initialization Oded Gabbay
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds the CB module, which allows the user to create and
destroy CBs and to map them to the user's process address-space.

A command buffer is a memory blocks that reside in DMA-able address-space
and is physically contiguous so it can be accessed by the device without
MMU translation. The command buffer memory is allocated using the
coherent DMA API.

When creating a new CB, the IOCTL returns a handle of it, and the
user-space process needs to use that handle to mmap the buffer to get a VA
in the user's address-space.

Before destroying (freeing) a CB, the user must unmap the CB's VA using the
CB handle.

Each CB has a reference counter, which tracks its usage in command
submissions and also its mmaps (only a single mmap is allowed).

The driver maintains a pool of pre-allocated CBs in order to reduce
latency during command submissions. In case the pool is empty, the driver
will go to the slow-path of allocating a new CB, i.e. calling
dma_alloc_coherent.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/Makefile           |   3 +-
 drivers/misc/habanalabs/command_buffer.c   | 414 +++++++++++++++++++++
 drivers/misc/habanalabs/device.c           |  43 ++-
 drivers/misc/habanalabs/goya/goya.c        |  28 ++
 drivers/misc/habanalabs/habanalabs.h       |  95 ++++-
 drivers/misc/habanalabs/habanalabs_drv.c   |   2 +
 drivers/misc/habanalabs/habanalabs_ioctl.c | 102 +++++
 include/uapi/misc/habanalabs.h             |  62 +++
 8 files changed, 746 insertions(+), 3 deletions(-)
 create mode 100644 drivers/misc/habanalabs/command_buffer.c
 create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
 create mode 100644 include/uapi/misc/habanalabs.h

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index 3ffbadc2ca01..2530c9b78ca4 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -4,7 +4,8 @@
 
 obj-m	:= habanalabs.o
 
-habanalabs-y := habanalabs_drv.o device.o context.o asid.o
+habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
+		command_buffer.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
new file mode 100644
index 000000000000..535ed6cc5bda
--- /dev/null
+++ b/drivers/misc/habanalabs/command_buffer.c
@@ -0,0 +1,414 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include <uapi/misc/habanalabs.h>
+#include "habanalabs.h"
+
+#include <linux/dma-mapping.h>
+
+static void cb_fini(struct hl_device *hdev, struct hl_cb *cb)
+{
+	hdev->asic_funcs->dma_free_coherent(hdev, cb->size,
+			(void *) cb->kernel_address, cb->bus_address);
+	kfree(cb);
+}
+
+static void cb_do_release(struct hl_device *hdev, struct hl_cb *cb)
+{
+	if (cb->is_pool) {
+		spin_lock(&hdev->cb_pool_lock);
+		list_add(&cb->pool_list, &hdev->cb_pool);
+		spin_unlock(&hdev->cb_pool_lock);
+	} else {
+		cb_fini(hdev, cb);
+	}
+}
+
+static void cb_release(struct kref *ref)
+{
+	struct hl_device *hdev;
+	struct hl_cb *cb;
+
+	cb = container_of(ref, struct hl_cb, refcount);
+	hdev = cb->hdev;
+
+	cb_do_release(hdev, cb);
+}
+
+static struct hl_cb *hl_cb_alloc(struct hl_device *hdev, u32 cb_size,
+					int ctx_id)
+{
+	struct hl_cb *cb;
+	void *p;
+
+	if (ctx_id == HL_KERNEL_ASID_ID)
+		cb = kzalloc(sizeof(*cb), GFP_ATOMIC);
+	else
+		cb = kzalloc(sizeof(*cb), GFP_KERNEL);
+
+	if (!cb)
+		return NULL;
+
+	if (ctx_id == HL_KERNEL_ASID_ID)
+		p = hdev->asic_funcs->dma_alloc_coherent(hdev, cb_size,
+						&cb->bus_address, GFP_ATOMIC);
+	else
+		p = hdev->asic_funcs->dma_alloc_coherent(hdev, cb_size,
+						&cb->bus_address,
+						GFP_USER | __GFP_ZERO);
+	if (!p) {
+		dev_err(hdev->dev,
+			"failed to allocate %d of dma memory for CB\n",
+			cb_size);
+		kfree(cb);
+		return NULL;
+	}
+
+	cb->kernel_address = (u64) p;
+	cb->size = cb_size;
+
+	return cb;
+}
+
+int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
+			u32 cb_size, u64 *handle, int ctx_id)
+{
+	struct hl_cb *cb;
+	bool alloc_new_cb = true;
+	int rc;
+
+	if (hdev->disabled) {
+		dev_warn_ratelimited(hdev->dev,
+			"Device is disabled !!! Can't create new CBs\n");
+		rc = -EBUSY;
+		goto out_err;
+	}
+
+	/* Minimum allocation must be PAGE SIZE */
+	if (cb_size < PAGE_SIZE)
+		cb_size = PAGE_SIZE;
+
+	if (ctx_id == HL_KERNEL_ASID_ID &&
+			cb_size <= hdev->asic_prop.cb_pool_cb_size) {
+
+		spin_lock(&hdev->cb_pool_lock);
+		if (!list_empty(&hdev->cb_pool)) {
+			cb = list_first_entry(&hdev->cb_pool, typeof(*cb),
+					pool_list);
+			list_del(&cb->pool_list);
+			spin_unlock(&hdev->cb_pool_lock);
+			alloc_new_cb = false;
+		} else {
+			spin_unlock(&hdev->cb_pool_lock);
+			dev_warn_once(hdev->dev, "CB pool is empty\n");
+		}
+	}
+
+	if (alloc_new_cb) {
+		cb = hl_cb_alloc(hdev, cb_size, ctx_id);
+		if (!cb) {
+			rc = -ENOMEM;
+			goto out_err;
+		}
+	}
+
+	cb->hdev = hdev;
+	cb->ctx_id = ctx_id;
+
+	spin_lock(&mgr->cb_lock);
+	rc = idr_alloc(&mgr->cb_handles, cb, 1, 0, GFP_ATOMIC);
+	spin_unlock(&mgr->cb_lock);
+
+	if (rc < 0) {
+		dev_err(hdev->dev, "Failed to allocate IDR for a new CB\n");
+		goto release_cb;
+	}
+
+	cb->id = rc;
+
+	kref_init(&cb->refcount);
+	spin_lock_init(&cb->lock);
+
+	/*
+	 * idr is 32-bit so we can safely OR it with a mask that is above
+	 * 32 bit
+	 */
+	*handle = cb->id | HL_MMAP_CB_MASK;
+	*handle <<= PAGE_SHIFT;
+
+	return 0;
+
+release_cb:
+	cb_do_release(hdev, cb);
+out_err:
+	*handle = 0;
+
+	return rc;
+}
+
+int hl_cb_destroy(struct hl_device *hdev, struct hl_cb_mgr *mgr, u64 cb_handle)
+{
+	struct hl_cb *cb;
+	u32 handle;
+	int rc = 0;
+
+	/*
+	 * handle was given to user to do mmap, I need to shift it back to
+	 * how the idr module gave it to me
+	 */
+	cb_handle >>= PAGE_SHIFT;
+	handle = (u32) cb_handle;
+
+	spin_lock(&mgr->cb_lock);
+
+	cb = idr_find(&mgr->cb_handles, handle);
+	if (cb) {
+		idr_remove(&mgr->cb_handles, handle);
+		spin_unlock(&mgr->cb_lock);
+		kref_put(&cb->refcount, cb_release);
+	} else {
+		spin_unlock(&mgr->cb_lock);
+		dev_err(hdev->dev,
+			"CB destroy failed, no match to handle 0x%x\n", handle);
+		rc = -EINVAL;
+	}
+
+	return rc;
+}
+
+int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	union hl_cb_args *args = data;
+	struct hl_device *hdev = hpriv->hdev;
+	u64 handle;
+	int rc;
+
+	switch (args->in.op) {
+	case HL_CB_OP_CREATE:
+		rc = hl_cb_create(hdev, &hpriv->cb_mgr, args->in.cb_size,
+					&handle, hpriv->ctx->asid);
+		memset(args, 0, sizeof(*args));
+		args->out.cb_handle = handle;
+		break;
+	case HL_CB_OP_DESTROY:
+		rc = hl_cb_destroy(hdev, &hpriv->cb_mgr,
+					args->in.cb_handle);
+		memset(args, 0, sizeof(*args));
+		break;
+	default:
+		rc = -EINVAL;
+		break;
+	}
+
+	return rc;
+}
+
+static void cb_vm_close(struct vm_area_struct *vma)
+{
+	struct hl_cb *cb = (struct hl_cb *) vma->vm_private_data;
+
+	hl_cb_put(cb);
+
+	spin_lock(&cb->lock);
+	cb->mmap = false;
+	cb->vm_start = 0;
+	cb->vm_end = 0;
+	spin_unlock(&cb->lock);
+
+	vma->vm_private_data = NULL;
+}
+
+static const struct vm_operations_struct cb_vm_ops = {
+	.close = cb_vm_close
+};
+
+int hl_cb_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	struct hl_cb *cb;
+	phys_addr_t address;
+	u32 handle;
+	int rc;
+
+	handle = vma->vm_pgoff;
+
+	/* reference was taken here */
+	cb = hl_cb_get(hdev, &hpriv->cb_mgr, handle);
+	if (!cb) {
+		dev_err(hdev->dev,
+			"CB mmap failed, no match to handle %d\n", handle);
+		goto err_out;
+	}
+
+	/* Validation check */
+	if (vma->vm_end - vma->vm_start != cb->size) {
+		dev_err(hdev->dev,
+			"CB mmap failed, mmap size 0x%lx != 0x%x cb size\n",
+			vma->vm_end - vma->vm_start, cb->size);
+		goto put_cb;
+	}
+
+	spin_lock(&cb->lock);
+
+	if (cb->mmap) {
+		dev_err(hdev->dev,
+			"CB mmap failed, CB already mmaped to user\n");
+		goto release_lock;
+	}
+
+	cb->mmap = true;
+
+	spin_unlock(&cb->lock);
+
+	vma->vm_ops = &cb_vm_ops;
+
+	/*
+	 * Note: We're transferring the cb reference to
+	 * vma->vm_private_data here.
+	 */
+
+	vma->vm_private_data = cb;
+
+	/* Calculate address for CB */
+	address = virt_to_phys((void *) cb->kernel_address);
+
+	rc = hdev->asic_funcs->cb_mmap(hdev, vma, cb->kernel_address,
+					address, cb->size);
+
+	if (rc) {
+		spin_lock(&cb->lock);
+		cb->mmap = false;
+		goto release_lock;
+	}
+
+	cb->vm_start = vma->vm_start;
+	cb->vm_end = vma->vm_end;
+
+	return 0;
+
+release_lock:
+	spin_unlock(&cb->lock);
+put_cb:
+	hl_cb_put(cb);
+err_out:
+	return -EINVAL;
+}
+
+struct hl_cb *hl_cb_get(struct hl_device *hdev, struct hl_cb_mgr *mgr,
+			u32 handle)
+{
+	struct hl_cb *cb;
+
+	spin_lock(&mgr->cb_lock);
+	cb = idr_find(&mgr->cb_handles, handle);
+
+	if (!cb) {
+		spin_unlock(&mgr->cb_lock);
+		dev_warn(hdev->dev,
+			"CB get failed, no match to handle %d\n", handle);
+		return NULL;
+	}
+
+	kref_get(&cb->refcount);
+
+	spin_unlock(&mgr->cb_lock);
+
+	return cb;
+
+}
+
+void hl_cb_put(struct hl_cb *cb)
+{
+	kref_put(&cb->refcount, cb_release);
+}
+
+void hl_cb_mgr_init(struct hl_cb_mgr *mgr)
+{
+	spin_lock_init(&mgr->cb_lock);
+	idr_init(&mgr->cb_handles);
+}
+
+void hl_cb_mgr_fini(struct hl_device *hdev, struct hl_cb_mgr *mgr)
+{
+	struct hl_cb *cb;
+	struct idr *idp;
+	u32 id;
+
+	idp = &mgr->cb_handles;
+
+	idr_for_each_entry(idp, cb, id) {
+		if (kref_put(&cb->refcount, cb_release) != 1)
+			dev_err(hdev->dev,
+				"CB %d for CTX ID %d is still alive\n",
+				id, cb->ctx_id);
+	}
+
+	idr_destroy(&mgr->cb_handles);
+}
+
+struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size)
+{
+	u64 cb_handle;
+	struct hl_cb *cb;
+	int rc;
+
+	rc = hl_cb_create(hdev, &hdev->kernel_cb_mgr, cb_size, &cb_handle,
+			HL_KERNEL_ASID_ID);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to allocate CB for KMD %d\n", rc);
+		return NULL;
+	}
+
+	cb_handle >>= PAGE_SHIFT;
+	cb = hl_cb_get(hdev, &hdev->kernel_cb_mgr, (u32) cb_handle);
+	/* hl_cb_get should never fail here so use kernel WARN */
+	WARN(!cb, "Kernel CB handle invalid 0x%x\n", (u32) cb_handle);
+	if (!cb)
+		goto destroy_cb;
+
+	return cb;
+
+destroy_cb:
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb_handle << PAGE_SHIFT);
+
+	return NULL;
+}
+
+int hl_cb_pool_init(struct hl_device *hdev)
+{
+	struct hl_cb *cb;
+	int i;
+
+	INIT_LIST_HEAD(&hdev->cb_pool);
+	spin_lock_init(&hdev->cb_pool_lock);
+
+	for (i = 0 ; i < hdev->asic_prop.cb_pool_cb_cnt ; i++) {
+		cb = hl_cb_alloc(hdev, hdev->asic_prop.cb_pool_cb_size,
+				HL_KERNEL_ASID_ID);
+		if (cb) {
+			cb->is_pool = true;
+			list_add(&cb->pool_list, &hdev->cb_pool);
+		} else {
+			hl_cb_pool_fini(hdev);
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+int hl_cb_pool_fini(struct hl_device *hdev)
+{
+	struct hl_cb *cb, *tmp;
+
+	list_for_each_entry_safe(cb, tmp, &hdev->cb_pool, pool_list) {
+		list_del(&cb->pool_list);
+		cb_fini(hdev, cb);
+	}
+
+	return 0;
+}
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 84ce9fcb52da..0bd86a7d34db 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -53,6 +53,7 @@ static int hl_device_release(struct inode *inode, struct file *filp)
 {
 	struct hl_fpriv *hpriv = filp->private_data;
 
+	hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
 	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
 
 	filp->private_data = NULL;
@@ -62,10 +63,34 @@ static int hl_device_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+/**
+ * hl_mmap - mmap function for habanalabs device
+ *
+ * @*filp: pointer to file structure
+ * @*vma: pointer to vm_area_struct of the process
+ *
+ * Called when process does an mmap on habanalabs device. Call the device's mmap
+ * function at the end of the common code.
+ */
+static int hl_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct hl_fpriv *hpriv = filp->private_data;
+
+	if ((vma->vm_pgoff & HL_MMAP_CB_MASK) == HL_MMAP_CB_MASK) {
+		vma->vm_pgoff ^= HL_MMAP_CB_MASK;
+		return hl_cb_mmap(hpriv, vma);
+	}
+
+	return hpriv->hdev->asic_funcs->mmap(hpriv, vma);
+}
+
 static const struct file_operations hl_ops = {
 	.owner = THIS_MODULE,
 	.open = hl_device_open,
-	.release = hl_device_release
+	.release = hl_device_release,
+	.mmap = hl_mmap,
+	.unlocked_ioctl = hl_ioctl,
+	.compat_ioctl = hl_ioctl
 };
 
 /**
@@ -145,6 +170,8 @@ static int device_early_init(struct hl_device *hdev)
 	if (rc)
 		goto early_fini;
 
+	hl_cb_mgr_init(&hdev->kernel_cb_mgr);
+
 	mutex_init(&hdev->device_open);
 	atomic_set(&hdev->fd_open_cnt, 0);
 
@@ -166,6 +193,8 @@ static int device_early_init(struct hl_device *hdev)
 static void device_early_fini(struct hl_device *hdev)
 {
 
+	hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
+
 	hl_asid_fini(hdev);
 
 	if (hdev->asic_funcs->early_fini)
@@ -280,11 +309,21 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto free_ctx;
 	}
 
+	rc = hl_cb_pool_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize CB pool\n");
+		goto release_ctx;
+	}
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
 	return 0;
 
+release_ctx:
+	if (hl_ctx_put(hdev->kernel_ctx) != 1)
+		dev_err(hdev->dev,
+			"kernel ctx is still alive on initialization failure\n");
 free_ctx:
 	kfree(hdev->kernel_ctx);
 sw_fini:
@@ -321,6 +360,8 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	hl_cb_pool_fini(hdev);
+
 	/* Release kernel context */
 	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
 		dev_err(hdev->dev, "kernel ctx is still alive\n");
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index b2952296b890..341ac085af82 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -92,6 +92,9 @@
 
 #define GOYA_MAX_INITIATORS		20
 
+#define GOYA_CB_POOL_CB_CNT		512
+#define GOYA_CB_POOL_CB_SIZE		0x20000		/* 128KB */
+
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
@@ -119,6 +122,8 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
 
 	prop->high_pll = PLL_HIGH_DEFAULT;
+	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
+	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
 }
 
 /**
@@ -598,6 +603,27 @@ int goya_resume(struct hl_device *hdev)
 	return 0;
 }
 
+int goya_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
+{
+	return -EINVAL;
+}
+
+int goya_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
+		u64 kaddress, phys_addr_t paddress, u32 size)
+{
+	int rc;
+
+	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
+			VM_DONTCOPY | VM_NORESERVE;
+
+	rc = remap_pfn_range(vma, vma->vm_start, paddress >> PAGE_SHIFT,
+				size, vma->vm_page_prot);
+	if (rc)
+		dev_err(hdev->dev, "remap_pfn_range error %d", rc);
+
+	return rc;
+}
+
 void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flags)
 {
@@ -617,6 +643,8 @@ static const struct hl_asic_funcs goya_funcs = {
 	.sw_fini = goya_sw_fini,
 	.suspend = goya_suspend,
 	.resume = goya_resume,
+	.mmap = goya_mmap,
+	.cb_mmap = goya_cb_mmap,
 	.dma_alloc_coherent = goya_dma_alloc_coherent,
 	.dma_free_coherent = goya_dma_free_coherent,
 };
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index d003a6af2131..6ad476df65b0 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -21,10 +21,12 @@
 
 #define HL_NAME				"habanalabs"
 
+#define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
+
 #define HL_MAX_QUEUES			128
 
 struct hl_device;
-
+struct hl_fpriv;
 
 
 
@@ -53,6 +55,8 @@ struct hl_device;
  * @max_asid: maximum number of open contexts (ASIDs).
  * @completion_queues_count: number of completion queues.
  * @high_pll: high PLL frequency used by the device.
+ * @cb_pool_cb_cnt: number of CBs in the CB pool.
+ * @cb_pool_cb_size: size of each CB in the CB pool.
  * @tpc_enabled_mask: which TPCs are enabled.
  */
 struct asic_fixed_properties {
@@ -73,11 +77,68 @@ struct asic_fixed_properties {
 	u32			sram_size;
 	u32			max_asid;
 	u32			high_pll;
+	u32			cb_pool_cb_cnt;
+	u32			cb_pool_cb_size;
 	u8			completion_queues_count;
 	u8			tpc_enabled_mask;
 };
 
 
+
+
+
+
+/*
+ * Command Buffers
+ */
+
+/**
+ * struct hl_cb_mgr - describes a Command Buffer Manager.
+ * @cb_lock: protects cb_handles.
+ * @cb_handles: an idr to hold all command buffer handles.
+ */
+struct hl_cb_mgr {
+	spinlock_t		cb_lock;
+	struct idr		cb_handles; /* protected by cb_lock */
+};
+
+/**
+ * struct hl_cb - describes a Command Buffer.
+ * @refcount: reference counter for usage of the CB.
+ * @hdev: pointer to device this CB belongs to.
+ * @lock: spinlock to protect mmap/cs flows.
+ * @pool_list: node in pool list of command buffers.
+ * @kernel_address: Holds the CB's kernel virtual address.
+ * @bus_address: Holds the CB's DMA address.
+ * @vm_start: Holds the CB's user start virtual address (when mmaped).
+ * @vm_end: Holds the CB's user end virtual address (when mmaped).
+ * @size: holds the CB's size.
+ * @id: the CB's ID.
+ * @ctx_id: holds the ID of the owner's context.
+ * @mmap: true if the CB is currently mmaped to user.
+ * @is_pool: true if CB was acquired from the pool, false otherwise.
+ */
+struct hl_cb {
+	struct kref		refcount;
+	struct hl_device	*hdev;
+	spinlock_t		lock;
+	struct list_head	pool_list;
+	u64			kernel_address;
+	dma_addr_t		bus_address;
+	u64			vm_start;
+	u64			vm_end;
+	u32			size;
+	u32			id;
+	u32			ctx_id;
+	u8			mmap;
+	u8			is_pool;
+};
+
+
+
+
+
+
 #define HL_QUEUE_LENGTH			256
 
 
@@ -109,6 +170,8 @@ enum hl_asic_type {
  * @sw_fini: tears down driver state, does not configure H/W.
  * @suspend: handles IP specific H/W or SW changes for suspend.
  * @resume: handles IP specific H/W or SW changes for resume.
+ * @mmap: mmap function, does nothing.
+ * @cb_mmap: maps a CB.
  * @dma_alloc_coherent: DMA allocate coherent memory.
  * @dma_free_coherent: free DMA allocation.
  */
@@ -119,6 +182,9 @@ struct hl_asic_funcs {
 	int (*sw_fini)(struct hl_device *hdev);
 	int (*suspend)(struct hl_device *hdev);
 	int (*resume)(struct hl_device *hdev);
+	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
+	int (*cb_mmap)(struct hl_device *hdev, struct vm_area_struct *vma,
+			u64 kaddress, phys_addr_t paddress, u32 size);
 	void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flag);
 	void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
@@ -175,6 +241,7 @@ struct hl_ctx_mgr {
  * @taskpid: current process ID.
  * @ctx: current executing context.
  * @ctx_mgr: context manager to handle multiple context for this FD.
+ * @cb_mgr: command buffer manager to handle multiple buffers for this FD.
  * @refcount: number of related contexts.
  */
 struct hl_fpriv {
@@ -183,6 +250,7 @@ struct hl_fpriv {
 	struct pid		*taskpid;
 	struct hl_ctx		*ctx; /* TODO: remove for multiple ctx */
 	struct hl_ctx_mgr	ctx_mgr;
+	struct hl_cb_mgr	cb_mgr;
 	struct kref		refcount;
 };
 
@@ -239,6 +307,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
  * @kernel_ctx: KMD context structure.
+ * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
  * @dma_pool: DMA pool for small allocations.
  * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
  * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
@@ -249,6 +318,8 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
+ * @cb_pool: list of preallocated CBs.
+ * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
  * @fd_open_cnt: number of open context executing.
  * @major: habanalabs KMD major.
@@ -264,6 +335,7 @@ struct hl_device {
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
 	struct hl_ctx			*kernel_ctx;
+	struct hl_cb_mgr		kernel_cb_mgr;
 	struct dma_pool			*dma_pool;
 	void				*cpu_accessible_dma_mem;
 	dma_addr_t			cpu_accessible_dma_address;
@@ -275,6 +347,10 @@ struct hl_device {
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
+
+	struct list_head		cb_pool;
+	spinlock_t			cb_pool_lock;
+
 	/* TODO: The following fields should be moved for multi-context */
 	struct hl_ctx			*user_ctx;
 	atomic_t			fd_open_cnt;
@@ -345,6 +421,23 @@ int hl_device_resume(struct hl_device *hdev);
 void hl_hpriv_get(struct hl_fpriv *hpriv);
 void hl_hpriv_put(struct hl_fpriv *hpriv);
 
+int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr, u32 cb_size,
+		u64 *handle, int ctx_id);
+int hl_cb_destroy(struct hl_device *hdev, struct hl_cb_mgr *mgr, u64 cb_handle);
+int hl_cb_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
+struct hl_cb *hl_cb_get(struct hl_device *hdev,	struct hl_cb_mgr *mgr,
+			u32 handle);
+void hl_cb_put(struct hl_cb *cb);
+void hl_cb_mgr_init(struct hl_cb_mgr *mgr);
+void hl_cb_mgr_fini(struct hl_device *hdev, struct hl_cb_mgr *mgr);
+struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size);
+int hl_cb_pool_init(struct hl_device *hdev);
+int hl_cb_pool_fini(struct hl_device *hdev);
+
 void goya_set_asic_funcs(struct hl_device *hdev);
 
+/* IOCTLs */
+long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
+int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
+
 #endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 0646da83eb53..5c312dd3aa50 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -123,6 +123,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	kref_init(&hpriv->refcount);
 	nonseekable_open(inode, filp);
 
+	hl_cb_mgr_init(&hpriv->cb_mgr);
 	hl_ctx_mgr_init(&hpriv->ctx_mgr);
 
 	rc = hl_ctx_create(hdev, hpriv);
@@ -138,6 +139,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 out_err:
 	filp->private_data = NULL;
 	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
+	hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
 	kfree(hpriv);
 
 close_device:
diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
new file mode 100644
index 000000000000..fa2287569e0e
--- /dev/null
+++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include <uapi/misc/habanalabs.h>
+#include "habanalabs.h"
+
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/cred.h>
+
+#define HL_IOCTL_DEF(ioctl, _func) \
+	[_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func}
+
+static const struct hl_ioctl_desc hl_ioctls[] = {
+	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl)
+};
+
+#define HL_CORE_IOCTL_COUNT	ARRAY_SIZE(hl_ioctls)
+
+long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
+{
+	struct hl_fpriv *hpriv = filep->private_data;
+	struct hl_device *hdev = hpriv->hdev;
+	hl_ioctl_t *func;
+	const struct hl_ioctl_desc *ioctl = NULL;
+	unsigned int nr = _IOC_NR(cmd);
+	char stack_kdata[128];
+	char *kdata = NULL;
+	unsigned int usize, asize;
+	int retcode = -EINVAL;
+
+	if (nr >= HL_CORE_IOCTL_COUNT)
+		goto err_i1;
+
+	if ((nr >= HL_COMMAND_START) && (nr < HL_COMMAND_END)) {
+		u32 hl_size;
+
+		ioctl = &hl_ioctls[nr];
+
+		hl_size = _IOC_SIZE(ioctl->cmd);
+		usize = asize = _IOC_SIZE(cmd);
+		if (hl_size > asize)
+			asize = hl_size;
+
+		cmd = ioctl->cmd;
+	} else {
+		goto err_i1;
+	}
+
+	/* Do not trust userspace, use our own definition */
+	func = ioctl->func;
+
+	if (unlikely(!func)) {
+		dev_dbg(hdev->dev, "no function\n");
+		retcode = -EINVAL;
+		goto err_i1;
+	}
+
+	if (cmd & (IOC_IN | IOC_OUT)) {
+		if (asize <= sizeof(stack_kdata)) {
+			kdata = stack_kdata;
+		} else {
+			kdata = kmalloc(asize, GFP_KERNEL);
+			if (!kdata) {
+				retcode = -ENOMEM;
+				goto err_i1;
+			}
+		}
+		if (asize > usize)
+			memset(kdata + usize, 0, asize - usize);
+	}
+
+	if (cmd & IOC_IN) {
+		if (copy_from_user(kdata, (void __user *)arg, usize)) {
+			retcode = -EFAULT;
+			goto err_i1;
+		}
+	} else if (cmd & IOC_OUT) {
+		memset(kdata, 0, usize);
+	}
+
+	retcode = func(hpriv, kdata);
+
+	if (cmd & IOC_OUT)
+		if (copy_to_user((void __user *)arg, kdata, usize))
+			retcode = -EFAULT;
+
+err_i1:
+	if (!ioctl)
+		dev_dbg(hdev->dev,
+			"invalid ioctl: pid=%d, cmd=0x%02x, nr=0x%02x\n",
+			  task_pid_nr(current), cmd, nr);
+
+	if (kdata != stack_kdata)
+		kfree(kdata);
+
+	return retcode;
+}
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
new file mode 100644
index 000000000000..b3f9213d4709
--- /dev/null
+++ b/include/uapi/misc/habanalabs.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ * Author: Oded Gabbay <oded.gabbay@gmail.com>
+ *
+ */
+
+#ifndef HABANALABS_H_
+#define HABANALABS_H_
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* Opcode to create a new command buffer */
+#define HL_CB_OP_CREATE		0
+/* Opcode to destroy previously created command buffer */
+#define HL_CB_OP_DESTROY	1
+
+struct hl_cb_in {
+	/* Handle of CB or 0 if we want to create one */
+	__u64 cb_handle;
+	/* HL_CB_OP_* */
+	__u32 op;
+	/* Size of CB. Minimum requested size must be PAGE_SIZE */
+	__u32 cb_size;
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+	__u32 pad;
+};
+
+struct hl_cb_out {
+	/* Handle of CB */
+	__u64 cb_handle;
+};
+
+union hl_cb_args {
+	struct hl_cb_in in;
+	struct hl_cb_out out;
+};
+
+/*
+ * Command Buffer
+ * - Request a Command Buffer
+ * - Destroy a Command Buffer
+ *
+ * The command buffers are memory blocks that reside in DMA-able address
+ * space and are physically contiguous so they can be accessed by the device
+ * directly. They are allocated using the coherent DMA API.
+ *
+ * When creating a new CB, the IOCTL returns a handle of it, and the user-space
+ * process needs to use that handle to mmap the buffer so it can access them.
+ *
+ */
+#define HL_IOCTL_CB		\
+		_IOWR('H', 0x02, union hl_cb_args)
+
+#define HL_COMMAND_START	0x02
+#define HL_COMMAND_END		0x03
+
+#endif /* HABANALABS_H_ */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 06/15] habanalabs: add basic Goya h/w initialization
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (3 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 05/15] habanalabs: add command buffer module Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-25  7:46   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 07/15] habanalabs: add h/w queues module Oded Gabbay
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds the basic part of Goya's H/W initialization. It adds code
that initializes Goya's internal CPU, various registers that are related to
internal routing, scrambling, workarounds for H/W bugs, etc.

It also initializes Goya's security scheme that prevents the user from
abusing Goya to steal data from the host, crash the host, change
Goya's F/W, etc.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/device.c              |   12 +
 drivers/misc/habanalabs/goya/Makefile         |    2 +-
 drivers/misc/habanalabs/goya/goya.c           | 1892 ++++++++++-
 drivers/misc/habanalabs/goya/goyaP.h          |    3 +
 drivers/misc/habanalabs/goya/goya_security.c  | 2999 +++++++++++++++++
 drivers/misc/habanalabs/habanalabs.h          |   16 +
 drivers/misc/habanalabs/habanalabs_drv.c      |    8 +
 drivers/misc/habanalabs/include/goya/goya.h   |    1 +
 .../include/goya/goya_async_events.h          |  186 +
 .../habanalabs/include/goya/goya_boot_if.h    |   32 +
 10 files changed, 5144 insertions(+), 7 deletions(-)
 create mode 100644 drivers/misc/habanalabs/goya/goya_security.c
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_async_events.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_boot_if.h

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 0bd86a7d34db..9fc7218a973c 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -315,6 +315,15 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto release_ctx;
 	}
 
+	rc = hdev->asic_funcs->hw_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize the H/W\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
+	hdev->disabled = false;
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
@@ -366,6 +375,9 @@ void hl_device_fini(struct hl_device *hdev)
 	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
 		dev_err(hdev->dev, "kernel ctx is still alive\n");
 
+	/* Reset the H/W. It will be in idle state after this returns */
+	hdev->asic_funcs->hw_fini(hdev, true);
+
 	/* Call ASIC S/W finalize function */
 	hdev->asic_funcs->sw_fini(hdev);
 
diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
index 5ebf3d0d5794..a57096fa41b6 100644
--- a/drivers/misc/habanalabs/goya/Makefile
+++ b/drivers/misc/habanalabs/goya/Makefile
@@ -1,3 +1,3 @@
 subdir-ccflags-y += -I$(src)
 
-HL_GOYA_FILES :=  goya/goya.o
\ No newline at end of file
+HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o
\ No newline at end of file
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 341ac085af82..f715e01838b3 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -119,11 +119,11 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->va_space_dram_end_address = VA_DDR_SPACE_END;
 	prop->cfg_size = CFG_SIZE;
 	prop->max_asid = MAX_ASID;
+	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
+	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
 	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
 
 	prop->high_pll = PLL_HIGH_DEFAULT;
-	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
-	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
 }
 
 /**
@@ -459,10 +459,12 @@ static int goya_early_init(struct hl_device *hdev)
 		goto disable_device;
 	}
 
-	val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
-	if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
-		dev_warn(hdev->dev,
-			"PCI strap is not configured correctly, PCI bus errors may occur\n");
+	if (!hdev->pldm) {
+		val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
+		if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
+			dev_warn(hdev->dev,
+				"PCI strap is not configured correctly, PCI bus errors may occur\n");
+	}
 
 	return 0;
 
@@ -593,6 +595,1882 @@ int goya_sw_fini(struct hl_device *hdev)
 	return 0;
 }
 
+/**
+ * goya_init_pll - Initialize pll registers
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static void goya_init_pll(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u16 hbw_nr, hbw_nf, hbw_od, hbw_nb;
+	u16 cpu_nr, cpu_nf, cpu_od, cpu_nb;
+	u16 mc_nr, mc_nf, mc_od, mc_nb;
+	u16 pci_nr, pci_nf, pci_od, pci_nb;
+	u16 emmc_nr, emmc_nf, emmc_od, emmc_nb;
+
+	if (!hdev->config_pll)
+		return;
+
+	if (goya->hw_cap_initialized & HW_CAP_PLL)
+		return;
+
+	if (hdev->cpu_enable) {
+		dev_info(hdev->dev,
+			"Waiting 5s for u-boot before configuring PLLs\n");
+		ssleep(5);
+	}
+
+/*
+ * PLL possible configuration values:
+	{50000000,1,16,16,8},
+	{100000000,1,32,16,16},
+	{150000000,1,48,16,24},
+	{200000000,1,64,16,32},
+	{250000000,1,70,14,35},
+	{300000000,1,60,10,30},
+	{350000000,1,70,10,35},
+	{400000000,1,64,8,32},
+	{450000000,1,54,6,27},
+	{500000000,1,60,6,30},
+	{550000000,1,66,6,33},
+	{600000000,1,48,4,24},
+	{650000000,1,52,4,26},
+	{700000000,1,56,4,28},
+	{750000000,1,60,4,30},
+	{800000000,1,64,4,32},
+	{850000000,1,68,4,34},
+	{900000000,1,36,2,18},
+	{950000000,1,38,2,19},
+	{1000000000,1,40,2,20},
+	{1050000000,1,42,2,21},
+	{1100000000,1,44,2,22},
+	{1150000000,1,46,2,23},
+	{1200000000,1,48,2,24},
+	{1250000000,1,50,2,25},
+	{1300000000,1,52,2,26},
+	{1350000000,1,54,2,27},
+	{1400000000,1,56,2,28},
+	{1450000000,1,58,2,29},
+	{1500000000,1,60,2,30},
+	{1550000000,1,62,2,31},
+*/
+
+	if (hdev->pldm) {
+		hbw_nr  = 4, hbw_nf  = 302, hbw_od  = 1, hbw_nb  = 151;
+		cpu_nr  = 0, cpu_nf  = 47, cpu_od  = 1, cpu_nb  = 32;
+		mc_nr   = 1, mc_nf   = 159, mc_od   = 9, mc_nb   = 79;
+		pci_nr  = 4, pci_nf  = 343, pci_od  = 3, pci_nb  = 171;
+		emmc_nr = 24, emmc_nf = 415, emmc_od = 15, emmc_nb = 207;
+	} else {
+		/* 200MHz */
+		hbw_nr  = 0, hbw_nf  = 63, hbw_od  = 15, hbw_nb  = 31;
+		cpu_nr  = 0, cpu_nf  = 47, cpu_od  = 1, cpu_nb  = 23;
+		mc_nr   = 2, mc_nf   = 0x9f, mc_od   = 3, mc_nb   = 0x4f;
+		pci_nr  = 4, pci_nf  = 343, pci_od  = 3, pci_nb  = 171;
+		emmc_nr = 24, emmc_nf = 415, emmc_od = 15, emmc_nb = 207;
+	}
+
+	/* Adjust divider for SPI */
+	WREG32(mmPSOC_SPI_BAUDR, 8);
+
+	WREG32(mmCPU_PLL_RST, 1);
+	WREG32(mmCPU_PLL_NR, cpu_nr);
+	WREG32(mmCPU_PLL_NF, cpu_nf);
+	WREG32(mmCPU_PLL_OD, cpu_od);
+	WREG32(mmCPU_PLL_NB, cpu_nb);
+	WREG32(mmCPU_PLL_DATA_CHNG, 0x11);
+
+	/* delay before taking PLL out of reset */
+	udelay(100);
+
+	WREG32(mmCPU_PLL_RST, 0);
+	WREG32(mmCPU_PLL_DIV_EN_0, 0x1);
+	WREG32(mmCPU_PLL_DIV_SEL_0, 0x1);
+
+	WREG32(mmCPU_PLL_DIV_FACTOR_1, 0x5);
+	WREG32(mmCPU_PLL_DIV_FACTOR_CMD_1, 0x1);
+	WREG32(mmCPU_PLL_DIV_EN_1, 0x1);
+	WREG32(mmCPU_PLL_DIV_SEL_1, 0x3);
+
+	WREG32(mmCPU_PLL_DIV_FACTOR_2, 0x1);
+	WREG32(mmCPU_PLL_DIV_FACTOR_CMD_2, 0x1);
+	WREG32(mmCPU_PLL_DIV_EN_2, 0x1);
+	WREG32(mmCPU_PLL_DIV_SEL_2, 0x3);
+
+	WREG32(mmIC_PLL_RST, 1);
+	WREG32(mmIC_PLL_NR, hbw_nr);
+	WREG32(mmIC_PLL_NF, hbw_nf);
+	WREG32(mmIC_PLL_OD, hbw_od);
+	WREG32(mmIC_PLL_NB, hbw_nb);
+	WREG32(mmIC_PLL_DATA_CHNG, 0x11);
+
+	udelay(100);
+
+	WREG32(mmIC_PLL_RST, 0);
+	WREG32(mmIC_PLL_DIV_EN_0, 0x1);
+	WREG32(mmIC_PLL_DIV_SEL_0, 0x1);
+
+	WREG32(mmIC_PLL_DIV_FACTOR_1, 0x1);
+	WREG32(mmIC_PLL_DIV_FACTOR_CMD_1, 0x1);
+	WREG32(mmIC_PLL_DIV_EN_1, 0x1);
+	WREG32(mmIC_PLL_DIV_SEL_1, 0x3);
+
+	WREG32(mmIC_PLL_DIV_FACTOR_2, 0x1);
+	WREG32(mmIC_PLL_DIV_FACTOR_CMD_2, 0x1);
+	WREG32(mmIC_PLL_DIV_EN_2, 0x1);
+	WREG32(mmIC_PLL_DIV_SEL_2, 0x3);
+
+	WREG32(mmIC_PLL_DIV_FACTOR_3, 0x8);
+	WREG32(mmIC_PLL_DIV_FACTOR_CMD_3, 0x1);
+	WREG32(mmIC_PLL_DIV_EN_3, 0x1);
+	WREG32(mmIC_PLL_DIV_SEL_3, 0x3);
+
+	if ((hdev->pldm) || (!hdev->cpu_enable)) {
+		WREG32(mmMC_PLL_RST, 1);
+
+		WREG32(mmMC_PLL_NR, mc_nr);
+		WREG32(mmMC_PLL_NF, mc_nf);
+		WREG32(mmMC_PLL_OD, mc_od);
+		WREG32(mmMC_PLL_NB, mc_nb);
+		WREG32(mmMC_PLL_DATA_CHNG, 0x11);
+
+		udelay(100);
+
+		WREG32(mmMC_PLL_RST, 0);
+		WREG32(mmMC_PLL_DIV_EN_0, 0x1);
+		WREG32(mmMC_PLL_DIV_SEL_0, 0x1);
+	}
+
+	WREG32(mmPSOC_MME_PLL_RST, 1);
+	WREG32(mmPSOC_MME_PLL_NR, hbw_nr);
+	WREG32(mmPSOC_MME_PLL_NF, hbw_nf);
+	WREG32(mmPSOC_MME_PLL_OD, hbw_od);
+	WREG32(mmPSOC_MME_PLL_NB, hbw_nb);
+	WREG32(mmPSOC_MME_PLL_DATA_CHNG, 0x11);
+
+	udelay(100);
+
+	WREG32(mmPSOC_MME_PLL_RST, 0);
+	WREG32(mmPSOC_MME_PLL_DIV_EN_0, 0x1);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_0, 0x1);
+
+	WREG32(mmPSOC_MME_PLL_DIV_FACTOR_1, 0x1);
+	WREG32(mmPSOC_MME_PLL_DIV_FACTOR_CMD_1, 0x1);
+	WREG32(mmPSOC_MME_PLL_DIV_EN_1, 0x1);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_1, 0x3);
+
+	WREG32(mmPSOC_MME_PLL_DIV_FACTOR_2, 0x2);
+	WREG32(mmPSOC_MME_PLL_DIV_FACTOR_CMD_2, 0x1);
+	WREG32(mmPSOC_MME_PLL_DIV_EN_2, 0x1);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_2, 0x3);
+
+	WREG32(mmPSOC_MME_PLL_DIV_FACTOR_3, 0x8);
+	WREG32(mmPSOC_MME_PLL_DIV_FACTOR_CMD_3, 0x1);
+	WREG32(mmPSOC_MME_PLL_DIV_EN_3, 0x1);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_3, 0x3);
+
+	WREG32(mmPSOC_PCI_PLL_RST, 1);
+	WREG32(mmPSOC_PCI_PLL_NR, pci_nr);
+	WREG32(mmPSOC_PCI_PLL_NF, pci_nf);
+	WREG32(mmPSOC_PCI_PLL_OD, pci_od);
+	WREG32(mmPSOC_PCI_PLL_NB, pci_nb);
+	WREG32(mmPSOC_PCI_PLL_DATA_CHNG, 0x11);
+
+	udelay(100);
+
+	WREG32(mmPSOC_PCI_PLL_RST, 0);
+	WREG32(mmPSOC_PCI_PLL_DIV_EN_0, 0x1);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_0, 0x1);
+
+	WREG32(mmPSOC_PCI_PLL_DIV_FACTOR_1, 0x4);
+	WREG32(mmPSOC_PCI_PLL_DIV_FACTOR_CMD_1, 0x1);
+	WREG32(mmPSOC_PCI_PLL_DIV_EN_1, 0x1);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_1, 0x3);
+
+	WREG32(mmPSOC_PCI_PLL_DIV_FACTOR_2, 0x8);
+	WREG32(mmPSOC_PCI_PLL_DIV_FACTOR_CMD_2, 0x1);
+	WREG32(mmPSOC_PCI_PLL_DIV_EN_2, 0x1);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_2, 0x3);
+
+	WREG32(mmPSOC_PCI_PLL_DIV_FACTOR_3, 0x55);
+	WREG32(mmPSOC_PCI_PLL_DIV_FACTOR_CMD_3, 0x1);
+	WREG32(mmPSOC_PCI_PLL_DIV_EN_3, 0x1);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_3, 0x3);
+
+	WREG32(mmPSOC_EMMC_PLL_RST, 1);
+	WREG32(mmPSOC_EMMC_PLL_NR, emmc_nr);
+	WREG32(mmPSOC_EMMC_PLL_NF, emmc_nf);
+	WREG32(mmPSOC_EMMC_PLL_OD, emmc_od);
+	WREG32(mmPSOC_EMMC_PLL_NB, emmc_nb);
+	WREG32(mmPSOC_EMMC_PLL_DATA_CHNG, 0x11);
+
+	udelay(100);
+
+	WREG32(mmPSOC_EMMC_PLL_RST, 0);
+	WREG32(mmPSOC_EMMC_PLL_DIV_EN_0, 0x1);
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_0, 0x1);
+
+	WREG32(mmPSOC_EMMC_PLL_DIV_FACTOR_1, 0x1);
+	WREG32(mmPSOC_EMMC_PLL_DIV_FACTOR_CMD_1, 0x1);
+	WREG32(mmPSOC_EMMC_PLL_DIV_EN_1, 0x1);
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_1, 0x3);
+
+	/* 200MHz*/
+	WREG32(mmTPC_PLL_RST, 1);
+	WREG32(mmTPC_PLL_NR, hbw_nr);
+	WREG32(mmTPC_PLL_NF, hbw_nf);
+	WREG32(mmTPC_PLL_OD, hbw_od);
+	WREG32(mmTPC_PLL_NB, hbw_nb);
+	WREG32(mmTPC_PLL_DATA_CHNG, 0x11);
+
+	udelay(100);
+
+	WREG32(mmTPC_PLL_RST, 0);
+	WREG32(mmTPC_PLL_DIV_EN_0, 0x1);
+	WREG32(mmTPC_PLL_DIV_SEL_0, 0x1);
+
+	WREG32(mmTPC_PLL_DIV_FACTOR_1, 0x1);
+	WREG32(mmTPC_PLL_DIV_FACTOR_CMD_1, 0x1);
+	WREG32(mmTPC_PLL_DIV_EN_1, 0x1);
+	WREG32(mmTPC_PLL_DIV_SEL_1, 0x3);
+
+	WREG32(mmTPC_PLL_DIV_FACTOR_2, 0x1);
+	WREG32(mmTPC_PLL_DIV_FACTOR_CMD_2, 0x1);
+	WREG32(mmTPC_PLL_DIV_EN_2, 0x1);
+	WREG32(mmTPC_PLL_DIV_SEL_2, 0x3);
+
+	WREG32(mmTPC_PLL_DIV_FACTOR_3, 0x8);
+	WREG32(mmTPC_PLL_DIV_FACTOR_CMD_3, 0x1);
+	WREG32(mmTPC_PLL_DIV_EN_3, 0x1);
+	WREG32(mmTPC_PLL_DIV_SEL_3, 0x3);
+
+	goya->hw_cap_initialized |= HW_CAP_PLL;
+}
+
+static void goya_set_pll_refclk(struct hl_device *hdev)
+{
+	WREG32(mmCPU_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmCPU_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmCPU_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmCPU_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmIC_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmIC_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmIC_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmIC_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmMC_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmMC_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmMC_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmMC_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmTPC_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmTPC_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmTPC_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmTPC_PLL_DIV_SEL_3, 0x0);
+}
+
+static void goya_disable_clk_rlx(struct hl_device *hdev)
+{
+	WREG32(mmPSOC_MME_PLL_CLK_RLX_0, 0x100010);
+	WREG32(mmIC_PLL_CLK_RLX_0, 0x100010);
+}
+
+/**
+ * goya_init_ddr_ch0 - Initialize DDR CH0 controller of the chip
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static void goya_init_ddr_ch0(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 val;
+
+	if (goya->hw_cap_initialized & HW_CAP_DDR_0)
+		return;
+
+	val = RREG32(mmDDR_MISC_CH0_CFG_DONE);
+	if (val & DDR_MISC_CH0_CFG_DONE_CFG_DONE_MASK) {
+		goya->hw_cap_initialized |= HW_CAP_DDR_0;
+		return;
+	}
+
+	WREG32(mmDDR_MC_CH0_DBG1, 0x00000001);
+	WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000001);
+
+	val = RREG32(mmDDR_MC_CH0_STAT);
+
+	WREG32(mmDDR_MC_CH0_MSTR, 0x81040210);
+	WREG32(mmDDR_MC_CH0_MRCTRL0, 0x4000a0f0);
+	WREG32(mmDDR_MC_CH0_MRCTRL1, 0x00022ad0);
+	WREG32(mmDDR_MC_CH0_MRCTRL2, 0x091629e1);
+	WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000008);
+	WREG32(mmDDR_MC_CH0_PWRTMG, 0x00040002);
+	WREG32(mmDDR_MC_CH0_HWLPCTL, 0x00be0002);
+	WREG32(mmDDR_MC_CH0_RFSHCTL0, 0x0091f020);
+	WREG32(mmDDR_MC_CH0_RFSHCTL1, 0x00120018);
+	WREG32((mmDDR_MC_CH0_MSTR + 0x00000058), 0x00160005);
+	WREG32(mmDDR_MC_CH0_RFSHCTL3, 0x00000020);
+	WREG32(mmDDR_MC_CH0_RFSHTMG, 0x003000d0);
+	WREG32(mmDDR_MC_CH0_ECCCFG0, 0x00000010);
+	WREG32(mmDDR_MC_CH0_ECCCFG1, 0x00000002);
+	WREG32(mmDDR_MC_CH0_ECCCTL, 0x00000300);
+	WREG32(mmDDR_MC_CH0_ECCPOISONADDR0, 0x00000078);
+	WREG32(mmDDR_MC_CH0_ECCPOISONADDR1, 0x100062f7);
+	WREG32(mmDDR_MC_CH0_CRCPARCTL0, 0x00008000);
+	WREG32(mmDDR_MC_CH0_CRCPARCTL1, 0x0e088301);
+	WREG32(mmDDR_MC_CH0_CRCPARCTL2, 0x00600527);
+	WREG32(mmDDR_MC_CH0_INIT0, 0x00070002);
+	WREG32(mmDDR_MC_CH0_INIT1, 0x0001000e);
+	WREG32(mmDDR_MC_CH0_INIT3, 0x0c510001);
+	WREG32(mmDDR_MC_CH0_INIT4, 0x00280400);
+	WREG32(mmDDR_MC_CH0_INIT5, 0x00110000);
+	WREG32(mmDDR_MC_CH0_INIT6, 0x02000643);
+	WREG32(mmDDR_MC_CH0_INIT7, 0x00001000);
+	WREG32(mmDDR_MC_CH0_DIMMCTL, 0x00000000);
+	WREG32(mmDDR_MC_CH0_RANKCTL, 0x000009a0);
+	WREG32(mmDDR_MC_CH0_DRAMTMG0, 0x1918361a);
+	WREG32(mmDDR_MC_CH0_DRAMTMG1, 0x00080724);
+	WREG32(mmDDR_MC_CH0_DRAMTMG2, 0x080d0713);
+	WREG32(mmDDR_MC_CH0_DRAMTMG3, 0x00012012);
+	WREG32(mmDDR_MC_CH0_DRAMTMG4, 0x0b04060b);
+	WREG32(mmDDR_MC_CH0_DRAMTMG5, 0x0a0c0804);
+	WREG32(mmDDR_MC_CH0_DRAMTMG8, 0x0606490c);
+	WREG32(mmDDR_MC_CH0_DRAMTMG9, 0x0002050f);
+	WREG32(mmDDR_MC_CH0_DRAMTMG10, 0x000e0d0f);
+	WREG32(mmDDR_MC_CH0_DRAMTMG11, 0x270b011f);
+	WREG32(mmDDR_MC_CH0_DRAMTMG12, 0x00000010);
+	WREG32(mmDDR_MC_CH0_DRAMTMG15, 0x00000000);
+	WREG32(mmDDR_MC_CH0_ZQCTL0, 0x31000040);
+	WREG32(mmDDR_MC_CH0_ZQCTL1, 0x00000070);
+	WREG32(mmDDR_MC_CH0_DFITMG0, 0x05978211);
+	WREG32(mmDDR_MC_CH0_DFITMG1, 0x00080101);
+	WREG32(mmDDR_MC_CH0_DFILPCFG0, 0x07006031);
+	WREG32(mmDDR_MC_CH0_DFILPCFG1, 0x00000010);
+	WREG32(mmDDR_MC_CH0_DFIUPD0, 0x40400018);
+	WREG32(mmDDR_MC_CH0_DFIUPD1, 0x000b0046);
+	WREG32(mmDDR_MC_CH0_DFIUPD2, 0x00000000);
+	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
+	WREG32(mmDDR_MC_CH0_DFITMG2, 0x00001711);
+	WREG32(mmDDR_MC_CH0_DFITMG3, 0x0000001e);
+	WREG32(mmDDR_MC_CH0_DBICTL, 0x00000001);
+	WREG32(mmDDR_MC_CH0_DFIPHYMSTR, 0x00000000);
+	WREG32(mmDDR_MC_CH0_ADDRMAP0, 0x00001f1f);
+	WREG32(mmDDR_MC_CH0_ADDRMAP1, 0x003f1503);
+	WREG32(mmDDR_MC_CH0_ADDRMAP2, 0x01000400);
+	WREG32(mmDDR_MC_CH0_ADDRMAP3, 0x04000505);
+	WREG32(mmDDR_MC_CH0_ADDRMAP4, 0x00001f1f);
+	WREG32(mmDDR_MC_CH0_ADDRMAP5, 0x06060303);
+	WREG32(mmDDR_MC_CH0_ADDRMAP6, 0x0f050709);
+	WREG32(mmDDR_MC_CH0_ADDRMAP7, 0x00000f0f);
+	WREG32(mmDDR_MC_CH0_ADDRMAP8, 0x00003f01);
+	WREG32(mmDDR_MC_CH0_ADDRMAP9, 0x09000606);
+	WREG32(mmDDR_MC_CH0_ADDRMAP10, 0x02090105);
+	WREG32(mmDDR_MC_CH0_ADDRMAP11, 0x0000000a);
+	WREG32(mmDDR_MC_CH0_ODTCFG, 0x09090a08);
+	WREG32(mmDDR_MC_CH0_ODTMAP, 0x9ae1b5fe);
+	WREG32(mmDDR_MC_CH0_SCHED, 0x664d3700);
+	WREG32(mmDDR_MC_CH0_SCHED1, 0x00000000);
+	WREG32(mmDDR_MC_CH0_PERFHPR1, 0x1700e024);
+	WREG32(mmDDR_MC_CH0_PERFLPR1, 0x1e00836c);
+	WREG32(mmDDR_MC_CH0_PERFWR1, 0x260046c9);
+	WREG32(mmDDR_MC_CH0_DQMAP0, 0x0d2b3503);
+	WREG32(mmDDR_MC_CH0_DQMAP1, 0x042a0537);
+	WREG32(mmDDR_MC_CH0_DQMAP2, 0x330b2806);
+	WREG32(mmDDR_MC_CH0_DQMAP3, 0x27013803);
+	WREG32(mmDDR_MC_CH0_DQMAP4, 0x0000022c);
+	WREG32(mmDDR_MC_CH0_DQMAP5, 0x00000001);
+	WREG32(mmDDR_MC_CH0_DBG0, 0x00000001);
+	WREG32(mmDDR_MC_CH0_DBG1, 0x00000000);
+	WREG32(mmDDR_MC_CH0_DBGCMD, 0x00000000);
+	WREG32(mmDDR_MC_CH0_SWCTL, 0x00000001);
+	WREG32(mmDDR_MC_CH0_POISONCFG, 0x00000001);
+	WREG32(mmDDR_MC_CH0_ADVECCINDEX, 0x00000004);
+	WREG32(mmDDR_MC_CH0_ECCPOISONPAT0, 0x00000000);
+	WREG32(mmDDR_MC_CH0_ECCPOISONPAT1, 0x00000000);
+	WREG32(mmDDR_MC_CH0_ECCPOISONPAT2, 0x00000000);
+	WREG32(mmDDR_MC_CH0_CAPARPOISONCTL, 0x00000000);
+	WREG32(mmDDR_MC_CH0_PCCFG, 0x00000011);
+	WREG32(mmDDR_MC_CH0_PCFGR_0, 0x0000518c);
+	WREG32(mmDDR_MC_CH0_PCFGW_0, 0x00001263);
+	WREG32(mmDDR_MC_CH0_PCTRL_0, 0x00000001);
+	WREG32(mmDDR_MC_CH0_PCFGQOS0_0, 0x0011000e);
+	WREG32(mmDDR_MC_CH0_SBRCTL, 0x0016b540);
+	WREG32(mmDDR_MC_CH0_SBRWDATA0, 0x8c1d1786);
+	WREG32(mmDDR_MC_CH0_SBRWDATA1, 0x265f03dd);
+
+	val = RREG32(mmDDR_MC_CH0_RFSHCTL3);
+
+	WREG32(mmDDR_MISC_CH0_CFG_DONE, 0x00000001);
+
+	WREG32(mmDDR_MC_CH0_DBG1, 0x00000000);
+
+	val = RREG32(mmDDR_MC_CH0_PWRCTL);
+
+	WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000002);
+
+	val = RREG32(mmDDR_MC_CH0_PWRCTL);
+
+	WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000000);
+	WREG32(mmDDR_MC_CH0_SWCTL, 0x00000000);
+	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
+	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000040);
+	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000040);
+	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000060);
+	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000040);
+	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
+	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
+	WREG32(mmDDR_MC_CH0_PCTRL_0, 0x00000001);
+
+	goya->hw_cap_initialized |= HW_CAP_DDR_0;
+}
+
+/**
+ * goya_init_ddr_ch1 - Initialize DDR CH1 controller of the chip
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static void goya_init_ddr_ch1(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 val;
+
+	if (goya->hw_cap_initialized & HW_CAP_DDR_1)
+		return;
+
+	val = RREG32(mmDDR_MISC_CH1_CFG_DONE);
+	if (val & DDR_MISC_CH1_CFG_DONE_CFG_DONE_MASK) {
+		goya->hw_cap_initialized |= HW_CAP_DDR_1;
+		return;
+	}
+
+	WREG32(mmDDR_MC_CH1_DBG1, 0x00000001);
+	WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000001);
+
+	val = RREG32(mmDDR_MC_CH1_STAT);
+
+	WREG32(mmDDR_MC_CH1_MSTR, 0x81040210);
+	WREG32(mmDDR_MC_CH1_MRCTRL0, 0x4000a0f0);
+	WREG32(mmDDR_MC_CH1_MRCTRL1, 0x00022ad0);
+	WREG32(mmDDR_MC_CH1_MRCTRL2, 0x091629e1);
+	WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000008);
+	WREG32(mmDDR_MC_CH1_PWRTMG, 0x00040002);
+	WREG32(mmDDR_MC_CH1_HWLPCTL, 0x00be0002);
+	WREG32(mmDDR_MC_CH1_RFSHCTL0, 0x0091f020);
+	WREG32(mmDDR_MC_CH1_RFSHCTL1, 0x00120018);
+	WREG32((mmDDR_MC_CH1_MSTR + 0x00000058), 0x00160005);
+	WREG32(mmDDR_MC_CH1_RFSHCTL3, 0x00000020);
+	WREG32(mmDDR_MC_CH1_RFSHTMG, 0x003000d0);
+	WREG32(mmDDR_MC_CH1_ECCCFG0, 0x00000010);
+	WREG32(mmDDR_MC_CH1_ECCCFG1, 0x00000002);
+	WREG32(mmDDR_MC_CH1_ECCCTL, 0x00000300);
+	WREG32(mmDDR_MC_CH1_ECCPOISONADDR0, 0x00000078);
+	WREG32(mmDDR_MC_CH1_ECCPOISONADDR1, 0x100062f7);
+	WREG32(mmDDR_MC_CH1_CRCPARCTL0, 0x00008000);
+	WREG32(mmDDR_MC_CH1_CRCPARCTL1, 0x0e088301);
+	WREG32(mmDDR_MC_CH1_CRCPARCTL2, 0x00600527);
+	WREG32(mmDDR_MC_CH1_INIT0, 0x00070002);
+	WREG32(mmDDR_MC_CH1_INIT1, 0x0001000e);
+	WREG32(mmDDR_MC_CH1_INIT3, 0x0c510001);
+	WREG32(mmDDR_MC_CH1_INIT4, 0x00280400);
+	WREG32(mmDDR_MC_CH1_INIT5, 0x00110000);
+	WREG32(mmDDR_MC_CH1_INIT6, 0x02000643);
+	WREG32(mmDDR_MC_CH1_INIT7, 0x00001000);
+	WREG32(mmDDR_MC_CH1_DIMMCTL, 0x00000000);
+	WREG32(mmDDR_MC_CH1_RANKCTL, 0x000009a0);
+	WREG32(mmDDR_MC_CH1_DRAMTMG0, 0x1918361a);
+	WREG32(mmDDR_MC_CH1_DRAMTMG1, 0x00080724);
+	WREG32(mmDDR_MC_CH1_DRAMTMG2, 0x080d0713);
+	WREG32(mmDDR_MC_CH1_DRAMTMG3, 0x00012012);
+	WREG32(mmDDR_MC_CH1_DRAMTMG4, 0x0b04060b);
+	WREG32(mmDDR_MC_CH1_DRAMTMG5, 0x0a0c0804);
+	WREG32(mmDDR_MC_CH1_DRAMTMG8, 0x0606490c);
+	WREG32(mmDDR_MC_CH1_DRAMTMG9, 0x0002050f);
+	WREG32(mmDDR_MC_CH1_DRAMTMG10, 0x000e0d0f);
+	WREG32(mmDDR_MC_CH1_DRAMTMG11, 0x270b011f);
+	WREG32(mmDDR_MC_CH1_DRAMTMG12, 0x00000010);
+	WREG32(mmDDR_MC_CH1_DRAMTMG15, 0x00000000);
+	WREG32(mmDDR_MC_CH1_ZQCTL0, 0x31000040);
+	WREG32(mmDDR_MC_CH1_ZQCTL1, 0x00000070);
+	WREG32(mmDDR_MC_CH1_DFITMG0, 0x05978211);
+	WREG32(mmDDR_MC_CH1_DFITMG1, 0x00080101);
+	WREG32(mmDDR_MC_CH1_DFILPCFG0, 0x07006031);
+	WREG32(mmDDR_MC_CH1_DFILPCFG1, 0x00000010);
+	WREG32(mmDDR_MC_CH1_DFIUPD0, 0x40400018);
+	WREG32(mmDDR_MC_CH1_DFIUPD1, 0x000b0046);
+	WREG32(mmDDR_MC_CH1_DFIUPD2, 0x00000000);
+	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
+	WREG32(mmDDR_MC_CH1_DFITMG2, 0x00001711);
+	WREG32(mmDDR_MC_CH1_DFITMG3, 0x0000001e);
+	WREG32(mmDDR_MC_CH1_DBICTL, 0x00000001);
+	WREG32(mmDDR_MC_CH1_DFIPHYMSTR, 0x00000000);
+	WREG32(mmDDR_MC_CH1_ADDRMAP0, 0x00001f1f);
+	WREG32(mmDDR_MC_CH1_ADDRMAP1, 0x003f1503);
+	WREG32(mmDDR_MC_CH1_ADDRMAP2, 0x01000400);
+	WREG32(mmDDR_MC_CH1_ADDRMAP3, 0x04000505);
+	WREG32(mmDDR_MC_CH1_ADDRMAP4, 0x00001f1f);
+	WREG32(mmDDR_MC_CH1_ADDRMAP5, 0x06060303);
+	WREG32(mmDDR_MC_CH1_ADDRMAP6, 0x0f050709);
+	WREG32(mmDDR_MC_CH1_ADDRMAP7, 0x00000f0f);
+	WREG32(mmDDR_MC_CH1_ADDRMAP8, 0x00003f01);
+	WREG32(mmDDR_MC_CH1_ADDRMAP9, 0x09000606);
+	WREG32(mmDDR_MC_CH1_ADDRMAP10, 0x02090105);
+	WREG32(mmDDR_MC_CH1_ADDRMAP11, 0x0000000a);
+	WREG32(mmDDR_MC_CH1_ODTCFG, 0x09090a08);
+	WREG32(mmDDR_MC_CH1_ODTMAP, 0x9ae1b5fe);
+	WREG32(mmDDR_MC_CH1_SCHED, 0x664d3700);
+	WREG32(mmDDR_MC_CH1_SCHED1, 0x00000000);
+	WREG32(mmDDR_MC_CH1_PERFHPR1, 0x1700e024);
+	WREG32(mmDDR_MC_CH1_PERFLPR1, 0x1e00836c);
+	WREG32(mmDDR_MC_CH1_PERFWR1, 0x260046c9);
+	WREG32(mmDDR_MC_CH1_DQMAP0, 0x0d2b3503);
+	WREG32(mmDDR_MC_CH1_DQMAP1, 0x042a0537);
+	WREG32(mmDDR_MC_CH1_DQMAP2, 0x330b2806);
+	WREG32(mmDDR_MC_CH1_DQMAP3, 0x27013803);
+	WREG32(mmDDR_MC_CH1_DQMAP4, 0x0000022c);
+	WREG32(mmDDR_MC_CH1_DQMAP5, 0x00000001);
+	WREG32(mmDDR_MC_CH1_DBG0, 0x00000001);
+	WREG32(mmDDR_MC_CH1_DBG1, 0x00000000);
+	WREG32(mmDDR_MC_CH1_DBGCMD, 0x00000000);
+	WREG32(mmDDR_MC_CH1_SWCTL, 0x00000001);
+	WREG32(mmDDR_MC_CH1_POISONCFG, 0x00000001);
+	WREG32(mmDDR_MC_CH1_ADVECCINDEX, 0x00000004);
+	WREG32(mmDDR_MC_CH1_ECCPOISONPAT0, 0x00000000);
+	WREG32(mmDDR_MC_CH1_ECCPOISONPAT1, 0x00000000);
+	WREG32(mmDDR_MC_CH1_ECCPOISONPAT2, 0x00000000);
+	WREG32(mmDDR_MC_CH1_CAPARPOISONCTL, 0x00000000);
+	WREG32(mmDDR_MC_CH1_PCCFG, 0x00000011);
+	WREG32(mmDDR_MC_CH1_PCFGR_0, 0x0000518c);
+	WREG32(mmDDR_MC_CH1_PCFGW_0, 0x00001263);
+	WREG32(mmDDR_MC_CH1_PCTRL_0, 0x00000001);
+	WREG32(mmDDR_MC_CH1_PCFGQOS0_0, 0x0011000e);
+	WREG32(mmDDR_MC_CH1_SBRCTL, 0x0016b540);
+	WREG32(mmDDR_MC_CH1_SBRWDATA0, 0x8c1d1786);
+	WREG32(mmDDR_MC_CH1_SBRWDATA1, 0x265f03dd);
+
+	val = RREG32(mmDDR_MC_CH1_RFSHCTL3);
+
+	WREG32(mmDDR_MISC_CH1_CFG_DONE, 0x00000001);
+
+	WREG32(mmDDR_MC_CH1_DBG1, 0x00000000);
+
+	val = RREG32(mmDDR_MC_CH1_PWRCTL);
+
+	WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000002);
+
+	val = RREG32(mmDDR_MC_CH1_PWRCTL);
+
+	WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000000);
+	WREG32(mmDDR_MC_CH1_SWCTL, 0x00000000);
+	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
+	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000040);
+	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000040);
+	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000060);
+	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000040);
+	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
+	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
+	WREG32(mmDDR_MC_CH1_PCTRL_0, 0x00000001);
+
+	goya->hw_cap_initialized |= HW_CAP_DDR_1;
+}
+
+static void _goya_tpc_mbist_workaround(struct hl_device *hdev, u8 tpc_id)
+{
+	u64 tpc_eml_address;
+	u32 val, tpc_offset, tpc_eml_offset, tpc_slm_offset;
+	int err, slm_index;
+
+	WARN_ON(tpc_id >= TPC_MAX_NUM);
+
+	tpc_offset = tpc_id * 0x40000;
+	tpc_eml_offset = tpc_id * 0x200000;
+	tpc_eml_address = (mmTPC0_EML_CFG_BASE + tpc_eml_offset - CFG_BASE);
+	tpc_slm_offset = tpc_eml_address + 0x100000;
+
+	/*
+	 * Workaround for Bug H2 #2443 :
+	 * "TPC SB is not initialized on chip reset"
+	 */
+
+	val = RREG32(mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset);
+	if (val & TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_ACTIVE_MASK)
+		dev_warn(hdev->dev, "TPC%d MBIST ACTIVE is not cleared\n",
+			tpc_id);
+
+	WREG32(mmTPC0_CFG_FUNC_MBIST_PAT + tpc_offset, val & 0xFFFFF000);
+
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_0 + tpc_offset, 0x37FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_1 + tpc_offset, 0x303F);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_2 + tpc_offset, 0x71FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_3 + tpc_offset, 0x71FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_4 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_5 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_6 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_7 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_8 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_9 + tpc_offset, 0x70FF);
+
+	WREG32_OR(mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset,
+		1 << TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_START_SHIFT);
+
+	err = hl_poll_timeout(
+		hdev,
+		mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset,
+		val,
+		(val & TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_DONE_MASK),
+		1000,
+		HL_DEVICE_TIMEOUT_USEC);
+
+	if (err)
+		dev_err(hdev->dev,
+			"Timeout while waiting for TPC%d MBIST DONE\n", tpc_id);
+
+	WREG32_OR(mmTPC0_EML_CFG_DBG_CNT + tpc_eml_offset,
+		1 << TPC0_EML_CFG_DBG_CNT_CORE_RST_SHIFT);
+
+	msleep(GOYA_RESET_WAIT_MSEC);
+
+	WREG32_AND(mmTPC0_EML_CFG_DBG_CNT + tpc_eml_offset,
+		~(1 << TPC0_EML_CFG_DBG_CNT_CORE_RST_SHIFT));
+
+	msleep(GOYA_RESET_WAIT_MSEC);
+
+	for (slm_index = 0 ; slm_index < 256 ; slm_index++)
+		WREG32(tpc_slm_offset + (slm_index << 2), 0);
+
+	val = RREG32(tpc_slm_offset);
+
+	WREG32(mmTPC0_CFG_BASE + tpc_offset + 0xF40 - CFG_BASE, 0x100);
+}
+
+static void goya_tpc_mbist_workaround(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i;
+
+	if (hdev->pldm)
+		return;
+
+	if (goya->hw_cap_initialized & HW_CAP_TPC_MBIST)
+		return;
+
+	/* Workaround for H2 #2443 */
+
+	for (i = 0 ; i < TPC_MAX_NUM ; i++)
+		_goya_tpc_mbist_workaround(hdev, i);
+
+	goya->hw_cap_initialized |= HW_CAP_TPC_MBIST;
+}
+
+/**
+ * goya_init_golden_registers - Initialize golden registers
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Initialize the H/W registers of the device
+ *
+ */
+static void goya_init_golden_registers(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 polynom[10], tpc_intr_mask;
+
+	if (goya->hw_cap_initialized & HW_CAP_GOLDEN)
+		return;
+
+	polynom[0] = 0x00020080;
+	polynom[1] = 0x00401000;
+	polynom[2] = 0x00200800;
+	polynom[3] = 0x00002000;
+	polynom[4] = 0x00080200;
+	polynom[5] = 0x00040100;
+	polynom[6] = 0x00100400;
+	polynom[7] = 0x00004000;
+	polynom[8] = 0x00010000;
+	polynom[9] = 0x00008000;
+
+	/* Mask all arithmetic interrupts from TPC */
+	tpc_intr_mask = 0x7FFF;
+
+	WREG32(mmDMA_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmDMA_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmDMA_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmDMA_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmDMA_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmDMA_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmDMA_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmDMA_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmDMA_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmDMA_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmDMA_NRTR_SCRAMB_EN, 1 << DMA_NRTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmDMA_NRTR_NON_LIN_SCRAMB,
+			1 << DMA_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	WREG32(mmSRAM_Y5_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y4_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y3_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y2_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y1_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y0_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y5_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y4_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y3_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y2_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y1_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y0_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y5_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y4_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y3_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y2_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y1_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y0_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y5_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y4_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y3_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y2_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y1_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y0_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y5_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y4_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y3_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y2_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y1_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y0_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
+	WREG32(mmSRAM_Y5_X0_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y4_X0_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y3_X0_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y2_X0_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y1_X0_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y5_X1_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y4_X1_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y3_X1_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y2_X1_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y1_X1_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y5_X2_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y4_X2_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y3_X2_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y2_X2_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y1_X2_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y5_X3_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y4_X3_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y3_X3_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y2_X3_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y1_X3_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y5_X4_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y4_X4_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y3_X4_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y2_X4_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y1_X4_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_L_ARB, 0x204);
+	WREG32(mmSRAM_Y5_X0_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y4_X0_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y3_X0_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y2_X0_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y1_X0_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y5_X1_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y4_X1_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y3_X1_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y2_X1_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y1_X1_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y5_X2_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y4_X2_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y3_X2_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y2_X2_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y1_X2_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_E_ARB, 0x206);
+	WREG32(mmSRAM_Y5_X3_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y4_X3_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y3_X3_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y2_X3_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y1_X3_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y5_X4_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y4_X4_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y3_X4_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y2_X4_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y1_X4_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_E_ARB, 0x207);
+	WREG32(mmSRAM_Y5_X0_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y4_X0_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y3_X0_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y2_X0_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y1_X0_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y5_X1_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y4_X1_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y3_X1_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y2_X1_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y1_X1_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_W_ARB, 0x207);
+	WREG32(mmSRAM_Y5_X2_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y4_X2_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y3_X2_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y2_X2_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y1_X2_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y5_X3_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y4_X3_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y3_X3_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y2_X3_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y1_X3_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y5_X4_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y4_X4_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y3_X4_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y2_X4_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y1_X4_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_W_ARB, 0x206);
+	WREG32(mmSRAM_Y5_X0_RTR_HBW_WR_RS_E_ARB, 0x101);
+	WREG32(mmSRAM_Y4_X0_RTR_HBW_WR_RS_E_ARB, 0x101);
+	WREG32(mmSRAM_Y3_X0_RTR_HBW_WR_RS_E_ARB, 0x101);
+	WREG32(mmSRAM_Y2_X0_RTR_HBW_WR_RS_E_ARB, 0x101);
+	WREG32(mmSRAM_Y1_X0_RTR_HBW_WR_RS_E_ARB, 0x101);
+	WREG32(mmSRAM_Y0_X0_RTR_HBW_WR_RS_E_ARB, 0x101);
+	WREG32(mmSRAM_Y5_X1_RTR_HBW_WR_RS_E_ARB, 0x102);
+	WREG32(mmSRAM_Y4_X1_RTR_HBW_WR_RS_E_ARB, 0x102);
+	WREG32(mmSRAM_Y3_X1_RTR_HBW_WR_RS_E_ARB, 0x102);
+	WREG32(mmSRAM_Y2_X1_RTR_HBW_WR_RS_E_ARB, 0x102);
+	WREG32(mmSRAM_Y1_X1_RTR_HBW_WR_RS_E_ARB, 0x102);
+	WREG32(mmSRAM_Y0_X1_RTR_HBW_WR_RS_E_ARB, 0x102);
+	WREG32(mmSRAM_Y5_X2_RTR_HBW_WR_RS_E_ARB, 0x103);
+	WREG32(mmSRAM_Y4_X2_RTR_HBW_WR_RS_E_ARB, 0x103);
+	WREG32(mmSRAM_Y3_X2_RTR_HBW_WR_RS_E_ARB, 0x103);
+	WREG32(mmSRAM_Y2_X2_RTR_HBW_WR_RS_E_ARB, 0x103);
+	WREG32(mmSRAM_Y1_X2_RTR_HBW_WR_RS_E_ARB, 0x103);
+	WREG32(mmSRAM_Y0_X2_RTR_HBW_WR_RS_E_ARB, 0x103);
+	WREG32(mmSRAM_Y5_X3_RTR_HBW_WR_RS_E_ARB, 0x104);
+	WREG32(mmSRAM_Y4_X3_RTR_HBW_WR_RS_E_ARB, 0x104);
+	WREG32(mmSRAM_Y3_X3_RTR_HBW_WR_RS_E_ARB, 0x104);
+	WREG32(mmSRAM_Y2_X3_RTR_HBW_WR_RS_E_ARB, 0x104);
+	WREG32(mmSRAM_Y1_X3_RTR_HBW_WR_RS_E_ARB, 0x104);
+	WREG32(mmSRAM_Y0_X3_RTR_HBW_WR_RS_E_ARB, 0x104);
+	WREG32(mmSRAM_Y5_X4_RTR_HBW_WR_RS_E_ARB, 0x105);
+	WREG32(mmSRAM_Y4_X4_RTR_HBW_WR_RS_E_ARB, 0x105);
+	WREG32(mmSRAM_Y3_X4_RTR_HBW_WR_RS_E_ARB, 0x105);
+	WREG32(mmSRAM_Y2_X4_RTR_HBW_WR_RS_E_ARB, 0x105);
+	WREG32(mmSRAM_Y1_X4_RTR_HBW_WR_RS_E_ARB, 0x105);
+	WREG32(mmSRAM_Y0_X4_RTR_HBW_WR_RS_E_ARB, 0x105);
+	WREG32(mmSRAM_Y5_X0_RTR_HBW_WR_RS_W_ARB, 0x105);
+	WREG32(mmSRAM_Y4_X0_RTR_HBW_WR_RS_W_ARB, 0x105);
+	WREG32(mmSRAM_Y3_X0_RTR_HBW_WR_RS_W_ARB, 0x105);
+	WREG32(mmSRAM_Y2_X0_RTR_HBW_WR_RS_W_ARB, 0x105);
+	WREG32(mmSRAM_Y1_X0_RTR_HBW_WR_RS_W_ARB, 0x105);
+	WREG32(mmSRAM_Y0_X0_RTR_HBW_WR_RS_W_ARB, 0x105);
+	WREG32(mmSRAM_Y5_X1_RTR_HBW_WR_RS_W_ARB, 0x104);
+	WREG32(mmSRAM_Y4_X1_RTR_HBW_WR_RS_W_ARB, 0x104);
+	WREG32(mmSRAM_Y3_X1_RTR_HBW_WR_RS_W_ARB, 0x104);
+	WREG32(mmSRAM_Y2_X1_RTR_HBW_WR_RS_W_ARB, 0x104);
+	WREG32(mmSRAM_Y1_X1_RTR_HBW_WR_RS_W_ARB, 0x104);
+	WREG32(mmSRAM_Y0_X1_RTR_HBW_WR_RS_W_ARB, 0x104);
+	WREG32(mmSRAM_Y5_X2_RTR_HBW_WR_RS_W_ARB, 0x103);
+	WREG32(mmSRAM_Y4_X2_RTR_HBW_WR_RS_W_ARB, 0x103);
+	WREG32(mmSRAM_Y3_X2_RTR_HBW_WR_RS_W_ARB, 0x103);
+	WREG32(mmSRAM_Y2_X2_RTR_HBW_WR_RS_W_ARB, 0x103);
+	WREG32(mmSRAM_Y1_X2_RTR_HBW_WR_RS_W_ARB, 0x103);
+	WREG32(mmSRAM_Y0_X2_RTR_HBW_WR_RS_W_ARB, 0x103);
+	WREG32(mmSRAM_Y5_X3_RTR_HBW_WR_RS_W_ARB, 0x102);
+	WREG32(mmSRAM_Y4_X3_RTR_HBW_WR_RS_W_ARB, 0x102);
+	WREG32(mmSRAM_Y3_X3_RTR_HBW_WR_RS_W_ARB, 0x102);
+	WREG32(mmSRAM_Y2_X3_RTR_HBW_WR_RS_W_ARB, 0x102);
+	WREG32(mmSRAM_Y1_X3_RTR_HBW_WR_RS_W_ARB, 0x102);
+	WREG32(mmSRAM_Y0_X3_RTR_HBW_WR_RS_W_ARB, 0x102);
+	WREG32(mmSRAM_Y5_X4_RTR_HBW_WR_RS_W_ARB, 0x101);
+	WREG32(mmSRAM_Y4_X4_RTR_HBW_WR_RS_W_ARB, 0x101);
+	WREG32(mmSRAM_Y3_X4_RTR_HBW_WR_RS_W_ARB, 0x101);
+	WREG32(mmSRAM_Y2_X4_RTR_HBW_WR_RS_W_ARB, 0x101);
+	WREG32(mmSRAM_Y1_X4_RTR_HBW_WR_RS_W_ARB, 0x101);
+	WREG32(mmSRAM_Y0_X4_RTR_HBW_WR_RS_W_ARB, 0x101);
+
+	WREG32(mmMME_STORE_MAX_CREDIT, 0x21);
+	WREG32(mmMME_AGU, 0x0f0f0f10);
+	WREG32(mmMME_SEI_MASK, ~0x0);
+
+	WREG32(mmMME6_RTR_HBW_RD_RQ_N_ARB, 0x01010101);
+	WREG32(mmMME5_RTR_HBW_RD_RQ_N_ARB, 0x01040101);
+	WREG32(mmMME4_RTR_HBW_RD_RQ_N_ARB, 0x01030101);
+	WREG32(mmMME3_RTR_HBW_RD_RQ_N_ARB, 0x01020101);
+	WREG32(mmMME2_RTR_HBW_RD_RQ_N_ARB, 0x01010101);
+	WREG32(mmMME1_RTR_HBW_RD_RQ_N_ARB, 0x07010701);
+	WREG32(mmMME6_RTR_HBW_RD_RQ_S_ARB, 0x04010401);
+	WREG32(mmMME5_RTR_HBW_RD_RQ_S_ARB, 0x04050401);
+	WREG32(mmMME4_RTR_HBW_RD_RQ_S_ARB, 0x03070301);
+	WREG32(mmMME3_RTR_HBW_RD_RQ_S_ARB, 0x01030101);
+	WREG32(mmMME2_RTR_HBW_RD_RQ_S_ARB, 0x01040101);
+	WREG32(mmMME1_RTR_HBW_RD_RQ_S_ARB, 0x01050105);
+	WREG32(mmMME6_RTR_HBW_RD_RQ_W_ARB, 0x01010501);
+	WREG32(mmMME5_RTR_HBW_RD_RQ_W_ARB, 0x01010501);
+	WREG32(mmMME4_RTR_HBW_RD_RQ_W_ARB, 0x01040301);
+	WREG32(mmMME3_RTR_HBW_RD_RQ_W_ARB, 0x01030401);
+	WREG32(mmMME2_RTR_HBW_RD_RQ_W_ARB, 0x01040101);
+	WREG32(mmMME1_RTR_HBW_RD_RQ_W_ARB, 0x01050101);
+	WREG32(mmMME6_RTR_HBW_WR_RQ_N_ARB, 0x02020202);
+	WREG32(mmMME5_RTR_HBW_WR_RQ_N_ARB, 0x01070101);
+	WREG32(mmMME4_RTR_HBW_WR_RQ_N_ARB, 0x02020201);
+	WREG32(mmMME3_RTR_HBW_WR_RQ_N_ARB, 0x07020701);
+	WREG32(mmMME2_RTR_HBW_WR_RQ_N_ARB, 0x01020101);
+	WREG32(mmMME1_RTR_HBW_WR_RQ_S_ARB, 0x01010101);
+	WREG32(mmMME6_RTR_HBW_WR_RQ_S_ARB, 0x01070101);
+	WREG32(mmMME5_RTR_HBW_WR_RQ_S_ARB, 0x01070101);
+	WREG32(mmMME4_RTR_HBW_WR_RQ_S_ARB, 0x07020701);
+	WREG32(mmMME3_RTR_HBW_WR_RQ_S_ARB, 0x02020201);
+	WREG32(mmMME2_RTR_HBW_WR_RQ_S_ARB, 0x01070101);
+	WREG32(mmMME1_RTR_HBW_WR_RQ_S_ARB, 0x01020102);
+	WREG32(mmMME6_RTR_HBW_WR_RQ_W_ARB, 0x01020701);
+	WREG32(mmMME5_RTR_HBW_WR_RQ_W_ARB, 0x01020701);
+	WREG32(mmMME4_RTR_HBW_WR_RQ_W_ARB, 0x07020707);
+	WREG32(mmMME3_RTR_HBW_WR_RQ_W_ARB, 0x01020201);
+	WREG32(mmMME2_RTR_HBW_WR_RQ_W_ARB, 0x01070201);
+	WREG32(mmMME1_RTR_HBW_WR_RQ_W_ARB, 0x01070201);
+	WREG32(mmMME6_RTR_HBW_RD_RS_N_ARB, 0x01070102);
+	WREG32(mmMME5_RTR_HBW_RD_RS_N_ARB, 0x01070102);
+	WREG32(mmMME4_RTR_HBW_RD_RS_N_ARB, 0x01060102);
+	WREG32(mmMME3_RTR_HBW_RD_RS_N_ARB, 0x01040102);
+	WREG32(mmMME2_RTR_HBW_RD_RS_N_ARB, 0x01020102);
+	WREG32(mmMME1_RTR_HBW_RD_RS_N_ARB, 0x01020107);
+	WREG32(mmMME6_RTR_HBW_RD_RS_S_ARB, 0x01020106);
+	WREG32(mmMME5_RTR_HBW_RD_RS_S_ARB, 0x01020102);
+	WREG32(mmMME4_RTR_HBW_RD_RS_S_ARB, 0x01040102);
+	WREG32(mmMME3_RTR_HBW_RD_RS_S_ARB, 0x01060102);
+	WREG32(mmMME2_RTR_HBW_RD_RS_S_ARB, 0x01070102);
+	WREG32(mmMME1_RTR_HBW_RD_RS_S_ARB, 0x01070102);
+	WREG32(mmMME6_RTR_HBW_RD_RS_E_ARB, 0x01020702);
+	WREG32(mmMME5_RTR_HBW_RD_RS_E_ARB, 0x01020702);
+	WREG32(mmMME4_RTR_HBW_RD_RS_E_ARB, 0x01040602);
+	WREG32(mmMME3_RTR_HBW_RD_RS_E_ARB, 0x01060402);
+	WREG32(mmMME2_RTR_HBW_RD_RS_E_ARB, 0x01070202);
+	WREG32(mmMME1_RTR_HBW_RD_RS_E_ARB, 0x01070102);
+	WREG32(mmMME6_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME5_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME4_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME3_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME2_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME1_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME6_RTR_HBW_WR_RS_N_ARB, 0x01050101);
+	WREG32(mmMME5_RTR_HBW_WR_RS_N_ARB, 0x01040101);
+	WREG32(mmMME4_RTR_HBW_WR_RS_N_ARB, 0x01030101);
+	WREG32(mmMME3_RTR_HBW_WR_RS_N_ARB, 0x01020101);
+	WREG32(mmMME2_RTR_HBW_WR_RS_N_ARB, 0x01010101);
+	WREG32(mmMME1_RTR_HBW_WR_RS_N_ARB, 0x01010107);
+	WREG32(mmMME6_RTR_HBW_WR_RS_S_ARB, 0x01010107);
+	WREG32(mmMME5_RTR_HBW_WR_RS_S_ARB, 0x01010101);
+	WREG32(mmMME4_RTR_HBW_WR_RS_S_ARB, 0x01020101);
+	WREG32(mmMME3_RTR_HBW_WR_RS_S_ARB, 0x01030101);
+	WREG32(mmMME2_RTR_HBW_WR_RS_S_ARB, 0x01040101);
+	WREG32(mmMME1_RTR_HBW_WR_RS_S_ARB, 0x01050101);
+	WREG32(mmMME6_RTR_HBW_WR_RS_E_ARB, 0x01010501);
+	WREG32(mmMME5_RTR_HBW_WR_RS_E_ARB, 0x01010501);
+	WREG32(mmMME4_RTR_HBW_WR_RS_E_ARB, 0x01040301);
+	WREG32(mmMME3_RTR_HBW_WR_RS_E_ARB, 0x01030401);
+	WREG32(mmMME2_RTR_HBW_WR_RS_E_ARB, 0x01040101);
+	WREG32(mmMME1_RTR_HBW_WR_RS_E_ARB, 0x01050101);
+	WREG32(mmMME6_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME5_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME4_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME3_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME2_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME1_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+
+	WREG32(mmMME1_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmMME1_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmMME1_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmMME1_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmMME1_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmMME1_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmMME1_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmMME1_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmMME1_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmMME1_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmMME2_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmMME2_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmMME2_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmMME2_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmMME2_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmMME2_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmMME2_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmMME2_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmMME2_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmMME2_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmMME3_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmMME3_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmMME3_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmMME3_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmMME3_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmMME3_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmMME3_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmMME3_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmMME3_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmMME3_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmMME4_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmMME4_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmMME4_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmMME4_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmMME4_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmMME4_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmMME4_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmMME4_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmMME4_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmMME4_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmMME5_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmMME5_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmMME5_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmMME5_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmMME5_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmMME5_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmMME5_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmMME5_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmMME5_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmMME5_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmMME6_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmMME6_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmMME6_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmMME6_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmMME6_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmMME6_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmMME6_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmMME6_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmMME6_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmMME6_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmMME1_RTR_SCRAMB_EN, 1 << MME1_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmMME1_RTR_NON_LIN_SCRAMB,
+			1 << MME1_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	WREG32(mmMME2_RTR_SCRAMB_EN, 1 << MME2_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmMME2_RTR_NON_LIN_SCRAMB,
+			1 << MME2_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	WREG32(mmMME3_RTR_SCRAMB_EN, 1 << MME3_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmMME3_RTR_NON_LIN_SCRAMB,
+			1 << MME3_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	WREG32(mmMME4_RTR_SCRAMB_EN, 1 << MME4_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmMME4_RTR_NON_LIN_SCRAMB,
+			1 << MME4_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	WREG32(mmMME5_RTR_SCRAMB_EN, 1 << MME5_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmMME5_RTR_NON_LIN_SCRAMB,
+			1 << MME5_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	WREG32(mmMME6_RTR_SCRAMB_EN, 1 << MME6_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmMME6_RTR_NON_LIN_SCRAMB,
+			1 << MME6_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmTPC0_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmTPC0_NRTR_SCRAMB_EN, 1 << TPC0_NRTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmTPC0_NRTR_NON_LIN_SCRAMB,
+			1 << TPC0_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for Bug H2 #2441 :
+	 * "ST.NOP set trace event illegal opcode"
+	 */
+	WREG32(mmTPC0_CFG_TPC_INTR_MASK, tpc_intr_mask);
+
+	WREG32(mmTPC1_RTR_HBW_RD_RQ_N_ARB, 0x01010101);
+	WREG32(mmTPC1_RTR_HBW_RD_RQ_S_ARB, 0x01010101);
+	WREG32(mmTPC1_RTR_HBW_RD_RQ_E_ARB, 0x01060101);
+	WREG32(mmTPC1_RTR_HBW_WR_RQ_N_ARB, 0x02020102);
+	WREG32(mmTPC1_RTR_HBW_WR_RQ_S_ARB, 0x01010101);
+	WREG32(mmTPC1_RTR_HBW_WR_RQ_E_ARB, 0x02070202);
+	WREG32(mmTPC1_RTR_HBW_RD_RS_N_ARB, 0x01020201);
+	WREG32(mmTPC1_RTR_HBW_RD_RS_S_ARB, 0x01070201);
+	WREG32(mmTPC1_RTR_HBW_RD_RS_W_ARB, 0x01070202);
+	WREG32(mmTPC1_RTR_HBW_WR_RS_N_ARB, 0x01010101);
+	WREG32(mmTPC1_RTR_HBW_WR_RS_S_ARB, 0x01050101);
+	WREG32(mmTPC1_RTR_HBW_WR_RS_W_ARB, 0x01050101);
+
+	WREG32(mmTPC1_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmTPC1_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmTPC1_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmTPC1_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmTPC1_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmTPC1_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmTPC1_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmTPC1_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmTPC1_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmTPC1_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmTPC1_RTR_SCRAMB_EN, 1 << TPC1_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmTPC1_RTR_NON_LIN_SCRAMB,
+			1 << TPC1_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for Bug H2 #2441 :
+	 * "ST.NOP set trace event illegal opcode"
+	 */
+	WREG32(mmTPC1_CFG_TPC_INTR_MASK, tpc_intr_mask);
+
+	WREG32(mmTPC2_RTR_HBW_RD_RQ_N_ARB, 0x01020101);
+	WREG32(mmTPC2_RTR_HBW_RD_RQ_S_ARB, 0x01050101);
+	WREG32(mmTPC2_RTR_HBW_RD_RQ_E_ARB, 0x01010201);
+	WREG32(mmTPC2_RTR_HBW_WR_RQ_N_ARB, 0x02040102);
+	WREG32(mmTPC2_RTR_HBW_WR_RQ_S_ARB, 0x01050101);
+	WREG32(mmTPC2_RTR_HBW_WR_RQ_E_ARB, 0x02060202);
+	WREG32(mmTPC2_RTR_HBW_RD_RS_N_ARB, 0x01020201);
+	WREG32(mmTPC2_RTR_HBW_RD_RS_S_ARB, 0x01070201);
+	WREG32(mmTPC2_RTR_HBW_RD_RS_W_ARB, 0x01070202);
+	WREG32(mmTPC2_RTR_HBW_WR_RS_N_ARB, 0x01010101);
+	WREG32(mmTPC2_RTR_HBW_WR_RS_S_ARB, 0x01040101);
+	WREG32(mmTPC2_RTR_HBW_WR_RS_W_ARB, 0x01040101);
+
+	WREG32(mmTPC2_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmTPC2_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmTPC2_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmTPC2_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmTPC2_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmTPC2_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmTPC2_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmTPC2_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmTPC2_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmTPC2_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmTPC2_RTR_SCRAMB_EN, 1 << TPC2_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmTPC2_RTR_NON_LIN_SCRAMB,
+			1 << TPC2_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for Bug H2 #2441 :
+	 * "ST.NOP set trace event illegal opcode"
+	 */
+	WREG32(mmTPC2_CFG_TPC_INTR_MASK, tpc_intr_mask);
+
+	WREG32(mmTPC3_RTR_HBW_RD_RQ_N_ARB, 0x01030101);
+	WREG32(mmTPC3_RTR_HBW_RD_RQ_S_ARB, 0x01040101);
+	WREG32(mmTPC3_RTR_HBW_RD_RQ_E_ARB, 0x01040301);
+	WREG32(mmTPC3_RTR_HBW_WR_RQ_N_ARB, 0x02060102);
+	WREG32(mmTPC3_RTR_HBW_WR_RQ_S_ARB, 0x01040101);
+	WREG32(mmTPC3_RTR_HBW_WR_RQ_E_ARB, 0x01040301);
+	WREG32(mmTPC3_RTR_HBW_RD_RS_N_ARB, 0x01040201);
+	WREG32(mmTPC3_RTR_HBW_RD_RS_S_ARB, 0x01060201);
+	WREG32(mmTPC3_RTR_HBW_RD_RS_W_ARB, 0x01060402);
+	WREG32(mmTPC3_RTR_HBW_WR_RS_N_ARB, 0x01020101);
+	WREG32(mmTPC3_RTR_HBW_WR_RS_S_ARB, 0x01030101);
+	WREG32(mmTPC3_RTR_HBW_WR_RS_W_ARB, 0x01030401);
+
+	WREG32(mmTPC3_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmTPC3_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmTPC3_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmTPC3_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmTPC3_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmTPC3_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmTPC3_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmTPC3_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmTPC3_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmTPC3_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmTPC3_RTR_SCRAMB_EN, 1 << TPC3_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmTPC3_RTR_NON_LIN_SCRAMB,
+			1 << TPC3_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for Bug H2 #2441 :
+	 * "ST.NOP set trace event illegal opcode"
+	 */
+	WREG32(mmTPC3_CFG_TPC_INTR_MASK, tpc_intr_mask);
+
+	WREG32(mmTPC4_RTR_HBW_RD_RQ_N_ARB, 0x01040101);
+	WREG32(mmTPC4_RTR_HBW_RD_RQ_S_ARB, 0x01030101);
+	WREG32(mmTPC4_RTR_HBW_RD_RQ_E_ARB, 0x01030401);
+	WREG32(mmTPC4_RTR_HBW_WR_RQ_N_ARB, 0x02070102);
+	WREG32(mmTPC4_RTR_HBW_WR_RQ_S_ARB, 0x01030101);
+	WREG32(mmTPC4_RTR_HBW_WR_RQ_E_ARB, 0x02060702);
+	WREG32(mmTPC4_RTR_HBW_RD_RS_N_ARB, 0x01060201);
+	WREG32(mmTPC4_RTR_HBW_RD_RS_S_ARB, 0x01040201);
+	WREG32(mmTPC4_RTR_HBW_RD_RS_W_ARB, 0x01040602);
+	WREG32(mmTPC4_RTR_HBW_WR_RS_N_ARB, 0x01030101);
+	WREG32(mmTPC4_RTR_HBW_WR_RS_S_ARB, 0x01020101);
+	WREG32(mmTPC4_RTR_HBW_WR_RS_W_ARB, 0x01040301);
+
+	WREG32(mmTPC4_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmTPC4_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmTPC4_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmTPC4_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmTPC4_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmTPC4_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmTPC4_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmTPC4_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmTPC4_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmTPC4_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmTPC4_RTR_SCRAMB_EN, 1 << TPC4_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmTPC4_RTR_NON_LIN_SCRAMB,
+			1 << TPC4_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for Bug H2 #2441 :
+	 * "ST.NOP set trace event illegal opcode"
+	 */
+	WREG32(mmTPC4_CFG_TPC_INTR_MASK, tpc_intr_mask);
+
+	WREG32(mmTPC5_RTR_HBW_RD_RQ_N_ARB, 0x01050101);
+	WREG32(mmTPC5_RTR_HBW_RD_RQ_S_ARB, 0x01020101);
+	WREG32(mmTPC5_RTR_HBW_RD_RQ_E_ARB, 0x01200501);
+	WREG32(mmTPC5_RTR_HBW_WR_RQ_N_ARB, 0x02070102);
+	WREG32(mmTPC5_RTR_HBW_WR_RQ_S_ARB, 0x01020101);
+	WREG32(mmTPC5_RTR_HBW_WR_RQ_E_ARB, 0x02020602);
+	WREG32(mmTPC5_RTR_HBW_RD_RS_N_ARB, 0x01070201);
+	WREG32(mmTPC5_RTR_HBW_RD_RS_S_ARB, 0x01020201);
+	WREG32(mmTPC5_RTR_HBW_RD_RS_W_ARB, 0x01020702);
+	WREG32(mmTPC5_RTR_HBW_WR_RS_N_ARB, 0x01040101);
+	WREG32(mmTPC5_RTR_HBW_WR_RS_S_ARB, 0x01010101);
+	WREG32(mmTPC5_RTR_HBW_WR_RS_W_ARB, 0x01010501);
+
+	WREG32(mmTPC5_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmTPC5_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmTPC5_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmTPC5_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmTPC5_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmTPC5_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmTPC5_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmTPC5_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmTPC5_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmTPC5_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmTPC5_RTR_SCRAMB_EN, 1 << TPC5_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmTPC5_RTR_NON_LIN_SCRAMB,
+			1 << TPC5_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for Bug H2 #2441 :
+	 * "ST.NOP set trace event illegal opcode"
+	 */
+	WREG32(mmTPC5_CFG_TPC_INTR_MASK, tpc_intr_mask);
+
+	WREG32(mmTPC6_RTR_HBW_RD_RQ_N_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_RD_RQ_S_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_RD_RQ_E_ARB, 0x01010601);
+	WREG32(mmTPC6_RTR_HBW_WR_RQ_N_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_WR_RQ_S_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_WR_RQ_E_ARB, 0x02020702);
+	WREG32(mmTPC6_RTR_HBW_RD_RS_N_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_RD_RS_S_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_RD_RS_W_ARB, 0x01020702);
+	WREG32(mmTPC6_RTR_HBW_WR_RS_N_ARB, 0x01050101);
+	WREG32(mmTPC6_RTR_HBW_WR_RS_S_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_WR_RS_W_ARB, 0x01010501);
+
+	WREG32(mmTPC6_RTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmTPC6_RTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmTPC6_RTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmTPC6_RTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmTPC6_RTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmTPC6_RTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmTPC6_RTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmTPC6_RTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmTPC6_RTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmTPC6_RTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmTPC6_RTR_SCRAMB_EN, 1 << TPC6_RTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmTPC6_RTR_NON_LIN_SCRAMB,
+			1 << TPC6_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for Bug H2 #2441 :
+	 * "ST.NOP set trace event illegal opcode"
+	 */
+	WREG32(mmTPC6_CFG_TPC_INTR_MASK, tpc_intr_mask);
+
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmTPC7_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmTPC7_NRTR_SCRAMB_EN, 1 << TPC7_NRTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmTPC7_NRTR_NON_LIN_SCRAMB,
+			1 << TPC7_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for Bug H2 #2441 :
+	 * "ST.NOP set trace event illegal opcode"
+	 */
+	WREG32(mmTPC7_CFG_TPC_INTR_MASK, tpc_intr_mask);
+
+	WREG32(mmPCI_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
+	WREG32(mmPCI_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
+	WREG32(mmPCI_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
+	WREG32(mmPCI_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
+	WREG32(mmPCI_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
+	WREG32(mmPCI_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
+	WREG32(mmPCI_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
+	WREG32(mmPCI_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
+	WREG32(mmPCI_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
+	WREG32(mmPCI_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
+
+	WREG32(mmPCI_NRTR_SCRAMB_EN, 1 << PCI_NRTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmPCI_NRTR_NON_LIN_SCRAMB,
+			1 << PCI_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for H2 #HW-23 bug
+	 * Set DMA max outstanding read requests to 240 on DMA CH 1. Set it
+	 * to 16 on KMD DMA
+	 * We need to limit only these DMAs because the user can only read
+	 * from Host using DMA CH 1
+	 */
+	WREG32(mmDMA_CH_0_CFG0, 0x0fff0010);
+	WREG32(mmDMA_CH_1_CFG0, 0x0fff00F0);
+
+	goya->hw_cap_initialized |= HW_CAP_GOLDEN;
+}
+
+
+/**
+ * goya_push_uboot_to_device - Push u-boot FW code to device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Copy u-boot fw code from firmware file to SRAM BAR.
+ * Returns 0 on success
+ *
+ */
+static int goya_push_uboot_to_device(struct hl_device *hdev)
+{
+	char fw_name[200];
+	const u64 *fw_data;
+	void __iomem *dst;
+	size_t fw_size, i;
+	int rc;
+
+	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
+
+	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to request u-boot fw image\n");
+		goto out;
+	}
+
+	fw_size = hdev->spl_fw->size;
+	if ((fw_size % 4) != 0) {
+		dev_err(hdev->dev, "illegal u-boot firmware size %lu\n",
+			fw_size);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	dev_dbg(hdev->dev, "u-boot firmware size == %lu\n", fw_size);
+
+	fw_data = (const u64 *) hdev->spl_fw->data;
+	dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
+
+	if ((hdev->spl_fw->size % 8) != 0)
+		fw_size -= 8;
+
+	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
+		if (!(i & (0x80000 - 1)))
+			dev_dbg(hdev->dev,
+				"u-boot copied so far %lu out of %lu",
+				i, fw_size);
+
+		writeq(*fw_data, dst);
+	}
+
+	if ((hdev->spl_fw->size % 8) != 0)
+		writel(*(const u32 *) fw_data, dst);
+
+out:
+	release_firmware(hdev->spl_fw);
+	return rc;
+}
+
+/**
+ * goya_push_linux_to_device - Push LINUX FW code to device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Copy LINXU fw code from firmware file to DDR BAR.
+ * Returns 0 on success
+ *
+ */
+static int goya_push_linux_to_device(struct hl_device *hdev)
+{
+	char fw_name[200];
+	const u64 *fw_data;
+	void __iomem *dst;
+	size_t fw_size, i;
+	int rc;
+
+	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
+
+	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to request Linux fw image\n");
+		goto out;
+	}
+
+	fw_size = hdev->spl_fw->size;
+	if ((fw_size % 4) != 0) {
+		dev_err(hdev->dev, "illegal Linux firmware size %lu\n",
+			fw_size);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	dev_dbg(hdev->dev, "Linux firmware size == %lu\n", fw_size);
+
+	fw_data = (const u64 *) hdev->spl_fw->data;
+	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
+
+	if ((hdev->spl_fw->size % 8) != 0)
+		fw_size -= 8;
+
+	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
+		if (!(i & (0x80000 - 1))) {
+			dev_dbg(hdev->dev,
+				"Linux copied so far %lu out of %lu",
+				i, fw_size);
+			usleep_range(20, 100);
+		}
+		writeq(*fw_data, dst);
+	}
+
+	if ((hdev->spl_fw->size % 8) != 0)
+		writel(*(const u32 *) fw_data, dst);
+
+out:
+	release_firmware(hdev->spl_fw);
+	return rc;
+}
+
+static int goya_pldm_init_cpu(struct hl_device *hdev)
+{
+	u32 val, unit_rst_val;
+	int rc;
+
+	/* Must initialize SRAM scrambler before pushing u-boot to SRAM */
+	goya_init_golden_registers(hdev);
+
+	/* Put ARM cores into reset */
+	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
+	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+
+	/* Reset the CA53 MACRO */
+	unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
+	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
+	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+
+	rc = goya_push_uboot_to_device(hdev);
+	if (rc)
+		return rc;
+
+	rc = goya_push_linux_to_device(hdev);
+	if (rc)
+		return rc;
+
+	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
+	WREG32(mmPSOC_GLOBAL_CONF_WARM_REBOOT, CPU_BOOT_STATUS_NA);
+
+	WREG32(mmCPU_CA53_CFG_RST_ADDR_LSB_0,
+		lower_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
+	WREG32(mmCPU_CA53_CFG_RST_ADDR_MSB_0,
+		upper_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
+
+	/* Release ARM core 0 from reset */
+	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL,
+					CPU_RESET_CORE0_DEASSERT);
+	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+
+	return 0;
+}
+
+/*
+ * FW component passes an offset from SRAM_BASE_ADDR in SCRATCHPAD_xx.
+ * The version string should be located by that offset.
+ */
+static void goya_read_device_fw_version(struct hl_device *hdev,
+					enum goya_fw_component fwc)
+{
+	const char *name;
+	u32 ver_off;
+	char *dest;
+
+	switch (fwc) {
+	case FW_COMP_UBOOT:
+		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_29);
+		dest = hdev->asic_prop.uboot_ver;
+		name = "U-Boot";
+		break;
+	case FW_COMP_PREBOOT:
+		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_28);
+		dest = hdev->asic_prop.preboot_ver;
+		name = "Preboot";
+		break;
+	default:
+		dev_warn(hdev->dev, "Undefined FW component: %d\n", fwc);
+		return;
+	}
+
+	ver_off &= ~((u32)SRAM_BASE_ADDR);
+
+	if (ver_off < SRAM_SIZE - VERSION_MAX_LEN) {
+		memcpy_fromio(dest, hdev->pcie_bar[SRAM_CFG_BAR_ID] + ver_off,
+							VERSION_MAX_LEN);
+	} else {
+		dev_err(hdev->dev, "%s version offset (0x%x) is above SRAM\n",
+								name, ver_off);
+		strcpy(dest, "unavailable");
+	}
+}
+
+static int goya_init_cpu(struct hl_device *hdev, u32 cpu_timeout)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 status;
+	int rc;
+
+	if (!hdev->cpu_enable)
+		return 0;
+
+	if (goya->hw_cap_initialized & HW_CAP_CPU)
+		return 0;
+
+	/*
+	 * Before pushing u-boot/linux to device, need to set the ddr bar to
+	 * base address of dram
+	 */
+	rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to map DDR bar to DRAM base address\n");
+		return rc;
+	}
+
+	if (hdev->pldm) {
+		rc = goya_pldm_init_cpu(hdev);
+		if (rc)
+			return rc;
+
+		goto out;
+	}
+
+	/* Make sure CPU boot-loader is running */
+	rc = hl_poll_timeout(
+		hdev,
+		mmPSOC_GLOBAL_CONF_WARM_REBOOT,
+		status,
+		(status == CPU_BOOT_STATUS_DRAM_RDY) ||
+		(status == CPU_BOOT_STATUS_SRAM_AVAIL),
+		10000,
+		cpu_timeout);
+
+	if (rc) {
+		dev_err(hdev->dev, "Error in ARM u-boot !!!");
+		switch (status) {
+		case CPU_BOOT_STATUS_NA:
+			dev_err(hdev->dev,
+				"ARM status %d - BTL did NOT run\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_WFE:
+			dev_err(hdev->dev,
+				"ARM status %d - Inside WFE loop\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_BTL:
+			dev_err(hdev->dev,
+				"ARM status %d - Stuck in BTL\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_PREBOOT:
+			dev_err(hdev->dev,
+				"ARM status %d - Stuck in Preboot\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_SPL:
+			dev_err(hdev->dev,
+				"ARM status %d - Stuck in SPL\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_UBOOT:
+			dev_err(hdev->dev,
+				"ARM status %d - Stuck in u-boot\n", status);
+			break;
+		case CPU_BOOT_STATUS_DRAM_INIT_FAIL:
+			dev_err(hdev->dev,
+				"ARM status %d - DDR initialization failed\n",
+				status);
+			break;
+		default:
+			dev_err(hdev->dev,
+				"ARM status %d - Invalid status code\n",
+				status);
+			break;
+		}
+		return -EIO;
+	}
+
+	/* Read U-Boot version now in case we will later fail */
+	goya_read_device_fw_version(hdev, FW_COMP_UBOOT);
+	goya_read_device_fw_version(hdev, FW_COMP_PREBOOT);
+
+	if (status == CPU_BOOT_STATUS_SRAM_AVAIL)
+		goto out;
+
+	if (!hdev->fw_loading) {
+		dev_info(hdev->dev, "Skip loading FW\n");
+		goto out;
+	}
+
+	rc = goya_push_linux_to_device(hdev);
+	if (rc)
+		return rc;
+
+	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
+
+	rc = hl_poll_timeout(
+		hdev,
+		mmPSOC_GLOBAL_CONF_WARM_REBOOT,
+		status,
+		(status == CPU_BOOT_STATUS_SRAM_AVAIL),
+		10000,
+		cpu_timeout);
+
+	if (rc) {
+		if (status == CPU_BOOT_STATUS_FIT_CORRUPTED)
+			dev_err(hdev->dev,
+				"ARM u-boot reports FIT image is corrupted\n");
+		else
+			dev_err(hdev->dev,
+				"ARM Linux failed to load, %d\n", status);
+		WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_NA);
+		return -EIO;
+	}
+
+	dev_info(hdev->dev, "Successfully loaded firmware to device\n");
+
+out:
+	goya->hw_cap_initialized |= HW_CAP_CPU;
+
+	return 0;
+}
+
+/**
+ * goya_hw_init - Goya hardware initialization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Returns 0 on success
+ *
+ */
+static int goya_hw_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	u32 val;
+	int rc;
+
+	dev_info(hdev->dev, "Starting initialization of H/W\n");
+
+	/* Perform read from the device to make sure device is up */
+	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
+
+	goya_init_pll(hdev);
+
+	if (hdev->pldm) {
+		goya_init_ddr_ch0(hdev);
+		goya_init_ddr_ch1(hdev);
+	}
+
+	rc = goya_init_cpu(hdev, GOYA_CPU_TIMEOUT_USEC);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize CPU\n");
+		return rc;
+	}
+
+	goya_tpc_mbist_workaround(hdev);
+
+	goya_init_golden_registers(hdev);
+
+	/*
+	 * After CPU initialization is finished, change DDR bar mapping inside
+	 * iATU to point to the start address of the MMU page tables
+	 */
+	rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE +
+		(MMU_PAGE_TABLES_ADDR & ~(prop->dram_pci_bar_size - 0x1ull)));
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to map DDR bar to MMU page tables\n");
+		return rc;
+	}
+
+	goya_init_security(hdev);
+
+	/* CPU initialization is finished, we can now move to 48 bit DMA mask */
+	rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
+	if (rc) {
+		dev_warn(hdev->dev, "Unable to set pci dma mask to 48 bits\n");
+		rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(32));
+		if (rc) {
+			dev_err(hdev->dev,
+				"Unable to set pci dma mask to 32 bits\n");
+			return rc;
+		}
+	}
+
+	rc = pci_set_consistent_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
+	if (rc) {
+		dev_warn(hdev->dev,
+			"Unable to set pci consistent dma mask to 48 bits\n");
+		rc = pci_set_consistent_dma_mask(hdev->pdev, DMA_BIT_MASK(32));
+		if (rc) {
+			dev_err(hdev->dev,
+				"Unable to set pci consistent dma mask to 32 bits\n");
+			return rc;
+		}
+	}
+
+	/* Perform read from the device to flush all MSI-X configuration */
+	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
+
+	return 0;
+}
+
+/**
+ * goya_hw_fini - Goya hardware tear-down code
+ *
+ * @hdev: pointer to hl_device structure
+ * @hard_reset: should we do hard reset to all engines or just reset the
+ *              compute/dma engines
+ *
+ * The function does the following:
+ * - Send interrupt to CPU to go into "quiet" mode
+ * - Stall MME, TPC
+ * - Stop External, Internal QMANs
+ * - Disable MSI-X
+ * - Issue reset command
+ * - Wait until reset is done
+ * - Start device BTL
+ *
+ */
+static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 reset_timeout_ms, status;
+
+	if (hdev->pldm)
+		reset_timeout_ms = GOYA_PLDM_RESET_TIMEOUT_MSEC;
+	else
+		reset_timeout_ms = GOYA_RESET_TIMEOUT_MSEC;
+
+	if (hard_reset) {
+		goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
+		goya_disable_clk_rlx(hdev);
+		goya_set_pll_refclk(hdev);
+
+		WREG32(mmPSOC_GLOBAL_CONF_SW_ALL_RST_CFG, RESET_ALL);
+		dev_info(hdev->dev,
+			"Issued HARD reset command, going to wait %dms\n",
+			reset_timeout_ms);
+	} else {
+		WREG32(mmPSOC_GLOBAL_CONF_SW_ALL_RST_CFG, DMA_MME_TPC_RESET);
+		dev_info(hdev->dev,
+			"Issued SOFT reset command, going to wait %dms\n",
+			reset_timeout_ms);
+	}
+
+	/*
+	 * After hard reset, we can't poll the BTM_FSM register because the PSOC
+	 * itself is in reset. In either reset we need to wait until the reset
+	 * is deasserted
+	 */
+	msleep(reset_timeout_ms);
+
+	status = RREG32(mmPSOC_GLOBAL_CONF_BTM_FSM);
+	if (status & PSOC_GLOBAL_CONF_BTM_FSM_STATE_MASK)
+		dev_err(hdev->dev,
+			"Timeout while waiting for device to reset 0x%x\n",
+			status);
+
+	if (!hard_reset) {
+		goya->hw_cap_initialized &= ~(HW_CAP_DMA | HW_CAP_MME |
+						HW_CAP_GOLDEN | HW_CAP_TPC);
+		WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+				GOYA_ASYNC_EVENT_ID_SOFT_RESET);
+		return;
+	}
+
+	/* Chicken bit to re-initiate boot sequencer flow */
+	WREG32(mmPSOC_GLOBAL_CONF_BOOT_SEQ_RE_START,
+		1 << PSOC_GLOBAL_CONF_BOOT_SEQ_RE_START_IND_SHIFT);
+	/* Move boot manager FSM to pre boot sequencer init state */
+	WREG32(mmPSOC_GLOBAL_CONF_SW_BTM_FSM,
+			0xA << PSOC_GLOBAL_CONF_SW_BTM_FSM_CTRL_SHIFT);
+
+	goya->hw_cap_initialized &= ~(HW_CAP_CPU | HW_CAP_CPU_Q |
+					HW_CAP_DDR_0 | HW_CAP_DDR_1 |
+					HW_CAP_DMA | HW_CAP_MME |
+					HW_CAP_MMU | HW_CAP_TPC_MBIST |
+					HW_CAP_GOLDEN | HW_CAP_TPC);
+
+	if (!hdev->pldm) {
+		int rc;
+		/* In case we are running inside VM and the VM is
+		 * shutting down, we need to make sure CPU boot-loader
+		 * is running before we can continue the VM shutdown.
+		 * That is because the VM will send an FLR signal that
+		 * we must answer
+		 */
+		dev_info(hdev->dev,
+			"Going to wait up to %ds for CPU boot loader\n",
+			GOYA_CPU_TIMEOUT_USEC / 1000 / 1000);
+
+		rc = hl_poll_timeout(
+			hdev,
+			mmPSOC_GLOBAL_CONF_WARM_REBOOT,
+			status,
+			(status == CPU_BOOT_STATUS_DRAM_RDY),
+			10000,
+			GOYA_CPU_TIMEOUT_USEC);
+		if (rc)
+			dev_err(hdev->dev,
+				"failed to wait for CPU boot loader\n");
+	}
+}
+
 int goya_suspend(struct hl_device *hdev)
 {
 	return 0;
@@ -641,6 +2519,8 @@ static const struct hl_asic_funcs goya_funcs = {
 	.early_fini = goya_early_fini,
 	.sw_init = goya_sw_init,
 	.sw_fini = goya_sw_fini,
+	.hw_init = goya_hw_init,
+	.hw_fini = goya_hw_fini,
 	.suspend = goya_suspend,
 	.resume = goya_resume,
 	.mmap = goya_mmap,
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 0e12c56472bd..45a6d2ca2752 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -9,6 +9,7 @@
 #define GOYAP_H_
 
 #include "habanalabs.h"
+#include "include/goya/goya_boot_if.h"
 #include "include/goya/goya.h"
 
 #define NUMBER_OF_CMPLT_QUEUES		5
@@ -122,4 +123,6 @@ struct goya_device {
 	u32		hw_cap_initialized;
 };
 
+void goya_init_security(struct hl_device *hdev);
+
 #endif /* GOYAP_H_ */
diff --git a/drivers/misc/habanalabs/goya/goya_security.c b/drivers/misc/habanalabs/goya/goya_security.c
new file mode 100644
index 000000000000..99ad9aacf49e
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/goya_security.c
@@ -0,0 +1,2999 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "goyaP.h"
+
+/**
+ * goya_set_block_as_protected - set the given block as protected
+ *
+ * @hdev: pointer to hl_device structure
+ * @block: block base address
+ *
+ */
+static void goya_pb_set_block(struct hl_device *hdev, u64 base)
+{
+	u32 pb_addr = base - CFG_BASE + PROT_BITS_OFFS;
+
+	while (pb_addr & 0xFFF) {
+		WREG32(pb_addr, 0);
+		pb_addr += 4;
+	}
+}
+
+static void goya_init_mme_protection_bits(struct hl_device *hdev)
+{
+	u32 pb_addr, mask;
+	u8 word_offset;
+
+	/* TODO: change to real reg name when Soc Online is updated */
+	u64 mmMME_SBB_POWER_ECO1 = 0xDFF60,
+		mmMME_SBB_POWER_ECO2 = 0xDFF64;
+
+	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_0_BASE);
+	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_1_BASE);
+	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_2_BASE);
+	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_3_BASE);
+
+	goya_pb_set_block(hdev, mmSBA_ECC_MEM_BASE);
+	goya_pb_set_block(hdev, mmSBB_ECC_MEM_BASE);
+
+	goya_pb_set_block(hdev, mmMME1_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME1_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME1_WR_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME2_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME2_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME2_WR_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME3_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME3_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME3_WR_REGULATOR_BASE);
+
+	goya_pb_set_block(hdev, mmMME4_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME4_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME4_WR_REGULATOR_BASE);
+
+	goya_pb_set_block(hdev, mmMME5_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME5_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME5_WR_REGULATOR_BASE);
+
+	goya_pb_set_block(hdev, mmMME6_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME6_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME6_WR_REGULATOR_BASE);
+
+	pb_addr = (mmMME_DUMMY & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_DUMMY & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_DUMMY & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_RESET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_STALL & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_ADD & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_DATA_WR & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_DATA_RD & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_CTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_RC & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_LOG_SHADOW & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_STORE_MAX_CREDIT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_STORE_MAX_CREDIT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_STORE_MAX_CREDIT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_AGU & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBB & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBC & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_WBC & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBA_CONTROL_DATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBB_CONTROL_DATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBC_CONTROL_DATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_WBC_CONTROL_DATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_TE & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_TE2DEC & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_REI_STATUS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_REI_MASK & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SEI_STATUS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SEI_MASK & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SPI_STATUS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SPI_MASK & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_QM_CP_STS & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_QM_CP_STS & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_QM_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_CURRENT_INST_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_BUF_RDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_CMDQ_CQ_IFIFO_CNT &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmMME_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_SBB_POWER_ECO1 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_SBB_POWER_ECO1 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_SBB_POWER_ECO1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBB_POWER_ECO2 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+}
+
+static void goya_init_dma_protection_bits(struct hl_device *hdev)
+{
+	u32 pb_addr, mask;
+	u8 word_offset;
+
+	goya_pb_set_block(hdev, mmDMA_NRTR_BASE);
+	goya_pb_set_block(hdev, mmDMA_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmDMA_WR_REGULATOR_BASE);
+
+	pb_addr = (mmDMA_QM_0_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_0_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_0_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_0_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_0_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_0_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_0_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_0_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_0_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_0_BASE);
+
+	pb_addr = (mmDMA_QM_1_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_1_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_1_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_1_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_1_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_1_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_1_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_1_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_1_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_1_BASE);
+
+	pb_addr = (mmDMA_QM_2_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_2_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_2_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_2_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_2_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_2_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_2_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_2_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_2_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_2_BASE);
+
+	pb_addr = (mmDMA_QM_3_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_3_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_3_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_3_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_3_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_3_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_3_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_3_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_3_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_3_BASE);
+
+	pb_addr = (mmDMA_QM_4_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_4_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_4_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_4_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_4_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_4_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_4_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_4_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_4_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_4_BASE);
+}
+
+static void goya_init_tpc_protection_bits(struct hl_device *hdev)
+{
+	u32 pb_addr, mask;
+	u8 word_offset;
+
+	goya_pb_set_block(hdev, mmTPC0_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC0_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC0_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC0_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC1_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC1_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC1_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC1_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CFG_FUNC_MBIST_CNTRL & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC1_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC1_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC2_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC2_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC2_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC2_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CFG_FUNC_MBIST_CNTRL & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC2_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC2_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC3_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC3_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC3_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC3_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CFG_CFG_BASE_ADDRESS_HIGH
+			& PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CFG_FUNC_MBIST_CNTRL
+			& PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC3_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC4_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC4_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC4_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC4_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC4_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC5_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC5_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC5_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC5_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC5_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC6_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC6_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC6_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC6_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC6_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC7_NRTR_BASE);
+	goya_pb_set_block(hdev, mmTPC7_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC7_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC7_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) +	PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC7_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+}
+
+/**
+ * goya_init_protection_bits - Initialize protection bits for specific registers
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * All protection bits are 1 by default, means not protected. Need to set to 0
+ * each bit that belongs to a protected register.
+ *
+ */
+static void goya_init_protection_bits(struct hl_device *hdev)
+{
+	/*
+	 * In each 4K block of registers, the last 128 bytes are protection
+	 * bits - total of 1024 bits, one for each register. Each bit is related
+	 * to a specific register, by the order of the registers.
+	 * So in order to calculate the bit that is related to a given register,
+	 * we need to calculate its word offset and then the exact bit inside
+	 * the word (which is 4 bytes).
+	 *
+	 * Register address:
+	 *
+	 * 31                 12 11           7   6             2  1      0
+	 * -----------------------------------------------------------------
+	 * |      Don't         |    word       |  bit location  |    0    |
+	 * |      care          |   offset      |  inside word   |         |
+	 * -----------------------------------------------------------------
+	 *
+	 * Bits 7-11 represents the word offset inside the 128 bytes.
+	 * Bits 2-6 represents the bit location inside the word.
+	 */
+
+	goya_pb_set_block(hdev, mmPCI_NRTR_BASE);
+	goya_pb_set_block(hdev, mmPCI_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmPCI_WR_REGULATOR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y0_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y1_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y2_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y3_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y4_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y5_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmPCIE_WRAP_BASE);
+	goya_pb_set_block(hdev, mmPCIE_CORE_BASE);
+	goya_pb_set_block(hdev, mmPCIE_DB_CFG_BASE);
+	goya_pb_set_block(hdev, mmPCIE_DB_CMD_BASE);
+	goya_pb_set_block(hdev, mmPCIE_AUX_BASE);
+	goya_pb_set_block(hdev, mmPCIE_DB_RSV_BASE);
+	goya_pb_set_block(hdev, mmPCIE_PHY_BASE);
+
+	goya_init_mme_protection_bits(hdev);
+
+	goya_init_dma_protection_bits(hdev);
+
+	goya_init_tpc_protection_bits(hdev);
+}
+
+/**
+ * goya_init_security - Initialize security model
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Initialize the security model of the device
+ * That includes range registers and protection bit per register
+ *
+ */
+void goya_init_security(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	u32 dram_addr_lo = lower_32_bits(DRAM_PHYS_BASE);
+	u32 dram_addr_hi = upper_32_bits(DRAM_PHYS_BASE);
+
+	u32 lbw_rng0_base = 0xFC440000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng0_mask = 0xFFFF0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng1_base = 0xFC480000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng1_mask = 0xFFF80000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng2_base = 0xFC600000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng2_mask = 0xFFE00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng3_base = 0xFC800000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng3_mask = 0xFFF00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng4_base = 0xFCC02000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng4_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng5_base = 0xFCC40000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng5_mask = 0xFFFF8000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng6_base = 0xFCC48000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng6_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng7_base = 0xFCC4A000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng7_mask = 0xFFFFE000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng8_base = 0xFCC4C000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng8_mask = 0xFFFFC000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng9_base = 0xFCC50000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng9_mask = 0xFFFF0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng10_base = 0xFCC60000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng10_mask = 0xFFFE0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng11_base = 0xFCE00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng11_mask = 0xFFFFC000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng12_base = 0xFE484000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng12_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng13_base = 0xFEC43000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng13_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	WREG32(mmDMA_MACRO_LBW_RANGE_HIT_BLOCK, 0xFFFF);
+	WREG32(mmDMA_MACRO_HBW_RANGE_HIT_BLOCK, 0xFF);
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU)) {
+		WREG32(mmDMA_MACRO_HBW_RANGE_HIT_BLOCK, 0xFE);
+
+		/* Protect HOST */
+		WREG32(mmDMA_MACRO_HBW_RANGE_BASE_31_0_0, 0);
+		WREG32(mmDMA_MACRO_HBW_RANGE_BASE_49_32_0, 0);
+		WREG32(mmDMA_MACRO_HBW_RANGE_MASK_31_0_0, 0);
+		WREG32(mmDMA_MACRO_HBW_RANGE_MASK_49_32_0, 0xFFF80);
+	}
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmDMA_MACRO_HBW_RANGE_BASE_31_0_1, dram_addr_lo);
+	WREG32(mmDMA_MACRO_HBW_RANGE_BASE_49_32_1, dram_addr_hi);
+	WREG32(mmDMA_MACRO_HBW_RANGE_MASK_31_0_1, 0xE0000000);
+	WREG32(mmDMA_MACRO_HBW_RANGE_MASK_49_32_1, 0x3FFFF);
+
+	/* Protect registers */
+
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME1_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME2_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME3_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME4_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME5_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME6_RTR_LBW_RANGE_HIT, 0xFFFF);
+
+	WREG32(mmMME1_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME2_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME3_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME4_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME5_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME6_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmMME1_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME1_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME1_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME1_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME2_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME2_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME2_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME2_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME3_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME3_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME3_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME3_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME4_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME4_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME4_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME4_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME5_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME5_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME5_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME5_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME6_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME6_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME6_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME6_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmMME1_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME1_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME1_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME1_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME2_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME2_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME2_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME2_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME3_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME3_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME3_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME3_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME4_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME4_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME4_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME4_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME5_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME5_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME5_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME5_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME6_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME6_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME6_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME6_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC0_NRTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC0_NRTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC0_NRTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC1_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC1_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC1_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC1_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC1_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC1_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC1_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC1_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC1_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC1_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC2_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC2_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC2_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC2_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC2_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC2_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC2_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC2_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC2_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC2_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC3_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC3_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC3_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC3_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC3_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC3_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC3_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC3_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC3_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC3_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC4_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC4_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC4_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC4_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC4_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC4_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC4_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC4_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC4_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC4_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC5_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC5_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC5_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC5_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC5_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC5_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC5_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC5_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC5_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC5_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC6_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC6_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC6_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC6_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC6_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC6_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC6_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC6_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC6_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC6_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC7_NRTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC7_NRTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC7_NRTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	goya_init_protection_bits(hdev);
+}
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 6ad476df65b0..adda281ec2af 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -23,6 +23,8 @@
 
 #define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
 
+#define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */
+
 #define HL_MAX_QUEUES			128
 
 struct hl_device;
@@ -32,6 +34,8 @@ struct hl_fpriv;
 
 /**
  * struct asic_fixed_properties - ASIC specific immutable properties.
+ * @uboot_ver: F/W U-boot version.
+ * @preboot_ver: F/W Preboot version.
  * @sram_base_address: SRAM physical start address.
  * @sram_end_address: SRAM physical end address.
  * @sram_user_base_address - SRAM physical start address for user access.
@@ -60,6 +64,8 @@ struct hl_fpriv;
  * @tpc_enabled_mask: which TPCs are enabled.
  */
 struct asic_fixed_properties {
+	char			uboot_ver[VERSION_MAX_LEN];
+	char			preboot_ver[VERSION_MAX_LEN];
 	u64			sram_base_address;
 	u64			sram_end_address;
 	u64			sram_user_base_address;
@@ -168,6 +174,8 @@ enum hl_asic_type {
  * @early_fini: tears down what was done in early_init.
  * @sw_init: sets up driver state, does not configure H/W.
  * @sw_fini: tears down driver state, does not configure H/W.
+ * @hw_init: sets up the H/W state.
+ * @hw_fini: tears down the H/W state.
  * @suspend: handles IP specific H/W or SW changes for suspend.
  * @resume: handles IP specific H/W or SW changes for resume.
  * @mmap: mmap function, does nothing.
@@ -180,6 +188,8 @@ struct hl_asic_funcs {
 	int (*early_fini)(struct hl_device *hdev);
 	int (*sw_init)(struct hl_device *hdev);
 	int (*sw_fini)(struct hl_device *hdev);
+	int (*hw_init)(struct hl_device *hdev);
+	void (*hw_fini)(struct hl_device *hdev, bool hard_reset);
 	int (*suspend)(struct hl_device *hdev);
 	int (*resume)(struct hl_device *hdev);
 	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
@@ -312,6 +322,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
  * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
  * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
+ * @spl_fw: image to load to ArmCP.
  * @asid_bitmap: holds used/available ASIDs.
  * @asid_mutex: protects asid_bitmap.
  * @device_open: lock for sanity checks upon FD open.
@@ -340,6 +351,7 @@ struct hl_device {
 	void				*cpu_accessible_dma_mem;
 	dma_addr_t			cpu_accessible_dma_address;
 	struct gen_pool			*cpu_accessible_dma_pool;
+	const struct firmware		*spl_fw;
 	unsigned long			*asid_bitmap;
 	struct mutex			asid_mutex;
 	/* TODO: change to rw_sem for multiple contexts (same as other IOCTL) */
@@ -359,7 +371,11 @@ struct hl_device {
 	u8				disabled;
 
 	/* Parameters for bring-up */
+	u8				cpu_enable;
 	u8				reset_pcilink;
+	u8				config_pll;
+	u8				fw_loading;
+	u8				pldm;
 };
 
 /*
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 5c312dd3aa50..bd80683118d3 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -181,7 +181,15 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	hdev->major = hl_major;
 
 	/* Parameters for bring-up - set them to defaults */
+	hdev->cpu_enable = 1;
 	hdev->reset_pcilink = 0;
+	hdev->config_pll = 0;
+	hdev->fw_loading = 1;
+	hdev->pldm = 0;
+
+	/* If CPU is disabled, no point in loading FW */
+	if (!hdev->cpu_enable)
+		hdev->fw_loading = 0;
 
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
index 192a1450cbb1..2d0efb7b44bb 100644
--- a/drivers/misc/habanalabs/include/goya/goya.h
+++ b/drivers/misc/habanalabs/include/goya/goya.h
@@ -11,6 +11,7 @@
 #define GOYA_H
 
 #include "asic_reg/goya_regs.h"
+#include "goya_async_events.h"
 
 #include <linux/types.h>
 
diff --git a/drivers/misc/habanalabs/include/goya/goya_async_events.h b/drivers/misc/habanalabs/include/goya/goya_async_events.h
new file mode 100644
index 000000000000..497937a17ee9
--- /dev/null
+++ b/drivers/misc/habanalabs/include/goya/goya_async_events.h
@@ -0,0 +1,186 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef __GOYA_ASYNC_EVENTS_H_
+#define __GOYA_ASYNC_EVENTS_H_
+
+enum goya_async_event_id {
+	GOYA_ASYNC_EVENT_ID_PCIE_IF = 33,
+	GOYA_ASYNC_EVENT_ID_TPC0_ECC = 36,
+	GOYA_ASYNC_EVENT_ID_TPC1_ECC = 39,
+	GOYA_ASYNC_EVENT_ID_TPC2_ECC = 42,
+	GOYA_ASYNC_EVENT_ID_TPC3_ECC = 45,
+	GOYA_ASYNC_EVENT_ID_TPC4_ECC = 48,
+	GOYA_ASYNC_EVENT_ID_TPC5_ECC = 51,
+	GOYA_ASYNC_EVENT_ID_TPC6_ECC = 54,
+	GOYA_ASYNC_EVENT_ID_TPC7_ECC = 57,
+	GOYA_ASYNC_EVENT_ID_MME_ECC = 60,
+	GOYA_ASYNC_EVENT_ID_MME_ECC_EXT = 61,
+	GOYA_ASYNC_EVENT_ID_MMU_ECC = 63,
+	GOYA_ASYNC_EVENT_ID_DMA_MACRO = 64,
+	GOYA_ASYNC_EVENT_ID_DMA_ECC = 66,
+	GOYA_ASYNC_EVENT_ID_CPU_IF_ECC = 75,
+	GOYA_ASYNC_EVENT_ID_PSOC_MEM = 78,
+	GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT = 79,
+	GOYA_ASYNC_EVENT_ID_SRAM0 = 81,
+	GOYA_ASYNC_EVENT_ID_SRAM1 = 82,
+	GOYA_ASYNC_EVENT_ID_SRAM2 = 83,
+	GOYA_ASYNC_EVENT_ID_SRAM3 = 84,
+	GOYA_ASYNC_EVENT_ID_SRAM4 = 85,
+	GOYA_ASYNC_EVENT_ID_SRAM5 = 86,
+	GOYA_ASYNC_EVENT_ID_SRAM6 = 87,
+	GOYA_ASYNC_EVENT_ID_SRAM7 = 88,
+	GOYA_ASYNC_EVENT_ID_SRAM8 = 89,
+	GOYA_ASYNC_EVENT_ID_SRAM9 = 90,
+	GOYA_ASYNC_EVENT_ID_SRAM10 = 91,
+	GOYA_ASYNC_EVENT_ID_SRAM11 = 92,
+	GOYA_ASYNC_EVENT_ID_SRAM12 = 93,
+	GOYA_ASYNC_EVENT_ID_SRAM13 = 94,
+	GOYA_ASYNC_EVENT_ID_SRAM14 = 95,
+	GOYA_ASYNC_EVENT_ID_SRAM15 = 96,
+	GOYA_ASYNC_EVENT_ID_SRAM16 = 97,
+	GOYA_ASYNC_EVENT_ID_SRAM17 = 98,
+	GOYA_ASYNC_EVENT_ID_SRAM18 = 99,
+	GOYA_ASYNC_EVENT_ID_SRAM19 = 100,
+	GOYA_ASYNC_EVENT_ID_SRAM20 = 101,
+	GOYA_ASYNC_EVENT_ID_SRAM21 = 102,
+	GOYA_ASYNC_EVENT_ID_SRAM22 = 103,
+	GOYA_ASYNC_EVENT_ID_SRAM23 = 104,
+	GOYA_ASYNC_EVENT_ID_SRAM24 = 105,
+	GOYA_ASYNC_EVENT_ID_SRAM25 = 106,
+	GOYA_ASYNC_EVENT_ID_SRAM26 = 107,
+	GOYA_ASYNC_EVENT_ID_SRAM27 = 108,
+	GOYA_ASYNC_EVENT_ID_SRAM28 = 109,
+	GOYA_ASYNC_EVENT_ID_SRAM29 = 110,
+	GOYA_ASYNC_EVENT_ID_GIC500 = 112,
+	GOYA_ASYNC_EVENT_ID_PCIE_DEC = 115,
+	GOYA_ASYNC_EVENT_ID_TPC0_DEC = 117,
+	GOYA_ASYNC_EVENT_ID_TPC1_DEC = 120,
+	GOYA_ASYNC_EVENT_ID_TPC2_DEC = 123,
+	GOYA_ASYNC_EVENT_ID_TPC3_DEC = 126,
+	GOYA_ASYNC_EVENT_ID_TPC4_DEC = 129,
+	GOYA_ASYNC_EVENT_ID_TPC5_DEC = 132,
+	GOYA_ASYNC_EVENT_ID_TPC6_DEC = 135,
+	GOYA_ASYNC_EVENT_ID_TPC7_DEC = 138,
+	GOYA_ASYNC_EVENT_ID_AXI_ECC = 139,
+	GOYA_ASYNC_EVENT_ID_L2_RAM_ECC = 140,
+	GOYA_ASYNC_EVENT_ID_MME_WACS = 141,
+	GOYA_ASYNC_EVENT_ID_MME_WACSD = 142,
+	GOYA_ASYNC_EVENT_ID_PLL0 = 143,
+	GOYA_ASYNC_EVENT_ID_PLL1 = 144,
+	GOYA_ASYNC_EVENT_ID_PLL3 = 146,
+	GOYA_ASYNC_EVENT_ID_PLL4 = 147,
+	GOYA_ASYNC_EVENT_ID_PLL5 = 148,
+	GOYA_ASYNC_EVENT_ID_PLL6 = 149,
+	GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER = 155,
+	GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC = 159,
+	GOYA_ASYNC_EVENT_ID_PSOC = 160,
+	GOYA_ASYNC_EVENT_ID_PCIE_FLR = 171,
+	GOYA_ASYNC_EVENT_ID_PCIE_HOT_RESET = 172,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG0 = 174,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG1 = 175,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG2 = 176,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG3 = 177,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG0 = 178,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG1 = 179,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG2 = 180,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG3 = 181,
+	GOYA_ASYNC_EVENT_ID_PCIE_APB = 182,
+	GOYA_ASYNC_EVENT_ID_PCIE_QDB = 183,
+	GOYA_ASYNC_EVENT_ID_PCIE_BM_D_P_WR = 184,
+	GOYA_ASYNC_EVENT_ID_PCIE_BM_D_RD = 185,
+	GOYA_ASYNC_EVENT_ID_PCIE_BM_U_P_WR = 186,
+	GOYA_ASYNC_EVENT_ID_PCIE_BM_U_RD = 187,
+	GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU = 190,
+	GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR = 191,
+	GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU = 200,
+	GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR = 201,
+	GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU = 210,
+	GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR = 211,
+	GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU = 220,
+	GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR = 221,
+	GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU = 230,
+	GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR = 231,
+	GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU = 240,
+	GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR = 241,
+	GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU = 250,
+	GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR = 251,
+	GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU = 260,
+	GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR = 261,
+	GOYA_ASYNC_EVENT_ID_MMU_SBA_SPMU0 = 270,
+	GOYA_ASYNC_EVENT_ID_MMU_SBA_SPMU1 = 271,
+	GOYA_ASYNC_EVENT_ID_MME_WACS_UP = 272,
+	GOYA_ASYNC_EVENT_ID_MME_WACS_DOWN = 273,
+	GOYA_ASYNC_EVENT_ID_MMU_PAGE_FAULT = 280,
+	GOYA_ASYNC_EVENT_ID_MMU_WR_PERM = 281,
+	GOYA_ASYNC_EVENT_ID_MMU_DBG_BM = 282,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH0 = 290,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH1 = 291,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH2 = 292,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH3 = 293,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH4 = 294,
+	GOYA_ASYNC_EVENT_ID_DDR0_PHY_DFI = 300,
+	GOYA_ASYNC_EVENT_ID_DDR0_ECC_SCRUB = 301,
+	GOYA_ASYNC_EVENT_ID_DDR0_DB_ECC = 302,
+	GOYA_ASYNC_EVENT_ID_DDR0_SB_ECC = 303,
+	GOYA_ASYNC_EVENT_ID_DDR0_SB_ECC_MC = 304,
+	GOYA_ASYNC_EVENT_ID_DDR0_AXI_RD = 305,
+	GOYA_ASYNC_EVENT_ID_DDR0_AXI_WR = 306,
+	GOYA_ASYNC_EVENT_ID_DDR1_PHY_DFI = 310,
+	GOYA_ASYNC_EVENT_ID_DDR1_ECC_SCRUB = 311,
+	GOYA_ASYNC_EVENT_ID_DDR1_DB_ECC = 312,
+	GOYA_ASYNC_EVENT_ID_DDR1_SB_ECC = 313,
+	GOYA_ASYNC_EVENT_ID_DDR1_SB_ECC_MC = 314,
+	GOYA_ASYNC_EVENT_ID_DDR1_AXI_RD = 315,
+	GOYA_ASYNC_EVENT_ID_DDR1_AXI_WR = 316,
+	GOYA_ASYNC_EVENT_ID_CPU_BMON = 320,
+	GOYA_ASYNC_EVENT_ID_TS_EAST = 322,
+	GOYA_ASYNC_EVENT_ID_TS_WEST = 323,
+	GOYA_ASYNC_EVENT_ID_TS_NORTH = 324,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_0 = 330,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_1 = 331,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_2 = 332,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET = 356,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT = 361,
+	GOYA_ASYNC_EVENT_ID_TPC0_CMDQ = 430,
+	GOYA_ASYNC_EVENT_ID_TPC1_CMDQ = 431,
+	GOYA_ASYNC_EVENT_ID_TPC2_CMDQ = 432,
+	GOYA_ASYNC_EVENT_ID_TPC3_CMDQ = 433,
+	GOYA_ASYNC_EVENT_ID_TPC4_CMDQ = 434,
+	GOYA_ASYNC_EVENT_ID_TPC5_CMDQ = 435,
+	GOYA_ASYNC_EVENT_ID_TPC6_CMDQ = 436,
+	GOYA_ASYNC_EVENT_ID_TPC7_CMDQ = 437,
+	GOYA_ASYNC_EVENT_ID_TPC0_QM = 438,
+	GOYA_ASYNC_EVENT_ID_TPC1_QM = 439,
+	GOYA_ASYNC_EVENT_ID_TPC2_QM = 440,
+	GOYA_ASYNC_EVENT_ID_TPC3_QM = 441,
+	GOYA_ASYNC_EVENT_ID_TPC4_QM = 442,
+	GOYA_ASYNC_EVENT_ID_TPC5_QM = 443,
+	GOYA_ASYNC_EVENT_ID_TPC6_QM = 444,
+	GOYA_ASYNC_EVENT_ID_TPC7_QM = 445,
+	GOYA_ASYNC_EVENT_ID_MME_QM = 447,
+	GOYA_ASYNC_EVENT_ID_MME_CMDQ = 448,
+	GOYA_ASYNC_EVENT_ID_DMA0_QM = 449,
+	GOYA_ASYNC_EVENT_ID_DMA1_QM = 450,
+	GOYA_ASYNC_EVENT_ID_DMA2_QM = 451,
+	GOYA_ASYNC_EVENT_ID_DMA3_QM = 452,
+	GOYA_ASYNC_EVENT_ID_DMA4_QM = 453,
+	GOYA_ASYNC_EVENT_ID_DMA_ON_HBW = 454,
+	GOYA_ASYNC_EVENT_ID_DMA0_CH = 455,
+	GOYA_ASYNC_EVENT_ID_DMA1_CH = 456,
+	GOYA_ASYNC_EVENT_ID_DMA2_CH = 457,
+	GOYA_ASYNC_EVENT_ID_DMA3_CH = 458,
+	GOYA_ASYNC_EVENT_ID_DMA4_CH = 459,
+	GOYA_ASYNC_EVENT_ID_PI_UPDATE = 484,
+	GOYA_ASYNC_EVENT_ID_HALT_MACHINE = 485,
+	GOYA_ASYNC_EVENT_ID_INTS_REGISTER = 486,
+	GOYA_ASYNC_EVENT_ID_SOFT_RESET = 487,
+	GOYA_ASYNC_EVENT_ID_LAST_VALID_ID = 1023,
+	GOYA_ASYNC_EVENT_ID_SIZE
+};
+
+#endif /* __GOYA_ASYNC_EVENTS_H_ */
diff --git a/drivers/misc/habanalabs/include/goya/goya_boot_if.h b/drivers/misc/habanalabs/include/goya/goya_boot_if.h
new file mode 100644
index 000000000000..2e39578ec795
--- /dev/null
+++ b/drivers/misc/habanalabs/include/goya/goya_boot_if.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ * Author: Oded Gabbay <oded.gabbay@gmail.com>
+ *
+ */
+
+#ifndef GOYA_BOOT_IF_H
+#define GOYA_BOOT_IF_H
+
+enum cpu_boot_status {
+	CPU_BOOT_STATUS_NA = 0,		/* Default value after reset of chip */
+	CPU_BOOT_STATUS_IN_WFE,
+	CPU_BOOT_STATUS_DRAM_RDY,
+	CPU_BOOT_STATUS_SRAM_AVAIL,
+	CPU_BOOT_STATUS_IN_BTL,		/* BTL is H/W FSM */
+	CPU_BOOT_STATUS_IN_PREBOOT,
+	CPU_BOOT_STATUS_IN_SPL,
+	CPU_BOOT_STATUS_IN_UBOOT,
+	CPU_BOOT_STATUS_DRAM_INIT_FAIL,
+	CPU_BOOT_STATUS_FIT_CORRUPTED
+};
+
+enum kmd_msg {
+	KMD_MSG_NA = 0,
+	KMD_MSG_GOTO_WFE,
+	KMD_MSG_FIT_RDY
+};
+
+#endif /* GOYA_BOOT_IF_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 07/15] habanalabs: add h/w queues module
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (4 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 06/15] habanalabs: add basic Goya h/w initialization Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-25  7:50   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 08/15] habanalabs: add event queue and interrupts Oded Gabbay
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds the H/W queues module and the code to initialize Goya's
various compute and DMA engines and their queues.

Goya has 5 DMA channels, 8 TPC engines and a single MME engine. For each
channel/engine, there is a H/W queue logic which is used to pass commands
from the user to the H/W. That logic is called QMAN.

There are two types of QMANs: external and internal. The DMA QMANs are
considered external while the TPC and MME QMANs are considered internal.
For each external queue there is a completion queue, which is located on
the Host memory.

The differences between external and internal QMANs are:

1. The location of the queue's memory. External QMANs are located on the
   Host memory while internal QMANs are located on the on-chip memory.

2. The external QMAN write an entry to a completion queue and sends an
   MSI-X interrupt upon completion of a command buffer that was given to
   it. The internal QMAN doesn't do that.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/Makefile              |    2 +-
 drivers/misc/habanalabs/device.c              |   74 +-
 drivers/misc/habanalabs/goya/goya.c           | 1518 +++++++++++++++--
 drivers/misc/habanalabs/goya/goyaP.h          |    6 +
 drivers/misc/habanalabs/habanalabs.h          |  176 +-
 drivers/misc/habanalabs/habanalabs_drv.c      |    6 +
 drivers/misc/habanalabs/hw_queue.c            |  404 +++++
 .../habanalabs/include/goya/goya_packets.h    |  234 +++
 .../habanalabs/include/habanalabs_device_if.h |  272 +++
 drivers/misc/habanalabs/irq.c                 |  150 ++
 10 files changed, 2721 insertions(+), 121 deletions(-)
 create mode 100644 drivers/misc/habanalabs/hw_queue.c
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_packets.h
 create mode 100644 drivers/misc/habanalabs/irq.c

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index 2530c9b78ca4..c07f3ccb57dc 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -5,7 +5,7 @@
 obj-m	:= habanalabs.o
 
 habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
-		command_buffer.o
+		command_buffer.o hw_queue.o irq.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 9fc7218a973c..98220628a467 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -170,13 +170,22 @@ static int device_early_init(struct hl_device *hdev)
 	if (rc)
 		goto early_fini;
 
+	hdev->cq_wq = alloc_workqueue("hl-free-jobs", WQ_UNBOUND, 0);
+	if (hdev->cq_wq == NULL) {
+		dev_err(hdev->dev, "Failed to allocate CQ workqueue\n");
+		goto asid_fini;
+	}
+
 	hl_cb_mgr_init(&hdev->kernel_cb_mgr);
 
 	mutex_init(&hdev->device_open);
+	mutex_init(&hdev->send_cpu_message_lock);
 	atomic_set(&hdev->fd_open_cnt, 0);
 
 	return 0;
 
+asid_fini:
+	hl_asid_fini(hdev);
 early_fini:
 	if (hdev->asic_funcs->early_fini)
 		hdev->asic_funcs->early_fini(hdev);
@@ -192,9 +201,12 @@ static int device_early_init(struct hl_device *hdev)
  */
 static void device_early_fini(struct hl_device *hdev)
 {
+	mutex_destroy(&hdev->send_cpu_message_lock);
 
 	hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
 
+	destroy_workqueue(hdev->cq_wq);
+
 	hl_asid_fini(hdev);
 
 	if (hdev->asic_funcs->early_fini)
@@ -273,7 +285,7 @@ int hl_device_resume(struct hl_device *hdev)
  */
 int hl_device_init(struct hl_device *hdev, struct class *hclass)
 {
-	int rc;
+	int i, rc, cq_ready_cnt;
 
 	/* Create device */
 	rc = device_setup_cdev(hdev, hclass, hdev->id, &hl_ops);
@@ -294,11 +306,48 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 	if (rc)
 		goto early_fini;
 
+	/*
+	 * Initialize the H/W queues. Must be done before hw_init, because
+	 * there the addresses of the kernel queue are being written to the
+	 * registers of the device
+	 */
+	rc = hl_hw_queues_create(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize kernel queues\n");
+		goto sw_fini;
+	}
+
+	/*
+	 * Initialize the completion queues. Must be done before hw_init,
+	 * because there the addresses of the completion queues are being
+	 * passed as arguments to request_irq
+	 */
+	hdev->completion_queue =
+			kcalloc(hdev->asic_prop.completion_queues_count,
+				sizeof(*hdev->completion_queue), GFP_KERNEL);
+
+	if (!hdev->completion_queue) {
+		dev_err(hdev->dev, "failed to allocate completion queues\n");
+		rc = -ENOMEM;
+		goto hw_queues_destroy;
+	}
+
+	for (i = 0, cq_ready_cnt = 0;
+			i < hdev->asic_prop.completion_queues_count;
+			i++, cq_ready_cnt++) {
+		rc = hl_cq_init(hdev, &hdev->completion_queue[i], i);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed to initialize completion queue\n");
+			goto cq_fini;
+		}
+	}
+
 	/* Allocate the kernel context */
 	hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
 	if (!hdev->kernel_ctx) {
 		rc = -ENOMEM;
-		goto sw_fini;
+		goto cq_fini;
 	}
 
 	hdev->user_ctx = NULL;
@@ -324,6 +373,14 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 
 	hdev->disabled = false;
 
+	/* Check that the communication with the device is working */
+	rc = hdev->asic_funcs->test_queues(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to detect if device is alive\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
@@ -335,6 +392,12 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 			"kernel ctx is still alive on initialization failure\n");
 free_ctx:
 	kfree(hdev->kernel_ctx);
+cq_fini:
+	for (i = 0 ; i < cq_ready_cnt ; i++)
+		hl_cq_fini(hdev, &hdev->completion_queue[i]);
+	kfree(hdev->completion_queue);
+hw_queues_destroy:
+	hl_hw_queues_destroy(hdev);
 sw_fini:
 	hdev->asic_funcs->sw_fini(hdev);
 early_fini:
@@ -364,6 +427,7 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
  */
 void hl_device_fini(struct hl_device *hdev)
 {
+	int i;
 	dev_info(hdev->dev, "Removing device\n");
 
 	/* Mark device as disabled */
@@ -378,6 +442,12 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Reset the H/W. It will be in idle state after this returns */
 	hdev->asic_funcs->hw_fini(hdev, true);
 
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
+		hl_cq_fini(hdev, &hdev->completion_queue[i]);
+	kfree(hdev->completion_queue);
+
+	hl_hw_queues_destroy(hdev);
+
 	/* Call ASIC S/W finalize function */
 	hdev->asic_funcs->sw_fini(hdev);
 
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index f715e01838b3..08d5227eaf1d 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -98,6 +98,26 @@
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	int i;
+
+	for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++) {
+		prop->hw_queues_props[i].type = QUEUE_TYPE_EXT;
+		prop->hw_queues_props[i].kmd_only = 0;
+	}
+
+	for (; i < NUMBER_OF_EXT_HW_QUEUES + NUMBER_OF_CPU_HW_QUEUES ; i++) {
+		prop->hw_queues_props[i].type = QUEUE_TYPE_CPU;
+		prop->hw_queues_props[i].kmd_only = 1;
+	}
+
+	for (; i < NUMBER_OF_EXT_HW_QUEUES + NUMBER_OF_CPU_HW_QUEUES +
+			NUMBER_OF_INT_HW_QUEUES; i++) {
+		prop->hw_queues_props[i].type = QUEUE_TYPE_INT;
+		prop->hw_queues_props[i].kmd_only = 0;
+	}
+
+	for (; i < HL_MAX_QUEUES; i++)
+		prop->hw_queues_props[i].type = QUEUE_TYPE_NA;
 
 	prop->completion_queues_count = NUMBER_OF_CMPLT_QUEUES;
 
@@ -126,6 +146,18 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->high_pll = PLL_HIGH_DEFAULT;
 }
 
+int goya_send_pci_access_msg(struct hl_device *hdev, u32 opcode)
+{
+	struct armcp_packet pkt;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = opcode;
+
+	return hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt,
+			sizeof(pkt), HL_DEVICE_TIMEOUT_USEC, NULL);
+}
+
 /**
  * goya_pci_bars_map - Map PCI BARS of Goya device
  *
@@ -509,6 +541,8 @@ static int goya_sw_init(struct hl_device *hdev)
 	if (!goya)
 		return -ENOMEM;
 
+	goya->test_cpu_queue = goya_test_cpu_queue;
+
 	/* according to goya_init_iatu */
 	goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
 	hdev->asic_specific = goya;
@@ -595,6 +629,299 @@ int goya_sw_fini(struct hl_device *hdev)
 	return 0;
 }
 
+static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
+		dma_addr_t bus_address)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u32 reg_off = dma_id * (mmDMA_QM_1_PQ_PI - mmDMA_QM_0_PQ_PI);
+
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+
+	WREG32(mmDMA_QM_0_PQ_BASE_LO + reg_off, lower_32_bits(bus_address));
+	WREG32(mmDMA_QM_0_PQ_BASE_HI + reg_off, upper_32_bits(bus_address));
+
+	WREG32(mmDMA_QM_0_PQ_SIZE + reg_off, ilog2(HL_QUEUE_LENGTH));
+	WREG32(mmDMA_QM_0_PQ_PI + reg_off, 0);
+	WREG32(mmDMA_QM_0_PQ_CI + reg_off, 0);
+
+	WREG32(mmDMA_QM_0_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
+	WREG32(mmDMA_QM_0_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
+	WREG32(mmDMA_QM_0_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
+	WREG32(mmDMA_QM_0_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
+	WREG32(mmDMA_QM_0_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
+	WREG32(mmDMA_QM_0_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
+	WREG32(mmDMA_QM_0_GLBL_ERR_WDATA + reg_off,
+			GOYA_ASYNC_EVENT_ID_DMA0_QM + dma_id);
+
+	/* PQ has buffer of 2 cache lines, while CQ has 8 lines */
+	WREG32(mmDMA_QM_0_PQ_CFG1 + reg_off, 0x00020002);
+	WREG32(mmDMA_QM_0_CQ_CFG1 + reg_off, 0x00080008);
+
+	if (dma_id == 0)
+		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
+	else
+		if (goya->hw_cap_initialized & HW_CAP_MMU)
+			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
+					QMAN_DMA_PARTLY_TRUSTED);
+		else
+			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
+					QMAN_DMA_FULLY_TRUSTED);
+
+	WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, QMAN_DMA_ERR_MSG_EN);
+	WREG32(mmDMA_QM_0_GLBL_CFG0 + reg_off, QMAN_DMA_ENABLE);
+}
+
+static void goya_init_dma_ch(struct hl_device *hdev, int dma_id)
+{
+	u32 gic_base_lo, gic_base_hi;
+	u64 sob_addr;
+	u32 reg_off = dma_id * (mmDMA_CH_1_CFG1 - mmDMA_CH_0_CFG1);
+
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+
+	WREG32(mmDMA_CH_0_ERRMSG_ADDR_LO + reg_off, gic_base_lo);
+	WREG32(mmDMA_CH_0_ERRMSG_ADDR_HI + reg_off, gic_base_hi);
+	WREG32(mmDMA_CH_0_ERRMSG_WDATA + reg_off,
+			GOYA_ASYNC_EVENT_ID_DMA0_CH + dma_id);
+
+	if (dma_id) {
+		sob_addr = CFG_BASE + mmSYNC_MNGR_SOB_OBJ_1000 +
+				(dma_id - 1) * 4;
+		WREG32(mmDMA_CH_0_WR_COMP_ADDR_LO + reg_off,
+				lower_32_bits(sob_addr));
+		WREG32(mmDMA_CH_0_WR_COMP_ADDR_HI + reg_off,
+				upper_32_bits(sob_addr));
+		WREG32(mmDMA_CH_0_WR_COMP_WDATA + reg_off, 0x80000001);
+	}
+}
+
+/**
+ * goya_init_dma_qmans - Initialize QMAN DMA registers
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Initialize the H/W registers of the QMAN DMA channels
+ *
+ */
+static void goya_init_dma_qmans(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct hl_hw_queue *q;
+	dma_addr_t bus_address;
+	int i;
+
+	if (goya->hw_cap_initialized & HW_CAP_DMA)
+		return;
+
+	q = &hdev->kernel_queues[0];
+
+	for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++, q++) {
+		bus_address = q->bus_address +
+				hdev->asic_prop.host_phys_base_address;
+
+		goya_init_dma_qman(hdev, i, bus_address);
+		goya_init_dma_ch(hdev, i);
+	}
+
+	goya->hw_cap_initialized |= HW_CAP_DMA;
+}
+
+/**
+ * goya_disable_external_queues - Disable external queues
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static void goya_disable_external_queues(struct hl_device *hdev)
+{
+	WREG32(mmDMA_QM_0_GLBL_CFG0, 0);
+	WREG32(mmDMA_QM_1_GLBL_CFG0, 0);
+	WREG32(mmDMA_QM_2_GLBL_CFG0, 0);
+	WREG32(mmDMA_QM_3_GLBL_CFG0, 0);
+	WREG32(mmDMA_QM_4_GLBL_CFG0, 0);
+}
+
+static int goya_stop_queue(struct hl_device *hdev, u32 cfg_reg,
+				u32 cp_sts_reg, u32 glbl_sts0_reg)
+{
+	int rc;
+	u32 status;
+
+	/* use the values of TPC0 as they are all the same*/
+
+	WREG32(cfg_reg, 1 << TPC0_QM_GLBL_CFG1_CP_STOP_SHIFT);
+
+	status = RREG32(cp_sts_reg);
+	if (status & TPC0_QM_CP_STS_FENCE_IN_PROGRESS_MASK) {
+		rc = hl_poll_timeout(
+			hdev,
+			cp_sts_reg,
+			status,
+			!(status & TPC0_QM_CP_STS_FENCE_IN_PROGRESS_MASK),
+			1000,
+			QMAN_FENCE_TIMEOUT_USEC);
+
+		/* if QMAN is stuck in fence no need to check for stop */
+		if (rc)
+			return 0;
+	}
+
+	rc = hl_poll_timeout(
+		hdev,
+		glbl_sts0_reg,
+		status,
+		(status & TPC0_QM_GLBL_STS0_CP_IS_STOP_MASK),
+		1000,
+		QMAN_STOP_TIMEOUT_USEC);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Timeout while waiting for QMAN to stop\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/**
+ * goya_stop_external_queues - Stop external queues
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Returns 0 on success
+ *
+ */
+static int goya_stop_external_queues(struct hl_device *hdev)
+{
+	int rc = goya_stop_queue(hdev,
+			mmDMA_QM_0_GLBL_CFG1,
+			mmDMA_QM_0_CP_STS,
+			mmDMA_QM_0_GLBL_STS0);
+
+	if (rc)
+		dev_err(hdev->dev, "failed to stop DMA QMAN 0\n");
+
+	rc = goya_stop_queue(hdev,
+			mmDMA_QM_1_GLBL_CFG1,
+			mmDMA_QM_1_CP_STS,
+			mmDMA_QM_1_GLBL_STS0);
+
+	if (rc)
+		dev_err(hdev->dev, "failed to stop DMA QMAN 1\n");
+
+	rc = goya_stop_queue(hdev,
+			mmDMA_QM_2_GLBL_CFG1,
+			mmDMA_QM_2_CP_STS,
+			mmDMA_QM_2_GLBL_STS0);
+
+	if (rc)
+		dev_err(hdev->dev, "failed to stop DMA QMAN 2\n");
+
+	rc = goya_stop_queue(hdev,
+			mmDMA_QM_3_GLBL_CFG1,
+			mmDMA_QM_3_CP_STS,
+			mmDMA_QM_3_GLBL_STS0);
+
+	if (rc)
+		dev_err(hdev->dev, "failed to stop DMA QMAN 3\n");
+
+	rc = goya_stop_queue(hdev,
+			mmDMA_QM_4_GLBL_CFG1,
+			mmDMA_QM_4_CP_STS,
+			mmDMA_QM_4_GLBL_STS0);
+
+	if (rc)
+		dev_err(hdev->dev, "failed to stop DMA QMAN 4\n");
+
+	return rc;
+}
+
+static void goya_resume_external_queues(struct hl_device *hdev)
+{
+	WREG32(mmDMA_QM_0_GLBL_CFG1, 0);
+	WREG32(mmDMA_QM_1_GLBL_CFG1, 0);
+	WREG32(mmDMA_QM_2_GLBL_CFG1, 0);
+	WREG32(mmDMA_QM_3_GLBL_CFG1, 0);
+	WREG32(mmDMA_QM_4_GLBL_CFG1, 0);
+}
+
+/**
+ * goya_init_cpu_queues - Initialize PQ/CQ/EQ of CPU
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Returns 0 on success
+ *
+ */
+int goya_init_cpu_queues(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	dma_addr_t bus_address;
+	u32 status;
+	struct hl_hw_queue *cpu_pq = &hdev->kernel_queues[GOYA_QUEUE_ID_CPU_PQ];
+	int err;
+
+	if (!hdev->cpu_queues_enable)
+		return 0;
+
+	if (goya->hw_cap_initialized & HW_CAP_CPU_Q)
+		return 0;
+
+	bus_address = cpu_pq->bus_address +
+			hdev->asic_prop.host_phys_base_address;
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_0, lower_32_bits(bus_address));
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_1, upper_32_bits(bus_address));
+
+	bus_address = hdev->cpu_accessible_dma_address +
+			hdev->asic_prop.host_phys_base_address;
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_8, lower_32_bits(bus_address));
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_9, upper_32_bits(bus_address));
+
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_5, HL_QUEUE_SIZE_IN_BYTES);
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_10, CPU_ACCESSIBLE_MEM_SIZE);
+
+	/* Used for EQ CI */
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, 0);
+
+	WREG32(mmCPU_IF_PF_PQ_PI, 0);
+
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_7, PQ_INIT_STATUS_READY_FOR_CP);
+
+	WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+			GOYA_ASYNC_EVENT_ID_PI_UPDATE);
+
+	err = hl_poll_timeout(
+		hdev,
+		mmPSOC_GLOBAL_CONF_SCRATCHPAD_7,
+		status,
+		(status == PQ_INIT_STATUS_READY_FOR_HOST),
+		1000,
+		GOYA_CPU_TIMEOUT_USEC);
+
+	if (err) {
+		dev_err(hdev->dev,
+			"Failed to communicate with ARM CPU (ArmCP timeout)\n");
+		return -EIO;
+	}
+
+	goya->hw_cap_initialized |= HW_CAP_CPU_Q;
+	return 0;
+}
+
 /**
  * goya_init_pll - Initialize pll registers
  *
@@ -1960,152 +2287,646 @@ static void goya_init_golden_registers(struct hl_device *hdev)
 	goya->hw_cap_initialized |= HW_CAP_GOLDEN;
 }
 
-
-/**
- * goya_push_uboot_to_device - Push u-boot FW code to device
- *
- * @hdev: pointer to hl_device structure
- *
- * Copy u-boot fw code from firmware file to SRAM BAR.
- * Returns 0 on success
- *
- */
-static int goya_push_uboot_to_device(struct hl_device *hdev)
+static void goya_init_mme_qman(struct hl_device *hdev)
 {
-	char fw_name[200];
-	const u64 *fw_data;
-	void __iomem *dst;
-	size_t fw_size, i;
-	int rc;
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u64 qman_base_addr;
 
-	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
 
-	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
 
-	if (rc) {
-		dev_err(hdev->dev, "Failed to request u-boot fw image\n");
-		goto out;
-	}
+	qman_base_addr = hdev->asic_prop.sram_base_address +
+				MME_QMAN_BASE_OFFSET;
 
-	fw_size = hdev->spl_fw->size;
-	if ((fw_size % 4) != 0) {
-		dev_err(hdev->dev, "illegal u-boot firmware size %lu\n",
-			fw_size);
-		rc = -EINVAL;
-		goto out;
-	}
+	WREG32(mmMME_QM_PQ_BASE_LO, lower_32_bits(qman_base_addr));
+	WREG32(mmMME_QM_PQ_BASE_HI, upper_32_bits(qman_base_addr));
+	WREG32(mmMME_QM_PQ_SIZE, ilog2(MME_QMAN_LENGTH));
+	WREG32(mmMME_QM_PQ_PI, 0);
+	WREG32(mmMME_QM_PQ_CI, 0);
+	WREG32(mmMME_QM_CP_LDMA_SRC_BASE_LO_OFFSET, 0x10C0);
+	WREG32(mmMME_QM_CP_LDMA_SRC_BASE_HI_OFFSET, 0x10C4);
+	WREG32(mmMME_QM_CP_LDMA_TSIZE_OFFSET, 0x10C8);
+	WREG32(mmMME_QM_CP_LDMA_COMMIT_OFFSET, 0x10CC);
 
-	dev_dbg(hdev->dev, "u-boot firmware size == %lu\n", fw_size);
+	WREG32(mmMME_QM_CP_MSG_BASE0_ADDR_LO, mtr_base_lo);
+	WREG32(mmMME_QM_CP_MSG_BASE0_ADDR_HI, mtr_base_hi);
+	WREG32(mmMME_QM_CP_MSG_BASE1_ADDR_LO, so_base_lo);
+	WREG32(mmMME_QM_CP_MSG_BASE1_ADDR_HI, so_base_hi);
 
-	fw_data = (const u64 *) hdev->spl_fw->data;
-	dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
+	/* QMAN CQ has 8 cache lines */
+	WREG32(mmMME_QM_CQ_CFG1, 0x00080008);
 
-	if ((hdev->spl_fw->size % 8) != 0)
-		fw_size -= 8;
+	WREG32(mmMME_QM_GLBL_ERR_ADDR_LO, gic_base_lo);
+	WREG32(mmMME_QM_GLBL_ERR_ADDR_HI, gic_base_hi);
 
-	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
-		if (!(i & (0x80000 - 1)))
-			dev_dbg(hdev->dev,
-				"u-boot copied so far %lu out of %lu",
-				i, fw_size);
+	WREG32(mmMME_QM_GLBL_ERR_WDATA, GOYA_ASYNC_EVENT_ID_MME_QM);
 
-		writeq(*fw_data, dst);
-	}
+	WREG32(mmMME_QM_GLBL_ERR_CFG, QMAN_MME_ERR_MSG_EN);
 
-	if ((hdev->spl_fw->size % 8) != 0)
-		writel(*(const u32 *) fw_data, dst);
+	WREG32(mmMME_QM_GLBL_PROT, QMAN_MME_ERR_PROT);
 
-out:
-	release_firmware(hdev->spl_fw);
-	return rc;
+	WREG32(mmMME_QM_GLBL_CFG0, QMAN_MME_ENABLE);
 }
 
-/**
- * goya_push_linux_to_device - Push LINUX FW code to device
- *
- * @hdev: pointer to hl_device structure
- *
- * Copy LINXU fw code from firmware file to DDR BAR.
- * Returns 0 on success
- *
- */
-static int goya_push_linux_to_device(struct hl_device *hdev)
+static void goya_init_mme_cmdq(struct hl_device *hdev)
 {
-	char fw_name[200];
-	const u64 *fw_data;
-	void __iomem *dst;
-	size_t fw_size, i;
-	int rc;
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u64 qman_base_addr;
 
-	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
 
-	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
 
-	if (rc) {
-		dev_err(hdev->dev, "Failed to request Linux fw image\n");
-		goto out;
-	}
+	qman_base_addr = hdev->asic_prop.sram_base_address +
+				MME_QMAN_BASE_OFFSET;
 
-	fw_size = hdev->spl_fw->size;
-	if ((fw_size % 4) != 0) {
-		dev_err(hdev->dev, "illegal Linux firmware size %lu\n",
-			fw_size);
-		rc = -EINVAL;
-		goto out;
-	}
+	WREG32(mmMME_CMDQ_CP_MSG_BASE0_ADDR_LO, mtr_base_lo);
+	WREG32(mmMME_CMDQ_CP_MSG_BASE0_ADDR_HI, mtr_base_hi);
+	WREG32(mmMME_CMDQ_CP_MSG_BASE1_ADDR_LO,	so_base_lo);
+	WREG32(mmMME_CMDQ_CP_MSG_BASE1_ADDR_HI, so_base_hi);
 
-	dev_dbg(hdev->dev, "Linux firmware size == %lu\n", fw_size);
+	/* CMDQ CQ has 20 cache lines */
+	WREG32(mmMME_CMDQ_CQ_CFG1, 0x00140014);
 
-	fw_data = (const u64 *) hdev->spl_fw->data;
-	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
+	WREG32(mmMME_CMDQ_GLBL_ERR_ADDR_LO, gic_base_lo);
+	WREG32(mmMME_CMDQ_GLBL_ERR_ADDR_HI, gic_base_hi);
 
-	if ((hdev->spl_fw->size % 8) != 0)
-		fw_size -= 8;
+	WREG32(mmMME_CMDQ_GLBL_ERR_WDATA, GOYA_ASYNC_EVENT_ID_MME_CMDQ);
 
-	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
-		if (!(i & (0x80000 - 1))) {
-			dev_dbg(hdev->dev,
-				"Linux copied so far %lu out of %lu",
-				i, fw_size);
-			usleep_range(20, 100);
-		}
-		writeq(*fw_data, dst);
-	}
+	WREG32(mmMME_CMDQ_GLBL_ERR_CFG, CMDQ_MME_ERR_MSG_EN);
 
-	if ((hdev->spl_fw->size % 8) != 0)
-		writel(*(const u32 *) fw_data, dst);
+	WREG32(mmMME_CMDQ_GLBL_PROT, CMDQ_MME_ERR_PROT);
 
-out:
-	release_firmware(hdev->spl_fw);
-	return rc;
+	WREG32(mmMME_CMDQ_GLBL_CFG0, CMDQ_MME_ENABLE);
 }
 
-static int goya_pldm_init_cpu(struct hl_device *hdev)
+static void goya_init_mme_qmans(struct hl_device *hdev)
 {
-	u32 val, unit_rst_val;
-	int rc;
+	struct goya_device *goya = hdev->asic_specific;
+	u32 so_base_lo, so_base_hi;
 
-	/* Must initialize SRAM scrambler before pushing u-boot to SRAM */
-	goya_init_golden_registers(hdev);
+	if (goya->hw_cap_initialized & HW_CAP_MME)
+		return;
 
-	/* Put ARM cores into reset */
-	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
-	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
 
-	/* Reset the CA53 MACRO */
-	unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
-	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
-	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
-	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
-	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	WREG32(mmMME_SM_BASE_ADDRESS_LOW, so_base_lo);
+	WREG32(mmMME_SM_BASE_ADDRESS_HIGH, so_base_hi);
 
-	rc = goya_push_uboot_to_device(hdev);
-	if (rc)
-		return rc;
+	goya_init_mme_qman(hdev);
+	goya_init_mme_cmdq(hdev);
 
-	rc = goya_push_linux_to_device(hdev);
-	if (rc)
-		return rc;
+	goya->hw_cap_initialized |= HW_CAP_MME;
+}
+
+static void goya_init_tpc_qman(struct hl_device *hdev, u32 base_off, int tpc_id)
+{
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u64 qman_base_addr;
+	u32 reg_off = tpc_id * (mmTPC1_QM_PQ_PI - mmTPC0_QM_PQ_PI);
+
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+
+	qman_base_addr = hdev->asic_prop.sram_base_address + base_off;
+
+	WREG32(mmTPC0_QM_PQ_BASE_LO + reg_off, lower_32_bits(qman_base_addr));
+	WREG32(mmTPC0_QM_PQ_BASE_HI + reg_off, upper_32_bits(qman_base_addr));
+	WREG32(mmTPC0_QM_PQ_SIZE + reg_off, ilog2(TPC_QMAN_LENGTH));
+	WREG32(mmTPC0_QM_PQ_PI + reg_off, 0);
+	WREG32(mmTPC0_QM_PQ_CI + reg_off, 0);
+	WREG32(mmTPC0_QM_CP_LDMA_SRC_BASE_LO_OFFSET + reg_off, 0x10C0);
+	WREG32(mmTPC0_QM_CP_LDMA_SRC_BASE_HI_OFFSET + reg_off, 0x10C4);
+	WREG32(mmTPC0_QM_CP_LDMA_TSIZE_OFFSET + reg_off, 0x10C8);
+	WREG32(mmTPC0_QM_CP_LDMA_COMMIT_OFFSET + reg_off, 0x10CC);
+
+	WREG32(mmTPC0_QM_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
+	WREG32(mmTPC0_QM_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
+	WREG32(mmTPC0_QM_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
+	WREG32(mmTPC0_QM_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
+
+	WREG32(mmTPC0_QM_CQ_CFG1 + reg_off, 0x00080008);
+
+	WREG32(mmTPC0_QM_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
+	WREG32(mmTPC0_QM_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
+
+	WREG32(mmTPC0_QM_GLBL_ERR_WDATA + reg_off,
+			GOYA_ASYNC_EVENT_ID_TPC0_QM + tpc_id);
+
+	WREG32(mmTPC0_QM_GLBL_ERR_CFG + reg_off, QMAN_TPC_ERR_MSG_EN);
+
+	WREG32(mmTPC0_QM_GLBL_PROT + reg_off, QMAN_TPC_ERR_PROT);
+
+	WREG32(mmTPC0_QM_GLBL_CFG0 + reg_off, QMAN_TPC_ENABLE);
+}
+
+static void goya_init_tpc_cmdq(struct hl_device *hdev, int tpc_id)
+{
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u32 reg_off = tpc_id * (mmTPC1_CMDQ_CQ_CFG1 - mmTPC0_CMDQ_CQ_CFG1);
+
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+
+	WREG32(mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
+	WREG32(mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
+	WREG32(mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
+	WREG32(mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
+
+	WREG32(mmTPC0_CMDQ_CQ_CFG1 + reg_off, 0x00140014);
+
+	WREG32(mmTPC0_CMDQ_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
+	WREG32(mmTPC0_CMDQ_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
+
+	WREG32(mmTPC0_CMDQ_GLBL_ERR_WDATA + reg_off,
+			GOYA_ASYNC_EVENT_ID_TPC0_CMDQ + tpc_id);
+
+	WREG32(mmTPC0_CMDQ_GLBL_ERR_CFG + reg_off, CMDQ_TPC_ERR_MSG_EN);
+
+	WREG32(mmTPC0_CMDQ_GLBL_PROT + reg_off, CMDQ_TPC_ERR_PROT);
+
+	WREG32(mmTPC0_CMDQ_GLBL_CFG0 + reg_off, CMDQ_TPC_ENABLE);
+}
+
+static void goya_init_tpc_qmans(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 so_base_lo, so_base_hi;
+	u32 cfg_off = mmTPC1_CFG_SM_BASE_ADDRESS_LOW -
+			mmTPC0_CFG_SM_BASE_ADDRESS_LOW;
+	int i;
+
+	if (goya->hw_cap_initialized & HW_CAP_TPC)
+		return;
+
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+
+	for (i = 0 ; i < TPC_MAX_NUM ; i++) {
+		WREG32(mmTPC0_CFG_SM_BASE_ADDRESS_LOW + i * cfg_off,
+				so_base_lo);
+		WREG32(mmTPC0_CFG_SM_BASE_ADDRESS_HIGH + i * cfg_off,
+				so_base_hi);
+	}
+
+	goya_init_tpc_qman(hdev, TPC0_QMAN_BASE_OFFSET, 0);
+	goya_init_tpc_qman(hdev, TPC1_QMAN_BASE_OFFSET, 1);
+	goya_init_tpc_qman(hdev, TPC2_QMAN_BASE_OFFSET, 2);
+	goya_init_tpc_qman(hdev, TPC3_QMAN_BASE_OFFSET, 3);
+	goya_init_tpc_qman(hdev, TPC4_QMAN_BASE_OFFSET, 4);
+	goya_init_tpc_qman(hdev, TPC5_QMAN_BASE_OFFSET, 5);
+	goya_init_tpc_qman(hdev, TPC6_QMAN_BASE_OFFSET, 6);
+	goya_init_tpc_qman(hdev, TPC7_QMAN_BASE_OFFSET, 7);
+
+	for (i = 0 ; i < TPC_MAX_NUM ; i++)
+		goya_init_tpc_cmdq(hdev, i);
+
+	goya->hw_cap_initialized |= HW_CAP_TPC;
+}
+
+/**
+ * goya_disable_internal_queues - Disable internal queues
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static void goya_disable_internal_queues(struct hl_device *hdev)
+{
+	WREG32(mmMME_QM_GLBL_CFG0, 0);
+	WREG32(mmMME_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC0_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC0_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC1_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC1_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC2_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC2_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC3_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC3_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC4_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC4_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC5_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC5_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC6_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC6_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC7_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC7_CMDQ_GLBL_CFG0, 0);
+}
+
+/**
+ * goya_stop_internal_queues - Stop internal queues
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Returns 0 on success
+ *
+ */
+static int goya_stop_internal_queues(struct hl_device *hdev)
+{
+	int rc, retval = 0;
+
+	rc = goya_stop_queue(hdev,
+			mmMME_QM_GLBL_CFG1,
+			mmMME_QM_CP_STS,
+			mmMME_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop MME QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmMME_CMDQ_GLBL_CFG1,
+			mmMME_CMDQ_CP_STS,
+			mmMME_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop MME CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC0_QM_GLBL_CFG1,
+			mmTPC0_QM_CP_STS,
+			mmTPC0_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 0 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC0_CMDQ_GLBL_CFG1,
+			mmTPC0_CMDQ_CP_STS,
+			mmTPC0_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 0 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC1_QM_GLBL_CFG1,
+			mmTPC1_QM_CP_STS,
+			mmTPC1_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 1 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC1_CMDQ_GLBL_CFG1,
+			mmTPC1_CMDQ_CP_STS,
+			mmTPC1_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 1 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC2_QM_GLBL_CFG1,
+			mmTPC2_QM_CP_STS,
+			mmTPC2_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 2 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC2_CMDQ_GLBL_CFG1,
+			mmTPC2_CMDQ_CP_STS,
+			mmTPC2_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 2 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC3_QM_GLBL_CFG1,
+			mmTPC3_QM_CP_STS,
+			mmTPC3_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 3 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC3_CMDQ_GLBL_CFG1,
+			mmTPC3_CMDQ_CP_STS,
+			mmTPC3_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 3 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC4_QM_GLBL_CFG1,
+			mmTPC4_QM_CP_STS,
+			mmTPC4_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 4 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC4_CMDQ_GLBL_CFG1,
+			mmTPC4_CMDQ_CP_STS,
+			mmTPC4_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 4 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC5_QM_GLBL_CFG1,
+			mmTPC5_QM_CP_STS,
+			mmTPC5_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 5 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC5_CMDQ_GLBL_CFG1,
+			mmTPC5_CMDQ_CP_STS,
+			mmTPC5_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 5 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC6_QM_GLBL_CFG1,
+			mmTPC6_QM_CP_STS,
+			mmTPC6_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 6 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC6_CMDQ_GLBL_CFG1,
+			mmTPC6_CMDQ_CP_STS,
+			mmTPC6_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 6 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC7_QM_GLBL_CFG1,
+			mmTPC7_QM_CP_STS,
+			mmTPC7_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 7 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC7_CMDQ_GLBL_CFG1,
+			mmTPC7_CMDQ_CP_STS,
+			mmTPC7_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 7 CMDQ\n");
+		retval = -EIO;
+	}
+
+	return rc;
+}
+
+static void goya_resume_internal_queues(struct hl_device *hdev)
+{
+	WREG32(mmMME_QM_GLBL_CFG1, 0);
+	WREG32(mmMME_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC0_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC0_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC1_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC1_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC2_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC2_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC3_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC3_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC4_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC4_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC5_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC5_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC6_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC6_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC7_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC7_CMDQ_GLBL_CFG1, 0);
+}
+
+
+/**
+ * goya_push_uboot_to_device - Push u-boot FW code to device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Copy u-boot fw code from firmware file to SRAM BAR.
+ * Returns 0 on success
+ *
+ */
+static int goya_push_uboot_to_device(struct hl_device *hdev)
+{
+	char fw_name[200];
+	const u64 *fw_data;
+	void __iomem *dst;
+	size_t fw_size, i;
+	int rc;
+
+	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
+
+	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to request u-boot fw image\n");
+		goto out;
+	}
+
+	fw_size = hdev->spl_fw->size;
+	if ((fw_size % 4) != 0) {
+		dev_err(hdev->dev, "illegal u-boot firmware size %lu\n",
+			fw_size);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	dev_dbg(hdev->dev, "u-boot firmware size == %lu\n", fw_size);
+
+	fw_data = (const u64 *) hdev->spl_fw->data;
+	dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
+
+	if ((hdev->spl_fw->size % 8) != 0)
+		fw_size -= 8;
+
+	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
+		if (!(i & (0x80000 - 1)))
+			dev_dbg(hdev->dev,
+				"u-boot copied so far %lu out of %lu",
+				i, fw_size);
+
+		writeq(*fw_data, dst);
+	}
+
+	if ((hdev->spl_fw->size % 8) != 0)
+		writel(*(const u32 *) fw_data, dst);
+
+out:
+	release_firmware(hdev->spl_fw);
+	return rc;
+}
+
+/**
+ * goya_push_linux_to_device - Push LINUX FW code to device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Copy LINXU fw code from firmware file to DDR BAR.
+ * Returns 0 on success
+ *
+ */
+static int goya_push_linux_to_device(struct hl_device *hdev)
+{
+	char fw_name[200];
+	const u64 *fw_data;
+	void __iomem *dst;
+	size_t fw_size, i;
+	int rc;
+
+	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
+
+	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to request Linux fw image\n");
+		goto out;
+	}
+
+	fw_size = hdev->spl_fw->size;
+	if ((fw_size % 4) != 0) {
+		dev_err(hdev->dev, "illegal Linux firmware size %lu\n",
+			fw_size);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	dev_dbg(hdev->dev, "Linux firmware size == %lu\n", fw_size);
+
+	fw_data = (const u64 *) hdev->spl_fw->data;
+	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
+
+	if ((hdev->spl_fw->size % 8) != 0)
+		fw_size -= 8;
+
+	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
+		if (!(i & (0x80000 - 1))) {
+			dev_dbg(hdev->dev,
+				"Linux copied so far %lu out of %lu",
+				i, fw_size);
+			usleep_range(20, 100);
+		}
+		writeq(*fw_data, dst);
+	}
+
+	if ((hdev->spl_fw->size % 8) != 0)
+		writel(*(const u32 *) fw_data, dst);
+
+out:
+	release_firmware(hdev->spl_fw);
+	return rc;
+}
+
+static int goya_pldm_init_cpu(struct hl_device *hdev)
+{
+	u32 val, unit_rst_val;
+	int rc;
+
+	/* Must initialize SRAM scrambler before pushing u-boot to SRAM */
+	goya_init_golden_registers(hdev);
+
+	/* Put ARM cores into reset */
+	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
+	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+
+	/* Reset the CA53 MACRO */
+	unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
+	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
+	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+
+	rc = goya_push_uboot_to_device(hdev);
+	if (rc)
+		return rc;
+
+	rc = goya_push_linux_to_device(hdev);
+	if (rc)
+		return rc;
 
 	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
 	WREG32(mmPSOC_GLOBAL_CONF_WARM_REBOOT, CPU_BOOT_STATUS_NA);
@@ -2339,6 +3160,19 @@ static int goya_hw_init(struct hl_device *hdev)
 
 	goya_init_security(hdev);
 
+	goya_init_dma_qmans(hdev);
+
+	goya_init_mme_qmans(hdev);
+
+	goya_init_tpc_qmans(hdev);
+
+	rc = goya_init_cpu_queues(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize CPU H/W queues %d\n",
+			rc);
+		goto disable_queues;
+	}
+
 	/* CPU initialization is finished, we can now move to 48 bit DMA mask */
 	rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
 	if (rc) {
@@ -2347,7 +3181,7 @@ static int goya_hw_init(struct hl_device *hdev)
 		if (rc) {
 			dev_err(hdev->dev,
 				"Unable to set pci dma mask to 32 bits\n");
-			return rc;
+			goto disable_pci_access;
 		}
 	}
 
@@ -2359,7 +3193,7 @@ static int goya_hw_init(struct hl_device *hdev)
 		if (rc) {
 			dev_err(hdev->dev,
 				"Unable to set pci consistent dma mask to 32 bits\n");
-			return rc;
+			goto disable_pci_access;
 		}
 	}
 
@@ -2367,6 +3201,14 @@ static int goya_hw_init(struct hl_device *hdev)
 	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
 
 	return 0;
+
+disable_pci_access:
+	goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
+disable_queues:
+	goya_disable_internal_queues(hdev);
+	goya_disable_external_queues(hdev);
+
+	return rc;
 }
 
 /**
@@ -2473,12 +3315,40 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
 
 int goya_suspend(struct hl_device *hdev)
 {
-	return 0;
+	int rc;
+
+	rc = goya_stop_internal_queues(hdev);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop internal queues\n");
+		return rc;
+	}
+
+	rc = goya_stop_external_queues(hdev);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop external queues\n");
+		return rc;
+	}
+
+	rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
+	if (rc)
+		dev_err(hdev->dev, "Failed to disable PCI access from CPU\n");
+
+	return rc;
 }
 
 int goya_resume(struct hl_device *hdev)
 {
-	return 0;
+	int rc;
+
+	goya_resume_external_queues(hdev);
+	goya_resume_internal_queues(hdev);
+
+	rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_ENABLE_PCI_ACCESS);
+	if (rc)
+		dev_err(hdev->dev, "Failed to enable PCI access from CPU\n");
+	return rc;
 }
 
 int goya_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
@@ -2502,6 +3372,104 @@ int goya_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
 	return rc;
 }
 
+void goya_ring_doorbell(struct hl_device *hdev, u32 hw_queue_id, u32 pi)
+{
+	u32 db_reg_offset, db_value;
+	bool invalid_queue = false;
+
+	switch (hw_queue_id) {
+	case GOYA_QUEUE_ID_DMA_0:
+		db_reg_offset = mmDMA_QM_0_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_DMA_1:
+		db_reg_offset = mmDMA_QM_1_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_DMA_2:
+		db_reg_offset = mmDMA_QM_2_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_DMA_3:
+		db_reg_offset = mmDMA_QM_3_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_DMA_4:
+		db_reg_offset = mmDMA_QM_4_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_CPU_PQ:
+		if (hdev->cpu_queues_enable)
+			db_reg_offset = mmCPU_IF_PF_PQ_PI;
+		else
+			invalid_queue = true;
+		break;
+
+	case GOYA_QUEUE_ID_MME:
+		db_reg_offset = mmMME_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC0:
+		db_reg_offset = mmTPC0_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC1:
+		db_reg_offset = mmTPC1_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC2:
+		db_reg_offset = mmTPC2_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC3:
+		db_reg_offset = mmTPC3_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC4:
+		db_reg_offset = mmTPC4_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC5:
+		db_reg_offset = mmTPC5_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC6:
+		db_reg_offset = mmTPC6_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC7:
+		db_reg_offset = mmTPC7_QM_PQ_PI;
+		break;
+
+	default:
+		invalid_queue = true;
+	}
+
+	if (invalid_queue) {
+		/* Should never get here */
+		dev_err(hdev->dev, "h/w queue %d is invalid. Can't set pi\n",
+			hw_queue_id);
+		return;
+	}
+
+	db_value = pi;
+
+	if (hdev->ifh)
+		return;
+
+	/* ring the doorbell */
+	WREG32(db_reg_offset, db_value);
+
+	if (hw_queue_id == GOYA_QUEUE_ID_CPU_PQ)
+		WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+				GOYA_ASYNC_EVENT_ID_PI_UPDATE);
+}
+
+void goya_flush_pq_write(struct hl_device *hdev, u64 *pq, u64 exp_val)
+{
+	/* Not needed in Goya */
+}
+
 void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flags)
 {
@@ -2514,6 +3482,311 @@ void goya_dma_free_coherent(struct hl_device *hdev, size_t size, void *cpu_addr,
 	dma_free_coherent(&hdev->pdev->dev, size, cpu_addr, dma_handle);
 }
 
+void *goya_get_int_queue_base(struct hl_device *hdev, u32 queue_id,
+				dma_addr_t *dma_handle,	u16 *queue_len)
+{
+	void *base;
+	u32 offset;
+
+	*dma_handle = hdev->asic_prop.sram_base_address;
+
+	base = hdev->pcie_bar[SRAM_CFG_BAR_ID];
+
+	switch (queue_id) {
+	case GOYA_QUEUE_ID_MME:
+		offset = MME_QMAN_BASE_OFFSET;
+		*queue_len = MME_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC0:
+		offset = TPC0_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC1:
+		offset = TPC1_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC2:
+		offset = TPC2_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC3:
+		offset = TPC3_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC4:
+		offset = TPC4_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC5:
+		offset = TPC5_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC6:
+		offset = TPC6_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC7:
+		offset = TPC7_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	default:
+		dev_err(hdev->dev, "Got invalid queue id %d\n", queue_id);
+		return NULL;
+	}
+
+	base += offset;
+	*dma_handle += offset;
+
+	return base;
+}
+
+int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
+				u32 timeout, long *result)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct armcp_packet *pkt;
+	dma_addr_t pkt_dma_addr;
+	u32 tmp;
+	int rc = 0;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q)) {
+		if (result)
+			*result = 0;
+		return 0;
+	}
+
+	if (len > CPU_CB_SIZE) {
+		dev_err(hdev->dev, "Invalid CPU message size of %d bytes\n",
+			len);
+		return -ENOMEM;
+	}
+
+	pkt = hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev, len,
+								&pkt_dma_addr);
+	if (!pkt) {
+		dev_err(hdev->dev,
+			"Failed to allocate DMA memory for packet to CPU\n");
+		return -ENOMEM;
+	}
+
+	memcpy(pkt, msg, len);
+
+	mutex_lock(&hdev->send_cpu_message_lock);
+
+	if (hdev->disabled)
+		goto out;
+
+	rc = hl_hw_queue_send_cb_no_cmpl(hdev, GOYA_QUEUE_ID_CPU_PQ, len,
+			pkt_dma_addr);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to send CB on CPU PQ (%d)\n", rc);
+		goto out;
+	}
+
+	rc = hl_poll_timeout_memory(hdev, (u64) &pkt->fence, timeout, &tmp);
+
+	hl_hw_queue_inc_ci_kernel(hdev, GOYA_QUEUE_ID_CPU_PQ);
+
+	if (rc == -ETIMEDOUT) {
+		dev_err(hdev->dev,
+			"Timeout while waiting for CPU packet fence\n");
+		goto out;
+	}
+
+	if (tmp == ARMCP_PACKET_FENCE_VAL) {
+		if (pkt->rc) {
+			dev_err(hdev->dev,
+				"failed to execute CPU packet, rc: %d\n",
+					pkt->rc);
+			rc = -EINVAL;
+		} else if (result) {
+			*result = pkt->result;
+		}
+	} else {
+		dev_err(hdev->dev, "CPU packet wrong fence value\n");
+		rc = -EINVAL;
+	}
+
+out:
+	mutex_unlock(&hdev->send_cpu_message_lock);
+
+	hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev, len, pkt);
+
+	return rc;
+}
+
+int goya_test_queue(struct hl_device *hdev, u32 hw_queue_id)
+{
+	struct packet_msg_prot *fence_pkt;
+	dma_addr_t pkt_dma_addr;
+	u32 fence_val, tmp;
+	dma_addr_t fence_dma_addr;
+	u32 *fence_ptr;
+	int rc;
+
+	fence_val = GOYA_QMAN0_FENCE_VAL;
+
+	fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL,
+							&fence_dma_addr);
+	if (!fence_ptr) {
+		dev_err(hdev->dev,
+			"Failed to allocate memory for queue testing\n");
+		return -ENOMEM;
+	}
+
+	*fence_ptr = 0;
+
+	fence_pkt = hdev->asic_funcs->dma_pool_zalloc(hdev,
+					sizeof(struct packet_msg_prot),
+					GFP_KERNEL, &pkt_dma_addr);
+	if (!fence_pkt) {
+		dev_err(hdev->dev,
+			"Failed to allocate packet for queue testing\n");
+		rc = -ENOMEM;
+		goto free_fence_ptr;
+	}
+
+	fence_pkt->opcode = PACKET_MSG_PROT;
+	fence_pkt->value = fence_val;
+	fence_pkt->addr = fence_dma_addr +
+				hdev->asic_prop.host_phys_base_address;
+
+	rc = hl_hw_queue_send_cb_no_cmpl(hdev, hw_queue_id,
+					sizeof(struct packet_msg_prot),
+					pkt_dma_addr);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to send fence packet\n");
+		goto free_pkt;
+	}
+
+	rc = hl_poll_timeout_memory(hdev, (u64) fence_ptr,
+					GOYA_TEST_QUEUE_WAIT_USEC, &tmp);
+
+	hl_hw_queue_inc_ci_kernel(hdev, hw_queue_id);
+
+	if ((!rc) && (tmp == fence_val)) {
+		dev_info(hdev->dev,
+			"queue test on H/W queue %d succeeded\n",
+			hw_queue_id);
+	} else {
+		dev_err(hdev->dev,
+			"H/W queue %d test failed (scratch(0x%08llX) == 0x%08X)\n",
+			hw_queue_id, fence_dma_addr, tmp);
+		rc = -EINVAL;
+	}
+
+free_pkt:
+	hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_pkt,
+					pkt_dma_addr);
+free_fence_ptr:
+	hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_ptr,
+					fence_dma_addr);
+	return rc;
+}
+
+int goya_test_cpu_queue(struct hl_device *hdev)
+{
+	struct armcp_packet test_pkt;
+	long result;
+	int rc;
+
+	/* cpu_queues_enable flag is always checked in send cpu message */
+
+	memset(&test_pkt, 0, sizeof(test_pkt));
+
+	test_pkt.opcode = ARMCP_PACKET_TEST;
+	test_pkt.value = ARMCP_PACKET_FENCE_VAL;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &test_pkt,
+			sizeof(test_pkt), HL_DEVICE_TIMEOUT_USEC, &result);
+
+	if (!rc)
+		dev_info(hdev->dev, "queue test on CPU queue succeeded\n");
+	else
+		dev_err(hdev->dev, "CPU queue test failed (0x%08lX)\n", result);
+
+	return rc;
+}
+
+static int goya_test_queues(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i, rc, ret_val = 0;
+
+	if (hdev->ifh)
+		return 0;
+
+	for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++) {
+		rc = goya_test_queue(hdev, i);
+		if (rc)
+			ret_val = -EINVAL;
+	}
+
+	if (hdev->cpu_queues_enable) {
+		rc = goya->test_cpu_queue(hdev);
+		if (rc)
+			ret_val = -EINVAL;
+	}
+
+	return ret_val;
+}
+
+void *goya_dma_pool_zalloc(struct hl_device *hdev, size_t size, gfp_t mem_flags,
+				dma_addr_t *dma_handle)
+{
+	if (size > GOYA_DMA_POOL_BLK_SIZE)
+		return NULL;
+
+	return dma_pool_zalloc(hdev->dma_pool, mem_flags, dma_handle);
+}
+
+void goya_dma_pool_free(struct hl_device *hdev, void *vaddr,
+			dma_addr_t dma_addr)
+{
+	dma_pool_free(hdev->dma_pool, vaddr, dma_addr);
+}
+
+void *goya_cpu_accessible_dma_pool_alloc(struct hl_device *hdev, size_t size,
+			dma_addr_t *dma_handle)
+{
+	u64 kernel_addr;
+
+	/* roundup to CPU_PKT_SIZE */
+	size = (size + (CPU_PKT_SIZE - 1)) & CPU_PKT_MASK;
+
+	kernel_addr = gen_pool_alloc(hdev->cpu_accessible_dma_pool, size);
+
+	*dma_handle = hdev->cpu_accessible_dma_address +
+			(kernel_addr - (u64) hdev->cpu_accessible_dma_mem);
+
+	return (void *) kernel_addr;
+}
+
+void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
+			void *vaddr)
+{
+	/* roundup to CPU_PKT_SIZE */
+	size = (size + (CPU_PKT_SIZE - 1)) & CPU_PKT_MASK;
+
+	gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
+}
+
+
+static void goya_hw_queues_lock(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	spin_lock(&goya->hw_queues_lock);
+}
+
+static void goya_hw_queues_unlock(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	spin_unlock(&goya->hw_queues_lock);
+}
+
 static const struct hl_asic_funcs goya_funcs = {
 	.early_init = goya_early_init,
 	.early_fini = goya_early_fini,
@@ -2525,8 +3798,19 @@ static const struct hl_asic_funcs goya_funcs = {
 	.resume = goya_resume,
 	.mmap = goya_mmap,
 	.cb_mmap = goya_cb_mmap,
+	.ring_doorbell = goya_ring_doorbell,
+	.flush_pq_write = goya_flush_pq_write,
 	.dma_alloc_coherent = goya_dma_alloc_coherent,
 	.dma_free_coherent = goya_dma_free_coherent,
+	.get_int_queue_base = goya_get_int_queue_base,
+	.test_queues = goya_test_queues,
+	.dma_pool_zalloc = goya_dma_pool_zalloc,
+	.dma_pool_free = goya_dma_pool_free,
+	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
+	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
+	.hw_queues_lock = goya_hw_queues_lock,
+	.hw_queues_unlock = goya_hw_queues_unlock,
+	.send_cpu_message = goya_send_cpu_message
 };
 
 /**
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 45a6d2ca2752..598a718d3df1 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -9,6 +9,7 @@
 #define GOYAP_H_
 
 #include "habanalabs.h"
+#include "include/goya/goya_packets.h"
 #include "include/goya/goya_boot_if.h"
 #include "include/goya/goya.h"
 
@@ -117,12 +118,17 @@ enum goya_fw_component {
 };
 
 struct goya_device {
+	int (*test_cpu_queue)(struct hl_device *hdev);
+
 	/* TODO: remove hw_queues_lock after moving to scheduler code */
 	spinlock_t	hw_queues_lock;
 	u64		ddr_bar_cur_addr;
 	u32		hw_cap_initialized;
 };
 
+int goya_test_cpu_queue(struct hl_device *hdev);
+int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
+				u32 timeout, long *result);
 void goya_init_security(struct hl_device *hdev);
 
 #endif /* GOYAP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index adda281ec2af..8232e2259463 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -30,10 +30,36 @@
 struct hl_device;
 struct hl_fpriv;
 
+/**
+ * enum hl_queue_type - Supported QUEUE types.
+ * @QUEUE_TYPE_NA: queue is not available.
+ * @QUEUE_TYPE_EXT: external queue which is a DMA channel that may access the
+ *                  host.
+ * @QUEUE_TYPE_INT: internal queue that performs DMA inside the device's
+ *			memories and/or operates the compute engines.
+ * @QUEUE_TYPE_CPU: S/W queue for communication with the device's CPU.
+ */
+enum hl_queue_type {
+	QUEUE_TYPE_NA,
+	QUEUE_TYPE_EXT,
+	QUEUE_TYPE_INT,
+	QUEUE_TYPE_CPU
+};
 
+/**
+ * struct hw_queue_properties - queue information.
+ * @type: queue type.
+ * @kmd_only: true if only KMD is allowed to send a job to this queue, false
+ *            otherwise.
+ */
+struct hw_queue_properties {
+	enum hl_queue_type	type;
+	u8			kmd_only;
+};
 
 /**
  * struct asic_fixed_properties - ASIC specific immutable properties.
+ * @hw_queues_props: H/W queues properties.
  * @uboot_ver: F/W U-boot version.
  * @preboot_ver: F/W Preboot version.
  * @sram_base_address: SRAM physical start address.
@@ -64,6 +90,7 @@ struct hl_fpriv;
  * @tpc_enabled_mask: which TPCs are enabled.
  */
 struct asic_fixed_properties {
+	struct hw_queue_properties	hw_queues_props[HL_MAX_QUEUES];
 	char			uboot_ver[VERSION_MAX_LEN];
 	char			preboot_ver[VERSION_MAX_LEN];
 	u64			sram_base_address;
@@ -145,7 +172,92 @@ struct hl_cb {
 
 
 
+/*
+ * QUEUES
+ */
+
+struct hl_cs_job;
+
+/*
+ * Currently, there are two limitations on the maximum length of a queue:
+ *
+ * 1. The memory footprint of the queue. The current allocated space for the
+ *    queue is PAGE_SIZE. Because each entry in the queue is HL_BD_SIZE,
+ *    the maximum length of the queue can be PAGE_SIZE / HL_BD_SIZE,
+ *    which currently is 4096/16 = 256 entries.
+ *
+ *    To increase that, we need either to decrease the size of the
+ *    BD (difficult), or allocate more than a single page (easier).
+ *
+ * 2. Because the size of the JOB handle field in the BD CTL / completion queue
+ *    is 10-bit, we can have up to 1024 open jobs per hardware queue.
+ *    Therefore, each queue can hold up to 1024 entries.
+ *
+ * HL_QUEUE_LENGTH is in units of struct hl_bd.
+ * HL_QUEUE_LENGTH * sizeof(struct hl_bd) should be <= HL_PAGE_SIZE
+ */
+
+#define HL_PAGE_SIZE			4096 /* minimum page size */
+/* Must be power of 2 (HL_PAGE_SIZE / HL_BD_SIZE) */
 #define HL_QUEUE_LENGTH			256
+#define HL_QUEUE_SIZE_IN_BYTES		(HL_QUEUE_LENGTH * HL_BD_SIZE)
+
+/*
+ * HL_CQ_LENGTH is in units of struct hl_cq_entry.
+ * HL_CQ_LENGTH should be <= HL_PAGE_SIZE
+ */
+#define HL_CQ_LENGTH			HL_QUEUE_LENGTH
+#define HL_CQ_SIZE_IN_BYTES		(HL_CQ_LENGTH * HL_CQ_ENTRY_SIZE)
+
+
+
+/**
+ * struct hl_hw_queue - describes a H/W transport queue.
+ * @shadow_queue: pointer to a shadow queue that holds pointers to jobs.
+ * @queue_type: type of queue.
+ * @kernel_address: holds the queue's kernel virtual address.
+ * @bus_address: holds the queue's DMA address.
+ * @pi: holds the queue's pi value.
+ * @ci: holds the queue's ci value, AS CALCULATED BY THE DRIVER (not real ci).
+ * @hw_queue_id: the id of the H/W queue.
+ * @int_queue_len: length of internal queue (number of entries).
+ * @valid: is the queue valid (we have array of 32 queues, not all of them
+ *		exists).
+ */
+struct hl_hw_queue {
+	struct hl_cs_job	**shadow_queue;
+	enum hl_queue_type	queue_type;
+	u64			kernel_address;
+	dma_addr_t		bus_address;
+	u32			pi;
+	u32			ci;
+	u32			hw_queue_id;
+	u16			int_queue_len;
+	u8			valid;
+};
+
+/**
+ * struct hl_cq - describes a completion queue
+ * @hdev: pointer to the device structure
+ * @kernel_address: holds the queue's kernel virtual address
+ * @bus_address: holds the queue's DMA address
+ * @hw_queue_id: the id of the matching H/W queue
+ * @ci: ci inside the queue
+ * @pi: pi inside the queue
+ * @free_slots_cnt: counter of free slots in queue
+ */
+struct hl_cq {
+	struct hl_device	*hdev;
+	u64			kernel_address;
+	dma_addr_t		bus_address;
+	u32			hw_queue_id;
+	u32			ci;
+	u32			pi;
+	atomic_t		free_slots_cnt;
+};
+
+
+
 
 
 /*
@@ -180,8 +292,20 @@ enum hl_asic_type {
  * @resume: handles IP specific H/W or SW changes for resume.
  * @mmap: mmap function, does nothing.
  * @cb_mmap: maps a CB.
+ * @ring_doorbell: increment PI on a given QMAN.
+ * @flush_pq_write: flush PQ entry write if necessary, WARN if flushing failed.
  * @dma_alloc_coherent: DMA allocate coherent memory.
  * @dma_free_coherent: free DMA allocation.
+ * @get_int_queue_base: get the internal queue base address.
+ * @test_queues: run simple test on all queues for sanity check.
+ * @dma_pool_zalloc: small DMA allocation of coherent memory from DMA pool.
+ *                   size of allocation is HL_DMA_POOL_BLK_SIZE.
+ * @dma_pool_free: free small DMA allocation from pool.
+ * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
+ * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
+ * @hw_queues_lock: acquire H/W queues lock.
+ * @hw_queues_unlock: release H/W queues lock.
+ * @send_cpu_message: send buffer to ArmCP.
  */
 struct hl_asic_funcs {
 	int (*early_init)(struct hl_device *hdev);
@@ -195,10 +319,27 @@ struct hl_asic_funcs {
 	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
 	int (*cb_mmap)(struct hl_device *hdev, struct vm_area_struct *vma,
 			u64 kaddress, phys_addr_t paddress, u32 size);
+	void (*ring_doorbell)(struct hl_device *hdev, u32 hw_queue_id, u32 pi);
+	void (*flush_pq_write)(struct hl_device *hdev, u64 *pq, u64 exp_val);
 	void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flag);
 	void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
 					void *cpu_addr, dma_addr_t dma_handle);
+	void* (*get_int_queue_base)(struct hl_device *hdev, u32 queue_id,
+				dma_addr_t *dma_handle, u16 *queue_len);
+	int (*test_queues)(struct hl_device *hdev);
+	void* (*dma_pool_zalloc)(struct hl_device *hdev, size_t size,
+				gfp_t mem_flags, dma_addr_t *dma_handle);
+	void (*dma_pool_free)(struct hl_device *hdev, void *vaddr,
+				dma_addr_t dma_addr);
+	void* (*cpu_accessible_dma_pool_alloc)(struct hl_device *hdev,
+				size_t size, dma_addr_t *dma_handle);
+	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
+				size_t size, void *vaddr);
+	void (*hw_queues_lock)(struct hl_device *hdev);
+	void (*hw_queues_unlock)(struct hl_device *hdev);
+	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
+				u16 len, u32 timeout, long *result);
 };
 
 
@@ -240,6 +381,17 @@ struct hl_ctx_mgr {
 
 
 
+/**
+ * struct hl_cs_job - command submission job.
+ * @finish_work: workqueue object to run when job is completed.
+ * @id: the id of this job inside a CS.
+ */
+struct hl_cs_job {
+	struct work_struct	finish_work;
+	u32			id;
+};
+
+
 /*
  * FILE PRIVATE STRUCTURE
  */
@@ -316,7 +468,11 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @dev: realted kernel basic device structure.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
+ * @completion_queue: array of hl_cq.
+ * @cq_wq: work queue of completion queues for executing work in process context
+ * @eq_wq: work queue of event queue for executing work in process context.
  * @kernel_ctx: KMD context structure.
+ * @kernel_queues: array of hl_hw_queue.
  * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
  * @dma_pool: DMA pool for small allocations.
  * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
@@ -326,6 +482,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @asid_bitmap: holds used/available ASIDs.
  * @asid_mutex: protects asid_bitmap.
  * @device_open: lock for sanity checks upon FD open.
+ * @send_cpu_message_lock: enforces only one message in KMD <-> ArmCP queue.
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
@@ -345,7 +502,10 @@ struct hl_device {
 	struct device			*dev;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
+	struct hl_cq			*completion_queue;
+	struct workqueue_struct		*cq_wq;
 	struct hl_ctx			*kernel_ctx;
+	struct hl_hw_queue		*kernel_queues;
 	struct hl_cb_mgr		kernel_cb_mgr;
 	struct dma_pool			*dma_pool;
 	void				*cpu_accessible_dma_mem;
@@ -356,6 +516,7 @@ struct hl_device {
 	struct mutex			asid_mutex;
 	/* TODO: change to rw_sem for multiple contexts (same as other IOCTL) */
 	struct mutex			device_open;
+	struct mutex			send_cpu_message_lock;
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
@@ -374,7 +535,9 @@ struct hl_device {
 	u8				cpu_enable;
 	u8				reset_pcilink;
 	u8				config_pll;
+	u8				cpu_queues_enable;
 	u8				fw_loading;
+	u8				ifh;
 	u8				pldm;
 };
 
@@ -418,7 +581,18 @@ int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
 				u32 *val);
 int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
 				u32 timeout_us, u32 *val);
-
+int hl_hw_queues_create(struct hl_device *hdev);
+void hl_hw_queues_destroy(struct hl_device *hdev);
+int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
+				u32 cb_size, u64 cb_ptr);
+u32 hl_hw_queue_add_ptr(u32 ptr, u16 val);
+void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
+
+#define hl_queue_inc_ptr(p)		hl_hw_queue_add_ptr(p, 1)
+#define hl_pi_2_offset(pi)		((pi) & (HL_QUEUE_LENGTH - 1))
+
+int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
+void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
 int hl_asid_init(struct hl_device *hdev);
 void hl_asid_fini(struct hl_device *hdev);
 unsigned long hl_asid_alloc(struct hl_device *hdev);
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index bd80683118d3..b64f58ad0f5d 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -184,13 +184,19 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	hdev->cpu_enable = 1;
 	hdev->reset_pcilink = 0;
 	hdev->config_pll = 0;
+	hdev->cpu_queues_enable = 1;
 	hdev->fw_loading = 1;
+	hdev->ifh = 0;
 	hdev->pldm = 0;
 
 	/* If CPU is disabled, no point in loading FW */
 	if (!hdev->cpu_enable)
 		hdev->fw_loading = 0;
 
+	/* If we don't load FW, no need to initialize CPU queues */
+	if (!hdev->fw_loading)
+		hdev->cpu_queues_enable = 0;
+
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
 
diff --git a/drivers/misc/habanalabs/hw_queue.c b/drivers/misc/habanalabs/hw_queue.c
new file mode 100644
index 000000000000..65102a5bc2ca
--- /dev/null
+++ b/drivers/misc/habanalabs/hw_queue.c
@@ -0,0 +1,404 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/dma-mapping.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/delay.h>
+
+/**
+ * hl_queue_add_ptr - add to pi or ci and checks if it wraps around
+ *
+ * @ptr: the current pi/ci value
+ * @val: the amount to add
+ *
+ * Add val to ptr. It can go until twice the queue length.
+ */
+inline u32 hl_hw_queue_add_ptr(u32 ptr, u16 val)
+{
+	ptr += val;
+	ptr &= ((HL_QUEUE_LENGTH << 1) - 1);
+	return ptr;
+}
+
+static inline int queue_free_slots(struct hl_hw_queue *q, u32 queue_len)
+{
+	int delta = (q->pi - q->ci);
+
+	if (delta >= 0)
+		return (queue_len - delta);
+	else
+		return (abs(delta) - queue_len);
+}
+
+/**
+ * ext_queue_submit_bd - Submit a buffer descriptor to an external queue
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @q: pointer to habanalabs queue structure
+ * @ctl: BD's control word
+ * @len: BD's length
+ * @ptr: BD's pointer
+ *
+ * This function assumes there is enough space on the queue to submit a new
+ * BD to it. It initializes the next BD and calls the device specific
+ * function to set the pi (and doorbell)
+ *
+ * This function must be called when the scheduler mutex is taken
+ *
+ */
+static void ext_queue_submit_bd(struct hl_device *hdev, struct hl_hw_queue *q,
+				u32 ctl, u32 len, u64 ptr)
+{
+	struct hl_bd *bd;
+
+	bd = (struct hl_bd *) q->kernel_address;
+	bd += hl_pi_2_offset(q->pi);
+	bd->ctl = ctl;
+	bd->len = len;
+	bd->ptr = ptr + hdev->asic_prop.host_phys_base_address;
+
+	q->pi = hl_queue_inc_ptr(q->pi);
+	hdev->asic_funcs->ring_doorbell(hdev, q->hw_queue_id, q->pi);
+}
+
+/**
+ * ext_queue_sanity_checks - perform some sanity checks on external queue
+ *
+ * @hdev              : pointer to hl_device structure
+ * @q                 :	pointer to hl_hw_queue structure
+ * @num_of_entries    : how many entries to check for space
+ * @reserve_cq_entry  :	whether to reserve an entry in the cq
+ *
+ * H/W queues spinlock should be taken before calling this function
+ *
+ * Perform the following:
+ * - Make sure we have enough space in the h/w queue
+ * - Make sure we have enough space in the completion queue
+ * - Reserve space in the completion queue (needs to be reversed if there
+ *   is a failure down the road before the actual submission of work). Only
+ *   do this action if reserve_cq_entry is true
+ *
+ */
+static int ext_queue_sanity_checks(struct hl_device *hdev,
+				struct hl_hw_queue *q, int num_of_entries,
+				bool reserve_cq_entry)
+{
+	atomic_t *free_slots =
+			&hdev->completion_queue[q->hw_queue_id].free_slots_cnt;
+	int free_slots_cnt;
+
+	/* Check we have enough space in the queue */
+	free_slots_cnt = queue_free_slots(q, HL_QUEUE_LENGTH);
+
+	if (free_slots_cnt < num_of_entries) {
+		dev_dbg(hdev->dev, "Queue %d doesn't have room for %d CBs\n",
+			q->hw_queue_id, num_of_entries);
+		return -EAGAIN;
+	}
+
+	if (reserve_cq_entry) {
+		/*
+		 * Check we have enough space in the completion queue
+		 * Add -1 to counter (decrement) unless counter was already 0
+		 * In that case, CQ is full so we can't submit a new CB because
+		 * we won't get ack on its completion
+		 * atomic_add_unless will return 0 if counter was already 0
+		 */
+		if (atomic_add_negative(num_of_entries * -1, free_slots)) {
+			dev_dbg(hdev->dev, "No space for %d on CQ %d\n",
+				num_of_entries, q->hw_queue_id);
+			atomic_add(num_of_entries, free_slots);
+			return -EAGAIN;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * hl_hw_queue_send_cb_no_cmpl - send a single CB (not a JOB) without completion
+ *
+ * @hdev: pointer to hl_device structure
+ * @hw_queue_id: Queue's type
+ * @cb_size: size of CB
+ * @cb_ptr: pointer to CB location
+ *
+ * This function sends a single CB, that must NOT generate a completion entry
+ *
+ */
+int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
+				u32 cb_size, u64 cb_ptr)
+{
+	struct hl_hw_queue *q = &hdev->kernel_queues[hw_queue_id];
+	int rc;
+
+	/*
+	 * The CPU queue is a synchronous queue with an effective depth of
+	 * a single entry (although it is allocated with room for multiple
+	 * entries). Therefore, there is a different lock, called
+	 * send_cpu_message_lock, that serializes accesses to the CPU queue.
+	 * As a result, we don't need to lock the access to the entire H/W
+	 * queues module when submitting a JOB to the CPU queue
+	 */
+	if (q->queue_type != QUEUE_TYPE_CPU)
+		hdev->asic_funcs->hw_queues_lock(hdev);
+
+	if (hdev->disabled) {
+		rc = -EPERM;
+		goto out;
+	}
+
+	rc = ext_queue_sanity_checks(hdev, q, 1, false);
+	if (rc)
+		goto out;
+
+	ext_queue_submit_bd(hdev, q, 0, cb_size, cb_ptr);
+
+out:
+	if (q->queue_type != QUEUE_TYPE_CPU)
+		hdev->asic_funcs->hw_queues_unlock(hdev);
+
+	return rc;
+}
+
+/**
+ * hl_hw_queue_inc_ci_kernel - increment ci for kernel's queue
+ *
+ * @hdev: pointer to hl_device structure
+ * @hw_queue_id: which queue to increment its ci
+ */
+void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id)
+{
+	struct hl_hw_queue *q = &hdev->kernel_queues[hw_queue_id];
+
+	q->ci = hl_queue_inc_ptr(q->ci);
+}
+
+static int ext_and_cpu_hw_queue_init(struct hl_device *hdev,
+					struct hl_hw_queue *q)
+{
+	void *p;
+	int rc;
+
+	p = hdev->asic_funcs->dma_alloc_coherent(hdev,
+				HL_QUEUE_SIZE_IN_BYTES,
+				&q->bus_address, GFP_KERNEL | __GFP_ZERO);
+	if (!p)
+		return -ENOMEM;
+
+	q->kernel_address = (u64) p;
+
+	q->shadow_queue = kmalloc_array(HL_QUEUE_LENGTH,
+					sizeof(*q->shadow_queue),
+					GFP_KERNEL);
+	if (!q->shadow_queue) {
+		dev_err(hdev->dev,
+			"Failed to allocate shadow queue for H/W queue %d\n",
+			q->hw_queue_id);
+		rc = -ENOMEM;
+		goto free_queue;
+	}
+
+	/* Make sure read/write pointers are initialized to start of queue */
+	q->ci = 0;
+	q->pi = 0;
+
+	return 0;
+
+free_queue:
+	hdev->asic_funcs->dma_free_coherent(hdev, HL_QUEUE_SIZE_IN_BYTES,
+			(void *) q->kernel_address, q->bus_address);
+
+	return rc;
+}
+
+static int int_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
+{
+	void *p;
+
+	p = hdev->asic_funcs->get_int_queue_base(hdev, q->hw_queue_id,
+					&q->bus_address, &q->int_queue_len);
+	if (!p) {
+		dev_err(hdev->dev,
+			"Failed to get base address for internal queue %d\n",
+			q->hw_queue_id);
+		return -EFAULT;
+	}
+
+	q->kernel_address = (u64) p;
+	q->pi = 0;
+	q->ci = 0;
+
+	return 0;
+}
+
+static int cpu_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
+{
+	return ext_and_cpu_hw_queue_init(hdev, q);
+}
+
+static int ext_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
+{
+	return ext_and_cpu_hw_queue_init(hdev, q);
+}
+
+/**
+ * hw_queue_init - main initialization function for H/W queue object
+ *
+ * @hdev: pointer to hl_device device structure
+ * @q: pointer to hl_hw_queue queue structure
+ * @hw_queue_id: The id of the H/W queue
+ *
+ * Allocate dma-able memory for the queue and initialize fields
+ * Returns 0 on success
+ */
+static int hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q,
+			u32 hw_queue_id)
+{
+	int rc;
+
+	BUILD_BUG_ON(HL_QUEUE_SIZE_IN_BYTES > HL_PAGE_SIZE);
+
+	q->hw_queue_id = hw_queue_id;
+
+	switch (q->queue_type) {
+	case QUEUE_TYPE_EXT:
+		rc = ext_hw_queue_init(hdev, q);
+		break;
+
+	case QUEUE_TYPE_INT:
+		rc = int_hw_queue_init(hdev, q);
+		break;
+
+	case QUEUE_TYPE_CPU:
+		rc = cpu_hw_queue_init(hdev, q);
+		break;
+
+	case QUEUE_TYPE_NA:
+		q->valid = 0;
+		return 0;
+
+	default:
+		dev_crit(hdev->dev, "wrong queue type %d during init\n",
+			q->queue_type);
+		rc = -EINVAL;
+		break;
+	}
+
+	if (rc)
+		return rc;
+
+	q->valid = 1;
+
+	return 0;
+}
+
+/**
+ * hw_queue_fini - destroy queue
+ *
+ * @hdev: pointer to hl_device device structure
+ * @q: pointer to hl_hw_queue queue structure
+ *
+ * Free the queue memory
+ */
+static void hw_queue_fini(struct hl_device *hdev, struct hl_hw_queue *q)
+{
+	if (!q->valid)
+		return;
+
+	/*
+	 * If we arrived here, there are no jobs waiting on this queue
+	 * so we can safely remove it.
+	 * This is because this function can only called when:
+	 * 1. Either a context is deleted, which only can occur if all its
+	 *    jobs were finished
+	 * 2. A context wasn't able to be created due to failure or timeout,
+	 *    which means there are no jobs on the queue yet
+	 *
+	 * The only exception are the queues of the kernel context, but
+	 * if they are being destroyed, it means that the entire module is
+	 * being removed. If the module is removed, it means there is no open
+	 * user context. It also means that if a job was submitted by
+	 * the kernel driver (e.g. context creation), the job itself was
+	 * released by the kernel driver when a timeout occurred on its
+	 * Completion. Thus, we don't need to release it again.
+	 */
+
+	if (q->queue_type == QUEUE_TYPE_INT)
+		return;
+
+	kfree(q->shadow_queue);
+
+	hdev->asic_funcs->dma_free_coherent(hdev,
+			HL_QUEUE_SIZE_IN_BYTES,
+			(void *) q->kernel_address, q->bus_address);
+}
+
+int hl_hw_queues_create(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *asic = &hdev->asic_prop;
+	struct hl_hw_queue *q;
+	int i, rc, q_ready_cnt;
+
+	hdev->kernel_queues = kcalloc(HL_MAX_QUEUES,
+				sizeof(*hdev->kernel_queues), GFP_KERNEL);
+
+	if (!hdev->kernel_queues) {
+		dev_err(hdev->dev, "Not enough memory for H/W queues\n");
+		return -ENOMEM;
+	}
+
+	/* Initialize the H/W queues */
+	for (i = 0, q_ready_cnt = 0, q = hdev->kernel_queues;
+			i < HL_MAX_QUEUES ; i++, q_ready_cnt++, q++) {
+
+		q->queue_type = asic->hw_queues_props[i].type;
+		rc = hw_queue_init(hdev, q, i);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed to initialize queue %d\n", i);
+			goto release_queues;
+		}
+	}
+
+	return 0;
+
+release_queues:
+	for (i = 0, q = hdev->kernel_queues ; i < q_ready_cnt ; i++, q++)
+		hw_queue_fini(hdev, q);
+
+	kfree(hdev->kernel_queues);
+
+	return rc;
+}
+
+void hl_hw_queues_destroy(struct hl_device *hdev)
+{
+	struct hl_hw_queue *q;
+	int i;
+
+	for (i = 0, q = hdev->kernel_queues ; i < HL_MAX_QUEUES ; i++, q++)
+		hw_queue_fini(hdev, q);
+
+	kfree(hdev->kernel_queues);
+}
+
+void hl_hw_queue_reset(struct hl_device *hdev, bool hard_reset)
+{
+	struct hl_hw_queue *q;
+	int i;
+
+	for (i = 0, q = hdev->kernel_queues ; i < HL_MAX_QUEUES ; i++, q++) {
+		if ((!q->valid) ||
+			((!hard_reset) && (q->queue_type == QUEUE_TYPE_CPU)))
+			continue;
+		q->pi = q->ci = 0;
+	}
+}
diff --git a/drivers/misc/habanalabs/include/goya/goya_packets.h b/drivers/misc/habanalabs/include/goya/goya_packets.h
new file mode 100644
index 000000000000..669a3f37ccb7
--- /dev/null
+++ b/drivers/misc/habanalabs/include/goya/goya_packets.h
@@ -0,0 +1,234 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2017-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ * Authors:
+ *
+ * Oded Gabbay <oded.gabbay@gmail.com>
+ * Guy Eilat <geilat@habana.ai>
+ *
+ */
+
+#ifndef GOYA_PACKETS_H
+#define GOYA_PACKETS_H
+
+#include <linux/types.h>
+
+#define PACKET_HEADER_PACKET_ID_SHIFT		56
+#define PACKET_HEADER_PACKET_ID_MASK		0x1F00000000000000ull
+
+enum packet_id {
+	PACKET_WREG_32 = 0x1,
+	PACKET_WREG_BULK = 0x2,
+	PACKET_MSG_LONG = 0x3,
+	PACKET_MSG_SHORT = 0x4,
+	PACKET_CP_DMA = 0x5,
+	PACKET_MSG_PROT = 0x7,
+	PACKET_FENCE = 0x8,
+	PACKET_LIN_DMA = 0x9,
+	PACKET_NOP = 0xA,
+	PACKET_STOP = 0xB,
+	MAX_PACKET_ID = (PACKET_HEADER_PACKET_ID_MASK >>
+				PACKET_HEADER_PACKET_ID_SHIFT) + 1
+};
+
+enum goya_dma_direction {
+	DMA_HOST_TO_DRAM,
+	DMA_HOST_TO_SRAM,
+	DMA_DRAM_TO_SRAM,
+	DMA_SRAM_TO_DRAM,
+	DMA_SRAM_TO_HOST,
+	DMA_DRAM_TO_HOST,
+	DMA_DRAM_TO_DRAM,
+	DMA_SRAM_TO_SRAM,
+	DMA_ENUM_MAX
+};
+
+struct packet_nop {
+	__u32 reserved;
+	union {
+		struct {
+			__u32:24;
+			__u32 opcode :5;
+			__u32 eng_barrier :1;
+			__u32 reg_barrier :1;
+			__u32 msg_barrier :1;
+		};
+		__u32 ctl;
+	};
+};
+
+struct packet_stop {
+	__u32 reserved;
+	union {
+		struct {
+			__u32:24;
+			__u32 opcode :5;
+			__u32 eng_barrier :1;
+			__u32 reg_barrier :1; /* must be 0 */
+			__u32 msg_barrier :1; /* must be 0 */
+		};
+		__u32 ctl;
+	};
+};
+
+struct packet_wreg32 {
+	__u32 value;
+	union {
+		struct {
+			__u32 reg_offset :16;
+			__u32:7;
+			__u32 local :1; /* 0: write to TCL regs,
+					 * 1: write to CMDQ regs
+					 */
+			__u32 opcode :5;
+			__u32 eng_barrier :1;
+			__u32 reg_barrier :1; /* must be 1 */
+			__u32 msg_barrier :1;
+		};
+		__u32 ctl;
+	};
+};
+
+struct packet_wreg_bulk {
+	__u32 size64 :16;
+	__u32:16;
+	__u32 reg_offset :16;
+	__u32:8;
+	__u32 opcode :5;
+	__u32 eng_barrier :1;
+	__u32 reg_barrier :1; /* must be 1 */
+	__u32 msg_barrier :1;
+	__u64 values[0]; /* data starts here */
+};
+
+struct packet_msg_long {
+	__u32 value;
+	union {
+		struct {
+			__u32:16;
+			__u32 weakly_ordered :1;
+			__u32 no_snoop :1;
+			__u32:2;
+			__u32 op :2; /* 0: write <value>. 1: write timestamp. */
+			__u32:2;
+			__u32 opcode :5;
+			__u32 eng_barrier :1;
+			__u32 reg_barrier :1;
+			__u32 msg_barrier :1;
+		};
+		__u32 ctl;
+	};
+	__u64 addr;
+};
+
+struct packet_msg_short {
+	union {
+		struct {
+			__u32 sync_id :10;
+			__u32:5;
+			__u32 mode : 1;
+			__u32 sync_value :16;
+		} mon_arm_register;
+		struct {
+			__u32 sync_value :16;
+			__u32:15;
+			__u32 mode :1;
+		} so_upd;
+		__u32 value;
+	};
+	union {
+		struct {
+			__u32 msg_addr_offset :16;
+			__u32 weakly_ordered :1;
+			__u32 no_snoop :1;
+			__u32:2;
+			__u32 op :2;
+			__u32 base :2;
+			__u32 opcode :5;
+			__u32 eng_barrier :1;
+			__u32 reg_barrier :1;
+			__u32 msg_barrier :1;
+		};
+		__u32 ctl;
+	};
+};
+
+struct packet_msg_prot {
+	__u32 value;
+	union {
+		struct {
+			__u32:16;
+			__u32 weakly_ordered :1;
+			__u32 no_snoop :1;
+			__u32:2;
+			__u32 op :2; /* 0: write <value>. 1: write timestamp. */
+			__u32:2;
+			__u32 opcode :5;
+			__u32 eng_barrier :1;
+			__u32 reg_barrier :1;
+			__u32 msg_barrier :1;
+		};
+		__u32 ctl;
+	};
+	__u64 addr;
+};
+
+struct packet_fence {
+	__u32 dec_val :4;
+	__u32:12;
+	__u32 gate_val :8;
+	__u32:6;
+	__u32 id :2;
+	__u32:24;
+	__u32 opcode :5;
+	__u32 eng_barrier :1;
+	__u32 reg_barrier :1;
+	__u32 msg_barrier :1;
+};
+
+struct packet_lin_dma {
+	__u32 tsize;
+	union {
+		struct {
+			__u32 weakly_ordered :1; /* H/W bug, must be 1 */
+			__u32 rdcomp :1;
+			__u32 wrcomp :1;
+			__u32 no_snoop :1;
+			__u32 src_disable :1;
+			__u32 dst_disable :1;
+			__u32 memset_mode :1;
+			__u32 tensor_dma :1; /* N/A, must be 0 */
+			__u32 cntrl :12;
+			__u32 dma_dir :3; /* S/W only, no effect on HW */
+			__u32:1;
+			__u32 opcode :5;
+			__u32 eng_barrier :1;
+			__u32 reg_barrier :1; /* must be 1 */
+			__u32 msg_barrier :1;
+		};
+		__u32 ctl;
+	};
+	__u64 src_addr;
+	__u64 dst_addr;
+};
+
+struct packet_cp_dma {
+	__u32 tsize;
+	union {
+		struct {
+			__u32 weakly_ordered :1;
+			__u32 no_snoop :1;
+			__u32:22;
+			__u32 opcode :5;
+			__u32 eng_barrier :1;
+			__u32 reg_barrier :1; /* must be 1 */
+			__u32 msg_barrier :1;
+		};
+		__u32 ctl;
+	};
+	__u64 src_addr;
+};
+
+#endif /* GOYA_PACKETS_H */
diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
index 9dbb7077eabd..62df9981f68a 100644
--- a/drivers/misc/habanalabs/include/habanalabs_device_if.h
+++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
@@ -97,6 +97,278 @@ enum pq_init_status {
 	PQ_INIT_STATUS_READY_FOR_HOST
 };
 
+/*
+ * ArmCP Primary Queue Packets
+ *
+ * During normal operation, KMD needs to send various messages to ArmCP,
+ * usually either to SET some value into a H/W periphery or to GET the current
+ * value of some H/W periphery. For example, SET the frequency of MME/TPC and
+ * GET the value of the thermal sensor.
+ *
+ * These messages can be initiated either by the User application or by KMD
+ * itself, e.g. power management code. In either case, the communication from
+ * KMD to ArmCP will *always* be in synchronous mode, meaning that KMD will
+ * send a single message and poll until the message was acknowledged and the
+ * results are ready (if results are needed).
+ *
+ * This means that only a single message can be sent at a time and KMD must
+ * wait for its result before sending the next message. Having said that,
+ * because these are control messages which are sent in a relatively low
+ * frequency, this limitation seems acceptable. It's important to note that
+ * in case of multiple devices, messages to different devices *can* be sent
+ * at the same time.
+ *
+ * The message, inputs/outputs (if relevant) and fence object will be located
+ * on the device DDR at an address that will be determined by KMD. During
+ * device initialization phase, KMD will pass to ArmCP that address.  Most of
+ * the message types will contain inputs/outputs inside the message itself.
+ * The common part of each message will contain the opcode of the message (its
+ * type) and a field representing a fence object.
+ *
+ * When KMD wishes to send a message to ArmCP, it will write the message
+ * contents to the device DDR, clear the fence object and then write the
+ * value 484 to the mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR register to issue
+ * the 484 interrupt-id to the ARM core.
+ *
+ * Upon receiving the 484 interrupt-id, ArmCP will read the message from the
+ * DDR. In case the message is a SET operation, ArmCP will first perform the
+ * operation and then write to the fence object on the device DDR. In case the
+ * message is a GET operation, ArmCP will first fill the results section on the
+ * device DDR and then write to the fence object. If an error occurred, ArmCP
+ * will fill the rc field with the right error code.
+ *
+ * In the meantime, KMD will poll on the fence object. Once KMD sees that the
+ * fence object is signaled, it will read the results from the device DDR
+ * (if relevant) and resume the code execution in KMD.
+ *
+ * To use QMAN packets, the opcode must be the QMAN opcode, shifted by 8
+ * so the value being put by the KMD matches the value read by ArmCP
+ *
+ * Non-QMAN packets should be limited to values 1 through (2^8 - 1)
+ *
+ * Detailed description:
+ *
+ * ARMCP_PACKET_DISABLE_PCI_ACCESS -
+ *       After receiving this packet the embedded CPU must NOT issue PCI
+ *       transactions (read/write) towards the Host CPU. This also include
+ *       sending MSI-X interrupts.
+ *       This packet is usually sent before the device is moved to D3Hot state.
+ *
+ * ARMCP_PACKET_ENABLE_PCI_ACCESS -
+ *       After receiving this packet the embedded CPU is allowed to issue PCI
+ *       transactions towards the Host CPU, including sending MSI-X interrupts.
+ *       This packet is usually send after the device is moved to D0 state.
+ *
+ * ARMCP_PACKET_TEMPERATURE_GET -
+ *       Fetch the current temperature / Max / Max Hyst / Critical /
+ *       Critical Hyst of a specified thermal sensor. The packet's
+ *       arguments specify the desired sensor and the field to get.
+ *
+ * ARMCP_PACKET_VOLTAGE_GET -
+ *       Fetch the voltage / Max / Min of a specified sensor. The packet's
+ *       arguments specify the sensor and type.
+ *
+ * ARMCP_PACKET_CURRENT_GET -
+ *       Fetch the current / Max / Min of a specified sensor. The packet's
+ *       arguments specify the sensor and type.
+ *
+ * ARMCP_PACKET_FAN_SPEED_GET -
+ *       Fetch the speed / Max / Min of a specified fan. The packet's
+ *       arguments specify the sensor and type.
+ *
+ * ARMCP_PACKET_PWM_GET -
+ *       Fetch the pwm value / mode of a specified pwm. The packet's
+ *       arguments specify the sensor and type.
+ *
+ * ARMCP_PACKET_PWM_SET -
+ *       Set the pwm value / mode of a specified pwm. The packet's
+ *       arguments specify the sensor, type and value.
+ *
+ * ARMCP_PACKET_FREQUENCY_SET -
+ *       Set the frequency of a specified PLL. The packet's arguments specify
+ *       the PLL and the desired frequency. The actual frequency in the device
+ *       might differ from the requested frequency.
+ *
+ * ARMCP_PACKET_FREQUENCY_GET -
+ *       Fetch the frequency of a specified PLL. The packet's arguments specify
+ *       the PLL.
+ *
+ * ARMCP_PACKET_LED_SET -
+ *       Set the state of a specified led. The packet's arguments
+ *       specify the led and the desired state.
+ *
+ * ARMCP_PACKET_I2C_WR -
+ *       Write 32-bit value to I2C device. The packet's arguments specify the
+ *       I2C bus, address and value.
+ *
+ * ARMCP_PACKET_I2C_RD -
+ *       Read 32-bit value from I2C device. The packet's arguments specify the
+ *       I2C bus and address.
+ *
+ * ARMCP_PACKET_INFO_GET -
+ *       Fetch information from the device as specified in the packet's
+ *       structure. KMD passes the max size it allows the ArmCP to write to
+ *       the structure, to prevent data corruption in case of mismatched
+ *       KMD/FW versions.
+ *
+ * ARMCP_PACKET_FLASH_PROGRAM_REMOVED - this packet was removed
+ *
+ * ARMCP_PACKET_UNMASK_RAZWI_IRQ -
+ *       Unmask the given IRQ. The IRQ number is specified in the value field.
+ *       The packet is sent after receiving an interrupt and printing its
+ *       relevant information.
+ *
+ * ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY -
+ *       Unmask the given IRQs. The IRQs numbers are specified in an array right
+ *       after the armcp_packet structure, where its first element is the array
+ *       length. The packet is sent after a soft reset was done in order to
+ *       handle any interrupts that were sent during the reset process.
+ *
+ * ARMCP_PACKET_TEST -
+ *       Test packet for ArmCP connectivity. The CPU will put the fence value
+ *       in the result field.
+ *
+ * ARMCP_PACKET_FREQUENCY_CURR_GET -
+ *       Fetch the current frequency of a specified PLL. The packet's arguments
+ *       specify the PLL.
+ *
+ * ARMCP_PACKET_MAX_POWER_GET -
+ *       Fetch the maximal power of the device.
+ *
+ * ARMCP_PACKET_MAX_POWER_SET -
+ *       Set the maximal power of the device. The packet's arguments specify
+ *       the power.
+ *
+ * ARMCP_PACKET_EEPROM_DATA_GET -
+ *       Get EEPROM data from the ArmCP kernel. The buffer is specified in the
+ *       addr field. The CPU will put the returned data size in the result
+ *       field. In addition, KMD passes the max size it allows the ArmCP to
+ *       write to the structure, to prevent data corruption in case of
+ *       mismatched KMD/FW versions.
+ *
+ */
+
+enum armcp_packet_id {
+	ARMCP_PACKET_DISABLE_PCI_ACCESS = 1,	/* internal */
+	ARMCP_PACKET_ENABLE_PCI_ACCESS,		/* internal */
+	ARMCP_PACKET_TEMPERATURE_GET,		/* sysfs */
+	ARMCP_PACKET_VOLTAGE_GET,		/* sysfs */
+	ARMCP_PACKET_CURRENT_GET,		/* sysfs */
+	ARMCP_PACKET_FAN_SPEED_GET,		/* sysfs */
+	ARMCP_PACKET_PWM_GET,			/* sysfs */
+	ARMCP_PACKET_PWM_SET,			/* sysfs */
+	ARMCP_PACKET_FREQUENCY_SET,		/* sysfs */
+	ARMCP_PACKET_FREQUENCY_GET,		/* sysfs */
+	ARMCP_PACKET_LED_SET,			/* debugfs */
+	ARMCP_PACKET_I2C_WR,			/* debugfs */
+	ARMCP_PACKET_I2C_RD,			/* debugfs */
+	ARMCP_PACKET_INFO_GET,			/* IOCTL */
+	ARMCP_PACKET_FLASH_PROGRAM_REMOVED,
+	ARMCP_PACKET_UNMASK_RAZWI_IRQ,		/* internal */
+	ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY,	/* internal */
+	ARMCP_PACKET_TEST,			/* internal */
+	ARMCP_PACKET_FREQUENCY_CURR_GET,	/* sysfs */
+	ARMCP_PACKET_MAX_POWER_GET,		/* sysfs */
+	ARMCP_PACKET_MAX_POWER_SET,		/* sysfs */
+	ARMCP_PACKET_EEPROM_DATA_GET,		/* sysfs */
+};
+
+#define ARMCP_PACKET_FENCE_VAL	0xFE8CE7A5
+
+struct armcp_packet {
+	union {
+		__u64 value;	/* For SET packets */
+		__u64 result;	/* For GET packets */
+		__u64 addr;	/* For PQ */
+	};
+
+	union {
+		struct {
+			__u32:12;
+			__u32 rc :4;
+			__u32 opcode :13;
+			__u32 eng_barrier :1;
+			__u32 reg_barrier :1;
+			__u32 msg_barrier :1;
+		};
+		__u32 ctl;
+	};
+
+	__u32 fence;		/* Signal to KMD that message is completed */
+
+	union {
+		struct {/* For temperature/current/voltage/fan/pwm get/set */
+			__u16 sensor_index;
+			__u16 type;
+		};
+
+		struct {	/* For I2C read/write */
+			__u8 i2c_bus;
+			__u8 i2c_addr;
+			__u8 i2c_reg;
+			__u8 pad; /* unused */
+		};
+
+		/* For frequency get/set */
+		__u32 pll_index;
+
+		/* For led set */
+		__u32 led_index;
+
+		/* For get Armcp info/EEPROM data */
+		__u32 data_max_size;
+	};
+};
+
+struct armcp_unmask_irq_arr_packet {
+	struct armcp_packet armcp_pkt;
+	__u32 length;
+	__u32 irqs[0];
+};
+
+enum armcp_packet_rc {
+	armcp_packet_success,
+	armcp_packet_invalid,
+	armcp_packet_fault
+};
+
+enum armcp_temp_type {
+	armcp_temp_input,
+	armcp_temp_max = 6,
+	armcp_temp_max_hyst,
+	armcp_temp_crit,
+	armcp_temp_crit_hyst
+};
+
+enum armcp_in_attributes {
+	armcp_in_input,
+	armcp_in_min,
+	armcp_in_max
+};
+
+enum armcp_curr_attributes {
+	armcp_curr_input,
+	armcp_curr_min,
+	armcp_curr_max
+};
+
+enum armcp_fan_attributes {
+	armcp_fan_input,
+	armcp_fan_min = 2,
+	armcp_fan_max
+};
+
+enum armcp_pwm_attributes {
+	armcp_pwm_input,
+	armcp_pwm_enable
+};
+
+/* Event Queue Packets */
+
+struct eq_generic_event {
+	__u64 data[7];
+};
+
 /*
  * ArmCP info
  */
diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
new file mode 100644
index 000000000000..97b0de7ea5c2
--- /dev/null
+++ b/drivers/misc/habanalabs/irq.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/dma-mapping.h>
+
+
+/**
+ * hl_cq_inc_ptr - increment ci or pi of cq
+ *
+ * @ptr: the current ci or pi value of the completion queue
+ *
+ * Increment ptr by 1. If it reaches the number of completion queue
+ * entries, set it to 0
+ */
+inline u32 hl_cq_inc_ptr(u32 ptr)
+{
+	ptr++;
+	if (unlikely(ptr == HL_CQ_LENGTH))
+		ptr = 0;
+	return ptr;
+}
+
+/**
+ * hl_irq_handler_cq - irq handler for completion queue
+ *
+ * @irq: irq number
+ * @arg: pointer to completion queue structure
+ *
+ */
+irqreturn_t hl_irq_handler_cq(int irq, void *arg)
+{
+	struct hl_cq *cq = arg;
+	struct hl_device *hdev = cq->hdev;
+	struct hl_hw_queue *queue;
+	struct hl_cs_job *job;
+	bool shadow_index_valid;
+	u16 shadow_index;
+	u32 *cq_entry;
+	u32 *cq_base;
+
+	if (hdev->disabled) {
+		dev_dbg(hdev->dev,
+			"Device disabled but received IRQ %d for CQ %d\n",
+			irq, cq->hw_queue_id);
+		return IRQ_HANDLED;
+	}
+
+	cq_base = (u32 *) cq->kernel_address;
+
+	while (1) {
+		bool entry_ready = ((cq_base[cq->ci] & CQ_ENTRY_READY_MASK)
+						>> CQ_ENTRY_READY_SHIFT);
+
+		if (!entry_ready)
+			break;
+
+		cq_entry = (u32 *) &cq_base[cq->ci];
+
+		/*
+		 * Make sure we read CQ entry contents after we've
+		 * checked the ownership bit.
+		 */
+		dma_rmb();
+
+		shadow_index_valid =
+			((*cq_entry & CQ_ENTRY_SHADOW_INDEX_VALID_MASK)
+					>> CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT);
+
+		shadow_index = (u16)
+			((*cq_entry & CQ_ENTRY_SHADOW_INDEX_MASK)
+					>> CQ_ENTRY_SHADOW_INDEX_SHIFT);
+
+		queue = &hdev->kernel_queues[cq->hw_queue_id];
+
+		if ((shadow_index_valid) && (!hdev->disabled)) {
+			job = queue->shadow_queue[hl_pi_2_offset(shadow_index)];
+			queue_work(hdev->cq_wq, &job->finish_work);
+		}
+
+		/*
+		 * Update ci of the context's queue. There is no
+		 * need to protect it with spinlock because this update is
+		 * done only inside IRQ and there is a different IRQ per
+		 * queue
+		 */
+		queue->ci = hl_queue_inc_ptr(queue->ci);
+
+		/* Clear CQ entry ready bit */
+		cq_base[cq->ci] &= ~CQ_ENTRY_READY_MASK;
+
+		cq->ci = hl_cq_inc_ptr(cq->ci);
+
+		/* Increment free slots */
+		atomic_inc(&cq->free_slots_cnt);
+	}
+
+	return IRQ_HANDLED;
+}
+
+/**
+ * hl_cq_init - main initialization function for an cq object
+ *
+ * @hdev: pointer to device structure
+ * @q: pointer to cq structure
+ * @hw_queue_id: The H/W queue ID this completion queue belongs to
+ *
+ * Allocate dma-able memory for the completion queue and initialize fields
+ * Returns 0 on success
+ */
+int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id)
+{
+	void *p;
+
+	BUILD_BUG_ON(HL_CQ_SIZE_IN_BYTES > HL_PAGE_SIZE);
+
+	p = hdev->asic_funcs->dma_alloc_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
+				&q->bus_address, GFP_KERNEL | __GFP_ZERO);
+	if (!p)
+		return -ENOMEM;
+
+	q->hdev = hdev;
+	q->kernel_address = (u64) p;
+	q->hw_queue_id = hw_queue_id;
+	q->ci = 0;
+	q->pi = 0;
+
+	atomic_set(&q->free_slots_cnt, HL_CQ_LENGTH);
+
+	return 0;
+}
+
+/**
+ * hl_cq_fini - destroy completion queue
+ *
+ * @hdev: pointer to device structure
+ * @q: pointer to cq structure
+ *
+ * Free the completion queue memory
+ */
+void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
+{
+	hdev->asic_funcs->dma_free_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
+			(void *) q->kernel_address, q->bus_address);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 08/15] habanalabs: add event queue and interrupts
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (5 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 07/15] habanalabs: add h/w queues module Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-25  7:51   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 09/15] habanalabs: add sysfs and hwmon support Oded Gabbay
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds support for receiving events from Goya's control CPU and
for receiving MSI-X interrupts from Goya's DMA engines and CPU.

Goya's PCI controller supports up to 8 MSI-X interrupts, which only 6 of
them are currently used. The first 5 interrupts are dedicated for Goya's
DMA engine queues. The 6th interrupt is dedicated for Goya's control CPU.

The DMA queue will signal its MSI-X entry upon each completion of a command
buffer that was placed on its primary queue. The driver will then mark that
CB as completed and free the related resources. It will also update the
command submission object which that CB belongs to.

There is a dedicated event queue (EQ) between the driver and Goya's control
CPU. The EQ is located on the Host memory. The control CPU writes a new
entry to the EQ for various reasons, such as ECC error, MMU page fault, Hot
temperature. After writing the new entry to the EQ, the control CPU will
trigger its dedicated MSI-X entry to signal the driver that there is a new
entry in the EQ. The driver will then read the entry and act accordingly.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/device.c            |  35 +-
 drivers/misc/habanalabs/goya/goya.c         | 522 +++++++++++++++++++-
 drivers/misc/habanalabs/goya/goyaP.h        |   1 +
 drivers/misc/habanalabs/habanalabs.h        |  37 ++
 drivers/misc/habanalabs/include/goya/goya.h |   1 -
 drivers/misc/habanalabs/irq.c               | 144 ++++++
 6 files changed, 729 insertions(+), 11 deletions(-)

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 98220628a467..9199e070e79e 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -173,9 +173,17 @@ static int device_early_init(struct hl_device *hdev)
 	hdev->cq_wq = alloc_workqueue("hl-free-jobs", WQ_UNBOUND, 0);
 	if (hdev->cq_wq == NULL) {
 		dev_err(hdev->dev, "Failed to allocate CQ workqueue\n");
+		rc = -ENOMEM;
 		goto asid_fini;
 	}
 
+	hdev->eq_wq = alloc_workqueue("hl-events", WQ_UNBOUND, 0);
+	if (hdev->eq_wq == NULL) {
+		dev_err(hdev->dev, "Failed to allocate EQ workqueue\n");
+		rc = -ENOMEM;
+		goto free_cq_wq;
+	}
+
 	hl_cb_mgr_init(&hdev->kernel_cb_mgr);
 
 	mutex_init(&hdev->device_open);
@@ -184,6 +192,8 @@ static int device_early_init(struct hl_device *hdev)
 
 	return 0;
 
+free_cq_wq:
+	destroy_workqueue(hdev->cq_wq);
 asid_fini:
 	hl_asid_fini(hdev);
 early_fini:
@@ -205,6 +215,7 @@ static void device_early_fini(struct hl_device *hdev)
 
 	hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
 
+	destroy_workqueue(hdev->eq_wq);
 	destroy_workqueue(hdev->cq_wq);
 
 	hl_asid_fini(hdev);
@@ -343,11 +354,22 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		}
 	}
 
+	/*
+	 * Initialize the event queue. Must be done before hw_init,
+	 * because there the address of the event queue is being
+	 * passed as argument to request_irq
+	 */
+	rc = hl_eq_init(hdev, &hdev->event_queue);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize event queue\n");
+		goto cq_fini;
+	}
+
 	/* Allocate the kernel context */
 	hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
 	if (!hdev->kernel_ctx) {
 		rc = -ENOMEM;
-		goto cq_fini;
+		goto eq_fini;
 	}
 
 	hdev->user_ctx = NULL;
@@ -392,6 +414,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 			"kernel ctx is still alive on initialization failure\n");
 free_ctx:
 	kfree(hdev->kernel_ctx);
+eq_fini:
+	hl_eq_fini(hdev, &hdev->event_queue);
 cq_fini:
 	for (i = 0 ; i < cq_ready_cnt ; i++)
 		hl_cq_fini(hdev, &hdev->completion_queue[i]);
@@ -433,6 +457,13 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	/*
+	 * Halt the engines and disable interrupts so we won't get any more
+	 * completions from H/W and we won't have any accesses from the
+	 * H/W to the host machine
+	 */
+	hdev->asic_funcs->halt_engines(hdev, true);
+
 	hl_cb_pool_fini(hdev);
 
 	/* Release kernel context */
@@ -442,6 +473,8 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Reset the H/W. It will be in idle state after this returns */
 	hdev->asic_funcs->hw_fini(hdev, true);
 
+	hl_eq_fini(hdev, &hdev->event_queue);
+
 	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
 		hl_cq_fini(hdev, &hdev->completion_queue[i]);
 	kfree(hdev->completion_queue);
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 08d5227eaf1d..6c04277ae0fa 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -92,9 +92,41 @@
 
 #define GOYA_MAX_INITIATORS		20
 
+#define GOYA_MAX_STRING_LEN		20
+
 #define GOYA_CB_POOL_CB_CNT		512
 #define GOYA_CB_POOL_CB_SIZE		0x20000		/* 128KB */
 
+static const char goya_irq_name[GOYA_MSIX_ENTRIES][GOYA_MAX_STRING_LEN] = {
+		"goya cq 0", "goya cq 1", "goya cq 2", "goya cq 3",
+		"goya cq 4", "goya cpu eq"
+};
+
+static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
+	"MME0",
+	"MME1",
+	"MME2",
+	"MME3",
+	"MME4",
+	"MME5",
+	"TPC0",
+	"TPC1",
+	"TPC2",
+	"TPC3",
+	"TPC4",
+	"TPC5",
+	"TPC6",
+	"TPC7",
+	"PCI",
+	"DMA", /* HBW */
+	"DMA", /* LBW */
+	"PSOC",
+	"CPU",
+	"MMU"
+};
+
+#define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
+
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
@@ -139,6 +171,7 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->va_space_dram_end_address = VA_DDR_SPACE_END;
 	prop->cfg_size = CFG_SIZE;
 	prop->max_asid = MAX_ASID;
+	prop->num_of_events = GOYA_ASYNC_EVENT_ID_SIZE;
 	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
 	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
 	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
@@ -668,15 +701,10 @@ static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
 	WREG32(mmDMA_QM_0_PQ_CFG1 + reg_off, 0x00020002);
 	WREG32(mmDMA_QM_0_CQ_CFG1 + reg_off, 0x00080008);
 
-	if (dma_id == 0)
-		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
+	if (goya->hw_cap_initialized & HW_CAP_MMU)
+		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_PARTLY_TRUSTED);
 	else
-		if (goya->hw_cap_initialized & HW_CAP_MMU)
-			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
-					QMAN_DMA_PARTLY_TRUSTED);
-		else
-			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
-					QMAN_DMA_FULLY_TRUSTED);
+		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
 
 	WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, QMAN_DMA_ERR_MSG_EN);
 	WREG32(mmDMA_QM_0_GLBL_CFG0 + reg_off, QMAN_DMA_ENABLE);
@@ -870,6 +898,7 @@ static void goya_resume_external_queues(struct hl_device *hdev)
 int goya_init_cpu_queues(struct hl_device *hdev)
 {
 	struct goya_device *goya = hdev->asic_specific;
+	struct hl_eq *eq;
 	dma_addr_t bus_address;
 	u32 status;
 	struct hl_hw_queue *cpu_pq = &hdev->kernel_queues[GOYA_QUEUE_ID_CPU_PQ];
@@ -881,17 +910,24 @@ int goya_init_cpu_queues(struct hl_device *hdev)
 	if (goya->hw_cap_initialized & HW_CAP_CPU_Q)
 		return 0;
 
+	eq = &hdev->event_queue;
+
 	bus_address = cpu_pq->bus_address +
 			hdev->asic_prop.host_phys_base_address;
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_0, lower_32_bits(bus_address));
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_1, upper_32_bits(bus_address));
 
+	bus_address = eq->bus_address + hdev->asic_prop.host_phys_base_address;
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_2, lower_32_bits(bus_address));
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_3, upper_32_bits(bus_address));
+
 	bus_address = hdev->cpu_accessible_dma_address +
 			hdev->asic_prop.host_phys_base_address;
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_8, lower_32_bits(bus_address));
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_9, upper_32_bits(bus_address));
 
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_5, HL_QUEUE_SIZE_IN_BYTES);
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_4, HL_EQ_SIZE_IN_BYTES);
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_10, CPU_ACCESSIBLE_MEM_SIZE);
 
 	/* Used for EQ CI */
@@ -2781,6 +2817,163 @@ static void goya_resume_internal_queues(struct hl_device *hdev)
 	WREG32(mmTPC7_CMDQ_GLBL_CFG1, 0);
 }
 
+static void goya_dma_stall(struct hl_device *hdev)
+{
+	WREG32(mmDMA_QM_0_GLBL_CFG1, 1 << DMA_QM_0_GLBL_CFG1_DMA_STOP_SHIFT);
+	WREG32(mmDMA_QM_1_GLBL_CFG1, 1 << DMA_QM_1_GLBL_CFG1_DMA_STOP_SHIFT);
+	WREG32(mmDMA_QM_2_GLBL_CFG1, 1 << DMA_QM_2_GLBL_CFG1_DMA_STOP_SHIFT);
+	WREG32(mmDMA_QM_3_GLBL_CFG1, 1 << DMA_QM_3_GLBL_CFG1_DMA_STOP_SHIFT);
+	WREG32(mmDMA_QM_4_GLBL_CFG1, 1 << DMA_QM_4_GLBL_CFG1_DMA_STOP_SHIFT);
+}
+
+static void goya_tpc_stall(struct hl_device *hdev)
+{
+	WREG32(mmTPC0_CFG_TPC_STALL, 1 << TPC0_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC1_CFG_TPC_STALL, 1 << TPC1_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC2_CFG_TPC_STALL, 1 << TPC2_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC3_CFG_TPC_STALL, 1 << TPC3_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC4_CFG_TPC_STALL, 1 << TPC4_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC5_CFG_TPC_STALL, 1 << TPC5_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC6_CFG_TPC_STALL, 1 << TPC6_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC7_CFG_TPC_STALL, 1 << TPC7_CFG_TPC_STALL_V_SHIFT);
+}
+
+static void goya_mme_stall(struct hl_device *hdev)
+{
+	WREG32(mmMME_STALL, 0xFFFFFFFF);
+}
+
+static int goya_enable_msix(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int cq_cnt = hdev->asic_prop.completion_queues_count;
+	int rc, i, irq_cnt_init, irq;
+
+	if (goya->hw_cap_initialized & HW_CAP_MSIX)
+		return 0;
+
+	rc = pci_alloc_irq_vectors(hdev->pdev, GOYA_MSIX_ENTRIES,
+				GOYA_MSIX_ENTRIES, PCI_IRQ_MSIX);
+	if (rc < 0) {
+		dev_err(hdev->dev,
+			"MSI-X: Failed to enable support -- %d/%d\n",
+			GOYA_MSIX_ENTRIES, rc);
+		return rc;
+	}
+
+	for (i = 0, irq_cnt_init = 0 ; i < cq_cnt ; i++, irq_cnt_init++) {
+		irq = pci_irq_vector(hdev->pdev, i);
+		rc = request_irq(irq, hl_irq_handler_cq, 0, goya_irq_name[i],
+				&hdev->completion_queue[i]);
+		if (rc) {
+			dev_err(hdev->dev, "Failed to request IRQ %d", irq);
+			goto free_irqs;
+		}
+	}
+
+	irq = pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX);
+
+	rc = request_irq(irq, hl_irq_handler_eq, 0,
+			goya_irq_name[EVENT_QUEUE_MSIX_IDX],
+			&hdev->event_queue);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to request IRQ %d", irq);
+		goto free_irqs;
+	}
+
+	goya->hw_cap_initialized |= HW_CAP_MSIX;
+	return 0;
+
+free_irqs:
+	for (i = 0 ; i < irq_cnt_init ; i++)
+		free_irq(pci_irq_vector(hdev->pdev, i),
+			&hdev->completion_queue[i]);
+
+	pci_free_irq_vectors(hdev->pdev);
+	return rc;
+}
+
+static void goya_sync_irqs(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MSIX))
+		return;
+
+	/* Wait for all pending IRQs to be finished */
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
+		synchronize_irq(pci_irq_vector(hdev->pdev, i));
+
+	synchronize_irq(pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX));
+}
+
+static void goya_disable_msix(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i, irq;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MSIX))
+		return;
+
+	goya_sync_irqs(hdev);
+
+	irq = pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX);
+	free_irq(irq, &hdev->event_queue);
+
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++) {
+		irq = pci_irq_vector(hdev->pdev, i);
+		free_irq(irq, &hdev->completion_queue[i]);
+	}
+
+	pci_free_irq_vectors(hdev->pdev);
+
+	goya->hw_cap_initialized &= ~HW_CAP_MSIX;
+}
+
+static void goya_halt_engines(struct hl_device *hdev, bool hard_reset)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 wait_timeout_ms, cpu_timeout_ms;
+
+	dev_info(hdev->dev,
+		"Halting compute engines and disabling interrupts\n");
+
+	if (hdev->pldm) {
+		wait_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
+		cpu_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
+	} else {
+		wait_timeout_ms = GOYA_RESET_WAIT_MSEC;
+		cpu_timeout_ms = GOYA_CPU_RESET_WAIT_MSEC;
+	}
+
+	if ((hard_reset) && (goya->hw_cap_initialized & HW_CAP_CPU)) {
+		WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_GOTO_WFE);
+		if (hdev->fw_loading)
+			WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+				GOYA_ASYNC_EVENT_ID_HALT_MACHINE);
+		msleep(cpu_timeout_ms);
+	}
+
+	goya_stop_external_queues(hdev);
+	goya_stop_internal_queues(hdev);
+
+	msleep(wait_timeout_ms);
+
+	goya_dma_stall(hdev);
+	goya_tpc_stall(hdev);
+	goya_mme_stall(hdev);
+
+	msleep(wait_timeout_ms);
+
+	goya_disable_external_queues(hdev);
+	goya_disable_internal_queues(hdev);
+
+	if (hard_reset)
+		goya_disable_msix(hdev);
+	else
+		goya_sync_irqs(hdev);
+}
 
 /**
  * goya_push_uboot_to_device - Push u-boot FW code to device
@@ -3166,11 +3359,16 @@ static int goya_hw_init(struct hl_device *hdev)
 
 	goya_init_tpc_qmans(hdev);
 
+	/* MSI-X must be enabled before CPU queues are initialized */
+	rc = goya_enable_msix(hdev);
+	if (rc)
+		goto disable_queues;
+
 	rc = goya_init_cpu_queues(hdev);
 	if (rc) {
 		dev_err(hdev->dev, "failed to initialize CPU H/W queues %d\n",
 			rc);
-		goto disable_queues;
+		goto disable_msix;
 	}
 
 	/* CPU initialization is finished, we can now move to 48 bit DMA mask */
@@ -3204,6 +3402,8 @@ static int goya_hw_init(struct hl_device *hdev)
 
 disable_pci_access:
 	goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
+disable_msix:
+	goya_disable_msix(hdev);
 disable_queues:
 	goya_disable_internal_queues(hdev);
 	goya_disable_external_queues(hdev);
@@ -3287,6 +3487,7 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
 					HW_CAP_DMA | HW_CAP_MME |
 					HW_CAP_MMU | HW_CAP_TPC_MBIST |
 					HW_CAP_GOLDEN | HW_CAP_TPC);
+	memset(goya->events_stat, 0, sizeof(goya->events_stat));
 
 	if (!hdev->pldm) {
 		int rc;
@@ -3772,6 +3973,305 @@ void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
 	gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
 }
 
+static void goya_update_eq_ci(struct hl_device *hdev, u32 val)
+{
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, val);
+}
+
+static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
+		u16 event_type, char *axi_name, int len)
+{
+	if (!strcmp(goya_axi_name[agent_id], "DMA"))
+		if (event_type >= GOYA_ASYNC_EVENT_ID_DMA0_CH)
+			snprintf(axi_name, len, "DMA %d",
+				event_type - GOYA_ASYNC_EVENT_ID_DMA0_CH);
+		else
+			snprintf(axi_name, len, "DMA %d",
+				event_type - GOYA_ASYNC_EVENT_ID_DMA0_QM);
+	else
+		snprintf(axi_name, len, "%s", goya_axi_name[agent_id]);
+}
+
+static void goya_print_razwi_info(struct hl_device *hdev, u64 reg,
+		bool is_hbw, bool is_read, u16 event_type)
+{
+	u32 val, id, internal_id, agent_id, y, x;
+	char axi_name[10] = {0};
+
+	val = RREG32(reg);
+
+	if (is_hbw) {
+		id = (val & GOYA_IRQ_HBW_ID_MASK) >> GOYA_IRQ_HBW_ID_SHIFT;
+		internal_id = (val & GOYA_IRQ_HBW_INTERNAL_ID_MASK) >>
+				GOYA_IRQ_HBW_INTERNAL_ID_SHIFT;
+		agent_id = (val & GOYA_IRQ_HBW_AGENT_ID_MASK) >>
+				GOYA_IRQ_HBW_AGENT_ID_SHIFT;
+		y = (val & GOYA_IRQ_HBW_Y_MASK) >> GOYA_IRQ_HBW_Y_SHIFT;
+		x = (val & GOYA_IRQ_HBW_X_MASK) >> GOYA_IRQ_HBW_X_SHIFT;
+	} else {
+		id = (val & GOYA_IRQ_LBW_ID_MASK) >> GOYA_IRQ_LBW_ID_SHIFT;
+		internal_id = (val & GOYA_IRQ_LBW_INTERNAL_ID_MASK) >>
+				GOYA_IRQ_LBW_INTERNAL_ID_SHIFT;
+		agent_id = (val & GOYA_IRQ_LBW_AGENT_ID_MASK) >>
+				GOYA_IRQ_LBW_AGENT_ID_SHIFT;
+		y = (val & GOYA_IRQ_LBW_Y_MASK) >> GOYA_IRQ_LBW_Y_SHIFT;
+		x = (val & GOYA_IRQ_LBW_X_MASK) >> GOYA_IRQ_LBW_X_SHIFT;
+	}
+
+	if (agent_id >= GOYA_MAX_INITIATORS) {
+		dev_err(hdev->dev,
+			"Illegal %s %s with wrong initiator id %d, H/W IRQ %d\n",
+				is_read ? "read from" : "write to",
+				is_hbw ? "HBW" : "LBW",
+				agent_id,
+				event_type);
+	} else {
+		goya_get_axi_name(hdev, agent_id, event_type, axi_name,
+				sizeof(axi_name));
+		dev_err(hdev->dev, "Illegal %s by %s %s %s, H/W IRQ %d\n",
+				is_read ? "read" : "write",
+				axi_name,
+				is_read ? "from" : "to",
+				is_hbw ? "HBW" : "LBW",
+				event_type);
+	}
+}
+
+static void goya_print_irq_info(struct hl_device *hdev, u16 event_type)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	bool is_hbw = false, is_read = false, is_info = false;
+
+	if (RREG32(mmDMA_MACRO_RAZWI_LBW_WT_VLD)) {
+		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_LBW_WT_ID, is_hbw,
+				is_read, event_type);
+		WREG32(mmDMA_MACRO_RAZWI_LBW_WT_VLD, 0);
+		is_info = true;
+	}
+	if (RREG32(mmDMA_MACRO_RAZWI_LBW_RD_VLD)) {
+		is_read = true;
+		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_LBW_RD_ID, is_hbw,
+				is_read, event_type);
+		WREG32(mmDMA_MACRO_RAZWI_LBW_RD_VLD, 0);
+		is_info = true;
+	}
+	if (RREG32(mmDMA_MACRO_RAZWI_HBW_WT_VLD)) {
+		is_hbw = true;
+		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_HBW_WT_ID, is_hbw,
+				is_read, event_type);
+		WREG32(mmDMA_MACRO_RAZWI_HBW_WT_VLD, 0);
+		is_info = true;
+	}
+	if (RREG32(mmDMA_MACRO_RAZWI_HBW_RD_VLD)) {
+		is_hbw = true;
+		is_read = true;
+		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_HBW_RD_ID, is_hbw,
+				is_read, event_type);
+		WREG32(mmDMA_MACRO_RAZWI_HBW_RD_VLD, 0);
+		is_info = true;
+	}
+	if (!is_info) {
+		dev_err(hdev->dev,
+			"Received H/W interrupt %d, no additional info\n",
+			event_type);
+		return;
+	}
+
+	if (goya->hw_cap_initialized & HW_CAP_MMU) {
+		u32 val = RREG32(mmMMU_PAGE_ERROR_CAPTURE);
+		u64 addr;
+
+		if (val & MMU_PAGE_ERROR_CAPTURE_ENTRY_VALID_MASK) {
+			addr = val & MMU_PAGE_ERROR_CAPTURE_VA_49_32_MASK;
+			addr <<= 32;
+			addr |= RREG32(mmMMU_PAGE_ERROR_CAPTURE_VA);
+
+			dev_err(hdev->dev, "MMU page fault on va 0x%llx\n",
+					addr);
+
+			WREG32(mmMMU_PAGE_ERROR_CAPTURE, 0);
+		}
+	}
+}
+
+static int goya_unmask_irq(struct hl_device *hdev, u16 event_type)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_UNMASK_RAZWI_IRQ;
+	pkt.value = event_type;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+			HL_DEVICE_TIMEOUT_USEC, &result);
+
+	if (rc)
+		dev_err(hdev->dev, "failed to unmask RAZWI IRQ %d", event_type);
+
+	return rc;
+}
+
+void goya_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_entry)
+{
+	u16 event_type = ((eq_entry->hdr.ctl & EQ_CTL_EVENT_TYPE_MASK)
+			>> EQ_CTL_EVENT_TYPE_SHIFT);
+	struct goya_device *goya = hdev->asic_specific;
+
+	goya->events_stat[event_type]++;
+
+	switch (event_type) {
+	case GOYA_ASYNC_EVENT_ID_PCIE_IF:
+	case GOYA_ASYNC_EVENT_ID_TPC0_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC1_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC2_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC3_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC4_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC5_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC6_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC7_ECC:
+	case GOYA_ASYNC_EVENT_ID_MME_ECC:
+	case GOYA_ASYNC_EVENT_ID_MME_ECC_EXT:
+	case GOYA_ASYNC_EVENT_ID_MMU_ECC:
+	case GOYA_ASYNC_EVENT_ID_DMA_MACRO:
+	case GOYA_ASYNC_EVENT_ID_DMA_ECC:
+	case GOYA_ASYNC_EVENT_ID_CPU_IF_ECC:
+	case GOYA_ASYNC_EVENT_ID_PSOC_MEM:
+	case GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT:
+	case GOYA_ASYNC_EVENT_ID_SRAM0:
+	case GOYA_ASYNC_EVENT_ID_SRAM1:
+	case GOYA_ASYNC_EVENT_ID_SRAM2:
+	case GOYA_ASYNC_EVENT_ID_SRAM3:
+	case GOYA_ASYNC_EVENT_ID_SRAM4:
+	case GOYA_ASYNC_EVENT_ID_SRAM5:
+	case GOYA_ASYNC_EVENT_ID_SRAM6:
+	case GOYA_ASYNC_EVENT_ID_SRAM7:
+	case GOYA_ASYNC_EVENT_ID_SRAM8:
+	case GOYA_ASYNC_EVENT_ID_SRAM9:
+	case GOYA_ASYNC_EVENT_ID_SRAM10:
+	case GOYA_ASYNC_EVENT_ID_SRAM11:
+	case GOYA_ASYNC_EVENT_ID_SRAM12:
+	case GOYA_ASYNC_EVENT_ID_SRAM13:
+	case GOYA_ASYNC_EVENT_ID_SRAM14:
+	case GOYA_ASYNC_EVENT_ID_SRAM15:
+	case GOYA_ASYNC_EVENT_ID_SRAM16:
+	case GOYA_ASYNC_EVENT_ID_SRAM17:
+	case GOYA_ASYNC_EVENT_ID_SRAM18:
+	case GOYA_ASYNC_EVENT_ID_SRAM19:
+	case GOYA_ASYNC_EVENT_ID_SRAM20:
+	case GOYA_ASYNC_EVENT_ID_SRAM21:
+	case GOYA_ASYNC_EVENT_ID_SRAM22:
+	case GOYA_ASYNC_EVENT_ID_SRAM23:
+	case GOYA_ASYNC_EVENT_ID_SRAM24:
+	case GOYA_ASYNC_EVENT_ID_SRAM25:
+	case GOYA_ASYNC_EVENT_ID_SRAM26:
+	case GOYA_ASYNC_EVENT_ID_SRAM27:
+	case GOYA_ASYNC_EVENT_ID_SRAM28:
+	case GOYA_ASYNC_EVENT_ID_SRAM29:
+	case GOYA_ASYNC_EVENT_ID_GIC500:
+	case GOYA_ASYNC_EVENT_ID_PLL0:
+	case GOYA_ASYNC_EVENT_ID_PLL1:
+	case GOYA_ASYNC_EVENT_ID_PLL3:
+	case GOYA_ASYNC_EVENT_ID_PLL4:
+	case GOYA_ASYNC_EVENT_ID_PLL5:
+	case GOYA_ASYNC_EVENT_ID_PLL6:
+	case GOYA_ASYNC_EVENT_ID_AXI_ECC:
+	case GOYA_ASYNC_EVENT_ID_L2_RAM_ECC:
+	case GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET:
+	case GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT:
+		dev_err(hdev->dev,
+			"Received H/W interrupt %d, reset the chip\n",
+			event_type);
+		break;
+
+	case GOYA_ASYNC_EVENT_ID_PCIE_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC0_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC1_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC2_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC3_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC4_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC5_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC6_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC7_DEC:
+	case GOYA_ASYNC_EVENT_ID_MME_WACS:
+	case GOYA_ASYNC_EVENT_ID_MME_WACSD:
+	case GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER:
+	case GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC:
+	case GOYA_ASYNC_EVENT_ID_PSOC:
+	case GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC0_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_TPC1_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_TPC2_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_TPC3_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_TPC4_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_TPC5_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_TPC6_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_TPC7_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_TPC0_QM:
+	case GOYA_ASYNC_EVENT_ID_TPC1_QM:
+	case GOYA_ASYNC_EVENT_ID_TPC2_QM:
+	case GOYA_ASYNC_EVENT_ID_TPC3_QM:
+	case GOYA_ASYNC_EVENT_ID_TPC4_QM:
+	case GOYA_ASYNC_EVENT_ID_TPC5_QM:
+	case GOYA_ASYNC_EVENT_ID_TPC6_QM:
+	case GOYA_ASYNC_EVENT_ID_TPC7_QM:
+	case GOYA_ASYNC_EVENT_ID_MME_QM:
+	case GOYA_ASYNC_EVENT_ID_MME_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_DMA0_QM:
+	case GOYA_ASYNC_EVENT_ID_DMA1_QM:
+	case GOYA_ASYNC_EVENT_ID_DMA2_QM:
+	case GOYA_ASYNC_EVENT_ID_DMA3_QM:
+	case GOYA_ASYNC_EVENT_ID_DMA4_QM:
+	case GOYA_ASYNC_EVENT_ID_DMA0_CH:
+	case GOYA_ASYNC_EVENT_ID_DMA1_CH:
+	case GOYA_ASYNC_EVENT_ID_DMA2_CH:
+	case GOYA_ASYNC_EVENT_ID_DMA3_CH:
+	case GOYA_ASYNC_EVENT_ID_DMA4_CH:
+		goya_print_irq_info(hdev, event_type);
+		goya_unmask_irq(hdev, event_type);
+		break;
+
+	case GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH0:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH1:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH2:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH3:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH4:
+		dev_info(hdev->dev, "Received H/W interrupt %d\n", event_type);
+		break;
+
+	default:
+		dev_err(hdev->dev, "Received invalid H/W interrupt %d\n",
+				event_type);
+		break;
+	}
+}
+
+void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	*size = (u32) sizeof(goya->events_stat);
+
+	return goya->events_stat;
+}
+
 
 static void goya_hw_queues_lock(struct hl_device *hdev)
 {
@@ -3794,6 +4294,7 @@ static const struct hl_asic_funcs goya_funcs = {
 	.sw_fini = goya_sw_fini,
 	.hw_init = goya_hw_init,
 	.hw_fini = goya_hw_fini,
+	.halt_engines = goya_halt_engines,
 	.suspend = goya_suspend,
 	.resume = goya_resume,
 	.mmap = goya_mmap,
@@ -3808,6 +4309,9 @@ static const struct hl_asic_funcs goya_funcs = {
 	.dma_pool_free = goya_dma_pool_free,
 	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
 	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
+	.update_eq_ci = goya_update_eq_ci,
+	.handle_eqe = goya_handle_eqe,
+	.get_events_stat = goya_get_events_stat,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
 	.send_cpu_message = goya_send_cpu_message
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 598a718d3df1..c6bfcb6c6905 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -123,6 +123,7 @@ struct goya_device {
 	/* TODO: remove hw_queues_lock after moving to scheduler code */
 	spinlock_t	hw_queues_lock;
 	u64		ddr_bar_cur_addr;
+	u32		events_stat[GOYA_ASYNC_EVENT_ID_SIZE];
 	u32		hw_cap_initialized;
 };
 
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 8232e2259463..899bf98eb002 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -83,6 +83,7 @@ struct hw_queue_properties {
  * @cfg_size: configuration space size on SRAM.
  * @sram_size: total size of SRAM.
  * @max_asid: maximum number of open contexts (ASIDs).
+ * @num_of_events: number of possible internal H/W IRQs.
  * @completion_queues_count: number of completion queues.
  * @high_pll: high PLL frequency used by the device.
  * @cb_pool_cb_cnt: number of CBs in the CB pool.
@@ -109,6 +110,7 @@ struct asic_fixed_properties {
 	u32			cfg_size;
 	u32			sram_size;
 	u32			max_asid;
+	u32			num_of_events;
 	u32			high_pll;
 	u32			cb_pool_cb_cnt;
 	u32			cb_pool_cb_size;
@@ -209,6 +211,9 @@ struct hl_cs_job;
 #define HL_CQ_LENGTH			HL_QUEUE_LENGTH
 #define HL_CQ_SIZE_IN_BYTES		(HL_CQ_LENGTH * HL_CQ_ENTRY_SIZE)
 
+/* Must be power of 2 (HL_PAGE_SIZE / HL_EQ_ENTRY_SIZE) */
+#define HL_EQ_LENGTH			64
+#define HL_EQ_SIZE_IN_BYTES		(HL_EQ_LENGTH * HL_EQ_ENTRY_SIZE)
 
 
 /**
@@ -256,6 +261,20 @@ struct hl_cq {
 	atomic_t		free_slots_cnt;
 };
 
+/**
+ * struct hl_eq - describes the event queue (single one per device)
+ * @hdev: pointer to the device structure
+ * @kernel_address: holds the queue's kernel virtual address
+ * @bus_address: holds the queue's DMA address
+ * @ci: ci inside the queue
+ */
+struct hl_eq {
+	struct hl_device	*hdev;
+	u64			kernel_address;
+	dma_addr_t		bus_address;
+	u32			ci;
+};
+
 
 
 
@@ -288,6 +307,9 @@ enum hl_asic_type {
  * @sw_fini: tears down driver state, does not configure H/W.
  * @hw_init: sets up the H/W state.
  * @hw_fini: tears down the H/W state.
+ * @halt_engines: halt engines, needed for reset sequence. This also disables
+ *                interrupts from the device. Should be called before
+ *                hw_fini and before CS rollback.
  * @suspend: handles IP specific H/W or SW changes for suspend.
  * @resume: handles IP specific H/W or SW changes for resume.
  * @mmap: mmap function, does nothing.
@@ -303,6 +325,9 @@ enum hl_asic_type {
  * @dma_pool_free: free small DMA allocation from pool.
  * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
  * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
+ * @update_eq_ci: update event queue CI.
+ * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
+ * @get_events_stat: retrieve event queue entries histogram.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
  * @send_cpu_message: send buffer to ArmCP.
@@ -314,6 +339,7 @@ struct hl_asic_funcs {
 	int (*sw_fini)(struct hl_device *hdev);
 	int (*hw_init)(struct hl_device *hdev);
 	void (*hw_fini)(struct hl_device *hdev, bool hard_reset);
+	void (*halt_engines)(struct hl_device *hdev, bool hard_reset);
 	int (*suspend)(struct hl_device *hdev);
 	int (*resume)(struct hl_device *hdev);
 	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
@@ -336,6 +362,10 @@ struct hl_asic_funcs {
 				size_t size, dma_addr_t *dma_handle);
 	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
 				size_t size, void *vaddr);
+	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
+	void (*handle_eqe)(struct hl_device *hdev,
+				struct hl_eq_entry *eq_entry);
+	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
 	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
@@ -474,6 +504,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @kernel_ctx: KMD context structure.
  * @kernel_queues: array of hl_hw_queue.
  * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
+ * @event_queue: event queue for IRQ from ArmCP.
  * @dma_pool: DMA pool for small allocations.
  * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
  * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
@@ -504,9 +535,11 @@ struct hl_device {
 	enum hl_asic_type		asic_type;
 	struct hl_cq			*completion_queue;
 	struct workqueue_struct		*cq_wq;
+	struct workqueue_struct		*eq_wq;
 	struct hl_ctx			*kernel_ctx;
 	struct hl_hw_queue		*kernel_queues;
 	struct hl_cb_mgr		kernel_cb_mgr;
+	struct hl_eq			event_queue;
 	struct dma_pool			*dma_pool;
 	void				*cpu_accessible_dma_mem;
 	dma_addr_t			cpu_accessible_dma_address;
@@ -593,6 +626,10 @@ void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
 
 int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
 void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
+int hl_eq_init(struct hl_device *hdev, struct hl_eq *q);
+void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q);
+irqreturn_t hl_irq_handler_cq(int irq, void *arg);
+irqreturn_t hl_irq_handler_eq(int irq, void *arg);
 int hl_asid_init(struct hl_device *hdev);
 void hl_asid_fini(struct hl_device *hdev);
 unsigned long hl_asid_alloc(struct hl_device *hdev);
diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
index 2d0efb7b44bb..bcc461760e5f 100644
--- a/drivers/misc/habanalabs/include/goya/goya.h
+++ b/drivers/misc/habanalabs/include/goya/goya.h
@@ -65,7 +65,6 @@
 
 #define GOYA_MSIX_ENTRIES	8
 #define EVENT_QUEUE_MSIX_IDX	5
-#define ARMCP_RESET_MSIX_IDX	6
 
 #define QMAN_PQ_ENTRY_SIZE	16			/* Bytes */
 
diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
index 97b0de7ea5c2..9586323e7dfb 100644
--- a/drivers/misc/habanalabs/irq.c
+++ b/drivers/misc/habanalabs/irq.c
@@ -9,6 +9,18 @@
 
 #include <linux/dma-mapping.h>
 
+/**
+ * This structure is used to schedule work of EQ entry and armcp_reset event
+ *
+ * @eq_work          - workqueue object to run when EQ entry is received
+ * @hdev             - pointer to device structure
+ * @eq_entry         - copy of the EQ entry
+ */
+struct hl_eqe_work {
+	struct work_struct	eq_work;
+	struct hl_device	*hdev;
+	struct hl_eq_entry	eq_entry;
+};
 
 /**
  * hl_cq_inc_ptr - increment ci or pi of cq
@@ -26,6 +38,33 @@ inline u32 hl_cq_inc_ptr(u32 ptr)
 	return ptr;
 }
 
+/**
+ * hl_eq_inc_ptr - increment ci of eq
+ *
+ * @ptr: the current ci value of the event queue
+ *
+ * Increment ptr by 1. If it reaches the number of event queue
+ * entries, set it to 0
+ */
+inline u32 hl_eq_inc_ptr(u32 ptr)
+{
+	ptr++;
+	if (unlikely(ptr == HL_EQ_LENGTH))
+		ptr = 0;
+	return ptr;
+}
+
+static void irq_handle_eqe(struct work_struct *work)
+{
+	struct hl_eqe_work *eqe_work = container_of(work, struct hl_eqe_work,
+							eq_work);
+	struct hl_device *hdev = eqe_work->hdev;
+
+	hdev->asic_funcs->handle_eqe(hdev, &eqe_work->eq_entry);
+
+	kfree(eqe_work);
+}
+
 /**
  * hl_irq_handler_cq - irq handler for completion queue
  *
@@ -103,6 +142,68 @@ irqreturn_t hl_irq_handler_cq(int irq, void *arg)
 	return IRQ_HANDLED;
 }
 
+/**
+ * hl_irq_handler_eq - irq handler for event queue
+ *
+ * @irq: irq number
+ * @arg: pointer to event queue structure
+ *
+ */
+irqreturn_t hl_irq_handler_eq(int irq, void *arg)
+{
+	struct hl_eq *eq = arg;
+	struct hl_device *hdev = eq->hdev;
+	struct hl_eq_entry *eq_entry;
+	struct hl_eq_entry *eq_base;
+	struct hl_eqe_work *handle_eqe_work;
+
+	eq_base = (struct hl_eq_entry *) eq->kernel_address;
+
+	while (1) {
+		bool entry_ready =
+				((eq_base[eq->ci].hdr.ctl & EQ_CTL_READY_MASK)
+						>> EQ_CTL_READY_SHIFT);
+
+		if (!entry_ready)
+			break;
+
+		eq_entry = &eq_base[eq->ci];
+
+		/*
+		 * Make sure we read EQ entry contents after we've
+		 * checked the ownership bit.
+		 */
+		dma_rmb();
+
+		if (hdev->disabled) {
+			dev_warn(hdev->dev,
+				"Device disabled but received IRQ %d for EQ\n",
+					irq);
+			goto skip_irq;
+		}
+
+		handle_eqe_work = kmalloc(sizeof(*handle_eqe_work), GFP_ATOMIC);
+		if (handle_eqe_work) {
+			INIT_WORK(&handle_eqe_work->eq_work, irq_handle_eqe);
+			handle_eqe_work->hdev = hdev;
+
+			memcpy(&handle_eqe_work->eq_entry, eq_entry,
+					sizeof(*eq_entry));
+
+			queue_work(hdev->eq_wq, &handle_eqe_work->eq_work);
+		}
+skip_irq:
+		/* Clear EQ entry ready bit */
+		eq_entry->hdr.ctl &= ~EQ_CTL_READY_MASK;
+
+		eq->ci = hl_eq_inc_ptr(eq->ci);
+
+		hdev->asic_funcs->update_eq_ci(hdev, eq->ci);
+	}
+
+	return IRQ_HANDLED;
+}
+
 /**
  * hl_cq_init - main initialization function for an cq object
  *
@@ -148,3 +249,46 @@ void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
 	hdev->asic_funcs->dma_free_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
 			(void *) q->kernel_address, q->bus_address);
 }
+
+/**
+ * hl_eq_init - main initialization function for an event queue object
+ *
+ * @hdev: pointer to device structure
+ * @q: pointer to eq structure
+ *
+ * Allocate dma-able memory for the event queue and initialize fields
+ * Returns 0 on success
+ */
+int hl_eq_init(struct hl_device *hdev, struct hl_eq *q)
+{
+	void *p;
+
+	BUILD_BUG_ON(HL_EQ_SIZE_IN_BYTES > HL_PAGE_SIZE);
+
+	p = hdev->asic_funcs->dma_alloc_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
+				&q->bus_address, GFP_KERNEL | __GFP_ZERO);
+	if (!p)
+		return -ENOMEM;
+
+	q->hdev = hdev;
+	q->kernel_address = (u64) p;
+	q->ci = 0;
+
+	return 0;
+}
+
+/**
+ * hl_eq_fini - destroy event queue
+ *
+ * @hdev: pointer to device structure
+ * @q: pointer to eq structure
+ *
+ * Free the event queue memory
+ */
+void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q)
+{
+	flush_workqueue(hdev->eq_wq);
+
+	hdev->asic_funcs->dma_free_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
+			(void *) q->kernel_address, q->bus_address);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 09/15] habanalabs: add sysfs and hwmon support
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (6 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 08/15] habanalabs: add event queue and interrupts Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-25  7:54   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 10/15] habanalabs: add device reset support Oded Gabbay
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch add the sysfs and hwmon entries that are exposed by the driver.

Goya has several sensors, from various categories such as temperature,
voltage, current, etc. The driver exposes those sensors in the standard
hwmon mechanism.

In addition, the driver exposes a couple of interfaces in sysfs, both for
configuration and for providing status of the device or driver.

The configuration attributes is for Power Management:
- Automatic or manual
- Frequency value when moving to high frequency mode
- Maximum power the device is allowed to consume

The rest of the attributes are read-only and provide the following
information:
- Versions of the various firmwares running on the device
- Contents of the device's EEPROM
- The device type (currently only Goya is supported)
- PCI address of the device (to allow user-space to connect between
  /dev/hlX to PCI address)
- Status of the device (operational, malfunction, in_reset)
- How many processes are open on the device's file

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 .../ABI/testing/sysfs-driver-habanalabs       | 190 ++++++
 drivers/misc/habanalabs/Makefile              |   2 +-
 drivers/misc/habanalabs/device.c              | 146 +++++
 drivers/misc/habanalabs/goya/Makefile         |   2 +-
 drivers/misc/habanalabs/goya/goya.c           | 230 +++++++
 drivers/misc/habanalabs/goya/goyaP.h          |  21 +
 drivers/misc/habanalabs/goya/goya_hwmgr.c     | 306 +++++++++
 drivers/misc/habanalabs/habanalabs.h          |  97 +++
 drivers/misc/habanalabs/habanalabs_drv.c      |   7 +
 drivers/misc/habanalabs/hwmon.c               | 449 +++++++++++++
 drivers/misc/habanalabs/sysfs.c               | 588 ++++++++++++++++++
 11 files changed, 2036 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-driver-habanalabs
 create mode 100644 drivers/misc/habanalabs/goya/goya_hwmgr.c
 create mode 100644 drivers/misc/habanalabs/hwmon.c
 create mode 100644 drivers/misc/habanalabs/sysfs.c

diff --git a/Documentation/ABI/testing/sysfs-driver-habanalabs b/Documentation/ABI/testing/sysfs-driver-habanalabs
new file mode 100644
index 000000000000..19edd4da87c1
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-driver-habanalabs
@@ -0,0 +1,190 @@
+What:           /sys/class/habanalabs/hl<n>/armcp_kernel_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the Linux kernel running on the device's CPU
+
+What:           /sys/class/habanalabs/hl<n>/armcp_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the application running on the device's CPU
+
+What:           /sys/class/habanalabs/hl<n>/cpld_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the Device's CPLD F/W
+
+What:           /sys/class/habanalabs/hl<n>/device_type
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the code name of the device according to its type.
+                The supported values are: "GOYA"
+
+What:           /sys/class/habanalabs/hl<n>/eeprom
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    A binary file attribute that contains the contents of the
+                on-board EEPROM
+
+What:           /sys/class/habanalabs/hl<n>/fuse_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the device's version from the eFuse
+
+What:           /sys/class/habanalabs/hl<n>/hard_reset
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Interface to trigger a hard-reset operation for the device.
+                Hard-reset will reset ALL internal components of the device
+                except for the PCI interface and the internal PLLs
+
+What:           /sys/class/habanalabs/hl<n>/hard_reset_cnt
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays how many times the device have undergone a hard-reset
+                operation
+
+What:           /sys/class/habanalabs/hl<n>/high_pll
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum clock frequency for MME, TPC
+                and IC when the power management profile is set to "automatic".
+
+What:           /sys/class/habanalabs/hl<n>/ic_clk
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum clock frequency of the
+                Interconnect fabric. Writes to this parameter affect the device
+                only when the power management profile is set to "manual" mode.
+                The device IC clock might be set to lower value then the
+                maximum. The user should read the ic_clk_curr to see the actual
+                frequency value of the IC
+
+What:           /sys/class/habanalabs/hl<n>/ic_clk_curr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the current clock frequency of the Interconnect fabric
+
+What:           /sys/class/habanalabs/hl<n>/infineon_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the Device's power supply F/W code
+
+What:           /sys/class/habanalabs/hl<n>/max_power
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum power consumption of the
+                device in milliwatts.
+
+What:           /sys/class/habanalabs/hl<n>/mme_clk
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum clock frequency of the
+                MME compute engine. Writes to this parameter affect the device
+                only when the power management profile is set to "manual" mode.
+                The device MME clock might be set to lower value then the
+                maximum. The user should read the mme_clk_curr to see the actual
+                frequency value of the MME
+
+What:           /sys/class/habanalabs/hl<n>/mme_clk_curr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the current clock frequency of the MME compute engine
+
+What:           /sys/class/habanalabs/hl<n>/pci_addr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the PCI address of the device. This is needed so the
+                user would be able to open a device based on its PCI address
+
+What:           /sys/class/habanalabs/hl<n>/pm_mng_profile
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Power management profile. Values are "auto", "manual". In "auto"
+                mode, the driver will set the maximum clock frequency to a high
+                value when a user-space process opens the device's file (unless
+                it was already opened by another process). The driver will set
+                the max clock frequency to a low value when there are no user
+                processes that are opened on the device's file. In "manual"
+                mode, the user sets the maximum clock frequency by writing to
+                ic_clk, mme_clk and tpc_clk
+
+
+What:           /sys/class/habanalabs/hl<n>/preboot_btl_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the device's preboot F/W code
+
+What:           /sys/class/habanalabs/hl<n>/soft_reset
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Interface to trigger a soft-reset operation for the device.
+                Soft-reset will reset only the compute and DMA engines of the
+                device
+
+What:           /sys/class/habanalabs/hl<n>/soft_reset_cnt
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays how many times the device have undergone a soft-reset
+                operation
+
+What:           /sys/class/habanalabs/hl<n>/status
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Status of the card: "Operational", "Malfunction", "In reset".
+
+What:           /sys/class/habanalabs/hl<n>/thermal_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the Device's thermal daemon
+
+What:           /sys/class/habanalabs/hl<n>/tpc_clk
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum clock frequency of the
+                TPC compute engines. Writes to this parameter affect the device
+                only when the power management profile is set to "manual" mode.
+                The device TPC clock might be set to lower value then the
+                maximum. The user should read the tpc_clk_curr to see the actual
+                frequency value of the TPC
+
+What:           /sys/class/habanalabs/hl<n>/tpc_clk_curr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the current clock frequency of the TPC compute engines
+
+What:           /sys/class/habanalabs/hl<n>/uboot_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the u-boot running on the device's CPU
+
+What:           /sys/class/habanalabs/hl<n>/write_open_cnt
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the total number of user processes that are currently
+                opened on the device's file
diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index c07f3ccb57dc..b5607233d216 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -5,7 +5,7 @@
 obj-m	:= habanalabs.o
 
 habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
-		command_buffer.o hw_queue.o irq.o
+		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 9199e070e79e..ff7b610f18c4 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -226,6 +226,118 @@ static void device_early_fini(struct hl_device *hdev)
 	mutex_destroy(&hdev->device_open);
 }
 
+static void set_freq_to_low_job(struct work_struct *work)
+{
+	struct hl_device *hdev = container_of(work, struct hl_device,
+						work_freq.work);
+
+	if (atomic_read(&hdev->fd_open_cnt) == 0)
+		hl_device_set_frequency(hdev, PLL_LOW);
+
+	schedule_delayed_work(&hdev->work_freq,
+			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
+}
+
+/**
+ * device_late_init - do late stuff initialization for the habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Do stuff that either needs the device H/W queues to be active or needs
+ * to happen after all the rest of the initialization is finished
+ */
+static int device_late_init(struct hl_device *hdev)
+{
+	int rc;
+
+	INIT_DELAYED_WORK(&hdev->work_freq, set_freq_to_low_job);
+	hdev->high_pll = hdev->asic_prop.high_pll;
+
+	/* force setting to low frequency */
+	atomic_set(&hdev->curr_pll_profile, PLL_LOW);
+
+	if (hdev->pm_mng_profile == PM_AUTO)
+		hdev->asic_funcs->set_pll_profile(hdev, PLL_LOW);
+	else
+		hdev->asic_funcs->set_pll_profile(hdev, PLL_LAST);
+
+	if (hdev->asic_funcs->late_init) {
+		rc = hdev->asic_funcs->late_init(hdev);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed late initialization for the H/W\n");
+			return rc;
+		}
+	}
+
+	schedule_delayed_work(&hdev->work_freq,
+			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
+
+	hdev->late_init_done = true;
+
+	return 0;
+}
+
+/**
+ * device_late_fini - finalize all that was done in device_late_init
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ */
+static void device_late_fini(struct hl_device *hdev)
+{
+	if (!hdev->late_init_done)
+		return;
+
+	cancel_delayed_work_sync(&hdev->work_freq);
+
+	if (hdev->asic_funcs->late_fini)
+		hdev->asic_funcs->late_fini(hdev);
+
+	hdev->late_init_done = false;
+}
+
+/**
+ * hl_device_set_frequency - set the frequency of the device
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @freq: the new frequency value
+ *
+ * Change the frequency if needed.
+ * We allose to set PLL to low only if there is no user process
+ * Returns 0 if no change was done, otherwise returns 1;
+ */
+int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq)
+{
+	enum hl_pll_frequency old_freq =
+			(freq == PLL_HIGH) ? PLL_LOW : PLL_HIGH;
+	int ret;
+
+	if (hdev->pm_mng_profile == PM_MANUAL)
+		return 0;
+
+	ret = atomic_cmpxchg(&hdev->curr_pll_profile, old_freq, freq);
+	if (ret == freq)
+		return 0;
+
+	/*
+	 * in case we want to lower frequency, check if device is not
+	 * opened. We must have a check here to workaround race condition with
+	 * hl_device_open
+	 */
+	if ((freq == PLL_LOW) && (atomic_read(&hdev->fd_open_cnt) > 0)) {
+		atomic_set(&hdev->curr_pll_profile, PLL_HIGH);
+		return 0;
+	}
+
+	dev_dbg(hdev->dev, "Changing device frequency to %s\n",
+		freq == PLL_HIGH ? "high" : "low");
+
+	hdev->asic_funcs->set_pll_profile(hdev, freq);
+
+	return 1;
+}
+
 /**
  * hl_device_suspend - initiate device suspend
  *
@@ -386,6 +498,12 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto release_ctx;
 	}
 
+	rc = hl_sysfs_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize sysfs\n");
+		goto free_cb_pool;
+	}
+
 	rc = hdev->asic_funcs->hw_init(hdev);
 	if (rc) {
 		dev_err(hdev->dev, "failed to initialize the H/W\n");
@@ -403,11 +521,33 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto out_disabled;
 	}
 
+	/* After test_queues, KMD can start sending messages to device CPU */
+
+	rc = device_late_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed late initialization\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
+	dev_info(hdev->dev, "Found %s device with %lluGB DRAM\n",
+		hdev->asic_name,
+		hdev->asic_prop.dram_size / 1024 / 1024 / 1024);
+
+	rc = hl_hwmon_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to initialize hwmon\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
 	return 0;
 
+free_cb_pool:
+	hl_cb_pool_fini(hdev);
 release_ctx:
 	if (hl_ctx_put(hdev->kernel_ctx) != 1)
 		dev_err(hdev->dev,
@@ -457,6 +597,12 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	hl_hwmon_fini(hdev);
+
+	device_late_fini(hdev);
+
+	hl_sysfs_fini(hdev);
+
 	/*
 	 * Halt the engines and disable interrupts so we won't get any more
 	 * completions from H/W and we won't have any accesses from the
diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
index a57096fa41b6..ada8518ec215 100644
--- a/drivers/misc/habanalabs/goya/Makefile
+++ b/drivers/misc/habanalabs/goya/Makefile
@@ -1,3 +1,3 @@
 subdir-ccflags-y += -I$(src)
 
-HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o
\ No newline at end of file
+HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o goya/goya_hwmgr.o
\ No newline at end of file
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 6c04277ae0fa..7899ff762e0b 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -127,6 +127,8 @@ static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
 
 #define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
 
+static int goya_armcp_info_get(struct hl_device *hdev);
+
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
@@ -174,6 +176,7 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->num_of_events = GOYA_ASYNC_EVENT_ID_SIZE;
 	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
 	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
+	prop->max_power_default = MAX_POWER_DEFAULT;
 	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
 
 	prop->high_pll = PLL_HIGH_DEFAULT;
@@ -558,6 +561,89 @@ int goya_early_fini(struct hl_device *hdev)
 	return 0;
 }
 
+/**
+ * goya_fetch_psoc_frequency - Fetch PSOC frequency values
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static void goya_fetch_psoc_frequency(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+
+	prop->psoc_pci_pll_nr = RREG32(mmPSOC_PCI_PLL_NR);
+	prop->psoc_pci_pll_nf = RREG32(mmPSOC_PCI_PLL_NF);
+	prop->psoc_pci_pll_od = RREG32(mmPSOC_PCI_PLL_OD);
+	prop->psoc_pci_pll_div_factor = RREG32(mmPSOC_PCI_PLL_DIV_FACTOR_1);
+}
+
+/**
+ * goya_late_init - GOYA late initialization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Get ArmCP info and send message to CPU to enable PCI access
+ */
+static int goya_late_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+
+	rc = goya->armcp_info_get(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to get armcp info\n");
+		return rc;
+	}
+
+	/* Now that we have the DRAM size in ASIC prop, we need to check
+	 * its size and configure the DMA_IF DDR wrap protection (which is in
+	 * the MMU block) accordingly. The value is the log2 of the DRAM size
+	 */
+	WREG32(mmMMU_LOG2_DDR_SIZE, ilog2(prop->dram_size));
+
+	rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_ENABLE_PCI_ACCESS);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to enable PCI access from CPU\n");
+		return rc;
+	}
+
+	WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+			GOYA_ASYNC_EVENT_ID_INTS_REGISTER);
+
+	goya_fetch_psoc_frequency(hdev);
+
+	return 0;
+}
+
+/**
+ * goya_late_fini - GOYA late tear-down code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Free sensors allocated structures
+ */
+void goya_late_fini(struct hl_device *hdev)
+{
+	const struct hwmon_channel_info **channel_info_arr;
+	int i = 0;
+
+	if (!hdev->hl_chip_info.info)
+		return;
+
+	channel_info_arr = hdev->hl_chip_info.info;
+
+	while (channel_info_arr[i]) {
+		kfree(channel_info_arr[i]->config);
+		kfree(channel_info_arr[i]);
+		i++;
+	}
+
+	kfree(channel_info_arr);
+
+	hdev->hl_chip_info.info = NULL;
+}
+
 /**
  * goya_sw_init - Goya software initialization code
  *
@@ -575,9 +661,15 @@ static int goya_sw_init(struct hl_device *hdev)
 		return -ENOMEM;
 
 	goya->test_cpu_queue = goya_test_cpu_queue;
+	goya->armcp_info_get = goya_armcp_info_get;
 
 	/* according to goya_init_iatu */
 	goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
+
+	goya->mme_clk = GOYA_PLL_FREQ_LOW;
+	goya->tpc_clk = GOYA_PLL_FREQ_LOW;
+	goya->ic_clk = GOYA_PLL_FREQ_LOW;
+
 	hdev->asic_specific = goya;
 
 	/* Create DMA pool for small allocations */
@@ -4272,6 +4364,87 @@ void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
 	return goya->events_stat;
 }
 
+static int goya_armcp_info_get(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct armcp_packet pkt;
+	void *armcp_info_cpu_addr;
+	dma_addr_t armcp_info_dma_addr;
+	u64 dram_size;
+	long result;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
+		return 0;
+
+	armcp_info_cpu_addr =
+			hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev,
+			sizeof(struct armcp_info), &armcp_info_dma_addr);
+	if (!armcp_info_cpu_addr) {
+		dev_err(hdev->dev,
+			"Failed to allocate DMA memory for ArmCP info packet\n");
+		return -ENOMEM;
+	}
+
+	memset(armcp_info_cpu_addr, 0, sizeof(struct armcp_info));
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_INFO_GET;
+	pkt.addr = armcp_info_dma_addr + prop->host_phys_base_address;
+	pkt.data_max_size = sizeof(struct armcp_info);
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+			GOYA_ARMCP_INFO_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to send armcp info pkt, error %d\n", rc);
+		goto out;
+	}
+
+	memcpy(&prop->armcp_info, armcp_info_cpu_addr,
+			sizeof(prop->armcp_info));
+
+	dram_size = prop->armcp_info.dram_size;
+	if (dram_size) {
+		if ((!is_power_of_2(dram_size)) ||
+				(dram_size < DRAM_PHYS_DEFAULT_SIZE)) {
+			dev_err(hdev->dev,
+				"F/W reported invalid DRAM size %llu. Trying to use default size\n",
+				dram_size);
+			dram_size = DRAM_PHYS_DEFAULT_SIZE;
+		}
+
+		prop->dram_size = dram_size;
+		prop->dram_end_address = prop->dram_base_address + dram_size;
+	}
+
+	rc = hl_build_hwmon_channel_info(hdev, prop->armcp_info.sensors);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to build hwmon channel info, error %d\n", rc);
+		rc = -EFAULT;
+		goto out;
+	}
+
+out:
+	hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev,
+			sizeof(struct armcp_info), armcp_info_cpu_addr);
+
+	return rc;
+}
+
+static void goya_init_clock_gating(struct hl_device *hdev)
+{
+
+}
+
+static void goya_disable_clock_gating(struct hl_device *hdev)
+{
+
+}
 
 static void goya_hw_queues_lock(struct hl_device *hdev)
 {
@@ -4287,9 +4460,60 @@ static void goya_hw_queues_unlock(struct hl_device *hdev)
 	spin_unlock(&goya->hw_queues_lock);
 }
 
+int goya_get_eeprom_data(struct hl_device *hdev, void *data, size_t max_size)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct armcp_packet pkt;
+	void *eeprom_info_cpu_addr;
+	dma_addr_t eeprom_info_dma_addr;
+	long result;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
+		return 0;
+
+	eeprom_info_cpu_addr =
+			hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev,
+					max_size, &eeprom_info_dma_addr);
+	if (!eeprom_info_cpu_addr) {
+		dev_err(hdev->dev,
+			"Failed to allocate DMA memory for EEPROM info packet\n");
+		return -ENOMEM;
+	}
+
+	memset(eeprom_info_cpu_addr, 0, max_size);
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_EEPROM_DATA_GET;
+	pkt.addr = eeprom_info_dma_addr + prop->host_phys_base_address;
+	pkt.data_max_size = max_size;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+			GOYA_ARMCP_EEPROM_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to send armcp EEPROM pkt, error %d\n", rc);
+		goto out;
+	}
+
+	/* result contains the actual size */
+	memcpy(data, eeprom_info_cpu_addr, min((size_t)result, max_size));
+
+out:
+	hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev, max_size,
+			eeprom_info_cpu_addr);
+
+	return rc;
+}
+
 static const struct hl_asic_funcs goya_funcs = {
 	.early_init = goya_early_init,
 	.early_fini = goya_early_fini,
+	.late_init = goya_late_init,
+	.late_fini = goya_late_fini,
 	.sw_init = goya_sw_init,
 	.sw_fini = goya_sw_fini,
 	.hw_init = goya_hw_init,
@@ -4310,10 +4534,16 @@ static const struct hl_asic_funcs goya_funcs = {
 	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
 	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
 	.update_eq_ci = goya_update_eq_ci,
+	.add_device_attr = goya_add_device_attr,
+	.remove_device_attr = goya_remove_device_attr,
 	.handle_eqe = goya_handle_eqe,
+	.set_pll_profile = goya_set_pll_profile,
 	.get_events_stat = goya_get_events_stat,
+	.enable_clock_gating = goya_init_clock_gating,
+	.disable_clock_gating = goya_disable_clock_gating,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
+	.get_eeprom_data = goya_get_eeprom_data,
 	.send_cpu_message = goya_send_cpu_message
 };
 
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index c6bfcb6c6905..42e8b1baef2f 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -48,7 +48,10 @@
 
 #define PLL_HIGH_DEFAULT		1575000000	/* 1.575 GHz */
 
+#define MAX_POWER_DEFAULT		200000		/* 200W */
+
 #define GOYA_ARMCP_INFO_TIMEOUT		10000000	/* 10s */
+#define GOYA_ARMCP_EEPROM_TIMEOUT	10000000	/* 10s */
 
 #define DRAM_PHYS_DEFAULT_SIZE		0x100000000ull	/* 4GB */
 
@@ -119,9 +122,15 @@ enum goya_fw_component {
 
 struct goya_device {
 	int (*test_cpu_queue)(struct hl_device *hdev);
+	int (*armcp_info_get)(struct hl_device *hdev);
 
 	/* TODO: remove hw_queues_lock after moving to scheduler code */
 	spinlock_t	hw_queues_lock;
+
+	u64		mme_clk;
+	u64		tpc_clk;
+	u64		ic_clk;
+
 	u64		ddr_bar_cur_addr;
 	u32		events_stat[GOYA_ASYNC_EVENT_ID_SIZE];
 	u32		hw_cap_initialized;
@@ -130,6 +139,18 @@ struct goya_device {
 int goya_test_cpu_queue(struct hl_device *hdev);
 int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
 				u32 timeout, long *result);
+long goya_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
+long goya_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr);
+long goya_get_current(struct hl_device *hdev, int sensor_index, u32 attr);
+long goya_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
+long goya_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
+void goya_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
+			long value);
+void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq);
+int goya_add_device_attr(struct hl_device *hdev);
+void goya_remove_device_attr(struct hl_device *hdev);
 void goya_init_security(struct hl_device *hdev);
+u64 goya_get_max_power(struct hl_device *hdev);
+void goya_set_max_power(struct hl_device *hdev, u64 value);
 
 #endif /* GOYAP_H_ */
diff --git a/drivers/misc/habanalabs/goya/goya_hwmgr.c b/drivers/misc/habanalabs/goya/goya_hwmgr.c
new file mode 100644
index 000000000000..866d1774b2e4
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/goya_hwmgr.c
@@ -0,0 +1,306 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "goyaP.h"
+
+void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	switch (freq) {
+	case PLL_HIGH:
+		hl_set_frequency(hdev, MME_PLL, hdev->high_pll);
+		hl_set_frequency(hdev, TPC_PLL, hdev->high_pll);
+		hl_set_frequency(hdev, IC_PLL, hdev->high_pll);
+		break;
+	case PLL_LOW:
+		hl_set_frequency(hdev, MME_PLL, GOYA_PLL_FREQ_LOW);
+		hl_set_frequency(hdev, TPC_PLL, GOYA_PLL_FREQ_LOW);
+		hl_set_frequency(hdev, IC_PLL, GOYA_PLL_FREQ_LOW);
+		break;
+	case PLL_LAST:
+		hl_set_frequency(hdev, MME_PLL, goya->mme_clk);
+		hl_set_frequency(hdev, TPC_PLL, goya->tpc_clk);
+		hl_set_frequency(hdev, IC_PLL, goya->ic_clk);
+		break;
+	default:
+		dev_err(hdev->dev, "unknown frequency setting\n");
+	}
+}
+
+static ssize_t mme_clk_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, MME_PLL, false);
+
+	if (value < 0)
+		return value;
+
+	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
+}
+
+static ssize_t mme_clk_store(struct device *dev, struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+	long value;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto fail;
+	}
+
+	if (hdev->pm_mng_profile == PM_AUTO) {
+		count = -EPERM;
+		goto fail;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto fail;
+	}
+
+	hl_set_frequency(hdev, MME_PLL, value);
+	goya->mme_clk = value;
+
+fail:
+	return count;
+}
+
+static ssize_t tpc_clk_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, TPC_PLL, false);
+
+	if (value < 0)
+		return value;
+
+	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
+}
+
+static ssize_t tpc_clk_store(struct device *dev, struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+	long value;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto fail;
+	}
+
+	if (hdev->pm_mng_profile == PM_AUTO) {
+		count = -EPERM;
+		goto fail;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto fail;
+	}
+
+	hl_set_frequency(hdev, TPC_PLL, value);
+	goya->tpc_clk = value;
+
+fail:
+	return count;
+}
+
+static ssize_t ic_clk_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, IC_PLL, false);
+
+	if (value < 0)
+		return value;
+
+	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
+}
+
+static ssize_t ic_clk_store(struct device *dev, struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+	long value;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto fail;
+	}
+
+	if (hdev->pm_mng_profile == PM_AUTO) {
+		count = -EPERM;
+		goto fail;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto fail;
+	}
+
+	hl_set_frequency(hdev, IC_PLL, value);
+	goya->ic_clk = value;
+
+fail:
+	return count;
+}
+
+static ssize_t mme_clk_curr_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, MME_PLL, true);
+
+	if (value < 0)
+		return value;
+
+	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
+}
+
+static ssize_t tpc_clk_curr_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, TPC_PLL, true);
+
+	if (value < 0)
+		return value;
+
+	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
+}
+
+static ssize_t ic_clk_curr_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, IC_PLL, true);
+
+	if (value < 0)
+		return value;
+
+	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
+}
+
+static DEVICE_ATTR_RW(mme_clk);
+static DEVICE_ATTR_RW(tpc_clk);
+static DEVICE_ATTR_RW(ic_clk);
+static DEVICE_ATTR_RO(mme_clk_curr);
+static DEVICE_ATTR_RO(tpc_clk_curr);
+static DEVICE_ATTR_RO(ic_clk_curr);
+
+int goya_add_device_attr(struct hl_device *hdev)
+{
+	int rc;
+
+	rc = device_create_file(hdev->dev, &dev_attr_mme_clk);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file mme_clk\n");
+		return rc;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_tpc_clk);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file tpc_clk\n");
+		goto remove_mme_clk;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_ic_clk);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file ic_clk\n");
+		goto remove_tpc_clk;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_mme_clk_curr);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file mme_clk_curr\n");
+		goto remove_ic_clk;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_tpc_clk_curr);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file tpc_clk_curr\n");
+		goto remove_mme_clk_curr;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_ic_clk_curr);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file ic_clk_curr\n");
+		goto remove_tpc_clk_curr;
+	}
+
+	return 0;
+
+remove_tpc_clk_curr:
+	device_remove_file(hdev->dev, &dev_attr_tpc_clk_curr);
+remove_mme_clk_curr:
+	device_remove_file(hdev->dev, &dev_attr_mme_clk_curr);
+remove_ic_clk:
+	device_remove_file(hdev->dev, &dev_attr_ic_clk);
+remove_tpc_clk:
+	device_remove_file(hdev->dev, &dev_attr_tpc_clk);
+remove_mme_clk:
+	device_remove_file(hdev->dev, &dev_attr_mme_clk);
+	return rc;
+}
+
+void goya_remove_device_attr(struct hl_device *hdev)
+{
+	device_remove_file(hdev->dev, &dev_attr_ic_clk_curr);
+	device_remove_file(hdev->dev, &dev_attr_tpc_clk_curr);
+	device_remove_file(hdev->dev, &dev_attr_mme_clk_curr);
+	device_remove_file(hdev->dev, &dev_attr_ic_clk);
+	device_remove_file(hdev->dev, &dev_attr_tpc_clk);
+	device_remove_file(hdev->dev, &dev_attr_mme_clk);
+}
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 899bf98eb002..49b84b3ff864 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -25,6 +25,8 @@
 
 #define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */
 
+#define HL_PLL_LOW_JOB_FREQ_USEC	5000000 /* 5 s */
+
 #define HL_MAX_QUEUES			128
 
 struct hl_device;
@@ -60,6 +62,8 @@ struct hw_queue_properties {
 /**
  * struct asic_fixed_properties - ASIC specific immutable properties.
  * @hw_queues_props: H/W queues properties.
+ * @armcp_info: received various information from ArmCP regarding the H/W. e.g.
+ *		available sensors.
  * @uboot_ver: F/W U-boot version.
  * @preboot_ver: F/W Preboot version.
  * @sram_base_address: SRAM physical start address.
@@ -72,6 +76,7 @@ struct hw_queue_properties {
  * @dram_pci_bar_size: size of PCI bar towards DRAM.
  * @host_phys_base_address: base physical address of host memory for
  *				transactions that the device generates.
+ * @max_power_default: max power of the device after reset
  * @va_space_host_start_address: base address of virtual memory range for
  *                               mapping host memory.
  * @va_space_host_end_address: end address of virtual memory range for
@@ -84,6 +89,10 @@ struct hw_queue_properties {
  * @sram_size: total size of SRAM.
  * @max_asid: maximum number of open contexts (ASIDs).
  * @num_of_events: number of possible internal H/W IRQs.
+ * @psoc_pci_pll_nr: PCI PLL NR value.
+ * @psoc_pci_pll_nf: PCI PLL NF value.
+ * @psoc_pci_pll_od: PCI PLL OD value.
+ * @psoc_pci_pll_div_factor: PCI PLL DIV FACTOR 1 value.
  * @completion_queues_count: number of completion queues.
  * @high_pll: high PLL frequency used by the device.
  * @cb_pool_cb_cnt: number of CBs in the CB pool.
@@ -92,6 +101,7 @@ struct hw_queue_properties {
  */
 struct asic_fixed_properties {
 	struct hw_queue_properties	hw_queues_props[HL_MAX_QUEUES];
+	struct armcp_info	armcp_info;
 	char			uboot_ver[VERSION_MAX_LEN];
 	char			preboot_ver[VERSION_MAX_LEN];
 	u64			sram_base_address;
@@ -103,6 +113,7 @@ struct asic_fixed_properties {
 	u64			dram_size;
 	u64			dram_pci_bar_size;
 	u64			host_phys_base_address;
+	u64			max_power_default;
 	u64			va_space_host_start_address;
 	u64			va_space_host_end_address;
 	u64			va_space_dram_start_address;
@@ -111,6 +122,10 @@ struct asic_fixed_properties {
 	u32			sram_size;
 	u32			max_asid;
 	u32			num_of_events;
+	u32			psoc_pci_pll_nr;
+	u32			psoc_pci_pll_nf;
+	u32			psoc_pci_pll_od;
+	u32			psoc_pci_pll_div_factor;
 	u32			high_pll;
 	u32			cb_pool_cb_cnt;
 	u32			cb_pool_cb_size;
@@ -296,13 +311,37 @@ enum hl_asic_type {
 };
 
 
+/**
+ * enum hl_pm_mng_profile - power management profile.
+ * @PM_AUTO: internal clock is set by KMD.
+ * @PM_MANUAL: internal clock is set by the user.
+ * @PM_LAST: last power management type.
+ */
+enum hl_pm_mng_profile {
+	PM_AUTO = 1,
+	PM_MANUAL,
+	PM_LAST
+};
 
+/**
+ * enum hl_pll_frequency - PLL frequency.
+ * @PLL_HIGH: high frequency.
+ * @PLL_LOW: low frequency.
+ * @PLL_LAST: last frequency values that were configured by the user.
+ */
+enum hl_pll_frequency {
+	PLL_HIGH = 1,
+	PLL_LOW,
+	PLL_LAST
+};
 
 /**
  * struct hl_asic_funcs - ASIC specific functions that are can be called from
  *                        common code.
  * @early_init: sets up early driver state (pre sw_init), doesn't configure H/W.
  * @early_fini: tears down what was done in early_init.
+ * @late_init: sets up late driver/hw state (post hw_init) - Optional.
+ * @late_fini: tears down what was done in late_init (pre hw_fini) - Optional.
  * @sw_init: sets up driver state, does not configure H/W.
  * @sw_fini: tears down driver state, does not configure H/W.
  * @hw_init: sets up the H/W state.
@@ -326,15 +365,23 @@ enum hl_asic_type {
  * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
  * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
  * @update_eq_ci: update event queue CI.
+ * @add_device_attr: add ASIC specific device attributes.
+ * @remove_device_attr: remove ASIC specific device attributes.
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
+ * @set_pll_profile: change PLL profile (manual/automatic).
  * @get_events_stat: retrieve event queue entries histogram.
+ * @enable_clock_gating: enable clock gating for reducing power consumption.
+ * @disable_clock_gating: disable clock for accessing registers on HBW.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
+ * @get_eeprom_data: retrieve EEPROM data from F/W.
  * @send_cpu_message: send buffer to ArmCP.
  */
 struct hl_asic_funcs {
 	int (*early_init)(struct hl_device *hdev);
 	int (*early_fini)(struct hl_device *hdev);
+	int (*late_init)(struct hl_device *hdev);
+	void (*late_fini)(struct hl_device *hdev);
 	int (*sw_init)(struct hl_device *hdev);
 	int (*sw_fini)(struct hl_device *hdev);
 	int (*hw_init)(struct hl_device *hdev);
@@ -363,11 +410,19 @@ struct hl_asic_funcs {
 	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
 				size_t size, void *vaddr);
 	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
+	int (*add_device_attr)(struct hl_device *hdev);
+	void (*remove_device_attr)(struct hl_device *hdev);
 	void (*handle_eqe)(struct hl_device *hdev,
 				struct hl_eq_entry *eq_entry);
+	void (*set_pll_profile)(struct hl_device *hdev,
+			enum hl_pll_frequency freq);
 	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
+	void (*enable_clock_gating)(struct hl_device *hdev);
+	void (*disable_clock_gating)(struct hl_device *hdev);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
+	int (*get_eeprom_data)(struct hl_device *hdev, void *data,
+				size_t max_size);
 	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
 				u16 len, u32 timeout, long *result);
 };
@@ -496,6 +551,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @rmmio: configuration area address on SRAM.
  * @cdev: related char device.
  * @dev: realted kernel basic device structure.
+ * @work_freq: delayed work to lower device frequency if possible.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
  * @completion_queue: array of hl_cq.
@@ -517,13 +573,23 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
+ * @hwmon_dev: H/W monitor device.
+ * @pm_mng_profile: current power management profile.
+ * @hl_chip_info: ASIC's sensors information.
  * @cb_pool: list of preallocated CBs.
  * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
+ * @curr_pll_profile: current PLL profile.
  * @fd_open_cnt: number of open context executing.
+ * @max_power: the max power of the device, as configured by the sysadmin. This
+ *             value is saved so in case of hard-reset, KMD will restore this
+ *             value and update the F/W after the re-initialization
  * @major: habanalabs KMD major.
+ * @high_pll: high PLL profile frequency.
  * @id: device minor.
  * @disabled: is device disabled.
+ * @late_init_done: is late init stage was done during initialization.
+ * @hwmon_initialized: is H/W monitor sensors was initialized.
  */
 struct hl_device {
 	struct pci_dev			*pdev;
@@ -531,6 +597,7 @@ struct hl_device {
 	void __iomem			*rmmio;
 	struct cdev			cdev;
 	struct device			*dev;
+	struct delayed_work		work_freq;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
 	struct hl_cq			*completion_queue;
@@ -553,16 +620,25 @@ struct hl_device {
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
+	struct device			*hwmon_dev;
+	enum hl_pm_mng_profile		pm_mng_profile;
+	struct hwmon_chip_info		hl_chip_info;
 
 	struct list_head		cb_pool;
 	spinlock_t			cb_pool_lock;
 
 	/* TODO: The following fields should be moved for multi-context */
 	struct hl_ctx			*user_ctx;
+
+	atomic_t			curr_pll_profile;
 	atomic_t			fd_open_cnt;
+	u64				max_power;
 	u32				major;
+	u32				high_pll;
 	u16				id;
 	u8				disabled;
+	u8				late_init_done;
+	u8				hwmon_initialized;
 
 	/* Parameters for bring-up */
 	u8				cpu_enable;
@@ -647,6 +723,15 @@ int hl_device_suspend(struct hl_device *hdev);
 int hl_device_resume(struct hl_device *hdev);
 void hl_hpriv_get(struct hl_fpriv *hpriv);
 void hl_hpriv_put(struct hl_fpriv *hpriv);
+int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq);
+int hl_build_hwmon_channel_info(struct hl_device *hdev,
+		struct armcp_sensor *sensors_arr);
+
+int hl_sysfs_init(struct hl_device *hdev);
+void hl_sysfs_fini(struct hl_device *hdev);
+
+int hl_hwmon_init(struct hl_device *hdev);
+void hl_hwmon_fini(struct hl_device *hdev);
 
 int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr, u32 cb_size,
 		u64 *handle, int ctx_id);
@@ -663,6 +748,18 @@ int hl_cb_pool_fini(struct hl_device *hdev);
 
 void goya_set_asic_funcs(struct hl_device *hdev);
 
+long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr);
+void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq);
+long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
+long hl_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr);
+long hl_get_current(struct hl_device *hdev, int sensor_index, u32 attr);
+long hl_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
+long hl_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
+void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
+			long value);
+u64 hl_get_max_power(struct hl_device *hdev);
+void hl_set_max_power(struct hl_device *hdev, u64 value);
+
 /* IOCTLs */
 long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
 int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index b64f58ad0f5d..47a9ab458b43 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -134,6 +134,13 @@ int hl_device_open(struct inode *inode, struct file *filp)
 
 	hpriv->taskpid = find_get_pid(current->pid);
 
+	/*
+	 * Device is IDLE at this point so it is legal to change PLLs. There
+	 * is no need to check anything because if the PLL is already HIGH, the
+	 * set function will return without doing anything
+	 */
+	hl_device_set_frequency(hdev, PLL_HIGH);
+
 	return 0;
 
 out_err:
diff --git a/drivers/misc/habanalabs/hwmon.c b/drivers/misc/habanalabs/hwmon.c
new file mode 100644
index 000000000000..6ca0decb7490
--- /dev/null
+++ b/drivers/misc/habanalabs/hwmon.c
@@ -0,0 +1,449 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#define SENSORS_PKT_TIMEOUT		100000	/* 100ms */
+#define HWMON_NR_SENSOR_TYPES		(hwmon_pwm + 1)
+
+int hl_build_hwmon_channel_info(struct hl_device *hdev,
+				struct armcp_sensor *sensors_arr)
+{
+	u32 counts[HWMON_NR_SENSOR_TYPES] = {0};
+	u32 *sensors_by_type[HWMON_NR_SENSOR_TYPES] = {0};
+	u32 sensors_by_type_next_index[HWMON_NR_SENSOR_TYPES] = {0};
+	struct hwmon_channel_info **channels_info;
+	u32 num_sensors_for_type, num_active_sensor_types = 0,
+			arr_size = 0, *curr_arr;
+	enum hwmon_sensor_types type;
+	int rc, i, j;
+
+	for (i = 0 ; i < ARMCP_MAX_SENSORS ; i++) {
+		type = sensors_arr[i].type;
+
+		if ((type == 0) && (sensors_arr[i].flags == 0))
+			break;
+
+		if (type >= HWMON_NR_SENSOR_TYPES) {
+			dev_err(hdev->dev,
+				"Got wrong sensor type %d from device\n", type);
+			return -EINVAL;
+		}
+
+		counts[type]++;
+		arr_size++;
+	}
+
+	for (i = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++) {
+		if (counts[i] == 0)
+			continue;
+
+		num_sensors_for_type = counts[i] + 1;
+		curr_arr = kcalloc(num_sensors_for_type, sizeof(*curr_arr),
+				GFP_KERNEL);
+		if (!curr_arr) {
+			rc = -ENOMEM;
+			goto sensors_type_err;
+		}
+
+		num_active_sensor_types++;
+		sensors_by_type[i] = curr_arr;
+	}
+
+	for (i = 0 ; i < arr_size ; i++) {
+		type = sensors_arr[i].type;
+		curr_arr = sensors_by_type[type];
+		curr_arr[sensors_by_type_next_index[type]++] =
+				sensors_arr[i].flags;
+	}
+
+	channels_info = kcalloc(num_active_sensor_types + 1,
+			sizeof(*channels_info), GFP_KERNEL);
+	if (!channels_info) {
+		rc = -ENOMEM;
+		goto channels_info_array_err;
+	}
+
+	for (i = 0 ; i < num_active_sensor_types ; i++) {
+		channels_info[i] = kzalloc(sizeof(*channels_info[i]),
+				GFP_KERNEL);
+		if (!channels_info[i]) {
+			rc = -ENOMEM;
+			goto channel_info_err;
+		}
+	}
+
+	for (i = 0, j = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++) {
+		if (!sensors_by_type[i])
+			continue;
+
+		channels_info[j]->type = i;
+		channels_info[j]->config = sensors_by_type[i];
+		j++;
+	}
+
+	hdev->hl_chip_info.info =
+			(const struct hwmon_channel_info **)channels_info;
+
+	return 0;
+
+channel_info_err:
+	for (i = 0 ; i < num_active_sensor_types ; i++)
+		if (channels_info[i]) {
+			kfree(channels_info[i]->config);
+			kfree(channels_info[i]);
+		}
+	kfree(channels_info);
+channels_info_array_err:
+sensors_type_err:
+	for (i = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++)
+		kfree(sensors_by_type[i]);
+
+	return rc;
+}
+
+static int hl_read(struct device *dev, enum hwmon_sensor_types type,
+			u32 attr, int channel, long *val)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	switch (type) {
+	case hwmon_temp:
+		switch (attr) {
+		case hwmon_temp_input:
+		case hwmon_temp_max:
+		case hwmon_temp_crit:
+		case hwmon_temp_max_hyst:
+		case hwmon_temp_crit_hyst:
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		*val = hl_get_temperature(hdev, channel, attr);
+		break;
+	case hwmon_in:
+		switch (attr) {
+		case hwmon_in_input:
+		case hwmon_in_min:
+		case hwmon_in_max:
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		*val = hl_get_voltage(hdev, channel, attr);
+		break;
+	case hwmon_curr:
+		switch (attr) {
+		case hwmon_curr_input:
+		case hwmon_curr_min:
+		case hwmon_curr_max:
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		*val = hl_get_current(hdev, channel, attr);
+		break;
+	case hwmon_fan:
+		switch (attr) {
+		case hwmon_fan_input:
+		case hwmon_fan_min:
+		case hwmon_fan_max:
+			break;
+		default:
+			return -EINVAL;
+		}
+		*val = hl_get_fan_speed(hdev, channel, attr);
+		break;
+	case hwmon_pwm:
+		switch (attr) {
+		case hwmon_pwm_input:
+		case hwmon_pwm_enable:
+			break;
+		default:
+			return -EINVAL;
+		}
+		*val = hl_get_pwm_info(hdev, channel, attr);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int hl_write(struct device *dev, enum hwmon_sensor_types type,
+			u32 attr, int channel, long val)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	switch (type) {
+	case hwmon_pwm:
+		switch (attr) {
+		case hwmon_pwm_input:
+		case hwmon_pwm_enable:
+			break;
+		default:
+			return -EINVAL;
+		}
+		hl_set_pwm_info(hdev, channel, attr, val);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static umode_t hl_is_visible(const void *data, enum hwmon_sensor_types type,
+				u32 attr, int channel)
+{
+	switch (type) {
+	case hwmon_temp:
+		switch (attr) {
+		case hwmon_temp_input:
+		case hwmon_temp_max:
+		case hwmon_temp_max_hyst:
+		case hwmon_temp_crit:
+		case hwmon_temp_crit_hyst:
+			return 0444;
+		}
+		break;
+	case hwmon_in:
+		switch (attr) {
+		case hwmon_in_input:
+		case hwmon_in_min:
+		case hwmon_in_max:
+			return 0444;
+		}
+		break;
+	case hwmon_curr:
+		switch (attr) {
+		case hwmon_curr_input:
+		case hwmon_curr_min:
+		case hwmon_curr_max:
+			return 0444;
+		}
+		break;
+	case hwmon_fan:
+		switch (attr) {
+		case hwmon_fan_input:
+		case hwmon_fan_min:
+		case hwmon_fan_max:
+			return 0444;
+		}
+		break;
+	case hwmon_pwm:
+		switch (attr) {
+		case hwmon_pwm_input:
+		case hwmon_pwm_enable:
+			return 0644;
+		}
+		break;
+	default:
+		break;
+	}
+	return 0;
+}
+
+static const struct hwmon_ops hl_hwmon_ops = {
+	.is_visible = hl_is_visible,
+	.read = hl_read,
+	.write = hl_write
+};
+
+long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_TEMPERATURE_GET;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+			SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get temperature from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+long hl_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_VOLTAGE_GET;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get voltage from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+long hl_get_current(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_CURRENT_GET;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get current from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+long hl_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_FAN_SPEED_GET;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get fan speed from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+long hl_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_PWM_GET;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get pwm info from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
+			long value)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_PWM_SET;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+	pkt.value = value;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, NULL);
+
+	if (rc)
+		dev_err(hdev->dev,
+			"Failed to set pwm info to sensor %d, error %d\n",
+			sensor_index, rc);
+}
+
+int hl_hwmon_init(struct hl_device *hdev)
+{
+	struct device *dev = hdev->pdev ? &hdev->pdev->dev : hdev->dev;
+	int rc;
+
+	if ((hdev->hwmon_initialized) || !(hdev->fw_loading))
+		return 0;
+
+	if (hdev->hl_chip_info.info) {
+		hdev->hl_chip_info.ops = &hl_hwmon_ops;
+
+		hdev->hwmon_dev = hwmon_device_register_with_info(dev,
+				"habanalabs", hdev, &hdev->hl_chip_info, NULL);
+		if (IS_ERR(hdev->hwmon_dev)) {
+			rc = PTR_ERR(hdev->hwmon_dev);
+			dev_err(hdev->dev,
+				"Unable to register hwmon device: %d\n", rc);
+			return rc;
+		}
+
+		dev_info(hdev->dev, "%s: add sensors information\n",
+			dev_name(hdev->hwmon_dev));
+
+		hdev->hwmon_initialized = true;
+	} else {
+		dev_info(hdev->dev, "no available sensors\n");
+	}
+
+	return 0;
+}
+
+void hl_hwmon_fini(struct hl_device *hdev)
+{
+	if (!hdev->hwmon_initialized)
+		return;
+
+	hwmon_device_unregister(hdev->hwmon_dev);
+}
diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
new file mode 100644
index 000000000000..edd5f7159de0
--- /dev/null
+++ b/drivers/misc/habanalabs/sysfs.c
@@ -0,0 +1,588 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+#include "include/habanalabs_device_if.h"
+
+#include <linux/hwmon-sysfs.h>
+#include <linux/hwmon.h>
+
+#define SET_CLK_PKT_TIMEOUT	200000	/* 200ms */
+#define SET_PWR_PKT_TIMEOUT	400000	/* 400ms */
+
+long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	if (curr)
+		pkt.opcode = ARMCP_PACKET_FREQUENCY_CURR_GET;
+	else
+		pkt.opcode = ARMCP_PACKET_FREQUENCY_GET;
+	pkt.pll_index = pll_index;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+						SET_CLK_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get frequency of PLL %d, error %d\n",
+			pll_index, rc);
+		result = rc;
+	}
+
+	return result;
+}
+
+void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_FREQUENCY_SET;
+	pkt.pll_index = pll_index;
+	pkt.value = freq;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SET_CLK_PKT_TIMEOUT, NULL);
+
+	if (rc)
+		dev_err(hdev->dev,
+			"Failed to set frequency to PLL %d, error %d\n",
+			pll_index, rc);
+}
+
+u64 hl_get_max_power(struct hl_device *hdev)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_MAX_POWER_GET;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+						SET_PWR_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to get max power, error %d\n", rc);
+		result = rc;
+	}
+
+	return result;
+}
+
+void hl_set_max_power(struct hl_device *hdev, u64 value)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_MAX_POWER_SET;
+	pkt.value = value;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SET_PWR_PKT_TIMEOUT, NULL);
+
+	if (rc)
+		dev_err(hdev->dev, "Failed to set max power, error %d\n", rc);
+}
+
+static ssize_t pm_mng_profile_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	return snprintf(buf, PAGE_SIZE, "%s\n",
+			(hdev->pm_mng_profile == PM_AUTO) ? "auto" :
+			(hdev->pm_mng_profile == PM_MANUAL) ? "manual" :
+			"unknown");
+}
+
+static ssize_t pm_mng_profile_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto out;
+	}
+
+	mutex_lock(&hdev->device_open);
+
+	if (atomic_read(&hdev->fd_open_cnt) > 0) {
+		dev_err(hdev->dev,
+			"Can't change PM profile while user process is opened on the device\n");
+		count = -EPERM;
+		goto unlock_mutex;
+	}
+
+	if (strncmp("auto", buf, strlen("auto")) == 0) {
+		/* Make sure we are in LOW PLL when changing modes */
+		if (hdev->pm_mng_profile == PM_MANUAL) {
+			atomic_set(&hdev->curr_pll_profile, PLL_HIGH);
+			hl_device_set_frequency(hdev, PLL_LOW);
+			hdev->pm_mng_profile = PM_AUTO;
+		}
+	} else if (strncmp("manual", buf, strlen("manual")) == 0) {
+		/* Make sure we are in LOW PLL when changing modes */
+		if (hdev->pm_mng_profile == PM_AUTO) {
+			flush_delayed_work(&hdev->work_freq);
+			hdev->pm_mng_profile = PM_MANUAL;
+		}
+	} else {
+		dev_err(hdev->dev, "value should be auto or manual\n");
+		count = -EINVAL;
+		goto unlock_mutex;
+	}
+
+unlock_mutex:
+	mutex_unlock(&hdev->device_open);
+out:
+	return count;
+}
+
+static ssize_t high_pll_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	return snprintf(buf, PAGE_SIZE, "%u\n", hdev->high_pll);
+}
+
+static ssize_t high_pll_store(struct device *dev, struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+	int rc;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto out;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto out;
+	}
+
+	hdev->high_pll = value;
+
+out:
+	return count;
+}
+
+static ssize_t uboot_ver_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%s\n", hdev->asic_prop.uboot_ver);
+}
+
+static ssize_t armcp_kernel_ver_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%s",
+			hdev->asic_prop.armcp_info.kernel_version);
+}
+
+static ssize_t armcp_ver_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%s\n",
+			hdev->asic_prop.armcp_info.armcp_version);
+}
+
+static ssize_t cpld_ver_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "0x%08x\n",
+			hdev->asic_prop.armcp_info.cpld_version);
+}
+
+static ssize_t infineon_ver_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "0x%04x\n",
+			hdev->asic_prop.armcp_info.infineon_version);
+}
+
+static ssize_t fuse_ver_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%s\n",
+			hdev->asic_prop.armcp_info.fuse_version);
+}
+
+static ssize_t thermal_ver_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%s",
+			hdev->asic_prop.armcp_info.thermal_version);
+}
+
+static ssize_t preboot_btl_ver_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%s\n", hdev->asic_prop.preboot_ver);
+}
+
+static ssize_t device_type_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	char *str;
+
+	switch (hdev->asic_type) {
+	case ASIC_GOYA:
+		str = "GOYA";
+		break;
+	default:
+		dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
+				hdev->asic_type);
+		return -EINVAL;
+	}
+
+	return snprintf(buf, PAGE_SIZE, "%s\n", str);
+}
+
+static ssize_t pci_addr_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	/* Use dummy, fixed address for simulator */
+	if (!hdev->pdev)
+		return snprintf(buf, PAGE_SIZE, "0000:%02d:00.0\n", hdev->id);
+
+	return snprintf(buf, PAGE_SIZE, "%04x:%02x:%02x.%x\n",
+			pci_domain_nr(hdev->pdev->bus),
+			hdev->pdev->bus->number,
+			PCI_SLOT(hdev->pdev->devfn),
+			PCI_FUNC(hdev->pdev->devfn));
+}
+
+static ssize_t status_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	char *str;
+
+	if (hdev->disabled)
+		str = "Malfunction";
+	else
+		str = "Operational";
+
+	return snprintf(buf, PAGE_SIZE, "%s\n", str);
+}
+
+static ssize_t write_open_cnt_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", hdev->user_ctx ? 1 : 0);
+}
+
+static ssize_t max_power_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long val;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	val = hl_get_max_power(hdev);
+
+	return snprintf(buf, PAGE_SIZE, "%lu\n", val);
+}
+
+static ssize_t max_power_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	unsigned long value;
+	int rc;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto out;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto out;
+	}
+
+	hdev->max_power = value;
+	hl_set_max_power(hdev, value);
+
+out:
+	return count;
+}
+
+static ssize_t eeprom_read_handler(struct file *filp, struct kobject *kobj,
+			struct bin_attribute *attr, char *buf, loff_t offset,
+			size_t max_size)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	char *data;
+	int rc;
+
+	if (!max_size)
+		return -EINVAL;
+
+	data = kzalloc(max_size, GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	rc = hdev->asic_funcs->get_eeprom_data(hdev, data, max_size);
+	if (rc)
+		goto out;
+
+	memcpy(buf, data, max_size);
+
+out:
+	kfree(data);
+
+	return max_size;
+}
+
+static DEVICE_ATTR_RW(pm_mng_profile);
+static DEVICE_ATTR_RW(high_pll);
+static DEVICE_ATTR_RO(uboot_ver);
+static DEVICE_ATTR_RO(armcp_kernel_ver);
+static DEVICE_ATTR_RO(armcp_ver);
+static DEVICE_ATTR_RO(cpld_ver);
+static DEVICE_ATTR_RO(infineon_ver);
+static DEVICE_ATTR_RO(fuse_ver);
+static DEVICE_ATTR_RO(thermal_ver);
+static DEVICE_ATTR_RO(preboot_btl_ver);
+static DEVICE_ATTR_RO(device_type);
+static DEVICE_ATTR_RO(pci_addr);
+static DEVICE_ATTR_RO(status);
+static DEVICE_ATTR_RO(write_open_cnt);
+static DEVICE_ATTR_RW(max_power);
+
+static const struct bin_attribute bin_attr_eeprom = {
+	.attr = {.name = "eeprom", .mode = (0444)},
+	.size = PAGE_SIZE,
+	.read = eeprom_read_handler
+};
+
+int hl_sysfs_init(struct hl_device *hdev)
+{
+	int rc;
+
+	rc = hdev->asic_funcs->add_device_attr(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to add device attributes\n");
+		return rc;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_pm_mng_profile);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file pm_mng_profile\n");
+		goto remove_device_attr;
+	}
+
+	hdev->pm_mng_profile = PM_AUTO;
+
+	rc = device_create_file(hdev->dev, &dev_attr_high_pll);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file pll_profile\n");
+		goto remove_pm_mng_profile;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_uboot_ver);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file uboot_ver\n");
+		goto remove_pll_profile;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_armcp_kernel_ver);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file armcp_kernel_ver\n");
+		goto remove_uboot_ver;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_armcp_ver);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file armcp_ver\n");
+		goto remove_armcp_kernel_ver;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_cpld_ver);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file cpld_ver\n");
+		goto remove_armcp_ver;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_infineon_ver);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file infineon_ver\n");
+		goto remove_cpld_ver;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_fuse_ver);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file fuse_ver\n");
+		goto remove_infineon_ver;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_thermal_ver);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file thermal_ver\n");
+		goto remove_fuse_ver;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_preboot_btl_ver);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file preboot_btl_ver\n");
+		goto remove_thermal_ver;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_device_type);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file device_type\n");
+		goto remove_preboot_ver;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_pci_addr);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file pci_addr\n");
+		goto remove_device_type;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_status);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file status\n");
+		goto remove_pci_addr;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_write_open_cnt);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file write_open_count\n");
+		goto remove_status;
+	}
+
+	hdev->max_power = hdev->asic_prop.max_power_default;
+
+	rc = device_create_file(hdev->dev, &dev_attr_max_power);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file max_power\n");
+		goto remove_write_open_cnt;
+	}
+
+	rc = sysfs_create_bin_file(&hdev->dev->kobj, &bin_attr_eeprom);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create EEPROM sysfs entry\n");
+		goto remove_attr_max_power;
+	}
+
+	return 0;
+
+remove_attr_max_power:
+	device_remove_file(hdev->dev, &dev_attr_max_power);
+remove_write_open_cnt:
+	device_remove_file(hdev->dev, &dev_attr_write_open_cnt);
+remove_status:
+	device_remove_file(hdev->dev, &dev_attr_status);
+remove_pci_addr:
+	device_remove_file(hdev->dev, &dev_attr_pci_addr);
+remove_device_type:
+	device_remove_file(hdev->dev, &dev_attr_device_type);
+remove_preboot_ver:
+	device_remove_file(hdev->dev, &dev_attr_preboot_btl_ver);
+remove_thermal_ver:
+	device_remove_file(hdev->dev, &dev_attr_thermal_ver);
+remove_fuse_ver:
+	device_remove_file(hdev->dev, &dev_attr_fuse_ver);
+remove_infineon_ver:
+	device_remove_file(hdev->dev, &dev_attr_infineon_ver);
+remove_cpld_ver:
+	device_remove_file(hdev->dev, &dev_attr_cpld_ver);
+remove_armcp_ver:
+	device_remove_file(hdev->dev, &dev_attr_armcp_ver);
+remove_armcp_kernel_ver:
+	device_remove_file(hdev->dev, &dev_attr_armcp_kernel_ver);
+remove_uboot_ver:
+	device_remove_file(hdev->dev, &dev_attr_uboot_ver);
+remove_pll_profile:
+	device_remove_file(hdev->dev, &dev_attr_high_pll);
+remove_pm_mng_profile:
+	device_remove_file(hdev->dev, &dev_attr_pm_mng_profile);
+remove_device_attr:
+	hdev->asic_funcs->remove_device_attr(hdev);
+
+	return rc;
+}
+
+void hl_sysfs_fini(struct hl_device *hdev)
+{
+	sysfs_remove_bin_file(&hdev->dev->kobj, &bin_attr_eeprom);
+	device_remove_file(hdev->dev, &dev_attr_max_power);
+	device_remove_file(hdev->dev, &dev_attr_write_open_cnt);
+	device_remove_file(hdev->dev, &dev_attr_status);
+	device_remove_file(hdev->dev, &dev_attr_pci_addr);
+	device_remove_file(hdev->dev, &dev_attr_device_type);
+	device_remove_file(hdev->dev, &dev_attr_preboot_btl_ver);
+	device_remove_file(hdev->dev, &dev_attr_thermal_ver);
+	device_remove_file(hdev->dev, &dev_attr_fuse_ver);
+	device_remove_file(hdev->dev, &dev_attr_infineon_ver);
+	device_remove_file(hdev->dev, &dev_attr_cpld_ver);
+	device_remove_file(hdev->dev, &dev_attr_armcp_ver);
+	device_remove_file(hdev->dev, &dev_attr_armcp_kernel_ver);
+	device_remove_file(hdev->dev, &dev_attr_uboot_ver);
+	device_remove_file(hdev->dev, &dev_attr_high_pll);
+	device_remove_file(hdev->dev, &dev_attr_pm_mng_profile);
+	hdev->asic_funcs->remove_device_attr(hdev);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 10/15] habanalabs: add device reset support
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (7 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 09/15] habanalabs: add sysfs and hwmon support Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-27  7:51   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 11/15] habanalabs: add command submission module Oded Gabbay
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds support for doing various on-the-fly reset of Goya.

The driver supports two types of resets:
1. soft-reset
2. hard-reset

Soft-reset is done when the device detects a timeout of a command
submission that was given to the device. The soft-reset process only resets
the engines that are relevant for the submission of compute jobs, i.e. the
DMA channels, the TPCs and the MME. The purpose is to bring the device as
fast as possible to a working state.

Hard-reset is done in several cases:
1. After soft-reset is done but the device is not responding
2. When fatal errors occur inside the device, e.g. ECC error
3. When the driver is removed

Hard-reset performs a reset of the entire chip except for the PCI
controller and the PLLs. It is a much longer process then soft-reset but it
helps to recover the device without the need to reboot the Host.

After hard-reset, the driver will restore the max power attribute and in
case of manual power management, the frequencies that were set.

This patch also adds two entries to the sysfs, which allows the root user
to initiate a soft or hard reset.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/command_buffer.c  |  11 +-
 drivers/misc/habanalabs/device.c          | 308 +++++++++++++++++++++-
 drivers/misc/habanalabs/goya/goya.c       | 201 ++++++++++++++
 drivers/misc/habanalabs/goya/goya_hwmgr.c |  18 +-
 drivers/misc/habanalabs/habanalabs.h      |  35 +++
 drivers/misc/habanalabs/habanalabs_drv.c  |   9 +-
 drivers/misc/habanalabs/hwmon.c           |   4 +-
 drivers/misc/habanalabs/irq.c             |  31 +++
 drivers/misc/habanalabs/sysfs.c           | 120 ++++++++-
 9 files changed, 712 insertions(+), 25 deletions(-)

diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
index 535ed6cc5bda..700c6da01188 100644
--- a/drivers/misc/habanalabs/command_buffer.c
+++ b/drivers/misc/habanalabs/command_buffer.c
@@ -81,9 +81,10 @@ int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
 	bool alloc_new_cb = true;
 	int rc;
 
-	if (hdev->disabled) {
+	if ((hdev->disabled) || ((atomic_read(&hdev->in_reset)) &&
+					(ctx_id != HL_KERNEL_ASID_ID))) {
 		dev_warn_ratelimited(hdev->dev,
-			"Device is disabled !!! Can't create new CBs\n");
+			"Device is disabled or in reset !!! Can't create new CBs\n");
 		rc = -EBUSY;
 		goto out_err;
 	}
@@ -187,6 +188,12 @@ int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data)
 	u64 handle;
 	int rc;
 
+	if (hdev->hard_reset_pending) {
+		dev_crit_ratelimited(hdev->dev,
+			"Device HARD reset pending !!! Please close FD\n");
+		return -ENODEV;
+	}
+
 	switch (args->in.op) {
 	case HL_CB_OP_CREATE:
 		rc = hl_cb_create(hdev, &hpriv->cb_mgr, args->in.cb_size,
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index ff7b610f18c4..00fde57ce823 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -188,6 +188,7 @@ static int device_early_init(struct hl_device *hdev)
 
 	mutex_init(&hdev->device_open);
 	mutex_init(&hdev->send_cpu_message_lock);
+	atomic_set(&hdev->in_reset, 0);
 	atomic_set(&hdev->fd_open_cnt, 0);
 
 	return 0;
@@ -238,6 +239,27 @@ static void set_freq_to_low_job(struct work_struct *work)
 			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
 }
 
+static void hl_device_heartbeat(struct work_struct *work)
+{
+	struct hl_device *hdev = container_of(work, struct hl_device,
+						work_heartbeat.work);
+
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
+		goto reschedule;
+
+	if (!hdev->asic_funcs->send_heartbeat(hdev))
+		goto reschedule;
+
+	dev_err(hdev->dev, "Device heartbeat failed !!!\n");
+	hl_device_reset(hdev, true, false);
+
+	return;
+
+reschedule:
+	schedule_delayed_work(&hdev->work_heartbeat,
+			usecs_to_jiffies(HL_HEARTBEAT_PER_USEC));
+}
+
 /**
  * device_late_init - do late stuff initialization for the habanalabs device
  *
@@ -273,6 +295,12 @@ static int device_late_init(struct hl_device *hdev)
 	schedule_delayed_work(&hdev->work_freq,
 			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
 
+	if (hdev->heartbeat) {
+		INIT_DELAYED_WORK(&hdev->work_heartbeat, hl_device_heartbeat);
+		schedule_delayed_work(&hdev->work_heartbeat,
+				usecs_to_jiffies(HL_HEARTBEAT_PER_USEC));
+	}
+
 	hdev->late_init_done = true;
 
 	return 0;
@@ -290,6 +318,8 @@ static void device_late_fini(struct hl_device *hdev)
 		return;
 
 	cancel_delayed_work_sync(&hdev->work_freq);
+	if (hdev->heartbeat)
+		cancel_delayed_work_sync(&hdev->work_heartbeat);
 
 	if (hdev->asic_funcs->late_fini)
 		hdev->asic_funcs->late_fini(hdev);
@@ -397,6 +427,254 @@ int hl_device_resume(struct hl_device *hdev)
 	return 0;
 }
 
+static void hl_device_hard_reset_pending(struct work_struct *work)
+{
+	struct hl_device_reset_work *device_reset_work =
+		container_of(work, struct hl_device_reset_work, reset_work);
+	struct hl_device *hdev = device_reset_work->hdev;
+	u16 pending_cnt = HL_PENDING_RESET_PER_SEC;
+	struct task_struct *task = NULL;
+
+	/* Flush all processes that are inside hl_open */
+	mutex_lock(&hdev->device_open);
+
+	while ((atomic_read(&hdev->fd_open_cnt)) && (pending_cnt)) {
+
+		pending_cnt--;
+
+		dev_info(hdev->dev,
+			"Can't HARD reset, waiting for user to close FD\n");
+		ssleep(1);
+	}
+
+	if (atomic_read(&hdev->fd_open_cnt)) {
+		task = get_pid_task(hdev->user_ctx->hpriv->taskpid,
+					PIDTYPE_PID);
+		if (task) {
+			dev_info(hdev->dev, "Killing user processes\n");
+			send_sig(SIGKILL, task, 1);
+			msleep(100);
+
+			put_task_struct(task);
+		}
+	}
+
+	mutex_unlock(&hdev->device_open);
+
+	hl_device_reset(hdev, true, true);
+
+	kfree(device_reset_work);
+}
+
+/**
+ * hl_device_reset - reset the device
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @hard_reset: should we do hard reset to all engines or just reset the
+ *              compute/dma engines
+ *
+ * Block future CS and wait for pending CS to be enqueued
+ * Call ASIC H/W fini
+ * Flush all completions
+ * Re-initialize all internal data structures
+ * Call ASIC H/W init, late_init
+ * Test queues
+ * Enable device
+ *
+ * Returns 0 for success or an error on failure.
+ */
+int hl_device_reset(struct hl_device *hdev, bool hard_reset,
+			bool from_hard_reset_thread)
+{
+	int i, rc;
+
+	/*
+	 * Prevent concurrency in this function - only one reset should be
+	 * done at any given time. Only need to perform this if we didn't
+	 * get from the dedicated hard reset thread
+	 */
+	if (!from_hard_reset_thread) {
+		/* Block future CS/VM/JOB completion operations */
+		rc = atomic_cmpxchg(&hdev->in_reset, 0, 1);
+		if (rc)
+			return 0;
+
+		/* This also blocks future CS/VM/JOB completion operations */
+		hdev->disabled = true;
+
+		/*
+		 * Flush anyone that is inside the critical section of enqueue
+		 * jobs to the H/W
+		 */
+		hdev->asic_funcs->hw_queues_lock(hdev);
+		hdev->asic_funcs->hw_queues_unlock(hdev);
+
+		dev_err(hdev->dev, "Going to RESET device !!!\n");
+	}
+
+again:
+	if ((hard_reset) && (!from_hard_reset_thread)) {
+		struct hl_device_reset_work *device_reset_work;
+
+		if (!hdev->pdev) {
+			dev_err(hdev->dev,
+				"Reset action is NOT supported in simulator !!!\n");
+			rc = -EINVAL;
+			goto out_err;
+		}
+
+		hdev->hard_reset_pending = true;
+
+		device_reset_work = kzalloc(sizeof(*device_reset_work),
+						GFP_ATOMIC);
+		if (!device_reset_work) {
+			rc = -ENOMEM;
+			goto out_err;
+		}
+
+		/*
+		 * Because the reset function can't run from interrupt or
+		 * from heartbeat work, we need to call the reset function
+		 * from a dedicated work
+		 */
+		INIT_WORK(&device_reset_work->reset_work,
+				hl_device_hard_reset_pending);
+		device_reset_work->hdev = hdev;
+		schedule_work(&device_reset_work->reset_work);
+
+		return 0;
+	}
+
+	if (hard_reset) {
+		device_late_fini(hdev);
+
+		/*
+		 * Now that the heartbeat thread is closed, flush processes
+		 * which are sending messages to CPU
+		 */
+		mutex_lock(&hdev->send_cpu_message_lock);
+		mutex_unlock(&hdev->send_cpu_message_lock);
+	}
+
+	/*
+	 * Halt the engines and disable interrupts so we won't get any more
+	 * completions from H/W and we won't have any accesses from the
+	 * H/W to the host machine
+	 */
+	hdev->asic_funcs->halt_engines(hdev, hard_reset);
+
+	if (hard_reset) {
+		/* Release kernel context */
+		if (hl_ctx_put(hdev->kernel_ctx) != 1) {
+			dev_err(hdev->dev,
+				"kernel ctx is alive during hard reset\n");
+			rc = -EBUSY;
+			goto out_err;
+		}
+
+		hdev->kernel_ctx = NULL;
+	}
+
+	/* Reset the H/W. It will be in idle state after this returns */
+	hdev->asic_funcs->hw_fini(hdev, hard_reset);
+
+	if (hard_reset)
+		hl_eq_reset(hdev, &hdev->event_queue);
+
+	/* Re-initialize PI,CI to 0 in all queues (hw queue, cq) */
+	hl_hw_queue_reset(hdev, hard_reset);
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
+		hl_cq_reset(hdev, &hdev->completion_queue[i]);
+
+	/* Finished tear-down, starting to re-initialize */
+
+	if (hard_reset) {
+		/* Allocate the kernel context */
+		hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx),
+						GFP_KERNEL);
+		if (!hdev->kernel_ctx) {
+			rc = -ENOMEM;
+			goto out_err;
+		}
+
+		hdev->user_ctx = NULL;
+
+		rc = hl_ctx_init(hdev, hdev->kernel_ctx, true);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed to init kernel ctx in hard reset\n");
+			kfree(hdev->kernel_ctx);
+			hdev->kernel_ctx = NULL;
+			goto out_err;
+		}
+	}
+
+	rc = hdev->asic_funcs->hw_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to initialize the H/W after reset\n");
+		goto out_err;
+	}
+
+	hdev->disabled = false;
+
+	/* Check that the communication with the device is working */
+	rc = hdev->asic_funcs->test_queues(hdev);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to detect if device is alive after reset\n");
+		goto out_err;
+	}
+
+	if (hard_reset) {
+		rc = device_late_init(hdev);
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed late init after hard reset\n");
+			goto out_err;
+		}
+
+		hl_set_max_power(hdev, hdev->max_power);
+
+		hdev->hard_reset_pending = false;
+	} else {
+		rc = hdev->asic_funcs->soft_reset_late_init(hdev);
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed late init after soft reset\n");
+			goto out_err;
+		}
+	}
+
+	atomic_set(&hdev->in_reset, 0);
+
+	if (hard_reset)
+		hdev->hard_reset_cnt++;
+	else
+		hdev->soft_reset_cnt++;
+
+	return 0;
+
+out_err:
+	hdev->disabled = true;
+
+	if (hard_reset) {
+		dev_err(hdev->dev,
+			"Failed to reset. Device is NOT usable !!!\n");
+		hdev->hard_reset_cnt++;
+	} else {
+		dev_err(hdev->dev,
+			"Failed to do soft-reset, trying hard reset\n");
+		hdev->soft_reset_cnt++;
+		hard_reset = true;
+		goto again;
+	}
+
+	atomic_set(&hdev->in_reset, 0);
+
+	return rc;
+}
+
 /**
  * hl_device_init - main initialization function for habanalabs device
  *
@@ -410,6 +688,9 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 {
 	int i, rc, cq_ready_cnt;
 
+	/* Don't allow reset to run while we are in this function */
+	atomic_set(&hdev->in_reset, 1);
+
 	/* Create device */
 	rc = device_setup_cdev(hdev, hclass, hdev->id, &hl_ops);
 
@@ -544,6 +825,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
+	atomic_set(&hdev->in_reset, 0);
+
 	return 0;
 
 free_cb_pool:
@@ -579,6 +862,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		pr_err("habanalabs: Failed to initialize hl%d. Device is NOT usable !!!\n",
 			hdev->id);
 
+	atomic_set(&hdev->in_reset, 0);
+
 	return rc;
 }
 
@@ -591,9 +876,30 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
  */
 void hl_device_fini(struct hl_device *hdev)
 {
-	int i;
+	int i, rc;
+	ktime_t timeout;
+
 	dev_info(hdev->dev, "Removing device\n");
 
+	/*
+	 * This function is competing with the reset function, so try to
+	 * take the reset atomic and if we are already in middle of reset,
+	 * wait until reset function is finished. Reset function is designed
+	 * to always finish (could take up to a few seconds in worst case).
+	 */
+
+	timeout = ktime_add_us(ktime_get(),
+				HL_PENDING_RESET_PER_SEC * 1000 * 1000 * 4);
+	rc = atomic_cmpxchg(&hdev->in_reset, 0, 1);
+	while (rc) {
+		usleep_range(50, 200);
+		rc = atomic_cmpxchg(&hdev->in_reset, 0, 1);
+		if (ktime_compare(ktime_get(), timeout) > 0) {
+			WARN(1, "Failed to remove device because reset function did not finish\n");
+			return;
+		}
+	};
+
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 7899ff762e0b..ba9dd314c060 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -127,6 +127,130 @@ static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
 
 #define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
 
+static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
+	GOYA_ASYNC_EVENT_ID_PCIE_IF,
+	GOYA_ASYNC_EVENT_ID_TPC0_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC1_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC2_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC3_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC4_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC5_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC6_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC7_ECC,
+	GOYA_ASYNC_EVENT_ID_MME_ECC,
+	GOYA_ASYNC_EVENT_ID_MME_ECC_EXT,
+	GOYA_ASYNC_EVENT_ID_MMU_ECC,
+	GOYA_ASYNC_EVENT_ID_DMA_MACRO,
+	GOYA_ASYNC_EVENT_ID_DMA_ECC,
+	GOYA_ASYNC_EVENT_ID_CPU_IF_ECC,
+	GOYA_ASYNC_EVENT_ID_PSOC_MEM,
+	GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT,
+	GOYA_ASYNC_EVENT_ID_SRAM0,
+	GOYA_ASYNC_EVENT_ID_SRAM1,
+	GOYA_ASYNC_EVENT_ID_SRAM2,
+	GOYA_ASYNC_EVENT_ID_SRAM3,
+	GOYA_ASYNC_EVENT_ID_SRAM4,
+	GOYA_ASYNC_EVENT_ID_SRAM5,
+	GOYA_ASYNC_EVENT_ID_SRAM6,
+	GOYA_ASYNC_EVENT_ID_SRAM7,
+	GOYA_ASYNC_EVENT_ID_SRAM8,
+	GOYA_ASYNC_EVENT_ID_SRAM9,
+	GOYA_ASYNC_EVENT_ID_SRAM10,
+	GOYA_ASYNC_EVENT_ID_SRAM11,
+	GOYA_ASYNC_EVENT_ID_SRAM12,
+	GOYA_ASYNC_EVENT_ID_SRAM13,
+	GOYA_ASYNC_EVENT_ID_SRAM14,
+	GOYA_ASYNC_EVENT_ID_SRAM15,
+	GOYA_ASYNC_EVENT_ID_SRAM16,
+	GOYA_ASYNC_EVENT_ID_SRAM17,
+	GOYA_ASYNC_EVENT_ID_SRAM18,
+	GOYA_ASYNC_EVENT_ID_SRAM19,
+	GOYA_ASYNC_EVENT_ID_SRAM20,
+	GOYA_ASYNC_EVENT_ID_SRAM21,
+	GOYA_ASYNC_EVENT_ID_SRAM22,
+	GOYA_ASYNC_EVENT_ID_SRAM23,
+	GOYA_ASYNC_EVENT_ID_SRAM24,
+	GOYA_ASYNC_EVENT_ID_SRAM25,
+	GOYA_ASYNC_EVENT_ID_SRAM26,
+	GOYA_ASYNC_EVENT_ID_SRAM27,
+	GOYA_ASYNC_EVENT_ID_SRAM28,
+	GOYA_ASYNC_EVENT_ID_SRAM29,
+	GOYA_ASYNC_EVENT_ID_GIC500,
+	GOYA_ASYNC_EVENT_ID_PLL0,
+	GOYA_ASYNC_EVENT_ID_PLL1,
+	GOYA_ASYNC_EVENT_ID_PLL3,
+	GOYA_ASYNC_EVENT_ID_PLL4,
+	GOYA_ASYNC_EVENT_ID_PLL5,
+	GOYA_ASYNC_EVENT_ID_PLL6,
+	GOYA_ASYNC_EVENT_ID_AXI_ECC,
+	GOYA_ASYNC_EVENT_ID_L2_RAM_ECC,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT,
+	GOYA_ASYNC_EVENT_ID_PCIE_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC0_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC1_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC2_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC3_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC4_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC5_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC6_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC7_DEC,
+	GOYA_ASYNC_EVENT_ID_MME_WACS,
+	GOYA_ASYNC_EVENT_ID_MME_WACSD,
+	GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER,
+	GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC,
+	GOYA_ASYNC_EVENT_ID_PSOC,
+	GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC0_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC1_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC2_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC3_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC4_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC5_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC6_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC7_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC0_QM,
+	GOYA_ASYNC_EVENT_ID_TPC1_QM,
+	GOYA_ASYNC_EVENT_ID_TPC2_QM,
+	GOYA_ASYNC_EVENT_ID_TPC3_QM,
+	GOYA_ASYNC_EVENT_ID_TPC4_QM,
+	GOYA_ASYNC_EVENT_ID_TPC5_QM,
+	GOYA_ASYNC_EVENT_ID_TPC6_QM,
+	GOYA_ASYNC_EVENT_ID_TPC7_QM,
+	GOYA_ASYNC_EVENT_ID_MME_QM,
+	GOYA_ASYNC_EVENT_ID_MME_CMDQ,
+	GOYA_ASYNC_EVENT_ID_DMA0_QM,
+	GOYA_ASYNC_EVENT_ID_DMA1_QM,
+	GOYA_ASYNC_EVENT_ID_DMA2_QM,
+	GOYA_ASYNC_EVENT_ID_DMA3_QM,
+	GOYA_ASYNC_EVENT_ID_DMA4_QM,
+	GOYA_ASYNC_EVENT_ID_DMA0_CH,
+	GOYA_ASYNC_EVENT_ID_DMA1_CH,
+	GOYA_ASYNC_EVENT_ID_DMA2_CH,
+	GOYA_ASYNC_EVENT_ID_DMA3_CH,
+	GOYA_ASYNC_EVENT_ID_DMA4_CH,
+	GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH0,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH1,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH2,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH3,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH4
+};
+
 static int goya_armcp_info_get(struct hl_device *hdev);
 
 static void goya_get_fixed_properties(struct hl_device *hdev)
@@ -4186,6 +4310,56 @@ static void goya_print_irq_info(struct hl_device *hdev, u16 event_type)
 	}
 }
 
+static int goya_unmask_irq_arr(struct hl_device *hdev, u32 *irq_arr,
+		size_t irq_arr_size)
+{
+	struct armcp_unmask_irq_arr_packet *pkt;
+	size_t total_pkt_size;
+	long result;
+	int rc;
+
+	total_pkt_size = sizeof(struct armcp_unmask_irq_arr_packet) +
+			irq_arr_size;
+
+	/* data should be aligned to 8 bytes in order to ArmCP to copy it */
+	total_pkt_size = (total_pkt_size + 0x7) & ~0x7;
+
+	/* total_pkt_size is casted to u16 later on */
+	if (total_pkt_size > USHRT_MAX) {
+		dev_err(hdev->dev, "too many elements in IRQ array\n");
+		return -EINVAL;
+	}
+
+	pkt = kzalloc(total_pkt_size, GFP_KERNEL);
+	if (!pkt)
+		return -ENOMEM;
+
+	pkt->length = irq_arr_size / sizeof(irq_arr[0]);
+	memcpy(&pkt->irqs, irq_arr, irq_arr_size);
+
+	pkt->armcp_pkt.opcode = ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) pkt,
+			total_pkt_size, HL_DEVICE_TIMEOUT_USEC, &result);
+
+	if (rc)
+		dev_err(hdev->dev, "failed to unmask IRQ array\n");
+
+	kfree(pkt);
+
+	return rc;
+}
+
+static int goya_soft_reset_late_init(struct hl_device *hdev)
+{
+	/*
+	 * Unmask all IRQs since some could have been received
+	 * during the soft reset
+	 */
+	return goya_unmask_irq_arr(hdev, goya_non_fatal_events,
+			sizeof(goya_non_fatal_events));
+}
+
 static int goya_unmask_irq(struct hl_device *hdev, u16 event_type)
 {
 	struct armcp_packet pkt;
@@ -4276,6 +4450,7 @@ void goya_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_entry)
 		dev_err(hdev->dev,
 			"Received H/W interrupt %d, reset the chip\n",
 			event_type);
+		hl_device_reset(hdev, true, false);
 		break;
 
 	case GOYA_ASYNC_EVENT_ID_PCIE_DEC:
@@ -4364,6 +4539,30 @@ void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
 	return goya->events_stat;
 }
 
+int goya_send_heartbeat(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct armcp_packet hb_pkt;
+	long result;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
+		return 0;
+
+	memset(&hb_pkt, 0, sizeof(hb_pkt));
+
+	hb_pkt.opcode = ARMCP_PACKET_TEST;
+	hb_pkt.value = ARMCP_PACKET_FENCE_VAL;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &hb_pkt,
+			sizeof(hb_pkt), HL_DEVICE_TIMEOUT_USEC, &result);
+
+	if ((rc) || (result != ARMCP_PACKET_FENCE_VAL))
+		rc = -EIO;
+
+	return rc;
+}
+
 static int goya_armcp_info_get(struct hl_device *hdev)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -4539,8 +4738,10 @@ static const struct hl_asic_funcs goya_funcs = {
 	.handle_eqe = goya_handle_eqe,
 	.set_pll_profile = goya_set_pll_profile,
 	.get_events_stat = goya_get_events_stat,
+	.send_heartbeat = goya_send_heartbeat,
 	.enable_clock_gating = goya_init_clock_gating,
 	.disable_clock_gating = goya_disable_clock_gating,
+	.soft_reset_late_init = goya_soft_reset_late_init,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
 	.get_eeprom_data = goya_get_eeprom_data,
diff --git a/drivers/misc/habanalabs/goya/goya_hwmgr.c b/drivers/misc/habanalabs/goya/goya_hwmgr.c
index 866d1774b2e4..9482dbb2e03a 100644
--- a/drivers/misc/habanalabs/goya/goya_hwmgr.c
+++ b/drivers/misc/habanalabs/goya/goya_hwmgr.c
@@ -38,7 +38,7 @@ static ssize_t mme_clk_show(struct device *dev, struct device_attribute *attr,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, MME_PLL, false);
@@ -57,7 +57,7 @@ static ssize_t mme_clk_store(struct device *dev, struct device_attribute *attr,
 	int rc;
 	long value;
 
-	if (hdev->disabled) {
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
 		count = -ENODEV;
 		goto fail;
 	}
@@ -87,7 +87,7 @@ static ssize_t tpc_clk_show(struct device *dev, struct device_attribute *attr,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, TPC_PLL, false);
@@ -106,7 +106,7 @@ static ssize_t tpc_clk_store(struct device *dev, struct device_attribute *attr,
 	int rc;
 	long value;
 
-	if (hdev->disabled) {
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
 		count = -ENODEV;
 		goto fail;
 	}
@@ -136,7 +136,7 @@ static ssize_t ic_clk_show(struct device *dev, struct device_attribute *attr,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, IC_PLL, false);
@@ -155,7 +155,7 @@ static ssize_t ic_clk_store(struct device *dev, struct device_attribute *attr,
 	int rc;
 	long value;
 
-	if (hdev->disabled) {
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
 		count = -ENODEV;
 		goto fail;
 	}
@@ -185,7 +185,7 @@ static ssize_t mme_clk_curr_show(struct device *dev,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, MME_PLL, true);
@@ -202,7 +202,7 @@ static ssize_t tpc_clk_curr_show(struct device *dev,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, TPC_PLL, true);
@@ -219,7 +219,7 @@ static ssize_t ic_clk_curr_show(struct device *dev,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, IC_PLL, true);
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 49b84b3ff864..c0779dd447bd 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -23,8 +23,12 @@
 
 #define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
 
+#define HL_PENDING_RESET_PER_SEC	5
+
 #define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */
 
+#define HL_HEARTBEAT_PER_USEC		5000000 /* 5 s */
+
 #define HL_PLL_LOW_JOB_FREQ_USEC	5000000 /* 5 s */
 
 #define HL_MAX_QUEUES			128
@@ -370,8 +374,10 @@ enum hl_pll_frequency {
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
  * @set_pll_profile: change PLL profile (manual/automatic).
  * @get_events_stat: retrieve event queue entries histogram.
+ * @send_heartbeat: send is-alive packet to ArmCP and verify response.
  * @enable_clock_gating: enable clock gating for reducing power consumption.
  * @disable_clock_gating: disable clock for accessing registers on HBW.
+ * @soft_reset_late_init: perform certain actions needed after soft reset.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
  * @get_eeprom_data: retrieve EEPROM data from F/W.
@@ -417,8 +423,10 @@ struct hl_asic_funcs {
 	void (*set_pll_profile)(struct hl_device *hdev,
 			enum hl_pll_frequency freq);
 	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
+	int (*send_heartbeat)(struct hl_device *hdev);
 	void (*enable_clock_gating)(struct hl_device *hdev);
 	void (*disable_clock_gating)(struct hl_device *hdev);
+	int (*soft_reset_late_init)(struct hl_device *hdev);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
 	int (*get_eeprom_data)(struct hl_device *hdev, void *data,
@@ -544,6 +552,16 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
 	WREG32(mm##reg, (RREG32(mm##reg) & ~REG_FIELD_MASK(reg, field)) | \
 			(val) << REG_FIELD_SHIFT(reg, field))
 
+/**
+ * struct hl_device_reset_work - reset workqueue task wrapper.
+ * @reset_work: reset work to be done.
+ * @hdev: habanalabs device structure.
+ */
+struct hl_device_reset_work {
+	struct work_struct		reset_work;
+	struct hl_device		*hdev;
+};
+
 /**
  * struct hl_device - habanalabs device structure.
  * @pdev: pointer to PCI device, can be NULL in case of simulator device.
@@ -552,6 +570,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @cdev: related char device.
  * @dev: realted kernel basic device structure.
  * @work_freq: delayed work to lower device frequency if possible.
+ * @work_heartbeat: delayed work for ArmCP is-alive check.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
  * @completion_queue: array of hl_cq.
@@ -579,6 +598,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @cb_pool: list of preallocated CBs.
  * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
+ * @in_reset: is device in reset flow.
  * @curr_pll_profile: current PLL profile.
  * @fd_open_cnt: number of open context executing.
  * @max_power: the max power of the device, as configured by the sysadmin. This
@@ -586,10 +606,14 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  *             value and update the F/W after the re-initialization
  * @major: habanalabs KMD major.
  * @high_pll: high PLL profile frequency.
+ * @soft_reset_cnt: number of soft reset since KMD loading.
+ * @hard_reset_cnt: number of hard reset since KMD loading.
  * @id: device minor.
  * @disabled: is device disabled.
  * @late_init_done: is late init stage was done during initialization.
  * @hwmon_initialized: is H/W monitor sensors was initialized.
+ * @hard_reset_pending: is there a hard reset work pending.
+ * @heartbeat: is heartbeat sanity check towards ArmCP enabled.
  */
 struct hl_device {
 	struct pci_dev			*pdev;
@@ -598,6 +622,7 @@ struct hl_device {
 	struct cdev			cdev;
 	struct device			*dev;
 	struct delayed_work		work_freq;
+	struct delayed_work		work_heartbeat;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
 	struct hl_cq			*completion_queue;
@@ -630,15 +655,20 @@ struct hl_device {
 	/* TODO: The following fields should be moved for multi-context */
 	struct hl_ctx			*user_ctx;
 
+	atomic_t			in_reset;
 	atomic_t			curr_pll_profile;
 	atomic_t			fd_open_cnt;
 	u64				max_power;
 	u32				major;
 	u32				high_pll;
+	u32				soft_reset_cnt;
+	u32				hard_reset_cnt;
 	u16				id;
 	u8				disabled;
 	u8				late_init_done;
 	u8				hwmon_initialized;
+	u8				hard_reset_pending;
+	u8				heartbeat;
 
 	/* Parameters for bring-up */
 	u8				cpu_enable;
@@ -696,6 +726,7 @@ int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
 				u32 cb_size, u64 cb_ptr);
 u32 hl_hw_queue_add_ptr(u32 ptr, u16 val);
 void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
+void hl_hw_queue_reset(struct hl_device *hdev, bool hard_reset);
 
 #define hl_queue_inc_ptr(p)		hl_hw_queue_add_ptr(p, 1)
 #define hl_pi_2_offset(pi)		((pi) & (HL_QUEUE_LENGTH - 1))
@@ -704,6 +735,8 @@ int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
 void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
 int hl_eq_init(struct hl_device *hdev, struct hl_eq *q);
 void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q);
+void hl_cq_reset(struct hl_device *hdev, struct hl_cq *q);
+void hl_eq_reset(struct hl_device *hdev, struct hl_eq *q);
 irqreturn_t hl_irq_handler_cq(int irq, void *arg);
 irqreturn_t hl_irq_handler_eq(int irq, void *arg);
 int hl_asid_init(struct hl_device *hdev);
@@ -721,6 +754,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass);
 void hl_device_fini(struct hl_device *hdev);
 int hl_device_suspend(struct hl_device *hdev);
 int hl_device_resume(struct hl_device *hdev);
+int hl_device_reset(struct hl_device *hdev, bool hard_reset,
+			bool from_hard_reset_thread);
 void hl_hpriv_get(struct hl_fpriv *hpriv);
 void hl_hpriv_put(struct hl_fpriv *hpriv);
 int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq);
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 47a9ab458b43..7d101ee0f0f2 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -91,9 +91,9 @@ int hl_device_open(struct inode *inode, struct file *filp)
 
 	mutex_lock(&hdev->device_open);
 
-	if (hdev->disabled) {
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
 		dev_err_ratelimited(hdev->dev,
-			"Can't open %s because it is disabled\n",
+			"Can't open %s because it is disabled or in reset\n",
 			dev_name(hdev->dev));
 		mutex_unlock(&hdev->device_open);
 		return -EPERM;
@@ -195,6 +195,7 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	hdev->fw_loading = 1;
 	hdev->ifh = 0;
 	hdev->pldm = 0;
+	hdev->heartbeat = 1;
 
 	/* If CPU is disabled, no point in loading FW */
 	if (!hdev->cpu_enable)
@@ -204,6 +205,10 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	if (!hdev->fw_loading)
 		hdev->cpu_queues_enable = 0;
 
+	/* If CPU queues not enabled, no way to do heartbeat */
+	if (!hdev->cpu_queues_enable)
+		hdev->heartbeat = 0;
+
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
 
diff --git a/drivers/misc/habanalabs/hwmon.c b/drivers/misc/habanalabs/hwmon.c
index 6ca0decb7490..3b2a47a13705 100644
--- a/drivers/misc/habanalabs/hwmon.c
+++ b/drivers/misc/habanalabs/hwmon.c
@@ -111,7 +111,7 @@ static int hl_read(struct device *dev, enum hwmon_sensor_types type,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	switch (type) {
@@ -185,7 +185,7 @@ static int hl_write(struct device *dev, enum hwmon_sensor_types type,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	switch (type) {
diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
index 9586323e7dfb..f2990931d88b 100644
--- a/drivers/misc/habanalabs/irq.c
+++ b/drivers/misc/habanalabs/irq.c
@@ -250,6 +250,23 @@ void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
 			(void *) q->kernel_address, q->bus_address);
 }
 
+void hl_cq_reset(struct hl_device *hdev, struct hl_cq *q)
+{
+	q->ci = 0;
+	q->pi = 0;
+
+	atomic_set(&q->free_slots_cnt, HL_CQ_LENGTH);
+
+	/*
+	 * It's not enough to just reset the PI/CI because the H/W may have
+	 * written valid completion entries before it was halted and therefore
+	 * we need to clean the actual queues so we won't process old entries
+	 * when the device is operational again
+	 */
+
+	memset((void *) q->kernel_address, 0, HL_CQ_SIZE_IN_BYTES);
+}
+
 /**
  * hl_eq_init - main initialization function for an event queue object
  *
@@ -292,3 +309,17 @@ void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q)
 	hdev->asic_funcs->dma_free_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
 			(void *) q->kernel_address, q->bus_address);
 }
+
+void hl_eq_reset(struct hl_device *hdev, struct hl_eq *q)
+{
+	q->ci = 0;
+
+	/*
+	 * It's not enough to just reset the PI/CI because the H/W may have
+	 * written valid completion entries before it was halted and therefore
+	 * we need to clean the actual queues so we won't process old entries
+	 * when the device is operational again
+	 */
+
+	memset((void *) q->kernel_address, 0, HL_EQ_SIZE_IN_BYTES);
+}
diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
index edd5f7159de0..52d2ec580f8f 100644
--- a/drivers/misc/habanalabs/sysfs.c
+++ b/drivers/misc/habanalabs/sysfs.c
@@ -104,7 +104,7 @@ static ssize_t pm_mng_profile_show(struct device *dev,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	return snprintf(buf, PAGE_SIZE, "%s\n",
@@ -118,7 +118,7 @@ static ssize_t pm_mng_profile_store(struct device *dev,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled) {
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
 		count = -ENODEV;
 		goto out;
 	}
@@ -162,7 +162,7 @@ static ssize_t high_pll_show(struct device *dev, struct device_attribute *attr,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	return snprintf(buf, PAGE_SIZE, "%u\n", hdev->high_pll);
@@ -175,7 +175,7 @@ static ssize_t high_pll_store(struct device *dev, struct device_attribute *attr,
 	long value;
 	int rc;
 
-	if (hdev->disabled) {
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
 		count = -ENODEV;
 		goto out;
 	}
@@ -263,6 +263,48 @@ static ssize_t preboot_btl_ver_show(struct device *dev,
 	return snprintf(buf, PAGE_SIZE, "%s\n", hdev->asic_prop.preboot_ver);
 }
 
+static ssize_t soft_reset_store(struct device *dev,
+				struct device_attribute *attr, const char *buf,
+				size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+	int rc;
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto out;
+	}
+
+	hl_device_reset(hdev, false, false);
+
+out:
+	return count;
+}
+
+static ssize_t hard_reset_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+	int rc;
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto out;
+	}
+
+	hl_device_reset(hdev, true, false);
+
+out:
+	return count;
+}
+
 static ssize_t device_type_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -304,7 +346,9 @@ static ssize_t status_show(struct device *dev, struct device_attribute *attr,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	char *str;
 
-	if (hdev->disabled)
+	if (atomic_read(&hdev->in_reset))
+		str = "In reset";
+	else if (hdev->disabled)
 		str = "Malfunction";
 	else
 		str = "Operational";
@@ -320,13 +364,29 @@ static ssize_t write_open_cnt_show(struct device *dev,
 	return snprintf(buf, PAGE_SIZE, "%d\n", hdev->user_ctx ? 1 : 0);
 }
 
+static ssize_t soft_reset_cnt_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", hdev->soft_reset_cnt);
+}
+
+static ssize_t hard_reset_cnt_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", hdev->hard_reset_cnt);
+}
+
 static ssize_t max_power_show(struct device *dev, struct device_attribute *attr,
 				char *buf)
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long val;
 
-	if (hdev->disabled)
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
 		return -ENODEV;
 
 	val = hl_get_max_power(hdev);
@@ -341,7 +401,7 @@ static ssize_t max_power_store(struct device *dev,
 	unsigned long value;
 	int rc;
 
-	if (hdev->disabled) {
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
 		count = -ENODEV;
 		goto out;
 	}
@@ -398,10 +458,14 @@ static DEVICE_ATTR_RO(infineon_ver);
 static DEVICE_ATTR_RO(fuse_ver);
 static DEVICE_ATTR_RO(thermal_ver);
 static DEVICE_ATTR_RO(preboot_btl_ver);
+static DEVICE_ATTR_WO(soft_reset);
+static DEVICE_ATTR_WO(hard_reset);
 static DEVICE_ATTR_RO(device_type);
 static DEVICE_ATTR_RO(pci_addr);
 static DEVICE_ATTR_RO(status);
 static DEVICE_ATTR_RO(write_open_cnt);
+static DEVICE_ATTR_RO(soft_reset_cnt);
+static DEVICE_ATTR_RO(hard_reset_cnt);
 static DEVICE_ATTR_RW(max_power);
 
 static const struct bin_attribute bin_attr_eeprom = {
@@ -487,11 +551,23 @@ int hl_sysfs_init(struct hl_device *hdev)
 		goto remove_thermal_ver;
 	}
 
+	rc = device_create_file(hdev->dev, &dev_attr_soft_reset);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file soft_reset\n");
+		goto remove_preboot_ver;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_hard_reset);
+	if (rc) {
+		dev_err(hdev->dev, "failed to create device file hard_reset\n");
+		goto remove_soft_reset;
+	}
+
 	rc = device_create_file(hdev->dev, &dev_attr_device_type);
 	if (rc) {
 		dev_err(hdev->dev,
 			"failed to create device file device_type\n");
-		goto remove_preboot_ver;
+		goto remove_hard_reset;
 	}
 
 	rc = device_create_file(hdev->dev, &dev_attr_pci_addr);
@@ -513,13 +589,27 @@ int hl_sysfs_init(struct hl_device *hdev)
 		goto remove_status;
 	}
 
+	rc = device_create_file(hdev->dev, &dev_attr_soft_reset_cnt);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file soft_reset_count\n");
+		goto remove_write_open_cnt;
+	}
+
+	rc = device_create_file(hdev->dev, &dev_attr_hard_reset_cnt);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to create device file hard_reset_count\n");
+		goto remove_soft_reset_cnt;
+	}
+
 	hdev->max_power = hdev->asic_prop.max_power_default;
 
 	rc = device_create_file(hdev->dev, &dev_attr_max_power);
 	if (rc) {
 		dev_err(hdev->dev,
 			"failed to create device file max_power\n");
-		goto remove_write_open_cnt;
+		goto remove_hard_reset_cnt;
 	}
 
 	rc = sysfs_create_bin_file(&hdev->dev->kobj, &bin_attr_eeprom);
@@ -532,6 +622,10 @@ int hl_sysfs_init(struct hl_device *hdev)
 
 remove_attr_max_power:
 	device_remove_file(hdev->dev, &dev_attr_max_power);
+remove_hard_reset_cnt:
+	device_remove_file(hdev->dev, &dev_attr_hard_reset_cnt);
+remove_soft_reset_cnt:
+	device_remove_file(hdev->dev, &dev_attr_soft_reset_cnt);
 remove_write_open_cnt:
 	device_remove_file(hdev->dev, &dev_attr_write_open_cnt);
 remove_status:
@@ -540,6 +634,10 @@ int hl_sysfs_init(struct hl_device *hdev)
 	device_remove_file(hdev->dev, &dev_attr_pci_addr);
 remove_device_type:
 	device_remove_file(hdev->dev, &dev_attr_device_type);
+remove_hard_reset:
+	device_remove_file(hdev->dev, &dev_attr_hard_reset);
+remove_soft_reset:
+	device_remove_file(hdev->dev, &dev_attr_soft_reset);
 remove_preboot_ver:
 	device_remove_file(hdev->dev, &dev_attr_preboot_btl_ver);
 remove_thermal_ver:
@@ -570,10 +668,14 @@ void hl_sysfs_fini(struct hl_device *hdev)
 {
 	sysfs_remove_bin_file(&hdev->dev->kobj, &bin_attr_eeprom);
 	device_remove_file(hdev->dev, &dev_attr_max_power);
+	device_remove_file(hdev->dev, &dev_attr_hard_reset_cnt);
+	device_remove_file(hdev->dev, &dev_attr_soft_reset_cnt);
 	device_remove_file(hdev->dev, &dev_attr_write_open_cnt);
 	device_remove_file(hdev->dev, &dev_attr_status);
 	device_remove_file(hdev->dev, &dev_attr_pci_addr);
 	device_remove_file(hdev->dev, &dev_attr_device_type);
+	device_remove_file(hdev->dev, &dev_attr_hard_reset);
+	device_remove_file(hdev->dev, &dev_attr_soft_reset);
 	device_remove_file(hdev->dev, &dev_attr_preboot_btl_ver);
 	device_remove_file(hdev->dev, &dev_attr_thermal_ver);
 	device_remove_file(hdev->dev, &dev_attr_fuse_ver);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 11/15] habanalabs: add command submission module
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (8 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 10/15] habanalabs: add device reset support Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-27 15:11   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 12/15] habanalabs: add virtual memory and MMU modules Oded Gabbay
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds the main flow for the user to submit work to the device.

Each work is described by a command submission object (CS). The CS contains
3 arrays of command buffers: One for execution, and two for context-switch
(store and restore).

For each CB, the user specifies on which queue to put that CB. In case of
an internal queue, the entry doesn't contain a pointer to the CB but the
address in the on-chip memory that the CB resides at.

The driver parses some of the CBs to enforce security restrictions.

The user receives a sequence number that represents the CS object. The user
can then query the driver regarding the status of the CS, using that
sequence number.

In case the CS doesn't finish before the timeout expires, the driver will
perform a soft-reset of the device.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/Makefile             |    3 +-
 drivers/misc/habanalabs/command_submission.c |  787 +++++++++++++
 drivers/misc/habanalabs/context.c            |   52 +-
 drivers/misc/habanalabs/device.c             |   16 +
 drivers/misc/habanalabs/goya/goya.c          | 1082 ++++++++++++++++++
 drivers/misc/habanalabs/habanalabs.h         |  274 +++++
 drivers/misc/habanalabs/habanalabs_drv.c     |   23 +
 drivers/misc/habanalabs/habanalabs_ioctl.c   |    4 +-
 drivers/misc/habanalabs/hw_queue.c           |  250 ++++
 drivers/misc/habanalabs/memory.c             |  200 ++++
 include/uapi/misc/habanalabs.h               |  158 ++-
 11 files changed, 2842 insertions(+), 7 deletions(-)
 create mode 100644 drivers/misc/habanalabs/command_submission.c
 create mode 100644 drivers/misc/habanalabs/memory.c

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index b5607233d216..d2fd0e18b1eb 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -5,7 +5,8 @@
 obj-m	:= habanalabs.o
 
 habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
-		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o
+		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o memory.o \
+		command_submission.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/command_submission.c b/drivers/misc/habanalabs/command_submission.c
new file mode 100644
index 000000000000..0116c2262f17
--- /dev/null
+++ b/drivers/misc/habanalabs/command_submission.c
@@ -0,0 +1,787 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include <uapi/misc/habanalabs.h>
+#include "habanalabs.h"
+
+#include <linux/sched/mm.h>
+#include <linux/sched/task.h>
+#include <linux/sched/signal.h>
+#include <linux/wait.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+
+static void job_wq_completion(struct work_struct *work);
+static long _hl_cs_wait_ioctl(struct hl_device *hdev,
+		struct hl_ctx *ctx, u64 timeout_us, u64 seq);
+static void cs_do_release(struct kref *ref);
+
+static const char *hl_fence_get_driver_name(struct dma_fence *fence)
+{
+	return "HabanaLabs";
+}
+
+static const char *hl_fence_get_timeline_name(struct dma_fence *fence)
+{
+	struct hl_dma_fence *hl_fence =
+		container_of(fence, struct hl_dma_fence, base_fence);
+
+	return dev_name(hl_fence->hdev->dev);
+}
+
+static bool hl_fence_enable_signaling(struct dma_fence *fence)
+{
+	return true;
+}
+
+static void hl_fence_release(struct dma_fence *fence)
+{
+	struct hl_dma_fence *hl_fence =
+		container_of(fence, struct hl_dma_fence, base_fence);
+
+	kfree_rcu(hl_fence, base_fence.rcu);
+}
+
+static const struct dma_fence_ops hl_fence_ops = {
+	.get_driver_name = hl_fence_get_driver_name,
+	.get_timeline_name = hl_fence_get_timeline_name,
+	.enable_signaling = hl_fence_enable_signaling,
+	.wait = dma_fence_default_wait,
+	.release = hl_fence_release
+};
+
+static void cs_get(struct hl_cs *cs)
+{
+	kref_get(&cs->refcount);
+}
+
+static int cs_get_unless_zero(struct hl_cs *cs)
+{
+	return kref_get_unless_zero(&cs->refcount);
+}
+
+static void cs_put(struct hl_cs *cs)
+{
+	kref_put(&cs->refcount, cs_do_release);
+}
+
+/**
+ * cs_parser - parse the user command submission
+ *
+ * @hpriv	: pointer to the private data of the fd
+ * @job        : pointer to the job that holds the command submission info
+ *
+ * The function parses the command submission of the user. It calls the
+ * ASIC specific parser, which returns a list of memory blocks to send
+ * to the device as different command buffers
+ *
+ */
+static int cs_parser(struct hl_fpriv *hpriv, struct hl_cs_job *job)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	struct hl_cs_parser parser;
+	int rc;
+
+	parser.ctx_id = job->cs->ctx->asid;
+	parser.cs_sequence = job->cs->sequence;
+	parser.job_id = job->id;
+
+	parser.hw_queue_id = job->hw_queue_id;
+	parser.job_userptr_list = &job->userptr_list;
+	parser.patched_cb = NULL;
+	parser.user_cb = job->user_cb;
+	parser.user_cb_size = job->user_cb_size;
+	parser.ext_queue = job->ext_queue;
+	job->patched_cb = NULL;
+	parser.use_virt_addr = hdev->mmu_enable;
+
+	rc = hdev->asic_funcs->cs_parser(hdev, &parser);
+	if (job->ext_queue) {
+		if (!rc) {
+			job->patched_cb = parser.patched_cb;
+			job->job_cb_size = parser.patched_cb_size;
+
+			spin_lock(&job->patched_cb->lock);
+			job->patched_cb->cs_cnt++;
+			spin_unlock(&job->patched_cb->lock);
+		}
+
+		/*
+		 * Whether the parsing worked or not, we don't need the
+		 * original CB anymore because it was already parsed and
+		 * won't be accessed again for this CS
+		 */
+		spin_lock(&job->user_cb->lock);
+		job->user_cb->cs_cnt--;
+		spin_unlock(&job->user_cb->lock);
+		hl_cb_put(job->user_cb);
+		job->user_cb = NULL;
+	}
+
+	return rc;
+}
+
+static void free_job(struct hl_device *hdev, struct hl_cs_job *job)
+{
+	struct hl_cs *cs = job->cs;
+
+	if (job->ext_queue) {
+		hl_userptr_delete_list(hdev, &job->userptr_list);
+
+		/*
+		 * We might arrive here from rollback and patched CB wasn't
+		 * created, so we need to check it's not NULL
+		 */
+		if (job->patched_cb) {
+			spin_lock(&job->patched_cb->lock);
+			job->patched_cb->cs_cnt--;
+			spin_unlock(&job->patched_cb->lock);
+
+			hl_cb_put(job->patched_cb);
+		}
+	}
+
+	/*
+	 * This is the only place where there can be multiple threads
+	 * modifying the list at the same time
+	 */
+	spin_lock(&cs->job_lock);
+	list_del(&job->cs_node);
+	spin_unlock(&cs->job_lock);
+
+	if (job->ext_queue)
+		cs_put(cs);
+
+	kfree(job);
+}
+
+static void cs_do_release(struct kref *ref)
+{
+	struct hl_cs *cs = container_of(ref, struct hl_cs,
+						refcount);
+	struct hl_device *hdev = cs->ctx->hdev;
+	struct hl_cs_job *job, *tmp;
+
+	cs->completed = true;
+
+	/*
+	 * Although if we reached here it means that all external jobs have
+	 * finished, because each one of them took refcnt to CS, we still
+	 * need to go over the internal jobs and free them. Otherwise, we
+	 * will have leaked memory and what's worse, the CS object (and
+	 * potentially the CTX object) could be released, while the JOB
+	 * still holds a pointer to them (but no reference).
+	 */
+	list_for_each_entry_safe(job, tmp, &cs->job_list, cs_node)
+		free_job(hdev, job);
+
+	/* We also need to update CI for internal queues */
+	if (cs->submitted) {
+		hl_int_hw_queue_update_ci(cs);
+
+		spin_lock(&hdev->hw_queues_mirror_lock);
+		/* remove CS from hw_queues mirror list */
+		list_del_init(&cs->mirror_node);
+		spin_unlock(&hdev->hw_queues_mirror_lock);
+
+		/*
+		 * Don't cancel TDR in case this CS was timedout because we
+		 * might be running from the TDR context
+		 */
+		if ((!cs->timedout) &&
+			(hdev->timeout_jiffies != MAX_SCHEDULE_TIMEOUT)) {
+			struct hl_cs *next;
+
+			if (cs->tdr_active)
+				cancel_delayed_work_sync(&cs->work_tdr);
+
+			spin_lock(&hdev->hw_queues_mirror_lock);
+			/* queue TDR for next CS */
+			next = list_first_entry_or_null(
+					&hdev->hw_queues_mirror_list,
+					struct hl_cs, mirror_node);
+			if ((next) && (!next->tdr_active)) {
+				next->tdr_active = true;
+				schedule_delayed_work(&next->work_tdr,
+							hdev->timeout_jiffies);
+				spin_unlock(&hdev->hw_queues_mirror_lock);
+			} else {
+				spin_unlock(&hdev->hw_queues_mirror_lock);
+			}
+		}
+	}
+
+	hl_ctx_put(cs->ctx);
+
+	if (cs->timedout)
+		dma_fence_set_error(cs->fence, -ETIMEDOUT);
+	else if (cs->aborted)
+		dma_fence_set_error(cs->fence, -EIO);
+
+	dma_fence_signal(cs->fence);
+	dma_fence_put(cs->fence);
+
+	kfree(cs);
+}
+
+static void cs_timedout(struct work_struct *work)
+{
+	struct hl_device *hdev;
+	int ctx_asid, rc;
+	struct hl_cs *cs = container_of(work, struct hl_cs,
+						 work_tdr.work);
+	rc = cs_get_unless_zero(cs);
+	if (!rc)
+		return;
+
+	if ((!cs->submitted) || (cs->completed)) {
+		cs_put(cs);
+		return;
+	}
+
+	/* Mark the CS is timed out so we won't try to cancel its TDR */
+	cs->timedout = true;
+
+	hdev = cs->ctx->hdev;
+	ctx_asid = cs->ctx->asid;
+
+	/* TODO: add information about last signaled seq and last emitted seq */
+	dev_err(hdev->dev, "CS %d.%llu got stuck!!!\n", ctx_asid, cs->sequence);
+
+	cs_put(cs);
+
+	if (hdev->reset_on_lockup)
+		hl_device_reset(hdev, false, false);
+}
+
+static int allocate_cs(struct hl_device *hdev, struct hl_ctx *ctx,
+			struct hl_cs **cs_new)
+{
+	struct hl_dma_fence *fence;
+	struct dma_fence *other = NULL;
+	struct hl_cs *cs;
+	int rc;
+
+	cs = kzalloc(sizeof(*cs), GFP_ATOMIC);
+	if (!cs)
+		return -ENOMEM;
+
+	cs->ctx = ctx;
+	cs->submitted = false;
+	cs->completed = false;
+	INIT_LIST_HEAD(&cs->job_list);
+	INIT_DELAYED_WORK(&cs->work_tdr, cs_timedout);
+	kref_init(&cs->refcount);
+	spin_lock_init(&cs->job_lock);
+
+	fence = kmalloc(sizeof(*fence), GFP_ATOMIC);
+	if (!fence) {
+		rc = -ENOMEM;
+		goto free_cs;
+	}
+
+	fence->hdev = hdev;
+	spin_lock_init(&fence->lock);
+	cs->fence = &fence->base_fence;
+
+	spin_lock(&ctx->cs_lock);
+
+	fence->cs_seq = ctx->cs_sequence;
+	other = ctx->cs_pending[fence->cs_seq & (HL_MAX_PENDING_CS - 1)];
+	if ((other) && (!dma_fence_is_signaled(other))) {
+		spin_unlock(&ctx->cs_lock);
+		rc = -EAGAIN;
+		goto free_fence;
+	}
+
+	dma_fence_init(&fence->base_fence, &hl_fence_ops, &fence->lock,
+			ctx->asid, ctx->cs_sequence);
+
+	cs->sequence = fence->cs_seq;
+
+	ctx->cs_pending[fence->cs_seq & (HL_MAX_PENDING_CS - 1)] =
+							&fence->base_fence;
+	ctx->cs_sequence++;
+
+	dma_fence_get(&fence->base_fence);
+
+	dma_fence_put(other);
+
+	spin_unlock(&ctx->cs_lock);
+
+	*cs_new = cs;
+
+	return 0;
+
+free_fence:
+	kfree(fence);
+free_cs:
+	kfree(cs);
+	return rc;
+}
+
+static void cs_rollback(struct hl_device *hdev, struct hl_cs *cs)
+{
+	struct hl_cs_job *job, *tmp;
+
+	list_for_each_entry_safe(job, tmp, &cs->job_list, cs_node)
+		free_job(hdev, job);
+}
+
+void hl_cs_rollback_all(struct hl_device *hdev)
+{
+	struct hl_cs *cs, *tmp;
+
+	/* flush all completions */
+	flush_workqueue(hdev->cq_wq);
+
+	/* Make sure we don't have leftovers in the H/W queues mirror list */
+	list_for_each_entry_safe(cs, tmp, &hdev->hw_queues_mirror_list,
+				mirror_node) {
+		cs_get(cs);
+		cs->aborted = true;
+		dev_warn_ratelimited(hdev->dev, "Killing CS %d.%llu\n",
+					cs->ctx->asid, cs->sequence);
+		cs_rollback(hdev, cs);
+		cs_put(cs);
+	}
+}
+
+static void job_wq_completion(struct work_struct *work)
+{
+	struct hl_cs_job *job = container_of(work, struct hl_cs_job,
+						finish_work);
+	struct hl_cs *cs = job->cs;
+	struct hl_device *hdev = cs->ctx->hdev;
+
+	/* job is no longer needed */
+	free_job(hdev, job);
+}
+
+static struct hl_cb *validate_queue_index(struct hl_device *hdev,
+					struct hl_cb_mgr *cb_mgr,
+					struct hl_cs_chunk *chunk,
+					bool *ext_queue)
+{
+	struct asic_fixed_properties *asic = &hdev->asic_prop;
+	struct hw_queue_properties *hw_queue_prop;
+	u32 cb_handle;
+	struct hl_cb *cb;
+
+	/* Assume external queue */
+	*ext_queue = true;
+
+	hw_queue_prop = &asic->hw_queues_props[chunk->queue_index];
+
+	if ((chunk->queue_index >= HL_MAX_QUEUES) ||
+			(hw_queue_prop->type == QUEUE_TYPE_NA)) {
+		dev_err(hdev->dev, "Queue index %d is invalid\n",
+			chunk->queue_index);
+		return NULL;
+	}
+
+	if (hw_queue_prop->kmd_only) {
+		dev_err(hdev->dev, "Queue index %d is restricted for KMD\n",
+			chunk->queue_index);
+		return NULL;
+	} else if (hw_queue_prop->type == QUEUE_TYPE_INT) {
+		*ext_queue = false;
+		return (struct hl_cb *) chunk->cb_handle;
+	}
+
+	/* Retrieve CB object */
+	cb_handle = (u32) (chunk->cb_handle >> PAGE_SHIFT);
+
+	cb = hl_cb_get(hdev, cb_mgr, cb_handle);
+	if (!cb) {
+		dev_err(hdev->dev, "CB handle 0x%x invalid\n", cb_handle);
+		return NULL;
+	}
+
+	if ((chunk->cb_size < 8) || (chunk->cb_size > cb->size)) {
+		dev_err(hdev->dev, "CB size %u invalid\n", chunk->cb_size);
+		goto release_cb;
+	}
+
+	spin_lock(&cb->lock);
+	cb->cs_cnt++;
+	spin_unlock(&cb->lock);
+
+	return cb;
+
+release_cb:
+	hl_cb_put(cb);
+	return NULL;
+}
+
+struct hl_cs_job *hl_cs_allocate_job(struct hl_device *hdev, bool ext_queue)
+{
+	struct hl_cs_job *job;
+
+	job = kzalloc(sizeof(*job), GFP_ATOMIC);
+	if (!job)
+		return NULL;
+
+	job->ext_queue = ext_queue;
+
+	if (job->ext_queue) {
+		INIT_LIST_HEAD(&job->userptr_list);
+		INIT_WORK(&job->finish_work, job_wq_completion);
+	}
+
+	return job;
+}
+
+static int _hl_cs_ioctl(struct hl_fpriv *hpriv, void __user *chunks,
+			u32 num_chunks, u64 *cs_seq)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	struct hl_cs_chunk *cs_chunk_array;
+	struct hl_cs_job *job;
+	struct hl_cs *cs;
+	struct hl_cb *cb;
+	bool ext_queue_present = false;
+	u32 size_to_copy;
+	int rc, i, parse_cnt;
+
+	*cs_seq = ULLONG_MAX;
+
+	if (num_chunks > HL_MAX_JOBS_PER_CS) {
+		dev_err(hdev->dev,
+			"Number of chunks can NOT be larger than %d\n",
+			HL_MAX_JOBS_PER_CS);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	cs_chunk_array = kmalloc_array(num_chunks, sizeof(*cs_chunk_array),
+					GFP_ATOMIC);
+	if (!cs_chunk_array) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	size_to_copy = num_chunks * sizeof(struct hl_cs_chunk);
+	if (copy_from_user(cs_chunk_array, chunks, size_to_copy)) {
+		dev_err(hdev->dev, "Failed to copy cs chunk array from user\n");
+		rc = -EFAULT;
+		goto free_cs_chunk_array;
+	}
+
+	/* increment refcnt for context */
+	hl_ctx_get(hdev, hpriv->ctx);
+
+	rc = allocate_cs(hdev, hpriv->ctx, &cs);
+	if (rc) {
+		hl_ctx_put(hpriv->ctx);
+		goto free_cs_chunk_array;
+	}
+
+	*cs_seq = cs->sequence;
+
+	/* Validate ALL the CS chunks before submitting the CS */
+	for (i = 0, parse_cnt = 0 ; i < num_chunks ; i++, parse_cnt++) {
+		struct hl_cs_chunk *chunk = &cs_chunk_array[i];
+		bool ext_queue;
+
+		cb = validate_queue_index(hdev, &hpriv->cb_mgr, chunk,
+					&ext_queue);
+		if (ext_queue) {
+			ext_queue_present = true;
+			if (!cb) {
+				rc = -EINVAL;
+				goto free_cs_object;
+			}
+		}
+
+		job = hl_cs_allocate_job(hdev, ext_queue);
+		if (!job) {
+			dev_err(hdev->dev, "Failed to allocate a new job\n");
+			rc = -ENOMEM;
+			if (ext_queue)
+				goto release_cb;
+			else
+				goto free_cs_object;
+		}
+
+		job->id = i + 1;
+		job->cs = cs;
+		job->user_cb = cb;
+		job->user_cb_size = chunk->cb_size;
+		if (job->ext_queue)
+			job->job_cb_size = cb->size;
+		else
+			job->job_cb_size = chunk->cb_size;
+		job->hw_queue_id = chunk->queue_index;
+
+		cs->jobs_in_queue_cnt[job->hw_queue_id]++;
+
+		list_add_tail(&job->cs_node, &cs->job_list);
+
+		/*
+		 * Increment CS reference. When CS reference is 0, CS is
+		 * done and can be signaled to user and free all its resources
+		 * Only increment for JOB on external queues, because only
+		 * for those JOBs we get completion
+		 */
+		if (job->ext_queue)
+			cs_get(cs);
+
+		rc = cs_parser(hpriv, job);
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed to parse JOB %d.%llu.%d, err %d, rejecting the CS!!!\n",
+				cs->ctx->asid, cs->sequence, job->id, rc);
+			goto free_cs_object;
+		}
+	}
+
+	if (!ext_queue_present) {
+		dev_err(hdev->dev,
+			"Reject CS %d.%llu because no external queues jobs\n",
+			cs->ctx->asid, cs->sequence);
+		rc = -EINVAL;
+		goto free_cs_object;
+	}
+
+	rc = hl_hw_queue_schedule_cs(cs);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to submit CS %d.%llu to H/W queues, error %d\n",
+			cs->ctx->asid, cs->sequence, rc);
+		goto free_cs_object;
+	}
+
+	rc = HL_CS_STATUS_SUCCESS;
+	goto put_cs;
+
+release_cb:
+	spin_lock(&cb->lock);
+	cb->cs_cnt--;
+	spin_unlock(&cb->lock);
+	hl_cb_put(cb);
+free_cs_object:
+	cs_rollback(hdev, cs);
+	*cs_seq = ULLONG_MAX;
+	/* The path below is both for good and erroneous exits */
+put_cs:
+	/* We finished with the CS in this function, so put the ref */
+	cs_put(cs);
+free_cs_chunk_array:
+	kfree(cs_chunk_array);
+out:
+	return rc;
+}
+
+int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	union hl_cs_args *args = data;
+	void __user *chunks;
+	u32 num_chunks;
+	u64 cs_seq = ULONG_MAX;
+	int rc, do_restore;
+	bool need_soft_reset = false;
+
+	if (hdev->hard_reset_pending) {
+		dev_crit_ratelimited(hdev->dev,
+			"Device HARD reset pending !!! Please close FD\n");
+		rc = -ENODEV;
+		goto out;
+	}
+
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
+		dev_warn(hdev->dev,
+			"Device is %s !!! Can't submit new CS\n",
+			atomic_read(&hdev->in_reset) ? "in_reset" : "disabled");
+		rc = -EBUSY;
+		goto out;
+	}
+
+	do_restore = atomic_cmpxchg(&hpriv->ctx->thread_restore_token, 1, 0);
+
+	if (do_restore || (args->in.cs_flags & HL_CS_FLAGS_FORCE_RESTORE)) {
+		long ret;
+
+		chunks = (void __user *)(uintptr_t)args->in.chunks_restore;
+		num_chunks = args->in.num_chunks_restore;
+
+		mutex_lock(&hpriv->restore_phase_mutex);
+
+		if (do_restore) {
+			rc = hdev->asic_funcs->context_switch(hdev,
+					hpriv->ctx->asid);
+			if (rc) {
+				dev_err_ratelimited(hdev->dev,
+					"Failed to switch to context %d, rejecting CS! %d\n",
+					hpriv->ctx->asid, rc);
+				/*
+				 * If we timedout, we need to soft-reset because
+				 * QMAN is probably stuck. However, we can't
+				 * call to reset here directly because of
+				 * deadlock, so need to do it at the very end
+				 * of this function
+				 */
+				if (rc == -ETIMEDOUT)
+					need_soft_reset = true;
+				mutex_unlock(&hpriv->restore_phase_mutex);
+				goto out;
+			}
+		}
+
+		hdev->asic_funcs->restore_phase_topology(hdev);
+
+		if (num_chunks == 0) {
+			dev_dbg(hdev->dev,
+			"Need to run restore phase but restore CS is empty\n");
+			rc = 0;
+		} else {
+			rc = _hl_cs_ioctl(hpriv, chunks, num_chunks,
+						&cs_seq);
+		}
+
+		mutex_unlock(&hpriv->restore_phase_mutex);
+
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed to submit restore CS for context %d (%d)\n",
+				hpriv->ctx->asid, rc);
+			goto out;
+		}
+
+		/* Need to wait for restore completion before execution phase */
+		if (num_chunks > 0) {
+wait_again:
+			ret = _hl_cs_wait_ioctl(hdev, hpriv->ctx,
+					jiffies_to_usecs(hdev->timeout_jiffies),
+					cs_seq);
+			if (ret <= 0) {
+				if ((ret == -ERESTARTSYS) && (hdev->ifh)) {
+					usleep_range(100, 200);
+					goto wait_again;
+				}
+				dev_err(hdev->dev,
+					"Restore CS for context %d failed to complete %ld\n",
+					hpriv->ctx->asid, ret);
+				rc = -ENOEXEC;
+				goto out;
+			}
+		}
+
+		hpriv->ctx->thread_restore_wait_token = 1;
+	} else if (!hpriv->ctx->thread_restore_wait_token) {
+		u32 tmp;
+
+		rc = hl_poll_timeout_memory(hdev,
+			(u64) &hpriv->ctx->thread_restore_wait_token,
+			jiffies_to_usecs(hdev->timeout_jiffies),
+			&tmp);
+
+		if (rc || !tmp) {
+			dev_err(hdev->dev,
+				"restore phase hasn't finished in time\n");
+			rc = -ETIMEDOUT;
+			goto out;
+		}
+	}
+
+	chunks = (void __user *)(uintptr_t)args->in.chunks_execute;
+	num_chunks = args->in.num_chunks_execute;
+
+	if (num_chunks == 0) {
+		dev_err(hdev->dev,
+			"Got execute CS with 0 chunks, context %d!!!\n",
+			hpriv->ctx->asid);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	rc = _hl_cs_ioctl(hpriv, chunks, num_chunks, &cs_seq);
+
+out:
+	if (rc != -EAGAIN) {
+		memset(args, 0, sizeof(*args));
+		args->out.status = rc;
+		args->out.seq = cs_seq;
+	}
+
+	if ((rc == -ETIMEDOUT) && (need_soft_reset))
+		hl_device_reset(hdev, false, false);
+
+	return rc;
+}
+
+static long _hl_cs_wait_ioctl(struct hl_device *hdev,
+		struct hl_ctx *ctx, u64 timeout_us, u64 seq)
+{
+	struct dma_fence *fence;
+	unsigned long timeout;
+	long rc;
+
+	if (timeout_us == MAX_SCHEDULE_TIMEOUT)
+		timeout = timeout_us;
+	else
+		timeout = usecs_to_jiffies(timeout_us);
+
+	hl_ctx_get(hdev, ctx);
+
+	fence = hl_ctx_get_fence(ctx, seq);
+	if (IS_ERR(fence)) {
+		rc = PTR_ERR(fence);
+	} else if (fence) {
+		rc = dma_fence_wait_timeout(fence, true, timeout);
+		if (fence->error == -ETIMEDOUT)
+			rc = -ETIMEDOUT;
+		else if (fence->error == -EIO)
+			rc = -EIO;
+		dma_fence_put(fence);
+	} else
+		rc = 1;
+
+	hl_ctx_put(ctx);
+
+	return rc;
+}
+
+int hl_cs_wait_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	union hl_wait_cs_args *args = data;
+	u64 seq = args->in.seq;
+	long rc;
+
+	if (hdev->hard_reset_pending) {
+		dev_crit_ratelimited(hdev->dev,
+			"Device HARD reset pending !!! Please close FD\n");
+		return -ENODEV;
+	}
+
+	rc = _hl_cs_wait_ioctl(hdev, hpriv->ctx, args->in.timeout_us, seq);
+
+	memset(args, 0, sizeof(*args));
+
+	if (rc < 0) {
+		dev_err(hdev->dev, "Error %ld on waiting for CS handle %llu\n",
+			rc, seq);
+		if (rc == -ERESTARTSYS) {
+			args->out.status = HL_WAIT_CS_STATUS_INTERRUPTED;
+			rc = -EINTR;
+		} else if (rc == -ETIMEDOUT) {
+			args->out.status = HL_WAIT_CS_STATUS_TIMEDOUT;
+		} else if (rc == -EIO) {
+			args->out.status = HL_WAIT_CS_STATUS_ABORTED;
+		}
+		return rc;
+	}
+
+	if (rc == 0)
+		args->out.status = HL_WAIT_CS_STATUS_BUSY;
+	else
+		args->out.status = HL_WAIT_CS_STATUS_COMPLETED;
+
+	return 0;
+}
diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
index cdcad077e5cf..2da672113e7a 100644
--- a/drivers/misc/habanalabs/context.c
+++ b/drivers/misc/habanalabs/context.c
@@ -13,6 +13,18 @@
 static void hl_ctx_fini(struct hl_ctx *ctx)
 {
 	struct hl_device *hdev = ctx->hdev;
+	int i;
+
+	/*
+	 * If we arrived here, there are no jobs waiting for this context
+	 * on its queues so we can safely remove it.
+	 * This is because for each CS, we increment the ref count and for
+	 * every CS that was finished we decrement it and we won't arrive
+	 * to this function unless the ref count is 0
+	 */
+
+	for (i = 0 ; i < HL_MAX_PENDING_CS ; i++)
+		dma_fence_put(ctx->cs_pending[i]);
 
 	if (ctx->asid != HL_KERNEL_ASID_ID)
 		hl_asid_free(hdev, ctx->asid);
@@ -24,8 +36,6 @@ void hl_ctx_do_release(struct kref *ref)
 
 	ctx = container_of(ref, struct hl_ctx, refcount);
 
-	dev_dbg(ctx->hdev->dev, "Now really releasing context %d\n", ctx->asid);
-
 	hl_ctx_fini(ctx);
 
 	if (ctx->hpriv)
@@ -91,6 +101,11 @@ int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
 
 	kref_init(&ctx->refcount);
 
+	ctx->cs_sequence = 1;
+	spin_lock_init(&ctx->cs_lock);
+	atomic_set(&ctx->thread_restore_token, 1);
+	ctx->thread_restore_wait_token = 0;
+
 	if (is_kernel_ctx) {
 		ctx->asid = HL_KERNEL_ASID_ID; /* KMD gets ASID 0 */
 	} else {
@@ -101,8 +116,6 @@ int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
 		}
 	}
 
-	dev_dbg(hdev->dev, "Created context with ASID %u\n", ctx->asid);
-
 	return 0;
 }
 
@@ -116,6 +129,37 @@ int hl_ctx_put(struct hl_ctx *ctx)
 	return kref_put(&ctx->refcount, hl_ctx_do_release);
 }
 
+struct dma_fence *hl_ctx_get_fence(struct hl_ctx *ctx, u64 seq)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct dma_fence *fence;
+
+	spin_lock(&ctx->cs_lock);
+
+	if (seq >= ctx->cs_sequence) {
+		dev_notice(hdev->dev,
+			"Can't wait on seq %llu because current CS is at seq %llu\n",
+			seq, ctx->cs_sequence);
+		spin_unlock(&ctx->cs_lock);
+		return ERR_PTR(-EINVAL);
+	}
+
+
+	if (seq + HL_MAX_PENDING_CS < ctx->cs_sequence) {
+		dev_dbg(hdev->dev,
+			"Can't wait on seq %llu because current CS is at seq %llu (Fence is gone)\n",
+			seq, ctx->cs_sequence);
+		spin_unlock(&ctx->cs_lock);
+		return NULL;
+	}
+
+	fence = dma_fence_get(
+			ctx->cs_pending[seq & (HL_MAX_PENDING_CS - 1)]);
+	spin_unlock(&ctx->cs_lock);
+
+	return fence;
+}
+
 /**
  * hl_ctx_mgr_init - initialize the context manager
  *
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 00fde57ce823..a47e00fe5ccf 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -22,6 +22,8 @@ static void hpriv_release(struct kref *ref)
 
 	put_pid(hpriv->taskpid);
 
+	mutex_destroy(&hpriv->restore_phase_mutex);
+
 	kfree(hpriv);
 
 	/* Now the FD is really closed */
@@ -188,6 +190,8 @@ static int device_early_init(struct hl_device *hdev)
 
 	mutex_init(&hdev->device_open);
 	mutex_init(&hdev->send_cpu_message_lock);
+	INIT_LIST_HEAD(&hdev->hw_queues_mirror_list);
+	spin_lock_init(&hdev->hw_queues_mirror_lock);
 	atomic_set(&hdev->in_reset, 0);
 	atomic_set(&hdev->fd_open_cnt, 0);
 
@@ -563,6 +567,9 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 	 */
 	hdev->asic_funcs->halt_engines(hdev, hard_reset);
 
+	/* Go over all the queues, release all CS and their jobs */
+	hl_cs_rollback_all(hdev);
+
 	if (hard_reset) {
 		/* Release kernel context */
 		if (hl_ctx_put(hdev->kernel_ctx) != 1) {
@@ -586,6 +593,12 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
 		hl_cq_reset(hdev, &hdev->completion_queue[i]);
 
+	/* Make sure the setup phase for the user context will run again */
+	if (hdev->user_ctx) {
+		atomic_set(&hdev->user_ctx->thread_restore_token, 1);
+		hdev->user_ctx->thread_restore_wait_token = 0;
+	}
+
 	/* Finished tear-down, starting to re-initialize */
 
 	if (hard_reset) {
@@ -916,6 +929,9 @@ void hl_device_fini(struct hl_device *hdev)
 	 */
 	hdev->asic_funcs->halt_engines(hdev, true);
 
+	/* Go over all the queues, release all CS and their jobs */
+	hl_cs_rollback_all(hdev);
+
 	hl_cb_pool_fini(hdev);
 
 	/* Release kernel context */
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index ba9dd314c060..e3867615b974 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -102,6 +102,19 @@ static const char goya_irq_name[GOYA_MSIX_ENTRIES][GOYA_MAX_STRING_LEN] = {
 		"goya cq 4", "goya cpu eq"
 };
 
+static u16 goya_packet_sizes[MAX_PACKET_ID] = {
+	[PACKET_WREG_32]	= sizeof(struct packet_wreg32),
+	[PACKET_WREG_BULK]	= sizeof(struct packet_wreg_bulk),
+	[PACKET_MSG_LONG]	= sizeof(struct packet_msg_long),
+	[PACKET_MSG_SHORT]	= sizeof(struct packet_msg_short),
+	[PACKET_CP_DMA]		= sizeof(struct packet_cp_dma),
+	[PACKET_MSG_PROT]	= sizeof(struct packet_msg_prot),
+	[PACKET_FENCE]		= sizeof(struct packet_fence),
+	[PACKET_LIN_DMA]	= sizeof(struct packet_lin_dma),
+	[PACKET_NOP]		= sizeof(struct packet_nop),
+	[PACKET_STOP]		= sizeof(struct packet_stop)
+};
+
 static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
 	"MME0",
 	"MME1",
@@ -3957,6 +3970,86 @@ void *goya_get_int_queue_base(struct hl_device *hdev, u32 queue_id,
 	return base;
 }
 
+int goya_send_job_on_qman0(struct hl_device *hdev, struct hl_cs_job *job)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct packet_msg_prot *fence_pkt;
+	u32 *fence_ptr;
+	dma_addr_t fence_dma_addr;
+	struct hl_cb *cb;
+	u32 tmp;
+	int rc;
+
+	if (!hdev->asic_funcs->is_device_idle(hdev)) {
+		dev_err_ratelimited(hdev->dev,
+			"Can't send KMD job on QMAN0 if device is not idle\n");
+		return -EFAULT;
+	}
+
+	fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL,
+							&fence_dma_addr);
+	if (!fence_ptr) {
+		dev_err(hdev->dev,
+			"Failed to allocate fence memory for QMAN0\n");
+		return -ENOMEM;
+	}
+
+	*fence_ptr = 0;
+
+	if (goya->hw_cap_initialized & HW_CAP_MMU) {
+		WREG32(mmDMA_QM_0_GLBL_PROT, QMAN_DMA_FULLY_TRUSTED);
+		RREG32(mmDMA_QM_0_GLBL_PROT);
+	}
+
+	/*
+	 * goya cs parser saves space for 2xpacket_msg_prot at end of CB. For
+	 * synchronized kernel jobs we only need space for 1 packet_msg_prot
+	 */
+	job->job_cb_size -= sizeof(struct packet_msg_prot);
+
+	cb = job->patched_cb;
+
+	fence_pkt = (struct packet_msg_prot *) (cb->kernel_address +
+			job->job_cb_size - sizeof(struct packet_msg_prot));
+
+	fence_pkt->ctl = 0;
+	fence_pkt->opcode = PACKET_MSG_PROT;
+	fence_pkt->reg_barrier = 0;
+	fence_pkt->msg_barrier = 1;
+	fence_pkt->eng_barrier = 1;
+	fence_pkt->value = GOYA_QMAN0_FENCE_VAL;
+	fence_pkt->addr = fence_dma_addr +
+			hdev->asic_prop.host_phys_base_address;
+
+	rc = hl_hw_queue_send_cb_no_cmpl(hdev, GOYA_QUEUE_ID_DMA_0,
+					job->job_cb_size, cb->bus_address);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to send CB on QMAN0, %d\n", rc);
+		goto free_fence_ptr;
+	}
+
+	rc = hl_poll_timeout_memory(hdev, (u64) fence_ptr,
+					HL_DEVICE_TIMEOUT_USEC, &tmp);
+
+	hl_hw_queue_inc_ci_kernel(hdev, GOYA_QUEUE_ID_DMA_0);
+
+	if ((rc) || (tmp != GOYA_QMAN0_FENCE_VAL)) {
+		dev_err(hdev->dev, "QMAN0 Job hasn't finished in time\n");
+		rc = -ETIMEDOUT;
+	}
+
+free_fence_ptr:
+	hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_ptr,
+					fence_dma_addr);
+
+	if (goya->hw_cap_initialized & HW_CAP_MMU) {
+		WREG32(mmDMA_QM_0_GLBL_PROT, QMAN_DMA_PARTLY_TRUSTED);
+		RREG32(mmDMA_QM_0_GLBL_PROT);
+	}
+
+	return rc;
+}
+
 int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
 				u32 timeout, long *result)
 {
@@ -4189,11 +4282,950 @@ void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
 	gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
 }
 
+int goya_dma_map_sg(struct hl_device *hdev, struct scatterlist *sg, int nents,
+			enum dma_data_direction dir)
+{
+	if (!dma_map_sg(&hdev->pdev->dev, sg, nents, dir))
+		return -ENOMEM;
+
+	return 0;
+}
+
+void goya_dma_unmap_sg(struct hl_device *hdev, struct scatterlist *sg,
+			int nents, enum dma_data_direction dir)
+{
+	dma_unmap_sg(&hdev->pdev->dev, sg, nents, dir);
+}
+
+u32 goya_get_dma_desc_list_size(struct hl_device *hdev,
+					struct sg_table *sgt)
+{
+	struct scatterlist *sg, *sg_next_iter;
+	u32 count, len, dma_desc_cnt, len_next;
+	dma_addr_t addr, addr_next;
+
+	dma_desc_cnt = 0;
+
+	for_each_sg(sgt->sgl, sg, sgt->nents, count) {
+
+		len = sg_dma_len(sg);
+		addr = sg_dma_address(sg);
+
+		if (len == 0)
+			break;
+
+		while ((count + 1) < sgt->nents) {
+			sg_next_iter = sg_next(sg);
+			len_next = sg_dma_len(sg_next_iter);
+			addr_next = sg_dma_address(sg_next_iter);
+
+			if (len_next == 0)
+				break;
+
+			if ((addr + len == addr_next) &&
+				(len + len_next <= DMA_MAX_TRANSFER_SIZE)) {
+				len += len_next;
+				count++;
+				sg = sg_next_iter;
+			} else {
+				break;
+			}
+		}
+
+		dma_desc_cnt++;
+	}
+
+	return dma_desc_cnt * sizeof(struct packet_lin_dma);
+}
+
+static int goya_pin_memory_before_cs(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt,
+				u64 addr, enum dma_data_direction dir)
+{
+	struct hl_userptr *userptr;
+	int rc;
+
+	if (hl_userptr_is_pinned(hdev, addr, user_dma_pkt->tsize,
+			parser->job_userptr_list, &userptr))
+		goto already_pinned;
+
+	userptr = kzalloc(sizeof(*userptr), GFP_ATOMIC);
+	if (!userptr)
+		return -ENOMEM;
+
+	rc = hl_pin_host_memory(hdev, addr, user_dma_pkt->tsize, userptr);
+	if (rc)
+		goto free_userptr;
+
+	list_add_tail(&userptr->job_node, parser->job_userptr_list);
+
+	rc = hdev->asic_funcs->asic_dma_map_sg(hdev, userptr->sgt->sgl,
+					userptr->sgt->nents, dir);
+	if (rc) {
+		dev_err(hdev->dev, "failed to map sgt with DMA region\n");
+		goto unpin_memory;
+	}
+
+	userptr->dma_mapped = true;
+	userptr->dir = dir;
+
+already_pinned:
+	parser->patched_cb_size +=
+			goya_get_dma_desc_list_size(hdev, userptr->sgt);
+
+	return 0;
+
+unpin_memory:
+	hl_unpin_host_memory(hdev, userptr);
+free_userptr:
+	kfree(userptr);
+	return rc;
+}
+
+static int goya_validate_dma_pkt_host(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt)
+{
+	u64 device_memory_addr, addr;
+	enum dma_data_direction dir;
+	bool sram_addr = true;
+	bool skip_host_mem_pin = false;
+	int rc = 0;
+
+	switch (user_dma_pkt->dma_dir) {
+	case DMA_HOST_TO_DRAM:
+		dev_dbg(hdev->dev, "DMA direction is HOST --> DRAM\n");
+		dir = DMA_TO_DEVICE;
+		sram_addr = false;
+		addr = user_dma_pkt->src_addr;
+		device_memory_addr = user_dma_pkt->dst_addr;
+		if (user_dma_pkt->memset_mode)
+			skip_host_mem_pin = true;
+		break;
+
+	case DMA_DRAM_TO_HOST:
+		dev_dbg(hdev->dev, "DMA direction is DRAM --> HOST\n");
+		dir = DMA_FROM_DEVICE;
+		sram_addr = false;
+		addr = user_dma_pkt->dst_addr;
+		device_memory_addr = user_dma_pkt->src_addr;
+		break;
+
+	case DMA_HOST_TO_SRAM:
+		dev_dbg(hdev->dev, "DMA direction is HOST --> SRAM\n");
+		dir = DMA_TO_DEVICE;
+		addr = user_dma_pkt->src_addr;
+		device_memory_addr = user_dma_pkt->dst_addr;
+		if (user_dma_pkt->memset_mode)
+			skip_host_mem_pin = true;
+		break;
+
+	case DMA_SRAM_TO_HOST:
+		dev_dbg(hdev->dev, "DMA direction is SRAM --> HOST\n");
+		dir = DMA_FROM_DEVICE;
+		addr = user_dma_pkt->dst_addr;
+		device_memory_addr = user_dma_pkt->src_addr;
+		break;
+	default:
+		dev_err(hdev->dev, "DMA direction is undefined!!!\n");
+		return -EFAULT;
+	}
+
+	if (parser->ctx_id != HL_KERNEL_ASID_ID) {
+		if (sram_addr) {
+			if (!hl_mem_area_inside_range(device_memory_addr,
+					user_dma_pkt->tsize,
+					hdev->asic_prop.sram_user_base_address,
+					hdev->asic_prop.sram_end_address)) {
+
+				dev_err(hdev->dev,
+					"SRAM address 0x%llx + 0x%x is invalid\n",
+					device_memory_addr,
+					user_dma_pkt->tsize);
+				return -EFAULT;
+			}
+		} else {
+			if (!hl_mem_area_inside_range(device_memory_addr,
+					user_dma_pkt->tsize,
+					hdev->asic_prop.dram_user_base_address,
+					hdev->asic_prop.dram_end_address)) {
+
+				dev_err(hdev->dev,
+					"DRAM address 0x%llx + 0x%x is invalid\n",
+					device_memory_addr,
+					user_dma_pkt->tsize);
+				return -EFAULT;
+			}
+		}
+	}
+
+	if (skip_host_mem_pin)
+		parser->patched_cb_size += sizeof(*user_dma_pkt);
+	else {
+		if ((dir == DMA_TO_DEVICE) &&
+				(parser->hw_queue_id > GOYA_QUEUE_ID_DMA_1)) {
+			dev_err(hdev->dev,
+				"Can't DMA from host on queue other then 1\n");
+			return -EFAULT;
+		}
+
+		rc = goya_pin_memory_before_cs(hdev, parser, user_dma_pkt,
+						addr, dir);
+	}
+
+	return rc;
+}
+
+static int goya_validate_dma_pkt_no_host(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt)
+{
+	u64 sram_memory_addr, dram_memory_addr;
+
+	if (user_dma_pkt->dma_dir == DMA_DRAM_TO_SRAM) {
+		dev_dbg(hdev->dev, "DMA direction is DRAM --> SRAM\n");
+		dram_memory_addr = user_dma_pkt->src_addr;
+		sram_memory_addr = user_dma_pkt->dst_addr;
+	} else {
+		dev_dbg(hdev->dev, "DMA direction is SRAM --> DRAM\n");
+		sram_memory_addr = user_dma_pkt->src_addr;
+		dram_memory_addr = user_dma_pkt->dst_addr;
+	}
+
+	if (!hl_mem_area_inside_range(sram_memory_addr, user_dma_pkt->tsize,
+				hdev->asic_prop.sram_user_base_address,
+				hdev->asic_prop.sram_end_address)) {
+		dev_err(hdev->dev, "SRAM address 0x%llx + 0x%x is invalid\n",
+			sram_memory_addr, user_dma_pkt->tsize);
+		return -EFAULT;
+	}
+
+	if (!hl_mem_area_inside_range(dram_memory_addr, user_dma_pkt->tsize,
+				hdev->asic_prop.dram_user_base_address,
+				hdev->asic_prop.dram_end_address)) {
+		dev_err(hdev->dev, "DRAM address 0x%llx + 0x%x is invalid\n",
+			dram_memory_addr, user_dma_pkt->tsize);
+		return -EFAULT;
+	}
+
+	parser->patched_cb_size += sizeof(*user_dma_pkt);
+
+	return 0;
+}
+
+static int goya_validate_dma_pkt_no_mmu(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt)
+{
+	int rc;
+
+	dev_dbg(hdev->dev, "DMA packet details:\n");
+	dev_dbg(hdev->dev, "source == 0x%llx\n", user_dma_pkt->src_addr);
+	dev_dbg(hdev->dev, "destination == 0x%llx\n", user_dma_pkt->dst_addr);
+	dev_dbg(hdev->dev, "size == %u\n", user_dma_pkt->tsize);
+
+	/*
+	 * Special handling for DMA with size 0. The H/W has a bug where
+	 * this can cause the QMAN DMA to get stuck, so block it here.
+	 */
+	if (user_dma_pkt->tsize == 0) {
+		dev_err(hdev->dev,
+			"Got DMA with size 0, might reset the device\n");
+		return -EINVAL;
+	}
+
+	if ((user_dma_pkt->dma_dir == DMA_DRAM_TO_SRAM) ||
+			(user_dma_pkt->dma_dir == DMA_SRAM_TO_DRAM)) {
+		rc = goya_validate_dma_pkt_no_host(hdev, parser, user_dma_pkt);
+	} else {
+		rc = goya_validate_dma_pkt_host(hdev, parser, user_dma_pkt);
+	}
+
+	return rc;
+}
+
+static int goya_validate_dma_pkt_mmu(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt)
+{
+	dev_dbg(hdev->dev, "DMA packet details:\n");
+	dev_dbg(hdev->dev, "source == 0x%llx\n", user_dma_pkt->src_addr);
+	dev_dbg(hdev->dev, "destination == 0x%llx\n", user_dma_pkt->dst_addr);
+	dev_dbg(hdev->dev, "size == %u\n", user_dma_pkt->tsize);
+
+	/*
+	 * WA for HW-23.
+	 * We can't allow user to read from Host using QMANs other than 1.
+	 */
+	if (parser->hw_queue_id > GOYA_QUEUE_ID_DMA_1 &&
+		hl_mem_area_inside_range(user_dma_pkt->src_addr,
+				user_dma_pkt->tsize,
+				hdev->asic_prop.va_space_host_start_address,
+				hdev->asic_prop.va_space_host_end_address)) {
+		dev_err(hdev->dev,
+			"Can't DMA from host on queue other then 1\n");
+		return -EFAULT;
+	}
+
+	if (user_dma_pkt->tsize == 0) {
+		dev_err(hdev->dev,
+			"Got DMA with size 0, might reset the device\n");
+		return -EINVAL;
+	}
+
+	parser->patched_cb_size += sizeof(*user_dma_pkt);
+
+	return 0;
+}
+
+static int goya_validate_wreg32(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_wreg32 *wreg_pkt)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 sob_start_addr, sob_end_addr;
+
+	dev_dbg(hdev->dev, "WREG32 packet details:\n");
+	dev_dbg(hdev->dev, "reg_offset == 0x%x\n", wreg_pkt->reg_offset);
+	dev_dbg(hdev->dev, "value      == 0x%x\n", wreg_pkt->value);
+
+	if (wreg_pkt->reg_offset != (mmDMA_CH_1_WR_COMP_ADDR_LO & 0xFFFF)) {
+		dev_err(hdev->dev, "WREG32 packet with illegal address 0x%x\n",
+			wreg_pkt->reg_offset);
+		return -EPERM;
+	}
+
+	/*
+	 * With MMU, DMA channels are not secured, so it doesn't matter where
+	 * the WR COMP will be written to because it will go out with
+	 * non-secured property
+	 */
+	if (goya->hw_cap_initialized & HW_CAP_MMU)
+		return 0;
+
+	sob_start_addr = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	sob_end_addr = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_1023);
+
+	if ((wreg_pkt->value < sob_start_addr) ||
+			(wreg_pkt->value > sob_end_addr)) {
+
+		dev_err(hdev->dev, "WREG32 packet with illegal value 0x%x\n",
+			wreg_pkt->value);
+		return -EPERM;
+	}
+
+	return 0;
+}
+
+static int goya_validate_cb(struct hl_device *hdev,
+			struct hl_cs_parser *parser, bool is_mmu)
+{
+	u32 cb_parsed_length = 0;
+	int rc = 0;
+
+	parser->patched_cb_size = 0;
+
+	/* cb_user_size is more than 0 so loop will always be executed */
+	while ((cb_parsed_length < parser->user_cb_size) && (!rc)) {
+		enum packet_id pkt_id;
+		u16 pkt_size;
+		void *user_pkt;
+
+		user_pkt = (void *) (parser->user_cb->kernel_address +
+							cb_parsed_length);
+
+		pkt_id = (enum packet_id) (((*(u64 *) user_pkt) &
+				PACKET_HEADER_PACKET_ID_MASK) >>
+					PACKET_HEADER_PACKET_ID_SHIFT);
+
+		pkt_size = goya_packet_sizes[pkt_id];
+		cb_parsed_length += pkt_size;
+		if (cb_parsed_length > parser->user_cb_size) {
+			dev_err(hdev->dev,
+				"packet 0x%x is out of CB boundary\n", pkt_id);
+			rc = -EINVAL;
+			continue;
+		}
+
+		switch (pkt_id) {
+		case PACKET_WREG_32:
+			/*
+			 * Although it is validated after copy in patch_cb(),
+			 * need to validate here as well because patch_cb() is
+			 * not called in MMU path while this function is called
+			 */
+			rc = goya_validate_wreg32(hdev, parser, user_pkt);
+			break;
+
+		case PACKET_WREG_BULK:
+			dev_err(hdev->dev,
+				"User not allowed to use WREG_BULK\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_MSG_PROT:
+			dev_err(hdev->dev,
+				"User not allowed to use MSG_PROT\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_CP_DMA:
+			dev_err(hdev->dev, "User not allowed to use CP_DMA\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_STOP:
+			dev_err(hdev->dev, "User not allowed to use STOP\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_LIN_DMA:
+			if (is_mmu)
+				rc = goya_validate_dma_pkt_mmu(hdev, parser,
+						user_pkt);
+			else
+				rc = goya_validate_dma_pkt_no_mmu(hdev, parser,
+						user_pkt);
+			break;
+
+		case PACKET_MSG_LONG:
+		case PACKET_MSG_SHORT:
+		case PACKET_FENCE:
+		case PACKET_NOP:
+			parser->patched_cb_size += pkt_size;
+			break;
+
+		default:
+			dev_err(hdev->dev, "Invalid packet header 0x%x\n",
+				pkt_id);
+			rc = -EINVAL;
+			break;
+		}
+	}
+
+	/*
+	 * The new CB should have space at the end for two MSG_PROT packets:
+	 * 1. A packet that will act as a completion packet
+	 * 2. A packet that will generate MSI-X interrupt
+	 */
+	parser->patched_cb_size += sizeof(struct packet_msg_prot) * 2;
+
+	return rc;
+}
+
+static int goya_patch_dma_packet(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt,
+				struct packet_lin_dma *new_dma_pkt,
+				u32 *new_dma_pkt_size)
+{
+	struct hl_userptr *userptr;
+	struct scatterlist *sg, *sg_next_iter;
+	u32 count, len, dma_desc_cnt, len_next;
+	dma_addr_t dma_addr, dma_addr_next;
+	u64 device_memory_addr, addr;
+	enum dma_data_direction dir;
+	struct sg_table *sgt;
+	bool skip_host_mem_pin = false;
+
+	if ((user_dma_pkt->dma_dir == DMA_DRAM_TO_SRAM) ||
+			(user_dma_pkt->dma_dir == DMA_SRAM_TO_DRAM) ||
+			(user_dma_pkt->tsize == 0)) {
+		memcpy(new_dma_pkt, user_dma_pkt, sizeof(*new_dma_pkt));
+		*new_dma_pkt_size = sizeof(*new_dma_pkt);
+		return 0;
+	}
+
+	if ((user_dma_pkt->dma_dir == DMA_HOST_TO_DRAM) ||
+			(user_dma_pkt->dma_dir == DMA_HOST_TO_SRAM)) {
+		addr = user_dma_pkt->src_addr;
+		device_memory_addr = user_dma_pkt->dst_addr;
+		dir = DMA_TO_DEVICE;
+		if (user_dma_pkt->memset_mode)
+			skip_host_mem_pin = true;
+	} else {
+		addr = user_dma_pkt->dst_addr;
+		device_memory_addr = user_dma_pkt->src_addr;
+		dir = DMA_FROM_DEVICE;
+	}
+
+	if ((!skip_host_mem_pin) &&
+		(hl_userptr_is_pinned(hdev, addr, user_dma_pkt->tsize,
+			parser->job_userptr_list, &userptr) == false)) {
+		dev_err(hdev->dev, "Userptr 0x%llx + 0x%x NOT mapped !!!\n",
+				addr, user_dma_pkt->tsize);
+		return -EFAULT;
+	}
+
+	if ((user_dma_pkt->memset_mode) && (dir == DMA_TO_DEVICE)) {
+		memcpy(new_dma_pkt, user_dma_pkt, sizeof(*user_dma_pkt));
+		*new_dma_pkt_size = sizeof(*user_dma_pkt);
+		return 0;
+	}
+
+	sgt = userptr->sgt;
+	dma_desc_cnt = 0;
+
+	for_each_sg(sgt->sgl, sg, sgt->nents, count) {
+		len = sg_dma_len(sg);
+		dma_addr = sg_dma_address(sg);
+
+		if (len == 0)
+			break;
+
+		while ((count + 1) < sgt->nents) {
+			sg_next_iter = sg_next(sg);
+			len_next = sg_dma_len(sg_next_iter);
+			dma_addr_next = sg_dma_address(sg_next_iter);
+
+			if (len_next == 0)
+				break;
+
+			if ((dma_addr + len == dma_addr_next) &&
+				(len + len_next <= DMA_MAX_TRANSFER_SIZE)) {
+				len += len_next;
+				count++;
+				sg = sg_next_iter;
+			} else {
+				break;
+			}
+		}
+
+		new_dma_pkt->ctl = user_dma_pkt->ctl;
+		if (likely(dma_desc_cnt))
+			new_dma_pkt->eng_barrier = 0;
+		new_dma_pkt->rdcomp = 0;
+		new_dma_pkt->wrcomp = 0;
+		new_dma_pkt->tsize = len;
+
+		dma_addr += hdev->asic_prop.host_phys_base_address;
+
+		if (dir == DMA_TO_DEVICE) {
+			new_dma_pkt->src_addr = dma_addr;
+			new_dma_pkt->dst_addr = device_memory_addr;
+		} else {
+			new_dma_pkt->src_addr = device_memory_addr;
+			new_dma_pkt->dst_addr = dma_addr;
+		}
+
+		if (!user_dma_pkt->memset_mode)
+			device_memory_addr += len;
+		dma_desc_cnt++;
+		new_dma_pkt++;
+	}
+
+	if (!dma_desc_cnt) {
+		dev_err(hdev->dev,
+			"Error of 0 SG entries when patching DMA packet\n");
+		return -EFAULT;
+	}
+
+	/* Fix the last dma packet - rdcomp/wrcomp must be as user set them */
+	new_dma_pkt--;
+	new_dma_pkt->rdcomp = user_dma_pkt->rdcomp;
+	new_dma_pkt->wrcomp = user_dma_pkt->wrcomp;
+
+	*new_dma_pkt_size = dma_desc_cnt * sizeof(struct packet_lin_dma);
+
+	return 0;
+}
+
+static int goya_patch_cb(struct hl_device *hdev,
+				struct hl_cs_parser *parser)
+{
+	u32 cb_parsed_length = 0;
+	u32 cb_patched_cur_length = 0;
+	int rc = 0;
+
+	/* cb_user_size is more than 0 so loop will always be executed */
+	while ((cb_parsed_length < parser->user_cb_size) && (!rc)) {
+		enum packet_id pkt_id;
+		u16 pkt_size;
+		u32 new_pkt_size = 0;
+		void *user_pkt, *kernel_pkt;
+
+		user_pkt = (void *) (parser->user_cb->kernel_address +
+							cb_parsed_length);
+		kernel_pkt = (void *) (parser->patched_cb->kernel_address +
+							cb_patched_cur_length);
+
+		pkt_id = (enum packet_id) (((*(u64 *) user_pkt) &
+				PACKET_HEADER_PACKET_ID_MASK) >>
+					PACKET_HEADER_PACKET_ID_SHIFT);
+
+		pkt_size = goya_packet_sizes[pkt_id];
+		cb_parsed_length += pkt_size;
+		if (cb_parsed_length > parser->user_cb_size) {
+			dev_err(hdev->dev,
+				"packet 0x%x is out of CB boundary\n", pkt_id);
+			rc = -EINVAL;
+			continue;
+		}
+
+		switch (pkt_id) {
+		case PACKET_LIN_DMA:
+			rc = goya_patch_dma_packet(hdev, parser, user_pkt,
+						kernel_pkt, &new_pkt_size);
+			cb_patched_cur_length += new_pkt_size;
+			break;
+
+		case PACKET_WREG_32:
+			memcpy(kernel_pkt, user_pkt, pkt_size);
+			cb_patched_cur_length += pkt_size;
+			rc = goya_validate_wreg32(hdev, parser, kernel_pkt);
+			break;
+
+		case PACKET_WREG_BULK:
+			dev_err(hdev->dev,
+				"User not allowed to use WREG_BULK\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_MSG_PROT:
+			dev_err(hdev->dev,
+				"User not allowed to use MSG_PROT\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_CP_DMA:
+			dev_err(hdev->dev, "User not allowed to use CP_DMA\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_STOP:
+			dev_err(hdev->dev, "User not allowed to use STOP\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_MSG_LONG:
+		case PACKET_MSG_SHORT:
+		case PACKET_FENCE:
+		case PACKET_NOP:
+			memcpy(kernel_pkt, user_pkt, pkt_size);
+			cb_patched_cur_length += pkt_size;
+			break;
+
+		default:
+			dev_err(hdev->dev, "Invalid packet header 0x%x\n",
+				pkt_id);
+			rc = -EINVAL;
+			break;
+		}
+	}
+
+	return rc;
+}
+
+static int goya_parse_cb_mmu(struct hl_device *hdev,
+		struct hl_cs_parser *parser)
+{
+	u64 patched_cb_handle;
+	u32 patched_cb_size;
+	struct hl_cb *user_cb;
+	int rc;
+
+	/*
+	 * The new CB should have space at the end for two MSG_PROT pkt:
+	 * 1. A packet that will act as a completion packet
+	 * 2. A packet that will generate MSI-X interrupt
+	 */
+	parser->patched_cb_size = parser->user_cb_size +
+			sizeof(struct packet_msg_prot) * 2;
+
+	rc = hl_cb_create(hdev, &hdev->kernel_cb_mgr,
+				parser->patched_cb_size,
+				&patched_cb_handle, HL_KERNEL_ASID_ID);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to allocate patched CB for DMA CS %d\n",
+			rc);
+		return rc;
+	}
+
+	patched_cb_handle >>= PAGE_SHIFT;
+	parser->patched_cb = hl_cb_get(hdev, &hdev->kernel_cb_mgr,
+				(u32) patched_cb_handle);
+	/* hl_cb_get should never fail here so use kernel WARN */
+	WARN(!parser->patched_cb, "DMA CB handle invalid 0x%x\n",
+			(u32) patched_cb_handle);
+	if (!parser->patched_cb) {
+		rc = -EFAULT;
+		goto out;
+	}
+
+	/*
+	 * The check that parser->user_cb_size <= parser->user_cb->size was done
+	 * in validate_queue_index().
+	 */
+	memcpy((void *) parser->patched_cb->kernel_address,
+		(void *) parser->user_cb->kernel_address,
+		parser->user_cb_size);
+
+	patched_cb_size = parser->patched_cb_size;
+
+	/* validate patched CB instead of user CB */
+	user_cb = parser->user_cb;
+	parser->user_cb = parser->patched_cb;
+	rc = goya_validate_cb(hdev, parser, true);
+	parser->user_cb = user_cb;
+
+	if (rc) {
+		hl_cb_put(parser->patched_cb);
+		goto out;
+	}
+
+	if (patched_cb_size != parser->patched_cb_size) {
+		dev_err(hdev->dev, "user CB size mismatch\n");
+		hl_cb_put(parser->patched_cb);
+		rc = -EINVAL;
+		goto out;
+	}
+
+out:
+	/*
+	 * Always call cb destroy here because we still have 1 reference
+	 * to it by calling cb_get earlier. After the job will be completed,
+	 * cb_put will release it, but here we want to remove it from the
+	 * idr
+	 */
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr,
+					patched_cb_handle << PAGE_SHIFT);
+
+	return rc;
+}
+
+int goya_parse_cb_no_mmu(struct hl_device *hdev, struct hl_cs_parser *parser)
+{
+	u64 patched_cb_handle;
+	int rc;
+
+	rc = goya_validate_cb(hdev, parser, false);
+
+	if (rc)
+		goto free_userptr;
+
+	rc = hl_cb_create(hdev, &hdev->kernel_cb_mgr,
+				parser->patched_cb_size,
+				&patched_cb_handle, HL_KERNEL_ASID_ID);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to allocate patched CB for DMA CS %d\n", rc);
+		goto free_userptr;
+	}
+
+	patched_cb_handle >>= PAGE_SHIFT;
+	parser->patched_cb = hl_cb_get(hdev, &hdev->kernel_cb_mgr,
+				(u32) patched_cb_handle);
+	/* hl_cb_get should never fail here so use kernel WARN */
+	WARN(!parser->patched_cb, "DMA CB handle invalid 0x%x\n",
+			(u32) patched_cb_handle);
+	if (!parser->patched_cb) {
+		rc = -EFAULT;
+		goto out;
+	}
+
+	rc = goya_patch_cb(hdev, parser);
+
+	if (rc)
+		hl_cb_put(parser->patched_cb);
+
+out:
+	/*
+	 * Always call cb destroy here because we still have 1 reference
+	 * to it by calling cb_get earlier. After the job will be completed,
+	 * cb_put will release it, but here we want to remove it from the
+	 * idr
+	 */
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr,
+				patched_cb_handle << PAGE_SHIFT);
+
+free_userptr:
+	if (rc)
+		hl_userptr_delete_list(hdev, parser->job_userptr_list);
+	return rc;
+}
+
+int goya_parse_cb_no_ext_quque(struct hl_device *hdev,
+		struct hl_cs_parser *parser)
+{
+	struct asic_fixed_properties *asic_prop = &hdev->asic_prop;
+	struct goya_device *goya = hdev->asic_specific;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU)) {
+		/* For internal queue jobs, just check if cb address is valid */
+		if (hl_mem_area_inside_range(
+				(u64) parser->user_cb,
+				parser->user_cb_size,
+				asic_prop->sram_user_base_address,
+				asic_prop->sram_end_address))
+			return 0;
+
+		if (hl_mem_area_inside_range(
+				(u64) parser->user_cb,
+				parser->user_cb_size,
+				asic_prop->dram_user_base_address,
+				asic_prop->dram_end_address))
+			return 0;
+
+		dev_err(hdev->dev,
+			"Internal CB address 0x%llx + 0x%x is not in SRAM nor in DRAM\n",
+			(u64) parser->user_cb, parser->user_cb_size);
+
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+int goya_cs_parser(struct hl_device *hdev, struct hl_cs_parser *parser)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	if (!parser->ext_queue)
+		return goya_parse_cb_no_ext_quque(hdev, parser);
+
+	if ((goya->hw_cap_initialized & HW_CAP_MMU) && parser->use_virt_addr)
+		return goya_parse_cb_mmu(hdev, parser);
+	else
+		return goya_parse_cb_no_mmu(hdev, parser);
+}
+
+void goya_add_end_of_cb_packets(u64 kernel_address, u32 len, u64 cq_addr,
+				u32 cq_val, u32 msix_vec)
+{
+	struct packet_msg_prot *cq_pkt;
+
+	cq_pkt = (struct packet_msg_prot *) (kernel_address + len -
+					(sizeof(struct packet_msg_prot) * 2));
+
+	cq_pkt->ctl = 0;
+	cq_pkt->opcode = PACKET_MSG_PROT;
+	cq_pkt->reg_barrier = 0;
+	cq_pkt->msg_barrier = 1;
+	cq_pkt->eng_barrier = 1;
+	cq_pkt->value = cq_val;
+	cq_pkt->addr = cq_addr;
+
+	cq_pkt++;
+
+	cq_pkt->ctl = 0;
+	cq_pkt->opcode = PACKET_MSG_PROT;
+	cq_pkt->reg_barrier = 0;
+	cq_pkt->msg_barrier = 1;
+	cq_pkt->eng_barrier = 0;
+	cq_pkt->value = msix_vec & 0x7FF;
+	cq_pkt->addr = CFG_BASE + mmPCIE_DBI_MSIX_DOORBELL_OFF;
+}
+
 static void goya_update_eq_ci(struct hl_device *hdev, u32 val)
 {
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, val);
 }
 
+int goya_context_switch(struct hl_device *hdev, u32 asid)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct packet_lin_dma *clear_sram_pkt;
+	struct hl_cs_parser parser;
+	struct hl_cs_job *job;
+	u32 cb_size;
+	struct hl_cb *cb;
+	int rc;
+
+	cb = hl_cb_kernel_create(hdev, PAGE_SIZE);
+	if (!cb)
+		return -EFAULT;
+
+	clear_sram_pkt = (struct packet_lin_dma *) cb->kernel_address;
+	memset(clear_sram_pkt, 0, sizeof(*clear_sram_pkt));
+	cb_size = sizeof(*clear_sram_pkt);
+
+	clear_sram_pkt->opcode = PACKET_LIN_DMA;
+	clear_sram_pkt->src_addr = 0x7777777777777777ull;
+	clear_sram_pkt->dst_addr = prop->sram_base_address;
+	clear_sram_pkt->dma_dir = DMA_HOST_TO_SRAM;
+	if (hdev->pldm)
+		clear_sram_pkt->tsize = 0x10000;
+	else
+		clear_sram_pkt->tsize = prop->sram_size;
+	clear_sram_pkt->weakly_ordered = 1;
+	clear_sram_pkt->reg_barrier = 1;
+	clear_sram_pkt->msg_barrier = 1;
+	clear_sram_pkt->memset_mode = 1;
+
+	job = hl_cs_allocate_job(hdev, true);
+	if (!job) {
+		dev_err(hdev->dev, "Failed to allocate a new job\n");
+		rc = -ENOMEM;
+		goto release_cb;
+	}
+
+	job->id = 0;
+	job->user_cb = cb;
+	job->user_cb->cs_cnt++;
+	job->user_cb_size = cb_size;
+	job->hw_queue_id = GOYA_QUEUE_ID_DMA_0;
+
+	parser.ctx_id = HL_KERNEL_ASID_ID;
+	parser.cs_sequence = 0;
+	parser.job_id = job->id;
+	parser.hw_queue_id = job->hw_queue_id;
+	parser.job_userptr_list = &job->userptr_list;
+	parser.user_cb = job->user_cb;
+	parser.user_cb_size = job->user_cb_size;
+	parser.ext_queue = job->ext_queue;
+	parser.use_virt_addr = hdev->mmu_enable;
+
+	rc = hdev->asic_funcs->cs_parser(hdev, &parser);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to parse kernel CB during context switch\n");
+		goto free_job;
+	}
+
+	job->patched_cb = parser.patched_cb;
+	job->job_cb_size = parser.patched_cb_size;
+	job->patched_cb->cs_cnt++;
+
+	rc = goya_send_job_on_qman0(hdev, job);
+
+	job->patched_cb->cs_cnt--;
+	hl_cb_put(job->patched_cb);
+
+free_job:
+	hl_userptr_delete_list(hdev, &job->userptr_list);
+	kfree(job);
+	cb->cs_cnt--;
+
+release_cb:
+	hl_cb_put(cb);
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb->id << PAGE_SHIFT);
+
+	return rc;
+}
+
+void goya_restore_phase_topology(struct hl_device *hdev)
+{
+	int i, num_of_sob_in_longs, num_of_mon_in_longs;
+
+	num_of_sob_in_longs =
+		((mmSYNC_MNGR_SOB_OBJ_1023 - mmSYNC_MNGR_SOB_OBJ_0) + 4);
+
+	num_of_mon_in_longs =
+		((mmSYNC_MNGR_MON_STATUS_255 - mmSYNC_MNGR_MON_STATUS_0) + 4);
+
+	for (i = 0 ; i < num_of_sob_in_longs ; i += 4)
+		WREG32(mmSYNC_MNGR_SOB_OBJ_0 + i, 0);
+
+	for (i = 0 ; i < num_of_mon_in_longs ; i += 4)
+		WREG32(mmSYNC_MNGR_MON_STATUS_0 + i, 0);
+
+	/* Flush all WREG to prevent race */
+	i = RREG32(mmSYNC_MNGR_SOB_OBJ_0);
+}
+
 static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
 		u16 event_type, char *axi_name, int len)
 {
@@ -4645,6 +5677,48 @@ static void goya_disable_clock_gating(struct hl_device *hdev)
 
 }
 
+static bool goya_is_device_idle(struct hl_device *hdev)
+{
+	u64 offset, dma_qm_reg, tpc_qm_reg, tpc_cmdq_reg, tpc_cfg_reg;
+	bool val = true;
+	int i;
+
+	offset = mmDMA_QM_1_GLBL_STS0 - mmDMA_QM_0_GLBL_STS0;
+
+	for (i = 0 ; i < DMA_MAX_NUM ; i++) {
+		dma_qm_reg = mmDMA_QM_0_GLBL_STS0 + i * offset;
+
+		val = val && ((RREG32(dma_qm_reg) & DMA_QM_IDLE_MASK) ==
+				DMA_QM_IDLE_MASK);
+	}
+
+	offset = mmTPC1_QM_GLBL_STS0 - mmTPC0_QM_GLBL_STS0;
+
+	for (i = 0 ; i < TPC_MAX_NUM ; i++) {
+		tpc_qm_reg = mmTPC0_QM_GLBL_STS0 + i * offset;
+		tpc_cmdq_reg = mmTPC0_CMDQ_GLBL_STS0 + i * offset;
+		tpc_cfg_reg = mmTPC0_CFG_STATUS + i * offset;
+
+		val = val && ((RREG32(tpc_qm_reg) & TPC_QM_IDLE_MASK) ==
+				TPC_QM_IDLE_MASK);
+		val = val && ((RREG32(tpc_cmdq_reg) & TPC_CMDQ_IDLE_MASK) ==
+				TPC_CMDQ_IDLE_MASK);
+		val = val && ((RREG32(tpc_cfg_reg) & TPC_CFG_IDLE_MASK) ==
+				TPC_CFG_IDLE_MASK);
+	}
+
+	val = val && ((RREG32(mmMME_QM_GLBL_STS0) & MME_QM_IDLE_MASK) ==
+			MME_QM_IDLE_MASK);
+	val = val && ((RREG32(mmMME_CMDQ_GLBL_STS0) & MME_CMDQ_IDLE_MASK) ==
+			MME_CMDQ_IDLE_MASK);
+	val = val && ((RREG32(mmMME_ARCH_STATUS) & MME_ARCH_IDLE_MASK) ==
+			MME_ARCH_IDLE_MASK);
+	val = val && ((RREG32(mmMME_SHADOW_0_STATUS) & MME_SHADOW_IDLE_MASK) ==
+			0);
+
+	return val;
+}
+
 static void goya_hw_queues_lock(struct hl_device *hdev)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -4732,7 +5806,14 @@ static const struct hl_asic_funcs goya_funcs = {
 	.dma_pool_free = goya_dma_pool_free,
 	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
 	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
+	.hl_dma_unmap_sg = goya_dma_unmap_sg,
+	.cs_parser = goya_cs_parser,
+	.asic_dma_map_sg = goya_dma_map_sg,
+	.get_dma_desc_list_size = goya_get_dma_desc_list_size,
+	.add_end_of_cb_packets = goya_add_end_of_cb_packets,
 	.update_eq_ci = goya_update_eq_ci,
+	.context_switch = goya_context_switch,
+	.restore_phase_topology = goya_restore_phase_topology,
 	.add_device_attr = goya_add_device_attr,
 	.remove_device_attr = goya_remove_device_attr,
 	.handle_eqe = goya_handle_eqe,
@@ -4741,6 +5822,7 @@ static const struct hl_asic_funcs goya_funcs = {
 	.send_heartbeat = goya_send_heartbeat,
 	.enable_clock_gating = goya_init_clock_gating,
 	.disable_clock_gating = goya_disable_clock_gating,
+	.is_device_idle = goya_is_device_idle,
 	.soft_reset_late_init = goya_soft_reset_late_init,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index c0779dd447bd..793512b0fa09 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -33,6 +33,11 @@
 
 #define HL_MAX_QUEUES			128
 
+#define HL_MAX_JOBS_PER_CS		64
+
+/* MUST BE POWER OF 2 and larger than 1 */
+#define HL_MAX_PENDING_CS		64
+
 struct hl_device;
 struct hl_fpriv;
 
@@ -63,6 +68,16 @@ struct hw_queue_properties {
 	u8			kmd_only;
 };
 
+/**
+ * enum vm_type_t - virtual memory mapping request information.
+ * @VM_TYPE_USERPTR: mapping of user memory to device virtual address.
+ * @VM_TYPE_PHYS_LIST: mapping of DRAM memory to device virtual address.
+ */
+enum vm_type_t {
+	VM_TYPE_USERPTR,
+	VM_TYPE_PHYS_LIST
+};
+
 /**
  * struct asic_fixed_properties - ASIC specific immutable properties.
  * @hw_queues_props: H/W queues properties.
@@ -137,6 +152,19 @@ struct asic_fixed_properties {
 	u8			tpc_enabled_mask;
 };
 
+/**
+ * struct hl_dma_fence - wrapper for fence object used by command submissions.
+ * @base_fence: kernel fence object.
+ * @lock: spinlock to protect fence.
+ * @hdev: habanalabs device structure.
+ * @cs_seq: command submission sequence number.
+ */
+struct hl_dma_fence {
+	struct dma_fence	base_fence;
+	spinlock_t		lock;
+	struct hl_device	*hdev;
+	u64			cs_seq;
+};
 
 
 
@@ -168,6 +196,7 @@ struct hl_cb_mgr {
  * @vm_end: Holds the CB's user end virtual address (when mmaped).
  * @size: holds the CB's size.
  * @id: the CB's ID.
+ * @cs_cnt: holds number of CS that this CB participates in.
  * @ctx_id: holds the ID of the owner's context.
  * @mmap: true if the CB is currently mmaped to user.
  * @is_pool: true if CB was acquired from the pool, false otherwise.
@@ -183,6 +212,7 @@ struct hl_cb {
 	u64			vm_end;
 	u32			size;
 	u32			id;
+	u32			cs_cnt;
 	u32			ctx_id;
 	u8			mmap;
 	u8			is_pool;
@@ -314,6 +344,7 @@ enum hl_asic_type {
 	ASIC_LAST
 };
 
+struct hl_cs_parser;
 
 /**
  * enum hl_pm_mng_profile - power management profile.
@@ -368,7 +399,14 @@ enum hl_pll_frequency {
  * @dma_pool_free: free small DMA allocation from pool.
  * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
  * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
+ * @hl_dma_unmap_sg: DMA unmap scatter-gather list.
+ * @cs_parser: parse Command Submission.
+ * @asic_dma_map_sg: DMA map scatter-gather list.
+ * @get_dma_desc_list_size: get number of LIN_DMA packets required for CB.
+ * @add_end_of_cb_packets: Add packets to the end of CB, if device requires it.
  * @update_eq_ci: update event queue CI.
+ * @context_switch: called upon ASID context switch.
+ * @restore_phase_topology: clear all SOBs amd MONs.
  * @add_device_attr: add ASIC specific device attributes.
  * @remove_device_attr: remove ASIC specific device attributes.
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
@@ -377,6 +415,7 @@ enum hl_pll_frequency {
  * @send_heartbeat: send is-alive packet to ArmCP and verify response.
  * @enable_clock_gating: enable clock gating for reducing power consumption.
  * @disable_clock_gating: disable clock for accessing registers on HBW.
+ * @is_device_idle: return true if device is idle, false otherwise.
  * @soft_reset_late_init: perform certain actions needed after soft reset.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
@@ -415,7 +454,20 @@ struct hl_asic_funcs {
 				size_t size, dma_addr_t *dma_handle);
 	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
 				size_t size, void *vaddr);
+	void (*hl_dma_unmap_sg)(struct hl_device *hdev,
+				struct scatterlist *sg, int nents,
+				enum dma_data_direction dir);
+	int (*cs_parser)(struct hl_device *hdev, struct hl_cs_parser *parser);
+	int (*asic_dma_map_sg)(struct hl_device *hdev,
+				struct scatterlist *sg, int nents,
+				enum dma_data_direction dir);
+	u32 (*get_dma_desc_list_size)(struct hl_device *hdev,
+					struct sg_table *sgt);
+	void (*add_end_of_cb_packets)(u64 kernel_address, u32 len, u64 cq_addr,
+					u32 cq_val, u32 msix_num);
 	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
+	int (*context_switch)(struct hl_device *hdev, u32 asid);
+	void (*restore_phase_topology)(struct hl_device *hdev);
 	int (*add_device_attr)(struct hl_device *hdev);
 	void (*remove_device_attr)(struct hl_device *hdev);
 	void (*handle_eqe)(struct hl_device *hdev,
@@ -426,6 +478,7 @@ struct hl_asic_funcs {
 	int (*send_heartbeat)(struct hl_device *hdev);
 	void (*enable_clock_gating)(struct hl_device *hdev);
 	void (*disable_clock_gating)(struct hl_device *hdev);
+	bool (*is_device_idle)(struct hl_device *hdev);
 	int (*soft_reset_late_init)(struct hl_device *hdev);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
@@ -451,12 +504,28 @@ struct hl_asic_funcs {
  * @hdev: pointer to the device structure.
  * @refcount: reference counter for the context. Context is released only when
  *		this hits 0l. It is incremented on CS and CS_WAIT.
+ * @cs_pending: array of DMA fence objects representing pending CS.
+ * @cs_sequence: sequence number for CS. Value is assigned to a CS and passed
+ *			to user so user could inquire about CS. It is used as
+ *			index to cs_pending array.
+ * @cs_lock: spinlock to protect cs_sequence.
+ * @thread_restore_token: token to prevent multiple threads of the same context
+ *				from running the restore phase. Only one thread
+ *				should run it.
+ * @thread_restore_wait_token: token to prevent the threads that didn't run
+ *				the restore phase from moving to their execution
+ *				phase before the restore phase has finished.
  * @asid: context's unique address space ID in the device's MMU.
  */
 struct hl_ctx {
 	struct hl_fpriv		*hpriv;
 	struct hl_device	*hdev;
 	struct kref		refcount;
+	struct dma_fence	*cs_pending[HL_MAX_PENDING_CS];
+	u64			cs_sequence;
+	spinlock_t		cs_lock;
+	atomic_t		thread_restore_token;
+	u32			thread_restore_wait_token;
 	u32			asid;
 };
 
@@ -474,16 +543,134 @@ struct hl_ctx_mgr {
 
 
 
+
+/*
+ * COMMAND SUBMISSIONS
+ */
+
+/**
+ * struct hl_userptr - memory mapping chunk information
+ * @vm_type: type of the VM.
+ * @job_node: linked-list node for hanging the object on the Job's list.
+ * @vec: pointer to the frame vector.
+ * @sgt: pointer to the scatter-gather table that holds the pages.
+ * @dir: for DMA unmapping, the direction must be supplied, so save it.
+ * @debugfs_list: node in debugfs list of command submissions.
+ * @addr: user-space virtual pointer to the start of the memory area.
+ * @size: size of the memory area to pin & map.
+ * @dma_mapped: true if the SG was mapped to DMA addresses, false otherwise.
+ */
+struct hl_userptr {
+	enum vm_type_t		vm_type; /* must be first */
+	struct list_head	job_node;
+	struct frame_vector	*vec;
+	struct sg_table		*sgt;
+	enum dma_data_direction dir;
+	struct list_head	debugfs_list;
+	u64			addr;
+	u32			size;
+	u8			dma_mapped;
+};
+
+/**
+ * struct hl_cs - command submission.
+ * @jobs_in_queue_cnt: per each queue, maintain counter of submitted jobs.
+ * @ctx: the context this CS belongs to.
+ * @job_list: list of the CS's jobs in the various queues.
+ * @job_lock: spinlock for the CS's jobs list. Needed for free_job.
+ * @refcount: reference counter for usage of the CS.
+ * @fence: pointer to the fence object of this CS.
+ * @work_tdr: delayed work node for TDR.
+ * @mirror_node : node in device mirror list of command submissions.
+ * @sequence: the sequence number of this CS.
+ * @submitted: true if CS was submitted to H/W.
+ * @completed: true if CS was completed by device.
+ * @timedout : true if CS was timedout.
+ * @tdr_active: true if TDR was activated for this CS (to prevent
+ *		double TDR activation).
+ * @aborted: true if CS was aborted due to some device error.
+ */
+struct hl_cs {
+	u8			jobs_in_queue_cnt[HL_MAX_QUEUES];
+	struct hl_ctx		*ctx;
+	struct list_head	job_list;
+	spinlock_t		job_lock;
+	struct kref		refcount;
+	struct dma_fence	*fence;
+	struct delayed_work	work_tdr;
+	struct list_head	mirror_node;
+	u64			sequence;
+	u8			submitted;
+	u8			completed;
+	u8			timedout;
+	u8			tdr_active;
+	u8			aborted;
+};
+
 /**
  * struct hl_cs_job - command submission job.
+ * @cs_node: the node to hang on the CS jobs list.
+ * @cs: the CS this job belongs to.
+ * @user_cb: the CB we got from the user.
+ * @patched_cb: in case of patching, this is internal CB which is submitted on
+ *		the queue instead of the CB we got from the IOCTL.
  * @finish_work: workqueue object to run when job is completed.
+ * @userptr_list: linked-list of userptr mappings that belong to this job and
+ *			wait for completion.
  * @id: the id of this job inside a CS.
+ * @hw_queue_id: the id of the H/W queue this job is submitted to.
+ * @user_cb_size: the actual size of the CB we got from the user.
+ * @job_cb_size: the actual size of the CB that we put on the queue.
+ * @ext_queue: whether the job is for external queue or internal queue.
  */
 struct hl_cs_job {
+	struct list_head	cs_node;
+	struct hl_cs		*cs;
+	struct hl_cb		*user_cb;
+	struct hl_cb		*patched_cb;
 	struct work_struct	finish_work;
+	struct list_head	userptr_list;
 	u32			id;
+	u32			hw_queue_id;
+	u32			user_cb_size;
+	u32			job_cb_size;
+	u8			ext_queue;
 };
 
+/**
+ * struct hl_cs_parser - command submission paerser properties.
+ * @user_cb: the CB we got from the user.
+ * @patched_cb: in case of patching, this is internal CB which is submitted on
+ *		the queue instead of the CB we got from the IOCTL.
+ * @job_userptr_list: linked-list of userptr mappings that belong to the related
+ *			job and wait for completion.
+ * @cs_sequence: the sequence number of the related CS.
+ * @ctx_id: the ID of the context the related CS belongs to.
+ * @hw_queue_id: the id of the H/W queue this job is submitted to.
+ * @user_cb_size: the actual size of the CB we got from the user.
+ * @patched_cb_size: the size of the CB after parsing.
+ * @ext_queue: whether the job is for external queue or internal queue.
+ * @job_id: the id of the related job inside the related CS.
+ * @use_virt_addr: whether to treat the addresses in the CB as virtual during
+ *			parsing.
+ */
+struct hl_cs_parser {
+	struct hl_cb		*user_cb;
+	struct hl_cb		*patched_cb;
+	struct list_head	*job_userptr_list;
+	u64			cs_sequence;
+	u32			ctx_id;
+	u32			hw_queue_id;
+	u32			user_cb_size;
+	u32			patched_cb_size;
+	u8			ext_queue;
+	u8			job_id;
+	u8			use_virt_addr;
+};
+
+
+
+
 
 /*
  * FILE PRIVATE STRUCTURE
@@ -498,6 +685,7 @@ struct hl_cs_job {
  * @ctx_mgr: context manager to handle multiple context for this FD.
  * @cb_mgr: command buffer manager to handle multiple buffers for this FD.
  * @refcount: number of related contexts.
+ * @restore_phase_mutex: lock for context switch and restore phase.
  */
 struct hl_fpriv {
 	struct hl_device	*hdev;
@@ -507,6 +695,7 @@ struct hl_fpriv {
 	struct hl_ctx_mgr	ctx_mgr;
 	struct hl_cb_mgr	cb_mgr;
 	struct kref		refcount;
+	struct mutex		restore_phase_mutex;
 };
 
 
@@ -578,6 +767,8 @@ struct hl_device_reset_work {
  * @eq_wq: work queue of event queue for executing work in process context.
  * @kernel_ctx: KMD context structure.
  * @kernel_queues: array of hl_hw_queue.
+ * @hw_queues_mirror_list: CS mirror list for TDR.
+ * @hw_queues_mirror_lock: protects hw_queues_mirror_list.
  * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
  * @event_queue: event queue for IRQ from ArmCP.
  * @dma_pool: DMA pool for small allocations.
@@ -601,6 +792,7 @@ struct hl_device_reset_work {
  * @in_reset: is device in reset flow.
  * @curr_pll_profile: current PLL profile.
  * @fd_open_cnt: number of open context executing.
+ * @timeout_jiffies: device CS timeout value.
  * @max_power: the max power of the device, as configured by the sysadmin. This
  *             value is saved so in case of hard-reset, KMD will restore this
  *             value and update the F/W after the re-initialization
@@ -614,6 +806,9 @@ struct hl_device_reset_work {
  * @hwmon_initialized: is H/W monitor sensors was initialized.
  * @hard_reset_pending: is there a hard reset work pending.
  * @heartbeat: is heartbeat sanity check towards ArmCP enabled.
+ * @reset_on_lockup: true if a reset should be done in case of stuck CS, false
+ *                   otherwise.
+ * @mmu_enable: is MMU enabled.
  */
 struct hl_device {
 	struct pci_dev			*pdev;
@@ -630,6 +825,8 @@ struct hl_device {
 	struct workqueue_struct		*eq_wq;
 	struct hl_ctx			*kernel_ctx;
 	struct hl_hw_queue		*kernel_queues;
+	struct list_head		hw_queues_mirror_list;
+	spinlock_t			hw_queues_mirror_lock;
 	struct hl_cb_mgr		kernel_cb_mgr;
 	struct hl_eq			event_queue;
 	struct dma_pool			*dma_pool;
@@ -658,6 +855,7 @@ struct hl_device {
 	atomic_t			in_reset;
 	atomic_t			curr_pll_profile;
 	atomic_t			fd_open_cnt;
+	u64				timeout_jiffies;
 	u64				max_power;
 	u32				major;
 	u32				high_pll;
@@ -669,8 +867,10 @@ struct hl_device {
 	u8				hwmon_initialized;
 	u8				hard_reset_pending;
 	u8				heartbeat;
+	u8				reset_on_lockup;
 
 	/* Parameters for bring-up */
+	u8				mmu_enable;
 	u8				cpu_enable;
 	u8				reset_pcilink;
 	u8				config_pll;
@@ -712,6 +912,58 @@ struct hl_ioctl_desc {
  * Kernel module functions that can be accessed by entire module
  */
 
+/**
+ * hl_mem_area_inside_range() - Checks whether address+size are inside a range.
+ * @address: The start address of the area we want to validate.
+ * @size: The size in bytes of the area we want to validate.
+ * @range_start_address: The start address of the valid range.
+ * @range_end_address: The end address of the valid range.
+ *
+ * Return: true if the area is inside the valid range, false otherwise.
+ */
+static inline bool hl_mem_area_inside_range(u64 address, u32 size,
+				u64 range_start_address, u64 range_end_address)
+{
+	u64 end_address = address + size;
+
+	if ((address >= range_start_address) &&
+			(end_address <= range_end_address) &&
+			(end_address > address))
+		return true;
+
+	return false;
+}
+
+/**
+ * hl_mem_area_crosses_range() - Checks whether address+size crossing a range.
+ * @address: The start address of the area we want to validate.
+ * @size: The size in bytes of the area we want to validate.
+ * @range_start_address: The start address of the valid range.
+ * @range_end_address: The end address of the valid range.
+ *
+ * Return: true if the area overlaps part or all of the valid range,
+ *		false otherwise.
+ */
+static inline bool hl_mem_area_crosses_range(u64 address, u32 size,
+				u64 range_start_address, u64 range_end_address)
+{
+	u64 end_address = address + size;
+
+	if ((address >= range_start_address) &&
+			(address < range_end_address))
+		return true;
+
+	if ((end_address >= range_start_address) &&
+			(end_address < range_end_address))
+		return true;
+
+	if ((address < range_start_address) &&
+			(end_address >= range_end_address))
+		return true;
+
+	return false;
+}
+
 int hl_device_open(struct inode *inode, struct file *filp);
 int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 		enum hl_asic_type asic_type, int minor);
@@ -724,8 +976,10 @@ int hl_hw_queues_create(struct hl_device *hdev);
 void hl_hw_queues_destroy(struct hl_device *hdev);
 int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
 				u32 cb_size, u64 cb_ptr);
+int hl_hw_queue_schedule_cs(struct hl_cs *cs);
 u32 hl_hw_queue_add_ptr(u32 ptr, u16 val);
 void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
+void hl_int_hw_queue_update_ci(struct hl_cs *cs);
 void hl_hw_queue_reset(struct hl_device *hdev, bool hard_reset);
 
 #define hl_queue_inc_ptr(p)		hl_hw_queue_add_ptr(p, 1)
@@ -739,6 +993,8 @@ void hl_cq_reset(struct hl_device *hdev, struct hl_cq *q);
 void hl_eq_reset(struct hl_device *hdev, struct hl_eq *q);
 irqreturn_t hl_irq_handler_cq(int irq, void *arg);
 irqreturn_t hl_irq_handler_eq(int irq, void *arg);
+u32 hl_cq_inc_ptr(u32 ptr);
+
 int hl_asid_init(struct hl_device *hdev);
 void hl_asid_fini(struct hl_device *hdev);
 unsigned long hl_asid_alloc(struct hl_device *hdev);
@@ -747,9 +1003,13 @@ void hl_asid_free(struct hl_device *hdev, unsigned long asid);
 int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv);
 void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx);
 int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx);
+void hl_ctx_do_release(struct kref *ref);
+void hl_ctx_get(struct hl_device *hdev,	struct hl_ctx *ctx);
 int hl_ctx_put(struct hl_ctx *ctx);
+struct dma_fence *hl_ctx_get_fence(struct hl_ctx *ctx, u64 seq);
 void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr);
 void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr);
+
 int hl_device_init(struct hl_device *hdev, struct class *hclass);
 void hl_device_fini(struct hl_device *hdev);
 int hl_device_suspend(struct hl_device *hdev);
@@ -781,8 +1041,20 @@ struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size);
 int hl_cb_pool_init(struct hl_device *hdev);
 int hl_cb_pool_fini(struct hl_device *hdev);
 
+void hl_cs_rollback_all(struct hl_device *hdev);
+struct hl_cs_job *hl_cs_allocate_job(struct hl_device *hdev, bool ext_queue);
+
 void goya_set_asic_funcs(struct hl_device *hdev);
 
+int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u32 size,
+			struct hl_userptr *userptr);
+int hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr);
+void hl_userptr_delete_list(struct hl_device *hdev,
+				struct list_head *userptr_list);
+bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr, u32 size,
+				struct list_head *userptr_list,
+				struct hl_userptr **userptr);
+
 long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr);
 void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq);
 long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
@@ -798,5 +1070,7 @@ void hl_set_max_power(struct hl_device *hdev, u64 value);
 /* IOCTLs */
 long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
 int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
+int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data);
+int hl_cs_wait_ioctl(struct hl_fpriv *hpriv, void *data);
 
 #endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 7d101ee0f0f2..fccfa7830121 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -30,6 +30,17 @@ static struct class *hl_class;
 DEFINE_IDR(hl_devs_idr);
 DEFINE_MUTEX(hl_devs_idr_lock);
 
+static int timeout_locked = 5;
+static int reset_on_lockup = 1;
+
+module_param(timeout_locked, int, 0444);
+MODULE_PARM_DESC(timeout_locked,
+	"Device lockup timeout in seconds (0 = disabled, default 5s)");
+
+module_param(reset_on_lockup, int, 0444);
+MODULE_PARM_DESC(reset_on_lockup,
+	"Do device reset on lockup (0 = no, 1 = yes, default yes)");
+
 #define PCI_VENDOR_ID_HABANALABS	0x1da3
 
 #define PCI_IDS_GOYA			0x0001
@@ -120,6 +131,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	hpriv->hdev = hdev;
 	filp->private_data = hpriv;
 	hpriv->filp = filp;
+	mutex_init(&hpriv->restore_phase_mutex);
 	kref_init(&hpriv->refcount);
 	nonseekable_open(inode, filp);
 
@@ -147,6 +159,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	filp->private_data = NULL;
 	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
 	hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
+	mutex_destroy(&hpriv->restore_phase_mutex);
 	kfree(hpriv);
 
 close_device:
@@ -186,8 +199,10 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	}
 
 	hdev->major = hl_major;
+	hdev->reset_on_lockup = reset_on_lockup;
 
 	/* Parameters for bring-up - set them to defaults */
+	hdev->mmu_enable = 0;
 	hdev->cpu_enable = 1;
 	hdev->reset_pcilink = 0;
 	hdev->config_pll = 0;
@@ -209,6 +224,14 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	if (!hdev->cpu_queues_enable)
 		hdev->heartbeat = 0;
 
+	if (hdev->ifh)
+		timeout_locked = 0;
+
+	if (timeout_locked)
+		hdev->timeout_jiffies = msecs_to_jiffies(timeout_locked * 1000);
+	else
+		hdev->timeout_jiffies = MAX_SCHEDULE_TIMEOUT;
+
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
 
diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
index fa2287569e0e..f6969d6dba9c 100644
--- a/drivers/misc/habanalabs/habanalabs_ioctl.c
+++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
@@ -16,7 +16,9 @@
 	[_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func}
 
 static const struct hl_ioctl_desc hl_ioctls[] = {
-	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl)
+	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl),
+	HL_IOCTL_DEF(HL_IOCTL_CS, hl_cs_ioctl),
+	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl)
 };
 
 #define HL_CORE_IOCTL_COUNT	ARRAY_SIZE(hl_ioctls)
diff --git a/drivers/misc/habanalabs/hw_queue.c b/drivers/misc/habanalabs/hw_queue.c
index 65102a5bc2ca..a414e3775d3e 100644
--- a/drivers/misc/habanalabs/hw_queue.c
+++ b/drivers/misc/habanalabs/hw_queue.c
@@ -37,6 +37,29 @@ static inline int queue_free_slots(struct hl_hw_queue *q, u32 queue_len)
 		return (abs(delta) - queue_len);
 }
 
+void hl_int_hw_queue_update_ci(struct hl_cs *cs)
+{
+	struct hl_device *hdev = cs->ctx->hdev;
+	struct hl_hw_queue *q;
+	int i;
+
+	hdev->asic_funcs->hw_queues_lock(hdev);
+
+	if (hdev->disabled)
+		goto out;
+
+	q = &hdev->kernel_queues[0];
+	for (i = 0 ; i < HL_MAX_QUEUES ; i++, q++) {
+		if (q->queue_type == QUEUE_TYPE_INT) {
+			q->ci += cs->jobs_in_queue_cnt[i];
+			q->ci &= ((q->int_queue_len << 1) - 1);
+		}
+	}
+
+out:
+	hdev->asic_funcs->hw_queues_unlock(hdev);
+}
+
 /**
  * ext_queue_submit_bd - Submit a buffer descriptor to an external queue
  *
@@ -122,6 +145,37 @@ static int ext_queue_sanity_checks(struct hl_device *hdev,
 	return 0;
 }
 
+/**
+ * int_queue_sanity_checks - perform some sanity checks on internal queue
+ *
+ * @hdev              : pointer to hl_device structure
+ * @q                 :	pointer to hl_hw_queue structure
+ * @num_of_entries    : how many entries to check for space
+ *
+ * H/W queues spinlock should be taken before calling this function
+ *
+ * Perform the following:
+ * - Make sure we have enough space in the h/w queue
+ *
+ */
+static int int_queue_sanity_checks(struct hl_device *hdev,
+					struct hl_hw_queue *q,
+					int num_of_entries)
+{
+	int free_slots_cnt;
+
+	/* Check we have enough space in the queue */
+	free_slots_cnt = queue_free_slots(q, q->int_queue_len);
+
+	if (free_slots_cnt < num_of_entries) {
+		dev_dbg(hdev->dev, "Queue %d doesn't have room for %d CBs\n",
+			q->hw_queue_id, num_of_entries);
+		return -EAGAIN;
+	}
+
+	return 0;
+}
+
 /**
  * hl_hw_queue_send_cb_no_cmpl - send a single CB (not a JOB) without completion
  *
@@ -168,6 +222,202 @@ int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
 	return rc;
 }
 
+/**
+ * ext_hw_queue_schedule_job - submit an JOB to an external queue
+ *
+ * @job: pointer to the job that needs to be submitted to the queue
+ *
+ * This function must be called when the scheduler mutex is taken
+ *
+ */
+static void ext_hw_queue_schedule_job(struct hl_cs_job *job)
+{
+	struct hl_device *hdev = job->cs->ctx->hdev;
+	struct hl_hw_queue *q = &hdev->kernel_queues[job->hw_queue_id];
+	struct hl_cq_entry cq_pkt;
+	struct hl_cq *cq;
+	u64 cq_addr;
+	struct hl_cb *cb;
+	u32 ctl;
+	u32 len;
+	u64 ptr;
+
+	/*
+	 * Update the JOB ID inside the BD CTL so the device would know what
+	 * to write in the completion queue
+	 */
+	ctl = ((q->pi << BD_CTL_SHADOW_INDEX_SHIFT) & BD_CTL_SHADOW_INDEX_MASK);
+
+	cb = job->patched_cb;
+	len = job->job_cb_size;
+	ptr = cb->bus_address;
+
+	cq_pkt.data = (q->pi << CQ_ENTRY_SHADOW_INDEX_SHIFT)
+					& CQ_ENTRY_SHADOW_INDEX_MASK;
+	cq_pkt.data |= 1 << CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT;
+	cq_pkt.data |= 1 << CQ_ENTRY_READY_SHIFT;
+
+	/*
+	 * No need to protect pi_offset because scheduling to the
+	 * H/W queues is done under the scheduler mutex
+	 *
+	 * No need to check if CQ is full because it was already
+	 * checked in hl_queue_sanity_checks
+	 */
+	cq = &hdev->completion_queue[q->hw_queue_id];
+	cq_addr = cq->bus_address +
+			hdev->asic_prop.host_phys_base_address;
+	cq_addr += cq->pi * sizeof(struct hl_cq_entry);
+
+	hdev->asic_funcs->add_end_of_cb_packets(cb->kernel_address, len,
+				cq_addr, cq_pkt.data, q->hw_queue_id);
+
+	q->shadow_queue[hl_pi_2_offset(q->pi)] = job;
+
+	if (hdev->ifh) {
+		u32 *cq_kernel_addr = (u32 *) cq->kernel_address;
+
+		cq_kernel_addr[cq->pi] = cq_pkt.data;
+	}
+
+	cq->pi = hl_cq_inc_ptr(cq->pi);
+
+	ext_queue_submit_bd(hdev, q, ctl, len, ptr);
+}
+
+/**
+ * int_hw_queue_schedule_job - submit an JOB to an internal queue
+ *
+ * @job: pointer to the job that needs to be submitted to the queue
+ *
+ * This function must be called when the scheduler mutex is taken
+ *
+ */
+static void int_hw_queue_schedule_job(struct hl_cs_job *job)
+{
+	struct hl_device *hdev = job->cs->ctx->hdev;
+	struct hl_hw_queue *q = &hdev->kernel_queues[job->hw_queue_id];
+	struct hl_bd bd;
+	u64 *pi, *pbd = (u64 *) &bd;
+
+	bd.ctl = 0;
+	bd.len = job->job_cb_size;
+	bd.ptr = (u64) job->user_cb;
+
+	pi = (u64 *) (q->kernel_address +
+		((q->pi & (q->int_queue_len - 1)) * sizeof(bd)));
+
+	pi[0] = pbd[0];
+	pi[1] = pbd[1];
+
+	q->pi++;
+	q->pi &= ((q->int_queue_len << 1) - 1);
+
+	/* Flush PQ entry write. Relevant only for specific ASICs */
+	hdev->asic_funcs->flush_pq_write(hdev, pi, pbd[0]);
+
+	hdev->asic_funcs->ring_doorbell(hdev, q->hw_queue_id, q->pi);
+}
+
+/**
+ * hl_hw_queue_schedule_cs - schedule a command submission
+ *
+ * @job        : pointer to the CS
+ *
+ */
+int hl_hw_queue_schedule_cs(struct hl_cs *cs)
+{
+	struct hl_device *hdev = cs->ctx->hdev;
+	struct hl_cs_job *job, *tmp;
+	struct hl_hw_queue *q;
+	int rc = 0, i, cq_cnt;
+
+	hdev->asic_funcs->hw_queues_lock(hdev);
+
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
+		dev_err(hdev->dev,
+			"device is disabled or in reset, CS rejected!\n");
+		rc = -EPERM;
+		goto out;
+	}
+
+	q = &hdev->kernel_queues[0];
+	/* This loop assumes all external queues are consecutive */
+	for (i = 0, cq_cnt = 0 ; i < HL_MAX_QUEUES ; i++, q++) {
+		if (q->queue_type == QUEUE_TYPE_EXT) {
+			if (cs->jobs_in_queue_cnt[i]) {
+				rc = ext_queue_sanity_checks(hdev, q,
+					cs->jobs_in_queue_cnt[i], true);
+				if (rc)
+					goto unroll_cq_resv;
+				cq_cnt++;
+			}
+		} else if (q->queue_type == QUEUE_TYPE_INT) {
+			if (cs->jobs_in_queue_cnt[i]) {
+				rc = int_queue_sanity_checks(hdev, q,
+					cs->jobs_in_queue_cnt[i]);
+				if (rc)
+					goto unroll_cq_resv;
+			}
+		}
+	}
+
+	spin_lock(&hdev->hw_queues_mirror_lock);
+	list_add_tail(&cs->mirror_node, &hdev->hw_queues_mirror_list);
+
+	/* Queue TDR if the CS is the first entry and if timeout is wanted */
+	if ((hdev->timeout_jiffies != MAX_SCHEDULE_TIMEOUT) &&
+			(list_first_entry(&hdev->hw_queues_mirror_list,
+					struct hl_cs, mirror_node) == cs)) {
+		cs->tdr_active = true;
+		schedule_delayed_work(&cs->work_tdr, hdev->timeout_jiffies);
+		spin_unlock(&hdev->hw_queues_mirror_lock);
+	} else {
+		spin_unlock(&hdev->hw_queues_mirror_lock);
+	}
+
+	list_for_each_entry_safe(job, tmp, &cs->job_list, cs_node) {
+		if (job->ext_queue)
+			ext_hw_queue_schedule_job(job);
+		else
+			int_hw_queue_schedule_job(job);
+	}
+
+	cs->submitted = true;
+
+	goto out;
+
+unroll_cq_resv:
+	/* This loop assumes all external queues are consecutive */
+	q = &hdev->kernel_queues[0];
+	for (i = 0 ; (i < HL_MAX_QUEUES) && (cq_cnt > 0) ; i++, q++) {
+		if ((q->queue_type == QUEUE_TYPE_EXT) &&
+				(cs->jobs_in_queue_cnt[i])) {
+			atomic_t *free_slots =
+				&hdev->completion_queue[i].free_slots_cnt;
+			atomic_add(cs->jobs_in_queue_cnt[i], free_slots);
+			cq_cnt--;
+		}
+	}
+
+out:
+	if ((cs->submitted) && (hdev->ifh)) {
+		list_for_each_entry_safe(job, tmp, &cs->job_list, cs_node) {
+			struct hl_cq *cq;
+
+			if (!job->ext_queue)
+				continue;
+
+			cq = &hdev->completion_queue[job->hw_queue_id];
+			hl_irq_handler_cq(cq->hw_queue_id, cq);
+		}
+	}
+
+	hdev->asic_funcs->hw_queues_unlock(hdev);
+
+	return rc;
+}
+
 /**
  * hl_hw_queue_inc_ci_kernel - increment ci for kernel's queue
  *
diff --git a/drivers/misc/habanalabs/memory.c b/drivers/misc/habanalabs/memory.c
new file mode 100644
index 000000000000..94cbb252656d
--- /dev/null
+++ b/drivers/misc/habanalabs/memory.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/genalloc.h>
+
+/**
+ * hl_pin_host_memory - pins a chunk of host memory
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @addr                : the user-space virtual address of the memory area
+ * @size                : the size of the memory area
+ * @userptr	        : pointer to hl_userptr structure
+ *
+ * This function does the following:
+ * - Pins the physical pages
+ * - Create a SG list from those pages
+ */
+int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u32 size,
+			struct hl_userptr *userptr)
+{
+	u64 start, end;
+	u32 npages, offset;
+	int rc;
+
+	if (!size) {
+		dev_err(hdev->dev, "size to pin is invalid - %d\n",
+			size);
+		return -EINVAL;
+	}
+
+	if (!access_ok((void __user *) addr, size)) {
+		dev_err(hdev->dev, "user pointer is invalid - 0x%llx\n",
+			addr);
+		return -EFAULT;
+	}
+
+	/*
+	 * If the combination of the address and size requested for this memory
+	 * region causes an integer overflow, return error.
+	 */
+	if (((addr + size) < addr) ||
+			PAGE_ALIGN(addr + size) < (addr + size)) {
+		dev_err(hdev->dev,
+			"user pointer 0x%llx + %u causes integer overflow\n",
+			addr, size);
+		return -EINVAL;
+	}
+
+	start = addr & PAGE_MASK;
+	offset = addr & ~PAGE_MASK;
+	end = PAGE_ALIGN(addr + size);
+	npages = (end - start) >> PAGE_SHIFT;
+
+	userptr->size = size;
+	userptr->addr = addr;
+	userptr->dma_mapped = false;
+	INIT_LIST_HEAD(&userptr->job_node);
+
+	userptr->vec = frame_vector_create(npages);
+	if (!userptr->vec) {
+		dev_err(hdev->dev, "Failed to create frame vector\n");
+		return -ENOMEM;
+	}
+
+	rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
+				userptr->vec);
+
+	if (rc != npages) {
+		dev_err(hdev->dev,
+			"Failed to map host memory, user ptr probably wrong\n");
+		if (rc < 0)
+			goto destroy_framevec;
+		rc = -EFAULT;
+		goto put_framevec;
+	}
+
+	if (frame_vector_to_pages(userptr->vec) < 0) {
+		dev_err(hdev->dev,
+			"Failed to translate frame vector to pages\n");
+		rc = -EFAULT;
+		goto put_framevec;
+	}
+
+	userptr->sgt = kzalloc(sizeof(*userptr->sgt), GFP_ATOMIC);
+	if (!userptr->sgt) {
+		rc = -ENOMEM;
+		goto put_framevec;
+	}
+
+	rc = sg_alloc_table_from_pages(userptr->sgt,
+					frame_vector_pages(userptr->vec),
+					npages, offset, size, GFP_ATOMIC);
+	if (rc < 0) {
+		dev_err(hdev->dev, "failed to create SG table from pages\n");
+		goto free_sgt;
+	}
+
+	return 0;
+
+free_sgt:
+	kfree(userptr->sgt);
+put_framevec:
+	put_vaddr_frames(userptr->vec);
+destroy_framevec:
+	frame_vector_destroy(userptr->vec);
+	return rc;
+}
+
+/**
+ * hl_unpin_host_memory - unpins a chunk of host memory
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @userptr             : pointer to hl_userptr structure
+ *
+ * This function does the following:
+ * - Unpins the physical pages related to the host memory
+ * - Free the SG list
+ */
+int hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr)
+{
+	struct page **pages;
+
+	if (userptr->dma_mapped)
+		hdev->asic_funcs->hl_dma_unmap_sg(hdev,
+				userptr->sgt->sgl,
+				userptr->sgt->nents,
+				userptr->dir);
+
+	pages = frame_vector_pages(userptr->vec);
+	if (!IS_ERR(pages)) {
+		int i;
+
+		for (i = 0; i < frame_vector_count(userptr->vec); i++)
+			set_page_dirty_lock(pages[i]);
+	}
+	put_vaddr_frames(userptr->vec);
+	frame_vector_destroy(userptr->vec);
+
+	list_del(&userptr->job_node);
+
+	sg_free_table(userptr->sgt);
+	kfree(userptr->sgt);
+
+	return 0;
+}
+
+/**
+ * hl_userptr_delete_list - clear userptr list
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @userptr_list        : pointer to the list to clear
+ *
+ * This function does the following:
+ * - Iterates over the list and unpins the host memory and frees the userptr
+ *   structure.
+ */
+void hl_userptr_delete_list(struct hl_device *hdev,
+				struct list_head *userptr_list)
+{
+	struct hl_userptr *userptr, *tmp;
+
+	list_for_each_entry_safe(userptr, tmp, userptr_list, job_node) {
+		hl_unpin_host_memory(hdev, userptr);
+		kfree(userptr);
+	}
+
+	INIT_LIST_HEAD(userptr_list);
+}
+
+/**
+ * hl_userptr_is_pinned - returns whether the given userptr is pinned
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @userptr_list        : pointer to the list to clear
+ * @userptr             : pointer to userptr to check
+ *
+ * This function does the following:
+ * - Iterates over the list and checks if the given userptr is in it, means is
+ *   pinned. If so, returns true, otherwise returns false.
+ */
+bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr,
+				u32 size, struct list_head *userptr_list,
+				struct hl_userptr **userptr)
+{
+	list_for_each_entry((*userptr), userptr_list, job_node) {
+		if ((addr == (*userptr)->addr) && (size == (*userptr)->size))
+			return true;
+	}
+
+	return false;
+}
+
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index b3f9213d4709..369438dbc9c3 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -40,6 +40,95 @@ union hl_cb_args {
 	struct hl_cb_out out;
 };
 
+/*
+ * This structure size must always be fixed to 64-bytes for backward
+ * compatibility
+ */
+struct hl_cs_chunk {
+	/*
+	 * For external queue, this represents a Handle of CB on the Host
+	 * For internal queue, this represents an SRAM or DRAM address of the
+	 * internal CB
+	 */
+	__u64 cb_handle;
+	/* Index of queue to put the CB on */
+	__u32 queue_index;
+	/*
+	 * Size of command buffer with valid packets
+	 * Can be smaller then actual CB size
+	 */
+	__u32 cb_size;
+	/* HL_CS_CHUNK_FLAGS_* */
+	__u32 cs_chunk_flags;
+	/* Align structure to 64 bytes */
+	__u32 pad[11];
+};
+
+#define HL_CS_FLAGS_FORCE_RESTORE	0x1
+
+#define HL_CS_STATUS_SUCCESS		0
+
+struct hl_cs_in {
+	/* this holds address of array of hl_cs_chunk for restore phase */
+	__u64 chunks_restore;
+	/* this holds address of array of hl_cs_chunk for execution phase */
+	__u64 chunks_execute;
+	/* this holds address of array of hl_cs_chunk for store phase -
+	 * Currently not in use
+	 */
+	__u64 chunks_store;
+	/* Number of chunks in restore phase array */
+	__u32 num_chunks_restore;
+	/* Number of chunks in execution array */
+	__u32 num_chunks_execute;
+	/* Number of chunks in restore phase array - Currently not in use */
+	__u32 num_chunks_store;
+	/* HL_CS_FLAGS_* */
+	__u32 cs_flags;
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+};
+
+struct hl_cs_out {
+	/* this holds the sequence number of the CS to pass to wait ioctl */
+	__u64 seq;
+	/* HL_CS_STATUS_* */
+	__u32 status;
+	__u32 pad;
+};
+
+union hl_cs_args {
+	struct hl_cs_in in;
+	struct hl_cs_out out;
+};
+
+struct hl_wait_cs_in {
+	/* Command submission sequence number */
+	__u64 seq;
+	/* Absolute timeout to wait in microseconds */
+	__u64 timeout_us;
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+	__u32 pad;
+};
+
+#define HL_WAIT_CS_STATUS_COMPLETED	0
+#define HL_WAIT_CS_STATUS_BUSY		1
+#define HL_WAIT_CS_STATUS_TIMEDOUT	2
+#define HL_WAIT_CS_STATUS_ABORTED	3
+#define HL_WAIT_CS_STATUS_INTERRUPTED	4
+
+struct hl_wait_cs_out {
+	/* HL_WAIT_CS_STATUS_* */
+	__u32 status;
+	__u32 pad;
+};
+
+union hl_wait_cs_args {
+	struct hl_wait_cs_in in;
+	struct hl_wait_cs_out out;
+};
+
 /*
  * Command Buffer
  * - Request a Command Buffer
@@ -56,7 +145,74 @@ union hl_cb_args {
 #define HL_IOCTL_CB		\
 		_IOWR('H', 0x02, union hl_cb_args)
 
+/*
+ * Command Submission
+ *
+ * To submit work to the device, the user need to call this IOCTL with a set
+ * of JOBS. That set of JOBS constitutes a CS object.
+ * Each JOB will be enqueued on a specific queue, according to the user's input.
+ * There can be more then one JOB per queue.
+ *
+ * There are two types of queues - external and internal. External queues
+ * are DMA queues which transfer data from/to the Host. All other queues are
+ * internal. The driver will get completion notifications from the device only
+ * on JOBS which are enqueued in the external queues.
+ *
+ * This IOCTL is asynchronous in regard to the actual execution of the CS. This
+ * means it returns immediately after ALL the JOBS were enqueued on their
+ * relevant queues. Therefore, the user mustn't assume the CS has been completed
+ * or has even started to execute.
+ *
+ * Upon successful enqueue, the IOCTL returns an opaque handle which the user
+ * can use with the "Wait for CS" IOCTL to check whether the handle's CS
+ * external JOBS have been completed. Note that if the CS has internal JOBS
+ * which can execute AFTER the external JOBS have finished, the driver might
+ * report that the CS has finished executing BEFORE the internal JOBS have
+ * actually finish executing.
+ *
+ * The CS IOCTL will receive three sets of JOBS. One set is for "restore" phase,
+ * a second set is for "execution" phase and a third set is for "store" phase.
+ * The JOBS on the "restore" phase are enqueued only after context-switch
+ * (or if its the first CS for this context). The user can also order the
+ * driver to run the "restore" phase explicitly
+ *
+ */
+#define HL_IOCTL_CS			\
+		_IOWR('H', 0x03, union hl_cs_args)
+
+/*
+ * Wait for Command Submission
+ *
+ * The user can call this IOCTL with a handle it received from the CS IOCTL
+ * to wait until the handle's CS has finished executing. The user will wait
+ * inside the kernel until the CS has finished or until the user-requeusted
+ * timeout has expired.
+ *
+ * The return value of the IOCTL is a standard Linux error code. The possible
+ * values are:
+ *
+ * EINTR     - Kernel waiting has been interrupted, e.g. due to OS signal
+ *             that the user process received
+ * ETIMEDOUT - The CS has caused a timeout on the device
+ * EIO       - The CS was aborted (usually because the device was reset)
+ * ENODEV    - The device wants to do hard-reset (so user need to close FD)
+ *
+ * The driver also returns a custom define inside the IOCTL which can be:
+ *
+ * HL_WAIT_CS_STATUS_COMPLETED   - The CS has been completed successfully (0)
+ * HL_WAIT_CS_STATUS_BUSY        - The CS is still executing (0)
+ * HL_WAIT_CS_STATUS_TIMEDOUT    - The CS has caused a timeout on the device
+ *                                 (ETIMEDOUT)
+ * HL_WAIT_CS_STATUS_ABORTED     - The CS was aborted, usually because the
+ *                                 device was reset (EIO)
+ * HL_WAIT_CS_STATUS_INTERRUPTED - Waiting for the CS was interrupted (EINTR)
+ *
+ */
+
+#define HL_IOCTL_WAIT_CS			\
+		_IOWR('H', 0x04, union hl_wait_cs_args)
+
 #define HL_COMMAND_START	0x02
-#define HL_COMMAND_END		0x03
+#define HL_COMMAND_END		0x05
 
 #endif /* HABANALABS_H_ */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 12/15] habanalabs: add virtual memory and MMU modules
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (9 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 11/15] habanalabs: add command submission module Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-27 16:13   ` Mike Rapoport
  2019-01-23  0:00 ` [PATCH 13/15] habanalabs: implement INFO IOCTL Oded Gabbay
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay, Omer Shpigelman

From: Omer Shpigelman <oshpigelman@habana.ai>

This patch adds the Virtual Memory and MMU modules.

Goya has an internal MMU which provides process isolation on the internal
DDR. The internal MMU also performs translations for transactions that go
from Goya to the Host.

The driver is responsible for allocating and freeing memory on the DDR
upon user request. It also provides an interface to map and unmap DDR and
Host memory to the device address space.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/Makefile              |    2 +-
 drivers/misc/habanalabs/context.c             |   19 +-
 drivers/misc/habanalabs/device.c              |   20 +-
 drivers/misc/habanalabs/goya/goya.c           |  391 +++++
 drivers/misc/habanalabs/habanalabs.h          |  195 +++
 drivers/misc/habanalabs/habanalabs_drv.c      |    2 +-
 drivers/misc/habanalabs/habanalabs_ioctl.c    |    3 +-
 drivers/misc/habanalabs/include/goya/goya.h   |    6 +-
 .../include/hw_ip/mmu/mmu_general.h           |   45 +
 .../habanalabs/include/hw_ip/mmu/mmu_v1_0.h   |   15 +
 drivers/misc/habanalabs/memory.c              | 1506 +++++++++++++++++
 drivers/misc/habanalabs/mmu.c                 |  604 +++++++
 include/uapi/misc/habanalabs.h                |  122 +-
 13 files changed, 2922 insertions(+), 8 deletions(-)
 create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
 create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
 create mode 100644 drivers/misc/habanalabs/mmu.c

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index d2fd0e18b1eb..fd46f8b48bab 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -6,7 +6,7 @@ obj-m	:= habanalabs.o
 
 habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
 		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o memory.o \
-		command_submission.o
+		command_submission.o mmu.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
index 2da672113e7a..dc0800a0ac9c 100644
--- a/drivers/misc/habanalabs/context.c
+++ b/drivers/misc/habanalabs/context.c
@@ -26,8 +26,10 @@ static void hl_ctx_fini(struct hl_ctx *ctx)
 	for (i = 0 ; i < HL_MAX_PENDING_CS ; i++)
 		dma_fence_put(ctx->cs_pending[i]);
 
-	if (ctx->asid != HL_KERNEL_ASID_ID)
+	if (ctx->asid != HL_KERNEL_ASID_ID) {
+		hl_vm_ctx_fini(ctx);
 		hl_asid_free(hdev, ctx->asid);
+	}
 }
 
 void hl_ctx_do_release(struct kref *ref)
@@ -97,6 +99,8 @@ void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx)
 
 int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
 {
+	int rc = 0;
+
 	ctx->hdev = hdev;
 
 	kref_init(&ctx->refcount);
@@ -114,9 +118,22 @@ int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
 			dev_err(hdev->dev, "No free ASID, failed to create context\n");
 			return -ENOMEM;
 		}
+
+		rc = hl_vm_ctx_init(ctx);
+		if (rc) {
+			dev_err(hdev->dev, "Failed to init mem ctx module\n");
+			rc = -ENOMEM;
+			goto mem_ctx_err;
+		}
 	}
 
 	return 0;
+
+mem_ctx_err:
+	if (ctx->asid != HL_KERNEL_ASID_ID)
+		hl_asid_free(hdev, ctx->asid);
+
+	return rc;
 }
 
 void hl_ctx_get(struct hl_device *hdev, struct hl_ctx *ctx)
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index a47e00fe5ccf..1f7340551386 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -585,8 +585,10 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 	/* Reset the H/W. It will be in idle state after this returns */
 	hdev->asic_funcs->hw_fini(hdev, hard_reset);
 
-	if (hard_reset)
+	if (hard_reset) {
+		hl_vm_fini(hdev);
 		hl_eq_reset(hdev, &hdev->event_queue);
+	}
 
 	/* Re-initialize PI,CI to 0 in all queues (hw queue, cq) */
 	hl_hw_queue_reset(hdev, hard_reset);
@@ -647,6 +649,13 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 			goto out_err;
 		}
 
+		rc = hl_vm_init(hdev);
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed to init memory module after hard reset\n");
+			goto out_err;
+		}
+
 		hl_set_max_power(hdev, hdev->max_power);
 
 		hdev->hard_reset_pending = false;
@@ -828,6 +837,13 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		hdev->asic_name,
 		hdev->asic_prop.dram_size / 1024 / 1024 / 1024);
 
+	rc = hl_vm_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to initialize memory module\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
 	rc = hl_hwmon_init(hdev);
 	if (rc) {
 		dev_err(hdev->dev, "Failed to initialize hwmon\n");
@@ -941,6 +957,8 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Reset the H/W. It will be in idle state after this returns */
 	hdev->asic_funcs->hw_fini(hdev, true);
 
+	hl_vm_fini(hdev);
+
 	hl_eq_fini(hdev, &hdev->event_queue);
 
 	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index e3867615b974..94ee4cb00a49 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -6,6 +6,8 @@
  */
 
 #include "goyaP.h"
+#include "include/hw_ip/mmu/mmu_general.h"
+#include "include/hw_ip/mmu/mmu_v1_0.h"
 #include "include/goya/asic_reg/goya_masks.h"
 
 #include <linux/fs.h>
@@ -87,6 +89,7 @@
 #define GOYA_PLDM_RESET_WAIT_MSEC	1000		/* 1s */
 #define GOYA_CPU_TIMEOUT_USEC		10000000	/* 10s */
 #define GOYA_TEST_QUEUE_WAIT_USEC	100000		/* 100ms */
+#define GOYA_PLDM_MMU_TIMEOUT_USEC	(MMU_CONFIG_TIMEOUT_USEC * 100)
 
 #define GOYA_QMAN0_FENCE_VAL		0xD169B243
 
@@ -138,6 +141,70 @@ static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
 	"MMU"
 };
 
+static u64 goya_mmu_regs[GOYA_MMU_REGS_NUM] = {
+	mmDMA_QM_0_GLBL_NON_SECURE_PROPS,
+	mmDMA_QM_1_GLBL_NON_SECURE_PROPS,
+	mmDMA_QM_2_GLBL_NON_SECURE_PROPS,
+	mmDMA_QM_3_GLBL_NON_SECURE_PROPS,
+	mmDMA_QM_4_GLBL_NON_SECURE_PROPS,
+	mmTPC0_QM_GLBL_SECURE_PROPS,
+	mmTPC0_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC0_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC0_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC0_CFG_ARUSER,
+	mmTPC0_CFG_AWUSER,
+	mmTPC1_QM_GLBL_SECURE_PROPS,
+	mmTPC1_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC1_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC1_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC1_CFG_ARUSER,
+	mmTPC1_CFG_AWUSER,
+	mmTPC2_QM_GLBL_SECURE_PROPS,
+	mmTPC2_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC2_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC2_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC2_CFG_ARUSER,
+	mmTPC2_CFG_AWUSER,
+	mmTPC3_QM_GLBL_SECURE_PROPS,
+	mmTPC3_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC3_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC3_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC3_CFG_ARUSER,
+	mmTPC3_CFG_AWUSER,
+	mmTPC4_QM_GLBL_SECURE_PROPS,
+	mmTPC4_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC4_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC4_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC4_CFG_ARUSER,
+	mmTPC4_CFG_AWUSER,
+	mmTPC5_QM_GLBL_SECURE_PROPS,
+	mmTPC5_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC5_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC5_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC5_CFG_ARUSER,
+	mmTPC5_CFG_AWUSER,
+	mmTPC6_QM_GLBL_SECURE_PROPS,
+	mmTPC6_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC6_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC6_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC6_CFG_ARUSER,
+	mmTPC6_CFG_AWUSER,
+	mmTPC7_QM_GLBL_SECURE_PROPS,
+	mmTPC7_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC7_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC7_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC7_CFG_ARUSER,
+	mmTPC7_CFG_AWUSER,
+	mmMME_QM_GLBL_SECURE_PROPS,
+	mmMME_QM_GLBL_NON_SECURE_PROPS,
+	mmMME_CMDQ_GLBL_SECURE_PROPS,
+	mmMME_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmMME_SBA_CONTROL_DATA,
+	mmMME_SBB_CONTROL_DATA,
+	mmMME_SBC_CONTROL_DATA,
+	mmMME_WBC_CONTROL_DATA
+};
+
 #define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
 
 static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
@@ -265,6 +332,10 @@ static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
 };
 
 static int goya_armcp_info_get(struct hl_device *hdev);
+static void goya_mmu_prepare(struct hl_device *hdev, u32 asid);
+static int goya_mmu_clear_pgt_range(struct hl_device *hdev);
+static int goya_mmu_update_asid_hop0_addr(struct hl_device *hdev, u32 asid,
+					u64 phys_addr);
 
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
@@ -303,6 +374,16 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->sram_user_base_address = prop->sram_base_address +
 						SRAM_USER_BASE_OFFSET;
 
+	prop->mmu_pgt_addr = MMU_PAGE_TABLES_ADDR;
+	if (hdev->pldm)
+		prop->mmu_pgt_size = 0x800000; /* 8MB */
+	else
+		prop->mmu_pgt_size = MMU_PAGE_TABLES_SIZE;
+	prop->mmu_pte_size = PTE_SIZE;
+	prop->mmu_hop_table_size = HOP_TABLE_SIZE;
+	prop->mmu_hop0_tables_total_size = HOP0_TABLES_TOTAL_SIZE;
+	prop->dram_page_size = PAGE_SIZE_2MB;
+
 	prop->host_phys_base_address = HOST_PHYS_BASE;
 	prop->va_space_host_start_address = VA_HOST_SPACE_START;
 	prop->va_space_host_end_address = VA_HOST_SPACE_END;
@@ -750,7 +831,18 @@ static int goya_late_init(struct hl_device *hdev)
 
 	goya_fetch_psoc_frequency(hdev);
 
+	rc = goya_mmu_clear_pgt_range(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to clear MMU page tables range\n");
+		goto disable_pci_access;
+	}
+
 	return 0;
+
+disable_pci_access:
+	goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
+
+	return rc;
 }
 
 /**
@@ -3532,6 +3624,54 @@ static int goya_init_cpu(struct hl_device *hdev, u32 cpu_timeout)
 	return 0;
 }
 
+static int goya_mmu_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct goya_device *goya = hdev->asic_specific;
+	u64 hop0_addr;
+	int rc, i;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	if (goya->hw_cap_initialized & HW_CAP_MMU)
+		return 0;
+
+	hdev->dram_supports_virtual_memory = true;
+
+	for (i = 0 ; i < prop->max_asid ; i++) {
+		hop0_addr = prop->mmu_pgt_addr +
+				(i * prop->mmu_hop_table_size);
+
+		rc = goya_mmu_update_asid_hop0_addr(hdev, i, hop0_addr);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed to set hop0 addr for asid %d\n", i);
+			goto err;
+		}
+	}
+
+	goya->hw_cap_initialized |= HW_CAP_MMU;
+
+	/* init MMU cache manage page */
+	WREG32(mmSTLB_CACHE_INV_BASE_39_8, MMU_CACHE_MNG_ADDR >> 8);
+	WREG32(mmSTLB_CACHE_INV_BASE_49_40, MMU_CACHE_MNG_ADDR << 40);
+
+	/* Remove follower feature due to performance bug */
+	WREG32_AND(mmSTLB_STLB_FEATURE_EN,
+			(~STLB_STLB_FEATURE_EN_FOLLOWER_EN_MASK));
+
+	hdev->asic_funcs->mmu_invalidate_cache(hdev, true);
+
+	WREG32(mmMMU_MMU_ENABLE, 1);
+	WREG32(mmMMU_SPI_MASK, 0xF);
+
+	return 0;
+
+err:
+	return rc;
+}
+
 /**
  * goya_hw_init - Goya hardware initialization code
  *
@@ -3580,6 +3720,10 @@ static int goya_hw_init(struct hl_device *hdev)
 		return rc;
 	}
 
+	rc = goya_mmu_init(hdev);
+	if (rc)
+		return rc;
+
 	goya_init_security(hdev);
 
 	goya_init_dma_qmans(hdev);
@@ -5191,6 +5335,10 @@ int goya_context_switch(struct hl_device *hdev, u32 asid)
 
 	rc = goya_send_job_on_qman0(hdev, job);
 
+	/* no point in setting the asid in case of failure */
+	if (!rc)
+		goya_mmu_prepare(hdev, asid);
+
 	job->patched_cb->cs_cnt--;
 	hl_cb_put(job->patched_cb);
 
@@ -5226,6 +5374,22 @@ void goya_restore_phase_topology(struct hl_device *hdev)
 	i = RREG32(mmSYNC_MNGR_SOB_OBJ_0);
 }
 
+static u64 goya_read_pte(struct hl_device *hdev, u64 addr)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	return readq(hdev->pcie_bar[DDR_BAR_ID] +
+			(addr - goya->ddr_bar_cur_addr));
+}
+
+static void goya_write_pte(struct hl_device *hdev, u64 addr, u64 val)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	writeq(val, hdev->pcie_bar[DDR_BAR_ID] +
+			(addr - goya->ddr_bar_cur_addr));
+}
+
 static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
 		u16 event_type, char *axi_name, int len)
 {
@@ -5571,6 +5735,229 @@ void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
 	return goya->events_stat;
 }
 
+static int goya_mmu_clear_pgt_range(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct goya_device *goya = hdev->asic_specific;
+	struct packet_lin_dma *clear_pgt_range_pkt;
+	struct hl_cs_parser parser;
+	struct hl_cs_job *job;
+	u32 cb_size;
+	struct hl_cb *cb;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
+		return 0;
+
+	cb = hl_cb_kernel_create(hdev, PAGE_SIZE);
+	if (!cb)
+		return -EFAULT;
+
+	clear_pgt_range_pkt = (struct packet_lin_dma *) cb->kernel_address;
+	memset(clear_pgt_range_pkt, 0, sizeof(*clear_pgt_range_pkt));
+	cb_size = sizeof(*clear_pgt_range_pkt);
+
+	clear_pgt_range_pkt->opcode = PACKET_LIN_DMA;
+	clear_pgt_range_pkt->src_addr = 0;
+	clear_pgt_range_pkt->dst_addr = prop->mmu_pgt_addr;
+	clear_pgt_range_pkt->dma_dir = DMA_HOST_TO_DRAM;
+	clear_pgt_range_pkt->tsize = prop->mmu_pgt_size + MMU_CACHE_MNG_SIZE;
+	clear_pgt_range_pkt->weakly_ordered = 1;
+	clear_pgt_range_pkt->reg_barrier = 1;
+	clear_pgt_range_pkt->msg_barrier = 1;
+	clear_pgt_range_pkt->memset_mode = 1;
+
+	job = hl_cs_allocate_job(hdev, true);
+	if (!job) {
+		dev_err(hdev->dev, "Failed to allocate a new job\n");
+		rc = -ENOMEM;
+		goto release_cb;
+	}
+
+	job->id = 0;
+	job->user_cb = cb;
+	job->user_cb->cs_cnt++;
+	job->user_cb_size = cb_size;
+	job->hw_queue_id = GOYA_QUEUE_ID_DMA_0;
+
+	parser.ctx_id = HL_KERNEL_ASID_ID;
+	parser.cs_sequence = 0;
+	parser.job_id = job->id;
+	parser.hw_queue_id = job->hw_queue_id;
+	parser.job_userptr_list = &job->userptr_list;
+	parser.user_cb = job->user_cb;
+	parser.user_cb_size = job->user_cb_size;
+	parser.ext_queue = job->ext_queue;
+	parser.use_virt_addr = hdev->mmu_enable;
+
+	rc = hdev->asic_funcs->cs_parser(hdev, &parser);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to parse kernel CB when clearing pgt\n");
+		goto free_job;
+	}
+
+	job->patched_cb = parser.patched_cb;
+	job->job_cb_size = parser.patched_cb_size;
+	job->patched_cb->cs_cnt++;
+
+	rc = goya_send_job_on_qman0(hdev, job);
+
+	job->patched_cb->cs_cnt--;
+	hl_cb_put(job->patched_cb);
+
+free_job:
+	hl_userptr_delete_list(hdev, &job->userptr_list);
+	kfree(job);
+	cb->cs_cnt--;
+
+release_cb:
+	hl_cb_put(cb);
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb->id << PAGE_SHIFT);
+
+	return rc;
+}
+
+static void goya_mmu_prepare(struct hl_device *hdev, u32 asid)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
+		return;
+
+	if (asid & ~MME_QM_GLBL_SECURE_PROPS_ASID_MASK) {
+		WARN(1, "asid %u is too big\n", asid);
+		return;
+	}
+
+	/* zero the MMBP and ASID bits and then set the ASID */
+	for (i = 0 ; i < GOYA_MMU_REGS_NUM ; i++) {
+		WREG32_AND(goya_mmu_regs[i], ~0x7FF);
+		WREG32_OR(goya_mmu_regs[i], asid);
+	}
+}
+
+static void goya_mmu_invalidate_cache(struct hl_device *hdev, bool is_hard)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 status, timeout_usec;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
+		return;
+
+	/* no need in L1 only invalidation in Goya */
+	if (!is_hard)
+		return;
+
+	if (hdev->pldm)
+		timeout_usec = GOYA_PLDM_MMU_TIMEOUT_USEC;
+	else
+		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
+
+	mutex_lock(&hdev->mmu_cache_lock);
+
+	/* L0 & L1 invalidation */
+	WREG32(mmSTLB_INV_ALL_START, 1);
+
+	rc = hl_poll_timeout(
+		hdev,
+		mmSTLB_INV_ALL_START,
+		status,
+		!status,
+		1000,
+		timeout_usec);
+
+	mutex_unlock(&hdev->mmu_cache_lock);
+
+	if (rc)
+		dev_notice_ratelimited(hdev->dev,
+			"Timeout when waiting for MMU cache invalidation\n");
+}
+
+static void goya_mmu_invalidate_cache_range(struct hl_device *hdev,
+		bool is_hard, u32 asid, u64 va, u64 size)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 status, timeout_usec, inv_data, pi;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
+		return;
+
+	/* no need in L1 only invalidation in Goya */
+	if (!is_hard)
+		return;
+
+	if (hdev->pldm)
+		timeout_usec = GOYA_PLDM_MMU_TIMEOUT_USEC;
+	else
+		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
+
+	mutex_lock(&hdev->mmu_cache_lock);
+
+	/*
+	 * TODO: currently invalidate entire L0 & L1 as in regular hard
+	 * invalidation. Need to apply invalidation of specific cache lines with
+	 * mask of ASID & VA & size.
+	 * Note that L1 with be flushed entirely in any case.
+	 */
+
+	/* L0 & L1 invalidation */
+	inv_data = RREG32(mmSTLB_CACHE_INV);
+	/* PI is 8 bit */
+	pi = ((inv_data & STLB_CACHE_INV_PRODUCER_INDEX_MASK) + 1) & 0xFF;
+	WREG32(mmSTLB_CACHE_INV,
+			(inv_data & STLB_CACHE_INV_INDEX_MASK_MASK) | pi);
+
+	rc = hl_poll_timeout(
+		hdev,
+		mmSTLB_INV_CONSUMER_INDEX,
+		status,
+		status == pi,
+		1000,
+		timeout_usec);
+
+	mutex_unlock(&hdev->mmu_cache_lock);
+
+	if (rc)
+		dev_notice_ratelimited(hdev->dev,
+			"Timeout when waiting for MMU cache invalidation\n");
+}
+
+static int goya_mmu_update_asid_hop0_addr(struct hl_device *hdev, u32 asid,
+						u64 phys_addr)
+{
+	u32 status, timeout_usec;
+	int rc;
+
+	if (hdev->pldm)
+		timeout_usec = GOYA_PLDM_MMU_TIMEOUT_USEC;
+	else
+		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
+
+	WREG32(MMU_HOP0_PA43_12, phys_addr >> MMU_HOP0_PA43_12_SHIFT);
+	WREG32(MMU_HOP0_PA49_44, phys_addr >> MMU_HOP0_PA49_44_SHIFT);
+	WREG32(MMU_ASID_BUSY, 0x80000000 | asid);
+
+	rc = hl_poll_timeout(
+		hdev,
+		MMU_ASID_BUSY,
+		status,
+		!(status & 0x80000000),
+		1000,
+		timeout_usec);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Timeout during MMU hop0 config of asid %d\n", asid);
+		return rc;
+	}
+
+	return 0;
+}
+
 int goya_send_heartbeat(struct hl_device *hdev)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -5819,6 +6206,10 @@ static const struct hl_asic_funcs goya_funcs = {
 	.handle_eqe = goya_handle_eqe,
 	.set_pll_profile = goya_set_pll_profile,
 	.get_events_stat = goya_get_events_stat,
+	.read_pte = goya_read_pte,
+	.write_pte = goya_write_pte,
+	.mmu_invalidate_cache = goya_mmu_invalidate_cache,
+	.mmu_invalidate_cache_range = goya_mmu_invalidate_cache_range,
 	.send_heartbeat = goya_send_heartbeat,
 	.enable_clock_gating = goya_init_clock_gating,
 	.disable_clock_gating = goya_disable_clock_gating,
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 793512b0fa09..1abc139d4293 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -38,6 +38,31 @@
 /* MUST BE POWER OF 2 and larger than 1 */
 #define HL_MAX_PENDING_CS		64
 
+/* Memory */
+#define MEM_HASH_TABLE_BITS		7 /* 1 << 7 buckets */
+
+/* MMU */
+#define MMU_HASH_TABLE_BITS		7 /* 1 << 7 buckets */
+
+/**
+ * struct pgt_info - MMU hop page info.
+ * @node: hash linked-list node for the pgts hash of pgts.
+ * @addr: physical address of the pgt.
+ * @ctx: pointer to the owner ctx.
+ * @num_of_ptes: indicates how many ptes are used in the pgt.
+ *
+ * The MMU page tables hierarchy is placed on the DRAM. When a new level (hop)
+ * is needed during mapping, a new page is allocated and this structure holds
+ * its essential information. During unmapping, if no valid PTEs remained in the
+ * page, it is freed with its pgt_info structure.
+ */
+struct pgt_info {
+	struct hlist_node node;
+	u64 addr;
+	struct hl_ctx *ctx;
+	int num_of_ptes;
+};
+
 struct hl_device;
 struct hl_fpriv;
 
@@ -104,6 +129,12 @@ enum vm_type_t {
  *                               mapping DRAM memory.
  * @va_space_dram_end_address: end address of virtual memory range for
  *                             mapping DRAM memory.
+ * @mmu_pgt_addr: base physical address in DRAM of MMU page tables.
+ * @mmu_pgt_size: MMU page tables total size.
+ * @mmu_pte_size: PTE size in MMU page tables.
+ * @mmu_hop_table_size: MMU hop table size.
+ * @mmu_hop0_tables_total_size: total size of MMU hop0 tables.
+ * @dram_page_size: page size for MMU DRAM allocation.
  * @cfg_size: configuration space size on SRAM.
  * @sram_size: total size of SRAM.
  * @max_asid: maximum number of open contexts (ASIDs).
@@ -137,6 +168,12 @@ struct asic_fixed_properties {
 	u64			va_space_host_end_address;
 	u64			va_space_dram_start_address;
 	u64			va_space_dram_end_address;
+	u64			mmu_pgt_addr;
+	u32			mmu_pgt_size;
+	u32			mmu_pte_size;
+	u32			mmu_hop_table_size;
+	u32			mmu_hop0_tables_total_size;
+	u32			dram_page_size;
 	u32			cfg_size;
 	u32			sram_size;
 	u32			max_asid;
@@ -412,6 +449,12 @@ enum hl_pll_frequency {
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
  * @set_pll_profile: change PLL profile (manual/automatic).
  * @get_events_stat: retrieve event queue entries histogram.
+ * @read_pte: read MMU page table entry from DRAM.
+ * @write_pte: write MMU page table entry to DRAM.
+ * @mmu_invalidate_cache: flush MMU STLB cache, either with soft (L1 only) or
+ *                        hard (L0 & L1) flush.
+ * @mmu_invalidate_cache_range: flush specific MMU STLB cache lines with
+ *                              ASID-VA-size mask.
  * @send_heartbeat: send is-alive packet to ArmCP and verify response.
  * @enable_clock_gating: enable clock gating for reducing power consumption.
  * @disable_clock_gating: disable clock for accessing registers on HBW.
@@ -475,6 +518,11 @@ struct hl_asic_funcs {
 	void (*set_pll_profile)(struct hl_device *hdev,
 			enum hl_pll_frequency freq);
 	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
+	u64 (*read_pte)(struct hl_device *hdev, u64 addr);
+	void (*write_pte)(struct hl_device *hdev, u64 addr, u64 val);
+	void (*mmu_invalidate_cache)(struct hl_device *hdev, bool is_hard);
+	void (*mmu_invalidate_cache_range)(struct hl_device *hdev, bool is_hard,
+			u32 asid, u64 va, u64 size);
 	int (*send_heartbeat)(struct hl_device *hdev);
 	void (*enable_clock_gating)(struct hl_device *hdev);
 	void (*disable_clock_gating)(struct hl_device *hdev);
@@ -498,17 +546,40 @@ struct hl_asic_funcs {
 
 #define HL_KERNEL_ASID_ID	0
 
+/**
+ * struct hl_va_range - virtual addresses range.
+ * @lock: protects the virtual addresses list.
+ * @list: list of virtual addresses blocks available for mappings.
+ * @start_addr: range start address.
+ * @end_addr: range end address.
+ */
+struct hl_va_range {
+	struct mutex		lock;
+	struct list_head	list;
+	u64			start_addr;
+	u64			end_addr;
+};
+
 /**
  * struct hl_ctx - user/kernel context.
+ * @mem_hash: holds mapping from virtual address to virtual memory area
+ *		descriptor (hl_vm_phys_pg_list or hl_userptr).
+ * @mmu_hash: holds a mapping from virtual address to pgt_info structure.
  * @hpriv: pointer to the private (KMD) data of the process (fd).
  * @hdev: pointer to the device structure.
  * @refcount: reference counter for the context. Context is released only when
  *		this hits 0l. It is incremented on CS and CS_WAIT.
  * @cs_pending: array of DMA fence objects representing pending CS.
+ * @host_va_range: holds available virtual addresses for host mappings.
+ * @dram_va_range: holds available virtual addresses for DRAM mappings.
+ * @mem_hash_lock: protects the mem_hash.
+ * @mmu_lock: protects the MMU page tables. Any change to the PGT, modifing the
+ *            MMU hash or walking the PGT requires talking this lock
  * @cs_sequence: sequence number for CS. Value is assigned to a CS and passed
  *			to user so user could inquire about CS. It is used as
  *			index to cs_pending array.
  * @cs_lock: spinlock to protect cs_sequence.
+ * @dram_phys_mem: amount of used physical DRAM memory by this context.
  * @thread_restore_token: token to prevent multiple threads of the same context
  *				from running the restore phase. Only one thread
  *				should run it.
@@ -518,12 +589,19 @@ struct hl_asic_funcs {
  * @asid: context's unique address space ID in the device's MMU.
  */
 struct hl_ctx {
+	DECLARE_HASHTABLE(mem_hash, MEM_HASH_TABLE_BITS);
+	DECLARE_HASHTABLE(mmu_hash, MMU_HASH_TABLE_BITS);
 	struct hl_fpriv		*hpriv;
 	struct hl_device	*hdev;
 	struct kref		refcount;
 	struct dma_fence	*cs_pending[HL_MAX_PENDING_CS];
+	struct hl_va_range	host_va_range;
+	struct hl_va_range	dram_va_range;
+	struct mutex		mem_hash_lock;
+	struct mutex		mmu_lock;
 	u64			cs_sequence;
 	spinlock_t		cs_lock;
+	atomic64_t		dram_phys_mem;
 	atomic_t		thread_restore_token;
 	u32			thread_restore_wait_token;
 	u32			asid;
@@ -672,6 +750,96 @@ struct hl_cs_parser {
 
 
 
+/*
+ * MEMORY STRUCTURE
+ */
+
+/**
+ * struct hl_vm_hash_node - hash element from virtual address to virtual
+ *				memory area descriptor (hl_vm_phys_pg_list or
+ *				hl_userptr).
+ * @node: node to hang on the hash table in context object.
+ * @vaddr: key virtual address.
+ * @ptr: value pointer (hl_vm_phys_pg_list or hl_userptr).
+ */
+struct hl_vm_hash_node {
+	struct hlist_node	node;
+	u64			vaddr;
+	void			*ptr;
+};
+
+/**
+ * struct hl_vm_phys_pg - physical page information.
+ * @node: node to hang on the physical page list.
+ * @paddr: physical address of the page.
+ * @page_size: size of the physical page.
+ */
+struct hl_vm_phys_pg {
+	struct list_head	node;
+	u64			paddr;
+	u32			page_size;
+	u32			pad;
+};
+
+/**
+ * struct hl_vm_phys_pg_list - physical page list.
+ * @vm_type: describes the type of the virtual area descriptor.
+ * @list: head of the physical page list.
+ * @mapping_cnt: number of shared mappings.
+ * @list_size: page list size.
+ * @asid: the context related to this list.
+ * @total_size: total size of all the pages in this list.
+ * @flags: HL_MEM_* flags related to this list.
+ * @handle: the provided handle related to this list.
+ * @offset: offset from the first page.
+ * @contiguous: is contiguous physical memory.
+ * @created_from_userptr: is product of host virtual address.
+ */
+struct hl_vm_phys_pg_list {
+	enum vm_type_t		vm_type; /* must be first */
+	struct list_head	list;
+	atomic_t		mapping_cnt;
+	u32			list_size;
+	u32			asid;
+	u32			total_size;
+	u32			flags;
+	u32			handle;
+	u32			offset;
+	u8			contiguous;
+	u8			created_from_userptr;
+};
+
+/**
+ * struct hl_vm_va_block - virtual range block information.
+ * @node: node to hang on the virtual range list in context object.
+ * @start: virtual range start address.
+ * @end: virtual range end address.
+ * @size: virtual range size.
+ */
+struct hl_vm_va_block {
+	struct list_head	node;
+	u64			start;
+	u64			end;
+	u64			size;
+};
+
+/**
+ * struct hl_vm - virtual memory manager for MMU.
+ * @dram_pg_pool: pool for DRAM physical pages of 2MB.
+ * @dram_pg_pool_refcount: reference counter for the pool usage.
+ * @idr_lock: protects the phys_pg_list_handles.
+ * @phys_pg_list_handles: idr to hold all device allocations handles.
+ * @init_done: whether initialization was done. We need this because VM
+ *		initialization might be skipped during device initialization.
+ */
+struct hl_vm {
+	struct gen_pool		*dram_pg_pool;
+	struct kref		dram_pg_pool_refcount;
+	spinlock_t		idr_lock;
+	struct idr		phys_pg_list_handles;
+	u8			init_done;
+};
+
 /*
  * FILE PRIVATE STRUCTURE
  */
@@ -783,12 +951,16 @@ struct hl_device_reset_work {
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
+ * @mmu_pgt_pool: pool of available MMU hops.
+ * @vm: virtual memory manager for MMU.
+ * @mmu_cache_lock: protects MMU cache invalidation as it can serve one context
  * @hwmon_dev: H/W monitor device.
  * @pm_mng_profile: current power management profile.
  * @hl_chip_info: ASIC's sensors information.
  * @cb_pool: list of preallocated CBs.
  * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
+ * @dram_used_mem: current DRAM memory consumption.
  * @in_reset: is device in reset flow.
  * @curr_pll_profile: current PLL profile.
  * @fd_open_cnt: number of open context executing.
@@ -808,6 +980,7 @@ struct hl_device_reset_work {
  * @heartbeat: is heartbeat sanity check towards ArmCP enabled.
  * @reset_on_lockup: true if a reset should be done in case of stuck CS, false
  *                   otherwise.
+ * @dram_supports_virtual_memory: is MMU enabled towards DRAM.
  * @mmu_enable: is MMU enabled.
  */
 struct hl_device {
@@ -842,6 +1015,9 @@ struct hl_device {
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
+	struct gen_pool			*mmu_pgt_pool;
+	struct hl_vm			vm;
+	struct mutex			mmu_cache_lock;
 	struct device			*hwmon_dev;
 	enum hl_pm_mng_profile		pm_mng_profile;
 	struct hwmon_chip_info		hl_chip_info;
@@ -852,6 +1028,7 @@ struct hl_device {
 	/* TODO: The following fields should be moved for multi-context */
 	struct hl_ctx			*user_ctx;
 
+	atomic64_t			dram_used_mem;
 	atomic_t			in_reset;
 	atomic_t			curr_pll_profile;
 	atomic_t			fd_open_cnt;
@@ -868,6 +1045,7 @@ struct hl_device {
 	u8				hard_reset_pending;
 	u8				heartbeat;
 	u8				reset_on_lockup;
+	u8				dram_supports_virtual_memory;
 
 	/* Parameters for bring-up */
 	u8				mmu_enable;
@@ -1019,6 +1197,7 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 void hl_hpriv_get(struct hl_fpriv *hpriv);
 void hl_hpriv_put(struct hl_fpriv *hpriv);
 int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq);
+
 int hl_build_hwmon_channel_info(struct hl_device *hdev,
 		struct armcp_sensor *sensors_arr);
 
@@ -1046,6 +1225,12 @@ struct hl_cs_job *hl_cs_allocate_job(struct hl_device *hdev, bool ext_queue);
 
 void goya_set_asic_funcs(struct hl_device *hdev);
 
+int hl_vm_ctx_init(struct hl_ctx *ctx);
+void hl_vm_ctx_fini(struct hl_ctx *ctx);
+
+int hl_vm_init(struct hl_device *hdev);
+void hl_vm_fini(struct hl_device *hdev);
+
 int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u32 size,
 			struct hl_userptr *userptr);
 int hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr);
@@ -1055,6 +1240,15 @@ bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr, u32 size,
 				struct list_head *userptr_list,
 				struct hl_userptr **userptr);
 
+int hl_mmu_init(struct hl_device *hdev);
+void hl_mmu_fini(struct hl_device *hdev);
+void hl_mmu_ctx_init(struct hl_ctx *ctx);
+void hl_mmu_ctx_fini(struct hl_ctx *ctx);
+int hl_mmu_map(struct hl_ctx *ctx, u64 virt_addr, u64 phys_addr, u32 page_size);
+int hl_mmu_unmap(struct hl_ctx *ctx, u64 virt_addr);
+void hl_mmu_swap_out(struct hl_ctx *ctx);
+void hl_mmu_swap_in(struct hl_ctx *ctx);
+
 long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr);
 void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq);
 long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
@@ -1072,5 +1266,6 @@ long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
 int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
 int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data);
 int hl_cs_wait_ioctl(struct hl_fpriv *hpriv, void *data);
+int hl_mem_ioctl(struct hl_fpriv *hpriv, void *data);
 
 #endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index fccfa7830121..4b7bf42a4d3e 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -202,7 +202,7 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	hdev->reset_on_lockup = reset_on_lockup;
 
 	/* Parameters for bring-up - set them to defaults */
-	hdev->mmu_enable = 0;
+	hdev->mmu_enable = 1;
 	hdev->cpu_enable = 1;
 	hdev->reset_pcilink = 0;
 	hdev->config_pll = 0;
diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
index f6969d6dba9c..6dcad810b821 100644
--- a/drivers/misc/habanalabs/habanalabs_ioctl.c
+++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
@@ -18,7 +18,8 @@
 static const struct hl_ioctl_desc hl_ioctls[] = {
 	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl),
 	HL_IOCTL_DEF(HL_IOCTL_CS, hl_cs_ioctl),
-	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl)
+	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl),
+	HL_IOCTL_DEF(HL_IOCTL_MEMORY, hl_mem_ioctl)
 };
 
 #define HL_CORE_IOCTL_COUNT	ARRAY_SIZE(hl_ioctls)
diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
index bcc461760e5f..3599a7833679 100644
--- a/drivers/misc/habanalabs/include/goya/goya.h
+++ b/drivers/misc/habanalabs/include/goya/goya.h
@@ -36,12 +36,14 @@
 
 #define CPU_FW_IMAGE_SIZE	0x10000000	/* 256MB */
 #define MMU_PAGE_TABLES_SIZE	0x0E000000	/* 224MB */
+#define MMU_CACHE_MNG_SIZE	0x00001000	/* 4KB */
 #define CPU_PQ_PKT_SIZE		0x00001000	/* 4KB */
-#define CPU_PQ_DATA_SIZE	0x01FFF000	/* 32MB - 4KB  */
+#define CPU_PQ_DATA_SIZE	0x01FFE000	/* 32MB - 8KB  */
 
 #define CPU_FW_IMAGE_ADDR	DRAM_PHYS_BASE
 #define MMU_PAGE_TABLES_ADDR	(CPU_FW_IMAGE_ADDR + CPU_FW_IMAGE_SIZE)
-#define CPU_PQ_PKT_ADDR		(MMU_PAGE_TABLES_ADDR + MMU_PAGE_TABLES_SIZE)
+#define MMU_CACHE_MNG_ADDR	(MMU_PAGE_TABLES_ADDR + MMU_PAGE_TABLES_SIZE)
+#define CPU_PQ_PKT_ADDR		(MMU_CACHE_MNG_ADDR + MMU_CACHE_MNG_SIZE)
 #define CPU_PQ_DATA_ADDR	(CPU_PQ_PKT_ADDR + CPU_PQ_PKT_SIZE)
 #define DRAM_BASE_ADDR_USER	(CPU_PQ_DATA_ADDR + CPU_PQ_DATA_SIZE)
 
diff --git a/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
new file mode 100644
index 000000000000..8d61ee4f2d17
--- /dev/null
+++ b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef INCLUDE_MMU_GENERAL_H_
+#define INCLUDE_MMU_GENERAL_H_
+
+#define PAGE_SHIFT_4KB			12
+#define PAGE_SHIFT_2MB			21
+#define PAGE_SIZE_2MB			(_AC(1, UL) << PAGE_SHIFT_2MB)
+#define PAGE_SIZE_4KB			(_AC(1, UL) << PAGE_SHIFT_4KB)
+
+#define PAGE_PRESENT_MASK		0x0000000000001
+#define SWAP_OUT_MASK			0x0000000000004
+#define LAST_MASK			0x0000000000800
+#define PHYS_ADDR_MASK			0x3FFFFFFFFF000ull
+#define HOP0_MASK			0x3000000000000ull
+#define HOP1_MASK			0x0FF8000000000ull
+#define HOP2_MASK			0x0007FC0000000ull
+#define HOP3_MASK			0x000003FE00000
+#define HOP4_MASK			0x00000001FF000
+#define OFFSET_MASK			0x0000000000FFF
+
+#define HOP0_SHIFT			48
+#define HOP1_SHIFT			39
+#define HOP2_SHIFT			30
+#define HOP3_SHIFT			21
+#define HOP4_SHIFT			12
+
+#define PTE_PHYS_ADDR_SHIFT		12
+#define PTE_PHYS_ADDR_MASK		~0xFFF
+
+#define PTE_SIZE			sizeof(u64)
+#define HOP_TABLE_SIZE			PAGE_SIZE_4KB
+#define HOP0_TABLES_TOTAL_SIZE		(HOP_TABLE_SIZE * MAX_ASID)
+
+#define MMU_HOP0_PA43_12_SHIFT		12
+#define MMU_HOP0_PA49_44_SHIFT		(12 + 32)
+
+#define MMU_CONFIG_TIMEOUT_USEC		2000 /* 2 ms */
+
+#endif /* INCLUDE_MMU_GENERAL_H_ */
diff --git a/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
new file mode 100644
index 000000000000..8539dd041f2c
--- /dev/null
+++ b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef INCLUDE_MMU_V1_0_H_
+#define INCLUDE_MMU_V1_0_H_
+
+#define MMU_HOP0_PA43_12	0x490004
+#define MMU_HOP0_PA49_44	0x490008
+#define MMU_ASID_BUSY		0x490000
+
+#endif /* INCLUDE_MMU_V1_0_H_ */
diff --git a/drivers/misc/habanalabs/memory.c b/drivers/misc/habanalabs/memory.c
index 94cbb252656d..c41ea19502e5 100644
--- a/drivers/misc/habanalabs/memory.c
+++ b/drivers/misc/habanalabs/memory.c
@@ -5,12 +5,1193 @@
  * All Rights Reserved.
  */
 
+#include <uapi/misc/habanalabs.h>
 #include "habanalabs.h"
+#include "include/hw_ip/mmu/mmu_general.h"
 
 #include <linux/sched.h>
 #include <linux/uaccess.h>
 #include <linux/genalloc.h>
 
+#define HL_MMU_DEBUG	0
+
+/*
+ * The va ranges in context object contain a list with the available chunks of
+ * device virtual memory.
+ * There is one range for host allocations and one for DRAM allocations.
+ *
+ * On initialization each range contains one chunk of all of its available
+ * virtual range which is a half of the total device virtual range.
+ *
+ * On each mapping of physical pages, a suitable virtual range chunk (with a
+ * minimum size) is selected from the list. If the chunk size equals the
+ * requested size, the chunk is returned. Otherwise, the chunk is split into
+ * two chunks - one to return as result and a remainder to stay in the list.
+ *
+ * On each Unmapping of a virtual address, the relevant virtual chunk is
+ * returned to the list. The chunk is added to the list and if its edges match
+ * the edges of the adjacent chunks (means a contiguous chunk can be created),
+ * the chunks are merged.
+ *
+ * On finish, the list is checked to have only one chunk of all the relevant
+ * virtual range (which is a half of the device total virtual range).
+ * If not (means not all mappings were unmapped), a warning is printed.
+ */
+
+/**
+ * alloc_device_memory - allocate device memory
+ *
+ * @ctx                 : current context
+ * @args                : host parameters containing the requested size
+ * @ret_handle          : result handle
+ *
+ * This function does the following:
+ * - Allocate the requested size rounded up to 2MB pages
+ * - Return unique handle
+ */
+static int alloc_device_memory(struct hl_ctx *ctx, struct hl_mem_in *args,
+				u32 *ret_handle)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_list *phys_pg_list;
+	struct hl_vm_phys_pg *phys_pg, *tmp;
+	u64 paddr = 0;
+	u32 total_size, num_pgs, page_size, page_shift;
+	int handle, rc, i;
+	bool contiguous;
+
+	page_size = hdev->asic_prop.dram_page_size;
+	page_shift = __ffs(page_size);
+	num_pgs = (args->alloc.mem_size + (page_size - 1)) >> page_shift;
+	total_size = num_pgs << page_shift;
+
+	contiguous = args->flags & HL_MEM_CONTIGUOUS;
+
+	if (contiguous) {
+		paddr = (u64) gen_pool_alloc(vm->dram_pg_pool, total_size);
+		if (!paddr) {
+			dev_err(hdev->dev,
+				"failed to allocate %u huge contiguous pages\n",
+				num_pgs);
+			return -ENOMEM;
+		}
+	}
+
+	phys_pg_list = kzalloc(sizeof(*phys_pg_list), GFP_KERNEL);
+	if (!phys_pg_list) {
+		rc = -ENOMEM;
+		goto page_list_err;
+	}
+
+	phys_pg_list->vm_type = VM_TYPE_PHYS_LIST;
+	phys_pg_list->asid = ctx->asid;
+	phys_pg_list->total_size = total_size;
+	phys_pg_list->flags = args->flags;
+	phys_pg_list->contiguous = contiguous;
+	INIT_LIST_HEAD(&phys_pg_list->list);
+
+	for (i = 0 ; i < num_pgs ; i++) {
+		phys_pg = kzalloc(sizeof(*phys_pg), GFP_KERNEL);
+		if (!phys_pg) {
+			rc = -ENOMEM;
+			goto pb_err;
+		}
+
+		phys_pg->page_size = page_size;
+
+		if (phys_pg_list->contiguous) {
+			phys_pg->paddr = paddr + i * phys_pg->page_size;
+		} else {
+			phys_pg->paddr =
+				(u64) gen_pool_alloc(vm->dram_pg_pool,
+							phys_pg->page_size);
+			if (!phys_pg->paddr) {
+				dev_err(hdev->dev, "ioctl failed to allocate page\n");
+				kfree(phys_pg);
+				rc = -ENOMEM;
+				goto pb_err;
+			}
+		}
+
+		list_add_tail(&phys_pg->node, &phys_pg_list->list);
+	}
+
+	spin_lock(&vm->idr_lock);
+	handle = idr_alloc(&vm->phys_pg_list_handles, phys_pg_list, 1, 0,
+				GFP_ATOMIC);
+	spin_unlock(&vm->idr_lock);
+
+	if (handle < 0) {
+		dev_err(hdev->dev, "Failed to get handle for page\n");
+		rc = -EFAULT;
+		goto idr_err;
+	}
+
+	for (i = 0; i < num_pgs ; i++)
+		kref_get(&vm->dram_pg_pool_refcount);
+
+	phys_pg_list->handle = handle;
+
+	atomic64_add(phys_pg_list->total_size, &ctx->dram_phys_mem);
+	atomic64_add(phys_pg_list->total_size, &hdev->dram_used_mem);
+
+	*ret_handle = handle;
+
+	return 0;
+
+idr_err:
+pb_err:
+	list_for_each_entry_safe(phys_pg, tmp, &phys_pg_list->list, node) {
+		if (!phys_pg_list->contiguous)
+			gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
+					phys_pg->page_size);
+
+		list_del(&phys_pg->node);
+		kfree(phys_pg);
+	}
+
+	kfree(phys_pg_list);
+page_list_err:
+	if (contiguous)
+		gen_pool_free(vm->dram_pg_pool, paddr, total_size);
+
+	return rc;
+}
+
+/**
+ * get_userptr_from_host_va - initialize userptr structure from given host
+ *                            virtual address
+ *
+ * @hdev                : habanalabs device structure
+ * @args                : parameters containing the virtual address and size
+ * @p_userptr           : pointer to result userptr structure
+ *
+ * This function does the following:
+ * - Allocate userptr structure
+ * - Pin the given host memory using the userptr structure
+ * - Perform DMA mapping to have the DMA addresses of the pages
+ */
+static int get_userptr_from_host_va(struct hl_device *hdev,
+		struct hl_mem_in *args, struct hl_userptr **p_userptr)
+{
+	struct hl_userptr *userptr;
+	int rc;
+
+	userptr = kzalloc(sizeof(*userptr), GFP_KERNEL);
+	if (!userptr) {
+		rc = -ENOMEM;
+		goto userptr_err;
+	}
+
+	rc = hl_pin_host_memory(hdev, args->map_host.host_virt_addr,
+			args->map_host.mem_size, userptr);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to pin host memory\n");
+		goto pin_err;
+	}
+
+	rc = hdev->asic_funcs->asic_dma_map_sg(hdev, userptr->sgt->sgl,
+					userptr->sgt->nents, DMA_BIDIRECTIONAL);
+	if (rc) {
+		dev_err(hdev->dev, "failed to map sgt with DMA region\n");
+		goto dma_map_err;
+	}
+
+	userptr->dma_mapped = true;
+	userptr->dir = DMA_BIDIRECTIONAL;
+	userptr->vm_type = VM_TYPE_USERPTR;
+
+	*p_userptr = userptr;
+
+	return 0;
+
+dma_map_err:
+	hl_unpin_host_memory(hdev, userptr);
+pin_err:
+	kfree(userptr);
+userptr_err:
+
+	return rc;
+}
+
+/**
+ * free_userptr - free userptr structure
+ *
+ * @hdev                : habanalabs device structure
+ * @userptr             : userptr to free
+ *
+ * This function does the following:
+ * - Unpins the physical pages
+ * - Frees the userptr structure
+ */
+static void free_userptr(struct hl_device *hdev, struct hl_userptr *userptr)
+{
+	hl_unpin_host_memory(hdev, userptr);
+	kfree(userptr);
+}
+
+/**
+ * dram_pg_pool_do_release - free DRAM pages pool
+ *
+ * @ref                 : pointer to reference object
+ *
+ * This function does the following:
+ * - Frees the idr structure of physical pages handles
+ * - Frees the generic pool of DRAM physical pages
+ */
+static void dram_pg_pool_do_release(struct kref *ref)
+{
+	struct hl_vm *vm = container_of(ref, struct hl_vm,
+			dram_pg_pool_refcount);
+
+	/*
+	 * free the idr here as only here we know for sure that there are no
+	 * allocated physical pages and hence there are no handles in use
+	 */
+	idr_destroy(&vm->phys_pg_list_handles);
+	gen_pool_destroy(vm->dram_pg_pool);
+}
+
+/**
+ * free_phys_pg_list    - free physical page list
+ *
+ * @hdev                : habanalabs device structure
+ * @phys_pg_list        : physical page list to free
+ *
+ * This function does the following:
+ * - Iterate over the list and free each physical block structure
+ * - In case of allocated memory, return the physical memory to the general pool
+ * - Free the hl_vm_phys_pg_list structure
+ */
+static void free_phys_pg_list(struct hl_device *hdev,
+		struct hl_vm_phys_pg_list *phys_pg_list)
+{
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg *phys_pg, *tmp;
+	u32 num_pgs;
+	bool first = true;
+	int i;
+
+	list_for_each_entry_safe(phys_pg, tmp, &phys_pg_list->list, node) {
+		/*
+		 * this if statement is relevant only when called from
+		 * hl_vm_ctx_fini() and free_device_memory()
+		 */
+		if (!phys_pg_list->created_from_userptr) {
+			if ((phys_pg_list->contiguous) && (first)) {
+				first = false;
+				gen_pool_free(vm->dram_pg_pool,
+						phys_pg->paddr,
+						phys_pg_list->total_size);
+
+				num_pgs = phys_pg_list->total_size >>
+					__ffs(hdev->asic_prop.dram_page_size);
+
+				for (i = 0; i < num_pgs ; i++)
+					kref_put(&vm->dram_pg_pool_refcount,
+						dram_pg_pool_do_release);
+
+			} else if (!phys_pg_list->contiguous) {
+				gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
+						phys_pg->page_size);
+				kref_put(&vm->dram_pg_pool_refcount,
+						dram_pg_pool_do_release);
+			}
+		}
+
+		list_del(&phys_pg->node);
+		kfree(phys_pg);
+	}
+
+	kfree(phys_pg_list);
+}
+
+/**
+ * free_device_memory - free device memory
+ *
+ * @ctx                  : current context
+ * @handle              : handle of the memory chunk to free
+ *
+ * This function does the following:
+ * - Free the device memory related to the given handle
+ */
+static int free_device_memory(struct hl_ctx *ctx, u32 handle)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_list *phys_pg_list;
+
+	spin_lock(&vm->idr_lock);
+	phys_pg_list = idr_find(&vm->phys_pg_list_handles, handle);
+	if (phys_pg_list) {
+		if (atomic_read(&phys_pg_list->mapping_cnt) > 0) {
+			dev_err(hdev->dev, "handle %u is mapped, cannot free\n",
+				handle);
+			spin_unlock(&vm->idr_lock);
+			return -EINVAL;
+		}
+
+		/*
+		 * must remove from idr before the freeing of the physical
+		 * pages as the refcount of the pool is also the trigger of the
+		 * idr destroy
+		 */
+		idr_remove(&vm->phys_pg_list_handles, handle);
+		spin_unlock(&vm->idr_lock);
+
+		atomic64_sub(phys_pg_list->total_size, &ctx->dram_phys_mem);
+		atomic64_sub(phys_pg_list->total_size, &hdev->dram_used_mem);
+
+		free_phys_pg_list(hdev, phys_pg_list);
+	} else {
+		spin_unlock(&vm->idr_lock);
+		dev_err(hdev->dev,
+			"free device memory failed, no match for handle %u\n",
+			handle);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/**
+ * clear_va_list_locked    - free virtual addresses list
+ *
+ * @hdev                : habanalabs device structure
+ * @va_list             : list of virtual addresses to free
+ *
+ * This function does the following:
+ * - Iterate over the list and free each virtual addresses block
+ *
+ * This function should be called only when va_list lock is taken
+ */
+static void clear_va_list_locked(struct hl_device *hdev,
+		struct list_head *va_list)
+{
+	struct hl_vm_va_block *va_block, *tmp;
+
+	list_for_each_entry_safe(va_block, tmp, va_list, node) {
+		list_del(&va_block->node);
+		kfree(va_block);
+	}
+}
+
+/**
+ * print_va_list_locked    - print virtual addresses list
+ *
+ * @hdev                : habanalabs device structure
+ * @va_list             : list of virtual addresses to print
+ *
+ * This function does the following:
+ * - Iterate over the list and print each virtual addresses block
+ *
+ * This function should be called only when va_list lock is taken
+ */
+static void print_va_list_locked(struct hl_device *hdev,
+		struct list_head *va_list)
+{
+#if HL_MMU_DEBUG
+	struct hl_vm_va_block *va_block;
+
+	dev_dbg(hdev->dev, "print va list:\n");
+
+	list_for_each_entry(va_block, va_list, node)
+		dev_dbg(hdev->dev,
+			"va block, start: 0x%llx, end: 0x%llx, size: %llu\n",
+			va_block->start, va_block->end, va_block->size);
+#endif
+}
+
+/**
+ * merge_va_blocks_locked - merge a virtual block if possible
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @va_list             : pointer to the virtual addresses block list
+ * @va_block            : virtual block to merge with adjacent blocks
+ *
+ * This function does the following:
+ * - Merge the given blocks with the adjacent blocks if their virtual ranges
+ *   create a contiguous virtual range
+ *
+ * This Function should be called only when va_list lock is taken
+ */
+static void merge_va_blocks_locked(struct hl_device *hdev,
+		struct list_head *va_list, struct hl_vm_va_block *va_block)
+{
+	struct hl_vm_va_block *prev, *next;
+
+	prev = list_prev_entry(va_block, node);
+	if (&prev->node != va_list && prev->end + 1 == va_block->start) {
+		prev->end = va_block->end;
+		prev->size = prev->end - prev->start;
+		list_del(&va_block->node);
+		kfree(va_block);
+		va_block = prev;
+	}
+
+	next = list_next_entry(va_block, node);
+	if (&next->node != va_list && va_block->end + 1 == next->start) {
+		next->start = va_block->start;
+		next->size = next->end - next->start;
+		list_del(&va_block->node);
+		kfree(va_block);
+	}
+}
+
+/**
+ * add_va_block_locked - add a virtual block to the virtual addresses list
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @va_list             : pointer to the virtual addresses block list
+ * @start               : start virtual address
+ * @end                 : end virtual address
+ *
+ * This function does the following:
+ * - Add the given block to the virtual blocks list and merge with other
+ * blocks if a contiguous virtual block can be created
+ *
+ * This Function should be called only when va_list lock is taken
+ */
+static int add_va_block_locked(struct hl_device *hdev,
+		struct list_head *va_list, u64 start, u64 end)
+{
+	struct hl_vm_va_block *va_block, *res = NULL;
+	u64 size = end - start;
+
+	print_va_list_locked(hdev, va_list);
+
+	list_for_each_entry(va_block, va_list, node) {
+		/* TODO: remove upon matureness */
+		if (hl_mem_area_crosses_range(start, size, va_block->start,
+				va_block->end)) {
+			dev_err(hdev->dev,
+				"block crossing ranges at start 0x%llx, end 0x%llx\n",
+				va_block->start, va_block->end);
+			return -EINVAL;
+		}
+
+		if (va_block->end < start)
+			res = va_block;
+	}
+
+	va_block = kmalloc(sizeof(*va_block), GFP_KERNEL);
+	if (!va_block)
+		return -ENOMEM;
+
+	va_block->start = start;
+	va_block->end = end;
+	va_block->size = size;
+
+	if (!res)
+		list_add(&va_block->node, va_list);
+	else
+		list_add(&va_block->node, &res->node);
+
+	merge_va_blocks_locked(hdev, va_list, va_block);
+
+	print_va_list_locked(hdev, va_list);
+
+	return 0;
+}
+
+/**
+ * add_va_block - wrapper for add_va_block_locked
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @va_list             : pointer to the virtual addresses block list
+ * @start               : start virtual address
+ * @end                 : end virtual address
+ *
+ * This function does the following:
+ * - Takes the list lock and calls add_va_block_locked
+ */
+static inline int add_va_block(struct hl_device *hdev,
+		struct hl_va_range *va_range, u64 start, u64 end)
+{
+	int rc;
+
+	mutex_lock(&va_range->lock);
+	rc = add_va_block_locked(hdev, &va_range->list, start, end);
+	mutex_unlock(&va_range->lock);
+
+	return rc;
+}
+
+/**
+ * get_va_block - get a virtual block with the requested size
+ *
+ * @hdev            : pointer to the habanalabs device structure
+ * @va_range        : pointer to the virtual addresses range
+ * @size            : requested block size
+ * @hint_addr       : hint for request address by the user
+ * @is_userptr      : is host or DRAM memory
+ *
+ * This function does the following:
+ * - Iterate on the virtual block list to find a suitable virtual block for the
+ *   requested size
+ * - Reserve the requested block and update the list
+ * - Return the start address of the virtual block
+ */
+static u64 get_va_block(struct hl_device *hdev,
+		struct hl_va_range *va_range, u32 size, u64 hint_addr,
+		bool is_userptr)
+{
+	struct hl_vm_va_block *va_block, *new_va_block = NULL;
+	u64 valid_start, valid_size, prev_start, prev_end, page_mask,
+		res_valid_start = 0, res_valid_size = 0;
+	u32 page_size;
+	bool add_prev = false;
+
+	if (is_userptr) {
+		/*
+		 * We cannot know if the user allocated memory with huge pages
+		 * or not, hence we continue with the biggest possible
+		 * granularity.
+		 */
+		page_size = HPAGE_SIZE;
+		page_mask = HPAGE_MASK;
+	} else {
+		page_size = hdev->asic_prop.dram_page_size;
+		page_mask = ~((u64)page_size - 1);
+	}
+
+	mutex_lock(&va_range->lock);
+
+	print_va_list_locked(hdev, &va_range->list);
+
+	list_for_each_entry(va_block, &va_range->list, node) {
+		/* calc the first possible aligned addr */
+		valid_start = va_block->start;
+
+
+		if (valid_start & (page_size - 1)) {
+			valid_start &= page_mask;
+			valid_start += page_size;
+			if (valid_start > va_block->end)
+				continue;
+		}
+
+		valid_size = va_block->end - valid_start;
+
+		if (valid_size >= size &&
+			(!new_va_block || valid_size < res_valid_size)) {
+
+			new_va_block = va_block;
+			res_valid_start = valid_start;
+			res_valid_size = valid_size;
+		}
+
+		if (hint_addr && hint_addr >= valid_start &&
+				((hint_addr + size) <= va_block->end)) {
+			new_va_block = va_block;
+			res_valid_start = hint_addr;
+			res_valid_size = valid_size;
+			break;
+		}
+	}
+
+	if (!new_va_block) {
+		dev_err(hdev->dev, "no available va block for size %u\n", size);
+		goto out;
+	}
+
+	if (res_valid_start > new_va_block->start) {
+		prev_start = new_va_block->start;
+		prev_end = res_valid_start - 1;
+
+		new_va_block->start = res_valid_start;
+		new_va_block->size = res_valid_size;
+
+		add_prev = true;
+	}
+
+	if (new_va_block->size > size) {
+		new_va_block->start += size;
+		new_va_block->size = new_va_block->end - new_va_block->start;
+	} else {
+		list_del(&new_va_block->node);
+		kfree(new_va_block);
+	}
+
+	if (add_prev)
+		add_va_block_locked(hdev, &va_range->list, prev_start,
+				prev_end);
+
+	print_va_list_locked(hdev, &va_range->list);
+out:
+	mutex_unlock(&va_range->lock);
+
+	return res_valid_start;
+}
+
+/**
+ * init_phys_pg_list_from_userptr - initialize physical page list from host
+ *                                  memory
+ *
+ * @ctx                 : current context
+ * @userptr             : userptr to initialize from
+ * @pphys_pg_list       : res pointer
+ *
+ * This function does the following:
+ * - Pin the physical pages related to the given virtual block
+ * - Create a physical page list from the physical pages related to the given
+ *   virtual block
+ */
+static int init_phys_pg_list_from_userptr(struct hl_ctx *ctx,
+		struct hl_userptr *userptr,
+		struct hl_vm_phys_pg_list **pphys_pg_list)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm_phys_pg_list *phys_pg_list;
+	struct hl_vm_phys_pg *phys_pg;
+	struct scatterlist *sg;
+	dma_addr_t dma_addr;
+	u32 npages, len;
+	bool first = true;
+	int rc, i;
+
+	phys_pg_list = kzalloc(sizeof(*phys_pg_list), GFP_KERNEL);
+	if (!phys_pg_list) {
+		rc = -ENOMEM;
+		goto page_list_mem_err;
+	}
+
+	phys_pg_list->vm_type = userptr->vm_type;
+	phys_pg_list->created_from_userptr = true;
+	INIT_LIST_HEAD(&phys_pg_list->list);
+	phys_pg_list->asid = ctx->asid;
+	atomic_set(&phys_pg_list->mapping_cnt, 1);
+
+	for_each_sg(userptr->sgt->sgl, sg, userptr->sgt->nents, i) {
+		len = sg_dma_len(sg);
+		dma_addr = sg_dma_address(sg);
+
+		/*
+		 * Calculate the number of consecutive pages described by the
+		 * SG list. Take the offset of the address in the first page,
+		 * add to it the length and round it up to the number of needed
+		 * 4K pages.
+		 *
+		 */
+		npages =
+			(((dma_addr & (PAGE_SIZE - 1)) + len) + (PAGE_SIZE - 1))
+			>> PAGE_SHIFT;
+
+		/* align down to physical page size and save the offset */
+		if (first) {
+			first = false;
+			phys_pg_list->offset = dma_addr & (PAGE_SIZE - 1);
+			dma_addr &= PAGE_MASK;
+		}
+
+		while (npages) {
+			phys_pg = kzalloc(sizeof(*phys_pg), GFP_KERNEL);
+			if (!phys_pg) {
+				rc = -ENOMEM;
+				goto page_mem_err;
+			}
+
+			list_add_tail(&phys_pg->node, &phys_pg_list->list);
+			phys_pg_list->list_size++;
+
+			phys_pg->page_size = PAGE_SIZE;
+			npages--;
+
+			phys_pg->paddr = dma_addr;
+			phys_pg_list->total_size += phys_pg->page_size;
+
+			dma_addr += phys_pg->page_size;
+		}
+	}
+
+	*pphys_pg_list = phys_pg_list;
+
+	return 0;
+
+page_mem_err:
+	free_phys_pg_list(hdev, phys_pg_list);
+page_list_mem_err:
+
+	return rc;
+}
+
+/**
+ * map_phys_page_list - maps the page list
+ *
+ * @ctx                 : current context
+ * @vaddr               : start address of the virtual area to map from
+ * @phys_pg_list        : the  list of physical pages to map to
+ *
+ * This function does the following:
+ * - Maps each chunk of virtual memory to matching physical chunk
+ * - Returns the number of successful mappings
+ */
+static int map_phys_page_list(struct hl_ctx *ctx, u64 vaddr,
+		struct hl_vm_phys_pg_list *phys_pg_list)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm_phys_pg *phys_pg;
+	u64 next_vaddr = vaddr, paddr;
+	int rc, mapped_pg_cnt = 0;
+
+	list_for_each_entry(phys_pg, &phys_pg_list->list, node) {
+		paddr = phys_pg->paddr;
+
+		/* For accessing the host we need to turn on bit 39 */
+		if (phys_pg_list->created_from_userptr)
+			paddr += hdev->asic_prop.host_phys_base_address;
+
+		rc = hl_mmu_map(ctx, next_vaddr, paddr, phys_pg->page_size);
+		if (rc) {
+			dev_err(hdev->dev, "map failed for handle %u",
+					phys_pg_list->handle);
+			break;
+		}
+		mapped_pg_cnt++;
+		next_vaddr += phys_pg->page_size;
+	}
+
+	return mapped_pg_cnt;
+}
+
+static int get_paddr_from_handle(struct hl_ctx *ctx, struct hl_mem_in *args,
+				u64 *paddr)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_list *phys_pg_list;
+	struct hl_vm_phys_pg *phys_pg;
+	u32 handle;
+
+	handle = lower_32_bits(args->map_device.handle);
+	spin_lock(&vm->idr_lock);
+	phys_pg_list = idr_find(&vm->phys_pg_list_handles, handle);
+	if (!phys_pg_list) {
+		spin_unlock(&vm->idr_lock);
+		dev_err(hdev->dev, "no match for handle %u\n", handle);
+		return -EINVAL;
+	}
+
+	phys_pg = list_first_entry(&phys_pg_list->list, typeof(*phys_pg), node);
+
+	*paddr = phys_pg->paddr;
+
+	spin_unlock(&vm->idr_lock);
+
+	return 0;
+}
+
+/**
+ * map_device_va - map the given memory
+ *
+ * @ctx	         : current context
+ * @args         : host parameters with handle/host virtual address
+ * @device_addr	 : pointer to result device virtual address
+ *
+ * This function does the following:
+ * - If given a physical device memory handle, map to a device virtual block
+ *   and return the start address of this block
+ * - If given a host virtual address and size, find the related physical pages,
+ *   map a device virtual block to this pages and return the start address of
+ *   this block
+ */
+static int map_device_va(struct hl_ctx *ctx, struct hl_mem_in *args,
+		u64 *device_addr)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_list *phys_pg_list;
+	struct hl_userptr *userptr = NULL;
+	struct hl_vm_phys_pg *phys_pg;
+	struct hl_vm_hash_node *hnode;
+	enum vm_type_t *vm_type;
+	u64 ret_vaddr, next_vaddr, hint_addr;
+	u32 handle = 0;
+	int rc, mapped_pg_cnt = 0;
+	bool is_userptr = args->flags & HL_MEM_USERPTR;
+
+	/* Assume failure */
+	*device_addr = 0;
+
+	if (is_userptr) {
+		rc = get_userptr_from_host_va(hdev, args, &userptr);
+		if (rc) {
+			dev_err(hdev->dev, "failed to get userptr from va\n");
+			return rc;
+		}
+
+		rc = init_phys_pg_list_from_userptr(ctx, userptr,
+				&phys_pg_list);
+		if (rc) {
+			dev_err(hdev->dev,
+				"unable to init page list for vaddr 0x%llx\n",
+				args->map_host.host_virt_addr);
+			goto init_page_list_err;
+		}
+
+		vm_type = (enum vm_type_t *) userptr;
+		hint_addr = args->map_host.hint_addr;
+	} else {
+		handle = lower_32_bits(args->map_device.handle);
+
+		spin_lock(&vm->idr_lock);
+		phys_pg_list = idr_find(&vm->phys_pg_list_handles, handle);
+		if (!phys_pg_list) {
+			spin_unlock(&vm->idr_lock);
+			dev_err(hdev->dev,
+				"no match for handle %u\n", handle);
+			return -EINVAL;
+		}
+
+		/* increment now to avoid freeing device memory while mapping */
+		atomic_inc(&phys_pg_list->mapping_cnt);
+
+		spin_unlock(&vm->idr_lock);
+
+		vm_type = (enum vm_type_t *) phys_pg_list;
+
+		hint_addr = args->map_device.hint_addr;
+	}
+
+	/*
+	 * relevant for mapping device physical memory only, as host memory is
+	 * implicitly shared
+	 */
+	if (!is_userptr && !(phys_pg_list->flags & HL_MEM_SHARED) &&
+			phys_pg_list->asid != ctx->asid) {
+		dev_err(hdev->dev,
+			"Failed to map memory, handle %u is not shared\n",
+			handle);
+		rc = -EPERM;
+		goto shared_err;
+	}
+
+	hnode = kzalloc(sizeof(*hnode), GFP_KERNEL);
+	if (!hnode) {
+		rc = -ENOMEM;
+		goto hnode_err;
+	}
+
+	ret_vaddr = get_va_block(hdev,
+			is_userptr ? &ctx->host_va_range : &ctx->dram_va_range,
+			phys_pg_list->total_size, hint_addr, is_userptr);
+	if (!ret_vaddr) {
+		dev_err(hdev->dev, "no available va block for handle %u\n",
+				handle);
+		rc = -ENOMEM;
+		goto va_block_err;
+	}
+
+	mutex_lock(&ctx->mmu_lock);
+
+	mapped_pg_cnt = map_phys_page_list(ctx, ret_vaddr, phys_pg_list);
+	if (mapped_pg_cnt < phys_pg_list->list_size) {
+		mutex_unlock(&ctx->mmu_lock);
+		dev_err(hdev->dev, "mapping page list failed for handle %u\n",
+				handle);
+		goto map_err;
+	}
+
+	hdev->asic_funcs->mmu_invalidate_cache_range(hdev, false, ctx->asid,
+			ret_vaddr, phys_pg_list->total_size);
+
+	mutex_unlock(&ctx->mmu_lock);
+
+	hnode->ptr = vm_type;
+	hnode->vaddr = ret_vaddr;
+
+	mutex_lock(&ctx->mem_hash_lock);
+	hash_add(ctx->mem_hash, &hnode->node, ret_vaddr);
+	mutex_unlock(&ctx->mem_hash_lock);
+
+	*device_addr = ret_vaddr + phys_pg_list->offset;
+
+	if (is_userptr)
+		free_phys_pg_list(hdev, phys_pg_list);
+
+	return 0;
+
+map_err:
+	mutex_lock(&ctx->mmu_lock);
+	next_vaddr = ret_vaddr;
+	list_for_each_entry(phys_pg, &phys_pg_list->list, node) {
+		if (mapped_pg_cnt-- == 0)
+			break;
+
+		rc = hl_mmu_unmap(ctx, next_vaddr);
+		if (rc)
+			WARN(1,
+				"failed to unmap handle %u, vaddr 0x%llx, paddr 0x%llx, page size %u\n",
+				handle, next_vaddr, phys_pg->paddr,
+				phys_pg->page_size);
+
+		next_vaddr += phys_pg->page_size;
+	}
+
+	hdev->asic_funcs->mmu_invalidate_cache_range(hdev, true, ctx->asid,
+			ret_vaddr, phys_pg_list->total_size);
+
+	mutex_unlock(&ctx->mmu_lock);
+
+	rc = add_va_block(hdev,
+			is_userptr ? &ctx->host_va_range : &ctx->dram_va_range,
+			ret_vaddr,
+			ret_vaddr + phys_pg_list->total_size - 1);
+	if (rc)
+		WARN(1,
+		"release va block failed for handle 0x%x, vaddr: 0x%llx\n",
+				handle, *device_addr);
+
+va_block_err:
+	kfree(hnode);
+hnode_err:
+shared_err:
+	atomic_dec(&phys_pg_list->mapping_cnt);
+	if (is_userptr)
+		free_phys_pg_list(hdev, phys_pg_list);
+init_page_list_err:
+	if (is_userptr)
+		free_userptr(hdev, userptr);
+
+	return rc;
+}
+
+/**
+ * unmap_device_va      - unmap the given device virtual address
+ *
+ * @ctx                 : current context
+ * @vaddr               : device virtual address to unmap
+ *
+ * This function does the following:
+ * - Unmap the physical pages related to the given virtual address
+ * - return the device virtual block to the virtual block list
+ */
+static int unmap_device_va(struct hl_ctx *ctx, u64 vaddr)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm_phys_pg_list *phys_pg_list = NULL;
+	struct hl_vm_phys_pg *phys_pg;
+	struct hl_vm_hash_node *hnode = NULL;
+	struct hl_userptr *userptr = NULL;
+	enum vm_type_t *vm_type;
+	u64 next_vaddr;
+	bool is_userptr;
+	int rc;
+
+	vaddr &= PAGE_MASK;
+
+	/* protect from double entrance */
+	mutex_lock(&ctx->mem_hash_lock);
+	hash_for_each_possible(ctx->mem_hash, hnode, node, (unsigned long)vaddr)
+		if (vaddr == hnode->vaddr)
+			break;
+
+	if (!hnode) {
+		mutex_unlock(&ctx->mem_hash_lock);
+		dev_err(hdev->dev,
+			"unmap failed, no mem hnode for vaddr 0x%llx\n",
+			vaddr);
+		return -EINVAL;
+	}
+
+	hash_del(&hnode->node);
+	mutex_unlock(&ctx->mem_hash_lock);
+
+	vm_type = hnode->ptr;
+
+	if (*vm_type == VM_TYPE_USERPTR) {
+		is_userptr = true;
+		userptr = hnode->ptr;
+		rc = init_phys_pg_list_from_userptr(ctx, userptr,
+				&phys_pg_list);
+		if (rc) {
+			dev_err(hdev->dev,
+				"unable to init page list for vaddr 0x%llx\n",
+				vaddr);
+			goto vm_type_err;
+		}
+	} else if (*vm_type == VM_TYPE_PHYS_LIST) {
+		is_userptr = false;
+		phys_pg_list = hnode->ptr;
+	} else {
+		WARN(1, "unmap failed, unknown vm desc for vaddr 0x%llx\n",
+				vaddr);
+		rc = -EFAULT;
+		goto vm_type_err;
+	}
+
+	if (atomic_read(&phys_pg_list->mapping_cnt) == 0) {
+		dev_err(hdev->dev, "vaddr 0x%llx is not mapped\n", vaddr);
+		rc = -EINVAL;
+		goto mapping_cnt_err;
+	}
+
+	next_vaddr = vaddr;
+
+	mutex_lock(&ctx->mmu_lock);
+
+	list_for_each_entry(phys_pg, &phys_pg_list->list, node) {
+		WARN(hl_mmu_unmap(ctx, next_vaddr),
+				"unmap failed for vaddr: 0x%llx\n", next_vaddr);
+
+		next_vaddr += phys_pg->page_size;
+	}
+
+	hdev->asic_funcs->mmu_invalidate_cache_range(hdev, true, ctx->asid,
+			vaddr, phys_pg_list->total_size);
+
+	mutex_unlock(&ctx->mmu_lock);
+
+	WARN(add_va_block(hdev,
+			is_userptr ? &ctx->host_va_range : &ctx->dram_va_range,
+			vaddr,
+			vaddr + phys_pg_list->total_size - 1),
+			"add va block failed for vaddr: 0x%llx\n", vaddr);
+
+	atomic_dec(&phys_pg_list->mapping_cnt);
+
+	kfree(hnode);
+
+	if (userptr) {
+		free_phys_pg_list(hdev, phys_pg_list);
+		free_userptr(hdev, userptr);
+	}
+
+	return 0;
+
+mapping_cnt_err:
+	if (userptr)
+		free_phys_pg_list(hdev, phys_pg_list);
+vm_type_err:
+	mutex_lock(&ctx->mem_hash_lock);
+	hash_add(ctx->mem_hash, &hnode->node, vaddr);
+	mutex_unlock(&ctx->mem_hash_lock);
+
+	return rc;
+}
+
+int hl_mem_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	union hl_mem_args *args = data;
+	struct hl_device *hdev = hpriv->hdev;
+	struct hl_ctx *ctx = hpriv->ctx;
+	u64 device_addr = 0;
+	u32 handle = 0;
+	int rc;
+
+	if (hdev->hard_reset_pending) {
+		dev_crit_ratelimited(hdev->dev,
+			"Device HARD reset pending !!! Please close FD\n");
+		return -ENODEV;
+	}
+
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
+		dev_warn_ratelimited(hdev->dev,
+			"Device is disabled or in reset !!! Can't execute memory IOCTL\n");
+		return -EBUSY;
+	}
+
+	if (hdev->mmu_enable) {
+		switch (args->in.op) {
+		case HL_MEM_OP_ALLOC:
+			if (!hdev->dram_supports_virtual_memory) {
+				dev_err(hdev->dev,
+					"DRAM alloc is not supported\n");
+				rc = -EINVAL;
+				goto out;
+			}
+			if (args->in.alloc.mem_size == 0) {
+				dev_err(hdev->dev,
+					"alloc size must be larger than 0\n");
+				rc = -EINVAL;
+				goto out;
+			}
+			rc = alloc_device_memory(ctx, &args->in, &handle);
+
+			memset(args, 0, sizeof(*args));
+			args->out.handle = (__u64) handle;
+			break;
+
+		case HL_MEM_OP_FREE:
+			if (!hdev->dram_supports_virtual_memory) {
+				dev_err(hdev->dev,
+					"DRAM free is not supported\n");
+				rc = -EINVAL;
+				goto out;
+			}
+			rc = free_device_memory(ctx, args->in.free.handle);
+			break;
+
+		case HL_MEM_OP_MAP:
+			rc = map_device_va(ctx, &args->in, &device_addr);
+
+			memset(args, 0, sizeof(*args));
+			args->out.device_virt_addr = device_addr;
+			break;
+
+		case HL_MEM_OP_UNMAP:
+			rc = unmap_device_va(ctx,
+					args->in.unmap.device_virt_addr);
+			break;
+
+		default:
+			dev_err(hdev->dev, "Unknown opcode for memory IOCTL\n");
+			rc = -EINVAL;
+			break;
+		}
+	} else {
+		switch (args->in.op) {
+		case HL_MEM_OP_ALLOC:
+			if (args->in.alloc.mem_size == 0) {
+				dev_err(hdev->dev,
+					"alloc size must be larger than 0\n");
+				rc = -EINVAL;
+				goto out;
+			}
+
+			/* Force contiguous as there are no real MMU
+			 * translations to overcome physical memory gaps
+			 */
+			args->in.flags |= HL_MEM_CONTIGUOUS;
+			rc = alloc_device_memory(ctx, &args->in, &handle);
+
+			memset(args, 0, sizeof(*args));
+			args->out.handle = (__u64) handle;
+			break;
+
+		case HL_MEM_OP_FREE:
+			rc = free_device_memory(ctx, args->in.free.handle);
+			break;
+
+		case HL_MEM_OP_MAP:
+			if (args->in.flags & HL_MEM_USERPTR) {
+				device_addr = args->in.map_host.host_virt_addr;
+				rc = 0;
+			} else {
+				rc = get_paddr_from_handle(ctx, &args->in,
+						&device_addr);
+			}
+
+			memset(args, 0, sizeof(*args));
+			args->out.device_virt_addr = device_addr;
+			break;
+
+		case HL_MEM_OP_UNMAP:
+			rc = 0;
+			break;
+
+		default:
+			dev_err(hdev->dev, "Unknown opcode for memory IOCTL\n");
+			rc = -EINVAL;
+			break;
+		}
+	}
+
+out:
+	return rc;
+}
+
 /**
  * hl_pin_host_memory - pins a chunk of host memory
  *
@@ -198,3 +1379,328 @@ bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr,
 	return false;
 }
 
+/**
+ * hl_va_range_init - initialize virtual addresses range
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @va_range            : pointer to the range to initialize
+ * @start               : range start address
+ * @end                 : range end address
+ *
+ * This function does the following:
+ * - Initializes the virtual addresses list of the given range with the given
+ *   addresses.
+ */
+static int hl_va_range_init(struct hl_device *hdev,
+		struct hl_va_range *va_range, u64 start, u64 end)
+{
+	int rc;
+
+	INIT_LIST_HEAD(&va_range->list);
+
+	/* PAGE_SIZE alignment */
+
+	if (start & (PAGE_SIZE - 1)) {
+		start &= PAGE_MASK;
+		start += PAGE_SIZE;
+	}
+
+	if (end & (PAGE_SIZE - 1))
+		end &= PAGE_MASK;
+
+	if (start >= end) {
+		dev_err(hdev->dev, "too small vm range for va list\n");
+		return -EFAULT;
+	}
+
+	rc = add_va_block(hdev, va_range, start, end);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to init host va list\n");
+		return rc;
+	}
+
+	va_range->start_addr = start;
+	va_range->end_addr = end;
+
+	return 0;
+}
+
+/**
+ * hl_vm_ctx_init_with_ranges - initialize virtual memory for context
+ *
+ * @ctx                 : pointer to the habanalabs context structure
+ * @host_range_start    : host virtual addresses range start
+ * @host_range_end      : host virtual addresses range end
+ * @dram_range_start    : dram virtual addresses range start
+ * @dram_range_end      : dram virtual addresses range end
+ *
+ * This function initializes the following:
+ * - MMU for context
+ * - Virtual address to area descriptor hashtable
+ * - Virtual block list of available virtual memory
+ */
+int hl_vm_ctx_init_with_ranges(struct hl_ctx *ctx, u64 host_range_start,
+				u64 host_range_end, u64 dram_range_start,
+				u64 dram_range_end)
+{
+	struct hl_device *hdev = ctx->hdev;
+	int rc;
+
+	hl_mmu_ctx_init(ctx);
+
+	mutex_init(&ctx->mem_hash_lock);
+	hash_init(ctx->mem_hash);
+
+	mutex_init(&ctx->host_va_range.lock);
+
+	rc = hl_va_range_init(hdev, &ctx->host_va_range, host_range_start,
+			host_range_end);
+	if (rc) {
+		dev_err(hdev->dev, "failed to init host vm range\n");
+		goto host_vm_err;
+	}
+
+	mutex_init(&ctx->dram_va_range.lock);
+
+	rc = hl_va_range_init(hdev, &ctx->dram_va_range, dram_range_start,
+			dram_range_end);
+	if (rc) {
+		dev_err(hdev->dev, "failed to init dram vm range\n");
+		goto dram_vm_err;
+	}
+
+	return 0;
+
+dram_vm_err:
+	mutex_destroy(&ctx->dram_va_range.lock);
+
+	mutex_lock(&ctx->host_va_range.lock);
+	clear_va_list_locked(hdev, &ctx->host_va_range.list);
+	mutex_unlock(&ctx->host_va_range.lock);
+host_vm_err:
+	mutex_destroy(&ctx->host_va_range.lock);
+	mutex_destroy(&ctx->mem_hash_lock);
+	hl_mmu_ctx_fini(ctx);
+
+	return rc;
+}
+
+int hl_vm_ctx_init(struct hl_ctx *ctx)
+{
+	struct asic_fixed_properties *prop = &ctx->hdev->asic_prop;
+	u64 host_range_start, host_range_end, dram_range_start,
+		dram_range_end;
+
+	atomic64_set(&ctx->dram_phys_mem, 0);
+
+	/*
+	 * - If MMU is enabled, init the ranges as usual.
+	 * - If MMU is disabled, in case of host mapping, the returned address
+	 *   is the given one.
+	 *   In case of DRAM mapping, the returned address is the physical
+	 *   address of the memory related to the given handle.
+	 */
+	if (ctx->hdev->mmu_enable) {
+		dram_range_start = prop->va_space_dram_start_address;
+		dram_range_end = prop->va_space_dram_end_address;
+		host_range_start = prop->va_space_host_start_address;
+		host_range_end = prop->va_space_host_end_address;
+	} else {
+		dram_range_start = prop->dram_user_base_address;
+		dram_range_end = prop->dram_end_address;
+		host_range_start = prop->dram_user_base_address;
+		host_range_end = prop->dram_end_address;
+	}
+
+	return hl_vm_ctx_init_with_ranges(ctx, host_range_start, host_range_end,
+			dram_range_start, dram_range_end);
+}
+
+/**
+ * hl_va_range_fini     - clear a virtual addresses range
+ *
+ * @hdev                : pointer to the habanalabs structure
+ * va_range             : pointer to virtual addresses range
+ *
+ * This function initializes the following:
+ * - Checks that the given range contains the whole initial range
+ * - Frees the virtual addresses block list and its lock
+ */
+static void hl_va_range_fini(struct hl_device *hdev,
+		struct hl_va_range *va_range)
+{
+	struct hl_vm_va_block *va_block;
+
+	if (list_empty(&va_range->list)) {
+		WARN(1, "va list should not be empty on cleanup!\n");
+		goto out;
+	}
+
+	if (!list_is_singular(&va_range->list)) {
+		WARN(1,
+		"va list should not contain multiple blocks on cleanup!\n");
+		goto free_va_list;
+	}
+
+	va_block = list_first_entry(&va_range->list, typeof(*va_block), node);
+
+	if (va_block->start != va_range->start_addr ||
+		va_block->end != va_range->end_addr) {
+		WARN(1, "wrong va block on cleanup, from 0x%llx to 0x%llx\n",
+			va_block->start, va_block->end);
+		goto free_va_list;
+	}
+
+free_va_list:
+	mutex_lock(&va_range->lock);
+	clear_va_list_locked(hdev, &va_range->list);
+	mutex_unlock(&va_range->lock);
+
+out:
+	mutex_destroy(&va_range->lock);
+}
+
+/**
+ * hl_vm_ctx_fini       - virtual memory teardown of context
+ *
+ * @ctx                 : pointer to the habanalabs context structure
+ *
+ * This function perform teardown the following:
+ * - Virtual block list of available virtual memory
+ * - Virtual address to area descriptor hashtable
+ * - MMU for context
+ *
+ * In addition this function does the following:
+ * - Unmaps the existing hashtable nodes if the hashtable is not empty. The
+ *   hashtable should be empty as no valid mappings should exist at this
+ *   point.
+ * - Frees any existing physical page list from the idr which relates to the
+ *   current context asid.
+ * - This function checks the virtual block list for correctness. At this point
+ *   the list should contain one element which describes the whole virtual
+ *   memory range of the context. Otherwise, a warning is printed.
+ */
+void hl_vm_ctx_fini(struct hl_ctx *ctx)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_list *phys_pg_list;
+	struct hl_vm_hash_node *hnode;
+	struct hlist_node *tmp_node;
+	int i;
+
+	if (!hash_empty(ctx->mem_hash))
+		dev_notice(hdev->dev, "ctx is freed while it has va in use\n");
+
+	hash_for_each_safe(ctx->mem_hash, i, tmp_node, hnode, node) {
+		dev_dbg(hdev->dev,
+			"hl_mem_hash_node of vaddr 0x%llx of asid %d is still alive\n",
+			hnode->vaddr, ctx->asid);
+		unmap_device_va(ctx, hnode->vaddr);
+	}
+
+	spin_lock(&vm->idr_lock);
+	idr_for_each_entry(&vm->phys_pg_list_handles, phys_pg_list, i)
+		if (phys_pg_list->asid == ctx->asid) {
+			dev_dbg(hdev->dev,
+				"page list 0x%p of asid %d is still alive\n",
+				phys_pg_list, ctx->asid);
+			free_phys_pg_list(hdev, phys_pg_list);
+			idr_remove(&vm->phys_pg_list_handles, i);
+		}
+	spin_unlock(&vm->idr_lock);
+
+	hl_va_range_fini(hdev, &ctx->dram_va_range);
+	hl_va_range_fini(hdev, &ctx->host_va_range);
+
+	mutex_destroy(&ctx->mem_hash_lock);
+	hl_mmu_ctx_fini(ctx);
+}
+
+/**
+ * hl_vm_init           - initialize virtual memory module
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ *
+ * This function initializes the following:
+ * - MMU module
+ * - DRAM physical pages pool of 2MB
+ * - Idr for device memory allocation handles
+ */
+int hl_vm_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct hl_vm *vm = &hdev->vm;
+	int rc;
+
+	rc = hl_mmu_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to init MMU\n");
+		return rc;
+	}
+
+	vm->dram_pg_pool = gen_pool_create(__ffs(prop->dram_page_size), -1);
+	if (!vm->dram_pg_pool) {
+		dev_err(hdev->dev, "Failed to create dram page pool\n");
+		rc = -ENOMEM;
+		goto pool_create_err;
+	}
+
+	kref_init(&vm->dram_pg_pool_refcount);
+
+	rc = gen_pool_add(vm->dram_pg_pool, prop->dram_user_base_address,
+			prop->dram_end_address - prop->dram_user_base_address,
+			-1);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to add memory to dram page pool %d\n", rc);
+		goto pool_add_err;
+	}
+
+	spin_lock_init(&vm->idr_lock);
+	idr_init(&vm->phys_pg_list_handles);
+
+	atomic64_set(&hdev->dram_used_mem, 0);
+
+	vm->init_done = true;
+
+	return 0;
+
+pool_add_err:
+	gen_pool_destroy(vm->dram_pg_pool);
+pool_create_err:
+	hl_mmu_fini(hdev);
+
+	return rc;
+}
+
+/**
+ * hl_vm_fini           - virtual memory module teardown
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ *
+ * This function perform teardown to the following:
+ * - Idr for device memory allocation handles
+ * - DRAM physical pages pool of 2MB
+ * - MMU module
+ */
+void hl_vm_fini(struct hl_device *hdev)
+{
+	struct hl_vm *vm = &hdev->vm;
+
+	if (!vm->init_done)
+		return;
+
+	/*
+	 * At this point all the contexts should be freed and hence no DRAM
+	 * memory should be in use. Hence the DRAM pool should be freed here.
+	 */
+	WARN(kref_put(&vm->dram_pg_pool_refcount, dram_pg_pool_do_release) != 1,
+			"dram_pg_pool was not destroyed on %s\n", __func__);
+
+	hl_mmu_fini(hdev);
+
+	vm->init_done = false;
+}
diff --git a/drivers/misc/habanalabs/mmu.c b/drivers/misc/habanalabs/mmu.c
new file mode 100644
index 000000000000..083842d3c1a4
--- /dev/null
+++ b/drivers/misc/habanalabs/mmu.c
@@ -0,0 +1,604 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+#include "include/hw_ip/mmu/mmu_general.h"
+
+#include <linux/genalloc.h>
+
+#define HOP_NOT_FREED	0
+#define HOP_FREED	1
+
+static struct pgt_info *get_pgt_info(struct hl_ctx *ctx, u64 addr)
+{
+	struct pgt_info *pgt_info = NULL;
+
+	hash_for_each_possible(ctx->mmu_hash, pgt_info, node,
+				(unsigned long) addr)
+		if (addr == pgt_info->addr)
+			break;
+
+	return pgt_info;
+}
+
+static void free_hop(struct hl_ctx *ctx, u64 hop_addr)
+{
+	struct pgt_info *pgt_info = get_pgt_info(ctx, hop_addr);
+
+	gen_pool_free(pgt_info->ctx->hdev->mmu_pgt_pool, pgt_info->addr,
+			ctx->hdev->asic_prop.mmu_hop_table_size);
+	hash_del(&pgt_info->node);
+
+	kfree(pgt_info);
+}
+
+static u64 alloc_hop(struct hl_ctx *ctx)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct pgt_info *pgt_info;
+	u64 addr;
+
+	pgt_info = kmalloc(sizeof(*pgt_info), GFP_KERNEL);
+	if (!pgt_info)
+		return ULLONG_MAX;
+
+	addr = (u64) gen_pool_alloc(hdev->mmu_pgt_pool,
+			hdev->asic_prop.mmu_hop_table_size);
+	if (!addr) {
+		dev_err(hdev->dev, "failed to allocate page\n");
+		kfree(pgt_info);
+		return ULLONG_MAX;
+	}
+
+	pgt_info->addr = addr;
+	pgt_info->ctx = ctx;
+	pgt_info->num_of_ptes = 0;
+	hash_add(ctx->mmu_hash, &pgt_info->node, addr);
+
+	return addr;
+}
+
+static inline void clear_pte(struct hl_device *hdev, u64 pte_addr)
+{
+	/* clear the last and present bits */
+	hdev->asic_funcs->write_pte(hdev, pte_addr, 0);
+}
+
+static inline void inc_num_of_ptes(struct hl_ctx *ctx, u64 hop_addr)
+{
+	get_pgt_info(ctx, hop_addr)->num_of_ptes++;
+}
+
+/**
+ * dec_num_of_ptes - decrement the num of ptes and free the hop if possible
+ *
+ * @ctx: pointer to the context structure
+ * @hop_addr: addr of the hop
+ *
+ * This function returns HOP_FREED if the hop was freed or HOP_NOT_FREED if the
+ * num of ptes was decremented without freeing the hop.
+ */
+static inline int dec_num_of_ptes(struct hl_ctx *ctx, u64 hop_addr)
+{
+	struct pgt_info *pgt_info = get_pgt_info(ctx, hop_addr);
+
+	pgt_info->num_of_ptes--;
+
+	if (pgt_info->num_of_ptes == 0) {
+		free_hop(ctx, hop_addr);
+		return HOP_FREED;
+	}
+
+	return HOP_NOT_FREED;
+}
+
+static inline u64 get_hop0_addr(struct hl_ctx *ctx)
+{
+	return ctx->hdev->asic_prop.mmu_pgt_addr +
+			(ctx->asid * ctx->hdev->asic_prop.mmu_hop_table_size);
+}
+
+static inline u64 get_hop0_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+					u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP0_MASK) >> HOP0_SHIFT);
+}
+
+static inline u64 get_hop1_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+					u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP1_MASK) >> HOP1_SHIFT);
+}
+
+static inline u64 get_hop2_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+					u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP2_MASK) >> HOP2_SHIFT);
+}
+
+static inline u64 get_hop3_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+					u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP3_MASK) >> HOP3_SHIFT);
+}
+
+static inline u64 get_hop4_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+					u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP4_MASK) >> HOP4_SHIFT);
+}
+
+static inline u64 get_next_hop_addr(u64 curr_pte)
+{
+	if (curr_pte & PAGE_PRESENT_MASK)
+		return curr_pte & PHYS_ADDR_MASK;
+	else
+		return ULLONG_MAX;
+}
+
+/**
+ * hl_mmu_init - init the mmu module
+ *
+ * @hdev: pointer to the habanalabs device structure
+ *
+ * This function does the following:
+ * - Allocate max_asid zeroed hop0 pgts so no mapping is available
+ * - Enable mmu in hw
+ * - Invalidate the mmu cache
+ * - Create a pool of pages for pgts
+ * - Returns 0 on success
+ *
+ * This function depends on DMA QMAN to be working!
+ */
+int hl_mmu_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	int rc;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	/* MMU HW init was already done in device hw_init() */
+
+	mutex_init(&hdev->mmu_cache_lock);
+
+	hdev->mmu_pgt_pool =
+			gen_pool_create(__ffs(prop->mmu_hop_table_size), -1);
+
+	if (!hdev->mmu_pgt_pool) {
+		dev_err(hdev->dev, "Failed to create page gen pool\n");
+		rc = -ENOMEM;
+		goto err_pool_create;
+	}
+
+	rc = gen_pool_add(hdev->mmu_pgt_pool, prop->mmu_pgt_addr +
+			prop->mmu_hop0_tables_total_size,
+			prop->mmu_pgt_size - prop->mmu_hop0_tables_total_size,
+			-1);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to add memory to page gen pool\n");
+		goto err_pool_add;
+	}
+
+	return 0;
+
+err_pool_add:
+	gen_pool_destroy(hdev->mmu_pgt_pool);
+err_pool_create:
+	mutex_destroy(&hdev->mmu_cache_lock);
+
+	return rc;
+}
+
+/**
+ * hl_mmu_fini - release the mmu module.
+ *
+ * @hdev: pointer to the habanalabs device structure
+ *
+ * This function does the following:
+ * - Disable mmu in hw
+ * - free the pgts pool
+ *
+ * All ctxs should be freed before calling this func
+ */
+void hl_mmu_fini(struct hl_device *hdev)
+{
+	if (!hdev->mmu_enable)
+		return;
+
+	gen_pool_destroy(hdev->mmu_pgt_pool);
+
+	mutex_destroy(&hdev->mmu_cache_lock);
+
+	/* MMU HW fini will be done in device hw_fini() */
+}
+
+/**
+ * hl_mmu_ctx_init - init a ctx for using the mmu module
+ *
+ * @ctx: pointer to the context structure
+ *
+ * This function does the following:
+ * - Init a mutex to protect the concurrent mapping flow
+ * - Init a hash to hold all pgts related to this ctx
+ */
+void hl_mmu_ctx_init(struct hl_ctx *ctx)
+{
+	if (!ctx->hdev->mmu_enable)
+		return;
+
+	mutex_init(&ctx->mmu_lock);
+	hash_init(ctx->mmu_hash);
+}
+
+/**
+ * hl_mmu_ctx_fini - disable a ctx from using the mmu module
+ *
+ * @ctx: pointer to the context structure
+ *
+ * This function does the following:
+ * - Free any pgts which were not freed yet
+ * - Free the mutex
+ */
+void hl_mmu_ctx_fini(struct hl_ctx *ctx)
+{
+	struct pgt_info *pgt_info;
+	struct hlist_node *tmp;
+	int i;
+
+	if (!ctx->hdev->mmu_enable)
+		return;
+
+	if (!hash_empty(ctx->mmu_hash))
+		dev_err(ctx->hdev->dev,
+				"ctx is freed while it has pgts in use\n");
+
+	hash_for_each_safe(ctx->mmu_hash, i, tmp, pgt_info, node) {
+		dev_err(ctx->hdev->dev,
+			"pgt_info of addr 0x%llx of asid %d was not destroyed, num_ptes: %d\n",
+			pgt_info->addr, ctx->asid, pgt_info->num_of_ptes);
+		free_hop(ctx, pgt_info->addr);
+	}
+
+	mutex_destroy(&ctx->mmu_lock);
+}
+
+/**
+ * hl_mmu_map - maps a virtual addr to physical addr
+ *
+ * @ctx: pointer to the context structure
+ * @virt_addr: virt addr to map from
+ * @phys_addr: phys addr to map to
+ * @page_size: physical page size
+ *
+ * This function does the following:
+ * - Check that the virt addr is not mapped
+ * - Allocate pgts as necessary in order to map the virt addr to the phys
+ * - Returns 0 on success, -EINVAL if addr is already mapped, or -ENOMEM.
+ *
+ * Because this function changes the page tables in the device and because it
+ * changes the MMU hash, it must be protected by a lock.
+ * However, because it maps only a single page, the lock should be implemented
+ * in a higher level in order to protect the entire mapping of the memory area
+ */
+int hl_mmu_map(struct hl_ctx *ctx, u64 virt_addr, u64 phys_addr, u32 page_size)
+{
+	struct hl_device *hdev = ctx->hdev;
+	u64 hop0_addr = 0, hop0_pte_addr = 0,
+		hop1_addr = 0, hop1_pte_addr = 0,
+		hop2_addr = 0, hop2_pte_addr = 0,
+		hop3_addr = 0, hop3_pte_addr = 0,
+		hop4_addr = 0, hop4_pte_addr = 0,
+		curr_pte = 0;
+	int hop1_new = 0, hop2_new = 0, hop3_new = 0, hop4_new = 0,
+			rc = -ENOMEM;
+	bool is_huge;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	/*
+	 * This mapping function can map a 4KB/2MB page. For 2MB page there are
+	 * only 3 hops rather than 4. Currently the DRAM allocation uses 2MB
+	 * pages only but user memory could have been allocated with one of the
+	 * two page sizes. Since this is a common code for all the three cases,
+	 * we need this hugs page check.
+	 */
+	is_huge = page_size == PAGE_SIZE_2MB;
+
+	hop0_addr = get_hop0_addr(ctx);
+
+	hop0_pte_addr = get_hop0_pte_addr(ctx, hop0_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop0_pte_addr);
+
+	hop1_addr = get_next_hop_addr(curr_pte);
+
+	if (hop1_addr == ULLONG_MAX) {
+		hop1_addr = alloc_hop(ctx);
+		if (hop1_addr == ULLONG_MAX)
+			goto err;
+		else
+			hop1_new = 1;
+	}
+
+	hop1_pte_addr = get_hop1_pte_addr(ctx, hop1_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop1_pte_addr);
+
+	hop2_addr = get_next_hop_addr(curr_pte);
+
+	if (hop2_addr == ULLONG_MAX) {
+		hop2_addr = alloc_hop(ctx);
+		if (hop2_addr == ULLONG_MAX)
+			goto err;
+		else
+			hop2_new = 1;
+	}
+
+	hop2_pte_addr = get_hop2_pte_addr(ctx, hop2_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop2_pte_addr);
+
+	hop3_addr = get_next_hop_addr(curr_pte);
+
+	if (hop3_addr == ULLONG_MAX) {
+		hop3_addr = alloc_hop(ctx);
+		if (hop3_addr == ULLONG_MAX)
+			goto err;
+		else
+			hop3_new = 1;
+	}
+
+	hop3_pte_addr = get_hop3_pte_addr(ctx, hop3_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop3_pte_addr);
+
+	if (!is_huge) {
+		hop4_addr = get_next_hop_addr(curr_pte);
+
+		if (hop4_addr == ULLONG_MAX) {
+			hop4_addr = alloc_hop(ctx);
+			if (hop4_addr == ULLONG_MAX)
+				goto err;
+			else
+				hop4_new = 1;
+		}
+
+		hop4_pte_addr = get_hop4_pte_addr(ctx, hop4_addr, virt_addr);
+
+		curr_pte = hdev->asic_funcs->read_pte(hdev, hop4_pte_addr);
+	}
+
+	if (curr_pte & PAGE_PRESENT_MASK) {
+		dev_err(hdev->dev,
+				"mapping already exists for virt_addr 0x%llx\n",
+					virt_addr);
+
+		dev_dbg(hdev->dev, "hop0 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev, hop0_pte_addr),
+				hop0_pte_addr);
+		dev_dbg(hdev->dev, "hop1 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev, hop1_pte_addr),
+				hop1_pte_addr);
+		dev_dbg(hdev->dev, "hop2 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev, hop2_pte_addr),
+				hop2_pte_addr);
+		dev_dbg(hdev->dev, "hop3 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev, hop3_pte_addr),
+				hop3_pte_addr);
+
+		if (!is_huge)
+			dev_dbg(hdev->dev, "hop4 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev,
+							hop4_pte_addr),
+							hop4_pte_addr);
+
+		rc = EINVAL;
+		goto err;
+	}
+
+	curr_pte = (phys_addr & PTE_PHYS_ADDR_MASK) | LAST_MASK
+			| PAGE_PRESENT_MASK;
+
+	hdev->asic_funcs->write_pte(hdev,
+				is_huge ? hop3_pte_addr : hop4_pte_addr,
+				curr_pte);
+
+	if (hop1_new) {
+		curr_pte = (hop1_addr & PTE_PHYS_ADDR_MASK) |
+				PAGE_PRESENT_MASK;
+		ctx->hdev->asic_funcs->write_pte(ctx->hdev, hop0_pte_addr,
+				curr_pte);
+	}
+	if (hop2_new) {
+		curr_pte = (hop2_addr & PTE_PHYS_ADDR_MASK) |
+				PAGE_PRESENT_MASK;
+		ctx->hdev->asic_funcs->write_pte(ctx->hdev, hop1_pte_addr,
+				curr_pte);
+		inc_num_of_ptes(ctx, hop1_addr);
+	}
+	if (hop3_new) {
+		curr_pte = (hop3_addr & PTE_PHYS_ADDR_MASK) |
+				PAGE_PRESENT_MASK;
+		ctx->hdev->asic_funcs->write_pte(ctx->hdev, hop2_pte_addr,
+				curr_pte);
+		inc_num_of_ptes(ctx, hop2_addr);
+	}
+
+	if (!is_huge) {
+		if (hop4_new) {
+			curr_pte = (hop4_addr & PTE_PHYS_ADDR_MASK) |
+					PAGE_PRESENT_MASK;
+			ctx->hdev->asic_funcs->write_pte(ctx->hdev,
+					hop3_pte_addr, curr_pte);
+			inc_num_of_ptes(ctx, hop3_addr);
+		}
+
+		inc_num_of_ptes(ctx, hop4_addr);
+	} else
+		inc_num_of_ptes(ctx, hop3_addr);
+
+	/* flush all writes from all cores to reach PCI */
+	mb();
+
+	hdev->asic_funcs->read_pte(hdev,
+				is_huge ? hop3_pte_addr : hop4_pte_addr);
+
+	return 0;
+
+err:
+	if (hop4_new)
+		free_hop(ctx, hop4_addr);
+	if (hop3_new)
+		free_hop(ctx, hop3_addr);
+	if (hop2_new)
+		free_hop(ctx, hop2_addr);
+	if (hop1_new)
+		free_hop(ctx, hop1_addr);
+
+	return rc;
+}
+
+/**
+ * hl_mmu_unmap - unmaps a virtual addr
+ *
+ * @ctx: pointer to the context structure
+ * @virt_addr: virt addr to map from
+ *
+ * This function does the following:
+ * - Check that the virt addr is mapped
+ * - Unmap the vurt addr and frees pgts if possible
+ * - Returns 0 on success, -EINVAL if the given addr is not mapped
+ *
+ * Because this function changes the page tables in the device and because it
+ * changes the MMU hash, it must be protected by a lock.
+ * However, because it maps only a single page, the lock should be implemented
+ * in a higher level in order to protect the entire mapping of the memory area
+ */
+int hl_mmu_unmap(struct hl_ctx *ctx, u64 virt_addr)
+{
+	struct hl_device *hdev = ctx->hdev;
+	u64 hop0_addr = 0, hop0_pte_addr = 0,
+		hop1_addr = 0, hop1_pte_addr = 0,
+		hop2_addr = 0, hop2_pte_addr = 0,
+		hop3_addr = 0, hop3_pte_addr = 0,
+		hop4_addr = 0, hop4_pte_addr = 0,
+		curr_pte;
+	int clear_hop3 = 1;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	hop0_addr = get_hop0_addr(ctx);
+
+	hop0_pte_addr = get_hop0_pte_addr(ctx, hop0_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop0_pte_addr);
+
+	hop1_addr = get_next_hop_addr(curr_pte);
+
+	if (hop1_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop1_pte_addr = get_hop1_pte_addr(ctx, hop1_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop1_pte_addr);
+
+	hop2_addr = get_next_hop_addr(curr_pte);
+
+	if (hop2_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop2_pte_addr = get_hop2_pte_addr(ctx, hop2_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop2_pte_addr);
+
+	hop3_addr = get_next_hop_addr(curr_pte);
+
+	if (hop3_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop3_pte_addr = get_hop3_pte_addr(ctx, hop3_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop3_pte_addr);
+
+	if (!(curr_pte & LAST_MASK)) {
+		hop4_addr = get_next_hop_addr(curr_pte);
+
+		if (hop4_addr == ULLONG_MAX)
+			goto not_mapped;
+
+		hop4_pte_addr = get_hop4_pte_addr(ctx, hop4_addr, virt_addr);
+
+		curr_pte = hdev->asic_funcs->read_pte(hdev, hop4_pte_addr);
+
+		clear_hop3 = 0;
+	}
+
+	if (!(curr_pte & PAGE_PRESENT_MASK))
+		goto not_mapped;
+
+	clear_pte(hdev, hop4_addr ? hop4_pte_addr : hop3_pte_addr);
+
+	if (hop4_addr && dec_num_of_ptes(ctx, hop4_addr) == HOP_FREED)
+		clear_hop3 = 1;
+
+	if (clear_hop3) {
+		clear_pte(hdev, hop3_pte_addr);
+		if (dec_num_of_ptes(ctx, hop3_addr) == HOP_FREED) {
+			clear_pte(hdev, hop2_pte_addr);
+			if (dec_num_of_ptes(ctx, hop2_addr) == HOP_FREED) {
+				clear_pte(hdev, hop1_pte_addr);
+				if (dec_num_of_ptes(ctx, hop1_addr) ==
+						HOP_FREED)
+					clear_pte(hdev, hop0_pte_addr);
+			}
+		}
+	}
+
+	/* flush all writes from all cores to reach PCI */
+	mb();
+
+	hdev->asic_funcs->read_pte(hdev,
+				hop4_addr ? hop4_pte_addr : hop3_pte_addr);
+
+	return 0;
+
+not_mapped:
+	dev_err(hdev->dev, "virt addr 0x%llx is not mapped to phys addr\n",
+		virt_addr);
+
+	return -EINVAL;
+}
+
+/**
+ * hl_mmu_swap_out - marks all mapping of the given ctx as swapped out
+ *
+ * @ctx: pointer to the context structure
+ *
+ */
+void hl_mmu_swap_out(struct hl_ctx *ctx)
+{
+
+}
+
+/**
+ * hl_mmu_swap_in - marks all mapping of the given ctx as swapped in
+ *
+ * @ctx: pointer to the context structure
+ *
+ */
+void hl_mmu_swap_in(struct hl_ctx *ctx)
+{
+
+}
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index 369438dbc9c3..e5bfd7586b79 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -129,6 +129,108 @@ union hl_wait_cs_args {
 	struct hl_wait_cs_out out;
 };
 
+/* Opcode to alloc device memory */
+#define HL_MEM_OP_ALLOC			0
+/* Opcode to free previously allocated device memory */
+#define HL_MEM_OP_FREE			1
+/* Opcode to map host memory */
+#define HL_MEM_OP_MAP			2
+/* Opcode to unmap previously mapped host memory */
+#define HL_MEM_OP_UNMAP			3
+
+/* Memory flags */
+#define HL_MEM_CONTIGUOUS	0x1
+#define HL_MEM_SHARED		0x2
+#define HL_MEM_USERPTR		0x4
+
+struct hl_mem_in {
+	union {
+		/* HL_MEM_OP_ALLOC- allocate device memory */
+		struct {
+			/* Size to alloc */
+			__u32 mem_size;
+			__u32 pad;
+		} alloc;
+
+		/* HL_MEM_OP_FREE - free device memory */
+		struct {
+			/* Handle returned from HL_MEM_OP_ALLOC */
+			__u64 handle;
+		} free;
+
+		/* HL_MEM_OP_MAP - map device memory */
+		struct {
+			/*
+			 * Requested virtual address of mapped memory.
+			 * KMD will try to map the requested region to this
+			 * hint address, as long as the address is valid and
+			 * not already mapped. The user should check the
+			 * returned address of the IOCTL to make sure he got
+			 * the hint address. Passing 0 here means that KMD
+			 * will choose the address itself.
+			 */
+			__u64 hint_addr;
+			/* Handle returned from HL_MEM_OP_ALLOC */
+			__u64 handle;
+		} map_device;
+
+		/* HL_MEM_OP_MAP - map host memory */
+		struct {
+			/* Address of allocated host memory */
+			__u64 host_virt_addr;
+			/*
+			 * Requested virtual address of mapped memory.
+			 * KMD will try to map the requested region to this
+			 * hint address, as long as the address is valid and
+			 * not already mapped. The user should check the
+			 * returned address of the IOCTL to make sure he got
+			 * the hint address. Passing 0 here means that KMD
+			 * will choose the address itself.
+			 */
+			__u64 hint_addr;
+			/* Size of allocated host memory */
+			__u32 mem_size;
+			__u32 pad;
+		} map_host;
+
+		/* HL_MEM_OP_UNMAP - unmap host memory */
+		struct {
+			/* Virtual address returned from HL_MEM_OP_MAP */
+			__u64 device_virt_addr;
+		} unmap;
+	};
+
+	/* HL_MEM_OP_* */
+	__u32 op;
+	/* HL_MEM_* flags */
+	__u32 flags;
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+	__u32 pad;
+};
+
+struct hl_mem_out {
+	union {
+		/*
+		 * Used for HL_MEM_OP_MAP as the virtual address that was
+		 * assigned in the device VA space.
+		 * A value of 0 means the requested operation failed.
+		 */
+		__u64 device_virt_addr;
+
+		/*
+		 * Used for HL_MEM_OP_ALLOC. This is the assigned
+		 * handle for the allocated memory
+		 */
+		__u64 handle;
+	};
+};
+
+union hl_mem_args {
+	struct hl_mem_in in;
+	struct hl_mem_out out;
+};
+
 /*
  * Command Buffer
  * - Request a Command Buffer
@@ -212,7 +314,25 @@ union hl_wait_cs_args {
 #define HL_IOCTL_WAIT_CS			\
 		_IOWR('H', 0x04, union hl_wait_cs_args)
 
+/*
+ * Memory
+ * - Map host memory to device MMU
+ * - Unmap host memory from device MMU
+ *
+ * This IOCTL allows the user to map host memory to the device MMU
+ *
+ * For host memory, the IOCTL doesn't allocate memory. The user is supposed
+ * to allocate the memory in user-space (malloc/new). The driver pins the
+ * physical pages (up to the allowed limit by the OS), assigns a virtual
+ * address in the device VA space and initializes the device MMU.
+ *
+ * There is an option for the user to specify the requested virtual address.
+ *
+ */
+#define HL_IOCTL_MEMORY		\
+		_IOWR('H', 0x05, union hl_mem_args)
+
 #define HL_COMMAND_START	0x02
-#define HL_COMMAND_END		0x05
+#define HL_COMMAND_END		0x06
 
 #endif /* HABANALABS_H_ */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 13/15] habanalabs: implement INFO IOCTL
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (10 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 12/15] habanalabs: add virtual memory and MMU modules Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-23  0:00 ` [PATCH 14/15] habanalabs: add debugfs support Oded Gabbay
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch implements the INFO IOCTL. That IOCTL is used by the user to
query information that is relevant/needed by the user in order to submit
deep learning jobs to Goya.

The information is divided into several categories, such as H/W IP, Events
that happened, DDR usage and more.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/goya/goya.c        |   6 +
 drivers/misc/habanalabs/habanalabs.h       |   2 +
 drivers/misc/habanalabs/habanalabs_ioctl.c | 132 +++++++++++++++++++++
 include/uapi/misc/habanalabs.h             |  76 +++++++++++-
 4 files changed, 215 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 94ee4cb00a49..c21c6046f09b 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -6120,6 +6120,11 @@ static void goya_hw_queues_unlock(struct hl_device *hdev)
 	spin_unlock(&goya->hw_queues_lock);
 }
 
+static u32 goya_get_pci_id(struct hl_device *hdev)
+{
+	return hdev->pdev->device;
+}
+
 int goya_get_eeprom_data(struct hl_device *hdev, void *data, size_t max_size)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -6217,6 +6222,7 @@ static const struct hl_asic_funcs goya_funcs = {
 	.soft_reset_late_init = goya_soft_reset_late_init,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
+	.get_pci_id = goya_get_pci_id,
 	.get_eeprom_data = goya_get_eeprom_data,
 	.send_cpu_message = goya_send_cpu_message
 };
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 1abc139d4293..6c0fe76936be 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -462,6 +462,7 @@ enum hl_pll_frequency {
  * @soft_reset_late_init: perform certain actions needed after soft reset.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
+ * @get_pci_id: retrieve PCI ID.
  * @get_eeprom_data: retrieve EEPROM data from F/W.
  * @send_cpu_message: send buffer to ArmCP.
  */
@@ -530,6 +531,7 @@ struct hl_asic_funcs {
 	int (*soft_reset_late_init)(struct hl_device *hdev);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
+	u32 (*get_pci_id)(struct hl_device *hdev);
 	int (*get_eeprom_data)(struct hl_device *hdev, void *data,
 				size_t max_size);
 	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
index 6dcad810b821..067cf640ad50 100644
--- a/drivers/misc/habanalabs/habanalabs_ioctl.c
+++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
@@ -12,10 +12,142 @@
 #include <linux/uaccess.h>
 #include <linux/cred.h>
 
+static int hw_ip_info(struct hl_device *hdev, struct hl_info_args *args)
+{
+	struct hl_info_hw_ip_info hw_ip = {0};
+	u32 size = args->return_size;
+	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	u64 sram_kmd_size, dram_kmd_size;
+
+	if ((!size) || (!out))
+		return -EINVAL;
+
+	sram_kmd_size = (prop->sram_user_base_address -
+				prop->sram_base_address);
+	dram_kmd_size = (prop->dram_user_base_address -
+				prop->dram_base_address);
+
+	hw_ip.device_id = hdev->asic_funcs->get_pci_id(hdev);
+	hw_ip.sram_base_address = prop->sram_user_base_address;
+	hw_ip.dram_base_address = prop->dram_user_base_address;
+	hw_ip.tpc_enabled_mask = prop->tpc_enabled_mask;
+	hw_ip.sram_size = prop->sram_size - sram_kmd_size;
+	hw_ip.dram_size = prop->dram_size - dram_kmd_size;
+	if (hw_ip.dram_size > 0)
+		hw_ip.dram_enabled = 1;
+	hw_ip.num_of_events = prop->num_of_events;
+	memcpy(hw_ip.armcp_version,
+		prop->armcp_info.armcp_version, VERSION_MAX_LEN);
+	hw_ip.armcp_cpld_version = prop->armcp_info.cpld_version;
+	hw_ip.psoc_pci_pll_nr = prop->psoc_pci_pll_nr;
+	hw_ip.psoc_pci_pll_nf = prop->psoc_pci_pll_nf;
+	hw_ip.psoc_pci_pll_od = prop->psoc_pci_pll_od;
+	hw_ip.psoc_pci_pll_div_factor = prop->psoc_pci_pll_div_factor;
+
+	return copy_to_user(out, &hw_ip,
+		min((size_t)size, sizeof(hw_ip))) ? -EFAULT : 0;
+}
+
+static int hw_events_info(struct hl_device *hdev, struct hl_info_args *args)
+{
+	u32 size, max_size = args->return_size;
+	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
+	void *arr;
+
+	if ((!max_size) || (!out))
+		return -EINVAL;
+
+	arr = hdev->asic_funcs->get_events_stat(hdev, &size);
+
+	return copy_to_user(out, arr, min(max_size, size)) ? -EFAULT : 0;
+}
+
+static int dram_usage_info(struct hl_device *hdev, struct hl_info_args *args)
+{
+	struct hl_info_dram_usage dram_usage = {0};
+	u32 max_size = args->return_size;
+	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	u64 dram_kmd_size;
+
+	if ((!max_size) || (!out))
+		return -EINVAL;
+
+	dram_kmd_size = (prop->dram_user_base_address -
+				prop->dram_base_address);
+	dram_usage.dram_free_mem = (prop->dram_size - dram_kmd_size) -
+					atomic64_read(&hdev->dram_used_mem);
+	dram_usage.ctx_dram_mem = atomic64_read(&hdev->user_ctx->dram_phys_mem);
+
+	return copy_to_user(out, &dram_usage,
+		min((size_t) max_size, sizeof(dram_usage))) ? -EFAULT : 0;
+}
+
+static int hw_idle(struct hl_device *hdev, struct hl_info_args *args)
+{
+	struct hl_info_hw_idle hw_idle = {0};
+	u32 max_size = args->return_size;
+	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
+
+	if ((!max_size) || (!out))
+		return -EINVAL;
+
+	hw_idle.is_idle = hdev->asic_funcs->is_device_idle(hdev);
+
+	return copy_to_user(out, &hw_idle,
+		min((size_t) max_size, sizeof(hw_idle))) ? -EFAULT : 0;
+}
+
+static int hl_info_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	struct hl_info_args *args = data;
+	struct hl_device *hdev = hpriv->hdev;
+	int rc;
+
+	if (hdev->hard_reset_pending) {
+		dev_crit(hdev->dev,
+			"Device HARD reset pending !!! Please close FD\n");
+		return -ENODEV;
+	}
+
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
+		dev_err(hdev->dev,
+			"Device is disabled or in reset !!! Can't execute INFO IOCTL\n");
+		return -EBUSY;
+	}
+
+	switch (args->op) {
+	case HL_INFO_HW_IP_INFO:
+		rc = hw_ip_info(hdev, args);
+		break;
+
+	case HL_INFO_HW_EVENTS:
+		rc = hw_events_info(hdev, args);
+		break;
+
+	case HL_INFO_DRAM_USAGE:
+		rc = dram_usage_info(hdev, args);
+		break;
+
+	case HL_INFO_HW_IDLE:
+		rc = hw_idle(hdev, args);
+		break;
+
+	default:
+		dev_err(hdev->dev, "Invalid request %d\n", args->op);
+		rc = -EINVAL;
+		break;
+	}
+
+	return rc;
+}
+
 #define HL_IOCTL_DEF(ioctl, _func) \
 	[_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func}
 
 static const struct hl_ioctl_desc hl_ioctls[] = {
+	HL_IOCTL_DEF(HL_IOCTL_INFO, hl_info_ioctl),
 	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl),
 	HL_IOCTL_DEF(HL_IOCTL_CS, hl_cs_ioctl),
 	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl),
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index e5bfd7586b79..2f202043e1e0 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -13,6 +13,63 @@
 #include <linux/types.h>
 #include <linux/ioctl.h>
 
+/* Opcode for management ioctl */
+#define HL_INFO_HW_IP_INFO	0
+#define HL_INFO_HW_EVENTS	1
+#define HL_INFO_DRAM_USAGE	2
+#define HL_INFO_HW_IDLE		3
+
+#define HL_INFO_VERSION_MAX_LEN	128
+
+struct hl_info_hw_ip_info {
+	__u64 sram_base_address;
+	__u64 dram_base_address;
+	__u64 dram_size;
+	__u32 sram_size;
+	__u32 num_of_events;
+	__u32 device_id; /* PCI Device ID */
+	__u32 reserved[3];
+	__u32 armcp_cpld_version;
+	__u32 psoc_pci_pll_nr;
+	__u32 psoc_pci_pll_nf;
+	__u32 psoc_pci_pll_od;
+	__u32 psoc_pci_pll_div_factor;
+	__u8 tpc_enabled_mask;
+	__u8 dram_enabled;
+	__u8 pad[2];
+	__u8 armcp_version[HL_INFO_VERSION_MAX_LEN];
+};
+
+struct hl_info_dram_usage {
+	__u64 dram_free_mem;
+	__u64 ctx_dram_mem;
+};
+
+struct hl_info_hw_idle {
+	__u32 is_idle;
+	__u32 pad;
+};
+
+struct hl_info_args {
+	/* Location of relevant struct in userspace */
+	__u64 return_pointer;
+	/*
+	 * The size of the return value. Just like "size" in "snprintf",
+	 * it limits how many bytes the kernel can write
+	 *
+	 * For hw_events array, the size should be
+	 * hl_info_hw_ip_info.num_of_events * sizeof(__u32)
+	 */
+	__u32 return_size;
+
+	/* HL_INFO_* */
+	__u32 op;
+
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+	__u32 pad;
+};
+
 /* Opcode to create a new command buffer */
 #define HL_CB_OP_CREATE		0
 /* Opcode to destroy previously created command buffer */
@@ -231,6 +288,23 @@ union hl_mem_args {
 	struct hl_mem_out out;
 };
 
+/*
+ * Various information operations such as:
+ * - H/W IP information
+ * - Current dram usage
+ *
+ * The user calls this IOCTL with an opcode that describes the required
+ * information. The user should supply a pointer to a user-allocated memory
+ * chunk, which will be filled by the driver with the requested information.
+ *
+ * The user supplies the maximum amount of size to copy into the user's memory,
+ * in order to prevent data corruption in case of differences between the
+ * definitions of structures in kernel and userspace, e.g. in case of old
+ * userspace and new kernel driver
+ */
+#define HL_IOCTL_INFO	\
+		_IOWR('H', 0x01, struct hl_info_args)
+
 /*
  * Command Buffer
  * - Request a Command Buffer
@@ -332,7 +406,7 @@ union hl_mem_args {
 #define HL_IOCTL_MEMORY		\
 		_IOWR('H', 0x05, union hl_mem_args)
 
-#define HL_COMMAND_START	0x02
+#define HL_COMMAND_START	0x01
 #define HL_COMMAND_END		0x06
 
 #endif /* HABANALABS_H_ */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 14/15] habanalabs: add debugfs support
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (11 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 13/15] habanalabs: implement INFO IOCTL Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-23  0:00 ` [PATCH 15/15] Update MAINTAINERS and CREDITS with habanalabs info Oded Gabbay
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

This patch adds debugfs support to the driver. It allows the user-space to
display information that is contained in the internal structures of the
driver, such as:
- active command submissions
- active user virtual memory mappings
- number of allocated command buffers

It also enables the user to perform reads and writes through Goya's PCI
bars.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 .../ABI/testing/debugfs-driver-habanalabs     |  127 ++
 drivers/misc/habanalabs/Makefile              |    2 +
 drivers/misc/habanalabs/command_buffer.c      |    4 +
 drivers/misc/habanalabs/command_submission.c  |   12 +
 drivers/misc/habanalabs/debugfs.c             | 1069 +++++++++++++++++
 drivers/misc/habanalabs/device.c              |    6 +
 drivers/misc/habanalabs/goya/goya.c           |  108 ++
 drivers/misc/habanalabs/goya/goyaP.h          |    5 +
 drivers/misc/habanalabs/habanalabs.h          |  191 +++
 drivers/misc/habanalabs/habanalabs_drv.c      |   16 +-
 drivers/misc/habanalabs/memory.c              |    8 +
 11 files changed, 1546 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/debugfs-driver-habanalabs
 create mode 100644 drivers/misc/habanalabs/debugfs.c

diff --git a/Documentation/ABI/testing/debugfs-driver-habanalabs b/Documentation/ABI/testing/debugfs-driver-habanalabs
new file mode 100644
index 000000000000..2b606c84938c
--- /dev/null
+++ b/Documentation/ABI/testing/debugfs-driver-habanalabs
@@ -0,0 +1,127 @@
+What:           /sys/kernel/debug/habanalabs/hl<n>/addr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the device address to be used for read or write through
+                PCI bar. The acceptable value is a string that starts with "0x"
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/command_buffers
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with information about the currently allocated
+                command buffers
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/command_submission
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with information about the currently active
+                command submissions
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/command_submission_jobs
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with detailed information about each JOB (CB) of
+                each active command submission
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/data32
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the root user to read or write directly through the
+                device's PCI bar. Writing to this file generates a write
+                transaction while reading from the file generates a read
+                transcation. This custom interface is needed (instead of using
+                the generic Linux user-space PCI mapping) because the DDR bar
+                is very small compared to the DDR memory and only the driver can
+                move the bar before and after the transaction
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/device
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Enables the root user to set the device to specific state.
+                Valid values are "disable", "enable", "suspend", "resume".
+                User can read this property to see the valid values
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/i2c_addr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets I2C device address for I2C transaction that is generated
+                by the device's CPU
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/i2c_bus
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets I2C bus address for I2C transaction that is generated by
+                the device's CPU
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/i2c_data
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Triggers an I2C transaction that is generated by the device's
+                CPU. Writing to this file generates a write transaction while
+                reading from the file generates a read transcation
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/i2c_reg
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets I2C register id for I2C transaction that is generated by
+                the device's CPU
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/led0
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the state of the first S/W led on the device
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/led1
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the state of the second S/W led on the device
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/led2
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the state of the third S/W led on the device
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/mmu
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the hop values and physical address for a given ASID
+                and virtual address. The user should write the ASID and VA into
+                the file and then read the file to get the result.
+                e.g. to display info about VA 0x1000 for ASID 1 you need to do:
+                echo "1 0x1000" > /sys/kernel/debug/habanalabs/hl0/mmu
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/set_power_state
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the PCI power state. Valid values are "1" for D0 and "2"
+                for D3Hot
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/userptr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with information about the currently user
+                pointers (user virtual addresses) that are pinned and mapped
+                to DMA addresses
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/vm
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with information about all the active virtual
+                address mappings per ASID
+
diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index fd46f8b48bab..c6592db59b25 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -8,5 +8,7 @@ habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
 		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o memory.o \
 		command_submission.o mmu.o
 
+habanalabs-$(CONFIG_DEBUG_FS) += debugfs.o
+
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
index 700c6da01188..c2aa580f2bd0 100644
--- a/drivers/misc/habanalabs/command_buffer.c
+++ b/drivers/misc/habanalabs/command_buffer.c
@@ -36,6 +36,8 @@ static void cb_release(struct kref *ref)
 	cb = container_of(ref, struct hl_cb, refcount);
 	hdev = cb->hdev;
 
+	hl_debugfs_remove_cb(cb);
+
 	cb_do_release(hdev, cb);
 }
 
@@ -141,6 +143,8 @@ int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
 	*handle = cb->id | HL_MMAP_CB_MASK;
 	*handle <<= PAGE_SHIFT;
 
+	hl_debugfs_add_cb(cb);
+
 	return 0;
 
 release_cb:
diff --git a/drivers/misc/habanalabs/command_submission.c b/drivers/misc/habanalabs/command_submission.c
index 0116c2262f17..bc1a50682304 100644
--- a/drivers/misc/habanalabs/command_submission.c
+++ b/drivers/misc/habanalabs/command_submission.c
@@ -153,6 +153,8 @@ static void free_job(struct hl_device *hdev, struct hl_cs_job *job)
 	list_del(&job->cs_node);
 	spin_unlock(&cs->job_lock);
 
+	hl_debugfs_remove_job(hdev, job);
+
 	if (job->ext_queue)
 		cs_put(cs);
 
@@ -215,6 +217,12 @@ static void cs_do_release(struct kref *ref)
 		}
 	}
 
+	/*
+	 * Must be called before hl_ctx_put because inside we use ctx to get
+	 * the device
+	 */
+	hl_debugfs_remove_cs(cs);
+
 	hl_ctx_put(cs->ctx);
 
 	if (cs->timedout)
@@ -483,6 +491,8 @@ static int _hl_cs_ioctl(struct hl_fpriv *hpriv, void __user *chunks,
 
 	*cs_seq = cs->sequence;
 
+	hl_debugfs_add_cs(cs);
+
 	/* Validate ALL the CS chunks before submitting the CS */
 	for (i = 0, parse_cnt = 0 ; i < num_chunks ; i++, parse_cnt++) {
 		struct hl_cs_chunk *chunk = &cs_chunk_array[i];
@@ -531,6 +541,8 @@ static int _hl_cs_ioctl(struct hl_fpriv *hpriv, void __user *chunks,
 		if (job->ext_queue)
 			cs_get(cs);
 
+		hl_debugfs_add_job(hdev, job);
+
 		rc = cs_parser(hpriv, job);
 		if (rc) {
 			dev_err(hdev->dev,
diff --git a/drivers/misc/habanalabs/debugfs.c b/drivers/misc/habanalabs/debugfs.c
new file mode 100644
index 000000000000..09221b05daf7
--- /dev/null
+++ b/drivers/misc/habanalabs/debugfs.c
@@ -0,0 +1,1069 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+#include "include/hw_ip/mmu/mmu_general.h"
+
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+
+#define MMU_ADDR_BUF_SIZE	40
+#define MMU_ASID_BUF_SIZE	10
+#define MMU_KBUF_SIZE		(MMU_ADDR_BUF_SIZE + MMU_ASID_BUF_SIZE)
+
+static struct dentry *hl_debug_root;
+
+static int hl_debugfs_i2c_read(struct hl_device *hdev, u8 i2c_bus, u8 i2c_addr,
+				u8 i2c_reg, u32 *val)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
+		return 0;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_I2C_RD;
+	pkt.i2c_bus = i2c_bus;
+	pkt.i2c_addr = i2c_addr;
+	pkt.i2c_reg = i2c_reg;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					HL_DEVICE_TIMEOUT_USEC, (long *) val);
+
+	if (rc)
+		dev_err(hdev->dev, "Failed to read from I2C, error %d\n", rc);
+
+	return rc;
+}
+
+static int hl_debugfs_i2c_write(struct hl_device *hdev, u8 i2c_bus, u8 i2c_addr,
+				u8 i2c_reg, u32 val)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
+		return 0;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_I2C_WR;
+	pkt.i2c_bus = i2c_bus;
+	pkt.i2c_addr = i2c_addr;
+	pkt.i2c_reg = i2c_reg;
+	pkt.value = val;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					HL_DEVICE_TIMEOUT_USEC, NULL);
+
+	if (rc)
+		dev_err(hdev->dev, "Failed to write to I2C, error %d\n", rc);
+
+	return rc;
+}
+
+static void hl_debugfs_led_set(struct hl_device *hdev, u8 led, u8 state)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
+		return;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.opcode = ARMCP_PACKET_LED_SET;
+	pkt.led_index = led;
+	pkt.value = state;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+						HL_DEVICE_TIMEOUT_USEC, NULL);
+
+	if (rc)
+		dev_err(hdev->dev, "Failed to set LED %d, error %d\n", led, rc);
+}
+
+static int command_buffers_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_cb *cb;
+	bool first = true;
+
+	spin_lock(&dev_entry->cb_spinlock);
+
+	list_for_each_entry(cb, &dev_entry->cb_list, debugfs_list) {
+		if (first) {
+			first = false;
+			seq_puts(s, "\n");
+			seq_puts(s, " CB ID   CTX ID   CB size    CB RefCnt    mmap?   CS counter\n");
+			seq_puts(s, "---------------------------------------------------------------\n");
+		}
+		seq_printf(s,
+			"   %03d        %d    0x%08x      %d          %d          %d\n",
+			cb->id, cb->ctx_id, cb->size,
+			kref_read(&cb->refcount),
+			cb->mmap, cb->cs_cnt);
+	}
+
+	spin_unlock(&dev_entry->cb_spinlock);
+
+	if (!first)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int command_submission_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_cs *cs;
+	bool first = true;
+
+	spin_lock(&dev_entry->cs_spinlock);
+
+	list_for_each_entry(cs, &dev_entry->cs_list, debugfs_list) {
+		if (first) {
+			first = false;
+			seq_puts(s, "\n");
+			seq_puts(s, " CS ID   CTX ASID   CS RefCnt   Submitted    Completed\n");
+			seq_puts(s, "------------------------------------------------------\n");
+		}
+		seq_printf(s,
+			"   %llu       %d          %d           %d            %d\n",
+			cs->sequence, cs->ctx->asid,
+			kref_read(&cs->refcount),
+			cs->submitted, cs->completed);
+	}
+
+	spin_unlock(&dev_entry->cs_spinlock);
+
+	if (!first)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int command_submission_jobs_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_cs_job *job;
+	bool first = true;
+
+	spin_lock(&dev_entry->cs_job_spinlock);
+
+	list_for_each_entry(job, &dev_entry->cs_job_list, debugfs_list) {
+		if (first) {
+			first = false;
+			seq_puts(s, "\n");
+			seq_puts(s, " JOB ID   CS ID    CTX ASID   H/W Queue\n");
+			seq_puts(s, "---------------------------------------\n");
+		}
+		if (job->cs)
+			seq_printf(s,
+				"    %02d       %llu         %d         %d\n",
+				job->id, job->cs->sequence, job->cs->ctx->asid,
+				job->hw_queue_id);
+		else
+			seq_printf(s,
+				"    %02d       0         %d         %d\n",
+				job->id, HL_KERNEL_ASID_ID, job->hw_queue_id);
+	}
+
+	spin_unlock(&dev_entry->cs_job_spinlock);
+
+	if (!first)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int userptr_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_userptr *userptr;
+	char dma_dir[4][30] = {"DMA_BIDIRECTIONAL", "DMA_TO_DEVICE",
+				"DMA_FROM_DEVICE", "DMA_NONE"};
+	bool first = true;
+
+	spin_lock(&dev_entry->userptr_spinlock);
+
+	list_for_each_entry(userptr, &dev_entry->userptr_list, debugfs_list) {
+		if (first) {
+			first = false;
+			seq_puts(s, "\n");
+			seq_puts(s, " user virtual address     size             dma dir\n");
+			seq_puts(s, "----------------------------------------------------------\n");
+		}
+		seq_printf(s,
+			"    0x%-14llx      %-10u    %-30s\n",
+			userptr->addr, userptr->size, dma_dir[userptr->dir]);
+	}
+
+	spin_unlock(&dev_entry->userptr_spinlock);
+
+	if (!first)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int vm_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_ctx *ctx;
+	struct hl_vm *vm;
+	struct hl_vm_hash_node *hnode;
+	struct hl_userptr *userptr;
+	struct hl_vm_phys_pg_list *phys_pg_list = NULL;
+	struct hl_vm_phys_pg *phys_pg;
+	enum vm_type_t *vm_type;
+	bool once = true;
+	int i;
+
+	if (!dev_entry->hdev->mmu_enable)
+		return 0;
+
+	spin_lock(&dev_entry->ctx_mem_hash_spinlock);
+
+	list_for_each_entry(ctx, &dev_entry->ctx_mem_hash_list, debugfs_list) {
+		once = false;
+		seq_puts(s, "\n\n----------------------------------------------------");
+		seq_puts(s, "\n----------------------------------------------------\n\n");
+		seq_printf(s, "ctx asid: %u\n", ctx->asid);
+
+		seq_puts(s, "\nmappings:\n\n");
+		seq_puts(s, "    virtual address        size          handle\n");
+		seq_puts(s, "----------------------------------------------------\n");
+		mutex_lock(&ctx->mem_hash_lock);
+		hash_for_each(ctx->mem_hash, i, hnode, node) {
+			vm_type = hnode->ptr;
+
+			if (*vm_type == VM_TYPE_USERPTR) {
+				userptr = hnode->ptr;
+				seq_printf(s,
+					"    0x%-14llx      %-10u\n",
+					hnode->vaddr, userptr->size);
+			} else {
+				phys_pg_list = hnode->ptr;
+				seq_printf(s,
+					"    0x%-14llx      %-10u       %-4u\n",
+					hnode->vaddr, phys_pg_list->total_size,
+					phys_pg_list->handle);
+			}
+		}
+		mutex_unlock(&ctx->mem_hash_lock);
+
+		vm = &ctx->hdev->vm;
+		spin_lock(&vm->idr_lock);
+
+		if (!idr_is_empty(&vm->phys_pg_list_handles))
+			seq_puts(s, "\n\nallocations:\n");
+
+		idr_for_each_entry(&vm->phys_pg_list_handles, phys_pg_list, i) {
+			if (phys_pg_list->asid != ctx->asid)
+				continue;
+
+			seq_printf(s, "\nhandle: %u\n", phys_pg_list->handle);
+			seq_puts(s, "   physical address        size\n");
+			seq_puts(s, "-------------------------------------\n");
+			list_for_each_entry(phys_pg, &phys_pg_list->list, node)
+				seq_printf(s, "    0x%-14llx      %-10u\n",
+					phys_pg->paddr, phys_pg->page_size);
+		}
+		spin_unlock(&vm->idr_lock);
+
+	}
+
+	spin_unlock(&dev_entry->ctx_mem_hash_spinlock);
+
+	if (!once)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+/* these inline functions are copied from mmu.c */
+static inline u64 get_hop0_addr(struct hl_ctx *ctx)
+{
+	return ctx->hdev->asic_prop.mmu_pgt_addr +
+			(ctx->asid * ctx->hdev->asic_prop.mmu_hop_table_size);
+}
+
+static inline u64 get_hop0_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP0_MASK) >> HOP0_SHIFT);
+}
+
+static inline u64 get_hop1_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP1_MASK) >> HOP1_SHIFT);
+}
+
+static inline u64 get_hop2_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP2_MASK) >> HOP2_SHIFT);
+}
+
+static inline u64 get_hop3_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP3_MASK) >> HOP3_SHIFT);
+}
+
+static inline u64 get_hop4_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP4_MASK) >> HOP4_SHIFT);
+}
+
+static inline u64 get_next_hop_addr(u64 curr_pte)
+{
+	if (curr_pte & PAGE_PRESENT_MASK)
+		return curr_pte & PHYS_ADDR_MASK;
+	else
+		return ULLONG_MAX;
+}
+
+static int mmu_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_device *hdev = dev_entry->hdev;
+	struct hl_ctx *ctx = hdev->user_ctx;
+
+	u64 hop0_addr = 0, hop0_pte_addr = 0, hop0_pte = 0,
+		hop1_addr = 0, hop1_pte_addr = 0, hop1_pte = 0,
+		hop2_addr = 0, hop2_pte_addr = 0, hop2_pte = 0,
+		hop3_addr = 0, hop3_pte_addr = 0, hop3_pte = 0,
+		hop4_addr = 0, hop4_pte_addr = 0, hop4_pte = 0,
+		virt_addr = dev_entry->mmu_addr;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	if (!ctx) {
+		dev_err(hdev->dev, "no ctx available\n");
+		return 0;
+	}
+
+	mutex_lock(&ctx->mmu_lock);
+
+	/* the following lookup is copied from unmap() in mmu.c */
+
+	hop0_addr = get_hop0_addr(ctx);
+	hop0_pte_addr = get_hop0_pte_addr(ctx, hop0_addr, virt_addr);
+	hop0_pte = hdev->asic_funcs->read_pte(hdev, hop0_pte_addr);
+	hop1_addr = get_next_hop_addr(hop0_pte);
+
+	if (hop1_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop1_pte_addr = get_hop1_pte_addr(ctx, hop1_addr, virt_addr);
+	hop1_pte = hdev->asic_funcs->read_pte(hdev, hop1_pte_addr);
+	hop2_addr = get_next_hop_addr(hop1_pte);
+
+	if (hop2_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop2_pte_addr = get_hop2_pte_addr(ctx, hop2_addr, virt_addr);
+	hop2_pte = hdev->asic_funcs->read_pte(hdev, hop2_pte_addr);
+	hop3_addr = get_next_hop_addr(hop2_pte);
+
+	if (hop3_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop3_pte_addr = get_hop3_pte_addr(ctx, hop3_addr, virt_addr);
+	hop3_pte = hdev->asic_funcs->read_pte(hdev, hop3_pte_addr);
+
+	if (!(hop3_pte & LAST_MASK)) {
+		hop4_addr = get_next_hop_addr(hop3_pte);
+
+		if (hop4_addr == ULLONG_MAX)
+			goto not_mapped;
+
+		hop4_pte_addr = get_hop4_pte_addr(ctx, hop4_addr, virt_addr);
+		hop4_pte = hdev->asic_funcs->read_pte(hdev, hop4_pte_addr);
+		if (!(hop4_pte & PAGE_PRESENT_MASK))
+			goto not_mapped;
+	} else {
+		if (!(hop3_pte & PAGE_PRESENT_MASK))
+			goto not_mapped;
+	}
+
+	seq_printf(s, "asid: %u, virt_addr: 0x%llx\n",
+			dev_entry->mmu_asid, dev_entry->mmu_addr);
+
+	seq_printf(s, "hop0_addr: 0x%llx\n", hop0_addr);
+	seq_printf(s, "hop0_pte_addr: 0x%llx\n", hop0_pte_addr);
+	seq_printf(s, "hop0_pte: 0x%llx\n", hop0_pte);
+
+	seq_printf(s, "hop1_addr: 0x%llx\n", hop1_addr);
+	seq_printf(s, "hop1_pte_addr: 0x%llx\n", hop1_pte_addr);
+	seq_printf(s, "hop1_pte: 0x%llx\n", hop1_pte);
+
+	seq_printf(s, "hop2_addr: 0x%llx\n", hop2_addr);
+	seq_printf(s, "hop2_pte_addr: 0x%llx\n", hop2_pte_addr);
+	seq_printf(s, "hop2_pte: 0x%llx\n", hop2_pte);
+
+	seq_printf(s, "hop3_addr: 0x%llx\n", hop3_addr);
+	seq_printf(s, "hop3_pte_addr: 0x%llx\n", hop3_pte_addr);
+	seq_printf(s, "hop3_pte: 0x%llx\n", hop3_pte);
+
+	if (!(hop3_pte & LAST_MASK)) {
+		seq_printf(s, "hop4_addr: 0x%llx\n", hop4_addr);
+		seq_printf(s, "hop4_pte_addr: 0x%llx\n", hop4_pte_addr);
+		seq_printf(s, "hop4_pte: 0x%llx\n", hop4_pte);
+	}
+
+	goto out;
+
+not_mapped:
+	dev_err(hdev->dev, "virt addr 0x%llx is not mapped to phys addr\n",
+			virt_addr);
+out:
+	mutex_unlock(&ctx->mmu_lock);
+
+	return 0;
+}
+
+static ssize_t mmu_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *f_pos)
+{
+	struct seq_file *s = file->private_data;
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_device *hdev = dev_entry->hdev;
+	char kbuf[MMU_KBUF_SIZE], asid_kbuf[MMU_ASID_BUF_SIZE],
+		addr_kbuf[MMU_ADDR_BUF_SIZE];
+	char *c;
+	ssize_t rc;
+
+	if (!hdev->mmu_enable)
+		return count;
+
+	memset(kbuf, 0, sizeof(kbuf));
+	memset(asid_kbuf, 0, sizeof(asid_kbuf));
+	memset(addr_kbuf, 0, sizeof(addr_kbuf));
+
+	if (copy_from_user(kbuf, buf, count))
+		goto err;
+
+	kbuf[MMU_KBUF_SIZE - 1] = 0;
+
+	c = strchr(kbuf, ' ');
+	if (!c)
+		goto err;
+
+	memcpy(asid_kbuf, kbuf, c - kbuf);
+
+	rc = kstrtouint(asid_kbuf, 10, &dev_entry->mmu_asid);
+	if (rc)
+		goto err;
+
+	c = strstr(kbuf, " 0x");
+	if (!c)
+		goto err;
+
+	c += 3;
+	memcpy(addr_kbuf, c, (kbuf + count) - c);
+
+	rc = kstrtoull(addr_kbuf, 16, &dev_entry->mmu_addr);
+	if (rc)
+		goto err;
+
+	return count;
+
+err:
+	dev_err(hdev->dev, "usage: echo <asid> <0xaddr> > mmu\n");
+
+	return -EINVAL;
+}
+
+static ssize_t hl_data_read32(struct file *f, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	char tmp_buf[32];
+	u32 val;
+	ssize_t rc;
+
+	if (*ppos)
+		return 0;
+
+	rc = hdev->asic_funcs->debugfs_read32(hdev, entry->addr, &val);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to read from 0x%010llx\n",
+			entry->addr);
+		return rc;
+	}
+
+	sprintf(tmp_buf, "0x%08x\n", val);
+	rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
+			strlen(tmp_buf) + 1);
+
+	return rc;
+}
+
+static ssize_t hl_data_write32(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 16, &value);
+	if (rc)
+		return rc;
+
+	rc = hdev->asic_funcs->debugfs_write32(hdev, entry->addr, value);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to write 0x%08x to 0x%010llx\n",
+			value, entry->addr);
+		return rc;
+	}
+
+	return count;
+}
+
+static ssize_t hl_get_power_state(struct file *f, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	char tmp_buf[200];
+	ssize_t rc;
+	int i;
+
+	if (*ppos)
+		return 0;
+
+	if (hdev->pdev->current_state == PCI_D0)
+		i = 1;
+	else if (hdev->pdev->current_state == PCI_D3hot)
+		i = 2;
+	else
+		i = 3;
+
+	sprintf(tmp_buf,
+		"current power state: %d\n1 - D0\n2 - D3hot\n3 - Unknown\n", i);
+	rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
+			strlen(tmp_buf) + 1);
+
+	return rc;
+}
+
+static ssize_t hl_set_power_state(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 10, &value);
+	if (rc)
+		return rc;
+
+	if (value == 1) {
+		pci_set_power_state(hdev->pdev, PCI_D0);
+		pci_restore_state(hdev->pdev);
+		rc = pci_enable_device(hdev->pdev);
+	} else if (value == 2) {
+		pci_save_state(hdev->pdev);
+		pci_disable_device(hdev->pdev);
+		pci_set_power_state(hdev->pdev, PCI_D3hot);
+	} else {
+		dev_dbg(hdev->dev, "invalid power state value %u\n", value);
+		return -EINVAL;
+	}
+
+	return count;
+}
+
+static ssize_t hl_i2c_data_read(struct file *f, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	char tmp_buf[32];
+	u32 val;
+	ssize_t rc;
+
+	if (*ppos)
+		return 0;
+
+	rc = hl_debugfs_i2c_read(hdev, entry->i2c_bus, entry->i2c_addr,
+			entry->i2c_reg, &val);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to read from I2C bus %d, addr %d, reg %d\n",
+			entry->i2c_bus, entry->i2c_addr, entry->i2c_reg);
+		return rc;
+	}
+
+	sprintf(tmp_buf, "0x%02x\n", val);
+	rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
+			strlen(tmp_buf) + 1);
+
+	return rc;
+}
+
+static ssize_t hl_i2c_data_write(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 16, &value);
+	if (rc)
+		return rc;
+
+	rc = hl_debugfs_i2c_write(hdev, entry->i2c_bus, entry->i2c_addr,
+			entry->i2c_reg, value);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to write 0x%02x to I2C bus %d, addr %d, reg %d\n",
+			value, entry->i2c_bus, entry->i2c_addr, entry->i2c_reg);
+		return rc;
+	}
+
+	return count;
+}
+
+static ssize_t hl_led0_write(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 10, &value);
+	if (rc)
+		return rc;
+
+	value = value ? 1 : 0;
+
+	hl_debugfs_led_set(hdev, 0, value);
+
+	return count;
+}
+
+static ssize_t hl_led1_write(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 10, &value);
+	if (rc)
+		return rc;
+
+	value = value ? 1 : 0;
+
+	hl_debugfs_led_set(hdev, 1, value);
+
+	return count;
+}
+
+static ssize_t hl_led2_write(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 10, &value);
+	if (rc)
+		return rc;
+
+	value = value ? 1 : 0;
+
+	hl_debugfs_led_set(hdev, 2, value);
+
+	return count;
+}
+
+static ssize_t hl_device_read(struct file *f, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	char tmp_buf[200];
+	ssize_t rc;
+
+	if (*ppos)
+		return 0;
+
+	sprintf(tmp_buf,
+		"Valid values are: disable, enable, suspend, resume\n");
+	rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
+			strlen(tmp_buf) + 1);
+
+	return rc;
+}
+
+static ssize_t hl_device_write(struct file *f, const char __user *buf,
+				     size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+
+	if (strncmp("disable", buf, strlen("disable")) == 0) {
+		hdev->disabled = true;
+	} else if (strncmp("enable", buf, strlen("enable")) == 0) {
+		hdev->disabled = false;
+	} else if (strncmp("suspend", buf, strlen("suspend")) == 0) {
+		hdev->asic_funcs->suspend(hdev);
+	} else if (strncmp("resume", buf, strlen("resume")) == 0) {
+		hdev->asic_funcs->resume(hdev);
+	} else {
+		dev_err(hdev->dev,
+			"Valid values are: disable, enable, suspend, resume\n");
+		count = -EINVAL;
+	}
+
+	return count;
+}
+
+static const struct file_operations hl_data32b_fops = {
+	.owner = THIS_MODULE,
+	.read = hl_data_read32,
+	.write = hl_data_write32
+};
+
+static const struct file_operations hl_i2c_data_fops = {
+	.owner = THIS_MODULE,
+	.read = hl_i2c_data_read,
+	.write = hl_i2c_data_write
+};
+
+static const struct file_operations hl_power_fops = {
+	.owner = THIS_MODULE,
+	.read = hl_get_power_state,
+	.write = hl_set_power_state
+};
+
+static const struct file_operations hl_led0_fops = {
+	.owner = THIS_MODULE,
+	.write = hl_led0_write
+};
+
+static const struct file_operations hl_led1_fops = {
+	.owner = THIS_MODULE,
+	.write = hl_led1_write
+};
+
+static const struct file_operations hl_led2_fops = {
+	.owner = THIS_MODULE,
+	.write = hl_led2_write
+};
+
+static const struct file_operations hl_device_fops = {
+	.owner = THIS_MODULE,
+	.read = hl_device_read,
+	.write = hl_device_write
+};
+
+static const struct hl_info_list hl_debugfs_list[] = {
+	{"command_buffers", command_buffers_show, NULL},
+	{"command_submission", command_submission_show, NULL},
+	{"command_submission_jobs", command_submission_jobs_show, NULL},
+	{"userptr", userptr_show, NULL},
+	{"vm", vm_show, NULL},
+	{"mmu", mmu_show, mmu_write},
+};
+
+static int hl_debugfs_open(struct inode *inode, struct file *file)
+{
+	struct hl_debugfs_entry *node = inode->i_private;
+
+	return single_open(file, node->info_ent->show, node);
+}
+
+static ssize_t hl_debugfs_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *f_pos)
+{
+	struct hl_debugfs_entry *node = file->f_inode->i_private;
+
+	if (node->info_ent->write)
+		return node->info_ent->write(file, buf, count, f_pos);
+	else
+		return -EINVAL;
+
+}
+
+static const struct file_operations hl_debugfs_fops = {
+	.owner = THIS_MODULE,
+	.open = hl_debugfs_open,
+	.read = seq_read,
+	.write = hl_debugfs_write,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+void hl_debugfs_add_device(struct hl_device *hdev)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+	int count = ARRAY_SIZE(hl_debugfs_list);
+	struct hl_debugfs_entry *entry;
+	struct dentry *ent;
+	int i;
+
+	dev_entry->hdev = hdev;
+	dev_entry->entry_arr = kmalloc_array(count,
+					sizeof(struct hl_debugfs_entry),
+					GFP_KERNEL);
+	if (!dev_entry->entry_arr)
+		return;
+
+	INIT_LIST_HEAD(&dev_entry->file_list);
+	INIT_LIST_HEAD(&dev_entry->cb_list);
+	INIT_LIST_HEAD(&dev_entry->cs_list);
+	INIT_LIST_HEAD(&dev_entry->cs_job_list);
+	INIT_LIST_HEAD(&dev_entry->userptr_list);
+	INIT_LIST_HEAD(&dev_entry->ctx_mem_hash_list);
+	mutex_init(&dev_entry->file_mutex);
+	spin_lock_init(&dev_entry->cb_spinlock);
+	spin_lock_init(&dev_entry->cs_spinlock);
+	spin_lock_init(&dev_entry->cs_job_spinlock);
+	spin_lock_init(&dev_entry->userptr_spinlock);
+	spin_lock_init(&dev_entry->ctx_mem_hash_spinlock);
+
+	dev_entry->root = debugfs_create_dir(dev_name(hdev->dev),
+						hl_debug_root);
+
+	debugfs_create_x64("addr",
+				0644,
+				dev_entry->root,
+				&dev_entry->addr);
+
+	debugfs_create_file("data32",
+				0644,
+				dev_entry->root,
+				dev_entry,
+				&hl_data32b_fops);
+
+	debugfs_create_file("set_power_state",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_power_fops);
+
+	debugfs_create_u8("i2c_bus",
+				0644,
+				dev_entry->root,
+				&dev_entry->i2c_bus);
+
+	debugfs_create_u8("i2c_addr",
+				0644,
+				dev_entry->root,
+				&dev_entry->i2c_addr);
+
+	debugfs_create_u8("i2c_reg",
+				0644,
+				dev_entry->root,
+				&dev_entry->i2c_reg);
+
+	debugfs_create_file("i2c_data",
+				0644,
+				dev_entry->root,
+				dev_entry,
+				&hl_i2c_data_fops);
+
+	debugfs_create_file("led0",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_led0_fops);
+
+	debugfs_create_file("led1",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_led1_fops);
+
+	debugfs_create_file("led2",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_led2_fops);
+
+	debugfs_create_file("device",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_device_fops);
+
+	for (i = 0, entry = dev_entry->entry_arr ; i < count ; i++, entry++) {
+
+		ent = debugfs_create_file(hl_debugfs_list[i].name,
+					0444,
+					dev_entry->root,
+					entry,
+					&hl_debugfs_fops);
+		entry->dent = ent;
+		entry->info_ent = &hl_debugfs_list[i];
+		entry->dev_entry = dev_entry;
+	}
+}
+
+void hl_debugfs_remove_device(struct hl_device *hdev)
+{
+	struct hl_dbg_device_entry *entry = &hdev->hl_debugfs;
+
+	debugfs_remove_recursive(entry->root);
+
+	mutex_destroy(&entry->file_mutex);
+	kfree(entry->entry_arr);
+}
+
+void hl_debugfs_add_file(struct hl_fpriv *hpriv)
+{
+	struct hl_dbg_device_entry *dev_entry = &hpriv->hdev->hl_debugfs;
+
+	mutex_lock(&dev_entry->file_mutex);
+	list_add(&hpriv->debugfs_list, &dev_entry->file_list);
+	mutex_unlock(&dev_entry->file_mutex);
+}
+
+void hl_debugfs_remove_file(struct hl_fpriv *hpriv)
+{
+	struct hl_dbg_device_entry *dev_entry = &hpriv->hdev->hl_debugfs;
+
+	mutex_lock(&dev_entry->file_mutex);
+	list_del(&hpriv->debugfs_list);
+	mutex_unlock(&dev_entry->file_mutex);
+}
+
+void hl_debugfs_add_cb(struct hl_cb *cb)
+{
+	struct hl_dbg_device_entry *dev_entry = &cb->hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cb_spinlock);
+	list_add(&cb->debugfs_list, &dev_entry->cb_list);
+	spin_unlock(&dev_entry->cb_spinlock);
+}
+
+void hl_debugfs_remove_cb(struct hl_cb *cb)
+{
+	struct hl_dbg_device_entry *dev_entry = &cb->hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cb_spinlock);
+	list_del(&cb->debugfs_list);
+	spin_unlock(&dev_entry->cb_spinlock);
+}
+
+void hl_debugfs_add_cs(struct hl_cs *cs)
+{
+	struct hl_dbg_device_entry *dev_entry = &cs->ctx->hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cs_spinlock);
+	list_add(&cs->debugfs_list, &dev_entry->cs_list);
+	spin_unlock(&dev_entry->cs_spinlock);
+}
+
+void hl_debugfs_remove_cs(struct hl_cs *cs)
+{
+	struct hl_dbg_device_entry *dev_entry = &cs->ctx->hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cs_spinlock);
+	list_del(&cs->debugfs_list);
+	spin_unlock(&dev_entry->cs_spinlock);
+}
+
+void hl_debugfs_add_job(struct hl_device *hdev, struct hl_cs_job *job)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cs_job_spinlock);
+	list_add(&job->debugfs_list, &dev_entry->cs_job_list);
+	spin_unlock(&dev_entry->cs_job_spinlock);
+}
+
+void hl_debugfs_remove_job(struct hl_device *hdev, struct hl_cs_job *job)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cs_job_spinlock);
+	list_del(&job->debugfs_list);
+	spin_unlock(&dev_entry->cs_job_spinlock);
+}
+
+void hl_debugfs_add_userptr(struct hl_device *hdev, struct hl_userptr *userptr)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->userptr_spinlock);
+	list_add(&userptr->debugfs_list, &dev_entry->userptr_list);
+	spin_unlock(&dev_entry->userptr_spinlock);
+}
+
+void hl_debugfs_remove_userptr(struct hl_device *hdev,
+				struct hl_userptr *userptr)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->userptr_spinlock);
+	list_del(&userptr->debugfs_list);
+	spin_unlock(&dev_entry->userptr_spinlock);
+}
+
+void hl_debugfs_add_ctx_mem_hash(struct hl_device *hdev, struct hl_ctx *ctx)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->ctx_mem_hash_spinlock);
+	list_add(&ctx->debugfs_list, &dev_entry->ctx_mem_hash_list);
+	spin_unlock(&dev_entry->ctx_mem_hash_spinlock);
+}
+
+void hl_debugfs_remove_ctx_mem_hash(struct hl_device *hdev, struct hl_ctx *ctx)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->ctx_mem_hash_spinlock);
+	list_del(&ctx->debugfs_list);
+	spin_unlock(&dev_entry->ctx_mem_hash_spinlock);
+}
+
+int __init hl_debugfs_init(void)
+{
+	hl_debug_root = debugfs_create_dir("habanalabs", NULL);
+	if (IS_ERR_OR_NULL(hl_debug_root)) {
+		pr_err("habanalabs: can not create debugfs directory\n");
+		hl_debug_root = NULL;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void hl_debugfs_fini(void)
+{
+	debugfs_remove_recursive(hl_debug_root);
+}
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 1f7340551386..ba69307de0a1 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -22,6 +22,8 @@ static void hpriv_release(struct kref *ref)
 
 	put_pid(hpriv->taskpid);
 
+	hl_debugfs_remove_file(hpriv);
+
 	mutex_destroy(&hpriv->restore_phase_mutex);
 
 	kfree(hpriv);
@@ -807,6 +809,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto free_cb_pool;
 	}
 
+	hl_debugfs_add_device(hdev);
+
 	rc = hdev->asic_funcs->hw_init(hdev);
 	if (rc) {
 		dev_err(hdev->dev, "failed to initialize the H/W\n");
@@ -936,6 +940,8 @@ void hl_device_fini(struct hl_device *hdev)
 
 	device_late_fini(hdev);
 
+	hl_debugfs_remove_device(hdev);
+
 	hl_sysfs_fini(hdev);
 
 	/*
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index c21c6046f09b..d9782cdc6091 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -5312,6 +5312,8 @@ int goya_context_switch(struct hl_device *hdev, u32 asid)
 	job->user_cb_size = cb_size;
 	job->hw_queue_id = GOYA_QUEUE_ID_DMA_0;
 
+	hl_debugfs_add_job(hdev, job);
+
 	parser.ctx_id = HL_KERNEL_ASID_ID;
 	parser.cs_sequence = 0;
 	parser.job_id = job->id;
@@ -5344,6 +5346,7 @@ int goya_context_switch(struct hl_device *hdev, u32 asid)
 
 free_job:
 	hl_userptr_delete_list(hdev, &job->userptr_list);
+	hl_debugfs_remove_job(hdev, job);
 	kfree(job);
 	cb->cs_cnt--;
 
@@ -5374,6 +5377,106 @@ void goya_restore_phase_topology(struct hl_device *hdev)
 	i = RREG32(mmSYNC_MNGR_SOB_OBJ_0);
 }
 
+/**
+ * goya_debugfs_read32 - read a 32bit value from a given device address
+ *
+ * @hdev:	pointer to hl_device structure
+ * @addr:	address in device
+ * @val:	returned value
+ *
+ * In case of DDR address that is not mapped into the default aperture that
+ * the DDR bar exposes, the function will configure the iATU so that the DDR
+ * bar will be positioned at a base address that allows reading from the
+ * required address. Configuring the iATU during normal operation can
+ * lead to undefined behavior and therefore, should be done with extreme care
+ *
+ */
+int goya_debugfs_read32(struct hl_device *hdev, u64 addr, u32 *val)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	int rc = 0;
+
+	if ((addr >= CFG_BASE) && (addr < CFG_BASE + CFG_SIZE)) {
+		*val = RREG32(addr - CFG_BASE);
+
+	} else if ((addr >= SRAM_BASE_ADDR) &&
+			(addr < SRAM_BASE_ADDR + SRAM_SIZE)) {
+
+		*val = readl(hdev->pcie_bar[SRAM_CFG_BAR_ID] +
+				(addr - SRAM_BASE_ADDR));
+
+	} else if ((addr >= DRAM_PHYS_BASE) &&
+			(addr < DRAM_PHYS_BASE + hdev->asic_prop.dram_size)) {
+
+		u64 bar_base_addr = DRAM_PHYS_BASE +
+				(addr & ~(prop->dram_pci_bar_size - 0x1ull));
+
+		rc = goya_set_ddr_bar_base(hdev, bar_base_addr);
+		if (!rc) {
+			*val = readl(hdev->pcie_bar[DDR_BAR_ID] +
+						(addr - bar_base_addr));
+
+			rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE +
+				(MMU_PAGE_TABLES_ADDR &
+					~(prop->dram_pci_bar_size - 0x1ull)));
+		}
+	} else {
+		rc = -EFAULT;
+	}
+
+	return rc;
+}
+
+/**
+ * goya_debugfs_write32 - write a 32bit value to a given device address
+ *
+ * @hdev:	pointer to hl_device structure
+ * @addr:	address in device
+ * @val:	returned value
+ *
+ * In case of DDR address that is not mapped into the default aperture that
+ * the DDR bar exposes, the function will configure the iATU so that the DDR
+ * bar will be positioned at a base address that allows writing to the
+ * required address. Configuring the iATU during normal operation can
+ * lead to undefined behavior and therefore, should be done with extreme care
+ *
+ */
+int goya_debugfs_write32(struct hl_device *hdev, u64 addr, u32 val)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	int rc = 0;
+
+	if ((addr >= CFG_BASE) && (addr < CFG_BASE + CFG_SIZE)) {
+		WREG32(addr - CFG_BASE, val);
+
+	} else if ((addr >= SRAM_BASE_ADDR) &&
+			(addr < SRAM_BASE_ADDR + SRAM_SIZE)) {
+
+		writel(val, hdev->pcie_bar[SRAM_CFG_BAR_ID] +
+					(addr - SRAM_BASE_ADDR));
+
+	} else if ((addr >= DRAM_PHYS_BASE) &&
+			(addr < DRAM_PHYS_BASE + hdev->asic_prop.dram_size)) {
+
+		u64 bar_base_addr = DRAM_PHYS_BASE +
+				(addr & ~(prop->dram_pci_bar_size - 0x1ull));
+
+		rc = goya_set_ddr_bar_base(hdev, bar_base_addr);
+		if (!rc) {
+			writel(val, hdev->pcie_bar[DDR_BAR_ID] +
+						(addr - bar_base_addr));
+
+			rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE +
+				(MMU_PAGE_TABLES_ADDR &
+					~(prop->dram_pci_bar_size - 0x1ull)));
+		}
+	} else {
+		rc = -EFAULT;
+	}
+
+	return rc;
+}
+
 static u64 goya_read_pte(struct hl_device *hdev, u64 addr)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -5780,6 +5883,8 @@ static int goya_mmu_clear_pgt_range(struct hl_device *hdev)
 	job->user_cb_size = cb_size;
 	job->hw_queue_id = GOYA_QUEUE_ID_DMA_0;
 
+	hl_debugfs_add_job(hdev, job);
+
 	parser.ctx_id = HL_KERNEL_ASID_ID;
 	parser.cs_sequence = 0;
 	parser.job_id = job->id;
@@ -5808,6 +5913,7 @@ static int goya_mmu_clear_pgt_range(struct hl_device *hdev)
 
 free_job:
 	hl_userptr_delete_list(hdev, &job->userptr_list);
+	hl_debugfs_remove_job(hdev, job);
 	kfree(job);
 	cb->cs_cnt--;
 
@@ -6206,6 +6312,8 @@ static const struct hl_asic_funcs goya_funcs = {
 	.update_eq_ci = goya_update_eq_ci,
 	.context_switch = goya_context_switch,
 	.restore_phase_topology = goya_restore_phase_topology,
+	.debugfs_read32 = goya_debugfs_read32,
+	.debugfs_write32 = goya_debugfs_write32,
 	.add_device_attr = goya_add_device_attr,
 	.remove_device_attr = goya_remove_device_attr,
 	.handle_eqe = goya_handle_eqe,
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 42e8b1baef2f..4e98d0da75c9 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -136,6 +136,10 @@ struct goya_device {
 	u32		hw_cap_initialized;
 };
 
+int goya_debugfs_i2c_read(struct hl_device *hdev, u8 i2c_bus,
+			u8 i2c_addr, u8 i2c_reg, u32 *val);
+int goya_debugfs_i2c_write(struct hl_device *hdev, u8 i2c_bus,
+			u8 i2c_addr, u8 i2c_reg, u32 val);
 int goya_test_cpu_queue(struct hl_device *hdev);
 int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
 				u32 timeout, long *result);
@@ -146,6 +150,7 @@ long goya_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
 long goya_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
 void goya_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
 			long value);
+void goya_debugfs_led_set(struct hl_device *hdev, u8 led, u8 state);
 void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq);
 int goya_add_device_attr(struct hl_device *hdev);
 void goya_remove_device_attr(struct hl_device *hdev);
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 6c0fe76936be..3f8a42a5af82 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -226,6 +226,7 @@ struct hl_cb_mgr {
  * @refcount: reference counter for usage of the CB.
  * @hdev: pointer to device this CB belongs to.
  * @lock: spinlock to protect mmap/cs flows.
+ * @debugfs_list: node in debugfs list of command buffers.
  * @pool_list: node in pool list of command buffers.
  * @kernel_address: Holds the CB's kernel virtual address.
  * @bus_address: Holds the CB's DMA address.
@@ -242,6 +243,7 @@ struct hl_cb {
 	struct kref		refcount;
 	struct hl_device	*hdev;
 	spinlock_t		lock;
+	struct list_head	debugfs_list;
 	struct list_head	pool_list;
 	u64			kernel_address;
 	dma_addr_t		bus_address;
@@ -444,6 +446,8 @@ enum hl_pll_frequency {
  * @update_eq_ci: update event queue CI.
  * @context_switch: called upon ASID context switch.
  * @restore_phase_topology: clear all SOBs amd MONs.
+ * @debugfs_read32: debug interface for reading u32 from DRAM/SRAM.
+ * @debugfs_write32: debug interface for writing u32 to DRAM/SRAM.
  * @add_device_attr: add ASIC specific device attributes.
  * @remove_device_attr: remove ASIC specific device attributes.
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
@@ -512,6 +516,8 @@ struct hl_asic_funcs {
 	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
 	int (*context_switch)(struct hl_device *hdev, u32 asid);
 	void (*restore_phase_topology)(struct hl_device *hdev);
+	int (*debugfs_read32)(struct hl_device *hdev, u64 addr, u32 *val);
+	int (*debugfs_write32)(struct hl_device *hdev, u64 addr, u32 val);
 	int (*add_device_attr)(struct hl_device *hdev);
 	void (*remove_device_attr)(struct hl_device *hdev);
 	void (*handle_eqe)(struct hl_device *hdev,
@@ -577,6 +583,7 @@ struct hl_va_range {
  * @mem_hash_lock: protects the mem_hash.
  * @mmu_lock: protects the MMU page tables. Any change to the PGT, modifing the
  *            MMU hash or walking the PGT requires talking this lock
+ * @debugfs_list: node in debugfs list of contexts.
  * @cs_sequence: sequence number for CS. Value is assigned to a CS and passed
  *			to user so user could inquire about CS. It is used as
  *			index to cs_pending array.
@@ -601,6 +608,7 @@ struct hl_ctx {
 	struct hl_va_range	dram_va_range;
 	struct mutex		mem_hash_lock;
 	struct mutex		mmu_lock;
+	struct list_head	debugfs_list;
 	u64			cs_sequence;
 	spinlock_t		cs_lock;
 	atomic64_t		dram_phys_mem;
@@ -662,6 +670,7 @@ struct hl_userptr {
  * @fence: pointer to the fence object of this CS.
  * @work_tdr: delayed work node for TDR.
  * @mirror_node : node in device mirror list of command submissions.
+ * @debugfs_list: node in debugfs list of command submissions.
  * @sequence: the sequence number of this CS.
  * @submitted: true if CS was submitted to H/W.
  * @completed: true if CS was completed by device.
@@ -679,6 +688,7 @@ struct hl_cs {
 	struct dma_fence	*fence;
 	struct delayed_work	work_tdr;
 	struct list_head	mirror_node;
+	struct list_head	debugfs_list;
 	u64			sequence;
 	u8			submitted;
 	u8			completed;
@@ -697,6 +707,7 @@ struct hl_cs {
  * @finish_work: workqueue object to run when job is completed.
  * @userptr_list: linked-list of userptr mappings that belong to this job and
  *			wait for completion.
+ * @debugfs_list: node in debugfs list of command submission jobs.
  * @id: the id of this job inside a CS.
  * @hw_queue_id: the id of the H/W queue this job is submitted to.
  * @user_cb_size: the actual size of the CB we got from the user.
@@ -710,6 +721,7 @@ struct hl_cs_job {
 	struct hl_cb		*patched_cb;
 	struct work_struct	finish_work;
 	struct list_head	userptr_list;
+	struct list_head	debugfs_list;
 	u32			id;
 	u32			hw_queue_id;
 	u32			user_cb_size;
@@ -854,6 +866,7 @@ struct hl_vm {
  * @ctx: current executing context.
  * @ctx_mgr: context manager to handle multiple context for this FD.
  * @cb_mgr: command buffer manager to handle multiple buffers for this FD.
+ * @debugfs_list: list of relevant ASIC debugfs.
  * @refcount: number of related contexts.
  * @restore_phase_mutex: lock for context switch and restore phase.
  */
@@ -864,6 +877,7 @@ struct hl_fpriv {
 	struct hl_ctx		*ctx; /* TODO: remove for multiple ctx */
 	struct hl_ctx_mgr	ctx_mgr;
 	struct hl_cb_mgr	cb_mgr;
+	struct list_head	debugfs_list;
 	struct kref		refcount;
 	struct mutex		restore_phase_mutex;
 };
@@ -871,6 +885,85 @@ struct hl_fpriv {
 
 
 
+
+/*
+ * DebugFS
+ */
+
+/**
+ * struct hl_info_list - debugfs file ops.
+ * @name: file name.
+ * @show: function to output information.
+ * @write: function to write to the file.
+ */
+struct hl_info_list {
+	const char	*name;
+	int		(*show)(struct seq_file *s, void *data);
+	ssize_t		(*write)(struct file *file, const char __user *buf,
+				size_t count, loff_t *f_pos);
+};
+
+/**
+ * struct hl_debugfs_entry - debugfs dentry wrapper.
+ * @dent: base debugfs entry structure.
+ * @info_ent: dentry realted ops.
+ * @dev_entry: ASIC specific debugfs manager.
+ */
+struct hl_debugfs_entry {
+	struct dentry			*dent;
+	const struct hl_info_list	*info_ent;
+	struct hl_dbg_device_entry	*dev_entry;
+};
+
+/**
+ * struct hl_dbg_device_entry - ASIC specific debugfs manager.
+ * @root: root dentry.
+ * @hdev: habanalabs device structure.
+ * @entry_arr: array of available hl_debugfs_entry.
+ * @file_list: list of available debugfs files.
+ * @file_mutex: protects file_list.
+ * @cb_list: list of available CBs.
+ * @cb_spinlock: protects cb_list.
+ * @cs_list: list of available CSs.
+ * @cs_spinlock: protects cs_list.
+ * @cs_job_list: list of available CB jobs.
+ * @cs_job_spinlock: protects cs_job_list.
+ * @userptr_list: list of available userptrs (virtual memory chunk descriptor).
+ * @userptr_spinlock: protects userptr_list.
+ * @ctx_mem_hash_list: list of available contexts with MMU mappings.
+ * @ctx_mem_hash_spinlock: protects cb_list.
+ * @addr: next address to read/write from/to in read/write32.
+ * @mmu_addr: next virtual address to translate to physical address in mmu_show.
+ * @mmu_asid: ASID to use while translating in mmu_show.
+ * @i2c_bus: generic u8 debugfs file for bus value to use in i2c_data_read.
+ * @i2c_bus: generic u8 debugfs file for address value to use in i2c_data_read.
+ * @i2c_bus: generic u8 debugfs file for register value to use in i2c_data_read.
+ */
+struct hl_dbg_device_entry {
+	struct dentry			*root;
+	struct hl_device		*hdev;
+	struct hl_debugfs_entry		*entry_arr;
+	struct list_head		file_list;
+	struct mutex			file_mutex;
+	struct list_head		cb_list;
+	spinlock_t			cb_spinlock;
+	struct list_head		cs_list;
+	spinlock_t			cs_spinlock;
+	struct list_head		cs_job_list;
+	spinlock_t			cs_job_spinlock;
+	struct list_head		userptr_list;
+	spinlock_t			userptr_spinlock;
+	struct list_head		ctx_mem_hash_list;
+	spinlock_t			ctx_mem_hash_spinlock;
+	u64				addr;
+	u64				mmu_addr;
+	u32				mmu_asid;
+	u8				i2c_bus;
+	u8				i2c_addr;
+	u8				i2c_reg;
+};
+
+
 /*
  * DEVICES
  */
@@ -959,6 +1052,7 @@ struct hl_device_reset_work {
  * @hwmon_dev: H/W monitor device.
  * @pm_mng_profile: current power management profile.
  * @hl_chip_info: ASIC's sensors information.
+ * @hl_debugfs: device's debugfs manager.
  * @cb_pool: list of preallocated CBs.
  * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
@@ -1024,6 +1118,8 @@ struct hl_device {
 	enum hl_pm_mng_profile		pm_mng_profile;
 	struct hwmon_chip_info		hl_chip_info;
 
+	struct hl_dbg_device_entry	hl_debugfs;
+
 	struct list_head		cb_pool;
 	spinlock_t			cb_pool_lock;
 
@@ -1263,6 +1359,101 @@ void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
 u64 hl_get_max_power(struct hl_device *hdev);
 void hl_set_max_power(struct hl_device *hdev, u64 value);
 
+#ifdef CONFIG_DEBUG_FS
+
+int hl_debugfs_init(void);
+void hl_debugfs_fini(void);
+void hl_debugfs_add_device(struct hl_device *hdev);
+void hl_debugfs_remove_device(struct hl_device *hdev);
+void hl_debugfs_add_file(struct hl_fpriv *hpriv);
+void hl_debugfs_remove_file(struct hl_fpriv *hpriv);
+void hl_debugfs_add_cb(struct hl_cb *cb);
+void hl_debugfs_remove_cb(struct hl_cb *cb);
+void hl_debugfs_add_cs(struct hl_cs *cs);
+void hl_debugfs_remove_cs(struct hl_cs *cs);
+void hl_debugfs_add_job(struct hl_device *hdev, struct hl_cs_job *job);
+void hl_debugfs_remove_job(struct hl_device *hdev, struct hl_cs_job *job);
+void hl_debugfs_add_userptr(struct hl_device *hdev, struct hl_userptr *userptr);
+void hl_debugfs_remove_userptr(struct hl_device *hdev,
+				struct hl_userptr *userptr);
+void hl_debugfs_add_ctx_mem_hash(struct hl_device *hdev, struct hl_ctx *ctx);
+void hl_debugfs_remove_ctx_mem_hash(struct hl_device *hdev, struct hl_ctx *ctx);
+
+#else
+
+static inline int __init hl_debugfs_init(void)
+{
+	return 0;
+}
+
+static inline void hl_debugfs_fini(void)
+{
+}
+
+static inline void hl_debugfs_add_device(struct hl_device *hdev)
+{
+}
+
+static inline void hl_debugfs_remove_device(struct hl_device *hdev)
+{
+}
+
+static inline void hl_debugfs_add_file(struct hl_fpriv *hpriv)
+{
+}
+
+static inline void hl_debugfs_remove_file(struct hl_fpriv *hpriv)
+{
+}
+
+static inline void hl_debugfs_add_cb(struct hl_cb *cb)
+{
+}
+
+static inline void hl_debugfs_remove_cb(struct hl_cb *cb)
+{
+}
+
+static inline void hl_debugfs_add_cs(struct hl_cs *cs)
+{
+}
+
+static inline void hl_debugfs_remove_cs(struct hl_cs *cs)
+{
+}
+
+static inline void hl_debugfs_add_job(struct hl_device *hdev,
+					struct hl_cs_job *job)
+{
+}
+
+static inline void hl_debugfs_remove_job(struct hl_device *hdev,
+					struct hl_cs_job *job)
+{
+}
+
+static inline void hl_debugfs_add_userptr(struct hl_device *hdev,
+					struct hl_userptr *userptr)
+{
+}
+
+static inline void hl_debugfs_remove_userptr(struct hl_device *hdev,
+					struct hl_userptr *userptr)
+{
+}
+
+static inline void hl_debugfs_add_ctx_mem_hash(struct hl_device *hdev,
+					struct hl_ctx *ctx)
+{
+}
+
+static inline void hl_debugfs_remove_ctx_mem_hash(struct hl_device *hdev,
+					struct hl_ctx *ctx)
+{
+}
+
+#endif
+
 /* IOCTLs */
 long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
 int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 4b7bf42a4d3e..c12a7807664e 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -153,6 +153,8 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	 */
 	hl_device_set_frequency(hdev, PLL_HIGH);
 
+	hl_debugfs_add_file(hpriv);
+
 	return 0;
 
 out_err:
@@ -425,17 +427,20 @@ static int __init hl_init(void)
 		goto remove_major;
 	}
 
+	hl_debugfs_init();
+
 	rc = pci_register_driver(&hl_pci_driver);
 	if (rc) {
 		pr_err("habanalabs: failed to register pci device\n");
-		goto remove_class;
+		goto remove_debugfs;
 	}
 
 	pr_debug("habanalabs: driver loaded\n");
 
 	return 0;
 
-remove_class:
+remove_debugfs:
+	hl_debugfs_fini();
 	class_destroy(hl_class);
 remove_major:
 	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
@@ -450,6 +455,13 @@ static void __exit hl_exit(void)
 {
 	pci_unregister_driver(&hl_pci_driver);
 
+	/*
+	 * Removing debugfs must be after all devices or simulator devices
+	 * have been removed because otherwise we get a bug in the
+	 * debugfs module for referencing NULL objects
+	 */
+	hl_debugfs_fini();
+
 	class_destroy(hl_class);
 	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
 
diff --git a/drivers/misc/habanalabs/memory.c b/drivers/misc/habanalabs/memory.c
index c41ea19502e5..32698ec978b2 100644
--- a/drivers/misc/habanalabs/memory.c
+++ b/drivers/misc/habanalabs/memory.c
@@ -1284,6 +1284,8 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u32 size,
 		goto free_sgt;
 	}
 
+	hl_debugfs_add_userptr(hdev, userptr);
+
 	return 0;
 
 free_sgt:
@@ -1309,6 +1311,8 @@ int hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr)
 {
 	struct page **pages;
 
+	hl_debugfs_remove_userptr(hdev, userptr);
+
 	if (userptr->dma_mapped)
 		hdev->asic_funcs->hl_dma_unmap_sg(hdev,
 				userptr->sgt->sgl,
@@ -1470,6 +1474,8 @@ int hl_vm_ctx_init_with_ranges(struct hl_ctx *ctx, u64 host_range_start,
 		goto dram_vm_err;
 	}
 
+	hl_debugfs_add_ctx_mem_hash(hdev, ctx);
+
 	return 0;
 
 dram_vm_err:
@@ -1590,6 +1596,8 @@ void hl_vm_ctx_fini(struct hl_ctx *ctx)
 	struct hlist_node *tmp_node;
 	int i;
 
+	hl_debugfs_remove_ctx_mem_hash(hdev, ctx);
+
 	if (!hash_empty(ctx->mem_hash))
 		dev_notice(hdev->dev, "ctx is freed while it has va in use\n");
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 15/15] Update MAINTAINERS and CREDITS with habanalabs info
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (12 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 14/15] habanalabs: add debugfs support Oded Gabbay
@ 2019-01-23  0:00 ` Oded Gabbay
  2019-01-23 12:27 ` [PATCH 00/15] Habana Labs kernel driver Mike Rapoport
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23  0:00 UTC (permalink / raw)
  To: gregkh, linux-kernel; +Cc: ogabbay

The habanalabs driver was written from scratch from the very first days
of Habana and is maintained by Oded Gabbay.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 CREDITS     | 2 +-
 MAINTAINERS | 9 +++++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/CREDITS b/CREDITS
index e818eb6a3e71..03f3d67126fc 100644
--- a/CREDITS
+++ b/CREDITS
@@ -1222,7 +1222,7 @@ S: Brazil
 
 N: Oded Gabbay
 E: oded.gabbay@gmail.com
-D: AMD KFD maintainer
+D: HabanaLabs and AMD KFD maintainer
 S: 12 Shraga Raphaeli
 S: Petah-Tikva, 4906418
 S: Israel
diff --git a/MAINTAINERS b/MAINTAINERS
index 51029a425dbe..93e047336cab 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6641,6 +6641,15 @@ F:	drivers/clocksource/h8300_*.c
 F:	drivers/clk/h8300/
 F:	drivers/irqchip/irq-renesas-h8*.c
 
+HABANALABS PCI DRIVER
+M:	Oded Gabbay <oded.gabbay@gmail.com>
+T:	git https://github.com/HabanaAI/linux.git
+S:	Supported
+F:	drivers/misc/habanalabs/
+F:	include/uapi/misc/habanalabs.h
+F:	Documentation/ABI/testing/sysfs-driver-habanalabs
+F:	Documentation/ABI/testing/debugfs-driver-habanalabs
+
 HACKRF MEDIA DRIVER
 M:	Antti Palosaari <crope@iki.fi>
 L:	linux-media@vger.kernel.org
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-23  0:00 ` [PATCH 01/15] habanalabs: add skeleton driver Oded Gabbay
@ 2019-01-23  0:49   ` Joe Perches
  2019-01-25 19:18     ` Oded Gabbay
  2019-01-23 12:28   ` Mike Rapoport
  2019-01-26 16:05   ` Arnd Bergmann
  2 siblings, 1 reply; 103+ messages in thread
From: Joe Perches @ 2019-01-23  0:49 UTC (permalink / raw)
  To: Oded Gabbay, gregkh, linux-kernel; +Cc: ogabbay

On Wed, 2019-01-23 at 02:00 +0200, Oded Gabbay wrote:
> This patch adds the habanalabs skeleton driver. The driver does nothing at
> this stage except very basic operations. It contains the minimal code to
> insmod and rmmod the driver and to create a /dev/hlX file per PCI device.

trivial notes:

> 
> diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
[]
> \ No newline at end of file

You should fixes these.  There are a least a couple of them.

> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
[]
> @@ -0,0 +1,331 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */

Add #define pr_fmt(fmt) "habanalabs: " fmt

> +
> +#include "habanalabs.h"

or add it in this file


> +static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
> +				int minor, const struct file_operations *fops)
> +{
> +	int err, devno = MKDEV(hdev->major, minor);
> +	struct cdev *hdev_cdev = &hdev->cdev;
> +	char name[8];
> +
> +	sprintf(name, "hl%d", hdev->id);

Might overflow name one day

> +
> +	cdev_init(hdev_cdev, fops);
> +	hdev_cdev->owner = THIS_MODULE;
> +	err = cdev_add(hdev_cdev, devno, 1);
> +	if (err) {
> +		pr_err("habanalabs: Failed to add char device %s", name);

So #define pr_fmt can auto prefix these and this would be

		pr_err("Failed to add char device %s\n", name);

missing terminating '\n' btw

> +		goto err_cdev_add;
> +	}
> +
> +	hdev->dev = device_create(hclass, NULL, devno, NULL, "%s", name);
> +	if (IS_ERR(hdev->dev)) {
> +		pr_err("habanalabs: Failed to create device %s\n", name);

And this would be:
		pr_err("Failed to create device %s\n", name);


etc...

> +static int device_early_init(struct hl_device *hdev)
> +{
> +	switch (hdev->asic_type) {
> +	case ASIC_GOYA:
> +		sprintf(hdev->asic_name, "GOYA");

strcpy or perhaps better still as strlcpy

> +int hl_device_init(struct hl_device *hdev, struct class *hclass)
> +{
[]
> +	dev_notice(hdev->dev,
> +		"Successfully added device to habanalabs driver\n");

This is mostly aligned to open parenthesis, but perhaps
it could check with scripts/checkpatch.pl --strict and
see if you agree with anything it bleats.

> +int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr,
> +				u32 timeout_us, u32 *val)
> +{
> +	/*
> +	 * pReturnVal is defined as volatile because it points to HOST memory,
> +	 * which is being written to by the device. Therefore, we can't use
> +	 * locks to synchronize it and it is not a memory-mapped register space
> +	 */
> +	volatile u32 *pReturnVal = (volatile u32 *) addr;

It'd be nice to avoid hungarian and camelcase

> +	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
> +
> +	might_sleep();
> +
> +	for (;;) {
> +		*val = *pReturnVal;
> +		if (*val)
> +			break;
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			*val = *pReturnVal;
> +			break;
> +		}
> +		usleep_range((100 >> 2) + 1, 100);
> +	}
> +
> +	return (*val ? 0 : -ETIMEDOUT);

Unnecessary parentheses

> diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
[]
> +static struct pci_device_id ids[] = {
> +	{ PCI_DEVICE(PCI_VENDOR_ID_HABANALABS, PCI_IDS_GOYA), },
> +	{ 0, }
> +};

static const?

> diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
[]
> +struct hl_bd {
> +	__u64	ptr;
> +	__u32	len;
> +	union {
> +		struct {
> +			__u32	repeat:16;
> +			__u32	res1:8;
> +			__u32	repeat_valid:1;
> +			__u32	res2:7;
> +		};
> +		__u32	ctl;
> +	};
> +};

Maybe use the appropriate bit-endian __le<size> instead of __u<size>
with whatever cpu_to_le<size> / le<size>_to_cpu bits are necessary.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (13 preceding siblings ...)
  2019-01-23  0:00 ` [PATCH 15/15] Update MAINTAINERS and CREDITS with habanalabs info Oded Gabbay
@ 2019-01-23 12:27 ` Mike Rapoport
  2019-01-23 22:43   ` Oded Gabbay
  2019-01-23 21:52 ` Olof Johansson
  2019-01-23 21:57 ` Dave Airlie
  16 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-23 12:27 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

Hi,

On Wed, Jan 23, 2019 at 02:00:42AM +0200, Oded Gabbay wrote:
> Hello,
> 
> For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> Habana Labs since its inception two and a half years ago. 
> 
> Habana is a leading startup in the emerging AI processor space and we have
> already started production of our first Goya inference processor PCIe card
> and delivered it to customers. The Goya processor silicon has been tested
> since June of 2018 and is production-qualified by now. The Gaudi training
> processor solution is slated to sample in the second quarter of 2019.
> 
> This patch-set contains the kernel driver for Habana's AI Processors 
> (AIP) that are designed to accelerate Deep Learning inference and training
> workloads. The current version supports only the Goya processor and
> support for Gaudi will be upstreamed after the ASIC will be available to
> customers.
> 
> The Goya processor has been designed from the ground up for deep learning
> inference workloads. It comprises a cluster of eight fully programmable
> Tensor Processing Cores (TPC). The TPC core is a VLIW SIMD vector
> processor with ISA and hardware that was tailored to serve deep learning
> workloads efficiently. 

[ ... ] 
 
> I would appricate any feedback, question and/or review.

I've looked at the patches 1,3-5 for now, it seems patch 2 still didn't
make it to lore.kernel.org.

FWIW, I think it's a good solid work, unless you spoil it in patches 6-14
;-)

As a general note, maybe drivers/misc is not the most appropriate place for
such a complex beast. How about drivers/accelerator/ai?
 
> p.s. for those who prefer to clone the tree instead of looking at the
> emails, you can grab a copy from our company's page in GitHub:
> 
> https://github.com/HabanaAI/linux/releases/tag/hl_patchset_v1
> 
> Thanks,
> Oded
> 
> Oded Gabbay (14):
>   habanalabs: add skeleton driver
>   habanalabs: add Goya registers header files
>   habanalabs: add basic Goya support
>   habanalabs: add context and ASID modules
>   habanalabs: add command buffer module
>   habanalabs: add basic Goya h/w initialization
>   habanalabs: add h/w queues module
>   habanalabs: add event queue and interrupts
>   habanalabs: add sysfs and hwmon support
>   habanalabs: add device reset support
>   habanalabs: add command submission module
>   habanalabs: implement INFO IOCTL
>   habanalabs: add debugfs support
>   Update MAINTAINERS and CREDITS with habanalabs info
> 
> Omer Shpigelman (1):
>   habanalabs: add virtual memory and MMU modules
> 
>  CREDITS                                       |    2 +-
>  .../ABI/testing/debugfs-driver-habanalabs     |  127 +
>  .../ABI/testing/sysfs-driver-habanalabs       |  190 +
>  MAINTAINERS                                   |    9 +
>  drivers/misc/Kconfig                          |    1 +
>  drivers/misc/Makefile                         |    1 +
>  drivers/misc/habanalabs/Kconfig               |   22 +
>  drivers/misc/habanalabs/Makefile              |   14 +
>  drivers/misc/habanalabs/asid.c                |   58 +
>  drivers/misc/habanalabs/command_buffer.c      |  425 +
>  drivers/misc/habanalabs/command_submission.c  |  799 ++
>  drivers/misc/habanalabs/context.c             |  216 +
>  drivers/misc/habanalabs/debugfs.c             | 1069 ++
>  drivers/misc/habanalabs/device.c              | 1097 ++
>  drivers/misc/habanalabs/goya/Makefile         |    3 +
>  drivers/misc/habanalabs/goya/goya.c           | 6347 ++++++++++++
>  drivers/misc/habanalabs/goya/goyaP.h          |  161 +
>  drivers/misc/habanalabs/goya/goya_hwmgr.c     |  306 +
>  drivers/misc/habanalabs/goya/goya_security.c  | 2999 ++++++
>  drivers/misc/habanalabs/habanalabs.h          | 1464 +++
>  drivers/misc/habanalabs/habanalabs_drv.c      |  474 +
>  drivers/misc/habanalabs/habanalabs_ioctl.c    |  237 +
>  drivers/misc/habanalabs/hw_queue.c            |  654 ++
>  drivers/misc/habanalabs/hwmon.c               |  449 +
>  .../include/goya/asic_reg/cpu_ca53_cfg_regs.h |  213 +
>  .../include/goya/asic_reg/cpu_if_regs.h       |  110 +
>  .../include/goya/asic_reg/cpu_pll_regs.h      |  186 +
>  .../include/goya/asic_reg/ddr_mc_ch0_regs.h   | 1158 +++
>  .../include/goya/asic_reg/ddr_mc_ch1_regs.h   | 1158 +++
>  .../include/goya/asic_reg/ddr_misc_ch0_regs.h |  156 +
>  .../include/goya/asic_reg/ddr_misc_ch1_regs.h |  156 +
>  .../include/goya/asic_reg/dma_ch_0_regs.h     |  512 +
>  .../include/goya/asic_reg/dma_ch_1_regs.h     |  512 +
>  .../include/goya/asic_reg/dma_ch_2_regs.h     |  512 +
>  .../include/goya/asic_reg/dma_ch_3_regs.h     |  512 +
>  .../include/goya/asic_reg/dma_ch_4_regs.h     |  512 +
>  .../include/goya/asic_reg/dma_macro_regs.h    |  242 +
>  .../include/goya/asic_reg/dma_nrtr_regs.h     |  380 +
>  .../include/goya/asic_reg/dma_qm_0_regs.h     |  543 +
>  .../include/goya/asic_reg/dma_qm_1_regs.h     |  543 +
>  .../include/goya/asic_reg/dma_qm_2_regs.h     |  543 +
>  .../include/goya/asic_reg/dma_qm_3_regs.h     |  543 +
>  .../include/goya/asic_reg/dma_qm_4_regs.h     |  543 +
>  .../include/goya/asic_reg/gic_regs.h          | 9079 +++++++++++++++++
>  .../include/goya/asic_reg/goya_blocks.h       | 1372 +++
>  .../include/goya/asic_reg/goya_masks.h        |  262 +
>  .../include/goya/asic_reg/goya_regs.h         |  119 +
>  .../include/goya/asic_reg/ic_pll_regs.h       |  186 +
>  .../include/goya/asic_reg/mc_pll_regs.h       |  186 +
>  .../include/goya/asic_reg/mme1_rtr_regs.h     |  876 ++
>  .../include/goya/asic_reg/mme2_rtr_regs.h     |  876 ++
>  .../include/goya/asic_reg/mme3_rtr_regs.h     |  876 ++
>  .../include/goya/asic_reg/mme4_rtr_regs.h     |  876 ++
>  .../include/goya/asic_reg/mme5_rtr_regs.h     |  876 ++
>  .../include/goya/asic_reg/mme6_rtr_regs.h     |  876 ++
>  .../include/goya/asic_reg/mme_cmdq_regs.h     |  431 +
>  .../include/goya/asic_reg/mme_qm_regs.h       |  543 +
>  .../include/goya/asic_reg/mme_regs.h          | 2422 +++++
>  .../include/goya/asic_reg/mmu_regs.h          |  158 +
>  .../include/goya/asic_reg/pci_nrtr_regs.h     |  380 +
>  .../include/goya/asic_reg/pcie_aux_regs.h     |  476 +
>  .../include/goya/asic_reg/pcie_dbi_regs.h     | 2909 ++++++
>  .../goya/asic_reg/psoc_emmc_pll_regs.h        |  186 +
>  .../goya/asic_reg/psoc_global_conf_regs.h     | 1119 ++
>  .../include/goya/asic_reg/psoc_mme_pll_regs.h |  186 +
>  .../include/goya/asic_reg/psoc_pci_pll_regs.h |  186 +
>  .../include/goya/asic_reg/psoc_spi_regs.h     |  427 +
>  .../goya/asic_reg/sram_y0_x0_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y0_x1_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y0_x2_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y0_x3_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y0_x4_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y1_x0_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y1_x1_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y1_x2_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y1_x3_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y1_x4_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y2_x0_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y2_x1_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y2_x2_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y2_x3_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y2_x4_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y3_x0_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y3_x1_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y3_x2_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y3_x3_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y3_x4_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y4_x0_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y4_x1_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y4_x2_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y4_x3_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y4_x4_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y5_x0_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y5_x1_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y5_x2_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y5_x3_rtr_regs.h       |  215 +
>  .../goya/asic_reg/sram_y5_x4_rtr_regs.h       |  215 +
>  .../include/goya/asic_reg/stlb_regs.h         |  133 +
>  .../include/goya/asic_reg/sync_mngr_regs.h    | 4930 +++++++++
>  .../include/goya/asic_reg/tpc0_cfg_regs.h     | 2110 ++++
>  .../include/goya/asic_reg/tpc0_cmdq_regs.h    |  431 +
>  .../include/goya/asic_reg/tpc0_eml_cfg_regs.h |  580 ++
>  .../include/goya/asic_reg/tpc0_nrtr_regs.h    |  380 +
>  .../include/goya/asic_reg/tpc0_qm_regs.h      |  543 +
>  .../include/goya/asic_reg/tpc1_cfg_regs.h     | 2110 ++++
>  .../include/goya/asic_reg/tpc1_cmdq_regs.h    |  431 +
>  .../include/goya/asic_reg/tpc1_qm_regs.h      |  543 +
>  .../include/goya/asic_reg/tpc1_rtr_regs.h     |  848 ++
>  .../include/goya/asic_reg/tpc2_cfg_regs.h     | 2110 ++++
>  .../include/goya/asic_reg/tpc2_cmdq_regs.h    |  431 +
>  .../include/goya/asic_reg/tpc2_qm_regs.h      |  543 +
>  .../include/goya/asic_reg/tpc2_rtr_regs.h     |  848 ++
>  .../include/goya/asic_reg/tpc3_cfg_regs.h     | 2110 ++++
>  .../include/goya/asic_reg/tpc3_cmdq_regs.h    |  431 +
>  .../include/goya/asic_reg/tpc3_qm_regs.h      |  543 +
>  .../include/goya/asic_reg/tpc3_rtr_regs.h     |  848 ++
>  .../include/goya/asic_reg/tpc4_cfg_regs.h     | 2110 ++++
>  .../include/goya/asic_reg/tpc4_cmdq_regs.h    |  431 +
>  .../include/goya/asic_reg/tpc4_qm_regs.h      |  543 +
>  .../include/goya/asic_reg/tpc4_rtr_regs.h     |  848 ++
>  .../include/goya/asic_reg/tpc5_cfg_regs.h     | 2110 ++++
>  .../include/goya/asic_reg/tpc5_cmdq_regs.h    |  431 +
>  .../include/goya/asic_reg/tpc5_qm_regs.h      |  543 +
>  .../include/goya/asic_reg/tpc5_rtr_regs.h     |  848 ++
>  .../include/goya/asic_reg/tpc6_cfg_regs.h     | 2110 ++++
>  .../include/goya/asic_reg/tpc6_cmdq_regs.h    |  431 +
>  .../include/goya/asic_reg/tpc6_qm_regs.h      |  543 +
>  .../include/goya/asic_reg/tpc6_rtr_regs.h     |  848 ++
>  .../include/goya/asic_reg/tpc7_cfg_regs.h     | 2110 ++++
>  .../include/goya/asic_reg/tpc7_cmdq_regs.h    |  431 +
>  .../include/goya/asic_reg/tpc7_nrtr_regs.h    |  380 +
>  .../include/goya/asic_reg/tpc7_qm_regs.h      |  543 +
>  .../include/goya/asic_reg/tpc_pll_regs.h      |  186 +
>  drivers/misc/habanalabs/include/goya/goya.h   |  117 +
>  .../include/goya/goya_async_events.h          |  186 +
>  .../habanalabs/include/goya/goya_boot_if.h    |   32 +
>  .../habanalabs/include/goya/goya_packets.h    |  234 +
>  .../habanalabs/include/habanalabs_device_if.h |  397 +
>  .../include/hw_ip/mmu/mmu_general.h           |   45 +
>  .../habanalabs/include/hw_ip/mmu/mmu_v1_0.h   |   15 +
>  drivers/misc/habanalabs/irq.c                 |  325 +
>  drivers/misc/habanalabs/memory.c              | 1714 ++++
>  drivers/misc/habanalabs/mmu.c                 |  604 ++
>  drivers/misc/habanalabs/sysfs.c               |  690 ++
>  include/uapi/misc/habanalabs.h                |  412 +
>  145 files changed, 99610 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/ABI/testing/debugfs-driver-habanalabs
>  create mode 100644 Documentation/ABI/testing/sysfs-driver-habanalabs
>  create mode 100644 drivers/misc/habanalabs/Kconfig
>  create mode 100644 drivers/misc/habanalabs/Makefile
>  create mode 100644 drivers/misc/habanalabs/asid.c
>  create mode 100644 drivers/misc/habanalabs/command_buffer.c
>  create mode 100644 drivers/misc/habanalabs/command_submission.c
>  create mode 100644 drivers/misc/habanalabs/context.c
>  create mode 100644 drivers/misc/habanalabs/debugfs.c
>  create mode 100644 drivers/misc/habanalabs/device.c
>  create mode 100644 drivers/misc/habanalabs/goya/Makefile
>  create mode 100644 drivers/misc/habanalabs/goya/goya.c
>  create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
>  create mode 100644 drivers/misc/habanalabs/goya/goya_hwmgr.c
>  create mode 100644 drivers/misc/habanalabs/goya/goya_security.c
>  create mode 100644 drivers/misc/habanalabs/habanalabs.h
>  create mode 100644 drivers/misc/habanalabs/habanalabs_drv.c
>  create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
>  create mode 100644 drivers/misc/habanalabs/hw_queue.c
>  create mode 100644 drivers/misc/habanalabs/hwmon.c
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_ca53_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_if_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_pll_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_mc_ch0_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_mc_ch1_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_misc_ch0_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_misc_ch1_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_0_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_1_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_2_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_3_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_4_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_macro_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_nrtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_0_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_1_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_2_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_3_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_4_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/gic_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_blocks.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_masks.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ic_pll_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mc_pll_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme1_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme2_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme3_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme4_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme5_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme6_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_cmdq_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_qm_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mmu_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pci_nrtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pcie_aux_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pcie_dbi_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_emmc_pll_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_global_conf_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_mme_pll_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_pci_pll_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_spi_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x0_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x1_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x2_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x3_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x4_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x0_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x1_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x2_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x3_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x4_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x0_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x1_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x2_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x3_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x4_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x0_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x1_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x2_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x3_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x4_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x0_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x1_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x2_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x3_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x4_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x0_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x1_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x2_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x3_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x4_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/stlb_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sync_mngr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cmdq_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_eml_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_nrtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_qm_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_cmdq_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_qm_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_cmdq_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_qm_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_cmdq_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_qm_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_cmdq_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_qm_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_cmdq_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_qm_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_cmdq_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_qm_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_rtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_cfg_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_cmdq_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_nrtr_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_qm_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc_pll_regs.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/goya.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/goya_async_events.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/goya_boot_if.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/goya_packets.h
>  create mode 100644 drivers/misc/habanalabs/include/habanalabs_device_if.h
>  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
>  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
>  create mode 100644 drivers/misc/habanalabs/irq.c
>  create mode 100644 drivers/misc/habanalabs/memory.c
>  create mode 100644 drivers/misc/habanalabs/mmu.c
>  create mode 100644 drivers/misc/habanalabs/sysfs.c
>  create mode 100644 include/uapi/misc/habanalabs.h
> 
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-23  0:00 ` [PATCH 01/15] habanalabs: add skeleton driver Oded Gabbay
  2019-01-23  0:49   ` Joe Perches
@ 2019-01-23 12:28   ` Mike Rapoport
  2019-01-23 12:40     ` Greg KH
  2019-01-25 20:05     ` Oded Gabbay
  2019-01-26 16:05   ` Arnd Bergmann
  2 siblings, 2 replies; 103+ messages in thread
From: Mike Rapoport @ 2019-01-23 12:28 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:00:43AM +0200, Oded Gabbay wrote:
> This patch adds the habanalabs skeleton driver. The driver does nothing at
> this stage except very basic operations. It contains the minimal code to
> insmod and rmmod the driver and to create a /dev/hlX file per PCI device.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/Kconfig                          |   1 +
>  drivers/misc/Makefile                         |   1 +
>  drivers/misc/habanalabs/Kconfig               |  22 ++
>  drivers/misc/habanalabs/Makefile              |   7 +
>  drivers/misc/habanalabs/device.c              | 331 ++++++++++++++++
>  drivers/misc/habanalabs/habanalabs.h          | 149 +++++++
>  drivers/misc/habanalabs/habanalabs_drv.c      | 366 ++++++++++++++++++
>  .../habanalabs/include/habanalabs_device_if.h | 125 ++++++
>  8 files changed, 1002 insertions(+)
>  create mode 100644 drivers/misc/habanalabs/Kconfig
>  create mode 100644 drivers/misc/habanalabs/Makefile
>  create mode 100644 drivers/misc/habanalabs/device.c
>  create mode 100644 drivers/misc/habanalabs/habanalabs.h
>  create mode 100644 drivers/misc/habanalabs/habanalabs_drv.c
>  create mode 100644 drivers/misc/habanalabs/include/habanalabs_device_if.h
> 
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index f417b06e11c5..fecab53c4f21 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -535,4 +535,5 @@ source "drivers/misc/echo/Kconfig"
>  source "drivers/misc/cxl/Kconfig"
>  source "drivers/misc/ocxl/Kconfig"
>  source "drivers/misc/cardreader/Kconfig"
> +source "drivers/misc/habanalabs/Kconfig"
>  endmenu
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index e39ccbbc1b3a..ae77dfd790a4 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -59,3 +59,4 @@ obj-$(CONFIG_PCI_ENDPOINT_TEST)	+= pci_endpoint_test.o
>  obj-$(CONFIG_OCXL)		+= ocxl/
>  obj-y				+= cardreader/
>  obj-$(CONFIG_PVPANIC)   	+= pvpanic.o
> +obj-$(CONFIG_HABANA_AI)		+= habanalabs/
> diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig
> new file mode 100644
> index 000000000000..b7f38a14caf5
> --- /dev/null
> +++ b/drivers/misc/habanalabs/Kconfig
> @@ -0,0 +1,22 @@
> +#
> +# HabanaLabs AI accelerators driver
> +#
> +
> +config HABANA_AI
> +	tristate "HabanaAI accelerators (habanalabs)"
> +	depends on PCI
> +	select FRAME_VECTOR
> +	help
> +	  Enables PCIe card driver for Habana's AI Processors (AIP) that are
> +	  designed to accelerate Deep Learning inference and training workloads.
> +
> +	  The driver manages the PCIe devices and provides IOCTL interface for
> +	  the user to submit workloads to the devices.
> +
> +	  The user-space interface is described in
> +	  include/uapi/misc/habanalabs.h
> +
> +	  If unsure, say N.
> +
> +	  To compile this driver as a module, choose M here: the
> +	  module will be called habanalabs.
> diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> new file mode 100644
> index 000000000000..b41433a09e02
> --- /dev/null
> +++ b/drivers/misc/habanalabs/Makefile
> @@ -0,0 +1,7 @@
> +#
> +# Makefile for HabanaLabs AI accelerators driver
> +#
> +
> +obj-m	:= habanalabs.o
> +
> +habanalabs-y := habanalabs_drv.o device.o
> \ No newline at end of file
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> new file mode 100644
> index 000000000000..376b55eb73d4
> --- /dev/null
> +++ b/drivers/misc/habanalabs/device.c
> @@ -0,0 +1,331 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "habanalabs.h"
> +
> +#include <linux/fs.h>
> +#include <linux/kthread.h>
> +#include <linux/sched/signal.h>
> +
> +static void hpriv_release(struct kref *ref)
> +{
> +	struct hl_fpriv *hpriv;
> +	struct hl_device *hdev;
> +
> +	hpriv = container_of(ref, struct hl_fpriv, refcount);
> +
> +	hdev = hpriv->hdev;
> +
> +	put_pid(hpriv->taskpid);
> +
> +	kfree(hpriv);
> +}
> +
> +void hl_hpriv_get(struct hl_fpriv *hpriv)
> +{
> +	kref_get(&hpriv->refcount);
> +}
> +
> +void hl_hpriv_put(struct hl_fpriv *hpriv)
> +{
> +	kref_put(&hpriv->refcount, hpriv_release);
> +}
> +
> +/**
> + * hl_device_release - release function for habanalabs device
> + *
> + * @inode: pointer to inode structure
> + * @filp: pointer to file structure
> + *
> + * Called when process closes an habanalabs device
> + */

It's nice to see docs coming along with the codei
I have some comments for the formatting.

kernel-doc won't be happy about missing return value descriptions, and
although they are sometimes redundant or too obvious their absence makes
'make V=1 htmldocs' really noisy.

In general, it would be nice if you could link hanabnalabs driver
kernel-doc somewhere in Documentation/ run 'make V=1 htmldocs'.

> +static int hl_device_release(struct inode *inode, struct file *filp)
> +{
> +	struct hl_fpriv *hpriv = filp->private_data;
> +
> +	filp->private_data = NULL;
> +
> +	hl_hpriv_put(hpriv);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations hl_ops = {
> +	.owner = THIS_MODULE,
> +	.open = hl_device_open,
> +	.release = hl_device_release
> +};
> +
> +/**
> + * device_setup_cdev - setup cdev and device for habanalabs device
> + *
> + * @hdev: pointer to habanalabs device structure
> + * @hclass: pointer to the class object of the device
> + * @minor: minor number of the specific device
> + * @fpos : file operations to install for this device
> + *
> + * Create a cdev and a Linux device for habanalabs's device. Need to be
> + * called at the end of the habanalabs device initialization process,
> + * because this function exposes the device to the user
> + */
> +static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
> +				int minor, const struct file_operations *fops)
> +{
> +	int err, devno = MKDEV(hdev->major, minor);
> +	struct cdev *hdev_cdev = &hdev->cdev;
> +	char name[8];
> +
> +	sprintf(name, "hl%d", hdev->id);
> +
> +	cdev_init(hdev_cdev, fops);
> +	hdev_cdev->owner = THIS_MODULE;
> +	err = cdev_add(hdev_cdev, devno, 1);
> +	if (err) {
> +		pr_err("habanalabs: Failed to add char device %s", name);
> +		goto err_cdev_add;
> +	}
> +
> +	hdev->dev = device_create(hclass, NULL, devno, NULL, "%s", name);
> +	if (IS_ERR(hdev->dev)) {
> +		pr_err("habanalabs: Failed to create device %s\n", name);
> +		err = PTR_ERR(hdev->dev);
> +		goto err_device_create;
> +	}
> +
> +	dev_set_drvdata(hdev->dev, hdev);
> +
> +	return 0;
> +
> +err_device_create:
> +	cdev_del(hdev_cdev);
> +err_cdev_add:
> +	return err;
> +}
> +
> +/**
> + * device_early_init - do some early initialization for the habanalabs device
> + *
> + * @hdev: pointer to habanalabs device structure
> + *
> + * Install the relevant function pointers and call the early_init function,
> + * if such a function exists
> + */
> +static int device_early_init(struct hl_device *hdev)
> +{
> +	switch (hdev->asic_type) {
> +	case ASIC_GOYA:
> +		sprintf(hdev->asic_name, "GOYA");
> +		break;
> +	default:
> +		dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
> +			hdev->asic_type);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * device_early_fini - finalize all that was done in device_early_fini

                                                                    ^init
> + *
> + * @hdev: pointer to habanalabs device structure
> + *
> + */
> +static void device_early_fini(struct hl_device *hdev)
> +{
> +}
> +
> +/**
> + * hl_device_suspend - initiate device suspend
> + *
> + * @hdev: pointer to habanalabs device structure
> + *
> + * Puts the hw in the suspend state (all asics).
> + * Returns 0 for success or an error on failure.

Should be Return: or Returns: for kernel-doc to understand it.

> + * Called at driver suspend.

This probably should be marked as Context:

> + */
> +int hl_device_suspend(struct hl_device *hdev)
> +{
> +	pci_save_state(hdev->pdev);
> +
> +	/* Shut down the device */
> +	pci_disable_device(hdev->pdev);
> +	pci_set_power_state(hdev->pdev, PCI_D3hot);
> +
> +	return 0;
> +}
> +
> +/**
> + * hl_device_resume - initiate device resume
> + *
> + * @hdev: pointer to habanalabs device structure
> + *
> + * Bring the hw back to operating state (all asics).
> + * Returns 0 for success or an error on failure.
> + * Called at driver resume.

Same comments as for the previous functions.

> + */
> +int hl_device_resume(struct hl_device *hdev)
> +{
> +	int rc;
> +
> +	pci_set_power_state(hdev->pdev, PCI_D0);
> +	pci_restore_state(hdev->pdev);
> +	rc = pci_enable_device(hdev->pdev);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to enable PCI device in resume\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * hl_device_init - main initialization function for habanalabs device
> + *
> + * @hdev: pointer to habanalabs device structure
> + *
> + * Allocate an id for the device, do early initialization and then call the
> + * ASIC specific initialization functions. Finally, create the cdev and the
> + * Linux device to expose it to the user
> + */
> +int hl_device_init(struct hl_device *hdev, struct class *hclass)
> +{
> +	int rc;
> +
> +	/* Create device */
> +	rc = device_setup_cdev(hdev, hclass, hdev->id, &hl_ops);
> +
> +	if (rc)
> +		goto out_disabled;
> +
> +	/* Initialize ASIC function pointers and perform early init */
> +	rc = device_early_init(hdev);
> +	if (rc)
> +		goto release_device;
> +
> +	dev_notice(hdev->dev,
> +		"Successfully added device to habanalabs driver\n");
> +
> +	return 0;
> +
> +release_device:
> +	device_destroy(hclass, hdev->dev->devt);
> +	cdev_del(&hdev->cdev);
> +out_disabled:
> +	hdev->disabled = true;
> +	if (hdev->pdev)
> +		dev_err(&hdev->pdev->dev,
> +			"Failed to initialize hl%d. Device is NOT usable !!!\n",
> +			hdev->id);
> +	else
> +		pr_err("habanalabs: Failed to initialize hl%d. Device is NOT usable !!!\n",
> +			hdev->id);

Maybe three exclamation marks would be too much?

> +
> +	return rc;
> +}
> +
> +/**
> + * hl_device_fini - main tear-down function for habanalabs device
> + *
> + * @hdev: pointer to habanalabs device structure
> + *
> + * Destroy the device, call ASIC fini functions and release the id
> + */
> +void hl_device_fini(struct hl_device *hdev)
> +{
> +	dev_info(hdev->dev, "Removing device\n");
> +
> +	/* Mark device as disabled */
> +	hdev->disabled = true;
> +
> +	device_early_fini(hdev);
> +
> +	/* Hide device from user */
> +	device_destroy(hdev->dev->class, hdev->dev->devt);
> +	cdev_del(&hdev->cdev);
> +
> +	pr_info("habanalabs: removed device successfully\n");
> +}
> +
> +/**
> + * hl_poll_timeout_memory - Periodically poll a host memory address
> + *                              until it is not zero or a timeout occurs
> + * @hdev: pointer to habanalabs device structure
> + * @addr: Address to poll
> + * @timeout_us: timeout in us
> + * @val: Variable to read the value into
> + *
> + * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
> + * case, the last read value at @addr is stored in @val. Must not
> + * be called from atomic context if sleep_us or timeout_us are used.
> + *
> + * The function sleeps for 100us with timeout value of
> + * timeout_us
> + */
> +int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr,
> +				u32 timeout_us, u32 *val)
> +{
> +	/*
> +	 * pReturnVal is defined as volatile because it points to HOST memory,
> +	 * which is being written to by the device. Therefore, we can't use
> +	 * locks to synchronize it and it is not a memory-mapped register space
> +	 */
> +	volatile u32 *pReturnVal = (volatile u32 *) addr;
> +	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
> +
> +	might_sleep();
> +
> +	for (;;) {
> +		*val = *pReturnVal;
> +		if (*val)
> +			break;
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			*val = *pReturnVal;
> +			break;
> +		}
> +		usleep_range((100 >> 2) + 1, 100);
> +	}
> +
> +	return (*val ? 0 : -ETIMEDOUT);
> +}
> +
> +/**
> + * hl_poll_timeout_devicememory - Periodically poll a device memory address
> + *                                until it is not zero or a timeout occurs
> + * @hdev: pointer to habanalabs device structure
> + * @addr: Device address to poll
> + * @timeout_us: timeout in us
> + * @val: Variable to read the value into
> + *
> + * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
> + * case, the last read value at @addr is stored in @val. Must not
> + * be called from atomic context if sleep_us or timeout_us are used.
> + *
> + * The function sleeps for 100us with timeout value of
> + * timeout_us
> + */
> +int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
> +				u32 timeout_us, u32 *val)
> +{
> +	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
> +
> +	might_sleep();
> +
> +	for (;;) {
> +		*val = readl(addr);
> +		if (*val)
> +			break;
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			*val = readl(addr);
> +			break;
> +		}
> +		usleep_range((100 >> 2) + 1, 100);
> +	}
> +
> +	return (*val ? 0 : -ETIMEDOUT);
> +}
> diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> new file mode 100644
> index 000000000000..7e1b088b677c
> --- /dev/null
> +++ b/drivers/misc/habanalabs/habanalabs.h
> @@ -0,0 +1,149 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + */
> +
> +#ifndef HABANALABSP_H_
> +#define HABANALABSP_H_
> +
> +#include "include/habanalabs_device_if.h"
> +
> +#include <linux/pci.h>
> +#include <linux/types.h>
> +#include <linux/cdev.h>
> +#include <linux/interrupt.h>
> +#include <linux/iopoll.h>
> +#include <linux/dma-fence.h>
> +#include <linux/hashtable.h>
> +#include <linux/hwmon.h>
> +
> +#define HL_NAME				"habanalabs"
> +
> +struct hl_device;
> +
> +
> +
> +
> +
> +

Too many blank lines, IMHO.

> +/*
> + * ASICs
> + */
> +
> +/**
> + * enum hl_asic_type - supported ASIC types.
> + * @ASIC_AUTO_DETECT: ASIC type will be automatically set.
> + * @ASIC_GOYA: Goya device.
> + * @ASIC_LAST: last ASIC type.
> + */
> +enum hl_asic_type {
> +	ASIC_AUTO_DETECT,
> +	ASIC_GOYA,
> +	ASIC_LAST
> +};
> +
> +
> +
> +
> +
> +/*
> + * FILE PRIVATE STRUCTURE
> + */
> +
> +/**
> + * struct hl_fpriv - process information stored in FD private data.
> + * @hdev: habanalabs device structure.
> + * @filp: pointer to the given file structure.
> + * @taskpid: current process ID.
> + * @refcount: number of related contexts.
> + */
> +struct hl_fpriv {
> +	struct hl_device	*hdev;
> +	struct file		*filp;
> +	struct pid		*taskpid;
> +	struct kref		refcount;
> +};
> +
> +
> +
> +
> +/*
> + * DEVICES
> + */
> +
> +/* Theoretical limit only. A single host can only contain up to 4 or 8 PCIe
> + * x16 cards. In extereme cases, there are hosts that can accommodate 16 cards
> + */
> +#define HL_MAX_MINORS	256
> +
> +/**
> + * struct hl_device - habanalabs device structure.
> + * @pdev: pointer to PCI device, can be NULL in case of simulator device.
> + * @cdev: related char device.
> + * @dev: realted kernel basic device structure.
> + * @asic_name: ASIC specific nmae.
> + * @asic_type: ASIC specific type.
> + * @major: habanalabs KMD major.
> + * @id: device minor.
> + * @disabled: is device disabled.
> + */
> +struct hl_device {
> +	struct pci_dev			*pdev;
> +	struct cdev			cdev;
> +	struct device			*dev;
> +	char				asic_name[16];
> +	enum hl_asic_type		asic_type;
> +	u32				major;
> +	u16				id;
> +	u8				disabled;
> +};
> +
> +/*
> + * IOCTLs
> + */
> +
> +/**
> + * typedef hl_ioctl_t - typedef for ioctl function in the driver
> + * @hpriv: pointer to the FD's private data, which contains state of
> + *		user process
> + * @data: pointer to the input/output arguments structure of the IOCTL
> + *
> + * Return: 0 for success, negative value for error
> + */
> +typedef int hl_ioctl_t(struct hl_fpriv *hpriv, void *data);
> +
> +/**
> + * struct hl_ioctl_desc - describes an IOCTL entry of the driver.
> + * @cmd: the IOCTL code as created by the kernel macros.
> + * @func: pointer to the driver's function that should be called for this IOCTL.
> + */
> +struct hl_ioctl_desc {
> +	unsigned int cmd;
> +	hl_ioctl_t *func;
> +};
> +
> +
> +
> +
> +
> +/*
> + * Kernel module functions that can be accessed by entire module
> + */
> +
> +int hl_device_open(struct inode *inode, struct file *filp);
> +int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
> +		enum hl_asic_type asic_type, int minor);
> +void destroy_hdev(struct hl_device *hdev);
> +int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
> +				u32 *val);
> +int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
> +				u32 timeout_us, u32 *val);
> +
> +int hl_device_init(struct hl_device *hdev, struct class *hclass);
> +void hl_device_fini(struct hl_device *hdev);
> +int hl_device_suspend(struct hl_device *hdev);
> +int hl_device_resume(struct hl_device *hdev);
> +
> +#endif /* HABANALABSP_H_ */
> diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> new file mode 100644
> index 000000000000..15217975327b
> --- /dev/null
> +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> @@ -0,0 +1,366 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + * Author: Oded Gabbay <oded.gabbay@gmail.com>
> + *
> + */
> +
> +#include "habanalabs.h"
> +
> +#include <linux/device.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/kthread.h>
> +
> +#include <linux/fs.h>
> +
> +#define HL_DRIVER_AUTHOR	"HabanaLabs Kernel Driver Team"
> +
> +#define HL_DRIVER_DESC		"Driver for HabanaLabs's AI Accelerators"
> +
> +MODULE_AUTHOR(HL_DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(HL_DRIVER_DESC);
> +MODULE_LICENSE("GPL v2");
> +
> +static int hl_major;
> +static struct class *hl_class;
> +DEFINE_IDR(hl_devs_idr);
> +DEFINE_MUTEX(hl_devs_idr_lock);
> +
> +#define PCI_VENDOR_ID_HABANALABS	0x1da3
> +
> +#define PCI_IDS_GOYA			0x0001
> +
> +static struct pci_device_id ids[] = {
> +	{ PCI_DEVICE(PCI_VENDOR_ID_HABANALABS, PCI_IDS_GOYA), },
> +	{ 0, }
> +};
> +MODULE_DEVICE_TABLE(pci, ids);
> +
> +/**
> + * get_asic_type - translate device id to asic type
> + *
> + * @device: id of the PCI device
> + * @asic_type: pointer that will be filled by the asic type
> + *
> + * Translate device id to asic type.
> + * In case of unidentified device, return -1
> + */
> +static int get_asic_type(u16 device, enum hl_asic_type *asic_type)

This can simply return the hl_asic_type, see also a comment in
create_hdev(().

> +{
> +	int rc = 0;
> +
> +	switch (device) {
> +	case PCI_IDS_GOYA:
> +		*asic_type = ASIC_GOYA;
> +		break;
> +	default:
> +		*asic_type = rc = -1;
> +		break;
> +	}
> +
> +	return rc;
> +}
> +
> +/**
> + * hl_device_open - open function for habanalabs device
> + *
> + * @inode: pointer to inode structure
> + * @filp: pointer to file structure
> + *
> + * Called when process opens an habanalabs device.
> + */
> +int hl_device_open(struct inode *inode, struct file *filp)
> +{
> +	struct hl_device *hdev;
> +	struct hl_fpriv *hpriv;
> +
> +	mutex_lock(&hl_devs_idr_lock);
> +	hdev = idr_find(&hl_devs_idr, iminor(inode));
> +	mutex_unlock(&hl_devs_idr_lock);
> +
> +	if (!hdev) {
> +		pr_err("habanalabs: Couldn't find device %d:%d\n",
> +			imajor(inode), iminor(inode));
> +		return -ENXIO;
> +	}
> +
> +	hpriv = kzalloc(sizeof(*hpriv), GFP_KERNEL);
> +	if (!hpriv)
> +		return -ENOMEM;
> +
> +	hpriv->hdev = hdev;
> +	filp->private_data = hpriv;
> +	hpriv->filp = filp;
> +	kref_init(&hpriv->refcount);
> +	nonseekable_open(inode, filp);
> +
> +	hpriv->taskpid = find_get_pid(current->pid);
> +
> +	return 0;
> +}
> +
> +/**
> + * create_hdev - create habanalabs device instance
> + *
> + * @dev: will hold the pointer to the new habanalabs device structure
> + * @pdev: pointer to the pci device
> + * @asic_type: in case of simulator device, which device is it
> + * @minor: in case of simulator device, the minor of the device
> + *
> + * Allocate memory for habanalabs device and initialize basic fields
> + * Identify the ASIC type
> + * Allocate ID (minor) for the device (only for real devices)
> + */
> +int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
> +		enum hl_asic_type asic_type, int minor)
> +{
> +	struct hl_device *hdev;
> +	int rc;
> +
> +	*dev = NULL;
> +
> +	hdev = kzalloc(sizeof(*hdev), GFP_KERNEL);
> +	if (!hdev) {
> +		if (pdev)
> +			dev_err(&pdev->dev,
> +				"Not enough memory for habanalabs device\n");
> +		else
> +			pr_err("habanalabs: Not enough memory for  device\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	hdev->major = hl_major;
> +
> +	hdev->disabled = true;
> +	hdev->pdev = pdev; /* can be NULL in case of simulator device */
> +
> +	if (asic_type == ASIC_AUTO_DETECT) {
> +		rc = get_asic_type(pdev->device, &hdev->asic_type);

You can just make it 

		&hdev->asic_type = get_asic_type(pdev->device);

> +		if (rc) {
> +			dev_err(&pdev->dev, "Unsupported ASIC\n");
> +			rc = -ENODEV;
> +			goto free_hdev;
> +		}
> +	} else {
> +		hdev->asic_type = asic_type;
> +	}

In the current version create_hdev() is always called with
ASIC_AUTO_DETECT, what are the usecases for other types?

> +
> +	mutex_lock(&hl_devs_idr_lock);
> +
> +	if (minor == -1) {
> +		rc = idr_alloc(&hl_devs_idr, hdev, 0, HL_MAX_MINORS,
> +				GFP_KERNEL);
> +	} else {
> +		idr_replace(&hl_devs_idr, hdev, minor);

idr_replace can fail, can't it?

> +		rc = minor;
> +	}
> +
> +	mutex_unlock(&hl_devs_idr_lock);
> +
> +	if (rc < 0) {
> +		if (rc == -ENOSPC) {
> +			pr_err("habanalabs: too many devices in the system\n");
> +			rc = -EBUSY;
> +		}
> +		goto free_hdev;
> +	}
> +
> +	hdev->id = rc;
> +
> +	*dev = hdev;
> +
> +	return 0;
> +
> +free_hdev:
> +	kfree(hdev);
> +	return rc;
> +}
> +
> +/**
> + * destroy_hdev - destroy habanalabs device instance
> + *
> + * @dev: pointer to the habanalabs device structure
> + *
> + */
> +void destroy_hdev(struct hl_device *hdev)
> +{
> +	/* Remove device from the device list */
> +	mutex_lock(&hl_devs_idr_lock);
> +	idr_remove(&hl_devs_idr, hdev->id);
> +	mutex_unlock(&hl_devs_idr_lock);
> +
> +	kfree(hdev);
> +}
> +
> +static int hl_pmops_suspend(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct hl_device *hdev = pci_get_drvdata(pdev);
> +
> +	pr_debug("habanalabs: Going to suspend PCI device\n");
> +
> +	if (!hdev) {
> +		pr_err("habanalabs: device pointer is NULL in suspend\n");
> +		return 0;
> +	}
> +
> +	return hl_device_suspend(hdev);
> +}
> +
> +static int hl_pmops_resume(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct hl_device *hdev = pci_get_drvdata(pdev);
> +
> +	pr_debug("habanalabs: Going to resume PCI device\n");
> +
> +	if (!hdev) {
> +		pr_err("habanalabs: device pointer is NULL in resume\n");
> +		return 0;
> +	}
> +
> +	return hl_device_resume(hdev);
> +}
> +
> +/**
> + * hl_pci_probe - probe PCI habanalabs devices
> + *
> + * @pdev: pointer to pci device
> + * @id: pointer to pci device id structure
> + *
> + * Standard PCI probe function for habanalabs device.
> + * Create a new habanalabs device and initialize it according to the
> + * device's type
> + */
> +static int hl_pci_probe(struct pci_dev *pdev,
> +				const struct pci_device_id *id)
> +{
> +	struct hl_device *hdev;
> +	int rc;
> +
> +	dev_info(&pdev->dev, HL_NAME
> +		 " device found [%04x:%04x] (rev %x)\n",
> +		 (int)pdev->vendor, (int)pdev->device, (int)pdev->revision);
> +
> +	rc = create_hdev(&hdev, pdev, ASIC_AUTO_DETECT, -1);
> +	if (rc)
> +		return rc;
> +
> +	pci_set_drvdata(pdev, hdev);
> +
> +	rc = hl_device_init(hdev, hl_class);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Fatal error during habanalabs device init\n");
> +		rc = -ENODEV;
> +		goto disable_device;
> +	}
> +
> +	return 0;
> +
> +disable_device:
> +	pci_set_drvdata(pdev, NULL);
> +	destroy_hdev(hdev);
> +
> +	return rc;
> +}
> +
> +/**
> + * hl_pci_remove - remove PCI habanalabs devices
> + *
> + * @pdev: pointer to pci device
> + *
> + * Standard PCI remove function for habanalabs device
> + */
> +static void hl_pci_remove(struct pci_dev *pdev)
> +{
> +	struct hl_device *hdev;
> +
> +	hdev = pci_get_drvdata(pdev);
> +	if (!hdev)
> +		return;
> +
> +	hl_device_fini(hdev);
> +	pci_set_drvdata(pdev, NULL);
> +
> +	destroy_hdev(hdev);
> +}
> +
> +static const struct dev_pm_ops hl_pm_ops = {
> +	.suspend = hl_pmops_suspend,
> +	.resume = hl_pmops_resume,
> +};
> +
> +static struct pci_driver hl_pci_driver = {
> +	.name = HL_NAME,
> +	.id_table = ids,
> +	.probe = hl_pci_probe,
> +	.remove = hl_pci_remove,
> +	.driver.pm = &hl_pm_ops,
> +};
> +
> +/**
> + * hl_init - Initialize the habanalabs kernel driver
> + *
> + */
> +static int __init hl_init(void)
> +{
> +	int rc;
> +	dev_t dev;
> +
> +	pr_info("habanalabs: loading driver\n");
> +
> +	rc = alloc_chrdev_region(&dev, 0, HL_MAX_MINORS, HL_NAME);
> +	if (rc < 0) {
> +		pr_err("habanalabs: unable to get major\n");
> +		return rc;
> +	}
> +
> +	hl_major = MAJOR(dev);
> +
> +	hl_class = class_create(THIS_MODULE, HL_NAME);
> +	if (IS_ERR(hl_class)) {
> +		pr_err("habanalabs: failed to allocate class\n");
> +		rc = PTR_ERR(hl_class);
> +		goto remove_major;
> +	}
> +
> +	rc = pci_register_driver(&hl_pci_driver);
> +	if (rc) {
> +		pr_err("habanalabs: failed to register pci device\n");
> +		goto remove_class;
> +	}
> +
> +	pr_debug("habanalabs: driver loaded\n");
> +
> +	return 0;
> +
> +remove_class:
> +	class_destroy(hl_class);
> +remove_major:
> +	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
> +	return rc;
> +}
> +
> +/**
> + * hl_exit - Release all resources of the habanalabs kernel driver
> + *
> + */
> +static void __exit hl_exit(void)
> +{
> +	pci_unregister_driver(&hl_pci_driver);
> +
> +	class_destroy(hl_class);
> +	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
> +
> +	idr_destroy(&hl_devs_idr);
> +
> +	pr_debug("habanalabs: driver removed\n");
> +}
> +
> +module_init(hl_init);
> +module_exit(hl_exit);
> diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> new file mode 100644
> index 000000000000..9dbb7077eabd
> --- /dev/null
> +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + */
> +
> +#ifndef HABANALABS_DEVICE_IF_H
> +#define HABANALABS_DEVICE_IF_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * PRIMARY QUEUE
> + */
> +
> +struct hl_bd {
> +	__u64	ptr;
> +	__u32	len;
> +	union {
> +		struct {
> +			__u32	repeat:16;
> +			__u32	res1:8;
> +			__u32	repeat_valid:1;
> +			__u32	res2:7;
> +		};
> +		__u32	ctl;
> +	};
> +};
> +
> +#define HL_BD_SIZE			sizeof(struct hl_bd)
> +
> +/*
> + * BD_CTL_REPEAT_VALID tells the CP whether the repeat field in the BD CTL is
> + * valid. 1 means the repeat field is valid, 0 means not-valid,
> + * i.e. repeat == 1
> + */
> +#define BD_CTL_REPEAT_VALID_SHIFT	24
> +#define BD_CTL_REPEAT_VALID_MASK	0x01000000
> +
> +#define BD_CTL_SHADOW_INDEX_SHIFT	0
> +#define BD_CTL_SHADOW_INDEX_MASK	0x00000FFF
> +
> +/*
> + * COMPLETION QUEUE
> + */
> +
> +struct hl_cq_entry {
> +	__u32	data;
> +};
> +
> +#define HL_CQ_ENTRY_SIZE		sizeof(struct hl_cq_entry)
> +
> +#define CQ_ENTRY_READY_SHIFT			31
> +#define CQ_ENTRY_READY_MASK			0x80000000
> +
> +#define CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT	30
> +#define CQ_ENTRY_SHADOW_INDEX_VALID_MASK	0x40000000
> +
> +#define CQ_ENTRY_SHADOW_INDEX_SHIFT		BD_CTL_SHADOW_INDEX_SHIFT
> +#define CQ_ENTRY_SHADOW_INDEX_MASK		BD_CTL_SHADOW_INDEX_MASK
> +
> +/*
> + * EVENT QUEUE
> + */
> +
> +struct hl_eq_header {
> +	__u32 reserved;
> +	union {
> +		struct {
> +			__u32 ctx_id :10;
> +			__u32:6;
> +			__u32 opcode :10;
> +			__u32:5;
> +			__u32 ready :1;
> +		};
> +		__u32 ctl;
> +	};
> +};
> +
> +struct hl_eq_entry {
> +	struct hl_eq_header hdr;
> +	__u64 data[7];
> +};
> +
> +#define HL_EQ_ENTRY_SIZE		sizeof(struct hl_eq_entry)
> +
> +#define EQ_CTL_READY_SHIFT		31
> +#define EQ_CTL_READY_MASK		0x80000000
> +
> +#define EQ_CTL_EVENT_TYPE_SHIFT		16
> +#define EQ_CTL_EVENT_TYPE_MASK		0x03FF0000
> +
> +enum pq_init_status {
> +	PQ_INIT_STATUS_NA = 0,
> +	PQ_INIT_STATUS_READY_FOR_CP,
> +	PQ_INIT_STATUS_READY_FOR_HOST
> +};
> +
> +/*
> + * ArmCP info
> + */
> +
> +#define VERSION_MAX_LEN			128
> +#define ARMCP_MAX_SENSORS		128
> +
> +struct armcp_sensor {
> +	__u32 type;
> +	__u32 flags;
> +};
> +
> +/* must be aligned to 4 bytes */
> +struct armcp_info {
> +	struct armcp_sensor sensors[ARMCP_MAX_SENSORS];
> +	__u8 kernel_version[VERSION_MAX_LEN];
> +	__u32 reserved[3];
> +	__u32 cpld_version;
> +	__u32 infineon_version;
> +	__u8 fuse_version[VERSION_MAX_LEN];
> +	__u8 thermal_version[VERSION_MAX_LEN];
> +	__u8 armcp_version[VERSION_MAX_LEN];
> +	__u64 dram_size;
> +};
> +
> +#endif /* HABANALABS_DEVICE_IF_H */
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 03/15] habanalabs: add basic Goya support
  2019-01-23  0:00 ` [PATCH 03/15] habanalabs: add basic Goya support Oded Gabbay
@ 2019-01-23 12:28   ` Mike Rapoport
  2019-01-25 20:32     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-23 12:28 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:00:45AM +0200, Oded Gabbay wrote:
> This patch adds a basic support for the Goya device. The code initializes
> the device's PCI controller and PCI bars. It also initializes various S/W
> structures and adds some basic helper functions.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/habanalabs/Makefile            |   5 +-
>  drivers/misc/habanalabs/device.c            |  71 +++
>  drivers/misc/habanalabs/goya/Makefile       |   3 +
>  drivers/misc/habanalabs/goya/goya.c         | 633 ++++++++++++++++++++
>  drivers/misc/habanalabs/goya/goyaP.h        | 125 ++++
>  drivers/misc/habanalabs/habanalabs.h        | 131 ++++
>  drivers/misc/habanalabs/habanalabs_drv.c    |   3 +
>  drivers/misc/habanalabs/include/goya/goya.h | 115 ++++
>  8 files changed, 1085 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/misc/habanalabs/goya/Makefile
>  create mode 100644 drivers/misc/habanalabs/goya/goya.c
>  create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/goya.h
> 
> diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> index b41433a09e02..6f1ead69bd77 100644
> --- a/drivers/misc/habanalabs/Makefile
> +++ b/drivers/misc/habanalabs/Makefile
> @@ -4,4 +4,7 @@
>  
>  obj-m	:= habanalabs.o
>  
> -habanalabs-y := habanalabs_drv.o device.o
> \ No newline at end of file
> +habanalabs-y := habanalabs_drv.o device.o
> +
> +include $(src)/goya/Makefile
> +habanalabs-y += $(HL_GOYA_FILES)
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> index 376b55eb73d4..a4276ef559b3 100644
> --- a/drivers/misc/habanalabs/device.c
> +++ b/drivers/misc/habanalabs/device.c
> @@ -116,8 +116,11 @@ static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
>   */
>  static int device_early_init(struct hl_device *hdev)
>  {
> +	int rc;
> +
>  	switch (hdev->asic_type) {
>  	case ASIC_GOYA:
> +		goya_set_asic_funcs(hdev);
>  		sprintf(hdev->asic_name, "GOYA");
>  		break;
>  	default:
> @@ -126,6 +129,10 @@ static int device_early_init(struct hl_device *hdev)
>  		return -EINVAL;
>  	}
>  
> +	rc = hdev->asic_funcs->early_init(hdev);
> +	if (rc)
> +		return rc;
> +
>  	return 0;
>  }
>  
> @@ -137,6 +144,10 @@ static int device_early_init(struct hl_device *hdev)
>   */
>  static void device_early_fini(struct hl_device *hdev)
>  {
> +
> +	if (hdev->asic_funcs->early_fini)
> +		hdev->asic_funcs->early_fini(hdev);
> +
>  }
>  
>  /**
> @@ -150,8 +161,15 @@ static void device_early_fini(struct hl_device *hdev)
>   */
>  int hl_device_suspend(struct hl_device *hdev)
>  {
> +	int rc;
> +
>  	pci_save_state(hdev->pdev);
>  
> +	rc = hdev->asic_funcs->suspend(hdev);
> +	if (rc)
> +		dev_err(hdev->dev,
> +			"Failed to disable PCI access of device CPU\n");
> +
>  	/* Shut down the device */
>  	pci_disable_device(hdev->pdev);
>  	pci_set_power_state(hdev->pdev, PCI_D3hot);
> @@ -181,6 +199,13 @@ int hl_device_resume(struct hl_device *hdev)
>  		return rc;
>  	}
>  
> +	rc = hdev->asic_funcs->resume(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to enable PCI access from device CPU\n");
> +		return rc;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -208,11 +233,21 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  	if (rc)
>  		goto release_device;
>  
> +	/*
> +	 * Start calling ASIC initialization. First S/W then H/W and finally
> +	 * late init
> +	 */
> +	rc = hdev->asic_funcs->sw_init(hdev);
> +	if (rc)
> +		goto early_fini;
> +
>  	dev_notice(hdev->dev,
>  		"Successfully added device to habanalabs driver\n");
>  
>  	return 0;
>  
> +early_fini:
> +	device_early_fini(hdev);
>  release_device:
>  	device_destroy(hclass, hdev->dev->devt);
>  	cdev_del(&hdev->cdev);
> @@ -243,6 +278,9 @@ void hl_device_fini(struct hl_device *hdev)
>  	/* Mark device as disabled */
>  	hdev->disabled = true;
>  
> +	/* Call ASIC S/W finalize function */
> +	hdev->asic_funcs->sw_fini(hdev);
> +
>  	device_early_fini(hdev);
>  
>  	/* Hide device from user */
> @@ -329,3 +367,36 @@ int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
>  
>  	return (*val ? 0 : -ETIMEDOUT);
>  }
> +
> +/*
> + * MMIO register access helper functions.
> + */
> +
> +/**
> + * hl_rreg - Read an MMIO register
> + *
> + * @hdev: pointer to habanalabs device structure
> + * @reg: MMIO register offset (in bytes)
> + *
> + * Returns the value of the MMIO register we are asked to read
> + *
> + */
> +inline u32 hl_rreg(struct hl_device *hdev, u32 reg)
> +{
> +	return readl(hdev->rmmio + reg);
> +}
> +
> +/**
> + * hl_wreg - Write to an MMIO register
> + *
> + * @hdev: pointer to habanalabs device structure
> + * @reg: MMIO register offset (in bytes)
> + * @val: 32-bit value
> + *
> + * Writes the 32-bit value into the MMIO register
> + *
> + */
> +inline void hl_wreg(struct hl_device *hdev, u32 reg, u32 val)
> +{
> +	writel(val, hdev->rmmio + reg);
> +}
> diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
> new file mode 100644
> index 000000000000..5ebf3d0d5794
> --- /dev/null
> +++ b/drivers/misc/habanalabs/goya/Makefile
> @@ -0,0 +1,3 @@
> +subdir-ccflags-y += -I$(src)
> +
> +HL_GOYA_FILES :=  goya/goya.o
> \ No newline at end of file
> diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> new file mode 100644
> index 000000000000..b2952296b890
> --- /dev/null
> +++ b/drivers/misc/habanalabs/goya/goya.c
> @@ -0,0 +1,633 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "goyaP.h"
> +#include "include/goya/asic_reg/goya_masks.h"
> +
> +#include <linux/fs.h>
> +#include <linux/delay.h>
> +#include <linux/vmalloc.h>
> +#include <linux/sched.h>
> +#include <linux/genalloc.h>
> +#include <linux/sysfs.h>
> +#include <linux/kfifo.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/firmware.h>
> +#include <linux/log2.h>
> +#include <linux/hwmon.h>
> +#include <linux/string.h>
> +#include <linux/io.h>
> +
> +/*
> + * GOYA security scheme:
> + *
> + * 1. Host is protected by:
> + *        - Range registers (When MMU is enabled, DMA RR does NOT protect host)
> + *        - MMU
> + *
> + * 2. DRAM is protected by:
> + *        - Range registers (protect the first 512MB)
> + *        - MMU (isolation between users)
> + *
> + * 3. Configuration is protected by:
> + *        - Range registers
> + *        - Protection bits
> + *
> + * When MMU is disabled:
> + *
> + * QMAN DMA: PQ, CQ, CP, DMA are secured.
> + * PQ, CB and the data are on the host.
> + *
> + * QMAN TPC/MME:
> + * PQ, CQ and CP are not secured.
> + * PQ, CB and the data are on the SRAM/DRAM.
> + *
> + * Since QMAN DMA is secured, KMD is parsing the DMA CB:
> + *     - KMD checks DMA pointer
> + *     - WREG, MSG_PROT are not allowed.
> + *     - MSG_LONG/SHORT are allowed.
> + *
> + * A read/write transaction by the QMAN to a protected area will succeed if
> + * and only if the QMAN's CP is secured and MSG_PROT is used
> + *
> + *
> + * When MMU is enabled:
> + *
> + * QMAN DMA: PQ, CQ and CP are secured.
> + * MMU is set to bypass on the Secure props register of the QMAN.
> + * The reasons we don't enable MMU for PQ, CQ and CP are:
> + *     - PQ entry is in kernel address space and KMD doesn't map it.
> + *     - CP writes to MSIX register and to kernel address space (completion
> + *       queue).
> + *
> + * DMA is not secured but because CP is secured, KMD still needs to parse the
> + * CB, but doesn't need to check the DMA addresses.
> + *
> + * For QMAN DMA 0, DMA is also secured because only KMD uses this DMA and KMD
> + * doesn't map memory in MMU.
> + *
> + * QMAN TPC/MME: PQ, CQ and CP aren't secured (no change from MMU disabled mode)
> + *
> + * DMA RR does NOT protect host because DMA is not secured
> + *
> + */
> +
> +#define GOYA_MMU_REGS_NUM		61
> +
> +#define GOYA_DMA_POOL_BLK_SIZE		0x100		/* 256 bytes */
> +
> +#define GOYA_RESET_TIMEOUT_MSEC		500		/* 500ms */
> +#define GOYA_PLDM_RESET_TIMEOUT_MSEC	20000		/* 20s */
> +#define GOYA_RESET_WAIT_MSEC		1		/* 1ms */
> +#define GOYA_CPU_RESET_WAIT_MSEC	100		/* 100ms */
> +#define GOYA_PLDM_RESET_WAIT_MSEC	1000		/* 1s */
> +#define GOYA_CPU_TIMEOUT_USEC		10000000	/* 10s */
> +#define GOYA_TEST_QUEUE_WAIT_USEC	100000		/* 100ms */
> +
> +#define GOYA_QMAN0_FENCE_VAL		0xD169B243
> +
> +#define GOYA_MAX_INITIATORS		20
> +
> +static void goya_get_fixed_properties(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +
> +	prop->completion_queues_count = NUMBER_OF_CMPLT_QUEUES;
> +
> +	prop->dram_base_address = DRAM_PHYS_BASE;
> +	prop->dram_size = DRAM_PHYS_DEFAULT_SIZE;
> +	prop->dram_end_address = prop->dram_base_address + prop->dram_size;
> +	prop->dram_user_base_address = DRAM_BASE_ADDR_USER;
> +
> +	prop->sram_base_address = SRAM_BASE_ADDR;
> +	prop->sram_size = SRAM_SIZE;
> +	prop->sram_end_address = prop->sram_base_address + prop->sram_size;
> +	prop->sram_user_base_address = prop->sram_base_address +
> +						SRAM_USER_BASE_OFFSET;
> +
> +	prop->host_phys_base_address = HOST_PHYS_BASE;
> +	prop->va_space_host_start_address = VA_HOST_SPACE_START;
> +	prop->va_space_host_end_address = VA_HOST_SPACE_END;
> +	prop->va_space_dram_start_address = VA_DDR_SPACE_START;
> +	prop->va_space_dram_end_address = VA_DDR_SPACE_END;
> +	prop->cfg_size = CFG_SIZE;
> +	prop->max_asid = MAX_ASID;
> +	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
> +
> +	prop->high_pll = PLL_HIGH_DEFAULT;
> +}
> +
> +/**
> + * goya_pci_bars_map - Map PCI BARS of Goya device
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Request PCI regions and map them to kernel virtual addresses.
> + * Returns 0 on success
> + *
> + */
> +int goya_pci_bars_map(struct hl_device *hdev)
> +{
> +	struct pci_dev *pdev = hdev->pdev;
> +	int rc;

You could just init rc= -ENODEV here and avoid the hassle below.
> +
> +	rc = pci_request_regions(pdev, HL_NAME);
> +	if (rc) {
> +		dev_err(hdev->dev, "Cannot obtain PCI resources\n");
> +		return rc;
> +	}
> +
> +	hdev->pcie_bar[SRAM_CFG_BAR_ID] =
> +			pci_ioremap_bar(pdev, SRAM_CFG_BAR_ID);
> +	if (!hdev->pcie_bar[SRAM_CFG_BAR_ID]) {
> +		dev_err(hdev->dev, "pci_ioremap_bar failed for CFG\n");
> +		rc = -ENODEV;
> +		goto err_release_regions;
> +	}
> +
> +	hdev->pcie_bar[MSIX_BAR_ID] = pci_ioremap_bar(pdev, MSIX_BAR_ID);
> +	if (!hdev->pcie_bar[MSIX_BAR_ID]) {
> +		dev_err(hdev->dev, "pci_ioremap_bar failed for MSIX\n");
> +		rc = -ENODEV;
> +		goto err_unmap_sram_cfg;
> +	}
> +
> +	hdev->pcie_bar[DDR_BAR_ID] = pci_ioremap_wc_bar(pdev, DDR_BAR_ID);
> +	if (!hdev->pcie_bar[DDR_BAR_ID]) {
> +		dev_err(hdev->dev, "pci_ioremap_bar failed for DDR\n");
> +		rc = -ENODEV;
> +		goto err_unmap_msix;
> +	}
> +
> +	hdev->rmmio = hdev->pcie_bar[SRAM_CFG_BAR_ID] +
> +				(CFG_BASE - SRAM_BASE_ADDR);
> +
> +	return 0;
> +
> +err_unmap_msix:
> +	iounmap(hdev->pcie_bar[MSIX_BAR_ID]);
> +err_unmap_sram_cfg:
> +	iounmap(hdev->pcie_bar[SRAM_CFG_BAR_ID]);
> +err_release_regions:
> +	pci_release_regions(pdev);
> +
> +	return rc;
> +}
> +
> +/**
> + * goya_pci_bars_unmap - Unmap PCI BARS of Goya device
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Release all PCI BARS and unmap their virtual addresses
> + *
> + */
> +static void goya_pci_bars_unmap(struct hl_device *hdev)
> +{
> +	struct pci_dev *pdev = hdev->pdev;
> +
> +	iounmap(hdev->pcie_bar[DDR_BAR_ID]);
> +	iounmap(hdev->pcie_bar[MSIX_BAR_ID]);
> +	iounmap(hdev->pcie_bar[SRAM_CFG_BAR_ID]);
> +	pci_release_regions(pdev);
> +}
> +
> +/**
> + * goya_elbi_write - Write through the ELBI interface
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * return 0 on success, -1 on failure
> + *
> + */
> +static int goya_elbi_write(struct hl_device *hdev, u64 addr, u32 data)
> +{
> +	struct pci_dev *pdev = hdev->pdev;
> +	ktime_t timeout;
> +	u32 val;
> +
> +	/* Clear previous status */
> +	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_STS, 0);
> +
> +	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_ADDR, (u32) addr);
> +	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_DATA, data);
> +	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_CTRL,
> +				PCI_CONFIG_ELBI_CTRL_WRITE);
> +
> +	timeout = ktime_add_ms(ktime_get(), 10);
> +	for (;;) {
> +		pci_read_config_dword(pdev, mmPCI_CONFIG_ELBI_STS, &val);
> +		if (val & PCI_CONFIG_ELBI_STS_MASK)
> +			break;
> +		if (ktime_compare(ktime_get(), timeout) > 0) {
> +			pci_read_config_dword(pdev, mmPCI_CONFIG_ELBI_STS,
> +						&val);
> +			break;
> +		}
> +		usleep_range(300, 500);
> +	}
> +
> +	if ((val & PCI_CONFIG_ELBI_STS_MASK) == PCI_CONFIG_ELBI_STS_DONE)
> +		return 0;
> +
> +	if (val & PCI_CONFIG_ELBI_STS_ERR) {
> +		dev_err(hdev->dev, "Error writing to ELBI\n");o
> +		return -1;

Please change -1 to an error code, say -EIO...

> +	}
> +
> +	if (!(val & PCI_CONFIG_ELBI_STS_MASK)) {
> +		dev_err(hdev->dev, "ELBI write didn't finish in time\n");
> +		return -1;
> +	}
> +
> +	dev_err(hdev->dev, "ELBI write has undefined bits in status\n");
> +	return -1;
> +}
> +
> +/**
> + * goya_iatu_write - iatu write routine
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + */
> +static int goya_iatu_write(struct hl_device *hdev, u32 addr, u32 data)
> +{
> +	u32 dbi_offset;
> +	int rc;
> +
> +	dbi_offset = addr & 0xFFF;
> +
> +	rc = goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0x00300000);
> +	rc |= goya_elbi_write(hdev, mmPCIE_DBI_BASE + dbi_offset, data);

hmm, error code in goya_elbi_write probably won't work...
Any reason to try the second write if the first failed?

> +
> +	return rc;
> +}
> +
> +void goya_reset_link_through_bridge(struct hl_device *hdev)
> +{
> +	struct pci_dev *pdev = hdev->pdev;
> +	struct pci_dev *parent_port;
> +	u16 val;
> +
> +	parent_port = pdev->bus->self;
> +	pci_read_config_word(parent_port, PCI_BRIDGE_CONTROL, &val);
> +	val |= PCI_BRIDGE_CTL_BUS_RESET;
> +	pci_write_config_word(parent_port, PCI_BRIDGE_CONTROL, val);
> +	ssleep(1);
> +
> +	val &= ~(PCI_BRIDGE_CTL_BUS_RESET);
> +	pci_write_config_word(parent_port, PCI_BRIDGE_CONTROL, val);
> +	ssleep(3);
> +}
> +
> +/**
> + * goya_set_ddr_bar_base - set DDR bar to map specific device address
> + *
> + * @hdev: pointer to hl_device structure
> + * @addr: address in DDR. Must be aligned to DDR bar size
> + *
> + * This function configures the iATU so that the DDR bar will start at the
> + * specified addr.
> + *
> + */
> +static int goya_set_ddr_bar_base(struct hl_device *hdev, u64 addr)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	int rc;
> +
> +	if ((goya) && (goya->ddr_bar_cur_addr == addr))
> +		return 0;
> +
> +	/* Inbound Region 1 - Bar 4 - Point to DDR */
> +	rc = goya_iatu_write(hdev, 0x314, lower_32_bits(addr));
> +	rc |= goya_iatu_write(hdev, 0x318, upper_32_bits(addr));
> +	rc |= goya_iatu_write(hdev, 0x300, 0);
> +	/* Enable + Bar match + match enable + Bar 4 */
> +	rc |= goya_iatu_write(hdev, 0x304, 0xC0080400);
> +
> +	/* Return the DBI window to the default location */
> +	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0);
> +	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI_32, 0);

And here as well.
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to map DDR bar to 0x%08llx\n", addr);
> +		return rc;
> +	}

I believe that at least here you'd want to return an error code.

> +
> +	if (goya)
> +		goya->ddr_bar_cur_addr = addr;
> +
> +	return 0;
> +}
> +
> +/**
> + * goya_init_iatu - Initialize the iATU unit inside the PCI controller
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * This is needed in case the firmware doesn't initialize the iATU
> + *
> + */
> +static int goya_init_iatu(struct hl_device *hdev)
> +{
> +	int rc;
> +
> +	/* Inbound Region 0 - Bar 0 - Point to SRAM_BASE_ADDR */
> +	rc  = goya_iatu_write(hdev, 0x114, lower_32_bits(SRAM_BASE_ADDR));
> +	rc |= goya_iatu_write(hdev, 0x118, upper_32_bits(SRAM_BASE_ADDR));
> +	rc |= goya_iatu_write(hdev, 0x100, 0);
> +	/* Enable + Bar match + match enable */
> +	rc |= goya_iatu_write(hdev, 0x104, 0xC0080000);
> +
> +	/* Inbound Region 1 - Bar 4 - Point to DDR */
> +	rc |= goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
> +
> +	/* Outbound Region 0 - Point to Host */
> +	rc |= goya_iatu_write(hdev, 0x008, lower_32_bits(HOST_PHYS_BASE));
> +	rc |= goya_iatu_write(hdev, 0x00C, upper_32_bits(HOST_PHYS_BASE));
> +	rc |= goya_iatu_write(hdev, 0x010,
> +		lower_32_bits(HOST_PHYS_BASE + HOST_PHYS_SIZE - 1));
> +	rc |= goya_iatu_write(hdev, 0x014, 0);
> +	rc |= goya_iatu_write(hdev, 0x018, 0);
> +	rc |= goya_iatu_write(hdev, 0x020,
> +		upper_32_bits(HOST_PHYS_BASE + HOST_PHYS_SIZE - 1));
> +	/* Increase region size */
> +	rc |= goya_iatu_write(hdev, 0x000, 0x00002000);
> +	/* Enable */
> +	rc |= goya_iatu_write(hdev, 0x004, 0x80000000);
> +
> +	/* Return the DBI window to the default location */
> +	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0);
> +	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI_32, 0);
> +
> +	return rc;

Ditto

> +}
> +
> +/**
> + * goya_early_init - GOYA early initialization code
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Verify PCI bars
> + * Set DMA masks
> + * PCI controller initialization
> + * Map PCI bars
> + *
> + */
> +static int goya_early_init(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	struct pci_dev *pdev = hdev->pdev;
> +	u32 val;
> +	int rc;
> +
> +	goya_get_fixed_properties(hdev);
> +
> +	/* Check BAR sizes */
> +	if (pci_resource_len(pdev, SRAM_CFG_BAR_ID) != CFG_BAR_SIZE) {
> +		dev_err(hdev->dev,
> +			"Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
> +			SRAM_CFG_BAR_ID,
> +			pci_resource_len(pdev, SRAM_CFG_BAR_ID),
> +			CFG_BAR_SIZE);
> +		return -ENODEV;
> +	}
> +
> +	if (pci_resource_len(pdev, MSIX_BAR_ID) != MSIX_BAR_SIZE) {
> +		dev_err(hdev->dev,
> +			"Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
> +			MSIX_BAR_ID, pci_resource_len(pdev, MSIX_BAR_ID),
> +			MSIX_BAR_SIZE);
> +		return -ENODEV;
> +	}
> +
> +	prop->dram_pci_bar_size = pci_resource_len(pdev, DDR_BAR_ID);
> +
> +	/* set DMA mask for GOYA */
> +	rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(39));
> +	if (rc) {
> +		dev_warn(hdev->dev, "Unable to set pci dma mask to 39 bits\n");
> +		rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"Unable to set pci dma mask to 32 bits\n");
> +			return rc;
> +		}
> +	}
> +
> +	rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(39));
> +	if (rc) {
> +		dev_warn(hdev->dev,
> +			"Unable to set pci consistent dma mask to 39 bits\n");
> +		rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32));
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"Unable to set pci consistent dma mask to 32 bits\n");
> +			return rc;
> +		}
> +	}
> +
> +	if (hdev->reset_pcilink)
> +		goya_reset_link_through_bridge(hdev);
> +
> +	rc = pci_enable_device_mem(pdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "can't enable PCI device\n");
> +		return rc;
> +	}
> +
> +	pci_set_master(pdev);
> +
> +	rc = goya_init_iatu(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to initialize iATU\n");
> +		goto disable_device;
> +	}
> +
> +	rc = goya_pci_bars_map(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to initialize PCI BARS\n");
> +		goto disable_device;
> +	}
> +
> +	val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
> +	if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
> +		dev_warn(hdev->dev,
> +			"PCI strap is not configured correctly, PCI bus errors may occur\n");
> +
> +	return 0;
> +
> +disable_device:
> +	pci_clear_master(pdev);
> +	pci_disable_device(pdev);
> +
> +	return rc;
> +}
> +
> +/**
> + * goya_early_fini - GOYA early finalization code
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Unmap PCI bars
> + *
> + */
> +int goya_early_fini(struct hl_device *hdev)
> +{
> +	goya_pci_bars_unmap(hdev);
> +
> +	pci_clear_master(hdev->pdev);
> +	pci_disable_device(hdev->pdev);
> +
> +	return 0;
> +}
> +
> +/**
> + * goya_sw_init - Goya software initialization code
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + */
> +static int goya_sw_init(struct hl_device *hdev)
> +{
> +	struct goya_device *goya;
> +	int rc;
> +
> +	/* Allocate device structure */
> +	goya = kzalloc(sizeof(*goya), GFP_KERNEL);

Consider using devm_k[mz]alloc() for memory allocations throughout the
driver. I didn't check all the spots where it can be applicable.

> +	if (!goya)
> +		return -ENOMEM;
> +
> +	/* according to goya_init_iatu */
> +	goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
> +	hdev->asic_specific = goya;
> +
> +	/* Create DMA pool for small allocations */
> +	hdev->dma_pool = dma_pool_create(dev_name(hdev->dev),
> +			&hdev->pdev->dev, GOYA_DMA_POOL_BLK_SIZE, 8, 0);
> +	if (!hdev->dma_pool) {
> +		dev_err(hdev->dev, "failed to create DMA pool\n");
> +		rc = -ENOMEM;
> +		goto free_goya_device;
> +	}
> +
> +	hdev->cpu_accessible_dma_mem =
> +			hdev->asic_funcs->dma_alloc_coherent(hdev,
> +					CPU_ACCESSIBLE_MEM_SIZE,
> +					&hdev->cpu_accessible_dma_address,
> +					GFP_KERNEL | __GFP_ZERO);
> +
> +	if (!hdev->cpu_accessible_dma_mem) {
> +		dev_err(hdev->dev,
> +			"failed to allocate %d of dma memory for CPU accessible memory space\n",
> +			CPU_ACCESSIBLE_MEM_SIZE);
> +		rc = -ENOMEM;
> +		goto free_dma_pool;
> +	}
> +
> +	hdev->cpu_accessible_dma_pool = gen_pool_create(CPU_PKT_SHIFT, -1);
> +	if (!hdev->cpu_accessible_dma_pool) {
> +		dev_err(hdev->dev,
> +			"Failed to create CPU accessible DMA pool\n");
> +		rc = -ENOMEM;

You could init rc = -ENOMEM at the beginning and save the duplication.

> +		goto free_cpu_pq_dma_mem;
> +	}
> +
> +	rc = gen_pool_add(hdev->cpu_accessible_dma_pool,
> +				(u64) hdev->cpu_accessible_dma_mem,
> +				CPU_ACCESSIBLE_MEM_SIZE, -1);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to add memory to CPU accessible DMA pool\n");
> +		rc = -EFAULT;
> +		goto free_cpu_pq_pool;
> +	}
> +
> +	spin_lock_init(&goya->hw_queues_lock);
> +
> +	return 0;
> +
> +free_cpu_pq_pool:
> +	gen_pool_destroy(hdev->cpu_accessible_dma_pool);
> +free_cpu_pq_dma_mem:
> +	hdev->asic_funcs->dma_free_coherent(hdev, CPU_ACCESSIBLE_MEM_SIZE,
> +			hdev->cpu_accessible_dma_mem,
> +			hdev->cpu_accessible_dma_address);
> +free_dma_pool:
> +	dma_pool_destroy(hdev->dma_pool);
> +free_goya_device:
> +	kfree(goya);
> +
> +	return rc;
> +}
> +
> +/**
> + * goya_sw_fini - Goya software tear-down code
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + */
> +int goya_sw_fini(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +
> +	gen_pool_destroy(hdev->cpu_accessible_dma_pool);
> +
> +	hdev->asic_funcs->dma_free_coherent(hdev, CPU_ACCESSIBLE_MEM_SIZE,
> +			hdev->cpu_accessible_dma_mem,
> +			hdev->cpu_accessible_dma_address);
> +
> +	dma_pool_destroy(hdev->dma_pool);
> +
> +	kfree(goya);
> +
> +	return 0;
> +}
> +
> +int goya_suspend(struct hl_device *hdev)
> +{
> +	return 0;
> +}
> +
> +int goya_resume(struct hl_device *hdev)
> +{
> +	return 0;
> +}
> +
> +void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
> +					dma_addr_t *dma_handle, gfp_t flags)
> +{
> +	return dma_alloc_coherent(&hdev->pdev->dev, size, dma_handle, flags);
> +}
> +
> +void goya_dma_free_coherent(struct hl_device *hdev, size_t size, void *cpu_addr,
> +				dma_addr_t dma_handle)
> +{
> +	dma_free_coherent(&hdev->pdev->dev, size, cpu_addr, dma_handle);
> +}
> +
> +static const struct hl_asic_funcs goya_funcs = {
> +	.early_init = goya_early_init,
> +	.early_fini = goya_early_fini,
> +	.sw_init = goya_sw_init,
> +	.sw_fini = goya_sw_fini,
> +	.suspend = goya_suspend,
> +	.resume = goya_resume,
> +	.dma_alloc_coherent = goya_dma_alloc_coherent,
> +	.dma_free_coherent = goya_dma_free_coherent,

Is there any additional functionality that is planned in goya or gaudi in
these two functions?
It seems like they are not really needed, at least at the moment and for
sure that don't need to be part of ASIC ops.

> +};
> +
> +/**
> + * goya_set_asic_funcs - set Goya function pointers
> + *
> + * @*hdev: pointer to hl_device structure
> + *
> + */
> +void goya_set_asic_funcs(struct hl_device *hdev)
> +{
> +	hdev->asic_funcs = &goya_funcs;
> +}
> diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> new file mode 100644
> index 000000000000..0e12c56472bd
> --- /dev/null
> +++ b/drivers/misc/habanalabs/goya/goyaP.h
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + */
> +
> +#ifndef GOYAP_H_
> +#define GOYAP_H_
> +
> +#include "habanalabs.h"
> +#include "include/goya/goya.h"
> +
> +#define NUMBER_OF_CMPLT_QUEUES		5
> +#define NUMBER_OF_EXT_HW_QUEUES		5
> +#define NUMBER_OF_CPU_HW_QUEUES		1
> +#define NUMBER_OF_INT_HW_QUEUES		9
> +#define NUMBER_OF_HW_QUEUES		(NUMBER_OF_EXT_HW_QUEUES + \
> +					NUMBER_OF_CPU_HW_QUEUES + \
> +					NUMBER_OF_INT_HW_QUEUES)
> +
> +/*
> + * Number of MSIX interrupts IDS:
> + * Each completion queue has 1 ID
> + * The event queue has 1 ID
> + * ArmCP reset has 1 ID
> + */
> +#define NUMBER_OF_INTERRUPTS		(NUMBER_OF_CMPLT_QUEUES + 2)
> +
> +#if (NUMBER_OF_HW_QUEUES >= HL_MAX_QUEUES)
> +#error "Number of H/W queues must be smaller than HL_MAX_QUEUES"
> +#endif
> +
> +#if (NUMBER_OF_INTERRUPTS > GOYA_MSIX_ENTRIES)
> +#error "Number of MSIX interrupts must be smaller or equal to GOYA_MSIX_ENTRIES"
> +#endif
> +
> +#define QMAN_FENCE_TIMEOUT_USEC		10000	/* 10 ms */
> +
> +#define QMAN_STOP_TIMEOUT_USEC		100000	/* 100 ms */
> +
> +#define TPC_MAX_NUM			8
> +#define TPC_ENABLED_MASK		0xFF
> +
> +#define DMA_MAX_NUM			5
> +
> +#define PLL_HIGH_DEFAULT		1575000000	/* 1.575 GHz */
> +
> +#define GOYA_ARMCP_INFO_TIMEOUT		10000000	/* 10s */
> +
> +#define DRAM_PHYS_DEFAULT_SIZE		0x100000000ull	/* 4GB */
> +
> +/*
> + * SRAM Memory Map for KMD
> + *
> + * KMD occupies KMD_SRAM_SIZE bytes from the start of SRAM. It is used for
> + * MME/TPC QMANs
> + *
> + */
> +
> +#define MME_QMAN_BASE_OFFSET	0x000000	/* Must be 0 */
> +#define MME_QMAN_LENGTH		64
> +#define TPC_QMAN_LENGTH		64
> +
> +#define TPC0_QMAN_BASE_OFFSET	(MME_QMAN_BASE_OFFSET + \
> +				(MME_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> +#define TPC1_QMAN_BASE_OFFSET	(TPC0_QMAN_BASE_OFFSET + \
> +				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> +#define TPC2_QMAN_BASE_OFFSET	(TPC1_QMAN_BASE_OFFSET + \
> +				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> +#define TPC3_QMAN_BASE_OFFSET	(TPC2_QMAN_BASE_OFFSET + \
> +				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> +#define TPC4_QMAN_BASE_OFFSET	(TPC3_QMAN_BASE_OFFSET + \
> +				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> +#define TPC5_QMAN_BASE_OFFSET	(TPC4_QMAN_BASE_OFFSET + \
> +				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> +#define TPC6_QMAN_BASE_OFFSET	(TPC5_QMAN_BASE_OFFSET + \
> +				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> +#define TPC7_QMAN_BASE_OFFSET	(TPC6_QMAN_BASE_OFFSET + \
> +				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> +
> +#define SRAM_KMD_RES_OFFSET	(TPC7_QMAN_BASE_OFFSET + \
> +				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> +
> +#if (SRAM_KMD_RES_OFFSET >= KMD_SRAM_RESERVED_SIZE)
> +#error "MME/TPC QMANs SRAM space exceeds limit"
> +#endif
> +
> +#define SRAM_USER_BASE_OFFSET	KMD_SRAM_RESERVED_SIZE
> +
> +#define DMA_MAX_TRANSFER_SIZE	0xFFFFFFFF
> +
> +#define HW_CAP_PLL		0x00000001
> +#define HW_CAP_DDR_0		0x00000002
> +#define HW_CAP_DDR_1		0x00000004
> +#define HW_CAP_MME		0x00000008
> +#define HW_CAP_CPU		0x00000010
> +#define HW_CAP_DMA		0x00000020
> +#define HW_CAP_MSIX		0x00000040
> +#define HW_CAP_CPU_Q		0x00000080
> +#define HW_CAP_MMU		0x00000100
> +#define HW_CAP_TPC_MBIST	0x00000200
> +#define HW_CAP_GOLDEN		0x00000400
> +#define HW_CAP_TPC		0x00000800
> +
> +#define CPU_PKT_SHIFT		5
> +#define CPU_PKT_SIZE		(1 << CPU_PKT_SHIFT)
> +#define CPU_PKT_MASK		(~((1 << CPU_PKT_SHIFT) - 1))
> +#define CPU_MAX_PKTS_IN_CB	32
> +#define CPU_CB_SIZE		(CPU_PKT_SIZE * CPU_MAX_PKTS_IN_CB)
> +#define CPU_ACCESSIBLE_MEM_SIZE	(HL_QUEUE_LENGTH * CPU_CB_SIZE)
> +
> +enum goya_fw_component {
> +	FW_COMP_UBOOT,
> +	FW_COMP_PREBOOT
> +};
> +
> +struct goya_device {
> +	/* TODO: remove hw_queues_lock after moving to scheduler code */
> +	spinlock_t	hw_queues_lock;
> +	u64		ddr_bar_cur_addr;
> +	u32		hw_cap_initialized;
> +};
> +
> +#endif /* GOYAP_H_ */
> diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> index 7e1b088b677c..97844825f7a8 100644
> --- a/drivers/misc/habanalabs/habanalabs.h
> +++ b/drivers/misc/habanalabs/habanalabs.h
> @@ -21,11 +21,64 @@
>  
>  #define HL_NAME				"habanalabs"
>  
> +#define HL_MAX_QUEUES			128
> +
>  struct hl_device;
>  
>  
>  
>  
> +/**
> + * struct asic_fixed_properties - ASIC specific immutable properties.
> + * @sram_base_address: SRAM physical start address.
> + * @sram_end_address: SRAM physical end address.
> + * @sram_user_base_address - SRAM physical start address for user access.
> + * @dram_base_address: DRAM physical start address.
> + * @dram_end_address: DRAM physical end address.
> + * @dram_user_base_address: DRAM physical start address for user access.
> + * @dram_size: DRAM total size.
> + * @dram_pci_bar_size: size of PCI bar towards DRAM.
> + * @host_phys_base_address: base physical address of host memory for
> + *				transactions that the device generates.
> + * @va_space_host_start_address: base address of virtual memory range for
> + *                               mapping host memory.
> + * @va_space_host_end_address: end address of virtual memory range for
> + *                             mapping host memory.
> + * @va_space_dram_start_address: base address of virtual memory range for
> + *                               mapping DRAM memory.
> + * @va_space_dram_end_address: end address of virtual memory range for
> + *                             mapping DRAM memory.
> + * @cfg_size: configuration space size on SRAM.
> + * @sram_size: total size of SRAM.
> + * @max_asid: maximum number of open contexts (ASIDs).
> + * @completion_queues_count: number of completion queues.
> + * @high_pll: high PLL frequency used by the device.
> + * @tpc_enabled_mask: which TPCs are enabled.
> + */
> +struct asic_fixed_properties {
> +	u64			sram_base_address;
> +	u64			sram_end_address;
> +	u64			sram_user_base_address;
> +	u64			dram_base_address;
> +	u64			dram_end_address;
> +	u64			dram_user_base_address;
> +	u64			dram_size;
> +	u64			dram_pci_bar_size;
> +	u64			host_phys_base_address;
> +	u64			va_space_host_start_address;
> +	u64			va_space_host_end_address;
> +	u64			va_space_dram_start_address;
> +	u64			va_space_dram_end_address;
> +	u32			cfg_size;
> +	u32			sram_size;
> +	u32			max_asid;
> +	u32			high_pll;
> +	u8			completion_queues_count;
> +	u8			tpc_enabled_mask;
> +};
> +
> +
> +#define HL_QUEUE_LENGTH			256
>  
>  
>  /*
> @@ -47,6 +100,30 @@ enum hl_asic_type {
>  
>  
>  
> +/**
> + * struct hl_asic_funcs - ASIC specific functions that are can be called from
> + *                        common code.
> + * @early_init: sets up early driver state (pre sw_init), doesn't configure H/W.
> + * @early_fini: tears down what was done in early_init.
> + * @sw_init: sets up driver state, does not configure H/W.
> + * @sw_fini: tears down driver state, does not configure H/W.
> + * @suspend: handles IP specific H/W or SW changes for suspend.
> + * @resume: handles IP specific H/W or SW changes for resume.
> + * @dma_alloc_coherent: DMA allocate coherent memory.
> + * @dma_free_coherent: free DMA allocation.
> + */
> +struct hl_asic_funcs {
> +	int (*early_init)(struct hl_device *hdev);
> +	int (*early_fini)(struct hl_device *hdev);
> +	int (*sw_init)(struct hl_device *hdev);
> +	int (*sw_fini)(struct hl_device *hdev);
> +	int (*suspend)(struct hl_device *hdev);
> +	int (*resume)(struct hl_device *hdev);
> +	void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
> +					dma_addr_t *dma_handle, gfp_t flag);
> +	void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
> +					void *cpu_addr, dma_addr_t dma_handle);
> +};
>  
>  /*
>   * FILE PRIVATE STRUCTURE
> @@ -78,26 +155,78 @@ struct hl_fpriv {
>   */
>  #define HL_MAX_MINORS	256
>  
> +/*
> + * Registers read & write functions.
> + */
> +
> +u32 hl_rreg(struct hl_device *hdev, u32 reg);
> +void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> +
> +#define hl_poll_timeout(hdev, addr, val, cond, sleep_us, timeout_us) \
> +	readl_poll_timeout(hdev->rmmio + addr, val, cond, sleep_us, timeout_us)
> +
> +#define RREG32(reg) hl_rreg(hdev, (reg))
> +#define WREG32(reg, v) hl_wreg(hdev, (reg), (v))
> +#define DREG32(reg) pr_info("REGISTER: " #reg " : 0x%08X\n",	\
> +				hl_rreg(hdev, (reg)))
> +
> +#define WREG32_P(reg, val, mask)				\
> +	do {							\
> +		u32 tmp_ = RREG32(reg);				\
> +		tmp_ &= (mask);					\
> +		tmp_ |= ((val) & ~(mask));			\
> +		WREG32(reg, tmp_);				\
> +	} while (0)
> +#define WREG32_AND(reg, and) WREG32_P(reg, 0, and)
> +#define WREG32_OR(reg, or) WREG32_P(reg, or, ~(or))
> +
> +#define REG_FIELD_SHIFT(reg, field) reg##_##field##_SHIFT
> +#define REG_FIELD_MASK(reg, field) reg##_##field##_MASK
> +#define WREG32_FIELD(reg, field, val)	\
> +	WREG32(mm##reg, (RREG32(mm##reg) & ~REG_FIELD_MASK(reg, field)) | \
> +			(val) << REG_FIELD_SHIFT(reg, field))
> +
>  /**
>   * struct hl_device - habanalabs device structure.
>   * @pdev: pointer to PCI device, can be NULL in case of simulator device.
> + * @pcie_bar: array of available PCIe bars.
> + * @rmmio: configuration area address on SRAM.
>   * @cdev: related char device.
>   * @dev: realted kernel basic device structure.
>   * @asic_name: ASIC specific nmae.
>   * @asic_type: ASIC specific type.
> + * @dma_pool: DMA pool for small allocations.
> + * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
> + * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
> + * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
> + * @asic_prop: ASIC specific immutable properties.
> + * @asic_funcs: ASIC specific functions.
> + * @asic_specific: ASIC specific information to use only from ASIC files.
>   * @major: habanalabs KMD major.
>   * @id: device minor.
>   * @disabled: is device disabled.
>   */
>  struct hl_device {
>  	struct pci_dev			*pdev;
> +	void __iomem			*pcie_bar[6];
> +	void __iomem			*rmmio;
>  	struct cdev			cdev;
>  	struct device			*dev;
>  	char				asic_name[16];
>  	enum hl_asic_type		asic_type;
> +	struct dma_pool			*dma_pool;
> +	void				*cpu_accessible_dma_mem;
> +	dma_addr_t			cpu_accessible_dma_address;
> +	struct gen_pool			*cpu_accessible_dma_pool;
> +	struct asic_fixed_properties	asic_prop;
> +	const struct hl_asic_funcs	*asic_funcs;
> +	void				*asic_specific;
>  	u32				major;
>  	u16				id;
>  	u8				disabled;
> +
> +	/* Parameters for bring-up */
> +	u8				reset_pcilink;
>  };
>  
>  /*
> @@ -146,4 +275,6 @@ void hl_device_fini(struct hl_device *hdev);
>  int hl_device_suspend(struct hl_device *hdev);
>  int hl_device_resume(struct hl_device *hdev);
>  
> +void goya_set_asic_funcs(struct hl_device *hdev);
> +
>  #endif /* HABANALABSP_H_ */
> diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> index 15217975327b..79545003b7c2 100644
> --- a/drivers/misc/habanalabs/habanalabs_drv.c
> +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> @@ -136,6 +136,9 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
>  
>  	hdev->major = hl_major;
>  
> +	/* Parameters for bring-up - set them to defaults */
> +	hdev->reset_pcilink = 0;
> +
>  	hdev->disabled = true;
>  	hdev->pdev = pdev; /* can be NULL in case of simulator device */
>  
> diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
> new file mode 100644
> index 000000000000..192a1450cbb1
> --- /dev/null
> +++ b/drivers/misc/habanalabs/include/goya/goya.h
> @@ -0,0 +1,115 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + * Author: Oded Gabbay <oded.gabbay@gmail.com>
> + *
> + */
> +
> +#ifndef GOYA_H
> +#define GOYA_H
> +
> +#include "asic_reg/goya_regs.h"
> +
> +#include <linux/types.h>
> +
> +#define SRAM_CFG_BAR_ID		0
> +#define MSIX_BAR_ID		2
> +#define DDR_BAR_ID		4
> +
> +#define CFG_BAR_SIZE		0x10000000ull		/* 256MB */
> +#define MSIX_BAR_SIZE		0x1000ull		/* 4KB */
> +
> +#define CFG_BASE		0x7FFC000000ull
> +#define CFG_SIZE		0x4000000		/* 32MB CFG + 32MB DBG*/
> +
> +#define SRAM_BASE_ADDR		0x7FF0000000ull
> +#define SRAM_SIZE		0x32A0000		/* 50.625MB */
> +#define KMD_SRAM_RESERVED_SIZE	0x8000			/* 32KB */
> +
> +#define SRAM_BASE_ADDR_USER	(0x7FF0000000ull + KMD_SRAM_RESERVED_SIZE)
> +#define SRAM_SIZE_USER		(SRAM_SIZE - KMD_SRAM_RESERVED_SIZE)
> +
> +#define DRAM_PHYS_BASE		0x0ull
> +
> +#define CPU_FW_IMAGE_SIZE	0x10000000	/* 256MB */
> +#define MMU_PAGE_TABLES_SIZE	0x0E000000	/* 224MB */
> +#define CPU_PQ_PKT_SIZE		0x00001000	/* 4KB */
> +#define CPU_PQ_DATA_SIZE	0x01FFF000	/* 32MB - 4KB  */
> +
> +#define CPU_FW_IMAGE_ADDR	DRAM_PHYS_BASE
> +#define MMU_PAGE_TABLES_ADDR	(CPU_FW_IMAGE_ADDR + CPU_FW_IMAGE_SIZE)
> +#define CPU_PQ_PKT_ADDR		(MMU_PAGE_TABLES_ADDR + MMU_PAGE_TABLES_SIZE)
> +#define CPU_PQ_DATA_ADDR	(CPU_PQ_PKT_ADDR + CPU_PQ_PKT_SIZE)
> +#define DRAM_BASE_ADDR_USER	(CPU_PQ_DATA_ADDR + CPU_PQ_DATA_SIZE)
> +
> +#define HOST_PHYS_BASE		0x8000000000ull		/* 0.5TB */
> +#define HOST_PHYS_SIZE		0x1000000000000ull	/* 0.25PB (48 bits) */
> +
> +#define VA_HOST_SPACE_START	0x1000000000000ull	/* 256TB */
> +#define VA_HOST_SPACE_END	0x3FF8000000000ull	/* 1PB - 1TB */
> +#define VA_HOST_SPACE_SIZE	(VA_HOST_SPACE_END - \
> +					VA_HOST_SPACE_START) /* 767TB */
> +
> +#define VA_DDR_SPACE_START	0x800000000ull		/* 32GB */
> +#define VA_DDR_SPACE_END	0x2000000000ull		/* 128GB */
> +#define VA_DDR_SPACE_SIZE	(VA_DDR_SPACE_END - \
> +					VA_DDR_SPACE_START)	/* 128GB */
> +
> +#define CPU_BOOT_ADDR		0x7FF8040000ull
> +
> +#define UBOOT_FW_OFFSET		0x100000		/* 1MB in SRAM */
> +#define LINUX_FW_OFFSET		0x800000		/* 8BM in DDR */
> +
> +#define GOYA_MSIX_ENTRIES	8
> +#define EVENT_QUEUE_MSIX_IDX	5
> +#define ARMCP_RESET_MSIX_IDX	6
> +
> +#define QMAN_PQ_ENTRY_SIZE	16			/* Bytes */
> +
> +#define MAX_ASID		1024
> +
> +#define PROT_BITS_OFFS		0xF80
> +
> +/*
> + * Queue Numbering
> + *
> + * The external queues (DMA channels + CPU) MUST be before the internal queues
> + * and each group (DMA channels + CPU and internal) must be contiguous inside
> + * itself but there can be a gap between the two groups (although not
> + * recommended)
> + */
> +
> +enum goya_queue_id {
> +	GOYA_QUEUE_ID_DMA_0 = 0,
> +	GOYA_QUEUE_ID_DMA_1,
> +	GOYA_QUEUE_ID_DMA_2,
> +	GOYA_QUEUE_ID_DMA_3,
> +	GOYA_QUEUE_ID_DMA_4,
> +	GOYA_QUEUE_ID_CPU_PQ,
> +	GOYA_QUEUE_ID_MME,
> +	GOYA_QUEUE_ID_TPC0,
> +	GOYA_QUEUE_ID_TPC1,
> +	GOYA_QUEUE_ID_TPC2,
> +	GOYA_QUEUE_ID_TPC3,
> +	GOYA_QUEUE_ID_TPC4,
> +	GOYA_QUEUE_ID_TPC5,
> +	GOYA_QUEUE_ID_TPC6,
> +	GOYA_QUEUE_ID_TPC7,
> +	GOYA_QUEUE_ID_SIZE
> +};
> +
> +enum goya_pll_index {
> +	CPU_PLL = 0,
> +	IC_PLL,
> +	MC_PLL,
> +	MME_PLL,
> +	PCI_PLL,
> +	EMMC_PLL,
> +	TPC_PLL
> +};
> +
> +#define GOYA_PLL_FREQ_LOW		50000000 /* 50 MHz */
> +
> +#endif /* GOYA_H */
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 05/15] habanalabs: add command buffer module
  2019-01-23  0:00 ` [PATCH 05/15] habanalabs: add command buffer module Oded Gabbay
@ 2019-01-23 12:28   ` Mike Rapoport
  2019-01-25 21:47     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-23 12:28 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:00:47AM +0200, Oded Gabbay wrote:
> This patch adds the CB module, which allows the user to create and
> destroy CBs and to map them to the user's process address-space.

Can you please spell "command buffer" at least first time it's mentioned?
 
> A command buffer is a memory blocks that reside in DMA-able address-space
> and is physically contiguous so it can be accessed by the device without
> MMU translation. The command buffer memory is allocated using the
> coherent DMA API.
> 
> When creating a new CB, the IOCTL returns a handle of it, and the
> user-space process needs to use that handle to mmap the buffer to get a VA
> in the user's address-space.
> 
> Before destroying (freeing) a CB, the user must unmap the CB's VA using the
> CB handle.
> 
> Each CB has a reference counter, which tracks its usage in command
> submissions and also its mmaps (only a single mmap is allowed).
> 
> The driver maintains a pool of pre-allocated CBs in order to reduce
> latency during command submissions. In case the pool is empty, the driver
> will go to the slow-path of allocating a new CB, i.e. calling
> dma_alloc_coherent.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/habanalabs/Makefile           |   3 +-
>  drivers/misc/habanalabs/command_buffer.c   | 414 +++++++++++++++++++++
>  drivers/misc/habanalabs/device.c           |  43 ++-
>  drivers/misc/habanalabs/goya/goya.c        |  28 ++
>  drivers/misc/habanalabs/habanalabs.h       |  95 ++++-
>  drivers/misc/habanalabs/habanalabs_drv.c   |   2 +
>  drivers/misc/habanalabs/habanalabs_ioctl.c | 102 +++++
>  include/uapi/misc/habanalabs.h             |  62 +++
>  8 files changed, 746 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/misc/habanalabs/command_buffer.c
>  create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
>  create mode 100644 include/uapi/misc/habanalabs.h
> 
> diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> index 3ffbadc2ca01..2530c9b78ca4 100644
> --- a/drivers/misc/habanalabs/Makefile
> +++ b/drivers/misc/habanalabs/Makefile
> @@ -4,7 +4,8 @@
>  
>  obj-m	:= habanalabs.o
>  
> -habanalabs-y := habanalabs_drv.o device.o context.o asid.o
> +habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
> +		command_buffer.o
>  
>  include $(src)/goya/Makefile
>  habanalabs-y += $(HL_GOYA_FILES)
> diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
> new file mode 100644
> index 000000000000..535ed6cc5bda
> --- /dev/null
> +++ b/drivers/misc/habanalabs/command_buffer.c
> @@ -0,0 +1,414 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include <uapi/misc/habanalabs.h>
> +#include "habanalabs.h"
> +
> +#include <linux/dma-mapping.h>
> +
> +static void cb_fini(struct hl_device *hdev, struct hl_cb *cb)
> +{
> +	hdev->asic_funcs->dma_free_coherent(hdev, cb->size,
> +			(void *) cb->kernel_address, cb->bus_address);

As it seems, ASIC specific dma_free_coherent is a shortcut for a generic
dma_free_coherent. Why not use it directly?

> +	kfree(cb);
> +}
> +
> +static void cb_do_release(struct hl_device *hdev, struct hl_cb *cb)
> +{
> +	if (cb->is_pool) {
> +		spin_lock(&hdev->cb_pool_lock);
> +		list_add(&cb->pool_list, &hdev->cb_pool);
> +		spin_unlock(&hdev->cb_pool_lock);
> +	} else {
> +		cb_fini(hdev, cb);
> +	}
> +}
> +
> +static void cb_release(struct kref *ref)
> +{
> +	struct hl_device *hdev;
> +	struct hl_cb *cb;
> +
> +	cb = container_of(ref, struct hl_cb, refcount);
> +	hdev = cb->hdev;
> +
> +	cb_do_release(hdev, cb);
> +}
> +
> +static struct hl_cb *hl_cb_alloc(struct hl_device *hdev, u32 cb_size,
> +					int ctx_id)
> +{
> +	struct hl_cb *cb;
> +	void *p;
> +
> +	if (ctx_id == HL_KERNEL_ASID_ID)
> +		cb = kzalloc(sizeof(*cb), GFP_ATOMIC);

The GFP_ATOMIC should be used when the caller cannot tolerate reclaim or
sleep and it does not seem to be the case here.

> +	else
> +		cb = kzalloc(sizeof(*cb), GFP_KERNEL);
> +
> +	if (!cb)
> +		return NULL;
> +
> +	if (ctx_id == HL_KERNEL_ASID_ID)
> +		p = hdev->asic_funcs->dma_alloc_coherent(hdev, cb_size,
> +						&cb->bus_address, GFP_ATOMIC);

GFP_KERNEL?

> +	else
> +		p = hdev->asic_funcs->dma_alloc_coherent(hdev, cb_size,
> +						&cb->bus_address,
> +						GFP_USER | __GFP_ZERO);
> +	if (!p) {
> +		dev_err(hdev->dev,
> +			"failed to allocate %d of dma memory for CB\n",
> +			cb_size);
> +		kfree(cb);
> +		return NULL;
> +	}
> +
> +	cb->kernel_address = (u64) p;
> +	cb->size = cb_size;
> +
> +	return cb;
> +}
> +
> +int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
> +			u32 cb_size, u64 *handle, int ctx_id)
> +{
> +	struct hl_cb *cb;
> +	bool alloc_new_cb = true;
> +	int rc;
> +
> +	if (hdev->disabled) {
> +		dev_warn_ratelimited(hdev->dev,
> +			"Device is disabled !!! Can't create new CBs\n");
> +		rc = -EBUSY;
> +		goto out_err;
> +	}
> +
> +	/* Minimum allocation must be PAGE SIZE */
> +	if (cb_size < PAGE_SIZE)
> +		cb_size = PAGE_SIZE;
> +
> +	if (ctx_id == HL_KERNEL_ASID_ID &&
> +			cb_size <= hdev->asic_prop.cb_pool_cb_size) {
> +
> +		spin_lock(&hdev->cb_pool_lock);
> +		if (!list_empty(&hdev->cb_pool)) {
> +			cb = list_first_entry(&hdev->cb_pool, typeof(*cb),
> +					pool_list);
> +			list_del(&cb->pool_list);
> +			spin_unlock(&hdev->cb_pool_lock);
> +			alloc_new_cb = false;
> +		} else {
> +			spin_unlock(&hdev->cb_pool_lock);
> +			dev_warn_once(hdev->dev, "CB pool is empty\n");

Isn't it going to be a false alarm when you allocate the cb for the first
time?

> +		}
> +	}
> +
> +	if (alloc_new_cb) {
> +		cb = hl_cb_alloc(hdev, cb_size, ctx_id);
> +		if (!cb) {
> +			rc = -ENOMEM;
> +			goto out_err;
> +		}
> +	}
> +
> +	cb->hdev = hdev;
> +	cb->ctx_id = ctx_id;
> +
> +	spin_lock(&mgr->cb_lock);
> +	rc = idr_alloc(&mgr->cb_handles, cb, 1, 0, GFP_ATOMIC);

It seems the ID will remain dangling if the cb is reused.

> +	spin_unlock(&mgr->cb_lock);
> +
> +	if (rc < 0) {
> +		dev_err(hdev->dev, "Failed to allocate IDR for a new CB\n");
> +		goto release_cb;
> +	}
> +
> +	cb->id = rc;
> +
> +	kref_init(&cb->refcount);
> +	spin_lock_init(&cb->lock);
> +
> +	/*
> +	 * idr is 32-bit so we can safely OR it with a mask that is above
> +	 * 32 bit
> +	 */
> +	*handle = cb->id | HL_MMAP_CB_MASK;
> +	*handle <<= PAGE_SHIFT;
> +
> +	return 0;
> +
> +release_cb:
> +	cb_do_release(hdev, cb);
> +out_err:
> +	*handle = 0;
> +
> +	return rc;
> +}
> +
> +int hl_cb_destroy(struct hl_device *hdev, struct hl_cb_mgr *mgr, u64 cb_handle)
> +{
> +	struct hl_cb *cb;
> +	u32 handle;
> +	int rc = 0;
> +
> +	/*
> +	 * handle was given to user to do mmap, I need to shift it back to
> +	 * how the idr module gave it to me
> +	 */
> +	cb_handle >>= PAGE_SHIFT;
> +	handle = (u32) cb_handle;
> +
> +	spin_lock(&mgr->cb_lock);
> +
> +	cb = idr_find(&mgr->cb_handles, handle);
> +	if (cb) {
> +		idr_remove(&mgr->cb_handles, handle);
> +		spin_unlock(&mgr->cb_lock);
> +		kref_put(&cb->refcount, cb_release);
> +	} else {
> +		spin_unlock(&mgr->cb_lock);
> +		dev_err(hdev->dev,
> +			"CB destroy failed, no match to handle 0x%x\n", handle);
> +		rc = -EINVAL;
> +	}
> +
> +	return rc;
> +}
> +
> +int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data)
> +{
> +	union hl_cb_args *args = data;
> +	struct hl_device *hdev = hpriv->hdev;
> +	u64 handle;
> +	int rc;
> +
> +	switch (args->in.op) {
> +	case HL_CB_OP_CREATE:
> +		rc = hl_cb_create(hdev, &hpriv->cb_mgr, args->in.cb_size,
> +					&handle, hpriv->ctx->asid);
> +		memset(args, 0, sizeof(*args));
> +		args->out.cb_handle = handle;
> +		break;
> +	case HL_CB_OP_DESTROY:
> +		rc = hl_cb_destroy(hdev, &hpriv->cb_mgr,
> +					args->in.cb_handle);
> +		memset(args, 0, sizeof(*args));
> +		break;
> +	default:
> +		rc = -EINVAL;
> +		break;
> +	}
> +
> +	return rc;
> +}
> +
> +static void cb_vm_close(struct vm_area_struct *vma)
> +{
> +	struct hl_cb *cb = (struct hl_cb *) vma->vm_private_data;
> +
> +	hl_cb_put(cb);
> +
> +	spin_lock(&cb->lock);
> +	cb->mmap = false;
> +	cb->vm_start = 0;
> +	cb->vm_end = 0;
> +	spin_unlock(&cb->lock);
> +
> +	vma->vm_private_data = NULL;
> +}
> +
> +static const struct vm_operations_struct cb_vm_ops = {
> +	.close = cb_vm_close
> +};
> +
> +int hl_cb_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
> +{
> +	struct hl_device *hdev = hpriv->hdev;
> +	struct hl_cb *cb;
> +	phys_addr_t address;
> +	u32 handle;
> +	int rc;
> +
> +	handle = vma->vm_pgoff;
> +
> +	/* reference was taken here */
> +	cb = hl_cb_get(hdev, &hpriv->cb_mgr, handle);
> +	if (!cb) {
> +		dev_err(hdev->dev,
> +			"CB mmap failed, no match to handle %d\n", handle);
> +		goto err_out;

why no simply return -EINVAL?

> +	}
> +
> +	/* Validation check */
> +	if (vma->vm_end - vma->vm_start != cb->size) {
> +		dev_err(hdev->dev,
> +			"CB mmap failed, mmap size 0x%lx != 0x%x cb size\n",
> +			vma->vm_end - vma->vm_start, cb->size);
> +		goto put_cb;
> +	}
> +
> +	spin_lock(&cb->lock);
> +
> +	if (cb->mmap) {
> +		dev_err(hdev->dev,
> +			"CB mmap failed, CB already mmaped to user\n");
> +		goto release_lock;
> +	}
> +
> +	cb->mmap = true;
> +
> +	spin_unlock(&cb->lock);
> +
> +	vma->vm_ops = &cb_vm_ops;
> +
> +	/*
> +	 * Note: We're transferring the cb reference to
> +	 * vma->vm_private_data here.
> +	 */
> +
> +	vma->vm_private_data = cb;
> +
> +	/* Calculate address for CB */
> +	address = virt_to_phys((void *) cb->kernel_address);
> +
> +	rc = hdev->asic_funcs->cb_mmap(hdev, vma, cb->kernel_address,
> +					address, cb->size);
> +
> +	if (rc) {
> +		spin_lock(&cb->lock);
> +		cb->mmap = false;
> +		goto release_lock;
> +	}
> +
> +	cb->vm_start = vma->vm_start;
> +	cb->vm_end = vma->vm_end;
> +
> +	return 0;
> +
> +release_lock:
> +	spin_unlock(&cb->lock);
> +put_cb:
> +	hl_cb_put(cb);
> +err_out:
> +	return -EINVAL;
> +}
> +
> +struct hl_cb *hl_cb_get(struct hl_device *hdev, struct hl_cb_mgr *mgr,
> +			u32 handle)
> +{
> +	struct hl_cb *cb;
> +
> +	spin_lock(&mgr->cb_lock);
> +	cb = idr_find(&mgr->cb_handles, handle);
> +
> +	if (!cb) {
> +		spin_unlock(&mgr->cb_lock);
> +		dev_warn(hdev->dev,
> +			"CB get failed, no match to handle %d\n", handle);
> +		return NULL;
> +	}
> +
> +	kref_get(&cb->refcount);
> +
> +	spin_unlock(&mgr->cb_lock);
> +
> +	return cb;
> +
> +}
> +
> +void hl_cb_put(struct hl_cb *cb)
> +{
> +	kref_put(&cb->refcount, cb_release);
> +}
> +
> +void hl_cb_mgr_init(struct hl_cb_mgr *mgr)
> +{
> +	spin_lock_init(&mgr->cb_lock);
> +	idr_init(&mgr->cb_handles);
> +}
> +
> +void hl_cb_mgr_fini(struct hl_device *hdev, struct hl_cb_mgr *mgr)
> +{
> +	struct hl_cb *cb;
> +	struct idr *idp;
> +	u32 id;
> +
> +	idp = &mgr->cb_handles;
> +
> +	idr_for_each_entry(idp, cb, id) {
> +		if (kref_put(&cb->refcount, cb_release) != 1)
> +			dev_err(hdev->dev,
> +				"CB %d for CTX ID %d is still alive\n",
> +				id, cb->ctx_id);
> +	}
> +
> +	idr_destroy(&mgr->cb_handles);
> +}
> +
> +struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size)
> +{
> +	u64 cb_handle;
> +	struct hl_cb *cb;
> +	int rc;
> +
> +	rc = hl_cb_create(hdev, &hdev->kernel_cb_mgr, cb_size, &cb_handle,
> +			HL_KERNEL_ASID_ID);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to allocate CB for KMD %d\n", rc);
> +		return NULL;
> +	}
> +
> +	cb_handle >>= PAGE_SHIFT;
> +	cb = hl_cb_get(hdev, &hdev->kernel_cb_mgr, (u32) cb_handle);
> +	/* hl_cb_get should never fail here so use kernel WARN */
> +	WARN(!cb, "Kernel CB handle invalid 0x%x\n", (u32) cb_handle);
> +	if (!cb)
> +		goto destroy_cb;
> +
> +	return cb;
> +
> +destroy_cb:
> +	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb_handle << PAGE_SHIFT);
> +
> +	return NULL;
> +}
> +
> +int hl_cb_pool_init(struct hl_device *hdev)
> +{
> +	struct hl_cb *cb;
> +	int i;
> +
> +	INIT_LIST_HEAD(&hdev->cb_pool);
> +	spin_lock_init(&hdev->cb_pool_lock);
> +
> +	for (i = 0 ; i < hdev->asic_prop.cb_pool_cb_cnt ; i++) {
> +		cb = hl_cb_alloc(hdev, hdev->asic_prop.cb_pool_cb_size,
> +				HL_KERNEL_ASID_ID);
> +		if (cb) {
> +			cb->is_pool = true;
> +			list_add(&cb->pool_list, &hdev->cb_pool);
> +		} else {
> +			hl_cb_pool_fini(hdev);
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +int hl_cb_pool_fini(struct hl_device *hdev)
> +{
> +	struct hl_cb *cb, *tmp;
> +
> +	list_for_each_entry_safe(cb, tmp, &hdev->cb_pool, pool_list) {
> +		list_del(&cb->pool_list);
> +		cb_fini(hdev, cb);
> +	}
> +
> +	return 0;
> +}
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> index 84ce9fcb52da..0bd86a7d34db 100644
> --- a/drivers/misc/habanalabs/device.c
> +++ b/drivers/misc/habanalabs/device.c
> @@ -53,6 +53,7 @@ static int hl_device_release(struct inode *inode, struct file *filp)
>  {
>  	struct hl_fpriv *hpriv = filp->private_data;
>  
> +	hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
>  	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
>  
>  	filp->private_data = NULL;
> @@ -62,10 +63,34 @@ static int hl_device_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> +/**
> + * hl_mmap - mmap function for habanalabs device
> + *
> + * @*filp: pointer to file structure
> + * @*vma: pointer to vm_area_struct of the process
> + *
> + * Called when process does an mmap on habanalabs device. Call the device's mmap
> + * function at the end of the common code.
> + */
> +static int hl_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> +	struct hl_fpriv *hpriv = filp->private_data;
> +
> +	if ((vma->vm_pgoff & HL_MMAP_CB_MASK) == HL_MMAP_CB_MASK) {
> +		vma->vm_pgoff ^= HL_MMAP_CB_MASK;
> +		return hl_cb_mmap(hpriv, vma);
> +	}
> +
> +	return hpriv->hdev->asic_funcs->mmap(hpriv, vma);
> +}
> +
>  static const struct file_operations hl_ops = {
>  	.owner = THIS_MODULE,
>  	.open = hl_device_open,
> -	.release = hl_device_release
> +	.release = hl_device_release,
> +	.mmap = hl_mmap,
> +	.unlocked_ioctl = hl_ioctl,
> +	.compat_ioctl = hl_ioctl
>  };
>  
>  /**
> @@ -145,6 +170,8 @@ static int device_early_init(struct hl_device *hdev)
>  	if (rc)
>  		goto early_fini;
>  
> +	hl_cb_mgr_init(&hdev->kernel_cb_mgr);
> +
>  	mutex_init(&hdev->device_open);
>  	atomic_set(&hdev->fd_open_cnt, 0);
>  
> @@ -166,6 +193,8 @@ static int device_early_init(struct hl_device *hdev)
>  static void device_early_fini(struct hl_device *hdev)
>  {
>  
> +	hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
> +
>  	hl_asid_fini(hdev);
>  
>  	if (hdev->asic_funcs->early_fini)
> @@ -280,11 +309,21 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  		goto free_ctx;
>  	}
>  
> +	rc = hl_cb_pool_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to initialize CB pool\n");
> +		goto release_ctx;
> +	}
> +
>  	dev_notice(hdev->dev,
>  		"Successfully added device to habanalabs driver\n");
>  
>  	return 0;
>  
> +release_ctx:
> +	if (hl_ctx_put(hdev->kernel_ctx) != 1)
> +		dev_err(hdev->dev,
> +			"kernel ctx is still alive on initialization failure\n");
>  free_ctx:
>  	kfree(hdev->kernel_ctx);
>  sw_fini:
> @@ -321,6 +360,8 @@ void hl_device_fini(struct hl_device *hdev)
>  	/* Mark device as disabled */
>  	hdev->disabled = true;
>  
> +	hl_cb_pool_fini(hdev);
> +
>  	/* Release kernel context */
>  	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
>  		dev_err(hdev->dev, "kernel ctx is still alive\n");
> diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> index b2952296b890..341ac085af82 100644
> --- a/drivers/misc/habanalabs/goya/goya.c
> +++ b/drivers/misc/habanalabs/goya/goya.c
> @@ -92,6 +92,9 @@
>  
>  #define GOYA_MAX_INITIATORS		20
>  
> +#define GOYA_CB_POOL_CB_CNT		512
> +#define GOYA_CB_POOL_CB_SIZE		0x20000		/* 128KB */
> +
>  static void goya_get_fixed_properties(struct hl_device *hdev)
>  {
>  	struct asic_fixed_properties *prop = &hdev->asic_prop;
> @@ -119,6 +122,8 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
>  	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
>  
>  	prop->high_pll = PLL_HIGH_DEFAULT;
> +	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
> +	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
>  }
>  
>  /**
> @@ -598,6 +603,27 @@ int goya_resume(struct hl_device *hdev)
>  	return 0;
>  }
>  
> +int goya_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
> +{
> +	return -EINVAL;
> +}
> +
> +int goya_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
> +		u64 kaddress, phys_addr_t paddress, u32 size)
> +{
> +	int rc;
> +
> +	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
> +			VM_DONTCOPY | VM_NORESERVE;
> +
> +	rc = remap_pfn_range(vma, vma->vm_start, paddress >> PAGE_SHIFT,
> +				size, vma->vm_page_prot);
> +	if (rc)
> +		dev_err(hdev->dev, "remap_pfn_range error %d", rc);
> +
> +	return rc;
> +}
> +
>  void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
>  					dma_addr_t *dma_handle, gfp_t flags)
>  {
> @@ -617,6 +643,8 @@ static const struct hl_asic_funcs goya_funcs = {
>  	.sw_fini = goya_sw_fini,
>  	.suspend = goya_suspend,
>  	.resume = goya_resume,
> +	.mmap = goya_mmap,
> +	.cb_mmap = goya_cb_mmap,
>  	.dma_alloc_coherent = goya_dma_alloc_coherent,
>  	.dma_free_coherent = goya_dma_free_coherent,
>  };
> diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> index d003a6af2131..6ad476df65b0 100644
> --- a/drivers/misc/habanalabs/habanalabs.h
> +++ b/drivers/misc/habanalabs/habanalabs.h
> @@ -21,10 +21,12 @@
>  
>  #define HL_NAME				"habanalabs"
>  
> +#define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
> +
>  #define HL_MAX_QUEUES			128
>  
>  struct hl_device;
> -
> +struct hl_fpriv;
>  
>  
>  
> @@ -53,6 +55,8 @@ struct hl_device;
>   * @max_asid: maximum number of open contexts (ASIDs).
>   * @completion_queues_count: number of completion queues.
>   * @high_pll: high PLL frequency used by the device.
> + * @cb_pool_cb_cnt: number of CBs in the CB pool.
> + * @cb_pool_cb_size: size of each CB in the CB pool.
>   * @tpc_enabled_mask: which TPCs are enabled.
>   */
>  struct asic_fixed_properties {
> @@ -73,11 +77,68 @@ struct asic_fixed_properties {
>  	u32			sram_size;
>  	u32			max_asid;
>  	u32			high_pll;
> +	u32			cb_pool_cb_cnt;
> +	u32			cb_pool_cb_size;
>  	u8			completion_queues_count;
>  	u8			tpc_enabled_mask;
>  };
>  
>  
> +
> +
> +
> +
> +/*
> + * Command Buffers
> + */
> +
> +/**
> + * struct hl_cb_mgr - describes a Command Buffer Manager.
> + * @cb_lock: protects cb_handles.
> + * @cb_handles: an idr to hold all command buffer handles.
> + */
> +struct hl_cb_mgr {
> +	spinlock_t		cb_lock;
> +	struct idr		cb_handles; /* protected by cb_lock */
> +};
> +
> +/**
> + * struct hl_cb - describes a Command Buffer.
> + * @refcount: reference counter for usage of the CB.
> + * @hdev: pointer to device this CB belongs to.
> + * @lock: spinlock to protect mmap/cs flows.
> + * @pool_list: node in pool list of command buffers.
> + * @kernel_address: Holds the CB's kernel virtual address.
> + * @bus_address: Holds the CB's DMA address.
> + * @vm_start: Holds the CB's user start virtual address (when mmaped).
> + * @vm_end: Holds the CB's user end virtual address (when mmaped).
> + * @size: holds the CB's size.
> + * @id: the CB's ID.
> + * @ctx_id: holds the ID of the owner's context.
> + * @mmap: true if the CB is currently mmaped to user.
> + * @is_pool: true if CB was acquired from the pool, false otherwise.
> + */
> +struct hl_cb {
> +	struct kref		refcount;
> +	struct hl_device	*hdev;
> +	spinlock_t		lock;
> +	struct list_head	pool_list;
> +	u64			kernel_address;
> +	dma_addr_t		bus_address;
> +	u64			vm_start;
> +	u64			vm_end;
> +	u32			size;
> +	u32			id;
> +	u32			ctx_id;
> +	u8			mmap;
> +	u8			is_pool;
> +};
> +
> +
> +
> +
> +
> +
>  #define HL_QUEUE_LENGTH			256
>  
>  
> @@ -109,6 +170,8 @@ enum hl_asic_type {
>   * @sw_fini: tears down driver state, does not configure H/W.
>   * @suspend: handles IP specific H/W or SW changes for suspend.
>   * @resume: handles IP specific H/W or SW changes for resume.
> + * @mmap: mmap function, does nothing.
> + * @cb_mmap: maps a CB.
>   * @dma_alloc_coherent: DMA allocate coherent memory.
>   * @dma_free_coherent: free DMA allocation.
>   */
> @@ -119,6 +182,9 @@ struct hl_asic_funcs {
>  	int (*sw_fini)(struct hl_device *hdev);
>  	int (*suspend)(struct hl_device *hdev);
>  	int (*resume)(struct hl_device *hdev);
> +	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
> +	int (*cb_mmap)(struct hl_device *hdev, struct vm_area_struct *vma,
> +			u64 kaddress, phys_addr_t paddress, u32 size);
>  	void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
>  					dma_addr_t *dma_handle, gfp_t flag);
>  	void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
> @@ -175,6 +241,7 @@ struct hl_ctx_mgr {
>   * @taskpid: current process ID.
>   * @ctx: current executing context.
>   * @ctx_mgr: context manager to handle multiple context for this FD.
> + * @cb_mgr: command buffer manager to handle multiple buffers for this FD.
>   * @refcount: number of related contexts.
>   */
>  struct hl_fpriv {
> @@ -183,6 +250,7 @@ struct hl_fpriv {
>  	struct pid		*taskpid;
>  	struct hl_ctx		*ctx; /* TODO: remove for multiple ctx */
>  	struct hl_ctx_mgr	ctx_mgr;
> +	struct hl_cb_mgr	cb_mgr;
>  	struct kref		refcount;
>  };
>  
> @@ -239,6 +307,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
>   * @asic_name: ASIC specific nmae.
>   * @asic_type: ASIC specific type.
>   * @kernel_ctx: KMD context structure.
> + * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
>   * @dma_pool: DMA pool for small allocations.
>   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
>   * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
> @@ -249,6 +318,8 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
>   * @asic_prop: ASIC specific immutable properties.
>   * @asic_funcs: ASIC specific functions.
>   * @asic_specific: ASIC specific information to use only from ASIC files.
> + * @cb_pool: list of preallocated CBs.
> + * @cb_pool_lock: protects the CB pool.
>   * @user_ctx: current user context executing.
>   * @fd_open_cnt: number of open context executing.
>   * @major: habanalabs KMD major.
> @@ -264,6 +335,7 @@ struct hl_device {
>  	char				asic_name[16];
>  	enum hl_asic_type		asic_type;
>  	struct hl_ctx			*kernel_ctx;
> +	struct hl_cb_mgr		kernel_cb_mgr;
>  	struct dma_pool			*dma_pool;
>  	void				*cpu_accessible_dma_mem;
>  	dma_addr_t			cpu_accessible_dma_address;
> @@ -275,6 +347,10 @@ struct hl_device {
>  	struct asic_fixed_properties	asic_prop;
>  	const struct hl_asic_funcs	*asic_funcs;
>  	void				*asic_specific;
> +
> +	struct list_head		cb_pool;
> +	spinlock_t			cb_pool_lock;
> +
>  	/* TODO: The following fields should be moved for multi-context */
>  	struct hl_ctx			*user_ctx;
>  	atomic_t			fd_open_cnt;
> @@ -345,6 +421,23 @@ int hl_device_resume(struct hl_device *hdev);
>  void hl_hpriv_get(struct hl_fpriv *hpriv);
>  void hl_hpriv_put(struct hl_fpriv *hpriv);
>  
> +int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr, u32 cb_size,
> +		u64 *handle, int ctx_id);
> +int hl_cb_destroy(struct hl_device *hdev, struct hl_cb_mgr *mgr, u64 cb_handle);
> +int hl_cb_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
> +struct hl_cb *hl_cb_get(struct hl_device *hdev,	struct hl_cb_mgr *mgr,
> +			u32 handle);
> +void hl_cb_put(struct hl_cb *cb);
> +void hl_cb_mgr_init(struct hl_cb_mgr *mgr);
> +void hl_cb_mgr_fini(struct hl_device *hdev, struct hl_cb_mgr *mgr);
> +struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size);
> +int hl_cb_pool_init(struct hl_device *hdev);
> +int hl_cb_pool_fini(struct hl_device *hdev);
> +
>  void goya_set_asic_funcs(struct hl_device *hdev);
>  
> +/* IOCTLs */
> +long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
> +int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
> +
>  #endif /* HABANALABSP_H_ */
> diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> index 0646da83eb53..5c312dd3aa50 100644
> --- a/drivers/misc/habanalabs/habanalabs_drv.c
> +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> @@ -123,6 +123,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
>  	kref_init(&hpriv->refcount);
>  	nonseekable_open(inode, filp);
>  
> +	hl_cb_mgr_init(&hpriv->cb_mgr);
>  	hl_ctx_mgr_init(&hpriv->ctx_mgr);
>  
>  	rc = hl_ctx_create(hdev, hpriv);
> @@ -138,6 +139,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
>  out_err:
>  	filp->private_data = NULL;
>  	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
> +	hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
>  	kfree(hpriv);
>  
>  close_device:
> diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
> new file mode 100644
> index 000000000000..fa2287569e0e
> --- /dev/null
> +++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include <uapi/misc/habanalabs.h>
> +#include "habanalabs.h"
> +
> +#include <linux/fs.h>
> +#include <linux/uaccess.h>
> +#include <linux/cred.h>
> +
> +#define HL_IOCTL_DEF(ioctl, _func) \
> +	[_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func}
> +
> +static const struct hl_ioctl_desc hl_ioctls[] = {
> +	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl)
> +};
> +
> +#define HL_CORE_IOCTL_COUNT	ARRAY_SIZE(hl_ioctls)
> +
> +long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> +{
> +	struct hl_fpriv *hpriv = filep->private_data;
> +	struct hl_device *hdev = hpriv->hdev;
> +	hl_ioctl_t *func;
> +	const struct hl_ioctl_desc *ioctl = NULL;
> +	unsigned int nr = _IOC_NR(cmd);
> +	char stack_kdata[128];
> +	char *kdata = NULL;
> +	unsigned int usize, asize;
> +	int retcode = -EINVAL;
> +
> +	if (nr >= HL_CORE_IOCTL_COUNT)

	nr > HL_CORE_IOCTL_COUNT, isn't it?

> +		goto err_i1;

err_i1 is not very meaningfull. Maybe invalid_ioctl?

> +
> +	if ((nr >= HL_COMMAND_START) && (nr < HL_COMMAND_END)) {

The HL_COMMAND_{START,END} do not seem to be defined. 
Besides, this check seem to be overlapped with

	if (nr > HL_CORE_IOCTL_COUNT)

> +		u32 hl_size;
> +
> +		ioctl = &hl_ioctls[nr];
> +
> +		hl_size = _IOC_SIZE(ioctl->cmd);
> +		usize = asize = _IOC_SIZE(cmd);
> +		if (hl_size > asize)
> +			asize = hl_size;
> +
> +		cmd = ioctl->cmd;
> +	} else {
> +		goto err_i1;
> +	}
> +
> +	/* Do not trust userspace, use our own definition */
> +	func = ioctl->func;
> +
> +	if (unlikely(!func)) {
> +		dev_dbg(hdev->dev, "no function\n");
> +		retcode = -EINVAL;
> +		goto err_i1;
> +	}
> +
> +	if (cmd & (IOC_IN | IOC_OUT)) {
> +		if (asize <= sizeof(stack_kdata)) {
> +			kdata = stack_kdata;
> +		} else {
> +			kdata = kmalloc(asize, GFP_KERNEL);
> +			if (!kdata) {
> +				retcode = -ENOMEM;
> +				goto err_i1;
> +			}
> +		}
> +		if (asize > usize)
> +			memset(kdata + usize, 0, asize - usize);

Just init stack_kdata to 0 and use kzalloc instead of malloc.

> +	}
> +
> +	if (cmd & IOC_IN) {
> +		if (copy_from_user(kdata, (void __user *)arg, usize)) {
> +			retcode = -EFAULT;
> +			goto err_i1;
> +		}
> +	} else if (cmd & IOC_OUT) {
> +		memset(kdata, 0, usize);
> +	}
> +
> +	retcode = func(hpriv, kdata);
> +
> +	if (cmd & IOC_OUT)
> +		if (copy_to_user((void __user *)arg, kdata, usize))
> +			retcode = -EFAULT;
> +
> +err_i1:
> +	if (!ioctl)
> +		dev_dbg(hdev->dev,
> +			"invalid ioctl: pid=%d, cmd=0x%02x, nr=0x%02x\n",
> +			  task_pid_nr(current), cmd, nr);

I think this can move right after the 'nr' sanity check and there you can
simple return -EINVAL after dev_dbg().

> +
> +	if (kdata != stack_kdata)
> +		kfree(kdata);
> +
> +	return retcode;
> +}
> diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
> new file mode 100644
> index 000000000000..b3f9213d4709
> --- /dev/null
> +++ b/include/uapi/misc/habanalabs.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
> + *
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + * Author: Oded Gabbay <oded.gabbay@gmail.com>
> + *
> + */
> +
> +#ifndef HABANALABS_H_
> +#define HABANALABS_H_
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +/* Opcode to create a new command buffer */
> +#define HL_CB_OP_CREATE		0
> +/* Opcode to destroy previously created command buffer */
> +#define HL_CB_OP_DESTROY	1
> +
> +struct hl_cb_in {
> +	/* Handle of CB or 0 if we want to create one */
> +	__u64 cb_handle;
> +	/* HL_CB_OP_* */
> +	__u32 op;
> +	/* Size of CB. Minimum requested size must be PAGE_SIZE */
> +	__u32 cb_size;
> +	/* Context ID - Currently not in use */
> +	__u32 ctx_id;
> +	__u32 pad;
> +};
> +
> +struct hl_cb_out {
> +	/* Handle of CB */
> +	__u64 cb_handle;
> +};
> +
> +union hl_cb_args {
> +	struct hl_cb_in in;
> +	struct hl_cb_out out;
> +};
> +
> +/*
> + * Command Buffer
> + * - Request a Command Buffer
> + * - Destroy a Command Buffer
> + *
> + * The command buffers are memory blocks that reside in DMA-able address
> + * space and are physically contiguous so they can be accessed by the device
> + * directly. They are allocated using the coherent DMA API.
> + *
> + * When creating a new CB, the IOCTL returns a handle of it, and the user-space
> + * process needs to use that handle to mmap the buffer so it can access them.
> + *
> + */
> +#define HL_IOCTL_CB		\
> +		_IOWR('H', 0x02, union hl_cb_args)
> +
> +#define HL_COMMAND_START	0x02
> +#define HL_COMMAND_END		0x03
> +
> +#endif /* HABANALABS_H_ */
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/15] habanalabs: add context and ASID modules
  2019-01-23  0:00 ` [PATCH 04/15] habanalabs: add context and ASID modules Oded Gabbay
@ 2019-01-23 12:28   ` Mike Rapoport
  2019-01-25 21:07     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-23 12:28 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:00:46AM +0200, Oded Gabbay wrote:
> This patch adds two modules - ASID and context.
> 
> Each user process the opens a device's file must have at least one context

                   ^that

> before it is able to "work" with the device. Each context has its own
> device address-space and contains information about its runtime state (its
> active command submissions).
> 
> To have address-space separation between contexts, each context is assigned
> a unique ASID, which stands for "address-space id". Goya supports up to
> 1024 ASIDs.
> 
> Currently, the driver doesn't support multiple contexts. Therefore, the
> user doesn't need to actively create a context. A "primary context" is
> created automatically when the user opens the device's file.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/habanalabs/Makefile         |   2 +-
>  drivers/misc/habanalabs/asid.c           |  58 +++++++++
>  drivers/misc/habanalabs/context.c        | 155 +++++++++++++++++++++++
>  drivers/misc/habanalabs/device.c         |  47 +++++++
>  drivers/misc/habanalabs/habanalabs.h     |  70 ++++++++++
>  drivers/misc/habanalabs/habanalabs_drv.c |  46 ++++++-
>  6 files changed, 375 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/misc/habanalabs/asid.c
>  create mode 100644 drivers/misc/habanalabs/context.c
> 
> diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> index 6f1ead69bd77..3ffbadc2ca01 100644
> --- a/drivers/misc/habanalabs/Makefile
> +++ b/drivers/misc/habanalabs/Makefile
> @@ -4,7 +4,7 @@
>  
>  obj-m	:= habanalabs.o
>  
> -habanalabs-y := habanalabs_drv.o device.o
> +habanalabs-y := habanalabs_drv.o device.o context.o asid.o
>  
>  include $(src)/goya/Makefile
>  habanalabs-y += $(HL_GOYA_FILES)
> diff --git a/drivers/misc/habanalabs/asid.c b/drivers/misc/habanalabs/asid.c
> new file mode 100644
> index 000000000000..0ce84c8f5a47
> --- /dev/null
> +++ b/drivers/misc/habanalabs/asid.c
> @@ -0,0 +1,58 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "habanalabs.h"
> +
> +#include <linux/slab.h>
> +#include <linux/types.h>
> +
> +int hl_asid_init(struct hl_device *hdev)
> +{
> +	hdev->asid_bitmap = kcalloc(BITS_TO_LONGS(hdev->asic_prop.max_asid),
> +					sizeof(*hdev->asid_bitmap), GFP_KERNEL);
> +	if (!hdev->asid_bitmap)
> +		return -ENOMEM;
> +
> +	mutex_init(&hdev->asid_mutex);
> +
> +	/* ASID 0 is reserved for KMD */
> +	set_bit(0, hdev->asid_bitmap);
> +
> +	return 0;
> +}
> +
> +void hl_asid_fini(struct hl_device *hdev)
> +{
> +	mutex_destroy(&hdev->asid_mutex);
> +	kfree(hdev->asid_bitmap);
> +}
> +
> +unsigned long hl_asid_alloc(struct hl_device *hdev)
> +{
> +	unsigned long found;
> +
> +	mutex_lock(&hdev->asid_mutex);
> +
> +	found = find_first_zero_bit(hdev->asid_bitmap,
> +					hdev->asic_prop.max_asid);
> +	if (found == hdev->asic_prop.max_asid)
> +		found = 0;
> +	else
> +		set_bit(found, hdev->asid_bitmap);
> +
> +	mutex_unlock(&hdev->asid_mutex);
> +
> +	return found;
> +}
> +
> +void hl_asid_free(struct hl_device *hdev, unsigned long asid)
> +{
> +	if (WARN((asid == 0 || asid >= hdev->asic_prop.max_asid),
> +						"Invalid ASID %lu", asid))
> +		return;
> +	clear_bit(asid, hdev->asid_bitmap);
> +}
> diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
> new file mode 100644
> index 000000000000..cdcad077e5cf
> --- /dev/null
> +++ b/drivers/misc/habanalabs/context.c
> @@ -0,0 +1,155 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "habanalabs.h"
> +
> +#include <linux/sched.h>
> +#include <linux/delay.h>
> +
> +static void hl_ctx_fini(struct hl_ctx *ctx)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +
> +	if (ctx->asid != HL_KERNEL_ASID_ID)
> +		hl_asid_free(hdev, ctx->asid);
> +}
> +
> +void hl_ctx_do_release(struct kref *ref)
> +{
> +	struct hl_ctx *ctx;
> +
> +	ctx = container_of(ref, struct hl_ctx, refcount);
> +
> +	dev_dbg(ctx->hdev->dev, "Now really releasing context %d\n", ctx->asid);
> +
> +	hl_ctx_fini(ctx);
> +
> +	if (ctx->hpriv)
> +		hl_hpriv_put(ctx->hpriv);
> +
> +	kfree(ctx);
> +}
> +
> +int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv)
> +{
> +	struct hl_ctx_mgr *mgr = &hpriv->ctx_mgr;
> +	struct hl_ctx *ctx;
> +	int rc;
> +
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx) {
> +		rc = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	rc = hl_ctx_init(hdev, ctx, false);
> +	if (rc)
> +		goto free_ctx;
> +
> +	hl_hpriv_get(hpriv);
> +	ctx->hpriv = hpriv;
> +
> +	/* TODO: remove for multiple contexts */
> +	hpriv->ctx = ctx;
> +	hdev->user_ctx = ctx;
> +
> +	mutex_lock(&mgr->ctx_lock);
> +	rc = idr_alloc(&mgr->ctx_handles, ctx, 1, 0, GFP_KERNEL);
> +	mutex_unlock(&mgr->ctx_lock);
> +
> +	if (rc < 0) {
> +		dev_err(hdev->dev, "Failed to allocate IDR for a new CTX\n");
> +		hl_ctx_free(hdev, ctx);
> +		goto out_err;
> +	}
> +
> +	return 0;
> +
> +free_ctx:
> +	kfree(ctx);
> +out_err:
> +	return rc;
> +}
> +
> +void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx)
> +{
> +	if (kref_put(&ctx->refcount, hl_ctx_do_release) == 1)
> +		return;
> +
> +	dev_warn(hdev->dev,
> +		"Context %d closed or terminated but its CS are executing\n",
> +		ctx->asid);
> +}
> +
> +int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
> +{
> +	ctx->hdev = hdev;
> +
> +	kref_init(&ctx->refcount);
> +
> +	if (is_kernel_ctx) {
> +		ctx->asid = HL_KERNEL_ASID_ID; /* KMD gets ASID 0 */
> +	} else {
> +		ctx->asid = hl_asid_alloc(hdev);
> +		if (!ctx->asid) {
> +			dev_err(hdev->dev, "No free ASID, failed to create context\n");
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	dev_dbg(hdev->dev, "Created context with ASID %u\n", ctx->asid);
> +
> +	return 0;
> +}
> +
> +void hl_ctx_get(struct hl_device *hdev, struct hl_ctx *ctx)
> +{
> +	kref_get(&ctx->refcount);
> +}
> +
> +int hl_ctx_put(struct hl_ctx *ctx)
> +{
> +	return kref_put(&ctx->refcount, hl_ctx_do_release);
> +}
> +
> +/**
> + * hl_ctx_mgr_init - initialize the context manager
> + *
> + * @mgr: pointer to context manager structure
> + *
> + * This manager is an object inside the hpriv object of the user process.
> + * The function is called when a user process opens the FD.
> + */
> +void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr)
> +{
> +	mutex_init(&mgr->ctx_lock);
> +	idr_init(&mgr->ctx_handles);
> +}
> +
> +/**
> + * hl_ctx_mgr_fini - finalize the context manager
> + *
> + * @hdev: pointer to device structure
> + * @mgr: pointer to context manager structure
> + *
> + * This function goes over all the contexts in the manager and frees them.
> + * It is called when a process closes the FD.
> + */
> +void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr)
> +{
> +	struct hl_ctx *ctx;
> +	struct idr *idp;
> +	u32 id;
> +
> +	idp = &mgr->ctx_handles;
> +
> +	idr_for_each_entry(idp, ctx, id)
> +		hl_ctx_free(hdev, ctx);
> +
> +	idr_destroy(&mgr->ctx_handles);
> +	mutex_destroy(&mgr->ctx_lock);
> +}
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> index a4276ef559b3..84ce9fcb52da 100644
> --- a/drivers/misc/habanalabs/device.c
> +++ b/drivers/misc/habanalabs/device.c
> @@ -23,6 +23,12 @@ static void hpriv_release(struct kref *ref)
>  	put_pid(hpriv->taskpid);
>  
>  	kfree(hpriv);
> +
> +	/* Now the FD is really closed */
> +	atomic_dec(&hdev->fd_open_cnt);
> +
> +	/* This allows a new user context to open the device */
> +	hdev->user_ctx = NULL;
>  }
>  
>  void hl_hpriv_get(struct hl_fpriv *hpriv)
> @@ -47,6 +53,8 @@ static int hl_device_release(struct inode *inode, struct file *filp)
>  {
>  	struct hl_fpriv *hpriv = filp->private_data;
>  
> +	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
> +
>  	filp->private_data = NULL;
>  
>  	hl_hpriv_put(hpriv);
> @@ -133,7 +141,20 @@ static int device_early_init(struct hl_device *hdev)
>  	if (rc)
>  		return rc;
>  
> +	rc = hl_asid_init(hdev);
> +	if (rc)
> +		goto early_fini;
> +
> +	mutex_init(&hdev->device_open);
> +	atomic_set(&hdev->fd_open_cnt, 0);
> +
>  	return 0;
> +
> +early_fini:
> +	if (hdev->asic_funcs->early_fini)
> +		hdev->asic_funcs->early_fini(hdev);
> +
> +	return rc;
>  }
>  
>  /**
> @@ -145,9 +166,12 @@ static int device_early_init(struct hl_device *hdev)
>  static void device_early_fini(struct hl_device *hdev)
>  {
>  
> +	hl_asid_fini(hdev);
> +
>  	if (hdev->asic_funcs->early_fini)
>  		hdev->asic_funcs->early_fini(hdev);
>  
> +	mutex_destroy(&hdev->device_open);
>  }
>  
>  /**
> @@ -241,11 +265,30 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  	if (rc)
>  		goto early_fini;
>  
> +	/* Allocate the kernel context */
> +	hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
> +	if (!hdev->kernel_ctx) {
> +		rc = -ENOMEM;
> +		goto sw_fini;
> +	}
> +
> +	hdev->user_ctx = NULL;
> +
> +	rc = hl_ctx_init(hdev, hdev->kernel_ctx, true);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to initialize kernel context\n");
> +		goto free_ctx;
> +	}
> +
>  	dev_notice(hdev->dev,
>  		"Successfully added device to habanalabs driver\n");
>  
>  	return 0;
>  
> +free_ctx:
> +	kfree(hdev->kernel_ctx);
> +sw_fini:
> +	hdev->asic_funcs->sw_fini(hdev);
>  early_fini:
>  	device_early_fini(hdev);
>  release_device:
> @@ -278,6 +321,10 @@ void hl_device_fini(struct hl_device *hdev)
>  	/* Mark device as disabled */
>  	hdev->disabled = true;
>  
> +	/* Release kernel context */
> +	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
> +		dev_err(hdev->dev, "kernel ctx is still alive\n");
> +
>  	/* Call ASIC S/W finalize function */
>  	hdev->asic_funcs->sw_fini(hdev);
>  
> diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> index 97844825f7a8..d003a6af2131 100644
> --- a/drivers/misc/habanalabs/habanalabs.h
> +++ b/drivers/misc/habanalabs/habanalabs.h
> @@ -125,6 +125,45 @@ struct hl_asic_funcs {
>  					void *cpu_addr, dma_addr_t dma_handle);
>  };
>  
> +
> +
> +
> +
> +/*
> + * CONTEXTS
> + */
> +
> +#define HL_KERNEL_ASID_ID	0
> +
> +/**
> + * struct hl_ctx - user/kernel context.
> + * @hpriv: pointer to the private (KMD) data of the process (fd).
> + * @hdev: pointer to the device structure.
> + * @refcount: reference counter for the context. Context is released only when
> + *		this hits 0l. It is incremented on CS and CS_WAIT.
> + * @asid: context's unique address space ID in the device's MMU.
> + */
> +struct hl_ctx {
> +	struct hl_fpriv		*hpriv;
> +	struct hl_device	*hdev;
> +	struct kref		refcount;
> +	u32			asid;
> +};
> +
> +/**
> + * struct hl_ctx_mgr - for handling multiple contexts.
> + * @ctx_lock: protects ctx_handles.
> + * @ctx_handles: idr to hold all ctx handles.
> + */
> +struct hl_ctx_mgr {
> +	struct mutex		ctx_lock;
> +	struct idr		ctx_handles;
> +};
> +
> +
> +
> +
> +
>  /*
>   * FILE PRIVATE STRUCTURE
>   */
> @@ -134,12 +173,16 @@ struct hl_asic_funcs {
>   * @hdev: habanalabs device structure.
>   * @filp: pointer to the given file structure.
>   * @taskpid: current process ID.
> + * @ctx: current executing context.
> + * @ctx_mgr: context manager to handle multiple context for this FD.
>   * @refcount: number of related contexts.
>   */
>  struct hl_fpriv {
>  	struct hl_device	*hdev;
>  	struct file		*filp;
>  	struct pid		*taskpid;
> +	struct hl_ctx		*ctx; /* TODO: remove for multiple ctx */
> +	struct hl_ctx_mgr	ctx_mgr;
>  	struct kref		refcount;
>  };
>  
> @@ -195,13 +238,19 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
>   * @dev: realted kernel basic device structure.
>   * @asic_name: ASIC specific nmae.
>   * @asic_type: ASIC specific type.
> + * @kernel_ctx: KMD context structure.
>   * @dma_pool: DMA pool for small allocations.
>   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
>   * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
>   * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
> + * @asid_bitmap: holds used/available ASIDs.
> + * @asid_mutex: protects asid_bitmap.
> + * @device_open: lock for sanity checks upon FD open.

device_open is an ambiguous name for a lock

>   * @asic_prop: ASIC specific immutable properties.
>   * @asic_funcs: ASIC specific functions.
>   * @asic_specific: ASIC specific information to use only from ASIC files.
> + * @user_ctx: current user context executing.
> + * @fd_open_cnt: number of open context executing.
>   * @major: habanalabs KMD major.
>   * @id: device minor.
>   * @disabled: is device disabled.
> @@ -214,13 +263,21 @@ struct hl_device {
>  	struct device			*dev;
>  	char				asic_name[16];
>  	enum hl_asic_type		asic_type;
> +	struct hl_ctx			*kernel_ctx;
>  	struct dma_pool			*dma_pool;
>  	void				*cpu_accessible_dma_mem;
>  	dma_addr_t			cpu_accessible_dma_address;
>  	struct gen_pool			*cpu_accessible_dma_pool;
> +	unsigned long			*asid_bitmap;
> +	struct mutex			asid_mutex;
> +	/* TODO: change to rw_sem for multiple contexts (same as other IOCTL) */
> +	struct mutex			device_open;
>  	struct asic_fixed_properties	asic_prop;
>  	const struct hl_asic_funcs	*asic_funcs;
>  	void				*asic_specific;
> +	/* TODO: The following fields should be moved for multi-context */
> +	struct hl_ctx			*user_ctx;
> +	atomic_t			fd_open_cnt;
>  	u32				major;
>  	u16				id;
>  	u8				disabled;
> @@ -270,10 +327,23 @@ int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
>  int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
>  				u32 timeout_us, u32 *val);
>  
> +int hl_asid_init(struct hl_device *hdev);
> +void hl_asid_fini(struct hl_device *hdev);
> +unsigned long hl_asid_alloc(struct hl_device *hdev);
> +void hl_asid_free(struct hl_device *hdev, unsigned long asid);
> +
> +int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv);
> +void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx);
> +int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx);
> +int hl_ctx_put(struct hl_ctx *ctx);
> +void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr);
> +void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr);
>  int hl_device_init(struct hl_device *hdev, struct class *hclass);
>  void hl_device_fini(struct hl_device *hdev);
>  int hl_device_suspend(struct hl_device *hdev);
>  int hl_device_resume(struct hl_device *hdev);
> +void hl_hpriv_get(struct hl_fpriv *hpriv);
> +void hl_hpriv_put(struct hl_fpriv *hpriv);
>  
>  void goya_set_asic_funcs(struct hl_device *hdev);
>  
> diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> index 79545003b7c2..0646da83eb53 100644
> --- a/drivers/misc/habanalabs/habanalabs_drv.c
> +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> @@ -77,6 +77,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
>  {
>  	struct hl_device *hdev;
>  	struct hl_fpriv *hpriv;
> +	int rc;
>  
>  	mutex_lock(&hl_devs_idr_lock);
>  	hdev = idr_find(&hl_devs_idr, iminor(inode));
> @@ -88,9 +89,33 @@ int hl_device_open(struct inode *inode, struct file *filp)
>  		return -ENXIO;
>  	}
>  
> +	mutex_lock(&hdev->device_open);
> +
> +	if (hdev->disabled) {
> +		dev_err_ratelimited(hdev->dev,
> +			"Can't open %s because it is disabled\n",
> +			dev_name(hdev->dev));
> +		mutex_unlock(&hdev->device_open);
> +		return -EPERM;
> +	}
> +
> +	if (hdev->user_ctx) {
> +		dev_info_ratelimited(hdev->dev,
> +			"Device %s is already attached to application\n",
> +			dev_name(hdev->dev));
> +		mutex_unlock(&hdev->device_open);
> +		return -EBUSY;
> +	}
> +
> +	atomic_inc(&hdev->fd_open_cnt);
> +
> +	mutex_unlock(&hdev->device_open);
> +
>  	hpriv = kzalloc(sizeof(*hpriv), GFP_KERNEL);
> -	if (!hpriv)
> -		return -ENOMEM;
> +	if (!hpriv) {
> +		rc = -ENOMEM;
> +		goto close_device;
> +	}
>  
>  	hpriv->hdev = hdev;
>  	filp->private_data = hpriv;
> @@ -98,9 +123,26 @@ int hl_device_open(struct inode *inode, struct file *filp)
>  	kref_init(&hpriv->refcount);
>  	nonseekable_open(inode, filp);
>  
> +	hl_ctx_mgr_init(&hpriv->ctx_mgr);
> +
> +	rc = hl_ctx_create(hdev, hpriv);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to open FD (CTX fail)\n");
> +		goto out_err;
> +	}
> +
>  	hpriv->taskpid = find_get_pid(current->pid);
>  
>  	return 0;
> +
> +out_err:
> +	filp->private_data = NULL;
> +	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
> +	kfree(hpriv);
> +
> +close_device:
> +	atomic_dec(&hdev->fd_open_cnt);
> +	return rc;
>  }
>  
>  /**
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-23 12:28   ` Mike Rapoport
@ 2019-01-23 12:40     ` Greg KH
  2019-01-23 12:55       ` Mike Rapoport
  2019-01-25 20:05     ` Oded Gabbay
  1 sibling, 1 reply; 103+ messages in thread
From: Greg KH @ 2019-01-23 12:40 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Oded Gabbay, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:28:05PM +0200, Mike Rapoport wrote:
> On Wed, Jan 23, 2019 at 02:00:43AM +0200, Oded Gabbay wrote:
> > +/**
> > + * hl_device_release - release function for habanalabs device
> > + *
> > + * @inode: pointer to inode structure
> > + * @filp: pointer to file structure
> > + *
> > + * Called when process closes an habanalabs device
> > + */
> 
> It's nice to see docs coming along with the codei
> I have some comments for the formatting.
> 
> kernel-doc won't be happy about missing return value descriptions, and
> although they are sometimes redundant or too obvious their absence makes
> 'make V=1 htmldocs' really noisy.
> 
> In general, it would be nice if you could link hanabnalabs driver
> kernel-doc somewhere in Documentation/ run 'make V=1 htmldocs'.
> 
> > +static int hl_device_release(struct inode *inode, struct file *filp)

There's no need for kerneldoc comments for static functions, as no one
can call them and they are not part of any api.

So what would be better here is to just drop the /** line and use /*

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-23 12:40     ` Greg KH
@ 2019-01-23 12:55       ` Mike Rapoport
  2019-01-25 20:09         ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-23 12:55 UTC (permalink / raw)
  To: Greg KH; +Cc: Oded Gabbay, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 01:40:04PM +0100, Greg KH wrote:
> On Wed, Jan 23, 2019 at 02:28:05PM +0200, Mike Rapoport wrote:
> > On Wed, Jan 23, 2019 at 02:00:43AM +0200, Oded Gabbay wrote:
> > > +/**
> > > + * hl_device_release - release function for habanalabs device
> > > + *
> > > + * @inode: pointer to inode structure
> > > + * @filp: pointer to file structure
> > > + *
> > > + * Called when process closes an habanalabs device
> > > + */
> > 
> > It's nice to see docs coming along with the codei
> > I have some comments for the formatting.
> > 
> > kernel-doc won't be happy about missing return value descriptions, and
> > although they are sometimes redundant or too obvious their absence makes
> > 'make V=1 htmldocs' really noisy.
> > 
> > In general, it would be nice if you could link hanabnalabs driver
> > kernel-doc somewhere in Documentation/ run 'make V=1 htmldocs'.
> > 
> > > +static int hl_device_release(struct inode *inode, struct file *filp)
> 
> There's no need for kerneldoc comments for static functions, as no one
> can call them and they are not part of any api.
> 
> So what would be better here is to just drop the /** line and use /*

Maybe it'd make sense to use /* for most of the comments in this driver as
there are kernel-doc formatting issues in non-static functions as well, I
was just too lazy to go over all of them.
 
> thanks,
> 
> greg k-h
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (14 preceding siblings ...)
  2019-01-23 12:27 ` [PATCH 00/15] Habana Labs kernel driver Mike Rapoport
@ 2019-01-23 21:52 ` Olof Johansson
  2019-01-23 22:40   ` Oded Gabbay
                     ` (2 more replies)
  2019-01-23 21:57 ` Dave Airlie
  16 siblings, 3 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-23 21:52 UTC (permalink / raw)
  To: Oded Gabbay, Dave Airlie
  Cc: Greg Kroah-Hartman, Linux Kernel Mailing List, ogabbay,
	Arnd Bergmann, fbarrat, andrew.donnellan

Hi,

On Tue, Jan 22, 2019 at 4:01 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> Hello,
>
> For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> Habana Labs since its inception two and a half years ago.
>
> Habana is a leading startup in the emerging AI processor space and we have
> already started production of our first Goya inference processor PCIe card
> and delivered it to customers. The Goya processor silicon has been tested
> since June of 2018 and is production-qualified by now. The Gaudi training
> processor solution is slated to sample in the second quarter of 2019.
>
> This patch-set contains the kernel driver for Habana's AI Processors
> (AIP) that are designed to accelerate Deep Learning inference and training
> workloads. The current version supports only the Goya processor and
> support for Gaudi will be upstreamed after the ASIC will be available to
> customers.
[...]

As others have mentioned, thanks for the amount of background and
information in this patch set, it's great to see.

Some have pointed out style and formatting issues, I'm not going to do
that here but I do have some higher-level comments:

 - There's a whole bunch of register definition headers. Outside of
GPUs, traditionally we don't include the full sets unless they're
needed in the driver since they tend to be very verbose.
 - I see a good amount of HW setup code that's mostly just writing
hardcoded values to a large number of registers. I don't have any
specific recommendation on how to do it better, but doing as much as
possible of this through on-device firmware tends to be a little
cleaner (or rather, hides it from the kernel. :). I don't know if that
fits your design though.
 - Are there any pointers to the userspace pieces that are used to run
on this card, or any kind of test suites that can be used when someone
has the hardware and is looking to change the driver?

But, I think the largest question I have (for a broader audience) is:

I predict that we will see a handful of these kind of devices over the
upcoming future -- definitely from ML accelerators but maybe also for
other kinds of processing, where there's a command-based, buffer-based
setup sending workloads to an offload engine and getting results back.
While the first waves will all look different due to design trade-offs
made in isolation, I think it makes sense to group them in one bucket
instead of merging them through drivers/misc, if nothing else to
encourage more cross-collaboration over time. First steps in figuring
out long-term suitable frameworks is to get a survey of a few
non-shared implementations.

So, I'd like to propose a drivers/accel drivers subtree, and I'd be
happy to bootstrap it with a small group (@Dave Airlie: I think your
input from GPU land be very useful, want to join in?). Individual
drivers maintained by existing maintainers, of course.

I think it might make sense to move the CAPI/OpenCAPI drivers over as
well -- not necessarily to change those drivers, but to group them
with the rest as more show up.


-Olof



-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (15 preceding siblings ...)
  2019-01-23 21:52 ` Olof Johansson
@ 2019-01-23 21:57 ` Dave Airlie
  2019-01-23 22:02   ` Dave Airlie
  2019-01-25  7:37   ` Greg Kroah-Hartman
  16 siblings, 2 replies; 103+ messages in thread
From: Dave Airlie @ 2019-01-23 21:57 UTC (permalink / raw)
  To: Oded Gabbay, Jerome Glisse; +Cc: Greg Kroah-Hartman, LKML, ogabbay

On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> Hello,
>
> For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> Habana Labs since its inception two and a half years ago.

Hey Oded,

So this creates a driver with a userspace facing API via ioctls.
Although this isn't a "GPU" driver we have a rule in the graphics
drivers are for accelerators that we don't merge userspace API with an
appropriate userspace user.

https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements

I see nothing in these accelerator drivers that make me think we
should be treating them different.

Having large closed userspaces that we have no insight into means we
get suboptimal locked for ever uAPIs. If someone in the future creates
an open source userspace, we will end up in a place where they get
suboptimal behaviour because they are locked into a uAPI that we can't
change.

Dave.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 21:57 ` Dave Airlie
@ 2019-01-23 22:02   ` Dave Airlie
  2019-01-23 22:31     ` Oded Gabbay
  2019-01-25  7:37   ` Greg Kroah-Hartman
  1 sibling, 1 reply; 103+ messages in thread
From: Dave Airlie @ 2019-01-23 22:02 UTC (permalink / raw)
  To: Oded Gabbay, Jerome Glisse, Daniel Vetter
  Cc: Greg Kroah-Hartman, LKML, ogabbay

Adding Daniel as well.

Dave.

On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
>
> On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> >
> > Hello,
> >
> > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > Habana Labs since its inception two and a half years ago.
>
> Hey Oded,
>
> So this creates a driver with a userspace facing API via ioctls.
> Although this isn't a "GPU" driver we have a rule in the graphics
> drivers are for accelerators that we don't merge userspace API with an
> appropriate userspace user.
>
> https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
>
> I see nothing in these accelerator drivers that make me think we
> should be treating them different.
>
> Having large closed userspaces that we have no insight into means we
> get suboptimal locked for ever uAPIs. If someone in the future creates
> an open source userspace, we will end up in a place where they get
> suboptimal behaviour because they are locked into a uAPI that we can't
> change.
>
> Dave.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 22:02   ` Dave Airlie
@ 2019-01-23 22:31     ` Oded Gabbay
  2019-01-23 22:45       ` Dave Airlie
  0 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23 22:31 UTC (permalink / raw)
  To: Dave Airlie, Greg Kroah-Hartman
  Cc: Jerome Glisse, Daniel Vetter, LKML, ogabbay, Arnd Bergmann,
	fbarrat, andrew.donnellan, Olof Johansson

On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
>
> Adding Daniel as well.
>
> Dave.
>
> On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > Habana Labs since its inception two and a half years ago.
> >
> > Hey Oded,
> >
> > So this creates a driver with a userspace facing API via ioctls.
> > Although this isn't a "GPU" driver we have a rule in the graphics
> > drivers are for accelerators that we don't merge userspace API with an
> > appropriate userspace user.
> >
> > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> >
> > I see nothing in these accelerator drivers that make me think we
> > should be treating them different.
> >
> > Having large closed userspaces that we have no insight into means we
> > get suboptimal locked for ever uAPIs. If someone in the future creates
> > an open source userspace, we will end up in a place where they get
> > suboptimal behaviour because they are locked into a uAPI that we can't
> > change.
> >
> > Dave.

Hi Dave,
While I always appreciate your opinion and happy to hear it, I totally
disagree with you on this point.

First of all, as you said, this device is NOT a GPU. Hence, I wasn't
aware that this rule might apply to this driver or to any other driver
outside of drm. Has this rule been applied to all the current drivers
in the kernel tree with userspace facing API via IOCTLs, which are not
in the drm subsystem ?  I see the logic for GPUs as they drive the
display of the entire machine, but this is an accelerator for a
specific purpose, not something generic as GPU. I just don't see how
one can treat them in the same way.

Second, I talked to Greg a couple of weeks ago about this driver and I
definitely didn't get any such requirement from him. Had I gotten such
a requirement, I would have planned this differently.

Third, I think this requirement, while maybe valid for drivers that
are inside an established framework with common userspace library,
such as drm, doesn't apply to a standalone driver which is not part of
any subsystem. There is no way that "someone" will create a userspace
for our H/W without the intimate knowledge of the H/W or without the
ISA of our programmable cores. Maybe for large companies this request
is valid, but for startups complying to this request is not realistic.

To conclude, I think this approach discourage other companies from
open sourcing their drivers and is counter-productive. I'm not sure
you are aware of how difficult it is to convince startup management to
opensource the code...

Thanks,
Oded

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 21:52 ` Olof Johansson
@ 2019-01-23 22:40   ` Oded Gabbay
  2019-01-23 23:16     ` Olof Johansson
  2019-01-24  1:03   ` Andrew Donnellan
  2019-02-24 22:23   ` Pavel Machek
  2 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23 22:40 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Dave Airlie, Greg Kroah-Hartman, Linux Kernel Mailing List,
	ogabbay, Arnd Bergmann, fbarrat, andrew.donnellan

On Wed, Jan 23, 2019 at 11:52 PM Olof Johansson <olof@lixom.net> wrote:
>
> Hi,
>
> On Tue, Jan 22, 2019 at 4:01 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> >
> > Hello,
> >
> > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > Habana Labs since its inception two and a half years ago.
> >
> > Habana is a leading startup in the emerging AI processor space and we have
> > already started production of our first Goya inference processor PCIe card
> > and delivered it to customers. The Goya processor silicon has been tested
> > since June of 2018 and is production-qualified by now. The Gaudi training
> > processor solution is slated to sample in the second quarter of 2019.
> >
> > This patch-set contains the kernel driver for Habana's AI Processors
> > (AIP) that are designed to accelerate Deep Learning inference and training
> > workloads. The current version supports only the Goya processor and
> > support for Gaudi will be upstreamed after the ASIC will be available to
> > customers.
> [...]
>
> As others have mentioned, thanks for the amount of background and
> information in this patch set, it's great to see.
>
> Some have pointed out style and formatting issues, I'm not going to do
> that here but I do have some higher-level comments:
>
>  - There's a whole bunch of register definition headers. Outside of
> GPUs, traditionally we don't include the full sets unless they're
> needed in the driver since they tend to be very verbose.

And it is not the entire list :)
I trimmed down the files to only the files I actually use registers
from. I didn't went into those files and removed from them the
registers I don't use.
I hope this isn't a hard requirement because that's really a dirty work.

>  - I see a good amount of HW setup code that's mostly just writing
> hardcoded values to a large number of registers. I don't have any
> specific recommendation on how to do it better, but doing as much as
> possible of this through on-device firmware tends to be a little
> cleaner (or rather, hides it from the kernel. :). I don't know if that
> fits your design though.

This is actually not according to our design. In our design, the host
driver is the "king" of the device and we prefer to have all
initializations which can be done from the host to be done from the
host.
I know its not a "technical" hard reason, but on the other hand, I
don't think that's really something so terrible that it can't be done
from the driver.

>  - Are there any pointers to the userspace pieces that are used to run
> on this card, or any kind of test suites that can be used when someone
> has the hardware and is looking to change the driver?

Not right now. I do hope we can release a package with some
pre-compiled libraries and binaries that can be used to work vs. the
driver, but I don't believe it will be open-source. At least, not in
2019.

>
> But, I think the largest question I have (for a broader audience) is:
>
> I predict that we will see a handful of these kind of devices over the
> upcoming future -- definitely from ML accelerators but maybe also for
> other kinds of processing, where there's a command-based, buffer-based
> setup sending workloads to an offload engine and getting results back.
> While the first waves will all look different due to design trade-offs
> made in isolation, I think it makes sense to group them in one bucket
> instead of merging them through drivers/misc, if nothing else to
> encourage more cross-collaboration over time. First steps in figuring
> out long-term suitable frameworks is to get a survey of a few
> non-shared implementations.
>
> So, I'd like to propose a drivers/accel drivers subtree, and I'd be
> happy to bootstrap it with a small group (@Dave Airlie: I think your
> input from GPU land be very useful, want to join in?). Individual
> drivers maintained by existing maintainers, of course.
>
> I think it might make sense to move the CAPI/OpenCAPI drivers over as
> well -- not necessarily to change those drivers, but to group them
> with the rest as more show up.

I actually prefer not going down that path, at least not from the
start. AFAIK, there is no other device driver in the kernel for AI
acceleration and I don't want to presume I know all the answers for
such devices.
You have said it yourself: there will be many devices and they won't
be similar, at least not in the next few years. So I think that trying
to setup a subsystem for this now would be a premature optimization.

Oded

>
>
> -Olof
>
>
>
> -Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 12:27 ` [PATCH 00/15] Habana Labs kernel driver Mike Rapoport
@ 2019-01-23 22:43   ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23 22:43 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> Hi,
>
> On Wed, Jan 23, 2019 at 02:00:42AM +0200, Oded Gabbay wrote:
> > Hello,
> >
> > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > Habana Labs since its inception two and a half years ago.
> >
> > Habana is a leading startup in the emerging AI processor space and we have
> > already started production of our first Goya inference processor PCIe card
> > and delivered it to customers. The Goya processor silicon has been tested
> > since June of 2018 and is production-qualified by now. The Gaudi training
> > processor solution is slated to sample in the second quarter of 2019.
> >
> > This patch-set contains the kernel driver for Habana's AI Processors
> > (AIP) that are designed to accelerate Deep Learning inference and training
> > workloads. The current version supports only the Goya processor and
> > support for Gaudi will be upstreamed after the ASIC will be available to
> > customers.
> >
> > The Goya processor has been designed from the ground up for deep learning
> > inference workloads. It comprises a cluster of eight fully programmable
> > Tensor Processing Cores (TPC). The TPC core is a VLIW SIMD vector
> > processor with ISA and hardware that was tailored to serve deep learning
> > workloads efficiently.
>
> [ ... ]
>
> > I would appricate any feedback, question and/or review.
>
> I've looked at the patches 1,3-5 for now, it seems patch 2 still didn't
> make it to lore.kernel.org.
>
> FWIW, I think it's a good solid work, unless you spoil it in patches 6-14
> ;-)

Thanks! I will go over your comments in the next few days.
>
> As a general note, maybe drivers/misc is not the most appropriate place for
> such a complex beast. How about drivers/accelerator/ai?

I talked to Greg about it a couple of weeks ago and we both agreed
that drivers/misc is the correct place for this driver, at this point
in time.
I don't see any reason why it couldn't be moved later on, when there
will be other drivers, but for now I just don't see the need to open a
new subsystem in the kernel.
Oded

>
> > p.s. for those who prefer to clone the tree instead of looking at the
> > emails, you can grab a copy from our company's page in GitHub:
> >
> > https://github.com/HabanaAI/linux/releases/tag/hl_patchset_v1
> >
> > Thanks,
> > Oded
> >
> > Oded Gabbay (14):
> >   habanalabs: add skeleton driver
> >   habanalabs: add Goya registers header files
> >   habanalabs: add basic Goya support
> >   habanalabs: add context and ASID modules
> >   habanalabs: add command buffer module
> >   habanalabs: add basic Goya h/w initialization
> >   habanalabs: add h/w queues module
> >   habanalabs: add event queue and interrupts
> >   habanalabs: add sysfs and hwmon support
> >   habanalabs: add device reset support
> >   habanalabs: add command submission module
> >   habanalabs: implement INFO IOCTL
> >   habanalabs: add debugfs support
> >   Update MAINTAINERS and CREDITS with habanalabs info
> >
> > Omer Shpigelman (1):
> >   habanalabs: add virtual memory and MMU modules
> >
> >  CREDITS                                       |    2 +-
> >  .../ABI/testing/debugfs-driver-habanalabs     |  127 +
> >  .../ABI/testing/sysfs-driver-habanalabs       |  190 +
> >  MAINTAINERS                                   |    9 +
> >  drivers/misc/Kconfig                          |    1 +
> >  drivers/misc/Makefile                         |    1 +
> >  drivers/misc/habanalabs/Kconfig               |   22 +
> >  drivers/misc/habanalabs/Makefile              |   14 +
> >  drivers/misc/habanalabs/asid.c                |   58 +
> >  drivers/misc/habanalabs/command_buffer.c      |  425 +
> >  drivers/misc/habanalabs/command_submission.c  |  799 ++
> >  drivers/misc/habanalabs/context.c             |  216 +
> >  drivers/misc/habanalabs/debugfs.c             | 1069 ++
> >  drivers/misc/habanalabs/device.c              | 1097 ++
> >  drivers/misc/habanalabs/goya/Makefile         |    3 +
> >  drivers/misc/habanalabs/goya/goya.c           | 6347 ++++++++++++
> >  drivers/misc/habanalabs/goya/goyaP.h          |  161 +
> >  drivers/misc/habanalabs/goya/goya_hwmgr.c     |  306 +
> >  drivers/misc/habanalabs/goya/goya_security.c  | 2999 ++++++
> >  drivers/misc/habanalabs/habanalabs.h          | 1464 +++
> >  drivers/misc/habanalabs/habanalabs_drv.c      |  474 +
> >  drivers/misc/habanalabs/habanalabs_ioctl.c    |  237 +
> >  drivers/misc/habanalabs/hw_queue.c            |  654 ++
> >  drivers/misc/habanalabs/hwmon.c               |  449 +
> >  .../include/goya/asic_reg/cpu_ca53_cfg_regs.h |  213 +
> >  .../include/goya/asic_reg/cpu_if_regs.h       |  110 +
> >  .../include/goya/asic_reg/cpu_pll_regs.h      |  186 +
> >  .../include/goya/asic_reg/ddr_mc_ch0_regs.h   | 1158 +++
> >  .../include/goya/asic_reg/ddr_mc_ch1_regs.h   | 1158 +++
> >  .../include/goya/asic_reg/ddr_misc_ch0_regs.h |  156 +
> >  .../include/goya/asic_reg/ddr_misc_ch1_regs.h |  156 +
> >  .../include/goya/asic_reg/dma_ch_0_regs.h     |  512 +
> >  .../include/goya/asic_reg/dma_ch_1_regs.h     |  512 +
> >  .../include/goya/asic_reg/dma_ch_2_regs.h     |  512 +
> >  .../include/goya/asic_reg/dma_ch_3_regs.h     |  512 +
> >  .../include/goya/asic_reg/dma_ch_4_regs.h     |  512 +
> >  .../include/goya/asic_reg/dma_macro_regs.h    |  242 +
> >  .../include/goya/asic_reg/dma_nrtr_regs.h     |  380 +
> >  .../include/goya/asic_reg/dma_qm_0_regs.h     |  543 +
> >  .../include/goya/asic_reg/dma_qm_1_regs.h     |  543 +
> >  .../include/goya/asic_reg/dma_qm_2_regs.h     |  543 +
> >  .../include/goya/asic_reg/dma_qm_3_regs.h     |  543 +
> >  .../include/goya/asic_reg/dma_qm_4_regs.h     |  543 +
> >  .../include/goya/asic_reg/gic_regs.h          | 9079 +++++++++++++++++
> >  .../include/goya/asic_reg/goya_blocks.h       | 1372 +++
> >  .../include/goya/asic_reg/goya_masks.h        |  262 +
> >  .../include/goya/asic_reg/goya_regs.h         |  119 +
> >  .../include/goya/asic_reg/ic_pll_regs.h       |  186 +
> >  .../include/goya/asic_reg/mc_pll_regs.h       |  186 +
> >  .../include/goya/asic_reg/mme1_rtr_regs.h     |  876 ++
> >  .../include/goya/asic_reg/mme2_rtr_regs.h     |  876 ++
> >  .../include/goya/asic_reg/mme3_rtr_regs.h     |  876 ++
> >  .../include/goya/asic_reg/mme4_rtr_regs.h     |  876 ++
> >  .../include/goya/asic_reg/mme5_rtr_regs.h     |  876 ++
> >  .../include/goya/asic_reg/mme6_rtr_regs.h     |  876 ++
> >  .../include/goya/asic_reg/mme_cmdq_regs.h     |  431 +
> >  .../include/goya/asic_reg/mme_qm_regs.h       |  543 +
> >  .../include/goya/asic_reg/mme_regs.h          | 2422 +++++
> >  .../include/goya/asic_reg/mmu_regs.h          |  158 +
> >  .../include/goya/asic_reg/pci_nrtr_regs.h     |  380 +
> >  .../include/goya/asic_reg/pcie_aux_regs.h     |  476 +
> >  .../include/goya/asic_reg/pcie_dbi_regs.h     | 2909 ++++++
> >  .../goya/asic_reg/psoc_emmc_pll_regs.h        |  186 +
> >  .../goya/asic_reg/psoc_global_conf_regs.h     | 1119 ++
> >  .../include/goya/asic_reg/psoc_mme_pll_regs.h |  186 +
> >  .../include/goya/asic_reg/psoc_pci_pll_regs.h |  186 +
> >  .../include/goya/asic_reg/psoc_spi_regs.h     |  427 +
> >  .../goya/asic_reg/sram_y0_x0_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y0_x1_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y0_x2_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y0_x3_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y0_x4_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y1_x0_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y1_x1_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y1_x2_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y1_x3_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y1_x4_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y2_x0_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y2_x1_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y2_x2_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y2_x3_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y2_x4_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y3_x0_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y3_x1_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y3_x2_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y3_x3_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y3_x4_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y4_x0_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y4_x1_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y4_x2_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y4_x3_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y4_x4_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y5_x0_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y5_x1_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y5_x2_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y5_x3_rtr_regs.h       |  215 +
> >  .../goya/asic_reg/sram_y5_x4_rtr_regs.h       |  215 +
> >  .../include/goya/asic_reg/stlb_regs.h         |  133 +
> >  .../include/goya/asic_reg/sync_mngr_regs.h    | 4930 +++++++++
> >  .../include/goya/asic_reg/tpc0_cfg_regs.h     | 2110 ++++
> >  .../include/goya/asic_reg/tpc0_cmdq_regs.h    |  431 +
> >  .../include/goya/asic_reg/tpc0_eml_cfg_regs.h |  580 ++
> >  .../include/goya/asic_reg/tpc0_nrtr_regs.h    |  380 +
> >  .../include/goya/asic_reg/tpc0_qm_regs.h      |  543 +
> >  .../include/goya/asic_reg/tpc1_cfg_regs.h     | 2110 ++++
> >  .../include/goya/asic_reg/tpc1_cmdq_regs.h    |  431 +
> >  .../include/goya/asic_reg/tpc1_qm_regs.h      |  543 +
> >  .../include/goya/asic_reg/tpc1_rtr_regs.h     |  848 ++
> >  .../include/goya/asic_reg/tpc2_cfg_regs.h     | 2110 ++++
> >  .../include/goya/asic_reg/tpc2_cmdq_regs.h    |  431 +
> >  .../include/goya/asic_reg/tpc2_qm_regs.h      |  543 +
> >  .../include/goya/asic_reg/tpc2_rtr_regs.h     |  848 ++
> >  .../include/goya/asic_reg/tpc3_cfg_regs.h     | 2110 ++++
> >  .../include/goya/asic_reg/tpc3_cmdq_regs.h    |  431 +
> >  .../include/goya/asic_reg/tpc3_qm_regs.h      |  543 +
> >  .../include/goya/asic_reg/tpc3_rtr_regs.h     |  848 ++
> >  .../include/goya/asic_reg/tpc4_cfg_regs.h     | 2110 ++++
> >  .../include/goya/asic_reg/tpc4_cmdq_regs.h    |  431 +
> >  .../include/goya/asic_reg/tpc4_qm_regs.h      |  543 +
> >  .../include/goya/asic_reg/tpc4_rtr_regs.h     |  848 ++
> >  .../include/goya/asic_reg/tpc5_cfg_regs.h     | 2110 ++++
> >  .../include/goya/asic_reg/tpc5_cmdq_regs.h    |  431 +
> >  .../include/goya/asic_reg/tpc5_qm_regs.h      |  543 +
> >  .../include/goya/asic_reg/tpc5_rtr_regs.h     |  848 ++
> >  .../include/goya/asic_reg/tpc6_cfg_regs.h     | 2110 ++++
> >  .../include/goya/asic_reg/tpc6_cmdq_regs.h    |  431 +
> >  .../include/goya/asic_reg/tpc6_qm_regs.h      |  543 +
> >  .../include/goya/asic_reg/tpc6_rtr_regs.h     |  848 ++
> >  .../include/goya/asic_reg/tpc7_cfg_regs.h     | 2110 ++++
> >  .../include/goya/asic_reg/tpc7_cmdq_regs.h    |  431 +
> >  .../include/goya/asic_reg/tpc7_nrtr_regs.h    |  380 +
> >  .../include/goya/asic_reg/tpc7_qm_regs.h      |  543 +
> >  .../include/goya/asic_reg/tpc_pll_regs.h      |  186 +
> >  drivers/misc/habanalabs/include/goya/goya.h   |  117 +
> >  .../include/goya/goya_async_events.h          |  186 +
> >  .../habanalabs/include/goya/goya_boot_if.h    |   32 +
> >  .../habanalabs/include/goya/goya_packets.h    |  234 +
> >  .../habanalabs/include/habanalabs_device_if.h |  397 +
> >  .../include/hw_ip/mmu/mmu_general.h           |   45 +
> >  .../habanalabs/include/hw_ip/mmu/mmu_v1_0.h   |   15 +
> >  drivers/misc/habanalabs/irq.c                 |  325 +
> >  drivers/misc/habanalabs/memory.c              | 1714 ++++
> >  drivers/misc/habanalabs/mmu.c                 |  604 ++
> >  drivers/misc/habanalabs/sysfs.c               |  690 ++
> >  include/uapi/misc/habanalabs.h                |  412 +
> >  145 files changed, 99610 insertions(+), 1 deletion(-)
> >  create mode 100644 Documentation/ABI/testing/debugfs-driver-habanalabs
> >  create mode 100644 Documentation/ABI/testing/sysfs-driver-habanalabs
> >  create mode 100644 drivers/misc/habanalabs/Kconfig
> >  create mode 100644 drivers/misc/habanalabs/Makefile
> >  create mode 100644 drivers/misc/habanalabs/asid.c
> >  create mode 100644 drivers/misc/habanalabs/command_buffer.c
> >  create mode 100644 drivers/misc/habanalabs/command_submission.c
> >  create mode 100644 drivers/misc/habanalabs/context.c
> >  create mode 100644 drivers/misc/habanalabs/debugfs.c
> >  create mode 100644 drivers/misc/habanalabs/device.c
> >  create mode 100644 drivers/misc/habanalabs/goya/Makefile
> >  create mode 100644 drivers/misc/habanalabs/goya/goya.c
> >  create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
> >  create mode 100644 drivers/misc/habanalabs/goya/goya_hwmgr.c
> >  create mode 100644 drivers/misc/habanalabs/goya/goya_security.c
> >  create mode 100644 drivers/misc/habanalabs/habanalabs.h
> >  create mode 100644 drivers/misc/habanalabs/habanalabs_drv.c
> >  create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
> >  create mode 100644 drivers/misc/habanalabs/hw_queue.c
> >  create mode 100644 drivers/misc/habanalabs/hwmon.c
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_ca53_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_if_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_pll_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_mc_ch0_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_mc_ch1_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_misc_ch0_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ddr_misc_ch1_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_0_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_1_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_2_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_3_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_4_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_macro_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_nrtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_0_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_1_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_2_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_3_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_4_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/gic_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_blocks.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_masks.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ic_pll_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mc_pll_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme1_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme2_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme3_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme4_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme5_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme6_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_cmdq_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_qm_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mmu_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pci_nrtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pcie_aux_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pcie_dbi_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_emmc_pll_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_global_conf_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_mme_pll_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_pci_pll_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_spi_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x0_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x1_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x2_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x3_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x4_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x0_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x1_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x2_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x3_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y1_x4_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x0_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x1_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x2_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x3_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y2_x4_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x0_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x1_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x2_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x3_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y3_x4_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x0_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x1_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x2_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x3_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y4_x4_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x0_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x1_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x2_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x3_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y5_x4_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/stlb_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sync_mngr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cmdq_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_eml_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_nrtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_qm_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_cmdq_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_qm_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_cmdq_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_qm_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_cmdq_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_qm_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_cmdq_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_qm_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_cmdq_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_qm_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_cmdq_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_qm_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_rtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_cfg_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_cmdq_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_nrtr_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_qm_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc_pll_regs.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/goya.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/goya_async_events.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/goya_boot_if.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/goya_packets.h
> >  create mode 100644 drivers/misc/habanalabs/include/habanalabs_device_if.h
> >  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
> >  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
> >  create mode 100644 drivers/misc/habanalabs/irq.c
> >  create mode 100644 drivers/misc/habanalabs/memory.c
> >  create mode 100644 drivers/misc/habanalabs/mmu.c
> >  create mode 100644 drivers/misc/habanalabs/sysfs.c
> >  create mode 100644 include/uapi/misc/habanalabs.h
> >
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 22:31     ` Oded Gabbay
@ 2019-01-23 22:45       ` Dave Airlie
  2019-01-23 23:04         ` Olof Johansson
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Airlie @ 2019-01-23 22:45 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Greg Kroah-Hartman, Jerome Glisse, Daniel Vetter, LKML, ogabbay,
	Arnd Bergmann, fbarrat, andrew.donnellan, Olof Johansson

On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> >
> > Adding Daniel as well.
> >
> > Dave.
> >
> > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > >
> > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > Habana Labs since its inception two and a half years ago.
> > >
> > > Hey Oded,
> > >
> > > So this creates a driver with a userspace facing API via ioctls.
> > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > drivers are for accelerators that we don't merge userspace API with an
> > > appropriate userspace user.
> > >
> > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > >
> > > I see nothing in these accelerator drivers that make me think we
> > > should be treating them different.
> > >
> > > Having large closed userspaces that we have no insight into means we
> > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > an open source userspace, we will end up in a place where they get
> > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > change.
> > >
> > > Dave.
>
> Hi Dave,
> While I always appreciate your opinion and happy to hear it, I totally
> disagree with you on this point.
>
> First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> aware that this rule might apply to this driver or to any other driver
> outside of drm. Has this rule been applied to all the current drivers
> in the kernel tree with userspace facing API via IOCTLs, which are not
> in the drm subsystem ?  I see the logic for GPUs as they drive the
> display of the entire machine, but this is an accelerator for a
> specific purpose, not something generic as GPU. I just don't see how
> one can treat them in the same way.

The logic isn't there for GPUs for those reason that we have an
established library or that GPUs are in laptops. They are just where
we learned the lessons of merging things whose primary reason for
being in the kernel is to execute stuff from misc userspace stacks,
where the uAPI has to remain stable indefinitely.

a) security - without knowledge of what the accelerator can do how can
we know if the API you expose isn't just a giant root hole?

b) uAPI stability. Without a userspace for this, there is no way for
anyone even if in possession of the hardware to validate the uAPI you
provide and are asking the kernel to commit to supporting indefinitely
is optimal or secure. If an open source userspace appears is it to be
limited to API the closed userspace has created. It limits the future
unnecessarily.

> There is no way that "someone" will create a userspace
> for our H/W without the intimate knowledge of the H/W or without the
> ISA of our programmable cores. Maybe for large companies this request
> is valid, but for startups complying to this request is not realistic.

So what benefit does the Linux kernel get from having support for this
feature upstream?

If users can't access the necessary code to use it, why does this
require to be maintained in the kernel.

> To conclude, I think this approach discourage other companies from
> open sourcing their drivers and is counter-productive. I'm not sure
> you are aware of how difficult it is to convince startup management to
> opensource the code...

Oh I am, but I'm also more aware how quickly startups go away and
leave the kernel holding a lot of code we don't know how to validate
or use.

I'm opening to being convinced but I think defining new userspace
facing APIs is a task that we should take a lot more seriously going
forward to avoid mistakes of the past.

Dave.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 22:45       ` Dave Airlie
@ 2019-01-23 23:04         ` Olof Johansson
  2019-01-23 23:20           ` Jerome Glisse
  2019-01-23 23:23           ` Oded Gabbay
  0 siblings, 2 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-23 23:04 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Oded Gabbay, Greg Kroah-Hartman, Jerome Glisse, Daniel Vetter,
	LKML, ogabbay, Arnd Bergmann, fbarrat, andrew.donnellan

On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@gmail.com> wrote:
>
> On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> >
> > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> > >
> > > Adding Daniel as well.
> > >
> > > Dave.
> > >
> > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > Habana Labs since its inception two and a half years ago.
> > > >
> > > > Hey Oded,
> > > >
> > > > So this creates a driver with a userspace facing API via ioctls.
> > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > drivers are for accelerators that we don't merge userspace API with an
> > > > appropriate userspace user.
> > > >
> > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > >
> > > > I see nothing in these accelerator drivers that make me think we
> > > > should be treating them different.
> > > >
> > > > Having large closed userspaces that we have no insight into means we
> > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > an open source userspace, we will end up in a place where they get
> > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > change.
> > > >
> > > > Dave.
> >
> > Hi Dave,
> > While I always appreciate your opinion and happy to hear it, I totally
> > disagree with you on this point.
> >
> > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > aware that this rule might apply to this driver or to any other driver
> > outside of drm. Has this rule been applied to all the current drivers
> > in the kernel tree with userspace facing API via IOCTLs, which are not
> > in the drm subsystem ?  I see the logic for GPUs as they drive the
> > display of the entire machine, but this is an accelerator for a
> > specific purpose, not something generic as GPU. I just don't see how
> > one can treat them in the same way.
>
> The logic isn't there for GPUs for those reason that we have an
> established library or that GPUs are in laptops. They are just where
> we learned the lessons of merging things whose primary reason for
> being in the kernel is to execute stuff from misc userspace stacks,
> where the uAPI has to remain stable indefinitely.
>
> a) security - without knowledge of what the accelerator can do how can
> we know if the API you expose isn't just a giant root hole?
>
> b) uAPI stability. Without a userspace for this, there is no way for
> anyone even if in possession of the hardware to validate the uAPI you
> provide and are asking the kernel to commit to supporting indefinitely
> is optimal or secure. If an open source userspace appears is it to be
> limited to API the closed userspace has created. It limits the future
> unnecessarily.
>
> > There is no way that "someone" will create a userspace
> > for our H/W without the intimate knowledge of the H/W or without the
> > ISA of our programmable cores. Maybe for large companies this request
> > is valid, but for startups complying to this request is not realistic.
>
> So what benefit does the Linux kernel get from having support for this
> feature upstream?
>
> If users can't access the necessary code to use it, why does this
> require to be maintained in the kernel.
>
> > To conclude, I think this approach discourage other companies from
> > open sourcing their drivers and is counter-productive. I'm not sure
> > you are aware of how difficult it is to convince startup management to
> > opensource the code...
>
> Oh I am, but I'm also more aware how quickly startups go away and
> leave the kernel holding a lot of code we don't know how to validate
> or use.
>
> I'm opening to being convinced but I think defining new userspace
> facing APIs is a task that we should take a lot more seriously going
> forward to avoid mistakes of the past.

I think the most important thing here is to know that things are
likely to change quite a bit over the next couple of years, and that
we don't know yet what we actually need. If we hold off picking up
support for hardware while all of this is ironed out, we'll miss out
on being exposed to it, and will have a very tall hill to climb once
we try to convince vendors to come into the fold. It's also not been a
requirement for the other two drivers we have merged, as far as I can
tell (CAPI and OpenCAPI) so the cat's already out of the bag.

I'd rather not get stuck in a stand-off needing the longterm solution
to pick up the short term contribution. That way we can move over to a
_new_ API once there's been a better chance of finding common grounds
and once things settle down a bit, instead of trying to bring some
larger legacy codebase for devices that people might no longer care
much about over to the newer APIs.

It's better to be exposed to the HW and drivers now, than having
people build large elaborate out-of-tree software stacks for this.
It's also better to get them to come and collaborate now, instead of
pushing them away until things are perfect.

Having a way to validate and exercise the userspace API is important,
including ability to change it if needed. Would it be possible to open
up the lowest userspace pieces (driver interactions), even if some
other layers might not yet be, to exercise the device/kernel/userspace
interfaces without "live" workload, etc?


-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 22:40   ` Oded Gabbay
@ 2019-01-23 23:16     ` Olof Johansson
  0 siblings, 0 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-23 23:16 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Dave Airlie, Greg Kroah-Hartman, Linux Kernel Mailing List,
	ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

On Wed, Jan 23, 2019 at 2:41 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> On Wed, Jan 23, 2019 at 11:52 PM Olof Johansson <olof@lixom.net> wrote:
> >
> > Hi,
> >
> > On Tue, Jan 22, 2019 at 4:01 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > Habana Labs since its inception two and a half years ago.
> > >
> > > Habana is a leading startup in the emerging AI processor space and we have
> > > already started production of our first Goya inference processor PCIe card
> > > and delivered it to customers. The Goya processor silicon has been tested
> > > since June of 2018 and is production-qualified by now. The Gaudi training
> > > processor solution is slated to sample in the second quarter of 2019.
> > >
> > > This patch-set contains the kernel driver for Habana's AI Processors
> > > (AIP) that are designed to accelerate Deep Learning inference and training
> > > workloads. The current version supports only the Goya processor and
> > > support for Gaudi will be upstreamed after the ASIC will be available to
> > > customers.
> > [...]
> >
> > As others have mentioned, thanks for the amount of background and
> > information in this patch set, it's great to see.
> >
> > Some have pointed out style and formatting issues, I'm not going to do
> > that here but I do have some higher-level comments:
> >
> >  - There's a whole bunch of register definition headers. Outside of
> > GPUs, traditionally we don't include the full sets unless they're
> > needed in the driver since they tend to be very verbose.
>
> And it is not the entire list :)
> I trimmed down the files to only the files I actually use registers
> from. I didn't went into those files and removed from them the
> registers I don't use.
> I hope this isn't a hard requirement because that's really a dirty work.

Yeah, it's always awkward to do this kind of cleanup. drivers/staging
was created in part for allowing a driver to go through this while
in-tree, if that helps.

> >  - I see a good amount of HW setup code that's mostly just writing
> > hardcoded values to a large number of registers. I don't have any
> > specific recommendation on how to do it better, but doing as much as
> > possible of this through on-device firmware tends to be a little
> > cleaner (or rather, hides it from the kernel. :). I don't know if that
> > fits your design though.
>
> This is actually not according to our design. In our design, the host
> driver is the "king" of the device and we prefer to have all
> initializations which can be done from the host to be done from the
> host.
> I know its not a "technical" hard reason, but on the other hand, I
> don't think that's really something so terrible that it can't be done
> from the driver.

This is why I was asking. It makes for a lot of boilerplate in the
driver, all with magic constants. They usually end up in some other
layer, but they're often constants no matter what.

> >  - Are there any pointers to the userspace pieces that are used to run
> > on this card, or any kind of test suites that can be used when someone
> > has the hardware and is looking to change the driver?
>
> Not right now. I do hope we can release a package with some
> pre-compiled libraries and binaries that can be used to work vs. the
> driver, but I don't believe it will be open-source. At least, not in
> 2019.

See my other reply, having the lowest layer of the interface from
userspace open might be an approach worth exploring.

> > But, I think the largest question I have (for a broader audience) is:
> >
> > I predict that we will see a handful of these kind of devices over the
> > upcoming future -- definitely from ML accelerators but maybe also for
> > other kinds of processing, where there's a command-based, buffer-based
> > setup sending workloads to an offload engine and getting results back.
> > While the first waves will all look different due to design trade-offs
> > made in isolation, I think it makes sense to group them in one bucket
> > instead of merging them through drivers/misc, if nothing else to
> > encourage more cross-collaboration over time. First steps in figuring
> > out long-term suitable frameworks is to get a survey of a few
> > non-shared implementations.
> >
> > So, I'd like to propose a drivers/accel drivers subtree, and I'd be
> > happy to bootstrap it with a small group (@Dave Airlie: I think your
> > input from GPU land be very useful, want to join in?). Individual
> > drivers maintained by existing maintainers, of course.
> >
> > I think it might make sense to move the CAPI/OpenCAPI drivers over as
> > well -- not necessarily to change those drivers, but to group them
> > with the rest as more show up.
>
> I actually prefer not going down that path, at least not from the
> start. AFAIK, there is no other device driver in the kernel for AI
> acceleration and I don't want to presume I know all the answers for
> such devices.

I'm not saying you have to have those answers, I'm just saying let's
start grouping them now so we at least have one place to look at them,
especially since we now have more than 2.

> You have said it yourself: there will be many devices and they won't
> be similar, at least not in the next few years. So I think that trying
> to setup a subsystem for this now would be a premature optimization.

It is initially not about building the shared subsystem as much as
grouping the general type of devices together, get someone to keep an
overall view across them, and encouraging more work between vendors
such as cross-review, etc.


-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 23:04         ` Olof Johansson
@ 2019-01-23 23:20           ` Jerome Glisse
  2019-01-23 23:35             ` Oded Gabbay
  2019-01-23 23:40             ` Olof Johansson
  2019-01-23 23:23           ` Oded Gabbay
  1 sibling, 2 replies; 103+ messages in thread
From: Jerome Glisse @ 2019-01-23 23:20 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Dave Airlie, Oded Gabbay, Greg Kroah-Hartman, Daniel Vetter,
	LKML, ogabbay, Arnd Bergmann, fbarrat, andrew.donnellan

On Wed, Jan 23, 2019 at 03:04:33PM -0800, Olof Johansson wrote:
> On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > >
> > > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > > Adding Daniel as well.
> > > >
> > > > Dave.
> > > >
> > > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > > > >
> > > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > > Habana Labs since its inception two and a half years ago.
> > > > >
> > > > > Hey Oded,
> > > > >
> > > > > So this creates a driver with a userspace facing API via ioctls.
> > > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > > drivers are for accelerators that we don't merge userspace API with an
> > > > > appropriate userspace user.
> > > > >
> > > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > > >
> > > > > I see nothing in these accelerator drivers that make me think we
> > > > > should be treating them different.
> > > > >
> > > > > Having large closed userspaces that we have no insight into means we
> > > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > > an open source userspace, we will end up in a place where they get
> > > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > > change.
> > > > >
> > > > > Dave.
> > >
> > > Hi Dave,
> > > While I always appreciate your opinion and happy to hear it, I totally
> > > disagree with you on this point.
> > >
> > > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > > aware that this rule might apply to this driver or to any other driver
> > > outside of drm. Has this rule been applied to all the current drivers
> > > in the kernel tree with userspace facing API via IOCTLs, which are not
> > > in the drm subsystem ?  I see the logic for GPUs as they drive the
> > > display of the entire machine, but this is an accelerator for a
> > > specific purpose, not something generic as GPU. I just don't see how
> > > one can treat them in the same way.
> >
> > The logic isn't there for GPUs for those reason that we have an
> > established library or that GPUs are in laptops. They are just where
> > we learned the lessons of merging things whose primary reason for
> > being in the kernel is to execute stuff from misc userspace stacks,
> > where the uAPI has to remain stable indefinitely.
> >
> > a) security - without knowledge of what the accelerator can do how can
> > we know if the API you expose isn't just a giant root hole?
> >
> > b) uAPI stability. Without a userspace for this, there is no way for
> > anyone even if in possession of the hardware to validate the uAPI you
> > provide and are asking the kernel to commit to supporting indefinitely
> > is optimal or secure. If an open source userspace appears is it to be
> > limited to API the closed userspace has created. It limits the future
> > unnecessarily.
> >
> > > There is no way that "someone" will create a userspace
> > > for our H/W without the intimate knowledge of the H/W or without the
> > > ISA of our programmable cores. Maybe for large companies this request
> > > is valid, but for startups complying to this request is not realistic.
> >
> > So what benefit does the Linux kernel get from having support for this
> > feature upstream?
> >
> > If users can't access the necessary code to use it, why does this
> > require to be maintained in the kernel.
> >
> > > To conclude, I think this approach discourage other companies from
> > > open sourcing their drivers and is counter-productive. I'm not sure
> > > you are aware of how difficult it is to convince startup management to
> > > opensource the code...
> >
> > Oh I am, but I'm also more aware how quickly startups go away and
> > leave the kernel holding a lot of code we don't know how to validate
> > or use.
> >
> > I'm opening to being convinced but I think defining new userspace
> > facing APIs is a task that we should take a lot more seriously going
> > forward to avoid mistakes of the past.
> 
> I think the most important thing here is to know that things are
> likely to change quite a bit over the next couple of years, and that
> we don't know yet what we actually need. If we hold off picking up
> support for hardware while all of this is ironed out, we'll miss out
> on being exposed to it, and will have a very tall hill to climb once
> we try to convince vendors to come into the fold. It's also not been a
> requirement for the other two drivers we have merged, as far as I can
> tell (CAPI and OpenCAPI) so the cat's already out of the bag.
> 
> I'd rather not get stuck in a stand-off needing the longterm solution
> to pick up the short term contribution. That way we can move over to a
> _new_ API once there's been a better chance of finding common grounds
> and once things settle down a bit, instead of trying to bring some
> larger legacy codebase for devices that people might no longer care
> much about over to the newer APIs.
> 
> It's better to be exposed to the HW and drivers now, than having
> people build large elaborate out-of-tree software stacks for this.
> It's also better to get them to come and collaborate now, instead of
> pushing them away until things are perfect.
> 
> Having a way to validate and exercise the userspace API is important,
> including ability to change it if needed. Would it be possible to open
> up the lowest userspace pieces (driver interactions), even if some
> other layers might not yet be, to exercise the device/kernel/userspace
> interfaces without "live" workload, etc?

Yes and to exercise the userspace API you need at very least to
know the ISA so that you can write program for the accelerator.
You also need to know the set of commands the hardware has. The
ioctl and how to create a userspace that interact with the kernel
is the easy part, the hard part is the compiler.

So if we want any kind of freedom to play with the UAPI, enhance
it or change it in anyway we must be free to build program for the
device ourself.

I believe that the GPU sub-system requirement are a good guideline
to follow and the only exception with drivers/ that i am aware of
is the fpga. Everything else in driver as either an open source
userspace, expose a common API (like network) or is so simple that
anyone can write a userspace for it.


For any complex device that execute program we should really enforce
the open source userspace so that we can properly audit the driver
as otherwise we only have half of the story with no idea what the
other half might implies.

Cheers,
Jérôme



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 23:04         ` Olof Johansson
  2019-01-23 23:20           ` Jerome Glisse
@ 2019-01-23 23:23           ` Oded Gabbay
  1 sibling, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23 23:23 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Dave Airlie, Greg Kroah-Hartman, Jerome Glisse, Daniel Vetter,
	LKML, ogabbay, Arnd Bergmann, fbarrat, andrew.donnellan

On Thu, Jan 24, 2019 at 1:04 AM Olof Johansson <olof@lixom.net> wrote:
>
> On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > >
> > > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > > Adding Daniel as well.
> > > >
> > > > Dave.
> > > >
> > > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > > > >
> > > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > > Habana Labs since its inception two and a half years ago.
> > > > >
> > > > > Hey Oded,
> > > > >
> > > > > So this creates a driver with a userspace facing API via ioctls.
> > > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > > drivers are for accelerators that we don't merge userspace API with an
> > > > > appropriate userspace user.
> > > > >
> > > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > > >
> > > > > I see nothing in these accelerator drivers that make me think we
> > > > > should be treating them different.
> > > > >
> > > > > Having large closed userspaces that we have no insight into means we
> > > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > > an open source userspace, we will end up in a place where they get
> > > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > > change.
> > > > >
> > > > > Dave.
> > >
> > > Hi Dave,
> > > While I always appreciate your opinion and happy to hear it, I totally
> > > disagree with you on this point.
> > >
> > > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > > aware that this rule might apply to this driver or to any other driver
> > > outside of drm. Has this rule been applied to all the current drivers
> > > in the kernel tree with userspace facing API via IOCTLs, which are not
> > > in the drm subsystem ?  I see the logic for GPUs as they drive the
> > > display of the entire machine, but this is an accelerator for a
> > > specific purpose, not something generic as GPU. I just don't see how
> > > one can treat them in the same way.
> >
> > The logic isn't there for GPUs for those reason that we have an
> > established library or that GPUs are in laptops. They are just where
> > we learned the lessons of merging things whose primary reason for
> > being in the kernel is to execute stuff from misc userspace stacks,
> > where the uAPI has to remain stable indefinitely.
> >
> > a) security - without knowledge of what the accelerator can do how can
> > we know if the API you expose isn't just a giant root hole?

I'm willing to explain the security mechanisms we have in our device
and how the driver initialize them in order to protect the host, and
to isolate between users.
I've done a LOT of work on that during the design of the ASIC and I
believe we have a good story there. If you want, we can go over that
code and explain the architecture in more detail.

> >
> > b) uAPI stability. Without a userspace for this, there is no way for
> > anyone even if in possession of the hardware to validate the uAPI you
> > provide and are asking the kernel to commit to supporting indefinitely
> > is optimal or secure. If an open source userspace appears is it to be
> > limited to API the closed userspace has created. It limits the future
> > unnecessarily.
I understand what you are saying and I think (but I can't guarantee it
yet) that I may be able to provide some minimal userspace to make sure
this interface won't break.

> >
> > > There is no way that "someone" will create a userspace
> > > for our H/W without the intimate knowledge of the H/W or without the
> > > ISA of our programmable cores. Maybe for large companies this request
> > > is valid, but for startups complying to this request is not realistic.
> >
> > So what benefit does the Linux kernel get from having support for this
> > feature upstream?
> >
> > If users can't access the necessary code to use it, why does this
> > require to be maintained in the kernel.
> >
> > > To conclude, I think this approach discourage other companies from
> > > open sourcing their drivers and is counter-productive. I'm not sure
> > > you are aware of how difficult it is to convince startup management to
> > > opensource the code...
> >
> > Oh I am, but I'm also more aware how quickly startups go away and
> > leave the kernel holding a lot of code we don't know how to validate
> > or use.
> >
> > I'm opening to being convinced but I think defining new userspace
> > facing APIs is a task that we should take a lot more seriously going
> > forward to avoid mistakes of the past.
>
> I think the most important thing here is to know that things are
> likely to change quite a bit over the next couple of years, and that
> we don't know yet what we actually need. If we hold off picking up
> support for hardware while all of this is ironed out, we'll miss out
> on being exposed to it, and will have a very tall hill to climb once
> we try to convince vendors to come into the fold. It's also not been a
> requirement for the other two drivers we have merged, as far as I can
> tell (CAPI and OpenCAPI) so the cat's already out of the bag.
>
> I'd rather not get stuck in a stand-off needing the longterm solution
> to pick up the short term contribution. That way we can move over to a
> _new_ API once there's been a better chance of finding common grounds
> and once things settle down a bit, instead of trying to bring some
> larger legacy codebase for devices that people might no longer care
> much about over to the newer APIs.
>
> It's better to be exposed to the HW and drivers now, than having
> people build large elaborate out-of-tree software stacks for this.
> It's also better to get them to come and collaborate now, instead of
> pushing them away until things are perfect.
>
> Having a way to validate and exercise the userspace API is important,
> including ability to change it if needed. Would it be possible to open
> up the lowest userspace pieces (driver interactions), even if some
> other layers might not yet be, to exercise the device/kernel/userspace
> interfaces without "live" workload, etc?
>
>
> -Olof

As I wrote above, I do think I could provide a very low userspace
piece that will contain the IOCTL facing functions and perhaps an
additional test code that will show how to run some simple tests on
the hardware to provide users the ability to do minimal liveness
checking.
I do want to ask that it won't block the upstream process because it
could take some time to provide that as I would need to split our
userspace code.
I will also need to clear that internally first, but I don't see a
reason why I won't be able to do that.

But before I start doing that internally, I need to have some kind of
assurance I won't do that for nothing. i.e, providing this library's
code will be enough to satisfy this specific requirement (I'm not
talking about other stuff of course that might show up in the review).

Thanks,
Oded


Oded

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 23:20           ` Jerome Glisse
@ 2019-01-23 23:35             ` Oded Gabbay
  2019-01-23 23:41               ` Olof Johansson
  2019-01-23 23:40             ` Olof Johansson
  1 sibling, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-23 23:35 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Olof Johansson, Dave Airlie, Greg Kroah-Hartman, Daniel Vetter,
	LKML, ogabbay, Arnd Bergmann, fbarrat, andrew.donnellan

On Thu, Jan 24, 2019 at 1:20 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Jan 23, 2019 at 03:04:33PM -0800, Olof Johansson wrote:
> > On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@gmail.com> wrote:
> > >
> > > On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> > > > >
> > > > > Adding Daniel as well.
> > > > >
> > > > > Dave.
> > > > >
> > > > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > > > Habana Labs since its inception two and a half years ago.
> > > > > >
> > > > > > Hey Oded,
> > > > > >
> > > > > > So this creates a driver with a userspace facing API via ioctls.
> > > > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > > > drivers are for accelerators that we don't merge userspace API with an
> > > > > > appropriate userspace user.
> > > > > >
> > > > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > > > >
> > > > > > I see nothing in these accelerator drivers that make me think we
> > > > > > should be treating them different.
> > > > > >
> > > > > > Having large closed userspaces that we have no insight into means we
> > > > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > > > an open source userspace, we will end up in a place where they get
> > > > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > > > change.
> > > > > >
> > > > > > Dave.
> > > >
> > > > Hi Dave,
> > > > While I always appreciate your opinion and happy to hear it, I totally
> > > > disagree with you on this point.
> > > >
> > > > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > > > aware that this rule might apply to this driver or to any other driver
> > > > outside of drm. Has this rule been applied to all the current drivers
> > > > in the kernel tree with userspace facing API via IOCTLs, which are not
> > > > in the drm subsystem ?  I see the logic for GPUs as they drive the
> > > > display of the entire machine, but this is an accelerator for a
> > > > specific purpose, not something generic as GPU. I just don't see how
> > > > one can treat them in the same way.
> > >
> > > The logic isn't there for GPUs for those reason that we have an
> > > established library or that GPUs are in laptops. They are just where
> > > we learned the lessons of merging things whose primary reason for
> > > being in the kernel is to execute stuff from misc userspace stacks,
> > > where the uAPI has to remain stable indefinitely.
> > >
> > > a) security - without knowledge of what the accelerator can do how can
> > > we know if the API you expose isn't just a giant root hole?
> > >
> > > b) uAPI stability. Without a userspace for this, there is no way for
> > > anyone even if in possession of the hardware to validate the uAPI you
> > > provide and are asking the kernel to commit to supporting indefinitely
> > > is optimal or secure. If an open source userspace appears is it to be
> > > limited to API the closed userspace has created. It limits the future
> > > unnecessarily.
> > >
> > > > There is no way that "someone" will create a userspace
> > > > for our H/W without the intimate knowledge of the H/W or without the
> > > > ISA of our programmable cores. Maybe for large companies this request
> > > > is valid, but for startups complying to this request is not realistic.
> > >
> > > So what benefit does the Linux kernel get from having support for this
> > > feature upstream?
> > >
> > > If users can't access the necessary code to use it, why does this
> > > require to be maintained in the kernel.
> > >
> > > > To conclude, I think this approach discourage other companies from
> > > > open sourcing their drivers and is counter-productive. I'm not sure
> > > > you are aware of how difficult it is to convince startup management to
> > > > opensource the code...
> > >
> > > Oh I am, but I'm also more aware how quickly startups go away and
> > > leave the kernel holding a lot of code we don't know how to validate
> > > or use.
> > >
> > > I'm opening to being convinced but I think defining new userspace
> > > facing APIs is a task that we should take a lot more seriously going
> > > forward to avoid mistakes of the past.
> >
> > I think the most important thing here is to know that things are
> > likely to change quite a bit over the next couple of years, and that
> > we don't know yet what we actually need. If we hold off picking up
> > support for hardware while all of this is ironed out, we'll miss out
> > on being exposed to it, and will have a very tall hill to climb once
> > we try to convince vendors to come into the fold. It's also not been a
> > requirement for the other two drivers we have merged, as far as I can
> > tell (CAPI and OpenCAPI) so the cat's already out of the bag.
> >
> > I'd rather not get stuck in a stand-off needing the longterm solution
> > to pick up the short term contribution. That way we can move over to a
> > _new_ API once there's been a better chance of finding common grounds
> > and once things settle down a bit, instead of trying to bring some
> > larger legacy codebase for devices that people might no longer care
> > much about over to the newer APIs.
> >
> > It's better to be exposed to the HW and drivers now, than having
> > people build large elaborate out-of-tree software stacks for this.
> > It's also better to get them to come and collaborate now, instead of
> > pushing them away until things are perfect.
> >
> > Having a way to validate and exercise the userspace API is important,
> > including ability to change it if needed. Would it be possible to open
> > up the lowest userspace pieces (driver interactions), even if some
> > other layers might not yet be, to exercise the device/kernel/userspace
> > interfaces without "live" workload, etc?
>
> Yes and to exercise the userspace API you need at very least to
> know the ISA so that you can write program for the accelerator.
> You also need to know the set of commands the hardware has. The
> ioctl and how to create a userspace that interact with the kernel
> is the easy part, the hard part is the compiler.

So actually in my case in order to exercise the IOCTL API, you can
give "work" to the device that will not trigger the compute parts, but
only the different queues and the DMA engines.
I think that is enough to validate that the IOCTLs won't break.
All the "commands" that you can give to the queue logic (QMAN) is
exposed in one of the files in the driver (goya_packets.h).

I want to stress this - To validate the IOCTLs, it is enough to do DMA
work. You will use ALL the 5 IOCTLs to do just that - give work to the
DMA engines.

And as I wrote to Dave, I can explain my security architecture in
detail and how I make sure the programmable cores can't be used by
malicious users.

And nice to hear from you Jerome :) I missed you!

Thanks,
Oded

>
> So if we want any kind of freedom to play with the UAPI, enhance
> it or change it in anyway we must be free to build program for the
> device ourself.
>
> I believe that the GPU sub-system requirement are a good guideline
> to follow and the only exception with drivers/ that i am aware of
> is the fpga. Everything else in driver as either an open source
> userspace, expose a common API (like network) or is so simple that
> anyone can write a userspace for it.
>
>
> For any complex device that execute program we should really enforce
> the open source userspace so that we can properly audit the driver
> as otherwise we only have half of the story with no idea what the
> other half might implies.
>
> Cheers,
> Jérôme
>
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 23:20           ` Jerome Glisse
  2019-01-23 23:35             ` Oded Gabbay
@ 2019-01-23 23:40             ` Olof Johansson
  2019-01-23 23:48               ` Jerome Glisse
  1 sibling, 1 reply; 103+ messages in thread
From: Olof Johansson @ 2019-01-23 23:40 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dave Airlie, Oded Gabbay, Greg Kroah-Hartman, Daniel Vetter,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

On Wed, Jan 23, 2019 at 3:20 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Jan 23, 2019 at 03:04:33PM -0800, Olof Johansson wrote:
> > On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@gmail.com> wrote:
> > >
> > > On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> > > > >
> > > > > Adding Daniel as well.
> > > > >
> > > > > Dave.
> > > > >
> > > > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > > > Habana Labs since its inception two and a half years ago.
> > > > > >
> > > > > > Hey Oded,
> > > > > >
> > > > > > So this creates a driver with a userspace facing API via ioctls.
> > > > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > > > drivers are for accelerators that we don't merge userspace API with an
> > > > > > appropriate userspace user.
> > > > > >
> > > > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > > > >
> > > > > > I see nothing in these accelerator drivers that make me think we
> > > > > > should be treating them different.
> > > > > >
> > > > > > Having large closed userspaces that we have no insight into means we
> > > > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > > > an open source userspace, we will end up in a place where they get
> > > > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > > > change.
> > > > > >
> > > > > > Dave.
> > > >
> > > > Hi Dave,
> > > > While I always appreciate your opinion and happy to hear it, I totally
> > > > disagree with you on this point.
> > > >
> > > > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > > > aware that this rule might apply to this driver or to any other driver
> > > > outside of drm. Has this rule been applied to all the current drivers
> > > > in the kernel tree with userspace facing API via IOCTLs, which are not
> > > > in the drm subsystem ?  I see the logic for GPUs as they drive the
> > > > display of the entire machine, but this is an accelerator for a
> > > > specific purpose, not something generic as GPU. I just don't see how
> > > > one can treat them in the same way.
> > >
> > > The logic isn't there for GPUs for those reason that we have an
> > > established library or that GPUs are in laptops. They are just where
> > > we learned the lessons of merging things whose primary reason for
> > > being in the kernel is to execute stuff from misc userspace stacks,
> > > where the uAPI has to remain stable indefinitely.
> > >
> > > a) security - without knowledge of what the accelerator can do how can
> > > we know if the API you expose isn't just a giant root hole?
> > >
> > > b) uAPI stability. Without a userspace for this, there is no way for
> > > anyone even if in possession of the hardware to validate the uAPI you
> > > provide and are asking the kernel to commit to supporting indefinitely
> > > is optimal or secure. If an open source userspace appears is it to be
> > > limited to API the closed userspace has created. It limits the future
> > > unnecessarily.
> > >
> > > > There is no way that "someone" will create a userspace
> > > > for our H/W without the intimate knowledge of the H/W or without the
> > > > ISA of our programmable cores. Maybe for large companies this request
> > > > is valid, but for startups complying to this request is not realistic.
> > >
> > > So what benefit does the Linux kernel get from having support for this
> > > feature upstream?
> > >
> > > If users can't access the necessary code to use it, why does this
> > > require to be maintained in the kernel.
> > >
> > > > To conclude, I think this approach discourage other companies from
> > > > open sourcing their drivers and is counter-productive. I'm not sure
> > > > you are aware of how difficult it is to convince startup management to
> > > > opensource the code...
> > >
> > > Oh I am, but I'm also more aware how quickly startups go away and
> > > leave the kernel holding a lot of code we don't know how to validate
> > > or use.
> > >
> > > I'm opening to being convinced but I think defining new userspace
> > > facing APIs is a task that we should take a lot more seriously going
> > > forward to avoid mistakes of the past.
> >
> > I think the most important thing here is to know that things are
> > likely to change quite a bit over the next couple of years, and that
> > we don't know yet what we actually need. If we hold off picking up
> > support for hardware while all of this is ironed out, we'll miss out
> > on being exposed to it, and will have a very tall hill to climb once
> > we try to convince vendors to come into the fold. It's also not been a
> > requirement for the other two drivers we have merged, as far as I can
> > tell (CAPI and OpenCAPI) so the cat's already out of the bag.
> >
> > I'd rather not get stuck in a stand-off needing the longterm solution
> > to pick up the short term contribution. That way we can move over to a
> > _new_ API once there's been a better chance of finding common grounds
> > and once things settle down a bit, instead of trying to bring some
> > larger legacy codebase for devices that people might no longer care
> > much about over to the newer APIs.
> >
> > It's better to be exposed to the HW and drivers now, than having
> > people build large elaborate out-of-tree software stacks for this.
> > It's also better to get them to come and collaborate now, instead of
> > pushing them away until things are perfect.
> >
> > Having a way to validate and exercise the userspace API is important,
> > including ability to change it if needed. Would it be possible to open
> > up the lowest userspace pieces (driver interactions), even if some
> > other layers might not yet be, to exercise the device/kernel/userspace
> > interfaces without "live" workload, etc?
>
> Yes and to exercise the userspace API you need at very least to
> know the ISA so that you can write program for the accelerator.
> You also need to know the set of commands the hardware has. The
> ioctl and how to create a userspace that interact with the kernel
> is the easy part, the hard part is the compiler.
>
> So if we want any kind of freedom to play with the UAPI, enhance
> it or change it in anyway we must be free to build program for the
> device ourself.
>
> I believe that the GPU sub-system requirement are a good guideline
> to follow and the only exception with drivers/ that i am aware of
> is the fpga. Everything else in driver as either an open source
> userspace, expose a common API (like network) or is so simple that
> anyone can write a userspace for it.

Once we have a common framework I agree that we need enough tools to
exercise everything needed. I don't agree that this includes full
sources to everything. We don't expect this for most PCIe cards today
either.

If the GPU subsystem is to be followed, I fear that we will end up
with Nvidia-equivalent vendors from day 1, where they will just build
a bigger and bigger software stack on the side instead of joining in,
and someone will need to best-effort bridge the gap by reverse
engineering. I don't want that situation long-term, which is why I
think it's reasonable to be more relaxed during the early days with
upfront, clear, expectations for the longer term that hardware/kernel
interfaces need to be exercisable.

> For any complex device that execute program we should really enforce
> the open source userspace so that we can properly audit the driver
> as otherwise we only have half of the story with no idea what the
> other half might implies.

What you're demanding is open userspace _and_ firmware. Since without
firmware sources, you can't audit any on-chip behavior either (in
reality, most commands passed down are likely parsed by said
firmware).


-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 23:35             ` Oded Gabbay
@ 2019-01-23 23:41               ` Olof Johansson
  0 siblings, 0 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-23 23:41 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Dave Airlie, Greg Kroah-Hartman, Daniel Vetter,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

On Wed, Jan 23, 2019 at 3:35 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> On Thu, Jan 24, 2019 at 1:20 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Wed, Jan 23, 2019 at 03:04:33PM -0800, Olof Johansson wrote:
> > > On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > > On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> > > > > >
> > > > > > Adding Daniel as well.
> > > > > >
> > > > > > Dave.
> > > > > >
> > > > > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > > > > Habana Labs since its inception two and a half years ago.
> > > > > > >
> > > > > > > Hey Oded,
> > > > > > >
> > > > > > > So this creates a driver with a userspace facing API via ioctls.
> > > > > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > > > > drivers are for accelerators that we don't merge userspace API with an
> > > > > > > appropriate userspace user.
> > > > > > >
> > > > > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > > > > >
> > > > > > > I see nothing in these accelerator drivers that make me think we
> > > > > > > should be treating them different.
> > > > > > >
> > > > > > > Having large closed userspaces that we have no insight into means we
> > > > > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > > > > an open source userspace, we will end up in a place where they get
> > > > > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > > > > change.
> > > > > > >
> > > > > > > Dave.
> > > > >
> > > > > Hi Dave,
> > > > > While I always appreciate your opinion and happy to hear it, I totally
> > > > > disagree with you on this point.
> > > > >
> > > > > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > > > > aware that this rule might apply to this driver or to any other driver
> > > > > outside of drm. Has this rule been applied to all the current drivers
> > > > > in the kernel tree with userspace facing API via IOCTLs, which are not
> > > > > in the drm subsystem ?  I see the logic for GPUs as they drive the
> > > > > display of the entire machine, but this is an accelerator for a
> > > > > specific purpose, not something generic as GPU. I just don't see how
> > > > > one can treat them in the same way.
> > > >
> > > > The logic isn't there for GPUs for those reason that we have an
> > > > established library or that GPUs are in laptops. They are just where
> > > > we learned the lessons of merging things whose primary reason for
> > > > being in the kernel is to execute stuff from misc userspace stacks,
> > > > where the uAPI has to remain stable indefinitely.
> > > >
> > > > a) security - without knowledge of what the accelerator can do how can
> > > > we know if the API you expose isn't just a giant root hole?
> > > >
> > > > b) uAPI stability. Without a userspace for this, there is no way for
> > > > anyone even if in possession of the hardware to validate the uAPI you
> > > > provide and are asking the kernel to commit to supporting indefinitely
> > > > is optimal or secure. If an open source userspace appears is it to be
> > > > limited to API the closed userspace has created. It limits the future
> > > > unnecessarily.
> > > >
> > > > > There is no way that "someone" will create a userspace
> > > > > for our H/W without the intimate knowledge of the H/W or without the
> > > > > ISA of our programmable cores. Maybe for large companies this request
> > > > > is valid, but for startups complying to this request is not realistic.
> > > >
> > > > So what benefit does the Linux kernel get from having support for this
> > > > feature upstream?
> > > >
> > > > If users can't access the necessary code to use it, why does this
> > > > require to be maintained in the kernel.
> > > >
> > > > > To conclude, I think this approach discourage other companies from
> > > > > open sourcing their drivers and is counter-productive. I'm not sure
> > > > > you are aware of how difficult it is to convince startup management to
> > > > > opensource the code...
> > > >
> > > > Oh I am, but I'm also more aware how quickly startups go away and
> > > > leave the kernel holding a lot of code we don't know how to validate
> > > > or use.
> > > >
> > > > I'm opening to being convinced but I think defining new userspace
> > > > facing APIs is a task that we should take a lot more seriously going
> > > > forward to avoid mistakes of the past.
> > >
> > > I think the most important thing here is to know that things are
> > > likely to change quite a bit over the next couple of years, and that
> > > we don't know yet what we actually need. If we hold off picking up
> > > support for hardware while all of this is ironed out, we'll miss out
> > > on being exposed to it, and will have a very tall hill to climb once
> > > we try to convince vendors to come into the fold. It's also not been a
> > > requirement for the other two drivers we have merged, as far as I can
> > > tell (CAPI and OpenCAPI) so the cat's already out of the bag.
> > >
> > > I'd rather not get stuck in a stand-off needing the longterm solution
> > > to pick up the short term contribution. That way we can move over to a
> > > _new_ API once there's been a better chance of finding common grounds
> > > and once things settle down a bit, instead of trying to bring some
> > > larger legacy codebase for devices that people might no longer care
> > > much about over to the newer APIs.
> > >
> > > It's better to be exposed to the HW and drivers now, than having
> > > people build large elaborate out-of-tree software stacks for this.
> > > It's also better to get them to come and collaborate now, instead of
> > > pushing them away until things are perfect.
> > >
> > > Having a way to validate and exercise the userspace API is important,
> > > including ability to change it if needed. Would it be possible to open
> > > up the lowest userspace pieces (driver interactions), even if some
> > > other layers might not yet be, to exercise the device/kernel/userspace
> > > interfaces without "live" workload, etc?
> >
> > Yes and to exercise the userspace API you need at very least to
> > know the ISA so that you can write program for the accelerator.
> > You also need to know the set of commands the hardware has. The
> > ioctl and how to create a userspace that interact with the kernel
> > is the easy part, the hard part is the compiler.
>
> So actually in my case in order to exercise the IOCTL API, you can
> give "work" to the device that will not trigger the compute parts, but
> only the different queues and the DMA engines.
> I think that is enough to validate that the IOCTLs won't break.
> All the "commands" that you can give to the queue logic (QMAN) is
> exposed in one of the files in the driver (goya_packets.h).
>
> I want to stress this - To validate the IOCTLs, it is enough to do DMA
> work. You will use ALL the 5 IOCTLs to do just that - give work to the
> DMA engines.

I personally think this is a reasonable trade-off, given that you have
a communication layer between. For hardware that doesn't have that,
and where device behavior and data movement depends on execution on
the compute parts, more would need to be open.


-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 23:40             ` Olof Johansson
@ 2019-01-23 23:48               ` Jerome Glisse
  2019-01-24  7:35                 ` Daniel Vetter
  0 siblings, 1 reply; 103+ messages in thread
From: Jerome Glisse @ 2019-01-23 23:48 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Dave Airlie, Oded Gabbay, Greg Kroah-Hartman, Daniel Vetter,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

On Wed, Jan 23, 2019 at 03:40:25PM -0800, Olof Johansson wrote:
> On Wed, Jan 23, 2019 at 3:20 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Wed, Jan 23, 2019 at 03:04:33PM -0800, Olof Johansson wrote:
> > > On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > > On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> > > > > >
> > > > > > Adding Daniel as well.
> > > > > >
> > > > > > Dave.
> > > > > >
> > > > > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > > > > Habana Labs since its inception two and a half years ago.
> > > > > > >
> > > > > > > Hey Oded,
> > > > > > >
> > > > > > > So this creates a driver with a userspace facing API via ioctls.
> > > > > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > > > > drivers are for accelerators that we don't merge userspace API with an
> > > > > > > appropriate userspace user.
> > > > > > >
> > > > > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > > > > >
> > > > > > > I see nothing in these accelerator drivers that make me think we
> > > > > > > should be treating them different.
> > > > > > >
> > > > > > > Having large closed userspaces that we have no insight into means we
> > > > > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > > > > an open source userspace, we will end up in a place where they get
> > > > > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > > > > change.
> > > > > > >
> > > > > > > Dave.
> > > > >
> > > > > Hi Dave,
> > > > > While I always appreciate your opinion and happy to hear it, I totally
> > > > > disagree with you on this point.
> > > > >
> > > > > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > > > > aware that this rule might apply to this driver or to any other driver
> > > > > outside of drm. Has this rule been applied to all the current drivers
> > > > > in the kernel tree with userspace facing API via IOCTLs, which are not
> > > > > in the drm subsystem ?  I see the logic for GPUs as they drive the
> > > > > display of the entire machine, but this is an accelerator for a
> > > > > specific purpose, not something generic as GPU. I just don't see how
> > > > > one can treat them in the same way.
> > > >
> > > > The logic isn't there for GPUs for those reason that we have an
> > > > established library or that GPUs are in laptops. They are just where
> > > > we learned the lessons of merging things whose primary reason for
> > > > being in the kernel is to execute stuff from misc userspace stacks,
> > > > where the uAPI has to remain stable indefinitely.
> > > >
> > > > a) security - without knowledge of what the accelerator can do how can
> > > > we know if the API you expose isn't just a giant root hole?
> > > >
> > > > b) uAPI stability. Without a userspace for this, there is no way for
> > > > anyone even if in possession of the hardware to validate the uAPI you
> > > > provide and are asking the kernel to commit to supporting indefinitely
> > > > is optimal or secure. If an open source userspace appears is it to be
> > > > limited to API the closed userspace has created. It limits the future
> > > > unnecessarily.
> > > >
> > > > > There is no way that "someone" will create a userspace
> > > > > for our H/W without the intimate knowledge of the H/W or without the
> > > > > ISA of our programmable cores. Maybe for large companies this request
> > > > > is valid, but for startups complying to this request is not realistic.
> > > >
> > > > So what benefit does the Linux kernel get from having support for this
> > > > feature upstream?
> > > >
> > > > If users can't access the necessary code to use it, why does this
> > > > require to be maintained in the kernel.
> > > >
> > > > > To conclude, I think this approach discourage other companies from
> > > > > open sourcing their drivers and is counter-productive. I'm not sure
> > > > > you are aware of how difficult it is to convince startup management to
> > > > > opensource the code...
> > > >
> > > > Oh I am, but I'm also more aware how quickly startups go away and
> > > > leave the kernel holding a lot of code we don't know how to validate
> > > > or use.
> > > >
> > > > I'm opening to being convinced but I think defining new userspace
> > > > facing APIs is a task that we should take a lot more seriously going
> > > > forward to avoid mistakes of the past.
> > >
> > > I think the most important thing here is to know that things are
> > > likely to change quite a bit over the next couple of years, and that
> > > we don't know yet what we actually need. If we hold off picking up
> > > support for hardware while all of this is ironed out, we'll miss out
> > > on being exposed to it, and will have a very tall hill to climb once
> > > we try to convince vendors to come into the fold. It's also not been a
> > > requirement for the other two drivers we have merged, as far as I can
> > > tell (CAPI and OpenCAPI) so the cat's already out of the bag.
> > >
> > > I'd rather not get stuck in a stand-off needing the longterm solution
> > > to pick up the short term contribution. That way we can move over to a
> > > _new_ API once there's been a better chance of finding common grounds
> > > and once things settle down a bit, instead of trying to bring some
> > > larger legacy codebase for devices that people might no longer care
> > > much about over to the newer APIs.
> > >
> > > It's better to be exposed to the HW and drivers now, than having
> > > people build large elaborate out-of-tree software stacks for this.
> > > It's also better to get them to come and collaborate now, instead of
> > > pushing them away until things are perfect.
> > >
> > > Having a way to validate and exercise the userspace API is important,
> > > including ability to change it if needed. Would it be possible to open
> > > up the lowest userspace pieces (driver interactions), even if some
> > > other layers might not yet be, to exercise the device/kernel/userspace
> > > interfaces without "live" workload, etc?
> >
> > Yes and to exercise the userspace API you need at very least to
> > know the ISA so that you can write program for the accelerator.
> > You also need to know the set of commands the hardware has. The
> > ioctl and how to create a userspace that interact with the kernel
> > is the easy part, the hard part is the compiler.
> >
> > So if we want any kind of freedom to play with the UAPI, enhance
> > it or change it in anyway we must be free to build program for the
> > device ourself.
> >
> > I believe that the GPU sub-system requirement are a good guideline
> > to follow and the only exception with drivers/ that i am aware of
> > is the fpga. Everything else in driver as either an open source
> > userspace, expose a common API (like network) or is so simple that
> > anyone can write a userspace for it.
> 
> Once we have a common framework I agree that we need enough tools to
> exercise everything needed. I don't agree that this includes full
> sources to everything. We don't expect this for most PCIe cards today
> either.

We do expected this today except for FPGA, i do not know any single
pcie device with upstream driver that we do not know how to program.
Biggest chunk of PCIE devices are straightforward (network, sound,
media, ...).

So in effect today the lowest common denominator is open source user
space or device API is so simple that user space is obvious (various
media device).

> 
> If the GPU subsystem is to be followed, I fear that we will end up
> with Nvidia-equivalent vendors from day 1, where they will just build
> a bigger and bigger software stack on the side instead of joining in,
> and someone will need to best-effort bridge the gap by reverse
> engineering. I don't want that situation long-term, which is why I
> think it's reasonable to be more relaxed during the early days with
> upfront, clear, expectations for the longer term that hardware/kernel
> interfaces need to be exercisable.

I think the other way around, allowing people to push upstream driver
with no open source user space and people loose any motivation to
work on open sourcing their userspace. Not being upstream is painful
enough that they will get pressure to go upstream and if upstream
means open source userspace then they have to comply.

> 
> > For any complex device that execute program we should really enforce
> > the open source userspace so that we can properly audit the driver
> > as otherwise we only have half of the story with no idea what the
> > other half might implies.
> 
> What you're demanding is open userspace _and_ firmware. Since without
> firmware sources, you can't audit any on-chip behavior either (in
> reality, most commands passed down are likely parsed by said
> firmware).

No i do not ask for firmware. If we have any doubt about what the firm-
ware can let through then we can lock down the ioctl ie parse commands
from userspace and only allow kernel to write sanitize command to
command queue. By auditing here i mean being able to understand the
overall flow that is expected from program so from that program flow
we can work on what is the best UAPI with minimum overhead to achieve
that program flow the most efficiently. Sorry if that was not clear.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 21:52 ` Olof Johansson
  2019-01-23 22:40   ` Oded Gabbay
@ 2019-01-24  1:03   ` Andrew Donnellan
  2019-01-24 11:59     ` Jonathan Cameron
  2019-01-25 17:13     ` Olof Johansson
  2019-02-24 22:23   ` Pavel Machek
  2 siblings, 2 replies; 103+ messages in thread
From: Andrew Donnellan @ 2019-01-24  1:03 UTC (permalink / raw)
  To: Olof Johansson, Oded Gabbay, Dave Airlie
  Cc: Greg Kroah-Hartman, Linux Kernel Mailing List, ogabbay,
	Arnd Bergmann, fbarrat, linux-accelerators

On 24/1/19 8:52 am, Olof Johansson wrote:
> But, I think the largest question I have (for a broader audience) is:
> 
> I predict that we will see a handful of these kind of devices over the
> upcoming future -- definitely from ML accelerators but maybe also for
> other kinds of processing, where there's a command-based, buffer-based
> setup sending workloads to an offload engine and getting results back.
> While the first waves will all look different due to design trade-offs
> made in isolation, I think it makes sense to group them in one bucket
> instead of merging them through drivers/misc, if nothing else to
> encourage more cross-collaboration over time. First steps in figuring
> out long-term suitable frameworks is to get a survey of a few
> non-shared implementations.
> 
> So, I'd like to propose a drivers/accel drivers subtree, and I'd be
> happy to bootstrap it with a small group (@Dave Airlie: I think your
> input from GPU land be very useful, want to join in?). Individual
> drivers maintained by existing maintainers, of course.
> 
> I think it might make sense to move the CAPI/OpenCAPI drivers over as
> well -- not necessarily to change those drivers, but to group them
> with the rest as more show up.

For cxl/ocxl, I have no objection to moving to this new subtree if 
that's what we all agree to do. (what do people do about UAPI headers in 
this situation? keep them where they are in misc/?)

If we do go ahead and set up this new subtree, perhaps we can use the 
mailing list I set up at linux-accelerators@lists.ozlabs.org but we 
haven't really started using...

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 23:48               ` Jerome Glisse
@ 2019-01-24  7:35                 ` Daniel Vetter
  2019-01-24  9:50                   ` Oded Gabbay
  2019-01-24 23:51                   ` Olof Johansson
  0 siblings, 2 replies; 103+ messages in thread
From: Daniel Vetter @ 2019-01-24  7:35 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Olof Johansson, Dave Airlie, Oded Gabbay, Greg Kroah-Hartman,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

Hi all,

Top post, because new argument.

There's lots of really good technical arguments for having the
userspace component of a driver stack that spans both kernel and
userspace open too. For me, that's not really the important argument.

I care about open source, I'm not interested in blobs (beyond that
they're useful for reverse engineering). I think the upstream
community should care about open source, and by and large it very much
does: We haven't merged ndiswrapper, or the nvidia shim, or anything
like that to make running blobs in the kernel easier. And at least in
the case of the one traditional driver subsystem where 90% of the
driver lives in userspace, we also care about that part being open.

Anything else is imo just a long-term dis-service to the community of
customers, other vendors, ... Adapting a famous quote: If you're ok
with throwing away some long term software freedom for a bit of short
term hardware support you'll get neither.

So if someone propose to merge some open source kernel driver that
requires piles of closed source userspace to be any use at all, I'm
just not interested. And if the fpga folks have merged fpga drivers
without at least a basic (non-optimizing) RTL compiler, then that was
a grave mistake. That doing this is also technically a bad idea (for
all the reasons already discussed) is just the icing on the top for
me.

And to tie this back to the technical discussion, here's a scenario
that's bound to happen:
1. vendor crams their open source driver into upstream, with full blob userspace
2. vendor gets bored (runs low on money, accidentally fired the entire
old team, needs to do more value add, whatever, ...) rewrites the
entire stack
3. vendor crams their new&completely incompatible open source stack
into upstream
4. upstream is now unvoluntarily stuck maintaining 2 drivers for the
exact same thing, and we can't fix anything of that because if you
touch one side of the stack without undertstanding the other part
you're guaranteed to create regressions (yes this is how this works
with gpu drivers, we've learned this the hard way)
5. repeat

Hence for these technical reasons you'll then end up with a subsystem
that only the vendor can touch, and hence also the vendor can abandon
at will. Not like drivers/gpu, where customers, consulting shops,
students, ... routinely can&do add new features to existing drivers.

This is not a winning move.

Cheers, Daniel

On Thu, Jan 24, 2019 at 12:48 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Jan 23, 2019 at 03:40:25PM -0800, Olof Johansson wrote:
> > On Wed, Jan 23, 2019 at 3:20 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Wed, Jan 23, 2019 at 03:04:33PM -0800, Olof Johansson wrote:
> > > > On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@gmail.com> wrote:
> > > > >
> > > > > On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> > > > > > >
> > > > > > > Adding Daniel as well.
> > > > > > >
> > > > > > > Dave.
> > > > > > >
> > > > > > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > > > > > Habana Labs since its inception two and a half years ago.
> > > > > > > >
> > > > > > > > Hey Oded,
> > > > > > > >
> > > > > > > > So this creates a driver with a userspace facing API via ioctls.
> > > > > > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > > > > > drivers are for accelerators that we don't merge userspace API with an
> > > > > > > > appropriate userspace user.
> > > > > > > >
> > > > > > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > > > > > >
> > > > > > > > I see nothing in these accelerator drivers that make me think we
> > > > > > > > should be treating them different.
> > > > > > > >
> > > > > > > > Having large closed userspaces that we have no insight into means we
> > > > > > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > > > > > an open source userspace, we will end up in a place where they get
> > > > > > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > > > > > change.
> > > > > > > >
> > > > > > > > Dave.
> > > > > >
> > > > > > Hi Dave,
> > > > > > While I always appreciate your opinion and happy to hear it, I totally
> > > > > > disagree with you on this point.
> > > > > >
> > > > > > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > > > > > aware that this rule might apply to this driver or to any other driver
> > > > > > outside of drm. Has this rule been applied to all the current drivers
> > > > > > in the kernel tree with userspace facing API via IOCTLs, which are not
> > > > > > in the drm subsystem ?  I see the logic for GPUs as they drive the
> > > > > > display of the entire machine, but this is an accelerator for a
> > > > > > specific purpose, not something generic as GPU. I just don't see how
> > > > > > one can treat them in the same way.
> > > > >
> > > > > The logic isn't there for GPUs for those reason that we have an
> > > > > established library or that GPUs are in laptops. They are just where
> > > > > we learned the lessons of merging things whose primary reason for
> > > > > being in the kernel is to execute stuff from misc userspace stacks,
> > > > > where the uAPI has to remain stable indefinitely.
> > > > >
> > > > > a) security - without knowledge of what the accelerator can do how can
> > > > > we know if the API you expose isn't just a giant root hole?
> > > > >
> > > > > b) uAPI stability. Without a userspace for this, there is no way for
> > > > > anyone even if in possession of the hardware to validate the uAPI you
> > > > > provide and are asking the kernel to commit to supporting indefinitely
> > > > > is optimal or secure. If an open source userspace appears is it to be
> > > > > limited to API the closed userspace has created. It limits the future
> > > > > unnecessarily.
> > > > >
> > > > > > There is no way that "someone" will create a userspace
> > > > > > for our H/W without the intimate knowledge of the H/W or without the
> > > > > > ISA of our programmable cores. Maybe for large companies this request
> > > > > > is valid, but for startups complying to this request is not realistic.
> > > > >
> > > > > So what benefit does the Linux kernel get from having support for this
> > > > > feature upstream?
> > > > >
> > > > > If users can't access the necessary code to use it, why does this
> > > > > require to be maintained in the kernel.
> > > > >
> > > > > > To conclude, I think this approach discourage other companies from
> > > > > > open sourcing their drivers and is counter-productive. I'm not sure
> > > > > > you are aware of how difficult it is to convince startup management to
> > > > > > opensource the code...
> > > > >
> > > > > Oh I am, but I'm also more aware how quickly startups go away and
> > > > > leave the kernel holding a lot of code we don't know how to validate
> > > > > or use.
> > > > >
> > > > > I'm opening to being convinced but I think defining new userspace
> > > > > facing APIs is a task that we should take a lot more seriously going
> > > > > forward to avoid mistakes of the past.
> > > >
> > > > I think the most important thing here is to know that things are
> > > > likely to change quite a bit over the next couple of years, and that
> > > > we don't know yet what we actually need. If we hold off picking up
> > > > support for hardware while all of this is ironed out, we'll miss out
> > > > on being exposed to it, and will have a very tall hill to climb once
> > > > we try to convince vendors to come into the fold. It's also not been a
> > > > requirement for the other two drivers we have merged, as far as I can
> > > > tell (CAPI and OpenCAPI) so the cat's already out of the bag.
> > > >
> > > > I'd rather not get stuck in a stand-off needing the longterm solution
> > > > to pick up the short term contribution. That way we can move over to a
> > > > _new_ API once there's been a better chance of finding common grounds
> > > > and once things settle down a bit, instead of trying to bring some
> > > > larger legacy codebase for devices that people might no longer care
> > > > much about over to the newer APIs.
> > > >
> > > > It's better to be exposed to the HW and drivers now, than having
> > > > people build large elaborate out-of-tree software stacks for this.
> > > > It's also better to get them to come and collaborate now, instead of
> > > > pushing them away until things are perfect.
> > > >
> > > > Having a way to validate and exercise the userspace API is important,
> > > > including ability to change it if needed. Would it be possible to open
> > > > up the lowest userspace pieces (driver interactions), even if some
> > > > other layers might not yet be, to exercise the device/kernel/userspace
> > > > interfaces without "live" workload, etc?
> > >
> > > Yes and to exercise the userspace API you need at very least to
> > > know the ISA so that you can write program for the accelerator.
> > > You also need to know the set of commands the hardware has. The
> > > ioctl and how to create a userspace that interact with the kernel
> > > is the easy part, the hard part is the compiler.
> > >
> > > So if we want any kind of freedom to play with the UAPI, enhance
> > > it or change it in anyway we must be free to build program for the
> > > device ourself.
> > >
> > > I believe that the GPU sub-system requirement are a good guideline
> > > to follow and the only exception with drivers/ that i am aware of
> > > is the fpga. Everything else in driver as either an open source
> > > userspace, expose a common API (like network) or is so simple that
> > > anyone can write a userspace for it.
> >
> > Once we have a common framework I agree that we need enough tools to
> > exercise everything needed. I don't agree that this includes full
> > sources to everything. We don't expect this for most PCIe cards today
> > either.
>
> We do expected this today except for FPGA, i do not know any single
> pcie device with upstream driver that we do not know how to program.
> Biggest chunk of PCIE devices are straightforward (network, sound,
> media, ...).
>
> So in effect today the lowest common denominator is open source user
> space or device API is so simple that user space is obvious (various
> media device).
>
> >
> > If the GPU subsystem is to be followed, I fear that we will end up
> > with Nvidia-equivalent vendors from day 1, where they will just build
> > a bigger and bigger software stack on the side instead of joining in,
> > and someone will need to best-effort bridge the gap by reverse
> > engineering. I don't want that situation long-term, which is why I
> > think it's reasonable to be more relaxed during the early days with
> > upfront, clear, expectations for the longer term that hardware/kernel
> > interfaces need to be exercisable.
>
> I think the other way around, allowing people to push upstream driver
> with no open source user space and people loose any motivation to
> work on open sourcing their userspace. Not being upstream is painful
> enough that they will get pressure to go upstream and if upstream
> means open source userspace then they have to comply.
>
> >
> > > For any complex device that execute program we should really enforce
> > > the open source userspace so that we can properly audit the driver
> > > as otherwise we only have half of the story with no idea what the
> > > other half might implies.
> >
> > What you're demanding is open userspace _and_ firmware. Since without
> > firmware sources, you can't audit any on-chip behavior either (in
> > reality, most commands passed down are likely parsed by said
> > firmware).
>
> No i do not ask for firmware. If we have any doubt about what the firm-
> ware can let through then we can lock down the ioctl ie parse commands
> from userspace and only allow kernel to write sanitize command to
> command queue. By auditing here i mean being able to understand the
> overall flow that is expected from program so from that program flow
> we can work on what is the best UAPI with minimum overhead to achieve
> that program flow the most efficiently. Sorry if that was not clear.
>
> Cheers,
> Jérôme



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-24  7:35                 ` Daniel Vetter
@ 2019-01-24  9:50                   ` Oded Gabbay
  2019-01-24 10:22                     ` Dave Airlie
  2019-01-24 23:51                   ` Olof Johansson
  1 sibling, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-24  9:50 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jerome Glisse, Olof Johansson, Dave Airlie, Greg Kroah-Hartman,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

Hi Daniel and Jerome,
I know I won't be able to convince you but I want to say that I think
your arguments for full userspace open source are not really
technical.

IMHO, an open-source, thin runtime that provides code to operate ALL
the uAPI the driver exports + commitment to only using this library as
the interface to the driver is good enough for the issue of uAPI
breakage. And that thing I can provide.

If at a later time habana will decide to throw it all away (not going
to happen while I'm here), then its habana's problem. Not the kernel.
In that case, I would argue that the kernel shouldn't accept a new
driver from habana. But as long as we keep API compatibility, I don't
see any harm.

I'm not convinced by your request to open the ISA of the programmable
cores. How is that relevant to the kernel driver. I don't even do
anything with those cores. The uAPI I export isn't related whatsoever
to those cores.

I honestly think that if your position is accepted by the Linux kernel
community, companies building AI accelerators won't go near the
kernel. They will simply go down the path Nvidia has gone, which is to
have an out-of-tree kernel driver + big closed userspace. That's a
totally sustainable path from business POV (see Nvidia's dominance in
GPU and deep learning). I don't see how that will serve any of us. I
would think we want to lure companies into open sourcing their code,
and you don't do that by saying: "if you don't open your entire code
base, don't bother to open any part of it".

Thanks,
Oded

On Thu, Jan 24, 2019 at 9:36 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> Hi all,
>
> Top post, because new argument.
>
> There's lots of really good technical arguments for having the
> userspace component of a driver stack that spans both kernel and
> userspace open too. For me, that's not really the important argument.
>
> I care about open source, I'm not interested in blobs (beyond that
> they're useful for reverse engineering). I think the upstream
> community should care about open source, and by and large it very much
> does: We haven't merged ndiswrapper, or the nvidia shim, or anything
> like that to make running blobs in the kernel easier. And at least in
> the case of the one traditional driver subsystem where 90% of the
> driver lives in userspace, we also care about that part being open.
>
> Anything else is imo just a long-term dis-service to the community of
> customers, other vendors, ... Adapting a famous quote: If you're ok
> with throwing away some long term software freedom for a bit of short
> term hardware support you'll get neither.
>
> So if someone propose to merge some open source kernel driver that
> requires piles of closed source userspace to be any use at all, I'm
> just not interested. And if the fpga folks have merged fpga drivers
> without at least a basic (non-optimizing) RTL compiler, then that was
> a grave mistake. That doing this is also technically a bad idea (for
> all the reasons already discussed) is just the icing on the top for
> me.
>
> And to tie this back to the technical discussion, here's a scenario
> that's bound to happen:
> 1. vendor crams their open source driver into upstream, with full blob userspace
> 2. vendor gets bored (runs low on money, accidentally fired the entire
> old team, needs to do more value add, whatever, ...) rewrites the
> entire stack
> 3. vendor crams their new&completely incompatible open source stack
> into upstream
> 4. upstream is now unvoluntarily stuck maintaining 2 drivers for the
> exact same thing, and we can't fix anything of that because if you
> touch one side of the stack without undertstanding the other part
> you're guaranteed to create regressions (yes this is how this works
> with gpu drivers, we've learned this the hard way)
> 5. repeat
>
> Hence for these technical reasons you'll then end up with a subsystem
> that only the vendor can touch, and hence also the vendor can abandon
> at will. Not like drivers/gpu, where customers, consulting shops,
> students, ... routinely can&do add new features to existing drivers.
>
> This is not a winning move.
>
> Cheers, Daniel
>
> On Thu, Jan 24, 2019 at 12:48 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Wed, Jan 23, 2019 at 03:40:25PM -0800, Olof Johansson wrote:
> > > On Wed, Jan 23, 2019 at 3:20 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Wed, Jan 23, 2019 at 03:04:33PM -0800, Olof Johansson wrote:
> > > > > On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Adding Daniel as well.
> > > > > > > >
> > > > > > > > Dave.
> > > > > > > >
> > > > > > > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > > > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > > > > > > Habana Labs since its inception two and a half years ago.
> > > > > > > > >
> > > > > > > > > Hey Oded,
> > > > > > > > >
> > > > > > > > > So this creates a driver with a userspace facing API via ioctls.
> > > > > > > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > > > > > > drivers are for accelerators that we don't merge userspace API with an
> > > > > > > > > appropriate userspace user.
> > > > > > > > >
> > > > > > > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > > > > > > >
> > > > > > > > > I see nothing in these accelerator drivers that make me think we
> > > > > > > > > should be treating them different.
> > > > > > > > >
> > > > > > > > > Having large closed userspaces that we have no insight into means we
> > > > > > > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > > > > > > an open source userspace, we will end up in a place where they get
> > > > > > > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > > > > > > change.
> > > > > > > > >
> > > > > > > > > Dave.
> > > > > > >
> > > > > > > Hi Dave,
> > > > > > > While I always appreciate your opinion and happy to hear it, I totally
> > > > > > > disagree with you on this point.
> > > > > > >
> > > > > > > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > > > > > > aware that this rule might apply to this driver or to any other driver
> > > > > > > outside of drm. Has this rule been applied to all the current drivers
> > > > > > > in the kernel tree with userspace facing API via IOCTLs, which are not
> > > > > > > in the drm subsystem ?  I see the logic for GPUs as they drive the
> > > > > > > display of the entire machine, but this is an accelerator for a
> > > > > > > specific purpose, not something generic as GPU. I just don't see how
> > > > > > > one can treat them in the same way.
> > > > > >
> > > > > > The logic isn't there for GPUs for those reason that we have an
> > > > > > established library or that GPUs are in laptops. They are just where
> > > > > > we learned the lessons of merging things whose primary reason for
> > > > > > being in the kernel is to execute stuff from misc userspace stacks,
> > > > > > where the uAPI has to remain stable indefinitely.
> > > > > >
> > > > > > a) security - without knowledge of what the accelerator can do how can
> > > > > > we know if the API you expose isn't just a giant root hole?
> > > > > >
> > > > > > b) uAPI stability. Without a userspace for this, there is no way for
> > > > > > anyone even if in possession of the hardware to validate the uAPI you
> > > > > > provide and are asking the kernel to commit to supporting indefinitely
> > > > > > is optimal or secure. If an open source userspace appears is it to be
> > > > > > limited to API the closed userspace has created. It limits the future
> > > > > > unnecessarily.
> > > > > >
> > > > > > > There is no way that "someone" will create a userspace
> > > > > > > for our H/W without the intimate knowledge of the H/W or without the
> > > > > > > ISA of our programmable cores. Maybe for large companies this request
> > > > > > > is valid, but for startups complying to this request is not realistic.
> > > > > >
> > > > > > So what benefit does the Linux kernel get from having support for this
> > > > > > feature upstream?
> > > > > >
> > > > > > If users can't access the necessary code to use it, why does this
> > > > > > require to be maintained in the kernel.
> > > > > >
> > > > > > > To conclude, I think this approach discourage other companies from
> > > > > > > open sourcing their drivers and is counter-productive. I'm not sure
> > > > > > > you are aware of how difficult it is to convince startup management to
> > > > > > > opensource the code...
> > > > > >
> > > > > > Oh I am, but I'm also more aware how quickly startups go away and
> > > > > > leave the kernel holding a lot of code we don't know how to validate
> > > > > > or use.
> > > > > >
> > > > > > I'm opening to being convinced but I think defining new userspace
> > > > > > facing APIs is a task that we should take a lot more seriously going
> > > > > > forward to avoid mistakes of the past.
> > > > >
> > > > > I think the most important thing here is to know that things are
> > > > > likely to change quite a bit over the next couple of years, and that
> > > > > we don't know yet what we actually need. If we hold off picking up
> > > > > support for hardware while all of this is ironed out, we'll miss out
> > > > > on being exposed to it, and will have a very tall hill to climb once
> > > > > we try to convince vendors to come into the fold. It's also not been a
> > > > > requirement for the other two drivers we have merged, as far as I can
> > > > > tell (CAPI and OpenCAPI) so the cat's already out of the bag.
> > > > >
> > > > > I'd rather not get stuck in a stand-off needing the longterm solution
> > > > > to pick up the short term contribution. That way we can move over to a
> > > > > _new_ API once there's been a better chance of finding common grounds
> > > > > and once things settle down a bit, instead of trying to bring some
> > > > > larger legacy codebase for devices that people might no longer care
> > > > > much about over to the newer APIs.
> > > > >
> > > > > It's better to be exposed to the HW and drivers now, than having
> > > > > people build large elaborate out-of-tree software stacks for this.
> > > > > It's also better to get them to come and collaborate now, instead of
> > > > > pushing them away until things are perfect.
> > > > >
> > > > > Having a way to validate and exercise the userspace API is important,
> > > > > including ability to change it if needed. Would it be possible to open
> > > > > up the lowest userspace pieces (driver interactions), even if some
> > > > > other layers might not yet be, to exercise the device/kernel/userspace
> > > > > interfaces without "live" workload, etc?
> > > >
> > > > Yes and to exercise the userspace API you need at very least to
> > > > know the ISA so that you can write program for the accelerator.
> > > > You also need to know the set of commands the hardware has. The
> > > > ioctl and how to create a userspace that interact with the kernel
> > > > is the easy part, the hard part is the compiler.
> > > >
> > > > So if we want any kind of freedom to play with the UAPI, enhance
> > > > it or change it in anyway we must be free to build program for the
> > > > device ourself.
> > > >
> > > > I believe that the GPU sub-system requirement are a good guideline
> > > > to follow and the only exception with drivers/ that i am aware of
> > > > is the fpga. Everything else in driver as either an open source
> > > > userspace, expose a common API (like network) or is so simple that
> > > > anyone can write a userspace for it.
> > >
> > > Once we have a common framework I agree that we need enough tools to
> > > exercise everything needed. I don't agree that this includes full
> > > sources to everything. We don't expect this for most PCIe cards today
> > > either.
> >
> > We do expected this today except for FPGA, i do not know any single
> > pcie device with upstream driver that we do not know how to program.
> > Biggest chunk of PCIE devices are straightforward (network, sound,
> > media, ...).
> >
> > So in effect today the lowest common denominator is open source user
> > space or device API is so simple that user space is obvious (various
> > media device).
> >
> > >
> > > If the GPU subsystem is to be followed, I fear that we will end up
> > > with Nvidia-equivalent vendors from day 1, where they will just build
> > > a bigger and bigger software stack on the side instead of joining in,
> > > and someone will need to best-effort bridge the gap by reverse
> > > engineering. I don't want that situation long-term, which is why I
> > > think it's reasonable to be more relaxed during the early days with
> > > upfront, clear, expectations for the longer term that hardware/kernel
> > > interfaces need to be exercisable.
> >
> > I think the other way around, allowing people to push upstream driver
> > with no open source user space and people loose any motivation to
> > work on open sourcing their userspace. Not being upstream is painful
> > enough that they will get pressure to go upstream and if upstream
> > means open source userspace then they have to comply.
> >
> > >
> > > > For any complex device that execute program we should really enforce
> > > > the open source userspace so that we can properly audit the driver
> > > > as otherwise we only have half of the story with no idea what the
> > > > other half might implies.
> > >
> > > What you're demanding is open userspace _and_ firmware. Since without
> > > firmware sources, you can't audit any on-chip behavior either (in
> > > reality, most commands passed down are likely parsed by said
> > > firmware).
> >
> > No i do not ask for firmware. If we have any doubt about what the firm-
> > ware can let through then we can lock down the ioctl ie parse commands
> > from userspace and only allow kernel to write sanitize command to
> > command queue. By auditing here i mean being able to understand the
> > overall flow that is expected from program so from that program flow
> > we can work on what is the best UAPI with minimum overhead to achieve
> > that program flow the most efficiently. Sorry if that was not clear.
> >
> > Cheers,
> > Jérôme
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-24  9:50                   ` Oded Gabbay
@ 2019-01-24 10:22                     ` Dave Airlie
  2019-01-25  0:13                       ` Olof Johansson
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Airlie @ 2019-01-24 10:22 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Daniel Vetter, Jerome Glisse, Olof Johansson, Greg Kroah-Hartman,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

> I know I won't be able to convince you but I want to say that I think
> your arguments for full userspace open source are not really
> technical.

There is more to keeping a kernel going than technical argument unfortunately.

I guess the question for Greg, Olof etc, is do we care about Linux the
kernel, or Linux the open source ecosystem, if the former, these sort
of accelerator shim drivers are fine, useless to anyone who doesn't
have all the magic hidden userspace, and impossible to support for
anyone else, if the latter, we should leave the cost of maintenance to
the company benefiting from it and leave maintaining it out of tree.

Simple question like If I plug your accelerator into Power or ARM64,
where do I get the port of your userspace to use it?

I'm not the final arbiter on this sort of thing, but I'm definitely
going to make sure that anyone who lands this code is explicit in
ignoring any experience we've had in this area and in the future will
gladly accept "I told you so" :-)

Dave.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-24  1:03   ` Andrew Donnellan
@ 2019-01-24 11:59     ` Jonathan Cameron
  2019-01-25 17:13     ` Olof Johansson
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2019-01-24 11:59 UTC (permalink / raw)
  To: Andrew Donnellan
  Cc: Olof Johansson, Oded Gabbay, Dave Airlie, Arnd Bergmann, ogabbay,
	Greg Kroah-Hartman, Linux Kernel Mailing List, fbarrat,
	linux-accelerators

On Thu, 24 Jan 2019 12:03:06 +1100
Andrew Donnellan <andrew.donnellan@au1.ibm.com> wrote:

> On 24/1/19 8:52 am, Olof Johansson wrote:
> > But, I think the largest question I have (for a broader audience) is:
> > 
> > I predict that we will see a handful of these kind of devices over the
> > upcoming future -- definitely from ML accelerators but maybe also for
> > other kinds of processing, where there's a command-based, buffer-based
> > setup sending workloads to an offload engine and getting results back.
> > While the first waves will all look different due to design trade-offs
> > made in isolation, I think it makes sense to group them in one bucket
> > instead of merging them through drivers/misc, if nothing else to
> > encourage more cross-collaboration over time. First steps in figuring
> > out long-term suitable frameworks is to get a survey of a few
> > non-shared implementations.
> > 
> > So, I'd like to propose a drivers/accel drivers subtree, and I'd be
> > happy to bootstrap it with a small group (@Dave Airlie: I think your
> > input from GPU land be very useful, want to join in?). Individual
> > drivers maintained by existing maintainers, of course.
> > 
> > I think it might make sense to move the CAPI/OpenCAPI drivers over as
> > well -- not necessarily to change those drivers, but to group them
> > with the rest as more show up.  
> 
> For cxl/ocxl, I have no objection to moving to this new subtree if 
> that's what we all agree to do. (what do people do about UAPI headers in 
> this situation? keep them where they are in misc/?)
> 
> If we do go ahead and set up this new subtree, perhaps we can use the 
> mailing list I set up at linux-accelerators@lists.ozlabs.org but we 
> haven't really started using...
> 
Assuming the concensus falls behind this...

I'll push this for the CCIX drivers as well as those start to turn up.

This particularly driver had passed me by until this email so great
to get the heads up via that list!

Sounds like a good plan in general to me.

Jonathan



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-24  7:35                 ` Daniel Vetter
  2019-01-24  9:50                   ` Oded Gabbay
@ 2019-01-24 23:51                   ` Olof Johansson
  1 sibling, 0 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-24 23:51 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jerome Glisse, Dave Airlie, Oded Gabbay, Greg Kroah-Hartman,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

Hi,

On Wed, Jan 23, 2019 at 11:36 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> Hi all,
>
> Top post, because new argument.

I'm diving in and replying to this instead of other replies upthread,
since I think it brings up the core of the disagreement.

> There's lots of really good technical arguments for having the
> userspace component of a driver stack that spans both kernel and
> userspace open too. For me, that's not really the important argument.
>
> I care about open source, I'm not interested in blobs (beyond that
> they're useful for reverse engineering). I think the upstream
> community should care about open source, and by and large it very much
> does: We haven't merged ndiswrapper, or the nvidia shim, or anything
> like that to make running blobs in the kernel easier. And at least in
> the case of the one traditional driver subsystem where 90% of the
> driver lives in userspace, we also care about that part being open.

Nobody is talking about merging kernel blobs. I think we're all in
agreement that it's absolutely out of question.

Traditionally, nearly all hardware has had closed firmware as well,
and if anything affects how we are tied down on making kernel-level
changes, this is a big one. What makes userspace different from that
perspective? Why do we have that double standard?

The question is if we're looking to alienate vendors and create a
whole new set of Nvidia-style driver stacks that will grow and grow,
or if we're willing to discuss with them and get them involved now, to
a point where we can come up with a reasonable,
standardized/extensible interface between upper levels of device FW,
through kernel and into low-level userspace. Getting them to separate
out the low-level portions of their software stacks to something that
is open is a medium-term good compromise in this direction (ideally
they might end up sharing this layer too, but that's not on me to
decide). Most of these pieces of hardware work in similar manners; a
stream of commands with data, and a stream of
completions/results/output data.

I'm incredibly impressed by how much of the graphics stack is open,
and how much of it has been reverse engineered for the closed
platforms. But if we have a chance to do it differently here, and in
particular avoid the long cycle of alienating the vendors and
encouraging them to build out-of-tree elaborate stacks for later
reverse engineering and catch-up, I would really like to.

There's currently one large benefit between these drivers and the
graphics space as far as I know; nobody's trying to do unified drivers
between Linux and other OS:es, so the whole "we need a messy shim
layer and a universal driver" situation should be avoidable (and to be
clear, we would not accept such drivers no matter what).

> Anything else is imo just a long-term dis-service to the community of
> customers, other vendors, ... Adapting a famous quote: If you're ok
> with throwing away some long term software freedom for a bit of short
> term hardware support you'll get neither.

The argument here is not "short term hardware support", since that's
not what we're adding (since you need more than the kernel pieces for
that). What we're able to do is collaborate instead of having all
these vendors work out-of-tree on their own with absolutely no
discussions with us at all, and nowhere to share their work without
setting up some new organization (with all the overhead from that). I
think getting people to collaborate in-tree is the best shot we have
at success.

> So if someone propose to merge some open source kernel driver that
> requires piles of closed source userspace to be any use at all, I'm
> just not interested. And if the fpga folks have merged fpga drivers
> without at least a basic (non-optimizing) RTL compiler, then that was
> a grave mistake. That doing this is also technically a bad idea (for
> all the reasons already discussed) is just the icing on the top for
> me.
>
> And to tie this back to the technical discussion, here's a scenario
> that's bound to happen:
> 1. vendor crams their open source driver into upstream, with full blob userspace
> 2. vendor gets bored (runs low on money, accidentally fired the entire
> old team, needs to do more value add, whatever, ...) rewrites the
> entire stack
> 3. vendor crams their new&completely incompatible open source stack
> into upstream
> 4. upstream is now unvoluntarily stuck maintaining 2 drivers for the
> exact same thing, and we can't fix anything of that because if you
> touch one side of the stack without undertstanding the other part
> you're guaranteed to create regressions (yes this is how this works
> with gpu drivers, we've learned this the hard way)
> 5. repeat

This can be avoided, in that we would not allow second completely
separate stacks. We should have a transition point where we don't
allow one-off weird custom drivers in the future, but we don't know
what the shared implementation will look like yet.

We have precedence from the wifi space, where we pushed back and got
vendors to move towards shared interfaces.

> Hence for these technical reasons you'll then end up with a subsystem
> that only the vendor can touch, and hence also the vendor can abandon
> at will. Not like drivers/gpu, where customers, consulting shops,
> students, ... routinely can&do add new features to existing drivers.
>
> This is not a winning move.

It depends on what the goal is. Complete software freedom? I agree,
this might not get us much closer to that (but also not further). And
if that's the goal, we should refuse to merge any driver that doesn't
have open device firmware as well. Why would we have double standards
in this area? Why are we allowing libusb to implement proprietary
userspace drivers?



So, let's loop back to the technical arguments instead.

What we want from a technical goal is to avoid broad proliferation of
completely separate out-of-tree software stacks, and get people to
collaborate and benefit from each others work in ways that we can
still change things over time where we need to from the kernel side.
Is anyone disagreeing with that (technical) goal?

Unless there's disagreement on the goal, where the views differ is on
how to get there -- whether we are better of pretending that this
hardware doesn't exist, and try to come up with some elaborate shared
framework that nobody is using yet, with the hopes that vendors will
move over from their proprietary stack once they've already been
successful in shipping that. Or whether we're better off getting them
engaged with us, picking up their drivers for the early hardware and
we all get exposure to the stacks and keep communication channels open
with clear understanding that we expect this engagement to shift over
time.

Since we're starting fresh here, we can set our own expectations
upfront: No second implementations unless they're onto a shared
framework, and we can even preserve the right to remove hardware
support (treat it as staging drivers) if a vendor disengages and goes
away, or if promises in other areas are broken (such as open low-level
userspace).


-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-24 10:22                     ` Dave Airlie
@ 2019-01-25  0:13                       ` Olof Johansson
  2019-01-25  7:43                         ` Daniel Vetter
  0 siblings, 1 reply; 103+ messages in thread
From: Olof Johansson @ 2019-01-25  0:13 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Oded Gabbay, Daniel Vetter, Jerome Glisse, Greg Kroah-Hartman,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

On Thu, Jan 24, 2019 at 2:23 AM Dave Airlie <airlied@gmail.com> wrote:
>
> > I know I won't be able to convince you but I want to say that I think
> > your arguments for full userspace open source are not really
> > technical.
>
> There is more to keeping a kernel going than technical argument unfortunately.
>
> I guess the question for Greg, Olof etc, is do we care about Linux the
> kernel, or Linux the open source ecosystem, if the former, these sort
> of accelerator shim drivers are fine, useless to anyone who doesn't
> have all the magic hidden userspace, and impossible to support for
> anyone else, if the latter, we should leave the cost of maintenance to
> the company benefiting from it and leave maintaining it out of tree.

As mentioned in my reply to Daniel, I think we've got a history of
being pragmatic and finding reasonable trade-offs of what can be open
and what can be closed. For example, if truly care about open source
ecosystem, drivers that require closed firmware should also be
refused.

> Simple question like If I plug your accelerator into Power or ARM64,
> where do I get the port of your userspace to use it?

Does demanding complete open userspace get us closer to that goal in
reality? By refusing to work with people to enable their hardware,
they will still ship their platforms out of tree, using DKMS and all
the other ways of getting kernel modules installed to talk to the
hardware. And we'd be no closer.

In the end, they'd open up their userspace when there's business
reasons to do so. It's well-known how to work around refusal from us
to merge drivers by now, so it's not much leverage in that area.

> I'm not the final arbiter on this sort of thing, but I'm definitely
> going to make sure that anyone who lands this code is explicit in
> ignoring any experience we've had in this area and in the future will
> gladly accept "I told you so" :-)

There's only one final arbiter on any inclusion to code to the kernel,
but we tend to sort out most disagreements without going all the way
there.

I still think engaging has a better chance of success than rejecting
the contributions, especially with clear expectations w.r.t. continued
engagement and no second implementations over time. In all honestly,
either approach might fail miserably.


-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 21:57 ` Dave Airlie
  2019-01-23 22:02   ` Dave Airlie
@ 2019-01-25  7:37   ` Greg Kroah-Hartman
  2019-01-25 15:33     ` Olof Johansson
  1 sibling, 1 reply; 103+ messages in thread
From: Greg Kroah-Hartman @ 2019-01-25  7:37 UTC (permalink / raw)
  To: Dave Airlie; +Cc: Oded Gabbay, Jerome Glisse, LKML, ogabbay

On Thu, Jan 24, 2019 at 07:57:11AM +1000, Dave Airlie wrote:
> On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> >
> > Hello,
> >
> > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > Habana Labs since its inception two and a half years ago.
> 
> Hey Oded,
> 
> So this creates a driver with a userspace facing API via ioctls.
> Although this isn't a "GPU" driver we have a rule in the graphics
> drivers are for accelerators that we don't merge userspace API with an
> appropriate userspace user.
> 
> https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> 
> I see nothing in these accelerator drivers that make me think we
> should be treating them different.

I understand that this is your position on when you accept drivers into
the DRM layer, as you need to interact with common interfaces and a
massive userspace stack at the same time.  And that's wonderful, it
allows you to be able to move both sides of that stack forward without
removing support for devices that worked on older kernels.

But, that's not really the case with this new driver at all.  We add new
driver subsystems, and individual drivers, with loads of new ioctls, in
every new kernel release.  We don't impose on all of them the "your
userspace code must be open" rule, so why is this new driver somehow
different from them?

Yes, there is the fun legal issue of "derivative works" when talking
about a userspace program that is written to only interact with a
specific kernel driver using a custom api like this one has, and how the
license of the kernel side (GPLv2) affects the userspace side
(whatever), but that is something that I leave up to the lawyers who
like discussing and enforcing such things.

When evaluating this driver (note, I saw it for a few revisions before
Oded posted it here), all I did was try to make sure that it fit in
properly with the kernel apis and methods of operations.  Given that
there are no in-kernel drivers for this type of device, and that it
really is a pretty small shim layer around the hardware, which means
that userspace does a lot of the heavy lifting, it is going to be a
very hardware-specific user/kernel api, and that shows.

Sidenote, this could have almost just been a UIO driver, which would
have put _ALL_ of the logic in userspace.  At least this way we have a
chance for the kernel code to be sane and not try to inflict problems on
the rest of the system.

Going forward, it would be wonderful if we could come up with a unified
api for how to interact with these types of hardware accelerators, but
given the newness of this industry, and the vastly different ways people
are trying to solve the problem, that is going to take a few attempts,
and many years before we can get there.  Until then, taking drivers like
this into the kernel tree makes sense as that way all of our users will
be able to use that hardware, and better yet, the rest of us can learn
more about how this stuff works so that we can help out with that api
generation when it happens.

So for now, I have no objection to taking this type of driver into the
tree.  Yes, it would be wonderful if we had an open userspace program to
drive it so that we could actually test and make changes to the api over
time, but I think that is something that the submitting company needs to
realize will be better for them to do, as for right now, all of that
testing and changes are their responsibility.

As for what directory the code should live in, I suggested "misc" as
there was no other universal location, and I hate to see new subsystems
be created with only one driver, as that's pretty sad.  But it's just a
name/location, I have no dog in the fight, so I really don't care where
it ends up in the tree, just as long as it gets merged somewhere :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-25  0:13                       ` Olof Johansson
@ 2019-01-25  7:43                         ` Daniel Vetter
  2019-01-25 15:02                           ` Olof Johansson
  0 siblings, 1 reply; 103+ messages in thread
From: Daniel Vetter @ 2019-01-25  7:43 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Dave Airlie, Oded Gabbay, Jerome Glisse, Greg Kroah-Hartman,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

On Fri, Jan 25, 2019 at 1:14 AM Olof Johansson <olof@lixom.net> wrote:
>
> On Thu, Jan 24, 2019 at 2:23 AM Dave Airlie <airlied@gmail.com> wrote:
> >
> > > I know I won't be able to convince you but I want to say that I think
> > > your arguments for full userspace open source are not really
> > > technical.
> >
> > There is more to keeping a kernel going than technical argument unfortunately.
> >
> > I guess the question for Greg, Olof etc, is do we care about Linux the
> > kernel, or Linux the open source ecosystem, if the former, these sort
> > of accelerator shim drivers are fine, useless to anyone who doesn't
> > have all the magic hidden userspace, and impossible to support for
> > anyone else, if the latter, we should leave the cost of maintenance to
> > the company benefiting from it and leave maintaining it out of tree.
>
> As mentioned in my reply to Daniel, I think we've got a history of
> being pragmatic and finding reasonable trade-offs of what can be open
> and what can be closed. For example, if truly care about open source
> ecosystem, drivers that require closed firmware should also be
> refused.

Firmware has traditionally been different since usually it's looked
down, doesn't do much wrt functionality (dumb fifo scheduling at best,
not really power management) and so could be reasonably shrugged off
as "it's part of hw". If you care about the open graphics ecosystem,
i.e. your ability to port the stack to new cpu architectures, new
window systems (e.g. android -> xorg, or xorg -> android, or something
entirely new like wayland), new, more efficient client interface
(vulkan is a very new fad), then having a closed firmware is not going
to be a problem. Closed compiler, closed runtime, closed anything else
otoh is a serious practical pain.

Unfortunately hw vendors seem to have realized that we (overall
community of customers, distro, upstream) are not insisting on open
firmware, so they're moving a lot of "valuable sauce" (no really, it's
not) into the firmware. PM governors, cpu scheduling algorithms, that
kind of stuff. We're not pleased, and there's lots of people doing the
behind the scenes work to fix it. One practical problem is that even
if we've demonstrated that r/e'ing a uc is no bigger challenge than
anything, there's usually this pesky issue with signatures. So we
can't force the vendors like we can with the userspace side. Otherwise
nouveau would have completely open firmware even for latest chips
(like it has for olders).

> > Simple question like If I plug your accelerator into Power or ARM64,
> > where do I get the port of your userspace to use it?
>
> Does demanding complete open userspace get us closer to that goal in
> reality? By refusing to work with people to enable their hardware,
> they will still ship their platforms out of tree, using DKMS and all
> the other ways of getting kernel modules installed to talk to the
> hardware. And we'd be no closer.
>
> In the end, they'd open up their userspace when there's business
> reasons to do so. It's well-known how to work around refusal from us
> to merge drivers by now, so it's not much leverage in that area.

Correct. None of the hw vendors had a business reason to open source
anything unforunately. Yes, eventually customers started demanding
open source and treatening to buy the competition, but this only works
if you have multiple reasonably performant & conformant stacks for
different vendors. The only way to get these is to reverse engineer
them.

Now reverse-engineering is a major pain in itself (despite all the
great tooling gpu folks developed over the past 10 years to convert it
from a black art to a repeatable engineering excercise), but if you
additionally prefer the vendors closed stack (which you do by allowing
to get them to get merged) the r/e'd stack has no chance. And there is
not other way to get your open source stack. I can't really go into
all the details of the past 15+ of open source gpus, but without the
pressure of other r/e'ed stacks and the pressure of having stacks for
competitiors (all made possible through aggressive code sharing) we
would have 0 open source gfx stacks. All the ones we have either got
started with r/e first (and eventually the vendor jumped on board) or
survived through r/e and customer efforts (because the vendor planned
to abandon it). Another part of this is that we accept userspace only
when it's the common upstream (if there is one), to prevent vendors
closing down their stacks gradually.

So yeah I think by not clearly preferring open source over
stacks-with-blobs (how radically you do that is a bit a balance act in
the end, I think we've maxed out in drivers/gpu on what's practically
possible) you'll just make sure that there's never going to be a
serious open source stack.

> > I'm not the final arbiter on this sort of thing, but I'm definitely
> > going to make sure that anyone who lands this code is explicit in
> > ignoring any experience we've had in this area and in the future will
> > gladly accept "I told you so" :-)
>
> There's only one final arbiter on any inclusion to code to the kernel,
> but we tend to sort out most disagreements without going all the way
> there.
>
> I still think engaging has a better chance of success than rejecting
> the contributions, especially with clear expectations w.r.t. continued
> engagement and no second implementations over time. In all honestly,
> either approach might fail miserably.

This is maybe not clear, but we still work together with the blob
folks as much as possible, for demonstration: nvidia sponsored XDC
this year, and nvidia engineers have been regularly presenting there.
Collaboration happens around the driver interfaces, like loaders (in
userspace), buffer sharing, synchronization, negotiation of buffer
formats and all that stuff. Do as much enganging as possible, but if
you give preferrential treatment to the closed stacks over the open
ones (and by default the vendor _always_ gives you a closed stack, or
as closed as possible, there's just no business case for them to open
up without a customer demanding it and competition providing it too),
you will end up with a closed stack for a very long time, maybe
forever.

Even if you insist on an open stack it's going to take years, since
the only way to get there is lots of r/e, and you need to have at
least 2 stacks or otherwise the customers can't walk away from the
negotiation table. So again from gfx experience: The only way to get
open stacks is solid competition by open stacks, and customers/distros
investing ridiculous amounts of money to r/e the chips and write these
open&cross vendor stacks. The business case for vendors to open source
their stacks is just not there. Not until they can't sell their chips
any other way anymore (nvidia will embrace open stacks as soon as
their margins evaporate, not a second earlier, like all the others
before them). Maybe at the next hallway track we need to go through a
few examples of what all happened and is still happening in the
background (here's maybe not a good idea).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 06/15] habanalabs: add basic Goya h/w initialization
  2019-01-23  0:00 ` [PATCH 06/15] habanalabs: add basic Goya h/w initialization Oded Gabbay
@ 2019-01-25  7:46   ` Mike Rapoport
  2019-01-28 10:35     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-25  7:46 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

Hi,

This starts the 6-9 review :)

These were more difficult to review because small pieces of code are interleaved with
large sequences of register writes. Probably making these register data
rather than code can help.

On Wed, Jan 23, 2019 at 02:00:48AM +0200, Oded Gabbay wrote:
> This patch adds the basic part of Goya's H/W initialization. It adds code
> that initializes Goya's internal CPU, various registers that are related to
> internal routing, scrambling, workarounds for H/W bugs, etc.
> 
> It also initializes Goya's security scheme that prevents the user from
> abusing Goya to steal data from the host, crash the host, change
> Goya's F/W, etc.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/habanalabs/device.c              |   12 +
>  drivers/misc/habanalabs/goya/Makefile         |    2 +-
>  drivers/misc/habanalabs/goya/goya.c           | 1892 ++++++++++-
>  drivers/misc/habanalabs/goya/goyaP.h          |    3 +
>  drivers/misc/habanalabs/goya/goya_security.c  | 2999 +++++++++++++++++
>  drivers/misc/habanalabs/habanalabs.h          |   16 +
>  drivers/misc/habanalabs/habanalabs_drv.c      |    8 +
>  drivers/misc/habanalabs/include/goya/goya.h   |    1 +
>  .../include/goya/goya_async_events.h          |  186 +
>  .../habanalabs/include/goya/goya_boot_if.h    |   32 +
>  10 files changed, 5144 insertions(+), 7 deletions(-)
>  create mode 100644 drivers/misc/habanalabs/goya/goya_security.c
>  create mode 100644 drivers/misc/habanalabs/include/goya/goya_async_events.h
>  create mode 100644 drivers/misc/habanalabs/include/goya/goya_boot_if.h
> 
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> index 0bd86a7d34db..9fc7218a973c 100644
> --- a/drivers/misc/habanalabs/device.c
> +++ b/drivers/misc/habanalabs/device.c
> @@ -315,6 +315,15 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  		goto release_ctx;
>  	}
>  
> +	rc = hdev->asic_funcs->hw_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to initialize the H/W\n");
> +		rc = 0;

Mistype, I suppose.

> +		goto out_disabled;
> +	}
> +
> +	hdev->disabled = false;
> +
>  	dev_notice(hdev->dev,
>  		"Successfully added device to habanalabs driver\n");
>  
> @@ -366,6 +375,9 @@ void hl_device_fini(struct hl_device *hdev)
>  	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
>  		dev_err(hdev->dev, "kernel ctx is still alive\n");
>  
> +	/* Reset the H/W. It will be in idle state after this returns */
> +	hdev->asic_funcs->hw_fini(hdev, true);
> +
>  	/* Call ASIC S/W finalize function */
>  	hdev->asic_funcs->sw_fini(hdev);
>  
> diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
> index 5ebf3d0d5794..a57096fa41b6 100644
> --- a/drivers/misc/habanalabs/goya/Makefile
> +++ b/drivers/misc/habanalabs/goya/Makefile
> @@ -1,3 +1,3 @@
>  subdir-ccflags-y += -I$(src)
>  
> -HL_GOYA_FILES :=  goya/goya.o
> \ No newline at end of file
> +HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o
> \ No newline at end of file
> diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> index 341ac085af82..f715e01838b3 100644
> --- a/drivers/misc/habanalabs/goya/goya.c
> +++ b/drivers/misc/habanalabs/goya/goya.c
> @@ -119,11 +119,11 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
>  	prop->va_space_dram_end_address = VA_DDR_SPACE_END;
>  	prop->cfg_size = CFG_SIZE;
>  	prop->max_asid = MAX_ASID;
> +	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
> +	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
>  	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
>  
>  	prop->high_pll = PLL_HIGH_DEFAULT;
> -	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
> -	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
>  }
>  
>  /**
> @@ -459,10 +459,12 @@ static int goya_early_init(struct hl_device *hdev)
>  		goto disable_device;
>  	}
>  
> -	val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
> -	if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
> -		dev_warn(hdev->dev,
> -			"PCI strap is not configured correctly, PCI bus errors may occur\n");
> +	if (!hdev->pldm) {
> +		val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);

What is the purpose of the 'mm' prefix in register names?

> +		if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
> +			dev_warn(hdev->dev,
> +				"PCI strap is not configured correctly, PCI bus errors may occur\n");
> +	}
>  
>  	return 0;
>  
> @@ -593,6 +595,1882 @@ int goya_sw_fini(struct hl_device *hdev)
>  	return 0;
>  }
>  
> +/**
> + * goya_init_pll - Initialize pll registers
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + */
> +static void goya_init_pll(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u16 hbw_nr, hbw_nf, hbw_od, hbw_nb;
> +	u16 cpu_nr, cpu_nf, cpu_od, cpu_nb;
> +	u16 mc_nr, mc_nf, mc_od, mc_nb;
> +	u16 pci_nr, pci_nf, pci_od, pci_nb;
> +	u16 emmc_nr, emmc_nf, emmc_od, emmc_nb;
> +
> +	if (!hdev->config_pll)
> +		return;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_PLL)
> +		return;
> +
> +	if (hdev->cpu_enable) {
> +		dev_info(hdev->dev,
> +			"Waiting 5s for u-boot before configuring PLLs\n");
> +		ssleep(5);
> +	}
> +
> +/*
> + * PLL possible configuration values:
> +	{50000000,1,16,16,8},
> +	{100000000,1,32,16,16},
> +	{150000000,1,48,16,24},
> +	{200000000,1,64,16,32},
> +	{250000000,1,70,14,35},
> +	{300000000,1,60,10,30},
> +	{350000000,1,70,10,35},
> +	{400000000,1,64,8,32},
> +	{450000000,1,54,6,27},
> +	{500000000,1,60,6,30},
> +	{550000000,1,66,6,33},
> +	{600000000,1,48,4,24},
> +	{650000000,1,52,4,26},
> +	{700000000,1,56,4,28},
> +	{750000000,1,60,4,30},
> +	{800000000,1,64,4,32},
> +	{850000000,1,68,4,34},
> +	{900000000,1,36,2,18},
> +	{950000000,1,38,2,19},
> +	{1000000000,1,40,2,20},
> +	{1050000000,1,42,2,21},
> +	{1100000000,1,44,2,22},
> +	{1150000000,1,46,2,23},
> +	{1200000000,1,48,2,24},
> +	{1250000000,1,50,2,25},
> +	{1300000000,1,52,2,26},
> +	{1350000000,1,54,2,27},
> +	{1400000000,1,56,2,28},
> +	{1450000000,1,58,2,29},
> +	{1500000000,1,60,2,30},
> +	{1550000000,1,62,2,31},

Some explanation about the correspondence of these values to _nr, _nf, _od
and _nb would be helpfull.

> +*/
> +
> +	if (hdev->pldm) {

                /* ? MHz */

> +		hbw_nr  = 4, hbw_nf  = 302, hbw_od  = 1, hbw_nb  = 151;
> +		cpu_nr  = 0, cpu_nf  = 47, cpu_od  = 1, cpu_nb  = 32;
> +		mc_nr   = 1, mc_nf   = 159, mc_od   = 9, mc_nb   = 79;
> +		pci_nr  = 4, pci_nf  = 343, pci_od  = 3, pci_nb  = 171;
> +		emmc_nr = 24, emmc_nf = 415, emmc_od = 15, emmc_nb = 207;
> +	} else {
> +		/* 200MHz */
> +		hbw_nr  = 0, hbw_nf  = 63, hbw_od  = 15, hbw_nb  = 31;
> +		cpu_nr  = 0, cpu_nf  = 47, cpu_od  = 1, cpu_nb  = 23;
> +		mc_nr   = 2, mc_nf   = 0x9f, mc_od   = 3, mc_nb   = 0x4f;

The hex here looks inconsistent.

> +		pci_nr  = 4, pci_nf  = 343, pci_od  = 3, pci_nb  = 171;
> +		emmc_nr = 24, emmc_nf = 415, emmc_od = 15, emmc_nb = 207;
> +	}
> +
> +	/* Adjust divider for SPI */
> +	WREG32(mmPSOC_SPI_BAUDR, 8);
> +
> +	WREG32(mmCPU_PLL_RST, 1);
> +	WREG32(mmCPU_PLL_NR, cpu_nr);
> +	WREG32(mmCPU_PLL_NF, cpu_nf);
> +	WREG32(mmCPU_PLL_OD, cpu_od);
> +	WREG32(mmCPU_PLL_NB, cpu_nb);
> +	WREG32(mmCPU_PLL_DATA_CHNG, 0x11);
> +
> +	/* delay before taking PLL out of reset */
> +	udelay(100);
> +

[ ... ]

> +
> +	goya->hw_cap_initialized |= HW_CAP_PLL;
> +}
> +
> +static void goya_set_pll_refclk(struct hl_device *hdev)
> +{
> +	WREG32(mmCPU_PLL_DIV_SEL_0, 0x0);
> +	WREG32(mmCPU_PLL_DIV_SEL_1, 0x0);
> +	WREG32(mmCPU_PLL_DIV_SEL_2, 0x0);
> +	WREG32(mmCPU_PLL_DIV_SEL_3, 0x0);
> +
> +	WREG32(mmIC_PLL_DIV_SEL_0, 0x0);
> +	WREG32(mmIC_PLL_DIV_SEL_1, 0x0);
> +	WREG32(mmIC_PLL_DIV_SEL_2, 0x0);
> +	WREG32(mmIC_PLL_DIV_SEL_3, 0x0);
> +
> +	WREG32(mmMC_PLL_DIV_SEL_0, 0x0);
> +	WREG32(mmMC_PLL_DIV_SEL_1, 0x0);
> +	WREG32(mmMC_PLL_DIV_SEL_2, 0x0);
> +	WREG32(mmMC_PLL_DIV_SEL_3, 0x0);
> +
> +	WREG32(mmPSOC_MME_PLL_DIV_SEL_0, 0x0);
> +	WREG32(mmPSOC_MME_PLL_DIV_SEL_1, 0x0);
> +	WREG32(mmPSOC_MME_PLL_DIV_SEL_2, 0x0);
> +	WREG32(mmPSOC_MME_PLL_DIV_SEL_3, 0x0);
> +
> +	WREG32(mmPSOC_PCI_PLL_DIV_SEL_0, 0x0);
> +	WREG32(mmPSOC_PCI_PLL_DIV_SEL_1, 0x0);
> +	WREG32(mmPSOC_PCI_PLL_DIV_SEL_2, 0x0);
> +	WREG32(mmPSOC_PCI_PLL_DIV_SEL_3, 0x0);
> +
> +	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_0, 0x0);
> +	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_1, 0x0);
> +	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_2, 0x0);
> +	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_3, 0x0);
> +
> +	WREG32(mmTPC_PLL_DIV_SEL_0, 0x0);
> +	WREG32(mmTPC_PLL_DIV_SEL_1, 0x0);
> +	WREG32(mmTPC_PLL_DIV_SEL_2, 0x0);
> +	WREG32(mmTPC_PLL_DIV_SEL_3, 0x0);
> +}
> +
> +static void goya_disable_clk_rlx(struct hl_device *hdev)
> +{
> +	WREG32(mmPSOC_MME_PLL_CLK_RLX_0, 0x100010);
> +	WREG32(mmIC_PLL_CLK_RLX_0, 0x100010);
> +}
> +
> +/**
> + * goya_init_ddr_ch0 - Initialize DDR CH0 controller of the chip
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + */
> +static void goya_init_ddr_ch0(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 val;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_DDR_0)
> +		return;
> +
> +	val = RREG32(mmDDR_MISC_CH0_CFG_DONE);
> +	if (val & DDR_MISC_CH0_CFG_DONE_CFG_DONE_MASK) {
> +		goya->hw_cap_initialized |= HW_CAP_DDR_0;
> +		return;
> +	}
> +
> +	WREG32(mmDDR_MC_CH0_DBG1, 0x00000001);
> +	WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000001);
> +
> +	val = RREG32(mmDDR_MC_CH0_STAT);
> +
> +	WREG32(mmDDR_MC_CH0_MSTR, 0x81040210);
> +	WREG32(mmDDR_MC_CH0_MRCTRL0, 0x4000a0f0);
> +	WREG32(mmDDR_MC_CH0_MRCTRL1, 0x00022ad0);
> +	WREG32(mmDDR_MC_CH0_MRCTRL2, 0x091629e1);
> +	WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000008);
> +	WREG32(mmDDR_MC_CH0_PWRTMG, 0x00040002);
> +	WREG32(mmDDR_MC_CH0_HWLPCTL, 0x00be0002);
> +	WREG32(mmDDR_MC_CH0_RFSHCTL0, 0x0091f020);
> +	WREG32(mmDDR_MC_CH0_RFSHCTL1, 0x00120018);
> +	WREG32((mmDDR_MC_CH0_MSTR + 0x00000058), 0x00160005);
> +	WREG32(mmDDR_MC_CH0_RFSHCTL3, 0x00000020);
> +	WREG32(mmDDR_MC_CH0_RFSHTMG, 0x003000d0);
> +	WREG32(mmDDR_MC_CH0_ECCCFG0, 0x00000010);
> +	WREG32(mmDDR_MC_CH0_ECCCFG1, 0x00000002);
> +	WREG32(mmDDR_MC_CH0_ECCCTL, 0x00000300);
> +	WREG32(mmDDR_MC_CH0_ECCPOISONADDR0, 0x00000078);
> +	WREG32(mmDDR_MC_CH0_ECCPOISONADDR1, 0x100062f7);
> +	WREG32(mmDDR_MC_CH0_CRCPARCTL0, 0x00008000);
> +	WREG32(mmDDR_MC_CH0_CRCPARCTL1, 0x0e088301);
> +	WREG32(mmDDR_MC_CH0_CRCPARCTL2, 0x00600527);
> +	WREG32(mmDDR_MC_CH0_INIT0, 0x00070002);
> +	WREG32(mmDDR_MC_CH0_INIT1, 0x0001000e);
> +	WREG32(mmDDR_MC_CH0_INIT3, 0x0c510001);
> +	WREG32(mmDDR_MC_CH0_INIT4, 0x00280400);
> +	WREG32(mmDDR_MC_CH0_INIT5, 0x00110000);
> +	WREG32(mmDDR_MC_CH0_INIT6, 0x02000643);
> +	WREG32(mmDDR_MC_CH0_INIT7, 0x00001000);
> +	WREG32(mmDDR_MC_CH0_DIMMCTL, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_RANKCTL, 0x000009a0);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG0, 0x1918361a);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG1, 0x00080724);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG2, 0x080d0713);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG3, 0x00012012);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG4, 0x0b04060b);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG5, 0x0a0c0804);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG8, 0x0606490c);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG9, 0x0002050f);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG10, 0x000e0d0f);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG11, 0x270b011f);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG12, 0x00000010);
> +	WREG32(mmDDR_MC_CH0_DRAMTMG15, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_ZQCTL0, 0x31000040);
> +	WREG32(mmDDR_MC_CH0_ZQCTL1, 0x00000070);
> +	WREG32(mmDDR_MC_CH0_DFITMG0, 0x05978211);
> +	WREG32(mmDDR_MC_CH0_DFITMG1, 0x00080101);
> +	WREG32(mmDDR_MC_CH0_DFILPCFG0, 0x07006031);
> +	WREG32(mmDDR_MC_CH0_DFILPCFG1, 0x00000010);
> +	WREG32(mmDDR_MC_CH0_DFIUPD0, 0x40400018);
> +	WREG32(mmDDR_MC_CH0_DFIUPD1, 0x000b0046);
> +	WREG32(mmDDR_MC_CH0_DFIUPD2, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
> +	WREG32(mmDDR_MC_CH0_DFITMG2, 0x00001711);
> +	WREG32(mmDDR_MC_CH0_DFITMG3, 0x0000001e);
> +	WREG32(mmDDR_MC_CH0_DBICTL, 0x00000001);
> +	WREG32(mmDDR_MC_CH0_DFIPHYMSTR, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP0, 0x00001f1f);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP1, 0x003f1503);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP2, 0x01000400);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP3, 0x04000505);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP4, 0x00001f1f);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP5, 0x06060303);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP6, 0x0f050709);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP7, 0x00000f0f);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP8, 0x00003f01);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP9, 0x09000606);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP10, 0x02090105);
> +	WREG32(mmDDR_MC_CH0_ADDRMAP11, 0x0000000a);
> +	WREG32(mmDDR_MC_CH0_ODTCFG, 0x09090a08);
> +	WREG32(mmDDR_MC_CH0_ODTMAP, 0x9ae1b5fe);
> +	WREG32(mmDDR_MC_CH0_SCHED, 0x664d3700);
> +	WREG32(mmDDR_MC_CH0_SCHED1, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_PERFHPR1, 0x1700e024);
> +	WREG32(mmDDR_MC_CH0_PERFLPR1, 0x1e00836c);
> +	WREG32(mmDDR_MC_CH0_PERFWR1, 0x260046c9);
> +	WREG32(mmDDR_MC_CH0_DQMAP0, 0x0d2b3503);
> +	WREG32(mmDDR_MC_CH0_DQMAP1, 0x042a0537);
> +	WREG32(mmDDR_MC_CH0_DQMAP2, 0x330b2806);
> +	WREG32(mmDDR_MC_CH0_DQMAP3, 0x27013803);
> +	WREG32(mmDDR_MC_CH0_DQMAP4, 0x0000022c);
> +	WREG32(mmDDR_MC_CH0_DQMAP5, 0x00000001);
> +	WREG32(mmDDR_MC_CH0_DBG0, 0x00000001);
> +	WREG32(mmDDR_MC_CH0_DBG1, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_DBGCMD, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_SWCTL, 0x00000001);
> +	WREG32(mmDDR_MC_CH0_POISONCFG, 0x00000001);
> +	WREG32(mmDDR_MC_CH0_ADVECCINDEX, 0x00000004);
> +	WREG32(mmDDR_MC_CH0_ECCPOISONPAT0, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_ECCPOISONPAT1, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_ECCPOISONPAT2, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_CAPARPOISONCTL, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_PCCFG, 0x00000011);
> +	WREG32(mmDDR_MC_CH0_PCFGR_0, 0x0000518c);
> +	WREG32(mmDDR_MC_CH0_PCFGW_0, 0x00001263);
> +	WREG32(mmDDR_MC_CH0_PCTRL_0, 0x00000001);
> +	WREG32(mmDDR_MC_CH0_PCFGQOS0_0, 0x0011000e);
> +	WREG32(mmDDR_MC_CH0_SBRCTL, 0x0016b540);
> +	WREG32(mmDDR_MC_CH0_SBRWDATA0, 0x8c1d1786);
> +	WREG32(mmDDR_MC_CH0_SBRWDATA1, 0x265f03dd);
> +
> +	val = RREG32(mmDDR_MC_CH0_RFSHCTL3);
> +
> +	WREG32(mmDDR_MISC_CH0_CFG_DONE, 0x00000001);
> +
> +	WREG32(mmDDR_MC_CH0_DBG1, 0x00000000);
> +
> +	val = RREG32(mmDDR_MC_CH0_PWRCTL);
> +
> +	WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000002);
> +
> +	val = RREG32(mmDDR_MC_CH0_PWRCTL);
> +
> +	WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_SWCTL, 0x00000000);
> +	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
> +	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000040);
> +	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000040);
> +	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000060);
> +	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000040);
> +	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
> +	WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
> +	WREG32(mmDDR_MC_CH0_PCTRL_0, 0x00000001);
> +
> +	goya->hw_cap_initialized |= HW_CAP_DDR_0;
> +}
> +
> +/**
> + * goya_init_ddr_ch1 - Initialize DDR CH1 controller of the chip
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + */
> +static void goya_init_ddr_ch1(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 val;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_DDR_1)
> +		return;
> +
> +	val = RREG32(mmDDR_MISC_CH1_CFG_DONE);
> +	if (val & DDR_MISC_CH1_CFG_DONE_CFG_DONE_MASK) {
> +		goya->hw_cap_initialized |= HW_CAP_DDR_1;
> +		return;
> +	}
> +
> +	WREG32(mmDDR_MC_CH1_DBG1, 0x00000001);
> +	WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000001);
> +
> +	val = RREG32(mmDDR_MC_CH1_STAT);
> +
> +	WREG32(mmDDR_MC_CH1_MSTR, 0x81040210);
> +	WREG32(mmDDR_MC_CH1_MRCTRL0, 0x4000a0f0);
> +	WREG32(mmDDR_MC_CH1_MRCTRL1, 0x00022ad0);
> +	WREG32(mmDDR_MC_CH1_MRCTRL2, 0x091629e1);
> +	WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000008);
> +	WREG32(mmDDR_MC_CH1_PWRTMG, 0x00040002);
> +	WREG32(mmDDR_MC_CH1_HWLPCTL, 0x00be0002);
> +	WREG32(mmDDR_MC_CH1_RFSHCTL0, 0x0091f020);
> +	WREG32(mmDDR_MC_CH1_RFSHCTL1, 0x00120018);
> +	WREG32((mmDDR_MC_CH1_MSTR + 0x00000058), 0x00160005);
> +	WREG32(mmDDR_MC_CH1_RFSHCTL3, 0x00000020);
> +	WREG32(mmDDR_MC_CH1_RFSHTMG, 0x003000d0);
> +	WREG32(mmDDR_MC_CH1_ECCCFG0, 0x00000010);
> +	WREG32(mmDDR_MC_CH1_ECCCFG1, 0x00000002);
> +	WREG32(mmDDR_MC_CH1_ECCCTL, 0x00000300);
> +	WREG32(mmDDR_MC_CH1_ECCPOISONADDR0, 0x00000078);
> +	WREG32(mmDDR_MC_CH1_ECCPOISONADDR1, 0x100062f7);
> +	WREG32(mmDDR_MC_CH1_CRCPARCTL0, 0x00008000);
> +	WREG32(mmDDR_MC_CH1_CRCPARCTL1, 0x0e088301);
> +	WREG32(mmDDR_MC_CH1_CRCPARCTL2, 0x00600527);
> +	WREG32(mmDDR_MC_CH1_INIT0, 0x00070002);
> +	WREG32(mmDDR_MC_CH1_INIT1, 0x0001000e);
> +	WREG32(mmDDR_MC_CH1_INIT3, 0x0c510001);
> +	WREG32(mmDDR_MC_CH1_INIT4, 0x00280400);
> +	WREG32(mmDDR_MC_CH1_INIT5, 0x00110000);
> +	WREG32(mmDDR_MC_CH1_INIT6, 0x02000643);
> +	WREG32(mmDDR_MC_CH1_INIT7, 0x00001000);
> +	WREG32(mmDDR_MC_CH1_DIMMCTL, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_RANKCTL, 0x000009a0);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG0, 0x1918361a);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG1, 0x00080724);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG2, 0x080d0713);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG3, 0x00012012);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG4, 0x0b04060b);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG5, 0x0a0c0804);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG8, 0x0606490c);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG9, 0x0002050f);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG10, 0x000e0d0f);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG11, 0x270b011f);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG12, 0x00000010);
> +	WREG32(mmDDR_MC_CH1_DRAMTMG15, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_ZQCTL0, 0x31000040);
> +	WREG32(mmDDR_MC_CH1_ZQCTL1, 0x00000070);
> +	WREG32(mmDDR_MC_CH1_DFITMG0, 0x05978211);
> +	WREG32(mmDDR_MC_CH1_DFITMG1, 0x00080101);
> +	WREG32(mmDDR_MC_CH1_DFILPCFG0, 0x07006031);
> +	WREG32(mmDDR_MC_CH1_DFILPCFG1, 0x00000010);
> +	WREG32(mmDDR_MC_CH1_DFIUPD0, 0x40400018);
> +	WREG32(mmDDR_MC_CH1_DFIUPD1, 0x000b0046);
> +	WREG32(mmDDR_MC_CH1_DFIUPD2, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
> +	WREG32(mmDDR_MC_CH1_DFITMG2, 0x00001711);
> +	WREG32(mmDDR_MC_CH1_DFITMG3, 0x0000001e);
> +	WREG32(mmDDR_MC_CH1_DBICTL, 0x00000001);
> +	WREG32(mmDDR_MC_CH1_DFIPHYMSTR, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP0, 0x00001f1f);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP1, 0x003f1503);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP2, 0x01000400);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP3, 0x04000505);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP4, 0x00001f1f);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP5, 0x06060303);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP6, 0x0f050709);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP7, 0x00000f0f);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP8, 0x00003f01);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP9, 0x09000606);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP10, 0x02090105);
> +	WREG32(mmDDR_MC_CH1_ADDRMAP11, 0x0000000a);
> +	WREG32(mmDDR_MC_CH1_ODTCFG, 0x09090a08);
> +	WREG32(mmDDR_MC_CH1_ODTMAP, 0x9ae1b5fe);
> +	WREG32(mmDDR_MC_CH1_SCHED, 0x664d3700);
> +	WREG32(mmDDR_MC_CH1_SCHED1, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_PERFHPR1, 0x1700e024);
> +	WREG32(mmDDR_MC_CH1_PERFLPR1, 0x1e00836c);
> +	WREG32(mmDDR_MC_CH1_PERFWR1, 0x260046c9);
> +	WREG32(mmDDR_MC_CH1_DQMAP0, 0x0d2b3503);
> +	WREG32(mmDDR_MC_CH1_DQMAP1, 0x042a0537);
> +	WREG32(mmDDR_MC_CH1_DQMAP2, 0x330b2806);
> +	WREG32(mmDDR_MC_CH1_DQMAP3, 0x27013803);
> +	WREG32(mmDDR_MC_CH1_DQMAP4, 0x0000022c);
> +	WREG32(mmDDR_MC_CH1_DQMAP5, 0x00000001);
> +	WREG32(mmDDR_MC_CH1_DBG0, 0x00000001);
> +	WREG32(mmDDR_MC_CH1_DBG1, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_DBGCMD, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_SWCTL, 0x00000001);
> +	WREG32(mmDDR_MC_CH1_POISONCFG, 0x00000001);
> +	WREG32(mmDDR_MC_CH1_ADVECCINDEX, 0x00000004);
> +	WREG32(mmDDR_MC_CH1_ECCPOISONPAT0, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_ECCPOISONPAT1, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_ECCPOISONPAT2, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_CAPARPOISONCTL, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_PCCFG, 0x00000011);
> +	WREG32(mmDDR_MC_CH1_PCFGR_0, 0x0000518c);
> +	WREG32(mmDDR_MC_CH1_PCFGW_0, 0x00001263);
> +	WREG32(mmDDR_MC_CH1_PCTRL_0, 0x00000001);
> +	WREG32(mmDDR_MC_CH1_PCFGQOS0_0, 0x0011000e);
> +	WREG32(mmDDR_MC_CH1_SBRCTL, 0x0016b540);
> +	WREG32(mmDDR_MC_CH1_SBRWDATA0, 0x8c1d1786);
> +	WREG32(mmDDR_MC_CH1_SBRWDATA1, 0x265f03dd);
> +
> +	val = RREG32(mmDDR_MC_CH1_RFSHCTL3);
> +
> +	WREG32(mmDDR_MISC_CH1_CFG_DONE, 0x00000001);
> +
> +	WREG32(mmDDR_MC_CH1_DBG1, 0x00000000);
> +
> +	val = RREG32(mmDDR_MC_CH1_PWRCTL);
> +
> +	WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000002);
> +
> +	val = RREG32(mmDDR_MC_CH1_PWRCTL);
> +
> +	WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_SWCTL, 0x00000000);
> +	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
> +	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000040);
> +	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000040);
> +	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000060);
> +	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000040);
> +	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
> +	WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
> +	WREG32(mmDDR_MC_CH1_PCTRL_0, 0x00000001);

The initialization sequence for the second DDR channel looks really similar
to that of the first channel.
I would guess their control registers have identical offsets from some base
address. If this is the case the DDR initialization can be factored out and
get that base address as a parameter.
  
> +
> +	goya->hw_cap_initialized |= HW_CAP_DDR_1;
> +}
> +
> +static void _goya_tpc_mbist_workaround(struct hl_device *hdev, u8 tpc_id)
> +{
> +	u64 tpc_eml_address;
> +	u32 val, tpc_offset, tpc_eml_offset, tpc_slm_offset;
> +	int err, slm_index;
> +
> +	WARN_ON(tpc_id >= TPC_MAX_NUM);

Is it safe to continue if tpc_id >= TPC_MAX_NUM?

> +	tpc_offset = tpc_id * 0x40000;
> +	tpc_eml_offset = tpc_id * 0x200000;
> +	tpc_eml_address = (mmTPC0_EML_CFG_BASE + tpc_eml_offset - CFG_BASE);
> +	tpc_slm_offset = tpc_eml_address + 0x100000;
> +
> +	/*
> +	 * Workaround for Bug H2 #2443 :
> +	 * "TPC SB is not initialized on chip reset"
> +	 */
> +
> +	val = RREG32(mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset);
> +	if (val & TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_ACTIVE_MASK)
> +		dev_warn(hdev->dev, "TPC%d MBIST ACTIVE is not cleared\n",
> +			tpc_id);
> +
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_PAT + tpc_offset, val & 0xFFFFF000);
> +
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_0 + tpc_offset, 0x37FF);
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_1 + tpc_offset, 0x303F);
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_2 + tpc_offset, 0x71FF);
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_3 + tpc_offset, 0x71FF);
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_4 + tpc_offset, 0x70FF);
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_5 + tpc_offset, 0x70FF);
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_6 + tpc_offset, 0x70FF);
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_7 + tpc_offset, 0x70FF);
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_8 + tpc_offset, 0x70FF);
> +	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_9 + tpc_offset, 0x70FF);
> +
> +	WREG32_OR(mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset,
> +		1 << TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_START_SHIFT);
> +
> +	err = hl_poll_timeout(
> +		hdev,
> +		mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset,
> +		val,
> +		(val & TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_DONE_MASK),
> +		1000,
> +		HL_DEVICE_TIMEOUT_USEC);
> +
> +	if (err)
> +		dev_err(hdev->dev,
> +			"Timeout while waiting for TPC%d MBIST DONE\n", tpc_id);
> +
> +	WREG32_OR(mmTPC0_EML_CFG_DBG_CNT + tpc_eml_offset,
> +		1 << TPC0_EML_CFG_DBG_CNT_CORE_RST_SHIFT);
> +
> +	msleep(GOYA_RESET_WAIT_MSEC);
> +
> +	WREG32_AND(mmTPC0_EML_CFG_DBG_CNT + tpc_eml_offset,
> +		~(1 << TPC0_EML_CFG_DBG_CNT_CORE_RST_SHIFT));
> +
> +	msleep(GOYA_RESET_WAIT_MSEC);
> +
> +	for (slm_index = 0 ; slm_index < 256 ; slm_index++)
> +		WREG32(tpc_slm_offset + (slm_index << 2), 0);
> +
> +	val = RREG32(tpc_slm_offset);
> +
> +	WREG32(mmTPC0_CFG_BASE + tpc_offset + 0xF40 - CFG_BASE, 0x100);
> +}
> +
> +static void goya_tpc_mbist_workaround(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	int i;
> +
> +	if (hdev->pldm)
> +		return;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_TPC_MBIST)
> +		return;
> +
> +	/* Workaround for H2 #2443 */
> +
> +	for (i = 0 ; i < TPC_MAX_NUM ; i++)
> +		_goya_tpc_mbist_workaround(hdev, i);
> +
> +	goya->hw_cap_initialized |= HW_CAP_TPC_MBIST;
> +}
> +
> +/**
> + * goya_init_golden_registers - Initialize golden registers
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Initialize the H/W registers of the device
> + *
> + */
> +static void goya_init_golden_registers(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 polynom[10], tpc_intr_mask;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_GOLDEN)
> +		return;
> +
> +	polynom[0] = 0x00020080;
> +	polynom[1] = 0x00401000;
> +	polynom[2] = 0x00200800;
> +	polynom[3] = 0x00002000;
> +	polynom[4] = 0x00080200;
> +	polynom[5] = 0x00040100;
> +	polynom[6] = 0x00100400;
> +	polynom[7] = 0x00004000;
> +	polynom[8] = 0x00010000;
> +	polynom[9] = 0x00008000;
> +
> +	/* Mask all arithmetic interrupts from TPC */
> +	tpc_intr_mask = 0x7FFF;
> +
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmDMA_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
> +
> +	WREG32(mmDMA_NRTR_SCRAMB_EN, 1 << DMA_NRTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmDMA_NRTR_NON_LIN_SCRAMB,
> +			1 << DMA_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
> +
> +	WREG32(mmSRAM_Y5_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y4_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y3_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y2_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y1_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y0_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y5_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y4_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y3_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y2_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y1_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y0_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y5_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y4_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y3_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y2_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y1_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y0_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y5_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y4_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y3_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y2_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y1_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y0_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y5_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y4_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y3_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y2_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y1_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> +	WREG32(mmSRAM_Y0_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);

Any chance this can be done in a loop?

> +	WREG32(mmSRAM_Y5_X0_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y4_X0_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y3_X0_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y2_X0_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y1_X0_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y5_X1_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y4_X1_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y3_X1_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y2_X1_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y1_X1_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y5_X2_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y4_X2_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y3_X2_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y2_X2_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y1_X2_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y5_X3_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y4_X3_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y3_X3_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y2_X3_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y1_X3_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y5_X4_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y4_X4_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y3_X4_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y2_X4_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y1_X4_RTR_HBW_DATA_L_ARB, 0x204);
> +	WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_L_ARB, 0x204);

Ditto.

> +	WREG32(mmSRAM_Y5_X0_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y4_X0_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y3_X0_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y2_X0_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y1_X0_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y5_X1_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y4_X1_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y3_X1_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y2_X1_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y1_X1_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y5_X2_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y4_X2_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y3_X2_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y2_X2_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y1_X2_RTR_HBW_DATA_E_ARB, 0x206);
> +	WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_E_ARB, 0x206);

And here and below as well.

> +	WREG32(mmSRAM_Y5_X3_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y4_X3_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y3_X3_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y2_X3_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y1_X3_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y5_X4_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y4_X4_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y3_X4_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y2_X4_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y1_X4_RTR_HBW_DATA_E_ARB, 0x207);
> +	WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_E_ARB, 0x207);

[ ... ]

> +	WREG32(mmMME1_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmMME1_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmMME1_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmMME1_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmMME1_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmMME1_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmMME1_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmMME1_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmMME1_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmMME1_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> +
> +	WREG32(mmMME2_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmMME2_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmMME2_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmMME2_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmMME2_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmMME2_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmMME2_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmMME2_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmMME2_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmMME2_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> +
> +	WREG32(mmMME3_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmMME3_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmMME3_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmMME3_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmMME3_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmMME3_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmMME3_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmMME3_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmMME3_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmMME3_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> +
> +	WREG32(mmMME4_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmMME4_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmMME4_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmMME4_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmMME4_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmMME4_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmMME4_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmMME4_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmMME4_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmMME4_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> +
> +	WREG32(mmMME5_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmMME5_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmMME5_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmMME5_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmMME5_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmMME5_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmMME5_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmMME5_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmMME5_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmMME5_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> +
> +	WREG32(mmMME6_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmMME6_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmMME6_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmMME6_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmMME6_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmMME6_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmMME6_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmMME6_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmMME6_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmMME6_RTR_SPLIT_COEF_9, polynom[9] >> 7);

This sequence seem to repeat itself. If the register map permits I'd
suggest splitting writes of the polynom[] to registers into a helper
function.

> +
> +	WREG32(mmMME1_RTR_SCRAMB_EN, 1 << MME1_RTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmMME1_RTR_NON_LIN_SCRAMB,
> +			1 << MME1_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> +
> +	WREG32(mmMME2_RTR_SCRAMB_EN, 1 << MME2_RTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmMME2_RTR_NON_LIN_SCRAMB,
> +			1 << MME2_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> +
> +	WREG32(mmMME3_RTR_SCRAMB_EN, 1 << MME3_RTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmMME3_RTR_NON_LIN_SCRAMB,
> +			1 << MME3_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> +
> +	WREG32(mmMME4_RTR_SCRAMB_EN, 1 << MME4_RTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmMME4_RTR_NON_LIN_SCRAMB,
> +			1 << MME4_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> +
> +	WREG32(mmMME5_RTR_SCRAMB_EN, 1 << MME5_RTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmMME5_RTR_NON_LIN_SCRAMB,
> +			1 << MME5_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> +
> +	WREG32(mmMME6_RTR_SCRAMB_EN, 1 << MME6_RTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmMME6_RTR_NON_LIN_SCRAMB,
> +			1 << MME6_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> +
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmTPC0_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
> +
> +	WREG32(mmTPC0_NRTR_SCRAMB_EN, 1 << TPC0_NRTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmTPC0_NRTR_NON_LIN_SCRAMB,
> +			1 << TPC0_NRTR_NON_LIN_SCRAMB_EN_SHIFT);

[ ... ]

> +	/*
> +	 * Workaround for Bug H2 #2441 :
> +	 * "ST.NOP set trace event illegal opcode"
> +	 */
> +	WREG32(mmTPC6_CFG_TPC_INTR_MASK, tpc_intr_mask);
> +
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmTPC7_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
> +
> +	WREG32(mmTPC7_NRTR_SCRAMB_EN, 1 << TPC7_NRTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmTPC7_NRTR_NON_LIN_SCRAMB,
> +			1 << TPC7_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
> +
> +	/*
> +	 * Workaround for Bug H2 #2441 :
> +	 * "ST.NOP set trace event illegal opcode"
> +	 */
> +	WREG32(mmTPC7_CFG_TPC_INTR_MASK, tpc_intr_mask);
> +
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
> +	WREG32(mmPCI_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
> +
> +	WREG32(mmPCI_NRTR_SCRAMB_EN, 1 << PCI_NRTR_SCRAMB_EN_VAL_SHIFT);
> +	WREG32(mmPCI_NRTR_NON_LIN_SCRAMB,
> +			1 << PCI_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
> +

I think all these long sequences of register writes could be grouped into
something like

struct regs_write_seq {
	unsigned long addr;
	unsigned long val;
};

const struct regs_write_seq golden_regs1 [] {
	...
};

const struct regs_write_seq workaround_bug_2411 [] {
	...
};

and written with a helper function looping over such array.

> +	/*
> +	 * Workaround for H2 #HW-23 bug
> +	 * Set DMA max outstanding read requests to 240 on DMA CH 1. Set it
> +	 * to 16 on KMD DMA
> +	 * We need to limit only these DMAs because the user can only read
> +	 * from Host using DMA CH 1
> +	 */
> +	WREG32(mmDMA_CH_0_CFG0, 0x0fff0010);
> +	WREG32(mmDMA_CH_1_CFG0, 0x0fff00F0);
> +
> +	goya->hw_cap_initialized |= HW_CAP_GOLDEN;
> +}
> +
> +
> +/**
> + * goya_push_uboot_to_device - Push u-boot FW code to device
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Copy u-boot fw code from firmware file to SRAM BAR.
> + * Returns 0 on success
> + *
> + */
> +static int goya_push_uboot_to_device(struct hl_device *hdev)
> +{
> +	char fw_name[200];
> +	const u64 *fw_data;
> +	void __iomem *dst;
> +	size_t fw_size, i;
> +	int rc;
> +
> +	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
> +
> +	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to request u-boot fw image\n");
> +		goto out;
> +	}
> +
> +	fw_size = hdev->spl_fw->size;
> +	if ((fw_size % 4) != 0) {
> +		dev_err(hdev->dev, "illegal u-boot firmware size %lu\n",
> +			fw_size);
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	dev_dbg(hdev->dev, "u-boot firmware size == %lu\n", fw_size);
> +
> +	fw_data = (const u64 *) hdev->spl_fw->data;
> +	dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
> +
> +	if ((hdev->spl_fw->size % 8) != 0)
> +		fw_size -= 8;
> +
> +	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> +		if (!(i & (0x80000 - 1)))
> +			dev_dbg(hdev->dev,
> +				"u-boot copied so far %lu out of %lu",
> +				i, fw_size);
> +
> +		writeq(*fw_data, dst);
> +	}
> +
> +	if ((hdev->spl_fw->size % 8) != 0)
> +		writel(*(const u32 *) fw_data, dst);
> +
> +out:
> +	release_firmware(hdev->spl_fw);
> +	return rc;
> +}
> +
> +/**
> + * goya_push_linux_to_device - Push LINUX FW code to device
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Copy LINXU fw code from firmware file to DDR BAR.

	  ^ Linux

> + * Returns 0 on success
> + *
> + */
> +static int goya_push_linux_to_device(struct hl_device *hdev)
> +{
> +	char fw_name[200];
> +	const u64 *fw_data;
> +	void __iomem *dst;
> +	size_t fw_size, i;
> +	int rc;
> +
> +	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
> +
> +	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to request Linux fw image\n");
> +		goto out;
> +	}
> +
> +	fw_size = hdev->spl_fw->size;
> +	if ((fw_size % 4) != 0) {
> +		dev_err(hdev->dev, "illegal Linux firmware size %lu\n",
> +			fw_size);
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	dev_dbg(hdev->dev, "Linux firmware size == %lu\n", fw_size);
> +
> +	fw_data = (const u64 *) hdev->spl_fw->data;
> +	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
> +
> +	if ((hdev->spl_fw->size % 8) != 0)
> +		fw_size -= 8;
> +
> +	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> +		if (!(i & (0x80000 - 1))) {
> +			dev_dbg(hdev->dev,
> +				"Linux copied so far %lu out of %lu",
> +				i, fw_size);
> +			usleep_range(20, 100);
> +		}
> +		writeq(*fw_data, dst);
> +	}
> +
> +	if ((hdev->spl_fw->size % 8) != 0)
> +		writel(*(const u32 *) fw_data, dst);
> +
> +out:
> +	release_firmware(hdev->spl_fw);
> +	return rc;

The U-Boot and Linux loading to the device seem almost identical. I think
it can be declared as

static int goya_push_fw_to_device(struct hl_device *hdev, const char *name,
				  void __iomem *dst)

and called twice.

> +}
> +
> +static int goya_pldm_init_cpu(struct hl_device *hdev)
> +{
> +	u32 val, unit_rst_val;
> +	int rc;
> +
> +	/* Must initialize SRAM scrambler before pushing u-boot to SRAM */
> +	goya_init_golden_registers(hdev);
> +
> +	/* Put ARM cores into reset */
> +	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
> +	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
> +
> +	/* Reset the CA53 MACRO */
> +	unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> +	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
> +	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> +	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
> +	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> +
> +	rc = goya_push_uboot_to_device(hdev);
> +	if (rc)
> +		return rc;
> +
> +	rc = goya_push_linux_to_device(hdev);
> +	if (rc)
> +		return rc;
> +
> +	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
> +	WREG32(mmPSOC_GLOBAL_CONF_WARM_REBOOT, CPU_BOOT_STATUS_NA);
> +
> +	WREG32(mmCPU_CA53_CFG_RST_ADDR_LSB_0,
> +		lower_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
> +	WREG32(mmCPU_CA53_CFG_RST_ADDR_MSB_0,
> +		upper_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
> +
> +	/* Release ARM core 0 from reset */
> +	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL,
> +					CPU_RESET_CORE0_DEASSERT);
> +	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
> +
> +	return 0;
> +}
> +
> +/*
> + * FW component passes an offset from SRAM_BASE_ADDR in SCRATCHPAD_xx.
> + * The version string should be located by that offset.
> + */
> +static void goya_read_device_fw_version(struct hl_device *hdev,
> +					enum goya_fw_component fwc)
> +{
> +	const char *name;
> +	u32 ver_off;
> +	char *dest;
> +
> +	switch (fwc) {
> +	case FW_COMP_UBOOT:
> +		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_29);
> +		dest = hdev->asic_prop.uboot_ver;
> +		name = "U-Boot";
> +		break;
> +	case FW_COMP_PREBOOT:
> +		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_28);
> +		dest = hdev->asic_prop.preboot_ver;
> +		name = "Preboot";
> +		break;
> +	default:
> +		dev_warn(hdev->dev, "Undefined FW component: %d\n", fwc);
> +		return;
> +	}
> +
> +	ver_off &= ~((u32)SRAM_BASE_ADDR);
> +
> +	if (ver_off < SRAM_SIZE - VERSION_MAX_LEN) {
> +		memcpy_fromio(dest, hdev->pcie_bar[SRAM_CFG_BAR_ID] + ver_off,
> +							VERSION_MAX_LEN);
> +	} else {
> +		dev_err(hdev->dev, "%s version offset (0x%x) is above SRAM\n",
> +								name, ver_off);
> +		strcpy(dest, "unavailable");
> +	}
> +}
> +
> +static int goya_init_cpu(struct hl_device *hdev, u32 cpu_timeout)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 status;
> +	int rc;
> +
> +	if (!hdev->cpu_enable)
> +		return 0;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_CPU)
> +		return 0;
> +
> +	/*
> +	 * Before pushing u-boot/linux to device, need to set the ddr bar to
> +	 * base address of dram
> +	 */
> +	rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to map DDR bar to DRAM base address\n");
> +		return rc;
> +	}
> +
> +	if (hdev->pldm) {
> +		rc = goya_pldm_init_cpu(hdev);
> +		if (rc)
> +			return rc;
> +
> +		goto out;
> +	}
> +
> +	/* Make sure CPU boot-loader is running */
> +	rc = hl_poll_timeout(
> +		hdev,
> +		mmPSOC_GLOBAL_CONF_WARM_REBOOT,
> +		status,
> +		(status == CPU_BOOT_STATUS_DRAM_RDY) ||
> +		(status == CPU_BOOT_STATUS_SRAM_AVAIL),
> +		10000,
> +		cpu_timeout);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "Error in ARM u-boot !!!");
> +		switch (status) {
> +		case CPU_BOOT_STATUS_NA:
> +			dev_err(hdev->dev,
> +				"ARM status %d - BTL did NOT run\n", status);
> +			break;
> +		case CPU_BOOT_STATUS_IN_WFE:
> +			dev_err(hdev->dev,
> +				"ARM status %d - Inside WFE loop\n", status);
> +			break;
> +		case CPU_BOOT_STATUS_IN_BTL:
> +			dev_err(hdev->dev,
> +				"ARM status %d - Stuck in BTL\n", status);
> +			break;
> +		case CPU_BOOT_STATUS_IN_PREBOOT:
> +			dev_err(hdev->dev,
> +				"ARM status %d - Stuck in Preboot\n", status);
> +			break;
> +		case CPU_BOOT_STATUS_IN_SPL:
> +			dev_err(hdev->dev,
> +				"ARM status %d - Stuck in SPL\n", status);
> +			break;
> +		case CPU_BOOT_STATUS_IN_UBOOT:
> +			dev_err(hdev->dev,
> +				"ARM status %d - Stuck in u-boot\n", status);
> +			break;
> +		case CPU_BOOT_STATUS_DRAM_INIT_FAIL:
> +			dev_err(hdev->dev,
> +				"ARM status %d - DDR initialization failed\n",
> +				status);
> +			break;
> +		default:
> +			dev_err(hdev->dev,
> +				"ARM status %d - Invalid status code\n",
> +				status);
> +			break;
> +		}
> +		return -EIO;
> +	}
> +
> +	/* Read U-Boot version now in case we will later fail */
> +	goya_read_device_fw_version(hdev, FW_COMP_UBOOT);
> +	goya_read_device_fw_version(hdev, FW_COMP_PREBOOT);
> +
> +	if (status == CPU_BOOT_STATUS_SRAM_AVAIL)
> +		goto out;
> +
> +	if (!hdev->fw_loading) {
> +		dev_info(hdev->dev, "Skip loading FW\n");
> +		goto out;
> +	}
> +
> +	rc = goya_push_linux_to_device(hdev);
> +	if (rc)
> +		return rc;
> +
> +	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
> +
> +	rc = hl_poll_timeout(
> +		hdev,
> +		mmPSOC_GLOBAL_CONF_WARM_REBOOT,
> +		status,
> +		(status == CPU_BOOT_STATUS_SRAM_AVAIL),
> +		10000,
> +		cpu_timeout);
> +
> +	if (rc) {
> +		if (status == CPU_BOOT_STATUS_FIT_CORRUPTED)
> +			dev_err(hdev->dev,
> +				"ARM u-boot reports FIT image is corrupted\n");
> +		else
> +			dev_err(hdev->dev,
> +				"ARM Linux failed to load, %d\n", status);
> +		WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_NA);
> +		return -EIO;
> +	}
> +
> +	dev_info(hdev->dev, "Successfully loaded firmware to device\n");
> +
> +out:
> +	goya->hw_cap_initialized |= HW_CAP_CPU;
> +
> +	return 0;
> +}
> +
> +/**
> + * goya_hw_init - Goya hardware initialization code
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Returns 0 on success
> + *
> + */
> +static int goya_hw_init(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	u32 val;
> +	int rc;
> +
> +	dev_info(hdev->dev, "Starting initialization of H/W\n");
> +
> +	/* Perform read from the device to make sure device is up */
> +	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
> +
> +	goya_init_pll(hdev);
> +
> +	if (hdev->pldm) {
> +		goya_init_ddr_ch0(hdev);
> +		goya_init_ddr_ch1(hdev);
> +	}
> +
> +	rc = goya_init_cpu(hdev, GOYA_CPU_TIMEOUT_USEC);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to initialize CPU\n");
> +		return rc;
> +	}
> +
> +	goya_tpc_mbist_workaround(hdev);
> +
> +	goya_init_golden_registers(hdev);
> +
> +	/*
> +	 * After CPU initialization is finished, change DDR bar mapping inside
> +	 * iATU to point to the start address of the MMU page tables
> +	 */
> +	rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE +
> +		(MMU_PAGE_TABLES_ADDR & ~(prop->dram_pci_bar_size - 0x1ull)));
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to map DDR bar to MMU page tables\n");
> +		return rc;
> +	}
> +
> +	goya_init_security(hdev);
> +
> +	/* CPU initialization is finished, we can now move to 48 bit DMA mask */
> +	rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
> +	if (rc) {
> +		dev_warn(hdev->dev, "Unable to set pci dma mask to 48 bits\n");
> +		rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(32));
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"Unable to set pci dma mask to 32 bits\n");
> +			return rc;
> +		}
> +	}
> +
> +	rc = pci_set_consistent_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
> +	if (rc) {
> +		dev_warn(hdev->dev,
> +			"Unable to set pci consistent dma mask to 48 bits\n");
> +		rc = pci_set_consistent_dma_mask(hdev->pdev, DMA_BIT_MASK(32));
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"Unable to set pci consistent dma mask to 32 bits\n");
> +			return rc;
> +		}
> +	}
> +
> +	/* Perform read from the device to flush all MSI-X configuration */
> +	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
> +
> +	return 0;
> +}
> +
> +/**
> + * goya_hw_fini - Goya hardware tear-down code
> + *
> + * @hdev: pointer to hl_device structure
> + * @hard_reset: should we do hard reset to all engines or just reset the
> + *              compute/dma engines
> + *
> + * The function does the following:
> + * - Send interrupt to CPU to go into "quiet" mode
> + * - Stall MME, TPC
> + * - Stop External, Internal QMANs
> + * - Disable MSI-X
> + * - Issue reset command
> + * - Wait until reset is done
> + * - Start device BTL
> + *
> + */
> +static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 reset_timeout_ms, status;
> +
> +	if (hdev->pldm)
> +		reset_timeout_ms = GOYA_PLDM_RESET_TIMEOUT_MSEC;
> +	else
> +		reset_timeout_ms = GOYA_RESET_TIMEOUT_MSEC;
> +
> +	if (hard_reset) {
> +		goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
> +		goya_disable_clk_rlx(hdev);
> +		goya_set_pll_refclk(hdev);
> +
> +		WREG32(mmPSOC_GLOBAL_CONF_SW_ALL_RST_CFG, RESET_ALL);
> +		dev_info(hdev->dev,
> +			"Issued HARD reset command, going to wait %dms\n",
> +			reset_timeout_ms);
> +	} else {
> +		WREG32(mmPSOC_GLOBAL_CONF_SW_ALL_RST_CFG, DMA_MME_TPC_RESET);
> +		dev_info(hdev->dev,
> +			"Issued SOFT reset command, going to wait %dms\n",
> +			reset_timeout_ms);
> +	}
> +
> +	/*
> +	 * After hard reset, we can't poll the BTM_FSM register because the PSOC
> +	 * itself is in reset. In either reset we need to wait until the reset
> +	 * is deasserted
> +	 */
> +	msleep(reset_timeout_ms);
> +
> +	status = RREG32(mmPSOC_GLOBAL_CONF_BTM_FSM);
> +	if (status & PSOC_GLOBAL_CONF_BTM_FSM_STATE_MASK)
> +		dev_err(hdev->dev,
> +			"Timeout while waiting for device to reset 0x%x\n",
> +			status);
> +
> +	if (!hard_reset) {
> +		goya->hw_cap_initialized &= ~(HW_CAP_DMA | HW_CAP_MME |
> +						HW_CAP_GOLDEN | HW_CAP_TPC);
> +		WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> +				GOYA_ASYNC_EVENT_ID_SOFT_RESET);
> +		return;
> +	}
> +
> +	/* Chicken bit to re-initiate boot sequencer flow */
> +	WREG32(mmPSOC_GLOBAL_CONF_BOOT_SEQ_RE_START,
> +		1 << PSOC_GLOBAL_CONF_BOOT_SEQ_RE_START_IND_SHIFT);
> +	/* Move boot manager FSM to pre boot sequencer init state */
> +	WREG32(mmPSOC_GLOBAL_CONF_SW_BTM_FSM,
> +			0xA << PSOC_GLOBAL_CONF_SW_BTM_FSM_CTRL_SHIFT);
> +
> +	goya->hw_cap_initialized &= ~(HW_CAP_CPU | HW_CAP_CPU_Q |
> +					HW_CAP_DDR_0 | HW_CAP_DDR_1 |
> +					HW_CAP_DMA | HW_CAP_MME |
> +					HW_CAP_MMU | HW_CAP_TPC_MBIST |
> +					HW_CAP_GOLDEN | HW_CAP_TPC);
> +
> +	if (!hdev->pldm) {
> +		int rc;
> +		/* In case we are running inside VM and the VM is
> +		 * shutting down, we need to make sure CPU boot-loader
> +		 * is running before we can continue the VM shutdown.
> +		 * That is because the VM will send an FLR signal that
> +		 * we must answer
> +		 */
> +		dev_info(hdev->dev,
> +			"Going to wait up to %ds for CPU boot loader\n",
> +			GOYA_CPU_TIMEOUT_USEC / 1000 / 1000);
> +
> +		rc = hl_poll_timeout(
> +			hdev,
> +			mmPSOC_GLOBAL_CONF_WARM_REBOOT,
> +			status,
> +			(status == CPU_BOOT_STATUS_DRAM_RDY),
> +			10000,
> +			GOYA_CPU_TIMEOUT_USEC);
> +		if (rc)
> +			dev_err(hdev->dev,
> +				"failed to wait for CPU boot loader\n");
> +	}
> +}
> +
>  int goya_suspend(struct hl_device *hdev)
>  {
>  	return 0;
> @@ -641,6 +2519,8 @@ static const struct hl_asic_funcs goya_funcs = {
>  	.early_fini = goya_early_fini,
>  	.sw_init = goya_sw_init,
>  	.sw_fini = goya_sw_fini,
> +	.hw_init = goya_hw_init,
> +	.hw_fini = goya_hw_fini,
>  	.suspend = goya_suspend,
>  	.resume = goya_resume,
>  	.mmap = goya_mmap,
> diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> index 0e12c56472bd..45a6d2ca2752 100644
> --- a/drivers/misc/habanalabs/goya/goyaP.h
> +++ b/drivers/misc/habanalabs/goya/goyaP.h
> @@ -9,6 +9,7 @@
>  #define GOYAP_H_
>  
>  #include "habanalabs.h"
> +#include "include/goya/goya_boot_if.h"
>  #include "include/goya/goya.h"
>  
>  #define NUMBER_OF_CMPLT_QUEUES		5
> @@ -122,4 +123,6 @@ struct goya_device {
>  	u32		hw_cap_initialized;
>  };
>  
> +void goya_init_security(struct hl_device *hdev);
> +
>  #endif /* GOYAP_H_ */
> diff --git a/drivers/misc/habanalabs/goya/goya_security.c b/drivers/misc/habanalabs/goya/goya_security.c
> new file mode 100644
> index 000000000000..99ad9aacf49e
> --- /dev/null
> +++ b/drivers/misc/habanalabs/goya/goya_security.c
> @@ -0,0 +1,2999 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "goyaP.h"
> +
> +/**
> + * goya_set_block_as_protected - set the given block as protected
> + *
> + * @hdev: pointer to hl_device structure
> + * @block: block base address
> + *
> + */
> +static void goya_pb_set_block(struct hl_device *hdev, u64 base)
> +{
> +	u32 pb_addr = base - CFG_BASE + PROT_BITS_OFFS;
> +
> +	while (pb_addr & 0xFFF) {
> +		WREG32(pb_addr, 0);
> +		pb_addr += 4;
> +	}
> +}
> +
> +static void goya_init_mme_protection_bits(struct hl_device *hdev)
> +{
> +	u32 pb_addr, mask;
> +	u8 word_offset;
> +
> +	/* TODO: change to real reg name when Soc Online is updated */
> +	u64 mmMME_SBB_POWER_ECO1 = 0xDFF60,
> +		mmMME_SBB_POWER_ECO2 = 0xDFF64;
> +
> +	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_0_BASE);
> +	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_1_BASE);
> +	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_2_BASE);
> +	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_3_BASE);
> +
> +	goya_pb_set_block(hdev, mmSBA_ECC_MEM_BASE);
> +	goya_pb_set_block(hdev, mmSBB_ECC_MEM_BASE);
> +
> +	goya_pb_set_block(hdev, mmMME1_RTR_BASE);
> +	goya_pb_set_block(hdev, mmMME1_RD_REGULATOR_BASE);
> +	goya_pb_set_block(hdev, mmMME1_WR_REGULATOR_BASE);
> +	goya_pb_set_block(hdev, mmMME2_RTR_BASE);
> +	goya_pb_set_block(hdev, mmMME2_RD_REGULATOR_BASE);
> +	goya_pb_set_block(hdev, mmMME2_WR_REGULATOR_BASE);
> +	goya_pb_set_block(hdev, mmMME3_RTR_BASE);
> +	goya_pb_set_block(hdev, mmMME3_RD_REGULATOR_BASE);
> +	goya_pb_set_block(hdev, mmMME3_WR_REGULATOR_BASE);
> +
> +	goya_pb_set_block(hdev, mmMME4_RTR_BASE);
> +	goya_pb_set_block(hdev, mmMME4_RD_REGULATOR_BASE);
> +	goya_pb_set_block(hdev, mmMME4_WR_REGULATOR_BASE);
> +
> +	goya_pb_set_block(hdev, mmMME5_RTR_BASE);
> +	goya_pb_set_block(hdev, mmMME5_RD_REGULATOR_BASE);
> +	goya_pb_set_block(hdev, mmMME5_WR_REGULATOR_BASE);
> +
> +	goya_pb_set_block(hdev, mmMME6_RTR_BASE);
> +	goya_pb_set_block(hdev, mmMME6_RD_REGULATOR_BASE);
> +	goya_pb_set_block(hdev, mmMME6_WR_REGULATOR_BASE);
> +
> +	pb_addr = (mmMME_DUMMY & ~0xFFF) + PROT_BITS_OFFS;
> +	word_offset = ((mmMME_DUMMY & PROT_BITS_OFFS) >> 7) << 2;
> +	mask = 1 << ((mmMME_DUMMY & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_RESET & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_STALL & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_DBGMEM_ADD & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_DBGMEM_DATA_WR & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_DBGMEM_DATA_RD & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_DBGMEM_CTRL & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_DBGMEM_RC & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_LOG_SHADOW & 0x7F) >> 2);
> +

The mask here and below seems to be a constant.
A #define could suffice, no?

> +	WREG32(pb_addr + word_offset, ~mask);
> +
> +	pb_addr = (mmMME_STORE_MAX_CREDIT & ~0xFFF) + PROT_BITS_OFFS;
> +	word_offset = ((mmMME_STORE_MAX_CREDIT & PROT_BITS_OFFS) >> 7) << 2;
> +	mask = 1 << ((mmMME_STORE_MAX_CREDIT & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_AGU & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SBA & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SBB & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SBC & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_WBC & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SBA_CONTROL_DATA & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SBB_CONTROL_DATA & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SBC_CONTROL_DATA & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_WBC_CONTROL_DATA & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_TE & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_TE2DEC & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_REI_STATUS & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_REI_MASK & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SEI_STATUS & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SEI_MASK & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SPI_STATUS & 0x7F) >> 2);
> +	mask |= 1 << ((mmMME_SPI_MASK & 0x7F) >> 2);
> +
> +	WREG32(pb_addr + word_offset, ~mask);
> +

[ ... ]

> +
> +/**
> + * goya_init_protection_bits - Initialize protection bits for specific registers
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * All protection bits are 1 by default, means not protected. Need to set to 0
> + * each bit that belongs to a protected register.
> + *
> + */
> +static void goya_init_protection_bits(struct hl_device *hdev)
> +{
> +	/*
> +	 * In each 4K block of registers, the last 128 bytes are protection
> +	 * bits - total of 1024 bits, one for each register. Each bit is related
> +	 * to a specific register, by the order of the registers.
> +	 * So in order to calculate the bit that is related to a given register,
> +	 * we need to calculate its word offset and then the exact bit inside
> +	 * the word (which is 4 bytes).
> +	 *
> +	 * Register address:
> +	 *
> +	 * 31                 12 11           7   6             2  1      0
> +	 * -----------------------------------------------------------------
> +	 * |      Don't         |    word       |  bit location  |    0    |
> +	 * |      care          |   offset      |  inside word   |         |
> +	 * -----------------------------------------------------------------
> +	 *
> +	 * Bits 7-11 represents the word offset inside the 128 bytes.
> +	 * Bits 2-6 represents the bit location inside the word.
> +	 */
> +
> +	goya_pb_set_block(hdev, mmPCI_NRTR_BASE);
> +	goya_pb_set_block(hdev, mmPCI_RD_REGULATOR_BASE);
> +	goya_pb_set_block(hdev, mmPCI_WR_REGULATOR_BASE);

[ ... ]

> +	goya_init_mme_protection_bits(hdev);
> +
> +	goya_init_dma_protection_bits(hdev);
> +
> +	goya_init_tpc_protection_bits(hdev);
> +}
> +
> +/**
> + * goya_init_security - Initialize security model
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Initialize the security model of the device
> + * That includes range registers and protection bit per register
> + *
> + */
> +void goya_init_security(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +
> +	u32 dram_addr_lo = lower_32_bits(DRAM_PHYS_BASE);
> +	u32 dram_addr_hi = upper_32_bits(DRAM_PHYS_BASE);
> +
> +	u32 lbw_rng0_base = 0xFC440000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng0_mask = 0xFFFF0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;

These are anyway magic numbers, why not include the mask in them directly?
BTW, I couldn't fine DMA_MACRO_LBW_RANGE_BASE_R_MASK anywhere in the
driver.

> +
> +	u32 lbw_rng1_base = 0xFC480000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng1_mask = 0xFFF80000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng2_base = 0xFC600000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng2_mask = 0xFFE00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng3_base = 0xFC800000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng3_mask = 0xFFF00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng4_base = 0xFCC02000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng4_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng5_base = 0xFCC40000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng5_mask = 0xFFFF8000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng6_base = 0xFCC48000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng6_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng7_base = 0xFCC4A000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng7_mask = 0xFFFFE000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng8_base = 0xFCC4C000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng8_mask = 0xFFFFC000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng9_base = 0xFCC50000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng9_mask = 0xFFFF0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng10_base = 0xFCC60000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng10_mask = 0xFFFE0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng11_base = 0xFCE00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng11_mask = 0xFFFFC000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng12_base = 0xFE484000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng12_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	u32 lbw_rng13_base = 0xFEC43000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +	u32 lbw_rng13_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> +
> +	WREG32(mmDMA_MACRO_LBW_RANGE_HIT_BLOCK, 0xFFFF);
> +	WREG32(mmDMA_MACRO_HBW_RANGE_HIT_BLOCK, 0xFF);
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_MMU)) {
> +		WREG32(mmDMA_MACRO_HBW_RANGE_HIT_BLOCK, 0xFE);
> +
> +		/* Protect HOST */
> +		WREG32(mmDMA_MACRO_HBW_RANGE_BASE_31_0_0, 0);
> +		WREG32(mmDMA_MACRO_HBW_RANGE_BASE_49_32_0, 0);
> +		WREG32(mmDMA_MACRO_HBW_RANGE_MASK_31_0_0, 0);
> +		WREG32(mmDMA_MACRO_HBW_RANGE_MASK_49_32_0, 0xFFF80);
> +	}
> +
> +	/*
> +	 * Protect DDR @
> +	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
> +	 * The mask protects the first 512MB
> +	 */
> +	WREG32(mmDMA_MACRO_HBW_RANGE_BASE_31_0_1, dram_addr_lo);
> +	WREG32(mmDMA_MACRO_HBW_RANGE_BASE_49_32_1, dram_addr_hi);
> +	WREG32(mmDMA_MACRO_HBW_RANGE_MASK_31_0_1, 0xE0000000);
> +	WREG32(mmDMA_MACRO_HBW_RANGE_MASK_49_32_1, 0x3FFFF);
> +
> +	/* Protect registers */
> +
> +	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_0, lbw_rng0_base);
> +	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_0, lbw_rng0_mask);
> +	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_1, lbw_rng1_base);
> +	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_1, lbw_rng1_mask);
> +	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_2, lbw_rng2_base);
> +	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_2, lbw_rng2_mask);
> +	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_3, lbw_rng3_base);

[ ... ]

> +	goya_init_protection_bits(hdev);
> +}
> diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> index 6ad476df65b0..adda281ec2af 100644
> --- a/drivers/misc/habanalabs/habanalabs.h
> +++ b/drivers/misc/habanalabs/habanalabs.h
> @@ -23,6 +23,8 @@
>  
>  #define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
>  
> +#define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */
> +
>  #define HL_MAX_QUEUES			128
>  
>  struct hl_device;
> @@ -32,6 +34,8 @@ struct hl_fpriv;
>  
>  /**
>   * struct asic_fixed_properties - ASIC specific immutable properties.
> + * @uboot_ver: F/W U-boot version.
> + * @preboot_ver: F/W Preboot version.
>   * @sram_base_address: SRAM physical start address.
>   * @sram_end_address: SRAM physical end address.
>   * @sram_user_base_address - SRAM physical start address for user access.
> @@ -60,6 +64,8 @@ struct hl_fpriv;
>   * @tpc_enabled_mask: which TPCs are enabled.
>   */
>  struct asic_fixed_properties {
> +	char			uboot_ver[VERSION_MAX_LEN];
> +	char			preboot_ver[VERSION_MAX_LEN];
>  	u64			sram_base_address;
>  	u64			sram_end_address;
>  	u64			sram_user_base_address;
> @@ -168,6 +174,8 @@ enum hl_asic_type {
>   * @early_fini: tears down what was done in early_init.
>   * @sw_init: sets up driver state, does not configure H/W.
>   * @sw_fini: tears down driver state, does not configure H/W.
> + * @hw_init: sets up the H/W state.
> + * @hw_fini: tears down the H/W state.
>   * @suspend: handles IP specific H/W or SW changes for suspend.
>   * @resume: handles IP specific H/W or SW changes for resume.
>   * @mmap: mmap function, does nothing.
> @@ -180,6 +188,8 @@ struct hl_asic_funcs {
>  	int (*early_fini)(struct hl_device *hdev);
>  	int (*sw_init)(struct hl_device *hdev);
>  	int (*sw_fini)(struct hl_device *hdev);
> +	int (*hw_init)(struct hl_device *hdev);
> +	void (*hw_fini)(struct hl_device *hdev, bool hard_reset);
>  	int (*suspend)(struct hl_device *hdev);
>  	int (*resume)(struct hl_device *hdev);
>  	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
> @@ -312,6 +322,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
>   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
>   * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
>   * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
> + * @spl_fw: image to load to ArmCP.
>   * @asid_bitmap: holds used/available ASIDs.
>   * @asid_mutex: protects asid_bitmap.
>   * @device_open: lock for sanity checks upon FD open.
> @@ -340,6 +351,7 @@ struct hl_device {
>  	void				*cpu_accessible_dma_mem;
>  	dma_addr_t			cpu_accessible_dma_address;
>  	struct gen_pool			*cpu_accessible_dma_pool;
> +	const struct firmware		*spl_fw;
>  	unsigned long			*asid_bitmap;
>  	struct mutex			asid_mutex;
>  	/* TODO: change to rw_sem for multiple contexts (same as other IOCTL) */
> @@ -359,7 +371,11 @@ struct hl_device {
>  	u8				disabled;
>  
>  	/* Parameters for bring-up */
> +	u8				cpu_enable;
>  	u8				reset_pcilink;
> +	u8				config_pll;
> +	u8				fw_loading;
> +	u8				pldm;
>  };
>  
>  /*
> diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> index 5c312dd3aa50..bd80683118d3 100644
> --- a/drivers/misc/habanalabs/habanalabs_drv.c
> +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> @@ -181,7 +181,15 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
>  	hdev->major = hl_major;
>  
>  	/* Parameters for bring-up - set them to defaults */
> +	hdev->cpu_enable = 1;
>  	hdev->reset_pcilink = 0;
> +	hdev->config_pll = 0;
> +	hdev->fw_loading = 1;
> +	hdev->pldm = 0;
> +
> +	/* If CPU is disabled, no point in loading FW */
> +	if (!hdev->cpu_enable)
> +		hdev->fw_loading = 0;

The CPU was enabled just a couple of lines above, wasn't it?
I've noticed there are a lot of checks for hdev->cpu_enabled and hdev->pldm
but I didn't see them ever change.

>  
>  	hdev->disabled = true;
>  	hdev->pdev = pdev; /* can be NULL in case of simulator device */
> diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
> index 192a1450cbb1..2d0efb7b44bb 100644
> --- a/drivers/misc/habanalabs/include/goya/goya.h
> +++ b/drivers/misc/habanalabs/include/goya/goya.h
> @@ -11,6 +11,7 @@
>  #define GOYA_H
>  
>  #include "asic_reg/goya_regs.h"
> +#include "goya_async_events.h"
>  
>  #include <linux/types.h>
>  
> diff --git a/drivers/misc/habanalabs/include/goya/goya_async_events.h b/drivers/misc/habanalabs/include/goya/goya_async_events.h
> new file mode 100644
> index 000000000000..497937a17ee9
> --- /dev/null
> +++ b/drivers/misc/habanalabs/include/goya/goya_async_events.h

This, apparently, should have been a part of patch 8 (habanalabs: add event
queue and interrupts)

> @@ -0,0 +1,186 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + */
> +
> +#ifndef __GOYA_ASYNC_EVENTS_H_
> +#define __GOYA_ASYNC_EVENTS_H_
> +
> +enum goya_async_event_id {
> +	GOYA_ASYNC_EVENT_ID_PCIE_IF = 33,
> +	GOYA_ASYNC_EVENT_ID_TPC0_ECC = 36,
> +	GOYA_ASYNC_EVENT_ID_TPC1_ECC = 39,
> +	GOYA_ASYNC_EVENT_ID_TPC2_ECC = 42,
> +	GOYA_ASYNC_EVENT_ID_TPC3_ECC = 45,
> +	GOYA_ASYNC_EVENT_ID_TPC4_ECC = 48,
> +	GOYA_ASYNC_EVENT_ID_TPC5_ECC = 51,
> +	GOYA_ASYNC_EVENT_ID_TPC6_ECC = 54,
> +	GOYA_ASYNC_EVENT_ID_TPC7_ECC = 57,
> +	GOYA_ASYNC_EVENT_ID_MME_ECC = 60,
> +	GOYA_ASYNC_EVENT_ID_MME_ECC_EXT = 61,
> +	GOYA_ASYNC_EVENT_ID_MMU_ECC = 63,
> +	GOYA_ASYNC_EVENT_ID_DMA_MACRO = 64,
> +	GOYA_ASYNC_EVENT_ID_DMA_ECC = 66,
> +	GOYA_ASYNC_EVENT_ID_CPU_IF_ECC = 75,
> +	GOYA_ASYNC_EVENT_ID_PSOC_MEM = 78,
> +	GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT = 79,
> +	GOYA_ASYNC_EVENT_ID_SRAM0 = 81,
> +	GOYA_ASYNC_EVENT_ID_SRAM1 = 82,
> +	GOYA_ASYNC_EVENT_ID_SRAM2 = 83,
> +	GOYA_ASYNC_EVENT_ID_SRAM3 = 84,
> +	GOYA_ASYNC_EVENT_ID_SRAM4 = 85,
> +	GOYA_ASYNC_EVENT_ID_SRAM5 = 86,
> +	GOYA_ASYNC_EVENT_ID_SRAM6 = 87,
> +	GOYA_ASYNC_EVENT_ID_SRAM7 = 88,
> +	GOYA_ASYNC_EVENT_ID_SRAM8 = 89,
> +	GOYA_ASYNC_EVENT_ID_SRAM9 = 90,
> +	GOYA_ASYNC_EVENT_ID_SRAM10 = 91,
> +	GOYA_ASYNC_EVENT_ID_SRAM11 = 92,
> +	GOYA_ASYNC_EVENT_ID_SRAM12 = 93,
> +	GOYA_ASYNC_EVENT_ID_SRAM13 = 94,
> +	GOYA_ASYNC_EVENT_ID_SRAM14 = 95,
> +	GOYA_ASYNC_EVENT_ID_SRAM15 = 96,
> +	GOYA_ASYNC_EVENT_ID_SRAM16 = 97,
> +	GOYA_ASYNC_EVENT_ID_SRAM17 = 98,
> +	GOYA_ASYNC_EVENT_ID_SRAM18 = 99,
> +	GOYA_ASYNC_EVENT_ID_SRAM19 = 100,
> +	GOYA_ASYNC_EVENT_ID_SRAM20 = 101,
> +	GOYA_ASYNC_EVENT_ID_SRAM21 = 102,
> +	GOYA_ASYNC_EVENT_ID_SRAM22 = 103,
> +	GOYA_ASYNC_EVENT_ID_SRAM23 = 104,
> +	GOYA_ASYNC_EVENT_ID_SRAM24 = 105,
> +	GOYA_ASYNC_EVENT_ID_SRAM25 = 106,
> +	GOYA_ASYNC_EVENT_ID_SRAM26 = 107,
> +	GOYA_ASYNC_EVENT_ID_SRAM27 = 108,
> +	GOYA_ASYNC_EVENT_ID_SRAM28 = 109,
> +	GOYA_ASYNC_EVENT_ID_SRAM29 = 110,
> +	GOYA_ASYNC_EVENT_ID_GIC500 = 112,
> +	GOYA_ASYNC_EVENT_ID_PCIE_DEC = 115,
> +	GOYA_ASYNC_EVENT_ID_TPC0_DEC = 117,
> +	GOYA_ASYNC_EVENT_ID_TPC1_DEC = 120,
> +	GOYA_ASYNC_EVENT_ID_TPC2_DEC = 123,
> +	GOYA_ASYNC_EVENT_ID_TPC3_DEC = 126,
> +	GOYA_ASYNC_EVENT_ID_TPC4_DEC = 129,
> +	GOYA_ASYNC_EVENT_ID_TPC5_DEC = 132,
> +	GOYA_ASYNC_EVENT_ID_TPC6_DEC = 135,
> +	GOYA_ASYNC_EVENT_ID_TPC7_DEC = 138,
> +	GOYA_ASYNC_EVENT_ID_AXI_ECC = 139,
> +	GOYA_ASYNC_EVENT_ID_L2_RAM_ECC = 140,
> +	GOYA_ASYNC_EVENT_ID_MME_WACS = 141,
> +	GOYA_ASYNC_EVENT_ID_MME_WACSD = 142,
> +	GOYA_ASYNC_EVENT_ID_PLL0 = 143,
> +	GOYA_ASYNC_EVENT_ID_PLL1 = 144,
> +	GOYA_ASYNC_EVENT_ID_PLL3 = 146,
> +	GOYA_ASYNC_EVENT_ID_PLL4 = 147,
> +	GOYA_ASYNC_EVENT_ID_PLL5 = 148,
> +	GOYA_ASYNC_EVENT_ID_PLL6 = 149,
> +	GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER = 155,
> +	GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC = 159,
> +	GOYA_ASYNC_EVENT_ID_PSOC = 160,
> +	GOYA_ASYNC_EVENT_ID_PCIE_FLR = 171,
> +	GOYA_ASYNC_EVENT_ID_PCIE_HOT_RESET = 172,
> +	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG0 = 174,
> +	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG1 = 175,
> +	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG2 = 176,
> +	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG3 = 177,
> +	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG0 = 178,
> +	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG1 = 179,
> +	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG2 = 180,
> +	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG3 = 181,
> +	GOYA_ASYNC_EVENT_ID_PCIE_APB = 182,
> +	GOYA_ASYNC_EVENT_ID_PCIE_QDB = 183,
> +	GOYA_ASYNC_EVENT_ID_PCIE_BM_D_P_WR = 184,
> +	GOYA_ASYNC_EVENT_ID_PCIE_BM_D_RD = 185,
> +	GOYA_ASYNC_EVENT_ID_PCIE_BM_U_P_WR = 186,
> +	GOYA_ASYNC_EVENT_ID_PCIE_BM_U_RD = 187,
> +	GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU = 190,
> +	GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR = 191,
> +	GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU = 200,
> +	GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR = 201,
> +	GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU = 210,
> +	GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR = 211,
> +	GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU = 220,
> +	GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR = 221,
> +	GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU = 230,
> +	GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR = 231,
> +	GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU = 240,
> +	GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR = 241,
> +	GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU = 250,
> +	GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR = 251,
> +	GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU = 260,
> +	GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR = 261,
> +	GOYA_ASYNC_EVENT_ID_MMU_SBA_SPMU0 = 270,
> +	GOYA_ASYNC_EVENT_ID_MMU_SBA_SPMU1 = 271,
> +	GOYA_ASYNC_EVENT_ID_MME_WACS_UP = 272,
> +	GOYA_ASYNC_EVENT_ID_MME_WACS_DOWN = 273,
> +	GOYA_ASYNC_EVENT_ID_MMU_PAGE_FAULT = 280,
> +	GOYA_ASYNC_EVENT_ID_MMU_WR_PERM = 281,
> +	GOYA_ASYNC_EVENT_ID_MMU_DBG_BM = 282,
> +	GOYA_ASYNC_EVENT_ID_DMA_BM_CH0 = 290,
> +	GOYA_ASYNC_EVENT_ID_DMA_BM_CH1 = 291,
> +	GOYA_ASYNC_EVENT_ID_DMA_BM_CH2 = 292,
> +	GOYA_ASYNC_EVENT_ID_DMA_BM_CH3 = 293,
> +	GOYA_ASYNC_EVENT_ID_DMA_BM_CH4 = 294,
> +	GOYA_ASYNC_EVENT_ID_DDR0_PHY_DFI = 300,
> +	GOYA_ASYNC_EVENT_ID_DDR0_ECC_SCRUB = 301,
> +	GOYA_ASYNC_EVENT_ID_DDR0_DB_ECC = 302,
> +	GOYA_ASYNC_EVENT_ID_DDR0_SB_ECC = 303,
> +	GOYA_ASYNC_EVENT_ID_DDR0_SB_ECC_MC = 304,
> +	GOYA_ASYNC_EVENT_ID_DDR0_AXI_RD = 305,
> +	GOYA_ASYNC_EVENT_ID_DDR0_AXI_WR = 306,
> +	GOYA_ASYNC_EVENT_ID_DDR1_PHY_DFI = 310,
> +	GOYA_ASYNC_EVENT_ID_DDR1_ECC_SCRUB = 311,
> +	GOYA_ASYNC_EVENT_ID_DDR1_DB_ECC = 312,
> +	GOYA_ASYNC_EVENT_ID_DDR1_SB_ECC = 313,
> +	GOYA_ASYNC_EVENT_ID_DDR1_SB_ECC_MC = 314,
> +	GOYA_ASYNC_EVENT_ID_DDR1_AXI_RD = 315,
> +	GOYA_ASYNC_EVENT_ID_DDR1_AXI_WR = 316,
> +	GOYA_ASYNC_EVENT_ID_CPU_BMON = 320,
> +	GOYA_ASYNC_EVENT_ID_TS_EAST = 322,
> +	GOYA_ASYNC_EVENT_ID_TS_WEST = 323,
> +	GOYA_ASYNC_EVENT_ID_TS_NORTH = 324,
> +	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_0 = 330,
> +	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_1 = 331,
> +	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_2 = 332,
> +	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET = 356,
> +	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT = 361,
> +	GOYA_ASYNC_EVENT_ID_TPC0_CMDQ = 430,
> +	GOYA_ASYNC_EVENT_ID_TPC1_CMDQ = 431,
> +	GOYA_ASYNC_EVENT_ID_TPC2_CMDQ = 432,
> +	GOYA_ASYNC_EVENT_ID_TPC3_CMDQ = 433,
> +	GOYA_ASYNC_EVENT_ID_TPC4_CMDQ = 434,
> +	GOYA_ASYNC_EVENT_ID_TPC5_CMDQ = 435,
> +	GOYA_ASYNC_EVENT_ID_TPC6_CMDQ = 436,
> +	GOYA_ASYNC_EVENT_ID_TPC7_CMDQ = 437,
> +	GOYA_ASYNC_EVENT_ID_TPC0_QM = 438,
> +	GOYA_ASYNC_EVENT_ID_TPC1_QM = 439,
> +	GOYA_ASYNC_EVENT_ID_TPC2_QM = 440,
> +	GOYA_ASYNC_EVENT_ID_TPC3_QM = 441,
> +	GOYA_ASYNC_EVENT_ID_TPC4_QM = 442,
> +	GOYA_ASYNC_EVENT_ID_TPC5_QM = 443,
> +	GOYA_ASYNC_EVENT_ID_TPC6_QM = 444,
> +	GOYA_ASYNC_EVENT_ID_TPC7_QM = 445,
> +	GOYA_ASYNC_EVENT_ID_MME_QM = 447,
> +	GOYA_ASYNC_EVENT_ID_MME_CMDQ = 448,
> +	GOYA_ASYNC_EVENT_ID_DMA0_QM = 449,
> +	GOYA_ASYNC_EVENT_ID_DMA1_QM = 450,
> +	GOYA_ASYNC_EVENT_ID_DMA2_QM = 451,
> +	GOYA_ASYNC_EVENT_ID_DMA3_QM = 452,
> +	GOYA_ASYNC_EVENT_ID_DMA4_QM = 453,
> +	GOYA_ASYNC_EVENT_ID_DMA_ON_HBW = 454,
> +	GOYA_ASYNC_EVENT_ID_DMA0_CH = 455,
> +	GOYA_ASYNC_EVENT_ID_DMA1_CH = 456,
> +	GOYA_ASYNC_EVENT_ID_DMA2_CH = 457,
> +	GOYA_ASYNC_EVENT_ID_DMA3_CH = 458,
> +	GOYA_ASYNC_EVENT_ID_DMA4_CH = 459,
> +	GOYA_ASYNC_EVENT_ID_PI_UPDATE = 484,
> +	GOYA_ASYNC_EVENT_ID_HALT_MACHINE = 485,
> +	GOYA_ASYNC_EVENT_ID_INTS_REGISTER = 486,
> +	GOYA_ASYNC_EVENT_ID_SOFT_RESET = 487,
> +	GOYA_ASYNC_EVENT_ID_LAST_VALID_ID = 1023,
> +	GOYA_ASYNC_EVENT_ID_SIZE
> +};
> +
> +#endif /* __GOYA_ASYNC_EVENTS_H_ */
> diff --git a/drivers/misc/habanalabs/include/goya/goya_boot_if.h b/drivers/misc/habanalabs/include/goya/goya_boot_if.h
> new file mode 100644
> index 000000000000..2e39578ec795
> --- /dev/null
> +++ b/drivers/misc/habanalabs/include/goya/goya_boot_if.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + * Author: Oded Gabbay <oded.gabbay@gmail.com>
> + *
> + */
> +
> +#ifndef GOYA_BOOT_IF_H
> +#define GOYA_BOOT_IF_H
> +
> +enum cpu_boot_status {
> +	CPU_BOOT_STATUS_NA = 0,		/* Default value after reset of chip */
> +	CPU_BOOT_STATUS_IN_WFE,
> +	CPU_BOOT_STATUS_DRAM_RDY,
> +	CPU_BOOT_STATUS_SRAM_AVAIL,
> +	CPU_BOOT_STATUS_IN_BTL,		/* BTL is H/W FSM */
> +	CPU_BOOT_STATUS_IN_PREBOOT,
> +	CPU_BOOT_STATUS_IN_SPL,
> +	CPU_BOOT_STATUS_IN_UBOOT,
> +	CPU_BOOT_STATUS_DRAM_INIT_FAIL,
> +	CPU_BOOT_STATUS_FIT_CORRUPTED
> +};
> +
> +enum kmd_msg {
> +	KMD_MSG_NA = 0,
> +	KMD_MSG_GOTO_WFE,
> +	KMD_MSG_FIT_RDY
> +};
> +
> +#endif /* GOYA_BOOT_IF_H */
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 07/15] habanalabs: add h/w queues module
  2019-01-23  0:00 ` [PATCH 07/15] habanalabs: add h/w queues module Oded Gabbay
@ 2019-01-25  7:50   ` Mike Rapoport
  2019-01-28 10:50     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-25  7:50 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:00:49AM +0200, Oded Gabbay wrote:
> This patch adds the H/W queues module and the code to initialize Goya's
> various compute and DMA engines and their queues.
> 
> Goya has 5 DMA channels, 8 TPC engines and a single MME engine. For each
> channel/engine, there is a H/W queue logic which is used to pass commands
> from the user to the H/W. That logic is called QMAN.
> 
> There are two types of QMANs: external and internal. The DMA QMANs are
> considered external while the TPC and MME QMANs are considered internal.
> For each external queue there is a completion queue, which is located on
> the Host memory.
> 
> The differences between external and internal QMANs are:
> 
> 1. The location of the queue's memory. External QMANs are located on the
>    Host memory while internal QMANs are located on the on-chip memory.
> 
> 2. The external QMAN write an entry to a completion queue and sends an
>    MSI-X interrupt upon completion of a command buffer that was given to
>    it. The internal QMAN doesn't do that.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/habanalabs/Makefile              |    2 +-
>  drivers/misc/habanalabs/device.c              |   74 +-
>  drivers/misc/habanalabs/goya/goya.c           | 1518 +++++++++++++++--
>  drivers/misc/habanalabs/goya/goyaP.h          |    6 +
>  drivers/misc/habanalabs/habanalabs.h          |  176 +-
>  drivers/misc/habanalabs/habanalabs_drv.c      |    6 +
>  drivers/misc/habanalabs/hw_queue.c            |  404 +++++
>  .../habanalabs/include/goya/goya_packets.h    |  234 +++
>  .../habanalabs/include/habanalabs_device_if.h |  272 +++
>  drivers/misc/habanalabs/irq.c                 |  150 ++
>  10 files changed, 2721 insertions(+), 121 deletions(-)
>  create mode 100644 drivers/misc/habanalabs/hw_queue.c
>  create mode 100644 drivers/misc/habanalabs/include/goya/goya_packets.h
>  create mode 100644 drivers/misc/habanalabs/irq.c
> 
> diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> index 2530c9b78ca4..c07f3ccb57dc 100644
> --- a/drivers/misc/habanalabs/Makefile
> +++ b/drivers/misc/habanalabs/Makefile
> @@ -5,7 +5,7 @@
>  obj-m	:= habanalabs.o
>  
>  habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
> -		command_buffer.o
> +		command_buffer.o hw_queue.o irq.o
>  
>  include $(src)/goya/Makefile
>  habanalabs-y += $(HL_GOYA_FILES)
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> index 9fc7218a973c..98220628a467 100644
> --- a/drivers/misc/habanalabs/device.c
> +++ b/drivers/misc/habanalabs/device.c
> @@ -170,13 +170,22 @@ static int device_early_init(struct hl_device *hdev)
>  	if (rc)
>  		goto early_fini;
>  
> +	hdev->cq_wq = alloc_workqueue("hl-free-jobs", WQ_UNBOUND, 0);
> +	if (hdev->cq_wq == NULL) {
> +		dev_err(hdev->dev, "Failed to allocate CQ workqueue\n");
> +		goto asid_fini;
> +	}
> +
>  	hl_cb_mgr_init(&hdev->kernel_cb_mgr);
>  
>  	mutex_init(&hdev->device_open);
> +	mutex_init(&hdev->send_cpu_message_lock);
>  	atomic_set(&hdev->fd_open_cnt, 0);
>  
>  	return 0;
>  
> +asid_fini:
> +	hl_asid_fini(hdev);
>  early_fini:
>  	if (hdev->asic_funcs->early_fini)
>  		hdev->asic_funcs->early_fini(hdev);
> @@ -192,9 +201,12 @@ static int device_early_init(struct hl_device *hdev)
>   */
>  static void device_early_fini(struct hl_device *hdev)
>  {
> +	mutex_destroy(&hdev->send_cpu_message_lock);
>  
>  	hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
>  
> +	destroy_workqueue(hdev->cq_wq);
> +
>  	hl_asid_fini(hdev);
>  
>  	if (hdev->asic_funcs->early_fini)
> @@ -273,7 +285,7 @@ int hl_device_resume(struct hl_device *hdev)
>   */
>  int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  {
> -	int rc;
> +	int i, rc, cq_ready_cnt;
>  
>  	/* Create device */
>  	rc = device_setup_cdev(hdev, hclass, hdev->id, &hl_ops);
> @@ -294,11 +306,48 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  	if (rc)
>  		goto early_fini;
>  
> +	/*
> +	 * Initialize the H/W queues. Must be done before hw_init, because
> +	 * there the addresses of the kernel queue are being written to the
> +	 * registers of the device
> +	 */
> +	rc = hl_hw_queues_create(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to initialize kernel queues\n");
> +		goto sw_fini;
> +	}
> +
> +	/*
> +	 * Initialize the completion queues. Must be done before hw_init,
> +	 * because there the addresses of the completion queues are being
> +	 * passed as arguments to request_irq
> +	 */
> +	hdev->completion_queue =
> +			kcalloc(hdev->asic_prop.completion_queues_count,
> +				sizeof(*hdev->completion_queue), GFP_KERNEL);
> +
> +	if (!hdev->completion_queue) {
> +		dev_err(hdev->dev, "failed to allocate completion queues\n");
> +		rc = -ENOMEM;
> +		goto hw_queues_destroy;
> +	}
> +
> +	for (i = 0, cq_ready_cnt = 0;
> +			i < hdev->asic_prop.completion_queues_count;
> +			i++, cq_ready_cnt++) {
> +		rc = hl_cq_init(hdev, &hdev->completion_queue[i], i);
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"failed to initialize completion queue\n");
> +			goto cq_fini;
> +		}
> +	}
> +
>  	/* Allocate the kernel context */
>  	hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
>  	if (!hdev->kernel_ctx) {
>  		rc = -ENOMEM;
> -		goto sw_fini;
> +		goto cq_fini;
>  	}
>  
>  	hdev->user_ctx = NULL;
> @@ -324,6 +373,14 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  
>  	hdev->disabled = false;
>  
> +	/* Check that the communication with the device is working */
> +	rc = hdev->asic_funcs->test_queues(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to detect if device is alive\n");
> +		rc = 0;

Why rc is 0 here?

> +		goto out_disabled;
> +	}
> +
>  	dev_notice(hdev->dev,
>  		"Successfully added device to habanalabs driver\n");
>  
> @@ -335,6 +392,12 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  			"kernel ctx is still alive on initialization failure\n");
>  free_ctx:
>  	kfree(hdev->kernel_ctx);
> +cq_fini:
> +	for (i = 0 ; i < cq_ready_cnt ; i++)
> +		hl_cq_fini(hdev, &hdev->completion_queue[i]);
> +	kfree(hdev->completion_queue);
> +hw_queues_destroy:
> +	hl_hw_queues_destroy(hdev);
>  sw_fini:
>  	hdev->asic_funcs->sw_fini(hdev);
>  early_fini:
> @@ -364,6 +427,7 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>   */
>  void hl_device_fini(struct hl_device *hdev)
>  {
> +	int i;
>  	dev_info(hdev->dev, "Removing device\n");
>  
>  	/* Mark device as disabled */
> @@ -378,6 +442,12 @@ void hl_device_fini(struct hl_device *hdev)
>  	/* Reset the H/W. It will be in idle state after this returns */
>  	hdev->asic_funcs->hw_fini(hdev, true);
>  
> +	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
> +		hl_cq_fini(hdev, &hdev->completion_queue[i]);
> +	kfree(hdev->completion_queue);
> +
> +	hl_hw_queues_destroy(hdev);
> +
>  	/* Call ASIC S/W finalize function */
>  	hdev->asic_funcs->sw_fini(hdev);
>  
> diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> index f715e01838b3..08d5227eaf1d 100644
> --- a/drivers/misc/habanalabs/goya/goya.c
> +++ b/drivers/misc/habanalabs/goya/goya.c
> @@ -98,6 +98,26 @@
>  static void goya_get_fixed_properties(struct hl_device *hdev)
>  {
>  	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	int i;
> +
> +	for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++) {
> +		prop->hw_queues_props[i].type = QUEUE_TYPE_EXT;
> +		prop->hw_queues_props[i].kmd_only = 0;
> +	}
> +
> +	for (; i < NUMBER_OF_EXT_HW_QUEUES + NUMBER_OF_CPU_HW_QUEUES ; i++) {
> +		prop->hw_queues_props[i].type = QUEUE_TYPE_CPU;
> +		prop->hw_queues_props[i].kmd_only = 1;
> +	}
> +
> +	for (; i < NUMBER_OF_EXT_HW_QUEUES + NUMBER_OF_CPU_HW_QUEUES +
> +			NUMBER_OF_INT_HW_QUEUES; i++) {
> +		prop->hw_queues_props[i].type = QUEUE_TYPE_INT;
> +		prop->hw_queues_props[i].kmd_only = 0;
> +	}
> +
> +	for (; i < HL_MAX_QUEUES; i++)
> +		prop->hw_queues_props[i].type = QUEUE_TYPE_NA;
>  
>  	prop->completion_queues_count = NUMBER_OF_CMPLT_QUEUES;
>  
> @@ -126,6 +146,18 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
>  	prop->high_pll = PLL_HIGH_DEFAULT;
>  }
>  
> +int goya_send_pci_access_msg(struct hl_device *hdev, u32 opcode)
> +{
> +	struct armcp_packet pkt;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = opcode;
> +
> +	return hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt,
> +			sizeof(pkt), HL_DEVICE_TIMEOUT_USEC, NULL);
> +}
> +
>  /**
>   * goya_pci_bars_map - Map PCI BARS of Goya device
>   *
> @@ -509,6 +541,8 @@ static int goya_sw_init(struct hl_device *hdev)
>  	if (!goya)
>  		return -ENOMEM;
>  
> +	goya->test_cpu_queue = goya_test_cpu_queue;
> +
>  	/* according to goya_init_iatu */
>  	goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
>  	hdev->asic_specific = goya;
> @@ -595,6 +629,299 @@ int goya_sw_fini(struct hl_device *hdev)
>  	return 0;
>  }
>  
> +static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
> +		dma_addr_t bus_address)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 mtr_base_lo, mtr_base_hi;
> +	u32 so_base_lo, so_base_hi;
> +	u32 gic_base_lo, gic_base_hi;
> +	u32 reg_off = dma_id * (mmDMA_QM_1_PQ_PI - mmDMA_QM_0_PQ_PI);
> +
> +	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +
> +	gic_base_lo =
> +		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +	gic_base_hi =
> +		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +
> +	WREG32(mmDMA_QM_0_PQ_BASE_LO + reg_off, lower_32_bits(bus_address));
> +	WREG32(mmDMA_QM_0_PQ_BASE_HI + reg_off, upper_32_bits(bus_address));
> +
> +	WREG32(mmDMA_QM_0_PQ_SIZE + reg_off, ilog2(HL_QUEUE_LENGTH));
> +	WREG32(mmDMA_QM_0_PQ_PI + reg_off, 0);
> +	WREG32(mmDMA_QM_0_PQ_CI + reg_off, 0);
> +
> +	WREG32(mmDMA_QM_0_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
> +	WREG32(mmDMA_QM_0_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
> +	WREG32(mmDMA_QM_0_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
> +	WREG32(mmDMA_QM_0_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
> +	WREG32(mmDMA_QM_0_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
> +	WREG32(mmDMA_QM_0_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
> +	WREG32(mmDMA_QM_0_GLBL_ERR_WDATA + reg_off,
> +			GOYA_ASYNC_EVENT_ID_DMA0_QM + dma_id);
> +
> +	/* PQ has buffer of 2 cache lines, while CQ has 8 lines */
> +	WREG32(mmDMA_QM_0_PQ_CFG1 + reg_off, 0x00020002);
> +	WREG32(mmDMA_QM_0_CQ_CFG1 + reg_off, 0x00080008);
> +
> +	if (dma_id == 0)
> +		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
> +	else
> +		if (goya->hw_cap_initialized & HW_CAP_MMU)
> +			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
> +					QMAN_DMA_PARTLY_TRUSTED);
> +		else
> +			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
> +					QMAN_DMA_FULLY_TRUSTED);
> +
> +	WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, QMAN_DMA_ERR_MSG_EN);
> +	WREG32(mmDMA_QM_0_GLBL_CFG0 + reg_off, QMAN_DMA_ENABLE);
> +}
> +
> +static void goya_init_dma_ch(struct hl_device *hdev, int dma_id)
> +{
> +	u32 gic_base_lo, gic_base_hi;
> +	u64 sob_addr;
> +	u32 reg_off = dma_id * (mmDMA_CH_1_CFG1 - mmDMA_CH_0_CFG1);
> +
> +	gic_base_lo =
> +		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +	gic_base_hi =
> +		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +
> +	WREG32(mmDMA_CH_0_ERRMSG_ADDR_LO + reg_off, gic_base_lo);
> +	WREG32(mmDMA_CH_0_ERRMSG_ADDR_HI + reg_off, gic_base_hi);
> +	WREG32(mmDMA_CH_0_ERRMSG_WDATA + reg_off,
> +			GOYA_ASYNC_EVENT_ID_DMA0_CH + dma_id);
> +
> +	if (dma_id) {
> +		sob_addr = CFG_BASE + mmSYNC_MNGR_SOB_OBJ_1000 +
> +				(dma_id - 1) * 4;
> +		WREG32(mmDMA_CH_0_WR_COMP_ADDR_LO + reg_off,
> +				lower_32_bits(sob_addr));
> +		WREG32(mmDMA_CH_0_WR_COMP_ADDR_HI + reg_off,
> +				upper_32_bits(sob_addr));
> +		WREG32(mmDMA_CH_0_WR_COMP_WDATA + reg_off, 0x80000001);
> +	}
> +}
> +
> +/**
> + * goya_init_dma_qmans - Initialize QMAN DMA registers
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Initialize the H/W registers of the QMAN DMA channels
> + *
> + */
> +static void goya_init_dma_qmans(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	struct hl_hw_queue *q;
> +	dma_addr_t bus_address;
> +	int i;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_DMA)
> +		return;
> +
> +	q = &hdev->kernel_queues[0];
> +
> +	for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++, q++) {
> +		bus_address = q->bus_address +
> +				hdev->asic_prop.host_phys_base_address;
> +
> +		goya_init_dma_qman(hdev, i, bus_address);
> +		goya_init_dma_ch(hdev, i);
> +	}
> +
> +	goya->hw_cap_initialized |= HW_CAP_DMA;
> +}
> +
> +/**
> + * goya_disable_external_queues - Disable external queues
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + */
> +static void goya_disable_external_queues(struct hl_device *hdev)
> +{
> +	WREG32(mmDMA_QM_0_GLBL_CFG0, 0);
> +	WREG32(mmDMA_QM_1_GLBL_CFG0, 0);
> +	WREG32(mmDMA_QM_2_GLBL_CFG0, 0);
> +	WREG32(mmDMA_QM_3_GLBL_CFG0, 0);
> +	WREG32(mmDMA_QM_4_GLBL_CFG0, 0);
> +}
> +
> +static int goya_stop_queue(struct hl_device *hdev, u32 cfg_reg,
> +				u32 cp_sts_reg, u32 glbl_sts0_reg)
> +{
> +	int rc;
> +	u32 status;
> +
> +	/* use the values of TPC0 as they are all the same*/
> +
> +	WREG32(cfg_reg, 1 << TPC0_QM_GLBL_CFG1_CP_STOP_SHIFT);
> +
> +	status = RREG32(cp_sts_reg);
> +	if (status & TPC0_QM_CP_STS_FENCE_IN_PROGRESS_MASK) {
> +		rc = hl_poll_timeout(
> +			hdev,
> +			cp_sts_reg,
> +			status,
> +			!(status & TPC0_QM_CP_STS_FENCE_IN_PROGRESS_MASK),
> +			1000,
> +			QMAN_FENCE_TIMEOUT_USEC);
> +
> +		/* if QMAN is stuck in fence no need to check for stop */
> +		if (rc)
> +			return 0;

Isn't it an error?

> +	}
> +
> +	rc = hl_poll_timeout(
> +		hdev,
> +		glbl_sts0_reg,
> +		status,
> +		(status & TPC0_QM_GLBL_STS0_CP_IS_STOP_MASK),
> +		1000,
> +		QMAN_STOP_TIMEOUT_USEC);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Timeout while waiting for QMAN to stop\n");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * goya_stop_external_queues - Stop external queues
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Returns 0 on success
> + *
> + */
> +static int goya_stop_external_queues(struct hl_device *hdev)
> +{
> +	int rc = goya_stop_queue(hdev,
> +			mmDMA_QM_0_GLBL_CFG1,
> +			mmDMA_QM_0_CP_STS,
> +			mmDMA_QM_0_GLBL_STS0);
> +
> +	if (rc)
> +		dev_err(hdev->dev, "failed to stop DMA QMAN 0\n");
> +
> +	rc = goya_stop_queue(hdev,
> +			mmDMA_QM_1_GLBL_CFG1,
> +			mmDMA_QM_1_CP_STS,
> +			mmDMA_QM_1_GLBL_STS0);
> +
> +	if (rc)
> +		dev_err(hdev->dev, "failed to stop DMA QMAN 1\n");
> +
> +	rc = goya_stop_queue(hdev,
> +			mmDMA_QM_2_GLBL_CFG1,
> +			mmDMA_QM_2_CP_STS,
> +			mmDMA_QM_2_GLBL_STS0);
> +
> +	if (rc)
> +		dev_err(hdev->dev, "failed to stop DMA QMAN 2\n");
> +
> +	rc = goya_stop_queue(hdev,
> +			mmDMA_QM_3_GLBL_CFG1,
> +			mmDMA_QM_3_CP_STS,
> +			mmDMA_QM_3_GLBL_STS0);
> +
> +	if (rc)
> +		dev_err(hdev->dev, "failed to stop DMA QMAN 3\n");
> +
> +	rc = goya_stop_queue(hdev,
> +			mmDMA_QM_4_GLBL_CFG1,
> +			mmDMA_QM_4_CP_STS,
> +			mmDMA_QM_4_GLBL_STS0);
> +
> +	if (rc)
> +		dev_err(hdev->dev, "failed to stop DMA QMAN 4\n");
> +
> +	return rc;
> +}
> +
> +static void goya_resume_external_queues(struct hl_device *hdev)
> +{
> +	WREG32(mmDMA_QM_0_GLBL_CFG1, 0);
> +	WREG32(mmDMA_QM_1_GLBL_CFG1, 0);
> +	WREG32(mmDMA_QM_2_GLBL_CFG1, 0);
> +	WREG32(mmDMA_QM_3_GLBL_CFG1, 0);
> +	WREG32(mmDMA_QM_4_GLBL_CFG1, 0);
> +}
> +
> +/**
> + * goya_init_cpu_queues - Initialize PQ/CQ/EQ of CPU
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Returns 0 on success
> + *
> + */
> +int goya_init_cpu_queues(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	dma_addr_t bus_address;
> +	u32 status;
> +	struct hl_hw_queue *cpu_pq = &hdev->kernel_queues[GOYA_QUEUE_ID_CPU_PQ];
> +	int err;
> +
> +	if (!hdev->cpu_queues_enable)
> +		return 0;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_CPU_Q)
> +		return 0;
> +
> +	bus_address = cpu_pq->bus_address +
> +			hdev->asic_prop.host_phys_base_address;
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_0, lower_32_bits(bus_address));
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_1, upper_32_bits(bus_address));
> +
> +	bus_address = hdev->cpu_accessible_dma_address +
> +			hdev->asic_prop.host_phys_base_address;
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_8, lower_32_bits(bus_address));
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_9, upper_32_bits(bus_address));
> +
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_5, HL_QUEUE_SIZE_IN_BYTES);
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_10, CPU_ACCESSIBLE_MEM_SIZE);
> +
> +	/* Used for EQ CI */
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, 0);
> +
> +	WREG32(mmCPU_IF_PF_PQ_PI, 0);
> +
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_7, PQ_INIT_STATUS_READY_FOR_CP);
> +
> +	WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> +			GOYA_ASYNC_EVENT_ID_PI_UPDATE);
> +
> +	err = hl_poll_timeout(
> +		hdev,
> +		mmPSOC_GLOBAL_CONF_SCRATCHPAD_7,
> +		status,
> +		(status == PQ_INIT_STATUS_READY_FOR_HOST),
> +		1000,
> +		GOYA_CPU_TIMEOUT_USEC);
> +
> +	if (err) {
> +		dev_err(hdev->dev,
> +			"Failed to communicate with ARM CPU (ArmCP timeout)\n");
> +		return -EIO;
> +	}
> +
> +	goya->hw_cap_initialized |= HW_CAP_CPU_Q;
> +	return 0;
> +}
> +
>  /**
>   * goya_init_pll - Initialize pll registers
>   *
> @@ -1960,152 +2287,646 @@ static void goya_init_golden_registers(struct hl_device *hdev)
>  	goya->hw_cap_initialized |= HW_CAP_GOLDEN;
>  }
>  
> -
> -/**
> - * goya_push_uboot_to_device - Push u-boot FW code to device
> - *
> - * @hdev: pointer to hl_device structure
> - *
> - * Copy u-boot fw code from firmware file to SRAM BAR.
> - * Returns 0 on success
> - *
> - */
> -static int goya_push_uboot_to_device(struct hl_device *hdev)
> +static void goya_init_mme_qman(struct hl_device *hdev)
>  {
> -	char fw_name[200];
> -	const u64 *fw_data;
> -	void __iomem *dst;
> -	size_t fw_size, i;
> -	int rc;
> +	u32 mtr_base_lo, mtr_base_hi;
> +	u32 so_base_lo, so_base_hi;
> +	u32 gic_base_lo, gic_base_hi;
> +	u64 qman_base_addr;
>  
> -	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
> +	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
>  
> -	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> +	gic_base_lo =
> +		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +	gic_base_hi =
> +		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
>  
> -	if (rc) {
> -		dev_err(hdev->dev, "Failed to request u-boot fw image\n");
> -		goto out;
> -	}
> +	qman_base_addr = hdev->asic_prop.sram_base_address +
> +				MME_QMAN_BASE_OFFSET;
>  
> -	fw_size = hdev->spl_fw->size;
> -	if ((fw_size % 4) != 0) {
> -		dev_err(hdev->dev, "illegal u-boot firmware size %lu\n",
> -			fw_size);
> -		rc = -EINVAL;
> -		goto out;
> -	}
> +	WREG32(mmMME_QM_PQ_BASE_LO, lower_32_bits(qman_base_addr));
> +	WREG32(mmMME_QM_PQ_BASE_HI, upper_32_bits(qman_base_addr));
> +	WREG32(mmMME_QM_PQ_SIZE, ilog2(MME_QMAN_LENGTH));
> +	WREG32(mmMME_QM_PQ_PI, 0);
> +	WREG32(mmMME_QM_PQ_CI, 0);
> +	WREG32(mmMME_QM_CP_LDMA_SRC_BASE_LO_OFFSET, 0x10C0);
> +	WREG32(mmMME_QM_CP_LDMA_SRC_BASE_HI_OFFSET, 0x10C4);
> +	WREG32(mmMME_QM_CP_LDMA_TSIZE_OFFSET, 0x10C8);
> +	WREG32(mmMME_QM_CP_LDMA_COMMIT_OFFSET, 0x10CC);
>  
> -	dev_dbg(hdev->dev, "u-boot firmware size == %lu\n", fw_size);
> +	WREG32(mmMME_QM_CP_MSG_BASE0_ADDR_LO, mtr_base_lo);
> +	WREG32(mmMME_QM_CP_MSG_BASE0_ADDR_HI, mtr_base_hi);
> +	WREG32(mmMME_QM_CP_MSG_BASE1_ADDR_LO, so_base_lo);
> +	WREG32(mmMME_QM_CP_MSG_BASE1_ADDR_HI, so_base_hi);
>  
> -	fw_data = (const u64 *) hdev->spl_fw->data;
> -	dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
> +	/* QMAN CQ has 8 cache lines */
> +	WREG32(mmMME_QM_CQ_CFG1, 0x00080008);
>  
> -	if ((hdev->spl_fw->size % 8) != 0)
> -		fw_size -= 8;
> +	WREG32(mmMME_QM_GLBL_ERR_ADDR_LO, gic_base_lo);
> +	WREG32(mmMME_QM_GLBL_ERR_ADDR_HI, gic_base_hi);
>  
> -	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> -		if (!(i & (0x80000 - 1)))
> -			dev_dbg(hdev->dev,
> -				"u-boot copied so far %lu out of %lu",
> -				i, fw_size);
> +	WREG32(mmMME_QM_GLBL_ERR_WDATA, GOYA_ASYNC_EVENT_ID_MME_QM);
>  
> -		writeq(*fw_data, dst);
> -	}
> +	WREG32(mmMME_QM_GLBL_ERR_CFG, QMAN_MME_ERR_MSG_EN);
>  
> -	if ((hdev->spl_fw->size % 8) != 0)
> -		writel(*(const u32 *) fw_data, dst);
> +	WREG32(mmMME_QM_GLBL_PROT, QMAN_MME_ERR_PROT);
>  
> -out:
> -	release_firmware(hdev->spl_fw);
> -	return rc;
> +	WREG32(mmMME_QM_GLBL_CFG0, QMAN_MME_ENABLE);
>  }
>  
> -/**
> - * goya_push_linux_to_device - Push LINUX FW code to device
> - *
> - * @hdev: pointer to hl_device structure
> - *
> - * Copy LINXU fw code from firmware file to DDR BAR.
> - * Returns 0 on success
> - *
> - */
> -static int goya_push_linux_to_device(struct hl_device *hdev)
> +static void goya_init_mme_cmdq(struct hl_device *hdev)
>  {
> -	char fw_name[200];
> -	const u64 *fw_data;
> -	void __iomem *dst;
> -	size_t fw_size, i;
> -	int rc;
> +	u32 mtr_base_lo, mtr_base_hi;
> +	u32 so_base_lo, so_base_hi;
> +	u32 gic_base_lo, gic_base_hi;
> +	u64 qman_base_addr;
>  
> -	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
> +	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
>  
> -	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> +	gic_base_lo =
> +		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +	gic_base_hi =
> +		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
>  
> -	if (rc) {
> -		dev_err(hdev->dev, "Failed to request Linux fw image\n");
> -		goto out;
> -	}
> +	qman_base_addr = hdev->asic_prop.sram_base_address +
> +				MME_QMAN_BASE_OFFSET;
>  
> -	fw_size = hdev->spl_fw->size;
> -	if ((fw_size % 4) != 0) {
> -		dev_err(hdev->dev, "illegal Linux firmware size %lu\n",
> -			fw_size);
> -		rc = -EINVAL;
> -		goto out;
> -	}
> +	WREG32(mmMME_CMDQ_CP_MSG_BASE0_ADDR_LO, mtr_base_lo);
> +	WREG32(mmMME_CMDQ_CP_MSG_BASE0_ADDR_HI, mtr_base_hi);
> +	WREG32(mmMME_CMDQ_CP_MSG_BASE1_ADDR_LO,	so_base_lo);
> +	WREG32(mmMME_CMDQ_CP_MSG_BASE1_ADDR_HI, so_base_hi);
>  
> -	dev_dbg(hdev->dev, "Linux firmware size == %lu\n", fw_size);
> +	/* CMDQ CQ has 20 cache lines */
> +	WREG32(mmMME_CMDQ_CQ_CFG1, 0x00140014);
>  
> -	fw_data = (const u64 *) hdev->spl_fw->data;
> -	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
> +	WREG32(mmMME_CMDQ_GLBL_ERR_ADDR_LO, gic_base_lo);
> +	WREG32(mmMME_CMDQ_GLBL_ERR_ADDR_HI, gic_base_hi);
>  
> -	if ((hdev->spl_fw->size % 8) != 0)
> -		fw_size -= 8;
> +	WREG32(mmMME_CMDQ_GLBL_ERR_WDATA, GOYA_ASYNC_EVENT_ID_MME_CMDQ);
>  
> -	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> -		if (!(i & (0x80000 - 1))) {
> -			dev_dbg(hdev->dev,
> -				"Linux copied so far %lu out of %lu",
> -				i, fw_size);
> -			usleep_range(20, 100);
> -		}
> -		writeq(*fw_data, dst);
> -	}
> +	WREG32(mmMME_CMDQ_GLBL_ERR_CFG, CMDQ_MME_ERR_MSG_EN);
>  
> -	if ((hdev->spl_fw->size % 8) != 0)
> -		writel(*(const u32 *) fw_data, dst);
> +	WREG32(mmMME_CMDQ_GLBL_PROT, CMDQ_MME_ERR_PROT);
>  
> -out:
> -	release_firmware(hdev->spl_fw);
> -	return rc;
> +	WREG32(mmMME_CMDQ_GLBL_CFG0, CMDQ_MME_ENABLE);
>  }
>  
> -static int goya_pldm_init_cpu(struct hl_device *hdev)
> +static void goya_init_mme_qmans(struct hl_device *hdev)
>  {
> -	u32 val, unit_rst_val;
> -	int rc;
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 so_base_lo, so_base_hi;
>  
> -	/* Must initialize SRAM scrambler before pushing u-boot to SRAM */
> -	goya_init_golden_registers(hdev);
> +	if (goya->hw_cap_initialized & HW_CAP_MME)
> +		return;
>  
> -	/* Put ARM cores into reset */
> -	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
> -	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
> +	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
>  
> -	/* Reset the CA53 MACRO */
> -	unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> -	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
> -	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> -	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
> -	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> +	WREG32(mmMME_SM_BASE_ADDRESS_LOW, so_base_lo);
> +	WREG32(mmMME_SM_BASE_ADDRESS_HIGH, so_base_hi);
>  
> -	rc = goya_push_uboot_to_device(hdev);
> -	if (rc)
> -		return rc;
> +	goya_init_mme_qman(hdev);
> +	goya_init_mme_cmdq(hdev);
>  
> -	rc = goya_push_linux_to_device(hdev);
> -	if (rc)
> -		return rc;
> +	goya->hw_cap_initialized |= HW_CAP_MME;
> +}
> +
> +static void goya_init_tpc_qman(struct hl_device *hdev, u32 base_off, int tpc_id)
> +{
> +	u32 mtr_base_lo, mtr_base_hi;
> +	u32 so_base_lo, so_base_hi;
> +	u32 gic_base_lo, gic_base_hi;
> +	u64 qman_base_addr;
> +	u32 reg_off = tpc_id * (mmTPC1_QM_PQ_PI - mmTPC0_QM_PQ_PI);
> +
> +	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +
> +	gic_base_lo =
> +		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +	gic_base_hi =
> +		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +
> +	qman_base_addr = hdev->asic_prop.sram_base_address + base_off;
> +
> +	WREG32(mmTPC0_QM_PQ_BASE_LO + reg_off, lower_32_bits(qman_base_addr));
> +	WREG32(mmTPC0_QM_PQ_BASE_HI + reg_off, upper_32_bits(qman_base_addr));
> +	WREG32(mmTPC0_QM_PQ_SIZE + reg_off, ilog2(TPC_QMAN_LENGTH));
> +	WREG32(mmTPC0_QM_PQ_PI + reg_off, 0);
> +	WREG32(mmTPC0_QM_PQ_CI + reg_off, 0);
> +	WREG32(mmTPC0_QM_CP_LDMA_SRC_BASE_LO_OFFSET + reg_off, 0x10C0);
> +	WREG32(mmTPC0_QM_CP_LDMA_SRC_BASE_HI_OFFSET + reg_off, 0x10C4);
> +	WREG32(mmTPC0_QM_CP_LDMA_TSIZE_OFFSET + reg_off, 0x10C8);
> +	WREG32(mmTPC0_QM_CP_LDMA_COMMIT_OFFSET + reg_off, 0x10CC);
> +
> +	WREG32(mmTPC0_QM_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
> +	WREG32(mmTPC0_QM_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
> +	WREG32(mmTPC0_QM_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
> +	WREG32(mmTPC0_QM_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
> +
> +	WREG32(mmTPC0_QM_CQ_CFG1 + reg_off, 0x00080008);
> +
> +	WREG32(mmTPC0_QM_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
> +	WREG32(mmTPC0_QM_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
> +
> +	WREG32(mmTPC0_QM_GLBL_ERR_WDATA + reg_off,
> +			GOYA_ASYNC_EVENT_ID_TPC0_QM + tpc_id);
> +
> +	WREG32(mmTPC0_QM_GLBL_ERR_CFG + reg_off, QMAN_TPC_ERR_MSG_EN);
> +
> +	WREG32(mmTPC0_QM_GLBL_PROT + reg_off, QMAN_TPC_ERR_PROT);
> +
> +	WREG32(mmTPC0_QM_GLBL_CFG0 + reg_off, QMAN_TPC_ENABLE);
> +}
> +
> +static void goya_init_tpc_cmdq(struct hl_device *hdev, int tpc_id)
> +{
> +	u32 mtr_base_lo, mtr_base_hi;
> +	u32 so_base_lo, so_base_hi;
> +	u32 gic_base_lo, gic_base_hi;
> +	u32 reg_off = tpc_id * (mmTPC1_CMDQ_CQ_CFG1 - mmTPC0_CMDQ_CQ_CFG1);
> +
> +	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> +	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +
> +	gic_base_lo =
> +		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +	gic_base_hi =
> +		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> +
> +	WREG32(mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
> +	WREG32(mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
> +	WREG32(mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
> +	WREG32(mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
> +
> +	WREG32(mmTPC0_CMDQ_CQ_CFG1 + reg_off, 0x00140014);
> +
> +	WREG32(mmTPC0_CMDQ_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
> +	WREG32(mmTPC0_CMDQ_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
> +
> +	WREG32(mmTPC0_CMDQ_GLBL_ERR_WDATA + reg_off,
> +			GOYA_ASYNC_EVENT_ID_TPC0_CMDQ + tpc_id);
> +
> +	WREG32(mmTPC0_CMDQ_GLBL_ERR_CFG + reg_off, CMDQ_TPC_ERR_MSG_EN);
> +
> +	WREG32(mmTPC0_CMDQ_GLBL_PROT + reg_off, CMDQ_TPC_ERR_PROT);
> +
> +	WREG32(mmTPC0_CMDQ_GLBL_CFG0 + reg_off, CMDQ_TPC_ENABLE);
> +}
> +
> +static void goya_init_tpc_qmans(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 so_base_lo, so_base_hi;
> +	u32 cfg_off = mmTPC1_CFG_SM_BASE_ADDRESS_LOW -
> +			mmTPC0_CFG_SM_BASE_ADDRESS_LOW;
> +	int i;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_TPC)
> +		return;
> +
> +	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> +
> +	for (i = 0 ; i < TPC_MAX_NUM ; i++) {
> +		WREG32(mmTPC0_CFG_SM_BASE_ADDRESS_LOW + i * cfg_off,
> +				so_base_lo);
> +		WREG32(mmTPC0_CFG_SM_BASE_ADDRESS_HIGH + i * cfg_off,
> +				so_base_hi);
> +	}
> +
> +	goya_init_tpc_qman(hdev, TPC0_QMAN_BASE_OFFSET, 0);
> +	goya_init_tpc_qman(hdev, TPC1_QMAN_BASE_OFFSET, 1);
> +	goya_init_tpc_qman(hdev, TPC2_QMAN_BASE_OFFSET, 2);
> +	goya_init_tpc_qman(hdev, TPC3_QMAN_BASE_OFFSET, 3);
> +	goya_init_tpc_qman(hdev, TPC4_QMAN_BASE_OFFSET, 4);
> +	goya_init_tpc_qman(hdev, TPC5_QMAN_BASE_OFFSET, 5);
> +	goya_init_tpc_qman(hdev, TPC6_QMAN_BASE_OFFSET, 6);
> +	goya_init_tpc_qman(hdev, TPC7_QMAN_BASE_OFFSET, 7);
> +
> +	for (i = 0 ; i < TPC_MAX_NUM ; i++)
> +		goya_init_tpc_cmdq(hdev, i);
> +
> +	goya->hw_cap_initialized |= HW_CAP_TPC;
> +}
> +
> +/**
> + * goya_disable_internal_queues - Disable internal queues
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + */
> +static void goya_disable_internal_queues(struct hl_device *hdev)
> +{
> +	WREG32(mmMME_QM_GLBL_CFG0, 0);
> +	WREG32(mmMME_CMDQ_GLBL_CFG0, 0);
> +
> +	WREG32(mmTPC0_QM_GLBL_CFG0, 0);
> +	WREG32(mmTPC0_CMDQ_GLBL_CFG0, 0);
> +
> +	WREG32(mmTPC1_QM_GLBL_CFG0, 0);
> +	WREG32(mmTPC1_CMDQ_GLBL_CFG0, 0);
> +
> +	WREG32(mmTPC2_QM_GLBL_CFG0, 0);
> +	WREG32(mmTPC2_CMDQ_GLBL_CFG0, 0);
> +
> +	WREG32(mmTPC3_QM_GLBL_CFG0, 0);
> +	WREG32(mmTPC3_CMDQ_GLBL_CFG0, 0);
> +
> +	WREG32(mmTPC4_QM_GLBL_CFG0, 0);
> +	WREG32(mmTPC4_CMDQ_GLBL_CFG0, 0);
> +
> +	WREG32(mmTPC5_QM_GLBL_CFG0, 0);
> +	WREG32(mmTPC5_CMDQ_GLBL_CFG0, 0);
> +
> +	WREG32(mmTPC6_QM_GLBL_CFG0, 0);
> +	WREG32(mmTPC6_CMDQ_GLBL_CFG0, 0);
> +
> +	WREG32(mmTPC7_QM_GLBL_CFG0, 0);
> +	WREG32(mmTPC7_CMDQ_GLBL_CFG0, 0);
> +}
> +
> +/**
> + * goya_stop_internal_queues - Stop internal queues
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Returns 0 on success
> + *
> + */
> +static int goya_stop_internal_queues(struct hl_device *hdev)
> +{
> +	int rc, retval = 0;
> +
> +	rc = goya_stop_queue(hdev,
> +			mmMME_QM_GLBL_CFG1,
> +			mmMME_QM_CP_STS,
> +			mmMME_QM_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop MME QMAN\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmMME_CMDQ_GLBL_CFG1,
> +			mmMME_CMDQ_CP_STS,
> +			mmMME_CMDQ_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop MME CMDQ\n");
> +		retval = -EIO;
> +	}

If I understand correctly, the queues can be and should be stopped independently and
failure to stop one of them wouldn't prevent stopping the others.
If that's the case a comment explaining that would be nice.

> +	rc = goya_stop_queue(hdev,
> +			mmTPC0_QM_GLBL_CFG1,
> +			mmTPC0_QM_CP_STS,
> +			mmTPC0_QM_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 0 QMAN\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC0_CMDQ_GLBL_CFG1,
> +			mmTPC0_CMDQ_CP_STS,
> +			mmTPC0_CMDQ_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 0 CMDQ\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC1_QM_GLBL_CFG1,
> +			mmTPC1_QM_CP_STS,
> +			mmTPC1_QM_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 1 QMAN\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC1_CMDQ_GLBL_CFG1,
> +			mmTPC1_CMDQ_CP_STS,
> +			mmTPC1_CMDQ_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 1 CMDQ\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC2_QM_GLBL_CFG1,
> +			mmTPC2_QM_CP_STS,
> +			mmTPC2_QM_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 2 QMAN\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC2_CMDQ_GLBL_CFG1,
> +			mmTPC2_CMDQ_CP_STS,
> +			mmTPC2_CMDQ_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 2 CMDQ\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC3_QM_GLBL_CFG1,
> +			mmTPC3_QM_CP_STS,
> +			mmTPC3_QM_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 3 QMAN\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC3_CMDQ_GLBL_CFG1,
> +			mmTPC3_CMDQ_CP_STS,
> +			mmTPC3_CMDQ_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 3 CMDQ\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC4_QM_GLBL_CFG1,
> +			mmTPC4_QM_CP_STS,
> +			mmTPC4_QM_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 4 QMAN\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC4_CMDQ_GLBL_CFG1,
> +			mmTPC4_CMDQ_CP_STS,
> +			mmTPC4_CMDQ_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 4 CMDQ\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC5_QM_GLBL_CFG1,
> +			mmTPC5_QM_CP_STS,
> +			mmTPC5_QM_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 5 QMAN\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC5_CMDQ_GLBL_CFG1,
> +			mmTPC5_CMDQ_CP_STS,
> +			mmTPC5_CMDQ_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 5 CMDQ\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC6_QM_GLBL_CFG1,
> +			mmTPC6_QM_CP_STS,
> +			mmTPC6_QM_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 6 QMAN\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC6_CMDQ_GLBL_CFG1,
> +			mmTPC6_CMDQ_CP_STS,
> +			mmTPC6_CMDQ_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 6 CMDQ\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC7_QM_GLBL_CFG1,
> +			mmTPC7_QM_CP_STS,
> +			mmTPC7_QM_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 7 QMAN\n");
> +		retval = -EIO;
> +	}
> +
> +	rc = goya_stop_queue(hdev,
> +			mmTPC7_CMDQ_GLBL_CFG1,
> +			mmTPC7_CMDQ_CP_STS,
> +			mmTPC7_CMDQ_GLBL_STS0);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop TPC 7 CMDQ\n");
> +		retval = -EIO;
> +	}
> +
> +	return rc;
> +}
> +
> +static void goya_resume_internal_queues(struct hl_device *hdev)
> +{
> +	WREG32(mmMME_QM_GLBL_CFG1, 0);
> +	WREG32(mmMME_CMDQ_GLBL_CFG1, 0);
> +
> +	WREG32(mmTPC0_QM_GLBL_CFG1, 0);
> +	WREG32(mmTPC0_CMDQ_GLBL_CFG1, 0);
> +
> +	WREG32(mmTPC1_QM_GLBL_CFG1, 0);
> +	WREG32(mmTPC1_CMDQ_GLBL_CFG1, 0);
> +
> +	WREG32(mmTPC2_QM_GLBL_CFG1, 0);
> +	WREG32(mmTPC2_CMDQ_GLBL_CFG1, 0);
> +
> +	WREG32(mmTPC3_QM_GLBL_CFG1, 0);
> +	WREG32(mmTPC3_CMDQ_GLBL_CFG1, 0);
> +
> +	WREG32(mmTPC4_QM_GLBL_CFG1, 0);
> +	WREG32(mmTPC4_CMDQ_GLBL_CFG1, 0);
> +
> +	WREG32(mmTPC5_QM_GLBL_CFG1, 0);
> +	WREG32(mmTPC5_CMDQ_GLBL_CFG1, 0);
> +
> +	WREG32(mmTPC6_QM_GLBL_CFG1, 0);
> +	WREG32(mmTPC6_CMDQ_GLBL_CFG1, 0);
> +
> +	WREG32(mmTPC7_QM_GLBL_CFG1, 0);
> +	WREG32(mmTPC7_CMDQ_GLBL_CFG1, 0);
> +}
> +
> +
> +/**
> + * goya_push_uboot_to_device - Push u-boot FW code to device
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Copy u-boot fw code from firmware file to SRAM BAR.
> + * Returns 0 on success
> + *
> + */
> +static int goya_push_uboot_to_device(struct hl_device *hdev)
> +{
> +	char fw_name[200];
> +	const u64 *fw_data;
> +	void __iomem *dst;
> +	size_t fw_size, i;
> +	int rc;
> +
> +	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
> +
> +	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to request u-boot fw image\n");
> +		goto out;
> +	}
> +
> +	fw_size = hdev->spl_fw->size;
> +	if ((fw_size % 4) != 0) {
> +		dev_err(hdev->dev, "illegal u-boot firmware size %lu\n",
> +			fw_size);
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	dev_dbg(hdev->dev, "u-boot firmware size == %lu\n", fw_size);
> +
> +	fw_data = (const u64 *) hdev->spl_fw->data;
> +	dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
> +
> +	if ((hdev->spl_fw->size % 8) != 0)
> +		fw_size -= 8;
> +
> +	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> +		if (!(i & (0x80000 - 1)))
> +			dev_dbg(hdev->dev,
> +				"u-boot copied so far %lu out of %lu",
> +				i, fw_size);
> +
> +		writeq(*fw_data, dst);
> +	}
> +
> +	if ((hdev->spl_fw->size % 8) != 0)
> +		writel(*(const u32 *) fw_data, dst);
> +
> +out:
> +	release_firmware(hdev->spl_fw);
> +	return rc;
> +}
> +
> +/**
> + * goya_push_linux_to_device - Push LINUX FW code to device
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Copy LINXU fw code from firmware file to DDR BAR.
> + * Returns 0 on success
> + *
> + */
> +static int goya_push_linux_to_device(struct hl_device *hdev)
> +{
> +	char fw_name[200];
> +	const u64 *fw_data;
> +	void __iomem *dst;
> +	size_t fw_size, i;
> +	int rc;
> +
> +	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
> +
> +	rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to request Linux fw image\n");
> +		goto out;
> +	}
> +
> +	fw_size = hdev->spl_fw->size;
> +	if ((fw_size % 4) != 0) {
> +		dev_err(hdev->dev, "illegal Linux firmware size %lu\n",
> +			fw_size);
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	dev_dbg(hdev->dev, "Linux firmware size == %lu\n", fw_size);
> +
> +	fw_data = (const u64 *) hdev->spl_fw->data;
> +	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
> +
> +	if ((hdev->spl_fw->size % 8) != 0)
> +		fw_size -= 8;
> +
> +	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> +		if (!(i & (0x80000 - 1))) {
> +			dev_dbg(hdev->dev,
> +				"Linux copied so far %lu out of %lu",
> +				i, fw_size);
> +			usleep_range(20, 100);
> +		}
> +		writeq(*fw_data, dst);
> +	}
> +
> +	if ((hdev->spl_fw->size % 8) != 0)
> +		writel(*(const u32 *) fw_data, dst);
> +
> +out:
> +	release_firmware(hdev->spl_fw);
> +	return rc;
> +}
> +
> +static int goya_pldm_init_cpu(struct hl_device *hdev)
> +{
> +	u32 val, unit_rst_val;
> +	int rc;
> +
> +	/* Must initialize SRAM scrambler before pushing u-boot to SRAM */
> +	goya_init_golden_registers(hdev);
> +
> +	/* Put ARM cores into reset */
> +	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
> +	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
> +
> +	/* Reset the CA53 MACRO */
> +	unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> +	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
> +	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> +	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
> +	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> +
> +	rc = goya_push_uboot_to_device(hdev);
> +	if (rc)
> +		return rc;
> +
> +	rc = goya_push_linux_to_device(hdev);
> +	if (rc)
> +		return rc;
>  
>  	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
>  	WREG32(mmPSOC_GLOBAL_CONF_WARM_REBOOT, CPU_BOOT_STATUS_NA);
> @@ -2339,6 +3160,19 @@ static int goya_hw_init(struct hl_device *hdev)
>  
>  	goya_init_security(hdev);
>  
> +	goya_init_dma_qmans(hdev);
> +
> +	goya_init_mme_qmans(hdev);
> +
> +	goya_init_tpc_qmans(hdev);
> +
> +	rc = goya_init_cpu_queues(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to initialize CPU H/W queues %d\n",
> +			rc);
> +		goto disable_queues;
> +	}
> +
>  	/* CPU initialization is finished, we can now move to 48 bit DMA mask */
>  	rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
>  	if (rc) {
> @@ -2347,7 +3181,7 @@ static int goya_hw_init(struct hl_device *hdev)
>  		if (rc) {
>  			dev_err(hdev->dev,
>  				"Unable to set pci dma mask to 32 bits\n");
> -			return rc;
> +			goto disable_pci_access;
>  		}
>  	}
>  
> @@ -2359,7 +3193,7 @@ static int goya_hw_init(struct hl_device *hdev)
>  		if (rc) {
>  			dev_err(hdev->dev,
>  				"Unable to set pci consistent dma mask to 32 bits\n");
> -			return rc;
> +			goto disable_pci_access;
>  		}
>  	}
>  
> @@ -2367,6 +3201,14 @@ static int goya_hw_init(struct hl_device *hdev)
>  	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
>  
>  	return 0;
> +
> +disable_pci_access:
> +	goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
> +disable_queues:
> +	goya_disable_internal_queues(hdev);
> +	goya_disable_external_queues(hdev);
> +
> +	return rc;
>  }
>  
>  /**
> @@ -2473,12 +3315,40 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
>  
>  int goya_suspend(struct hl_device *hdev)
>  {
> -	return 0;
> +	int rc;
> +
> +	rc = goya_stop_internal_queues(hdev);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop internal queues\n");
> +		return rc;
> +	}
> +
> +	rc = goya_stop_external_queues(hdev);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to stop external queues\n");
> +		return rc;
> +	}
> +
> +	rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
> +	if (rc)
> +		dev_err(hdev->dev, "Failed to disable PCI access from CPU\n");
> +
> +	return rc;
>  }
>  
>  int goya_resume(struct hl_device *hdev)
>  {
> -	return 0;
> +	int rc;
> +
> +	goya_resume_external_queues(hdev);
> +	goya_resume_internal_queues(hdev);
> +
> +	rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_ENABLE_PCI_ACCESS);
> +	if (rc)
> +		dev_err(hdev->dev, "Failed to enable PCI access from CPU\n");
> +	return rc;
>  }
>  
>  int goya_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
> @@ -2502,6 +3372,104 @@ int goya_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
>  	return rc;
>  }
>  
> +void goya_ring_doorbell(struct hl_device *hdev, u32 hw_queue_id, u32 pi)
> +{
> +	u32 db_reg_offset, db_value;
> +	bool invalid_queue = false;
> +
> +	switch (hw_queue_id) {
> +	case GOYA_QUEUE_ID_DMA_0:
> +		db_reg_offset = mmDMA_QM_0_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_DMA_1:
> +		db_reg_offset = mmDMA_QM_1_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_DMA_2:
> +		db_reg_offset = mmDMA_QM_2_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_DMA_3:
> +		db_reg_offset = mmDMA_QM_3_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_DMA_4:
> +		db_reg_offset = mmDMA_QM_4_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_CPU_PQ:
> +		if (hdev->cpu_queues_enable)
> +			db_reg_offset = mmCPU_IF_PF_PQ_PI;
> +		else
> +			invalid_queue = true;
> +		break;
> +
> +	case GOYA_QUEUE_ID_MME:
> +		db_reg_offset = mmMME_QM_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_TPC0:
> +		db_reg_offset = mmTPC0_QM_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_TPC1:
> +		db_reg_offset = mmTPC1_QM_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_TPC2:
> +		db_reg_offset = mmTPC2_QM_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_TPC3:
> +		db_reg_offset = mmTPC3_QM_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_TPC4:
> +		db_reg_offset = mmTPC4_QM_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_TPC5:
> +		db_reg_offset = mmTPC5_QM_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_TPC6:
> +		db_reg_offset = mmTPC6_QM_PQ_PI;
> +		break;
> +
> +	case GOYA_QUEUE_ID_TPC7:
> +		db_reg_offset = mmTPC7_QM_PQ_PI;
> +		break;
> +
> +	default:
> +		invalid_queue = true;
> +	}
> +
> +	if (invalid_queue) {
> +		/* Should never get here */
> +		dev_err(hdev->dev, "h/w queue %d is invalid. Can't set pi\n",
> +			hw_queue_id);
> +		return;
> +	}
> +
> +	db_value = pi;
> +
> +	if (hdev->ifh)
> +		return;
> +
> +	/* ring the doorbell */
> +	WREG32(db_reg_offset, db_value);
> +
> +	if (hw_queue_id == GOYA_QUEUE_ID_CPU_PQ)
> +		WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> +				GOYA_ASYNC_EVENT_ID_PI_UPDATE);
> +}
> +
> +void goya_flush_pq_write(struct hl_device *hdev, u64 *pq, u64 exp_val)
> +{
> +	/* Not needed in Goya */
> +}
> +
>  void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
>  					dma_addr_t *dma_handle, gfp_t flags)
>  {
> @@ -2514,6 +3482,311 @@ void goya_dma_free_coherent(struct hl_device *hdev, size_t size, void *cpu_addr,
>  	dma_free_coherent(&hdev->pdev->dev, size, cpu_addr, dma_handle);
>  }
>  
> +void *goya_get_int_queue_base(struct hl_device *hdev, u32 queue_id,
> +				dma_addr_t *dma_handle,	u16 *queue_len)
> +{
> +	void *base;
> +	u32 offset;
> +
> +	*dma_handle = hdev->asic_prop.sram_base_address;
> +
> +	base = hdev->pcie_bar[SRAM_CFG_BAR_ID];
> +
> +	switch (queue_id) {
> +	case GOYA_QUEUE_ID_MME:
> +		offset = MME_QMAN_BASE_OFFSET;
> +		*queue_len = MME_QMAN_LENGTH;
> +		break;
> +	case GOYA_QUEUE_ID_TPC0:
> +		offset = TPC0_QMAN_BASE_OFFSET;
> +		*queue_len = TPC_QMAN_LENGTH;
> +		break;
> +	case GOYA_QUEUE_ID_TPC1:
> +		offset = TPC1_QMAN_BASE_OFFSET;
> +		*queue_len = TPC_QMAN_LENGTH;
> +		break;
> +	case GOYA_QUEUE_ID_TPC2:
> +		offset = TPC2_QMAN_BASE_OFFSET;
> +		*queue_len = TPC_QMAN_LENGTH;
> +		break;
> +	case GOYA_QUEUE_ID_TPC3:
> +		offset = TPC3_QMAN_BASE_OFFSET;
> +		*queue_len = TPC_QMAN_LENGTH;
> +		break;
> +	case GOYA_QUEUE_ID_TPC4:
> +		offset = TPC4_QMAN_BASE_OFFSET;
> +		*queue_len = TPC_QMAN_LENGTH;
> +		break;
> +	case GOYA_QUEUE_ID_TPC5:
> +		offset = TPC5_QMAN_BASE_OFFSET;
> +		*queue_len = TPC_QMAN_LENGTH;
> +		break;
> +	case GOYA_QUEUE_ID_TPC6:
> +		offset = TPC6_QMAN_BASE_OFFSET;
> +		*queue_len = TPC_QMAN_LENGTH;
> +		break;
> +	case GOYA_QUEUE_ID_TPC7:
> +		offset = TPC7_QMAN_BASE_OFFSET;
> +		*queue_len = TPC_QMAN_LENGTH;
> +		break;
> +	default:
> +		dev_err(hdev->dev, "Got invalid queue id %d\n", queue_id);
> +		return NULL;
> +	}
> +
> +	base += offset;
> +	*dma_handle += offset;
> +
> +	return base;
> +}
> +
> +int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
> +				u32 timeout, long *result)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	struct armcp_packet *pkt;
> +	dma_addr_t pkt_dma_addr;
> +	u32 tmp;
> +	int rc = 0;
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q)) {
> +		if (result)
> +			*result = 0;
> +		return 0;
> +	}
> +
> +	if (len > CPU_CB_SIZE) {
> +		dev_err(hdev->dev, "Invalid CPU message size of %d bytes\n",
> +			len);
> +		return -ENOMEM;
> +	}
> +
> +	pkt = hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev, len,
> +								&pkt_dma_addr);
> +	if (!pkt) {
> +		dev_err(hdev->dev,
> +			"Failed to allocate DMA memory for packet to CPU\n");
> +		return -ENOMEM;
> +	}
> +
> +	memcpy(pkt, msg, len);
> +
> +	mutex_lock(&hdev->send_cpu_message_lock);
> +
> +	if (hdev->disabled)
> +		goto out;
> +
> +	rc = hl_hw_queue_send_cb_no_cmpl(hdev, GOYA_QUEUE_ID_CPU_PQ, len,
> +			pkt_dma_addr);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to send CB on CPU PQ (%d)\n", rc);
> +		goto out;
> +	}
> +
> +	rc = hl_poll_timeout_memory(hdev, (u64) &pkt->fence, timeout, &tmp);
> +
> +	hl_hw_queue_inc_ci_kernel(hdev, GOYA_QUEUE_ID_CPU_PQ);
> +
> +	if (rc == -ETIMEDOUT) {
> +		dev_err(hdev->dev,
> +			"Timeout while waiting for CPU packet fence\n");
> +		goto out;
> +	}
> +
> +	if (tmp == ARMCP_PACKET_FENCE_VAL) {
> +		if (pkt->rc) {
> +			dev_err(hdev->dev,
> +				"failed to execute CPU packet, rc: %d\n",
> +					pkt->rc);
> +			rc = -EINVAL;
> +		} else if (result) {
> +			*result = pkt->result;

For some error cases above the *result is not initialized.

> +		}
> +	} else {
> +		dev_err(hdev->dev, "CPU packet wrong fence value\n");
> +		rc = -EINVAL;
> +	}
> +
> +out:
> +	mutex_unlock(&hdev->send_cpu_message_lock);
> +
> +	hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev, len, pkt);
> +
> +	return rc;
> +}
> +
> +int goya_test_queue(struct hl_device *hdev, u32 hw_queue_id)
> +{
> +	struct packet_msg_prot *fence_pkt;
> +	dma_addr_t pkt_dma_addr;
> +	u32 fence_val, tmp;
> +	dma_addr_t fence_dma_addr;
> +	u32 *fence_ptr;
> +	int rc;
> +
> +	fence_val = GOYA_QMAN0_FENCE_VAL;
> +
> +	fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL,
> +							&fence_dma_addr);
> +	if (!fence_ptr) {
> +		dev_err(hdev->dev,
> +			"Failed to allocate memory for queue testing\n");
> +		return -ENOMEM;
> +	}
> +
> +	*fence_ptr = 0;
> +
> +	fence_pkt = hdev->asic_funcs->dma_pool_zalloc(hdev,
> +					sizeof(struct packet_msg_prot),
> +					GFP_KERNEL, &pkt_dma_addr);
> +	if (!fence_pkt) {
> +		dev_err(hdev->dev,
> +			"Failed to allocate packet for queue testing\n");
> +		rc = -ENOMEM;
> +		goto free_fence_ptr;
> +	}
> +
> +	fence_pkt->opcode = PACKET_MSG_PROT;
> +	fence_pkt->value = fence_val;
> +	fence_pkt->addr = fence_dma_addr +
> +				hdev->asic_prop.host_phys_base_address;
> +
> +	rc = hl_hw_queue_send_cb_no_cmpl(hdev, hw_queue_id,
> +					sizeof(struct packet_msg_prot),
> +					pkt_dma_addr);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to send fence packet\n");
> +		goto free_pkt;
> +	}
> +
> +	rc = hl_poll_timeout_memory(hdev, (u64) fence_ptr,
> +					GOYA_TEST_QUEUE_WAIT_USEC, &tmp);
> +
> +	hl_hw_queue_inc_ci_kernel(hdev, hw_queue_id);
> +
> +	if ((!rc) && (tmp == fence_val)) {
> +		dev_info(hdev->dev,
> +			"queue test on H/W queue %d succeeded\n",
> +			hw_queue_id);
> +	} else {
> +		dev_err(hdev->dev,
> +			"H/W queue %d test failed (scratch(0x%08llX) == 0x%08X)\n",
> +			hw_queue_id, fence_dma_addr, tmp);
> +		rc = -EINVAL;
> +	}
> +
> +free_pkt:
> +	hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_pkt,
> +					pkt_dma_addr);
> +free_fence_ptr:
> +	hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_ptr,
> +					fence_dma_addr);
> +	return rc;
> +}
> +
> +int goya_test_cpu_queue(struct hl_device *hdev)
> +{
> +	struct armcp_packet test_pkt;
> +	long result;
> +	int rc;
> +
> +	/* cpu_queues_enable flag is always checked in send cpu message */
> +
> +	memset(&test_pkt, 0, sizeof(test_pkt));
> +
> +	test_pkt.opcode = ARMCP_PACKET_TEST;
> +	test_pkt.value = ARMCP_PACKET_FENCE_VAL;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &test_pkt,
> +			sizeof(test_pkt), HL_DEVICE_TIMEOUT_USEC, &result);
> +
> +	if (!rc)
> +		dev_info(hdev->dev, "queue test on CPU queue succeeded\n");
> +	else
> +		dev_err(hdev->dev, "CPU queue test failed (0x%08lX)\n", result);
> +
> +	return rc;
> +}
> +
> +static int goya_test_queues(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	int i, rc, ret_val = 0;
> +
> +	if (hdev->ifh)
> +		return 0;
> +
> +	for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++) {
> +		rc = goya_test_queue(hdev, i);
> +		if (rc)
> +			ret_val = -EINVAL;
> +	}
> +
> +	if (hdev->cpu_queues_enable) {
> +		rc = goya->test_cpu_queue(hdev);
> +		if (rc)
> +			ret_val = -EINVAL;
> +	}
> +
> +	return ret_val;
> +}
> +
> +void *goya_dma_pool_zalloc(struct hl_device *hdev, size_t size, gfp_t mem_flags,
> +				dma_addr_t *dma_handle)
> +{
> +	if (size > GOYA_DMA_POOL_BLK_SIZE)
> +		return NULL;
> +
> +	return dma_pool_zalloc(hdev->dma_pool, mem_flags, dma_handle);
> +}
> +
> +void goya_dma_pool_free(struct hl_device *hdev, void *vaddr,
> +			dma_addr_t dma_addr)
> +{
> +	dma_pool_free(hdev->dma_pool, vaddr, dma_addr);
> +}
> +
> +void *goya_cpu_accessible_dma_pool_alloc(struct hl_device *hdev, size_t size,
> +			dma_addr_t *dma_handle)
> +{
> +	u64 kernel_addr;
> +
> +	/* roundup to CPU_PKT_SIZE */
> +	size = (size + (CPU_PKT_SIZE - 1)) & CPU_PKT_MASK;
> +
> +	kernel_addr = gen_pool_alloc(hdev->cpu_accessible_dma_pool, size);
> +
> +	*dma_handle = hdev->cpu_accessible_dma_address +
> +			(kernel_addr - (u64) hdev->cpu_accessible_dma_mem);
> +
> +	return (void *) kernel_addr;
> +}
> +
> +void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
> +			void *vaddr)
> +{
> +	/* roundup to CPU_PKT_SIZE */
> +	size = (size + (CPU_PKT_SIZE - 1)) & CPU_PKT_MASK;
> +
> +	gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
> +}
> +
> +
> +static void goya_hw_queues_lock(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +
> +	spin_lock(&goya->hw_queues_lock);
> +}
> +
> +static void goya_hw_queues_unlock(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +
> +	spin_unlock(&goya->hw_queues_lock);
> +}
> +
>  static const struct hl_asic_funcs goya_funcs = {
>  	.early_init = goya_early_init,
>  	.early_fini = goya_early_fini,
> @@ -2525,8 +3798,19 @@ static const struct hl_asic_funcs goya_funcs = {
>  	.resume = goya_resume,
>  	.mmap = goya_mmap,
>  	.cb_mmap = goya_cb_mmap,
> +	.ring_doorbell = goya_ring_doorbell,
> +	.flush_pq_write = goya_flush_pq_write,
>  	.dma_alloc_coherent = goya_dma_alloc_coherent,
>  	.dma_free_coherent = goya_dma_free_coherent,
> +	.get_int_queue_base = goya_get_int_queue_base,
> +	.test_queues = goya_test_queues,
> +	.dma_pool_zalloc = goya_dma_pool_zalloc,
> +	.dma_pool_free = goya_dma_pool_free,
> +	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
> +	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
> +	.hw_queues_lock = goya_hw_queues_lock,
> +	.hw_queues_unlock = goya_hw_queues_unlock,
> +	.send_cpu_message = goya_send_cpu_message
>  };
>  
>  /**
> diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> index 45a6d2ca2752..598a718d3df1 100644
> --- a/drivers/misc/habanalabs/goya/goyaP.h
> +++ b/drivers/misc/habanalabs/goya/goyaP.h
> @@ -9,6 +9,7 @@
>  #define GOYAP_H_
>  
>  #include "habanalabs.h"
> +#include "include/goya/goya_packets.h"
>  #include "include/goya/goya_boot_if.h"
>  #include "include/goya/goya.h"
>  
> @@ -117,12 +118,17 @@ enum goya_fw_component {
>  };
>  
>  struct goya_device {
> +	int (*test_cpu_queue)(struct hl_device *hdev);
> +
>  	/* TODO: remove hw_queues_lock after moving to scheduler code */
>  	spinlock_t	hw_queues_lock;
>  	u64		ddr_bar_cur_addr;
>  	u32		hw_cap_initialized;
>  };
>  
> +int goya_test_cpu_queue(struct hl_device *hdev);
> +int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
> +				u32 timeout, long *result);
>  void goya_init_security(struct hl_device *hdev);
>  
>  #endif /* GOYAP_H_ */
> diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> index adda281ec2af..8232e2259463 100644
> --- a/drivers/misc/habanalabs/habanalabs.h
> +++ b/drivers/misc/habanalabs/habanalabs.h
> @@ -30,10 +30,36 @@
>  struct hl_device;
>  struct hl_fpriv;
>  
> +/**
> + * enum hl_queue_type - Supported QUEUE types.
> + * @QUEUE_TYPE_NA: queue is not available.
> + * @QUEUE_TYPE_EXT: external queue which is a DMA channel that may access the
> + *                  host.
> + * @QUEUE_TYPE_INT: internal queue that performs DMA inside the device's
> + *			memories and/or operates the compute engines.
> + * @QUEUE_TYPE_CPU: S/W queue for communication with the device's CPU.
> + */
> +enum hl_queue_type {
> +	QUEUE_TYPE_NA,
> +	QUEUE_TYPE_EXT,
> +	QUEUE_TYPE_INT,
> +	QUEUE_TYPE_CPU
> +};
>  
> +/**
> + * struct hw_queue_properties - queue information.
> + * @type: queue type.
> + * @kmd_only: true if only KMD is allowed to send a job to this queue, false
> + *            otherwise.
> + */
> +struct hw_queue_properties {
> +	enum hl_queue_type	type;
> +	u8			kmd_only;
> +};
>  
>  /**
>   * struct asic_fixed_properties - ASIC specific immutable properties.
> + * @hw_queues_props: H/W queues properties.
>   * @uboot_ver: F/W U-boot version.
>   * @preboot_ver: F/W Preboot version.
>   * @sram_base_address: SRAM physical start address.
> @@ -64,6 +90,7 @@ struct hl_fpriv;
>   * @tpc_enabled_mask: which TPCs are enabled.
>   */
>  struct asic_fixed_properties {
> +	struct hw_queue_properties	hw_queues_props[HL_MAX_QUEUES];
>  	char			uboot_ver[VERSION_MAX_LEN];
>  	char			preboot_ver[VERSION_MAX_LEN];
>  	u64			sram_base_address;
> @@ -145,7 +172,92 @@ struct hl_cb {
>  
>  
>  
> +/*
> + * QUEUES
> + */
> +
> +struct hl_cs_job;
> +
> +/*
> + * Currently, there are two limitations on the maximum length of a queue:
> + *
> + * 1. The memory footprint of the queue. The current allocated space for the
> + *    queue is PAGE_SIZE. Because each entry in the queue is HL_BD_SIZE,
> + *    the maximum length of the queue can be PAGE_SIZE / HL_BD_SIZE,
> + *    which currently is 4096/16 = 256 entries.
> + *
> + *    To increase that, we need either to decrease the size of the
> + *    BD (difficult), or allocate more than a single page (easier).
> + *
> + * 2. Because the size of the JOB handle field in the BD CTL / completion queue
> + *    is 10-bit, we can have up to 1024 open jobs per hardware queue.
> + *    Therefore, each queue can hold up to 1024 entries.
> + *
> + * HL_QUEUE_LENGTH is in units of struct hl_bd.
> + * HL_QUEUE_LENGTH * sizeof(struct hl_bd) should be <= HL_PAGE_SIZE
> + */
> +
> +#define HL_PAGE_SIZE			4096 /* minimum page size */
> +/* Must be power of 2 (HL_PAGE_SIZE / HL_BD_SIZE) */
>  #define HL_QUEUE_LENGTH			256
> +#define HL_QUEUE_SIZE_IN_BYTES		(HL_QUEUE_LENGTH * HL_BD_SIZE)
> +
> +/*
> + * HL_CQ_LENGTH is in units of struct hl_cq_entry.
> + * HL_CQ_LENGTH should be <= HL_PAGE_SIZE
> + */
> +#define HL_CQ_LENGTH			HL_QUEUE_LENGTH
> +#define HL_CQ_SIZE_IN_BYTES		(HL_CQ_LENGTH * HL_CQ_ENTRY_SIZE)
> +
> +
> +
> +/**
> + * struct hl_hw_queue - describes a H/W transport queue.
> + * @shadow_queue: pointer to a shadow queue that holds pointers to jobs.
> + * @queue_type: type of queue.
> + * @kernel_address: holds the queue's kernel virtual address.
> + * @bus_address: holds the queue's DMA address.
> + * @pi: holds the queue's pi value.
> + * @ci: holds the queue's ci value, AS CALCULATED BY THE DRIVER (not real ci).
> + * @hw_queue_id: the id of the H/W queue.
> + * @int_queue_len: length of internal queue (number of entries).
> + * @valid: is the queue valid (we have array of 32 queues, not all of them
> + *		exists).
> + */
> +struct hl_hw_queue {
> +	struct hl_cs_job	**shadow_queue;
> +	enum hl_queue_type	queue_type;
> +	u64			kernel_address;
> +	dma_addr_t		bus_address;
> +	u32			pi;
> +	u32			ci;
> +	u32			hw_queue_id;
> +	u16			int_queue_len;
> +	u8			valid;
> +};
> +
> +/**
> + * struct hl_cq - describes a completion queue
> + * @hdev: pointer to the device structure
> + * @kernel_address: holds the queue's kernel virtual address
> + * @bus_address: holds the queue's DMA address
> + * @hw_queue_id: the id of the matching H/W queue
> + * @ci: ci inside the queue
> + * @pi: pi inside the queue
> + * @free_slots_cnt: counter of free slots in queue
> + */
> +struct hl_cq {
> +	struct hl_device	*hdev;
> +	u64			kernel_address;
> +	dma_addr_t		bus_address;
> +	u32			hw_queue_id;
> +	u32			ci;
> +	u32			pi;
> +	atomic_t		free_slots_cnt;
> +};
> +
> +
> +
>  
>  
>  /*
> @@ -180,8 +292,20 @@ enum hl_asic_type {
>   * @resume: handles IP specific H/W or SW changes for resume.
>   * @mmap: mmap function, does nothing.
>   * @cb_mmap: maps a CB.
> + * @ring_doorbell: increment PI on a given QMAN.
> + * @flush_pq_write: flush PQ entry write if necessary, WARN if flushing failed.
>   * @dma_alloc_coherent: DMA allocate coherent memory.
>   * @dma_free_coherent: free DMA allocation.
> + * @get_int_queue_base: get the internal queue base address.
> + * @test_queues: run simple test on all queues for sanity check.
> + * @dma_pool_zalloc: small DMA allocation of coherent memory from DMA pool.
> + *                   size of allocation is HL_DMA_POOL_BLK_SIZE.
> + * @dma_pool_free: free small DMA allocation from pool.
> + * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
> + * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
> + * @hw_queues_lock: acquire H/W queues lock.
> + * @hw_queues_unlock: release H/W queues lock.
> + * @send_cpu_message: send buffer to ArmCP.
>   */
>  struct hl_asic_funcs {
>  	int (*early_init)(struct hl_device *hdev);
> @@ -195,10 +319,27 @@ struct hl_asic_funcs {
>  	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
>  	int (*cb_mmap)(struct hl_device *hdev, struct vm_area_struct *vma,
>  			u64 kaddress, phys_addr_t paddress, u32 size);
> +	void (*ring_doorbell)(struct hl_device *hdev, u32 hw_queue_id, u32 pi);
> +	void (*flush_pq_write)(struct hl_device *hdev, u64 *pq, u64 exp_val);
>  	void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
>  					dma_addr_t *dma_handle, gfp_t flag);
>  	void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
>  					void *cpu_addr, dma_addr_t dma_handle);
> +	void* (*get_int_queue_base)(struct hl_device *hdev, u32 queue_id,
> +				dma_addr_t *dma_handle, u16 *queue_len);
> +	int (*test_queues)(struct hl_device *hdev);
> +	void* (*dma_pool_zalloc)(struct hl_device *hdev, size_t size,
> +				gfp_t mem_flags, dma_addr_t *dma_handle);
> +	void (*dma_pool_free)(struct hl_device *hdev, void *vaddr,
> +				dma_addr_t dma_addr);
> +	void* (*cpu_accessible_dma_pool_alloc)(struct hl_device *hdev,
> +				size_t size, dma_addr_t *dma_handle);
> +	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
> +				size_t size, void *vaddr);
> +	void (*hw_queues_lock)(struct hl_device *hdev);
> +	void (*hw_queues_unlock)(struct hl_device *hdev);
> +	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
> +				u16 len, u32 timeout, long *result);
>  };
>  
>  
> @@ -240,6 +381,17 @@ struct hl_ctx_mgr {
>  
>  
>  
> +/**
> + * struct hl_cs_job - command submission job.
> + * @finish_work: workqueue object to run when job is completed.
> + * @id: the id of this job inside a CS.
> + */
> +struct hl_cs_job {
> +	struct work_struct	finish_work;
> +	u32			id;
> +};
> +
> +
>  /*
>   * FILE PRIVATE STRUCTURE
>   */
> @@ -316,7 +468,11 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
>   * @dev: realted kernel basic device structure.
>   * @asic_name: ASIC specific nmae.
>   * @asic_type: ASIC specific type.
> + * @completion_queue: array of hl_cq.
> + * @cq_wq: work queue of completion queues for executing work in process context
> + * @eq_wq: work queue of event queue for executing work in process context.
>   * @kernel_ctx: KMD context structure.
> + * @kernel_queues: array of hl_hw_queue.
>   * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
>   * @dma_pool: DMA pool for small allocations.
>   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
> @@ -326,6 +482,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
>   * @asid_bitmap: holds used/available ASIDs.
>   * @asid_mutex: protects asid_bitmap.
>   * @device_open: lock for sanity checks upon FD open.
> + * @send_cpu_message_lock: enforces only one message in KMD <-> ArmCP queue.
>   * @asic_prop: ASIC specific immutable properties.
>   * @asic_funcs: ASIC specific functions.
>   * @asic_specific: ASIC specific information to use only from ASIC files.
> @@ -345,7 +502,10 @@ struct hl_device {
>  	struct device			*dev;
>  	char				asic_name[16];
>  	enum hl_asic_type		asic_type;
> +	struct hl_cq			*completion_queue;
> +	struct workqueue_struct		*cq_wq;
>  	struct hl_ctx			*kernel_ctx;
> +	struct hl_hw_queue		*kernel_queues;
>  	struct hl_cb_mgr		kernel_cb_mgr;
>  	struct dma_pool			*dma_pool;
>  	void				*cpu_accessible_dma_mem;
> @@ -356,6 +516,7 @@ struct hl_device {
>  	struct mutex			asid_mutex;
>  	/* TODO: change to rw_sem for multiple contexts (same as other IOCTL) */
>  	struct mutex			device_open;
> +	struct mutex			send_cpu_message_lock;
>  	struct asic_fixed_properties	asic_prop;
>  	const struct hl_asic_funcs	*asic_funcs;
>  	void				*asic_specific;
> @@ -374,7 +535,9 @@ struct hl_device {
>  	u8				cpu_enable;
>  	u8				reset_pcilink;
>  	u8				config_pll;
> +	u8				cpu_queues_enable;
>  	u8				fw_loading;
> +	u8				ifh;
>  	u8				pldm;
>  };
>  
> @@ -418,7 +581,18 @@ int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
>  				u32 *val);
>  int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
>  				u32 timeout_us, u32 *val);
> -
> +int hl_hw_queues_create(struct hl_device *hdev);
> +void hl_hw_queues_destroy(struct hl_device *hdev);
> +int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
> +				u32 cb_size, u64 cb_ptr);
> +u32 hl_hw_queue_add_ptr(u32 ptr, u16 val);
> +void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
> +
> +#define hl_queue_inc_ptr(p)		hl_hw_queue_add_ptr(p, 1)
> +#define hl_pi_2_offset(pi)		((pi) & (HL_QUEUE_LENGTH - 1))
> +
> +int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
> +void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
>  int hl_asid_init(struct hl_device *hdev);
>  void hl_asid_fini(struct hl_device *hdev);
>  unsigned long hl_asid_alloc(struct hl_device *hdev);
> diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> index bd80683118d3..b64f58ad0f5d 100644
> --- a/drivers/misc/habanalabs/habanalabs_drv.c
> +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> @@ -184,13 +184,19 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
>  	hdev->cpu_enable = 1;
>  	hdev->reset_pcilink = 0;
>  	hdev->config_pll = 0;
> +	hdev->cpu_queues_enable = 1;
>  	hdev->fw_loading = 1;
> +	hdev->ifh = 0;
>  	hdev->pldm = 0;
>  
>  	/* If CPU is disabled, no point in loading FW */
>  	if (!hdev->cpu_enable)
>  		hdev->fw_loading = 0;
>  
> +	/* If we don't load FW, no need to initialize CPU queues */
> +	if (!hdev->fw_loading)
> +		hdev->cpu_queues_enable = 0;
> +
>  	hdev->disabled = true;
>  	hdev->pdev = pdev; /* can be NULL in case of simulator device */
>  
> diff --git a/drivers/misc/habanalabs/hw_queue.c b/drivers/misc/habanalabs/hw_queue.c
> new file mode 100644
> index 000000000000..65102a5bc2ca
> --- /dev/null
> +++ b/drivers/misc/habanalabs/hw_queue.c
> @@ -0,0 +1,404 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "habanalabs.h"
> +
> +#include <linux/dma-mapping.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/delay.h>
> +
> +/**
> + * hl_queue_add_ptr - add to pi or ci and checks if it wraps around
> + *
> + * @ptr: the current pi/ci value
> + * @val: the amount to add
> + *
> + * Add val to ptr. It can go until twice the queue length.
> + */
> +inline u32 hl_hw_queue_add_ptr(u32 ptr, u16 val)
> +{
> +	ptr += val;
> +	ptr &= ((HL_QUEUE_LENGTH << 1) - 1);
> +	return ptr;
> +}
> +
> +static inline int queue_free_slots(struct hl_hw_queue *q, u32 queue_len)
> +{
> +	int delta = (q->pi - q->ci);
> +
> +	if (delta >= 0)
> +		return (queue_len - delta);
> +	else
> +		return (abs(delta) - queue_len);
> +}
> +
> +/**
> + * ext_queue_submit_bd - Submit a buffer descriptor to an external queue
> + *
> + * @hdev: pointer to habanalabs device structure
> + * @q: pointer to habanalabs queue structure
> + * @ctl: BD's control word
> + * @len: BD's length
> + * @ptr: BD's pointer
> + *
> + * This function assumes there is enough space on the queue to submit a new
> + * BD to it. It initializes the next BD and calls the device specific
> + * function to set the pi (and doorbell)
> + *
> + * This function must be called when the scheduler mutex is taken
> + *
> + */
> +static void ext_queue_submit_bd(struct hl_device *hdev, struct hl_hw_queue *q,
> +				u32 ctl, u32 len, u64 ptr)
> +{
> +	struct hl_bd *bd;
> +
> +	bd = (struct hl_bd *) q->kernel_address;
> +	bd += hl_pi_2_offset(q->pi);
> +	bd->ctl = ctl;
> +	bd->len = len;
> +	bd->ptr = ptr + hdev->asic_prop.host_phys_base_address;
> +
> +	q->pi = hl_queue_inc_ptr(q->pi);
> +	hdev->asic_funcs->ring_doorbell(hdev, q->hw_queue_id, q->pi);
> +}
> +
> +/**
> + * ext_queue_sanity_checks - perform some sanity checks on external queue
> + *
> + * @hdev              : pointer to hl_device structure
> + * @q                 :	pointer to hl_hw_queue structure
> + * @num_of_entries    : how many entries to check for space
> + * @reserve_cq_entry  :	whether to reserve an entry in the cq
> + *
> + * H/W queues spinlock should be taken before calling this function
> + *
> + * Perform the following:
> + * - Make sure we have enough space in the h/w queue
> + * - Make sure we have enough space in the completion queue
> + * - Reserve space in the completion queue (needs to be reversed if there
> + *   is a failure down the road before the actual submission of work). Only
> + *   do this action if reserve_cq_entry is true
> + *
> + */
> +static int ext_queue_sanity_checks(struct hl_device *hdev,
> +				struct hl_hw_queue *q, int num_of_entries,
> +				bool reserve_cq_entry)
> +{
> +	atomic_t *free_slots =
> +			&hdev->completion_queue[q->hw_queue_id].free_slots_cnt;
> +	int free_slots_cnt;
> +
> +	/* Check we have enough space in the queue */
> +	free_slots_cnt = queue_free_slots(q, HL_QUEUE_LENGTH);
> +
> +	if (free_slots_cnt < num_of_entries) {
> +		dev_dbg(hdev->dev, "Queue %d doesn't have room for %d CBs\n",
> +			q->hw_queue_id, num_of_entries);
> +		return -EAGAIN;
> +	}
> +
> +	if (reserve_cq_entry) {
> +		/*
> +		 * Check we have enough space in the completion queue
> +		 * Add -1 to counter (decrement) unless counter was already 0
> +		 * In that case, CQ is full so we can't submit a new CB because
> +		 * we won't get ack on its completion
> +		 * atomic_add_unless will return 0 if counter was already 0
> +		 */
> +		if (atomic_add_negative(num_of_entries * -1, free_slots)) {
> +			dev_dbg(hdev->dev, "No space for %d on CQ %d\n",
> +				num_of_entries, q->hw_queue_id);
> +			atomic_add(num_of_entries, free_slots);
> +			return -EAGAIN;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * hl_hw_queue_send_cb_no_cmpl - send a single CB (not a JOB) without completion
> + *
> + * @hdev: pointer to hl_device structure
> + * @hw_queue_id: Queue's type
> + * @cb_size: size of CB
> + * @cb_ptr: pointer to CB location
> + *
> + * This function sends a single CB, that must NOT generate a completion entry
> + *
> + */
> +int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
> +				u32 cb_size, u64 cb_ptr)
> +{
> +	struct hl_hw_queue *q = &hdev->kernel_queues[hw_queue_id];
> +	int rc;
> +
> +	/*
> +	 * The CPU queue is a synchronous queue with an effective depth of
> +	 * a single entry (although it is allocated with room for multiple
> +	 * entries). Therefore, there is a different lock, called
> +	 * send_cpu_message_lock, that serializes accesses to the CPU queue.
> +	 * As a result, we don't need to lock the access to the entire H/W
> +	 * queues module when submitting a JOB to the CPU queue
> +	 */
> +	if (q->queue_type != QUEUE_TYPE_CPU)
> +		hdev->asic_funcs->hw_queues_lock(hdev);
> +
> +	if (hdev->disabled) {
> +		rc = -EPERM;
> +		goto out;
> +	}
> +
> +	rc = ext_queue_sanity_checks(hdev, q, 1, false);
> +	if (rc)
> +		goto out;
> +
> +	ext_queue_submit_bd(hdev, q, 0, cb_size, cb_ptr);
> +
> +out:
> +	if (q->queue_type != QUEUE_TYPE_CPU)
> +		hdev->asic_funcs->hw_queues_unlock(hdev);
> +
> +	return rc;
> +}
> +
> +/**
> + * hl_hw_queue_inc_ci_kernel - increment ci for kernel's queue
> + *
> + * @hdev: pointer to hl_device structure
> + * @hw_queue_id: which queue to increment its ci
> + */
> +void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id)
> +{
> +	struct hl_hw_queue *q = &hdev->kernel_queues[hw_queue_id];
> +
> +	q->ci = hl_queue_inc_ptr(q->ci);
> +}
> +
> +static int ext_and_cpu_hw_queue_init(struct hl_device *hdev,
> +					struct hl_hw_queue *q)
> +{
> +	void *p;
> +	int rc;
> +
> +	p = hdev->asic_funcs->dma_alloc_coherent(hdev,
> +				HL_QUEUE_SIZE_IN_BYTES,
> +				&q->bus_address, GFP_KERNEL | __GFP_ZERO);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	q->kernel_address = (u64) p;
> +
> +	q->shadow_queue = kmalloc_array(HL_QUEUE_LENGTH,
> +					sizeof(*q->shadow_queue),
> +					GFP_KERNEL);
> +	if (!q->shadow_queue) {
> +		dev_err(hdev->dev,
> +			"Failed to allocate shadow queue for H/W queue %d\n",
> +			q->hw_queue_id);
> +		rc = -ENOMEM;
> +		goto free_queue;
> +	}
> +
> +	/* Make sure read/write pointers are initialized to start of queue */
> +	q->ci = 0;
> +	q->pi = 0;
> +
> +	return 0;
> +
> +free_queue:
> +	hdev->asic_funcs->dma_free_coherent(hdev, HL_QUEUE_SIZE_IN_BYTES,
> +			(void *) q->kernel_address, q->bus_address);
> +
> +	return rc;
> +}
> +
> +static int int_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
> +{
> +	void *p;
> +
> +	p = hdev->asic_funcs->get_int_queue_base(hdev, q->hw_queue_id,
> +					&q->bus_address, &q->int_queue_len);
> +	if (!p) {
> +		dev_err(hdev->dev,
> +			"Failed to get base address for internal queue %d\n",
> +			q->hw_queue_id);
> +		return -EFAULT;
> +	}
> +
> +	q->kernel_address = (u64) p;
> +	q->pi = 0;
> +	q->ci = 0;
> +
> +	return 0;
> +}
> +
> +static int cpu_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
> +{
> +	return ext_and_cpu_hw_queue_init(hdev, q);
> +}
> +
> +static int ext_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
> +{
> +	return ext_and_cpu_hw_queue_init(hdev, q);
> +}
> +
> +/**
> + * hw_queue_init - main initialization function for H/W queue object
> + *
> + * @hdev: pointer to hl_device device structure
> + * @q: pointer to hl_hw_queue queue structure
> + * @hw_queue_id: The id of the H/W queue
> + *
> + * Allocate dma-able memory for the queue and initialize fields
> + * Returns 0 on success
> + */
> +static int hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q,
> +			u32 hw_queue_id)
> +{
> +	int rc;
> +
> +	BUILD_BUG_ON(HL_QUEUE_SIZE_IN_BYTES > HL_PAGE_SIZE);
> +
> +	q->hw_queue_id = hw_queue_id;
> +
> +	switch (q->queue_type) {
> +	case QUEUE_TYPE_EXT:
> +		rc = ext_hw_queue_init(hdev, q);
> +		break;
> +
> +	case QUEUE_TYPE_INT:
> +		rc = int_hw_queue_init(hdev, q);
> +		break;
> +
> +	case QUEUE_TYPE_CPU:
> +		rc = cpu_hw_queue_init(hdev, q);
> +		break;
> +
> +	case QUEUE_TYPE_NA:
> +		q->valid = 0;
> +		return 0;
> +
> +	default:
> +		dev_crit(hdev->dev, "wrong queue type %d during init\n",
> +			q->queue_type);
> +		rc = -EINVAL;
> +		break;
> +	}
> +
> +	if (rc)
> +		return rc;
> +
> +	q->valid = 1;
> +
> +	return 0;
> +}
> +
> +/**
> + * hw_queue_fini - destroy queue
> + *
> + * @hdev: pointer to hl_device device structure
> + * @q: pointer to hl_hw_queue queue structure
> + *
> + * Free the queue memory
> + */
> +static void hw_queue_fini(struct hl_device *hdev, struct hl_hw_queue *q)
> +{
> +	if (!q->valid)
> +		return;
> +
> +	/*
> +	 * If we arrived here, there are no jobs waiting on this queue
> +	 * so we can safely remove it.
> +	 * This is because this function can only called when:
> +	 * 1. Either a context is deleted, which only can occur if all its
> +	 *    jobs were finished
> +	 * 2. A context wasn't able to be created due to failure or timeout,
> +	 *    which means there are no jobs on the queue yet
> +	 *
> +	 * The only exception are the queues of the kernel context, but
> +	 * if they are being destroyed, it means that the entire module is
> +	 * being removed. If the module is removed, it means there is no open
> +	 * user context. It also means that if a job was submitted by
> +	 * the kernel driver (e.g. context creation), the job itself was
> +	 * released by the kernel driver when a timeout occurred on its
> +	 * Completion. Thus, we don't need to release it again.
> +	 */
> +
> +	if (q->queue_type == QUEUE_TYPE_INT)
> +		return;
> +
> +	kfree(q->shadow_queue);
> +
> +	hdev->asic_funcs->dma_free_coherent(hdev,
> +			HL_QUEUE_SIZE_IN_BYTES,
> +			(void *) q->kernel_address, q->bus_address);
> +}
> +
> +int hl_hw_queues_create(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *asic = &hdev->asic_prop;
> +	struct hl_hw_queue *q;
> +	int i, rc, q_ready_cnt;
> +
> +	hdev->kernel_queues = kcalloc(HL_MAX_QUEUES,
> +				sizeof(*hdev->kernel_queues), GFP_KERNEL);
> +
> +	if (!hdev->kernel_queues) {
> +		dev_err(hdev->dev, "Not enough memory for H/W queues\n");
> +		return -ENOMEM;
> +	}
> +
> +	/* Initialize the H/W queues */
> +	for (i = 0, q_ready_cnt = 0, q = hdev->kernel_queues;
> +			i < HL_MAX_QUEUES ; i++, q_ready_cnt++, q++) {
> +
> +		q->queue_type = asic->hw_queues_props[i].type;
> +		rc = hw_queue_init(hdev, q, i);
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"failed to initialize queue %d\n", i);
> +			goto release_queues;
> +		}
> +	}
> +
> +	return 0;
> +
> +release_queues:
> +	for (i = 0, q = hdev->kernel_queues ; i < q_ready_cnt ; i++, q++)
> +		hw_queue_fini(hdev, q);
> +
> +	kfree(hdev->kernel_queues);
> +
> +	return rc;
> +}
> +
> +void hl_hw_queues_destroy(struct hl_device *hdev)
> +{
> +	struct hl_hw_queue *q;
> +	int i;
> +
> +	for (i = 0, q = hdev->kernel_queues ; i < HL_MAX_QUEUES ; i++, q++)
> +		hw_queue_fini(hdev, q);
> +
> +	kfree(hdev->kernel_queues);
> +}
> +
> +void hl_hw_queue_reset(struct hl_device *hdev, bool hard_reset)
> +{
> +	struct hl_hw_queue *q;
> +	int i;
> +
> +	for (i = 0, q = hdev->kernel_queues ; i < HL_MAX_QUEUES ; i++, q++) {
> +		if ((!q->valid) ||
> +			((!hard_reset) && (q->queue_type == QUEUE_TYPE_CPU)))
> +			continue;
> +		q->pi = q->ci = 0;
> +	}
> +}
> diff --git a/drivers/misc/habanalabs/include/goya/goya_packets.h b/drivers/misc/habanalabs/include/goya/goya_packets.h
> new file mode 100644
> index 000000000000..669a3f37ccb7
> --- /dev/null
> +++ b/drivers/misc/habanalabs/include/goya/goya_packets.h
> @@ -0,0 +1,234 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2017-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + * Authors:
> + *
> + * Oded Gabbay <oded.gabbay@gmail.com>
> + * Guy Eilat <geilat@habana.ai>
> + *
> + */
> +
> +#ifndef GOYA_PACKETS_H
> +#define GOYA_PACKETS_H
> +
> +#include <linux/types.h>
> +
> +#define PACKET_HEADER_PACKET_ID_SHIFT		56
> +#define PACKET_HEADER_PACKET_ID_MASK		0x1F00000000000000ull
> +
> +enum packet_id {
> +	PACKET_WREG_32 = 0x1,
> +	PACKET_WREG_BULK = 0x2,
> +	PACKET_MSG_LONG = 0x3,
> +	PACKET_MSG_SHORT = 0x4,
> +	PACKET_CP_DMA = 0x5,
> +	PACKET_MSG_PROT = 0x7,
> +	PACKET_FENCE = 0x8,
> +	PACKET_LIN_DMA = 0x9,
> +	PACKET_NOP = 0xA,
> +	PACKET_STOP = 0xB,
> +	MAX_PACKET_ID = (PACKET_HEADER_PACKET_ID_MASK >>
> +				PACKET_HEADER_PACKET_ID_SHIFT) + 1
> +};
> +
> +enum goya_dma_direction {
> +	DMA_HOST_TO_DRAM,
> +	DMA_HOST_TO_SRAM,
> +	DMA_DRAM_TO_SRAM,
> +	DMA_SRAM_TO_DRAM,
> +	DMA_SRAM_TO_HOST,
> +	DMA_DRAM_TO_HOST,
> +	DMA_DRAM_TO_DRAM,
> +	DMA_SRAM_TO_SRAM,
> +	DMA_ENUM_MAX
> +};
> +
> +struct packet_nop {
> +	__u32 reserved;
> +	union {
> +		struct {
> +			__u32:24;
> +			__u32 opcode :5;
> +			__u32 eng_barrier :1;
> +			__u32 reg_barrier :1;
> +			__u32 msg_barrier :1;
> +		};
> +		__u32 ctl;
> +	};
> +};
> +
> +struct packet_stop {
> +	__u32 reserved;
> +	union {
> +		struct {
> +			__u32:24;
> +			__u32 opcode :5;
> +			__u32 eng_barrier :1;
> +			__u32 reg_barrier :1; /* must be 0 */
> +			__u32 msg_barrier :1; /* must be 0 */
> +		};
> +		__u32 ctl;
> +	};
> +};
> +
> +struct packet_wreg32 {
> +	__u32 value;
> +	union {
> +		struct {
> +			__u32 reg_offset :16;
> +			__u32:7;
> +			__u32 local :1; /* 0: write to TCL regs,
> +					 * 1: write to CMDQ regs
> +					 */
> +			__u32 opcode :5;
> +			__u32 eng_barrier :1;
> +			__u32 reg_barrier :1; /* must be 1 */
> +			__u32 msg_barrier :1;
> +		};
> +		__u32 ctl;
> +	};
> +};
> +
> +struct packet_wreg_bulk {
> +	__u32 size64 :16;
> +	__u32:16;
> +	__u32 reg_offset :16;
> +	__u32:8;
> +	__u32 opcode :5;
> +	__u32 eng_barrier :1;
> +	__u32 reg_barrier :1; /* must be 1 */
> +	__u32 msg_barrier :1;
> +	__u64 values[0]; /* data starts here */
> +};
> +
> +struct packet_msg_long {
> +	__u32 value;
> +	union {
> +		struct {
> +			__u32:16;
> +			__u32 weakly_ordered :1;
> +			__u32 no_snoop :1;
> +			__u32:2;
> +			__u32 op :2; /* 0: write <value>. 1: write timestamp. */
> +			__u32:2;
> +			__u32 opcode :5;
> +			__u32 eng_barrier :1;
> +			__u32 reg_barrier :1;
> +			__u32 msg_barrier :1;
> +		};
> +		__u32 ctl;
> +	};
> +	__u64 addr;
> +};
> +
> +struct packet_msg_short {
> +	union {
> +		struct {
> +			__u32 sync_id :10;
> +			__u32:5;
> +			__u32 mode : 1;
> +			__u32 sync_value :16;
> +		} mon_arm_register;
> +		struct {
> +			__u32 sync_value :16;
> +			__u32:15;
> +			__u32 mode :1;
> +		} so_upd;
> +		__u32 value;
> +	};
> +	union {
> +		struct {
> +			__u32 msg_addr_offset :16;
> +			__u32 weakly_ordered :1;
> +			__u32 no_snoop :1;
> +			__u32:2;
> +			__u32 op :2;
> +			__u32 base :2;
> +			__u32 opcode :5;
> +			__u32 eng_barrier :1;
> +			__u32 reg_barrier :1;
> +			__u32 msg_barrier :1;
> +		};
> +		__u32 ctl;
> +	};
> +};
> +
> +struct packet_msg_prot {
> +	__u32 value;
> +	union {
> +		struct {
> +			__u32:16;
> +			__u32 weakly_ordered :1;
> +			__u32 no_snoop :1;
> +			__u32:2;
> +			__u32 op :2; /* 0: write <value>. 1: write timestamp. */
> +			__u32:2;
> +			__u32 opcode :5;
> +			__u32 eng_barrier :1;
> +			__u32 reg_barrier :1;
> +			__u32 msg_barrier :1;
> +		};
> +		__u32 ctl;
> +	};
> +	__u64 addr;
> +};
> +
> +struct packet_fence {
> +	__u32 dec_val :4;
> +	__u32:12;
> +	__u32 gate_val :8;
> +	__u32:6;
> +	__u32 id :2;
> +	__u32:24;
> +	__u32 opcode :5;
> +	__u32 eng_barrier :1;
> +	__u32 reg_barrier :1;
> +	__u32 msg_barrier :1;
> +};
> +
> +struct packet_lin_dma {
> +	__u32 tsize;
> +	union {
> +		struct {
> +			__u32 weakly_ordered :1; /* H/W bug, must be 1 */
> +			__u32 rdcomp :1;
> +			__u32 wrcomp :1;
> +			__u32 no_snoop :1;
> +			__u32 src_disable :1;
> +			__u32 dst_disable :1;
> +			__u32 memset_mode :1;
> +			__u32 tensor_dma :1; /* N/A, must be 0 */
> +			__u32 cntrl :12;
> +			__u32 dma_dir :3; /* S/W only, no effect on HW */
> +			__u32:1;
> +			__u32 opcode :5;
> +			__u32 eng_barrier :1;
> +			__u32 reg_barrier :1; /* must be 1 */
> +			__u32 msg_barrier :1;
> +		};
> +		__u32 ctl;
> +	};
> +	__u64 src_addr;
> +	__u64 dst_addr;
> +};
> +
> +struct packet_cp_dma {
> +	__u32 tsize;
> +	union {
> +		struct {
> +			__u32 weakly_ordered :1;
> +			__u32 no_snoop :1;
> +			__u32:22;
> +			__u32 opcode :5;
> +			__u32 eng_barrier :1;
> +			__u32 reg_barrier :1; /* must be 1 */
> +			__u32 msg_barrier :1;
> +		};
> +		__u32 ctl;
> +	};
> +	__u64 src_addr;
> +};
> +
> +#endif /* GOYA_PACKETS_H */
> diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> index 9dbb7077eabd..62df9981f68a 100644
> --- a/drivers/misc/habanalabs/include/habanalabs_device_if.h
> +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> @@ -97,6 +97,278 @@ enum pq_init_status {
>  	PQ_INIT_STATUS_READY_FOR_HOST
>  };
>  
> +/*
> + * ArmCP Primary Queue Packets
> + *
> + * During normal operation, KMD needs to send various messages to ArmCP,
> + * usually either to SET some value into a H/W periphery or to GET the current
> + * value of some H/W periphery. For example, SET the frequency of MME/TPC and
> + * GET the value of the thermal sensor.
> + *
> + * These messages can be initiated either by the User application or by KMD
> + * itself, e.g. power management code. In either case, the communication from
> + * KMD to ArmCP will *always* be in synchronous mode, meaning that KMD will
> + * send a single message and poll until the message was acknowledged and the
> + * results are ready (if results are needed).
> + *
> + * This means that only a single message can be sent at a time and KMD must
> + * wait for its result before sending the next message. Having said that,
> + * because these are control messages which are sent in a relatively low
> + * frequency, this limitation seems acceptable. It's important to note that
> + * in case of multiple devices, messages to different devices *can* be sent
> + * at the same time.
> + *
> + * The message, inputs/outputs (if relevant) and fence object will be located
> + * on the device DDR at an address that will be determined by KMD. During
> + * device initialization phase, KMD will pass to ArmCP that address.  Most of
> + * the message types will contain inputs/outputs inside the message itself.
> + * The common part of each message will contain the opcode of the message (its
> + * type) and a field representing a fence object.
> + *
> + * When KMD wishes to send a message to ArmCP, it will write the message
> + * contents to the device DDR, clear the fence object and then write the
> + * value 484 to the mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR register to issue
> + * the 484 interrupt-id to the ARM core.
> + *
> + * Upon receiving the 484 interrupt-id, ArmCP will read the message from the
> + * DDR. In case the message is a SET operation, ArmCP will first perform the
> + * operation and then write to the fence object on the device DDR. In case the
> + * message is a GET operation, ArmCP will first fill the results section on the
> + * device DDR and then write to the fence object. If an error occurred, ArmCP
> + * will fill the rc field with the right error code.
> + *
> + * In the meantime, KMD will poll on the fence object. Once KMD sees that the
> + * fence object is signaled, it will read the results from the device DDR
> + * (if relevant) and resume the code execution in KMD.
> + *
> + * To use QMAN packets, the opcode must be the QMAN opcode, shifted by 8
> + * so the value being put by the KMD matches the value read by ArmCP
> + *
> + * Non-QMAN packets should be limited to values 1 through (2^8 - 1)
> + *
> + * Detailed description:
> + *
> + * ARMCP_PACKET_DISABLE_PCI_ACCESS -
> + *       After receiving this packet the embedded CPU must NOT issue PCI
> + *       transactions (read/write) towards the Host CPU. This also include
> + *       sending MSI-X interrupts.
> + *       This packet is usually sent before the device is moved to D3Hot state.
> + *
> + * ARMCP_PACKET_ENABLE_PCI_ACCESS -
> + *       After receiving this packet the embedded CPU is allowed to issue PCI
> + *       transactions towards the Host CPU, including sending MSI-X interrupts.
> + *       This packet is usually send after the device is moved to D0 state.
> + *
> + * ARMCP_PACKET_TEMPERATURE_GET -
> + *       Fetch the current temperature / Max / Max Hyst / Critical /
> + *       Critical Hyst of a specified thermal sensor. The packet's
> + *       arguments specify the desired sensor and the field to get.
> + *
> + * ARMCP_PACKET_VOLTAGE_GET -
> + *       Fetch the voltage / Max / Min of a specified sensor. The packet's
> + *       arguments specify the sensor and type.
> + *
> + * ARMCP_PACKET_CURRENT_GET -
> + *       Fetch the current / Max / Min of a specified sensor. The packet's
> + *       arguments specify the sensor and type.
> + *
> + * ARMCP_PACKET_FAN_SPEED_GET -
> + *       Fetch the speed / Max / Min of a specified fan. The packet's
> + *       arguments specify the sensor and type.
> + *
> + * ARMCP_PACKET_PWM_GET -
> + *       Fetch the pwm value / mode of a specified pwm. The packet's
> + *       arguments specify the sensor and type.
> + *
> + * ARMCP_PACKET_PWM_SET -
> + *       Set the pwm value / mode of a specified pwm. The packet's
> + *       arguments specify the sensor, type and value.
> + *
> + * ARMCP_PACKET_FREQUENCY_SET -
> + *       Set the frequency of a specified PLL. The packet's arguments specify
> + *       the PLL and the desired frequency. The actual frequency in the device
> + *       might differ from the requested frequency.
> + *
> + * ARMCP_PACKET_FREQUENCY_GET -
> + *       Fetch the frequency of a specified PLL. The packet's arguments specify
> + *       the PLL.
> + *
> + * ARMCP_PACKET_LED_SET -
> + *       Set the state of a specified led. The packet's arguments
> + *       specify the led and the desired state.
> + *
> + * ARMCP_PACKET_I2C_WR -
> + *       Write 32-bit value to I2C device. The packet's arguments specify the
> + *       I2C bus, address and value.
> + *
> + * ARMCP_PACKET_I2C_RD -
> + *       Read 32-bit value from I2C device. The packet's arguments specify the
> + *       I2C bus and address.
> + *
> + * ARMCP_PACKET_INFO_GET -
> + *       Fetch information from the device as specified in the packet's
> + *       structure. KMD passes the max size it allows the ArmCP to write to
> + *       the structure, to prevent data corruption in case of mismatched
> + *       KMD/FW versions.
> + *
> + * ARMCP_PACKET_FLASH_PROGRAM_REMOVED - this packet was removed
> + *
> + * ARMCP_PACKET_UNMASK_RAZWI_IRQ -
> + *       Unmask the given IRQ. The IRQ number is specified in the value field.
> + *       The packet is sent after receiving an interrupt and printing its
> + *       relevant information.
> + *
> + * ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY -
> + *       Unmask the given IRQs. The IRQs numbers are specified in an array right
> + *       after the armcp_packet structure, where its first element is the array
> + *       length. The packet is sent after a soft reset was done in order to
> + *       handle any interrupts that were sent during the reset process.
> + *
> + * ARMCP_PACKET_TEST -
> + *       Test packet for ArmCP connectivity. The CPU will put the fence value
> + *       in the result field.
> + *
> + * ARMCP_PACKET_FREQUENCY_CURR_GET -
> + *       Fetch the current frequency of a specified PLL. The packet's arguments
> + *       specify the PLL.
> + *
> + * ARMCP_PACKET_MAX_POWER_GET -
> + *       Fetch the maximal power of the device.
> + *
> + * ARMCP_PACKET_MAX_POWER_SET -
> + *       Set the maximal power of the device. The packet's arguments specify
> + *       the power.
> + *
> + * ARMCP_PACKET_EEPROM_DATA_GET -
> + *       Get EEPROM data from the ArmCP kernel. The buffer is specified in the
> + *       addr field. The CPU will put the returned data size in the result
> + *       field. In addition, KMD passes the max size it allows the ArmCP to
> + *       write to the structure, to prevent data corruption in case of
> + *       mismatched KMD/FW versions.
> + *
> + */
> +
> +enum armcp_packet_id {
> +	ARMCP_PACKET_DISABLE_PCI_ACCESS = 1,	/* internal */
> +	ARMCP_PACKET_ENABLE_PCI_ACCESS,		/* internal */
> +	ARMCP_PACKET_TEMPERATURE_GET,		/* sysfs */
> +	ARMCP_PACKET_VOLTAGE_GET,		/* sysfs */
> +	ARMCP_PACKET_CURRENT_GET,		/* sysfs */
> +	ARMCP_PACKET_FAN_SPEED_GET,		/* sysfs */
> +	ARMCP_PACKET_PWM_GET,			/* sysfs */
> +	ARMCP_PACKET_PWM_SET,			/* sysfs */
> +	ARMCP_PACKET_FREQUENCY_SET,		/* sysfs */
> +	ARMCP_PACKET_FREQUENCY_GET,		/* sysfs */
> +	ARMCP_PACKET_LED_SET,			/* debugfs */
> +	ARMCP_PACKET_I2C_WR,			/* debugfs */
> +	ARMCP_PACKET_I2C_RD,			/* debugfs */
> +	ARMCP_PACKET_INFO_GET,			/* IOCTL */
> +	ARMCP_PACKET_FLASH_PROGRAM_REMOVED,
> +	ARMCP_PACKET_UNMASK_RAZWI_IRQ,		/* internal */
> +	ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY,	/* internal */
> +	ARMCP_PACKET_TEST,			/* internal */
> +	ARMCP_PACKET_FREQUENCY_CURR_GET,	/* sysfs */
> +	ARMCP_PACKET_MAX_POWER_GET,		/* sysfs */
> +	ARMCP_PACKET_MAX_POWER_SET,		/* sysfs */
> +	ARMCP_PACKET_EEPROM_DATA_GET,		/* sysfs */
> +};
> +
> +#define ARMCP_PACKET_FENCE_VAL	0xFE8CE7A5
> +
> +struct armcp_packet {
> +	union {
> +		__u64 value;	/* For SET packets */
> +		__u64 result;	/* For GET packets */
> +		__u64 addr;	/* For PQ */
> +	};
> +
> +	union {
> +		struct {
> +			__u32:12;
> +			__u32 rc :4;
> +			__u32 opcode :13;
> +			__u32 eng_barrier :1;
> +			__u32 reg_barrier :1;
> +			__u32 msg_barrier :1;
> +		};
> +		__u32 ctl;
> +	};
> +
> +	__u32 fence;		/* Signal to KMD that message is completed */
> +
> +	union {
> +		struct {/* For temperature/current/voltage/fan/pwm get/set */
> +			__u16 sensor_index;
> +			__u16 type;
> +		};
> +
> +		struct {	/* For I2C read/write */
> +			__u8 i2c_bus;
> +			__u8 i2c_addr;
> +			__u8 i2c_reg;
> +			__u8 pad; /* unused */
> +		};
> +
> +		/* For frequency get/set */
> +		__u32 pll_index;
> +
> +		/* For led set */
> +		__u32 led_index;
> +
> +		/* For get Armcp info/EEPROM data */
> +		__u32 data_max_size;
> +	};
> +};
> +
> +struct armcp_unmask_irq_arr_packet {
> +	struct armcp_packet armcp_pkt;
> +	__u32 length;
> +	__u32 irqs[0];
> +};
> +
> +enum armcp_packet_rc {
> +	armcp_packet_success,
> +	armcp_packet_invalid,
> +	armcp_packet_fault
> +};
> +
> +enum armcp_temp_type {
> +	armcp_temp_input,
> +	armcp_temp_max = 6,
> +	armcp_temp_max_hyst,
> +	armcp_temp_crit,
> +	armcp_temp_crit_hyst
> +};
> +
> +enum armcp_in_attributes {
> +	armcp_in_input,
> +	armcp_in_min,
> +	armcp_in_max
> +};
> +
> +enum armcp_curr_attributes {
> +	armcp_curr_input,
> +	armcp_curr_min,
> +	armcp_curr_max
> +};
> +
> +enum armcp_fan_attributes {
> +	armcp_fan_input,
> +	armcp_fan_min = 2,
> +	armcp_fan_max
> +};
> +
> +enum armcp_pwm_attributes {
> +	armcp_pwm_input,
> +	armcp_pwm_enable
> +};
> +
> +/* Event Queue Packets */
> +
> +struct eq_generic_event {
> +	__u64 data[7];
> +};
> +
>  /*
>   * ArmCP info
>   */
> diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
> new file mode 100644
> index 000000000000..97b0de7ea5c2
> --- /dev/null
> +++ b/drivers/misc/habanalabs/irq.c
> @@ -0,0 +1,150 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "habanalabs.h"
> +
> +#include <linux/dma-mapping.h>
> +
> +
> +/**
> + * hl_cq_inc_ptr - increment ci or pi of cq
> + *
> + * @ptr: the current ci or pi value of the completion queue
> + *
> + * Increment ptr by 1. If it reaches the number of completion queue
> + * entries, set it to 0
> + */
> +inline u32 hl_cq_inc_ptr(u32 ptr)
> +{
> +	ptr++;
> +	if (unlikely(ptr == HL_CQ_LENGTH))
> +		ptr = 0;
> +	return ptr;
> +}
> +
> +/**
> + * hl_irq_handler_cq - irq handler for completion queue
> + *
> + * @irq: irq number
> + * @arg: pointer to completion queue structure
> + *
> + */
> +irqreturn_t hl_irq_handler_cq(int irq, void *arg)
> +{
> +	struct hl_cq *cq = arg;
> +	struct hl_device *hdev = cq->hdev;
> +	struct hl_hw_queue *queue;
> +	struct hl_cs_job *job;
> +	bool shadow_index_valid;
> +	u16 shadow_index;
> +	u32 *cq_entry;
> +	u32 *cq_base;
> +
> +	if (hdev->disabled) {
> +		dev_dbg(hdev->dev,
> +			"Device disabled but received IRQ %d for CQ %d\n",
> +			irq, cq->hw_queue_id);
> +		return IRQ_HANDLED;
> +	}
> +
> +	cq_base = (u32 *) cq->kernel_address;
> +
> +	while (1) {
> +		bool entry_ready = ((cq_base[cq->ci] & CQ_ENTRY_READY_MASK)
> +						>> CQ_ENTRY_READY_SHIFT);
> +
> +		if (!entry_ready)
> +			break;
> +
> +		cq_entry = (u32 *) &cq_base[cq->ci];
> +
> +		/*
> +		 * Make sure we read CQ entry contents after we've
> +		 * checked the ownership bit.
> +		 */
> +		dma_rmb();
> +
> +		shadow_index_valid =
> +			((*cq_entry & CQ_ENTRY_SHADOW_INDEX_VALID_MASK)
> +					>> CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT);
> +
> +		shadow_index = (u16)
> +			((*cq_entry & CQ_ENTRY_SHADOW_INDEX_MASK)
> +					>> CQ_ENTRY_SHADOW_INDEX_SHIFT);
> +
> +		queue = &hdev->kernel_queues[cq->hw_queue_id];
> +
> +		if ((shadow_index_valid) && (!hdev->disabled)) {
> +			job = queue->shadow_queue[hl_pi_2_offset(shadow_index)];
> +			queue_work(hdev->cq_wq, &job->finish_work);
> +		}
> +
> +		/*
> +		 * Update ci of the context's queue. There is no
> +		 * need to protect it with spinlock because this update is
> +		 * done only inside IRQ and there is a different IRQ per
> +		 * queue
> +		 */
> +		queue->ci = hl_queue_inc_ptr(queue->ci);
> +
> +		/* Clear CQ entry ready bit */
> +		cq_base[cq->ci] &= ~CQ_ENTRY_READY_MASK;
> +
> +		cq->ci = hl_cq_inc_ptr(cq->ci);
> +
> +		/* Increment free slots */
> +		atomic_inc(&cq->free_slots_cnt);
> +	}
> +
> +	return IRQ_HANDLED;
> +}
> +
> +/**
> + * hl_cq_init - main initialization function for an cq object
> + *
> + * @hdev: pointer to device structure
> + * @q: pointer to cq structure
> + * @hw_queue_id: The H/W queue ID this completion queue belongs to
> + *
> + * Allocate dma-able memory for the completion queue and initialize fields
> + * Returns 0 on success
> + */
> +int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id)
> +{
> +	void *p;
> +
> +	BUILD_BUG_ON(HL_CQ_SIZE_IN_BYTES > HL_PAGE_SIZE);
> +
> +	p = hdev->asic_funcs->dma_alloc_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
> +				&q->bus_address, GFP_KERNEL | __GFP_ZERO);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	q->hdev = hdev;
> +	q->kernel_address = (u64) p;
> +	q->hw_queue_id = hw_queue_id;
> +	q->ci = 0;
> +	q->pi = 0;
> +
> +	atomic_set(&q->free_slots_cnt, HL_CQ_LENGTH);
> +
> +	return 0;
> +}
> +
> +/**
> + * hl_cq_fini - destroy completion queue
> + *
> + * @hdev: pointer to device structure
> + * @q: pointer to cq structure
> + *
> + * Free the completion queue memory
> + */
> +void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
> +{
> +	hdev->asic_funcs->dma_free_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
> +			(void *) q->kernel_address, q->bus_address);
> +}
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 08/15] habanalabs: add event queue and interrupts
  2019-01-23  0:00 ` [PATCH 08/15] habanalabs: add event queue and interrupts Oded Gabbay
@ 2019-01-25  7:51   ` Mike Rapoport
  2019-01-28 11:14     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-25  7:51 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:00:50AM +0200, Oded Gabbay wrote:
> This patch adds support for receiving events from Goya's control CPU and
> for receiving MSI-X interrupts from Goya's DMA engines and CPU.
> 
> Goya's PCI controller supports up to 8 MSI-X interrupts, which only 6 of
> them are currently used. The first 5 interrupts are dedicated for Goya's
> DMA engine queues. The 6th interrupt is dedicated for Goya's control CPU.
> 
> The DMA queue will signal its MSI-X entry upon each completion of a command
> buffer that was placed on its primary queue. The driver will then mark that
> CB as completed and free the related resources. It will also update the
> command submission object which that CB belongs to.
> 
> There is a dedicated event queue (EQ) between the driver and Goya's control
> CPU. The EQ is located on the Host memory. The control CPU writes a new
> entry to the EQ for various reasons, such as ECC error, MMU page fault, Hot
> temperature. After writing the new entry to the EQ, the control CPU will
> trigger its dedicated MSI-X entry to signal the driver that there is a new
> entry in the EQ. The driver will then read the entry and act accordingly.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/habanalabs/device.c            |  35 +-
>  drivers/misc/habanalabs/goya/goya.c         | 522 +++++++++++++++++++-
>  drivers/misc/habanalabs/goya/goyaP.h        |   1 +
>  drivers/misc/habanalabs/habanalabs.h        |  37 ++
>  drivers/misc/habanalabs/include/goya/goya.h |   1 -
>  drivers/misc/habanalabs/irq.c               | 144 ++++++
>  6 files changed, 729 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> index 98220628a467..9199e070e79e 100644
> --- a/drivers/misc/habanalabs/device.c
> +++ b/drivers/misc/habanalabs/device.c
> @@ -173,9 +173,17 @@ static int device_early_init(struct hl_device *hdev)
>  	hdev->cq_wq = alloc_workqueue("hl-free-jobs", WQ_UNBOUND, 0);
>  	if (hdev->cq_wq == NULL) {
>  		dev_err(hdev->dev, "Failed to allocate CQ workqueue\n");
> +		rc = -ENOMEM;

Apparently, it should have been in one of the earlier patches

>  		goto asid_fini;
>  	}
>  
> +	hdev->eq_wq = alloc_workqueue("hl-events", WQ_UNBOUND, 0);
> +	if (hdev->eq_wq == NULL) {
> +		dev_err(hdev->dev, "Failed to allocate EQ workqueue\n");
> +		rc = -ENOMEM;
> +		goto free_cq_wq;
> +	}
> +
>  	hl_cb_mgr_init(&hdev->kernel_cb_mgr);
>  
>  	mutex_init(&hdev->device_open);
> @@ -184,6 +192,8 @@ static int device_early_init(struct hl_device *hdev)
>  
>  	return 0;
>  
> +free_cq_wq:
> +	destroy_workqueue(hdev->cq_wq);
>  asid_fini:
>  	hl_asid_fini(hdev);
>  early_fini:
> @@ -205,6 +215,7 @@ static void device_early_fini(struct hl_device *hdev)
>  
>  	hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
>  
> +	destroy_workqueue(hdev->eq_wq);
>  	destroy_workqueue(hdev->cq_wq);
>  
>  	hl_asid_fini(hdev);
> @@ -343,11 +354,22 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  		}
>  	}
>  
> +	/*
> +	 * Initialize the event queue. Must be done before hw_init,
> +	 * because there the address of the event queue is being
> +	 * passed as argument to request_irq
> +	 */
> +	rc = hl_eq_init(hdev, &hdev->event_queue);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to initialize event queue\n");
> +		goto cq_fini;
> +	}
> +
>  	/* Allocate the kernel context */
>  	hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
>  	if (!hdev->kernel_ctx) {
>  		rc = -ENOMEM;
> -		goto cq_fini;
> +		goto eq_fini;
>  	}
>  
>  	hdev->user_ctx = NULL;
> @@ -392,6 +414,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  			"kernel ctx is still alive on initialization failure\n");
>  free_ctx:
>  	kfree(hdev->kernel_ctx);
> +eq_fini:
> +	hl_eq_fini(hdev, &hdev->event_queue);
>  cq_fini:
>  	for (i = 0 ; i < cq_ready_cnt ; i++)
>  		hl_cq_fini(hdev, &hdev->completion_queue[i]);
> @@ -433,6 +457,13 @@ void hl_device_fini(struct hl_device *hdev)
>  	/* Mark device as disabled */
>  	hdev->disabled = true;
>  
> +	/*
> +	 * Halt the engines and disable interrupts so we won't get any more
> +	 * completions from H/W and we won't have any accesses from the
> +	 * H/W to the host machine
> +	 */
> +	hdev->asic_funcs->halt_engines(hdev, true);
> +
>  	hl_cb_pool_fini(hdev);
>  
>  	/* Release kernel context */
> @@ -442,6 +473,8 @@ void hl_device_fini(struct hl_device *hdev)
>  	/* Reset the H/W. It will be in idle state after this returns */
>  	hdev->asic_funcs->hw_fini(hdev, true);
>  
> +	hl_eq_fini(hdev, &hdev->event_queue);
> +
>  	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
>  		hl_cq_fini(hdev, &hdev->completion_queue[i]);
>  	kfree(hdev->completion_queue);
> diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> index 08d5227eaf1d..6c04277ae0fa 100644
> --- a/drivers/misc/habanalabs/goya/goya.c
> +++ b/drivers/misc/habanalabs/goya/goya.c
> @@ -92,9 +92,41 @@
>  
>  #define GOYA_MAX_INITIATORS		20
>  
> +#define GOYA_MAX_STRING_LEN		20
> +
>  #define GOYA_CB_POOL_CB_CNT		512
>  #define GOYA_CB_POOL_CB_SIZE		0x20000		/* 128KB */
>  
> +static const char goya_irq_name[GOYA_MSIX_ENTRIES][GOYA_MAX_STRING_LEN] = {
> +		"goya cq 0", "goya cq 1", "goya cq 2", "goya cq 3",
> +		"goya cq 4", "goya cpu eq"
> +};
> +
> +static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
> +	"MME0",
> +	"MME1",
> +	"MME2",
> +	"MME3",
> +	"MME4",
> +	"MME5",
> +	"TPC0",
> +	"TPC1",
> +	"TPC2",
> +	"TPC3",
> +	"TPC4",
> +	"TPC5",
> +	"TPC6",
> +	"TPC7",
> +	"PCI",
> +	"DMA", /* HBW */
> +	"DMA", /* LBW */
> +	"PSOC",
> +	"CPU",
> +	"MMU"
> +};
> +
> +#define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
> +
>  static void goya_get_fixed_properties(struct hl_device *hdev)
>  {
>  	struct asic_fixed_properties *prop = &hdev->asic_prop;
> @@ -139,6 +171,7 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
>  	prop->va_space_dram_end_address = VA_DDR_SPACE_END;
>  	prop->cfg_size = CFG_SIZE;
>  	prop->max_asid = MAX_ASID;
> +	prop->num_of_events = GOYA_ASYNC_EVENT_ID_SIZE;
>  	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
>  	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
>  	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
> @@ -668,15 +701,10 @@ static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
>  	WREG32(mmDMA_QM_0_PQ_CFG1 + reg_off, 0x00020002);
>  	WREG32(mmDMA_QM_0_CQ_CFG1 + reg_off, 0x00080008);
>  
> -	if (dma_id == 0)
> -		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
> +	if (goya->hw_cap_initialized & HW_CAP_MMU)
> +		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_PARTLY_TRUSTED);
>  	else
> -		if (goya->hw_cap_initialized & HW_CAP_MMU)
> -			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
> -					QMAN_DMA_PARTLY_TRUSTED);
> -		else
> -			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
> -					QMAN_DMA_FULLY_TRUSTED);
> +		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
>  
>  	WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, QMAN_DMA_ERR_MSG_EN);
>  	WREG32(mmDMA_QM_0_GLBL_CFG0 + reg_off, QMAN_DMA_ENABLE);
> @@ -870,6 +898,7 @@ static void goya_resume_external_queues(struct hl_device *hdev)
>  int goya_init_cpu_queues(struct hl_device *hdev)
>  {
>  	struct goya_device *goya = hdev->asic_specific;
> +	struct hl_eq *eq;
>  	dma_addr_t bus_address;
>  	u32 status;
>  	struct hl_hw_queue *cpu_pq = &hdev->kernel_queues[GOYA_QUEUE_ID_CPU_PQ];
> @@ -881,17 +910,24 @@ int goya_init_cpu_queues(struct hl_device *hdev)
>  	if (goya->hw_cap_initialized & HW_CAP_CPU_Q)
>  		return 0;
>  
> +	eq = &hdev->event_queue;
> +
>  	bus_address = cpu_pq->bus_address +
>  			hdev->asic_prop.host_phys_base_address;
>  	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_0, lower_32_bits(bus_address));
>  	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_1, upper_32_bits(bus_address));
>  
> +	bus_address = eq->bus_address + hdev->asic_prop.host_phys_base_address;
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_2, lower_32_bits(bus_address));
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_3, upper_32_bits(bus_address));
> +
>  	bus_address = hdev->cpu_accessible_dma_address +
>  			hdev->asic_prop.host_phys_base_address;
>  	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_8, lower_32_bits(bus_address));
>  	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_9, upper_32_bits(bus_address));
>  
>  	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_5, HL_QUEUE_SIZE_IN_BYTES);
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_4, HL_EQ_SIZE_IN_BYTES);
>  	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_10, CPU_ACCESSIBLE_MEM_SIZE);
>  
>  	/* Used for EQ CI */
> @@ -2781,6 +2817,163 @@ static void goya_resume_internal_queues(struct hl_device *hdev)
>  	WREG32(mmTPC7_CMDQ_GLBL_CFG1, 0);
>  }
>  
> +static void goya_dma_stall(struct hl_device *hdev)
> +{
> +	WREG32(mmDMA_QM_0_GLBL_CFG1, 1 << DMA_QM_0_GLBL_CFG1_DMA_STOP_SHIFT);
> +	WREG32(mmDMA_QM_1_GLBL_CFG1, 1 << DMA_QM_1_GLBL_CFG1_DMA_STOP_SHIFT);
> +	WREG32(mmDMA_QM_2_GLBL_CFG1, 1 << DMA_QM_2_GLBL_CFG1_DMA_STOP_SHIFT);
> +	WREG32(mmDMA_QM_3_GLBL_CFG1, 1 << DMA_QM_3_GLBL_CFG1_DMA_STOP_SHIFT);
> +	WREG32(mmDMA_QM_4_GLBL_CFG1, 1 << DMA_QM_4_GLBL_CFG1_DMA_STOP_SHIFT);
> +}
> +
> +static void goya_tpc_stall(struct hl_device *hdev)
> +{
> +	WREG32(mmTPC0_CFG_TPC_STALL, 1 << TPC0_CFG_TPC_STALL_V_SHIFT);
> +	WREG32(mmTPC1_CFG_TPC_STALL, 1 << TPC1_CFG_TPC_STALL_V_SHIFT);
> +	WREG32(mmTPC2_CFG_TPC_STALL, 1 << TPC2_CFG_TPC_STALL_V_SHIFT);
> +	WREG32(mmTPC3_CFG_TPC_STALL, 1 << TPC3_CFG_TPC_STALL_V_SHIFT);
> +	WREG32(mmTPC4_CFG_TPC_STALL, 1 << TPC4_CFG_TPC_STALL_V_SHIFT);
> +	WREG32(mmTPC5_CFG_TPC_STALL, 1 << TPC5_CFG_TPC_STALL_V_SHIFT);
> +	WREG32(mmTPC6_CFG_TPC_STALL, 1 << TPC6_CFG_TPC_STALL_V_SHIFT);
> +	WREG32(mmTPC7_CFG_TPC_STALL, 1 << TPC7_CFG_TPC_STALL_V_SHIFT);
> +}
> +
> +static void goya_mme_stall(struct hl_device *hdev)
> +{
> +	WREG32(mmMME_STALL, 0xFFFFFFFF);
> +}
> +
> +static int goya_enable_msix(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	int cq_cnt = hdev->asic_prop.completion_queues_count;
> +	int rc, i, irq_cnt_init, irq;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_MSIX)
> +		return 0;
> +
> +	rc = pci_alloc_irq_vectors(hdev->pdev, GOYA_MSIX_ENTRIES,
> +				GOYA_MSIX_ENTRIES, PCI_IRQ_MSIX);
> +	if (rc < 0) {
> +		dev_err(hdev->dev,
> +			"MSI-X: Failed to enable support -- %d/%d\n",
> +			GOYA_MSIX_ENTRIES, rc);
> +		return rc;
> +	}
> +
> +	for (i = 0, irq_cnt_init = 0 ; i < cq_cnt ; i++, irq_cnt_init++) {
> +		irq = pci_irq_vector(hdev->pdev, i);
> +		rc = request_irq(irq, hl_irq_handler_cq, 0, goya_irq_name[i],
> +				&hdev->completion_queue[i]);
> +		if (rc) {
> +			dev_err(hdev->dev, "Failed to request IRQ %d", irq);
> +			goto free_irqs;
> +		}
> +	}
> +
> +	irq = pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX);
> +
> +	rc = request_irq(irq, hl_irq_handler_eq, 0,
> +			goya_irq_name[EVENT_QUEUE_MSIX_IDX],
> +			&hdev->event_queue);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to request IRQ %d", irq);
> +		goto free_irqs;
> +	}
> +
> +	goya->hw_cap_initialized |= HW_CAP_MSIX;
> +	return 0;
> +
> +free_irqs:
> +	for (i = 0 ; i < irq_cnt_init ; i++)
> +		free_irq(pci_irq_vector(hdev->pdev, i),
> +			&hdev->completion_queue[i]);
> +
> +	pci_free_irq_vectors(hdev->pdev);
> +	return rc;
> +}
> +
> +static void goya_sync_irqs(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	int i;
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_MSIX))
> +		return;
> +
> +	/* Wait for all pending IRQs to be finished */
> +	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
> +		synchronize_irq(pci_irq_vector(hdev->pdev, i));
> +
> +	synchronize_irq(pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX));
> +}
> +
> +static void goya_disable_msix(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	int i, irq;
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_MSIX))
> +		return;
> +
> +	goya_sync_irqs(hdev);
> +
> +	irq = pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX);
> +	free_irq(irq, &hdev->event_queue);
> +
> +	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++) {
> +		irq = pci_irq_vector(hdev->pdev, i);
> +		free_irq(irq, &hdev->completion_queue[i]);
> +	}
> +
> +	pci_free_irq_vectors(hdev->pdev);
> +
> +	goya->hw_cap_initialized &= ~HW_CAP_MSIX;
> +}
> +
> +static void goya_halt_engines(struct hl_device *hdev, bool hard_reset)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 wait_timeout_ms, cpu_timeout_ms;
> +
> +	dev_info(hdev->dev,
> +		"Halting compute engines and disabling interrupts\n");
> +
> +	if (hdev->pldm) {
> +		wait_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
> +		cpu_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
> +	} else {
> +		wait_timeout_ms = GOYA_RESET_WAIT_MSEC;
> +		cpu_timeout_ms = GOYA_CPU_RESET_WAIT_MSEC;
> +	}
> +
> +	if ((hard_reset) && (goya->hw_cap_initialized & HW_CAP_CPU)) {
> +		WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_GOTO_WFE);
> +		if (hdev->fw_loading)
> +			WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> +				GOYA_ASYNC_EVENT_ID_HALT_MACHINE);
> +		msleep(cpu_timeout_ms);
> +	}
> +
> +	goya_stop_external_queues(hdev);
> +	goya_stop_internal_queues(hdev);
> +
> +	msleep(wait_timeout_ms);
> +
> +	goya_dma_stall(hdev);
> +	goya_tpc_stall(hdev);
> +	goya_mme_stall(hdev);
> +
> +	msleep(wait_timeout_ms);
> +
> +	goya_disable_external_queues(hdev);
> +	goya_disable_internal_queues(hdev);
> +
> +	if (hard_reset)
> +		goya_disable_msix(hdev);
> +	else
> +		goya_sync_irqs(hdev);
> +}
>  
>  /**
>   * goya_push_uboot_to_device - Push u-boot FW code to device
> @@ -3166,11 +3359,16 @@ static int goya_hw_init(struct hl_device *hdev)
>  
>  	goya_init_tpc_qmans(hdev);
>  
> +	/* MSI-X must be enabled before CPU queues are initialized */
> +	rc = goya_enable_msix(hdev);
> +	if (rc)
> +		goto disable_queues;
> +
>  	rc = goya_init_cpu_queues(hdev);
>  	if (rc) {
>  		dev_err(hdev->dev, "failed to initialize CPU H/W queues %d\n",
>  			rc);
> -		goto disable_queues;
> +		goto disable_msix;
>  	}
>  
>  	/* CPU initialization is finished, we can now move to 48 bit DMA mask */
> @@ -3204,6 +3402,8 @@ static int goya_hw_init(struct hl_device *hdev)
>  
>  disable_pci_access:
>  	goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
> +disable_msix:
> +	goya_disable_msix(hdev);
>  disable_queues:
>  	goya_disable_internal_queues(hdev);
>  	goya_disable_external_queues(hdev);
> @@ -3287,6 +3487,7 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
>  					HW_CAP_DMA | HW_CAP_MME |
>  					HW_CAP_MMU | HW_CAP_TPC_MBIST |
>  					HW_CAP_GOLDEN | HW_CAP_TPC);
> +	memset(goya->events_stat, 0, sizeof(goya->events_stat));
>  
>  	if (!hdev->pldm) {
>  		int rc;
> @@ -3772,6 +3973,305 @@ void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
>  	gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
>  }
>  
> +static void goya_update_eq_ci(struct hl_device *hdev, u32 val)
> +{
> +	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, val);
> +}
> +
> +static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
> +		u16 event_type, char *axi_name, int len)
> +{
> +	if (!strcmp(goya_axi_name[agent_id], "DMA"))
> +		if (event_type >= GOYA_ASYNC_EVENT_ID_DMA0_CH)
> +			snprintf(axi_name, len, "DMA %d",
> +				event_type - GOYA_ASYNC_EVENT_ID_DMA0_CH);
> +		else
> +			snprintf(axi_name, len, "DMA %d",
> +				event_type - GOYA_ASYNC_EVENT_ID_DMA0_QM);
> +	else
> +		snprintf(axi_name, len, "%s", goya_axi_name[agent_id]);
> +}
> +
> +static void goya_print_razwi_info(struct hl_device *hdev, u64 reg,
> +		bool is_hbw, bool is_read, u16 event_type)
> +{
> +	u32 val, id, internal_id, agent_id, y, x;
> +	char axi_name[10] = {0};
> +
> +	val = RREG32(reg);
> +
> +	if (is_hbw) {
> +		id = (val & GOYA_IRQ_HBW_ID_MASK) >> GOYA_IRQ_HBW_ID_SHIFT;
> +		internal_id = (val & GOYA_IRQ_HBW_INTERNAL_ID_MASK) >>
> +				GOYA_IRQ_HBW_INTERNAL_ID_SHIFT;
> +		agent_id = (val & GOYA_IRQ_HBW_AGENT_ID_MASK) >>
> +				GOYA_IRQ_HBW_AGENT_ID_SHIFT;
> +		y = (val & GOYA_IRQ_HBW_Y_MASK) >> GOYA_IRQ_HBW_Y_SHIFT;
> +		x = (val & GOYA_IRQ_HBW_X_MASK) >> GOYA_IRQ_HBW_X_SHIFT;
> +	} else {
> +		id = (val & GOYA_IRQ_LBW_ID_MASK) >> GOYA_IRQ_LBW_ID_SHIFT;
> +		internal_id = (val & GOYA_IRQ_LBW_INTERNAL_ID_MASK) >>
> +				GOYA_IRQ_LBW_INTERNAL_ID_SHIFT;
> +		agent_id = (val & GOYA_IRQ_LBW_AGENT_ID_MASK) >>
> +				GOYA_IRQ_LBW_AGENT_ID_SHIFT;
> +		y = (val & GOYA_IRQ_LBW_Y_MASK) >> GOYA_IRQ_LBW_Y_SHIFT;
> +		x = (val & GOYA_IRQ_LBW_X_MASK) >> GOYA_IRQ_LBW_X_SHIFT;
> +	}

It seems that only agent_id is used

> +
> +	if (agent_id >= GOYA_MAX_INITIATORS) {
> +		dev_err(hdev->dev,
> +			"Illegal %s %s with wrong initiator id %d, H/W IRQ %d\n",
> +				is_read ? "read from" : "write to",
> +				is_hbw ? "HBW" : "LBW",
> +				agent_id,
> +				event_type);
> +	} else {
> +		goya_get_axi_name(hdev, agent_id, event_type, axi_name,
> +				sizeof(axi_name));
> +		dev_err(hdev->dev, "Illegal %s by %s %s %s, H/W IRQ %d\n",
> +				is_read ? "read" : "write",
> +				axi_name,
> +				is_read ? "from" : "to",
> +				is_hbw ? "HBW" : "LBW",
> +				event_type);
> +	}
> +}
> +
> +static void goya_print_irq_info(struct hl_device *hdev, u16 event_type)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	bool is_hbw = false, is_read = false, is_info = false;
> +
> +	if (RREG32(mmDMA_MACRO_RAZWI_LBW_WT_VLD)) {
> +		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_LBW_WT_ID, is_hbw,
> +				is_read, event_type);
> +		WREG32(mmDMA_MACRO_RAZWI_LBW_WT_VLD, 0);
> +		is_info = true;
> +	}
> +	if (RREG32(mmDMA_MACRO_RAZWI_LBW_RD_VLD)) {
> +		is_read = true;
> +		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_LBW_RD_ID, is_hbw,
> +				is_read, event_type);
> +		WREG32(mmDMA_MACRO_RAZWI_LBW_RD_VLD, 0);
> +		is_info = true;
> +	}
> +	if (RREG32(mmDMA_MACRO_RAZWI_HBW_WT_VLD)) {
> +		is_hbw = true;
> +		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_HBW_WT_ID, is_hbw,
> +				is_read, event_type);
> +		WREG32(mmDMA_MACRO_RAZWI_HBW_WT_VLD, 0);
> +		is_info = true;
> +	}
> +	if (RREG32(mmDMA_MACRO_RAZWI_HBW_RD_VLD)) {
> +		is_hbw = true;
> +		is_read = true;
> +		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_HBW_RD_ID, is_hbw,
> +				is_read, event_type);
> +		WREG32(mmDMA_MACRO_RAZWI_HBW_RD_VLD, 0);
> +		is_info = true;
> +	}
> +	if (!is_info) {
> +		dev_err(hdev->dev,
> +			"Received H/W interrupt %d, no additional info\n",
> +			event_type);
> +		return;
> +	}
> +
> +	if (goya->hw_cap_initialized & HW_CAP_MMU) {
> +		u32 val = RREG32(mmMMU_PAGE_ERROR_CAPTURE);
> +		u64 addr;
> +
> +		if (val & MMU_PAGE_ERROR_CAPTURE_ENTRY_VALID_MASK) {
> +			addr = val & MMU_PAGE_ERROR_CAPTURE_VA_49_32_MASK;
> +			addr <<= 32;
> +			addr |= RREG32(mmMMU_PAGE_ERROR_CAPTURE_VA);
> +
> +			dev_err(hdev->dev, "MMU page fault on va 0x%llx\n",
> +					addr);
> +
> +			WREG32(mmMMU_PAGE_ERROR_CAPTURE, 0);
> +		}
> +	}
> +}
> +
> +static int goya_unmask_irq(struct hl_device *hdev, u16 event_type)
> +{
> +	struct armcp_packet pkt;
> +	long result;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_UNMASK_RAZWI_IRQ;
> +	pkt.value = event_type;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +			HL_DEVICE_TIMEOUT_USEC, &result);
> +
> +	if (rc)
> +		dev_err(hdev->dev, "failed to unmask RAZWI IRQ %d", event_type);
> +
> +	return rc;
> +}
> +
> +void goya_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_entry)
> +{
> +	u16 event_type = ((eq_entry->hdr.ctl & EQ_CTL_EVENT_TYPE_MASK)
> +			>> EQ_CTL_EVENT_TYPE_SHIFT);
> +	struct goya_device *goya = hdev->asic_specific;
> +
> +	goya->events_stat[event_type]++;
> +
> +	switch (event_type) {
> +	case GOYA_ASYNC_EVENT_ID_PCIE_IF:
> +	case GOYA_ASYNC_EVENT_ID_TPC0_ECC:
> +	case GOYA_ASYNC_EVENT_ID_TPC1_ECC:
> +	case GOYA_ASYNC_EVENT_ID_TPC2_ECC:
> +	case GOYA_ASYNC_EVENT_ID_TPC3_ECC:
> +	case GOYA_ASYNC_EVENT_ID_TPC4_ECC:
> +	case GOYA_ASYNC_EVENT_ID_TPC5_ECC:
> +	case GOYA_ASYNC_EVENT_ID_TPC6_ECC:
> +	case GOYA_ASYNC_EVENT_ID_TPC7_ECC:
> +	case GOYA_ASYNC_EVENT_ID_MME_ECC:
> +	case GOYA_ASYNC_EVENT_ID_MME_ECC_EXT:
> +	case GOYA_ASYNC_EVENT_ID_MMU_ECC:
> +	case GOYA_ASYNC_EVENT_ID_DMA_MACRO:
> +	case GOYA_ASYNC_EVENT_ID_DMA_ECC:
> +	case GOYA_ASYNC_EVENT_ID_CPU_IF_ECC:
> +	case GOYA_ASYNC_EVENT_ID_PSOC_MEM:
> +	case GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT:
> +	case GOYA_ASYNC_EVENT_ID_SRAM0:
> +	case GOYA_ASYNC_EVENT_ID_SRAM1:
> +	case GOYA_ASYNC_EVENT_ID_SRAM2:
> +	case GOYA_ASYNC_EVENT_ID_SRAM3:
> +	case GOYA_ASYNC_EVENT_ID_SRAM4:
> +	case GOYA_ASYNC_EVENT_ID_SRAM5:
> +	case GOYA_ASYNC_EVENT_ID_SRAM6:
> +	case GOYA_ASYNC_EVENT_ID_SRAM7:
> +	case GOYA_ASYNC_EVENT_ID_SRAM8:
> +	case GOYA_ASYNC_EVENT_ID_SRAM9:
> +	case GOYA_ASYNC_EVENT_ID_SRAM10:
> +	case GOYA_ASYNC_EVENT_ID_SRAM11:
> +	case GOYA_ASYNC_EVENT_ID_SRAM12:
> +	case GOYA_ASYNC_EVENT_ID_SRAM13:
> +	case GOYA_ASYNC_EVENT_ID_SRAM14:
> +	case GOYA_ASYNC_EVENT_ID_SRAM15:
> +	case GOYA_ASYNC_EVENT_ID_SRAM16:
> +	case GOYA_ASYNC_EVENT_ID_SRAM17:
> +	case GOYA_ASYNC_EVENT_ID_SRAM18:
> +	case GOYA_ASYNC_EVENT_ID_SRAM19:
> +	case GOYA_ASYNC_EVENT_ID_SRAM20:
> +	case GOYA_ASYNC_EVENT_ID_SRAM21:
> +	case GOYA_ASYNC_EVENT_ID_SRAM22:
> +	case GOYA_ASYNC_EVENT_ID_SRAM23:
> +	case GOYA_ASYNC_EVENT_ID_SRAM24:
> +	case GOYA_ASYNC_EVENT_ID_SRAM25:
> +	case GOYA_ASYNC_EVENT_ID_SRAM26:
> +	case GOYA_ASYNC_EVENT_ID_SRAM27:
> +	case GOYA_ASYNC_EVENT_ID_SRAM28:
> +	case GOYA_ASYNC_EVENT_ID_SRAM29:
> +	case GOYA_ASYNC_EVENT_ID_GIC500:
> +	case GOYA_ASYNC_EVENT_ID_PLL0:
> +	case GOYA_ASYNC_EVENT_ID_PLL1:
> +	case GOYA_ASYNC_EVENT_ID_PLL3:
> +	case GOYA_ASYNC_EVENT_ID_PLL4:
> +	case GOYA_ASYNC_EVENT_ID_PLL5:
> +	case GOYA_ASYNC_EVENT_ID_PLL6:
> +	case GOYA_ASYNC_EVENT_ID_AXI_ECC:
> +	case GOYA_ASYNC_EVENT_ID_L2_RAM_ECC:
> +	case GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET:
> +	case GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT:
> +		dev_err(hdev->dev,
> +			"Received H/W interrupt %d, reset the chip\n",
> +			event_type);
> +		break;

Looks tough. Any chance some of these values are consecutive and can be
grouped, e.g

	case GOYA_ASYNC_EVENT_ID_SRAM0 ... GOYA_ASYNC_EVENT_ID_SRAM29:
?

> +
> +	case GOYA_ASYNC_EVENT_ID_PCIE_DEC:
> +	case GOYA_ASYNC_EVENT_ID_TPC0_DEC:
> +	case GOYA_ASYNC_EVENT_ID_TPC1_DEC:
> +	case GOYA_ASYNC_EVENT_ID_TPC2_DEC:
> +	case GOYA_ASYNC_EVENT_ID_TPC3_DEC:
> +	case GOYA_ASYNC_EVENT_ID_TPC4_DEC:
> +	case GOYA_ASYNC_EVENT_ID_TPC5_DEC:
> +	case GOYA_ASYNC_EVENT_ID_TPC6_DEC:
> +	case GOYA_ASYNC_EVENT_ID_TPC7_DEC:
> +	case GOYA_ASYNC_EVENT_ID_MME_WACS:
> +	case GOYA_ASYNC_EVENT_ID_MME_WACSD:
> +	case GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER:
> +	case GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC:
> +	case GOYA_ASYNC_EVENT_ID_PSOC:
> +	case GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR:
> +	case GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR:
> +	case GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR:
> +	case GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR:
> +	case GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR:
> +	case GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR:
> +	case GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR:
> +	case GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR:
> +	case GOYA_ASYNC_EVENT_ID_TPC0_CMDQ:
> +	case GOYA_ASYNC_EVENT_ID_TPC1_CMDQ:
> +	case GOYA_ASYNC_EVENT_ID_TPC2_CMDQ:
> +	case GOYA_ASYNC_EVENT_ID_TPC3_CMDQ:
> +	case GOYA_ASYNC_EVENT_ID_TPC4_CMDQ:
> +	case GOYA_ASYNC_EVENT_ID_TPC5_CMDQ:
> +	case GOYA_ASYNC_EVENT_ID_TPC6_CMDQ:
> +	case GOYA_ASYNC_EVENT_ID_TPC7_CMDQ:
> +	case GOYA_ASYNC_EVENT_ID_TPC0_QM:
> +	case GOYA_ASYNC_EVENT_ID_TPC1_QM:
> +	case GOYA_ASYNC_EVENT_ID_TPC2_QM:
> +	case GOYA_ASYNC_EVENT_ID_TPC3_QM:
> +	case GOYA_ASYNC_EVENT_ID_TPC4_QM:
> +	case GOYA_ASYNC_EVENT_ID_TPC5_QM:
> +	case GOYA_ASYNC_EVENT_ID_TPC6_QM:
> +	case GOYA_ASYNC_EVENT_ID_TPC7_QM:
> +	case GOYA_ASYNC_EVENT_ID_MME_QM:
> +	case GOYA_ASYNC_EVENT_ID_MME_CMDQ:
> +	case GOYA_ASYNC_EVENT_ID_DMA0_QM:
> +	case GOYA_ASYNC_EVENT_ID_DMA1_QM:
> +	case GOYA_ASYNC_EVENT_ID_DMA2_QM:
> +	case GOYA_ASYNC_EVENT_ID_DMA3_QM:
> +	case GOYA_ASYNC_EVENT_ID_DMA4_QM:
> +	case GOYA_ASYNC_EVENT_ID_DMA0_CH:
> +	case GOYA_ASYNC_EVENT_ID_DMA1_CH:
> +	case GOYA_ASYNC_EVENT_ID_DMA2_CH:
> +	case GOYA_ASYNC_EVENT_ID_DMA3_CH:
> +	case GOYA_ASYNC_EVENT_ID_DMA4_CH:
> +		goya_print_irq_info(hdev, event_type);
> +		goya_unmask_irq(hdev, event_type);
> +		break;
> +
> +	case GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU:
> +	case GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU:
> +	case GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU:
> +	case GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU:
> +	case GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU:
> +	case GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU:
> +	case GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU:
> +	case GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU:
> +	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH0:
> +	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH1:
> +	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH2:
> +	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH3:
> +	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH4:
> +		dev_info(hdev->dev, "Received H/W interrupt %d\n", event_type);
> +		break;
> +
> +	default:
> +		dev_err(hdev->dev, "Received invalid H/W interrupt %d\n",
> +				event_type);
> +		break;
> +	}
> +}
> +
> +void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +
> +	*size = (u32) sizeof(goya->events_stat);
> +
> +	return goya->events_stat;
> +}
> +
>  
>  static void goya_hw_queues_lock(struct hl_device *hdev)
>  {
> @@ -3794,6 +4294,7 @@ static const struct hl_asic_funcs goya_funcs = {
>  	.sw_fini = goya_sw_fini,
>  	.hw_init = goya_hw_init,
>  	.hw_fini = goya_hw_fini,
> +	.halt_engines = goya_halt_engines,
>  	.suspend = goya_suspend,
>  	.resume = goya_resume,
>  	.mmap = goya_mmap,
> @@ -3808,6 +4309,9 @@ static const struct hl_asic_funcs goya_funcs = {
>  	.dma_pool_free = goya_dma_pool_free,
>  	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
>  	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
> +	.update_eq_ci = goya_update_eq_ci,
> +	.handle_eqe = goya_handle_eqe,
> +	.get_events_stat = goya_get_events_stat,
>  	.hw_queues_lock = goya_hw_queues_lock,
>  	.hw_queues_unlock = goya_hw_queues_unlock,
>  	.send_cpu_message = goya_send_cpu_message
> diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> index 598a718d3df1..c6bfcb6c6905 100644
> --- a/drivers/misc/habanalabs/goya/goyaP.h
> +++ b/drivers/misc/habanalabs/goya/goyaP.h
> @@ -123,6 +123,7 @@ struct goya_device {
>  	/* TODO: remove hw_queues_lock after moving to scheduler code */
>  	spinlock_t	hw_queues_lock;
>  	u64		ddr_bar_cur_addr;
> +	u32		events_stat[GOYA_ASYNC_EVENT_ID_SIZE];
>  	u32		hw_cap_initialized;
>  };
>  
> diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> index 8232e2259463..899bf98eb002 100644
> --- a/drivers/misc/habanalabs/habanalabs.h
> +++ b/drivers/misc/habanalabs/habanalabs.h
> @@ -83,6 +83,7 @@ struct hw_queue_properties {
>   * @cfg_size: configuration space size on SRAM.
>   * @sram_size: total size of SRAM.
>   * @max_asid: maximum number of open contexts (ASIDs).
> + * @num_of_events: number of possible internal H/W IRQs.
>   * @completion_queues_count: number of completion queues.
>   * @high_pll: high PLL frequency used by the device.
>   * @cb_pool_cb_cnt: number of CBs in the CB pool.
> @@ -109,6 +110,7 @@ struct asic_fixed_properties {
>  	u32			cfg_size;
>  	u32			sram_size;
>  	u32			max_asid;
> +	u32			num_of_events;
>  	u32			high_pll;
>  	u32			cb_pool_cb_cnt;
>  	u32			cb_pool_cb_size;
> @@ -209,6 +211,9 @@ struct hl_cs_job;
>  #define HL_CQ_LENGTH			HL_QUEUE_LENGTH
>  #define HL_CQ_SIZE_IN_BYTES		(HL_CQ_LENGTH * HL_CQ_ENTRY_SIZE)
>  
> +/* Must be power of 2 (HL_PAGE_SIZE / HL_EQ_ENTRY_SIZE) */
> +#define HL_EQ_LENGTH			64
> +#define HL_EQ_SIZE_IN_BYTES		(HL_EQ_LENGTH * HL_EQ_ENTRY_SIZE)
>  
>  
>  /**
> @@ -256,6 +261,20 @@ struct hl_cq {
>  	atomic_t		free_slots_cnt;
>  };
>  
> +/**
> + * struct hl_eq - describes the event queue (single one per device)
> + * @hdev: pointer to the device structure
> + * @kernel_address: holds the queue's kernel virtual address
> + * @bus_address: holds the queue's DMA address
> + * @ci: ci inside the queue
> + */
> +struct hl_eq {
> +	struct hl_device	*hdev;
> +	u64			kernel_address;
> +	dma_addr_t		bus_address;
> +	u32			ci;
> +};
> +
>  
>  
>  
> @@ -288,6 +307,9 @@ enum hl_asic_type {
>   * @sw_fini: tears down driver state, does not configure H/W.
>   * @hw_init: sets up the H/W state.
>   * @hw_fini: tears down the H/W state.
> + * @halt_engines: halt engines, needed for reset sequence. This also disables
> + *                interrupts from the device. Should be called before
> + *                hw_fini and before CS rollback.
>   * @suspend: handles IP specific H/W or SW changes for suspend.
>   * @resume: handles IP specific H/W or SW changes for resume.
>   * @mmap: mmap function, does nothing.
> @@ -303,6 +325,9 @@ enum hl_asic_type {
>   * @dma_pool_free: free small DMA allocation from pool.
>   * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
>   * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
> + * @update_eq_ci: update event queue CI.
> + * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
> + * @get_events_stat: retrieve event queue entries histogram.
>   * @hw_queues_lock: acquire H/W queues lock.
>   * @hw_queues_unlock: release H/W queues lock.
>   * @send_cpu_message: send buffer to ArmCP.
> @@ -314,6 +339,7 @@ struct hl_asic_funcs {
>  	int (*sw_fini)(struct hl_device *hdev);
>  	int (*hw_init)(struct hl_device *hdev);
>  	void (*hw_fini)(struct hl_device *hdev, bool hard_reset);
> +	void (*halt_engines)(struct hl_device *hdev, bool hard_reset);
>  	int (*suspend)(struct hl_device *hdev);
>  	int (*resume)(struct hl_device *hdev);
>  	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
> @@ -336,6 +362,10 @@ struct hl_asic_funcs {
>  				size_t size, dma_addr_t *dma_handle);
>  	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
>  				size_t size, void *vaddr);
> +	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
> +	void (*handle_eqe)(struct hl_device *hdev,
> +				struct hl_eq_entry *eq_entry);
> +	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
>  	void (*hw_queues_lock)(struct hl_device *hdev);
>  	void (*hw_queues_unlock)(struct hl_device *hdev);
>  	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
> @@ -474,6 +504,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
>   * @kernel_ctx: KMD context structure.
>   * @kernel_queues: array of hl_hw_queue.
>   * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
> + * @event_queue: event queue for IRQ from ArmCP.
>   * @dma_pool: DMA pool for small allocations.
>   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
>   * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
> @@ -504,9 +535,11 @@ struct hl_device {
>  	enum hl_asic_type		asic_type;
>  	struct hl_cq			*completion_queue;
>  	struct workqueue_struct		*cq_wq;
> +	struct workqueue_struct		*eq_wq;
>  	struct hl_ctx			*kernel_ctx;
>  	struct hl_hw_queue		*kernel_queues;
>  	struct hl_cb_mgr		kernel_cb_mgr;
> +	struct hl_eq			event_queue;
>  	struct dma_pool			*dma_pool;
>  	void				*cpu_accessible_dma_mem;
>  	dma_addr_t			cpu_accessible_dma_address;
> @@ -593,6 +626,10 @@ void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
>  
>  int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
>  void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
> +int hl_eq_init(struct hl_device *hdev, struct hl_eq *q);
> +void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q);
> +irqreturn_t hl_irq_handler_cq(int irq, void *arg);
> +irqreturn_t hl_irq_handler_eq(int irq, void *arg);
>  int hl_asid_init(struct hl_device *hdev);
>  void hl_asid_fini(struct hl_device *hdev);
>  unsigned long hl_asid_alloc(struct hl_device *hdev);
> diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
> index 2d0efb7b44bb..bcc461760e5f 100644
> --- a/drivers/misc/habanalabs/include/goya/goya.h
> +++ b/drivers/misc/habanalabs/include/goya/goya.h
> @@ -65,7 +65,6 @@
>  
>  #define GOYA_MSIX_ENTRIES	8
>  #define EVENT_QUEUE_MSIX_IDX	5
> -#define ARMCP_RESET_MSIX_IDX	6
>  
>  #define QMAN_PQ_ENTRY_SIZE	16			/* Bytes */
>  
> diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
> index 97b0de7ea5c2..9586323e7dfb 100644
> --- a/drivers/misc/habanalabs/irq.c
> +++ b/drivers/misc/habanalabs/irq.c
> @@ -9,6 +9,18 @@
>  
>  #include <linux/dma-mapping.h>
>  
> +/**
> + * This structure is used to schedule work of EQ entry and armcp_reset event
> + *
> + * @eq_work          - workqueue object to run when EQ entry is received
> + * @hdev             - pointer to device structure
> + * @eq_entry         - copy of the EQ entry
> + */
> +struct hl_eqe_work {
> +	struct work_struct	eq_work;
> +	struct hl_device	*hdev;
> +	struct hl_eq_entry	eq_entry;
> +};
>  
>  /**
>   * hl_cq_inc_ptr - increment ci or pi of cq
> @@ -26,6 +38,33 @@ inline u32 hl_cq_inc_ptr(u32 ptr)
>  	return ptr;
>  }
>  
> +/**
> + * hl_eq_inc_ptr - increment ci of eq
> + *
> + * @ptr: the current ci value of the event queue
> + *
> + * Increment ptr by 1. If it reaches the number of event queue
> + * entries, set it to 0
> + */
> +inline u32 hl_eq_inc_ptr(u32 ptr)
> +{
> +	ptr++;
> +	if (unlikely(ptr == HL_EQ_LENGTH))
> +		ptr = 0;
> +	return ptr;
> +}
> +
> +static void irq_handle_eqe(struct work_struct *work)
> +{
> +	struct hl_eqe_work *eqe_work = container_of(work, struct hl_eqe_work,
> +							eq_work);
> +	struct hl_device *hdev = eqe_work->hdev;
> +
> +	hdev->asic_funcs->handle_eqe(hdev, &eqe_work->eq_entry);
> +
> +	kfree(eqe_work);
> +}
> +
>  /**
>   * hl_irq_handler_cq - irq handler for completion queue
>   *
> @@ -103,6 +142,68 @@ irqreturn_t hl_irq_handler_cq(int irq, void *arg)
>  	return IRQ_HANDLED;
>  }
>  
> +/**
> + * hl_irq_handler_eq - irq handler for event queue
> + *
> + * @irq: irq number
> + * @arg: pointer to event queue structure
> + *
> + */
> +irqreturn_t hl_irq_handler_eq(int irq, void *arg)
> +{
> +	struct hl_eq *eq = arg;
> +	struct hl_device *hdev = eq->hdev;
> +	struct hl_eq_entry *eq_entry;
> +	struct hl_eq_entry *eq_base;
> +	struct hl_eqe_work *handle_eqe_work;
> +
> +	eq_base = (struct hl_eq_entry *) eq->kernel_address;
> +
> +	while (1) {
> +		bool entry_ready =
> +				((eq_base[eq->ci].hdr.ctl & EQ_CTL_READY_MASK)
> +						>> EQ_CTL_READY_SHIFT);
> +
> +		if (!entry_ready)
> +			break;
> +
> +		eq_entry = &eq_base[eq->ci];
> +
> +		/*
> +		 * Make sure we read EQ entry contents after we've
> +		 * checked the ownership bit.
> +		 */
> +		dma_rmb();
> +
> +		if (hdev->disabled) {
> +			dev_warn(hdev->dev,
> +				"Device disabled but received IRQ %d for EQ\n",
> +					irq);
> +			goto skip_irq;
> +		}
> +
> +		handle_eqe_work = kmalloc(sizeof(*handle_eqe_work), GFP_ATOMIC);
> +		if (handle_eqe_work) {

I couldn't find where is it freed

> +			INIT_WORK(&handle_eqe_work->eq_work, irq_handle_eqe);
> +			handle_eqe_work->hdev = hdev;
> +
> +			memcpy(&handle_eqe_work->eq_entry, eq_entry,
> +					sizeof(*eq_entry));
> +
> +			queue_work(hdev->eq_wq, &handle_eqe_work->eq_work);
> +		}
> +skip_irq:
> +		/* Clear EQ entry ready bit */
> +		eq_entry->hdr.ctl &= ~EQ_CTL_READY_MASK;
> +
> +		eq->ci = hl_eq_inc_ptr(eq->ci);
> +
> +		hdev->asic_funcs->update_eq_ci(hdev, eq->ci);
> +	}
> +
> +	return IRQ_HANDLED;
> +}
> +
>  /**
>   * hl_cq_init - main initialization function for an cq object
>   *
> @@ -148,3 +249,46 @@ void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
>  	hdev->asic_funcs->dma_free_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
>  			(void *) q->kernel_address, q->bus_address);
>  }
> +
> +/**
> + * hl_eq_init - main initialization function for an event queue object
> + *
> + * @hdev: pointer to device structure
> + * @q: pointer to eq structure
> + *
> + * Allocate dma-able memory for the event queue and initialize fields
> + * Returns 0 on success
> + */
> +int hl_eq_init(struct hl_device *hdev, struct hl_eq *q)
> +{
> +	void *p;
> +
> +	BUILD_BUG_ON(HL_EQ_SIZE_IN_BYTES > HL_PAGE_SIZE);
> +
> +	p = hdev->asic_funcs->dma_alloc_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
> +				&q->bus_address, GFP_KERNEL | __GFP_ZERO);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	q->hdev = hdev;
> +	q->kernel_address = (u64) p;
> +	q->ci = 0;
> +
> +	return 0;
> +}
> +
> +/**
> + * hl_eq_fini - destroy event queue
> + *
> + * @hdev: pointer to device structure
> + * @q: pointer to eq structure
> + *
> + * Free the event queue memory
> + */
> +void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q)
> +{
> +	flush_workqueue(hdev->eq_wq);
> +
> +	hdev->asic_funcs->dma_free_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
> +			(void *) q->kernel_address, q->bus_address);
> +}
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 09/15] habanalabs: add sysfs and hwmon support
  2019-01-23  0:00 ` [PATCH 09/15] habanalabs: add sysfs and hwmon support Oded Gabbay
@ 2019-01-25  7:54   ` Mike Rapoport
  2019-01-28 11:26     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-25  7:54 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:00:51AM +0200, Oded Gabbay wrote:
> This patch add the sysfs and hwmon entries that are exposed by the driver.
> 
> Goya has several sensors, from various categories such as temperature,
> voltage, current, etc. The driver exposes those sensors in the standard
> hwmon mechanism.
> 
> In addition, the driver exposes a couple of interfaces in sysfs, both for
> configuration and for providing status of the device or driver.
> 
> The configuration attributes is for Power Management:
> - Automatic or manual
> - Frequency value when moving to high frequency mode
> - Maximum power the device is allowed to consume
> 
> The rest of the attributes are read-only and provide the following
> information:
> - Versions of the various firmwares running on the device
> - Contents of the device's EEPROM
> - The device type (currently only Goya is supported)
> - PCI address of the device (to allow user-space to connect between
>   /dev/hlX to PCI address)
> - Status of the device (operational, malfunction, in_reset)
> - How many processes are open on the device's file
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  .../ABI/testing/sysfs-driver-habanalabs       | 190 ++++++
>  drivers/misc/habanalabs/Makefile              |   2 +-
>  drivers/misc/habanalabs/device.c              | 146 +++++
>  drivers/misc/habanalabs/goya/Makefile         |   2 +-
>  drivers/misc/habanalabs/goya/goya.c           | 230 +++++++
>  drivers/misc/habanalabs/goya/goyaP.h          |  21 +
>  drivers/misc/habanalabs/goya/goya_hwmgr.c     | 306 +++++++++
>  drivers/misc/habanalabs/habanalabs.h          |  97 +++
>  drivers/misc/habanalabs/habanalabs_drv.c      |   7 +
>  drivers/misc/habanalabs/hwmon.c               | 449 +++++++++++++
>  drivers/misc/habanalabs/sysfs.c               | 588 ++++++++++++++++++
>  11 files changed, 2036 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-driver-habanalabs
>  create mode 100644 drivers/misc/habanalabs/goya/goya_hwmgr.c
>  create mode 100644 drivers/misc/habanalabs/hwmon.c
>  create mode 100644 drivers/misc/habanalabs/sysfs.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-driver-habanalabs b/Documentation/ABI/testing/sysfs-driver-habanalabs
> new file mode 100644
> index 000000000000..19edd4da87c1
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-driver-habanalabs
> @@ -0,0 +1,190 @@
> +What:           /sys/class/habanalabs/hl<n>/armcp_kernel_ver
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Version of the Linux kernel running on the device's CPU
> +
> +What:           /sys/class/habanalabs/hl<n>/armcp_ver
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Version of the application running on the device's CPU
> +
> +What:           /sys/class/habanalabs/hl<n>/cpld_ver
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Version of the Device's CPLD F/W
> +
> +What:           /sys/class/habanalabs/hl<n>/device_type
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Displays the code name of the device according to its type.
> +                The supported values are: "GOYA"
> +
> +What:           /sys/class/habanalabs/hl<n>/eeprom
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    A binary file attribute that contains the contents of the
> +                on-board EEPROM
> +
> +What:           /sys/class/habanalabs/hl<n>/fuse_ver
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Displays the device's version from the eFuse
> +
> +What:           /sys/class/habanalabs/hl<n>/hard_reset
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Interface to trigger a hard-reset operation for the device.
> +                Hard-reset will reset ALL internal components of the device
> +                except for the PCI interface and the internal PLLs
> +
> +What:           /sys/class/habanalabs/hl<n>/hard_reset_cnt
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Displays how many times the device have undergone a hard-reset
> +                operation
> +
> +What:           /sys/class/habanalabs/hl<n>/high_pll
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Allows the user to set the maximum clock frequency for MME, TPC
> +                and IC when the power management profile is set to "automatic".
> +
> +What:           /sys/class/habanalabs/hl<n>/ic_clk
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Allows the user to set the maximum clock frequency of the
> +                Interconnect fabric. Writes to this parameter affect the device
> +                only when the power management profile is set to "manual" mode.
> +                The device IC clock might be set to lower value then the
> +                maximum. The user should read the ic_clk_curr to see the actual
> +                frequency value of the IC
> +
> +What:           /sys/class/habanalabs/hl<n>/ic_clk_curr
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Displays the current clock frequency of the Interconnect fabric
> +
> +What:           /sys/class/habanalabs/hl<n>/infineon_ver
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Version of the Device's power supply F/W code
> +
> +What:           /sys/class/habanalabs/hl<n>/max_power
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Allows the user to set the maximum power consumption of the
> +                device in milliwatts.
> +
> +What:           /sys/class/habanalabs/hl<n>/mme_clk
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Allows the user to set the maximum clock frequency of the
> +                MME compute engine. Writes to this parameter affect the device
> +                only when the power management profile is set to "manual" mode.
> +                The device MME clock might be set to lower value then the
> +                maximum. The user should read the mme_clk_curr to see the actual
> +                frequency value of the MME
> +
> +What:           /sys/class/habanalabs/hl<n>/mme_clk_curr
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Displays the current clock frequency of the MME compute engine
> +
> +What:           /sys/class/habanalabs/hl<n>/pci_addr
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Displays the PCI address of the device. This is needed so the
> +                user would be able to open a device based on its PCI address
> +
> +What:           /sys/class/habanalabs/hl<n>/pm_mng_profile
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Power management profile. Values are "auto", "manual". In "auto"
> +                mode, the driver will set the maximum clock frequency to a high
> +                value when a user-space process opens the device's file (unless
> +                it was already opened by another process). The driver will set
> +                the max clock frequency to a low value when there are no user
> +                processes that are opened on the device's file. In "manual"
> +                mode, the user sets the maximum clock frequency by writing to
> +                ic_clk, mme_clk and tpc_clk
> +
> +
> +What:           /sys/class/habanalabs/hl<n>/preboot_btl_ver
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Version of the device's preboot F/W code
> +
> +What:           /sys/class/habanalabs/hl<n>/soft_reset
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Interface to trigger a soft-reset operation for the device.
> +                Soft-reset will reset only the compute and DMA engines of the
> +                device
> +
> +What:           /sys/class/habanalabs/hl<n>/soft_reset_cnt
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Displays how many times the device have undergone a soft-reset
> +                operation
> +
> +What:           /sys/class/habanalabs/hl<n>/status
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Status of the card: "Operational", "Malfunction", "In reset".
> +
> +What:           /sys/class/habanalabs/hl<n>/thermal_ver
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Version of the Device's thermal daemon
> +
> +What:           /sys/class/habanalabs/hl<n>/tpc_clk
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Allows the user to set the maximum clock frequency of the
> +                TPC compute engines. Writes to this parameter affect the device
> +                only when the power management profile is set to "manual" mode.
> +                The device TPC clock might be set to lower value then the
> +                maximum. The user should read the tpc_clk_curr to see the actual
> +                frequency value of the TPC
> +
> +What:           /sys/class/habanalabs/hl<n>/tpc_clk_curr
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Displays the current clock frequency of the TPC compute engines
> +
> +What:           /sys/class/habanalabs/hl<n>/uboot_ver
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Version of the u-boot running on the device's CPU
> +
> +What:           /sys/class/habanalabs/hl<n>/write_open_cnt
> +Date:           Jan 2019
> +KernelVersion:  5.1
> +Contact:        oded.gabbay@gmail.com
> +Description:    Displays the total number of user processes that are currently
> +                opened on the device's file
> diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> index c07f3ccb57dc..b5607233d216 100644
> --- a/drivers/misc/habanalabs/Makefile
> +++ b/drivers/misc/habanalabs/Makefile
> @@ -5,7 +5,7 @@
>  obj-m	:= habanalabs.o
>  
>  habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
> -		command_buffer.o hw_queue.o irq.o
> +		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o
>  
>  include $(src)/goya/Makefile
>  habanalabs-y += $(HL_GOYA_FILES)
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> index 9199e070e79e..ff7b610f18c4 100644
> --- a/drivers/misc/habanalabs/device.c
> +++ b/drivers/misc/habanalabs/device.c
> @@ -226,6 +226,118 @@ static void device_early_fini(struct hl_device *hdev)
>  	mutex_destroy(&hdev->device_open);
>  }
>  
> +static void set_freq_to_low_job(struct work_struct *work)
> +{
> +	struct hl_device *hdev = container_of(work, struct hl_device,
> +						work_freq.work);
> +
> +	if (atomic_read(&hdev->fd_open_cnt) == 0)
> +		hl_device_set_frequency(hdev, PLL_LOW);
> +
> +	schedule_delayed_work(&hdev->work_freq,
> +			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
> +}
> +
> +/**
> + * device_late_init - do late stuff initialization for the habanalabs device
> + *
> + * @hdev: pointer to habanalabs device structure
> + *
> + * Do stuff that either needs the device H/W queues to be active or needs
> + * to happen after all the rest of the initialization is finished
> + */
> +static int device_late_init(struct hl_device *hdev)
> +{
> +	int rc;
> +
> +	INIT_DELAYED_WORK(&hdev->work_freq, set_freq_to_low_job);
> +	hdev->high_pll = hdev->asic_prop.high_pll;
> +
> +	/* force setting to low frequency */
> +	atomic_set(&hdev->curr_pll_profile, PLL_LOW);
> +
> +	if (hdev->pm_mng_profile == PM_AUTO)
> +		hdev->asic_funcs->set_pll_profile(hdev, PLL_LOW);
> +	else
> +		hdev->asic_funcs->set_pll_profile(hdev, PLL_LAST);
> +
> +	if (hdev->asic_funcs->late_init) {
> +		rc = hdev->asic_funcs->late_init(hdev);
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"failed late initialization for the H/W\n");
> +			return rc;
> +		}
> +	}
> +
> +	schedule_delayed_work(&hdev->work_freq,
> +			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
> +
> +	hdev->late_init_done = true;
> +
> +	return 0;
> +}
> +
> +/**
> + * device_late_fini - finalize all that was done in device_late_init
> + *
> + * @hdev: pointer to habanalabs device structure
> + *
> + */
> +static void device_late_fini(struct hl_device *hdev)
> +{
> +	if (!hdev->late_init_done)
> +		return;
> +
> +	cancel_delayed_work_sync(&hdev->work_freq);
> +
> +	if (hdev->asic_funcs->late_fini)
> +		hdev->asic_funcs->late_fini(hdev);
> +
> +	hdev->late_init_done = false;
> +}
> +
> +/**
> + * hl_device_set_frequency - set the frequency of the device
> + *
> + * @hdev: pointer to habanalabs device structure
> + * @freq: the new frequency value
> + *
> + * Change the frequency if needed.
> + * We allose to set PLL to low only if there is no user process
> + * Returns 0 if no change was done, otherwise returns 1;
> + */
> +int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq)
> +{
> +	enum hl_pll_frequency old_freq =
> +			(freq == PLL_HIGH) ? PLL_LOW : PLL_HIGH;
> +	int ret;
> +
> +	if (hdev->pm_mng_profile == PM_MANUAL)
> +		return 0;
> +
> +	ret = atomic_cmpxchg(&hdev->curr_pll_profile, old_freq, freq);
> +	if (ret == freq)
> +		return 0;
> +
> +	/*
> +	 * in case we want to lower frequency, check if device is not
> +	 * opened. We must have a check here to workaround race condition with
> +	 * hl_device_open
> +	 */
> +	if ((freq == PLL_LOW) && (atomic_read(&hdev->fd_open_cnt) > 0)) {
> +		atomic_set(&hdev->curr_pll_profile, PLL_HIGH);
> +		return 0;
> +	}
> +
> +	dev_dbg(hdev->dev, "Changing device frequency to %s\n",
> +		freq == PLL_HIGH ? "high" : "low");
> +
> +	hdev->asic_funcs->set_pll_profile(hdev, freq);
> +
> +	return 1;
> +}
> +
>  /**
>   * hl_device_suspend - initiate device suspend
>   *
> @@ -386,6 +498,12 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  		goto release_ctx;
>  	}
>  
> +	rc = hl_sysfs_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to initialize sysfs\n");
> +		goto free_cb_pool;
> +	}
> +
>  	rc = hdev->asic_funcs->hw_init(hdev);
>  	if (rc) {
>  		dev_err(hdev->dev, "failed to initialize the H/W\n");
> @@ -403,11 +521,33 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  		goto out_disabled;
>  	}
>  
> +	/* After test_queues, KMD can start sending messages to device CPU */
> +
> +	rc = device_late_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed late initialization\n");
> +		rc = 0;

Isn't this an error?

> +		goto out_disabled;
> +	}
> +
> +	dev_info(hdev->dev, "Found %s device with %lluGB DRAM\n",
> +		hdev->asic_name,
> +		hdev->asic_prop.dram_size / 1024 / 1024 / 1024);
> +
> +	rc = hl_hwmon_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to initialize hwmon\n");
> +		rc = 0;

Ditto

> +		goto out_disabled;
> +	}
> +
>  	dev_notice(hdev->dev,
>  		"Successfully added device to habanalabs driver\n");
>  
>  	return 0;
>  
> +free_cb_pool:
> +	hl_cb_pool_fini(hdev);
>  release_ctx:
>  	if (hl_ctx_put(hdev->kernel_ctx) != 1)
>  		dev_err(hdev->dev,
> @@ -457,6 +597,12 @@ void hl_device_fini(struct hl_device *hdev)
>  	/* Mark device as disabled */
>  	hdev->disabled = true;
>  
> +	hl_hwmon_fini(hdev);
> +
> +	device_late_fini(hdev);
> +
> +	hl_sysfs_fini(hdev);
> +
>  	/*
>  	 * Halt the engines and disable interrupts so we won't get any more
>  	 * completions from H/W and we won't have any accesses from the
> diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
> index a57096fa41b6..ada8518ec215 100644
> --- a/drivers/misc/habanalabs/goya/Makefile
> +++ b/drivers/misc/habanalabs/goya/Makefile
> @@ -1,3 +1,3 @@
>  subdir-ccflags-y += -I$(src)
>  
> -HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o
> \ No newline at end of file
> +HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o goya/goya_hwmgr.o
> \ No newline at end of file
> diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> index 6c04277ae0fa..7899ff762e0b 100644
> --- a/drivers/misc/habanalabs/goya/goya.c
> +++ b/drivers/misc/habanalabs/goya/goya.c
> @@ -127,6 +127,8 @@ static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
>  
>  #define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
>  
> +static int goya_armcp_info_get(struct hl_device *hdev);
> +
>  static void goya_get_fixed_properties(struct hl_device *hdev)
>  {
>  	struct asic_fixed_properties *prop = &hdev->asic_prop;
> @@ -174,6 +176,7 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
>  	prop->num_of_events = GOYA_ASYNC_EVENT_ID_SIZE;
>  	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
>  	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
> +	prop->max_power_default = MAX_POWER_DEFAULT;
>  	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
>  
>  	prop->high_pll = PLL_HIGH_DEFAULT;
> @@ -558,6 +561,89 @@ int goya_early_fini(struct hl_device *hdev)
>  	return 0;
>  }
>  
> +/**
> + * goya_fetch_psoc_frequency - Fetch PSOC frequency values
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + */
> +static void goya_fetch_psoc_frequency(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +
> +	prop->psoc_pci_pll_nr = RREG32(mmPSOC_PCI_PLL_NR);
> +	prop->psoc_pci_pll_nf = RREG32(mmPSOC_PCI_PLL_NF);
> +	prop->psoc_pci_pll_od = RREG32(mmPSOC_PCI_PLL_OD);
> +	prop->psoc_pci_pll_div_factor = RREG32(mmPSOC_PCI_PLL_DIV_FACTOR_1);
> +}
> +
> +/**
> + * goya_late_init - GOYA late initialization code
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Get ArmCP info and send message to CPU to enable PCI access
> + */
> +static int goya_late_init(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	struct goya_device *goya = hdev->asic_specific;
> +	int rc;
> +
> +	rc = goya->armcp_info_get(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to get armcp info\n");
> +		return rc;
> +	}
> +
> +	/* Now that we have the DRAM size in ASIC prop, we need to check
> +	 * its size and configure the DMA_IF DDR wrap protection (which is in
> +	 * the MMU block) accordingly. The value is the log2 of the DRAM size
> +	 */
> +	WREG32(mmMMU_LOG2_DDR_SIZE, ilog2(prop->dram_size));
> +
> +	rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_ENABLE_PCI_ACCESS);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to enable PCI access from CPU\n");
> +		return rc;
> +	}
> +
> +	WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> +			GOYA_ASYNC_EVENT_ID_INTS_REGISTER);
> +
> +	goya_fetch_psoc_frequency(hdev);
> +
> +	return 0;
> +}
> +
> +/**
> + * goya_late_fini - GOYA late tear-down code
> + *
> + * @hdev: pointer to hl_device structure
> + *
> + * Free sensors allocated structures
> + */
> +void goya_late_fini(struct hl_device *hdev)
> +{
> +	const struct hwmon_channel_info **channel_info_arr;
> +	int i = 0;
> +
> +	if (!hdev->hl_chip_info.info)
> +		return;
> +
> +	channel_info_arr = hdev->hl_chip_info.info;
> +
> +	while (channel_info_arr[i]) {
> +		kfree(channel_info_arr[i]->config);
> +		kfree(channel_info_arr[i]);
> +		i++;
> +	}
> +
> +	kfree(channel_info_arr);
> +
> +	hdev->hl_chip_info.info = NULL;
> +}
> +
>  /**
>   * goya_sw_init - Goya software initialization code
>   *
> @@ -575,9 +661,15 @@ static int goya_sw_init(struct hl_device *hdev)
>  		return -ENOMEM;
>  
>  	goya->test_cpu_queue = goya_test_cpu_queue;
> +	goya->armcp_info_get = goya_armcp_info_get;
>  
>  	/* according to goya_init_iatu */
>  	goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
> +
> +	goya->mme_clk = GOYA_PLL_FREQ_LOW;
> +	goya->tpc_clk = GOYA_PLL_FREQ_LOW;
> +	goya->ic_clk = GOYA_PLL_FREQ_LOW;
> +
>  	hdev->asic_specific = goya;
>  
>  	/* Create DMA pool for small allocations */
> @@ -4272,6 +4364,87 @@ void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
>  	return goya->events_stat;
>  }
>  
> +static int goya_armcp_info_get(struct hl_device *hdev)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	struct armcp_packet pkt;
> +	void *armcp_info_cpu_addr;
> +	dma_addr_t armcp_info_dma_addr;
> +	u64 dram_size;
> +	long result;
> +	int rc;
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
> +		return 0;
> +
> +	armcp_info_cpu_addr =
> +			hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev,
> +			sizeof(struct armcp_info), &armcp_info_dma_addr);
> +	if (!armcp_info_cpu_addr) {
> +		dev_err(hdev->dev,
> +			"Failed to allocate DMA memory for ArmCP info packet\n");
> +		return -ENOMEM;
> +	}
> +
> +	memset(armcp_info_cpu_addr, 0, sizeof(struct armcp_info));

Do you expect usage of cpu_accessible_dma_pool_alloc() without the need to
clear the memory?
If not memset(0) can be moved inside that function.

> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_INFO_GET;
> +	pkt.addr = armcp_info_dma_addr + prop->host_phys_base_address;
> +	pkt.data_max_size = sizeof(struct armcp_info);
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +			GOYA_ARMCP_INFO_TIMEOUT, &result);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to send armcp info pkt, error %d\n", rc);
> +		goto out;
> +	}
> +
> +	memcpy(&prop->armcp_info, armcp_info_cpu_addr,
> +			sizeof(prop->armcp_info));
> +
> +	dram_size = prop->armcp_info.dram_size;
> +	if (dram_size) {
> +		if ((!is_power_of_2(dram_size)) ||
> +				(dram_size < DRAM_PHYS_DEFAULT_SIZE)) {
> +			dev_err(hdev->dev,
> +				"F/W reported invalid DRAM size %llu. Trying to use default size\n",
> +				dram_size);
> +			dram_size = DRAM_PHYS_DEFAULT_SIZE;
> +		}
> +
> +		prop->dram_size = dram_size;
> +		prop->dram_end_address = prop->dram_base_address + dram_size;
> +	}
> +
> +	rc = hl_build_hwmon_channel_info(hdev, prop->armcp_info.sensors);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to build hwmon channel info, error %d\n", rc);
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +out:
> +	hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev,
> +			sizeof(struct armcp_info), armcp_info_cpu_addr);
> +
> +	return rc;
> +}
> +
> +static void goya_init_clock_gating(struct hl_device *hdev)
> +{
> +
> +}
> +
> +static void goya_disable_clock_gating(struct hl_device *hdev)
> +{
> +
> +}
>  
>  static void goya_hw_queues_lock(struct hl_device *hdev)
>  {
> @@ -4287,9 +4460,60 @@ static void goya_hw_queues_unlock(struct hl_device *hdev)
>  	spin_unlock(&goya->hw_queues_lock);
>  }
>  
> +int goya_get_eeprom_data(struct hl_device *hdev, void *data, size_t max_size)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	struct armcp_packet pkt;
> +	void *eeprom_info_cpu_addr;
> +	dma_addr_t eeprom_info_dma_addr;
> +	long result;
> +	int rc;
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
> +		return 0;
> +
> +	eeprom_info_cpu_addr =
> +			hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev,
> +					max_size, &eeprom_info_dma_addr);
> +	if (!eeprom_info_cpu_addr) {
> +		dev_err(hdev->dev,
> +			"Failed to allocate DMA memory for EEPROM info packet\n");
> +		return -ENOMEM;
> +	}
> +
> +	memset(eeprom_info_cpu_addr, 0, max_size);
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_EEPROM_DATA_GET;
> +	pkt.addr = eeprom_info_dma_addr + prop->host_phys_base_address;
> +	pkt.data_max_size = max_size;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +			GOYA_ARMCP_EEPROM_TIMEOUT, &result);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to send armcp EEPROM pkt, error %d\n", rc);
> +		goto out;
> +	}
> +
> +	/* result contains the actual size */
> +	memcpy(data, eeprom_info_cpu_addr, min((size_t)result, max_size));
> +
> +out:
> +	hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev, max_size,
> +			eeprom_info_cpu_addr);
> +
> +	return rc;
> +}
> +
>  static const struct hl_asic_funcs goya_funcs = {
>  	.early_init = goya_early_init,
>  	.early_fini = goya_early_fini,
> +	.late_init = goya_late_init,
> +	.late_fini = goya_late_fini,
>  	.sw_init = goya_sw_init,
>  	.sw_fini = goya_sw_fini,
>  	.hw_init = goya_hw_init,
> @@ -4310,10 +4534,16 @@ static const struct hl_asic_funcs goya_funcs = {
>  	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
>  	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
>  	.update_eq_ci = goya_update_eq_ci,
> +	.add_device_attr = goya_add_device_attr,
> +	.remove_device_attr = goya_remove_device_attr,
>  	.handle_eqe = goya_handle_eqe,
> +	.set_pll_profile = goya_set_pll_profile,
>  	.get_events_stat = goya_get_events_stat,
> +	.enable_clock_gating = goya_init_clock_gating,
> +	.disable_clock_gating = goya_disable_clock_gating,
>  	.hw_queues_lock = goya_hw_queues_lock,
>  	.hw_queues_unlock = goya_hw_queues_unlock,
> +	.get_eeprom_data = goya_get_eeprom_data,
>  	.send_cpu_message = goya_send_cpu_message
>  };
>  
> diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> index c6bfcb6c6905..42e8b1baef2f 100644
> --- a/drivers/misc/habanalabs/goya/goyaP.h
> +++ b/drivers/misc/habanalabs/goya/goyaP.h
> @@ -48,7 +48,10 @@
>  
>  #define PLL_HIGH_DEFAULT		1575000000	/* 1.575 GHz */
>  
> +#define MAX_POWER_DEFAULT		200000		/* 200W */
> +
>  #define GOYA_ARMCP_INFO_TIMEOUT		10000000	/* 10s */
> +#define GOYA_ARMCP_EEPROM_TIMEOUT	10000000	/* 10s */
>  
>  #define DRAM_PHYS_DEFAULT_SIZE		0x100000000ull	/* 4GB */
>  
> @@ -119,9 +122,15 @@ enum goya_fw_component {
>  
>  struct goya_device {
>  	int (*test_cpu_queue)(struct hl_device *hdev);
> +	int (*armcp_info_get)(struct hl_device *hdev);
>  
>  	/* TODO: remove hw_queues_lock after moving to scheduler code */
>  	spinlock_t	hw_queues_lock;
> +
> +	u64		mme_clk;
> +	u64		tpc_clk;
> +	u64		ic_clk;
> +
>  	u64		ddr_bar_cur_addr;
>  	u32		events_stat[GOYA_ASYNC_EVENT_ID_SIZE];
>  	u32		hw_cap_initialized;
> @@ -130,6 +139,18 @@ struct goya_device {
>  int goya_test_cpu_queue(struct hl_device *hdev);
>  int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
>  				u32 timeout, long *result);
> +long goya_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
> +long goya_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr);
> +long goya_get_current(struct hl_device *hdev, int sensor_index, u32 attr);
> +long goya_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
> +long goya_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
> +void goya_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
> +			long value);
> +void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq);
> +int goya_add_device_attr(struct hl_device *hdev);
> +void goya_remove_device_attr(struct hl_device *hdev);
>  void goya_init_security(struct hl_device *hdev);
> +u64 goya_get_max_power(struct hl_device *hdev);
> +void goya_set_max_power(struct hl_device *hdev, u64 value);
>  
>  #endif /* GOYAP_H_ */
> diff --git a/drivers/misc/habanalabs/goya/goya_hwmgr.c b/drivers/misc/habanalabs/goya/goya_hwmgr.c
> new file mode 100644
> index 000000000000..866d1774b2e4
> --- /dev/null
> +++ b/drivers/misc/habanalabs/goya/goya_hwmgr.c
> @@ -0,0 +1,306 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "goyaP.h"
> +
> +void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +
> +	switch (freq) {
> +	case PLL_HIGH:
> +		hl_set_frequency(hdev, MME_PLL, hdev->high_pll);
> +		hl_set_frequency(hdev, TPC_PLL, hdev->high_pll);
> +		hl_set_frequency(hdev, IC_PLL, hdev->high_pll);
> +		break;
> +	case PLL_LOW:
> +		hl_set_frequency(hdev, MME_PLL, GOYA_PLL_FREQ_LOW);
> +		hl_set_frequency(hdev, TPC_PLL, GOYA_PLL_FREQ_LOW);
> +		hl_set_frequency(hdev, IC_PLL, GOYA_PLL_FREQ_LOW);
> +		break;
> +	case PLL_LAST:
> +		hl_set_frequency(hdev, MME_PLL, goya->mme_clk);
> +		hl_set_frequency(hdev, TPC_PLL, goya->tpc_clk);
> +		hl_set_frequency(hdev, IC_PLL, goya->ic_clk);
> +		break;
> +	default:
> +		dev_err(hdev->dev, "unknown frequency setting\n");
> +	}
> +}
> +
> +static ssize_t mme_clk_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	long value;
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	value = hl_get_frequency(hdev, MME_PLL, false);
> +
> +	if (value < 0)
> +		return value;
> +
> +	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> +}
> +
> +static ssize_t mme_clk_store(struct device *dev, struct device_attribute *attr,
> +				const char *buf, size_t count)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	struct goya_device *goya = hdev->asic_specific;
> +	int rc;
> +	long value;
> +
> +	if (hdev->disabled) {
> +		count = -ENODEV;
> +		goto fail;
> +	}
> +
> +	if (hdev->pm_mng_profile == PM_AUTO) {
> +		count = -EPERM;
> +		goto fail;
> +	}
> +
> +	rc = kstrtoul(buf, 0, &value);
> +
> +	if (rc) {
> +		count = -EINVAL;
> +		goto fail;
> +	}
> +
> +	hl_set_frequency(hdev, MME_PLL, value);
> +	goya->mme_clk = value;
> +
> +fail:
> +	return count;
> +}
> +
> +static ssize_t tpc_clk_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	long value;
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	value = hl_get_frequency(hdev, TPC_PLL, false);
> +
> +	if (value < 0)
> +		return value;
> +
> +	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> +}
> +
> +static ssize_t tpc_clk_store(struct device *dev, struct device_attribute *attr,
> +				const char *buf, size_t count)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	struct goya_device *goya = hdev->asic_specific;
> +	int rc;
> +	long value;
> +
> +	if (hdev->disabled) {
> +		count = -ENODEV;
> +		goto fail;
> +	}
> +
> +	if (hdev->pm_mng_profile == PM_AUTO) {
> +		count = -EPERM;
> +		goto fail;
> +	}
> +
> +	rc = kstrtoul(buf, 0, &value);
> +
> +	if (rc) {
> +		count = -EINVAL;
> +		goto fail;
> +	}
> +
> +	hl_set_frequency(hdev, TPC_PLL, value);
> +	goya->tpc_clk = value;
> +
> +fail:
> +	return count;
> +}
> +
> +static ssize_t ic_clk_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	long value;
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	value = hl_get_frequency(hdev, IC_PLL, false);
> +
> +	if (value < 0)
> +		return value;
> +
> +	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> +}
> +
> +static ssize_t ic_clk_store(struct device *dev, struct device_attribute *attr,
> +				const char *buf, size_t count)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	struct goya_device *goya = hdev->asic_specific;
> +	int rc;
> +	long value;
> +
> +	if (hdev->disabled) {
> +		count = -ENODEV;
> +		goto fail;
> +	}
> +
> +	if (hdev->pm_mng_profile == PM_AUTO) {
> +		count = -EPERM;
> +		goto fail;
> +	}
> +
> +	rc = kstrtoul(buf, 0, &value);
> +
> +	if (rc) {
> +		count = -EINVAL;
> +		goto fail;
> +	}
> +
> +	hl_set_frequency(hdev, IC_PLL, value);
> +	goya->ic_clk = value;
> +
> +fail:
> +	return count;
> +}
> +
> +static ssize_t mme_clk_curr_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	long value;
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	value = hl_get_frequency(hdev, MME_PLL, true);
> +
> +	if (value < 0)
> +		return value;
> +
> +	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> +}
> +
> +static ssize_t tpc_clk_curr_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	long value;
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	value = hl_get_frequency(hdev, TPC_PLL, true);
> +
> +	if (value < 0)
> +		return value;
> +
> +	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> +}
> +
> +static ssize_t ic_clk_curr_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	long value;
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	value = hl_get_frequency(hdev, IC_PLL, true);
> +
> +	if (value < 0)
> +		return value;
> +
> +	return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> +}
> +
> +static DEVICE_ATTR_RW(mme_clk);
> +static DEVICE_ATTR_RW(tpc_clk);
> +static DEVICE_ATTR_RW(ic_clk);
> +static DEVICE_ATTR_RO(mme_clk_curr);
> +static DEVICE_ATTR_RO(tpc_clk_curr);
> +static DEVICE_ATTR_RO(ic_clk_curr);
> +
> +int goya_add_device_attr(struct hl_device *hdev)
> +{
> +	int rc;
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_mme_clk);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file mme_clk\n");
> +		return rc;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_tpc_clk);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file tpc_clk\n");
> +		goto remove_mme_clk;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_ic_clk);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file ic_clk\n");
> +		goto remove_tpc_clk;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_mme_clk_curr);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file mme_clk_curr\n");
> +		goto remove_ic_clk;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_tpc_clk_curr);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file tpc_clk_curr\n");
> +		goto remove_mme_clk_curr;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_ic_clk_curr);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file ic_clk_curr\n");
> +		goto remove_tpc_clk_curr;
> +	}
> +
> +	return 0;
> +
> +remove_tpc_clk_curr:
> +	device_remove_file(hdev->dev, &dev_attr_tpc_clk_curr);
> +remove_mme_clk_curr:
> +	device_remove_file(hdev->dev, &dev_attr_mme_clk_curr);
> +remove_ic_clk:
> +	device_remove_file(hdev->dev, &dev_attr_ic_clk);
> +remove_tpc_clk:
> +	device_remove_file(hdev->dev, &dev_attr_tpc_clk);
> +remove_mme_clk:
> +	device_remove_file(hdev->dev, &dev_attr_mme_clk);
> +	return rc;
> +}
> +
> +void goya_remove_device_attr(struct hl_device *hdev)
> +{
> +	device_remove_file(hdev->dev, &dev_attr_ic_clk_curr);
> +	device_remove_file(hdev->dev, &dev_attr_tpc_clk_curr);
> +	device_remove_file(hdev->dev, &dev_attr_mme_clk_curr);
> +	device_remove_file(hdev->dev, &dev_attr_ic_clk);
> +	device_remove_file(hdev->dev, &dev_attr_tpc_clk);
> +	device_remove_file(hdev->dev, &dev_attr_mme_clk);
> +}
> diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> index 899bf98eb002..49b84b3ff864 100644
> --- a/drivers/misc/habanalabs/habanalabs.h
> +++ b/drivers/misc/habanalabs/habanalabs.h
> @@ -25,6 +25,8 @@
>  
>  #define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */
>  
> +#define HL_PLL_LOW_JOB_FREQ_USEC	5000000 /* 5 s */
> +
>  #define HL_MAX_QUEUES			128
>  
>  struct hl_device;
> @@ -60,6 +62,8 @@ struct hw_queue_properties {
>  /**
>   * struct asic_fixed_properties - ASIC specific immutable properties.
>   * @hw_queues_props: H/W queues properties.
> + * @armcp_info: received various information from ArmCP regarding the H/W. e.g.
> + *		available sensors.
>   * @uboot_ver: F/W U-boot version.
>   * @preboot_ver: F/W Preboot version.
>   * @sram_base_address: SRAM physical start address.
> @@ -72,6 +76,7 @@ struct hw_queue_properties {
>   * @dram_pci_bar_size: size of PCI bar towards DRAM.
>   * @host_phys_base_address: base physical address of host memory for
>   *				transactions that the device generates.
> + * @max_power_default: max power of the device after reset
>   * @va_space_host_start_address: base address of virtual memory range for
>   *                               mapping host memory.
>   * @va_space_host_end_address: end address of virtual memory range for
> @@ -84,6 +89,10 @@ struct hw_queue_properties {
>   * @sram_size: total size of SRAM.
>   * @max_asid: maximum number of open contexts (ASIDs).
>   * @num_of_events: number of possible internal H/W IRQs.
> + * @psoc_pci_pll_nr: PCI PLL NR value.
> + * @psoc_pci_pll_nf: PCI PLL NF value.
> + * @psoc_pci_pll_od: PCI PLL OD value.
> + * @psoc_pci_pll_div_factor: PCI PLL DIV FACTOR 1 value.
>   * @completion_queues_count: number of completion queues.
>   * @high_pll: high PLL frequency used by the device.
>   * @cb_pool_cb_cnt: number of CBs in the CB pool.
> @@ -92,6 +101,7 @@ struct hw_queue_properties {
>   */
>  struct asic_fixed_properties {
>  	struct hw_queue_properties	hw_queues_props[HL_MAX_QUEUES];
> +	struct armcp_info	armcp_info;
>  	char			uboot_ver[VERSION_MAX_LEN];
>  	char			preboot_ver[VERSION_MAX_LEN];
>  	u64			sram_base_address;
> @@ -103,6 +113,7 @@ struct asic_fixed_properties {
>  	u64			dram_size;
>  	u64			dram_pci_bar_size;
>  	u64			host_phys_base_address;
> +	u64			max_power_default;
>  	u64			va_space_host_start_address;
>  	u64			va_space_host_end_address;
>  	u64			va_space_dram_start_address;
> @@ -111,6 +122,10 @@ struct asic_fixed_properties {
>  	u32			sram_size;
>  	u32			max_asid;
>  	u32			num_of_events;
> +	u32			psoc_pci_pll_nr;
> +	u32			psoc_pci_pll_nf;
> +	u32			psoc_pci_pll_od;
> +	u32			psoc_pci_pll_div_factor;
>  	u32			high_pll;
>  	u32			cb_pool_cb_cnt;
>  	u32			cb_pool_cb_size;
> @@ -296,13 +311,37 @@ enum hl_asic_type {
>  };
>  
>  
> +/**
> + * enum hl_pm_mng_profile - power management profile.
> + * @PM_AUTO: internal clock is set by KMD.
> + * @PM_MANUAL: internal clock is set by the user.
> + * @PM_LAST: last power management type.
> + */
> +enum hl_pm_mng_profile {
> +	PM_AUTO = 1,
> +	PM_MANUAL,
> +	PM_LAST
> +};
>  
> +/**
> + * enum hl_pll_frequency - PLL frequency.
> + * @PLL_HIGH: high frequency.
> + * @PLL_LOW: low frequency.
> + * @PLL_LAST: last frequency values that were configured by the user.
> + */
> +enum hl_pll_frequency {
> +	PLL_HIGH = 1,
> +	PLL_LOW,
> +	PLL_LAST
> +};
>  
>  /**
>   * struct hl_asic_funcs - ASIC specific functions that are can be called from
>   *                        common code.
>   * @early_init: sets up early driver state (pre sw_init), doesn't configure H/W.
>   * @early_fini: tears down what was done in early_init.
> + * @late_init: sets up late driver/hw state (post hw_init) - Optional.
> + * @late_fini: tears down what was done in late_init (pre hw_fini) - Optional.
>   * @sw_init: sets up driver state, does not configure H/W.
>   * @sw_fini: tears down driver state, does not configure H/W.
>   * @hw_init: sets up the H/W state.
> @@ -326,15 +365,23 @@ enum hl_asic_type {
>   * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
>   * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
>   * @update_eq_ci: update event queue CI.
> + * @add_device_attr: add ASIC specific device attributes.
> + * @remove_device_attr: remove ASIC specific device attributes.
>   * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
> + * @set_pll_profile: change PLL profile (manual/automatic).
>   * @get_events_stat: retrieve event queue entries histogram.
> + * @enable_clock_gating: enable clock gating for reducing power consumption.
> + * @disable_clock_gating: disable clock for accessing registers on HBW.
>   * @hw_queues_lock: acquire H/W queues lock.
>   * @hw_queues_unlock: release H/W queues lock.
> + * @get_eeprom_data: retrieve EEPROM data from F/W.
>   * @send_cpu_message: send buffer to ArmCP.
>   */
>  struct hl_asic_funcs {
>  	int (*early_init)(struct hl_device *hdev);
>  	int (*early_fini)(struct hl_device *hdev);
> +	int (*late_init)(struct hl_device *hdev);
> +	void (*late_fini)(struct hl_device *hdev);
>  	int (*sw_init)(struct hl_device *hdev);
>  	int (*sw_fini)(struct hl_device *hdev);
>  	int (*hw_init)(struct hl_device *hdev);
> @@ -363,11 +410,19 @@ struct hl_asic_funcs {
>  	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
>  				size_t size, void *vaddr);
>  	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
> +	int (*add_device_attr)(struct hl_device *hdev);
> +	void (*remove_device_attr)(struct hl_device *hdev);
>  	void (*handle_eqe)(struct hl_device *hdev,
>  				struct hl_eq_entry *eq_entry);
> +	void (*set_pll_profile)(struct hl_device *hdev,
> +			enum hl_pll_frequency freq);
>  	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
> +	void (*enable_clock_gating)(struct hl_device *hdev);
> +	void (*disable_clock_gating)(struct hl_device *hdev);
>  	void (*hw_queues_lock)(struct hl_device *hdev);
>  	void (*hw_queues_unlock)(struct hl_device *hdev);
> +	int (*get_eeprom_data)(struct hl_device *hdev, void *data,
> +				size_t max_size);
>  	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
>  				u16 len, u32 timeout, long *result);
>  };
> @@ -496,6 +551,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
>   * @rmmio: configuration area address on SRAM.
>   * @cdev: related char device.
>   * @dev: realted kernel basic device structure.
> + * @work_freq: delayed work to lower device frequency if possible.
>   * @asic_name: ASIC specific nmae.
>   * @asic_type: ASIC specific type.
>   * @completion_queue: array of hl_cq.
> @@ -517,13 +573,23 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
>   * @asic_prop: ASIC specific immutable properties.
>   * @asic_funcs: ASIC specific functions.
>   * @asic_specific: ASIC specific information to use only from ASIC files.
> + * @hwmon_dev: H/W monitor device.
> + * @pm_mng_profile: current power management profile.
> + * @hl_chip_info: ASIC's sensors information.
>   * @cb_pool: list of preallocated CBs.
>   * @cb_pool_lock: protects the CB pool.
>   * @user_ctx: current user context executing.
> + * @curr_pll_profile: current PLL profile.
>   * @fd_open_cnt: number of open context executing.
> + * @max_power: the max power of the device, as configured by the sysadmin. This
> + *             value is saved so in case of hard-reset, KMD will restore this
> + *             value and update the F/W after the re-initialization
>   * @major: habanalabs KMD major.
> + * @high_pll: high PLL profile frequency.
>   * @id: device minor.
>   * @disabled: is device disabled.
> + * @late_init_done: is late init stage was done during initialization.
> + * @hwmon_initialized: is H/W monitor sensors was initialized.
>   */
>  struct hl_device {
>  	struct pci_dev			*pdev;
> @@ -531,6 +597,7 @@ struct hl_device {
>  	void __iomem			*rmmio;
>  	struct cdev			cdev;
>  	struct device			*dev;
> +	struct delayed_work		work_freq;
>  	char				asic_name[16];
>  	enum hl_asic_type		asic_type;
>  	struct hl_cq			*completion_queue;
> @@ -553,16 +620,25 @@ struct hl_device {
>  	struct asic_fixed_properties	asic_prop;
>  	const struct hl_asic_funcs	*asic_funcs;
>  	void				*asic_specific;
> +	struct device			*hwmon_dev;
> +	enum hl_pm_mng_profile		pm_mng_profile;
> +	struct hwmon_chip_info		hl_chip_info;
>  
>  	struct list_head		cb_pool;
>  	spinlock_t			cb_pool_lock;
>  
>  	/* TODO: The following fields should be moved for multi-context */
>  	struct hl_ctx			*user_ctx;
> +
> +	atomic_t			curr_pll_profile;
>  	atomic_t			fd_open_cnt;
> +	u64				max_power;
>  	u32				major;
> +	u32				high_pll;
>  	u16				id;
>  	u8				disabled;
> +	u8				late_init_done;
> +	u8				hwmon_initialized;
>  
>  	/* Parameters for bring-up */
>  	u8				cpu_enable;
> @@ -647,6 +723,15 @@ int hl_device_suspend(struct hl_device *hdev);
>  int hl_device_resume(struct hl_device *hdev);
>  void hl_hpriv_get(struct hl_fpriv *hpriv);
>  void hl_hpriv_put(struct hl_fpriv *hpriv);
> +int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq);
> +int hl_build_hwmon_channel_info(struct hl_device *hdev,
> +		struct armcp_sensor *sensors_arr);
> +
> +int hl_sysfs_init(struct hl_device *hdev);
> +void hl_sysfs_fini(struct hl_device *hdev);
> +
> +int hl_hwmon_init(struct hl_device *hdev);
> +void hl_hwmon_fini(struct hl_device *hdev);
>  
>  int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr, u32 cb_size,
>  		u64 *handle, int ctx_id);
> @@ -663,6 +748,18 @@ int hl_cb_pool_fini(struct hl_device *hdev);
>  
>  void goya_set_asic_funcs(struct hl_device *hdev);
>  
> +long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr);
> +void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq);
> +long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
> +long hl_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr);
> +long hl_get_current(struct hl_device *hdev, int sensor_index, u32 attr);
> +long hl_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
> +long hl_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
> +void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
> +			long value);
> +u64 hl_get_max_power(struct hl_device *hdev);
> +void hl_set_max_power(struct hl_device *hdev, u64 value);
> +
>  /* IOCTLs */
>  long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
>  int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
> diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> index b64f58ad0f5d..47a9ab458b43 100644
> --- a/drivers/misc/habanalabs/habanalabs_drv.c
> +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> @@ -134,6 +134,13 @@ int hl_device_open(struct inode *inode, struct file *filp)
>  
>  	hpriv->taskpid = find_get_pid(current->pid);
>  
> +	/*
> +	 * Device is IDLE at this point so it is legal to change PLLs. There
> +	 * is no need to check anything because if the PLL is already HIGH, the
> +	 * set function will return without doing anything
> +	 */
> +	hl_device_set_frequency(hdev, PLL_HIGH);
> +
>  	return 0;
>  
>  out_err:
> diff --git a/drivers/misc/habanalabs/hwmon.c b/drivers/misc/habanalabs/hwmon.c
> new file mode 100644
> index 000000000000..6ca0decb7490
> --- /dev/null
> +++ b/drivers/misc/habanalabs/hwmon.c
> @@ -0,0 +1,449 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "habanalabs.h"
> +
> +#define SENSORS_PKT_TIMEOUT		100000	/* 100ms */
> +#define HWMON_NR_SENSOR_TYPES		(hwmon_pwm + 1)
> +
> +int hl_build_hwmon_channel_info(struct hl_device *hdev,
> +				struct armcp_sensor *sensors_arr)
> +{
> +	u32 counts[HWMON_NR_SENSOR_TYPES] = {0};
> +	u32 *sensors_by_type[HWMON_NR_SENSOR_TYPES] = {0};
> +	u32 sensors_by_type_next_index[HWMON_NR_SENSOR_TYPES] = {0};
> +	struct hwmon_channel_info **channels_info;
> +	u32 num_sensors_for_type, num_active_sensor_types = 0,
> +			arr_size = 0, *curr_arr;
> +	enum hwmon_sensor_types type;
> +	int rc, i, j;
> +
> +	for (i = 0 ; i < ARMCP_MAX_SENSORS ; i++) {
> +		type = sensors_arr[i].type;
> +
> +		if ((type == 0) && (sensors_arr[i].flags == 0))
> +			break;
> +
> +		if (type >= HWMON_NR_SENSOR_TYPES) {
> +			dev_err(hdev->dev,
> +				"Got wrong sensor type %d from device\n", type);
> +			return -EINVAL;
> +		}
> +
> +		counts[type]++;
> +		arr_size++;
> +	}
> +
> +	for (i = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++) {
> +		if (counts[i] == 0)
> +			continue;
> +
> +		num_sensors_for_type = counts[i] + 1;
> +		curr_arr = kcalloc(num_sensors_for_type, sizeof(*curr_arr),
> +				GFP_KERNEL);
> +		if (!curr_arr) {
> +			rc = -ENOMEM;
> +			goto sensors_type_err;
> +		}
> +
> +		num_active_sensor_types++;
> +		sensors_by_type[i] = curr_arr;
> +	}
> +
> +	for (i = 0 ; i < arr_size ; i++) {
> +		type = sensors_arr[i].type;
> +		curr_arr = sensors_by_type[type];
> +		curr_arr[sensors_by_type_next_index[type]++] =
> +				sensors_arr[i].flags;
> +	}
> +
> +	channels_info = kcalloc(num_active_sensor_types + 1,
> +			sizeof(*channels_info), GFP_KERNEL);
> +	if (!channels_info) {
> +		rc = -ENOMEM;
> +		goto channels_info_array_err;
> +	}
> +
> +	for (i = 0 ; i < num_active_sensor_types ; i++) {
> +		channels_info[i] = kzalloc(sizeof(*channels_info[i]),
> +				GFP_KERNEL);
> +		if (!channels_info[i]) {
> +			rc = -ENOMEM;
> +			goto channel_info_err;
> +		}
> +	}
> +
> +	for (i = 0, j = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++) {
> +		if (!sensors_by_type[i])
> +			continue;
> +
> +		channels_info[j]->type = i;
> +		channels_info[j]->config = sensors_by_type[i];
> +		j++;
> +	}
> +
> +	hdev->hl_chip_info.info =
> +			(const struct hwmon_channel_info **)channels_info;
> +
> +	return 0;
> +
> +channel_info_err:
> +	for (i = 0 ; i < num_active_sensor_types ; i++)
> +		if (channels_info[i]) {
> +			kfree(channels_info[i]->config);
> +			kfree(channels_info[i]);
> +		}
> +	kfree(channels_info);
> +channels_info_array_err:
> +sensors_type_err:
> +	for (i = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++)
> +		kfree(sensors_by_type[i]);
> +
> +	return rc;
> +}
> +
> +static int hl_read(struct device *dev, enum hwmon_sensor_types type,
> +			u32 attr, int channel, long *val)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	switch (type) {
> +	case hwmon_temp:
> +		switch (attr) {
> +		case hwmon_temp_input:
> +		case hwmon_temp_max:
> +		case hwmon_temp_crit:
> +		case hwmon_temp_max_hyst:
> +		case hwmon_temp_crit_hyst:
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +
> +		*val = hl_get_temperature(hdev, channel, attr);
> +		break;
> +	case hwmon_in:
> +		switch (attr) {
> +		case hwmon_in_input:
> +		case hwmon_in_min:
> +		case hwmon_in_max:
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +
> +		*val = hl_get_voltage(hdev, channel, attr);
> +		break;
> +	case hwmon_curr:
> +		switch (attr) {
> +		case hwmon_curr_input:
> +		case hwmon_curr_min:
> +		case hwmon_curr_max:
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +
> +		*val = hl_get_current(hdev, channel, attr);
> +		break;
> +	case hwmon_fan:
> +		switch (attr) {
> +		case hwmon_fan_input:
> +		case hwmon_fan_min:
> +		case hwmon_fan_max:
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +		*val = hl_get_fan_speed(hdev, channel, attr);
> +		break;
> +	case hwmon_pwm:
> +		switch (attr) {
> +		case hwmon_pwm_input:
> +		case hwmon_pwm_enable:
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +		*val = hl_get_pwm_info(hdev, channel, attr);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static int hl_write(struct device *dev, enum hwmon_sensor_types type,
> +			u32 attr, int channel, long val)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	switch (type) {
> +	case hwmon_pwm:
> +		switch (attr) {
> +		case hwmon_pwm_input:
> +		case hwmon_pwm_enable:
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +		hl_set_pwm_info(hdev, channel, attr, val);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static umode_t hl_is_visible(const void *data, enum hwmon_sensor_types type,
> +				u32 attr, int channel)
> +{
> +	switch (type) {
> +	case hwmon_temp:
> +		switch (attr) {
> +		case hwmon_temp_input:
> +		case hwmon_temp_max:
> +		case hwmon_temp_max_hyst:
> +		case hwmon_temp_crit:
> +		case hwmon_temp_crit_hyst:
> +			return 0444;
> +		}
> +		break;
> +	case hwmon_in:
> +		switch (attr) {
> +		case hwmon_in_input:
> +		case hwmon_in_min:
> +		case hwmon_in_max:
> +			return 0444;
> +		}
> +		break;
> +	case hwmon_curr:
> +		switch (attr) {
> +		case hwmon_curr_input:
> +		case hwmon_curr_min:
> +		case hwmon_curr_max:
> +			return 0444;
> +		}
> +		break;
> +	case hwmon_fan:
> +		switch (attr) {
> +		case hwmon_fan_input:
> +		case hwmon_fan_min:
> +		case hwmon_fan_max:
> +			return 0444;
> +		}
> +		break;
> +	case hwmon_pwm:
> +		switch (attr) {
> +		case hwmon_pwm_input:
> +		case hwmon_pwm_enable:
> +			return 0644;
> +		}
> +		break;
> +	default:
> +		break;
> +	}
> +	return 0;
> +}
> +
> +static const struct hwmon_ops hl_hwmon_ops = {
> +	.is_visible = hl_is_visible,
> +	.read = hl_read,
> +	.write = hl_write
> +};
> +
> +long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr)
> +{
> +	struct armcp_packet pkt;
> +	long result;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_TEMPERATURE_GET;
> +	pkt.sensor_index = sensor_index;
> +	pkt.type = attr;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +			SENSORS_PKT_TIMEOUT, &result);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to get temperature from sensor %d, error %d\n",
> +			sensor_index, rc);
> +		result = 0;
> +	}
> +
> +	return result;
> +}
> +
> +long hl_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr)
> +{
> +	struct armcp_packet pkt;
> +	long result;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_VOLTAGE_GET;
> +	pkt.sensor_index = sensor_index;
> +	pkt.type = attr;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +					SENSORS_PKT_TIMEOUT, &result);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to get voltage from sensor %d, error %d\n",
> +			sensor_index, rc);
> +		result = 0;
> +	}
> +
> +	return result;
> +}
> +
> +long hl_get_current(struct hl_device *hdev, int sensor_index, u32 attr)
> +{
> +	struct armcp_packet pkt;
> +	long result;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_CURRENT_GET;
> +	pkt.sensor_index = sensor_index;
> +	pkt.type = attr;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +					SENSORS_PKT_TIMEOUT, &result);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to get current from sensor %d, error %d\n",
> +			sensor_index, rc);
> +		result = 0;
> +	}
> +
> +	return result;
> +}
> +
> +long hl_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr)
> +{
> +	struct armcp_packet pkt;
> +	long result;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_FAN_SPEED_GET;
> +	pkt.sensor_index = sensor_index;
> +	pkt.type = attr;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +					SENSORS_PKT_TIMEOUT, &result);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to get fan speed from sensor %d, error %d\n",
> +			sensor_index, rc);
> +		result = 0;
> +	}
> +
> +	return result;
> +}
> +
> +long hl_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr)
> +{
> +	struct armcp_packet pkt;
> +	long result;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_PWM_GET;
> +	pkt.sensor_index = sensor_index;
> +	pkt.type = attr;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +					SENSORS_PKT_TIMEOUT, &result);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to get pwm info from sensor %d, error %d\n",
> +			sensor_index, rc);
> +		result = 0;
> +	}
> +
> +	return result;
> +}
> +
> +void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
> +			long value)
> +{
> +	struct armcp_packet pkt;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_PWM_SET;
> +	pkt.sensor_index = sensor_index;
> +	pkt.type = attr;
> +	pkt.value = value;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +					SENSORS_PKT_TIMEOUT, NULL);
> +
> +	if (rc)
> +		dev_err(hdev->dev,
> +			"Failed to set pwm info to sensor %d, error %d\n",
> +			sensor_index, rc);
> +}
> +
> +int hl_hwmon_init(struct hl_device *hdev)
> +{
> +	struct device *dev = hdev->pdev ? &hdev->pdev->dev : hdev->dev;
> +	int rc;
> +
> +	if ((hdev->hwmon_initialized) || !(hdev->fw_loading))
> +		return 0;
> +
> +	if (hdev->hl_chip_info.info) {
> +		hdev->hl_chip_info.ops = &hl_hwmon_ops;
> +
> +		hdev->hwmon_dev = hwmon_device_register_with_info(dev,
> +				"habanalabs", hdev, &hdev->hl_chip_info, NULL);
> +		if (IS_ERR(hdev->hwmon_dev)) {
> +			rc = PTR_ERR(hdev->hwmon_dev);
> +			dev_err(hdev->dev,
> +				"Unable to register hwmon device: %d\n", rc);
> +			return rc;
> +		}
> +
> +		dev_info(hdev->dev, "%s: add sensors information\n",
> +			dev_name(hdev->hwmon_dev));
> +
> +		hdev->hwmon_initialized = true;
> +	} else {
> +		dev_info(hdev->dev, "no available sensors\n");
> +	}
> +
> +	return 0;
> +}
> +
> +void hl_hwmon_fini(struct hl_device *hdev)
> +{
> +	if (!hdev->hwmon_initialized)
> +		return;
> +
> +	hwmon_device_unregister(hdev->hwmon_dev);
> +}
> diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
> new file mode 100644
> index 000000000000..edd5f7159de0
> --- /dev/null
> +++ b/drivers/misc/habanalabs/sysfs.c
> @@ -0,0 +1,588 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "habanalabs.h"
> +#include "include/habanalabs_device_if.h"
> +
> +#include <linux/hwmon-sysfs.h>
> +#include <linux/hwmon.h>
> +
> +#define SET_CLK_PKT_TIMEOUT	200000	/* 200ms */
> +#define SET_PWR_PKT_TIMEOUT	400000	/* 400ms */
> +
> +long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr)
> +{
> +	struct armcp_packet pkt;
> +	long result;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	if (curr)
> +		pkt.opcode = ARMCP_PACKET_FREQUENCY_CURR_GET;
> +	else
> +		pkt.opcode = ARMCP_PACKET_FREQUENCY_GET;
> +	pkt.pll_index = pll_index;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +						SET_CLK_PKT_TIMEOUT, &result);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to get frequency of PLL %d, error %d\n",
> +			pll_index, rc);
> +		result = rc;
> +	}
> +
> +	return result;
> +}
> +
> +void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq)
> +{
> +	struct armcp_packet pkt;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_FREQUENCY_SET;
> +	pkt.pll_index = pll_index;
> +	pkt.value = freq;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +					SET_CLK_PKT_TIMEOUT, NULL);
> +
> +	if (rc)
> +		dev_err(hdev->dev,
> +			"Failed to set frequency to PLL %d, error %d\n",
> +			pll_index, rc);
> +}
> +
> +u64 hl_get_max_power(struct hl_device *hdev)
> +{
> +	struct armcp_packet pkt;
> +	long result;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_MAX_POWER_GET;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +						SET_PWR_PKT_TIMEOUT, &result);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to get max power, error %d\n", rc);
> +		result = rc;
> +	}
> +
> +	return result;
> +}
> +
> +void hl_set_max_power(struct hl_device *hdev, u64 value)
> +{
> +	struct armcp_packet pkt;
> +	int rc;
> +
> +	memset(&pkt, 0, sizeof(pkt));
> +
> +	pkt.opcode = ARMCP_PACKET_MAX_POWER_SET;
> +	pkt.value = value;
> +
> +	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> +					SET_PWR_PKT_TIMEOUT, NULL);
> +
> +	if (rc)
> +		dev_err(hdev->dev, "Failed to set max power, error %d\n", rc);
> +}
> +
> +static ssize_t pm_mng_profile_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	return snprintf(buf, PAGE_SIZE, "%s\n",
> +			(hdev->pm_mng_profile == PM_AUTO) ? "auto" :
> +			(hdev->pm_mng_profile == PM_MANUAL) ? "manual" :
> +			"unknown");
> +}
> +
> +static ssize_t pm_mng_profile_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	if (hdev->disabled) {
> +		count = -ENODEV;
> +		goto out;
> +	}
> +
> +	mutex_lock(&hdev->device_open);
> +
> +	if (atomic_read(&hdev->fd_open_cnt) > 0) {
> +		dev_err(hdev->dev,
> +			"Can't change PM profile while user process is opened on the device\n");
> +		count = -EPERM;
> +		goto unlock_mutex;
> +	}
> +
> +	if (strncmp("auto", buf, strlen("auto")) == 0) {
> +		/* Make sure we are in LOW PLL when changing modes */
> +		if (hdev->pm_mng_profile == PM_MANUAL) {
> +			atomic_set(&hdev->curr_pll_profile, PLL_HIGH);
> +			hl_device_set_frequency(hdev, PLL_LOW);
> +			hdev->pm_mng_profile = PM_AUTO;
> +		}
> +	} else if (strncmp("manual", buf, strlen("manual")) == 0) {
> +		/* Make sure we are in LOW PLL when changing modes */
> +		if (hdev->pm_mng_profile == PM_AUTO) {
> +			flush_delayed_work(&hdev->work_freq);
> +			hdev->pm_mng_profile = PM_MANUAL;
> +		}
> +	} else {
> +		dev_err(hdev->dev, "value should be auto or manual\n");
> +		count = -EINVAL;
> +		goto unlock_mutex;
> +	}
> +
> +unlock_mutex:
> +	mutex_unlock(&hdev->device_open);
> +out:
> +	return count;
> +}
> +
> +static ssize_t high_pll_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	return snprintf(buf, PAGE_SIZE, "%u\n", hdev->high_pll);
> +}
> +
> +static ssize_t high_pll_store(struct device *dev, struct device_attribute *attr,
> +				const char *buf, size_t count)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	long value;
> +	int rc;
> +
> +	if (hdev->disabled) {
> +		count = -ENODEV;
> +		goto out;
> +	}
> +
> +	rc = kstrtoul(buf, 0, &value);
> +
> +	if (rc) {
> +		count = -EINVAL;
> +		goto out;
> +	}
> +
> +	hdev->high_pll = value;
> +
> +out:
> +	return count;
> +}
> +
> +static ssize_t uboot_ver_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%s\n", hdev->asic_prop.uboot_ver);
> +}
> +
> +static ssize_t armcp_kernel_ver_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%s",
> +			hdev->asic_prop.armcp_info.kernel_version);
> +}
> +
> +static ssize_t armcp_ver_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%s\n",
> +			hdev->asic_prop.armcp_info.armcp_version);
> +}
> +
> +static ssize_t cpld_ver_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "0x%08x\n",
> +			hdev->asic_prop.armcp_info.cpld_version);
> +}
> +
> +static ssize_t infineon_ver_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "0x%04x\n",
> +			hdev->asic_prop.armcp_info.infineon_version);
> +}
> +
> +static ssize_t fuse_ver_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%s\n",
> +			hdev->asic_prop.armcp_info.fuse_version);
> +}
> +
> +static ssize_t thermal_ver_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%s",
> +			hdev->asic_prop.armcp_info.thermal_version);
> +}
> +
> +static ssize_t preboot_btl_ver_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%s\n", hdev->asic_prop.preboot_ver);
> +}
> +
> +static ssize_t device_type_show(struct device *dev,
> +		struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	char *str;
> +
> +	switch (hdev->asic_type) {
> +	case ASIC_GOYA:
> +		str = "GOYA";
> +		break;
> +	default:
> +		dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
> +				hdev->asic_type);
> +		return -EINVAL;
> +	}
> +
> +	return snprintf(buf, PAGE_SIZE, "%s\n", str);
> +}
> +
> +static ssize_t pci_addr_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	/* Use dummy, fixed address for simulator */
> +	if (!hdev->pdev)
> +		return snprintf(buf, PAGE_SIZE, "0000:%02d:00.0\n", hdev->id);
> +
> +	return snprintf(buf, PAGE_SIZE, "%04x:%02x:%02x.%x\n",
> +			pci_domain_nr(hdev->pdev->bus),
> +			hdev->pdev->bus->number,
> +			PCI_SLOT(hdev->pdev->devfn),
> +			PCI_FUNC(hdev->pdev->devfn));
> +}
> +
> +static ssize_t status_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	char *str;
> +
> +	if (hdev->disabled)
> +		str = "Malfunction";
> +	else
> +		str = "Operational";
> +
> +	return snprintf(buf, PAGE_SIZE, "%s\n", str);
> +}
> +
> +static ssize_t write_open_cnt_show(struct device *dev,
> +		struct device_attribute *attr, char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%d\n", hdev->user_ctx ? 1 : 0);
> +}
> +
> +static ssize_t max_power_show(struct device *dev, struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	long val;
> +
> +	if (hdev->disabled)
> +		return -ENODEV;
> +
> +	val = hl_get_max_power(hdev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%lu\n", val);
> +}
> +
> +static ssize_t max_power_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	unsigned long value;
> +	int rc;
> +
> +	if (hdev->disabled) {
> +		count = -ENODEV;
> +		goto out;
> +	}
> +
> +	rc = kstrtoul(buf, 0, &value);
> +
> +	if (rc) {
> +		count = -EINVAL;
> +		goto out;
> +	}
> +
> +	hdev->max_power = value;
> +	hl_set_max_power(hdev, value);
> +
> +out:
> +	return count;
> +}
> +
> +static ssize_t eeprom_read_handler(struct file *filp, struct kobject *kobj,
> +			struct bin_attribute *attr, char *buf, loff_t offset,
> +			size_t max_size)
> +{
> +	struct device *dev = container_of(kobj, struct device, kobj);
> +	struct hl_device *hdev = dev_get_drvdata(dev);
> +	char *data;
> +	int rc;
> +
> +	if (!max_size)
> +		return -EINVAL;
> +
> +	data = kzalloc(max_size, GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	rc = hdev->asic_funcs->get_eeprom_data(hdev, data, max_size);
> +	if (rc)
> +		goto out;
> +
> +	memcpy(buf, data, max_size);
> +
> +out:
> +	kfree(data);
> +
> +	return max_size;
> +}
> +
> +static DEVICE_ATTR_RW(pm_mng_profile);
> +static DEVICE_ATTR_RW(high_pll);
> +static DEVICE_ATTR_RO(uboot_ver);
> +static DEVICE_ATTR_RO(armcp_kernel_ver);
> +static DEVICE_ATTR_RO(armcp_ver);
> +static DEVICE_ATTR_RO(cpld_ver);
> +static DEVICE_ATTR_RO(infineon_ver);
> +static DEVICE_ATTR_RO(fuse_ver);
> +static DEVICE_ATTR_RO(thermal_ver);
> +static DEVICE_ATTR_RO(preboot_btl_ver);
> +static DEVICE_ATTR_RO(device_type);
> +static DEVICE_ATTR_RO(pci_addr);
> +static DEVICE_ATTR_RO(status);
> +static DEVICE_ATTR_RO(write_open_cnt);
> +static DEVICE_ATTR_RW(max_power);
> +
> +static const struct bin_attribute bin_attr_eeprom = {
> +	.attr = {.name = "eeprom", .mode = (0444)},
> +	.size = PAGE_SIZE,
> +	.read = eeprom_read_handler
> +};
> +
> +int hl_sysfs_init(struct hl_device *hdev)
> +{
> +	int rc;
> +
> +	rc = hdev->asic_funcs->add_device_attr(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to add device attributes\n");
> +		return rc;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_pm_mng_profile);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file pm_mng_profile\n");
> +		goto remove_device_attr;
> +	}
> +
> +	hdev->pm_mng_profile = PM_AUTO;
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_high_pll);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file pll_profile\n");
> +		goto remove_pm_mng_profile;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_uboot_ver);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file uboot_ver\n");
> +		goto remove_pll_profile;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_armcp_kernel_ver);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file armcp_kernel_ver\n");
> +		goto remove_uboot_ver;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_armcp_ver);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file armcp_ver\n");
> +		goto remove_armcp_kernel_ver;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_cpld_ver);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file cpld_ver\n");
> +		goto remove_armcp_ver;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_infineon_ver);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file infineon_ver\n");
> +		goto remove_cpld_ver;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_fuse_ver);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file fuse_ver\n");
> +		goto remove_infineon_ver;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_thermal_ver);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file thermal_ver\n");
> +		goto remove_fuse_ver;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_preboot_btl_ver);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file preboot_btl_ver\n");
> +		goto remove_thermal_ver;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_device_type);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file device_type\n");
> +		goto remove_preboot_ver;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_pci_addr);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file pci_addr\n");
> +		goto remove_device_type;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_status);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create device file status\n");
> +		goto remove_pci_addr;
> +	}
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_write_open_cnt);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file write_open_count\n");
> +		goto remove_status;
> +	}
> +
> +	hdev->max_power = hdev->asic_prop.max_power_default;
> +
> +	rc = device_create_file(hdev->dev, &dev_attr_max_power);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"failed to create device file max_power\n");
> +		goto remove_write_open_cnt;
> +	}
> +
> +	rc = sysfs_create_bin_file(&hdev->dev->kobj, &bin_attr_eeprom);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to create EEPROM sysfs entry\n");
> +		goto remove_attr_max_power;
> +	}
> +
> +	return 0;
> +
> +remove_attr_max_power:
> +	device_remove_file(hdev->dev, &dev_attr_max_power);
> +remove_write_open_cnt:
> +	device_remove_file(hdev->dev, &dev_attr_write_open_cnt);
> +remove_status:
> +	device_remove_file(hdev->dev, &dev_attr_status);
> +remove_pci_addr:
> +	device_remove_file(hdev->dev, &dev_attr_pci_addr);
> +remove_device_type:
> +	device_remove_file(hdev->dev, &dev_attr_device_type);
> +remove_preboot_ver:
> +	device_remove_file(hdev->dev, &dev_attr_preboot_btl_ver);
> +remove_thermal_ver:
> +	device_remove_file(hdev->dev, &dev_attr_thermal_ver);
> +remove_fuse_ver:
> +	device_remove_file(hdev->dev, &dev_attr_fuse_ver);
> +remove_infineon_ver:
> +	device_remove_file(hdev->dev, &dev_attr_infineon_ver);
> +remove_cpld_ver:
> +	device_remove_file(hdev->dev, &dev_attr_cpld_ver);
> +remove_armcp_ver:
> +	device_remove_file(hdev->dev, &dev_attr_armcp_ver);
> +remove_armcp_kernel_ver:
> +	device_remove_file(hdev->dev, &dev_attr_armcp_kernel_ver);
> +remove_uboot_ver:
> +	device_remove_file(hdev->dev, &dev_attr_uboot_ver);
> +remove_pll_profile:
> +	device_remove_file(hdev->dev, &dev_attr_high_pll);
> +remove_pm_mng_profile:
> +	device_remove_file(hdev->dev, &dev_attr_pm_mng_profile);
> +remove_device_attr:
> +	hdev->asic_funcs->remove_device_attr(hdev);
> +
> +	return rc;
> +}
> +
> +void hl_sysfs_fini(struct hl_device *hdev)
> +{
> +	sysfs_remove_bin_file(&hdev->dev->kobj, &bin_attr_eeprom);
> +	device_remove_file(hdev->dev, &dev_attr_max_power);
> +	device_remove_file(hdev->dev, &dev_attr_write_open_cnt);
> +	device_remove_file(hdev->dev, &dev_attr_status);
> +	device_remove_file(hdev->dev, &dev_attr_pci_addr);
> +	device_remove_file(hdev->dev, &dev_attr_device_type);
> +	device_remove_file(hdev->dev, &dev_attr_preboot_btl_ver);
> +	device_remove_file(hdev->dev, &dev_attr_thermal_ver);
> +	device_remove_file(hdev->dev, &dev_attr_fuse_ver);
> +	device_remove_file(hdev->dev, &dev_attr_infineon_ver);
> +	device_remove_file(hdev->dev, &dev_attr_cpld_ver);
> +	device_remove_file(hdev->dev, &dev_attr_armcp_ver);
> +	device_remove_file(hdev->dev, &dev_attr_armcp_kernel_ver);
> +	device_remove_file(hdev->dev, &dev_attr_uboot_ver);
> +	device_remove_file(hdev->dev, &dev_attr_high_pll);
> +	device_remove_file(hdev->dev, &dev_attr_pm_mng_profile);
> +	hdev->asic_funcs->remove_device_attr(hdev);
> +}
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-25  7:43                         ` Daniel Vetter
@ 2019-01-25 15:02                           ` Olof Johansson
  2019-01-25 16:00                             ` Daniel Vetter
  0 siblings, 1 reply; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 15:02 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, Oded Gabbay, Jerome Glisse, Greg Kroah-Hartman,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

On Thu, Jan 24, 2019 at 11:43 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Fri, Jan 25, 2019 at 1:14 AM Olof Johansson <olof@lixom.net> wrote:
> >
> > On Thu, Jan 24, 2019 at 2:23 AM Dave Airlie <airlied@gmail.com> wrote:
> > >
> > > > I know I won't be able to convince you but I want to say that I think
> > > > your arguments for full userspace open source are not really
> > > > technical.
> > >
> > > There is more to keeping a kernel going than technical argument unfortunately.
> > >
> > > I guess the question for Greg, Olof etc, is do we care about Linux the
> > > kernel, or Linux the open source ecosystem, if the former, these sort
> > > of accelerator shim drivers are fine, useless to anyone who doesn't
> > > have all the magic hidden userspace, and impossible to support for
> > > anyone else, if the latter, we should leave the cost of maintenance to
> > > the company benefiting from it and leave maintaining it out of tree.
> >
> > As mentioned in my reply to Daniel, I think we've got a history of
> > being pragmatic and finding reasonable trade-offs of what can be open
> > and what can be closed. For example, if truly care about open source
> > ecosystem, drivers that require closed firmware should also be
> > refused.
>
> Firmware has traditionally been different since usually it's looked
> down, doesn't do much wrt functionality (dumb fifo scheduling at best,
> not really power management) and so could be reasonably shrugged off
> as "it's part of hw". If you care about the open graphics ecosystem,
> i.e. your ability to port the stack to new cpu architectures, new
> window systems (e.g. android -> xorg, or xorg -> android, or something
> entirely new like wayland), new, more efficient client interface
> (vulkan is a very new fad), then having a closed firmware is not going
> to be a problem. Closed compiler, closed runtime, closed anything else
> otoh is a serious practical pain.
>
> Unfortunately hw vendors seem to have realized that we (overall
> community of customers, distro, upstream) are not insisting on open
> firmware, so they're moving a lot of "valuable sauce" (no really, it's
> not) into the firmware. PM governors, cpu scheduling algorithms, that
> kind of stuff. We're not pleased, and there's lots of people doing the
> behind the scenes work to fix it. One practical problem is that even
> if we've demonstrated that r/e'ing a uc is no bigger challenge than
> anything, there's usually this pesky issue with signatures. So we
> can't force the vendors like we can with the userspace side. Otherwise
> nouveau would have completely open firmware even for latest chips
> (like it has for olders).
>
> > > Simple question like If I plug your accelerator into Power or ARM64,
> > > where do I get the port of your userspace to use it?
> >
> > Does demanding complete open userspace get us closer to that goal in
> > reality? By refusing to work with people to enable their hardware,
> > they will still ship their platforms out of tree, using DKMS and all
> > the other ways of getting kernel modules installed to talk to the
> > hardware. And we'd be no closer.
> >
> > In the end, they'd open up their userspace when there's business
> > reasons to do so. It's well-known how to work around refusal from us
> > to merge drivers by now, so it's not much leverage in that area.
>
> Correct. None of the hw vendors had a business reason to open source
> anything unforunately. Yes, eventually customers started demanding
> open source and treatening to buy the competition, but this only works
> if you have multiple reasonably performant & conformant stacks for
> different vendors. The only way to get these is to reverse engineer
> them.

That's the grass-roots version of it, and it is indeed a lot of work.
What _has_ proven to have success is when companies that would drive
real revenue for the vendors have requirements for them to open up,
contribute, and participate. In the graphics world it hasn't gotten
things all the way to the right spot, but I know first hand that for
example Chrome OS's insistence on upstream participation has made
significant differences in how several companies interact with our
communities.

It's not something that's easy to do when the target is
consumer-oriented hardware (which graphics mostly is), but at least
for now, these accelerators aren't targeting end users as much as
corporate environments, where we do have allies.

> Now reverse-engineering is a major pain in itself (despite all the
> great tooling gpu folks developed over the past 10 years to convert it
> from a black art to a repeatable engineering excercise), but if you
> additionally prefer the vendors closed stack (which you do by allowing
> to get them to get merged) the r/e'd stack has no chance. And there is
> not other way to get your open source stack. I can't really go into
> all the details of the past 15+ of open source gpus, but without the
> pressure of other r/e'ed stacks and the pressure of having stacks for
> competitiors (all made possible through aggressive code sharing) we
> would have 0 open source gfx stacks. All the ones we have either got
> started with r/e first (and eventually the vendor jumped on board) or
> survived through r/e and customer efforts (because the vendor planned
> to abandon it). Another part of this is that we accept userspace only
> when it's the common upstream (if there is one), to prevent vendors
> closing down their stacks gradually.
>
> So yeah I think by not clearly preferring open source over
> stacks-with-blobs (how radically you do that is a bit a balance act in
> the end, I think we've maxed out in drivers/gpu on what's practically
> possible) you'll just make sure that there's never going to be a
> serious open source stack.

I can confidently say that I would myself clearly give preferential
treatment to open stacks when they show up. The trick is how we get
there -- do we get there quicker by refusing to work with the
partially closed stacks? My viewpoint is that we don't, and that the
best way to get there is to bring them in and start working with what
we have instead of building separate camps that we later need to
figure out how to move.

> > > I'm not the final arbiter on this sort of thing, but I'm definitely
> > > going to make sure that anyone who lands this code is explicit in
> > > ignoring any experience we've had in this area and in the future will
> > > gladly accept "I told you so" :-)
> >
> > There's only one final arbiter on any inclusion to code to the kernel,
> > but we tend to sort out most disagreements without going all the way
> > there.
> >
> > I still think engaging has a better chance of success than rejecting
> > the contributions, especially with clear expectations w.r.t. continued
> > engagement and no second implementations over time. In all honestly,
> > either approach might fail miserably.
>
> This is maybe not clear, but we still work together with the blob
> folks as much as possible, for demonstration: nvidia sponsored XDC
> this year, and nvidia engineers have been regularly presenting there.
> Collaboration happens around the driver interfaces, like loaders (in
> userspace), buffer sharing, synchronization, negotiation of buffer
> formats and all that stuff. Do as much enganging as possible, but if
> you give preferrential treatment to the closed stacks over the open
> ones (and by default the vendor _always_ gives you a closed stack, or
> as closed as possible, there's just no business case for them to open
> up without a customer demanding it and competition providing it too),
> you will end up with a closed stack for a very long time, maybe
> forever.
>
> Even if you insist on an open stack it's going to take years, since
> the only way to get there is lots of r/e, and you need to have at
> least 2 stacks or otherwise the customers can't walk away from the
> negotiation table. So again from gfx experience: The only way to get
> open stacks is solid competition by open stacks, and customers/distros
> investing ridiculous amounts of money to r/e the chips and write these
> open&cross vendor stacks. The business case for vendors to open source
> their stacks is just not there. Not until they can't sell their chips
> any other way anymore (nvidia will embrace open stacks as soon as
> their margins evaporate, not a second earlier, like all the others
> before them). Maybe at the next hallway track we need to go through a
> few examples of what all happened and is still happening in the
> background (here's maybe not a good idea).

Again, the graphics world is different since the volume market has
traditionally been consumers, and the split got very deep. What we
want to avoid here is to get into the same situation by avoiding the
large split.

Look at it another way, these are roughly the options and possible outcomes:

1a. We don't merge these drivers, vendors say "okay then" and open up
their whole stacks. We merge the drivers. Then we start working
together on moving to a common base.
1b. We don't merge these drivers, the vendors all do their own thing
and over the next 5 years, we reverse engineer and start to bring in
second implementations of all their code.
2a. We merge these drivers, start close engagement with the vendors,
collaborate and converge with their next-gen products (possibly move
first-gen over).
2b. We merge these drivers, and vendors still go off on their own and
do their own thing. We spend the next 5 years reverse engineering and
move over to open, new drivers even though the first ones are in-tree.

1a/2a are successful outcomes. I put 1a at very very low probability.
2b at medium probability with the right allies in the corporate world.

1b/2b are partial failure modes with huge effort needed and a
passionate volunteer base.

Both 1a/2a and 1b/2b have similar amounts of work involved. In other
words, I don't see how anyone benefits from eliminating (2) as an
approach, especially since 1a is an unlikely development of events.

And, guess what -- if we do get open stacks early on, and give heavy
preferential treatment to these, we can push others in that direction
sooner rather than later, before stacks diverge too far. I just don't
see how _not_ engaging is going to help here. Even if we do engage,
the worst possible outcome is still the same as not engaging but a
good chance for something better.


-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-25  7:37   ` Greg Kroah-Hartman
@ 2019-01-25 15:33     ` Olof Johansson
  2019-01-25 16:06       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 15:33 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: Dave Airlie, Oded Gabbay, Jerome Glisse, LKML, ogabbay

Hi,

On Thu, Jan 24, 2019 at 11:37 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Thu, Jan 24, 2019 at 07:57:11AM +1000, Dave Airlie wrote:
> > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > Habana Labs since its inception two and a half years ago.
> >
> > Hey Oded,
> >
> > So this creates a driver with a userspace facing API via ioctls.
> > Although this isn't a "GPU" driver we have a rule in the graphics
> > drivers are for accelerators that we don't merge userspace API with an
> > appropriate userspace user.
> >
> > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> >
> > I see nothing in these accelerator drivers that make me think we
> > should be treating them different.
>
> I understand that this is your position on when you accept drivers into
> the DRM layer, as you need to interact with common interfaces and a
> massive userspace stack at the same time.  And that's wonderful, it
> allows you to be able to move both sides of that stack forward without
> removing support for devices that worked on older kernels.
>
> But, that's not really the case with this new driver at all.  We add new
> driver subsystems, and individual drivers, with loads of new ioctls, in
> every new kernel release.  We don't impose on all of them the "your
> userspace code must be open" rule, so why is this new driver somehow
> different from them?
>
> Yes, there is the fun legal issue of "derivative works" when talking
> about a userspace program that is written to only interact with a
> specific kernel driver using a custom api like this one has, and how the
> license of the kernel side (GPLv2) affects the userspace side
> (whatever), but that is something that I leave up to the lawyers who
> like discussing and enforcing such things.
>
> When evaluating this driver (note, I saw it for a few revisions before
> Oded posted it here), all I did was try to make sure that it fit in
> properly with the kernel apis and methods of operations.  Given that
> there are no in-kernel drivers for this type of device, and that it
> really is a pretty small shim layer around the hardware, which means
> that userspace does a lot of the heavy lifting, it is going to be a
> very hardware-specific user/kernel api, and that shows.

I brought this up because there are sort of, if you squint, three-ish
of these already (the OpenCAPI/CAPI ones and now this). Also,
Jonathan's comment about CCIX pieces coming as well.

I've talked to a handful of vendors in this space and more or less all
of them hope that drivers/misc is still a suitable home for their
driver by the time it's ready to post, and I know that if we keep them
there, it'll only be one or two more drivers until we have this
discussion anyway.

That's really why I brought it up now, to get a clear stance on "Yeah,
we know these are all slightly different today, and we're willing to
give them a predictable home for the first while as we figure out
together how things should look". Keep in mind, this space is also
currently a bit of a gold rush, with many people working hard to get
to market with near-zero talk between them.

> Sidenote, this could have almost just been a UIO driver, which would
> have put _ALL_ of the logic in userspace.  At least this way we have a
> chance for the kernel code to be sane and not try to inflict problems on
> the rest of the system.

Yeah, and sharing hardware when you have userspace drivers tends to
need a bunch of synchronization and coordination which leads to taller
(closed?) stacks there. Having resource arbitration and sharing
assisted by the kernel is usually the right thing, and we should
encourage that.

> Going forward, it would be wonderful if we could come up with a unified
> api for how to interact with these types of hardware accelerators, but
> given the newness of this industry, and the vastly different ways people
> are trying to solve the problem, that is going to take a few attempts,
> and many years before we can get there.  Until then, taking drivers like
> this into the kernel tree makes sense as that way all of our users will
> be able to use that hardware, and better yet, the rest of us can learn
> more about how this stuff works so that we can help out with that api
> generation when it happens.

Exactly my viewpoint as well, combined with not pushing vendors
towards nvidia models by default even if they start out separate, and
having a chance to find inroads with them on engineering levels when
they come to us with the drivers.

Absolutely best chance for success is _always_ when we can engage with
the vendor engineers as peers, instead of going up and down the
corporate charts first (you often need to do a bit of both, but having
allies in engineering makes it much easier).

> So for now, I have no objection to taking this type of driver into the
> tree.  Yes, it would be wonderful if we had an open userspace program to
> drive it so that we could actually test and make changes to the api over
> time, but I think that is something that the submitting company needs to
> realize will be better for them to do, as for right now, all of that
> testing and changes are their responsibility.

I do think having requirements of being able to exercise the hardware
is really valuable, and we should consider what a requirement for that
would look like. For the Habana case that could be a separate
low-level library and a test workload on top. For FPGA cases it could
be a well-known reference bitstream with the vendor reference
control/communication path + similar low-level pieces.

> As for what directory the code should live in, I suggested "misc" as
> there was no other universal location, and I hate to see new subsystems
> be created with only one driver, as that's pretty sad.  But it's just a
> name/location, I have no dog in the fight, so I really don't care where
> it ends up in the tree, just as long as it gets merged somewhere :)

I'm usually one to push back against new subsystems too, especially
when I see a framework proposal with just one driver. In this case,
given that we all know more vendors will come along, I think it makes
sense to take the discussion and establish structure now. This should
give some clarity to those who are out there that we haven't seen yet,
and give them a chance to prepare for things such as the low-level
userspace pieces mentioned above.

So I think setting this up now is the right thing to do, we know there
will be more material here and having a common aggregation of it makes
sense.


-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-25 15:02                           ` Olof Johansson
@ 2019-01-25 16:00                             ` Daniel Vetter
  0 siblings, 0 replies; 103+ messages in thread
From: Daniel Vetter @ 2019-01-25 16:00 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Dave Airlie, Oded Gabbay, Jerome Glisse, Greg Kroah-Hartman,
	LKML, ogabbay, Arnd Bergmann, fbarrat, Andrew Donnellan

On Fri, Jan 25, 2019 at 4:02 PM Olof Johansson <olof@lixom.net> wrote:
>
> On Thu, Jan 24, 2019 at 11:43 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Fri, Jan 25, 2019 at 1:14 AM Olof Johansson <olof@lixom.net> wrote:
> > >
> > > On Thu, Jan 24, 2019 at 2:23 AM Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > > > I know I won't be able to convince you but I want to say that I think
> > > > > your arguments for full userspace open source are not really
> > > > > technical.
> > > >
> > > > There is more to keeping a kernel going than technical argument unfortunately.
> > > >
> > > > I guess the question for Greg, Olof etc, is do we care about Linux the
> > > > kernel, or Linux the open source ecosystem, if the former, these sort
> > > > of accelerator shim drivers are fine, useless to anyone who doesn't
> > > > have all the magic hidden userspace, and impossible to support for
> > > > anyone else, if the latter, we should leave the cost of maintenance to
> > > > the company benefiting from it and leave maintaining it out of tree.
> > >
> > > As mentioned in my reply to Daniel, I think we've got a history of
> > > being pragmatic and finding reasonable trade-offs of what can be open
> > > and what can be closed. For example, if truly care about open source
> > > ecosystem, drivers that require closed firmware should also be
> > > refused.
> >
> > Firmware has traditionally been different since usually it's looked
> > down, doesn't do much wrt functionality (dumb fifo scheduling at best,
> > not really power management) and so could be reasonably shrugged off
> > as "it's part of hw". If you care about the open graphics ecosystem,
> > i.e. your ability to port the stack to new cpu architectures, new
> > window systems (e.g. android -> xorg, or xorg -> android, or something
> > entirely new like wayland), new, more efficient client interface
> > (vulkan is a very new fad), then having a closed firmware is not going
> > to be a problem. Closed compiler, closed runtime, closed anything else
> > otoh is a serious practical pain.
> >
> > Unfortunately hw vendors seem to have realized that we (overall
> > community of customers, distro, upstream) are not insisting on open
> > firmware, so they're moving a lot of "valuable sauce" (no really, it's
> > not) into the firmware. PM governors, cpu scheduling algorithms, that
> > kind of stuff. We're not pleased, and there's lots of people doing the
> > behind the scenes work to fix it. One practical problem is that even
> > if we've demonstrated that r/e'ing a uc is no bigger challenge than
> > anything, there's usually this pesky issue with signatures. So we
> > can't force the vendors like we can with the userspace side. Otherwise
> > nouveau would have completely open firmware even for latest chips
> > (like it has for olders).
> >
> > > > Simple question like If I plug your accelerator into Power or ARM64,
> > > > where do I get the port of your userspace to use it?
> > >
> > > Does demanding complete open userspace get us closer to that goal in
> > > reality? By refusing to work with people to enable their hardware,
> > > they will still ship their platforms out of tree, using DKMS and all
> > > the other ways of getting kernel modules installed to talk to the
> > > hardware. And we'd be no closer.
> > >
> > > In the end, they'd open up their userspace when there's business
> > > reasons to do so. It's well-known how to work around refusal from us
> > > to merge drivers by now, so it's not much leverage in that area.
> >
> > Correct. None of the hw vendors had a business reason to open source
> > anything unforunately. Yes, eventually customers started demanding
> > open source and treatening to buy the competition, but this only works
> > if you have multiple reasonably performant & conformant stacks for
> > different vendors. The only way to get these is to reverse engineer
> > them.
>
> That's the grass-roots version of it, and it is indeed a lot of work.
> What _has_ proven to have success is when companies that would drive
> real revenue for the vendors have requirements for them to open up,
> contribute, and participate. In the graphics world it hasn't gotten
> things all the way to the right spot, but I know first hand that for
> example Chrome OS's insistence on upstream participation has made
> significant differences in how several companies interact with our
> communities.
>
> It's not something that's easy to do when the target is
> consumer-oriented hardware (which graphics mostly is), but at least
> for now, these accelerators aren't targeting end users as much as
> corporate environments, where we do have allies.

Cros is the only reason we do have the stack we do in graphics.
They've been investing ridiculous amounts of money into this, and they
actually invest even more now than 1-2 years ago. I know for a fact
that without cros a few of the open stacks would have substantially,
if not completely, closed down again. If you don't believe that, look
at how debian treats the latest intel libva driver as partially
non-free (because it's become that).

> > Now reverse-engineering is a major pain in itself (despite all the
> > great tooling gpu folks developed over the past 10 years to convert it
> > from a black art to a repeatable engineering excercise), but if you
> > additionally prefer the vendors closed stack (which you do by allowing
> > to get them to get merged) the r/e'd stack has no chance. And there is
> > not other way to get your open source stack. I can't really go into
> > all the details of the past 15+ of open source gpus, but without the
> > pressure of other r/e'ed stacks and the pressure of having stacks for
> > competitiors (all made possible through aggressive code sharing) we
> > would have 0 open source gfx stacks. All the ones we have either got
> > started with r/e first (and eventually the vendor jumped on board) or
> > survived through r/e and customer efforts (because the vendor planned
> > to abandon it). Another part of this is that we accept userspace only
> > when it's the common upstream (if there is one), to prevent vendors
> > closing down their stacks gradually.
> >
> > So yeah I think by not clearly preferring open source over
> > stacks-with-blobs (how radically you do that is a bit a balance act in
> > the end, I think we've maxed out in drivers/gpu on what's practically
> > possible) you'll just make sure that there's never going to be a
> > serious open source stack.
>
> I can confidently say that I would myself clearly give preferential
> treatment to open stacks when they show up. The trick is how we get
> there -- do we get there quicker by refusing to work with the
> partially closed stacks? My viewpoint is that we don't, and that the
> best way to get there is to bring them in and start working with what
> we have instead of building separate camps that we later need to
> figure out how to move.
>
> > > > I'm not the final arbiter on this sort of thing, but I'm definitely
> > > > going to make sure that anyone who lands this code is explicit in
> > > > ignoring any experience we've had in this area and in the future will
> > > > gladly accept "I told you so" :-)
> > >
> > > There's only one final arbiter on any inclusion to code to the kernel,
> > > but we tend to sort out most disagreements without going all the way
> > > there.
> > >
> > > I still think engaging has a better chance of success than rejecting
> > > the contributions, especially with clear expectations w.r.t. continued
> > > engagement and no second implementations over time. In all honestly,
> > > either approach might fail miserably.
> >
> > This is maybe not clear, but we still work together with the blob
> > folks as much as possible, for demonstration: nvidia sponsored XDC
> > this year, and nvidia engineers have been regularly presenting there.
> > Collaboration happens around the driver interfaces, like loaders (in
> > userspace), buffer sharing, synchronization, negotiation of buffer
> > formats and all that stuff. Do as much enganging as possible, but if
> > you give preferrential treatment to the closed stacks over the open
> > ones (and by default the vendor _always_ gives you a closed stack, or
> > as closed as possible, there's just no business case for them to open
> > up without a customer demanding it and competition providing it too),
> > you will end up with a closed stack for a very long time, maybe
> > forever.
> >
> > Even if you insist on an open stack it's going to take years, since
> > the only way to get there is lots of r/e, and you need to have at
> > least 2 stacks or otherwise the customers can't walk away from the
> > negotiation table. So again from gfx experience: The only way to get
> > open stacks is solid competition by open stacks, and customers/distros
> > investing ridiculous amounts of money to r/e the chips and write these
> > open&cross vendor stacks. The business case for vendors to open source
> > their stacks is just not there. Not until they can't sell their chips
> > any other way anymore (nvidia will embrace open stacks as soon as
> > their margins evaporate, not a second earlier, like all the others
> > before them). Maybe at the next hallway track we need to go through a
> > few examples of what all happened and is still happening in the
> > background (here's maybe not a good idea).
>
> Again, the graphics world is different since the volume market has
> traditionally been consumers, and the split got very deep. What we
> want to avoid here is to get into the same situation by avoiding the
> large split.
>
> Look at it another way, these are roughly the options and possible outcomes:
>
> 1a. We don't merge these drivers, vendors say "okay then" and open up
> their whole stacks. We merge the drivers. Then we start working
> together on moving to a common base.
> 1b. We don't merge these drivers, the vendors all do their own thing
> and over the next 5 years, we reverse engineer and start to bring in
> second implementations of all their code.
> 2a. We merge these drivers, start close engagement with the vendors,
> collaborate and converge with their next-gen products (possibly move
> first-gen over).
> 2b. We merge these drivers, and vendors still go off on their own and
> do their own thing. We spend the next 5 years reverse engineering and
> move over to open, new drivers even though the first ones are in-tree.
>
> 1a/2a are successful outcomes. I put 1a at very very low probability.
> 2b at medium probability with the right allies in the corporate world.
>
> 1b/2b are partial failure modes with huge effort needed and a
> passionate volunteer base.
>
> Both 1a/2a and 1b/2b have similar amounts of work involved. In other
> words, I don't see how anyone benefits from eliminating (2) as an
> approach, especially since 1a is an unlikely development of events.
>
> And, guess what -- if we do get open stacks early on, and give heavy
> preferential treatment to these, we can push others in that direction
> sooner rather than later, before stacks diverge too far. I just don't
> see how _not_ engaging is going to help here. Even if we do engage,
> the worst possible outcome is still the same as not engaging but a
> good chance for something better.

If you do have allies with big purses (both product buying power and
the willingness & ability to just hire a driver team and create facts
if the vendor doesnt cooperate), then you can directly make option 1a
happen: vendors open their stacks, you merge them (and of course you
start engaging right away). If you don't have that, then you'll have
1b or 2b and will entirely depend upon students and other fools with
too much time to r/e and write your open stack. And for those 1b is
the substantially better option. Only cases where 2a works is if some
team internally in the vendor makes it happen and somehow convinces
management that there's a real market for this (or going to be). But
if that real market (said customer with a big purse and an insistence
on open source) does not materialize, the effort will crumble again
after a few years.

Anyway I don't think we'll convince each other of anything here, so
we'll just see what happens and we have a nice chat in a few years
about this all :-) Maybe if you want someone else still from the
graphics hippies club, perhaps chat with Keith Packard. He's got quite
a bit of experience and has seen even more dumpster fires than Dave,
Jerome or me.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-25 15:33     ` Olof Johansson
@ 2019-01-25 16:06       ` Greg Kroah-Hartman
  2019-01-25 17:12         ` Olof Johansson
  0 siblings, 1 reply; 103+ messages in thread
From: Greg Kroah-Hartman @ 2019-01-25 16:06 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Dave Airlie, Oded Gabbay, Jerome Glisse, LKML, ogabbay

On Fri, Jan 25, 2019 at 07:33:23AM -0800, Olof Johansson wrote:
> On Thu, Jan 24, 2019 at 11:37 PM Greg Kroah-Hartman
> > As for what directory the code should live in, I suggested "misc" as
> > there was no other universal location, and I hate to see new subsystems
> > be created with only one driver, as that's pretty sad.  But it's just a
> > name/location, I have no dog in the fight, so I really don't care where
> > it ends up in the tree, just as long as it gets merged somewhere :)
> 
> I'm usually one to push back against new subsystems too, especially
> when I see a framework proposal with just one driver. In this case,
> given that we all know more vendors will come along, I think it makes
> sense to take the discussion and establish structure now. This should
> give some clarity to those who are out there that we haven't seen yet,
> and give them a chance to prepare for things such as the low-level
> userspace pieces mentioned above.
> 
> So I think setting this up now is the right thing to do, we know there
> will be more material here and having a common aggregation of it makes
> sense.

Ok, how about:
	drivers/deep_thought/

as a first proposal.

Let the bikeshedding begin!  :)

greg k-h

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-25 16:06       ` Greg Kroah-Hartman
@ 2019-01-25 17:12         ` Olof Johansson
  2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
  2019-01-26 13:52           ` [PATCH 00/15] Habana Labs kernel driver Greg Kroah-Hartman
  0 siblings, 2 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 17:12 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: Dave Airlie, Oded Gabbay, Jerome Glisse, LKML, ogabbay

On Fri, Jan 25, 2019 at 8:06 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Fri, Jan 25, 2019 at 07:33:23AM -0800, Olof Johansson wrote:
> > On Thu, Jan 24, 2019 at 11:37 PM Greg Kroah-Hartman
> > > As for what directory the code should live in, I suggested "misc" as
> > > there was no other universal location, and I hate to see new subsystems
> > > be created with only one driver, as that's pretty sad.  But it's just a
> > > name/location, I have no dog in the fight, so I really don't care where
> > > it ends up in the tree, just as long as it gets merged somewhere :)
> >
> > I'm usually one to push back against new subsystems too, especially
> > when I see a framework proposal with just one driver. In this case,
> > given that we all know more vendors will come along, I think it makes
> > sense to take the discussion and establish structure now. This should
> > give some clarity to those who are out there that we haven't seen yet,
> > and give them a chance to prepare for things such as the low-level
> > userspace pieces mentioned above.
> >
> > So I think setting this up now is the right thing to do, we know there
> > will be more material here and having a common aggregation of it makes
> > sense.
>
> Ok, how about:
>         drivers/deep_thought/
>
> as a first proposal.
>
> Let the bikeshedding begin!  :)

My original proposal upthread was driver/accel. I'm not sure whether
Dave and/or anyone else wants to participate to start though, I hope
they will at least join in once things are heading in the direction
they want.

I'll post patches with the proposed moves and documentation of
expectations shortly, hopefully to collect acks.



-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-24  1:03   ` Andrew Donnellan
  2019-01-24 11:59     ` Jonathan Cameron
@ 2019-01-25 17:13     ` Olof Johansson
  1 sibling, 0 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 17:13 UTC (permalink / raw)
  To: Andrew Donnellan
  Cc: Oded Gabbay, Dave Airlie, Greg Kroah-Hartman,
	Linux Kernel Mailing List, ogabbay, Arnd Bergmann, fbarrat,
	linux-accelerators

On Wed, Jan 23, 2019 at 5:03 PM Andrew Donnellan
<andrew.donnellan@au1.ibm.com> wrote:
>
> On 24/1/19 8:52 am, Olof Johansson wrote:
> > But, I think the largest question I have (for a broader audience) is:
> >
> > I predict that we will see a handful of these kind of devices over the
> > upcoming future -- definitely from ML accelerators but maybe also for
> > other kinds of processing, where there's a command-based, buffer-based
> > setup sending workloads to an offload engine and getting results back.
> > While the first waves will all look different due to design trade-offs
> > made in isolation, I think it makes sense to group them in one bucket
> > instead of merging them through drivers/misc, if nothing else to
> > encourage more cross-collaboration over time. First steps in figuring
> > out long-term suitable frameworks is to get a survey of a few
> > non-shared implementations.
> >
> > So, I'd like to propose a drivers/accel drivers subtree, and I'd be
> > happy to bootstrap it with a small group (@Dave Airlie: I think your
> > input from GPU land be very useful, want to join in?). Individual
> > drivers maintained by existing maintainers, of course.
> >
> > I think it might make sense to move the CAPI/OpenCAPI drivers over as
> > well -- not necessarily to change those drivers, but to group them
> > with the rest as more show up.
>
> For cxl/ocxl, I have no objection to moving to this new subtree if
> that's what we all agree to do. (what do people do about UAPI headers in
> this situation? keep them where they are in misc/?)

How about moving them but keeping a stub in misc that just includes
the moved file?

> If we do go ahead and set up this new subtree, perhaps we can use the
> mailing list I set up at linux-accelerators@lists.ozlabs.org but we
> haven't really started using...

I've lost all lists.ozlabs.org muscle memory these days, but that
seems reasonable to me.



-Olof

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH/RFC 0/5] HW accel subsystem
  2019-01-25 17:12         ` Olof Johansson
@ 2019-01-25 18:16           ` Olof Johansson
  2019-01-25 18:16             ` [PATCH 1/5] drivers/accel: Introduce subsystem Olof Johansson
                               ` (6 more replies)
  2019-01-26 13:52           ` [PATCH 00/15] Habana Labs kernel driver Greg Kroah-Hartman
  1 sibling, 7 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 18:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, Frederic Barrat,
	Andrew Donnellan, ogabbay, airlied, jglisse

Per discussion in on the Habana Labs driver submission
(https://lore.kernel.org/lkml/20190123000057.31477-1-oded.gabbay@gmail.com/),
there seems to be time to create a separate subsystem for hw accellerators
instead of letting them proliferate around the tree (and/or in misc).

There's difference in opinion on how stringent the requirements are for
a fully open stack for these kind of drivers. I've documented the middle
road approach in the first patch (requiring some sort of open low-level
userspace for the kernel interaction, and a way to use/test it).

Comments and suggestions for better approaches are definitely welcome.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/5] drivers/accel: Introduce subsystem
  2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
@ 2019-01-25 18:16             ` Olof Johansson
  2019-01-25 21:13               ` [PATCH v2 " Olof Johansson
  2019-01-25 22:23               ` [PATCH " Daniel Vetter
  2019-01-25 18:16             ` [PATCH 2/5] cxl: Move to drivers/accel Olof Johansson
                               ` (5 subsequent siblings)
  6 siblings, 2 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 18:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, Frederic Barrat,
	Andrew Donnellan, ogabbay, airlied, jglisse, Olof Johansson

We're starting to see more of these kind of devices, the current
upcoming wave will likely be around machine learning and inference
engines. A few drivers have been added to drivers/misc for this, but
it's timely to make it into a separate group of drivers/subsystem, to
make it easier to find them, and to encourage collaboration between
contributors.

Over time, we expect to build shared frameworks that the drivers will
make use of, but how that framework needs to look like to fill the needs
is still unclear, and the best way to gain that knowledge is to give the
disparate implementations a shared location.

There has been some controversy around expectations for userspace
stacks being open. The clear preference is to see that happen, and any
driver and platform stack that is delivered like that will be given
preferential treatment, and at some point in the future it might
become the requirement. Until then, the bare minimum we need is an
open low-level userspace such that the driver and HW interfaces can be
exercised if someone is modifying the driver, even if the full details
of the workload are not always available.

Bootstrapping this with myself and Greg as maintainers (since the current
drivers will be moving out of drivers/misc). Looking forward to expanding
that group over time.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Olof Johansson <olof@lixom.net>
---
 MAINTAINERS            |  8 ++++++++
 drivers/Kconfig        |  2 ++
 drivers/Makefile       |  1 +
 drivers/accel/Kconfig  | 16 ++++++++++++++++
 drivers/accel/Makefile |  5 +++++
 5 files changed, 32 insertions(+)
 create mode 100644 drivers/accel/Kconfig
 create mode 100644 drivers/accel/Makefile

diff --git a/MAINTAINERS b/MAINTAINERS
index ddcdc29dfe1f6..8a9bbaf8f6e90 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7033,6 +7033,14 @@ W:	https://linuxtv.org
 S:	Supported
 F:	drivers/media/platform/sti/hva
 
+HW ACCELERATOR OFFLOAD SUBSYSTEM
+M:	Olof Johansson <olof@lixom.net>
+M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+L:	linux-accelerators@lists.ozlabs.org
+S:	Supported
+F:	drivers/accel/
+F:	Documentation/accelerators/
+
 HWPOISON MEMORY FAILURE HANDLING
 M:	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
 L:	linux-mm@kvack.org
diff --git a/drivers/Kconfig b/drivers/Kconfig
index 4f9f99057ff85..3cc461f325569 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -228,4 +228,6 @@ source "drivers/siox/Kconfig"
 
 source "drivers/slimbus/Kconfig"
 
+source "drivers/accel/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 04da7876032cc..e4be06579cc5d 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -186,3 +186,4 @@ obj-$(CONFIG_MULTIPLEXER)	+= mux/
 obj-$(CONFIG_UNISYS_VISORBUS)	+= visorbus/
 obj-$(CONFIG_SIOX)		+= siox/
 obj-$(CONFIG_GNSS)		+= gnss/
+obj-$(CONFIG_ACCEL)		+= accel/
diff --git a/drivers/accel/Kconfig b/drivers/accel/Kconfig
new file mode 100644
index 0000000000000..13b36c0398895
--- /dev/null
+++ b/drivers/accel/Kconfig
@@ -0,0 +1,16 @@
+#
+# Drivers for hardware offload accelerators
+# See Documentation/accel/README.rst for more details
+#
+
+menuconfig ACCEL
+	bool "Hardware offload accelerator support"
+        help
+	  HW offload accelerators are used for high-bandwidth workloads
+	  where a higher-level kernel/userspace interface isn't suitable.
+
+if ACCEL
+
+comment "HW Accellerator drivers"
+
+endif
diff --git a/drivers/accel/Makefile b/drivers/accel/Makefile
new file mode 100644
index 0000000000000..343bbb8f45a14
--- /dev/null
+++ b/drivers/accel/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for accel devices
+#
+
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 2/5] cxl: Move to drivers/accel
  2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
  2019-01-25 18:16             ` [PATCH 1/5] drivers/accel: Introduce subsystem Olof Johansson
@ 2019-01-25 18:16             ` Olof Johansson
  2019-01-25 18:16             ` [PATCH 3/5] drivers/accel: cxl: Move non-uapi include files Olof Johansson
                               ` (4 subsequent siblings)
  6 siblings, 0 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 18:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, Frederic Barrat,
	Andrew Donnellan, ogabbay, airlied, jglisse, Olof Johansson,
	Arnd Bergmann

Move include files in separate commit, so leave them alone for now.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Olof Johansson <olof@lixom.net>
---
 MAINTAINERS                           | 2 +-
 drivers/accel/Kconfig                 | 2 ++
 drivers/accel/Makefile                | 1 +
 drivers/{misc => accel}/cxl/Kconfig   | 0
 drivers/{misc => accel}/cxl/Makefile  | 0
 drivers/{misc => accel}/cxl/api.c     | 0
 drivers/{misc => accel}/cxl/base.c    | 0
 drivers/{misc => accel}/cxl/context.c | 0
 drivers/{misc => accel}/cxl/cxl.h     | 0
 drivers/{misc => accel}/cxl/cxllib.c  | 0
 drivers/{misc => accel}/cxl/debugfs.c | 0
 drivers/{misc => accel}/cxl/fault.c   | 0
 drivers/{misc => accel}/cxl/file.c    | 0
 drivers/{misc => accel}/cxl/flash.c   | 0
 drivers/{misc => accel}/cxl/guest.c   | 0
 drivers/{misc => accel}/cxl/hcalls.c  | 0
 drivers/{misc => accel}/cxl/hcalls.h  | 0
 drivers/{misc => accel}/cxl/irq.c     | 0
 drivers/{misc => accel}/cxl/main.c    | 0
 drivers/{misc => accel}/cxl/native.c  | 0
 drivers/{misc => accel}/cxl/of.c      | 0
 drivers/{misc => accel}/cxl/pci.c     | 0
 drivers/{misc => accel}/cxl/sysfs.c   | 0
 drivers/{misc => accel}/cxl/trace.c   | 0
 drivers/{misc => accel}/cxl/trace.h   | 0
 drivers/{misc => accel}/cxl/vphb.c    | 0
 drivers/misc/Kconfig                  | 1 -
 drivers/misc/Makefile                 | 1 -
 28 files changed, 4 insertions(+), 3 deletions(-)
 rename drivers/{misc => accel}/cxl/Kconfig (100%)
 rename drivers/{misc => accel}/cxl/Makefile (100%)
 rename drivers/{misc => accel}/cxl/api.c (100%)
 rename drivers/{misc => accel}/cxl/base.c (100%)
 rename drivers/{misc => accel}/cxl/context.c (100%)
 rename drivers/{misc => accel}/cxl/cxl.h (100%)
 rename drivers/{misc => accel}/cxl/cxllib.c (100%)
 rename drivers/{misc => accel}/cxl/debugfs.c (100%)
 rename drivers/{misc => accel}/cxl/fault.c (100%)
 rename drivers/{misc => accel}/cxl/file.c (100%)
 rename drivers/{misc => accel}/cxl/flash.c (100%)
 rename drivers/{misc => accel}/cxl/guest.c (100%)
 rename drivers/{misc => accel}/cxl/hcalls.c (100%)
 rename drivers/{misc => accel}/cxl/hcalls.h (100%)
 rename drivers/{misc => accel}/cxl/irq.c (100%)
 rename drivers/{misc => accel}/cxl/main.c (100%)
 rename drivers/{misc => accel}/cxl/native.c (100%)
 rename drivers/{misc => accel}/cxl/of.c (100%)
 rename drivers/{misc => accel}/cxl/pci.c (100%)
 rename drivers/{misc => accel}/cxl/sysfs.c (100%)
 rename drivers/{misc => accel}/cxl/trace.c (100%)
 rename drivers/{misc => accel}/cxl/trace.h (100%)
 rename drivers/{misc => accel}/cxl/vphb.c (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8a9bbaf8f6e90..93fbfed6e6915 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4214,7 +4214,7 @@ M:	Andrew Donnellan <andrew.donnellan@au1.ibm.com>
 L:	linuxppc-dev@lists.ozlabs.org
 S:	Supported
 F:	arch/powerpc/platforms/powernv/pci-cxl.c
-F:	drivers/misc/cxl/
+F:	drivers/accel/cxl/
 F:	include/misc/cxl*
 F:	include/uapi/misc/cxl.h
 F:	Documentation/powerpc/cxl.txt
diff --git a/drivers/accel/Kconfig b/drivers/accel/Kconfig
index 13b36c0398895..c0754448efdf0 100644
--- a/drivers/accel/Kconfig
+++ b/drivers/accel/Kconfig
@@ -13,4 +13,6 @@ if ACCEL
 
 comment "HW Accellerator drivers"
 
+source "drivers/accel/cxl/Kconfig"
+
 endif
diff --git a/drivers/accel/Makefile b/drivers/accel/Makefile
index 343bbb8f45a14..752a54e227ad7 100644
--- a/drivers/accel/Makefile
+++ b/drivers/accel/Makefile
@@ -3,3 +3,4 @@
 # Makefile for accel devices
 #
 
+obj-$(CONFIG_CXL_BASE)		+= cxl/
diff --git a/drivers/misc/cxl/Kconfig b/drivers/accel/cxl/Kconfig
similarity index 100%
rename from drivers/misc/cxl/Kconfig
rename to drivers/accel/cxl/Kconfig
diff --git a/drivers/misc/cxl/Makefile b/drivers/accel/cxl/Makefile
similarity index 100%
rename from drivers/misc/cxl/Makefile
rename to drivers/accel/cxl/Makefile
diff --git a/drivers/misc/cxl/api.c b/drivers/accel/cxl/api.c
similarity index 100%
rename from drivers/misc/cxl/api.c
rename to drivers/accel/cxl/api.c
diff --git a/drivers/misc/cxl/base.c b/drivers/accel/cxl/base.c
similarity index 100%
rename from drivers/misc/cxl/base.c
rename to drivers/accel/cxl/base.c
diff --git a/drivers/misc/cxl/context.c b/drivers/accel/cxl/context.c
similarity index 100%
rename from drivers/misc/cxl/context.c
rename to drivers/accel/cxl/context.c
diff --git a/drivers/misc/cxl/cxl.h b/drivers/accel/cxl/cxl.h
similarity index 100%
rename from drivers/misc/cxl/cxl.h
rename to drivers/accel/cxl/cxl.h
diff --git a/drivers/misc/cxl/cxllib.c b/drivers/accel/cxl/cxllib.c
similarity index 100%
rename from drivers/misc/cxl/cxllib.c
rename to drivers/accel/cxl/cxllib.c
diff --git a/drivers/misc/cxl/debugfs.c b/drivers/accel/cxl/debugfs.c
similarity index 100%
rename from drivers/misc/cxl/debugfs.c
rename to drivers/accel/cxl/debugfs.c
diff --git a/drivers/misc/cxl/fault.c b/drivers/accel/cxl/fault.c
similarity index 100%
rename from drivers/misc/cxl/fault.c
rename to drivers/accel/cxl/fault.c
diff --git a/drivers/misc/cxl/file.c b/drivers/accel/cxl/file.c
similarity index 100%
rename from drivers/misc/cxl/file.c
rename to drivers/accel/cxl/file.c
diff --git a/drivers/misc/cxl/flash.c b/drivers/accel/cxl/flash.c
similarity index 100%
rename from drivers/misc/cxl/flash.c
rename to drivers/accel/cxl/flash.c
diff --git a/drivers/misc/cxl/guest.c b/drivers/accel/cxl/guest.c
similarity index 100%
rename from drivers/misc/cxl/guest.c
rename to drivers/accel/cxl/guest.c
diff --git a/drivers/misc/cxl/hcalls.c b/drivers/accel/cxl/hcalls.c
similarity index 100%
rename from drivers/misc/cxl/hcalls.c
rename to drivers/accel/cxl/hcalls.c
diff --git a/drivers/misc/cxl/hcalls.h b/drivers/accel/cxl/hcalls.h
similarity index 100%
rename from drivers/misc/cxl/hcalls.h
rename to drivers/accel/cxl/hcalls.h
diff --git a/drivers/misc/cxl/irq.c b/drivers/accel/cxl/irq.c
similarity index 100%
rename from drivers/misc/cxl/irq.c
rename to drivers/accel/cxl/irq.c
diff --git a/drivers/misc/cxl/main.c b/drivers/accel/cxl/main.c
similarity index 100%
rename from drivers/misc/cxl/main.c
rename to drivers/accel/cxl/main.c
diff --git a/drivers/misc/cxl/native.c b/drivers/accel/cxl/native.c
similarity index 100%
rename from drivers/misc/cxl/native.c
rename to drivers/accel/cxl/native.c
diff --git a/drivers/misc/cxl/of.c b/drivers/accel/cxl/of.c
similarity index 100%
rename from drivers/misc/cxl/of.c
rename to drivers/accel/cxl/of.c
diff --git a/drivers/misc/cxl/pci.c b/drivers/accel/cxl/pci.c
similarity index 100%
rename from drivers/misc/cxl/pci.c
rename to drivers/accel/cxl/pci.c
diff --git a/drivers/misc/cxl/sysfs.c b/drivers/accel/cxl/sysfs.c
similarity index 100%
rename from drivers/misc/cxl/sysfs.c
rename to drivers/accel/cxl/sysfs.c
diff --git a/drivers/misc/cxl/trace.c b/drivers/accel/cxl/trace.c
similarity index 100%
rename from drivers/misc/cxl/trace.c
rename to drivers/accel/cxl/trace.c
diff --git a/drivers/misc/cxl/trace.h b/drivers/accel/cxl/trace.h
similarity index 100%
rename from drivers/misc/cxl/trace.h
rename to drivers/accel/cxl/trace.h
diff --git a/drivers/misc/cxl/vphb.c b/drivers/accel/cxl/vphb.c
similarity index 100%
rename from drivers/misc/cxl/vphb.c
rename to drivers/accel/cxl/vphb.c
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index f417b06e11c51..02153382d67b6 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -532,7 +532,6 @@ source "drivers/misc/vmw_vmci/Kconfig"
 source "drivers/misc/mic/Kconfig"
 source "drivers/misc/genwqe/Kconfig"
 source "drivers/misc/echo/Kconfig"
-source "drivers/misc/cxl/Kconfig"
 source "drivers/misc/ocxl/Kconfig"
 source "drivers/misc/cardreader/Kconfig"
 endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index e39ccbbc1b3a8..72fa4fc42a2d6 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -52,7 +52,6 @@ obj-y				+= mic/
 obj-$(CONFIG_GENWQE)		+= genwqe/
 obj-$(CONFIG_ECHO)		+= echo/
 obj-$(CONFIG_VEXPRESS_SYSCFG)	+= vexpress-syscfg.o
-obj-$(CONFIG_CXL_BASE)		+= cxl/
 obj-$(CONFIG_ASPEED_LPC_CTRL)	+= aspeed-lpc-ctrl.o
 obj-$(CONFIG_ASPEED_LPC_SNOOP)	+= aspeed-lpc-snoop.o
 obj-$(CONFIG_PCI_ENDPOINT_TEST)	+= pci_endpoint_test.o
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 3/5] drivers/accel: cxl: Move non-uapi include files
  2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
  2019-01-25 18:16             ` [PATCH 1/5] drivers/accel: Introduce subsystem Olof Johansson
  2019-01-25 18:16             ` [PATCH 2/5] cxl: Move to drivers/accel Olof Johansson
@ 2019-01-25 18:16             ` Olof Johansson
  2019-01-25 18:16             ` [PATCH 4/5] ocxl: Move to drivers/accel Olof Johansson
                               ` (3 subsequent siblings)
  6 siblings, 0 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 18:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, Frederic Barrat,
	Andrew Donnellan, ogabbay, airlied, jglisse, Olof Johansson,
	Arnd Bergmann

Separate to expose the edits vs pure moves.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Olof Johansson <olof@lixom.net>
---
 MAINTAINERS                               | 2 +-
 arch/powerpc/include/asm/pnv-pci.h        | 2 +-
 arch/powerpc/mm/copro_fault.c             | 2 +-
 arch/powerpc/mm/hash_native_64.c          | 2 +-
 arch/powerpc/mm/pgtable-book3s64.c        | 2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
 drivers/accel/cxl/api.c                   | 2 +-
 drivers/accel/cxl/base.c                  | 2 +-
 drivers/accel/cxl/cxl.h                   | 4 ++--
 drivers/accel/cxl/cxllib.c                | 2 +-
 drivers/accel/cxl/irq.c                   | 2 +-
 drivers/accel/cxl/main.c                  | 2 +-
 drivers/accel/cxl/native.c                | 2 +-
 drivers/accel/cxl/pci.c                   | 2 +-
 drivers/accel/cxl/vphb.c                  | 2 +-
 drivers/scsi/cxlflash/cxl_hw.c            | 2 +-
 include/{misc => linux/accel}/cxl-base.h  | 4 ++--
 include/{misc => linux/accel}/cxl.h       | 6 +++---
 include/{misc => linux/accel}/cxllib.h    | 6 +++---
 19 files changed, 25 insertions(+), 25 deletions(-)
 rename include/{misc => linux/accel}/cxl-base.h (94%)
 rename include/{misc => linux/accel}/cxl.h (99%)
 rename include/{misc => linux/accel}/cxllib.h (97%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 93fbfed6e6915..97aed390129f0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4215,7 +4215,7 @@ L:	linuxppc-dev@lists.ozlabs.org
 S:	Supported
 F:	arch/powerpc/platforms/powernv/pci-cxl.c
 F:	drivers/accel/cxl/
-F:	include/misc/cxl*
+F:	include/linux/accel/cxl*
 F:	include/uapi/misc/cxl.h
 F:	Documentation/powerpc/cxl.txt
 F:	Documentation/ABI/testing/sysfs-class-cxl
diff --git a/arch/powerpc/include/asm/pnv-pci.h b/arch/powerpc/include/asm/pnv-pci.h
index 630eb8b1b7ed3..17e0ded18ffd6 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -13,7 +13,7 @@
 #include <linux/pci.h>
 #include <linux/pci_hotplug.h>
 #include <linux/irq.h>
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 #include <asm/opal-api.h>
 
 #define PCI_SLOT_ID_PREFIX	(1UL << 63)
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index c8da352e8686c..441a51d9c8f8a 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -26,7 +26,7 @@
 #include <asm/reg.h>
 #include <asm/copro.h>
 #include <asm/spu.h>
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 
 /*
  * This ought to be kept in sync with the powerpc specific do_page_fault
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index aaa28fd918fe4..b6f49c98aa732 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -31,7 +31,7 @@
 #include <asm/ppc-opcode.h>
 #include <asm/feature-fixups.h>
 
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 
 #ifdef DEBUG_LOW
 #define DBG_LOW(fmt...) udbg_printf(fmt)
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 55876b7e38130..34bdd8bca31e6 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -10,7 +10,7 @@
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/memblock.h>
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7db3119f8a5b3..0506eb74b99b6 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -41,7 +41,7 @@
 #include <asm/pnv-pci.h>
 #include <asm/mmzone.h>
 
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 
 #include "powernv.h"
 #include "pci.h"
diff --git a/drivers/accel/cxl/api.c b/drivers/accel/cxl/api.c
index 750470ef2049b..220944d7a398c 100644
--- a/drivers/accel/cxl/api.c
+++ b/drivers/accel/cxl/api.c
@@ -10,7 +10,7 @@
 #include <linux/pci.h>
 #include <linux/slab.h>
 #include <linux/file.h>
-#include <misc/cxl.h>
+#include <linux/accel/cxl.h>
 #include <linux/module.h>
 #include <linux/mount.h>
 #include <linux/sched/mm.h>
diff --git a/drivers/accel/cxl/base.c b/drivers/accel/cxl/base.c
index 7557835cdfcd6..bd2958a7a379a 100644
--- a/drivers/accel/cxl/base.c
+++ b/drivers/accel/cxl/base.c
@@ -10,7 +10,7 @@
 #include <linux/module.h>
 #include <linux/rcupdate.h>
 #include <asm/errno.h>
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 #include <linux/of_platform.h>
 #include "cxl.h"
 
diff --git a/drivers/accel/cxl/cxl.h b/drivers/accel/cxl/cxl.h
index d1d927ccb589c..82457ccc58cad 100644
--- a/drivers/accel/cxl/cxl.h
+++ b/drivers/accel/cxl/cxl.h
@@ -22,9 +22,9 @@
 #include <asm/cputable.h>
 #include <asm/mmu.h>
 #include <asm/reg.h>
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 
-#include <misc/cxl.h>
+#include <linux/accel/cxl.h>
 #include <uapi/misc/cxl.h>
 
 extern uint cxl_verbose;
diff --git a/drivers/accel/cxl/cxllib.c b/drivers/accel/cxl/cxllib.c
index 5a3f912552585..ba5cde62e030f 100644
--- a/drivers/accel/cxl/cxllib.c
+++ b/drivers/accel/cxl/cxllib.c
@@ -10,7 +10,7 @@
 #include <linux/hugetlb.h>
 #include <linux/sched/mm.h>
 #include <asm/pnv-pci.h>
-#include <misc/cxllib.h>
+#include <linux/accel/cxllib.h>
 
 #include "cxl.h"
 
diff --git a/drivers/accel/cxl/irq.c b/drivers/accel/cxl/irq.c
index ce08a9f22308f..6f55130dd014c 100644
--- a/drivers/accel/cxl/irq.c
+++ b/drivers/accel/cxl/irq.c
@@ -14,7 +14,7 @@
 #include <linux/slab.h>
 #include <linux/pid.h>
 #include <asm/cputable.h>
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 
 #include "cxl.h"
 #include "trace.h"
diff --git a/drivers/accel/cxl/main.c b/drivers/accel/cxl/main.c
index f35406be465a5..f49edfe0371d2 100644
--- a/drivers/accel/cxl/main.c
+++ b/drivers/accel/cxl/main.c
@@ -22,7 +22,7 @@
 #include <linux/sched/task.h>
 
 #include <asm/cputable.h>
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 
 #include "cxl.h"
 #include "trace.h"
diff --git a/drivers/accel/cxl/native.c b/drivers/accel/cxl/native.c
index c9d5d82dce8ec..e9b8116a5e34b 100644
--- a/drivers/accel/cxl/native.c
+++ b/drivers/accel/cxl/native.c
@@ -17,7 +17,7 @@
 #include <linux/delay.h>
 #include <asm/synch.h>
 #include <asm/switch_to.h>
-#include <misc/cxl-base.h>
+#include <linux/accel/cxl-base.h>
 
 #include "cxl.h"
 #include "trace.h"
diff --git a/drivers/accel/cxl/pci.c b/drivers/accel/cxl/pci.c
index c79ba1c699ad1..ff8d0b5679c43 100644
--- a/drivers/accel/cxl/pci.c
+++ b/drivers/accel/cxl/pci.c
@@ -24,7 +24,7 @@
 #include <asm/reg.h>
 
 #include "cxl.h"
-#include <misc/cxl.h>
+#include <linux/accel/cxl.h>
 
 
 #define CXL_PCI_VSEC_ID	0x1280
diff --git a/drivers/accel/cxl/vphb.c b/drivers/accel/cxl/vphb.c
index 49da2f744bbf1..c3670d8b4a252 100644
--- a/drivers/accel/cxl/vphb.c
+++ b/drivers/accel/cxl/vphb.c
@@ -8,7 +8,7 @@
  */
 
 #include <linux/pci.h>
-#include <misc/cxl.h>
+#include <linux/accel/cxl.h>
 #include "cxl.h"
 
 static int cxl_pci_probe_mode(struct pci_bus *bus)
diff --git a/drivers/scsi/cxlflash/cxl_hw.c b/drivers/scsi/cxlflash/cxl_hw.c
index b42da88386bdd..bb3b6b9443062 100644
--- a/drivers/scsi/cxlflash/cxl_hw.c
+++ b/drivers/scsi/cxlflash/cxl_hw.c
@@ -12,7 +12,7 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include <misc/cxl.h>
+#include <linux/accel/cxl.h>
 
 #include "backend.h"
 
diff --git a/include/misc/cxl-base.h b/include/linux/accel/cxl-base.h
similarity index 94%
rename from include/misc/cxl-base.h
rename to include/linux/accel/cxl-base.h
index f53808fa638ab..8e7825693f7de 100644
--- a/include/misc/cxl-base.h
+++ b/include/linux/accel/cxl-base.h
@@ -7,8 +7,8 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#ifndef _MISC_CXL_BASE_H
-#define _MISC_CXL_BASE_H
+#ifndef _LINUX_ACCEL_CXL_BASE_H
+#define _LINUX_ACCEL_CXL_BASE_H
 
 #ifdef CONFIG_CXL_BASE
 
diff --git a/include/misc/cxl.h b/include/linux/accel/cxl.h
similarity index 99%
rename from include/misc/cxl.h
rename to include/linux/accel/cxl.h
index ea9ff4a1a9ca5..07c3942c62ea1 100644
--- a/include/misc/cxl.h
+++ b/include/linux/accel/cxl.h
@@ -7,8 +7,8 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#ifndef _MISC_CXL_H
-#define _MISC_CXL_H
+#ifndef _LINUX_ACCEL_CXL_H
+#define _LINUX_ACCEL_CXL_H
 
 #include <linux/pci.h>
 #include <linux/poll.h>
@@ -266,4 +266,4 @@ void cxl_set_driver_ops(struct cxl_context *ctx,
 void cxl_context_events_pending(struct cxl_context *ctx,
 				unsigned int new_events);
 
-#endif /* _MISC_CXL_H */
+#endif /* _LINUX_ACCEL_CXL_H */
diff --git a/include/misc/cxllib.h b/include/linux/accel/cxllib.h
similarity index 97%
rename from include/misc/cxllib.h
rename to include/linux/accel/cxllib.h
index e5aa29f019a6b..ef045430a9679 100644
--- a/include/misc/cxllib.h
+++ b/include/linux/accel/cxllib.h
@@ -7,8 +7,8 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#ifndef _MISC_CXLLIB_H
-#define _MISC_CXLLIB_H
+#ifndef _LINUX_ACCEL_CXLLIB_H
+#define _LINUX_ACCEL_CXLLIB_H
 
 #include <linux/pci.h>
 #include <asm/reg.h>
@@ -130,4 +130,4 @@ int cxllib_get_PE_attributes(struct task_struct *task,
 int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, u64 flags);
 
 
-#endif /* _MISC_CXLLIB_H */
+#endif /* _LINUX_ACCEL_CXLLIB_H */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 4/5] ocxl: Move to drivers/accel
  2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
                               ` (2 preceding siblings ...)
  2019-01-25 18:16             ` [PATCH 3/5] drivers/accel: cxl: Move non-uapi include files Olof Johansson
@ 2019-01-25 18:16             ` Olof Johansson
  2019-01-25 18:16             ` [PATCH 5/5] drivers/accel: ocxl: Move non-uapi include files Olof Johansson
                               ` (2 subsequent siblings)
  6 siblings, 0 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 18:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, Frederic Barrat,
	Andrew Donnellan, ogabbay, airlied, jglisse, Olof Johansson,
	Arnd Bergmann

Move include files in separate commit, so leave them alone for now.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Olof Johansson <olof@lixom.net>
---
 MAINTAINERS                                  | 2 +-
 drivers/accel/Kconfig                        | 1 +
 drivers/accel/Makefile                       | 1 +
 drivers/{misc => accel}/ocxl/Kconfig         | 0
 drivers/{misc => accel}/ocxl/Makefile        | 0
 drivers/{misc => accel}/ocxl/afu_irq.c       | 0
 drivers/{misc => accel}/ocxl/config.c        | 0
 drivers/{misc => accel}/ocxl/context.c       | 0
 drivers/{misc => accel}/ocxl/file.c          | 0
 drivers/{misc => accel}/ocxl/link.c          | 0
 drivers/{misc => accel}/ocxl/main.c          | 0
 drivers/{misc => accel}/ocxl/ocxl_internal.h | 0
 drivers/{misc => accel}/ocxl/pasid.c         | 0
 drivers/{misc => accel}/ocxl/pci.c           | 0
 drivers/{misc => accel}/ocxl/sysfs.c         | 0
 drivers/{misc => accel}/ocxl/trace.c         | 0
 drivers/{misc => accel}/ocxl/trace.h         | 0
 drivers/misc/Kconfig                         | 1 -
 drivers/misc/Makefile                        | 1 -
 19 files changed, 3 insertions(+), 3 deletions(-)
 rename drivers/{misc => accel}/ocxl/Kconfig (100%)
 rename drivers/{misc => accel}/ocxl/Makefile (100%)
 rename drivers/{misc => accel}/ocxl/afu_irq.c (100%)
 rename drivers/{misc => accel}/ocxl/config.c (100%)
 rename drivers/{misc => accel}/ocxl/context.c (100%)
 rename drivers/{misc => accel}/ocxl/file.c (100%)
 rename drivers/{misc => accel}/ocxl/link.c (100%)
 rename drivers/{misc => accel}/ocxl/main.c (100%)
 rename drivers/{misc => accel}/ocxl/ocxl_internal.h (100%)
 rename drivers/{misc => accel}/ocxl/pasid.c (100%)
 rename drivers/{misc => accel}/ocxl/pci.c (100%)
 rename drivers/{misc => accel}/ocxl/sysfs.c (100%)
 rename drivers/{misc => accel}/ocxl/trace.c (100%)
 rename drivers/{misc => accel}/ocxl/trace.h (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 97aed390129f0..a1b2ba3bd402d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11012,7 +11012,7 @@ L:	linuxppc-dev@lists.ozlabs.org
 S:	Supported
 F:	arch/powerpc/platforms/powernv/ocxl.c
 F:	arch/powerpc/include/asm/pnv-ocxl.h
-F:	drivers/misc/ocxl/
+F:	drivers/accel/ocxl/
 F:	include/misc/ocxl*
 F:	include/uapi/misc/ocxl.h
 F:	Documentation/accelerators/ocxl.rst
diff --git a/drivers/accel/Kconfig b/drivers/accel/Kconfig
index c0754448efdf0..1790630c560e6 100644
--- a/drivers/accel/Kconfig
+++ b/drivers/accel/Kconfig
@@ -14,5 +14,6 @@ if ACCEL
 comment "HW Accellerator drivers"
 
 source "drivers/accel/cxl/Kconfig"
+source "drivers/accel/ocxl/Kconfig"
 
 endif
diff --git a/drivers/accel/Makefile b/drivers/accel/Makefile
index 752a54e227ad7..313867582ff5a 100644
--- a/drivers/accel/Makefile
+++ b/drivers/accel/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-$(CONFIG_CXL_BASE)		+= cxl/
+obj-$(CONFIG_OCXL)		+= ocxl/
diff --git a/drivers/misc/ocxl/Kconfig b/drivers/accel/ocxl/Kconfig
similarity index 100%
rename from drivers/misc/ocxl/Kconfig
rename to drivers/accel/ocxl/Kconfig
diff --git a/drivers/misc/ocxl/Makefile b/drivers/accel/ocxl/Makefile
similarity index 100%
rename from drivers/misc/ocxl/Makefile
rename to drivers/accel/ocxl/Makefile
diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/accel/ocxl/afu_irq.c
similarity index 100%
rename from drivers/misc/ocxl/afu_irq.c
rename to drivers/accel/ocxl/afu_irq.c
diff --git a/drivers/misc/ocxl/config.c b/drivers/accel/ocxl/config.c
similarity index 100%
rename from drivers/misc/ocxl/config.c
rename to drivers/accel/ocxl/config.c
diff --git a/drivers/misc/ocxl/context.c b/drivers/accel/ocxl/context.c
similarity index 100%
rename from drivers/misc/ocxl/context.c
rename to drivers/accel/ocxl/context.c
diff --git a/drivers/misc/ocxl/file.c b/drivers/accel/ocxl/file.c
similarity index 100%
rename from drivers/misc/ocxl/file.c
rename to drivers/accel/ocxl/file.c
diff --git a/drivers/misc/ocxl/link.c b/drivers/accel/ocxl/link.c
similarity index 100%
rename from drivers/misc/ocxl/link.c
rename to drivers/accel/ocxl/link.c
diff --git a/drivers/misc/ocxl/main.c b/drivers/accel/ocxl/main.c
similarity index 100%
rename from drivers/misc/ocxl/main.c
rename to drivers/accel/ocxl/main.c
diff --git a/drivers/misc/ocxl/ocxl_internal.h b/drivers/accel/ocxl/ocxl_internal.h
similarity index 100%
rename from drivers/misc/ocxl/ocxl_internal.h
rename to drivers/accel/ocxl/ocxl_internal.h
diff --git a/drivers/misc/ocxl/pasid.c b/drivers/accel/ocxl/pasid.c
similarity index 100%
rename from drivers/misc/ocxl/pasid.c
rename to drivers/accel/ocxl/pasid.c
diff --git a/drivers/misc/ocxl/pci.c b/drivers/accel/ocxl/pci.c
similarity index 100%
rename from drivers/misc/ocxl/pci.c
rename to drivers/accel/ocxl/pci.c
diff --git a/drivers/misc/ocxl/sysfs.c b/drivers/accel/ocxl/sysfs.c
similarity index 100%
rename from drivers/misc/ocxl/sysfs.c
rename to drivers/accel/ocxl/sysfs.c
diff --git a/drivers/misc/ocxl/trace.c b/drivers/accel/ocxl/trace.c
similarity index 100%
rename from drivers/misc/ocxl/trace.c
rename to drivers/accel/ocxl/trace.c
diff --git a/drivers/misc/ocxl/trace.h b/drivers/accel/ocxl/trace.h
similarity index 100%
rename from drivers/misc/ocxl/trace.h
rename to drivers/accel/ocxl/trace.h
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 02153382d67b6..35c11005a5150 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -532,6 +532,5 @@ source "drivers/misc/vmw_vmci/Kconfig"
 source "drivers/misc/mic/Kconfig"
 source "drivers/misc/genwqe/Kconfig"
 source "drivers/misc/echo/Kconfig"
-source "drivers/misc/ocxl/Kconfig"
 source "drivers/misc/cardreader/Kconfig"
 endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 72fa4fc42a2d6..2ee4eeb573b58 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -55,6 +55,5 @@ obj-$(CONFIG_VEXPRESS_SYSCFG)	+= vexpress-syscfg.o
 obj-$(CONFIG_ASPEED_LPC_CTRL)	+= aspeed-lpc-ctrl.o
 obj-$(CONFIG_ASPEED_LPC_SNOOP)	+= aspeed-lpc-snoop.o
 obj-$(CONFIG_PCI_ENDPOINT_TEST)	+= pci_endpoint_test.o
-obj-$(CONFIG_OCXL)		+= ocxl/
 obj-y				+= cardreader/
 obj-$(CONFIG_PVPANIC)   	+= pvpanic.o
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 5/5] drivers/accel: ocxl: Move non-uapi include files
  2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
                               ` (3 preceding siblings ...)
  2019-01-25 18:16             ` [PATCH 4/5] ocxl: Move to drivers/accel Olof Johansson
@ 2019-01-25 18:16             ` Olof Johansson
  2019-01-26 13:51               ` Greg Kroah-Hartman
  2019-01-26 21:11             ` [PATCH/RFC 0/5] HW accel subsystem Arnd Bergmann
  2019-02-01  9:10             ` Kenneth Lee
  6 siblings, 1 reply; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 18:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, Frederic Barrat,
	Andrew Donnellan, ogabbay, airlied, jglisse, Olof Johansson,
	Arnd Bergmann

Separate to expose the edits vs pure moves.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Olof Johansson <olof@lixom.net>
---
 MAINTAINERS                                 | 2 +-
 arch/powerpc/platforms/powernv/ocxl.c       | 2 +-
 drivers/accel/ocxl/config.c                 | 4 ++--
 drivers/accel/ocxl/link.c                   | 2 +-
 drivers/accel/ocxl/ocxl_internal.h          | 2 +-
 drivers/scsi/cxlflash/ocxl_hw.c             | 2 +-
 include/{misc => linux/accel}/ocxl-config.h | 6 +++---
 include/{misc => linux/accel}/ocxl.h        | 6 +++---
 8 files changed, 13 insertions(+), 13 deletions(-)
 rename include/{misc => linux/accel}/ocxl-config.h (94%)
 rename include/{misc => linux/accel}/ocxl.h (98%)

diff --git a/MAINTAINERS b/MAINTAINERS
index a1b2ba3bd402d..faa39da1445d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11013,7 +11013,7 @@ S:	Supported
 F:	arch/powerpc/platforms/powernv/ocxl.c
 F:	arch/powerpc/include/asm/pnv-ocxl.h
 F:	drivers/accel/ocxl/
-F:	include/misc/ocxl*
+F:	include/linux/accel/ocxl*
 F:	include/uapi/misc/ocxl.h
 F:	Documentation/accelerators/ocxl.rst
 
diff --git a/arch/powerpc/platforms/powernv/ocxl.c b/arch/powerpc/platforms/powernv/ocxl.c
index 8c65aacda9c81..90e3a66c51dde 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -3,7 +3,7 @@
 #include <asm/pnv-ocxl.h>
 #include <asm/opal.h>
 #include <asm/xive.h>
-#include <misc/ocxl-config.h>
+#include <linux/accel/ocxl-config.h>
 #include "pci.h"
 
 #define PNV_OCXL_TL_P9_RECV_CAP		0x000000000000000Full
diff --git a/drivers/accel/ocxl/config.c b/drivers/accel/ocxl/config.c
index 8f2c5d8bd2eee..38351c6b28039 100644
--- a/drivers/accel/ocxl/config.c
+++ b/drivers/accel/ocxl/config.c
@@ -2,8 +2,8 @@
 // Copyright 2017 IBM Corp.
 #include <linux/pci.h>
 #include <asm/pnv-ocxl.h>
-#include <misc/ocxl.h>
-#include <misc/ocxl-config.h>
+#include <linux/accel/ocxl.h>
+#include <linux/accel/ocxl-config.h>
 
 #define EXTRACT_BIT(val, bit) (!!(val & BIT(bit)))
 #define EXTRACT_BITS(val, s, e) ((val & GENMASK(e, s)) >> s)
diff --git a/drivers/accel/ocxl/link.c b/drivers/accel/ocxl/link.c
index d50b861d7e57b..7c0550425a129 100644
--- a/drivers/accel/ocxl/link.c
+++ b/drivers/accel/ocxl/link.c
@@ -6,7 +6,7 @@
 #include <linux/mmu_context.h>
 #include <asm/copro.h>
 #include <asm/pnv-ocxl.h>
-#include <misc/ocxl.h>
+#include <linux/accel/ocxl.h>
 #include "ocxl_internal.h"
 #include "trace.h"
 
diff --git a/drivers/accel/ocxl/ocxl_internal.h b/drivers/accel/ocxl/ocxl_internal.h
index a32f2151029f6..4516390a8dbcb 100644
--- a/drivers/accel/ocxl/ocxl_internal.h
+++ b/drivers/accel/ocxl/ocxl_internal.h
@@ -6,7 +6,7 @@
 #include <linux/pci.h>
 #include <linux/cdev.h>
 #include <linux/list.h>
-#include <misc/ocxl.h>
+#include <linux/accel/ocxl.h>
 
 #define MAX_IRQ_PER_LINK	2000
 #define MAX_IRQ_PER_CONTEXT	MAX_IRQ_PER_LINK
diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
index 37b8dc60f5f6d..7a62f78033b73 100644
--- a/drivers/scsi/cxlflash/ocxl_hw.c
+++ b/drivers/scsi/cxlflash/ocxl_hw.c
@@ -19,7 +19,7 @@
 #include <linux/poll.h>
 #include <linux/sched/signal.h>
 
-#include <misc/ocxl.h>
+#include <linux/accel/ocxl.h>
 
 #include <uapi/misc/cxl.h>
 
diff --git a/include/misc/ocxl-config.h b/include/linux/accel/ocxl-config.h
similarity index 94%
rename from include/misc/ocxl-config.h
rename to include/linux/accel/ocxl-config.h
index 3526fa996a220..4d25ed7b971f8 100644
--- a/include/misc/ocxl-config.h
+++ b/include/linux/accel/ocxl-config.h
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0+
 // Copyright 2017 IBM Corp.
-#ifndef _OCXL_CONFIG_H_
-#define _OCXL_CONFIG_H_
+#ifndef _LINUX_ACCEL_OCXL_CONFIG_H_
+#define _LINUX_ACCEL_OCXL_CONFIG_H_
 
 /*
  * This file lists the various constants used to read the
@@ -42,4 +42,4 @@
 #define   OCXL_DVSEC_VENDOR_TLX_VERS            0x10
 #define   OCXL_DVSEC_VENDOR_DLX_VERS            0x20
 
-#endif /* _OCXL_CONFIG_H_ */
+#endif /* _LINUX_ACCEL_OCXL_CONFIG_H_ */
diff --git a/include/misc/ocxl.h b/include/linux/accel/ocxl.h
similarity index 98%
rename from include/misc/ocxl.h
rename to include/linux/accel/ocxl.h
index 9ff6ddc28e221..1ab4c50700029 100644
--- a/include/misc/ocxl.h
+++ b/include/linux/accel/ocxl.h
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0+
 // Copyright 2017 IBM Corp.
-#ifndef _MISC_OCXL_H_
-#define _MISC_OCXL_H_
+#ifndef _LINUX_ACCEL_OCXL_H_
+#define _LINUX_ACCEL_OCXL_H_
 
 #include <linux/pci.h>
 
@@ -220,4 +220,4 @@ extern int ocxl_link_irq_alloc(void *link_handle, int *hw_irq,
  */
 extern void ocxl_link_free_irq(void *link_handle, int hw_irq);
 
-#endif /* _MISC_OCXL_H_ */
+#endif /* _LINUX_ACCEL_OCXL_H_ */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-23  0:49   ` Joe Perches
@ 2019-01-25 19:18     ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-25 19:18 UTC (permalink / raw)
  To: Joe Perches; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org

On Wed, Jan 23, 2019 at 2:49 AM Joe Perches <joe@perches.com> wrote:
>
> On Wed, 2019-01-23 at 02:00 +0200, Oded Gabbay wrote:
> > This patch adds the habanalabs skeleton driver. The driver does nothing at
> > this stage except very basic operations. It contains the minimal code to
> > insmod and rmmod the driver and to create a /dev/hlX file per PCI device.
>
> trivial notes:
>
> >
> > diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> []
> > \ No newline at end of file
>
> You should fixes these.  There are a least a couple of them.
>
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> []
> > @@ -0,0 +1,331 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
>
> Add #define pr_fmt(fmt) "habanalabs: " fmt
>
> > +
> > +#include "habanalabs.h"
>
> or add it in this file
>
>
> > +static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
> > +                             int minor, const struct file_operations *fops)
> > +{
> > +     int err, devno = MKDEV(hdev->major, minor);
> > +     struct cdev *hdev_cdev = &hdev->cdev;
> > +     char name[8];
> > +
> > +     sprintf(name, "hl%d", hdev->id);
>
> Might overflow name one day
>
> > +
> > +     cdev_init(hdev_cdev, fops);
> > +     hdev_cdev->owner = THIS_MODULE;
> > +     err = cdev_add(hdev_cdev, devno, 1);
> > +     if (err) {
> > +             pr_err("habanalabs: Failed to add char device %s", name);
>
> So #define pr_fmt can auto prefix these and this would be
>
>                 pr_err("Failed to add char device %s\n", name);
>
> missing terminating '\n' btw
>
> > +             goto err_cdev_add;
> > +     }
> > +
> > +     hdev->dev = device_create(hclass, NULL, devno, NULL, "%s", name);
> > +     if (IS_ERR(hdev->dev)) {
> > +             pr_err("habanalabs: Failed to create device %s\n", name);
>
> And this would be:
>                 pr_err("Failed to create device %s\n", name);
>
>
> etc...
>
> > +static int device_early_init(struct hl_device *hdev)
> > +{
> > +     switch (hdev->asic_type) {
> > +     case ASIC_GOYA:
> > +             sprintf(hdev->asic_name, "GOYA");
>
> strcpy or perhaps better still as strlcpy
>
> > +int hl_device_init(struct hl_device *hdev, struct class *hclass)
> > +{
> []
> > +     dev_notice(hdev->dev,
> > +             "Successfully added device to habanalabs driver\n");
>
> This is mostly aligned to open parenthesis, but perhaps
> it could check with scripts/checkpatch.pl --strict and
> see if you agree with anything it bleats.
>
> > +int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr,
> > +                             u32 timeout_us, u32 *val)
> > +{
> > +     /*
> > +      * pReturnVal is defined as volatile because it points to HOST memory,
> > +      * which is being written to by the device. Therefore, we can't use
> > +      * locks to synchronize it and it is not a memory-mapped register space
> > +      */
> > +     volatile u32 *pReturnVal = (volatile u32 *) addr;
>
> It'd be nice to avoid hungarian and camelcase
>
> > +     ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
> > +
> > +     might_sleep();
> > +
> > +     for (;;) {
> > +             *val = *pReturnVal;
> > +             if (*val)
> > +                     break;
> > +             if (ktime_compare(ktime_get(), timeout) > 0) {
> > +                     *val = *pReturnVal;
> > +                     break;
> > +             }
> > +             usleep_range((100 >> 2) + 1, 100);
> > +     }
> > +
> > +     return (*val ? 0 : -ETIMEDOUT);
>
> Unnecessary parentheses
>
> > diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> []
> > +static struct pci_device_id ids[] = {
> > +     { PCI_DEVICE(PCI_VENDOR_ID_HABANALABS, PCI_IDS_GOYA), },
> > +     { 0, }
> > +};
>
> static const?
>
> > diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> []
> > +struct hl_bd {
> > +     __u64   ptr;
> > +     __u32   len;
> > +     union {
> > +             struct {
> > +                     __u32   repeat:16;
> > +                     __u32   res1:8;
> > +                     __u32   repeat_valid:1;
> > +                     __u32   res2:7;
> > +             };
> > +             __u32   ctl;
> > +     };
> > +};
>
> Maybe use the appropriate bit-endian __le<size> instead of __u<size>
> with whatever cpu_to_le<size> / le<size>_to_cpu bits are necessary.
>
>

Hi Joe,
Thanks for the review.
I fixed everything except for two things:
1. Alignment to open parenthesis. I never code like that in the kernel
and I don't really believe in anything that requires to combine spaces
and tabs.
2. The bit-endian format. We don't support big-endian architecture
(what's left after POWER moved to support little endian ?). And in any
case, our software stack is so big that this minor change in the
driver won't have any impact.

Thanks,
Oded

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-23 12:28   ` Mike Rapoport
  2019-01-23 12:40     ` Greg KH
@ 2019-01-25 20:05     ` Oded Gabbay
  1 sibling, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-25 20:05 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:43AM +0200, Oded Gabbay wrote:
> > This patch adds the habanalabs skeleton driver. The driver does nothing at
> > this stage except very basic operations. It contains the minimal code to
> > insmod and rmmod the driver and to create a /dev/hlX file per PCI device.
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/Kconfig                          |   1 +
> >  drivers/misc/Makefile                         |   1 +
> >  drivers/misc/habanalabs/Kconfig               |  22 ++
> >  drivers/misc/habanalabs/Makefile              |   7 +
> >  drivers/misc/habanalabs/device.c              | 331 ++++++++++++++++
> >  drivers/misc/habanalabs/habanalabs.h          | 149 +++++++
> >  drivers/misc/habanalabs/habanalabs_drv.c      | 366 ++++++++++++++++++
> >  .../habanalabs/include/habanalabs_device_if.h | 125 ++++++
> >  8 files changed, 1002 insertions(+)
> >  create mode 100644 drivers/misc/habanalabs/Kconfig
> >  create mode 100644 drivers/misc/habanalabs/Makefile
> >  create mode 100644 drivers/misc/habanalabs/device.c
> >  create mode 100644 drivers/misc/habanalabs/habanalabs.h
> >  create mode 100644 drivers/misc/habanalabs/habanalabs_drv.c
> >  create mode 100644 drivers/misc/habanalabs/include/habanalabs_device_if.h
> >
> > diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> > index f417b06e11c5..fecab53c4f21 100644
> > --- a/drivers/misc/Kconfig
> > +++ b/drivers/misc/Kconfig
> > @@ -535,4 +535,5 @@ source "drivers/misc/echo/Kconfig"
> >  source "drivers/misc/cxl/Kconfig"
> >  source "drivers/misc/ocxl/Kconfig"
> >  source "drivers/misc/cardreader/Kconfig"
> > +source "drivers/misc/habanalabs/Kconfig"
> >  endmenu
> > diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> > index e39ccbbc1b3a..ae77dfd790a4 100644
> > --- a/drivers/misc/Makefile
> > +++ b/drivers/misc/Makefile
> > @@ -59,3 +59,4 @@ obj-$(CONFIG_PCI_ENDPOINT_TEST)     += pci_endpoint_test.o
> >  obj-$(CONFIG_OCXL)           += ocxl/
> >  obj-y                                += cardreader/
> >  obj-$(CONFIG_PVPANIC)        += pvpanic.o
> > +obj-$(CONFIG_HABANA_AI)              += habanalabs/
> > diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig
> > new file mode 100644
> > index 000000000000..b7f38a14caf5
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/Kconfig
> > @@ -0,0 +1,22 @@
> > +#
> > +# HabanaLabs AI accelerators driver
> > +#
> > +
> > +config HABANA_AI
> > +     tristate "HabanaAI accelerators (habanalabs)"
> > +     depends on PCI
> > +     select FRAME_VECTOR
> > +     help
> > +       Enables PCIe card driver for Habana's AI Processors (AIP) that are
> > +       designed to accelerate Deep Learning inference and training workloads.
> > +
> > +       The driver manages the PCIe devices and provides IOCTL interface for
> > +       the user to submit workloads to the devices.
> > +
> > +       The user-space interface is described in
> > +       include/uapi/misc/habanalabs.h
> > +
> > +       If unsure, say N.
> > +
> > +       To compile this driver as a module, choose M here: the
> > +       module will be called habanalabs.
> > diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> > new file mode 100644
> > index 000000000000..b41433a09e02
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/Makefile
> > @@ -0,0 +1,7 @@
> > +#
> > +# Makefile for HabanaLabs AI accelerators driver
> > +#
> > +
> > +obj-m        := habanalabs.o
> > +
> > +habanalabs-y := habanalabs_drv.o device.o
> > \ No newline at end of file
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> > new file mode 100644
> > index 000000000000..376b55eb73d4
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/device.c
> > @@ -0,0 +1,331 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "habanalabs.h"
> > +
> > +#include <linux/fs.h>
> > +#include <linux/kthread.h>
> > +#include <linux/sched/signal.h>
> > +
> > +static void hpriv_release(struct kref *ref)
> > +{
> > +     struct hl_fpriv *hpriv;
> > +     struct hl_device *hdev;
> > +
> > +     hpriv = container_of(ref, struct hl_fpriv, refcount);
> > +
> > +     hdev = hpriv->hdev;
> > +
> > +     put_pid(hpriv->taskpid);
> > +
> > +     kfree(hpriv);
> > +}
> > +
> > +void hl_hpriv_get(struct hl_fpriv *hpriv)
> > +{
> > +     kref_get(&hpriv->refcount);
> > +}
> > +
> > +void hl_hpriv_put(struct hl_fpriv *hpriv)
> > +{
> > +     kref_put(&hpriv->refcount, hpriv_release);
> > +}
> > +
> > +/**
> > + * hl_device_release - release function for habanalabs device
> > + *
> > + * @inode: pointer to inode structure
> > + * @filp: pointer to file structure
> > + *
> > + * Called when process closes an habanalabs device
> > + */
>
> It's nice to see docs coming along with the codei
> I have some comments for the formatting.
>
> kernel-doc won't be happy about missing return value descriptions, and
> although they are sometimes redundant or too obvious their absence makes
> 'make V=1 htmldocs' really noisy.
>
> In general, it would be nice if you could link hanabnalabs driver
> kernel-doc somewhere in Documentation/ run 'make V=1 htmldocs'.
>
> > +static int hl_device_release(struct inode *inode, struct file *filp)
> > +{
> > +     struct hl_fpriv *hpriv = filp->private_data;
> > +
> > +     filp->private_data = NULL;
> > +
> > +     hl_hpriv_put(hpriv);
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations hl_ops = {
> > +     .owner = THIS_MODULE,
> > +     .open = hl_device_open,
> > +     .release = hl_device_release
> > +};
> > +
> > +/**
> > + * device_setup_cdev - setup cdev and device for habanalabs device
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + * @hclass: pointer to the class object of the device
> > + * @minor: minor number of the specific device
> > + * @fpos : file operations to install for this device
> > + *
> > + * Create a cdev and a Linux device for habanalabs's device. Need to be
> > + * called at the end of the habanalabs device initialization process,
> > + * because this function exposes the device to the user
> > + */
> > +static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
> > +                             int minor, const struct file_operations *fops)
> > +{
> > +     int err, devno = MKDEV(hdev->major, minor);
> > +     struct cdev *hdev_cdev = &hdev->cdev;
> > +     char name[8];
> > +
> > +     sprintf(name, "hl%d", hdev->id);
> > +
> > +     cdev_init(hdev_cdev, fops);
> > +     hdev_cdev->owner = THIS_MODULE;
> > +     err = cdev_add(hdev_cdev, devno, 1);
> > +     if (err) {
> > +             pr_err("habanalabs: Failed to add char device %s", name);
> > +             goto err_cdev_add;
> > +     }
> > +
> > +     hdev->dev = device_create(hclass, NULL, devno, NULL, "%s", name);
> > +     if (IS_ERR(hdev->dev)) {
> > +             pr_err("habanalabs: Failed to create device %s\n", name);
> > +             err = PTR_ERR(hdev->dev);
> > +             goto err_device_create;
> > +     }
> > +
> > +     dev_set_drvdata(hdev->dev, hdev);
> > +
> > +     return 0;
> > +
> > +err_device_create:
> > +     cdev_del(hdev_cdev);
> > +err_cdev_add:
> > +     return err;
> > +}
> > +
> > +/**
> > + * device_early_init - do some early initialization for the habanalabs device
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + *
> > + * Install the relevant function pointers and call the early_init function,
> > + * if such a function exists
> > + */
> > +static int device_early_init(struct hl_device *hdev)
> > +{
> > +     switch (hdev->asic_type) {
> > +     case ASIC_GOYA:
> > +             sprintf(hdev->asic_name, "GOYA");
> > +             break;
> > +     default:
> > +             dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
> > +                     hdev->asic_type);
> > +             return -EINVAL;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * device_early_fini - finalize all that was done in device_early_fini
>
>                                                                     ^init
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + *
> > + */
> > +static void device_early_fini(struct hl_device *hdev)
> > +{
> > +}
> > +
> > +/**
> > + * hl_device_suspend - initiate device suspend
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + *
> > + * Puts the hw in the suspend state (all asics).
> > + * Returns 0 for success or an error on failure.
>
> Should be Return: or Returns: for kernel-doc to understand it.
>
> > + * Called at driver suspend.
>
> This probably should be marked as Context:
>
> > + */
> > +int hl_device_suspend(struct hl_device *hdev)
> > +{
> > +     pci_save_state(hdev->pdev);
> > +
> > +     /* Shut down the device */
> > +     pci_disable_device(hdev->pdev);
> > +     pci_set_power_state(hdev->pdev, PCI_D3hot);
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * hl_device_resume - initiate device resume
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + *
> > + * Bring the hw back to operating state (all asics).
> > + * Returns 0 for success or an error on failure.
> > + * Called at driver resume.
>
> Same comments as for the previous functions.
>
> > + */
> > +int hl_device_resume(struct hl_device *hdev)
> > +{
> > +     int rc;
> > +
> > +     pci_set_power_state(hdev->pdev, PCI_D0);
> > +     pci_restore_state(hdev->pdev);
> > +     rc = pci_enable_device(hdev->pdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to enable PCI device in resume\n");
> > +             return rc;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * hl_device_init - main initialization function for habanalabs device
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + *
> > + * Allocate an id for the device, do early initialization and then call the
> > + * ASIC specific initialization functions. Finally, create the cdev and the
> > + * Linux device to expose it to the user
> > + */
> > +int hl_device_init(struct hl_device *hdev, struct class *hclass)
> > +{
> > +     int rc;
> > +
> > +     /* Create device */
> > +     rc = device_setup_cdev(hdev, hclass, hdev->id, &hl_ops);
> > +
> > +     if (rc)
> > +             goto out_disabled;
> > +
> > +     /* Initialize ASIC function pointers and perform early init */
> > +     rc = device_early_init(hdev);
> > +     if (rc)
> > +             goto release_device;
> > +
> > +     dev_notice(hdev->dev,
> > +             "Successfully added device to habanalabs driver\n");
> > +
> > +     return 0;
> > +
> > +release_device:
> > +     device_destroy(hclass, hdev->dev->devt);
> > +     cdev_del(&hdev->cdev);
> > +out_disabled:
> > +     hdev->disabled = true;
> > +     if (hdev->pdev)
> > +             dev_err(&hdev->pdev->dev,
> > +                     "Failed to initialize hl%d. Device is NOT usable !!!\n",
> > +                     hdev->id);
> > +     else
> > +             pr_err("habanalabs: Failed to initialize hl%d. Device is NOT usable !!!\n",
> > +                     hdev->id);
>
> Maybe three exclamation marks would be too much?
>
> > +
> > +     return rc;
> > +}
> > +
> > +/**
> > + * hl_device_fini - main tear-down function for habanalabs device
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + *
> > + * Destroy the device, call ASIC fini functions and release the id
> > + */
> > +void hl_device_fini(struct hl_device *hdev)
> > +{
> > +     dev_info(hdev->dev, "Removing device\n");
> > +
> > +     /* Mark device as disabled */
> > +     hdev->disabled = true;
> > +
> > +     device_early_fini(hdev);
> > +
> > +     /* Hide device from user */
> > +     device_destroy(hdev->dev->class, hdev->dev->devt);
> > +     cdev_del(&hdev->cdev);
> > +
> > +     pr_info("habanalabs: removed device successfully\n");
> > +}
> > +
> > +/**
> > + * hl_poll_timeout_memory - Periodically poll a host memory address
> > + *                              until it is not zero or a timeout occurs
> > + * @hdev: pointer to habanalabs device structure
> > + * @addr: Address to poll
> > + * @timeout_us: timeout in us
> > + * @val: Variable to read the value into
> > + *
> > + * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
> > + * case, the last read value at @addr is stored in @val. Must not
> > + * be called from atomic context if sleep_us or timeout_us are used.
> > + *
> > + * The function sleeps for 100us with timeout value of
> > + * timeout_us
> > + */
> > +int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr,
> > +                             u32 timeout_us, u32 *val)
> > +{
> > +     /*
> > +      * pReturnVal is defined as volatile because it points to HOST memory,
> > +      * which is being written to by the device. Therefore, we can't use
> > +      * locks to synchronize it and it is not a memory-mapped register space
> > +      */
> > +     volatile u32 *pReturnVal = (volatile u32 *) addr;
> > +     ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
> > +
> > +     might_sleep();
> > +
> > +     for (;;) {
> > +             *val = *pReturnVal;
> > +             if (*val)
> > +                     break;
> > +             if (ktime_compare(ktime_get(), timeout) > 0) {
> > +                     *val = *pReturnVal;
> > +                     break;
> > +             }
> > +             usleep_range((100 >> 2) + 1, 100);
> > +     }
> > +
> > +     return (*val ? 0 : -ETIMEDOUT);
> > +}
> > +
> > +/**
> > + * hl_poll_timeout_devicememory - Periodically poll a device memory address
> > + *                                until it is not zero or a timeout occurs
> > + * @hdev: pointer to habanalabs device structure
> > + * @addr: Device address to poll
> > + * @timeout_us: timeout in us
> > + * @val: Variable to read the value into
> > + *
> > + * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
> > + * case, the last read value at @addr is stored in @val. Must not
> > + * be called from atomic context if sleep_us or timeout_us are used.
> > + *
> > + * The function sleeps for 100us with timeout value of
> > + * timeout_us
> > + */
> > +int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
> > +                             u32 timeout_us, u32 *val)
> > +{
> > +     ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
> > +
> > +     might_sleep();
> > +
> > +     for (;;) {
> > +             *val = readl(addr);
> > +             if (*val)
> > +                     break;
> > +             if (ktime_compare(ktime_get(), timeout) > 0) {
> > +                     *val = readl(addr);
> > +                     break;
> > +             }
> > +             usleep_range((100 >> 2) + 1, 100);
> > +     }
> > +
> > +     return (*val ? 0 : -ETIMEDOUT);
> > +}
> > diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> > new file mode 100644
> > index 000000000000..7e1b088b677c
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/habanalabs.h
> > @@ -0,0 +1,149 @@
> > +/* SPDX-License-Identifier: GPL-2.0
> > + *
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + */
> > +
> > +#ifndef HABANALABSP_H_
> > +#define HABANALABSP_H_
> > +
> > +#include "include/habanalabs_device_if.h"
> > +
> > +#include <linux/pci.h>
> > +#include <linux/types.h>
> > +#include <linux/cdev.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/iopoll.h>
> > +#include <linux/dma-fence.h>
> > +#include <linux/hashtable.h>
> > +#include <linux/hwmon.h>
> > +
> > +#define HL_NAME                              "habanalabs"
> > +
> > +struct hl_device;
> > +
> > +
> > +
> > +
> > +
> > +
>
> Too many blank lines, IMHO.
>
> > +/*
> > + * ASICs
> > + */
> > +
> > +/**
> > + * enum hl_asic_type - supported ASIC types.
> > + * @ASIC_AUTO_DETECT: ASIC type will be automatically set.
> > + * @ASIC_GOYA: Goya device.
> > + * @ASIC_LAST: last ASIC type.
> > + */
> > +enum hl_asic_type {
> > +     ASIC_AUTO_DETECT,
> > +     ASIC_GOYA,
> > +     ASIC_LAST
> > +};
> > +
> > +
> > +
> > +
> > +
> > +/*
> > + * FILE PRIVATE STRUCTURE
> > + */
> > +
> > +/**
> > + * struct hl_fpriv - process information stored in FD private data.
> > + * @hdev: habanalabs device structure.
> > + * @filp: pointer to the given file structure.
> > + * @taskpid: current process ID.
> > + * @refcount: number of related contexts.
> > + */
> > +struct hl_fpriv {
> > +     struct hl_device        *hdev;
> > +     struct file             *filp;
> > +     struct pid              *taskpid;
> > +     struct kref             refcount;
> > +};
> > +
> > +
> > +
> > +
> > +/*
> > + * DEVICES
> > + */
> > +
> > +/* Theoretical limit only. A single host can only contain up to 4 or 8 PCIe
> > + * x16 cards. In extereme cases, there are hosts that can accommodate 16 cards
> > + */
> > +#define HL_MAX_MINORS        256
> > +
> > +/**
> > + * struct hl_device - habanalabs device structure.
> > + * @pdev: pointer to PCI device, can be NULL in case of simulator device.
> > + * @cdev: related char device.
> > + * @dev: realted kernel basic device structure.
> > + * @asic_name: ASIC specific nmae.
> > + * @asic_type: ASIC specific type.
> > + * @major: habanalabs KMD major.
> > + * @id: device minor.
> > + * @disabled: is device disabled.
> > + */
> > +struct hl_device {
> > +     struct pci_dev                  *pdev;
> > +     struct cdev                     cdev;
> > +     struct device                   *dev;
> > +     char                            asic_name[16];
> > +     enum hl_asic_type               asic_type;
> > +     u32                             major;
> > +     u16                             id;
> > +     u8                              disabled;
> > +};
> > +
> > +/*
> > + * IOCTLs
> > + */
> > +
> > +/**
> > + * typedef hl_ioctl_t - typedef for ioctl function in the driver
> > + * @hpriv: pointer to the FD's private data, which contains state of
> > + *           user process
> > + * @data: pointer to the input/output arguments structure of the IOCTL
> > + *
> > + * Return: 0 for success, negative value for error
> > + */
> > +typedef int hl_ioctl_t(struct hl_fpriv *hpriv, void *data);
> > +
> > +/**
> > + * struct hl_ioctl_desc - describes an IOCTL entry of the driver.
> > + * @cmd: the IOCTL code as created by the kernel macros.
> > + * @func: pointer to the driver's function that should be called for this IOCTL.
> > + */
> > +struct hl_ioctl_desc {
> > +     unsigned int cmd;
> > +     hl_ioctl_t *func;
> > +};
> > +
> > +
> > +
> > +
> > +
> > +/*
> > + * Kernel module functions that can be accessed by entire module
> > + */
> > +
> > +int hl_device_open(struct inode *inode, struct file *filp);
> > +int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
> > +             enum hl_asic_type asic_type, int minor);
> > +void destroy_hdev(struct hl_device *hdev);
> > +int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
> > +                             u32 *val);
> > +int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
> > +                             u32 timeout_us, u32 *val);
> > +
> > +int hl_device_init(struct hl_device *hdev, struct class *hclass);
> > +void hl_device_fini(struct hl_device *hdev);
> > +int hl_device_suspend(struct hl_device *hdev);
> > +int hl_device_resume(struct hl_device *hdev);
> > +
> > +#endif /* HABANALABSP_H_ */
> > diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> > new file mode 100644
> > index 000000000000..15217975327b
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> > @@ -0,0 +1,366 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + * Author: Oded Gabbay <oded.gabbay@gmail.com>
> > + *
> > + */
> > +
> > +#include "habanalabs.h"
> > +
> > +#include <linux/device.h>
> > +#include <linux/module.h>
> > +#include <linux/init.h>
> > +#include <linux/kthread.h>
> > +
> > +#include <linux/fs.h>
> > +
> > +#define HL_DRIVER_AUTHOR     "HabanaLabs Kernel Driver Team"
> > +
> > +#define HL_DRIVER_DESC               "Driver for HabanaLabs's AI Accelerators"
> > +
> > +MODULE_AUTHOR(HL_DRIVER_AUTHOR);
> > +MODULE_DESCRIPTION(HL_DRIVER_DESC);
> > +MODULE_LICENSE("GPL v2");
> > +
> > +static int hl_major;
> > +static struct class *hl_class;
> > +DEFINE_IDR(hl_devs_idr);
> > +DEFINE_MUTEX(hl_devs_idr_lock);
> > +
> > +#define PCI_VENDOR_ID_HABANALABS     0x1da3
> > +
> > +#define PCI_IDS_GOYA                 0x0001
> > +
> > +static struct pci_device_id ids[] = {
> > +     { PCI_DEVICE(PCI_VENDOR_ID_HABANALABS, PCI_IDS_GOYA), },
> > +     { 0, }
> > +};
> > +MODULE_DEVICE_TABLE(pci, ids);
> > +
> > +/**
> > + * get_asic_type - translate device id to asic type
> > + *
> > + * @device: id of the PCI device
> > + * @asic_type: pointer that will be filled by the asic type
> > + *
> > + * Translate device id to asic type.
> > + * In case of unidentified device, return -1
> > + */
> > +static int get_asic_type(u16 device, enum hl_asic_type *asic_type)
>
> This can simply return the hl_asic_type, see also a comment in
> create_hdev(().
>
> > +{
> > +     int rc = 0;
> > +
> > +     switch (device) {
> > +     case PCI_IDS_GOYA:
> > +             *asic_type = ASIC_GOYA;
> > +             break;
> > +     default:
> > +             *asic_type = rc = -1;
> > +             break;
> > +     }
> > +
> > +     return rc;
> > +}
> > +
> > +/**
> > + * hl_device_open - open function for habanalabs device
> > + *
> > + * @inode: pointer to inode structure
> > + * @filp: pointer to file structure
> > + *
> > + * Called when process opens an habanalabs device.
> > + */
> > +int hl_device_open(struct inode *inode, struct file *filp)
> > +{
> > +     struct hl_device *hdev;
> > +     struct hl_fpriv *hpriv;
> > +
> > +     mutex_lock(&hl_devs_idr_lock);
> > +     hdev = idr_find(&hl_devs_idr, iminor(inode));
> > +     mutex_unlock(&hl_devs_idr_lock);
> > +
> > +     if (!hdev) {
> > +             pr_err("habanalabs: Couldn't find device %d:%d\n",
> > +                     imajor(inode), iminor(inode));
> > +             return -ENXIO;
> > +     }
> > +
> > +     hpriv = kzalloc(sizeof(*hpriv), GFP_KERNEL);
> > +     if (!hpriv)
> > +             return -ENOMEM;
> > +
> > +     hpriv->hdev = hdev;
> > +     filp->private_data = hpriv;
> > +     hpriv->filp = filp;
> > +     kref_init(&hpriv->refcount);
> > +     nonseekable_open(inode, filp);
> > +
> > +     hpriv->taskpid = find_get_pid(current->pid);
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * create_hdev - create habanalabs device instance
> > + *
> > + * @dev: will hold the pointer to the new habanalabs device structure
> > + * @pdev: pointer to the pci device
> > + * @asic_type: in case of simulator device, which device is it
> > + * @minor: in case of simulator device, the minor of the device
> > + *
> > + * Allocate memory for habanalabs device and initialize basic fields
> > + * Identify the ASIC type
> > + * Allocate ID (minor) for the device (only for real devices)
> > + */
> > +int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
> > +             enum hl_asic_type asic_type, int minor)
> > +{
> > +     struct hl_device *hdev;
> > +     int rc;
> > +
> > +     *dev = NULL;
> > +
> > +     hdev = kzalloc(sizeof(*hdev), GFP_KERNEL);
> > +     if (!hdev) {
> > +             if (pdev)
> > +                     dev_err(&pdev->dev,
> > +                             "Not enough memory for habanalabs device\n");
> > +             else
> > +                     pr_err("habanalabs: Not enough memory for  device\n");
> > +
> > +             return -ENOMEM;
> > +     }
> > +
> > +     hdev->major = hl_major;
> > +
> > +     hdev->disabled = true;
> > +     hdev->pdev = pdev; /* can be NULL in case of simulator device */
> > +
> > +     if (asic_type == ASIC_AUTO_DETECT) {
> > +             rc = get_asic_type(pdev->device, &hdev->asic_type);
>
> You can just make it
>
>                 &hdev->asic_type = get_asic_type(pdev->device);
>
> > +             if (rc) {
> > +                     dev_err(&pdev->dev, "Unsupported ASIC\n");
> > +                     rc = -ENODEV;
> > +                     goto free_hdev;
> > +             }
> > +     } else {
> > +             hdev->asic_type = asic_type;
> > +     }
>
> In the current version create_hdev() is always called with
> ASIC_AUTO_DETECT, what are the usecases for other types?
>
So I don't think I mentioned this, but we have a software simulator
that we wrote for our ASICs.
To support that, you can load the driver in simulation mode.
Most of the simulation code is in an asic-specific file
(goya_simulator.c) which I don't intend to upstream because:
1. It does really nasty things to make the simulator work and those
nasty things are totally un-upstreamable :)
2. We don't intend to open source the simulator, nor give it to
customers, so there is no need to upstream that code.

Having said that, there are very few places in the common code, which
I think all of them are in habanalabs_drv.c, that contain code which
is for simulation mode.
One of those places is:
hdev->asic_type = asic_type;

Another place is the idr_replace below.

I hope that because we are talking about a couple of lines in the
entire driver, and because by themselves they are totally valid, I
could upstream them even if that path will never be taken.
If even those few lines are problematic, I will remove them, but it
will just make my life a bit harder.

Oded

> > +
> > +     mutex_lock(&hl_devs_idr_lock);
> > +
> > +     if (minor == -1) {
> > +             rc = idr_alloc(&hl_devs_idr, hdev, 0, HL_MAX_MINORS,
> > +                             GFP_KERNEL);
> > +     } else {
> > +             idr_replace(&hl_devs_idr, hdev, minor);
>
> idr_replace can fail, can't it?
>
> > +             rc = minor;
> > +     }
> > +
> > +     mutex_unlock(&hl_devs_idr_lock);
> > +
> > +     if (rc < 0) {
> > +             if (rc == -ENOSPC) {
> > +                     pr_err("habanalabs: too many devices in the system\n");
> > +                     rc = -EBUSY;
> > +             }
> > +             goto free_hdev;
> > +     }
> > +
> > +     hdev->id = rc;
> > +
> > +     *dev = hdev;
> > +
> > +     return 0;
> > +
> > +free_hdev:
> > +     kfree(hdev);
> > +     return rc;
> > +}
> > +
> > +/**
> > + * destroy_hdev - destroy habanalabs device instance
> > + *
> > + * @dev: pointer to the habanalabs device structure
> > + *
> > + */
> > +void destroy_hdev(struct hl_device *hdev)
> > +{
> > +     /* Remove device from the device list */
> > +     mutex_lock(&hl_devs_idr_lock);
> > +     idr_remove(&hl_devs_idr, hdev->id);
> > +     mutex_unlock(&hl_devs_idr_lock);
> > +
> > +     kfree(hdev);
> > +}
> > +
> > +static int hl_pmops_suspend(struct device *dev)
> > +{
> > +     struct pci_dev *pdev = to_pci_dev(dev);
> > +     struct hl_device *hdev = pci_get_drvdata(pdev);
> > +
> > +     pr_debug("habanalabs: Going to suspend PCI device\n");
> > +
> > +     if (!hdev) {
> > +             pr_err("habanalabs: device pointer is NULL in suspend\n");
> > +             return 0;
> > +     }
> > +
> > +     return hl_device_suspend(hdev);
> > +}
> > +
> > +static int hl_pmops_resume(struct device *dev)
> > +{
> > +     struct pci_dev *pdev = to_pci_dev(dev);
> > +     struct hl_device *hdev = pci_get_drvdata(pdev);
> > +
> > +     pr_debug("habanalabs: Going to resume PCI device\n");
> > +
> > +     if (!hdev) {
> > +             pr_err("habanalabs: device pointer is NULL in resume\n");
> > +             return 0;
> > +     }
> > +
> > +     return hl_device_resume(hdev);
> > +}
> > +
> > +/**
> > + * hl_pci_probe - probe PCI habanalabs devices
> > + *
> > + * @pdev: pointer to pci device
> > + * @id: pointer to pci device id structure
> > + *
> > + * Standard PCI probe function for habanalabs device.
> > + * Create a new habanalabs device and initialize it according to the
> > + * device's type
> > + */
> > +static int hl_pci_probe(struct pci_dev *pdev,
> > +                             const struct pci_device_id *id)
> > +{
> > +     struct hl_device *hdev;
> > +     int rc;
> > +
> > +     dev_info(&pdev->dev, HL_NAME
> > +              " device found [%04x:%04x] (rev %x)\n",
> > +              (int)pdev->vendor, (int)pdev->device, (int)pdev->revision);
> > +
> > +     rc = create_hdev(&hdev, pdev, ASIC_AUTO_DETECT, -1);
> > +     if (rc)
> > +             return rc;
> > +
> > +     pci_set_drvdata(pdev, hdev);
> > +
> > +     rc = hl_device_init(hdev, hl_class);
> > +     if (rc) {
> > +             dev_err(&pdev->dev, "Fatal error during habanalabs device init\n");
> > +             rc = -ENODEV;
> > +             goto disable_device;
> > +     }
> > +
> > +     return 0;
> > +
> > +disable_device:
> > +     pci_set_drvdata(pdev, NULL);
> > +     destroy_hdev(hdev);
> > +
> > +     return rc;
> > +}
> > +
> > +/**
> > + * hl_pci_remove - remove PCI habanalabs devices
> > + *
> > + * @pdev: pointer to pci device
> > + *
> > + * Standard PCI remove function for habanalabs device
> > + */
> > +static void hl_pci_remove(struct pci_dev *pdev)
> > +{
> > +     struct hl_device *hdev;
> > +
> > +     hdev = pci_get_drvdata(pdev);
> > +     if (!hdev)
> > +             return;
> > +
> > +     hl_device_fini(hdev);
> > +     pci_set_drvdata(pdev, NULL);
> > +
> > +     destroy_hdev(hdev);
> > +}
> > +
> > +static const struct dev_pm_ops hl_pm_ops = {
> > +     .suspend = hl_pmops_suspend,
> > +     .resume = hl_pmops_resume,
> > +};
> > +
> > +static struct pci_driver hl_pci_driver = {
> > +     .name = HL_NAME,
> > +     .id_table = ids,
> > +     .probe = hl_pci_probe,
> > +     .remove = hl_pci_remove,
> > +     .driver.pm = &hl_pm_ops,
> > +};
> > +
> > +/**
> > + * hl_init - Initialize the habanalabs kernel driver
> > + *
> > + */
> > +static int __init hl_init(void)
> > +{
> > +     int rc;
> > +     dev_t dev;
> > +
> > +     pr_info("habanalabs: loading driver\n");
> > +
> > +     rc = alloc_chrdev_region(&dev, 0, HL_MAX_MINORS, HL_NAME);
> > +     if (rc < 0) {
> > +             pr_err("habanalabs: unable to get major\n");
> > +             return rc;
> > +     }
> > +
> > +     hl_major = MAJOR(dev);
> > +
> > +     hl_class = class_create(THIS_MODULE, HL_NAME);
> > +     if (IS_ERR(hl_class)) {
> > +             pr_err("habanalabs: failed to allocate class\n");
> > +             rc = PTR_ERR(hl_class);
> > +             goto remove_major;
> > +     }
> > +
> > +     rc = pci_register_driver(&hl_pci_driver);
> > +     if (rc) {
> > +             pr_err("habanalabs: failed to register pci device\n");
> > +             goto remove_class;
> > +     }
> > +
> > +     pr_debug("habanalabs: driver loaded\n");
> > +
> > +     return 0;
> > +
> > +remove_class:
> > +     class_destroy(hl_class);
> > +remove_major:
> > +     unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
> > +     return rc;
> > +}
> > +
> > +/**
> > + * hl_exit - Release all resources of the habanalabs kernel driver
> > + *
> > + */
> > +static void __exit hl_exit(void)
> > +{
> > +     pci_unregister_driver(&hl_pci_driver);
> > +
> > +     class_destroy(hl_class);
> > +     unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
> > +
> > +     idr_destroy(&hl_devs_idr);
> > +
> > +     pr_debug("habanalabs: driver removed\n");
> > +}
> > +
> > +module_init(hl_init);
> > +module_exit(hl_exit);
> > diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > new file mode 100644
> > index 000000000000..9dbb7077eabd
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > @@ -0,0 +1,125 @@
> > +/* SPDX-License-Identifier: GPL-2.0
> > + *
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + */
> > +
> > +#ifndef HABANALABS_DEVICE_IF_H
> > +#define HABANALABS_DEVICE_IF_H
> > +
> > +#include <linux/types.h>
> > +
> > +/*
> > + * PRIMARY QUEUE
> > + */
> > +
> > +struct hl_bd {
> > +     __u64   ptr;
> > +     __u32   len;
> > +     union {
> > +             struct {
> > +                     __u32   repeat:16;
> > +                     __u32   res1:8;
> > +                     __u32   repeat_valid:1;
> > +                     __u32   res2:7;
> > +             };
> > +             __u32   ctl;
> > +     };
> > +};
> > +
> > +#define HL_BD_SIZE                   sizeof(struct hl_bd)
> > +
> > +/*
> > + * BD_CTL_REPEAT_VALID tells the CP whether the repeat field in the BD CTL is
> > + * valid. 1 means the repeat field is valid, 0 means not-valid,
> > + * i.e. repeat == 1
> > + */
> > +#define BD_CTL_REPEAT_VALID_SHIFT    24
> > +#define BD_CTL_REPEAT_VALID_MASK     0x01000000
> > +
> > +#define BD_CTL_SHADOW_INDEX_SHIFT    0
> > +#define BD_CTL_SHADOW_INDEX_MASK     0x00000FFF
> > +
> > +/*
> > + * COMPLETION QUEUE
> > + */
> > +
> > +struct hl_cq_entry {
> > +     __u32   data;
> > +};
> > +
> > +#define HL_CQ_ENTRY_SIZE             sizeof(struct hl_cq_entry)
> > +
> > +#define CQ_ENTRY_READY_SHIFT                 31
> > +#define CQ_ENTRY_READY_MASK                  0x80000000
> > +
> > +#define CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT    30
> > +#define CQ_ENTRY_SHADOW_INDEX_VALID_MASK     0x40000000
> > +
> > +#define CQ_ENTRY_SHADOW_INDEX_SHIFT          BD_CTL_SHADOW_INDEX_SHIFT
> > +#define CQ_ENTRY_SHADOW_INDEX_MASK           BD_CTL_SHADOW_INDEX_MASK
> > +
> > +/*
> > + * EVENT QUEUE
> > + */
> > +
> > +struct hl_eq_header {
> > +     __u32 reserved;
> > +     union {
> > +             struct {
> > +                     __u32 ctx_id :10;
> > +                     __u32:6;
> > +                     __u32 opcode :10;
> > +                     __u32:5;
> > +                     __u32 ready :1;
> > +             };
> > +             __u32 ctl;
> > +     };
> > +};
> > +
> > +struct hl_eq_entry {
> > +     struct hl_eq_header hdr;
> > +     __u64 data[7];
> > +};
> > +
> > +#define HL_EQ_ENTRY_SIZE             sizeof(struct hl_eq_entry)
> > +
> > +#define EQ_CTL_READY_SHIFT           31
> > +#define EQ_CTL_READY_MASK            0x80000000
> > +
> > +#define EQ_CTL_EVENT_TYPE_SHIFT              16
> > +#define EQ_CTL_EVENT_TYPE_MASK               0x03FF0000
> > +
> > +enum pq_init_status {
> > +     PQ_INIT_STATUS_NA = 0,
> > +     PQ_INIT_STATUS_READY_FOR_CP,
> > +     PQ_INIT_STATUS_READY_FOR_HOST
> > +};
> > +
> > +/*
> > + * ArmCP info
> > + */
> > +
> > +#define VERSION_MAX_LEN                      128
> > +#define ARMCP_MAX_SENSORS            128
> > +
> > +struct armcp_sensor {
> > +     __u32 type;
> > +     __u32 flags;
> > +};
> > +
> > +/* must be aligned to 4 bytes */
> > +struct armcp_info {
> > +     struct armcp_sensor sensors[ARMCP_MAX_SENSORS];
> > +     __u8 kernel_version[VERSION_MAX_LEN];
> > +     __u32 reserved[3];
> > +     __u32 cpld_version;
> > +     __u32 infineon_version;
> > +     __u8 fuse_version[VERSION_MAX_LEN];
> > +     __u8 thermal_version[VERSION_MAX_LEN];
> > +     __u8 armcp_version[VERSION_MAX_LEN];
> > +     __u64 dram_size;
> > +};
> > +
> > +#endif /* HABANALABS_DEVICE_IF_H */
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-23 12:55       ` Mike Rapoport
@ 2019-01-25 20:09         ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-25 20:09 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg KH, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Wed, Jan 23, 2019 at 2:55 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 01:40:04PM +0100, Greg KH wrote:
> > On Wed, Jan 23, 2019 at 02:28:05PM +0200, Mike Rapoport wrote:
> > > On Wed, Jan 23, 2019 at 02:00:43AM +0200, Oded Gabbay wrote:
> > > > +/**
> > > > + * hl_device_release - release function for habanalabs device
> > > > + *
> > > > + * @inode: pointer to inode structure
> > > > + * @filp: pointer to file structure
> > > > + *
> > > > + * Called when process closes an habanalabs device
> > > > + */
> > >
> > > It's nice to see docs coming along with the codei
> > > I have some comments for the formatting.
> > >
> > > kernel-doc won't be happy about missing return value descriptions, and
> > > although they are sometimes redundant or too obvious their absence makes
> > > 'make V=1 htmldocs' really noisy.
> > >
> > > In general, it would be nice if you could link hanabnalabs driver
> > > kernel-doc somewhere in Documentation/ run 'make V=1 htmldocs'.
> > >
> > > > +static int hl_device_release(struct inode *inode, struct file *filp)
> >
> > There's no need for kerneldoc comments for static functions, as no one
> > can call them and they are not part of any api.
> >
> > So what would be better here is to just drop the /** line and use /*
>
> Maybe it'd make sense to use /* for most of the comments in this driver as
> there are kernel-doc formatting issues in non-static functions as well, I
> was just too lazy to go over all of them.
>
Hi Mike,
That's what I'm going to do for v2, except for the comments in
habanalabs.h, which we made sure are written according to kernel-doc.
I promise that we will go over the entire code and make everything
kernel-doc compatible (I'm opening a jira ticket now :) ) but it may
take a couple of weeks or more.

In any case, I fixed all of your other comments for this patch.
Thanks,
Oded

> > thanks,
> >
> > greg k-h
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 03/15] habanalabs: add basic Goya support
  2019-01-23 12:28   ` Mike Rapoport
@ 2019-01-25 20:32     ` Oded Gabbay
  2019-01-27  6:39       ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-25 20:32 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org

On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:45AM +0200, Oded Gabbay wrote:
> > This patch adds a basic support for the Goya device. The code initializes
> > the device's PCI controller and PCI bars. It also initializes various S/W
> > structures and adds some basic helper functions.
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/habanalabs/Makefile            |   5 +-
> >  drivers/misc/habanalabs/device.c            |  71 +++
> >  drivers/misc/habanalabs/goya/Makefile       |   3 +
> >  drivers/misc/habanalabs/goya/goya.c         | 633 ++++++++++++++++++++
> >  drivers/misc/habanalabs/goya/goyaP.h        | 125 ++++
> >  drivers/misc/habanalabs/habanalabs.h        | 131 ++++
> >  drivers/misc/habanalabs/habanalabs_drv.c    |   3 +
> >  drivers/misc/habanalabs/include/goya/goya.h | 115 ++++
> >  8 files changed, 1085 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/misc/habanalabs/goya/Makefile
> >  create mode 100644 drivers/misc/habanalabs/goya/goya.c
> >  create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/goya.h
> >
> > diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> > index b41433a09e02..6f1ead69bd77 100644
> > --- a/drivers/misc/habanalabs/Makefile
> > +++ b/drivers/misc/habanalabs/Makefile
> > @@ -4,4 +4,7 @@
> >
> >  obj-m        := habanalabs.o
> >
> > -habanalabs-y := habanalabs_drv.o device.o
> > \ No newline at end of file
> > +habanalabs-y := habanalabs_drv.o device.o
> > +
> > +include $(src)/goya/Makefile
> > +habanalabs-y += $(HL_GOYA_FILES)
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> > index 376b55eb73d4..a4276ef559b3 100644
> > --- a/drivers/misc/habanalabs/device.c
> > +++ b/drivers/misc/habanalabs/device.c
> > @@ -116,8 +116,11 @@ static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
> >   */
> >  static int device_early_init(struct hl_device *hdev)
> >  {
> > +     int rc;
> > +
> >       switch (hdev->asic_type) {
> >       case ASIC_GOYA:
> > +             goya_set_asic_funcs(hdev);
> >               sprintf(hdev->asic_name, "GOYA");
> >               break;
> >       default:
> > @@ -126,6 +129,10 @@ static int device_early_init(struct hl_device *hdev)
> >               return -EINVAL;
> >       }
> >
> > +     rc = hdev->asic_funcs->early_init(hdev);
> > +     if (rc)
> > +             return rc;
> > +
> >       return 0;
> >  }
> >
> > @@ -137,6 +144,10 @@ static int device_early_init(struct hl_device *hdev)
> >   */
> >  static void device_early_fini(struct hl_device *hdev)
> >  {
> > +
> > +     if (hdev->asic_funcs->early_fini)
> > +             hdev->asic_funcs->early_fini(hdev);
> > +
> >  }
> >
> >  /**
> > @@ -150,8 +161,15 @@ static void device_early_fini(struct hl_device *hdev)
> >   */
> >  int hl_device_suspend(struct hl_device *hdev)
> >  {
> > +     int rc;
> > +
> >       pci_save_state(hdev->pdev);
> >
> > +     rc = hdev->asic_funcs->suspend(hdev);
> > +     if (rc)
> > +             dev_err(hdev->dev,
> > +                     "Failed to disable PCI access of device CPU\n");
> > +
> >       /* Shut down the device */
> >       pci_disable_device(hdev->pdev);
> >       pci_set_power_state(hdev->pdev, PCI_D3hot);
> > @@ -181,6 +199,13 @@ int hl_device_resume(struct hl_device *hdev)
> >               return rc;
> >       }
> >
> > +     rc = hdev->asic_funcs->resume(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to enable PCI access from device CPU\n");
> > +             return rc;
> > +     }
> > +
> >       return 0;
> >  }
> >
> > @@ -208,11 +233,21 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >       if (rc)
> >               goto release_device;
> >
> > +     /*
> > +      * Start calling ASIC initialization. First S/W then H/W and finally
> > +      * late init
> > +      */
> > +     rc = hdev->asic_funcs->sw_init(hdev);
> > +     if (rc)
> > +             goto early_fini;
> > +
> >       dev_notice(hdev->dev,
> >               "Successfully added device to habanalabs driver\n");
> >
> >       return 0;
> >
> > +early_fini:
> > +     device_early_fini(hdev);
> >  release_device:
> >       device_destroy(hclass, hdev->dev->devt);
> >       cdev_del(&hdev->cdev);
> > @@ -243,6 +278,9 @@ void hl_device_fini(struct hl_device *hdev)
> >       /* Mark device as disabled */
> >       hdev->disabled = true;
> >
> > +     /* Call ASIC S/W finalize function */
> > +     hdev->asic_funcs->sw_fini(hdev);
> > +
> >       device_early_fini(hdev);
> >
> >       /* Hide device from user */
> > @@ -329,3 +367,36 @@ int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
> >
> >       return (*val ? 0 : -ETIMEDOUT);
> >  }
> > +
> > +/*
> > + * MMIO register access helper functions.
> > + */
> > +
> > +/**
> > + * hl_rreg - Read an MMIO register
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + * @reg: MMIO register offset (in bytes)
> > + *
> > + * Returns the value of the MMIO register we are asked to read
> > + *
> > + */
> > +inline u32 hl_rreg(struct hl_device *hdev, u32 reg)
> > +{
> > +     return readl(hdev->rmmio + reg);
> > +}
> > +
> > +/**
> > + * hl_wreg - Write to an MMIO register
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + * @reg: MMIO register offset (in bytes)
> > + * @val: 32-bit value
> > + *
> > + * Writes the 32-bit value into the MMIO register
> > + *
> > + */
> > +inline void hl_wreg(struct hl_device *hdev, u32 reg, u32 val)
> > +{
> > +     writel(val, hdev->rmmio + reg);
> > +}
> > diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
> > new file mode 100644
> > index 000000000000..5ebf3d0d5794
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/goya/Makefile
> > @@ -0,0 +1,3 @@
> > +subdir-ccflags-y += -I$(src)
> > +
> > +HL_GOYA_FILES :=  goya/goya.o
> > \ No newline at end of file
> > diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> > new file mode 100644
> > index 000000000000..b2952296b890
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/goya/goya.c
> > @@ -0,0 +1,633 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "goyaP.h"
> > +#include "include/goya/asic_reg/goya_masks.h"
> > +
> > +#include <linux/fs.h>
> > +#include <linux/delay.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/sched.h>
> > +#include <linux/genalloc.h>
> > +#include <linux/sysfs.h>
> > +#include <linux/kfifo.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/firmware.h>
> > +#include <linux/log2.h>
> > +#include <linux/hwmon.h>
> > +#include <linux/string.h>
> > +#include <linux/io.h>
> > +
> > +/*
> > + * GOYA security scheme:
> > + *
> > + * 1. Host is protected by:
> > + *        - Range registers (When MMU is enabled, DMA RR does NOT protect host)
> > + *        - MMU
> > + *
> > + * 2. DRAM is protected by:
> > + *        - Range registers (protect the first 512MB)
> > + *        - MMU (isolation between users)
> > + *
> > + * 3. Configuration is protected by:
> > + *        - Range registers
> > + *        - Protection bits
> > + *
> > + * When MMU is disabled:
> > + *
> > + * QMAN DMA: PQ, CQ, CP, DMA are secured.
> > + * PQ, CB and the data are on the host.
> > + *
> > + * QMAN TPC/MME:
> > + * PQ, CQ and CP are not secured.
> > + * PQ, CB and the data are on the SRAM/DRAM.
> > + *
> > + * Since QMAN DMA is secured, KMD is parsing the DMA CB:
> > + *     - KMD checks DMA pointer
> > + *     - WREG, MSG_PROT are not allowed.
> > + *     - MSG_LONG/SHORT are allowed.
> > + *
> > + * A read/write transaction by the QMAN to a protected area will succeed if
> > + * and only if the QMAN's CP is secured and MSG_PROT is used
> > + *
> > + *
> > + * When MMU is enabled:
> > + *
> > + * QMAN DMA: PQ, CQ and CP are secured.
> > + * MMU is set to bypass on the Secure props register of the QMAN.
> > + * The reasons we don't enable MMU for PQ, CQ and CP are:
> > + *     - PQ entry is in kernel address space and KMD doesn't map it.
> > + *     - CP writes to MSIX register and to kernel address space (completion
> > + *       queue).
> > + *
> > + * DMA is not secured but because CP is secured, KMD still needs to parse the
> > + * CB, but doesn't need to check the DMA addresses.
> > + *
> > + * For QMAN DMA 0, DMA is also secured because only KMD uses this DMA and KMD
> > + * doesn't map memory in MMU.
> > + *
> > + * QMAN TPC/MME: PQ, CQ and CP aren't secured (no change from MMU disabled mode)
> > + *
> > + * DMA RR does NOT protect host because DMA is not secured
> > + *
> > + */
> > +
> > +#define GOYA_MMU_REGS_NUM            61
> > +
> > +#define GOYA_DMA_POOL_BLK_SIZE               0x100           /* 256 bytes */
> > +
> > +#define GOYA_RESET_TIMEOUT_MSEC              500             /* 500ms */
> > +#define GOYA_PLDM_RESET_TIMEOUT_MSEC 20000           /* 20s */
> > +#define GOYA_RESET_WAIT_MSEC         1               /* 1ms */
> > +#define GOYA_CPU_RESET_WAIT_MSEC     100             /* 100ms */
> > +#define GOYA_PLDM_RESET_WAIT_MSEC    1000            /* 1s */
> > +#define GOYA_CPU_TIMEOUT_USEC                10000000        /* 10s */
> > +#define GOYA_TEST_QUEUE_WAIT_USEC    100000          /* 100ms */
> > +
> > +#define GOYA_QMAN0_FENCE_VAL         0xD169B243
> > +
> > +#define GOYA_MAX_INITIATORS          20
> > +
> > +static void goya_get_fixed_properties(struct hl_device *hdev)
> > +{
> > +     struct asic_fixed_properties *prop = &hdev->asic_prop;
> > +
> > +     prop->completion_queues_count = NUMBER_OF_CMPLT_QUEUES;
> > +
> > +     prop->dram_base_address = DRAM_PHYS_BASE;
> > +     prop->dram_size = DRAM_PHYS_DEFAULT_SIZE;
> > +     prop->dram_end_address = prop->dram_base_address + prop->dram_size;
> > +     prop->dram_user_base_address = DRAM_BASE_ADDR_USER;
> > +
> > +     prop->sram_base_address = SRAM_BASE_ADDR;
> > +     prop->sram_size = SRAM_SIZE;
> > +     prop->sram_end_address = prop->sram_base_address + prop->sram_size;
> > +     prop->sram_user_base_address = prop->sram_base_address +
> > +                                             SRAM_USER_BASE_OFFSET;
> > +
> > +     prop->host_phys_base_address = HOST_PHYS_BASE;
> > +     prop->va_space_host_start_address = VA_HOST_SPACE_START;
> > +     prop->va_space_host_end_address = VA_HOST_SPACE_END;
> > +     prop->va_space_dram_start_address = VA_DDR_SPACE_START;
> > +     prop->va_space_dram_end_address = VA_DDR_SPACE_END;
> > +     prop->cfg_size = CFG_SIZE;
> > +     prop->max_asid = MAX_ASID;
> > +     prop->tpc_enabled_mask = TPC_ENABLED_MASK;
> > +
> > +     prop->high_pll = PLL_HIGH_DEFAULT;
> > +}
> > +
> > +/**
> > + * goya_pci_bars_map - Map PCI BARS of Goya device
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Request PCI regions and map them to kernel virtual addresses.
> > + * Returns 0 on success
> > + *
> > + */
> > +int goya_pci_bars_map(struct hl_device *hdev)
> > +{
> > +     struct pci_dev *pdev = hdev->pdev;
> > +     int rc;
>
> You could just init rc= -ENODEV here and avoid the hassle below.

But the next line assigns rc the return value of pci_request_regions...
I could do rc= -ENODEV before the calls to pci_ioremapbar but then if
this function will change in the future and I will have another
possibility of a different error, it will seem strange.
I honestly prefer to write code in drivers as explicitly as possible,
even if that means a bit more code.

> > +
> > +     rc = pci_request_regions(pdev, HL_NAME);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Cannot obtain PCI resources\n");
> > +             return rc;
> > +     }
> > +
> > +     hdev->pcie_bar[SRAM_CFG_BAR_ID] =
> > +                     pci_ioremap_bar(pdev, SRAM_CFG_BAR_ID);
> > +     if (!hdev->pcie_bar[SRAM_CFG_BAR_ID]) {
> > +             dev_err(hdev->dev, "pci_ioremap_bar failed for CFG\n");
> > +             rc = -ENODEV;
> > +             goto err_release_regions;
> > +     }
> > +
> > +     hdev->pcie_bar[MSIX_BAR_ID] = pci_ioremap_bar(pdev, MSIX_BAR_ID);
> > +     if (!hdev->pcie_bar[MSIX_BAR_ID]) {
> > +             dev_err(hdev->dev, "pci_ioremap_bar failed for MSIX\n");
> > +             rc = -ENODEV;
> > +             goto err_unmap_sram_cfg;
> > +     }
> > +
> > +     hdev->pcie_bar[DDR_BAR_ID] = pci_ioremap_wc_bar(pdev, DDR_BAR_ID);
> > +     if (!hdev->pcie_bar[DDR_BAR_ID]) {
> > +             dev_err(hdev->dev, "pci_ioremap_bar failed for DDR\n");
> > +             rc = -ENODEV;
> > +             goto err_unmap_msix;
> > +     }
> > +
> > +     hdev->rmmio = hdev->pcie_bar[SRAM_CFG_BAR_ID] +
> > +                             (CFG_BASE - SRAM_BASE_ADDR);
> > +
> > +     return 0;
> > +
> > +err_unmap_msix:
> > +     iounmap(hdev->pcie_bar[MSIX_BAR_ID]);
> > +err_unmap_sram_cfg:
> > +     iounmap(hdev->pcie_bar[SRAM_CFG_BAR_ID]);
> > +err_release_regions:
> > +     pci_release_regions(pdev);
> > +
> > +     return rc;
> > +}
> > +
> > +/**
> > + * goya_pci_bars_unmap - Unmap PCI BARS of Goya device
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Release all PCI BARS and unmap their virtual addresses
> > + *
> > + */
> > +static void goya_pci_bars_unmap(struct hl_device *hdev)
> > +{
> > +     struct pci_dev *pdev = hdev->pdev;
> > +
> > +     iounmap(hdev->pcie_bar[DDR_BAR_ID]);
> > +     iounmap(hdev->pcie_bar[MSIX_BAR_ID]);
> > +     iounmap(hdev->pcie_bar[SRAM_CFG_BAR_ID]);
> > +     pci_release_regions(pdev);
> > +}
> > +
> > +/**
> > + * goya_elbi_write - Write through the ELBI interface
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * return 0 on success, -1 on failure
> > + *
> > + */
> > +static int goya_elbi_write(struct hl_device *hdev, u64 addr, u32 data)
> > +{
> > +     struct pci_dev *pdev = hdev->pdev;
> > +     ktime_t timeout;
> > +     u32 val;
> > +
> > +     /* Clear previous status */
> > +     pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_STS, 0);
> > +
> > +     pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_ADDR, (u32) addr);
> > +     pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_DATA, data);
> > +     pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_CTRL,
> > +                             PCI_CONFIG_ELBI_CTRL_WRITE);
> > +
> > +     timeout = ktime_add_ms(ktime_get(), 10);
> > +     for (;;) {
> > +             pci_read_config_dword(pdev, mmPCI_CONFIG_ELBI_STS, &val);
> > +             if (val & PCI_CONFIG_ELBI_STS_MASK)
> > +                     break;
> > +             if (ktime_compare(ktime_get(), timeout) > 0) {
> > +                     pci_read_config_dword(pdev, mmPCI_CONFIG_ELBI_STS,
> > +                                             &val);
> > +                     break;
> > +             }
> > +             usleep_range(300, 500);
> > +     }
> > +
> > +     if ((val & PCI_CONFIG_ELBI_STS_MASK) == PCI_CONFIG_ELBI_STS_DONE)
> > +             return 0;
> > +
> > +     if (val & PCI_CONFIG_ELBI_STS_ERR) {
> > +             dev_err(hdev->dev, "Error writing to ELBI\n");o
> > +             return -1;
>
> Please change -1 to an error code, say -EIO...
Of course, done.

>
> > +     }
> > +
> > +     if (!(val & PCI_CONFIG_ELBI_STS_MASK)) {
> > +             dev_err(hdev->dev, "ELBI write didn't finish in time\n");
> > +             return -1;
> > +     }
> > +
> > +     dev_err(hdev->dev, "ELBI write has undefined bits in status\n");
> > +     return -1;
> > +}
> > +
> > +/**
> > + * goya_iatu_write - iatu write routine
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + */
> > +static int goya_iatu_write(struct hl_device *hdev, u32 addr, u32 data)
> > +{
> > +     u32 dbi_offset;
> > +     int rc;
> > +
> > +     dbi_offset = addr & 0xFFF;
> > +
> > +     rc = goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0x00300000);
> > +     rc |= goya_elbi_write(hdev, mmPCIE_DBI_BASE + dbi_offset, data);
>
> hmm, error code in goya_elbi_write probably won't work...
> Any reason to try the second write if the first failed?
>
You are correct it definitely won't work. But I didn't want to put an
if() after each call to that function - it happens a few more times in
the code.
And because the second write won't do any harm either, I thought this
is a more elegant solution to make the code more readable.

> > +
> > +     return rc;
> > +}
> > +
> > +void goya_reset_link_through_bridge(struct hl_device *hdev)
> > +{
> > +     struct pci_dev *pdev = hdev->pdev;
> > +     struct pci_dev *parent_port;
> > +     u16 val;
> > +
> > +     parent_port = pdev->bus->self;
> > +     pci_read_config_word(parent_port, PCI_BRIDGE_CONTROL, &val);
> > +     val |= PCI_BRIDGE_CTL_BUS_RESET;
> > +     pci_write_config_word(parent_port, PCI_BRIDGE_CONTROL, val);
> > +     ssleep(1);
> > +
> > +     val &= ~(PCI_BRIDGE_CTL_BUS_RESET);
> > +     pci_write_config_word(parent_port, PCI_BRIDGE_CONTROL, val);
> > +     ssleep(3);
> > +}
> > +
> > +/**
> > + * goya_set_ddr_bar_base - set DDR bar to map specific device address
> > + *
> > + * @hdev: pointer to hl_device structure
> > + * @addr: address in DDR. Must be aligned to DDR bar size
> > + *
> > + * This function configures the iATU so that the DDR bar will start at the
> > + * specified addr.
> > + *
> > + */
> > +static int goya_set_ddr_bar_base(struct hl_device *hdev, u64 addr)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int rc;
> > +
> > +     if ((goya) && (goya->ddr_bar_cur_addr == addr))
> > +             return 0;
> > +
> > +     /* Inbound Region 1 - Bar 4 - Point to DDR */
> > +     rc = goya_iatu_write(hdev, 0x314, lower_32_bits(addr));
> > +     rc |= goya_iatu_write(hdev, 0x318, upper_32_bits(addr));
> > +     rc |= goya_iatu_write(hdev, 0x300, 0);
> > +     /* Enable + Bar match + match enable + Bar 4 */
> > +     rc |= goya_iatu_write(hdev, 0x304, 0xC0080400);
> > +
> > +     /* Return the DBI window to the default location */
> > +     rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0);
> > +     rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI_32, 0);
>
> And here as well.
Same remark as the previous one

> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to map DDR bar to 0x%08llx\n", addr);
> > +             return rc;
> > +     }
>
> I believe that at least here you'd want to return an error code.
Fixed
>
> > +
> > +     if (goya)
> > +             goya->ddr_bar_cur_addr = addr;
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * goya_init_iatu - Initialize the iATU unit inside the PCI controller
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * This is needed in case the firmware doesn't initialize the iATU
> > + *
> > + */
> > +static int goya_init_iatu(struct hl_device *hdev)
> > +{
> > +     int rc;
> > +
> > +     /* Inbound Region 0 - Bar 0 - Point to SRAM_BASE_ADDR */
> > +     rc  = goya_iatu_write(hdev, 0x114, lower_32_bits(SRAM_BASE_ADDR));
> > +     rc |= goya_iatu_write(hdev, 0x118, upper_32_bits(SRAM_BASE_ADDR));
> > +     rc |= goya_iatu_write(hdev, 0x100, 0);
> > +     /* Enable + Bar match + match enable */
> > +     rc |= goya_iatu_write(hdev, 0x104, 0xC0080000);
> > +
> > +     /* Inbound Region 1 - Bar 4 - Point to DDR */
> > +     rc |= goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
> > +
> > +     /* Outbound Region 0 - Point to Host */
> > +     rc |= goya_iatu_write(hdev, 0x008, lower_32_bits(HOST_PHYS_BASE));
> > +     rc |= goya_iatu_write(hdev, 0x00C, upper_32_bits(HOST_PHYS_BASE));
> > +     rc |= goya_iatu_write(hdev, 0x010,
> > +             lower_32_bits(HOST_PHYS_BASE + HOST_PHYS_SIZE - 1));
> > +     rc |= goya_iatu_write(hdev, 0x014, 0);
> > +     rc |= goya_iatu_write(hdev, 0x018, 0);
> > +     rc |= goya_iatu_write(hdev, 0x020,
> > +             upper_32_bits(HOST_PHYS_BASE + HOST_PHYS_SIZE - 1));
> > +     /* Increase region size */
> > +     rc |= goya_iatu_write(hdev, 0x000, 0x00002000);
> > +     /* Enable */
> > +     rc |= goya_iatu_write(hdev, 0x004, 0x80000000);
> > +
> > +     /* Return the DBI window to the default location */
> > +     rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0);
> > +     rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI_32, 0);
> > +
> > +     return rc;
>
> Ditto
Fixed
>
> > +}
> > +
> > +/**
> > + * goya_early_init - GOYA early initialization code
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Verify PCI bars
> > + * Set DMA masks
> > + * PCI controller initialization
> > + * Map PCI bars
> > + *
> > + */
> > +static int goya_early_init(struct hl_device *hdev)
> > +{
> > +     struct asic_fixed_properties *prop = &hdev->asic_prop;
> > +     struct pci_dev *pdev = hdev->pdev;
> > +     u32 val;
> > +     int rc;
> > +
> > +     goya_get_fixed_properties(hdev);
> > +
> > +     /* Check BAR sizes */
> > +     if (pci_resource_len(pdev, SRAM_CFG_BAR_ID) != CFG_BAR_SIZE) {
> > +             dev_err(hdev->dev,
> > +                     "Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
> > +                     SRAM_CFG_BAR_ID,
> > +                     pci_resource_len(pdev, SRAM_CFG_BAR_ID),
> > +                     CFG_BAR_SIZE);
> > +             return -ENODEV;
> > +     }
> > +
> > +     if (pci_resource_len(pdev, MSIX_BAR_ID) != MSIX_BAR_SIZE) {
> > +             dev_err(hdev->dev,
> > +                     "Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
> > +                     MSIX_BAR_ID, pci_resource_len(pdev, MSIX_BAR_ID),
> > +                     MSIX_BAR_SIZE);
> > +             return -ENODEV;
> > +     }
> > +
> > +     prop->dram_pci_bar_size = pci_resource_len(pdev, DDR_BAR_ID);
> > +
> > +     /* set DMA mask for GOYA */
> > +     rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(39));
> > +     if (rc) {
> > +             dev_warn(hdev->dev, "Unable to set pci dma mask to 39 bits\n");
> > +             rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
> > +             if (rc) {
> > +                     dev_err(hdev->dev,
> > +                             "Unable to set pci dma mask to 32 bits\n");
> > +                     return rc;
> > +             }
> > +     }
> > +
> > +     rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(39));
> > +     if (rc) {
> > +             dev_warn(hdev->dev,
> > +                     "Unable to set pci consistent dma mask to 39 bits\n");
> > +             rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32));
> > +             if (rc) {
> > +                     dev_err(hdev->dev,
> > +                             "Unable to set pci consistent dma mask to 32 bits\n");
> > +                     return rc;
> > +             }
> > +     }
> > +
> > +     if (hdev->reset_pcilink)
> > +             goya_reset_link_through_bridge(hdev);
> > +
> > +     rc = pci_enable_device_mem(pdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "can't enable PCI device\n");
> > +             return rc;
> > +     }
> > +
> > +     pci_set_master(pdev);
> > +
> > +     rc = goya_init_iatu(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to initialize iATU\n");
> > +             goto disable_device;
> > +     }
> > +
> > +     rc = goya_pci_bars_map(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to initialize PCI BARS\n");
> > +             goto disable_device;
> > +     }
> > +
> > +     val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
> > +     if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
> > +             dev_warn(hdev->dev,
> > +                     "PCI strap is not configured correctly, PCI bus errors may occur\n");
> > +
> > +     return 0;
> > +
> > +disable_device:
> > +     pci_clear_master(pdev);
> > +     pci_disable_device(pdev);
> > +
> > +     return rc;
> > +}
> > +
> > +/**
> > + * goya_early_fini - GOYA early finalization code
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Unmap PCI bars
> > + *
> > + */
> > +int goya_early_fini(struct hl_device *hdev)
> > +{
> > +     goya_pci_bars_unmap(hdev);
> > +
> > +     pci_clear_master(hdev->pdev);
> > +     pci_disable_device(hdev->pdev);
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * goya_sw_init - Goya software initialization code
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + */
> > +static int goya_sw_init(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya;
> > +     int rc;
> > +
> > +     /* Allocate device structure */
> > +     goya = kzalloc(sizeof(*goya), GFP_KERNEL);
>
> Consider using devm_k[mz]alloc() for memory allocations throughout the
> driver. I didn't check all the spots where it can be applicable.
I honestly wasn't aware of that. We never used that in AMD drivers
(which where I spent most of my kernel time).
I'll look into that offline but for now I don't really want to change
into it blindly in all locations, unless there is some hard kernel
rule for using that in drivers.

>
> > +     if (!goya)
> > +             return -ENOMEM;
> > +
> > +     /* according to goya_init_iatu */
> > +     goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
> > +     hdev->asic_specific = goya;
> > +
> > +     /* Create DMA pool for small allocations */
> > +     hdev->dma_pool = dma_pool_create(dev_name(hdev->dev),
> > +                     &hdev->pdev->dev, GOYA_DMA_POOL_BLK_SIZE, 8, 0);
> > +     if (!hdev->dma_pool) {
> > +             dev_err(hdev->dev, "failed to create DMA pool\n");
> > +             rc = -ENOMEM;
> > +             goto free_goya_device;
> > +     }
> > +
> > +     hdev->cpu_accessible_dma_mem =
> > +                     hdev->asic_funcs->dma_alloc_coherent(hdev,
> > +                                     CPU_ACCESSIBLE_MEM_SIZE,
> > +                                     &hdev->cpu_accessible_dma_address,
> > +                                     GFP_KERNEL | __GFP_ZERO);
> > +
> > +     if (!hdev->cpu_accessible_dma_mem) {
> > +             dev_err(hdev->dev,
> > +                     "failed to allocate %d of dma memory for CPU accessible memory space\n",
> > +                     CPU_ACCESSIBLE_MEM_SIZE);
> > +             rc = -ENOMEM;
> > +             goto free_dma_pool;
> > +     }
> > +
> > +     hdev->cpu_accessible_dma_pool = gen_pool_create(CPU_PKT_SHIFT, -1);
> > +     if (!hdev->cpu_accessible_dma_pool) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to create CPU accessible DMA pool\n");
> > +             rc = -ENOMEM;
>
> You could init rc = -ENOMEM at the beginning and save the duplication.
Again, I don't agree with that programming paradigm. If I do that, and
then I'll add code at the beginning of the function in the likes of:
rc = foo()
I will insert a bug.
So I prefer the duplication and make the code more robust to future changes.

>
> > +             goto free_cpu_pq_dma_mem;
> > +     }
> > +
> > +     rc = gen_pool_add(hdev->cpu_accessible_dma_pool,
> > +                             (u64) hdev->cpu_accessible_dma_mem,
> > +                             CPU_ACCESSIBLE_MEM_SIZE, -1);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to add memory to CPU accessible DMA pool\n");
> > +             rc = -EFAULT;
> > +             goto free_cpu_pq_pool;
> > +     }
> > +
> > +     spin_lock_init(&goya->hw_queues_lock);
> > +
> > +     return 0;
> > +
> > +free_cpu_pq_pool:
> > +     gen_pool_destroy(hdev->cpu_accessible_dma_pool);
> > +free_cpu_pq_dma_mem:
> > +     hdev->asic_funcs->dma_free_coherent(hdev, CPU_ACCESSIBLE_MEM_SIZE,
> > +                     hdev->cpu_accessible_dma_mem,
> > +                     hdev->cpu_accessible_dma_address);
> > +free_dma_pool:
> > +     dma_pool_destroy(hdev->dma_pool);
> > +free_goya_device:
> > +     kfree(goya);
> > +
> > +     return rc;
> > +}
> > +
> > +/**
> > + * goya_sw_fini - Goya software tear-down code
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + */
> > +int goya_sw_fini(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +
> > +     gen_pool_destroy(hdev->cpu_accessible_dma_pool);
> > +
> > +     hdev->asic_funcs->dma_free_coherent(hdev, CPU_ACCESSIBLE_MEM_SIZE,
> > +                     hdev->cpu_accessible_dma_mem,
> > +                     hdev->cpu_accessible_dma_address);
> > +
> > +     dma_pool_destroy(hdev->dma_pool);
> > +
> > +     kfree(goya);
> > +
> > +     return 0;
> > +}
> > +
> > +int goya_suspend(struct hl_device *hdev)
> > +{
> > +     return 0;
> > +}
> > +
> > +int goya_resume(struct hl_device *hdev)
> > +{
> > +     return 0;
> > +}
> > +
> > +void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
> > +                                     dma_addr_t *dma_handle, gfp_t flags)
> > +{
> > +     return dma_alloc_coherent(&hdev->pdev->dev, size, dma_handle, flags);
> > +}
> > +
> > +void goya_dma_free_coherent(struct hl_device *hdev, size_t size, void *cpu_addr,
> > +                             dma_addr_t dma_handle)
> > +{
> > +     dma_free_coherent(&hdev->pdev->dev, size, cpu_addr, dma_handle);
> > +}
> > +
> > +static const struct hl_asic_funcs goya_funcs = {
> > +     .early_init = goya_early_init,
> > +     .early_fini = goya_early_fini,
> > +     .sw_init = goya_sw_init,
> > +     .sw_fini = goya_sw_fini,
> > +     .suspend = goya_suspend,
> > +     .resume = goya_resume,
> > +     .dma_alloc_coherent = goya_dma_alloc_coherent,
> > +     .dma_free_coherent = goya_dma_free_coherent,
>
> Is there any additional functionality that is planned in goya or gaudi in
> these two functions?
> It seems like they are not really needed, at least at the moment and for
> sure that don't need to be part of ASIC ops.

So this relates to the simulator support, because there the
implementation of these two functions is totally different as I don't
have pci device.

>
> > +};
> > +
> > +/**
> > + * goya_set_asic_funcs - set Goya function pointers
> > + *
> > + * @*hdev: pointer to hl_device structure
> > + *
> > + */
> > +void goya_set_asic_funcs(struct hl_device *hdev)
> > +{
> > +     hdev->asic_funcs = &goya_funcs;
> > +}
> > diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> > new file mode 100644
> > index 000000000000..0e12c56472bd
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/goya/goyaP.h
> > @@ -0,0 +1,125 @@
> > +/* SPDX-License-Identifier: GPL-2.0
> > + *
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + */
> > +
> > +#ifndef GOYAP_H_
> > +#define GOYAP_H_
> > +
> > +#include "habanalabs.h"
> > +#include "include/goya/goya.h"
> > +
> > +#define NUMBER_OF_CMPLT_QUEUES               5
> > +#define NUMBER_OF_EXT_HW_QUEUES              5
> > +#define NUMBER_OF_CPU_HW_QUEUES              1
> > +#define NUMBER_OF_INT_HW_QUEUES              9
> > +#define NUMBER_OF_HW_QUEUES          (NUMBER_OF_EXT_HW_QUEUES + \
> > +                                     NUMBER_OF_CPU_HW_QUEUES + \
> > +                                     NUMBER_OF_INT_HW_QUEUES)
> > +
> > +/*
> > + * Number of MSIX interrupts IDS:
> > + * Each completion queue has 1 ID
> > + * The event queue has 1 ID
> > + * ArmCP reset has 1 ID
> > + */
> > +#define NUMBER_OF_INTERRUPTS         (NUMBER_OF_CMPLT_QUEUES + 2)
> > +
> > +#if (NUMBER_OF_HW_QUEUES >= HL_MAX_QUEUES)
> > +#error "Number of H/W queues must be smaller than HL_MAX_QUEUES"
> > +#endif
> > +
> > +#if (NUMBER_OF_INTERRUPTS > GOYA_MSIX_ENTRIES)
> > +#error "Number of MSIX interrupts must be smaller or equal to GOYA_MSIX_ENTRIES"
> > +#endif
> > +
> > +#define QMAN_FENCE_TIMEOUT_USEC              10000   /* 10 ms */
> > +
> > +#define QMAN_STOP_TIMEOUT_USEC               100000  /* 100 ms */
> > +
> > +#define TPC_MAX_NUM                  8
> > +#define TPC_ENABLED_MASK             0xFF
> > +
> > +#define DMA_MAX_NUM                  5
> > +
> > +#define PLL_HIGH_DEFAULT             1575000000      /* 1.575 GHz */
> > +
> > +#define GOYA_ARMCP_INFO_TIMEOUT              10000000        /* 10s */
> > +
> > +#define DRAM_PHYS_DEFAULT_SIZE               0x100000000ull  /* 4GB */
> > +
> > +/*
> > + * SRAM Memory Map for KMD
> > + *
> > + * KMD occupies KMD_SRAM_SIZE bytes from the start of SRAM. It is used for
> > + * MME/TPC QMANs
> > + *
> > + */
> > +
> > +#define MME_QMAN_BASE_OFFSET 0x000000        /* Must be 0 */
> > +#define MME_QMAN_LENGTH              64
> > +#define TPC_QMAN_LENGTH              64
> > +
> > +#define TPC0_QMAN_BASE_OFFSET        (MME_QMAN_BASE_OFFSET + \
> > +                             (MME_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> > +#define TPC1_QMAN_BASE_OFFSET        (TPC0_QMAN_BASE_OFFSET + \
> > +                             (TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> > +#define TPC2_QMAN_BASE_OFFSET        (TPC1_QMAN_BASE_OFFSET + \
> > +                             (TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> > +#define TPC3_QMAN_BASE_OFFSET        (TPC2_QMAN_BASE_OFFSET + \
> > +                             (TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> > +#define TPC4_QMAN_BASE_OFFSET        (TPC3_QMAN_BASE_OFFSET + \
> > +                             (TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> > +#define TPC5_QMAN_BASE_OFFSET        (TPC4_QMAN_BASE_OFFSET + \
> > +                             (TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> > +#define TPC6_QMAN_BASE_OFFSET        (TPC5_QMAN_BASE_OFFSET + \
> > +                             (TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> > +#define TPC7_QMAN_BASE_OFFSET        (TPC6_QMAN_BASE_OFFSET + \
> > +                             (TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> > +
> > +#define SRAM_KMD_RES_OFFSET  (TPC7_QMAN_BASE_OFFSET + \
> > +                             (TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
> > +
> > +#if (SRAM_KMD_RES_OFFSET >= KMD_SRAM_RESERVED_SIZE)
> > +#error "MME/TPC QMANs SRAM space exceeds limit"
> > +#endif
> > +
> > +#define SRAM_USER_BASE_OFFSET        KMD_SRAM_RESERVED_SIZE
> > +
> > +#define DMA_MAX_TRANSFER_SIZE        0xFFFFFFFF
> > +
> > +#define HW_CAP_PLL           0x00000001
> > +#define HW_CAP_DDR_0         0x00000002
> > +#define HW_CAP_DDR_1         0x00000004
> > +#define HW_CAP_MME           0x00000008
> > +#define HW_CAP_CPU           0x00000010
> > +#define HW_CAP_DMA           0x00000020
> > +#define HW_CAP_MSIX          0x00000040
> > +#define HW_CAP_CPU_Q         0x00000080
> > +#define HW_CAP_MMU           0x00000100
> > +#define HW_CAP_TPC_MBIST     0x00000200
> > +#define HW_CAP_GOLDEN                0x00000400
> > +#define HW_CAP_TPC           0x00000800
> > +
> > +#define CPU_PKT_SHIFT                5
> > +#define CPU_PKT_SIZE         (1 << CPU_PKT_SHIFT)
> > +#define CPU_PKT_MASK         (~((1 << CPU_PKT_SHIFT) - 1))
> > +#define CPU_MAX_PKTS_IN_CB   32
> > +#define CPU_CB_SIZE          (CPU_PKT_SIZE * CPU_MAX_PKTS_IN_CB)
> > +#define CPU_ACCESSIBLE_MEM_SIZE      (HL_QUEUE_LENGTH * CPU_CB_SIZE)
> > +
> > +enum goya_fw_component {
> > +     FW_COMP_UBOOT,
> > +     FW_COMP_PREBOOT
> > +};
> > +
> > +struct goya_device {
> > +     /* TODO: remove hw_queues_lock after moving to scheduler code */
> > +     spinlock_t      hw_queues_lock;
> > +     u64             ddr_bar_cur_addr;
> > +     u32             hw_cap_initialized;
> > +};
> > +
> > +#endif /* GOYAP_H_ */
> > diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> > index 7e1b088b677c..97844825f7a8 100644
> > --- a/drivers/misc/habanalabs/habanalabs.h
> > +++ b/drivers/misc/habanalabs/habanalabs.h
> > @@ -21,11 +21,64 @@
> >
> >  #define HL_NAME                              "habanalabs"
> >
> > +#define HL_MAX_QUEUES                        128
> > +
> >  struct hl_device;
> >
> >
> >
> >
> > +/**
> > + * struct asic_fixed_properties - ASIC specific immutable properties.
> > + * @sram_base_address: SRAM physical start address.
> > + * @sram_end_address: SRAM physical end address.
> > + * @sram_user_base_address - SRAM physical start address for user access.
> > + * @dram_base_address: DRAM physical start address.
> > + * @dram_end_address: DRAM physical end address.
> > + * @dram_user_base_address: DRAM physical start address for user access.
> > + * @dram_size: DRAM total size.
> > + * @dram_pci_bar_size: size of PCI bar towards DRAM.
> > + * @host_phys_base_address: base physical address of host memory for
> > + *                           transactions that the device generates.
> > + * @va_space_host_start_address: base address of virtual memory range for
> > + *                               mapping host memory.
> > + * @va_space_host_end_address: end address of virtual memory range for
> > + *                             mapping host memory.
> > + * @va_space_dram_start_address: base address of virtual memory range for
> > + *                               mapping DRAM memory.
> > + * @va_space_dram_end_address: end address of virtual memory range for
> > + *                             mapping DRAM memory.
> > + * @cfg_size: configuration space size on SRAM.
> > + * @sram_size: total size of SRAM.
> > + * @max_asid: maximum number of open contexts (ASIDs).
> > + * @completion_queues_count: number of completion queues.
> > + * @high_pll: high PLL frequency used by the device.
> > + * @tpc_enabled_mask: which TPCs are enabled.
> > + */
> > +struct asic_fixed_properties {
> > +     u64                     sram_base_address;
> > +     u64                     sram_end_address;
> > +     u64                     sram_user_base_address;
> > +     u64                     dram_base_address;
> > +     u64                     dram_end_address;
> > +     u64                     dram_user_base_address;
> > +     u64                     dram_size;
> > +     u64                     dram_pci_bar_size;
> > +     u64                     host_phys_base_address;
> > +     u64                     va_space_host_start_address;
> > +     u64                     va_space_host_end_address;
> > +     u64                     va_space_dram_start_address;
> > +     u64                     va_space_dram_end_address;
> > +     u32                     cfg_size;
> > +     u32                     sram_size;
> > +     u32                     max_asid;
> > +     u32                     high_pll;
> > +     u8                      completion_queues_count;
> > +     u8                      tpc_enabled_mask;
> > +};
> > +
> > +
> > +#define HL_QUEUE_LENGTH                      256
> >
> >
> >  /*
> > @@ -47,6 +100,30 @@ enum hl_asic_type {
> >
> >
> >
> > +/**
> > + * struct hl_asic_funcs - ASIC specific functions that are can be called from
> > + *                        common code.
> > + * @early_init: sets up early driver state (pre sw_init), doesn't configure H/W.
> > + * @early_fini: tears down what was done in early_init.
> > + * @sw_init: sets up driver state, does not configure H/W.
> > + * @sw_fini: tears down driver state, does not configure H/W.
> > + * @suspend: handles IP specific H/W or SW changes for suspend.
> > + * @resume: handles IP specific H/W or SW changes for resume.
> > + * @dma_alloc_coherent: DMA allocate coherent memory.
> > + * @dma_free_coherent: free DMA allocation.
> > + */
> > +struct hl_asic_funcs {
> > +     int (*early_init)(struct hl_device *hdev);
> > +     int (*early_fini)(struct hl_device *hdev);
> > +     int (*sw_init)(struct hl_device *hdev);
> > +     int (*sw_fini)(struct hl_device *hdev);
> > +     int (*suspend)(struct hl_device *hdev);
> > +     int (*resume)(struct hl_device *hdev);
> > +     void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
> > +                                     dma_addr_t *dma_handle, gfp_t flag);
> > +     void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
> > +                                     void *cpu_addr, dma_addr_t dma_handle);
> > +};
> >
> >  /*
> >   * FILE PRIVATE STRUCTURE
> > @@ -78,26 +155,78 @@ struct hl_fpriv {
> >   */
> >  #define HL_MAX_MINORS        256
> >
> > +/*
> > + * Registers read & write functions.
> > + */
> > +
> > +u32 hl_rreg(struct hl_device *hdev, u32 reg);
> > +void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> > +
> > +#define hl_poll_timeout(hdev, addr, val, cond, sleep_us, timeout_us) \
> > +     readl_poll_timeout(hdev->rmmio + addr, val, cond, sleep_us, timeout_us)
> > +
> > +#define RREG32(reg) hl_rreg(hdev, (reg))
> > +#define WREG32(reg, v) hl_wreg(hdev, (reg), (v))
> > +#define DREG32(reg) pr_info("REGISTER: " #reg " : 0x%08X\n", \
> > +                             hl_rreg(hdev, (reg)))
> > +
> > +#define WREG32_P(reg, val, mask)                             \
> > +     do {                                                    \
> > +             u32 tmp_ = RREG32(reg);                         \
> > +             tmp_ &= (mask);                                 \
> > +             tmp_ |= ((val) & ~(mask));                      \
> > +             WREG32(reg, tmp_);                              \
> > +     } while (0)
> > +#define WREG32_AND(reg, and) WREG32_P(reg, 0, and)
> > +#define WREG32_OR(reg, or) WREG32_P(reg, or, ~(or))
> > +
> > +#define REG_FIELD_SHIFT(reg, field) reg##_##field##_SHIFT
> > +#define REG_FIELD_MASK(reg, field) reg##_##field##_MASK
> > +#define WREG32_FIELD(reg, field, val)        \
> > +     WREG32(mm##reg, (RREG32(mm##reg) & ~REG_FIELD_MASK(reg, field)) | \
> > +                     (val) << REG_FIELD_SHIFT(reg, field))
> > +
> >  /**
> >   * struct hl_device - habanalabs device structure.
> >   * @pdev: pointer to PCI device, can be NULL in case of simulator device.
> > + * @pcie_bar: array of available PCIe bars.
> > + * @rmmio: configuration area address on SRAM.
> >   * @cdev: related char device.
> >   * @dev: realted kernel basic device structure.
> >   * @asic_name: ASIC specific nmae.
> >   * @asic_type: ASIC specific type.
> > + * @dma_pool: DMA pool for small allocations.
> > + * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
> > + * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
> > + * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
> > + * @asic_prop: ASIC specific immutable properties.
> > + * @asic_funcs: ASIC specific functions.
> > + * @asic_specific: ASIC specific information to use only from ASIC files.
> >   * @major: habanalabs KMD major.
> >   * @id: device minor.
> >   * @disabled: is device disabled.
> >   */
> >  struct hl_device {
> >       struct pci_dev                  *pdev;
> > +     void __iomem                    *pcie_bar[6];
> > +     void __iomem                    *rmmio;
> >       struct cdev                     cdev;
> >       struct device                   *dev;
> >       char                            asic_name[16];
> >       enum hl_asic_type               asic_type;
> > +     struct dma_pool                 *dma_pool;
> > +     void                            *cpu_accessible_dma_mem;
> > +     dma_addr_t                      cpu_accessible_dma_address;
> > +     struct gen_pool                 *cpu_accessible_dma_pool;
> > +     struct asic_fixed_properties    asic_prop;
> > +     const struct hl_asic_funcs      *asic_funcs;
> > +     void                            *asic_specific;
> >       u32                             major;
> >       u16                             id;
> >       u8                              disabled;
> > +
> > +     /* Parameters for bring-up */
> > +     u8                              reset_pcilink;
> >  };
> >
> >  /*
> > @@ -146,4 +275,6 @@ void hl_device_fini(struct hl_device *hdev);
> >  int hl_device_suspend(struct hl_device *hdev);
> >  int hl_device_resume(struct hl_device *hdev);
> >
> > +void goya_set_asic_funcs(struct hl_device *hdev);
> > +
> >  #endif /* HABANALABSP_H_ */
> > diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> > index 15217975327b..79545003b7c2 100644
> > --- a/drivers/misc/habanalabs/habanalabs_drv.c
> > +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> > @@ -136,6 +136,9 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
> >
> >       hdev->major = hl_major;
> >
> > +     /* Parameters for bring-up - set them to defaults */
> > +     hdev->reset_pcilink = 0;
> > +
> >       hdev->disabled = true;
> >       hdev->pdev = pdev; /* can be NULL in case of simulator device */
> >
> > diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
> > new file mode 100644
> > index 000000000000..192a1450cbb1
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/include/goya/goya.h
> > @@ -0,0 +1,115 @@
> > +/* SPDX-License-Identifier: GPL-2.0
> > + *
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + * Author: Oded Gabbay <oded.gabbay@gmail.com>
> > + *
> > + */
> > +
> > +#ifndef GOYA_H
> > +#define GOYA_H
> > +
> > +#include "asic_reg/goya_regs.h"
> > +
> > +#include <linux/types.h>
> > +
> > +#define SRAM_CFG_BAR_ID              0
> > +#define MSIX_BAR_ID          2
> > +#define DDR_BAR_ID           4
> > +
> > +#define CFG_BAR_SIZE         0x10000000ull           /* 256MB */
> > +#define MSIX_BAR_SIZE                0x1000ull               /* 4KB */
> > +
> > +#define CFG_BASE             0x7FFC000000ull
> > +#define CFG_SIZE             0x4000000               /* 32MB CFG + 32MB DBG*/
> > +
> > +#define SRAM_BASE_ADDR               0x7FF0000000ull
> > +#define SRAM_SIZE            0x32A0000               /* 50.625MB */
> > +#define KMD_SRAM_RESERVED_SIZE       0x8000                  /* 32KB */
> > +
> > +#define SRAM_BASE_ADDR_USER  (0x7FF0000000ull + KMD_SRAM_RESERVED_SIZE)
> > +#define SRAM_SIZE_USER               (SRAM_SIZE - KMD_SRAM_RESERVED_SIZE)
> > +
> > +#define DRAM_PHYS_BASE               0x0ull
> > +
> > +#define CPU_FW_IMAGE_SIZE    0x10000000      /* 256MB */
> > +#define MMU_PAGE_TABLES_SIZE 0x0E000000      /* 224MB */
> > +#define CPU_PQ_PKT_SIZE              0x00001000      /* 4KB */
> > +#define CPU_PQ_DATA_SIZE     0x01FFF000      /* 32MB - 4KB  */
> > +
> > +#define CPU_FW_IMAGE_ADDR    DRAM_PHYS_BASE
> > +#define MMU_PAGE_TABLES_ADDR (CPU_FW_IMAGE_ADDR + CPU_FW_IMAGE_SIZE)
> > +#define CPU_PQ_PKT_ADDR              (MMU_PAGE_TABLES_ADDR + MMU_PAGE_TABLES_SIZE)
> > +#define CPU_PQ_DATA_ADDR     (CPU_PQ_PKT_ADDR + CPU_PQ_PKT_SIZE)
> > +#define DRAM_BASE_ADDR_USER  (CPU_PQ_DATA_ADDR + CPU_PQ_DATA_SIZE)
> > +
> > +#define HOST_PHYS_BASE               0x8000000000ull         /* 0.5TB */
> > +#define HOST_PHYS_SIZE               0x1000000000000ull      /* 0.25PB (48 bits) */
> > +
> > +#define VA_HOST_SPACE_START  0x1000000000000ull      /* 256TB */
> > +#define VA_HOST_SPACE_END    0x3FF8000000000ull      /* 1PB - 1TB */
> > +#define VA_HOST_SPACE_SIZE   (VA_HOST_SPACE_END - \
> > +                                     VA_HOST_SPACE_START) /* 767TB */
> > +
> > +#define VA_DDR_SPACE_START   0x800000000ull          /* 32GB */
> > +#define VA_DDR_SPACE_END     0x2000000000ull         /* 128GB */
> > +#define VA_DDR_SPACE_SIZE    (VA_DDR_SPACE_END - \
> > +                                     VA_DDR_SPACE_START)     /* 128GB */
> > +
> > +#define CPU_BOOT_ADDR                0x7FF8040000ull
> > +
> > +#define UBOOT_FW_OFFSET              0x100000                /* 1MB in SRAM */
> > +#define LINUX_FW_OFFSET              0x800000                /* 8BM in DDR */
> > +
> > +#define GOYA_MSIX_ENTRIES    8
> > +#define EVENT_QUEUE_MSIX_IDX 5
> > +#define ARMCP_RESET_MSIX_IDX 6
> > +
> > +#define QMAN_PQ_ENTRY_SIZE   16                      /* Bytes */
> > +
> > +#define MAX_ASID             1024
> > +
> > +#define PROT_BITS_OFFS               0xF80
> > +
> > +/*
> > + * Queue Numbering
> > + *
> > + * The external queues (DMA channels + CPU) MUST be before the internal queues
> > + * and each group (DMA channels + CPU and internal) must be contiguous inside
> > + * itself but there can be a gap between the two groups (although not
> > + * recommended)
> > + */
> > +
> > +enum goya_queue_id {
> > +     GOYA_QUEUE_ID_DMA_0 = 0,
> > +     GOYA_QUEUE_ID_DMA_1,
> > +     GOYA_QUEUE_ID_DMA_2,
> > +     GOYA_QUEUE_ID_DMA_3,
> > +     GOYA_QUEUE_ID_DMA_4,
> > +     GOYA_QUEUE_ID_CPU_PQ,
> > +     GOYA_QUEUE_ID_MME,
> > +     GOYA_QUEUE_ID_TPC0,
> > +     GOYA_QUEUE_ID_TPC1,
> > +     GOYA_QUEUE_ID_TPC2,
> > +     GOYA_QUEUE_ID_TPC3,
> > +     GOYA_QUEUE_ID_TPC4,
> > +     GOYA_QUEUE_ID_TPC5,
> > +     GOYA_QUEUE_ID_TPC6,
> > +     GOYA_QUEUE_ID_TPC7,
> > +     GOYA_QUEUE_ID_SIZE
> > +};
> > +
> > +enum goya_pll_index {
> > +     CPU_PLL = 0,
> > +     IC_PLL,
> > +     MC_PLL,
> > +     MME_PLL,
> > +     PCI_PLL,
> > +     EMMC_PLL,
> > +     TPC_PLL
> > +};
> > +
> > +#define GOYA_PLL_FREQ_LOW            50000000 /* 50 MHz */
> > +
> > +#endif /* GOYA_H */
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/15] habanalabs: add context and ASID modules
  2019-01-23 12:28   ` Mike Rapoport
@ 2019-01-25 21:07     ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-25 21:07 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:46AM +0200, Oded Gabbay wrote:
> > This patch adds two modules - ASID and context.
> >
> > Each user process the opens a device's file must have at least one context
>
>                    ^that
>
> > before it is able to "work" with the device. Each context has its own
> > device address-space and contains information about its runtime state (its
> > active command submissions).
> >
> > To have address-space separation between contexts, each context is assigned
> > a unique ASID, which stands for "address-space id". Goya supports up to
> > 1024 ASIDs.
> >
> > Currently, the driver doesn't support multiple contexts. Therefore, the
> > user doesn't need to actively create a context. A "primary context" is
> > created automatically when the user opens the device's file.
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/habanalabs/Makefile         |   2 +-
> >  drivers/misc/habanalabs/asid.c           |  58 +++++++++
> >  drivers/misc/habanalabs/context.c        | 155 +++++++++++++++++++++++
> >  drivers/misc/habanalabs/device.c         |  47 +++++++
> >  drivers/misc/habanalabs/habanalabs.h     |  70 ++++++++++
> >  drivers/misc/habanalabs/habanalabs_drv.c |  46 ++++++-
> >  6 files changed, 375 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/misc/habanalabs/asid.c
> >  create mode 100644 drivers/misc/habanalabs/context.c
> >
> > diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> > index 6f1ead69bd77..3ffbadc2ca01 100644
> > --- a/drivers/misc/habanalabs/Makefile
> > +++ b/drivers/misc/habanalabs/Makefile
> > @@ -4,7 +4,7 @@
> >
> >  obj-m        := habanalabs.o
> >
> > -habanalabs-y := habanalabs_drv.o device.o
> > +habanalabs-y := habanalabs_drv.o device.o context.o asid.o
> >
> >  include $(src)/goya/Makefile
> >  habanalabs-y += $(HL_GOYA_FILES)
> > diff --git a/drivers/misc/habanalabs/asid.c b/drivers/misc/habanalabs/asid.c
> > new file mode 100644
> > index 000000000000..0ce84c8f5a47
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/asid.c
> > @@ -0,0 +1,58 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "habanalabs.h"
> > +
> > +#include <linux/slab.h>
> > +#include <linux/types.h>
> > +
> > +int hl_asid_init(struct hl_device *hdev)
> > +{
> > +     hdev->asid_bitmap = kcalloc(BITS_TO_LONGS(hdev->asic_prop.max_asid),
> > +                                     sizeof(*hdev->asid_bitmap), GFP_KERNEL);
> > +     if (!hdev->asid_bitmap)
> > +             return -ENOMEM;
> > +
> > +     mutex_init(&hdev->asid_mutex);
> > +
> > +     /* ASID 0 is reserved for KMD */
> > +     set_bit(0, hdev->asid_bitmap);
> > +
> > +     return 0;
> > +}
> > +
> > +void hl_asid_fini(struct hl_device *hdev)
> > +{
> > +     mutex_destroy(&hdev->asid_mutex);
> > +     kfree(hdev->asid_bitmap);
> > +}
> > +
> > +unsigned long hl_asid_alloc(struct hl_device *hdev)
> > +{
> > +     unsigned long found;
> > +
> > +     mutex_lock(&hdev->asid_mutex);
> > +
> > +     found = find_first_zero_bit(hdev->asid_bitmap,
> > +                                     hdev->asic_prop.max_asid);
> > +     if (found == hdev->asic_prop.max_asid)
> > +             found = 0;
> > +     else
> > +             set_bit(found, hdev->asid_bitmap);
> > +
> > +     mutex_unlock(&hdev->asid_mutex);
> > +
> > +     return found;
> > +}
> > +
> > +void hl_asid_free(struct hl_device *hdev, unsigned long asid)
> > +{
> > +     if (WARN((asid == 0 || asid >= hdev->asic_prop.max_asid),
> > +                                             "Invalid ASID %lu", asid))
> > +             return;
> > +     clear_bit(asid, hdev->asid_bitmap);
> > +}
> > diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
> > new file mode 100644
> > index 000000000000..cdcad077e5cf
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/context.c
> > @@ -0,0 +1,155 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "habanalabs.h"
> > +
> > +#include <linux/sched.h>
> > +#include <linux/delay.h>
> > +
> > +static void hl_ctx_fini(struct hl_ctx *ctx)
> > +{
> > +     struct hl_device *hdev = ctx->hdev;
> > +
> > +     if (ctx->asid != HL_KERNEL_ASID_ID)
> > +             hl_asid_free(hdev, ctx->asid);
> > +}
> > +
> > +void hl_ctx_do_release(struct kref *ref)
> > +{
> > +     struct hl_ctx *ctx;
> > +
> > +     ctx = container_of(ref, struct hl_ctx, refcount);
> > +
> > +     dev_dbg(ctx->hdev->dev, "Now really releasing context %d\n", ctx->asid);
> > +
> > +     hl_ctx_fini(ctx);
> > +
> > +     if (ctx->hpriv)
> > +             hl_hpriv_put(ctx->hpriv);
> > +
> > +     kfree(ctx);
> > +}
> > +
> > +int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv)
> > +{
> > +     struct hl_ctx_mgr *mgr = &hpriv->ctx_mgr;
> > +     struct hl_ctx *ctx;
> > +     int rc;
> > +
> > +     ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> > +     if (!ctx) {
> > +             rc = -ENOMEM;
> > +             goto out_err;
> > +     }
> > +
> > +     rc = hl_ctx_init(hdev, ctx, false);
> > +     if (rc)
> > +             goto free_ctx;
> > +
> > +     hl_hpriv_get(hpriv);
> > +     ctx->hpriv = hpriv;
> > +
> > +     /* TODO: remove for multiple contexts */
> > +     hpriv->ctx = ctx;
> > +     hdev->user_ctx = ctx;
> > +
> > +     mutex_lock(&mgr->ctx_lock);
> > +     rc = idr_alloc(&mgr->ctx_handles, ctx, 1, 0, GFP_KERNEL);
> > +     mutex_unlock(&mgr->ctx_lock);
> > +
> > +     if (rc < 0) {
> > +             dev_err(hdev->dev, "Failed to allocate IDR for a new CTX\n");
> > +             hl_ctx_free(hdev, ctx);
> > +             goto out_err;
> > +     }
> > +
> > +     return 0;
> > +
> > +free_ctx:
> > +     kfree(ctx);
> > +out_err:
> > +     return rc;
> > +}
> > +
> > +void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx)
> > +{
> > +     if (kref_put(&ctx->refcount, hl_ctx_do_release) == 1)
> > +             return;
> > +
> > +     dev_warn(hdev->dev,
> > +             "Context %d closed or terminated but its CS are executing\n",
> > +             ctx->asid);
> > +}
> > +
> > +int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
> > +{
> > +     ctx->hdev = hdev;
> > +
> > +     kref_init(&ctx->refcount);
> > +
> > +     if (is_kernel_ctx) {
> > +             ctx->asid = HL_KERNEL_ASID_ID; /* KMD gets ASID 0 */
> > +     } else {
> > +             ctx->asid = hl_asid_alloc(hdev);
> > +             if (!ctx->asid) {
> > +                     dev_err(hdev->dev, "No free ASID, failed to create context\n");
> > +                     return -ENOMEM;
> > +             }
> > +     }
> > +
> > +     dev_dbg(hdev->dev, "Created context with ASID %u\n", ctx->asid);
> > +
> > +     return 0;
> > +}
> > +
> > +void hl_ctx_get(struct hl_device *hdev, struct hl_ctx *ctx)
> > +{
> > +     kref_get(&ctx->refcount);
> > +}
> > +
> > +int hl_ctx_put(struct hl_ctx *ctx)
> > +{
> > +     return kref_put(&ctx->refcount, hl_ctx_do_release);
> > +}
> > +
> > +/**
> > + * hl_ctx_mgr_init - initialize the context manager
> > + *
> > + * @mgr: pointer to context manager structure
> > + *
> > + * This manager is an object inside the hpriv object of the user process.
> > + * The function is called when a user process opens the FD.
> > + */
> > +void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr)
> > +{
> > +     mutex_init(&mgr->ctx_lock);
> > +     idr_init(&mgr->ctx_handles);
> > +}
> > +
> > +/**
> > + * hl_ctx_mgr_fini - finalize the context manager
> > + *
> > + * @hdev: pointer to device structure
> > + * @mgr: pointer to context manager structure
> > + *
> > + * This function goes over all the contexts in the manager and frees them.
> > + * It is called when a process closes the FD.
> > + */
> > +void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr)
> > +{
> > +     struct hl_ctx *ctx;
> > +     struct idr *idp;
> > +     u32 id;
> > +
> > +     idp = &mgr->ctx_handles;
> > +
> > +     idr_for_each_entry(idp, ctx, id)
> > +             hl_ctx_free(hdev, ctx);
> > +
> > +     idr_destroy(&mgr->ctx_handles);
> > +     mutex_destroy(&mgr->ctx_lock);
> > +}
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> > index a4276ef559b3..84ce9fcb52da 100644
> > --- a/drivers/misc/habanalabs/device.c
> > +++ b/drivers/misc/habanalabs/device.c
> > @@ -23,6 +23,12 @@ static void hpriv_release(struct kref *ref)
> >       put_pid(hpriv->taskpid);
> >
> >       kfree(hpriv);
> > +
> > +     /* Now the FD is really closed */
> > +     atomic_dec(&hdev->fd_open_cnt);
> > +
> > +     /* This allows a new user context to open the device */
> > +     hdev->user_ctx = NULL;
> >  }
> >
> >  void hl_hpriv_get(struct hl_fpriv *hpriv)
> > @@ -47,6 +53,8 @@ static int hl_device_release(struct inode *inode, struct file *filp)
> >  {
> >       struct hl_fpriv *hpriv = filp->private_data;
> >
> > +     hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
> > +
> >       filp->private_data = NULL;
> >
> >       hl_hpriv_put(hpriv);
> > @@ -133,7 +141,20 @@ static int device_early_init(struct hl_device *hdev)
> >       if (rc)
> >               return rc;
> >
> > +     rc = hl_asid_init(hdev);
> > +     if (rc)
> > +             goto early_fini;
> > +
> > +     mutex_init(&hdev->device_open);
> > +     atomic_set(&hdev->fd_open_cnt, 0);
> > +
> >       return 0;
> > +
> > +early_fini:
> > +     if (hdev->asic_funcs->early_fini)
> > +             hdev->asic_funcs->early_fini(hdev);
> > +
> > +     return rc;
> >  }
> >
> >  /**
> > @@ -145,9 +166,12 @@ static int device_early_init(struct hl_device *hdev)
> >  static void device_early_fini(struct hl_device *hdev)
> >  {
> >
> > +     hl_asid_fini(hdev);
> > +
> >       if (hdev->asic_funcs->early_fini)
> >               hdev->asic_funcs->early_fini(hdev);
> >
> > +     mutex_destroy(&hdev->device_open);
> >  }
> >
> >  /**
> > @@ -241,11 +265,30 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >       if (rc)
> >               goto early_fini;
> >
> > +     /* Allocate the kernel context */
> > +     hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
> > +     if (!hdev->kernel_ctx) {
> > +             rc = -ENOMEM;
> > +             goto sw_fini;
> > +     }
> > +
> > +     hdev->user_ctx = NULL;
> > +
> > +     rc = hl_ctx_init(hdev, hdev->kernel_ctx, true);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to initialize kernel context\n");
> > +             goto free_ctx;
> > +     }
> > +
> >       dev_notice(hdev->dev,
> >               "Successfully added device to habanalabs driver\n");
> >
> >       return 0;
> >
> > +free_ctx:
> > +     kfree(hdev->kernel_ctx);
> > +sw_fini:
> > +     hdev->asic_funcs->sw_fini(hdev);
> >  early_fini:
> >       device_early_fini(hdev);
> >  release_device:
> > @@ -278,6 +321,10 @@ void hl_device_fini(struct hl_device *hdev)
> >       /* Mark device as disabled */
> >       hdev->disabled = true;
> >
> > +     /* Release kernel context */
> > +     if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
> > +             dev_err(hdev->dev, "kernel ctx is still alive\n");
> > +
> >       /* Call ASIC S/W finalize function */
> >       hdev->asic_funcs->sw_fini(hdev);
> >
> > diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> > index 97844825f7a8..d003a6af2131 100644
> > --- a/drivers/misc/habanalabs/habanalabs.h
> > +++ b/drivers/misc/habanalabs/habanalabs.h
> > @@ -125,6 +125,45 @@ struct hl_asic_funcs {
> >                                       void *cpu_addr, dma_addr_t dma_handle);
> >  };
> >
> > +
> > +
> > +
> > +
> > +/*
> > + * CONTEXTS
> > + */
> > +
> > +#define HL_KERNEL_ASID_ID    0
> > +
> > +/**
> > + * struct hl_ctx - user/kernel context.
> > + * @hpriv: pointer to the private (KMD) data of the process (fd).
> > + * @hdev: pointer to the device structure.
> > + * @refcount: reference counter for the context. Context is released only when
> > + *           this hits 0l. It is incremented on CS and CS_WAIT.
> > + * @asid: context's unique address space ID in the device's MMU.
> > + */
> > +struct hl_ctx {
> > +     struct hl_fpriv         *hpriv;
> > +     struct hl_device        *hdev;
> > +     struct kref             refcount;
> > +     u32                     asid;
> > +};
> > +
> > +/**
> > + * struct hl_ctx_mgr - for handling multiple contexts.
> > + * @ctx_lock: protects ctx_handles.
> > + * @ctx_handles: idr to hold all ctx handles.
> > + */
> > +struct hl_ctx_mgr {
> > +     struct mutex            ctx_lock;
> > +     struct idr              ctx_handles;
> > +};
> > +
> > +
> > +
> > +
> > +
> >  /*
> >   * FILE PRIVATE STRUCTURE
> >   */
> > @@ -134,12 +173,16 @@ struct hl_asic_funcs {
> >   * @hdev: habanalabs device structure.
> >   * @filp: pointer to the given file structure.
> >   * @taskpid: current process ID.
> > + * @ctx: current executing context.
> > + * @ctx_mgr: context manager to handle multiple context for this FD.
> >   * @refcount: number of related contexts.
> >   */
> >  struct hl_fpriv {
> >       struct hl_device        *hdev;
> >       struct file             *filp;
> >       struct pid              *taskpid;
> > +     struct hl_ctx           *ctx; /* TODO: remove for multiple ctx */
> > +     struct hl_ctx_mgr       ctx_mgr;
> >       struct kref             refcount;
> >  };
> >
> > @@ -195,13 +238,19 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> >   * @dev: realted kernel basic device structure.
> >   * @asic_name: ASIC specific nmae.
> >   * @asic_type: ASIC specific type.
> > + * @kernel_ctx: KMD context structure.
> >   * @dma_pool: DMA pool for small allocations.
> >   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
> >   * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
> >   * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
> > + * @asid_bitmap: holds used/available ASIDs.
> > + * @asid_mutex: protects asid_bitmap.
> > + * @device_open: lock for sanity checks upon FD open.
>
> device_open is an ambiguous name for a lock
>
> >   * @asic_prop: ASIC specific immutable properties.
> >   * @asic_funcs: ASIC specific functions.
> >   * @asic_specific: ASIC specific information to use only from ASIC files.
> > + * @user_ctx: current user context executing.
> > + * @fd_open_cnt: number of open context executing.
> >   * @major: habanalabs KMD major.
> >   * @id: device minor.
> >   * @disabled: is device disabled.
> > @@ -214,13 +263,21 @@ struct hl_device {
> >       struct device                   *dev;
> >       char                            asic_name[16];
> >       enum hl_asic_type               asic_type;
> > +     struct hl_ctx                   *kernel_ctx;
> >       struct dma_pool                 *dma_pool;
> >       void                            *cpu_accessible_dma_mem;
> >       dma_addr_t                      cpu_accessible_dma_address;
> >       struct gen_pool                 *cpu_accessible_dma_pool;
> > +     unsigned long                   *asid_bitmap;
> > +     struct mutex                    asid_mutex;
> > +     /* TODO: change to rw_sem for multiple contexts (same as other IOCTL) */
> > +     struct mutex                    device_open;
> >       struct asic_fixed_properties    asic_prop;
> >       const struct hl_asic_funcs      *asic_funcs;
> >       void                            *asic_specific;
> > +     /* TODO: The following fields should be moved for multi-context */
> > +     struct hl_ctx                   *user_ctx;
> > +     atomic_t                        fd_open_cnt;
> >       u32                             major;
> >       u16                             id;
> >       u8                              disabled;
> > @@ -270,10 +327,23 @@ int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
> >  int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
> >                               u32 timeout_us, u32 *val);
> >
> > +int hl_asid_init(struct hl_device *hdev);
> > +void hl_asid_fini(struct hl_device *hdev);
> > +unsigned long hl_asid_alloc(struct hl_device *hdev);
> > +void hl_asid_free(struct hl_device *hdev, unsigned long asid);
> > +
> > +int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv);
> > +void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx);
> > +int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx);
> > +int hl_ctx_put(struct hl_ctx *ctx);
> > +void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr);
> > +void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr);
> >  int hl_device_init(struct hl_device *hdev, struct class *hclass);
> >  void hl_device_fini(struct hl_device *hdev);
> >  int hl_device_suspend(struct hl_device *hdev);
> >  int hl_device_resume(struct hl_device *hdev);
> > +void hl_hpriv_get(struct hl_fpriv *hpriv);
> > +void hl_hpriv_put(struct hl_fpriv *hpriv);
> >
> >  void goya_set_asic_funcs(struct hl_device *hdev);
> >
> > diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> > index 79545003b7c2..0646da83eb53 100644
> > --- a/drivers/misc/habanalabs/habanalabs_drv.c
> > +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> > @@ -77,6 +77,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
> >  {
> >       struct hl_device *hdev;
> >       struct hl_fpriv *hpriv;
> > +     int rc;
> >
> >       mutex_lock(&hl_devs_idr_lock);
> >       hdev = idr_find(&hl_devs_idr, iminor(inode));
> > @@ -88,9 +89,33 @@ int hl_device_open(struct inode *inode, struct file *filp)
> >               return -ENXIO;
> >       }
> >
> > +     mutex_lock(&hdev->device_open);
> > +
> > +     if (hdev->disabled) {
> > +             dev_err_ratelimited(hdev->dev,
> > +                     "Can't open %s because it is disabled\n",
> > +                     dev_name(hdev->dev));
> > +             mutex_unlock(&hdev->device_open);
> > +             return -EPERM;
> > +     }
> > +
> > +     if (hdev->user_ctx) {
> > +             dev_info_ratelimited(hdev->dev,
> > +                     "Device %s is already attached to application\n",
> > +                     dev_name(hdev->dev));
> > +             mutex_unlock(&hdev->device_open);
> > +             return -EBUSY;
> > +     }
> > +
> > +     atomic_inc(&hdev->fd_open_cnt);
> > +
> > +     mutex_unlock(&hdev->device_open);
> > +
> >       hpriv = kzalloc(sizeof(*hpriv), GFP_KERNEL);
> > -     if (!hpriv)
> > -             return -ENOMEM;
> > +     if (!hpriv) {
> > +             rc = -ENOMEM;
> > +             goto close_device;
> > +     }
> >
> >       hpriv->hdev = hdev;
> >       filp->private_data = hpriv;
> > @@ -98,9 +123,26 @@ int hl_device_open(struct inode *inode, struct file *filp)
> >       kref_init(&hpriv->refcount);
> >       nonseekable_open(inode, filp);
> >
> > +     hl_ctx_mgr_init(&hpriv->ctx_mgr);
> > +
> > +     rc = hl_ctx_create(hdev, hpriv);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to open FD (CTX fail)\n");
> > +             goto out_err;
> > +     }
> > +
> >       hpriv->taskpid = find_get_pid(current->pid);
> >
> >       return 0;
> > +
> > +out_err:
> > +     filp->private_data = NULL;
> > +     hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
> > +     kfree(hpriv);
> > +
> > +close_device:
> > +     atomic_dec(&hdev->fd_open_cnt);
> > +     return rc;
> >  }
> >
> >  /**
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
>
Fixed, thanks.
Oded

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 1/5] drivers/accel: Introduce subsystem
  2019-01-25 18:16             ` [PATCH 1/5] drivers/accel: Introduce subsystem Olof Johansson
@ 2019-01-25 21:13               ` Olof Johansson
  2019-01-26 17:09                 ` Randy Dunlap
  2019-01-27  4:31                 ` Andrew Donnellan
  2019-01-25 22:23               ` [PATCH " Daniel Vetter
  1 sibling, 2 replies; 103+ messages in thread
From: Olof Johansson @ 2019-01-25 21:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, Frederic Barrat,
	Andrew Donnellan, ogabbay, airlied, jglisse, Olof Johansson

We're starting to see more of these kind of devices, the current
upcoming wave will likely be around machine learning and inference
engines. A few drivers have been added to drivers/misc for this, but
it's timely to make it into a separate group of drivers/subsystem, to
make it easier to find them, and to encourage collaboration between
contributors.

Over time, we expect to build shared frameworks that the drivers will
make use of, but how that framework needs to look like to fill the needs
is still unclear, and the best way to gain that knowledge is to give the
disparate implementations a shared location.

There has been some controversy around expectations for userspace
stacks being open. The clear preference is to see that happen, and any
driver and platform stack that is delivered like that will be given
preferential treatment, and at some point in the future it might
become the requirement. Until then, the bare minimum we need is an
open low-level userspace such that the driver and HW interfaces can be
exercised if someone is modifying the driver, even if the full details
of the workload are not always available.

Bootstrapping this with myself and Greg as maintainers (since the current
drivers will be moving out of drivers/misc). Looking forward to expanding
that group over time.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Olof Johansson <olof@lixom.net>
---

v2: 
I had missed a git add of the Documentation/ piece, which is the most
important piece of this anyway. Sigh.

 Documentation/accelerators/README.rst | 42 +++++++++++++++++++++++++++++++++++
 MAINTAINERS                           |  8 +++++++
 drivers/Kconfig                       |  2 ++
 drivers/Makefile                      |  1 +
 drivers/accel/Kconfig                 | 16 +++++++++++++
 drivers/accel/Makefile                |  5 +++++
 6 files changed, 74 insertions(+)
 create mode 100644 Documentation/accelerators/README.rst
 create mode 100644 drivers/accel/Kconfig
 create mode 100644 drivers/accel/Makefile

diff --git a/Documentation/accelerators/README.rst b/Documentation/accelerators/README.rst
new file mode 100644
index 0000000000000..79049ff99e93e
--- /dev/null
+++ b/Documentation/accelerators/README.rst
@@ -0,0 +1,42 @@
+.. _readme:
+
+Hardware offload accelerator subsystem
+======================================
+
+This is a brief overview of the subsystem (grouping) of hardware
+accelerators kept under drivers/accel
+
+Types of hardware supported
+---------------------------
+
+  The general types of hardware supported are hardware devices that has
+  general interactions of sending commands and buffers to the hardware,
+  returning completions and possible filled buffers back, together
+  with the usual driver pieces around hardware control, setup, error
+  handling, etc.
+
+  Drivers that fit into other subsystems are expected to be merged
+  there, and use the appropriate userspace interfaces of said functional
+  areas. We don't expect to see drivers for network, storage, graphics
+  and similar hardware implemented by drivers here.
+
+Expectations for contributions
+------------------------------
+
+ - Platforms and hardware that has fully open stacks, from Firmware to
+   Userspace, are always going to be given preferential treatment. These
+   platforms give the best insight for behavior and interaction of all
+   layers, including ability to improve implementation across the stack
+   over time.
+
+ - If a platform is partially proprietary, it is still expected that the
+   portions that interact the driver can be shared in a form that allows
+   for exercising the hardware/driver and evolution of the interface over
+   time. This could be separated into a shared library and test/sample
+   programs, for example.
+
+ - Over time, there is an expectation to converge drivers over to shared
+   frameworks and interfaces. Until then, the general rule is that no
+   more than one driver per vendor will be acceptable. For vendors that
+   aren't participating in the work towards shared frameworks over time,
+   we reserve the right to phase out support for the hardware.
diff --git a/MAINTAINERS b/MAINTAINERS
index ddcdc29dfe1f6..8a9bbaf8f6e90 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7033,6 +7033,14 @@ W:	https://linuxtv.org
 S:	Supported
 F:	drivers/media/platform/sti/hva
 
+HW ACCELERATOR OFFLOAD SUBSYSTEM
+M:	Olof Johansson <olof@lixom.net>
+M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+L:	linux-accelerators@lists.ozlabs.org
+S:	Supported
+F:	drivers/accel/
+F:	Documentation/accelerators/
+
 HWPOISON MEMORY FAILURE HANDLING
 M:	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
 L:	linux-mm@kvack.org
diff --git a/drivers/Kconfig b/drivers/Kconfig
index 4f9f99057ff85..3cc461f325569 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -228,4 +228,6 @@ source "drivers/siox/Kconfig"
 
 source "drivers/slimbus/Kconfig"
 
+source "drivers/accel/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 04da7876032cc..e4be06579cc5d 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -186,3 +186,4 @@ obj-$(CONFIG_MULTIPLEXER)	+= mux/
 obj-$(CONFIG_UNISYS_VISORBUS)	+= visorbus/
 obj-$(CONFIG_SIOX)		+= siox/
 obj-$(CONFIG_GNSS)		+= gnss/
+obj-$(CONFIG_ACCEL)		+= accel/
diff --git a/drivers/accel/Kconfig b/drivers/accel/Kconfig
new file mode 100644
index 0000000000000..13b36c0398895
--- /dev/null
+++ b/drivers/accel/Kconfig
@@ -0,0 +1,16 @@
+#
+# Drivers for hardware offload accelerators
+# See Documentation/accel/README.rst for more details
+#
+
+menuconfig ACCEL
+	bool "Hardware offload accelerator support"
+        help
+	  HW offload accelerators are used for high-bandwidth workloads
+	  where a higher-level kernel/userspace interface isn't suitable.
+
+if ACCEL
+
+comment "HW Accellerator drivers"
+
+endif
diff --git a/drivers/accel/Makefile b/drivers/accel/Makefile
new file mode 100644
index 0000000000000..343bbb8f45a14
--- /dev/null
+++ b/drivers/accel/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for accel devices
+#
+
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 05/15] habanalabs: add command buffer module
  2019-01-23 12:28   ` Mike Rapoport
@ 2019-01-25 21:47     ` Oded Gabbay
  2019-01-27  6:49       ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-25 21:47 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:47AM +0200, Oded Gabbay wrote:
> > This patch adds the CB module, which allows the user to create and
> > destroy CBs and to map them to the user's process address-space.
>
> Can you please spell "command buffer" at least first time it's mentioned?
fixed
>
> > A command buffer is a memory blocks that reside in DMA-able address-space
> > and is physically contiguous so it can be accessed by the device without
> > MMU translation. The command buffer memory is allocated using the
> > coherent DMA API.
> >
> > When creating a new CB, the IOCTL returns a handle of it, and the
> > user-space process needs to use that handle to mmap the buffer to get a VA
> > in the user's address-space.
> >
> > Before destroying (freeing) a CB, the user must unmap the CB's VA using the
> > CB handle.
> >
> > Each CB has a reference counter, which tracks its usage in command
> > submissions and also its mmaps (only a single mmap is allowed).
> >
> > The driver maintains a pool of pre-allocated CBs in order to reduce
> > latency during command submissions. In case the pool is empty, the driver
> > will go to the slow-path of allocating a new CB, i.e. calling
> > dma_alloc_coherent.
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/habanalabs/Makefile           |   3 +-
> >  drivers/misc/habanalabs/command_buffer.c   | 414 +++++++++++++++++++++
> >  drivers/misc/habanalabs/device.c           |  43 ++-
> >  drivers/misc/habanalabs/goya/goya.c        |  28 ++
> >  drivers/misc/habanalabs/habanalabs.h       |  95 ++++-
> >  drivers/misc/habanalabs/habanalabs_drv.c   |   2 +
> >  drivers/misc/habanalabs/habanalabs_ioctl.c | 102 +++++
> >  include/uapi/misc/habanalabs.h             |  62 +++
> >  8 files changed, 746 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/misc/habanalabs/command_buffer.c
> >  create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
> >  create mode 100644 include/uapi/misc/habanalabs.h
> >
> > diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> > index 3ffbadc2ca01..2530c9b78ca4 100644
> > --- a/drivers/misc/habanalabs/Makefile
> > +++ b/drivers/misc/habanalabs/Makefile
> > @@ -4,7 +4,8 @@
> >
> >  obj-m        := habanalabs.o
> >
> > -habanalabs-y := habanalabs_drv.o device.o context.o asid.o
> > +habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
> > +             command_buffer.o
> >
> >  include $(src)/goya/Makefile
> >  habanalabs-y += $(HL_GOYA_FILES)
> > diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
> > new file mode 100644
> > index 000000000000..535ed6cc5bda
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/command_buffer.c
> > @@ -0,0 +1,414 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include <uapi/misc/habanalabs.h>
> > +#include "habanalabs.h"
> > +
> > +#include <linux/dma-mapping.h>
> > +
> > +static void cb_fini(struct hl_device *hdev, struct hl_cb *cb)
> > +{
> > +     hdev->asic_funcs->dma_free_coherent(hdev, cb->size,
> > +                     (void *) cb->kernel_address, cb->bus_address);
>
> As it seems, ASIC specific dma_free_coherent is a shortcut for a generic
> dma_free_coherent. Why not use it directly?
>
As I explained in a previous patch review, there is a different
implementation when I'm working with a simulator.

> > +     kfree(cb);
> > +}
> > +
> > +static void cb_do_release(struct hl_device *hdev, struct hl_cb *cb)
> > +{
> > +     if (cb->is_pool) {
> > +             spin_lock(&hdev->cb_pool_lock);
> > +             list_add(&cb->pool_list, &hdev->cb_pool);
> > +             spin_unlock(&hdev->cb_pool_lock);
> > +     } else {
> > +             cb_fini(hdev, cb);
> > +     }
> > +}
> > +
> > +static void cb_release(struct kref *ref)
> > +{
> > +     struct hl_device *hdev;
> > +     struct hl_cb *cb;
> > +
> > +     cb = container_of(ref, struct hl_cb, refcount);
> > +     hdev = cb->hdev;
> > +
> > +     cb_do_release(hdev, cb);
> > +}
> > +
> > +static struct hl_cb *hl_cb_alloc(struct hl_device *hdev, u32 cb_size,
> > +                                     int ctx_id)
> > +{
> > +     struct hl_cb *cb;
> > +     void *p;
> > +
> > +     if (ctx_id == HL_KERNEL_ASID_ID)
> > +             cb = kzalloc(sizeof(*cb), GFP_ATOMIC);
>
> The GFP_ATOMIC should be used when the caller cannot tolerate reclaim or
> sleep and it does not seem to be the case here.
>
Yes, I can see why this is misleading. The call of this function from
latency-sensitive code comes in a much later patch (command
submission).
In short, due to H/W limitations, I must copy the CB of the user to a
kernel allocated CB (in case of an external queue) and this must be
done during command submission. Hence the GFP_ATOMIC flag in all
memory allocations in this function.
I will add a comment in this patch explaining this.

btw, we solved this problem in future ASICs.

> > +     else
> > +             cb = kzalloc(sizeof(*cb), GFP_KERNEL);
> > +
> > +     if (!cb)
> > +             return NULL;
> > +
> > +     if (ctx_id == HL_KERNEL_ASID_ID)
> > +             p = hdev->asic_funcs->dma_alloc_coherent(hdev, cb_size,
> > +                                             &cb->bus_address, GFP_ATOMIC);
>
> GFP_KERNEL?
Same explanation as above.
>
> > +     else
> > +             p = hdev->asic_funcs->dma_alloc_coherent(hdev, cb_size,
> > +                                             &cb->bus_address,
> > +                                             GFP_USER | __GFP_ZERO);
> > +     if (!p) {
> > +             dev_err(hdev->dev,
> > +                     "failed to allocate %d of dma memory for CB\n",
> > +                     cb_size);
> > +             kfree(cb);
> > +             return NULL;
> > +     }
> > +
> > +     cb->kernel_address = (u64) p;
> > +     cb->size = cb_size;
> > +
> > +     return cb;
> > +}
> > +
> > +int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
> > +                     u32 cb_size, u64 *handle, int ctx_id)
> > +{
> > +     struct hl_cb *cb;
> > +     bool alloc_new_cb = true;
> > +     int rc;
> > +
> > +     if (hdev->disabled) {
> > +             dev_warn_ratelimited(hdev->dev,
> > +                     "Device is disabled !!! Can't create new CBs\n");
> > +             rc = -EBUSY;
> > +             goto out_err;
> > +     }
> > +
> > +     /* Minimum allocation must be PAGE SIZE */
> > +     if (cb_size < PAGE_SIZE)
> > +             cb_size = PAGE_SIZE;
> > +
> > +     if (ctx_id == HL_KERNEL_ASID_ID &&
> > +                     cb_size <= hdev->asic_prop.cb_pool_cb_size) {
> > +
> > +             spin_lock(&hdev->cb_pool_lock);
> > +             if (!list_empty(&hdev->cb_pool)) {
> > +                     cb = list_first_entry(&hdev->cb_pool, typeof(*cb),
> > +                                     pool_list);
> > +                     list_del(&cb->pool_list);
> > +                     spin_unlock(&hdev->cb_pool_lock);
> > +                     alloc_new_cb = false;
> > +             } else {
> > +                     spin_unlock(&hdev->cb_pool_lock);
> > +                     dev_warn_once(hdev->dev, "CB pool is empty\n");
>
> Isn't it going to be a false alarm when you allocate the cb for the first
> time?
Why ?
The cb_pool list holds a list of available CBs. See hl_cb_pool_init()
- it adds newly allocated CBs to this pool list.

if (!list_empty(&hdev->cb_pool)) {       -  this checks whether the
pool is not empty so we can take an available CB from it. If the list
is empty (hence the pool is empty), we print the warning.

>
> > +             }
> > +     }
> > +
> > +     if (alloc_new_cb) {
> > +             cb = hl_cb_alloc(hdev, cb_size, ctx_id);
> > +             if (!cb) {
> > +                     rc = -ENOMEM;
> > +                     goto out_err;
> > +             }
> > +     }
> > +
> > +     cb->hdev = hdev;
> > +     cb->ctx_id = ctx_id;
> > +
> > +     spin_lock(&mgr->cb_lock);
> > +     rc = idr_alloc(&mgr->cb_handles, cb, 1, 0, GFP_ATOMIC);
>
> It seems the ID will remain dangling if the cb is reused.

I'm not sure what you mean by this comment. Reused by whom ? in how
fashion it is reused ?

>
> > +     spin_unlock(&mgr->cb_lock);
> > +
> > +     if (rc < 0) {
> > +             dev_err(hdev->dev, "Failed to allocate IDR for a new CB\n");
> > +             goto release_cb;
> > +     }
> > +
> > +     cb->id = rc;
> > +
> > +     kref_init(&cb->refcount);
> > +     spin_lock_init(&cb->lock);
> > +
> > +     /*
> > +      * idr is 32-bit so we can safely OR it with a mask that is above
> > +      * 32 bit
> > +      */
> > +     *handle = cb->id | HL_MMAP_CB_MASK;
> > +     *handle <<= PAGE_SHIFT;
> > +
> > +     return 0;
> > +
> > +release_cb:
> > +     cb_do_release(hdev, cb);
> > +out_err:
> > +     *handle = 0;
> > +
> > +     return rc;
> > +}
> > +
> > +int hl_cb_destroy(struct hl_device *hdev, struct hl_cb_mgr *mgr, u64 cb_handle)
> > +{
> > +     struct hl_cb *cb;
> > +     u32 handle;
> > +     int rc = 0;
> > +
> > +     /*
> > +      * handle was given to user to do mmap, I need to shift it back to
> > +      * how the idr module gave it to me
> > +      */
> > +     cb_handle >>= PAGE_SHIFT;
> > +     handle = (u32) cb_handle;
> > +
> > +     spin_lock(&mgr->cb_lock);
> > +
> > +     cb = idr_find(&mgr->cb_handles, handle);
> > +     if (cb) {
> > +             idr_remove(&mgr->cb_handles, handle);
> > +             spin_unlock(&mgr->cb_lock);
> > +             kref_put(&cb->refcount, cb_release);
> > +     } else {
> > +             spin_unlock(&mgr->cb_lock);
> > +             dev_err(hdev->dev,
> > +                     "CB destroy failed, no match to handle 0x%x\n", handle);
> > +             rc = -EINVAL;
> > +     }
> > +
> > +     return rc;
> > +}
> > +
> > +int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data)
> > +{
> > +     union hl_cb_args *args = data;
> > +     struct hl_device *hdev = hpriv->hdev;
> > +     u64 handle;
> > +     int rc;
> > +
> > +     switch (args->in.op) {
> > +     case HL_CB_OP_CREATE:
> > +             rc = hl_cb_create(hdev, &hpriv->cb_mgr, args->in.cb_size,
> > +                                     &handle, hpriv->ctx->asid);
> > +             memset(args, 0, sizeof(*args));
> > +             args->out.cb_handle = handle;
> > +             break;
> > +     case HL_CB_OP_DESTROY:
> > +             rc = hl_cb_destroy(hdev, &hpriv->cb_mgr,
> > +                                     args->in.cb_handle);
> > +             memset(args, 0, sizeof(*args));
> > +             break;
> > +     default:
> > +             rc = -EINVAL;
> > +             break;
> > +     }
> > +
> > +     return rc;
> > +}
> > +
> > +static void cb_vm_close(struct vm_area_struct *vma)
> > +{
> > +     struct hl_cb *cb = (struct hl_cb *) vma->vm_private_data;
> > +
> > +     hl_cb_put(cb);
> > +
> > +     spin_lock(&cb->lock);
> > +     cb->mmap = false;
> > +     cb->vm_start = 0;
> > +     cb->vm_end = 0;
> > +     spin_unlock(&cb->lock);
> > +
> > +     vma->vm_private_data = NULL;
> > +}
> > +
> > +static const struct vm_operations_struct cb_vm_ops = {
> > +     .close = cb_vm_close
> > +};
> > +
> > +int hl_cb_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
> > +{
> > +     struct hl_device *hdev = hpriv->hdev;
> > +     struct hl_cb *cb;
> > +     phys_addr_t address;
> > +     u32 handle;
> > +     int rc;
> > +
> > +     handle = vma->vm_pgoff;
> > +
> > +     /* reference was taken here */
> > +     cb = hl_cb_get(hdev, &hpriv->cb_mgr, handle);
> > +     if (!cb) {
> > +             dev_err(hdev->dev,
> > +                     "CB mmap failed, no match to handle %d\n", handle);
> > +             goto err_out;
>
> why no simply return -EINVAL?
>
fixed
> > +     }
> > +
> > +     /* Validation check */
> > +     if (vma->vm_end - vma->vm_start != cb->size) {
> > +             dev_err(hdev->dev,
> > +                     "CB mmap failed, mmap size 0x%lx != 0x%x cb size\n",
> > +                     vma->vm_end - vma->vm_start, cb->size);
> > +             goto put_cb;
> > +     }
> > +
> > +     spin_lock(&cb->lock);
> > +
> > +     if (cb->mmap) {
> > +             dev_err(hdev->dev,
> > +                     "CB mmap failed, CB already mmaped to user\n");
> > +             goto release_lock;
> > +     }
> > +
> > +     cb->mmap = true;
> > +
> > +     spin_unlock(&cb->lock);
> > +
> > +     vma->vm_ops = &cb_vm_ops;
> > +
> > +     /*
> > +      * Note: We're transferring the cb reference to
> > +      * vma->vm_private_data here.
> > +      */
> > +
> > +     vma->vm_private_data = cb;
> > +
> > +     /* Calculate address for CB */
> > +     address = virt_to_phys((void *) cb->kernel_address);
> > +
> > +     rc = hdev->asic_funcs->cb_mmap(hdev, vma, cb->kernel_address,
> > +                                     address, cb->size);
> > +
> > +     if (rc) {
> > +             spin_lock(&cb->lock);
> > +             cb->mmap = false;
> > +             goto release_lock;
> > +     }
> > +
> > +     cb->vm_start = vma->vm_start;
> > +     cb->vm_end = vma->vm_end;
> > +
> > +     return 0;
> > +
> > +release_lock:
> > +     spin_unlock(&cb->lock);
> > +put_cb:
> > +     hl_cb_put(cb);
> > +err_out:
> > +     return -EINVAL;
> > +}
> > +
> > +struct hl_cb *hl_cb_get(struct hl_device *hdev, struct hl_cb_mgr *mgr,
> > +                     u32 handle)
> > +{
> > +     struct hl_cb *cb;
> > +
> > +     spin_lock(&mgr->cb_lock);
> > +     cb = idr_find(&mgr->cb_handles, handle);
> > +
> > +     if (!cb) {
> > +             spin_unlock(&mgr->cb_lock);
> > +             dev_warn(hdev->dev,
> > +                     "CB get failed, no match to handle %d\n", handle);
> > +             return NULL;
> > +     }
> > +
> > +     kref_get(&cb->refcount);
> > +
> > +     spin_unlock(&mgr->cb_lock);
> > +
> > +     return cb;
> > +
> > +}
> > +
> > +void hl_cb_put(struct hl_cb *cb)
> > +{
> > +     kref_put(&cb->refcount, cb_release);
> > +}
> > +
> > +void hl_cb_mgr_init(struct hl_cb_mgr *mgr)
> > +{
> > +     spin_lock_init(&mgr->cb_lock);
> > +     idr_init(&mgr->cb_handles);
> > +}
> > +
> > +void hl_cb_mgr_fini(struct hl_device *hdev, struct hl_cb_mgr *mgr)
> > +{
> > +     struct hl_cb *cb;
> > +     struct idr *idp;
> > +     u32 id;
> > +
> > +     idp = &mgr->cb_handles;
> > +
> > +     idr_for_each_entry(idp, cb, id) {
> > +             if (kref_put(&cb->refcount, cb_release) != 1)
> > +                     dev_err(hdev->dev,
> > +                             "CB %d for CTX ID %d is still alive\n",
> > +                             id, cb->ctx_id);
> > +     }
> > +
> > +     idr_destroy(&mgr->cb_handles);
> > +}
> > +
> > +struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size)
> > +{
> > +     u64 cb_handle;
> > +     struct hl_cb *cb;
> > +     int rc;
> > +
> > +     rc = hl_cb_create(hdev, &hdev->kernel_cb_mgr, cb_size, &cb_handle,
> > +                     HL_KERNEL_ASID_ID);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to allocate CB for KMD %d\n", rc);
> > +             return NULL;
> > +     }
> > +
> > +     cb_handle >>= PAGE_SHIFT;
> > +     cb = hl_cb_get(hdev, &hdev->kernel_cb_mgr, (u32) cb_handle);
> > +     /* hl_cb_get should never fail here so use kernel WARN */
> > +     WARN(!cb, "Kernel CB handle invalid 0x%x\n", (u32) cb_handle);
> > +     if (!cb)
> > +             goto destroy_cb;
> > +
> > +     return cb;
> > +
> > +destroy_cb:
> > +     hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb_handle << PAGE_SHIFT);
> > +
> > +     return NULL;
> > +}
> > +
> > +int hl_cb_pool_init(struct hl_device *hdev)
> > +{
> > +     struct hl_cb *cb;
> > +     int i;
> > +
> > +     INIT_LIST_HEAD(&hdev->cb_pool);
> > +     spin_lock_init(&hdev->cb_pool_lock);
> > +
> > +     for (i = 0 ; i < hdev->asic_prop.cb_pool_cb_cnt ; i++) {
> > +             cb = hl_cb_alloc(hdev, hdev->asic_prop.cb_pool_cb_size,
> > +                             HL_KERNEL_ASID_ID);
> > +             if (cb) {
> > +                     cb->is_pool = true;
> > +                     list_add(&cb->pool_list, &hdev->cb_pool);
> > +             } else {
> > +                     hl_cb_pool_fini(hdev);
> > +                     return -ENOMEM;
> > +             }
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +int hl_cb_pool_fini(struct hl_device *hdev)
> > +{
> > +     struct hl_cb *cb, *tmp;
> > +
> > +     list_for_each_entry_safe(cb, tmp, &hdev->cb_pool, pool_list) {
> > +             list_del(&cb->pool_list);
> > +             cb_fini(hdev, cb);
> > +     }
> > +
> > +     return 0;
> > +}
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> > index 84ce9fcb52da..0bd86a7d34db 100644
> > --- a/drivers/misc/habanalabs/device.c
> > +++ b/drivers/misc/habanalabs/device.c
> > @@ -53,6 +53,7 @@ static int hl_device_release(struct inode *inode, struct file *filp)
> >  {
> >       struct hl_fpriv *hpriv = filp->private_data;
> >
> > +     hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
> >       hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
> >
> >       filp->private_data = NULL;
> > @@ -62,10 +63,34 @@ static int hl_device_release(struct inode *inode, struct file *filp)
> >       return 0;
> >  }
> >
> > +/**
> > + * hl_mmap - mmap function for habanalabs device
> > + *
> > + * @*filp: pointer to file structure
> > + * @*vma: pointer to vm_area_struct of the process
> > + *
> > + * Called when process does an mmap on habanalabs device. Call the device's mmap
> > + * function at the end of the common code.
> > + */
> > +static int hl_mmap(struct file *filp, struct vm_area_struct *vma)
> > +{
> > +     struct hl_fpriv *hpriv = filp->private_data;
> > +
> > +     if ((vma->vm_pgoff & HL_MMAP_CB_MASK) == HL_MMAP_CB_MASK) {
> > +             vma->vm_pgoff ^= HL_MMAP_CB_MASK;
> > +             return hl_cb_mmap(hpriv, vma);
> > +     }
> > +
> > +     return hpriv->hdev->asic_funcs->mmap(hpriv, vma);
> > +}
> > +
> >  static const struct file_operations hl_ops = {
> >       .owner = THIS_MODULE,
> >       .open = hl_device_open,
> > -     .release = hl_device_release
> > +     .release = hl_device_release,
> > +     .mmap = hl_mmap,
> > +     .unlocked_ioctl = hl_ioctl,
> > +     .compat_ioctl = hl_ioctl
> >  };
> >
> >  /**
> > @@ -145,6 +170,8 @@ static int device_early_init(struct hl_device *hdev)
> >       if (rc)
> >               goto early_fini;
> >
> > +     hl_cb_mgr_init(&hdev->kernel_cb_mgr);
> > +
> >       mutex_init(&hdev->device_open);
> >       atomic_set(&hdev->fd_open_cnt, 0);
> >
> > @@ -166,6 +193,8 @@ static int device_early_init(struct hl_device *hdev)
> >  static void device_early_fini(struct hl_device *hdev)
> >  {
> >
> > +     hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
> > +
> >       hl_asid_fini(hdev);
> >
> >       if (hdev->asic_funcs->early_fini)
> > @@ -280,11 +309,21 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >               goto free_ctx;
> >       }
> >
> > +     rc = hl_cb_pool_init(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to initialize CB pool\n");
> > +             goto release_ctx;
> > +     }
> > +
> >       dev_notice(hdev->dev,
> >               "Successfully added device to habanalabs driver\n");
> >
> >       return 0;
> >
> > +release_ctx:
> > +     if (hl_ctx_put(hdev->kernel_ctx) != 1)
> > +             dev_err(hdev->dev,
> > +                     "kernel ctx is still alive on initialization failure\n");
> >  free_ctx:
> >       kfree(hdev->kernel_ctx);
> >  sw_fini:
> > @@ -321,6 +360,8 @@ void hl_device_fini(struct hl_device *hdev)
> >       /* Mark device as disabled */
> >       hdev->disabled = true;
> >
> > +     hl_cb_pool_fini(hdev);
> > +
> >       /* Release kernel context */
> >       if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
> >               dev_err(hdev->dev, "kernel ctx is still alive\n");
> > diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> > index b2952296b890..341ac085af82 100644
> > --- a/drivers/misc/habanalabs/goya/goya.c
> > +++ b/drivers/misc/habanalabs/goya/goya.c
> > @@ -92,6 +92,9 @@
> >
> >  #define GOYA_MAX_INITIATORS          20
> >
> > +#define GOYA_CB_POOL_CB_CNT          512
> > +#define GOYA_CB_POOL_CB_SIZE         0x20000         /* 128KB */
> > +
> >  static void goya_get_fixed_properties(struct hl_device *hdev)
> >  {
> >       struct asic_fixed_properties *prop = &hdev->asic_prop;
> > @@ -119,6 +122,8 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
> >       prop->tpc_enabled_mask = TPC_ENABLED_MASK;
> >
> >       prop->high_pll = PLL_HIGH_DEFAULT;
> > +     prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
> > +     prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
> >  }
> >
> >  /**
> > @@ -598,6 +603,27 @@ int goya_resume(struct hl_device *hdev)
> >       return 0;
> >  }
> >
> > +int goya_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
> > +{
> > +     return -EINVAL;
> > +}
> > +
> > +int goya_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
> > +             u64 kaddress, phys_addr_t paddress, u32 size)
> > +{
> > +     int rc;
> > +
> > +     vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
> > +                     VM_DONTCOPY | VM_NORESERVE;
> > +
> > +     rc = remap_pfn_range(vma, vma->vm_start, paddress >> PAGE_SHIFT,
> > +                             size, vma->vm_page_prot);
> > +     if (rc)
> > +             dev_err(hdev->dev, "remap_pfn_range error %d", rc);
> > +
> > +     return rc;
> > +}
> > +
> >  void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
> >                                       dma_addr_t *dma_handle, gfp_t flags)
> >  {
> > @@ -617,6 +643,8 @@ static const struct hl_asic_funcs goya_funcs = {
> >       .sw_fini = goya_sw_fini,
> >       .suspend = goya_suspend,
> >       .resume = goya_resume,
> > +     .mmap = goya_mmap,
> > +     .cb_mmap = goya_cb_mmap,
> >       .dma_alloc_coherent = goya_dma_alloc_coherent,
> >       .dma_free_coherent = goya_dma_free_coherent,
> >  };
> > diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> > index d003a6af2131..6ad476df65b0 100644
> > --- a/drivers/misc/habanalabs/habanalabs.h
> > +++ b/drivers/misc/habanalabs/habanalabs.h
> > @@ -21,10 +21,12 @@
> >
> >  #define HL_NAME                              "habanalabs"
> >
> > +#define HL_MMAP_CB_MASK                      (0x8000000000000000ull >> PAGE_SHIFT)
> > +
> >  #define HL_MAX_QUEUES                        128
> >
> >  struct hl_device;
> > -
> > +struct hl_fpriv;
> >
> >
> >
> > @@ -53,6 +55,8 @@ struct hl_device;
> >   * @max_asid: maximum number of open contexts (ASIDs).
> >   * @completion_queues_count: number of completion queues.
> >   * @high_pll: high PLL frequency used by the device.
> > + * @cb_pool_cb_cnt: number of CBs in the CB pool.
> > + * @cb_pool_cb_size: size of each CB in the CB pool.
> >   * @tpc_enabled_mask: which TPCs are enabled.
> >   */
> >  struct asic_fixed_properties {
> > @@ -73,11 +77,68 @@ struct asic_fixed_properties {
> >       u32                     sram_size;
> >       u32                     max_asid;
> >       u32                     high_pll;
> > +     u32                     cb_pool_cb_cnt;
> > +     u32                     cb_pool_cb_size;
> >       u8                      completion_queues_count;
> >       u8                      tpc_enabled_mask;
> >  };
> >
> >
> > +
> > +
> > +
> > +
> > +/*
> > + * Command Buffers
> > + */
> > +
> > +/**
> > + * struct hl_cb_mgr - describes a Command Buffer Manager.
> > + * @cb_lock: protects cb_handles.
> > + * @cb_handles: an idr to hold all command buffer handles.
> > + */
> > +struct hl_cb_mgr {
> > +     spinlock_t              cb_lock;
> > +     struct idr              cb_handles; /* protected by cb_lock */
> > +};
> > +
> > +/**
> > + * struct hl_cb - describes a Command Buffer.
> > + * @refcount: reference counter for usage of the CB.
> > + * @hdev: pointer to device this CB belongs to.
> > + * @lock: spinlock to protect mmap/cs flows.
> > + * @pool_list: node in pool list of command buffers.
> > + * @kernel_address: Holds the CB's kernel virtual address.
> > + * @bus_address: Holds the CB's DMA address.
> > + * @vm_start: Holds the CB's user start virtual address (when mmaped).
> > + * @vm_end: Holds the CB's user end virtual address (when mmaped).
> > + * @size: holds the CB's size.
> > + * @id: the CB's ID.
> > + * @ctx_id: holds the ID of the owner's context.
> > + * @mmap: true if the CB is currently mmaped to user.
> > + * @is_pool: true if CB was acquired from the pool, false otherwise.
> > + */
> > +struct hl_cb {
> > +     struct kref             refcount;
> > +     struct hl_device        *hdev;
> > +     spinlock_t              lock;
> > +     struct list_head        pool_list;
> > +     u64                     kernel_address;
> > +     dma_addr_t              bus_address;
> > +     u64                     vm_start;
> > +     u64                     vm_end;
> > +     u32                     size;
> > +     u32                     id;
> > +     u32                     ctx_id;
> > +     u8                      mmap;
> > +     u8                      is_pool;
> > +};
> > +
> > +
> > +
> > +
> > +
> > +
> >  #define HL_QUEUE_LENGTH                      256
> >
> >
> > @@ -109,6 +170,8 @@ enum hl_asic_type {
> >   * @sw_fini: tears down driver state, does not configure H/W.
> >   * @suspend: handles IP specific H/W or SW changes for suspend.
> >   * @resume: handles IP specific H/W or SW changes for resume.
> > + * @mmap: mmap function, does nothing.
> > + * @cb_mmap: maps a CB.
> >   * @dma_alloc_coherent: DMA allocate coherent memory.
> >   * @dma_free_coherent: free DMA allocation.
> >   */
> > @@ -119,6 +182,9 @@ struct hl_asic_funcs {
> >       int (*sw_fini)(struct hl_device *hdev);
> >       int (*suspend)(struct hl_device *hdev);
> >       int (*resume)(struct hl_device *hdev);
> > +     int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
> > +     int (*cb_mmap)(struct hl_device *hdev, struct vm_area_struct *vma,
> > +                     u64 kaddress, phys_addr_t paddress, u32 size);
> >       void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
> >                                       dma_addr_t *dma_handle, gfp_t flag);
> >       void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
> > @@ -175,6 +241,7 @@ struct hl_ctx_mgr {
> >   * @taskpid: current process ID.
> >   * @ctx: current executing context.
> >   * @ctx_mgr: context manager to handle multiple context for this FD.
> > + * @cb_mgr: command buffer manager to handle multiple buffers for this FD.
> >   * @refcount: number of related contexts.
> >   */
> >  struct hl_fpriv {
> > @@ -183,6 +250,7 @@ struct hl_fpriv {
> >       struct pid              *taskpid;
> >       struct hl_ctx           *ctx; /* TODO: remove for multiple ctx */
> >       struct hl_ctx_mgr       ctx_mgr;
> > +     struct hl_cb_mgr        cb_mgr;
> >       struct kref             refcount;
> >  };
> >
> > @@ -239,6 +307,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> >   * @asic_name: ASIC specific nmae.
> >   * @asic_type: ASIC specific type.
> >   * @kernel_ctx: KMD context structure.
> > + * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
> >   * @dma_pool: DMA pool for small allocations.
> >   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
> >   * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
> > @@ -249,6 +318,8 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> >   * @asic_prop: ASIC specific immutable properties.
> >   * @asic_funcs: ASIC specific functions.
> >   * @asic_specific: ASIC specific information to use only from ASIC files.
> > + * @cb_pool: list of preallocated CBs.
> > + * @cb_pool_lock: protects the CB pool.
> >   * @user_ctx: current user context executing.
> >   * @fd_open_cnt: number of open context executing.
> >   * @major: habanalabs KMD major.
> > @@ -264,6 +335,7 @@ struct hl_device {
> >       char                            asic_name[16];
> >       enum hl_asic_type               asic_type;
> >       struct hl_ctx                   *kernel_ctx;
> > +     struct hl_cb_mgr                kernel_cb_mgr;
> >       struct dma_pool                 *dma_pool;
> >       void                            *cpu_accessible_dma_mem;
> >       dma_addr_t                      cpu_accessible_dma_address;
> > @@ -275,6 +347,10 @@ struct hl_device {
> >       struct asic_fixed_properties    asic_prop;
> >       const struct hl_asic_funcs      *asic_funcs;
> >       void                            *asic_specific;
> > +
> > +     struct list_head                cb_pool;
> > +     spinlock_t                      cb_pool_lock;
> > +
> >       /* TODO: The following fields should be moved for multi-context */
> >       struct hl_ctx                   *user_ctx;
> >       atomic_t                        fd_open_cnt;
> > @@ -345,6 +421,23 @@ int hl_device_resume(struct hl_device *hdev);
> >  void hl_hpriv_get(struct hl_fpriv *hpriv);
> >  void hl_hpriv_put(struct hl_fpriv *hpriv);
> >
> > +int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr, u32 cb_size,
> > +             u64 *handle, int ctx_id);
> > +int hl_cb_destroy(struct hl_device *hdev, struct hl_cb_mgr *mgr, u64 cb_handle);
> > +int hl_cb_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
> > +struct hl_cb *hl_cb_get(struct hl_device *hdev,      struct hl_cb_mgr *mgr,
> > +                     u32 handle);
> > +void hl_cb_put(struct hl_cb *cb);
> > +void hl_cb_mgr_init(struct hl_cb_mgr *mgr);
> > +void hl_cb_mgr_fini(struct hl_device *hdev, struct hl_cb_mgr *mgr);
> > +struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size);
> > +int hl_cb_pool_init(struct hl_device *hdev);
> > +int hl_cb_pool_fini(struct hl_device *hdev);
> > +
> >  void goya_set_asic_funcs(struct hl_device *hdev);
> >
> > +/* IOCTLs */
> > +long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
> > +int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
> > +
> >  #endif /* HABANALABSP_H_ */
> > diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> > index 0646da83eb53..5c312dd3aa50 100644
> > --- a/drivers/misc/habanalabs/habanalabs_drv.c
> > +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> > @@ -123,6 +123,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
> >       kref_init(&hpriv->refcount);
> >       nonseekable_open(inode, filp);
> >
> > +     hl_cb_mgr_init(&hpriv->cb_mgr);
> >       hl_ctx_mgr_init(&hpriv->ctx_mgr);
> >
> >       rc = hl_ctx_create(hdev, hpriv);
> > @@ -138,6 +139,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
> >  out_err:
> >       filp->private_data = NULL;
> >       hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
> > +     hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
> >       kfree(hpriv);
> >
> >  close_device:
> > diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
> > new file mode 100644
> > index 000000000000..fa2287569e0e
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
> > @@ -0,0 +1,102 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include <uapi/misc/habanalabs.h>
> > +#include "habanalabs.h"
> > +
> > +#include <linux/fs.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/cred.h>
> > +
> > +#define HL_IOCTL_DEF(ioctl, _func) \
> > +     [_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func}
> > +
> > +static const struct hl_ioctl_desc hl_ioctls[] = {
> > +     HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl)
> > +};
> > +
> > +#define HL_CORE_IOCTL_COUNT  ARRAY_SIZE(hl_ioctls)
> > +
> > +long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> > +{
> > +     struct hl_fpriv *hpriv = filep->private_data;
> > +     struct hl_device *hdev = hpriv->hdev;
> > +     hl_ioctl_t *func;
> > +     const struct hl_ioctl_desc *ioctl = NULL;
> > +     unsigned int nr = _IOC_NR(cmd);
> > +     char stack_kdata[128];
> > +     char *kdata = NULL;
> > +     unsigned int usize, asize;
> > +     int retcode = -EINVAL;
> > +
> > +     if (nr >= HL_CORE_IOCTL_COUNT)
>
>         nr > HL_CORE_IOCTL_COUNT, isn't it?
>
> > +             goto err_i1;
>
> err_i1 is not very meaningfull. Maybe invalid_ioctl?
Changed to out_err as this is used from other places in this function
>
> > +
> > +     if ((nr >= HL_COMMAND_START) && (nr < HL_COMMAND_END)) {
>
> The HL_COMMAND_{START,END} do not seem to be defined.
They are defined in uapi/misc/habanalabs.h

> Besides, this check seem to be overlapped with
>
>         if (nr > HL_CORE_IOCTL_COUNT)

Correct, removed the first if   (HL_CORE_IOCTL_COUNT)
>
> > +             u32 hl_size;
> > +
> > +             ioctl = &hl_ioctls[nr];
> > +
> > +             hl_size = _IOC_SIZE(ioctl->cmd);
> > +             usize = asize = _IOC_SIZE(cmd);
> > +             if (hl_size > asize)
> > +                     asize = hl_size;
> > +
> > +             cmd = ioctl->cmd;
> > +     } else {
> > +             goto err_i1;
> > +     }
> > +
> > +     /* Do not trust userspace, use our own definition */
> > +     func = ioctl->func;
> > +
> > +     if (unlikely(!func)) {
> > +             dev_dbg(hdev->dev, "no function\n");
> > +             retcode = -EINVAL;
> > +             goto err_i1;
> > +     }
> > +
> > +     if (cmd & (IOC_IN | IOC_OUT)) {
> > +             if (asize <= sizeof(stack_kdata)) {
> > +                     kdata = stack_kdata;
> > +             } else {
> > +                     kdata = kmalloc(asize, GFP_KERNEL);
> > +                     if (!kdata) {
> > +                             retcode = -ENOMEM;
> > +                             goto err_i1;
> > +                     }
> > +             }
> > +             if (asize > usize)
> > +                     memset(kdata + usize, 0, asize - usize);
>
> Just init stack_kdata to 0 and use kzalloc instead of malloc.
fixed
>
> > +     }
> > +
> > +     if (cmd & IOC_IN) {
> > +             if (copy_from_user(kdata, (void __user *)arg, usize)) {
> > +                     retcode = -EFAULT;
> > +                     goto err_i1;
> > +             }
> > +     } else if (cmd & IOC_OUT) {
> > +             memset(kdata, 0, usize);
> > +     }
> > +
> > +     retcode = func(hpriv, kdata);
> > +
> > +     if (cmd & IOC_OUT)
> > +             if (copy_to_user((void __user *)arg, kdata, usize))
> > +                     retcode = -EFAULT;
> > +
> > +err_i1:
> > +     if (!ioctl)
> > +             dev_dbg(hdev->dev,
> > +                     "invalid ioctl: pid=%d, cmd=0x%02x, nr=0x%02x\n",
> > +                       task_pid_nr(current), cmd, nr);
>
> I think this can move right after the 'nr' sanity check and there you can
> simple return -EINVAL after dev_dbg().
But you reach here from many different places and I want to print the
ioctl information for each case.
I think the real mistake here is if(!ioctl). This should actually be
if (ioctl) because only then it is relevant for most of the places in
this function that has an error.
And I'll add a dedicated print if the nr is not correct.
>
> > +
> > +     if (kdata != stack_kdata)
> > +             kfree(kdata);
> > +
> > +     return retcode;
> > +}
> > diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
> > new file mode 100644
> > index 000000000000..b3f9213d4709
> > --- /dev/null
> > +++ b/include/uapi/misc/habanalabs.h
> > @@ -0,0 +1,62 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
> > + *
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + * Author: Oded Gabbay <oded.gabbay@gmail.com>
> > + *
> > + */
> > +
> > +#ifndef HABANALABS_H_
> > +#define HABANALABS_H_
> > +
> > +#include <linux/types.h>
> > +#include <linux/ioctl.h>
> > +
> > +/* Opcode to create a new command buffer */
> > +#define HL_CB_OP_CREATE              0
> > +/* Opcode to destroy previously created command buffer */
> > +#define HL_CB_OP_DESTROY     1
> > +
> > +struct hl_cb_in {
> > +     /* Handle of CB or 0 if we want to create one */
> > +     __u64 cb_handle;
> > +     /* HL_CB_OP_* */
> > +     __u32 op;
> > +     /* Size of CB. Minimum requested size must be PAGE_SIZE */
> > +     __u32 cb_size;
> > +     /* Context ID - Currently not in use */
> > +     __u32 ctx_id;
> > +     __u32 pad;
> > +};
> > +
> > +struct hl_cb_out {
> > +     /* Handle of CB */
> > +     __u64 cb_handle;
> > +};
> > +
> > +union hl_cb_args {
> > +     struct hl_cb_in in;
> > +     struct hl_cb_out out;
> > +};
> > +
> > +/*
> > + * Command Buffer
> > + * - Request a Command Buffer
> > + * - Destroy a Command Buffer
> > + *
> > + * The command buffers are memory blocks that reside in DMA-able address
> > + * space and are physically contiguous so they can be accessed by the device
> > + * directly. They are allocated using the coherent DMA API.
> > + *
> > + * When creating a new CB, the IOCTL returns a handle of it, and the user-space
> > + * process needs to use that handle to mmap the buffer so it can access them.
> > + *
> > + */
> > +#define HL_IOCTL_CB          \
> > +             _IOWR('H', 0x02, union hl_cb_args)
> > +
> > +#define HL_COMMAND_START     0x02
> > +#define HL_COMMAND_END               0x03
> > +
> > +#endif /* HABANALABS_H_ */
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/5] drivers/accel: Introduce subsystem
  2019-01-25 18:16             ` [PATCH 1/5] drivers/accel: Introduce subsystem Olof Johansson
  2019-01-25 21:13               ` [PATCH v2 " Olof Johansson
@ 2019-01-25 22:23               ` Daniel Vetter
  2019-01-27 16:31                 ` Daniel Vetter
  1 sibling, 1 reply; 103+ messages in thread
From: Daniel Vetter @ 2019-01-25 22:23 UTC (permalink / raw)
  To: Olof Johansson
  Cc: linux-kernel, linux-accelerators, Greg Kroah-Hartman,
	Frederic Barrat, Andrew Donnellan, ogabbay, airlied, jglisse

On Fri, Jan 25, 2019 at 10:16:12AM -0800, Olof Johansson wrote:
> We're starting to see more of these kind of devices, the current
> upcoming wave will likely be around machine learning and inference
> engines. A few drivers have been added to drivers/misc for this, but
> it's timely to make it into a separate group of drivers/subsystem, to
> make it easier to find them, and to encourage collaboration between
> contributors.
> 
> Over time, we expect to build shared frameworks that the drivers will
> make use of, but how that framework needs to look like to fill the needs
> is still unclear, and the best way to gain that knowledge is to give the
> disparate implementations a shared location.
> 
> There has been some controversy around expectations for userspace
> stacks being open. The clear preference is to see that happen, and any
> driver and platform stack that is delivered like that will be given
> preferential treatment, and at some point in the future it might
> become the requirement. Until then, the bare minimum we need is an
> open low-level userspace such that the driver and HW interfaces can be
> exercised if someone is modifying the driver, even if the full details
> of the workload are not always available.
> 
> Bootstrapping this with myself and Greg as maintainers (since the current
> drivers will be moving out of drivers/misc). Looking forward to expanding
> that group over time.
> 
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Olof Johansson <olof@lixom.net>

I spent a bit of time reading the proposed drivers, mostly just their uapi
(habanalabs and ocxl&cxl), and there's really no technical difference I
think between an accelaration driver sitting in drivers/gpu and an
accelaration driver sitting in drivers/accel. Except:

- drivers/gpu already has common interfaces for the things you'll probably
  want to standardize (buffer sharing, syncronization primitives,
  scheduler - right now we're working on figuring out some common
  tracepoints).

- Maybe even more important, drivers/gpu has the lessons learned in its
  codebase about what not to standardize between drivers (everything else,
  you'll regret it, we've been there).

- drivers/gpu is the subsystem with 20 years of experience writing tiny
  shim drivers in the kernel for high performance accelarators that need a
  pretty huge stack in userspace to make them do anything useful. 20 years
  ago all the rage to make faster was graphics, now it's AI. Looks exactly
  the same from a kernel pov - command buffers, gigabytes of DMA and a
  security/long term support nightmare.

- drivers/gpu requires open source. The real thing, not some demo that
  does a few DMA operations.

And now we have drivers/accel and someone gets to explain to nvidia (or
arm or whatever) how their exact same drivers (and well run engineering
orgs really only invent command submission once) can be merged when they
say it's for a TPU, and will get rejected when they say it's for a GPU. Or
someone gets to explain to TPU+GPU vendors why their driver is not cool
(because we'd end up with two), while their startup-competition only doing
a TPU is totally fine and merged into upstream. Or we just stuff all the
kernel drivers for blobby userspace into drivers/accel and otherwise
ignore each another.

I guess that last option would at least somewhat help me, since I wont
ever have to explain anymore why we're the radical commies on dri-devel
:-)

Anyway, only reason I replied here again is because I accidentally started
a private thread (well was too lazy to download the mbox to properly
reply), and that's not good either. But I don't think anyone's going to
change their opinion here, I think this reply is just for the record.

Cheers, Daniel

PS: Seen that there's a v2 of this now with Documentation, hasn't reached
my inbox (yet). I don't think that one clarifies any of the tricky
questions between drivers/gpu and drivers/accel, so figured won't harm if
I leave the reply on v1.


> ---
>  MAINTAINERS            |  8 ++++++++
>  drivers/Kconfig        |  2 ++
>  drivers/Makefile       |  1 +
>  drivers/accel/Kconfig  | 16 ++++++++++++++++
>  drivers/accel/Makefile |  5 +++++
>  5 files changed, 32 insertions(+)
>  create mode 100644 drivers/accel/Kconfig
>  create mode 100644 drivers/accel/Makefile
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ddcdc29dfe1f6..8a9bbaf8f6e90 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7033,6 +7033,14 @@ W:	https://linuxtv.org
>  S:	Supported
>  F:	drivers/media/platform/sti/hva
>  
> +HW ACCELERATOR OFFLOAD SUBSYSTEM
> +M:	Olof Johansson <olof@lixom.net>
> +M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> +L:	linux-accelerators@lists.ozlabs.org
> +S:	Supported
> +F:	drivers/accel/
> +F:	Documentation/accelerators/
> +
>  HWPOISON MEMORY FAILURE HANDLING
>  M:	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>  L:	linux-mm@kvack.org
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index 4f9f99057ff85..3cc461f325569 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -228,4 +228,6 @@ source "drivers/siox/Kconfig"
>  
>  source "drivers/slimbus/Kconfig"
>  
> +source "drivers/accel/Kconfig"
> +
>  endmenu
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 04da7876032cc..e4be06579cc5d 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -186,3 +186,4 @@ obj-$(CONFIG_MULTIPLEXER)	+= mux/
>  obj-$(CONFIG_UNISYS_VISORBUS)	+= visorbus/
>  obj-$(CONFIG_SIOX)		+= siox/
>  obj-$(CONFIG_GNSS)		+= gnss/
> +obj-$(CONFIG_ACCEL)		+= accel/
> diff --git a/drivers/accel/Kconfig b/drivers/accel/Kconfig
> new file mode 100644
> index 0000000000000..13b36c0398895
> --- /dev/null
> +++ b/drivers/accel/Kconfig
> @@ -0,0 +1,16 @@
> +#
> +# Drivers for hardware offload accelerators
> +# See Documentation/accel/README.rst for more details
> +#
> +
> +menuconfig ACCEL
> +	bool "Hardware offload accelerator support"
> +        help
> +	  HW offload accelerators are used for high-bandwidth workloads
> +	  where a higher-level kernel/userspace interface isn't suitable.
> +
> +if ACCEL
> +
> +comment "HW Accellerator drivers"
> +
> +endif
> diff --git a/drivers/accel/Makefile b/drivers/accel/Makefile
> new file mode 100644
> index 0000000000000..343bbb8f45a14
> --- /dev/null
> +++ b/drivers/accel/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Makefile for accel devices
> +#
> +
> -- 
> 2.11.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 5/5] drivers/accel: ocxl: Move non-uapi include files
  2019-01-25 18:16             ` [PATCH 5/5] drivers/accel: ocxl: Move non-uapi include files Olof Johansson
@ 2019-01-26 13:51               ` Greg Kroah-Hartman
  0 siblings, 0 replies; 103+ messages in thread
From: Greg Kroah-Hartman @ 2019-01-26 13:51 UTC (permalink / raw)
  To: Olof Johansson
  Cc: linux-kernel, linux-accelerators, Frederic Barrat,
	Andrew Donnellan, ogabbay, airlied, jglisse, Arnd Bergmann

On Fri, Jan 25, 2019 at 10:16:16AM -0800, Olof Johansson wrote:
> Separate to expose the edits vs pure moves.

Ugh, putting the uapi files in misc/ was a mistake, sorry I never caught
that before, my fault :(

greg k-h

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-25 17:12         ` Olof Johansson
  2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
@ 2019-01-26 13:52           ` Greg Kroah-Hartman
  1 sibling, 0 replies; 103+ messages in thread
From: Greg Kroah-Hartman @ 2019-01-26 13:52 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Dave Airlie, Oded Gabbay, Jerome Glisse, LKML, ogabbay

On Fri, Jan 25, 2019 at 09:12:49AM -0800, Olof Johansson wrote:
> On Fri, Jan 25, 2019 at 8:06 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Fri, Jan 25, 2019 at 07:33:23AM -0800, Olof Johansson wrote:
> > > On Thu, Jan 24, 2019 at 11:37 PM Greg Kroah-Hartman
> > > > As for what directory the code should live in, I suggested "misc" as
> > > > there was no other universal location, and I hate to see new subsystems
> > > > be created with only one driver, as that's pretty sad.  But it's just a
> > > > name/location, I have no dog in the fight, so I really don't care where
> > > > it ends up in the tree, just as long as it gets merged somewhere :)
> > >
> > > I'm usually one to push back against new subsystems too, especially
> > > when I see a framework proposal with just one driver. In this case,
> > > given that we all know more vendors will come along, I think it makes
> > > sense to take the discussion and establish structure now. This should
> > > give some clarity to those who are out there that we haven't seen yet,
> > > and give them a chance to prepare for things such as the low-level
> > > userspace pieces mentioned above.
> > >
> > > So I think setting this up now is the right thing to do, we know there
> > > will be more material here and having a common aggregation of it makes
> > > sense.
> >
> > Ok, how about:
> >         drivers/deep_thought/
> >
> > as a first proposal.
> >
> > Let the bikeshedding begin!  :)
> 
> My original proposal upthread was driver/accel. I'm not sure whether
> Dave and/or anyone else wants to participate to start though, I hope
> they will at least join in once things are heading in the direction
> they want.
> 
> I'll post patches with the proposed moves and documentation of
> expectations shortly, hopefully to collect acks.

Patches all look good to me, I can merge them through my char-misc tree
once we get the acks from the relevant driver maintainers.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-23  0:00 ` [PATCH 01/15] habanalabs: add skeleton driver Oded Gabbay
  2019-01-23  0:49   ` Joe Perches
  2019-01-23 12:28   ` Mike Rapoport
@ 2019-01-26 16:05   ` Arnd Bergmann
  2019-01-26 16:24     ` Oded Gabbay
  2 siblings, 1 reply; 103+ messages in thread
From: Arnd Bergmann @ 2019-01-26 16:05 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, Linux Kernel Mailing List, ogabbay

On Wed, Jan 23, 2019 at 1:01 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:

> diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> new file mode 100644
> index 000000000000..9dbb7077eabd
> --- /dev/null
> +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h

Since this is a apparently a user space ABI, the file should be in
include/uapi/linux/,
not in the driver directory.

> +/* must be aligned to 4 bytes */
> +struct armcp_info {
> +       struct armcp_sensor sensors[ARMCP_MAX_SENSORS];
> +       __u8 kernel_version[VERSION_MAX_LEN];
> +       __u32 reserved[3];
> +       __u32 cpld_version;
> +       __u32 infineon_version;
> +       __u8 fuse_version[VERSION_MAX_LEN];
> +       __u8 thermal_version[VERSION_MAX_LEN];
> +       __u8 armcp_version[VERSION_MAX_LEN];
> +       __u64 dram_size;
> +};

The compiler will align this to 8 bytes on most architectures, and
add another padding field before dram_size. Better remove the
'reserved' fields, or make them an even number.

       Arnd

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-26 16:05   ` Arnd Bergmann
@ 2019-01-26 16:24     ` Oded Gabbay
  2019-01-26 21:14       ` Arnd Bergmann
  0 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-26 16:24 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: gregkh, Linux Kernel Mailing List, ogabbay

On Sat, Jan 26, 2019 at 6:06 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Wed, Jan 23, 2019 at 1:01 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> > diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > new file mode 100644
> > index 000000000000..9dbb7077eabd
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
>
> Since this is a apparently a user space ABI, the file should be in
> include/uapi/linux/,
> not in the driver directory.

This is not a user space ABI. This is the ABI between the driver and the F/W.

>
> > +/* must be aligned to 4 bytes */
> > +struct armcp_info {
> > +       struct armcp_sensor sensors[ARMCP_MAX_SENSORS];
> > +       __u8 kernel_version[VERSION_MAX_LEN];
> > +       __u32 reserved[3];
> > +       __u32 cpld_version;
> > +       __u32 infineon_version;
> > +       __u8 fuse_version[VERSION_MAX_LEN];
> > +       __u8 thermal_version[VERSION_MAX_LEN];
> > +       __u8 armcp_version[VERSION_MAX_LEN];
> > +       __u64 dram_size;
> > +};
>
> The compiler will align this to 8 bytes on most architectures, and
> add another padding field before dram_size. Better remove the
> 'reserved' fields, or make them an even number.
I can't do that, because those fields were once used by the F/W and if
I will change the order here, or add/remove those fields then it will
break compatibility with old F/W.

Thanks,
Oded

>
>        Arnd

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 1/5] drivers/accel: Introduce subsystem
  2019-01-25 21:13               ` [PATCH v2 " Olof Johansson
@ 2019-01-26 17:09                 ` Randy Dunlap
  2019-01-27  4:31                 ` Andrew Donnellan
  1 sibling, 0 replies; 103+ messages in thread
From: Randy Dunlap @ 2019-01-26 17:09 UTC (permalink / raw)
  To: Olof Johansson, linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, Frederic Barrat,
	Andrew Donnellan, ogabbay, airlied, jglisse

Hi,

Please see a few corrections inline...

On 1/25/19 1:13 PM, Olof Johansson wrote:
> 
>  Documentation/accelerators/README.rst | 42 +++++++++++++++++++++++++++++++++++
>  MAINTAINERS                           |  8 +++++++
>  drivers/Kconfig                       |  2 ++
>  drivers/Makefile                      |  1 +
>  drivers/accel/Kconfig                 | 16 +++++++++++++
>  drivers/accel/Makefile                |  5 +++++
>  6 files changed, 74 insertions(+)
>  create mode 100644 Documentation/accelerators/README.rst
>  create mode 100644 drivers/accel/Kconfig
>  create mode 100644 drivers/accel/Makefile
> 
> diff --git a/Documentation/accelerators/README.rst b/Documentation/accelerators/README.rst
> new file mode 100644
> index 0000000000000..79049ff99e93e
> --- /dev/null
> +++ b/Documentation/accelerators/README.rst
> @@ -0,0 +1,42 @@
> +.. _readme:
> +
> +Hardware offload accelerator subsystem
> +======================================
> +
> +This is a brief overview of the subsystem (grouping) of hardware
> +accelerators kept under drivers/accel
> +
> +Types of hardware supported
> +---------------------------
> +
> +  The general types of hardware supported are hardware devices that has

                                                                  that have

> +  general interactions of sending commands and buffers to the hardware,
> +  returning completions and possible filled buffers back, together
> +  with the usual driver pieces around hardware control, setup, error
> +  handling, etc.
> +
> +  Drivers that fit into other subsystems are expected to be merged
> +  there, and use the appropriate userspace interfaces of said functional
> +  areas. We don't expect to see drivers for network, storage, graphics
> +  and similar hardware implemented by drivers here.
> +
> +Expectations for contributions
> +------------------------------
> +
> + - Platforms and hardware that has fully open stacks, from Firmware to

                             that have

> +   Userspace, are always going to be given preferential treatment. These
> +   platforms give the best insight for behavior and interaction of all
> +   layers, including ability to improve implementation across the stack
> +   over time.
> +
> + - If a platform is partially proprietary, it is still expected that the
> +   portions that interact the driver can be shared in a form that allows

               that interact with the driver

> +   for exercising the hardware/driver and evolution of the interface over
> +   time. This could be separated into a shared library and test/sample
> +   programs, for example.
> +
> + - Over time, there is an expectation to converge drivers over to shared
> +   frameworks and interfaces. Until then, the general rule is that no
> +   more than one driver per vendor will be acceptable. For vendors that
> +   aren't participating in the work towards shared frameworks over time,
> +   we reserve the right to phase out support for the hardware.

> diff --git a/drivers/accel/Kconfig b/drivers/accel/Kconfig
> new file mode 100644
> index 0000000000000..13b36c0398895
> --- /dev/null
> +++ b/drivers/accel/Kconfig
> @@ -0,0 +1,16 @@
> +#
> +# Drivers for hardware offload accelerators
> +# See Documentation/accel/README.rst for more details
> +#
> +
> +menuconfig ACCEL
> +	bool "Hardware offload accelerator support"
> +        help

Use tab instead of spaces above.

> +	  HW offload accelerators are used for high-bandwidth workloads
> +	  where a higher-level kernel/userspace interface isn't suitable.
> +
> +if ACCEL
> +
> +comment "HW Accellerator drivers"
> +
> +endif


-- 
~Randy

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH/RFC 0/5] HW accel subsystem
  2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
                               ` (4 preceding siblings ...)
  2019-01-25 18:16             ` [PATCH 5/5] drivers/accel: ocxl: Move non-uapi include files Olof Johansson
@ 2019-01-26 21:11             ` Arnd Bergmann
  2019-02-01  9:10             ` Kenneth Lee
  6 siblings, 0 replies; 103+ messages in thread
From: Arnd Bergmann @ 2019-01-26 21:11 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Linux Kernel Mailing List, linux-accelerators,
	Greg Kroah-Hartman, Frederic Barrat, Andrew Donnellan, ogabbay,
	Dave Airlie, Jérôme Glisse

On Fri, Jan 25, 2019 at 7:17 PM Olof Johansson <olof@lixom.net> wrote:
>
> Per discussion in on the Habana Labs driver submission
> (https://lore.kernel.org/lkml/20190123000057.31477-1-oded.gabbay@gmail.com/),
> there seems to be time to create a separate subsystem for hw accellerators
> instead of letting them proliferate around the tree (and/or in misc).
>
> There's difference in opinion on how stringent the requirements are for
> a fully open stack for these kind of drivers. I've documented the middle
> road approach in the first patch (requiring some sort of open low-level
> userspace for the kernel interaction, and a way to use/test it).
>
> Comments and suggestions for better approaches are definitely welcome.

We probably want to move drivers/misc/mic together with the others
as well. We could even move arch/powerpc/platforms/cell/spu*/
for another historic example.

        Arnd

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-26 16:24     ` Oded Gabbay
@ 2019-01-26 21:14       ` Arnd Bergmann
  2019-01-26 21:48         ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Arnd Bergmann @ 2019-01-26 21:14 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, Linux Kernel Mailing List, ogabbay

On Sat, Jan 26, 2019 at 5:25 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> On Sat, Jan 26, 2019 at 6:06 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Wed, Jan 23, 2019 at 1:01 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> >
> > > diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > > new file mode 100644
> > > index 000000000000..9dbb7077eabd
> > > --- /dev/null
> > > +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> >
> > Since this is a apparently a user space ABI, the file should be in
> > include/uapi/linux/,
> > not in the driver directory.
>
> This is not a user space ABI. This is the ABI between the driver and the F/W.

Ah, I see. In that case, you should get rid of all the bitfields and make the
struct members all __le32/__le64/... to make it work on big-endian kernels.

> >
> > > +/* must be aligned to 4 bytes */
> > > +struct armcp_info {
> > > +       struct armcp_sensor sensors[ARMCP_MAX_SENSORS];
> > > +       __u8 kernel_version[VERSION_MAX_LEN];
> > > +       __u32 reserved[3];
> > > +       __u32 cpld_version;
> > > +       __u32 infineon_version;
> > > +       __u8 fuse_version[VERSION_MAX_LEN];
> > > +       __u8 thermal_version[VERSION_MAX_LEN];
> > > +       __u8 armcp_version[VERSION_MAX_LEN];
> > > +       __u64 dram_size;
> > > +};
> >
> > The compiler will align this to 8 bytes on most architectures, and
> > add another padding field before dram_size. Better remove the
> > 'reserved' fields, or make them an even number.
> I can't do that, because those fields were once used by the F/W and if
> I will change the order here, or add/remove those fields then it will
> break compatibility with old F/W.

Ok, I see. Then you should add an explicit padding field and fix the
comment to make the structure match the actual interface.

       Arnd

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-26 21:14       ` Arnd Bergmann
@ 2019-01-26 21:48         ` Oded Gabbay
  2019-01-27  8:32           ` gregkh
  0 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-26 21:48 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: gregkh, Linux Kernel Mailing List, ogabbay

On Sat, Jan 26, 2019 at 11:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, Jan 26, 2019 at 5:25 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> >
> > On Sat, Jan 26, 2019 at 6:06 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Wed, Jan 23, 2019 at 1:01 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > >
> > > > diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > > > new file mode 100644
> > > > index 000000000000..9dbb7077eabd
> > > > --- /dev/null
> > > > +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > >
> > > Since this is a apparently a user space ABI, the file should be in
> > > include/uapi/linux/,
> > > not in the driver directory.
> >
> > This is not a user space ABI. This is the ABI between the driver and the F/W.
>
> Ah, I see. In that case, you should get rid of all the bitfields and make the
> struct members all __le32/__le64/... to make it work on big-endian kernels.
>
I really don't want to start converting bitfields and structures to
use __le32/64.
As I wrote in one of the previous reviews, we don't support big-endian
architecture (what's left after POWER moved to support little endian
?).  We actually do run on POWER9 but with ppc64le architecture
In any case, our software stack is so big that this minor change in
the driver won't have any impact on the overall ability to run
something on our H/W

> > >
> > > > +/* must be aligned to 4 bytes */
> > > > +struct armcp_info {
> > > > +       struct armcp_sensor sensors[ARMCP_MAX_SENSORS];
> > > > +       __u8 kernel_version[VERSION_MAX_LEN];
> > > > +       __u32 reserved[3];
> > > > +       __u32 cpld_version;
> > > > +       __u32 infineon_version;
> > > > +       __u8 fuse_version[VERSION_MAX_LEN];
> > > > +       __u8 thermal_version[VERSION_MAX_LEN];
> > > > +       __u8 armcp_version[VERSION_MAX_LEN];
> > > > +       __u64 dram_size;
> > > > +};
> > >
> > > The compiler will align this to 8 bytes on most architectures, and
> > > add another padding field before dram_size. Better remove the
> > > 'reserved' fields, or make them an even number.
> > I can't do that, because those fields were once used by the F/W and if
> > I will change the order here, or add/remove those fields then it will
> > break compatibility with old F/W.
>
> Ok, I see. Then you should add an explicit padding field and fix the
> comment to make the structure match the actual interface.
>
>        Arnd
Understood, will be fixed.
Thanks,
Oded

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 1/5] drivers/accel: Introduce subsystem
  2019-01-25 21:13               ` [PATCH v2 " Olof Johansson
  2019-01-26 17:09                 ` Randy Dunlap
@ 2019-01-27  4:31                 ` Andrew Donnellan
  2019-01-28 19:36                   ` Frederic Barrat
  1 sibling, 1 reply; 103+ messages in thread
From: Andrew Donnellan @ 2019-01-27  4:31 UTC (permalink / raw)
  To: Olof Johansson, linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, Frederic Barrat, ogabbay,
	airlied, jglisse, linuxppc-dev

[+ linuxppc-dev, because cxl/ocxl are handled through powerpc - please 
cc on future versions of this series]

On 26/1/19 8:13 am, Olof Johansson wrote:
> We're starting to see more of these kind of devices, the current
> upcoming wave will likely be around machine learning and inference
> engines. A few drivers have been added to drivers/misc for this, but
> it's timely to make it into a separate group of drivers/subsystem, to
> make it easier to find them, and to encourage collaboration between
> contributors.
> 
> Over time, we expect to build shared frameworks that the drivers will
> make use of, but how that framework needs to look like to fill the needs
> is still unclear, and the best way to gain that knowledge is to give the
> disparate implementations a shared location.
> 
> There has been some controversy around expectations for userspace
> stacks being open. The clear preference is to see that happen, and any
> driver and platform stack that is delivered like that will be given
> preferential treatment, and at some point in the future it might
> become the requirement. Until then, the bare minimum we need is an
> open low-level userspace such that the driver and HW interfaces can be
> exercised if someone is modifying the driver, even if the full details
> of the workload are not always available.
> 
> Bootstrapping this with myself and Greg as maintainers (since the current
> drivers will be moving out of drivers/misc). Looking forward to expanding
> that group over time.
> 

[snip]

> +
> +Hardware offload accelerator subsystem
> +======================================
> +
> +This is a brief overview of the subsystem (grouping) of hardware
> +accelerators kept under drivers/accel
> +
> +Types of hardware supported
> +---------------------------
> +
> +  The general types of hardware supported are hardware devices that has
> +  general interactions of sending commands and buffers to the hardware,
> +  returning completions and possible filled buffers back, together
> +  with the usual driver pieces around hardware control, setup, error
> +  handling, etc.
> +
> +  Drivers that fit into other subsystems are expected to be merged
> +  there, and use the appropriate userspace interfaces of said functional
> +  areas. We don't expect to see drivers for network, storage, graphics
> +  and similar hardware implemented by drivers here.
> +
> +Expectations for contributions
> +------------------------------
> +
> + - Platforms and hardware that has fully open stacks, from Firmware to
> +   Userspace, are always going to be given preferential treatment. These
> +   platforms give the best insight for behavior and interaction of all
> +   layers, including ability to improve implementation across the stack
> +   over time.
> +
> + - If a platform is partially proprietary, it is still expected that the
> +   portions that interact the driver can be shared in a form that allows
> +   for exercising the hardware/driver and evolution of the interface over
> +   time. This could be separated into a shared library and test/sample
> +   programs, for example.
> +
> + - Over time, there is an expectation to converge drivers over to shared
> +   frameworks and interfaces. Until then, the general rule is that no
> +   more than one driver per vendor will be acceptable. For vendors that
> +   aren't participating in the work towards shared frameworks over time,
> +   we reserve the right to phase out support for the hardware.
How exactly do generic drivers for interconnect protocols, such as 
cxl/ocxl, fit in here?

cxl and ocxl are not drivers for a specific device, they are generic 
drivers which can be used with any device implementing the CAPI or 
OpenCAPI protocol respectively - many of which will be FPGA boards 
flashed with customer-designed accelerator cores for specific workloads, 
some will be accelerators using ASICs or using FPGA images supplied by 
vendors, some will be driven from userspace, others using the cxl/ocxl 
kernel API, etc.

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 03/15] habanalabs: add basic Goya support
  2019-01-25 20:32     ` Oded Gabbay
@ 2019-01-27  6:39       ` Mike Rapoport
  2019-01-28  7:44         ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-27  6:39 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org

On Fri, Jan 25, 2019 at 10:32:55PM +0200, Oded Gabbay wrote:
> On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Wed, Jan 23, 2019 at 02:00:45AM +0200, Oded Gabbay wrote:
> > > This patch adds a basic support for the Goya device. The code initializes
> > > the device's PCI controller and PCI bars. It also initializes various S/W
> > > structures and adds some basic helper functions.
> > >
> > > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > > ---
> > >  drivers/misc/habanalabs/Makefile            |   5 +-
> > >  drivers/misc/habanalabs/device.c            |  71 +++
> > >  drivers/misc/habanalabs/goya/Makefile       |   3 +
> > >  drivers/misc/habanalabs/goya/goya.c         | 633 ++++++++++++++++++++
> > >  drivers/misc/habanalabs/goya/goyaP.h        | 125 ++++
> > >  drivers/misc/habanalabs/habanalabs.h        | 131 ++++
> > >  drivers/misc/habanalabs/habanalabs_drv.c    |   3 +
> > >  drivers/misc/habanalabs/include/goya/goya.h | 115 ++++
> > >  8 files changed, 1085 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/misc/habanalabs/goya/Makefile
> > >  create mode 100644 drivers/misc/habanalabs/goya/goya.c
> > >  create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
> > >  create mode 100644 drivers/misc/habanalabs/include/goya/goya.h

[ ... ]

> > > +
> > > +/**
> > > + * goya_sw_init - Goya software initialization code
> > > + *
> > > + * @hdev: pointer to hl_device structure
> > > + *
> > > + */
> > > +static int goya_sw_init(struct hl_device *hdev)
> > > +{
> > > +     struct goya_device *goya;
> > > +     int rc;
> > > +
> > > +     /* Allocate device structure */
> > > +     goya = kzalloc(sizeof(*goya), GFP_KERNEL);
> >
> > Consider using devm_k[mz]alloc() for memory allocations throughout the
> > driver. I didn't check all the spots where it can be applicable.
> I honestly wasn't aware of that. We never used that in AMD drivers
> (which where I spent most of my kernel time).
> I'll look into that offline but for now I don't really want to change
> into it blindly in all locations, unless there is some hard kernel
> rule for using that in drivers.

AFAIK, there's no such rule. It's just supposed to make driver
developer/maintainer life easier ;-)
 
> >
> > > +     if (!goya)
> > > +             return -ENOMEM;
> > > +
> > > +     /* according to goya_init_iatu */
> > > +     goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
> > > +     hdev->asic_specific = goya;
> > > +
> > > +     /* Create DMA pool for small allocations */
> > > +     hdev->dma_pool = dma_pool_create(dev_name(hdev->dev),
> > > +                     &hdev->pdev->dev, GOYA_DMA_POOL_BLK_SIZE, 8, 0);
> > > +     if (!hdev->dma_pool) {
> > > +             dev_err(hdev->dev, "failed to create DMA pool\n");
> > > +             rc = -ENOMEM;
> > > +             goto free_goya_device;
> > > +     }
> > > +

[ ... ]

> > > +
> > > +static const struct hl_asic_funcs goya_funcs = {
> > > +     .early_init = goya_early_init,
> > > +     .early_fini = goya_early_fini,
> > > +     .sw_init = goya_sw_init,
> > > +     .sw_fini = goya_sw_fini,
> > > +     .suspend = goya_suspend,
> > > +     .resume = goya_resume,
> > > +     .dma_alloc_coherent = goya_dma_alloc_coherent,
> > > +     .dma_free_coherent = goya_dma_free_coherent,
> >
> > Is there any additional functionality that is planned in goya or gaudi in
> > these two functions?
> > It seems like they are not really needed, at least at the moment and for
> > sure that don't need to be part of ASIC ops.
> 
> So this relates to the simulator support, because there the
> implementation of these two functions is totally different as I don't
> have pci device.

Can you please add a comment about it here?
 
-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 05/15] habanalabs: add command buffer module
  2019-01-25 21:47     ` Oded Gabbay
@ 2019-01-27  6:49       ` Mike Rapoport
  2019-01-28  7:55         ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-27  6:49 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Fri, Jan 25, 2019 at 11:47:03PM +0200, Oded Gabbay wrote:
> On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Wed, Jan 23, 2019 at 02:00:47AM +0200, Oded Gabbay wrote:
> > > This patch adds the CB module, which allows the user to create and
> > > destroy CBs and to map them to the user's process address-space.
> >
> > Can you please spell "command buffer" at least first time it's mentioned?
> fixed
> >
> > > A command buffer is a memory blocks that reside in DMA-able address-space
> > > and is physically contiguous so it can be accessed by the device without
> > > MMU translation. The command buffer memory is allocated using the
> > > coherent DMA API.
> > >
> > > When creating a new CB, the IOCTL returns a handle of it, and the
> > > user-space process needs to use that handle to mmap the buffer to get a VA
> > > in the user's address-space.
> > >
> > > Before destroying (freeing) a CB, the user must unmap the CB's VA using the
> > > CB handle.
> > >
> > > Each CB has a reference counter, which tracks its usage in command
> > > submissions and also its mmaps (only a single mmap is allowed).
> > >
> > > The driver maintains a pool of pre-allocated CBs in order to reduce
> > > latency during command submissions. In case the pool is empty, the driver
> > > will go to the slow-path of allocating a new CB, i.e. calling
> > > dma_alloc_coherent.
> > >
> > > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > > ---
> > >  drivers/misc/habanalabs/Makefile           |   3 +-
> > >  drivers/misc/habanalabs/command_buffer.c   | 414 +++++++++++++++++++++
> > >  drivers/misc/habanalabs/device.c           |  43 ++-
> > >  drivers/misc/habanalabs/goya/goya.c        |  28 ++
> > >  drivers/misc/habanalabs/habanalabs.h       |  95 ++++-
> > >  drivers/misc/habanalabs/habanalabs_drv.c   |   2 +
> > >  drivers/misc/habanalabs/habanalabs_ioctl.c | 102 +++++
> > >  include/uapi/misc/habanalabs.h             |  62 +++
> > >  8 files changed, 746 insertions(+), 3 deletions(-)
> > >  create mode 100644 drivers/misc/habanalabs/command_buffer.c
> > >  create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
> > >  create mode 100644 include/uapi/misc/habanalabs.h

[ ... ]

> > > +int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
> > > +                     u32 cb_size, u64 *handle, int ctx_id)
> > > +{
> > > +     struct hl_cb *cb;
> > > +     bool alloc_new_cb = true;
> > > +     int rc;
> > > +
> > > +     if (hdev->disabled) {
> > > +             dev_warn_ratelimited(hdev->dev,
> > > +                     "Device is disabled !!! Can't create new CBs\n");
> > > +             rc = -EBUSY;
> > > +             goto out_err;
> > > +     }
> > > +
> > > +     /* Minimum allocation must be PAGE SIZE */
> > > +     if (cb_size < PAGE_SIZE)
> > > +             cb_size = PAGE_SIZE;
> > > +
> > > +     if (ctx_id == HL_KERNEL_ASID_ID &&
> > > +                     cb_size <= hdev->asic_prop.cb_pool_cb_size) {
> > > +
> > > +             spin_lock(&hdev->cb_pool_lock);
> > > +             if (!list_empty(&hdev->cb_pool)) {
> > > +                     cb = list_first_entry(&hdev->cb_pool, typeof(*cb),
> > > +                                     pool_list);
> > > +                     list_del(&cb->pool_list);
> > > +                     spin_unlock(&hdev->cb_pool_lock);
> > > +                     alloc_new_cb = false;
> > > +             } else {
> > > +                     spin_unlock(&hdev->cb_pool_lock);
> > > +                     dev_warn_once(hdev->dev, "CB pool is empty\n");
> >
> > Isn't it going to be a false alarm when you allocate the cb for the first
> > time?
> Why ?
> The cb_pool list holds a list of available CBs. See hl_cb_pool_init()
> - it adds newly allocated CBs to this pool list.
> 
> if (!list_empty(&hdev->cb_pool)) {       -  this checks whether the
> pool is not empty so we can take an available CB from it. If the list
> is empty (hence the pool is empty), we print the warning.
 
Sorry if it's too much nitpicking, but why the allocation of the first cb
should be a warning? There's nothing wrong there... Maybe dev_dbg()
instead?

> > > +             }
> > > +     }
> > > +
> > > +     if (alloc_new_cb) {
> > > +             cb = hl_cb_alloc(hdev, cb_size, ctx_id);
> > > +             if (!cb) {
> > > +                     rc = -ENOMEM;
> > > +                     goto out_err;
> > > +             }
> > > +     }
> > > +
> > > +     cb->hdev = hdev;
> > > +     cb->ctx_id = ctx_id;
> > > +
> > > +     spin_lock(&mgr->cb_lock);
> > > +     rc = idr_alloc(&mgr->cb_handles, cb, 1, 0, GFP_ATOMIC);
> >
> > It seems the ID will remain dangling if the cb is reused.
> 
> I'm not sure what you mean by this comment. Reused by whom ? in how
> fashion it is reused ?
 
Sorry if I didn't explain it more clearly.
If the case the cb is reused, you anyway call idr_alloc() and overwrite the
previous value of cb->id and it never gets idr_remove()'ed

> >
> > > +     spin_unlock(&mgr->cb_lock);
> > > +
> > > +     if (rc < 0) {
> > > +             dev_err(hdev->dev, "Failed to allocate IDR for a new CB\n");
> > > +             goto release_cb;
> > > +     }
> > > +
> > > +     cb->id = rc;
> > > +
> > > +     kref_init(&cb->refcount);
> > > +     spin_lock_init(&cb->lock);
> > > +
> > > +     /*
> > > +      * idr is 32-bit so we can safely OR it with a mask that is above
> > > +      * 32 bit
> > > +      */
> > > +     *handle = cb->id | HL_MMAP_CB_MASK;
> > > +     *handle <<= PAGE_SHIFT;
> > > +
> > > +     return 0;
> > > +
> > > +release_cb:
> > > +     cb_do_release(hdev, cb);
> > > +out_err:
> > > +     *handle = 0;
> > > +
> > > +     return rc;
> > > +}
> > > +

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 10/15] habanalabs: add device reset support
  2019-01-23  0:00 ` [PATCH 10/15] habanalabs: add device reset support Oded Gabbay
@ 2019-01-27  7:51   ` Mike Rapoport
  2019-01-28 12:53     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-27  7:51 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:00:52AM +0200, Oded Gabbay wrote:
> This patch adds support for doing various on-the-fly reset of Goya.
> 
> The driver supports two types of resets:
> 1. soft-reset
> 2. hard-reset
> 
> Soft-reset is done when the device detects a timeout of a command
> submission that was given to the device. The soft-reset process only resets
> the engines that are relevant for the submission of compute jobs, i.e. the
> DMA channels, the TPCs and the MME. The purpose is to bring the device as
> fast as possible to a working state.
> 
> Hard-reset is done in several cases:
> 1. After soft-reset is done but the device is not responding
> 2. When fatal errors occur inside the device, e.g. ECC error
> 3. When the driver is removed
> 
> Hard-reset performs a reset of the entire chip except for the PCI
> controller and the PLLs. It is a much longer process then soft-reset but it
> helps to recover the device without the need to reboot the Host.
> 
> After hard-reset, the driver will restore the max power attribute and in
> case of manual power management, the frequencies that were set.
> 
> This patch also adds two entries to the sysfs, which allows the root user
> to initiate a soft or hard reset.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/habanalabs/command_buffer.c  |  11 +-
>  drivers/misc/habanalabs/device.c          | 308 +++++++++++++++++++++-
>  drivers/misc/habanalabs/goya/goya.c       | 201 ++++++++++++++
>  drivers/misc/habanalabs/goya/goya_hwmgr.c |  18 +-
>  drivers/misc/habanalabs/habanalabs.h      |  35 +++
>  drivers/misc/habanalabs/habanalabs_drv.c  |   9 +-
>  drivers/misc/habanalabs/hwmon.c           |   4 +-
>  drivers/misc/habanalabs/irq.c             |  31 +++
>  drivers/misc/habanalabs/sysfs.c           | 120 ++++++++-
>  9 files changed, 712 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
> index 535ed6cc5bda..700c6da01188 100644
> --- a/drivers/misc/habanalabs/command_buffer.c
> +++ b/drivers/misc/habanalabs/command_buffer.c
> @@ -81,9 +81,10 @@ int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
>  	bool alloc_new_cb = true;
>  	int rc;
>  
> -	if (hdev->disabled) {
> +	if ((hdev->disabled) || ((atomic_read(&hdev->in_reset)) &&
> +					(ctx_id != HL_KERNEL_ASID_ID))) {
>  		dev_warn_ratelimited(hdev->dev,
> -			"Device is disabled !!! Can't create new CBs\n");
> +			"Device is disabled or in reset !!! Can't create new CBs\n");
>  		rc = -EBUSY;
>  		goto out_err;
>  	}
> @@ -187,6 +188,12 @@ int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data)
>  	u64 handle;
>  	int rc;
>  
> +	if (hdev->hard_reset_pending) {
> +		dev_crit_ratelimited(hdev->dev,
> +			"Device HARD reset pending !!! Please close FD\n");
> +		return -ENODEV;
> +	}

Probably this check should be done at the top-level ioctl()? 
And, what will happen if the devices performs hard reset, but the used
keeps the file descriptor open?

> +
>  	switch (args->in.op) {
>  	case HL_CB_OP_CREATE:
>  		rc = hl_cb_create(hdev, &hpriv->cb_mgr, args->in.cb_size,
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> index ff7b610f18c4..00fde57ce823 100644
> --- a/drivers/misc/habanalabs/device.c
> +++ b/drivers/misc/habanalabs/device.c
> @@ -188,6 +188,7 @@ static int device_early_init(struct hl_device *hdev)
>  
>  	mutex_init(&hdev->device_open);
>  	mutex_init(&hdev->send_cpu_message_lock);
> +	atomic_set(&hdev->in_reset, 0);
>  	atomic_set(&hdev->fd_open_cnt, 0);
>  
>  	return 0;
> @@ -238,6 +239,27 @@ static void set_freq_to_low_job(struct work_struct *work)
>  			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
>  }
>  
> +static void hl_device_heartbeat(struct work_struct *work)
> +{
> +	struct hl_device *hdev = container_of(work, struct hl_device,
> +						work_heartbeat.work);
> +
> +	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
> +		goto reschedule;
> +
> +	if (!hdev->asic_funcs->send_heartbeat(hdev))
> +		goto reschedule;

AFAIU, asic_funcs->send_heartbeat() it set once at init time. The work
should not be scheduled it it's NULL, I suppose.

> +
> +	dev_err(hdev->dev, "Device heartbeat failed !!!\n");
> +	hl_device_reset(hdev, true, false);
> +
> +	return;
> +
> +reschedule:
> +	schedule_delayed_work(&hdev->work_heartbeat,
> +			usecs_to_jiffies(HL_HEARTBEAT_PER_USEC));
> +}
> +
>  /**
>   * device_late_init - do late stuff initialization for the habanalabs device
>   *
> @@ -273,6 +295,12 @@ static int device_late_init(struct hl_device *hdev)
>  	schedule_delayed_work(&hdev->work_freq,
>  			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
>  
> +	if (hdev->heartbeat) {
> +		INIT_DELAYED_WORK(&hdev->work_heartbeat, hl_device_heartbeat);
> +		schedule_delayed_work(&hdev->work_heartbeat,
> +				usecs_to_jiffies(HL_HEARTBEAT_PER_USEC));
> +	}
> +
>  	hdev->late_init_done = true;
>  
>  	return 0;
> @@ -290,6 +318,8 @@ static void device_late_fini(struct hl_device *hdev)
>  		return;
>  
>  	cancel_delayed_work_sync(&hdev->work_freq);
> +	if (hdev->heartbeat)
> +		cancel_delayed_work_sync(&hdev->work_heartbeat);
>  
>  	if (hdev->asic_funcs->late_fini)
>  		hdev->asic_funcs->late_fini(hdev);
> @@ -397,6 +427,254 @@ int hl_device_resume(struct hl_device *hdev)
>  	return 0;
>  }
>  
> +static void hl_device_hard_reset_pending(struct work_struct *work)
> +{
> +	struct hl_device_reset_work *device_reset_work =
> +		container_of(work, struct hl_device_reset_work, reset_work);
> +	struct hl_device *hdev = device_reset_work->hdev;
> +	u16 pending_cnt = HL_PENDING_RESET_PER_SEC;
> +	struct task_struct *task = NULL;
> +
> +	/* Flush all processes that are inside hl_open */
> +	mutex_lock(&hdev->device_open);
> +
> +	while ((atomic_read(&hdev->fd_open_cnt)) && (pending_cnt)) {
> +
> +		pending_cnt--;
> +
> +		dev_info(hdev->dev,
> +			"Can't HARD reset, waiting for user to close FD\n");
> +		ssleep(1);
> +	}
> +
> +	if (atomic_read(&hdev->fd_open_cnt)) {
> +		task = get_pid_task(hdev->user_ctx->hpriv->taskpid,
> +					PIDTYPE_PID);
> +		if (task) {
> +			dev_info(hdev->dev, "Killing user processes\n");
> +			send_sig(SIGKILL, task, 1);

Shouldn't the user get a chance for cleanup?

> +			msleep(100);
> +
> +			put_task_struct(task);
> +		}
> +	}
> +
> +	mutex_unlock(&hdev->device_open);
> +
> +	hl_device_reset(hdev, true, true);
> +
> +	kfree(device_reset_work);
> +}
> +

[ ... ]

> diff --git a/drivers/misc/habanalabs/goya/goya_hwmgr.c b/drivers/misc/habanalabs/goya/goya_hwmgr.c
> index 866d1774b2e4..9482dbb2e03a 100644
> --- a/drivers/misc/habanalabs/goya/goya_hwmgr.c
> +++ b/drivers/misc/habanalabs/goya/goya_hwmgr.c
> @@ -38,7 +38,7 @@ static ssize_t mme_clk_show(struct device *dev, struct device_attribute *attr,
>  	struct hl_device *hdev = dev_get_drvdata(dev);
>  	long value;
>  
> -	if (hdev->disabled)
> +	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
>  		return -ENODEV;
>  
>  	value = hl_get_frequency(hdev, MME_PLL, false);
> @@ -57,7 +57,7 @@ static ssize_t mme_clk_store(struct device *dev, struct device_attribute *attr,
>  	int rc;
>  	long value;
>  
> -	if (hdev->disabled) {
> +	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {

There are quite a few of those, maybe split this check to a helper
function?

>  		count = -ENODEV;
>  		goto fail;
>  	}
> @@ -87,7 +87,7 @@ static ssize_t tpc_clk_show(struct device *dev, struct device_attribute *attr,
>  	struct hl_device *hdev = dev_get_drvdata(dev);
>  	long value;
>  
> -	if (hdev->disabled)
> +	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
>  		return -ENODEV;
>  
>  	value = hl_get_frequency(hdev, TPC_PLL, false);
> @@ -106,7 +106,7 @@ static ssize_t tpc_clk_store(struct device *dev, struct device_attribute *attr,
>  	int rc;
>  	long value;
>  
> -	if (hdev->disabled) {
> +	if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
>  		count = -ENODEV;
>  		goto fail;
>  	}

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-26 21:48         ` Oded Gabbay
@ 2019-01-27  8:32           ` gregkh
  2019-01-29 22:49             ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: gregkh @ 2019-01-27  8:32 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: Arnd Bergmann, Linux Kernel Mailing List, ogabbay

On Sat, Jan 26, 2019 at 11:48:02PM +0200, Oded Gabbay wrote:
> On Sat, Jan 26, 2019 at 11:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Sat, Jan 26, 2019 at 5:25 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > >
> > > On Sat, Jan 26, 2019 at 6:06 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Wed, Jan 23, 2019 at 1:01 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > >
> > > > > diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > > > > new file mode 100644
> > > > > index 000000000000..9dbb7077eabd
> > > > > --- /dev/null
> > > > > +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > > >
> > > > Since this is a apparently a user space ABI, the file should be in
> > > > include/uapi/linux/,
> > > > not in the driver directory.
> > >
> > > This is not a user space ABI. This is the ABI between the driver and the F/W.
> >
> > Ah, I see. In that case, you should get rid of all the bitfields and make the
> > struct members all __le32/__le64/... to make it work on big-endian kernels.
> >
> I really don't want to start converting bitfields and structures to
> use __le32/64.
> As I wrote in one of the previous reviews, we don't support big-endian
> architecture (what's left after POWER moved to support little endian
> ?).  We actually do run on POWER9 but with ppc64le architecture
> In any case, our software stack is so big that this minor change in
> the driver won't have any impact on the overall ability to run
> something on our H/W

You don't have to do anything at the moment to "convert" to use a
specific endian, but you do have to always mark such variables that are
in a specific endian that this is the format they are expected in.

Then, when you run a tool like sparse, you will be notified if you
happen to be making any assumptions that might not be correct about
those variables, and it's trivial to usually fix it up at that time.

hope this helps,

greg k-h

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 11/15] habanalabs: add command submission module
  2019-01-23  0:00 ` [PATCH 11/15] habanalabs: add command submission module Oded Gabbay
@ 2019-01-27 15:11   ` Mike Rapoport
  2019-01-28 13:51     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-27 15:11 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay

On Wed, Jan 23, 2019 at 02:00:53AM +0200, Oded Gabbay wrote:
> This patch adds the main flow for the user to submit work to the device.
> 
> Each work is described by a command submission object (CS). The CS contains
> 3 arrays of command buffers: One for execution, and two for context-switch
> (store and restore).
> 
> For each CB, the user specifies on which queue to put that CB. In case of
> an internal queue, the entry doesn't contain a pointer to the CB but the
> address in the on-chip memory that the CB resides at.
> 
> The driver parses some of the CBs to enforce security restrictions.
> 
> The user receives a sequence number that represents the CS object. The user
> can then query the driver regarding the status of the CS, using that
> sequence number.
> 
> In case the CS doesn't finish before the timeout expires, the driver will
> perform a soft-reset of the device.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/habanalabs/Makefile             |    3 +-
>  drivers/misc/habanalabs/command_submission.c |  787 +++++++++++++
>  drivers/misc/habanalabs/context.c            |   52 +-
>  drivers/misc/habanalabs/device.c             |   16 +
>  drivers/misc/habanalabs/goya/goya.c          | 1082 ++++++++++++++++++
>  drivers/misc/habanalabs/habanalabs.h         |  274 +++++
>  drivers/misc/habanalabs/habanalabs_drv.c     |   23 +
>  drivers/misc/habanalabs/habanalabs_ioctl.c   |    4 +-
>  drivers/misc/habanalabs/hw_queue.c           |  250 ++++
>  drivers/misc/habanalabs/memory.c             |  200 ++++
>  include/uapi/misc/habanalabs.h               |  158 ++-
>  11 files changed, 2842 insertions(+), 7 deletions(-)
>  create mode 100644 drivers/misc/habanalabs/command_submission.c
>  create mode 100644 drivers/misc/habanalabs/memory.c
> 
> diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> index b5607233d216..d2fd0e18b1eb 100644
> --- a/drivers/misc/habanalabs/Makefile
> +++ b/drivers/misc/habanalabs/Makefile
> @@ -5,7 +5,8 @@
>  obj-m	:= habanalabs.o
>  
>  habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
> -		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o
> +		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o memory.o \
> +		command_submission.o
>  
>  include $(src)/goya/Makefile
>  habanalabs-y += $(HL_GOYA_FILES)
> diff --git a/drivers/misc/habanalabs/command_submission.c b/drivers/misc/habanalabs/command_submission.c
> new file mode 100644
> index 000000000000..0116c2262f17
> --- /dev/null
> +++ b/drivers/misc/habanalabs/command_submission.c
> @@ -0,0 +1,787 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include <uapi/misc/habanalabs.h>
> +#include "habanalabs.h"
> +
> +#include <linux/sched/mm.h>
> +#include <linux/sched/task.h>
> +#include <linux/sched/signal.h>
> +#include <linux/wait.h>
> +#include <linux/mm.h>
> +#include <linux/highmem.h>

[ ... ]

> +static void cs_do_release(struct kref *ref)
> +{
> +	struct hl_cs *cs = container_of(ref, struct hl_cs,
> +						refcount);
> +	struct hl_device *hdev = cs->ctx->hdev;
> +	struct hl_cs_job *job, *tmp;
> +
> +	cs->completed = true;
> +
> +	/*
> +	 * Although if we reached here it means that all external jobs have
> +	 * finished, because each one of them took refcnt to CS, we still
> +	 * need to go over the internal jobs and free them. Otherwise, we
> +	 * will have leaked memory and what's worse, the CS object (and
> +	 * potentially the CTX object) could be released, while the JOB
> +	 * still holds a pointer to them (but no reference).
> +	 */
> +	list_for_each_entry_safe(job, tmp, &cs->job_list, cs_node)
> +		free_job(hdev, job);
> +
> +	/* We also need to update CI for internal queues */
> +	if (cs->submitted) {
> +		hl_int_hw_queue_update_ci(cs);
> +
> +		spin_lock(&hdev->hw_queues_mirror_lock);
> +		/* remove CS from hw_queues mirror list */
> +		list_del_init(&cs->mirror_node);
> +		spin_unlock(&hdev->hw_queues_mirror_lock);
> +
> +		/*
> +		 * Don't cancel TDR in case this CS was timedout because we
> +		 * might be running from the TDR context
> +		 */
> +		if ((!cs->timedout) &&
> +			(hdev->timeout_jiffies != MAX_SCHEDULE_TIMEOUT)) {
> +			struct hl_cs *next;
> +
> +			if (cs->tdr_active)
> +				cancel_delayed_work_sync(&cs->work_tdr);
> +
> +			spin_lock(&hdev->hw_queues_mirror_lock);
> +			/* queue TDR for next CS */
> +			next = list_first_entry_or_null(
> +					&hdev->hw_queues_mirror_list,
> +					struct hl_cs, mirror_node);
> +			if ((next) && (!next->tdr_active)) {
> +				next->tdr_active = true;
> +				schedule_delayed_work(&next->work_tdr,
> +							hdev->timeout_jiffies);
> +				spin_unlock(&hdev->hw_queues_mirror_lock);
> +			} else {
> +				spin_unlock(&hdev->hw_queues_mirror_lock);
> +			}

'else' can be dropped, just move spin_unlock() outside the 'if'

> +		}
> +	}
> +
> +	hl_ctx_put(cs->ctx);
> +
> +	if (cs->timedout)
> +		dma_fence_set_error(cs->fence, -ETIMEDOUT);
> +	else if (cs->aborted)
> +		dma_fence_set_error(cs->fence, -EIO);
> +
> +	dma_fence_signal(cs->fence);
> +	dma_fence_put(cs->fence);
> +
> +	kfree(cs);
> +}

[ ... ]

> +static int allocate_cs(struct hl_device *hdev, struct hl_ctx *ctx,
> +			struct hl_cs **cs_new)
> +{
> +	struct hl_dma_fence *fence;
> +	struct dma_fence *other = NULL;
> +	struct hl_cs *cs;
> +	int rc;
> +
> +	cs = kzalloc(sizeof(*cs), GFP_ATOMIC);
> +	if (!cs)
> +		return -ENOMEM;

Does this ever run from a context that cannot use GFP_KERNEL?
This applies to other allocations below.

> +
> +	cs->ctx = ctx;
> +	cs->submitted = false;
> +	cs->completed = false;
> +	INIT_LIST_HEAD(&cs->job_list);
> +	INIT_DELAYED_WORK(&cs->work_tdr, cs_timedout);
> +	kref_init(&cs->refcount);
> +	spin_lock_init(&cs->job_lock);
> +
> +	fence = kmalloc(sizeof(*fence), GFP_ATOMIC);

kzalloc?

> +	if (!fence) {
> +		rc = -ENOMEM;
> +		goto free_cs;
> +	}
> +
> +	fence->hdev = hdev;
> +	spin_lock_init(&fence->lock);
> +	cs->fence = &fence->base_fence;
> +
> +	spin_lock(&ctx->cs_lock);
> +
> +	fence->cs_seq = ctx->cs_sequence;
> +	other = ctx->cs_pending[fence->cs_seq & (HL_MAX_PENDING_CS - 1)];
> +	if ((other) && (!dma_fence_is_signaled(other))) {
> +		spin_unlock(&ctx->cs_lock);
> +		rc = -EAGAIN;
> +		goto free_fence;
> +	}
> +
> +	dma_fence_init(&fence->base_fence, &hl_fence_ops, &fence->lock,
> +			ctx->asid, ctx->cs_sequence);
> +
> +	cs->sequence = fence->cs_seq;
> +
> +	ctx->cs_pending[fence->cs_seq & (HL_MAX_PENDING_CS - 1)] =
> +							&fence->base_fence;
> +	ctx->cs_sequence++;
> +
> +	dma_fence_get(&fence->base_fence);
> +
> +	dma_fence_put(other);
> +
> +	spin_unlock(&ctx->cs_lock);
> +
> +	*cs_new = cs;
> +
> +	return 0;
> +
> +free_fence:
> +	kfree(fence);
> +free_cs:
> +	kfree(cs);
> +	return rc;
> +}
> +

[ ... ]

> +
> +static int goya_validate_cb(struct hl_device *hdev,
> +			struct hl_cs_parser *parser, bool is_mmu)
> +{
> +	u32 cb_parsed_length = 0;
> +	int rc = 0;
> +
> +	parser->patched_cb_size = 0;
> +
> +	/* cb_user_size is more than 0 so loop will always be executed */
> +	while ((cb_parsed_length < parser->user_cb_size) && (!rc)) {
> +		enum packet_id pkt_id;
> +		u16 pkt_size;
> +		void *user_pkt;
> +
> +		user_pkt = (void *) (parser->user_cb->kernel_address +
> +							cb_parsed_length);
> +
> +		pkt_id = (enum packet_id) (((*(u64 *) user_pkt) &
> +				PACKET_HEADER_PACKET_ID_MASK) >>
> +					PACKET_HEADER_PACKET_ID_SHIFT);
> +
> +		pkt_size = goya_packet_sizes[pkt_id];
> +		cb_parsed_length += pkt_size;
> +		if (cb_parsed_length > parser->user_cb_size) {
> +			dev_err(hdev->dev,
> +				"packet 0x%x is out of CB boundary\n", pkt_id);
> +			rc = -EINVAL;
> +			continue;

For me !rc in the while statement was blind. Please consider break here and 

	if (!rc)
		break;

after the switch

> +		}
> +
> +		switch (pkt_id) {
> +		case PACKET_WREG_32:
> +			/*
> +			 * Although it is validated after copy in patch_cb(),
> +			 * need to validate here as well because patch_cb() is
> +			 * not called in MMU path while this function is called
> +			 */
> +			rc = goya_validate_wreg32(hdev, parser, user_pkt);
> +			break;
> +
> +		case PACKET_WREG_BULK:
> +			dev_err(hdev->dev,
> +				"User not allowed to use WREG_BULK\n");
> +			rc = -EPERM;
> +			break;
> +
> +		case PACKET_MSG_PROT:
> +			dev_err(hdev->dev,
> +				"User not allowed to use MSG_PROT\n");
> +			rc = -EPERM;
> +			break;
> +
> +		case PACKET_CP_DMA:
> +			dev_err(hdev->dev, "User not allowed to use CP_DMA\n");
> +			rc = -EPERM;
> +			break;
> +
> +		case PACKET_STOP:
> +			dev_err(hdev->dev, "User not allowed to use STOP\n");
> +			rc = -EPERM;
> +			break;
> +
> +		case PACKET_LIN_DMA:
> +			if (is_mmu)
> +				rc = goya_validate_dma_pkt_mmu(hdev, parser,
> +						user_pkt);
> +			else
> +				rc = goya_validate_dma_pkt_no_mmu(hdev, parser,
> +						user_pkt);
> +			break;
> +
> +		case PACKET_MSG_LONG:
> +		case PACKET_MSG_SHORT:
> +		case PACKET_FENCE:
> +		case PACKET_NOP:
> +			parser->patched_cb_size += pkt_size;
> +			break;
> +
> +		default:
> +			dev_err(hdev->dev, "Invalid packet header 0x%x\n",
> +				pkt_id);
> +			rc = -EINVAL;
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * The new CB should have space at the end for two MSG_PROT packets:
> +	 * 1. A packet that will act as a completion packet
> +	 * 2. A packet that will generate MSI-X interrupt
> +	 */
> +	parser->patched_cb_size += sizeof(struct packet_msg_prot) * 2;
> +
> +	return rc;
> +}

[ ... ]

> +static int goya_patch_cb(struct hl_device *hdev,
> +				struct hl_cs_parser *parser)
> +{
> +	u32 cb_parsed_length = 0;
> +	u32 cb_patched_cur_length = 0;
> +	int rc = 0;
> +
> +	/* cb_user_size is more than 0 so loop will always be executed */
> +	while ((cb_parsed_length < parser->user_cb_size) && (!rc)) {
> +		enum packet_id pkt_id;
> +		u16 pkt_size;
> +		u32 new_pkt_size = 0;
> +		void *user_pkt, *kernel_pkt;
> +
> +		user_pkt = (void *) (parser->user_cb->kernel_address +
> +							cb_parsed_length);
> +		kernel_pkt = (void *) (parser->patched_cb->kernel_address +
> +							cb_patched_cur_length);
> +
> +		pkt_id = (enum packet_id) (((*(u64 *) user_pkt) &
> +				PACKET_HEADER_PACKET_ID_MASK) >>
> +					PACKET_HEADER_PACKET_ID_SHIFT);
> +
> +		pkt_size = goya_packet_sizes[pkt_id];
> +		cb_parsed_length += pkt_size;
> +		if (cb_parsed_length > parser->user_cb_size) {
> +			dev_err(hdev->dev,
> +				"packet 0x%x is out of CB boundary\n", pkt_id);
> +			rc = -EINVAL;
> +			continue;

Ditto

> +		}
> +
> +		switch (pkt_id) {
> +		case PACKET_LIN_DMA:
> +			rc = goya_patch_dma_packet(hdev, parser, user_pkt,
> +						kernel_pkt, &new_pkt_size);
> +			cb_patched_cur_length += new_pkt_size;
> +			break;
> +
> +		case PACKET_WREG_32:
> +			memcpy(kernel_pkt, user_pkt, pkt_size);
> +			cb_patched_cur_length += pkt_size;
> +			rc = goya_validate_wreg32(hdev, parser, kernel_pkt);
> +			break;
> +
> +		case PACKET_WREG_BULK:
> +			dev_err(hdev->dev,
> +				"User not allowed to use WREG_BULK\n");
> +			rc = -EPERM;
> +			break;
> +
> +		case PACKET_MSG_PROT:
> +			dev_err(hdev->dev,
> +				"User not allowed to use MSG_PROT\n");
> +			rc = -EPERM;
> +			break;
> +
> +		case PACKET_CP_DMA:
> +			dev_err(hdev->dev, "User not allowed to use CP_DMA\n");
> +			rc = -EPERM;
> +			break;
> +
> +		case PACKET_STOP:
> +			dev_err(hdev->dev, "User not allowed to use STOP\n");
> +			rc = -EPERM;
> +			break;
> +
> +		case PACKET_MSG_LONG:
> +		case PACKET_MSG_SHORT:
> +		case PACKET_FENCE:
> +		case PACKET_NOP:
> +			memcpy(kernel_pkt, user_pkt, pkt_size);
> +			cb_patched_cur_length += pkt_size;
> +			break;
> +
> +		default:
> +			dev_err(hdev->dev, "Invalid packet header 0x%x\n",
> +				pkt_id);
> +			rc = -EINVAL;
> +			break;
> +		}
> +	}
> +
> +	return rc;
> +}

[ ... ]

>  static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
>  		u16 event_type, char *axi_name, int len)
>  {
> @@ -4645,6 +5677,48 @@ static void goya_disable_clock_gating(struct hl_device *hdev)
>  
>  }
>  
> +static bool goya_is_device_idle(struct hl_device *hdev)
> +{
> +	u64 offset, dma_qm_reg, tpc_qm_reg, tpc_cmdq_reg, tpc_cfg_reg;
> +	bool val = true;
> +	int i;
> +
> +	offset = mmDMA_QM_1_GLBL_STS0 - mmDMA_QM_0_GLBL_STS0;
> +
> +	for (i = 0 ; i < DMA_MAX_NUM ; i++) {
> +		dma_qm_reg = mmDMA_QM_0_GLBL_STS0 + i * offset;
> +
> +		val = val && ((RREG32(dma_qm_reg) & DMA_QM_IDLE_MASK) ==
> +				DMA_QM_IDLE_MASK);
> +	}
> +
> +	offset = mmTPC1_QM_GLBL_STS0 - mmTPC0_QM_GLBL_STS0;
> +
> +	for (i = 0 ; i < TPC_MAX_NUM ; i++) {
> +		tpc_qm_reg = mmTPC0_QM_GLBL_STS0 + i * offset;
> +		tpc_cmdq_reg = mmTPC0_CMDQ_GLBL_STS0 + i * offset;
> +		tpc_cfg_reg = mmTPC0_CFG_STATUS + i * offset;
> +
> +		val = val && ((RREG32(tpc_qm_reg) & TPC_QM_IDLE_MASK) ==
> +				TPC_QM_IDLE_MASK);
> +		val = val && ((RREG32(tpc_cmdq_reg) & TPC_CMDQ_IDLE_MASK) ==
> +				TPC_CMDQ_IDLE_MASK);
> +		val = val && ((RREG32(tpc_cfg_reg) & TPC_CFG_IDLE_MASK) ==
> +				TPC_CFG_IDLE_MASK);
> +	}
> +
> +	val = val && ((RREG32(mmMME_QM_GLBL_STS0) & MME_QM_IDLE_MASK) ==
> +			MME_QM_IDLE_MASK);
> +	val = val && ((RREG32(mmMME_CMDQ_GLBL_STS0) & MME_CMDQ_IDLE_MASK) ==
> +			MME_CMDQ_IDLE_MASK);
> +	val = val && ((RREG32(mmMME_ARCH_STATUS) & MME_ARCH_IDLE_MASK) ==
> +			MME_ARCH_IDLE_MASK);
> +	val = val && ((RREG32(mmMME_SHADOW_0_STATUS) & MME_SHADOW_IDLE_MASK) ==
> +			0);

Huh, these are neat, but IMHO plain

	if ((RREG(reg) & mask) != mask)
		return false;

are more readable...

> +
> +	return val;
> +}
> +

[ ... ]

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 12/15] habanalabs: add virtual memory and MMU modules
  2019-01-23  0:00 ` [PATCH 12/15] habanalabs: add virtual memory and MMU modules Oded Gabbay
@ 2019-01-27 16:13   ` Mike Rapoport
  2019-01-30 10:34     ` Oded Gabbay
  0 siblings, 1 reply; 103+ messages in thread
From: Mike Rapoport @ 2019-01-27 16:13 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: gregkh, linux-kernel, ogabbay, Omer Shpigelman

On Wed, Jan 23, 2019 at 02:00:54AM +0200, Oded Gabbay wrote:
> From: Omer Shpigelman <oshpigelman@habana.ai>
> 
> This patch adds the Virtual Memory and MMU modules.
> 
> Goya has an internal MMU which provides process isolation on the internal
> DDR. The internal MMU also performs translations for transactions that go
> from Goya to the Host.
> 
> The driver is responsible for allocating and freeing memory on the DDR
> upon user request. It also provides an interface to map and unmap DDR and
> Host memory to the device address space.
> 
> Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> ---
>  drivers/misc/habanalabs/Makefile              |    2 +-
>  drivers/misc/habanalabs/context.c             |   19 +-
>  drivers/misc/habanalabs/device.c              |   20 +-
>  drivers/misc/habanalabs/goya/goya.c           |  391 +++++
>  drivers/misc/habanalabs/habanalabs.h          |  195 +++
>  drivers/misc/habanalabs/habanalabs_drv.c      |    2 +-
>  drivers/misc/habanalabs/habanalabs_ioctl.c    |    3 +-
>  drivers/misc/habanalabs/include/goya/goya.h   |    6 +-
>  .../include/hw_ip/mmu/mmu_general.h           |   45 +
>  .../habanalabs/include/hw_ip/mmu/mmu_v1_0.h   |   15 +
>  drivers/misc/habanalabs/memory.c              | 1506 +++++++++++++++++
>  drivers/misc/habanalabs/mmu.c                 |  604 +++++++
>  include/uapi/misc/habanalabs.h                |  122 +-
>  13 files changed, 2922 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
>  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
>  create mode 100644 drivers/misc/habanalabs/mmu.c
 
[ ... ]

> diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> index e3867615b974..94ee4cb00a49 100644
> --- a/drivers/misc/habanalabs/goya/goya.c
> +++ b/drivers/misc/habanalabs/goya/goya.c

[ ... ]

> @@ -265,6 +332,10 @@ static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
>  };
>  
>  static int goya_armcp_info_get(struct hl_device *hdev);
> +static void goya_mmu_prepare(struct hl_device *hdev, u32 asid);
> +static int goya_mmu_clear_pgt_range(struct hl_device *hdev);
> +static int goya_mmu_update_asid_hop0_addr(struct hl_device *hdev, u32 asid,
> +					u64 phys_addr);

Nit: are the static declarations are necessary? Or it's a matter of moving
code around?

>  
>  static void goya_get_fixed_properties(struct hl_device *hdev)
>  {
> @@ -303,6 +374,16 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
>  	prop->sram_user_base_address = prop->sram_base_address +
>  						SRAM_USER_BASE_OFFSET;
>  
> +	prop->mmu_pgt_addr = MMU_PAGE_TABLES_ADDR;
> +	if (hdev->pldm)
> +		prop->mmu_pgt_size = 0x800000; /* 8MB */
> +	else
> +		prop->mmu_pgt_size = MMU_PAGE_TABLES_SIZE;
> +	prop->mmu_pte_size = PTE_SIZE;
> +	prop->mmu_hop_table_size = HOP_TABLE_SIZE;
> +	prop->mmu_hop0_tables_total_size = HOP0_TABLES_TOTAL_SIZE;
> +	prop->dram_page_size = PAGE_SIZE_2MB;
> +
>  	prop->host_phys_base_address = HOST_PHYS_BASE;
>  	prop->va_space_host_start_address = VA_HOST_SPACE_START;
>  	prop->va_space_host_end_address = VA_HOST_SPACE_END;

[ ... ]

> diff --git a/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
> new file mode 100644
> index 000000000000..8d61ee4f2d17
> --- /dev/null
> +++ b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
> @@ -0,0 +1,45 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + */
> +
> +#ifndef INCLUDE_MMU_GENERAL_H_
> +#define INCLUDE_MMU_GENERAL_H_
> +
> +#define PAGE_SHIFT_4KB			12
> +#define PAGE_SHIFT_2MB			21
> +#define PAGE_SIZE_2MB			(_AC(1, UL) << PAGE_SHIFT_2MB)
> +#define PAGE_SIZE_4KB			(_AC(1, UL) << PAGE_SHIFT_4KB)
> +
> +#define PAGE_PRESENT_MASK		0x0000000000001
> +#define SWAP_OUT_MASK			0x0000000000004
> +#define LAST_MASK			0x0000000000800
> +#define PHYS_ADDR_MASK			0x3FFFFFFFFF000ull
> +#define HOP0_MASK			0x3000000000000ull
> +#define HOP1_MASK			0x0FF8000000000ull
> +#define HOP2_MASK			0x0007FC0000000ull
> +#define HOP3_MASK			0x000003FE00000
> +#define HOP4_MASK			0x00000001FF000
> +#define OFFSET_MASK			0x0000000000FFF
> +
> +#define HOP0_SHIFT			48
> +#define HOP1_SHIFT			39
> +#define HOP2_SHIFT			30
> +#define HOP3_SHIFT			21
> +#define HOP4_SHIFT			12
> +
> +#define PTE_PHYS_ADDR_SHIFT		12
> +#define PTE_PHYS_ADDR_MASK		~0xFFF
> +
> +#define PTE_SIZE			sizeof(u64)

I suspect some architectures define PTE_SIZE in arch/*/include/asm
Probably you'd want to namespace this.

> +#define HOP_TABLE_SIZE			PAGE_SIZE_4KB
> +#define HOP0_TABLES_TOTAL_SIZE		(HOP_TABLE_SIZE * MAX_ASID)
> +
> +#define MMU_HOP0_PA43_12_SHIFT		12
> +#define MMU_HOP0_PA49_44_SHIFT		(12 + 32)
> +
> +#define MMU_CONFIG_TIMEOUT_USEC		2000 /* 2 ms */
> +
> +#endif /* INCLUDE_MMU_GENERAL_H_ */
> diff --git a/drivers/misc/habanalabs/memory.c b/drivers/misc/habanalabs/memory.c
> index 94cbb252656d..c41ea19502e5 100644
> --- a/drivers/misc/habanalabs/memory.c
> +++ b/drivers/misc/habanalabs/memory.c
> @@ -5,12 +5,1193 @@
>   * All Rights Reserved.
>   */
>  
> +#include <uapi/misc/habanalabs.h>
>  #include "habanalabs.h"
> +#include "include/hw_ip/mmu/mmu_general.h"
>  
>  #include <linux/sched.h>
>  #include <linux/uaccess.h>
>  #include <linux/genalloc.h>
>  
> +#define HL_MMU_DEBUG	0
> +
> +/*
> + * The va ranges in context object contain a list with the available chunks of
> + * device virtual memory.
> + * There is one range for host allocations and one for DRAM allocations.
> + *
> + * On initialization each range contains one chunk of all of its available
> + * virtual range which is a half of the total device virtual range.
> + *
> + * On each mapping of physical pages, a suitable virtual range chunk (with a
> + * minimum size) is selected from the list. If the chunk size equals the
> + * requested size, the chunk is returned. Otherwise, the chunk is split into
> + * two chunks - one to return as result and a remainder to stay in the list.
> + *
> + * On each Unmapping of a virtual address, the relevant virtual chunk is
> + * returned to the list. The chunk is added to the list and if its edges match
> + * the edges of the adjacent chunks (means a contiguous chunk can be created),
> + * the chunks are merged.
> + *
> + * On finish, the list is checked to have only one chunk of all the relevant
> + * virtual range (which is a half of the device total virtual range).
> + * If not (means not all mappings were unmapped), a warning is printed.
> + */
> +
> +/**
> + * alloc_device_memory - allocate device memory
> + *
> + * @ctx                 : current context
> + * @args                : host parameters containing the requested size
> + * @ret_handle          : result handle
> + *
> + * This function does the following:
> + * - Allocate the requested size rounded up to 2MB pages
> + * - Return unique handle
> + */
> +static int alloc_device_memory(struct hl_ctx *ctx, struct hl_mem_in *args,
> +				u32 *ret_handle)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	struct hl_vm *vm = &hdev->vm;
> +	struct hl_vm_phys_pg_list *phys_pg_list;
> +	struct hl_vm_phys_pg *phys_pg, *tmp;
> +	u64 paddr = 0;
> +	u32 total_size, num_pgs, page_size, page_shift;
> +	int handle, rc, i;
> +	bool contiguous;
> +
> +	page_size = hdev->asic_prop.dram_page_size;
> +	page_shift = __ffs(page_size);

Maybe it's worth storing page_shift in the asi_prop and calculating
page_size.

> +	num_pgs = (args->alloc.mem_size + (page_size - 1)) >> page_shift;
> +	total_size = num_pgs << page_shift;
> +
> +	contiguous = args->flags & HL_MEM_CONTIGUOUS;
> +
> +	if (contiguous) {
> +		paddr = (u64) gen_pool_alloc(vm->dram_pg_pool, total_size);
> +		if (!paddr) {
> +			dev_err(hdev->dev,
> +				"failed to allocate %u huge contiguous pages\n",
> +				num_pgs);
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	phys_pg_list = kzalloc(sizeof(*phys_pg_list), GFP_KERNEL);
> +	if (!phys_pg_list) {
> +		rc = -ENOMEM;
> +		goto page_list_err;
> +	}
> +
> +	phys_pg_list->vm_type = VM_TYPE_PHYS_LIST;
> +	phys_pg_list->asid = ctx->asid;
> +	phys_pg_list->total_size = total_size;
> +	phys_pg_list->flags = args->flags;
> +	phys_pg_list->contiguous = contiguous;
> +	INIT_LIST_HEAD(&phys_pg_list->list);
> +
> +	for (i = 0 ; i < num_pgs ; i++) {
> +		phys_pg = kzalloc(sizeof(*phys_pg), GFP_KERNEL);

Consider adding *phys_pgs to phys_pg_list using kcalloc() before the loop.

> +		if (!phys_pg) {
> +			rc = -ENOMEM;
> +			goto pb_err;
> +		}
> +
> +		phys_pg->page_size = page_size;
> +
> +		if (phys_pg_list->contiguous) {
> +			phys_pg->paddr = paddr + i * phys_pg->page_size;
> +		} else {
> +			phys_pg->paddr =
> +				(u64) gen_pool_alloc(vm->dram_pg_pool,
> +							phys_pg->page_size);
> +			if (!phys_pg->paddr) {
> +				dev_err(hdev->dev, "ioctl failed to allocate page\n");
> +				kfree(phys_pg);
> +				rc = -ENOMEM;
> +				goto pb_err;
> +			}
> +		}
> +
> +		list_add_tail(&phys_pg->node, &phys_pg_list->list);
> +	}
> +
> +	spin_lock(&vm->idr_lock);
> +	handle = idr_alloc(&vm->phys_pg_list_handles, phys_pg_list, 1, 0,
> +				GFP_ATOMIC);
> +	spin_unlock(&vm->idr_lock);
> +
> +	if (handle < 0) {
> +		dev_err(hdev->dev, "Failed to get handle for page\n");
> +		rc = -EFAULT;
> +		goto idr_err;
> +	}
> +
> +	for (i = 0; i < num_pgs ; i++)
> +		kref_get(&vm->dram_pg_pool_refcount);
> +
> +	phys_pg_list->handle = handle;
> +
> +	atomic64_add(phys_pg_list->total_size, &ctx->dram_phys_mem);
> +	atomic64_add(phys_pg_list->total_size, &hdev->dram_used_mem);
> +
> +	*ret_handle = handle;
> +
> +	return 0;
> +
> +idr_err:
> +pb_err:
> +	list_for_each_entry_safe(phys_pg, tmp, &phys_pg_list->list, node) {
> +		if (!phys_pg_list->contiguous)
> +			gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
> +					phys_pg->page_size);
> +
> +		list_del(&phys_pg->node);
> +		kfree(phys_pg);
> +	}
> +
> +	kfree(phys_pg_list);
> +page_list_err:
> +	if (contiguous)
> +		gen_pool_free(vm->dram_pg_pool, paddr, total_size);
> +
> +	return rc;
> +}

[ ... ]

> +/**
> + * free_phys_pg_list    - free physical page list
> + *
> + * @hdev                : habanalabs device structure
> + * @phys_pg_list        : physical page list to free
> + *
> + * This function does the following:
> + * - Iterate over the list and free each physical block structure
> + * - In case of allocated memory, return the physical memory to the general pool
> + * - Free the hl_vm_phys_pg_list structure
> + */
> +static void free_phys_pg_list(struct hl_device *hdev,
> +		struct hl_vm_phys_pg_list *phys_pg_list)
> +{
> +	struct hl_vm *vm = &hdev->vm;
> +	struct hl_vm_phys_pg *phys_pg, *tmp;
> +	u32 num_pgs;
> +	bool first = true;
> +	int i;
> +
> +	list_for_each_entry_safe(phys_pg, tmp, &phys_pg_list->list, node) {
> +		/*
> +		 * this if statement is relevant only when called from
> +		 * hl_vm_ctx_fini() and free_device_memory()
> +		 */
> +		if (!phys_pg_list->created_from_userptr) {
> +			if ((phys_pg_list->contiguous) && (first)) {
> +				first = false;
> +				gen_pool_free(vm->dram_pg_pool,
> +						phys_pg->paddr,
> +						phys_pg_list->total_size);
> +
> +				num_pgs = phys_pg_list->total_size >>
> +					__ffs(hdev->asic_prop.dram_page_size);
> +
> +				for (i = 0; i < num_pgs ; i++)
> +					kref_put(&vm->dram_pg_pool_refcount,
> +						dram_pg_pool_do_release);
> +
> +			} else if (!phys_pg_list->contiguous) {
> +				gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
> +						phys_pg->page_size);
> +				kref_put(&vm->dram_pg_pool_refcount,
> +						dram_pg_pool_do_release);
> +			}
> +		}
> +
> +		list_del(&phys_pg->node);
> +		kfree(phys_pg);
> +	}

Unless I'm missing something this can be simplified a bit:

if (!phys_pg_list->created_from_userptr) {
	for (i = 0; i < num_pgs ; i++)
		kref_put(&vm->dram_pg_pool_refcount,
			 dram_pg_pool_do_release);
	if (phys_pg_list->contiguous)
		gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
			      phys_pg_list->total_size);
}

list_for_each_entry_safe(phys_pg, tmp, &phys_pg_list->list, node) {
	if (!phys_pg_list->created_from_userptr &&
	    !phys_pg_list->contiguous)
		gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
			      phys_pg->page_size);
	list_del(&phys_pg->node);
	kfree(phys_pg);
}

nd with phys_pg's array hanging from phys_pg_list it would be even simpler
;-)

> +
> +	kfree(phys_pg_list);
> +}
> +

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/5] drivers/accel: Introduce subsystem
  2019-01-25 22:23               ` [PATCH " Daniel Vetter
@ 2019-01-27 16:31                 ` Daniel Vetter
  0 siblings, 0 replies; 103+ messages in thread
From: Daniel Vetter @ 2019-01-27 16:31 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Linux Kernel Mailing List, linux-accelerators,
	Greg Kroah-Hartman, Frederic Barrat, Andrew Donnellan, ogabbay,
	Dave Airlie, Jerome Glisse

Hi Olof & Greg,

Ok I thought about what this means in practice a bit more over the
w/e, and I think we need to drag this discussion on for a bit more.

On Fri, Jan 25, 2019 at 11:23 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Fri, Jan 25, 2019 at 10:16:12AM -0800, Olof Johansson wrote:
> > We're starting to see more of these kind of devices, the current
> > upcoming wave will likely be around machine learning and inference
> > engines. A few drivers have been added to drivers/misc for this, but
> > it's timely to make it into a separate group of drivers/subsystem, to
> > make it easier to find them, and to encourage collaboration between
> > contributors.
> >
> > Over time, we expect to build shared frameworks that the drivers will
> > make use of, but how that framework needs to look like to fill the needs
> > is still unclear, and the best way to gain that knowledge is to give the
> > disparate implementations a shared location.
> >
> > There has been some controversy around expectations for userspace
> > stacks being open. The clear preference is to see that happen, and any
> > driver and platform stack that is delivered like that will be given
> > preferential treatment, and at some point in the future it might
> > become the requirement. Until then, the bare minimum we need is an
> > open low-level userspace such that the driver and HW interfaces can be
> > exercised if someone is modifying the driver, even if the full details
> > of the workload are not always available.
> >
> > Bootstrapping this with myself and Greg as maintainers (since the current
> > drivers will be moving out of drivers/misc). Looking forward to expanding
> > that group over time.
> >
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Signed-off-by: Olof Johansson <olof@lixom.net>
>
> I spent a bit of time reading the proposed drivers, mostly just their uapi
> (habanalabs and ocxl&cxl), and there's really no technical difference I
> think between an accelaration driver sitting in drivers/gpu and an
> accelaration driver sitting in drivers/accel. Except:
>
> - drivers/gpu already has common interfaces for the things you'll probably
>   want to standardize (buffer sharing, syncronization primitives,
>   scheduler - right now we're working on figuring out some common
>   tracepoints).
>
> - Maybe even more important, drivers/gpu has the lessons learned in its
>   codebase about what not to standardize between drivers (everything else,
>   you'll regret it, we've been there).
>
> - drivers/gpu is the subsystem with 20 years of experience writing tiny
>   shim drivers in the kernel for high performance accelarators that need a
>   pretty huge stack in userspace to make them do anything useful. 20 years
>   ago all the rage to make faster was graphics, now it's AI. Looks exactly
>   the same from a kernel pov - command buffers, gigabytes of DMA and a
>   security/long term support nightmare.
>
> - drivers/gpu requires open source. The real thing, not some demo that
>   does a few DMA operations.
>
> And now we have drivers/accel and someone gets to explain to nvidia (or
> arm or whatever) how their exact same drivers (and well run engineering
> orgs really only invent command submission once) can be merged when they
> say it's for a TPU, and will get rejected when they say it's for a GPU. Or
> someone gets to explain to TPU+GPU vendors why their driver is not cool
> (because we'd end up with two), while their startup-competition only doing
> a TPU is totally fine and merged into upstream. Or we just stuff all the
> kernel drivers for blobby userspace into drivers/accel and otherwise
> ignore each another.

One awkward scenario I've missed is the following:
1. the fully open stack for accelarating glow/tensorflow/DNN that
we're planning to make happen takes off. Rough idea is
llvm+spriv+clover (in mesa) + gallium backends (also mesa3d) + drm.

2. if that's the case and someone writes an r/e'ed driver for such a
TPU (and the vendor supporting only their blob stack) then the
reasonable thing would be to write that stack on top of the existing
open source accelarator infrastructure. So a new/2nd driver in the
kernel, using the drm+mesa infrastructure.

3. this means to not block r/e efforts on we need to be ok with
duplicated drivers between drivers/accel and drivers/gpu - not being
able to share code with all the other programmable accelarators will
make it nigh impossible to have a working r/e'ed stack. We already see
this on the gl/vk/cl side, where the mesa drivers tend to be 1-2
orders of magnitude smaller than the proprietary/vendor stacks
(whether open or closed).

> I guess that last option would at least somewhat help me, since I wont
> ever have to explain anymore why we're the radical commies on dri-devel
> :-)
>
> Anyway, only reason I replied here again is because I accidentally started
> a private thread (well was too lazy to download the mbox to properly
> reply), and that's not good either. But I don't think anyone's going to
> change their opinion here, I think this reply is just for the record.
>
> Cheers, Daniel
>
> PS: Seen that there's a v2 of this now with Documentation, hasn't reached
> my inbox (yet). I don't think that one clarifies any of the tricky
> questions between drivers/gpu and drivers/accel, so figured won't harm if
> I leave the reply on v1.

Given the above revising my stance, and I think your item to exclude
overlap between drivers/accel and drivers/gpu is actually harmful. And
we instead need a line that yes, there is overlap, and we do expect
the occasional duplicated driver for the same hardware. I think that's
going to have the best outcome:

- It's not going to piss off the drivers/gpu people any more than
drivers/accel itself does, you already scored all the sighs you'll get
I think. So won't make things worse.

- It's going to be the technically sound decision, since no more "is
this a tpu or gpu" non-technical tricky questions that just depend
upon what you market a chip for, instead of what it is.

- I think it's going to be much quicker to prove whether drivers/gpu
is too strict with merging drivers and has harmed the overall
ecosystem, since you'll have many more drivers to potentially merge.
One argument in your favour is clearly that we've never tried to drop
the open source userspace requirement. Imo if we're going to do this
experiment, we should do it right.

- There's not going to be a problem for accelarators that need to work
together with atomic modeset drivers or v4l drivers for displaying
their computation results - all you need for that is dma-buf import,
and you'll need that sooner or later anyway in drivers/accel. dma-buf
compatibility extends all gfx userspace protocols (I think we rev'ed
them all to add dma-buf support). There's also not going to be
coordination problems due to this with drivers/gpu, since all the
dma-buf bits you'll need are already in drivers/dma-buf.

- It might actually make r/e'ing gpus easier, since currently you need
to boot 2 different kernels: One with the upstream drm stack, the
other with downstream vendor stack. If you actually manage to get all
these drivers merged in a useable state that would become better.

- Essentially this would make drivers/accel the -staging for
drivers/gpu, at least from our pov. I don't think that's going to be
an issue - with Greg you already have the residential expert for
maintaining dumpster fires on the team. And as long as we don't try to
share code (beyond stuff in drivers/dma-buf) I don't think this will
result in the coordination pains we've had with display drivers in
-staging.

- This is not going to set a new precedence - there's been plenty of
duplicated subsystem, especially for drivers. Occasionally you just
need to burn it all down and start over.

- Like with your proposal here for drivers/accel we can always change
the rules in a few years, or whenever it's clear what to do. And I
think the positions between the drivers/gpu and drivers/accel folks
are fundamentally opposed, there's not really a room for a both ways
compromise, best seems to decide this by actually trying it out.

- Plus, I can just point people at you instead of having to explain
why we insist on open source for accelarators in drivers/gpu. That
wasn't meant as a joke, not entirely at least, it's kinda tiring to
have this discussion at least once every year ...

tldr; If you want to do this, do it right.

All we need is to drop your paragraph about avoiding overlap with
drivers/gpu, and instead acknowledge that there is an overlap with the
accelarator drivers in drivers/gpu and clearly state that both
subsystems are going to be ok with that overlap and duplication. I'd
ack that version.

Cheers, Daniel

> > ---
> >  MAINTAINERS            |  8 ++++++++
> >  drivers/Kconfig        |  2 ++
> >  drivers/Makefile       |  1 +
> >  drivers/accel/Kconfig  | 16 ++++++++++++++++
> >  drivers/accel/Makefile |  5 +++++
> >  5 files changed, 32 insertions(+)
> >  create mode 100644 drivers/accel/Kconfig
> >  create mode 100644 drivers/accel/Makefile
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index ddcdc29dfe1f6..8a9bbaf8f6e90 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -7033,6 +7033,14 @@ W:     https://linuxtv.org
> >  S:   Supported
> >  F:   drivers/media/platform/sti/hva
> >
> > +HW ACCELERATOR OFFLOAD SUBSYSTEM
> > +M:   Olof Johansson <olof@lixom.net>
> > +M:   Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > +L:   linux-accelerators@lists.ozlabs.org
> > +S:   Supported
> > +F:   drivers/accel/
> > +F:   Documentation/accelerators/
> > +
> >  HWPOISON MEMORY FAILURE HANDLING
> >  M:   Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >  L:   linux-mm@kvack.org
> > diff --git a/drivers/Kconfig b/drivers/Kconfig
> > index 4f9f99057ff85..3cc461f325569 100644
> > --- a/drivers/Kconfig
> > +++ b/drivers/Kconfig
> > @@ -228,4 +228,6 @@ source "drivers/siox/Kconfig"
> >
> >  source "drivers/slimbus/Kconfig"
> >
> > +source "drivers/accel/Kconfig"
> > +
> >  endmenu
> > diff --git a/drivers/Makefile b/drivers/Makefile
> > index 04da7876032cc..e4be06579cc5d 100644
> > --- a/drivers/Makefile
> > +++ b/drivers/Makefile
> > @@ -186,3 +186,4 @@ obj-$(CONFIG_MULTIPLEXER) += mux/
> >  obj-$(CONFIG_UNISYS_VISORBUS)        += visorbus/
> >  obj-$(CONFIG_SIOX)           += siox/
> >  obj-$(CONFIG_GNSS)           += gnss/
> > +obj-$(CONFIG_ACCEL)          += accel/
> > diff --git a/drivers/accel/Kconfig b/drivers/accel/Kconfig
> > new file mode 100644
> > index 0000000000000..13b36c0398895
> > --- /dev/null
> > +++ b/drivers/accel/Kconfig
> > @@ -0,0 +1,16 @@
> > +#
> > +# Drivers for hardware offload accelerators
> > +# See Documentation/accel/README.rst for more details
> > +#
> > +
> > +menuconfig ACCEL
> > +     bool "Hardware offload accelerator support"
> > +        help
> > +       HW offload accelerators are used for high-bandwidth workloads
> > +       where a higher-level kernel/userspace interface isn't suitable.
> > +
> > +if ACCEL
> > +
> > +comment "HW Accellerator drivers"
> > +
> > +endif
> > diff --git a/drivers/accel/Makefile b/drivers/accel/Makefile
> > new file mode 100644
> > index 0000000000000..343bbb8f45a14
> > --- /dev/null
> > +++ b/drivers/accel/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +#
> > +# Makefile for accel devices
> > +#
> > +
> > --
> > 2.11.0
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 03/15] habanalabs: add basic Goya support
  2019-01-27  6:39       ` Mike Rapoport
@ 2019-01-28  7:44         ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-28  7:44 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org

On Sun, Jan 27, 2019 at 8:39 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Fri, Jan 25, 2019 at 10:32:55PM +0200, Oded Gabbay wrote:
> > On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > >
> > > On Wed, Jan 23, 2019 at 02:00:45AM +0200, Oded Gabbay wrote:
> > > > This patch adds a basic support for the Goya device. The code initializes
> > > > the device's PCI controller and PCI bars. It also initializes various S/W
> > > > structures and adds some basic helper functions.
> > > >
> > > > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > > > ---
> > > >  drivers/misc/habanalabs/Makefile            |   5 +-
> > > >  drivers/misc/habanalabs/device.c            |  71 +++
> > > >  drivers/misc/habanalabs/goya/Makefile       |   3 +
> > > >  drivers/misc/habanalabs/goya/goya.c         | 633 ++++++++++++++++++++
> > > >  drivers/misc/habanalabs/goya/goyaP.h        | 125 ++++
> > > >  drivers/misc/habanalabs/habanalabs.h        | 131 ++++
> > > >  drivers/misc/habanalabs/habanalabs_drv.c    |   3 +
> > > >  drivers/misc/habanalabs/include/goya/goya.h | 115 ++++
> > > >  8 files changed, 1085 insertions(+), 1 deletion(-)
> > > >  create mode 100644 drivers/misc/habanalabs/goya/Makefile
> > > >  create mode 100644 drivers/misc/habanalabs/goya/goya.c
> > > >  create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
> > > >  create mode 100644 drivers/misc/habanalabs/include/goya/goya.h
>
> [ ... ]
>
> > > > +
> > > > +/**
> > > > + * goya_sw_init - Goya software initialization code
> > > > + *
> > > > + * @hdev: pointer to hl_device structure
> > > > + *
> > > > + */
> > > > +static int goya_sw_init(struct hl_device *hdev)
> > > > +{
> > > > +     struct goya_device *goya;
> > > > +     int rc;
> > > > +
> > > > +     /* Allocate device structure */
> > > > +     goya = kzalloc(sizeof(*goya), GFP_KERNEL);
> > >
> > > Consider using devm_k[mz]alloc() for memory allocations throughout the
> > > driver. I didn't check all the spots where it can be applicable.
> > I honestly wasn't aware of that. We never used that in AMD drivers
> > (which where I spent most of my kernel time).
> > I'll look into that offline but for now I don't really want to change
> > into it blindly in all locations, unless there is some hard kernel
> > rule for using that in drivers.
>
> AFAIK, there's no such rule. It's just supposed to make driver
> developer/maintainer life easier ;-)
>
> > >
> > > > +     if (!goya)
> > > > +             return -ENOMEM;
> > > > +
> > > > +     /* according to goya_init_iatu */
> > > > +     goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
> > > > +     hdev->asic_specific = goya;
> > > > +
> > > > +     /* Create DMA pool for small allocations */
> > > > +     hdev->dma_pool = dma_pool_create(dev_name(hdev->dev),
> > > > +                     &hdev->pdev->dev, GOYA_DMA_POOL_BLK_SIZE, 8, 0);
> > > > +     if (!hdev->dma_pool) {
> > > > +             dev_err(hdev->dev, "failed to create DMA pool\n");
> > > > +             rc = -ENOMEM;
> > > > +             goto free_goya_device;
> > > > +     }
> > > > +
>
> [ ... ]
>
> > > > +
> > > > +static const struct hl_asic_funcs goya_funcs = {
> > > > +     .early_init = goya_early_init,
> > > > +     .early_fini = goya_early_fini,
> > > > +     .sw_init = goya_sw_init,
> > > > +     .sw_fini = goya_sw_fini,
> > > > +     .suspend = goya_suspend,
> > > > +     .resume = goya_resume,
> > > > +     .dma_alloc_coherent = goya_dma_alloc_coherent,
> > > > +     .dma_free_coherent = goya_dma_free_coherent,
> > >
> > > Is there any additional functionality that is planned in goya or gaudi in
> > > these two functions?
> > > It seems like they are not really needed, at least at the moment and for
> > > sure that don't need to be part of ASIC ops.
> >
> > So this relates to the simulator support, because there the
> > implementation of these two functions is totally different as I don't
> > have pci device.
>
> Can you please add a comment about it here?
Of course, done.
Thanks,
Oded

>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 05/15] habanalabs: add command buffer module
  2019-01-27  6:49       ` Mike Rapoport
@ 2019-01-28  7:55         ` Oded Gabbay
  2019-01-28  8:41           ` Mike Rapoport
  0 siblings, 1 reply; 103+ messages in thread
From: Oded Gabbay @ 2019-01-28  7:55 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Sun, Jan 27, 2019 at 8:49 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Fri, Jan 25, 2019 at 11:47:03PM +0200, Oded Gabbay wrote:
> > On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > >
> > > On Wed, Jan 23, 2019 at 02:00:47AM +0200, Oded Gabbay wrote:
> > > > This patch adds the CB module, which allows the user to create and
> > > > destroy CBs and to map them to the user's process address-space.
> > >
> > > Can you please spell "command buffer" at least first time it's mentioned?
> > fixed
> > >
> > > > A command buffer is a memory blocks that reside in DMA-able address-space
> > > > and is physically contiguous so it can be accessed by the device without
> > > > MMU translation. The command buffer memory is allocated using the
> > > > coherent DMA API.
> > > >
> > > > When creating a new CB, the IOCTL returns a handle of it, and the
> > > > user-space process needs to use that handle to mmap the buffer to get a VA
> > > > in the user's address-space.
> > > >
> > > > Before destroying (freeing) a CB, the user must unmap the CB's VA using the
> > > > CB handle.
> > > >
> > > > Each CB has a reference counter, which tracks its usage in command
> > > > submissions and also its mmaps (only a single mmap is allowed).
> > > >
> > > > The driver maintains a pool of pre-allocated CBs in order to reduce
> > > > latency during command submissions. In case the pool is empty, the driver
> > > > will go to the slow-path of allocating a new CB, i.e. calling
> > > > dma_alloc_coherent.
> > > >
> > > > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > > > ---
> > > >  drivers/misc/habanalabs/Makefile           |   3 +-
> > > >  drivers/misc/habanalabs/command_buffer.c   | 414 +++++++++++++++++++++
> > > >  drivers/misc/habanalabs/device.c           |  43 ++-
> > > >  drivers/misc/habanalabs/goya/goya.c        |  28 ++
> > > >  drivers/misc/habanalabs/habanalabs.h       |  95 ++++-
> > > >  drivers/misc/habanalabs/habanalabs_drv.c   |   2 +
> > > >  drivers/misc/habanalabs/habanalabs_ioctl.c | 102 +++++
> > > >  include/uapi/misc/habanalabs.h             |  62 +++
> > > >  8 files changed, 746 insertions(+), 3 deletions(-)
> > > >  create mode 100644 drivers/misc/habanalabs/command_buffer.c
> > > >  create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
> > > >  create mode 100644 include/uapi/misc/habanalabs.h
>
> [ ... ]
>
> > > > +int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
> > > > +                     u32 cb_size, u64 *handle, int ctx_id)
> > > > +{
> > > > +     struct hl_cb *cb;
> > > > +     bool alloc_new_cb = true;
> > > > +     int rc;
> > > > +
> > > > +     if (hdev->disabled) {
> > > > +             dev_warn_ratelimited(hdev->dev,
> > > > +                     "Device is disabled !!! Can't create new CBs\n");
> > > > +             rc = -EBUSY;
> > > > +             goto out_err;
> > > > +     }
> > > > +
> > > > +     /* Minimum allocation must be PAGE SIZE */
> > > > +     if (cb_size < PAGE_SIZE)
> > > > +             cb_size = PAGE_SIZE;
> > > > +
> > > > +     if (ctx_id == HL_KERNEL_ASID_ID &&
> > > > +                     cb_size <= hdev->asic_prop.cb_pool_cb_size) {
> > > > +
> > > > +             spin_lock(&hdev->cb_pool_lock);
> > > > +             if (!list_empty(&hdev->cb_pool)) {
> > > > +                     cb = list_first_entry(&hdev->cb_pool, typeof(*cb),
> > > > +                                     pool_list);
> > > > +                     list_del(&cb->pool_list);
> > > > +                     spin_unlock(&hdev->cb_pool_lock);
> > > > +                     alloc_new_cb = false;
> > > > +             } else {
> > > > +                     spin_unlock(&hdev->cb_pool_lock);
> > > > +                     dev_warn_once(hdev->dev, "CB pool is empty\n");
> > >
> > > Isn't it going to be a false alarm when you allocate the cb for the first
> > > time?
> > Why ?
> > The cb_pool list holds a list of available CBs. See hl_cb_pool_init()
> > - it adds newly allocated CBs to this pool list.
> >
> > if (!list_empty(&hdev->cb_pool)) {       -  this checks whether the
> > pool is not empty so we can take an available CB from it. If the list
> > is empty (hence the pool is empty), we print the warning.
>
> Sorry if it's too much nitpicking, but why the allocation of the first cb
> should be a warning? There's nothing wrong there... Maybe dev_dbg()
> instead?
Yeah, that's a fair point. The issue is I would like to know if we
reach to this state and dev_dbg isn't usually enabled.
Still, I get what you are saying and I'll change this to dev_dbg.

>
> > > > +             }
> > > > +     }
> > > > +
> > > > +     if (alloc_new_cb) {
> > > > +             cb = hl_cb_alloc(hdev, cb_size, ctx_id);
> > > > +             if (!cb) {
> > > > +                     rc = -ENOMEM;
> > > > +                     goto out_err;
> > > > +             }
> > > > +     }
> > > > +
> > > > +     cb->hdev = hdev;
> > > > +     cb->ctx_id = ctx_id;
> > > > +
> > > > +     spin_lock(&mgr->cb_lock);
> > > > +     rc = idr_alloc(&mgr->cb_handles, cb, 1, 0, GFP_ATOMIC);
> > >
> > > It seems the ID will remain dangling if the cb is reused.
> >
> > I'm not sure what you mean by this comment. Reused by whom ? in how
> > fashion it is reused ?
>
> Sorry if I didn't explain it more clearly.
> If the case the cb is reused, you anyway call idr_alloc() and overwrite the
> previous value of cb->id and it never gets idr_remove()'ed
I don't think that is the case.
Please look at hl_cb_destroy(). There, we do the idr_remove and then
we kref_put the CB. In it's release code path, we check if this is a
CB from pool, and if so, we return it to the pool. When it will be
alloc'ed again, it will get a new id.
The problem in this patch is that hl_cb_destroy is not used yet for
CB's from the pool because the command submission code which use that
comes at a later patch, so indeed it might be confusing. But if you
will take a look at the entire code and check when hl_cb_destroy is
called I think you will agree with me.
But if you still think otherwise, please tell me. I might be missing
something here.

Thanks,
Oded

>
> > >
> > > > +     spin_unlock(&mgr->cb_lock);
> > > > +
> > > > +     if (rc < 0) {
> > > > +             dev_err(hdev->dev, "Failed to allocate IDR for a new CB\n");
> > > > +             goto release_cb;
> > > > +     }
> > > > +
> > > > +     cb->id = rc;
> > > > +
> > > > +     kref_init(&cb->refcount);
> > > > +     spin_lock_init(&cb->lock);
> > > > +
> > > > +     /*
> > > > +      * idr is 32-bit so we can safely OR it with a mask that is above
> > > > +      * 32 bit
> > > > +      */
> > > > +     *handle = cb->id | HL_MMAP_CB_MASK;
> > > > +     *handle <<= PAGE_SHIFT;
> > > > +
> > > > +     return 0;
> > > > +
> > > > +release_cb:
> > > > +     cb_do_release(hdev, cb);
> > > > +out_err:
> > > > +     *handle = 0;
> > > > +
> > > > +     return rc;
> > > > +}
> > > > +
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 05/15] habanalabs: add command buffer module
  2019-01-28  7:55         ` Oded Gabbay
@ 2019-01-28  8:41           ` Mike Rapoport
  0 siblings, 0 replies; 103+ messages in thread
From: Mike Rapoport @ 2019-01-28  8:41 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Mon, Jan 28, 2019 at 09:55:23AM +0200, Oded Gabbay wrote:
> On Sun, Jan 27, 2019 at 8:49 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Fri, Jan 25, 2019 at 11:47:03PM +0200, Oded Gabbay wrote:
> > > On Wed, Jan 23, 2019 at 2:28 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > >
> > > > On Wed, Jan 23, 2019 at 02:00:47AM +0200, Oded Gabbay wrote:
> > > > > This patch adds the CB module, which allows the user to create and
> > > > > destroy CBs and to map them to the user's process address-space.
> > > >
> > > > Can you please spell "command buffer" at least first time it's mentioned?
> > > fixed
> > > >
> > > > > A command buffer is a memory blocks that reside in DMA-able address-space
> > > > > and is physically contiguous so it can be accessed by the device without
> > > > > MMU translation. The command buffer memory is allocated using the
> > > > > coherent DMA API.
> > > > >
> > > > > When creating a new CB, the IOCTL returns a handle of it, and the
> > > > > user-space process needs to use that handle to mmap the buffer to get a VA
> > > > > in the user's address-space.
> > > > >
> > > > > Before destroying (freeing) a CB, the user must unmap the CB's VA using the
> > > > > CB handle.
> > > > >
> > > > > Each CB has a reference counter, which tracks its usage in command
> > > > > submissions and also its mmaps (only a single mmap is allowed).
> > > > >
> > > > > The driver maintains a pool of pre-allocated CBs in order to reduce
> > > > > latency during command submissions. In case the pool is empty, the driver
> > > > > will go to the slow-path of allocating a new CB, i.e. calling
> > > > > dma_alloc_coherent.
> > > > >
> > > > > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > > > > ---
> > > > >  drivers/misc/habanalabs/Makefile           |   3 +-
> > > > >  drivers/misc/habanalabs/command_buffer.c   | 414 +++++++++++++++++++++
> > > > >  drivers/misc/habanalabs/device.c           |  43 ++-
> > > > >  drivers/misc/habanalabs/goya/goya.c        |  28 ++
> > > > >  drivers/misc/habanalabs/habanalabs.h       |  95 ++++-
> > > > >  drivers/misc/habanalabs/habanalabs_drv.c   |   2 +
> > > > >  drivers/misc/habanalabs/habanalabs_ioctl.c | 102 +++++
> > > > >  include/uapi/misc/habanalabs.h             |  62 +++
> > > > >  8 files changed, 746 insertions(+), 3 deletions(-)
> > > > >  create mode 100644 drivers/misc/habanalabs/command_buffer.c
> > > > >  create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
> > > > >  create mode 100644 include/uapi/misc/habanalabs.h
> >
> > [ ... ]
> >
> > > > > +int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
> > > > > +                     u32 cb_size, u64 *handle, int ctx_id)
> > > > > +{
> > > > > +     struct hl_cb *cb;
> > > > > +     bool alloc_new_cb = true;
> > > > > +     int rc;
> > > > > +
> > > > > +     if (hdev->disabled) {
> > > > > +             dev_warn_ratelimited(hdev->dev,
> > > > > +                     "Device is disabled !!! Can't create new CBs\n");
> > > > > +             rc = -EBUSY;
> > > > > +             goto out_err;
> > > > > +     }
> > > > > +
> > > > > +     /* Minimum allocation must be PAGE SIZE */
> > > > > +     if (cb_size < PAGE_SIZE)
> > > > > +             cb_size = PAGE_SIZE;
> > > > > +
> > > > > +     if (ctx_id == HL_KERNEL_ASID_ID &&
> > > > > +                     cb_size <= hdev->asic_prop.cb_pool_cb_size) {
> > > > > +
> > > > > +             spin_lock(&hdev->cb_pool_lock);
> > > > > +             if (!list_empty(&hdev->cb_pool)) {
> > > > > +                     cb = list_first_entry(&hdev->cb_pool, typeof(*cb),
> > > > > +                                     pool_list);
> > > > > +                     list_del(&cb->pool_list);
> > > > > +                     spin_unlock(&hdev->cb_pool_lock);
> > > > > +                     alloc_new_cb = false;
> > > > > +             } else {
> > > > > +                     spin_unlock(&hdev->cb_pool_lock);
> > > > > +                     dev_warn_once(hdev->dev, "CB pool is empty\n");
> > > >
> > > > Isn't it going to be a false alarm when you allocate the cb for the first
> > > > time?
> > > Why ?
> > > The cb_pool list holds a list of available CBs. See hl_cb_pool_init()
> > > - it adds newly allocated CBs to this pool list.
> > >
> > > if (!list_empty(&hdev->cb_pool)) {       -  this checks whether the
> > > pool is not empty so we can take an available CB from it. If the list
> > > is empty (hence the pool is empty), we print the warning.
> >
> > Sorry if it's too much nitpicking, but why the allocation of the first cb
> > should be a warning? There's nothing wrong there... Maybe dev_dbg()
> > instead?
> Yeah, that's a fair point. The issue is I would like to know if we
> reach to this state and dev_dbg isn't usually enabled.
> Still, I get what you are saying and I'll change this to dev_dbg.
> 
> >
> > > > > +             }
> > > > > +     }
> > > > > +
> > > > > +     if (alloc_new_cb) {
> > > > > +             cb = hl_cb_alloc(hdev, cb_size, ctx_id);
> > > > > +             if (!cb) {
> > > > > +                     rc = -ENOMEM;
> > > > > +                     goto out_err;
> > > > > +             }
> > > > > +     }
> > > > > +
> > > > > +     cb->hdev = hdev;
> > > > > +     cb->ctx_id = ctx_id;
> > > > > +
> > > > > +     spin_lock(&mgr->cb_lock);
> > > > > +     rc = idr_alloc(&mgr->cb_handles, cb, 1, 0, GFP_ATOMIC);
> > > >
> > > > It seems the ID will remain dangling if the cb is reused.
> > >
> > > I'm not sure what you mean by this comment. Reused by whom ? in how
> > > fashion it is reused ?
> >
> > Sorry if I didn't explain it more clearly.
> > If the case the cb is reused, you anyway call idr_alloc() and overwrite the
> > previous value of cb->id and it never gets idr_remove()'ed
> I don't think that is the case.
> Please look at hl_cb_destroy(). There, we do the idr_remove and then
> we kref_put the CB. In it's release code path, we check if this is a
> CB from pool, and if so, we return it to the pool. When it will be
> alloc'ed again, it will get a new id.
> The problem in this patch is that hl_cb_destroy is not used yet for
> CB's from the pool because the command submission code which use that
> comes at a later patch, so indeed it might be confusing. But if you
> will take a look at the entire code and check when hl_cb_destroy is
> called I think you will agree with me.
> But if you still think otherwise, please tell me. I might be missing
> something here.

Right, hl_cb_create and hl_cb_destroy are indeed paired. Frankly, I was too
lazy to thoroughly check hl_device_release() case when userspace didn't
free all the cb's, but, apparently it also does the required cleanup. 
 
> Thanks,
> Oded
> 
> >
> > > >
> > > > > +     spin_unlock(&mgr->cb_lock);
> > > > > +
> > > > > +     if (rc < 0) {
> > > > > +             dev_err(hdev->dev, "Failed to allocate IDR for a new CB\n");
> > > > > +             goto release_cb;
> > > > > +     }
> > > > > +
> > > > > +     cb->id = rc;
> > > > > +
> > > > > +     kref_init(&cb->refcount);
> > > > > +     spin_lock_init(&cb->lock);
> > > > > +
> > > > > +     /*
> > > > > +      * idr is 32-bit so we can safely OR it with a mask that is above
> > > > > +      * 32 bit
> > > > > +      */
> > > > > +     *handle = cb->id | HL_MMAP_CB_MASK;
> > > > > +     *handle <<= PAGE_SHIFT;
> > > > > +
> > > > > +     return 0;
> > > > > +
> > > > > +release_cb:
> > > > > +     cb_do_release(hdev, cb);
> > > > > +out_err:
> > > > > +     *handle = 0;
> > > > > +
> > > > > +     return rc;
> > > > > +}
> > > > > +
> >
> > --
> > Sincerely yours,
> > Mike.
> >
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 06/15] habanalabs: add basic Goya h/w initialization
  2019-01-25  7:46   ` Mike Rapoport
@ 2019-01-28 10:35     ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-28 10:35 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Fri, Jan 25, 2019 at 9:46 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> Hi,
>
> This starts the 6-9 review :)
>
> These were more difficult to review because small pieces of code are interleaved with
> large sequences of register writes. Probably making these register data
> rather than code can help.
>
> On Wed, Jan 23, 2019 at 02:00:48AM +0200, Oded Gabbay wrote:
> > This patch adds the basic part of Goya's H/W initialization. It adds code
> > that initializes Goya's internal CPU, various registers that are related to
> > internal routing, scrambling, workarounds for H/W bugs, etc.
> >
> > It also initializes Goya's security scheme that prevents the user from
> > abusing Goya to steal data from the host, crash the host, change
> > Goya's F/W, etc.
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/habanalabs/device.c              |   12 +
> >  drivers/misc/habanalabs/goya/Makefile         |    2 +-
> >  drivers/misc/habanalabs/goya/goya.c           | 1892 ++++++++++-
> >  drivers/misc/habanalabs/goya/goyaP.h          |    3 +
> >  drivers/misc/habanalabs/goya/goya_security.c  | 2999 +++++++++++++++++
> >  drivers/misc/habanalabs/habanalabs.h          |   16 +
> >  drivers/misc/habanalabs/habanalabs_drv.c      |    8 +
> >  drivers/misc/habanalabs/include/goya/goya.h   |    1 +
> >  .../include/goya/goya_async_events.h          |  186 +
> >  .../habanalabs/include/goya/goya_boot_if.h    |   32 +
> >  10 files changed, 5144 insertions(+), 7 deletions(-)
> >  create mode 100644 drivers/misc/habanalabs/goya/goya_security.c
> >  create mode 100644 drivers/misc/habanalabs/include/goya/goya_async_events.h
> >  create mode 100644 drivers/misc/habanalabs/include/goya/goya_boot_if.h
> >
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> > index 0bd86a7d34db..9fc7218a973c 100644
> > --- a/drivers/misc/habanalabs/device.c
> > +++ b/drivers/misc/habanalabs/device.c
> > @@ -315,6 +315,15 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >               goto release_ctx;
> >       }
> >
> > +     rc = hdev->asic_funcs->hw_init(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to initialize the H/W\n");
> > +             rc = 0;
>
> Mistype, I suppose.
Actually no :) From certain point in the init process, I would like
the device to stay present with its sysfs/debugfs interface, but it
will be in disabled ("malfunctioned" in sysfs) state so the user can't
submit workloads. The user/sysadmin will be able to try to reset the
device to make it work again, or read registers/memory through debugfs
interface. So I need to "cheat" the return code to 0 to make that
work.

>
> > +             goto out_disabled;
> > +     }
> > +
> > +     hdev->disabled = false;
> > +
> >       dev_notice(hdev->dev,
> >               "Successfully added device to habanalabs driver\n");
> >
> > @@ -366,6 +375,9 @@ void hl_device_fini(struct hl_device *hdev)
> >       if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
> >               dev_err(hdev->dev, "kernel ctx is still alive\n");
> >
> > +     /* Reset the H/W. It will be in idle state after this returns */
> > +     hdev->asic_funcs->hw_fini(hdev, true);
> > +
> >       /* Call ASIC S/W finalize function */
> >       hdev->asic_funcs->sw_fini(hdev);
> >
> > diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
> > index 5ebf3d0d5794..a57096fa41b6 100644
> > --- a/drivers/misc/habanalabs/goya/Makefile
> > +++ b/drivers/misc/habanalabs/goya/Makefile
> > @@ -1,3 +1,3 @@
> >  subdir-ccflags-y += -I$(src)
> >
> > -HL_GOYA_FILES :=  goya/goya.o
> > \ No newline at end of file
> > +HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o
> > \ No newline at end of file
> > diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> > index 341ac085af82..f715e01838b3 100644
> > --- a/drivers/misc/habanalabs/goya/goya.c
> > +++ b/drivers/misc/habanalabs/goya/goya.c
> > @@ -119,11 +119,11 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
> >       prop->va_space_dram_end_address = VA_DDR_SPACE_END;
> >       prop->cfg_size = CFG_SIZE;
> >       prop->max_asid = MAX_ASID;
> > +     prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
> > +     prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
> >       prop->tpc_enabled_mask = TPC_ENABLED_MASK;
> >
> >       prop->high_pll = PLL_HIGH_DEFAULT;
> > -     prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
> > -     prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
> >  }
> >
> >  /**
> > @@ -459,10 +459,12 @@ static int goya_early_init(struct hl_device *hdev)
> >               goto disable_device;
> >       }
> >
> > -     val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
> > -     if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
> > -             dev_warn(hdev->dev,
> > -                     "PCI strap is not configured correctly, PCI bus errors may occur\n");
> > +     if (!hdev->pldm) {
> > +             val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
>
> What is the purpose of the 'mm' prefix in register names?
>
memory-mapped. It is a convention that I like (taken from AMD - see
registers file of amdgpu code)

> > +             if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
> > +                     dev_warn(hdev->dev,
> > +                             "PCI strap is not configured correctly, PCI bus errors may occur\n");
> > +     }
> >
> >       return 0;
> >
> > @@ -593,6 +595,1882 @@ int goya_sw_fini(struct hl_device *hdev)
> >       return 0;
> >  }
> >
> > +/**
> > + * goya_init_pll - Initialize pll registers
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + */
> > +static void goya_init_pll(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u16 hbw_nr, hbw_nf, hbw_od, hbw_nb;
> > +     u16 cpu_nr, cpu_nf, cpu_od, cpu_nb;
> > +     u16 mc_nr, mc_nf, mc_od, mc_nb;
> > +     u16 pci_nr, pci_nf, pci_od, pci_nb;
> > +     u16 emmc_nr, emmc_nf, emmc_od, emmc_nb;
> > +
> > +     if (!hdev->config_pll)
> > +             return;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_PLL)
> > +             return;
> > +
> > +     if (hdev->cpu_enable) {
> > +             dev_info(hdev->dev,
> > +                     "Waiting 5s for u-boot before configuring PLLs\n");
> > +             ssleep(5);
> > +     }
> > +
> > +/*
> > + * PLL possible configuration values:
> > +     {50000000,1,16,16,8},
> > +     {100000000,1,32,16,16},
> > +     {150000000,1,48,16,24},
> > +     {200000000,1,64,16,32},
> > +     {250000000,1,70,14,35},
> > +     {300000000,1,60,10,30},
> > +     {350000000,1,70,10,35},
> > +     {400000000,1,64,8,32},
> > +     {450000000,1,54,6,27},
> > +     {500000000,1,60,6,30},
> > +     {550000000,1,66,6,33},
> > +     {600000000,1,48,4,24},
> > +     {650000000,1,52,4,26},
> > +     {700000000,1,56,4,28},
> > +     {750000000,1,60,4,30},
> > +     {800000000,1,64,4,32},
> > +     {850000000,1,68,4,34},
> > +     {900000000,1,36,2,18},
> > +     {950000000,1,38,2,19},
> > +     {1000000000,1,40,2,20},
> > +     {1050000000,1,42,2,21},
> > +     {1100000000,1,44,2,22},
> > +     {1150000000,1,46,2,23},
> > +     {1200000000,1,48,2,24},
> > +     {1250000000,1,50,2,25},
> > +     {1300000000,1,52,2,26},
> > +     {1350000000,1,54,2,27},
> > +     {1400000000,1,56,2,28},
> > +     {1450000000,1,58,2,29},
> > +     {1500000000,1,60,2,30},
> > +     {1550000000,1,62,2,31},
>
> Some explanation about the correspondence of these values to _nr, _nf, _od
> and _nb would be helpfull.

So actually this function is only relevant for working in Palladium. I
think I will just remove it.
PLLs are initialized and maintained by F/W

>
> > +*/
> > +
> > +     if (hdev->pldm) {
>
>                 /* ? MHz */
>
> > +             hbw_nr  = 4, hbw_nf  = 302, hbw_od  = 1, hbw_nb  = 151;
> > +             cpu_nr  = 0, cpu_nf  = 47, cpu_od  = 1, cpu_nb  = 32;
> > +             mc_nr   = 1, mc_nf   = 159, mc_od   = 9, mc_nb   = 79;
> > +             pci_nr  = 4, pci_nf  = 343, pci_od  = 3, pci_nb  = 171;
> > +             emmc_nr = 24, emmc_nf = 415, emmc_od = 15, emmc_nb = 207;
> > +     } else {
> > +             /* 200MHz */
> > +             hbw_nr  = 0, hbw_nf  = 63, hbw_od  = 15, hbw_nb  = 31;
> > +             cpu_nr  = 0, cpu_nf  = 47, cpu_od  = 1, cpu_nb  = 23;
> > +             mc_nr   = 2, mc_nf   = 0x9f, mc_od   = 3, mc_nb   = 0x4f;
>
> The hex here looks inconsistent.
>
> > +             pci_nr  = 4, pci_nf  = 343, pci_od  = 3, pci_nb  = 171;
> > +             emmc_nr = 24, emmc_nf = 415, emmc_od = 15, emmc_nb = 207;
> > +     }
> > +
> > +     /* Adjust divider for SPI */
> > +     WREG32(mmPSOC_SPI_BAUDR, 8);
> > +
> > +     WREG32(mmCPU_PLL_RST, 1);
> > +     WREG32(mmCPU_PLL_NR, cpu_nr);
> > +     WREG32(mmCPU_PLL_NF, cpu_nf);
> > +     WREG32(mmCPU_PLL_OD, cpu_od);
> > +     WREG32(mmCPU_PLL_NB, cpu_nb);
> > +     WREG32(mmCPU_PLL_DATA_CHNG, 0x11);
> > +
> > +     /* delay before taking PLL out of reset */
> > +     udelay(100);
> > +
>
> [ ... ]
>
> > +
> > +     goya->hw_cap_initialized |= HW_CAP_PLL;
> > +}
> > +
> > +static void goya_set_pll_refclk(struct hl_device *hdev)
> > +{
> > +     WREG32(mmCPU_PLL_DIV_SEL_0, 0x0);
> > +     WREG32(mmCPU_PLL_DIV_SEL_1, 0x0);
> > +     WREG32(mmCPU_PLL_DIV_SEL_2, 0x0);
> > +     WREG32(mmCPU_PLL_DIV_SEL_3, 0x0);
> > +
> > +     WREG32(mmIC_PLL_DIV_SEL_0, 0x0);
> > +     WREG32(mmIC_PLL_DIV_SEL_1, 0x0);
> > +     WREG32(mmIC_PLL_DIV_SEL_2, 0x0);
> > +     WREG32(mmIC_PLL_DIV_SEL_3, 0x0);
> > +
> > +     WREG32(mmMC_PLL_DIV_SEL_0, 0x0);
> > +     WREG32(mmMC_PLL_DIV_SEL_1, 0x0);
> > +     WREG32(mmMC_PLL_DIV_SEL_2, 0x0);
> > +     WREG32(mmMC_PLL_DIV_SEL_3, 0x0);
> > +
> > +     WREG32(mmPSOC_MME_PLL_DIV_SEL_0, 0x0);
> > +     WREG32(mmPSOC_MME_PLL_DIV_SEL_1, 0x0);
> > +     WREG32(mmPSOC_MME_PLL_DIV_SEL_2, 0x0);
> > +     WREG32(mmPSOC_MME_PLL_DIV_SEL_3, 0x0);
> > +
> > +     WREG32(mmPSOC_PCI_PLL_DIV_SEL_0, 0x0);
> > +     WREG32(mmPSOC_PCI_PLL_DIV_SEL_1, 0x0);
> > +     WREG32(mmPSOC_PCI_PLL_DIV_SEL_2, 0x0);
> > +     WREG32(mmPSOC_PCI_PLL_DIV_SEL_3, 0x0);
> > +
> > +     WREG32(mmPSOC_EMMC_PLL_DIV_SEL_0, 0x0);
> > +     WREG32(mmPSOC_EMMC_PLL_DIV_SEL_1, 0x0);
> > +     WREG32(mmPSOC_EMMC_PLL_DIV_SEL_2, 0x0);
> > +     WREG32(mmPSOC_EMMC_PLL_DIV_SEL_3, 0x0);
> > +
> > +     WREG32(mmTPC_PLL_DIV_SEL_0, 0x0);
> > +     WREG32(mmTPC_PLL_DIV_SEL_1, 0x0);
> > +     WREG32(mmTPC_PLL_DIV_SEL_2, 0x0);
> > +     WREG32(mmTPC_PLL_DIV_SEL_3, 0x0);
> > +}
> > +
> > +static void goya_disable_clk_rlx(struct hl_device *hdev)
> > +{
> > +     WREG32(mmPSOC_MME_PLL_CLK_RLX_0, 0x100010);
> > +     WREG32(mmIC_PLL_CLK_RLX_0, 0x100010);
> > +}
> > +
> > +/**
> > + * goya_init_ddr_ch0 - Initialize DDR CH0 controller of the chip
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + */
> > +static void goya_init_ddr_ch0(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u32 val;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_DDR_0)
> > +             return;
> > +
> > +     val = RREG32(mmDDR_MISC_CH0_CFG_DONE);
> > +     if (val & DDR_MISC_CH0_CFG_DONE_CFG_DONE_MASK) {
> > +             goya->hw_cap_initialized |= HW_CAP_DDR_0;
> > +             return;
> > +     }
> > +
> > +     WREG32(mmDDR_MC_CH0_DBG1, 0x00000001);
> > +     WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000001);
> > +
> > +     val = RREG32(mmDDR_MC_CH0_STAT);
> > +
> > +     WREG32(mmDDR_MC_CH0_MSTR, 0x81040210);
> > +     WREG32(mmDDR_MC_CH0_MRCTRL0, 0x4000a0f0);
> > +     WREG32(mmDDR_MC_CH0_MRCTRL1, 0x00022ad0);
> > +     WREG32(mmDDR_MC_CH0_MRCTRL2, 0x091629e1);
> > +     WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000008);
> > +     WREG32(mmDDR_MC_CH0_PWRTMG, 0x00040002);
> > +     WREG32(mmDDR_MC_CH0_HWLPCTL, 0x00be0002);
> > +     WREG32(mmDDR_MC_CH0_RFSHCTL0, 0x0091f020);
> > +     WREG32(mmDDR_MC_CH0_RFSHCTL1, 0x00120018);
> > +     WREG32((mmDDR_MC_CH0_MSTR + 0x00000058), 0x00160005);
> > +     WREG32(mmDDR_MC_CH0_RFSHCTL3, 0x00000020);
> > +     WREG32(mmDDR_MC_CH0_RFSHTMG, 0x003000d0);
> > +     WREG32(mmDDR_MC_CH0_ECCCFG0, 0x00000010);
> > +     WREG32(mmDDR_MC_CH0_ECCCFG1, 0x00000002);
> > +     WREG32(mmDDR_MC_CH0_ECCCTL, 0x00000300);
> > +     WREG32(mmDDR_MC_CH0_ECCPOISONADDR0, 0x00000078);
> > +     WREG32(mmDDR_MC_CH0_ECCPOISONADDR1, 0x100062f7);
> > +     WREG32(mmDDR_MC_CH0_CRCPARCTL0, 0x00008000);
> > +     WREG32(mmDDR_MC_CH0_CRCPARCTL1, 0x0e088301);
> > +     WREG32(mmDDR_MC_CH0_CRCPARCTL2, 0x00600527);
> > +     WREG32(mmDDR_MC_CH0_INIT0, 0x00070002);
> > +     WREG32(mmDDR_MC_CH0_INIT1, 0x0001000e);
> > +     WREG32(mmDDR_MC_CH0_INIT3, 0x0c510001);
> > +     WREG32(mmDDR_MC_CH0_INIT4, 0x00280400);
> > +     WREG32(mmDDR_MC_CH0_INIT5, 0x00110000);
> > +     WREG32(mmDDR_MC_CH0_INIT6, 0x02000643);
> > +     WREG32(mmDDR_MC_CH0_INIT7, 0x00001000);
> > +     WREG32(mmDDR_MC_CH0_DIMMCTL, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_RANKCTL, 0x000009a0);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG0, 0x1918361a);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG1, 0x00080724);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG2, 0x080d0713);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG3, 0x00012012);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG4, 0x0b04060b);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG5, 0x0a0c0804);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG8, 0x0606490c);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG9, 0x0002050f);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG10, 0x000e0d0f);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG11, 0x270b011f);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG12, 0x00000010);
> > +     WREG32(mmDDR_MC_CH0_DRAMTMG15, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_ZQCTL0, 0x31000040);
> > +     WREG32(mmDDR_MC_CH0_ZQCTL1, 0x00000070);
> > +     WREG32(mmDDR_MC_CH0_DFITMG0, 0x05978211);
> > +     WREG32(mmDDR_MC_CH0_DFITMG1, 0x00080101);
> > +     WREG32(mmDDR_MC_CH0_DFILPCFG0, 0x07006031);
> > +     WREG32(mmDDR_MC_CH0_DFILPCFG1, 0x00000010);
> > +     WREG32(mmDDR_MC_CH0_DFIUPD0, 0x40400018);
> > +     WREG32(mmDDR_MC_CH0_DFIUPD1, 0x000b0046);
> > +     WREG32(mmDDR_MC_CH0_DFIUPD2, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
> > +     WREG32(mmDDR_MC_CH0_DFITMG2, 0x00001711);
> > +     WREG32(mmDDR_MC_CH0_DFITMG3, 0x0000001e);
> > +     WREG32(mmDDR_MC_CH0_DBICTL, 0x00000001);
> > +     WREG32(mmDDR_MC_CH0_DFIPHYMSTR, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP0, 0x00001f1f);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP1, 0x003f1503);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP2, 0x01000400);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP3, 0x04000505);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP4, 0x00001f1f);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP5, 0x06060303);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP6, 0x0f050709);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP7, 0x00000f0f);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP8, 0x00003f01);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP9, 0x09000606);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP10, 0x02090105);
> > +     WREG32(mmDDR_MC_CH0_ADDRMAP11, 0x0000000a);
> > +     WREG32(mmDDR_MC_CH0_ODTCFG, 0x09090a08);
> > +     WREG32(mmDDR_MC_CH0_ODTMAP, 0x9ae1b5fe);
> > +     WREG32(mmDDR_MC_CH0_SCHED, 0x664d3700);
> > +     WREG32(mmDDR_MC_CH0_SCHED1, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_PERFHPR1, 0x1700e024);
> > +     WREG32(mmDDR_MC_CH0_PERFLPR1, 0x1e00836c);
> > +     WREG32(mmDDR_MC_CH0_PERFWR1, 0x260046c9);
> > +     WREG32(mmDDR_MC_CH0_DQMAP0, 0x0d2b3503);
> > +     WREG32(mmDDR_MC_CH0_DQMAP1, 0x042a0537);
> > +     WREG32(mmDDR_MC_CH0_DQMAP2, 0x330b2806);
> > +     WREG32(mmDDR_MC_CH0_DQMAP3, 0x27013803);
> > +     WREG32(mmDDR_MC_CH0_DQMAP4, 0x0000022c);
> > +     WREG32(mmDDR_MC_CH0_DQMAP5, 0x00000001);
> > +     WREG32(mmDDR_MC_CH0_DBG0, 0x00000001);
> > +     WREG32(mmDDR_MC_CH0_DBG1, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_DBGCMD, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_SWCTL, 0x00000001);
> > +     WREG32(mmDDR_MC_CH0_POISONCFG, 0x00000001);
> > +     WREG32(mmDDR_MC_CH0_ADVECCINDEX, 0x00000004);
> > +     WREG32(mmDDR_MC_CH0_ECCPOISONPAT0, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_ECCPOISONPAT1, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_ECCPOISONPAT2, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_CAPARPOISONCTL, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_PCCFG, 0x00000011);
> > +     WREG32(mmDDR_MC_CH0_PCFGR_0, 0x0000518c);
> > +     WREG32(mmDDR_MC_CH0_PCFGW_0, 0x00001263);
> > +     WREG32(mmDDR_MC_CH0_PCTRL_0, 0x00000001);
> > +     WREG32(mmDDR_MC_CH0_PCFGQOS0_0, 0x0011000e);
> > +     WREG32(mmDDR_MC_CH0_SBRCTL, 0x0016b540);
> > +     WREG32(mmDDR_MC_CH0_SBRWDATA0, 0x8c1d1786);
> > +     WREG32(mmDDR_MC_CH0_SBRWDATA1, 0x265f03dd);
> > +
> > +     val = RREG32(mmDDR_MC_CH0_RFSHCTL3);
> > +
> > +     WREG32(mmDDR_MISC_CH0_CFG_DONE, 0x00000001);
> > +
> > +     WREG32(mmDDR_MC_CH0_DBG1, 0x00000000);
> > +
> > +     val = RREG32(mmDDR_MC_CH0_PWRCTL);
> > +
> > +     WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000002);
> > +
> > +     val = RREG32(mmDDR_MC_CH0_PWRCTL);
> > +
> > +     WREG32(mmDDR_MC_CH0_PWRCTL, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_SWCTL, 0x00000000);
> > +     WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
> > +     WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000040);
> > +     WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000040);
> > +     WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000060);
> > +     WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000040);
> > +     WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
> > +     WREG32(mmDDR_MC_CH0_DFIMISC, 0x00000041);
> > +     WREG32(mmDDR_MC_CH0_PCTRL_0, 0x00000001);
> > +
> > +     goya->hw_cap_initialized |= HW_CAP_DDR_0;
> > +}
> > +
> > +/**
> > + * goya_init_ddr_ch1 - Initialize DDR CH1 controller of the chip
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + */
> > +static void goya_init_ddr_ch1(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u32 val;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_DDR_1)
> > +             return;
> > +
> > +     val = RREG32(mmDDR_MISC_CH1_CFG_DONE);
> > +     if (val & DDR_MISC_CH1_CFG_DONE_CFG_DONE_MASK) {
> > +             goya->hw_cap_initialized |= HW_CAP_DDR_1;
> > +             return;
> > +     }
> > +
> > +     WREG32(mmDDR_MC_CH1_DBG1, 0x00000001);
> > +     WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000001);
> > +
> > +     val = RREG32(mmDDR_MC_CH1_STAT);
> > +
> > +     WREG32(mmDDR_MC_CH1_MSTR, 0x81040210);
> > +     WREG32(mmDDR_MC_CH1_MRCTRL0, 0x4000a0f0);
> > +     WREG32(mmDDR_MC_CH1_MRCTRL1, 0x00022ad0);
> > +     WREG32(mmDDR_MC_CH1_MRCTRL2, 0x091629e1);
> > +     WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000008);
> > +     WREG32(mmDDR_MC_CH1_PWRTMG, 0x00040002);
> > +     WREG32(mmDDR_MC_CH1_HWLPCTL, 0x00be0002);
> > +     WREG32(mmDDR_MC_CH1_RFSHCTL0, 0x0091f020);
> > +     WREG32(mmDDR_MC_CH1_RFSHCTL1, 0x00120018);
> > +     WREG32((mmDDR_MC_CH1_MSTR + 0x00000058), 0x00160005);
> > +     WREG32(mmDDR_MC_CH1_RFSHCTL3, 0x00000020);
> > +     WREG32(mmDDR_MC_CH1_RFSHTMG, 0x003000d0);
> > +     WREG32(mmDDR_MC_CH1_ECCCFG0, 0x00000010);
> > +     WREG32(mmDDR_MC_CH1_ECCCFG1, 0x00000002);
> > +     WREG32(mmDDR_MC_CH1_ECCCTL, 0x00000300);
> > +     WREG32(mmDDR_MC_CH1_ECCPOISONADDR0, 0x00000078);
> > +     WREG32(mmDDR_MC_CH1_ECCPOISONADDR1, 0x100062f7);
> > +     WREG32(mmDDR_MC_CH1_CRCPARCTL0, 0x00008000);
> > +     WREG32(mmDDR_MC_CH1_CRCPARCTL1, 0x0e088301);
> > +     WREG32(mmDDR_MC_CH1_CRCPARCTL2, 0x00600527);
> > +     WREG32(mmDDR_MC_CH1_INIT0, 0x00070002);
> > +     WREG32(mmDDR_MC_CH1_INIT1, 0x0001000e);
> > +     WREG32(mmDDR_MC_CH1_INIT3, 0x0c510001);
> > +     WREG32(mmDDR_MC_CH1_INIT4, 0x00280400);
> > +     WREG32(mmDDR_MC_CH1_INIT5, 0x00110000);
> > +     WREG32(mmDDR_MC_CH1_INIT6, 0x02000643);
> > +     WREG32(mmDDR_MC_CH1_INIT7, 0x00001000);
> > +     WREG32(mmDDR_MC_CH1_DIMMCTL, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_RANKCTL, 0x000009a0);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG0, 0x1918361a);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG1, 0x00080724);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG2, 0x080d0713);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG3, 0x00012012);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG4, 0x0b04060b);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG5, 0x0a0c0804);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG8, 0x0606490c);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG9, 0x0002050f);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG10, 0x000e0d0f);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG11, 0x270b011f);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG12, 0x00000010);
> > +     WREG32(mmDDR_MC_CH1_DRAMTMG15, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_ZQCTL0, 0x31000040);
> > +     WREG32(mmDDR_MC_CH1_ZQCTL1, 0x00000070);
> > +     WREG32(mmDDR_MC_CH1_DFITMG0, 0x05978211);
> > +     WREG32(mmDDR_MC_CH1_DFITMG1, 0x00080101);
> > +     WREG32(mmDDR_MC_CH1_DFILPCFG0, 0x07006031);
> > +     WREG32(mmDDR_MC_CH1_DFILPCFG1, 0x00000010);
> > +     WREG32(mmDDR_MC_CH1_DFIUPD0, 0x40400018);
> > +     WREG32(mmDDR_MC_CH1_DFIUPD1, 0x000b0046);
> > +     WREG32(mmDDR_MC_CH1_DFIUPD2, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
> > +     WREG32(mmDDR_MC_CH1_DFITMG2, 0x00001711);
> > +     WREG32(mmDDR_MC_CH1_DFITMG3, 0x0000001e);
> > +     WREG32(mmDDR_MC_CH1_DBICTL, 0x00000001);
> > +     WREG32(mmDDR_MC_CH1_DFIPHYMSTR, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP0, 0x00001f1f);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP1, 0x003f1503);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP2, 0x01000400);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP3, 0x04000505);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP4, 0x00001f1f);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP5, 0x06060303);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP6, 0x0f050709);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP7, 0x00000f0f);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP8, 0x00003f01);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP9, 0x09000606);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP10, 0x02090105);
> > +     WREG32(mmDDR_MC_CH1_ADDRMAP11, 0x0000000a);
> > +     WREG32(mmDDR_MC_CH1_ODTCFG, 0x09090a08);
> > +     WREG32(mmDDR_MC_CH1_ODTMAP, 0x9ae1b5fe);
> > +     WREG32(mmDDR_MC_CH1_SCHED, 0x664d3700);
> > +     WREG32(mmDDR_MC_CH1_SCHED1, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_PERFHPR1, 0x1700e024);
> > +     WREG32(mmDDR_MC_CH1_PERFLPR1, 0x1e00836c);
> > +     WREG32(mmDDR_MC_CH1_PERFWR1, 0x260046c9);
> > +     WREG32(mmDDR_MC_CH1_DQMAP0, 0x0d2b3503);
> > +     WREG32(mmDDR_MC_CH1_DQMAP1, 0x042a0537);
> > +     WREG32(mmDDR_MC_CH1_DQMAP2, 0x330b2806);
> > +     WREG32(mmDDR_MC_CH1_DQMAP3, 0x27013803);
> > +     WREG32(mmDDR_MC_CH1_DQMAP4, 0x0000022c);
> > +     WREG32(mmDDR_MC_CH1_DQMAP5, 0x00000001);
> > +     WREG32(mmDDR_MC_CH1_DBG0, 0x00000001);
> > +     WREG32(mmDDR_MC_CH1_DBG1, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_DBGCMD, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_SWCTL, 0x00000001);
> > +     WREG32(mmDDR_MC_CH1_POISONCFG, 0x00000001);
> > +     WREG32(mmDDR_MC_CH1_ADVECCINDEX, 0x00000004);
> > +     WREG32(mmDDR_MC_CH1_ECCPOISONPAT0, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_ECCPOISONPAT1, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_ECCPOISONPAT2, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_CAPARPOISONCTL, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_PCCFG, 0x00000011);
> > +     WREG32(mmDDR_MC_CH1_PCFGR_0, 0x0000518c);
> > +     WREG32(mmDDR_MC_CH1_PCFGW_0, 0x00001263);
> > +     WREG32(mmDDR_MC_CH1_PCTRL_0, 0x00000001);
> > +     WREG32(mmDDR_MC_CH1_PCFGQOS0_0, 0x0011000e);
> > +     WREG32(mmDDR_MC_CH1_SBRCTL, 0x0016b540);
> > +     WREG32(mmDDR_MC_CH1_SBRWDATA0, 0x8c1d1786);
> > +     WREG32(mmDDR_MC_CH1_SBRWDATA1, 0x265f03dd);
> > +
> > +     val = RREG32(mmDDR_MC_CH1_RFSHCTL3);
> > +
> > +     WREG32(mmDDR_MISC_CH1_CFG_DONE, 0x00000001);
> > +
> > +     WREG32(mmDDR_MC_CH1_DBG1, 0x00000000);
> > +
> > +     val = RREG32(mmDDR_MC_CH1_PWRCTL);
> > +
> > +     WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000002);
> > +
> > +     val = RREG32(mmDDR_MC_CH1_PWRCTL);
> > +
> > +     WREG32(mmDDR_MC_CH1_PWRCTL, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_SWCTL, 0x00000000);
> > +     WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
> > +     WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000040);
> > +     WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000040);
> > +     WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000060);
> > +     WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000040);
> > +     WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
> > +     WREG32(mmDDR_MC_CH1_DFIMISC, 0x00000041);
> > +     WREG32(mmDDR_MC_CH1_PCTRL_0, 0x00000001);
>
> The initialization sequence for the second DDR channel looks really similar
> to that of the first channel.
> I would guess their control registers have identical offsets from some base
> address. If this is the case the DDR initialization can be factored out and
> get that base address as a parameter.
>
Again, this function is only relevant for working in Palladium. I will
just remove it.
DDR is initialized by F/W


> > +
> > +     goya->hw_cap_initialized |= HW_CAP_DDR_1;
> > +}
> > +
> > +static void _goya_tpc_mbist_workaround(struct hl_device *hdev, u8 tpc_id)
> > +{
> > +     u64 tpc_eml_address;
> > +     u32 val, tpc_offset, tpc_eml_offset, tpc_slm_offset;
> > +     int err, slm_index;
> > +
> > +     WARN_ON(tpc_id >= TPC_MAX_NUM);
>
> Is it safe to continue if tpc_id >= TPC_MAX_NUM?
no, but I also think this is not needed because this is a static
function that is called from only one place with a well defined for
loop. If I will check this parameter I will need to check every
parameter for every static function. Bottom line, I will remove this
line.
>
> > +     tpc_offset = tpc_id * 0x40000;
> > +     tpc_eml_offset = tpc_id * 0x200000;
> > +     tpc_eml_address = (mmTPC0_EML_CFG_BASE + tpc_eml_offset - CFG_BASE);
> > +     tpc_slm_offset = tpc_eml_address + 0x100000;
> > +
> > +     /*
> > +      * Workaround for Bug H2 #2443 :
> > +      * "TPC SB is not initialized on chip reset"
> > +      */
> > +
> > +     val = RREG32(mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset);
> > +     if (val & TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_ACTIVE_MASK)
> > +             dev_warn(hdev->dev, "TPC%d MBIST ACTIVE is not cleared\n",
> > +                     tpc_id);
> > +
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_PAT + tpc_offset, val & 0xFFFFF000);
> > +
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_0 + tpc_offset, 0x37FF);
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_1 + tpc_offset, 0x303F);
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_2 + tpc_offset, 0x71FF);
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_3 + tpc_offset, 0x71FF);
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_4 + tpc_offset, 0x70FF);
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_5 + tpc_offset, 0x70FF);
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_6 + tpc_offset, 0x70FF);
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_7 + tpc_offset, 0x70FF);
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_8 + tpc_offset, 0x70FF);
> > +     WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_9 + tpc_offset, 0x70FF);
> > +
> > +     WREG32_OR(mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset,
> > +             1 << TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_START_SHIFT);
> > +
> > +     err = hl_poll_timeout(
> > +             hdev,
> > +             mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset,
> > +             val,
> > +             (val & TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_DONE_MASK),
> > +             1000,
> > +             HL_DEVICE_TIMEOUT_USEC);
> > +
> > +     if (err)
> > +             dev_err(hdev->dev,
> > +                     "Timeout while waiting for TPC%d MBIST DONE\n", tpc_id);
> > +
> > +     WREG32_OR(mmTPC0_EML_CFG_DBG_CNT + tpc_eml_offset,
> > +             1 << TPC0_EML_CFG_DBG_CNT_CORE_RST_SHIFT);
> > +
> > +     msleep(GOYA_RESET_WAIT_MSEC);
> > +
> > +     WREG32_AND(mmTPC0_EML_CFG_DBG_CNT + tpc_eml_offset,
> > +             ~(1 << TPC0_EML_CFG_DBG_CNT_CORE_RST_SHIFT));
> > +
> > +     msleep(GOYA_RESET_WAIT_MSEC);
> > +
> > +     for (slm_index = 0 ; slm_index < 256 ; slm_index++)
> > +             WREG32(tpc_slm_offset + (slm_index << 2), 0);
> > +
> > +     val = RREG32(tpc_slm_offset);
> > +
> > +     WREG32(mmTPC0_CFG_BASE + tpc_offset + 0xF40 - CFG_BASE, 0x100);
> > +}
> > +
> > +static void goya_tpc_mbist_workaround(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int i;
> > +
> > +     if (hdev->pldm)
> > +             return;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_TPC_MBIST)
> > +             return;
> > +
> > +     /* Workaround for H2 #2443 */
> > +
> > +     for (i = 0 ; i < TPC_MAX_NUM ; i++)
> > +             _goya_tpc_mbist_workaround(hdev, i);
> > +
> > +     goya->hw_cap_initialized |= HW_CAP_TPC_MBIST;
> > +}
> > +
> > +/**
> > + * goya_init_golden_registers - Initialize golden registers
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Initialize the H/W registers of the device
> > + *
> > + */
> > +static void goya_init_golden_registers(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u32 polynom[10], tpc_intr_mask;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_GOLDEN)
> > +             return;
> > +
> > +     polynom[0] = 0x00020080;
> > +     polynom[1] = 0x00401000;
> > +     polynom[2] = 0x00200800;
> > +     polynom[3] = 0x00002000;
> > +     polynom[4] = 0x00080200;
> > +     polynom[5] = 0x00040100;
> > +     polynom[6] = 0x00100400;
> > +     polynom[7] = 0x00004000;
> > +     polynom[8] = 0x00010000;
> > +     polynom[9] = 0x00008000;
> > +
> > +     /* Mask all arithmetic interrupts from TPC */
> > +     tpc_intr_mask = 0x7FFF;
> > +
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmDMA_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
> > +
> > +     WREG32(mmDMA_NRTR_SCRAMB_EN, 1 << DMA_NRTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmDMA_NRTR_NON_LIN_SCRAMB,
> > +                     1 << DMA_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
> > +
> > +     WREG32(mmSRAM_Y5_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y4_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y3_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y2_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y1_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y0_X0_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y5_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y4_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y3_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y2_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y1_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y0_X1_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y5_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y4_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y3_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y2_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y1_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y0_X2_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y5_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y4_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y3_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y2_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y1_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y0_X3_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y5_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y4_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y3_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y2_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y1_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
> > +     WREG32(mmSRAM_Y0_X4_RTR_HBW_RD_RQ_L_ARB, 0x302);
>
> Any chance this can be done in a loop?
fixed
>
> > +     WREG32(mmSRAM_Y5_X0_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y4_X0_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y3_X0_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y2_X0_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y1_X0_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y5_X1_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y4_X1_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y3_X1_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y2_X1_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y1_X1_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y5_X2_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y4_X2_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y3_X2_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y2_X2_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y1_X2_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y5_X3_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y4_X3_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y3_X3_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y2_X3_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y1_X3_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y5_X4_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y4_X4_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y3_X4_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y2_X4_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y1_X4_RTR_HBW_DATA_L_ARB, 0x204);
> > +     WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_L_ARB, 0x204);
>
> Ditto.
fixed
>
> > +     WREG32(mmSRAM_Y5_X0_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y4_X0_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y3_X0_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y2_X0_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y1_X0_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y5_X1_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y4_X1_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y3_X1_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y2_X1_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y1_X1_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y5_X2_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y4_X2_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y3_X2_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y2_X2_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y1_X2_RTR_HBW_DATA_E_ARB, 0x206);
> > +     WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_E_ARB, 0x206);
>
> And here and below as well.
fixed
>
> > +     WREG32(mmSRAM_Y5_X3_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y4_X3_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y3_X3_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y2_X3_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y1_X3_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y5_X4_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y4_X4_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y3_X4_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y2_X4_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y1_X4_RTR_HBW_DATA_E_ARB, 0x207);
> > +     WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_E_ARB, 0x207);
>
fixed
> [ ... ]
>
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmMME1_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> > +
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmMME2_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> > +
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmMME3_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> > +
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmMME4_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> > +
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmMME5_RTR_SPLIT_COEF_9, polynom[9] >> 7);
> > +
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmMME6_RTR_SPLIT_COEF_9, polynom[9] >> 7);
>
> This sequence seem to repeat itself. If the register map permits I'd
> suggest splitting writes of the polynom[] to registers into a helper
> function.
>
fixed with a loop
> > +
> > +     WREG32(mmMME1_RTR_SCRAMB_EN, 1 << MME1_RTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmMME1_RTR_NON_LIN_SCRAMB,
> > +                     1 << MME1_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> > +
> > +     WREG32(mmMME2_RTR_SCRAMB_EN, 1 << MME2_RTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmMME2_RTR_NON_LIN_SCRAMB,
> > +                     1 << MME2_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> > +
> > +     WREG32(mmMME3_RTR_SCRAMB_EN, 1 << MME3_RTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmMME3_RTR_NON_LIN_SCRAMB,
> > +                     1 << MME3_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> > +
> > +     WREG32(mmMME4_RTR_SCRAMB_EN, 1 << MME4_RTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmMME4_RTR_NON_LIN_SCRAMB,
> > +                     1 << MME4_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> > +
> > +     WREG32(mmMME5_RTR_SCRAMB_EN, 1 << MME5_RTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmMME5_RTR_NON_LIN_SCRAMB,
> > +                     1 << MME5_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> > +
> > +     WREG32(mmMME6_RTR_SCRAMB_EN, 1 << MME6_RTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmMME6_RTR_NON_LIN_SCRAMB,
> > +                     1 << MME6_RTR_NON_LIN_SCRAMB_EN_SHIFT);
> > +
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmTPC0_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
> > +
> > +     WREG32(mmTPC0_NRTR_SCRAMB_EN, 1 << TPC0_NRTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmTPC0_NRTR_NON_LIN_SCRAMB,
> > +                     1 << TPC0_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
>
fixed
> [ ... ]
>
> > +     /*
> > +      * Workaround for Bug H2 #2441 :
> > +      * "ST.NOP set trace event illegal opcode"
> > +      */
> > +     WREG32(mmTPC6_CFG_TPC_INTR_MASK, tpc_intr_mask);
> > +
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmTPC7_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
> > +
> > +     WREG32(mmTPC7_NRTR_SCRAMB_EN, 1 << TPC7_NRTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmTPC7_NRTR_NON_LIN_SCRAMB,
> > +                     1 << TPC7_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
> > +
> > +     /*
> > +      * Workaround for Bug H2 #2441 :
> > +      * "ST.NOP set trace event illegal opcode"
> > +      */
> > +     WREG32(mmTPC7_CFG_TPC_INTR_MASK, tpc_intr_mask);
> > +
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_0, polynom[0] >> 7);
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_1, polynom[1] >> 7);
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_2, polynom[2] >> 7);
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_3, polynom[3] >> 7);
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_4, polynom[4] >> 7);
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_5, polynom[5] >> 7);
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_6, polynom[6] >> 7);
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_7, polynom[7] >> 7);
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_8, polynom[8] >> 7);
> > +     WREG32(mmPCI_NRTR_SPLIT_COEF_9, polynom[9] >> 7);
> > +
> > +     WREG32(mmPCI_NRTR_SCRAMB_EN, 1 << PCI_NRTR_SCRAMB_EN_VAL_SHIFT);
> > +     WREG32(mmPCI_NRTR_NON_LIN_SCRAMB,
> > +                     1 << PCI_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
> > +
>
> I think all these long sequences of register writes could be grouped into
> something like
>
> struct regs_write_seq {
>         unsigned long addr;
>         unsigned long val;
> };
>
> const struct regs_write_seq golden_regs1 [] {
>         ...
> };
>
> const struct regs_write_seq workaround_bug_2411 [] {
>         ...
> };
>
> and written with a helper function looping over such array.
>
I personally don't like so much this method. I combined it to a loop
whenever possible. I hope that is good enough.

> > +     /*
> > +      * Workaround for H2 #HW-23 bug
> > +      * Set DMA max outstanding read requests to 240 on DMA CH 1. Set it
> > +      * to 16 on KMD DMA
> > +      * We need to limit only these DMAs because the user can only read
> > +      * from Host using DMA CH 1
> > +      */
> > +     WREG32(mmDMA_CH_0_CFG0, 0x0fff0010);
> > +     WREG32(mmDMA_CH_1_CFG0, 0x0fff00F0);
> > +
> > +     goya->hw_cap_initialized |= HW_CAP_GOLDEN;
> > +}
> > +
> > +
> > +/**
> > + * goya_push_uboot_to_device - Push u-boot FW code to device
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Copy u-boot fw code from firmware file to SRAM BAR.
> > + * Returns 0 on success
> > + *
> > + */
> > +static int goya_push_uboot_to_device(struct hl_device *hdev)
> > +{
> > +     char fw_name[200];
> > +     const u64 *fw_data;
> > +     void __iomem *dst;
> > +     size_t fw_size, i;
> > +     int rc;
> > +
> > +     snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
> > +
> > +     rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to request u-boot fw image\n");
> > +             goto out;
> > +     }
> > +
> > +     fw_size = hdev->spl_fw->size;
> > +     if ((fw_size % 4) != 0) {
> > +             dev_err(hdev->dev, "illegal u-boot firmware size %lu\n",
> > +                     fw_size);
> > +             rc = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     dev_dbg(hdev->dev, "u-boot firmware size == %lu\n", fw_size);
> > +
> > +     fw_data = (const u64 *) hdev->spl_fw->data;
> > +     dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
> > +
> > +     if ((hdev->spl_fw->size % 8) != 0)
> > +             fw_size -= 8;
> > +
> > +     for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> > +             if (!(i & (0x80000 - 1)))
> > +                     dev_dbg(hdev->dev,
> > +                             "u-boot copied so far %lu out of %lu",
> > +                             i, fw_size);
> > +
> > +             writeq(*fw_data, dst);
> > +     }
> > +
> > +     if ((hdev->spl_fw->size % 8) != 0)
> > +             writel(*(const u32 *) fw_data, dst);
> > +
> > +out:
> > +     release_firmware(hdev->spl_fw);
> > +     return rc;
> > +}
> > +
> > +/**
> > + * goya_push_linux_to_device - Push LINUX FW code to device
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Copy LINXU fw code from firmware file to DDR BAR.
>
>           ^ Linux
>
fixed
> > + * Returns 0 on success
> > + *
> > + */
> > +static int goya_push_linux_to_device(struct hl_device *hdev)
> > +{
> > +     char fw_name[200];
> > +     const u64 *fw_data;
> > +     void __iomem *dst;
> > +     size_t fw_size, i;
> > +     int rc;
> > +
> > +     snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
> > +
> > +     rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to request Linux fw image\n");
> > +             goto out;
> > +     }
> > +
> > +     fw_size = hdev->spl_fw->size;
> > +     if ((fw_size % 4) != 0) {
> > +             dev_err(hdev->dev, "illegal Linux firmware size %lu\n",
> > +                     fw_size);
> > +             rc = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     dev_dbg(hdev->dev, "Linux firmware size == %lu\n", fw_size);
> > +
> > +     fw_data = (const u64 *) hdev->spl_fw->data;
> > +     dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
> > +
> > +     if ((hdev->spl_fw->size % 8) != 0)
> > +             fw_size -= 8;
> > +
> > +     for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> > +             if (!(i & (0x80000 - 1))) {
> > +                     dev_dbg(hdev->dev,
> > +                             "Linux copied so far %lu out of %lu",
> > +                             i, fw_size);
> > +                     usleep_range(20, 100);
> > +             }
> > +             writeq(*fw_data, dst);
> > +     }
> > +
> > +     if ((hdev->spl_fw->size % 8) != 0)
> > +             writel(*(const u32 *) fw_data, dst);
> > +
> > +out:
> > +     release_firmware(hdev->spl_fw);
> > +     return rc;
>
> The U-Boot and Linux loading to the device seem almost identical. I think
> it can be declared as
>
> static int goya_push_fw_to_device(struct hl_device *hdev, const char *name,
>                                   void __iomem *dst)
>
> and called twice.
>
fixed

> > +}
> > +
> > +static int goya_pldm_init_cpu(struct hl_device *hdev)
> > +{
> > +     u32 val, unit_rst_val;
> > +     int rc;
> > +
> > +     /* Must initialize SRAM scrambler before pushing u-boot to SRAM */
> > +     goya_init_golden_registers(hdev);
> > +
> > +     /* Put ARM cores into reset */
> > +     WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
> > +     val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
> > +
> > +     /* Reset the CA53 MACRO */
> > +     unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> > +     WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
> > +     val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> > +     WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
> > +     val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> > +
> > +     rc = goya_push_uboot_to_device(hdev);
> > +     if (rc)
> > +             return rc;
> > +
> > +     rc = goya_push_linux_to_device(hdev);
> > +     if (rc)
> > +             return rc;
> > +
> > +     WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
> > +     WREG32(mmPSOC_GLOBAL_CONF_WARM_REBOOT, CPU_BOOT_STATUS_NA);
> > +
> > +     WREG32(mmCPU_CA53_CFG_RST_ADDR_LSB_0,
> > +             lower_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
> > +     WREG32(mmCPU_CA53_CFG_RST_ADDR_MSB_0,
> > +             upper_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
> > +
> > +     /* Release ARM core 0 from reset */
> > +     WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL,
> > +                                     CPU_RESET_CORE0_DEASSERT);
> > +     val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * FW component passes an offset from SRAM_BASE_ADDR in SCRATCHPAD_xx.
> > + * The version string should be located by that offset.
> > + */
> > +static void goya_read_device_fw_version(struct hl_device *hdev,
> > +                                     enum goya_fw_component fwc)
> > +{
> > +     const char *name;
> > +     u32 ver_off;
> > +     char *dest;
> > +
> > +     switch (fwc) {
> > +     case FW_COMP_UBOOT:
> > +             ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_29);
> > +             dest = hdev->asic_prop.uboot_ver;
> > +             name = "U-Boot";
> > +             break;
> > +     case FW_COMP_PREBOOT:
> > +             ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_28);
> > +             dest = hdev->asic_prop.preboot_ver;
> > +             name = "Preboot";
> > +             break;
> > +     default:
> > +             dev_warn(hdev->dev, "Undefined FW component: %d\n", fwc);
> > +             return;
> > +     }
> > +
> > +     ver_off &= ~((u32)SRAM_BASE_ADDR);
> > +
> > +     if (ver_off < SRAM_SIZE - VERSION_MAX_LEN) {
> > +             memcpy_fromio(dest, hdev->pcie_bar[SRAM_CFG_BAR_ID] + ver_off,
> > +                                                     VERSION_MAX_LEN);
> > +     } else {
> > +             dev_err(hdev->dev, "%s version offset (0x%x) is above SRAM\n",
> > +                                                             name, ver_off);
> > +             strcpy(dest, "unavailable");
> > +     }
> > +}
> > +
> > +static int goya_init_cpu(struct hl_device *hdev, u32 cpu_timeout)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u32 status;
> > +     int rc;
> > +
> > +     if (!hdev->cpu_enable)
> > +             return 0;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_CPU)
> > +             return 0;
> > +
> > +     /*
> > +      * Before pushing u-boot/linux to device, need to set the ddr bar to
> > +      * base address of dram
> > +      */
> > +     rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to map DDR bar to DRAM base address\n");
> > +             return rc;
> > +     }
> > +
> > +     if (hdev->pldm) {
> > +             rc = goya_pldm_init_cpu(hdev);
> > +             if (rc)
> > +                     return rc;
> > +
> > +             goto out;
> > +     }
> > +
> > +     /* Make sure CPU boot-loader is running */
> > +     rc = hl_poll_timeout(
> > +             hdev,
> > +             mmPSOC_GLOBAL_CONF_WARM_REBOOT,
> > +             status,
> > +             (status == CPU_BOOT_STATUS_DRAM_RDY) ||
> > +             (status == CPU_BOOT_STATUS_SRAM_AVAIL),
> > +             10000,
> > +             cpu_timeout);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Error in ARM u-boot !!!");
> > +             switch (status) {
> > +             case CPU_BOOT_STATUS_NA:
> > +                     dev_err(hdev->dev,
> > +                             "ARM status %d - BTL did NOT run\n", status);
> > +                     break;
> > +             case CPU_BOOT_STATUS_IN_WFE:
> > +                     dev_err(hdev->dev,
> > +                             "ARM status %d - Inside WFE loop\n", status);
> > +                     break;
> > +             case CPU_BOOT_STATUS_IN_BTL:
> > +                     dev_err(hdev->dev,
> > +                             "ARM status %d - Stuck in BTL\n", status);
> > +                     break;
> > +             case CPU_BOOT_STATUS_IN_PREBOOT:
> > +                     dev_err(hdev->dev,
> > +                             "ARM status %d - Stuck in Preboot\n", status);
> > +                     break;
> > +             case CPU_BOOT_STATUS_IN_SPL:
> > +                     dev_err(hdev->dev,
> > +                             "ARM status %d - Stuck in SPL\n", status);
> > +                     break;
> > +             case CPU_BOOT_STATUS_IN_UBOOT:
> > +                     dev_err(hdev->dev,
> > +                             "ARM status %d - Stuck in u-boot\n", status);
> > +                     break;
> > +             case CPU_BOOT_STATUS_DRAM_INIT_FAIL:
> > +                     dev_err(hdev->dev,
> > +                             "ARM status %d - DDR initialization failed\n",
> > +                             status);
> > +                     break;
> > +             default:
> > +                     dev_err(hdev->dev,
> > +                             "ARM status %d - Invalid status code\n",
> > +                             status);
> > +                     break;
> > +             }
> > +             return -EIO;
> > +     }
> > +
> > +     /* Read U-Boot version now in case we will later fail */
> > +     goya_read_device_fw_version(hdev, FW_COMP_UBOOT);
> > +     goya_read_device_fw_version(hdev, FW_COMP_PREBOOT);
> > +
> > +     if (status == CPU_BOOT_STATUS_SRAM_AVAIL)
> > +             goto out;
> > +
> > +     if (!hdev->fw_loading) {
> > +             dev_info(hdev->dev, "Skip loading FW\n");
> > +             goto out;
> > +     }
> > +
> > +     rc = goya_push_linux_to_device(hdev);
> > +     if (rc)
> > +             return rc;
> > +
> > +     WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
> > +
> > +     rc = hl_poll_timeout(
> > +             hdev,
> > +             mmPSOC_GLOBAL_CONF_WARM_REBOOT,
> > +             status,
> > +             (status == CPU_BOOT_STATUS_SRAM_AVAIL),
> > +             10000,
> > +             cpu_timeout);
> > +
> > +     if (rc) {
> > +             if (status == CPU_BOOT_STATUS_FIT_CORRUPTED)
> > +                     dev_err(hdev->dev,
> > +                             "ARM u-boot reports FIT image is corrupted\n");
> > +             else
> > +                     dev_err(hdev->dev,
> > +                             "ARM Linux failed to load, %d\n", status);
> > +             WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_NA);
> > +             return -EIO;
> > +     }
> > +
> > +     dev_info(hdev->dev, "Successfully loaded firmware to device\n");
> > +
> > +out:
> > +     goya->hw_cap_initialized |= HW_CAP_CPU;
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * goya_hw_init - Goya hardware initialization code
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Returns 0 on success
> > + *
> > + */
> > +static int goya_hw_init(struct hl_device *hdev)
> > +{
> > +     struct asic_fixed_properties *prop = &hdev->asic_prop;
> > +     u32 val;
> > +     int rc;
> > +
> > +     dev_info(hdev->dev, "Starting initialization of H/W\n");
> > +
> > +     /* Perform read from the device to make sure device is up */
> > +     val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
> > +
> > +     goya_init_pll(hdev);
> > +
> > +     if (hdev->pldm) {
> > +             goya_init_ddr_ch0(hdev);
> > +             goya_init_ddr_ch1(hdev);
> > +     }
> > +
> > +     rc = goya_init_cpu(hdev, GOYA_CPU_TIMEOUT_USEC);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to initialize CPU\n");
> > +             return rc;
> > +     }
> > +
> > +     goya_tpc_mbist_workaround(hdev);
> > +
> > +     goya_init_golden_registers(hdev);
> > +
> > +     /*
> > +      * After CPU initialization is finished, change DDR bar mapping inside
> > +      * iATU to point to the start address of the MMU page tables
> > +      */
> > +     rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE +
> > +             (MMU_PAGE_TABLES_ADDR & ~(prop->dram_pci_bar_size - 0x1ull)));
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to map DDR bar to MMU page tables\n");
> > +             return rc;
> > +     }
> > +
> > +     goya_init_security(hdev);
> > +
> > +     /* CPU initialization is finished, we can now move to 48 bit DMA mask */
> > +     rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
> > +     if (rc) {
> > +             dev_warn(hdev->dev, "Unable to set pci dma mask to 48 bits\n");
> > +             rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(32));
> > +             if (rc) {
> > +                     dev_err(hdev->dev,
> > +                             "Unable to set pci dma mask to 32 bits\n");
> > +                     return rc;
> > +             }
> > +     }
> > +
> > +     rc = pci_set_consistent_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
> > +     if (rc) {
> > +             dev_warn(hdev->dev,
> > +                     "Unable to set pci consistent dma mask to 48 bits\n");
> > +             rc = pci_set_consistent_dma_mask(hdev->pdev, DMA_BIT_MASK(32));
> > +             if (rc) {
> > +                     dev_err(hdev->dev,
> > +                             "Unable to set pci consistent dma mask to 32 bits\n");
> > +                     return rc;
> > +             }
> > +     }
> > +
> > +     /* Perform read from the device to flush all MSI-X configuration */
> > +     val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * goya_hw_fini - Goya hardware tear-down code
> > + *
> > + * @hdev: pointer to hl_device structure
> > + * @hard_reset: should we do hard reset to all engines or just reset the
> > + *              compute/dma engines
> > + *
> > + * The function does the following:
> > + * - Send interrupt to CPU to go into "quiet" mode
> > + * - Stall MME, TPC
> > + * - Stop External, Internal QMANs
> > + * - Disable MSI-X
> > + * - Issue reset command
> > + * - Wait until reset is done
> > + * - Start device BTL
> > + *
> > + */
> > +static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u32 reset_timeout_ms, status;
> > +
> > +     if (hdev->pldm)
> > +             reset_timeout_ms = GOYA_PLDM_RESET_TIMEOUT_MSEC;
> > +     else
> > +             reset_timeout_ms = GOYA_RESET_TIMEOUT_MSEC;
> > +
> > +     if (hard_reset) {
> > +             goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
> > +             goya_disable_clk_rlx(hdev);
> > +             goya_set_pll_refclk(hdev);
> > +
> > +             WREG32(mmPSOC_GLOBAL_CONF_SW_ALL_RST_CFG, RESET_ALL);
> > +             dev_info(hdev->dev,
> > +                     "Issued HARD reset command, going to wait %dms\n",
> > +                     reset_timeout_ms);
> > +     } else {
> > +             WREG32(mmPSOC_GLOBAL_CONF_SW_ALL_RST_CFG, DMA_MME_TPC_RESET);
> > +             dev_info(hdev->dev,
> > +                     "Issued SOFT reset command, going to wait %dms\n",
> > +                     reset_timeout_ms);
> > +     }
> > +
> > +     /*
> > +      * After hard reset, we can't poll the BTM_FSM register because the PSOC
> > +      * itself is in reset. In either reset we need to wait until the reset
> > +      * is deasserted
> > +      */
> > +     msleep(reset_timeout_ms);
> > +
> > +     status = RREG32(mmPSOC_GLOBAL_CONF_BTM_FSM);
> > +     if (status & PSOC_GLOBAL_CONF_BTM_FSM_STATE_MASK)
> > +             dev_err(hdev->dev,
> > +                     "Timeout while waiting for device to reset 0x%x\n",
> > +                     status);
> > +
> > +     if (!hard_reset) {
> > +             goya->hw_cap_initialized &= ~(HW_CAP_DMA | HW_CAP_MME |
> > +                                             HW_CAP_GOLDEN | HW_CAP_TPC);
> > +             WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> > +                             GOYA_ASYNC_EVENT_ID_SOFT_RESET);
> > +             return;
> > +     }
> > +
> > +     /* Chicken bit to re-initiate boot sequencer flow */
> > +     WREG32(mmPSOC_GLOBAL_CONF_BOOT_SEQ_RE_START,
> > +             1 << PSOC_GLOBAL_CONF_BOOT_SEQ_RE_START_IND_SHIFT);
> > +     /* Move boot manager FSM to pre boot sequencer init state */
> > +     WREG32(mmPSOC_GLOBAL_CONF_SW_BTM_FSM,
> > +                     0xA << PSOC_GLOBAL_CONF_SW_BTM_FSM_CTRL_SHIFT);
> > +
> > +     goya->hw_cap_initialized &= ~(HW_CAP_CPU | HW_CAP_CPU_Q |
> > +                                     HW_CAP_DDR_0 | HW_CAP_DDR_1 |
> > +                                     HW_CAP_DMA | HW_CAP_MME |
> > +                                     HW_CAP_MMU | HW_CAP_TPC_MBIST |
> > +                                     HW_CAP_GOLDEN | HW_CAP_TPC);
> > +
> > +     if (!hdev->pldm) {
> > +             int rc;
> > +             /* In case we are running inside VM and the VM is
> > +              * shutting down, we need to make sure CPU boot-loader
> > +              * is running before we can continue the VM shutdown.
> > +              * That is because the VM will send an FLR signal that
> > +              * we must answer
> > +              */
> > +             dev_info(hdev->dev,
> > +                     "Going to wait up to %ds for CPU boot loader\n",
> > +                     GOYA_CPU_TIMEOUT_USEC / 1000 / 1000);
> > +
> > +             rc = hl_poll_timeout(
> > +                     hdev,
> > +                     mmPSOC_GLOBAL_CONF_WARM_REBOOT,
> > +                     status,
> > +                     (status == CPU_BOOT_STATUS_DRAM_RDY),
> > +                     10000,
> > +                     GOYA_CPU_TIMEOUT_USEC);
> > +             if (rc)
> > +                     dev_err(hdev->dev,
> > +                             "failed to wait for CPU boot loader\n");
> > +     }
> > +}
> > +
> >  int goya_suspend(struct hl_device *hdev)
> >  {
> >       return 0;
> > @@ -641,6 +2519,8 @@ static const struct hl_asic_funcs goya_funcs = {
> >       .early_fini = goya_early_fini,
> >       .sw_init = goya_sw_init,
> >       .sw_fini = goya_sw_fini,
> > +     .hw_init = goya_hw_init,
> > +     .hw_fini = goya_hw_fini,
> >       .suspend = goya_suspend,
> >       .resume = goya_resume,
> >       .mmap = goya_mmap,
> > diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> > index 0e12c56472bd..45a6d2ca2752 100644
> > --- a/drivers/misc/habanalabs/goya/goyaP.h
> > +++ b/drivers/misc/habanalabs/goya/goyaP.h
> > @@ -9,6 +9,7 @@
> >  #define GOYAP_H_
> >
> >  #include "habanalabs.h"
> > +#include "include/goya/goya_boot_if.h"
> >  #include "include/goya/goya.h"
> >
> >  #define NUMBER_OF_CMPLT_QUEUES               5
> > @@ -122,4 +123,6 @@ struct goya_device {
> >       u32             hw_cap_initialized;
> >  };
> >
> > +void goya_init_security(struct hl_device *hdev);
> > +
> >  #endif /* GOYAP_H_ */
> > diff --git a/drivers/misc/habanalabs/goya/goya_security.c b/drivers/misc/habanalabs/goya/goya_security.c
> > new file mode 100644
> > index 000000000000..99ad9aacf49e
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/goya/goya_security.c
> > @@ -0,0 +1,2999 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "goyaP.h"
> > +
> > +/**
> > + * goya_set_block_as_protected - set the given block as protected
> > + *
> > + * @hdev: pointer to hl_device structure
> > + * @block: block base address
> > + *
> > + */
> > +static void goya_pb_set_block(struct hl_device *hdev, u64 base)
> > +{
> > +     u32 pb_addr = base - CFG_BASE + PROT_BITS_OFFS;
> > +
> > +     while (pb_addr & 0xFFF) {
> > +             WREG32(pb_addr, 0);
> > +             pb_addr += 4;
> > +     }
> > +}
> > +
> > +static void goya_init_mme_protection_bits(struct hl_device *hdev)
> > +{
> > +     u32 pb_addr, mask;
> > +     u8 word_offset;
> > +
> > +     /* TODO: change to real reg name when Soc Online is updated */
> > +     u64 mmMME_SBB_POWER_ECO1 = 0xDFF60,
> > +             mmMME_SBB_POWER_ECO2 = 0xDFF64;
> > +
> > +     goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_0_BASE);
> > +     goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_1_BASE);
> > +     goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_2_BASE);
> > +     goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_3_BASE);
> > +
> > +     goya_pb_set_block(hdev, mmSBA_ECC_MEM_BASE);
> > +     goya_pb_set_block(hdev, mmSBB_ECC_MEM_BASE);
> > +
> > +     goya_pb_set_block(hdev, mmMME1_RTR_BASE);
> > +     goya_pb_set_block(hdev, mmMME1_RD_REGULATOR_BASE);
> > +     goya_pb_set_block(hdev, mmMME1_WR_REGULATOR_BASE);
> > +     goya_pb_set_block(hdev, mmMME2_RTR_BASE);
> > +     goya_pb_set_block(hdev, mmMME2_RD_REGULATOR_BASE);
> > +     goya_pb_set_block(hdev, mmMME2_WR_REGULATOR_BASE);
> > +     goya_pb_set_block(hdev, mmMME3_RTR_BASE);
> > +     goya_pb_set_block(hdev, mmMME3_RD_REGULATOR_BASE);
> > +     goya_pb_set_block(hdev, mmMME3_WR_REGULATOR_BASE);
> > +
> > +     goya_pb_set_block(hdev, mmMME4_RTR_BASE);
> > +     goya_pb_set_block(hdev, mmMME4_RD_REGULATOR_BASE);
> > +     goya_pb_set_block(hdev, mmMME4_WR_REGULATOR_BASE);
> > +
> > +     goya_pb_set_block(hdev, mmMME5_RTR_BASE);
> > +     goya_pb_set_block(hdev, mmMME5_RD_REGULATOR_BASE);
> > +     goya_pb_set_block(hdev, mmMME5_WR_REGULATOR_BASE);
> > +
> > +     goya_pb_set_block(hdev, mmMME6_RTR_BASE);
> > +     goya_pb_set_block(hdev, mmMME6_RD_REGULATOR_BASE);
> > +     goya_pb_set_block(hdev, mmMME6_WR_REGULATOR_BASE);
> > +
> > +     pb_addr = (mmMME_DUMMY & ~0xFFF) + PROT_BITS_OFFS;
> > +     word_offset = ((mmMME_DUMMY & PROT_BITS_OFFS) >> 7) << 2;
> > +     mask = 1 << ((mmMME_DUMMY & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_RESET & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_STALL & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_DBGMEM_ADD & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_DBGMEM_DATA_WR & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_DBGMEM_DATA_RD & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_DBGMEM_CTRL & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_DBGMEM_RC & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_LOG_SHADOW & 0x7F) >> 2);
> > +
>
> The mask here and below seems to be a constant.
> A #define could suffice, no?
>
> > +     WREG32(pb_addr + word_offset, ~mask);
> > +
> > +     pb_addr = (mmMME_STORE_MAX_CREDIT & ~0xFFF) + PROT_BITS_OFFS;
> > +     word_offset = ((mmMME_STORE_MAX_CREDIT & PROT_BITS_OFFS) >> 7) << 2;
> > +     mask = 1 << ((mmMME_STORE_MAX_CREDIT & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_AGU & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SBA & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SBB & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SBC & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_WBC & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SBA_CONTROL_DATA & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SBB_CONTROL_DATA & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SBC_CONTROL_DATA & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_WBC_CONTROL_DATA & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_TE & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_TE2DEC & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_REI_STATUS & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_REI_MASK & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SEI_STATUS & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SEI_MASK & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SPI_STATUS & 0x7F) >> 2);
> > +     mask |= 1 << ((mmMME_SPI_MASK & 0x7F) >> 2);
> > +
> > +     WREG32(pb_addr + word_offset, ~mask);
> > +
>
> [ ... ]
>
> > +
> > +/**
> > + * goya_init_protection_bits - Initialize protection bits for specific registers
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * All protection bits are 1 by default, means not protected. Need to set to 0
> > + * each bit that belongs to a protected register.
> > + *
> > + */
> > +static void goya_init_protection_bits(struct hl_device *hdev)
> > +{
> > +     /*
> > +      * In each 4K block of registers, the last 128 bytes are protection
> > +      * bits - total of 1024 bits, one for each register. Each bit is related
> > +      * to a specific register, by the order of the registers.
> > +      * So in order to calculate the bit that is related to a given register,
> > +      * we need to calculate its word offset and then the exact bit inside
> > +      * the word (which is 4 bytes).
> > +      *
> > +      * Register address:
> > +      *
> > +      * 31                 12 11           7   6             2  1      0
> > +      * -----------------------------------------------------------------
> > +      * |      Don't         |    word       |  bit location  |    0    |
> > +      * |      care          |   offset      |  inside word   |         |
> > +      * -----------------------------------------------------------------
> > +      *
> > +      * Bits 7-11 represents the word offset inside the 128 bytes.
> > +      * Bits 2-6 represents the bit location inside the word.
> > +      */
> > +
> > +     goya_pb_set_block(hdev, mmPCI_NRTR_BASE);
> > +     goya_pb_set_block(hdev, mmPCI_RD_REGULATOR_BASE);
> > +     goya_pb_set_block(hdev, mmPCI_WR_REGULATOR_BASE);
>
> [ ... ]
>
> > +     goya_init_mme_protection_bits(hdev);
> > +
> > +     goya_init_dma_protection_bits(hdev);
> > +
> > +     goya_init_tpc_protection_bits(hdev);
> > +}
> > +
> > +/**
> > + * goya_init_security - Initialize security model
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Initialize the security model of the device
> > + * That includes range registers and protection bit per register
> > + *
> > + */
> > +void goya_init_security(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +
> > +     u32 dram_addr_lo = lower_32_bits(DRAM_PHYS_BASE);
> > +     u32 dram_addr_hi = upper_32_bits(DRAM_PHYS_BASE);
> > +
> > +     u32 lbw_rng0_base = 0xFC440000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng0_mask = 0xFFFF0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
>
> These are anyway magic numbers, why not include the mask in them directly?
> BTW, I couldn't fine DMA_MACRO_LBW_RANGE_BASE_R_MASK anywhere in the
> driver.
The define is at drivers/misc/habanalabs/include/goya/asic_reg/dma_macro_regs.h

Because I prefer to see the ranges here so in case we ever need to
change it is easy  to understand the real address inside our chip and
what this range covers.
So it's for the sake of readability.

>
> > +
> > +     u32 lbw_rng1_base = 0xFC480000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng1_mask = 0xFFF80000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng2_base = 0xFC600000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng2_mask = 0xFFE00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng3_base = 0xFC800000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng3_mask = 0xFFF00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng4_base = 0xFCC02000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng4_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng5_base = 0xFCC40000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng5_mask = 0xFFFF8000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng6_base = 0xFCC48000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng6_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng7_base = 0xFCC4A000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng7_mask = 0xFFFFE000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng8_base = 0xFCC4C000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng8_mask = 0xFFFFC000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng9_base = 0xFCC50000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng9_mask = 0xFFFF0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng10_base = 0xFCC60000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng10_mask = 0xFFFE0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng11_base = 0xFCE00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng11_mask = 0xFFFFC000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng12_base = 0xFE484000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng12_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     u32 lbw_rng13_base = 0xFEC43000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +     u32 lbw_rng13_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
> > +
> > +     WREG32(mmDMA_MACRO_LBW_RANGE_HIT_BLOCK, 0xFFFF);
> > +     WREG32(mmDMA_MACRO_HBW_RANGE_HIT_BLOCK, 0xFF);
> > +
> > +     if (!(goya->hw_cap_initialized & HW_CAP_MMU)) {
> > +             WREG32(mmDMA_MACRO_HBW_RANGE_HIT_BLOCK, 0xFE);
> > +
> > +             /* Protect HOST */
> > +             WREG32(mmDMA_MACRO_HBW_RANGE_BASE_31_0_0, 0);
> > +             WREG32(mmDMA_MACRO_HBW_RANGE_BASE_49_32_0, 0);
> > +             WREG32(mmDMA_MACRO_HBW_RANGE_MASK_31_0_0, 0);
> > +             WREG32(mmDMA_MACRO_HBW_RANGE_MASK_49_32_0, 0xFFF80);
> > +     }
> > +
> > +     /*
> > +      * Protect DDR @
> > +      * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
> > +      * The mask protects the first 512MB
> > +      */
> > +     WREG32(mmDMA_MACRO_HBW_RANGE_BASE_31_0_1, dram_addr_lo);
> > +     WREG32(mmDMA_MACRO_HBW_RANGE_BASE_49_32_1, dram_addr_hi);
> > +     WREG32(mmDMA_MACRO_HBW_RANGE_MASK_31_0_1, 0xE0000000);
> > +     WREG32(mmDMA_MACRO_HBW_RANGE_MASK_49_32_1, 0x3FFFF);
> > +
> > +     /* Protect registers */
> > +
> > +     WREG32(mmDMA_MACRO_LBW_RANGE_BASE_0, lbw_rng0_base);
> > +     WREG32(mmDMA_MACRO_LBW_RANGE_MASK_0, lbw_rng0_mask);
> > +     WREG32(mmDMA_MACRO_LBW_RANGE_BASE_1, lbw_rng1_base);
> > +     WREG32(mmDMA_MACRO_LBW_RANGE_MASK_1, lbw_rng1_mask);
> > +     WREG32(mmDMA_MACRO_LBW_RANGE_BASE_2, lbw_rng2_base);
> > +     WREG32(mmDMA_MACRO_LBW_RANGE_MASK_2, lbw_rng2_mask);
> > +     WREG32(mmDMA_MACRO_LBW_RANGE_BASE_3, lbw_rng3_base);
>
> [ ... ]
>
> > +     goya_init_protection_bits(hdev);
> > +}
> > diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> > index 6ad476df65b0..adda281ec2af 100644
> > --- a/drivers/misc/habanalabs/habanalabs.h
> > +++ b/drivers/misc/habanalabs/habanalabs.h
> > @@ -23,6 +23,8 @@
> >
> >  #define HL_MMAP_CB_MASK                      (0x8000000000000000ull >> PAGE_SHIFT)
> >
> > +#define HL_DEVICE_TIMEOUT_USEC               1000000 /* 1 s */
> > +
> >  #define HL_MAX_QUEUES                        128
> >
> >  struct hl_device;
> > @@ -32,6 +34,8 @@ struct hl_fpriv;
> >
> >  /**
> >   * struct asic_fixed_properties - ASIC specific immutable properties.
> > + * @uboot_ver: F/W U-boot version.
> > + * @preboot_ver: F/W Preboot version.
> >   * @sram_base_address: SRAM physical start address.
> >   * @sram_end_address: SRAM physical end address.
> >   * @sram_user_base_address - SRAM physical start address for user access.
> > @@ -60,6 +64,8 @@ struct hl_fpriv;
> >   * @tpc_enabled_mask: which TPCs are enabled.
> >   */
> >  struct asic_fixed_properties {
> > +     char                    uboot_ver[VERSION_MAX_LEN];
> > +     char                    preboot_ver[VERSION_MAX_LEN];
> >       u64                     sram_base_address;
> >       u64                     sram_end_address;
> >       u64                     sram_user_base_address;
> > @@ -168,6 +174,8 @@ enum hl_asic_type {
> >   * @early_fini: tears down what was done in early_init.
> >   * @sw_init: sets up driver state, does not configure H/W.
> >   * @sw_fini: tears down driver state, does not configure H/W.
> > + * @hw_init: sets up the H/W state.
> > + * @hw_fini: tears down the H/W state.
> >   * @suspend: handles IP specific H/W or SW changes for suspend.
> >   * @resume: handles IP specific H/W or SW changes for resume.
> >   * @mmap: mmap function, does nothing.
> > @@ -180,6 +188,8 @@ struct hl_asic_funcs {
> >       int (*early_fini)(struct hl_device *hdev);
> >       int (*sw_init)(struct hl_device *hdev);
> >       int (*sw_fini)(struct hl_device *hdev);
> > +     int (*hw_init)(struct hl_device *hdev);
> > +     void (*hw_fini)(struct hl_device *hdev, bool hard_reset);
> >       int (*suspend)(struct hl_device *hdev);
> >       int (*resume)(struct hl_device *hdev);
> >       int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
> > @@ -312,6 +322,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> >   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
> >   * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
> >   * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
> > + * @spl_fw: image to load to ArmCP.
> >   * @asid_bitmap: holds used/available ASIDs.
> >   * @asid_mutex: protects asid_bitmap.
> >   * @device_open: lock for sanity checks upon FD open.
> > @@ -340,6 +351,7 @@ struct hl_device {
> >       void                            *cpu_accessible_dma_mem;
> >       dma_addr_t                      cpu_accessible_dma_address;
> >       struct gen_pool                 *cpu_accessible_dma_pool;
> > +     const struct firmware           *spl_fw;
> >       unsigned long                   *asid_bitmap;
> >       struct mutex                    asid_mutex;
> >       /* TODO: change to rw_sem for multiple contexts (same as other IOCTL) */
> > @@ -359,7 +371,11 @@ struct hl_device {
> >       u8                              disabled;
> >
> >       /* Parameters for bring-up */
> > +     u8                              cpu_enable;
> >       u8                              reset_pcilink;
> > +     u8                              config_pll;
> > +     u8                              fw_loading;
> > +     u8                              pldm;
> >  };
> >
> >  /*
> > diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> > index 5c312dd3aa50..bd80683118d3 100644
> > --- a/drivers/misc/habanalabs/habanalabs_drv.c
> > +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> > @@ -181,7 +181,15 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
> >       hdev->major = hl_major;
> >
> >       /* Parameters for bring-up - set them to defaults */
> > +     hdev->cpu_enable = 1;
> >       hdev->reset_pcilink = 0;
> > +     hdev->config_pll = 0;
> > +     hdev->fw_loading = 1;
> > +     hdev->pldm = 0;
> > +
> > +     /* If CPU is disabled, no point in loading FW */
> > +     if (!hdev->cpu_enable)
> > +             hdev->fw_loading = 0;
>
> The CPU was enabled just a couple of lines above, wasn't it?
> I've noticed there are a lot of checks for hdev->cpu_enabled and hdev->pldm
> but I didn't see them ever change.
Nope, CPU is enabled in goya_hw_init.

All the parameters that are in hl_device under the /* Parameters for
bring-up */ comment are hard-coded in the upstream version.
If I will need to remove them completely from the code it would make
my life harder when trying to bring code from our internal driver to
the open source one.
I removed most of that code but some of them I left as they have
minimal "signature".
These parameters are actuall kernel mode parameters in our internal
driver but here I hard-code them to the correct values.

>
> >
> >       hdev->disabled = true;
> >       hdev->pdev = pdev; /* can be NULL in case of simulator device */
> > diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
> > index 192a1450cbb1..2d0efb7b44bb 100644
> > --- a/drivers/misc/habanalabs/include/goya/goya.h
> > +++ b/drivers/misc/habanalabs/include/goya/goya.h
> > @@ -11,6 +11,7 @@
> >  #define GOYA_H
> >
> >  #include "asic_reg/goya_regs.h"
> > +#include "goya_async_events.h"
> >
> >  #include <linux/types.h>
> >
> > diff --git a/drivers/misc/habanalabs/include/goya/goya_async_events.h b/drivers/misc/habanalabs/include/goya/goya_async_events.h
> > new file mode 100644
> > index 000000000000..497937a17ee9
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/include/goya/goya_async_events.h
>
> This, apparently, should have been a part of patch 8 (habanalabs: add event
> queue and interrupts)
Fixed
>
> > @@ -0,0 +1,186 @@
> > +/* SPDX-License-Identifier: GPL-2.0
> > + *
> > + * Copyright 2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + */
> > +
> > +#ifndef __GOYA_ASYNC_EVENTS_H_
> > +#define __GOYA_ASYNC_EVENTS_H_
> > +
> > +enum goya_async_event_id {
> > +     GOYA_ASYNC_EVENT_ID_PCIE_IF = 33,
> > +     GOYA_ASYNC_EVENT_ID_TPC0_ECC = 36,
> > +     GOYA_ASYNC_EVENT_ID_TPC1_ECC = 39,
> > +     GOYA_ASYNC_EVENT_ID_TPC2_ECC = 42,
> > +     GOYA_ASYNC_EVENT_ID_TPC3_ECC = 45,
> > +     GOYA_ASYNC_EVENT_ID_TPC4_ECC = 48,
> > +     GOYA_ASYNC_EVENT_ID_TPC5_ECC = 51,
> > +     GOYA_ASYNC_EVENT_ID_TPC6_ECC = 54,
> > +     GOYA_ASYNC_EVENT_ID_TPC7_ECC = 57,
> > +     GOYA_ASYNC_EVENT_ID_MME_ECC = 60,
> > +     GOYA_ASYNC_EVENT_ID_MME_ECC_EXT = 61,
> > +     GOYA_ASYNC_EVENT_ID_MMU_ECC = 63,
> > +     GOYA_ASYNC_EVENT_ID_DMA_MACRO = 64,
> > +     GOYA_ASYNC_EVENT_ID_DMA_ECC = 66,
> > +     GOYA_ASYNC_EVENT_ID_CPU_IF_ECC = 75,
> > +     GOYA_ASYNC_EVENT_ID_PSOC_MEM = 78,
> > +     GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT = 79,
> > +     GOYA_ASYNC_EVENT_ID_SRAM0 = 81,
> > +     GOYA_ASYNC_EVENT_ID_SRAM1 = 82,
> > +     GOYA_ASYNC_EVENT_ID_SRAM2 = 83,
> > +     GOYA_ASYNC_EVENT_ID_SRAM3 = 84,
> > +     GOYA_ASYNC_EVENT_ID_SRAM4 = 85,
> > +     GOYA_ASYNC_EVENT_ID_SRAM5 = 86,
> > +     GOYA_ASYNC_EVENT_ID_SRAM6 = 87,
> > +     GOYA_ASYNC_EVENT_ID_SRAM7 = 88,
> > +     GOYA_ASYNC_EVENT_ID_SRAM8 = 89,
> > +     GOYA_ASYNC_EVENT_ID_SRAM9 = 90,
> > +     GOYA_ASYNC_EVENT_ID_SRAM10 = 91,
> > +     GOYA_ASYNC_EVENT_ID_SRAM11 = 92,
> > +     GOYA_ASYNC_EVENT_ID_SRAM12 = 93,
> > +     GOYA_ASYNC_EVENT_ID_SRAM13 = 94,
> > +     GOYA_ASYNC_EVENT_ID_SRAM14 = 95,
> > +     GOYA_ASYNC_EVENT_ID_SRAM15 = 96,
> > +     GOYA_ASYNC_EVENT_ID_SRAM16 = 97,
> > +     GOYA_ASYNC_EVENT_ID_SRAM17 = 98,
> > +     GOYA_ASYNC_EVENT_ID_SRAM18 = 99,
> > +     GOYA_ASYNC_EVENT_ID_SRAM19 = 100,
> > +     GOYA_ASYNC_EVENT_ID_SRAM20 = 101,
> > +     GOYA_ASYNC_EVENT_ID_SRAM21 = 102,
> > +     GOYA_ASYNC_EVENT_ID_SRAM22 = 103,
> > +     GOYA_ASYNC_EVENT_ID_SRAM23 = 104,
> > +     GOYA_ASYNC_EVENT_ID_SRAM24 = 105,
> > +     GOYA_ASYNC_EVENT_ID_SRAM25 = 106,
> > +     GOYA_ASYNC_EVENT_ID_SRAM26 = 107,
> > +     GOYA_ASYNC_EVENT_ID_SRAM27 = 108,
> > +     GOYA_ASYNC_EVENT_ID_SRAM28 = 109,
> > +     GOYA_ASYNC_EVENT_ID_SRAM29 = 110,
> > +     GOYA_ASYNC_EVENT_ID_GIC500 = 112,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_DEC = 115,
> > +     GOYA_ASYNC_EVENT_ID_TPC0_DEC = 117,
> > +     GOYA_ASYNC_EVENT_ID_TPC1_DEC = 120,
> > +     GOYA_ASYNC_EVENT_ID_TPC2_DEC = 123,
> > +     GOYA_ASYNC_EVENT_ID_TPC3_DEC = 126,
> > +     GOYA_ASYNC_EVENT_ID_TPC4_DEC = 129,
> > +     GOYA_ASYNC_EVENT_ID_TPC5_DEC = 132,
> > +     GOYA_ASYNC_EVENT_ID_TPC6_DEC = 135,
> > +     GOYA_ASYNC_EVENT_ID_TPC7_DEC = 138,
> > +     GOYA_ASYNC_EVENT_ID_AXI_ECC = 139,
> > +     GOYA_ASYNC_EVENT_ID_L2_RAM_ECC = 140,
> > +     GOYA_ASYNC_EVENT_ID_MME_WACS = 141,
> > +     GOYA_ASYNC_EVENT_ID_MME_WACSD = 142,
> > +     GOYA_ASYNC_EVENT_ID_PLL0 = 143,
> > +     GOYA_ASYNC_EVENT_ID_PLL1 = 144,
> > +     GOYA_ASYNC_EVENT_ID_PLL3 = 146,
> > +     GOYA_ASYNC_EVENT_ID_PLL4 = 147,
> > +     GOYA_ASYNC_EVENT_ID_PLL5 = 148,
> > +     GOYA_ASYNC_EVENT_ID_PLL6 = 149,
> > +     GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER = 155,
> > +     GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC = 159,
> > +     GOYA_ASYNC_EVENT_ID_PSOC = 160,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_FLR = 171,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_HOT_RESET = 172,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG0 = 174,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG1 = 175,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG2 = 176,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG3 = 177,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG0 = 178,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG1 = 179,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG2 = 180,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG3 = 181,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_APB = 182,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_QDB = 183,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_BM_D_P_WR = 184,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_BM_D_RD = 185,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_BM_U_P_WR = 186,
> > +     GOYA_ASYNC_EVENT_ID_PCIE_BM_U_RD = 187,
> > +     GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU = 190,
> > +     GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR = 191,
> > +     GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU = 200,
> > +     GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR = 201,
> > +     GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU = 210,
> > +     GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR = 211,
> > +     GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU = 220,
> > +     GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR = 221,
> > +     GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU = 230,
> > +     GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR = 231,
> > +     GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU = 240,
> > +     GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR = 241,
> > +     GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU = 250,
> > +     GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR = 251,
> > +     GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU = 260,
> > +     GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR = 261,
> > +     GOYA_ASYNC_EVENT_ID_MMU_SBA_SPMU0 = 270,
> > +     GOYA_ASYNC_EVENT_ID_MMU_SBA_SPMU1 = 271,
> > +     GOYA_ASYNC_EVENT_ID_MME_WACS_UP = 272,
> > +     GOYA_ASYNC_EVENT_ID_MME_WACS_DOWN = 273,
> > +     GOYA_ASYNC_EVENT_ID_MMU_PAGE_FAULT = 280,
> > +     GOYA_ASYNC_EVENT_ID_MMU_WR_PERM = 281,
> > +     GOYA_ASYNC_EVENT_ID_MMU_DBG_BM = 282,
> > +     GOYA_ASYNC_EVENT_ID_DMA_BM_CH0 = 290,
> > +     GOYA_ASYNC_EVENT_ID_DMA_BM_CH1 = 291,
> > +     GOYA_ASYNC_EVENT_ID_DMA_BM_CH2 = 292,
> > +     GOYA_ASYNC_EVENT_ID_DMA_BM_CH3 = 293,
> > +     GOYA_ASYNC_EVENT_ID_DMA_BM_CH4 = 294,
> > +     GOYA_ASYNC_EVENT_ID_DDR0_PHY_DFI = 300,
> > +     GOYA_ASYNC_EVENT_ID_DDR0_ECC_SCRUB = 301,
> > +     GOYA_ASYNC_EVENT_ID_DDR0_DB_ECC = 302,
> > +     GOYA_ASYNC_EVENT_ID_DDR0_SB_ECC = 303,
> > +     GOYA_ASYNC_EVENT_ID_DDR0_SB_ECC_MC = 304,
> > +     GOYA_ASYNC_EVENT_ID_DDR0_AXI_RD = 305,
> > +     GOYA_ASYNC_EVENT_ID_DDR0_AXI_WR = 306,
> > +     GOYA_ASYNC_EVENT_ID_DDR1_PHY_DFI = 310,
> > +     GOYA_ASYNC_EVENT_ID_DDR1_ECC_SCRUB = 311,
> > +     GOYA_ASYNC_EVENT_ID_DDR1_DB_ECC = 312,
> > +     GOYA_ASYNC_EVENT_ID_DDR1_SB_ECC = 313,
> > +     GOYA_ASYNC_EVENT_ID_DDR1_SB_ECC_MC = 314,
> > +     GOYA_ASYNC_EVENT_ID_DDR1_AXI_RD = 315,
> > +     GOYA_ASYNC_EVENT_ID_DDR1_AXI_WR = 316,
> > +     GOYA_ASYNC_EVENT_ID_CPU_BMON = 320,
> > +     GOYA_ASYNC_EVENT_ID_TS_EAST = 322,
> > +     GOYA_ASYNC_EVENT_ID_TS_WEST = 323,
> > +     GOYA_ASYNC_EVENT_ID_TS_NORTH = 324,
> > +     GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_0 = 330,
> > +     GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_1 = 331,
> > +     GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_2 = 332,
> > +     GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET = 356,
> > +     GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT = 361,
> > +     GOYA_ASYNC_EVENT_ID_TPC0_CMDQ = 430,
> > +     GOYA_ASYNC_EVENT_ID_TPC1_CMDQ = 431,
> > +     GOYA_ASYNC_EVENT_ID_TPC2_CMDQ = 432,
> > +     GOYA_ASYNC_EVENT_ID_TPC3_CMDQ = 433,
> > +     GOYA_ASYNC_EVENT_ID_TPC4_CMDQ = 434,
> > +     GOYA_ASYNC_EVENT_ID_TPC5_CMDQ = 435,
> > +     GOYA_ASYNC_EVENT_ID_TPC6_CMDQ = 436,
> > +     GOYA_ASYNC_EVENT_ID_TPC7_CMDQ = 437,
> > +     GOYA_ASYNC_EVENT_ID_TPC0_QM = 438,
> > +     GOYA_ASYNC_EVENT_ID_TPC1_QM = 439,
> > +     GOYA_ASYNC_EVENT_ID_TPC2_QM = 440,
> > +     GOYA_ASYNC_EVENT_ID_TPC3_QM = 441,
> > +     GOYA_ASYNC_EVENT_ID_TPC4_QM = 442,
> > +     GOYA_ASYNC_EVENT_ID_TPC5_QM = 443,
> > +     GOYA_ASYNC_EVENT_ID_TPC6_QM = 444,
> > +     GOYA_ASYNC_EVENT_ID_TPC7_QM = 445,
> > +     GOYA_ASYNC_EVENT_ID_MME_QM = 447,
> > +     GOYA_ASYNC_EVENT_ID_MME_CMDQ = 448,
> > +     GOYA_ASYNC_EVENT_ID_DMA0_QM = 449,
> > +     GOYA_ASYNC_EVENT_ID_DMA1_QM = 450,
> > +     GOYA_ASYNC_EVENT_ID_DMA2_QM = 451,
> > +     GOYA_ASYNC_EVENT_ID_DMA3_QM = 452,
> > +     GOYA_ASYNC_EVENT_ID_DMA4_QM = 453,
> > +     GOYA_ASYNC_EVENT_ID_DMA_ON_HBW = 454,
> > +     GOYA_ASYNC_EVENT_ID_DMA0_CH = 455,
> > +     GOYA_ASYNC_EVENT_ID_DMA1_CH = 456,
> > +     GOYA_ASYNC_EVENT_ID_DMA2_CH = 457,
> > +     GOYA_ASYNC_EVENT_ID_DMA3_CH = 458,
> > +     GOYA_ASYNC_EVENT_ID_DMA4_CH = 459,
> > +     GOYA_ASYNC_EVENT_ID_PI_UPDATE = 484,
> > +     GOYA_ASYNC_EVENT_ID_HALT_MACHINE = 485,
> > +     GOYA_ASYNC_EVENT_ID_INTS_REGISTER = 486,
> > +     GOYA_ASYNC_EVENT_ID_SOFT_RESET = 487,
> > +     GOYA_ASYNC_EVENT_ID_LAST_VALID_ID = 1023,
> > +     GOYA_ASYNC_EVENT_ID_SIZE
> > +};
> > +
> > +#endif /* __GOYA_ASYNC_EVENTS_H_ */
> > diff --git a/drivers/misc/habanalabs/include/goya/goya_boot_if.h b/drivers/misc/habanalabs/include/goya/goya_boot_if.h
> > new file mode 100644
> > index 000000000000..2e39578ec795
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/include/goya/goya_boot_if.h
> > @@ -0,0 +1,32 @@
> > +/* SPDX-License-Identifier: GPL-2.0
> > + *
> > + * Copyright 2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + * Author: Oded Gabbay <oded.gabbay@gmail.com>
> > + *
> > + */
> > +
> > +#ifndef GOYA_BOOT_IF_H
> > +#define GOYA_BOOT_IF_H
> > +
> > +enum cpu_boot_status {
> > +     CPU_BOOT_STATUS_NA = 0,         /* Default value after reset of chip */
> > +     CPU_BOOT_STATUS_IN_WFE,
> > +     CPU_BOOT_STATUS_DRAM_RDY,
> > +     CPU_BOOT_STATUS_SRAM_AVAIL,
> > +     CPU_BOOT_STATUS_IN_BTL,         /* BTL is H/W FSM */
> > +     CPU_BOOT_STATUS_IN_PREBOOT,
> > +     CPU_BOOT_STATUS_IN_SPL,
> > +     CPU_BOOT_STATUS_IN_UBOOT,
> > +     CPU_BOOT_STATUS_DRAM_INIT_FAIL,
> > +     CPU_BOOT_STATUS_FIT_CORRUPTED
> > +};
> > +
> > +enum kmd_msg {
> > +     KMD_MSG_NA = 0,
> > +     KMD_MSG_GOTO_WFE,
> > +     KMD_MSG_FIT_RDY
> > +};
> > +
> > +#endif /* GOYA_BOOT_IF_H */
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 07/15] habanalabs: add h/w queues module
  2019-01-25  7:50   ` Mike Rapoport
@ 2019-01-28 10:50     ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-28 10:50 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org

On Fri, Jan 25, 2019 at 9:51 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:49AM +0200, Oded Gabbay wrote:
> > This patch adds the H/W queues module and the code to initialize Goya's
> > various compute and DMA engines and their queues.
> >
> > Goya has 5 DMA channels, 8 TPC engines and a single MME engine. For each
> > channel/engine, there is a H/W queue logic which is used to pass commands
> > from the user to the H/W. That logic is called QMAN.
> >
> > There are two types of QMANs: external and internal. The DMA QMANs are
> > considered external while the TPC and MME QMANs are considered internal.
> > For each external queue there is a completion queue, which is located on
> > the Host memory.
> >
> > The differences between external and internal QMANs are:
> >
> > 1. The location of the queue's memory. External QMANs are located on the
> >    Host memory while internal QMANs are located on the on-chip memory.
> >
> > 2. The external QMAN write an entry to a completion queue and sends an
> >    MSI-X interrupt upon completion of a command buffer that was given to
> >    it. The internal QMAN doesn't do that.
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/habanalabs/Makefile              |    2 +-
> >  drivers/misc/habanalabs/device.c              |   74 +-
> >  drivers/misc/habanalabs/goya/goya.c           | 1518 +++++++++++++++--
> >  drivers/misc/habanalabs/goya/goyaP.h          |    6 +
> >  drivers/misc/habanalabs/habanalabs.h          |  176 +-
> >  drivers/misc/habanalabs/habanalabs_drv.c      |    6 +
> >  drivers/misc/habanalabs/hw_queue.c            |  404 +++++
> >  .../habanalabs/include/goya/goya_packets.h    |  234 +++
> >  .../habanalabs/include/habanalabs_device_if.h |  272 +++
> >  drivers/misc/habanalabs/irq.c                 |  150 ++
> >  10 files changed, 2721 insertions(+), 121 deletions(-)
> >  create mode 100644 drivers/misc/habanalabs/hw_queue.c
> >  create mode 100644 drivers/misc/habanalabs/include/goya/goya_packets.h
> >  create mode 100644 drivers/misc/habanalabs/irq.c
> >
> > diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> > index 2530c9b78ca4..c07f3ccb57dc 100644
> > --- a/drivers/misc/habanalabs/Makefile
> > +++ b/drivers/misc/habanalabs/Makefile
> > @@ -5,7 +5,7 @@
> >  obj-m        := habanalabs.o
> >
> >  habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
> > -             command_buffer.o
> > +             command_buffer.o hw_queue.o irq.o
> >
> >  include $(src)/goya/Makefile
> >  habanalabs-y += $(HL_GOYA_FILES)
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> > index 9fc7218a973c..98220628a467 100644
> > --- a/drivers/misc/habanalabs/device.c
> > +++ b/drivers/misc/habanalabs/device.c
> > @@ -170,13 +170,22 @@ static int device_early_init(struct hl_device *hdev)
> >       if (rc)
> >               goto early_fini;
> >
> > +     hdev->cq_wq = alloc_workqueue("hl-free-jobs", WQ_UNBOUND, 0);
> > +     if (hdev->cq_wq == NULL) {
> > +             dev_err(hdev->dev, "Failed to allocate CQ workqueue\n");
> > +             goto asid_fini;
> > +     }
> > +
> >       hl_cb_mgr_init(&hdev->kernel_cb_mgr);
> >
> >       mutex_init(&hdev->device_open);
> > +     mutex_init(&hdev->send_cpu_message_lock);
> >       atomic_set(&hdev->fd_open_cnt, 0);
> >
> >       return 0;
> >
> > +asid_fini:
> > +     hl_asid_fini(hdev);
> >  early_fini:
> >       if (hdev->asic_funcs->early_fini)
> >               hdev->asic_funcs->early_fini(hdev);
> > @@ -192,9 +201,12 @@ static int device_early_init(struct hl_device *hdev)
> >   */
> >  static void device_early_fini(struct hl_device *hdev)
> >  {
> > +     mutex_destroy(&hdev->send_cpu_message_lock);
> >
> >       hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
> >
> > +     destroy_workqueue(hdev->cq_wq);
> > +
> >       hl_asid_fini(hdev);
> >
> >       if (hdev->asic_funcs->early_fini)
> > @@ -273,7 +285,7 @@ int hl_device_resume(struct hl_device *hdev)
> >   */
> >  int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >  {
> > -     int rc;
> > +     int i, rc, cq_ready_cnt;
> >
> >       /* Create device */
> >       rc = device_setup_cdev(hdev, hclass, hdev->id, &hl_ops);
> > @@ -294,11 +306,48 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >       if (rc)
> >               goto early_fini;
> >
> > +     /*
> > +      * Initialize the H/W queues. Must be done before hw_init, because
> > +      * there the addresses of the kernel queue are being written to the
> > +      * registers of the device
> > +      */
> > +     rc = hl_hw_queues_create(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to initialize kernel queues\n");
> > +             goto sw_fini;
> > +     }
> > +
> > +     /*
> > +      * Initialize the completion queues. Must be done before hw_init,
> > +      * because there the addresses of the completion queues are being
> > +      * passed as arguments to request_irq
> > +      */
> > +     hdev->completion_queue =
> > +                     kcalloc(hdev->asic_prop.completion_queues_count,
> > +                             sizeof(*hdev->completion_queue), GFP_KERNEL);
> > +
> > +     if (!hdev->completion_queue) {
> > +             dev_err(hdev->dev, "failed to allocate completion queues\n");
> > +             rc = -ENOMEM;
> > +             goto hw_queues_destroy;
> > +     }
> > +
> > +     for (i = 0, cq_ready_cnt = 0;
> > +                     i < hdev->asic_prop.completion_queues_count;
> > +                     i++, cq_ready_cnt++) {
> > +             rc = hl_cq_init(hdev, &hdev->completion_queue[i], i);
> > +             if (rc) {
> > +                     dev_err(hdev->dev,
> > +                             "failed to initialize completion queue\n");
> > +                     goto cq_fini;
> > +             }
> > +     }
> > +
> >       /* Allocate the kernel context */
> >       hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
> >       if (!hdev->kernel_ctx) {
> >               rc = -ENOMEM;
> > -             goto sw_fini;
> > +             goto cq_fini;
> >       }
> >
> >       hdev->user_ctx = NULL;
> > @@ -324,6 +373,14 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >
> >       hdev->disabled = false;
> >
> > +     /* Check that the communication with the device is working */
> > +     rc = hdev->asic_funcs->test_queues(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to detect if device is alive\n");
> > +             rc = 0;
>
> Why rc is 0 here?
>
See my explanation in the previous patch. It is to make the device
stay in Linux in "disabled/malfunction" state and give user ability to
reset it / debug it

> > +             goto out_disabled;
> > +     }
> > +
> >       dev_notice(hdev->dev,
> >               "Successfully added device to habanalabs driver\n");
> >
> > @@ -335,6 +392,12 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >                       "kernel ctx is still alive on initialization failure\n");
> >  free_ctx:
> >       kfree(hdev->kernel_ctx);
> > +cq_fini:
> > +     for (i = 0 ; i < cq_ready_cnt ; i++)
> > +             hl_cq_fini(hdev, &hdev->completion_queue[i]);
> > +     kfree(hdev->completion_queue);
> > +hw_queues_destroy:
> > +     hl_hw_queues_destroy(hdev);
> >  sw_fini:
> >       hdev->asic_funcs->sw_fini(hdev);
> >  early_fini:
> > @@ -364,6 +427,7 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >   */
> >  void hl_device_fini(struct hl_device *hdev)
> >  {
> > +     int i;
> >       dev_info(hdev->dev, "Removing device\n");
> >
> >       /* Mark device as disabled */
> > @@ -378,6 +442,12 @@ void hl_device_fini(struct hl_device *hdev)
> >       /* Reset the H/W. It will be in idle state after this returns */
> >       hdev->asic_funcs->hw_fini(hdev, true);
> >
> > +     for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
> > +             hl_cq_fini(hdev, &hdev->completion_queue[i]);
> > +     kfree(hdev->completion_queue);
> > +
> > +     hl_hw_queues_destroy(hdev);
> > +
> >       /* Call ASIC S/W finalize function */
> >       hdev->asic_funcs->sw_fini(hdev);
> >
> > diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> > index f715e01838b3..08d5227eaf1d 100644
> > --- a/drivers/misc/habanalabs/goya/goya.c
> > +++ b/drivers/misc/habanalabs/goya/goya.c
> > @@ -98,6 +98,26 @@
> >  static void goya_get_fixed_properties(struct hl_device *hdev)
> >  {
> >       struct asic_fixed_properties *prop = &hdev->asic_prop;
> > +     int i;
> > +
> > +     for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++) {
> > +             prop->hw_queues_props[i].type = QUEUE_TYPE_EXT;
> > +             prop->hw_queues_props[i].kmd_only = 0;
> > +     }
> > +
> > +     for (; i < NUMBER_OF_EXT_HW_QUEUES + NUMBER_OF_CPU_HW_QUEUES ; i++) {
> > +             prop->hw_queues_props[i].type = QUEUE_TYPE_CPU;
> > +             prop->hw_queues_props[i].kmd_only = 1;
> > +     }
> > +
> > +     for (; i < NUMBER_OF_EXT_HW_QUEUES + NUMBER_OF_CPU_HW_QUEUES +
> > +                     NUMBER_OF_INT_HW_QUEUES; i++) {
> > +             prop->hw_queues_props[i].type = QUEUE_TYPE_INT;
> > +             prop->hw_queues_props[i].kmd_only = 0;
> > +     }
> > +
> > +     for (; i < HL_MAX_QUEUES; i++)
> > +             prop->hw_queues_props[i].type = QUEUE_TYPE_NA;
> >
> >       prop->completion_queues_count = NUMBER_OF_CMPLT_QUEUES;
> >
> > @@ -126,6 +146,18 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
> >       prop->high_pll = PLL_HIGH_DEFAULT;
> >  }
> >
> > +int goya_send_pci_access_msg(struct hl_device *hdev, u32 opcode)
> > +{
> > +     struct armcp_packet pkt;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = opcode;
> > +
> > +     return hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt,
> > +                     sizeof(pkt), HL_DEVICE_TIMEOUT_USEC, NULL);
> > +}
> > +
> >  /**
> >   * goya_pci_bars_map - Map PCI BARS of Goya device
> >   *
> > @@ -509,6 +541,8 @@ static int goya_sw_init(struct hl_device *hdev)
> >       if (!goya)
> >               return -ENOMEM;
> >
> > +     goya->test_cpu_queue = goya_test_cpu_queue;
> > +
> >       /* according to goya_init_iatu */
> >       goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
> >       hdev->asic_specific = goya;
> > @@ -595,6 +629,299 @@ int goya_sw_fini(struct hl_device *hdev)
> >       return 0;
> >  }
> >
> > +static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
> > +             dma_addr_t bus_address)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u32 mtr_base_lo, mtr_base_hi;
> > +     u32 so_base_lo, so_base_hi;
> > +     u32 gic_base_lo, gic_base_hi;
> > +     u32 reg_off = dma_id * (mmDMA_QM_1_PQ_PI - mmDMA_QM_0_PQ_PI);
> > +
> > +     mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +     so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +
> > +     gic_base_lo =
> > +             lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +     gic_base_hi =
> > +             upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +
> > +     WREG32(mmDMA_QM_0_PQ_BASE_LO + reg_off, lower_32_bits(bus_address));
> > +     WREG32(mmDMA_QM_0_PQ_BASE_HI + reg_off, upper_32_bits(bus_address));
> > +
> > +     WREG32(mmDMA_QM_0_PQ_SIZE + reg_off, ilog2(HL_QUEUE_LENGTH));
> > +     WREG32(mmDMA_QM_0_PQ_PI + reg_off, 0);
> > +     WREG32(mmDMA_QM_0_PQ_CI + reg_off, 0);
> > +
> > +     WREG32(mmDMA_QM_0_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
> > +     WREG32(mmDMA_QM_0_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
> > +     WREG32(mmDMA_QM_0_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
> > +     WREG32(mmDMA_QM_0_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
> > +     WREG32(mmDMA_QM_0_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
> > +     WREG32(mmDMA_QM_0_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
> > +     WREG32(mmDMA_QM_0_GLBL_ERR_WDATA + reg_off,
> > +                     GOYA_ASYNC_EVENT_ID_DMA0_QM + dma_id);
> > +
> > +     /* PQ has buffer of 2 cache lines, while CQ has 8 lines */
> > +     WREG32(mmDMA_QM_0_PQ_CFG1 + reg_off, 0x00020002);
> > +     WREG32(mmDMA_QM_0_CQ_CFG1 + reg_off, 0x00080008);
> > +
> > +     if (dma_id == 0)
> > +             WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
> > +     else
> > +             if (goya->hw_cap_initialized & HW_CAP_MMU)
> > +                     WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
> > +                                     QMAN_DMA_PARTLY_TRUSTED);
> > +             else
> > +                     WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
> > +                                     QMAN_DMA_FULLY_TRUSTED);
> > +
> > +     WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, QMAN_DMA_ERR_MSG_EN);
> > +     WREG32(mmDMA_QM_0_GLBL_CFG0 + reg_off, QMAN_DMA_ENABLE);
> > +}
> > +
> > +static void goya_init_dma_ch(struct hl_device *hdev, int dma_id)
> > +{
> > +     u32 gic_base_lo, gic_base_hi;
> > +     u64 sob_addr;
> > +     u32 reg_off = dma_id * (mmDMA_CH_1_CFG1 - mmDMA_CH_0_CFG1);
> > +
> > +     gic_base_lo =
> > +             lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +     gic_base_hi =
> > +             upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +
> > +     WREG32(mmDMA_CH_0_ERRMSG_ADDR_LO + reg_off, gic_base_lo);
> > +     WREG32(mmDMA_CH_0_ERRMSG_ADDR_HI + reg_off, gic_base_hi);
> > +     WREG32(mmDMA_CH_0_ERRMSG_WDATA + reg_off,
> > +                     GOYA_ASYNC_EVENT_ID_DMA0_CH + dma_id);
> > +
> > +     if (dma_id) {
> > +             sob_addr = CFG_BASE + mmSYNC_MNGR_SOB_OBJ_1000 +
> > +                             (dma_id - 1) * 4;
> > +             WREG32(mmDMA_CH_0_WR_COMP_ADDR_LO + reg_off,
> > +                             lower_32_bits(sob_addr));
> > +             WREG32(mmDMA_CH_0_WR_COMP_ADDR_HI + reg_off,
> > +                             upper_32_bits(sob_addr));
> > +             WREG32(mmDMA_CH_0_WR_COMP_WDATA + reg_off, 0x80000001);
> > +     }
> > +}
> > +
> > +/**
> > + * goya_init_dma_qmans - Initialize QMAN DMA registers
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Initialize the H/W registers of the QMAN DMA channels
> > + *
> > + */
> > +static void goya_init_dma_qmans(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     struct hl_hw_queue *q;
> > +     dma_addr_t bus_address;
> > +     int i;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_DMA)
> > +             return;
> > +
> > +     q = &hdev->kernel_queues[0];
> > +
> > +     for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++, q++) {
> > +             bus_address = q->bus_address +
> > +                             hdev->asic_prop.host_phys_base_address;
> > +
> > +             goya_init_dma_qman(hdev, i, bus_address);
> > +             goya_init_dma_ch(hdev, i);
> > +     }
> > +
> > +     goya->hw_cap_initialized |= HW_CAP_DMA;
> > +}
> > +
> > +/**
> > + * goya_disable_external_queues - Disable external queues
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + */
> > +static void goya_disable_external_queues(struct hl_device *hdev)
> > +{
> > +     WREG32(mmDMA_QM_0_GLBL_CFG0, 0);
> > +     WREG32(mmDMA_QM_1_GLBL_CFG0, 0);
> > +     WREG32(mmDMA_QM_2_GLBL_CFG0, 0);
> > +     WREG32(mmDMA_QM_3_GLBL_CFG0, 0);
> > +     WREG32(mmDMA_QM_4_GLBL_CFG0, 0);
> > +}
> > +
> > +static int goya_stop_queue(struct hl_device *hdev, u32 cfg_reg,
> > +                             u32 cp_sts_reg, u32 glbl_sts0_reg)
> > +{
> > +     int rc;
> > +     u32 status;
> > +
> > +     /* use the values of TPC0 as they are all the same*/
> > +
> > +     WREG32(cfg_reg, 1 << TPC0_QM_GLBL_CFG1_CP_STOP_SHIFT);
> > +
> > +     status = RREG32(cp_sts_reg);
> > +     if (status & TPC0_QM_CP_STS_FENCE_IN_PROGRESS_MASK) {
> > +             rc = hl_poll_timeout(
> > +                     hdev,
> > +                     cp_sts_reg,
> > +                     status,
> > +                     !(status & TPC0_QM_CP_STS_FENCE_IN_PROGRESS_MASK),
> > +                     1000,
> > +                     QMAN_FENCE_TIMEOUT_USEC);
> > +
> > +             /* if QMAN is stuck in fence no need to check for stop */
> > +             if (rc)
> > +                     return 0;
>
> Isn't it an error?
Nope, that's how our H/W works :( if the QMAN is stuck in fence, the
stop indication will never be set, so no point in checking it. But
when the QMAN is stuck in fence, it is almost equal for stop and it is
good enough for reset.
>
> > +     }
> > +
> > +     rc = hl_poll_timeout(
> > +             hdev,
> > +             glbl_sts0_reg,
> > +             status,
> > +             (status & TPC0_QM_GLBL_STS0_CP_IS_STOP_MASK),
> > +             1000,
> > +             QMAN_STOP_TIMEOUT_USEC);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Timeout while waiting for QMAN to stop\n");
> > +             return -EINVAL;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * goya_stop_external_queues - Stop external queues
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Returns 0 on success
> > + *
> > + */
> > +static int goya_stop_external_queues(struct hl_device *hdev)
> > +{
> > +     int rc = goya_stop_queue(hdev,
> > +                     mmDMA_QM_0_GLBL_CFG1,
> > +                     mmDMA_QM_0_CP_STS,
> > +                     mmDMA_QM_0_GLBL_STS0);
> > +
> > +     if (rc)
> > +             dev_err(hdev->dev, "failed to stop DMA QMAN 0\n");
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmDMA_QM_1_GLBL_CFG1,
> > +                     mmDMA_QM_1_CP_STS,
> > +                     mmDMA_QM_1_GLBL_STS0);
> > +
> > +     if (rc)
> > +             dev_err(hdev->dev, "failed to stop DMA QMAN 1\n");
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmDMA_QM_2_GLBL_CFG1,
> > +                     mmDMA_QM_2_CP_STS,
> > +                     mmDMA_QM_2_GLBL_STS0);
> > +
> > +     if (rc)
> > +             dev_err(hdev->dev, "failed to stop DMA QMAN 2\n");
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmDMA_QM_3_GLBL_CFG1,
> > +                     mmDMA_QM_3_CP_STS,
> > +                     mmDMA_QM_3_GLBL_STS0);
> > +
> > +     if (rc)
> > +             dev_err(hdev->dev, "failed to stop DMA QMAN 3\n");
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmDMA_QM_4_GLBL_CFG1,
> > +                     mmDMA_QM_4_CP_STS,
> > +                     mmDMA_QM_4_GLBL_STS0);
> > +
> > +     if (rc)
> > +             dev_err(hdev->dev, "failed to stop DMA QMAN 4\n");
> > +
> > +     return rc;
> > +}
> > +
> > +static void goya_resume_external_queues(struct hl_device *hdev)
> > +{
> > +     WREG32(mmDMA_QM_0_GLBL_CFG1, 0);
> > +     WREG32(mmDMA_QM_1_GLBL_CFG1, 0);
> > +     WREG32(mmDMA_QM_2_GLBL_CFG1, 0);
> > +     WREG32(mmDMA_QM_3_GLBL_CFG1, 0);
> > +     WREG32(mmDMA_QM_4_GLBL_CFG1, 0);
> > +}
> > +
> > +/**
> > + * goya_init_cpu_queues - Initialize PQ/CQ/EQ of CPU
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Returns 0 on success
> > + *
> > + */
> > +int goya_init_cpu_queues(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     dma_addr_t bus_address;
> > +     u32 status;
> > +     struct hl_hw_queue *cpu_pq = &hdev->kernel_queues[GOYA_QUEUE_ID_CPU_PQ];
> > +     int err;
> > +
> > +     if (!hdev->cpu_queues_enable)
> > +             return 0;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_CPU_Q)
> > +             return 0;
> > +
> > +     bus_address = cpu_pq->bus_address +
> > +                     hdev->asic_prop.host_phys_base_address;
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_0, lower_32_bits(bus_address));
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_1, upper_32_bits(bus_address));
> > +
> > +     bus_address = hdev->cpu_accessible_dma_address +
> > +                     hdev->asic_prop.host_phys_base_address;
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_8, lower_32_bits(bus_address));
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_9, upper_32_bits(bus_address));
> > +
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_5, HL_QUEUE_SIZE_IN_BYTES);
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_10, CPU_ACCESSIBLE_MEM_SIZE);
> > +
> > +     /* Used for EQ CI */
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, 0);
> > +
> > +     WREG32(mmCPU_IF_PF_PQ_PI, 0);
> > +
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_7, PQ_INIT_STATUS_READY_FOR_CP);
> > +
> > +     WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> > +                     GOYA_ASYNC_EVENT_ID_PI_UPDATE);
> > +
> > +     err = hl_poll_timeout(
> > +             hdev,
> > +             mmPSOC_GLOBAL_CONF_SCRATCHPAD_7,
> > +             status,
> > +             (status == PQ_INIT_STATUS_READY_FOR_HOST),
> > +             1000,
> > +             GOYA_CPU_TIMEOUT_USEC);
> > +
> > +     if (err) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to communicate with ARM CPU (ArmCP timeout)\n");
> > +             return -EIO;
> > +     }
> > +
> > +     goya->hw_cap_initialized |= HW_CAP_CPU_Q;
> > +     return 0;
> > +}
> > +
> >  /**
> >   * goya_init_pll - Initialize pll registers
> >   *
> > @@ -1960,152 +2287,646 @@ static void goya_init_golden_registers(struct hl_device *hdev)
> >       goya->hw_cap_initialized |= HW_CAP_GOLDEN;
> >  }
> >
> > -
> > -/**
> > - * goya_push_uboot_to_device - Push u-boot FW code to device
> > - *
> > - * @hdev: pointer to hl_device structure
> > - *
> > - * Copy u-boot fw code from firmware file to SRAM BAR.
> > - * Returns 0 on success
> > - *
> > - */
> > -static int goya_push_uboot_to_device(struct hl_device *hdev)
> > +static void goya_init_mme_qman(struct hl_device *hdev)
> >  {
> > -     char fw_name[200];
> > -     const u64 *fw_data;
> > -     void __iomem *dst;
> > -     size_t fw_size, i;
> > -     int rc;
> > +     u32 mtr_base_lo, mtr_base_hi;
> > +     u32 so_base_lo, so_base_hi;
> > +     u32 gic_base_lo, gic_base_hi;
> > +     u64 qman_base_addr;
> >
> > -     snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
> > +     mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +     so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> >
> > -     rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> > +     gic_base_lo =
> > +             lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +     gic_base_hi =
> > +             upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> >
> > -     if (rc) {
> > -             dev_err(hdev->dev, "Failed to request u-boot fw image\n");
> > -             goto out;
> > -     }
> > +     qman_base_addr = hdev->asic_prop.sram_base_address +
> > +                             MME_QMAN_BASE_OFFSET;
> >
> > -     fw_size = hdev->spl_fw->size;
> > -     if ((fw_size % 4) != 0) {
> > -             dev_err(hdev->dev, "illegal u-boot firmware size %lu\n",
> > -                     fw_size);
> > -             rc = -EINVAL;
> > -             goto out;
> > -     }
> > +     WREG32(mmMME_QM_PQ_BASE_LO, lower_32_bits(qman_base_addr));
> > +     WREG32(mmMME_QM_PQ_BASE_HI, upper_32_bits(qman_base_addr));
> > +     WREG32(mmMME_QM_PQ_SIZE, ilog2(MME_QMAN_LENGTH));
> > +     WREG32(mmMME_QM_PQ_PI, 0);
> > +     WREG32(mmMME_QM_PQ_CI, 0);
> > +     WREG32(mmMME_QM_CP_LDMA_SRC_BASE_LO_OFFSET, 0x10C0);
> > +     WREG32(mmMME_QM_CP_LDMA_SRC_BASE_HI_OFFSET, 0x10C4);
> > +     WREG32(mmMME_QM_CP_LDMA_TSIZE_OFFSET, 0x10C8);
> > +     WREG32(mmMME_QM_CP_LDMA_COMMIT_OFFSET, 0x10CC);
> >
> > -     dev_dbg(hdev->dev, "u-boot firmware size == %lu\n", fw_size);
> > +     WREG32(mmMME_QM_CP_MSG_BASE0_ADDR_LO, mtr_base_lo);
> > +     WREG32(mmMME_QM_CP_MSG_BASE0_ADDR_HI, mtr_base_hi);
> > +     WREG32(mmMME_QM_CP_MSG_BASE1_ADDR_LO, so_base_lo);
> > +     WREG32(mmMME_QM_CP_MSG_BASE1_ADDR_HI, so_base_hi);
> >
> > -     fw_data = (const u64 *) hdev->spl_fw->data;
> > -     dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
> > +     /* QMAN CQ has 8 cache lines */
> > +     WREG32(mmMME_QM_CQ_CFG1, 0x00080008);
> >
> > -     if ((hdev->spl_fw->size % 8) != 0)
> > -             fw_size -= 8;
> > +     WREG32(mmMME_QM_GLBL_ERR_ADDR_LO, gic_base_lo);
> > +     WREG32(mmMME_QM_GLBL_ERR_ADDR_HI, gic_base_hi);
> >
> > -     for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> > -             if (!(i & (0x80000 - 1)))
> > -                     dev_dbg(hdev->dev,
> > -                             "u-boot copied so far %lu out of %lu",
> > -                             i, fw_size);
> > +     WREG32(mmMME_QM_GLBL_ERR_WDATA, GOYA_ASYNC_EVENT_ID_MME_QM);
> >
> > -             writeq(*fw_data, dst);
> > -     }
> > +     WREG32(mmMME_QM_GLBL_ERR_CFG, QMAN_MME_ERR_MSG_EN);
> >
> > -     if ((hdev->spl_fw->size % 8) != 0)
> > -             writel(*(const u32 *) fw_data, dst);
> > +     WREG32(mmMME_QM_GLBL_PROT, QMAN_MME_ERR_PROT);
> >
> > -out:
> > -     release_firmware(hdev->spl_fw);
> > -     return rc;
> > +     WREG32(mmMME_QM_GLBL_CFG0, QMAN_MME_ENABLE);
> >  }
> >
> > -/**
> > - * goya_push_linux_to_device - Push LINUX FW code to device
> > - *
> > - * @hdev: pointer to hl_device structure
> > - *
> > - * Copy LINXU fw code from firmware file to DDR BAR.
> > - * Returns 0 on success
> > - *
> > - */
> > -static int goya_push_linux_to_device(struct hl_device *hdev)
> > +static void goya_init_mme_cmdq(struct hl_device *hdev)
> >  {
> > -     char fw_name[200];
> > -     const u64 *fw_data;
> > -     void __iomem *dst;
> > -     size_t fw_size, i;
> > -     int rc;
> > +     u32 mtr_base_lo, mtr_base_hi;
> > +     u32 so_base_lo, so_base_hi;
> > +     u32 gic_base_lo, gic_base_hi;
> > +     u64 qman_base_addr;
> >
> > -     snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
> > +     mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +     so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> >
> > -     rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> > +     gic_base_lo =
> > +             lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +     gic_base_hi =
> > +             upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> >
> > -     if (rc) {
> > -             dev_err(hdev->dev, "Failed to request Linux fw image\n");
> > -             goto out;
> > -     }
> > +     qman_base_addr = hdev->asic_prop.sram_base_address +
> > +                             MME_QMAN_BASE_OFFSET;
> >
> > -     fw_size = hdev->spl_fw->size;
> > -     if ((fw_size % 4) != 0) {
> > -             dev_err(hdev->dev, "illegal Linux firmware size %lu\n",
> > -                     fw_size);
> > -             rc = -EINVAL;
> > -             goto out;
> > -     }
> > +     WREG32(mmMME_CMDQ_CP_MSG_BASE0_ADDR_LO, mtr_base_lo);
> > +     WREG32(mmMME_CMDQ_CP_MSG_BASE0_ADDR_HI, mtr_base_hi);
> > +     WREG32(mmMME_CMDQ_CP_MSG_BASE1_ADDR_LO, so_base_lo);
> > +     WREG32(mmMME_CMDQ_CP_MSG_BASE1_ADDR_HI, so_base_hi);
> >
> > -     dev_dbg(hdev->dev, "Linux firmware size == %lu\n", fw_size);
> > +     /* CMDQ CQ has 20 cache lines */
> > +     WREG32(mmMME_CMDQ_CQ_CFG1, 0x00140014);
> >
> > -     fw_data = (const u64 *) hdev->spl_fw->data;
> > -     dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
> > +     WREG32(mmMME_CMDQ_GLBL_ERR_ADDR_LO, gic_base_lo);
> > +     WREG32(mmMME_CMDQ_GLBL_ERR_ADDR_HI, gic_base_hi);
> >
> > -     if ((hdev->spl_fw->size % 8) != 0)
> > -             fw_size -= 8;
> > +     WREG32(mmMME_CMDQ_GLBL_ERR_WDATA, GOYA_ASYNC_EVENT_ID_MME_CMDQ);
> >
> > -     for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> > -             if (!(i & (0x80000 - 1))) {
> > -                     dev_dbg(hdev->dev,
> > -                             "Linux copied so far %lu out of %lu",
> > -                             i, fw_size);
> > -                     usleep_range(20, 100);
> > -             }
> > -             writeq(*fw_data, dst);
> > -     }
> > +     WREG32(mmMME_CMDQ_GLBL_ERR_CFG, CMDQ_MME_ERR_MSG_EN);
> >
> > -     if ((hdev->spl_fw->size % 8) != 0)
> > -             writel(*(const u32 *) fw_data, dst);
> > +     WREG32(mmMME_CMDQ_GLBL_PROT, CMDQ_MME_ERR_PROT);
> >
> > -out:
> > -     release_firmware(hdev->spl_fw);
> > -     return rc;
> > +     WREG32(mmMME_CMDQ_GLBL_CFG0, CMDQ_MME_ENABLE);
> >  }
> >
> > -static int goya_pldm_init_cpu(struct hl_device *hdev)
> > +static void goya_init_mme_qmans(struct hl_device *hdev)
> >  {
> > -     u32 val, unit_rst_val;
> > -     int rc;
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u32 so_base_lo, so_base_hi;
> >
> > -     /* Must initialize SRAM scrambler before pushing u-boot to SRAM */
> > -     goya_init_golden_registers(hdev);
> > +     if (goya->hw_cap_initialized & HW_CAP_MME)
> > +             return;
> >
> > -     /* Put ARM cores into reset */
> > -     WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
> > -     val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
> > +     so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +     so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> >
> > -     /* Reset the CA53 MACRO */
> > -     unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> > -     WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
> > -     val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> > -     WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
> > -     val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> > +     WREG32(mmMME_SM_BASE_ADDRESS_LOW, so_base_lo);
> > +     WREG32(mmMME_SM_BASE_ADDRESS_HIGH, so_base_hi);
> >
> > -     rc = goya_push_uboot_to_device(hdev);
> > -     if (rc)
> > -             return rc;
> > +     goya_init_mme_qman(hdev);
> > +     goya_init_mme_cmdq(hdev);
> >
> > -     rc = goya_push_linux_to_device(hdev);
> > -     if (rc)
> > -             return rc;
> > +     goya->hw_cap_initialized |= HW_CAP_MME;
> > +}
> > +
> > +static void goya_init_tpc_qman(struct hl_device *hdev, u32 base_off, int tpc_id)
> > +{
> > +     u32 mtr_base_lo, mtr_base_hi;
> > +     u32 so_base_lo, so_base_hi;
> > +     u32 gic_base_lo, gic_base_hi;
> > +     u64 qman_base_addr;
> > +     u32 reg_off = tpc_id * (mmTPC1_QM_PQ_PI - mmTPC0_QM_PQ_PI);
> > +
> > +     mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +     so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +
> > +     gic_base_lo =
> > +             lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +     gic_base_hi =
> > +             upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +
> > +     qman_base_addr = hdev->asic_prop.sram_base_address + base_off;
> > +
> > +     WREG32(mmTPC0_QM_PQ_BASE_LO + reg_off, lower_32_bits(qman_base_addr));
> > +     WREG32(mmTPC0_QM_PQ_BASE_HI + reg_off, upper_32_bits(qman_base_addr));
> > +     WREG32(mmTPC0_QM_PQ_SIZE + reg_off, ilog2(TPC_QMAN_LENGTH));
> > +     WREG32(mmTPC0_QM_PQ_PI + reg_off, 0);
> > +     WREG32(mmTPC0_QM_PQ_CI + reg_off, 0);
> > +     WREG32(mmTPC0_QM_CP_LDMA_SRC_BASE_LO_OFFSET + reg_off, 0x10C0);
> > +     WREG32(mmTPC0_QM_CP_LDMA_SRC_BASE_HI_OFFSET + reg_off, 0x10C4);
> > +     WREG32(mmTPC0_QM_CP_LDMA_TSIZE_OFFSET + reg_off, 0x10C8);
> > +     WREG32(mmTPC0_QM_CP_LDMA_COMMIT_OFFSET + reg_off, 0x10CC);
> > +
> > +     WREG32(mmTPC0_QM_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
> > +     WREG32(mmTPC0_QM_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
> > +     WREG32(mmTPC0_QM_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
> > +     WREG32(mmTPC0_QM_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
> > +
> > +     WREG32(mmTPC0_QM_CQ_CFG1 + reg_off, 0x00080008);
> > +
> > +     WREG32(mmTPC0_QM_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
> > +     WREG32(mmTPC0_QM_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
> > +
> > +     WREG32(mmTPC0_QM_GLBL_ERR_WDATA + reg_off,
> > +                     GOYA_ASYNC_EVENT_ID_TPC0_QM + tpc_id);
> > +
> > +     WREG32(mmTPC0_QM_GLBL_ERR_CFG + reg_off, QMAN_TPC_ERR_MSG_EN);
> > +
> > +     WREG32(mmTPC0_QM_GLBL_PROT + reg_off, QMAN_TPC_ERR_PROT);
> > +
> > +     WREG32(mmTPC0_QM_GLBL_CFG0 + reg_off, QMAN_TPC_ENABLE);
> > +}
> > +
> > +static void goya_init_tpc_cmdq(struct hl_device *hdev, int tpc_id)
> > +{
> > +     u32 mtr_base_lo, mtr_base_hi;
> > +     u32 so_base_lo, so_base_hi;
> > +     u32 gic_base_lo, gic_base_hi;
> > +     u32 reg_off = tpc_id * (mmTPC1_CMDQ_CQ_CFG1 - mmTPC0_CMDQ_CQ_CFG1);
> > +
> > +     mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
> > +     so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +     so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +
> > +     gic_base_lo =
> > +             lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +     gic_base_hi =
> > +             upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
> > +
> > +     WREG32(mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
> > +     WREG32(mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
> > +     WREG32(mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
> > +     WREG32(mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
> > +
> > +     WREG32(mmTPC0_CMDQ_CQ_CFG1 + reg_off, 0x00140014);
> > +
> > +     WREG32(mmTPC0_CMDQ_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
> > +     WREG32(mmTPC0_CMDQ_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
> > +
> > +     WREG32(mmTPC0_CMDQ_GLBL_ERR_WDATA + reg_off,
> > +                     GOYA_ASYNC_EVENT_ID_TPC0_CMDQ + tpc_id);
> > +
> > +     WREG32(mmTPC0_CMDQ_GLBL_ERR_CFG + reg_off, CMDQ_TPC_ERR_MSG_EN);
> > +
> > +     WREG32(mmTPC0_CMDQ_GLBL_PROT + reg_off, CMDQ_TPC_ERR_PROT);
> > +
> > +     WREG32(mmTPC0_CMDQ_GLBL_CFG0 + reg_off, CMDQ_TPC_ENABLE);
> > +}
> > +
> > +static void goya_init_tpc_qmans(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u32 so_base_lo, so_base_hi;
> > +     u32 cfg_off = mmTPC1_CFG_SM_BASE_ADDRESS_LOW -
> > +                     mmTPC0_CFG_SM_BASE_ADDRESS_LOW;
> > +     int i;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_TPC)
> > +             return;
> > +
> > +     so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +     so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
> > +
> > +     for (i = 0 ; i < TPC_MAX_NUM ; i++) {
> > +             WREG32(mmTPC0_CFG_SM_BASE_ADDRESS_LOW + i * cfg_off,
> > +                             so_base_lo);
> > +             WREG32(mmTPC0_CFG_SM_BASE_ADDRESS_HIGH + i * cfg_off,
> > +                             so_base_hi);
> > +     }
> > +
> > +     goya_init_tpc_qman(hdev, TPC0_QMAN_BASE_OFFSET, 0);
> > +     goya_init_tpc_qman(hdev, TPC1_QMAN_BASE_OFFSET, 1);
> > +     goya_init_tpc_qman(hdev, TPC2_QMAN_BASE_OFFSET, 2);
> > +     goya_init_tpc_qman(hdev, TPC3_QMAN_BASE_OFFSET, 3);
> > +     goya_init_tpc_qman(hdev, TPC4_QMAN_BASE_OFFSET, 4);
> > +     goya_init_tpc_qman(hdev, TPC5_QMAN_BASE_OFFSET, 5);
> > +     goya_init_tpc_qman(hdev, TPC6_QMAN_BASE_OFFSET, 6);
> > +     goya_init_tpc_qman(hdev, TPC7_QMAN_BASE_OFFSET, 7);
> > +
> > +     for (i = 0 ; i < TPC_MAX_NUM ; i++)
> > +             goya_init_tpc_cmdq(hdev, i);
> > +
> > +     goya->hw_cap_initialized |= HW_CAP_TPC;
> > +}
> > +
> > +/**
> > + * goya_disable_internal_queues - Disable internal queues
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + */
> > +static void goya_disable_internal_queues(struct hl_device *hdev)
> > +{
> > +     WREG32(mmMME_QM_GLBL_CFG0, 0);
> > +     WREG32(mmMME_CMDQ_GLBL_CFG0, 0);
> > +
> > +     WREG32(mmTPC0_QM_GLBL_CFG0, 0);
> > +     WREG32(mmTPC0_CMDQ_GLBL_CFG0, 0);
> > +
> > +     WREG32(mmTPC1_QM_GLBL_CFG0, 0);
> > +     WREG32(mmTPC1_CMDQ_GLBL_CFG0, 0);
> > +
> > +     WREG32(mmTPC2_QM_GLBL_CFG0, 0);
> > +     WREG32(mmTPC2_CMDQ_GLBL_CFG0, 0);
> > +
> > +     WREG32(mmTPC3_QM_GLBL_CFG0, 0);
> > +     WREG32(mmTPC3_CMDQ_GLBL_CFG0, 0);
> > +
> > +     WREG32(mmTPC4_QM_GLBL_CFG0, 0);
> > +     WREG32(mmTPC4_CMDQ_GLBL_CFG0, 0);
> > +
> > +     WREG32(mmTPC5_QM_GLBL_CFG0, 0);
> > +     WREG32(mmTPC5_CMDQ_GLBL_CFG0, 0);
> > +
> > +     WREG32(mmTPC6_QM_GLBL_CFG0, 0);
> > +     WREG32(mmTPC6_CMDQ_GLBL_CFG0, 0);
> > +
> > +     WREG32(mmTPC7_QM_GLBL_CFG0, 0);
> > +     WREG32(mmTPC7_CMDQ_GLBL_CFG0, 0);
> > +}
> > +
> > +/**
> > + * goya_stop_internal_queues - Stop internal queues
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Returns 0 on success
> > + *
> > + */
> > +static int goya_stop_internal_queues(struct hl_device *hdev)
> > +{
> > +     int rc, retval = 0;
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmMME_QM_GLBL_CFG1,
> > +                     mmMME_QM_CP_STS,
> > +                     mmMME_QM_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop MME QMAN\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmMME_CMDQ_GLBL_CFG1,
> > +                     mmMME_CMDQ_CP_STS,
> > +                     mmMME_CMDQ_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop MME CMDQ\n");
> > +             retval = -EIO;
> > +     }
>
> If I understand correctly, the queues can be and should be stopped independently and
> failure to stop one of them wouldn't prevent stopping the others.
> If that's the case a comment explaining that would be nice.

Correct, added comment
>
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC0_QM_GLBL_CFG1,
> > +                     mmTPC0_QM_CP_STS,
> > +                     mmTPC0_QM_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 0 QMAN\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC0_CMDQ_GLBL_CFG1,
> > +                     mmTPC0_CMDQ_CP_STS,
> > +                     mmTPC0_CMDQ_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 0 CMDQ\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC1_QM_GLBL_CFG1,
> > +                     mmTPC1_QM_CP_STS,
> > +                     mmTPC1_QM_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 1 QMAN\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC1_CMDQ_GLBL_CFG1,
> > +                     mmTPC1_CMDQ_CP_STS,
> > +                     mmTPC1_CMDQ_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 1 CMDQ\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC2_QM_GLBL_CFG1,
> > +                     mmTPC2_QM_CP_STS,
> > +                     mmTPC2_QM_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 2 QMAN\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC2_CMDQ_GLBL_CFG1,
> > +                     mmTPC2_CMDQ_CP_STS,
> > +                     mmTPC2_CMDQ_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 2 CMDQ\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC3_QM_GLBL_CFG1,
> > +                     mmTPC3_QM_CP_STS,
> > +                     mmTPC3_QM_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 3 QMAN\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC3_CMDQ_GLBL_CFG1,
> > +                     mmTPC3_CMDQ_CP_STS,
> > +                     mmTPC3_CMDQ_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 3 CMDQ\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC4_QM_GLBL_CFG1,
> > +                     mmTPC4_QM_CP_STS,
> > +                     mmTPC4_QM_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 4 QMAN\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC4_CMDQ_GLBL_CFG1,
> > +                     mmTPC4_CMDQ_CP_STS,
> > +                     mmTPC4_CMDQ_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 4 CMDQ\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC5_QM_GLBL_CFG1,
> > +                     mmTPC5_QM_CP_STS,
> > +                     mmTPC5_QM_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 5 QMAN\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC5_CMDQ_GLBL_CFG1,
> > +                     mmTPC5_CMDQ_CP_STS,
> > +                     mmTPC5_CMDQ_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 5 CMDQ\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC6_QM_GLBL_CFG1,
> > +                     mmTPC6_QM_CP_STS,
> > +                     mmTPC6_QM_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 6 QMAN\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC6_CMDQ_GLBL_CFG1,
> > +                     mmTPC6_CMDQ_CP_STS,
> > +                     mmTPC6_CMDQ_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 6 CMDQ\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC7_QM_GLBL_CFG1,
> > +                     mmTPC7_QM_CP_STS,
> > +                     mmTPC7_QM_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 7 QMAN\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     rc = goya_stop_queue(hdev,
> > +                     mmTPC7_CMDQ_GLBL_CFG1,
> > +                     mmTPC7_CMDQ_CP_STS,
> > +                     mmTPC7_CMDQ_GLBL_STS0);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop TPC 7 CMDQ\n");
> > +             retval = -EIO;
> > +     }
> > +
> > +     return rc;
> > +}
> > +
> > +static void goya_resume_internal_queues(struct hl_device *hdev)
> > +{
> > +     WREG32(mmMME_QM_GLBL_CFG1, 0);
> > +     WREG32(mmMME_CMDQ_GLBL_CFG1, 0);
> > +
> > +     WREG32(mmTPC0_QM_GLBL_CFG1, 0);
> > +     WREG32(mmTPC0_CMDQ_GLBL_CFG1, 0);
> > +
> > +     WREG32(mmTPC1_QM_GLBL_CFG1, 0);
> > +     WREG32(mmTPC1_CMDQ_GLBL_CFG1, 0);
> > +
> > +     WREG32(mmTPC2_QM_GLBL_CFG1, 0);
> > +     WREG32(mmTPC2_CMDQ_GLBL_CFG1, 0);
> > +
> > +     WREG32(mmTPC3_QM_GLBL_CFG1, 0);
> > +     WREG32(mmTPC3_CMDQ_GLBL_CFG1, 0);
> > +
> > +     WREG32(mmTPC4_QM_GLBL_CFG1, 0);
> > +     WREG32(mmTPC4_CMDQ_GLBL_CFG1, 0);
> > +
> > +     WREG32(mmTPC5_QM_GLBL_CFG1, 0);
> > +     WREG32(mmTPC5_CMDQ_GLBL_CFG1, 0);
> > +
> > +     WREG32(mmTPC6_QM_GLBL_CFG1, 0);
> > +     WREG32(mmTPC6_CMDQ_GLBL_CFG1, 0);
> > +
> > +     WREG32(mmTPC7_QM_GLBL_CFG1, 0);
> > +     WREG32(mmTPC7_CMDQ_GLBL_CFG1, 0);
> > +}
> > +
> > +
> > +/**
> > + * goya_push_uboot_to_device - Push u-boot FW code to device
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Copy u-boot fw code from firmware file to SRAM BAR.
> > + * Returns 0 on success
> > + *
> > + */
> > +static int goya_push_uboot_to_device(struct hl_device *hdev)
> > +{
> > +     char fw_name[200];
> > +     const u64 *fw_data;
> > +     void __iomem *dst;
> > +     size_t fw_size, i;
> > +     int rc;
> > +
> > +     snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
> > +
> > +     rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to request u-boot fw image\n");
> > +             goto out;
> > +     }
> > +
> > +     fw_size = hdev->spl_fw->size;
> > +     if ((fw_size % 4) != 0) {
> > +             dev_err(hdev->dev, "illegal u-boot firmware size %lu\n",
> > +                     fw_size);
> > +             rc = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     dev_dbg(hdev->dev, "u-boot firmware size == %lu\n", fw_size);
> > +
> > +     fw_data = (const u64 *) hdev->spl_fw->data;
> > +     dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
> > +
> > +     if ((hdev->spl_fw->size % 8) != 0)
> > +             fw_size -= 8;
> > +
> > +     for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> > +             if (!(i & (0x80000 - 1)))
> > +                     dev_dbg(hdev->dev,
> > +                             "u-boot copied so far %lu out of %lu",
> > +                             i, fw_size);
> > +
> > +             writeq(*fw_data, dst);
> > +     }
> > +
> > +     if ((hdev->spl_fw->size % 8) != 0)
> > +             writel(*(const u32 *) fw_data, dst);
> > +
> > +out:
> > +     release_firmware(hdev->spl_fw);
> > +     return rc;
> > +}
> > +
> > +/**
> > + * goya_push_linux_to_device - Push LINUX FW code to device
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Copy LINXU fw code from firmware file to DDR BAR.
> > + * Returns 0 on success
> > + *
> > + */
> > +static int goya_push_linux_to_device(struct hl_device *hdev)
> > +{
> > +     char fw_name[200];
> > +     const u64 *fw_data;
> > +     void __iomem *dst;
> > +     size_t fw_size, i;
> > +     int rc;
> > +
> > +     snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
> > +
> > +     rc = request_firmware(&hdev->spl_fw, fw_name, hdev->dev);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to request Linux fw image\n");
> > +             goto out;
> > +     }
> > +
> > +     fw_size = hdev->spl_fw->size;
> > +     if ((fw_size % 4) != 0) {
> > +             dev_err(hdev->dev, "illegal Linux firmware size %lu\n",
> > +                     fw_size);
> > +             rc = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     dev_dbg(hdev->dev, "Linux firmware size == %lu\n", fw_size);
> > +
> > +     fw_data = (const u64 *) hdev->spl_fw->data;
> > +     dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
> > +
> > +     if ((hdev->spl_fw->size % 8) != 0)
> > +             fw_size -= 8;
> > +
> > +     for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
> > +             if (!(i & (0x80000 - 1))) {
> > +                     dev_dbg(hdev->dev,
> > +                             "Linux copied so far %lu out of %lu",
> > +                             i, fw_size);
> > +                     usleep_range(20, 100);
> > +             }
> > +             writeq(*fw_data, dst);
> > +     }
> > +
> > +     if ((hdev->spl_fw->size % 8) != 0)
> > +             writel(*(const u32 *) fw_data, dst);
> > +
> > +out:
> > +     release_firmware(hdev->spl_fw);
> > +     return rc;
> > +}
> > +
> > +static int goya_pldm_init_cpu(struct hl_device *hdev)
> > +{
> > +     u32 val, unit_rst_val;
> > +     int rc;
> > +
> > +     /* Must initialize SRAM scrambler before pushing u-boot to SRAM */
> > +     goya_init_golden_registers(hdev);
> > +
> > +     /* Put ARM cores into reset */
> > +     WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
> > +     val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
> > +
> > +     /* Reset the CA53 MACRO */
> > +     unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> > +     WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
> > +     val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> > +     WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
> > +     val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
> > +
> > +     rc = goya_push_uboot_to_device(hdev);
> > +     if (rc)
> > +             return rc;
> > +
> > +     rc = goya_push_linux_to_device(hdev);
> > +     if (rc)
> > +             return rc;
> >
> >       WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
> >       WREG32(mmPSOC_GLOBAL_CONF_WARM_REBOOT, CPU_BOOT_STATUS_NA);
> > @@ -2339,6 +3160,19 @@ static int goya_hw_init(struct hl_device *hdev)
> >
> >       goya_init_security(hdev);
> >
> > +     goya_init_dma_qmans(hdev);
> > +
> > +     goya_init_mme_qmans(hdev);
> > +
> > +     goya_init_tpc_qmans(hdev);
> > +
> > +     rc = goya_init_cpu_queues(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to initialize CPU H/W queues %d\n",
> > +                     rc);
> > +             goto disable_queues;
> > +     }
> > +
> >       /* CPU initialization is finished, we can now move to 48 bit DMA mask */
> >       rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
> >       if (rc) {
> > @@ -2347,7 +3181,7 @@ static int goya_hw_init(struct hl_device *hdev)
> >               if (rc) {
> >                       dev_err(hdev->dev,
> >                               "Unable to set pci dma mask to 32 bits\n");
> > -                     return rc;
> > +                     goto disable_pci_access;
> >               }
> >       }
> >
> > @@ -2359,7 +3193,7 @@ static int goya_hw_init(struct hl_device *hdev)
> >               if (rc) {
> >                       dev_err(hdev->dev,
> >                               "Unable to set pci consistent dma mask to 32 bits\n");
> > -                     return rc;
> > +                     goto disable_pci_access;
> >               }
> >       }
> >
> > @@ -2367,6 +3201,14 @@ static int goya_hw_init(struct hl_device *hdev)
> >       val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
> >
> >       return 0;
> > +
> > +disable_pci_access:
> > +     goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
> > +disable_queues:
> > +     goya_disable_internal_queues(hdev);
> > +     goya_disable_external_queues(hdev);
> > +
> > +     return rc;
> >  }
> >
> >  /**
> > @@ -2473,12 +3315,40 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
> >
> >  int goya_suspend(struct hl_device *hdev)
> >  {
> > -     return 0;
> > +     int rc;
> > +
> > +     rc = goya_stop_internal_queues(hdev);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop internal queues\n");
> > +             return rc;
> > +     }
> > +
> > +     rc = goya_stop_external_queues(hdev);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to stop external queues\n");
> > +             return rc;
> > +     }
> > +
> > +     rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
> > +     if (rc)
> > +             dev_err(hdev->dev, "Failed to disable PCI access from CPU\n");
> > +
> > +     return rc;
> >  }
> >
> >  int goya_resume(struct hl_device *hdev)
> >  {
> > -     return 0;
> > +     int rc;
> > +
> > +     goya_resume_external_queues(hdev);
> > +     goya_resume_internal_queues(hdev);
> > +
> > +     rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_ENABLE_PCI_ACCESS);
> > +     if (rc)
> > +             dev_err(hdev->dev, "Failed to enable PCI access from CPU\n");
> > +     return rc;
> >  }
> >
> >  int goya_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
> > @@ -2502,6 +3372,104 @@ int goya_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
> >       return rc;
> >  }
> >
> > +void goya_ring_doorbell(struct hl_device *hdev, u32 hw_queue_id, u32 pi)
> > +{
> > +     u32 db_reg_offset, db_value;
> > +     bool invalid_queue = false;
> > +
> > +     switch (hw_queue_id) {
> > +     case GOYA_QUEUE_ID_DMA_0:
> > +             db_reg_offset = mmDMA_QM_0_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_DMA_1:
> > +             db_reg_offset = mmDMA_QM_1_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_DMA_2:
> > +             db_reg_offset = mmDMA_QM_2_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_DMA_3:
> > +             db_reg_offset = mmDMA_QM_3_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_DMA_4:
> > +             db_reg_offset = mmDMA_QM_4_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_CPU_PQ:
> > +             if (hdev->cpu_queues_enable)
> > +                     db_reg_offset = mmCPU_IF_PF_PQ_PI;
> > +             else
> > +                     invalid_queue = true;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_MME:
> > +             db_reg_offset = mmMME_QM_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_TPC0:
> > +             db_reg_offset = mmTPC0_QM_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_TPC1:
> > +             db_reg_offset = mmTPC1_QM_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_TPC2:
> > +             db_reg_offset = mmTPC2_QM_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_TPC3:
> > +             db_reg_offset = mmTPC3_QM_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_TPC4:
> > +             db_reg_offset = mmTPC4_QM_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_TPC5:
> > +             db_reg_offset = mmTPC5_QM_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_TPC6:
> > +             db_reg_offset = mmTPC6_QM_PQ_PI;
> > +             break;
> > +
> > +     case GOYA_QUEUE_ID_TPC7:
> > +             db_reg_offset = mmTPC7_QM_PQ_PI;
> > +             break;
> > +
> > +     default:
> > +             invalid_queue = true;
> > +     }
> > +
> > +     if (invalid_queue) {
> > +             /* Should never get here */
> > +             dev_err(hdev->dev, "h/w queue %d is invalid. Can't set pi\n",
> > +                     hw_queue_id);
> > +             return;
> > +     }
> > +
> > +     db_value = pi;
> > +
> > +     if (hdev->ifh)
> > +             return;
> > +
> > +     /* ring the doorbell */
> > +     WREG32(db_reg_offset, db_value);
> > +
> > +     if (hw_queue_id == GOYA_QUEUE_ID_CPU_PQ)
> > +             WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> > +                             GOYA_ASYNC_EVENT_ID_PI_UPDATE);
> > +}
> > +
> > +void goya_flush_pq_write(struct hl_device *hdev, u64 *pq, u64 exp_val)
> > +{
> > +     /* Not needed in Goya */
> > +}
> > +
> >  void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
> >                                       dma_addr_t *dma_handle, gfp_t flags)
> >  {
> > @@ -2514,6 +3482,311 @@ void goya_dma_free_coherent(struct hl_device *hdev, size_t size, void *cpu_addr,
> >       dma_free_coherent(&hdev->pdev->dev, size, cpu_addr, dma_handle);
> >  }
> >
> > +void *goya_get_int_queue_base(struct hl_device *hdev, u32 queue_id,
> > +                             dma_addr_t *dma_handle, u16 *queue_len)
> > +{
> > +     void *base;
> > +     u32 offset;
> > +
> > +     *dma_handle = hdev->asic_prop.sram_base_address;
> > +
> > +     base = hdev->pcie_bar[SRAM_CFG_BAR_ID];
> > +
> > +     switch (queue_id) {
> > +     case GOYA_QUEUE_ID_MME:
> > +             offset = MME_QMAN_BASE_OFFSET;
> > +             *queue_len = MME_QMAN_LENGTH;
> > +             break;
> > +     case GOYA_QUEUE_ID_TPC0:
> > +             offset = TPC0_QMAN_BASE_OFFSET;
> > +             *queue_len = TPC_QMAN_LENGTH;
> > +             break;
> > +     case GOYA_QUEUE_ID_TPC1:
> > +             offset = TPC1_QMAN_BASE_OFFSET;
> > +             *queue_len = TPC_QMAN_LENGTH;
> > +             break;
> > +     case GOYA_QUEUE_ID_TPC2:
> > +             offset = TPC2_QMAN_BASE_OFFSET;
> > +             *queue_len = TPC_QMAN_LENGTH;
> > +             break;
> > +     case GOYA_QUEUE_ID_TPC3:
> > +             offset = TPC3_QMAN_BASE_OFFSET;
> > +             *queue_len = TPC_QMAN_LENGTH;
> > +             break;
> > +     case GOYA_QUEUE_ID_TPC4:
> > +             offset = TPC4_QMAN_BASE_OFFSET;
> > +             *queue_len = TPC_QMAN_LENGTH;
> > +             break;
> > +     case GOYA_QUEUE_ID_TPC5:
> > +             offset = TPC5_QMAN_BASE_OFFSET;
> > +             *queue_len = TPC_QMAN_LENGTH;
> > +             break;
> > +     case GOYA_QUEUE_ID_TPC6:
> > +             offset = TPC6_QMAN_BASE_OFFSET;
> > +             *queue_len = TPC_QMAN_LENGTH;
> > +             break;
> > +     case GOYA_QUEUE_ID_TPC7:
> > +             offset = TPC7_QMAN_BASE_OFFSET;
> > +             *queue_len = TPC_QMAN_LENGTH;
> > +             break;
> > +     default:
> > +             dev_err(hdev->dev, "Got invalid queue id %d\n", queue_id);
> > +             return NULL;
> > +     }
> > +
> > +     base += offset;
> > +     *dma_handle += offset;
> > +
> > +     return base;
> > +}
> > +
> > +int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
> > +                             u32 timeout, long *result)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     struct armcp_packet *pkt;
> > +     dma_addr_t pkt_dma_addr;
> > +     u32 tmp;
> > +     int rc = 0;
> > +
> > +     if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q)) {
> > +             if (result)
> > +                     *result = 0;
> > +             return 0;
> > +     }
> > +
> > +     if (len > CPU_CB_SIZE) {
> > +             dev_err(hdev->dev, "Invalid CPU message size of %d bytes\n",
> > +                     len);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     pkt = hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev, len,
> > +                                                             &pkt_dma_addr);
> > +     if (!pkt) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to allocate DMA memory for packet to CPU\n");
> > +             return -ENOMEM;
> > +     }
> > +
> > +     memcpy(pkt, msg, len);
> > +
> > +     mutex_lock(&hdev->send_cpu_message_lock);
> > +
> > +     if (hdev->disabled)
> > +             goto out;
> > +
> > +     rc = hl_hw_queue_send_cb_no_cmpl(hdev, GOYA_QUEUE_ID_CPU_PQ, len,
> > +                     pkt_dma_addr);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to send CB on CPU PQ (%d)\n", rc);
> > +             goto out;
> > +     }
> > +
> > +     rc = hl_poll_timeout_memory(hdev, (u64) &pkt->fence, timeout, &tmp);
> > +
> > +     hl_hw_queue_inc_ci_kernel(hdev, GOYA_QUEUE_ID_CPU_PQ);
> > +
> > +     if (rc == -ETIMEDOUT) {
> > +             dev_err(hdev->dev,
> > +                     "Timeout while waiting for CPU packet fence\n");
> > +             goto out;
> > +     }
> > +
> > +     if (tmp == ARMCP_PACKET_FENCE_VAL) {
> > +             if (pkt->rc) {
> > +                     dev_err(hdev->dev,
> > +                             "failed to execute CPU packet, rc: %d\n",
> > +                                     pkt->rc);
> > +                     rc = -EINVAL;
> > +             } else if (result) {
> > +                     *result = pkt->result;
>
> For some error cases above the *result is not initialized.
>
> > +             }
> > +     } else {
> > +             dev_err(hdev->dev, "CPU packet wrong fence value\n");
> > +             rc = -EINVAL;
> > +     }
> > +
> > +out:
> > +     mutex_unlock(&hdev->send_cpu_message_lock);
> > +
> > +     hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev, len, pkt);
> > +
> > +     return rc;
> > +}
> > +
> > +int goya_test_queue(struct hl_device *hdev, u32 hw_queue_id)
> > +{
> > +     struct packet_msg_prot *fence_pkt;
> > +     dma_addr_t pkt_dma_addr;
> > +     u32 fence_val, tmp;
> > +     dma_addr_t fence_dma_addr;
> > +     u32 *fence_ptr;
> > +     int rc;
> > +
> > +     fence_val = GOYA_QMAN0_FENCE_VAL;
> > +
> > +     fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL,
> > +                                                     &fence_dma_addr);
> > +     if (!fence_ptr) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to allocate memory for queue testing\n");
> > +             return -ENOMEM;
> > +     }
> > +
> > +     *fence_ptr = 0;
> > +
> > +     fence_pkt = hdev->asic_funcs->dma_pool_zalloc(hdev,
> > +                                     sizeof(struct packet_msg_prot),
> > +                                     GFP_KERNEL, &pkt_dma_addr);
> > +     if (!fence_pkt) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to allocate packet for queue testing\n");
> > +             rc = -ENOMEM;
> > +             goto free_fence_ptr;
> > +     }
> > +
> > +     fence_pkt->opcode = PACKET_MSG_PROT;
> > +     fence_pkt->value = fence_val;
> > +     fence_pkt->addr = fence_dma_addr +
> > +                             hdev->asic_prop.host_phys_base_address;
> > +
> > +     rc = hl_hw_queue_send_cb_no_cmpl(hdev, hw_queue_id,
> > +                                     sizeof(struct packet_msg_prot),
> > +                                     pkt_dma_addr);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to send fence packet\n");
> > +             goto free_pkt;
> > +     }
> > +
> > +     rc = hl_poll_timeout_memory(hdev, (u64) fence_ptr,
> > +                                     GOYA_TEST_QUEUE_WAIT_USEC, &tmp);
> > +
> > +     hl_hw_queue_inc_ci_kernel(hdev, hw_queue_id);
> > +
> > +     if ((!rc) && (tmp == fence_val)) {
> > +             dev_info(hdev->dev,
> > +                     "queue test on H/W queue %d succeeded\n",
> > +                     hw_queue_id);
> > +     } else {
> > +             dev_err(hdev->dev,
> > +                     "H/W queue %d test failed (scratch(0x%08llX) == 0x%08X)\n",
> > +                     hw_queue_id, fence_dma_addr, tmp);
> > +             rc = -EINVAL;
> > +     }
> > +
> > +free_pkt:
> > +     hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_pkt,
> > +                                     pkt_dma_addr);
> > +free_fence_ptr:
> > +     hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_ptr,
> > +                                     fence_dma_addr);
> > +     return rc;
> > +}
> > +
> > +int goya_test_cpu_queue(struct hl_device *hdev)
> > +{
> > +     struct armcp_packet test_pkt;
> > +     long result;
> > +     int rc;
> > +
> > +     /* cpu_queues_enable flag is always checked in send cpu message */
> > +
> > +     memset(&test_pkt, 0, sizeof(test_pkt));
> > +
> > +     test_pkt.opcode = ARMCP_PACKET_TEST;
> > +     test_pkt.value = ARMCP_PACKET_FENCE_VAL;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &test_pkt,
> > +                     sizeof(test_pkt), HL_DEVICE_TIMEOUT_USEC, &result);
> > +
> > +     if (!rc)
> > +             dev_info(hdev->dev, "queue test on CPU queue succeeded\n");
> > +     else
> > +             dev_err(hdev->dev, "CPU queue test failed (0x%08lX)\n", result);
> > +
> > +     return rc;
> > +}
> > +
> > +static int goya_test_queues(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int i, rc, ret_val = 0;
> > +
> > +     if (hdev->ifh)
> > +             return 0;
> > +
> > +     for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++) {
> > +             rc = goya_test_queue(hdev, i);
> > +             if (rc)
> > +                     ret_val = -EINVAL;
> > +     }
> > +
> > +     if (hdev->cpu_queues_enable) {
> > +             rc = goya->test_cpu_queue(hdev);
> > +             if (rc)
> > +                     ret_val = -EINVAL;
> > +     }
> > +
> > +     return ret_val;
> > +}
> > +
> > +void *goya_dma_pool_zalloc(struct hl_device *hdev, size_t size, gfp_t mem_flags,
> > +                             dma_addr_t *dma_handle)
> > +{
> > +     if (size > GOYA_DMA_POOL_BLK_SIZE)
> > +             return NULL;
> > +
> > +     return dma_pool_zalloc(hdev->dma_pool, mem_flags, dma_handle);
> > +}
> > +
> > +void goya_dma_pool_free(struct hl_device *hdev, void *vaddr,
> > +                     dma_addr_t dma_addr)
> > +{
> > +     dma_pool_free(hdev->dma_pool, vaddr, dma_addr);
> > +}
> > +
> > +void *goya_cpu_accessible_dma_pool_alloc(struct hl_device *hdev, size_t size,
> > +                     dma_addr_t *dma_handle)
> > +{
> > +     u64 kernel_addr;
> > +
> > +     /* roundup to CPU_PKT_SIZE */
> > +     size = (size + (CPU_PKT_SIZE - 1)) & CPU_PKT_MASK;
> > +
> > +     kernel_addr = gen_pool_alloc(hdev->cpu_accessible_dma_pool, size);
> > +
> > +     *dma_handle = hdev->cpu_accessible_dma_address +
> > +                     (kernel_addr - (u64) hdev->cpu_accessible_dma_mem);
> > +
> > +     return (void *) kernel_addr;
> > +}
> > +
> > +void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
> > +                     void *vaddr)
> > +{
> > +     /* roundup to CPU_PKT_SIZE */
> > +     size = (size + (CPU_PKT_SIZE - 1)) & CPU_PKT_MASK;
> > +
> > +     gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
> > +}
> > +
> > +
> > +static void goya_hw_queues_lock(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +
> > +     spin_lock(&goya->hw_queues_lock);
> > +}
> > +
> > +static void goya_hw_queues_unlock(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +
> > +     spin_unlock(&goya->hw_queues_lock);
> > +}
> > +
> >  static const struct hl_asic_funcs goya_funcs = {
> >       .early_init = goya_early_init,
> >       .early_fini = goya_early_fini,
> > @@ -2525,8 +3798,19 @@ static const struct hl_asic_funcs goya_funcs = {
> >       .resume = goya_resume,
> >       .mmap = goya_mmap,
> >       .cb_mmap = goya_cb_mmap,
> > +     .ring_doorbell = goya_ring_doorbell,
> > +     .flush_pq_write = goya_flush_pq_write,
> >       .dma_alloc_coherent = goya_dma_alloc_coherent,
> >       .dma_free_coherent = goya_dma_free_coherent,
> > +     .get_int_queue_base = goya_get_int_queue_base,
> > +     .test_queues = goya_test_queues,
> > +     .dma_pool_zalloc = goya_dma_pool_zalloc,
> > +     .dma_pool_free = goya_dma_pool_free,
> > +     .cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
> > +     .cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
> > +     .hw_queues_lock = goya_hw_queues_lock,
> > +     .hw_queues_unlock = goya_hw_queues_unlock,
> > +     .send_cpu_message = goya_send_cpu_message
> >  };
> >
> >  /**
> > diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> > index 45a6d2ca2752..598a718d3df1 100644
> > --- a/drivers/misc/habanalabs/goya/goyaP.h
> > +++ b/drivers/misc/habanalabs/goya/goyaP.h
> > @@ -9,6 +9,7 @@
> >  #define GOYAP_H_
> >
> >  #include "habanalabs.h"
> > +#include "include/goya/goya_packets.h"
> >  #include "include/goya/goya_boot_if.h"
> >  #include "include/goya/goya.h"
> >
> > @@ -117,12 +118,17 @@ enum goya_fw_component {
> >  };
> >
> >  struct goya_device {
> > +     int (*test_cpu_queue)(struct hl_device *hdev);
> > +
> >       /* TODO: remove hw_queues_lock after moving to scheduler code */
> >       spinlock_t      hw_queues_lock;
> >       u64             ddr_bar_cur_addr;
> >       u32             hw_cap_initialized;
> >  };
> >
> > +int goya_test_cpu_queue(struct hl_device *hdev);
> > +int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
> > +                             u32 timeout, long *result);
> >  void goya_init_security(struct hl_device *hdev);
> >
> >  #endif /* GOYAP_H_ */
> > diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> > index adda281ec2af..8232e2259463 100644
> > --- a/drivers/misc/habanalabs/habanalabs.h
> > +++ b/drivers/misc/habanalabs/habanalabs.h
> > @@ -30,10 +30,36 @@
> >  struct hl_device;
> >  struct hl_fpriv;
> >
> > +/**
> > + * enum hl_queue_type - Supported QUEUE types.
> > + * @QUEUE_TYPE_NA: queue is not available.
> > + * @QUEUE_TYPE_EXT: external queue which is a DMA channel that may access the
> > + *                  host.
> > + * @QUEUE_TYPE_INT: internal queue that performs DMA inside the device's
> > + *                   memories and/or operates the compute engines.
> > + * @QUEUE_TYPE_CPU: S/W queue for communication with the device's CPU.
> > + */
> > +enum hl_queue_type {
> > +     QUEUE_TYPE_NA,
> > +     QUEUE_TYPE_EXT,
> > +     QUEUE_TYPE_INT,
> > +     QUEUE_TYPE_CPU
> > +};
> >
> > +/**
> > + * struct hw_queue_properties - queue information.
> > + * @type: queue type.
> > + * @kmd_only: true if only KMD is allowed to send a job to this queue, false
> > + *            otherwise.
> > + */
> > +struct hw_queue_properties {
> > +     enum hl_queue_type      type;
> > +     u8                      kmd_only;
> > +};
> >
> >  /**
> >   * struct asic_fixed_properties - ASIC specific immutable properties.
> > + * @hw_queues_props: H/W queues properties.
> >   * @uboot_ver: F/W U-boot version.
> >   * @preboot_ver: F/W Preboot version.
> >   * @sram_base_address: SRAM physical start address.
> > @@ -64,6 +90,7 @@ struct hl_fpriv;
> >   * @tpc_enabled_mask: which TPCs are enabled.
> >   */
> >  struct asic_fixed_properties {
> > +     struct hw_queue_properties      hw_queues_props[HL_MAX_QUEUES];
> >       char                    uboot_ver[VERSION_MAX_LEN];
> >       char                    preboot_ver[VERSION_MAX_LEN];
> >       u64                     sram_base_address;
> > @@ -145,7 +172,92 @@ struct hl_cb {
> >
> >
> >
> > +/*
> > + * QUEUES
> > + */
> > +
> > +struct hl_cs_job;
> > +
> > +/*
> > + * Currently, there are two limitations on the maximum length of a queue:
> > + *
> > + * 1. The memory footprint of the queue. The current allocated space for the
> > + *    queue is PAGE_SIZE. Because each entry in the queue is HL_BD_SIZE,
> > + *    the maximum length of the queue can be PAGE_SIZE / HL_BD_SIZE,
> > + *    which currently is 4096/16 = 256 entries.
> > + *
> > + *    To increase that, we need either to decrease the size of the
> > + *    BD (difficult), or allocate more than a single page (easier).
> > + *
> > + * 2. Because the size of the JOB handle field in the BD CTL / completion queue
> > + *    is 10-bit, we can have up to 1024 open jobs per hardware queue.
> > + *    Therefore, each queue can hold up to 1024 entries.
> > + *
> > + * HL_QUEUE_LENGTH is in units of struct hl_bd.
> > + * HL_QUEUE_LENGTH * sizeof(struct hl_bd) should be <= HL_PAGE_SIZE
> > + */
> > +
> > +#define HL_PAGE_SIZE                 4096 /* minimum page size */
> > +/* Must be power of 2 (HL_PAGE_SIZE / HL_BD_SIZE) */
> >  #define HL_QUEUE_LENGTH                      256
> > +#define HL_QUEUE_SIZE_IN_BYTES               (HL_QUEUE_LENGTH * HL_BD_SIZE)
> > +
> > +/*
> > + * HL_CQ_LENGTH is in units of struct hl_cq_entry.
> > + * HL_CQ_LENGTH should be <= HL_PAGE_SIZE
> > + */
> > +#define HL_CQ_LENGTH                 HL_QUEUE_LENGTH
> > +#define HL_CQ_SIZE_IN_BYTES          (HL_CQ_LENGTH * HL_CQ_ENTRY_SIZE)
> > +
> > +
> > +
> > +/**
> > + * struct hl_hw_queue - describes a H/W transport queue.
> > + * @shadow_queue: pointer to a shadow queue that holds pointers to jobs.
> > + * @queue_type: type of queue.
> > + * @kernel_address: holds the queue's kernel virtual address.
> > + * @bus_address: holds the queue's DMA address.
> > + * @pi: holds the queue's pi value.
> > + * @ci: holds the queue's ci value, AS CALCULATED BY THE DRIVER (not real ci).
> > + * @hw_queue_id: the id of the H/W queue.
> > + * @int_queue_len: length of internal queue (number of entries).
> > + * @valid: is the queue valid (we have array of 32 queues, not all of them
> > + *           exists).
> > + */
> > +struct hl_hw_queue {
> > +     struct hl_cs_job        **shadow_queue;
> > +     enum hl_queue_type      queue_type;
> > +     u64                     kernel_address;
> > +     dma_addr_t              bus_address;
> > +     u32                     pi;
> > +     u32                     ci;
> > +     u32                     hw_queue_id;
> > +     u16                     int_queue_len;
> > +     u8                      valid;
> > +};
> > +
> > +/**
> > + * struct hl_cq - describes a completion queue
> > + * @hdev: pointer to the device structure
> > + * @kernel_address: holds the queue's kernel virtual address
> > + * @bus_address: holds the queue's DMA address
> > + * @hw_queue_id: the id of the matching H/W queue
> > + * @ci: ci inside the queue
> > + * @pi: pi inside the queue
> > + * @free_slots_cnt: counter of free slots in queue
> > + */
> > +struct hl_cq {
> > +     struct hl_device        *hdev;
> > +     u64                     kernel_address;
> > +     dma_addr_t              bus_address;
> > +     u32                     hw_queue_id;
> > +     u32                     ci;
> > +     u32                     pi;
> > +     atomic_t                free_slots_cnt;
> > +};
> > +
> > +
> > +
> >
> >
> >  /*
> > @@ -180,8 +292,20 @@ enum hl_asic_type {
> >   * @resume: handles IP specific H/W or SW changes for resume.
> >   * @mmap: mmap function, does nothing.
> >   * @cb_mmap: maps a CB.
> > + * @ring_doorbell: increment PI on a given QMAN.
> > + * @flush_pq_write: flush PQ entry write if necessary, WARN if flushing failed.
> >   * @dma_alloc_coherent: DMA allocate coherent memory.
> >   * @dma_free_coherent: free DMA allocation.
> > + * @get_int_queue_base: get the internal queue base address.
> > + * @test_queues: run simple test on all queues for sanity check.
> > + * @dma_pool_zalloc: small DMA allocation of coherent memory from DMA pool.
> > + *                   size of allocation is HL_DMA_POOL_BLK_SIZE.
> > + * @dma_pool_free: free small DMA allocation from pool.
> > + * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
> > + * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
> > + * @hw_queues_lock: acquire H/W queues lock.
> > + * @hw_queues_unlock: release H/W queues lock.
> > + * @send_cpu_message: send buffer to ArmCP.
> >   */
> >  struct hl_asic_funcs {
> >       int (*early_init)(struct hl_device *hdev);
> > @@ -195,10 +319,27 @@ struct hl_asic_funcs {
> >       int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
> >       int (*cb_mmap)(struct hl_device *hdev, struct vm_area_struct *vma,
> >                       u64 kaddress, phys_addr_t paddress, u32 size);
> > +     void (*ring_doorbell)(struct hl_device *hdev, u32 hw_queue_id, u32 pi);
> > +     void (*flush_pq_write)(struct hl_device *hdev, u64 *pq, u64 exp_val);
> >       void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
> >                                       dma_addr_t *dma_handle, gfp_t flag);
> >       void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
> >                                       void *cpu_addr, dma_addr_t dma_handle);
> > +     void* (*get_int_queue_base)(struct hl_device *hdev, u32 queue_id,
> > +                             dma_addr_t *dma_handle, u16 *queue_len);
> > +     int (*test_queues)(struct hl_device *hdev);
> > +     void* (*dma_pool_zalloc)(struct hl_device *hdev, size_t size,
> > +                             gfp_t mem_flags, dma_addr_t *dma_handle);
> > +     void (*dma_pool_free)(struct hl_device *hdev, void *vaddr,
> > +                             dma_addr_t dma_addr);
> > +     void* (*cpu_accessible_dma_pool_alloc)(struct hl_device *hdev,
> > +                             size_t size, dma_addr_t *dma_handle);
> > +     void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
> > +                             size_t size, void *vaddr);
> > +     void (*hw_queues_lock)(struct hl_device *hdev);
> > +     void (*hw_queues_unlock)(struct hl_device *hdev);
> > +     int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
> > +                             u16 len, u32 timeout, long *result);
> >  };
> >
> >
> > @@ -240,6 +381,17 @@ struct hl_ctx_mgr {
> >
> >
> >
> > +/**
> > + * struct hl_cs_job - command submission job.
> > + * @finish_work: workqueue object to run when job is completed.
> > + * @id: the id of this job inside a CS.
> > + */
> > +struct hl_cs_job {
> > +     struct work_struct      finish_work;
> > +     u32                     id;
> > +};
> > +
> > +
> >  /*
> >   * FILE PRIVATE STRUCTURE
> >   */
> > @@ -316,7 +468,11 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> >   * @dev: realted kernel basic device structure.
> >   * @asic_name: ASIC specific nmae.
> >   * @asic_type: ASIC specific type.
> > + * @completion_queue: array of hl_cq.
> > + * @cq_wq: work queue of completion queues for executing work in process context
> > + * @eq_wq: work queue of event queue for executing work in process context.
> >   * @kernel_ctx: KMD context structure.
> > + * @kernel_queues: array of hl_hw_queue.
> >   * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
> >   * @dma_pool: DMA pool for small allocations.
> >   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
> > @@ -326,6 +482,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> >   * @asid_bitmap: holds used/available ASIDs.
> >   * @asid_mutex: protects asid_bitmap.
> >   * @device_open: lock for sanity checks upon FD open.
> > + * @send_cpu_message_lock: enforces only one message in KMD <-> ArmCP queue.
> >   * @asic_prop: ASIC specific immutable properties.
> >   * @asic_funcs: ASIC specific functions.
> >   * @asic_specific: ASIC specific information to use only from ASIC files.
> > @@ -345,7 +502,10 @@ struct hl_device {
> >       struct device                   *dev;
> >       char                            asic_name[16];
> >       enum hl_asic_type               asic_type;
> > +     struct hl_cq                    *completion_queue;
> > +     struct workqueue_struct         *cq_wq;
> >       struct hl_ctx                   *kernel_ctx;
> > +     struct hl_hw_queue              *kernel_queues;
> >       struct hl_cb_mgr                kernel_cb_mgr;
> >       struct dma_pool                 *dma_pool;
> >       void                            *cpu_accessible_dma_mem;
> > @@ -356,6 +516,7 @@ struct hl_device {
> >       struct mutex                    asid_mutex;
> >       /* TODO: change to rw_sem for multiple contexts (same as other IOCTL) */
> >       struct mutex                    device_open;
> > +     struct mutex                    send_cpu_message_lock;
> >       struct asic_fixed_properties    asic_prop;
> >       const struct hl_asic_funcs      *asic_funcs;
> >       void                            *asic_specific;
> > @@ -374,7 +535,9 @@ struct hl_device {
> >       u8                              cpu_enable;
> >       u8                              reset_pcilink;
> >       u8                              config_pll;
> > +     u8                              cpu_queues_enable;
> >       u8                              fw_loading;
> > +     u8                              ifh;
> >       u8                              pldm;
> >  };
> >
> > @@ -418,7 +581,18 @@ int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
> >                               u32 *val);
> >  int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
> >                               u32 timeout_us, u32 *val);
> > -
> > +int hl_hw_queues_create(struct hl_device *hdev);
> > +void hl_hw_queues_destroy(struct hl_device *hdev);
> > +int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
> > +                             u32 cb_size, u64 cb_ptr);
> > +u32 hl_hw_queue_add_ptr(u32 ptr, u16 val);
> > +void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
> > +
> > +#define hl_queue_inc_ptr(p)          hl_hw_queue_add_ptr(p, 1)
> > +#define hl_pi_2_offset(pi)           ((pi) & (HL_QUEUE_LENGTH - 1))
> > +
> > +int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
> > +void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
> >  int hl_asid_init(struct hl_device *hdev);
> >  void hl_asid_fini(struct hl_device *hdev);
> >  unsigned long hl_asid_alloc(struct hl_device *hdev);
> > diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> > index bd80683118d3..b64f58ad0f5d 100644
> > --- a/drivers/misc/habanalabs/habanalabs_drv.c
> > +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> > @@ -184,13 +184,19 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
> >       hdev->cpu_enable = 1;
> >       hdev->reset_pcilink = 0;
> >       hdev->config_pll = 0;
> > +     hdev->cpu_queues_enable = 1;
> >       hdev->fw_loading = 1;
> > +     hdev->ifh = 0;
> >       hdev->pldm = 0;
> >
> >       /* If CPU is disabled, no point in loading FW */
> >       if (!hdev->cpu_enable)
> >               hdev->fw_loading = 0;
> >
> > +     /* If we don't load FW, no need to initialize CPU queues */
> > +     if (!hdev->fw_loading)
> > +             hdev->cpu_queues_enable = 0;
> > +
> >       hdev->disabled = true;
> >       hdev->pdev = pdev; /* can be NULL in case of simulator device */
> >
> > diff --git a/drivers/misc/habanalabs/hw_queue.c b/drivers/misc/habanalabs/hw_queue.c
> > new file mode 100644
> > index 000000000000..65102a5bc2ca
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/hw_queue.c
> > @@ -0,0 +1,404 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "habanalabs.h"
> > +
> > +#include <linux/dma-mapping.h>
> > +#include <linux/sched.h>
> > +#include <linux/wait.h>
> > +#include <linux/delay.h>
> > +
> > +/**
> > + * hl_queue_add_ptr - add to pi or ci and checks if it wraps around
> > + *
> > + * @ptr: the current pi/ci value
> > + * @val: the amount to add
> > + *
> > + * Add val to ptr. It can go until twice the queue length.
> > + */
> > +inline u32 hl_hw_queue_add_ptr(u32 ptr, u16 val)
> > +{
> > +     ptr += val;
> > +     ptr &= ((HL_QUEUE_LENGTH << 1) - 1);
> > +     return ptr;
> > +}
> > +
> > +static inline int queue_free_slots(struct hl_hw_queue *q, u32 queue_len)
> > +{
> > +     int delta = (q->pi - q->ci);
> > +
> > +     if (delta >= 0)
> > +             return (queue_len - delta);
> > +     else
> > +             return (abs(delta) - queue_len);
> > +}
> > +
> > +/**
> > + * ext_queue_submit_bd - Submit a buffer descriptor to an external queue
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + * @q: pointer to habanalabs queue structure
> > + * @ctl: BD's control word
> > + * @len: BD's length
> > + * @ptr: BD's pointer
> > + *
> > + * This function assumes there is enough space on the queue to submit a new
> > + * BD to it. It initializes the next BD and calls the device specific
> > + * function to set the pi (and doorbell)
> > + *
> > + * This function must be called when the scheduler mutex is taken
> > + *
> > + */
> > +static void ext_queue_submit_bd(struct hl_device *hdev, struct hl_hw_queue *q,
> > +                             u32 ctl, u32 len, u64 ptr)
> > +{
> > +     struct hl_bd *bd;
> > +
> > +     bd = (struct hl_bd *) q->kernel_address;
> > +     bd += hl_pi_2_offset(q->pi);
> > +     bd->ctl = ctl;
> > +     bd->len = len;
> > +     bd->ptr = ptr + hdev->asic_prop.host_phys_base_address;
> > +
> > +     q->pi = hl_queue_inc_ptr(q->pi);
> > +     hdev->asic_funcs->ring_doorbell(hdev, q->hw_queue_id, q->pi);
> > +}
> > +
> > +/**
> > + * ext_queue_sanity_checks - perform some sanity checks on external queue
> > + *
> > + * @hdev              : pointer to hl_device structure
> > + * @q                 :      pointer to hl_hw_queue structure
> > + * @num_of_entries    : how many entries to check for space
> > + * @reserve_cq_entry  :      whether to reserve an entry in the cq
> > + *
> > + * H/W queues spinlock should be taken before calling this function
> > + *
> > + * Perform the following:
> > + * - Make sure we have enough space in the h/w queue
> > + * - Make sure we have enough space in the completion queue
> > + * - Reserve space in the completion queue (needs to be reversed if there
> > + *   is a failure down the road before the actual submission of work). Only
> > + *   do this action if reserve_cq_entry is true
> > + *
> > + */
> > +static int ext_queue_sanity_checks(struct hl_device *hdev,
> > +                             struct hl_hw_queue *q, int num_of_entries,
> > +                             bool reserve_cq_entry)
> > +{
> > +     atomic_t *free_slots =
> > +                     &hdev->completion_queue[q->hw_queue_id].free_slots_cnt;
> > +     int free_slots_cnt;
> > +
> > +     /* Check we have enough space in the queue */
> > +     free_slots_cnt = queue_free_slots(q, HL_QUEUE_LENGTH);
> > +
> > +     if (free_slots_cnt < num_of_entries) {
> > +             dev_dbg(hdev->dev, "Queue %d doesn't have room for %d CBs\n",
> > +                     q->hw_queue_id, num_of_entries);
> > +             return -EAGAIN;
> > +     }
> > +
> > +     if (reserve_cq_entry) {
> > +             /*
> > +              * Check we have enough space in the completion queue
> > +              * Add -1 to counter (decrement) unless counter was already 0
> > +              * In that case, CQ is full so we can't submit a new CB because
> > +              * we won't get ack on its completion
> > +              * atomic_add_unless will return 0 if counter was already 0
> > +              */
> > +             if (atomic_add_negative(num_of_entries * -1, free_slots)) {
> > +                     dev_dbg(hdev->dev, "No space for %d on CQ %d\n",
> > +                             num_of_entries, q->hw_queue_id);
> > +                     atomic_add(num_of_entries, free_slots);
> > +                     return -EAGAIN;
> > +             }
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * hl_hw_queue_send_cb_no_cmpl - send a single CB (not a JOB) without completion
> > + *
> > + * @hdev: pointer to hl_device structure
> > + * @hw_queue_id: Queue's type
> > + * @cb_size: size of CB
> > + * @cb_ptr: pointer to CB location
> > + *
> > + * This function sends a single CB, that must NOT generate a completion entry
> > + *
> > + */
> > +int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
> > +                             u32 cb_size, u64 cb_ptr)
> > +{
> > +     struct hl_hw_queue *q = &hdev->kernel_queues[hw_queue_id];
> > +     int rc;
> > +
> > +     /*
> > +      * The CPU queue is a synchronous queue with an effective depth of
> > +      * a single entry (although it is allocated with room for multiple
> > +      * entries). Therefore, there is a different lock, called
> > +      * send_cpu_message_lock, that serializes accesses to the CPU queue.
> > +      * As a result, we don't need to lock the access to the entire H/W
> > +      * queues module when submitting a JOB to the CPU queue
> > +      */
> > +     if (q->queue_type != QUEUE_TYPE_CPU)
> > +             hdev->asic_funcs->hw_queues_lock(hdev);
> > +
> > +     if (hdev->disabled) {
> > +             rc = -EPERM;
> > +             goto out;
> > +     }
> > +
> > +     rc = ext_queue_sanity_checks(hdev, q, 1, false);
> > +     if (rc)
> > +             goto out;
> > +
> > +     ext_queue_submit_bd(hdev, q, 0, cb_size, cb_ptr);
> > +
> > +out:
> > +     if (q->queue_type != QUEUE_TYPE_CPU)
> > +             hdev->asic_funcs->hw_queues_unlock(hdev);
> > +
> > +     return rc;
> > +}
> > +
> > +/**
> > + * hl_hw_queue_inc_ci_kernel - increment ci for kernel's queue
> > + *
> > + * @hdev: pointer to hl_device structure
> > + * @hw_queue_id: which queue to increment its ci
> > + */
> > +void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id)
> > +{
> > +     struct hl_hw_queue *q = &hdev->kernel_queues[hw_queue_id];
> > +
> > +     q->ci = hl_queue_inc_ptr(q->ci);
> > +}
> > +
> > +static int ext_and_cpu_hw_queue_init(struct hl_device *hdev,
> > +                                     struct hl_hw_queue *q)
> > +{
> > +     void *p;
> > +     int rc;
> > +
> > +     p = hdev->asic_funcs->dma_alloc_coherent(hdev,
> > +                             HL_QUEUE_SIZE_IN_BYTES,
> > +                             &q->bus_address, GFP_KERNEL | __GFP_ZERO);
> > +     if (!p)
> > +             return -ENOMEM;
> > +
> > +     q->kernel_address = (u64) p;
> > +
> > +     q->shadow_queue = kmalloc_array(HL_QUEUE_LENGTH,
> > +                                     sizeof(*q->shadow_queue),
> > +                                     GFP_KERNEL);
> > +     if (!q->shadow_queue) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to allocate shadow queue for H/W queue %d\n",
> > +                     q->hw_queue_id);
> > +             rc = -ENOMEM;
> > +             goto free_queue;
> > +     }
> > +
> > +     /* Make sure read/write pointers are initialized to start of queue */
> > +     q->ci = 0;
> > +     q->pi = 0;
> > +
> > +     return 0;
> > +
> > +free_queue:
> > +     hdev->asic_funcs->dma_free_coherent(hdev, HL_QUEUE_SIZE_IN_BYTES,
> > +                     (void *) q->kernel_address, q->bus_address);
> > +
> > +     return rc;
> > +}
> > +
> > +static int int_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
> > +{
> > +     void *p;
> > +
> > +     p = hdev->asic_funcs->get_int_queue_base(hdev, q->hw_queue_id,
> > +                                     &q->bus_address, &q->int_queue_len);
> > +     if (!p) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to get base address for internal queue %d\n",
> > +                     q->hw_queue_id);
> > +             return -EFAULT;
> > +     }
> > +
> > +     q->kernel_address = (u64) p;
> > +     q->pi = 0;
> > +     q->ci = 0;
> > +
> > +     return 0;
> > +}
> > +
> > +static int cpu_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
> > +{
> > +     return ext_and_cpu_hw_queue_init(hdev, q);
> > +}
> > +
> > +static int ext_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
> > +{
> > +     return ext_and_cpu_hw_queue_init(hdev, q);
> > +}
> > +
> > +/**
> > + * hw_queue_init - main initialization function for H/W queue object
> > + *
> > + * @hdev: pointer to hl_device device structure
> > + * @q: pointer to hl_hw_queue queue structure
> > + * @hw_queue_id: The id of the H/W queue
> > + *
> > + * Allocate dma-able memory for the queue and initialize fields
> > + * Returns 0 on success
> > + */
> > +static int hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q,
> > +                     u32 hw_queue_id)
> > +{
> > +     int rc;
> > +
> > +     BUILD_BUG_ON(HL_QUEUE_SIZE_IN_BYTES > HL_PAGE_SIZE);
> > +
> > +     q->hw_queue_id = hw_queue_id;
> > +
> > +     switch (q->queue_type) {
> > +     case QUEUE_TYPE_EXT:
> > +             rc = ext_hw_queue_init(hdev, q);
> > +             break;
> > +
> > +     case QUEUE_TYPE_INT:
> > +             rc = int_hw_queue_init(hdev, q);
> > +             break;
> > +
> > +     case QUEUE_TYPE_CPU:
> > +             rc = cpu_hw_queue_init(hdev, q);
> > +             break;
> > +
> > +     case QUEUE_TYPE_NA:
> > +             q->valid = 0;
> > +             return 0;
> > +
> > +     default:
> > +             dev_crit(hdev->dev, "wrong queue type %d during init\n",
> > +                     q->queue_type);
> > +             rc = -EINVAL;
> > +             break;
> > +     }
> > +
> > +     if (rc)
> > +             return rc;
> > +
> > +     q->valid = 1;
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * hw_queue_fini - destroy queue
> > + *
> > + * @hdev: pointer to hl_device device structure
> > + * @q: pointer to hl_hw_queue queue structure
> > + *
> > + * Free the queue memory
> > + */
> > +static void hw_queue_fini(struct hl_device *hdev, struct hl_hw_queue *q)
> > +{
> > +     if (!q->valid)
> > +             return;
> > +
> > +     /*
> > +      * If we arrived here, there are no jobs waiting on this queue
> > +      * so we can safely remove it.
> > +      * This is because this function can only called when:
> > +      * 1. Either a context is deleted, which only can occur if all its
> > +      *    jobs were finished
> > +      * 2. A context wasn't able to be created due to failure or timeout,
> > +      *    which means there are no jobs on the queue yet
> > +      *
> > +      * The only exception are the queues of the kernel context, but
> > +      * if they are being destroyed, it means that the entire module is
> > +      * being removed. If the module is removed, it means there is no open
> > +      * user context. It also means that if a job was submitted by
> > +      * the kernel driver (e.g. context creation), the job itself was
> > +      * released by the kernel driver when a timeout occurred on its
> > +      * Completion. Thus, we don't need to release it again.
> > +      */
> > +
> > +     if (q->queue_type == QUEUE_TYPE_INT)
> > +             return;
> > +
> > +     kfree(q->shadow_queue);
> > +
> > +     hdev->asic_funcs->dma_free_coherent(hdev,
> > +                     HL_QUEUE_SIZE_IN_BYTES,
> > +                     (void *) q->kernel_address, q->bus_address);
> > +}
> > +
> > +int hl_hw_queues_create(struct hl_device *hdev)
> > +{
> > +     struct asic_fixed_properties *asic = &hdev->asic_prop;
> > +     struct hl_hw_queue *q;
> > +     int i, rc, q_ready_cnt;
> > +
> > +     hdev->kernel_queues = kcalloc(HL_MAX_QUEUES,
> > +                             sizeof(*hdev->kernel_queues), GFP_KERNEL);
> > +
> > +     if (!hdev->kernel_queues) {
> > +             dev_err(hdev->dev, "Not enough memory for H/W queues\n");
> > +             return -ENOMEM;
> > +     }
> > +
> > +     /* Initialize the H/W queues */
> > +     for (i = 0, q_ready_cnt = 0, q = hdev->kernel_queues;
> > +                     i < HL_MAX_QUEUES ; i++, q_ready_cnt++, q++) {
> > +
> > +             q->queue_type = asic->hw_queues_props[i].type;
> > +             rc = hw_queue_init(hdev, q, i);
> > +             if (rc) {
> > +                     dev_err(hdev->dev,
> > +                             "failed to initialize queue %d\n", i);
> > +                     goto release_queues;
> > +             }
> > +     }
> > +
> > +     return 0;
> > +
> > +release_queues:
> > +     for (i = 0, q = hdev->kernel_queues ; i < q_ready_cnt ; i++, q++)
> > +             hw_queue_fini(hdev, q);
> > +
> > +     kfree(hdev->kernel_queues);
> > +
> > +     return rc;
> > +}
> > +
> > +void hl_hw_queues_destroy(struct hl_device *hdev)
> > +{
> > +     struct hl_hw_queue *q;
> > +     int i;
> > +
> > +     for (i = 0, q = hdev->kernel_queues ; i < HL_MAX_QUEUES ; i++, q++)
> > +             hw_queue_fini(hdev, q);
> > +
> > +     kfree(hdev->kernel_queues);
> > +}
> > +
> > +void hl_hw_queue_reset(struct hl_device *hdev, bool hard_reset)
> > +{
> > +     struct hl_hw_queue *q;
> > +     int i;
> > +
> > +     for (i = 0, q = hdev->kernel_queues ; i < HL_MAX_QUEUES ; i++, q++) {
> > +             if ((!q->valid) ||
> > +                     ((!hard_reset) && (q->queue_type == QUEUE_TYPE_CPU)))
> > +                     continue;
> > +             q->pi = q->ci = 0;
> > +     }
> > +}
> > diff --git a/drivers/misc/habanalabs/include/goya/goya_packets.h b/drivers/misc/habanalabs/include/goya/goya_packets.h
> > new file mode 100644
> > index 000000000000..669a3f37ccb7
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/include/goya/goya_packets.h
> > @@ -0,0 +1,234 @@
> > +/* SPDX-License-Identifier: GPL-2.0
> > + *
> > + * Copyright 2017-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + * Authors:
> > + *
> > + * Oded Gabbay <oded.gabbay@gmail.com>
> > + * Guy Eilat <geilat@habana.ai>
> > + *
> > + */
> > +
> > +#ifndef GOYA_PACKETS_H
> > +#define GOYA_PACKETS_H
> > +
> > +#include <linux/types.h>
> > +
> > +#define PACKET_HEADER_PACKET_ID_SHIFT                56
> > +#define PACKET_HEADER_PACKET_ID_MASK         0x1F00000000000000ull
> > +
> > +enum packet_id {
> > +     PACKET_WREG_32 = 0x1,
> > +     PACKET_WREG_BULK = 0x2,
> > +     PACKET_MSG_LONG = 0x3,
> > +     PACKET_MSG_SHORT = 0x4,
> > +     PACKET_CP_DMA = 0x5,
> > +     PACKET_MSG_PROT = 0x7,
> > +     PACKET_FENCE = 0x8,
> > +     PACKET_LIN_DMA = 0x9,
> > +     PACKET_NOP = 0xA,
> > +     PACKET_STOP = 0xB,
> > +     MAX_PACKET_ID = (PACKET_HEADER_PACKET_ID_MASK >>
> > +                             PACKET_HEADER_PACKET_ID_SHIFT) + 1
> > +};
> > +
> > +enum goya_dma_direction {
> > +     DMA_HOST_TO_DRAM,
> > +     DMA_HOST_TO_SRAM,
> > +     DMA_DRAM_TO_SRAM,
> > +     DMA_SRAM_TO_DRAM,
> > +     DMA_SRAM_TO_HOST,
> > +     DMA_DRAM_TO_HOST,
> > +     DMA_DRAM_TO_DRAM,
> > +     DMA_SRAM_TO_SRAM,
> > +     DMA_ENUM_MAX
> > +};
> > +
> > +struct packet_nop {
> > +     __u32 reserved;
> > +     union {
> > +             struct {
> > +                     __u32:24;
> > +                     __u32 opcode :5;
> > +                     __u32 eng_barrier :1;
> > +                     __u32 reg_barrier :1;
> > +                     __u32 msg_barrier :1;
> > +             };
> > +             __u32 ctl;
> > +     };
> > +};
> > +
> > +struct packet_stop {
> > +     __u32 reserved;
> > +     union {
> > +             struct {
> > +                     __u32:24;
> > +                     __u32 opcode :5;
> > +                     __u32 eng_barrier :1;
> > +                     __u32 reg_barrier :1; /* must be 0 */
> > +                     __u32 msg_barrier :1; /* must be 0 */
> > +             };
> > +             __u32 ctl;
> > +     };
> > +};
> > +
> > +struct packet_wreg32 {
> > +     __u32 value;
> > +     union {
> > +             struct {
> > +                     __u32 reg_offset :16;
> > +                     __u32:7;
> > +                     __u32 local :1; /* 0: write to TCL regs,
> > +                                      * 1: write to CMDQ regs
> > +                                      */
> > +                     __u32 opcode :5;
> > +                     __u32 eng_barrier :1;
> > +                     __u32 reg_barrier :1; /* must be 1 */
> > +                     __u32 msg_barrier :1;
> > +             };
> > +             __u32 ctl;
> > +     };
> > +};
> > +
> > +struct packet_wreg_bulk {
> > +     __u32 size64 :16;
> > +     __u32:16;
> > +     __u32 reg_offset :16;
> > +     __u32:8;
> > +     __u32 opcode :5;
> > +     __u32 eng_barrier :1;
> > +     __u32 reg_barrier :1; /* must be 1 */
> > +     __u32 msg_barrier :1;
> > +     __u64 values[0]; /* data starts here */
> > +};
> > +
> > +struct packet_msg_long {
> > +     __u32 value;
> > +     union {
> > +             struct {
> > +                     __u32:16;
> > +                     __u32 weakly_ordered :1;
> > +                     __u32 no_snoop :1;
> > +                     __u32:2;
> > +                     __u32 op :2; /* 0: write <value>. 1: write timestamp. */
> > +                     __u32:2;
> > +                     __u32 opcode :5;
> > +                     __u32 eng_barrier :1;
> > +                     __u32 reg_barrier :1;
> > +                     __u32 msg_barrier :1;
> > +             };
> > +             __u32 ctl;
> > +     };
> > +     __u64 addr;
> > +};
> > +
> > +struct packet_msg_short {
> > +     union {
> > +             struct {
> > +                     __u32 sync_id :10;
> > +                     __u32:5;
> > +                     __u32 mode : 1;
> > +                     __u32 sync_value :16;
> > +             } mon_arm_register;
> > +             struct {
> > +                     __u32 sync_value :16;
> > +                     __u32:15;
> > +                     __u32 mode :1;
> > +             } so_upd;
> > +             __u32 value;
> > +     };
> > +     union {
> > +             struct {
> > +                     __u32 msg_addr_offset :16;
> > +                     __u32 weakly_ordered :1;
> > +                     __u32 no_snoop :1;
> > +                     __u32:2;
> > +                     __u32 op :2;
> > +                     __u32 base :2;
> > +                     __u32 opcode :5;
> > +                     __u32 eng_barrier :1;
> > +                     __u32 reg_barrier :1;
> > +                     __u32 msg_barrier :1;
> > +             };
> > +             __u32 ctl;
> > +     };
> > +};
> > +
> > +struct packet_msg_prot {
> > +     __u32 value;
> > +     union {
> > +             struct {
> > +                     __u32:16;
> > +                     __u32 weakly_ordered :1;
> > +                     __u32 no_snoop :1;
> > +                     __u32:2;
> > +                     __u32 op :2; /* 0: write <value>. 1: write timestamp. */
> > +                     __u32:2;
> > +                     __u32 opcode :5;
> > +                     __u32 eng_barrier :1;
> > +                     __u32 reg_barrier :1;
> > +                     __u32 msg_barrier :1;
> > +             };
> > +             __u32 ctl;
> > +     };
> > +     __u64 addr;
> > +};
> > +
> > +struct packet_fence {
> > +     __u32 dec_val :4;
> > +     __u32:12;
> > +     __u32 gate_val :8;
> > +     __u32:6;
> > +     __u32 id :2;
> > +     __u32:24;
> > +     __u32 opcode :5;
> > +     __u32 eng_barrier :1;
> > +     __u32 reg_barrier :1;
> > +     __u32 msg_barrier :1;
> > +};
> > +
> > +struct packet_lin_dma {
> > +     __u32 tsize;
> > +     union {
> > +             struct {
> > +                     __u32 weakly_ordered :1; /* H/W bug, must be 1 */
> > +                     __u32 rdcomp :1;
> > +                     __u32 wrcomp :1;
> > +                     __u32 no_snoop :1;
> > +                     __u32 src_disable :1;
> > +                     __u32 dst_disable :1;
> > +                     __u32 memset_mode :1;
> > +                     __u32 tensor_dma :1; /* N/A, must be 0 */
> > +                     __u32 cntrl :12;
> > +                     __u32 dma_dir :3; /* S/W only, no effect on HW */
> > +                     __u32:1;
> > +                     __u32 opcode :5;
> > +                     __u32 eng_barrier :1;
> > +                     __u32 reg_barrier :1; /* must be 1 */
> > +                     __u32 msg_barrier :1;
> > +             };
> > +             __u32 ctl;
> > +     };
> > +     __u64 src_addr;
> > +     __u64 dst_addr;
> > +};
> > +
> > +struct packet_cp_dma {
> > +     __u32 tsize;
> > +     union {
> > +             struct {
> > +                     __u32 weakly_ordered :1;
> > +                     __u32 no_snoop :1;
> > +                     __u32:22;
> > +                     __u32 opcode :5;
> > +                     __u32 eng_barrier :1;
> > +                     __u32 reg_barrier :1; /* must be 1 */
> > +                     __u32 msg_barrier :1;
> > +             };
> > +             __u32 ctl;
> > +     };
> > +     __u64 src_addr;
> > +};
> > +
> > +#endif /* GOYA_PACKETS_H */
> > diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > index 9dbb7077eabd..62df9981f68a 100644
> > --- a/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > @@ -97,6 +97,278 @@ enum pq_init_status {
> >       PQ_INIT_STATUS_READY_FOR_HOST
> >  };
> >
> > +/*
> > + * ArmCP Primary Queue Packets
> > + *
> > + * During normal operation, KMD needs to send various messages to ArmCP,
> > + * usually either to SET some value into a H/W periphery or to GET the current
> > + * value of some H/W periphery. For example, SET the frequency of MME/TPC and
> > + * GET the value of the thermal sensor.
> > + *
> > + * These messages can be initiated either by the User application or by KMD
> > + * itself, e.g. power management code. In either case, the communication from
> > + * KMD to ArmCP will *always* be in synchronous mode, meaning that KMD will
> > + * send a single message and poll until the message was acknowledged and the
> > + * results are ready (if results are needed).
> > + *
> > + * This means that only a single message can be sent at a time and KMD must
> > + * wait for its result before sending the next message. Having said that,
> > + * because these are control messages which are sent in a relatively low
> > + * frequency, this limitation seems acceptable. It's important to note that
> > + * in case of multiple devices, messages to different devices *can* be sent
> > + * at the same time.
> > + *
> > + * The message, inputs/outputs (if relevant) and fence object will be located
> > + * on the device DDR at an address that will be determined by KMD. During
> > + * device initialization phase, KMD will pass to ArmCP that address.  Most of
> > + * the message types will contain inputs/outputs inside the message itself.
> > + * The common part of each message will contain the opcode of the message (its
> > + * type) and a field representing a fence object.
> > + *
> > + * When KMD wishes to send a message to ArmCP, it will write the message
> > + * contents to the device DDR, clear the fence object and then write the
> > + * value 484 to the mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR register to issue
> > + * the 484 interrupt-id to the ARM core.
> > + *
> > + * Upon receiving the 484 interrupt-id, ArmCP will read the message from the
> > + * DDR. In case the message is a SET operation, ArmCP will first perform the
> > + * operation and then write to the fence object on the device DDR. In case the
> > + * message is a GET operation, ArmCP will first fill the results section on the
> > + * device DDR and then write to the fence object. If an error occurred, ArmCP
> > + * will fill the rc field with the right error code.
> > + *
> > + * In the meantime, KMD will poll on the fence object. Once KMD sees that the
> > + * fence object is signaled, it will read the results from the device DDR
> > + * (if relevant) and resume the code execution in KMD.
> > + *
> > + * To use QMAN packets, the opcode must be the QMAN opcode, shifted by 8
> > + * so the value being put by the KMD matches the value read by ArmCP
> > + *
> > + * Non-QMAN packets should be limited to values 1 through (2^8 - 1)
> > + *
> > + * Detailed description:
> > + *
> > + * ARMCP_PACKET_DISABLE_PCI_ACCESS -
> > + *       After receiving this packet the embedded CPU must NOT issue PCI
> > + *       transactions (read/write) towards the Host CPU. This also include
> > + *       sending MSI-X interrupts.
> > + *       This packet is usually sent before the device is moved to D3Hot state.
> > + *
> > + * ARMCP_PACKET_ENABLE_PCI_ACCESS -
> > + *       After receiving this packet the embedded CPU is allowed to issue PCI
> > + *       transactions towards the Host CPU, including sending MSI-X interrupts.
> > + *       This packet is usually send after the device is moved to D0 state.
> > + *
> > + * ARMCP_PACKET_TEMPERATURE_GET -
> > + *       Fetch the current temperature / Max / Max Hyst / Critical /
> > + *       Critical Hyst of a specified thermal sensor. The packet's
> > + *       arguments specify the desired sensor and the field to get.
> > + *
> > + * ARMCP_PACKET_VOLTAGE_GET -
> > + *       Fetch the voltage / Max / Min of a specified sensor. The packet's
> > + *       arguments specify the sensor and type.
> > + *
> > + * ARMCP_PACKET_CURRENT_GET -
> > + *       Fetch the current / Max / Min of a specified sensor. The packet's
> > + *       arguments specify the sensor and type.
> > + *
> > + * ARMCP_PACKET_FAN_SPEED_GET -
> > + *       Fetch the speed / Max / Min of a specified fan. The packet's
> > + *       arguments specify the sensor and type.
> > + *
> > + * ARMCP_PACKET_PWM_GET -
> > + *       Fetch the pwm value / mode of a specified pwm. The packet's
> > + *       arguments specify the sensor and type.
> > + *
> > + * ARMCP_PACKET_PWM_SET -
> > + *       Set the pwm value / mode of a specified pwm. The packet's
> > + *       arguments specify the sensor, type and value.
> > + *
> > + * ARMCP_PACKET_FREQUENCY_SET -
> > + *       Set the frequency of a specified PLL. The packet's arguments specify
> > + *       the PLL and the desired frequency. The actual frequency in the device
> > + *       might differ from the requested frequency.
> > + *
> > + * ARMCP_PACKET_FREQUENCY_GET -
> > + *       Fetch the frequency of a specified PLL. The packet's arguments specify
> > + *       the PLL.
> > + *
> > + * ARMCP_PACKET_LED_SET -
> > + *       Set the state of a specified led. The packet's arguments
> > + *       specify the led and the desired state.
> > + *
> > + * ARMCP_PACKET_I2C_WR -
> > + *       Write 32-bit value to I2C device. The packet's arguments specify the
> > + *       I2C bus, address and value.
> > + *
> > + * ARMCP_PACKET_I2C_RD -
> > + *       Read 32-bit value from I2C device. The packet's arguments specify the
> > + *       I2C bus and address.
> > + *
> > + * ARMCP_PACKET_INFO_GET -
> > + *       Fetch information from the device as specified in the packet's
> > + *       structure. KMD passes the max size it allows the ArmCP to write to
> > + *       the structure, to prevent data corruption in case of mismatched
> > + *       KMD/FW versions.
> > + *
> > + * ARMCP_PACKET_FLASH_PROGRAM_REMOVED - this packet was removed
> > + *
> > + * ARMCP_PACKET_UNMASK_RAZWI_IRQ -
> > + *       Unmask the given IRQ. The IRQ number is specified in the value field.
> > + *       The packet is sent after receiving an interrupt and printing its
> > + *       relevant information.
> > + *
> > + * ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY -
> > + *       Unmask the given IRQs. The IRQs numbers are specified in an array right
> > + *       after the armcp_packet structure, where its first element is the array
> > + *       length. The packet is sent after a soft reset was done in order to
> > + *       handle any interrupts that were sent during the reset process.
> > + *
> > + * ARMCP_PACKET_TEST -
> > + *       Test packet for ArmCP connectivity. The CPU will put the fence value
> > + *       in the result field.
> > + *
> > + * ARMCP_PACKET_FREQUENCY_CURR_GET -
> > + *       Fetch the current frequency of a specified PLL. The packet's arguments
> > + *       specify the PLL.
> > + *
> > + * ARMCP_PACKET_MAX_POWER_GET -
> > + *       Fetch the maximal power of the device.
> > + *
> > + * ARMCP_PACKET_MAX_POWER_SET -
> > + *       Set the maximal power of the device. The packet's arguments specify
> > + *       the power.
> > + *
> > + * ARMCP_PACKET_EEPROM_DATA_GET -
> > + *       Get EEPROM data from the ArmCP kernel. The buffer is specified in the
> > + *       addr field. The CPU will put the returned data size in the result
> > + *       field. In addition, KMD passes the max size it allows the ArmCP to
> > + *       write to the structure, to prevent data corruption in case of
> > + *       mismatched KMD/FW versions.
> > + *
> > + */
> > +
> > +enum armcp_packet_id {
> > +     ARMCP_PACKET_DISABLE_PCI_ACCESS = 1,    /* internal */
> > +     ARMCP_PACKET_ENABLE_PCI_ACCESS,         /* internal */
> > +     ARMCP_PACKET_TEMPERATURE_GET,           /* sysfs */
> > +     ARMCP_PACKET_VOLTAGE_GET,               /* sysfs */
> > +     ARMCP_PACKET_CURRENT_GET,               /* sysfs */
> > +     ARMCP_PACKET_FAN_SPEED_GET,             /* sysfs */
> > +     ARMCP_PACKET_PWM_GET,                   /* sysfs */
> > +     ARMCP_PACKET_PWM_SET,                   /* sysfs */
> > +     ARMCP_PACKET_FREQUENCY_SET,             /* sysfs */
> > +     ARMCP_PACKET_FREQUENCY_GET,             /* sysfs */
> > +     ARMCP_PACKET_LED_SET,                   /* debugfs */
> > +     ARMCP_PACKET_I2C_WR,                    /* debugfs */
> > +     ARMCP_PACKET_I2C_RD,                    /* debugfs */
> > +     ARMCP_PACKET_INFO_GET,                  /* IOCTL */
> > +     ARMCP_PACKET_FLASH_PROGRAM_REMOVED,
> > +     ARMCP_PACKET_UNMASK_RAZWI_IRQ,          /* internal */
> > +     ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY,    /* internal */
> > +     ARMCP_PACKET_TEST,                      /* internal */
> > +     ARMCP_PACKET_FREQUENCY_CURR_GET,        /* sysfs */
> > +     ARMCP_PACKET_MAX_POWER_GET,             /* sysfs */
> > +     ARMCP_PACKET_MAX_POWER_SET,             /* sysfs */
> > +     ARMCP_PACKET_EEPROM_DATA_GET,           /* sysfs */
> > +};
> > +
> > +#define ARMCP_PACKET_FENCE_VAL       0xFE8CE7A5
> > +
> > +struct armcp_packet {
> > +     union {
> > +             __u64 value;    /* For SET packets */
> > +             __u64 result;   /* For GET packets */
> > +             __u64 addr;     /* For PQ */
> > +     };
> > +
> > +     union {
> > +             struct {
> > +                     __u32:12;
> > +                     __u32 rc :4;
> > +                     __u32 opcode :13;
> > +                     __u32 eng_barrier :1;
> > +                     __u32 reg_barrier :1;
> > +                     __u32 msg_barrier :1;
> > +             };
> > +             __u32 ctl;
> > +     };
> > +
> > +     __u32 fence;            /* Signal to KMD that message is completed */
> > +
> > +     union {
> > +             struct {/* For temperature/current/voltage/fan/pwm get/set */
> > +                     __u16 sensor_index;
> > +                     __u16 type;
> > +             };
> > +
> > +             struct {        /* For I2C read/write */
> > +                     __u8 i2c_bus;
> > +                     __u8 i2c_addr;
> > +                     __u8 i2c_reg;
> > +                     __u8 pad; /* unused */
> > +             };
> > +
> > +             /* For frequency get/set */
> > +             __u32 pll_index;
> > +
> > +             /* For led set */
> > +             __u32 led_index;
> > +
> > +             /* For get Armcp info/EEPROM data */
> > +             __u32 data_max_size;
> > +     };
> > +};
> > +
> > +struct armcp_unmask_irq_arr_packet {
> > +     struct armcp_packet armcp_pkt;
> > +     __u32 length;
> > +     __u32 irqs[0];
> > +};
> > +
> > +enum armcp_packet_rc {
> > +     armcp_packet_success,
> > +     armcp_packet_invalid,
> > +     armcp_packet_fault
> > +};
> > +
> > +enum armcp_temp_type {
> > +     armcp_temp_input,
> > +     armcp_temp_max = 6,
> > +     armcp_temp_max_hyst,
> > +     armcp_temp_crit,
> > +     armcp_temp_crit_hyst
> > +};
> > +
> > +enum armcp_in_attributes {
> > +     armcp_in_input,
> > +     armcp_in_min,
> > +     armcp_in_max
> > +};
> > +
> > +enum armcp_curr_attributes {
> > +     armcp_curr_input,
> > +     armcp_curr_min,
> > +     armcp_curr_max
> > +};
> > +
> > +enum armcp_fan_attributes {
> > +     armcp_fan_input,
> > +     armcp_fan_min = 2,
> > +     armcp_fan_max
> > +};
> > +
> > +enum armcp_pwm_attributes {
> > +     armcp_pwm_input,
> > +     armcp_pwm_enable
> > +};
> > +
> > +/* Event Queue Packets */
> > +
> > +struct eq_generic_event {
> > +     __u64 data[7];
> > +};
> > +
> >  /*
> >   * ArmCP info
> >   */
> > diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
> > new file mode 100644
> > index 000000000000..97b0de7ea5c2
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/irq.c
> > @@ -0,0 +1,150 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "habanalabs.h"
> > +
> > +#include <linux/dma-mapping.h>
> > +
> > +
> > +/**
> > + * hl_cq_inc_ptr - increment ci or pi of cq
> > + *
> > + * @ptr: the current ci or pi value of the completion queue
> > + *
> > + * Increment ptr by 1. If it reaches the number of completion queue
> > + * entries, set it to 0
> > + */
> > +inline u32 hl_cq_inc_ptr(u32 ptr)
> > +{
> > +     ptr++;
> > +     if (unlikely(ptr == HL_CQ_LENGTH))
> > +             ptr = 0;
> > +     return ptr;
> > +}
> > +
> > +/**
> > + * hl_irq_handler_cq - irq handler for completion queue
> > + *
> > + * @irq: irq number
> > + * @arg: pointer to completion queue structure
> > + *
> > + */
> > +irqreturn_t hl_irq_handler_cq(int irq, void *arg)
> > +{
> > +     struct hl_cq *cq = arg;
> > +     struct hl_device *hdev = cq->hdev;
> > +     struct hl_hw_queue *queue;
> > +     struct hl_cs_job *job;
> > +     bool shadow_index_valid;
> > +     u16 shadow_index;
> > +     u32 *cq_entry;
> > +     u32 *cq_base;
> > +
> > +     if (hdev->disabled) {
> > +             dev_dbg(hdev->dev,
> > +                     "Device disabled but received IRQ %d for CQ %d\n",
> > +                     irq, cq->hw_queue_id);
> > +             return IRQ_HANDLED;
> > +     }
> > +
> > +     cq_base = (u32 *) cq->kernel_address;
> > +
> > +     while (1) {
> > +             bool entry_ready = ((cq_base[cq->ci] & CQ_ENTRY_READY_MASK)
> > +                                             >> CQ_ENTRY_READY_SHIFT);
> > +
> > +             if (!entry_ready)
> > +                     break;
> > +
> > +             cq_entry = (u32 *) &cq_base[cq->ci];
> > +
> > +             /*
> > +              * Make sure we read CQ entry contents after we've
> > +              * checked the ownership bit.
> > +              */
> > +             dma_rmb();
> > +
> > +             shadow_index_valid =
> > +                     ((*cq_entry & CQ_ENTRY_SHADOW_INDEX_VALID_MASK)
> > +                                     >> CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT);
> > +
> > +             shadow_index = (u16)
> > +                     ((*cq_entry & CQ_ENTRY_SHADOW_INDEX_MASK)
> > +                                     >> CQ_ENTRY_SHADOW_INDEX_SHIFT);
> > +
> > +             queue = &hdev->kernel_queues[cq->hw_queue_id];
> > +
> > +             if ((shadow_index_valid) && (!hdev->disabled)) {
> > +                     job = queue->shadow_queue[hl_pi_2_offset(shadow_index)];
> > +                     queue_work(hdev->cq_wq, &job->finish_work);
> > +             }
> > +
> > +             /*
> > +              * Update ci of the context's queue. There is no
> > +              * need to protect it with spinlock because this update is
> > +              * done only inside IRQ and there is a different IRQ per
> > +              * queue
> > +              */
> > +             queue->ci = hl_queue_inc_ptr(queue->ci);
> > +
> > +             /* Clear CQ entry ready bit */
> > +             cq_base[cq->ci] &= ~CQ_ENTRY_READY_MASK;
> > +
> > +             cq->ci = hl_cq_inc_ptr(cq->ci);
> > +
> > +             /* Increment free slots */
> > +             atomic_inc(&cq->free_slots_cnt);
> > +     }
> > +
> > +     return IRQ_HANDLED;
> > +}
> > +
> > +/**
> > + * hl_cq_init - main initialization function for an cq object
> > + *
> > + * @hdev: pointer to device structure
> > + * @q: pointer to cq structure
> > + * @hw_queue_id: The H/W queue ID this completion queue belongs to
> > + *
> > + * Allocate dma-able memory for the completion queue and initialize fields
> > + * Returns 0 on success
> > + */
> > +int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id)
> > +{
> > +     void *p;
> > +
> > +     BUILD_BUG_ON(HL_CQ_SIZE_IN_BYTES > HL_PAGE_SIZE);
> > +
> > +     p = hdev->asic_funcs->dma_alloc_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
> > +                             &q->bus_address, GFP_KERNEL | __GFP_ZERO);
> > +     if (!p)
> > +             return -ENOMEM;
> > +
> > +     q->hdev = hdev;
> > +     q->kernel_address = (u64) p;
> > +     q->hw_queue_id = hw_queue_id;
> > +     q->ci = 0;
> > +     q->pi = 0;
> > +
> > +     atomic_set(&q->free_slots_cnt, HL_CQ_LENGTH);
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * hl_cq_fini - destroy completion queue
> > + *
> > + * @hdev: pointer to device structure
> > + * @q: pointer to cq structure
> > + *
> > + * Free the completion queue memory
> > + */
> > +void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
> > +{
> > +     hdev->asic_funcs->dma_free_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
> > +                     (void *) q->kernel_address, q->bus_address);
> > +}
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 08/15] habanalabs: add event queue and interrupts
  2019-01-25  7:51   ` Mike Rapoport
@ 2019-01-28 11:14     ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-28 11:14 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org

On Fri, Jan 25, 2019 at 9:51 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:50AM +0200, Oded Gabbay wrote:
> > This patch adds support for receiving events from Goya's control CPU and
> > for receiving MSI-X interrupts from Goya's DMA engines and CPU.
> >
> > Goya's PCI controller supports up to 8 MSI-X interrupts, which only 6 of
> > them are currently used. The first 5 interrupts are dedicated for Goya's
> > DMA engine queues. The 6th interrupt is dedicated for Goya's control CPU.
> >
> > The DMA queue will signal its MSI-X entry upon each completion of a command
> > buffer that was placed on its primary queue. The driver will then mark that
> > CB as completed and free the related resources. It will also update the
> > command submission object which that CB belongs to.
> >
> > There is a dedicated event queue (EQ) between the driver and Goya's control
> > CPU. The EQ is located on the Host memory. The control CPU writes a new
> > entry to the EQ for various reasons, such as ECC error, MMU page fault, Hot
> > temperature. After writing the new entry to the EQ, the control CPU will
> > trigger its dedicated MSI-X entry to signal the driver that there is a new
> > entry in the EQ. The driver will then read the entry and act accordingly.
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/habanalabs/device.c            |  35 +-
> >  drivers/misc/habanalabs/goya/goya.c         | 522 +++++++++++++++++++-
> >  drivers/misc/habanalabs/goya/goyaP.h        |   1 +
> >  drivers/misc/habanalabs/habanalabs.h        |  37 ++
> >  drivers/misc/habanalabs/include/goya/goya.h |   1 -
> >  drivers/misc/habanalabs/irq.c               | 144 ++++++
> >  6 files changed, 729 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> > index 98220628a467..9199e070e79e 100644
> > --- a/drivers/misc/habanalabs/device.c
> > +++ b/drivers/misc/habanalabs/device.c
> > @@ -173,9 +173,17 @@ static int device_early_init(struct hl_device *hdev)
> >       hdev->cq_wq = alloc_workqueue("hl-free-jobs", WQ_UNBOUND, 0);
> >       if (hdev->cq_wq == NULL) {
> >               dev_err(hdev->dev, "Failed to allocate CQ workqueue\n");
> > +             rc = -ENOMEM;
>
> Apparently, it should have been in one of the earlier patches
>
Correct, fixed
> >               goto asid_fini;
> >       }
> >
> > +     hdev->eq_wq = alloc_workqueue("hl-events", WQ_UNBOUND, 0);
> > +     if (hdev->eq_wq == NULL) {
> > +             dev_err(hdev->dev, "Failed to allocate EQ workqueue\n");
> > +             rc = -ENOMEM;
> > +             goto free_cq_wq;
> > +     }
> > +
> >       hl_cb_mgr_init(&hdev->kernel_cb_mgr);
> >
> >       mutex_init(&hdev->device_open);
> > @@ -184,6 +192,8 @@ static int device_early_init(struct hl_device *hdev)
> >
> >       return 0;
> >
> > +free_cq_wq:
> > +     destroy_workqueue(hdev->cq_wq);
> >  asid_fini:
> >       hl_asid_fini(hdev);
> >  early_fini:
> > @@ -205,6 +215,7 @@ static void device_early_fini(struct hl_device *hdev)
> >
> >       hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
> >
> > +     destroy_workqueue(hdev->eq_wq);
> >       destroy_workqueue(hdev->cq_wq);
> >
> >       hl_asid_fini(hdev);
> > @@ -343,11 +354,22 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >               }
> >       }
> >
> > +     /*
> > +      * Initialize the event queue. Must be done before hw_init,
> > +      * because there the address of the event queue is being
> > +      * passed as argument to request_irq
> > +      */
> > +     rc = hl_eq_init(hdev, &hdev->event_queue);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to initialize event queue\n");
> > +             goto cq_fini;
> > +     }
> > +
> >       /* Allocate the kernel context */
> >       hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
> >       if (!hdev->kernel_ctx) {
> >               rc = -ENOMEM;
> > -             goto cq_fini;
> > +             goto eq_fini;
> >       }
> >
> >       hdev->user_ctx = NULL;
> > @@ -392,6 +414,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >                       "kernel ctx is still alive on initialization failure\n");
> >  free_ctx:
> >       kfree(hdev->kernel_ctx);
> > +eq_fini:
> > +     hl_eq_fini(hdev, &hdev->event_queue);
> >  cq_fini:
> >       for (i = 0 ; i < cq_ready_cnt ; i++)
> >               hl_cq_fini(hdev, &hdev->completion_queue[i]);
> > @@ -433,6 +457,13 @@ void hl_device_fini(struct hl_device *hdev)
> >       /* Mark device as disabled */
> >       hdev->disabled = true;
> >
> > +     /*
> > +      * Halt the engines and disable interrupts so we won't get any more
> > +      * completions from H/W and we won't have any accesses from the
> > +      * H/W to the host machine
> > +      */
> > +     hdev->asic_funcs->halt_engines(hdev, true);
> > +
> >       hl_cb_pool_fini(hdev);
> >
> >       /* Release kernel context */
> > @@ -442,6 +473,8 @@ void hl_device_fini(struct hl_device *hdev)
> >       /* Reset the H/W. It will be in idle state after this returns */
> >       hdev->asic_funcs->hw_fini(hdev, true);
> >
> > +     hl_eq_fini(hdev, &hdev->event_queue);
> > +
> >       for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
> >               hl_cq_fini(hdev, &hdev->completion_queue[i]);
> >       kfree(hdev->completion_queue);
> > diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> > index 08d5227eaf1d..6c04277ae0fa 100644
> > --- a/drivers/misc/habanalabs/goya/goya.c
> > +++ b/drivers/misc/habanalabs/goya/goya.c
> > @@ -92,9 +92,41 @@
> >
> >  #define GOYA_MAX_INITIATORS          20
> >
> > +#define GOYA_MAX_STRING_LEN          20
> > +
> >  #define GOYA_CB_POOL_CB_CNT          512
> >  #define GOYA_CB_POOL_CB_SIZE         0x20000         /* 128KB */
> >
> > +static const char goya_irq_name[GOYA_MSIX_ENTRIES][GOYA_MAX_STRING_LEN] = {
> > +             "goya cq 0", "goya cq 1", "goya cq 2", "goya cq 3",
> > +             "goya cq 4", "goya cpu eq"
> > +};
> > +
> > +static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
> > +     "MME0",
> > +     "MME1",
> > +     "MME2",
> > +     "MME3",
> > +     "MME4",
> > +     "MME5",
> > +     "TPC0",
> > +     "TPC1",
> > +     "TPC2",
> > +     "TPC3",
> > +     "TPC4",
> > +     "TPC5",
> > +     "TPC6",
> > +     "TPC7",
> > +     "PCI",
> > +     "DMA", /* HBW */
> > +     "DMA", /* LBW */
> > +     "PSOC",
> > +     "CPU",
> > +     "MMU"
> > +};
> > +
> > +#define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
> > +
> >  static void goya_get_fixed_properties(struct hl_device *hdev)
> >  {
> >       struct asic_fixed_properties *prop = &hdev->asic_prop;
> > @@ -139,6 +171,7 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
> >       prop->va_space_dram_end_address = VA_DDR_SPACE_END;
> >       prop->cfg_size = CFG_SIZE;
> >       prop->max_asid = MAX_ASID;
> > +     prop->num_of_events = GOYA_ASYNC_EVENT_ID_SIZE;
> >       prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
> >       prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
> >       prop->tpc_enabled_mask = TPC_ENABLED_MASK;
> > @@ -668,15 +701,10 @@ static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
> >       WREG32(mmDMA_QM_0_PQ_CFG1 + reg_off, 0x00020002);
> >       WREG32(mmDMA_QM_0_CQ_CFG1 + reg_off, 0x00080008);
> >
> > -     if (dma_id == 0)
> > -             WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
> > +     if (goya->hw_cap_initialized & HW_CAP_MMU)
> > +             WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_PARTLY_TRUSTED);
> >       else
> > -             if (goya->hw_cap_initialized & HW_CAP_MMU)
> > -                     WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
> > -                                     QMAN_DMA_PARTLY_TRUSTED);
> > -             else
> > -                     WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
> > -                                     QMAN_DMA_FULLY_TRUSTED);
> > +             WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
> >
> >       WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, QMAN_DMA_ERR_MSG_EN);
> >       WREG32(mmDMA_QM_0_GLBL_CFG0 + reg_off, QMAN_DMA_ENABLE);
> > @@ -870,6 +898,7 @@ static void goya_resume_external_queues(struct hl_device *hdev)
> >  int goya_init_cpu_queues(struct hl_device *hdev)
> >  {
> >       struct goya_device *goya = hdev->asic_specific;
> > +     struct hl_eq *eq;
> >       dma_addr_t bus_address;
> >       u32 status;
> >       struct hl_hw_queue *cpu_pq = &hdev->kernel_queues[GOYA_QUEUE_ID_CPU_PQ];
> > @@ -881,17 +910,24 @@ int goya_init_cpu_queues(struct hl_device *hdev)
> >       if (goya->hw_cap_initialized & HW_CAP_CPU_Q)
> >               return 0;
> >
> > +     eq = &hdev->event_queue;
> > +
> >       bus_address = cpu_pq->bus_address +
> >                       hdev->asic_prop.host_phys_base_address;
> >       WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_0, lower_32_bits(bus_address));
> >       WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_1, upper_32_bits(bus_address));
> >
> > +     bus_address = eq->bus_address + hdev->asic_prop.host_phys_base_address;
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_2, lower_32_bits(bus_address));
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_3, upper_32_bits(bus_address));
> > +
> >       bus_address = hdev->cpu_accessible_dma_address +
> >                       hdev->asic_prop.host_phys_base_address;
> >       WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_8, lower_32_bits(bus_address));
> >       WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_9, upper_32_bits(bus_address));
> >
> >       WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_5, HL_QUEUE_SIZE_IN_BYTES);
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_4, HL_EQ_SIZE_IN_BYTES);
> >       WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_10, CPU_ACCESSIBLE_MEM_SIZE);
> >
> >       /* Used for EQ CI */
> > @@ -2781,6 +2817,163 @@ static void goya_resume_internal_queues(struct hl_device *hdev)
> >       WREG32(mmTPC7_CMDQ_GLBL_CFG1, 0);
> >  }
> >
> > +static void goya_dma_stall(struct hl_device *hdev)
> > +{
> > +     WREG32(mmDMA_QM_0_GLBL_CFG1, 1 << DMA_QM_0_GLBL_CFG1_DMA_STOP_SHIFT);
> > +     WREG32(mmDMA_QM_1_GLBL_CFG1, 1 << DMA_QM_1_GLBL_CFG1_DMA_STOP_SHIFT);
> > +     WREG32(mmDMA_QM_2_GLBL_CFG1, 1 << DMA_QM_2_GLBL_CFG1_DMA_STOP_SHIFT);
> > +     WREG32(mmDMA_QM_3_GLBL_CFG1, 1 << DMA_QM_3_GLBL_CFG1_DMA_STOP_SHIFT);
> > +     WREG32(mmDMA_QM_4_GLBL_CFG1, 1 << DMA_QM_4_GLBL_CFG1_DMA_STOP_SHIFT);
> > +}
> > +
> > +static void goya_tpc_stall(struct hl_device *hdev)
> > +{
> > +     WREG32(mmTPC0_CFG_TPC_STALL, 1 << TPC0_CFG_TPC_STALL_V_SHIFT);
> > +     WREG32(mmTPC1_CFG_TPC_STALL, 1 << TPC1_CFG_TPC_STALL_V_SHIFT);
> > +     WREG32(mmTPC2_CFG_TPC_STALL, 1 << TPC2_CFG_TPC_STALL_V_SHIFT);
> > +     WREG32(mmTPC3_CFG_TPC_STALL, 1 << TPC3_CFG_TPC_STALL_V_SHIFT);
> > +     WREG32(mmTPC4_CFG_TPC_STALL, 1 << TPC4_CFG_TPC_STALL_V_SHIFT);
> > +     WREG32(mmTPC5_CFG_TPC_STALL, 1 << TPC5_CFG_TPC_STALL_V_SHIFT);
> > +     WREG32(mmTPC6_CFG_TPC_STALL, 1 << TPC6_CFG_TPC_STALL_V_SHIFT);
> > +     WREG32(mmTPC7_CFG_TPC_STALL, 1 << TPC7_CFG_TPC_STALL_V_SHIFT);
> > +}
> > +
> > +static void goya_mme_stall(struct hl_device *hdev)
> > +{
> > +     WREG32(mmMME_STALL, 0xFFFFFFFF);
> > +}
> > +
> > +static int goya_enable_msix(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int cq_cnt = hdev->asic_prop.completion_queues_count;
> > +     int rc, i, irq_cnt_init, irq;
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_MSIX)
> > +             return 0;
> > +
> > +     rc = pci_alloc_irq_vectors(hdev->pdev, GOYA_MSIX_ENTRIES,
> > +                             GOYA_MSIX_ENTRIES, PCI_IRQ_MSIX);
> > +     if (rc < 0) {
> > +             dev_err(hdev->dev,
> > +                     "MSI-X: Failed to enable support -- %d/%d\n",
> > +                     GOYA_MSIX_ENTRIES, rc);
> > +             return rc;
> > +     }
> > +
> > +     for (i = 0, irq_cnt_init = 0 ; i < cq_cnt ; i++, irq_cnt_init++) {
> > +             irq = pci_irq_vector(hdev->pdev, i);
> > +             rc = request_irq(irq, hl_irq_handler_cq, 0, goya_irq_name[i],
> > +                             &hdev->completion_queue[i]);
> > +             if (rc) {
> > +                     dev_err(hdev->dev, "Failed to request IRQ %d", irq);
> > +                     goto free_irqs;
> > +             }
> > +     }
> > +
> > +     irq = pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX);
> > +
> > +     rc = request_irq(irq, hl_irq_handler_eq, 0,
> > +                     goya_irq_name[EVENT_QUEUE_MSIX_IDX],
> > +                     &hdev->event_queue);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to request IRQ %d", irq);
> > +             goto free_irqs;
> > +     }
> > +
> > +     goya->hw_cap_initialized |= HW_CAP_MSIX;
> > +     return 0;
> > +
> > +free_irqs:
> > +     for (i = 0 ; i < irq_cnt_init ; i++)
> > +             free_irq(pci_irq_vector(hdev->pdev, i),
> > +                     &hdev->completion_queue[i]);
> > +
> > +     pci_free_irq_vectors(hdev->pdev);
> > +     return rc;
> > +}
> > +
> > +static void goya_sync_irqs(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int i;
> > +
> > +     if (!(goya->hw_cap_initialized & HW_CAP_MSIX))
> > +             return;
> > +
> > +     /* Wait for all pending IRQs to be finished */
> > +     for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
> > +             synchronize_irq(pci_irq_vector(hdev->pdev, i));
> > +
> > +     synchronize_irq(pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX));
> > +}
> > +
> > +static void goya_disable_msix(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int i, irq;
> > +
> > +     if (!(goya->hw_cap_initialized & HW_CAP_MSIX))
> > +             return;
> > +
> > +     goya_sync_irqs(hdev);
> > +
> > +     irq = pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX);
> > +     free_irq(irq, &hdev->event_queue);
> > +
> > +     for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++) {
> > +             irq = pci_irq_vector(hdev->pdev, i);
> > +             free_irq(irq, &hdev->completion_queue[i]);
> > +     }
> > +
> > +     pci_free_irq_vectors(hdev->pdev);
> > +
> > +     goya->hw_cap_initialized &= ~HW_CAP_MSIX;
> > +}
> > +
> > +static void goya_halt_engines(struct hl_device *hdev, bool hard_reset)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     u32 wait_timeout_ms, cpu_timeout_ms;
> > +
> > +     dev_info(hdev->dev,
> > +             "Halting compute engines and disabling interrupts\n");
> > +
> > +     if (hdev->pldm) {
> > +             wait_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
> > +             cpu_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
> > +     } else {
> > +             wait_timeout_ms = GOYA_RESET_WAIT_MSEC;
> > +             cpu_timeout_ms = GOYA_CPU_RESET_WAIT_MSEC;
> > +     }
> > +
> > +     if ((hard_reset) && (goya->hw_cap_initialized & HW_CAP_CPU)) {
> > +             WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_GOTO_WFE);
> > +             if (hdev->fw_loading)
> > +                     WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> > +                             GOYA_ASYNC_EVENT_ID_HALT_MACHINE);
> > +             msleep(cpu_timeout_ms);
> > +     }
> > +
> > +     goya_stop_external_queues(hdev);
> > +     goya_stop_internal_queues(hdev);
> > +
> > +     msleep(wait_timeout_ms);
> > +
> > +     goya_dma_stall(hdev);
> > +     goya_tpc_stall(hdev);
> > +     goya_mme_stall(hdev);
> > +
> > +     msleep(wait_timeout_ms);
> > +
> > +     goya_disable_external_queues(hdev);
> > +     goya_disable_internal_queues(hdev);
> > +
> > +     if (hard_reset)
> > +             goya_disable_msix(hdev);
> > +     else
> > +             goya_sync_irqs(hdev);
> > +}
> >
> >  /**
> >   * goya_push_uboot_to_device - Push u-boot FW code to device
> > @@ -3166,11 +3359,16 @@ static int goya_hw_init(struct hl_device *hdev)
> >
> >       goya_init_tpc_qmans(hdev);
> >
> > +     /* MSI-X must be enabled before CPU queues are initialized */
> > +     rc = goya_enable_msix(hdev);
> > +     if (rc)
> > +             goto disable_queues;
> > +
> >       rc = goya_init_cpu_queues(hdev);
> >       if (rc) {
> >               dev_err(hdev->dev, "failed to initialize CPU H/W queues %d\n",
> >                       rc);
> > -             goto disable_queues;
> > +             goto disable_msix;
> >       }
> >
> >       /* CPU initialization is finished, we can now move to 48 bit DMA mask */
> > @@ -3204,6 +3402,8 @@ static int goya_hw_init(struct hl_device *hdev)
> >
> >  disable_pci_access:
> >       goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
> > +disable_msix:
> > +     goya_disable_msix(hdev);
> >  disable_queues:
> >       goya_disable_internal_queues(hdev);
> >       goya_disable_external_queues(hdev);
> > @@ -3287,6 +3487,7 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
> >                                       HW_CAP_DMA | HW_CAP_MME |
> >                                       HW_CAP_MMU | HW_CAP_TPC_MBIST |
> >                                       HW_CAP_GOLDEN | HW_CAP_TPC);
> > +     memset(goya->events_stat, 0, sizeof(goya->events_stat));
> >
> >       if (!hdev->pldm) {
> >               int rc;
> > @@ -3772,6 +3973,305 @@ void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
> >       gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
> >  }
> >
> > +static void goya_update_eq_ci(struct hl_device *hdev, u32 val)
> > +{
> > +     WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, val);
> > +}
> > +
> > +static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
> > +             u16 event_type, char *axi_name, int len)
> > +{
> > +     if (!strcmp(goya_axi_name[agent_id], "DMA"))
> > +             if (event_type >= GOYA_ASYNC_EVENT_ID_DMA0_CH)
> > +                     snprintf(axi_name, len, "DMA %d",
> > +                             event_type - GOYA_ASYNC_EVENT_ID_DMA0_CH);
> > +             else
> > +                     snprintf(axi_name, len, "DMA %d",
> > +                             event_type - GOYA_ASYNC_EVENT_ID_DMA0_QM);
> > +     else
> > +             snprintf(axi_name, len, "%s", goya_axi_name[agent_id]);
> > +}
> > +
> > +static void goya_print_razwi_info(struct hl_device *hdev, u64 reg,
> > +             bool is_hbw, bool is_read, u16 event_type)
> > +{
> > +     u32 val, id, internal_id, agent_id, y, x;
> > +     char axi_name[10] = {0};
> > +
> > +     val = RREG32(reg);
> > +
> > +     if (is_hbw) {
> > +             id = (val & GOYA_IRQ_HBW_ID_MASK) >> GOYA_IRQ_HBW_ID_SHIFT;
> > +             internal_id = (val & GOYA_IRQ_HBW_INTERNAL_ID_MASK) >>
> > +                             GOYA_IRQ_HBW_INTERNAL_ID_SHIFT;
> > +             agent_id = (val & GOYA_IRQ_HBW_AGENT_ID_MASK) >>
> > +                             GOYA_IRQ_HBW_AGENT_ID_SHIFT;
> > +             y = (val & GOYA_IRQ_HBW_Y_MASK) >> GOYA_IRQ_HBW_Y_SHIFT;
> > +             x = (val & GOYA_IRQ_HBW_X_MASK) >> GOYA_IRQ_HBW_X_SHIFT;
> > +     } else {
> > +             id = (val & GOYA_IRQ_LBW_ID_MASK) >> GOYA_IRQ_LBW_ID_SHIFT;
> > +             internal_id = (val & GOYA_IRQ_LBW_INTERNAL_ID_MASK) >>
> > +                             GOYA_IRQ_LBW_INTERNAL_ID_SHIFT;
> > +             agent_id = (val & GOYA_IRQ_LBW_AGENT_ID_MASK) >>
> > +                             GOYA_IRQ_LBW_AGENT_ID_SHIFT;
> > +             y = (val & GOYA_IRQ_LBW_Y_MASK) >> GOYA_IRQ_LBW_Y_SHIFT;
> > +             x = (val & GOYA_IRQ_LBW_X_MASK) >> GOYA_IRQ_LBW_X_SHIFT;
> > +     }
>
> It seems that only agent_id is used
>
Fixed

> > +
> > +     if (agent_id >= GOYA_MAX_INITIATORS) {
> > +             dev_err(hdev->dev,
> > +                     "Illegal %s %s with wrong initiator id %d, H/W IRQ %d\n",
> > +                             is_read ? "read from" : "write to",
> > +                             is_hbw ? "HBW" : "LBW",
> > +                             agent_id,
> > +                             event_type);
> > +     } else {
> > +             goya_get_axi_name(hdev, agent_id, event_type, axi_name,
> > +                             sizeof(axi_name));
> > +             dev_err(hdev->dev, "Illegal %s by %s %s %s, H/W IRQ %d\n",
> > +                             is_read ? "read" : "write",
> > +                             axi_name,
> > +                             is_read ? "from" : "to",
> > +                             is_hbw ? "HBW" : "LBW",
> > +                             event_type);
> > +     }
> > +}
> > +
> > +static void goya_print_irq_info(struct hl_device *hdev, u16 event_type)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     bool is_hbw = false, is_read = false, is_info = false;
> > +
> > +     if (RREG32(mmDMA_MACRO_RAZWI_LBW_WT_VLD)) {
> > +             goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_LBW_WT_ID, is_hbw,
> > +                             is_read, event_type);
> > +             WREG32(mmDMA_MACRO_RAZWI_LBW_WT_VLD, 0);
> > +             is_info = true;
> > +     }
> > +     if (RREG32(mmDMA_MACRO_RAZWI_LBW_RD_VLD)) {
> > +             is_read = true;
> > +             goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_LBW_RD_ID, is_hbw,
> > +                             is_read, event_type);
> > +             WREG32(mmDMA_MACRO_RAZWI_LBW_RD_VLD, 0);
> > +             is_info = true;
> > +     }
> > +     if (RREG32(mmDMA_MACRO_RAZWI_HBW_WT_VLD)) {
> > +             is_hbw = true;
> > +             goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_HBW_WT_ID, is_hbw,
> > +                             is_read, event_type);
> > +             WREG32(mmDMA_MACRO_RAZWI_HBW_WT_VLD, 0);
> > +             is_info = true;
> > +     }
> > +     if (RREG32(mmDMA_MACRO_RAZWI_HBW_RD_VLD)) {
> > +             is_hbw = true;
> > +             is_read = true;
> > +             goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_HBW_RD_ID, is_hbw,
> > +                             is_read, event_type);
> > +             WREG32(mmDMA_MACRO_RAZWI_HBW_RD_VLD, 0);
> > +             is_info = true;
> > +     }
> > +     if (!is_info) {
> > +             dev_err(hdev->dev,
> > +                     "Received H/W interrupt %d, no additional info\n",
> > +                     event_type);
> > +             return;
> > +     }
> > +
> > +     if (goya->hw_cap_initialized & HW_CAP_MMU) {
> > +             u32 val = RREG32(mmMMU_PAGE_ERROR_CAPTURE);
> > +             u64 addr;
> > +
> > +             if (val & MMU_PAGE_ERROR_CAPTURE_ENTRY_VALID_MASK) {
> > +                     addr = val & MMU_PAGE_ERROR_CAPTURE_VA_49_32_MASK;
> > +                     addr <<= 32;
> > +                     addr |= RREG32(mmMMU_PAGE_ERROR_CAPTURE_VA);
> > +
> > +                     dev_err(hdev->dev, "MMU page fault on va 0x%llx\n",
> > +                                     addr);
> > +
> > +                     WREG32(mmMMU_PAGE_ERROR_CAPTURE, 0);
> > +             }
> > +     }
> > +}
> > +
> > +static int goya_unmask_irq(struct hl_device *hdev, u16 event_type)
> > +{
> > +     struct armcp_packet pkt;
> > +     long result;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_UNMASK_RAZWI_IRQ;
> > +     pkt.value = event_type;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                     HL_DEVICE_TIMEOUT_USEC, &result);
> > +
> > +     if (rc)
> > +             dev_err(hdev->dev, "failed to unmask RAZWI IRQ %d", event_type);
> > +
> > +     return rc;
> > +}
> > +
> > +void goya_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_entry)
> > +{
> > +     u16 event_type = ((eq_entry->hdr.ctl & EQ_CTL_EVENT_TYPE_MASK)
> > +                     >> EQ_CTL_EVENT_TYPE_SHIFT);
> > +     struct goya_device *goya = hdev->asic_specific;
> > +
> > +     goya->events_stat[event_type]++;
> > +
> > +     switch (event_type) {
> > +     case GOYA_ASYNC_EVENT_ID_PCIE_IF:
> > +     case GOYA_ASYNC_EVENT_ID_TPC0_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC1_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC2_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC3_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC4_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC5_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC6_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC7_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_MME_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_MME_ECC_EXT:
> > +     case GOYA_ASYNC_EVENT_ID_MMU_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_DMA_MACRO:
> > +     case GOYA_ASYNC_EVENT_ID_DMA_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_CPU_IF_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_PSOC_MEM:
> > +     case GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM0:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM1:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM2:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM3:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM4:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM5:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM6:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM7:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM8:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM9:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM10:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM11:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM12:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM13:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM14:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM15:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM16:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM17:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM18:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM19:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM20:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM21:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM22:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM23:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM24:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM25:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM26:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM27:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM28:
> > +     case GOYA_ASYNC_EVENT_ID_SRAM29:
> > +     case GOYA_ASYNC_EVENT_ID_GIC500:
> > +     case GOYA_ASYNC_EVENT_ID_PLL0:
> > +     case GOYA_ASYNC_EVENT_ID_PLL1:
> > +     case GOYA_ASYNC_EVENT_ID_PLL3:
> > +     case GOYA_ASYNC_EVENT_ID_PLL4:
> > +     case GOYA_ASYNC_EVENT_ID_PLL5:
> > +     case GOYA_ASYNC_EVENT_ID_PLL6:
> > +     case GOYA_ASYNC_EVENT_ID_AXI_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_L2_RAM_ECC:
> > +     case GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET:
> > +     case GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT:
> > +             dev_err(hdev->dev,
> > +                     "Received H/W interrupt %d, reset the chip\n",
> > +                     event_type);
> > +             break;
>
> Looks tough. Any chance some of these values are consecutive and can be
> grouped, e.g
>
>         case GOYA_ASYNC_EVENT_ID_SRAM0 ... GOYA_ASYNC_EVENT_ID_SRAM29:
> ?

Fixed (did what I could)
>
> > +
> > +     case GOYA_ASYNC_EVENT_ID_PCIE_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC0_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC1_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC2_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC3_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC4_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC5_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC6_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC7_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_MME_WACS:
> > +     case GOYA_ASYNC_EVENT_ID_MME_WACSD:
> > +     case GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER:
> > +     case GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC:
> > +     case GOYA_ASYNC_EVENT_ID_PSOC:
> > +     case GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR:
> > +     case GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR:
> > +     case GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR:
> > +     case GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR:
> > +     case GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR:
> > +     case GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR:
> > +     case GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR:
> > +     case GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR:
> > +     case GOYA_ASYNC_EVENT_ID_TPC0_CMDQ:
> > +     case GOYA_ASYNC_EVENT_ID_TPC1_CMDQ:
> > +     case GOYA_ASYNC_EVENT_ID_TPC2_CMDQ:
> > +     case GOYA_ASYNC_EVENT_ID_TPC3_CMDQ:
> > +     case GOYA_ASYNC_EVENT_ID_TPC4_CMDQ:
> > +     case GOYA_ASYNC_EVENT_ID_TPC5_CMDQ:
> > +     case GOYA_ASYNC_EVENT_ID_TPC6_CMDQ:
> > +     case GOYA_ASYNC_EVENT_ID_TPC7_CMDQ:
> > +     case GOYA_ASYNC_EVENT_ID_TPC0_QM:
> > +     case GOYA_ASYNC_EVENT_ID_TPC1_QM:
> > +     case GOYA_ASYNC_EVENT_ID_TPC2_QM:
> > +     case GOYA_ASYNC_EVENT_ID_TPC3_QM:
> > +     case GOYA_ASYNC_EVENT_ID_TPC4_QM:
> > +     case GOYA_ASYNC_EVENT_ID_TPC5_QM:
> > +     case GOYA_ASYNC_EVENT_ID_TPC6_QM:
> > +     case GOYA_ASYNC_EVENT_ID_TPC7_QM:
> > +     case GOYA_ASYNC_EVENT_ID_MME_QM:
> > +     case GOYA_ASYNC_EVENT_ID_MME_CMDQ:
> > +     case GOYA_ASYNC_EVENT_ID_DMA0_QM:
> > +     case GOYA_ASYNC_EVENT_ID_DMA1_QM:
> > +     case GOYA_ASYNC_EVENT_ID_DMA2_QM:
> > +     case GOYA_ASYNC_EVENT_ID_DMA3_QM:
> > +     case GOYA_ASYNC_EVENT_ID_DMA4_QM:
> > +     case GOYA_ASYNC_EVENT_ID_DMA0_CH:
> > +     case GOYA_ASYNC_EVENT_ID_DMA1_CH:
> > +     case GOYA_ASYNC_EVENT_ID_DMA2_CH:
> > +     case GOYA_ASYNC_EVENT_ID_DMA3_CH:
> > +     case GOYA_ASYNC_EVENT_ID_DMA4_CH:
> > +             goya_print_irq_info(hdev, event_type);
> > +             goya_unmask_irq(hdev, event_type);
> > +             break;
> > +
> > +     case GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU:
> > +     case GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU:
> > +     case GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU:
> > +     case GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU:
> > +     case GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU:
> > +     case GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU:
> > +     case GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU:
> > +     case GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU:
> > +     case GOYA_ASYNC_EVENT_ID_DMA_BM_CH0:
> > +     case GOYA_ASYNC_EVENT_ID_DMA_BM_CH1:
> > +     case GOYA_ASYNC_EVENT_ID_DMA_BM_CH2:
> > +     case GOYA_ASYNC_EVENT_ID_DMA_BM_CH3:
> > +     case GOYA_ASYNC_EVENT_ID_DMA_BM_CH4:
> > +             dev_info(hdev->dev, "Received H/W interrupt %d\n", event_type);
> > +             break;
> > +
> > +     default:
> > +             dev_err(hdev->dev, "Received invalid H/W interrupt %d\n",
> > +                             event_type);
> > +             break;
> > +     }
> > +}
> > +
> > +void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +
> > +     *size = (u32) sizeof(goya->events_stat);
> > +
> > +     return goya->events_stat;
> > +}
> > +
> >
> >  static void goya_hw_queues_lock(struct hl_device *hdev)
> >  {
> > @@ -3794,6 +4294,7 @@ static const struct hl_asic_funcs goya_funcs = {
> >       .sw_fini = goya_sw_fini,
> >       .hw_init = goya_hw_init,
> >       .hw_fini = goya_hw_fini,
> > +     .halt_engines = goya_halt_engines,
> >       .suspend = goya_suspend,
> >       .resume = goya_resume,
> >       .mmap = goya_mmap,
> > @@ -3808,6 +4309,9 @@ static const struct hl_asic_funcs goya_funcs = {
> >       .dma_pool_free = goya_dma_pool_free,
> >       .cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
> >       .cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
> > +     .update_eq_ci = goya_update_eq_ci,
> > +     .handle_eqe = goya_handle_eqe,
> > +     .get_events_stat = goya_get_events_stat,
> >       .hw_queues_lock = goya_hw_queues_lock,
> >       .hw_queues_unlock = goya_hw_queues_unlock,
> >       .send_cpu_message = goya_send_cpu_message
> > diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> > index 598a718d3df1..c6bfcb6c6905 100644
> > --- a/drivers/misc/habanalabs/goya/goyaP.h
> > +++ b/drivers/misc/habanalabs/goya/goyaP.h
> > @@ -123,6 +123,7 @@ struct goya_device {
> >       /* TODO: remove hw_queues_lock after moving to scheduler code */
> >       spinlock_t      hw_queues_lock;
> >       u64             ddr_bar_cur_addr;
> > +     u32             events_stat[GOYA_ASYNC_EVENT_ID_SIZE];
> >       u32             hw_cap_initialized;
> >  };
> >
> > diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> > index 8232e2259463..899bf98eb002 100644
> > --- a/drivers/misc/habanalabs/habanalabs.h
> > +++ b/drivers/misc/habanalabs/habanalabs.h
> > @@ -83,6 +83,7 @@ struct hw_queue_properties {
> >   * @cfg_size: configuration space size on SRAM.
> >   * @sram_size: total size of SRAM.
> >   * @max_asid: maximum number of open contexts (ASIDs).
> > + * @num_of_events: number of possible internal H/W IRQs.
> >   * @completion_queues_count: number of completion queues.
> >   * @high_pll: high PLL frequency used by the device.
> >   * @cb_pool_cb_cnt: number of CBs in the CB pool.
> > @@ -109,6 +110,7 @@ struct asic_fixed_properties {
> >       u32                     cfg_size;
> >       u32                     sram_size;
> >       u32                     max_asid;
> > +     u32                     num_of_events;
> >       u32                     high_pll;
> >       u32                     cb_pool_cb_cnt;
> >       u32                     cb_pool_cb_size;
> > @@ -209,6 +211,9 @@ struct hl_cs_job;
> >  #define HL_CQ_LENGTH                 HL_QUEUE_LENGTH
> >  #define HL_CQ_SIZE_IN_BYTES          (HL_CQ_LENGTH * HL_CQ_ENTRY_SIZE)
> >
> > +/* Must be power of 2 (HL_PAGE_SIZE / HL_EQ_ENTRY_SIZE) */
> > +#define HL_EQ_LENGTH                 64
> > +#define HL_EQ_SIZE_IN_BYTES          (HL_EQ_LENGTH * HL_EQ_ENTRY_SIZE)
> >
> >
> >  /**
> > @@ -256,6 +261,20 @@ struct hl_cq {
> >       atomic_t                free_slots_cnt;
> >  };
> >
> > +/**
> > + * struct hl_eq - describes the event queue (single one per device)
> > + * @hdev: pointer to the device structure
> > + * @kernel_address: holds the queue's kernel virtual address
> > + * @bus_address: holds the queue's DMA address
> > + * @ci: ci inside the queue
> > + */
> > +struct hl_eq {
> > +     struct hl_device        *hdev;
> > +     u64                     kernel_address;
> > +     dma_addr_t              bus_address;
> > +     u32                     ci;
> > +};
> > +
> >
> >
> >
> > @@ -288,6 +307,9 @@ enum hl_asic_type {
> >   * @sw_fini: tears down driver state, does not configure H/W.
> >   * @hw_init: sets up the H/W state.
> >   * @hw_fini: tears down the H/W state.
> > + * @halt_engines: halt engines, needed for reset sequence. This also disables
> > + *                interrupts from the device. Should be called before
> > + *                hw_fini and before CS rollback.
> >   * @suspend: handles IP specific H/W or SW changes for suspend.
> >   * @resume: handles IP specific H/W or SW changes for resume.
> >   * @mmap: mmap function, does nothing.
> > @@ -303,6 +325,9 @@ enum hl_asic_type {
> >   * @dma_pool_free: free small DMA allocation from pool.
> >   * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
> >   * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
> > + * @update_eq_ci: update event queue CI.
> > + * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
> > + * @get_events_stat: retrieve event queue entries histogram.
> >   * @hw_queues_lock: acquire H/W queues lock.
> >   * @hw_queues_unlock: release H/W queues lock.
> >   * @send_cpu_message: send buffer to ArmCP.
> > @@ -314,6 +339,7 @@ struct hl_asic_funcs {
> >       int (*sw_fini)(struct hl_device *hdev);
> >       int (*hw_init)(struct hl_device *hdev);
> >       void (*hw_fini)(struct hl_device *hdev, bool hard_reset);
> > +     void (*halt_engines)(struct hl_device *hdev, bool hard_reset);
> >       int (*suspend)(struct hl_device *hdev);
> >       int (*resume)(struct hl_device *hdev);
> >       int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
> > @@ -336,6 +362,10 @@ struct hl_asic_funcs {
> >                               size_t size, dma_addr_t *dma_handle);
> >       void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
> >                               size_t size, void *vaddr);
> > +     void (*update_eq_ci)(struct hl_device *hdev, u32 val);
> > +     void (*handle_eqe)(struct hl_device *hdev,
> > +                             struct hl_eq_entry *eq_entry);
> > +     void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
> >       void (*hw_queues_lock)(struct hl_device *hdev);
> >       void (*hw_queues_unlock)(struct hl_device *hdev);
> >       int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
> > @@ -474,6 +504,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> >   * @kernel_ctx: KMD context structure.
> >   * @kernel_queues: array of hl_hw_queue.
> >   * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
> > + * @event_queue: event queue for IRQ from ArmCP.
> >   * @dma_pool: DMA pool for small allocations.
> >   * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
> >   * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
> > @@ -504,9 +535,11 @@ struct hl_device {
> >       enum hl_asic_type               asic_type;
> >       struct hl_cq                    *completion_queue;
> >       struct workqueue_struct         *cq_wq;
> > +     struct workqueue_struct         *eq_wq;
> >       struct hl_ctx                   *kernel_ctx;
> >       struct hl_hw_queue              *kernel_queues;
> >       struct hl_cb_mgr                kernel_cb_mgr;
> > +     struct hl_eq                    event_queue;
> >       struct dma_pool                 *dma_pool;
> >       void                            *cpu_accessible_dma_mem;
> >       dma_addr_t                      cpu_accessible_dma_address;
> > @@ -593,6 +626,10 @@ void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
> >
> >  int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
> >  void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
> > +int hl_eq_init(struct hl_device *hdev, struct hl_eq *q);
> > +void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q);
> > +irqreturn_t hl_irq_handler_cq(int irq, void *arg);
> > +irqreturn_t hl_irq_handler_eq(int irq, void *arg);
> >  int hl_asid_init(struct hl_device *hdev);
> >  void hl_asid_fini(struct hl_device *hdev);
> >  unsigned long hl_asid_alloc(struct hl_device *hdev);
> > diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
> > index 2d0efb7b44bb..bcc461760e5f 100644
> > --- a/drivers/misc/habanalabs/include/goya/goya.h
> > +++ b/drivers/misc/habanalabs/include/goya/goya.h
> > @@ -65,7 +65,6 @@
> >
> >  #define GOYA_MSIX_ENTRIES    8
> >  #define EVENT_QUEUE_MSIX_IDX 5
> > -#define ARMCP_RESET_MSIX_IDX 6
> >
> >  #define QMAN_PQ_ENTRY_SIZE   16                      /* Bytes */
> >
> > diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
> > index 97b0de7ea5c2..9586323e7dfb 100644
> > --- a/drivers/misc/habanalabs/irq.c
> > +++ b/drivers/misc/habanalabs/irq.c
> > @@ -9,6 +9,18 @@
> >
> >  #include <linux/dma-mapping.h>
> >
> > +/**
> > + * This structure is used to schedule work of EQ entry and armcp_reset event
> > + *
> > + * @eq_work          - workqueue object to run when EQ entry is received
> > + * @hdev             - pointer to device structure
> > + * @eq_entry         - copy of the EQ entry
> > + */
> > +struct hl_eqe_work {
> > +     struct work_struct      eq_work;
> > +     struct hl_device        *hdev;
> > +     struct hl_eq_entry      eq_entry;
> > +};
> >
> >  /**
> >   * hl_cq_inc_ptr - increment ci or pi of cq
> > @@ -26,6 +38,33 @@ inline u32 hl_cq_inc_ptr(u32 ptr)
> >       return ptr;
> >  }
> >
> > +/**
> > + * hl_eq_inc_ptr - increment ci of eq
> > + *
> > + * @ptr: the current ci value of the event queue
> > + *
> > + * Increment ptr by 1. If it reaches the number of event queue
> > + * entries, set it to 0
> > + */
> > +inline u32 hl_eq_inc_ptr(u32 ptr)
> > +{
> > +     ptr++;
> > +     if (unlikely(ptr == HL_EQ_LENGTH))
> > +             ptr = 0;
> > +     return ptr;
> > +}
> > +
> > +static void irq_handle_eqe(struct work_struct *work)
> > +{
> > +     struct hl_eqe_work *eqe_work = container_of(work, struct hl_eqe_work,
> > +                                                     eq_work);
> > +     struct hl_device *hdev = eqe_work->hdev;
> > +
> > +     hdev->asic_funcs->handle_eqe(hdev, &eqe_work->eq_entry);
> > +
> > +     kfree(eqe_work);
> > +}
> > +
> >  /**
> >   * hl_irq_handler_cq - irq handler for completion queue
> >   *
> > @@ -103,6 +142,68 @@ irqreturn_t hl_irq_handler_cq(int irq, void *arg)
> >       return IRQ_HANDLED;
> >  }
> >
> > +/**
> > + * hl_irq_handler_eq - irq handler for event queue
> > + *
> > + * @irq: irq number
> > + * @arg: pointer to event queue structure
> > + *
> > + */
> > +irqreturn_t hl_irq_handler_eq(int irq, void *arg)
> > +{
> > +     struct hl_eq *eq = arg;
> > +     struct hl_device *hdev = eq->hdev;
> > +     struct hl_eq_entry *eq_entry;
> > +     struct hl_eq_entry *eq_base;
> > +     struct hl_eqe_work *handle_eqe_work;
> > +
> > +     eq_base = (struct hl_eq_entry *) eq->kernel_address;
> > +
> > +     while (1) {
> > +             bool entry_ready =
> > +                             ((eq_base[eq->ci].hdr.ctl & EQ_CTL_READY_MASK)
> > +                                             >> EQ_CTL_READY_SHIFT);
> > +
> > +             if (!entry_ready)
> > +                     break;
> > +
> > +             eq_entry = &eq_base[eq->ci];
> > +
> > +             /*
> > +              * Make sure we read EQ entry contents after we've
> > +              * checked the ownership bit.
> > +              */
> > +             dma_rmb();
> > +
> > +             if (hdev->disabled) {
> > +                     dev_warn(hdev->dev,
> > +                             "Device disabled but received IRQ %d for EQ\n",
> > +                                     irq);
> > +                     goto skip_irq;
> > +             }
> > +
> > +             handle_eqe_work = kmalloc(sizeof(*handle_eqe_work), GFP_ATOMIC);
> > +             if (handle_eqe_work) {
>
> I couldn't find where is it freed
In irq_handle_eqe()

>
> > +                     INIT_WORK(&handle_eqe_work->eq_work, irq_handle_eqe);
> > +                     handle_eqe_work->hdev = hdev;
> > +
> > +                     memcpy(&handle_eqe_work->eq_entry, eq_entry,
> > +                                     sizeof(*eq_entry));
> > +
> > +                     queue_work(hdev->eq_wq, &handle_eqe_work->eq_work);
> > +             }
> > +skip_irq:
> > +             /* Clear EQ entry ready bit */
> > +             eq_entry->hdr.ctl &= ~EQ_CTL_READY_MASK;
> > +
> > +             eq->ci = hl_eq_inc_ptr(eq->ci);
> > +
> > +             hdev->asic_funcs->update_eq_ci(hdev, eq->ci);
> > +     }
> > +
> > +     return IRQ_HANDLED;
> > +}
> > +
> >  /**
> >   * hl_cq_init - main initialization function for an cq object
> >   *
> > @@ -148,3 +249,46 @@ void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
> >       hdev->asic_funcs->dma_free_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
> >                       (void *) q->kernel_address, q->bus_address);
> >  }
> > +
> > +/**
> > + * hl_eq_init - main initialization function for an event queue object
> > + *
> > + * @hdev: pointer to device structure
> > + * @q: pointer to eq structure
> > + *
> > + * Allocate dma-able memory for the event queue and initialize fields
> > + * Returns 0 on success
> > + */
> > +int hl_eq_init(struct hl_device *hdev, struct hl_eq *q)
> > +{
> > +     void *p;
> > +
> > +     BUILD_BUG_ON(HL_EQ_SIZE_IN_BYTES > HL_PAGE_SIZE);
> > +
> > +     p = hdev->asic_funcs->dma_alloc_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
> > +                             &q->bus_address, GFP_KERNEL | __GFP_ZERO);
> > +     if (!p)
> > +             return -ENOMEM;
> > +
> > +     q->hdev = hdev;
> > +     q->kernel_address = (u64) p;
> > +     q->ci = 0;
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * hl_eq_fini - destroy event queue
> > + *
> > + * @hdev: pointer to device structure
> > + * @q: pointer to eq structure
> > + *
> > + * Free the event queue memory
> > + */
> > +void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q)
> > +{
> > +     flush_workqueue(hdev->eq_wq);
> > +
> > +     hdev->asic_funcs->dma_free_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
> > +                     (void *) q->kernel_address, q->bus_address);
> > +}
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 09/15] habanalabs: add sysfs and hwmon support
  2019-01-25  7:54   ` Mike Rapoport
@ 2019-01-28 11:26     ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-28 11:26 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Fri, Jan 25, 2019 at 9:54 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:51AM +0200, Oded Gabbay wrote:
> > This patch add the sysfs and hwmon entries that are exposed by the driver.
> >
> > Goya has several sensors, from various categories such as temperature,
> > voltage, current, etc. The driver exposes those sensors in the standard
> > hwmon mechanism.
> >
> > In addition, the driver exposes a couple of interfaces in sysfs, both for
> > configuration and for providing status of the device or driver.
> >
> > The configuration attributes is for Power Management:
> > - Automatic or manual
> > - Frequency value when moving to high frequency mode
> > - Maximum power the device is allowed to consume
> >
> > The rest of the attributes are read-only and provide the following
> > information:
> > - Versions of the various firmwares running on the device
> > - Contents of the device's EEPROM
> > - The device type (currently only Goya is supported)
> > - PCI address of the device (to allow user-space to connect between
> >   /dev/hlX to PCI address)
> > - Status of the device (operational, malfunction, in_reset)
> > - How many processes are open on the device's file
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  .../ABI/testing/sysfs-driver-habanalabs       | 190 ++++++
> >  drivers/misc/habanalabs/Makefile              |   2 +-
> >  drivers/misc/habanalabs/device.c              | 146 +++++
> >  drivers/misc/habanalabs/goya/Makefile         |   2 +-
> >  drivers/misc/habanalabs/goya/goya.c           | 230 +++++++
> >  drivers/misc/habanalabs/goya/goyaP.h          |  21 +
> >  drivers/misc/habanalabs/goya/goya_hwmgr.c     | 306 +++++++++
> >  drivers/misc/habanalabs/habanalabs.h          |  97 +++
> >  drivers/misc/habanalabs/habanalabs_drv.c      |   7 +
> >  drivers/misc/habanalabs/hwmon.c               | 449 +++++++++++++
> >  drivers/misc/habanalabs/sysfs.c               | 588 ++++++++++++++++++
> >  11 files changed, 2036 insertions(+), 2 deletions(-)
> >  create mode 100644 Documentation/ABI/testing/sysfs-driver-habanalabs
> >  create mode 100644 drivers/misc/habanalabs/goya/goya_hwmgr.c
> >  create mode 100644 drivers/misc/habanalabs/hwmon.c
> >  create mode 100644 drivers/misc/habanalabs/sysfs.c
> >
> > diff --git a/Documentation/ABI/testing/sysfs-driver-habanalabs b/Documentation/ABI/testing/sysfs-driver-habanalabs
> > new file mode 100644
> > index 000000000000..19edd4da87c1
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-driver-habanalabs
> > @@ -0,0 +1,190 @@
> > +What:           /sys/class/habanalabs/hl<n>/armcp_kernel_ver
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Version of the Linux kernel running on the device's CPU
> > +
> > +What:           /sys/class/habanalabs/hl<n>/armcp_ver
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Version of the application running on the device's CPU
> > +
> > +What:           /sys/class/habanalabs/hl<n>/cpld_ver
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Version of the Device's CPLD F/W
> > +
> > +What:           /sys/class/habanalabs/hl<n>/device_type
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Displays the code name of the device according to its type.
> > +                The supported values are: "GOYA"
> > +
> > +What:           /sys/class/habanalabs/hl<n>/eeprom
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    A binary file attribute that contains the contents of the
> > +                on-board EEPROM
> > +
> > +What:           /sys/class/habanalabs/hl<n>/fuse_ver
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Displays the device's version from the eFuse
> > +
> > +What:           /sys/class/habanalabs/hl<n>/hard_reset
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Interface to trigger a hard-reset operation for the device.
> > +                Hard-reset will reset ALL internal components of the device
> > +                except for the PCI interface and the internal PLLs
> > +
> > +What:           /sys/class/habanalabs/hl<n>/hard_reset_cnt
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Displays how many times the device have undergone a hard-reset
> > +                operation
> > +
> > +What:           /sys/class/habanalabs/hl<n>/high_pll
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Allows the user to set the maximum clock frequency for MME, TPC
> > +                and IC when the power management profile is set to "automatic".
> > +
> > +What:           /sys/class/habanalabs/hl<n>/ic_clk
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Allows the user to set the maximum clock frequency of the
> > +                Interconnect fabric. Writes to this parameter affect the device
> > +                only when the power management profile is set to "manual" mode.
> > +                The device IC clock might be set to lower value then the
> > +                maximum. The user should read the ic_clk_curr to see the actual
> > +                frequency value of the IC
> > +
> > +What:           /sys/class/habanalabs/hl<n>/ic_clk_curr
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Displays the current clock frequency of the Interconnect fabric
> > +
> > +What:           /sys/class/habanalabs/hl<n>/infineon_ver
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Version of the Device's power supply F/W code
> > +
> > +What:           /sys/class/habanalabs/hl<n>/max_power
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Allows the user to set the maximum power consumption of the
> > +                device in milliwatts.
> > +
> > +What:           /sys/class/habanalabs/hl<n>/mme_clk
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Allows the user to set the maximum clock frequency of the
> > +                MME compute engine. Writes to this parameter affect the device
> > +                only when the power management profile is set to "manual" mode.
> > +                The device MME clock might be set to lower value then the
> > +                maximum. The user should read the mme_clk_curr to see the actual
> > +                frequency value of the MME
> > +
> > +What:           /sys/class/habanalabs/hl<n>/mme_clk_curr
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Displays the current clock frequency of the MME compute engine
> > +
> > +What:           /sys/class/habanalabs/hl<n>/pci_addr
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Displays the PCI address of the device. This is needed so the
> > +                user would be able to open a device based on its PCI address
> > +
> > +What:           /sys/class/habanalabs/hl<n>/pm_mng_profile
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Power management profile. Values are "auto", "manual". In "auto"
> > +                mode, the driver will set the maximum clock frequency to a high
> > +                value when a user-space process opens the device's file (unless
> > +                it was already opened by another process). The driver will set
> > +                the max clock frequency to a low value when there are no user
> > +                processes that are opened on the device's file. In "manual"
> > +                mode, the user sets the maximum clock frequency by writing to
> > +                ic_clk, mme_clk and tpc_clk
> > +
> > +
> > +What:           /sys/class/habanalabs/hl<n>/preboot_btl_ver
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Version of the device's preboot F/W code
> > +
> > +What:           /sys/class/habanalabs/hl<n>/soft_reset
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Interface to trigger a soft-reset operation for the device.
> > +                Soft-reset will reset only the compute and DMA engines of the
> > +                device
> > +
> > +What:           /sys/class/habanalabs/hl<n>/soft_reset_cnt
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Displays how many times the device have undergone a soft-reset
> > +                operation
> > +
> > +What:           /sys/class/habanalabs/hl<n>/status
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Status of the card: "Operational", "Malfunction", "In reset".
> > +
> > +What:           /sys/class/habanalabs/hl<n>/thermal_ver
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Version of the Device's thermal daemon
> > +
> > +What:           /sys/class/habanalabs/hl<n>/tpc_clk
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Allows the user to set the maximum clock frequency of the
> > +                TPC compute engines. Writes to this parameter affect the device
> > +                only when the power management profile is set to "manual" mode.
> > +                The device TPC clock might be set to lower value then the
> > +                maximum. The user should read the tpc_clk_curr to see the actual
> > +                frequency value of the TPC
> > +
> > +What:           /sys/class/habanalabs/hl<n>/tpc_clk_curr
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Displays the current clock frequency of the TPC compute engines
> > +
> > +What:           /sys/class/habanalabs/hl<n>/uboot_ver
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Version of the u-boot running on the device's CPU
> > +
> > +What:           /sys/class/habanalabs/hl<n>/write_open_cnt
> > +Date:           Jan 2019
> > +KernelVersion:  5.1
> > +Contact:        oded.gabbay@gmail.com
> > +Description:    Displays the total number of user processes that are currently
> > +                opened on the device's file
> > diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> > index c07f3ccb57dc..b5607233d216 100644
> > --- a/drivers/misc/habanalabs/Makefile
> > +++ b/drivers/misc/habanalabs/Makefile
> > @@ -5,7 +5,7 @@
> >  obj-m        := habanalabs.o
> >
> >  habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
> > -             command_buffer.o hw_queue.o irq.o
> > +             command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o
> >
> >  include $(src)/goya/Makefile
> >  habanalabs-y += $(HL_GOYA_FILES)
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> > index 9199e070e79e..ff7b610f18c4 100644
> > --- a/drivers/misc/habanalabs/device.c
> > +++ b/drivers/misc/habanalabs/device.c
> > @@ -226,6 +226,118 @@ static void device_early_fini(struct hl_device *hdev)
> >       mutex_destroy(&hdev->device_open);
> >  }
> >
> > +static void set_freq_to_low_job(struct work_struct *work)
> > +{
> > +     struct hl_device *hdev = container_of(work, struct hl_device,
> > +                                             work_freq.work);
> > +
> > +     if (atomic_read(&hdev->fd_open_cnt) == 0)
> > +             hl_device_set_frequency(hdev, PLL_LOW);
> > +
> > +     schedule_delayed_work(&hdev->work_freq,
> > +                     usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
> > +}
> > +
> > +/**
> > + * device_late_init - do late stuff initialization for the habanalabs device
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + *
> > + * Do stuff that either needs the device H/W queues to be active or needs
> > + * to happen after all the rest of the initialization is finished
> > + */
> > +static int device_late_init(struct hl_device *hdev)
> > +{
> > +     int rc;
> > +
> > +     INIT_DELAYED_WORK(&hdev->work_freq, set_freq_to_low_job);
> > +     hdev->high_pll = hdev->asic_prop.high_pll;
> > +
> > +     /* force setting to low frequency */
> > +     atomic_set(&hdev->curr_pll_profile, PLL_LOW);
> > +
> > +     if (hdev->pm_mng_profile == PM_AUTO)
> > +             hdev->asic_funcs->set_pll_profile(hdev, PLL_LOW);
> > +     else
> > +             hdev->asic_funcs->set_pll_profile(hdev, PLL_LAST);
> > +
> > +     if (hdev->asic_funcs->late_init) {
> > +             rc = hdev->asic_funcs->late_init(hdev);
> > +             if (rc) {
> > +                     dev_err(hdev->dev,
> > +                             "failed late initialization for the H/W\n");
> > +                     return rc;
> > +             }
> > +     }
> > +
> > +     schedule_delayed_work(&hdev->work_freq,
> > +                     usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
> > +
> > +     hdev->late_init_done = true;
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * device_late_fini - finalize all that was done in device_late_init
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + *
> > + */
> > +static void device_late_fini(struct hl_device *hdev)
> > +{
> > +     if (!hdev->late_init_done)
> > +             return;
> > +
> > +     cancel_delayed_work_sync(&hdev->work_freq);
> > +
> > +     if (hdev->asic_funcs->late_fini)
> > +             hdev->asic_funcs->late_fini(hdev);
> > +
> > +     hdev->late_init_done = false;
> > +}
> > +
> > +/**
> > + * hl_device_set_frequency - set the frequency of the device
> > + *
> > + * @hdev: pointer to habanalabs device structure
> > + * @freq: the new frequency value
> > + *
> > + * Change the frequency if needed.
> > + * We allose to set PLL to low only if there is no user process
> > + * Returns 0 if no change was done, otherwise returns 1;
> > + */
> > +int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq)
> > +{
> > +     enum hl_pll_frequency old_freq =
> > +                     (freq == PLL_HIGH) ? PLL_LOW : PLL_HIGH;
> > +     int ret;
> > +
> > +     if (hdev->pm_mng_profile == PM_MANUAL)
> > +             return 0;
> > +
> > +     ret = atomic_cmpxchg(&hdev->curr_pll_profile, old_freq, freq);
> > +     if (ret == freq)
> > +             return 0;
> > +
> > +     /*
> > +      * in case we want to lower frequency, check if device is not
> > +      * opened. We must have a check here to workaround race condition with
> > +      * hl_device_open
> > +      */
> > +     if ((freq == PLL_LOW) && (atomic_read(&hdev->fd_open_cnt) > 0)) {
> > +             atomic_set(&hdev->curr_pll_profile, PLL_HIGH);
> > +             return 0;
> > +     }
> > +
> > +     dev_dbg(hdev->dev, "Changing device frequency to %s\n",
> > +             freq == PLL_HIGH ? "high" : "low");
> > +
> > +     hdev->asic_funcs->set_pll_profile(hdev, freq);
> > +
> > +     return 1;
> > +}
> > +
> >  /**
> >   * hl_device_suspend - initiate device suspend
> >   *
> > @@ -386,6 +498,12 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >               goto release_ctx;
> >       }
> >
> > +     rc = hl_sysfs_init(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to initialize sysfs\n");
> > +             goto free_cb_pool;
> > +     }
> > +
> >       rc = hdev->asic_funcs->hw_init(hdev);
> >       if (rc) {
> >               dev_err(hdev->dev, "failed to initialize the H/W\n");
> > @@ -403,11 +521,33 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
> >               goto out_disabled;
> >       }
> >
> > +     /* After test_queues, KMD can start sending messages to device CPU */
> > +
> > +     rc = device_late_init(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed late initialization\n");
> > +             rc = 0;
>
> Isn't this an error?
nope, same explanation as previous patches
>
> > +             goto out_disabled;
> > +     }
> > +
> > +     dev_info(hdev->dev, "Found %s device with %lluGB DRAM\n",
> > +             hdev->asic_name,
> > +             hdev->asic_prop.dram_size / 1024 / 1024 / 1024);
> > +
> > +     rc = hl_hwmon_init(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to initialize hwmon\n");
> > +             rc = 0;
>
> Ditto
>
ditto for the answer :)

> > +             goto out_disabled;
> > +     }
> > +
> >       dev_notice(hdev->dev,
> >               "Successfully added device to habanalabs driver\n");
> >
> >       return 0;
> >
> > +free_cb_pool:
> > +     hl_cb_pool_fini(hdev);
> >  release_ctx:
> >       if (hl_ctx_put(hdev->kernel_ctx) != 1)
> >               dev_err(hdev->dev,
> > @@ -457,6 +597,12 @@ void hl_device_fini(struct hl_device *hdev)
> >       /* Mark device as disabled */
> >       hdev->disabled = true;
> >
> > +     hl_hwmon_fini(hdev);
> > +
> > +     device_late_fini(hdev);
> > +
> > +     hl_sysfs_fini(hdev);
> > +
> >       /*
> >        * Halt the engines and disable interrupts so we won't get any more
> >        * completions from H/W and we won't have any accesses from the
> > diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
> > index a57096fa41b6..ada8518ec215 100644
> > --- a/drivers/misc/habanalabs/goya/Makefile
> > +++ b/drivers/misc/habanalabs/goya/Makefile
> > @@ -1,3 +1,3 @@
> >  subdir-ccflags-y += -I$(src)
> >
> > -HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o
> > \ No newline at end of file
> > +HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o goya/goya_hwmgr.o
> > \ No newline at end of file
> > diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> > index 6c04277ae0fa..7899ff762e0b 100644
> > --- a/drivers/misc/habanalabs/goya/goya.c
> > +++ b/drivers/misc/habanalabs/goya/goya.c
> > @@ -127,6 +127,8 @@ static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
> >
> >  #define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
> >
> > +static int goya_armcp_info_get(struct hl_device *hdev);
> > +
> >  static void goya_get_fixed_properties(struct hl_device *hdev)
> >  {
> >       struct asic_fixed_properties *prop = &hdev->asic_prop;
> > @@ -174,6 +176,7 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
> >       prop->num_of_events = GOYA_ASYNC_EVENT_ID_SIZE;
> >       prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
> >       prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
> > +     prop->max_power_default = MAX_POWER_DEFAULT;
> >       prop->tpc_enabled_mask = TPC_ENABLED_MASK;
> >
> >       prop->high_pll = PLL_HIGH_DEFAULT;
> > @@ -558,6 +561,89 @@ int goya_early_fini(struct hl_device *hdev)
> >       return 0;
> >  }
> >
> > +/**
> > + * goya_fetch_psoc_frequency - Fetch PSOC frequency values
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + */
> > +static void goya_fetch_psoc_frequency(struct hl_device *hdev)
> > +{
> > +     struct asic_fixed_properties *prop = &hdev->asic_prop;
> > +
> > +     prop->psoc_pci_pll_nr = RREG32(mmPSOC_PCI_PLL_NR);
> > +     prop->psoc_pci_pll_nf = RREG32(mmPSOC_PCI_PLL_NF);
> > +     prop->psoc_pci_pll_od = RREG32(mmPSOC_PCI_PLL_OD);
> > +     prop->psoc_pci_pll_div_factor = RREG32(mmPSOC_PCI_PLL_DIV_FACTOR_1);
> > +}
> > +
> > +/**
> > + * goya_late_init - GOYA late initialization code
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Get ArmCP info and send message to CPU to enable PCI access
> > + */
> > +static int goya_late_init(struct hl_device *hdev)
> > +{
> > +     struct asic_fixed_properties *prop = &hdev->asic_prop;
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int rc;
> > +
> > +     rc = goya->armcp_info_get(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to get armcp info\n");
> > +             return rc;
> > +     }
> > +
> > +     /* Now that we have the DRAM size in ASIC prop, we need to check
> > +      * its size and configure the DMA_IF DDR wrap protection (which is in
> > +      * the MMU block) accordingly. The value is the log2 of the DRAM size
> > +      */
> > +     WREG32(mmMMU_LOG2_DDR_SIZE, ilog2(prop->dram_size));
> > +
> > +     rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_ENABLE_PCI_ACCESS);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to enable PCI access from CPU\n");
> > +             return rc;
> > +     }
> > +
> > +     WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
> > +                     GOYA_ASYNC_EVENT_ID_INTS_REGISTER);
> > +
> > +     goya_fetch_psoc_frequency(hdev);
> > +
> > +     return 0;
> > +}
> > +
> > +/**
> > + * goya_late_fini - GOYA late tear-down code
> > + *
> > + * @hdev: pointer to hl_device structure
> > + *
> > + * Free sensors allocated structures
> > + */
> > +void goya_late_fini(struct hl_device *hdev)
> > +{
> > +     const struct hwmon_channel_info **channel_info_arr;
> > +     int i = 0;
> > +
> > +     if (!hdev->hl_chip_info.info)
> > +             return;
> > +
> > +     channel_info_arr = hdev->hl_chip_info.info;
> > +
> > +     while (channel_info_arr[i]) {
> > +             kfree(channel_info_arr[i]->config);
> > +             kfree(channel_info_arr[i]);
> > +             i++;
> > +     }
> > +
> > +     kfree(channel_info_arr);
> > +
> > +     hdev->hl_chip_info.info = NULL;
> > +}
> > +
> >  /**
> >   * goya_sw_init - Goya software initialization code
> >   *
> > @@ -575,9 +661,15 @@ static int goya_sw_init(struct hl_device *hdev)
> >               return -ENOMEM;
> >
> >       goya->test_cpu_queue = goya_test_cpu_queue;
> > +     goya->armcp_info_get = goya_armcp_info_get;
> >
> >       /* according to goya_init_iatu */
> >       goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
> > +
> > +     goya->mme_clk = GOYA_PLL_FREQ_LOW;
> > +     goya->tpc_clk = GOYA_PLL_FREQ_LOW;
> > +     goya->ic_clk = GOYA_PLL_FREQ_LOW;
> > +
> >       hdev->asic_specific = goya;
> >
> >       /* Create DMA pool for small allocations */
> > @@ -4272,6 +4364,87 @@ void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
> >       return goya->events_stat;
> >  }
> >
> > +static int goya_armcp_info_get(struct hl_device *hdev)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     struct asic_fixed_properties *prop = &hdev->asic_prop;
> > +     struct armcp_packet pkt;
> > +     void *armcp_info_cpu_addr;
> > +     dma_addr_t armcp_info_dma_addr;
> > +     u64 dram_size;
> > +     long result;
> > +     int rc;
> > +
> > +     if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
> > +             return 0;
> > +
> > +     armcp_info_cpu_addr =
> > +                     hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev,
> > +                     sizeof(struct armcp_info), &armcp_info_dma_addr);
> > +     if (!armcp_info_cpu_addr) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to allocate DMA memory for ArmCP info packet\n");
> > +             return -ENOMEM;
> > +     }
> > +
> > +     memset(armcp_info_cpu_addr, 0, sizeof(struct armcp_info));
>
> Do you expect usage of cpu_accessible_dma_pool_alloc() without the need to
> clear the memory?
> If not memset(0) can be moved inside that function.
yes, if we allocate a pkt from it then we just memcpy over the entire
pkt and no need to memset it.
>
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_INFO_GET;
> > +     pkt.addr = armcp_info_dma_addr + prop->host_phys_base_address;
> > +     pkt.data_max_size = sizeof(struct armcp_info);
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                     GOYA_ARMCP_INFO_TIMEOUT, &result);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to send armcp info pkt, error %d\n", rc);
> > +             goto out;
> > +     }
> > +
> > +     memcpy(&prop->armcp_info, armcp_info_cpu_addr,
> > +                     sizeof(prop->armcp_info));
> > +
> > +     dram_size = prop->armcp_info.dram_size;
> > +     if (dram_size) {
> > +             if ((!is_power_of_2(dram_size)) ||
> > +                             (dram_size < DRAM_PHYS_DEFAULT_SIZE)) {
> > +                     dev_err(hdev->dev,
> > +                             "F/W reported invalid DRAM size %llu. Trying to use default size\n",
> > +                             dram_size);
> > +                     dram_size = DRAM_PHYS_DEFAULT_SIZE;
> > +             }
> > +
> > +             prop->dram_size = dram_size;
> > +             prop->dram_end_address = prop->dram_base_address + dram_size;
> > +     }
> > +
> > +     rc = hl_build_hwmon_channel_info(hdev, prop->armcp_info.sensors);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to build hwmon channel info, error %d\n", rc);
> > +             rc = -EFAULT;
> > +             goto out;
> > +     }
> > +
> > +out:
> > +     hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev,
> > +                     sizeof(struct armcp_info), armcp_info_cpu_addr);
> > +
> > +     return rc;
> > +}
> > +
> > +static void goya_init_clock_gating(struct hl_device *hdev)
> > +{
> > +
> > +}
> > +
> > +static void goya_disable_clock_gating(struct hl_device *hdev)
> > +{
> > +
> > +}
> >
> >  static void goya_hw_queues_lock(struct hl_device *hdev)
> >  {
> > @@ -4287,9 +4460,60 @@ static void goya_hw_queues_unlock(struct hl_device *hdev)
> >       spin_unlock(&goya->hw_queues_lock);
> >  }
> >
> > +int goya_get_eeprom_data(struct hl_device *hdev, void *data, size_t max_size)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     struct asic_fixed_properties *prop = &hdev->asic_prop;
> > +     struct armcp_packet pkt;
> > +     void *eeprom_info_cpu_addr;
> > +     dma_addr_t eeprom_info_dma_addr;
> > +     long result;
> > +     int rc;
> > +
> > +     if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
> > +             return 0;
> > +
> > +     eeprom_info_cpu_addr =
> > +                     hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev,
> > +                                     max_size, &eeprom_info_dma_addr);
> > +     if (!eeprom_info_cpu_addr) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to allocate DMA memory for EEPROM info packet\n");
> > +             return -ENOMEM;
> > +     }
> > +
> > +     memset(eeprom_info_cpu_addr, 0, max_size);
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_EEPROM_DATA_GET;
> > +     pkt.addr = eeprom_info_dma_addr + prop->host_phys_base_address;
> > +     pkt.data_max_size = max_size;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                     GOYA_ARMCP_EEPROM_TIMEOUT, &result);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to send armcp EEPROM pkt, error %d\n", rc);
> > +             goto out;
> > +     }
> > +
> > +     /* result contains the actual size */
> > +     memcpy(data, eeprom_info_cpu_addr, min((size_t)result, max_size));
> > +
> > +out:
> > +     hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev, max_size,
> > +                     eeprom_info_cpu_addr);
> > +
> > +     return rc;
> > +}
> > +
> >  static const struct hl_asic_funcs goya_funcs = {
> >       .early_init = goya_early_init,
> >       .early_fini = goya_early_fini,
> > +     .late_init = goya_late_init,
> > +     .late_fini = goya_late_fini,
> >       .sw_init = goya_sw_init,
> >       .sw_fini = goya_sw_fini,
> >       .hw_init = goya_hw_init,
> > @@ -4310,10 +4534,16 @@ static const struct hl_asic_funcs goya_funcs = {
> >       .cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
> >       .cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
> >       .update_eq_ci = goya_update_eq_ci,
> > +     .add_device_attr = goya_add_device_attr,
> > +     .remove_device_attr = goya_remove_device_attr,
> >       .handle_eqe = goya_handle_eqe,
> > +     .set_pll_profile = goya_set_pll_profile,
> >       .get_events_stat = goya_get_events_stat,
> > +     .enable_clock_gating = goya_init_clock_gating,
> > +     .disable_clock_gating = goya_disable_clock_gating,
> >       .hw_queues_lock = goya_hw_queues_lock,
> >       .hw_queues_unlock = goya_hw_queues_unlock,
> > +     .get_eeprom_data = goya_get_eeprom_data,
> >       .send_cpu_message = goya_send_cpu_message
> >  };
> >
> > diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
> > index c6bfcb6c6905..42e8b1baef2f 100644
> > --- a/drivers/misc/habanalabs/goya/goyaP.h
> > +++ b/drivers/misc/habanalabs/goya/goyaP.h
> > @@ -48,7 +48,10 @@
> >
> >  #define PLL_HIGH_DEFAULT             1575000000      /* 1.575 GHz */
> >
> > +#define MAX_POWER_DEFAULT            200000          /* 200W */
> > +
> >  #define GOYA_ARMCP_INFO_TIMEOUT              10000000        /* 10s */
> > +#define GOYA_ARMCP_EEPROM_TIMEOUT    10000000        /* 10s */
> >
> >  #define DRAM_PHYS_DEFAULT_SIZE               0x100000000ull  /* 4GB */
> >
> > @@ -119,9 +122,15 @@ enum goya_fw_component {
> >
> >  struct goya_device {
> >       int (*test_cpu_queue)(struct hl_device *hdev);
> > +     int (*armcp_info_get)(struct hl_device *hdev);
> >
> >       /* TODO: remove hw_queues_lock after moving to scheduler code */
> >       spinlock_t      hw_queues_lock;
> > +
> > +     u64             mme_clk;
> > +     u64             tpc_clk;
> > +     u64             ic_clk;
> > +
> >       u64             ddr_bar_cur_addr;
> >       u32             events_stat[GOYA_ASYNC_EVENT_ID_SIZE];
> >       u32             hw_cap_initialized;
> > @@ -130,6 +139,18 @@ struct goya_device {
> >  int goya_test_cpu_queue(struct hl_device *hdev);
> >  int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
> >                               u32 timeout, long *result);
> > +long goya_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
> > +long goya_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr);
> > +long goya_get_current(struct hl_device *hdev, int sensor_index, u32 attr);
> > +long goya_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
> > +long goya_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
> > +void goya_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
> > +                     long value);
> > +void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq);
> > +int goya_add_device_attr(struct hl_device *hdev);
> > +void goya_remove_device_attr(struct hl_device *hdev);
> >  void goya_init_security(struct hl_device *hdev);
> > +u64 goya_get_max_power(struct hl_device *hdev);
> > +void goya_set_max_power(struct hl_device *hdev, u64 value);
> >
> >  #endif /* GOYAP_H_ */
> > diff --git a/drivers/misc/habanalabs/goya/goya_hwmgr.c b/drivers/misc/habanalabs/goya/goya_hwmgr.c
> > new file mode 100644
> > index 000000000000..866d1774b2e4
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/goya/goya_hwmgr.c
> > @@ -0,0 +1,306 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "goyaP.h"
> > +
> > +void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq)
> > +{
> > +     struct goya_device *goya = hdev->asic_specific;
> > +
> > +     switch (freq) {
> > +     case PLL_HIGH:
> > +             hl_set_frequency(hdev, MME_PLL, hdev->high_pll);
> > +             hl_set_frequency(hdev, TPC_PLL, hdev->high_pll);
> > +             hl_set_frequency(hdev, IC_PLL, hdev->high_pll);
> > +             break;
> > +     case PLL_LOW:
> > +             hl_set_frequency(hdev, MME_PLL, GOYA_PLL_FREQ_LOW);
> > +             hl_set_frequency(hdev, TPC_PLL, GOYA_PLL_FREQ_LOW);
> > +             hl_set_frequency(hdev, IC_PLL, GOYA_PLL_FREQ_LOW);
> > +             break;
> > +     case PLL_LAST:
> > +             hl_set_frequency(hdev, MME_PLL, goya->mme_clk);
> > +             hl_set_frequency(hdev, TPC_PLL, goya->tpc_clk);
> > +             hl_set_frequency(hdev, IC_PLL, goya->ic_clk);
> > +             break;
> > +     default:
> > +             dev_err(hdev->dev, "unknown frequency setting\n");
> > +     }
> > +}
> > +
> > +static ssize_t mme_clk_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     long value;
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     value = hl_get_frequency(hdev, MME_PLL, false);
> > +
> > +     if (value < 0)
> > +             return value;
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> > +}
> > +
> > +static ssize_t mme_clk_store(struct device *dev, struct device_attribute *attr,
> > +                             const char *buf, size_t count)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int rc;
> > +     long value;
> > +
> > +     if (hdev->disabled) {
> > +             count = -ENODEV;
> > +             goto fail;
> > +     }
> > +
> > +     if (hdev->pm_mng_profile == PM_AUTO) {
> > +             count = -EPERM;
> > +             goto fail;
> > +     }
> > +
> > +     rc = kstrtoul(buf, 0, &value);
> > +
> > +     if (rc) {
> > +             count = -EINVAL;
> > +             goto fail;
> > +     }
> > +
> > +     hl_set_frequency(hdev, MME_PLL, value);
> > +     goya->mme_clk = value;
> > +
> > +fail:
> > +     return count;
> > +}
> > +
> > +static ssize_t tpc_clk_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     long value;
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     value = hl_get_frequency(hdev, TPC_PLL, false);
> > +
> > +     if (value < 0)
> > +             return value;
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> > +}
> > +
> > +static ssize_t tpc_clk_store(struct device *dev, struct device_attribute *attr,
> > +                             const char *buf, size_t count)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int rc;
> > +     long value;
> > +
> > +     if (hdev->disabled) {
> > +             count = -ENODEV;
> > +             goto fail;
> > +     }
> > +
> > +     if (hdev->pm_mng_profile == PM_AUTO) {
> > +             count = -EPERM;
> > +             goto fail;
> > +     }
> > +
> > +     rc = kstrtoul(buf, 0, &value);
> > +
> > +     if (rc) {
> > +             count = -EINVAL;
> > +             goto fail;
> > +     }
> > +
> > +     hl_set_frequency(hdev, TPC_PLL, value);
> > +     goya->tpc_clk = value;
> > +
> > +fail:
> > +     return count;
> > +}
> > +
> > +static ssize_t ic_clk_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     long value;
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     value = hl_get_frequency(hdev, IC_PLL, false);
> > +
> > +     if (value < 0)
> > +             return value;
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> > +}
> > +
> > +static ssize_t ic_clk_store(struct device *dev, struct device_attribute *attr,
> > +                             const char *buf, size_t count)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     struct goya_device *goya = hdev->asic_specific;
> > +     int rc;
> > +     long value;
> > +
> > +     if (hdev->disabled) {
> > +             count = -ENODEV;
> > +             goto fail;
> > +     }
> > +
> > +     if (hdev->pm_mng_profile == PM_AUTO) {
> > +             count = -EPERM;
> > +             goto fail;
> > +     }
> > +
> > +     rc = kstrtoul(buf, 0, &value);
> > +
> > +     if (rc) {
> > +             count = -EINVAL;
> > +             goto fail;
> > +     }
> > +
> > +     hl_set_frequency(hdev, IC_PLL, value);
> > +     goya->ic_clk = value;
> > +
> > +fail:
> > +     return count;
> > +}
> > +
> > +static ssize_t mme_clk_curr_show(struct device *dev,
> > +                             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     long value;
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     value = hl_get_frequency(hdev, MME_PLL, true);
> > +
> > +     if (value < 0)
> > +             return value;
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> > +}
> > +
> > +static ssize_t tpc_clk_curr_show(struct device *dev,
> > +                             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     long value;
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     value = hl_get_frequency(hdev, TPC_PLL, true);
> > +
> > +     if (value < 0)
> > +             return value;
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> > +}
> > +
> > +static ssize_t ic_clk_curr_show(struct device *dev,
> > +                             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     long value;
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     value = hl_get_frequency(hdev, IC_PLL, true);
> > +
> > +     if (value < 0)
> > +             return value;
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%lu\n", value);
> > +}
> > +
> > +static DEVICE_ATTR_RW(mme_clk);
> > +static DEVICE_ATTR_RW(tpc_clk);
> > +static DEVICE_ATTR_RW(ic_clk);
> > +static DEVICE_ATTR_RO(mme_clk_curr);
> > +static DEVICE_ATTR_RO(tpc_clk_curr);
> > +static DEVICE_ATTR_RO(ic_clk_curr);
> > +
> > +int goya_add_device_attr(struct hl_device *hdev)
> > +{
> > +     int rc;
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_mme_clk);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file mme_clk\n");
> > +             return rc;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_tpc_clk);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file tpc_clk\n");
> > +             goto remove_mme_clk;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_ic_clk);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file ic_clk\n");
> > +             goto remove_tpc_clk;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_mme_clk_curr);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file mme_clk_curr\n");
> > +             goto remove_ic_clk;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_tpc_clk_curr);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file tpc_clk_curr\n");
> > +             goto remove_mme_clk_curr;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_ic_clk_curr);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file ic_clk_curr\n");
> > +             goto remove_tpc_clk_curr;
> > +     }
> > +
> > +     return 0;
> > +
> > +remove_tpc_clk_curr:
> > +     device_remove_file(hdev->dev, &dev_attr_tpc_clk_curr);
> > +remove_mme_clk_curr:
> > +     device_remove_file(hdev->dev, &dev_attr_mme_clk_curr);
> > +remove_ic_clk:
> > +     device_remove_file(hdev->dev, &dev_attr_ic_clk);
> > +remove_tpc_clk:
> > +     device_remove_file(hdev->dev, &dev_attr_tpc_clk);
> > +remove_mme_clk:
> > +     device_remove_file(hdev->dev, &dev_attr_mme_clk);
> > +     return rc;
> > +}
> > +
> > +void goya_remove_device_attr(struct hl_device *hdev)
> > +{
> > +     device_remove_file(hdev->dev, &dev_attr_ic_clk_curr);
> > +     device_remove_file(hdev->dev, &dev_attr_tpc_clk_curr);
> > +     device_remove_file(hdev->dev, &dev_attr_mme_clk_curr);
> > +     device_remove_file(hdev->dev, &dev_attr_ic_clk);
> > +     device_remove_file(hdev->dev, &dev_attr_tpc_clk);
> > +     device_remove_file(hdev->dev, &dev_attr_mme_clk);
> > +}
> > diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> > index 899bf98eb002..49b84b3ff864 100644
> > --- a/drivers/misc/habanalabs/habanalabs.h
> > +++ b/drivers/misc/habanalabs/habanalabs.h
> > @@ -25,6 +25,8 @@
> >
> >  #define HL_DEVICE_TIMEOUT_USEC               1000000 /* 1 s */
> >
> > +#define HL_PLL_LOW_JOB_FREQ_USEC     5000000 /* 5 s */
> > +
> >  #define HL_MAX_QUEUES                        128
> >
> >  struct hl_device;
> > @@ -60,6 +62,8 @@ struct hw_queue_properties {
> >  /**
> >   * struct asic_fixed_properties - ASIC specific immutable properties.
> >   * @hw_queues_props: H/W queues properties.
> > + * @armcp_info: received various information from ArmCP regarding the H/W. e.g.
> > + *           available sensors.
> >   * @uboot_ver: F/W U-boot version.
> >   * @preboot_ver: F/W Preboot version.
> >   * @sram_base_address: SRAM physical start address.
> > @@ -72,6 +76,7 @@ struct hw_queue_properties {
> >   * @dram_pci_bar_size: size of PCI bar towards DRAM.
> >   * @host_phys_base_address: base physical address of host memory for
> >   *                           transactions that the device generates.
> > + * @max_power_default: max power of the device after reset
> >   * @va_space_host_start_address: base address of virtual memory range for
> >   *                               mapping host memory.
> >   * @va_space_host_end_address: end address of virtual memory range for
> > @@ -84,6 +89,10 @@ struct hw_queue_properties {
> >   * @sram_size: total size of SRAM.
> >   * @max_asid: maximum number of open contexts (ASIDs).
> >   * @num_of_events: number of possible internal H/W IRQs.
> > + * @psoc_pci_pll_nr: PCI PLL NR value.
> > + * @psoc_pci_pll_nf: PCI PLL NF value.
> > + * @psoc_pci_pll_od: PCI PLL OD value.
> > + * @psoc_pci_pll_div_factor: PCI PLL DIV FACTOR 1 value.
> >   * @completion_queues_count: number of completion queues.
> >   * @high_pll: high PLL frequency used by the device.
> >   * @cb_pool_cb_cnt: number of CBs in the CB pool.
> > @@ -92,6 +101,7 @@ struct hw_queue_properties {
> >   */
> >  struct asic_fixed_properties {
> >       struct hw_queue_properties      hw_queues_props[HL_MAX_QUEUES];
> > +     struct armcp_info       armcp_info;
> >       char                    uboot_ver[VERSION_MAX_LEN];
> >       char                    preboot_ver[VERSION_MAX_LEN];
> >       u64                     sram_base_address;
> > @@ -103,6 +113,7 @@ struct asic_fixed_properties {
> >       u64                     dram_size;
> >       u64                     dram_pci_bar_size;
> >       u64                     host_phys_base_address;
> > +     u64                     max_power_default;
> >       u64                     va_space_host_start_address;
> >       u64                     va_space_host_end_address;
> >       u64                     va_space_dram_start_address;
> > @@ -111,6 +122,10 @@ struct asic_fixed_properties {
> >       u32                     sram_size;
> >       u32                     max_asid;
> >       u32                     num_of_events;
> > +     u32                     psoc_pci_pll_nr;
> > +     u32                     psoc_pci_pll_nf;
> > +     u32                     psoc_pci_pll_od;
> > +     u32                     psoc_pci_pll_div_factor;
> >       u32                     high_pll;
> >       u32                     cb_pool_cb_cnt;
> >       u32                     cb_pool_cb_size;
> > @@ -296,13 +311,37 @@ enum hl_asic_type {
> >  };
> >
> >
> > +/**
> > + * enum hl_pm_mng_profile - power management profile.
> > + * @PM_AUTO: internal clock is set by KMD.
> > + * @PM_MANUAL: internal clock is set by the user.
> > + * @PM_LAST: last power management type.
> > + */
> > +enum hl_pm_mng_profile {
> > +     PM_AUTO = 1,
> > +     PM_MANUAL,
> > +     PM_LAST
> > +};
> >
> > +/**
> > + * enum hl_pll_frequency - PLL frequency.
> > + * @PLL_HIGH: high frequency.
> > + * @PLL_LOW: low frequency.
> > + * @PLL_LAST: last frequency values that were configured by the user.
> > + */
> > +enum hl_pll_frequency {
> > +     PLL_HIGH = 1,
> > +     PLL_LOW,
> > +     PLL_LAST
> > +};
> >
> >  /**
> >   * struct hl_asic_funcs - ASIC specific functions that are can be called from
> >   *                        common code.
> >   * @early_init: sets up early driver state (pre sw_init), doesn't configure H/W.
> >   * @early_fini: tears down what was done in early_init.
> > + * @late_init: sets up late driver/hw state (post hw_init) - Optional.
> > + * @late_fini: tears down what was done in late_init (pre hw_fini) - Optional.
> >   * @sw_init: sets up driver state, does not configure H/W.
> >   * @sw_fini: tears down driver state, does not configure H/W.
> >   * @hw_init: sets up the H/W state.
> > @@ -326,15 +365,23 @@ enum hl_asic_type {
> >   * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
> >   * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
> >   * @update_eq_ci: update event queue CI.
> > + * @add_device_attr: add ASIC specific device attributes.
> > + * @remove_device_attr: remove ASIC specific device attributes.
> >   * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
> > + * @set_pll_profile: change PLL profile (manual/automatic).
> >   * @get_events_stat: retrieve event queue entries histogram.
> > + * @enable_clock_gating: enable clock gating for reducing power consumption.
> > + * @disable_clock_gating: disable clock for accessing registers on HBW.
> >   * @hw_queues_lock: acquire H/W queues lock.
> >   * @hw_queues_unlock: release H/W queues lock.
> > + * @get_eeprom_data: retrieve EEPROM data from F/W.
> >   * @send_cpu_message: send buffer to ArmCP.
> >   */
> >  struct hl_asic_funcs {
> >       int (*early_init)(struct hl_device *hdev);
> >       int (*early_fini)(struct hl_device *hdev);
> > +     int (*late_init)(struct hl_device *hdev);
> > +     void (*late_fini)(struct hl_device *hdev);
> >       int (*sw_init)(struct hl_device *hdev);
> >       int (*sw_fini)(struct hl_device *hdev);
> >       int (*hw_init)(struct hl_device *hdev);
> > @@ -363,11 +410,19 @@ struct hl_asic_funcs {
> >       void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
> >                               size_t size, void *vaddr);
> >       void (*update_eq_ci)(struct hl_device *hdev, u32 val);
> > +     int (*add_device_attr)(struct hl_device *hdev);
> > +     void (*remove_device_attr)(struct hl_device *hdev);
> >       void (*handle_eqe)(struct hl_device *hdev,
> >                               struct hl_eq_entry *eq_entry);
> > +     void (*set_pll_profile)(struct hl_device *hdev,
> > +                     enum hl_pll_frequency freq);
> >       void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
> > +     void (*enable_clock_gating)(struct hl_device *hdev);
> > +     void (*disable_clock_gating)(struct hl_device *hdev);
> >       void (*hw_queues_lock)(struct hl_device *hdev);
> >       void (*hw_queues_unlock)(struct hl_device *hdev);
> > +     int (*get_eeprom_data)(struct hl_device *hdev, void *data,
> > +                             size_t max_size);
> >       int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
> >                               u16 len, u32 timeout, long *result);
> >  };
> > @@ -496,6 +551,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> >   * @rmmio: configuration area address on SRAM.
> >   * @cdev: related char device.
> >   * @dev: realted kernel basic device structure.
> > + * @work_freq: delayed work to lower device frequency if possible.
> >   * @asic_name: ASIC specific nmae.
> >   * @asic_type: ASIC specific type.
> >   * @completion_queue: array of hl_cq.
> > @@ -517,13 +573,23 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
> >   * @asic_prop: ASIC specific immutable properties.
> >   * @asic_funcs: ASIC specific functions.
> >   * @asic_specific: ASIC specific information to use only from ASIC files.
> > + * @hwmon_dev: H/W monitor device.
> > + * @pm_mng_profile: current power management profile.
> > + * @hl_chip_info: ASIC's sensors information.
> >   * @cb_pool: list of preallocated CBs.
> >   * @cb_pool_lock: protects the CB pool.
> >   * @user_ctx: current user context executing.
> > + * @curr_pll_profile: current PLL profile.
> >   * @fd_open_cnt: number of open context executing.
> > + * @max_power: the max power of the device, as configured by the sysadmin. This
> > + *             value is saved so in case of hard-reset, KMD will restore this
> > + *             value and update the F/W after the re-initialization
> >   * @major: habanalabs KMD major.
> > + * @high_pll: high PLL profile frequency.
> >   * @id: device minor.
> >   * @disabled: is device disabled.
> > + * @late_init_done: is late init stage was done during initialization.
> > + * @hwmon_initialized: is H/W monitor sensors was initialized.
> >   */
> >  struct hl_device {
> >       struct pci_dev                  *pdev;
> > @@ -531,6 +597,7 @@ struct hl_device {
> >       void __iomem                    *rmmio;
> >       struct cdev                     cdev;
> >       struct device                   *dev;
> > +     struct delayed_work             work_freq;
> >       char                            asic_name[16];
> >       enum hl_asic_type               asic_type;
> >       struct hl_cq                    *completion_queue;
> > @@ -553,16 +620,25 @@ struct hl_device {
> >       struct asic_fixed_properties    asic_prop;
> >       const struct hl_asic_funcs      *asic_funcs;
> >       void                            *asic_specific;
> > +     struct device                   *hwmon_dev;
> > +     enum hl_pm_mng_profile          pm_mng_profile;
> > +     struct hwmon_chip_info          hl_chip_info;
> >
> >       struct list_head                cb_pool;
> >       spinlock_t                      cb_pool_lock;
> >
> >       /* TODO: The following fields should be moved for multi-context */
> >       struct hl_ctx                   *user_ctx;
> > +
> > +     atomic_t                        curr_pll_profile;
> >       atomic_t                        fd_open_cnt;
> > +     u64                             max_power;
> >       u32                             major;
> > +     u32                             high_pll;
> >       u16                             id;
> >       u8                              disabled;
> > +     u8                              late_init_done;
> > +     u8                              hwmon_initialized;
> >
> >       /* Parameters for bring-up */
> >       u8                              cpu_enable;
> > @@ -647,6 +723,15 @@ int hl_device_suspend(struct hl_device *hdev);
> >  int hl_device_resume(struct hl_device *hdev);
> >  void hl_hpriv_get(struct hl_fpriv *hpriv);
> >  void hl_hpriv_put(struct hl_fpriv *hpriv);
> > +int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq);
> > +int hl_build_hwmon_channel_info(struct hl_device *hdev,
> > +             struct armcp_sensor *sensors_arr);
> > +
> > +int hl_sysfs_init(struct hl_device *hdev);
> > +void hl_sysfs_fini(struct hl_device *hdev);
> > +
> > +int hl_hwmon_init(struct hl_device *hdev);
> > +void hl_hwmon_fini(struct hl_device *hdev);
> >
> >  int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr, u32 cb_size,
> >               u64 *handle, int ctx_id);
> > @@ -663,6 +748,18 @@ int hl_cb_pool_fini(struct hl_device *hdev);
> >
> >  void goya_set_asic_funcs(struct hl_device *hdev);
> >
> > +long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr);
> > +void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq);
> > +long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
> > +long hl_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr);
> > +long hl_get_current(struct hl_device *hdev, int sensor_index, u32 attr);
> > +long hl_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
> > +long hl_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
> > +void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
> > +                     long value);
> > +u64 hl_get_max_power(struct hl_device *hdev);
> > +void hl_set_max_power(struct hl_device *hdev, u64 value);
> > +
> >  /* IOCTLs */
> >  long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
> >  int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
> > diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> > index b64f58ad0f5d..47a9ab458b43 100644
> > --- a/drivers/misc/habanalabs/habanalabs_drv.c
> > +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> > @@ -134,6 +134,13 @@ int hl_device_open(struct inode *inode, struct file *filp)
> >
> >       hpriv->taskpid = find_get_pid(current->pid);
> >
> > +     /*
> > +      * Device is IDLE at this point so it is legal to change PLLs. There
> > +      * is no need to check anything because if the PLL is already HIGH, the
> > +      * set function will return without doing anything
> > +      */
> > +     hl_device_set_frequency(hdev, PLL_HIGH);
> > +
> >       return 0;
> >
> >  out_err:
> > diff --git a/drivers/misc/habanalabs/hwmon.c b/drivers/misc/habanalabs/hwmon.c
> > new file mode 100644
> > index 000000000000..6ca0decb7490
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/hwmon.c
> > @@ -0,0 +1,449 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "habanalabs.h"
> > +
> > +#define SENSORS_PKT_TIMEOUT          100000  /* 100ms */
> > +#define HWMON_NR_SENSOR_TYPES                (hwmon_pwm + 1)
> > +
> > +int hl_build_hwmon_channel_info(struct hl_device *hdev,
> > +                             struct armcp_sensor *sensors_arr)
> > +{
> > +     u32 counts[HWMON_NR_SENSOR_TYPES] = {0};
> > +     u32 *sensors_by_type[HWMON_NR_SENSOR_TYPES] = {0};
> > +     u32 sensors_by_type_next_index[HWMON_NR_SENSOR_TYPES] = {0};
> > +     struct hwmon_channel_info **channels_info;
> > +     u32 num_sensors_for_type, num_active_sensor_types = 0,
> > +                     arr_size = 0, *curr_arr;
> > +     enum hwmon_sensor_types type;
> > +     int rc, i, j;
> > +
> > +     for (i = 0 ; i < ARMCP_MAX_SENSORS ; i++) {
> > +             type = sensors_arr[i].type;
> > +
> > +             if ((type == 0) && (sensors_arr[i].flags == 0))
> > +                     break;
> > +
> > +             if (type >= HWMON_NR_SENSOR_TYPES) {
> > +                     dev_err(hdev->dev,
> > +                             "Got wrong sensor type %d from device\n", type);
> > +                     return -EINVAL;
> > +             }
> > +
> > +             counts[type]++;
> > +             arr_size++;
> > +     }
> > +
> > +     for (i = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++) {
> > +             if (counts[i] == 0)
> > +                     continue;
> > +
> > +             num_sensors_for_type = counts[i] + 1;
> > +             curr_arr = kcalloc(num_sensors_for_type, sizeof(*curr_arr),
> > +                             GFP_KERNEL);
> > +             if (!curr_arr) {
> > +                     rc = -ENOMEM;
> > +                     goto sensors_type_err;
> > +             }
> > +
> > +             num_active_sensor_types++;
> > +             sensors_by_type[i] = curr_arr;
> > +     }
> > +
> > +     for (i = 0 ; i < arr_size ; i++) {
> > +             type = sensors_arr[i].type;
> > +             curr_arr = sensors_by_type[type];
> > +             curr_arr[sensors_by_type_next_index[type]++] =
> > +                             sensors_arr[i].flags;
> > +     }
> > +
> > +     channels_info = kcalloc(num_active_sensor_types + 1,
> > +                     sizeof(*channels_info), GFP_KERNEL);
> > +     if (!channels_info) {
> > +             rc = -ENOMEM;
> > +             goto channels_info_array_err;
> > +     }
> > +
> > +     for (i = 0 ; i < num_active_sensor_types ; i++) {
> > +             channels_info[i] = kzalloc(sizeof(*channels_info[i]),
> > +                             GFP_KERNEL);
> > +             if (!channels_info[i]) {
> > +                     rc = -ENOMEM;
> > +                     goto channel_info_err;
> > +             }
> > +     }
> > +
> > +     for (i = 0, j = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++) {
> > +             if (!sensors_by_type[i])
> > +                     continue;
> > +
> > +             channels_info[j]->type = i;
> > +             channels_info[j]->config = sensors_by_type[i];
> > +             j++;
> > +     }
> > +
> > +     hdev->hl_chip_info.info =
> > +                     (const struct hwmon_channel_info **)channels_info;
> > +
> > +     return 0;
> > +
> > +channel_info_err:
> > +     for (i = 0 ; i < num_active_sensor_types ; i++)
> > +             if (channels_info[i]) {
> > +                     kfree(channels_info[i]->config);
> > +                     kfree(channels_info[i]);
> > +             }
> > +     kfree(channels_info);
> > +channels_info_array_err:
> > +sensors_type_err:
> > +     for (i = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++)
> > +             kfree(sensors_by_type[i]);
> > +
> > +     return rc;
> > +}
> > +
> > +static int hl_read(struct device *dev, enum hwmon_sensor_types type,
> > +                     u32 attr, int channel, long *val)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     switch (type) {
> > +     case hwmon_temp:
> > +             switch (attr) {
> > +             case hwmon_temp_input:
> > +             case hwmon_temp_max:
> > +             case hwmon_temp_crit:
> > +             case hwmon_temp_max_hyst:
> > +             case hwmon_temp_crit_hyst:
> > +                     break;
> > +             default:
> > +                     return -EINVAL;
> > +             }
> > +
> > +             *val = hl_get_temperature(hdev, channel, attr);
> > +             break;
> > +     case hwmon_in:
> > +             switch (attr) {
> > +             case hwmon_in_input:
> > +             case hwmon_in_min:
> > +             case hwmon_in_max:
> > +                     break;
> > +             default:
> > +                     return -EINVAL;
> > +             }
> > +
> > +             *val = hl_get_voltage(hdev, channel, attr);
> > +             break;
> > +     case hwmon_curr:
> > +             switch (attr) {
> > +             case hwmon_curr_input:
> > +             case hwmon_curr_min:
> > +             case hwmon_curr_max:
> > +                     break;
> > +             default:
> > +                     return -EINVAL;
> > +             }
> > +
> > +             *val = hl_get_current(hdev, channel, attr);
> > +             break;
> > +     case hwmon_fan:
> > +             switch (attr) {
> > +             case hwmon_fan_input:
> > +             case hwmon_fan_min:
> > +             case hwmon_fan_max:
> > +                     break;
> > +             default:
> > +                     return -EINVAL;
> > +             }
> > +             *val = hl_get_fan_speed(hdev, channel, attr);
> > +             break;
> > +     case hwmon_pwm:
> > +             switch (attr) {
> > +             case hwmon_pwm_input:
> > +             case hwmon_pwm_enable:
> > +                     break;
> > +             default:
> > +                     return -EINVAL;
> > +             }
> > +             *val = hl_get_pwm_info(hdev, channel, attr);
> > +             break;
> > +     default:
> > +             return -EINVAL;
> > +     }
> > +     return 0;
> > +}
> > +
> > +static int hl_write(struct device *dev, enum hwmon_sensor_types type,
> > +                     u32 attr, int channel, long val)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     switch (type) {
> > +     case hwmon_pwm:
> > +             switch (attr) {
> > +             case hwmon_pwm_input:
> > +             case hwmon_pwm_enable:
> > +                     break;
> > +             default:
> > +                     return -EINVAL;
> > +             }
> > +             hl_set_pwm_info(hdev, channel, attr, val);
> > +             break;
> > +     default:
> > +             return -EINVAL;
> > +     }
> > +     return 0;
> > +}
> > +
> > +static umode_t hl_is_visible(const void *data, enum hwmon_sensor_types type,
> > +                             u32 attr, int channel)
> > +{
> > +     switch (type) {
> > +     case hwmon_temp:
> > +             switch (attr) {
> > +             case hwmon_temp_input:
> > +             case hwmon_temp_max:
> > +             case hwmon_temp_max_hyst:
> > +             case hwmon_temp_crit:
> > +             case hwmon_temp_crit_hyst:
> > +                     return 0444;
> > +             }
> > +             break;
> > +     case hwmon_in:
> > +             switch (attr) {
> > +             case hwmon_in_input:
> > +             case hwmon_in_min:
> > +             case hwmon_in_max:
> > +                     return 0444;
> > +             }
> > +             break;
> > +     case hwmon_curr:
> > +             switch (attr) {
> > +             case hwmon_curr_input:
> > +             case hwmon_curr_min:
> > +             case hwmon_curr_max:
> > +                     return 0444;
> > +             }
> > +             break;
> > +     case hwmon_fan:
> > +             switch (attr) {
> > +             case hwmon_fan_input:
> > +             case hwmon_fan_min:
> > +             case hwmon_fan_max:
> > +                     return 0444;
> > +             }
> > +             break;
> > +     case hwmon_pwm:
> > +             switch (attr) {
> > +             case hwmon_pwm_input:
> > +             case hwmon_pwm_enable:
> > +                     return 0644;
> > +             }
> > +             break;
> > +     default:
> > +             break;
> > +     }
> > +     return 0;
> > +}
> > +
> > +static const struct hwmon_ops hl_hwmon_ops = {
> > +     .is_visible = hl_is_visible,
> > +     .read = hl_read,
> > +     .write = hl_write
> > +};
> > +
> > +long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr)
> > +{
> > +     struct armcp_packet pkt;
> > +     long result;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_TEMPERATURE_GET;
> > +     pkt.sensor_index = sensor_index;
> > +     pkt.type = attr;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                     SENSORS_PKT_TIMEOUT, &result);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to get temperature from sensor %d, error %d\n",
> > +                     sensor_index, rc);
> > +             result = 0;
> > +     }
> > +
> > +     return result;
> > +}
> > +
> > +long hl_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr)
> > +{
> > +     struct armcp_packet pkt;
> > +     long result;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_VOLTAGE_GET;
> > +     pkt.sensor_index = sensor_index;
> > +     pkt.type = attr;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                                     SENSORS_PKT_TIMEOUT, &result);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to get voltage from sensor %d, error %d\n",
> > +                     sensor_index, rc);
> > +             result = 0;
> > +     }
> > +
> > +     return result;
> > +}
> > +
> > +long hl_get_current(struct hl_device *hdev, int sensor_index, u32 attr)
> > +{
> > +     struct armcp_packet pkt;
> > +     long result;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_CURRENT_GET;
> > +     pkt.sensor_index = sensor_index;
> > +     pkt.type = attr;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                                     SENSORS_PKT_TIMEOUT, &result);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to get current from sensor %d, error %d\n",
> > +                     sensor_index, rc);
> > +             result = 0;
> > +     }
> > +
> > +     return result;
> > +}
> > +
> > +long hl_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr)
> > +{
> > +     struct armcp_packet pkt;
> > +     long result;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_FAN_SPEED_GET;
> > +     pkt.sensor_index = sensor_index;
> > +     pkt.type = attr;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                                     SENSORS_PKT_TIMEOUT, &result);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to get fan speed from sensor %d, error %d\n",
> > +                     sensor_index, rc);
> > +             result = 0;
> > +     }
> > +
> > +     return result;
> > +}
> > +
> > +long hl_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr)
> > +{
> > +     struct armcp_packet pkt;
> > +     long result;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_PWM_GET;
> > +     pkt.sensor_index = sensor_index;
> > +     pkt.type = attr;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                                     SENSORS_PKT_TIMEOUT, &result);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to get pwm info from sensor %d, error %d\n",
> > +                     sensor_index, rc);
> > +             result = 0;
> > +     }
> > +
> > +     return result;
> > +}
> > +
> > +void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
> > +                     long value)
> > +{
> > +     struct armcp_packet pkt;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_PWM_SET;
> > +     pkt.sensor_index = sensor_index;
> > +     pkt.type = attr;
> > +     pkt.value = value;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                                     SENSORS_PKT_TIMEOUT, NULL);
> > +
> > +     if (rc)
> > +             dev_err(hdev->dev,
> > +                     "Failed to set pwm info to sensor %d, error %d\n",
> > +                     sensor_index, rc);
> > +}
> > +
> > +int hl_hwmon_init(struct hl_device *hdev)
> > +{
> > +     struct device *dev = hdev->pdev ? &hdev->pdev->dev : hdev->dev;
> > +     int rc;
> > +
> > +     if ((hdev->hwmon_initialized) || !(hdev->fw_loading))
> > +             return 0;
> > +
> > +     if (hdev->hl_chip_info.info) {
> > +             hdev->hl_chip_info.ops = &hl_hwmon_ops;
> > +
> > +             hdev->hwmon_dev = hwmon_device_register_with_info(dev,
> > +                             "habanalabs", hdev, &hdev->hl_chip_info, NULL);
> > +             if (IS_ERR(hdev->hwmon_dev)) {
> > +                     rc = PTR_ERR(hdev->hwmon_dev);
> > +                     dev_err(hdev->dev,
> > +                             "Unable to register hwmon device: %d\n", rc);
> > +                     return rc;
> > +             }
> > +
> > +             dev_info(hdev->dev, "%s: add sensors information\n",
> > +                     dev_name(hdev->hwmon_dev));
> > +
> > +             hdev->hwmon_initialized = true;
> > +     } else {
> > +             dev_info(hdev->dev, "no available sensors\n");
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +void hl_hwmon_fini(struct hl_device *hdev)
> > +{
> > +     if (!hdev->hwmon_initialized)
> > +             return;
> > +
> > +     hwmon_device_unregister(hdev->hwmon_dev);
> > +}
> > diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
> > new file mode 100644
> > index 000000000000..edd5f7159de0
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/sysfs.c
> > @@ -0,0 +1,588 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include "habanalabs.h"
> > +#include "include/habanalabs_device_if.h"
> > +
> > +#include <linux/hwmon-sysfs.h>
> > +#include <linux/hwmon.h>
> > +
> > +#define SET_CLK_PKT_TIMEOUT  200000  /* 200ms */
> > +#define SET_PWR_PKT_TIMEOUT  400000  /* 400ms */
> > +
> > +long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr)
> > +{
> > +     struct armcp_packet pkt;
> > +     long result;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     if (curr)
> > +             pkt.opcode = ARMCP_PACKET_FREQUENCY_CURR_GET;
> > +     else
> > +             pkt.opcode = ARMCP_PACKET_FREQUENCY_GET;
> > +     pkt.pll_index = pll_index;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                                             SET_CLK_PKT_TIMEOUT, &result);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "Failed to get frequency of PLL %d, error %d\n",
> > +                     pll_index, rc);
> > +             result = rc;
> > +     }
> > +
> > +     return result;
> > +}
> > +
> > +void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq)
> > +{
> > +     struct armcp_packet pkt;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_FREQUENCY_SET;
> > +     pkt.pll_index = pll_index;
> > +     pkt.value = freq;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                                     SET_CLK_PKT_TIMEOUT, NULL);
> > +
> > +     if (rc)
> > +             dev_err(hdev->dev,
> > +                     "Failed to set frequency to PLL %d, error %d\n",
> > +                     pll_index, rc);
> > +}
> > +
> > +u64 hl_get_max_power(struct hl_device *hdev)
> > +{
> > +     struct armcp_packet pkt;
> > +     long result;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_MAX_POWER_GET;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                                             SET_PWR_PKT_TIMEOUT, &result);
> > +
> > +     if (rc) {
> > +             dev_err(hdev->dev, "Failed to get max power, error %d\n", rc);
> > +             result = rc;
> > +     }
> > +
> > +     return result;
> > +}
> > +
> > +void hl_set_max_power(struct hl_device *hdev, u64 value)
> > +{
> > +     struct armcp_packet pkt;
> > +     int rc;
> > +
> > +     memset(&pkt, 0, sizeof(pkt));
> > +
> > +     pkt.opcode = ARMCP_PACKET_MAX_POWER_SET;
> > +     pkt.value = value;
> > +
> > +     rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
> > +                                     SET_PWR_PKT_TIMEOUT, NULL);
> > +
> > +     if (rc)
> > +             dev_err(hdev->dev, "Failed to set max power, error %d\n", rc);
> > +}
> > +
> > +static ssize_t pm_mng_profile_show(struct device *dev,
> > +                             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%s\n",
> > +                     (hdev->pm_mng_profile == PM_AUTO) ? "auto" :
> > +                     (hdev->pm_mng_profile == PM_MANUAL) ? "manual" :
> > +                     "unknown");
> > +}
> > +
> > +static ssize_t pm_mng_profile_store(struct device *dev,
> > +             struct device_attribute *attr, const char *buf, size_t count)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     if (hdev->disabled) {
> > +             count = -ENODEV;
> > +             goto out;
> > +     }
> > +
> > +     mutex_lock(&hdev->device_open);
> > +
> > +     if (atomic_read(&hdev->fd_open_cnt) > 0) {
> > +             dev_err(hdev->dev,
> > +                     "Can't change PM profile while user process is opened on the device\n");
> > +             count = -EPERM;
> > +             goto unlock_mutex;
> > +     }
> > +
> > +     if (strncmp("auto", buf, strlen("auto")) == 0) {
> > +             /* Make sure we are in LOW PLL when changing modes */
> > +             if (hdev->pm_mng_profile == PM_MANUAL) {
> > +                     atomic_set(&hdev->curr_pll_profile, PLL_HIGH);
> > +                     hl_device_set_frequency(hdev, PLL_LOW);
> > +                     hdev->pm_mng_profile = PM_AUTO;
> > +             }
> > +     } else if (strncmp("manual", buf, strlen("manual")) == 0) {
> > +             /* Make sure we are in LOW PLL when changing modes */
> > +             if (hdev->pm_mng_profile == PM_AUTO) {
> > +                     flush_delayed_work(&hdev->work_freq);
> > +                     hdev->pm_mng_profile = PM_MANUAL;
> > +             }
> > +     } else {
> > +             dev_err(hdev->dev, "value should be auto or manual\n");
> > +             count = -EINVAL;
> > +             goto unlock_mutex;
> > +     }
> > +
> > +unlock_mutex:
> > +     mutex_unlock(&hdev->device_open);
> > +out:
> > +     return count;
> > +}
> > +
> > +static ssize_t high_pll_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%u\n", hdev->high_pll);
> > +}
> > +
> > +static ssize_t high_pll_store(struct device *dev, struct device_attribute *attr,
> > +                             const char *buf, size_t count)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     long value;
> > +     int rc;
> > +
> > +     if (hdev->disabled) {
> > +             count = -ENODEV;
> > +             goto out;
> > +     }
> > +
> > +     rc = kstrtoul(buf, 0, &value);
> > +
> > +     if (rc) {
> > +             count = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     hdev->high_pll = value;
> > +
> > +out:
> > +     return count;
> > +}
> > +
> > +static ssize_t uboot_ver_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%s\n", hdev->asic_prop.uboot_ver);
> > +}
> > +
> > +static ssize_t armcp_kernel_ver_show(struct device *dev,
> > +                             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%s",
> > +                     hdev->asic_prop.armcp_info.kernel_version);
> > +}
> > +
> > +static ssize_t armcp_ver_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%s\n",
> > +                     hdev->asic_prop.armcp_info.armcp_version);
> > +}
> > +
> > +static ssize_t cpld_ver_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "0x%08x\n",
> > +                     hdev->asic_prop.armcp_info.cpld_version);
> > +}
> > +
> > +static ssize_t infineon_ver_show(struct device *dev,
> > +                             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "0x%04x\n",
> > +                     hdev->asic_prop.armcp_info.infineon_version);
> > +}
> > +
> > +static ssize_t fuse_ver_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%s\n",
> > +                     hdev->asic_prop.armcp_info.fuse_version);
> > +}
> > +
> > +static ssize_t thermal_ver_show(struct device *dev,
> > +                             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%s",
> > +                     hdev->asic_prop.armcp_info.thermal_version);
> > +}
> > +
> > +static ssize_t preboot_btl_ver_show(struct device *dev,
> > +                             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%s\n", hdev->asic_prop.preboot_ver);
> > +}
> > +
> > +static ssize_t device_type_show(struct device *dev,
> > +             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     char *str;
> > +
> > +     switch (hdev->asic_type) {
> > +     case ASIC_GOYA:
> > +             str = "GOYA";
> > +             break;
> > +     default:
> > +             dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
> > +                             hdev->asic_type);
> > +             return -EINVAL;
> > +     }
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%s\n", str);
> > +}
> > +
> > +static ssize_t pci_addr_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     /* Use dummy, fixed address for simulator */
> > +     if (!hdev->pdev)
> > +             return snprintf(buf, PAGE_SIZE, "0000:%02d:00.0\n", hdev->id);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%04x:%02x:%02x.%x\n",
> > +                     pci_domain_nr(hdev->pdev->bus),
> > +                     hdev->pdev->bus->number,
> > +                     PCI_SLOT(hdev->pdev->devfn),
> > +                     PCI_FUNC(hdev->pdev->devfn));
> > +}
> > +
> > +static ssize_t status_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     char *str;
> > +
> > +     if (hdev->disabled)
> > +             str = "Malfunction";
> > +     else
> > +             str = "Operational";
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%s\n", str);
> > +}
> > +
> > +static ssize_t write_open_cnt_show(struct device *dev,
> > +             struct device_attribute *attr, char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%d\n", hdev->user_ctx ? 1 : 0);
> > +}
> > +
> > +static ssize_t max_power_show(struct device *dev, struct device_attribute *attr,
> > +                             char *buf)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     long val;
> > +
> > +     if (hdev->disabled)
> > +             return -ENODEV;
> > +
> > +     val = hl_get_max_power(hdev);
> > +
> > +     return snprintf(buf, PAGE_SIZE, "%lu\n", val);
> > +}
> > +
> > +static ssize_t max_power_store(struct device *dev,
> > +             struct device_attribute *attr, const char *buf, size_t count)
> > +{
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     unsigned long value;
> > +     int rc;
> > +
> > +     if (hdev->disabled) {
> > +             count = -ENODEV;
> > +             goto out;
> > +     }
> > +
> > +     rc = kstrtoul(buf, 0, &value);
> > +
> > +     if (rc) {
> > +             count = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     hdev->max_power = value;
> > +     hl_set_max_power(hdev, value);
> > +
> > +out:
> > +     return count;
> > +}
> > +
> > +static ssize_t eeprom_read_handler(struct file *filp, struct kobject *kobj,
> > +                     struct bin_attribute *attr, char *buf, loff_t offset,
> > +                     size_t max_size)
> > +{
> > +     struct device *dev = container_of(kobj, struct device, kobj);
> > +     struct hl_device *hdev = dev_get_drvdata(dev);
> > +     char *data;
> > +     int rc;
> > +
> > +     if (!max_size)
> > +             return -EINVAL;
> > +
> > +     data = kzalloc(max_size, GFP_KERNEL);
> > +     if (!data)
> > +             return -ENOMEM;
> > +
> > +     rc = hdev->asic_funcs->get_eeprom_data(hdev, data, max_size);
> > +     if (rc)
> > +             goto out;
> > +
> > +     memcpy(buf, data, max_size);
> > +
> > +out:
> > +     kfree(data);
> > +
> > +     return max_size;
> > +}
> > +
> > +static DEVICE_ATTR_RW(pm_mng_profile);
> > +static DEVICE_ATTR_RW(high_pll);
> > +static DEVICE_ATTR_RO(uboot_ver);
> > +static DEVICE_ATTR_RO(armcp_kernel_ver);
> > +static DEVICE_ATTR_RO(armcp_ver);
> > +static DEVICE_ATTR_RO(cpld_ver);
> > +static DEVICE_ATTR_RO(infineon_ver);
> > +static DEVICE_ATTR_RO(fuse_ver);
> > +static DEVICE_ATTR_RO(thermal_ver);
> > +static DEVICE_ATTR_RO(preboot_btl_ver);
> > +static DEVICE_ATTR_RO(device_type);
> > +static DEVICE_ATTR_RO(pci_addr);
> > +static DEVICE_ATTR_RO(status);
> > +static DEVICE_ATTR_RO(write_open_cnt);
> > +static DEVICE_ATTR_RW(max_power);
> > +
> > +static const struct bin_attribute bin_attr_eeprom = {
> > +     .attr = {.name = "eeprom", .mode = (0444)},
> > +     .size = PAGE_SIZE,
> > +     .read = eeprom_read_handler
> > +};
> > +
> > +int hl_sysfs_init(struct hl_device *hdev)
> > +{
> > +     int rc;
> > +
> > +     rc = hdev->asic_funcs->add_device_attr(hdev);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to add device attributes\n");
> > +             return rc;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_pm_mng_profile);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file pm_mng_profile\n");
> > +             goto remove_device_attr;
> > +     }
> > +
> > +     hdev->pm_mng_profile = PM_AUTO;
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_high_pll);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file pll_profile\n");
> > +             goto remove_pm_mng_profile;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_uboot_ver);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file uboot_ver\n");
> > +             goto remove_pll_profile;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_armcp_kernel_ver);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file armcp_kernel_ver\n");
> > +             goto remove_uboot_ver;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_armcp_ver);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file armcp_ver\n");
> > +             goto remove_armcp_kernel_ver;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_cpld_ver);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file cpld_ver\n");
> > +             goto remove_armcp_ver;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_infineon_ver);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file infineon_ver\n");
> > +             goto remove_cpld_ver;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_fuse_ver);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file fuse_ver\n");
> > +             goto remove_infineon_ver;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_thermal_ver);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file thermal_ver\n");
> > +             goto remove_fuse_ver;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_preboot_btl_ver);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file preboot_btl_ver\n");
> > +             goto remove_thermal_ver;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_device_type);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file device_type\n");
> > +             goto remove_preboot_ver;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_pci_addr);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file pci_addr\n");
> > +             goto remove_device_type;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_status);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create device file status\n");
> > +             goto remove_pci_addr;
> > +     }
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_write_open_cnt);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file write_open_count\n");
> > +             goto remove_status;
> > +     }
> > +
> > +     hdev->max_power = hdev->asic_prop.max_power_default;
> > +
> > +     rc = device_create_file(hdev->dev, &dev_attr_max_power);
> > +     if (rc) {
> > +             dev_err(hdev->dev,
> > +                     "failed to create device file max_power\n");
> > +             goto remove_write_open_cnt;
> > +     }
> > +
> > +     rc = sysfs_create_bin_file(&hdev->dev->kobj, &bin_attr_eeprom);
> > +     if (rc) {
> > +             dev_err(hdev->dev, "failed to create EEPROM sysfs entry\n");
> > +             goto remove_attr_max_power;
> > +     }
> > +
> > +     return 0;
> > +
> > +remove_attr_max_power:
> > +     device_remove_file(hdev->dev, &dev_attr_max_power);
> > +remove_write_open_cnt:
> > +     device_remove_file(hdev->dev, &dev_attr_write_open_cnt);
> > +remove_status:
> > +     device_remove_file(hdev->dev, &dev_attr_status);
> > +remove_pci_addr:
> > +     device_remove_file(hdev->dev, &dev_attr_pci_addr);
> > +remove_device_type:
> > +     device_remove_file(hdev->dev, &dev_attr_device_type);
> > +remove_preboot_ver:
> > +     device_remove_file(hdev->dev, &dev_attr_preboot_btl_ver);
> > +remove_thermal_ver:
> > +     device_remove_file(hdev->dev, &dev_attr_thermal_ver);
> > +remove_fuse_ver:
> > +     device_remove_file(hdev->dev, &dev_attr_fuse_ver);
> > +remove_infineon_ver:
> > +     device_remove_file(hdev->dev, &dev_attr_infineon_ver);
> > +remove_cpld_ver:
> > +     device_remove_file(hdev->dev, &dev_attr_cpld_ver);
> > +remove_armcp_ver:
> > +     device_remove_file(hdev->dev, &dev_attr_armcp_ver);
> > +remove_armcp_kernel_ver:
> > +     device_remove_file(hdev->dev, &dev_attr_armcp_kernel_ver);
> > +remove_uboot_ver:
> > +     device_remove_file(hdev->dev, &dev_attr_uboot_ver);
> > +remove_pll_profile:
> > +     device_remove_file(hdev->dev, &dev_attr_high_pll);
> > +remove_pm_mng_profile:
> > +     device_remove_file(hdev->dev, &dev_attr_pm_mng_profile);
> > +remove_device_attr:
> > +     hdev->asic_funcs->remove_device_attr(hdev);
> > +
> > +     return rc;
> > +}
> > +
> > +void hl_sysfs_fini(struct hl_device *hdev)
> > +{
> > +     sysfs_remove_bin_file(&hdev->dev->kobj, &bin_attr_eeprom);
> > +     device_remove_file(hdev->dev, &dev_attr_max_power);
> > +     device_remove_file(hdev->dev, &dev_attr_write_open_cnt);
> > +     device_remove_file(hdev->dev, &dev_attr_status);
> > +     device_remove_file(hdev->dev, &dev_attr_pci_addr);
> > +     device_remove_file(hdev->dev, &dev_attr_device_type);
> > +     device_remove_file(hdev->dev, &dev_attr_preboot_btl_ver);
> > +     device_remove_file(hdev->dev, &dev_attr_thermal_ver);
> > +     device_remove_file(hdev->dev, &dev_attr_fuse_ver);
> > +     device_remove_file(hdev->dev, &dev_attr_infineon_ver);
> > +     device_remove_file(hdev->dev, &dev_attr_cpld_ver);
> > +     device_remove_file(hdev->dev, &dev_attr_armcp_ver);
> > +     device_remove_file(hdev->dev, &dev_attr_armcp_kernel_ver);
> > +     device_remove_file(hdev->dev, &dev_attr_uboot_ver);
> > +     device_remove_file(hdev->dev, &dev_attr_high_pll);
> > +     device_remove_file(hdev->dev, &dev_attr_pm_mng_profile);
> > +     hdev->asic_funcs->remove_device_attr(hdev);
> > +}
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 10/15] habanalabs: add device reset support
  2019-01-27  7:51   ` Mike Rapoport
@ 2019-01-28 12:53     ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-28 12:53 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Sun, Jan 27, 2019 at 9:51 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:52AM +0200, Oded Gabbay wrote:
> > This patch adds support for doing various on-the-fly reset of Goya.
> >
> > The driver supports two types of resets:
> > 1. soft-reset
> > 2. hard-reset
> >
> > Soft-reset is done when the device detects a timeout of a command
> > submission that was given to the device. The soft-reset process only resets
> > the engines that are relevant for the submission of compute jobs, i.e. the
> > DMA channels, the TPCs and the MME. The purpose is to bring the device as
> > fast as possible to a working state.
> >
> > Hard-reset is done in several cases:
> > 1. After soft-reset is done but the device is not responding
> > 2. When fatal errors occur inside the device, e.g. ECC error
> > 3. When the driver is removed
> >
> > Hard-reset performs a reset of the entire chip except for the PCI
> > controller and the PLLs. It is a much longer process then soft-reset but it
> > helps to recover the device without the need to reboot the Host.
> >
> > After hard-reset, the driver will restore the max power attribute and in
> > case of manual power management, the frequencies that were set.
> >
> > This patch also adds two entries to the sysfs, which allows the root user
> > to initiate a soft or hard reset.
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/habanalabs/command_buffer.c  |  11 +-
> >  drivers/misc/habanalabs/device.c          | 308 +++++++++++++++++++++-
> >  drivers/misc/habanalabs/goya/goya.c       | 201 ++++++++++++++
> >  drivers/misc/habanalabs/goya/goya_hwmgr.c |  18 +-
> >  drivers/misc/habanalabs/habanalabs.h      |  35 +++
> >  drivers/misc/habanalabs/habanalabs_drv.c  |   9 +-
> >  drivers/misc/habanalabs/hwmon.c           |   4 +-
> >  drivers/misc/habanalabs/irq.c             |  31 +++
> >  drivers/misc/habanalabs/sysfs.c           | 120 ++++++++-
> >  9 files changed, 712 insertions(+), 25 deletions(-)
> >
> > diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
> > index 535ed6cc5bda..700c6da01188 100644
> > --- a/drivers/misc/habanalabs/command_buffer.c
> > +++ b/drivers/misc/habanalabs/command_buffer.c
> > @@ -81,9 +81,10 @@ int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
> >       bool alloc_new_cb = true;
> >       int rc;
> >
> > -     if (hdev->disabled) {
> > +     if ((hdev->disabled) || ((atomic_read(&hdev->in_reset)) &&
> > +                                     (ctx_id != HL_KERNEL_ASID_ID))) {
> >               dev_warn_ratelimited(hdev->dev,
> > -                     "Device is disabled !!! Can't create new CBs\n");
> > +                     "Device is disabled or in reset !!! Can't create new CBs\n");
> >               rc = -EBUSY;
> >               goto out_err;
> >       }
> > @@ -187,6 +188,12 @@ int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data)
> >       u64 handle;
> >       int rc;
> >
> > +     if (hdev->hard_reset_pending) {
> > +             dev_crit_ratelimited(hdev->dev,
> > +                     "Device HARD reset pending !!! Please close FD\n");
> > +             return -ENODEV;
> > +     }
>
> Probably this check should be done at the top-level ioctl()?
fixed
> And, what will happen if the devices performs hard reset, but the used
> keeps the file descriptor open?
I take care of that in the reset function. Basically, I don't do the
hard-reset until all user processes (and currently I only support a
single one) close their FDs.
And if they don't close it after a timeout, I kill the user processes.
Take a look at hl_device_hard_reset_pending()
>
> > +
> >       switch (args->in.op) {
> >       case HL_CB_OP_CREATE:
> >               rc = hl_cb_create(hdev, &hpriv->cb_mgr, args->in.cb_size,
> > diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> > index ff7b610f18c4..00fde57ce823 100644
> > --- a/drivers/misc/habanalabs/device.c
> > +++ b/drivers/misc/habanalabs/device.c
> > @@ -188,6 +188,7 @@ static int device_early_init(struct hl_device *hdev)
> >
> >       mutex_init(&hdev->device_open);
> >       mutex_init(&hdev->send_cpu_message_lock);
> > +     atomic_set(&hdev->in_reset, 0);
> >       atomic_set(&hdev->fd_open_cnt, 0);
> >
> >       return 0;
> > @@ -238,6 +239,27 @@ static void set_freq_to_low_job(struct work_struct *work)
> >                       usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
> >  }
> >
> > +static void hl_device_heartbeat(struct work_struct *work)
> > +{
> > +     struct hl_device *hdev = container_of(work, struct hl_device,
> > +                                             work_heartbeat.work);
> > +
> > +     if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
> > +             goto reschedule;
> > +
> > +     if (!hdev->asic_funcs->send_heartbeat(hdev))
> > +             goto reschedule;
>
> AFAIU, asic_funcs->send_heartbeat() it set once at init time. The work
> should not be scheduled it it's NULL, I suppose.
I don't check her if the function pointer is NULL. I check the return
value of the call to the function. The function itself is always
implemented

>
> > +
> > +     dev_err(hdev->dev, "Device heartbeat failed !!!\n");
> > +     hl_device_reset(hdev, true, false);
> > +
> > +     return;
> > +
> > +reschedule:
> > +     schedule_delayed_work(&hdev->work_heartbeat,
> > +                     usecs_to_jiffies(HL_HEARTBEAT_PER_USEC));
> > +}
> > +
> >  /**
> >   * device_late_init - do late stuff initialization for the habanalabs device
> >   *
> > @@ -273,6 +295,12 @@ static int device_late_init(struct hl_device *hdev)
> >       schedule_delayed_work(&hdev->work_freq,
> >                       usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
> >
> > +     if (hdev->heartbeat) {
> > +             INIT_DELAYED_WORK(&hdev->work_heartbeat, hl_device_heartbeat);
> > +             schedule_delayed_work(&hdev->work_heartbeat,
> > +                             usecs_to_jiffies(HL_HEARTBEAT_PER_USEC));
> > +     }
> > +
> >       hdev->late_init_done = true;
> >
> >       return 0;
> > @@ -290,6 +318,8 @@ static void device_late_fini(struct hl_device *hdev)
> >               return;
> >
> >       cancel_delayed_work_sync(&hdev->work_freq);
> > +     if (hdev->heartbeat)
> > +             cancel_delayed_work_sync(&hdev->work_heartbeat);
> >
> >       if (hdev->asic_funcs->late_fini)
> >               hdev->asic_funcs->late_fini(hdev);
> > @@ -397,6 +427,254 @@ int hl_device_resume(struct hl_device *hdev)
> >       return 0;
> >  }
> >
> > +static void hl_device_hard_reset_pending(struct work_struct *work)
> > +{
> > +     struct hl_device_reset_work *device_reset_work =
> > +             container_of(work, struct hl_device_reset_work, reset_work);
> > +     struct hl_device *hdev = device_reset_work->hdev;
> > +     u16 pending_cnt = HL_PENDING_RESET_PER_SEC;
> > +     struct task_struct *task = NULL;
> > +
> > +     /* Flush all processes that are inside hl_open */
> > +     mutex_lock(&hdev->device_open);
> > +
> > +     while ((atomic_read(&hdev->fd_open_cnt)) && (pending_cnt)) {
> > +
> > +             pending_cnt--;
> > +
> > +             dev_info(hdev->dev,
> > +                     "Can't HARD reset, waiting for user to close FD\n");
> > +             ssleep(1);
> > +     }
> > +
> > +     if (atomic_read(&hdev->fd_open_cnt)) {
> > +             task = get_pid_task(hdev->user_ctx->hpriv->taskpid,
> > +                                     PIDTYPE_PID);
> > +             if (task) {
> > +                     dev_info(hdev->dev, "Killing user processes\n");
> > +                     send_sig(SIGKILL, task, 1);
>
> Shouldn't the user get a chance for cleanup?
I give them 5 seconds - It's eternity :)
This is a question where I deliberated with myself a lot about. Should
I kill the process to do the hard-reset automatically, or wait until
the FD is closed, and potentially never hard-reset because the user
will never close the FD.
Currently I decided to do the former. I guess that if users won't like
this behavior, I may add a kernel parameter to control this behavior.

>
> > +                     msleep(100);
> > +
> > +                     put_task_struct(task);
> > +             }
> > +     }
> > +
> > +     mutex_unlock(&hdev->device_open);
> > +
> > +     hl_device_reset(hdev, true, true);
> > +
> > +     kfree(device_reset_work);
> > +}
> > +
>
> [ ... ]
>
> > diff --git a/drivers/misc/habanalabs/goya/goya_hwmgr.c b/drivers/misc/habanalabs/goya/goya_hwmgr.c
> > index 866d1774b2e4..9482dbb2e03a 100644
> > --- a/drivers/misc/habanalabs/goya/goya_hwmgr.c
> > +++ b/drivers/misc/habanalabs/goya/goya_hwmgr.c
> > @@ -38,7 +38,7 @@ static ssize_t mme_clk_show(struct device *dev, struct device_attribute *attr,
> >       struct hl_device *hdev = dev_get_drvdata(dev);
> >       long value;
> >
> > -     if (hdev->disabled)
> > +     if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
> >               return -ENODEV;
> >
> >       value = hl_get_frequency(hdev, MME_PLL, false);
> > @@ -57,7 +57,7 @@ static ssize_t mme_clk_store(struct device *dev, struct device_attribute *attr,
> >       int rc;
> >       long value;
> >
> > -     if (hdev->disabled) {
> > +     if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
>
> There are quite a few of those, maybe split this check to a helper
> function?
Fixed
>
> >               count = -ENODEV;
> >               goto fail;
> >       }
> > @@ -87,7 +87,7 @@ static ssize_t tpc_clk_show(struct device *dev, struct device_attribute *attr,
> >       struct hl_device *hdev = dev_get_drvdata(dev);
> >       long value;
> >
> > -     if (hdev->disabled)
> > +     if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
> >               return -ENODEV;
> >
> >       value = hl_get_frequency(hdev, TPC_PLL, false);
> > @@ -106,7 +106,7 @@ static ssize_t tpc_clk_store(struct device *dev, struct device_attribute *attr,
> >       int rc;
> >       long value;
> >
> > -     if (hdev->disabled) {
> > +     if ((hdev->disabled) || (atomic_read(&hdev->in_reset))) {
> >               count = -ENODEV;
> >               goto fail;
> >       }
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 11/15] habanalabs: add command submission module
  2019-01-27 15:11   ` Mike Rapoport
@ 2019-01-28 13:51     ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-28 13:51 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay

On Sun, Jan 27, 2019 at 5:11 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:53AM +0200, Oded Gabbay wrote:
> > This patch adds the main flow for the user to submit work to the device.
> >
> > Each work is described by a command submission object (CS). The CS contains
> > 3 arrays of command buffers: One for execution, and two for context-switch
> > (store and restore).
> >
> > For each CB, the user specifies on which queue to put that CB. In case of
> > an internal queue, the entry doesn't contain a pointer to the CB but the
> > address in the on-chip memory that the CB resides at.
> >
> > The driver parses some of the CBs to enforce security restrictions.
> >
> > The user receives a sequence number that represents the CS object. The user
> > can then query the driver regarding the status of the CS, using that
> > sequence number.
> >
> > In case the CS doesn't finish before the timeout expires, the driver will
> > perform a soft-reset of the device.
> >
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/habanalabs/Makefile             |    3 +-
> >  drivers/misc/habanalabs/command_submission.c |  787 +++++++++++++
> >  drivers/misc/habanalabs/context.c            |   52 +-
> >  drivers/misc/habanalabs/device.c             |   16 +
> >  drivers/misc/habanalabs/goya/goya.c          | 1082 ++++++++++++++++++
> >  drivers/misc/habanalabs/habanalabs.h         |  274 +++++
> >  drivers/misc/habanalabs/habanalabs_drv.c     |   23 +
> >  drivers/misc/habanalabs/habanalabs_ioctl.c   |    4 +-
> >  drivers/misc/habanalabs/hw_queue.c           |  250 ++++
> >  drivers/misc/habanalabs/memory.c             |  200 ++++
> >  include/uapi/misc/habanalabs.h               |  158 ++-
> >  11 files changed, 2842 insertions(+), 7 deletions(-)
> >  create mode 100644 drivers/misc/habanalabs/command_submission.c
> >  create mode 100644 drivers/misc/habanalabs/memory.c
> >
> > diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> > index b5607233d216..d2fd0e18b1eb 100644
> > --- a/drivers/misc/habanalabs/Makefile
> > +++ b/drivers/misc/habanalabs/Makefile
> > @@ -5,7 +5,8 @@
> >  obj-m        := habanalabs.o
> >
> >  habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
> > -             command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o
> > +             command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o memory.o \
> > +             command_submission.o
> >
> >  include $(src)/goya/Makefile
> >  habanalabs-y += $(HL_GOYA_FILES)
> > diff --git a/drivers/misc/habanalabs/command_submission.c b/drivers/misc/habanalabs/command_submission.c
> > new file mode 100644
> > index 000000000000..0116c2262f17
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/command_submission.c
> > @@ -0,0 +1,787 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + */
> > +
> > +#include <uapi/misc/habanalabs.h>
> > +#include "habanalabs.h"
> > +
> > +#include <linux/sched/mm.h>
> > +#include <linux/sched/task.h>
> > +#include <linux/sched/signal.h>
> > +#include <linux/wait.h>
> > +#include <linux/mm.h>
> > +#include <linux/highmem.h>
>
> [ ... ]
>
> > +static void cs_do_release(struct kref *ref)
> > +{
> > +     struct hl_cs *cs = container_of(ref, struct hl_cs,
> > +                                             refcount);
> > +     struct hl_device *hdev = cs->ctx->hdev;
> > +     struct hl_cs_job *job, *tmp;
> > +
> > +     cs->completed = true;
> > +
> > +     /*
> > +      * Although if we reached here it means that all external jobs have
> > +      * finished, because each one of them took refcnt to CS, we still
> > +      * need to go over the internal jobs and free them. Otherwise, we
> > +      * will have leaked memory and what's worse, the CS object (and
> > +      * potentially the CTX object) could be released, while the JOB
> > +      * still holds a pointer to them (but no reference).
> > +      */
> > +     list_for_each_entry_safe(job, tmp, &cs->job_list, cs_node)
> > +             free_job(hdev, job);
> > +
> > +     /* We also need to update CI for internal queues */
> > +     if (cs->submitted) {
> > +             hl_int_hw_queue_update_ci(cs);
> > +
> > +             spin_lock(&hdev->hw_queues_mirror_lock);
> > +             /* remove CS from hw_queues mirror list */
> > +             list_del_init(&cs->mirror_node);
> > +             spin_unlock(&hdev->hw_queues_mirror_lock);
> > +
> > +             /*
> > +              * Don't cancel TDR in case this CS was timedout because we
> > +              * might be running from the TDR context
> > +              */
> > +             if ((!cs->timedout) &&
> > +                     (hdev->timeout_jiffies != MAX_SCHEDULE_TIMEOUT)) {
> > +                     struct hl_cs *next;
> > +
> > +                     if (cs->tdr_active)
> > +                             cancel_delayed_work_sync(&cs->work_tdr);
> > +
> > +                     spin_lock(&hdev->hw_queues_mirror_lock);
> > +                     /* queue TDR for next CS */
> > +                     next = list_first_entry_or_null(
> > +                                     &hdev->hw_queues_mirror_list,
> > +                                     struct hl_cs, mirror_node);
> > +                     if ((next) && (!next->tdr_active)) {
> > +                             next->tdr_active = true;
> > +                             schedule_delayed_work(&next->work_tdr,
> > +                                                     hdev->timeout_jiffies);
> > +                             spin_unlock(&hdev->hw_queues_mirror_lock);
> > +                     } else {
> > +                             spin_unlock(&hdev->hw_queues_mirror_lock);
> > +                     }
>
> 'else' can be dropped, just move spin_unlock() outside the 'if'
>
Fixed
> > +             }
> > +     }
> > +
> > +     hl_ctx_put(cs->ctx);
> > +
> > +     if (cs->timedout)
> > +             dma_fence_set_error(cs->fence, -ETIMEDOUT);
> > +     else if (cs->aborted)
> > +             dma_fence_set_error(cs->fence, -EIO);
> > +
> > +     dma_fence_signal(cs->fence);
> > +     dma_fence_put(cs->fence);
> > +
> > +     kfree(cs);
> > +}
>
> [ ... ]
>
> > +static int allocate_cs(struct hl_device *hdev, struct hl_ctx *ctx,
> > +                     struct hl_cs **cs_new)
> > +{
> > +     struct hl_dma_fence *fence;
> > +     struct dma_fence *other = NULL;
> > +     struct hl_cs *cs;
> > +     int rc;
> > +
> > +     cs = kzalloc(sizeof(*cs), GFP_ATOMIC);
> > +     if (!cs)
> > +             return -ENOMEM;
>
> Does this ever run from a context that cannot use GFP_KERNEL?
> This applies to other allocations below.
>
It *always* runs from a context that cannot use GFP_KERNEL, because we
mustn't sleep during command submission due to low latency
requirements.

> > +
> > +     cs->ctx = ctx;
> > +     cs->submitted = false;
> > +     cs->completed = false;
> > +     INIT_LIST_HEAD(&cs->job_list);
> > +     INIT_DELAYED_WORK(&cs->work_tdr, cs_timedout);
> > +     kref_init(&cs->refcount);
> > +     spin_lock_init(&cs->job_lock);
> > +
> > +     fence = kmalloc(sizeof(*fence), GFP_ATOMIC);
>
> kzalloc?
Can't waste time on memset

>
> > +     if (!fence) {
> > +             rc = -ENOMEM;
> > +             goto free_cs;
> > +     }
> > +
> > +     fence->hdev = hdev;
> > +     spin_lock_init(&fence->lock);
> > +     cs->fence = &fence->base_fence;
> > +
> > +     spin_lock(&ctx->cs_lock);
> > +
> > +     fence->cs_seq = ctx->cs_sequence;
> > +     other = ctx->cs_pending[fence->cs_seq & (HL_MAX_PENDING_CS - 1)];
> > +     if ((other) && (!dma_fence_is_signaled(other))) {
> > +             spin_unlock(&ctx->cs_lock);
> > +             rc = -EAGAIN;
> > +             goto free_fence;
> > +     }
> > +
> > +     dma_fence_init(&fence->base_fence, &hl_fence_ops, &fence->lock,
> > +                     ctx->asid, ctx->cs_sequence);
> > +
> > +     cs->sequence = fence->cs_seq;
> > +
> > +     ctx->cs_pending[fence->cs_seq & (HL_MAX_PENDING_CS - 1)] =
> > +                                                     &fence->base_fence;
> > +     ctx->cs_sequence++;
> > +
> > +     dma_fence_get(&fence->base_fence);
> > +
> > +     dma_fence_put(other);
> > +
> > +     spin_unlock(&ctx->cs_lock);
> > +
> > +     *cs_new = cs;
> > +
> > +     return 0;
> > +
> > +free_fence:
> > +     kfree(fence);
> > +free_cs:
> > +     kfree(cs);
> > +     return rc;
> > +}
> > +
>
> [ ... ]
>
> > +
> > +static int goya_validate_cb(struct hl_device *hdev,
> > +                     struct hl_cs_parser *parser, bool is_mmu)
> > +{
> > +     u32 cb_parsed_length = 0;
> > +     int rc = 0;
> > +
> > +     parser->patched_cb_size = 0;
> > +
> > +     /* cb_user_size is more than 0 so loop will always be executed */
> > +     while ((cb_parsed_length < parser->user_cb_size) && (!rc)) {
> > +             enum packet_id pkt_id;
> > +             u16 pkt_size;
> > +             void *user_pkt;
> > +
> > +             user_pkt = (void *) (parser->user_cb->kernel_address +
> > +                                                     cb_parsed_length);
> > +
> > +             pkt_id = (enum packet_id) (((*(u64 *) user_pkt) &
> > +                             PACKET_HEADER_PACKET_ID_MASK) >>
> > +                                     PACKET_HEADER_PACKET_ID_SHIFT);
> > +
> > +             pkt_size = goya_packet_sizes[pkt_id];
> > +             cb_parsed_length += pkt_size;
> > +             if (cb_parsed_length > parser->user_cb_size) {
> > +                     dev_err(hdev->dev,
> > +                             "packet 0x%x is out of CB boundary\n", pkt_id);
> > +                     rc = -EINVAL;
> > +                     continue;
>
> For me !rc in the while statement was blind. Please consider break here and
>
>         if (!rc)
>                 break;
>
> after the switch
Fixed
>
> > +             }
> > +
> > +             switch (pkt_id) {
> > +             case PACKET_WREG_32:
> > +                     /*
> > +                      * Although it is validated after copy in patch_cb(),
> > +                      * need to validate here as well because patch_cb() is
> > +                      * not called in MMU path while this function is called
> > +                      */
> > +                     rc = goya_validate_wreg32(hdev, parser, user_pkt);
> > +                     break;
> > +
> > +             case PACKET_WREG_BULK:
> > +                     dev_err(hdev->dev,
> > +                             "User not allowed to use WREG_BULK\n");
> > +                     rc = -EPERM;
> > +                     break;
> > +
> > +             case PACKET_MSG_PROT:
> > +                     dev_err(hdev->dev,
> > +                             "User not allowed to use MSG_PROT\n");
> > +                     rc = -EPERM;
> > +                     break;
> > +
> > +             case PACKET_CP_DMA:
> > +                     dev_err(hdev->dev, "User not allowed to use CP_DMA\n");
> > +                     rc = -EPERM;
> > +                     break;
> > +
> > +             case PACKET_STOP:
> > +                     dev_err(hdev->dev, "User not allowed to use STOP\n");
> > +                     rc = -EPERM;
> > +                     break;
> > +
> > +             case PACKET_LIN_DMA:
> > +                     if (is_mmu)
> > +                             rc = goya_validate_dma_pkt_mmu(hdev, parser,
> > +                                             user_pkt);
> > +                     else
> > +                             rc = goya_validate_dma_pkt_no_mmu(hdev, parser,
> > +                                             user_pkt);
> > +                     break;
> > +
> > +             case PACKET_MSG_LONG:
> > +             case PACKET_MSG_SHORT:
> > +             case PACKET_FENCE:
> > +             case PACKET_NOP:
> > +                     parser->patched_cb_size += pkt_size;
> > +                     break;
> > +
> > +             default:
> > +                     dev_err(hdev->dev, "Invalid packet header 0x%x\n",
> > +                             pkt_id);
> > +                     rc = -EINVAL;
> > +                     break;
> > +             }
> > +     }
> > +
> > +     /*
> > +      * The new CB should have space at the end for two MSG_PROT packets:
> > +      * 1. A packet that will act as a completion packet
> > +      * 2. A packet that will generate MSI-X interrupt
> > +      */
> > +     parser->patched_cb_size += sizeof(struct packet_msg_prot) * 2;
> > +
> > +     return rc;
> > +}
>
> [ ... ]
>
> > +static int goya_patch_cb(struct hl_device *hdev,
> > +                             struct hl_cs_parser *parser)
> > +{
> > +     u32 cb_parsed_length = 0;
> > +     u32 cb_patched_cur_length = 0;
> > +     int rc = 0;
> > +
> > +     /* cb_user_size is more than 0 so loop will always be executed */
> > +     while ((cb_parsed_length < parser->user_cb_size) && (!rc)) {
> > +             enum packet_id pkt_id;
> > +             u16 pkt_size;
> > +             u32 new_pkt_size = 0;
> > +             void *user_pkt, *kernel_pkt;
> > +
> > +             user_pkt = (void *) (parser->user_cb->kernel_address +
> > +                                                     cb_parsed_length);
> > +             kernel_pkt = (void *) (parser->patched_cb->kernel_address +
> > +                                                     cb_patched_cur_length);
> > +
> > +             pkt_id = (enum packet_id) (((*(u64 *) user_pkt) &
> > +                             PACKET_HEADER_PACKET_ID_MASK) >>
> > +                                     PACKET_HEADER_PACKET_ID_SHIFT);
> > +
> > +             pkt_size = goya_packet_sizes[pkt_id];
> > +             cb_parsed_length += pkt_size;
> > +             if (cb_parsed_length > parser->user_cb_size) {
> > +                     dev_err(hdev->dev,
> > +                             "packet 0x%x is out of CB boundary\n", pkt_id);
> > +                     rc = -EINVAL;
> > +                     continue;
>
> Ditto
Fixed
>
> > +             }
> > +
> > +             switch (pkt_id) {
> > +             case PACKET_LIN_DMA:
> > +                     rc = goya_patch_dma_packet(hdev, parser, user_pkt,
> > +                                             kernel_pkt, &new_pkt_size);
> > +                     cb_patched_cur_length += new_pkt_size;
> > +                     break;
> > +
> > +             case PACKET_WREG_32:
> > +                     memcpy(kernel_pkt, user_pkt, pkt_size);
> > +                     cb_patched_cur_length += pkt_size;
> > +                     rc = goya_validate_wreg32(hdev, parser, kernel_pkt);
> > +                     break;
> > +
> > +             case PACKET_WREG_BULK:
> > +                     dev_err(hdev->dev,
> > +                             "User not allowed to use WREG_BULK\n");
> > +                     rc = -EPERM;
> > +                     break;
> > +
> > +             case PACKET_MSG_PROT:
> > +                     dev_err(hdev->dev,
> > +                             "User not allowed to use MSG_PROT\n");
> > +                     rc = -EPERM;
> > +                     break;
> > +
> > +             case PACKET_CP_DMA:
> > +                     dev_err(hdev->dev, "User not allowed to use CP_DMA\n");
> > +                     rc = -EPERM;
> > +                     break;
> > +
> > +             case PACKET_STOP:
> > +                     dev_err(hdev->dev, "User not allowed to use STOP\n");
> > +                     rc = -EPERM;
> > +                     break;
> > +
> > +             case PACKET_MSG_LONG:
> > +             case PACKET_MSG_SHORT:
> > +             case PACKET_FENCE:
> > +             case PACKET_NOP:
> > +                     memcpy(kernel_pkt, user_pkt, pkt_size);
> > +                     cb_patched_cur_length += pkt_size;
> > +                     break;
> > +
> > +             default:
> > +                     dev_err(hdev->dev, "Invalid packet header 0x%x\n",
> > +                             pkt_id);
> > +                     rc = -EINVAL;
> > +                     break;
> > +             }
> > +     }
> > +
> > +     return rc;
> > +}
>
> [ ... ]
>
> >  static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
> >               u16 event_type, char *axi_name, int len)
> >  {
> > @@ -4645,6 +5677,48 @@ static void goya_disable_clock_gating(struct hl_device *hdev)
> >
> >  }
> >
> > +static bool goya_is_device_idle(struct hl_device *hdev)
> > +{
> > +     u64 offset, dma_qm_reg, tpc_qm_reg, tpc_cmdq_reg, tpc_cfg_reg;
> > +     bool val = true;
> > +     int i;
> > +
> > +     offset = mmDMA_QM_1_GLBL_STS0 - mmDMA_QM_0_GLBL_STS0;
> > +
> > +     for (i = 0 ; i < DMA_MAX_NUM ; i++) {
> > +             dma_qm_reg = mmDMA_QM_0_GLBL_STS0 + i * offset;
> > +
> > +             val = val && ((RREG32(dma_qm_reg) & DMA_QM_IDLE_MASK) ==
> > +                             DMA_QM_IDLE_MASK);
> > +     }
> > +
> > +     offset = mmTPC1_QM_GLBL_STS0 - mmTPC0_QM_GLBL_STS0;
> > +
> > +     for (i = 0 ; i < TPC_MAX_NUM ; i++) {
> > +             tpc_qm_reg = mmTPC0_QM_GLBL_STS0 + i * offset;
> > +             tpc_cmdq_reg = mmTPC0_CMDQ_GLBL_STS0 + i * offset;
> > +             tpc_cfg_reg = mmTPC0_CFG_STATUS + i * offset;
> > +
> > +             val = val && ((RREG32(tpc_qm_reg) & TPC_QM_IDLE_MASK) ==
> > +                             TPC_QM_IDLE_MASK);
> > +             val = val && ((RREG32(tpc_cmdq_reg) & TPC_CMDQ_IDLE_MASK) ==
> > +                             TPC_CMDQ_IDLE_MASK);
> > +             val = val && ((RREG32(tpc_cfg_reg) & TPC_CFG_IDLE_MASK) ==
> > +                             TPC_CFG_IDLE_MASK);
> > +     }
> > +
> > +     val = val && ((RREG32(mmMME_QM_GLBL_STS0) & MME_QM_IDLE_MASK) ==
> > +                     MME_QM_IDLE_MASK);
> > +     val = val && ((RREG32(mmMME_CMDQ_GLBL_STS0) & MME_CMDQ_IDLE_MASK) ==
> > +                     MME_CMDQ_IDLE_MASK);
> > +     val = val && ((RREG32(mmMME_ARCH_STATUS) & MME_ARCH_IDLE_MASK) ==
> > +                     MME_ARCH_IDLE_MASK);
> > +     val = val && ((RREG32(mmMME_SHADOW_0_STATUS) & MME_SHADOW_IDLE_MASK) ==
> > +                     0);
>
> Huh, these are neat, but IMHO plain
>
>         if ((RREG(reg) & mask) != mask)
>                 return false;
>
> are more readable...
>
Fixed
> > +
> > +     return val;
> > +}
> > +
>
> [ ... ]
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 1/5] drivers/accel: Introduce subsystem
  2019-01-27  4:31                 ` Andrew Donnellan
@ 2019-01-28 19:36                   ` Frederic Barrat
  0 siblings, 0 replies; 103+ messages in thread
From: Frederic Barrat @ 2019-01-28 19:36 UTC (permalink / raw)
  To: Andrew Donnellan, Olof Johansson, linux-kernel
  Cc: linux-accelerators, Greg Kroah-Hartman, ogabbay, airlied,
	jglisse, linuxppc-dev



Le 27/01/2019 à 05:31, Andrew Donnellan a écrit :
> [+ linuxppc-dev, because cxl/ocxl are handled through powerpc - please 
> cc on future versions of this series]
> 
> On 26/1/19 8:13 am, Olof Johansson wrote:
>> We're starting to see more of these kind of devices, the current
>> upcoming wave will likely be around machine learning and inference
>> engines. A few drivers have been added to drivers/misc for this, but
>> it's timely to make it into a separate group of drivers/subsystem, to
>> make it easier to find them, and to encourage collaboration between
>> contributors.
>>
>> Over time, we expect to build shared frameworks that the drivers will
>> make use of, but how that framework needs to look like to fill the needs
>> is still unclear, and the best way to gain that knowledge is to give the
>> disparate implementations a shared location.
>>
>> There has been some controversy around expectations for userspace
>> stacks being open. The clear preference is to see that happen, and any
>> driver and platform stack that is delivered like that will be given
>> preferential treatment, and at some point in the future it might
>> become the requirement. Until then, the bare minimum we need is an
>> open low-level userspace such that the driver and HW interfaces can be
>> exercised if someone is modifying the driver, even if the full details
>> of the workload are not always available.
>>
>> Bootstrapping this with myself and Greg as maintainers (since the current
>> drivers will be moving out of drivers/misc). Looking forward to expanding
>> that group over time.
>>
> 
> [snip]
> 
>> +
>> +Hardware offload accelerator subsystem
>> +======================================
>> +
>> +This is a brief overview of the subsystem (grouping) of hardware
>> +accelerators kept under drivers/accel
>> +
>> +Types of hardware supported
>> +---------------------------
>> +
>> +  The general types of hardware supported are hardware devices that has
>> +  general interactions of sending commands and buffers to the hardware,
>> +  returning completions and possible filled buffers back, together
>> +  with the usual driver pieces around hardware control, setup, error
>> +  handling, etc.
>> +
>> +  Drivers that fit into other subsystems are expected to be merged
>> +  there, and use the appropriate userspace interfaces of said functional
>> +  areas. We don't expect to see drivers for network, storage, graphics
>> +  and similar hardware implemented by drivers here.
>> +
>> +Expectations for contributions
>> +------------------------------
>> +
>> + - Platforms and hardware that has fully open stacks, from Firmware to
>> +   Userspace, are always going to be given preferential treatment. These
>> +   platforms give the best insight for behavior and interaction of all
>> +   layers, including ability to improve implementation across the stack
>> +   over time.
>> +
>> + - If a platform is partially proprietary, it is still expected that the
>> +   portions that interact the driver can be shared in a form that allows
>> +   for exercising the hardware/driver and evolution of the interface 
>> over
>> +   time. This could be separated into a shared library and test/sample
>> +   programs, for example.
>> +
>> + - Over time, there is an expectation to converge drivers over to shared
>> +   frameworks and interfaces. Until then, the general rule is that no
>> +   more than one driver per vendor will be acceptable. For vendors that
>> +   aren't participating in the work towards shared frameworks over time,
>> +   we reserve the right to phase out support for the hardware.
> How exactly do generic drivers for interconnect protocols, such as 
> cxl/ocxl, fit in here?
> 
> cxl and ocxl are not drivers for a specific device, they are generic 
> drivers which can be used with any device implementing the CAPI or 
> OpenCAPI protocol respectively - many of which will be FPGA boards 
> flashed with customer-designed accelerator cores for specific workloads, 
> some will be accelerators using ASICs or using FPGA images supplied by 
> vendors, some will be driven from userspace, others using the cxl/ocxl 
> kernel API, etc.


I have the same reservation as Andrew. While my first reaction was to 
think that cxl and ocxl should be part of the accel subsystem, they 
hardly seem to fit the stated goals.
Furthermore, there are implications there, as all the distros currently 
shipping cxl and ocxl as modules on powerpc would need to have their 
config modified to enable CONFIG_ACCEL.

   Fred


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 01/15] habanalabs: add skeleton driver
  2019-01-27  8:32           ` gregkh
@ 2019-01-29 22:49             ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-29 22:49 UTC (permalink / raw)
  To: gregkh; +Cc: Arnd Bergmann, Linux Kernel Mailing List, ogabbay

On Sun, Jan 27, 2019 at 10:32 AM gregkh <gregkh@linuxfoundation.org> wrote:
>
> On Sat, Jan 26, 2019 at 11:48:02PM +0200, Oded Gabbay wrote:
> > On Sat, Jan 26, 2019 at 11:14 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Sat, Jan 26, 2019 at 5:25 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > >
> > > > On Sat, Jan 26, 2019 at 6:06 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > >
> > > > > On Wed, Jan 23, 2019 at 1:01 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > >
> > > > > > diff --git a/drivers/misc/habanalabs/include/habanalabs_device_if.h b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..9dbb7077eabd
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/misc/habanalabs/include/habanalabs_device_if.h
> > > > >
> > > > > Since this is a apparently a user space ABI, the file should be in
> > > > > include/uapi/linux/,
> > > > > not in the driver directory.
> > > >
> > > > This is not a user space ABI. This is the ABI between the driver and the F/W.
> > >
> > > Ah, I see. In that case, you should get rid of all the bitfields and make the
> > > struct members all __le32/__le64/... to make it work on big-endian kernels.
> > >
> > I really don't want to start converting bitfields and structures to
> > use __le32/64.
> > As I wrote in one of the previous reviews, we don't support big-endian
> > architecture (what's left after POWER moved to support little endian
> > ?).  We actually do run on POWER9 but with ppc64le architecture
> > In any case, our software stack is so big that this minor change in
> > the driver won't have any impact on the overall ability to run
> > something on our H/W
>
> You don't have to do anything at the moment to "convert" to use a
> specific endian, but you do have to always mark such variables that are
> in a specific endian that this is the format they are expected in.
>
> Then, when you run a tool like sparse, you will be notified if you
> happen to be making any assumptions that might not be correct about
> those variables, and it's trivial to usually fix it up at that time.
>
> hope this helps,
>
> greg k-h

ok, understood, fixed.
Thanks,
Oded

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 12/15] habanalabs: add virtual memory and MMU modules
  2019-01-27 16:13   ` Mike Rapoport
@ 2019-01-30 10:34     ` Oded Gabbay
  0 siblings, 0 replies; 103+ messages in thread
From: Oded Gabbay @ 2019-01-30 10:34 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Greg Kroah-Hartman, Linux-Kernel@Vger. Kernel. Org, ogabbay,
	Omer Shpigelman

On Sun, Jan 27, 2019 at 6:13 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Jan 23, 2019 at 02:00:54AM +0200, Oded Gabbay wrote:
> > From: Omer Shpigelman <oshpigelman@habana.ai>
> >
> > This patch adds the Virtual Memory and MMU modules.
> >
> > Goya has an internal MMU which provides process isolation on the internal
> > DDR. The internal MMU also performs translations for transactions that go
> > from Goya to the Host.
> >
> > The driver is responsible for allocating and freeing memory on the DDR
> > upon user request. It also provides an interface to map and unmap DDR and
> > Host memory to the device address space.
> >
> > Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
> > Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
> > ---
> >  drivers/misc/habanalabs/Makefile              |    2 +-
> >  drivers/misc/habanalabs/context.c             |   19 +-
> >  drivers/misc/habanalabs/device.c              |   20 +-
> >  drivers/misc/habanalabs/goya/goya.c           |  391 +++++
> >  drivers/misc/habanalabs/habanalabs.h          |  195 +++
> >  drivers/misc/habanalabs/habanalabs_drv.c      |    2 +-
> >  drivers/misc/habanalabs/habanalabs_ioctl.c    |    3 +-
> >  drivers/misc/habanalabs/include/goya/goya.h   |    6 +-
> >  .../include/hw_ip/mmu/mmu_general.h           |   45 +
> >  .../habanalabs/include/hw_ip/mmu/mmu_v1_0.h   |   15 +
> >  drivers/misc/habanalabs/memory.c              | 1506 +++++++++++++++++
> >  drivers/misc/habanalabs/mmu.c                 |  604 +++++++
> >  include/uapi/misc/habanalabs.h                |  122 +-
> >  13 files changed, 2922 insertions(+), 8 deletions(-)
> >  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
> >  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
> >  create mode 100644 drivers/misc/habanalabs/mmu.c
>
> [ ... ]
>
> > diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> > index e3867615b974..94ee4cb00a49 100644
> > --- a/drivers/misc/habanalabs/goya/goya.c
> > +++ b/drivers/misc/habanalabs/goya/goya.c
>
> [ ... ]
>
> > @@ -265,6 +332,10 @@ static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
> >  };
> >
> >  static int goya_armcp_info_get(struct hl_device *hdev);
> > +static void goya_mmu_prepare(struct hl_device *hdev, u32 asid);
> > +static int goya_mmu_clear_pgt_range(struct hl_device *hdev);
> > +static int goya_mmu_update_asid_hop0_addr(struct hl_device *hdev, u32 asid,
> > +                                     u64 phys_addr);
>
> Nit: are the static declarations are necessary? Or it's a matter of moving
> code around?
Honestly, it's a bit of an issue moving them now. Those function call
other functions and if I will need to move the first ones I will need
to move the others as well and may end in some circular dependency.

>
> >
> >  static void goya_get_fixed_properties(struct hl_device *hdev)
> >  {
> > @@ -303,6 +374,16 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
> >       prop->sram_user_base_address = prop->sram_base_address +
> >                                               SRAM_USER_BASE_OFFSET;
> >
> > +     prop->mmu_pgt_addr = MMU_PAGE_TABLES_ADDR;
> > +     if (hdev->pldm)
> > +             prop->mmu_pgt_size = 0x800000; /* 8MB */
> > +     else
> > +             prop->mmu_pgt_size = MMU_PAGE_TABLES_SIZE;
> > +     prop->mmu_pte_size = PTE_SIZE;
> > +     prop->mmu_hop_table_size = HOP_TABLE_SIZE;
> > +     prop->mmu_hop0_tables_total_size = HOP0_TABLES_TOTAL_SIZE;
> > +     prop->dram_page_size = PAGE_SIZE_2MB;
> > +
> >       prop->host_phys_base_address = HOST_PHYS_BASE;
> >       prop->va_space_host_start_address = VA_HOST_SPACE_START;
> >       prop->va_space_host_end_address = VA_HOST_SPACE_END;
>
> [ ... ]
>
> > diff --git a/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
> > new file mode 100644
> > index 000000000000..8d61ee4f2d17
> > --- /dev/null
> > +++ b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
> > @@ -0,0 +1,45 @@
> > +/* SPDX-License-Identifier: GPL-2.0
> > + *
> > + * Copyright 2016-2018 HabanaLabs, Ltd.
> > + * All Rights Reserved.
> > + *
> > + */
> > +
> > +#ifndef INCLUDE_MMU_GENERAL_H_
> > +#define INCLUDE_MMU_GENERAL_H_
> > +
> > +#define PAGE_SHIFT_4KB                       12
> > +#define PAGE_SHIFT_2MB                       21
> > +#define PAGE_SIZE_2MB                        (_AC(1, UL) << PAGE_SHIFT_2MB)
> > +#define PAGE_SIZE_4KB                        (_AC(1, UL) << PAGE_SHIFT_4KB)
> > +
> > +#define PAGE_PRESENT_MASK            0x0000000000001
> > +#define SWAP_OUT_MASK                        0x0000000000004
> > +#define LAST_MASK                    0x0000000000800
> > +#define PHYS_ADDR_MASK                       0x3FFFFFFFFF000ull
> > +#define HOP0_MASK                    0x3000000000000ull
> > +#define HOP1_MASK                    0x0FF8000000000ull
> > +#define HOP2_MASK                    0x0007FC0000000ull
> > +#define HOP3_MASK                    0x000003FE00000
> > +#define HOP4_MASK                    0x00000001FF000
> > +#define OFFSET_MASK                  0x0000000000FFF
> > +
> > +#define HOP0_SHIFT                   48
> > +#define HOP1_SHIFT                   39
> > +#define HOP2_SHIFT                   30
> > +#define HOP3_SHIFT                   21
> > +#define HOP4_SHIFT                   12
> > +
> > +#define PTE_PHYS_ADDR_SHIFT          12
> > +#define PTE_PHYS_ADDR_MASK           ~0xFFF
> > +
> > +#define PTE_SIZE                     sizeof(u64)
>
> I suspect some architectures define PTE_SIZE in arch/*/include/asm
> Probably you'd want to namespace this.
>
Fixed


> > +#define HOP_TABLE_SIZE                       PAGE_SIZE_4KB
> > +#define HOP0_TABLES_TOTAL_SIZE               (HOP_TABLE_SIZE * MAX_ASID)
> > +
> > +#define MMU_HOP0_PA43_12_SHIFT               12
> > +#define MMU_HOP0_PA49_44_SHIFT               (12 + 32)
> > +
> > +#define MMU_CONFIG_TIMEOUT_USEC              2000 /* 2 ms */
> > +
> > +#endif /* INCLUDE_MMU_GENERAL_H_ */
> > diff --git a/drivers/misc/habanalabs/memory.c b/drivers/misc/habanalabs/memory.c
> > index 94cbb252656d..c41ea19502e5 100644
> > --- a/drivers/misc/habanalabs/memory.c
> > +++ b/drivers/misc/habanalabs/memory.c
> > @@ -5,12 +5,1193 @@
> >   * All Rights Reserved.
> >   */
> >
> > +#include <uapi/misc/habanalabs.h>
> >  #include "habanalabs.h"
> > +#include "include/hw_ip/mmu/mmu_general.h"
> >
> >  #include <linux/sched.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/genalloc.h>
> >
> > +#define HL_MMU_DEBUG 0
> > +
> > +/*
> > + * The va ranges in context object contain a list with the available chunks of
> > + * device virtual memory.
> > + * There is one range for host allocations and one for DRAM allocations.
> > + *
> > + * On initialization each range contains one chunk of all of its available
> > + * virtual range which is a half of the total device virtual range.
> > + *
> > + * On each mapping of physical pages, a suitable virtual range chunk (with a
> > + * minimum size) is selected from the list. If the chunk size equals the
> > + * requested size, the chunk is returned. Otherwise, the chunk is split into
> > + * two chunks - one to return as result and a remainder to stay in the list.
> > + *
> > + * On each Unmapping of a virtual address, the relevant virtual chunk is
> > + * returned to the list. The chunk is added to the list and if its edges match
> > + * the edges of the adjacent chunks (means a contiguous chunk can be created),
> > + * the chunks are merged.
> > + *
> > + * On finish, the list is checked to have only one chunk of all the relevant
> > + * virtual range (which is a half of the device total virtual range).
> > + * If not (means not all mappings were unmapped), a warning is printed.
> > + */
> > +
> > +/**
> > + * alloc_device_memory - allocate device memory
> > + *
> > + * @ctx                 : current context
> > + * @args                : host parameters containing the requested size
> > + * @ret_handle          : result handle
> > + *
> > + * This function does the following:
> > + * - Allocate the requested size rounded up to 2MB pages
> > + * - Return unique handle
> > + */
> > +static int alloc_device_memory(struct hl_ctx *ctx, struct hl_mem_in *args,
> > +                             u32 *ret_handle)
> > +{
> > +     struct hl_device *hdev = ctx->hdev;
> > +     struct hl_vm *vm = &hdev->vm;
> > +     struct hl_vm_phys_pg_list *phys_pg_list;
> > +     struct hl_vm_phys_pg *phys_pg, *tmp;
> > +     u64 paddr = 0;
> > +     u32 total_size, num_pgs, page_size, page_shift;
> > +     int handle, rc, i;
> > +     bool contiguous;
> > +
> > +     page_size = hdev->asic_prop.dram_page_size;
> > +     page_shift = __ffs(page_size);
>
> Maybe it's worth storing page_shift in the asi_prop and calculating
> page_size.
>
> > +     num_pgs = (args->alloc.mem_size + (page_size - 1)) >> page_shift;
> > +     total_size = num_pgs << page_shift;
> > +
> > +     contiguous = args->flags & HL_MEM_CONTIGUOUS;
> > +
> > +     if (contiguous) {
> > +             paddr = (u64) gen_pool_alloc(vm->dram_pg_pool, total_size);
> > +             if (!paddr) {
> > +                     dev_err(hdev->dev,
> > +                             "failed to allocate %u huge contiguous pages\n",
> > +                             num_pgs);
> > +                     return -ENOMEM;
> > +             }
> > +     }
> > +
> > +     phys_pg_list = kzalloc(sizeof(*phys_pg_list), GFP_KERNEL);
> > +     if (!phys_pg_list) {
> > +             rc = -ENOMEM;
> > +             goto page_list_err;
> > +     }
> > +
> > +     phys_pg_list->vm_type = VM_TYPE_PHYS_LIST;
> > +     phys_pg_list->asid = ctx->asid;
> > +     phys_pg_list->total_size = total_size;
> > +     phys_pg_list->flags = args->flags;
> > +     phys_pg_list->contiguous = contiguous;
> > +     INIT_LIST_HEAD(&phys_pg_list->list);
> > +
> > +     for (i = 0 ; i < num_pgs ; i++) {
> > +             phys_pg = kzalloc(sizeof(*phys_pg), GFP_KERNEL);
>
> Consider adding *phys_pgs to phys_pg_list using kcalloc() before the loop.
Fixed in a re-factor to make it array
>
> > +             if (!phys_pg) {
> > +                     rc = -ENOMEM;
> > +                     goto pb_err;
> > +             }
> > +
> > +             phys_pg->page_size = page_size;
> > +
> > +             if (phys_pg_list->contiguous) {
> > +                     phys_pg->paddr = paddr + i * phys_pg->page_size;
> > +             } else {
> > +                     phys_pg->paddr =
> > +                             (u64) gen_pool_alloc(vm->dram_pg_pool,
> > +                                                     phys_pg->page_size);
> > +                     if (!phys_pg->paddr) {
> > +                             dev_err(hdev->dev, "ioctl failed to allocate page\n");
> > +                             kfree(phys_pg);
> > +                             rc = -ENOMEM;
> > +                             goto pb_err;
> > +                     }
> > +             }
> > +
> > +             list_add_tail(&phys_pg->node, &phys_pg_list->list);
> > +     }
> > +
> > +     spin_lock(&vm->idr_lock);
> > +     handle = idr_alloc(&vm->phys_pg_list_handles, phys_pg_list, 1, 0,
> > +                             GFP_ATOMIC);
> > +     spin_unlock(&vm->idr_lock);
> > +
> > +     if (handle < 0) {
> > +             dev_err(hdev->dev, "Failed to get handle for page\n");
> > +             rc = -EFAULT;
> > +             goto idr_err;
> > +     }
> > +
> > +     for (i = 0; i < num_pgs ; i++)
> > +             kref_get(&vm->dram_pg_pool_refcount);
> > +
> > +     phys_pg_list->handle = handle;
> > +
> > +     atomic64_add(phys_pg_list->total_size, &ctx->dram_phys_mem);
> > +     atomic64_add(phys_pg_list->total_size, &hdev->dram_used_mem);
> > +
> > +     *ret_handle = handle;
> > +
> > +     return 0;
> > +
> > +idr_err:
> > +pb_err:
> > +     list_for_each_entry_safe(phys_pg, tmp, &phys_pg_list->list, node) {
> > +             if (!phys_pg_list->contiguous)
> > +                     gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
> > +                                     phys_pg->page_size);
> > +
> > +             list_del(&phys_pg->node);
> > +             kfree(phys_pg);
> > +     }
> > +
> > +     kfree(phys_pg_list);
> > +page_list_err:
> > +     if (contiguous)
> > +             gen_pool_free(vm->dram_pg_pool, paddr, total_size);
> > +
> > +     return rc;
> > +}
>
> [ ... ]
>
> > +/**
> > + * free_phys_pg_list    - free physical page list
> > + *
> > + * @hdev                : habanalabs device structure
> > + * @phys_pg_list        : physical page list to free
> > + *
> > + * This function does the following:
> > + * - Iterate over the list and free each physical block structure
> > + * - In case of allocated memory, return the physical memory to the general pool
> > + * - Free the hl_vm_phys_pg_list structure
> > + */
> > +static void free_phys_pg_list(struct hl_device *hdev,
> > +             struct hl_vm_phys_pg_list *phys_pg_list)
> > +{
> > +     struct hl_vm *vm = &hdev->vm;
> > +     struct hl_vm_phys_pg *phys_pg, *tmp;
> > +     u32 num_pgs;
> > +     bool first = true;
> > +     int i;
> > +
> > +     list_for_each_entry_safe(phys_pg, tmp, &phys_pg_list->list, node) {
> > +             /*
> > +              * this if statement is relevant only when called from
> > +              * hl_vm_ctx_fini() and free_device_memory()
> > +              */
> > +             if (!phys_pg_list->created_from_userptr) {
> > +                     if ((phys_pg_list->contiguous) && (first)) {
> > +                             first = false;
> > +                             gen_pool_free(vm->dram_pg_pool,
> > +                                             phys_pg->paddr,
> > +                                             phys_pg_list->total_size);
> > +
> > +                             num_pgs = phys_pg_list->total_size >>
> > +                                     __ffs(hdev->asic_prop.dram_page_size);
> > +
> > +                             for (i = 0; i < num_pgs ; i++)
> > +                                     kref_put(&vm->dram_pg_pool_refcount,
> > +                                             dram_pg_pool_do_release);
> > +
> > +                     } else if (!phys_pg_list->contiguous) {
> > +                             gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
> > +                                             phys_pg->page_size);
> > +                             kref_put(&vm->dram_pg_pool_refcount,
> > +                                             dram_pg_pool_do_release);
> > +                     }
> > +             }
> > +
> > +             list_del(&phys_pg->node);
> > +             kfree(phys_pg);
> > +     }
>
> Unless I'm missing something this can be simplified a bit:
>
> if (!phys_pg_list->created_from_userptr) {
>         for (i = 0; i < num_pgs ; i++)
>                 kref_put(&vm->dram_pg_pool_refcount,
>                          dram_pg_pool_do_release);
>         if (phys_pg_list->contiguous)
>                 gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
>                               phys_pg_list->total_size);
> }
>
> list_for_each_entry_safe(phys_pg, tmp, &phys_pg_list->list, node) {
>         if (!phys_pg_list->created_from_userptr &&
>             !phys_pg_list->contiguous)
>                 gen_pool_free(vm->dram_pg_pool, phys_pg->paddr,
>                               phys_pg->page_size);
>         list_del(&phys_pg->node);
>         kfree(phys_pg);
> }
>
> nd with phys_pg's array hanging from phys_pg_list it would be even simpler
> ;-)
Definitely agree. Fixed with a refactor to make it array instead of
list, as there is no real need for list here.

>
> > +
> > +     kfree(phys_pg_list);
> > +}
> > +
>
> --
> Sincerely yours,
> Mike.
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH/RFC 0/5] HW accel subsystem
  2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
                               ` (5 preceding siblings ...)
  2019-01-26 21:11             ` [PATCH/RFC 0/5] HW accel subsystem Arnd Bergmann
@ 2019-02-01  9:10             ` Kenneth Lee
  2019-02-01 10:07               ` Greg Kroah-Hartman
  6 siblings, 1 reply; 103+ messages in thread
From: Kenneth Lee @ 2019-02-01  9:10 UTC (permalink / raw)
  To: Olof Johansson
  Cc: linux-kernel, ogabbay, Greg Kroah-Hartman, jglisse,
	Andrew Donnellan, Frederic Barrat, airlied, linux-accelerators,
	nek.in.cn

On Fri, Jan 25, 2019 at 10:16:11AM -0800, Olof Johansson wrote:
> Date: Fri, 25 Jan 2019 10:16:11 -0800
> From: Olof Johansson <olof@lixom.net>
> To: linux-kernel@vger.kernel.org
> CC: ogabbay@habana.ai, Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
>  jglisse@redhat.com, Andrew Donnellan <andrew.donnellan@au1.ibm.com>,
>  Frederic Barrat <fbarrat@linux.ibm.com>, airlied@redhat.com,
>  linux-accelerators@lists.ozlabs.org
> Subject: [PATCH/RFC 0/5] HW accel subsystem
> X-Mailer: git-send-email 2.11.0
> Message-ID: <20190125181616.62609-1-olof@lixom.net>
> 
> Per discussion in on the Habana Labs driver submission
> (https://lore.kernel.org/lkml/20190123000057.31477-1-oded.gabbay@gmail.com/),
> there seems to be time to create a separate subsystem for hw accellerators
> instead of letting them proliferate around the tree (and/or in misc).
> 
> There's difference in opinion on how stringent the requirements are for
> a fully open stack for these kind of drivers. I've documented the middle
> road approach in the first patch (requiring some sort of open low-level
> userspace for the kernel interaction, and a way to use/test it).
> 
> Comments and suggestions for better approaches are definitely welcome.

Dear Olof,

How are you? Let me introduce myself. My name is Kenenth Lee, working for
Hisilicon. Our company provide server, AI, networking and terminal SoCs to the
market. We tried to create an accelerator framework a year back and now we are
working on the branch here (There is document in Documentation/warpdrive
directory):

https://github.com/Kenneth-Lee/linux-kernel-warpdrive/tree/wdprd-v1

The user space framework is here:

https://github.com/Kenneth-Lee/warpdrive/tree/wdprd-v1

We have tried to create it on VFIO at the very beginning. The RFCv1 is here:

https://lwn.net/Articles/763990/

But it seems it is not fit. There are two major issues:

1. The VFIO framework enforces the concept of separating the resource into
   devices before using it. This is not an accelerator style. Accelerator is
   another CPU to let the others to share it.
2. The way VFIO used to pin memory in place, has some flaw. In the current
   kernel, if you fork a sub-rpcess after pin the dma memory, you may lost the
   physical pages. (You can get more detail in the threads)

So we tried RFCv2 and build the solution directly on IOMMU. We call our solution
as WarpDrive and the kernel module is called uacce. Our assumption is that:

1. Most of users of the accelerator are in user space.
2. An accelerator is always another heterogeneous processor. It is waiting and
   processing work load sent from CPU.
3. The data structure in the CPU may be complex. It is no good to wrap the data
   and send it to hardware again and again. The better way is to keep the data
   in place and give a pointer to the accelerator, leaving it to finish the job.

So we create a pipe (we called it queue) between the user process and the
hardware directly. It is presented as a file to the user space. The user process
mmap the queue file to address the mmio space of the hardware, share memory and 
so on. With the capability of IOMMU, we can share the whole or part of process
space with the hardware. This can make the software solution easier.

After the RFCv2 was sent to the lkml, we do not get much feedback. But the
Infini-band guys said they did not like it. They think the solution is
re-invention of ib-verbs.

But we do not think so. ib-verbs maintains semantics of "REMOTE memory". But
UACCE maintains semantics of "LOCAL memory". We don't need to send, or sync
memory with other parties. We share those memory with all processes who share
the local bus.

But we know we need more "complete" solution to let people understand and accept
our idea. So now we are working on it with our Compression and RSA accelerator
on Hi1620 Server SoC. We are also planning to port our AI framework on it.

Do you think we can cooperate to create an framework in Linux together? Please
feel free to ask for more information. We are happy to answer it.


Cheers
-- 
			-Kenneth(Hisilicon)


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH/RFC 0/5] HW accel subsystem
  2019-02-01  9:10             ` Kenneth Lee
@ 2019-02-01 10:07               ` Greg Kroah-Hartman
  2019-02-01 12:09                 ` Kenneth Lee
  0 siblings, 1 reply; 103+ messages in thread
From: Greg Kroah-Hartman @ 2019-02-01 10:07 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Olof Johansson, linux-kernel, ogabbay, jglisse, Andrew Donnellan,
	Frederic Barrat, airlied, linux-accelerators, nek.in.cn

On Fri, Feb 01, 2019 at 05:10:40PM +0800, Kenneth Lee wrote:
> After the RFCv2 was sent to the lkml, we do not get much feedback. But the
> Infini-band guys said they did not like it. They think the solution is
> re-invention of ib-verbs.

No one needs to re-invent a monstrosity that is ib-verbs.  If anything,
that is a model that should never be recreated again, showing that we
can learn from past mistakes :)

> But we do not think so. ib-verbs maintains semantics of "REMOTE memory". But
> UACCE maintains semantics of "LOCAL memory". We don't need to send, or sync
> memory with other parties. We share those memory with all processes who share
> the local bus.

I agree, don't try to duplicate the mess that people moved away from
(hint, everyone sane wraps ib-verbs in another model that can actually
be used and understood...)

> But we know we need more "complete" solution to let people understand and accept
> our idea. So now we are working on it with our Compression and RSA accelerator
> on Hi1620 Server SoC. We are also planning to port our AI framework on it.
> 
> Do you think we can cooperate to create an framework in Linux together? Please
> feel free to ask for more information. We are happy to answer it.

Sure, that sounds like a great goal!

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH/RFC 0/5] HW accel subsystem
  2019-02-01 10:07               ` Greg Kroah-Hartman
@ 2019-02-01 12:09                 ` Kenneth Lee
  0 siblings, 0 replies; 103+ messages in thread
From: Kenneth Lee @ 2019-02-01 12:09 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Kenneth Lee
  Cc: Olof Johansson, linux-kernel, ogabbay, jglisse, Andrew Donnellan,
	Frederic Barrat, airlied, linux-accelerators


在 2019/2/1 下午6:07, Greg Kroah-Hartman 写道:
> On Fri, Feb 01, 2019 at 05:10:40PM +0800, Kenneth Lee wrote:
>> After the RFCv2 was sent to the lkml, we do not get much feedback. But the
>> Infini-band guys said they did not like it. They think the solution is
>> re-invention of ib-verbs.
> No one needs to re-invent a monstrosity that is ib-verbs.  If anything,
> that is a model that should never be recreated again, showing that we
> can learn from past mistakes :)
>
>> But we do not think so. ib-verbs maintains semantics of "REMOTE memory". But
>> UACCE maintains semantics of "LOCAL memory". We don't need to send, or sync
>> memory with other parties. We share those memory with all processes who share
>> the local bus.
> I agree, don't try to duplicate the mess that people moved away from
> (hint, everyone sane wraps ib-verbs in another model that can actually
> be used and understood...)
>
>> But we know we need more "complete" solution to let people understand and accept
>> our idea. So now we are working on it with our Compression and RSA accelerator
>> on Hi1620 Server SoC. We are also planning to port our AI framework on it.
>>
>> Do you think we can cooperate to create an framework in Linux together? Please
>> feel free to ask for more information. We are happy to answer it.
> Sure, that sounds like a great goal!

Thank you very much for your encouragement:)

Kenneth Lee

>
> thanks,
>
> greg k-h
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/15] Habana Labs kernel driver
  2019-01-23 21:52 ` Olof Johansson
  2019-01-23 22:40   ` Oded Gabbay
  2019-01-24  1:03   ` Andrew Donnellan
@ 2019-02-24 22:23   ` Pavel Machek
  2 siblings, 0 replies; 103+ messages in thread
From: Pavel Machek @ 2019-02-24 22:23 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Oded Gabbay, Dave Airlie, Greg Kroah-Hartman,
	Linux Kernel Mailing List, ogabbay, Arnd Bergmann, fbarrat,
	andrew.donnellan

[-- Attachment #1: Type: text/plain, Size: 925 bytes --]

Hi!

> So, I'd like to propose a drivers/accel drivers subtree, and I'd be
> happy to bootstrap it with a small group (@Dave Airlie: I think your
> input from GPU land be very useful, want to join in?). Individual
> drivers maintained by existing maintainers, of course.

Does this sound similar?

menu "Remoteproc drivers"

config REMOTEPROC
        tristate "Support for Remote Processor subsystem"
	        depends on HAS_DMA
		        select CRC32
			        select FW_LOADER
				        select VIRTIO
					        select
        WANT_DEV_COREDUMP
	        help
		          Support for remote processors (such as DSP
        coprocessors). These
	          are mainly used on embedded systems.

...we already have that in tree...

								Pavel
								
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2019-02-24 22:28 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-23  0:00 [PATCH 00/15] Habana Labs kernel driver Oded Gabbay
2019-01-23  0:00 ` [PATCH 01/15] habanalabs: add skeleton driver Oded Gabbay
2019-01-23  0:49   ` Joe Perches
2019-01-25 19:18     ` Oded Gabbay
2019-01-23 12:28   ` Mike Rapoport
2019-01-23 12:40     ` Greg KH
2019-01-23 12:55       ` Mike Rapoport
2019-01-25 20:09         ` Oded Gabbay
2019-01-25 20:05     ` Oded Gabbay
2019-01-26 16:05   ` Arnd Bergmann
2019-01-26 16:24     ` Oded Gabbay
2019-01-26 21:14       ` Arnd Bergmann
2019-01-26 21:48         ` Oded Gabbay
2019-01-27  8:32           ` gregkh
2019-01-29 22:49             ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 03/15] habanalabs: add basic Goya support Oded Gabbay
2019-01-23 12:28   ` Mike Rapoport
2019-01-25 20:32     ` Oded Gabbay
2019-01-27  6:39       ` Mike Rapoport
2019-01-28  7:44         ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 04/15] habanalabs: add context and ASID modules Oded Gabbay
2019-01-23 12:28   ` Mike Rapoport
2019-01-25 21:07     ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 05/15] habanalabs: add command buffer module Oded Gabbay
2019-01-23 12:28   ` Mike Rapoport
2019-01-25 21:47     ` Oded Gabbay
2019-01-27  6:49       ` Mike Rapoport
2019-01-28  7:55         ` Oded Gabbay
2019-01-28  8:41           ` Mike Rapoport
2019-01-23  0:00 ` [PATCH 06/15] habanalabs: add basic Goya h/w initialization Oded Gabbay
2019-01-25  7:46   ` Mike Rapoport
2019-01-28 10:35     ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 07/15] habanalabs: add h/w queues module Oded Gabbay
2019-01-25  7:50   ` Mike Rapoport
2019-01-28 10:50     ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 08/15] habanalabs: add event queue and interrupts Oded Gabbay
2019-01-25  7:51   ` Mike Rapoport
2019-01-28 11:14     ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 09/15] habanalabs: add sysfs and hwmon support Oded Gabbay
2019-01-25  7:54   ` Mike Rapoport
2019-01-28 11:26     ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 10/15] habanalabs: add device reset support Oded Gabbay
2019-01-27  7:51   ` Mike Rapoport
2019-01-28 12:53     ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 11/15] habanalabs: add command submission module Oded Gabbay
2019-01-27 15:11   ` Mike Rapoport
2019-01-28 13:51     ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 12/15] habanalabs: add virtual memory and MMU modules Oded Gabbay
2019-01-27 16:13   ` Mike Rapoport
2019-01-30 10:34     ` Oded Gabbay
2019-01-23  0:00 ` [PATCH 13/15] habanalabs: implement INFO IOCTL Oded Gabbay
2019-01-23  0:00 ` [PATCH 14/15] habanalabs: add debugfs support Oded Gabbay
2019-01-23  0:00 ` [PATCH 15/15] Update MAINTAINERS and CREDITS with habanalabs info Oded Gabbay
2019-01-23 12:27 ` [PATCH 00/15] Habana Labs kernel driver Mike Rapoport
2019-01-23 22:43   ` Oded Gabbay
2019-01-23 21:52 ` Olof Johansson
2019-01-23 22:40   ` Oded Gabbay
2019-01-23 23:16     ` Olof Johansson
2019-01-24  1:03   ` Andrew Donnellan
2019-01-24 11:59     ` Jonathan Cameron
2019-01-25 17:13     ` Olof Johansson
2019-02-24 22:23   ` Pavel Machek
2019-01-23 21:57 ` Dave Airlie
2019-01-23 22:02   ` Dave Airlie
2019-01-23 22:31     ` Oded Gabbay
2019-01-23 22:45       ` Dave Airlie
2019-01-23 23:04         ` Olof Johansson
2019-01-23 23:20           ` Jerome Glisse
2019-01-23 23:35             ` Oded Gabbay
2019-01-23 23:41               ` Olof Johansson
2019-01-23 23:40             ` Olof Johansson
2019-01-23 23:48               ` Jerome Glisse
2019-01-24  7:35                 ` Daniel Vetter
2019-01-24  9:50                   ` Oded Gabbay
2019-01-24 10:22                     ` Dave Airlie
2019-01-25  0:13                       ` Olof Johansson
2019-01-25  7:43                         ` Daniel Vetter
2019-01-25 15:02                           ` Olof Johansson
2019-01-25 16:00                             ` Daniel Vetter
2019-01-24 23:51                   ` Olof Johansson
2019-01-23 23:23           ` Oded Gabbay
2019-01-25  7:37   ` Greg Kroah-Hartman
2019-01-25 15:33     ` Olof Johansson
2019-01-25 16:06       ` Greg Kroah-Hartman
2019-01-25 17:12         ` Olof Johansson
2019-01-25 18:16           ` [PATCH/RFC 0/5] HW accel subsystem Olof Johansson
2019-01-25 18:16             ` [PATCH 1/5] drivers/accel: Introduce subsystem Olof Johansson
2019-01-25 21:13               ` [PATCH v2 " Olof Johansson
2019-01-26 17:09                 ` Randy Dunlap
2019-01-27  4:31                 ` Andrew Donnellan
2019-01-28 19:36                   ` Frederic Barrat
2019-01-25 22:23               ` [PATCH " Daniel Vetter
2019-01-27 16:31                 ` Daniel Vetter
2019-01-25 18:16             ` [PATCH 2/5] cxl: Move to drivers/accel Olof Johansson
2019-01-25 18:16             ` [PATCH 3/5] drivers/accel: cxl: Move non-uapi include files Olof Johansson
2019-01-25 18:16             ` [PATCH 4/5] ocxl: Move to drivers/accel Olof Johansson
2019-01-25 18:16             ` [PATCH 5/5] drivers/accel: ocxl: Move non-uapi include files Olof Johansson
2019-01-26 13:51               ` Greg Kroah-Hartman
2019-01-26 21:11             ` [PATCH/RFC 0/5] HW accel subsystem Arnd Bergmann
2019-02-01  9:10             ` Kenneth Lee
2019-02-01 10:07               ` Greg Kroah-Hartman
2019-02-01 12:09                 ` Kenneth Lee
2019-01-26 13:52           ` [PATCH 00/15] Habana Labs kernel driver Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).