LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH v4 00/15] Habana Labs kernel driver
@ 2019-02-11 15:17 Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 01/15] habanalabs: add skeleton driver Oded Gabbay
                   ` (14 more replies)
  0 siblings, 15 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

Hello,
This is v4 of the Habana Labs kernel driver patch-set. It contains fixes
according to reviews done on v3, mainly for the command buffer, sysfs and MMU
patches. In addition, patch 2/15 was reduced in size from 4.3MB to 1.4MB.

The patch-set is rebased on v5.0-rc6.

Link to v3 cover letter: https://lkml.org/lkml/2019/2/4/1033

Link to v2 cover letter: https://lkml.org/lkml/2019/1/30/1003

Link to v1 cover letter: https://lwn.net/Articles/777342/

I would appricate any feedback, question and/or review.

Thanks,
Oded

p.s. for those who prefer to clone the tree instead of looking at the
emails, you can grab a copy from our company's page in GitHub:

https://github.com/HabanaAI/linux/releases/tag/hl_patchset_v4_20190211

Oded Gabbay (14):
  habanalabs: add skeleton driver
  habanalabs: add Goya registers header files
  habanalabs: add basic Goya support
  habanalabs: add context and ASID modules
  habanalabs: add command buffer module
  habanalabs: add basic Goya h/w initialization
  habanalabs: add h/w queues module
  habanalabs: add event queue and interrupts
  habanalabs: add sysfs and hwmon support
  habanalabs: add device reset support
  habanalabs: add command submission module
  habanalabs: implement INFO IOCTL
  habanalabs: add debugfs support
  Update MAINTAINERS and CREDITS with habanalabs info

Omer Shpigelman (1):
  habanalabs: add virtual memory and MMU modules

 CREDITS                                       |    2 +-
 .../ABI/testing/debugfs-driver-habanalabs     |  126 +
 .../ABI/testing/sysfs-driver-habanalabs       |  190 +
 MAINTAINERS                                   |    9 +
 drivers/misc/Kconfig                          |    1 +
 drivers/misc/Makefile                         |    1 +
 drivers/misc/habanalabs/Kconfig               |   22 +
 drivers/misc/habanalabs/Makefile              |   14 +
 drivers/misc/habanalabs/asid.c                |   58 +
 drivers/misc/habanalabs/command_buffer.c      |  440 ++
 drivers/misc/habanalabs/command_submission.c  |  782 +++
 drivers/misc/habanalabs/context.c             |  216 +
 drivers/misc/habanalabs/debugfs.c             | 1064 ++++
 drivers/misc/habanalabs/device.c              | 1117 ++++
 drivers/misc/habanalabs/goya/Makefile         |    3 +
 drivers/misc/habanalabs/goya/goya.c           | 5359 +++++++++++++++++
 drivers/misc/habanalabs/goya/goyaP.h          |  193 +
 drivers/misc/habanalabs/goya/goya_hwmgr.c     |  254 +
 drivers/misc/habanalabs/goya/goya_security.c  | 2999 +++++++++
 drivers/misc/habanalabs/habanalabs.h          | 1453 +++++
 drivers/misc/habanalabs/habanalabs_drv.c      |  465 ++
 drivers/misc/habanalabs/habanalabs_ioctl.c    |  234 +
 drivers/misc/habanalabs/hw_queue.c            |  636 ++
 drivers/misc/habanalabs/hwmon.c               |  449 ++
 drivers/misc/habanalabs/include/armcp_if.h    |  335 ++
 .../goya/asic_reg/cpu_ca53_cfg_masks.h        |  191 +
 .../include/goya/asic_reg/cpu_ca53_cfg_regs.h |   61 +
 .../include/goya/asic_reg/cpu_if_regs.h       |   49 +
 .../include/goya/asic_reg/cpu_pll_regs.h      |  105 +
 .../include/goya/asic_reg/dma_ch_0_regs.h     |  209 +
 .../include/goya/asic_reg/dma_ch_1_regs.h     |  209 +
 .../include/goya/asic_reg/dma_ch_2_regs.h     |  209 +
 .../include/goya/asic_reg/dma_ch_3_regs.h     |  209 +
 .../include/goya/asic_reg/dma_ch_4_regs.h     |  209 +
 .../include/goya/asic_reg/dma_macro_masks.h   |  105 +
 .../include/goya/asic_reg/dma_macro_regs.h    |  181 +
 .../include/goya/asic_reg/dma_nrtr_masks.h    |  209 +
 .../include/goya/asic_reg/dma_nrtr_regs.h     |  227 +
 .../include/goya/asic_reg/dma_qm_0_masks.h    |  465 ++
 .../include/goya/asic_reg/dma_qm_0_regs.h     |  179 +
 .../include/goya/asic_reg/dma_qm_1_regs.h     |  179 +
 .../include/goya/asic_reg/dma_qm_2_regs.h     |  179 +
 .../include/goya/asic_reg/dma_qm_3_regs.h     |  179 +
 .../include/goya/asic_reg/dma_qm_4_regs.h     |  179 +
 .../include/goya/asic_reg/goya_blocks.h       | 1372 +++++
 .../include/goya/asic_reg/goya_masks.h        |  275 +
 .../include/goya/asic_reg/goya_regs.h         |  117 +
 .../include/goya/asic_reg/ic_pll_regs.h       |  105 +
 .../include/goya/asic_reg/mc_pll_regs.h       |  105 +
 .../include/goya/asic_reg/mme1_rtr_masks.h    |  653 ++
 .../include/goya/asic_reg/mme1_rtr_regs.h     |  331 +
 .../include/goya/asic_reg/mme2_rtr_regs.h     |  331 +
 .../include/goya/asic_reg/mme3_rtr_regs.h     |  331 +
 .../include/goya/asic_reg/mme4_rtr_regs.h     |  331 +
 .../include/goya/asic_reg/mme5_rtr_regs.h     |  331 +
 .../include/goya/asic_reg/mme6_rtr_regs.h     |  331 +
 .../include/goya/asic_reg/mme_cmdq_masks.h    |  373 ++
 .../include/goya/asic_reg/mme_cmdq_regs.h     |  139 +
 .../include/goya/asic_reg/mme_masks.h         | 1537 +++++
 .../include/goya/asic_reg/mme_qm_masks.h      |  465 ++
 .../include/goya/asic_reg/mme_qm_regs.h       |  179 +
 .../include/goya/asic_reg/mme_regs.h          | 1153 ++++
 .../include/goya/asic_reg/mmu_masks.h         |  143 +
 .../include/goya/asic_reg/mmu_regs.h          |   53 +
 .../include/goya/asic_reg/pci_nrtr_masks.h    |  209 +
 .../include/goya/asic_reg/pci_nrtr_regs.h     |  227 +
 .../include/goya/asic_reg/pcie_aux_regs.h     |  243 +
 .../goya/asic_reg/psoc_emmc_pll_regs.h        |  105 +
 .../goya/asic_reg/psoc_global_conf_masks.h    |  447 ++
 .../goya/asic_reg/psoc_global_conf_regs.h     |  745 +++
 .../include/goya/asic_reg/psoc_mme_pll_regs.h |  105 +
 .../include/goya/asic_reg/psoc_pci_pll_regs.h |  105 +
 .../include/goya/asic_reg/psoc_spi_regs.h     |  143 +
 .../goya/asic_reg/sram_y0_x0_rtr_regs.h       |   83 +
 .../goya/asic_reg/sram_y0_x1_rtr_regs.h       |   83 +
 .../goya/asic_reg/sram_y0_x2_rtr_regs.h       |   83 +
 .../goya/asic_reg/sram_y0_x3_rtr_regs.h       |   83 +
 .../goya/asic_reg/sram_y0_x4_rtr_regs.h       |   83 +
 .../include/goya/asic_reg/stlb_masks.h        |  117 +
 .../include/goya/asic_reg/stlb_regs.h         |   55 +
 .../include/goya/asic_reg/tpc0_cfg_masks.h    | 1607 +++++
 .../include/goya/asic_reg/tpc0_cfg_regs.h     |  887 +++
 .../include/goya/asic_reg/tpc0_cmdq_masks.h   |  373 ++
 .../include/goya/asic_reg/tpc0_cmdq_regs.h    |  139 +
 .../goya/asic_reg/tpc0_eml_cfg_masks.h        |  347 ++
 .../include/goya/asic_reg/tpc0_eml_cfg_regs.h |  313 +
 .../include/goya/asic_reg/tpc0_nrtr_masks.h   |  209 +
 .../include/goya/asic_reg/tpc0_nrtr_regs.h    |  227 +
 .../include/goya/asic_reg/tpc0_qm_masks.h     |  465 ++
 .../include/goya/asic_reg/tpc0_qm_regs.h      |  179 +
 .../include/goya/asic_reg/tpc1_cfg_regs.h     |  887 +++
 .../include/goya/asic_reg/tpc1_cmdq_regs.h    |  139 +
 .../include/goya/asic_reg/tpc1_qm_regs.h      |  179 +
 .../include/goya/asic_reg/tpc1_rtr_regs.h     |  323 +
 .../include/goya/asic_reg/tpc2_cfg_regs.h     |  887 +++
 .../include/goya/asic_reg/tpc2_cmdq_regs.h    |  139 +
 .../include/goya/asic_reg/tpc2_qm_regs.h      |  179 +
 .../include/goya/asic_reg/tpc2_rtr_regs.h     |  323 +
 .../include/goya/asic_reg/tpc3_cfg_regs.h     |  887 +++
 .../include/goya/asic_reg/tpc3_cmdq_regs.h    |  139 +
 .../include/goya/asic_reg/tpc3_qm_regs.h      |  179 +
 .../include/goya/asic_reg/tpc3_rtr_regs.h     |  323 +
 .../include/goya/asic_reg/tpc4_cfg_regs.h     |  887 +++
 .../include/goya/asic_reg/tpc4_cmdq_regs.h    |  139 +
 .../include/goya/asic_reg/tpc4_qm_regs.h      |  179 +
 .../include/goya/asic_reg/tpc4_rtr_regs.h     |  323 +
 .../include/goya/asic_reg/tpc5_cfg_regs.h     |  887 +++
 .../include/goya/asic_reg/tpc5_cmdq_regs.h    |  139 +
 .../include/goya/asic_reg/tpc5_qm_regs.h      |  179 +
 .../include/goya/asic_reg/tpc5_rtr_regs.h     |  323 +
 .../include/goya/asic_reg/tpc6_cfg_regs.h     |  887 +++
 .../include/goya/asic_reg/tpc6_cmdq_regs.h    |  139 +
 .../include/goya/asic_reg/tpc6_qm_regs.h      |  179 +
 .../include/goya/asic_reg/tpc6_rtr_regs.h     |  323 +
 .../include/goya/asic_reg/tpc7_cfg_regs.h     |  887 +++
 .../include/goya/asic_reg/tpc7_cmdq_regs.h    |  139 +
 .../include/goya/asic_reg/tpc7_nrtr_regs.h    |  227 +
 .../include/goya/asic_reg/tpc7_qm_regs.h      |  179 +
 .../include/goya/asic_reg/tpc_pll_regs.h      |  105 +
 drivers/misc/habanalabs/include/goya/goya.h   |   41 +
 .../include/goya/goya_async_events.h          |  186 +
 .../misc/habanalabs/include/goya/goya_fw_if.h |   28 +
 .../habanalabs/include/goya/goya_packets.h    |  129 +
 drivers/misc/habanalabs/include/hl_boot_if.h  |   30 +
 .../include/hw_ip/mmu/mmu_general.h           |   45 +
 .../habanalabs/include/hw_ip/mmu/mmu_v1_0.h   |   15 +
 drivers/misc/habanalabs/include/qman_if.h     |   56 +
 drivers/misc/habanalabs/irq.c                 |  325 +
 drivers/misc/habanalabs/memory.c              | 1722 ++++++
 drivers/misc/habanalabs/mmu.c                 |  690 +++
 drivers/misc/habanalabs/sysfs.c               |  537 ++
 include/uapi/misc/habanalabs.h                |  444 ++
 132 files changed, 51224 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/debugfs-driver-habanalabs
 create mode 100644 Documentation/ABI/testing/sysfs-driver-habanalabs
 create mode 100644 drivers/misc/habanalabs/Kconfig
 create mode 100644 drivers/misc/habanalabs/Makefile
 create mode 100644 drivers/misc/habanalabs/asid.c
 create mode 100644 drivers/misc/habanalabs/command_buffer.c
 create mode 100644 drivers/misc/habanalabs/command_submission.c
 create mode 100644 drivers/misc/habanalabs/context.c
 create mode 100644 drivers/misc/habanalabs/debugfs.c
 create mode 100644 drivers/misc/habanalabs/device.c
 create mode 100644 drivers/misc/habanalabs/goya/Makefile
 create mode 100644 drivers/misc/habanalabs/goya/goya.c
 create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
 create mode 100644 drivers/misc/habanalabs/goya/goya_hwmgr.c
 create mode 100644 drivers/misc/habanalabs/goya/goya_security.c
 create mode 100644 drivers/misc/habanalabs/habanalabs.h
 create mode 100644 drivers/misc/habanalabs/habanalabs_drv.c
 create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c
 create mode 100644 drivers/misc/habanalabs/hw_queue.c
 create mode 100644 drivers/misc/habanalabs/hwmon.c
 create mode 100644 drivers/misc/habanalabs/include/armcp_if.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_ca53_cfg_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_ca53_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_if_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/cpu_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_2_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_3_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_ch_4_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_macro_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_macro_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_nrtr_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_nrtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_0_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_0_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_1_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_2_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_3_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/dma_qm_4_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_blocks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/goya_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/ic_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mc_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme1_rtr_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme5_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme6_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_cmdq_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_qm_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mme_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mmu_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/mmu_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pci_nrtr_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pci_nrtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/pcie_aux_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_emmc_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_global_conf_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_global_conf_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_mme_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_pci_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/psoc_spi_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x0_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/sram_y0_x4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/stlb_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/stlb_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cfg_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cmdq_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_eml_cfg_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_eml_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_nrtr_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_nrtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_qm_masks.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc0_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc1_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc2_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc3_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc4_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc5_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc6_rtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_cfg_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_cmdq_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_nrtr_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc7_qm_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/asic_reg/tpc_pll_regs.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_async_events.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_fw_if.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_packets.h
 create mode 100644 drivers/misc/habanalabs/include/hl_boot_if.h
 create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
 create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
 create mode 100644 drivers/misc/habanalabs/include/qman_if.h
 create mode 100644 drivers/misc/habanalabs/irq.c
 create mode 100644 drivers/misc/habanalabs/memory.c
 create mode 100644 drivers/misc/habanalabs/mmu.c
 create mode 100644 drivers/misc/habanalabs/sysfs.c
 create mode 100644 include/uapi/misc/habanalabs.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 01/15] habanalabs: add skeleton driver
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 03/15] habanalabs: add basic Goya support Oded Gabbay
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds the habanalabs skeleton driver. The driver does nothing at
this stage except very basic operations. It contains the minimal code to
insmod and rmmod the driver and to create a /dev/hlX file per PCI device.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/Kconfig                     |   1 +
 drivers/misc/Makefile                    |   1 +
 drivers/misc/habanalabs/Kconfig          |  22 ++
 drivers/misc/habanalabs/Makefile         |   7 +
 drivers/misc/habanalabs/device.c         | 336 +++++++++++++++++++++
 drivers/misc/habanalabs/habanalabs.h     | 138 +++++++++
 drivers/misc/habanalabs/habanalabs_drv.c | 362 +++++++++++++++++++++++
 7 files changed, 867 insertions(+)
 create mode 100644 drivers/misc/habanalabs/Kconfig
 create mode 100644 drivers/misc/habanalabs/Makefile
 create mode 100644 drivers/misc/habanalabs/device.c
 create mode 100644 drivers/misc/habanalabs/habanalabs.h
 create mode 100644 drivers/misc/habanalabs/habanalabs_drv.c

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index f417b06e11c5..fecab53c4f21 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -535,4 +535,5 @@ source "drivers/misc/echo/Kconfig"
 source "drivers/misc/cxl/Kconfig"
 source "drivers/misc/ocxl/Kconfig"
 source "drivers/misc/cardreader/Kconfig"
+source "drivers/misc/habanalabs/Kconfig"
 endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index e39ccbbc1b3a..ae77dfd790a4 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -59,3 +59,4 @@ obj-$(CONFIG_PCI_ENDPOINT_TEST)	+= pci_endpoint_test.o
 obj-$(CONFIG_OCXL)		+= ocxl/
 obj-y				+= cardreader/
 obj-$(CONFIG_PVPANIC)   	+= pvpanic.o
+obj-$(CONFIG_HABANA_AI)		+= habanalabs/
diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig
new file mode 100644
index 000000000000..b7f38a14caf5
--- /dev/null
+++ b/drivers/misc/habanalabs/Kconfig
@@ -0,0 +1,22 @@
+#
+# HabanaLabs AI accelerators driver
+#
+
+config HABANA_AI
+	tristate "HabanaAI accelerators (habanalabs)"
+	depends on PCI
+	select FRAME_VECTOR
+	help
+	  Enables PCIe card driver for Habana's AI Processors (AIP) that are
+	  designed to accelerate Deep Learning inference and training workloads.
+
+	  The driver manages the PCIe devices and provides IOCTL interface for
+	  the user to submit workloads to the devices.
+
+	  The user-space interface is described in
+	  include/uapi/misc/habanalabs.h
+
+	  If unsure, say N.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called habanalabs.
diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
new file mode 100644
index 000000000000..0910f7fa34ec
--- /dev/null
+++ b/drivers/misc/habanalabs/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for HabanaLabs AI accelerators driver
+#
+
+obj-m	:= habanalabs.o
+
+habanalabs-y := habanalabs_drv.o device.o
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
new file mode 100644
index 000000000000..3797a4281974
--- /dev/null
+++ b/drivers/misc/habanalabs/device.c
@@ -0,0 +1,336 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/fs.h>
+#include <linux/kthread.h>
+#include <linux/sched/signal.h>
+
+static void hpriv_release(struct kref *ref)
+{
+	struct hl_fpriv *hpriv;
+	struct hl_device *hdev;
+
+	hpriv = container_of(ref, struct hl_fpriv, refcount);
+
+	hdev = hpriv->hdev;
+
+	put_pid(hpriv->taskpid);
+
+	kfree(hpriv);
+}
+
+void hl_hpriv_get(struct hl_fpriv *hpriv)
+{
+	kref_get(&hpriv->refcount);
+}
+
+void hl_hpriv_put(struct hl_fpriv *hpriv)
+{
+	kref_put(&hpriv->refcount, hpriv_release);
+}
+
+/*
+ * hl_device_release - release function for habanalabs device
+ *
+ * @inode: pointer to inode structure
+ * @filp: pointer to file structure
+ *
+ * Called when process closes an habanalabs device
+ */
+static int hl_device_release(struct inode *inode, struct file *filp)
+{
+	struct hl_fpriv *hpriv = filp->private_data;
+
+	filp->private_data = NULL;
+
+	hl_hpriv_put(hpriv);
+
+	return 0;
+}
+
+static const struct file_operations hl_ops = {
+	.owner = THIS_MODULE,
+	.open = hl_device_open,
+	.release = hl_device_release
+};
+
+/*
+ * device_setup_cdev - setup cdev and device for habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @hclass: pointer to the class object of the device
+ * @minor: minor number of the specific device
+ * @fpos : file operations to install for this device
+ *
+ * Create a cdev and a Linux device for habanalabs's device. Need to be
+ * called at the end of the habanalabs device initialization process,
+ * because this function exposes the device to the user
+ */
+static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
+				int minor, const struct file_operations *fops)
+{
+	int err, devno = MKDEV(hdev->major, minor);
+	struct cdev *hdev_cdev = &hdev->cdev;
+	char *name;
+
+	name = kasprintf(GFP_KERNEL, "hl%d", hdev->id);
+	if (!name)
+		return -ENOMEM;
+
+	cdev_init(hdev_cdev, fops);
+	hdev_cdev->owner = THIS_MODULE;
+	err = cdev_add(hdev_cdev, devno, 1);
+	if (err) {
+		pr_err("Failed to add char device %s\n", name);
+		goto err_cdev_add;
+	}
+
+	hdev->dev = device_create(hclass, NULL, devno, NULL, "%s", name);
+	if (IS_ERR(hdev->dev)) {
+		pr_err("Failed to create device %s\n", name);
+		err = PTR_ERR(hdev->dev);
+		goto err_device_create;
+	}
+
+	dev_set_drvdata(hdev->dev, hdev);
+
+	kfree(name);
+
+	return 0;
+
+err_device_create:
+	cdev_del(hdev_cdev);
+err_cdev_add:
+	kfree(name);
+	return err;
+}
+
+/*
+ * device_early_init - do some early initialization for the habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Install the relevant function pointers and call the early_init function,
+ * if such a function exists
+ */
+static int device_early_init(struct hl_device *hdev)
+{
+	switch (hdev->asic_type) {
+	case ASIC_GOYA:
+		strlcpy(hdev->asic_name, "GOYA", sizeof(hdev->asic_name));
+		break;
+	default:
+		dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
+			hdev->asic_type);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * device_early_fini - finalize all that was done in device_early_init
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ */
+static void device_early_fini(struct hl_device *hdev)
+{
+}
+
+/*
+ * hl_device_suspend - initiate device suspend
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Puts the hw in the suspend state (all asics).
+ * Returns 0 for success or an error on failure.
+ * Called at driver suspend.
+ */
+int hl_device_suspend(struct hl_device *hdev)
+{
+	pci_save_state(hdev->pdev);
+
+	/* Shut down the device */
+	pci_disable_device(hdev->pdev);
+	pci_set_power_state(hdev->pdev, PCI_D3hot);
+
+	return 0;
+}
+
+/*
+ * hl_device_resume - initiate device resume
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Bring the hw back to operating state (all asics).
+ * Returns 0 for success or an error on failure.
+ * Called at driver resume.
+ */
+int hl_device_resume(struct hl_device *hdev)
+{
+	int rc;
+
+	pci_set_power_state(hdev->pdev, PCI_D0);
+	pci_restore_state(hdev->pdev);
+	rc = pci_enable_device(hdev->pdev);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to enable PCI device in resume\n");
+		return rc;
+	}
+
+	return 0;
+}
+
+/*
+ * hl_device_init - main initialization function for habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Allocate an id for the device, do early initialization and then call the
+ * ASIC specific initialization functions. Finally, create the cdev and the
+ * Linux device to expose it to the user
+ */
+int hl_device_init(struct hl_device *hdev, struct class *hclass)
+{
+	int rc;
+
+	/* Create device */
+	rc = device_setup_cdev(hdev, hclass, hdev->id, &hl_ops);
+
+	if (rc)
+		goto out_disabled;
+
+	/* Initialize ASIC function pointers and perform early init */
+	rc = device_early_init(hdev);
+	if (rc)
+		goto release_device;
+
+	dev_notice(hdev->dev,
+		"Successfully added device to habanalabs driver\n");
+
+	return 0;
+
+release_device:
+	device_destroy(hclass, hdev->dev->devt);
+	cdev_del(&hdev->cdev);
+out_disabled:
+	hdev->disabled = true;
+	if (hdev->pdev)
+		dev_err(&hdev->pdev->dev,
+			"Failed to initialize hl%d. Device is NOT usable !\n",
+			hdev->id);
+	else
+		pr_err("Failed to initialize hl%d. Device is NOT usable !\n",
+			hdev->id);
+
+	return rc;
+}
+
+/*
+ * hl_device_fini - main tear-down function for habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Destroy the device, call ASIC fini functions and release the id
+ */
+void hl_device_fini(struct hl_device *hdev)
+{
+	dev_info(hdev->dev, "Removing device\n");
+
+	/* Mark device as disabled */
+	hdev->disabled = true;
+
+	device_early_fini(hdev);
+
+	/* Hide device from user */
+	device_destroy(hdev->dev->class, hdev->dev->devt);
+	cdev_del(&hdev->cdev);
+
+	pr_info("removed device successfully\n");
+}
+
+/*
+ * hl_poll_timeout_memory - Periodically poll a host memory address
+ *                              until it is not zero or a timeout occurs
+ * @hdev: pointer to habanalabs device structure
+ * @addr: Address to poll
+ * @timeout_us: timeout in us
+ * @val: Variable to read the value into
+ *
+ * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
+ * case, the last read value at @addr is stored in @val. Must not
+ * be called from atomic context if sleep_us or timeout_us are used.
+ *
+ * The function sleeps for 100us with timeout value of
+ * timeout_us
+ */
+int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr,
+				u32 timeout_us, u32 *val)
+{
+	/*
+	 * paddr is defined as volatile because it points to HOST memory,
+	 * which is being written to by the device. Therefore, we can't use
+	 * locks to synchronize it and it is not a memory-mapped register space
+	 */
+	volatile u32 *paddr = (volatile u32 *) addr;
+	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
+
+	might_sleep();
+
+	for (;;) {
+		*val = *paddr;
+		if (*val)
+			break;
+		if (ktime_compare(ktime_get(), timeout) > 0) {
+			*val = *paddr;
+			break;
+		}
+		usleep_range((100 >> 2) + 1, 100);
+	}
+
+	return *val ? 0 : -ETIMEDOUT;
+}
+
+/*
+ * hl_poll_timeout_devicememory - Periodically poll a device memory address
+ *                                until it is not zero or a timeout occurs
+ * @hdev: pointer to habanalabs device structure
+ * @addr: Device address to poll
+ * @timeout_us: timeout in us
+ * @val: Variable to read the value into
+ *
+ * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
+ * case, the last read value at @addr is stored in @val. Must not
+ * be called from atomic context if sleep_us or timeout_us are used.
+ *
+ * The function sleeps for 100us with timeout value of
+ * timeout_us
+ */
+int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
+				u32 timeout_us, u32 *val)
+{
+	ktime_t timeout = ktime_add_us(ktime_get(), timeout_us);
+
+	might_sleep();
+
+	for (;;) {
+		*val = readl(addr);
+		if (*val)
+			break;
+		if (ktime_compare(ktime_get(), timeout) > 0) {
+			*val = readl(addr);
+			break;
+		}
+		usleep_range((100 >> 2) + 1, 100);
+	}
+
+	return *val ? 0 : -ETIMEDOUT;
+}
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
new file mode 100644
index 000000000000..02386dcaa3a3
--- /dev/null
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -0,0 +1,138 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef HABANALABSP_H_
+#define HABANALABSP_H_
+
+#define pr_fmt(fmt)			"habanalabs: " fmt
+
+#include <linux/pci.h>
+#include <linux/types.h>
+#include <linux/cdev.h>
+#include <linux/interrupt.h>
+#include <linux/iopoll.h>
+#include <linux/dma-fence.h>
+#include <linux/hashtable.h>
+#include <linux/hwmon.h>
+
+#define HL_NAME				"habanalabs"
+
+struct hl_device;
+
+
+/*
+ * ASICs
+ */
+
+/**
+ * enum hl_asic_type - supported ASIC types.
+ * @ASIC_AUTO_DETECT: ASIC type will be automatically set.
+ * @ASIC_GOYA: Goya device.
+ * @ASIC_INVALID: Invalid ASIC type.
+ */
+enum hl_asic_type {
+	ASIC_AUTO_DETECT,
+	ASIC_GOYA,
+	ASIC_INVALID
+};
+
+
+/*
+ * FILE PRIVATE STRUCTURE
+ */
+
+/**
+ * struct hl_fpriv - process information stored in FD private data.
+ * @hdev: habanalabs device structure.
+ * @filp: pointer to the given file structure.
+ * @taskpid: current process ID.
+ * @refcount: number of related contexts.
+ */
+struct hl_fpriv {
+	struct hl_device	*hdev;
+	struct file		*filp;
+	struct pid		*taskpid;
+	struct kref		refcount;
+};
+
+
+/*
+ * DEVICES
+ */
+
+/* Theoretical limit only. A single host can only contain up to 4 or 8 PCIe
+ * x16 cards. In extereme cases, there are hosts that can accommodate 16 cards
+ */
+#define HL_MAX_MINORS	256
+
+/**
+ * struct hl_device - habanalabs device structure.
+ * @pdev: pointer to PCI device, can be NULL in case of simulator device.
+ * @cdev: related char device.
+ * @dev: realted kernel basic device structure.
+ * @asic_name: ASIC specific nmae.
+ * @asic_type: ASIC specific type.
+ * @major: habanalabs KMD major.
+ * @id: device minor.
+ * @disabled: is device disabled.
+ */
+struct hl_device {
+	struct pci_dev			*pdev;
+	struct cdev			cdev;
+	struct device			*dev;
+	char				asic_name[16];
+	enum hl_asic_type		asic_type;
+	u32				major;
+	u16				id;
+	u8				disabled;
+};
+
+
+/*
+ * IOCTLs
+ */
+
+/**
+ * typedef hl_ioctl_t - typedef for ioctl function in the driver
+ * @hpriv: pointer to the FD's private data, which contains state of
+ *		user process
+ * @data: pointer to the input/output arguments structure of the IOCTL
+ *
+ * Return: 0 for success, negative value for error
+ */
+typedef int hl_ioctl_t(struct hl_fpriv *hpriv, void *data);
+
+/**
+ * struct hl_ioctl_desc - describes an IOCTL entry of the driver.
+ * @cmd: the IOCTL code as created by the kernel macros.
+ * @func: pointer to the driver's function that should be called for this IOCTL.
+ */
+struct hl_ioctl_desc {
+	unsigned int cmd;
+	hl_ioctl_t *func;
+};
+
+
+/*
+ * Kernel module functions that can be accessed by entire module
+ */
+
+int hl_device_open(struct inode *inode, struct file *filp);
+int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
+		enum hl_asic_type asic_type, int minor);
+void destroy_hdev(struct hl_device *hdev);
+int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
+				u32 *val);
+int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
+				u32 timeout_us, u32 *val);
+
+int hl_device_init(struct hl_device *hdev, struct class *hclass);
+void hl_device_fini(struct hl_device *hdev);
+int hl_device_suspend(struct hl_device *hdev);
+int hl_device_resume(struct hl_device *hdev);
+
+#endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
new file mode 100644
index 000000000000..573349eedce4
--- /dev/null
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -0,0 +1,362 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#include "habanalabs.h"
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kthread.h>
+
+#include <linux/fs.h>
+
+#define HL_DRIVER_AUTHOR	"HabanaLabs Kernel Driver Team"
+
+#define HL_DRIVER_DESC		"Driver for HabanaLabs's AI Accelerators"
+
+MODULE_AUTHOR(HL_DRIVER_AUTHOR);
+MODULE_DESCRIPTION(HL_DRIVER_DESC);
+MODULE_LICENSE("GPL v2");
+
+static int hl_major;
+static struct class *hl_class;
+DEFINE_IDR(hl_devs_idr);
+DEFINE_MUTEX(hl_devs_idr_lock);
+
+#define PCI_VENDOR_ID_HABANALABS	0x1da3
+
+#define PCI_IDS_GOYA			0x0001
+
+static const struct pci_device_id ids[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_HABANALABS, PCI_IDS_GOYA), },
+	{ 0, }
+};
+MODULE_DEVICE_TABLE(pci, ids);
+
+/*
+ * get_asic_type - translate device id to asic type
+ *
+ * @device: id of the PCI device
+ *
+ * Translate device id to asic type.
+ * In case of unidentified device, return -1
+ */
+static enum hl_asic_type get_asic_type(u16 device)
+{
+	enum hl_asic_type asic_type;
+
+	switch (device) {
+	case PCI_IDS_GOYA:
+		asic_type = ASIC_GOYA;
+		break;
+	default:
+		asic_type = ASIC_INVALID;
+		break;
+	}
+
+	return asic_type;
+}
+
+/*
+ * hl_device_open - open function for habanalabs device
+ *
+ * @inode: pointer to inode structure
+ * @filp: pointer to file structure
+ *
+ * Called when process opens an habanalabs device.
+ */
+int hl_device_open(struct inode *inode, struct file *filp)
+{
+	struct hl_device *hdev;
+	struct hl_fpriv *hpriv;
+
+	mutex_lock(&hl_devs_idr_lock);
+	hdev = idr_find(&hl_devs_idr, iminor(inode));
+	mutex_unlock(&hl_devs_idr_lock);
+
+	if (!hdev) {
+		pr_err("Couldn't find device %d:%d\n",
+			imajor(inode), iminor(inode));
+		return -ENXIO;
+	}
+
+	hpriv = kzalloc(sizeof(*hpriv), GFP_KERNEL);
+	if (!hpriv)
+		return -ENOMEM;
+
+	hpriv->hdev = hdev;
+	filp->private_data = hpriv;
+	hpriv->filp = filp;
+	kref_init(&hpriv->refcount);
+	nonseekable_open(inode, filp);
+
+	hpriv->taskpid = find_get_pid(current->pid);
+
+	return 0;
+}
+
+/*
+ * create_hdev - create habanalabs device instance
+ *
+ * @dev: will hold the pointer to the new habanalabs device structure
+ * @pdev: pointer to the pci device
+ * @asic_type: in case of simulator device, which device is it
+ * @minor: in case of simulator device, the minor of the device
+ *
+ * Allocate memory for habanalabs device and initialize basic fields
+ * Identify the ASIC type
+ * Allocate ID (minor) for the device (only for real devices)
+ */
+int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
+		enum hl_asic_type asic_type, int minor)
+{
+	struct hl_device *hdev;
+	int rc;
+
+	*dev = NULL;
+
+	hdev = kzalloc(sizeof(*hdev), GFP_KERNEL);
+	if (!hdev)
+		return -ENOMEM;
+
+	hdev->major = hl_major;
+
+	hdev->disabled = true;
+	hdev->pdev = pdev; /* can be NULL in case of simulator device */
+
+	if (asic_type == ASIC_AUTO_DETECT) {
+		hdev->asic_type = get_asic_type(pdev->device);
+		if (hdev->asic_type == ASIC_INVALID) {
+			dev_err(&pdev->dev, "Unsupported ASIC\n");
+			rc = -ENODEV;
+			goto free_hdev;
+		}
+	} else {
+		hdev->asic_type = asic_type;
+	}
+
+	mutex_lock(&hl_devs_idr_lock);
+
+	if (minor == -1) {
+		rc = idr_alloc(&hl_devs_idr, hdev, 0, HL_MAX_MINORS,
+				GFP_KERNEL);
+	} else {
+		void *old_idr = idr_replace(&hl_devs_idr, hdev, minor);
+
+		if (IS_ERR_VALUE(old_idr)) {
+			rc = PTR_ERR(old_idr);
+			pr_err("Error %d when trying to replace minor %d\n",
+				rc, minor);
+			mutex_unlock(&hl_devs_idr_lock);
+			goto free_hdev;
+		}
+		rc = minor;
+	}
+
+	mutex_unlock(&hl_devs_idr_lock);
+
+	if (rc < 0) {
+		if (rc == -ENOSPC) {
+			pr_err("too many devices in the system\n");
+			rc = -EBUSY;
+		}
+		goto free_hdev;
+	}
+
+	hdev->id = rc;
+
+	*dev = hdev;
+
+	return 0;
+
+free_hdev:
+	kfree(hdev);
+	return rc;
+}
+
+/*
+ * destroy_hdev - destroy habanalabs device instance
+ *
+ * @dev: pointer to the habanalabs device structure
+ *
+ */
+void destroy_hdev(struct hl_device *hdev)
+{
+	/* Remove device from the device list */
+	mutex_lock(&hl_devs_idr_lock);
+	idr_remove(&hl_devs_idr, hdev->id);
+	mutex_unlock(&hl_devs_idr_lock);
+
+	kfree(hdev);
+}
+
+static int hl_pmops_suspend(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct hl_device *hdev = pci_get_drvdata(pdev);
+
+	pr_debug("Going to suspend PCI device\n");
+
+	if (!hdev) {
+		pr_err("device pointer is NULL in suspend\n");
+		return 0;
+	}
+
+	return hl_device_suspend(hdev);
+}
+
+static int hl_pmops_resume(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct hl_device *hdev = pci_get_drvdata(pdev);
+
+	pr_debug("Going to resume PCI device\n");
+
+	if (!hdev) {
+		pr_err("device pointer is NULL in resume\n");
+		return 0;
+	}
+
+	return hl_device_resume(hdev);
+}
+
+/*
+ * hl_pci_probe - probe PCI habanalabs devices
+ *
+ * @pdev: pointer to pci device
+ * @id: pointer to pci device id structure
+ *
+ * Standard PCI probe function for habanalabs device.
+ * Create a new habanalabs device and initialize it according to the
+ * device's type
+ */
+static int hl_pci_probe(struct pci_dev *pdev,
+				const struct pci_device_id *id)
+{
+	struct hl_device *hdev;
+	int rc;
+
+	dev_info(&pdev->dev, HL_NAME
+		 " device found [%04x:%04x] (rev %x)\n",
+		 (int)pdev->vendor, (int)pdev->device, (int)pdev->revision);
+
+	rc = create_hdev(&hdev, pdev, ASIC_AUTO_DETECT, -1);
+	if (rc)
+		return rc;
+
+	pci_set_drvdata(pdev, hdev);
+
+	rc = hl_device_init(hdev, hl_class);
+	if (rc) {
+		dev_err(&pdev->dev, "Fatal error during habanalabs device init\n");
+		rc = -ENODEV;
+		goto disable_device;
+	}
+
+	return 0;
+
+disable_device:
+	pci_set_drvdata(pdev, NULL);
+	destroy_hdev(hdev);
+
+	return rc;
+}
+
+/*
+ * hl_pci_remove - remove PCI habanalabs devices
+ *
+ * @pdev: pointer to pci device
+ *
+ * Standard PCI remove function for habanalabs device
+ */
+static void hl_pci_remove(struct pci_dev *pdev)
+{
+	struct hl_device *hdev;
+
+	hdev = pci_get_drvdata(pdev);
+	if (!hdev)
+		return;
+
+	hl_device_fini(hdev);
+	pci_set_drvdata(pdev, NULL);
+
+	destroy_hdev(hdev);
+}
+
+static const struct dev_pm_ops hl_pm_ops = {
+	.suspend = hl_pmops_suspend,
+	.resume = hl_pmops_resume,
+};
+
+static struct pci_driver hl_pci_driver = {
+	.name = HL_NAME,
+	.id_table = ids,
+	.probe = hl_pci_probe,
+	.remove = hl_pci_remove,
+	.driver.pm = &hl_pm_ops,
+};
+
+/*
+ * hl_init - Initialize the habanalabs kernel driver
+ */
+static int __init hl_init(void)
+{
+	int rc;
+	dev_t dev;
+
+	pr_info("loading driver\n");
+
+	rc = alloc_chrdev_region(&dev, 0, HL_MAX_MINORS, HL_NAME);
+	if (rc < 0) {
+		pr_err("unable to get major\n");
+		return rc;
+	}
+
+	hl_major = MAJOR(dev);
+
+	hl_class = class_create(THIS_MODULE, HL_NAME);
+	if (IS_ERR(hl_class)) {
+		pr_err("failed to allocate class\n");
+		rc = PTR_ERR(hl_class);
+		goto remove_major;
+	}
+
+	rc = pci_register_driver(&hl_pci_driver);
+	if (rc) {
+		pr_err("failed to register pci device\n");
+		goto remove_class;
+	}
+
+	pr_debug("driver loaded\n");
+
+	return 0;
+
+remove_class:
+	class_destroy(hl_class);
+remove_major:
+	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
+	return rc;
+}
+
+/*
+ * hl_exit - Release all resources of the habanalabs kernel driver
+ */
+static void __exit hl_exit(void)
+{
+	pci_unregister_driver(&hl_pci_driver);
+
+	class_destroy(hl_class);
+	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
+
+	idr_destroy(&hl_devs_idr);
+
+	pr_debug("driver removed\n");
+}
+
+module_init(hl_init);
+module_exit(hl_exit);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 03/15] habanalabs: add basic Goya support
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 01/15] habanalabs: add skeleton driver Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 04/15] habanalabs: add context and ASID modules Oded Gabbay
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds a basic support for the Goya device. The code initializes
the device's PCI controller and PCI bars. It also initializes various S/W
structures and adds some basic helper functions.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - Remove empty line at end of Makefile
 
 drivers/misc/habanalabs/Makefile            |   3 +
 drivers/misc/habanalabs/device.c            |  71 +++
 drivers/misc/habanalabs/goya/Makefile       |   3 +
 drivers/misc/habanalabs/goya/goya.c         | 639 ++++++++++++++++++++
 drivers/misc/habanalabs/goya/goyaP.h        | 155 +++++
 drivers/misc/habanalabs/habanalabs.h        | 137 +++++
 drivers/misc/habanalabs/habanalabs_drv.c    |   3 +
 drivers/misc/habanalabs/include/goya/goya.h |  41 ++
 include/uapi/misc/habanalabs.h              |  20 +
 9 files changed, 1072 insertions(+)
 create mode 100644 drivers/misc/habanalabs/goya/Makefile
 create mode 100644 drivers/misc/habanalabs/goya/goya.c
 create mode 100644 drivers/misc/habanalabs/goya/goyaP.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya.h
 create mode 100644 include/uapi/misc/habanalabs.h

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index 0910f7fa34ec..6f1ead69bd77 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -5,3 +5,6 @@
 obj-m	:= habanalabs.o
 
 habanalabs-y := habanalabs_drv.o device.o
+
+include $(src)/goya/Makefile
+habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 3797a4281974..d77e070ae389 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -121,8 +121,11 @@ static int device_setup_cdev(struct hl_device *hdev, struct class *hclass,
  */
 static int device_early_init(struct hl_device *hdev)
 {
+	int rc;
+
 	switch (hdev->asic_type) {
 	case ASIC_GOYA:
+		goya_set_asic_funcs(hdev);
 		strlcpy(hdev->asic_name, "GOYA", sizeof(hdev->asic_name));
 		break;
 	default:
@@ -131,6 +134,10 @@ static int device_early_init(struct hl_device *hdev)
 		return -EINVAL;
 	}
 
+	rc = hdev->asic_funcs->early_init(hdev);
+	if (rc)
+		return rc;
+
 	return 0;
 }
 
@@ -142,6 +149,10 @@ static int device_early_init(struct hl_device *hdev)
  */
 static void device_early_fini(struct hl_device *hdev)
 {
+
+	if (hdev->asic_funcs->early_fini)
+		hdev->asic_funcs->early_fini(hdev);
+
 }
 
 /*
@@ -155,8 +166,15 @@ static void device_early_fini(struct hl_device *hdev)
  */
 int hl_device_suspend(struct hl_device *hdev)
 {
+	int rc;
+
 	pci_save_state(hdev->pdev);
 
+	rc = hdev->asic_funcs->suspend(hdev);
+	if (rc)
+		dev_err(hdev->dev,
+			"Failed to disable PCI access of device CPU\n");
+
 	/* Shut down the device */
 	pci_disable_device(hdev->pdev);
 	pci_set_power_state(hdev->pdev, PCI_D3hot);
@@ -186,6 +204,13 @@ int hl_device_resume(struct hl_device *hdev)
 		return rc;
 	}
 
+	rc = hdev->asic_funcs->resume(hdev);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to enable PCI access from device CPU\n");
+		return rc;
+	}
+
 	return 0;
 }
 
@@ -213,11 +238,21 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 	if (rc)
 		goto release_device;
 
+	/*
+	 * Start calling ASIC initialization. First S/W then H/W and finally
+	 * late init
+	 */
+	rc = hdev->asic_funcs->sw_init(hdev);
+	if (rc)
+		goto early_fini;
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
 	return 0;
 
+early_fini:
+	device_early_fini(hdev);
 release_device:
 	device_destroy(hclass, hdev->dev->devt);
 	cdev_del(&hdev->cdev);
@@ -248,6 +283,9 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	/* Call ASIC S/W finalize function */
+	hdev->asic_funcs->sw_fini(hdev);
+
 	device_early_fini(hdev);
 
 	/* Hide device from user */
@@ -334,3 +372,36 @@ int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
 
 	return *val ? 0 : -ETIMEDOUT;
 }
+
+/*
+ * MMIO register access helper functions.
+ */
+
+/*
+ * hl_rreg - Read an MMIO register
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @reg: MMIO register offset (in bytes)
+ *
+ * Returns the value of the MMIO register we are asked to read
+ *
+ */
+inline u32 hl_rreg(struct hl_device *hdev, u32 reg)
+{
+	return readl(hdev->rmmio + reg);
+}
+
+/*
+ * hl_wreg - Write to an MMIO register
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @reg: MMIO register offset (in bytes)
+ * @val: 32-bit value
+ *
+ * Writes the 32-bit value into the MMIO register
+ *
+ */
+inline void hl_wreg(struct hl_device *hdev, u32 reg, u32 val)
+{
+	writel(val, hdev->rmmio + reg);
+}
diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
new file mode 100644
index 000000000000..38d43006386d
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/Makefile
@@ -0,0 +1,3 @@
+subdir-ccflags-y += -I$(src)
+
+HL_GOYA_FILES :=  goya/goya.o
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
new file mode 100644
index 000000000000..47e7472b7472
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -0,0 +1,639 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "goyaP.h"
+#include "include/goya/asic_reg/goya_masks.h"
+
+#include <linux/fs.h>
+#include <linux/delay.h>
+#include <linux/vmalloc.h>
+#include <linux/sched.h>
+#include <linux/genalloc.h>
+#include <linux/sysfs.h>
+#include <linux/kfifo.h>
+#include <linux/dma-mapping.h>
+#include <linux/firmware.h>
+#include <linux/log2.h>
+#include <linux/hwmon.h>
+#include <linux/string.h>
+#include <linux/io.h>
+
+/*
+ * GOYA security scheme:
+ *
+ * 1. Host is protected by:
+ *        - Range registers (When MMU is enabled, DMA RR does NOT protect host)
+ *        - MMU
+ *
+ * 2. DRAM is protected by:
+ *        - Range registers (protect the first 512MB)
+ *        - MMU (isolation between users)
+ *
+ * 3. Configuration is protected by:
+ *        - Range registers
+ *        - Protection bits
+ *
+ * When MMU is disabled:
+ *
+ * QMAN DMA: PQ, CQ, CP, DMA are secured.
+ * PQ, CB and the data are on the host.
+ *
+ * QMAN TPC/MME:
+ * PQ, CQ and CP are not secured.
+ * PQ, CB and the data are on the SRAM/DRAM.
+ *
+ * Since QMAN DMA is secured, KMD is parsing the DMA CB:
+ *     - KMD checks DMA pointer
+ *     - WREG, MSG_PROT are not allowed.
+ *     - MSG_LONG/SHORT are allowed.
+ *
+ * A read/write transaction by the QMAN to a protected area will succeed if
+ * and only if the QMAN's CP is secured and MSG_PROT is used
+ *
+ *
+ * When MMU is enabled:
+ *
+ * QMAN DMA: PQ, CQ and CP are secured.
+ * MMU is set to bypass on the Secure props register of the QMAN.
+ * The reasons we don't enable MMU for PQ, CQ and CP are:
+ *     - PQ entry is in kernel address space and KMD doesn't map it.
+ *     - CP writes to MSIX register and to kernel address space (completion
+ *       queue).
+ *
+ * DMA is not secured but because CP is secured, KMD still needs to parse the
+ * CB, but doesn't need to check the DMA addresses.
+ *
+ * For QMAN DMA 0, DMA is also secured because only KMD uses this DMA and KMD
+ * doesn't map memory in MMU.
+ *
+ * QMAN TPC/MME: PQ, CQ and CP aren't secured (no change from MMU disabled mode)
+ *
+ * DMA RR does NOT protect host because DMA is not secured
+ *
+ */
+
+#define GOYA_MMU_REGS_NUM		61
+
+#define GOYA_DMA_POOL_BLK_SIZE		0x100		/* 256 bytes */
+
+#define GOYA_RESET_TIMEOUT_MSEC		500		/* 500ms */
+#define GOYA_PLDM_RESET_TIMEOUT_MSEC	20000		/* 20s */
+#define GOYA_RESET_WAIT_MSEC		1		/* 1ms */
+#define GOYA_CPU_RESET_WAIT_MSEC	100		/* 100ms */
+#define GOYA_PLDM_RESET_WAIT_MSEC	1000		/* 1s */
+#define GOYA_CPU_TIMEOUT_USEC		10000000	/* 10s */
+#define GOYA_TEST_QUEUE_WAIT_USEC	100000		/* 100ms */
+
+#define GOYA_QMAN0_FENCE_VAL		0xD169B243
+
+#define GOYA_MAX_INITIATORS		20
+
+static void goya_get_fixed_properties(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+
+	prop->completion_queues_count = NUMBER_OF_CMPLT_QUEUES;
+
+	prop->dram_base_address = DRAM_PHYS_BASE;
+	prop->dram_size = DRAM_PHYS_DEFAULT_SIZE;
+	prop->dram_end_address = prop->dram_base_address + prop->dram_size;
+	prop->dram_user_base_address = DRAM_BASE_ADDR_USER;
+
+	prop->sram_base_address = SRAM_BASE_ADDR;
+	prop->sram_size = SRAM_SIZE;
+	prop->sram_end_address = prop->sram_base_address + prop->sram_size;
+	prop->sram_user_base_address = prop->sram_base_address +
+						SRAM_USER_BASE_OFFSET;
+
+	prop->host_phys_base_address = HOST_PHYS_BASE;
+	prop->va_space_host_start_address = VA_HOST_SPACE_START;
+	prop->va_space_host_end_address = VA_HOST_SPACE_END;
+	prop->va_space_dram_start_address = VA_DDR_SPACE_START;
+	prop->va_space_dram_end_address = VA_DDR_SPACE_END;
+	prop->cfg_size = CFG_SIZE;
+	prop->max_asid = MAX_ASID;
+	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
+
+	prop->high_pll = PLL_HIGH_DEFAULT;
+}
+
+/*
+ * goya_pci_bars_map - Map PCI BARS of Goya device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Request PCI regions and map them to kernel virtual addresses.
+ * Returns 0 on success
+ *
+ */
+int goya_pci_bars_map(struct hl_device *hdev)
+{
+	struct pci_dev *pdev = hdev->pdev;
+	int rc;
+
+	rc = pci_request_regions(pdev, HL_NAME);
+	if (rc) {
+		dev_err(hdev->dev, "Cannot obtain PCI resources\n");
+		return rc;
+	}
+
+	hdev->pcie_bar[SRAM_CFG_BAR_ID] =
+			pci_ioremap_bar(pdev, SRAM_CFG_BAR_ID);
+	if (!hdev->pcie_bar[SRAM_CFG_BAR_ID]) {
+		dev_err(hdev->dev, "pci_ioremap_bar failed for CFG\n");
+		rc = -ENODEV;
+		goto err_release_regions;
+	}
+
+	hdev->pcie_bar[MSIX_BAR_ID] = pci_ioremap_bar(pdev, MSIX_BAR_ID);
+	if (!hdev->pcie_bar[MSIX_BAR_ID]) {
+		dev_err(hdev->dev, "pci_ioremap_bar failed for MSIX\n");
+		rc = -ENODEV;
+		goto err_unmap_sram_cfg;
+	}
+
+	hdev->pcie_bar[DDR_BAR_ID] = pci_ioremap_wc_bar(pdev, DDR_BAR_ID);
+	if (!hdev->pcie_bar[DDR_BAR_ID]) {
+		dev_err(hdev->dev, "pci_ioremap_bar failed for DDR\n");
+		rc = -ENODEV;
+		goto err_unmap_msix;
+	}
+
+	hdev->rmmio = hdev->pcie_bar[SRAM_CFG_BAR_ID] +
+				(CFG_BASE - SRAM_BASE_ADDR);
+
+	return 0;
+
+err_unmap_msix:
+	iounmap(hdev->pcie_bar[MSIX_BAR_ID]);
+err_unmap_sram_cfg:
+	iounmap(hdev->pcie_bar[SRAM_CFG_BAR_ID]);
+err_release_regions:
+	pci_release_regions(pdev);
+
+	return rc;
+}
+
+/*
+ * goya_pci_bars_unmap - Unmap PCI BARS of Goya device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Release all PCI BARS and unmap their virtual addresses
+ *
+ */
+static void goya_pci_bars_unmap(struct hl_device *hdev)
+{
+	struct pci_dev *pdev = hdev->pdev;
+
+	iounmap(hdev->pcie_bar[DDR_BAR_ID]);
+	iounmap(hdev->pcie_bar[MSIX_BAR_ID]);
+	iounmap(hdev->pcie_bar[SRAM_CFG_BAR_ID]);
+	pci_release_regions(pdev);
+}
+
+/*
+ * goya_elbi_write - Write through the ELBI interface
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * return 0 on success, -1 on failure
+ *
+ */
+static int goya_elbi_write(struct hl_device *hdev, u64 addr, u32 data)
+{
+	struct pci_dev *pdev = hdev->pdev;
+	ktime_t timeout;
+	u32 val;
+
+	/* Clear previous status */
+	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_STS, 0);
+
+	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_ADDR, (u32) addr);
+	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_DATA, data);
+	pci_write_config_dword(pdev, mmPCI_CONFIG_ELBI_CTRL,
+				PCI_CONFIG_ELBI_CTRL_WRITE);
+
+	timeout = ktime_add_ms(ktime_get(), 10);
+	for (;;) {
+		pci_read_config_dword(pdev, mmPCI_CONFIG_ELBI_STS, &val);
+		if (val & PCI_CONFIG_ELBI_STS_MASK)
+			break;
+		if (ktime_compare(ktime_get(), timeout) > 0) {
+			pci_read_config_dword(pdev, mmPCI_CONFIG_ELBI_STS,
+						&val);
+			break;
+		}
+		usleep_range(300, 500);
+	}
+
+	if ((val & PCI_CONFIG_ELBI_STS_MASK) == PCI_CONFIG_ELBI_STS_DONE)
+		return 0;
+
+	if (val & PCI_CONFIG_ELBI_STS_ERR) {
+		dev_err(hdev->dev, "Error writing to ELBI\n");
+		return -EIO;
+	}
+
+	if (!(val & PCI_CONFIG_ELBI_STS_MASK)) {
+		dev_err(hdev->dev, "ELBI write didn't finish in time\n");
+		return -EIO;
+	}
+
+	dev_err(hdev->dev, "ELBI write has undefined bits in status\n");
+	return -EIO;
+}
+
+/*
+ * goya_iatu_write - iatu write routine
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static int goya_iatu_write(struct hl_device *hdev, u32 addr, u32 data)
+{
+	u32 dbi_offset;
+	int rc;
+
+	dbi_offset = addr & 0xFFF;
+
+	rc = goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0x00300000);
+	rc |= goya_elbi_write(hdev, mmPCIE_DBI_BASE + dbi_offset, data);
+
+	if (rc)
+		return -EIO;
+
+	return 0;
+}
+
+void goya_reset_link_through_bridge(struct hl_device *hdev)
+{
+	struct pci_dev *pdev = hdev->pdev;
+	struct pci_dev *parent_port;
+	u16 val;
+
+	parent_port = pdev->bus->self;
+	pci_read_config_word(parent_port, PCI_BRIDGE_CONTROL, &val);
+	val |= PCI_BRIDGE_CTL_BUS_RESET;
+	pci_write_config_word(parent_port, PCI_BRIDGE_CONTROL, val);
+	ssleep(1);
+
+	val &= ~(PCI_BRIDGE_CTL_BUS_RESET);
+	pci_write_config_word(parent_port, PCI_BRIDGE_CONTROL, val);
+	ssleep(3);
+}
+
+/*
+ * goya_set_ddr_bar_base - set DDR bar to map specific device address
+ *
+ * @hdev: pointer to hl_device structure
+ * @addr: address in DDR. Must be aligned to DDR bar size
+ *
+ * This function configures the iATU so that the DDR bar will start at the
+ * specified addr.
+ *
+ */
+static int goya_set_ddr_bar_base(struct hl_device *hdev, u64 addr)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+
+	if ((goya) && (goya->ddr_bar_cur_addr == addr))
+		return 0;
+
+	/* Inbound Region 1 - Bar 4 - Point to DDR */
+	rc = goya_iatu_write(hdev, 0x314, lower_32_bits(addr));
+	rc |= goya_iatu_write(hdev, 0x318, upper_32_bits(addr));
+	rc |= goya_iatu_write(hdev, 0x300, 0);
+	/* Enable + Bar match + match enable + Bar 4 */
+	rc |= goya_iatu_write(hdev, 0x304, 0xC0080400);
+
+	/* Return the DBI window to the default location */
+	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0);
+	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI_32, 0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to map DDR bar to 0x%08llx\n", addr);
+		return -EIO;
+	}
+
+	if (goya)
+		goya->ddr_bar_cur_addr = addr;
+
+	return 0;
+}
+
+/*
+ * goya_init_iatu - Initialize the iATU unit inside the PCI controller
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * This is needed in case the firmware doesn't initialize the iATU
+ *
+ */
+static int goya_init_iatu(struct hl_device *hdev)
+{
+	int rc;
+
+	/* Inbound Region 0 - Bar 0 - Point to SRAM_BASE_ADDR */
+	rc  = goya_iatu_write(hdev, 0x114, lower_32_bits(SRAM_BASE_ADDR));
+	rc |= goya_iatu_write(hdev, 0x118, upper_32_bits(SRAM_BASE_ADDR));
+	rc |= goya_iatu_write(hdev, 0x100, 0);
+	/* Enable + Bar match + match enable */
+	rc |= goya_iatu_write(hdev, 0x104, 0xC0080000);
+
+	/* Inbound Region 1 - Bar 4 - Point to DDR */
+	rc |= goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
+
+	/* Outbound Region 0 - Point to Host */
+	rc |= goya_iatu_write(hdev, 0x008, lower_32_bits(HOST_PHYS_BASE));
+	rc |= goya_iatu_write(hdev, 0x00C, upper_32_bits(HOST_PHYS_BASE));
+	rc |= goya_iatu_write(hdev, 0x010,
+		lower_32_bits(HOST_PHYS_BASE + HOST_PHYS_SIZE - 1));
+	rc |= goya_iatu_write(hdev, 0x014, 0);
+	rc |= goya_iatu_write(hdev, 0x018, 0);
+	rc |= goya_iatu_write(hdev, 0x020,
+		upper_32_bits(HOST_PHYS_BASE + HOST_PHYS_SIZE - 1));
+	/* Increase region size */
+	rc |= goya_iatu_write(hdev, 0x000, 0x00002000);
+	/* Enable */
+	rc |= goya_iatu_write(hdev, 0x004, 0x80000000);
+
+	/* Return the DBI window to the default location */
+	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI, 0);
+	rc |= goya_elbi_write(hdev, CFG_BASE + mmPCIE_AUX_DBI_32, 0);
+
+	if (rc)
+		return -EIO;
+
+	return 0;
+}
+
+/*
+ * goya_early_init - GOYA early initialization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Verify PCI bars
+ * Set DMA masks
+ * PCI controller initialization
+ * Map PCI bars
+ *
+ */
+static int goya_early_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct pci_dev *pdev = hdev->pdev;
+	u32 val;
+	int rc;
+
+	goya_get_fixed_properties(hdev);
+
+	/* Check BAR sizes */
+	if (pci_resource_len(pdev, SRAM_CFG_BAR_ID) != CFG_BAR_SIZE) {
+		dev_err(hdev->dev,
+			"Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
+			SRAM_CFG_BAR_ID,
+			pci_resource_len(pdev, SRAM_CFG_BAR_ID),
+			CFG_BAR_SIZE);
+		return -ENODEV;
+	}
+
+	if (pci_resource_len(pdev, MSIX_BAR_ID) != MSIX_BAR_SIZE) {
+		dev_err(hdev->dev,
+			"Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
+			MSIX_BAR_ID, pci_resource_len(pdev, MSIX_BAR_ID),
+			MSIX_BAR_SIZE);
+		return -ENODEV;
+	}
+
+	prop->dram_pci_bar_size = pci_resource_len(pdev, DDR_BAR_ID);
+
+	/* set DMA mask for GOYA */
+	rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(39));
+	if (rc) {
+		dev_warn(hdev->dev, "Unable to set pci dma mask to 39 bits\n");
+		rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+		if (rc) {
+			dev_err(hdev->dev,
+				"Unable to set pci dma mask to 32 bits\n");
+			return rc;
+		}
+	}
+
+	rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(39));
+	if (rc) {
+		dev_warn(hdev->dev,
+			"Unable to set pci consistent dma mask to 39 bits\n");
+		rc = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32));
+		if (rc) {
+			dev_err(hdev->dev,
+				"Unable to set pci consistent dma mask to 32 bits\n");
+			return rc;
+		}
+	}
+
+	if (hdev->reset_pcilink)
+		goya_reset_link_through_bridge(hdev);
+
+	rc = pci_enable_device_mem(pdev);
+	if (rc) {
+		dev_err(hdev->dev, "can't enable PCI device\n");
+		return rc;
+	}
+
+	pci_set_master(pdev);
+
+	rc = goya_init_iatu(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to initialize iATU\n");
+		goto disable_device;
+	}
+
+	rc = goya_pci_bars_map(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to initialize PCI BARS\n");
+		goto disable_device;
+	}
+
+	val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
+	if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
+		dev_warn(hdev->dev,
+			"PCI strap is not configured correctly, PCI bus errors may occur\n");
+
+	return 0;
+
+disable_device:
+	pci_clear_master(pdev);
+	pci_disable_device(pdev);
+
+	return rc;
+}
+
+/*
+ * goya_early_fini - GOYA early finalization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Unmap PCI bars
+ *
+ */
+int goya_early_fini(struct hl_device *hdev)
+{
+	goya_pci_bars_unmap(hdev);
+
+	pci_clear_master(hdev->pdev);
+	pci_disable_device(hdev->pdev);
+
+	return 0;
+}
+
+/*
+ * goya_sw_init - Goya software initialization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static int goya_sw_init(struct hl_device *hdev)
+{
+	struct goya_device *goya;
+	int rc;
+
+	/* Allocate device structure */
+	goya = kzalloc(sizeof(*goya), GFP_KERNEL);
+	if (!goya)
+		return -ENOMEM;
+
+	/* according to goya_init_iatu */
+	goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
+	hdev->asic_specific = goya;
+
+	/* Create DMA pool for small allocations */
+	hdev->dma_pool = dma_pool_create(dev_name(hdev->dev),
+			&hdev->pdev->dev, GOYA_DMA_POOL_BLK_SIZE, 8, 0);
+	if (!hdev->dma_pool) {
+		dev_err(hdev->dev, "failed to create DMA pool\n");
+		rc = -ENOMEM;
+		goto free_goya_device;
+	}
+
+	hdev->cpu_accessible_dma_mem =
+			hdev->asic_funcs->dma_alloc_coherent(hdev,
+					CPU_ACCESSIBLE_MEM_SIZE,
+					&hdev->cpu_accessible_dma_address,
+					GFP_KERNEL | __GFP_ZERO);
+
+	if (!hdev->cpu_accessible_dma_mem) {
+		dev_err(hdev->dev,
+			"failed to allocate %d of dma memory for CPU accessible memory space\n",
+			CPU_ACCESSIBLE_MEM_SIZE);
+		rc = -ENOMEM;
+		goto free_dma_pool;
+	}
+
+	hdev->cpu_accessible_dma_pool = gen_pool_create(CPU_PKT_SHIFT, -1);
+	if (!hdev->cpu_accessible_dma_pool) {
+		dev_err(hdev->dev,
+			"Failed to create CPU accessible DMA pool\n");
+		rc = -ENOMEM;
+		goto free_cpu_pq_dma_mem;
+	}
+
+	rc = gen_pool_add(hdev->cpu_accessible_dma_pool,
+				(u64) hdev->cpu_accessible_dma_mem,
+				CPU_ACCESSIBLE_MEM_SIZE, -1);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to add memory to CPU accessible DMA pool\n");
+		rc = -EFAULT;
+		goto free_cpu_pq_pool;
+	}
+
+	spin_lock_init(&goya->hw_queues_lock);
+
+	return 0;
+
+free_cpu_pq_pool:
+	gen_pool_destroy(hdev->cpu_accessible_dma_pool);
+free_cpu_pq_dma_mem:
+	hdev->asic_funcs->dma_free_coherent(hdev, CPU_ACCESSIBLE_MEM_SIZE,
+			hdev->cpu_accessible_dma_mem,
+			hdev->cpu_accessible_dma_address);
+free_dma_pool:
+	dma_pool_destroy(hdev->dma_pool);
+free_goya_device:
+	kfree(goya);
+
+	return rc;
+}
+
+/*
+ * goya_sw_fini - Goya software tear-down code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+int goya_sw_fini(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	gen_pool_destroy(hdev->cpu_accessible_dma_pool);
+
+	hdev->asic_funcs->dma_free_coherent(hdev, CPU_ACCESSIBLE_MEM_SIZE,
+			hdev->cpu_accessible_dma_mem,
+			hdev->cpu_accessible_dma_address);
+
+	dma_pool_destroy(hdev->dma_pool);
+
+	kfree(goya);
+
+	return 0;
+}
+
+int goya_suspend(struct hl_device *hdev)
+{
+	return 0;
+}
+
+int goya_resume(struct hl_device *hdev)
+{
+	return 0;
+}
+
+void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
+					dma_addr_t *dma_handle, gfp_t flags)
+{
+	return dma_alloc_coherent(&hdev->pdev->dev, size, dma_handle, flags);
+}
+
+void goya_dma_free_coherent(struct hl_device *hdev, size_t size, void *cpu_addr,
+				dma_addr_t dma_handle)
+{
+	dma_free_coherent(&hdev->pdev->dev, size, cpu_addr, dma_handle);
+}
+
+static const struct hl_asic_funcs goya_funcs = {
+	.early_init = goya_early_init,
+	.early_fini = goya_early_fini,
+	.sw_init = goya_sw_init,
+	.sw_fini = goya_sw_fini,
+	.suspend = goya_suspend,
+	.resume = goya_resume,
+	.dma_alloc_coherent = goya_dma_alloc_coherent,
+	.dma_free_coherent = goya_dma_free_coherent,
+};
+
+/*
+ * goya_set_asic_funcs - set Goya function pointers
+ *
+ * @*hdev: pointer to hl_device structure
+ *
+ */
+void goya_set_asic_funcs(struct hl_device *hdev)
+{
+	hdev->asic_funcs = &goya_funcs;
+}
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
new file mode 100644
index 000000000000..d65ac54ba89b
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef GOYAP_H_
+#define GOYAP_H_
+
+#include <uapi/misc/habanalabs.h>
+#include "habanalabs.h"
+#include "include/goya/goya.h"
+
+#define NUMBER_OF_CMPLT_QUEUES		5
+#define NUMBER_OF_EXT_HW_QUEUES		5
+#define NUMBER_OF_CPU_HW_QUEUES		1
+#define NUMBER_OF_INT_HW_QUEUES		9
+#define NUMBER_OF_HW_QUEUES		(NUMBER_OF_EXT_HW_QUEUES + \
+					NUMBER_OF_CPU_HW_QUEUES + \
+					NUMBER_OF_INT_HW_QUEUES)
+
+/*
+ * Number of MSIX interrupts IDS:
+ * Each completion queue has 1 ID
+ * The event queue has 1 ID
+ */
+#define NUMBER_OF_INTERRUPTS		(NUMBER_OF_CMPLT_QUEUES + 1)
+
+#if (NUMBER_OF_HW_QUEUES >= HL_MAX_QUEUES)
+#error "Number of H/W queues must be smaller than HL_MAX_QUEUES"
+#endif
+
+#if (NUMBER_OF_INTERRUPTS > GOYA_MSIX_ENTRIES)
+#error "Number of MSIX interrupts must be smaller or equal to GOYA_MSIX_ENTRIES"
+#endif
+
+#define QMAN_FENCE_TIMEOUT_USEC		10000	/* 10 ms */
+
+#define QMAN_STOP_TIMEOUT_USEC		100000	/* 100 ms */
+
+#define TPC_MAX_NUM			8
+#define TPC_ENABLED_MASK		0xFF
+
+#define DMA_MAX_NUM			5
+
+#define PLL_HIGH_DEFAULT		1575000000	/* 1.575 GHz */
+
+#define GOYA_ARMCP_INFO_TIMEOUT		10000000	/* 10s */
+
+#define DRAM_PHYS_DEFAULT_SIZE		0x100000000ull	/* 4GB */
+
+/* DRAM Memory Map */
+
+#define CPU_FW_IMAGE_SIZE	0x10000000	/* 256MB */
+#define MMU_PAGE_TABLES_SIZE	0x0E000000	/* 224MB */
+#define MMU_CACHE_MNG_SIZE	0x00001000	/* 4KB */
+#define CPU_PQ_PKT_SIZE		0x00001000	/* 4KB */
+#define CPU_PQ_DATA_SIZE	0x01FFE000	/* 32MB - 8KB  */
+
+#define CPU_FW_IMAGE_ADDR	DRAM_PHYS_BASE
+#define MMU_PAGE_TABLES_ADDR	(CPU_FW_IMAGE_ADDR + CPU_FW_IMAGE_SIZE)
+#define MMU_CACHE_MNG_ADDR	(MMU_PAGE_TABLES_ADDR + MMU_PAGE_TABLES_SIZE)
+#define CPU_PQ_PKT_ADDR		(MMU_CACHE_MNG_ADDR + MMU_CACHE_MNG_SIZE)
+#define CPU_PQ_DATA_ADDR	(CPU_PQ_PKT_ADDR + CPU_PQ_PKT_SIZE)
+#define DRAM_BASE_ADDR_USER	(CPU_PQ_DATA_ADDR + CPU_PQ_DATA_SIZE)
+
+#if (DRAM_BASE_ADDR_USER != 0x20000000)
+#error "KMD must reserve 512MB"
+#endif
+
+/*
+ * SRAM Memory Map for KMD
+ *
+ * KMD occupies KMD_SRAM_SIZE bytes from the start of SRAM. It is used for
+ * MME/TPC QMANs
+ *
+ */
+
+#define MME_QMAN_BASE_OFFSET	0x000000	/* Must be 0 */
+#define MME_QMAN_LENGTH		64
+#define TPC_QMAN_LENGTH		64
+
+#define TPC0_QMAN_BASE_OFFSET	(MME_QMAN_BASE_OFFSET + \
+				(MME_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC1_QMAN_BASE_OFFSET	(TPC0_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC2_QMAN_BASE_OFFSET	(TPC1_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC3_QMAN_BASE_OFFSET	(TPC2_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC4_QMAN_BASE_OFFSET	(TPC3_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC5_QMAN_BASE_OFFSET	(TPC4_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC6_QMAN_BASE_OFFSET	(TPC5_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+#define TPC7_QMAN_BASE_OFFSET	(TPC6_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+
+#define SRAM_KMD_RES_OFFSET	(TPC7_QMAN_BASE_OFFSET + \
+				(TPC_QMAN_LENGTH * QMAN_PQ_ENTRY_SIZE))
+
+#if (SRAM_KMD_RES_OFFSET >= GOYA_KMD_SRAM_RESERVED_SIZE_FROM_START)
+#error "MME/TPC QMANs SRAM space exceeds limit"
+#endif
+
+#define SRAM_USER_BASE_OFFSET	GOYA_KMD_SRAM_RESERVED_SIZE_FROM_START
+
+/* Virtual address space */
+#define VA_HOST_SPACE_START	0x1000000000000ull	/* 256TB */
+#define VA_HOST_SPACE_END	0x3FF8000000000ull	/* 1PB - 1TB */
+#define VA_HOST_SPACE_SIZE	(VA_HOST_SPACE_END - \
+					VA_HOST_SPACE_START) /* 767TB */
+
+#define VA_DDR_SPACE_START	0x800000000ull		/* 32GB */
+#define VA_DDR_SPACE_END	0x2000000000ull		/* 128GB */
+#define VA_DDR_SPACE_SIZE	(VA_DDR_SPACE_END - \
+					VA_DDR_SPACE_START)	/* 128GB */
+
+#define DMA_MAX_TRANSFER_SIZE	0xFFFFFFFF
+
+#define HW_CAP_PLL		0x00000001
+#define HW_CAP_DDR_0		0x00000002
+#define HW_CAP_DDR_1		0x00000004
+#define HW_CAP_MME		0x00000008
+#define HW_CAP_CPU		0x00000010
+#define HW_CAP_DMA		0x00000020
+#define HW_CAP_MSIX		0x00000040
+#define HW_CAP_CPU_Q		0x00000080
+#define HW_CAP_MMU		0x00000100
+#define HW_CAP_TPC_MBIST	0x00000200
+#define HW_CAP_GOLDEN		0x00000400
+#define HW_CAP_TPC		0x00000800
+
+#define CPU_PKT_SHIFT		5
+#define CPU_PKT_SIZE		(1 << CPU_PKT_SHIFT)
+#define CPU_PKT_MASK		(~((1 << CPU_PKT_SHIFT) - 1))
+#define CPU_MAX_PKTS_IN_CB	32
+#define CPU_CB_SIZE		(CPU_PKT_SIZE * CPU_MAX_PKTS_IN_CB)
+#define CPU_ACCESSIBLE_MEM_SIZE	(HL_QUEUE_LENGTH * CPU_CB_SIZE)
+
+enum goya_fw_component {
+	FW_COMP_UBOOT,
+	FW_COMP_PREBOOT
+};
+
+struct goya_device {
+	/* TODO: remove hw_queues_lock after moving to scheduler code */
+	spinlock_t	hw_queues_lock;
+	u64		ddr_bar_cur_addr;
+	u32		hw_cap_initialized;
+};
+
+#endif /* GOYAP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 02386dcaa3a3..644624fa378b 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -21,9 +21,62 @@
 
 #define HL_NAME				"habanalabs"
 
+#define HL_MAX_QUEUES			128
+
 struct hl_device;
 
 
+/**
+ * struct asic_fixed_properties - ASIC specific immutable properties.
+ * @sram_base_address: SRAM physical start address.
+ * @sram_end_address: SRAM physical end address.
+ * @sram_user_base_address - SRAM physical start address for user access.
+ * @dram_base_address: DRAM physical start address.
+ * @dram_end_address: DRAM physical end address.
+ * @dram_user_base_address: DRAM physical start address for user access.
+ * @dram_size: DRAM total size.
+ * @dram_pci_bar_size: size of PCI bar towards DRAM.
+ * @host_phys_base_address: base physical address of host memory for
+ *				transactions that the device generates.
+ * @va_space_host_start_address: base address of virtual memory range for
+ *                               mapping host memory.
+ * @va_space_host_end_address: end address of virtual memory range for
+ *                             mapping host memory.
+ * @va_space_dram_start_address: base address of virtual memory range for
+ *                               mapping DRAM memory.
+ * @va_space_dram_end_address: end address of virtual memory range for
+ *                             mapping DRAM memory.
+ * @cfg_size: configuration space size on SRAM.
+ * @sram_size: total size of SRAM.
+ * @max_asid: maximum number of open contexts (ASIDs).
+ * @completion_queues_count: number of completion queues.
+ * @high_pll: high PLL frequency used by the device.
+ * @tpc_enabled_mask: which TPCs are enabled.
+ */
+struct asic_fixed_properties {
+	u64			sram_base_address;
+	u64			sram_end_address;
+	u64			sram_user_base_address;
+	u64			dram_base_address;
+	u64			dram_end_address;
+	u64			dram_user_base_address;
+	u64			dram_size;
+	u64			dram_pci_bar_size;
+	u64			host_phys_base_address;
+	u64			va_space_host_start_address;
+	u64			va_space_host_end_address;
+	u64			va_space_dram_start_address;
+	u64			va_space_dram_end_address;
+	u32			cfg_size;
+	u32			sram_size;
+	u32			max_asid;
+	u32			high_pll;
+	u8			completion_queues_count;
+	u8			tpc_enabled_mask;
+};
+
+
+#define HL_QUEUE_LENGTH			256
 /*
  * ASICs
  */
@@ -40,6 +93,36 @@ enum hl_asic_type {
 	ASIC_INVALID
 };
 
+/**
+ * struct hl_asic_funcs - ASIC specific functions that are can be called from
+ *                        common code.
+ * @early_init: sets up early driver state (pre sw_init), doesn't configure H/W.
+ * @early_fini: tears down what was done in early_init.
+ * @sw_init: sets up driver state, does not configure H/W.
+ * @sw_fini: tears down driver state, does not configure H/W.
+ * @suspend: handles IP specific H/W or SW changes for suspend.
+ * @resume: handles IP specific H/W or SW changes for resume.
+ * @dma_alloc_coherent: Allocate coherent DMA memory by calling
+ *                      dma_alloc_coherent(). This is ASIC function because its
+ *                      implementation is not trivial when the driver is loaded
+ *                      in simulation mode (not upstreamed).
+ * @dma_free_coherent: Free coherent DMA memory by calling dma_free_coherent().
+ *                     This is ASIC function because its implementation is not
+ *                     trivial when the driver is loaded in simulation mode
+ *                     (not upstreamed).
+ */
+struct hl_asic_funcs {
+	int (*early_init)(struct hl_device *hdev);
+	int (*early_fini)(struct hl_device *hdev);
+	int (*sw_init)(struct hl_device *hdev);
+	int (*sw_fini)(struct hl_device *hdev);
+	int (*suspend)(struct hl_device *hdev);
+	int (*resume)(struct hl_device *hdev);
+	void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
+					dma_addr_t *dma_handle, gfp_t flag);
+	void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
+					void *cpu_addr, dma_addr_t dma_handle);
+};
 
 /*
  * FILE PRIVATE STRUCTURE
@@ -69,26 +152,78 @@ struct hl_fpriv {
  */
 #define HL_MAX_MINORS	256
 
+/*
+ * Registers read & write functions.
+ */
+
+u32 hl_rreg(struct hl_device *hdev, u32 reg);
+void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
+
+#define hl_poll_timeout(hdev, addr, val, cond, sleep_us, timeout_us) \
+	readl_poll_timeout(hdev->rmmio + addr, val, cond, sleep_us, timeout_us)
+
+#define RREG32(reg) hl_rreg(hdev, (reg))
+#define WREG32(reg, v) hl_wreg(hdev, (reg), (v))
+#define DREG32(reg) pr_info("REGISTER: " #reg " : 0x%08X\n",	\
+				hl_rreg(hdev, (reg)))
+
+#define WREG32_P(reg, val, mask)				\
+	do {							\
+		u32 tmp_ = RREG32(reg);				\
+		tmp_ &= (mask);					\
+		tmp_ |= ((val) & ~(mask));			\
+		WREG32(reg, tmp_);				\
+	} while (0)
+#define WREG32_AND(reg, and) WREG32_P(reg, 0, and)
+#define WREG32_OR(reg, or) WREG32_P(reg, or, ~(or))
+
+#define REG_FIELD_SHIFT(reg, field) reg##_##field##_SHIFT
+#define REG_FIELD_MASK(reg, field) reg##_##field##_MASK
+#define WREG32_FIELD(reg, field, val)	\
+	WREG32(mm##reg, (RREG32(mm##reg) & ~REG_FIELD_MASK(reg, field)) | \
+			(val) << REG_FIELD_SHIFT(reg, field))
+
 /**
  * struct hl_device - habanalabs device structure.
  * @pdev: pointer to PCI device, can be NULL in case of simulator device.
+ * @pcie_bar: array of available PCIe bars.
+ * @rmmio: configuration area address on SRAM.
  * @cdev: related char device.
  * @dev: realted kernel basic device structure.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
+ * @dma_pool: DMA pool for small allocations.
+ * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
+ * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
+ * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
+ * @asic_prop: ASIC specific immutable properties.
+ * @asic_funcs: ASIC specific functions.
+ * @asic_specific: ASIC specific information to use only from ASIC files.
  * @major: habanalabs KMD major.
  * @id: device minor.
  * @disabled: is device disabled.
  */
 struct hl_device {
 	struct pci_dev			*pdev;
+	void __iomem			*pcie_bar[6];
+	void __iomem			*rmmio;
 	struct cdev			cdev;
 	struct device			*dev;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
+	struct dma_pool			*dma_pool;
+	void				*cpu_accessible_dma_mem;
+	dma_addr_t			cpu_accessible_dma_address;
+	struct gen_pool			*cpu_accessible_dma_pool;
+	struct asic_fixed_properties	asic_prop;
+	const struct hl_asic_funcs	*asic_funcs;
+	void				*asic_specific;
 	u32				major;
 	u16				id;
 	u8				disabled;
+
+	/* Parameters for bring-up */
+	u8				reset_pcilink;
 };
 
 
@@ -135,4 +270,6 @@ void hl_device_fini(struct hl_device *hdev);
 int hl_device_suspend(struct hl_device *hdev);
 int hl_device_resume(struct hl_device *hdev);
 
+void goya_set_asic_funcs(struct hl_device *hdev);
+
 #endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 573349eedce4..1596cf38f6c0 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -126,6 +126,9 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 
 	hdev->major = hl_major;
 
+	/* Parameters for bring-up - set them to defaults */
+	hdev->reset_pcilink = 0;
+
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
 
diff --git a/drivers/misc/habanalabs/include/goya/goya.h b/drivers/misc/habanalabs/include/goya/goya.h
new file mode 100644
index 000000000000..76d8a9c376e2
--- /dev/null
+++ b/drivers/misc/habanalabs/include/goya/goya.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef GOYA_H
+#define GOYA_H
+
+#include "asic_reg/goya_regs.h"
+
+#include <linux/types.h>
+
+#define SRAM_CFG_BAR_ID		0
+#define MSIX_BAR_ID		2
+#define DDR_BAR_ID		4
+
+#define CFG_BAR_SIZE		0x10000000ull		/* 256MB */
+#define MSIX_BAR_SIZE		0x1000ull		/* 4KB */
+
+#define CFG_BASE		0x7FFC000000ull
+#define CFG_SIZE		0x4000000		/* 32MB CFG + 32MB DBG*/
+
+#define SRAM_BASE_ADDR		0x7FF0000000ull
+#define SRAM_SIZE		0x32A0000		/* 50.625MB */
+
+#define DRAM_PHYS_BASE		0x0ull
+
+#define HOST_PHYS_BASE		0x8000000000ull		/* 0.5TB */
+#define HOST_PHYS_SIZE		0x1000000000000ull	/* 0.25PB (48 bits) */
+
+#define GOYA_MSIX_ENTRIES	8
+
+#define QMAN_PQ_ENTRY_SIZE	16			/* Bytes */
+
+#define MAX_ASID		1024
+
+#define PROT_BITS_OFFS		0xF80
+
+#endif /* GOYA_H */
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
new file mode 100644
index 000000000000..a0ec23adf8f5
--- /dev/null
+++ b/include/uapi/misc/habanalabs.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef HABANALABS_H_
+#define HABANALABS_H_
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Defines that are asic-specific but constitutes as ABI between kernel driver
+ * and userspace
+ */
+#define GOYA_KMD_SRAM_RESERVED_SIZE_FROM_START	0x8000	/* 32KB */
+
+#endif /* HABANALABS_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 04/15] habanalabs: add context and ASID modules
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 01/15] habanalabs: add skeleton driver Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 03/15] habanalabs: add basic Goya support Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 05/15] habanalabs: add command buffer module Oded Gabbay
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds two modules - ASID and context.

Each user process that opens a device's file must have at least one
context before it is able to "work" with the device. Each context has its
own device address-space and contains information about its runtime state
(its active command submissions).

To have address-space separation between contexts, each context is assigned
a unique ASID, which stands for "address-space id". Goya supports up to
1024 ASIDs.

Currently, the driver doesn't support multiple contexts. Therefore, the
user doesn't need to actively create a context. A "primary context" is
created automatically when the user opens the device's file.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/Makefile         |   2 +-
 drivers/misc/habanalabs/asid.c           |  58 +++++++++
 drivers/misc/habanalabs/context.c        | 155 +++++++++++++++++++++++
 drivers/misc/habanalabs/device.c         |  47 +++++++
 drivers/misc/habanalabs/habanalabs.h     |  71 +++++++++++
 drivers/misc/habanalabs/habanalabs_drv.c |  46 ++++++-
 6 files changed, 376 insertions(+), 3 deletions(-)
 create mode 100644 drivers/misc/habanalabs/asid.c
 create mode 100644 drivers/misc/habanalabs/context.c

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index 6f1ead69bd77..3ffbadc2ca01 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -4,7 +4,7 @@
 
 obj-m	:= habanalabs.o
 
-habanalabs-y := habanalabs_drv.o device.o
+habanalabs-y := habanalabs_drv.o device.o context.o asid.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/asid.c b/drivers/misc/habanalabs/asid.c
new file mode 100644
index 000000000000..0ce84c8f5a47
--- /dev/null
+++ b/drivers/misc/habanalabs/asid.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/slab.h>
+#include <linux/types.h>
+
+int hl_asid_init(struct hl_device *hdev)
+{
+	hdev->asid_bitmap = kcalloc(BITS_TO_LONGS(hdev->asic_prop.max_asid),
+					sizeof(*hdev->asid_bitmap), GFP_KERNEL);
+	if (!hdev->asid_bitmap)
+		return -ENOMEM;
+
+	mutex_init(&hdev->asid_mutex);
+
+	/* ASID 0 is reserved for KMD */
+	set_bit(0, hdev->asid_bitmap);
+
+	return 0;
+}
+
+void hl_asid_fini(struct hl_device *hdev)
+{
+	mutex_destroy(&hdev->asid_mutex);
+	kfree(hdev->asid_bitmap);
+}
+
+unsigned long hl_asid_alloc(struct hl_device *hdev)
+{
+	unsigned long found;
+
+	mutex_lock(&hdev->asid_mutex);
+
+	found = find_first_zero_bit(hdev->asid_bitmap,
+					hdev->asic_prop.max_asid);
+	if (found == hdev->asic_prop.max_asid)
+		found = 0;
+	else
+		set_bit(found, hdev->asid_bitmap);
+
+	mutex_unlock(&hdev->asid_mutex);
+
+	return found;
+}
+
+void hl_asid_free(struct hl_device *hdev, unsigned long asid)
+{
+	if (WARN((asid == 0 || asid >= hdev->asic_prop.max_asid),
+						"Invalid ASID %lu", asid))
+		return;
+	clear_bit(asid, hdev->asid_bitmap);
+}
diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
new file mode 100644
index 000000000000..4d46e51e1a82
--- /dev/null
+++ b/drivers/misc/habanalabs/context.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/sched.h>
+#include <linux/delay.h>
+
+static void hl_ctx_fini(struct hl_ctx *ctx)
+{
+	struct hl_device *hdev = ctx->hdev;
+
+	if (ctx->asid != HL_KERNEL_ASID_ID)
+		hl_asid_free(hdev, ctx->asid);
+}
+
+void hl_ctx_do_release(struct kref *ref)
+{
+	struct hl_ctx *ctx;
+
+	ctx = container_of(ref, struct hl_ctx, refcount);
+
+	dev_dbg(ctx->hdev->dev, "Now really releasing context %d\n", ctx->asid);
+
+	hl_ctx_fini(ctx);
+
+	if (ctx->hpriv)
+		hl_hpriv_put(ctx->hpriv);
+
+	kfree(ctx);
+}
+
+int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv)
+{
+	struct hl_ctx_mgr *mgr = &hpriv->ctx_mgr;
+	struct hl_ctx *ctx;
+	int rc;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx) {
+		rc = -ENOMEM;
+		goto out_err;
+	}
+
+	rc = hl_ctx_init(hdev, ctx, false);
+	if (rc)
+		goto free_ctx;
+
+	hl_hpriv_get(hpriv);
+	ctx->hpriv = hpriv;
+
+	/* TODO: remove for multiple contexts */
+	hpriv->ctx = ctx;
+	hdev->user_ctx = ctx;
+
+	mutex_lock(&mgr->ctx_lock);
+	rc = idr_alloc(&mgr->ctx_handles, ctx, 1, 0, GFP_KERNEL);
+	mutex_unlock(&mgr->ctx_lock);
+
+	if (rc < 0) {
+		dev_err(hdev->dev, "Failed to allocate IDR for a new CTX\n");
+		hl_ctx_free(hdev, ctx);
+		goto out_err;
+	}
+
+	return 0;
+
+free_ctx:
+	kfree(ctx);
+out_err:
+	return rc;
+}
+
+void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx)
+{
+	if (kref_put(&ctx->refcount, hl_ctx_do_release) == 1)
+		return;
+
+	dev_warn(hdev->dev,
+		"Context %d closed or terminated but its CS are executing\n",
+		ctx->asid);
+}
+
+int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
+{
+	ctx->hdev = hdev;
+
+	kref_init(&ctx->refcount);
+
+	if (is_kernel_ctx) {
+		ctx->asid = HL_KERNEL_ASID_ID; /* KMD gets ASID 0 */
+	} else {
+		ctx->asid = hl_asid_alloc(hdev);
+		if (!ctx->asid) {
+			dev_err(hdev->dev, "No free ASID, failed to create context\n");
+			return -ENOMEM;
+		}
+	}
+
+	dev_dbg(hdev->dev, "Created context with ASID %u\n", ctx->asid);
+
+	return 0;
+}
+
+void hl_ctx_get(struct hl_device *hdev, struct hl_ctx *ctx)
+{
+	kref_get(&ctx->refcount);
+}
+
+int hl_ctx_put(struct hl_ctx *ctx)
+{
+	return kref_put(&ctx->refcount, hl_ctx_do_release);
+}
+
+/*
+ * hl_ctx_mgr_init - initialize the context manager
+ *
+ * @mgr: pointer to context manager structure
+ *
+ * This manager is an object inside the hpriv object of the user process.
+ * The function is called when a user process opens the FD.
+ */
+void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr)
+{
+	mutex_init(&mgr->ctx_lock);
+	idr_init(&mgr->ctx_handles);
+}
+
+/*
+ * hl_ctx_mgr_fini - finalize the context manager
+ *
+ * @hdev: pointer to device structure
+ * @mgr: pointer to context manager structure
+ *
+ * This function goes over all the contexts in the manager and frees them.
+ * It is called when a process closes the FD.
+ */
+void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr)
+{
+	struct hl_ctx *ctx;
+	struct idr *idp;
+	u32 id;
+
+	idp = &mgr->ctx_handles;
+
+	idr_for_each_entry(idp, ctx, id)
+		hl_ctx_free(hdev, ctx);
+
+	idr_destroy(&mgr->ctx_handles);
+	mutex_destroy(&mgr->ctx_lock);
+}
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index d77e070ae389..81acb1de9c7e 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -23,6 +23,12 @@ static void hpriv_release(struct kref *ref)
 	put_pid(hpriv->taskpid);
 
 	kfree(hpriv);
+
+	/* Now the FD is really closed */
+	atomic_dec(&hdev->fd_open_cnt);
+
+	/* This allows a new user context to open the device */
+	hdev->user_ctx = NULL;
 }
 
 void hl_hpriv_get(struct hl_fpriv *hpriv)
@@ -47,6 +53,8 @@ static int hl_device_release(struct inode *inode, struct file *filp)
 {
 	struct hl_fpriv *hpriv = filp->private_data;
 
+	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
+
 	filp->private_data = NULL;
 
 	hl_hpriv_put(hpriv);
@@ -138,7 +146,20 @@ static int device_early_init(struct hl_device *hdev)
 	if (rc)
 		return rc;
 
+	rc = hl_asid_init(hdev);
+	if (rc)
+		goto early_fini;
+
+	mutex_init(&hdev->fd_open_cnt_lock);
+	atomic_set(&hdev->fd_open_cnt, 0);
+
 	return 0;
+
+early_fini:
+	if (hdev->asic_funcs->early_fini)
+		hdev->asic_funcs->early_fini(hdev);
+
+	return rc;
 }
 
 /*
@@ -150,9 +171,12 @@ static int device_early_init(struct hl_device *hdev)
 static void device_early_fini(struct hl_device *hdev)
 {
 
+	hl_asid_fini(hdev);
+
 	if (hdev->asic_funcs->early_fini)
 		hdev->asic_funcs->early_fini(hdev);
 
+	mutex_destroy(&hdev->fd_open_cnt_lock);
 }
 
 /*
@@ -246,11 +270,30 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 	if (rc)
 		goto early_fini;
 
+	/* Allocate the kernel context */
+	hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
+	if (!hdev->kernel_ctx) {
+		rc = -ENOMEM;
+		goto sw_fini;
+	}
+
+	hdev->user_ctx = NULL;
+
+	rc = hl_ctx_init(hdev, hdev->kernel_ctx, true);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize kernel context\n");
+		goto free_ctx;
+	}
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
 	return 0;
 
+free_ctx:
+	kfree(hdev->kernel_ctx);
+sw_fini:
+	hdev->asic_funcs->sw_fini(hdev);
 early_fini:
 	device_early_fini(hdev);
 release_device:
@@ -283,6 +326,10 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	/* Release kernel context */
+	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
+		dev_err(hdev->dev, "kernel ctx is still alive\n");
+
 	/* Call ASIC S/W finalize function */
 	hdev->asic_funcs->sw_fini(hdev);
 
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 644624fa378b..9f76373b3e2c 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -77,6 +77,8 @@ struct asic_fixed_properties {
 
 
 #define HL_QUEUE_LENGTH			256
+
+
 /*
  * ASICs
  */
@@ -124,6 +126,39 @@ struct hl_asic_funcs {
 					void *cpu_addr, dma_addr_t dma_handle);
 };
 
+
+/*
+ * CONTEXTS
+ */
+
+#define HL_KERNEL_ASID_ID	0
+
+/**
+ * struct hl_ctx - user/kernel context.
+ * @hpriv: pointer to the private (KMD) data of the process (fd).
+ * @hdev: pointer to the device structure.
+ * @refcount: reference counter for the context. Context is released only when
+ *		this hits 0l. It is incremented on CS and CS_WAIT.
+ * @asid: context's unique address space ID in the device's MMU.
+ */
+struct hl_ctx {
+	struct hl_fpriv		*hpriv;
+	struct hl_device	*hdev;
+	struct kref		refcount;
+	u32			asid;
+};
+
+/**
+ * struct hl_ctx_mgr - for handling multiple contexts.
+ * @ctx_lock: protects ctx_handles.
+ * @ctx_handles: idr to hold all ctx handles.
+ */
+struct hl_ctx_mgr {
+	struct mutex		ctx_lock;
+	struct idr		ctx_handles;
+};
+
+
 /*
  * FILE PRIVATE STRUCTURE
  */
@@ -133,12 +168,16 @@ struct hl_asic_funcs {
  * @hdev: habanalabs device structure.
  * @filp: pointer to the given file structure.
  * @taskpid: current process ID.
+ * @ctx: current executing context.
+ * @ctx_mgr: context manager to handle multiple context for this FD.
  * @refcount: number of related contexts.
  */
 struct hl_fpriv {
 	struct hl_device	*hdev;
 	struct file		*filp;
 	struct pid		*taskpid;
+	struct hl_ctx		*ctx; /* TODO: remove for multiple ctx */
+	struct hl_ctx_mgr	ctx_mgr;
 	struct kref		refcount;
 };
 
@@ -192,13 +231,24 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @dev: realted kernel basic device structure.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
+ * @kernel_ctx: KMD context structure.
  * @dma_pool: DMA pool for small allocations.
  * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
  * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
  * @cpu_accessible_dma_pool: KMD <-> ArmCP shared memory pool.
+ * @asid_bitmap: holds used/available ASIDs.
+ * @asid_mutex: protects asid_bitmap.
+ * @fd_open_cnt_lock: lock for updating fd_open_cnt in hl_device_open. Although
+ *                    fd_open_cnt is atomic, we need this lock to serialize
+ *                    the open function because the driver currently supports
+ *                    only a single process at a time. In addition, we need a
+ *                    lock here so we can flush user processes which are opening
+ *                    the device while we are trying to hard reset it
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
+ * @user_ctx: current user context executing.
+ * @fd_open_cnt: number of open user processes.
  * @major: habanalabs KMD major.
  * @id: device minor.
  * @disabled: is device disabled.
@@ -211,13 +261,21 @@ struct hl_device {
 	struct device			*dev;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
+	struct hl_ctx			*kernel_ctx;
 	struct dma_pool			*dma_pool;
 	void				*cpu_accessible_dma_mem;
 	dma_addr_t			cpu_accessible_dma_address;
 	struct gen_pool			*cpu_accessible_dma_pool;
+	unsigned long			*asid_bitmap;
+	struct mutex			asid_mutex;
+	/* TODO: remove fd_open_cnt_lock for multiple process support */
+	struct mutex			fd_open_cnt_lock;
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
+	/* TODO: remove user_ctx for multiple process support */
+	struct hl_ctx			*user_ctx;
+	atomic_t			fd_open_cnt;
 	u32				major;
 	u16				id;
 	u8				disabled;
@@ -265,10 +323,23 @@ int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
 int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
 				u32 timeout_us, u32 *val);
 
+int hl_asid_init(struct hl_device *hdev);
+void hl_asid_fini(struct hl_device *hdev);
+unsigned long hl_asid_alloc(struct hl_device *hdev);
+void hl_asid_free(struct hl_device *hdev, unsigned long asid);
+
+int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv);
+void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx);
+int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx);
+int hl_ctx_put(struct hl_ctx *ctx);
+void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr);
+void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr);
 int hl_device_init(struct hl_device *hdev, struct class *hclass);
 void hl_device_fini(struct hl_device *hdev);
 int hl_device_suspend(struct hl_device *hdev);
 int hl_device_resume(struct hl_device *hdev);
+void hl_hpriv_get(struct hl_fpriv *hpriv);
+void hl_hpriv_put(struct hl_fpriv *hpriv);
 
 void goya_set_asic_funcs(struct hl_device *hdev);
 
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 1596cf38f6c0..8ecb51fbb133 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -74,6 +74,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 {
 	struct hl_device *hdev;
 	struct hl_fpriv *hpriv;
+	int rc;
 
 	mutex_lock(&hl_devs_idr_lock);
 	hdev = idr_find(&hl_devs_idr, iminor(inode));
@@ -85,9 +86,33 @@ int hl_device_open(struct inode *inode, struct file *filp)
 		return -ENXIO;
 	}
 
+	mutex_lock(&hdev->fd_open_cnt_lock);
+
+	if (hdev->disabled) {
+		dev_err_ratelimited(hdev->dev,
+			"Can't open %s because it is disabled\n",
+			dev_name(hdev->dev));
+		mutex_unlock(&hdev->fd_open_cnt_lock);
+		return -EPERM;
+	}
+
+	if (atomic_read(&hdev->fd_open_cnt)) {
+		dev_info_ratelimited(hdev->dev,
+			"Device %s is already attached to application\n",
+			dev_name(hdev->dev));
+		mutex_unlock(&hdev->fd_open_cnt_lock);
+		return -EBUSY;
+	}
+
+	atomic_inc(&hdev->fd_open_cnt);
+
+	mutex_unlock(&hdev->fd_open_cnt_lock);
+
 	hpriv = kzalloc(sizeof(*hpriv), GFP_KERNEL);
-	if (!hpriv)
-		return -ENOMEM;
+	if (!hpriv) {
+		rc = -ENOMEM;
+		goto close_device;
+	}
 
 	hpriv->hdev = hdev;
 	filp->private_data = hpriv;
@@ -95,9 +120,26 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	kref_init(&hpriv->refcount);
 	nonseekable_open(inode, filp);
 
+	hl_ctx_mgr_init(&hpriv->ctx_mgr);
+
+	rc = hl_ctx_create(hdev, hpriv);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to open FD (CTX fail)\n");
+		goto out_err;
+	}
+
 	hpriv->taskpid = find_get_pid(current->pid);
 
 	return 0;
+
+out_err:
+	filp->private_data = NULL;
+	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
+	kfree(hpriv);
+
+close_device:
+	atomic_dec(&hdev->fd_open_cnt);
+	return rc;
 }
 
 /*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 05/15] habanalabs: add command buffer module
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (2 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 04/15] habanalabs: add context and ASID modules Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 06/15] habanalabs: add basic Goya h/w initialization Oded Gabbay
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds the command buffer (CB) module, which allows the user to
create and destroy CBs and to map them to the user's process
address-space.

A command buffer is a memory blocks that reside in DMA-able address-space
and is physically contiguous so it can be accessed by the device without
MMU translation. The command buffer memory is allocated using the
coherent DMA API.

When creating a new CB, the IOCTL returns a handle of it, and the
user-space process needs to use that handle to mmap the buffer to get a VA
in the user's address-space.

Before destroying (freeing) a CB, the user must unmap the CB's VA using the
CB handle.

Each CB has a reference counter, which tracks its usage in command
submissions and also its mmaps (only a single mmap is allowed).

The driver maintains a pool of pre-allocated CBs in order to reduce
latency during command submissions. In case the pool is empty, the driver
will go to the slow-path of allocating a new CB, i.e. calling
dma_alloc_coherent.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - Use -ENOTTY to signal bad ioctl parameter
  - Call put_cb() after updating its values (to prevent dereference after free)
  - Correctly account cb unmap calls (fix bug in case of partial munmap)
  - Check for maximum CB size allowed by driver
 
 drivers/misc/habanalabs/Makefile           |   3 +-
 drivers/misc/habanalabs/command_buffer.c   | 431 +++++++++++++++++++++
 drivers/misc/habanalabs/device.c           |  43 +-
 drivers/misc/habanalabs/goya/goya.c        |  28 ++
 drivers/misc/habanalabs/habanalabs.h       |  86 ++++
 drivers/misc/habanalabs/habanalabs_drv.c   |   2 +
 drivers/misc/habanalabs/habanalabs_ioctl.c |  99 +++++
 include/uapi/misc/habanalabs.h             |  46 +++
 8 files changed, 736 insertions(+), 2 deletions(-)
 create mode 100644 drivers/misc/habanalabs/command_buffer.c
 create mode 100644 drivers/misc/habanalabs/habanalabs_ioctl.c

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index 3ffbadc2ca01..2530c9b78ca4 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -4,7 +4,8 @@
 
 obj-m	:= habanalabs.o
 
-habanalabs-y := habanalabs_drv.o device.o context.o asid.o
+habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
+		command_buffer.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
new file mode 100644
index 000000000000..cd19f00f85f2
--- /dev/null
+++ b/drivers/misc/habanalabs/command_buffer.c
@@ -0,0 +1,431 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include <uapi/misc/habanalabs.h>
+#include "habanalabs.h"
+
+#include <linux/dma-mapping.h>
+
+static void cb_fini(struct hl_device *hdev, struct hl_cb *cb)
+{
+	hdev->asic_funcs->dma_free_coherent(hdev, cb->size,
+			(void *) cb->kernel_address, cb->bus_address);
+	kfree(cb);
+}
+
+static void cb_do_release(struct hl_device *hdev, struct hl_cb *cb)
+{
+	if (cb->is_pool) {
+		spin_lock(&hdev->cb_pool_lock);
+		list_add(&cb->pool_list, &hdev->cb_pool);
+		spin_unlock(&hdev->cb_pool_lock);
+	} else {
+		cb_fini(hdev, cb);
+	}
+}
+
+static void cb_release(struct kref *ref)
+{
+	struct hl_device *hdev;
+	struct hl_cb *cb;
+
+	cb = container_of(ref, struct hl_cb, refcount);
+	hdev = cb->hdev;
+
+	cb_do_release(hdev, cb);
+}
+
+static struct hl_cb *hl_cb_alloc(struct hl_device *hdev, u32 cb_size,
+					int ctx_id)
+{
+	struct hl_cb *cb;
+	void *p;
+
+	/*
+	 * We use of GFP_ATOMIC here because this function can be called from
+	 * the latency-sensitive code path for command submission. Due to H/W
+	 * limitations in some of the ASICs, the kernel must copy the user CB
+	 * that is designated for an external queue and actually enqueue
+	 * the kernel's copy. Hence, we must never sleep in this code section
+	 * and must use GFP_ATOMIC for all memory allocations.
+	 */
+	if (ctx_id == HL_KERNEL_ASID_ID)
+		cb = kzalloc(sizeof(*cb), GFP_ATOMIC);
+	else
+		cb = kzalloc(sizeof(*cb), GFP_KERNEL);
+
+	if (!cb)
+		return NULL;
+
+	if (ctx_id == HL_KERNEL_ASID_ID)
+		p = hdev->asic_funcs->dma_alloc_coherent(hdev, cb_size,
+						&cb->bus_address, GFP_ATOMIC);
+	else
+		p = hdev->asic_funcs->dma_alloc_coherent(hdev, cb_size,
+						&cb->bus_address,
+						GFP_USER | __GFP_ZERO);
+	if (!p) {
+		dev_err(hdev->dev,
+			"failed to allocate %d of dma memory for CB\n",
+			cb_size);
+		kfree(cb);
+		return NULL;
+	}
+
+	cb->kernel_address = (u64) p;
+	cb->size = cb_size;
+
+	return cb;
+}
+
+int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
+			u32 cb_size, u64 *handle, int ctx_id)
+{
+	struct hl_cb *cb;
+	bool alloc_new_cb = true;
+	int rc;
+
+	if (hdev->disabled) {
+		dev_warn_ratelimited(hdev->dev,
+			"Device is disabled. Can't create new CBs\n");
+		rc = -EBUSY;
+		goto out_err;
+	}
+
+	if (cb_size > HL_MAX_CB_SIZE) {
+		dev_err(hdev->dev,
+			"CB size %d must be less then %d\n",
+			cb_size, HL_MAX_CB_SIZE);
+		rc = -EINVAL;
+		goto out_err;
+	}
+
+	/* Minimum allocation must be PAGE SIZE */
+	if (cb_size < PAGE_SIZE)
+		cb_size = PAGE_SIZE;
+
+	if (ctx_id == HL_KERNEL_ASID_ID &&
+			cb_size <= hdev->asic_prop.cb_pool_cb_size) {
+
+		spin_lock(&hdev->cb_pool_lock);
+		if (!list_empty(&hdev->cb_pool)) {
+			cb = list_first_entry(&hdev->cb_pool, typeof(*cb),
+					pool_list);
+			list_del(&cb->pool_list);
+			spin_unlock(&hdev->cb_pool_lock);
+			alloc_new_cb = false;
+		} else {
+			spin_unlock(&hdev->cb_pool_lock);
+			dev_dbg(hdev->dev, "CB pool is empty\n");
+		}
+	}
+
+	if (alloc_new_cb) {
+		cb = hl_cb_alloc(hdev, cb_size, ctx_id);
+		if (!cb) {
+			rc = -ENOMEM;
+			goto out_err;
+		}
+	}
+
+	cb->hdev = hdev;
+	cb->ctx_id = ctx_id;
+
+	spin_lock(&mgr->cb_lock);
+	rc = idr_alloc(&mgr->cb_handles, cb, 1, 0, GFP_ATOMIC);
+	spin_unlock(&mgr->cb_lock);
+
+	if (rc < 0) {
+		dev_err(hdev->dev, "Failed to allocate IDR for a new CB\n");
+		goto release_cb;
+	}
+
+	cb->id = rc;
+
+	kref_init(&cb->refcount);
+	spin_lock_init(&cb->lock);
+
+	/*
+	 * idr is 32-bit so we can safely OR it with a mask that is above
+	 * 32 bit
+	 */
+	*handle = cb->id | HL_MMAP_CB_MASK;
+	*handle <<= PAGE_SHIFT;
+
+	return 0;
+
+release_cb:
+	cb_do_release(hdev, cb);
+out_err:
+	*handle = 0;
+
+	return rc;
+}
+
+int hl_cb_destroy(struct hl_device *hdev, struct hl_cb_mgr *mgr, u64 cb_handle)
+{
+	struct hl_cb *cb;
+	u32 handle;
+	int rc = 0;
+
+	/*
+	 * handle was given to user to do mmap, I need to shift it back to
+	 * how the idr module gave it to me
+	 */
+	cb_handle >>= PAGE_SHIFT;
+	handle = (u32) cb_handle;
+
+	spin_lock(&mgr->cb_lock);
+
+	cb = idr_find(&mgr->cb_handles, handle);
+	if (cb) {
+		idr_remove(&mgr->cb_handles, handle);
+		spin_unlock(&mgr->cb_lock);
+		kref_put(&cb->refcount, cb_release);
+	} else {
+		spin_unlock(&mgr->cb_lock);
+		dev_err(hdev->dev,
+			"CB destroy failed, no match to handle 0x%x\n", handle);
+		rc = -EINVAL;
+	}
+
+	return rc;
+}
+
+int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	union hl_cb_args *args = data;
+	struct hl_device *hdev = hpriv->hdev;
+	u64 handle;
+	int rc;
+
+	switch (args->in.op) {
+	case HL_CB_OP_CREATE:
+		rc = hl_cb_create(hdev, &hpriv->cb_mgr, args->in.cb_size,
+					&handle, hpriv->ctx->asid);
+		memset(args, 0, sizeof(*args));
+		args->out.cb_handle = handle;
+		break;
+	case HL_CB_OP_DESTROY:
+		rc = hl_cb_destroy(hdev, &hpriv->cb_mgr,
+					args->in.cb_handle);
+		break;
+	default:
+		rc = -ENOTTY;
+		break;
+	}
+
+	return rc;
+}
+
+static void cb_vm_close(struct vm_area_struct *vma)
+{
+	struct hl_cb *cb = (struct hl_cb *) vma->vm_private_data;
+
+	cb->mmap_size -= vma->vm_end - vma->vm_start;
+
+	if (cb->mmap_size)
+		return;
+
+	spin_lock(&cb->lock);
+	cb->mmap = false;
+	spin_unlock(&cb->lock);
+
+	hl_cb_put(cb);
+	vma->vm_private_data = NULL;
+}
+
+static const struct vm_operations_struct cb_vm_ops = {
+	.close = cb_vm_close
+};
+
+int hl_cb_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	struct hl_cb *cb;
+	phys_addr_t address;
+	u32 handle;
+	int rc;
+
+	handle = vma->vm_pgoff;
+
+	/* reference was taken here */
+	cb = hl_cb_get(hdev, &hpriv->cb_mgr, handle);
+	if (!cb) {
+		dev_err(hdev->dev,
+			"CB mmap failed, no match to handle %d\n", handle);
+		return -EINVAL;
+	}
+
+	/* Validation check */
+	if ((vma->vm_end - vma->vm_start) != cb->size) {
+		dev_err(hdev->dev,
+			"CB mmap failed, mmap size 0x%lx != 0x%x cb size\n",
+			vma->vm_end - vma->vm_start, cb->size);
+		rc = -EINVAL;
+		goto put_cb;
+	}
+
+	spin_lock(&cb->lock);
+
+	if (cb->mmap) {
+		dev_err(hdev->dev,
+			"CB mmap failed, CB already mmaped to user\n");
+		rc = -EINVAL;
+		goto release_lock;
+	}
+
+	cb->mmap = true;
+
+	spin_unlock(&cb->lock);
+
+	vma->vm_ops = &cb_vm_ops;
+
+	/*
+	 * Note: We're transferring the cb reference to
+	 * vma->vm_private_data here.
+	 */
+
+	vma->vm_private_data = cb;
+
+	/* Calculate address for CB */
+	address = virt_to_phys((void *) cb->kernel_address);
+
+	rc = hdev->asic_funcs->cb_mmap(hdev, vma, cb->kernel_address,
+					address, cb->size);
+
+	if (rc) {
+		spin_lock(&cb->lock);
+		cb->mmap = false;
+		goto release_lock;
+	}
+
+	cb->mmap_size = cb->size;
+
+	return 0;
+
+release_lock:
+	spin_unlock(&cb->lock);
+put_cb:
+	hl_cb_put(cb);
+	return rc;
+}
+
+struct hl_cb *hl_cb_get(struct hl_device *hdev, struct hl_cb_mgr *mgr,
+			u32 handle)
+{
+	struct hl_cb *cb;
+
+	spin_lock(&mgr->cb_lock);
+	cb = idr_find(&mgr->cb_handles, handle);
+
+	if (!cb) {
+		spin_unlock(&mgr->cb_lock);
+		dev_warn(hdev->dev,
+			"CB get failed, no match to handle %d\n", handle);
+		return NULL;
+	}
+
+	kref_get(&cb->refcount);
+
+	spin_unlock(&mgr->cb_lock);
+
+	return cb;
+
+}
+
+void hl_cb_put(struct hl_cb *cb)
+{
+	kref_put(&cb->refcount, cb_release);
+}
+
+void hl_cb_mgr_init(struct hl_cb_mgr *mgr)
+{
+	spin_lock_init(&mgr->cb_lock);
+	idr_init(&mgr->cb_handles);
+}
+
+void hl_cb_mgr_fini(struct hl_device *hdev, struct hl_cb_mgr *mgr)
+{
+	struct hl_cb *cb;
+	struct idr *idp;
+	u32 id;
+
+	idp = &mgr->cb_handles;
+
+	idr_for_each_entry(idp, cb, id) {
+		if (kref_put(&cb->refcount, cb_release) != 1)
+			dev_err(hdev->dev,
+				"CB %d for CTX ID %d is still alive\n",
+				id, cb->ctx_id);
+	}
+
+	idr_destroy(&mgr->cb_handles);
+}
+
+struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size)
+{
+	u64 cb_handle;
+	struct hl_cb *cb;
+	int rc;
+
+	rc = hl_cb_create(hdev, &hdev->kernel_cb_mgr, cb_size, &cb_handle,
+			HL_KERNEL_ASID_ID);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to allocate CB for KMD %d\n", rc);
+		return NULL;
+	}
+
+	cb_handle >>= PAGE_SHIFT;
+	cb = hl_cb_get(hdev, &hdev->kernel_cb_mgr, (u32) cb_handle);
+	/* hl_cb_get should never fail here so use kernel WARN */
+	WARN(!cb, "Kernel CB handle invalid 0x%x\n", (u32) cb_handle);
+	if (!cb)
+		goto destroy_cb;
+
+	return cb;
+
+destroy_cb:
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb_handle << PAGE_SHIFT);
+
+	return NULL;
+}
+
+int hl_cb_pool_init(struct hl_device *hdev)
+{
+	struct hl_cb *cb;
+	int i;
+
+	INIT_LIST_HEAD(&hdev->cb_pool);
+	spin_lock_init(&hdev->cb_pool_lock);
+
+	for (i = 0 ; i < hdev->asic_prop.cb_pool_cb_cnt ; i++) {
+		cb = hl_cb_alloc(hdev, hdev->asic_prop.cb_pool_cb_size,
+				HL_KERNEL_ASID_ID);
+		if (cb) {
+			cb->is_pool = true;
+			list_add(&cb->pool_list, &hdev->cb_pool);
+		} else {
+			hl_cb_pool_fini(hdev);
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+int hl_cb_pool_fini(struct hl_device *hdev)
+{
+	struct hl_cb *cb, *tmp;
+
+	list_for_each_entry_safe(cb, tmp, &hdev->cb_pool, pool_list) {
+		list_del(&cb->pool_list);
+		cb_fini(hdev, cb);
+	}
+
+	return 0;
+}
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 81acb1de9c7e..c5296b068aaf 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -53,6 +53,7 @@ static int hl_device_release(struct inode *inode, struct file *filp)
 {
 	struct hl_fpriv *hpriv = filp->private_data;
 
+	hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
 	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
 
 	filp->private_data = NULL;
@@ -62,10 +63,34 @@ static int hl_device_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+/*
+ * hl_mmap - mmap function for habanalabs device
+ *
+ * @*filp: pointer to file structure
+ * @*vma: pointer to vm_area_struct of the process
+ *
+ * Called when process does an mmap on habanalabs device. Call the device's mmap
+ * function at the end of the common code.
+ */
+static int hl_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct hl_fpriv *hpriv = filp->private_data;
+
+	if ((vma->vm_pgoff & HL_MMAP_CB_MASK) == HL_MMAP_CB_MASK) {
+		vma->vm_pgoff ^= HL_MMAP_CB_MASK;
+		return hl_cb_mmap(hpriv, vma);
+	}
+
+	return hpriv->hdev->asic_funcs->mmap(hpriv, vma);
+}
+
 static const struct file_operations hl_ops = {
 	.owner = THIS_MODULE,
 	.open = hl_device_open,
-	.release = hl_device_release
+	.release = hl_device_release,
+	.mmap = hl_mmap,
+	.unlocked_ioctl = hl_ioctl,
+	.compat_ioctl = hl_ioctl
 };
 
 /*
@@ -150,6 +175,8 @@ static int device_early_init(struct hl_device *hdev)
 	if (rc)
 		goto early_fini;
 
+	hl_cb_mgr_init(&hdev->kernel_cb_mgr);
+
 	mutex_init(&hdev->fd_open_cnt_lock);
 	atomic_set(&hdev->fd_open_cnt, 0);
 
@@ -171,6 +198,8 @@ static int device_early_init(struct hl_device *hdev)
 static void device_early_fini(struct hl_device *hdev)
 {
 
+	hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
+
 	hl_asid_fini(hdev);
 
 	if (hdev->asic_funcs->early_fini)
@@ -285,11 +314,21 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto free_ctx;
 	}
 
+	rc = hl_cb_pool_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize CB pool\n");
+		goto release_ctx;
+	}
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
 	return 0;
 
+release_ctx:
+	if (hl_ctx_put(hdev->kernel_ctx) != 1)
+		dev_err(hdev->dev,
+			"kernel ctx is still alive on initialization failure\n");
 free_ctx:
 	kfree(hdev->kernel_ctx);
 sw_fini:
@@ -326,6 +365,8 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	hl_cb_pool_fini(hdev);
+
 	/* Release kernel context */
 	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
 		dev_err(hdev->dev, "kernel ctx is still alive\n");
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 47e7472b7472..6626dead7231 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -92,6 +92,9 @@
 
 #define GOYA_MAX_INITIATORS		20
 
+#define GOYA_CB_POOL_CB_CNT		512
+#define GOYA_CB_POOL_CB_SIZE		0x20000		/* 128KB */
+
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
@@ -119,6 +122,8 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
 
 	prop->high_pll = PLL_HIGH_DEFAULT;
+	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
+	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
 }
 
 /*
@@ -604,6 +609,27 @@ int goya_resume(struct hl_device *hdev)
 	return 0;
 }
 
+int goya_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
+{
+	return -EINVAL;
+}
+
+int goya_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
+		u64 kaddress, phys_addr_t paddress, u32 size)
+{
+	int rc;
+
+	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
+			VM_DONTCOPY | VM_NORESERVE;
+
+	rc = remap_pfn_range(vma, vma->vm_start, paddress >> PAGE_SHIFT,
+				size, vma->vm_page_prot);
+	if (rc)
+		dev_err(hdev->dev, "remap_pfn_range error %d", rc);
+
+	return rc;
+}
+
 void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flags)
 {
@@ -623,6 +649,8 @@ static const struct hl_asic_funcs goya_funcs = {
 	.sw_fini = goya_sw_fini,
 	.suspend = goya_suspend,
 	.resume = goya_resume,
+	.mmap = goya_mmap,
+	.cb_mmap = goya_cb_mmap,
 	.dma_alloc_coherent = goya_dma_alloc_coherent,
 	.dma_free_coherent = goya_dma_free_coherent,
 };
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 9f76373b3e2c..ed79a0be58d7 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -21,9 +21,12 @@
 
 #define HL_NAME				"habanalabs"
 
+#define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
+
 #define HL_MAX_QUEUES			128
 
 struct hl_device;
+struct hl_fpriv;
 
 
 /**
@@ -51,6 +54,8 @@ struct hl_device;
  * @max_asid: maximum number of open contexts (ASIDs).
  * @completion_queues_count: number of completion queues.
  * @high_pll: high PLL frequency used by the device.
+ * @cb_pool_cb_cnt: number of CBs in the CB pool.
+ * @cb_pool_cb_size: size of each CB in the CB pool.
  * @tpc_enabled_mask: which TPCs are enabled.
  */
 struct asic_fixed_properties {
@@ -71,11 +76,60 @@ struct asic_fixed_properties {
 	u32			sram_size;
 	u32			max_asid;
 	u32			high_pll;
+	u32			cb_pool_cb_cnt;
+	u32			cb_pool_cb_size;
 	u8			completion_queues_count;
 	u8			tpc_enabled_mask;
 };
 
 
+/*
+ * Command Buffers
+ */
+
+#define HL_MAX_CB_SIZE		0x200000	/* 2MB */
+
+/**
+ * struct hl_cb_mgr - describes a Command Buffer Manager.
+ * @cb_lock: protects cb_handles.
+ * @cb_handles: an idr to hold all command buffer handles.
+ */
+struct hl_cb_mgr {
+	spinlock_t		cb_lock;
+	struct idr		cb_handles; /* protected by cb_lock */
+};
+
+/**
+ * struct hl_cb - describes a Command Buffer.
+ * @refcount: reference counter for usage of the CB.
+ * @hdev: pointer to device this CB belongs to.
+ * @lock: spinlock to protect mmap/cs flows.
+ * @pool_list: node in pool list of command buffers.
+ * @kernel_address: Holds the CB's kernel virtual address.
+ * @bus_address: Holds the CB's DMA address.
+ * @mmap_size: Holds the CB's size that was mmaped.
+ * @size: holds the CB's size.
+ * @id: the CB's ID.
+ * @ctx_id: holds the ID of the owner's context.
+ * @mmap: true if the CB is currently mmaped to user.
+ * @is_pool: true if CB was acquired from the pool, false otherwise.
+ */
+struct hl_cb {
+	struct kref		refcount;
+	struct hl_device	*hdev;
+	spinlock_t		lock;
+	struct list_head	pool_list;
+	u64			kernel_address;
+	dma_addr_t		bus_address;
+	u32			mmap_size;
+	u32			size;
+	u32			id;
+	u32			ctx_id;
+	u8			mmap;
+	u8			is_pool;
+};
+
+
 #define HL_QUEUE_LENGTH			256
 
 
@@ -104,6 +158,8 @@ enum hl_asic_type {
  * @sw_fini: tears down driver state, does not configure H/W.
  * @suspend: handles IP specific H/W or SW changes for suspend.
  * @resume: handles IP specific H/W or SW changes for resume.
+ * @mmap: mmap function, does nothing.
+ * @cb_mmap: maps a CB.
  * @dma_alloc_coherent: Allocate coherent DMA memory by calling
  *                      dma_alloc_coherent(). This is ASIC function because its
  *                      implementation is not trivial when the driver is loaded
@@ -120,6 +176,9 @@ struct hl_asic_funcs {
 	int (*sw_fini)(struct hl_device *hdev);
 	int (*suspend)(struct hl_device *hdev);
 	int (*resume)(struct hl_device *hdev);
+	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
+	int (*cb_mmap)(struct hl_device *hdev, struct vm_area_struct *vma,
+			u64 kaddress, phys_addr_t paddress, u32 size);
 	void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flag);
 	void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
@@ -170,6 +229,7 @@ struct hl_ctx_mgr {
  * @taskpid: current process ID.
  * @ctx: current executing context.
  * @ctx_mgr: context manager to handle multiple context for this FD.
+ * @cb_mgr: command buffer manager to handle multiple buffers for this FD.
  * @refcount: number of related contexts.
  */
 struct hl_fpriv {
@@ -178,6 +238,7 @@ struct hl_fpriv {
 	struct pid		*taskpid;
 	struct hl_ctx		*ctx; /* TODO: remove for multiple ctx */
 	struct hl_ctx_mgr	ctx_mgr;
+	struct hl_cb_mgr	cb_mgr;
 	struct kref		refcount;
 };
 
@@ -232,6 +293,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
  * @kernel_ctx: KMD context structure.
+ * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
  * @dma_pool: DMA pool for small allocations.
  * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
  * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
@@ -247,6 +309,8 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
+ * @cb_pool: list of preallocated CBs.
+ * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
  * @fd_open_cnt: number of open user processes.
  * @major: habanalabs KMD major.
@@ -262,6 +326,7 @@ struct hl_device {
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
 	struct hl_ctx			*kernel_ctx;
+	struct hl_cb_mgr		kernel_cb_mgr;
 	struct dma_pool			*dma_pool;
 	void				*cpu_accessible_dma_mem;
 	dma_addr_t			cpu_accessible_dma_address;
@@ -273,6 +338,10 @@ struct hl_device {
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
+
+	struct list_head		cb_pool;
+	spinlock_t			cb_pool_lock;
+
 	/* TODO: remove user_ctx for multiple process support */
 	struct hl_ctx			*user_ctx;
 	atomic_t			fd_open_cnt;
@@ -341,6 +410,23 @@ int hl_device_resume(struct hl_device *hdev);
 void hl_hpriv_get(struct hl_fpriv *hpriv);
 void hl_hpriv_put(struct hl_fpriv *hpriv);
 
+int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr, u32 cb_size,
+		u64 *handle, int ctx_id);
+int hl_cb_destroy(struct hl_device *hdev, struct hl_cb_mgr *mgr, u64 cb_handle);
+int hl_cb_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
+struct hl_cb *hl_cb_get(struct hl_device *hdev,	struct hl_cb_mgr *mgr,
+			u32 handle);
+void hl_cb_put(struct hl_cb *cb);
+void hl_cb_mgr_init(struct hl_cb_mgr *mgr);
+void hl_cb_mgr_fini(struct hl_device *hdev, struct hl_cb_mgr *mgr);
+struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size);
+int hl_cb_pool_init(struct hl_device *hdev);
+int hl_cb_pool_fini(struct hl_device *hdev);
+
 void goya_set_asic_funcs(struct hl_device *hdev);
 
+/* IOCTLs */
+long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
+int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
+
 #endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 8ecb51fbb133..b07979c599e2 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -120,6 +120,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	kref_init(&hpriv->refcount);
 	nonseekable_open(inode, filp);
 
+	hl_cb_mgr_init(&hpriv->cb_mgr);
 	hl_ctx_mgr_init(&hpriv->ctx_mgr);
 
 	rc = hl_ctx_create(hdev, hpriv);
@@ -135,6 +136,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 out_err:
 	filp->private_data = NULL;
 	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
+	hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
 	kfree(hpriv);
 
 close_device:
diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
new file mode 100644
index 000000000000..71dd1e760a22
--- /dev/null
+++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include <uapi/misc/habanalabs.h>
+#include "habanalabs.h"
+
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/cred.h>
+
+#define HL_IOCTL_DEF(ioctl, _func) \
+	[_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func}
+
+static const struct hl_ioctl_desc hl_ioctls[] = {
+	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl)
+};
+
+#define HL_CORE_IOCTL_COUNT	ARRAY_SIZE(hl_ioctls)
+
+long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
+{
+	struct hl_fpriv *hpriv = filep->private_data;
+	struct hl_device *hdev = hpriv->hdev;
+	hl_ioctl_t *func;
+	const struct hl_ioctl_desc *ioctl = NULL;
+	unsigned int nr = _IOC_NR(cmd);
+	char stack_kdata[128] = {0};
+	char *kdata = NULL;
+	unsigned int usize, asize;
+	int retcode;
+
+	if ((nr >= HL_COMMAND_START) && (nr < HL_COMMAND_END)) {
+		u32 hl_size;
+
+		ioctl = &hl_ioctls[nr];
+
+		hl_size = _IOC_SIZE(ioctl->cmd);
+		usize = asize = _IOC_SIZE(cmd);
+		if (hl_size > asize)
+			asize = hl_size;
+
+		cmd = ioctl->cmd;
+	} else {
+		dev_err(hdev->dev, "invalid ioctl: pid=%d, nr=0x%02x\n",
+			  task_pid_nr(current), nr);
+		return -ENOTTY;
+	}
+
+	/* Do not trust userspace, use our own definition */
+	func = ioctl->func;
+
+	if (unlikely(!func)) {
+		dev_dbg(hdev->dev, "no function\n");
+		retcode = -ENOTTY;
+		goto out_err;
+	}
+
+	if (cmd & (IOC_IN | IOC_OUT)) {
+		if (asize <= sizeof(stack_kdata)) {
+			kdata = stack_kdata;
+		} else {
+			kdata = kzalloc(asize, GFP_KERNEL);
+			if (!kdata) {
+				retcode = -ENOMEM;
+				goto out_err;
+			}
+		}
+	}
+
+	if (cmd & IOC_IN) {
+		if (copy_from_user(kdata, (void __user *)arg, usize)) {
+			retcode = -EFAULT;
+			goto out_err;
+		}
+	} else if (cmd & IOC_OUT) {
+		memset(kdata, 0, usize);
+	}
+
+	retcode = func(hpriv, kdata);
+
+	if (cmd & IOC_OUT)
+		if (copy_to_user((void __user *)arg, kdata, usize))
+			retcode = -EFAULT;
+
+out_err:
+	if (retcode)
+		dev_dbg(hdev->dev,
+			"error in ioctl: pid=%d, cmd=0x%02x, nr=0x%02x\n",
+			  task_pid_nr(current), cmd, nr);
+
+	if (kdata != stack_kdata)
+		kfree(kdata);
+
+	return retcode;
+}
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index a0ec23adf8f5..a8edfd3e9c95 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -17,4 +17,50 @@
  */
 #define GOYA_KMD_SRAM_RESERVED_SIZE_FROM_START	0x8000	/* 32KB */
 
+/* Opcode to create a new command buffer */
+#define HL_CB_OP_CREATE		0
+/* Opcode to destroy previously created command buffer */
+#define HL_CB_OP_DESTROY	1
+
+struct hl_cb_in {
+	/* Handle of CB or 0 if we want to create one */
+	__u64 cb_handle;
+	/* HL_CB_OP_* */
+	__u32 op;
+	/* Size of CB. Minimum requested size must be PAGE_SIZE */
+	__u32 cb_size;
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+	__u32 pad;
+};
+
+struct hl_cb_out {
+	/* Handle of CB */
+	__u64 cb_handle;
+};
+
+union hl_cb_args {
+	struct hl_cb_in in;
+	struct hl_cb_out out;
+};
+
+/*
+ * Command Buffer
+ * - Request a Command Buffer
+ * - Destroy a Command Buffer
+ *
+ * The command buffers are memory blocks that reside in DMA-able address
+ * space and are physically contiguous so they can be accessed by the device
+ * directly. They are allocated using the coherent DMA API.
+ *
+ * When creating a new CB, the IOCTL returns a handle of it, and the user-space
+ * process needs to use that handle to mmap the buffer so it can access them.
+ *
+ */
+#define HL_IOCTL_CB		\
+		_IOWR('H', 0x02, union hl_cb_args)
+
+#define HL_COMMAND_START	0x02
+#define HL_COMMAND_END		0x03
+
 #endif /* HABANALABS_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 06/15] habanalabs: add basic Goya h/w initialization
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (3 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 05/15] habanalabs: add command buffer module Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 07/15] habanalabs: add h/w queues module Oded Gabbay
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds the basic part of Goya's H/W initialization. It adds code
that initializes Goya's internal CPU, various registers that are related to
internal routing, scrambling, workarounds for H/W bugs, etc.

It also initializes Goya's security scheme that prevents the user from
abusing Goya to steal data from the host, crash the host, change
Goya's F/W, etc.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - Remove empty line at end of Makefile
 
 drivers/misc/habanalabs/device.c              |   12 +
 drivers/misc/habanalabs/goya/Makefile         |    2 +-
 drivers/misc/habanalabs/goya/goya.c           |  878 ++++-
 drivers/misc/habanalabs/goya/goyaP.h          |    4 +
 drivers/misc/habanalabs/goya/goya_security.c  | 2999 +++++++++++++++++
 drivers/misc/habanalabs/habanalabs.h          |   15 +
 drivers/misc/habanalabs/habanalabs_drv.c      |    7 +
 drivers/misc/habanalabs/include/armcp_if.h    |   19 +
 .../misc/habanalabs/include/goya/goya_fw_if.h |   28 +
 drivers/misc/habanalabs/include/hl_boot_if.h  |   30 +
 10 files changed, 3987 insertions(+), 7 deletions(-)
 create mode 100644 drivers/misc/habanalabs/goya/goya_security.c
 create mode 100644 drivers/misc/habanalabs/include/armcp_if.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_fw_if.h
 create mode 100644 drivers/misc/habanalabs/include/hl_boot_if.h

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index c5296b068aaf..9dcfe2cd3a0f 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -320,6 +320,15 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto release_ctx;
 	}
 
+	rc = hdev->asic_funcs->hw_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize the H/W\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
+	hdev->disabled = false;
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
@@ -371,6 +380,9 @@ void hl_device_fini(struct hl_device *hdev)
 	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
 		dev_err(hdev->dev, "kernel ctx is still alive\n");
 
+	/* Reset the H/W. It will be in idle state after this returns */
+	hdev->asic_funcs->hw_fini(hdev, true);
+
 	/* Call ASIC S/W finalize function */
 	hdev->asic_funcs->sw_fini(hdev);
 
diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
index 38d43006386d..7ae221696de9 100644
--- a/drivers/misc/habanalabs/goya/Makefile
+++ b/drivers/misc/habanalabs/goya/Makefile
@@ -1,3 +1,3 @@
 subdir-ccflags-y += -I$(src)
 
-HL_GOYA_FILES :=  goya/goya.o
+HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 6626dead7231..142232370d07 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -119,11 +119,11 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->va_space_dram_end_address = VA_DDR_SPACE_END;
 	prop->cfg_size = CFG_SIZE;
 	prop->max_asid = MAX_ASID;
+	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
+	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
 	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
 
 	prop->high_pll = PLL_HIGH_DEFAULT;
-	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
-	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
 }
 
 /*
@@ -465,10 +465,12 @@ static int goya_early_init(struct hl_device *hdev)
 		goto disable_device;
 	}
 
-	val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
-	if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
-		dev_warn(hdev->dev,
-			"PCI strap is not configured correctly, PCI bus errors may occur\n");
+	if (!hdev->pldm) {
+		val = RREG32(mmPSOC_GLOBAL_CONF_BOOT_STRAP_PINS);
+		if (val & PSOC_GLOBAL_CONF_BOOT_STRAP_PINS_SRIOV_EN_MASK)
+			dev_warn(hdev->dev,
+				"PCI strap is not configured correctly, PCI bus errors may occur\n");
+	}
 
 	return 0;
 
@@ -599,6 +601,868 @@ int goya_sw_fini(struct hl_device *hdev)
 	return 0;
 }
 
+static void goya_set_pll_refclk(struct hl_device *hdev)
+{
+	WREG32(mmCPU_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmCPU_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmCPU_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmCPU_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmIC_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmIC_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmIC_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmIC_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmMC_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmMC_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmMC_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmMC_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmPSOC_MME_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmPSOC_PCI_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmPSOC_EMMC_PLL_DIV_SEL_3, 0x0);
+
+	WREG32(mmTPC_PLL_DIV_SEL_0, 0x0);
+	WREG32(mmTPC_PLL_DIV_SEL_1, 0x0);
+	WREG32(mmTPC_PLL_DIV_SEL_2, 0x0);
+	WREG32(mmTPC_PLL_DIV_SEL_3, 0x0);
+}
+
+static void goya_disable_clk_rlx(struct hl_device *hdev)
+{
+	WREG32(mmPSOC_MME_PLL_CLK_RLX_0, 0x100010);
+	WREG32(mmIC_PLL_CLK_RLX_0, 0x100010);
+}
+
+static void _goya_tpc_mbist_workaround(struct hl_device *hdev, u8 tpc_id)
+{
+	u64 tpc_eml_address;
+	u32 val, tpc_offset, tpc_eml_offset, tpc_slm_offset;
+	int err, slm_index;
+
+	tpc_offset = tpc_id * 0x40000;
+	tpc_eml_offset = tpc_id * 0x200000;
+	tpc_eml_address = (mmTPC0_EML_CFG_BASE + tpc_eml_offset - CFG_BASE);
+	tpc_slm_offset = tpc_eml_address + 0x100000;
+
+	/*
+	 * Workaround for Bug H2 #2443 :
+	 * "TPC SB is not initialized on chip reset"
+	 */
+
+	val = RREG32(mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset);
+	if (val & TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_ACTIVE_MASK)
+		dev_warn(hdev->dev, "TPC%d MBIST ACTIVE is not cleared\n",
+			tpc_id);
+
+	WREG32(mmTPC0_CFG_FUNC_MBIST_PAT + tpc_offset, val & 0xFFFFF000);
+
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_0 + tpc_offset, 0x37FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_1 + tpc_offset, 0x303F);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_2 + tpc_offset, 0x71FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_3 + tpc_offset, 0x71FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_4 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_5 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_6 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_7 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_8 + tpc_offset, 0x70FF);
+	WREG32(mmTPC0_CFG_FUNC_MBIST_MEM_9 + tpc_offset, 0x70FF);
+
+	WREG32_OR(mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset,
+		1 << TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_START_SHIFT);
+
+	err = hl_poll_timeout(
+		hdev,
+		mmTPC0_CFG_FUNC_MBIST_CNTRL + tpc_offset,
+		val,
+		(val & TPC0_CFG_FUNC_MBIST_CNTRL_MBIST_DONE_MASK),
+		1000,
+		HL_DEVICE_TIMEOUT_USEC);
+
+	if (err)
+		dev_err(hdev->dev,
+			"Timeout while waiting for TPC%d MBIST DONE\n", tpc_id);
+
+	WREG32_OR(mmTPC0_EML_CFG_DBG_CNT + tpc_eml_offset,
+		1 << TPC0_EML_CFG_DBG_CNT_CORE_RST_SHIFT);
+
+	msleep(GOYA_RESET_WAIT_MSEC);
+
+	WREG32_AND(mmTPC0_EML_CFG_DBG_CNT + tpc_eml_offset,
+		~(1 << TPC0_EML_CFG_DBG_CNT_CORE_RST_SHIFT));
+
+	msleep(GOYA_RESET_WAIT_MSEC);
+
+	for (slm_index = 0 ; slm_index < 256 ; slm_index++)
+		WREG32(tpc_slm_offset + (slm_index << 2), 0);
+
+	val = RREG32(tpc_slm_offset);
+}
+
+static void goya_tpc_mbist_workaround(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i;
+
+	if (hdev->pldm)
+		return;
+
+	if (goya->hw_cap_initialized & HW_CAP_TPC_MBIST)
+		return;
+
+	/* Workaround for H2 #2443 */
+
+	for (i = 0 ; i < TPC_MAX_NUM ; i++)
+		_goya_tpc_mbist_workaround(hdev, i);
+
+	goya->hw_cap_initialized |= HW_CAP_TPC_MBIST;
+}
+
+/*
+ * goya_init_golden_registers - Initialize golden registers
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Initialize the H/W registers of the device
+ *
+ */
+static void goya_init_golden_registers(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 polynom[10], tpc_intr_mask, offset;
+	int i;
+
+	if (goya->hw_cap_initialized & HW_CAP_GOLDEN)
+		return;
+
+	polynom[0] = 0x00020080;
+	polynom[1] = 0x00401000;
+	polynom[2] = 0x00200800;
+	polynom[3] = 0x00002000;
+	polynom[4] = 0x00080200;
+	polynom[5] = 0x00040100;
+	polynom[6] = 0x00100400;
+	polynom[7] = 0x00004000;
+	polynom[8] = 0x00010000;
+	polynom[9] = 0x00008000;
+
+	/* Mask all arithmetic interrupts from TPC */
+	tpc_intr_mask = 0x7FFF;
+
+	for (i = 0, offset = 0 ; i < 6 ; i++, offset += 0x20000) {
+		WREG32(mmSRAM_Y0_X0_RTR_HBW_RD_RQ_L_ARB + offset, 0x302);
+		WREG32(mmSRAM_Y0_X1_RTR_HBW_RD_RQ_L_ARB + offset, 0x302);
+		WREG32(mmSRAM_Y0_X2_RTR_HBW_RD_RQ_L_ARB + offset, 0x302);
+		WREG32(mmSRAM_Y0_X3_RTR_HBW_RD_RQ_L_ARB + offset, 0x302);
+		WREG32(mmSRAM_Y0_X4_RTR_HBW_RD_RQ_L_ARB + offset, 0x302);
+
+		WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_L_ARB + offset, 0x204);
+		WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_L_ARB + offset, 0x204);
+		WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_L_ARB + offset, 0x204);
+		WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_L_ARB + offset, 0x204);
+		WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_L_ARB + offset, 0x204);
+
+
+		WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_E_ARB + offset, 0x206);
+		WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_E_ARB + offset, 0x206);
+		WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_E_ARB + offset, 0x206);
+		WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_E_ARB + offset, 0x207);
+		WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_E_ARB + offset, 0x207);
+
+		WREG32(mmSRAM_Y0_X0_RTR_HBW_DATA_W_ARB + offset, 0x207);
+		WREG32(mmSRAM_Y0_X1_RTR_HBW_DATA_W_ARB + offset, 0x207);
+		WREG32(mmSRAM_Y0_X2_RTR_HBW_DATA_W_ARB + offset, 0x206);
+		WREG32(mmSRAM_Y0_X3_RTR_HBW_DATA_W_ARB + offset, 0x206);
+		WREG32(mmSRAM_Y0_X4_RTR_HBW_DATA_W_ARB + offset, 0x206);
+
+		WREG32(mmSRAM_Y0_X0_RTR_HBW_WR_RS_E_ARB + offset, 0x101);
+		WREG32(mmSRAM_Y0_X1_RTR_HBW_WR_RS_E_ARB + offset, 0x102);
+		WREG32(mmSRAM_Y0_X2_RTR_HBW_WR_RS_E_ARB + offset, 0x103);
+		WREG32(mmSRAM_Y0_X3_RTR_HBW_WR_RS_E_ARB + offset, 0x104);
+		WREG32(mmSRAM_Y0_X4_RTR_HBW_WR_RS_E_ARB + offset, 0x105);
+
+		WREG32(mmSRAM_Y0_X0_RTR_HBW_WR_RS_W_ARB + offset, 0x105);
+		WREG32(mmSRAM_Y0_X1_RTR_HBW_WR_RS_W_ARB + offset, 0x104);
+		WREG32(mmSRAM_Y0_X2_RTR_HBW_WR_RS_W_ARB + offset, 0x103);
+		WREG32(mmSRAM_Y0_X3_RTR_HBW_WR_RS_W_ARB + offset, 0x102);
+		WREG32(mmSRAM_Y0_X4_RTR_HBW_WR_RS_W_ARB + offset, 0x101);
+	}
+
+	WREG32(mmMME_STORE_MAX_CREDIT, 0x21);
+	WREG32(mmMME_AGU, 0x0f0f0f10);
+	WREG32(mmMME_SEI_MASK, ~0x0);
+
+	WREG32(mmMME6_RTR_HBW_RD_RQ_N_ARB, 0x01010101);
+	WREG32(mmMME5_RTR_HBW_RD_RQ_N_ARB, 0x01040101);
+	WREG32(mmMME4_RTR_HBW_RD_RQ_N_ARB, 0x01030101);
+	WREG32(mmMME3_RTR_HBW_RD_RQ_N_ARB, 0x01020101);
+	WREG32(mmMME2_RTR_HBW_RD_RQ_N_ARB, 0x01010101);
+	WREG32(mmMME1_RTR_HBW_RD_RQ_N_ARB, 0x07010701);
+	WREG32(mmMME6_RTR_HBW_RD_RQ_S_ARB, 0x04010401);
+	WREG32(mmMME5_RTR_HBW_RD_RQ_S_ARB, 0x04050401);
+	WREG32(mmMME4_RTR_HBW_RD_RQ_S_ARB, 0x03070301);
+	WREG32(mmMME3_RTR_HBW_RD_RQ_S_ARB, 0x01030101);
+	WREG32(mmMME2_RTR_HBW_RD_RQ_S_ARB, 0x01040101);
+	WREG32(mmMME1_RTR_HBW_RD_RQ_S_ARB, 0x01050105);
+	WREG32(mmMME6_RTR_HBW_RD_RQ_W_ARB, 0x01010501);
+	WREG32(mmMME5_RTR_HBW_RD_RQ_W_ARB, 0x01010501);
+	WREG32(mmMME4_RTR_HBW_RD_RQ_W_ARB, 0x01040301);
+	WREG32(mmMME3_RTR_HBW_RD_RQ_W_ARB, 0x01030401);
+	WREG32(mmMME2_RTR_HBW_RD_RQ_W_ARB, 0x01040101);
+	WREG32(mmMME1_RTR_HBW_RD_RQ_W_ARB, 0x01050101);
+	WREG32(mmMME6_RTR_HBW_WR_RQ_N_ARB, 0x02020202);
+	WREG32(mmMME5_RTR_HBW_WR_RQ_N_ARB, 0x01070101);
+	WREG32(mmMME4_RTR_HBW_WR_RQ_N_ARB, 0x02020201);
+	WREG32(mmMME3_RTR_HBW_WR_RQ_N_ARB, 0x07020701);
+	WREG32(mmMME2_RTR_HBW_WR_RQ_N_ARB, 0x01020101);
+	WREG32(mmMME1_RTR_HBW_WR_RQ_S_ARB, 0x01010101);
+	WREG32(mmMME6_RTR_HBW_WR_RQ_S_ARB, 0x01070101);
+	WREG32(mmMME5_RTR_HBW_WR_RQ_S_ARB, 0x01070101);
+	WREG32(mmMME4_RTR_HBW_WR_RQ_S_ARB, 0x07020701);
+	WREG32(mmMME3_RTR_HBW_WR_RQ_S_ARB, 0x02020201);
+	WREG32(mmMME2_RTR_HBW_WR_RQ_S_ARB, 0x01070101);
+	WREG32(mmMME1_RTR_HBW_WR_RQ_S_ARB, 0x01020102);
+	WREG32(mmMME6_RTR_HBW_WR_RQ_W_ARB, 0x01020701);
+	WREG32(mmMME5_RTR_HBW_WR_RQ_W_ARB, 0x01020701);
+	WREG32(mmMME4_RTR_HBW_WR_RQ_W_ARB, 0x07020707);
+	WREG32(mmMME3_RTR_HBW_WR_RQ_W_ARB, 0x01020201);
+	WREG32(mmMME2_RTR_HBW_WR_RQ_W_ARB, 0x01070201);
+	WREG32(mmMME1_RTR_HBW_WR_RQ_W_ARB, 0x01070201);
+	WREG32(mmMME6_RTR_HBW_RD_RS_N_ARB, 0x01070102);
+	WREG32(mmMME5_RTR_HBW_RD_RS_N_ARB, 0x01070102);
+	WREG32(mmMME4_RTR_HBW_RD_RS_N_ARB, 0x01060102);
+	WREG32(mmMME3_RTR_HBW_RD_RS_N_ARB, 0x01040102);
+	WREG32(mmMME2_RTR_HBW_RD_RS_N_ARB, 0x01020102);
+	WREG32(mmMME1_RTR_HBW_RD_RS_N_ARB, 0x01020107);
+	WREG32(mmMME6_RTR_HBW_RD_RS_S_ARB, 0x01020106);
+	WREG32(mmMME5_RTR_HBW_RD_RS_S_ARB, 0x01020102);
+	WREG32(mmMME4_RTR_HBW_RD_RS_S_ARB, 0x01040102);
+	WREG32(mmMME3_RTR_HBW_RD_RS_S_ARB, 0x01060102);
+	WREG32(mmMME2_RTR_HBW_RD_RS_S_ARB, 0x01070102);
+	WREG32(mmMME1_RTR_HBW_RD_RS_S_ARB, 0x01070102);
+	WREG32(mmMME6_RTR_HBW_RD_RS_E_ARB, 0x01020702);
+	WREG32(mmMME5_RTR_HBW_RD_RS_E_ARB, 0x01020702);
+	WREG32(mmMME4_RTR_HBW_RD_RS_E_ARB, 0x01040602);
+	WREG32(mmMME3_RTR_HBW_RD_RS_E_ARB, 0x01060402);
+	WREG32(mmMME2_RTR_HBW_RD_RS_E_ARB, 0x01070202);
+	WREG32(mmMME1_RTR_HBW_RD_RS_E_ARB, 0x01070102);
+	WREG32(mmMME6_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME5_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME4_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME3_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME2_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME1_RTR_HBW_RD_RS_W_ARB, 0x01060401);
+	WREG32(mmMME6_RTR_HBW_WR_RS_N_ARB, 0x01050101);
+	WREG32(mmMME5_RTR_HBW_WR_RS_N_ARB, 0x01040101);
+	WREG32(mmMME4_RTR_HBW_WR_RS_N_ARB, 0x01030101);
+	WREG32(mmMME3_RTR_HBW_WR_RS_N_ARB, 0x01020101);
+	WREG32(mmMME2_RTR_HBW_WR_RS_N_ARB, 0x01010101);
+	WREG32(mmMME1_RTR_HBW_WR_RS_N_ARB, 0x01010107);
+	WREG32(mmMME6_RTR_HBW_WR_RS_S_ARB, 0x01010107);
+	WREG32(mmMME5_RTR_HBW_WR_RS_S_ARB, 0x01010101);
+	WREG32(mmMME4_RTR_HBW_WR_RS_S_ARB, 0x01020101);
+	WREG32(mmMME3_RTR_HBW_WR_RS_S_ARB, 0x01030101);
+	WREG32(mmMME2_RTR_HBW_WR_RS_S_ARB, 0x01040101);
+	WREG32(mmMME1_RTR_HBW_WR_RS_S_ARB, 0x01050101);
+	WREG32(mmMME6_RTR_HBW_WR_RS_E_ARB, 0x01010501);
+	WREG32(mmMME5_RTR_HBW_WR_RS_E_ARB, 0x01010501);
+	WREG32(mmMME4_RTR_HBW_WR_RS_E_ARB, 0x01040301);
+	WREG32(mmMME3_RTR_HBW_WR_RS_E_ARB, 0x01030401);
+	WREG32(mmMME2_RTR_HBW_WR_RS_E_ARB, 0x01040101);
+	WREG32(mmMME1_RTR_HBW_WR_RS_E_ARB, 0x01050101);
+	WREG32(mmMME6_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME5_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME4_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME3_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME2_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+	WREG32(mmMME1_RTR_HBW_WR_RS_W_ARB, 0x01010101);
+
+	WREG32(mmTPC1_RTR_HBW_RD_RQ_N_ARB, 0x01010101);
+	WREG32(mmTPC1_RTR_HBW_RD_RQ_S_ARB, 0x01010101);
+	WREG32(mmTPC1_RTR_HBW_RD_RQ_E_ARB, 0x01060101);
+	WREG32(mmTPC1_RTR_HBW_WR_RQ_N_ARB, 0x02020102);
+	WREG32(mmTPC1_RTR_HBW_WR_RQ_S_ARB, 0x01010101);
+	WREG32(mmTPC1_RTR_HBW_WR_RQ_E_ARB, 0x02070202);
+	WREG32(mmTPC1_RTR_HBW_RD_RS_N_ARB, 0x01020201);
+	WREG32(mmTPC1_RTR_HBW_RD_RS_S_ARB, 0x01070201);
+	WREG32(mmTPC1_RTR_HBW_RD_RS_W_ARB, 0x01070202);
+	WREG32(mmTPC1_RTR_HBW_WR_RS_N_ARB, 0x01010101);
+	WREG32(mmTPC1_RTR_HBW_WR_RS_S_ARB, 0x01050101);
+	WREG32(mmTPC1_RTR_HBW_WR_RS_W_ARB, 0x01050101);
+
+	WREG32(mmTPC2_RTR_HBW_RD_RQ_N_ARB, 0x01020101);
+	WREG32(mmTPC2_RTR_HBW_RD_RQ_S_ARB, 0x01050101);
+	WREG32(mmTPC2_RTR_HBW_RD_RQ_E_ARB, 0x01010201);
+	WREG32(mmTPC2_RTR_HBW_WR_RQ_N_ARB, 0x02040102);
+	WREG32(mmTPC2_RTR_HBW_WR_RQ_S_ARB, 0x01050101);
+	WREG32(mmTPC2_RTR_HBW_WR_RQ_E_ARB, 0x02060202);
+	WREG32(mmTPC2_RTR_HBW_RD_RS_N_ARB, 0x01020201);
+	WREG32(mmTPC2_RTR_HBW_RD_RS_S_ARB, 0x01070201);
+	WREG32(mmTPC2_RTR_HBW_RD_RS_W_ARB, 0x01070202);
+	WREG32(mmTPC2_RTR_HBW_WR_RS_N_ARB, 0x01010101);
+	WREG32(mmTPC2_RTR_HBW_WR_RS_S_ARB, 0x01040101);
+	WREG32(mmTPC2_RTR_HBW_WR_RS_W_ARB, 0x01040101);
+
+	WREG32(mmTPC3_RTR_HBW_RD_RQ_N_ARB, 0x01030101);
+	WREG32(mmTPC3_RTR_HBW_RD_RQ_S_ARB, 0x01040101);
+	WREG32(mmTPC3_RTR_HBW_RD_RQ_E_ARB, 0x01040301);
+	WREG32(mmTPC3_RTR_HBW_WR_RQ_N_ARB, 0x02060102);
+	WREG32(mmTPC3_RTR_HBW_WR_RQ_S_ARB, 0x01040101);
+	WREG32(mmTPC3_RTR_HBW_WR_RQ_E_ARB, 0x01040301);
+	WREG32(mmTPC3_RTR_HBW_RD_RS_N_ARB, 0x01040201);
+	WREG32(mmTPC3_RTR_HBW_RD_RS_S_ARB, 0x01060201);
+	WREG32(mmTPC3_RTR_HBW_RD_RS_W_ARB, 0x01060402);
+	WREG32(mmTPC3_RTR_HBW_WR_RS_N_ARB, 0x01020101);
+	WREG32(mmTPC3_RTR_HBW_WR_RS_S_ARB, 0x01030101);
+	WREG32(mmTPC3_RTR_HBW_WR_RS_W_ARB, 0x01030401);
+
+	WREG32(mmTPC4_RTR_HBW_RD_RQ_N_ARB, 0x01040101);
+	WREG32(mmTPC4_RTR_HBW_RD_RQ_S_ARB, 0x01030101);
+	WREG32(mmTPC4_RTR_HBW_RD_RQ_E_ARB, 0x01030401);
+	WREG32(mmTPC4_RTR_HBW_WR_RQ_N_ARB, 0x02070102);
+	WREG32(mmTPC4_RTR_HBW_WR_RQ_S_ARB, 0x01030101);
+	WREG32(mmTPC4_RTR_HBW_WR_RQ_E_ARB, 0x02060702);
+	WREG32(mmTPC4_RTR_HBW_RD_RS_N_ARB, 0x01060201);
+	WREG32(mmTPC4_RTR_HBW_RD_RS_S_ARB, 0x01040201);
+	WREG32(mmTPC4_RTR_HBW_RD_RS_W_ARB, 0x01040602);
+	WREG32(mmTPC4_RTR_HBW_WR_RS_N_ARB, 0x01030101);
+	WREG32(mmTPC4_RTR_HBW_WR_RS_S_ARB, 0x01020101);
+	WREG32(mmTPC4_RTR_HBW_WR_RS_W_ARB, 0x01040301);
+
+	WREG32(mmTPC5_RTR_HBW_RD_RQ_N_ARB, 0x01050101);
+	WREG32(mmTPC5_RTR_HBW_RD_RQ_S_ARB, 0x01020101);
+	WREG32(mmTPC5_RTR_HBW_RD_RQ_E_ARB, 0x01200501);
+	WREG32(mmTPC5_RTR_HBW_WR_RQ_N_ARB, 0x02070102);
+	WREG32(mmTPC5_RTR_HBW_WR_RQ_S_ARB, 0x01020101);
+	WREG32(mmTPC5_RTR_HBW_WR_RQ_E_ARB, 0x02020602);
+	WREG32(mmTPC5_RTR_HBW_RD_RS_N_ARB, 0x01070201);
+	WREG32(mmTPC5_RTR_HBW_RD_RS_S_ARB, 0x01020201);
+	WREG32(mmTPC5_RTR_HBW_RD_RS_W_ARB, 0x01020702);
+	WREG32(mmTPC5_RTR_HBW_WR_RS_N_ARB, 0x01040101);
+	WREG32(mmTPC5_RTR_HBW_WR_RS_S_ARB, 0x01010101);
+	WREG32(mmTPC5_RTR_HBW_WR_RS_W_ARB, 0x01010501);
+
+	WREG32(mmTPC6_RTR_HBW_RD_RQ_N_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_RD_RQ_S_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_RD_RQ_E_ARB, 0x01010601);
+	WREG32(mmTPC6_RTR_HBW_WR_RQ_N_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_WR_RQ_S_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_WR_RQ_E_ARB, 0x02020702);
+	WREG32(mmTPC6_RTR_HBW_RD_RS_N_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_RD_RS_S_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_RD_RS_W_ARB, 0x01020702);
+	WREG32(mmTPC6_RTR_HBW_WR_RS_N_ARB, 0x01050101);
+	WREG32(mmTPC6_RTR_HBW_WR_RS_S_ARB, 0x01010101);
+	WREG32(mmTPC6_RTR_HBW_WR_RS_W_ARB, 0x01010501);
+
+	for (i = 0, offset = 0 ; i < 10 ; i++, offset += 4) {
+		WREG32(mmMME1_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmMME2_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmMME3_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmMME4_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmMME5_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmMME6_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+
+		WREG32(mmTPC0_NRTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmTPC1_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmTPC2_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmTPC3_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmTPC4_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmTPC5_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmTPC6_RTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmTPC7_NRTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+
+		WREG32(mmPCI_NRTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+		WREG32(mmDMA_NRTR_SPLIT_COEF_0 + offset, polynom[i] >> 7);
+	}
+
+	for (i = 0, offset = 0 ; i < 6 ; i++, offset += 0x40000) {
+		WREG32(mmMME1_RTR_SCRAMB_EN + offset,
+				1 << MME1_RTR_SCRAMB_EN_VAL_SHIFT);
+		WREG32(mmMME1_RTR_NON_LIN_SCRAMB + offset,
+				1 << MME1_RTR_NON_LIN_SCRAMB_EN_SHIFT);
+	}
+
+	for (i = 0, offset = 0 ; i < 8 ; i++, offset += 0x40000) {
+		/*
+		 * Workaround for Bug H2 #2441 :
+		 * "ST.NOP set trace event illegal opcode"
+		 */
+		WREG32(mmTPC0_CFG_TPC_INTR_MASK + offset, tpc_intr_mask);
+
+		WREG32(mmTPC0_NRTR_SCRAMB_EN + offset,
+				1 << TPC0_NRTR_SCRAMB_EN_VAL_SHIFT);
+		WREG32(mmTPC0_NRTR_NON_LIN_SCRAMB + offset,
+				1 << TPC0_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
+	}
+
+	WREG32(mmDMA_NRTR_SCRAMB_EN, 1 << DMA_NRTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmDMA_NRTR_NON_LIN_SCRAMB,
+			1 << DMA_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	WREG32(mmPCI_NRTR_SCRAMB_EN, 1 << PCI_NRTR_SCRAMB_EN_VAL_SHIFT);
+	WREG32(mmPCI_NRTR_NON_LIN_SCRAMB,
+			1 << PCI_NRTR_NON_LIN_SCRAMB_EN_SHIFT);
+
+	/*
+	 * Workaround for H2 #HW-23 bug
+	 * Set DMA max outstanding read requests to 240 on DMA CH 1. Set it
+	 * to 16 on KMD DMA
+	 * We need to limit only these DMAs because the user can only read
+	 * from Host using DMA CH 1
+	 */
+	WREG32(mmDMA_CH_0_CFG0, 0x0fff0010);
+	WREG32(mmDMA_CH_1_CFG0, 0x0fff00F0);
+
+	goya->hw_cap_initialized |= HW_CAP_GOLDEN;
+}
+
+
+/*
+ * goya_push_fw_to_device - Push FW code to device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Copy fw code from firmware file to device memory.
+ * Returns 0 on success
+ *
+ */
+static int goya_push_fw_to_device(struct hl_device *hdev, const char *fw_name,
+					void __iomem *dst)
+{
+	const struct firmware *fw;
+	const u64 *fw_data;
+	size_t fw_size, i;
+	int rc;
+
+	rc = request_firmware(&fw, fw_name, hdev->dev);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to request %s\n", fw_name);
+		goto out;
+	}
+
+	fw_size = fw->size;
+	if ((fw_size % 4) != 0) {
+		dev_err(hdev->dev, "illegal %s firmware size %lu\n",
+			fw_name, fw_size);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	dev_dbg(hdev->dev, "%s firmware size == %lu\n", fw_name, fw_size);
+
+	fw_data = (const u64 *) fw->data;
+
+	if ((fw->size % 8) != 0)
+		fw_size -= 8;
+
+	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
+		if (!(i & (0x80000 - 1))) {
+			dev_dbg(hdev->dev,
+				"copied so far %lu out of %lu for %s firmware",
+				i, fw_size, fw_name);
+			usleep_range(20, 100);
+		}
+
+		writeq(*fw_data, dst);
+	}
+
+	if ((fw->size % 8) != 0)
+		writel(*(const u32 *) fw_data, dst);
+
+out:
+	release_firmware(fw);
+	return rc;
+}
+
+static int goya_pldm_init_cpu(struct hl_device *hdev)
+{
+	char fw_name[200];
+	void __iomem *dst;
+	u32 val, unit_rst_val;
+	int rc;
+
+	/* Must initialize SRAM scrambler before pushing u-boot to SRAM */
+	goya_init_golden_registers(hdev);
+
+	/* Put ARM cores into reset */
+	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
+	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+
+	/* Reset the CA53 MACRO */
+	unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
+	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
+	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+
+	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
+	dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
+	rc = goya_push_fw_to_device(hdev, fw_name, dst);
+	if (rc)
+		return rc;
+
+	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
+	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
+	rc = goya_push_fw_to_device(hdev, fw_name, dst);
+	if (rc)
+		return rc;
+
+	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
+	WREG32(mmPSOC_GLOBAL_CONF_WARM_REBOOT, CPU_BOOT_STATUS_NA);
+
+	WREG32(mmCPU_CA53_CFG_RST_ADDR_LSB_0,
+		lower_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
+	WREG32(mmCPU_CA53_CFG_RST_ADDR_MSB_0,
+		upper_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
+
+	/* Release ARM core 0 from reset */
+	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL,
+					CPU_RESET_CORE0_DEASSERT);
+	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+
+	return 0;
+}
+
+/*
+ * FW component passes an offset from SRAM_BASE_ADDR in SCRATCHPAD_xx.
+ * The version string should be located by that offset.
+ */
+static void goya_read_device_fw_version(struct hl_device *hdev,
+					enum goya_fw_component fwc)
+{
+	const char *name;
+	u32 ver_off;
+	char *dest;
+
+	switch (fwc) {
+	case FW_COMP_UBOOT:
+		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_29);
+		dest = hdev->asic_prop.uboot_ver;
+		name = "U-Boot";
+		break;
+	case FW_COMP_PREBOOT:
+		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_28);
+		dest = hdev->asic_prop.preboot_ver;
+		name = "Preboot";
+		break;
+	default:
+		dev_warn(hdev->dev, "Undefined FW component: %d\n", fwc);
+		return;
+	}
+
+	ver_off &= ~((u32)SRAM_BASE_ADDR);
+
+	if (ver_off < SRAM_SIZE - VERSION_MAX_LEN) {
+		memcpy_fromio(dest, hdev->pcie_bar[SRAM_CFG_BAR_ID] + ver_off,
+							VERSION_MAX_LEN);
+	} else {
+		dev_err(hdev->dev, "%s version offset (0x%x) is above SRAM\n",
+								name, ver_off);
+		strcpy(dest, "unavailable");
+	}
+}
+
+static int goya_init_cpu(struct hl_device *hdev, u32 cpu_timeout)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	char fw_name[200];
+	void __iomem *dst;
+	u32 status;
+	int rc;
+
+	if (!hdev->cpu_enable)
+		return 0;
+
+	if (goya->hw_cap_initialized & HW_CAP_CPU)
+		return 0;
+
+	/*
+	 * Before pushing u-boot/linux to device, need to set the ddr bar to
+	 * base address of dram
+	 */
+	rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to map DDR bar to DRAM base address\n");
+		return rc;
+	}
+
+	if (hdev->pldm) {
+		rc = goya_pldm_init_cpu(hdev);
+		if (rc)
+			return rc;
+
+		goto out;
+	}
+
+	/* Make sure CPU boot-loader is running */
+	rc = hl_poll_timeout(
+		hdev,
+		mmPSOC_GLOBAL_CONF_WARM_REBOOT,
+		status,
+		(status == CPU_BOOT_STATUS_DRAM_RDY) ||
+		(status == CPU_BOOT_STATUS_SRAM_AVAIL),
+		10000,
+		cpu_timeout);
+
+	if (rc) {
+		dev_err(hdev->dev, "Error in ARM u-boot!");
+		switch (status) {
+		case CPU_BOOT_STATUS_NA:
+			dev_err(hdev->dev,
+				"ARM status %d - BTL did NOT run\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_WFE:
+			dev_err(hdev->dev,
+				"ARM status %d - Inside WFE loop\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_BTL:
+			dev_err(hdev->dev,
+				"ARM status %d - Stuck in BTL\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_PREBOOT:
+			dev_err(hdev->dev,
+				"ARM status %d - Stuck in Preboot\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_SPL:
+			dev_err(hdev->dev,
+				"ARM status %d - Stuck in SPL\n", status);
+			break;
+		case CPU_BOOT_STATUS_IN_UBOOT:
+			dev_err(hdev->dev,
+				"ARM status %d - Stuck in u-boot\n", status);
+			break;
+		case CPU_BOOT_STATUS_DRAM_INIT_FAIL:
+			dev_err(hdev->dev,
+				"ARM status %d - DDR initialization failed\n",
+				status);
+			break;
+		default:
+			dev_err(hdev->dev,
+				"ARM status %d - Invalid status code\n",
+				status);
+			break;
+		}
+		return -EIO;
+	}
+
+	/* Read U-Boot version now in case we will later fail */
+	goya_read_device_fw_version(hdev, FW_COMP_UBOOT);
+	goya_read_device_fw_version(hdev, FW_COMP_PREBOOT);
+
+	if (status == CPU_BOOT_STATUS_SRAM_AVAIL)
+		goto out;
+
+	if (!hdev->fw_loading) {
+		dev_info(hdev->dev, "Skip loading FW\n");
+		goto out;
+	}
+
+	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
+	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
+	rc = goya_push_fw_to_device(hdev, fw_name, dst);
+	if (rc)
+		return rc;
+
+	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
+
+	rc = hl_poll_timeout(
+		hdev,
+		mmPSOC_GLOBAL_CONF_WARM_REBOOT,
+		status,
+		(status == CPU_BOOT_STATUS_SRAM_AVAIL),
+		10000,
+		cpu_timeout);
+
+	if (rc) {
+		if (status == CPU_BOOT_STATUS_FIT_CORRUPTED)
+			dev_err(hdev->dev,
+				"ARM u-boot reports FIT image is corrupted\n");
+		else
+			dev_err(hdev->dev,
+				"ARM Linux failed to load, %d\n", status);
+		WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_NA);
+		return -EIO;
+	}
+
+	dev_info(hdev->dev, "Successfully loaded firmware to device\n");
+
+out:
+	goya->hw_cap_initialized |= HW_CAP_CPU;
+
+	return 0;
+}
+
+/*
+ * goya_hw_init - Goya hardware initialization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Returns 0 on success
+ *
+ */
+static int goya_hw_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	u32 val;
+	int rc;
+
+	dev_info(hdev->dev, "Starting initialization of H/W\n");
+
+	/* Perform read from the device to make sure device is up */
+	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
+
+	rc = goya_init_cpu(hdev, GOYA_CPU_TIMEOUT_USEC);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize CPU\n");
+		return rc;
+	}
+
+	goya_tpc_mbist_workaround(hdev);
+
+	goya_init_golden_registers(hdev);
+
+	/*
+	 * After CPU initialization is finished, change DDR bar mapping inside
+	 * iATU to point to the start address of the MMU page tables
+	 */
+	rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE +
+		(MMU_PAGE_TABLES_ADDR & ~(prop->dram_pci_bar_size - 0x1ull)));
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to map DDR bar to MMU page tables\n");
+		return rc;
+	}
+
+	goya_init_security(hdev);
+
+	/* CPU initialization is finished, we can now move to 48 bit DMA mask */
+	rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
+	if (rc) {
+		dev_warn(hdev->dev, "Unable to set pci dma mask to 48 bits\n");
+		rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(32));
+		if (rc) {
+			dev_err(hdev->dev,
+				"Unable to set pci dma mask to 32 bits\n");
+			return rc;
+		}
+	}
+
+	rc = pci_set_consistent_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
+	if (rc) {
+		dev_warn(hdev->dev,
+			"Unable to set pci consistent dma mask to 48 bits\n");
+		rc = pci_set_consistent_dma_mask(hdev->pdev, DMA_BIT_MASK(32));
+		if (rc) {
+			dev_err(hdev->dev,
+				"Unable to set pci consistent dma mask to 32 bits\n");
+			return rc;
+		}
+	}
+
+	/* Perform read from the device to flush all MSI-X configuration */
+	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
+
+	return 0;
+}
+
+/*
+ * goya_hw_fini - Goya hardware tear-down code
+ *
+ * @hdev: pointer to hl_device structure
+ * @hard_reset: should we do hard reset to all engines or just reset the
+ *              compute/dma engines
+ */
+static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 reset_timeout_ms, status;
+
+	if (hdev->pldm)
+		reset_timeout_ms = GOYA_PLDM_RESET_TIMEOUT_MSEC;
+	else
+		reset_timeout_ms = GOYA_RESET_TIMEOUT_MSEC;
+
+	if (hard_reset) {
+		goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE);
+		goya_disable_clk_rlx(hdev);
+		goya_set_pll_refclk(hdev);
+
+		WREG32(mmPSOC_GLOBAL_CONF_SW_ALL_RST_CFG, RESET_ALL);
+		dev_info(hdev->dev,
+			"Issued HARD reset command, going to wait %dms\n",
+			reset_timeout_ms);
+	} else {
+		WREG32(mmPSOC_GLOBAL_CONF_SW_ALL_RST_CFG, DMA_MME_TPC_RESET);
+		dev_info(hdev->dev,
+			"Issued SOFT reset command, going to wait %dms\n",
+			reset_timeout_ms);
+	}
+
+	/*
+	 * After hard reset, we can't poll the BTM_FSM register because the PSOC
+	 * itself is in reset. In either reset we need to wait until the reset
+	 * is deasserted
+	 */
+	msleep(reset_timeout_ms);
+
+	status = RREG32(mmPSOC_GLOBAL_CONF_BTM_FSM);
+	if (status & PSOC_GLOBAL_CONF_BTM_FSM_STATE_MASK)
+		dev_err(hdev->dev,
+			"Timeout while waiting for device to reset 0x%x\n",
+			status);
+
+	/* Chicken bit to re-initiate boot sequencer flow */
+	WREG32(mmPSOC_GLOBAL_CONF_BOOT_SEQ_RE_START,
+		1 << PSOC_GLOBAL_CONF_BOOT_SEQ_RE_START_IND_SHIFT);
+	/* Move boot manager FSM to pre boot sequencer init state */
+	WREG32(mmPSOC_GLOBAL_CONF_SW_BTM_FSM,
+			0xA << PSOC_GLOBAL_CONF_SW_BTM_FSM_CTRL_SHIFT);
+
+	goya->hw_cap_initialized &= ~(HW_CAP_CPU | HW_CAP_CPU_Q |
+					HW_CAP_DDR_0 | HW_CAP_DDR_1 |
+					HW_CAP_DMA | HW_CAP_MME |
+					HW_CAP_MMU | HW_CAP_TPC_MBIST |
+					HW_CAP_GOLDEN | HW_CAP_TPC);
+
+	if (!hdev->pldm) {
+		int rc;
+		/* In case we are running inside VM and the VM is
+		 * shutting down, we need to make sure CPU boot-loader
+		 * is running before we can continue the VM shutdown.
+		 * That is because the VM will send an FLR signal that
+		 * we must answer
+		 */
+		dev_info(hdev->dev,
+			"Going to wait up to %ds for CPU boot loader\n",
+			GOYA_CPU_TIMEOUT_USEC / 1000 / 1000);
+
+		rc = hl_poll_timeout(
+			hdev,
+			mmPSOC_GLOBAL_CONF_WARM_REBOOT,
+			status,
+			(status == CPU_BOOT_STATUS_DRAM_RDY),
+			10000,
+			GOYA_CPU_TIMEOUT_USEC);
+		if (rc)
+			dev_err(hdev->dev,
+				"failed to wait for CPU boot loader\n");
+	}
+}
+
 int goya_suspend(struct hl_device *hdev)
 {
 	return 0;
@@ -647,6 +1511,8 @@ static const struct hl_asic_funcs goya_funcs = {
 	.early_fini = goya_early_fini,
 	.sw_init = goya_sw_init,
 	.sw_fini = goya_sw_fini,
+	.hw_init = goya_hw_init,
+	.hw_fini = goya_hw_fini,
 	.suspend = goya_suspend,
 	.resume = goya_resume,
 	.mmap = goya_mmap,
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index d65ac54ba89b..c3e17de41ab3 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -10,7 +10,9 @@
 
 #include <uapi/misc/habanalabs.h>
 #include "habanalabs.h"
+#include "include/hl_boot_if.h"
 #include "include/goya/goya.h"
+#include "include/goya/goya_fw_if.h"
 
 #define NUMBER_OF_CMPLT_QUEUES		5
 #define NUMBER_OF_EXT_HW_QUEUES		5
@@ -152,4 +154,6 @@ struct goya_device {
 	u32		hw_cap_initialized;
 };
 
+void goya_init_security(struct hl_device *hdev);
+
 #endif /* GOYAP_H_ */
diff --git a/drivers/misc/habanalabs/goya/goya_security.c b/drivers/misc/habanalabs/goya/goya_security.c
new file mode 100644
index 000000000000..7315bbc02569
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/goya_security.c
@@ -0,0 +1,2999 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "goyaP.h"
+
+/*
+ * goya_set_block_as_protected - set the given block as protected
+ *
+ * @hdev: pointer to hl_device structure
+ * @block: block base address
+ *
+ */
+static void goya_pb_set_block(struct hl_device *hdev, u64 base)
+{
+	u32 pb_addr = base - CFG_BASE + PROT_BITS_OFFS;
+
+	while (pb_addr & 0xFFF) {
+		WREG32(pb_addr, 0);
+		pb_addr += 4;
+	}
+}
+
+static void goya_init_mme_protection_bits(struct hl_device *hdev)
+{
+	u32 pb_addr, mask;
+	u8 word_offset;
+
+	/* TODO: change to real reg name when Soc Online is updated */
+	u64 mmMME_SBB_POWER_ECO1 = 0xDFF60,
+		mmMME_SBB_POWER_ECO2 = 0xDFF64;
+
+	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_0_BASE);
+	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_1_BASE);
+	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_2_BASE);
+	goya_pb_set_block(hdev, mmACC_MS_ECC_MEM_3_BASE);
+
+	goya_pb_set_block(hdev, mmSBA_ECC_MEM_BASE);
+	goya_pb_set_block(hdev, mmSBB_ECC_MEM_BASE);
+
+	goya_pb_set_block(hdev, mmMME1_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME1_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME1_WR_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME2_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME2_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME2_WR_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME3_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME3_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME3_WR_REGULATOR_BASE);
+
+	goya_pb_set_block(hdev, mmMME4_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME4_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME4_WR_REGULATOR_BASE);
+
+	goya_pb_set_block(hdev, mmMME5_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME5_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME5_WR_REGULATOR_BASE);
+
+	goya_pb_set_block(hdev, mmMME6_RTR_BASE);
+	goya_pb_set_block(hdev, mmMME6_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmMME6_WR_REGULATOR_BASE);
+
+	pb_addr = (mmMME_DUMMY & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_DUMMY & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_DUMMY & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_RESET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_STALL & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_ADD & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_DATA_WR & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_DATA_RD & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_CTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_DBGMEM_RC & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_LOG_SHADOW & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_STORE_MAX_CREDIT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_STORE_MAX_CREDIT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_STORE_MAX_CREDIT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_AGU & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBB & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBC & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_WBC & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBA_CONTROL_DATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBB_CONTROL_DATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBC_CONTROL_DATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_WBC_CONTROL_DATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_TE & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_TE2DEC & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_REI_STATUS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_REI_MASK & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SEI_STATUS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SEI_MASK & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SPI_STATUS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SPI_MASK & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_QM_CP_STS & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_QM_CP_STS & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_QM_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_CURRENT_INST_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_PQ_BUF_RDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_QM_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_CMDQ_CQ_IFIFO_CNT &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmMME_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmMME_SBB_POWER_ECO1 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmMME_SBB_POWER_ECO1 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmMME_SBB_POWER_ECO1 & 0x7F) >> 2);
+	mask |= 1 << ((mmMME_SBB_POWER_ECO2 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+}
+
+static void goya_init_dma_protection_bits(struct hl_device *hdev)
+{
+	u32 pb_addr, mask;
+	u8 word_offset;
+
+	goya_pb_set_block(hdev, mmDMA_NRTR_BASE);
+	goya_pb_set_block(hdev, mmDMA_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmDMA_WR_REGULATOR_BASE);
+
+	pb_addr = (mmDMA_QM_0_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_0_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_0_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_0_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_0_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_0_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_0_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_0_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_0_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_0_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_0_BASE);
+
+	pb_addr = (mmDMA_QM_1_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_1_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_1_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_1_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_1_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_1_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_1_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_1_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_1_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_1_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_1_BASE);
+
+	pb_addr = (mmDMA_QM_2_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_2_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_2_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_2_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_2_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_2_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_2_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_2_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_2_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_2_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_2_BASE);
+
+	pb_addr = (mmDMA_QM_3_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_3_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_3_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_3_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_3_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_3_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_3_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_3_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_3_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_3_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_3_BASE);
+
+	pb_addr = (mmDMA_QM_4_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_4_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_4_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_4_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_4_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_4_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmDMA_QM_4_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmDMA_QM_4_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmDMA_QM_4_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmDMA_QM_4_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmDMA_CH_4_BASE);
+}
+
+static void goya_init_tpc_protection_bits(struct hl_device *hdev)
+{
+	u32 pb_addr, mask;
+	u8 word_offset;
+
+	goya_pb_set_block(hdev, mmTPC0_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC0_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC0_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC0_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC0_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC0_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC0_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC0_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC1_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC1_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC1_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC1_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CFG_FUNC_MBIST_CNTRL & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC1_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC1_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC1_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC1_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC1_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC1_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC2_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC2_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC2_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC2_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CFG_FUNC_MBIST_CNTRL & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC2_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC2_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC2_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC2_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC2_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC2_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC3_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC3_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC3_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC3_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CFG_CFG_BASE_ADDRESS_HIGH
+			& PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CFG_FUNC_MBIST_CNTRL
+			& PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC3_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC3_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC3_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC3_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC3_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC4_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC4_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC4_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC4_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC4_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC4_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC4_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC4_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC4_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC5_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC5_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC5_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC5_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC5_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC5_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC5_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC5_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC5_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC6_RTR_BASE);
+	goya_pb_set_block(hdev, mmTPC6_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC6_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC6_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC6_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC6_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC6_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC6_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC6_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	goya_pb_set_block(hdev, mmTPC7_NRTR_BASE);
+	goya_pb_set_block(hdev, mmTPC7_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmTPC7_WR_REGULATOR_BASE);
+
+	pb_addr = (mmTPC7_CFG_CFG_BASE_ADDRESS_HIGH & ~0xFFF) +	PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CFG_CFG_BASE_ADDRESS_HIGH &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CFG_CFG_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_CFG_SUBTRACT_VALUE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_SM_BASE_ADDRESS_LOW & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_SM_BASE_ADDRESS_HIGH & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CFG_ARUSER & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CFG_ARUSER & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CFG_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_AWUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CFG_FUNC_MBIST_CNTRL & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CFG_FUNC_MBIST_CNTRL &
+			PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CFG_FUNC_MBIST_CNTRL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_PAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_4 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_5 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_6 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_7 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_8 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CFG_FUNC_MBIST_MEM_9 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_QM_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_QM_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_QM_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_GLBL_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_BASE_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_BASE_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_SIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_PI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_CI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_ARUSER & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_QM_PQ_PUSH0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_QM_PQ_PUSH0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_QM_PQ_PUSH0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_PUSH1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_PUSH2 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_PUSH3 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_PQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_PTR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_PTR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_TSIZE & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_CTL & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_QM_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_QM_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_QM_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_QM_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CMDQ_GLBL_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CMDQ_GLBL_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CMDQ_GLBL_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_PROT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_ERR_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_ERR_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_ERR_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_ERR_WDATA & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_NON_SECURE_PROPS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_GLBL_STS1 & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CMDQ_CQ_CFG0 & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CMDQ_CQ_CFG0 & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CMDQ_CQ_CFG0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_CFG1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_ARUSER & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_PTR_LO_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_PTR_HI_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_TSIZE_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_CTL_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_STS0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_STS1 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_RD_RATE_LIM_EN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_RD_RATE_LIM_RST_TOKEN & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_RD_RATE_LIM_SAT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_RD_RATE_LIM_TOUT & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CMDQ_CQ_IFIFO_CNT & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CMDQ_CQ_IFIFO_CNT & PROT_BITS_OFFS) >> 7) << 2;
+	mask = 1 << ((mmTPC7_CMDQ_CQ_IFIFO_CNT & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE0_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE0_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE1_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE1_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE2_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE2_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE3_ADDR_LO & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_MSG_BASE3_ADDR_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_TSIZE_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_SRC_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_SRC_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_DST_BASE_LO_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_DST_BASE_HI_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_LDMA_COMMIT_OFFSET & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_STS & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_CURRENT_INST_LO & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+
+	pb_addr = (mmTPC7_CMDQ_CP_CURRENT_INST_HI & ~0xFFF) + PROT_BITS_OFFS;
+	word_offset = ((mmTPC7_CMDQ_CP_CURRENT_INST_HI & PROT_BITS_OFFS) >> 7)
+			<< 2;
+	mask = 1 << ((mmTPC7_CMDQ_CP_CURRENT_INST_HI & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_BARRIER_CFG & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CP_DBG_0 & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_BUF_ADDR & 0x7F) >> 2);
+	mask |= 1 << ((mmTPC7_CMDQ_CQ_BUF_RDATA & 0x7F) >> 2);
+
+	WREG32(pb_addr + word_offset, ~mask);
+}
+
+/*
+ * goya_init_protection_bits - Initialize protection bits for specific registers
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * All protection bits are 1 by default, means not protected. Need to set to 0
+ * each bit that belongs to a protected register.
+ *
+ */
+static void goya_init_protection_bits(struct hl_device *hdev)
+{
+	/*
+	 * In each 4K block of registers, the last 128 bytes are protection
+	 * bits - total of 1024 bits, one for each register. Each bit is related
+	 * to a specific register, by the order of the registers.
+	 * So in order to calculate the bit that is related to a given register,
+	 * we need to calculate its word offset and then the exact bit inside
+	 * the word (which is 4 bytes).
+	 *
+	 * Register address:
+	 *
+	 * 31                 12 11           7   6             2  1      0
+	 * -----------------------------------------------------------------
+	 * |      Don't         |    word       |  bit location  |    0    |
+	 * |      care          |   offset      |  inside word   |         |
+	 * -----------------------------------------------------------------
+	 *
+	 * Bits 7-11 represents the word offset inside the 128 bytes.
+	 * Bits 2-6 represents the bit location inside the word.
+	 */
+
+	goya_pb_set_block(hdev, mmPCI_NRTR_BASE);
+	goya_pb_set_block(hdev, mmPCI_RD_REGULATOR_BASE);
+	goya_pb_set_block(hdev, mmPCI_WR_REGULATOR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y0_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y0_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y1_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y1_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y2_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y2_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y3_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y3_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y4_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y4_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmSRAM_Y5_X0_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X0_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X1_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X1_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X2_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X2_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X3_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X3_RTR_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X4_BANK_BASE);
+	goya_pb_set_block(hdev, mmSRAM_Y5_X4_RTR_BASE);
+
+	goya_pb_set_block(hdev, mmPCIE_WRAP_BASE);
+	goya_pb_set_block(hdev, mmPCIE_CORE_BASE);
+	goya_pb_set_block(hdev, mmPCIE_DB_CFG_BASE);
+	goya_pb_set_block(hdev, mmPCIE_DB_CMD_BASE);
+	goya_pb_set_block(hdev, mmPCIE_AUX_BASE);
+	goya_pb_set_block(hdev, mmPCIE_DB_RSV_BASE);
+	goya_pb_set_block(hdev, mmPCIE_PHY_BASE);
+
+	goya_init_mme_protection_bits(hdev);
+
+	goya_init_dma_protection_bits(hdev);
+
+	goya_init_tpc_protection_bits(hdev);
+}
+
+/*
+ * goya_init_security - Initialize security model
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Initialize the security model of the device
+ * That includes range registers and protection bit per register
+ *
+ */
+void goya_init_security(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	u32 dram_addr_lo = lower_32_bits(DRAM_PHYS_BASE);
+	u32 dram_addr_hi = upper_32_bits(DRAM_PHYS_BASE);
+
+	u32 lbw_rng0_base = 0xFC440000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng0_mask = 0xFFFF0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng1_base = 0xFC480000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng1_mask = 0xFFF80000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng2_base = 0xFC600000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng2_mask = 0xFFE00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng3_base = 0xFC800000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng3_mask = 0xFFF00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng4_base = 0xFCC02000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng4_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng5_base = 0xFCC40000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng5_mask = 0xFFFF8000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng6_base = 0xFCC48000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng6_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng7_base = 0xFCC4A000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng7_mask = 0xFFFFE000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng8_base = 0xFCC4C000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng8_mask = 0xFFFFC000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng9_base = 0xFCC50000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng9_mask = 0xFFFF0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng10_base = 0xFCC60000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng10_mask = 0xFFFE0000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng11_base = 0xFCE00000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng11_mask = 0xFFFFC000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng12_base = 0xFE484000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng12_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	u32 lbw_rng13_base = 0xFEC43000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+	u32 lbw_rng13_mask = 0xFFFFF000 & DMA_MACRO_LBW_RANGE_BASE_R_MASK;
+
+	WREG32(mmDMA_MACRO_LBW_RANGE_HIT_BLOCK, 0xFFFF);
+	WREG32(mmDMA_MACRO_HBW_RANGE_HIT_BLOCK, 0xFF);
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU)) {
+		WREG32(mmDMA_MACRO_HBW_RANGE_HIT_BLOCK, 0xFE);
+
+		/* Protect HOST */
+		WREG32(mmDMA_MACRO_HBW_RANGE_BASE_31_0_0, 0);
+		WREG32(mmDMA_MACRO_HBW_RANGE_BASE_49_32_0, 0);
+		WREG32(mmDMA_MACRO_HBW_RANGE_MASK_31_0_0, 0);
+		WREG32(mmDMA_MACRO_HBW_RANGE_MASK_49_32_0, 0xFFF80);
+	}
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmDMA_MACRO_HBW_RANGE_BASE_31_0_1, dram_addr_lo);
+	WREG32(mmDMA_MACRO_HBW_RANGE_BASE_49_32_1, dram_addr_hi);
+	WREG32(mmDMA_MACRO_HBW_RANGE_MASK_31_0_1, 0xE0000000);
+	WREG32(mmDMA_MACRO_HBW_RANGE_MASK_49_32_1, 0x3FFFF);
+
+	/* Protect registers */
+
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmDMA_MACRO_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmDMA_MACRO_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME1_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME2_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME3_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME4_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME5_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmMME6_RTR_LBW_RANGE_HIT, 0xFFFF);
+
+	WREG32(mmMME1_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME2_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME3_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME4_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME5_RTR_HBW_RANGE_HIT, 0xFE);
+	WREG32(mmMME6_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmMME1_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME1_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME1_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME1_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME2_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME2_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME2_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME2_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME3_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME3_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME3_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME3_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME4_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME4_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME4_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME4_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME5_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME5_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME5_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME5_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	WREG32(mmMME6_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmMME6_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmMME6_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmMME6_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmMME1_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME1_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME1_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME1_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME2_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME2_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME2_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME2_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME3_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME3_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME3_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME3_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME4_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME4_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME4_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME4_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME5_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME5_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME5_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME5_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME6_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmMME6_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmMME6_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmMME6_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME1_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME1_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME2_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME2_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME3_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME3_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME4_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME4_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME5_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME5_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmMME6_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmMME6_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC0_NRTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC0_NRTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC0_NRTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC0_NRTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC0_NRTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC1_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC1_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC1_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC1_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC1_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC1_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC1_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC1_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC1_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC1_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC1_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC1_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC2_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC2_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC2_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC2_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC2_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC2_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC2_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC2_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC2_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC2_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC2_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC2_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC3_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC3_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC3_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC3_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC3_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC3_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC3_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC3_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC3_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC3_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC3_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC3_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC4_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC4_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC4_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC4_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC4_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC4_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC4_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC4_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC4_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC4_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC4_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC4_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC5_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC5_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC5_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC5_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC5_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC5_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC5_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC5_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC5_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC5_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC5_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC5_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC6_RTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC6_RTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC6_RTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC6_RTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC6_RTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC6_RTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC6_RTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC6_RTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC6_RTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC6_RTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC6_RTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC6_RTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	WREG32(mmTPC7_NRTR_LBW_RANGE_HIT, 0xFFFF);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_HIT, 0xFE);
+
+	/* Protect HOST */
+	WREG32(mmTPC7_NRTR_HBW_RANGE_BASE_L_0, 0);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_BASE_H_0, 0);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_MASK_L_0, 0);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_MASK_H_0, 0xFFF80);
+
+	/*
+	 * Protect DDR @
+	 * DRAM_VIRT_BASE : DRAM_VIRT_BASE + DRAM_VIRT_END
+	 * The mask protects the first 512MB
+	 */
+	WREG32(mmTPC7_NRTR_HBW_RANGE_BASE_L_1, dram_addr_lo);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_BASE_H_1, dram_addr_hi);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_MASK_L_1, 0xE0000000);
+	WREG32(mmTPC7_NRTR_HBW_RANGE_MASK_H_1, 0x3FFFF);
+
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_0, lbw_rng0_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_0, lbw_rng0_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_1, lbw_rng1_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_1, lbw_rng1_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_2, lbw_rng2_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_2, lbw_rng2_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_3, lbw_rng3_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_3, lbw_rng3_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_4, lbw_rng4_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_4, lbw_rng4_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_5, lbw_rng5_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_5, lbw_rng5_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_6, lbw_rng6_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_6, lbw_rng6_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_7, lbw_rng7_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_7, lbw_rng7_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_8, lbw_rng8_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_8, lbw_rng8_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_9, lbw_rng9_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_9, lbw_rng9_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_10, lbw_rng10_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_10, lbw_rng10_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_11, lbw_rng11_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_11, lbw_rng11_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_12, lbw_rng12_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_12, lbw_rng12_mask);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_BASE_13, lbw_rng13_base);
+	WREG32(mmTPC7_NRTR_LBW_RANGE_MASK_13, lbw_rng13_mask);
+
+	goya_init_protection_bits(hdev);
+}
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index ed79a0be58d7..8b0e8796628b 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -8,6 +8,8 @@
 #ifndef HABANALABSP_H_
 #define HABANALABSP_H_
 
+#include "include/armcp_if.h"
+
 #define pr_fmt(fmt)			"habanalabs: " fmt
 
 #include <linux/pci.h>
@@ -23,6 +25,8 @@
 
 #define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
 
+#define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */
+
 #define HL_MAX_QUEUES			128
 
 struct hl_device;
@@ -31,6 +35,8 @@ struct hl_fpriv;
 
 /**
  * struct asic_fixed_properties - ASIC specific immutable properties.
+ * @uboot_ver: F/W U-boot version.
+ * @preboot_ver: F/W Preboot version.
  * @sram_base_address: SRAM physical start address.
  * @sram_end_address: SRAM physical end address.
  * @sram_user_base_address - SRAM physical start address for user access.
@@ -59,6 +65,8 @@ struct hl_fpriv;
  * @tpc_enabled_mask: which TPCs are enabled.
  */
 struct asic_fixed_properties {
+	char			uboot_ver[VERSION_MAX_LEN];
+	char			preboot_ver[VERSION_MAX_LEN];
 	u64			sram_base_address;
 	u64			sram_end_address;
 	u64			sram_user_base_address;
@@ -156,6 +164,8 @@ enum hl_asic_type {
  * @early_fini: tears down what was done in early_init.
  * @sw_init: sets up driver state, does not configure H/W.
  * @sw_fini: tears down driver state, does not configure H/W.
+ * @hw_init: sets up the H/W state.
+ * @hw_fini: tears down the H/W state.
  * @suspend: handles IP specific H/W or SW changes for suspend.
  * @resume: handles IP specific H/W or SW changes for resume.
  * @mmap: mmap function, does nothing.
@@ -174,6 +184,8 @@ struct hl_asic_funcs {
 	int (*early_fini)(struct hl_device *hdev);
 	int (*sw_init)(struct hl_device *hdev);
 	int (*sw_fini)(struct hl_device *hdev);
+	int (*hw_init)(struct hl_device *hdev);
+	void (*hw_fini)(struct hl_device *hdev, bool hard_reset);
 	int (*suspend)(struct hl_device *hdev);
 	int (*resume)(struct hl_device *hdev);
 	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
@@ -350,7 +362,10 @@ struct hl_device {
 	u8				disabled;
 
 	/* Parameters for bring-up */
+	u8				cpu_enable;
 	u8				reset_pcilink;
+	u8				fw_loading;
+	u8				pldm;
 };
 
 
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index b07979c599e2..2611883eab11 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -171,7 +171,14 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	hdev->major = hl_major;
 
 	/* Parameters for bring-up - set them to defaults */
+	hdev->cpu_enable = 1;
 	hdev->reset_pcilink = 0;
+	hdev->fw_loading = 1;
+	hdev->pldm = 0;
+
+	/* If CPU is disabled, no point in loading FW */
+	if (!hdev->cpu_enable)
+		hdev->fw_loading = 0;
 
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
diff --git a/drivers/misc/habanalabs/include/armcp_if.h b/drivers/misc/habanalabs/include/armcp_if.h
new file mode 100644
index 000000000000..85fc2efe144b
--- /dev/null
+++ b/drivers/misc/habanalabs/include/armcp_if.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef ARMCP_IF_H
+#define ARMCP_IF_H
+
+#include <linux/types.h>
+
+/*
+ * ArmCP info
+ */
+
+#define VERSION_MAX_LEN			128
+
+#endif /* ARMCP_IF_H */
diff --git a/drivers/misc/habanalabs/include/goya/goya_fw_if.h b/drivers/misc/habanalabs/include/goya/goya_fw_if.h
new file mode 100644
index 000000000000..a9920cb4a07b
--- /dev/null
+++ b/drivers/misc/habanalabs/include/goya/goya_fw_if.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef GOYA_FW_IF_H
+#define GOYA_FW_IF_H
+
+#define CPU_BOOT_ADDR		0x7FF8040000ull
+
+#define UBOOT_FW_OFFSET		0x100000		/* 1MB in SRAM */
+#define LINUX_FW_OFFSET		0x800000		/* 8MB in DDR */
+
+enum goya_pll_index {
+	CPU_PLL = 0,
+	IC_PLL,
+	MC_PLL,
+	MME_PLL,
+	PCI_PLL,
+	EMMC_PLL,
+	TPC_PLL
+};
+
+#define GOYA_PLL_FREQ_LOW		50000000 /* 50 MHz */
+
+#endif /* GOYA_FW_IF_H */
diff --git a/drivers/misc/habanalabs/include/hl_boot_if.h b/drivers/misc/habanalabs/include/hl_boot_if.h
new file mode 100644
index 000000000000..7475732b9996
--- /dev/null
+++ b/drivers/misc/habanalabs/include/hl_boot_if.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef HL_BOOT_IF_H
+#define HL_BOOT_IF_H
+
+enum cpu_boot_status {
+	CPU_BOOT_STATUS_NA = 0,		/* Default value after reset of chip */
+	CPU_BOOT_STATUS_IN_WFE,
+	CPU_BOOT_STATUS_DRAM_RDY,
+	CPU_BOOT_STATUS_SRAM_AVAIL,
+	CPU_BOOT_STATUS_IN_BTL,		/* BTL is H/W FSM */
+	CPU_BOOT_STATUS_IN_PREBOOT,
+	CPU_BOOT_STATUS_IN_SPL,
+	CPU_BOOT_STATUS_IN_UBOOT,
+	CPU_BOOT_STATUS_DRAM_INIT_FAIL,
+	CPU_BOOT_STATUS_FIT_CORRUPTED
+};
+
+enum kmd_msg {
+	KMD_MSG_NA = 0,
+	KMD_MSG_GOTO_WFE,
+	KMD_MSG_FIT_RDY
+};
+
+#endif /* HL_BOOT_IF_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 07/15] habanalabs: add h/w queues module
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (4 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 06/15] habanalabs: add basic Goya h/w initialization Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 08/15] habanalabs: add event queue and interrupts Oded Gabbay
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds the H/W queues module and the code to initialize Goya's
various compute and DMA engines and their queues.

Goya has 5 DMA channels, 8 TPC engines and a single MME engine. For each
channel/engine, there is a H/W queue logic which is used to pass commands
from the user to the H/W. That logic is called QMAN.

There are two types of QMANs: external and internal. The DMA QMANs are
considered external while the TPC and MME QMANs are considered internal.
For each external queue there is a completion queue, which is located on
the Host memory.

The differences between external and internal QMANs are:

1. The location of the queue's memory. External QMANs are located on the
   Host memory while internal QMANs are located on the on-chip memory.

2. The external QMAN write an entry to a completion queue and sends an
   MSI-X interrupt upon completion of a command buffer that was given to
   it. The internal QMAN doesn't do that.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/Makefile              |    2 +-
 drivers/misc/habanalabs/device.c              |   75 +-
 drivers/misc/habanalabs/goya/goya.c           | 1525 +++++++++++++++--
 drivers/misc/habanalabs/goya/goyaP.h          |    7 +
 drivers/misc/habanalabs/habanalabs.h          |  174 +-
 drivers/misc/habanalabs/habanalabs_drv.c      |    5 +
 drivers/misc/habanalabs/hw_queue.c            |  404 +++++
 drivers/misc/habanalabs/include/armcp_if.h    |  292 ++++
 .../include/goya/goya_async_events.h          |  186 ++
 .../habanalabs/include/goya/goya_packets.h    |  129 ++
 drivers/misc/habanalabs/include/qman_if.h     |   56 +
 drivers/misc/habanalabs/irq.c                 |  150 ++
 include/uapi/misc/habanalabs.h                |   29 +
 13 files changed, 2918 insertions(+), 116 deletions(-)
 create mode 100644 drivers/misc/habanalabs/hw_queue.c
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_async_events.h
 create mode 100644 drivers/misc/habanalabs/include/goya/goya_packets.h
 create mode 100644 drivers/misc/habanalabs/include/qman_if.h
 create mode 100644 drivers/misc/habanalabs/irq.c

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index 2530c9b78ca4..c07f3ccb57dc 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -5,7 +5,7 @@
 obj-m	:= habanalabs.o
 
 habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
-		command_buffer.o
+		command_buffer.o hw_queue.o irq.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 9dcfe2cd3a0f..3f807e1419a6 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -175,13 +175,23 @@ static int device_early_init(struct hl_device *hdev)
 	if (rc)
 		goto early_fini;
 
+	hdev->cq_wq = alloc_workqueue("hl-free-jobs", WQ_UNBOUND, 0);
+	if (hdev->cq_wq == NULL) {
+		dev_err(hdev->dev, "Failed to allocate CQ workqueue\n");
+		rc = -ENOMEM;
+		goto asid_fini;
+	}
+
 	hl_cb_mgr_init(&hdev->kernel_cb_mgr);
 
 	mutex_init(&hdev->fd_open_cnt_lock);
+	mutex_init(&hdev->send_cpu_message_lock);
 	atomic_set(&hdev->fd_open_cnt, 0);
 
 	return 0;
 
+asid_fini:
+	hl_asid_fini(hdev);
 early_fini:
 	if (hdev->asic_funcs->early_fini)
 		hdev->asic_funcs->early_fini(hdev);
@@ -197,9 +207,12 @@ static int device_early_init(struct hl_device *hdev)
  */
 static void device_early_fini(struct hl_device *hdev)
 {
+	mutex_destroy(&hdev->send_cpu_message_lock);
 
 	hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
 
+	destroy_workqueue(hdev->cq_wq);
+
 	hl_asid_fini(hdev);
 
 	if (hdev->asic_funcs->early_fini)
@@ -278,7 +291,7 @@ int hl_device_resume(struct hl_device *hdev)
  */
 int hl_device_init(struct hl_device *hdev, struct class *hclass)
 {
-	int rc;
+	int i, rc, cq_ready_cnt;
 
 	/* Create device */
 	rc = device_setup_cdev(hdev, hclass, hdev->id, &hl_ops);
@@ -299,11 +312,48 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 	if (rc)
 		goto early_fini;
 
+	/*
+	 * Initialize the H/W queues. Must be done before hw_init, because
+	 * there the addresses of the kernel queue are being written to the
+	 * registers of the device
+	 */
+	rc = hl_hw_queues_create(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize kernel queues\n");
+		goto sw_fini;
+	}
+
+	/*
+	 * Initialize the completion queues. Must be done before hw_init,
+	 * because there the addresses of the completion queues are being
+	 * passed as arguments to request_irq
+	 */
+	hdev->completion_queue =
+			kcalloc(hdev->asic_prop.completion_queues_count,
+				sizeof(*hdev->completion_queue), GFP_KERNEL);
+
+	if (!hdev->completion_queue) {
+		dev_err(hdev->dev, "failed to allocate completion queues\n");
+		rc = -ENOMEM;
+		goto hw_queues_destroy;
+	}
+
+	for (i = 0, cq_ready_cnt = 0;
+			i < hdev->asic_prop.completion_queues_count;
+			i++, cq_ready_cnt++) {
+		rc = hl_cq_init(hdev, &hdev->completion_queue[i], i);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed to initialize completion queue\n");
+			goto cq_fini;
+		}
+	}
+
 	/* Allocate the kernel context */
 	hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
 	if (!hdev->kernel_ctx) {
 		rc = -ENOMEM;
-		goto sw_fini;
+		goto cq_fini;
 	}
 
 	hdev->user_ctx = NULL;
@@ -329,6 +379,14 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 
 	hdev->disabled = false;
 
+	/* Check that the communication with the device is working */
+	rc = hdev->asic_funcs->test_queues(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to detect if device is alive\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
@@ -340,6 +398,12 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 			"kernel ctx is still alive on initialization failure\n");
 free_ctx:
 	kfree(hdev->kernel_ctx);
+cq_fini:
+	for (i = 0 ; i < cq_ready_cnt ; i++)
+		hl_cq_fini(hdev, &hdev->completion_queue[i]);
+	kfree(hdev->completion_queue);
+hw_queues_destroy:
+	hl_hw_queues_destroy(hdev);
 sw_fini:
 	hdev->asic_funcs->sw_fini(hdev);
 early_fini:
@@ -369,6 +433,7 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
  */
 void hl_device_fini(struct hl_device *hdev)
 {
+	int i;
 	dev_info(hdev->dev, "Removing device\n");
 
 	/* Mark device as disabled */
@@ -383,6 +448,12 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Reset the H/W. It will be in idle state after this returns */
 	hdev->asic_funcs->hw_fini(hdev, true);
 
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
+		hl_cq_fini(hdev, &hdev->completion_queue[i]);
+	kfree(hdev->completion_queue);
+
+	hl_hw_queues_destroy(hdev);
+
 	/* Call ASIC S/W finalize function */
 	hdev->asic_funcs->sw_fini(hdev);
 
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 142232370d07..046c4221af37 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -98,6 +98,26 @@
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	int i;
+
+	for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++) {
+		prop->hw_queues_props[i].type = QUEUE_TYPE_EXT;
+		prop->hw_queues_props[i].kmd_only = 0;
+	}
+
+	for (; i < NUMBER_OF_EXT_HW_QUEUES + NUMBER_OF_CPU_HW_QUEUES ; i++) {
+		prop->hw_queues_props[i].type = QUEUE_TYPE_CPU;
+		prop->hw_queues_props[i].kmd_only = 1;
+	}
+
+	for (; i < NUMBER_OF_EXT_HW_QUEUES + NUMBER_OF_CPU_HW_QUEUES +
+			NUMBER_OF_INT_HW_QUEUES; i++) {
+		prop->hw_queues_props[i].type = QUEUE_TYPE_INT;
+		prop->hw_queues_props[i].kmd_only = 0;
+	}
+
+	for (; i < HL_MAX_QUEUES; i++)
+		prop->hw_queues_props[i].type = QUEUE_TYPE_NA;
 
 	prop->completion_queues_count = NUMBER_OF_CMPLT_QUEUES;
 
@@ -126,6 +146,18 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->high_pll = PLL_HIGH_DEFAULT;
 }
 
+int goya_send_pci_access_msg(struct hl_device *hdev, u32 opcode)
+{
+	struct armcp_packet pkt;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = opcode << ARMCP_PKT_CTL_OPCODE_SHIFT;
+
+	return hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt,
+			sizeof(pkt), HL_DEVICE_TIMEOUT_USEC, NULL);
+}
+
 /*
  * goya_pci_bars_map - Map PCI BARS of Goya device
  *
@@ -515,6 +547,8 @@ static int goya_sw_init(struct hl_device *hdev)
 	if (!goya)
 		return -ENOMEM;
 
+	goya->test_cpu_queue = goya_test_cpu_queue;
+
 	/* according to goya_init_iatu */
 	goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
 	hdev->asic_specific = goya;
@@ -601,6 +635,311 @@ int goya_sw_fini(struct hl_device *hdev)
 	return 0;
 }
 
+static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
+		dma_addr_t bus_address)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u32 reg_off = dma_id * (mmDMA_QM_1_PQ_PI - mmDMA_QM_0_PQ_PI);
+
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+
+	WREG32(mmDMA_QM_0_PQ_BASE_LO + reg_off, lower_32_bits(bus_address));
+	WREG32(mmDMA_QM_0_PQ_BASE_HI + reg_off, upper_32_bits(bus_address));
+
+	WREG32(mmDMA_QM_0_PQ_SIZE + reg_off, ilog2(HL_QUEUE_LENGTH));
+	WREG32(mmDMA_QM_0_PQ_PI + reg_off, 0);
+	WREG32(mmDMA_QM_0_PQ_CI + reg_off, 0);
+
+	WREG32(mmDMA_QM_0_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
+	WREG32(mmDMA_QM_0_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
+	WREG32(mmDMA_QM_0_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
+	WREG32(mmDMA_QM_0_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
+	WREG32(mmDMA_QM_0_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
+	WREG32(mmDMA_QM_0_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
+	WREG32(mmDMA_QM_0_GLBL_ERR_WDATA + reg_off,
+			GOYA_ASYNC_EVENT_ID_DMA0_QM + dma_id);
+
+	/* PQ has buffer of 2 cache lines, while CQ has 8 lines */
+	WREG32(mmDMA_QM_0_PQ_CFG1 + reg_off, 0x00020002);
+	WREG32(mmDMA_QM_0_CQ_CFG1 + reg_off, 0x00080008);
+
+	if (dma_id == 0)
+		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
+	else
+		if (goya->hw_cap_initialized & HW_CAP_MMU)
+			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
+					QMAN_DMA_PARTLY_TRUSTED);
+		else
+			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
+					QMAN_DMA_FULLY_TRUSTED);
+
+	WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, QMAN_DMA_ERR_MSG_EN);
+	WREG32(mmDMA_QM_0_GLBL_CFG0 + reg_off, QMAN_DMA_ENABLE);
+}
+
+static void goya_init_dma_ch(struct hl_device *hdev, int dma_id)
+{
+	u32 gic_base_lo, gic_base_hi;
+	u64 sob_addr;
+	u32 reg_off = dma_id * (mmDMA_CH_1_CFG1 - mmDMA_CH_0_CFG1);
+
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+
+	WREG32(mmDMA_CH_0_ERRMSG_ADDR_LO + reg_off, gic_base_lo);
+	WREG32(mmDMA_CH_0_ERRMSG_ADDR_HI + reg_off, gic_base_hi);
+	WREG32(mmDMA_CH_0_ERRMSG_WDATA + reg_off,
+			GOYA_ASYNC_EVENT_ID_DMA0_CH + dma_id);
+
+	if (dma_id) {
+		sob_addr = CFG_BASE + mmSYNC_MNGR_SOB_OBJ_1000 +
+				(dma_id - 1) * 4;
+		WREG32(mmDMA_CH_0_WR_COMP_ADDR_LO + reg_off,
+				lower_32_bits(sob_addr));
+		WREG32(mmDMA_CH_0_WR_COMP_ADDR_HI + reg_off,
+				upper_32_bits(sob_addr));
+		WREG32(mmDMA_CH_0_WR_COMP_WDATA + reg_off, 0x80000001);
+	}
+}
+
+/*
+ * goya_init_dma_qmans - Initialize QMAN DMA registers
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Initialize the H/W registers of the QMAN DMA channels
+ *
+ */
+static void goya_init_dma_qmans(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct hl_hw_queue *q;
+	dma_addr_t bus_address;
+	int i;
+
+	if (goya->hw_cap_initialized & HW_CAP_DMA)
+		return;
+
+	q = &hdev->kernel_queues[0];
+
+	for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++, q++) {
+		bus_address = q->bus_address +
+				hdev->asic_prop.host_phys_base_address;
+
+		goya_init_dma_qman(hdev, i, bus_address);
+		goya_init_dma_ch(hdev, i);
+	}
+
+	goya->hw_cap_initialized |= HW_CAP_DMA;
+}
+
+/*
+ * goya_disable_external_queues - Disable external queues
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static void goya_disable_external_queues(struct hl_device *hdev)
+{
+	WREG32(mmDMA_QM_0_GLBL_CFG0, 0);
+	WREG32(mmDMA_QM_1_GLBL_CFG0, 0);
+	WREG32(mmDMA_QM_2_GLBL_CFG0, 0);
+	WREG32(mmDMA_QM_3_GLBL_CFG0, 0);
+	WREG32(mmDMA_QM_4_GLBL_CFG0, 0);
+}
+
+static int goya_stop_queue(struct hl_device *hdev, u32 cfg_reg,
+				u32 cp_sts_reg, u32 glbl_sts0_reg)
+{
+	int rc;
+	u32 status;
+
+	/* use the values of TPC0 as they are all the same*/
+
+	WREG32(cfg_reg, 1 << TPC0_QM_GLBL_CFG1_CP_STOP_SHIFT);
+
+	status = RREG32(cp_sts_reg);
+	if (status & TPC0_QM_CP_STS_FENCE_IN_PROGRESS_MASK) {
+		rc = hl_poll_timeout(
+			hdev,
+			cp_sts_reg,
+			status,
+			!(status & TPC0_QM_CP_STS_FENCE_IN_PROGRESS_MASK),
+			1000,
+			QMAN_FENCE_TIMEOUT_USEC);
+
+		/* if QMAN is stuck in fence no need to check for stop */
+		if (rc)
+			return 0;
+	}
+
+	rc = hl_poll_timeout(
+		hdev,
+		glbl_sts0_reg,
+		status,
+		(status & TPC0_QM_GLBL_STS0_CP_IS_STOP_MASK),
+		1000,
+		QMAN_STOP_TIMEOUT_USEC);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Timeout while waiting for QMAN to stop\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * goya_stop_external_queues - Stop external queues
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Returns 0 on success
+ *
+ */
+static int goya_stop_external_queues(struct hl_device *hdev)
+{
+	int rc, retval = 0;
+
+	rc = goya_stop_queue(hdev,
+			mmDMA_QM_0_GLBL_CFG1,
+			mmDMA_QM_0_CP_STS,
+			mmDMA_QM_0_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop DMA QMAN 0\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmDMA_QM_1_GLBL_CFG1,
+			mmDMA_QM_1_CP_STS,
+			mmDMA_QM_1_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop DMA QMAN 1\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmDMA_QM_2_GLBL_CFG1,
+			mmDMA_QM_2_CP_STS,
+			mmDMA_QM_2_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop DMA QMAN 2\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmDMA_QM_3_GLBL_CFG1,
+			mmDMA_QM_3_CP_STS,
+			mmDMA_QM_3_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop DMA QMAN 3\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmDMA_QM_4_GLBL_CFG1,
+			mmDMA_QM_4_CP_STS,
+			mmDMA_QM_4_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop DMA QMAN 4\n");
+		retval = -EIO;
+	}
+
+	return retval;
+}
+
+static void goya_resume_external_queues(struct hl_device *hdev)
+{
+	WREG32(mmDMA_QM_0_GLBL_CFG1, 0);
+	WREG32(mmDMA_QM_1_GLBL_CFG1, 0);
+	WREG32(mmDMA_QM_2_GLBL_CFG1, 0);
+	WREG32(mmDMA_QM_3_GLBL_CFG1, 0);
+	WREG32(mmDMA_QM_4_GLBL_CFG1, 0);
+}
+
+/*
+ * goya_init_cpu_queues - Initialize PQ/CQ/EQ of CPU
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Returns 0 on success
+ *
+ */
+int goya_init_cpu_queues(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	dma_addr_t bus_address;
+	u32 status;
+	struct hl_hw_queue *cpu_pq = &hdev->kernel_queues[GOYA_QUEUE_ID_CPU_PQ];
+	int err;
+
+	if (!hdev->cpu_queues_enable)
+		return 0;
+
+	if (goya->hw_cap_initialized & HW_CAP_CPU_Q)
+		return 0;
+
+	bus_address = cpu_pq->bus_address +
+			hdev->asic_prop.host_phys_base_address;
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_0, lower_32_bits(bus_address));
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_1, upper_32_bits(bus_address));
+
+	bus_address = hdev->cpu_accessible_dma_address +
+			hdev->asic_prop.host_phys_base_address;
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_8, lower_32_bits(bus_address));
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_9, upper_32_bits(bus_address));
+
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_5, HL_QUEUE_SIZE_IN_BYTES);
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_10, CPU_ACCESSIBLE_MEM_SIZE);
+
+	/* Used for EQ CI */
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, 0);
+
+	WREG32(mmCPU_IF_PF_PQ_PI, 0);
+
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_7, PQ_INIT_STATUS_READY_FOR_CP);
+
+	WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+			GOYA_ASYNC_EVENT_ID_PI_UPDATE);
+
+	err = hl_poll_timeout(
+		hdev,
+		mmPSOC_GLOBAL_CONF_SCRATCHPAD_7,
+		status,
+		(status == PQ_INIT_STATUS_READY_FOR_HOST),
+		1000,
+		GOYA_CPU_TIMEOUT_USEC);
+
+	if (err) {
+		dev_err(hdev->dev,
+			"Failed to communicate with ARM CPU (ArmCP timeout)\n");
+		return -EIO;
+	}
+
+	goya->hw_cap_initialized |= HW_CAP_CPU_Q;
+	return 0;
+}
+
 static void goya_set_pll_refclk(struct hl_device *hdev)
 {
 	WREG32(mmCPU_PLL_DIV_SEL_0, 0x0);
@@ -1028,144 +1367,644 @@ static void goya_init_golden_registers(struct hl_device *hdev)
 	goya->hw_cap_initialized |= HW_CAP_GOLDEN;
 }
 
-
-/*
- * goya_push_fw_to_device - Push FW code to device
- *
- * @hdev: pointer to hl_device structure
- *
- * Copy fw code from firmware file to device memory.
- * Returns 0 on success
- *
- */
-static int goya_push_fw_to_device(struct hl_device *hdev, const char *fw_name,
-					void __iomem *dst)
+static void goya_init_mme_qman(struct hl_device *hdev)
 {
-	const struct firmware *fw;
-	const u64 *fw_data;
-	size_t fw_size, i;
-	int rc;
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u64 qman_base_addr;
 
-	rc = request_firmware(&fw, fw_name, hdev->dev);
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
 
-	if (rc) {
-		dev_err(hdev->dev, "Failed to request %s\n", fw_name);
-		goto out;
-	}
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
 
-	fw_size = fw->size;
-	if ((fw_size % 4) != 0) {
-		dev_err(hdev->dev, "illegal %s firmware size %lu\n",
-			fw_name, fw_size);
-		rc = -EINVAL;
-		goto out;
-	}
+	qman_base_addr = hdev->asic_prop.sram_base_address +
+				MME_QMAN_BASE_OFFSET;
 
-	dev_dbg(hdev->dev, "%s firmware size == %lu\n", fw_name, fw_size);
+	WREG32(mmMME_QM_PQ_BASE_LO, lower_32_bits(qman_base_addr));
+	WREG32(mmMME_QM_PQ_BASE_HI, upper_32_bits(qman_base_addr));
+	WREG32(mmMME_QM_PQ_SIZE, ilog2(MME_QMAN_LENGTH));
+	WREG32(mmMME_QM_PQ_PI, 0);
+	WREG32(mmMME_QM_PQ_CI, 0);
+	WREG32(mmMME_QM_CP_LDMA_SRC_BASE_LO_OFFSET, 0x10C0);
+	WREG32(mmMME_QM_CP_LDMA_SRC_BASE_HI_OFFSET, 0x10C4);
+	WREG32(mmMME_QM_CP_LDMA_TSIZE_OFFSET, 0x10C8);
+	WREG32(mmMME_QM_CP_LDMA_COMMIT_OFFSET, 0x10CC);
 
-	fw_data = (const u64 *) fw->data;
+	WREG32(mmMME_QM_CP_MSG_BASE0_ADDR_LO, mtr_base_lo);
+	WREG32(mmMME_QM_CP_MSG_BASE0_ADDR_HI, mtr_base_hi);
+	WREG32(mmMME_QM_CP_MSG_BASE1_ADDR_LO, so_base_lo);
+	WREG32(mmMME_QM_CP_MSG_BASE1_ADDR_HI, so_base_hi);
 
-	if ((fw->size % 8) != 0)
-		fw_size -= 8;
+	/* QMAN CQ has 8 cache lines */
+	WREG32(mmMME_QM_CQ_CFG1, 0x00080008);
 
-	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
-		if (!(i & (0x80000 - 1))) {
-			dev_dbg(hdev->dev,
-				"copied so far %lu out of %lu for %s firmware",
-				i, fw_size, fw_name);
-			usleep_range(20, 100);
-		}
+	WREG32(mmMME_QM_GLBL_ERR_ADDR_LO, gic_base_lo);
+	WREG32(mmMME_QM_GLBL_ERR_ADDR_HI, gic_base_hi);
 
-		writeq(*fw_data, dst);
-	}
+	WREG32(mmMME_QM_GLBL_ERR_WDATA, GOYA_ASYNC_EVENT_ID_MME_QM);
 
-	if ((fw->size % 8) != 0)
-		writel(*(const u32 *) fw_data, dst);
+	WREG32(mmMME_QM_GLBL_ERR_CFG, QMAN_MME_ERR_MSG_EN);
 
-out:
-	release_firmware(fw);
-	return rc;
+	WREG32(mmMME_QM_GLBL_PROT, QMAN_MME_ERR_PROT);
+
+	WREG32(mmMME_QM_GLBL_CFG0, QMAN_MME_ENABLE);
 }
 
-static int goya_pldm_init_cpu(struct hl_device *hdev)
+static void goya_init_mme_cmdq(struct hl_device *hdev)
 {
-	char fw_name[200];
-	void __iomem *dst;
-	u32 val, unit_rst_val;
-	int rc;
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u64 qman_base_addr;
 
-	/* Must initialize SRAM scrambler before pushing u-boot to SRAM */
-	goya_init_golden_registers(hdev);
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
 
-	/* Put ARM cores into reset */
-	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
-	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
 
-	/* Reset the CA53 MACRO */
-	unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
-	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
-	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
-	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
-	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	qman_base_addr = hdev->asic_prop.sram_base_address +
+				MME_QMAN_BASE_OFFSET;
 
-	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
-	dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
-	rc = goya_push_fw_to_device(hdev, fw_name, dst);
-	if (rc)
-		return rc;
+	WREG32(mmMME_CMDQ_CP_MSG_BASE0_ADDR_LO, mtr_base_lo);
+	WREG32(mmMME_CMDQ_CP_MSG_BASE0_ADDR_HI, mtr_base_hi);
+	WREG32(mmMME_CMDQ_CP_MSG_BASE1_ADDR_LO,	so_base_lo);
+	WREG32(mmMME_CMDQ_CP_MSG_BASE1_ADDR_HI, so_base_hi);
 
-	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
-	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
-	rc = goya_push_fw_to_device(hdev, fw_name, dst);
-	if (rc)
-		return rc;
+	/* CMDQ CQ has 20 cache lines */
+	WREG32(mmMME_CMDQ_CQ_CFG1, 0x00140014);
 
-	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
-	WREG32(mmPSOC_GLOBAL_CONF_WARM_REBOOT, CPU_BOOT_STATUS_NA);
+	WREG32(mmMME_CMDQ_GLBL_ERR_ADDR_LO, gic_base_lo);
+	WREG32(mmMME_CMDQ_GLBL_ERR_ADDR_HI, gic_base_hi);
 
-	WREG32(mmCPU_CA53_CFG_RST_ADDR_LSB_0,
-		lower_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
-	WREG32(mmCPU_CA53_CFG_RST_ADDR_MSB_0,
-		upper_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
+	WREG32(mmMME_CMDQ_GLBL_ERR_WDATA, GOYA_ASYNC_EVENT_ID_MME_CMDQ);
 
-	/* Release ARM core 0 from reset */
-	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL,
-					CPU_RESET_CORE0_DEASSERT);
-	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+	WREG32(mmMME_CMDQ_GLBL_ERR_CFG, CMDQ_MME_ERR_MSG_EN);
 
-	return 0;
+	WREG32(mmMME_CMDQ_GLBL_PROT, CMDQ_MME_ERR_PROT);
+
+	WREG32(mmMME_CMDQ_GLBL_CFG0, CMDQ_MME_ENABLE);
 }
 
-/*
- * FW component passes an offset from SRAM_BASE_ADDR in SCRATCHPAD_xx.
- * The version string should be located by that offset.
- */
-static void goya_read_device_fw_version(struct hl_device *hdev,
-					enum goya_fw_component fwc)
+static void goya_init_mme_qmans(struct hl_device *hdev)
 {
-	const char *name;
-	u32 ver_off;
-	char *dest;
+	struct goya_device *goya = hdev->asic_specific;
+	u32 so_base_lo, so_base_hi;
 
-	switch (fwc) {
-	case FW_COMP_UBOOT:
-		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_29);
-		dest = hdev->asic_prop.uboot_ver;
-		name = "U-Boot";
-		break;
-	case FW_COMP_PREBOOT:
-		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_28);
-		dest = hdev->asic_prop.preboot_ver;
-		name = "Preboot";
-		break;
-	default:
-		dev_warn(hdev->dev, "Undefined FW component: %d\n", fwc);
+	if (goya->hw_cap_initialized & HW_CAP_MME)
 		return;
-	}
 
-	ver_off &= ~((u32)SRAM_BASE_ADDR);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
 
-	if (ver_off < SRAM_SIZE - VERSION_MAX_LEN) {
+	WREG32(mmMME_SM_BASE_ADDRESS_LOW, so_base_lo);
+	WREG32(mmMME_SM_BASE_ADDRESS_HIGH, so_base_hi);
+
+	goya_init_mme_qman(hdev);
+	goya_init_mme_cmdq(hdev);
+
+	goya->hw_cap_initialized |= HW_CAP_MME;
+}
+
+static void goya_init_tpc_qman(struct hl_device *hdev, u32 base_off, int tpc_id)
+{
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u64 qman_base_addr;
+	u32 reg_off = tpc_id * (mmTPC1_QM_PQ_PI - mmTPC0_QM_PQ_PI);
+
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+
+	qman_base_addr = hdev->asic_prop.sram_base_address + base_off;
+
+	WREG32(mmTPC0_QM_PQ_BASE_LO + reg_off, lower_32_bits(qman_base_addr));
+	WREG32(mmTPC0_QM_PQ_BASE_HI + reg_off, upper_32_bits(qman_base_addr));
+	WREG32(mmTPC0_QM_PQ_SIZE + reg_off, ilog2(TPC_QMAN_LENGTH));
+	WREG32(mmTPC0_QM_PQ_PI + reg_off, 0);
+	WREG32(mmTPC0_QM_PQ_CI + reg_off, 0);
+	WREG32(mmTPC0_QM_CP_LDMA_SRC_BASE_LO_OFFSET + reg_off, 0x10C0);
+	WREG32(mmTPC0_QM_CP_LDMA_SRC_BASE_HI_OFFSET + reg_off, 0x10C4);
+	WREG32(mmTPC0_QM_CP_LDMA_TSIZE_OFFSET + reg_off, 0x10C8);
+	WREG32(mmTPC0_QM_CP_LDMA_COMMIT_OFFSET + reg_off, 0x10CC);
+
+	WREG32(mmTPC0_QM_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
+	WREG32(mmTPC0_QM_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
+	WREG32(mmTPC0_QM_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
+	WREG32(mmTPC0_QM_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
+
+	WREG32(mmTPC0_QM_CQ_CFG1 + reg_off, 0x00080008);
+
+	WREG32(mmTPC0_QM_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
+	WREG32(mmTPC0_QM_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
+
+	WREG32(mmTPC0_QM_GLBL_ERR_WDATA + reg_off,
+			GOYA_ASYNC_EVENT_ID_TPC0_QM + tpc_id);
+
+	WREG32(mmTPC0_QM_GLBL_ERR_CFG + reg_off, QMAN_TPC_ERR_MSG_EN);
+
+	WREG32(mmTPC0_QM_GLBL_PROT + reg_off, QMAN_TPC_ERR_PROT);
+
+	WREG32(mmTPC0_QM_GLBL_CFG0 + reg_off, QMAN_TPC_ENABLE);
+}
+
+static void goya_init_tpc_cmdq(struct hl_device *hdev, int tpc_id)
+{
+	u32 mtr_base_lo, mtr_base_hi;
+	u32 so_base_lo, so_base_hi;
+	u32 gic_base_lo, gic_base_hi;
+	u32 reg_off = tpc_id * (mmTPC1_CMDQ_CQ_CFG1 - mmTPC0_CMDQ_CQ_CFG1);
+
+	mtr_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	mtr_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_MON_PAY_ADDRL_0);
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+
+	gic_base_lo =
+		lower_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+	gic_base_hi =
+		upper_32_bits(CFG_BASE + mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR);
+
+	WREG32(mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_LO + reg_off, mtr_base_lo);
+	WREG32(mmTPC0_CMDQ_CP_MSG_BASE0_ADDR_HI + reg_off, mtr_base_hi);
+	WREG32(mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_LO + reg_off, so_base_lo);
+	WREG32(mmTPC0_CMDQ_CP_MSG_BASE1_ADDR_HI + reg_off, so_base_hi);
+
+	WREG32(mmTPC0_CMDQ_CQ_CFG1 + reg_off, 0x00140014);
+
+	WREG32(mmTPC0_CMDQ_GLBL_ERR_ADDR_LO + reg_off, gic_base_lo);
+	WREG32(mmTPC0_CMDQ_GLBL_ERR_ADDR_HI + reg_off, gic_base_hi);
+
+	WREG32(mmTPC0_CMDQ_GLBL_ERR_WDATA + reg_off,
+			GOYA_ASYNC_EVENT_ID_TPC0_CMDQ + tpc_id);
+
+	WREG32(mmTPC0_CMDQ_GLBL_ERR_CFG + reg_off, CMDQ_TPC_ERR_MSG_EN);
+
+	WREG32(mmTPC0_CMDQ_GLBL_PROT + reg_off, CMDQ_TPC_ERR_PROT);
+
+	WREG32(mmTPC0_CMDQ_GLBL_CFG0 + reg_off, CMDQ_TPC_ENABLE);
+}
+
+static void goya_init_tpc_qmans(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 so_base_lo, so_base_hi;
+	u32 cfg_off = mmTPC1_CFG_SM_BASE_ADDRESS_LOW -
+			mmTPC0_CFG_SM_BASE_ADDRESS_LOW;
+	int i;
+
+	if (goya->hw_cap_initialized & HW_CAP_TPC)
+		return;
+
+	so_base_lo = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	so_base_hi = upper_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+
+	for (i = 0 ; i < TPC_MAX_NUM ; i++) {
+		WREG32(mmTPC0_CFG_SM_BASE_ADDRESS_LOW + i * cfg_off,
+				so_base_lo);
+		WREG32(mmTPC0_CFG_SM_BASE_ADDRESS_HIGH + i * cfg_off,
+				so_base_hi);
+	}
+
+	goya_init_tpc_qman(hdev, TPC0_QMAN_BASE_OFFSET, 0);
+	goya_init_tpc_qman(hdev, TPC1_QMAN_BASE_OFFSET, 1);
+	goya_init_tpc_qman(hdev, TPC2_QMAN_BASE_OFFSET, 2);
+	goya_init_tpc_qman(hdev, TPC3_QMAN_BASE_OFFSET, 3);
+	goya_init_tpc_qman(hdev, TPC4_QMAN_BASE_OFFSET, 4);
+	goya_init_tpc_qman(hdev, TPC5_QMAN_BASE_OFFSET, 5);
+	goya_init_tpc_qman(hdev, TPC6_QMAN_BASE_OFFSET, 6);
+	goya_init_tpc_qman(hdev, TPC7_QMAN_BASE_OFFSET, 7);
+
+	for (i = 0 ; i < TPC_MAX_NUM ; i++)
+		goya_init_tpc_cmdq(hdev, i);
+
+	goya->hw_cap_initialized |= HW_CAP_TPC;
+}
+
+/*
+ * goya_disable_internal_queues - Disable internal queues
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static void goya_disable_internal_queues(struct hl_device *hdev)
+{
+	WREG32(mmMME_QM_GLBL_CFG0, 0);
+	WREG32(mmMME_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC0_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC0_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC1_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC1_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC2_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC2_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC3_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC3_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC4_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC4_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC5_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC5_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC6_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC6_CMDQ_GLBL_CFG0, 0);
+
+	WREG32(mmTPC7_QM_GLBL_CFG0, 0);
+	WREG32(mmTPC7_CMDQ_GLBL_CFG0, 0);
+}
+
+/*
+ * goya_stop_internal_queues - Stop internal queues
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Returns 0 on success
+ *
+ */
+static int goya_stop_internal_queues(struct hl_device *hdev)
+{
+	int rc, retval = 0;
+
+	/*
+	 * Each queue (QMAN) is a separate H/W logic. That means that each
+	 * QMAN can be stopped independently and failure to stop one does NOT
+	 * mandate we should not try to stop other QMANs
+	 */
+
+	rc = goya_stop_queue(hdev,
+			mmMME_QM_GLBL_CFG1,
+			mmMME_QM_CP_STS,
+			mmMME_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop MME QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmMME_CMDQ_GLBL_CFG1,
+			mmMME_CMDQ_CP_STS,
+			mmMME_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop MME CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC0_QM_GLBL_CFG1,
+			mmTPC0_QM_CP_STS,
+			mmTPC0_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 0 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC0_CMDQ_GLBL_CFG1,
+			mmTPC0_CMDQ_CP_STS,
+			mmTPC0_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 0 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC1_QM_GLBL_CFG1,
+			mmTPC1_QM_CP_STS,
+			mmTPC1_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 1 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC1_CMDQ_GLBL_CFG1,
+			mmTPC1_CMDQ_CP_STS,
+			mmTPC1_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 1 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC2_QM_GLBL_CFG1,
+			mmTPC2_QM_CP_STS,
+			mmTPC2_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 2 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC2_CMDQ_GLBL_CFG1,
+			mmTPC2_CMDQ_CP_STS,
+			mmTPC2_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 2 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC3_QM_GLBL_CFG1,
+			mmTPC3_QM_CP_STS,
+			mmTPC3_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 3 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC3_CMDQ_GLBL_CFG1,
+			mmTPC3_CMDQ_CP_STS,
+			mmTPC3_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 3 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC4_QM_GLBL_CFG1,
+			mmTPC4_QM_CP_STS,
+			mmTPC4_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 4 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC4_CMDQ_GLBL_CFG1,
+			mmTPC4_CMDQ_CP_STS,
+			mmTPC4_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 4 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC5_QM_GLBL_CFG1,
+			mmTPC5_QM_CP_STS,
+			mmTPC5_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 5 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC5_CMDQ_GLBL_CFG1,
+			mmTPC5_CMDQ_CP_STS,
+			mmTPC5_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 5 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC6_QM_GLBL_CFG1,
+			mmTPC6_QM_CP_STS,
+			mmTPC6_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 6 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC6_CMDQ_GLBL_CFG1,
+			mmTPC6_CMDQ_CP_STS,
+			mmTPC6_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 6 CMDQ\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC7_QM_GLBL_CFG1,
+			mmTPC7_QM_CP_STS,
+			mmTPC7_QM_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 7 QMAN\n");
+		retval = -EIO;
+	}
+
+	rc = goya_stop_queue(hdev,
+			mmTPC7_CMDQ_GLBL_CFG1,
+			mmTPC7_CMDQ_CP_STS,
+			mmTPC7_CMDQ_GLBL_STS0);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop TPC 7 CMDQ\n");
+		retval = -EIO;
+	}
+
+	return retval;
+}
+
+static void goya_resume_internal_queues(struct hl_device *hdev)
+{
+	WREG32(mmMME_QM_GLBL_CFG1, 0);
+	WREG32(mmMME_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC0_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC0_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC1_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC1_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC2_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC2_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC3_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC3_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC4_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC4_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC5_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC5_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC6_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC6_CMDQ_GLBL_CFG1, 0);
+
+	WREG32(mmTPC7_QM_GLBL_CFG1, 0);
+	WREG32(mmTPC7_CMDQ_GLBL_CFG1, 0);
+}
+
+
+/*
+ * goya_push_fw_to_device - Push FW code to device
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Copy fw code from firmware file to device memory.
+ * Returns 0 on success
+ *
+ */
+static int goya_push_fw_to_device(struct hl_device *hdev, const char *fw_name,
+					void __iomem *dst)
+{
+	const struct firmware *fw;
+	const u64 *fw_data;
+	size_t fw_size, i;
+	int rc;
+
+	rc = request_firmware(&fw, fw_name, hdev->dev);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to request %s\n", fw_name);
+		goto out;
+	}
+
+	fw_size = fw->size;
+	if ((fw_size % 4) != 0) {
+		dev_err(hdev->dev, "illegal %s firmware size %lu\n",
+			fw_name, fw_size);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	dev_dbg(hdev->dev, "%s firmware size == %lu\n", fw_name, fw_size);
+
+	fw_data = (const u64 *) fw->data;
+
+	if ((fw->size % 8) != 0)
+		fw_size -= 8;
+
+	for (i = 0 ; i < fw_size ; i += 8, fw_data++, dst += 8) {
+		if (!(i & (0x80000 - 1))) {
+			dev_dbg(hdev->dev,
+				"copied so far %lu out of %lu for %s firmware",
+				i, fw_size, fw_name);
+			usleep_range(20, 100);
+		}
+
+		writeq(*fw_data, dst);
+	}
+
+	if ((fw->size % 8) != 0)
+		writel(*(const u32 *) fw_data, dst);
+
+out:
+	release_firmware(fw);
+	return rc;
+}
+
+static int goya_pldm_init_cpu(struct hl_device *hdev)
+{
+	char fw_name[200];
+	void __iomem *dst;
+	u32 val, unit_rst_val;
+	int rc;
+
+	/* Must initialize SRAM scrambler before pushing u-boot to SRAM */
+	goya_init_golden_registers(hdev);
+
+	/* Put ARM cores into reset */
+	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL, CPU_RESET_ASSERT);
+	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+
+	/* Reset the CA53 MACRO */
+	unit_rst_val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, CA53_RESET);
+	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+	WREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N, unit_rst_val);
+	val = RREG32(mmPSOC_GLOBAL_CONF_UNIT_RST_N);
+
+	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-u-boot.bin");
+	dst = hdev->pcie_bar[SRAM_CFG_BAR_ID] + UBOOT_FW_OFFSET;
+	rc = goya_push_fw_to_device(hdev, fw_name, dst);
+	if (rc)
+		return rc;
+
+	snprintf(fw_name, sizeof(fw_name), "habanalabs/goya/goya-fit.itb");
+	dst = hdev->pcie_bar[DDR_BAR_ID] + LINUX_FW_OFFSET;
+	rc = goya_push_fw_to_device(hdev, fw_name, dst);
+	if (rc)
+		return rc;
+
+	WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_FIT_RDY);
+	WREG32(mmPSOC_GLOBAL_CONF_WARM_REBOOT, CPU_BOOT_STATUS_NA);
+
+	WREG32(mmCPU_CA53_CFG_RST_ADDR_LSB_0,
+		lower_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
+	WREG32(mmCPU_CA53_CFG_RST_ADDR_MSB_0,
+		upper_32_bits(SRAM_BASE_ADDR + UBOOT_FW_OFFSET));
+
+	/* Release ARM core 0 from reset */
+	WREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL,
+					CPU_RESET_CORE0_DEASSERT);
+	val = RREG32(mmCPU_CA53_CFG_ARM_RST_CONTROL);
+
+	return 0;
+}
+
+/*
+ * FW component passes an offset from SRAM_BASE_ADDR in SCRATCHPAD_xx.
+ * The version string should be located by that offset.
+ */
+static void goya_read_device_fw_version(struct hl_device *hdev,
+					enum goya_fw_component fwc)
+{
+	const char *name;
+	u32 ver_off;
+	char *dest;
+
+	switch (fwc) {
+	case FW_COMP_UBOOT:
+		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_29);
+		dest = hdev->asic_prop.uboot_ver;
+		name = "U-Boot";
+		break;
+	case FW_COMP_PREBOOT:
+		ver_off = RREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_28);
+		dest = hdev->asic_prop.preboot_ver;
+		name = "Preboot";
+		break;
+	default:
+		dev_warn(hdev->dev, "Undefined FW component: %d\n", fwc);
+		return;
+	}
+
+	ver_off &= ~((u32)SRAM_BASE_ADDR);
+
+	if (ver_off < SRAM_SIZE - VERSION_MAX_LEN) {
 		memcpy_fromio(dest, hdev->pcie_bar[SRAM_CFG_BAR_ID] + ver_off,
 							VERSION_MAX_LEN);
 	} else {
@@ -1349,6 +2188,19 @@ static int goya_hw_init(struct hl_device *hdev)
 
 	goya_init_security(hdev);
 
+	goya_init_dma_qmans(hdev);
+
+	goya_init_mme_qmans(hdev);
+
+	goya_init_tpc_qmans(hdev);
+
+	rc = goya_init_cpu_queues(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize CPU H/W queues %d\n",
+			rc);
+		goto disable_queues;
+	}
+
 	/* CPU initialization is finished, we can now move to 48 bit DMA mask */
 	rc = pci_set_dma_mask(hdev->pdev, DMA_BIT_MASK(48));
 	if (rc) {
@@ -1357,7 +2209,7 @@ static int goya_hw_init(struct hl_device *hdev)
 		if (rc) {
 			dev_err(hdev->dev,
 				"Unable to set pci dma mask to 32 bits\n");
-			return rc;
+			goto disable_pci_access;
 		}
 	}
 
@@ -1369,7 +2221,7 @@ static int goya_hw_init(struct hl_device *hdev)
 		if (rc) {
 			dev_err(hdev->dev,
 				"Unable to set pci consistent dma mask to 32 bits\n");
-			return rc;
+			goto disable_pci_access;
 		}
 	}
 
@@ -1377,6 +2229,14 @@ static int goya_hw_init(struct hl_device *hdev)
 	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
 
 	return 0;
+
+disable_pci_access:
+	goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
+disable_queues:
+	goya_disable_internal_queues(hdev);
+	goya_disable_external_queues(hdev);
+
+	return rc;
 }
 
 /*
@@ -1465,12 +2325,40 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
 
 int goya_suspend(struct hl_device *hdev)
 {
-	return 0;
+	int rc;
+
+	rc = goya_stop_internal_queues(hdev);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop internal queues\n");
+		return rc;
+	}
+
+	rc = goya_stop_external_queues(hdev);
+
+	if (rc) {
+		dev_err(hdev->dev, "failed to stop external queues\n");
+		return rc;
+	}
+
+	rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
+	if (rc)
+		dev_err(hdev->dev, "Failed to disable PCI access from CPU\n");
+
+	return rc;
 }
 
 int goya_resume(struct hl_device *hdev)
 {
-	return 0;
+	int rc;
+
+	goya_resume_external_queues(hdev);
+	goya_resume_internal_queues(hdev);
+
+	rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_ENABLE_PCI_ACCESS);
+	if (rc)
+		dev_err(hdev->dev, "Failed to enable PCI access from CPU\n");
+	return rc;
 }
 
 int goya_mmap(struct hl_fpriv *hpriv, struct vm_area_struct *vma)
@@ -1494,6 +2382,101 @@ int goya_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
 	return rc;
 }
 
+void goya_ring_doorbell(struct hl_device *hdev, u32 hw_queue_id, u32 pi)
+{
+	u32 db_reg_offset, db_value;
+	bool invalid_queue = false;
+
+	switch (hw_queue_id) {
+	case GOYA_QUEUE_ID_DMA_0:
+		db_reg_offset = mmDMA_QM_0_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_DMA_1:
+		db_reg_offset = mmDMA_QM_1_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_DMA_2:
+		db_reg_offset = mmDMA_QM_2_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_DMA_3:
+		db_reg_offset = mmDMA_QM_3_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_DMA_4:
+		db_reg_offset = mmDMA_QM_4_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_CPU_PQ:
+		if (hdev->cpu_queues_enable)
+			db_reg_offset = mmCPU_IF_PF_PQ_PI;
+		else
+			invalid_queue = true;
+		break;
+
+	case GOYA_QUEUE_ID_MME:
+		db_reg_offset = mmMME_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC0:
+		db_reg_offset = mmTPC0_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC1:
+		db_reg_offset = mmTPC1_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC2:
+		db_reg_offset = mmTPC2_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC3:
+		db_reg_offset = mmTPC3_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC4:
+		db_reg_offset = mmTPC4_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC5:
+		db_reg_offset = mmTPC5_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC6:
+		db_reg_offset = mmTPC6_QM_PQ_PI;
+		break;
+
+	case GOYA_QUEUE_ID_TPC7:
+		db_reg_offset = mmTPC7_QM_PQ_PI;
+		break;
+
+	default:
+		invalid_queue = true;
+	}
+
+	if (invalid_queue) {
+		/* Should never get here */
+		dev_err(hdev->dev, "h/w queue %d is invalid. Can't set pi\n",
+			hw_queue_id);
+		return;
+	}
+
+	db_value = pi;
+
+	/* ring the doorbell */
+	WREG32(db_reg_offset, db_value);
+
+	if (hw_queue_id == GOYA_QUEUE_ID_CPU_PQ)
+		WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+				GOYA_ASYNC_EVENT_ID_PI_UPDATE);
+}
+
+void goya_flush_pq_write(struct hl_device *hdev, u64 *pq, u64 exp_val)
+{
+	/* Not needed in Goya */
+}
+
 void *goya_dma_alloc_coherent(struct hl_device *hdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flags)
 {
@@ -1506,6 +2489,313 @@ void goya_dma_free_coherent(struct hl_device *hdev, size_t size, void *cpu_addr,
 	dma_free_coherent(&hdev->pdev->dev, size, cpu_addr, dma_handle);
 }
 
+void *goya_get_int_queue_base(struct hl_device *hdev, u32 queue_id,
+				dma_addr_t *dma_handle,	u16 *queue_len)
+{
+	void *base;
+	u32 offset;
+
+	*dma_handle = hdev->asic_prop.sram_base_address;
+
+	base = hdev->pcie_bar[SRAM_CFG_BAR_ID];
+
+	switch (queue_id) {
+	case GOYA_QUEUE_ID_MME:
+		offset = MME_QMAN_BASE_OFFSET;
+		*queue_len = MME_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC0:
+		offset = TPC0_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC1:
+		offset = TPC1_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC2:
+		offset = TPC2_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC3:
+		offset = TPC3_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC4:
+		offset = TPC4_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC5:
+		offset = TPC5_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC6:
+		offset = TPC6_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	case GOYA_QUEUE_ID_TPC7:
+		offset = TPC7_QMAN_BASE_OFFSET;
+		*queue_len = TPC_QMAN_LENGTH;
+		break;
+	default:
+		dev_err(hdev->dev, "Got invalid queue id %d\n", queue_id);
+		return NULL;
+	}
+
+	base += offset;
+	*dma_handle += offset;
+
+	return base;
+}
+
+int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
+				u32 timeout, long *result)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct armcp_packet *pkt;
+	dma_addr_t pkt_dma_addr;
+	u32 tmp;
+	int rc = 0;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q)) {
+		if (result)
+			*result = 0;
+		return 0;
+	}
+
+	if (len > CPU_CB_SIZE) {
+		dev_err(hdev->dev, "Invalid CPU message size of %d bytes\n",
+			len);
+		return -ENOMEM;
+	}
+
+	pkt = hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev, len,
+								&pkt_dma_addr);
+	if (!pkt) {
+		dev_err(hdev->dev,
+			"Failed to allocate DMA memory for packet to CPU\n");
+		return -ENOMEM;
+	}
+
+	memcpy(pkt, msg, len);
+
+	mutex_lock(&hdev->send_cpu_message_lock);
+
+	if (hdev->disabled)
+		goto out;
+
+	rc = hl_hw_queue_send_cb_no_cmpl(hdev, GOYA_QUEUE_ID_CPU_PQ, len,
+			pkt_dma_addr);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to send CB on CPU PQ (%d)\n", rc);
+		goto out;
+	}
+
+	rc = hl_poll_timeout_memory(hdev, (u64) &pkt->fence, timeout, &tmp);
+
+	hl_hw_queue_inc_ci_kernel(hdev, GOYA_QUEUE_ID_CPU_PQ);
+
+	if (rc == -ETIMEDOUT) {
+		dev_err(hdev->dev,
+			"Timeout while waiting for CPU packet fence\n");
+		goto out;
+	}
+
+	if (tmp == ARMCP_PACKET_FENCE_VAL) {
+		rc = (pkt->ctl & ARMCP_PKT_CTL_RC_MASK) >>
+						ARMCP_PKT_CTL_RC_SHIFT;
+		if (rc) {
+			dev_err(hdev->dev,
+				"F/W ERROR %d for CPU packet %d\n",
+				rc, (pkt->ctl & ARMCP_PKT_CTL_OPCODE_MASK)
+						>> ARMCP_PKT_CTL_OPCODE_SHIFT);
+			rc = -EINVAL;
+		} else if (result) {
+			*result = pkt->result;
+		}
+	} else {
+		dev_err(hdev->dev, "CPU packet wrong fence value\n");
+		rc = -EINVAL;
+	}
+
+out:
+	mutex_unlock(&hdev->send_cpu_message_lock);
+
+	hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev, len, pkt);
+
+	return rc;
+}
+
+int goya_test_queue(struct hl_device *hdev, u32 hw_queue_id)
+{
+	struct packet_msg_prot *fence_pkt;
+	dma_addr_t pkt_dma_addr;
+	u32 fence_val, tmp;
+	dma_addr_t fence_dma_addr;
+	u32 *fence_ptr;
+	int rc;
+
+	fence_val = GOYA_QMAN0_FENCE_VAL;
+
+	fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL,
+							&fence_dma_addr);
+	if (!fence_ptr) {
+		dev_err(hdev->dev,
+			"Failed to allocate memory for queue testing\n");
+		return -ENOMEM;
+	}
+
+	*fence_ptr = 0;
+
+	fence_pkt = hdev->asic_funcs->dma_pool_zalloc(hdev,
+					sizeof(struct packet_msg_prot),
+					GFP_KERNEL, &pkt_dma_addr);
+	if (!fence_pkt) {
+		dev_err(hdev->dev,
+			"Failed to allocate packet for queue testing\n");
+		rc = -ENOMEM;
+		goto free_fence_ptr;
+	}
+
+	fence_pkt->ctl = (PACKET_MSG_PROT << GOYA_PKT_CTL_OPCODE_SHIFT) |
+			(1 << GOYA_PKT_CTL_EB_SHIFT) |
+			(1 << GOYA_PKT_CTL_MB_SHIFT);
+	fence_pkt->value = fence_val;
+	fence_pkt->addr = fence_dma_addr +
+				hdev->asic_prop.host_phys_base_address;
+
+	rc = hl_hw_queue_send_cb_no_cmpl(hdev, hw_queue_id,
+					sizeof(struct packet_msg_prot),
+					pkt_dma_addr);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to send fence packet\n");
+		goto free_pkt;
+	}
+
+	rc = hl_poll_timeout_memory(hdev, (u64) fence_ptr,
+					GOYA_TEST_QUEUE_WAIT_USEC, &tmp);
+
+	hl_hw_queue_inc_ci_kernel(hdev, hw_queue_id);
+
+	if ((!rc) && (tmp == fence_val)) {
+		dev_info(hdev->dev,
+			"queue test on H/W queue %d succeeded\n",
+			hw_queue_id);
+	} else {
+		dev_err(hdev->dev,
+			"H/W queue %d test failed (scratch(0x%08llX) == 0x%08X)\n",
+			hw_queue_id, fence_dma_addr, tmp);
+		rc = -EINVAL;
+	}
+
+free_pkt:
+	hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_pkt,
+					pkt_dma_addr);
+free_fence_ptr:
+	hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_ptr,
+					fence_dma_addr);
+	return rc;
+}
+
+int goya_test_cpu_queue(struct hl_device *hdev)
+{
+	struct armcp_packet test_pkt;
+	long result;
+	int rc;
+
+	/* cpu_queues_enable flag is always checked in send cpu message */
+
+	memset(&test_pkt, 0, sizeof(test_pkt));
+
+	test_pkt.ctl = ARMCP_PACKET_TEST << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	test_pkt.value = ARMCP_PACKET_FENCE_VAL;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &test_pkt,
+			sizeof(test_pkt), HL_DEVICE_TIMEOUT_USEC, &result);
+
+	if (!rc)
+		dev_info(hdev->dev, "queue test on CPU queue succeeded\n");
+	else
+		dev_err(hdev->dev, "CPU queue test failed (0x%08lX)\n", result);
+
+	return rc;
+}
+
+static int goya_test_queues(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i, rc, ret_val = 0;
+
+	for (i = 0 ; i < NUMBER_OF_EXT_HW_QUEUES ; i++) {
+		rc = goya_test_queue(hdev, i);
+		if (rc)
+			ret_val = -EINVAL;
+	}
+
+	if (hdev->cpu_queues_enable) {
+		rc = goya->test_cpu_queue(hdev);
+		if (rc)
+			ret_val = -EINVAL;
+	}
+
+	return ret_val;
+}
+
+void *goya_dma_pool_zalloc(struct hl_device *hdev, size_t size, gfp_t mem_flags,
+				dma_addr_t *dma_handle)
+{
+	if (size > GOYA_DMA_POOL_BLK_SIZE)
+		return NULL;
+
+	return dma_pool_zalloc(hdev->dma_pool, mem_flags, dma_handle);
+}
+
+void goya_dma_pool_free(struct hl_device *hdev, void *vaddr,
+			dma_addr_t dma_addr)
+{
+	dma_pool_free(hdev->dma_pool, vaddr, dma_addr);
+}
+
+void *goya_cpu_accessible_dma_pool_alloc(struct hl_device *hdev, size_t size,
+			dma_addr_t *dma_handle)
+{
+	u64 kernel_addr;
+
+	/* roundup to CPU_PKT_SIZE */
+	size = (size + (CPU_PKT_SIZE - 1)) & CPU_PKT_MASK;
+
+	kernel_addr = gen_pool_alloc(hdev->cpu_accessible_dma_pool, size);
+
+	*dma_handle = hdev->cpu_accessible_dma_address +
+			(kernel_addr - (u64) hdev->cpu_accessible_dma_mem);
+
+	return (void *) kernel_addr;
+}
+
+void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
+			void *vaddr)
+{
+	/* roundup to CPU_PKT_SIZE */
+	size = (size + (CPU_PKT_SIZE - 1)) & CPU_PKT_MASK;
+
+	gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
+}
+
+
+static void goya_hw_queues_lock(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	spin_lock(&goya->hw_queues_lock);
+}
+
+static void goya_hw_queues_unlock(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	spin_unlock(&goya->hw_queues_lock);
+}
+
 static const struct hl_asic_funcs goya_funcs = {
 	.early_init = goya_early_init,
 	.early_fini = goya_early_fini,
@@ -1517,8 +2807,19 @@ static const struct hl_asic_funcs goya_funcs = {
 	.resume = goya_resume,
 	.mmap = goya_mmap,
 	.cb_mmap = goya_cb_mmap,
+	.ring_doorbell = goya_ring_doorbell,
+	.flush_pq_write = goya_flush_pq_write,
 	.dma_alloc_coherent = goya_dma_alloc_coherent,
 	.dma_free_coherent = goya_dma_free_coherent,
+	.get_int_queue_base = goya_get_int_queue_base,
+	.test_queues = goya_test_queues,
+	.dma_pool_zalloc = goya_dma_pool_zalloc,
+	.dma_pool_free = goya_dma_pool_free,
+	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
+	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
+	.hw_queues_lock = goya_hw_queues_lock,
+	.hw_queues_unlock = goya_hw_queues_unlock,
+	.send_cpu_message = goya_send_cpu_message
 };
 
 /*
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index c3e17de41ab3..7badde68aef0 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -11,7 +11,9 @@
 #include <uapi/misc/habanalabs.h>
 #include "habanalabs.h"
 #include "include/hl_boot_if.h"
+#include "include/goya/goya_packets.h"
 #include "include/goya/goya.h"
+#include "include/goya/goya_async_events.h"
 #include "include/goya/goya_fw_if.h"
 
 #define NUMBER_OF_CMPLT_QUEUES		5
@@ -148,12 +150,17 @@ enum goya_fw_component {
 };
 
 struct goya_device {
+	int (*test_cpu_queue)(struct hl_device *hdev);
+
 	/* TODO: remove hw_queues_lock after moving to scheduler code */
 	spinlock_t	hw_queues_lock;
 	u64		ddr_bar_cur_addr;
 	u32		hw_cap_initialized;
 };
 
+int goya_test_cpu_queue(struct hl_device *hdev);
+int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
+				u32 timeout, long *result);
 void goya_init_security(struct hl_device *hdev);
 
 #endif /* GOYAP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 8b0e8796628b..12e4ee6eb45e 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -9,6 +9,7 @@
 #define HABANALABSP_H_
 
 #include "include/armcp_if.h"
+#include "include/qman_if.h"
 
 #define pr_fmt(fmt)			"habanalabs: " fmt
 
@@ -32,9 +33,36 @@
 struct hl_device;
 struct hl_fpriv;
 
+/**
+ * enum hl_queue_type - Supported QUEUE types.
+ * @QUEUE_TYPE_NA: queue is not available.
+ * @QUEUE_TYPE_EXT: external queue which is a DMA channel that may access the
+ *                  host.
+ * @QUEUE_TYPE_INT: internal queue that performs DMA inside the device's
+ *			memories and/or operates the compute engines.
+ * @QUEUE_TYPE_CPU: S/W queue for communication with the device's CPU.
+ */
+enum hl_queue_type {
+	QUEUE_TYPE_NA,
+	QUEUE_TYPE_EXT,
+	QUEUE_TYPE_INT,
+	QUEUE_TYPE_CPU
+};
+
+/**
+ * struct hw_queue_properties - queue information.
+ * @type: queue type.
+ * @kmd_only: true if only KMD is allowed to send a job to this queue, false
+ *            otherwise.
+ */
+struct hw_queue_properties {
+	enum hl_queue_type	type;
+	u8			kmd_only;
+};
 
 /**
  * struct asic_fixed_properties - ASIC specific immutable properties.
+ * @hw_queues_props: H/W queues properties.
  * @uboot_ver: F/W U-boot version.
  * @preboot_ver: F/W Preboot version.
  * @sram_base_address: SRAM physical start address.
@@ -65,6 +93,7 @@ struct hl_fpriv;
  * @tpc_enabled_mask: which TPCs are enabled.
  */
 struct asic_fixed_properties {
+	struct hw_queue_properties	hw_queues_props[HL_MAX_QUEUES];
 	char			uboot_ver[VERSION_MAX_LEN];
 	char			preboot_ver[VERSION_MAX_LEN];
 	u64			sram_base_address;
@@ -138,7 +167,89 @@ struct hl_cb {
 };
 
 
+/*
+ * QUEUES
+ */
+
+struct hl_cs_job;
+
+/*
+ * Currently, there are two limitations on the maximum length of a queue:
+ *
+ * 1. The memory footprint of the queue. The current allocated space for the
+ *    queue is PAGE_SIZE. Because each entry in the queue is HL_BD_SIZE,
+ *    the maximum length of the queue can be PAGE_SIZE / HL_BD_SIZE,
+ *    which currently is 4096/16 = 256 entries.
+ *
+ *    To increase that, we need either to decrease the size of the
+ *    BD (difficult), or allocate more than a single page (easier).
+ *
+ * 2. Because the size of the JOB handle field in the BD CTL / completion queue
+ *    is 10-bit, we can have up to 1024 open jobs per hardware queue.
+ *    Therefore, each queue can hold up to 1024 entries.
+ *
+ * HL_QUEUE_LENGTH is in units of struct hl_bd.
+ * HL_QUEUE_LENGTH * sizeof(struct hl_bd) should be <= HL_PAGE_SIZE
+ */
+
+#define HL_PAGE_SIZE			4096 /* minimum page size */
+/* Must be power of 2 (HL_PAGE_SIZE / HL_BD_SIZE) */
 #define HL_QUEUE_LENGTH			256
+#define HL_QUEUE_SIZE_IN_BYTES		(HL_QUEUE_LENGTH * HL_BD_SIZE)
+
+/*
+ * HL_CQ_LENGTH is in units of struct hl_cq_entry.
+ * HL_CQ_LENGTH should be <= HL_PAGE_SIZE
+ */
+#define HL_CQ_LENGTH			HL_QUEUE_LENGTH
+#define HL_CQ_SIZE_IN_BYTES		(HL_CQ_LENGTH * HL_CQ_ENTRY_SIZE)
+
+
+
+/**
+ * struct hl_hw_queue - describes a H/W transport queue.
+ * @shadow_queue: pointer to a shadow queue that holds pointers to jobs.
+ * @queue_type: type of queue.
+ * @kernel_address: holds the queue's kernel virtual address.
+ * @bus_address: holds the queue's DMA address.
+ * @pi: holds the queue's pi value.
+ * @ci: holds the queue's ci value, AS CALCULATED BY THE DRIVER (not real ci).
+ * @hw_queue_id: the id of the H/W queue.
+ * @int_queue_len: length of internal queue (number of entries).
+ * @valid: is the queue valid (we have array of 32 queues, not all of them
+ *		exists).
+ */
+struct hl_hw_queue {
+	struct hl_cs_job	**shadow_queue;
+	enum hl_queue_type	queue_type;
+	u64			kernel_address;
+	dma_addr_t		bus_address;
+	u32			pi;
+	u32			ci;
+	u32			hw_queue_id;
+	u16			int_queue_len;
+	u8			valid;
+};
+
+/**
+ * struct hl_cq - describes a completion queue
+ * @hdev: pointer to the device structure
+ * @kernel_address: holds the queue's kernel virtual address
+ * @bus_address: holds the queue's DMA address
+ * @hw_queue_id: the id of the matching H/W queue
+ * @ci: ci inside the queue
+ * @pi: pi inside the queue
+ * @free_slots_cnt: counter of free slots in queue
+ */
+struct hl_cq {
+	struct hl_device	*hdev;
+	u64			kernel_address;
+	dma_addr_t		bus_address;
+	u32			hw_queue_id;
+	u32			ci;
+	u32			pi;
+	atomic_t		free_slots_cnt;
+};
 
 
 /*
@@ -170,6 +281,8 @@ enum hl_asic_type {
  * @resume: handles IP specific H/W or SW changes for resume.
  * @mmap: mmap function, does nothing.
  * @cb_mmap: maps a CB.
+ * @ring_doorbell: increment PI on a given QMAN.
+ * @flush_pq_write: flush PQ entry write if necessary, WARN if flushing failed.
  * @dma_alloc_coherent: Allocate coherent DMA memory by calling
  *                      dma_alloc_coherent(). This is ASIC function because its
  *                      implementation is not trivial when the driver is loaded
@@ -178,6 +291,16 @@ enum hl_asic_type {
  *                     This is ASIC function because its implementation is not
  *                     trivial when the driver is loaded in simulation mode
  *                     (not upstreamed).
+ * @get_int_queue_base: get the internal queue base address.
+ * @test_queues: run simple test on all queues for sanity check.
+ * @dma_pool_zalloc: small DMA allocation of coherent memory from DMA pool.
+ *                   size of allocation is HL_DMA_POOL_BLK_SIZE.
+ * @dma_pool_free: free small DMA allocation from pool.
+ * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
+ * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
+ * @hw_queues_lock: acquire H/W queues lock.
+ * @hw_queues_unlock: release H/W queues lock.
+ * @send_cpu_message: send buffer to ArmCP.
  */
 struct hl_asic_funcs {
 	int (*early_init)(struct hl_device *hdev);
@@ -191,10 +314,27 @@ struct hl_asic_funcs {
 	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
 	int (*cb_mmap)(struct hl_device *hdev, struct vm_area_struct *vma,
 			u64 kaddress, phys_addr_t paddress, u32 size);
+	void (*ring_doorbell)(struct hl_device *hdev, u32 hw_queue_id, u32 pi);
+	void (*flush_pq_write)(struct hl_device *hdev, u64 *pq, u64 exp_val);
 	void* (*dma_alloc_coherent)(struct hl_device *hdev, size_t size,
 					dma_addr_t *dma_handle, gfp_t flag);
 	void (*dma_free_coherent)(struct hl_device *hdev, size_t size,
 					void *cpu_addr, dma_addr_t dma_handle);
+	void* (*get_int_queue_base)(struct hl_device *hdev, u32 queue_id,
+				dma_addr_t *dma_handle, u16 *queue_len);
+	int (*test_queues)(struct hl_device *hdev);
+	void* (*dma_pool_zalloc)(struct hl_device *hdev, size_t size,
+				gfp_t mem_flags, dma_addr_t *dma_handle);
+	void (*dma_pool_free)(struct hl_device *hdev, void *vaddr,
+				dma_addr_t dma_addr);
+	void* (*cpu_accessible_dma_pool_alloc)(struct hl_device *hdev,
+				size_t size, dma_addr_t *dma_handle);
+	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
+				size_t size, void *vaddr);
+	void (*hw_queues_lock)(struct hl_device *hdev);
+	void (*hw_queues_unlock)(struct hl_device *hdev);
+	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
+				u16 len, u32 timeout, long *result);
 };
 
 
@@ -230,6 +370,17 @@ struct hl_ctx_mgr {
 };
 
 
+
+
+/**
+ * struct hl_cs_job - command submission job.
+ * @finish_work: workqueue object to run when job is completed.
+ * @id: the id of this job inside a CS.
+ */
+struct hl_cs_job {
+	struct work_struct	finish_work;
+	u32			id;
+};
 /*
  * FILE PRIVATE STRUCTURE
  */
@@ -304,7 +455,11 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @dev: realted kernel basic device structure.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
+ * @completion_queue: array of hl_cq.
+ * @cq_wq: work queue of completion queues for executing work in process context
+ * @eq_wq: work queue of event queue for executing work in process context.
  * @kernel_ctx: KMD context structure.
+ * @kernel_queues: array of hl_hw_queue.
  * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
  * @dma_pool: DMA pool for small allocations.
  * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
@@ -318,6 +473,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  *                    only a single process at a time. In addition, we need a
  *                    lock here so we can flush user processes which are opening
  *                    the device while we are trying to hard reset it
+ * @send_cpu_message_lock: enforces only one message in KMD <-> ArmCP queue.
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
@@ -337,7 +493,10 @@ struct hl_device {
 	struct device			*dev;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
+	struct hl_cq			*completion_queue;
+	struct workqueue_struct		*cq_wq;
 	struct hl_ctx			*kernel_ctx;
+	struct hl_hw_queue		*kernel_queues;
 	struct hl_cb_mgr		kernel_cb_mgr;
 	struct dma_pool			*dma_pool;
 	void				*cpu_accessible_dma_mem;
@@ -347,6 +506,7 @@ struct hl_device {
 	struct mutex			asid_mutex;
 	/* TODO: remove fd_open_cnt_lock for multiple process support */
 	struct mutex			fd_open_cnt_lock;
+	struct mutex			send_cpu_message_lock;
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
@@ -364,6 +524,7 @@ struct hl_device {
 	/* Parameters for bring-up */
 	u8				cpu_enable;
 	u8				reset_pcilink;
+	u8				cpu_queues_enable;
 	u8				fw_loading;
 	u8				pldm;
 };
@@ -406,7 +567,18 @@ int hl_poll_timeout_memory(struct hl_device *hdev, u64 addr, u32 timeout_us,
 				u32 *val);
 int hl_poll_timeout_device_memory(struct hl_device *hdev, void __iomem *addr,
 				u32 timeout_us, u32 *val);
-
+int hl_hw_queues_create(struct hl_device *hdev);
+void hl_hw_queues_destroy(struct hl_device *hdev);
+int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
+				u32 cb_size, u64 cb_ptr);
+u32 hl_hw_queue_add_ptr(u32 ptr, u16 val);
+void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
+
+#define hl_queue_inc_ptr(p)		hl_hw_queue_add_ptr(p, 1)
+#define hl_pi_2_offset(pi)		((pi) & (HL_QUEUE_LENGTH - 1))
+
+int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
+void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
 int hl_asid_init(struct hl_device *hdev);
 void hl_asid_fini(struct hl_device *hdev);
 unsigned long hl_asid_alloc(struct hl_device *hdev);
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 2611883eab11..14775a7022f0 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -173,6 +173,7 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	/* Parameters for bring-up - set them to defaults */
 	hdev->cpu_enable = 1;
 	hdev->reset_pcilink = 0;
+	hdev->cpu_queues_enable = 1;
 	hdev->fw_loading = 1;
 	hdev->pldm = 0;
 
@@ -180,6 +181,10 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	if (!hdev->cpu_enable)
 		hdev->fw_loading = 0;
 
+	/* If we don't load FW, no need to initialize CPU queues */
+	if (!hdev->fw_loading)
+		hdev->cpu_queues_enable = 0;
+
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
 
diff --git a/drivers/misc/habanalabs/hw_queue.c b/drivers/misc/habanalabs/hw_queue.c
new file mode 100644
index 000000000000..6841e23f1b30
--- /dev/null
+++ b/drivers/misc/habanalabs/hw_queue.c
@@ -0,0 +1,404 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/dma-mapping.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/delay.h>
+
+/*
+ * hl_queue_add_ptr - add to pi or ci and checks if it wraps around
+ *
+ * @ptr: the current pi/ci value
+ * @val: the amount to add
+ *
+ * Add val to ptr. It can go until twice the queue length.
+ */
+inline u32 hl_hw_queue_add_ptr(u32 ptr, u16 val)
+{
+	ptr += val;
+	ptr &= ((HL_QUEUE_LENGTH << 1) - 1);
+	return ptr;
+}
+
+static inline int queue_free_slots(struct hl_hw_queue *q, u32 queue_len)
+{
+	int delta = (q->pi - q->ci);
+
+	if (delta >= 0)
+		return (queue_len - delta);
+	else
+		return (abs(delta) - queue_len);
+}
+
+/*
+ * ext_queue_submit_bd - Submit a buffer descriptor to an external queue
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @q: pointer to habanalabs queue structure
+ * @ctl: BD's control word
+ * @len: BD's length
+ * @ptr: BD's pointer
+ *
+ * This function assumes there is enough space on the queue to submit a new
+ * BD to it. It initializes the next BD and calls the device specific
+ * function to set the pi (and doorbell)
+ *
+ * This function must be called when the scheduler mutex is taken
+ *
+ */
+static void ext_queue_submit_bd(struct hl_device *hdev, struct hl_hw_queue *q,
+				u32 ctl, u32 len, u64 ptr)
+{
+	struct hl_bd *bd;
+
+	bd = (struct hl_bd *) q->kernel_address;
+	bd += hl_pi_2_offset(q->pi);
+	bd->ctl = ctl;
+	bd->len = len;
+	bd->ptr = ptr + hdev->asic_prop.host_phys_base_address;
+
+	q->pi = hl_queue_inc_ptr(q->pi);
+	hdev->asic_funcs->ring_doorbell(hdev, q->hw_queue_id, q->pi);
+}
+
+/*
+ * ext_queue_sanity_checks - perform some sanity checks on external queue
+ *
+ * @hdev              : pointer to hl_device structure
+ * @q                 :	pointer to hl_hw_queue structure
+ * @num_of_entries    : how many entries to check for space
+ * @reserve_cq_entry  :	whether to reserve an entry in the cq
+ *
+ * H/W queues spinlock should be taken before calling this function
+ *
+ * Perform the following:
+ * - Make sure we have enough space in the h/w queue
+ * - Make sure we have enough space in the completion queue
+ * - Reserve space in the completion queue (needs to be reversed if there
+ *   is a failure down the road before the actual submission of work). Only
+ *   do this action if reserve_cq_entry is true
+ *
+ */
+static int ext_queue_sanity_checks(struct hl_device *hdev,
+				struct hl_hw_queue *q, int num_of_entries,
+				bool reserve_cq_entry)
+{
+	atomic_t *free_slots =
+			&hdev->completion_queue[q->hw_queue_id].free_slots_cnt;
+	int free_slots_cnt;
+
+	/* Check we have enough space in the queue */
+	free_slots_cnt = queue_free_slots(q, HL_QUEUE_LENGTH);
+
+	if (free_slots_cnt < num_of_entries) {
+		dev_dbg(hdev->dev, "Queue %d doesn't have room for %d CBs\n",
+			q->hw_queue_id, num_of_entries);
+		return -EAGAIN;
+	}
+
+	if (reserve_cq_entry) {
+		/*
+		 * Check we have enough space in the completion queue
+		 * Add -1 to counter (decrement) unless counter was already 0
+		 * In that case, CQ is full so we can't submit a new CB because
+		 * we won't get ack on its completion
+		 * atomic_add_unless will return 0 if counter was already 0
+		 */
+		if (atomic_add_negative(num_of_entries * -1, free_slots)) {
+			dev_dbg(hdev->dev, "No space for %d on CQ %d\n",
+				num_of_entries, q->hw_queue_id);
+			atomic_add(num_of_entries, free_slots);
+			return -EAGAIN;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * hl_hw_queue_send_cb_no_cmpl - send a single CB (not a JOB) without completion
+ *
+ * @hdev: pointer to hl_device structure
+ * @hw_queue_id: Queue's type
+ * @cb_size: size of CB
+ * @cb_ptr: pointer to CB location
+ *
+ * This function sends a single CB, that must NOT generate a completion entry
+ *
+ */
+int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
+				u32 cb_size, u64 cb_ptr)
+{
+	struct hl_hw_queue *q = &hdev->kernel_queues[hw_queue_id];
+	int rc;
+
+	/*
+	 * The CPU queue is a synchronous queue with an effective depth of
+	 * a single entry (although it is allocated with room for multiple
+	 * entries). Therefore, there is a different lock, called
+	 * send_cpu_message_lock, that serializes accesses to the CPU queue.
+	 * As a result, we don't need to lock the access to the entire H/W
+	 * queues module when submitting a JOB to the CPU queue
+	 */
+	if (q->queue_type != QUEUE_TYPE_CPU)
+		hdev->asic_funcs->hw_queues_lock(hdev);
+
+	if (hdev->disabled) {
+		rc = -EPERM;
+		goto out;
+	}
+
+	rc = ext_queue_sanity_checks(hdev, q, 1, false);
+	if (rc)
+		goto out;
+
+	ext_queue_submit_bd(hdev, q, 0, cb_size, cb_ptr);
+
+out:
+	if (q->queue_type != QUEUE_TYPE_CPU)
+		hdev->asic_funcs->hw_queues_unlock(hdev);
+
+	return rc;
+}
+
+/*
+ * hl_hw_queue_inc_ci_kernel - increment ci for kernel's queue
+ *
+ * @hdev: pointer to hl_device structure
+ * @hw_queue_id: which queue to increment its ci
+ */
+void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id)
+{
+	struct hl_hw_queue *q = &hdev->kernel_queues[hw_queue_id];
+
+	q->ci = hl_queue_inc_ptr(q->ci);
+}
+
+static int ext_and_cpu_hw_queue_init(struct hl_device *hdev,
+					struct hl_hw_queue *q)
+{
+	void *p;
+	int rc;
+
+	p = hdev->asic_funcs->dma_alloc_coherent(hdev,
+				HL_QUEUE_SIZE_IN_BYTES,
+				&q->bus_address, GFP_KERNEL | __GFP_ZERO);
+	if (!p)
+		return -ENOMEM;
+
+	q->kernel_address = (u64) p;
+
+	q->shadow_queue = kmalloc_array(HL_QUEUE_LENGTH,
+					sizeof(*q->shadow_queue),
+					GFP_KERNEL);
+	if (!q->shadow_queue) {
+		dev_err(hdev->dev,
+			"Failed to allocate shadow queue for H/W queue %d\n",
+			q->hw_queue_id);
+		rc = -ENOMEM;
+		goto free_queue;
+	}
+
+	/* Make sure read/write pointers are initialized to start of queue */
+	q->ci = 0;
+	q->pi = 0;
+
+	return 0;
+
+free_queue:
+	hdev->asic_funcs->dma_free_coherent(hdev, HL_QUEUE_SIZE_IN_BYTES,
+			(void *) q->kernel_address, q->bus_address);
+
+	return rc;
+}
+
+static int int_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
+{
+	void *p;
+
+	p = hdev->asic_funcs->get_int_queue_base(hdev, q->hw_queue_id,
+					&q->bus_address, &q->int_queue_len);
+	if (!p) {
+		dev_err(hdev->dev,
+			"Failed to get base address for internal queue %d\n",
+			q->hw_queue_id);
+		return -EFAULT;
+	}
+
+	q->kernel_address = (u64) p;
+	q->pi = 0;
+	q->ci = 0;
+
+	return 0;
+}
+
+static int cpu_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
+{
+	return ext_and_cpu_hw_queue_init(hdev, q);
+}
+
+static int ext_hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q)
+{
+	return ext_and_cpu_hw_queue_init(hdev, q);
+}
+
+/*
+ * hw_queue_init - main initialization function for H/W queue object
+ *
+ * @hdev: pointer to hl_device device structure
+ * @q: pointer to hl_hw_queue queue structure
+ * @hw_queue_id: The id of the H/W queue
+ *
+ * Allocate dma-able memory for the queue and initialize fields
+ * Returns 0 on success
+ */
+static int hw_queue_init(struct hl_device *hdev, struct hl_hw_queue *q,
+			u32 hw_queue_id)
+{
+	int rc;
+
+	BUILD_BUG_ON(HL_QUEUE_SIZE_IN_BYTES > HL_PAGE_SIZE);
+
+	q->hw_queue_id = hw_queue_id;
+
+	switch (q->queue_type) {
+	case QUEUE_TYPE_EXT:
+		rc = ext_hw_queue_init(hdev, q);
+		break;
+
+	case QUEUE_TYPE_INT:
+		rc = int_hw_queue_init(hdev, q);
+		break;
+
+	case QUEUE_TYPE_CPU:
+		rc = cpu_hw_queue_init(hdev, q);
+		break;
+
+	case QUEUE_TYPE_NA:
+		q->valid = 0;
+		return 0;
+
+	default:
+		dev_crit(hdev->dev, "wrong queue type %d during init\n",
+			q->queue_type);
+		rc = -EINVAL;
+		break;
+	}
+
+	if (rc)
+		return rc;
+
+	q->valid = 1;
+
+	return 0;
+}
+
+/*
+ * hw_queue_fini - destroy queue
+ *
+ * @hdev: pointer to hl_device device structure
+ * @q: pointer to hl_hw_queue queue structure
+ *
+ * Free the queue memory
+ */
+static void hw_queue_fini(struct hl_device *hdev, struct hl_hw_queue *q)
+{
+	if (!q->valid)
+		return;
+
+	/*
+	 * If we arrived here, there are no jobs waiting on this queue
+	 * so we can safely remove it.
+	 * This is because this function can only called when:
+	 * 1. Either a context is deleted, which only can occur if all its
+	 *    jobs were finished
+	 * 2. A context wasn't able to be created due to failure or timeout,
+	 *    which means there are no jobs on the queue yet
+	 *
+	 * The only exception are the queues of the kernel context, but
+	 * if they are being destroyed, it means that the entire module is
+	 * being removed. If the module is removed, it means there is no open
+	 * user context. It also means that if a job was submitted by
+	 * the kernel driver (e.g. context creation), the job itself was
+	 * released by the kernel driver when a timeout occurred on its
+	 * Completion. Thus, we don't need to release it again.
+	 */
+
+	if (q->queue_type == QUEUE_TYPE_INT)
+		return;
+
+	kfree(q->shadow_queue);
+
+	hdev->asic_funcs->dma_free_coherent(hdev,
+			HL_QUEUE_SIZE_IN_BYTES,
+			(void *) q->kernel_address, q->bus_address);
+}
+
+int hl_hw_queues_create(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *asic = &hdev->asic_prop;
+	struct hl_hw_queue *q;
+	int i, rc, q_ready_cnt;
+
+	hdev->kernel_queues = kcalloc(HL_MAX_QUEUES,
+				sizeof(*hdev->kernel_queues), GFP_KERNEL);
+
+	if (!hdev->kernel_queues) {
+		dev_err(hdev->dev, "Not enough memory for H/W queues\n");
+		return -ENOMEM;
+	}
+
+	/* Initialize the H/W queues */
+	for (i = 0, q_ready_cnt = 0, q = hdev->kernel_queues;
+			i < HL_MAX_QUEUES ; i++, q_ready_cnt++, q++) {
+
+		q->queue_type = asic->hw_queues_props[i].type;
+		rc = hw_queue_init(hdev, q, i);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed to initialize queue %d\n", i);
+			goto release_queues;
+		}
+	}
+
+	return 0;
+
+release_queues:
+	for (i = 0, q = hdev->kernel_queues ; i < q_ready_cnt ; i++, q++)
+		hw_queue_fini(hdev, q);
+
+	kfree(hdev->kernel_queues);
+
+	return rc;
+}
+
+void hl_hw_queues_destroy(struct hl_device *hdev)
+{
+	struct hl_hw_queue *q;
+	int i;
+
+	for (i = 0, q = hdev->kernel_queues ; i < HL_MAX_QUEUES ; i++, q++)
+		hw_queue_fini(hdev, q);
+
+	kfree(hdev->kernel_queues);
+}
+
+void hl_hw_queue_reset(struct hl_device *hdev, bool hard_reset)
+{
+	struct hl_hw_queue *q;
+	int i;
+
+	for (i = 0, q = hdev->kernel_queues ; i < HL_MAX_QUEUES ; i++, q++) {
+		if ((!q->valid) ||
+			((!hard_reset) && (q->queue_type == QUEUE_TYPE_CPU)))
+			continue;
+		q->pi = q->ci = 0;
+	}
+}
diff --git a/drivers/misc/habanalabs/include/armcp_if.h b/drivers/misc/habanalabs/include/armcp_if.h
index 85fc2efe144b..cc37003aa6b7 100644
--- a/drivers/misc/habanalabs/include/armcp_if.h
+++ b/drivers/misc/habanalabs/include/armcp_if.h
@@ -10,10 +10,302 @@
 
 #include <linux/types.h>
 
+enum pq_init_status {
+	PQ_INIT_STATUS_NA = 0,
+	PQ_INIT_STATUS_READY_FOR_CP,
+	PQ_INIT_STATUS_READY_FOR_HOST
+};
+
+/*
+ * ArmCP Primary Queue Packets
+ *
+ * During normal operation, KMD needs to send various messages to ArmCP,
+ * usually either to SET some value into a H/W periphery or to GET the current
+ * value of some H/W periphery. For example, SET the frequency of MME/TPC and
+ * GET the value of the thermal sensor.
+ *
+ * These messages can be initiated either by the User application or by KMD
+ * itself, e.g. power management code. In either case, the communication from
+ * KMD to ArmCP will *always* be in synchronous mode, meaning that KMD will
+ * send a single message and poll until the message was acknowledged and the
+ * results are ready (if results are needed).
+ *
+ * This means that only a single message can be sent at a time and KMD must
+ * wait for its result before sending the next message. Having said that,
+ * because these are control messages which are sent in a relatively low
+ * frequency, this limitation seems acceptable. It's important to note that
+ * in case of multiple devices, messages to different devices *can* be sent
+ * at the same time.
+ *
+ * The message, inputs/outputs (if relevant) and fence object will be located
+ * on the device DDR at an address that will be determined by KMD. During
+ * device initialization phase, KMD will pass to ArmCP that address.  Most of
+ * the message types will contain inputs/outputs inside the message itself.
+ * The common part of each message will contain the opcode of the message (its
+ * type) and a field representing a fence object.
+ *
+ * When KMD wishes to send a message to ArmCP, it will write the message
+ * contents to the device DDR, clear the fence object and then write the
+ * value 484 to the mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR register to issue
+ * the 484 interrupt-id to the ARM core.
+ *
+ * Upon receiving the 484 interrupt-id, ArmCP will read the message from the
+ * DDR. In case the message is a SET operation, ArmCP will first perform the
+ * operation and then write to the fence object on the device DDR. In case the
+ * message is a GET operation, ArmCP will first fill the results section on the
+ * device DDR and then write to the fence object. If an error occurred, ArmCP
+ * will fill the rc field with the right error code.
+ *
+ * In the meantime, KMD will poll on the fence object. Once KMD sees that the
+ * fence object is signaled, it will read the results from the device DDR
+ * (if relevant) and resume the code execution in KMD.
+ *
+ * To use QMAN packets, the opcode must be the QMAN opcode, shifted by 8
+ * so the value being put by the KMD matches the value read by ArmCP
+ *
+ * Non-QMAN packets should be limited to values 1 through (2^8 - 1)
+ *
+ * Detailed description:
+ *
+ * ARMCP_PACKET_DISABLE_PCI_ACCESS -
+ *       After receiving this packet the embedded CPU must NOT issue PCI
+ *       transactions (read/write) towards the Host CPU. This also include
+ *       sending MSI-X interrupts.
+ *       This packet is usually sent before the device is moved to D3Hot state.
+ *
+ * ARMCP_PACKET_ENABLE_PCI_ACCESS -
+ *       After receiving this packet the embedded CPU is allowed to issue PCI
+ *       transactions towards the Host CPU, including sending MSI-X interrupts.
+ *       This packet is usually send after the device is moved to D0 state.
+ *
+ * ARMCP_PACKET_TEMPERATURE_GET -
+ *       Fetch the current temperature / Max / Max Hyst / Critical /
+ *       Critical Hyst of a specified thermal sensor. The packet's
+ *       arguments specify the desired sensor and the field to get.
+ *
+ * ARMCP_PACKET_VOLTAGE_GET -
+ *       Fetch the voltage / Max / Min of a specified sensor. The packet's
+ *       arguments specify the sensor and type.
+ *
+ * ARMCP_PACKET_CURRENT_GET -
+ *       Fetch the current / Max / Min of a specified sensor. The packet's
+ *       arguments specify the sensor and type.
+ *
+ * ARMCP_PACKET_FAN_SPEED_GET -
+ *       Fetch the speed / Max / Min of a specified fan. The packet's
+ *       arguments specify the sensor and type.
+ *
+ * ARMCP_PACKET_PWM_GET -
+ *       Fetch the pwm value / mode of a specified pwm. The packet's
+ *       arguments specify the sensor and type.
+ *
+ * ARMCP_PACKET_PWM_SET -
+ *       Set the pwm value / mode of a specified pwm. The packet's
+ *       arguments specify the sensor, type and value.
+ *
+ * ARMCP_PACKET_FREQUENCY_SET -
+ *       Set the frequency of a specified PLL. The packet's arguments specify
+ *       the PLL and the desired frequency. The actual frequency in the device
+ *       might differ from the requested frequency.
+ *
+ * ARMCP_PACKET_FREQUENCY_GET -
+ *       Fetch the frequency of a specified PLL. The packet's arguments specify
+ *       the PLL.
+ *
+ * ARMCP_PACKET_LED_SET -
+ *       Set the state of a specified led. The packet's arguments
+ *       specify the led and the desired state.
+ *
+ * ARMCP_PACKET_I2C_WR -
+ *       Write 32-bit value to I2C device. The packet's arguments specify the
+ *       I2C bus, address and value.
+ *
+ * ARMCP_PACKET_I2C_RD -
+ *       Read 32-bit value from I2C device. The packet's arguments specify the
+ *       I2C bus and address.
+ *
+ * ARMCP_PACKET_INFO_GET -
+ *       Fetch information from the device as specified in the packet's
+ *       structure. KMD passes the max size it allows the ArmCP to write to
+ *       the structure, to prevent data corruption in case of mismatched
+ *       KMD/FW versions.
+ *
+ * ARMCP_PACKET_FLASH_PROGRAM_REMOVED - this packet was removed
+ *
+ * ARMCP_PACKET_UNMASK_RAZWI_IRQ -
+ *       Unmask the given IRQ. The IRQ number is specified in the value field.
+ *       The packet is sent after receiving an interrupt and printing its
+ *       relevant information.
+ *
+ * ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY -
+ *       Unmask the given IRQs. The IRQs numbers are specified in an array right
+ *       after the armcp_packet structure, where its first element is the array
+ *       length. The packet is sent after a soft reset was done in order to
+ *       handle any interrupts that were sent during the reset process.
+ *
+ * ARMCP_PACKET_TEST -
+ *       Test packet for ArmCP connectivity. The CPU will put the fence value
+ *       in the result field.
+ *
+ * ARMCP_PACKET_FREQUENCY_CURR_GET -
+ *       Fetch the current frequency of a specified PLL. The packet's arguments
+ *       specify the PLL.
+ *
+ * ARMCP_PACKET_MAX_POWER_GET -
+ *       Fetch the maximal power of the device.
+ *
+ * ARMCP_PACKET_MAX_POWER_SET -
+ *       Set the maximal power of the device. The packet's arguments specify
+ *       the power.
+ *
+ * ARMCP_PACKET_EEPROM_DATA_GET -
+ *       Get EEPROM data from the ArmCP kernel. The buffer is specified in the
+ *       addr field. The CPU will put the returned data size in the result
+ *       field. In addition, KMD passes the max size it allows the ArmCP to
+ *       write to the structure, to prevent data corruption in case of
+ *       mismatched KMD/FW versions.
+ *
+ */
+
+enum armcp_packet_id {
+	ARMCP_PACKET_DISABLE_PCI_ACCESS = 1,	/* internal */
+	ARMCP_PACKET_ENABLE_PCI_ACCESS,		/* internal */
+	ARMCP_PACKET_TEMPERATURE_GET,		/* sysfs */
+	ARMCP_PACKET_VOLTAGE_GET,		/* sysfs */
+	ARMCP_PACKET_CURRENT_GET,		/* sysfs */
+	ARMCP_PACKET_FAN_SPEED_GET,		/* sysfs */
+	ARMCP_PACKET_PWM_GET,			/* sysfs */
+	ARMCP_PACKET_PWM_SET,			/* sysfs */
+	ARMCP_PACKET_FREQUENCY_SET,		/* sysfs */
+	ARMCP_PACKET_FREQUENCY_GET,		/* sysfs */
+	ARMCP_PACKET_LED_SET,			/* debugfs */
+	ARMCP_PACKET_I2C_WR,			/* debugfs */
+	ARMCP_PACKET_I2C_RD,			/* debugfs */
+	ARMCP_PACKET_INFO_GET,			/* IOCTL */
+	ARMCP_PACKET_FLASH_PROGRAM_REMOVED,
+	ARMCP_PACKET_UNMASK_RAZWI_IRQ,		/* internal */
+	ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY,	/* internal */
+	ARMCP_PACKET_TEST,			/* internal */
+	ARMCP_PACKET_FREQUENCY_CURR_GET,	/* sysfs */
+	ARMCP_PACKET_MAX_POWER_GET,		/* sysfs */
+	ARMCP_PACKET_MAX_POWER_SET,		/* sysfs */
+	ARMCP_PACKET_EEPROM_DATA_GET,		/* sysfs */
+};
+
+#define ARMCP_PACKET_FENCE_VAL	0xFE8CE7A5
+
+#define ARMCP_PKT_CTL_RC_SHIFT		12
+#define ARMCP_PKT_CTL_RC_MASK		0x0000F000
+
+#define ARMCP_PKT_CTL_OPCODE_SHIFT	16
+#define ARMCP_PKT_CTL_OPCODE_MASK	0x1FFF0000
+
+struct armcp_packet {
+	union {
+		__le64 value;	/* For SET packets */
+		__le64 result;	/* For GET packets */
+		__le64 addr;	/* For PQ */
+	};
+
+	__le32 ctl;
+
+	__le32 fence;		/* Signal to KMD that message is completed */
+
+	union {
+		struct {/* For temperature/current/voltage/fan/pwm get/set */
+			__le16 sensor_index;
+			__le16 type;
+		};
+
+		struct {	/* For I2C read/write */
+			__u8 i2c_bus;
+			__u8 i2c_addr;
+			__u8 i2c_reg;
+			__u8 pad; /* unused */
+		};
+
+		/* For frequency get/set */
+		__le32 pll_index;
+
+		/* For led set */
+		__le32 led_index;
+
+		/* For get Armcp info/EEPROM data */
+		__le32 data_max_size;
+	};
+};
+
+struct armcp_unmask_irq_arr_packet {
+	struct armcp_packet armcp_pkt;
+	__le32 length;
+	__le32 irqs[0];
+};
+
+enum armcp_packet_rc {
+	armcp_packet_success,
+	armcp_packet_invalid,
+	armcp_packet_fault
+};
+
+enum armcp_temp_type {
+	armcp_temp_input,
+	armcp_temp_max = 6,
+	armcp_temp_max_hyst,
+	armcp_temp_crit,
+	armcp_temp_crit_hyst
+};
+
+enum armcp_in_attributes {
+	armcp_in_input,
+	armcp_in_min,
+	armcp_in_max
+};
+
+enum armcp_curr_attributes {
+	armcp_curr_input,
+	armcp_curr_min,
+	armcp_curr_max
+};
+
+enum armcp_fan_attributes {
+	armcp_fan_input,
+	armcp_fan_min = 2,
+	armcp_fan_max
+};
+
+enum armcp_pwm_attributes {
+	armcp_pwm_input,
+	armcp_pwm_enable
+};
+
+/* Event Queue Packets */
+
+struct eq_generic_event {
+	__le64 data[7];
+};
+
 /*
  * ArmCP info
  */
 
 #define VERSION_MAX_LEN			128
+#define ARMCP_MAX_SENSORS		128
+
+struct armcp_sensor {
+	__le32 type;
+	__le32 flags;
+};
+
+struct armcp_info {
+	struct armcp_sensor sensors[ARMCP_MAX_SENSORS];
+	__u8 kernel_version[VERSION_MAX_LEN];
+	__le32 reserved[3];
+	__le32 cpld_version;
+	__le32 infineon_version;
+	__u8 fuse_version[VERSION_MAX_LEN];
+	__u8 thermal_version[VERSION_MAX_LEN];
+	__u8 armcp_version[VERSION_MAX_LEN];
+	__le64 dram_size;
+};
 
 #endif /* ARMCP_IF_H */
diff --git a/drivers/misc/habanalabs/include/goya/goya_async_events.h b/drivers/misc/habanalabs/include/goya/goya_async_events.h
new file mode 100644
index 000000000000..497937a17ee9
--- /dev/null
+++ b/drivers/misc/habanalabs/include/goya/goya_async_events.h
@@ -0,0 +1,186 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef __GOYA_ASYNC_EVENTS_H_
+#define __GOYA_ASYNC_EVENTS_H_
+
+enum goya_async_event_id {
+	GOYA_ASYNC_EVENT_ID_PCIE_IF = 33,
+	GOYA_ASYNC_EVENT_ID_TPC0_ECC = 36,
+	GOYA_ASYNC_EVENT_ID_TPC1_ECC = 39,
+	GOYA_ASYNC_EVENT_ID_TPC2_ECC = 42,
+	GOYA_ASYNC_EVENT_ID_TPC3_ECC = 45,
+	GOYA_ASYNC_EVENT_ID_TPC4_ECC = 48,
+	GOYA_ASYNC_EVENT_ID_TPC5_ECC = 51,
+	GOYA_ASYNC_EVENT_ID_TPC6_ECC = 54,
+	GOYA_ASYNC_EVENT_ID_TPC7_ECC = 57,
+	GOYA_ASYNC_EVENT_ID_MME_ECC = 60,
+	GOYA_ASYNC_EVENT_ID_MME_ECC_EXT = 61,
+	GOYA_ASYNC_EVENT_ID_MMU_ECC = 63,
+	GOYA_ASYNC_EVENT_ID_DMA_MACRO = 64,
+	GOYA_ASYNC_EVENT_ID_DMA_ECC = 66,
+	GOYA_ASYNC_EVENT_ID_CPU_IF_ECC = 75,
+	GOYA_ASYNC_EVENT_ID_PSOC_MEM = 78,
+	GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT = 79,
+	GOYA_ASYNC_EVENT_ID_SRAM0 = 81,
+	GOYA_ASYNC_EVENT_ID_SRAM1 = 82,
+	GOYA_ASYNC_EVENT_ID_SRAM2 = 83,
+	GOYA_ASYNC_EVENT_ID_SRAM3 = 84,
+	GOYA_ASYNC_EVENT_ID_SRAM4 = 85,
+	GOYA_ASYNC_EVENT_ID_SRAM5 = 86,
+	GOYA_ASYNC_EVENT_ID_SRAM6 = 87,
+	GOYA_ASYNC_EVENT_ID_SRAM7 = 88,
+	GOYA_ASYNC_EVENT_ID_SRAM8 = 89,
+	GOYA_ASYNC_EVENT_ID_SRAM9 = 90,
+	GOYA_ASYNC_EVENT_ID_SRAM10 = 91,
+	GOYA_ASYNC_EVENT_ID_SRAM11 = 92,
+	GOYA_ASYNC_EVENT_ID_SRAM12 = 93,
+	GOYA_ASYNC_EVENT_ID_SRAM13 = 94,
+	GOYA_ASYNC_EVENT_ID_SRAM14 = 95,
+	GOYA_ASYNC_EVENT_ID_SRAM15 = 96,
+	GOYA_ASYNC_EVENT_ID_SRAM16 = 97,
+	GOYA_ASYNC_EVENT_ID_SRAM17 = 98,
+	GOYA_ASYNC_EVENT_ID_SRAM18 = 99,
+	GOYA_ASYNC_EVENT_ID_SRAM19 = 100,
+	GOYA_ASYNC_EVENT_ID_SRAM20 = 101,
+	GOYA_ASYNC_EVENT_ID_SRAM21 = 102,
+	GOYA_ASYNC_EVENT_ID_SRAM22 = 103,
+	GOYA_ASYNC_EVENT_ID_SRAM23 = 104,
+	GOYA_ASYNC_EVENT_ID_SRAM24 = 105,
+	GOYA_ASYNC_EVENT_ID_SRAM25 = 106,
+	GOYA_ASYNC_EVENT_ID_SRAM26 = 107,
+	GOYA_ASYNC_EVENT_ID_SRAM27 = 108,
+	GOYA_ASYNC_EVENT_ID_SRAM28 = 109,
+	GOYA_ASYNC_EVENT_ID_SRAM29 = 110,
+	GOYA_ASYNC_EVENT_ID_GIC500 = 112,
+	GOYA_ASYNC_EVENT_ID_PCIE_DEC = 115,
+	GOYA_ASYNC_EVENT_ID_TPC0_DEC = 117,
+	GOYA_ASYNC_EVENT_ID_TPC1_DEC = 120,
+	GOYA_ASYNC_EVENT_ID_TPC2_DEC = 123,
+	GOYA_ASYNC_EVENT_ID_TPC3_DEC = 126,
+	GOYA_ASYNC_EVENT_ID_TPC4_DEC = 129,
+	GOYA_ASYNC_EVENT_ID_TPC5_DEC = 132,
+	GOYA_ASYNC_EVENT_ID_TPC6_DEC = 135,
+	GOYA_ASYNC_EVENT_ID_TPC7_DEC = 138,
+	GOYA_ASYNC_EVENT_ID_AXI_ECC = 139,
+	GOYA_ASYNC_EVENT_ID_L2_RAM_ECC = 140,
+	GOYA_ASYNC_EVENT_ID_MME_WACS = 141,
+	GOYA_ASYNC_EVENT_ID_MME_WACSD = 142,
+	GOYA_ASYNC_EVENT_ID_PLL0 = 143,
+	GOYA_ASYNC_EVENT_ID_PLL1 = 144,
+	GOYA_ASYNC_EVENT_ID_PLL3 = 146,
+	GOYA_ASYNC_EVENT_ID_PLL4 = 147,
+	GOYA_ASYNC_EVENT_ID_PLL5 = 148,
+	GOYA_ASYNC_EVENT_ID_PLL6 = 149,
+	GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER = 155,
+	GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC = 159,
+	GOYA_ASYNC_EVENT_ID_PSOC = 160,
+	GOYA_ASYNC_EVENT_ID_PCIE_FLR = 171,
+	GOYA_ASYNC_EVENT_ID_PCIE_HOT_RESET = 172,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG0 = 174,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG1 = 175,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG2 = 176,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID0_ENG3 = 177,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG0 = 178,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG1 = 179,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG2 = 180,
+	GOYA_ASYNC_EVENT_ID_PCIE_QID1_ENG3 = 181,
+	GOYA_ASYNC_EVENT_ID_PCIE_APB = 182,
+	GOYA_ASYNC_EVENT_ID_PCIE_QDB = 183,
+	GOYA_ASYNC_EVENT_ID_PCIE_BM_D_P_WR = 184,
+	GOYA_ASYNC_EVENT_ID_PCIE_BM_D_RD = 185,
+	GOYA_ASYNC_EVENT_ID_PCIE_BM_U_P_WR = 186,
+	GOYA_ASYNC_EVENT_ID_PCIE_BM_U_RD = 187,
+	GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU = 190,
+	GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR = 191,
+	GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU = 200,
+	GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR = 201,
+	GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU = 210,
+	GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR = 211,
+	GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU = 220,
+	GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR = 221,
+	GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU = 230,
+	GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR = 231,
+	GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU = 240,
+	GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR = 241,
+	GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU = 250,
+	GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR = 251,
+	GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU = 260,
+	GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR = 261,
+	GOYA_ASYNC_EVENT_ID_MMU_SBA_SPMU0 = 270,
+	GOYA_ASYNC_EVENT_ID_MMU_SBA_SPMU1 = 271,
+	GOYA_ASYNC_EVENT_ID_MME_WACS_UP = 272,
+	GOYA_ASYNC_EVENT_ID_MME_WACS_DOWN = 273,
+	GOYA_ASYNC_EVENT_ID_MMU_PAGE_FAULT = 280,
+	GOYA_ASYNC_EVENT_ID_MMU_WR_PERM = 281,
+	GOYA_ASYNC_EVENT_ID_MMU_DBG_BM = 282,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH0 = 290,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH1 = 291,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH2 = 292,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH3 = 293,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH4 = 294,
+	GOYA_ASYNC_EVENT_ID_DDR0_PHY_DFI = 300,
+	GOYA_ASYNC_EVENT_ID_DDR0_ECC_SCRUB = 301,
+	GOYA_ASYNC_EVENT_ID_DDR0_DB_ECC = 302,
+	GOYA_ASYNC_EVENT_ID_DDR0_SB_ECC = 303,
+	GOYA_ASYNC_EVENT_ID_DDR0_SB_ECC_MC = 304,
+	GOYA_ASYNC_EVENT_ID_DDR0_AXI_RD = 305,
+	GOYA_ASYNC_EVENT_ID_DDR0_AXI_WR = 306,
+	GOYA_ASYNC_EVENT_ID_DDR1_PHY_DFI = 310,
+	GOYA_ASYNC_EVENT_ID_DDR1_ECC_SCRUB = 311,
+	GOYA_ASYNC_EVENT_ID_DDR1_DB_ECC = 312,
+	GOYA_ASYNC_EVENT_ID_DDR1_SB_ECC = 313,
+	GOYA_ASYNC_EVENT_ID_DDR1_SB_ECC_MC = 314,
+	GOYA_ASYNC_EVENT_ID_DDR1_AXI_RD = 315,
+	GOYA_ASYNC_EVENT_ID_DDR1_AXI_WR = 316,
+	GOYA_ASYNC_EVENT_ID_CPU_BMON = 320,
+	GOYA_ASYNC_EVENT_ID_TS_EAST = 322,
+	GOYA_ASYNC_EVENT_ID_TS_WEST = 323,
+	GOYA_ASYNC_EVENT_ID_TS_NORTH = 324,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_0 = 330,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_1 = 331,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_U16_2 = 332,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET = 356,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT = 361,
+	GOYA_ASYNC_EVENT_ID_TPC0_CMDQ = 430,
+	GOYA_ASYNC_EVENT_ID_TPC1_CMDQ = 431,
+	GOYA_ASYNC_EVENT_ID_TPC2_CMDQ = 432,
+	GOYA_ASYNC_EVENT_ID_TPC3_CMDQ = 433,
+	GOYA_ASYNC_EVENT_ID_TPC4_CMDQ = 434,
+	GOYA_ASYNC_EVENT_ID_TPC5_CMDQ = 435,
+	GOYA_ASYNC_EVENT_ID_TPC6_CMDQ = 436,
+	GOYA_ASYNC_EVENT_ID_TPC7_CMDQ = 437,
+	GOYA_ASYNC_EVENT_ID_TPC0_QM = 438,
+	GOYA_ASYNC_EVENT_ID_TPC1_QM = 439,
+	GOYA_ASYNC_EVENT_ID_TPC2_QM = 440,
+	GOYA_ASYNC_EVENT_ID_TPC3_QM = 441,
+	GOYA_ASYNC_EVENT_ID_TPC4_QM = 442,
+	GOYA_ASYNC_EVENT_ID_TPC5_QM = 443,
+	GOYA_ASYNC_EVENT_ID_TPC6_QM = 444,
+	GOYA_ASYNC_EVENT_ID_TPC7_QM = 445,
+	GOYA_ASYNC_EVENT_ID_MME_QM = 447,
+	GOYA_ASYNC_EVENT_ID_MME_CMDQ = 448,
+	GOYA_ASYNC_EVENT_ID_DMA0_QM = 449,
+	GOYA_ASYNC_EVENT_ID_DMA1_QM = 450,
+	GOYA_ASYNC_EVENT_ID_DMA2_QM = 451,
+	GOYA_ASYNC_EVENT_ID_DMA3_QM = 452,
+	GOYA_ASYNC_EVENT_ID_DMA4_QM = 453,
+	GOYA_ASYNC_EVENT_ID_DMA_ON_HBW = 454,
+	GOYA_ASYNC_EVENT_ID_DMA0_CH = 455,
+	GOYA_ASYNC_EVENT_ID_DMA1_CH = 456,
+	GOYA_ASYNC_EVENT_ID_DMA2_CH = 457,
+	GOYA_ASYNC_EVENT_ID_DMA3_CH = 458,
+	GOYA_ASYNC_EVENT_ID_DMA4_CH = 459,
+	GOYA_ASYNC_EVENT_ID_PI_UPDATE = 484,
+	GOYA_ASYNC_EVENT_ID_HALT_MACHINE = 485,
+	GOYA_ASYNC_EVENT_ID_INTS_REGISTER = 486,
+	GOYA_ASYNC_EVENT_ID_SOFT_RESET = 487,
+	GOYA_ASYNC_EVENT_ID_LAST_VALID_ID = 1023,
+	GOYA_ASYNC_EVENT_ID_SIZE
+};
+
+#endif /* __GOYA_ASYNC_EVENTS_H_ */
diff --git a/drivers/misc/habanalabs/include/goya/goya_packets.h b/drivers/misc/habanalabs/include/goya/goya_packets.h
new file mode 100644
index 000000000000..a14407b975e4
--- /dev/null
+++ b/drivers/misc/habanalabs/include/goya/goya_packets.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2017-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef GOYA_PACKETS_H
+#define GOYA_PACKETS_H
+
+#include <linux/types.h>
+
+#define PACKET_HEADER_PACKET_ID_SHIFT		56
+#define PACKET_HEADER_PACKET_ID_MASK		0x1F00000000000000ull
+
+enum packet_id {
+	PACKET_WREG_32 = 0x1,
+	PACKET_WREG_BULK = 0x2,
+	PACKET_MSG_LONG = 0x3,
+	PACKET_MSG_SHORT = 0x4,
+	PACKET_CP_DMA = 0x5,
+	PACKET_MSG_PROT = 0x7,
+	PACKET_FENCE = 0x8,
+	PACKET_LIN_DMA = 0x9,
+	PACKET_NOP = 0xA,
+	PACKET_STOP = 0xB,
+	MAX_PACKET_ID = (PACKET_HEADER_PACKET_ID_MASK >>
+				PACKET_HEADER_PACKET_ID_SHIFT) + 1
+};
+
+enum goya_dma_direction {
+	DMA_HOST_TO_DRAM,
+	DMA_HOST_TO_SRAM,
+	DMA_DRAM_TO_SRAM,
+	DMA_SRAM_TO_DRAM,
+	DMA_SRAM_TO_HOST,
+	DMA_DRAM_TO_HOST,
+	DMA_DRAM_TO_DRAM,
+	DMA_SRAM_TO_SRAM,
+	DMA_ENUM_MAX
+};
+
+#define GOYA_PKT_CTL_OPCODE_SHIFT	24
+#define GOYA_PKT_CTL_OPCODE_MASK	0x1F000000
+
+#define GOYA_PKT_CTL_EB_SHIFT		29
+#define GOYA_PKT_CTL_EB_MASK		0x20000000
+
+#define GOYA_PKT_CTL_RB_SHIFT		30
+#define GOYA_PKT_CTL_RB_MASK		0x40000000
+
+#define GOYA_PKT_CTL_MB_SHIFT		31
+#define GOYA_PKT_CTL_MB_MASK		0x80000000
+
+struct packet_nop {
+	__le32 reserved;
+	__le32 ctl;
+};
+
+struct packet_stop {
+	__le32 reserved;
+	__le32 ctl;
+};
+
+#define GOYA_PKT_WREG32_CTL_REG_OFFSET_SHIFT	0
+#define GOYA_PKT_WREG32_CTL_REG_OFFSET_MASK	0x0000FFFF
+
+struct packet_wreg32 {
+	__le32 value;
+	__le32 ctl;
+};
+
+struct packet_wreg_bulk {
+	__le32 size64;
+	__le32 ctl;
+	__le64 values[0]; /* data starts here */
+};
+
+struct packet_msg_long {
+	__le32 value;
+	__le32 ctl;
+	__le64 addr;
+};
+
+struct packet_msg_short {
+	__le32 value;
+	__le32 ctl;
+};
+
+struct packet_msg_prot {
+	__le32 value;
+	__le32 ctl;
+	__le64 addr;
+};
+
+struct packet_fence {
+	__le32 cfg;
+	__le32 ctl;
+};
+
+#define GOYA_PKT_LIN_DMA_CTL_WO_SHIFT		0
+#define GOYA_PKT_LIN_DMA_CTL_WO_MASK		0x00000001
+
+#define GOYA_PKT_LIN_DMA_CTL_RDCOMP_SHIFT	1
+#define GOYA_PKT_LIN_DMA_CTL_RDCOMP_MASK	0x00000002
+
+#define GOYA_PKT_LIN_DMA_CTL_WRCOMP_SHIFT	2
+#define GOYA_PKT_LIN_DMA_CTL_WRCOMP_MASK	0x00000004
+
+#define GOYA_PKT_LIN_DMA_CTL_MEMSET_SHIFT	6
+#define GOYA_PKT_LIN_DMA_CTL_MEMSET_MASK	0x00000040
+
+#define GOYA_PKT_LIN_DMA_CTL_DMA_DIR_SHIFT	20
+#define GOYA_PKT_LIN_DMA_CTL_DMA_DIR_MASK	0x00700000
+
+struct packet_lin_dma {
+	__le32 tsize;
+	__le32 ctl;
+	__le64 src_addr;
+	__le64 dst_addr;
+};
+
+struct packet_cp_dma {
+	__le32 tsize;
+	__le32 ctl;
+	__le64 src_addr;
+};
+
+#endif /* GOYA_PACKETS_H */
diff --git a/drivers/misc/habanalabs/include/qman_if.h b/drivers/misc/habanalabs/include/qman_if.h
new file mode 100644
index 000000000000..bf59bbe27fdc
--- /dev/null
+++ b/drivers/misc/habanalabs/include/qman_if.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef QMAN_IF_H
+#define QMAN_IF_H
+
+#include <linux/types.h>
+
+/*
+ * PRIMARY QUEUE
+ */
+
+struct hl_bd {
+	__le64	ptr;
+	__le32	len;
+	__le32	ctl;
+};
+
+#define HL_BD_SIZE			sizeof(struct hl_bd)
+
+/*
+ * BD_CTL_REPEAT_VALID tells the CP whether the repeat field in the BD CTL is
+ * valid. 1 means the repeat field is valid, 0 means not-valid,
+ * i.e. repeat == 1
+ */
+#define BD_CTL_REPEAT_VALID_SHIFT	24
+#define BD_CTL_REPEAT_VALID_MASK	0x01000000
+
+#define BD_CTL_SHADOW_INDEX_SHIFT	0
+#define BD_CTL_SHADOW_INDEX_MASK	0x00000FFF
+
+/*
+ * COMPLETION QUEUE
+ */
+
+struct hl_cq_entry {
+	__le32	data;
+};
+
+#define HL_CQ_ENTRY_SIZE		sizeof(struct hl_cq_entry)
+
+#define CQ_ENTRY_READY_SHIFT			31
+#define CQ_ENTRY_READY_MASK			0x80000000
+
+#define CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT	30
+#define CQ_ENTRY_SHADOW_INDEX_VALID_MASK	0x40000000
+
+#define CQ_ENTRY_SHADOW_INDEX_SHIFT		BD_CTL_SHADOW_INDEX_SHIFT
+#define CQ_ENTRY_SHADOW_INDEX_MASK		BD_CTL_SHADOW_INDEX_MASK
+
+
+#endif /* QMAN_IF_H */
diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
new file mode 100644
index 000000000000..acf9a5a55476
--- /dev/null
+++ b/drivers/misc/habanalabs/irq.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/dma-mapping.h>
+
+
+/*
+ * hl_cq_inc_ptr - increment ci or pi of cq
+ *
+ * @ptr: the current ci or pi value of the completion queue
+ *
+ * Increment ptr by 1. If it reaches the number of completion queue
+ * entries, set it to 0
+ */
+inline u32 hl_cq_inc_ptr(u32 ptr)
+{
+	ptr++;
+	if (unlikely(ptr == HL_CQ_LENGTH))
+		ptr = 0;
+	return ptr;
+}
+
+/*
+ * hl_irq_handler_cq - irq handler for completion queue
+ *
+ * @irq: irq number
+ * @arg: pointer to completion queue structure
+ *
+ */
+irqreturn_t hl_irq_handler_cq(int irq, void *arg)
+{
+	struct hl_cq *cq = arg;
+	struct hl_device *hdev = cq->hdev;
+	struct hl_hw_queue *queue;
+	struct hl_cs_job *job;
+	bool shadow_index_valid;
+	u16 shadow_index;
+	u32 *cq_entry;
+	u32 *cq_base;
+
+	if (hdev->disabled) {
+		dev_dbg(hdev->dev,
+			"Device disabled but received IRQ %d for CQ %d\n",
+			irq, cq->hw_queue_id);
+		return IRQ_HANDLED;
+	}
+
+	cq_base = (u32 *) cq->kernel_address;
+
+	while (1) {
+		bool entry_ready = ((cq_base[cq->ci] & CQ_ENTRY_READY_MASK)
+						>> CQ_ENTRY_READY_SHIFT);
+
+		if (!entry_ready)
+			break;
+
+		cq_entry = (u32 *) &cq_base[cq->ci];
+
+		/*
+		 * Make sure we read CQ entry contents after we've
+		 * checked the ownership bit.
+		 */
+		dma_rmb();
+
+		shadow_index_valid =
+			((*cq_entry & CQ_ENTRY_SHADOW_INDEX_VALID_MASK)
+					>> CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT);
+
+		shadow_index = (u16)
+			((*cq_entry & CQ_ENTRY_SHADOW_INDEX_MASK)
+					>> CQ_ENTRY_SHADOW_INDEX_SHIFT);
+
+		queue = &hdev->kernel_queues[cq->hw_queue_id];
+
+		if ((shadow_index_valid) && (!hdev->disabled)) {
+			job = queue->shadow_queue[hl_pi_2_offset(shadow_index)];
+			queue_work(hdev->cq_wq, &job->finish_work);
+		}
+
+		/*
+		 * Update ci of the context's queue. There is no
+		 * need to protect it with spinlock because this update is
+		 * done only inside IRQ and there is a different IRQ per
+		 * queue
+		 */
+		queue->ci = hl_queue_inc_ptr(queue->ci);
+
+		/* Clear CQ entry ready bit */
+		cq_base[cq->ci] &= ~CQ_ENTRY_READY_MASK;
+
+		cq->ci = hl_cq_inc_ptr(cq->ci);
+
+		/* Increment free slots */
+		atomic_inc(&cq->free_slots_cnt);
+	}
+
+	return IRQ_HANDLED;
+}
+
+/*
+ * hl_cq_init - main initialization function for an cq object
+ *
+ * @hdev: pointer to device structure
+ * @q: pointer to cq structure
+ * @hw_queue_id: The H/W queue ID this completion queue belongs to
+ *
+ * Allocate dma-able memory for the completion queue and initialize fields
+ * Returns 0 on success
+ */
+int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id)
+{
+	void *p;
+
+	BUILD_BUG_ON(HL_CQ_SIZE_IN_BYTES > HL_PAGE_SIZE);
+
+	p = hdev->asic_funcs->dma_alloc_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
+				&q->bus_address, GFP_KERNEL | __GFP_ZERO);
+	if (!p)
+		return -ENOMEM;
+
+	q->hdev = hdev;
+	q->kernel_address = (u64) p;
+	q->hw_queue_id = hw_queue_id;
+	q->ci = 0;
+	q->pi = 0;
+
+	atomic_set(&q->free_slots_cnt, HL_CQ_LENGTH);
+
+	return 0;
+}
+
+/*
+ * hl_cq_fini - destroy completion queue
+ *
+ * @hdev: pointer to device structure
+ * @q: pointer to cq structure
+ *
+ * Free the completion queue memory
+ */
+void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
+{
+	hdev->asic_funcs->dma_free_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
+			(void *) q->kernel_address, q->bus_address);
+}
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index a8edfd3e9c95..756266cf0416 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -17,6 +17,35 @@
  */
 #define GOYA_KMD_SRAM_RESERVED_SIZE_FROM_START	0x8000	/* 32KB */
 
+/*
+ * Queue Numbering
+ *
+ * The external queues (DMA channels + CPU) MUST be before the internal queues
+ * and each group (DMA channels + CPU and internal) must be contiguous inside
+ * itself but there can be a gap between the two groups (although not
+ * recommended)
+ */
+
+enum goya_queue_id {
+	GOYA_QUEUE_ID_DMA_0 = 0,
+	GOYA_QUEUE_ID_DMA_1,
+	GOYA_QUEUE_ID_DMA_2,
+	GOYA_QUEUE_ID_DMA_3,
+	GOYA_QUEUE_ID_DMA_4,
+	GOYA_QUEUE_ID_CPU_PQ,
+	GOYA_QUEUE_ID_MME,
+	GOYA_QUEUE_ID_TPC0,
+	GOYA_QUEUE_ID_TPC1,
+	GOYA_QUEUE_ID_TPC2,
+	GOYA_QUEUE_ID_TPC3,
+	GOYA_QUEUE_ID_TPC4,
+	GOYA_QUEUE_ID_TPC5,
+	GOYA_QUEUE_ID_TPC6,
+	GOYA_QUEUE_ID_TPC7,
+	GOYA_QUEUE_ID_SIZE
+};
+
+
 /* Opcode to create a new command buffer */
 #define HL_CB_OP_CREATE		0
 /* Opcode to destroy previously created command buffer */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 08/15] habanalabs: add event queue and interrupts
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (5 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 07/15] habanalabs: add h/w queues module Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 09/15] habanalabs: add sysfs and hwmon support Oded Gabbay
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds support for receiving events from Goya's control CPU and
for receiving MSI-X interrupts from Goya's DMA engines and CPU.

Goya's PCI controller supports up to 8 MSI-X interrupts, which only 6 of
them are currently used. The first 5 interrupts are dedicated for Goya's
DMA engine queues. The 6th interrupt is dedicated for Goya's control CPU.

The DMA queue will signal its MSI-X entry upon each completion of a command
buffer that was placed on its primary queue. The driver will then mark that
CB as completed and free the related resources. It will also update the
command submission object which that CB belongs to.

There is a dedicated event queue (EQ) between the driver and Goya's control
CPU. The EQ is located on the Host memory. The control CPU writes a new
entry to the EQ for various reasons, such as ECC error, MMU page fault, Hot
temperature. After writing the new entry to the EQ, the control CPU will
trigger its dedicated MSI-X entry to signal the driver that there is a new
entry in the EQ. The driver will then read the entry and act accordingly.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - Do not assume state of device CPU when trying to halt it
 
 drivers/misc/habanalabs/device.c           |  34 +-
 drivers/misc/habanalabs/goya/goya.c        | 461 ++++++++++++++++++++-
 drivers/misc/habanalabs/goya/goyaP.h       |   1 +
 drivers/misc/habanalabs/habanalabs.h       |  39 +-
 drivers/misc/habanalabs/include/armcp_if.h |  24 ++
 drivers/misc/habanalabs/irq.c              | 144 +++++++
 6 files changed, 691 insertions(+), 12 deletions(-)

diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 3f807e1419a6..e4c939899042 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -182,6 +182,13 @@ static int device_early_init(struct hl_device *hdev)
 		goto asid_fini;
 	}
 
+	hdev->eq_wq = alloc_workqueue("hl-events", WQ_UNBOUND, 0);
+	if (hdev->eq_wq == NULL) {
+		dev_err(hdev->dev, "Failed to allocate EQ workqueue\n");
+		rc = -ENOMEM;
+		goto free_cq_wq;
+	}
+
 	hl_cb_mgr_init(&hdev->kernel_cb_mgr);
 
 	mutex_init(&hdev->fd_open_cnt_lock);
@@ -190,6 +197,8 @@ static int device_early_init(struct hl_device *hdev)
 
 	return 0;
 
+free_cq_wq:
+	destroy_workqueue(hdev->cq_wq);
 asid_fini:
 	hl_asid_fini(hdev);
 early_fini:
@@ -211,6 +220,7 @@ static void device_early_fini(struct hl_device *hdev)
 
 	hl_cb_mgr_fini(hdev, &hdev->kernel_cb_mgr);
 
+	destroy_workqueue(hdev->eq_wq);
 	destroy_workqueue(hdev->cq_wq);
 
 	hl_asid_fini(hdev);
@@ -349,11 +359,22 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		}
 	}
 
+	/*
+	 * Initialize the event queue. Must be done before hw_init,
+	 * because there the address of the event queue is being
+	 * passed as argument to request_irq
+	 */
+	rc = hl_eq_init(hdev, &hdev->event_queue);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize event queue\n");
+		goto cq_fini;
+	}
+
 	/* Allocate the kernel context */
 	hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx), GFP_KERNEL);
 	if (!hdev->kernel_ctx) {
 		rc = -ENOMEM;
-		goto cq_fini;
+		goto eq_fini;
 	}
 
 	hdev->user_ctx = NULL;
@@ -398,6 +419,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 			"kernel ctx is still alive on initialization failure\n");
 free_ctx:
 	kfree(hdev->kernel_ctx);
+eq_fini:
+	hl_eq_fini(hdev, &hdev->event_queue);
 cq_fini:
 	for (i = 0 ; i < cq_ready_cnt ; i++)
 		hl_cq_fini(hdev, &hdev->completion_queue[i]);
@@ -439,6 +462,13 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	/*
+	 * Halt the engines and disable interrupts so we won't get any more
+	 * completions from H/W and we won't have any accesses from the
+	 * H/W to the host machine
+	 */
+	hdev->asic_funcs->halt_engines(hdev, true);
+
 	hl_cb_pool_fini(hdev);
 
 	/* Release kernel context */
@@ -448,6 +478,8 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Reset the H/W. It will be in idle state after this returns */
 	hdev->asic_funcs->hw_fini(hdev, true);
 
+	hl_eq_fini(hdev, &hdev->event_queue);
+
 	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
 		hl_cq_fini(hdev, &hdev->completion_queue[i]);
 	kfree(hdev->completion_queue);
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 046c4221af37..d7206cbb812e 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -92,9 +92,41 @@
 
 #define GOYA_MAX_INITIATORS		20
 
+#define GOYA_MAX_STRING_LEN		20
+
 #define GOYA_CB_POOL_CB_CNT		512
 #define GOYA_CB_POOL_CB_SIZE		0x20000		/* 128KB */
 
+static const char goya_irq_name[GOYA_MSIX_ENTRIES][GOYA_MAX_STRING_LEN] = {
+		"goya cq 0", "goya cq 1", "goya cq 2", "goya cq 3",
+		"goya cq 4", "goya cpu eq"
+};
+
+static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
+	"MME0",
+	"MME1",
+	"MME2",
+	"MME3",
+	"MME4",
+	"MME5",
+	"TPC0",
+	"TPC1",
+	"TPC2",
+	"TPC3",
+	"TPC4",
+	"TPC5",
+	"TPC6",
+	"TPC7",
+	"PCI",
+	"DMA", /* HBW */
+	"DMA", /* LBW */
+	"PSOC",
+	"CPU",
+	"MMU"
+};
+
+#define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
+
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
@@ -139,6 +171,7 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->va_space_dram_end_address = VA_DDR_SPACE_END;
 	prop->cfg_size = CFG_SIZE;
 	prop->max_asid = MAX_ASID;
+	prop->num_of_events = GOYA_ASYNC_EVENT_ID_SIZE;
 	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
 	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
 	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
@@ -674,15 +707,10 @@ static void goya_init_dma_qman(struct hl_device *hdev, int dma_id,
 	WREG32(mmDMA_QM_0_PQ_CFG1 + reg_off, 0x00020002);
 	WREG32(mmDMA_QM_0_CQ_CFG1 + reg_off, 0x00080008);
 
-	if (dma_id == 0)
-		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
+	if (goya->hw_cap_initialized & HW_CAP_MMU)
+		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_PARTLY_TRUSTED);
 	else
-		if (goya->hw_cap_initialized & HW_CAP_MMU)
-			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
-					QMAN_DMA_PARTLY_TRUSTED);
-		else
-			WREG32(mmDMA_QM_0_GLBL_PROT + reg_off,
-					QMAN_DMA_FULLY_TRUSTED);
+		WREG32(mmDMA_QM_0_GLBL_PROT + reg_off, QMAN_DMA_FULLY_TRUSTED);
 
 	WREG32(mmDMA_QM_0_GLBL_ERR_CFG + reg_off, QMAN_DMA_ERR_MSG_EN);
 	WREG32(mmDMA_QM_0_GLBL_CFG0 + reg_off, QMAN_DMA_ENABLE);
@@ -888,6 +916,7 @@ static void goya_resume_external_queues(struct hl_device *hdev)
 int goya_init_cpu_queues(struct hl_device *hdev)
 {
 	struct goya_device *goya = hdev->asic_specific;
+	struct hl_eq *eq;
 	dma_addr_t bus_address;
 	u32 status;
 	struct hl_hw_queue *cpu_pq = &hdev->kernel_queues[GOYA_QUEUE_ID_CPU_PQ];
@@ -899,17 +928,24 @@ int goya_init_cpu_queues(struct hl_device *hdev)
 	if (goya->hw_cap_initialized & HW_CAP_CPU_Q)
 		return 0;
 
+	eq = &hdev->event_queue;
+
 	bus_address = cpu_pq->bus_address +
 			hdev->asic_prop.host_phys_base_address;
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_0, lower_32_bits(bus_address));
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_1, upper_32_bits(bus_address));
 
+	bus_address = eq->bus_address + hdev->asic_prop.host_phys_base_address;
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_2, lower_32_bits(bus_address));
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_3, upper_32_bits(bus_address));
+
 	bus_address = hdev->cpu_accessible_dma_address +
 			hdev->asic_prop.host_phys_base_address;
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_8, lower_32_bits(bus_address));
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_9, upper_32_bits(bus_address));
 
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_5, HL_QUEUE_SIZE_IN_BYTES);
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_4, HL_EQ_SIZE_IN_BYTES);
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_10, CPU_ACCESSIBLE_MEM_SIZE);
 
 	/* Used for EQ CI */
@@ -1867,6 +1903,165 @@ static void goya_resume_internal_queues(struct hl_device *hdev)
 	WREG32(mmTPC7_CMDQ_GLBL_CFG1, 0);
 }
 
+static void goya_dma_stall(struct hl_device *hdev)
+{
+	WREG32(mmDMA_QM_0_GLBL_CFG1, 1 << DMA_QM_0_GLBL_CFG1_DMA_STOP_SHIFT);
+	WREG32(mmDMA_QM_1_GLBL_CFG1, 1 << DMA_QM_1_GLBL_CFG1_DMA_STOP_SHIFT);
+	WREG32(mmDMA_QM_2_GLBL_CFG1, 1 << DMA_QM_2_GLBL_CFG1_DMA_STOP_SHIFT);
+	WREG32(mmDMA_QM_3_GLBL_CFG1, 1 << DMA_QM_3_GLBL_CFG1_DMA_STOP_SHIFT);
+	WREG32(mmDMA_QM_4_GLBL_CFG1, 1 << DMA_QM_4_GLBL_CFG1_DMA_STOP_SHIFT);
+}
+
+static void goya_tpc_stall(struct hl_device *hdev)
+{
+	WREG32(mmTPC0_CFG_TPC_STALL, 1 << TPC0_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC1_CFG_TPC_STALL, 1 << TPC1_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC2_CFG_TPC_STALL, 1 << TPC2_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC3_CFG_TPC_STALL, 1 << TPC3_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC4_CFG_TPC_STALL, 1 << TPC4_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC5_CFG_TPC_STALL, 1 << TPC5_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC6_CFG_TPC_STALL, 1 << TPC6_CFG_TPC_STALL_V_SHIFT);
+	WREG32(mmTPC7_CFG_TPC_STALL, 1 << TPC7_CFG_TPC_STALL_V_SHIFT);
+}
+
+static void goya_mme_stall(struct hl_device *hdev)
+{
+	WREG32(mmMME_STALL, 0xFFFFFFFF);
+}
+
+static int goya_enable_msix(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int cq_cnt = hdev->asic_prop.completion_queues_count;
+	int rc, i, irq_cnt_init, irq;
+
+	if (goya->hw_cap_initialized & HW_CAP_MSIX)
+		return 0;
+
+	rc = pci_alloc_irq_vectors(hdev->pdev, GOYA_MSIX_ENTRIES,
+				GOYA_MSIX_ENTRIES, PCI_IRQ_MSIX);
+	if (rc < 0) {
+		dev_err(hdev->dev,
+			"MSI-X: Failed to enable support -- %d/%d\n",
+			GOYA_MSIX_ENTRIES, rc);
+		return rc;
+	}
+
+	for (i = 0, irq_cnt_init = 0 ; i < cq_cnt ; i++, irq_cnt_init++) {
+		irq = pci_irq_vector(hdev->pdev, i);
+		rc = request_irq(irq, hl_irq_handler_cq, 0, goya_irq_name[i],
+				&hdev->completion_queue[i]);
+		if (rc) {
+			dev_err(hdev->dev, "Failed to request IRQ %d", irq);
+			goto free_irqs;
+		}
+	}
+
+	irq = pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX);
+
+	rc = request_irq(irq, hl_irq_handler_eq, 0,
+			goya_irq_name[EVENT_QUEUE_MSIX_IDX],
+			&hdev->event_queue);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to request IRQ %d", irq);
+		goto free_irqs;
+	}
+
+	goya->hw_cap_initialized |= HW_CAP_MSIX;
+	return 0;
+
+free_irqs:
+	for (i = 0 ; i < irq_cnt_init ; i++)
+		free_irq(pci_irq_vector(hdev->pdev, i),
+			&hdev->completion_queue[i]);
+
+	pci_free_irq_vectors(hdev->pdev);
+	return rc;
+}
+
+static void goya_sync_irqs(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MSIX))
+		return;
+
+	/* Wait for all pending IRQs to be finished */
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
+		synchronize_irq(pci_irq_vector(hdev->pdev, i));
+
+	synchronize_irq(pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX));
+}
+
+static void goya_disable_msix(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i, irq;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MSIX))
+		return;
+
+	goya_sync_irqs(hdev);
+
+	irq = pci_irq_vector(hdev->pdev, EVENT_QUEUE_MSIX_IDX);
+	free_irq(irq, &hdev->event_queue);
+
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++) {
+		irq = pci_irq_vector(hdev->pdev, i);
+		free_irq(irq, &hdev->completion_queue[i]);
+	}
+
+	pci_free_irq_vectors(hdev->pdev);
+
+	goya->hw_cap_initialized &= ~HW_CAP_MSIX;
+}
+
+static void goya_halt_engines(struct hl_device *hdev, bool hard_reset)
+{
+	u32 wait_timeout_ms, cpu_timeout_ms;
+
+	dev_info(hdev->dev,
+		"Halting compute engines and disabling interrupts\n");
+
+	if (hdev->pldm) {
+		wait_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
+		cpu_timeout_ms = GOYA_PLDM_RESET_WAIT_MSEC;
+	} else {
+		wait_timeout_ms = GOYA_RESET_WAIT_MSEC;
+		cpu_timeout_ms = GOYA_CPU_RESET_WAIT_MSEC;
+	}
+
+	if (hard_reset) {
+		/*
+		 * I don't know what is the state of the CPU so make sure it is
+		 * stopped in any means necessary
+		 */
+		WREG32(mmPSOC_GLOBAL_CONF_UBOOT_MAGIC, KMD_MSG_GOTO_WFE);
+		WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+			GOYA_ASYNC_EVENT_ID_HALT_MACHINE);
+		msleep(cpu_timeout_ms);
+	}
+
+	goya_stop_external_queues(hdev);
+	goya_stop_internal_queues(hdev);
+
+	msleep(wait_timeout_ms);
+
+	goya_dma_stall(hdev);
+	goya_tpc_stall(hdev);
+	goya_mme_stall(hdev);
+
+	msleep(wait_timeout_ms);
+
+	goya_disable_external_queues(hdev);
+	goya_disable_internal_queues(hdev);
+
+	if (hard_reset)
+		goya_disable_msix(hdev);
+	else
+		goya_sync_irqs(hdev);
+}
 
 /*
  * goya_push_fw_to_device - Push FW code to device
@@ -2194,11 +2389,16 @@ static int goya_hw_init(struct hl_device *hdev)
 
 	goya_init_tpc_qmans(hdev);
 
+	/* MSI-X must be enabled before CPU queues are initialized */
+	rc = goya_enable_msix(hdev);
+	if (rc)
+		goto disable_queues;
+
 	rc = goya_init_cpu_queues(hdev);
 	if (rc) {
 		dev_err(hdev->dev, "failed to initialize CPU H/W queues %d\n",
 			rc);
-		goto disable_queues;
+		goto disable_msix;
 	}
 
 	/* CPU initialization is finished, we can now move to 48 bit DMA mask */
@@ -2232,6 +2432,8 @@ static int goya_hw_init(struct hl_device *hdev)
 
 disable_pci_access:
 	goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
+disable_msix:
+	goya_disable_msix(hdev);
 disable_queues:
 	goya_disable_internal_queues(hdev);
 	goya_disable_external_queues(hdev);
@@ -2297,6 +2499,7 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
 					HW_CAP_DMA | HW_CAP_MME |
 					HW_CAP_MMU | HW_CAP_TPC_MBIST |
 					HW_CAP_GOLDEN | HW_CAP_TPC);
+	memset(goya->events_stat, 0, sizeof(goya->events_stat));
 
 	if (!hdev->pldm) {
 		int rc;
@@ -2781,6 +2984,242 @@ void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
 	gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
 }
 
+static void goya_update_eq_ci(struct hl_device *hdev, u32 val)
+{
+	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, val);
+}
+
+static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
+		u16 event_type, char *axi_name, int len)
+{
+	if (!strcmp(goya_axi_name[agent_id], "DMA"))
+		if (event_type >= GOYA_ASYNC_EVENT_ID_DMA0_CH)
+			snprintf(axi_name, len, "DMA %d",
+				event_type - GOYA_ASYNC_EVENT_ID_DMA0_CH);
+		else
+			snprintf(axi_name, len, "DMA %d",
+				event_type - GOYA_ASYNC_EVENT_ID_DMA0_QM);
+	else
+		snprintf(axi_name, len, "%s", goya_axi_name[agent_id]);
+}
+
+static void goya_print_razwi_info(struct hl_device *hdev, u64 reg,
+		bool is_hbw, bool is_read, u16 event_type)
+{
+	u32 val, agent_id;
+	char axi_name[10] = {0};
+
+	val = RREG32(reg);
+
+	if (is_hbw)
+		agent_id = (val & GOYA_IRQ_HBW_AGENT_ID_MASK) >>
+				GOYA_IRQ_HBW_AGENT_ID_SHIFT;
+	else
+		agent_id = (val & GOYA_IRQ_LBW_AGENT_ID_MASK) >>
+				GOYA_IRQ_LBW_AGENT_ID_SHIFT;
+
+	if (agent_id >= GOYA_MAX_INITIATORS) {
+		dev_err(hdev->dev,
+			"Illegal %s %s with wrong initiator id %d, H/W IRQ %d\n",
+				is_read ? "read from" : "write to",
+				is_hbw ? "HBW" : "LBW",
+				agent_id,
+				event_type);
+	} else {
+		goya_get_axi_name(hdev, agent_id, event_type, axi_name,
+				sizeof(axi_name));
+		dev_err(hdev->dev, "Illegal %s by %s %s %s, H/W IRQ %d\n",
+				is_read ? "read" : "write",
+				axi_name,
+				is_read ? "from" : "to",
+				is_hbw ? "HBW" : "LBW",
+				event_type);
+	}
+}
+
+static void goya_print_irq_info(struct hl_device *hdev, u16 event_type)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	bool is_hbw = false, is_read = false, is_info = false;
+
+	if (RREG32(mmDMA_MACRO_RAZWI_LBW_WT_VLD)) {
+		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_LBW_WT_ID, is_hbw,
+				is_read, event_type);
+		WREG32(mmDMA_MACRO_RAZWI_LBW_WT_VLD, 0);
+		is_info = true;
+	}
+	if (RREG32(mmDMA_MACRO_RAZWI_LBW_RD_VLD)) {
+		is_read = true;
+		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_LBW_RD_ID, is_hbw,
+				is_read, event_type);
+		WREG32(mmDMA_MACRO_RAZWI_LBW_RD_VLD, 0);
+		is_info = true;
+	}
+	if (RREG32(mmDMA_MACRO_RAZWI_HBW_WT_VLD)) {
+		is_hbw = true;
+		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_HBW_WT_ID, is_hbw,
+				is_read, event_type);
+		WREG32(mmDMA_MACRO_RAZWI_HBW_WT_VLD, 0);
+		is_info = true;
+	}
+	if (RREG32(mmDMA_MACRO_RAZWI_HBW_RD_VLD)) {
+		is_hbw = true;
+		is_read = true;
+		goya_print_razwi_info(hdev, mmDMA_MACRO_RAZWI_HBW_RD_ID, is_hbw,
+				is_read, event_type);
+		WREG32(mmDMA_MACRO_RAZWI_HBW_RD_VLD, 0);
+		is_info = true;
+	}
+	if (!is_info) {
+		dev_err(hdev->dev,
+			"Received H/W interrupt %d, no additional info\n",
+			event_type);
+		return;
+	}
+
+	if (goya->hw_cap_initialized & HW_CAP_MMU) {
+		u32 val = RREG32(mmMMU_PAGE_ERROR_CAPTURE);
+		u64 addr;
+
+		if (val & MMU_PAGE_ERROR_CAPTURE_ENTRY_VALID_MASK) {
+			addr = val & MMU_PAGE_ERROR_CAPTURE_VA_49_32_MASK;
+			addr <<= 32;
+			addr |= RREG32(mmMMU_PAGE_ERROR_CAPTURE_VA);
+
+			dev_err(hdev->dev, "MMU page fault on va 0x%llx\n",
+					addr);
+
+			WREG32(mmMMU_PAGE_ERROR_CAPTURE, 0);
+		}
+	}
+}
+
+static int goya_unmask_irq(struct hl_device *hdev, u16 event_type)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_UNMASK_RAZWI_IRQ << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.value = event_type;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+			HL_DEVICE_TIMEOUT_USEC, &result);
+
+	if (rc)
+		dev_err(hdev->dev, "failed to unmask RAZWI IRQ %d", event_type);
+
+	return rc;
+}
+
+void goya_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_entry)
+{
+	u16 event_type = ((eq_entry->hdr.ctl & EQ_CTL_EVENT_TYPE_MASK)
+			>> EQ_CTL_EVENT_TYPE_SHIFT);
+	struct goya_device *goya = hdev->asic_specific;
+
+	goya->events_stat[event_type]++;
+
+	switch (event_type) {
+	case GOYA_ASYNC_EVENT_ID_PCIE_IF:
+	case GOYA_ASYNC_EVENT_ID_TPC0_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC1_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC2_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC3_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC4_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC5_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC6_ECC:
+	case GOYA_ASYNC_EVENT_ID_TPC7_ECC:
+	case GOYA_ASYNC_EVENT_ID_MME_ECC:
+	case GOYA_ASYNC_EVENT_ID_MME_ECC_EXT:
+	case GOYA_ASYNC_EVENT_ID_MMU_ECC:
+	case GOYA_ASYNC_EVENT_ID_DMA_MACRO:
+	case GOYA_ASYNC_EVENT_ID_DMA_ECC:
+	case GOYA_ASYNC_EVENT_ID_CPU_IF_ECC:
+	case GOYA_ASYNC_EVENT_ID_PSOC_MEM:
+	case GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT:
+	case GOYA_ASYNC_EVENT_ID_SRAM0 ... GOYA_ASYNC_EVENT_ID_SRAM29:
+	case GOYA_ASYNC_EVENT_ID_GIC500:
+	case GOYA_ASYNC_EVENT_ID_PLL0:
+	case GOYA_ASYNC_EVENT_ID_PLL1:
+	case GOYA_ASYNC_EVENT_ID_PLL3:
+	case GOYA_ASYNC_EVENT_ID_PLL4:
+	case GOYA_ASYNC_EVENT_ID_PLL5:
+	case GOYA_ASYNC_EVENT_ID_PLL6:
+	case GOYA_ASYNC_EVENT_ID_AXI_ECC:
+	case GOYA_ASYNC_EVENT_ID_L2_RAM_ECC:
+	case GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET:
+	case GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT:
+		dev_err(hdev->dev,
+			"Received H/W interrupt %d, reset the chip\n",
+			event_type);
+		break;
+
+	case GOYA_ASYNC_EVENT_ID_PCIE_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC0_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC1_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC2_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC3_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC4_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC5_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC6_DEC:
+	case GOYA_ASYNC_EVENT_ID_TPC7_DEC:
+	case GOYA_ASYNC_EVENT_ID_MME_WACS:
+	case GOYA_ASYNC_EVENT_ID_MME_WACSD:
+	case GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER:
+	case GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC:
+	case GOYA_ASYNC_EVENT_ID_PSOC:
+	case GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR:
+	case GOYA_ASYNC_EVENT_ID_TPC0_CMDQ ... GOYA_ASYNC_EVENT_ID_TPC7_QM:
+	case GOYA_ASYNC_EVENT_ID_MME_QM:
+	case GOYA_ASYNC_EVENT_ID_MME_CMDQ:
+	case GOYA_ASYNC_EVENT_ID_DMA0_QM ... GOYA_ASYNC_EVENT_ID_DMA4_QM:
+	case GOYA_ASYNC_EVENT_ID_DMA0_CH ... GOYA_ASYNC_EVENT_ID_DMA4_CH:
+		goya_print_irq_info(hdev, event_type);
+		goya_unmask_irq(hdev, event_type);
+		break;
+
+	case GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH0:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH1:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH2:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH3:
+	case GOYA_ASYNC_EVENT_ID_DMA_BM_CH4:
+		dev_info(hdev->dev, "Received H/W interrupt %d\n", event_type);
+		break;
+
+	default:
+		dev_err(hdev->dev, "Received invalid H/W interrupt %d\n",
+				event_type);
+		break;
+	}
+}
+
+void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	*size = (u32) sizeof(goya->events_stat);
+
+	return goya->events_stat;
+}
+
 
 static void goya_hw_queues_lock(struct hl_device *hdev)
 {
@@ -2803,6 +3242,7 @@ static const struct hl_asic_funcs goya_funcs = {
 	.sw_fini = goya_sw_fini,
 	.hw_init = goya_hw_init,
 	.hw_fini = goya_hw_fini,
+	.halt_engines = goya_halt_engines,
 	.suspend = goya_suspend,
 	.resume = goya_resume,
 	.mmap = goya_mmap,
@@ -2817,6 +3257,9 @@ static const struct hl_asic_funcs goya_funcs = {
 	.dma_pool_free = goya_dma_pool_free,
 	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
 	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
+	.update_eq_ci = goya_update_eq_ci,
+	.handle_eqe = goya_handle_eqe,
+	.get_events_stat = goya_get_events_stat,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
 	.send_cpu_message = goya_send_cpu_message
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 7badde68aef0..58f1e316f8d9 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -155,6 +155,7 @@ struct goya_device {
 	/* TODO: remove hw_queues_lock after moving to scheduler code */
 	spinlock_t	hw_queues_lock;
 	u64		ddr_bar_cur_addr;
+	u32		events_stat[GOYA_ASYNC_EVENT_ID_SIZE];
 	u32		hw_cap_initialized;
 };
 
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 12e4ee6eb45e..5152b6d3bb4a 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -86,6 +86,7 @@ struct hw_queue_properties {
  * @cfg_size: configuration space size on SRAM.
  * @sram_size: total size of SRAM.
  * @max_asid: maximum number of open contexts (ASIDs).
+ * @num_of_events: number of possible internal H/W IRQs.
  * @completion_queues_count: number of completion queues.
  * @high_pll: high PLL frequency used by the device.
  * @cb_pool_cb_cnt: number of CBs in the CB pool.
@@ -112,6 +113,7 @@ struct asic_fixed_properties {
 	u32			cfg_size;
 	u32			sram_size;
 	u32			max_asid;
+	u32			num_of_events;
 	u32			high_pll;
 	u32			cb_pool_cb_cnt;
 	u32			cb_pool_cb_size;
@@ -204,6 +206,9 @@ struct hl_cs_job;
 #define HL_CQ_LENGTH			HL_QUEUE_LENGTH
 #define HL_CQ_SIZE_IN_BYTES		(HL_CQ_LENGTH * HL_CQ_ENTRY_SIZE)
 
+/* Must be power of 2 (HL_PAGE_SIZE / HL_EQ_ENTRY_SIZE) */
+#define HL_EQ_LENGTH			64
+#define HL_EQ_SIZE_IN_BYTES		(HL_EQ_LENGTH * HL_EQ_ENTRY_SIZE)
 
 
 /**
@@ -251,6 +256,20 @@ struct hl_cq {
 	atomic_t		free_slots_cnt;
 };
 
+/**
+ * struct hl_eq - describes the event queue (single one per device)
+ * @hdev: pointer to the device structure
+ * @kernel_address: holds the queue's kernel virtual address
+ * @bus_address: holds the queue's DMA address
+ * @ci: ci inside the queue
+ */
+struct hl_eq {
+	struct hl_device	*hdev;
+	u64			kernel_address;
+	dma_addr_t		bus_address;
+	u32			ci;
+};
+
 
 /*
  * ASICs
@@ -277,6 +296,9 @@ enum hl_asic_type {
  * @sw_fini: tears down driver state, does not configure H/W.
  * @hw_init: sets up the H/W state.
  * @hw_fini: tears down the H/W state.
+ * @halt_engines: halt engines, needed for reset sequence. This also disables
+ *                interrupts from the device. Should be called before
+ *                hw_fini and before CS rollback.
  * @suspend: handles IP specific H/W or SW changes for suspend.
  * @resume: handles IP specific H/W or SW changes for resume.
  * @mmap: mmap function, does nothing.
@@ -298,6 +320,9 @@ enum hl_asic_type {
  * @dma_pool_free: free small DMA allocation from pool.
  * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
  * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
+ * @update_eq_ci: update event queue CI.
+ * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
+ * @get_events_stat: retrieve event queue entries histogram.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
  * @send_cpu_message: send buffer to ArmCP.
@@ -309,6 +334,7 @@ struct hl_asic_funcs {
 	int (*sw_fini)(struct hl_device *hdev);
 	int (*hw_init)(struct hl_device *hdev);
 	void (*hw_fini)(struct hl_device *hdev, bool hard_reset);
+	void (*halt_engines)(struct hl_device *hdev, bool hard_reset);
 	int (*suspend)(struct hl_device *hdev);
 	int (*resume)(struct hl_device *hdev);
 	int (*mmap)(struct hl_fpriv *hpriv, struct vm_area_struct *vma);
@@ -331,6 +357,10 @@ struct hl_asic_funcs {
 				size_t size, dma_addr_t *dma_handle);
 	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
 				size_t size, void *vaddr);
+	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
+	void (*handle_eqe)(struct hl_device *hdev,
+				struct hl_eq_entry *eq_entry);
+	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
 	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
@@ -370,8 +400,6 @@ struct hl_ctx_mgr {
 };
 
 
-
-
 /**
  * struct hl_cs_job - command submission job.
  * @finish_work: workqueue object to run when job is completed.
@@ -461,6 +489,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @kernel_ctx: KMD context structure.
  * @kernel_queues: array of hl_hw_queue.
  * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
+ * @event_queue: event queue for IRQ from ArmCP.
  * @dma_pool: DMA pool for small allocations.
  * @cpu_accessible_dma_mem: KMD <-> ArmCP shared memory CPU address.
  * @cpu_accessible_dma_address: KMD <-> ArmCP shared memory DMA address.
@@ -495,9 +524,11 @@ struct hl_device {
 	enum hl_asic_type		asic_type;
 	struct hl_cq			*completion_queue;
 	struct workqueue_struct		*cq_wq;
+	struct workqueue_struct		*eq_wq;
 	struct hl_ctx			*kernel_ctx;
 	struct hl_hw_queue		*kernel_queues;
 	struct hl_cb_mgr		kernel_cb_mgr;
+	struct hl_eq			event_queue;
 	struct dma_pool			*dma_pool;
 	void				*cpu_accessible_dma_mem;
 	dma_addr_t			cpu_accessible_dma_address;
@@ -579,6 +610,10 @@ void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
 
 int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
 void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
+int hl_eq_init(struct hl_device *hdev, struct hl_eq *q);
+void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q);
+irqreturn_t hl_irq_handler_cq(int irq, void *arg);
+irqreturn_t hl_irq_handler_eq(int irq, void *arg);
 int hl_asid_init(struct hl_device *hdev);
 void hl_asid_fini(struct hl_device *hdev);
 unsigned long hl_asid_alloc(struct hl_device *hdev);
diff --git a/drivers/misc/habanalabs/include/armcp_if.h b/drivers/misc/habanalabs/include/armcp_if.h
index cc37003aa6b7..9dddb917e72c 100644
--- a/drivers/misc/habanalabs/include/armcp_if.h
+++ b/drivers/misc/habanalabs/include/armcp_if.h
@@ -10,6 +10,30 @@
 
 #include <linux/types.h>
 
+/*
+ * EVENT QUEUE
+ */
+
+struct hl_eq_header {
+	__le32 reserved;
+	__le32 ctl;
+};
+
+struct hl_eq_entry {
+	struct hl_eq_header hdr;
+	__le64 data[7];
+};
+
+#define HL_EQ_ENTRY_SIZE		sizeof(struct hl_eq_entry)
+
+#define EQ_CTL_READY_SHIFT		31
+#define EQ_CTL_READY_MASK		0x80000000
+
+#define EQ_CTL_EVENT_TYPE_SHIFT		16
+#define EQ_CTL_EVENT_TYPE_MASK		0x03FF0000
+
+#define EVENT_QUEUE_MSIX_IDX		5
+
 enum pq_init_status {
 	PQ_INIT_STATUS_NA = 0,
 	PQ_INIT_STATUS_READY_FOR_CP,
diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
index acf9a5a55476..5a7cb85ab50a 100644
--- a/drivers/misc/habanalabs/irq.c
+++ b/drivers/misc/habanalabs/irq.c
@@ -9,6 +9,18 @@
 
 #include <linux/dma-mapping.h>
 
+/**
+ * This structure is used to schedule work of EQ entry and armcp_reset event
+ *
+ * @eq_work          - workqueue object to run when EQ entry is received
+ * @hdev             - pointer to device structure
+ * @eq_entry         - copy of the EQ entry
+ */
+struct hl_eqe_work {
+	struct work_struct	eq_work;
+	struct hl_device	*hdev;
+	struct hl_eq_entry	eq_entry;
+};
 
 /*
  * hl_cq_inc_ptr - increment ci or pi of cq
@@ -26,6 +38,33 @@ inline u32 hl_cq_inc_ptr(u32 ptr)
 	return ptr;
 }
 
+/*
+ * hl_eq_inc_ptr - increment ci of eq
+ *
+ * @ptr: the current ci value of the event queue
+ *
+ * Increment ptr by 1. If it reaches the number of event queue
+ * entries, set it to 0
+ */
+inline u32 hl_eq_inc_ptr(u32 ptr)
+{
+	ptr++;
+	if (unlikely(ptr == HL_EQ_LENGTH))
+		ptr = 0;
+	return ptr;
+}
+
+static void irq_handle_eqe(struct work_struct *work)
+{
+	struct hl_eqe_work *eqe_work = container_of(work, struct hl_eqe_work,
+							eq_work);
+	struct hl_device *hdev = eqe_work->hdev;
+
+	hdev->asic_funcs->handle_eqe(hdev, &eqe_work->eq_entry);
+
+	kfree(eqe_work);
+}
+
 /*
  * hl_irq_handler_cq - irq handler for completion queue
  *
@@ -103,6 +142,68 @@ irqreturn_t hl_irq_handler_cq(int irq, void *arg)
 	return IRQ_HANDLED;
 }
 
+/*
+ * hl_irq_handler_eq - irq handler for event queue
+ *
+ * @irq: irq number
+ * @arg: pointer to event queue structure
+ *
+ */
+irqreturn_t hl_irq_handler_eq(int irq, void *arg)
+{
+	struct hl_eq *eq = arg;
+	struct hl_device *hdev = eq->hdev;
+	struct hl_eq_entry *eq_entry;
+	struct hl_eq_entry *eq_base;
+	struct hl_eqe_work *handle_eqe_work;
+
+	eq_base = (struct hl_eq_entry *) eq->kernel_address;
+
+	while (1) {
+		bool entry_ready =
+				((eq_base[eq->ci].hdr.ctl & EQ_CTL_READY_MASK)
+						>> EQ_CTL_READY_SHIFT);
+
+		if (!entry_ready)
+			break;
+
+		eq_entry = &eq_base[eq->ci];
+
+		/*
+		 * Make sure we read EQ entry contents after we've
+		 * checked the ownership bit.
+		 */
+		dma_rmb();
+
+		if (hdev->disabled) {
+			dev_warn(hdev->dev,
+				"Device disabled but received IRQ %d for EQ\n",
+					irq);
+			goto skip_irq;
+		}
+
+		handle_eqe_work = kmalloc(sizeof(*handle_eqe_work), GFP_ATOMIC);
+		if (handle_eqe_work) {
+			INIT_WORK(&handle_eqe_work->eq_work, irq_handle_eqe);
+			handle_eqe_work->hdev = hdev;
+
+			memcpy(&handle_eqe_work->eq_entry, eq_entry,
+					sizeof(*eq_entry));
+
+			queue_work(hdev->eq_wq, &handle_eqe_work->eq_work);
+		}
+skip_irq:
+		/* Clear EQ entry ready bit */
+		eq_entry->hdr.ctl &= ~EQ_CTL_READY_MASK;
+
+		eq->ci = hl_eq_inc_ptr(eq->ci);
+
+		hdev->asic_funcs->update_eq_ci(hdev, eq->ci);
+	}
+
+	return IRQ_HANDLED;
+}
+
 /*
  * hl_cq_init - main initialization function for an cq object
  *
@@ -148,3 +249,46 @@ void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
 	hdev->asic_funcs->dma_free_coherent(hdev, HL_CQ_SIZE_IN_BYTES,
 			(void *) q->kernel_address, q->bus_address);
 }
+
+/*
+ * hl_eq_init - main initialization function for an event queue object
+ *
+ * @hdev: pointer to device structure
+ * @q: pointer to eq structure
+ *
+ * Allocate dma-able memory for the event queue and initialize fields
+ * Returns 0 on success
+ */
+int hl_eq_init(struct hl_device *hdev, struct hl_eq *q)
+{
+	void *p;
+
+	BUILD_BUG_ON(HL_EQ_SIZE_IN_BYTES > HL_PAGE_SIZE);
+
+	p = hdev->asic_funcs->dma_alloc_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
+				&q->bus_address, GFP_KERNEL | __GFP_ZERO);
+	if (!p)
+		return -ENOMEM;
+
+	q->hdev = hdev;
+	q->kernel_address = (u64) p;
+	q->ci = 0;
+
+	return 0;
+}
+
+/*
+ * hl_eq_fini - destroy event queue
+ *
+ * @hdev: pointer to device structure
+ * @q: pointer to eq structure
+ *
+ * Free the event queue memory
+ */
+void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q)
+{
+	flush_workqueue(hdev->eq_wq);
+
+	hdev->asic_funcs->dma_free_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
+			(void *) q->kernel_address, q->bus_address);
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 09/15] habanalabs: add sysfs and hwmon support
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (6 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 08/15] habanalabs: add event queue and interrupts Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 10/15] habanalabs: add device reset support Oded Gabbay
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch add the sysfs and hwmon entries that are exposed by the driver.

Goya has several sensors, from various categories such as temperature,
voltage, current, etc. The driver exposes those sensors in the standard
hwmon mechanism.

In addition, the driver exposes a couple of interfaces in sysfs, both for
configuration and for providing status of the device or driver.

The configuration attributes is for Power Management:
- Automatic or manual
- Frequency value when moving to high frequency mode
- Maximum power the device is allowed to consume

The rest of the attributes are read-only and provide the following
information:
- Versions of the various firmwares running on the device
- Contents of the device's EEPROM
- The device type (currently only Goya is supported)
- PCI address of the device (to allow user-space to connect between
  /dev/hlX to PCI address)
- Status of the device (operational, malfunction, in_reset)
- How many processes are open on the device's file

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - Use sprintf instead of snprintf in sysfs
  - Use device attribute groups to manage sysfs attributes
 
 .../ABI/testing/sysfs-driver-habanalabs       | 190 +++++++
 drivers/misc/habanalabs/Makefile              |   2 +-
 drivers/misc/habanalabs/device.c              | 146 ++++++
 drivers/misc/habanalabs/goya/Makefile         |   2 +-
 drivers/misc/habanalabs/goya/goya.c           | 229 +++++++++
 drivers/misc/habanalabs/goya/goyaP.h          |  21 +
 drivers/misc/habanalabs/goya/goya_hwmgr.c     | 254 ++++++++++
 drivers/misc/habanalabs/habanalabs.h          | 100 ++++
 drivers/misc/habanalabs/habanalabs_drv.c      |   7 +
 drivers/misc/habanalabs/hwmon.c               | 449 +++++++++++++++++
 drivers/misc/habanalabs/sysfs.c               | 469 ++++++++++++++++++
 11 files changed, 1867 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-driver-habanalabs
 create mode 100644 drivers/misc/habanalabs/goya/goya_hwmgr.c
 create mode 100644 drivers/misc/habanalabs/hwmon.c
 create mode 100644 drivers/misc/habanalabs/sysfs.c

diff --git a/Documentation/ABI/testing/sysfs-driver-habanalabs b/Documentation/ABI/testing/sysfs-driver-habanalabs
new file mode 100644
index 000000000000..78b2bcf316a3
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-driver-habanalabs
@@ -0,0 +1,190 @@
+What:           /sys/class/habanalabs/hl<n>/armcp_kernel_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the Linux kernel running on the device's CPU
+
+What:           /sys/class/habanalabs/hl<n>/armcp_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the application running on the device's CPU
+
+What:           /sys/class/habanalabs/hl<n>/cpld_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the Device's CPLD F/W
+
+What:           /sys/class/habanalabs/hl<n>/device_type
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the code name of the device according to its type.
+                The supported values are: "GOYA"
+
+What:           /sys/class/habanalabs/hl<n>/eeprom
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    A binary file attribute that contains the contents of the
+                on-board EEPROM
+
+What:           /sys/class/habanalabs/hl<n>/fuse_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the device's version from the eFuse
+
+What:           /sys/class/habanalabs/hl<n>/hard_reset
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Interface to trigger a hard-reset operation for the device.
+                Hard-reset will reset ALL internal components of the device
+                except for the PCI interface and the internal PLLs
+
+What:           /sys/class/habanalabs/hl<n>/hard_reset_cnt
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays how many times the device have undergone a hard-reset
+                operation since the driver was loaded
+
+What:           /sys/class/habanalabs/hl<n>/high_pll
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum clock frequency for MME, TPC
+                and IC when the power management profile is set to "automatic".
+
+What:           /sys/class/habanalabs/hl<n>/ic_clk
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum clock frequency of the
+                Interconnect fabric. Writes to this parameter affect the device
+                only when the power management profile is set to "manual" mode.
+                The device IC clock might be set to lower value then the
+                maximum. The user should read the ic_clk_curr to see the actual
+                frequency value of the IC
+
+What:           /sys/class/habanalabs/hl<n>/ic_clk_curr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the current clock frequency of the Interconnect fabric
+
+What:           /sys/class/habanalabs/hl<n>/infineon_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the Device's power supply F/W code
+
+What:           /sys/class/habanalabs/hl<n>/max_power
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum power consumption of the
+                device in milliwatts.
+
+What:           /sys/class/habanalabs/hl<n>/mme_clk
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum clock frequency of the
+                MME compute engine. Writes to this parameter affect the device
+                only when the power management profile is set to "manual" mode.
+                The device MME clock might be set to lower value then the
+                maximum. The user should read the mme_clk_curr to see the actual
+                frequency value of the MME
+
+What:           /sys/class/habanalabs/hl<n>/mme_clk_curr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the current clock frequency of the MME compute engine
+
+What:           /sys/class/habanalabs/hl<n>/pci_addr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the PCI address of the device. This is needed so the
+                user would be able to open a device based on its PCI address
+
+What:           /sys/class/habanalabs/hl<n>/pm_mng_profile
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Power management profile. Values are "auto", "manual". In "auto"
+                mode, the driver will set the maximum clock frequency to a high
+                value when a user-space process opens the device's file (unless
+                it was already opened by another process). The driver will set
+                the max clock frequency to a low value when there are no user
+                processes that are opened on the device's file. In "manual"
+                mode, the user sets the maximum clock frequency by writing to
+                ic_clk, mme_clk and tpc_clk
+
+
+What:           /sys/class/habanalabs/hl<n>/preboot_btl_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the device's preboot F/W code
+
+What:           /sys/class/habanalabs/hl<n>/soft_reset
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Interface to trigger a soft-reset operation for the device.
+                Soft-reset will reset only the compute and DMA engines of the
+                device
+
+What:           /sys/class/habanalabs/hl<n>/soft_reset_cnt
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays how many times the device have undergone a soft-reset
+                operation since the driver was loaded
+
+What:           /sys/class/habanalabs/hl<n>/status
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Status of the card: "Operational", "Malfunction", "In reset".
+
+What:           /sys/class/habanalabs/hl<n>/thermal_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the Device's thermal daemon
+
+What:           /sys/class/habanalabs/hl<n>/tpc_clk
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the user to set the maximum clock frequency of the
+                TPC compute engines. Writes to this parameter affect the device
+                only when the power management profile is set to "manual" mode.
+                The device TPC clock might be set to lower value then the
+                maximum. The user should read the tpc_clk_curr to see the actual
+                frequency value of the TPC
+
+What:           /sys/class/habanalabs/hl<n>/tpc_clk_curr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the current clock frequency of the TPC compute engines
+
+What:           /sys/class/habanalabs/hl<n>/uboot_ver
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Version of the u-boot running on the device's CPU
+
+What:           /sys/class/habanalabs/hl<n>/write_open_cnt
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the total number of user processes that are currently
+                opened on the device's file
diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index c07f3ccb57dc..b5607233d216 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -5,7 +5,7 @@
 obj-m	:= habanalabs.o
 
 habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
-		command_buffer.o hw_queue.o irq.o
+		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index e4c939899042..5cb54cbd8502 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -231,6 +231,118 @@ static void device_early_fini(struct hl_device *hdev)
 	mutex_destroy(&hdev->fd_open_cnt_lock);
 }
 
+static void set_freq_to_low_job(struct work_struct *work)
+{
+	struct hl_device *hdev = container_of(work, struct hl_device,
+						work_freq.work);
+
+	if (atomic_read(&hdev->fd_open_cnt) == 0)
+		hl_device_set_frequency(hdev, PLL_LOW);
+
+	schedule_delayed_work(&hdev->work_freq,
+			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
+}
+
+/*
+ * device_late_init - do late stuff initialization for the habanalabs device
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ * Do stuff that either needs the device H/W queues to be active or needs
+ * to happen after all the rest of the initialization is finished
+ */
+static int device_late_init(struct hl_device *hdev)
+{
+	int rc;
+
+	INIT_DELAYED_WORK(&hdev->work_freq, set_freq_to_low_job);
+	hdev->high_pll = hdev->asic_prop.high_pll;
+
+	/* force setting to low frequency */
+	atomic_set(&hdev->curr_pll_profile, PLL_LOW);
+
+	if (hdev->pm_mng_profile == PM_AUTO)
+		hdev->asic_funcs->set_pll_profile(hdev, PLL_LOW);
+	else
+		hdev->asic_funcs->set_pll_profile(hdev, PLL_LAST);
+
+	if (hdev->asic_funcs->late_init) {
+		rc = hdev->asic_funcs->late_init(hdev);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed late initialization for the H/W\n");
+			return rc;
+		}
+	}
+
+	schedule_delayed_work(&hdev->work_freq,
+			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
+
+	hdev->late_init_done = true;
+
+	return 0;
+}
+
+/*
+ * device_late_fini - finalize all that was done in device_late_init
+ *
+ * @hdev: pointer to habanalabs device structure
+ *
+ */
+static void device_late_fini(struct hl_device *hdev)
+{
+	if (!hdev->late_init_done)
+		return;
+
+	cancel_delayed_work_sync(&hdev->work_freq);
+
+	if (hdev->asic_funcs->late_fini)
+		hdev->asic_funcs->late_fini(hdev);
+
+	hdev->late_init_done = false;
+}
+
+/*
+ * hl_device_set_frequency - set the frequency of the device
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @freq: the new frequency value
+ *
+ * Change the frequency if needed.
+ * We allose to set PLL to low only if there is no user process
+ * Returns 0 if no change was done, otherwise returns 1;
+ */
+int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq)
+{
+	enum hl_pll_frequency old_freq =
+			(freq == PLL_HIGH) ? PLL_LOW : PLL_HIGH;
+	int ret;
+
+	if (hdev->pm_mng_profile == PM_MANUAL)
+		return 0;
+
+	ret = atomic_cmpxchg(&hdev->curr_pll_profile, old_freq, freq);
+	if (ret == freq)
+		return 0;
+
+	/*
+	 * in case we want to lower frequency, check if device is not
+	 * opened. We must have a check here to workaround race condition with
+	 * hl_device_open
+	 */
+	if ((freq == PLL_LOW) && (atomic_read(&hdev->fd_open_cnt) > 0)) {
+		atomic_set(&hdev->curr_pll_profile, PLL_HIGH);
+		return 0;
+	}
+
+	dev_dbg(hdev->dev, "Changing device frequency to %s\n",
+		freq == PLL_HIGH ? "high" : "low");
+
+	hdev->asic_funcs->set_pll_profile(hdev, freq);
+
+	return 1;
+}
+
 /*
  * hl_device_suspend - initiate device suspend
  *
@@ -391,6 +503,12 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto release_ctx;
 	}
 
+	rc = hl_sysfs_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize sysfs\n");
+		goto free_cb_pool;
+	}
+
 	rc = hdev->asic_funcs->hw_init(hdev);
 	if (rc) {
 		dev_err(hdev->dev, "failed to initialize the H/W\n");
@@ -408,11 +526,33 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto out_disabled;
 	}
 
+	/* After test_queues, KMD can start sending messages to device CPU */
+
+	rc = device_late_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed late initialization\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
+	dev_info(hdev->dev, "Found %s device with %lluGB DRAM\n",
+		hdev->asic_name,
+		hdev->asic_prop.dram_size / 1024 / 1024 / 1024);
+
+	rc = hl_hwmon_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to initialize hwmon\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
 	return 0;
 
+free_cb_pool:
+	hl_cb_pool_fini(hdev);
 release_ctx:
 	if (hl_ctx_put(hdev->kernel_ctx) != 1)
 		dev_err(hdev->dev,
@@ -462,6 +602,12 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
+	hl_hwmon_fini(hdev);
+
+	device_late_fini(hdev);
+
+	hl_sysfs_fini(hdev);
+
 	/*
 	 * Halt the engines and disable interrupts so we won't get any more
 	 * completions from H/W and we won't have any accesses from the
diff --git a/drivers/misc/habanalabs/goya/Makefile b/drivers/misc/habanalabs/goya/Makefile
index 7ae221696de9..e458e5ba500b 100644
--- a/drivers/misc/habanalabs/goya/Makefile
+++ b/drivers/misc/habanalabs/goya/Makefile
@@ -1,3 +1,3 @@
 subdir-ccflags-y += -I$(src)
 
-HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o
+HL_GOYA_FILES :=  goya/goya.o goya/goya_security.o goya/goya_hwmgr.o
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index d7206cbb812e..6e0a815744c3 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -127,6 +127,8 @@ static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
 
 #define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
 
+static int goya_armcp_info_get(struct hl_device *hdev);
+
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
@@ -174,6 +176,7 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->num_of_events = GOYA_ASYNC_EVENT_ID_SIZE;
 	prop->cb_pool_cb_cnt = GOYA_CB_POOL_CB_CNT;
 	prop->cb_pool_cb_size = GOYA_CB_POOL_CB_SIZE;
+	prop->max_power_default = MAX_POWER_DEFAULT;
 	prop->tpc_enabled_mask = TPC_ENABLED_MASK;
 
 	prop->high_pll = PLL_HIGH_DEFAULT;
@@ -564,6 +567,89 @@ int goya_early_fini(struct hl_device *hdev)
 	return 0;
 }
 
+/*
+ * goya_fetch_psoc_frequency - Fetch PSOC frequency values
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ */
+static void goya_fetch_psoc_frequency(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+
+	prop->psoc_pci_pll_nr = RREG32(mmPSOC_PCI_PLL_NR);
+	prop->psoc_pci_pll_nf = RREG32(mmPSOC_PCI_PLL_NF);
+	prop->psoc_pci_pll_od = RREG32(mmPSOC_PCI_PLL_OD);
+	prop->psoc_pci_pll_div_factor = RREG32(mmPSOC_PCI_PLL_DIV_FACTOR_1);
+}
+
+/*
+ * goya_late_init - GOYA late initialization code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Get ArmCP info and send message to CPU to enable PCI access
+ */
+static int goya_late_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+
+	rc = goya->armcp_info_get(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to get armcp info\n");
+		return rc;
+	}
+
+	/* Now that we have the DRAM size in ASIC prop, we need to check
+	 * its size and configure the DMA_IF DDR wrap protection (which is in
+	 * the MMU block) accordingly. The value is the log2 of the DRAM size
+	 */
+	WREG32(mmMMU_LOG2_DDR_SIZE, ilog2(prop->dram_size));
+
+	rc = goya_send_pci_access_msg(hdev, ARMCP_PACKET_ENABLE_PCI_ACCESS);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to enable PCI access from CPU\n");
+		return rc;
+	}
+
+	WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+			GOYA_ASYNC_EVENT_ID_INTS_REGISTER);
+
+	goya_fetch_psoc_frequency(hdev);
+
+	return 0;
+}
+
+/*
+ * goya_late_fini - GOYA late tear-down code
+ *
+ * @hdev: pointer to hl_device structure
+ *
+ * Free sensors allocated structures
+ */
+void goya_late_fini(struct hl_device *hdev)
+{
+	const struct hwmon_channel_info **channel_info_arr;
+	int i = 0;
+
+	if (!hdev->hl_chip_info.info)
+		return;
+
+	channel_info_arr = hdev->hl_chip_info.info;
+
+	while (channel_info_arr[i]) {
+		kfree(channel_info_arr[i]->config);
+		kfree(channel_info_arr[i]);
+		i++;
+	}
+
+	kfree(channel_info_arr);
+
+	hdev->hl_chip_info.info = NULL;
+}
+
 /*
  * goya_sw_init - Goya software initialization code
  *
@@ -581,9 +667,15 @@ static int goya_sw_init(struct hl_device *hdev)
 		return -ENOMEM;
 
 	goya->test_cpu_queue = goya_test_cpu_queue;
+	goya->armcp_info_get = goya_armcp_info_get;
 
 	/* according to goya_init_iatu */
 	goya->ddr_bar_cur_addr = DRAM_PHYS_BASE;
+
+	goya->mme_clk = GOYA_PLL_FREQ_LOW;
+	goya->tpc_clk = GOYA_PLL_FREQ_LOW;
+	goya->ic_clk = GOYA_PLL_FREQ_LOW;
+
 	hdev->asic_specific = goya;
 
 	/* Create DMA pool for small allocations */
@@ -3220,6 +3312,87 @@ void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
 	return goya->events_stat;
 }
 
+static int goya_armcp_info_get(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct armcp_packet pkt;
+	void *armcp_info_cpu_addr;
+	dma_addr_t armcp_info_dma_addr;
+	u64 dram_size;
+	long result;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
+		return 0;
+
+	armcp_info_cpu_addr =
+			hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev,
+			sizeof(struct armcp_info), &armcp_info_dma_addr);
+	if (!armcp_info_cpu_addr) {
+		dev_err(hdev->dev,
+			"Failed to allocate DMA memory for ArmCP info packet\n");
+		return -ENOMEM;
+	}
+
+	memset(armcp_info_cpu_addr, 0, sizeof(struct armcp_info));
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_INFO_GET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.addr = armcp_info_dma_addr + prop->host_phys_base_address;
+	pkt.data_max_size = sizeof(struct armcp_info);
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+			GOYA_ARMCP_INFO_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to send armcp info pkt, error %d\n", rc);
+		goto out;
+	}
+
+	memcpy(&prop->armcp_info, armcp_info_cpu_addr,
+			sizeof(prop->armcp_info));
+
+	dram_size = prop->armcp_info.dram_size;
+	if (dram_size) {
+		if ((!is_power_of_2(dram_size)) ||
+				(dram_size < DRAM_PHYS_DEFAULT_SIZE)) {
+			dev_err(hdev->dev,
+				"F/W reported invalid DRAM size %llu. Trying to use default size\n",
+				dram_size);
+			dram_size = DRAM_PHYS_DEFAULT_SIZE;
+		}
+
+		prop->dram_size = dram_size;
+		prop->dram_end_address = prop->dram_base_address + dram_size;
+	}
+
+	rc = hl_build_hwmon_channel_info(hdev, prop->armcp_info.sensors);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to build hwmon channel info, error %d\n", rc);
+		rc = -EFAULT;
+		goto out;
+	}
+
+out:
+	hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev,
+			sizeof(struct armcp_info), armcp_info_cpu_addr);
+
+	return rc;
+}
+
+static void goya_init_clock_gating(struct hl_device *hdev)
+{
+
+}
+
+static void goya_disable_clock_gating(struct hl_device *hdev)
+{
+
+}
 
 static void goya_hw_queues_lock(struct hl_device *hdev)
 {
@@ -3235,9 +3408,60 @@ static void goya_hw_queues_unlock(struct hl_device *hdev)
 	spin_unlock(&goya->hw_queues_lock);
 }
 
+int goya_get_eeprom_data(struct hl_device *hdev, void *data, size_t max_size)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct armcp_packet pkt;
+	void *eeprom_info_cpu_addr;
+	dma_addr_t eeprom_info_dma_addr;
+	long result;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
+		return 0;
+
+	eeprom_info_cpu_addr =
+			hdev->asic_funcs->cpu_accessible_dma_pool_alloc(hdev,
+					max_size, &eeprom_info_dma_addr);
+	if (!eeprom_info_cpu_addr) {
+		dev_err(hdev->dev,
+			"Failed to allocate DMA memory for EEPROM info packet\n");
+		return -ENOMEM;
+	}
+
+	memset(eeprom_info_cpu_addr, 0, max_size);
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_EEPROM_DATA_GET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.addr = eeprom_info_dma_addr + prop->host_phys_base_address;
+	pkt.data_max_size = max_size;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+			GOYA_ARMCP_EEPROM_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to send armcp EEPROM pkt, error %d\n", rc);
+		goto out;
+	}
+
+	/* result contains the actual size */
+	memcpy(data, eeprom_info_cpu_addr, min((size_t)result, max_size));
+
+out:
+	hdev->asic_funcs->cpu_accessible_dma_pool_free(hdev, max_size,
+			eeprom_info_cpu_addr);
+
+	return rc;
+}
+
 static const struct hl_asic_funcs goya_funcs = {
 	.early_init = goya_early_init,
 	.early_fini = goya_early_fini,
+	.late_init = goya_late_init,
+	.late_fini = goya_late_fini,
 	.sw_init = goya_sw_init,
 	.sw_fini = goya_sw_fini,
 	.hw_init = goya_hw_init,
@@ -3258,10 +3482,15 @@ static const struct hl_asic_funcs goya_funcs = {
 	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
 	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
 	.update_eq_ci = goya_update_eq_ci,
+	.add_device_attr = goya_add_device_attr,
 	.handle_eqe = goya_handle_eqe,
+	.set_pll_profile = goya_set_pll_profile,
 	.get_events_stat = goya_get_events_stat,
+	.enable_clock_gating = goya_init_clock_gating,
+	.disable_clock_gating = goya_disable_clock_gating,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
+	.get_eeprom_data = goya_get_eeprom_data,
 	.send_cpu_message = goya_send_cpu_message
 };
 
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index 58f1e316f8d9..ba8596b07403 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -50,7 +50,10 @@
 
 #define PLL_HIGH_DEFAULT		1575000000	/* 1.575 GHz */
 
+#define MAX_POWER_DEFAULT		200000		/* 200W */
+
 #define GOYA_ARMCP_INFO_TIMEOUT		10000000	/* 10s */
+#define GOYA_ARMCP_EEPROM_TIMEOUT	10000000	/* 10s */
 
 #define DRAM_PHYS_DEFAULT_SIZE		0x100000000ull	/* 4GB */
 
@@ -151,9 +154,15 @@ enum goya_fw_component {
 
 struct goya_device {
 	int (*test_cpu_queue)(struct hl_device *hdev);
+	int (*armcp_info_get)(struct hl_device *hdev);
 
 	/* TODO: remove hw_queues_lock after moving to scheduler code */
 	spinlock_t	hw_queues_lock;
+
+	u64		mme_clk;
+	u64		tpc_clk;
+	u64		ic_clk;
+
 	u64		ddr_bar_cur_addr;
 	u32		events_stat[GOYA_ASYNC_EVENT_ID_SIZE];
 	u32		hw_cap_initialized;
@@ -162,6 +171,18 @@ struct goya_device {
 int goya_test_cpu_queue(struct hl_device *hdev);
 int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
 				u32 timeout, long *result);
+long goya_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
+long goya_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr);
+long goya_get_current(struct hl_device *hdev, int sensor_index, u32 attr);
+long goya_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
+long goya_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
+void goya_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
+			long value);
+void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq);
+void goya_add_device_attr(struct hl_device *hdev,
+			struct attribute_group *dev_attr_grp);
 void goya_init_security(struct hl_device *hdev);
+u64 goya_get_max_power(struct hl_device *hdev);
+void goya_set_max_power(struct hl_device *hdev, u64 value);
 
 #endif /* GOYAP_H_ */
diff --git a/drivers/misc/habanalabs/goya/goya_hwmgr.c b/drivers/misc/habanalabs/goya/goya_hwmgr.c
new file mode 100644
index 000000000000..b524799c62b7
--- /dev/null
+++ b/drivers/misc/habanalabs/goya/goya_hwmgr.c
@@ -0,0 +1,254 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "goyaP.h"
+
+void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	switch (freq) {
+	case PLL_HIGH:
+		hl_set_frequency(hdev, MME_PLL, hdev->high_pll);
+		hl_set_frequency(hdev, TPC_PLL, hdev->high_pll);
+		hl_set_frequency(hdev, IC_PLL, hdev->high_pll);
+		break;
+	case PLL_LOW:
+		hl_set_frequency(hdev, MME_PLL, GOYA_PLL_FREQ_LOW);
+		hl_set_frequency(hdev, TPC_PLL, GOYA_PLL_FREQ_LOW);
+		hl_set_frequency(hdev, IC_PLL, GOYA_PLL_FREQ_LOW);
+		break;
+	case PLL_LAST:
+		hl_set_frequency(hdev, MME_PLL, goya->mme_clk);
+		hl_set_frequency(hdev, TPC_PLL, goya->tpc_clk);
+		hl_set_frequency(hdev, IC_PLL, goya->ic_clk);
+		break;
+	default:
+		dev_err(hdev->dev, "unknown frequency setting\n");
+	}
+}
+
+static ssize_t mme_clk_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, MME_PLL, false);
+
+	if (value < 0)
+		return value;
+
+	return sprintf(buf, "%lu\n", value);
+}
+
+static ssize_t mme_clk_store(struct device *dev, struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+	long value;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto fail;
+	}
+
+	if (hdev->pm_mng_profile == PM_AUTO) {
+		count = -EPERM;
+		goto fail;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto fail;
+	}
+
+	hl_set_frequency(hdev, MME_PLL, value);
+	goya->mme_clk = value;
+
+fail:
+	return count;
+}
+
+static ssize_t tpc_clk_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, TPC_PLL, false);
+
+	if (value < 0)
+		return value;
+
+	return sprintf(buf, "%lu\n", value);
+}
+
+static ssize_t tpc_clk_store(struct device *dev, struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+	long value;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto fail;
+	}
+
+	if (hdev->pm_mng_profile == PM_AUTO) {
+		count = -EPERM;
+		goto fail;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto fail;
+	}
+
+	hl_set_frequency(hdev, TPC_PLL, value);
+	goya->tpc_clk = value;
+
+fail:
+	return count;
+}
+
+static ssize_t ic_clk_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, IC_PLL, false);
+
+	if (value < 0)
+		return value;
+
+	return sprintf(buf, "%lu\n", value);
+}
+
+static ssize_t ic_clk_store(struct device *dev, struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	struct goya_device *goya = hdev->asic_specific;
+	int rc;
+	long value;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto fail;
+	}
+
+	if (hdev->pm_mng_profile == PM_AUTO) {
+		count = -EPERM;
+		goto fail;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto fail;
+	}
+
+	hl_set_frequency(hdev, IC_PLL, value);
+	goya->ic_clk = value;
+
+fail:
+	return count;
+}
+
+static ssize_t mme_clk_curr_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, MME_PLL, true);
+
+	if (value < 0)
+		return value;
+
+	return sprintf(buf, "%lu\n", value);
+}
+
+static ssize_t tpc_clk_curr_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, TPC_PLL, true);
+
+	if (value < 0)
+		return value;
+
+	return sprintf(buf, "%lu\n", value);
+}
+
+static ssize_t ic_clk_curr_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	value = hl_get_frequency(hdev, IC_PLL, true);
+
+	if (value < 0)
+		return value;
+
+	return sprintf(buf, "%lu\n", value);
+}
+
+static DEVICE_ATTR_RW(ic_clk);
+static DEVICE_ATTR_RO(ic_clk_curr);
+static DEVICE_ATTR_RW(mme_clk);
+static DEVICE_ATTR_RO(mme_clk_curr);
+static DEVICE_ATTR_RW(tpc_clk);
+static DEVICE_ATTR_RO(tpc_clk_curr);
+
+static struct attribute *goya_dev_attrs[] = {
+	&dev_attr_ic_clk.attr,
+	&dev_attr_ic_clk_curr.attr,
+	&dev_attr_mme_clk.attr,
+	&dev_attr_mme_clk_curr.attr,
+	&dev_attr_tpc_clk.attr,
+	&dev_attr_tpc_clk_curr.attr,
+	NULL,
+};
+
+void goya_add_device_attr(struct hl_device *hdev,
+			struct attribute_group *dev_attr_grp)
+{
+	dev_attr_grp->attrs = goya_dev_attrs;
+}
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 5152b6d3bb4a..e260586f3bdb 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -28,6 +28,8 @@
 
 #define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */
 
+#define HL_PLL_LOW_JOB_FREQ_USEC	5000000 /* 5 s */
+
 #define HL_MAX_QUEUES			128
 
 struct hl_device;
@@ -63,6 +65,8 @@ struct hw_queue_properties {
 /**
  * struct asic_fixed_properties - ASIC specific immutable properties.
  * @hw_queues_props: H/W queues properties.
+ * @armcp_info: received various information from ArmCP regarding the H/W. e.g.
+ *		available sensors.
  * @uboot_ver: F/W U-boot version.
  * @preboot_ver: F/W Preboot version.
  * @sram_base_address: SRAM physical start address.
@@ -75,6 +79,7 @@ struct hw_queue_properties {
  * @dram_pci_bar_size: size of PCI bar towards DRAM.
  * @host_phys_base_address: base physical address of host memory for
  *				transactions that the device generates.
+ * @max_power_default: max power of the device after reset
  * @va_space_host_start_address: base address of virtual memory range for
  *                               mapping host memory.
  * @va_space_host_end_address: end address of virtual memory range for
@@ -87,6 +92,10 @@ struct hw_queue_properties {
  * @sram_size: total size of SRAM.
  * @max_asid: maximum number of open contexts (ASIDs).
  * @num_of_events: number of possible internal H/W IRQs.
+ * @psoc_pci_pll_nr: PCI PLL NR value.
+ * @psoc_pci_pll_nf: PCI PLL NF value.
+ * @psoc_pci_pll_od: PCI PLL OD value.
+ * @psoc_pci_pll_div_factor: PCI PLL DIV FACTOR 1 value.
  * @completion_queues_count: number of completion queues.
  * @high_pll: high PLL frequency used by the device.
  * @cb_pool_cb_cnt: number of CBs in the CB pool.
@@ -95,6 +104,7 @@ struct hw_queue_properties {
  */
 struct asic_fixed_properties {
 	struct hw_queue_properties	hw_queues_props[HL_MAX_QUEUES];
+	struct armcp_info	armcp_info;
 	char			uboot_ver[VERSION_MAX_LEN];
 	char			preboot_ver[VERSION_MAX_LEN];
 	u64			sram_base_address;
@@ -106,6 +116,7 @@ struct asic_fixed_properties {
 	u64			dram_size;
 	u64			dram_pci_bar_size;
 	u64			host_phys_base_address;
+	u64			max_power_default;
 	u64			va_space_host_start_address;
 	u64			va_space_host_end_address;
 	u64			va_space_dram_start_address;
@@ -114,6 +125,10 @@ struct asic_fixed_properties {
 	u32			sram_size;
 	u32			max_asid;
 	u32			num_of_events;
+	u32			psoc_pci_pll_nr;
+	u32			psoc_pci_pll_nf;
+	u32			psoc_pci_pll_od;
+	u32			psoc_pci_pll_div_factor;
 	u32			high_pll;
 	u32			cb_pool_cb_cnt;
 	u32			cb_pool_cb_size;
@@ -287,11 +302,37 @@ enum hl_asic_type {
 	ASIC_INVALID
 };
 
+/**
+ * enum hl_pm_mng_profile - power management profile.
+ * @PM_AUTO: internal clock is set by KMD.
+ * @PM_MANUAL: internal clock is set by the user.
+ * @PM_LAST: last power management type.
+ */
+enum hl_pm_mng_profile {
+	PM_AUTO = 1,
+	PM_MANUAL,
+	PM_LAST
+};
+
+/**
+ * enum hl_pll_frequency - PLL frequency.
+ * @PLL_HIGH: high frequency.
+ * @PLL_LOW: low frequency.
+ * @PLL_LAST: last frequency values that were configured by the user.
+ */
+enum hl_pll_frequency {
+	PLL_HIGH = 1,
+	PLL_LOW,
+	PLL_LAST
+};
+
 /**
  * struct hl_asic_funcs - ASIC specific functions that are can be called from
  *                        common code.
  * @early_init: sets up early driver state (pre sw_init), doesn't configure H/W.
  * @early_fini: tears down what was done in early_init.
+ * @late_init: sets up late driver/hw state (post hw_init) - Optional.
+ * @late_fini: tears down what was done in late_init (pre hw_fini) - Optional.
  * @sw_init: sets up driver state, does not configure H/W.
  * @sw_fini: tears down driver state, does not configure H/W.
  * @hw_init: sets up the H/W state.
@@ -321,15 +362,22 @@ enum hl_asic_type {
  * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
  * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
  * @update_eq_ci: update event queue CI.
+ * @add_device_attr: add ASIC specific device attributes.
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
+ * @set_pll_profile: change PLL profile (manual/automatic).
  * @get_events_stat: retrieve event queue entries histogram.
+ * @enable_clock_gating: enable clock gating for reducing power consumption.
+ * @disable_clock_gating: disable clock for accessing registers on HBW.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
+ * @get_eeprom_data: retrieve EEPROM data from F/W.
  * @send_cpu_message: send buffer to ArmCP.
  */
 struct hl_asic_funcs {
 	int (*early_init)(struct hl_device *hdev);
 	int (*early_fini)(struct hl_device *hdev);
+	int (*late_init)(struct hl_device *hdev);
+	void (*late_fini)(struct hl_device *hdev);
 	int (*sw_init)(struct hl_device *hdev);
 	int (*sw_fini)(struct hl_device *hdev);
 	int (*hw_init)(struct hl_device *hdev);
@@ -358,11 +406,19 @@ struct hl_asic_funcs {
 	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
 				size_t size, void *vaddr);
 	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
+	void (*add_device_attr)(struct hl_device *hdev,
+				struct attribute_group *dev_attr_grp);
 	void (*handle_eqe)(struct hl_device *hdev,
 				struct hl_eq_entry *eq_entry);
+	void (*set_pll_profile)(struct hl_device *hdev,
+			enum hl_pll_frequency freq);
 	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
+	void (*enable_clock_gating)(struct hl_device *hdev);
+	void (*disable_clock_gating)(struct hl_device *hdev);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
+	int (*get_eeprom_data)(struct hl_device *hdev, void *data,
+				size_t max_size);
 	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
 				u16 len, u32 timeout, long *result);
 };
@@ -409,6 +465,8 @@ struct hl_cs_job {
 	struct work_struct	finish_work;
 	u32			id;
 };
+
+
 /*
  * FILE PRIVATE STRUCTURE
  */
@@ -481,6 +539,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @rmmio: configuration area address on SRAM.
  * @cdev: related char device.
  * @dev: realted kernel basic device structure.
+ * @work_freq: delayed work to lower device frequency if possible.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
  * @completion_queue: array of hl_cq.
@@ -506,13 +565,23 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
+ * @hwmon_dev: H/W monitor device.
+ * @pm_mng_profile: current power management profile.
+ * @hl_chip_info: ASIC's sensors information.
  * @cb_pool: list of preallocated CBs.
  * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
+ * @curr_pll_profile: current PLL profile.
  * @fd_open_cnt: number of open user processes.
+ * @max_power: the max power of the device, as configured by the sysadmin. This
+ *             value is saved so in case of hard-reset, KMD will restore this
+ *             value and update the F/W after the re-initialization
  * @major: habanalabs KMD major.
+ * @high_pll: high PLL profile frequency.
  * @id: device minor.
  * @disabled: is device disabled.
+ * @late_init_done: is late init stage was done during initialization.
+ * @hwmon_initialized: is H/W monitor sensors was initialized.
  */
 struct hl_device {
 	struct pci_dev			*pdev;
@@ -520,6 +589,7 @@ struct hl_device {
 	void __iomem			*rmmio;
 	struct cdev			cdev;
 	struct device			*dev;
+	struct delayed_work		work_freq;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
 	struct hl_cq			*completion_queue;
@@ -541,16 +611,25 @@ struct hl_device {
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
+	struct device			*hwmon_dev;
+	enum hl_pm_mng_profile		pm_mng_profile;
+	struct hwmon_chip_info		hl_chip_info;
 
 	struct list_head		cb_pool;
 	spinlock_t			cb_pool_lock;
 
 	/* TODO: remove user_ctx for multiple process support */
 	struct hl_ctx			*user_ctx;
+
+	atomic_t			curr_pll_profile;
 	atomic_t			fd_open_cnt;
+	u64				max_power;
 	u32				major;
+	u32				high_pll;
 	u16				id;
 	u8				disabled;
+	u8				late_init_done;
+	u8				hwmon_initialized;
 
 	/* Parameters for bring-up */
 	u8				cpu_enable;
@@ -631,6 +710,15 @@ int hl_device_suspend(struct hl_device *hdev);
 int hl_device_resume(struct hl_device *hdev);
 void hl_hpriv_get(struct hl_fpriv *hpriv);
 void hl_hpriv_put(struct hl_fpriv *hpriv);
+int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq);
+int hl_build_hwmon_channel_info(struct hl_device *hdev,
+		struct armcp_sensor *sensors_arr);
+
+int hl_sysfs_init(struct hl_device *hdev);
+void hl_sysfs_fini(struct hl_device *hdev);
+
+int hl_hwmon_init(struct hl_device *hdev);
+void hl_hwmon_fini(struct hl_device *hdev);
 
 int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr, u32 cb_size,
 		u64 *handle, int ctx_id);
@@ -647,6 +735,18 @@ int hl_cb_pool_fini(struct hl_device *hdev);
 
 void goya_set_asic_funcs(struct hl_device *hdev);
 
+long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr);
+void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq);
+long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
+long hl_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr);
+long hl_get_current(struct hl_device *hdev, int sensor_index, u32 attr);
+long hl_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
+long hl_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
+void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
+			long value);
+u64 hl_get_max_power(struct hl_device *hdev);
+void hl_set_max_power(struct hl_device *hdev, u64 value);
+
 /* IOCTLs */
 long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
 int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 14775a7022f0..6260d99ef1ca 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -131,6 +131,13 @@ int hl_device_open(struct inode *inode, struct file *filp)
 
 	hpriv->taskpid = find_get_pid(current->pid);
 
+	/*
+	 * Device is IDLE at this point so it is legal to change PLLs. There
+	 * is no need to check anything because if the PLL is already HIGH, the
+	 * set function will return without doing anything
+	 */
+	hl_device_set_frequency(hdev, PLL_HIGH);
+
 	return 0;
 
 out_err:
diff --git a/drivers/misc/habanalabs/hwmon.c b/drivers/misc/habanalabs/hwmon.c
new file mode 100644
index 000000000000..4d60b36dfa21
--- /dev/null
+++ b/drivers/misc/habanalabs/hwmon.c
@@ -0,0 +1,449 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#define SENSORS_PKT_TIMEOUT		100000	/* 100ms */
+#define HWMON_NR_SENSOR_TYPES		(hwmon_pwm + 1)
+
+int hl_build_hwmon_channel_info(struct hl_device *hdev,
+				struct armcp_sensor *sensors_arr)
+{
+	u32 counts[HWMON_NR_SENSOR_TYPES] = {0};
+	u32 *sensors_by_type[HWMON_NR_SENSOR_TYPES] = {0};
+	u32 sensors_by_type_next_index[HWMON_NR_SENSOR_TYPES] = {0};
+	struct hwmon_channel_info **channels_info;
+	u32 num_sensors_for_type, num_active_sensor_types = 0,
+			arr_size = 0, *curr_arr;
+	enum hwmon_sensor_types type;
+	int rc, i, j;
+
+	for (i = 0 ; i < ARMCP_MAX_SENSORS ; i++) {
+		type = sensors_arr[i].type;
+
+		if ((type == 0) && (sensors_arr[i].flags == 0))
+			break;
+
+		if (type >= HWMON_NR_SENSOR_TYPES) {
+			dev_err(hdev->dev,
+				"Got wrong sensor type %d from device\n", type);
+			return -EINVAL;
+		}
+
+		counts[type]++;
+		arr_size++;
+	}
+
+	for (i = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++) {
+		if (counts[i] == 0)
+			continue;
+
+		num_sensors_for_type = counts[i] + 1;
+		curr_arr = kcalloc(num_sensors_for_type, sizeof(*curr_arr),
+				GFP_KERNEL);
+		if (!curr_arr) {
+			rc = -ENOMEM;
+			goto sensors_type_err;
+		}
+
+		num_active_sensor_types++;
+		sensors_by_type[i] = curr_arr;
+	}
+
+	for (i = 0 ; i < arr_size ; i++) {
+		type = sensors_arr[i].type;
+		curr_arr = sensors_by_type[type];
+		curr_arr[sensors_by_type_next_index[type]++] =
+				sensors_arr[i].flags;
+	}
+
+	channels_info = kcalloc(num_active_sensor_types + 1,
+			sizeof(*channels_info), GFP_KERNEL);
+	if (!channels_info) {
+		rc = -ENOMEM;
+		goto channels_info_array_err;
+	}
+
+	for (i = 0 ; i < num_active_sensor_types ; i++) {
+		channels_info[i] = kzalloc(sizeof(*channels_info[i]),
+				GFP_KERNEL);
+		if (!channels_info[i]) {
+			rc = -ENOMEM;
+			goto channel_info_err;
+		}
+	}
+
+	for (i = 0, j = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++) {
+		if (!sensors_by_type[i])
+			continue;
+
+		channels_info[j]->type = i;
+		channels_info[j]->config = sensors_by_type[i];
+		j++;
+	}
+
+	hdev->hl_chip_info.info =
+			(const struct hwmon_channel_info **)channels_info;
+
+	return 0;
+
+channel_info_err:
+	for (i = 0 ; i < num_active_sensor_types ; i++)
+		if (channels_info[i]) {
+			kfree(channels_info[i]->config);
+			kfree(channels_info[i]);
+		}
+	kfree(channels_info);
+channels_info_array_err:
+sensors_type_err:
+	for (i = 0 ; i < HWMON_NR_SENSOR_TYPES ; i++)
+		kfree(sensors_by_type[i]);
+
+	return rc;
+}
+
+static int hl_read(struct device *dev, enum hwmon_sensor_types type,
+			u32 attr, int channel, long *val)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	switch (type) {
+	case hwmon_temp:
+		switch (attr) {
+		case hwmon_temp_input:
+		case hwmon_temp_max:
+		case hwmon_temp_crit:
+		case hwmon_temp_max_hyst:
+		case hwmon_temp_crit_hyst:
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		*val = hl_get_temperature(hdev, channel, attr);
+		break;
+	case hwmon_in:
+		switch (attr) {
+		case hwmon_in_input:
+		case hwmon_in_min:
+		case hwmon_in_max:
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		*val = hl_get_voltage(hdev, channel, attr);
+		break;
+	case hwmon_curr:
+		switch (attr) {
+		case hwmon_curr_input:
+		case hwmon_curr_min:
+		case hwmon_curr_max:
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		*val = hl_get_current(hdev, channel, attr);
+		break;
+	case hwmon_fan:
+		switch (attr) {
+		case hwmon_fan_input:
+		case hwmon_fan_min:
+		case hwmon_fan_max:
+			break;
+		default:
+			return -EINVAL;
+		}
+		*val = hl_get_fan_speed(hdev, channel, attr);
+		break;
+	case hwmon_pwm:
+		switch (attr) {
+		case hwmon_pwm_input:
+		case hwmon_pwm_enable:
+			break;
+		default:
+			return -EINVAL;
+		}
+		*val = hl_get_pwm_info(hdev, channel, attr);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int hl_write(struct device *dev, enum hwmon_sensor_types type,
+			u32 attr, int channel, long val)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	switch (type) {
+	case hwmon_pwm:
+		switch (attr) {
+		case hwmon_pwm_input:
+		case hwmon_pwm_enable:
+			break;
+		default:
+			return -EINVAL;
+		}
+		hl_set_pwm_info(hdev, channel, attr, val);
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static umode_t hl_is_visible(const void *data, enum hwmon_sensor_types type,
+				u32 attr, int channel)
+{
+	switch (type) {
+	case hwmon_temp:
+		switch (attr) {
+		case hwmon_temp_input:
+		case hwmon_temp_max:
+		case hwmon_temp_max_hyst:
+		case hwmon_temp_crit:
+		case hwmon_temp_crit_hyst:
+			return 0444;
+		}
+		break;
+	case hwmon_in:
+		switch (attr) {
+		case hwmon_in_input:
+		case hwmon_in_min:
+		case hwmon_in_max:
+			return 0444;
+		}
+		break;
+	case hwmon_curr:
+		switch (attr) {
+		case hwmon_curr_input:
+		case hwmon_curr_min:
+		case hwmon_curr_max:
+			return 0444;
+		}
+		break;
+	case hwmon_fan:
+		switch (attr) {
+		case hwmon_fan_input:
+		case hwmon_fan_min:
+		case hwmon_fan_max:
+			return 0444;
+		}
+		break;
+	case hwmon_pwm:
+		switch (attr) {
+		case hwmon_pwm_input:
+		case hwmon_pwm_enable:
+			return 0644;
+		}
+		break;
+	default:
+		break;
+	}
+	return 0;
+}
+
+static const struct hwmon_ops hl_hwmon_ops = {
+	.is_visible = hl_is_visible,
+	.read = hl_read,
+	.write = hl_write
+};
+
+long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_TEMPERATURE_GET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+			SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get temperature from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+long hl_get_voltage(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_VOLTAGE_GET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get voltage from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+long hl_get_current(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_CURRENT_GET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get current from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+long hl_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_FAN_SPEED_GET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get fan speed from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+long hl_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_PWM_GET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get pwm info from sensor %d, error %d\n",
+			sensor_index, rc);
+		result = 0;
+	}
+
+	return result;
+}
+
+void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
+			long value)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_PWM_SET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.sensor_index = sensor_index;
+	pkt.type = attr;
+	pkt.value = value;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SENSORS_PKT_TIMEOUT, NULL);
+
+	if (rc)
+		dev_err(hdev->dev,
+			"Failed to set pwm info to sensor %d, error %d\n",
+			sensor_index, rc);
+}
+
+int hl_hwmon_init(struct hl_device *hdev)
+{
+	struct device *dev = hdev->pdev ? &hdev->pdev->dev : hdev->dev;
+	int rc;
+
+	if ((hdev->hwmon_initialized) || !(hdev->fw_loading))
+		return 0;
+
+	if (hdev->hl_chip_info.info) {
+		hdev->hl_chip_info.ops = &hl_hwmon_ops;
+
+		hdev->hwmon_dev = hwmon_device_register_with_info(dev,
+				"habanalabs", hdev, &hdev->hl_chip_info, NULL);
+		if (IS_ERR(hdev->hwmon_dev)) {
+			rc = PTR_ERR(hdev->hwmon_dev);
+			dev_err(hdev->dev,
+				"Unable to register hwmon device: %d\n", rc);
+			return rc;
+		}
+
+		dev_info(hdev->dev, "%s: add sensors information\n",
+			dev_name(hdev->hwmon_dev));
+
+		hdev->hwmon_initialized = true;
+	} else {
+		dev_info(hdev->dev, "no available sensors\n");
+	}
+
+	return 0;
+}
+
+void hl_hwmon_fini(struct hl_device *hdev)
+{
+	if (!hdev->hwmon_initialized)
+		return;
+
+	hwmon_device_unregister(hdev->hwmon_dev);
+}
diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
new file mode 100644
index 000000000000..fe865b2e8bab
--- /dev/null
+++ b/drivers/misc/habanalabs/sysfs.c
@@ -0,0 +1,469 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/hwmon-sysfs.h>
+#include <linux/hwmon.h>
+
+#define SET_CLK_PKT_TIMEOUT	200000	/* 200ms */
+#define SET_PWR_PKT_TIMEOUT	400000	/* 400ms */
+
+long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	if (curr)
+		pkt.ctl = ARMCP_PACKET_FREQUENCY_CURR_GET <<
+						ARMCP_PKT_CTL_OPCODE_SHIFT;
+	else
+		pkt.ctl = ARMCP_PACKET_FREQUENCY_GET <<
+						ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.pll_index = pll_index;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+						SET_CLK_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to get frequency of PLL %d, error %d\n",
+			pll_index, rc);
+		result = rc;
+	}
+
+	return result;
+}
+
+void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_FREQUENCY_SET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.pll_index = pll_index;
+	pkt.value = freq;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SET_CLK_PKT_TIMEOUT, NULL);
+
+	if (rc)
+		dev_err(hdev->dev,
+			"Failed to set frequency to PLL %d, error %d\n",
+			pll_index, rc);
+}
+
+u64 hl_get_max_power(struct hl_device *hdev)
+{
+	struct armcp_packet pkt;
+	long result;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_MAX_POWER_GET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+						SET_PWR_PKT_TIMEOUT, &result);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to get max power, error %d\n", rc);
+		result = rc;
+	}
+
+	return result;
+}
+
+void hl_set_max_power(struct hl_device *hdev, u64 value)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_MAX_POWER_SET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.value = value;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					SET_PWR_PKT_TIMEOUT, NULL);
+
+	if (rc)
+		dev_err(hdev->dev, "Failed to set max power, error %d\n", rc);
+}
+
+static ssize_t pm_mng_profile_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	return sprintf(buf, "%s\n",
+			(hdev->pm_mng_profile == PM_AUTO) ? "auto" :
+			(hdev->pm_mng_profile == PM_MANUAL) ? "manual" :
+			"unknown");
+}
+
+static ssize_t pm_mng_profile_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto out;
+	}
+
+	mutex_lock(&hdev->fd_open_cnt_lock);
+
+	if (atomic_read(&hdev->fd_open_cnt) > 0) {
+		dev_err(hdev->dev,
+			"Can't change PM profile while user process is opened on the device\n");
+		count = -EPERM;
+		goto unlock_mutex;
+	}
+
+	if (strncmp("auto", buf, strlen("auto")) == 0) {
+		/* Make sure we are in LOW PLL when changing modes */
+		if (hdev->pm_mng_profile == PM_MANUAL) {
+			atomic_set(&hdev->curr_pll_profile, PLL_HIGH);
+			hl_device_set_frequency(hdev, PLL_LOW);
+			hdev->pm_mng_profile = PM_AUTO;
+		}
+	} else if (strncmp("manual", buf, strlen("manual")) == 0) {
+		/* Make sure we are in LOW PLL when changing modes */
+		if (hdev->pm_mng_profile == PM_AUTO) {
+			flush_delayed_work(&hdev->work_freq);
+			hdev->pm_mng_profile = PM_MANUAL;
+		}
+	} else {
+		dev_err(hdev->dev, "value should be auto or manual\n");
+		count = -EINVAL;
+		goto unlock_mutex;
+	}
+
+unlock_mutex:
+	mutex_unlock(&hdev->fd_open_cnt_lock);
+out:
+	return count;
+}
+
+static ssize_t high_pll_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	return sprintf(buf, "%u\n", hdev->high_pll);
+}
+
+static ssize_t high_pll_store(struct device *dev, struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+	int rc;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto out;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto out;
+	}
+
+	hdev->high_pll = value;
+
+out:
+	return count;
+}
+
+static ssize_t uboot_ver_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%s\n", hdev->asic_prop.uboot_ver);
+}
+
+static ssize_t armcp_kernel_ver_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%s", hdev->asic_prop.armcp_info.kernel_version);
+}
+
+static ssize_t armcp_ver_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%s\n", hdev->asic_prop.armcp_info.armcp_version);
+}
+
+static ssize_t cpld_ver_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "0x%08x\n",
+			hdev->asic_prop.armcp_info.cpld_version);
+}
+
+static ssize_t infineon_ver_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "0x%04x\n",
+			hdev->asic_prop.armcp_info.infineon_version);
+}
+
+static ssize_t fuse_ver_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%s\n", hdev->asic_prop.armcp_info.fuse_version);
+}
+
+static ssize_t thermal_ver_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%s", hdev->asic_prop.armcp_info.thermal_version);
+}
+
+static ssize_t preboot_btl_ver_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%s\n", hdev->asic_prop.preboot_ver);
+}
+
+static ssize_t device_type_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	char *str;
+
+	switch (hdev->asic_type) {
+	case ASIC_GOYA:
+		str = "GOYA";
+		break;
+	default:
+		dev_err(hdev->dev, "Unrecognized ASIC type %d\n",
+				hdev->asic_type);
+		return -EINVAL;
+	}
+
+	return sprintf(buf, "%s\n", str);
+}
+
+static ssize_t pci_addr_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	/* Use dummy, fixed address for simulator */
+	if (!hdev->pdev)
+		return sprintf(buf, "0000:%02d:00.0\n", hdev->id);
+
+	return sprintf(buf, "%04x:%02x:%02x.%x\n",
+			pci_domain_nr(hdev->pdev->bus),
+			hdev->pdev->bus->number,
+			PCI_SLOT(hdev->pdev->devfn),
+			PCI_FUNC(hdev->pdev->devfn));
+}
+
+static ssize_t status_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	char *str;
+
+	if (hdev->disabled)
+		str = "Malfunction";
+	else
+		str = "Operational";
+
+	return sprintf(buf, "%s\n", str);
+}
+
+static ssize_t write_open_cnt_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%d\n", hdev->user_ctx ? 1 : 0);
+}
+
+static ssize_t max_power_show(struct device *dev, struct device_attribute *attr,
+				char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long val;
+
+	if (hdev->disabled)
+		return -ENODEV;
+
+	val = hl_get_max_power(hdev);
+
+	return sprintf(buf, "%lu\n", val);
+}
+
+static ssize_t max_power_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	unsigned long value;
+	int rc;
+
+	if (hdev->disabled) {
+		count = -ENODEV;
+		goto out;
+	}
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto out;
+	}
+
+	hdev->max_power = value;
+	hl_set_max_power(hdev, value);
+
+out:
+	return count;
+}
+
+static ssize_t eeprom_read_handler(struct file *filp, struct kobject *kobj,
+			struct bin_attribute *attr, char *buf, loff_t offset,
+			size_t max_size)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	char *data;
+	int rc;
+
+	if (!max_size)
+		return -EINVAL;
+
+	data = kzalloc(max_size, GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	rc = hdev->asic_funcs->get_eeprom_data(hdev, data, max_size);
+	if (rc)
+		goto out;
+
+	memcpy(buf, data, max_size);
+
+out:
+	kfree(data);
+
+	return max_size;
+}
+
+static DEVICE_ATTR_RO(armcp_kernel_ver);
+static DEVICE_ATTR_RO(armcp_ver);
+static DEVICE_ATTR_RO(cpld_ver);
+static DEVICE_ATTR_RO(device_type);
+static DEVICE_ATTR_RO(fuse_ver);
+static DEVICE_ATTR_RW(high_pll);
+static DEVICE_ATTR_RO(infineon_ver);
+static DEVICE_ATTR_RW(max_power);
+static DEVICE_ATTR_RO(pci_addr);
+static DEVICE_ATTR_RW(pm_mng_profile);
+static DEVICE_ATTR_RO(preboot_btl_ver);
+static DEVICE_ATTR_RO(status);
+static DEVICE_ATTR_RO(thermal_ver);
+static DEVICE_ATTR_RO(uboot_ver);
+static DEVICE_ATTR_RO(write_open_cnt);
+
+static struct bin_attribute bin_attr_eeprom = {
+	.attr = {.name = "eeprom", .mode = (0444)},
+	.size = PAGE_SIZE,
+	.read = eeprom_read_handler
+};
+
+static struct attribute *hl_dev_attrs[] = {
+	&dev_attr_armcp_kernel_ver.attr,
+	&dev_attr_armcp_ver.attr,
+	&dev_attr_cpld_ver.attr,
+	&dev_attr_device_type.attr,
+	&dev_attr_fuse_ver.attr,
+	&dev_attr_high_pll.attr,
+	&dev_attr_infineon_ver.attr,
+	&dev_attr_max_power.attr,
+	&dev_attr_pci_addr.attr,
+	&dev_attr_pm_mng_profile.attr,
+	&dev_attr_preboot_btl_ver.attr,
+	&dev_attr_status.attr,
+	&dev_attr_thermal_ver.attr,
+	&dev_attr_uboot_ver.attr,
+	&dev_attr_write_open_cnt.attr,
+	NULL,
+};
+
+static struct bin_attribute *hl_dev_bin_attrs[] = {
+	&bin_attr_eeprom,
+	NULL
+};
+
+static struct attribute_group hl_dev_attr_group = {
+	.attrs = hl_dev_attrs,
+	.bin_attrs = hl_dev_bin_attrs,
+};
+
+static struct attribute_group hl_dev_clks_attr_group;
+
+static const struct attribute_group *hl_dev_attr_groups[] = {
+	&hl_dev_attr_group,
+	&hl_dev_clks_attr_group,
+	NULL,
+};
+
+int hl_sysfs_init(struct hl_device *hdev)
+{
+	int rc;
+
+	hdev->pm_mng_profile = PM_AUTO;
+	hdev->max_power = hdev->asic_prop.max_power_default;
+
+	hdev->asic_funcs->add_device_attr(hdev, &hl_dev_clks_attr_group);
+
+	rc = device_add_groups(hdev->dev, hl_dev_attr_groups);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to add groups to device, error %d\n", rc);
+		return rc;
+	}
+
+	return 0;
+}
+
+void hl_sysfs_fini(struct hl_device *hdev)
+{
+	device_remove_groups(hdev->dev, hl_dev_attr_groups);
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 10/15] habanalabs: add device reset support
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (7 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 09/15] habanalabs: add sysfs and hwmon support Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 11/15] habanalabs: add command submission module Oded Gabbay
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds support for doing various on-the-fly reset of Goya.

The driver supports two types of resets:
1. soft-reset
2. hard-reset

Soft-reset is done when the device detects a timeout of a command
submission that was given to the device. The soft-reset process only resets
the engines that are relevant for the submission of compute jobs, i.e. the
DMA channels, the TPCs and the MME. The purpose is to bring the device as
fast as possible to a working state.

Hard-reset is done in several cases:
1. After soft-reset is done but the device is not responding
2. When fatal errors occur inside the device, e.g. ECC error
3. When the driver is removed

Hard-reset performs a reset of the entire chip except for the PCI
controller and the PLLs. It is a much longer process then soft-reset but it
helps to recover the device without the need to reboot the Host.

After hard-reset, the driver will restore the max power attribute and in
case of manual power management, the frequencies that were set.

This patch also adds two entries to the sysfs, which allows the root user
to initiate a soft or hard reset.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - Use sprintf instead of snprintf in sysfs entries
  - Detect whether H/W state of device is dirty during initialization. If so,
    reset device before initialization
 
 drivers/misc/habanalabs/command_buffer.c   |   9 +-
 drivers/misc/habanalabs/device.c           | 323 ++++++++++++++++++++-
 drivers/misc/habanalabs/goya/goya.c        | 226 +++++++++++++-
 drivers/misc/habanalabs/goya/goya_hwmgr.c  |  18 +-
 drivers/misc/habanalabs/habanalabs.h       |  52 ++++
 drivers/misc/habanalabs/habanalabs_drv.c   |   9 +-
 drivers/misc/habanalabs/habanalabs_ioctl.c |   6 +
 drivers/misc/habanalabs/hwmon.c            |   4 +-
 drivers/misc/habanalabs/irq.c              |  31 ++
 drivers/misc/habanalabs/sysfs.c            |  82 +++++-
 10 files changed, 736 insertions(+), 24 deletions(-)

diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
index cd19f00f85f2..bec951d42c98 100644
--- a/drivers/misc/habanalabs/command_buffer.c
+++ b/drivers/misc/habanalabs/command_buffer.c
@@ -89,9 +89,14 @@ int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
 	bool alloc_new_cb = true;
 	int rc;
 
-	if (hdev->disabled) {
+	/*
+	 * Can't use generic function to check this because of special case
+	 * where we create a CB as part of the reset process
+	 */
+	if ((hdev->disabled) || ((atomic_read(&hdev->in_reset)) &&
+					(ctx_id != HL_KERNEL_ASID_ID))) {
 		dev_warn_ratelimited(hdev->dev,
-			"Device is disabled. Can't create new CBs\n");
+			"Device is disabled or in reset. Can't create new CBs\n");
 		rc = -EBUSY;
 		goto out_err;
 	}
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 5cb54cbd8502..c244be37bc55 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -11,6 +11,14 @@
 #include <linux/kthread.h>
 #include <linux/sched/signal.h>
 
+bool hl_device_disabled_or_in_reset(struct hl_device *hdev)
+{
+	if ((hdev->disabled) || (atomic_read(&hdev->in_reset)))
+		return true;
+	else
+		return false;
+}
+
 static void hpriv_release(struct kref *ref)
 {
 	struct hl_fpriv *hpriv;
@@ -193,6 +201,7 @@ static int device_early_init(struct hl_device *hdev)
 
 	mutex_init(&hdev->fd_open_cnt_lock);
 	mutex_init(&hdev->send_cpu_message_lock);
+	atomic_set(&hdev->in_reset, 0);
 	atomic_set(&hdev->fd_open_cnt, 0);
 
 	return 0;
@@ -243,6 +252,27 @@ static void set_freq_to_low_job(struct work_struct *work)
 			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
 }
 
+static void hl_device_heartbeat(struct work_struct *work)
+{
+	struct hl_device *hdev = container_of(work, struct hl_device,
+						work_heartbeat.work);
+
+	if (hl_device_disabled_or_in_reset(hdev))
+		goto reschedule;
+
+	if (!hdev->asic_funcs->send_heartbeat(hdev))
+		goto reschedule;
+
+	dev_err(hdev->dev, "Device heartbeat failed!\n");
+	hl_device_reset(hdev, true, false);
+
+	return;
+
+reschedule:
+	schedule_delayed_work(&hdev->work_heartbeat,
+			usecs_to_jiffies(HL_HEARTBEAT_PER_USEC));
+}
+
 /*
  * device_late_init - do late stuff initialization for the habanalabs device
  *
@@ -278,6 +308,12 @@ static int device_late_init(struct hl_device *hdev)
 	schedule_delayed_work(&hdev->work_freq,
 			usecs_to_jiffies(HL_PLL_LOW_JOB_FREQ_USEC));
 
+	if (hdev->heartbeat) {
+		INIT_DELAYED_WORK(&hdev->work_heartbeat, hl_device_heartbeat);
+		schedule_delayed_work(&hdev->work_heartbeat,
+				usecs_to_jiffies(HL_HEARTBEAT_PER_USEC));
+	}
+
 	hdev->late_init_done = true;
 
 	return 0;
@@ -295,6 +331,8 @@ static void device_late_fini(struct hl_device *hdev)
 		return;
 
 	cancel_delayed_work_sync(&hdev->work_freq);
+	if (hdev->heartbeat)
+		cancel_delayed_work_sync(&hdev->work_heartbeat);
 
 	if (hdev->asic_funcs->late_fini)
 		hdev->asic_funcs->late_fini(hdev);
@@ -402,6 +440,260 @@ int hl_device_resume(struct hl_device *hdev)
 	return 0;
 }
 
+static void hl_device_hard_reset_pending(struct work_struct *work)
+{
+	struct hl_device_reset_work *device_reset_work =
+		container_of(work, struct hl_device_reset_work, reset_work);
+	struct hl_device *hdev = device_reset_work->hdev;
+	u16 pending_cnt = HL_PENDING_RESET_PER_SEC;
+	struct task_struct *task = NULL;
+
+	/* Flush all processes that are inside hl_open */
+	mutex_lock(&hdev->fd_open_cnt_lock);
+
+	while ((atomic_read(&hdev->fd_open_cnt)) && (pending_cnt)) {
+
+		pending_cnt--;
+
+		dev_info(hdev->dev,
+			"Can't HARD reset, waiting for user to close FD\n");
+		ssleep(1);
+	}
+
+	if (atomic_read(&hdev->fd_open_cnt)) {
+		task = get_pid_task(hdev->user_ctx->hpriv->taskpid,
+					PIDTYPE_PID);
+		if (task) {
+			dev_info(hdev->dev, "Killing user processes\n");
+			send_sig(SIGKILL, task, 1);
+			msleep(100);
+
+			put_task_struct(task);
+		}
+	}
+
+	mutex_unlock(&hdev->fd_open_cnt_lock);
+
+	hl_device_reset(hdev, true, true);
+
+	kfree(device_reset_work);
+}
+
+/*
+ * hl_device_reset - reset the device
+ *
+ * @hdev: pointer to habanalabs device structure
+ * @hard_reset: should we do hard reset to all engines or just reset the
+ *              compute/dma engines
+ *
+ * Block future CS and wait for pending CS to be enqueued
+ * Call ASIC H/W fini
+ * Flush all completions
+ * Re-initialize all internal data structures
+ * Call ASIC H/W init, late_init
+ * Test queues
+ * Enable device
+ *
+ * Returns 0 for success or an error on failure.
+ */
+int hl_device_reset(struct hl_device *hdev, bool hard_reset,
+			bool from_hard_reset_thread)
+{
+	int i, rc;
+
+	if (!hdev->init_done) {
+		dev_err(hdev->dev,
+			"Can't reset before initialization is done\n");
+		return 0;
+	}
+
+	/*
+	 * Prevent concurrency in this function - only one reset should be
+	 * done at any given time. Only need to perform this if we didn't
+	 * get from the dedicated hard reset thread
+	 */
+	if (!from_hard_reset_thread) {
+		/* Block future CS/VM/JOB completion operations */
+		rc = atomic_cmpxchg(&hdev->in_reset, 0, 1);
+		if (rc)
+			return 0;
+
+		/* This also blocks future CS/VM/JOB completion operations */
+		hdev->disabled = true;
+
+		/*
+		 * Flush anyone that is inside the critical section of enqueue
+		 * jobs to the H/W
+		 */
+		hdev->asic_funcs->hw_queues_lock(hdev);
+		hdev->asic_funcs->hw_queues_unlock(hdev);
+
+		dev_err(hdev->dev, "Going to RESET device!\n");
+	}
+
+again:
+	if ((hard_reset) && (!from_hard_reset_thread)) {
+		struct hl_device_reset_work *device_reset_work;
+
+		if (!hdev->pdev) {
+			dev_err(hdev->dev,
+				"Reset action is NOT supported in simulator\n");
+			rc = -EINVAL;
+			goto out_err;
+		}
+
+		hdev->hard_reset_pending = true;
+
+		device_reset_work = kzalloc(sizeof(*device_reset_work),
+						GFP_ATOMIC);
+		if (!device_reset_work) {
+			rc = -ENOMEM;
+			goto out_err;
+		}
+
+		/*
+		 * Because the reset function can't run from interrupt or
+		 * from heartbeat work, we need to call the reset function
+		 * from a dedicated work
+		 */
+		INIT_WORK(&device_reset_work->reset_work,
+				hl_device_hard_reset_pending);
+		device_reset_work->hdev = hdev;
+		schedule_work(&device_reset_work->reset_work);
+
+		return 0;
+	}
+
+	if (hard_reset) {
+		device_late_fini(hdev);
+
+		/*
+		 * Now that the heartbeat thread is closed, flush processes
+		 * which are sending messages to CPU
+		 */
+		mutex_lock(&hdev->send_cpu_message_lock);
+		mutex_unlock(&hdev->send_cpu_message_lock);
+	}
+
+	/*
+	 * Halt the engines and disable interrupts so we won't get any more
+	 * completions from H/W and we won't have any accesses from the
+	 * H/W to the host machine
+	 */
+	hdev->asic_funcs->halt_engines(hdev, hard_reset);
+
+	if (hard_reset) {
+		/* Release kernel context */
+		if (hl_ctx_put(hdev->kernel_ctx) != 1) {
+			dev_err(hdev->dev,
+				"kernel ctx is alive during hard reset\n");
+			rc = -EBUSY;
+			goto out_err;
+		}
+
+		hdev->kernel_ctx = NULL;
+	}
+
+	/* Reset the H/W. It will be in idle state after this returns */
+	hdev->asic_funcs->hw_fini(hdev, hard_reset);
+
+	if (hard_reset)
+		hl_eq_reset(hdev, &hdev->event_queue);
+
+	/* Re-initialize PI,CI to 0 in all queues (hw queue, cq) */
+	hl_hw_queue_reset(hdev, hard_reset);
+	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
+		hl_cq_reset(hdev, &hdev->completion_queue[i]);
+
+	/* Finished tear-down, starting to re-initialize */
+
+	if (hard_reset) {
+		/* Allocate the kernel context */
+		hdev->kernel_ctx = kzalloc(sizeof(*hdev->kernel_ctx),
+						GFP_KERNEL);
+		if (!hdev->kernel_ctx) {
+			rc = -ENOMEM;
+			goto out_err;
+		}
+
+		hdev->user_ctx = NULL;
+
+		rc = hl_ctx_init(hdev, hdev->kernel_ctx, true);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed to init kernel ctx in hard reset\n");
+			kfree(hdev->kernel_ctx);
+			hdev->kernel_ctx = NULL;
+			goto out_err;
+		}
+	}
+
+	rc = hdev->asic_funcs->hw_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev,
+			"failed to initialize the H/W after reset\n");
+		goto out_err;
+	}
+
+	hdev->disabled = false;
+
+	/* Check that the communication with the device is working */
+	rc = hdev->asic_funcs->test_queues(hdev);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to detect if device is alive after reset\n");
+		goto out_err;
+	}
+
+	if (hard_reset) {
+		rc = device_late_init(hdev);
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed late init after hard reset\n");
+			goto out_err;
+		}
+
+		hl_set_max_power(hdev, hdev->max_power);
+
+		hdev->hard_reset_pending = false;
+	} else {
+		rc = hdev->asic_funcs->soft_reset_late_init(hdev);
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed late init after soft reset\n");
+			goto out_err;
+		}
+	}
+
+	atomic_set(&hdev->in_reset, 0);
+
+	if (hard_reset)
+		hdev->hard_reset_cnt++;
+	else
+		hdev->soft_reset_cnt++;
+
+	return 0;
+
+out_err:
+	hdev->disabled = true;
+
+	if (hard_reset) {
+		dev_err(hdev->dev,
+			"Failed to reset! Device is NOT usable\n");
+		hdev->hard_reset_cnt++;
+	} else {
+		dev_err(hdev->dev,
+			"Failed to do soft-reset, trying hard reset\n");
+		hdev->soft_reset_cnt++;
+		hard_reset = true;
+		goto again;
+	}
+
+	atomic_set(&hdev->in_reset, 0);
+
+	return rc;
+}
+
 /*
  * hl_device_init - main initialization function for habanalabs device
  *
@@ -509,6 +801,12 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto free_cb_pool;
 	}
 
+	if (hdev->asic_funcs->get_hw_state(hdev) == HL_DEVICE_HW_STATE_DIRTY) {
+		dev_info(hdev->dev,
+			"H/W state is dirty, must reset before initializing\n");
+		hdev->asic_funcs->hw_fini(hdev, true);
+	}
+
 	rc = hdev->asic_funcs->hw_init(hdev);
 	if (rc) {
 		dev_err(hdev->dev, "failed to initialize the H/W\n");
@@ -549,6 +847,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 	dev_notice(hdev->dev,
 		"Successfully added device to habanalabs driver\n");
 
+	hdev->init_done = true;
+
 	return 0;
 
 free_cb_pool:
@@ -596,9 +896,30 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
  */
 void hl_device_fini(struct hl_device *hdev)
 {
-	int i;
+	int i, rc;
+	ktime_t timeout;
+
 	dev_info(hdev->dev, "Removing device\n");
 
+	/*
+	 * This function is competing with the reset function, so try to
+	 * take the reset atomic and if we are already in middle of reset,
+	 * wait until reset function is finished. Reset function is designed
+	 * to always finish (could take up to a few seconds in worst case).
+	 */
+
+	timeout = ktime_add_us(ktime_get(),
+				HL_PENDING_RESET_PER_SEC * 1000 * 1000 * 4);
+	rc = atomic_cmpxchg(&hdev->in_reset, 0, 1);
+	while (rc) {
+		usleep_range(50, 200);
+		rc = atomic_cmpxchg(&hdev->in_reset, 0, 1);
+		if (ktime_compare(ktime_get(), timeout) > 0) {
+			WARN(1, "Failed to remove device because reset function did not finish\n");
+			return;
+		}
+	};
+
 	/* Mark device as disabled */
 	hdev->disabled = true;
 
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 6e0a815744c3..1c6cce81c496 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -127,6 +127,130 @@ static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
 
 #define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
 
+static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
+	GOYA_ASYNC_EVENT_ID_PCIE_IF,
+	GOYA_ASYNC_EVENT_ID_TPC0_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC1_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC2_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC3_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC4_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC5_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC6_ECC,
+	GOYA_ASYNC_EVENT_ID_TPC7_ECC,
+	GOYA_ASYNC_EVENT_ID_MME_ECC,
+	GOYA_ASYNC_EVENT_ID_MME_ECC_EXT,
+	GOYA_ASYNC_EVENT_ID_MMU_ECC,
+	GOYA_ASYNC_EVENT_ID_DMA_MACRO,
+	GOYA_ASYNC_EVENT_ID_DMA_ECC,
+	GOYA_ASYNC_EVENT_ID_CPU_IF_ECC,
+	GOYA_ASYNC_EVENT_ID_PSOC_MEM,
+	GOYA_ASYNC_EVENT_ID_PSOC_CORESIGHT,
+	GOYA_ASYNC_EVENT_ID_SRAM0,
+	GOYA_ASYNC_EVENT_ID_SRAM1,
+	GOYA_ASYNC_EVENT_ID_SRAM2,
+	GOYA_ASYNC_EVENT_ID_SRAM3,
+	GOYA_ASYNC_EVENT_ID_SRAM4,
+	GOYA_ASYNC_EVENT_ID_SRAM5,
+	GOYA_ASYNC_EVENT_ID_SRAM6,
+	GOYA_ASYNC_EVENT_ID_SRAM7,
+	GOYA_ASYNC_EVENT_ID_SRAM8,
+	GOYA_ASYNC_EVENT_ID_SRAM9,
+	GOYA_ASYNC_EVENT_ID_SRAM10,
+	GOYA_ASYNC_EVENT_ID_SRAM11,
+	GOYA_ASYNC_EVENT_ID_SRAM12,
+	GOYA_ASYNC_EVENT_ID_SRAM13,
+	GOYA_ASYNC_EVENT_ID_SRAM14,
+	GOYA_ASYNC_EVENT_ID_SRAM15,
+	GOYA_ASYNC_EVENT_ID_SRAM16,
+	GOYA_ASYNC_EVENT_ID_SRAM17,
+	GOYA_ASYNC_EVENT_ID_SRAM18,
+	GOYA_ASYNC_EVENT_ID_SRAM19,
+	GOYA_ASYNC_EVENT_ID_SRAM20,
+	GOYA_ASYNC_EVENT_ID_SRAM21,
+	GOYA_ASYNC_EVENT_ID_SRAM22,
+	GOYA_ASYNC_EVENT_ID_SRAM23,
+	GOYA_ASYNC_EVENT_ID_SRAM24,
+	GOYA_ASYNC_EVENT_ID_SRAM25,
+	GOYA_ASYNC_EVENT_ID_SRAM26,
+	GOYA_ASYNC_EVENT_ID_SRAM27,
+	GOYA_ASYNC_EVENT_ID_SRAM28,
+	GOYA_ASYNC_EVENT_ID_SRAM29,
+	GOYA_ASYNC_EVENT_ID_GIC500,
+	GOYA_ASYNC_EVENT_ID_PLL0,
+	GOYA_ASYNC_EVENT_ID_PLL1,
+	GOYA_ASYNC_EVENT_ID_PLL3,
+	GOYA_ASYNC_EVENT_ID_PLL4,
+	GOYA_ASYNC_EVENT_ID_PLL5,
+	GOYA_ASYNC_EVENT_ID_PLL6,
+	GOYA_ASYNC_EVENT_ID_AXI_ECC,
+	GOYA_ASYNC_EVENT_ID_L2_RAM_ECC,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_05_SW_RESET,
+	GOYA_ASYNC_EVENT_ID_PSOC_GPIO_10_VRHOT_ICRIT,
+	GOYA_ASYNC_EVENT_ID_PCIE_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC0_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC1_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC2_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC3_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC4_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC5_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC6_DEC,
+	GOYA_ASYNC_EVENT_ID_TPC7_DEC,
+	GOYA_ASYNC_EVENT_ID_MME_WACS,
+	GOYA_ASYNC_EVENT_ID_MME_WACSD,
+	GOYA_ASYNC_EVENT_ID_CPU_AXI_SPLITTER,
+	GOYA_ASYNC_EVENT_ID_PSOC_AXI_DEC,
+	GOYA_ASYNC_EVENT_ID_PSOC,
+	GOYA_ASYNC_EVENT_ID_TPC0_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC1_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC2_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC3_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC4_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC5_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC6_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC7_KRN_ERR,
+	GOYA_ASYNC_EVENT_ID_TPC0_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC1_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC2_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC3_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC4_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC5_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC6_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC7_CMDQ,
+	GOYA_ASYNC_EVENT_ID_TPC0_QM,
+	GOYA_ASYNC_EVENT_ID_TPC1_QM,
+	GOYA_ASYNC_EVENT_ID_TPC2_QM,
+	GOYA_ASYNC_EVENT_ID_TPC3_QM,
+	GOYA_ASYNC_EVENT_ID_TPC4_QM,
+	GOYA_ASYNC_EVENT_ID_TPC5_QM,
+	GOYA_ASYNC_EVENT_ID_TPC6_QM,
+	GOYA_ASYNC_EVENT_ID_TPC7_QM,
+	GOYA_ASYNC_EVENT_ID_MME_QM,
+	GOYA_ASYNC_EVENT_ID_MME_CMDQ,
+	GOYA_ASYNC_EVENT_ID_DMA0_QM,
+	GOYA_ASYNC_EVENT_ID_DMA1_QM,
+	GOYA_ASYNC_EVENT_ID_DMA2_QM,
+	GOYA_ASYNC_EVENT_ID_DMA3_QM,
+	GOYA_ASYNC_EVENT_ID_DMA4_QM,
+	GOYA_ASYNC_EVENT_ID_DMA0_CH,
+	GOYA_ASYNC_EVENT_ID_DMA1_CH,
+	GOYA_ASYNC_EVENT_ID_DMA2_CH,
+	GOYA_ASYNC_EVENT_ID_DMA3_CH,
+	GOYA_ASYNC_EVENT_ID_DMA4_CH,
+	GOYA_ASYNC_EVENT_ID_TPC0_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC1_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC2_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC3_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC4_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC5_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC6_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_TPC7_BMON_SPMU,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH0,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH1,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH2,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH3,
+	GOYA_ASYNC_EVENT_ID_DMA_BM_CH4
+};
+
 static int goya_armcp_info_get(struct hl_device *hdev);
 
 static void goya_get_fixed_properties(struct hl_device *hdev)
@@ -2451,6 +2575,14 @@ static int goya_hw_init(struct hl_device *hdev)
 	/* Perform read from the device to make sure device is up */
 	val = RREG32(mmPCIE_DBI_DEVICE_ID_VENDOR_ID_REG);
 
+	/*
+	 * Let's mark in the H/W that we have reached this point. We check
+	 * this value in the reset_before_init function to understand whether
+	 * we need to reset the chip before doing H/W init. This register is
+	 * cleared by the H/W upon H/W reset
+	 */
+	WREG32(mmPSOC_GLOBAL_CONF_APP_STATUS, HL_DEVICE_HW_STATE_DIRTY);
+
 	rc = goya_init_cpu(hdev, GOYA_CPU_TIMEOUT_USEC);
 	if (rc) {
 		dev_err(hdev->dev, "failed to initialize CPU\n");
@@ -2579,6 +2711,14 @@ static void goya_hw_fini(struct hl_device *hdev, bool hard_reset)
 			"Timeout while waiting for device to reset 0x%x\n",
 			status);
 
+	if (!hard_reset) {
+		goya->hw_cap_initialized &= ~(HW_CAP_DMA | HW_CAP_MME |
+						HW_CAP_GOLDEN | HW_CAP_TPC);
+		WREG32(mmGIC_DISTRIBUTOR__5_GICD_SETSPI_NSR,
+				GOYA_ASYNC_EVENT_ID_SOFT_RESET);
+		return;
+	}
+
 	/* Chicken bit to re-initiate boot sequencer flow */
 	WREG32(mmPSOC_GLOBAL_CONF_BOOT_SEQ_RE_START,
 		1 << PSOC_GLOBAL_CONF_BOOT_SEQ_RE_START_IND_SHIFT);
@@ -3186,6 +3326,57 @@ static void goya_print_irq_info(struct hl_device *hdev, u16 event_type)
 	}
 }
 
+static int goya_unmask_irq_arr(struct hl_device *hdev, u32 *irq_arr,
+		size_t irq_arr_size)
+{
+	struct armcp_unmask_irq_arr_packet *pkt;
+	size_t total_pkt_size;
+	long result;
+	int rc;
+
+	total_pkt_size = sizeof(struct armcp_unmask_irq_arr_packet) +
+			irq_arr_size;
+
+	/* data should be aligned to 8 bytes in order to ArmCP to copy it */
+	total_pkt_size = (total_pkt_size + 0x7) & ~0x7;
+
+	/* total_pkt_size is casted to u16 later on */
+	if (total_pkt_size > USHRT_MAX) {
+		dev_err(hdev->dev, "too many elements in IRQ array\n");
+		return -EINVAL;
+	}
+
+	pkt = kzalloc(total_pkt_size, GFP_KERNEL);
+	if (!pkt)
+		return -ENOMEM;
+
+	pkt->length = irq_arr_size / sizeof(irq_arr[0]);
+	memcpy(&pkt->irqs, irq_arr, irq_arr_size);
+
+	pkt->armcp_pkt.ctl = ARMCP_PACKET_UNMASK_RAZWI_IRQ_ARRAY <<
+						ARMCP_PKT_CTL_OPCODE_SHIFT;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) pkt,
+			total_pkt_size, HL_DEVICE_TIMEOUT_USEC, &result);
+
+	if (rc)
+		dev_err(hdev->dev, "failed to unmask IRQ array\n");
+
+	kfree(pkt);
+
+	return rc;
+}
+
+static int goya_soft_reset_late_init(struct hl_device *hdev)
+{
+	/*
+	 * Unmask all IRQs since some could have been received
+	 * during the soft reset
+	 */
+	return goya_unmask_irq_arr(hdev, goya_non_fatal_events,
+			sizeof(goya_non_fatal_events));
+}
+
 static int goya_unmask_irq(struct hl_device *hdev, u16 event_type)
 {
 	struct armcp_packet pkt;
@@ -3247,6 +3438,7 @@ void goya_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_entry)
 		dev_err(hdev->dev,
 			"Received H/W interrupt %d, reset the chip\n",
 			event_type);
+		hl_device_reset(hdev, true, false);
 		break;
 
 	case GOYA_ASYNC_EVENT_ID_PCIE_DEC:
@@ -3312,6 +3504,30 @@ void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
 	return goya->events_stat;
 }
 
+int goya_send_heartbeat(struct hl_device *hdev)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct armcp_packet hb_pkt;
+	long result;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_CPU_Q))
+		return 0;
+
+	memset(&hb_pkt, 0, sizeof(hb_pkt));
+
+	hb_pkt.ctl = ARMCP_PACKET_TEST << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	hb_pkt.value = ARMCP_PACKET_FENCE_VAL;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &hb_pkt,
+			sizeof(hb_pkt), HL_DEVICE_TIMEOUT_USEC, &result);
+
+	if ((rc) || (result != ARMCP_PACKET_FENCE_VAL))
+		rc = -EIO;
+
+	return rc;
+}
+
 static int goya_armcp_info_get(struct hl_device *hdev)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -3457,6 +3673,11 @@ int goya_get_eeprom_data(struct hl_device *hdev, void *data, size_t max_size)
 	return rc;
 }
 
+static enum hl_device_hw_state goya_get_hw_state(struct hl_device *hdev)
+{
+	return RREG32(mmPSOC_GLOBAL_CONF_APP_STATUS);
+}
+
 static const struct hl_asic_funcs goya_funcs = {
 	.early_init = goya_early_init,
 	.early_fini = goya_early_fini,
@@ -3486,12 +3707,15 @@ static const struct hl_asic_funcs goya_funcs = {
 	.handle_eqe = goya_handle_eqe,
 	.set_pll_profile = goya_set_pll_profile,
 	.get_events_stat = goya_get_events_stat,
+	.send_heartbeat = goya_send_heartbeat,
 	.enable_clock_gating = goya_init_clock_gating,
 	.disable_clock_gating = goya_disable_clock_gating,
+	.soft_reset_late_init = goya_soft_reset_late_init,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
 	.get_eeprom_data = goya_get_eeprom_data,
-	.send_cpu_message = goya_send_cpu_message
+	.send_cpu_message = goya_send_cpu_message,
+	.get_hw_state = goya_get_hw_state
 };
 
 /*
diff --git a/drivers/misc/habanalabs/goya/goya_hwmgr.c b/drivers/misc/habanalabs/goya/goya_hwmgr.c
index b524799c62b7..c885b7f9feb3 100644
--- a/drivers/misc/habanalabs/goya/goya_hwmgr.c
+++ b/drivers/misc/habanalabs/goya/goya_hwmgr.c
@@ -38,7 +38,7 @@ static ssize_t mme_clk_show(struct device *dev, struct device_attribute *attr,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, MME_PLL, false);
@@ -57,7 +57,7 @@ static ssize_t mme_clk_store(struct device *dev, struct device_attribute *attr,
 	int rc;
 	long value;
 
-	if (hdev->disabled) {
+	if (hl_device_disabled_or_in_reset(hdev)) {
 		count = -ENODEV;
 		goto fail;
 	}
@@ -87,7 +87,7 @@ static ssize_t tpc_clk_show(struct device *dev, struct device_attribute *attr,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, TPC_PLL, false);
@@ -106,7 +106,7 @@ static ssize_t tpc_clk_store(struct device *dev, struct device_attribute *attr,
 	int rc;
 	long value;
 
-	if (hdev->disabled) {
+	if (hl_device_disabled_or_in_reset(hdev)) {
 		count = -ENODEV;
 		goto fail;
 	}
@@ -136,7 +136,7 @@ static ssize_t ic_clk_show(struct device *dev, struct device_attribute *attr,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, IC_PLL, false);
@@ -155,7 +155,7 @@ static ssize_t ic_clk_store(struct device *dev, struct device_attribute *attr,
 	int rc;
 	long value;
 
-	if (hdev->disabled) {
+	if (hl_device_disabled_or_in_reset(hdev)) {
 		count = -ENODEV;
 		goto fail;
 	}
@@ -185,7 +185,7 @@ static ssize_t mme_clk_curr_show(struct device *dev,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, MME_PLL, true);
@@ -202,7 +202,7 @@ static ssize_t tpc_clk_curr_show(struct device *dev,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, TPC_PLL, true);
@@ -219,7 +219,7 @@ static ssize_t ic_clk_curr_show(struct device *dev,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long value;
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	value = hl_get_frequency(hdev, IC_PLL, true);
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index e260586f3bdb..4d5cba225700 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -26,8 +26,12 @@
 
 #define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
 
+#define HL_PENDING_RESET_PER_SEC	5
+
 #define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */
 
+#define HL_HEARTBEAT_PER_USEC		5000000 /* 5 s */
+
 #define HL_PLL_LOW_JOB_FREQ_USEC	5000000 /* 5 s */
 
 #define HL_MAX_QUEUES			128
@@ -62,6 +66,18 @@ struct hw_queue_properties {
 	u8			kmd_only;
 };
 
+/**
+ * enum hl_device_hw_state - H/W device state. use this to understand whether
+ *                           to do reset before hw_init or not
+ * @HL_DEVICE_HW_STATE_CLEAN: H/W state is clean. i.e. after hard reset
+ * @HL_DEVICE_HW_STATE_DIRTY: H/W state is dirty. i.e. we started to execute
+ *                            hw_init
+ */
+enum hl_device_hw_state {
+	HL_DEVICE_HW_STATE_CLEAN = 0,
+	HL_DEVICE_HW_STATE_DIRTY
+};
+
 /**
  * struct asic_fixed_properties - ASIC specific immutable properties.
  * @hw_queues_props: H/W queues properties.
@@ -366,12 +382,15 @@ enum hl_pll_frequency {
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
  * @set_pll_profile: change PLL profile (manual/automatic).
  * @get_events_stat: retrieve event queue entries histogram.
+ * @send_heartbeat: send is-alive packet to ArmCP and verify response.
  * @enable_clock_gating: enable clock gating for reducing power consumption.
  * @disable_clock_gating: disable clock for accessing registers on HBW.
+ * @soft_reset_late_init: perform certain actions needed after soft reset.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
  * @get_eeprom_data: retrieve EEPROM data from F/W.
  * @send_cpu_message: send buffer to ArmCP.
+ * @get_hw_state: retrieve the H/W state
  */
 struct hl_asic_funcs {
 	int (*early_init)(struct hl_device *hdev);
@@ -413,14 +432,17 @@ struct hl_asic_funcs {
 	void (*set_pll_profile)(struct hl_device *hdev,
 			enum hl_pll_frequency freq);
 	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
+	int (*send_heartbeat)(struct hl_device *hdev);
 	void (*enable_clock_gating)(struct hl_device *hdev);
 	void (*disable_clock_gating)(struct hl_device *hdev);
+	int (*soft_reset_late_init)(struct hl_device *hdev);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
 	int (*get_eeprom_data)(struct hl_device *hdev, void *data,
 				size_t max_size);
 	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
 				u16 len, u32 timeout, long *result);
+	enum hl_device_hw_state (*get_hw_state)(struct hl_device *hdev);
 };
 
 
@@ -532,6 +554,16 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
 	WREG32(mm##reg, (RREG32(mm##reg) & ~REG_FIELD_MASK(reg, field)) | \
 			(val) << REG_FIELD_SHIFT(reg, field))
 
+/**
+ * struct hl_device_reset_work - reset workqueue task wrapper.
+ * @reset_work: reset work to be done.
+ * @hdev: habanalabs device structure.
+ */
+struct hl_device_reset_work {
+	struct work_struct		reset_work;
+	struct hl_device		*hdev;
+};
+
 /**
  * struct hl_device - habanalabs device structure.
  * @pdev: pointer to PCI device, can be NULL in case of simulator device.
@@ -540,6 +572,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @cdev: related char device.
  * @dev: realted kernel basic device structure.
  * @work_freq: delayed work to lower device frequency if possible.
+ * @work_heartbeat: delayed work for ArmCP is-alive check.
  * @asic_name: ASIC specific nmae.
  * @asic_type: ASIC specific type.
  * @completion_queue: array of hl_cq.
@@ -571,6 +604,7 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  * @cb_pool: list of preallocated CBs.
  * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
+ * @in_reset: is device in reset flow.
  * @curr_pll_profile: current PLL profile.
  * @fd_open_cnt: number of open user processes.
  * @max_power: the max power of the device, as configured by the sysadmin. This
@@ -578,10 +612,15 @@ void hl_wreg(struct hl_device *hdev, u32 reg, u32 val);
  *             value and update the F/W after the re-initialization
  * @major: habanalabs KMD major.
  * @high_pll: high PLL profile frequency.
+ * @soft_reset_cnt: number of soft reset since KMD loading.
+ * @hard_reset_cnt: number of hard reset since KMD loading.
  * @id: device minor.
  * @disabled: is device disabled.
  * @late_init_done: is late init stage was done during initialization.
  * @hwmon_initialized: is H/W monitor sensors was initialized.
+ * @hard_reset_pending: is there a hard reset work pending.
+ * @heartbeat: is heartbeat sanity check towards ArmCP enabled.
+ * @init_done: is the initialization of the device done.
  */
 struct hl_device {
 	struct pci_dev			*pdev;
@@ -590,6 +629,7 @@ struct hl_device {
 	struct cdev			cdev;
 	struct device			*dev;
 	struct delayed_work		work_freq;
+	struct delayed_work		work_heartbeat;
 	char				asic_name[16];
 	enum hl_asic_type		asic_type;
 	struct hl_cq			*completion_queue;
@@ -621,15 +661,21 @@ struct hl_device {
 	/* TODO: remove user_ctx for multiple process support */
 	struct hl_ctx			*user_ctx;
 
+	atomic_t			in_reset;
 	atomic_t			curr_pll_profile;
 	atomic_t			fd_open_cnt;
 	u64				max_power;
 	u32				major;
 	u32				high_pll;
+	u32				soft_reset_cnt;
+	u32				hard_reset_cnt;
 	u16				id;
 	u8				disabled;
 	u8				late_init_done;
 	u8				hwmon_initialized;
+	u8				hard_reset_pending;
+	u8				heartbeat;
+	u8				init_done;
 
 	/* Parameters for bring-up */
 	u8				cpu_enable;
@@ -670,6 +716,7 @@ struct hl_ioctl_desc {
  */
 
 int hl_device_open(struct inode *inode, struct file *filp);
+bool hl_device_disabled_or_in_reset(struct hl_device *hdev);
 int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 		enum hl_asic_type asic_type, int minor);
 void destroy_hdev(struct hl_device *hdev);
@@ -683,6 +730,7 @@ int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
 				u32 cb_size, u64 cb_ptr);
 u32 hl_hw_queue_add_ptr(u32 ptr, u16 val);
 void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
+void hl_hw_queue_reset(struct hl_device *hdev, bool hard_reset);
 
 #define hl_queue_inc_ptr(p)		hl_hw_queue_add_ptr(p, 1)
 #define hl_pi_2_offset(pi)		((pi) & (HL_QUEUE_LENGTH - 1))
@@ -691,6 +739,8 @@ int hl_cq_init(struct hl_device *hdev, struct hl_cq *q, u32 hw_queue_id);
 void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q);
 int hl_eq_init(struct hl_device *hdev, struct hl_eq *q);
 void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q);
+void hl_cq_reset(struct hl_device *hdev, struct hl_cq *q);
+void hl_eq_reset(struct hl_device *hdev, struct hl_eq *q);
 irqreturn_t hl_irq_handler_cq(int irq, void *arg);
 irqreturn_t hl_irq_handler_eq(int irq, void *arg);
 int hl_asid_init(struct hl_device *hdev);
@@ -708,6 +758,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass);
 void hl_device_fini(struct hl_device *hdev);
 int hl_device_suspend(struct hl_device *hdev);
 int hl_device_resume(struct hl_device *hdev);
+int hl_device_reset(struct hl_device *hdev, bool hard_reset,
+			bool from_hard_reset_thread);
 void hl_hpriv_get(struct hl_fpriv *hpriv);
 void hl_hpriv_put(struct hl_fpriv *hpriv);
 int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq);
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 6260d99ef1ca..0702f6f974eb 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -88,9 +88,9 @@ int hl_device_open(struct inode *inode, struct file *filp)
 
 	mutex_lock(&hdev->fd_open_cnt_lock);
 
-	if (hdev->disabled) {
+	if (hl_device_disabled_or_in_reset(hdev)) {
 		dev_err_ratelimited(hdev->dev,
-			"Can't open %s because it is disabled\n",
+			"Can't open %s because it is disabled or in reset\n",
 			dev_name(hdev->dev));
 		mutex_unlock(&hdev->fd_open_cnt_lock);
 		return -EPERM;
@@ -183,6 +183,7 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	hdev->cpu_queues_enable = 1;
 	hdev->fw_loading = 1;
 	hdev->pldm = 0;
+	hdev->heartbeat = 1;
 
 	/* If CPU is disabled, no point in loading FW */
 	if (!hdev->cpu_enable)
@@ -192,6 +193,10 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	if (!hdev->fw_loading)
 		hdev->cpu_queues_enable = 0;
 
+	/* If CPU queues not enabled, no way to do heartbeat */
+	if (!hdev->cpu_queues_enable)
+		hdev->heartbeat = 0;
+
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
 
diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
index 71dd1e760a22..f406c3eb6316 100644
--- a/drivers/misc/habanalabs/habanalabs_ioctl.c
+++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
@@ -33,6 +33,12 @@ long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
 	unsigned int usize, asize;
 	int retcode;
 
+	if (hdev->hard_reset_pending) {
+		dev_crit_ratelimited(hdev->dev,
+			"Device HARD reset pending! Please close FD\n");
+		return -ENODEV;
+	}
+
 	if ((nr >= HL_COMMAND_START) && (nr < HL_COMMAND_END)) {
 		u32 hl_size;
 
diff --git a/drivers/misc/habanalabs/hwmon.c b/drivers/misc/habanalabs/hwmon.c
index 4d60b36dfa21..61b0a2b26c19 100644
--- a/drivers/misc/habanalabs/hwmon.c
+++ b/drivers/misc/habanalabs/hwmon.c
@@ -111,7 +111,7 @@ static int hl_read(struct device *dev, enum hwmon_sensor_types type,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	switch (type) {
@@ -185,7 +185,7 @@ static int hl_write(struct device *dev, enum hwmon_sensor_types type,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	switch (type) {
diff --git a/drivers/misc/habanalabs/irq.c b/drivers/misc/habanalabs/irq.c
index 5a7cb85ab50a..29dc24c63606 100644
--- a/drivers/misc/habanalabs/irq.c
+++ b/drivers/misc/habanalabs/irq.c
@@ -250,6 +250,23 @@ void hl_cq_fini(struct hl_device *hdev, struct hl_cq *q)
 			(void *) q->kernel_address, q->bus_address);
 }
 
+void hl_cq_reset(struct hl_device *hdev, struct hl_cq *q)
+{
+	q->ci = 0;
+	q->pi = 0;
+
+	atomic_set(&q->free_slots_cnt, HL_CQ_LENGTH);
+
+	/*
+	 * It's not enough to just reset the PI/CI because the H/W may have
+	 * written valid completion entries before it was halted and therefore
+	 * we need to clean the actual queues so we won't process old entries
+	 * when the device is operational again
+	 */
+
+	memset((void *) q->kernel_address, 0, HL_CQ_SIZE_IN_BYTES);
+}
+
 /*
  * hl_eq_init - main initialization function for an event queue object
  *
@@ -292,3 +309,17 @@ void hl_eq_fini(struct hl_device *hdev, struct hl_eq *q)
 	hdev->asic_funcs->dma_free_coherent(hdev, HL_EQ_SIZE_IN_BYTES,
 			(void *) q->kernel_address, q->bus_address);
 }
+
+void hl_eq_reset(struct hl_device *hdev, struct hl_eq *q)
+{
+	q->ci = 0;
+
+	/*
+	 * It's not enough to just reset the PI/CI because the H/W may have
+	 * written valid completion entries before it was halted and therefore
+	 * we need to clean the actual queues so we won't process old entries
+	 * when the device is operational again
+	 */
+
+	memset((void *) q->kernel_address, 0, HL_EQ_SIZE_IN_BYTES);
+}
diff --git a/drivers/misc/habanalabs/sysfs.c b/drivers/misc/habanalabs/sysfs.c
index fe865b2e8bab..443045ac3c57 100644
--- a/drivers/misc/habanalabs/sysfs.c
+++ b/drivers/misc/habanalabs/sysfs.c
@@ -105,7 +105,7 @@ static ssize_t pm_mng_profile_show(struct device *dev,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	return sprintf(buf, "%s\n",
@@ -119,7 +119,7 @@ static ssize_t pm_mng_profile_store(struct device *dev,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled) {
+	if (hl_device_disabled_or_in_reset(hdev)) {
 		count = -ENODEV;
 		goto out;
 	}
@@ -163,7 +163,7 @@ static ssize_t high_pll_show(struct device *dev, struct device_attribute *attr,
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	return sprintf(buf, "%u\n", hdev->high_pll);
@@ -176,7 +176,7 @@ static ssize_t high_pll_store(struct device *dev, struct device_attribute *attr,
 	long value;
 	int rc;
 
-	if (hdev->disabled) {
+	if (hl_device_disabled_or_in_reset(hdev)) {
 		count = -ENODEV;
 		goto out;
 	}
@@ -260,6 +260,48 @@ static ssize_t preboot_btl_ver_show(struct device *dev,
 	return sprintf(buf, "%s\n", hdev->asic_prop.preboot_ver);
 }
 
+static ssize_t soft_reset_store(struct device *dev,
+				struct device_attribute *attr, const char *buf,
+				size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+	int rc;
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto out;
+	}
+
+	hl_device_reset(hdev, false, false);
+
+out:
+	return count;
+}
+
+static ssize_t hard_reset_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+	long value;
+	int rc;
+
+	rc = kstrtoul(buf, 0, &value);
+
+	if (rc) {
+		count = -EINVAL;
+		goto out;
+	}
+
+	hl_device_reset(hdev, true, false);
+
+out:
+	return count;
+}
+
 static ssize_t device_type_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -301,7 +343,9 @@ static ssize_t status_show(struct device *dev, struct device_attribute *attr,
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	char *str;
 
-	if (hdev->disabled)
+	if (atomic_read(&hdev->in_reset))
+		str = "In reset";
+	else if (hdev->disabled)
 		str = "Malfunction";
 	else
 		str = "Operational";
@@ -317,13 +361,29 @@ static ssize_t write_open_cnt_show(struct device *dev,
 	return sprintf(buf, "%d\n", hdev->user_ctx ? 1 : 0);
 }
 
+static ssize_t soft_reset_cnt_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%d\n", hdev->soft_reset_cnt);
+}
+
+static ssize_t hard_reset_cnt_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct hl_device *hdev = dev_get_drvdata(dev);
+
+	return sprintf(buf, "%d\n", hdev->hard_reset_cnt);
+}
+
 static ssize_t max_power_show(struct device *dev, struct device_attribute *attr,
 				char *buf)
 {
 	struct hl_device *hdev = dev_get_drvdata(dev);
 	long val;
 
-	if (hdev->disabled)
+	if (hl_device_disabled_or_in_reset(hdev))
 		return -ENODEV;
 
 	val = hl_get_max_power(hdev);
@@ -338,7 +398,7 @@ static ssize_t max_power_store(struct device *dev,
 	unsigned long value;
 	int rc;
 
-	if (hdev->disabled) {
+	if (hl_device_disabled_or_in_reset(hdev)) {
 		count = -ENODEV;
 		goto out;
 	}
@@ -390,12 +450,16 @@ static DEVICE_ATTR_RO(armcp_ver);
 static DEVICE_ATTR_RO(cpld_ver);
 static DEVICE_ATTR_RO(device_type);
 static DEVICE_ATTR_RO(fuse_ver);
+static DEVICE_ATTR_WO(hard_reset);
+static DEVICE_ATTR_RO(hard_reset_cnt);
 static DEVICE_ATTR_RW(high_pll);
 static DEVICE_ATTR_RO(infineon_ver);
 static DEVICE_ATTR_RW(max_power);
 static DEVICE_ATTR_RO(pci_addr);
 static DEVICE_ATTR_RW(pm_mng_profile);
 static DEVICE_ATTR_RO(preboot_btl_ver);
+static DEVICE_ATTR_WO(soft_reset);
+static DEVICE_ATTR_RO(soft_reset_cnt);
 static DEVICE_ATTR_RO(status);
 static DEVICE_ATTR_RO(thermal_ver);
 static DEVICE_ATTR_RO(uboot_ver);
@@ -413,12 +477,16 @@ static struct attribute *hl_dev_attrs[] = {
 	&dev_attr_cpld_ver.attr,
 	&dev_attr_device_type.attr,
 	&dev_attr_fuse_ver.attr,
+	&dev_attr_hard_reset.attr,
+	&dev_attr_hard_reset_cnt.attr,
 	&dev_attr_high_pll.attr,
 	&dev_attr_infineon_ver.attr,
 	&dev_attr_max_power.attr,
 	&dev_attr_pci_addr.attr,
 	&dev_attr_pm_mng_profile.attr,
 	&dev_attr_preboot_btl_ver.attr,
+	&dev_attr_soft_reset.attr,
+	&dev_attr_soft_reset_cnt.attr,
 	&dev_attr_status.attr,
 	&dev_attr_thermal_ver.attr,
 	&dev_attr_uboot_ver.attr,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 11/15] habanalabs: add command submission module
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (8 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 10/15] habanalabs: add device reset support Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 12/15] habanalabs: add virtual memory and MMU modules Oded Gabbay
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds the main flow for the user to submit work to the device.

Each work is described by a command submission object (CS). The CS contains
3 arrays of command buffers: One for execution, and two for context-switch
(store and restore).

For each CB, the user specifies on which queue to put that CB. In case of
an internal queue, the entry doesn't contain a pointer to the CB but the
address in the on-chip memory that the CB resides at.

The driver parses some of the CBs to enforce security restrictions.

The user receives a sequence number that represents the CS object. The user
can then query the driver regarding the status of the CS, using that
sequence number.

In case the CS doesn't finish before the timeout expires, the driver will
perform a soft-reset of the device.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - Remove empty line at end of memory.c
 
 drivers/misc/habanalabs/Makefile             |    3 +-
 drivers/misc/habanalabs/command_submission.c |  770 ++++++++++++
 drivers/misc/habanalabs/context.c            |   52 +-
 drivers/misc/habanalabs/device.c             |   16 +
 drivers/misc/habanalabs/goya/goya.c          | 1122 ++++++++++++++++++
 drivers/misc/habanalabs/habanalabs.h         |  272 +++++
 drivers/misc/habanalabs/habanalabs_drv.c     |   20 +
 drivers/misc/habanalabs/habanalabs_ioctl.c   |    4 +-
 drivers/misc/habanalabs/hw_queue.c           |  232 ++++
 drivers/misc/habanalabs/memory.c             |  199 ++++
 include/uapi/misc/habanalabs.h               |  158 ++-
 11 files changed, 2841 insertions(+), 7 deletions(-)
 create mode 100644 drivers/misc/habanalabs/command_submission.c
 create mode 100644 drivers/misc/habanalabs/memory.c

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index b5607233d216..d2fd0e18b1eb 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -5,7 +5,8 @@
 obj-m	:= habanalabs.o
 
 habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
-		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o
+		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o memory.o \
+		command_submission.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/command_submission.c b/drivers/misc/habanalabs/command_submission.c
new file mode 100644
index 000000000000..288dddc14873
--- /dev/null
+++ b/drivers/misc/habanalabs/command_submission.c
@@ -0,0 +1,770 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include <uapi/misc/habanalabs.h>
+#include "habanalabs.h"
+
+#include <linux/sched/mm.h>
+#include <linux/sched/task.h>
+#include <linux/sched/signal.h>
+#include <linux/wait.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+
+static void job_wq_completion(struct work_struct *work);
+static long _hl_cs_wait_ioctl(struct hl_device *hdev,
+		struct hl_ctx *ctx, u64 timeout_us, u64 seq);
+static void cs_do_release(struct kref *ref);
+
+static const char *hl_fence_get_driver_name(struct dma_fence *fence)
+{
+	return "HabanaLabs";
+}
+
+static const char *hl_fence_get_timeline_name(struct dma_fence *fence)
+{
+	struct hl_dma_fence *hl_fence =
+		container_of(fence, struct hl_dma_fence, base_fence);
+
+	return dev_name(hl_fence->hdev->dev);
+}
+
+static bool hl_fence_enable_signaling(struct dma_fence *fence)
+{
+	return true;
+}
+
+static void hl_fence_release(struct dma_fence *fence)
+{
+	struct hl_dma_fence *hl_fence =
+		container_of(fence, struct hl_dma_fence, base_fence);
+
+	kfree_rcu(hl_fence, base_fence.rcu);
+}
+
+static const struct dma_fence_ops hl_fence_ops = {
+	.get_driver_name = hl_fence_get_driver_name,
+	.get_timeline_name = hl_fence_get_timeline_name,
+	.enable_signaling = hl_fence_enable_signaling,
+	.wait = dma_fence_default_wait,
+	.release = hl_fence_release
+};
+
+static void cs_get(struct hl_cs *cs)
+{
+	kref_get(&cs->refcount);
+}
+
+static int cs_get_unless_zero(struct hl_cs *cs)
+{
+	return kref_get_unless_zero(&cs->refcount);
+}
+
+static void cs_put(struct hl_cs *cs)
+{
+	kref_put(&cs->refcount, cs_do_release);
+}
+
+/*
+ * cs_parser - parse the user command submission
+ *
+ * @hpriv	: pointer to the private data of the fd
+ * @job        : pointer to the job that holds the command submission info
+ *
+ * The function parses the command submission of the user. It calls the
+ * ASIC specific parser, which returns a list of memory blocks to send
+ * to the device as different command buffers
+ *
+ */
+static int cs_parser(struct hl_fpriv *hpriv, struct hl_cs_job *job)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	struct hl_cs_parser parser;
+	int rc;
+
+	parser.ctx_id = job->cs->ctx->asid;
+	parser.cs_sequence = job->cs->sequence;
+	parser.job_id = job->id;
+
+	parser.hw_queue_id = job->hw_queue_id;
+	parser.job_userptr_list = &job->userptr_list;
+	parser.patched_cb = NULL;
+	parser.user_cb = job->user_cb;
+	parser.user_cb_size = job->user_cb_size;
+	parser.ext_queue = job->ext_queue;
+	job->patched_cb = NULL;
+	parser.use_virt_addr = hdev->mmu_enable;
+
+	rc = hdev->asic_funcs->cs_parser(hdev, &parser);
+	if (job->ext_queue) {
+		if (!rc) {
+			job->patched_cb = parser.patched_cb;
+			job->job_cb_size = parser.patched_cb_size;
+
+			spin_lock(&job->patched_cb->lock);
+			job->patched_cb->cs_cnt++;
+			spin_unlock(&job->patched_cb->lock);
+		}
+
+		/*
+		 * Whether the parsing worked or not, we don't need the
+		 * original CB anymore because it was already parsed and
+		 * won't be accessed again for this CS
+		 */
+		spin_lock(&job->user_cb->lock);
+		job->user_cb->cs_cnt--;
+		spin_unlock(&job->user_cb->lock);
+		hl_cb_put(job->user_cb);
+		job->user_cb = NULL;
+	}
+
+	return rc;
+}
+
+static void free_job(struct hl_device *hdev, struct hl_cs_job *job)
+{
+	struct hl_cs *cs = job->cs;
+
+	if (job->ext_queue) {
+		hl_userptr_delete_list(hdev, &job->userptr_list);
+
+		/*
+		 * We might arrive here from rollback and patched CB wasn't
+		 * created, so we need to check it's not NULL
+		 */
+		if (job->patched_cb) {
+			spin_lock(&job->patched_cb->lock);
+			job->patched_cb->cs_cnt--;
+			spin_unlock(&job->patched_cb->lock);
+
+			hl_cb_put(job->patched_cb);
+		}
+	}
+
+	/*
+	 * This is the only place where there can be multiple threads
+	 * modifying the list at the same time
+	 */
+	spin_lock(&cs->job_lock);
+	list_del(&job->cs_node);
+	spin_unlock(&cs->job_lock);
+
+	if (job->ext_queue)
+		cs_put(cs);
+
+	kfree(job);
+}
+
+static void cs_do_release(struct kref *ref)
+{
+	struct hl_cs *cs = container_of(ref, struct hl_cs,
+						refcount);
+	struct hl_device *hdev = cs->ctx->hdev;
+	struct hl_cs_job *job, *tmp;
+
+	cs->completed = true;
+
+	/*
+	 * Although if we reached here it means that all external jobs have
+	 * finished, because each one of them took refcnt to CS, we still
+	 * need to go over the internal jobs and free them. Otherwise, we
+	 * will have leaked memory and what's worse, the CS object (and
+	 * potentially the CTX object) could be released, while the JOB
+	 * still holds a pointer to them (but no reference).
+	 */
+	list_for_each_entry_safe(job, tmp, &cs->job_list, cs_node)
+		free_job(hdev, job);
+
+	/* We also need to update CI for internal queues */
+	if (cs->submitted) {
+		hl_int_hw_queue_update_ci(cs);
+
+		spin_lock(&hdev->hw_queues_mirror_lock);
+		/* remove CS from hw_queues mirror list */
+		list_del_init(&cs->mirror_node);
+		spin_unlock(&hdev->hw_queues_mirror_lock);
+
+		/*
+		 * Don't cancel TDR in case this CS was timedout because we
+		 * might be running from the TDR context
+		 */
+		if ((!cs->timedout) &&
+			(hdev->timeout_jiffies != MAX_SCHEDULE_TIMEOUT)) {
+			struct hl_cs *next;
+
+			if (cs->tdr_active)
+				cancel_delayed_work_sync(&cs->work_tdr);
+
+			spin_lock(&hdev->hw_queues_mirror_lock);
+
+			/* queue TDR for next CS */
+			next = list_first_entry_or_null(
+					&hdev->hw_queues_mirror_list,
+					struct hl_cs, mirror_node);
+
+			if ((next) && (!next->tdr_active)) {
+				next->tdr_active = true;
+				schedule_delayed_work(&next->work_tdr,
+							hdev->timeout_jiffies);
+			}
+
+			spin_unlock(&hdev->hw_queues_mirror_lock);
+		}
+	}
+
+	hl_ctx_put(cs->ctx);
+
+	if (cs->timedout)
+		dma_fence_set_error(cs->fence, -ETIMEDOUT);
+	else if (cs->aborted)
+		dma_fence_set_error(cs->fence, -EIO);
+
+	dma_fence_signal(cs->fence);
+	dma_fence_put(cs->fence);
+
+	kfree(cs);
+}
+
+static void cs_timedout(struct work_struct *work)
+{
+	struct hl_device *hdev;
+	int ctx_asid, rc;
+	struct hl_cs *cs = container_of(work, struct hl_cs,
+						 work_tdr.work);
+	rc = cs_get_unless_zero(cs);
+	if (!rc)
+		return;
+
+	if ((!cs->submitted) || (cs->completed)) {
+		cs_put(cs);
+		return;
+	}
+
+	/* Mark the CS is timed out so we won't try to cancel its TDR */
+	cs->timedout = true;
+
+	hdev = cs->ctx->hdev;
+	ctx_asid = cs->ctx->asid;
+
+	/* TODO: add information about last signaled seq and last emitted seq */
+	dev_err(hdev->dev, "CS %d.%llu got stuck!\n", ctx_asid, cs->sequence);
+
+	cs_put(cs);
+
+	if (hdev->reset_on_lockup)
+		hl_device_reset(hdev, false, false);
+}
+
+static int allocate_cs(struct hl_device *hdev, struct hl_ctx *ctx,
+			struct hl_cs **cs_new)
+{
+	struct hl_dma_fence *fence;
+	struct dma_fence *other = NULL;
+	struct hl_cs *cs;
+	int rc;
+
+	cs = kzalloc(sizeof(*cs), GFP_ATOMIC);
+	if (!cs)
+		return -ENOMEM;
+
+	cs->ctx = ctx;
+	cs->submitted = false;
+	cs->completed = false;
+	INIT_LIST_HEAD(&cs->job_list);
+	INIT_DELAYED_WORK(&cs->work_tdr, cs_timedout);
+	kref_init(&cs->refcount);
+	spin_lock_init(&cs->job_lock);
+
+	fence = kmalloc(sizeof(*fence), GFP_ATOMIC);
+	if (!fence) {
+		rc = -ENOMEM;
+		goto free_cs;
+	}
+
+	fence->hdev = hdev;
+	spin_lock_init(&fence->lock);
+	cs->fence = &fence->base_fence;
+
+	spin_lock(&ctx->cs_lock);
+
+	fence->cs_seq = ctx->cs_sequence;
+	other = ctx->cs_pending[fence->cs_seq & (HL_MAX_PENDING_CS - 1)];
+	if ((other) && (!dma_fence_is_signaled(other))) {
+		spin_unlock(&ctx->cs_lock);
+		rc = -EAGAIN;
+		goto free_fence;
+	}
+
+	dma_fence_init(&fence->base_fence, &hl_fence_ops, &fence->lock,
+			ctx->asid, ctx->cs_sequence);
+
+	cs->sequence = fence->cs_seq;
+
+	ctx->cs_pending[fence->cs_seq & (HL_MAX_PENDING_CS - 1)] =
+							&fence->base_fence;
+	ctx->cs_sequence++;
+
+	dma_fence_get(&fence->base_fence);
+
+	dma_fence_put(other);
+
+	spin_unlock(&ctx->cs_lock);
+
+	*cs_new = cs;
+
+	return 0;
+
+free_fence:
+	kfree(fence);
+free_cs:
+	kfree(cs);
+	return rc;
+}
+
+static void cs_rollback(struct hl_device *hdev, struct hl_cs *cs)
+{
+	struct hl_cs_job *job, *tmp;
+
+	list_for_each_entry_safe(job, tmp, &cs->job_list, cs_node)
+		free_job(hdev, job);
+}
+
+void hl_cs_rollback_all(struct hl_device *hdev)
+{
+	struct hl_cs *cs, *tmp;
+
+	/* flush all completions */
+	flush_workqueue(hdev->cq_wq);
+
+	/* Make sure we don't have leftovers in the H/W queues mirror list */
+	list_for_each_entry_safe(cs, tmp, &hdev->hw_queues_mirror_list,
+				mirror_node) {
+		cs_get(cs);
+		cs->aborted = true;
+		dev_warn_ratelimited(hdev->dev, "Killing CS %d.%llu\n",
+					cs->ctx->asid, cs->sequence);
+		cs_rollback(hdev, cs);
+		cs_put(cs);
+	}
+}
+
+static void job_wq_completion(struct work_struct *work)
+{
+	struct hl_cs_job *job = container_of(work, struct hl_cs_job,
+						finish_work);
+	struct hl_cs *cs = job->cs;
+	struct hl_device *hdev = cs->ctx->hdev;
+
+	/* job is no longer needed */
+	free_job(hdev, job);
+}
+
+static struct hl_cb *validate_queue_index(struct hl_device *hdev,
+					struct hl_cb_mgr *cb_mgr,
+					struct hl_cs_chunk *chunk,
+					bool *ext_queue)
+{
+	struct asic_fixed_properties *asic = &hdev->asic_prop;
+	struct hw_queue_properties *hw_queue_prop;
+	u32 cb_handle;
+	struct hl_cb *cb;
+
+	/* Assume external queue */
+	*ext_queue = true;
+
+	hw_queue_prop = &asic->hw_queues_props[chunk->queue_index];
+
+	if ((chunk->queue_index >= HL_MAX_QUEUES) ||
+			(hw_queue_prop->type == QUEUE_TYPE_NA)) {
+		dev_err(hdev->dev, "Queue index %d is invalid\n",
+			chunk->queue_index);
+		return NULL;
+	}
+
+	if (hw_queue_prop->kmd_only) {
+		dev_err(hdev->dev, "Queue index %d is restricted for KMD\n",
+			chunk->queue_index);
+		return NULL;
+	} else if (hw_queue_prop->type == QUEUE_TYPE_INT) {
+		*ext_queue = false;
+		return (struct hl_cb *) chunk->cb_handle;
+	}
+
+	/* Retrieve CB object */
+	cb_handle = (u32) (chunk->cb_handle >> PAGE_SHIFT);
+
+	cb = hl_cb_get(hdev, cb_mgr, cb_handle);
+	if (!cb) {
+		dev_err(hdev->dev, "CB handle 0x%x invalid\n", cb_handle);
+		return NULL;
+	}
+
+	if ((chunk->cb_size < 8) || (chunk->cb_size > cb->size)) {
+		dev_err(hdev->dev, "CB size %u invalid\n", chunk->cb_size);
+		goto release_cb;
+	}
+
+	spin_lock(&cb->lock);
+	cb->cs_cnt++;
+	spin_unlock(&cb->lock);
+
+	return cb;
+
+release_cb:
+	hl_cb_put(cb);
+	return NULL;
+}
+
+struct hl_cs_job *hl_cs_allocate_job(struct hl_device *hdev, bool ext_queue)
+{
+	struct hl_cs_job *job;
+
+	job = kzalloc(sizeof(*job), GFP_ATOMIC);
+	if (!job)
+		return NULL;
+
+	job->ext_queue = ext_queue;
+
+	if (job->ext_queue) {
+		INIT_LIST_HEAD(&job->userptr_list);
+		INIT_WORK(&job->finish_work, job_wq_completion);
+	}
+
+	return job;
+}
+
+static int _hl_cs_ioctl(struct hl_fpriv *hpriv, void __user *chunks,
+			u32 num_chunks, u64 *cs_seq)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	struct hl_cs_chunk *cs_chunk_array;
+	struct hl_cs_job *job;
+	struct hl_cs *cs;
+	struct hl_cb *cb;
+	bool ext_queue_present = false;
+	u32 size_to_copy;
+	int rc, i, parse_cnt;
+
+	*cs_seq = ULLONG_MAX;
+
+	if (num_chunks > HL_MAX_JOBS_PER_CS) {
+		dev_err(hdev->dev,
+			"Number of chunks can NOT be larger than %d\n",
+			HL_MAX_JOBS_PER_CS);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	cs_chunk_array = kmalloc_array(num_chunks, sizeof(*cs_chunk_array),
+					GFP_ATOMIC);
+	if (!cs_chunk_array) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	size_to_copy = num_chunks * sizeof(struct hl_cs_chunk);
+	if (copy_from_user(cs_chunk_array, chunks, size_to_copy)) {
+		dev_err(hdev->dev, "Failed to copy cs chunk array from user\n");
+		rc = -EFAULT;
+		goto free_cs_chunk_array;
+	}
+
+	/* increment refcnt for context */
+	hl_ctx_get(hdev, hpriv->ctx);
+
+	rc = allocate_cs(hdev, hpriv->ctx, &cs);
+	if (rc) {
+		hl_ctx_put(hpriv->ctx);
+		goto free_cs_chunk_array;
+	}
+
+	*cs_seq = cs->sequence;
+
+	/* Validate ALL the CS chunks before submitting the CS */
+	for (i = 0, parse_cnt = 0 ; i < num_chunks ; i++, parse_cnt++) {
+		struct hl_cs_chunk *chunk = &cs_chunk_array[i];
+		bool ext_queue;
+
+		cb = validate_queue_index(hdev, &hpriv->cb_mgr, chunk,
+					&ext_queue);
+		if (ext_queue) {
+			ext_queue_present = true;
+			if (!cb) {
+				rc = -EINVAL;
+				goto free_cs_object;
+			}
+		}
+
+		job = hl_cs_allocate_job(hdev, ext_queue);
+		if (!job) {
+			dev_err(hdev->dev, "Failed to allocate a new job\n");
+			rc = -ENOMEM;
+			if (ext_queue)
+				goto release_cb;
+			else
+				goto free_cs_object;
+		}
+
+		job->id = i + 1;
+		job->cs = cs;
+		job->user_cb = cb;
+		job->user_cb_size = chunk->cb_size;
+		if (job->ext_queue)
+			job->job_cb_size = cb->size;
+		else
+			job->job_cb_size = chunk->cb_size;
+		job->hw_queue_id = chunk->queue_index;
+
+		cs->jobs_in_queue_cnt[job->hw_queue_id]++;
+
+		list_add_tail(&job->cs_node, &cs->job_list);
+
+		/*
+		 * Increment CS reference. When CS reference is 0, CS is
+		 * done and can be signaled to user and free all its resources
+		 * Only increment for JOB on external queues, because only
+		 * for those JOBs we get completion
+		 */
+		if (job->ext_queue)
+			cs_get(cs);
+
+		rc = cs_parser(hpriv, job);
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed to parse JOB %d.%llu.%d, err %d, rejecting the CS\n",
+				cs->ctx->asid, cs->sequence, job->id, rc);
+			goto free_cs_object;
+		}
+	}
+
+	if (!ext_queue_present) {
+		dev_err(hdev->dev,
+			"Reject CS %d.%llu because no external queues jobs\n",
+			cs->ctx->asid, cs->sequence);
+		rc = -EINVAL;
+		goto free_cs_object;
+	}
+
+	rc = hl_hw_queue_schedule_cs(cs);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to submit CS %d.%llu to H/W queues, error %d\n",
+			cs->ctx->asid, cs->sequence, rc);
+		goto free_cs_object;
+	}
+
+	rc = HL_CS_STATUS_SUCCESS;
+	goto put_cs;
+
+release_cb:
+	spin_lock(&cb->lock);
+	cb->cs_cnt--;
+	spin_unlock(&cb->lock);
+	hl_cb_put(cb);
+free_cs_object:
+	cs_rollback(hdev, cs);
+	*cs_seq = ULLONG_MAX;
+	/* The path below is both for good and erroneous exits */
+put_cs:
+	/* We finished with the CS in this function, so put the ref */
+	cs_put(cs);
+free_cs_chunk_array:
+	kfree(cs_chunk_array);
+out:
+	return rc;
+}
+
+int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	union hl_cs_args *args = data;
+	void __user *chunks;
+	u32 num_chunks;
+	u64 cs_seq = ULONG_MAX;
+	int rc, do_restore;
+	bool need_soft_reset = false;
+
+	if (hl_device_disabled_or_in_reset(hdev)) {
+		dev_warn(hdev->dev,
+			"Device is %s. Can't submit new CS\n",
+			atomic_read(&hdev->in_reset) ? "in_reset" : "disabled");
+		rc = -EBUSY;
+		goto out;
+	}
+
+	do_restore = atomic_cmpxchg(&hpriv->ctx->thread_restore_token, 1, 0);
+
+	if (do_restore || (args->in.cs_flags & HL_CS_FLAGS_FORCE_RESTORE)) {
+		long ret;
+
+		chunks = (void __user *)(uintptr_t)args->in.chunks_restore;
+		num_chunks = args->in.num_chunks_restore;
+
+		mutex_lock(&hpriv->restore_phase_mutex);
+
+		if (do_restore) {
+			rc = hdev->asic_funcs->context_switch(hdev,
+					hpriv->ctx->asid);
+			if (rc) {
+				dev_err_ratelimited(hdev->dev,
+					"Failed to switch to context %d, rejecting CS! %d\n",
+					hpriv->ctx->asid, rc);
+				/*
+				 * If we timedout, we need to soft-reset because
+				 * QMAN is probably stuck. However, we can't
+				 * call to reset here directly because of
+				 * deadlock, so need to do it at the very end
+				 * of this function
+				 */
+				if (rc == -ETIMEDOUT)
+					need_soft_reset = true;
+				mutex_unlock(&hpriv->restore_phase_mutex);
+				goto out;
+			}
+		}
+
+		hdev->asic_funcs->restore_phase_topology(hdev);
+
+		if (num_chunks == 0) {
+			dev_dbg(hdev->dev,
+			"Need to run restore phase but restore CS is empty\n");
+			rc = 0;
+		} else {
+			rc = _hl_cs_ioctl(hpriv, chunks, num_chunks,
+						&cs_seq);
+		}
+
+		mutex_unlock(&hpriv->restore_phase_mutex);
+
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed to submit restore CS for context %d (%d)\n",
+				hpriv->ctx->asid, rc);
+			goto out;
+		}
+
+		/* Need to wait for restore completion before execution phase */
+		if (num_chunks > 0) {
+			ret = _hl_cs_wait_ioctl(hdev, hpriv->ctx,
+					jiffies_to_usecs(hdev->timeout_jiffies),
+					cs_seq);
+			if (ret <= 0) {
+				dev_err(hdev->dev,
+					"Restore CS for context %d failed to complete %ld\n",
+					hpriv->ctx->asid, ret);
+				rc = -ENOEXEC;
+				goto out;
+			}
+		}
+
+		hpriv->ctx->thread_restore_wait_token = 1;
+	} else if (!hpriv->ctx->thread_restore_wait_token) {
+		u32 tmp;
+
+		rc = hl_poll_timeout_memory(hdev,
+			(u64) &hpriv->ctx->thread_restore_wait_token,
+			jiffies_to_usecs(hdev->timeout_jiffies),
+			&tmp);
+
+		if (rc || !tmp) {
+			dev_err(hdev->dev,
+				"restore phase hasn't finished in time\n");
+			rc = -ETIMEDOUT;
+			goto out;
+		}
+	}
+
+	chunks = (void __user *)(uintptr_t)args->in.chunks_execute;
+	num_chunks = args->in.num_chunks_execute;
+
+	if (num_chunks == 0) {
+		dev_err(hdev->dev,
+			"Got execute CS with 0 chunks, context %d\n",
+			hpriv->ctx->asid);
+		rc = -EINVAL;
+		goto out;
+	}
+
+	rc = _hl_cs_ioctl(hpriv, chunks, num_chunks, &cs_seq);
+
+out:
+	if (rc != -EAGAIN) {
+		memset(args, 0, sizeof(*args));
+		args->out.status = rc;
+		args->out.seq = cs_seq;
+	}
+
+	if ((rc == -ETIMEDOUT) && (need_soft_reset))
+		hl_device_reset(hdev, false, false);
+
+	return rc;
+}
+
+static long _hl_cs_wait_ioctl(struct hl_device *hdev,
+		struct hl_ctx *ctx, u64 timeout_us, u64 seq)
+{
+	struct dma_fence *fence;
+	unsigned long timeout;
+	long rc;
+
+	if (timeout_us == MAX_SCHEDULE_TIMEOUT)
+		timeout = timeout_us;
+	else
+		timeout = usecs_to_jiffies(timeout_us);
+
+	hl_ctx_get(hdev, ctx);
+
+	fence = hl_ctx_get_fence(ctx, seq);
+	if (IS_ERR(fence)) {
+		rc = PTR_ERR(fence);
+	} else if (fence) {
+		rc = dma_fence_wait_timeout(fence, true, timeout);
+		if (fence->error == -ETIMEDOUT)
+			rc = -ETIMEDOUT;
+		else if (fence->error == -EIO)
+			rc = -EIO;
+		dma_fence_put(fence);
+	} else
+		rc = 1;
+
+	hl_ctx_put(ctx);
+
+	return rc;
+}
+
+int hl_cs_wait_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	union hl_wait_cs_args *args = data;
+	u64 seq = args->in.seq;
+	long rc;
+
+	rc = _hl_cs_wait_ioctl(hdev, hpriv->ctx, args->in.timeout_us, seq);
+
+	memset(args, 0, sizeof(*args));
+
+	if (rc < 0) {
+		dev_err(hdev->dev, "Error %ld on waiting for CS handle %llu\n",
+			rc, seq);
+		if (rc == -ERESTARTSYS) {
+			args->out.status = HL_WAIT_CS_STATUS_INTERRUPTED;
+			rc = -EINTR;
+		} else if (rc == -ETIMEDOUT) {
+			args->out.status = HL_WAIT_CS_STATUS_TIMEDOUT;
+		} else if (rc == -EIO) {
+			args->out.status = HL_WAIT_CS_STATUS_ABORTED;
+		}
+		return rc;
+	}
+
+	if (rc == 0)
+		args->out.status = HL_WAIT_CS_STATUS_BUSY;
+	else
+		args->out.status = HL_WAIT_CS_STATUS_COMPLETED;
+
+	return 0;
+}
diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
index 4d46e51e1a82..98710646de6c 100644
--- a/drivers/misc/habanalabs/context.c
+++ b/drivers/misc/habanalabs/context.c
@@ -13,6 +13,18 @@
 static void hl_ctx_fini(struct hl_ctx *ctx)
 {
 	struct hl_device *hdev = ctx->hdev;
+	int i;
+
+	/*
+	 * If we arrived here, there are no jobs waiting for this context
+	 * on its queues so we can safely remove it.
+	 * This is because for each CS, we increment the ref count and for
+	 * every CS that was finished we decrement it and we won't arrive
+	 * to this function unless the ref count is 0
+	 */
+
+	for (i = 0 ; i < HL_MAX_PENDING_CS ; i++)
+		dma_fence_put(ctx->cs_pending[i]);
 
 	if (ctx->asid != HL_KERNEL_ASID_ID)
 		hl_asid_free(hdev, ctx->asid);
@@ -24,8 +36,6 @@ void hl_ctx_do_release(struct kref *ref)
 
 	ctx = container_of(ref, struct hl_ctx, refcount);
 
-	dev_dbg(ctx->hdev->dev, "Now really releasing context %d\n", ctx->asid);
-
 	hl_ctx_fini(ctx);
 
 	if (ctx->hpriv)
@@ -91,6 +101,11 @@ int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
 
 	kref_init(&ctx->refcount);
 
+	ctx->cs_sequence = 1;
+	spin_lock_init(&ctx->cs_lock);
+	atomic_set(&ctx->thread_restore_token, 1);
+	ctx->thread_restore_wait_token = 0;
+
 	if (is_kernel_ctx) {
 		ctx->asid = HL_KERNEL_ASID_ID; /* KMD gets ASID 0 */
 	} else {
@@ -101,8 +116,6 @@ int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
 		}
 	}
 
-	dev_dbg(hdev->dev, "Created context with ASID %u\n", ctx->asid);
-
 	return 0;
 }
 
@@ -116,6 +129,37 @@ int hl_ctx_put(struct hl_ctx *ctx)
 	return kref_put(&ctx->refcount, hl_ctx_do_release);
 }
 
+struct dma_fence *hl_ctx_get_fence(struct hl_ctx *ctx, u64 seq)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct dma_fence *fence;
+
+	spin_lock(&ctx->cs_lock);
+
+	if (seq >= ctx->cs_sequence) {
+		dev_notice(hdev->dev,
+			"Can't wait on seq %llu because current CS is at seq %llu\n",
+			seq, ctx->cs_sequence);
+		spin_unlock(&ctx->cs_lock);
+		return ERR_PTR(-EINVAL);
+	}
+
+
+	if (seq + HL_MAX_PENDING_CS < ctx->cs_sequence) {
+		dev_dbg(hdev->dev,
+			"Can't wait on seq %llu because current CS is at seq %llu (Fence is gone)\n",
+			seq, ctx->cs_sequence);
+		spin_unlock(&ctx->cs_lock);
+		return NULL;
+	}
+
+	fence = dma_fence_get(
+			ctx->cs_pending[seq & (HL_MAX_PENDING_CS - 1)]);
+	spin_unlock(&ctx->cs_lock);
+
+	return fence;
+}
+
 /*
  * hl_ctx_mgr_init - initialize the context manager
  *
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index c244be37bc55..21496882be3a 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -30,6 +30,8 @@ static void hpriv_release(struct kref *ref)
 
 	put_pid(hpriv->taskpid);
 
+	mutex_destroy(&hpriv->restore_phase_mutex);
+
 	kfree(hpriv);
 
 	/* Now the FD is really closed */
@@ -201,6 +203,8 @@ static int device_early_init(struct hl_device *hdev)
 
 	mutex_init(&hdev->fd_open_cnt_lock);
 	mutex_init(&hdev->send_cpu_message_lock);
+	INIT_LIST_HEAD(&hdev->hw_queues_mirror_list);
+	spin_lock_init(&hdev->hw_queues_mirror_lock);
 	atomic_set(&hdev->in_reset, 0);
 	atomic_set(&hdev->fd_open_cnt, 0);
 
@@ -582,6 +586,9 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 	 */
 	hdev->asic_funcs->halt_engines(hdev, hard_reset);
 
+	/* Go over all the queues, release all CS and their jobs */
+	hl_cs_rollback_all(hdev);
+
 	if (hard_reset) {
 		/* Release kernel context */
 		if (hl_ctx_put(hdev->kernel_ctx) != 1) {
@@ -605,6 +612,12 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
 		hl_cq_reset(hdev, &hdev->completion_queue[i]);
 
+	/* Make sure the setup phase for the user context will run again */
+	if (hdev->user_ctx) {
+		atomic_set(&hdev->user_ctx->thread_restore_token, 1);
+		hdev->user_ctx->thread_restore_wait_token = 0;
+	}
+
 	/* Finished tear-down, starting to re-initialize */
 
 	if (hard_reset) {
@@ -936,6 +949,9 @@ void hl_device_fini(struct hl_device *hdev)
 	 */
 	hdev->asic_funcs->halt_engines(hdev, true);
 
+	/* Go over all the queues, release all CS and their jobs */
+	hl_cs_rollback_all(hdev);
+
 	hl_cb_pool_fini(hdev);
 
 	/* Release kernel context */
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 1c6cce81c496..9fda232139ea 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -102,6 +102,19 @@ static const char goya_irq_name[GOYA_MSIX_ENTRIES][GOYA_MAX_STRING_LEN] = {
 		"goya cq 4", "goya cpu eq"
 };
 
+static u16 goya_packet_sizes[MAX_PACKET_ID] = {
+	[PACKET_WREG_32]	= sizeof(struct packet_wreg32),
+	[PACKET_WREG_BULK]	= sizeof(struct packet_wreg_bulk),
+	[PACKET_MSG_LONG]	= sizeof(struct packet_msg_long),
+	[PACKET_MSG_SHORT]	= sizeof(struct packet_msg_short),
+	[PACKET_CP_DMA]		= sizeof(struct packet_cp_dma),
+	[PACKET_MSG_PROT]	= sizeof(struct packet_msg_prot),
+	[PACKET_FENCE]		= sizeof(struct packet_fence),
+	[PACKET_LIN_DMA]	= sizeof(struct packet_lin_dma),
+	[PACKET_NOP]		= sizeof(struct packet_nop),
+	[PACKET_STOP]		= sizeof(struct packet_stop)
+};
+
 static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
 	"MME0",
 	"MME1",
@@ -2982,6 +2995,84 @@ void *goya_get_int_queue_base(struct hl_device *hdev, u32 queue_id,
 	return base;
 }
 
+int goya_send_job_on_qman0(struct hl_device *hdev, struct hl_cs_job *job)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	struct packet_msg_prot *fence_pkt;
+	u32 *fence_ptr;
+	dma_addr_t fence_dma_addr;
+	struct hl_cb *cb;
+	u32 tmp;
+	int rc;
+
+	if (!hdev->asic_funcs->is_device_idle(hdev)) {
+		dev_err_ratelimited(hdev->dev,
+			"Can't send KMD job on QMAN0 if device is not idle\n");
+		return -EFAULT;
+	}
+
+	fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL,
+							&fence_dma_addr);
+	if (!fence_ptr) {
+		dev_err(hdev->dev,
+			"Failed to allocate fence memory for QMAN0\n");
+		return -ENOMEM;
+	}
+
+	*fence_ptr = 0;
+
+	if (goya->hw_cap_initialized & HW_CAP_MMU) {
+		WREG32(mmDMA_QM_0_GLBL_PROT, QMAN_DMA_FULLY_TRUSTED);
+		RREG32(mmDMA_QM_0_GLBL_PROT);
+	}
+
+	/*
+	 * goya cs parser saves space for 2xpacket_msg_prot at end of CB. For
+	 * synchronized kernel jobs we only need space for 1 packet_msg_prot
+	 */
+	job->job_cb_size -= sizeof(struct packet_msg_prot);
+
+	cb = job->patched_cb;
+
+	fence_pkt = (struct packet_msg_prot *) (cb->kernel_address +
+			job->job_cb_size - sizeof(struct packet_msg_prot));
+
+	fence_pkt->ctl = (PACKET_MSG_PROT << GOYA_PKT_CTL_OPCODE_SHIFT) |
+			(1 << GOYA_PKT_CTL_EB_SHIFT) |
+			(1 << GOYA_PKT_CTL_MB_SHIFT);
+	fence_pkt->value = GOYA_QMAN0_FENCE_VAL;
+	fence_pkt->addr = fence_dma_addr +
+			hdev->asic_prop.host_phys_base_address;
+
+	rc = hl_hw_queue_send_cb_no_cmpl(hdev, GOYA_QUEUE_ID_DMA_0,
+					job->job_cb_size, cb->bus_address);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to send CB on QMAN0, %d\n", rc);
+		goto free_fence_ptr;
+	}
+
+	rc = hl_poll_timeout_memory(hdev, (u64) fence_ptr,
+					HL_DEVICE_TIMEOUT_USEC, &tmp);
+
+	hl_hw_queue_inc_ci_kernel(hdev, GOYA_QUEUE_ID_DMA_0);
+
+	if ((rc) || (tmp != GOYA_QMAN0_FENCE_VAL)) {
+		dev_err(hdev->dev, "QMAN0 Job hasn't finished in time\n");
+		rc = -ETIMEDOUT;
+	}
+
+free_fence_ptr:
+	hdev->asic_funcs->dma_pool_free(hdev, (void *) fence_ptr,
+					fence_dma_addr);
+
+	if (goya->hw_cap_initialized & HW_CAP_MMU) {
+		WREG32(mmDMA_QM_0_GLBL_PROT, QMAN_DMA_PARTLY_TRUSTED);
+		RREG32(mmDMA_QM_0_GLBL_PROT);
+	}
+
+	return rc;
+}
+
 int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
 				u32 timeout, long *result)
 {
@@ -3216,11 +3307,981 @@ void goya_cpu_accessible_dma_pool_free(struct hl_device *hdev, size_t size,
 	gen_pool_free(hdev->cpu_accessible_dma_pool, (u64) vaddr, size);
 }
 
+int goya_dma_map_sg(struct hl_device *hdev, struct scatterlist *sg, int nents,
+			enum dma_data_direction dir)
+{
+	if (!dma_map_sg(&hdev->pdev->dev, sg, nents, dir))
+		return -ENOMEM;
+
+	return 0;
+}
+
+void goya_dma_unmap_sg(struct hl_device *hdev, struct scatterlist *sg,
+			int nents, enum dma_data_direction dir)
+{
+	dma_unmap_sg(&hdev->pdev->dev, sg, nents, dir);
+}
+
+u32 goya_get_dma_desc_list_size(struct hl_device *hdev,
+					struct sg_table *sgt)
+{
+	struct scatterlist *sg, *sg_next_iter;
+	u32 count, len, dma_desc_cnt, len_next;
+	dma_addr_t addr, addr_next;
+
+	dma_desc_cnt = 0;
+
+	for_each_sg(sgt->sgl, sg, sgt->nents, count) {
+
+		len = sg_dma_len(sg);
+		addr = sg_dma_address(sg);
+
+		if (len == 0)
+			break;
+
+		while ((count + 1) < sgt->nents) {
+			sg_next_iter = sg_next(sg);
+			len_next = sg_dma_len(sg_next_iter);
+			addr_next = sg_dma_address(sg_next_iter);
+
+			if (len_next == 0)
+				break;
+
+			if ((addr + len == addr_next) &&
+				(len + len_next <= DMA_MAX_TRANSFER_SIZE)) {
+				len += len_next;
+				count++;
+				sg = sg_next_iter;
+			} else {
+				break;
+			}
+		}
+
+		dma_desc_cnt++;
+	}
+
+	return dma_desc_cnt * sizeof(struct packet_lin_dma);
+}
+
+static int goya_pin_memory_before_cs(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt,
+				u64 addr, enum dma_data_direction dir)
+{
+	struct hl_userptr *userptr;
+	int rc;
+
+	if (hl_userptr_is_pinned(hdev, addr, user_dma_pkt->tsize,
+			parser->job_userptr_list, &userptr))
+		goto already_pinned;
+
+	userptr = kzalloc(sizeof(*userptr), GFP_ATOMIC);
+	if (!userptr)
+		return -ENOMEM;
+
+	rc = hl_pin_host_memory(hdev, addr, user_dma_pkt->tsize, userptr);
+	if (rc)
+		goto free_userptr;
+
+	list_add_tail(&userptr->job_node, parser->job_userptr_list);
+
+	rc = hdev->asic_funcs->asic_dma_map_sg(hdev, userptr->sgt->sgl,
+					userptr->sgt->nents, dir);
+	if (rc) {
+		dev_err(hdev->dev, "failed to map sgt with DMA region\n");
+		goto unpin_memory;
+	}
+
+	userptr->dma_mapped = true;
+	userptr->dir = dir;
+
+already_pinned:
+	parser->patched_cb_size +=
+			goya_get_dma_desc_list_size(hdev, userptr->sgt);
+
+	return 0;
+
+unpin_memory:
+	hl_unpin_host_memory(hdev, userptr);
+free_userptr:
+	kfree(userptr);
+	return rc;
+}
+
+static int goya_validate_dma_pkt_host(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt)
+{
+	u64 device_memory_addr, addr;
+	enum dma_data_direction dir;
+	enum goya_dma_direction user_dir;
+	bool sram_addr = true;
+	bool skip_host_mem_pin = false;
+	bool user_memset;
+	int rc = 0;
+
+	user_dir = (user_dma_pkt->ctl & GOYA_PKT_LIN_DMA_CTL_DMA_DIR_MASK) >>
+			GOYA_PKT_LIN_DMA_CTL_DMA_DIR_SHIFT;
+
+	user_memset = (user_dma_pkt->ctl & GOYA_PKT_LIN_DMA_CTL_MEMSET_MASK) >>
+			GOYA_PKT_LIN_DMA_CTL_MEMSET_SHIFT;
+
+	switch (user_dir) {
+	case DMA_HOST_TO_DRAM:
+		dev_dbg(hdev->dev, "DMA direction is HOST --> DRAM\n");
+		dir = DMA_TO_DEVICE;
+		sram_addr = false;
+		addr = user_dma_pkt->src_addr;
+		device_memory_addr = user_dma_pkt->dst_addr;
+		if (user_memset)
+			skip_host_mem_pin = true;
+		break;
+
+	case DMA_DRAM_TO_HOST:
+		dev_dbg(hdev->dev, "DMA direction is DRAM --> HOST\n");
+		dir = DMA_FROM_DEVICE;
+		sram_addr = false;
+		addr = user_dma_pkt->dst_addr;
+		device_memory_addr = user_dma_pkt->src_addr;
+		break;
+
+	case DMA_HOST_TO_SRAM:
+		dev_dbg(hdev->dev, "DMA direction is HOST --> SRAM\n");
+		dir = DMA_TO_DEVICE;
+		addr = user_dma_pkt->src_addr;
+		device_memory_addr = user_dma_pkt->dst_addr;
+		if (user_memset)
+			skip_host_mem_pin = true;
+		break;
+
+	case DMA_SRAM_TO_HOST:
+		dev_dbg(hdev->dev, "DMA direction is SRAM --> HOST\n");
+		dir = DMA_FROM_DEVICE;
+		addr = user_dma_pkt->dst_addr;
+		device_memory_addr = user_dma_pkt->src_addr;
+		break;
+	default:
+		dev_err(hdev->dev, "DMA direction is undefined\n");
+		return -EFAULT;
+	}
+
+	if (parser->ctx_id != HL_KERNEL_ASID_ID) {
+		if (sram_addr) {
+			if (!hl_mem_area_inside_range(device_memory_addr,
+					user_dma_pkt->tsize,
+					hdev->asic_prop.sram_user_base_address,
+					hdev->asic_prop.sram_end_address)) {
+
+				dev_err(hdev->dev,
+					"SRAM address 0x%llx + 0x%x is invalid\n",
+					device_memory_addr,
+					user_dma_pkt->tsize);
+				return -EFAULT;
+			}
+		} else {
+			if (!hl_mem_area_inside_range(device_memory_addr,
+					user_dma_pkt->tsize,
+					hdev->asic_prop.dram_user_base_address,
+					hdev->asic_prop.dram_end_address)) {
+
+				dev_err(hdev->dev,
+					"DRAM address 0x%llx + 0x%x is invalid\n",
+					device_memory_addr,
+					user_dma_pkt->tsize);
+				return -EFAULT;
+			}
+		}
+	}
+
+	if (skip_host_mem_pin)
+		parser->patched_cb_size += sizeof(*user_dma_pkt);
+	else {
+		if ((dir == DMA_TO_DEVICE) &&
+				(parser->hw_queue_id > GOYA_QUEUE_ID_DMA_1)) {
+			dev_err(hdev->dev,
+				"Can't DMA from host on queue other then 1\n");
+			return -EFAULT;
+		}
+
+		rc = goya_pin_memory_before_cs(hdev, parser, user_dma_pkt,
+						addr, dir);
+	}
+
+	return rc;
+}
+
+static int goya_validate_dma_pkt_no_host(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt)
+{
+	u64 sram_memory_addr, dram_memory_addr;
+	enum goya_dma_direction user_dir;
+
+	user_dir = (user_dma_pkt->ctl & GOYA_PKT_LIN_DMA_CTL_DMA_DIR_MASK) >>
+			GOYA_PKT_LIN_DMA_CTL_DMA_DIR_SHIFT;
+
+	if (user_dir == DMA_DRAM_TO_SRAM) {
+		dev_dbg(hdev->dev, "DMA direction is DRAM --> SRAM\n");
+		dram_memory_addr = user_dma_pkt->src_addr;
+		sram_memory_addr = user_dma_pkt->dst_addr;
+	} else {
+		dev_dbg(hdev->dev, "DMA direction is SRAM --> DRAM\n");
+		sram_memory_addr = user_dma_pkt->src_addr;
+		dram_memory_addr = user_dma_pkt->dst_addr;
+	}
+
+	if (!hl_mem_area_inside_range(sram_memory_addr, user_dma_pkt->tsize,
+				hdev->asic_prop.sram_user_base_address,
+				hdev->asic_prop.sram_end_address)) {
+		dev_err(hdev->dev, "SRAM address 0x%llx + 0x%x is invalid\n",
+			sram_memory_addr, user_dma_pkt->tsize);
+		return -EFAULT;
+	}
+
+	if (!hl_mem_area_inside_range(dram_memory_addr, user_dma_pkt->tsize,
+				hdev->asic_prop.dram_user_base_address,
+				hdev->asic_prop.dram_end_address)) {
+		dev_err(hdev->dev, "DRAM address 0x%llx + 0x%x is invalid\n",
+			dram_memory_addr, user_dma_pkt->tsize);
+		return -EFAULT;
+	}
+
+	parser->patched_cb_size += sizeof(*user_dma_pkt);
+
+	return 0;
+}
+
+static int goya_validate_dma_pkt_no_mmu(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt)
+{
+	enum goya_dma_direction user_dir;
+	int rc;
+
+	dev_dbg(hdev->dev, "DMA packet details:\n");
+	dev_dbg(hdev->dev, "source == 0x%llx\n", user_dma_pkt->src_addr);
+	dev_dbg(hdev->dev, "destination == 0x%llx\n", user_dma_pkt->dst_addr);
+	dev_dbg(hdev->dev, "size == %u\n", user_dma_pkt->tsize);
+
+	user_dir = (user_dma_pkt->ctl & GOYA_PKT_LIN_DMA_CTL_DMA_DIR_MASK) >>
+			GOYA_PKT_LIN_DMA_CTL_DMA_DIR_SHIFT;
+
+	/*
+	 * Special handling for DMA with size 0. The H/W has a bug where
+	 * this can cause the QMAN DMA to get stuck, so block it here.
+	 */
+	if (user_dma_pkt->tsize == 0) {
+		dev_err(hdev->dev,
+			"Got DMA with size 0, might reset the device\n");
+		return -EINVAL;
+	}
+
+	if ((user_dir == DMA_DRAM_TO_SRAM) || (user_dir == DMA_SRAM_TO_DRAM))
+		rc = goya_validate_dma_pkt_no_host(hdev, parser, user_dma_pkt);
+	else
+		rc = goya_validate_dma_pkt_host(hdev, parser, user_dma_pkt);
+
+	return rc;
+}
+
+static int goya_validate_dma_pkt_mmu(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt)
+{
+	dev_dbg(hdev->dev, "DMA packet details:\n");
+	dev_dbg(hdev->dev, "source == 0x%llx\n", user_dma_pkt->src_addr);
+	dev_dbg(hdev->dev, "destination == 0x%llx\n", user_dma_pkt->dst_addr);
+	dev_dbg(hdev->dev, "size == %u\n", user_dma_pkt->tsize);
+
+	/*
+	 * WA for HW-23.
+	 * We can't allow user to read from Host using QMANs other than 1.
+	 */
+	if (parser->hw_queue_id > GOYA_QUEUE_ID_DMA_1 &&
+		hl_mem_area_inside_range(user_dma_pkt->src_addr,
+				user_dma_pkt->tsize,
+				hdev->asic_prop.va_space_host_start_address,
+				hdev->asic_prop.va_space_host_end_address)) {
+		dev_err(hdev->dev,
+			"Can't DMA from host on queue other then 1\n");
+		return -EFAULT;
+	}
+
+	if (user_dma_pkt->tsize == 0) {
+		dev_err(hdev->dev,
+			"Got DMA with size 0, might reset the device\n");
+		return -EINVAL;
+	}
+
+	parser->patched_cb_size += sizeof(*user_dma_pkt);
+
+	return 0;
+}
+
+static int goya_validate_wreg32(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_wreg32 *wreg_pkt)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 sob_start_addr, sob_end_addr;
+	u16 reg_offset;
+
+	reg_offset = wreg_pkt->ctl & GOYA_PKT_WREG32_CTL_REG_OFFSET_MASK;
+
+	dev_dbg(hdev->dev, "WREG32 packet details:\n");
+	dev_dbg(hdev->dev, "reg_offset == 0x%x\n", reg_offset);
+	dev_dbg(hdev->dev, "value      == 0x%x\n", wreg_pkt->value);
+
+	if (reg_offset != (mmDMA_CH_1_WR_COMP_ADDR_LO & 0xFFFF)) {
+		dev_err(hdev->dev, "WREG32 packet with illegal address 0x%x\n",
+			reg_offset);
+		return -EPERM;
+	}
+
+	/*
+	 * With MMU, DMA channels are not secured, so it doesn't matter where
+	 * the WR COMP will be written to because it will go out with
+	 * non-secured property
+	 */
+	if (goya->hw_cap_initialized & HW_CAP_MMU)
+		return 0;
+
+	sob_start_addr = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_0);
+	sob_end_addr = lower_32_bits(CFG_BASE + mmSYNC_MNGR_SOB_OBJ_1023);
+
+	if ((wreg_pkt->value < sob_start_addr) ||
+			(wreg_pkt->value > sob_end_addr)) {
+
+		dev_err(hdev->dev, "WREG32 packet with illegal value 0x%x\n",
+			wreg_pkt->value);
+		return -EPERM;
+	}
+
+	return 0;
+}
+
+static int goya_validate_cb(struct hl_device *hdev,
+			struct hl_cs_parser *parser, bool is_mmu)
+{
+	u32 cb_parsed_length = 0;
+	int rc = 0;
+
+	parser->patched_cb_size = 0;
+
+	/* cb_user_size is more than 0 so loop will always be executed */
+	while (cb_parsed_length < parser->user_cb_size) {
+		enum packet_id pkt_id;
+		u16 pkt_size;
+		void *user_pkt;
+
+		user_pkt = (void *) (parser->user_cb->kernel_address +
+							cb_parsed_length);
+
+		pkt_id = (enum packet_id) (((*(u64 *) user_pkt) &
+				PACKET_HEADER_PACKET_ID_MASK) >>
+					PACKET_HEADER_PACKET_ID_SHIFT);
+
+		pkt_size = goya_packet_sizes[pkt_id];
+		cb_parsed_length += pkt_size;
+		if (cb_parsed_length > parser->user_cb_size) {
+			dev_err(hdev->dev,
+				"packet 0x%x is out of CB boundary\n", pkt_id);
+			rc = -EINVAL;
+			break;
+		}
+
+		switch (pkt_id) {
+		case PACKET_WREG_32:
+			/*
+			 * Although it is validated after copy in patch_cb(),
+			 * need to validate here as well because patch_cb() is
+			 * not called in MMU path while this function is called
+			 */
+			rc = goya_validate_wreg32(hdev, parser, user_pkt);
+			break;
+
+		case PACKET_WREG_BULK:
+			dev_err(hdev->dev,
+				"User not allowed to use WREG_BULK\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_MSG_PROT:
+			dev_err(hdev->dev,
+				"User not allowed to use MSG_PROT\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_CP_DMA:
+			dev_err(hdev->dev, "User not allowed to use CP_DMA\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_STOP:
+			dev_err(hdev->dev, "User not allowed to use STOP\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_LIN_DMA:
+			if (is_mmu)
+				rc = goya_validate_dma_pkt_mmu(hdev, parser,
+						user_pkt);
+			else
+				rc = goya_validate_dma_pkt_no_mmu(hdev, parser,
+						user_pkt);
+			break;
+
+		case PACKET_MSG_LONG:
+		case PACKET_MSG_SHORT:
+		case PACKET_FENCE:
+		case PACKET_NOP:
+			parser->patched_cb_size += pkt_size;
+			break;
+
+		default:
+			dev_err(hdev->dev, "Invalid packet header 0x%x\n",
+				pkt_id);
+			rc = -EINVAL;
+			break;
+		}
+
+		if (rc)
+			break;
+	}
+
+	/*
+	 * The new CB should have space at the end for two MSG_PROT packets:
+	 * 1. A packet that will act as a completion packet
+	 * 2. A packet that will generate MSI-X interrupt
+	 */
+	parser->patched_cb_size += sizeof(struct packet_msg_prot) * 2;
+
+	return rc;
+}
+
+static int goya_patch_dma_packet(struct hl_device *hdev,
+				struct hl_cs_parser *parser,
+				struct packet_lin_dma *user_dma_pkt,
+				struct packet_lin_dma *new_dma_pkt,
+				u32 *new_dma_pkt_size)
+{
+	struct hl_userptr *userptr;
+	struct scatterlist *sg, *sg_next_iter;
+	u32 count, len, dma_desc_cnt, len_next;
+	dma_addr_t dma_addr, dma_addr_next;
+	enum goya_dma_direction user_dir;
+	u64 device_memory_addr, addr;
+	enum dma_data_direction dir;
+	struct sg_table *sgt;
+	bool skip_host_mem_pin = false;
+	bool user_memset;
+	u32 user_rdcomp_mask, user_wrcomp_mask;
+
+	user_dir = (user_dma_pkt->ctl & GOYA_PKT_LIN_DMA_CTL_DMA_DIR_MASK) >>
+			GOYA_PKT_LIN_DMA_CTL_DMA_DIR_SHIFT;
+
+	user_memset = (user_dma_pkt->ctl & GOYA_PKT_LIN_DMA_CTL_MEMSET_MASK) >>
+			GOYA_PKT_LIN_DMA_CTL_MEMSET_SHIFT;
+
+	if ((user_dir == DMA_DRAM_TO_SRAM) || (user_dir == DMA_SRAM_TO_DRAM) ||
+			(user_dma_pkt->tsize == 0)) {
+		memcpy(new_dma_pkt, user_dma_pkt, sizeof(*new_dma_pkt));
+		*new_dma_pkt_size = sizeof(*new_dma_pkt);
+		return 0;
+	}
+
+	if ((user_dir == DMA_HOST_TO_DRAM) || (user_dir == DMA_HOST_TO_SRAM)) {
+		addr = user_dma_pkt->src_addr;
+		device_memory_addr = user_dma_pkt->dst_addr;
+		dir = DMA_TO_DEVICE;
+		if (user_memset)
+			skip_host_mem_pin = true;
+	} else {
+		addr = user_dma_pkt->dst_addr;
+		device_memory_addr = user_dma_pkt->src_addr;
+		dir = DMA_FROM_DEVICE;
+	}
+
+	if ((!skip_host_mem_pin) &&
+		(hl_userptr_is_pinned(hdev, addr, user_dma_pkt->tsize,
+			parser->job_userptr_list, &userptr) == false)) {
+		dev_err(hdev->dev, "Userptr 0x%llx + 0x%x NOT mapped\n",
+				addr, user_dma_pkt->tsize);
+		return -EFAULT;
+	}
+
+	if ((user_memset) && (dir == DMA_TO_DEVICE)) {
+		memcpy(new_dma_pkt, user_dma_pkt, sizeof(*user_dma_pkt));
+		*new_dma_pkt_size = sizeof(*user_dma_pkt);
+		return 0;
+	}
+
+	user_rdcomp_mask =
+			(user_dma_pkt->ctl & GOYA_PKT_LIN_DMA_CTL_RDCOMP_MASK);
+
+	user_wrcomp_mask =
+			(user_dma_pkt->ctl & GOYA_PKT_LIN_DMA_CTL_WRCOMP_MASK);
+
+	sgt = userptr->sgt;
+	dma_desc_cnt = 0;
+
+	for_each_sg(sgt->sgl, sg, sgt->nents, count) {
+		len = sg_dma_len(sg);
+		dma_addr = sg_dma_address(sg);
+
+		if (len == 0)
+			break;
+
+		while ((count + 1) < sgt->nents) {
+			sg_next_iter = sg_next(sg);
+			len_next = sg_dma_len(sg_next_iter);
+			dma_addr_next = sg_dma_address(sg_next_iter);
+
+			if (len_next == 0)
+				break;
+
+			if ((dma_addr + len == dma_addr_next) &&
+				(len + len_next <= DMA_MAX_TRANSFER_SIZE)) {
+				len += len_next;
+				count++;
+				sg = sg_next_iter;
+			} else {
+				break;
+			}
+		}
+
+		new_dma_pkt->ctl = user_dma_pkt->ctl;
+		if (likely(dma_desc_cnt))
+			new_dma_pkt->ctl &= ~GOYA_PKT_CTL_EB_MASK;
+		new_dma_pkt->ctl &= ~(GOYA_PKT_LIN_DMA_CTL_RDCOMP_MASK |
+					GOYA_PKT_LIN_DMA_CTL_WRCOMP_MASK);
+		new_dma_pkt->tsize = len;
+
+		dma_addr += hdev->asic_prop.host_phys_base_address;
+
+		if (dir == DMA_TO_DEVICE) {
+			new_dma_pkt->src_addr = dma_addr;
+			new_dma_pkt->dst_addr = device_memory_addr;
+		} else {
+			new_dma_pkt->src_addr = device_memory_addr;
+			new_dma_pkt->dst_addr = dma_addr;
+		}
+
+		if (!user_memset)
+			device_memory_addr += len;
+		dma_desc_cnt++;
+		new_dma_pkt++;
+	}
+
+	if (!dma_desc_cnt) {
+		dev_err(hdev->dev,
+			"Error of 0 SG entries when patching DMA packet\n");
+		return -EFAULT;
+	}
+
+	/* Fix the last dma packet - rdcomp/wrcomp must be as user set them */
+	new_dma_pkt--;
+	new_dma_pkt->ctl |= (user_rdcomp_mask | user_wrcomp_mask);
+
+	*new_dma_pkt_size = dma_desc_cnt * sizeof(struct packet_lin_dma);
+
+	return 0;
+}
+
+static int goya_patch_cb(struct hl_device *hdev,
+				struct hl_cs_parser *parser)
+{
+	u32 cb_parsed_length = 0;
+	u32 cb_patched_cur_length = 0;
+	int rc = 0;
+
+	/* cb_user_size is more than 0 so loop will always be executed */
+	while (cb_parsed_length < parser->user_cb_size) {
+		enum packet_id pkt_id;
+		u16 pkt_size;
+		u32 new_pkt_size = 0;
+		void *user_pkt, *kernel_pkt;
+
+		user_pkt = (void *) (parser->user_cb->kernel_address +
+							cb_parsed_length);
+		kernel_pkt = (void *) (parser->patched_cb->kernel_address +
+							cb_patched_cur_length);
+
+		pkt_id = (enum packet_id) (((*(u64 *) user_pkt) &
+				PACKET_HEADER_PACKET_ID_MASK) >>
+					PACKET_HEADER_PACKET_ID_SHIFT);
+
+		pkt_size = goya_packet_sizes[pkt_id];
+		cb_parsed_length += pkt_size;
+		if (cb_parsed_length > parser->user_cb_size) {
+			dev_err(hdev->dev,
+				"packet 0x%x is out of CB boundary\n", pkt_id);
+			rc = -EINVAL;
+			break;
+		}
+
+		switch (pkt_id) {
+		case PACKET_LIN_DMA:
+			rc = goya_patch_dma_packet(hdev, parser, user_pkt,
+						kernel_pkt, &new_pkt_size);
+			cb_patched_cur_length += new_pkt_size;
+			break;
+
+		case PACKET_WREG_32:
+			memcpy(kernel_pkt, user_pkt, pkt_size);
+			cb_patched_cur_length += pkt_size;
+			rc = goya_validate_wreg32(hdev, parser, kernel_pkt);
+			break;
+
+		case PACKET_WREG_BULK:
+			dev_err(hdev->dev,
+				"User not allowed to use WREG_BULK\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_MSG_PROT:
+			dev_err(hdev->dev,
+				"User not allowed to use MSG_PROT\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_CP_DMA:
+			dev_err(hdev->dev, "User not allowed to use CP_DMA\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_STOP:
+			dev_err(hdev->dev, "User not allowed to use STOP\n");
+			rc = -EPERM;
+			break;
+
+		case PACKET_MSG_LONG:
+		case PACKET_MSG_SHORT:
+		case PACKET_FENCE:
+		case PACKET_NOP:
+			memcpy(kernel_pkt, user_pkt, pkt_size);
+			cb_patched_cur_length += pkt_size;
+			break;
+
+		default:
+			dev_err(hdev->dev, "Invalid packet header 0x%x\n",
+				pkt_id);
+			rc = -EINVAL;
+			break;
+		}
+
+		if (rc)
+			break;
+	}
+
+	return rc;
+}
+
+static int goya_parse_cb_mmu(struct hl_device *hdev,
+		struct hl_cs_parser *parser)
+{
+	u64 patched_cb_handle;
+	u32 patched_cb_size;
+	struct hl_cb *user_cb;
+	int rc;
+
+	/*
+	 * The new CB should have space at the end for two MSG_PROT pkt:
+	 * 1. A packet that will act as a completion packet
+	 * 2. A packet that will generate MSI-X interrupt
+	 */
+	parser->patched_cb_size = parser->user_cb_size +
+			sizeof(struct packet_msg_prot) * 2;
+
+	rc = hl_cb_create(hdev, &hdev->kernel_cb_mgr,
+				parser->patched_cb_size,
+				&patched_cb_handle, HL_KERNEL_ASID_ID);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to allocate patched CB for DMA CS %d\n",
+			rc);
+		return rc;
+	}
+
+	patched_cb_handle >>= PAGE_SHIFT;
+	parser->patched_cb = hl_cb_get(hdev, &hdev->kernel_cb_mgr,
+				(u32) patched_cb_handle);
+	/* hl_cb_get should never fail here so use kernel WARN */
+	WARN(!parser->patched_cb, "DMA CB handle invalid 0x%x\n",
+			(u32) patched_cb_handle);
+	if (!parser->patched_cb) {
+		rc = -EFAULT;
+		goto out;
+	}
+
+	/*
+	 * The check that parser->user_cb_size <= parser->user_cb->size was done
+	 * in validate_queue_index().
+	 */
+	memcpy((void *) parser->patched_cb->kernel_address,
+		(void *) parser->user_cb->kernel_address,
+		parser->user_cb_size);
+
+	patched_cb_size = parser->patched_cb_size;
+
+	/* validate patched CB instead of user CB */
+	user_cb = parser->user_cb;
+	parser->user_cb = parser->patched_cb;
+	rc = goya_validate_cb(hdev, parser, true);
+	parser->user_cb = user_cb;
+
+	if (rc) {
+		hl_cb_put(parser->patched_cb);
+		goto out;
+	}
+
+	if (patched_cb_size != parser->patched_cb_size) {
+		dev_err(hdev->dev, "user CB size mismatch\n");
+		hl_cb_put(parser->patched_cb);
+		rc = -EINVAL;
+		goto out;
+	}
+
+out:
+	/*
+	 * Always call cb destroy here because we still have 1 reference
+	 * to it by calling cb_get earlier. After the job will be completed,
+	 * cb_put will release it, but here we want to remove it from the
+	 * idr
+	 */
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr,
+					patched_cb_handle << PAGE_SHIFT);
+
+	return rc;
+}
+
+int goya_parse_cb_no_mmu(struct hl_device *hdev, struct hl_cs_parser *parser)
+{
+	u64 patched_cb_handle;
+	int rc;
+
+	rc = goya_validate_cb(hdev, parser, false);
+
+	if (rc)
+		goto free_userptr;
+
+	rc = hl_cb_create(hdev, &hdev->kernel_cb_mgr,
+				parser->patched_cb_size,
+				&patched_cb_handle, HL_KERNEL_ASID_ID);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to allocate patched CB for DMA CS %d\n", rc);
+		goto free_userptr;
+	}
+
+	patched_cb_handle >>= PAGE_SHIFT;
+	parser->patched_cb = hl_cb_get(hdev, &hdev->kernel_cb_mgr,
+				(u32) patched_cb_handle);
+	/* hl_cb_get should never fail here so use kernel WARN */
+	WARN(!parser->patched_cb, "DMA CB handle invalid 0x%x\n",
+			(u32) patched_cb_handle);
+	if (!parser->patched_cb) {
+		rc = -EFAULT;
+		goto out;
+	}
+
+	rc = goya_patch_cb(hdev, parser);
+
+	if (rc)
+		hl_cb_put(parser->patched_cb);
+
+out:
+	/*
+	 * Always call cb destroy here because we still have 1 reference
+	 * to it by calling cb_get earlier. After the job will be completed,
+	 * cb_put will release it, but here we want to remove it from the
+	 * idr
+	 */
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr,
+				patched_cb_handle << PAGE_SHIFT);
+
+free_userptr:
+	if (rc)
+		hl_userptr_delete_list(hdev, parser->job_userptr_list);
+	return rc;
+}
+
+int goya_parse_cb_no_ext_quque(struct hl_device *hdev,
+		struct hl_cs_parser *parser)
+{
+	struct asic_fixed_properties *asic_prop = &hdev->asic_prop;
+	struct goya_device *goya = hdev->asic_specific;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU)) {
+		/* For internal queue jobs, just check if cb address is valid */
+		if (hl_mem_area_inside_range(
+				(u64) parser->user_cb,
+				parser->user_cb_size,
+				asic_prop->sram_user_base_address,
+				asic_prop->sram_end_address))
+			return 0;
+
+		if (hl_mem_area_inside_range(
+				(u64) parser->user_cb,
+				parser->user_cb_size,
+				asic_prop->dram_user_base_address,
+				asic_prop->dram_end_address))
+			return 0;
+
+		dev_err(hdev->dev,
+			"Internal CB address 0x%llx + 0x%x is not in SRAM nor in DRAM\n",
+			(u64) parser->user_cb, parser->user_cb_size);
+
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+int goya_cs_parser(struct hl_device *hdev, struct hl_cs_parser *parser)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	if (!parser->ext_queue)
+		return goya_parse_cb_no_ext_quque(hdev, parser);
+
+	if ((goya->hw_cap_initialized & HW_CAP_MMU) && parser->use_virt_addr)
+		return goya_parse_cb_mmu(hdev, parser);
+	else
+		return goya_parse_cb_no_mmu(hdev, parser);
+}
+
+void goya_add_end_of_cb_packets(u64 kernel_address, u32 len, u64 cq_addr,
+				u32 cq_val, u32 msix_vec)
+{
+	struct packet_msg_prot *cq_pkt;
+
+	cq_pkt = (struct packet_msg_prot *) (kernel_address + len -
+					(sizeof(struct packet_msg_prot) * 2));
+
+	cq_pkt->ctl = (PACKET_MSG_PROT << GOYA_PKT_CTL_OPCODE_SHIFT) |
+			(1 << GOYA_PKT_CTL_EB_SHIFT) |
+			(1 << GOYA_PKT_CTL_MB_SHIFT);
+	cq_pkt->value = cq_val;
+	cq_pkt->addr = cq_addr;
+
+	cq_pkt++;
+
+	cq_pkt->ctl = (PACKET_MSG_PROT << GOYA_PKT_CTL_OPCODE_SHIFT) |
+			(1 << GOYA_PKT_CTL_MB_SHIFT);
+	cq_pkt->value = msix_vec & 0x7FF;
+	cq_pkt->addr = CFG_BASE + mmPCIE_DBI_MSIX_DOORBELL_OFF;
+}
+
 static void goya_update_eq_ci(struct hl_device *hdev, u32 val)
 {
 	WREG32(mmPSOC_GLOBAL_CONF_SCRATCHPAD_6, val);
 }
 
+int goya_context_switch(struct hl_device *hdev, u32 asid)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct packet_lin_dma *clear_sram_pkt;
+	struct hl_cs_parser parser;
+	struct hl_cs_job *job;
+	u32 cb_size;
+	struct hl_cb *cb;
+	int rc;
+
+	cb = hl_cb_kernel_create(hdev, PAGE_SIZE);
+	if (!cb)
+		return -EFAULT;
+
+	clear_sram_pkt = (struct packet_lin_dma *) cb->kernel_address;
+	memset(clear_sram_pkt, 0, sizeof(*clear_sram_pkt));
+	cb_size = sizeof(*clear_sram_pkt);
+
+	clear_sram_pkt->ctl = ((PACKET_LIN_DMA << GOYA_PKT_CTL_OPCODE_SHIFT) |
+		(DMA_HOST_TO_SRAM << GOYA_PKT_LIN_DMA_CTL_DMA_DIR_SHIFT) |
+		(1 << GOYA_PKT_LIN_DMA_CTL_MEMSET_SHIFT) |
+		(1 << GOYA_PKT_LIN_DMA_CTL_WO_SHIFT) |
+		(1 << GOYA_PKT_CTL_RB_SHIFT) |
+		(1 << GOYA_PKT_CTL_MB_SHIFT));
+
+	clear_sram_pkt->src_addr = 0x7777777777777777ull;
+	clear_sram_pkt->dst_addr = prop->sram_base_address;
+	if (hdev->pldm)
+		clear_sram_pkt->tsize = 0x10000;
+	else
+		clear_sram_pkt->tsize = prop->sram_size;
+
+	job = hl_cs_allocate_job(hdev, true);
+	if (!job) {
+		dev_err(hdev->dev, "Failed to allocate a new job\n");
+		rc = -ENOMEM;
+		goto release_cb;
+	}
+
+	job->id = 0;
+	job->user_cb = cb;
+	job->user_cb->cs_cnt++;
+	job->user_cb_size = cb_size;
+	job->hw_queue_id = GOYA_QUEUE_ID_DMA_0;
+
+	parser.ctx_id = HL_KERNEL_ASID_ID;
+	parser.cs_sequence = 0;
+	parser.job_id = job->id;
+	parser.hw_queue_id = job->hw_queue_id;
+	parser.job_userptr_list = &job->userptr_list;
+	parser.user_cb = job->user_cb;
+	parser.user_cb_size = job->user_cb_size;
+	parser.ext_queue = job->ext_queue;
+	parser.use_virt_addr = hdev->mmu_enable;
+
+	rc = hdev->asic_funcs->cs_parser(hdev, &parser);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to parse kernel CB during context switch\n");
+		goto free_job;
+	}
+
+	job->patched_cb = parser.patched_cb;
+	job->job_cb_size = parser.patched_cb_size;
+	job->patched_cb->cs_cnt++;
+
+	rc = goya_send_job_on_qman0(hdev, job);
+
+	job->patched_cb->cs_cnt--;
+	hl_cb_put(job->patched_cb);
+
+free_job:
+	hl_userptr_delete_list(hdev, &job->userptr_list);
+	kfree(job);
+	cb->cs_cnt--;
+
+release_cb:
+	hl_cb_put(cb);
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb->id << PAGE_SHIFT);
+
+	return rc;
+}
+
+void goya_restore_phase_topology(struct hl_device *hdev)
+{
+	int i, num_of_sob_in_longs, num_of_mon_in_longs;
+
+	num_of_sob_in_longs =
+		((mmSYNC_MNGR_SOB_OBJ_1023 - mmSYNC_MNGR_SOB_OBJ_0) + 4);
+
+	num_of_mon_in_longs =
+		((mmSYNC_MNGR_MON_STATUS_255 - mmSYNC_MNGR_MON_STATUS_0) + 4);
+
+	for (i = 0 ; i < num_of_sob_in_longs ; i += 4)
+		WREG32(mmSYNC_MNGR_SOB_OBJ_0 + i, 0);
+
+	for (i = 0 ; i < num_of_mon_in_longs ; i += 4)
+		WREG32(mmSYNC_MNGR_MON_STATUS_0 + i, 0);
+
+	/* Flush all WREG to prevent race */
+	i = RREG32(mmSYNC_MNGR_SOB_OBJ_0);
+}
+
 static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
 		u16 event_type, char *axi_name, int len)
 {
@@ -3610,6 +4671,59 @@ static void goya_disable_clock_gating(struct hl_device *hdev)
 
 }
 
+static bool goya_is_device_idle(struct hl_device *hdev)
+{
+	u64 offset, dma_qm_reg, tpc_qm_reg, tpc_cmdq_reg, tpc_cfg_reg;
+	int i;
+
+	offset = mmDMA_QM_1_GLBL_STS0 - mmDMA_QM_0_GLBL_STS0;
+
+	for (i = 0 ; i < DMA_MAX_NUM ; i++) {
+		dma_qm_reg = mmDMA_QM_0_GLBL_STS0 + i * offset;
+
+		if ((RREG32(dma_qm_reg) & DMA_QM_IDLE_MASK) !=
+				DMA_QM_IDLE_MASK)
+			return false;
+	}
+
+	offset = mmTPC1_QM_GLBL_STS0 - mmTPC0_QM_GLBL_STS0;
+
+	for (i = 0 ; i < TPC_MAX_NUM ; i++) {
+		tpc_qm_reg = mmTPC0_QM_GLBL_STS0 + i * offset;
+		tpc_cmdq_reg = mmTPC0_CMDQ_GLBL_STS0 + i * offset;
+		tpc_cfg_reg = mmTPC0_CFG_STATUS + i * offset;
+
+		if ((RREG32(tpc_qm_reg) & TPC_QM_IDLE_MASK) !=
+				TPC_QM_IDLE_MASK)
+			return false;
+
+		if ((RREG32(tpc_cmdq_reg) & TPC_CMDQ_IDLE_MASK) !=
+				TPC_CMDQ_IDLE_MASK)
+			return false;
+
+		if ((RREG32(tpc_cfg_reg) & TPC_CFG_IDLE_MASK) !=
+				TPC_CFG_IDLE_MASK)
+			return false;
+	}
+
+	if ((RREG32(mmMME_QM_GLBL_STS0) & MME_QM_IDLE_MASK) !=
+			MME_QM_IDLE_MASK)
+		return false;
+
+	if ((RREG32(mmMME_CMDQ_GLBL_STS0) & MME_CMDQ_IDLE_MASK) !=
+			MME_CMDQ_IDLE_MASK)
+		return false;
+
+	if ((RREG32(mmMME_ARCH_STATUS) & MME_ARCH_IDLE_MASK) !=
+			MME_ARCH_IDLE_MASK)
+		return false;
+
+	if (RREG32(mmMME_SHADOW_0_STATUS) & MME_SHADOW_IDLE_MASK)
+		return false;
+
+	return true;
+}
+
 static void goya_hw_queues_lock(struct hl_device *hdev)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -3702,7 +4816,14 @@ static const struct hl_asic_funcs goya_funcs = {
 	.dma_pool_free = goya_dma_pool_free,
 	.cpu_accessible_dma_pool_alloc = goya_cpu_accessible_dma_pool_alloc,
 	.cpu_accessible_dma_pool_free = goya_cpu_accessible_dma_pool_free,
+	.hl_dma_unmap_sg = goya_dma_unmap_sg,
+	.cs_parser = goya_cs_parser,
+	.asic_dma_map_sg = goya_dma_map_sg,
+	.get_dma_desc_list_size = goya_get_dma_desc_list_size,
+	.add_end_of_cb_packets = goya_add_end_of_cb_packets,
 	.update_eq_ci = goya_update_eq_ci,
+	.context_switch = goya_context_switch,
+	.restore_phase_topology = goya_restore_phase_topology,
 	.add_device_attr = goya_add_device_attr,
 	.handle_eqe = goya_handle_eqe,
 	.set_pll_profile = goya_set_pll_profile,
@@ -3710,6 +4831,7 @@ static const struct hl_asic_funcs goya_funcs = {
 	.send_heartbeat = goya_send_heartbeat,
 	.enable_clock_gating = goya_init_clock_gating,
 	.disable_clock_gating = goya_disable_clock_gating,
+	.is_device_idle = goya_is_device_idle,
 	.soft_reset_late_init = goya_soft_reset_late_init,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 4d5cba225700..e8f71f6c8fb4 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -36,6 +36,11 @@
 
 #define HL_MAX_QUEUES			128
 
+#define HL_MAX_JOBS_PER_CS		64
+
+/* MUST BE POWER OF 2 and larger than 1 */
+#define HL_MAX_PENDING_CS		64
+
 struct hl_device;
 struct hl_fpriv;
 
@@ -66,6 +71,16 @@ struct hw_queue_properties {
 	u8			kmd_only;
 };
 
+/**
+ * enum vm_type_t - virtual memory mapping request information.
+ * @VM_TYPE_USERPTR: mapping of user memory to device virtual address.
+ * @VM_TYPE_PHYS_LIST: mapping of DRAM memory to device virtual address.
+ */
+enum vm_type_t {
+	VM_TYPE_USERPTR,
+	VM_TYPE_PHYS_LIST
+};
+
 /**
  * enum hl_device_hw_state - H/W device state. use this to understand whether
  *                           to do reset before hw_init or not
@@ -152,6 +167,19 @@ struct asic_fixed_properties {
 	u8			tpc_enabled_mask;
 };
 
+/**
+ * struct hl_dma_fence - wrapper for fence object used by command submissions.
+ * @base_fence: kernel fence object.
+ * @lock: spinlock to protect fence.
+ * @hdev: habanalabs device structure.
+ * @cs_seq: command submission sequence number.
+ */
+struct hl_dma_fence {
+	struct dma_fence	base_fence;
+	spinlock_t		lock;
+	struct hl_device	*hdev;
+	u64			cs_seq;
+};
 
 /*
  * Command Buffers
@@ -180,6 +208,7 @@ struct hl_cb_mgr {
  * @mmap_size: Holds the CB's size that was mmaped.
  * @size: holds the CB's size.
  * @id: the CB's ID.
+ * @cs_cnt: holds number of CS that this CB participates in.
  * @ctx_id: holds the ID of the owner's context.
  * @mmap: true if the CB is currently mmaped to user.
  * @is_pool: true if CB was acquired from the pool, false otherwise.
@@ -194,6 +223,7 @@ struct hl_cb {
 	u32			mmap_size;
 	u32			size;
 	u32			id;
+	u32			cs_cnt;
 	u32			ctx_id;
 	u8			mmap;
 	u8			is_pool;
@@ -318,6 +348,8 @@ enum hl_asic_type {
 	ASIC_INVALID
 };
 
+struct hl_cs_parser;
+
 /**
  * enum hl_pm_mng_profile - power management profile.
  * @PM_AUTO: internal clock is set by KMD.
@@ -377,7 +409,14 @@ enum hl_pll_frequency {
  * @dma_pool_free: free small DMA allocation from pool.
  * @cpu_accessible_dma_pool_alloc: allocate CPU PQ packet from DMA pool.
  * @cpu_accessible_dma_pool_free: free CPU PQ packet from DMA pool.
+ * @hl_dma_unmap_sg: DMA unmap scatter-gather list.
+ * @cs_parser: parse Command Submission.
+ * @asic_dma_map_sg: DMA map scatter-gather list.
+ * @get_dma_desc_list_size: get number of LIN_DMA packets required for CB.
+ * @add_end_of_cb_packets: Add packets to the end of CB, if device requires it.
  * @update_eq_ci: update event queue CI.
+ * @context_switch: called upon ASID context switch.
+ * @restore_phase_topology: clear all SOBs amd MONs.
  * @add_device_attr: add ASIC specific device attributes.
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
  * @set_pll_profile: change PLL profile (manual/automatic).
@@ -385,6 +424,7 @@ enum hl_pll_frequency {
  * @send_heartbeat: send is-alive packet to ArmCP and verify response.
  * @enable_clock_gating: enable clock gating for reducing power consumption.
  * @disable_clock_gating: disable clock for accessing registers on HBW.
+ * @is_device_idle: return true if device is idle, false otherwise.
  * @soft_reset_late_init: perform certain actions needed after soft reset.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
@@ -424,7 +464,20 @@ struct hl_asic_funcs {
 				size_t size, dma_addr_t *dma_handle);
 	void (*cpu_accessible_dma_pool_free)(struct hl_device *hdev,
 				size_t size, void *vaddr);
+	void (*hl_dma_unmap_sg)(struct hl_device *hdev,
+				struct scatterlist *sg, int nents,
+				enum dma_data_direction dir);
+	int (*cs_parser)(struct hl_device *hdev, struct hl_cs_parser *parser);
+	int (*asic_dma_map_sg)(struct hl_device *hdev,
+				struct scatterlist *sg, int nents,
+				enum dma_data_direction dir);
+	u32 (*get_dma_desc_list_size)(struct hl_device *hdev,
+					struct sg_table *sgt);
+	void (*add_end_of_cb_packets)(u64 kernel_address, u32 len, u64 cq_addr,
+					u32 cq_val, u32 msix_num);
 	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
+	int (*context_switch)(struct hl_device *hdev, u32 asid);
+	void (*restore_phase_topology)(struct hl_device *hdev);
 	void (*add_device_attr)(struct hl_device *hdev,
 				struct attribute_group *dev_attr_grp);
 	void (*handle_eqe)(struct hl_device *hdev,
@@ -435,6 +488,7 @@ struct hl_asic_funcs {
 	int (*send_heartbeat)(struct hl_device *hdev);
 	void (*enable_clock_gating)(struct hl_device *hdev);
 	void (*disable_clock_gating)(struct hl_device *hdev);
+	bool (*is_device_idle)(struct hl_device *hdev);
 	int (*soft_reset_late_init)(struct hl_device *hdev);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
@@ -458,12 +512,28 @@ struct hl_asic_funcs {
  * @hdev: pointer to the device structure.
  * @refcount: reference counter for the context. Context is released only when
  *		this hits 0l. It is incremented on CS and CS_WAIT.
+ * @cs_pending: array of DMA fence objects representing pending CS.
+ * @cs_sequence: sequence number for CS. Value is assigned to a CS and passed
+ *			to user so user could inquire about CS. It is used as
+ *			index to cs_pending array.
+ * @cs_lock: spinlock to protect cs_sequence.
+ * @thread_restore_token: token to prevent multiple threads of the same context
+ *				from running the restore phase. Only one thread
+ *				should run it.
+ * @thread_restore_wait_token: token to prevent the threads that didn't run
+ *				the restore phase from moving to their execution
+ *				phase before the restore phase has finished.
  * @asid: context's unique address space ID in the device's MMU.
  */
 struct hl_ctx {
 	struct hl_fpriv		*hpriv;
 	struct hl_device	*hdev;
 	struct kref		refcount;
+	struct dma_fence	*cs_pending[HL_MAX_PENDING_CS];
+	u64			cs_sequence;
+	spinlock_t		cs_lock;
+	atomic_t		thread_restore_token;
+	u32			thread_restore_wait_token;
 	u32			asid;
 };
 
@@ -478,14 +548,129 @@ struct hl_ctx_mgr {
 };
 
 
+
+/*
+ * COMMAND SUBMISSIONS
+ */
+
+/**
+ * struct hl_userptr - memory mapping chunk information
+ * @vm_type: type of the VM.
+ * @job_node: linked-list node for hanging the object on the Job's list.
+ * @vec: pointer to the frame vector.
+ * @sgt: pointer to the scatter-gather table that holds the pages.
+ * @dir: for DMA unmapping, the direction must be supplied, so save it.
+ * @debugfs_list: node in debugfs list of command submissions.
+ * @addr: user-space virtual pointer to the start of the memory area.
+ * @size: size of the memory area to pin & map.
+ * @dma_mapped: true if the SG was mapped to DMA addresses, false otherwise.
+ */
+struct hl_userptr {
+	enum vm_type_t		vm_type; /* must be first */
+	struct list_head	job_node;
+	struct frame_vector	*vec;
+	struct sg_table		*sgt;
+	enum dma_data_direction dir;
+	struct list_head	debugfs_list;
+	u64			addr;
+	u32			size;
+	u8			dma_mapped;
+};
+
+/**
+ * struct hl_cs - command submission.
+ * @jobs_in_queue_cnt: per each queue, maintain counter of submitted jobs.
+ * @ctx: the context this CS belongs to.
+ * @job_list: list of the CS's jobs in the various queues.
+ * @job_lock: spinlock for the CS's jobs list. Needed for free_job.
+ * @refcount: reference counter for usage of the CS.
+ * @fence: pointer to the fence object of this CS.
+ * @work_tdr: delayed work node for TDR.
+ * @mirror_node : node in device mirror list of command submissions.
+ * @sequence: the sequence number of this CS.
+ * @submitted: true if CS was submitted to H/W.
+ * @completed: true if CS was completed by device.
+ * @timedout : true if CS was timedout.
+ * @tdr_active: true if TDR was activated for this CS (to prevent
+ *		double TDR activation).
+ * @aborted: true if CS was aborted due to some device error.
+ */
+struct hl_cs {
+	u8			jobs_in_queue_cnt[HL_MAX_QUEUES];
+	struct hl_ctx		*ctx;
+	struct list_head	job_list;
+	spinlock_t		job_lock;
+	struct kref		refcount;
+	struct dma_fence	*fence;
+	struct delayed_work	work_tdr;
+	struct list_head	mirror_node;
+	u64			sequence;
+	u8			submitted;
+	u8			completed;
+	u8			timedout;
+	u8			tdr_active;
+	u8			aborted;
+};
+
 /**
  * struct hl_cs_job - command submission job.
+ * @cs_node: the node to hang on the CS jobs list.
+ * @cs: the CS this job belongs to.
+ * @user_cb: the CB we got from the user.
+ * @patched_cb: in case of patching, this is internal CB which is submitted on
+ *		the queue instead of the CB we got from the IOCTL.
  * @finish_work: workqueue object to run when job is completed.
+ * @userptr_list: linked-list of userptr mappings that belong to this job and
+ *			wait for completion.
  * @id: the id of this job inside a CS.
+ * @hw_queue_id: the id of the H/W queue this job is submitted to.
+ * @user_cb_size: the actual size of the CB we got from the user.
+ * @job_cb_size: the actual size of the CB that we put on the queue.
+ * @ext_queue: whether the job is for external queue or internal queue.
  */
 struct hl_cs_job {
+	struct list_head	cs_node;
+	struct hl_cs		*cs;
+	struct hl_cb		*user_cb;
+	struct hl_cb		*patched_cb;
 	struct work_struct	finish_work;
+	struct list_head	userptr_list;
 	u32			id;
+	u32			hw_queue_id;
+	u32			user_cb_size;
+	u32			job_cb_size;
+	u8			ext_queue;
+};
+
+/**
+ * struct hl_cs_parser - command submission paerser properties.
+ * @user_cb: the CB we got from the user.
+ * @patched_cb: in case of patching, this is internal CB which is submitted on
+ *		the queue instead of the CB we got from the IOCTL.
+ * @job_userptr_list: linked-list of userptr mappings that belong to the related
+ *			job and wait for completion.
+ * @cs_sequence: the sequence number of the related CS.
+ * @ctx_id: the ID of the context the related CS belongs to.
+ * @hw_queue_id: the id of the H/W queue this job is submitted to.
+ * @user_cb_size: the actual size of the CB we got from the user.
+ * @patched_cb_size: the size of the CB after parsing.
+ * @ext_queue: whether the job is for external queue or internal queue.
+ * @job_id: the id of the related job inside the related CS.
+ * @use_virt_addr: whether to treat the addresses in the CB as virtual during
+ *			parsing.
+ */
+struct hl_cs_parser {
+	struct hl_cb		*user_cb;
+	struct hl_cb		*patched_cb;
+	struct list_head	*job_userptr_list;
+	u64			cs_sequence;
+	u32			ctx_id;
+	u32			hw_queue_id;
+	u32			user_cb_size;
+	u32			patched_cb_size;
+	u8			ext_queue;
+	u8			job_id;
+	u8			use_virt_addr;
 };
 
 
@@ -502,6 +687,7 @@ struct hl_cs_job {
  * @ctx_mgr: context manager to handle multiple context for this FD.
  * @cb_mgr: command buffer manager to handle multiple buffers for this FD.
  * @refcount: number of related contexts.
+ * @restore_phase_mutex: lock for context switch and restore phase.
  */
 struct hl_fpriv {
 	struct hl_device	*hdev;
@@ -511,6 +697,7 @@ struct hl_fpriv {
 	struct hl_ctx_mgr	ctx_mgr;
 	struct hl_cb_mgr	cb_mgr;
 	struct kref		refcount;
+	struct mutex		restore_phase_mutex;
 };
 
 
@@ -580,6 +767,8 @@ struct hl_device_reset_work {
  * @eq_wq: work queue of event queue for executing work in process context.
  * @kernel_ctx: KMD context structure.
  * @kernel_queues: array of hl_hw_queue.
+ * @hw_queues_mirror_list: CS mirror list for TDR.
+ * @hw_queues_mirror_lock: protects hw_queues_mirror_list.
  * @kernel_cb_mgr: command buffer manager for creating/destroying/handling CGs.
  * @event_queue: event queue for IRQ from ArmCP.
  * @dma_pool: DMA pool for small allocations.
@@ -607,6 +796,7 @@ struct hl_device_reset_work {
  * @in_reset: is device in reset flow.
  * @curr_pll_profile: current PLL profile.
  * @fd_open_cnt: number of open user processes.
+ * @timeout_jiffies: device CS timeout value.
  * @max_power: the max power of the device, as configured by the sysadmin. This
  *             value is saved so in case of hard-reset, KMD will restore this
  *             value and update the F/W after the re-initialization
@@ -620,7 +810,10 @@ struct hl_device_reset_work {
  * @hwmon_initialized: is H/W monitor sensors was initialized.
  * @hard_reset_pending: is there a hard reset work pending.
  * @heartbeat: is heartbeat sanity check towards ArmCP enabled.
+ * @reset_on_lockup: true if a reset should be done in case of stuck CS, false
+ *                   otherwise.
  * @init_done: is the initialization of the device done.
+ * @mmu_enable: is MMU enabled.
  */
 struct hl_device {
 	struct pci_dev			*pdev;
@@ -637,6 +830,8 @@ struct hl_device {
 	struct workqueue_struct		*eq_wq;
 	struct hl_ctx			*kernel_ctx;
 	struct hl_hw_queue		*kernel_queues;
+	struct list_head		hw_queues_mirror_list;
+	spinlock_t			hw_queues_mirror_lock;
 	struct hl_cb_mgr		kernel_cb_mgr;
 	struct hl_eq			event_queue;
 	struct dma_pool			*dma_pool;
@@ -664,6 +859,7 @@ struct hl_device {
 	atomic_t			in_reset;
 	atomic_t			curr_pll_profile;
 	atomic_t			fd_open_cnt;
+	u64				timeout_jiffies;
 	u64				max_power;
 	u32				major;
 	u32				high_pll;
@@ -675,9 +871,11 @@ struct hl_device {
 	u8				hwmon_initialized;
 	u8				hard_reset_pending;
 	u8				heartbeat;
+	u8				reset_on_lockup;
 	u8				init_done;
 
 	/* Parameters for bring-up */
+	u8				mmu_enable;
 	u8				cpu_enable;
 	u8				reset_pcilink;
 	u8				cpu_queues_enable;
@@ -715,6 +913,58 @@ struct hl_ioctl_desc {
  * Kernel module functions that can be accessed by entire module
  */
 
+/**
+ * hl_mem_area_inside_range() - Checks whether address+size are inside a range.
+ * @address: The start address of the area we want to validate.
+ * @size: The size in bytes of the area we want to validate.
+ * @range_start_address: The start address of the valid range.
+ * @range_end_address: The end address of the valid range.
+ *
+ * Return: true if the area is inside the valid range, false otherwise.
+ */
+static inline bool hl_mem_area_inside_range(u64 address, u32 size,
+				u64 range_start_address, u64 range_end_address)
+{
+	u64 end_address = address + size;
+
+	if ((address >= range_start_address) &&
+			(end_address <= range_end_address) &&
+			(end_address > address))
+		return true;
+
+	return false;
+}
+
+/**
+ * hl_mem_area_crosses_range() - Checks whether address+size crossing a range.
+ * @address: The start address of the area we want to validate.
+ * @size: The size in bytes of the area we want to validate.
+ * @range_start_address: The start address of the valid range.
+ * @range_end_address: The end address of the valid range.
+ *
+ * Return: true if the area overlaps part or all of the valid range,
+ *		false otherwise.
+ */
+static inline bool hl_mem_area_crosses_range(u64 address, u32 size,
+				u64 range_start_address, u64 range_end_address)
+{
+	u64 end_address = address + size;
+
+	if ((address >= range_start_address) &&
+			(address < range_end_address))
+		return true;
+
+	if ((end_address >= range_start_address) &&
+			(end_address < range_end_address))
+		return true;
+
+	if ((address < range_start_address) &&
+			(end_address >= range_end_address))
+		return true;
+
+	return false;
+}
+
 int hl_device_open(struct inode *inode, struct file *filp);
 bool hl_device_disabled_or_in_reset(struct hl_device *hdev);
 int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
@@ -728,8 +978,10 @@ int hl_hw_queues_create(struct hl_device *hdev);
 void hl_hw_queues_destroy(struct hl_device *hdev);
 int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
 				u32 cb_size, u64 cb_ptr);
+int hl_hw_queue_schedule_cs(struct hl_cs *cs);
 u32 hl_hw_queue_add_ptr(u32 ptr, u16 val);
 void hl_hw_queue_inc_ci_kernel(struct hl_device *hdev, u32 hw_queue_id);
+void hl_int_hw_queue_update_ci(struct hl_cs *cs);
 void hl_hw_queue_reset(struct hl_device *hdev, bool hard_reset);
 
 #define hl_queue_inc_ptr(p)		hl_hw_queue_add_ptr(p, 1)
@@ -743,6 +995,8 @@ void hl_cq_reset(struct hl_device *hdev, struct hl_cq *q);
 void hl_eq_reset(struct hl_device *hdev, struct hl_eq *q);
 irqreturn_t hl_irq_handler_cq(int irq, void *arg);
 irqreturn_t hl_irq_handler_eq(int irq, void *arg);
+u32 hl_cq_inc_ptr(u32 ptr);
+
 int hl_asid_init(struct hl_device *hdev);
 void hl_asid_fini(struct hl_device *hdev);
 unsigned long hl_asid_alloc(struct hl_device *hdev);
@@ -751,9 +1005,13 @@ void hl_asid_free(struct hl_device *hdev, unsigned long asid);
 int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv);
 void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx);
 int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx);
+void hl_ctx_do_release(struct kref *ref);
+void hl_ctx_get(struct hl_device *hdev,	struct hl_ctx *ctx);
 int hl_ctx_put(struct hl_ctx *ctx);
+struct dma_fence *hl_ctx_get_fence(struct hl_ctx *ctx, u64 seq);
 void hl_ctx_mgr_init(struct hl_ctx_mgr *mgr);
 void hl_ctx_mgr_fini(struct hl_device *hdev, struct hl_ctx_mgr *mgr);
+
 int hl_device_init(struct hl_device *hdev, struct class *hclass);
 void hl_device_fini(struct hl_device *hdev);
 int hl_device_suspend(struct hl_device *hdev);
@@ -785,8 +1043,20 @@ struct hl_cb *hl_cb_kernel_create(struct hl_device *hdev, u32 cb_size);
 int hl_cb_pool_init(struct hl_device *hdev);
 int hl_cb_pool_fini(struct hl_device *hdev);
 
+void hl_cs_rollback_all(struct hl_device *hdev);
+struct hl_cs_job *hl_cs_allocate_job(struct hl_device *hdev, bool ext_queue);
+
 void goya_set_asic_funcs(struct hl_device *hdev);
 
+int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u32 size,
+			struct hl_userptr *userptr);
+int hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr);
+void hl_userptr_delete_list(struct hl_device *hdev,
+				struct list_head *userptr_list);
+bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr, u32 size,
+				struct list_head *userptr_list,
+				struct hl_userptr **userptr);
+
 long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr);
 void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq);
 long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
@@ -802,5 +1072,7 @@ void hl_set_max_power(struct hl_device *hdev, u64 value);
 /* IOCTLs */
 long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
 int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
+int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data);
+int hl_cs_wait_ioctl(struct hl_fpriv *hpriv, void *data);
 
 #endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 0702f6f974eb..dbc0c9c2c99a 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -28,6 +28,17 @@ static struct class *hl_class;
 DEFINE_IDR(hl_devs_idr);
 DEFINE_MUTEX(hl_devs_idr_lock);
 
+static int timeout_locked = 5;
+static int reset_on_lockup = 1;
+
+module_param(timeout_locked, int, 0444);
+MODULE_PARM_DESC(timeout_locked,
+	"Device lockup timeout in seconds (0 = disabled, default 5s)");
+
+module_param(reset_on_lockup, int, 0444);
+MODULE_PARM_DESC(reset_on_lockup,
+	"Do device reset on lockup (0 = no, 1 = yes, default yes)");
+
 #define PCI_VENDOR_ID_HABANALABS	0x1da3
 
 #define PCI_IDS_GOYA			0x0001
@@ -117,6 +128,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	hpriv->hdev = hdev;
 	filp->private_data = hpriv;
 	hpriv->filp = filp;
+	mutex_init(&hpriv->restore_phase_mutex);
 	kref_init(&hpriv->refcount);
 	nonseekable_open(inode, filp);
 
@@ -144,6 +156,7 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	filp->private_data = NULL;
 	hl_ctx_mgr_fini(hpriv->hdev, &hpriv->ctx_mgr);
 	hl_cb_mgr_fini(hpriv->hdev, &hpriv->cb_mgr);
+	mutex_destroy(&hpriv->restore_phase_mutex);
 	kfree(hpriv);
 
 close_device:
@@ -176,8 +189,10 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 		return -ENOMEM;
 
 	hdev->major = hl_major;
+	hdev->reset_on_lockup = reset_on_lockup;
 
 	/* Parameters for bring-up - set them to defaults */
+	hdev->mmu_enable = 0;
 	hdev->cpu_enable = 1;
 	hdev->reset_pcilink = 0;
 	hdev->cpu_queues_enable = 1;
@@ -197,6 +212,11 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	if (!hdev->cpu_queues_enable)
 		hdev->heartbeat = 0;
 
+	if (timeout_locked)
+		hdev->timeout_jiffies = msecs_to_jiffies(timeout_locked * 1000);
+	else
+		hdev->timeout_jiffies = MAX_SCHEDULE_TIMEOUT;
+
 	hdev->disabled = true;
 	hdev->pdev = pdev; /* can be NULL in case of simulator device */
 
diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
index f406c3eb6316..f93649a63a9e 100644
--- a/drivers/misc/habanalabs/habanalabs_ioctl.c
+++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
@@ -16,7 +16,9 @@
 	[_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func}
 
 static const struct hl_ioctl_desc hl_ioctls[] = {
-	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl)
+	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl),
+	HL_IOCTL_DEF(HL_IOCTL_CS, hl_cs_ioctl),
+	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl)
 };
 
 #define HL_CORE_IOCTL_COUNT	ARRAY_SIZE(hl_ioctls)
diff --git a/drivers/misc/habanalabs/hw_queue.c b/drivers/misc/habanalabs/hw_queue.c
index 6841e23f1b30..29bd525bb3bd 100644
--- a/drivers/misc/habanalabs/hw_queue.c
+++ b/drivers/misc/habanalabs/hw_queue.c
@@ -37,6 +37,29 @@ static inline int queue_free_slots(struct hl_hw_queue *q, u32 queue_len)
 		return (abs(delta) - queue_len);
 }
 
+void hl_int_hw_queue_update_ci(struct hl_cs *cs)
+{
+	struct hl_device *hdev = cs->ctx->hdev;
+	struct hl_hw_queue *q;
+	int i;
+
+	hdev->asic_funcs->hw_queues_lock(hdev);
+
+	if (hdev->disabled)
+		goto out;
+
+	q = &hdev->kernel_queues[0];
+	for (i = 0 ; i < HL_MAX_QUEUES ; i++, q++) {
+		if (q->queue_type == QUEUE_TYPE_INT) {
+			q->ci += cs->jobs_in_queue_cnt[i];
+			q->ci &= ((q->int_queue_len << 1) - 1);
+		}
+	}
+
+out:
+	hdev->asic_funcs->hw_queues_unlock(hdev);
+}
+
 /*
  * ext_queue_submit_bd - Submit a buffer descriptor to an external queue
  *
@@ -122,6 +145,37 @@ static int ext_queue_sanity_checks(struct hl_device *hdev,
 	return 0;
 }
 
+/*
+ * int_queue_sanity_checks - perform some sanity checks on internal queue
+ *
+ * @hdev              : pointer to hl_device structure
+ * @q                 :	pointer to hl_hw_queue structure
+ * @num_of_entries    : how many entries to check for space
+ *
+ * H/W queues spinlock should be taken before calling this function
+ *
+ * Perform the following:
+ * - Make sure we have enough space in the h/w queue
+ *
+ */
+static int int_queue_sanity_checks(struct hl_device *hdev,
+					struct hl_hw_queue *q,
+					int num_of_entries)
+{
+	int free_slots_cnt;
+
+	/* Check we have enough space in the queue */
+	free_slots_cnt = queue_free_slots(q, q->int_queue_len);
+
+	if (free_slots_cnt < num_of_entries) {
+		dev_dbg(hdev->dev, "Queue %d doesn't have room for %d CBs\n",
+			q->hw_queue_id, num_of_entries);
+		return -EAGAIN;
+	}
+
+	return 0;
+}
+
 /*
  * hl_hw_queue_send_cb_no_cmpl - send a single CB (not a JOB) without completion
  *
@@ -168,6 +222,184 @@ int hl_hw_queue_send_cb_no_cmpl(struct hl_device *hdev, u32 hw_queue_id,
 	return rc;
 }
 
+/*
+ * ext_hw_queue_schedule_job - submit an JOB to an external queue
+ *
+ * @job: pointer to the job that needs to be submitted to the queue
+ *
+ * This function must be called when the scheduler mutex is taken
+ *
+ */
+static void ext_hw_queue_schedule_job(struct hl_cs_job *job)
+{
+	struct hl_device *hdev = job->cs->ctx->hdev;
+	struct hl_hw_queue *q = &hdev->kernel_queues[job->hw_queue_id];
+	struct hl_cq_entry cq_pkt;
+	struct hl_cq *cq;
+	u64 cq_addr;
+	struct hl_cb *cb;
+	u32 ctl;
+	u32 len;
+	u64 ptr;
+
+	/*
+	 * Update the JOB ID inside the BD CTL so the device would know what
+	 * to write in the completion queue
+	 */
+	ctl = ((q->pi << BD_CTL_SHADOW_INDEX_SHIFT) & BD_CTL_SHADOW_INDEX_MASK);
+
+	cb = job->patched_cb;
+	len = job->job_cb_size;
+	ptr = cb->bus_address;
+
+	cq_pkt.data = (q->pi << CQ_ENTRY_SHADOW_INDEX_SHIFT)
+					& CQ_ENTRY_SHADOW_INDEX_MASK;
+	cq_pkt.data |= 1 << CQ_ENTRY_SHADOW_INDEX_VALID_SHIFT;
+	cq_pkt.data |= 1 << CQ_ENTRY_READY_SHIFT;
+
+	/*
+	 * No need to protect pi_offset because scheduling to the
+	 * H/W queues is done under the scheduler mutex
+	 *
+	 * No need to check if CQ is full because it was already
+	 * checked in hl_queue_sanity_checks
+	 */
+	cq = &hdev->completion_queue[q->hw_queue_id];
+	cq_addr = cq->bus_address +
+			hdev->asic_prop.host_phys_base_address;
+	cq_addr += cq->pi * sizeof(struct hl_cq_entry);
+
+	hdev->asic_funcs->add_end_of_cb_packets(cb->kernel_address, len,
+				cq_addr, cq_pkt.data, q->hw_queue_id);
+
+	q->shadow_queue[hl_pi_2_offset(q->pi)] = job;
+
+	cq->pi = hl_cq_inc_ptr(cq->pi);
+
+	ext_queue_submit_bd(hdev, q, ctl, len, ptr);
+}
+
+/*
+ * int_hw_queue_schedule_job - submit an JOB to an internal queue
+ *
+ * @job: pointer to the job that needs to be submitted to the queue
+ *
+ * This function must be called when the scheduler mutex is taken
+ *
+ */
+static void int_hw_queue_schedule_job(struct hl_cs_job *job)
+{
+	struct hl_device *hdev = job->cs->ctx->hdev;
+	struct hl_hw_queue *q = &hdev->kernel_queues[job->hw_queue_id];
+	struct hl_bd bd;
+	u64 *pi, *pbd = (u64 *) &bd;
+
+	bd.ctl = 0;
+	bd.len = job->job_cb_size;
+	bd.ptr = (u64) job->user_cb;
+
+	pi = (u64 *) (q->kernel_address +
+		((q->pi & (q->int_queue_len - 1)) * sizeof(bd)));
+
+	pi[0] = pbd[0];
+	pi[1] = pbd[1];
+
+	q->pi++;
+	q->pi &= ((q->int_queue_len << 1) - 1);
+
+	/* Flush PQ entry write. Relevant only for specific ASICs */
+	hdev->asic_funcs->flush_pq_write(hdev, pi, pbd[0]);
+
+	hdev->asic_funcs->ring_doorbell(hdev, q->hw_queue_id, q->pi);
+}
+
+/*
+ * hl_hw_queue_schedule_cs - schedule a command submission
+ *
+ * @job        : pointer to the CS
+ *
+ */
+int hl_hw_queue_schedule_cs(struct hl_cs *cs)
+{
+	struct hl_device *hdev = cs->ctx->hdev;
+	struct hl_cs_job *job, *tmp;
+	struct hl_hw_queue *q;
+	int rc = 0, i, cq_cnt;
+
+	hdev->asic_funcs->hw_queues_lock(hdev);
+
+	if (hl_device_disabled_or_in_reset(hdev)) {
+		dev_err(hdev->dev,
+			"device is disabled or in reset, CS rejected!\n");
+		rc = -EPERM;
+		goto out;
+	}
+
+	q = &hdev->kernel_queues[0];
+	/* This loop assumes all external queues are consecutive */
+	for (i = 0, cq_cnt = 0 ; i < HL_MAX_QUEUES ; i++, q++) {
+		if (q->queue_type == QUEUE_TYPE_EXT) {
+			if (cs->jobs_in_queue_cnt[i]) {
+				rc = ext_queue_sanity_checks(hdev, q,
+					cs->jobs_in_queue_cnt[i], true);
+				if (rc)
+					goto unroll_cq_resv;
+				cq_cnt++;
+			}
+		} else if (q->queue_type == QUEUE_TYPE_INT) {
+			if (cs->jobs_in_queue_cnt[i]) {
+				rc = int_queue_sanity_checks(hdev, q,
+					cs->jobs_in_queue_cnt[i]);
+				if (rc)
+					goto unroll_cq_resv;
+			}
+		}
+	}
+
+	spin_lock(&hdev->hw_queues_mirror_lock);
+	list_add_tail(&cs->mirror_node, &hdev->hw_queues_mirror_list);
+
+	/* Queue TDR if the CS is the first entry and if timeout is wanted */
+	if ((hdev->timeout_jiffies != MAX_SCHEDULE_TIMEOUT) &&
+			(list_first_entry(&hdev->hw_queues_mirror_list,
+					struct hl_cs, mirror_node) == cs)) {
+		cs->tdr_active = true;
+		schedule_delayed_work(&cs->work_tdr, hdev->timeout_jiffies);
+		spin_unlock(&hdev->hw_queues_mirror_lock);
+	} else {
+		spin_unlock(&hdev->hw_queues_mirror_lock);
+	}
+
+	list_for_each_entry_safe(job, tmp, &cs->job_list, cs_node) {
+		if (job->ext_queue)
+			ext_hw_queue_schedule_job(job);
+		else
+			int_hw_queue_schedule_job(job);
+	}
+
+	cs->submitted = true;
+
+	goto out;
+
+unroll_cq_resv:
+	/* This loop assumes all external queues are consecutive */
+	q = &hdev->kernel_queues[0];
+	for (i = 0 ; (i < HL_MAX_QUEUES) && (cq_cnt > 0) ; i++, q++) {
+		if ((q->queue_type == QUEUE_TYPE_EXT) &&
+				(cs->jobs_in_queue_cnt[i])) {
+			atomic_t *free_slots =
+				&hdev->completion_queue[i].free_slots_cnt;
+			atomic_add(cs->jobs_in_queue_cnt[i], free_slots);
+			cq_cnt--;
+		}
+	}
+
+out:
+	hdev->asic_funcs->hw_queues_unlock(hdev);
+
+	return rc;
+}
+
 /*
  * hl_hw_queue_inc_ci_kernel - increment ci for kernel's queue
  *
diff --git a/drivers/misc/habanalabs/memory.c b/drivers/misc/habanalabs/memory.c
new file mode 100644
index 000000000000..47e110b4f76e
--- /dev/null
+++ b/drivers/misc/habanalabs/memory.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/genalloc.h>
+
+/*
+ * hl_pin_host_memory - pins a chunk of host memory
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @addr                : the user-space virtual address of the memory area
+ * @size                : the size of the memory area
+ * @userptr	        : pointer to hl_userptr structure
+ *
+ * This function does the following:
+ * - Pins the physical pages
+ * - Create a SG list from those pages
+ */
+int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u32 size,
+			struct hl_userptr *userptr)
+{
+	u64 start, end;
+	u32 npages, offset;
+	int rc;
+
+	if (!size) {
+		dev_err(hdev->dev, "size to pin is invalid - %d\n",
+			size);
+		return -EINVAL;
+	}
+
+	if (!access_ok((void __user *) addr, size)) {
+		dev_err(hdev->dev, "user pointer is invalid - 0x%llx\n",
+			addr);
+		return -EFAULT;
+	}
+
+	/*
+	 * If the combination of the address and size requested for this memory
+	 * region causes an integer overflow, return error.
+	 */
+	if (((addr + size) < addr) ||
+			PAGE_ALIGN(addr + size) < (addr + size)) {
+		dev_err(hdev->dev,
+			"user pointer 0x%llx + %u causes integer overflow\n",
+			addr, size);
+		return -EINVAL;
+	}
+
+	start = addr & PAGE_MASK;
+	offset = addr & ~PAGE_MASK;
+	end = PAGE_ALIGN(addr + size);
+	npages = (end - start) >> PAGE_SHIFT;
+
+	userptr->size = size;
+	userptr->addr = addr;
+	userptr->dma_mapped = false;
+	INIT_LIST_HEAD(&userptr->job_node);
+
+	userptr->vec = frame_vector_create(npages);
+	if (!userptr->vec) {
+		dev_err(hdev->dev, "Failed to create frame vector\n");
+		return -ENOMEM;
+	}
+
+	rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
+				userptr->vec);
+
+	if (rc != npages) {
+		dev_err(hdev->dev,
+			"Failed to map host memory, user ptr probably wrong\n");
+		if (rc < 0)
+			goto destroy_framevec;
+		rc = -EFAULT;
+		goto put_framevec;
+	}
+
+	if (frame_vector_to_pages(userptr->vec) < 0) {
+		dev_err(hdev->dev,
+			"Failed to translate frame vector to pages\n");
+		rc = -EFAULT;
+		goto put_framevec;
+	}
+
+	userptr->sgt = kzalloc(sizeof(*userptr->sgt), GFP_ATOMIC);
+	if (!userptr->sgt) {
+		rc = -ENOMEM;
+		goto put_framevec;
+	}
+
+	rc = sg_alloc_table_from_pages(userptr->sgt,
+					frame_vector_pages(userptr->vec),
+					npages, offset, size, GFP_ATOMIC);
+	if (rc < 0) {
+		dev_err(hdev->dev, "failed to create SG table from pages\n");
+		goto free_sgt;
+	}
+
+	return 0;
+
+free_sgt:
+	kfree(userptr->sgt);
+put_framevec:
+	put_vaddr_frames(userptr->vec);
+destroy_framevec:
+	frame_vector_destroy(userptr->vec);
+	return rc;
+}
+
+/*
+ * hl_unpin_host_memory - unpins a chunk of host memory
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @userptr             : pointer to hl_userptr structure
+ *
+ * This function does the following:
+ * - Unpins the physical pages related to the host memory
+ * - Free the SG list
+ */
+int hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr)
+{
+	struct page **pages;
+
+	if (userptr->dma_mapped)
+		hdev->asic_funcs->hl_dma_unmap_sg(hdev,
+				userptr->sgt->sgl,
+				userptr->sgt->nents,
+				userptr->dir);
+
+	pages = frame_vector_pages(userptr->vec);
+	if (!IS_ERR(pages)) {
+		int i;
+
+		for (i = 0; i < frame_vector_count(userptr->vec); i++)
+			set_page_dirty_lock(pages[i]);
+	}
+	put_vaddr_frames(userptr->vec);
+	frame_vector_destroy(userptr->vec);
+
+	list_del(&userptr->job_node);
+
+	sg_free_table(userptr->sgt);
+	kfree(userptr->sgt);
+
+	return 0;
+}
+
+/*
+ * hl_userptr_delete_list - clear userptr list
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @userptr_list        : pointer to the list to clear
+ *
+ * This function does the following:
+ * - Iterates over the list and unpins the host memory and frees the userptr
+ *   structure.
+ */
+void hl_userptr_delete_list(struct hl_device *hdev,
+				struct list_head *userptr_list)
+{
+	struct hl_userptr *userptr, *tmp;
+
+	list_for_each_entry_safe(userptr, tmp, userptr_list, job_node) {
+		hl_unpin_host_memory(hdev, userptr);
+		kfree(userptr);
+	}
+
+	INIT_LIST_HEAD(userptr_list);
+}
+
+/*
+ * hl_userptr_is_pinned - returns whether the given userptr is pinned
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @userptr_list        : pointer to the list to clear
+ * @userptr             : pointer to userptr to check
+ *
+ * This function does the following:
+ * - Iterates over the list and checks if the given userptr is in it, means is
+ *   pinned. If so, returns true, otherwise returns false.
+ */
+bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr,
+				u32 size, struct list_head *userptr_list,
+				struct hl_userptr **userptr)
+{
+	list_for_each_entry((*userptr), userptr_list, job_node) {
+		if ((addr == (*userptr)->addr) && (size == (*userptr)->size))
+			return true;
+	}
+
+	return false;
+}
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index 756266cf0416..fba49417f607 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -73,6 +73,95 @@ union hl_cb_args {
 	struct hl_cb_out out;
 };
 
+/*
+ * This structure size must always be fixed to 64-bytes for backward
+ * compatibility
+ */
+struct hl_cs_chunk {
+	/*
+	 * For external queue, this represents a Handle of CB on the Host
+	 * For internal queue, this represents an SRAM or DRAM address of the
+	 * internal CB
+	 */
+	__u64 cb_handle;
+	/* Index of queue to put the CB on */
+	__u32 queue_index;
+	/*
+	 * Size of command buffer with valid packets
+	 * Can be smaller then actual CB size
+	 */
+	__u32 cb_size;
+	/* HL_CS_CHUNK_FLAGS_* */
+	__u32 cs_chunk_flags;
+	/* Align structure to 64 bytes */
+	__u32 pad[11];
+};
+
+#define HL_CS_FLAGS_FORCE_RESTORE	0x1
+
+#define HL_CS_STATUS_SUCCESS		0
+
+struct hl_cs_in {
+	/* this holds address of array of hl_cs_chunk for restore phase */
+	__u64 chunks_restore;
+	/* this holds address of array of hl_cs_chunk for execution phase */
+	__u64 chunks_execute;
+	/* this holds address of array of hl_cs_chunk for store phase -
+	 * Currently not in use
+	 */
+	__u64 chunks_store;
+	/* Number of chunks in restore phase array */
+	__u32 num_chunks_restore;
+	/* Number of chunks in execution array */
+	__u32 num_chunks_execute;
+	/* Number of chunks in restore phase array - Currently not in use */
+	__u32 num_chunks_store;
+	/* HL_CS_FLAGS_* */
+	__u32 cs_flags;
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+};
+
+struct hl_cs_out {
+	/* this holds the sequence number of the CS to pass to wait ioctl */
+	__u64 seq;
+	/* HL_CS_STATUS_* */
+	__u32 status;
+	__u32 pad;
+};
+
+union hl_cs_args {
+	struct hl_cs_in in;
+	struct hl_cs_out out;
+};
+
+struct hl_wait_cs_in {
+	/* Command submission sequence number */
+	__u64 seq;
+	/* Absolute timeout to wait in microseconds */
+	__u64 timeout_us;
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+	__u32 pad;
+};
+
+#define HL_WAIT_CS_STATUS_COMPLETED	0
+#define HL_WAIT_CS_STATUS_BUSY		1
+#define HL_WAIT_CS_STATUS_TIMEDOUT	2
+#define HL_WAIT_CS_STATUS_ABORTED	3
+#define HL_WAIT_CS_STATUS_INTERRUPTED	4
+
+struct hl_wait_cs_out {
+	/* HL_WAIT_CS_STATUS_* */
+	__u32 status;
+	__u32 pad;
+};
+
+union hl_wait_cs_args {
+	struct hl_wait_cs_in in;
+	struct hl_wait_cs_out out;
+};
+
 /*
  * Command Buffer
  * - Request a Command Buffer
@@ -89,7 +178,74 @@ union hl_cb_args {
 #define HL_IOCTL_CB		\
 		_IOWR('H', 0x02, union hl_cb_args)
 
+/*
+ * Command Submission
+ *
+ * To submit work to the device, the user need to call this IOCTL with a set
+ * of JOBS. That set of JOBS constitutes a CS object.
+ * Each JOB will be enqueued on a specific queue, according to the user's input.
+ * There can be more then one JOB per queue.
+ *
+ * There are two types of queues - external and internal. External queues
+ * are DMA queues which transfer data from/to the Host. All other queues are
+ * internal. The driver will get completion notifications from the device only
+ * on JOBS which are enqueued in the external queues.
+ *
+ * This IOCTL is asynchronous in regard to the actual execution of the CS. This
+ * means it returns immediately after ALL the JOBS were enqueued on their
+ * relevant queues. Therefore, the user mustn't assume the CS has been completed
+ * or has even started to execute.
+ *
+ * Upon successful enqueue, the IOCTL returns an opaque handle which the user
+ * can use with the "Wait for CS" IOCTL to check whether the handle's CS
+ * external JOBS have been completed. Note that if the CS has internal JOBS
+ * which can execute AFTER the external JOBS have finished, the driver might
+ * report that the CS has finished executing BEFORE the internal JOBS have
+ * actually finish executing.
+ *
+ * The CS IOCTL will receive three sets of JOBS. One set is for "restore" phase,
+ * a second set is for "execution" phase and a third set is for "store" phase.
+ * The JOBS on the "restore" phase are enqueued only after context-switch
+ * (or if its the first CS for this context). The user can also order the
+ * driver to run the "restore" phase explicitly
+ *
+ */
+#define HL_IOCTL_CS			\
+		_IOWR('H', 0x03, union hl_cs_args)
+
+/*
+ * Wait for Command Submission
+ *
+ * The user can call this IOCTL with a handle it received from the CS IOCTL
+ * to wait until the handle's CS has finished executing. The user will wait
+ * inside the kernel until the CS has finished or until the user-requeusted
+ * timeout has expired.
+ *
+ * The return value of the IOCTL is a standard Linux error code. The possible
+ * values are:
+ *
+ * EINTR     - Kernel waiting has been interrupted, e.g. due to OS signal
+ *             that the user process received
+ * ETIMEDOUT - The CS has caused a timeout on the device
+ * EIO       - The CS was aborted (usually because the device was reset)
+ * ENODEV    - The device wants to do hard-reset (so user need to close FD)
+ *
+ * The driver also returns a custom define inside the IOCTL which can be:
+ *
+ * HL_WAIT_CS_STATUS_COMPLETED   - The CS has been completed successfully (0)
+ * HL_WAIT_CS_STATUS_BUSY        - The CS is still executing (0)
+ * HL_WAIT_CS_STATUS_TIMEDOUT    - The CS has caused a timeout on the device
+ *                                 (ETIMEDOUT)
+ * HL_WAIT_CS_STATUS_ABORTED     - The CS was aborted, usually because the
+ *                                 device was reset (EIO)
+ * HL_WAIT_CS_STATUS_INTERRUPTED - Waiting for the CS was interrupted (EINTR)
+ *
+ */
+
+#define HL_IOCTL_WAIT_CS			\
+		_IOWR('H', 0x04, union hl_wait_cs_args)
+
 #define HL_COMMAND_START	0x02
-#define HL_COMMAND_END		0x03
+#define HL_COMMAND_END		0x05
 
 #endif /* HABANALABS_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 12/15] habanalabs: add virtual memory and MMU modules
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (9 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 11/15] habanalabs: add command submission module Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 17:03   ` Mike Rapoport
  2019-02-11 15:17 ` [PATCH v4 13/15] habanalabs: implement INFO IOCTL Oded Gabbay
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe, Omer Shpigelman

From: Omer Shpigelman <oshpigelman@habana.ai>

This patch adds the Virtual Memory and MMU modules.

Goya has an internal MMU which provides process isolation on the internal
DDR. The internal MMU also performs translations for transactions that go
from Goya to the Host.

The driver is responsible for allocating and freeing memory on the DDR
upon user request. It also provides an interface to map and unmap DDR and
Host memory to the device address space.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - Return number of ptes in dec pte
  - Rename dec/inc_num_of_ptes to put/get_pte
  - Use common function to get next hop address
  - Fold alloc of new hop with finding next hop
  - Invert logic for freeing ptes
  - Support more pages sizes in MMU
  - Use -ENOTTY to signal bad ioctl parameter in memory ioctl
 
 drivers/misc/habanalabs/Makefile              |    2 +-
 drivers/misc/habanalabs/context.c             |   19 +-
 drivers/misc/habanalabs/device.c              |   20 +-
 drivers/misc/habanalabs/goya/goya.c           |  393 +++++
 drivers/misc/habanalabs/habanalabs.h          |  188 +-
 drivers/misc/habanalabs/habanalabs_drv.c      |    2 +-
 drivers/misc/habanalabs/habanalabs_ioctl.c    |    3 +-
 .../include/hw_ip/mmu/mmu_general.h           |   45 +
 .../habanalabs/include/hw_ip/mmu/mmu_v1_0.h   |   15 +
 drivers/misc/habanalabs/memory.c              | 1515 +++++++++++++++++
 drivers/misc/habanalabs/mmu.c                 |  690 ++++++++
 include/uapi/misc/habanalabs.h                |  122 +-
 12 files changed, 3006 insertions(+), 8 deletions(-)
 create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
 create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
 create mode 100644 drivers/misc/habanalabs/mmu.c

diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index d2fd0e18b1eb..fd46f8b48bab 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -6,7 +6,7 @@ obj-m	:= habanalabs.o
 
 habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
 		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o memory.o \
-		command_submission.o
+		command_submission.o mmu.o
 
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
index 98710646de6c..6ff0f2103d8d 100644
--- a/drivers/misc/habanalabs/context.c
+++ b/drivers/misc/habanalabs/context.c
@@ -26,8 +26,10 @@ static void hl_ctx_fini(struct hl_ctx *ctx)
 	for (i = 0 ; i < HL_MAX_PENDING_CS ; i++)
 		dma_fence_put(ctx->cs_pending[i]);
 
-	if (ctx->asid != HL_KERNEL_ASID_ID)
+	if (ctx->asid != HL_KERNEL_ASID_ID) {
+		hl_vm_ctx_fini(ctx);
 		hl_asid_free(hdev, ctx->asid);
+	}
 }
 
 void hl_ctx_do_release(struct kref *ref)
@@ -97,6 +99,8 @@ void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx)
 
 int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
 {
+	int rc = 0;
+
 	ctx->hdev = hdev;
 
 	kref_init(&ctx->refcount);
@@ -114,9 +118,22 @@ int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
 			dev_err(hdev->dev, "No free ASID, failed to create context\n");
 			return -ENOMEM;
 		}
+
+		rc = hl_vm_ctx_init(ctx);
+		if (rc) {
+			dev_err(hdev->dev, "Failed to init mem ctx module\n");
+			rc = -ENOMEM;
+			goto mem_ctx_err;
+		}
 	}
 
 	return 0;
+
+mem_ctx_err:
+	if (ctx->asid != HL_KERNEL_ASID_ID)
+		hl_asid_free(hdev, ctx->asid);
+
+	return rc;
 }
 
 void hl_ctx_get(struct hl_device *hdev, struct hl_ctx *ctx)
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 21496882be3a..5c850d3574eb 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -604,8 +604,10 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 	/* Reset the H/W. It will be in idle state after this returns */
 	hdev->asic_funcs->hw_fini(hdev, hard_reset);
 
-	if (hard_reset)
+	if (hard_reset) {
+		hl_vm_fini(hdev);
 		hl_eq_reset(hdev, &hdev->event_queue);
+	}
 
 	/* Re-initialize PI,CI to 0 in all queues (hw queue, cq) */
 	hl_hw_queue_reset(hdev, hard_reset);
@@ -666,6 +668,13 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 			goto out_err;
 		}
 
+		rc = hl_vm_init(hdev);
+		if (rc) {
+			dev_err(hdev->dev,
+				"Failed to init memory module after hard reset\n");
+			goto out_err;
+		}
+
 		hl_set_max_power(hdev, hdev->max_power);
 
 		hdev->hard_reset_pending = false;
@@ -850,6 +859,13 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		hdev->asic_name,
 		hdev->asic_prop.dram_size / 1024 / 1024 / 1024);
 
+	rc = hl_vm_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to initialize memory module\n");
+		rc = 0;
+		goto out_disabled;
+	}
+
 	rc = hl_hwmon_init(hdev);
 	if (rc) {
 		dev_err(hdev->dev, "Failed to initialize hwmon\n");
@@ -961,6 +977,8 @@ void hl_device_fini(struct hl_device *hdev)
 	/* Reset the H/W. It will be in idle state after this returns */
 	hdev->asic_funcs->hw_fini(hdev, true);
 
+	hl_vm_fini(hdev);
+
 	hl_eq_fini(hdev, &hdev->event_queue);
 
 	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 9fda232139ea..bba159fd7755 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -6,6 +6,8 @@
  */
 
 #include "goyaP.h"
+#include "include/hw_ip/mmu/mmu_general.h"
+#include "include/hw_ip/mmu/mmu_v1_0.h"
 #include "include/goya/asic_reg/goya_masks.h"
 
 #include <linux/fs.h>
@@ -87,6 +89,7 @@
 #define GOYA_PLDM_RESET_WAIT_MSEC	1000		/* 1s */
 #define GOYA_CPU_TIMEOUT_USEC		10000000	/* 10s */
 #define GOYA_TEST_QUEUE_WAIT_USEC	100000		/* 100ms */
+#define GOYA_PLDM_MMU_TIMEOUT_USEC	(MMU_CONFIG_TIMEOUT_USEC * 100)
 
 #define GOYA_QMAN0_FENCE_VAL		0xD169B243
 
@@ -138,6 +141,70 @@ static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
 	"MMU"
 };
 
+static u64 goya_mmu_regs[GOYA_MMU_REGS_NUM] = {
+	mmDMA_QM_0_GLBL_NON_SECURE_PROPS,
+	mmDMA_QM_1_GLBL_NON_SECURE_PROPS,
+	mmDMA_QM_2_GLBL_NON_SECURE_PROPS,
+	mmDMA_QM_3_GLBL_NON_SECURE_PROPS,
+	mmDMA_QM_4_GLBL_NON_SECURE_PROPS,
+	mmTPC0_QM_GLBL_SECURE_PROPS,
+	mmTPC0_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC0_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC0_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC0_CFG_ARUSER,
+	mmTPC0_CFG_AWUSER,
+	mmTPC1_QM_GLBL_SECURE_PROPS,
+	mmTPC1_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC1_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC1_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC1_CFG_ARUSER,
+	mmTPC1_CFG_AWUSER,
+	mmTPC2_QM_GLBL_SECURE_PROPS,
+	mmTPC2_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC2_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC2_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC2_CFG_ARUSER,
+	mmTPC2_CFG_AWUSER,
+	mmTPC3_QM_GLBL_SECURE_PROPS,
+	mmTPC3_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC3_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC3_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC3_CFG_ARUSER,
+	mmTPC3_CFG_AWUSER,
+	mmTPC4_QM_GLBL_SECURE_PROPS,
+	mmTPC4_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC4_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC4_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC4_CFG_ARUSER,
+	mmTPC4_CFG_AWUSER,
+	mmTPC5_QM_GLBL_SECURE_PROPS,
+	mmTPC5_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC5_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC5_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC5_CFG_ARUSER,
+	mmTPC5_CFG_AWUSER,
+	mmTPC6_QM_GLBL_SECURE_PROPS,
+	mmTPC6_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC6_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC6_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC6_CFG_ARUSER,
+	mmTPC6_CFG_AWUSER,
+	mmTPC7_QM_GLBL_SECURE_PROPS,
+	mmTPC7_QM_GLBL_NON_SECURE_PROPS,
+	mmTPC7_CMDQ_GLBL_SECURE_PROPS,
+	mmTPC7_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmTPC7_CFG_ARUSER,
+	mmTPC7_CFG_AWUSER,
+	mmMME_QM_GLBL_SECURE_PROPS,
+	mmMME_QM_GLBL_NON_SECURE_PROPS,
+	mmMME_CMDQ_GLBL_SECURE_PROPS,
+	mmMME_CMDQ_GLBL_NON_SECURE_PROPS,
+	mmMME_SBA_CONTROL_DATA,
+	mmMME_SBB_CONTROL_DATA,
+	mmMME_SBC_CONTROL_DATA,
+	mmMME_WBC_CONTROL_DATA
+};
+
 #define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
 
 static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
@@ -265,6 +332,10 @@ static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
 };
 
 static int goya_armcp_info_get(struct hl_device *hdev);
+static void goya_mmu_prepare(struct hl_device *hdev, u32 asid);
+static int goya_mmu_clear_pgt_range(struct hl_device *hdev);
+static int goya_mmu_update_asid_hop0_addr(struct hl_device *hdev, u32 asid,
+					u64 phys_addr);
 
 static void goya_get_fixed_properties(struct hl_device *hdev)
 {
@@ -303,6 +374,16 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
 	prop->sram_user_base_address = prop->sram_base_address +
 						SRAM_USER_BASE_OFFSET;
 
+	prop->mmu_pgt_addr = MMU_PAGE_TABLES_ADDR;
+	if (hdev->pldm)
+		prop->mmu_pgt_size = 0x800000; /* 8MB */
+	else
+		prop->mmu_pgt_size = MMU_PAGE_TABLES_SIZE;
+	prop->mmu_pte_size = HL_PTE_SIZE;
+	prop->mmu_hop_table_size = HOP_TABLE_SIZE;
+	prop->mmu_hop0_tables_total_size = HOP0_TABLES_TOTAL_SIZE;
+	prop->dram_page_size = PAGE_SIZE_2MB;
+
 	prop->host_phys_base_address = HOST_PHYS_BASE;
 	prop->va_space_host_start_address = VA_HOST_SPACE_START;
 	prop->va_space_host_end_address = VA_HOST_SPACE_END;
@@ -756,7 +837,18 @@ static int goya_late_init(struct hl_device *hdev)
 
 	goya_fetch_psoc_frequency(hdev);
 
+	rc = goya_mmu_clear_pgt_range(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to clear MMU page tables range\n");
+		goto disable_pci_access;
+	}
+
 	return 0;
+
+disable_pci_access:
+	goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
+
+	return rc;
 }
 
 /*
@@ -2569,6 +2661,54 @@ static int goya_init_cpu(struct hl_device *hdev, u32 cpu_timeout)
 	return 0;
 }
 
+static int goya_mmu_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct goya_device *goya = hdev->asic_specific;
+	u64 hop0_addr;
+	int rc, i;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	if (goya->hw_cap_initialized & HW_CAP_MMU)
+		return 0;
+
+	hdev->dram_supports_virtual_memory = true;
+
+	for (i = 0 ; i < prop->max_asid ; i++) {
+		hop0_addr = prop->mmu_pgt_addr +
+				(i * prop->mmu_hop_table_size);
+
+		rc = goya_mmu_update_asid_hop0_addr(hdev, i, hop0_addr);
+		if (rc) {
+			dev_err(hdev->dev,
+				"failed to set hop0 addr for asid %d\n", i);
+			goto err;
+		}
+	}
+
+	goya->hw_cap_initialized |= HW_CAP_MMU;
+
+	/* init MMU cache manage page */
+	WREG32(mmSTLB_CACHE_INV_BASE_39_8, MMU_CACHE_MNG_ADDR >> 8);
+	WREG32(mmSTLB_CACHE_INV_BASE_49_40, MMU_CACHE_MNG_ADDR << 40);
+
+	/* Remove follower feature due to performance bug */
+	WREG32_AND(mmSTLB_STLB_FEATURE_EN,
+			(~STLB_STLB_FEATURE_EN_FOLLOWER_EN_MASK));
+
+	hdev->asic_funcs->mmu_invalidate_cache(hdev, true);
+
+	WREG32(mmMMU_MMU_ENABLE, 1);
+	WREG32(mmMMU_SPI_MASK, 0xF);
+
+	return 0;
+
+err:
+	return rc;
+}
+
 /*
  * goya_hw_init - Goya hardware initialization code
  *
@@ -2618,6 +2758,10 @@ static int goya_hw_init(struct hl_device *hdev)
 		return rc;
 	}
 
+	rc = goya_mmu_init(hdev);
+	if (rc)
+		return rc;
+
 	goya_init_security(hdev);
 
 	goya_init_dma_qmans(hdev);
@@ -4247,6 +4391,10 @@ int goya_context_switch(struct hl_device *hdev, u32 asid)
 
 	rc = goya_send_job_on_qman0(hdev, job);
 
+	/* no point in setting the asid in case of failure */
+	if (!rc)
+		goya_mmu_prepare(hdev, asid);
+
 	job->patched_cb->cs_cnt--;
 	hl_cb_put(job->patched_cb);
 
@@ -4282,6 +4430,22 @@ void goya_restore_phase_topology(struct hl_device *hdev)
 	i = RREG32(mmSYNC_MNGR_SOB_OBJ_0);
 }
 
+static u64 goya_read_pte(struct hl_device *hdev, u64 addr)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	return readq(hdev->pcie_bar[DDR_BAR_ID] +
+			(addr - goya->ddr_bar_cur_addr));
+}
+
+static void goya_write_pte(struct hl_device *hdev, u64 addr, u64 val)
+{
+	struct goya_device *goya = hdev->asic_specific;
+
+	writeq(val, hdev->pcie_bar[DDR_BAR_ID] +
+			(addr - goya->ddr_bar_cur_addr));
+}
+
 static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
 		u16 event_type, char *axi_name, int len)
 {
@@ -4565,6 +4729,231 @@ void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
 	return goya->events_stat;
 }
 
+static int goya_mmu_clear_pgt_range(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct goya_device *goya = hdev->asic_specific;
+	struct packet_lin_dma *clear_pgt_range_pkt;
+	struct hl_cs_parser parser;
+	struct hl_cs_job *job;
+	u32 cb_size;
+	struct hl_cb *cb;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
+		return 0;
+
+	cb = hl_cb_kernel_create(hdev, PAGE_SIZE);
+	if (!cb)
+		return -EFAULT;
+
+	clear_pgt_range_pkt = (struct packet_lin_dma *) cb->kernel_address;
+	memset(clear_pgt_range_pkt, 0, sizeof(*clear_pgt_range_pkt));
+	cb_size = sizeof(*clear_pgt_range_pkt);
+
+	clear_pgt_range_pkt->ctl =
+		((PACKET_LIN_DMA << GOYA_PKT_CTL_OPCODE_SHIFT) |
+		(DMA_HOST_TO_DRAM << GOYA_PKT_LIN_DMA_CTL_DMA_DIR_SHIFT) |
+		(1 << GOYA_PKT_LIN_DMA_CTL_MEMSET_SHIFT) |
+		(1 << GOYA_PKT_LIN_DMA_CTL_WO_SHIFT) |
+		(1 << GOYA_PKT_CTL_RB_SHIFT) |
+		(1 << GOYA_PKT_CTL_MB_SHIFT));
+
+	clear_pgt_range_pkt->src_addr = 0;
+	clear_pgt_range_pkt->dst_addr = prop->mmu_pgt_addr;
+	clear_pgt_range_pkt->tsize = prop->mmu_pgt_size + MMU_CACHE_MNG_SIZE;
+
+	job = hl_cs_allocate_job(hdev, true);
+	if (!job) {
+		dev_err(hdev->dev, "Failed to allocate a new job\n");
+		rc = -ENOMEM;
+		goto release_cb;
+	}
+
+	job->id = 0;
+	job->user_cb = cb;
+	job->user_cb->cs_cnt++;
+	job->user_cb_size = cb_size;
+	job->hw_queue_id = GOYA_QUEUE_ID_DMA_0;
+
+	parser.ctx_id = HL_KERNEL_ASID_ID;
+	parser.cs_sequence = 0;
+	parser.job_id = job->id;
+	parser.hw_queue_id = job->hw_queue_id;
+	parser.job_userptr_list = &job->userptr_list;
+	parser.user_cb = job->user_cb;
+	parser.user_cb_size = job->user_cb_size;
+	parser.ext_queue = job->ext_queue;
+	parser.use_virt_addr = hdev->mmu_enable;
+
+	rc = hdev->asic_funcs->cs_parser(hdev, &parser);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to parse kernel CB when clearing pgt\n");
+		goto free_job;
+	}
+
+	job->patched_cb = parser.patched_cb;
+	job->job_cb_size = parser.patched_cb_size;
+	job->patched_cb->cs_cnt++;
+
+	rc = goya_send_job_on_qman0(hdev, job);
+
+	job->patched_cb->cs_cnt--;
+	hl_cb_put(job->patched_cb);
+
+free_job:
+	hl_userptr_delete_list(hdev, &job->userptr_list);
+	kfree(job);
+	cb->cs_cnt--;
+
+release_cb:
+	hl_cb_put(cb);
+	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb->id << PAGE_SHIFT);
+
+	return rc;
+}
+
+static void goya_mmu_prepare(struct hl_device *hdev, u32 asid)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	int i;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
+		return;
+
+	if (asid & ~MME_QM_GLBL_SECURE_PROPS_ASID_MASK) {
+		WARN(1, "asid %u is too big\n", asid);
+		return;
+	}
+
+	/* zero the MMBP and ASID bits and then set the ASID */
+	for (i = 0 ; i < GOYA_MMU_REGS_NUM ; i++) {
+		WREG32_AND(goya_mmu_regs[i], ~0x7FF);
+		WREG32_OR(goya_mmu_regs[i], asid);
+	}
+}
+
+static void goya_mmu_invalidate_cache(struct hl_device *hdev, bool is_hard)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 status, timeout_usec;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
+		return;
+
+	/* no need in L1 only invalidation in Goya */
+	if (!is_hard)
+		return;
+
+	if (hdev->pldm)
+		timeout_usec = GOYA_PLDM_MMU_TIMEOUT_USEC;
+	else
+		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
+
+	mutex_lock(&hdev->mmu_cache_lock);
+
+	/* L0 & L1 invalidation */
+	WREG32(mmSTLB_INV_ALL_START, 1);
+
+	rc = hl_poll_timeout(
+		hdev,
+		mmSTLB_INV_ALL_START,
+		status,
+		!status,
+		1000,
+		timeout_usec);
+
+	mutex_unlock(&hdev->mmu_cache_lock);
+
+	if (rc)
+		dev_notice_ratelimited(hdev->dev,
+			"Timeout when waiting for MMU cache invalidation\n");
+}
+
+static void goya_mmu_invalidate_cache_range(struct hl_device *hdev,
+		bool is_hard, u32 asid, u64 va, u64 size)
+{
+	struct goya_device *goya = hdev->asic_specific;
+	u32 status, timeout_usec, inv_data, pi;
+	int rc;
+
+	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
+		return;
+
+	/* no need in L1 only invalidation in Goya */
+	if (!is_hard)
+		return;
+
+	if (hdev->pldm)
+		timeout_usec = GOYA_PLDM_MMU_TIMEOUT_USEC;
+	else
+		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
+
+	mutex_lock(&hdev->mmu_cache_lock);
+
+	/*
+	 * TODO: currently invalidate entire L0 & L1 as in regular hard
+	 * invalidation. Need to apply invalidation of specific cache lines with
+	 * mask of ASID & VA & size.
+	 * Note that L1 with be flushed entirely in any case.
+	 */
+
+	/* L0 & L1 invalidation */
+	inv_data = RREG32(mmSTLB_CACHE_INV);
+	/* PI is 8 bit */
+	pi = ((inv_data & STLB_CACHE_INV_PRODUCER_INDEX_MASK) + 1) & 0xFF;
+	WREG32(mmSTLB_CACHE_INV,
+			(inv_data & STLB_CACHE_INV_INDEX_MASK_MASK) | pi);
+
+	rc = hl_poll_timeout(
+		hdev,
+		mmSTLB_INV_CONSUMER_INDEX,
+		status,
+		status == pi,
+		1000,
+		timeout_usec);
+
+	mutex_unlock(&hdev->mmu_cache_lock);
+
+	if (rc)
+		dev_notice_ratelimited(hdev->dev,
+			"Timeout when waiting for MMU cache invalidation\n");
+}
+
+static int goya_mmu_update_asid_hop0_addr(struct hl_device *hdev, u32 asid,
+						u64 phys_addr)
+{
+	u32 status, timeout_usec;
+	int rc;
+
+	if (hdev->pldm)
+		timeout_usec = GOYA_PLDM_MMU_TIMEOUT_USEC;
+	else
+		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
+
+	WREG32(MMU_HOP0_PA43_12, phys_addr >> MMU_HOP0_PA43_12_SHIFT);
+	WREG32(MMU_HOP0_PA49_44, phys_addr >> MMU_HOP0_PA49_44_SHIFT);
+	WREG32(MMU_ASID_BUSY, 0x80000000 | asid);
+
+	rc = hl_poll_timeout(
+		hdev,
+		MMU_ASID_BUSY,
+		status,
+		!(status & 0x80000000),
+		1000,
+		timeout_usec);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Timeout during MMU hop0 config of asid %d\n", asid);
+		return rc;
+	}
+
+	return 0;
+}
+
 int goya_send_heartbeat(struct hl_device *hdev)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -4828,6 +5217,10 @@ static const struct hl_asic_funcs goya_funcs = {
 	.handle_eqe = goya_handle_eqe,
 	.set_pll_profile = goya_set_pll_profile,
 	.get_events_stat = goya_get_events_stat,
+	.read_pte = goya_read_pte,
+	.write_pte = goya_write_pte,
+	.mmu_invalidate_cache = goya_mmu_invalidate_cache,
+	.mmu_invalidate_cache_range = goya_mmu_invalidate_cache_range,
 	.send_heartbeat = goya_send_heartbeat,
 	.enable_clock_gating = goya_init_clock_gating,
 	.disable_clock_gating = goya_disable_clock_gating,
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index e8f71f6c8fb4..bf2d5cba6148 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -41,6 +41,31 @@
 /* MUST BE POWER OF 2 and larger than 1 */
 #define HL_MAX_PENDING_CS		64
 
+/* Memory */
+#define MEM_HASH_TABLE_BITS		7 /* 1 << 7 buckets */
+
+/* MMU */
+#define MMU_HASH_TABLE_BITS		7 /* 1 << 7 buckets */
+
+/**
+ * struct pgt_info - MMU hop page info.
+ * @node: hash linked-list node for the pgts hash of pgts.
+ * @addr: physical address of the pgt.
+ * @ctx: pointer to the owner ctx.
+ * @num_of_ptes: indicates how many ptes are used in the pgt.
+ *
+ * The MMU page tables hierarchy is placed on the DRAM. When a new level (hop)
+ * is needed during mapping, a new page is allocated and this structure holds
+ * its essential information. During unmapping, if no valid PTEs remained in the
+ * page, it is freed with its pgt_info structure.
+ */
+struct pgt_info {
+	struct hlist_node node;
+	u64 addr;
+	struct hl_ctx *ctx;
+	int num_of_ptes;
+};
+
 struct hl_device;
 struct hl_fpriv;
 
@@ -74,11 +99,11 @@ struct hw_queue_properties {
 /**
  * enum vm_type_t - virtual memory mapping request information.
  * @VM_TYPE_USERPTR: mapping of user memory to device virtual address.
- * @VM_TYPE_PHYS_LIST: mapping of DRAM memory to device virtual address.
+ * @VM_TYPE_PHYS_PACK: mapping of DRAM memory to device virtual address.
  */
 enum vm_type_t {
 	VM_TYPE_USERPTR,
-	VM_TYPE_PHYS_LIST
+	VM_TYPE_PHYS_PACK
 };
 
 /**
@@ -119,6 +144,12 @@ enum hl_device_hw_state {
  *                               mapping DRAM memory.
  * @va_space_dram_end_address: end address of virtual memory range for
  *                             mapping DRAM memory.
+ * @mmu_pgt_addr: base physical address in DRAM of MMU page tables.
+ * @mmu_pgt_size: MMU page tables total size.
+ * @mmu_pte_size: PTE size in MMU page tables.
+ * @mmu_hop_table_size: MMU hop table size.
+ * @mmu_hop0_tables_total_size: total size of MMU hop0 tables.
+ * @dram_page_size: page size for MMU DRAM allocation.
  * @cfg_size: configuration space size on SRAM.
  * @sram_size: total size of SRAM.
  * @max_asid: maximum number of open contexts (ASIDs).
@@ -152,6 +183,12 @@ struct asic_fixed_properties {
 	u64			va_space_host_end_address;
 	u64			va_space_dram_start_address;
 	u64			va_space_dram_end_address;
+	u64			mmu_pgt_addr;
+	u32			mmu_pgt_size;
+	u32			mmu_pte_size;
+	u32			mmu_hop_table_size;
+	u32			mmu_hop0_tables_total_size;
+	u32			dram_page_size;
 	u32			cfg_size;
 	u32			sram_size;
 	u32			max_asid;
@@ -421,6 +458,12 @@ enum hl_pll_frequency {
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
  * @set_pll_profile: change PLL profile (manual/automatic).
  * @get_events_stat: retrieve event queue entries histogram.
+ * @read_pte: read MMU page table entry from DRAM.
+ * @write_pte: write MMU page table entry to DRAM.
+ * @mmu_invalidate_cache: flush MMU STLB cache, either with soft (L1 only) or
+ *                        hard (L0 & L1) flush.
+ * @mmu_invalidate_cache_range: flush specific MMU STLB cache lines with
+ *                              ASID-VA-size mask.
  * @send_heartbeat: send is-alive packet to ArmCP and verify response.
  * @enable_clock_gating: enable clock gating for reducing power consumption.
  * @disable_clock_gating: disable clock for accessing registers on HBW.
@@ -485,6 +528,11 @@ struct hl_asic_funcs {
 	void (*set_pll_profile)(struct hl_device *hdev,
 			enum hl_pll_frequency freq);
 	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
+	u64 (*read_pte)(struct hl_device *hdev, u64 addr);
+	void (*write_pte)(struct hl_device *hdev, u64 addr, u64 val);
+	void (*mmu_invalidate_cache)(struct hl_device *hdev, bool is_hard);
+	void (*mmu_invalidate_cache_range)(struct hl_device *hdev, bool is_hard,
+			u32 asid, u64 va, u64 size);
 	int (*send_heartbeat)(struct hl_device *hdev);
 	void (*enable_clock_gating)(struct hl_device *hdev);
 	void (*disable_clock_gating)(struct hl_device *hdev);
@@ -506,17 +554,40 @@ struct hl_asic_funcs {
 
 #define HL_KERNEL_ASID_ID	0
 
+/**
+ * struct hl_va_range - virtual addresses range.
+ * @lock: protects the virtual addresses list.
+ * @list: list of virtual addresses blocks available for mappings.
+ * @start_addr: range start address.
+ * @end_addr: range end address.
+ */
+struct hl_va_range {
+	struct mutex		lock;
+	struct list_head	list;
+	u64			start_addr;
+	u64			end_addr;
+};
+
 /**
  * struct hl_ctx - user/kernel context.
+ * @mem_hash: holds mapping from virtual address to virtual memory area
+ *		descriptor (hl_vm_phys_pg_list or hl_userptr).
+ * @mmu_hash: holds a mapping from virtual address to pgt_info structure.
  * @hpriv: pointer to the private (KMD) data of the process (fd).
  * @hdev: pointer to the device structure.
  * @refcount: reference counter for the context. Context is released only when
  *		this hits 0l. It is incremented on CS and CS_WAIT.
  * @cs_pending: array of DMA fence objects representing pending CS.
+ * @host_va_range: holds available virtual addresses for host mappings.
+ * @dram_va_range: holds available virtual addresses for DRAM mappings.
+ * @mem_hash_lock: protects the mem_hash.
+ * @mmu_lock: protects the MMU page tables. Any change to the PGT, modifing the
+ *            MMU hash or walking the PGT requires talking this lock
  * @cs_sequence: sequence number for CS. Value is assigned to a CS and passed
  *			to user so user could inquire about CS. It is used as
  *			index to cs_pending array.
  * @cs_lock: spinlock to protect cs_sequence.
+ * @dram_phys_mem: amount of used physical DRAM memory by this context.
  * @thread_restore_token: token to prevent multiple threads of the same context
  *				from running the restore phase. Only one thread
  *				should run it.
@@ -526,12 +597,19 @@ struct hl_asic_funcs {
  * @asid: context's unique address space ID in the device's MMU.
  */
 struct hl_ctx {
+	DECLARE_HASHTABLE(mem_hash, MEM_HASH_TABLE_BITS);
+	DECLARE_HASHTABLE(mmu_hash, MMU_HASH_TABLE_BITS);
 	struct hl_fpriv		*hpriv;
 	struct hl_device	*hdev;
 	struct kref		refcount;
 	struct dma_fence	*cs_pending[HL_MAX_PENDING_CS];
+	struct hl_va_range	host_va_range;
+	struct hl_va_range	dram_va_range;
+	struct mutex		mem_hash_lock;
+	struct mutex		mmu_lock;
 	u64			cs_sequence;
 	spinlock_t		cs_lock;
+	atomic64_t		dram_phys_mem;
 	atomic_t		thread_restore_token;
 	u32			thread_restore_wait_token;
 	u32			asid;
@@ -674,6 +752,85 @@ struct hl_cs_parser {
 };
 
 
+/*
+ * MEMORY STRUCTURE
+ */
+
+/**
+ * struct hl_vm_hash_node - hash element from virtual address to virtual
+ *				memory area descriptor (hl_vm_phys_pg_list or
+ *				hl_userptr).
+ * @node: node to hang on the hash table in context object.
+ * @vaddr: key virtual address.
+ * @ptr: value pointer (hl_vm_phys_pg_list or hl_userptr).
+ */
+struct hl_vm_hash_node {
+	struct hlist_node	node;
+	u64			vaddr;
+	void			*ptr;
+};
+
+/**
+ * struct hl_vm_phys_pg_pack - physical page pack.
+ * @vm_type: describes the type of the virtual area descriptor.
+ * @pages: the physical page array.
+ * @mapping_cnt: number of shared mappings.
+ * @asid: the context related to this list.
+ * @npages: num physical pages in the pack.
+ * @page_size: size of each page in the pack.
+ * @total_size: total size of all the pages in this list.
+ * @flags: HL_MEM_* flags related to this list.
+ * @handle: the provided handle related to this list.
+ * @offset: offset from the first page.
+ * @contiguous: is contiguous physical memory.
+ * @created_from_userptr: is product of host virtual address.
+ */
+struct hl_vm_phys_pg_pack {
+	enum vm_type_t		vm_type; /* must be first */
+	u64			*pages;
+	atomic_t		mapping_cnt;
+	u32			asid;
+	u32			npages;
+	u32			page_size;
+	u32			total_size;
+	u32			flags;
+	u32			handle;
+	u32			offset;
+	u8			contiguous;
+	u8			created_from_userptr;
+};
+
+/**
+ * struct hl_vm_va_block - virtual range block information.
+ * @node: node to hang on the virtual range list in context object.
+ * @start: virtual range start address.
+ * @end: virtual range end address.
+ * @size: virtual range size.
+ */
+struct hl_vm_va_block {
+	struct list_head	node;
+	u64			start;
+	u64			end;
+	u64			size;
+};
+
+/**
+ * struct hl_vm - virtual memory manager for MMU.
+ * @dram_pg_pool: pool for DRAM physical pages of 2MB.
+ * @dram_pg_pool_refcount: reference counter for the pool usage.
+ * @idr_lock: protects the phys_pg_list_handles.
+ * @phys_pg_pack_handles: idr to hold all device allocations handles.
+ * @init_done: whether initialization was done. We need this because VM
+ *		initialization might be skipped during device initialization.
+ */
+struct hl_vm {
+	struct gen_pool		*dram_pg_pool;
+	struct kref		dram_pg_pool_refcount;
+	spinlock_t		idr_lock;
+	struct idr		phys_pg_pack_handles;
+	u8			init_done;
+};
+
 /*
  * FILE PRIVATE STRUCTURE
  */
@@ -787,12 +944,16 @@ struct hl_device_reset_work {
  * @asic_prop: ASIC specific immutable properties.
  * @asic_funcs: ASIC specific functions.
  * @asic_specific: ASIC specific information to use only from ASIC files.
+ * @mmu_pgt_pool: pool of available MMU hops.
+ * @vm: virtual memory manager for MMU.
+ * @mmu_cache_lock: protects MMU cache invalidation as it can serve one context
  * @hwmon_dev: H/W monitor device.
  * @pm_mng_profile: current power management profile.
  * @hl_chip_info: ASIC's sensors information.
  * @cb_pool: list of preallocated CBs.
  * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
+ * @dram_used_mem: current DRAM memory consumption.
  * @in_reset: is device in reset flow.
  * @curr_pll_profile: current PLL profile.
  * @fd_open_cnt: number of open user processes.
@@ -812,6 +973,7 @@ struct hl_device_reset_work {
  * @heartbeat: is heartbeat sanity check towards ArmCP enabled.
  * @reset_on_lockup: true if a reset should be done in case of stuck CS, false
  *                   otherwise.
+ * @dram_supports_virtual_memory: is MMU enabled towards DRAM.
  * @init_done: is the initialization of the device done.
  * @mmu_enable: is MMU enabled.
  */
@@ -846,6 +1008,9 @@ struct hl_device {
 	struct asic_fixed_properties	asic_prop;
 	const struct hl_asic_funcs	*asic_funcs;
 	void				*asic_specific;
+	struct gen_pool			*mmu_pgt_pool;
+	struct hl_vm			vm;
+	struct mutex			mmu_cache_lock;
 	struct device			*hwmon_dev;
 	enum hl_pm_mng_profile		pm_mng_profile;
 	struct hwmon_chip_info		hl_chip_info;
@@ -856,6 +1021,7 @@ struct hl_device {
 	/* TODO: remove user_ctx for multiple process support */
 	struct hl_ctx			*user_ctx;
 
+	atomic64_t			dram_used_mem;
 	atomic_t			in_reset;
 	atomic_t			curr_pll_profile;
 	atomic_t			fd_open_cnt;
@@ -872,6 +1038,7 @@ struct hl_device {
 	u8				hard_reset_pending;
 	u8				heartbeat;
 	u8				reset_on_lockup;
+	u8				dram_supports_virtual_memory;
 	u8				init_done;
 
 	/* Parameters for bring-up */
@@ -1021,6 +1188,7 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 void hl_hpriv_get(struct hl_fpriv *hpriv);
 void hl_hpriv_put(struct hl_fpriv *hpriv);
 int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq);
+
 int hl_build_hwmon_channel_info(struct hl_device *hdev,
 		struct armcp_sensor *sensors_arr);
 
@@ -1048,6 +1216,12 @@ struct hl_cs_job *hl_cs_allocate_job(struct hl_device *hdev, bool ext_queue);
 
 void goya_set_asic_funcs(struct hl_device *hdev);
 
+int hl_vm_ctx_init(struct hl_ctx *ctx);
+void hl_vm_ctx_fini(struct hl_ctx *ctx);
+
+int hl_vm_init(struct hl_device *hdev);
+void hl_vm_fini(struct hl_device *hdev);
+
 int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u32 size,
 			struct hl_userptr *userptr);
 int hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr);
@@ -1057,6 +1231,15 @@ bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr, u32 size,
 				struct list_head *userptr_list,
 				struct hl_userptr **userptr);
 
+int hl_mmu_init(struct hl_device *hdev);
+void hl_mmu_fini(struct hl_device *hdev);
+void hl_mmu_ctx_init(struct hl_ctx *ctx);
+void hl_mmu_ctx_fini(struct hl_ctx *ctx);
+int hl_mmu_map(struct hl_ctx *ctx, u64 virt_addr, u64 phys_addr, u32 page_size);
+int hl_mmu_unmap(struct hl_ctx *ctx, u64 virt_addr, u32 page_size);
+void hl_mmu_swap_out(struct hl_ctx *ctx);
+void hl_mmu_swap_in(struct hl_ctx *ctx);
+
 long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr);
 void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq);
 long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
@@ -1074,5 +1257,6 @@ long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
 int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
 int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data);
 int hl_cs_wait_ioctl(struct hl_fpriv *hpriv, void *data);
+int hl_mem_ioctl(struct hl_fpriv *hpriv, void *data);
 
 #endif /* HABANALABSP_H_ */
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index dbc0c9c2c99a..658dc4588a36 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -192,7 +192,7 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
 	hdev->reset_on_lockup = reset_on_lockup;
 
 	/* Parameters for bring-up - set them to defaults */
-	hdev->mmu_enable = 0;
+	hdev->mmu_enable = 1;
 	hdev->cpu_enable = 1;
 	hdev->reset_pcilink = 0;
 	hdev->cpu_queues_enable = 1;
diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
index f93649a63a9e..71ef0c91668b 100644
--- a/drivers/misc/habanalabs/habanalabs_ioctl.c
+++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
@@ -18,7 +18,8 @@
 static const struct hl_ioctl_desc hl_ioctls[] = {
 	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl),
 	HL_IOCTL_DEF(HL_IOCTL_CS, hl_cs_ioctl),
-	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl)
+	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl),
+	HL_IOCTL_DEF(HL_IOCTL_MEMORY, hl_mem_ioctl)
 };
 
 #define HL_CORE_IOCTL_COUNT	ARRAY_SIZE(hl_ioctls)
diff --git a/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
new file mode 100644
index 000000000000..01483a581561
--- /dev/null
+++ b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef INCLUDE_MMU_GENERAL_H_
+#define INCLUDE_MMU_GENERAL_H_
+
+#define PAGE_SHIFT_4KB			12
+#define PAGE_SHIFT_2MB			21
+#define PAGE_SIZE_2MB			(_AC(1, UL) << PAGE_SHIFT_2MB)
+#define PAGE_SIZE_4KB			(_AC(1, UL) << PAGE_SHIFT_4KB)
+
+#define PAGE_PRESENT_MASK		0x0000000000001
+#define SWAP_OUT_MASK			0x0000000000004
+#define LAST_MASK			0x0000000000800
+#define PHYS_ADDR_MASK			0x3FFFFFFFFF000ull
+#define HOP0_MASK			0x3000000000000ull
+#define HOP1_MASK			0x0FF8000000000ull
+#define HOP2_MASK			0x0007FC0000000ull
+#define HOP3_MASK			0x000003FE00000
+#define HOP4_MASK			0x00000001FF000
+#define OFFSET_MASK			0x0000000000FFF
+
+#define HOP0_SHIFT			48
+#define HOP1_SHIFT			39
+#define HOP2_SHIFT			30
+#define HOP3_SHIFT			21
+#define HOP4_SHIFT			12
+
+#define PTE_PHYS_ADDR_SHIFT		12
+#define PTE_PHYS_ADDR_MASK		~0xFFF
+
+#define HL_PTE_SIZE			sizeof(u64)
+#define HOP_TABLE_SIZE			PAGE_SIZE_4KB
+#define HOP0_TABLES_TOTAL_SIZE		(HOP_TABLE_SIZE * MAX_ASID)
+
+#define MMU_HOP0_PA43_12_SHIFT		12
+#define MMU_HOP0_PA49_44_SHIFT		(12 + 32)
+
+#define MMU_CONFIG_TIMEOUT_USEC		2000 /* 2 ms */
+
+#endif /* INCLUDE_MMU_GENERAL_H_ */
diff --git a/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
new file mode 100644
index 000000000000..8539dd041f2c
--- /dev/null
+++ b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ *
+ */
+
+#ifndef INCLUDE_MMU_V1_0_H_
+#define INCLUDE_MMU_V1_0_H_
+
+#define MMU_HOP0_PA43_12	0x490004
+#define MMU_HOP0_PA49_44	0x490008
+#define MMU_ASID_BUSY		0x490000
+
+#endif /* INCLUDE_MMU_V1_0_H_ */
diff --git a/drivers/misc/habanalabs/memory.c b/drivers/misc/habanalabs/memory.c
index 47e110b4f76e..98a7cc700fd5 100644
--- a/drivers/misc/habanalabs/memory.c
+++ b/drivers/misc/habanalabs/memory.c
@@ -5,12 +5,1198 @@
  * All Rights Reserved.
  */
 
+#include <uapi/misc/habanalabs.h>
 #include "habanalabs.h"
+#include "include/hw_ip/mmu/mmu_general.h"
 
 #include <linux/sched.h>
 #include <linux/uaccess.h>
 #include <linux/genalloc.h>
 
+#define PGS_IN_HPAGE   (HPAGE_SIZE >> PAGE_SHIFT)
+#define HL_MMU_DEBUG	0
+
+/*
+ * The va ranges in context object contain a list with the available chunks of
+ * device virtual memory.
+ * There is one range for host allocations and one for DRAM allocations.
+ *
+ * On initialization each range contains one chunk of all of its available
+ * virtual range which is a half of the total device virtual range.
+ *
+ * On each mapping of physical pages, a suitable virtual range chunk (with a
+ * minimum size) is selected from the list. If the chunk size equals the
+ * requested size, the chunk is returned. Otherwise, the chunk is split into
+ * two chunks - one to return as result and a remainder to stay in the list.
+ *
+ * On each Unmapping of a virtual address, the relevant virtual chunk is
+ * returned to the list. The chunk is added to the list and if its edges match
+ * the edges of the adjacent chunks (means a contiguous chunk can be created),
+ * the chunks are merged.
+ *
+ * On finish, the list is checked to have only one chunk of all the relevant
+ * virtual range (which is a half of the device total virtual range).
+ * If not (means not all mappings were unmapped), a warning is printed.
+ */
+
+/*
+ * alloc_device_memory - allocate device memory
+ *
+ * @ctx                 : current context
+ * @args                : host parameters containing the requested size
+ * @ret_handle          : result handle
+ *
+ * This function does the following:
+ * - Allocate the requested size rounded up to 2MB pages
+ * - Return unique handle
+ */
+static int alloc_device_memory(struct hl_ctx *ctx, struct hl_mem_in *args,
+				u32 *ret_handle)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_pack *phys_pg_pack;
+	u64 paddr = 0;
+	u32 total_size, num_pgs, num_curr_pgs, page_size, page_shift;
+	int handle, rc, i;
+	bool contiguous;
+
+	num_curr_pgs = 0;
+	page_size = hdev->asic_prop.dram_page_size;
+	page_shift = __ffs(page_size);
+	num_pgs = (args->alloc.mem_size + (page_size - 1)) >> page_shift;
+	total_size = num_pgs << page_shift;
+
+	contiguous = args->flags & HL_MEM_CONTIGUOUS;
+
+	if (contiguous) {
+		paddr = (u64) gen_pool_alloc(vm->dram_pg_pool, total_size);
+		if (!paddr) {
+			dev_err(hdev->dev,
+				"failed to allocate %u huge contiguous pages\n",
+				num_pgs);
+			return -ENOMEM;
+		}
+	}
+
+	phys_pg_pack = kzalloc(sizeof(*phys_pg_pack), GFP_KERNEL);
+	if (!phys_pg_pack) {
+		rc = -ENOMEM;
+		goto pages_pack_err;
+	}
+
+	phys_pg_pack->vm_type = VM_TYPE_PHYS_PACK;
+	phys_pg_pack->asid = ctx->asid;
+	phys_pg_pack->npages = num_pgs;
+	phys_pg_pack->page_size = page_size;
+	phys_pg_pack->total_size = total_size;
+	phys_pg_pack->flags = args->flags;
+	phys_pg_pack->contiguous = contiguous;
+
+	phys_pg_pack->pages = kcalloc(num_pgs, sizeof(u64), GFP_KERNEL);
+	if (!phys_pg_pack->pages) {
+		rc = -ENOMEM;
+		goto pages_arr_err;
+	}
+
+	if (phys_pg_pack->contiguous) {
+		for (i = 0 ; i < num_pgs ; i++)
+			phys_pg_pack->pages[i] = paddr + i * page_size;
+	} else {
+		for (i = 0 ; i < num_pgs ; i++) {
+			phys_pg_pack->pages[i] = (u64) gen_pool_alloc(
+							vm->dram_pg_pool,
+							page_size);
+			if (!phys_pg_pack->pages[i]) {
+				dev_err(hdev->dev,
+					"ioctl failed to allocate page\n");
+				rc = -ENOMEM;
+				goto page_err;
+			}
+
+			num_curr_pgs++;
+		}
+	}
+
+	spin_lock(&vm->idr_lock);
+	handle = idr_alloc(&vm->phys_pg_pack_handles, phys_pg_pack, 1, 0,
+				GFP_KERNEL);
+	spin_unlock(&vm->idr_lock);
+
+	if (handle < 0) {
+		dev_err(hdev->dev, "Failed to get handle for page\n");
+		rc = -EFAULT;
+		goto idr_err;
+	}
+
+	for (i = 0 ; i < num_pgs ; i++)
+		kref_get(&vm->dram_pg_pool_refcount);
+
+	phys_pg_pack->handle = handle;
+
+	atomic64_add(phys_pg_pack->total_size, &ctx->dram_phys_mem);
+	atomic64_add(phys_pg_pack->total_size, &hdev->dram_used_mem);
+
+	*ret_handle = handle;
+
+	return 0;
+
+idr_err:
+page_err:
+	if (!phys_pg_pack->contiguous)
+		for (i = 0 ; i < num_curr_pgs ; i++)
+			gen_pool_free(vm->dram_pg_pool, phys_pg_pack->pages[i],
+					page_size);
+
+	kfree(phys_pg_pack->pages);
+pages_arr_err:
+	kfree(phys_pg_pack);
+pages_pack_err:
+	if (contiguous)
+		gen_pool_free(vm->dram_pg_pool, paddr, total_size);
+
+	return rc;
+}
+
+/*
+ * get_userptr_from_host_va - initialize userptr structure from given host
+ *                            virtual address
+ *
+ * @hdev                : habanalabs device structure
+ * @args                : parameters containing the virtual address and size
+ * @p_userptr           : pointer to result userptr structure
+ *
+ * This function does the following:
+ * - Allocate userptr structure
+ * - Pin the given host memory using the userptr structure
+ * - Perform DMA mapping to have the DMA addresses of the pages
+ */
+static int get_userptr_from_host_va(struct hl_device *hdev,
+		struct hl_mem_in *args, struct hl_userptr **p_userptr)
+{
+	struct hl_userptr *userptr;
+	int rc;
+
+	userptr = kzalloc(sizeof(*userptr), GFP_KERNEL);
+	if (!userptr) {
+		rc = -ENOMEM;
+		goto userptr_err;
+	}
+
+	rc = hl_pin_host_memory(hdev, args->map_host.host_virt_addr,
+			args->map_host.mem_size, userptr);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to pin host memory\n");
+		goto pin_err;
+	}
+
+	rc = hdev->asic_funcs->asic_dma_map_sg(hdev, userptr->sgt->sgl,
+					userptr->sgt->nents, DMA_BIDIRECTIONAL);
+	if (rc) {
+		dev_err(hdev->dev, "failed to map sgt with DMA region\n");
+		goto dma_map_err;
+	}
+
+	userptr->dma_mapped = true;
+	userptr->dir = DMA_BIDIRECTIONAL;
+	userptr->vm_type = VM_TYPE_USERPTR;
+
+	*p_userptr = userptr;
+
+	return 0;
+
+dma_map_err:
+	hl_unpin_host_memory(hdev, userptr);
+pin_err:
+	kfree(userptr);
+userptr_err:
+
+	return rc;
+}
+
+/*
+ * free_userptr - free userptr structure
+ *
+ * @hdev                : habanalabs device structure
+ * @userptr             : userptr to free
+ *
+ * This function does the following:
+ * - Unpins the physical pages
+ * - Frees the userptr structure
+ */
+static void free_userptr(struct hl_device *hdev, struct hl_userptr *userptr)
+{
+	hl_unpin_host_memory(hdev, userptr);
+	kfree(userptr);
+}
+
+/*
+ * dram_pg_pool_do_release - free DRAM pages pool
+ *
+ * @ref                 : pointer to reference object
+ *
+ * This function does the following:
+ * - Frees the idr structure of physical pages handles
+ * - Frees the generic pool of DRAM physical pages
+ */
+static void dram_pg_pool_do_release(struct kref *ref)
+{
+	struct hl_vm *vm = container_of(ref, struct hl_vm,
+			dram_pg_pool_refcount);
+
+	/*
+	 * free the idr here as only here we know for sure that there are no
+	 * allocated physical pages and hence there are no handles in use
+	 */
+	idr_destroy(&vm->phys_pg_pack_handles);
+	gen_pool_destroy(vm->dram_pg_pool);
+}
+
+/*
+ * free_phys_pg_pack   - free physical page pack
+ *
+ * @hdev               : habanalabs device structure
+ * @phys_pg_pack       : physical page pack to free
+ *
+ * This function does the following:
+ * - For DRAM memory only, iterate over the pack and free each physical block
+ *   structure by returning it to the general pool
+ * - Free the hl_vm_phys_pg_pack structure
+ */
+static void free_phys_pg_pack(struct hl_device *hdev,
+		struct hl_vm_phys_pg_pack *phys_pg_pack)
+{
+	struct hl_vm *vm = &hdev->vm;
+	int i;
+
+	if (!phys_pg_pack->created_from_userptr) {
+		if (phys_pg_pack->contiguous) {
+			gen_pool_free(vm->dram_pg_pool, phys_pg_pack->pages[0],
+					phys_pg_pack->total_size);
+
+			for (i = 0; i < phys_pg_pack->npages ; i++)
+				kref_put(&vm->dram_pg_pool_refcount,
+					dram_pg_pool_do_release);
+		} else {
+			for (i = 0 ; i < phys_pg_pack->npages ; i++) {
+				gen_pool_free(vm->dram_pg_pool,
+						phys_pg_pack->pages[i],
+						phys_pg_pack->page_size);
+				kref_put(&vm->dram_pg_pool_refcount,
+					dram_pg_pool_do_release);
+			}
+		}
+	}
+
+	kfree(phys_pg_pack->pages);
+	kfree(phys_pg_pack);
+}
+
+/*
+ * free_device_memory - free device memory
+ *
+ * @ctx                  : current context
+ * @handle              : handle of the memory chunk to free
+ *
+ * This function does the following:
+ * - Free the device memory related to the given handle
+ */
+static int free_device_memory(struct hl_ctx *ctx, u32 handle)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_pack *phys_pg_pack;
+
+	spin_lock(&vm->idr_lock);
+	phys_pg_pack = idr_find(&vm->phys_pg_pack_handles, handle);
+	if (phys_pg_pack) {
+		if (atomic_read(&phys_pg_pack->mapping_cnt) > 0) {
+			dev_err(hdev->dev, "handle %u is mapped, cannot free\n",
+				handle);
+			spin_unlock(&vm->idr_lock);
+			return -EINVAL;
+		}
+
+		/*
+		 * must remove from idr before the freeing of the physical
+		 * pages as the refcount of the pool is also the trigger of the
+		 * idr destroy
+		 */
+		idr_remove(&vm->phys_pg_pack_handles, handle);
+		spin_unlock(&vm->idr_lock);
+
+		atomic64_sub(phys_pg_pack->total_size, &ctx->dram_phys_mem);
+		atomic64_sub(phys_pg_pack->total_size, &hdev->dram_used_mem);
+
+		free_phys_pg_pack(hdev, phys_pg_pack);
+	} else {
+		spin_unlock(&vm->idr_lock);
+		dev_err(hdev->dev,
+			"free device memory failed, no match for handle %u\n",
+			handle);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * clear_va_list_locked - free virtual addresses list
+ *
+ * @hdev                : habanalabs device structure
+ * @va_list             : list of virtual addresses to free
+ *
+ * This function does the following:
+ * - Iterate over the list and free each virtual addresses block
+ *
+ * This function should be called only when va_list lock is taken
+ */
+static void clear_va_list_locked(struct hl_device *hdev,
+		struct list_head *va_list)
+{
+	struct hl_vm_va_block *va_block, *tmp;
+
+	list_for_each_entry_safe(va_block, tmp, va_list, node) {
+		list_del(&va_block->node);
+		kfree(va_block);
+	}
+}
+
+/*
+ * print_va_list_locked    - print virtual addresses list
+ *
+ * @hdev                : habanalabs device structure
+ * @va_list             : list of virtual addresses to print
+ *
+ * This function does the following:
+ * - Iterate over the list and print each virtual addresses block
+ *
+ * This function should be called only when va_list lock is taken
+ */
+static void print_va_list_locked(struct hl_device *hdev,
+		struct list_head *va_list)
+{
+#if HL_MMU_DEBUG
+	struct hl_vm_va_block *va_block;
+
+	dev_dbg(hdev->dev, "print va list:\n");
+
+	list_for_each_entry(va_block, va_list, node)
+		dev_dbg(hdev->dev,
+			"va block, start: 0x%llx, end: 0x%llx, size: %llu\n",
+			va_block->start, va_block->end, va_block->size);
+#endif
+}
+
+/*
+ * merge_va_blocks_locked - merge a virtual block if possible
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @va_list             : pointer to the virtual addresses block list
+ * @va_block            : virtual block to merge with adjacent blocks
+ *
+ * This function does the following:
+ * - Merge the given blocks with the adjacent blocks if their virtual ranges
+ *   create a contiguous virtual range
+ *
+ * This Function should be called only when va_list lock is taken
+ */
+static void merge_va_blocks_locked(struct hl_device *hdev,
+		struct list_head *va_list, struct hl_vm_va_block *va_block)
+{
+	struct hl_vm_va_block *prev, *next;
+
+	prev = list_prev_entry(va_block, node);
+	if (&prev->node != va_list && prev->end + 1 == va_block->start) {
+		prev->end = va_block->end;
+		prev->size = prev->end - prev->start;
+		list_del(&va_block->node);
+		kfree(va_block);
+		va_block = prev;
+	}
+
+	next = list_next_entry(va_block, node);
+	if (&next->node != va_list && va_block->end + 1 == next->start) {
+		next->start = va_block->start;
+		next->size = next->end - next->start;
+		list_del(&va_block->node);
+		kfree(va_block);
+	}
+}
+
+/*
+ * add_va_block_locked - add a virtual block to the virtual addresses list
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @va_list             : pointer to the virtual addresses block list
+ * @start               : start virtual address
+ * @end                 : end virtual address
+ *
+ * This function does the following:
+ * - Add the given block to the virtual blocks list and merge with other
+ * blocks if a contiguous virtual block can be created
+ *
+ * This Function should be called only when va_list lock is taken
+ */
+static int add_va_block_locked(struct hl_device *hdev,
+		struct list_head *va_list, u64 start, u64 end)
+{
+	struct hl_vm_va_block *va_block, *res = NULL;
+	u64 size = end - start;
+
+	print_va_list_locked(hdev, va_list);
+
+	list_for_each_entry(va_block, va_list, node) {
+		/* TODO: remove upon matureness */
+		if (hl_mem_area_crosses_range(start, size, va_block->start,
+				va_block->end)) {
+			dev_err(hdev->dev,
+				"block crossing ranges at start 0x%llx, end 0x%llx\n",
+				va_block->start, va_block->end);
+			return -EINVAL;
+		}
+
+		if (va_block->end < start)
+			res = va_block;
+	}
+
+	va_block = kmalloc(sizeof(*va_block), GFP_KERNEL);
+	if (!va_block)
+		return -ENOMEM;
+
+	va_block->start = start;
+	va_block->end = end;
+	va_block->size = size;
+
+	if (!res)
+		list_add(&va_block->node, va_list);
+	else
+		list_add(&va_block->node, &res->node);
+
+	merge_va_blocks_locked(hdev, va_list, va_block);
+
+	print_va_list_locked(hdev, va_list);
+
+	return 0;
+}
+
+/*
+ * add_va_block - wrapper for add_va_block_locked
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @va_list             : pointer to the virtual addresses block list
+ * @start               : start virtual address
+ * @end                 : end virtual address
+ *
+ * This function does the following:
+ * - Takes the list lock and calls add_va_block_locked
+ */
+static inline int add_va_block(struct hl_device *hdev,
+		struct hl_va_range *va_range, u64 start, u64 end)
+{
+	int rc;
+
+	mutex_lock(&va_range->lock);
+	rc = add_va_block_locked(hdev, &va_range->list, start, end);
+	mutex_unlock(&va_range->lock);
+
+	return rc;
+}
+
+/*
+ * get_va_block - get a virtual block with the requested size
+ *
+ * @hdev            : pointer to the habanalabs device structure
+ * @va_range        : pointer to the virtual addresses range
+ * @size            : requested block size
+ * @hint_addr       : hint for request address by the user
+ * @is_userptr      : is host or DRAM memory
+ *
+ * This function does the following:
+ * - Iterate on the virtual block list to find a suitable virtual block for the
+ *   requested size
+ * - Reserve the requested block and update the list
+ * - Return the start address of the virtual block
+ */
+static u64 get_va_block(struct hl_device *hdev,
+		struct hl_va_range *va_range, u32 size, u64 hint_addr,
+		bool is_userptr)
+{
+	struct hl_vm_va_block *va_block, *new_va_block = NULL;
+	u64 valid_start, valid_size, prev_start, prev_end, page_mask,
+		res_valid_start = 0, res_valid_size = 0;
+	u32 page_size;
+	bool add_prev = false;
+
+	if (is_userptr) {
+		/*
+		 * We cannot know if the user allocated memory with huge pages
+		 * or not, hence we continue with the biggest possible
+		 * granularity.
+		 */
+		page_size = HPAGE_SIZE;
+		page_mask = HPAGE_MASK;
+	} else {
+		page_size = hdev->asic_prop.dram_page_size;
+		page_mask = ~((u64)page_size - 1);
+	}
+
+	mutex_lock(&va_range->lock);
+
+	print_va_list_locked(hdev, &va_range->list);
+
+	list_for_each_entry(va_block, &va_range->list, node) {
+		/* calc the first possible aligned addr */
+		valid_start = va_block->start;
+
+
+		if (valid_start & (page_size - 1)) {
+			valid_start &= page_mask;
+			valid_start += page_size;
+			if (valid_start > va_block->end)
+				continue;
+		}
+
+		valid_size = va_block->end - valid_start;
+
+		if (valid_size >= size &&
+			(!new_va_block || valid_size < res_valid_size)) {
+
+			new_va_block = va_block;
+			res_valid_start = valid_start;
+			res_valid_size = valid_size;
+		}
+
+		if (hint_addr && hint_addr >= valid_start &&
+				((hint_addr + size) <= va_block->end)) {
+			new_va_block = va_block;
+			res_valid_start = hint_addr;
+			res_valid_size = valid_size;
+			break;
+		}
+	}
+
+	if (!new_va_block) {
+		dev_err(hdev->dev, "no available va block for size %u\n", size);
+		goto out;
+	}
+
+	if (res_valid_start > new_va_block->start) {
+		prev_start = new_va_block->start;
+		prev_end = res_valid_start - 1;
+
+		new_va_block->start = res_valid_start;
+		new_va_block->size = res_valid_size;
+
+		add_prev = true;
+	}
+
+	if (new_va_block->size > size) {
+		new_va_block->start += size;
+		new_va_block->size = new_va_block->end - new_va_block->start;
+	} else {
+		list_del(&new_va_block->node);
+		kfree(new_va_block);
+	}
+
+	if (add_prev)
+		add_va_block_locked(hdev, &va_range->list, prev_start,
+				prev_end);
+
+	print_va_list_locked(hdev, &va_range->list);
+out:
+	mutex_unlock(&va_range->lock);
+
+	return res_valid_start;
+}
+
+/*
+ * get_sg_info - get number of pages and the DMA address from SG list
+ *
+ * @sg                 : the SG list
+ * @dma_addr           : pointer to DMA address to return
+ *
+ * Calculate the number of consecutive pages described by the SG list. Take the
+ * offset of the address in the first page, add to it the length and round it up
+ * to the number of needed pages.
+ */
+static u32 get_sg_info(struct scatterlist *sg, dma_addr_t *dma_addr)
+{
+	*dma_addr = sg_dma_address(sg);
+
+	return ((((*dma_addr) & (PAGE_SIZE - 1)) + sg_dma_len(sg)) +
+			(PAGE_SIZE - 1)) >> PAGE_SHIFT;
+}
+
+/*
+ * init_phys_pg_pack_from_userptr - initialize physical page pack from host
+ *                                   memory
+ *
+ * @ctx                : current context
+ * @userptr            : userptr to initialize from
+ * @pphys_pg_pack      : res pointer
+ *
+ * This function does the following:
+ * - Pin the physical pages related to the given virtual block
+ * - Create a physical page pack from the physical pages related to the given
+ *   virtual block
+ */
+static int init_phys_pg_pack_from_userptr(struct hl_ctx *ctx,
+		struct hl_userptr *userptr,
+		struct hl_vm_phys_pg_pack **pphys_pg_pack)
+{
+	struct hl_vm_phys_pg_pack *phys_pg_pack;
+	struct scatterlist *sg;
+	dma_addr_t dma_addr;
+	u64 page_mask;
+	u32 npages, total_npages, page_size = PAGE_SIZE;
+	bool first = true, is_huge_page_opt = true;
+	int rc, i, j;
+
+	phys_pg_pack = kzalloc(sizeof(*phys_pg_pack), GFP_KERNEL);
+	if (!phys_pg_pack)
+		return -ENOMEM;
+
+	phys_pg_pack->vm_type = userptr->vm_type;
+	phys_pg_pack->created_from_userptr = true;
+	phys_pg_pack->asid = ctx->asid;
+	atomic_set(&phys_pg_pack->mapping_cnt, 1);
+
+	/* Only if all dma_addrs are aligned to 2MB and their
+	 * sizes is at least 2MB, we can use huge page mapping.
+	 * We limit the 2MB optimization to this condition,
+	 * since later on we acquire the related VA range as one
+	 * consecutive block.
+	 */
+	total_npages = 0;
+	for_each_sg(userptr->sgt->sgl, sg, userptr->sgt->nents, i) {
+		npages = get_sg_info(sg, &dma_addr);
+
+		total_npages += npages;
+
+		if (first) {
+			first = false;
+			dma_addr &= HPAGE_MASK;
+		}
+
+		if ((npages % PGS_IN_HPAGE) || (dma_addr & (HPAGE_SIZE - 1)))
+			is_huge_page_opt = false;
+	}
+
+	if (is_huge_page_opt) {
+		page_size = HPAGE_SIZE;
+		total_npages /= PGS_IN_HPAGE;
+	}
+
+	page_mask = ~(((u64) page_size) - 1);
+
+	phys_pg_pack->pages = kcalloc(total_npages, sizeof(u64), GFP_KERNEL);
+	if (!phys_pg_pack->pages) {
+		rc = -ENOMEM;
+		goto page_pack_arr_mem_err;
+	}
+
+	phys_pg_pack->npages = total_npages;
+	phys_pg_pack->page_size = page_size;
+	phys_pg_pack->total_size = total_npages * page_size;
+
+	j = 0;
+	first = true;
+	for_each_sg(userptr->sgt->sgl, sg, userptr->sgt->nents, i) {
+		npages = get_sg_info(sg, &dma_addr);
+
+		/* align down to physical page size and save the offset */
+		if (first) {
+			first = false;
+			phys_pg_pack->offset = dma_addr & (page_size - 1);
+			dma_addr &= page_mask;
+		}
+
+		while (npages) {
+			phys_pg_pack->pages[j++] = dma_addr;
+			dma_addr += page_size;
+
+			if (is_huge_page_opt)
+				npages -= PGS_IN_HPAGE;
+			else
+				npages--;
+		}
+	}
+
+	*pphys_pg_pack = phys_pg_pack;
+
+	return 0;
+
+page_pack_arr_mem_err:
+	kfree(phys_pg_pack);
+
+	return rc;
+}
+
+/*
+ * map_phys_page_pack - maps the physical page pack
+ *
+ * @ctx                : current context
+ * @vaddr              : start address of the virtual area to map from
+ * @phys_pg_pack       : the pack of physical pages to map to
+ *
+ * This function does the following:
+ * - Maps each chunk of virtual memory to matching physical chunk
+ * - Stores number of successful mappings in the given argument
+ * - Returns 0 on success, error code otherwise.
+ */
+static int map_phys_page_pack(struct hl_ctx *ctx, u64 vaddr,
+		struct hl_vm_phys_pg_pack *phys_pg_pack)
+{
+	struct hl_device *hdev = ctx->hdev;
+	u64 next_vaddr = vaddr, paddr;
+	u32 page_size = phys_pg_pack->page_size;
+	int i, rc = 0, mapped_pg_cnt = 0;
+
+	for (i = 0 ; i < phys_pg_pack->npages ; i++) {
+		paddr = phys_pg_pack->pages[i];
+
+		/* For accessing the host we need to turn on bit 39 */
+		if (phys_pg_pack->created_from_userptr)
+			paddr += hdev->asic_prop.host_phys_base_address;
+
+		rc = hl_mmu_map(ctx, next_vaddr, paddr, page_size);
+		if (rc) {
+			dev_err(hdev->dev,
+				"map failed for handle %u, npages: %d, mapped: %d",
+				phys_pg_pack->handle, phys_pg_pack->npages,
+				mapped_pg_cnt);
+			goto err;
+		}
+
+		mapped_pg_cnt++;
+		next_vaddr += page_size;
+	}
+
+	return 0;
+
+err:
+	next_vaddr = vaddr;
+	for (i = 0 ; i < mapped_pg_cnt ; i++) {
+		if (hl_mmu_unmap(ctx, next_vaddr, page_size))
+			dev_warn_ratelimited(hdev->dev,
+				"failed to unmap handle %u, va: 0x%llx, pa: 0x%llx, page size: %u\n",
+					phys_pg_pack->handle, next_vaddr,
+					phys_pg_pack->pages[i], page_size);
+
+		next_vaddr += page_size;
+	}
+
+	return rc;
+}
+
+static int get_paddr_from_handle(struct hl_ctx *ctx, struct hl_mem_in *args,
+				u64 *paddr)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_pack *phys_pg_pack;
+	u32 handle;
+
+	handle = lower_32_bits(args->map_device.handle);
+	spin_lock(&vm->idr_lock);
+	phys_pg_pack = idr_find(&vm->phys_pg_pack_handles, handle);
+	if (!phys_pg_pack) {
+		spin_unlock(&vm->idr_lock);
+		dev_err(hdev->dev, "no match for handle %u\n", handle);
+		return -EINVAL;
+	}
+
+	*paddr = phys_pg_pack->pages[0];
+
+	spin_unlock(&vm->idr_lock);
+
+	return 0;
+}
+
+/*
+ * map_device_va - map the given memory
+ *
+ * @ctx	         : current context
+ * @args         : host parameters with handle/host virtual address
+ * @device_addr	 : pointer to result device virtual address
+ *
+ * This function does the following:
+ * - If given a physical device memory handle, map to a device virtual block
+ *   and return the start address of this block
+ * - If given a host virtual address and size, find the related physical pages,
+ *   map a device virtual block to this pages and return the start address of
+ *   this block
+ */
+static int map_device_va(struct hl_ctx *ctx, struct hl_mem_in *args,
+		u64 *device_addr)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_pack *phys_pg_pack;
+	struct hl_userptr *userptr = NULL;
+	struct hl_vm_hash_node *hnode;
+	enum vm_type_t *vm_type;
+	u64 ret_vaddr, hint_addr;
+	u32 handle = 0;
+	int rc;
+	bool is_userptr = args->flags & HL_MEM_USERPTR;
+
+	/* Assume failure */
+	*device_addr = 0;
+
+	if (is_userptr) {
+		rc = get_userptr_from_host_va(hdev, args, &userptr);
+		if (rc) {
+			dev_err(hdev->dev, "failed to get userptr from va\n");
+			return rc;
+		}
+
+		rc = init_phys_pg_pack_from_userptr(ctx, userptr,
+				&phys_pg_pack);
+		if (rc) {
+			dev_err(hdev->dev,
+				"unable to init page pack for vaddr 0x%llx\n",
+				args->map_host.host_virt_addr);
+			goto init_page_pack_err;
+		}
+
+		vm_type = (enum vm_type_t *) userptr;
+		hint_addr = args->map_host.hint_addr;
+	} else {
+		handle = lower_32_bits(args->map_device.handle);
+
+		spin_lock(&vm->idr_lock);
+		phys_pg_pack = idr_find(&vm->phys_pg_pack_handles, handle);
+		if (!phys_pg_pack) {
+			spin_unlock(&vm->idr_lock);
+			dev_err(hdev->dev,
+				"no match for handle %u\n", handle);
+			return -EINVAL;
+		}
+
+		/* increment now to avoid freeing device memory while mapping */
+		atomic_inc(&phys_pg_pack->mapping_cnt);
+
+		spin_unlock(&vm->idr_lock);
+
+		vm_type = (enum vm_type_t *) phys_pg_pack;
+
+		hint_addr = args->map_device.hint_addr;
+	}
+
+	/*
+	 * relevant for mapping device physical memory only, as host memory is
+	 * implicitly shared
+	 */
+	if (!is_userptr && !(phys_pg_pack->flags & HL_MEM_SHARED) &&
+			phys_pg_pack->asid != ctx->asid) {
+		dev_err(hdev->dev,
+			"Failed to map memory, handle %u is not shared\n",
+			handle);
+		rc = -EPERM;
+		goto shared_err;
+	}
+
+	hnode = kzalloc(sizeof(*hnode), GFP_KERNEL);
+	if (!hnode) {
+		rc = -ENOMEM;
+		goto hnode_err;
+	}
+
+	ret_vaddr = get_va_block(hdev,
+			is_userptr ? &ctx->host_va_range : &ctx->dram_va_range,
+			phys_pg_pack->total_size, hint_addr, is_userptr);
+	if (!ret_vaddr) {
+		dev_err(hdev->dev, "no available va block for handle %u\n",
+				handle);
+		rc = -ENOMEM;
+		goto va_block_err;
+	}
+
+	mutex_lock(&ctx->mmu_lock);
+
+	rc = map_phys_page_pack(ctx, ret_vaddr, phys_pg_pack);
+	if (rc) {
+		mutex_unlock(&ctx->mmu_lock);
+		dev_err(hdev->dev, "mapping page pack failed for handle %u\n",
+				handle);
+		goto map_err;
+	}
+
+	hdev->asic_funcs->mmu_invalidate_cache_range(hdev, false, ctx->asid,
+			ret_vaddr, phys_pg_pack->total_size);
+
+	mutex_unlock(&ctx->mmu_lock);
+
+	ret_vaddr += phys_pg_pack->offset;
+
+	hnode->ptr = vm_type;
+	hnode->vaddr = ret_vaddr;
+
+	mutex_lock(&ctx->mem_hash_lock);
+	hash_add(ctx->mem_hash, &hnode->node, ret_vaddr);
+	mutex_unlock(&ctx->mem_hash_lock);
+
+	*device_addr = ret_vaddr;
+
+	if (is_userptr)
+		free_phys_pg_pack(hdev, phys_pg_pack);
+
+	return 0;
+
+map_err:
+	if (add_va_block(hdev,
+			is_userptr ? &ctx->host_va_range : &ctx->dram_va_range,
+			ret_vaddr,
+			ret_vaddr + phys_pg_pack->total_size - 1))
+		dev_warn(hdev->dev,
+			"release va block failed for handle 0x%x, vaddr: 0x%llx\n",
+				handle, ret_vaddr);
+
+va_block_err:
+	kfree(hnode);
+hnode_err:
+shared_err:
+	atomic_dec(&phys_pg_pack->mapping_cnt);
+	if (is_userptr)
+		free_phys_pg_pack(hdev, phys_pg_pack);
+init_page_pack_err:
+	if (is_userptr)
+		free_userptr(hdev, userptr);
+
+	return rc;
+}
+
+/*
+ * unmap_device_va      - unmap the given device virtual address
+ *
+ * @ctx                 : current context
+ * @vaddr               : device virtual address to unmap
+ *
+ * This function does the following:
+ * - Unmap the physical pages related to the given virtual address
+ * - return the device virtual block to the virtual block list
+ */
+static int unmap_device_va(struct hl_ctx *ctx, u64 vaddr)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm_phys_pg_pack *phys_pg_pack = NULL;
+	struct hl_vm_hash_node *hnode = NULL;
+	struct hl_userptr *userptr = NULL;
+	enum vm_type_t *vm_type;
+	u64 next_vaddr;
+	u32 page_size;
+	bool is_userptr;
+	int i, rc;
+
+	/* protect from double entrance */
+	mutex_lock(&ctx->mem_hash_lock);
+	hash_for_each_possible(ctx->mem_hash, hnode, node, (unsigned long)vaddr)
+		if (vaddr == hnode->vaddr)
+			break;
+
+	if (!hnode) {
+		mutex_unlock(&ctx->mem_hash_lock);
+		dev_err(hdev->dev,
+			"unmap failed, no mem hnode for vaddr 0x%llx\n",
+			vaddr);
+		return -EINVAL;
+	}
+
+	hash_del(&hnode->node);
+	mutex_unlock(&ctx->mem_hash_lock);
+
+	vm_type = hnode->ptr;
+
+	if (*vm_type == VM_TYPE_USERPTR) {
+		is_userptr = true;
+		userptr = hnode->ptr;
+		rc = init_phys_pg_pack_from_userptr(ctx, userptr,
+				&phys_pg_pack);
+		if (rc) {
+			dev_err(hdev->dev,
+				"unable to init page pack for vaddr 0x%llx\n",
+				vaddr);
+			goto vm_type_err;
+		}
+	} else if (*vm_type == VM_TYPE_PHYS_PACK) {
+		is_userptr = false;
+		phys_pg_pack = hnode->ptr;
+	} else {
+		dev_warn(hdev->dev,
+			"unmap failed, unknown vm desc for vaddr 0x%llx\n",
+				vaddr);
+		rc = -EFAULT;
+		goto vm_type_err;
+	}
+
+	if (atomic_read(&phys_pg_pack->mapping_cnt) == 0) {
+		dev_err(hdev->dev, "vaddr 0x%llx is not mapped\n", vaddr);
+		rc = -EINVAL;
+		goto mapping_cnt_err;
+	}
+
+	page_size = phys_pg_pack->page_size;
+	vaddr &= ~(((u64) page_size) - 1);
+
+	next_vaddr = vaddr;
+
+	mutex_lock(&ctx->mmu_lock);
+
+	for (i = 0 ; i < phys_pg_pack->npages ; i++, next_vaddr += page_size)
+		if (hl_mmu_unmap(ctx, next_vaddr, page_size))
+			dev_warn_ratelimited(hdev->dev,
+				"unmap failed for vaddr: 0x%llx\n", next_vaddr);
+
+	hdev->asic_funcs->mmu_invalidate_cache_range(hdev, true, ctx->asid,
+			vaddr, phys_pg_pack->total_size);
+
+	mutex_unlock(&ctx->mmu_lock);
+
+	if (add_va_block(hdev,
+			is_userptr ? &ctx->host_va_range : &ctx->dram_va_range,
+			vaddr,
+			vaddr + phys_pg_pack->total_size - 1))
+		dev_warn(hdev->dev, "add va block failed for vaddr: 0x%llx\n",
+				vaddr);
+
+	atomic_dec(&phys_pg_pack->mapping_cnt);
+	kfree(hnode);
+
+	if (is_userptr) {
+		free_phys_pg_pack(hdev, phys_pg_pack);
+		free_userptr(hdev, userptr);
+	}
+
+	return 0;
+
+mapping_cnt_err:
+	if (is_userptr)
+		free_phys_pg_pack(hdev, phys_pg_pack);
+vm_type_err:
+	mutex_lock(&ctx->mem_hash_lock);
+	hash_add(ctx->mem_hash, &hnode->node, vaddr);
+	mutex_unlock(&ctx->mem_hash_lock);
+
+	return rc;
+}
+
+int hl_mem_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	union hl_mem_args *args = data;
+	struct hl_device *hdev = hpriv->hdev;
+	struct hl_ctx *ctx = hpriv->ctx;
+	u64 device_addr = 0;
+	u32 handle = 0;
+	int rc;
+
+	if (hl_device_disabled_or_in_reset(hdev)) {
+		dev_warn_ratelimited(hdev->dev,
+			"Device is disabled or in reset. Can't execute memory IOCTL\n");
+		return -EBUSY;
+	}
+
+	if (hdev->mmu_enable) {
+		switch (args->in.op) {
+		case HL_MEM_OP_ALLOC:
+			if (!hdev->dram_supports_virtual_memory) {
+				dev_err(hdev->dev,
+					"DRAM alloc is not supported\n");
+				rc = -EINVAL;
+				goto out;
+			}
+			if (args->in.alloc.mem_size == 0) {
+				dev_err(hdev->dev,
+					"alloc size must be larger than 0\n");
+				rc = -EINVAL;
+				goto out;
+			}
+			rc = alloc_device_memory(ctx, &args->in, &handle);
+
+			memset(args, 0, sizeof(*args));
+			args->out.handle = (__u64) handle;
+			break;
+
+		case HL_MEM_OP_FREE:
+			if (!hdev->dram_supports_virtual_memory) {
+				dev_err(hdev->dev,
+					"DRAM free is not supported\n");
+				rc = -EINVAL;
+				goto out;
+			}
+			rc = free_device_memory(ctx, args->in.free.handle);
+			break;
+
+		case HL_MEM_OP_MAP:
+			rc = map_device_va(ctx, &args->in, &device_addr);
+
+			memset(args, 0, sizeof(*args));
+			args->out.device_virt_addr = device_addr;
+			break;
+
+		case HL_MEM_OP_UNMAP:
+			rc = unmap_device_va(ctx,
+					args->in.unmap.device_virt_addr);
+			break;
+
+		default:
+			dev_err(hdev->dev, "Unknown opcode for memory IOCTL\n");
+			rc = -ENOTTY;
+			break;
+		}
+	} else {
+		switch (args->in.op) {
+		case HL_MEM_OP_ALLOC:
+			if (args->in.alloc.mem_size == 0) {
+				dev_err(hdev->dev,
+					"alloc size must be larger than 0\n");
+				rc = -EINVAL;
+				goto out;
+			}
+
+			/* Force contiguous as there are no real MMU
+			 * translations to overcome physical memory gaps
+			 */
+			args->in.flags |= HL_MEM_CONTIGUOUS;
+			rc = alloc_device_memory(ctx, &args->in, &handle);
+
+			memset(args, 0, sizeof(*args));
+			args->out.handle = (__u64) handle;
+			break;
+
+		case HL_MEM_OP_FREE:
+			rc = free_device_memory(ctx, args->in.free.handle);
+			break;
+
+		case HL_MEM_OP_MAP:
+			if (args->in.flags & HL_MEM_USERPTR) {
+				device_addr = args->in.map_host.host_virt_addr;
+				rc = 0;
+			} else {
+				rc = get_paddr_from_handle(ctx, &args->in,
+						&device_addr);
+			}
+
+			memset(args, 0, sizeof(*args));
+			args->out.device_virt_addr = device_addr;
+			break;
+
+		case HL_MEM_OP_UNMAP:
+			rc = 0;
+			break;
+
+		default:
+			dev_err(hdev->dev, "Unknown opcode for memory IOCTL\n");
+			rc = -ENOTTY;
+			break;
+		}
+	}
+
+out:
+	return rc;
+}
+
 /*
  * hl_pin_host_memory - pins a chunk of host memory
  *
@@ -197,3 +1383,332 @@ bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr,
 
 	return false;
 }
+
+/*
+ * hl_va_range_init - initialize virtual addresses range
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ * @va_range            : pointer to the range to initialize
+ * @start               : range start address
+ * @end                 : range end address
+ *
+ * This function does the following:
+ * - Initializes the virtual addresses list of the given range with the given
+ *   addresses.
+ */
+static int hl_va_range_init(struct hl_device *hdev,
+		struct hl_va_range *va_range, u64 start, u64 end)
+{
+	int rc;
+
+	INIT_LIST_HEAD(&va_range->list);
+
+	/* PAGE_SIZE alignment */
+
+	if (start & (PAGE_SIZE - 1)) {
+		start &= PAGE_MASK;
+		start += PAGE_SIZE;
+	}
+
+	if (end & (PAGE_SIZE - 1))
+		end &= PAGE_MASK;
+
+	if (start >= end) {
+		dev_err(hdev->dev, "too small vm range for va list\n");
+		return -EFAULT;
+	}
+
+	rc = add_va_block(hdev, va_range, start, end);
+
+	if (rc) {
+		dev_err(hdev->dev, "Failed to init host va list\n");
+		return rc;
+	}
+
+	va_range->start_addr = start;
+	va_range->end_addr = end;
+
+	return 0;
+}
+
+/*
+ * hl_vm_ctx_init_with_ranges - initialize virtual memory for context
+ *
+ * @ctx                 : pointer to the habanalabs context structure
+ * @host_range_start    : host virtual addresses range start
+ * @host_range_end      : host virtual addresses range end
+ * @dram_range_start    : dram virtual addresses range start
+ * @dram_range_end      : dram virtual addresses range end
+ *
+ * This function initializes the following:
+ * - MMU for context
+ * - Virtual address to area descriptor hashtable
+ * - Virtual block list of available virtual memory
+ */
+int hl_vm_ctx_init_with_ranges(struct hl_ctx *ctx, u64 host_range_start,
+				u64 host_range_end, u64 dram_range_start,
+				u64 dram_range_end)
+{
+	struct hl_device *hdev = ctx->hdev;
+	int rc;
+
+	hl_mmu_ctx_init(ctx);
+
+	mutex_init(&ctx->mem_hash_lock);
+	hash_init(ctx->mem_hash);
+
+	mutex_init(&ctx->host_va_range.lock);
+
+	rc = hl_va_range_init(hdev, &ctx->host_va_range, host_range_start,
+			host_range_end);
+	if (rc) {
+		dev_err(hdev->dev, "failed to init host vm range\n");
+		goto host_vm_err;
+	}
+
+	mutex_init(&ctx->dram_va_range.lock);
+
+	rc = hl_va_range_init(hdev, &ctx->dram_va_range, dram_range_start,
+			dram_range_end);
+	if (rc) {
+		dev_err(hdev->dev, "failed to init dram vm range\n");
+		goto dram_vm_err;
+	}
+
+	return 0;
+
+dram_vm_err:
+	mutex_destroy(&ctx->dram_va_range.lock);
+
+	mutex_lock(&ctx->host_va_range.lock);
+	clear_va_list_locked(hdev, &ctx->host_va_range.list);
+	mutex_unlock(&ctx->host_va_range.lock);
+host_vm_err:
+	mutex_destroy(&ctx->host_va_range.lock);
+	mutex_destroy(&ctx->mem_hash_lock);
+	hl_mmu_ctx_fini(ctx);
+
+	return rc;
+}
+
+int hl_vm_ctx_init(struct hl_ctx *ctx)
+{
+	struct asic_fixed_properties *prop = &ctx->hdev->asic_prop;
+	u64 host_range_start, host_range_end, dram_range_start,
+		dram_range_end;
+
+	atomic64_set(&ctx->dram_phys_mem, 0);
+
+	/*
+	 * - If MMU is enabled, init the ranges as usual.
+	 * - If MMU is disabled, in case of host mapping, the returned address
+	 *   is the given one.
+	 *   In case of DRAM mapping, the returned address is the physical
+	 *   address of the memory related to the given handle.
+	 */
+	if (ctx->hdev->mmu_enable) {
+		dram_range_start = prop->va_space_dram_start_address;
+		dram_range_end = prop->va_space_dram_end_address;
+		host_range_start = prop->va_space_host_start_address;
+		host_range_end = prop->va_space_host_end_address;
+	} else {
+		dram_range_start = prop->dram_user_base_address;
+		dram_range_end = prop->dram_end_address;
+		host_range_start = prop->dram_user_base_address;
+		host_range_end = prop->dram_end_address;
+	}
+
+	return hl_vm_ctx_init_with_ranges(ctx, host_range_start, host_range_end,
+			dram_range_start, dram_range_end);
+}
+
+/*
+ * hl_va_range_fini     - clear a virtual addresses range
+ *
+ * @hdev                : pointer to the habanalabs structure
+ * va_range             : pointer to virtual addresses range
+ *
+ * This function initializes the following:
+ * - Checks that the given range contains the whole initial range
+ * - Frees the virtual addresses block list and its lock
+ */
+static void hl_va_range_fini(struct hl_device *hdev,
+		struct hl_va_range *va_range)
+{
+	struct hl_vm_va_block *va_block;
+
+	if (list_empty(&va_range->list)) {
+		dev_warn(hdev->dev,
+				"va list should not be empty on cleanup!\n");
+		goto out;
+	}
+
+	if (!list_is_singular(&va_range->list)) {
+		dev_warn(hdev->dev,
+			"va list should not contain multiple blocks on cleanup!\n");
+		goto free_va_list;
+	}
+
+	va_block = list_first_entry(&va_range->list, typeof(*va_block), node);
+
+	if (va_block->start != va_range->start_addr ||
+		va_block->end != va_range->end_addr) {
+		dev_warn(hdev->dev,
+			"wrong va block on cleanup, from 0x%llx to 0x%llx\n",
+				va_block->start, va_block->end);
+		goto free_va_list;
+	}
+
+free_va_list:
+	mutex_lock(&va_range->lock);
+	clear_va_list_locked(hdev, &va_range->list);
+	mutex_unlock(&va_range->lock);
+
+out:
+	mutex_destroy(&va_range->lock);
+}
+
+/*
+ * hl_vm_ctx_fini       - virtual memory teardown of context
+ *
+ * @ctx                 : pointer to the habanalabs context structure
+ *
+ * This function perform teardown the following:
+ * - Virtual block list of available virtual memory
+ * - Virtual address to area descriptor hashtable
+ * - MMU for context
+ *
+ * In addition this function does the following:
+ * - Unmaps the existing hashtable nodes if the hashtable is not empty. The
+ *   hashtable should be empty as no valid mappings should exist at this
+ *   point.
+ * - Frees any existing physical page list from the idr which relates to the
+ *   current context asid.
+ * - This function checks the virtual block list for correctness. At this point
+ *   the list should contain one element which describes the whole virtual
+ *   memory range of the context. Otherwise, a warning is printed.
+ */
+void hl_vm_ctx_fini(struct hl_ctx *ctx)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct hl_vm *vm = &hdev->vm;
+	struct hl_vm_phys_pg_pack *phys_pg_list;
+	struct hl_vm_hash_node *hnode;
+	struct hlist_node *tmp_node;
+	int i;
+
+	if (!hash_empty(ctx->mem_hash))
+		dev_notice(hdev->dev, "ctx is freed while it has va in use\n");
+
+	hash_for_each_safe(ctx->mem_hash, i, tmp_node, hnode, node) {
+		dev_dbg(hdev->dev,
+			"hl_mem_hash_node of vaddr 0x%llx of asid %d is still alive\n",
+			hnode->vaddr, ctx->asid);
+		unmap_device_va(ctx, hnode->vaddr);
+	}
+
+	spin_lock(&vm->idr_lock);
+	idr_for_each_entry(&vm->phys_pg_pack_handles, phys_pg_list, i)
+		if (phys_pg_list->asid == ctx->asid) {
+			dev_dbg(hdev->dev,
+				"page list 0x%p of asid %d is still alive\n",
+				phys_pg_list, ctx->asid);
+			free_phys_pg_pack(hdev, phys_pg_list);
+			idr_remove(&vm->phys_pg_pack_handles, i);
+		}
+	spin_unlock(&vm->idr_lock);
+
+	hl_va_range_fini(hdev, &ctx->dram_va_range);
+	hl_va_range_fini(hdev, &ctx->host_va_range);
+
+	mutex_destroy(&ctx->mem_hash_lock);
+	hl_mmu_ctx_fini(ctx);
+}
+
+/*
+ * hl_vm_init           - initialize virtual memory module
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ *
+ * This function initializes the following:
+ * - MMU module
+ * - DRAM physical pages pool of 2MB
+ * - Idr for device memory allocation handles
+ */
+int hl_vm_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	struct hl_vm *vm = &hdev->vm;
+	int rc;
+
+	rc = hl_mmu_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to init MMU\n");
+		return rc;
+	}
+
+	vm->dram_pg_pool = gen_pool_create(__ffs(prop->dram_page_size), -1);
+	if (!vm->dram_pg_pool) {
+		dev_err(hdev->dev, "Failed to create dram page pool\n");
+		rc = -ENOMEM;
+		goto pool_create_err;
+	}
+
+	kref_init(&vm->dram_pg_pool_refcount);
+
+	rc = gen_pool_add(vm->dram_pg_pool, prop->dram_user_base_address,
+			prop->dram_end_address - prop->dram_user_base_address,
+			-1);
+
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to add memory to dram page pool %d\n", rc);
+		goto pool_add_err;
+	}
+
+	spin_lock_init(&vm->idr_lock);
+	idr_init(&vm->phys_pg_pack_handles);
+
+	atomic64_set(&hdev->dram_used_mem, 0);
+
+	vm->init_done = true;
+
+	return 0;
+
+pool_add_err:
+	gen_pool_destroy(vm->dram_pg_pool);
+pool_create_err:
+	hl_mmu_fini(hdev);
+
+	return rc;
+}
+
+/*
+ * hl_vm_fini           - virtual memory module teardown
+ *
+ * @hdev                : pointer to the habanalabs device structure
+ *
+ * This function perform teardown to the following:
+ * - Idr for device memory allocation handles
+ * - DRAM physical pages pool of 2MB
+ * - MMU module
+ */
+void hl_vm_fini(struct hl_device *hdev)
+{
+	struct hl_vm *vm = &hdev->vm;
+
+	if (!vm->init_done)
+		return;
+
+	/*
+	 * At this point all the contexts should be freed and hence no DRAM
+	 * memory should be in use. Hence the DRAM pool should be freed here.
+	 */
+	if (kref_put(&vm->dram_pg_pool_refcount, dram_pg_pool_do_release) != 1)
+		dev_warn(hdev->dev, "dram_pg_pool was not destroyed on %s\n",
+				__func__);
+
+	hl_mmu_fini(hdev);
+
+	vm->init_done = false;
+}
diff --git a/drivers/misc/habanalabs/mmu.c b/drivers/misc/habanalabs/mmu.c
new file mode 100644
index 000000000000..e6fa9d81933b
--- /dev/null
+++ b/drivers/misc/habanalabs/mmu.c
@@ -0,0 +1,690 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+#include "include/hw_ip/mmu/mmu_general.h"
+
+#include <linux/genalloc.h>
+
+static struct pgt_info *get_pgt_info(struct hl_ctx *ctx, u64 addr)
+{
+	struct pgt_info *pgt_info = NULL;
+
+	hash_for_each_possible(ctx->mmu_hash, pgt_info, node,
+				(unsigned long) addr)
+		if (addr == pgt_info->addr)
+			break;
+
+	return pgt_info;
+}
+
+static void free_hop(struct hl_ctx *ctx, u64 hop_addr)
+{
+	struct pgt_info *pgt_info = get_pgt_info(ctx, hop_addr);
+
+	gen_pool_free(pgt_info->ctx->hdev->mmu_pgt_pool, pgt_info->addr,
+			ctx->hdev->asic_prop.mmu_hop_table_size);
+	hash_del(&pgt_info->node);
+
+	kfree(pgt_info);
+}
+
+static u64 alloc_hop(struct hl_ctx *ctx)
+{
+	struct hl_device *hdev = ctx->hdev;
+	struct pgt_info *pgt_info;
+	u64 addr;
+
+	pgt_info = kmalloc(sizeof(*pgt_info), GFP_KERNEL);
+	if (!pgt_info)
+		return ULLONG_MAX;
+
+	addr = (u64) gen_pool_alloc(hdev->mmu_pgt_pool,
+			hdev->asic_prop.mmu_hop_table_size);
+	if (!addr) {
+		dev_err(hdev->dev, "failed to allocate page\n");
+		kfree(pgt_info);
+		return ULLONG_MAX;
+	}
+
+	pgt_info->addr = addr;
+	pgt_info->ctx = ctx;
+	pgt_info->num_of_ptes = 0;
+	hash_add(ctx->mmu_hash, &pgt_info->node, addr);
+
+	return addr;
+}
+
+static inline void clear_pte(struct hl_device *hdev, u64 pte_addr)
+{
+	/* clear the last and present bits */
+	hdev->asic_funcs->write_pte(hdev, pte_addr, 0);
+}
+
+static inline void get_pte(struct hl_ctx *ctx, u64 hop_addr)
+{
+	get_pgt_info(ctx, hop_addr)->num_of_ptes++;
+}
+
+/*
+ * put_pte - decrement the num of ptes and free the hop if possible
+ *
+ * @ctx: pointer to the context structure
+ * @hop_addr: addr of the hop
+ *
+ * This function returns the number of ptes left on this hop. If the number is
+ * 0, it means the pte was freed.
+ */
+static inline int put_pte(struct hl_ctx *ctx, u64 hop_addr)
+{
+	struct pgt_info *pgt_info = get_pgt_info(ctx, hop_addr);
+	int num_of_ptes_left;
+
+	pgt_info->num_of_ptes--;
+
+	/*
+	 * Need to save the number of ptes left because free_hop might free
+	 * the pgt_info
+	 */
+	num_of_ptes_left = pgt_info->num_of_ptes;
+	if (!num_of_ptes_left)
+		free_hop(ctx, hop_addr);
+
+	return num_of_ptes_left;
+}
+
+static inline u64 get_hop0_addr(struct hl_ctx *ctx)
+{
+	return ctx->hdev->asic_prop.mmu_pgt_addr +
+			(ctx->asid * ctx->hdev->asic_prop.mmu_hop_table_size);
+}
+
+static inline u64 get_hopN_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+					u64 virt_addr, u64 mask, u64 shift)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & mask) >> shift);
+}
+
+static inline u64 get_hop0_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
+{
+	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP0_MASK, HOP0_SHIFT);
+}
+
+static inline u64 get_hop1_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
+{
+	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP1_MASK, HOP1_SHIFT);
+}
+
+static inline u64 get_hop2_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
+{
+	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP2_MASK, HOP2_SHIFT);
+}
+
+static inline u64 get_hop3_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
+{
+	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP3_MASK, HOP3_SHIFT);
+}
+
+static inline u64 get_hop4_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
+{
+	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP4_MASK, HOP4_SHIFT);
+}
+
+static inline u64 get_next_hop_addr(u64 curr_pte)
+{
+	if (curr_pte & PAGE_PRESENT_MASK)
+		return curr_pte & PHYS_ADDR_MASK;
+	else
+		return ULLONG_MAX;
+}
+
+static inline u64 get_alloc_next_hop_addr(struct hl_ctx *ctx, u64 curr_pte,
+						bool *is_new_hop)
+{
+	u64 hop_addr = get_next_hop_addr(curr_pte);
+
+	if (hop_addr == ULLONG_MAX) {
+		hop_addr = alloc_hop(ctx);
+		*is_new_hop = true;
+	}
+
+	return hop_addr;
+}
+
+/*
+ * hl_mmu_init - init the mmu module
+ *
+ * @hdev: pointer to the habanalabs device structure
+ *
+ * This function does the following:
+ * - Allocate max_asid zeroed hop0 pgts so no mapping is available
+ * - Enable mmu in hw
+ * - Invalidate the mmu cache
+ * - Create a pool of pages for pgts
+ * - Returns 0 on success
+ *
+ * This function depends on DMA QMAN to be working!
+ */
+int hl_mmu_init(struct hl_device *hdev)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	int rc;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	/* MMU HW init was already done in device hw_init() */
+
+	mutex_init(&hdev->mmu_cache_lock);
+
+	hdev->mmu_pgt_pool =
+			gen_pool_create(__ffs(prop->mmu_hop_table_size), -1);
+
+	if (!hdev->mmu_pgt_pool) {
+		dev_err(hdev->dev, "Failed to create page gen pool\n");
+		rc = -ENOMEM;
+		goto err_pool_create;
+	}
+
+	rc = gen_pool_add(hdev->mmu_pgt_pool, prop->mmu_pgt_addr +
+			prop->mmu_hop0_tables_total_size,
+			prop->mmu_pgt_size - prop->mmu_hop0_tables_total_size,
+			-1);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to add memory to page gen pool\n");
+		goto err_pool_add;
+	}
+
+	return 0;
+
+err_pool_add:
+	gen_pool_destroy(hdev->mmu_pgt_pool);
+err_pool_create:
+	mutex_destroy(&hdev->mmu_cache_lock);
+
+	return rc;
+}
+
+/*
+ * hl_mmu_fini - release the mmu module.
+ *
+ * @hdev: pointer to the habanalabs device structure
+ *
+ * This function does the following:
+ * - Disable mmu in hw
+ * - free the pgts pool
+ *
+ * All ctxs should be freed before calling this func
+ */
+void hl_mmu_fini(struct hl_device *hdev)
+{
+	if (!hdev->mmu_enable)
+		return;
+
+	gen_pool_destroy(hdev->mmu_pgt_pool);
+
+	mutex_destroy(&hdev->mmu_cache_lock);
+
+	/* MMU HW fini will be done in device hw_fini() */
+}
+
+/*
+ * hl_mmu_ctx_init - init a ctx for using the mmu module
+ *
+ * @ctx: pointer to the context structure
+ *
+ * This function does the following:
+ * - Init a mutex to protect the concurrent mapping flow
+ * - Init a hash to hold all pgts related to this ctx
+ */
+void hl_mmu_ctx_init(struct hl_ctx *ctx)
+{
+	if (!ctx->hdev->mmu_enable)
+		return;
+
+	mutex_init(&ctx->mmu_lock);
+	hash_init(ctx->mmu_hash);
+}
+
+/*
+ * hl_mmu_ctx_fini - disable a ctx from using the mmu module
+ *
+ * @ctx: pointer to the context structure
+ *
+ * This function does the following:
+ * - Free any pgts which were not freed yet
+ * - Free the mutex
+ */
+void hl_mmu_ctx_fini(struct hl_ctx *ctx)
+{
+	struct pgt_info *pgt_info;
+	struct hlist_node *tmp;
+	int i;
+
+	if (!ctx->hdev->mmu_enable)
+		return;
+
+	if (!hash_empty(ctx->mmu_hash))
+		dev_err(ctx->hdev->dev,
+				"ctx is freed while it has pgts in use\n");
+
+	hash_for_each_safe(ctx->mmu_hash, i, tmp, pgt_info, node) {
+		dev_err(ctx->hdev->dev,
+			"pgt_info of addr 0x%llx of asid %d was not destroyed, num_ptes: %d\n",
+			pgt_info->addr, ctx->asid, pgt_info->num_of_ptes);
+		free_hop(ctx, pgt_info->addr);
+	}
+
+	mutex_destroy(&ctx->mmu_lock);
+}
+
+static int _hl_mmu_unmap(struct hl_ctx *ctx, u64 virt_addr)
+{
+	struct hl_device *hdev = ctx->hdev;
+	u64 hop0_addr = 0, hop0_pte_addr = 0,
+		hop1_addr = 0, hop1_pte_addr = 0,
+		hop2_addr = 0, hop2_pte_addr = 0,
+		hop3_addr = 0, hop3_pte_addr = 0,
+		hop4_addr = 0, hop4_pte_addr = 0,
+		curr_pte;
+	int clear_hop3 = 1;
+
+	hop0_addr = get_hop0_addr(ctx);
+
+	hop0_pte_addr = get_hop0_pte_addr(ctx, hop0_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop0_pte_addr);
+
+	hop1_addr = get_next_hop_addr(curr_pte);
+
+	if (hop1_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop1_pte_addr = get_hop1_pte_addr(ctx, hop1_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop1_pte_addr);
+
+	hop2_addr = get_next_hop_addr(curr_pte);
+
+	if (hop2_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop2_pte_addr = get_hop2_pte_addr(ctx, hop2_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop2_pte_addr);
+
+	hop3_addr = get_next_hop_addr(curr_pte);
+
+	if (hop3_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop3_pte_addr = get_hop3_pte_addr(ctx, hop3_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop3_pte_addr);
+
+	if (!(curr_pte & LAST_MASK)) {
+		hop4_addr = get_next_hop_addr(curr_pte);
+
+		if (hop4_addr == ULLONG_MAX)
+			goto not_mapped;
+
+		hop4_pte_addr = get_hop4_pte_addr(ctx, hop4_addr, virt_addr);
+
+		curr_pte = hdev->asic_funcs->read_pte(hdev, hop4_pte_addr);
+
+		clear_hop3 = 0;
+	}
+
+	if (!(curr_pte & PAGE_PRESENT_MASK))
+		goto not_mapped;
+
+	clear_pte(hdev, hop4_addr ? hop4_pte_addr : hop3_pte_addr);
+
+	if (hop4_addr && !put_pte(ctx, hop4_addr))
+		clear_hop3 = 1;
+
+	if (!clear_hop3)
+		goto flush;
+	clear_pte(hdev, hop3_pte_addr);
+
+	if (put_pte(ctx, hop3_addr))
+		goto flush;
+	clear_pte(hdev, hop2_pte_addr);
+
+	if (put_pte(ctx, hop2_addr))
+		goto flush;
+	clear_pte(hdev, hop1_pte_addr);
+
+	if (put_pte(ctx, hop1_addr))
+		goto flush;
+	clear_pte(hdev, hop0_pte_addr);
+
+flush:
+	/* flush all writes from all cores to reach PCI */
+	mb();
+
+	hdev->asic_funcs->read_pte(hdev,
+				hop4_addr ? hop4_pte_addr : hop3_pte_addr);
+
+	return 0;
+
+not_mapped:
+	dev_err(hdev->dev, "virt addr 0x%llx is not mapped to phys addr\n",
+		virt_addr);
+
+	return -EINVAL;
+}
+
+/*
+ * hl_mmu_unmap - unmaps a virtual addr
+ *
+ * @ctx: pointer to the context structure
+ * @virt_addr: virt addr to map from
+ * @page_size: size of the page to unmap
+ *
+ * This function does the following:
+ * - Check that the virt addr is mapped
+ * - Unmap the virt addr and frees pgts if possible
+ * - Returns 0 on success, -EINVAL if the given addr is not mapped
+ *
+ * Because this function changes the page tables in the device and because it
+ * changes the MMU hash, it must be protected by a lock.
+ * However, because it maps only a single page, the lock should be implemented
+ * in a higher level in order to protect the entire mapping of the memory area
+ */
+int hl_mmu_unmap(struct hl_ctx *ctx, u64 virt_addr, u32 page_size)
+{
+	struct hl_device *hdev = ctx->hdev;
+	u64 real_virt_addr;
+	u32 real_page_size, npages;
+	int i, rc;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	/*
+	 * The H/W handles mapping of 4KB/2MB page. Hence if the host page size
+	 * is bigger, we break it to sub-pages and unmap them separately.
+	 */
+	if ((page_size % PAGE_SIZE_2MB) == 0) {
+		real_page_size = PAGE_SIZE_2MB;
+	} else if ((page_size % PAGE_SIZE_4KB) == 0) {
+		real_page_size = PAGE_SIZE_4KB;
+	} else {
+		dev_err(hdev->dev,
+			"page size of %u is not 4KB nor 2MB aligned, can't unmap\n",
+				page_size);
+
+		return -EFAULT;
+	}
+
+	npages = page_size / real_page_size;
+	real_virt_addr = virt_addr;
+
+	for (i = 0 ; i < npages ; i++) {
+		rc = _hl_mmu_unmap(ctx, real_virt_addr);
+		if (rc)
+			return rc;
+
+		real_virt_addr += real_page_size;
+	}
+
+	return 0;
+}
+
+static int _hl_mmu_map(struct hl_ctx *ctx, u64 virt_addr, u64 phys_addr,
+		u32 page_size)
+{
+	struct hl_device *hdev = ctx->hdev;
+	u64 hop0_addr = 0, hop0_pte_addr = 0,
+		hop1_addr = 0, hop1_pte_addr = 0,
+		hop2_addr = 0, hop2_pte_addr = 0,
+		hop3_addr = 0, hop3_pte_addr = 0,
+		hop4_addr = 0, hop4_pte_addr = 0,
+		curr_pte = 0;
+	bool hop1_new = false, hop2_new = false, hop3_new = false,
+		hop4_new = false, is_huge;
+	int rc = -ENOMEM;
+
+	/*
+	 * This mapping function can map a 4KB/2MB page. For 2MB page there are
+	 * only 3 hops rather than 4. Currently the DRAM allocation uses 2MB
+	 * pages only but user memory could have been allocated with one of the
+	 * two page sizes. Since this is a common code for all the three cases,
+	 * we need this hugs page check.
+	 */
+	is_huge = page_size == PAGE_SIZE_2MB;
+
+	hop0_addr = get_hop0_addr(ctx);
+
+	hop0_pte_addr = get_hop0_pte_addr(ctx, hop0_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop0_pte_addr);
+
+	hop1_addr = get_alloc_next_hop_addr(ctx, curr_pte, &hop1_new);
+
+	if (hop1_addr == ULLONG_MAX)
+		goto err;
+
+	hop1_pte_addr = get_hop1_pte_addr(ctx, hop1_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop1_pte_addr);
+
+	hop2_addr = get_alloc_next_hop_addr(ctx, curr_pte, &hop2_new);
+
+	if (hop2_addr == ULLONG_MAX)
+		goto err;
+
+	hop2_pte_addr = get_hop2_pte_addr(ctx, hop2_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop2_pte_addr);
+
+	hop3_addr = get_alloc_next_hop_addr(ctx, curr_pte, &hop3_new);
+
+	if (hop3_addr == ULLONG_MAX)
+		goto err;
+
+	hop3_pte_addr = get_hop3_pte_addr(ctx, hop3_addr, virt_addr);
+
+	curr_pte = hdev->asic_funcs->read_pte(hdev, hop3_pte_addr);
+
+	if (!is_huge) {
+		hop4_addr = get_alloc_next_hop_addr(ctx, curr_pte, &hop4_new);
+
+		if (hop4_addr == ULLONG_MAX)
+			goto err;
+
+		hop4_pte_addr = get_hop4_pte_addr(ctx, hop4_addr, virt_addr);
+
+		curr_pte = hdev->asic_funcs->read_pte(hdev, hop4_pte_addr);
+	}
+
+	if (curr_pte & PAGE_PRESENT_MASK) {
+		dev_err(hdev->dev,
+				"mapping already exists for virt_addr 0x%llx\n",
+					virt_addr);
+
+		dev_dbg(hdev->dev, "hop0 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev, hop0_pte_addr),
+				hop0_pte_addr);
+		dev_dbg(hdev->dev, "hop1 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev, hop1_pte_addr),
+				hop1_pte_addr);
+		dev_dbg(hdev->dev, "hop2 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev, hop2_pte_addr),
+				hop2_pte_addr);
+		dev_dbg(hdev->dev, "hop3 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev, hop3_pte_addr),
+				hop3_pte_addr);
+
+		if (!is_huge)
+			dev_dbg(hdev->dev, "hop4 pte: 0x%llx (0x%llx)\n",
+				hdev->asic_funcs->read_pte(hdev,
+							hop4_pte_addr),
+							hop4_pte_addr);
+
+		rc = EINVAL;
+		goto err;
+	}
+
+	curr_pte = (phys_addr & PTE_PHYS_ADDR_MASK) | LAST_MASK
+			| PAGE_PRESENT_MASK;
+
+	hdev->asic_funcs->write_pte(hdev,
+				is_huge ? hop3_pte_addr : hop4_pte_addr,
+				curr_pte);
+
+	if (hop1_new) {
+		curr_pte = (hop1_addr & PTE_PHYS_ADDR_MASK) |
+				PAGE_PRESENT_MASK;
+		ctx->hdev->asic_funcs->write_pte(ctx->hdev, hop0_pte_addr,
+				curr_pte);
+	}
+	if (hop2_new) {
+		curr_pte = (hop2_addr & PTE_PHYS_ADDR_MASK) |
+				PAGE_PRESENT_MASK;
+		ctx->hdev->asic_funcs->write_pte(ctx->hdev, hop1_pte_addr,
+				curr_pte);
+		get_pte(ctx, hop1_addr);
+	}
+	if (hop3_new) {
+		curr_pte = (hop3_addr & PTE_PHYS_ADDR_MASK) |
+				PAGE_PRESENT_MASK;
+		ctx->hdev->asic_funcs->write_pte(ctx->hdev, hop2_pte_addr,
+				curr_pte);
+		get_pte(ctx, hop2_addr);
+	}
+
+	if (!is_huge) {
+		if (hop4_new) {
+			curr_pte = (hop4_addr & PTE_PHYS_ADDR_MASK) |
+					PAGE_PRESENT_MASK;
+			ctx->hdev->asic_funcs->write_pte(ctx->hdev,
+					hop3_pte_addr, curr_pte);
+			get_pte(ctx, hop3_addr);
+		}
+
+		get_pte(ctx, hop4_addr);
+	} else {
+		get_pte(ctx, hop3_addr);
+	}
+
+	/* flush all writes from all cores to reach PCI */
+	mb();
+
+	hdev->asic_funcs->read_pte(hdev,
+				is_huge ? hop3_pte_addr : hop4_pte_addr);
+
+	return 0;
+
+err:
+	if (hop4_new)
+		free_hop(ctx, hop4_addr);
+	if (hop3_new)
+		free_hop(ctx, hop3_addr);
+	if (hop2_new)
+		free_hop(ctx, hop2_addr);
+	if (hop1_new)
+		free_hop(ctx, hop1_addr);
+
+	return rc;
+}
+
+/*
+ * hl_mmu_map - maps a virtual addr to physical addr
+ *
+ * @ctx: pointer to the context structure
+ * @virt_addr: virt addr to map from
+ * @phys_addr: phys addr to map to
+ * @page_size: physical page size
+ *
+ * This function does the following:
+ * - Check that the virt addr is not mapped
+ * - Allocate pgts as necessary in order to map the virt addr to the phys
+ * - Returns 0 on success, -EINVAL if addr is already mapped, or -ENOMEM.
+ *
+ * Because this function changes the page tables in the device and because it
+ * changes the MMU hash, it must be protected by a lock.
+ * However, because it maps only a single page, the lock should be implemented
+ * in a higher level in order to protect the entire mapping of the memory area
+ */
+int hl_mmu_map(struct hl_ctx *ctx, u64 virt_addr, u64 phys_addr, u32 page_size)
+{
+	struct hl_device *hdev = ctx->hdev;
+	u64 real_virt_addr;
+	u32 real_page_size, npages;
+	int i, rc, mapped_cnt = 0;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	/*
+	 * The H/W handles mapping of 4KB/2MB page. Hence if the host page size
+	 * is bigger, we break it to sub-pages and map them separately.
+	 */
+	if ((page_size % PAGE_SIZE_2MB) == 0) {
+		real_page_size = PAGE_SIZE_2MB;
+	} else if ((page_size % PAGE_SIZE_4KB) == 0) {
+		real_page_size = PAGE_SIZE_4KB;
+	} else {
+		dev_err(hdev->dev,
+			"page size of %u is not 4KB nor 2MB aligned, can't map\n",
+				page_size);
+
+		return -EFAULT;
+	}
+
+	npages = page_size / real_page_size;
+	real_virt_addr = virt_addr;
+
+	for (i = 0 ; i < npages ; i++) {
+		rc = _hl_mmu_map(ctx, real_virt_addr, phys_addr,
+				real_page_size);
+		if (rc)
+			goto err;
+
+		real_virt_addr += real_page_size;
+		mapped_cnt++;
+	}
+
+	return 0;
+
+err:
+	real_virt_addr = virt_addr;
+	for (i = 0 ; i < mapped_cnt ; i++) {
+		if (_hl_mmu_unmap(ctx, real_virt_addr))
+			dev_warn_ratelimited(hdev->dev,
+				"failed to unmap va: 0x%llx\n", real_virt_addr);
+
+		real_virt_addr += real_page_size;
+	}
+
+	return rc;
+}
+
+/*
+ * hl_mmu_swap_out - marks all mapping of the given ctx as swapped out
+ *
+ * @ctx: pointer to the context structure
+ *
+ */
+void hl_mmu_swap_out(struct hl_ctx *ctx)
+{
+
+}
+
+/*
+ * hl_mmu_swap_in - marks all mapping of the given ctx as swapped in
+ *
+ * @ctx: pointer to the context structure
+ *
+ */
+void hl_mmu_swap_in(struct hl_ctx *ctx)
+{
+
+}
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index fba49417f607..9015043887d1 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -162,6 +162,108 @@ union hl_wait_cs_args {
 	struct hl_wait_cs_out out;
 };
 
+/* Opcode to alloc device memory */
+#define HL_MEM_OP_ALLOC			0
+/* Opcode to free previously allocated device memory */
+#define HL_MEM_OP_FREE			1
+/* Opcode to map host memory */
+#define HL_MEM_OP_MAP			2
+/* Opcode to unmap previously mapped host memory */
+#define HL_MEM_OP_UNMAP			3
+
+/* Memory flags */
+#define HL_MEM_CONTIGUOUS	0x1
+#define HL_MEM_SHARED		0x2
+#define HL_MEM_USERPTR		0x4
+
+struct hl_mem_in {
+	union {
+		/* HL_MEM_OP_ALLOC- allocate device memory */
+		struct {
+			/* Size to alloc */
+			__u32 mem_size;
+			__u32 pad;
+		} alloc;
+
+		/* HL_MEM_OP_FREE - free device memory */
+		struct {
+			/* Handle returned from HL_MEM_OP_ALLOC */
+			__u64 handle;
+		} free;
+
+		/* HL_MEM_OP_MAP - map device memory */
+		struct {
+			/*
+			 * Requested virtual address of mapped memory.
+			 * KMD will try to map the requested region to this
+			 * hint address, as long as the address is valid and
+			 * not already mapped. The user should check the
+			 * returned address of the IOCTL to make sure he got
+			 * the hint address. Passing 0 here means that KMD
+			 * will choose the address itself.
+			 */
+			__u64 hint_addr;
+			/* Handle returned from HL_MEM_OP_ALLOC */
+			__u64 handle;
+		} map_device;
+
+		/* HL_MEM_OP_MAP - map host memory */
+		struct {
+			/* Address of allocated host memory */
+			__u64 host_virt_addr;
+			/*
+			 * Requested virtual address of mapped memory.
+			 * KMD will try to map the requested region to this
+			 * hint address, as long as the address is valid and
+			 * not already mapped. The user should check the
+			 * returned address of the IOCTL to make sure he got
+			 * the hint address. Passing 0 here means that KMD
+			 * will choose the address itself.
+			 */
+			__u64 hint_addr;
+			/* Size of allocated host memory */
+			__u32 mem_size;
+			__u32 pad;
+		} map_host;
+
+		/* HL_MEM_OP_UNMAP - unmap host memory */
+		struct {
+			/* Virtual address returned from HL_MEM_OP_MAP */
+			__u64 device_virt_addr;
+		} unmap;
+	};
+
+	/* HL_MEM_OP_* */
+	__u32 op;
+	/* HL_MEM_* flags */
+	__u32 flags;
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+	__u32 pad;
+};
+
+struct hl_mem_out {
+	union {
+		/*
+		 * Used for HL_MEM_OP_MAP as the virtual address that was
+		 * assigned in the device VA space.
+		 * A value of 0 means the requested operation failed.
+		 */
+		__u64 device_virt_addr;
+
+		/*
+		 * Used for HL_MEM_OP_ALLOC. This is the assigned
+		 * handle for the allocated memory
+		 */
+		__u64 handle;
+	};
+};
+
+union hl_mem_args {
+	struct hl_mem_in in;
+	struct hl_mem_out out;
+};
+
 /*
  * Command Buffer
  * - Request a Command Buffer
@@ -245,7 +347,25 @@ union hl_wait_cs_args {
 #define HL_IOCTL_WAIT_CS			\
 		_IOWR('H', 0x04, union hl_wait_cs_args)
 
+/*
+ * Memory
+ * - Map host memory to device MMU
+ * - Unmap host memory from device MMU
+ *
+ * This IOCTL allows the user to map host memory to the device MMU
+ *
+ * For host memory, the IOCTL doesn't allocate memory. The user is supposed
+ * to allocate the memory in user-space (malloc/new). The driver pins the
+ * physical pages (up to the allowed limit by the OS), assigns a virtual
+ * address in the device VA space and initializes the device MMU.
+ *
+ * There is an option for the user to specify the requested virtual address.
+ *
+ */
+#define HL_IOCTL_MEMORY		\
+		_IOWR('H', 0x05, union hl_mem_args)
+
 #define HL_COMMAND_START	0x02
-#define HL_COMMAND_END		0x05
+#define HL_COMMAND_END		0x06
 
 #endif /* HABANALABS_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 13/15] habanalabs: implement INFO IOCTL
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (10 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 12/15] habanalabs: add virtual memory and MMU modules Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 14/15] habanalabs: add debugfs support Oded Gabbay
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch implements the INFO IOCTL. That IOCTL is used by the user to
query information that is relevant/needed by the user in order to submit
deep learning jobs to Goya.

The information is divided into several categories, such as H/W IP, Events
that happened, DDR usage and more.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - Use -ENOTTY to signal bad ioctl parameter in info ioctl
 
 drivers/misc/habanalabs/goya/goya.c        |   6 +
 drivers/misc/habanalabs/habanalabs.h       |   2 +
 drivers/misc/habanalabs/habanalabs_ioctl.c | 126 +++++++++++++++++++++
 include/uapi/misc/habanalabs.h             |  75 +++++++++++-
 4 files changed, 208 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index bba159fd7755..7cce308243ad 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -5127,6 +5127,11 @@ static void goya_hw_queues_unlock(struct hl_device *hdev)
 	spin_unlock(&goya->hw_queues_lock);
 }
 
+static u32 goya_get_pci_id(struct hl_device *hdev)
+{
+	return hdev->pdev->device;
+}
+
 int goya_get_eeprom_data(struct hl_device *hdev, void *data, size_t max_size)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -5228,6 +5233,7 @@ static const struct hl_asic_funcs goya_funcs = {
 	.soft_reset_late_init = goya_soft_reset_late_init,
 	.hw_queues_lock = goya_hw_queues_lock,
 	.hw_queues_unlock = goya_hw_queues_unlock,
+	.get_pci_id = goya_get_pci_id,
 	.get_eeprom_data = goya_get_eeprom_data,
 	.send_cpu_message = goya_send_cpu_message,
 	.get_hw_state = goya_get_hw_state
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index bf2d5cba6148..759510c96034 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -471,6 +471,7 @@ enum hl_pll_frequency {
  * @soft_reset_late_init: perform certain actions needed after soft reset.
  * @hw_queues_lock: acquire H/W queues lock.
  * @hw_queues_unlock: release H/W queues lock.
+ * @get_pci_id: retrieve PCI ID.
  * @get_eeprom_data: retrieve EEPROM data from F/W.
  * @send_cpu_message: send buffer to ArmCP.
  * @get_hw_state: retrieve the H/W state
@@ -540,6 +541,7 @@ struct hl_asic_funcs {
 	int (*soft_reset_late_init)(struct hl_device *hdev);
 	void (*hw_queues_lock)(struct hl_device *hdev);
 	void (*hw_queues_unlock)(struct hl_device *hdev);
+	u32 (*get_pci_id)(struct hl_device *hdev);
 	int (*get_eeprom_data)(struct hl_device *hdev, void *data,
 				size_t max_size);
 	int (*send_cpu_message)(struct hl_device *hdev, u32 *msg,
diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
index 71ef0c91668b..12f2f354324e 100644
--- a/drivers/misc/habanalabs/habanalabs_ioctl.c
+++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
@@ -12,10 +12,136 @@
 #include <linux/uaccess.h>
 #include <linux/cred.h>
 
+static int hw_ip_info(struct hl_device *hdev, struct hl_info_args *args)
+{
+	struct hl_info_hw_ip_info hw_ip = {0};
+	u32 size = args->return_size;
+	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	u64 sram_kmd_size, dram_kmd_size;
+
+	if ((!size) || (!out))
+		return -EINVAL;
+
+	sram_kmd_size = (prop->sram_user_base_address -
+				prop->sram_base_address);
+	dram_kmd_size = (prop->dram_user_base_address -
+				prop->dram_base_address);
+
+	hw_ip.device_id = hdev->asic_funcs->get_pci_id(hdev);
+	hw_ip.sram_base_address = prop->sram_user_base_address;
+	hw_ip.dram_base_address = prop->dram_user_base_address;
+	hw_ip.tpc_enabled_mask = prop->tpc_enabled_mask;
+	hw_ip.sram_size = prop->sram_size - sram_kmd_size;
+	hw_ip.dram_size = prop->dram_size - dram_kmd_size;
+	if (hw_ip.dram_size > 0)
+		hw_ip.dram_enabled = 1;
+	hw_ip.num_of_events = prop->num_of_events;
+	memcpy(hw_ip.armcp_version,
+		prop->armcp_info.armcp_version, VERSION_MAX_LEN);
+	hw_ip.armcp_cpld_version = prop->armcp_info.cpld_version;
+	hw_ip.psoc_pci_pll_nr = prop->psoc_pci_pll_nr;
+	hw_ip.psoc_pci_pll_nf = prop->psoc_pci_pll_nf;
+	hw_ip.psoc_pci_pll_od = prop->psoc_pci_pll_od;
+	hw_ip.psoc_pci_pll_div_factor = prop->psoc_pci_pll_div_factor;
+
+	return copy_to_user(out, &hw_ip,
+		min((size_t)size, sizeof(hw_ip))) ? -EFAULT : 0;
+}
+
+static int hw_events_info(struct hl_device *hdev, struct hl_info_args *args)
+{
+	u32 size, max_size = args->return_size;
+	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
+	void *arr;
+
+	if ((!max_size) || (!out))
+		return -EINVAL;
+
+	arr = hdev->asic_funcs->get_events_stat(hdev, &size);
+
+	return copy_to_user(out, arr, min(max_size, size)) ? -EFAULT : 0;
+}
+
+static int dram_usage_info(struct hl_device *hdev, struct hl_info_args *args)
+{
+	struct hl_info_dram_usage dram_usage = {0};
+	u32 max_size = args->return_size;
+	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	u64 dram_kmd_size;
+
+	if ((!max_size) || (!out))
+		return -EINVAL;
+
+	dram_kmd_size = (prop->dram_user_base_address -
+				prop->dram_base_address);
+	dram_usage.dram_free_mem = (prop->dram_size - dram_kmd_size) -
+					atomic64_read(&hdev->dram_used_mem);
+	dram_usage.ctx_dram_mem = atomic64_read(&hdev->user_ctx->dram_phys_mem);
+
+	return copy_to_user(out, &dram_usage,
+		min((size_t) max_size, sizeof(dram_usage))) ? -EFAULT : 0;
+}
+
+static int hw_idle(struct hl_device *hdev, struct hl_info_args *args)
+{
+	struct hl_info_hw_idle hw_idle = {0};
+	u32 max_size = args->return_size;
+	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
+
+	if ((!max_size) || (!out))
+		return -EINVAL;
+
+	hw_idle.is_idle = hdev->asic_funcs->is_device_idle(hdev);
+
+	return copy_to_user(out, &hw_idle,
+		min((size_t) max_size, sizeof(hw_idle))) ? -EFAULT : 0;
+}
+
+static int hl_info_ioctl(struct hl_fpriv *hpriv, void *data)
+{
+	struct hl_info_args *args = data;
+	struct hl_device *hdev = hpriv->hdev;
+	int rc;
+
+	if (hl_device_disabled_or_in_reset(hdev)) {
+		dev_err(hdev->dev,
+			"Device is disabled or in reset. Can't execute INFO IOCTL\n");
+		return -EBUSY;
+	}
+
+	switch (args->op) {
+	case HL_INFO_HW_IP_INFO:
+		rc = hw_ip_info(hdev, args);
+		break;
+
+	case HL_INFO_HW_EVENTS:
+		rc = hw_events_info(hdev, args);
+		break;
+
+	case HL_INFO_DRAM_USAGE:
+		rc = dram_usage_info(hdev, args);
+		break;
+
+	case HL_INFO_HW_IDLE:
+		rc = hw_idle(hdev, args);
+		break;
+
+	default:
+		dev_err(hdev->dev, "Invalid request %d\n", args->op);
+		rc = -ENOTTY;
+		break;
+	}
+
+	return rc;
+}
+
 #define HL_IOCTL_DEF(ioctl, _func) \
 	[_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func}
 
 static const struct hl_ioctl_desc hl_ioctls[] = {
+	HL_IOCTL_DEF(HL_IOCTL_INFO, hl_info_ioctl),
 	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl),
 	HL_IOCTL_DEF(HL_IOCTL_CS, hl_cs_ioctl),
 	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl),
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index 9015043887d1..4afc1891ece8 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -45,6 +45,62 @@ enum goya_queue_id {
 	GOYA_QUEUE_ID_SIZE
 };
 
+/* Opcode for management ioctl */
+#define HL_INFO_HW_IP_INFO	0
+#define HL_INFO_HW_EVENTS	1
+#define HL_INFO_DRAM_USAGE	2
+#define HL_INFO_HW_IDLE		3
+
+#define HL_INFO_VERSION_MAX_LEN	128
+
+struct hl_info_hw_ip_info {
+	__u64 sram_base_address;
+	__u64 dram_base_address;
+	__u64 dram_size;
+	__u32 sram_size;
+	__u32 num_of_events;
+	__u32 device_id; /* PCI Device ID */
+	__u32 reserved[3];
+	__u32 armcp_cpld_version;
+	__u32 psoc_pci_pll_nr;
+	__u32 psoc_pci_pll_nf;
+	__u32 psoc_pci_pll_od;
+	__u32 psoc_pci_pll_div_factor;
+	__u8 tpc_enabled_mask;
+	__u8 dram_enabled;
+	__u8 pad[2];
+	__u8 armcp_version[HL_INFO_VERSION_MAX_LEN];
+};
+
+struct hl_info_dram_usage {
+	__u64 dram_free_mem;
+	__u64 ctx_dram_mem;
+};
+
+struct hl_info_hw_idle {
+	__u32 is_idle;
+	__u32 pad;
+};
+
+struct hl_info_args {
+	/* Location of relevant struct in userspace */
+	__u64 return_pointer;
+	/*
+	 * The size of the return value. Just like "size" in "snprintf",
+	 * it limits how many bytes the kernel can write
+	 *
+	 * For hw_events array, the size should be
+	 * hl_info_hw_ip_info.num_of_events * sizeof(__u32)
+	 */
+	__u32 return_size;
+
+	/* HL_INFO_* */
+	__u32 op;
+
+	/* Context ID - Currently not in use */
+	__u32 ctx_id;
+	__u32 pad;
+};
 
 /* Opcode to create a new command buffer */
 #define HL_CB_OP_CREATE		0
@@ -264,6 +320,23 @@ union hl_mem_args {
 	struct hl_mem_out out;
 };
 
+/*
+ * Various information operations such as:
+ * - H/W IP information
+ * - Current dram usage
+ *
+ * The user calls this IOCTL with an opcode that describes the required
+ * information. The user should supply a pointer to a user-allocated memory
+ * chunk, which will be filled by the driver with the requested information.
+ *
+ * The user supplies the maximum amount of size to copy into the user's memory,
+ * in order to prevent data corruption in case of differences between the
+ * definitions of structures in kernel and userspace, e.g. in case of old
+ * userspace and new kernel driver
+ */
+#define HL_IOCTL_INFO	\
+		_IOWR('H', 0x01, struct hl_info_args)
+
 /*
  * Command Buffer
  * - Request a Command Buffer
@@ -365,7 +438,7 @@ union hl_mem_args {
 #define HL_IOCTL_MEMORY		\
 		_IOWR('H', 0x05, union hl_mem_args)
 
-#define HL_COMMAND_START	0x02
+#define HL_COMMAND_START	0x01
 #define HL_COMMAND_END		0x06
 
 #endif /* HABANALABS_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 14/15] habanalabs: add debugfs support
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (11 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 13/15] habanalabs: implement INFO IOCTL Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-11 15:17 ` [PATCH v4 15/15] Update MAINTAINERS and CREDITS with habanalabs info Oded Gabbay
  2019-02-14  7:11 ` [PATCH v4 00/15] Habana Labs kernel driver Greg KH
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

This patch adds debugfs support to the driver. It allows the user-space to
display information that is contained in the internal structures of the
driver, such as:
- active command submissions
- active user virtual memory mappings
- number of allocated command buffers

It also enables the user to perform reads and writes through Goya's PCI
bars.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
Changes in v4:
  - debugfs init shouldn't check for errors
  - Remove empty line at end of debugfs-driver-habanalabs
 
 .../ABI/testing/debugfs-driver-habanalabs     |  126 ++
 drivers/misc/habanalabs/Makefile              |    2 +
 drivers/misc/habanalabs/command_buffer.c      |    4 +
 drivers/misc/habanalabs/command_submission.c  |   12 +
 drivers/misc/habanalabs/debugfs.c             | 1064 +++++++++++++++++
 drivers/misc/habanalabs/device.c              |    6 +
 drivers/misc/habanalabs/goya/goya.c           |  108 ++
 drivers/misc/habanalabs/goya/goyaP.h          |    5 +
 drivers/misc/habanalabs/habanalabs.h          |  189 +++
 drivers/misc/habanalabs/habanalabs_drv.c      |   16 +-
 drivers/misc/habanalabs/memory.c              |    8 +
 11 files changed, 1538 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/debugfs-driver-habanalabs
 create mode 100644 drivers/misc/habanalabs/debugfs.c

diff --git a/Documentation/ABI/testing/debugfs-driver-habanalabs b/Documentation/ABI/testing/debugfs-driver-habanalabs
new file mode 100644
index 000000000000..2f5b80be07a3
--- /dev/null
+++ b/Documentation/ABI/testing/debugfs-driver-habanalabs
@@ -0,0 +1,126 @@
+What:           /sys/kernel/debug/habanalabs/hl<n>/addr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the device address to be used for read or write through
+                PCI bar. The acceptable value is a string that starts with "0x"
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/command_buffers
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with information about the currently allocated
+                command buffers
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/command_submission
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with information about the currently active
+                command submissions
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/command_submission_jobs
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with detailed information about each JOB (CB) of
+                each active command submission
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/data32
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Allows the root user to read or write directly through the
+                device's PCI bar. Writing to this file generates a write
+                transaction while reading from the file generates a read
+                transcation. This custom interface is needed (instead of using
+                the generic Linux user-space PCI mapping) because the DDR bar
+                is very small compared to the DDR memory and only the driver can
+                move the bar before and after the transaction
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/device
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Enables the root user to set the device to specific state.
+                Valid values are "disable", "enable", "suspend", "resume".
+                User can read this property to see the valid values
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/i2c_addr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets I2C device address for I2C transaction that is generated
+                by the device's CPU
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/i2c_bus
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets I2C bus address for I2C transaction that is generated by
+                the device's CPU
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/i2c_data
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Triggers an I2C transaction that is generated by the device's
+                CPU. Writing to this file generates a write transaction while
+                reading from the file generates a read transcation
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/i2c_reg
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets I2C register id for I2C transaction that is generated by
+                the device's CPU
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/led0
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the state of the first S/W led on the device
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/led1
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the state of the second S/W led on the device
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/led2
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the state of the third S/W led on the device
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/mmu
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays the hop values and physical address for a given ASID
+                and virtual address. The user should write the ASID and VA into
+                the file and then read the file to get the result.
+                e.g. to display info about VA 0x1000 for ASID 1 you need to do:
+                echo "1 0x1000" > /sys/kernel/debug/habanalabs/hl0/mmu
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/set_power_state
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Sets the PCI power state. Valid values are "1" for D0 and "2"
+                for D3Hot
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/userptr
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with information about the currently user
+                pointers (user virtual addresses) that are pinned and mapped
+                to DMA addresses
+
+What:           /sys/kernel/debug/habanalabs/hl<n>/vm
+Date:           Jan 2019
+KernelVersion:  5.1
+Contact:        oded.gabbay@gmail.com
+Description:    Displays a list with information about all the active virtual
+                address mappings per ASID
diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
index fd46f8b48bab..c6592db59b25 100644
--- a/drivers/misc/habanalabs/Makefile
+++ b/drivers/misc/habanalabs/Makefile
@@ -8,5 +8,7 @@ habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
 		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o memory.o \
 		command_submission.o mmu.o
 
+habanalabs-$(CONFIG_DEBUG_FS) += debugfs.o
+
 include $(src)/goya/Makefile
 habanalabs-y += $(HL_GOYA_FILES)
diff --git a/drivers/misc/habanalabs/command_buffer.c b/drivers/misc/habanalabs/command_buffer.c
index bec951d42c98..7ec9175e6023 100644
--- a/drivers/misc/habanalabs/command_buffer.c
+++ b/drivers/misc/habanalabs/command_buffer.c
@@ -36,6 +36,8 @@ static void cb_release(struct kref *ref)
 	cb = container_of(ref, struct hl_cb, refcount);
 	hdev = cb->hdev;
 
+	hl_debugfs_remove_cb(cb);
+
 	cb_do_release(hdev, cb);
 }
 
@@ -161,6 +163,8 @@ int hl_cb_create(struct hl_device *hdev, struct hl_cb_mgr *mgr,
 	*handle = cb->id | HL_MMAP_CB_MASK;
 	*handle <<= PAGE_SHIFT;
 
+	hl_debugfs_add_cb(cb);
+
 	return 0;
 
 release_cb:
diff --git a/drivers/misc/habanalabs/command_submission.c b/drivers/misc/habanalabs/command_submission.c
index 288dddc14873..e539ac0a4ca3 100644
--- a/drivers/misc/habanalabs/command_submission.c
+++ b/drivers/misc/habanalabs/command_submission.c
@@ -153,6 +153,8 @@ static void free_job(struct hl_device *hdev, struct hl_cs_job *job)
 	list_del(&job->cs_node);
 	spin_unlock(&cs->job_lock);
 
+	hl_debugfs_remove_job(hdev, job);
+
 	if (job->ext_queue)
 		cs_put(cs);
 
@@ -216,6 +218,12 @@ static void cs_do_release(struct kref *ref)
 		}
 	}
 
+	/*
+	 * Must be called before hl_ctx_put because inside we use ctx to get
+	 * the device
+	 */
+	hl_debugfs_remove_cs(cs);
+
 	hl_ctx_put(cs->ctx);
 
 	if (cs->timedout)
@@ -484,6 +492,8 @@ static int _hl_cs_ioctl(struct hl_fpriv *hpriv, void __user *chunks,
 
 	*cs_seq = cs->sequence;
 
+	hl_debugfs_add_cs(cs);
+
 	/* Validate ALL the CS chunks before submitting the CS */
 	for (i = 0, parse_cnt = 0 ; i < num_chunks ; i++, parse_cnt++) {
 		struct hl_cs_chunk *chunk = &cs_chunk_array[i];
@@ -532,6 +542,8 @@ static int _hl_cs_ioctl(struct hl_fpriv *hpriv, void __user *chunks,
 		if (job->ext_queue)
 			cs_get(cs);
 
+		hl_debugfs_add_job(hdev, job);
+
 		rc = cs_parser(hpriv, job);
 		if (rc) {
 			dev_err(hdev->dev,
diff --git a/drivers/misc/habanalabs/debugfs.c b/drivers/misc/habanalabs/debugfs.c
new file mode 100644
index 000000000000..6f9708a1c88e
--- /dev/null
+++ b/drivers/misc/habanalabs/debugfs.c
@@ -0,0 +1,1064 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright 2016-2018 HabanaLabs, Ltd.
+ * All Rights Reserved.
+ */
+
+#include "habanalabs.h"
+#include "include/hw_ip/mmu/mmu_general.h"
+
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+
+#define MMU_ADDR_BUF_SIZE	40
+#define MMU_ASID_BUF_SIZE	10
+#define MMU_KBUF_SIZE		(MMU_ADDR_BUF_SIZE + MMU_ASID_BUF_SIZE)
+
+static struct dentry *hl_debug_root;
+
+static int hl_debugfs_i2c_read(struct hl_device *hdev, u8 i2c_bus, u8 i2c_addr,
+				u8 i2c_reg, u32 *val)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	if (hl_device_disabled_or_in_reset(hdev))
+		return 0;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_I2C_RD << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.i2c_bus = i2c_bus;
+	pkt.i2c_addr = i2c_addr;
+	pkt.i2c_reg = i2c_reg;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					HL_DEVICE_TIMEOUT_USEC, (long *) val);
+
+	if (rc)
+		dev_err(hdev->dev, "Failed to read from I2C, error %d\n", rc);
+
+	return rc;
+}
+
+static int hl_debugfs_i2c_write(struct hl_device *hdev, u8 i2c_bus, u8 i2c_addr,
+				u8 i2c_reg, u32 val)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	if (hl_device_disabled_or_in_reset(hdev))
+		return 0;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_I2C_WR << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.i2c_bus = i2c_bus;
+	pkt.i2c_addr = i2c_addr;
+	pkt.i2c_reg = i2c_reg;
+	pkt.value = val;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+					HL_DEVICE_TIMEOUT_USEC, NULL);
+
+	if (rc)
+		dev_err(hdev->dev, "Failed to write to I2C, error %d\n", rc);
+
+	return rc;
+}
+
+static void hl_debugfs_led_set(struct hl_device *hdev, u8 led, u8 state)
+{
+	struct armcp_packet pkt;
+	int rc;
+
+	if (hl_device_disabled_or_in_reset(hdev))
+		return;
+
+	memset(&pkt, 0, sizeof(pkt));
+
+	pkt.ctl = ARMCP_PACKET_LED_SET << ARMCP_PKT_CTL_OPCODE_SHIFT;
+	pkt.led_index = led;
+	pkt.value = state;
+
+	rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) &pkt, sizeof(pkt),
+						HL_DEVICE_TIMEOUT_USEC, NULL);
+
+	if (rc)
+		dev_err(hdev->dev, "Failed to set LED %d, error %d\n", led, rc);
+}
+
+static int command_buffers_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_cb *cb;
+	bool first = true;
+
+	spin_lock(&dev_entry->cb_spinlock);
+
+	list_for_each_entry(cb, &dev_entry->cb_list, debugfs_list) {
+		if (first) {
+			first = false;
+			seq_puts(s, "\n");
+			seq_puts(s, " CB ID   CTX ID   CB size    CB RefCnt    mmap?   CS counter\n");
+			seq_puts(s, "---------------------------------------------------------------\n");
+		}
+		seq_printf(s,
+			"   %03d        %d    0x%08x      %d          %d          %d\n",
+			cb->id, cb->ctx_id, cb->size,
+			kref_read(&cb->refcount),
+			cb->mmap, cb->cs_cnt);
+	}
+
+	spin_unlock(&dev_entry->cb_spinlock);
+
+	if (!first)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int command_submission_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_cs *cs;
+	bool first = true;
+
+	spin_lock(&dev_entry->cs_spinlock);
+
+	list_for_each_entry(cs, &dev_entry->cs_list, debugfs_list) {
+		if (first) {
+			first = false;
+			seq_puts(s, "\n");
+			seq_puts(s, " CS ID   CTX ASID   CS RefCnt   Submitted    Completed\n");
+			seq_puts(s, "------------------------------------------------------\n");
+		}
+		seq_printf(s,
+			"   %llu       %d          %d           %d            %d\n",
+			cs->sequence, cs->ctx->asid,
+			kref_read(&cs->refcount),
+			cs->submitted, cs->completed);
+	}
+
+	spin_unlock(&dev_entry->cs_spinlock);
+
+	if (!first)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int command_submission_jobs_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_cs_job *job;
+	bool first = true;
+
+	spin_lock(&dev_entry->cs_job_spinlock);
+
+	list_for_each_entry(job, &dev_entry->cs_job_list, debugfs_list) {
+		if (first) {
+			first = false;
+			seq_puts(s, "\n");
+			seq_puts(s, " JOB ID   CS ID    CTX ASID   H/W Queue\n");
+			seq_puts(s, "---------------------------------------\n");
+		}
+		if (job->cs)
+			seq_printf(s,
+				"    %02d       %llu         %d         %d\n",
+				job->id, job->cs->sequence, job->cs->ctx->asid,
+				job->hw_queue_id);
+		else
+			seq_printf(s,
+				"    %02d       0         %d         %d\n",
+				job->id, HL_KERNEL_ASID_ID, job->hw_queue_id);
+	}
+
+	spin_unlock(&dev_entry->cs_job_spinlock);
+
+	if (!first)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int userptr_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_userptr *userptr;
+	char dma_dir[4][30] = {"DMA_BIDIRECTIONAL", "DMA_TO_DEVICE",
+				"DMA_FROM_DEVICE", "DMA_NONE"};
+	bool first = true;
+
+	spin_lock(&dev_entry->userptr_spinlock);
+
+	list_for_each_entry(userptr, &dev_entry->userptr_list, debugfs_list) {
+		if (first) {
+			first = false;
+			seq_puts(s, "\n");
+			seq_puts(s, " user virtual address     size             dma dir\n");
+			seq_puts(s, "----------------------------------------------------------\n");
+		}
+		seq_printf(s,
+			"    0x%-14llx      %-10u    %-30s\n",
+			userptr->addr, userptr->size, dma_dir[userptr->dir]);
+	}
+
+	spin_unlock(&dev_entry->userptr_spinlock);
+
+	if (!first)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int vm_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_ctx *ctx;
+	struct hl_vm *vm;
+	struct hl_vm_hash_node *hnode;
+	struct hl_userptr *userptr;
+	struct hl_vm_phys_pg_pack *phys_pg_pack = NULL;
+	enum vm_type_t *vm_type;
+	bool once = true;
+	int i;
+
+	if (!dev_entry->hdev->mmu_enable)
+		return 0;
+
+	spin_lock(&dev_entry->ctx_mem_hash_spinlock);
+
+	list_for_each_entry(ctx, &dev_entry->ctx_mem_hash_list, debugfs_list) {
+		once = false;
+		seq_puts(s, "\n\n----------------------------------------------------");
+		seq_puts(s, "\n----------------------------------------------------\n\n");
+		seq_printf(s, "ctx asid: %u\n", ctx->asid);
+
+		seq_puts(s, "\nmappings:\n\n");
+		seq_puts(s, "    virtual address        size          handle\n");
+		seq_puts(s, "----------------------------------------------------\n");
+		mutex_lock(&ctx->mem_hash_lock);
+		hash_for_each(ctx->mem_hash, i, hnode, node) {
+			vm_type = hnode->ptr;
+
+			if (*vm_type == VM_TYPE_USERPTR) {
+				userptr = hnode->ptr;
+				seq_printf(s,
+					"    0x%-14llx      %-10u\n",
+					hnode->vaddr, userptr->size);
+			} else {
+				phys_pg_pack = hnode->ptr;
+				seq_printf(s,
+					"    0x%-14llx      %-10u       %-4u\n",
+					hnode->vaddr, phys_pg_pack->total_size,
+					phys_pg_pack->handle);
+			}
+		}
+		mutex_unlock(&ctx->mem_hash_lock);
+
+		vm = &ctx->hdev->vm;
+		spin_lock(&vm->idr_lock);
+
+		if (!idr_is_empty(&vm->phys_pg_pack_handles))
+			seq_puts(s, "\n\nallocations:\n");
+
+		idr_for_each_entry(&vm->phys_pg_pack_handles, phys_pg_pack, i) {
+			if (phys_pg_pack->asid != ctx->asid)
+				continue;
+
+			seq_printf(s, "\nhandle: %u\n", phys_pg_pack->handle);
+			seq_printf(s, "page size: %u\n\n",
+						phys_pg_pack->page_size);
+			seq_puts(s, "   physical address\n");
+			seq_puts(s, "---------------------\n");
+			for (i = 0 ; i < phys_pg_pack->npages ; i++) {
+				seq_printf(s, "    0x%-14llx\n",
+						phys_pg_pack->pages[i]);
+			}
+		}
+		spin_unlock(&vm->idr_lock);
+
+	}
+
+	spin_unlock(&dev_entry->ctx_mem_hash_spinlock);
+
+	if (!once)
+		seq_puts(s, "\n");
+
+	return 0;
+}
+
+/* these inline functions are copied from mmu.c */
+static inline u64 get_hop0_addr(struct hl_ctx *ctx)
+{
+	return ctx->hdev->asic_prop.mmu_pgt_addr +
+			(ctx->asid * ctx->hdev->asic_prop.mmu_hop_table_size);
+}
+
+static inline u64 get_hop0_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP0_MASK) >> HOP0_SHIFT);
+}
+
+static inline u64 get_hop1_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP1_MASK) >> HOP1_SHIFT);
+}
+
+static inline u64 get_hop2_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP2_MASK) >> HOP2_SHIFT);
+}
+
+static inline u64 get_hop3_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP3_MASK) >> HOP3_SHIFT);
+}
+
+static inline u64 get_hop4_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
+		u64 virt_addr)
+{
+	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
+			((virt_addr & HOP4_MASK) >> HOP4_SHIFT);
+}
+
+static inline u64 get_next_hop_addr(u64 curr_pte)
+{
+	if (curr_pte & PAGE_PRESENT_MASK)
+		return curr_pte & PHYS_ADDR_MASK;
+	else
+		return ULLONG_MAX;
+}
+
+static int mmu_show(struct seq_file *s, void *data)
+{
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_device *hdev = dev_entry->hdev;
+	struct hl_ctx *ctx = hdev->user_ctx;
+
+	u64 hop0_addr = 0, hop0_pte_addr = 0, hop0_pte = 0,
+		hop1_addr = 0, hop1_pte_addr = 0, hop1_pte = 0,
+		hop2_addr = 0, hop2_pte_addr = 0, hop2_pte = 0,
+		hop3_addr = 0, hop3_pte_addr = 0, hop3_pte = 0,
+		hop4_addr = 0, hop4_pte_addr = 0, hop4_pte = 0,
+		virt_addr = dev_entry->mmu_addr;
+
+	if (!hdev->mmu_enable)
+		return 0;
+
+	if (!ctx) {
+		dev_err(hdev->dev, "no ctx available\n");
+		return 0;
+	}
+
+	mutex_lock(&ctx->mmu_lock);
+
+	/* the following lookup is copied from unmap() in mmu.c */
+
+	hop0_addr = get_hop0_addr(ctx);
+	hop0_pte_addr = get_hop0_pte_addr(ctx, hop0_addr, virt_addr);
+	hop0_pte = hdev->asic_funcs->read_pte(hdev, hop0_pte_addr);
+	hop1_addr = get_next_hop_addr(hop0_pte);
+
+	if (hop1_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop1_pte_addr = get_hop1_pte_addr(ctx, hop1_addr, virt_addr);
+	hop1_pte = hdev->asic_funcs->read_pte(hdev, hop1_pte_addr);
+	hop2_addr = get_next_hop_addr(hop1_pte);
+
+	if (hop2_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop2_pte_addr = get_hop2_pte_addr(ctx, hop2_addr, virt_addr);
+	hop2_pte = hdev->asic_funcs->read_pte(hdev, hop2_pte_addr);
+	hop3_addr = get_next_hop_addr(hop2_pte);
+
+	if (hop3_addr == ULLONG_MAX)
+		goto not_mapped;
+
+	hop3_pte_addr = get_hop3_pte_addr(ctx, hop3_addr, virt_addr);
+	hop3_pte = hdev->asic_funcs->read_pte(hdev, hop3_pte_addr);
+
+	if (!(hop3_pte & LAST_MASK)) {
+		hop4_addr = get_next_hop_addr(hop3_pte);
+
+		if (hop4_addr == ULLONG_MAX)
+			goto not_mapped;
+
+		hop4_pte_addr = get_hop4_pte_addr(ctx, hop4_addr, virt_addr);
+		hop4_pte = hdev->asic_funcs->read_pte(hdev, hop4_pte_addr);
+		if (!(hop4_pte & PAGE_PRESENT_MASK))
+			goto not_mapped;
+	} else {
+		if (!(hop3_pte & PAGE_PRESENT_MASK))
+			goto not_mapped;
+	}
+
+	seq_printf(s, "asid: %u, virt_addr: 0x%llx\n",
+			dev_entry->mmu_asid, dev_entry->mmu_addr);
+
+	seq_printf(s, "hop0_addr: 0x%llx\n", hop0_addr);
+	seq_printf(s, "hop0_pte_addr: 0x%llx\n", hop0_pte_addr);
+	seq_printf(s, "hop0_pte: 0x%llx\n", hop0_pte);
+
+	seq_printf(s, "hop1_addr: 0x%llx\n", hop1_addr);
+	seq_printf(s, "hop1_pte_addr: 0x%llx\n", hop1_pte_addr);
+	seq_printf(s, "hop1_pte: 0x%llx\n", hop1_pte);
+
+	seq_printf(s, "hop2_addr: 0x%llx\n", hop2_addr);
+	seq_printf(s, "hop2_pte_addr: 0x%llx\n", hop2_pte_addr);
+	seq_printf(s, "hop2_pte: 0x%llx\n", hop2_pte);
+
+	seq_printf(s, "hop3_addr: 0x%llx\n", hop3_addr);
+	seq_printf(s, "hop3_pte_addr: 0x%llx\n", hop3_pte_addr);
+	seq_printf(s, "hop3_pte: 0x%llx\n", hop3_pte);
+
+	if (!(hop3_pte & LAST_MASK)) {
+		seq_printf(s, "hop4_addr: 0x%llx\n", hop4_addr);
+		seq_printf(s, "hop4_pte_addr: 0x%llx\n", hop4_pte_addr);
+		seq_printf(s, "hop4_pte: 0x%llx\n", hop4_pte);
+	}
+
+	goto out;
+
+not_mapped:
+	dev_err(hdev->dev, "virt addr 0x%llx is not mapped to phys addr\n",
+			virt_addr);
+out:
+	mutex_unlock(&ctx->mmu_lock);
+
+	return 0;
+}
+
+static ssize_t mmu_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *f_pos)
+{
+	struct seq_file *s = file->private_data;
+	struct hl_debugfs_entry *entry = s->private;
+	struct hl_dbg_device_entry *dev_entry = entry->dev_entry;
+	struct hl_device *hdev = dev_entry->hdev;
+	char kbuf[MMU_KBUF_SIZE], asid_kbuf[MMU_ASID_BUF_SIZE],
+		addr_kbuf[MMU_ADDR_BUF_SIZE];
+	char *c;
+	ssize_t rc;
+
+	if (!hdev->mmu_enable)
+		return count;
+
+	memset(kbuf, 0, sizeof(kbuf));
+	memset(asid_kbuf, 0, sizeof(asid_kbuf));
+	memset(addr_kbuf, 0, sizeof(addr_kbuf));
+
+	if (copy_from_user(kbuf, buf, count))
+		goto err;
+
+	kbuf[MMU_KBUF_SIZE - 1] = 0;
+
+	c = strchr(kbuf, ' ');
+	if (!c)
+		goto err;
+
+	memcpy(asid_kbuf, kbuf, c - kbuf);
+
+	rc = kstrtouint(asid_kbuf, 10, &dev_entry->mmu_asid);
+	if (rc)
+		goto err;
+
+	c = strstr(kbuf, " 0x");
+	if (!c)
+		goto err;
+
+	c += 3;
+	memcpy(addr_kbuf, c, (kbuf + count) - c);
+
+	rc = kstrtoull(addr_kbuf, 16, &dev_entry->mmu_addr);
+	if (rc)
+		goto err;
+
+	return count;
+
+err:
+	dev_err(hdev->dev, "usage: echo <asid> <0xaddr> > mmu\n");
+
+	return -EINVAL;
+}
+
+static ssize_t hl_data_read32(struct file *f, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	char tmp_buf[32];
+	u32 val;
+	ssize_t rc;
+
+	if (*ppos)
+		return 0;
+
+	rc = hdev->asic_funcs->debugfs_read32(hdev, entry->addr, &val);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to read from 0x%010llx\n",
+			entry->addr);
+		return rc;
+	}
+
+	sprintf(tmp_buf, "0x%08x\n", val);
+	rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
+			strlen(tmp_buf) + 1);
+
+	return rc;
+}
+
+static ssize_t hl_data_write32(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 16, &value);
+	if (rc)
+		return rc;
+
+	rc = hdev->asic_funcs->debugfs_write32(hdev, entry->addr, value);
+	if (rc) {
+		dev_err(hdev->dev, "Failed to write 0x%08x to 0x%010llx\n",
+			value, entry->addr);
+		return rc;
+	}
+
+	return count;
+}
+
+static ssize_t hl_get_power_state(struct file *f, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	char tmp_buf[200];
+	ssize_t rc;
+	int i;
+
+	if (*ppos)
+		return 0;
+
+	if (hdev->pdev->current_state == PCI_D0)
+		i = 1;
+	else if (hdev->pdev->current_state == PCI_D3hot)
+		i = 2;
+	else
+		i = 3;
+
+	sprintf(tmp_buf,
+		"current power state: %d\n1 - D0\n2 - D3hot\n3 - Unknown\n", i);
+	rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
+			strlen(tmp_buf) + 1);
+
+	return rc;
+}
+
+static ssize_t hl_set_power_state(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 10, &value);
+	if (rc)
+		return rc;
+
+	if (value == 1) {
+		pci_set_power_state(hdev->pdev, PCI_D0);
+		pci_restore_state(hdev->pdev);
+		rc = pci_enable_device(hdev->pdev);
+	} else if (value == 2) {
+		pci_save_state(hdev->pdev);
+		pci_disable_device(hdev->pdev);
+		pci_set_power_state(hdev->pdev, PCI_D3hot);
+	} else {
+		dev_dbg(hdev->dev, "invalid power state value %u\n", value);
+		return -EINVAL;
+	}
+
+	return count;
+}
+
+static ssize_t hl_i2c_data_read(struct file *f, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	char tmp_buf[32];
+	u32 val;
+	ssize_t rc;
+
+	if (*ppos)
+		return 0;
+
+	rc = hl_debugfs_i2c_read(hdev, entry->i2c_bus, entry->i2c_addr,
+			entry->i2c_reg, &val);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to read from I2C bus %d, addr %d, reg %d\n",
+			entry->i2c_bus, entry->i2c_addr, entry->i2c_reg);
+		return rc;
+	}
+
+	sprintf(tmp_buf, "0x%02x\n", val);
+	rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
+			strlen(tmp_buf) + 1);
+
+	return rc;
+}
+
+static ssize_t hl_i2c_data_write(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 16, &value);
+	if (rc)
+		return rc;
+
+	rc = hl_debugfs_i2c_write(hdev, entry->i2c_bus, entry->i2c_addr,
+			entry->i2c_reg, value);
+	if (rc) {
+		dev_err(hdev->dev,
+			"Failed to write 0x%02x to I2C bus %d, addr %d, reg %d\n",
+			value, entry->i2c_bus, entry->i2c_addr, entry->i2c_reg);
+		return rc;
+	}
+
+	return count;
+}
+
+static ssize_t hl_led0_write(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 10, &value);
+	if (rc)
+		return rc;
+
+	value = value ? 1 : 0;
+
+	hl_debugfs_led_set(hdev, 0, value);
+
+	return count;
+}
+
+static ssize_t hl_led1_write(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 10, &value);
+	if (rc)
+		return rc;
+
+	value = value ? 1 : 0;
+
+	hl_debugfs_led_set(hdev, 1, value);
+
+	return count;
+}
+
+static ssize_t hl_led2_write(struct file *f, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+	u32 value;
+	ssize_t rc;
+
+	rc = kstrtouint_from_user(buf, count, 10, &value);
+	if (rc)
+		return rc;
+
+	value = value ? 1 : 0;
+
+	hl_debugfs_led_set(hdev, 2, value);
+
+	return count;
+}
+
+static ssize_t hl_device_read(struct file *f, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	char tmp_buf[200];
+	ssize_t rc;
+
+	if (*ppos)
+		return 0;
+
+	sprintf(tmp_buf,
+		"Valid values are: disable, enable, suspend, resume\n");
+	rc = simple_read_from_buffer(buf, strlen(tmp_buf) + 1, ppos, tmp_buf,
+			strlen(tmp_buf) + 1);
+
+	return rc;
+}
+
+static ssize_t hl_device_write(struct file *f, const char __user *buf,
+				     size_t count, loff_t *ppos)
+{
+	struct hl_dbg_device_entry *entry = file_inode(f)->i_private;
+	struct hl_device *hdev = entry->hdev;
+
+	if (strncmp("disable", buf, strlen("disable")) == 0) {
+		hdev->disabled = true;
+	} else if (strncmp("enable", buf, strlen("enable")) == 0) {
+		hdev->disabled = false;
+	} else if (strncmp("suspend", buf, strlen("suspend")) == 0) {
+		hdev->asic_funcs->suspend(hdev);
+	} else if (strncmp("resume", buf, strlen("resume")) == 0) {
+		hdev->asic_funcs->resume(hdev);
+	} else {
+		dev_err(hdev->dev,
+			"Valid values are: disable, enable, suspend, resume\n");
+		count = -EINVAL;
+	}
+
+	return count;
+}
+
+static const struct file_operations hl_data32b_fops = {
+	.owner = THIS_MODULE,
+	.read = hl_data_read32,
+	.write = hl_data_write32
+};
+
+static const struct file_operations hl_i2c_data_fops = {
+	.owner = THIS_MODULE,
+	.read = hl_i2c_data_read,
+	.write = hl_i2c_data_write
+};
+
+static const struct file_operations hl_power_fops = {
+	.owner = THIS_MODULE,
+	.read = hl_get_power_state,
+	.write = hl_set_power_state
+};
+
+static const struct file_operations hl_led0_fops = {
+	.owner = THIS_MODULE,
+	.write = hl_led0_write
+};
+
+static const struct file_operations hl_led1_fops = {
+	.owner = THIS_MODULE,
+	.write = hl_led1_write
+};
+
+static const struct file_operations hl_led2_fops = {
+	.owner = THIS_MODULE,
+	.write = hl_led2_write
+};
+
+static const struct file_operations hl_device_fops = {
+	.owner = THIS_MODULE,
+	.read = hl_device_read,
+	.write = hl_device_write
+};
+
+static const struct hl_info_list hl_debugfs_list[] = {
+	{"command_buffers", command_buffers_show, NULL},
+	{"command_submission", command_submission_show, NULL},
+	{"command_submission_jobs", command_submission_jobs_show, NULL},
+	{"userptr", userptr_show, NULL},
+	{"vm", vm_show, NULL},
+	{"mmu", mmu_show, mmu_write},
+};
+
+static int hl_debugfs_open(struct inode *inode, struct file *file)
+{
+	struct hl_debugfs_entry *node = inode->i_private;
+
+	return single_open(file, node->info_ent->show, node);
+}
+
+static ssize_t hl_debugfs_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *f_pos)
+{
+	struct hl_debugfs_entry *node = file->f_inode->i_private;
+
+	if (node->info_ent->write)
+		return node->info_ent->write(file, buf, count, f_pos);
+	else
+		return -EINVAL;
+
+}
+
+static const struct file_operations hl_debugfs_fops = {
+	.owner = THIS_MODULE,
+	.open = hl_debugfs_open,
+	.read = seq_read,
+	.write = hl_debugfs_write,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+void hl_debugfs_add_device(struct hl_device *hdev)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+	int count = ARRAY_SIZE(hl_debugfs_list);
+	struct hl_debugfs_entry *entry;
+	struct dentry *ent;
+	int i;
+
+	dev_entry->hdev = hdev;
+	dev_entry->entry_arr = kmalloc_array(count,
+					sizeof(struct hl_debugfs_entry),
+					GFP_KERNEL);
+	if (!dev_entry->entry_arr)
+		return;
+
+	INIT_LIST_HEAD(&dev_entry->file_list);
+	INIT_LIST_HEAD(&dev_entry->cb_list);
+	INIT_LIST_HEAD(&dev_entry->cs_list);
+	INIT_LIST_HEAD(&dev_entry->cs_job_list);
+	INIT_LIST_HEAD(&dev_entry->userptr_list);
+	INIT_LIST_HEAD(&dev_entry->ctx_mem_hash_list);
+	mutex_init(&dev_entry->file_mutex);
+	spin_lock_init(&dev_entry->cb_spinlock);
+	spin_lock_init(&dev_entry->cs_spinlock);
+	spin_lock_init(&dev_entry->cs_job_spinlock);
+	spin_lock_init(&dev_entry->userptr_spinlock);
+	spin_lock_init(&dev_entry->ctx_mem_hash_spinlock);
+
+	dev_entry->root = debugfs_create_dir(dev_name(hdev->dev),
+						hl_debug_root);
+
+	debugfs_create_x64("addr",
+				0644,
+				dev_entry->root,
+				&dev_entry->addr);
+
+	debugfs_create_file("data32",
+				0644,
+				dev_entry->root,
+				dev_entry,
+				&hl_data32b_fops);
+
+	debugfs_create_file("set_power_state",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_power_fops);
+
+	debugfs_create_u8("i2c_bus",
+				0644,
+				dev_entry->root,
+				&dev_entry->i2c_bus);
+
+	debugfs_create_u8("i2c_addr",
+				0644,
+				dev_entry->root,
+				&dev_entry->i2c_addr);
+
+	debugfs_create_u8("i2c_reg",
+				0644,
+				dev_entry->root,
+				&dev_entry->i2c_reg);
+
+	debugfs_create_file("i2c_data",
+				0644,
+				dev_entry->root,
+				dev_entry,
+				&hl_i2c_data_fops);
+
+	debugfs_create_file("led0",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_led0_fops);
+
+	debugfs_create_file("led1",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_led1_fops);
+
+	debugfs_create_file("led2",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_led2_fops);
+
+	debugfs_create_file("device",
+				0200,
+				dev_entry->root,
+				dev_entry,
+				&hl_device_fops);
+
+	for (i = 0, entry = dev_entry->entry_arr ; i < count ; i++, entry++) {
+
+		ent = debugfs_create_file(hl_debugfs_list[i].name,
+					0444,
+					dev_entry->root,
+					entry,
+					&hl_debugfs_fops);
+		entry->dent = ent;
+		entry->info_ent = &hl_debugfs_list[i];
+		entry->dev_entry = dev_entry;
+	}
+}
+
+void hl_debugfs_remove_device(struct hl_device *hdev)
+{
+	struct hl_dbg_device_entry *entry = &hdev->hl_debugfs;
+
+	debugfs_remove_recursive(entry->root);
+
+	mutex_destroy(&entry->file_mutex);
+	kfree(entry->entry_arr);
+}
+
+void hl_debugfs_add_file(struct hl_fpriv *hpriv)
+{
+	struct hl_dbg_device_entry *dev_entry = &hpriv->hdev->hl_debugfs;
+
+	mutex_lock(&dev_entry->file_mutex);
+	list_add(&hpriv->debugfs_list, &dev_entry->file_list);
+	mutex_unlock(&dev_entry->file_mutex);
+}
+
+void hl_debugfs_remove_file(struct hl_fpriv *hpriv)
+{
+	struct hl_dbg_device_entry *dev_entry = &hpriv->hdev->hl_debugfs;
+
+	mutex_lock(&dev_entry->file_mutex);
+	list_del(&hpriv->debugfs_list);
+	mutex_unlock(&dev_entry->file_mutex);
+}
+
+void hl_debugfs_add_cb(struct hl_cb *cb)
+{
+	struct hl_dbg_device_entry *dev_entry = &cb->hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cb_spinlock);
+	list_add(&cb->debugfs_list, &dev_entry->cb_list);
+	spin_unlock(&dev_entry->cb_spinlock);
+}
+
+void hl_debugfs_remove_cb(struct hl_cb *cb)
+{
+	struct hl_dbg_device_entry *dev_entry = &cb->hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cb_spinlock);
+	list_del(&cb->debugfs_list);
+	spin_unlock(&dev_entry->cb_spinlock);
+}
+
+void hl_debugfs_add_cs(struct hl_cs *cs)
+{
+	struct hl_dbg_device_entry *dev_entry = &cs->ctx->hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cs_spinlock);
+	list_add(&cs->debugfs_list, &dev_entry->cs_list);
+	spin_unlock(&dev_entry->cs_spinlock);
+}
+
+void hl_debugfs_remove_cs(struct hl_cs *cs)
+{
+	struct hl_dbg_device_entry *dev_entry = &cs->ctx->hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cs_spinlock);
+	list_del(&cs->debugfs_list);
+	spin_unlock(&dev_entry->cs_spinlock);
+}
+
+void hl_debugfs_add_job(struct hl_device *hdev, struct hl_cs_job *job)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cs_job_spinlock);
+	list_add(&job->debugfs_list, &dev_entry->cs_job_list);
+	spin_unlock(&dev_entry->cs_job_spinlock);
+}
+
+void hl_debugfs_remove_job(struct hl_device *hdev, struct hl_cs_job *job)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->cs_job_spinlock);
+	list_del(&job->debugfs_list);
+	spin_unlock(&dev_entry->cs_job_spinlock);
+}
+
+void hl_debugfs_add_userptr(struct hl_device *hdev, struct hl_userptr *userptr)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->userptr_spinlock);
+	list_add(&userptr->debugfs_list, &dev_entry->userptr_list);
+	spin_unlock(&dev_entry->userptr_spinlock);
+}
+
+void hl_debugfs_remove_userptr(struct hl_device *hdev,
+				struct hl_userptr *userptr)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->userptr_spinlock);
+	list_del(&userptr->debugfs_list);
+	spin_unlock(&dev_entry->userptr_spinlock);
+}
+
+void hl_debugfs_add_ctx_mem_hash(struct hl_device *hdev, struct hl_ctx *ctx)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->ctx_mem_hash_spinlock);
+	list_add(&ctx->debugfs_list, &dev_entry->ctx_mem_hash_list);
+	spin_unlock(&dev_entry->ctx_mem_hash_spinlock);
+}
+
+void hl_debugfs_remove_ctx_mem_hash(struct hl_device *hdev, struct hl_ctx *ctx)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	spin_lock(&dev_entry->ctx_mem_hash_spinlock);
+	list_del(&ctx->debugfs_list);
+	spin_unlock(&dev_entry->ctx_mem_hash_spinlock);
+}
+
+void __init hl_debugfs_init(void)
+{
+	hl_debug_root = debugfs_create_dir("habanalabs", NULL);
+}
+
+void hl_debugfs_fini(void)
+{
+	debugfs_remove_recursive(hl_debug_root);
+}
diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
index 5c850d3574eb..b74f734947ac 100644
--- a/drivers/misc/habanalabs/device.c
+++ b/drivers/misc/habanalabs/device.c
@@ -30,6 +30,8 @@ static void hpriv_release(struct kref *ref)
 
 	put_pid(hpriv->taskpid);
 
+	hl_debugfs_remove_file(hpriv);
+
 	mutex_destroy(&hpriv->restore_phase_mutex);
 
 	kfree(hpriv);
@@ -823,6 +825,8 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
 		goto free_cb_pool;
 	}
 
+	hl_debugfs_add_device(hdev);
+
 	if (hdev->asic_funcs->get_hw_state(hdev) == HL_DEVICE_HW_STATE_DIRTY) {
 		dev_info(hdev->dev,
 			"H/W state is dirty, must reset before initializing\n");
@@ -956,6 +960,8 @@ void hl_device_fini(struct hl_device *hdev)
 
 	device_late_fini(hdev);
 
+	hl_debugfs_remove_device(hdev);
+
 	hl_sysfs_fini(hdev);
 
 	/*
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index 7cce308243ad..5a41b821dd17 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -4368,6 +4368,8 @@ int goya_context_switch(struct hl_device *hdev, u32 asid)
 	job->user_cb_size = cb_size;
 	job->hw_queue_id = GOYA_QUEUE_ID_DMA_0;
 
+	hl_debugfs_add_job(hdev, job);
+
 	parser.ctx_id = HL_KERNEL_ASID_ID;
 	parser.cs_sequence = 0;
 	parser.job_id = job->id;
@@ -4400,6 +4402,7 @@ int goya_context_switch(struct hl_device *hdev, u32 asid)
 
 free_job:
 	hl_userptr_delete_list(hdev, &job->userptr_list);
+	hl_debugfs_remove_job(hdev, job);
 	kfree(job);
 	cb->cs_cnt--;
 
@@ -4430,6 +4433,106 @@ void goya_restore_phase_topology(struct hl_device *hdev)
 	i = RREG32(mmSYNC_MNGR_SOB_OBJ_0);
 }
 
+/*
+ * goya_debugfs_read32 - read a 32bit value from a given device address
+ *
+ * @hdev:	pointer to hl_device structure
+ * @addr:	address in device
+ * @val:	returned value
+ *
+ * In case of DDR address that is not mapped into the default aperture that
+ * the DDR bar exposes, the function will configure the iATU so that the DDR
+ * bar will be positioned at a base address that allows reading from the
+ * required address. Configuring the iATU during normal operation can
+ * lead to undefined behavior and therefore, should be done with extreme care
+ *
+ */
+int goya_debugfs_read32(struct hl_device *hdev, u64 addr, u32 *val)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	int rc = 0;
+
+	if ((addr >= CFG_BASE) && (addr < CFG_BASE + CFG_SIZE)) {
+		*val = RREG32(addr - CFG_BASE);
+
+	} else if ((addr >= SRAM_BASE_ADDR) &&
+			(addr < SRAM_BASE_ADDR + SRAM_SIZE)) {
+
+		*val = readl(hdev->pcie_bar[SRAM_CFG_BAR_ID] +
+				(addr - SRAM_BASE_ADDR));
+
+	} else if ((addr >= DRAM_PHYS_BASE) &&
+			(addr < DRAM_PHYS_BASE + hdev->asic_prop.dram_size)) {
+
+		u64 bar_base_addr = DRAM_PHYS_BASE +
+				(addr & ~(prop->dram_pci_bar_size - 0x1ull));
+
+		rc = goya_set_ddr_bar_base(hdev, bar_base_addr);
+		if (!rc) {
+			*val = readl(hdev->pcie_bar[DDR_BAR_ID] +
+						(addr - bar_base_addr));
+
+			rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE +
+				(MMU_PAGE_TABLES_ADDR &
+					~(prop->dram_pci_bar_size - 0x1ull)));
+		}
+	} else {
+		rc = -EFAULT;
+	}
+
+	return rc;
+}
+
+/*
+ * goya_debugfs_write32 - write a 32bit value to a given device address
+ *
+ * @hdev:	pointer to hl_device structure
+ * @addr:	address in device
+ * @val:	returned value
+ *
+ * In case of DDR address that is not mapped into the default aperture that
+ * the DDR bar exposes, the function will configure the iATU so that the DDR
+ * bar will be positioned at a base address that allows writing to the
+ * required address. Configuring the iATU during normal operation can
+ * lead to undefined behavior and therefore, should be done with extreme care
+ *
+ */
+int goya_debugfs_write32(struct hl_device *hdev, u64 addr, u32 val)
+{
+	struct asic_fixed_properties *prop = &hdev->asic_prop;
+	int rc = 0;
+
+	if ((addr >= CFG_BASE) && (addr < CFG_BASE + CFG_SIZE)) {
+		WREG32(addr - CFG_BASE, val);
+
+	} else if ((addr >= SRAM_BASE_ADDR) &&
+			(addr < SRAM_BASE_ADDR + SRAM_SIZE)) {
+
+		writel(val, hdev->pcie_bar[SRAM_CFG_BAR_ID] +
+					(addr - SRAM_BASE_ADDR));
+
+	} else if ((addr >= DRAM_PHYS_BASE) &&
+			(addr < DRAM_PHYS_BASE + hdev->asic_prop.dram_size)) {
+
+		u64 bar_base_addr = DRAM_PHYS_BASE +
+				(addr & ~(prop->dram_pci_bar_size - 0x1ull));
+
+		rc = goya_set_ddr_bar_base(hdev, bar_base_addr);
+		if (!rc) {
+			writel(val, hdev->pcie_bar[DDR_BAR_ID] +
+						(addr - bar_base_addr));
+
+			rc = goya_set_ddr_bar_base(hdev, DRAM_PHYS_BASE +
+				(MMU_PAGE_TABLES_ADDR &
+					~(prop->dram_pci_bar_size - 0x1ull)));
+		}
+	} else {
+		rc = -EFAULT;
+	}
+
+	return rc;
+}
+
 static u64 goya_read_pte(struct hl_device *hdev, u64 addr)
 {
 	struct goya_device *goya = hdev->asic_specific;
@@ -4776,6 +4879,8 @@ static int goya_mmu_clear_pgt_range(struct hl_device *hdev)
 	job->user_cb_size = cb_size;
 	job->hw_queue_id = GOYA_QUEUE_ID_DMA_0;
 
+	hl_debugfs_add_job(hdev, job);
+
 	parser.ctx_id = HL_KERNEL_ASID_ID;
 	parser.cs_sequence = 0;
 	parser.job_id = job->id;
@@ -4804,6 +4909,7 @@ static int goya_mmu_clear_pgt_range(struct hl_device *hdev)
 
 free_job:
 	hl_userptr_delete_list(hdev, &job->userptr_list);
+	hl_debugfs_remove_job(hdev, job);
 	kfree(job);
 	cb->cs_cnt--;
 
@@ -5218,6 +5324,8 @@ static const struct hl_asic_funcs goya_funcs = {
 	.update_eq_ci = goya_update_eq_ci,
 	.context_switch = goya_context_switch,
 	.restore_phase_topology = goya_restore_phase_topology,
+	.debugfs_read32 = goya_debugfs_read32,
+	.debugfs_write32 = goya_debugfs_write32,
 	.add_device_attr = goya_add_device_attr,
 	.handle_eqe = goya_handle_eqe,
 	.set_pll_profile = goya_set_pll_profile,
diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h
index ba8596b07403..bb78fa7e5e04 100644
--- a/drivers/misc/habanalabs/goya/goyaP.h
+++ b/drivers/misc/habanalabs/goya/goyaP.h
@@ -168,6 +168,10 @@ struct goya_device {
 	u32		hw_cap_initialized;
 };
 
+int goya_debugfs_i2c_read(struct hl_device *hdev, u8 i2c_bus,
+			u8 i2c_addr, u8 i2c_reg, u32 *val);
+int goya_debugfs_i2c_write(struct hl_device *hdev, u8 i2c_bus,
+			u8 i2c_addr, u8 i2c_reg, u32 val);
 int goya_test_cpu_queue(struct hl_device *hdev);
 int goya_send_cpu_message(struct hl_device *hdev, u32 *msg, u16 len,
 				u32 timeout, long *result);
@@ -178,6 +182,7 @@ long goya_get_fan_speed(struct hl_device *hdev, int sensor_index, u32 attr);
 long goya_get_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr);
 void goya_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
 			long value);
+void goya_debugfs_led_set(struct hl_device *hdev, u8 led, u8 state);
 void goya_set_pll_profile(struct hl_device *hdev, enum hl_pll_frequency freq);
 void goya_add_device_attr(struct hl_device *hdev,
 			struct attribute_group *dev_attr_grp);
diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 759510c96034..23a291d36f95 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -239,6 +239,7 @@ struct hl_cb_mgr {
  * @refcount: reference counter for usage of the CB.
  * @hdev: pointer to device this CB belongs to.
  * @lock: spinlock to protect mmap/cs flows.
+ * @debugfs_list: node in debugfs list of command buffers.
  * @pool_list: node in pool list of command buffers.
  * @kernel_address: Holds the CB's kernel virtual address.
  * @bus_address: Holds the CB's DMA address.
@@ -254,6 +255,7 @@ struct hl_cb {
 	struct kref		refcount;
 	struct hl_device	*hdev;
 	spinlock_t		lock;
+	struct list_head	debugfs_list;
 	struct list_head	pool_list;
 	u64			kernel_address;
 	dma_addr_t		bus_address;
@@ -454,6 +456,8 @@ enum hl_pll_frequency {
  * @update_eq_ci: update event queue CI.
  * @context_switch: called upon ASID context switch.
  * @restore_phase_topology: clear all SOBs amd MONs.
+ * @debugfs_read32: debug interface for reading u32 from DRAM/SRAM.
+ * @debugfs_write32: debug interface for writing u32 to DRAM/SRAM.
  * @add_device_attr: add ASIC specific device attributes.
  * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
  * @set_pll_profile: change PLL profile (manual/automatic).
@@ -522,6 +526,8 @@ struct hl_asic_funcs {
 	void (*update_eq_ci)(struct hl_device *hdev, u32 val);
 	int (*context_switch)(struct hl_device *hdev, u32 asid);
 	void (*restore_phase_topology)(struct hl_device *hdev);
+	int (*debugfs_read32)(struct hl_device *hdev, u64 addr, u32 *val);
+	int (*debugfs_write32)(struct hl_device *hdev, u64 addr, u32 val);
 	void (*add_device_attr)(struct hl_device *hdev,
 				struct attribute_group *dev_attr_grp);
 	void (*handle_eqe)(struct hl_device *hdev,
@@ -585,6 +591,7 @@ struct hl_va_range {
  * @mem_hash_lock: protects the mem_hash.
  * @mmu_lock: protects the MMU page tables. Any change to the PGT, modifing the
  *            MMU hash or walking the PGT requires talking this lock
+ * @debugfs_list: node in debugfs list of contexts.
  * @cs_sequence: sequence number for CS. Value is assigned to a CS and passed
  *			to user so user could inquire about CS. It is used as
  *			index to cs_pending array.
@@ -609,6 +616,7 @@ struct hl_ctx {
 	struct hl_va_range	dram_va_range;
 	struct mutex		mem_hash_lock;
 	struct mutex		mmu_lock;
+	struct list_head	debugfs_list;
 	u64			cs_sequence;
 	spinlock_t		cs_lock;
 	atomic64_t		dram_phys_mem;
@@ -667,6 +675,7 @@ struct hl_userptr {
  * @fence: pointer to the fence object of this CS.
  * @work_tdr: delayed work node for TDR.
  * @mirror_node : node in device mirror list of command submissions.
+ * @debugfs_list: node in debugfs list of command submissions.
  * @sequence: the sequence number of this CS.
  * @submitted: true if CS was submitted to H/W.
  * @completed: true if CS was completed by device.
@@ -684,6 +693,7 @@ struct hl_cs {
 	struct dma_fence	*fence;
 	struct delayed_work	work_tdr;
 	struct list_head	mirror_node;
+	struct list_head	debugfs_list;
 	u64			sequence;
 	u8			submitted;
 	u8			completed;
@@ -702,6 +712,7 @@ struct hl_cs {
  * @finish_work: workqueue object to run when job is completed.
  * @userptr_list: linked-list of userptr mappings that belong to this job and
  *			wait for completion.
+ * @debugfs_list: node in debugfs list of command submission jobs.
  * @id: the id of this job inside a CS.
  * @hw_queue_id: the id of the H/W queue this job is submitted to.
  * @user_cb_size: the actual size of the CB we got from the user.
@@ -715,6 +726,7 @@ struct hl_cs_job {
 	struct hl_cb		*patched_cb;
 	struct work_struct	finish_work;
 	struct list_head	userptr_list;
+	struct list_head	debugfs_list;
 	u32			id;
 	u32			hw_queue_id;
 	u32			user_cb_size;
@@ -845,6 +857,7 @@ struct hl_vm {
  * @ctx: current executing context.
  * @ctx_mgr: context manager to handle multiple context for this FD.
  * @cb_mgr: command buffer manager to handle multiple buffers for this FD.
+ * @debugfs_list: list of relevant ASIC debugfs.
  * @refcount: number of related contexts.
  * @restore_phase_mutex: lock for context switch and restore phase.
  */
@@ -855,11 +868,90 @@ struct hl_fpriv {
 	struct hl_ctx		*ctx; /* TODO: remove for multiple ctx */
 	struct hl_ctx_mgr	ctx_mgr;
 	struct hl_cb_mgr	cb_mgr;
+	struct list_head	debugfs_list;
 	struct kref		refcount;
 	struct mutex		restore_phase_mutex;
 };
 
 
+/*
+ * DebugFS
+ */
+
+/**
+ * struct hl_info_list - debugfs file ops.
+ * @name: file name.
+ * @show: function to output information.
+ * @write: function to write to the file.
+ */
+struct hl_info_list {
+	const char	*name;
+	int		(*show)(struct seq_file *s, void *data);
+	ssize_t		(*write)(struct file *file, const char __user *buf,
+				size_t count, loff_t *f_pos);
+};
+
+/**
+ * struct hl_debugfs_entry - debugfs dentry wrapper.
+ * @dent: base debugfs entry structure.
+ * @info_ent: dentry realted ops.
+ * @dev_entry: ASIC specific debugfs manager.
+ */
+struct hl_debugfs_entry {
+	struct dentry			*dent;
+	const struct hl_info_list	*info_ent;
+	struct hl_dbg_device_entry	*dev_entry;
+};
+
+/**
+ * struct hl_dbg_device_entry - ASIC specific debugfs manager.
+ * @root: root dentry.
+ * @hdev: habanalabs device structure.
+ * @entry_arr: array of available hl_debugfs_entry.
+ * @file_list: list of available debugfs files.
+ * @file_mutex: protects file_list.
+ * @cb_list: list of available CBs.
+ * @cb_spinlock: protects cb_list.
+ * @cs_list: list of available CSs.
+ * @cs_spinlock: protects cs_list.
+ * @cs_job_list: list of available CB jobs.
+ * @cs_job_spinlock: protects cs_job_list.
+ * @userptr_list: list of available userptrs (virtual memory chunk descriptor).
+ * @userptr_spinlock: protects userptr_list.
+ * @ctx_mem_hash_list: list of available contexts with MMU mappings.
+ * @ctx_mem_hash_spinlock: protects cb_list.
+ * @addr: next address to read/write from/to in read/write32.
+ * @mmu_addr: next virtual address to translate to physical address in mmu_show.
+ * @mmu_asid: ASID to use while translating in mmu_show.
+ * @i2c_bus: generic u8 debugfs file for bus value to use in i2c_data_read.
+ * @i2c_bus: generic u8 debugfs file for address value to use in i2c_data_read.
+ * @i2c_bus: generic u8 debugfs file for register value to use in i2c_data_read.
+ */
+struct hl_dbg_device_entry {
+	struct dentry			*root;
+	struct hl_device		*hdev;
+	struct hl_debugfs_entry		*entry_arr;
+	struct list_head		file_list;
+	struct mutex			file_mutex;
+	struct list_head		cb_list;
+	spinlock_t			cb_spinlock;
+	struct list_head		cs_list;
+	spinlock_t			cs_spinlock;
+	struct list_head		cs_job_list;
+	spinlock_t			cs_job_spinlock;
+	struct list_head		userptr_list;
+	spinlock_t			userptr_spinlock;
+	struct list_head		ctx_mem_hash_list;
+	spinlock_t			ctx_mem_hash_spinlock;
+	u64				addr;
+	u64				mmu_addr;
+	u32				mmu_asid;
+	u8				i2c_bus;
+	u8				i2c_addr;
+	u8				i2c_reg;
+};
+
+
 /*
  * DEVICES
  */
@@ -952,6 +1044,7 @@ struct hl_device_reset_work {
  * @hwmon_dev: H/W monitor device.
  * @pm_mng_profile: current power management profile.
  * @hl_chip_info: ASIC's sensors information.
+ * @hl_debugfs: device's debugfs manager.
  * @cb_pool: list of preallocated CBs.
  * @cb_pool_lock: protects the CB pool.
  * @user_ctx: current user context executing.
@@ -1017,6 +1110,8 @@ struct hl_device {
 	enum hl_pm_mng_profile		pm_mng_profile;
 	struct hwmon_chip_info		hl_chip_info;
 
+	struct hl_dbg_device_entry	hl_debugfs;
+
 	struct list_head		cb_pool;
 	spinlock_t			cb_pool_lock;
 
@@ -1254,6 +1349,100 @@ void hl_set_pwm_info(struct hl_device *hdev, int sensor_index, u32 attr,
 u64 hl_get_max_power(struct hl_device *hdev);
 void hl_set_max_power(struct hl_device *hdev, u64 value);
 
+#ifdef CONFIG_DEBUG_FS
+
+void hl_debugfs_init(void);
+void hl_debugfs_fini(void);
+void hl_debugfs_add_device(struct hl_device *hdev);
+void hl_debugfs_remove_device(struct hl_device *hdev);
+void hl_debugfs_add_file(struct hl_fpriv *hpriv);
+void hl_debugfs_remove_file(struct hl_fpriv *hpriv);
+void hl_debugfs_add_cb(struct hl_cb *cb);
+void hl_debugfs_remove_cb(struct hl_cb *cb);
+void hl_debugfs_add_cs(struct hl_cs *cs);
+void hl_debugfs_remove_cs(struct hl_cs *cs);
+void hl_debugfs_add_job(struct hl_device *hdev, struct hl_cs_job *job);
+void hl_debugfs_remove_job(struct hl_device *hdev, struct hl_cs_job *job);
+void hl_debugfs_add_userptr(struct hl_device *hdev, struct hl_userptr *userptr);
+void hl_debugfs_remove_userptr(struct hl_device *hdev,
+				struct hl_userptr *userptr);
+void hl_debugfs_add_ctx_mem_hash(struct hl_device *hdev, struct hl_ctx *ctx);
+void hl_debugfs_remove_ctx_mem_hash(struct hl_device *hdev, struct hl_ctx *ctx);
+
+#else
+
+static inline void __init hl_debugfs_init(void)
+{
+}
+
+static inline void hl_debugfs_fini(void)
+{
+}
+
+static inline void hl_debugfs_add_device(struct hl_device *hdev)
+{
+}
+
+static inline void hl_debugfs_remove_device(struct hl_device *hdev)
+{
+}
+
+static inline void hl_debugfs_add_file(struct hl_fpriv *hpriv)
+{
+}
+
+static inline void hl_debugfs_remove_file(struct hl_fpriv *hpriv)
+{
+}
+
+static inline void hl_debugfs_add_cb(struct hl_cb *cb)
+{
+}
+
+static inline void hl_debugfs_remove_cb(struct hl_cb *cb)
+{
+}
+
+static inline void hl_debugfs_add_cs(struct hl_cs *cs)
+{
+}
+
+static inline void hl_debugfs_remove_cs(struct hl_cs *cs)
+{
+}
+
+static inline void hl_debugfs_add_job(struct hl_device *hdev,
+					struct hl_cs_job *job)
+{
+}
+
+static inline void hl_debugfs_remove_job(struct hl_device *hdev,
+					struct hl_cs_job *job)
+{
+}
+
+static inline void hl_debugfs_add_userptr(struct hl_device *hdev,
+					struct hl_userptr *userptr)
+{
+}
+
+static inline void hl_debugfs_remove_userptr(struct hl_device *hdev,
+					struct hl_userptr *userptr)
+{
+}
+
+static inline void hl_debugfs_add_ctx_mem_hash(struct hl_device *hdev,
+					struct hl_ctx *ctx)
+{
+}
+
+static inline void hl_debugfs_remove_ctx_mem_hash(struct hl_device *hdev,
+					struct hl_ctx *ctx)
+{
+}
+
+#endif
+
 /* IOCTLs */
 long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
 int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
index 658dc4588a36..96e0fc9c401f 100644
--- a/drivers/misc/habanalabs/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/habanalabs_drv.c
@@ -150,6 +150,8 @@ int hl_device_open(struct inode *inode, struct file *filp)
 	 */
 	hl_device_set_frequency(hdev, PLL_HIGH);
 
+	hl_debugfs_add_file(hpriv);
+
 	return 0;
 
 out_err:
@@ -417,17 +419,20 @@ static int __init hl_init(void)
 		goto remove_major;
 	}
 
+	hl_debugfs_init();
+
 	rc = pci_register_driver(&hl_pci_driver);
 	if (rc) {
 		pr_err("failed to register pci device\n");
-		goto remove_class;
+		goto remove_debugfs;
 	}
 
 	pr_debug("driver loaded\n");
 
 	return 0;
 
-remove_class:
+remove_debugfs:
+	hl_debugfs_fini();
 	class_destroy(hl_class);
 remove_major:
 	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
@@ -441,6 +446,13 @@ static void __exit hl_exit(void)
 {
 	pci_unregister_driver(&hl_pci_driver);
 
+	/*
+	 * Removing debugfs must be after all devices or simulator devices
+	 * have been removed because otherwise we get a bug in the
+	 * debugfs module for referencing NULL objects
+	 */
+	hl_debugfs_fini();
+
 	class_destroy(hl_class);
 	unregister_chrdev_region(MKDEV(hl_major, 0), HL_MAX_MINORS);
 
diff --git a/drivers/misc/habanalabs/memory.c b/drivers/misc/habanalabs/memory.c
index 98a7cc700fd5..c0d0ff357436 100644
--- a/drivers/misc/habanalabs/memory.c
+++ b/drivers/misc/habanalabs/memory.c
@@ -1289,6 +1289,8 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u32 size,
 		goto free_sgt;
 	}
 
+	hl_debugfs_add_userptr(hdev, userptr);
+
 	return 0;
 
 free_sgt:
@@ -1314,6 +1316,8 @@ int hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr)
 {
 	struct page **pages;
 
+	hl_debugfs_remove_userptr(hdev, userptr);
+
 	if (userptr->dma_mapped)
 		hdev->asic_funcs->hl_dma_unmap_sg(hdev,
 				userptr->sgt->sgl,
@@ -1475,6 +1479,8 @@ int hl_vm_ctx_init_with_ranges(struct hl_ctx *ctx, u64 host_range_start,
 		goto dram_vm_err;
 	}
 
+	hl_debugfs_add_ctx_mem_hash(hdev, ctx);
+
 	return 0;
 
 dram_vm_err:
@@ -1597,6 +1603,8 @@ void hl_vm_ctx_fini(struct hl_ctx *ctx)
 	struct hlist_node *tmp_node;
 	int i;
 
+	hl_debugfs_remove_ctx_mem_hash(hdev, ctx);
+
 	if (!hash_empty(ctx->mem_hash))
 		dev_notice(hdev->dev, "ctx is freed while it has va in use\n");
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 15/15] Update MAINTAINERS and CREDITS with habanalabs info
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (12 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 14/15] habanalabs: add debugfs support Oded Gabbay
@ 2019-02-11 15:17 ` Oded Gabbay
  2019-02-14  7:11 ` [PATCH v4 00/15] Habana Labs kernel driver Greg KH
  14 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-11 15:17 UTC (permalink / raw)
  To: gregkh, linux-kernel, rppt, olof; +Cc: ogabbay, arnd, joe

The habanalabs driver was written from scratch from the very first days
of Habana and is maintained by Oded Gabbay.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 CREDITS     | 2 +-
 MAINTAINERS | 9 +++++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/CREDITS b/CREDITS
index e818eb6a3e71..03f3d67126fc 100644
--- a/CREDITS
+++ b/CREDITS
@@ -1222,7 +1222,7 @@ S: Brazil
 
 N: Oded Gabbay
 E: oded.gabbay@gmail.com
-D: AMD KFD maintainer
+D: HabanaLabs and AMD KFD maintainer
 S: 12 Shraga Raphaeli
 S: Petah-Tikva, 4906418
 S: Israel
diff --git a/MAINTAINERS b/MAINTAINERS
index 9919840d54cd..ac6fde220d2f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6647,6 +6647,15 @@ F:	drivers/clocksource/h8300_*.c
 F:	drivers/clk/h8300/
 F:	drivers/irqchip/irq-renesas-h8*.c
 
+HABANALABS PCI DRIVER
+M:	Oded Gabbay <oded.gabbay@gmail.com>
+T:	git https://github.com/HabanaAI/linux.git
+S:	Supported
+F:	drivers/misc/habanalabs/
+F:	include/uapi/misc/habanalabs.h
+F:	Documentation/ABI/testing/sysfs-driver-habanalabs
+F:	Documentation/ABI/testing/debugfs-driver-habanalabs
+
 HACKRF MEDIA DRIVER
 M:	Antti Palosaari <crope@iki.fi>
 L:	linux-media@vger.kernel.org
-- 
2.17.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 12/15] habanalabs: add virtual memory and MMU modules
  2019-02-11 15:17 ` [PATCH v4 12/15] habanalabs: add virtual memory and MMU modules Oded Gabbay
@ 2019-02-11 17:03   ` Mike Rapoport
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Rapoport @ 2019-02-11 17:03 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: gregkh, linux-kernel, olof, ogabbay, arnd, joe, Omer Shpigelman

On Mon, Feb 11, 2019 at 05:17:48PM +0200, Oded Gabbay wrote:
> From: Omer Shpigelman <oshpigelman@habana.ai>
> 
> This patch adds the Virtual Memory and MMU modules.
> 
> Goya has an internal MMU which provides process isolation on the internal
> DDR. The internal MMU also performs translations for transactions that go
> from Goya to the Host.
> 
> The driver is responsible for allocating and freeing memory on the DDR
> upon user request. It also provides an interface to map and unmap DDR and
> Host memory to the device address space.
> 
> Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
> Changes in v4:
>   - Return number of ptes in dec pte
>   - Rename dec/inc_num_of_ptes to put/get_pte
>   - Use common function to get next hop address
>   - Fold alloc of new hop with finding next hop
>   - Invert logic for freeing ptes
>   - Support more pages sizes in MMU
>   - Use -ENOTTY to signal bad ioctl parameter in memory ioctl
>  
>  drivers/misc/habanalabs/Makefile              |    2 +-
>  drivers/misc/habanalabs/context.c             |   19 +-
>  drivers/misc/habanalabs/device.c              |   20 +-
>  drivers/misc/habanalabs/goya/goya.c           |  393 +++++
>  drivers/misc/habanalabs/habanalabs.h          |  188 +-
>  drivers/misc/habanalabs/habanalabs_drv.c      |    2 +-
>  drivers/misc/habanalabs/habanalabs_ioctl.c    |    3 +-
>  .../include/hw_ip/mmu/mmu_general.h           |   45 +
>  .../habanalabs/include/hw_ip/mmu/mmu_v1_0.h   |   15 +
>  drivers/misc/habanalabs/memory.c              | 1515 +++++++++++++++++
>  drivers/misc/habanalabs/mmu.c                 |  690 ++++++++
>  include/uapi/misc/habanalabs.h                |  122 +-
>  12 files changed, 3006 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
>  create mode 100644 drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
>  create mode 100644 drivers/misc/habanalabs/mmu.c
> 
> diff --git a/drivers/misc/habanalabs/Makefile b/drivers/misc/habanalabs/Makefile
> index d2fd0e18b1eb..fd46f8b48bab 100644
> --- a/drivers/misc/habanalabs/Makefile
> +++ b/drivers/misc/habanalabs/Makefile
> @@ -6,7 +6,7 @@ obj-m	:= habanalabs.o
>  
>  habanalabs-y := habanalabs_drv.o device.o context.o asid.o habanalabs_ioctl.o \
>  		command_buffer.o hw_queue.o irq.o sysfs.o hwmon.o memory.o \
> -		command_submission.o
> +		command_submission.o mmu.o
>  
>  include $(src)/goya/Makefile
>  habanalabs-y += $(HL_GOYA_FILES)
> diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
> index 98710646de6c..6ff0f2103d8d 100644
> --- a/drivers/misc/habanalabs/context.c
> +++ b/drivers/misc/habanalabs/context.c
> @@ -26,8 +26,10 @@ static void hl_ctx_fini(struct hl_ctx *ctx)
>  	for (i = 0 ; i < HL_MAX_PENDING_CS ; i++)
>  		dma_fence_put(ctx->cs_pending[i]);
>  
> -	if (ctx->asid != HL_KERNEL_ASID_ID)
> +	if (ctx->asid != HL_KERNEL_ASID_ID) {
> +		hl_vm_ctx_fini(ctx);
>  		hl_asid_free(hdev, ctx->asid);
> +	}
>  }
>  
>  void hl_ctx_do_release(struct kref *ref)
> @@ -97,6 +99,8 @@ void hl_ctx_free(struct hl_device *hdev, struct hl_ctx *ctx)
>  
>  int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
>  {
> +	int rc = 0;
> +
>  	ctx->hdev = hdev;
>  
>  	kref_init(&ctx->refcount);
> @@ -114,9 +118,22 @@ int hl_ctx_init(struct hl_device *hdev, struct hl_ctx *ctx, bool is_kernel_ctx)
>  			dev_err(hdev->dev, "No free ASID, failed to create context\n");
>  			return -ENOMEM;
>  		}
> +
> +		rc = hl_vm_ctx_init(ctx);
> +		if (rc) {
> +			dev_err(hdev->dev, "Failed to init mem ctx module\n");
> +			rc = -ENOMEM;
> +			goto mem_ctx_err;
> +		}
>  	}
>  
>  	return 0;
> +
> +mem_ctx_err:
> +	if (ctx->asid != HL_KERNEL_ASID_ID)
> +		hl_asid_free(hdev, ctx->asid);
> +
> +	return rc;
>  }
>  
>  void hl_ctx_get(struct hl_device *hdev, struct hl_ctx *ctx)
> diff --git a/drivers/misc/habanalabs/device.c b/drivers/misc/habanalabs/device.c
> index 21496882be3a..5c850d3574eb 100644
> --- a/drivers/misc/habanalabs/device.c
> +++ b/drivers/misc/habanalabs/device.c
> @@ -604,8 +604,10 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
>  	/* Reset the H/W. It will be in idle state after this returns */
>  	hdev->asic_funcs->hw_fini(hdev, hard_reset);
>  
> -	if (hard_reset)
> +	if (hard_reset) {
> +		hl_vm_fini(hdev);
>  		hl_eq_reset(hdev, &hdev->event_queue);
> +	}
>  
>  	/* Re-initialize PI,CI to 0 in all queues (hw queue, cq) */
>  	hl_hw_queue_reset(hdev, hard_reset);
> @@ -666,6 +668,13 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
>  			goto out_err;
>  		}
>  
> +		rc = hl_vm_init(hdev);
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"Failed to init memory module after hard reset\n");
> +			goto out_err;
> +		}
> +
>  		hl_set_max_power(hdev, hdev->max_power);
>  
>  		hdev->hard_reset_pending = false;
> @@ -850,6 +859,13 @@ int hl_device_init(struct hl_device *hdev, struct class *hclass)
>  		hdev->asic_name,
>  		hdev->asic_prop.dram_size / 1024 / 1024 / 1024);
>  
> +	rc = hl_vm_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to initialize memory module\n");
> +		rc = 0;
> +		goto out_disabled;
> +	}
> +
>  	rc = hl_hwmon_init(hdev);
>  	if (rc) {
>  		dev_err(hdev->dev, "Failed to initialize hwmon\n");
> @@ -961,6 +977,8 @@ void hl_device_fini(struct hl_device *hdev)
>  	/* Reset the H/W. It will be in idle state after this returns */
>  	hdev->asic_funcs->hw_fini(hdev, true);
>  
> +	hl_vm_fini(hdev);
> +
>  	hl_eq_fini(hdev, &hdev->event_queue);
>  
>  	for (i = 0 ; i < hdev->asic_prop.completion_queues_count ; i++)
> diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
> index 9fda232139ea..bba159fd7755 100644
> --- a/drivers/misc/habanalabs/goya/goya.c
> +++ b/drivers/misc/habanalabs/goya/goya.c
> @@ -6,6 +6,8 @@
>   */
>  
>  #include "goyaP.h"
> +#include "include/hw_ip/mmu/mmu_general.h"
> +#include "include/hw_ip/mmu/mmu_v1_0.h"
>  #include "include/goya/asic_reg/goya_masks.h"
>  
>  #include <linux/fs.h>
> @@ -87,6 +89,7 @@
>  #define GOYA_PLDM_RESET_WAIT_MSEC	1000		/* 1s */
>  #define GOYA_CPU_TIMEOUT_USEC		10000000	/* 10s */
>  #define GOYA_TEST_QUEUE_WAIT_USEC	100000		/* 100ms */
> +#define GOYA_PLDM_MMU_TIMEOUT_USEC	(MMU_CONFIG_TIMEOUT_USEC * 100)
>  
>  #define GOYA_QMAN0_FENCE_VAL		0xD169B243
>  
> @@ -138,6 +141,70 @@ static const char *goya_axi_name[GOYA_MAX_INITIATORS] = {
>  	"MMU"
>  };
>  
> +static u64 goya_mmu_regs[GOYA_MMU_REGS_NUM] = {
> +	mmDMA_QM_0_GLBL_NON_SECURE_PROPS,
> +	mmDMA_QM_1_GLBL_NON_SECURE_PROPS,
> +	mmDMA_QM_2_GLBL_NON_SECURE_PROPS,
> +	mmDMA_QM_3_GLBL_NON_SECURE_PROPS,
> +	mmDMA_QM_4_GLBL_NON_SECURE_PROPS,
> +	mmTPC0_QM_GLBL_SECURE_PROPS,
> +	mmTPC0_QM_GLBL_NON_SECURE_PROPS,
> +	mmTPC0_CMDQ_GLBL_SECURE_PROPS,
> +	mmTPC0_CMDQ_GLBL_NON_SECURE_PROPS,
> +	mmTPC0_CFG_ARUSER,
> +	mmTPC0_CFG_AWUSER,
> +	mmTPC1_QM_GLBL_SECURE_PROPS,
> +	mmTPC1_QM_GLBL_NON_SECURE_PROPS,
> +	mmTPC1_CMDQ_GLBL_SECURE_PROPS,
> +	mmTPC1_CMDQ_GLBL_NON_SECURE_PROPS,
> +	mmTPC1_CFG_ARUSER,
> +	mmTPC1_CFG_AWUSER,
> +	mmTPC2_QM_GLBL_SECURE_PROPS,
> +	mmTPC2_QM_GLBL_NON_SECURE_PROPS,
> +	mmTPC2_CMDQ_GLBL_SECURE_PROPS,
> +	mmTPC2_CMDQ_GLBL_NON_SECURE_PROPS,
> +	mmTPC2_CFG_ARUSER,
> +	mmTPC2_CFG_AWUSER,
> +	mmTPC3_QM_GLBL_SECURE_PROPS,
> +	mmTPC3_QM_GLBL_NON_SECURE_PROPS,
> +	mmTPC3_CMDQ_GLBL_SECURE_PROPS,
> +	mmTPC3_CMDQ_GLBL_NON_SECURE_PROPS,
> +	mmTPC3_CFG_ARUSER,
> +	mmTPC3_CFG_AWUSER,
> +	mmTPC4_QM_GLBL_SECURE_PROPS,
> +	mmTPC4_QM_GLBL_NON_SECURE_PROPS,
> +	mmTPC4_CMDQ_GLBL_SECURE_PROPS,
> +	mmTPC4_CMDQ_GLBL_NON_SECURE_PROPS,
> +	mmTPC4_CFG_ARUSER,
> +	mmTPC4_CFG_AWUSER,
> +	mmTPC5_QM_GLBL_SECURE_PROPS,
> +	mmTPC5_QM_GLBL_NON_SECURE_PROPS,
> +	mmTPC5_CMDQ_GLBL_SECURE_PROPS,
> +	mmTPC5_CMDQ_GLBL_NON_SECURE_PROPS,
> +	mmTPC5_CFG_ARUSER,
> +	mmTPC5_CFG_AWUSER,
> +	mmTPC6_QM_GLBL_SECURE_PROPS,
> +	mmTPC6_QM_GLBL_NON_SECURE_PROPS,
> +	mmTPC6_CMDQ_GLBL_SECURE_PROPS,
> +	mmTPC6_CMDQ_GLBL_NON_SECURE_PROPS,
> +	mmTPC6_CFG_ARUSER,
> +	mmTPC6_CFG_AWUSER,
> +	mmTPC7_QM_GLBL_SECURE_PROPS,
> +	mmTPC7_QM_GLBL_NON_SECURE_PROPS,
> +	mmTPC7_CMDQ_GLBL_SECURE_PROPS,
> +	mmTPC7_CMDQ_GLBL_NON_SECURE_PROPS,
> +	mmTPC7_CFG_ARUSER,
> +	mmTPC7_CFG_AWUSER,
> +	mmMME_QM_GLBL_SECURE_PROPS,
> +	mmMME_QM_GLBL_NON_SECURE_PROPS,
> +	mmMME_CMDQ_GLBL_SECURE_PROPS,
> +	mmMME_CMDQ_GLBL_NON_SECURE_PROPS,
> +	mmMME_SBA_CONTROL_DATA,
> +	mmMME_SBB_CONTROL_DATA,
> +	mmMME_SBC_CONTROL_DATA,
> +	mmMME_WBC_CONTROL_DATA
> +};
> +
>  #define GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE 121
>  
>  static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
> @@ -265,6 +332,10 @@ static u32 goya_non_fatal_events[GOYA_ASYC_EVENT_GROUP_NON_FATAL_SIZE] = {
>  };
>  
>  static int goya_armcp_info_get(struct hl_device *hdev);
> +static void goya_mmu_prepare(struct hl_device *hdev, u32 asid);
> +static int goya_mmu_clear_pgt_range(struct hl_device *hdev);
> +static int goya_mmu_update_asid_hop0_addr(struct hl_device *hdev, u32 asid,
> +					u64 phys_addr);
>  
>  static void goya_get_fixed_properties(struct hl_device *hdev)
>  {
> @@ -303,6 +374,16 @@ static void goya_get_fixed_properties(struct hl_device *hdev)
>  	prop->sram_user_base_address = prop->sram_base_address +
>  						SRAM_USER_BASE_OFFSET;
>  
> +	prop->mmu_pgt_addr = MMU_PAGE_TABLES_ADDR;
> +	if (hdev->pldm)
> +		prop->mmu_pgt_size = 0x800000; /* 8MB */
> +	else
> +		prop->mmu_pgt_size = MMU_PAGE_TABLES_SIZE;
> +	prop->mmu_pte_size = HL_PTE_SIZE;
> +	prop->mmu_hop_table_size = HOP_TABLE_SIZE;
> +	prop->mmu_hop0_tables_total_size = HOP0_TABLES_TOTAL_SIZE;
> +	prop->dram_page_size = PAGE_SIZE_2MB;
> +
>  	prop->host_phys_base_address = HOST_PHYS_BASE;
>  	prop->va_space_host_start_address = VA_HOST_SPACE_START;
>  	prop->va_space_host_end_address = VA_HOST_SPACE_END;
> @@ -756,7 +837,18 @@ static int goya_late_init(struct hl_device *hdev)
>  
>  	goya_fetch_psoc_frequency(hdev);
>  
> +	rc = goya_mmu_clear_pgt_range(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to clear MMU page tables range\n");
> +		goto disable_pci_access;
> +	}
> +
>  	return 0;
> +
> +disable_pci_access:
> +	goya_send_pci_access_msg(hdev, ARMCP_PACKET_DISABLE_PCI_ACCESS);
> +
> +	return rc;
>  }
>  
>  /*
> @@ -2569,6 +2661,54 @@ static int goya_init_cpu(struct hl_device *hdev, u32 cpu_timeout)
>  	return 0;
>  }
>  
> +static int goya_mmu_init(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	struct goya_device *goya = hdev->asic_specific;
> +	u64 hop0_addr;
> +	int rc, i;
> +
> +	if (!hdev->mmu_enable)
> +		return 0;
> +
> +	if (goya->hw_cap_initialized & HW_CAP_MMU)
> +		return 0;
> +
> +	hdev->dram_supports_virtual_memory = true;
> +
> +	for (i = 0 ; i < prop->max_asid ; i++) {
> +		hop0_addr = prop->mmu_pgt_addr +
> +				(i * prop->mmu_hop_table_size);
> +
> +		rc = goya_mmu_update_asid_hop0_addr(hdev, i, hop0_addr);
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"failed to set hop0 addr for asid %d\n", i);
> +			goto err;
> +		}
> +	}
> +
> +	goya->hw_cap_initialized |= HW_CAP_MMU;
> +
> +	/* init MMU cache manage page */
> +	WREG32(mmSTLB_CACHE_INV_BASE_39_8, MMU_CACHE_MNG_ADDR >> 8);
> +	WREG32(mmSTLB_CACHE_INV_BASE_49_40, MMU_CACHE_MNG_ADDR << 40);
> +
> +	/* Remove follower feature due to performance bug */
> +	WREG32_AND(mmSTLB_STLB_FEATURE_EN,
> +			(~STLB_STLB_FEATURE_EN_FOLLOWER_EN_MASK));
> +
> +	hdev->asic_funcs->mmu_invalidate_cache(hdev, true);
> +
> +	WREG32(mmMMU_MMU_ENABLE, 1);
> +	WREG32(mmMMU_SPI_MASK, 0xF);
> +
> +	return 0;
> +
> +err:
> +	return rc;
> +}
> +
>  /*
>   * goya_hw_init - Goya hardware initialization code
>   *
> @@ -2618,6 +2758,10 @@ static int goya_hw_init(struct hl_device *hdev)
>  		return rc;
>  	}
>  
> +	rc = goya_mmu_init(hdev);
> +	if (rc)
> +		return rc;
> +
>  	goya_init_security(hdev);
>  
>  	goya_init_dma_qmans(hdev);
> @@ -4247,6 +4391,10 @@ int goya_context_switch(struct hl_device *hdev, u32 asid)
>  
>  	rc = goya_send_job_on_qman0(hdev, job);
>  
> +	/* no point in setting the asid in case of failure */
> +	if (!rc)
> +		goya_mmu_prepare(hdev, asid);
> +
>  	job->patched_cb->cs_cnt--;
>  	hl_cb_put(job->patched_cb);
>  
> @@ -4282,6 +4430,22 @@ void goya_restore_phase_topology(struct hl_device *hdev)
>  	i = RREG32(mmSYNC_MNGR_SOB_OBJ_0);
>  }
>  
> +static u64 goya_read_pte(struct hl_device *hdev, u64 addr)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +
> +	return readq(hdev->pcie_bar[DDR_BAR_ID] +
> +			(addr - goya->ddr_bar_cur_addr));
> +}
> +
> +static void goya_write_pte(struct hl_device *hdev, u64 addr, u64 val)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +
> +	writeq(val, hdev->pcie_bar[DDR_BAR_ID] +
> +			(addr - goya->ddr_bar_cur_addr));
> +}
> +
>  static void goya_get_axi_name(struct hl_device *hdev, u32 agent_id,
>  		u16 event_type, char *axi_name, int len)
>  {
> @@ -4565,6 +4729,231 @@ void *goya_get_events_stat(struct hl_device *hdev, u32 *size)
>  	return goya->events_stat;
>  }
>  
> +static int goya_mmu_clear_pgt_range(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	struct goya_device *goya = hdev->asic_specific;
> +	struct packet_lin_dma *clear_pgt_range_pkt;
> +	struct hl_cs_parser parser;
> +	struct hl_cs_job *job;
> +	u32 cb_size;
> +	struct hl_cb *cb;
> +	int rc;
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
> +		return 0;
> +
> +	cb = hl_cb_kernel_create(hdev, PAGE_SIZE);
> +	if (!cb)
> +		return -EFAULT;
> +
> +	clear_pgt_range_pkt = (struct packet_lin_dma *) cb->kernel_address;
> +	memset(clear_pgt_range_pkt, 0, sizeof(*clear_pgt_range_pkt));
> +	cb_size = sizeof(*clear_pgt_range_pkt);
> +
> +	clear_pgt_range_pkt->ctl =
> +		((PACKET_LIN_DMA << GOYA_PKT_CTL_OPCODE_SHIFT) |
> +		(DMA_HOST_TO_DRAM << GOYA_PKT_LIN_DMA_CTL_DMA_DIR_SHIFT) |
> +		(1 << GOYA_PKT_LIN_DMA_CTL_MEMSET_SHIFT) |
> +		(1 << GOYA_PKT_LIN_DMA_CTL_WO_SHIFT) |
> +		(1 << GOYA_PKT_CTL_RB_SHIFT) |
> +		(1 << GOYA_PKT_CTL_MB_SHIFT));
> +
> +	clear_pgt_range_pkt->src_addr = 0;
> +	clear_pgt_range_pkt->dst_addr = prop->mmu_pgt_addr;
> +	clear_pgt_range_pkt->tsize = prop->mmu_pgt_size + MMU_CACHE_MNG_SIZE;
> +
> +	job = hl_cs_allocate_job(hdev, true);
> +	if (!job) {
> +		dev_err(hdev->dev, "Failed to allocate a new job\n");
> +		rc = -ENOMEM;
> +		goto release_cb;
> +	}
> +
> +	job->id = 0;
> +	job->user_cb = cb;
> +	job->user_cb->cs_cnt++;
> +	job->user_cb_size = cb_size;
> +	job->hw_queue_id = GOYA_QUEUE_ID_DMA_0;
> +
> +	parser.ctx_id = HL_KERNEL_ASID_ID;
> +	parser.cs_sequence = 0;
> +	parser.job_id = job->id;
> +	parser.hw_queue_id = job->hw_queue_id;
> +	parser.job_userptr_list = &job->userptr_list;
> +	parser.user_cb = job->user_cb;
> +	parser.user_cb_size = job->user_cb_size;
> +	parser.ext_queue = job->ext_queue;
> +	parser.use_virt_addr = hdev->mmu_enable;
> +
> +	rc = hdev->asic_funcs->cs_parser(hdev, &parser);
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to parse kernel CB when clearing pgt\n");
> +		goto free_job;
> +	}
> +
> +	job->patched_cb = parser.patched_cb;
> +	job->job_cb_size = parser.patched_cb_size;
> +	job->patched_cb->cs_cnt++;
> +
> +	rc = goya_send_job_on_qman0(hdev, job);
> +
> +	job->patched_cb->cs_cnt--;
> +	hl_cb_put(job->patched_cb);
> +
> +free_job:
> +	hl_userptr_delete_list(hdev, &job->userptr_list);
> +	kfree(job);
> +	cb->cs_cnt--;
> +
> +release_cb:
> +	hl_cb_put(cb);
> +	hl_cb_destroy(hdev, &hdev->kernel_cb_mgr, cb->id << PAGE_SHIFT);
> +
> +	return rc;
> +}
> +
> +static void goya_mmu_prepare(struct hl_device *hdev, u32 asid)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	int i;
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
> +		return;
> +
> +	if (asid & ~MME_QM_GLBL_SECURE_PROPS_ASID_MASK) {
> +		WARN(1, "asid %u is too big\n", asid);
> +		return;
> +	}
> +
> +	/* zero the MMBP and ASID bits and then set the ASID */
> +	for (i = 0 ; i < GOYA_MMU_REGS_NUM ; i++) {
> +		WREG32_AND(goya_mmu_regs[i], ~0x7FF);
> +		WREG32_OR(goya_mmu_regs[i], asid);
> +	}
> +}
> +
> +static void goya_mmu_invalidate_cache(struct hl_device *hdev, bool is_hard)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 status, timeout_usec;
> +	int rc;
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
> +		return;
> +
> +	/* no need in L1 only invalidation in Goya */
> +	if (!is_hard)
> +		return;
> +
> +	if (hdev->pldm)
> +		timeout_usec = GOYA_PLDM_MMU_TIMEOUT_USEC;
> +	else
> +		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
> +
> +	mutex_lock(&hdev->mmu_cache_lock);
> +
> +	/* L0 & L1 invalidation */
> +	WREG32(mmSTLB_INV_ALL_START, 1);
> +
> +	rc = hl_poll_timeout(
> +		hdev,
> +		mmSTLB_INV_ALL_START,
> +		status,
> +		!status,
> +		1000,
> +		timeout_usec);
> +
> +	mutex_unlock(&hdev->mmu_cache_lock);
> +
> +	if (rc)
> +		dev_notice_ratelimited(hdev->dev,
> +			"Timeout when waiting for MMU cache invalidation\n");
> +}
> +
> +static void goya_mmu_invalidate_cache_range(struct hl_device *hdev,
> +		bool is_hard, u32 asid, u64 va, u64 size)
> +{
> +	struct goya_device *goya = hdev->asic_specific;
> +	u32 status, timeout_usec, inv_data, pi;
> +	int rc;
> +
> +	if (!(goya->hw_cap_initialized & HW_CAP_MMU))
> +		return;
> +
> +	/* no need in L1 only invalidation in Goya */
> +	if (!is_hard)
> +		return;
> +
> +	if (hdev->pldm)
> +		timeout_usec = GOYA_PLDM_MMU_TIMEOUT_USEC;
> +	else
> +		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
> +
> +	mutex_lock(&hdev->mmu_cache_lock);
> +
> +	/*
> +	 * TODO: currently invalidate entire L0 & L1 as in regular hard
> +	 * invalidation. Need to apply invalidation of specific cache lines with
> +	 * mask of ASID & VA & size.
> +	 * Note that L1 with be flushed entirely in any case.
> +	 */
> +
> +	/* L0 & L1 invalidation */
> +	inv_data = RREG32(mmSTLB_CACHE_INV);
> +	/* PI is 8 bit */
> +	pi = ((inv_data & STLB_CACHE_INV_PRODUCER_INDEX_MASK) + 1) & 0xFF;
> +	WREG32(mmSTLB_CACHE_INV,
> +			(inv_data & STLB_CACHE_INV_INDEX_MASK_MASK) | pi);
> +
> +	rc = hl_poll_timeout(
> +		hdev,
> +		mmSTLB_INV_CONSUMER_INDEX,
> +		status,
> +		status == pi,
> +		1000,
> +		timeout_usec);
> +
> +	mutex_unlock(&hdev->mmu_cache_lock);
> +
> +	if (rc)
> +		dev_notice_ratelimited(hdev->dev,
> +			"Timeout when waiting for MMU cache invalidation\n");
> +}
> +
> +static int goya_mmu_update_asid_hop0_addr(struct hl_device *hdev, u32 asid,
> +						u64 phys_addr)
> +{
> +	u32 status, timeout_usec;
> +	int rc;
> +
> +	if (hdev->pldm)
> +		timeout_usec = GOYA_PLDM_MMU_TIMEOUT_USEC;
> +	else
> +		timeout_usec = MMU_CONFIG_TIMEOUT_USEC;
> +
> +	WREG32(MMU_HOP0_PA43_12, phys_addr >> MMU_HOP0_PA43_12_SHIFT);
> +	WREG32(MMU_HOP0_PA49_44, phys_addr >> MMU_HOP0_PA49_44_SHIFT);
> +	WREG32(MMU_ASID_BUSY, 0x80000000 | asid);
> +
> +	rc = hl_poll_timeout(
> +		hdev,
> +		MMU_ASID_BUSY,
> +		status,
> +		!(status & 0x80000000),
> +		1000,
> +		timeout_usec);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Timeout during MMU hop0 config of asid %d\n", asid);
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
>  int goya_send_heartbeat(struct hl_device *hdev)
>  {
>  	struct goya_device *goya = hdev->asic_specific;
> @@ -4828,6 +5217,10 @@ static const struct hl_asic_funcs goya_funcs = {
>  	.handle_eqe = goya_handle_eqe,
>  	.set_pll_profile = goya_set_pll_profile,
>  	.get_events_stat = goya_get_events_stat,
> +	.read_pte = goya_read_pte,
> +	.write_pte = goya_write_pte,
> +	.mmu_invalidate_cache = goya_mmu_invalidate_cache,
> +	.mmu_invalidate_cache_range = goya_mmu_invalidate_cache_range,
>  	.send_heartbeat = goya_send_heartbeat,
>  	.enable_clock_gating = goya_init_clock_gating,
>  	.disable_clock_gating = goya_disable_clock_gating,
> diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
> index e8f71f6c8fb4..bf2d5cba6148 100644
> --- a/drivers/misc/habanalabs/habanalabs.h
> +++ b/drivers/misc/habanalabs/habanalabs.h
> @@ -41,6 +41,31 @@
>  /* MUST BE POWER OF 2 and larger than 1 */
>  #define HL_MAX_PENDING_CS		64
>  
> +/* Memory */
> +#define MEM_HASH_TABLE_BITS		7 /* 1 << 7 buckets */
> +
> +/* MMU */
> +#define MMU_HASH_TABLE_BITS		7 /* 1 << 7 buckets */
> +
> +/**
> + * struct pgt_info - MMU hop page info.
> + * @node: hash linked-list node for the pgts hash of pgts.
> + * @addr: physical address of the pgt.
> + * @ctx: pointer to the owner ctx.
> + * @num_of_ptes: indicates how many ptes are used in the pgt.
> + *
> + * The MMU page tables hierarchy is placed on the DRAM. When a new level (hop)
> + * is needed during mapping, a new page is allocated and this structure holds
> + * its essential information. During unmapping, if no valid PTEs remained in the
> + * page, it is freed with its pgt_info structure.
> + */
> +struct pgt_info {
> +	struct hlist_node node;
> +	u64 addr;
> +	struct hl_ctx *ctx;
> +	int num_of_ptes;
> +};
> +
>  struct hl_device;
>  struct hl_fpriv;
>  
> @@ -74,11 +99,11 @@ struct hw_queue_properties {
>  /**
>   * enum vm_type_t - virtual memory mapping request information.
>   * @VM_TYPE_USERPTR: mapping of user memory to device virtual address.
> - * @VM_TYPE_PHYS_LIST: mapping of DRAM memory to device virtual address.
> + * @VM_TYPE_PHYS_PACK: mapping of DRAM memory to device virtual address.
>   */
>  enum vm_type_t {
>  	VM_TYPE_USERPTR,
> -	VM_TYPE_PHYS_LIST
> +	VM_TYPE_PHYS_PACK
>  };
>  
>  /**
> @@ -119,6 +144,12 @@ enum hl_device_hw_state {
>   *                               mapping DRAM memory.
>   * @va_space_dram_end_address: end address of virtual memory range for
>   *                             mapping DRAM memory.
> + * @mmu_pgt_addr: base physical address in DRAM of MMU page tables.
> + * @mmu_pgt_size: MMU page tables total size.
> + * @mmu_pte_size: PTE size in MMU page tables.
> + * @mmu_hop_table_size: MMU hop table size.
> + * @mmu_hop0_tables_total_size: total size of MMU hop0 tables.
> + * @dram_page_size: page size for MMU DRAM allocation.
>   * @cfg_size: configuration space size on SRAM.
>   * @sram_size: total size of SRAM.
>   * @max_asid: maximum number of open contexts (ASIDs).
> @@ -152,6 +183,12 @@ struct asic_fixed_properties {
>  	u64			va_space_host_end_address;
>  	u64			va_space_dram_start_address;
>  	u64			va_space_dram_end_address;
> +	u64			mmu_pgt_addr;
> +	u32			mmu_pgt_size;
> +	u32			mmu_pte_size;
> +	u32			mmu_hop_table_size;
> +	u32			mmu_hop0_tables_total_size;
> +	u32			dram_page_size;
>  	u32			cfg_size;
>  	u32			sram_size;
>  	u32			max_asid;
> @@ -421,6 +458,12 @@ enum hl_pll_frequency {
>   * @handle_eqe: handle event queue entry (IRQ) from ArmCP.
>   * @set_pll_profile: change PLL profile (manual/automatic).
>   * @get_events_stat: retrieve event queue entries histogram.
> + * @read_pte: read MMU page table entry from DRAM.
> + * @write_pte: write MMU page table entry to DRAM.
> + * @mmu_invalidate_cache: flush MMU STLB cache, either with soft (L1 only) or
> + *                        hard (L0 & L1) flush.
> + * @mmu_invalidate_cache_range: flush specific MMU STLB cache lines with
> + *                              ASID-VA-size mask.
>   * @send_heartbeat: send is-alive packet to ArmCP and verify response.
>   * @enable_clock_gating: enable clock gating for reducing power consumption.
>   * @disable_clock_gating: disable clock for accessing registers on HBW.
> @@ -485,6 +528,11 @@ struct hl_asic_funcs {
>  	void (*set_pll_profile)(struct hl_device *hdev,
>  			enum hl_pll_frequency freq);
>  	void* (*get_events_stat)(struct hl_device *hdev, u32 *size);
> +	u64 (*read_pte)(struct hl_device *hdev, u64 addr);
> +	void (*write_pte)(struct hl_device *hdev, u64 addr, u64 val);
> +	void (*mmu_invalidate_cache)(struct hl_device *hdev, bool is_hard);
> +	void (*mmu_invalidate_cache_range)(struct hl_device *hdev, bool is_hard,
> +			u32 asid, u64 va, u64 size);
>  	int (*send_heartbeat)(struct hl_device *hdev);
>  	void (*enable_clock_gating)(struct hl_device *hdev);
>  	void (*disable_clock_gating)(struct hl_device *hdev);
> @@ -506,17 +554,40 @@ struct hl_asic_funcs {
>  
>  #define HL_KERNEL_ASID_ID	0
>  
> +/**
> + * struct hl_va_range - virtual addresses range.
> + * @lock: protects the virtual addresses list.
> + * @list: list of virtual addresses blocks available for mappings.
> + * @start_addr: range start address.
> + * @end_addr: range end address.
> + */
> +struct hl_va_range {
> +	struct mutex		lock;
> +	struct list_head	list;
> +	u64			start_addr;
> +	u64			end_addr;
> +};
> +
>  /**
>   * struct hl_ctx - user/kernel context.
> + * @mem_hash: holds mapping from virtual address to virtual memory area
> + *		descriptor (hl_vm_phys_pg_list or hl_userptr).
> + * @mmu_hash: holds a mapping from virtual address to pgt_info structure.
>   * @hpriv: pointer to the private (KMD) data of the process (fd).
>   * @hdev: pointer to the device structure.
>   * @refcount: reference counter for the context. Context is released only when
>   *		this hits 0l. It is incremented on CS and CS_WAIT.
>   * @cs_pending: array of DMA fence objects representing pending CS.
> + * @host_va_range: holds available virtual addresses for host mappings.
> + * @dram_va_range: holds available virtual addresses for DRAM mappings.
> + * @mem_hash_lock: protects the mem_hash.
> + * @mmu_lock: protects the MMU page tables. Any change to the PGT, modifing the
> + *            MMU hash or walking the PGT requires talking this lock
>   * @cs_sequence: sequence number for CS. Value is assigned to a CS and passed
>   *			to user so user could inquire about CS. It is used as
>   *			index to cs_pending array.
>   * @cs_lock: spinlock to protect cs_sequence.
> + * @dram_phys_mem: amount of used physical DRAM memory by this context.
>   * @thread_restore_token: token to prevent multiple threads of the same context
>   *				from running the restore phase. Only one thread
>   *				should run it.
> @@ -526,12 +597,19 @@ struct hl_asic_funcs {
>   * @asid: context's unique address space ID in the device's MMU.
>   */
>  struct hl_ctx {
> +	DECLARE_HASHTABLE(mem_hash, MEM_HASH_TABLE_BITS);
> +	DECLARE_HASHTABLE(mmu_hash, MMU_HASH_TABLE_BITS);
>  	struct hl_fpriv		*hpriv;
>  	struct hl_device	*hdev;
>  	struct kref		refcount;
>  	struct dma_fence	*cs_pending[HL_MAX_PENDING_CS];
> +	struct hl_va_range	host_va_range;
> +	struct hl_va_range	dram_va_range;
> +	struct mutex		mem_hash_lock;
> +	struct mutex		mmu_lock;
>  	u64			cs_sequence;
>  	spinlock_t		cs_lock;
> +	atomic64_t		dram_phys_mem;
>  	atomic_t		thread_restore_token;
>  	u32			thread_restore_wait_token;
>  	u32			asid;
> @@ -674,6 +752,85 @@ struct hl_cs_parser {
>  };
>  
>  
> +/*
> + * MEMORY STRUCTURE
> + */
> +
> +/**
> + * struct hl_vm_hash_node - hash element from virtual address to virtual
> + *				memory area descriptor (hl_vm_phys_pg_list or
> + *				hl_userptr).
> + * @node: node to hang on the hash table in context object.
> + * @vaddr: key virtual address.
> + * @ptr: value pointer (hl_vm_phys_pg_list or hl_userptr).
> + */
> +struct hl_vm_hash_node {
> +	struct hlist_node	node;
> +	u64			vaddr;
> +	void			*ptr;
> +};
> +
> +/**
> + * struct hl_vm_phys_pg_pack - physical page pack.
> + * @vm_type: describes the type of the virtual area descriptor.
> + * @pages: the physical page array.
> + * @mapping_cnt: number of shared mappings.
> + * @asid: the context related to this list.
> + * @npages: num physical pages in the pack.
> + * @page_size: size of each page in the pack.
> + * @total_size: total size of all the pages in this list.
> + * @flags: HL_MEM_* flags related to this list.
> + * @handle: the provided handle related to this list.
> + * @offset: offset from the first page.
> + * @contiguous: is contiguous physical memory.
> + * @created_from_userptr: is product of host virtual address.
> + */
> +struct hl_vm_phys_pg_pack {
> +	enum vm_type_t		vm_type; /* must be first */
> +	u64			*pages;
> +	atomic_t		mapping_cnt;
> +	u32			asid;
> +	u32			npages;
> +	u32			page_size;
> +	u32			total_size;
> +	u32			flags;
> +	u32			handle;
> +	u32			offset;
> +	u8			contiguous;
> +	u8			created_from_userptr;
> +};
> +
> +/**
> + * struct hl_vm_va_block - virtual range block information.
> + * @node: node to hang on the virtual range list in context object.
> + * @start: virtual range start address.
> + * @end: virtual range end address.
> + * @size: virtual range size.
> + */
> +struct hl_vm_va_block {
> +	struct list_head	node;
> +	u64			start;
> +	u64			end;
> +	u64			size;
> +};
> +
> +/**
> + * struct hl_vm - virtual memory manager for MMU.
> + * @dram_pg_pool: pool for DRAM physical pages of 2MB.
> + * @dram_pg_pool_refcount: reference counter for the pool usage.
> + * @idr_lock: protects the phys_pg_list_handles.
> + * @phys_pg_pack_handles: idr to hold all device allocations handles.
> + * @init_done: whether initialization was done. We need this because VM
> + *		initialization might be skipped during device initialization.
> + */
> +struct hl_vm {
> +	struct gen_pool		*dram_pg_pool;
> +	struct kref		dram_pg_pool_refcount;
> +	spinlock_t		idr_lock;
> +	struct idr		phys_pg_pack_handles;
> +	u8			init_done;
> +};
> +
>  /*
>   * FILE PRIVATE STRUCTURE
>   */
> @@ -787,12 +944,16 @@ struct hl_device_reset_work {
>   * @asic_prop: ASIC specific immutable properties.
>   * @asic_funcs: ASIC specific functions.
>   * @asic_specific: ASIC specific information to use only from ASIC files.
> + * @mmu_pgt_pool: pool of available MMU hops.
> + * @vm: virtual memory manager for MMU.
> + * @mmu_cache_lock: protects MMU cache invalidation as it can serve one context
>   * @hwmon_dev: H/W monitor device.
>   * @pm_mng_profile: current power management profile.
>   * @hl_chip_info: ASIC's sensors information.
>   * @cb_pool: list of preallocated CBs.
>   * @cb_pool_lock: protects the CB pool.
>   * @user_ctx: current user context executing.
> + * @dram_used_mem: current DRAM memory consumption.
>   * @in_reset: is device in reset flow.
>   * @curr_pll_profile: current PLL profile.
>   * @fd_open_cnt: number of open user processes.
> @@ -812,6 +973,7 @@ struct hl_device_reset_work {
>   * @heartbeat: is heartbeat sanity check towards ArmCP enabled.
>   * @reset_on_lockup: true if a reset should be done in case of stuck CS, false
>   *                   otherwise.
> + * @dram_supports_virtual_memory: is MMU enabled towards DRAM.
>   * @init_done: is the initialization of the device done.
>   * @mmu_enable: is MMU enabled.
>   */
> @@ -846,6 +1008,9 @@ struct hl_device {
>  	struct asic_fixed_properties	asic_prop;
>  	const struct hl_asic_funcs	*asic_funcs;
>  	void				*asic_specific;
> +	struct gen_pool			*mmu_pgt_pool;
> +	struct hl_vm			vm;
> +	struct mutex			mmu_cache_lock;
>  	struct device			*hwmon_dev;
>  	enum hl_pm_mng_profile		pm_mng_profile;
>  	struct hwmon_chip_info		hl_chip_info;
> @@ -856,6 +1021,7 @@ struct hl_device {
>  	/* TODO: remove user_ctx for multiple process support */
>  	struct hl_ctx			*user_ctx;
>  
> +	atomic64_t			dram_used_mem;
>  	atomic_t			in_reset;
>  	atomic_t			curr_pll_profile;
>  	atomic_t			fd_open_cnt;
> @@ -872,6 +1038,7 @@ struct hl_device {
>  	u8				hard_reset_pending;
>  	u8				heartbeat;
>  	u8				reset_on_lockup;
> +	u8				dram_supports_virtual_memory;
>  	u8				init_done;
>  
>  	/* Parameters for bring-up */
> @@ -1021,6 +1188,7 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
>  void hl_hpriv_get(struct hl_fpriv *hpriv);
>  void hl_hpriv_put(struct hl_fpriv *hpriv);
>  int hl_device_set_frequency(struct hl_device *hdev, enum hl_pll_frequency freq);
> +
>  int hl_build_hwmon_channel_info(struct hl_device *hdev,
>  		struct armcp_sensor *sensors_arr);
>  
> @@ -1048,6 +1216,12 @@ struct hl_cs_job *hl_cs_allocate_job(struct hl_device *hdev, bool ext_queue);
>  
>  void goya_set_asic_funcs(struct hl_device *hdev);
>  
> +int hl_vm_ctx_init(struct hl_ctx *ctx);
> +void hl_vm_ctx_fini(struct hl_ctx *ctx);
> +
> +int hl_vm_init(struct hl_device *hdev);
> +void hl_vm_fini(struct hl_device *hdev);
> +
>  int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u32 size,
>  			struct hl_userptr *userptr);
>  int hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr);
> @@ -1057,6 +1231,15 @@ bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr, u32 size,
>  				struct list_head *userptr_list,
>  				struct hl_userptr **userptr);
>  
> +int hl_mmu_init(struct hl_device *hdev);
> +void hl_mmu_fini(struct hl_device *hdev);
> +void hl_mmu_ctx_init(struct hl_ctx *ctx);
> +void hl_mmu_ctx_fini(struct hl_ctx *ctx);
> +int hl_mmu_map(struct hl_ctx *ctx, u64 virt_addr, u64 phys_addr, u32 page_size);
> +int hl_mmu_unmap(struct hl_ctx *ctx, u64 virt_addr, u32 page_size);
> +void hl_mmu_swap_out(struct hl_ctx *ctx);
> +void hl_mmu_swap_in(struct hl_ctx *ctx);
> +
>  long hl_get_frequency(struct hl_device *hdev, u32 pll_index, bool curr);
>  void hl_set_frequency(struct hl_device *hdev, u32 pll_index, u64 freq);
>  long hl_get_temperature(struct hl_device *hdev, int sensor_index, u32 attr);
> @@ -1074,5 +1257,6 @@ long hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
>  int hl_cb_ioctl(struct hl_fpriv *hpriv, void *data);
>  int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data);
>  int hl_cs_wait_ioctl(struct hl_fpriv *hpriv, void *data);
> +int hl_mem_ioctl(struct hl_fpriv *hpriv, void *data);
>  
>  #endif /* HABANALABSP_H_ */
> diff --git a/drivers/misc/habanalabs/habanalabs_drv.c b/drivers/misc/habanalabs/habanalabs_drv.c
> index dbc0c9c2c99a..658dc4588a36 100644
> --- a/drivers/misc/habanalabs/habanalabs_drv.c
> +++ b/drivers/misc/habanalabs/habanalabs_drv.c
> @@ -192,7 +192,7 @@ int create_hdev(struct hl_device **dev, struct pci_dev *pdev,
>  	hdev->reset_on_lockup = reset_on_lockup;
>  
>  	/* Parameters for bring-up - set them to defaults */
> -	hdev->mmu_enable = 0;
> +	hdev->mmu_enable = 1;
>  	hdev->cpu_enable = 1;
>  	hdev->reset_pcilink = 0;
>  	hdev->cpu_queues_enable = 1;
> diff --git a/drivers/misc/habanalabs/habanalabs_ioctl.c b/drivers/misc/habanalabs/habanalabs_ioctl.c
> index f93649a63a9e..71ef0c91668b 100644
> --- a/drivers/misc/habanalabs/habanalabs_ioctl.c
> +++ b/drivers/misc/habanalabs/habanalabs_ioctl.c
> @@ -18,7 +18,8 @@
>  static const struct hl_ioctl_desc hl_ioctls[] = {
>  	HL_IOCTL_DEF(HL_IOCTL_CB, hl_cb_ioctl),
>  	HL_IOCTL_DEF(HL_IOCTL_CS, hl_cs_ioctl),
> -	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl)
> +	HL_IOCTL_DEF(HL_IOCTL_WAIT_CS, hl_cs_wait_ioctl),
> +	HL_IOCTL_DEF(HL_IOCTL_MEMORY, hl_mem_ioctl)
>  };
>  
>  #define HL_CORE_IOCTL_COUNT	ARRAY_SIZE(hl_ioctls)
> diff --git a/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
> new file mode 100644
> index 000000000000..01483a581561
> --- /dev/null
> +++ b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_general.h
> @@ -0,0 +1,45 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + */
> +
> +#ifndef INCLUDE_MMU_GENERAL_H_
> +#define INCLUDE_MMU_GENERAL_H_
> +
> +#define PAGE_SHIFT_4KB			12
> +#define PAGE_SHIFT_2MB			21
> +#define PAGE_SIZE_2MB			(_AC(1, UL) << PAGE_SHIFT_2MB)
> +#define PAGE_SIZE_4KB			(_AC(1, UL) << PAGE_SHIFT_4KB)
> +
> +#define PAGE_PRESENT_MASK		0x0000000000001
> +#define SWAP_OUT_MASK			0x0000000000004
> +#define LAST_MASK			0x0000000000800
> +#define PHYS_ADDR_MASK			0x3FFFFFFFFF000ull
> +#define HOP0_MASK			0x3000000000000ull
> +#define HOP1_MASK			0x0FF8000000000ull
> +#define HOP2_MASK			0x0007FC0000000ull
> +#define HOP3_MASK			0x000003FE00000
> +#define HOP4_MASK			0x00000001FF000
> +#define OFFSET_MASK			0x0000000000FFF
> +
> +#define HOP0_SHIFT			48
> +#define HOP1_SHIFT			39
> +#define HOP2_SHIFT			30
> +#define HOP3_SHIFT			21
> +#define HOP4_SHIFT			12
> +
> +#define PTE_PHYS_ADDR_SHIFT		12
> +#define PTE_PHYS_ADDR_MASK		~0xFFF
> +
> +#define HL_PTE_SIZE			sizeof(u64)
> +#define HOP_TABLE_SIZE			PAGE_SIZE_4KB
> +#define HOP0_TABLES_TOTAL_SIZE		(HOP_TABLE_SIZE * MAX_ASID)
> +
> +#define MMU_HOP0_PA43_12_SHIFT		12
> +#define MMU_HOP0_PA49_44_SHIFT		(12 + 32)
> +
> +#define MMU_CONFIG_TIMEOUT_USEC		2000 /* 2 ms */
> +
> +#endif /* INCLUDE_MMU_GENERAL_H_ */
> diff --git a/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
> new file mode 100644
> index 000000000000..8539dd041f2c
> --- /dev/null
> +++ b/drivers/misc/habanalabs/include/hw_ip/mmu/mmu_v1_0.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + *
> + */
> +
> +#ifndef INCLUDE_MMU_V1_0_H_
> +#define INCLUDE_MMU_V1_0_H_
> +
> +#define MMU_HOP0_PA43_12	0x490004
> +#define MMU_HOP0_PA49_44	0x490008
> +#define MMU_ASID_BUSY		0x490000
> +
> +#endif /* INCLUDE_MMU_V1_0_H_ */
> diff --git a/drivers/misc/habanalabs/memory.c b/drivers/misc/habanalabs/memory.c
> index 47e110b4f76e..98a7cc700fd5 100644
> --- a/drivers/misc/habanalabs/memory.c
> +++ b/drivers/misc/habanalabs/memory.c
> @@ -5,12 +5,1198 @@
>   * All Rights Reserved.
>   */
>  
> +#include <uapi/misc/habanalabs.h>
>  #include "habanalabs.h"
> +#include "include/hw_ip/mmu/mmu_general.h"
>  
>  #include <linux/sched.h>
>  #include <linux/uaccess.h>
>  #include <linux/genalloc.h>
>  
> +#define PGS_IN_HPAGE   (HPAGE_SIZE >> PAGE_SHIFT)
> +#define HL_MMU_DEBUG	0
> +
> +/*
> + * The va ranges in context object contain a list with the available chunks of
> + * device virtual memory.
> + * There is one range for host allocations and one for DRAM allocations.
> + *
> + * On initialization each range contains one chunk of all of its available
> + * virtual range which is a half of the total device virtual range.
> + *
> + * On each mapping of physical pages, a suitable virtual range chunk (with a
> + * minimum size) is selected from the list. If the chunk size equals the
> + * requested size, the chunk is returned. Otherwise, the chunk is split into
> + * two chunks - one to return as result and a remainder to stay in the list.
> + *
> + * On each Unmapping of a virtual address, the relevant virtual chunk is
> + * returned to the list. The chunk is added to the list and if its edges match
> + * the edges of the adjacent chunks (means a contiguous chunk can be created),
> + * the chunks are merged.
> + *
> + * On finish, the list is checked to have only one chunk of all the relevant
> + * virtual range (which is a half of the device total virtual range).
> + * If not (means not all mappings were unmapped), a warning is printed.
> + */
> +
> +/*
> + * alloc_device_memory - allocate device memory
> + *
> + * @ctx                 : current context
> + * @args                : host parameters containing the requested size
> + * @ret_handle          : result handle
> + *
> + * This function does the following:
> + * - Allocate the requested size rounded up to 2MB pages
> + * - Return unique handle
> + */
> +static int alloc_device_memory(struct hl_ctx *ctx, struct hl_mem_in *args,
> +				u32 *ret_handle)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	struct hl_vm *vm = &hdev->vm;
> +	struct hl_vm_phys_pg_pack *phys_pg_pack;
> +	u64 paddr = 0;
> +	u32 total_size, num_pgs, num_curr_pgs, page_size, page_shift;
> +	int handle, rc, i;
> +	bool contiguous;
> +
> +	num_curr_pgs = 0;
> +	page_size = hdev->asic_prop.dram_page_size;
> +	page_shift = __ffs(page_size);
> +	num_pgs = (args->alloc.mem_size + (page_size - 1)) >> page_shift;
> +	total_size = num_pgs << page_shift;
> +
> +	contiguous = args->flags & HL_MEM_CONTIGUOUS;
> +
> +	if (contiguous) {
> +		paddr = (u64) gen_pool_alloc(vm->dram_pg_pool, total_size);
> +		if (!paddr) {
> +			dev_err(hdev->dev,
> +				"failed to allocate %u huge contiguous pages\n",
> +				num_pgs);
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	phys_pg_pack = kzalloc(sizeof(*phys_pg_pack), GFP_KERNEL);
> +	if (!phys_pg_pack) {
> +		rc = -ENOMEM;
> +		goto pages_pack_err;
> +	}
> +
> +	phys_pg_pack->vm_type = VM_TYPE_PHYS_PACK;
> +	phys_pg_pack->asid = ctx->asid;
> +	phys_pg_pack->npages = num_pgs;
> +	phys_pg_pack->page_size = page_size;
> +	phys_pg_pack->total_size = total_size;
> +	phys_pg_pack->flags = args->flags;
> +	phys_pg_pack->contiguous = contiguous;
> +
> +	phys_pg_pack->pages = kcalloc(num_pgs, sizeof(u64), GFP_KERNEL);
> +	if (!phys_pg_pack->pages) {
> +		rc = -ENOMEM;
> +		goto pages_arr_err;
> +	}
> +
> +	if (phys_pg_pack->contiguous) {
> +		for (i = 0 ; i < num_pgs ; i++)
> +			phys_pg_pack->pages[i] = paddr + i * page_size;
> +	} else {
> +		for (i = 0 ; i < num_pgs ; i++) {
> +			phys_pg_pack->pages[i] = (u64) gen_pool_alloc(
> +							vm->dram_pg_pool,
> +							page_size);
> +			if (!phys_pg_pack->pages[i]) {
> +				dev_err(hdev->dev,
> +					"ioctl failed to allocate page\n");
> +				rc = -ENOMEM;
> +				goto page_err;
> +			}
> +
> +			num_curr_pgs++;
> +		}
> +	}
> +
> +	spin_lock(&vm->idr_lock);
> +	handle = idr_alloc(&vm->phys_pg_pack_handles, phys_pg_pack, 1, 0,
> +				GFP_KERNEL);
> +	spin_unlock(&vm->idr_lock);
> +
> +	if (handle < 0) {
> +		dev_err(hdev->dev, "Failed to get handle for page\n");
> +		rc = -EFAULT;
> +		goto idr_err;
> +	}
> +
> +	for (i = 0 ; i < num_pgs ; i++)
> +		kref_get(&vm->dram_pg_pool_refcount);
> +
> +	phys_pg_pack->handle = handle;
> +
> +	atomic64_add(phys_pg_pack->total_size, &ctx->dram_phys_mem);
> +	atomic64_add(phys_pg_pack->total_size, &hdev->dram_used_mem);
> +
> +	*ret_handle = handle;
> +
> +	return 0;
> +
> +idr_err:
> +page_err:
> +	if (!phys_pg_pack->contiguous)
> +		for (i = 0 ; i < num_curr_pgs ; i++)
> +			gen_pool_free(vm->dram_pg_pool, phys_pg_pack->pages[i],
> +					page_size);
> +
> +	kfree(phys_pg_pack->pages);
> +pages_arr_err:
> +	kfree(phys_pg_pack);
> +pages_pack_err:
> +	if (contiguous)
> +		gen_pool_free(vm->dram_pg_pool, paddr, total_size);
> +
> +	return rc;
> +}
> +
> +/*
> + * get_userptr_from_host_va - initialize userptr structure from given host
> + *                            virtual address
> + *
> + * @hdev                : habanalabs device structure
> + * @args                : parameters containing the virtual address and size
> + * @p_userptr           : pointer to result userptr structure
> + *
> + * This function does the following:
> + * - Allocate userptr structure
> + * - Pin the given host memory using the userptr structure
> + * - Perform DMA mapping to have the DMA addresses of the pages
> + */
> +static int get_userptr_from_host_va(struct hl_device *hdev,
> +		struct hl_mem_in *args, struct hl_userptr **p_userptr)
> +{
> +	struct hl_userptr *userptr;
> +	int rc;
> +
> +	userptr = kzalloc(sizeof(*userptr), GFP_KERNEL);
> +	if (!userptr) {
> +		rc = -ENOMEM;
> +		goto userptr_err;
> +	}
> +
> +	rc = hl_pin_host_memory(hdev, args->map_host.host_virt_addr,
> +			args->map_host.mem_size, userptr);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to pin host memory\n");
> +		goto pin_err;
> +	}
> +
> +	rc = hdev->asic_funcs->asic_dma_map_sg(hdev, userptr->sgt->sgl,
> +					userptr->sgt->nents, DMA_BIDIRECTIONAL);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to map sgt with DMA region\n");
> +		goto dma_map_err;
> +	}
> +
> +	userptr->dma_mapped = true;
> +	userptr->dir = DMA_BIDIRECTIONAL;
> +	userptr->vm_type = VM_TYPE_USERPTR;
> +
> +	*p_userptr = userptr;
> +
> +	return 0;
> +
> +dma_map_err:
> +	hl_unpin_host_memory(hdev, userptr);
> +pin_err:
> +	kfree(userptr);
> +userptr_err:
> +
> +	return rc;
> +}
> +
> +/*
> + * free_userptr - free userptr structure
> + *
> + * @hdev                : habanalabs device structure
> + * @userptr             : userptr to free
> + *
> + * This function does the following:
> + * - Unpins the physical pages
> + * - Frees the userptr structure
> + */
> +static void free_userptr(struct hl_device *hdev, struct hl_userptr *userptr)
> +{
> +	hl_unpin_host_memory(hdev, userptr);
> +	kfree(userptr);
> +}
> +
> +/*
> + * dram_pg_pool_do_release - free DRAM pages pool
> + *
> + * @ref                 : pointer to reference object
> + *
> + * This function does the following:
> + * - Frees the idr structure of physical pages handles
> + * - Frees the generic pool of DRAM physical pages
> + */
> +static void dram_pg_pool_do_release(struct kref *ref)
> +{
> +	struct hl_vm *vm = container_of(ref, struct hl_vm,
> +			dram_pg_pool_refcount);
> +
> +	/*
> +	 * free the idr here as only here we know for sure that there are no
> +	 * allocated physical pages and hence there are no handles in use
> +	 */
> +	idr_destroy(&vm->phys_pg_pack_handles);
> +	gen_pool_destroy(vm->dram_pg_pool);
> +}
> +
> +/*
> + * free_phys_pg_pack   - free physical page pack
> + *
> + * @hdev               : habanalabs device structure
> + * @phys_pg_pack       : physical page pack to free
> + *
> + * This function does the following:
> + * - For DRAM memory only, iterate over the pack and free each physical block
> + *   structure by returning it to the general pool
> + * - Free the hl_vm_phys_pg_pack structure
> + */
> +static void free_phys_pg_pack(struct hl_device *hdev,
> +		struct hl_vm_phys_pg_pack *phys_pg_pack)
> +{
> +	struct hl_vm *vm = &hdev->vm;
> +	int i;
> +
> +	if (!phys_pg_pack->created_from_userptr) {
> +		if (phys_pg_pack->contiguous) {
> +			gen_pool_free(vm->dram_pg_pool, phys_pg_pack->pages[0],
> +					phys_pg_pack->total_size);
> +
> +			for (i = 0; i < phys_pg_pack->npages ; i++)
> +				kref_put(&vm->dram_pg_pool_refcount,
> +					dram_pg_pool_do_release);
> +		} else {
> +			for (i = 0 ; i < phys_pg_pack->npages ; i++) {
> +				gen_pool_free(vm->dram_pg_pool,
> +						phys_pg_pack->pages[i],
> +						phys_pg_pack->page_size);
> +				kref_put(&vm->dram_pg_pool_refcount,
> +					dram_pg_pool_do_release);
> +			}
> +		}
> +	}
> +
> +	kfree(phys_pg_pack->pages);
> +	kfree(phys_pg_pack);
> +}
> +
> +/*
> + * free_device_memory - free device memory
> + *
> + * @ctx                  : current context
> + * @handle              : handle of the memory chunk to free
> + *
> + * This function does the following:
> + * - Free the device memory related to the given handle
> + */
> +static int free_device_memory(struct hl_ctx *ctx, u32 handle)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	struct hl_vm *vm = &hdev->vm;
> +	struct hl_vm_phys_pg_pack *phys_pg_pack;
> +
> +	spin_lock(&vm->idr_lock);
> +	phys_pg_pack = idr_find(&vm->phys_pg_pack_handles, handle);
> +	if (phys_pg_pack) {
> +		if (atomic_read(&phys_pg_pack->mapping_cnt) > 0) {
> +			dev_err(hdev->dev, "handle %u is mapped, cannot free\n",
> +				handle);
> +			spin_unlock(&vm->idr_lock);
> +			return -EINVAL;
> +		}
> +
> +		/*
> +		 * must remove from idr before the freeing of the physical
> +		 * pages as the refcount of the pool is also the trigger of the
> +		 * idr destroy
> +		 */
> +		idr_remove(&vm->phys_pg_pack_handles, handle);
> +		spin_unlock(&vm->idr_lock);
> +
> +		atomic64_sub(phys_pg_pack->total_size, &ctx->dram_phys_mem);
> +		atomic64_sub(phys_pg_pack->total_size, &hdev->dram_used_mem);
> +
> +		free_phys_pg_pack(hdev, phys_pg_pack);
> +	} else {
> +		spin_unlock(&vm->idr_lock);
> +		dev_err(hdev->dev,
> +			"free device memory failed, no match for handle %u\n",
> +			handle);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * clear_va_list_locked - free virtual addresses list
> + *
> + * @hdev                : habanalabs device structure
> + * @va_list             : list of virtual addresses to free
> + *
> + * This function does the following:
> + * - Iterate over the list and free each virtual addresses block
> + *
> + * This function should be called only when va_list lock is taken
> + */
> +static void clear_va_list_locked(struct hl_device *hdev,
> +		struct list_head *va_list)
> +{
> +	struct hl_vm_va_block *va_block, *tmp;
> +
> +	list_for_each_entry_safe(va_block, tmp, va_list, node) {
> +		list_del(&va_block->node);
> +		kfree(va_block);
> +	}
> +}
> +
> +/*
> + * print_va_list_locked    - print virtual addresses list
> + *
> + * @hdev                : habanalabs device structure
> + * @va_list             : list of virtual addresses to print
> + *
> + * This function does the following:
> + * - Iterate over the list and print each virtual addresses block
> + *
> + * This function should be called only when va_list lock is taken
> + */
> +static void print_va_list_locked(struct hl_device *hdev,
> +		struct list_head *va_list)
> +{
> +#if HL_MMU_DEBUG
> +	struct hl_vm_va_block *va_block;
> +
> +	dev_dbg(hdev->dev, "print va list:\n");
> +
> +	list_for_each_entry(va_block, va_list, node)
> +		dev_dbg(hdev->dev,
> +			"va block, start: 0x%llx, end: 0x%llx, size: %llu\n",
> +			va_block->start, va_block->end, va_block->size);
> +#endif
> +}
> +
> +/*
> + * merge_va_blocks_locked - merge a virtual block if possible
> + *
> + * @hdev                : pointer to the habanalabs device structure
> + * @va_list             : pointer to the virtual addresses block list
> + * @va_block            : virtual block to merge with adjacent blocks
> + *
> + * This function does the following:
> + * - Merge the given blocks with the adjacent blocks if their virtual ranges
> + *   create a contiguous virtual range
> + *
> + * This Function should be called only when va_list lock is taken
> + */
> +static void merge_va_blocks_locked(struct hl_device *hdev,
> +		struct list_head *va_list, struct hl_vm_va_block *va_block)
> +{
> +	struct hl_vm_va_block *prev, *next;
> +
> +	prev = list_prev_entry(va_block, node);
> +	if (&prev->node != va_list && prev->end + 1 == va_block->start) {
> +		prev->end = va_block->end;
> +		prev->size = prev->end - prev->start;
> +		list_del(&va_block->node);
> +		kfree(va_block);
> +		va_block = prev;
> +	}
> +
> +	next = list_next_entry(va_block, node);
> +	if (&next->node != va_list && va_block->end + 1 == next->start) {
> +		next->start = va_block->start;
> +		next->size = next->end - next->start;
> +		list_del(&va_block->node);
> +		kfree(va_block);
> +	}
> +}
> +
> +/*
> + * add_va_block_locked - add a virtual block to the virtual addresses list
> + *
> + * @hdev                : pointer to the habanalabs device structure
> + * @va_list             : pointer to the virtual addresses block list
> + * @start               : start virtual address
> + * @end                 : end virtual address
> + *
> + * This function does the following:
> + * - Add the given block to the virtual blocks list and merge with other
> + * blocks if a contiguous virtual block can be created
> + *
> + * This Function should be called only when va_list lock is taken
> + */
> +static int add_va_block_locked(struct hl_device *hdev,
> +		struct list_head *va_list, u64 start, u64 end)
> +{
> +	struct hl_vm_va_block *va_block, *res = NULL;
> +	u64 size = end - start;
> +
> +	print_va_list_locked(hdev, va_list);
> +
> +	list_for_each_entry(va_block, va_list, node) {
> +		/* TODO: remove upon matureness */
> +		if (hl_mem_area_crosses_range(start, size, va_block->start,
> +				va_block->end)) {
> +			dev_err(hdev->dev,
> +				"block crossing ranges at start 0x%llx, end 0x%llx\n",
> +				va_block->start, va_block->end);
> +			return -EINVAL;
> +		}
> +
> +		if (va_block->end < start)
> +			res = va_block;
> +	}
> +
> +	va_block = kmalloc(sizeof(*va_block), GFP_KERNEL);
> +	if (!va_block)
> +		return -ENOMEM;
> +
> +	va_block->start = start;
> +	va_block->end = end;
> +	va_block->size = size;
> +
> +	if (!res)
> +		list_add(&va_block->node, va_list);
> +	else
> +		list_add(&va_block->node, &res->node);
> +
> +	merge_va_blocks_locked(hdev, va_list, va_block);
> +
> +	print_va_list_locked(hdev, va_list);
> +
> +	return 0;
> +}
> +
> +/*
> + * add_va_block - wrapper for add_va_block_locked
> + *
> + * @hdev                : pointer to the habanalabs device structure
> + * @va_list             : pointer to the virtual addresses block list
> + * @start               : start virtual address
> + * @end                 : end virtual address
> + *
> + * This function does the following:
> + * - Takes the list lock and calls add_va_block_locked
> + */
> +static inline int add_va_block(struct hl_device *hdev,
> +		struct hl_va_range *va_range, u64 start, u64 end)
> +{
> +	int rc;
> +
> +	mutex_lock(&va_range->lock);
> +	rc = add_va_block_locked(hdev, &va_range->list, start, end);
> +	mutex_unlock(&va_range->lock);
> +
> +	return rc;
> +}
> +
> +/*
> + * get_va_block - get a virtual block with the requested size
> + *
> + * @hdev            : pointer to the habanalabs device structure
> + * @va_range        : pointer to the virtual addresses range
> + * @size            : requested block size
> + * @hint_addr       : hint for request address by the user
> + * @is_userptr      : is host or DRAM memory
> + *
> + * This function does the following:
> + * - Iterate on the virtual block list to find a suitable virtual block for the
> + *   requested size
> + * - Reserve the requested block and update the list
> + * - Return the start address of the virtual block
> + */
> +static u64 get_va_block(struct hl_device *hdev,
> +		struct hl_va_range *va_range, u32 size, u64 hint_addr,
> +		bool is_userptr)
> +{
> +	struct hl_vm_va_block *va_block, *new_va_block = NULL;
> +	u64 valid_start, valid_size, prev_start, prev_end, page_mask,
> +		res_valid_start = 0, res_valid_size = 0;
> +	u32 page_size;
> +	bool add_prev = false;
> +
> +	if (is_userptr) {
> +		/*
> +		 * We cannot know if the user allocated memory with huge pages
> +		 * or not, hence we continue with the biggest possible
> +		 * granularity.
> +		 */
> +		page_size = HPAGE_SIZE;
> +		page_mask = HPAGE_MASK;
> +	} else {
> +		page_size = hdev->asic_prop.dram_page_size;
> +		page_mask = ~((u64)page_size - 1);
> +	}
> +
> +	mutex_lock(&va_range->lock);
> +
> +	print_va_list_locked(hdev, &va_range->list);
> +
> +	list_for_each_entry(va_block, &va_range->list, node) {
> +		/* calc the first possible aligned addr */
> +		valid_start = va_block->start;
> +
> +
> +		if (valid_start & (page_size - 1)) {
> +			valid_start &= page_mask;
> +			valid_start += page_size;
> +			if (valid_start > va_block->end)
> +				continue;
> +		}
> +
> +		valid_size = va_block->end - valid_start;
> +
> +		if (valid_size >= size &&
> +			(!new_va_block || valid_size < res_valid_size)) {
> +
> +			new_va_block = va_block;
> +			res_valid_start = valid_start;
> +			res_valid_size = valid_size;
> +		}
> +
> +		if (hint_addr && hint_addr >= valid_start &&
> +				((hint_addr + size) <= va_block->end)) {
> +			new_va_block = va_block;
> +			res_valid_start = hint_addr;
> +			res_valid_size = valid_size;
> +			break;
> +		}
> +	}
> +
> +	if (!new_va_block) {
> +		dev_err(hdev->dev, "no available va block for size %u\n", size);
> +		goto out;
> +	}
> +
> +	if (res_valid_start > new_va_block->start) {
> +		prev_start = new_va_block->start;
> +		prev_end = res_valid_start - 1;
> +
> +		new_va_block->start = res_valid_start;
> +		new_va_block->size = res_valid_size;
> +
> +		add_prev = true;
> +	}
> +
> +	if (new_va_block->size > size) {
> +		new_va_block->start += size;
> +		new_va_block->size = new_va_block->end - new_va_block->start;
> +	} else {
> +		list_del(&new_va_block->node);
> +		kfree(new_va_block);
> +	}
> +
> +	if (add_prev)
> +		add_va_block_locked(hdev, &va_range->list, prev_start,
> +				prev_end);
> +
> +	print_va_list_locked(hdev, &va_range->list);
> +out:
> +	mutex_unlock(&va_range->lock);
> +
> +	return res_valid_start;
> +}
> +
> +/*
> + * get_sg_info - get number of pages and the DMA address from SG list
> + *
> + * @sg                 : the SG list
> + * @dma_addr           : pointer to DMA address to return
> + *
> + * Calculate the number of consecutive pages described by the SG list. Take the
> + * offset of the address in the first page, add to it the length and round it up
> + * to the number of needed pages.
> + */
> +static u32 get_sg_info(struct scatterlist *sg, dma_addr_t *dma_addr)
> +{
> +	*dma_addr = sg_dma_address(sg);
> +
> +	return ((((*dma_addr) & (PAGE_SIZE - 1)) + sg_dma_len(sg)) +
> +			(PAGE_SIZE - 1)) >> PAGE_SHIFT;
> +}
> +
> +/*
> + * init_phys_pg_pack_from_userptr - initialize physical page pack from host
> + *                                   memory
> + *
> + * @ctx                : current context
> + * @userptr            : userptr to initialize from
> + * @pphys_pg_pack      : res pointer
> + *
> + * This function does the following:
> + * - Pin the physical pages related to the given virtual block
> + * - Create a physical page pack from the physical pages related to the given
> + *   virtual block
> + */
> +static int init_phys_pg_pack_from_userptr(struct hl_ctx *ctx,
> +		struct hl_userptr *userptr,
> +		struct hl_vm_phys_pg_pack **pphys_pg_pack)
> +{
> +	struct hl_vm_phys_pg_pack *phys_pg_pack;
> +	struct scatterlist *sg;
> +	dma_addr_t dma_addr;
> +	u64 page_mask;
> +	u32 npages, total_npages, page_size = PAGE_SIZE;
> +	bool first = true, is_huge_page_opt = true;
> +	int rc, i, j;
> +
> +	phys_pg_pack = kzalloc(sizeof(*phys_pg_pack), GFP_KERNEL);
> +	if (!phys_pg_pack)
> +		return -ENOMEM;
> +
> +	phys_pg_pack->vm_type = userptr->vm_type;
> +	phys_pg_pack->created_from_userptr = true;
> +	phys_pg_pack->asid = ctx->asid;
> +	atomic_set(&phys_pg_pack->mapping_cnt, 1);
> +
> +	/* Only if all dma_addrs are aligned to 2MB and their
> +	 * sizes is at least 2MB, we can use huge page mapping.
> +	 * We limit the 2MB optimization to this condition,
> +	 * since later on we acquire the related VA range as one
> +	 * consecutive block.
> +	 */
> +	total_npages = 0;
> +	for_each_sg(userptr->sgt->sgl, sg, userptr->sgt->nents, i) {
> +		npages = get_sg_info(sg, &dma_addr);
> +
> +		total_npages += npages;
> +
> +		if (first) {
> +			first = false;
> +			dma_addr &= HPAGE_MASK;
> +		}
> +
> +		if ((npages % PGS_IN_HPAGE) || (dma_addr & (HPAGE_SIZE - 1)))
> +			is_huge_page_opt = false;
> +	}
> +
> +	if (is_huge_page_opt) {
> +		page_size = HPAGE_SIZE;
> +		total_npages /= PGS_IN_HPAGE;
> +	}
> +
> +	page_mask = ~(((u64) page_size) - 1);
> +
> +	phys_pg_pack->pages = kcalloc(total_npages, sizeof(u64), GFP_KERNEL);
> +	if (!phys_pg_pack->pages) {
> +		rc = -ENOMEM;
> +		goto page_pack_arr_mem_err;
> +	}
> +
> +	phys_pg_pack->npages = total_npages;
> +	phys_pg_pack->page_size = page_size;
> +	phys_pg_pack->total_size = total_npages * page_size;
> +
> +	j = 0;
> +	first = true;
> +	for_each_sg(userptr->sgt->sgl, sg, userptr->sgt->nents, i) {
> +		npages = get_sg_info(sg, &dma_addr);
> +
> +		/* align down to physical page size and save the offset */
> +		if (first) {
> +			first = false;
> +			phys_pg_pack->offset = dma_addr & (page_size - 1);
> +			dma_addr &= page_mask;
> +		}
> +
> +		while (npages) {
> +			phys_pg_pack->pages[j++] = dma_addr;
> +			dma_addr += page_size;
> +
> +			if (is_huge_page_opt)
> +				npages -= PGS_IN_HPAGE;
> +			else
> +				npages--;
> +		}
> +	}
> +
> +	*pphys_pg_pack = phys_pg_pack;
> +
> +	return 0;
> +
> +page_pack_arr_mem_err:
> +	kfree(phys_pg_pack);
> +
> +	return rc;
> +}
> +
> +/*
> + * map_phys_page_pack - maps the physical page pack
> + *
> + * @ctx                : current context
> + * @vaddr              : start address of the virtual area to map from
> + * @phys_pg_pack       : the pack of physical pages to map to
> + *
> + * This function does the following:
> + * - Maps each chunk of virtual memory to matching physical chunk
> + * - Stores number of successful mappings in the given argument
> + * - Returns 0 on success, error code otherwise.
> + */
> +static int map_phys_page_pack(struct hl_ctx *ctx, u64 vaddr,
> +		struct hl_vm_phys_pg_pack *phys_pg_pack)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	u64 next_vaddr = vaddr, paddr;
> +	u32 page_size = phys_pg_pack->page_size;
> +	int i, rc = 0, mapped_pg_cnt = 0;
> +
> +	for (i = 0 ; i < phys_pg_pack->npages ; i++) {
> +		paddr = phys_pg_pack->pages[i];
> +
> +		/* For accessing the host we need to turn on bit 39 */
> +		if (phys_pg_pack->created_from_userptr)
> +			paddr += hdev->asic_prop.host_phys_base_address;
> +
> +		rc = hl_mmu_map(ctx, next_vaddr, paddr, page_size);
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"map failed for handle %u, npages: %d, mapped: %d",
> +				phys_pg_pack->handle, phys_pg_pack->npages,
> +				mapped_pg_cnt);
> +			goto err;
> +		}
> +
> +		mapped_pg_cnt++;
> +		next_vaddr += page_size;
> +	}
> +
> +	return 0;
> +
> +err:
> +	next_vaddr = vaddr;
> +	for (i = 0 ; i < mapped_pg_cnt ; i++) {
> +		if (hl_mmu_unmap(ctx, next_vaddr, page_size))
> +			dev_warn_ratelimited(hdev->dev,
> +				"failed to unmap handle %u, va: 0x%llx, pa: 0x%llx, page size: %u\n",
> +					phys_pg_pack->handle, next_vaddr,
> +					phys_pg_pack->pages[i], page_size);
> +
> +		next_vaddr += page_size;
> +	}
> +
> +	return rc;
> +}
> +
> +static int get_paddr_from_handle(struct hl_ctx *ctx, struct hl_mem_in *args,
> +				u64 *paddr)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	struct hl_vm *vm = &hdev->vm;
> +	struct hl_vm_phys_pg_pack *phys_pg_pack;
> +	u32 handle;
> +
> +	handle = lower_32_bits(args->map_device.handle);
> +	spin_lock(&vm->idr_lock);
> +	phys_pg_pack = idr_find(&vm->phys_pg_pack_handles, handle);
> +	if (!phys_pg_pack) {
> +		spin_unlock(&vm->idr_lock);
> +		dev_err(hdev->dev, "no match for handle %u\n", handle);
> +		return -EINVAL;
> +	}
> +
> +	*paddr = phys_pg_pack->pages[0];
> +
> +	spin_unlock(&vm->idr_lock);
> +
> +	return 0;
> +}
> +
> +/*
> + * map_device_va - map the given memory
> + *
> + * @ctx	         : current context
> + * @args         : host parameters with handle/host virtual address
> + * @device_addr	 : pointer to result device virtual address
> + *
> + * This function does the following:
> + * - If given a physical device memory handle, map to a device virtual block
> + *   and return the start address of this block
> + * - If given a host virtual address and size, find the related physical pages,
> + *   map a device virtual block to this pages and return the start address of
> + *   this block
> + */
> +static int map_device_va(struct hl_ctx *ctx, struct hl_mem_in *args,
> +		u64 *device_addr)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	struct hl_vm *vm = &hdev->vm;
> +	struct hl_vm_phys_pg_pack *phys_pg_pack;
> +	struct hl_userptr *userptr = NULL;
> +	struct hl_vm_hash_node *hnode;
> +	enum vm_type_t *vm_type;
> +	u64 ret_vaddr, hint_addr;
> +	u32 handle = 0;
> +	int rc;
> +	bool is_userptr = args->flags & HL_MEM_USERPTR;
> +
> +	/* Assume failure */
> +	*device_addr = 0;
> +
> +	if (is_userptr) {
> +		rc = get_userptr_from_host_va(hdev, args, &userptr);
> +		if (rc) {
> +			dev_err(hdev->dev, "failed to get userptr from va\n");
> +			return rc;
> +		}
> +
> +		rc = init_phys_pg_pack_from_userptr(ctx, userptr,
> +				&phys_pg_pack);
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"unable to init page pack for vaddr 0x%llx\n",
> +				args->map_host.host_virt_addr);
> +			goto init_page_pack_err;
> +		}
> +
> +		vm_type = (enum vm_type_t *) userptr;
> +		hint_addr = args->map_host.hint_addr;
> +	} else {
> +		handle = lower_32_bits(args->map_device.handle);
> +
> +		spin_lock(&vm->idr_lock);
> +		phys_pg_pack = idr_find(&vm->phys_pg_pack_handles, handle);
> +		if (!phys_pg_pack) {
> +			spin_unlock(&vm->idr_lock);
> +			dev_err(hdev->dev,
> +				"no match for handle %u\n", handle);
> +			return -EINVAL;
> +		}
> +
> +		/* increment now to avoid freeing device memory while mapping */
> +		atomic_inc(&phys_pg_pack->mapping_cnt);
> +
> +		spin_unlock(&vm->idr_lock);
> +
> +		vm_type = (enum vm_type_t *) phys_pg_pack;
> +
> +		hint_addr = args->map_device.hint_addr;
> +	}
> +
> +	/*
> +	 * relevant for mapping device physical memory only, as host memory is
> +	 * implicitly shared
> +	 */
> +	if (!is_userptr && !(phys_pg_pack->flags & HL_MEM_SHARED) &&
> +			phys_pg_pack->asid != ctx->asid) {
> +		dev_err(hdev->dev,
> +			"Failed to map memory, handle %u is not shared\n",
> +			handle);
> +		rc = -EPERM;
> +		goto shared_err;
> +	}
> +
> +	hnode = kzalloc(sizeof(*hnode), GFP_KERNEL);
> +	if (!hnode) {
> +		rc = -ENOMEM;
> +		goto hnode_err;
> +	}
> +
> +	ret_vaddr = get_va_block(hdev,
> +			is_userptr ? &ctx->host_va_range : &ctx->dram_va_range,
> +			phys_pg_pack->total_size, hint_addr, is_userptr);
> +	if (!ret_vaddr) {
> +		dev_err(hdev->dev, "no available va block for handle %u\n",
> +				handle);
> +		rc = -ENOMEM;
> +		goto va_block_err;
> +	}
> +
> +	mutex_lock(&ctx->mmu_lock);
> +
> +	rc = map_phys_page_pack(ctx, ret_vaddr, phys_pg_pack);
> +	if (rc) {
> +		mutex_unlock(&ctx->mmu_lock);
> +		dev_err(hdev->dev, "mapping page pack failed for handle %u\n",
> +				handle);
> +		goto map_err;
> +	}
> +
> +	hdev->asic_funcs->mmu_invalidate_cache_range(hdev, false, ctx->asid,
> +			ret_vaddr, phys_pg_pack->total_size);
> +
> +	mutex_unlock(&ctx->mmu_lock);
> +
> +	ret_vaddr += phys_pg_pack->offset;
> +
> +	hnode->ptr = vm_type;
> +	hnode->vaddr = ret_vaddr;
> +
> +	mutex_lock(&ctx->mem_hash_lock);
> +	hash_add(ctx->mem_hash, &hnode->node, ret_vaddr);
> +	mutex_unlock(&ctx->mem_hash_lock);
> +
> +	*device_addr = ret_vaddr;
> +
> +	if (is_userptr)
> +		free_phys_pg_pack(hdev, phys_pg_pack);
> +
> +	return 0;
> +
> +map_err:
> +	if (add_va_block(hdev,
> +			is_userptr ? &ctx->host_va_range : &ctx->dram_va_range,
> +			ret_vaddr,
> +			ret_vaddr + phys_pg_pack->total_size - 1))
> +		dev_warn(hdev->dev,
> +			"release va block failed for handle 0x%x, vaddr: 0x%llx\n",
> +				handle, ret_vaddr);
> +
> +va_block_err:
> +	kfree(hnode);
> +hnode_err:
> +shared_err:
> +	atomic_dec(&phys_pg_pack->mapping_cnt);
> +	if (is_userptr)
> +		free_phys_pg_pack(hdev, phys_pg_pack);
> +init_page_pack_err:
> +	if (is_userptr)
> +		free_userptr(hdev, userptr);
> +
> +	return rc;
> +}
> +
> +/*
> + * unmap_device_va      - unmap the given device virtual address
> + *
> + * @ctx                 : current context
> + * @vaddr               : device virtual address to unmap
> + *
> + * This function does the following:
> + * - Unmap the physical pages related to the given virtual address
> + * - return the device virtual block to the virtual block list
> + */
> +static int unmap_device_va(struct hl_ctx *ctx, u64 vaddr)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	struct hl_vm_phys_pg_pack *phys_pg_pack = NULL;
> +	struct hl_vm_hash_node *hnode = NULL;
> +	struct hl_userptr *userptr = NULL;
> +	enum vm_type_t *vm_type;
> +	u64 next_vaddr;
> +	u32 page_size;
> +	bool is_userptr;
> +	int i, rc;
> +
> +	/* protect from double entrance */
> +	mutex_lock(&ctx->mem_hash_lock);
> +	hash_for_each_possible(ctx->mem_hash, hnode, node, (unsigned long)vaddr)
> +		if (vaddr == hnode->vaddr)
> +			break;
> +
> +	if (!hnode) {
> +		mutex_unlock(&ctx->mem_hash_lock);
> +		dev_err(hdev->dev,
> +			"unmap failed, no mem hnode for vaddr 0x%llx\n",
> +			vaddr);
> +		return -EINVAL;
> +	}
> +
> +	hash_del(&hnode->node);
> +	mutex_unlock(&ctx->mem_hash_lock);
> +
> +	vm_type = hnode->ptr;
> +
> +	if (*vm_type == VM_TYPE_USERPTR) {
> +		is_userptr = true;
> +		userptr = hnode->ptr;
> +		rc = init_phys_pg_pack_from_userptr(ctx, userptr,
> +				&phys_pg_pack);
> +		if (rc) {
> +			dev_err(hdev->dev,
> +				"unable to init page pack for vaddr 0x%llx\n",
> +				vaddr);
> +			goto vm_type_err;
> +		}
> +	} else if (*vm_type == VM_TYPE_PHYS_PACK) {
> +		is_userptr = false;
> +		phys_pg_pack = hnode->ptr;
> +	} else {
> +		dev_warn(hdev->dev,
> +			"unmap failed, unknown vm desc for vaddr 0x%llx\n",
> +				vaddr);
> +		rc = -EFAULT;
> +		goto vm_type_err;
> +	}
> +
> +	if (atomic_read(&phys_pg_pack->mapping_cnt) == 0) {
> +		dev_err(hdev->dev, "vaddr 0x%llx is not mapped\n", vaddr);
> +		rc = -EINVAL;
> +		goto mapping_cnt_err;
> +	}
> +
> +	page_size = phys_pg_pack->page_size;
> +	vaddr &= ~(((u64) page_size) - 1);
> +
> +	next_vaddr = vaddr;
> +
> +	mutex_lock(&ctx->mmu_lock);
> +
> +	for (i = 0 ; i < phys_pg_pack->npages ; i++, next_vaddr += page_size)
> +		if (hl_mmu_unmap(ctx, next_vaddr, page_size))
> +			dev_warn_ratelimited(hdev->dev,
> +				"unmap failed for vaddr: 0x%llx\n", next_vaddr);
> +
> +	hdev->asic_funcs->mmu_invalidate_cache_range(hdev, true, ctx->asid,
> +			vaddr, phys_pg_pack->total_size);
> +
> +	mutex_unlock(&ctx->mmu_lock);
> +
> +	if (add_va_block(hdev,
> +			is_userptr ? &ctx->host_va_range : &ctx->dram_va_range,
> +			vaddr,
> +			vaddr + phys_pg_pack->total_size - 1))
> +		dev_warn(hdev->dev, "add va block failed for vaddr: 0x%llx\n",
> +				vaddr);
> +
> +	atomic_dec(&phys_pg_pack->mapping_cnt);
> +	kfree(hnode);
> +
> +	if (is_userptr) {
> +		free_phys_pg_pack(hdev, phys_pg_pack);
> +		free_userptr(hdev, userptr);
> +	}
> +
> +	return 0;
> +
> +mapping_cnt_err:
> +	if (is_userptr)
> +		free_phys_pg_pack(hdev, phys_pg_pack);
> +vm_type_err:
> +	mutex_lock(&ctx->mem_hash_lock);
> +	hash_add(ctx->mem_hash, &hnode->node, vaddr);
> +	mutex_unlock(&ctx->mem_hash_lock);
> +
> +	return rc;
> +}
> +
> +int hl_mem_ioctl(struct hl_fpriv *hpriv, void *data)
> +{
> +	union hl_mem_args *args = data;
> +	struct hl_device *hdev = hpriv->hdev;
> +	struct hl_ctx *ctx = hpriv->ctx;
> +	u64 device_addr = 0;
> +	u32 handle = 0;
> +	int rc;
> +
> +	if (hl_device_disabled_or_in_reset(hdev)) {
> +		dev_warn_ratelimited(hdev->dev,
> +			"Device is disabled or in reset. Can't execute memory IOCTL\n");
> +		return -EBUSY;
> +	}
> +
> +	if (hdev->mmu_enable) {
> +		switch (args->in.op) {
> +		case HL_MEM_OP_ALLOC:
> +			if (!hdev->dram_supports_virtual_memory) {
> +				dev_err(hdev->dev,
> +					"DRAM alloc is not supported\n");
> +				rc = -EINVAL;
> +				goto out;
> +			}
> +			if (args->in.alloc.mem_size == 0) {
> +				dev_err(hdev->dev,
> +					"alloc size must be larger than 0\n");
> +				rc = -EINVAL;
> +				goto out;
> +			}
> +			rc = alloc_device_memory(ctx, &args->in, &handle);
> +
> +			memset(args, 0, sizeof(*args));
> +			args->out.handle = (__u64) handle;
> +			break;
> +
> +		case HL_MEM_OP_FREE:
> +			if (!hdev->dram_supports_virtual_memory) {
> +				dev_err(hdev->dev,
> +					"DRAM free is not supported\n");
> +				rc = -EINVAL;
> +				goto out;
> +			}
> +			rc = free_device_memory(ctx, args->in.free.handle);
> +			break;
> +
> +		case HL_MEM_OP_MAP:
> +			rc = map_device_va(ctx, &args->in, &device_addr);
> +
> +			memset(args, 0, sizeof(*args));
> +			args->out.device_virt_addr = device_addr;
> +			break;
> +
> +		case HL_MEM_OP_UNMAP:
> +			rc = unmap_device_va(ctx,
> +					args->in.unmap.device_virt_addr);
> +			break;
> +
> +		default:
> +			dev_err(hdev->dev, "Unknown opcode for memory IOCTL\n");
> +			rc = -ENOTTY;
> +			break;
> +		}
> +	} else {
> +		switch (args->in.op) {
> +		case HL_MEM_OP_ALLOC:
> +			if (args->in.alloc.mem_size == 0) {
> +				dev_err(hdev->dev,
> +					"alloc size must be larger than 0\n");
> +				rc = -EINVAL;
> +				goto out;
> +			}
> +
> +			/* Force contiguous as there are no real MMU
> +			 * translations to overcome physical memory gaps
> +			 */
> +			args->in.flags |= HL_MEM_CONTIGUOUS;
> +			rc = alloc_device_memory(ctx, &args->in, &handle);
> +
> +			memset(args, 0, sizeof(*args));
> +			args->out.handle = (__u64) handle;
> +			break;
> +
> +		case HL_MEM_OP_FREE:
> +			rc = free_device_memory(ctx, args->in.free.handle);
> +			break;
> +
> +		case HL_MEM_OP_MAP:
> +			if (args->in.flags & HL_MEM_USERPTR) {
> +				device_addr = args->in.map_host.host_virt_addr;
> +				rc = 0;
> +			} else {
> +				rc = get_paddr_from_handle(ctx, &args->in,
> +						&device_addr);
> +			}
> +
> +			memset(args, 0, sizeof(*args));
> +			args->out.device_virt_addr = device_addr;
> +			break;
> +
> +		case HL_MEM_OP_UNMAP:
> +			rc = 0;
> +			break;
> +
> +		default:
> +			dev_err(hdev->dev, "Unknown opcode for memory IOCTL\n");
> +			rc = -ENOTTY;
> +			break;
> +		}
> +	}
> +
> +out:
> +	return rc;
> +}
> +
>  /*
>   * hl_pin_host_memory - pins a chunk of host memory
>   *
> @@ -197,3 +1383,332 @@ bool hl_userptr_is_pinned(struct hl_device *hdev, u64 addr,
>  
>  	return false;
>  }
> +
> +/*
> + * hl_va_range_init - initialize virtual addresses range
> + *
> + * @hdev                : pointer to the habanalabs device structure
> + * @va_range            : pointer to the range to initialize
> + * @start               : range start address
> + * @end                 : range end address
> + *
> + * This function does the following:
> + * - Initializes the virtual addresses list of the given range with the given
> + *   addresses.
> + */
> +static int hl_va_range_init(struct hl_device *hdev,
> +		struct hl_va_range *va_range, u64 start, u64 end)
> +{
> +	int rc;
> +
> +	INIT_LIST_HEAD(&va_range->list);
> +
> +	/* PAGE_SIZE alignment */
> +
> +	if (start & (PAGE_SIZE - 1)) {
> +		start &= PAGE_MASK;
> +		start += PAGE_SIZE;
> +	}
> +
> +	if (end & (PAGE_SIZE - 1))
> +		end &= PAGE_MASK;
> +
> +	if (start >= end) {
> +		dev_err(hdev->dev, "too small vm range for va list\n");
> +		return -EFAULT;
> +	}
> +
> +	rc = add_va_block(hdev, va_range, start, end);
> +
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to init host va list\n");
> +		return rc;
> +	}
> +
> +	va_range->start_addr = start;
> +	va_range->end_addr = end;
> +
> +	return 0;
> +}
> +
> +/*
> + * hl_vm_ctx_init_with_ranges - initialize virtual memory for context
> + *
> + * @ctx                 : pointer to the habanalabs context structure
> + * @host_range_start    : host virtual addresses range start
> + * @host_range_end      : host virtual addresses range end
> + * @dram_range_start    : dram virtual addresses range start
> + * @dram_range_end      : dram virtual addresses range end
> + *
> + * This function initializes the following:
> + * - MMU for context
> + * - Virtual address to area descriptor hashtable
> + * - Virtual block list of available virtual memory
> + */
> +int hl_vm_ctx_init_with_ranges(struct hl_ctx *ctx, u64 host_range_start,
> +				u64 host_range_end, u64 dram_range_start,
> +				u64 dram_range_end)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	int rc;
> +
> +	hl_mmu_ctx_init(ctx);
> +
> +	mutex_init(&ctx->mem_hash_lock);
> +	hash_init(ctx->mem_hash);
> +
> +	mutex_init(&ctx->host_va_range.lock);
> +
> +	rc = hl_va_range_init(hdev, &ctx->host_va_range, host_range_start,
> +			host_range_end);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to init host vm range\n");
> +		goto host_vm_err;
> +	}
> +
> +	mutex_init(&ctx->dram_va_range.lock);
> +
> +	rc = hl_va_range_init(hdev, &ctx->dram_va_range, dram_range_start,
> +			dram_range_end);
> +	if (rc) {
> +		dev_err(hdev->dev, "failed to init dram vm range\n");
> +		goto dram_vm_err;
> +	}
> +
> +	return 0;
> +
> +dram_vm_err:
> +	mutex_destroy(&ctx->dram_va_range.lock);
> +
> +	mutex_lock(&ctx->host_va_range.lock);
> +	clear_va_list_locked(hdev, &ctx->host_va_range.list);
> +	mutex_unlock(&ctx->host_va_range.lock);
> +host_vm_err:
> +	mutex_destroy(&ctx->host_va_range.lock);
> +	mutex_destroy(&ctx->mem_hash_lock);
> +	hl_mmu_ctx_fini(ctx);
> +
> +	return rc;
> +}
> +
> +int hl_vm_ctx_init(struct hl_ctx *ctx)
> +{
> +	struct asic_fixed_properties *prop = &ctx->hdev->asic_prop;
> +	u64 host_range_start, host_range_end, dram_range_start,
> +		dram_range_end;
> +
> +	atomic64_set(&ctx->dram_phys_mem, 0);
> +
> +	/*
> +	 * - If MMU is enabled, init the ranges as usual.
> +	 * - If MMU is disabled, in case of host mapping, the returned address
> +	 *   is the given one.
> +	 *   In case of DRAM mapping, the returned address is the physical
> +	 *   address of the memory related to the given handle.
> +	 */
> +	if (ctx->hdev->mmu_enable) {
> +		dram_range_start = prop->va_space_dram_start_address;
> +		dram_range_end = prop->va_space_dram_end_address;
> +		host_range_start = prop->va_space_host_start_address;
> +		host_range_end = prop->va_space_host_end_address;
> +	} else {
> +		dram_range_start = prop->dram_user_base_address;
> +		dram_range_end = prop->dram_end_address;
> +		host_range_start = prop->dram_user_base_address;
> +		host_range_end = prop->dram_end_address;
> +	}
> +
> +	return hl_vm_ctx_init_with_ranges(ctx, host_range_start, host_range_end,
> +			dram_range_start, dram_range_end);
> +}
> +
> +/*
> + * hl_va_range_fini     - clear a virtual addresses range
> + *
> + * @hdev                : pointer to the habanalabs structure
> + * va_range             : pointer to virtual addresses range
> + *
> + * This function initializes the following:
> + * - Checks that the given range contains the whole initial range
> + * - Frees the virtual addresses block list and its lock
> + */
> +static void hl_va_range_fini(struct hl_device *hdev,
> +		struct hl_va_range *va_range)
> +{
> +	struct hl_vm_va_block *va_block;
> +
> +	if (list_empty(&va_range->list)) {
> +		dev_warn(hdev->dev,
> +				"va list should not be empty on cleanup!\n");
> +		goto out;
> +	}
> +
> +	if (!list_is_singular(&va_range->list)) {
> +		dev_warn(hdev->dev,
> +			"va list should not contain multiple blocks on cleanup!\n");
> +		goto free_va_list;
> +	}
> +
> +	va_block = list_first_entry(&va_range->list, typeof(*va_block), node);
> +
> +	if (va_block->start != va_range->start_addr ||
> +		va_block->end != va_range->end_addr) {
> +		dev_warn(hdev->dev,
> +			"wrong va block on cleanup, from 0x%llx to 0x%llx\n",
> +				va_block->start, va_block->end);
> +		goto free_va_list;
> +	}
> +
> +free_va_list:
> +	mutex_lock(&va_range->lock);
> +	clear_va_list_locked(hdev, &va_range->list);
> +	mutex_unlock(&va_range->lock);
> +
> +out:
> +	mutex_destroy(&va_range->lock);
> +}
> +
> +/*
> + * hl_vm_ctx_fini       - virtual memory teardown of context
> + *
> + * @ctx                 : pointer to the habanalabs context structure
> + *
> + * This function perform teardown the following:
> + * - Virtual block list of available virtual memory
> + * - Virtual address to area descriptor hashtable
> + * - MMU for context
> + *
> + * In addition this function does the following:
> + * - Unmaps the existing hashtable nodes if the hashtable is not empty. The
> + *   hashtable should be empty as no valid mappings should exist at this
> + *   point.
> + * - Frees any existing physical page list from the idr which relates to the
> + *   current context asid.
> + * - This function checks the virtual block list for correctness. At this point
> + *   the list should contain one element which describes the whole virtual
> + *   memory range of the context. Otherwise, a warning is printed.
> + */
> +void hl_vm_ctx_fini(struct hl_ctx *ctx)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	struct hl_vm *vm = &hdev->vm;
> +	struct hl_vm_phys_pg_pack *phys_pg_list;
> +	struct hl_vm_hash_node *hnode;
> +	struct hlist_node *tmp_node;
> +	int i;
> +
> +	if (!hash_empty(ctx->mem_hash))
> +		dev_notice(hdev->dev, "ctx is freed while it has va in use\n");
> +
> +	hash_for_each_safe(ctx->mem_hash, i, tmp_node, hnode, node) {
> +		dev_dbg(hdev->dev,
> +			"hl_mem_hash_node of vaddr 0x%llx of asid %d is still alive\n",
> +			hnode->vaddr, ctx->asid);
> +		unmap_device_va(ctx, hnode->vaddr);
> +	}
> +
> +	spin_lock(&vm->idr_lock);
> +	idr_for_each_entry(&vm->phys_pg_pack_handles, phys_pg_list, i)
> +		if (phys_pg_list->asid == ctx->asid) {
> +			dev_dbg(hdev->dev,
> +				"page list 0x%p of asid %d is still alive\n",
> +				phys_pg_list, ctx->asid);
> +			free_phys_pg_pack(hdev, phys_pg_list);
> +			idr_remove(&vm->phys_pg_pack_handles, i);
> +		}
> +	spin_unlock(&vm->idr_lock);
> +
> +	hl_va_range_fini(hdev, &ctx->dram_va_range);
> +	hl_va_range_fini(hdev, &ctx->host_va_range);
> +
> +	mutex_destroy(&ctx->mem_hash_lock);
> +	hl_mmu_ctx_fini(ctx);
> +}
> +
> +/*
> + * hl_vm_init           - initialize virtual memory module
> + *
> + * @hdev                : pointer to the habanalabs device structure
> + *
> + * This function initializes the following:
> + * - MMU module
> + * - DRAM physical pages pool of 2MB
> + * - Idr for device memory allocation handles
> + */
> +int hl_vm_init(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	struct hl_vm *vm = &hdev->vm;
> +	int rc;
> +
> +	rc = hl_mmu_init(hdev);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to init MMU\n");
> +		return rc;
> +	}
> +
> +	vm->dram_pg_pool = gen_pool_create(__ffs(prop->dram_page_size), -1);
> +	if (!vm->dram_pg_pool) {
> +		dev_err(hdev->dev, "Failed to create dram page pool\n");
> +		rc = -ENOMEM;
> +		goto pool_create_err;
> +	}
> +
> +	kref_init(&vm->dram_pg_pool_refcount);
> +
> +	rc = gen_pool_add(vm->dram_pg_pool, prop->dram_user_base_address,
> +			prop->dram_end_address - prop->dram_user_base_address,
> +			-1);
> +
> +	if (rc) {
> +		dev_err(hdev->dev,
> +			"Failed to add memory to dram page pool %d\n", rc);
> +		goto pool_add_err;
> +	}
> +
> +	spin_lock_init(&vm->idr_lock);
> +	idr_init(&vm->phys_pg_pack_handles);
> +
> +	atomic64_set(&hdev->dram_used_mem, 0);
> +
> +	vm->init_done = true;
> +
> +	return 0;
> +
> +pool_add_err:
> +	gen_pool_destroy(vm->dram_pg_pool);
> +pool_create_err:
> +	hl_mmu_fini(hdev);
> +
> +	return rc;
> +}
> +
> +/*
> + * hl_vm_fini           - virtual memory module teardown
> + *
> + * @hdev                : pointer to the habanalabs device structure
> + *
> + * This function perform teardown to the following:
> + * - Idr for device memory allocation handles
> + * - DRAM physical pages pool of 2MB
> + * - MMU module
> + */
> +void hl_vm_fini(struct hl_device *hdev)
> +{
> +	struct hl_vm *vm = &hdev->vm;
> +
> +	if (!vm->init_done)
> +		return;
> +
> +	/*
> +	 * At this point all the contexts should be freed and hence no DRAM
> +	 * memory should be in use. Hence the DRAM pool should be freed here.
> +	 */
> +	if (kref_put(&vm->dram_pg_pool_refcount, dram_pg_pool_do_release) != 1)
> +		dev_warn(hdev->dev, "dram_pg_pool was not destroyed on %s\n",
> +				__func__);
> +
> +	hl_mmu_fini(hdev);
> +
> +	vm->init_done = false;
> +}
> diff --git a/drivers/misc/habanalabs/mmu.c b/drivers/misc/habanalabs/mmu.c
> new file mode 100644
> index 000000000000..e6fa9d81933b
> --- /dev/null
> +++ b/drivers/misc/habanalabs/mmu.c
> @@ -0,0 +1,690 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2016-2018 HabanaLabs, Ltd.
> + * All Rights Reserved.
> + */
> +
> +#include "habanalabs.h"
> +#include "include/hw_ip/mmu/mmu_general.h"
> +
> +#include <linux/genalloc.h>
> +
> +static struct pgt_info *get_pgt_info(struct hl_ctx *ctx, u64 addr)
> +{
> +	struct pgt_info *pgt_info = NULL;
> +
> +	hash_for_each_possible(ctx->mmu_hash, pgt_info, node,
> +				(unsigned long) addr)
> +		if (addr == pgt_info->addr)
> +			break;
> +
> +	return pgt_info;
> +}
> +
> +static void free_hop(struct hl_ctx *ctx, u64 hop_addr)
> +{
> +	struct pgt_info *pgt_info = get_pgt_info(ctx, hop_addr);
> +
> +	gen_pool_free(pgt_info->ctx->hdev->mmu_pgt_pool, pgt_info->addr,
> +			ctx->hdev->asic_prop.mmu_hop_table_size);
> +	hash_del(&pgt_info->node);
> +
> +	kfree(pgt_info);
> +}
> +
> +static u64 alloc_hop(struct hl_ctx *ctx)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	struct pgt_info *pgt_info;
> +	u64 addr;
> +
> +	pgt_info = kmalloc(sizeof(*pgt_info), GFP_KERNEL);
> +	if (!pgt_info)
> +		return ULLONG_MAX;
> +
> +	addr = (u64) gen_pool_alloc(hdev->mmu_pgt_pool,
> +			hdev->asic_prop.mmu_hop_table_size);
> +	if (!addr) {
> +		dev_err(hdev->dev, "failed to allocate page\n");
> +		kfree(pgt_info);
> +		return ULLONG_MAX;
> +	}
> +
> +	pgt_info->addr = addr;
> +	pgt_info->ctx = ctx;
> +	pgt_info->num_of_ptes = 0;
> +	hash_add(ctx->mmu_hash, &pgt_info->node, addr);
> +
> +	return addr;
> +}
> +
> +static inline void clear_pte(struct hl_device *hdev, u64 pte_addr)
> +{
> +	/* clear the last and present bits */
> +	hdev->asic_funcs->write_pte(hdev, pte_addr, 0);
> +}
> +
> +static inline void get_pte(struct hl_ctx *ctx, u64 hop_addr)
> +{
> +	get_pgt_info(ctx, hop_addr)->num_of_ptes++;
> +}
> +
> +/*
> + * put_pte - decrement the num of ptes and free the hop if possible
> + *
> + * @ctx: pointer to the context structure
> + * @hop_addr: addr of the hop
> + *
> + * This function returns the number of ptes left on this hop. If the number is
> + * 0, it means the pte was freed.
> + */
> +static inline int put_pte(struct hl_ctx *ctx, u64 hop_addr)
> +{
> +	struct pgt_info *pgt_info = get_pgt_info(ctx, hop_addr);
> +	int num_of_ptes_left;
> +
> +	pgt_info->num_of_ptes--;
> +
> +	/*
> +	 * Need to save the number of ptes left because free_hop might free
> +	 * the pgt_info
> +	 */
> +	num_of_ptes_left = pgt_info->num_of_ptes;
> +	if (!num_of_ptes_left)
> +		free_hop(ctx, hop_addr);
> +
> +	return num_of_ptes_left;
> +}
> +
> +static inline u64 get_hop0_addr(struct hl_ctx *ctx)
> +{
> +	return ctx->hdev->asic_prop.mmu_pgt_addr +
> +			(ctx->asid * ctx->hdev->asic_prop.mmu_hop_table_size);
> +}
> +
> +static inline u64 get_hopN_pte_addr(struct hl_ctx *ctx, u64 hop_addr,
> +					u64 virt_addr, u64 mask, u64 shift)
> +{
> +	return hop_addr + ctx->hdev->asic_prop.mmu_pte_size *
> +			((virt_addr & mask) >> shift);
> +}
> +
> +static inline u64 get_hop0_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
> +{
> +	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP0_MASK, HOP0_SHIFT);
> +}
> +
> +static inline u64 get_hop1_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
> +{
> +	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP1_MASK, HOP1_SHIFT);
> +}
> +
> +static inline u64 get_hop2_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
> +{
> +	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP2_MASK, HOP2_SHIFT);
> +}
> +
> +static inline u64 get_hop3_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
> +{
> +	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP3_MASK, HOP3_SHIFT);
> +}
> +
> +static inline u64 get_hop4_pte_addr(struct hl_ctx *ctx, u64 hop_addr, u64 vaddr)
> +{
> +	return get_hopN_pte_addr(ctx, hop_addr, vaddr, HOP4_MASK, HOP4_SHIFT);
> +}
> +
> +static inline u64 get_next_hop_addr(u64 curr_pte)
> +{
> +	if (curr_pte & PAGE_PRESENT_MASK)
> +		return curr_pte & PHYS_ADDR_MASK;
> +	else
> +		return ULLONG_MAX;
> +}
> +
> +static inline u64 get_alloc_next_hop_addr(struct hl_ctx *ctx, u64 curr_pte,
> +						bool *is_new_hop)
> +{
> +	u64 hop_addr = get_next_hop_addr(curr_pte);
> +
> +	if (hop_addr == ULLONG_MAX) {
> +		hop_addr = alloc_hop(ctx);
> +		*is_new_hop = true;
> +	}
> +
> +	return hop_addr;
> +}
> +
> +/*
> + * hl_mmu_init - init the mmu module
> + *
> + * @hdev: pointer to the habanalabs device structure
> + *
> + * This function does the following:
> + * - Allocate max_asid zeroed hop0 pgts so no mapping is available
> + * - Enable mmu in hw
> + * - Invalidate the mmu cache
> + * - Create a pool of pages for pgts
> + * - Returns 0 on success
> + *
> + * This function depends on DMA QMAN to be working!
> + */
> +int hl_mmu_init(struct hl_device *hdev)
> +{
> +	struct asic_fixed_properties *prop = &hdev->asic_prop;
> +	int rc;
> +
> +	if (!hdev->mmu_enable)
> +		return 0;
> +
> +	/* MMU HW init was already done in device hw_init() */
> +
> +	mutex_init(&hdev->mmu_cache_lock);
> +
> +	hdev->mmu_pgt_pool =
> +			gen_pool_create(__ffs(prop->mmu_hop_table_size), -1);
> +
> +	if (!hdev->mmu_pgt_pool) {
> +		dev_err(hdev->dev, "Failed to create page gen pool\n");
> +		rc = -ENOMEM;
> +		goto err_pool_create;
> +	}
> +
> +	rc = gen_pool_add(hdev->mmu_pgt_pool, prop->mmu_pgt_addr +
> +			prop->mmu_hop0_tables_total_size,
> +			prop->mmu_pgt_size - prop->mmu_hop0_tables_total_size,
> +			-1);
> +	if (rc) {
> +		dev_err(hdev->dev, "Failed to add memory to page gen pool\n");
> +		goto err_pool_add;
> +	}
> +
> +	return 0;
> +
> +err_pool_add:
> +	gen_pool_destroy(hdev->mmu_pgt_pool);
> +err_pool_create:
> +	mutex_destroy(&hdev->mmu_cache_lock);
> +
> +	return rc;
> +}
> +
> +/*
> + * hl_mmu_fini - release the mmu module.
> + *
> + * @hdev: pointer to the habanalabs device structure
> + *
> + * This function does the following:
> + * - Disable mmu in hw
> + * - free the pgts pool
> + *
> + * All ctxs should be freed before calling this func
> + */
> +void hl_mmu_fini(struct hl_device *hdev)
> +{
> +	if (!hdev->mmu_enable)
> +		return;
> +
> +	gen_pool_destroy(hdev->mmu_pgt_pool);
> +
> +	mutex_destroy(&hdev->mmu_cache_lock);
> +
> +	/* MMU HW fini will be done in device hw_fini() */
> +}
> +
> +/*
> + * hl_mmu_ctx_init - init a ctx for using the mmu module
> + *
> + * @ctx: pointer to the context structure
> + *
> + * This function does the following:
> + * - Init a mutex to protect the concurrent mapping flow
> + * - Init a hash to hold all pgts related to this ctx
> + */
> +void hl_mmu_ctx_init(struct hl_ctx *ctx)
> +{
> +	if (!ctx->hdev->mmu_enable)
> +		return;
> +
> +	mutex_init(&ctx->mmu_lock);
> +	hash_init(ctx->mmu_hash);
> +}
> +
> +/*
> + * hl_mmu_ctx_fini - disable a ctx from using the mmu module
> + *
> + * @ctx: pointer to the context structure
> + *
> + * This function does the following:
> + * - Free any pgts which were not freed yet
> + * - Free the mutex
> + */
> +void hl_mmu_ctx_fini(struct hl_ctx *ctx)
> +{
> +	struct pgt_info *pgt_info;
> +	struct hlist_node *tmp;
> +	int i;
> +
> +	if (!ctx->hdev->mmu_enable)
> +		return;
> +
> +	if (!hash_empty(ctx->mmu_hash))
> +		dev_err(ctx->hdev->dev,
> +				"ctx is freed while it has pgts in use\n");
> +
> +	hash_for_each_safe(ctx->mmu_hash, i, tmp, pgt_info, node) {
> +		dev_err(ctx->hdev->dev,
> +			"pgt_info of addr 0x%llx of asid %d was not destroyed, num_ptes: %d\n",
> +			pgt_info->addr, ctx->asid, pgt_info->num_of_ptes);
> +		free_hop(ctx, pgt_info->addr);
> +	}
> +
> +	mutex_destroy(&ctx->mmu_lock);
> +}
> +
> +static int _hl_mmu_unmap(struct hl_ctx *ctx, u64 virt_addr)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	u64 hop0_addr = 0, hop0_pte_addr = 0,
> +		hop1_addr = 0, hop1_pte_addr = 0,
> +		hop2_addr = 0, hop2_pte_addr = 0,
> +		hop3_addr = 0, hop3_pte_addr = 0,
> +		hop4_addr = 0, hop4_pte_addr = 0,
> +		curr_pte;
> +	int clear_hop3 = 1;
> +
> +	hop0_addr = get_hop0_addr(ctx);
> +
> +	hop0_pte_addr = get_hop0_pte_addr(ctx, hop0_addr, virt_addr);
> +
> +	curr_pte = hdev->asic_funcs->read_pte(hdev, hop0_pte_addr);
> +
> +	hop1_addr = get_next_hop_addr(curr_pte);
> +
> +	if (hop1_addr == ULLONG_MAX)
> +		goto not_mapped;
> +
> +	hop1_pte_addr = get_hop1_pte_addr(ctx, hop1_addr, virt_addr);
> +
> +	curr_pte = hdev->asic_funcs->read_pte(hdev, hop1_pte_addr);
> +
> +	hop2_addr = get_next_hop_addr(curr_pte);
> +
> +	if (hop2_addr == ULLONG_MAX)
> +		goto not_mapped;
> +
> +	hop2_pte_addr = get_hop2_pte_addr(ctx, hop2_addr, virt_addr);
> +
> +	curr_pte = hdev->asic_funcs->read_pte(hdev, hop2_pte_addr);
> +
> +	hop3_addr = get_next_hop_addr(curr_pte);
> +
> +	if (hop3_addr == ULLONG_MAX)
> +		goto not_mapped;
> +
> +	hop3_pte_addr = get_hop3_pte_addr(ctx, hop3_addr, virt_addr);
> +
> +	curr_pte = hdev->asic_funcs->read_pte(hdev, hop3_pte_addr);
> +
> +	if (!(curr_pte & LAST_MASK)) {
> +		hop4_addr = get_next_hop_addr(curr_pte);
> +
> +		if (hop4_addr == ULLONG_MAX)
> +			goto not_mapped;
> +
> +		hop4_pte_addr = get_hop4_pte_addr(ctx, hop4_addr, virt_addr);
> +
> +		curr_pte = hdev->asic_funcs->read_pte(hdev, hop4_pte_addr);
> +
> +		clear_hop3 = 0;
> +	}
> +
> +	if (!(curr_pte & PAGE_PRESENT_MASK))
> +		goto not_mapped;
> +
> +	clear_pte(hdev, hop4_addr ? hop4_pte_addr : hop3_pte_addr);
> +
> +	if (hop4_addr && !put_pte(ctx, hop4_addr))
> +		clear_hop3 = 1;
> +
> +	if (!clear_hop3)
> +		goto flush;
> +	clear_pte(hdev, hop3_pte_addr);
> +
> +	if (put_pte(ctx, hop3_addr))
> +		goto flush;
> +	clear_pte(hdev, hop2_pte_addr);
> +
> +	if (put_pte(ctx, hop2_addr))
> +		goto flush;
> +	clear_pte(hdev, hop1_pte_addr);
> +
> +	if (put_pte(ctx, hop1_addr))
> +		goto flush;
> +	clear_pte(hdev, hop0_pte_addr);
> +
> +flush:
> +	/* flush all writes from all cores to reach PCI */
> +	mb();
> +
> +	hdev->asic_funcs->read_pte(hdev,
> +				hop4_addr ? hop4_pte_addr : hop3_pte_addr);
> +
> +	return 0;
> +
> +not_mapped:
> +	dev_err(hdev->dev, "virt addr 0x%llx is not mapped to phys addr\n",
> +		virt_addr);
> +
> +	return -EINVAL;
> +}
> +
> +/*
> + * hl_mmu_unmap - unmaps a virtual addr
> + *
> + * @ctx: pointer to the context structure
> + * @virt_addr: virt addr to map from
> + * @page_size: size of the page to unmap
> + *
> + * This function does the following:
> + * - Check that the virt addr is mapped
> + * - Unmap the virt addr and frees pgts if possible
> + * - Returns 0 on success, -EINVAL if the given addr is not mapped
> + *
> + * Because this function changes the page tables in the device and because it
> + * changes the MMU hash, it must be protected by a lock.
> + * However, because it maps only a single page, the lock should be implemented
> + * in a higher level in order to protect the entire mapping of the memory area
> + */
> +int hl_mmu_unmap(struct hl_ctx *ctx, u64 virt_addr, u32 page_size)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	u64 real_virt_addr;
> +	u32 real_page_size, npages;
> +	int i, rc;
> +
> +	if (!hdev->mmu_enable)
> +		return 0;
> +
> +	/*
> +	 * The H/W handles mapping of 4KB/2MB page. Hence if the host page size
> +	 * is bigger, we break it to sub-pages and unmap them separately.
> +	 */
> +	if ((page_size % PAGE_SIZE_2MB) == 0) {
> +		real_page_size = PAGE_SIZE_2MB;
> +	} else if ((page_size % PAGE_SIZE_4KB) == 0) {
> +		real_page_size = PAGE_SIZE_4KB;
> +	} else {
> +		dev_err(hdev->dev,
> +			"page size of %u is not 4KB nor 2MB aligned, can't unmap\n",
> +				page_size);
> +
> +		return -EFAULT;
> +	}
> +
> +	npages = page_size / real_page_size;
> +	real_virt_addr = virt_addr;
> +
> +	for (i = 0 ; i < npages ; i++) {
> +		rc = _hl_mmu_unmap(ctx, real_virt_addr);
> +		if (rc)
> +			return rc;
> +
> +		real_virt_addr += real_page_size;
> +	}
> +
> +	return 0;
> +}
> +
> +static int _hl_mmu_map(struct hl_ctx *ctx, u64 virt_addr, u64 phys_addr,
> +		u32 page_size)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	u64 hop0_addr = 0, hop0_pte_addr = 0,
> +		hop1_addr = 0, hop1_pte_addr = 0,
> +		hop2_addr = 0, hop2_pte_addr = 0,
> +		hop3_addr = 0, hop3_pte_addr = 0,
> +		hop4_addr = 0, hop4_pte_addr = 0,
> +		curr_pte = 0;
> +	bool hop1_new = false, hop2_new = false, hop3_new = false,
> +		hop4_new = false, is_huge;
> +	int rc = -ENOMEM;
> +
> +	/*
> +	 * This mapping function can map a 4KB/2MB page. For 2MB page there are
> +	 * only 3 hops rather than 4. Currently the DRAM allocation uses 2MB
> +	 * pages only but user memory could have been allocated with one of the
> +	 * two page sizes. Since this is a common code for all the three cases,
> +	 * we need this hugs page check.
> +	 */
> +	is_huge = page_size == PAGE_SIZE_2MB;
> +
> +	hop0_addr = get_hop0_addr(ctx);
> +
> +	hop0_pte_addr = get_hop0_pte_addr(ctx, hop0_addr, virt_addr);
> +
> +	curr_pte = hdev->asic_funcs->read_pte(hdev, hop0_pte_addr);
> +
> +	hop1_addr = get_alloc_next_hop_addr(ctx, curr_pte, &hop1_new);
> +
> +	if (hop1_addr == ULLONG_MAX)
> +		goto err;
> +
> +	hop1_pte_addr = get_hop1_pte_addr(ctx, hop1_addr, virt_addr);
> +
> +	curr_pte = hdev->asic_funcs->read_pte(hdev, hop1_pte_addr);
> +
> +	hop2_addr = get_alloc_next_hop_addr(ctx, curr_pte, &hop2_new);
> +
> +	if (hop2_addr == ULLONG_MAX)
> +		goto err;
> +
> +	hop2_pte_addr = get_hop2_pte_addr(ctx, hop2_addr, virt_addr);
> +
> +	curr_pte = hdev->asic_funcs->read_pte(hdev, hop2_pte_addr);
> +
> +	hop3_addr = get_alloc_next_hop_addr(ctx, curr_pte, &hop3_new);
> +
> +	if (hop3_addr == ULLONG_MAX)
> +		goto err;
> +
> +	hop3_pte_addr = get_hop3_pte_addr(ctx, hop3_addr, virt_addr);
> +
> +	curr_pte = hdev->asic_funcs->read_pte(hdev, hop3_pte_addr);
> +
> +	if (!is_huge) {
> +		hop4_addr = get_alloc_next_hop_addr(ctx, curr_pte, &hop4_new);
> +
> +		if (hop4_addr == ULLONG_MAX)
> +			goto err;
> +
> +		hop4_pte_addr = get_hop4_pte_addr(ctx, hop4_addr, virt_addr);
> +
> +		curr_pte = hdev->asic_funcs->read_pte(hdev, hop4_pte_addr);
> +	}
> +
> +	if (curr_pte & PAGE_PRESENT_MASK) {
> +		dev_err(hdev->dev,
> +				"mapping already exists for virt_addr 0x%llx\n",
> +					virt_addr);
> +
> +		dev_dbg(hdev->dev, "hop0 pte: 0x%llx (0x%llx)\n",
> +				hdev->asic_funcs->read_pte(hdev, hop0_pte_addr),
> +				hop0_pte_addr);
> +		dev_dbg(hdev->dev, "hop1 pte: 0x%llx (0x%llx)\n",
> +				hdev->asic_funcs->read_pte(hdev, hop1_pte_addr),
> +				hop1_pte_addr);
> +		dev_dbg(hdev->dev, "hop2 pte: 0x%llx (0x%llx)\n",
> +				hdev->asic_funcs->read_pte(hdev, hop2_pte_addr),
> +				hop2_pte_addr);
> +		dev_dbg(hdev->dev, "hop3 pte: 0x%llx (0x%llx)\n",
> +				hdev->asic_funcs->read_pte(hdev, hop3_pte_addr),
> +				hop3_pte_addr);
> +
> +		if (!is_huge)
> +			dev_dbg(hdev->dev, "hop4 pte: 0x%llx (0x%llx)\n",
> +				hdev->asic_funcs->read_pte(hdev,
> +							hop4_pte_addr),
> +							hop4_pte_addr);
> +
> +		rc = EINVAL;
> +		goto err;
> +	}
> +
> +	curr_pte = (phys_addr & PTE_PHYS_ADDR_MASK) | LAST_MASK
> +			| PAGE_PRESENT_MASK;
> +
> +	hdev->asic_funcs->write_pte(hdev,
> +				is_huge ? hop3_pte_addr : hop4_pte_addr,
> +				curr_pte);
> +
> +	if (hop1_new) {
> +		curr_pte = (hop1_addr & PTE_PHYS_ADDR_MASK) |
> +				PAGE_PRESENT_MASK;
> +		ctx->hdev->asic_funcs->write_pte(ctx->hdev, hop0_pte_addr,
> +				curr_pte);
> +	}
> +	if (hop2_new) {
> +		curr_pte = (hop2_addr & PTE_PHYS_ADDR_MASK) |
> +				PAGE_PRESENT_MASK;
> +		ctx->hdev->asic_funcs->write_pte(ctx->hdev, hop1_pte_addr,
> +				curr_pte);
> +		get_pte(ctx, hop1_addr);
> +	}
> +	if (hop3_new) {
> +		curr_pte = (hop3_addr & PTE_PHYS_ADDR_MASK) |
> +				PAGE_PRESENT_MASK;
> +		ctx->hdev->asic_funcs->write_pte(ctx->hdev, hop2_pte_addr,
> +				curr_pte);
> +		get_pte(ctx, hop2_addr);
> +	}
> +
> +	if (!is_huge) {
> +		if (hop4_new) {
> +			curr_pte = (hop4_addr & PTE_PHYS_ADDR_MASK) |
> +					PAGE_PRESENT_MASK;
> +			ctx->hdev->asic_funcs->write_pte(ctx->hdev,
> +					hop3_pte_addr, curr_pte);
> +			get_pte(ctx, hop3_addr);
> +		}
> +
> +		get_pte(ctx, hop4_addr);
> +	} else {
> +		get_pte(ctx, hop3_addr);
> +	}
> +
> +	/* flush all writes from all cores to reach PCI */
> +	mb();
> +
> +	hdev->asic_funcs->read_pte(hdev,
> +				is_huge ? hop3_pte_addr : hop4_pte_addr);
> +
> +	return 0;
> +
> +err:
> +	if (hop4_new)
> +		free_hop(ctx, hop4_addr);
> +	if (hop3_new)
> +		free_hop(ctx, hop3_addr);
> +	if (hop2_new)
> +		free_hop(ctx, hop2_addr);
> +	if (hop1_new)
> +		free_hop(ctx, hop1_addr);
> +
> +	return rc;
> +}
> +
> +/*
> + * hl_mmu_map - maps a virtual addr to physical addr
> + *
> + * @ctx: pointer to the context structure
> + * @virt_addr: virt addr to map from
> + * @phys_addr: phys addr to map to
> + * @page_size: physical page size
> + *
> + * This function does the following:
> + * - Check that the virt addr is not mapped
> + * - Allocate pgts as necessary in order to map the virt addr to the phys
> + * - Returns 0 on success, -EINVAL if addr is already mapped, or -ENOMEM.
> + *
> + * Because this function changes the page tables in the device and because it
> + * changes the MMU hash, it must be protected by a lock.
> + * However, because it maps only a single page, the lock should be implemented
> + * in a higher level in order to protect the entire mapping of the memory area
> + */
> +int hl_mmu_map(struct hl_ctx *ctx, u64 virt_addr, u64 phys_addr, u32 page_size)
> +{
> +	struct hl_device *hdev = ctx->hdev;
> +	u64 real_virt_addr;
> +	u32 real_page_size, npages;
> +	int i, rc, mapped_cnt = 0;
> +
> +	if (!hdev->mmu_enable)
> +		return 0;
> +
> +	/*
> +	 * The H/W handles mapping of 4KB/2MB page. Hence if the host page size
> +	 * is bigger, we break it to sub-pages and map them separately.
> +	 */
> +	if ((page_size % PAGE_SIZE_2MB) == 0) {
> +		real_page_size = PAGE_SIZE_2MB;
> +	} else if ((page_size % PAGE_SIZE_4KB) == 0) {
> +		real_page_size = PAGE_SIZE_4KB;
> +	} else {
> +		dev_err(hdev->dev,
> +			"page size of %u is not 4KB nor 2MB aligned, can't map\n",
> +				page_size);
> +
> +		return -EFAULT;
> +	}
> +
> +	npages = page_size / real_page_size;
> +	real_virt_addr = virt_addr;
> +
> +	for (i = 0 ; i < npages ; i++) {
> +		rc = _hl_mmu_map(ctx, real_virt_addr, phys_addr,
> +				real_page_size);
> +		if (rc)
> +			goto err;
> +
> +		real_virt_addr += real_page_size;
> +		mapped_cnt++;
> +	}
> +
> +	return 0;
> +
> +err:
> +	real_virt_addr = virt_addr;
> +	for (i = 0 ; i < mapped_cnt ; i++) {
> +		if (_hl_mmu_unmap(ctx, real_virt_addr))
> +			dev_warn_ratelimited(hdev->dev,
> +				"failed to unmap va: 0x%llx\n", real_virt_addr);
> +
> +		real_virt_addr += real_page_size;
> +	}
> +
> +	return rc;
> +}
> +
> +/*
> + * hl_mmu_swap_out - marks all mapping of the given ctx as swapped out
> + *
> + * @ctx: pointer to the context structure
> + *
> + */
> +void hl_mmu_swap_out(struct hl_ctx *ctx)
> +{
> +
> +}
> +
> +/*
> + * hl_mmu_swap_in - marks all mapping of the given ctx as swapped in
> + *
> + * @ctx: pointer to the context structure
> + *
> + */
> +void hl_mmu_swap_in(struct hl_ctx *ctx)
> +{
> +
> +}
> diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
> index fba49417f607..9015043887d1 100644
> --- a/include/uapi/misc/habanalabs.h
> +++ b/include/uapi/misc/habanalabs.h
> @@ -162,6 +162,108 @@ union hl_wait_cs_args {
>  	struct hl_wait_cs_out out;
>  };
>  
> +/* Opcode to alloc device memory */
> +#define HL_MEM_OP_ALLOC			0
> +/* Opcode to free previously allocated device memory */
> +#define HL_MEM_OP_FREE			1
> +/* Opcode to map host memory */
> +#define HL_MEM_OP_MAP			2
> +/* Opcode to unmap previously mapped host memory */
> +#define HL_MEM_OP_UNMAP			3
> +
> +/* Memory flags */
> +#define HL_MEM_CONTIGUOUS	0x1
> +#define HL_MEM_SHARED		0x2
> +#define HL_MEM_USERPTR		0x4
> +
> +struct hl_mem_in {
> +	union {
> +		/* HL_MEM_OP_ALLOC- allocate device memory */
> +		struct {
> +			/* Size to alloc */
> +			__u32 mem_size;
> +			__u32 pad;
> +		} alloc;
> +
> +		/* HL_MEM_OP_FREE - free device memory */
> +		struct {
> +			/* Handle returned from HL_MEM_OP_ALLOC */
> +			__u64 handle;
> +		} free;
> +
> +		/* HL_MEM_OP_MAP - map device memory */
> +		struct {
> +			/*
> +			 * Requested virtual address of mapped memory.
> +			 * KMD will try to map the requested region to this
> +			 * hint address, as long as the address is valid and
> +			 * not already mapped. The user should check the
> +			 * returned address of the IOCTL to make sure he got
> +			 * the hint address. Passing 0 here means that KMD
> +			 * will choose the address itself.
> +			 */
> +			__u64 hint_addr;
> +			/* Handle returned from HL_MEM_OP_ALLOC */
> +			__u64 handle;
> +		} map_device;
> +
> +		/* HL_MEM_OP_MAP - map host memory */
> +		struct {
> +			/* Address of allocated host memory */
> +			__u64 host_virt_addr;
> +			/*
> +			 * Requested virtual address of mapped memory.
> +			 * KMD will try to map the requested region to this
> +			 * hint address, as long as the address is valid and
> +			 * not already mapped. The user should check the
> +			 * returned address of the IOCTL to make sure he got
> +			 * the hint address. Passing 0 here means that KMD
> +			 * will choose the address itself.
> +			 */
> +			__u64 hint_addr;
> +			/* Size of allocated host memory */
> +			__u32 mem_size;
> +			__u32 pad;
> +		} map_host;
> +
> +		/* HL_MEM_OP_UNMAP - unmap host memory */
> +		struct {
> +			/* Virtual address returned from HL_MEM_OP_MAP */
> +			__u64 device_virt_addr;
> +		} unmap;
> +	};
> +
> +	/* HL_MEM_OP_* */
> +	__u32 op;
> +	/* HL_MEM_* flags */
> +	__u32 flags;
> +	/* Context ID - Currently not in use */
> +	__u32 ctx_id;
> +	__u32 pad;
> +};
> +
> +struct hl_mem_out {
> +	union {
> +		/*
> +		 * Used for HL_MEM_OP_MAP as the virtual address that was
> +		 * assigned in the device VA space.
> +		 * A value of 0 means the requested operation failed.
> +		 */
> +		__u64 device_virt_addr;
> +
> +		/*
> +		 * Used for HL_MEM_OP_ALLOC. This is the assigned
> +		 * handle for the allocated memory
> +		 */
> +		__u64 handle;
> +	};
> +};
> +
> +union hl_mem_args {
> +	struct hl_mem_in in;
> +	struct hl_mem_out out;
> +};
> +
>  /*
>   * Command Buffer
>   * - Request a Command Buffer
> @@ -245,7 +347,25 @@ union hl_wait_cs_args {
>  #define HL_IOCTL_WAIT_CS			\
>  		_IOWR('H', 0x04, union hl_wait_cs_args)
>  
> +/*
> + * Memory
> + * - Map host memory to device MMU
> + * - Unmap host memory from device MMU
> + *
> + * This IOCTL allows the user to map host memory to the device MMU
> + *
> + * For host memory, the IOCTL doesn't allocate memory. The user is supposed
> + * to allocate the memory in user-space (malloc/new). The driver pins the
> + * physical pages (up to the allowed limit by the OS), assigns a virtual
> + * address in the device VA space and initializes the device MMU.
> + *
> + * There is an option for the user to specify the requested virtual address.
> + *
> + */
> +#define HL_IOCTL_MEMORY		\
> +		_IOWR('H', 0x05, union hl_mem_args)
> +
>  #define HL_COMMAND_START	0x02
> -#define HL_COMMAND_END		0x05
> +#define HL_COMMAND_END		0x06
>  
>  #endif /* HABANALABS_H_ */
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Habana Labs kernel driver
  2019-02-11 15:17 [PATCH v4 00/15] Habana Labs kernel driver Oded Gabbay
                   ` (13 preceding siblings ...)
  2019-02-11 15:17 ` [PATCH v4 15/15] Update MAINTAINERS and CREDITS with habanalabs info Oded Gabbay
@ 2019-02-14  7:11 ` Greg KH
  2019-02-14  7:13   ` Oded Gabbay
  14 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2019-02-14  7:11 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: linux-kernel, rppt, olof, ogabbay, arnd, joe

On Mon, Feb 11, 2019 at 05:17:36PM +0200, Oded Gabbay wrote:
> Hello,
> This is v4 of the Habana Labs kernel driver patch-set. It contains fixes
> according to reviews done on v3, mainly for the command buffer, sysfs and MMU
> patches. In addition, patch 2/15 was reduced in size from 4.3MB to 1.4MB.
> 
> The patch-set is rebased on v5.0-rc6.
> 
> Link to v3 cover letter: https://lkml.org/lkml/2019/2/4/1033
> 
> Link to v2 cover letter: https://lkml.org/lkml/2019/1/30/1003
> 
> Link to v1 cover letter: https://lwn.net/Articles/777342/
> 
> I would appricate any feedback, question and/or review.

There's been some 0-day bot feedback on some of these patches now that I
put them in my -testing branch.  So I'm going to drop the patch series
from there now and wait for a v5 of the series that hopefully will have
those issues fixed :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Habana Labs kernel driver
  2019-02-14  7:11 ` [PATCH v4 00/15] Habana Labs kernel driver Greg KH
@ 2019-02-14  7:13   ` Oded Gabbay
  2019-02-14  9:58     ` Oded Gabbay
  0 siblings, 1 reply; 25+ messages in thread
From: Oded Gabbay @ 2019-02-14  7:13 UTC (permalink / raw)
  To: Greg KH
  Cc: Linux-Kernel@Vger. Kernel. Org, Mike Rapoport, Olof Johansson,
	ogabbay, Arnd Bergmann, Joe Perches

On Thu, Feb 14, 2019 at 9:11 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Mon, Feb 11, 2019 at 05:17:36PM +0200, Oded Gabbay wrote:
> > Hello,
> > This is v4 of the Habana Labs kernel driver patch-set. It contains fixes
> > according to reviews done on v3, mainly for the command buffer, sysfs and MMU
> > patches. In addition, patch 2/15 was reduced in size from 4.3MB to 1.4MB.
> >
> > The patch-set is rebased on v5.0-rc6.
> >
> > Link to v3 cover letter: https://lkml.org/lkml/2019/2/4/1033
> >
> > Link to v2 cover letter: https://lkml.org/lkml/2019/1/30/1003
> >
> > Link to v1 cover letter: https://lwn.net/Articles/777342/
> >
> > I would appricate any feedback, question and/or review.
>
> There's been some 0-day bot feedback on some of these patches now that I
> put them in my -testing branch.  So I'm going to drop the patch series
> from there now and wait for a v5 of the series that hopefully will have
> those issues fixed :)
>
> thanks,
>
> greg k-h

Sure, np.
Thanks,
Oded

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Habana Labs kernel driver
  2019-02-14  7:13   ` Oded Gabbay
@ 2019-02-14  9:58     ` Oded Gabbay
  2019-02-14 10:07       ` Greg KH
  0 siblings, 1 reply; 25+ messages in thread
From: Oded Gabbay @ 2019-02-14  9:58 UTC (permalink / raw)
  To: Greg KH
  Cc: Linux-Kernel@Vger. Kernel. Org, Mike Rapoport, Olof Johansson,
	ogabbay, Arnd Bergmann, Joe Perches

On Thu, Feb 14, 2019 at 9:13 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> On Thu, Feb 14, 2019 at 9:11 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Mon, Feb 11, 2019 at 05:17:36PM +0200, Oded Gabbay wrote:
> > > Hello,
> > > This is v4 of the Habana Labs kernel driver patch-set. It contains fixes
> > > according to reviews done on v3, mainly for the command buffer, sysfs and MMU
> > > patches. In addition, patch 2/15 was reduced in size from 4.3MB to 1.4MB.
> > >
> > > The patch-set is rebased on v5.0-rc6.
> > >
> > > Link to v3 cover letter: https://lkml.org/lkml/2019/2/4/1033
> > >
> > > Link to v2 cover letter: https://lkml.org/lkml/2019/1/30/1003
> > >
> > > Link to v1 cover letter: https://lwn.net/Articles/777342/
> > >
> > > I would appricate any feedback, question and/or review.
> >
> > There's been some 0-day bot feedback on some of these patches now that I
> > put them in my -testing branch.  So I'm going to drop the patch series
> > from there now and wait for a v5 of the series that hopefully will have
> > those issues fixed :)
> >
Hi Greg,
I looked at the 4 warnings I received from your emails, and they all
appear in i386 architecture.
I don't want to support 32-bit kernel and I don't intend to support it.
Can we just specify in kconfig that we don't support it, and then you
won't get these warnings ?
I initially set in kconfig to support only x86_64, and you told me
(and you were right) not to limit to that. But I do think I would like
to disable the driver on i386.

Thanks,
Oded


> > thanks,
> >
> > greg k-h
>
> Sure, np.
> Thanks,
> Oded

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Habana Labs kernel driver
  2019-02-14  9:58     ` Oded Gabbay
@ 2019-02-14 10:07       ` Greg KH
  2019-02-14 10:15         ` Oded Gabbay
  0 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2019-02-14 10:07 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Linux-Kernel@Vger. Kernel. Org, Mike Rapoport, Olof Johansson,
	ogabbay, Arnd Bergmann, Joe Perches

On Thu, Feb 14, 2019 at 11:58:41AM +0200, Oded Gabbay wrote:
> On Thu, Feb 14, 2019 at 9:13 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> >
> > On Thu, Feb 14, 2019 at 9:11 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Mon, Feb 11, 2019 at 05:17:36PM +0200, Oded Gabbay wrote:
> > > > Hello,
> > > > This is v4 of the Habana Labs kernel driver patch-set. It contains fixes
> > > > according to reviews done on v3, mainly for the command buffer, sysfs and MMU
> > > > patches. In addition, patch 2/15 was reduced in size from 4.3MB to 1.4MB.
> > > >
> > > > The patch-set is rebased on v5.0-rc6.
> > > >
> > > > Link to v3 cover letter: https://lkml.org/lkml/2019/2/4/1033
> > > >
> > > > Link to v2 cover letter: https://lkml.org/lkml/2019/1/30/1003
> > > >
> > > > Link to v1 cover letter: https://lwn.net/Articles/777342/
> > > >
> > > > I would appricate any feedback, question and/or review.
> > >
> > > There's been some 0-day bot feedback on some of these patches now that I
> > > put them in my -testing branch.  So I'm going to drop the patch series
> > > from there now and wait for a v5 of the series that hopefully will have
> > > those issues fixed :)
> > >
> Hi Greg,
> I looked at the 4 warnings I received from your emails, and they all
> appear in i386 architecture.
> I don't want to support 32-bit kernel and I don't intend to support it.
> Can we just specify in kconfig that we don't support it, and then you
> won't get these warnings ?

No, if you use the correct kernel types and castings, you should be
fine.

> I initially set in kconfig to support only x86_64, and you told me
> (and you were right) not to limit to that. But I do think I would like
> to disable the driver on i386.

You might want to not support it on 32bit kernels, but even then, I
think all you need to do here is use the proper kernel types and you
will be ok.

As an example:
   drivers/misc/habanalabs/goya/goya.c: In function 'goya_early_init':
   drivers/misc/habanalabs/goya/goya.c:404:4: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 4 has type 'resource_size_t' {aka 'unsigned int'} [-Wformat=]
       "Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
       ^~~~~~

Use the correct printk type for a resource_size_t.

You got that warning twice.

Another one is:
>> drivers/misc/habanalabs/device.c:283:24: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     volatile u32 *paddr = (volatile u32 *) addr;

Now using a volatile makes me want to say "you are doing it wrong!", as
yes, you shouldn't be reading directly from a memory pointer, you need
to use the correct iomem accessors, right?

So I think just fixing this stuff up should be simple, the
resource_size_t fix is needed no matter what size kernel you run on.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Habana Labs kernel driver
  2019-02-14 10:07       ` Greg KH
@ 2019-02-14 10:15         ` Oded Gabbay
  2019-02-14 10:37           ` Greg KH
  0 siblings, 1 reply; 25+ messages in thread
From: Oded Gabbay @ 2019-02-14 10:15 UTC (permalink / raw)
  To: Greg KH
  Cc: Linux-Kernel@Vger. Kernel. Org, Mike Rapoport, Olof Johansson,
	ogabbay, Arnd Bergmann, Joe Perches

On Thu, Feb 14, 2019 at 12:07 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Feb 14, 2019 at 11:58:41AM +0200, Oded Gabbay wrote:
> > On Thu, Feb 14, 2019 at 9:13 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > >
> > > On Thu, Feb 14, 2019 at 9:11 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Mon, Feb 11, 2019 at 05:17:36PM +0200, Oded Gabbay wrote:
> > > > > Hello,
> > > > > This is v4 of the Habana Labs kernel driver patch-set. It contains fixes
> > > > > according to reviews done on v3, mainly for the command buffer, sysfs and MMU
> > > > > patches. In addition, patch 2/15 was reduced in size from 4.3MB to 1.4MB.
> > > > >
> > > > > The patch-set is rebased on v5.0-rc6.
> > > > >
> > > > > Link to v3 cover letter: https://lkml.org/lkml/2019/2/4/1033
> > > > >
> > > > > Link to v2 cover letter: https://lkml.org/lkml/2019/1/30/1003
> > > > >
> > > > > Link to v1 cover letter: https://lwn.net/Articles/777342/
> > > > >
> > > > > I would appricate any feedback, question and/or review.
> > > >
> > > > There's been some 0-day bot feedback on some of these patches now that I
> > > > put them in my -testing branch.  So I'm going to drop the patch series
> > > > from there now and wait for a v5 of the series that hopefully will have
> > > > those issues fixed :)
> > > >
> > Hi Greg,
> > I looked at the 4 warnings I received from your emails, and they all
> > appear in i386 architecture.
> > I don't want to support 32-bit kernel and I don't intend to support it.
> > Can we just specify in kconfig that we don't support it, and then you
> > won't get these warnings ?
>
> No, if you use the correct kernel types and castings, you should be
> fine.
>
> > I initially set in kconfig to support only x86_64, and you told me
> > (and you were right) not to limit to that. But I do think I would like
> > to disable the driver on i386.
>
> You might want to not support it on 32bit kernels, but even then, I
> think all you need to do here is use the proper kernel types and you
> will be ok.
>
> As an example:
>    drivers/misc/habanalabs/goya/goya.c: In function 'goya_early_init':
>    drivers/misc/habanalabs/goya/goya.c:404:4: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 4 has type 'resource_size_t' {aka 'unsigned int'} [-Wformat=]
>        "Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
>        ^~~~~~
>
> Use the correct printk type for a resource_size_t.
>
> You got that warning twice.
>
> Another one is:
> >> drivers/misc/habanalabs/device.c:283:24: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
>      volatile u32 *paddr = (volatile u32 *) addr;
>
> Now using a volatile makes me want to say "you are doing it wrong!", as
> yes, you shouldn't be reading directly from a memory pointer, you need
> to use the correct iomem accessors, right?
>
> So I think just fixing this stuff up should be simple, the
> resource_size_t fix is needed no matter what size kernel you run on.
>
> thanks,
>
> greg k-h

ok, got it, will be fixed.

Regarding the volatile, this is not an I/O memory. This is host memory
that is changed by the device. That's why I wrote in the comment
there:
/*
* paddr is defined as volatile because it points to HOST memory,
* which is being written to by the device. Therefore, we can't use
* locks to synchronize it and it is not a memory-mapped register space
*/

Am I missing something here ? I don't think I should use the iomem
accessors on host memory, right ? Assuming I'm right, is there another
way to ensure the compiler won't optimize this without using the
volatile keyword ?

Thanks,
Oded

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Habana Labs kernel driver
  2019-02-14 10:15         ` Oded Gabbay
@ 2019-02-14 10:37           ` Greg KH
  2019-02-14 10:45             ` Oded Gabbay
  0 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2019-02-14 10:37 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Linux-Kernel@Vger. Kernel. Org, Mike Rapoport, Olof Johansson,
	ogabbay, Arnd Bergmann, Joe Perches

On Thu, Feb 14, 2019 at 12:15:19PM +0200, Oded Gabbay wrote:
> On Thu, Feb 14, 2019 at 12:07 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Feb 14, 2019 at 11:58:41AM +0200, Oded Gabbay wrote:
> > > On Thu, Feb 14, 2019 at 9:13 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > >
> > > > On Thu, Feb 14, 2019 at 9:11 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Mon, Feb 11, 2019 at 05:17:36PM +0200, Oded Gabbay wrote:
> > > > > > Hello,
> > > > > > This is v4 of the Habana Labs kernel driver patch-set. It contains fixes
> > > > > > according to reviews done on v3, mainly for the command buffer, sysfs and MMU
> > > > > > patches. In addition, patch 2/15 was reduced in size from 4.3MB to 1.4MB.
> > > > > >
> > > > > > The patch-set is rebased on v5.0-rc6.
> > > > > >
> > > > > > Link to v3 cover letter: https://lkml.org/lkml/2019/2/4/1033
> > > > > >
> > > > > > Link to v2 cover letter: https://lkml.org/lkml/2019/1/30/1003
> > > > > >
> > > > > > Link to v1 cover letter: https://lwn.net/Articles/777342/
> > > > > >
> > > > > > I would appricate any feedback, question and/or review.
> > > > >
> > > > > There's been some 0-day bot feedback on some of these patches now that I
> > > > > put them in my -testing branch.  So I'm going to drop the patch series
> > > > > from there now and wait for a v5 of the series that hopefully will have
> > > > > those issues fixed :)
> > > > >
> > > Hi Greg,
> > > I looked at the 4 warnings I received from your emails, and they all
> > > appear in i386 architecture.
> > > I don't want to support 32-bit kernel and I don't intend to support it.
> > > Can we just specify in kconfig that we don't support it, and then you
> > > won't get these warnings ?
> >
> > No, if you use the correct kernel types and castings, you should be
> > fine.
> >
> > > I initially set in kconfig to support only x86_64, and you told me
> > > (and you were right) not to limit to that. But I do think I would like
> > > to disable the driver on i386.
> >
> > You might want to not support it on 32bit kernels, but even then, I
> > think all you need to do here is use the proper kernel types and you
> > will be ok.
> >
> > As an example:
> >    drivers/misc/habanalabs/goya/goya.c: In function 'goya_early_init':
> >    drivers/misc/habanalabs/goya/goya.c:404:4: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 4 has type 'resource_size_t' {aka 'unsigned int'} [-Wformat=]
> >        "Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
> >        ^~~~~~
> >
> > Use the correct printk type for a resource_size_t.
> >
> > You got that warning twice.
> >
> > Another one is:
> > >> drivers/misc/habanalabs/device.c:283:24: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
> >      volatile u32 *paddr = (volatile u32 *) addr;
> >
> > Now using a volatile makes me want to say "you are doing it wrong!", as
> > yes, you shouldn't be reading directly from a memory pointer, you need
> > to use the correct iomem accessors, right?
> >
> > So I think just fixing this stuff up should be simple, the
> > resource_size_t fix is needed no matter what size kernel you run on.
> >
> > thanks,
> >
> > greg k-h
> 
> ok, got it, will be fixed.
> 
> Regarding the volatile, this is not an I/O memory. This is host memory
> that is changed by the device. That's why I wrote in the comment
> there:
> /*
> * paddr is defined as volatile because it points to HOST memory,
> * which is being written to by the device. Therefore, we can't use
> * locks to synchronize it and it is not a memory-mapped register space

What do you mean by "HOST" memory?  The memory that the processor is
running on?

> */
> 
> Am I missing something here ? I don't think I should use the iomem
> accessors on host memory, right ? Assuming I'm right, is there another
> way to ensure the compiler won't optimize this without using the
> volatile keyword ?

What are you trying to prevent from being "optimized" here?

Are you sure you just don't need a correct memory barrier?  That's the
only way to ensure that if you write to a location from one thread/cpu,
it will show up to the other thread/cpu correctly.  volatile will not
ensure that for you at all (hint, the compiler just ignores it for the
most part.)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Habana Labs kernel driver
  2019-02-14 10:37           ` Greg KH
@ 2019-02-14 10:45             ` Oded Gabbay
  2019-02-14 11:04               ` Greg KH
  0 siblings, 1 reply; 25+ messages in thread
From: Oded Gabbay @ 2019-02-14 10:45 UTC (permalink / raw)
  To: Greg KH
  Cc: Linux-Kernel@Vger. Kernel. Org, Mike Rapoport, Olof Johansson,
	ogabbay, Arnd Bergmann, Joe Perches

On Thu, Feb 14, 2019 at 12:37 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Feb 14, 2019 at 12:15:19PM +0200, Oded Gabbay wrote:
> > On Thu, Feb 14, 2019 at 12:07 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Thu, Feb 14, 2019 at 11:58:41AM +0200, Oded Gabbay wrote:
> > > > On Thu, Feb 14, 2019 at 9:13 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > >
> > > > > On Thu, Feb 14, 2019 at 9:11 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > >
> > > > > > On Mon, Feb 11, 2019 at 05:17:36PM +0200, Oded Gabbay wrote:
> > > > > > > Hello,
> > > > > > > This is v4 of the Habana Labs kernel driver patch-set. It contains fixes
> > > > > > > according to reviews done on v3, mainly for the command buffer, sysfs and MMU
> > > > > > > patches. In addition, patch 2/15 was reduced in size from 4.3MB to 1.4MB.
> > > > > > >
> > > > > > > The patch-set is rebased on v5.0-rc6.
> > > > > > >
> > > > > > > Link to v3 cover letter: https://lkml.org/lkml/2019/2/4/1033
> > > > > > >
> > > > > > > Link to v2 cover letter: https://lkml.org/lkml/2019/1/30/1003
> > > > > > >
> > > > > > > Link to v1 cover letter: https://lwn.net/Articles/777342/
> > > > > > >
> > > > > > > I would appricate any feedback, question and/or review.
> > > > > >
> > > > > > There's been some 0-day bot feedback on some of these patches now that I
> > > > > > put them in my -testing branch.  So I'm going to drop the patch series
> > > > > > from there now and wait for a v5 of the series that hopefully will have
> > > > > > those issues fixed :)
> > > > > >
> > > > Hi Greg,
> > > > I looked at the 4 warnings I received from your emails, and they all
> > > > appear in i386 architecture.
> > > > I don't want to support 32-bit kernel and I don't intend to support it.
> > > > Can we just specify in kconfig that we don't support it, and then you
> > > > won't get these warnings ?
> > >
> > > No, if you use the correct kernel types and castings, you should be
> > > fine.
> > >
> > > > I initially set in kconfig to support only x86_64, and you told me
> > > > (and you were right) not to limit to that. But I do think I would like
> > > > to disable the driver on i386.
> > >
> > > You might want to not support it on 32bit kernels, but even then, I
> > > think all you need to do here is use the proper kernel types and you
> > > will be ok.
> > >
> > > As an example:
> > >    drivers/misc/habanalabs/goya/goya.c: In function 'goya_early_init':
> > >    drivers/misc/habanalabs/goya/goya.c:404:4: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 4 has type 'resource_size_t' {aka 'unsigned int'} [-Wformat=]
> > >        "Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
> > >        ^~~~~~
> > >
> > > Use the correct printk type for a resource_size_t.
> > >
> > > You got that warning twice.
> > >
> > > Another one is:
> > > >> drivers/misc/habanalabs/device.c:283:24: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
> > >      volatile u32 *paddr = (volatile u32 *) addr;
> > >
> > > Now using a volatile makes me want to say "you are doing it wrong!", as
> > > yes, you shouldn't be reading directly from a memory pointer, you need
> > > to use the correct iomem accessors, right?
> > >
> > > So I think just fixing this stuff up should be simple, the
> > > resource_size_t fix is needed no matter what size kernel you run on.
> > >
> > > thanks,
> > >
> > > greg k-h
> >
> > ok, got it, will be fixed.
> >
> > Regarding the volatile, this is not an I/O memory. This is host memory
> > that is changed by the device. That's why I wrote in the comment
> > there:
> > /*
> > * paddr is defined as volatile because it points to HOST memory,
> > * which is being written to by the device. Therefore, we can't use
> > * locks to synchronize it and it is not a memory-mapped register space
>
> What do you mean by "HOST" memory?  The memory that the processor is
> running on?
>
Yes, exactly. The memory of the server. Not a memory on my device.

> > */
> >
> > Am I missing something here ? I don't think I should use the iomem
> > accessors on host memory, right ? Assuming I'm right, is there another
> > way to ensure the compiler won't optimize this without using the
> > volatile keyword ?
>
> What are you trying to prevent from being "optimized" here?
>
> Are you sure you just don't need a correct memory barrier?  That's the
> only way to ensure that if you write to a location from one thread/cpu,
> it will show up to the other thread/cpu correctly.  volatile will not
> ensure that for you at all (hint, the compiler just ignores it for the
> most part.)

But the writing entity in this case is NOT another thread/cpu. The
writing entity is the device. So a memory barrier, IMO, won't help me
here, because memory barriers affect only on the CPU. Not on external
initiators.

AFAIK, the volatile keyword tells the compiler that the value of the
variable may change at any time--without any action being taken by the
code the compiler finds nearby. And this is exactly what happens here.
I poll on a memory location of the CPU, and that memory can change at
any time by the device.

Thanks,
Oded


>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Habana Labs kernel driver
  2019-02-14 10:45             ` Oded Gabbay
@ 2019-02-14 11:04               ` Greg KH
  2019-02-14 11:40                 ` Oded Gabbay
  0 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2019-02-14 11:04 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Linux-Kernel@Vger. Kernel. Org, Mike Rapoport, Olof Johansson,
	ogabbay, Arnd Bergmann, Joe Perches

On Thu, Feb 14, 2019 at 12:45:00PM +0200, Oded Gabbay wrote:
> On Thu, Feb 14, 2019 at 12:37 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Feb 14, 2019 at 12:15:19PM +0200, Oded Gabbay wrote:
> > > On Thu, Feb 14, 2019 at 12:07 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Thu, Feb 14, 2019 at 11:58:41AM +0200, Oded Gabbay wrote:
> > > > > On Thu, Feb 14, 2019 at 9:13 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Feb 14, 2019 at 9:11 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > > >
> > > > > > > On Mon, Feb 11, 2019 at 05:17:36PM +0200, Oded Gabbay wrote:
> > > > > > > > Hello,
> > > > > > > > This is v4 of the Habana Labs kernel driver patch-set. It contains fixes
> > > > > > > > according to reviews done on v3, mainly for the command buffer, sysfs and MMU
> > > > > > > > patches. In addition, patch 2/15 was reduced in size from 4.3MB to 1.4MB.
> > > > > > > >
> > > > > > > > The patch-set is rebased on v5.0-rc6.
> > > > > > > >
> > > > > > > > Link to v3 cover letter: https://lkml.org/lkml/2019/2/4/1033
> > > > > > > >
> > > > > > > > Link to v2 cover letter: https://lkml.org/lkml/2019/1/30/1003
> > > > > > > >
> > > > > > > > Link to v1 cover letter: https://lwn.net/Articles/777342/
> > > > > > > >
> > > > > > > > I would appricate any feedback, question and/or review.
> > > > > > >
> > > > > > > There's been some 0-day bot feedback on some of these patches now that I
> > > > > > > put them in my -testing branch.  So I'm going to drop the patch series
> > > > > > > from there now and wait for a v5 of the series that hopefully will have
> > > > > > > those issues fixed :)
> > > > > > >
> > > > > Hi Greg,
> > > > > I looked at the 4 warnings I received from your emails, and they all
> > > > > appear in i386 architecture.
> > > > > I don't want to support 32-bit kernel and I don't intend to support it.
> > > > > Can we just specify in kconfig that we don't support it, and then you
> > > > > won't get these warnings ?
> > > >
> > > > No, if you use the correct kernel types and castings, you should be
> > > > fine.
> > > >
> > > > > I initially set in kconfig to support only x86_64, and you told me
> > > > > (and you were right) not to limit to that. But I do think I would like
> > > > > to disable the driver on i386.
> > > >
> > > > You might want to not support it on 32bit kernels, but even then, I
> > > > think all you need to do here is use the proper kernel types and you
> > > > will be ok.
> > > >
> > > > As an example:
> > > >    drivers/misc/habanalabs/goya/goya.c: In function 'goya_early_init':
> > > >    drivers/misc/habanalabs/goya/goya.c:404:4: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 4 has type 'resource_size_t' {aka 'unsigned int'} [-Wformat=]
> > > >        "Not " HL_NAME "? BAR %d size %llu, expecting %llu\n",
> > > >        ^~~~~~
> > > >
> > > > Use the correct printk type for a resource_size_t.
> > > >
> > > > You got that warning twice.
> > > >
> > > > Another one is:
> > > > >> drivers/misc/habanalabs/device.c:283:24: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
> > > >      volatile u32 *paddr = (volatile u32 *) addr;
> > > >
> > > > Now using a volatile makes me want to say "you are doing it wrong!", as
> > > > yes, you shouldn't be reading directly from a memory pointer, you need
> > > > to use the correct iomem accessors, right?
> > > >
> > > > So I think just fixing this stuff up should be simple, the
> > > > resource_size_t fix is needed no matter what size kernel you run on.
> > > >
> > > > thanks,
> > > >
> > > > greg k-h
> > >
> > > ok, got it, will be fixed.
> > >
> > > Regarding the volatile, this is not an I/O memory. This is host memory
> > > that is changed by the device. That's why I wrote in the comment
> > > there:
> > > /*
> > > * paddr is defined as volatile because it points to HOST memory,
> > > * which is being written to by the device. Therefore, we can't use
> > > * locks to synchronize it and it is not a memory-mapped register space
> >
> > What do you mean by "HOST" memory?  The memory that the processor is
> > running on?
> >
> Yes, exactly. The memory of the server. Not a memory on my device.
> 
> > > */
> > >
> > > Am I missing something here ? I don't think I should use the iomem
> > > accessors on host memory, right ? Assuming I'm right, is there another
> > > way to ensure the compiler won't optimize this without using the
> > > volatile keyword ?
> >
> > What are you trying to prevent from being "optimized" here?
> >
> > Are you sure you just don't need a correct memory barrier?  That's the
> > only way to ensure that if you write to a location from one thread/cpu,
> > it will show up to the other thread/cpu correctly.  volatile will not
> > ensure that for you at all (hint, the compiler just ignores it for the
> > most part.)
> 
> But the writing entity in this case is NOT another thread/cpu. The
> writing entity is the device. So a memory barrier, IMO, won't help me
> here, because memory barriers affect only on the CPU. Not on external
> initiators.
> 
> AFAIK, the volatile keyword tells the compiler that the value of the
> variable may change at any time--without any action being taken by the
> code the compiler finds nearby. And this is exactly what happens here.
> I poll on a memory location of the CPU, and that memory can change at
> any time by the device.

And how is that memory location mapped into the device memory?  As such,
it's iomemory, right?

volatile doesn't tell the compiler much, if anything, anymore.

I can't seem to trace back the code here (it's split across multiple
emails), but it seems that the memory location is coming from struct
armcp_packet in one location, and a dma pool in another one.  I don't
know what the rules are for dma mapped memory, but it feels like
'volatile' is not the way to use it :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 00/15] Habana Labs kernel driver
  2019-02-14 11:04               ` Greg KH
@ 2019-02-14 11:40                 ` Oded Gabbay
  0 siblings, 0 replies; 25+ messages in thread
From: Oded Gabbay @ 2019-02-14 11:40 UTC (permalink / raw)
  To: Greg KH
  Cc: Linux-Kernel@Vger. Kernel. Org, Mike Rapoport, Olof Johansson,
	ogabbay, Arnd Bergmann, Joe Perches

On Thu, Feb 14, 2019 at 1:04 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Feb 14, 2019 at 12:45:00PM +0200, Oded Gabbay wrote:
> > On Thu, Feb 14, 2019 at 12:37 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Thu, Feb 14, 2019 at 12:15:19PM +0200, Oded Gabbay wrote:
> > > > On Thu, Feb 14, 2019 at 12:07 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Thu, Feb 14, 2019 at 11:58:41AM +0200, Oded Gabbay wrote:
> > > > > > On Thu, Feb 14, 2019 at 9:13 AM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Feb 14, 2019 at 9:11 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, Feb 11, 2019 at 05:17:36PM +0200, Oded Gabbay wrote:
> > > > > > > > > Hello,
> > > > > > > > > This is v4 of the Habana Labs