* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-25 17:17 ` tom.cooksey at arm.com 0 siblings, 0 replies; 131+ messages in thread From: tom.cooksey @ 2013-07-25 17:17 UTC (permalink / raw) To: dri-devel; +Cc: linux-fbdev, pawel.moll, linux-arm-kernel, Tom Cooksey From: Tom Cooksey <tom.cooksey@arm.com> Please find below the current state of our pl111 DRM/KMS driver. This is lightly tested on a Versatile Express using X running the xf86-video-armsoc DDX driver[i] with the patches applied to drm-next as of ~last week. To actually see anything on the DVI output, you must also apply Pawel Moll's VExpress DVI mux driver[ii] to select the video signal from the ca9x4 core tile. [i] <https://git.linaro.org/gitweb?p=arm/xorg/driver/xf86-video-armsoc.git;a=summary> [ii] <https://patchwork.kernel.org/patch/1765981/> Known issues: * It uses KDS. We intend to switch to whatever implicit per-buffer synchronisation mechanism gets merged, once something is merged. * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also allocate buffers for the GPU. Still not sure how to resolve this as we don't use DRM for our GPU driver. * Doesn't handle page flip event sequence numbers and timestamps * The v_sync handling needs work in general - a work queue is a little overkill * Doesn't support the import half of PRIME properly, only export * Need to validate src rectangle size in pl111_drm_cursor_plane_update() * Only supports 640x480 mode, which is hard-coded. We intend to rebase on top of CDF once it is merged, which hopefully will handle a lot of the EDID parsing & mode setting for us (once Pawel's CDF patches for VExpress also land). I appreciate that's a fairly hefty list of known issues already! However, we're waiting for both CDF & dma_buf sync mechanisms to land before we can address some of those. So in the mean-time, I thought someone might be interested in taking a look at what we have so far, which is why I'm posting this now. Needless to say the code will need to be refactored a fair bit, however I'm keen to get and additional feedback anyone cares to give. Cheers, Tom Tom Cooksey (1): drm/pl111: Initial drm/kms driver for pl111 display controller drivers/gpu/drm/Kconfig | 2 + drivers/gpu/drm/Makefile | 1 + drivers/gpu/drm/pl111/Kbuild | 14 + drivers/gpu/drm/pl111/Kconfig | 9 + drivers/gpu/drm/pl111/pl111_clcd_ext.h | 78 ++++ drivers/gpu/drm/pl111/pl111_drm.h | 227 ++++++++++++ drivers/gpu/drm/pl111/pl111_drm_connector.c | 166 +++++++++ drivers/gpu/drm/pl111/pl111_drm_crtc.c | 432 ++++++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_cursor.c | 97 +++++ drivers/gpu/drm/pl111/pl111_drm_device.c | 319 +++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_dma_buf.c | 339 ++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_encoder.c | 106 ++++++ drivers/gpu/drm/pl111/pl111_drm_fb.c | 152 ++++++++ drivers/gpu/drm/pl111/pl111_drm_funcs.h | 127 +++++++ drivers/gpu/drm/pl111/pl111_drm_gem.c | 287 +++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_pl111.c | 513 +++++++++++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_platform.c | 150 ++++++++ drivers/gpu/drm/pl111/pl111_drm_suspend.c | 35 ++ drivers/gpu/drm/pl111/pl111_drm_vma.c | 214 +++++++++++ 19 files changed, 3268 insertions(+) create mode 100644 drivers/gpu/drm/pl111/Kbuild create mode 100644 drivers/gpu/drm/pl111/Kconfig create mode 100644 drivers/gpu/drm/pl111/pl111_clcd_ext.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm_connector.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_crtc.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_cursor.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_device.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_encoder.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_fb.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_funcs.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm_gem.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_pl111.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_platform.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_suspend.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_vma.c -- 1.7.9.5 ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-25 17:17 ` tom.cooksey at arm.com 0 siblings, 0 replies; 131+ messages in thread From: tom.cooksey @ 2013-07-25 17:17 UTC (permalink / raw) To: dri-devel; +Cc: linux-fbdev, pawel.moll, linux-arm-kernel, Tom Cooksey From: Tom Cooksey <tom.cooksey@arm.com> Please find below the current state of our pl111 DRM/KMS driver. This is lightly tested on a Versatile Express using X running the xf86-video-armsoc DDX driver[i] with the patches applied to drm-next as of ~last week. To actually see anything on the DVI output, you must also apply Pawel Moll's VExpress DVI mux driver[ii] to select the video signal from the ca9x4 core tile. [i] <https://git.linaro.org/gitweb?p=arm/xorg/driver/xf86-video-armsoc.git;a=summary> [ii] <https://patchwork.kernel.org/patch/1765981/> Known issues: * It uses KDS. We intend to switch to whatever implicit per-buffer synchronisation mechanism gets merged, once something is merged. * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also allocate buffers for the GPU. Still not sure how to resolve this as we don't use DRM for our GPU driver. * Doesn't handle page flip event sequence numbers and timestamps * The v_sync handling needs work in general - a work queue is a little overkill * Doesn't support the import half of PRIME properly, only export * Need to validate src rectangle size in pl111_drm_cursor_plane_update() * Only supports 640x480 mode, which is hard-coded. We intend to rebase on top of CDF once it is merged, which hopefully will handle a lot of the EDID parsing & mode setting for us (once Pawel's CDF patches for VExpress also land). I appreciate that's a fairly hefty list of known issues already! However, we're waiting for both CDF & dma_buf sync mechanisms to land before we can address some of those. So in the mean-time, I thought someone might be interested in taking a look at what we have so far, which is why I'm posting this now. Needless to say the code will need to be refactored a fair bit, however I'm keen to get and additional feedback anyone cares to give. Cheers, Tom Tom Cooksey (1): drm/pl111: Initial drm/kms driver for pl111 display controller drivers/gpu/drm/Kconfig | 2 + drivers/gpu/drm/Makefile | 1 + drivers/gpu/drm/pl111/Kbuild | 14 + drivers/gpu/drm/pl111/Kconfig | 9 + drivers/gpu/drm/pl111/pl111_clcd_ext.h | 78 ++++ drivers/gpu/drm/pl111/pl111_drm.h | 227 ++++++++++++ drivers/gpu/drm/pl111/pl111_drm_connector.c | 166 +++++++++ drivers/gpu/drm/pl111/pl111_drm_crtc.c | 432 ++++++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_cursor.c | 97 +++++ drivers/gpu/drm/pl111/pl111_drm_device.c | 319 +++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_dma_buf.c | 339 ++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_encoder.c | 106 ++++++ drivers/gpu/drm/pl111/pl111_drm_fb.c | 152 ++++++++ drivers/gpu/drm/pl111/pl111_drm_funcs.h | 127 +++++++ drivers/gpu/drm/pl111/pl111_drm_gem.c | 287 +++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_pl111.c | 513 +++++++++++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_platform.c | 150 ++++++++ drivers/gpu/drm/pl111/pl111_drm_suspend.c | 35 ++ drivers/gpu/drm/pl111/pl111_drm_vma.c | 214 +++++++++++ 19 files changed, 3268 insertions(+) create mode 100644 drivers/gpu/drm/pl111/Kbuild create mode 100644 drivers/gpu/drm/pl111/Kconfig create mode 100644 drivers/gpu/drm/pl111/pl111_clcd_ext.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm_connector.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_crtc.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_cursor.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_device.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_encoder.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_fb.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_funcs.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm_gem.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_pl111.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_platform.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_suspend.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_vma.c -- 1.7.9.5 ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-25 17:17 ` tom.cooksey at arm.com 0 siblings, 0 replies; 131+ messages in thread From: tom.cooksey at arm.com @ 2013-07-25 17:17 UTC (permalink / raw) To: linux-arm-kernel From: Tom Cooksey <tom.cooksey@arm.com> Please find below the current state of our pl111 DRM/KMS driver. This is lightly tested on a Versatile Express using X running the xf86-video-armsoc DDX driver[i] with the patches applied to drm-next as of ~last week. To actually see anything on the DVI output, you must also apply Pawel Moll's VExpress DVI mux driver[ii] to select the video signal from the ca9x4 core tile. [i] <https://git.linaro.org/gitweb?p=arm/xorg/driver/xf86-video-armsoc.git;a=summary> [ii] <https://patchwork.kernel.org/patch/1765981/> Known issues: * It uses KDS. We intend to switch to whatever implicit per-buffer synchronisation mechanism gets merged, once something is merged. * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also allocate buffers for the GPU. Still not sure how to resolve this as we don't use DRM for our GPU driver. * Doesn't handle page flip event sequence numbers and timestamps * The v_sync handling needs work in general - a work queue is a little overkill * Doesn't support the import half of PRIME properly, only export * Need to validate src rectangle size in pl111_drm_cursor_plane_update() * Only supports 640x480 mode, which is hard-coded. We intend to rebase on top of CDF once it is merged, which hopefully will handle a lot of the EDID parsing & mode setting for us (once Pawel's CDF patches for VExpress also land). I appreciate that's a fairly hefty list of known issues already! However, we're waiting for both CDF & dma_buf sync mechanisms to land before we can address some of those. So in the mean-time, I thought someone might be interested in taking a look at what we have so far, which is why I'm posting this now. Needless to say the code will need to be refactored a fair bit, however I'm keen to get and additional feedback anyone cares to give. Cheers, Tom Tom Cooksey (1): drm/pl111: Initial drm/kms driver for pl111 display controller drivers/gpu/drm/Kconfig | 2 + drivers/gpu/drm/Makefile | 1 + drivers/gpu/drm/pl111/Kbuild | 14 + drivers/gpu/drm/pl111/Kconfig | 9 + drivers/gpu/drm/pl111/pl111_clcd_ext.h | 78 ++++ drivers/gpu/drm/pl111/pl111_drm.h | 227 ++++++++++++ drivers/gpu/drm/pl111/pl111_drm_connector.c | 166 +++++++++ drivers/gpu/drm/pl111/pl111_drm_crtc.c | 432 ++++++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_cursor.c | 97 +++++ drivers/gpu/drm/pl111/pl111_drm_device.c | 319 +++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_dma_buf.c | 339 ++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_encoder.c | 106 ++++++ drivers/gpu/drm/pl111/pl111_drm_fb.c | 152 ++++++++ drivers/gpu/drm/pl111/pl111_drm_funcs.h | 127 +++++++ drivers/gpu/drm/pl111/pl111_drm_gem.c | 287 +++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_pl111.c | 513 +++++++++++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_platform.c | 150 ++++++++ drivers/gpu/drm/pl111/pl111_drm_suspend.c | 35 ++ drivers/gpu/drm/pl111/pl111_drm_vma.c | 214 +++++++++++ 19 files changed, 3268 insertions(+) create mode 100644 drivers/gpu/drm/pl111/Kbuild create mode 100644 drivers/gpu/drm/pl111/Kconfig create mode 100644 drivers/gpu/drm/pl111/pl111_clcd_ext.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm_connector.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_crtc.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_cursor.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_device.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_encoder.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_fb.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_funcs.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm_gem.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_pl111.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_platform.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_suspend.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_vma.c -- 1.7.9.5 ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-25 17:17 ` tom.cooksey at arm.com (?) (?) @ 2013-07-25 17:17 ` tom.cooksey 2013-08-07 16:17 ` Rob Clark -1 siblings, 1 reply; 131+ messages in thread From: tom.cooksey @ 2013-07-25 17:17 UTC (permalink / raw) To: dri-devel; +Cc: linux-fbdev, pawel.moll, linux-arm-kernel From: Tom Cooksey <tom.cooksey@arm.com> This is a mode-setting driver for the pl111 CLCD display controller found on various ARM reference platforms such as the Versatile Express. The driver supports setting of a single mode (640x480) and has only been tested on Versatile Express with a Cortex-A9 core tile. Known issues: * It still includes code to use KDS, which is not going upstream. * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also allocate buffers for the GPU. * The v_sync handling needs work - a work queue is a little overkill. * Doesn't support the import half of PRIME properly, only export * Need to validate src rectangle size in when updating the cursor plane. * Only supports 640x480 mode, which is hard-coded. * We register both an amba & platform driver, only need one. Signed-off-by: Tom Cooksey <tom.cooksey@arm.com> --- drivers/gpu/drm/Kconfig | 2 + drivers/gpu/drm/Makefile | 1 + drivers/gpu/drm/pl111/Kbuild | 14 + drivers/gpu/drm/pl111/Kconfig | 9 + drivers/gpu/drm/pl111/pl111_clcd_ext.h | 78 ++++ drivers/gpu/drm/pl111/pl111_drm.h | 227 ++++++++++++ drivers/gpu/drm/pl111/pl111_drm_connector.c | 166 +++++++++ drivers/gpu/drm/pl111/pl111_drm_crtc.c | 432 ++++++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_cursor.c | 97 +++++ drivers/gpu/drm/pl111/pl111_drm_device.c | 319 +++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_dma_buf.c | 339 ++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_encoder.c | 106 ++++++ drivers/gpu/drm/pl111/pl111_drm_fb.c | 152 ++++++++ drivers/gpu/drm/pl111/pl111_drm_funcs.h | 127 +++++++ drivers/gpu/drm/pl111/pl111_drm_gem.c | 287 +++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_pl111.c | 513 +++++++++++++++++++++++++++ drivers/gpu/drm/pl111/pl111_drm_platform.c | 150 ++++++++ drivers/gpu/drm/pl111/pl111_drm_suspend.c | 35 ++ drivers/gpu/drm/pl111/pl111_drm_vma.c | 214 +++++++++++ 19 files changed, 3268 insertions(+) create mode 100644 drivers/gpu/drm/pl111/Kbuild create mode 100644 drivers/gpu/drm/pl111/Kconfig create mode 100644 drivers/gpu/drm/pl111/pl111_clcd_ext.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm_connector.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_crtc.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_cursor.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_device.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_encoder.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_fb.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_funcs.h create mode 100644 drivers/gpu/drm/pl111/pl111_drm_gem.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_pl111.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_platform.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_suspend.c create mode 100644 drivers/gpu/drm/pl111/pl111_drm_vma.c diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index a7c54c8..4f743f3 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -223,3 +223,5 @@ source "drivers/gpu/drm/omapdrm/Kconfig" source "drivers/gpu/drm/tilcdc/Kconfig" source "drivers/gpu/drm/qxl/Kconfig" + +source "drivers/gpu/drm/pl111/Kconfig" diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile index 801bcaf..2ec0181 100644 --- a/drivers/gpu/drm/Makefile +++ b/drivers/gpu/drm/Makefile @@ -54,4 +54,5 @@ obj-$(CONFIG_DRM_SHMOBILE) +=shmobile/ obj-$(CONFIG_DRM_OMAP) += omapdrm/ obj-$(CONFIG_DRM_TILCDC) += tilcdc/ obj-$(CONFIG_DRM_QXL) += qxl/ +obj-$(CONFIG_DRM_PL111) += pl111/ obj-y += i2c/ diff --git a/drivers/gpu/drm/pl111/Kbuild b/drivers/gpu/drm/pl111/Kbuild new file mode 100644 index 0000000..5dbd333 --- /dev/null +++ b/drivers/gpu/drm/pl111/Kbuild @@ -0,0 +1,14 @@ +pl111_drm-y += pl111_drm_device.o \ + pl111_drm_connector.o \ + pl111_drm_crtc.o \ + pl111_drm_cursor.o \ + pl111_drm_dma_buf.o \ + pl111_drm_encoder.o \ + pl111_drm_fb.o \ + pl111_drm_gem.o \ + pl111_drm_pl111.o \ + pl111_drm_platform.o \ + pl111_drm_suspend.o \ + pl111_drm_vma.o + +obj-$(CONFIG_DRM_PL111) += pl111_drm.o diff --git a/drivers/gpu/drm/pl111/Kconfig b/drivers/gpu/drm/pl111/Kconfig new file mode 100644 index 0000000..6aa4739 --- /dev/null +++ b/drivers/gpu/drm/pl111/Kconfig @@ -0,0 +1,9 @@ +config DRM_PL111 + tristate "DRM Support for PL111 CLCD Controller" + depends on DRM + select DRM_KMS_HELPER + select VT_HW_CONSOLE_BINDING if FRAMEBUFFER_CONSOLE + help + Choose this option for DRM support for the PL111 CLCD controller. + If M is selected the module will be called pl111_drm. + diff --git a/drivers/gpu/drm/pl111/pl111_clcd_ext.h b/drivers/gpu/drm/pl111/pl111_clcd_ext.h new file mode 100644 index 0000000..06e424c --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_clcd_ext.h @@ -0,0 +1,78 @@ +/* + * Support for PL111 + * + * Portions (C) COPYRIGHT 2011-2013 ARM Limited. All rights reserved. + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ +/** + * pl111_clcd_ext.h + * Extended CLCD register definitions + */ + +#ifndef PL111_CLCD_EXT_H_ +#define PL111_CLCD_EXT_H_ + +/* PL111 cursor register definitions not defined in the kernel's clcd header */ + +#define CLCD_CRSR_IMAGE 0x00000800 + +#define CLCD_CRSR_IMAGE_MAX_WORDS 256 + +#define CLCD_CRSR_CTRL 0x00000c00 +#define CLCD_CRSR_CONFIG 0x00000c04 +#define CLCD_CRSR_PALETTE_0 0x00000c08 +#define CLCD_CRSR_PALETTE_1 0x00000c0c +#define CLCD_CRSR_XY 0x00000c10 +#define CLCD_CRSR_CLIP 0x00000c14 +#define CLCD_CRSR_IMSC 0x00000c20 +#define CLCD_CRSR_ICR 0x00000c24 +#define CLCD_CRSR_RIS 0x00000c28 +#define CLCD_MIS 0x00000c2c + +#define CRSR_CTRL_CRSR_ON (1 << 0) +#define CRSR_CTRL_CRSR_MAX 3 +#define CRSR_CTRL_CRSR_NUM_SHIFT 4 +#define CRSR_CTRL_CRSR_NUM_MASK \ + (CRSR_CTRL_CRSR_MAX << CRSR_CTRL_CRSR_NUM_SHIFT) +#define CRSR_CTRL_CURSOR_0 0 +#define CRSR_CTRL_CURSOR_1 1 +#define CRSR_CTRL_CURSOR_2 2 +#define CRSR_CTRL_CURSOR_3 3 + +#define CRSR_CONFIG_CRSR_SIZE (1 << 0) +#define CRSR_CONFIG_CRSR_FRAME_SYNC (1 << 1) + +#define CRSR_PALETTE_RED_SHIFT 0 +#define CRSR_PALETTE_GREEN_SHIFT 8 +#define CRSR_PALETTE_BLUE_SHIFT 16 + +#define CRSR_PALETTE_RED_MASK 0x000000ff +#define CRSR_PALETTE_GREEN_MASK 0x0000ff00 +#define CRSR_PALETTE_BLUE_MASK 0x00ff0000 +#define CRSR_PALETTE_MASK (~0xff000000) + +#define CRSR_XY_MASK 0x000003ff +#define CRSR_XY_X_SHIFT 0 +#define CRSR_XY_Y_SHIFT 16 + +#define CRSR_XY_X_MASK CRSR_XY_MASK +#define CRSR_XY_Y_MASK (CRSR_XY_MASK << CRSR_XY_Y_SHIFT) + +#define CRSR_CLIP_MASK 0x3f +#define CRSR_CLIP_X_SHIFT 0 +#define CRSR_CLIP_Y_SHIFT 8 + +#define CRSR_CLIP_X_MASK CRSR_CLIP_MASK +#define CRSR_CLIP_Y_MASK (CRSR_CLIP_MASK << CRSR_CLIP_Y_SHIFT) + +#define CRSR_IMSC_CRSR_IM (1<<0) +#define CRSR_ICR_CRSR_IC (1<<0) +#define CRSR_RIS_CRSR_RIS (1<<0) +#define CRSR_MIS_CRSR_MIS (1<<0) + +#endif /* PL111_CLCD_EXT_H_ */ diff --git a/drivers/gpu/drm/pl111/pl111_drm.h b/drivers/gpu/drm/pl111/pl111_drm.h new file mode 100644 index 0000000..6d39a8b --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm.h @@ -0,0 +1,227 @@ +/* + * + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +#ifndef _PL111_DRM_H_ +#define _PL111_DRM_H_ + +/* Defines for drm_mode_create_dumb flags settings */ +#define PL111_BO_SCANOUT 0x00000001 /* scanout compatible buffer requested */ + +#define DRIVER_AUTHOR "ARM Ltd." +#define DRIVER_NAME "pl111_drm" +#define DRIVER_DESC "DRM module for PL111" +#define DRIVER_LICENCE "GPL" +#define DRIVER_ALIAS "platform:pl111_drm" +#define DRIVER_DATE "20101111" +#define DRIVER_VERSION "0.2" +#define DRIVER_MAJOR 2 +#define DRIVER_MINOR 1 +#define DRIVER_PATCHLEVEL 1 + +/* + * Number of flips allowed in flight at any one time. Any more flips requested + * beyond this value will cause the caller to block until earlier flips have + * completed. + * + * For performance reasons, this must be greater than the number of buffers + * used in the rendering pipeline. Note that the rendering pipeline can contain + * different types of buffer, e.g.: + * - 2 final framebuffers + * - >2 geometry buffers for GPU use-cases + * - >2 vertex buffers for GPU use-cases + * + * For example, a system using 5 geometry buffers could have 5 flips in flight, + * and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be 5 or greater. + * + * Whilst there may be more intermediate buffers (such as vertex/geometry) than + * final framebuffers, KDS is used to ensure that GPU rendering waits for the + * next off-screen buffer, so it doesn't overwrite an on-screen buffer and + * produce tearing. + */ + +/* + * Here, we choose a conservative value. A lower value is most likely + * suitable for GPU use-cases. + */ +#define NR_FLIPS_IN_FLIGHT_THRESHOLD 16 + +#define CLCD_IRQ_NEXTBASE_UPDATE (1u<<2) + +struct pl111_drm_flip_resource; +struct pl111_drm_cursor_plane; + +enum pl111_bo_type { + PL111_BOT_DMA, + PL111_BOT_SHM +}; + +struct pl111_gem_bo_dma { + dma_addr_t fb_dev_addr; + void *fb_cpu_addr; +}; + +struct pl111_gem_bo_shm { + struct page **pages; + dma_addr_t *dma_addrs; +}; + +struct pl111_gem_bo { + struct drm_gem_object gem_object; + enum pl111_bo_type type; + union { + struct pl111_gem_bo_dma dma; + struct pl111_gem_bo_shm shm; + } backing_data; + struct drm_framebuffer *fb; +}; + +extern struct pl111_drm_dev_private priv; + +struct pl111_drm_framebuffer { + struct drm_framebuffer fb; + struct pl111_gem_bo *bo; +}; + +struct pl111_drm_flip_resource { +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + struct kds_resource_set *kds_res_set; + int worker_release_kds; +#endif + struct drm_framebuffer *fb; + struct drm_crtc *crtc; + struct work_struct vsync_work; + struct list_head link; + bool page_flip; + struct drm_pending_vblank_event *event; +}; + +#define MAX_CURSOR_FORMATS (1) + +struct pl111_drm_cursor_plane { + struct drm_plane base; + uint32_t *formats; + uint32_t num_formats_supported; +}; + +struct pl111_drm_crtc { + struct drm_crtc crtc; + int crtc_index; + + spinlock_t current_displaying_lock; + spinlock_t base_update_lock; +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + struct kds_resource_set *old_kds_res_set; +#endif + struct drm_framebuffer *displaying_fb; + + struct drm_display_mode *new_mode; + struct drm_display_mode *current_mode; + int last_bpp; + + /* + * The resource that caused a base address update. Only one can be + * pending, hence it's != NULL if there's a pending update + */ + struct pl111_drm_flip_resource *current_update_res; + /* Queue of things waiting to update the base address */ + struct list_head update_queue; + + struct workqueue_struct *vsync_wq; + + struct pl111_drm_cursor_plane cursor; + + void (*show_framebuffer_cb)(struct pl111_drm_flip_resource *flip_res, + struct drm_framebuffer *fb); +}; + +struct pl111_drm_connector { + struct drm_connector connector; +}; + +struct pl111_drm_encoder { + struct drm_encoder encoder; +}; + +struct pl111_drm_dev_private { + struct pl111_drm_crtc *pl111_crtc; + + struct amba_device *amba_dev; + unsigned long mmio_start; + __u32 mmio_len; + void *regs; + struct clk *clk; +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + struct kds_callback kds_cb; + struct kds_callback kds_obtain_current_cb; +#endif + /* + * Number of flips that were started in show_framebuffer_on_crtc(), + * but haven't completed yet - because we do deferred flipping + */ + atomic_t nr_flips_in_flight; + wait_queue_head_t wait_for_flips; + + /* + * Used to prevent race between pl111_dma_buf_release and + * drm_gem_prime_handle_to_fd + */ + struct mutex export_dma_buf_lock; + + uint32_t number_crtcs; + + /* Cache for flip resources used to avoid kmalloc on each page flip */ + struct kmem_cache *page_flip_slab; +}; + +enum pl111_cursor_size { + CURSOR_32X32, + CURSOR_64X64 +}; + +enum pl111_cursor_sync { + CURSOR_SYNC_NONE, + CURSOR_SYNC_VSYNC +}; + +#define PL111_FB_FROM_FRAMEBUFFER(drm_fb) \ + (container_of(drm_fb, struct pl111_drm_framebuffer, fb)) + +#define PL111_BO_FROM_FRAMEBUFFER(drm_fb) \ + (container_of(drm_fb, struct pl111_drm_framebuffer, fb)->bo) + +#define PL111_BO_TO_FRAMEBUFFER(drm_fb, bo) \ + do { \ + container_of(drm_fb, \ + struct pl111_drm_framebuffer, fb)->bo = bo; \ + bo->fb = fb; \ + } while (0) + +#define PL111_BO_FROM_GEM(gem_obj) \ + container_of(gem_obj, struct pl111_gem_bo, gem_object) + +#define to_pl111_crtc(x) container_of(x, struct pl111_drm_crtc, crtc) + +#define PL111_ENCODER_FROM_ENCODER(x) \ + container_of(x, struct pl111_drm_encoder, encoder) + +#define PL111_CONNECTOR_FROM_CONNECTOR(x) \ + container_of(x, struct pl111_drm_connector, connector) + +#include "pl111_drm_funcs.h" + +#endif /* _PL111_DRM_H_ */ diff --git a/drivers/gpu/drm/pl111/pl111_drm_connector.c b/drivers/gpu/drm/pl111/pl111_drm_connector.c new file mode 100644 index 0000000..304a5be --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_connector.c @@ -0,0 +1,166 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_connector.c + * Implementation of the connector functions for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> + +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> + +#include "pl111_drm.h" + + +static struct { + int w, h, type; +} pl111_drm_modes[] = { + { 640, 480, DRM_MODE_TYPE_PREFERRED}, + { 800, 600, 0}, + {1024, 768, 0}, + { -1, -1, -1} +}; + +void pl111_connector_destroy(struct drm_connector *connector) +{ + struct pl111_drm_connector *pl111_connector = + PL111_CONNECTOR_FROM_CONNECTOR(connector); + + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); + + drm_sysfs_connector_remove(connector); + drm_connector_cleanup(connector); + kfree(pl111_connector); +} + +enum drm_connector_status pl111_connector_detect(struct drm_connector + *connector, bool force) +{ + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); + return connector_status_connected; +} + +void pl111_connector_dpms(struct drm_connector *connector, int mode) +{ + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); +} + +struct drm_encoder * +pl111_connector_helper_best_encoder(struct drm_connector *connector) +{ + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); + + if (connector->encoder != NULL) { + return connector->encoder; /* Return attached encoder */ + } else { + /* + * If there is no attached encoder we choose the best candidate + * from the list. + * For PL111 there is only one encoder so we return the first + * one we find. + * Other h/w would require a suitable criterion below. + */ + struct drm_encoder *encoder = NULL; + struct drm_device *dev = connector->dev; + + list_for_each_entry(encoder, &dev->mode_config.encoder_list, + head) { + if (1) { /* criterion ? */ + break; + } + } + return encoder; /* return best candidate encoder */ + } +} + +int pl111_connector_helper_get_modes(struct drm_connector *connector) +{ + int i = 0; + int count = 0; + + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); + + while (pl111_drm_modes[i].w != -1) { + struct drm_display_mode *mode = + drm_mode_find_dmt(connector->dev, + pl111_drm_modes[i].w, + pl111_drm_modes[i].h, + 60, + false); + + if (mode != NULL) { + mode->type |= pl111_drm_modes[i].type; + drm_mode_probed_add(connector, mode); + count++; + } + + i++; + } + + DRM_DEBUG_KMS("found %d modes\n", count); + + return count; +} + +int pl111_connector_helper_mode_valid(struct drm_connector *connector, + struct drm_display_mode *mode) +{ + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); + return MODE_OK; +} + +const struct drm_connector_funcs connector_funcs = { + .fill_modes = drm_helper_probe_single_connector_modes, + .destroy = pl111_connector_destroy, + .detect = pl111_connector_detect, + .dpms = pl111_connector_dpms, +}; + +const struct drm_connector_helper_funcs connector_helper_funcs = { + .get_modes = pl111_connector_helper_get_modes, + .mode_valid = pl111_connector_helper_mode_valid, + .best_encoder = pl111_connector_helper_best_encoder, +}; + +struct pl111_drm_connector *pl111_connector_create(struct drm_device *dev) +{ + struct pl111_drm_connector *pl111_connector; + + pl111_connector = kzalloc(sizeof(struct pl111_drm_connector), + GFP_KERNEL); + + if (pl111_connector == NULL) { + pr_err("Failed to allocated pl111_drm_connector\n"); + return NULL; + } + + drm_connector_init(dev, &pl111_connector->connector, &connector_funcs, + DRM_MODE_CONNECTOR_DVII); + + drm_connector_helper_add(&pl111_connector->connector, + &connector_helper_funcs); + + drm_sysfs_connector_add(&pl111_connector->connector); + + return pl111_connector; +} + diff --git a/drivers/gpu/drm/pl111/pl111_drm_crtc.c b/drivers/gpu/drm/pl111/pl111_drm_crtc.c new file mode 100644 index 0000000..1f8efbe --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_crtc.c @@ -0,0 +1,432 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_crtc.c + * Implementation of the CRTC functions for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> + +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> + +#include "pl111_drm.h" + +static int pl111_crtc_num; + +static void vsync_worker(struct work_struct *work) +{ + struct pl111_drm_flip_resource *flip_res; + struct pl111_gem_bo *bo; + struct pl111_drm_crtc *pl111_crtc; + struct drm_device *dev; + int flips_in_flight; + flip_res = + container_of(work, struct pl111_drm_flip_resource, vsync_work); + + pl111_crtc = to_pl111_crtc(flip_res->crtc); + dev = pl111_crtc->crtc.dev; + + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); + + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + if (flip_res->worker_release_kds == true) { + spin_lock(&pl111_crtc->current_displaying_lock); + release_kds_resource_and_display(flip_res); + spin_unlock(&pl111_crtc->current_displaying_lock); + } +#endif + /* Release DMA buffer on this flip */ + if (bo->gem_object.export_dma_buf != NULL) + dma_buf_put(bo->gem_object.export_dma_buf); + + drm_handle_vblank(dev, pl111_crtc->crtc_index); + + /* Wake up any processes waiting for page flip event */ + if (flip_res->event) { + spin_lock_bh(&dev->event_lock); + drm_send_vblank_event(dev, pl111_crtc->crtc_index, + flip_res->event); + spin_unlock_bh(&dev->event_lock); + } + + drm_vblank_put(dev, pl111_crtc->crtc_index); + + /* + * workqueue.c:process_one_work(): + * "It is permissible to free the struct work_struct from + * inside the function that is called from it" + */ + kmem_cache_free(priv.page_flip_slab, flip_res); + + flips_in_flight = atomic_dec_return(&priv.nr_flips_in_flight); + if (flips_in_flight == 0 || + flips_in_flight == (NR_FLIPS_IN_FLIGHT_THRESHOLD - 1)) + wake_up(&priv.wait_for_flips); + + DRM_DEBUG_KMS("DRM release flip_res=%p\n", flip_res); +} + +void pl111_common_irq(struct pl111_drm_crtc *pl111_crtc) +{ + unsigned long irq_flags; + + spin_lock_irqsave(&pl111_crtc->base_update_lock, irq_flags); + + if (pl111_crtc->current_update_res != NULL) { +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + /* + * If the lock is not acquired defer completion of the + * resource that caused the buffer update + */ + pl111_crtc->current_update_res->worker_release_kds = + true; + if (0 != spin_trylock( + &pl111_crtc->current_displaying_lock)) { + /* release the resource immediately */ + release_kds_resource_and_display( + pl111_crtc->current_update_res); + /* + * prevent worker from attempting to release + * resource again + */ + pl111_crtc->current_update_res-> + worker_release_kds = false; + spin_unlock(&pl111_crtc-> + current_displaying_lock); + } +#endif + /* + * Release dma_buf and resource + * (if not already released) + */ + queue_work(pl111_crtc->vsync_wq, + &pl111_crtc->current_update_res->vsync_work); + pl111_crtc->current_update_res = NULL; + } + + if (!list_empty(&pl111_crtc->update_queue)) { + struct pl111_drm_flip_resource *flip_res; + /* Remove the head of the list */ + flip_res = list_first_entry(&pl111_crtc->update_queue, + struct pl111_drm_flip_resource, link); + list_del(&flip_res->link); + do_flip_to_res(flip_res); + /* + * current_update_res will be set, so guarentees that + * another flip_res coming in gets queued instead of + * handled immediately + */ + } + + spin_unlock_irqrestore(&pl111_crtc->base_update_lock, irq_flags); +} + +void show_framebuffer_on_crtc_cb(void *cb1, void *cb2) +{ + struct pl111_drm_flip_resource *flip_res = cb1; + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(flip_res->crtc); + + pl111_crtc->show_framebuffer_cb(cb1, cb2); +} + +int show_framebuffer_on_crtc(struct drm_crtc *crtc, + struct drm_framebuffer *fb, bool page_flip, + struct drm_pending_vblank_event *event) +{ + struct pl111_gem_bo *bo; + struct pl111_drm_flip_resource *flip_res; + int flips_in_flight; + int old_flips_in_flight; + + crtc->fb = fb; + + bo = PL111_BO_FROM_FRAMEBUFFER(fb); + if (bo == NULL) { + DRM_DEBUG_KMS("Failed to get pl111_gem_bo object\n"); + return -EINVAL; + } + + /* If this is a full modeset, wait for all outstanding flips to complete + * before continuing. This avoids unnecessary complication from being + * able to queue up multiple modesets and queues of mixed modesets and + * page flips. + * + * Modesets should be uncommon and will not be performant anyway, so + * making them synchronous should have negligible performance impact. + */ + if (!page_flip) { + int ret = wait_event_killable(priv.wait_for_flips, + atomic_read(&priv.nr_flips_in_flight) == 0); + if (ret) + return ret; + } + + /* + * There can be more 'early display' flips in flight than there are + * buffers, and there is (currently) no explicit bound on the number of + * flips. Hence, we need a new allocation for each one. + * + * Note: this could be optimized down if we knew a bound on the flips, + * since an application can only have so many buffers in flight to be + * useful/not hog all the memory + */ + flip_res = kmem_cache_alloc(priv.page_flip_slab, GFP_KERNEL); + if (flip_res == NULL) { + pr_err("kmem_cache_alloc failed to alloc - flip ignored\n"); + return -ENOMEM; + } + + /* + * increment flips in flight, whilst blocking when we reach + * NR_FLIPS_IN_FLIGHT_THRESHOLD + */ + do { + /* + * Note: use of assign-and-then-compare in the condition to set + * flips_in_flight + */ + int ret = wait_event_killable(priv.wait_for_flips, + (flips_in_flight = + atomic_read(&priv.nr_flips_in_flight)) + < NR_FLIPS_IN_FLIGHT_THRESHOLD); + if (ret != 0) { + kmem_cache_free(priv.page_flip_slab, flip_res); + return ret; + } + + old_flips_in_flight = atomic_cmpxchg(&priv.nr_flips_in_flight, + flips_in_flight, flips_in_flight + 1); + } while (old_flips_in_flight != flips_in_flight); + + flip_res->fb = fb; + flip_res->crtc = crtc; + flip_res->page_flip = page_flip; + flip_res->event = event; + INIT_WORK(&flip_res->vsync_work, vsync_worker); + INIT_LIST_HEAD(&flip_res->link); + DRM_DEBUG_KMS("DRM alloc flip_res=%p\n", flip_res); +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + if (bo->gem_object.export_dma_buf != NULL) { + struct dma_buf *buf = bo->gem_object.export_dma_buf; + unsigned long shared[1] = { 0 }; + struct kds_resource *resource_list[1] = { + get_dma_buf_kds_resource(buf) }; + int err; + + get_dma_buf(buf); + DRM_DEBUG_KMS("Got dma_buf %p\n", buf); + + /* Wait for the KDS resource associated with this buffer */ + err = kds_async_waitall(&flip_res->kds_res_set, + &priv.kds_cb, flip_res, fb, 1, shared, + resource_list); + BUG_ON(err); + } else { + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); + + DRM_DEBUG_KMS("No dma_buf for this flip\n"); + + /* No dma-buf attached so just call the callback directly */ + flip_res->kds_res_set = NULL; + pl111_crtc->show_framebuffer_cb(flip_res, fb); + } +#else + if (bo->gem_object.export_dma_buf != NULL) { + struct dma_buf *buf = bo->gem_object.export_dma_buf; + + get_dma_buf(buf); + DRM_DEBUG_KMS("Got dma_buf %p\n", buf); + } else { + DRM_DEBUG_KMS("No dma_buf for this flip\n"); + } + + /* No dma-buf attached to this so just call the callback directly */ + { + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); + pl111_crtc->show_framebuffer_cb(flip_res, fb); + } +#endif + + /* For the same reasons as the wait at the start of this function, + * wait for the modeset to complete before continuing. + */ + if (!page_flip) { + int ret = wait_event_killable(priv.wait_for_flips, + flips_in_flight == 0); + if (ret) + return ret; + } + + return 0; +} + +int pl111_crtc_page_flip(struct drm_crtc *crtc, struct drm_framebuffer *fb, + struct drm_pending_vblank_event *event) +{ + DRM_DEBUG_KMS("%s: crtc=%p, fb=%p, event=%p\n", + __func__, crtc, fb, event); + return show_framebuffer_on_crtc(crtc, fb, true, event); +} + +int pl111_crtc_helper_mode_set(struct drm_crtc *crtc, + struct drm_display_mode *mode, + struct drm_display_mode *adjusted_mode, + int x, int y, struct drm_framebuffer *old_fb) +{ + int ret; + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); + struct drm_display_mode *duplicated_mode; + + DRM_DEBUG_KMS("DRM crtc_helper_mode_set, x=%d y=%d bpp=%d\n", + adjusted_mode->hdisplay, adjusted_mode->vdisplay, + crtc->fb->bits_per_pixel); + + duplicated_mode = drm_mode_duplicate(crtc->dev, adjusted_mode); + if (!duplicated_mode) + return -ENOMEM; + + pl111_crtc->new_mode = duplicated_mode; + ret = show_framebuffer_on_crtc(crtc, crtc->fb, false, NULL); + if (ret != 0) { + pl111_crtc->new_mode = pl111_crtc->current_mode; + drm_mode_destroy(crtc->dev, duplicated_mode); + } + + return ret; +} + +void pl111_crtc_helper_prepare(struct drm_crtc *crtc) +{ + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); +} + +void pl111_crtc_helper_commit(struct drm_crtc *crtc) +{ + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); +} + +bool pl111_crtc_helper_mode_fixup(struct drm_crtc *crtc, + struct drm_display_mode *mode, + struct drm_display_mode *adjusted_mode) +{ + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); + +#ifdef CONFIG_ARCH_VEXPRESS + /* + * 1024x768 with more than 16 bits per pixel does not work correctly + * on Versatile Express + */ + if (mode->hdisplay == 1024 && mode->vdisplay == 768 && + crtc->fb->bits_per_pixel > 16) { + return false; + } +#endif + + return true; +} + +void pl111_crtc_helper_disable(struct drm_crtc *crtc) +{ + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); + clcd_disable(crtc); +} + +void pl111_crtc_destroy(struct drm_crtc *crtc) +{ + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); + + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); + + destroy_workqueue(pl111_crtc->vsync_wq); + drm_crtc_cleanup(crtc); + kfree(pl111_crtc); +} + +const struct drm_crtc_funcs crtc_funcs = { + .set_config = drm_crtc_helper_set_config, + .page_flip = pl111_crtc_page_flip, + .destroy = pl111_crtc_destroy +}; + +const struct drm_crtc_helper_funcs crtc_helper_funcs = { + .mode_set = pl111_crtc_helper_mode_set, + .prepare = pl111_crtc_helper_prepare, + .commit = pl111_crtc_helper_commit, + .mode_fixup = pl111_crtc_helper_mode_fixup, + .disable = pl111_crtc_helper_disable, +}; + +bool pl111_crtc_is_fb_currently_displayed(struct drm_device *dev, + struct drm_framebuffer *fb) +{ + struct drm_crtc *crtc; + + if (fb == NULL) + return false; + + list_for_each_entry(crtc, &dev->mode_config.crtc_list, head) { + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); + if (pl111_crtc->displaying_fb == fb) + return true; + } + return false; +} + +struct pl111_drm_crtc *pl111_crtc_create(struct drm_device *dev) +{ + struct pl111_drm_crtc *pl111_crtc; + + pl111_crtc = kzalloc(sizeof(struct pl111_drm_crtc), GFP_KERNEL); + if (pl111_crtc == NULL) { + pr_err("Failed to allocated pl111_drm_crtc\n"); + return NULL; + } + + drm_crtc_init(dev, &pl111_crtc->crtc, &crtc_funcs); + drm_crtc_helper_add(&pl111_crtc->crtc, &crtc_helper_funcs); + + pl111_crtc->crtc_index = pl111_crtc_num; + pl111_crtc_num++; + pl111_crtc->vsync_wq = alloc_ordered_workqueue("pl111_drm_vsync_%d", + WQ_HIGHPRI, pl111_crtc->crtc_index); + if (!pl111_crtc->vsync_wq) { + pr_err("Failed to allocate vsync workqueue\n"); + drm_crtc_cleanup(&pl111_crtc->crtc); + return NULL; + } + + pl111_crtc->crtc.enabled = 0; + pl111_crtc->displaying_fb = NULL; + pl111_crtc->last_bpp = 0; + pl111_crtc->current_update_res = NULL; +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + pl111_crtc->old_kds_res_set = NULL; +#endif + pl111_crtc->show_framebuffer_cb = show_framebuffer_on_crtc_cb_internal; + INIT_LIST_HEAD(&pl111_crtc->update_queue); + spin_lock_init(&pl111_crtc->current_displaying_lock); + spin_lock_init(&pl111_crtc->base_update_lock); + + return pl111_crtc; +} + diff --git a/drivers/gpu/drm/pl111/pl111_drm_cursor.c b/drivers/gpu/drm/pl111/pl111_drm_cursor.c new file mode 100644 index 0000000..6be2a55 --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_cursor.c @@ -0,0 +1,97 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_cursor.c + * Implementation of cursor functions for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> + +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> +#include "pl111_drm.h" + +#define PL111_MAX_CURSOR_WIDTH (64) +#define PL111_MAX_CURSOR_HEIGHT (64) + +static int pl111_drm_cursor_plane_disable(struct drm_plane *plane) +{ + pl111_cursor_disable(); + return 0; +} + +static int pl111_drm_cursor_plane_update(struct drm_plane *plane, + struct drm_crtc *crtc, struct drm_framebuffer *fb, + int crtc_x, int crtc_y, + unsigned int crtc_w, unsigned int crtc_h, + uint32_t src_x, uint32_t src_y, + uint32_t src_w, uint32_t src_h) +{ + struct pl111_gem_bo *bo = PL111_BO_FROM_FRAMEBUFFER(fb); + + /* + * TODO Find out if there is a way to know if the image needs changing. + * The cursor API might be better for us than planes as that has + * distinct set cursor image and set cursor position call backs. + */ + + pl111_set_cursor_image(bo->backing_data.dma.fb_cpu_addr); + + pl111_cursor_enable(); + pl111_set_cursor_position(crtc_x, crtc_y); + + return 0; +} + +void pl111_drm_cursor_plane_destroy(struct drm_plane *plane) +{ + pl111_drm_cursor_plane_disable(plane); + drm_plane_cleanup(plane); +} + +static const struct drm_plane_funcs pl111_drm_cursor_plane_funcs = { + .update_plane = pl111_drm_cursor_plane_update, + .disable_plane = pl111_drm_cursor_plane_disable, + .destroy = pl111_drm_cursor_plane_destroy, +}; + +/* + * We don't actually support ARGB8888 for the cursor only PL111 LBBP, the + * rasult of setting this is that it creates a buffer larger than we actually + * need. But there are no compatible formats defined in fourcc.h, so we will + * only read 256 32 bits words from the buffer to set the cursor image. + * We expect user space to have formatted the buffer correctly to LBBP. + */ +static uint32_t pl111_cursor_formats[] = { DRM_FORMAT_ARGB8888 }; + +int pl111_cursor_plane_init(struct drm_device *dev, + struct pl111_drm_cursor_plane *cursor, + unsigned long possible_crtcs) +{ + cursor->formats = pl111_cursor_formats; + cursor->num_formats_supported = ARRAY_SIZE(pl111_cursor_formats); + + return drm_plane_init(dev, &cursor->base, possible_crtcs, + &pl111_drm_cursor_plane_funcs, cursor->formats, + cursor->num_formats_supported, false); +} + + diff --git a/drivers/gpu/drm/pl111/pl111_drm_device.c b/drivers/gpu/drm/pl111/pl111_drm_device.c new file mode 100644 index 0000000..4ade09a --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_device.c @@ -0,0 +1,319 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_device.c + * Implementation of the Linux device driver entrypoints for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> +#include <linux/slab.h> + +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> + +#include "pl111_drm.h" + +struct pl111_drm_dev_private priv; + +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS +static void initial_kds_obtained(void *cb1, void *cb2) +{ + wait_queue_head_t *wait = (wait_queue_head_t *) cb1; + bool *cb_has_called = (bool *) cb2; + + *cb_has_called = true; + wake_up(wait); +} + +/* Must be called from within current_displaying_lock spinlock */ +void release_kds_resource_and_display(struct pl111_drm_flip_resource *flip_res) +{ + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(flip_res->crtc); + pl111_crtc->displaying_fb = flip_res->fb; + + /* Release the previous buffer */ + if (pl111_crtc->old_kds_res_set != NULL) { + /* + * Can flip to the same buffer, but must not release the current + * resource set + */ + BUG_ON(pl111_crtc->old_kds_res_set == flip_res->kds_res_set); + kds_resource_set_release(&pl111_crtc->old_kds_res_set); + } + /* Record the current buffer, to release on the next buffer flip */ + pl111_crtc->old_kds_res_set = flip_res->kds_res_set; +} +#endif + +void pl111_drm_preclose(struct drm_device *dev, struct drm_file *file_priv) +{ + DRM_DEBUG_KMS("DRM %s on dev=%p\n", __func__, dev); +} + +void pl111_drm_lastclose(struct drm_device *dev) +{ + DRM_DEBUG_KMS("DRM %s on dev=%p\n", __func__, dev); +} + +/* + * pl111 does not have a proper HW counter for vblank IRQs so enable_vblank + * and disable_vblank are just no op callbacks. + */ +static int pl111_enable_vblank(struct drm_device *dev, int crtc) +{ + DRM_DEBUG_KMS("%s: dev=%p, crtc=%d", __func__, dev, crtc); + return 0; +} + +static void pl111_disable_vblank(struct drm_device *dev, int crtc) +{ + DRM_DEBUG_KMS("%s: dev=%p, crtc=%d", __func__, dev, crtc); +} + +struct drm_mode_config_funcs mode_config_funcs = { + .fb_create = pl111_fb_create, +}; + +static int pl111_modeset_init(struct drm_device *dev) +{ + struct drm_mode_config *mode_config; + struct pl111_drm_dev_private *priv = dev->dev_private; + struct pl111_drm_connector *pl111_connector; + struct pl111_drm_encoder *pl111_encoder; + int ret = 0; + + if (priv == NULL) + return -EINVAL; + + drm_mode_config_init(dev); + mode_config = &dev->mode_config; + mode_config->funcs = &mode_config_funcs; + mode_config->min_width = 1; + mode_config->max_width = 1024; + mode_config->min_height = 1; + mode_config->max_height = 768; + + priv->pl111_crtc = pl111_crtc_create(dev); + if (priv->pl111_crtc == NULL) { + pr_err("Failed to create pl111_drm_crtc\n"); + ret = -ENOMEM; + goto out_config; + } + + priv->number_crtcs = 1; + + pl111_connector = pl111_connector_create(dev); + if (pl111_connector == NULL) { + pr_err("Failed to create pl111_drm_connector\n"); + ret = -ENOMEM; + goto out_config; + } + + pl111_encoder = pl111_encoder_create(dev, 1); + if (pl111_encoder == NULL) { + pr_err("Failed to create pl111_drm_encoder\n"); + ret = -ENOMEM; + goto out_config; + } + + ret = drm_mode_connector_attach_encoder(&pl111_connector->connector, + &pl111_encoder->encoder); + if (ret != 0) { + DRM_ERROR("Failed to attach encoder\n"); + goto out_config; + } + + pl111_connector->connector.encoder = &pl111_encoder->encoder; + + ret = pl111_cursor_plane_init(dev, &priv->pl111_crtc->cursor, 1); + if (ret != 0) { + pr_err("Failed to init cursor plane\n"); + goto out_config; + } + + goto finish; + +out_config: + drm_mode_config_cleanup(dev); +finish: + DRM_DEBUG("%s returned %d\n", __func__, ret); + return ret; +} + +static void pl111_modeset_fini(struct drm_device *dev) +{ + drm_mode_config_cleanup(dev); +} + +static int pl111_drm_load(struct drm_device *dev, unsigned long chipset) +{ + int ret = 0; + + pr_info("DRM %s\n", __func__); + + mutex_init(&priv.export_dma_buf_lock); + atomic_set(&priv.nr_flips_in_flight, 0); + init_waitqueue_head(&priv.wait_for_flips); +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + ret = kds_callback_init(&priv.kds_cb, 1, show_framebuffer_on_crtc_cb); + if (ret != 0) { + pr_err("Failed to initialise KDS callback\n"); + goto finish; + } + + ret = kds_callback_init(&priv.kds_obtain_current_cb, 1, + initial_kds_obtained); + if (ret != 0) { + pr_err("Failed to init KDS obtain callback\n"); + kds_callback_term(&priv.kds_cb); + goto finish; + } +#endif + + /* Create a cache for page flips */ + priv.page_flip_slab = kmem_cache_create("page flip slab", + sizeof(struct pl111_drm_flip_resource), 0, 0, NULL); + if (priv.page_flip_slab == NULL) { + DRM_ERROR("Failed to create slab\n"); + ret = -ENOMEM; + goto out_kds_callbacks; + } + + dev->dev_private = &priv; + + ret = pl111_modeset_init(dev); + if (ret != 0) { + pr_err("Failed to init modeset\n"); + goto out_slab; + } + + ret = pl111_device_init(dev); + if (ret != 0) { + DRM_ERROR("Failed to init MMIO and IRQ\n"); + goto out_modeset; + } + + ret = drm_vblank_init(dev, 1); + if (ret != 0) { + DRM_ERROR("Failed to init vblank\n"); + goto out_vblank; + } + + goto finish; + +out_vblank: + pl111_device_fini(dev); +out_modeset: + pl111_modeset_fini(dev); +out_slab: + kmem_cache_destroy(priv.page_flip_slab); +out_kds_callbacks: +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + kds_callback_term(&priv.kds_obtain_current_cb); + kds_callback_term(&priv.kds_cb); +#endif +finish: + DRM_DEBUG_KMS("pl111_drm_load returned %d\n", ret); + return ret; +} + +static int pl111_drm_unload(struct drm_device *dev) +{ + pr_info("DRM %s\n", __func__); + + kmem_cache_destroy(priv.page_flip_slab); + + drm_vblank_cleanup(dev); + pl111_modeset_fini(dev); + pl111_device_fini(dev); + +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + kds_callback_term(&priv.kds_obtain_current_cb); + kds_callback_term(&priv.kds_cb); +#endif + return 0; +} + +static struct vm_operations_struct pl111_gem_vm_ops = { + .fault = pl111_gem_fault, + .open = drm_gem_vm_open, + .close = drm_gem_vm_close, +}; + +static const struct file_operations drm_fops = { + .owner = THIS_MODULE, + .open = drm_open, + .release = drm_release, + .unlocked_ioctl = drm_ioctl, + .mmap = pl111_gem_mmap, + .poll = drm_poll, + .read = drm_read, + .fasync = drm_fasync, +}; + +static struct drm_driver driver = { + .driver_features = + DRIVER_MODESET | DRIVER_FB_DMA | DRIVER_GEM | DRIVER_PRIME, + .load = pl111_drm_load, + .unload = pl111_drm_unload, + .context_dtor = NULL, + .preclose = pl111_drm_preclose, + .lastclose = pl111_drm_lastclose, + .suspend = pl111_drm_suspend, + .resume = pl111_drm_resume, + .get_vblank_counter = drm_vblank_count, + .enable_vblank = pl111_enable_vblank, + .disable_vblank = pl111_disable_vblank, + .ioctls = NULL, + .fops = &drm_fops, + .name = DRIVER_NAME, + .desc = DRIVER_DESC, + .date = DRIVER_DATE, + .major = DRIVER_MAJOR, + .minor = DRIVER_MINOR, + .patchlevel = DRIVER_PATCHLEVEL, + .dumb_create = pl111_dumb_create, + .dumb_destroy = pl111_dumb_destroy, + .dumb_map_offset = pl111_dumb_map_offset, + .gem_free_object = pl111_gem_free_object, + .gem_vm_ops = &pl111_gem_vm_ops, + .prime_handle_to_fd = &pl111_prime_handle_to_fd, + .gem_prime_export = &pl111_gem_prime_export, +}; + +int pl111_drm_init(struct platform_device *dev) +{ + int ret; + pr_info("DRM %s\n", __func__); + pr_info("PL111 DRM initialize, driver name: %s, version %d.%d\n", + DRIVER_NAME, DRIVER_MAJOR, DRIVER_MINOR); + driver.num_ioctls = 0; + ret = 0; + driver.kdriver.platform_device = dev; + return drm_platform_init(&driver, dev); + +} + +void pl111_drm_exit(struct platform_device *dev) +{ + pr_info("DRM %s\n", __func__); + drm_platform_exit(&driver, dev); +} diff --git a/drivers/gpu/drm/pl111/pl111_drm_dma_buf.c b/drivers/gpu/drm/pl111/pl111_drm_dma_buf.c new file mode 100644 index 0000000..6800100 --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_dma_buf.c @@ -0,0 +1,339 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_dma_buf.c + * Implementation of the dma_buf functions for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> + +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> + +#include "pl111_drm.h" + +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS +static void obtain_kds_if_currently_displayed(struct drm_device *dev, + struct drm_framebuffer *fb, + struct dma_buf *dma_buf) +{ + unsigned long shared[1] = { 0 }; + struct kds_resource *resource_list[1]; + struct kds_resource_set *kds_res_set; + struct drm_crtc *crtc; + bool cb_has_called = false; + int err; + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wake); + + /* + * Not all pl111_gem_bo structures have a framebuffer attached - early + * out in those cases + */ + if (fb == NULL) + return; + + DRM_DEBUG_KMS("Obtaining initial KDS res for fb:%p bo:%p dma_buf:%p\n", + fb, PL111_BO_FROM_FRAMEBUFFER(fb), dma_buf); + + resource_list[0] = get_dma_buf_kds_resource(dma_buf); + get_dma_buf(dma_buf); + + /* + * Can't use kds_waitall(), because kbase will be let through due to + * locked ignore' + */ + err = kds_async_waitall(&kds_res_set, + &priv.kds_obtain_current_cb, &wake, + &cb_has_called, 1, shared, resource_list); + BUG_ON(err); + wait_event(wake, cb_has_called == true); + + list_for_each_entry(crtc, &fb->dev->mode_config.crtc_list, head) { + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); + spin_lock(&pl111_crtc->current_displaying_lock); + if (pl111_crtc->displaying_fb == fb) { + DRM_DEBUG_KMS("Initial KDS resource for fb %p", fb); + DRM_DEBUG_KMS(" is being displayed, keeping\n"); + /* There shouldn't be a previous buffer to release */ + BUG_ON(pl111_crtc->old_kds_res_set); + + if (kds_res_set == NULL) { + err = kds_async_waitall(&kds_res_set, + &priv.kds_obtain_current_cb, + &wake, &cb_has_called, + 1, shared, resource_list); + BUG_ON(err); + wait_event(wake, cb_has_called == true); + } + + /* Current buffer will need releasing on next flip */ + pl111_crtc->old_kds_res_set = kds_res_set; + + /* + * Clear kds_res_set, so a new kds_res_set is allocated + * for additional CRTCs + */ + kds_res_set = NULL; + } + spin_unlock(&pl111_crtc->current_displaying_lock); + } + + /* kds_res_set will be NULL here if any CRTCs are displaying fb */ + if (kds_res_set != NULL) { + DRM_DEBUG_KMS("Initial KDS resource for fb %p", fb); + DRM_DEBUG_KMS(" not being displayed, discarding\n"); + /* They're not being displayed, release them */ + kds_resource_set_release(&kds_res_set); + } + + dma_buf_put(dma_buf); +} +#endif + +static int pl111_dma_buf_mmap(struct dma_buf *buffer, + struct vm_area_struct *vma) +{ + struct drm_gem_object *obj = buffer->priv; + struct pl111_gem_bo *bo = PL111_BO_FROM_GEM(obj); + DRM_DEBUG_KMS("DRM %s on dma_buf=%p\n", __func__, buffer); + + return pl111_bo_mmap(obj, bo, vma, buffer->size); +} + +static void pl111_dma_buf_release(struct dma_buf *buf) +{ + /* + * Need to release the dma_buf's reference on the gem object it was + * exported from, and also clear the gem object's export_dma_buf + * pointer to this dma_buf as it no longer exists + */ + struct drm_gem_object *obj = (struct drm_gem_object *)buf->priv; + struct pl111_gem_bo *bo; +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + struct drm_crtc *crtc; +#endif + bo = PL111_BO_FROM_GEM(obj); + + DRM_DEBUG_KMS("Releasing dma_buf %p, drm_gem_obj=%p\n", buf, obj); + +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + list_for_each_entry(crtc, &bo->gem_object.dev->mode_config.crtc_list, + head) { + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); + spin_lock(&pl111_crtc->current_displaying_lock); + if (pl111_crtc->displaying_fb == bo->fb) { + kds_resource_set_release(&pl111_crtc->old_kds_res_set); + pl111_crtc->old_kds_res_set = NULL; + } + spin_unlock(&pl111_crtc->current_displaying_lock); + } +#endif + mutex_lock(&priv.export_dma_buf_lock); + + obj->export_dma_buf = NULL; + drm_gem_object_unreference_unlocked(obj); + + mutex_unlock(&priv.export_dma_buf_lock); +} + +static int pl111_dma_buf_attach(struct dma_buf *buf, struct device *dev, + struct dma_buf_attachment *attach) +{ + DRM_DEBUG_KMS("Attaching dma_buf %p to device %p attach=%p\n", buf, + dev, attach); + + attach->priv = dev; + + return 0; +} + +static void pl111_dma_buf_detach(struct dma_buf *buf, + struct dma_buf_attachment *attach) +{ + DRM_DEBUG_KMS("Detaching dma_buf %p attach=%p\n", attach->dmabuf, + attach); +} + +/* Heavily from exynos_drm_dmabuf.c */ +static struct sg_table *pl111_dma_buf_map_dma_buf(struct dma_buf_attachment + *attach, + enum dma_data_direction + direction) +{ + struct drm_gem_object *obj = attach->dmabuf->priv; + struct pl111_gem_bo *bo = PL111_BO_FROM_GEM(obj); + struct sg_table *sgt; + int ret; + int size, n_pages, nents; + + DRM_DEBUG_KMS("Mapping dma_buf %p from attach=%p\n", attach->dmabuf, + attach); + + size = obj->size; + n_pages = PAGE_ALIGN(size) >> PAGE_SHIFT; + + if (bo->type == PL111_BOT_DMA) { + sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); + if (!sgt) { + DRM_ERROR("Failed to allocate sg_table\n"); + return ERR_PTR(-ENOMEM); + } + + ret = sg_alloc_table(sgt, 1, GFP_KERNEL); + if (ret < 0) { + DRM_ERROR("Failed to allocate page table\n"); + return ERR_PTR(-ENOMEM); + } + sg_dma_len(sgt->sgl) = size; + sg_set_page(sgt->sgl, + pfn_to_page(PFN_DOWN + (bo->backing_data.dma.fb_dev_addr)), + size, 0); + sg_dma_address(sgt->sgl) = bo->backing_data.dma.fb_dev_addr; + + } else { + struct page **pages; + + pages = get_pages(obj); + if (IS_ERR(pages)) { + dev_err(obj->dev->dev, "could not get pages: %ld\n", + PTR_ERR(pages)); + return ERR_CAST(pages); + } + sgt = drm_prime_pages_to_sg(pages, n_pages); + if (sgt == NULL) + return ERR_PTR(-ENOMEM); + nents = dma_map_sg(attach->dev, sgt->sgl, sgt->nents, + direction); + if (!nents) { + DRM_ERROR("Failed to map dma buffer\n"); + sg_free_table(sgt); + kfree(sgt); + return ERR_PTR(-ENOMEM); + } + if (nents < sgt->nents) { + /* dma_map_sg() may merge sglist entries (e.g. if + * they are contiguous) so nents may be less than + * sgt->nents. If this happens we need to fix + * sgt->nents as it is used by the caller */ + DRM_DEBUG_KMS( + "sg list entries merged during mapping\n"); + sgt->nents = nents; + } + } + return sgt; +} + +static void pl111_dma_buf_unmap_dma_buf(struct dma_buf_attachment *attach, + struct sg_table *sgt, + enum dma_data_direction direction) +{ + struct drm_gem_object *obj = attach->dmabuf->priv; + struct pl111_gem_bo *bo = PL111_BO_FROM_GEM(obj); + + DRM_DEBUG_KMS("Unmapping dma_buf %p from attach=%p\n", attach->dmabuf, + attach); + + if (PL111_BOT_SHM == bo->type) { + /* use orig_nents here as nents may have been + * modified in pl111_dma_buf_map_dma_buf() */ + dma_unmap_sg(attach->dev, sgt->sgl, sgt->orig_nents, + direction); + } + sg_free_table(sgt); + kfree(sgt); + sgt = NULL; +} + +static void *pl111_dma_buf_kmap_atomic(struct dma_buf *dma_buf, + unsigned long what) +{ + DRM_ERROR("pl111_dma_buf_kmap_atomic not implemented, dma_buf=%p\n", + dma_buf); + return NULL; +} + +static void *pl111_dma_buf_kmap(struct dma_buf *dma_buf, unsigned long what) +{ + DRM_ERROR("pl111_dma_buf_kmap not implemented, dma_buf=%p\n", dma_buf); + return NULL; +} + +static struct dma_buf_ops pl111_dma_buf_ops = { + .release = &pl111_dma_buf_release, + .attach = &pl111_dma_buf_attach, + .detach = &pl111_dma_buf_detach, + .map_dma_buf = &pl111_dma_buf_map_dma_buf, + .unmap_dma_buf = &pl111_dma_buf_unmap_dma_buf, + .kmap_atomic = &pl111_dma_buf_kmap_atomic, + .kmap = &pl111_dma_buf_kmap, + .mmap = &pl111_dma_buf_mmap, +}; + +struct dma_buf *pl111_gem_prime_export(struct drm_device *dev, + struct drm_gem_object *obj, int flags) +{ + struct dma_buf *new_buf; + struct pl111_gem_bo *bo; + size_t size; + + DRM_DEBUG_KMS("DRM %s on dev=%p drm_gem_obj=%p\n", __func__, dev, obj); + size = obj->size; + + new_buf = dma_buf_export(obj /*priv */ , &pl111_dma_buf_ops, size, + flags | O_RDWR); + bo = PL111_BO_FROM_GEM(new_buf->priv); + + /* + * bo->gem_object.export_dma_buf not setup until after gem_prime_export + * finishes + */ + +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + /* + * Ensure that we hold the kds resource if it's the currently + * displayed buffer. + */ + obtain_kds_if_currently_displayed(dev, bo->fb, new_buf); +#endif + + DRM_DEBUG_KMS("Created dma_buf %p\n", new_buf); + return new_buf; +} + +int pl111_prime_handle_to_fd(struct drm_device *dev, struct drm_file *file_priv, + uint32_t handle, uint32_t flags, int *prime_fd) +{ + int result; + /* + * This will re-use any existing exports, and calls + * driver->gem_prime_export to do the first export when needed + */ + DRM_DEBUG_KMS("DRM %s on file_priv=%p, handle=0x%.8x\n", __func__, + file_priv, handle); + + mutex_lock(&priv.export_dma_buf_lock); + result = drm_gem_prime_handle_to_fd(dev, file_priv, handle, flags, + prime_fd); + mutex_unlock(&priv.export_dma_buf_lock); + + return result; +} diff --git a/drivers/gpu/drm/pl111/pl111_drm_encoder.c b/drivers/gpu/drm/pl111/pl111_drm_encoder.c new file mode 100644 index 0000000..028b366 --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_encoder.c @@ -0,0 +1,106 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_encoder.c + * Implementation of the encoder functions for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> + +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> + +#include "pl111_drm.h" + +bool pl111_encoder_helper_mode_fixup(struct drm_encoder *encoder, + struct drm_display_mode *mode, + struct drm_display_mode *adjusted_mode) +{ + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); + return true; +} + +void pl111_encoder_helper_prepare(struct drm_encoder *encoder) +{ + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); +} + +void pl111_encoder_helper_commit(struct drm_encoder *encoder) +{ + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); +} + +void pl111_encoder_helper_mode_set(struct drm_encoder *encoder, + struct drm_display_mode *mode, + struct drm_display_mode *adjusted_mode) +{ + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); +} + +void pl111_encoder_helper_disable(struct drm_encoder *encoder) +{ + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); +} + +void pl111_encoder_destroy(struct drm_encoder *encoder) +{ + struct pl111_drm_encoder *pl111_encoder = + PL111_ENCODER_FROM_ENCODER(encoder); + + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); + + drm_encoder_cleanup(encoder); + kfree(pl111_encoder); +} + +const struct drm_encoder_funcs encoder_funcs = { + .destroy = pl111_encoder_destroy, +}; + +const struct drm_encoder_helper_funcs encoder_helper_funcs = { + .mode_fixup = pl111_encoder_helper_mode_fixup, + .prepare = pl111_encoder_helper_prepare, + .commit = pl111_encoder_helper_commit, + .mode_set = pl111_encoder_helper_mode_set, + .disable = pl111_encoder_helper_disable, +}; + +struct pl111_drm_encoder *pl111_encoder_create(struct drm_device *dev, + int possible_crtcs) +{ + struct pl111_drm_encoder *pl111_encoder; + + pl111_encoder = kzalloc(sizeof(struct pl111_drm_encoder), GFP_KERNEL); + if (pl111_encoder == NULL) { + pr_err("Failed to allocated pl111_drm_encoder\n"); + return NULL; + } + + drm_encoder_init(dev, &pl111_encoder->encoder, &encoder_funcs, + DRM_MODE_ENCODER_DAC); + + drm_encoder_helper_add(&pl111_encoder->encoder, &encoder_helper_funcs); + + pl111_encoder->encoder.possible_crtcs = possible_crtcs; + + return pl111_encoder; +} + diff --git a/drivers/gpu/drm/pl111/pl111_drm_fb.c b/drivers/gpu/drm/pl111/pl111_drm_fb.c new file mode 100644 index 0000000..fa37623 --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_fb.c @@ -0,0 +1,152 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_fb.c + * Implementation of the framebuffer functions for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> + +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> + +#include "pl111_drm.h" + +static void pl111_fb_destroy(struct drm_framebuffer *framebuffer) +{ + struct pl111_drm_framebuffer *pl111_fb; +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + struct drm_crtc *crtc; +#endif + DRM_DEBUG_KMS("Destroying framebuffer 0x%p...\n", framebuffer); + + pl111_fb = PL111_FB_FROM_FRAMEBUFFER(framebuffer); + + /* + * Because flips are deferred, wait for all previous flips to complete + */ + wait_event(priv.wait_for_flips, + atomic_read(&priv.nr_flips_in_flight) == 0); +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + /* + * Release KDS resources if it's currently being displayed. Only occurs + * when the last framebuffer is destroyed. + */ + list_for_each_entry(crtc, &framebuffer->dev->mode_config.crtc_list, + head) { + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); + spin_lock(&pl111_crtc->current_displaying_lock); + if (pl111_crtc->displaying_fb == framebuffer) { + /* Release the current buffers */ + if (pl111_crtc->old_kds_res_set != NULL) { + DRM_DEBUG_KMS("Releasing KDS resources for "); + DRM_DEBUG_KMS("displayed 0x%p\n", framebuffer); + kds_resource_set_release( + &pl111_crtc->old_kds_res_set); + } + pl111_crtc->old_kds_res_set = NULL; + } + spin_unlock(&pl111_crtc->current_displaying_lock); + } +#endif + drm_framebuffer_cleanup(framebuffer); + + if ((pl111_fb->bo != NULL) && (&pl111_fb->bo->gem_object != NULL)) + drm_gem_object_unreference_unlocked(&pl111_fb->bo->gem_object); + + kfree(pl111_fb); + + DRM_DEBUG_KMS("Destroyed framebuffer 0x%p\n", framebuffer); +} + +static int pl111_fb_create_handle(struct drm_framebuffer *fb, + struct drm_file *file_priv, + unsigned int *handle) +{ + struct pl111_gem_bo *bo = PL111_BO_FROM_FRAMEBUFFER(fb); + DRM_DEBUG_KMS("DRM %s on fb=%p\n", __func__, fb); + + if (bo == NULL) + return -EINVAL; + + return drm_gem_handle_create(file_priv, &bo->gem_object, handle); +} + +const struct drm_framebuffer_funcs fb_funcs = { + .destroy = pl111_fb_destroy, + .create_handle = pl111_fb_create_handle, +}; + +struct drm_framebuffer *pl111_fb_create(struct drm_device *dev, + struct drm_file *file_priv, + struct drm_mode_fb_cmd2 *mode_cmd) +{ + struct pl111_drm_framebuffer *pl111_fb = NULL; + struct drm_framebuffer *fb = NULL; + struct drm_gem_object *gem_obj; + struct pl111_gem_bo *bo; + + pr_info("DRM %s\n", __func__); + gem_obj = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]); + if (gem_obj == NULL) { + DRM_ERROR("Could not get gem obj from handle to create fb\n"); + goto out; + } + + bo = PL111_BO_FROM_GEM(gem_obj); + /* Don't even attempt PL111_BOT_SHM, it's not contiguous */ + BUG_ON(bo->type != PL111_BOT_DMA); + + switch ((char)(mode_cmd->pixel_format & 0xFF)) { + case 'Y': + case 'U': + case 'V': + case 'N': + case 'T': + DRM_ERROR("YUV formats not supported\n"); + goto out; + } + + pl111_fb = kzalloc(sizeof(struct pl111_drm_framebuffer), GFP_KERNEL); + if (pl111_fb == NULL) { + DRM_ERROR("Could not allocate pl111_drm_framebuffer\n"); + goto out; + } + fb = &pl111_fb->fb; + + if (drm_framebuffer_init(dev, fb, &fb_funcs)) { + DRM_ERROR("drm_framebuffer_init failed\n"); + kfree(fb); + fb = NULL; + goto out; + } + + drm_helper_mode_fill_fb_struct(fb, mode_cmd); + + PL111_BO_TO_FRAMEBUFFER(fb, bo); + + DRM_DEBUG_KMS("Created fb 0x%p for gem_obj 0x%p physaddr=0x%.8x\n", + fb, gem_obj, bo->backing_data.dma.fb_dev_addr); + +out: + return fb; +} diff --git a/drivers/gpu/drm/pl111/pl111_drm_funcs.h b/drivers/gpu/drm/pl111/pl111_drm_funcs.h new file mode 100644 index 0000000..de8a826 --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_funcs.h @@ -0,0 +1,127 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_funcs.h + * Function prototypes for PL111 DRM + */ + +#ifndef PL111_DRM_FUNCS_H_ +#define PL111_DRM_FUNCS_H_ + +/* Platform Initialisation */ +int pl111_drm_init(struct platform_device *dev); +void pl111_drm_exit(struct platform_device *dev); + +/* KDS Callbacks */ +void show_framebuffer_on_crtc_cb(void *cb1, void *cb2); +void release_kds_resource_and_display(struct pl111_drm_flip_resource *flip_res); + +/* CRTC Functions */ +struct pl111_drm_crtc *pl111_crtc_create(struct drm_device *dev); +struct pl111_drm_crtc *pl111_crtc_dummy_create(struct drm_device *dev); +void pl111_crtc_destroy(struct drm_crtc *crtc); + +bool pl111_crtc_is_fb_currently_displayed(struct drm_device *dev, + struct drm_framebuffer *fb); + +int show_framebuffer_on_crtc(struct drm_crtc *crtc, + struct drm_framebuffer *fb, bool page_flip, + struct drm_pending_vblank_event *event); + +/* Common IRQ handler */ +void pl111_common_irq(struct pl111_drm_crtc *pl111_crtc); + +int pl111_cursor_plane_init(struct drm_device *dev, + struct pl111_drm_cursor_plane *cursor, + unsigned long possible_crtcs); +void pl111_drm_cursor_plane_destroy(struct drm_plane *plane); + +/* Connector Functions */ +struct pl111_drm_connector *pl111_connector_create(struct drm_device *dev); +void pl111_connector_destroy(struct drm_connector *connector); +struct pl111_drm_connector *pl111_connector_dummy_create(struct drm_device + *dev); + +/* Encoder Functions */ +struct pl111_drm_encoder *pl111_encoder_create(struct drm_device *dev, + int possible_crtcs); +struct pl111_drm_encoder *pl111_encoder_dummy_create(struct drm_device *dev, + int possible_crtcs); +void pl111_encoder_destroy(struct drm_encoder *encoder); + +/* Frame Buffer Functions */ +struct drm_framebuffer *pl111_fb_create(struct drm_device *dev, + struct drm_file *file_priv, + struct drm_mode_fb_cmd2 *mode_cmd); + +/* VMA Functions */ +int pl111_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf); +int pl111_gem_mmap(struct file *file_priv, struct vm_area_struct *vma); +struct page **get_pages(struct drm_gem_object *obj); +void put_pages(struct drm_gem_object *obj, struct page **pages); + +/* Suspend Functions */ +int pl111_drm_resume(struct drm_device *dev); +int pl111_drm_suspend(struct drm_device *dev, pm_message_t state); + +/* GEM Functions */ +int pl111_dumb_create(struct drm_file *file_priv, + struct drm_device *dev, + struct drm_mode_create_dumb *args); +int pl111_dumb_destroy(struct drm_file *file_priv, + struct drm_device *dev, uint32_t handle); +int pl111_dumb_map_offset(struct drm_file *file_priv, + struct drm_device *dev, uint32_t handle, + uint64_t *offset); +void pl111_gem_free_object(struct drm_gem_object *obj); + +int pl111_bo_mmap(struct drm_gem_object *obj, struct pl111_gem_bo *bo, + struct vm_area_struct *vma, size_t size); + +/* DMA BUF Functions */ +int pl111_prime_handle_to_fd(struct drm_device *dev, struct drm_file *file_priv, + uint32_t handle, uint32_t flags, int *prime_fd); +struct dma_buf *pl111_gem_prime_export(struct drm_device *dev, + struct drm_gem_object *obj, int flags); + +/* Pl111 Functions */ +void show_framebuffer_on_crtc_cb_internal(struct pl111_drm_flip_resource + *flip_res, struct drm_framebuffer *fb); +int clcd_disable(struct drm_crtc *crtc); +void do_flip_to_res(struct pl111_drm_flip_resource *flip_res); +int pl111_amba_probe(struct amba_device *dev, const struct amba_id *id); +int pl111_amba_remove(struct amba_device *dev); + +int pl111_device_init(struct drm_device *dev); +void pl111_device_fini(struct drm_device *dev); + +void pl111_set_cursor_size(enum pl111_cursor_size size); +void pl111_set_cursor_sync(enum pl111_cursor_sync sync); +void pl111_set_cursor_index(u32 cursor); +void pl111_set_cursor_enable(bool enable); +void pl111_set_cursor_position(u32 x, u32 y); +void pl111_set_cursor_clipping(u32 x, u32 y); +void pl111_set_cursor_palette(u32 color0, u32 color1); +void pl111_cursor_enable(void); +void pl111_cursor_disable(void); +void pl111_set_cursor_image(u32 *data); + +void pl111_convert_drm_mode_to_timing(struct drm_display_mode *mode, + struct clcd_regs *timing); +void pl111_convert_timing_to_drm_mode(struct clcd_regs *timing, + struct drm_display_mode *mode); +#endif /* PL111_DRM_FUNCS_H_ */ diff --git a/drivers/gpu/drm/pl111/pl111_drm_gem.c b/drivers/gpu/drm/pl111/pl111_drm_gem.c new file mode 100644 index 0000000..01989ec --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_gem.c @@ -0,0 +1,287 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_gem.c + * Implementation of the GEM functions for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> +#include "pl111_drm.h" + +void pl111_gem_free_object(struct drm_gem_object *obj) +{ + struct pl111_gem_bo *bo; + struct drm_device *dev = obj->dev; + DRM_DEBUG_KMS("DRM %s on drm_gem_object=%p\n", __func__, obj); + + bo = PL111_BO_FROM_GEM(obj); + + if (obj->map_list.map != NULL) + drm_gem_free_mmap_offset(obj); + + if (bo->type == PL111_BOT_DMA) { + dma_free_writecombine(dev->dev, obj->size, + bo->backing_data.dma.fb_cpu_addr, + bo->backing_data.dma.fb_dev_addr); + } else if (bo->backing_data.shm.pages != NULL) { + put_pages(obj, bo->backing_data.shm.pages); + } + drm_gem_object_release(obj); + + kfree(bo); + + DRM_DEBUG_KMS("Destroyed dumb_bo handle 0x%p\n", bo); +} + +int pl111_dumb_create(struct drm_file *file_priv, + struct drm_device *dev, struct drm_mode_create_dumb *args) +{ + int ret = 0; + struct pl111_gem_bo *bo = NULL; + uint32_t bytes_pp; + bool create_contig_buffer; + + bo = kzalloc(sizeof(*bo), GFP_KERNEL); + if (bo == NULL) { + ret = -ENOMEM; + goto finish; + } + + /* Round bpp up, to allow for case where bpp<8 */ + bytes_pp = args->bpp >> 3; + if (args->bpp & ((1 << 3) - 1)) + bytes_pp++; + + args->pitch = ALIGN(args->width * bytes_pp, 64); + args->size = PAGE_ALIGN(args->pitch * args->height); + + DRM_DEBUG_KMS("dumb_create w=%d h=%d p=%d bpp=%d b=%d s=%llu f=0x%x\n", + args->width, args->height, args->pitch, args->bpp, + bytes_pp, args->size, args->flags); + + create_contig_buffer = args->flags & PL111_BO_SCANOUT; +#ifndef ARCH_HAS_SG_CHAIN + /* + * If the ARCH can't chain we can't have non-contiguous allocs larger + * than a single sg can hold. + * In this case we fall back to using contiguous memory + */ + if (!create_contig_buffer) { + long unsigned int n_pages = + PAGE_ALIGN(args->size) >> PAGE_SHIFT; + if (n_pages > SG_MAX_SINGLE_ALLOC) { + create_contig_buffer = true; + /* + * Non-contiguous allocation request changed to + * contigous + */ + DRM_INFO("non-contig alloc to contig %lu > %lu pages.", + n_pages, SG_MAX_SINGLE_ALLOC); + } + } +#endif + if (!create_contig_buffer) { + /* not scanout compatible - use non-contiguous buffer */ + bo->type = PL111_BOT_SHM; + ret = drm_gem_object_init(dev, &bo->gem_object, args->size); + if (ret != 0) { + DRM_ERROR("DRM could not init SHM backed GEM obj\n"); + kfree(bo); + ret = -ENOMEM; + goto finish; + } + DRM_DEBUG_KMS("Num bytes: %d\n", bo->gem_object.size); + } else { + /* scanout compatible - use contiguous buffer */ + bo->type = PL111_BOT_DMA; + + bo->backing_data.dma.fb_cpu_addr = + dma_alloc_writecombine(dev->dev, args->size, + &bo->backing_data.dma.fb_dev_addr, + GFP_KERNEL); + if (bo->backing_data.dma.fb_cpu_addr == NULL) { + DRM_ERROR("dma_alloc_writecombine failed\n"); + kfree(bo); + ret = -ENOMEM; + goto finish; + } + + ret = drm_gem_private_object_init(dev, &bo->gem_object, + args->size); + if (ret != 0) { + DRM_ERROR("DRM could not initialise GEM object\n"); + dma_free_writecombine(dev->dev, args->size, + bo->backing_data.dma.fb_cpu_addr, + bo->backing_data.dma.fb_dev_addr); + kfree(bo); + ret = -ENOMEM; + goto finish; + } + } + + DRM_DEBUG_KMS("dumb_create: 0x%p with w=%d, h=%d, p=%d, bpp=%d,", + bo, args->width, args->height, args->pitch, args->bpp); + DRM_DEBUG_KMS("bytes_pp=%d, s=%llu, flags=0x%x, %s 0x%.8lx, type=%d\n", + bytes_pp, args->size, args->flags, + (bo->type == PL111_BOT_DMA) ? "physaddr" : "shared page array", + (bo->type == PL111_BOT_DMA) + ? (unsigned long)bo->backing_data.dma.fb_dev_addr + : (unsigned long)bo->backing_data.shm.pages, bo->type); + + /* omap_gem_new_handle() */ + ret = drm_gem_handle_create(file_priv, &bo->gem_object, &args->handle); + if (ret != 0) { + DRM_ERROR("DRM failed to create GEM handle\n"); + drm_gem_object_release(&bo->gem_object); + if (bo->type == PL111_BOT_DMA) { + dma_free_writecombine(dev->dev, args->size, + bo->backing_data.dma.fb_cpu_addr, + bo->backing_data.dma.fb_dev_addr); + } + kfree(bo); + return ret; + } + /* drop reference from allocate - handle holds it now */ + drm_gem_object_unreference_unlocked(&bo->gem_object); + DRM_DEBUG_KMS("dumb_create completed: fp=%p h=0x%.8x gem_object=%p", + file_priv, args->handle, &bo->gem_object); + +finish: + return ret; +} + +int pl111_dumb_destroy(struct drm_file *file_priv, struct drm_device *dev, + uint32_t handle) +{ + DRM_DEBUG_KMS("DRM %s on file_priv=%p handle=0x%.8x\n", __func__, + file_priv, handle); + return drm_gem_handle_delete(file_priv, handle); +} + +int pl111_dumb_map_offset(struct drm_file *file_priv, + struct drm_device *dev, uint32_t handle, + uint64_t *offset) +{ + /* omap_gem_dump_map_offset */ + struct drm_gem_object *obj; + int ret = 0; + DRM_DEBUG_KMS("DRM %s on file_priv=%p handle=0x%.8x\n", __func__, + file_priv, handle); + + /* GEM does all our handle to object mapping */ + obj = drm_gem_object_lookup(dev, file_priv, handle); + if (obj == NULL) { + ret = -ENOENT; + goto fail; + } + + if (obj->map_list.map == NULL) { + ret = drm_gem_create_mmap_offset(obj); + if (ret != 0) + goto fail; + } + + *offset = (uint64_t) obj->map_list.hash.key << PAGE_SHIFT; + + drm_gem_object_unreference_unlocked(obj); +fail: + return ret; +} + +/* Based on drm_vm.c and omapdrm driver */ +int pl111_bo_mmap(struct drm_gem_object *obj, struct pl111_gem_bo *bo, + struct vm_area_struct *vma, size_t size) +{ + int ret = 0; + + DRM_DEBUG_KMS("DRM %s on drm_gem_object=%p, pl111_gem_bo=%p\n", + __func__, obj, bo); + + if (obj->size < vma->vm_end - vma->vm_start) { + ret = -EINVAL; + goto done; + } + + if (bo->type == PL111_BOT_DMA) { + vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; + vma->vm_page_prot = + pgprot_noncached(vm_get_page_prot(vma->vm_flags)); + + if (remap_pfn_range(vma, vma->vm_start, + (bo->backing_data.dma.fb_dev_addr) >> PAGE_SHIFT, + vma->vm_end - vma->vm_start, vma->vm_page_prot)) + ret = -EAGAIN; + } else { + unsigned long uaddr = vma->vm_start; + long usize = obj->size; + int i = 0; + struct page **pages; + vma->vm_flags &= ~VM_PFNMAP; + vma->vm_flags |= VM_MIXEDMAP; + vma->vm_page_prot = + pgprot_noncached(vm_get_page_prot(vma->vm_flags)); + pages = get_pages(obj); + if (IS_ERR(pages)) { + dev_err(obj->dev->dev, "could not get pages: %ld\n", + PTR_ERR(pages)); + return PTR_ERR(pages); + } + do { + ret = vm_insert_page(vma, uaddr, pages[i]); + if (ret != 0) { + DRM_ERROR("failed to remap user space.\n"); + return ret; + } + uaddr += PAGE_SIZE; + usize -= PAGE_SIZE; + i++; + } while (usize > 0); + } + +done: + return ret; +} + +int pl111_gem_mmap(struct file *file_priv, struct vm_area_struct *vma) +{ + int ret; + struct drm_file *priv = file_priv->private_data; + struct drm_device *dev = priv->minor->dev; + struct drm_gem_mm *mm = dev->mm_private; + struct drm_local_map *map = NULL; + struct drm_hash_item *hash; + struct drm_gem_object *obj; + struct pl111_gem_bo *bo; + DRM_DEBUG_KMS("DRM %s\n", __func__); + + ret = drm_gem_mmap(file_priv, vma); + + drm_ht_find_item(&mm->offset_hash, vma->vm_pgoff, &hash); + map = drm_hash_entry(hash, struct drm_map_list, hash)->map; + obj = map->handle; + bo = PL111_BO_FROM_GEM(obj); + + DRM_DEBUG_KMS("DRM %s on pl111_gem_bo %p\n", __func__, bo); + + return pl111_bo_mmap(obj, bo, vma, vma->vm_end - vma->vm_start); +} diff --git a/drivers/gpu/drm/pl111/pl111_drm_pl111.c b/drivers/gpu/drm/pl111/pl111_drm_pl111.c new file mode 100644 index 0000000..daaa5ba --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_pl111.c @@ -0,0 +1,513 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_pl111.c + * PL111 specific functions for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> +#include "pl111_clcd_ext.h" + +#include "pl111_drm.h" + +/* This can't be called from IRQ context, due to clk_get() and board->enable */ +static int clcd_enable(struct drm_framebuffer *fb) +{ + __u32 cntl; + struct clcd_board *board; + + pr_info("DRM %s\n", __func__); + + clk_prepare_enable(priv.clk); + + /* Enable and Power Up */ + cntl = CNTL_LCDEN | CNTL_LCDTFT | CNTL_LCDPWR | CNTL_LCDVCOMP(1); + DRM_DEBUG_KMS("fb->bits_per_pixel = %d\n", fb->bits_per_pixel); + if (fb->bits_per_pixel == 16) + cntl |= CNTL_LCDBPP16_565; + else if (fb->bits_per_pixel == 32 && fb->depth == 24) + cntl |= CNTL_LCDBPP24; + else + BUG_ON(1); + + cntl |= CNTL_BGR; + + writel(cntl, priv.regs + CLCD_PL111_CNTL); + + board = priv.amba_dev->dev.platform_data; + + if (board->enable) + board->enable(NULL); + + /* Enable Interrupts */ + writel(CLCD_IRQ_NEXTBASE_UPDATE, priv.regs + CLCD_PL111_IENB); + + return 0; +} + +int clcd_disable(struct drm_crtc *crtc) +{ + struct clcd_board *board; + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); + + pr_info("DRM %s\n", __func__); + + /* Disable Interrupts */ + writel(0x00000000, priv.regs + CLCD_PL111_IENB); + + board = priv.amba_dev->dev.platform_data; + + if (board->disable) + board->disable(NULL); + + /* Disable and Power Down */ + writel(0, priv.regs + CLCD_PL111_CNTL); + + /* Disable clock */ + clk_disable_unprepare(priv.clk); + + pl111_crtc->last_bpp = 0; +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS + /* Release the previous buffers */ + if (pl111_crtc->old_kds_res_set != NULL) + kds_resource_set_release(&pl111_crtc->old_kds_res_set); + + pl111_crtc->old_kds_res_set = NULL; +#endif + return 0; +} + +void do_flip_to_res(struct pl111_drm_flip_resource *flip_res) +{ + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(flip_res->crtc); + struct drm_framebuffer *fb; + struct pl111_gem_bo *bo; + + fb = flip_res->fb; + bo = PL111_BO_FROM_FRAMEBUFFER(fb); + + /* Don't even attempt PL111_BOT_SHM, it's not contiguous */ + BUG_ON(bo->type != PL111_BOT_DMA); + + /* + * Note the buffer for releasing after IRQ, and don't allow any more + * updates until then. + * + * This clcd controller latches the new address on next vsync. Address + * latching is indicated by CLCD_IRQ_NEXTBASE_UPDATE, and so we must + * wait for that before releasing the previous buffer's kds + * resources. Otherwise, we'll allow writers to write to the old buffer + * whilst it is still being displayed + */ + pl111_crtc->current_update_res = flip_res; + + DRM_DEBUG_KMS("Displaying fb 0x%p, dumb_bo 0x%p, physaddr %.8x\n", + fb, bo, bo->backing_data.dma.fb_dev_addr); + + if (drm_vblank_get(pl111_crtc->crtc.dev, pl111_crtc->crtc_index) < 0) + DRM_ERROR("Could not get vblank reference for crtc %d\n", + pl111_crtc->crtc_index); + + /* Set the scanout buffer */ + writel(bo->backing_data.dma.fb_dev_addr, priv.regs + CLCD_UBAS); + writel(bo->backing_data.dma.fb_dev_addr + + ((fb->height - 1) * fb->pitches[0]), priv.regs + CLCD_LBAS); +} + +void +show_framebuffer_on_crtc_cb_internal(struct pl111_drm_flip_resource *flip_res, + struct drm_framebuffer *fb) +{ + unsigned long irq_flags; + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(flip_res->crtc); + + spin_lock_irqsave(&pl111_crtc->base_update_lock, irq_flags); + if (list_empty(&pl111_crtc->update_queue) && + !pl111_crtc->current_update_res) { + do_flip_to_res(flip_res); + + /* + * Catch a potential race with the IRQ handler: + * - We may've updated the Base Address just after it was + * latched, but before the OS ran our IRQ handler + * - Hence, the CLCD controller is now scanning out the + * previous buffer, not our new buffer. + * - However, as soon as the IRQ handler runs, it'll inspect + * pl111_crtc->current_update_res, and use that to cause the + * previous buffer to be released on the workqueue (even + * though the CLCD controller is still scanning it out) + * Instead, we must wait until the *next* IRQ to allow + * releasing of the previous buffer: + */ + if (readl(priv.regs + CLCD_PL111_MIS) & + CLCD_IRQ_NEXTBASE_UPDATE) { + DRM_DEBUG_KMS("Redoing flip to fb %p on next IRQ\n", + fb); + pl111_crtc->current_update_res = NULL; + list_add_tail(&flip_res->link, + &pl111_crtc->update_queue); + } + } else { + /* + * Enqueue the update to occur on a future IRQ + * This only happens on triple-or-greater buffering + */ + DRM_DEBUG_KMS("Deferring 3+ buffered flip to fb %p to IRQ\n", + fb); + list_add_tail(&flip_res->link, &pl111_crtc->update_queue); + } + + spin_unlock_irqrestore(&pl111_crtc->base_update_lock, irq_flags); + + if (!flip_res->page_flip && (pl111_crtc->last_bpp == 0 || + pl111_crtc->last_bpp != fb->bits_per_pixel || + !drm_mode_equal(pl111_crtc->new_mode, + pl111_crtc->current_mode))) { + struct clcd_regs timing; + + pl111_convert_drm_mode_to_timing(pl111_crtc->new_mode, &timing); + + DRM_DEBUG_KMS("Set timing: %08X:%08X:%08X:%08X clk=%ldHz\n", + timing.tim0, timing.tim1, timing.tim2, + timing.tim3, timing.pixclock); + + /* This is the actual mode setting part */ + clk_set_rate(priv.clk, timing.pixclock); + + writel(timing.tim0, priv.regs + CLCD_TIM0); + writel(timing.tim1, priv.regs + CLCD_TIM1); + writel(timing.tim2, priv.regs + CLCD_TIM2); + writel(timing.tim3, priv.regs + CLCD_TIM3); + + clcd_enable(fb); + pl111_crtc->last_bpp = fb->bits_per_pixel; + } + + if (!flip_res->page_flip) { + drm_mode_destroy(flip_res->crtc->dev, pl111_crtc->current_mode); + pl111_crtc->current_mode = pl111_crtc->new_mode; + pl111_crtc->new_mode = NULL; + } + + BUG_ON(pl111_crtc->new_mode); + BUG_ON(!pl111_crtc->current_mode); + + /* + * If IRQs weren't enabled before, they are now. This will eventually + * cause flip_res to be released via vsync_worker(), which updates + * every time the Base Address is latched (i.e. every frame, regardless + * of whether we update the base address or not) + */ +} + +irqreturn_t pl111_irq(int irq, void *data) +{ + u32 irq_stat; + struct pl111_drm_crtc *pl111_crtc = priv.pl111_crtc; + + irq_stat = readl(priv.regs + CLCD_PL111_MIS); + + if (!irq_stat) + return IRQ_NONE; + + if (irq_stat & CLCD_IRQ_NEXTBASE_UPDATE) { + if (pl111_crtc->current_update_res || + !list_empty(&pl111_crtc->update_queue)) + DRM_DEBUG_KMS("DRM irq %x after base update\n", + irq_stat); + + /* + * We don't need to lock here as we don't do any flip-specific + * processing in this function. All these, including locks, is + * done in common_irq handler + */ + pl111_common_irq(pl111_crtc); + } + + /* Clear the interrupt once done */ + writel(irq_stat, priv.regs + CLCD_PL111_ICR); + + return IRQ_HANDLED; +} + +int pl111_device_init(struct drm_device *dev) +{ + struct pl111_drm_dev_private *priv = dev->dev_private; + int ret; + + if (priv == NULL || priv->amba_dev == NULL) + return -EINVAL; + + /* set up MMIO for register access */ + priv->mmio_start = priv->amba_dev->res.start; + priv->mmio_len = resource_size(&priv->amba_dev->res); + + DRM_DEBUG_KMS("mmio_start=%lu, mmio_len=%u\n", priv->mmio_start, + priv->mmio_len); + + priv->regs = ioremap(priv->mmio_start, priv->mmio_len); + if (priv->regs == NULL) { + pr_err("%s failed mmio\n", __func__); + return -EINVAL; + } + + /* turn off interrupts */ + writel(0, priv->regs + CLCD_PL111_IENB); + + ret = request_irq(priv->amba_dev->irq[0], pl111_irq, 0, + "pl111_irq_handler", NULL); + if (ret != 0) { + pr_err("%s failed %d\n", __func__, ret); + goto out_mmio; + } + + goto finish; + +out_mmio: + iounmap(priv->regs); +finish: + DRM_DEBUG_KMS("pl111_device_init returned %d\n", ret); + return ret; +} + +void pl111_device_fini(struct drm_device *dev) +{ + struct pl111_drm_dev_private *priv = dev->dev_private; + u32 cntl; + + if (priv == NULL || priv->regs == NULL) + return; + + free_irq(priv->amba_dev->irq[0], NULL); + + cntl = readl(priv->regs + CLCD_PL111_CNTL); + + cntl &= ~CNTL_LCDEN; + writel(cntl, priv->regs + CLCD_PL111_CNTL); + + cntl &= ~CNTL_LCDPWR; + writel(cntl, priv->regs + CLCD_PL111_CNTL); + + iounmap(priv->regs); +} + +int pl111_amba_probe(struct amba_device *dev, const struct amba_id *id) +{ + struct clcd_board *board = dev->dev.platform_data; + int ret; + pr_info("DRM %s\n", __func__); + + if (board == NULL) + return -EINVAL; + + ret = amba_request_regions(dev, NULL); + if (ret != 0) { + DRM_ERROR("CLCD: unable to reserve regs region\n"); + goto out; + } + + priv.amba_dev = dev; + + priv.clk = clk_get(&priv.amba_dev->dev, NULL); + if (IS_ERR(priv.clk)) { + DRM_ERROR("CLCD: unable to get clk.\n"); + ret = PTR_ERR(priv.clk); + goto clk_err; + } + + return 0; + +clk_err: + amba_release_regions(dev); +out: + return ret; +} + +int pl111_amba_remove(struct amba_device *dev) +{ + DRM_DEBUG_KMS("DRM %s\n", __func__); + + clk_put(priv.clk); + + amba_release_regions(dev); + + priv.amba_dev = NULL; + + return 0; +} + +void pl111_set_cursor_size(enum pl111_cursor_size size) +{ + u32 reg_data = readl(priv.regs + CLCD_CRSR_CONFIG); + + if (size == CURSOR_64X64) + reg_data |= CRSR_CONFIG_CRSR_SIZE; + else + reg_data &= ~CRSR_CONFIG_CRSR_SIZE; + + writel(reg_data, priv.regs + CLCD_CRSR_CONFIG); +} + +void pl111_set_cursor_sync(enum pl111_cursor_sync sync) +{ + u32 reg_data = readl(priv.regs + CLCD_CRSR_CONFIG); + + if (sync == CURSOR_SYNC_VSYNC) + reg_data |= CRSR_CONFIG_CRSR_FRAME_SYNC; + else + reg_data &= ~CRSR_CONFIG_CRSR_FRAME_SYNC; + + writel(reg_data, priv.regs + CLCD_CRSR_CONFIG); +} + +void pl111_set_cursor(u32 cursor) +{ + u32 reg_data = readl(priv.regs + CLCD_CRSR_CTRL); + + reg_data &= ~(CRSR_CTRL_CRSR_MAX << CRSR_CTRL_CRSR_NUM_SHIFT); + reg_data |= (cursor & CRSR_CTRL_CRSR_MAX) << CRSR_CTRL_CRSR_NUM_SHIFT; + + writel(reg_data, priv.regs + CLCD_CRSR_CTRL); +} + +void pl111_set_cursor_enable(bool enable) +{ + u32 reg_data = readl(priv.regs + CLCD_CRSR_CTRL); + + if (enable) + reg_data |= CRSR_CTRL_CRSR_ON; + else + reg_data &= ~CRSR_CTRL_CRSR_ON; + + writel(reg_data, priv.regs + CLCD_CRSR_CTRL); +} + +void pl111_set_cursor_position(u32 x, u32 y) +{ + u32 reg_data = (x & CRSR_XY_MASK) | + ((y & CRSR_XY_MASK) << CRSR_XY_Y_SHIFT); + /* could optimise out if same values */ + writel(reg_data, priv.regs + CLCD_CRSR_XY); +} + +void pl111_set_cursor_clipping(u32 x, u32 y) +{ + u32 reg_data = (x & CRSR_CLIP_MASK) | + ((y & CRSR_CLIP_MASK) << CRSR_CLIP_Y_SHIFT); + /* could optimise out if same values */ + writel(reg_data, priv.regs + CLCD_CRSR_CLIP); +} + +void pl111_set_cursor_palette(u32 color0, u32 color1) +{ + writel(color0 & CRSR_PALETTE_MASK, priv.regs + CLCD_CRSR_PALETTE_0); + writel(color1 & CRSR_PALETTE_MASK, priv.regs + CLCD_CRSR_PALETTE_1); +} + +void pl111_cursor_enable(void) +{ + pl111_set_cursor_sync(CURSOR_SYNC_VSYNC); + pl111_set_cursor_size(CURSOR_64X64); + pl111_set_cursor_palette(0x0, 0x00ffffff); + pl111_set_cursor_enable(true); +} + +void pl111_cursor_disable(void) +{ + pl111_set_cursor_enable(false); +} + +void pl111_set_cursor_image(u32 *data) +{ + u32 *cursor_ram = priv.regs + CLCD_CRSR_IMAGE; + int i; + + for (i = 0; i < CLCD_CRSR_IMAGE_MAX_WORDS; i++, data++, cursor_ram++) + writel(*data, cursor_ram); +} + +void pl111_convert_drm_mode_to_timing(struct drm_display_mode *mode, + struct clcd_regs *timing) +{ + unsigned int ppl, hsw, hfp, hbp; + unsigned int lpp, vsw, vfp, vbp; + unsigned int cpl; + + memset(timing, 0, sizeof(struct clcd_regs)); + + ppl = (mode->hdisplay / 16) - 1; + hsw = mode->hsync_end - mode->hsync_start - 1; + hfp = mode->hsync_start - mode->hdisplay - 1; + hbp = mode->htotal - mode->hsync_end - 1; + + lpp = mode->vdisplay - 1; + vsw = mode->vsync_end - mode->vsync_start - 1; + vfp = mode->vsync_start - mode->vdisplay; + vbp = mode->vtotal - mode->vsync_end; + + cpl = mode->hdisplay - 1; + + timing->tim0 = (ppl << 2) | (hsw << 8) | (hfp << 16) | (hbp << 24); + timing->tim1 = lpp | (vsw << 10) | (vfp << 16) | (vbp << 24); + timing->tim2 = TIM2_IVS | TIM2_IHS | TIM2_IPC | TIM2_BCD | (cpl << 16); + timing->tim3 = 0; + + timing->pixclock = mode->clock * 1000; +} + +void pl111_convert_timing_to_drm_mode(struct clcd_regs *timing, + struct drm_display_mode *mode) +{ + unsigned int ppl, hsw, hfp, hbp; + unsigned int lpp, vsw, vfp, vbp; + + ppl = (timing->tim0 >> 2) & 0x3f; + hsw = (timing->tim0 >> 8) & 0xff; + hfp = (timing->tim0 >> 16) & 0xff; + hbp = (timing->tim0 >> 24) & 0xff; + + lpp = timing->tim1 & 0x3ff; + vsw = (timing->tim1 >> 10) & 0x3f; + vfp = (timing->tim1 >> 16) & 0xff; + vbp = (timing->tim1 >> 24) & 0xff; + + mode->hdisplay = (ppl + 1) * 16; + mode->hsync_start = ((ppl + 1) * 16) + hfp + 1; + mode->hsync_end = ((ppl + 1) * 16) + hfp + hsw + 2; + mode->htotal = ((ppl + 1) * 16) + hfp + hsw + hbp + 3; + mode->hskew = 0; + + mode->vdisplay = lpp + 1; + mode->vsync_start = lpp + vfp + 1; + mode->vsync_end = lpp + vfp + vsw + 2; + mode->vtotal = lpp + vfp + vsw + vbp + 2; + + mode->flags = 0; + + mode->width_mm = 0; + mode->height_mm = 0; + + mode->clock = timing->pixclock / 1000; + mode->hsync = timing->pixclock / mode->htotal; + mode->vrefresh = mode->hsync / mode->vtotal; +} diff --git a/drivers/gpu/drm/pl111/pl111_drm_platform.c b/drivers/gpu/drm/pl111/pl111_drm_platform.c new file mode 100644 index 0000000..a0b9e50 --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_platform.c @@ -0,0 +1,150 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_platform.c + * Implementation of the Linux platform device entrypoints for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> +#include <linux/init.h> +#include <linux/platform_device.h> +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> +#include "pl111_drm.h" + +static int pl111_platform_drm_suspend(struct platform_device *dev, + pm_message_t state) +{ + pr_info("DRM %s\n", __func__); + return 0; +} + +static int pl111_platform_drm_resume(struct platform_device *dev) +{ + pr_info("DRM %s\n", __func__); + return 0; +} + +int pl111_platform_drm_probe(struct platform_device *dev) +{ + pr_info("DRM %s\n", __func__); + return pl111_drm_init(dev); +} + +static int pl111_platform_drm_remove(struct platform_device *dev) +{ + pr_info("DRM %s\n", __func__); + pl111_drm_exit(dev); + + return 0; +} + +static struct amba_id pl111_id_table[] = { + { + .id = 0x00041110, + .mask = 0x000ffffe, + }, + {0, 0}, +}; + +static struct amba_driver pl111_amba_driver = { + .drv = { + .name = "clcd-pl11x", + }, + .probe = pl111_amba_probe, + .remove = pl111_amba_remove, + .id_table = pl111_id_table, +}; + +static struct platform_driver platform_drm_driver = { + .probe = pl111_platform_drm_probe, + .remove = pl111_platform_drm_remove, + .suspend = pl111_platform_drm_suspend, + .resume = pl111_platform_drm_resume, + .driver = { + .owner = THIS_MODULE, + .name = DRIVER_NAME, + }, +}; + +static const struct platform_device_info pl111_drm_pdevinfo = { + .name = DRIVER_NAME, + .id = -1, + .dma_mask = ~0UL +}; + +static struct platform_device *pl111_drm_device; + +static int __init pl111_platform_drm_init(void) +{ + int ret; + + pr_info("DRM %s\n", __func__); + + pl111_drm_device = platform_device_register_full(&pl111_drm_pdevinfo); + if (pl111_drm_device == NULL) { + pr_err("DRM platform_device_register_full() failed\n"); + return -ENOMEM; + } + + ret = amba_driver_register(&pl111_amba_driver); + if (ret != 0) { + pr_err("DRM amba_driver_register() failed %d\n", ret); + goto err_amba_reg; + } + + ret = platform_driver_register(&platform_drm_driver); + if (ret != 0) { + pr_err("DRM platform_driver_register() failed %d\n", ret); + goto err_pdrv_reg; + } + + return 0; + +err_pdrv_reg: + amba_driver_unregister(&pl111_amba_driver); +err_amba_reg: + platform_device_unregister(pl111_drm_device); + + return ret; +} + +static void __exit pl111_platform_drm_exit(void) +{ + pr_info("DRM %s\n", __func__); + + platform_device_unregister(pl111_drm_device); + amba_driver_unregister(&pl111_amba_driver); + platform_driver_unregister(&platform_drm_driver); +} + +#ifdef MODULE +module_init(pl111_platform_drm_init); +#else +late_initcall(pl111_platform_drm_init); +#endif +module_exit(pl111_platform_drm_exit); + +MODULE_DESCRIPTION(DRIVER_DESC); +MODULE_VERSION(DRIVER_VERSION); +MODULE_AUTHOR(DRIVER_AUTHOR); +MODULE_LICENSE(DRIVER_LICENCE); +MODULE_ALIAS(DRIVER_ALIAS); diff --git a/drivers/gpu/drm/pl111/pl111_drm_suspend.c b/drivers/gpu/drm/pl111/pl111_drm_suspend.c new file mode 100644 index 0000000..d4da60f --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_suspend.c @@ -0,0 +1,35 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + */ + +/** + * pl111_drm_suspend.c + * Implementation of the suspend/resume functions for PL111 DRM + */ + +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> +#include "pl111_drm.h" + +int pl111_drm_suspend(struct drm_device *dev, pm_message_t state) +{ + pr_info("DRM %s\n", __func__); + return 0; +} + +int pl111_drm_resume(struct drm_device *dev) +{ + pr_info("DRM %s\n", __func__); + return 0; +} diff --git a/drivers/gpu/drm/pl111/pl111_drm_vma.c b/drivers/gpu/drm/pl111/pl111_drm_vma.c new file mode 100644 index 0000000..a3c78fa --- /dev/null +++ b/drivers/gpu/drm/pl111/pl111_drm_vma.c @@ -0,0 +1,214 @@ +/* + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. + * + * Parts of this file were based on sources as follows: + * + * Copyright (c) 2006-2008 Intel Corporation + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> + * Copyright (C) 2011 Texas Instruments + * + * This program is free software and is provided to you under the terms of the + * GNU General Public License version 2 as published by the Free Software + * Foundation, and any use by you of this program is subject to the terms of + * such GNU licence. + * + */ + +/** + * pl111_drm_vma.c + * Implementation of the VM functions for PL111 DRM + */ +#include <linux/amba/bus.h> +#include <linux/amba/clcd.h> +#include <linux/version.h> +#include <linux/shmem_fs.h> +#include <linux/dma-buf.h> +#include <linux/module.h> + +#include <drm/drmP.h> +#include <drm/drm_crtc_helper.h> + +#include "pl111_drm.h" + +/* BEGIN drivers/staging/omapdrm/omap_gem_helpers.c */ +/** + * drm_gem_put_pages - helper to free backing pages for a GEM object + * @obj: obj in question + * @pages: pages to free + */ +static void _drm_gem_put_pages(struct drm_gem_object *obj, struct page **pages, + bool dirty, bool accessed) +{ + int i, npages; + struct pl111_gem_bo *bo; + npages = obj->size >> PAGE_SHIFT; + bo = PL111_BO_FROM_GEM(obj); + for (i = 0; i < npages; i++) { + if (dirty) + set_page_dirty(pages[i]); + if (accessed) + mark_page_accessed(pages[i]); + /* Undo the reference we took when populating the table */ + page_cache_release(pages[i]); + } + drm_free_large(pages); +} + +void put_pages(struct drm_gem_object *obj, struct page **pages) +{ + int i, npages; + struct pl111_gem_bo *bo; + npages = obj->size >> PAGE_SHIFT; + bo = PL111_BO_FROM_GEM(obj); + _drm_gem_put_pages(obj, pages, true, true); + if (bo->backing_data.shm.dma_addrs) { + for (i = 0; i < npages; i++) { + if (!dma_mapping_error(obj->dev->dev, + bo->backing_data.shm.dma_addrs[i])) { + dma_unmap_page(obj->dev->dev, + bo->backing_data.shm.dma_addrs[i], + PAGE_SIZE, + DMA_BIDIRECTIONAL); + } + } + kfree(bo->backing_data.shm.dma_addrs); + bo->backing_data.shm.dma_addrs = NULL; + } +} + +/** + * drm_gem_get_pages - helper to allocate backing pages for a GEM object + * @obj: obj in question + * @gfpmask: gfp mask of requested pages + */ +static struct page **_drm_gem_get_pages(struct drm_gem_object *obj, + gfp_t gfpmask) +{ + struct inode *inode; + struct address_space *mapping; + struct page *p, **pages; + int i, npages; + + /* This is the shared memory object that backs the GEM resource */ + inode = obj->filp->f_path.dentry->d_inode; + mapping = inode->i_mapping; + + npages = obj->size >> PAGE_SHIFT; + + pages = drm_malloc_ab(npages, sizeof(struct page *)); + if (pages == NULL) + return ERR_PTR(-ENOMEM); + + gfpmask |= mapping_gfp_mask(mapping); + + for (i = 0; i < npages; i++) { + p = shmem_read_mapping_page_gfp(mapping, i, gfpmask); + if (IS_ERR(p)) + goto fail; + pages[i] = p; + + /* + * There is a hypothetical issue w/ drivers that require + * buffer memory in the low 4GB.. if the pages are un- + * pinned, and swapped out, they can end up swapped back + * in above 4GB. If pages are already in memory, then + * shmem_read_mapping_page_gfp will ignore the gfpmask, + * even if the already in-memory page disobeys the mask. + * + * It is only a theoretical issue today, because none of + * the devices with this limitation can be populated with + * enough memory to trigger the issue. But this BUG_ON() + * is here as a reminder in case the problem with + * shmem_read_mapping_page_gfp() isn't solved by the time + * it does become a real issue. + * + * See this thread: http://lkml.org/lkml/2011/7/11/238 + */ + BUG_ON((gfpmask & __GFP_DMA32) && + (page_to_pfn(p) >= 0x00100000UL)); + } + + return pages; + +fail: + while (i--) + page_cache_release(pages[i]); + + drm_free_large(pages); + return ERR_PTR(PTR_ERR(p)); +} + +struct page **get_pages(struct drm_gem_object *obj) +{ + struct pl111_gem_bo *bo; + bo = PL111_BO_FROM_GEM(obj); + + if (bo->backing_data.shm.pages == NULL) { + struct page **p; + int npages = obj->size >> PAGE_SHIFT; + int i; + + p = _drm_gem_get_pages(obj, GFP_KERNEL); + if (IS_ERR(p)) + return ERR_PTR(-ENOMEM); + + bo->backing_data.shm.pages = p; + + if (bo->backing_data.shm.dma_addrs == NULL) { + bo->backing_data.shm.dma_addrs = + kzalloc(npages * sizeof(dma_addr_t), + GFP_KERNEL); + if (bo->backing_data.shm.dma_addrs == NULL) + goto error_out; + } + + for (i = 0; i < npages; ++i) { + bo->backing_data.shm.dma_addrs[i] = + dma_map_page(obj->dev->dev, p[i], 0, PAGE_SIZE, + DMA_BIDIRECTIONAL); + if (dma_mapping_error(obj->dev->dev, + bo->backing_data.shm.dma_addrs[i])) + goto error_out; + } + } + + return bo->backing_data.shm.pages; + +error_out: + put_pages(obj, bo->backing_data.shm.pages); + bo->backing_data.shm.pages = NULL; + return ERR_PTR(-ENOMEM); +} + +/* END drivers/staging/omapdrm/omap_gem_helpers.c */ + +int pl111_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct page **pages; + pgoff_t pgoff; + unsigned long pfn; + struct drm_gem_object *obj = vma->vm_private_data; + struct pl111_gem_bo *bo = PL111_BO_FROM_GEM(obj); + DRM_DEBUG_KMS("DRM %s on pl111_gem_bo %p\n", __func__, bo); + + /* We don't use vmf->pgoff since that has the fake offset: */ + pgoff = ((unsigned long)vmf->virtual_address - + vma->vm_start) >> PAGE_SHIFT; + if (bo->type == PL111_BOT_SHM) { + pages = get_pages(obj); + if (IS_ERR(pages)) { + dev_err(obj->dev->dev, + "could not get pages: %ld\n", PTR_ERR(pages)); + return PTR_ERR(pages); + } + pfn = page_to_pfn(pages[pgoff]); + DRM_DEBUG_KMS("physaddr 0x%.8x for offset 0x%x\n", + PFN_PHYS(pfn), PFN_PHYS(pgoff)); + vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn); + return VM_FAULT_NOPAGE; + } else { + DRM_DEBUG_KMS("Fault on non-shared memory %p\n", + vmf->virtual_address); + return VM_FAULT_SIGBUS; + } +} -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 131+ messages in thread
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-25 17:17 ` [RFC 1/1] " tom.cooksey @ 2013-08-07 16:17 ` Rob Clark 2013-08-07 16:46 ` Daniel Vetter 0 siblings, 1 reply; 131+ messages in thread From: Rob Clark @ 2013-08-07 16:17 UTC (permalink / raw) To: tom.cooksey; +Cc: linux-arm-kernel, linux-fbdev, pawel.moll, dri-devel On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > From: Tom Cooksey <tom.cooksey@arm.com> > > This is a mode-setting driver for the pl111 CLCD display controller > found on various ARM reference platforms such as the Versatile > Express. The driver supports setting of a single mode (640x480) and > has only been tested on Versatile Express with a Cortex-A9 core tile. > > Known issues: > * It still includes code to use KDS, which is not going upstream. review's on http://lists.freedesktop.org/archives/dri-devel/2013-July/042462.html can't hurt although you might consider submitting a reduced functionality driver w/ KDS bits removed in the mean time.. then when the fence stuff is merged it is just an incremental patch rather than a whole driver ;-) > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > allocate buffers for the GPU. > * The v_sync handling needs work - a work queue is a little > overkill. > * Doesn't support the import half of PRIME properly, only export > * Need to validate src rectangle size in when updating the cursor > plane. > * Only supports 640x480 mode, which is hard-coded. > * We register both an amba & platform driver, only need one. > > Signed-off-by: Tom Cooksey <tom.cooksey@arm.com> > --- > drivers/gpu/drm/Kconfig | 2 + > drivers/gpu/drm/Makefile | 1 + > drivers/gpu/drm/pl111/Kbuild | 14 + > drivers/gpu/drm/pl111/Kconfig | 9 + > drivers/gpu/drm/pl111/pl111_clcd_ext.h | 78 ++++ > drivers/gpu/drm/pl111/pl111_drm.h | 227 ++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_connector.c | 166 +++++++++ > drivers/gpu/drm/pl111/pl111_drm_crtc.c | 432 ++++++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_cursor.c | 97 +++++ > drivers/gpu/drm/pl111/pl111_drm_device.c | 319 +++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_dma_buf.c | 339 ++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_encoder.c | 106 ++++++ > drivers/gpu/drm/pl111/pl111_drm_fb.c | 152 ++++++++ > drivers/gpu/drm/pl111/pl111_drm_funcs.h | 127 +++++++ > drivers/gpu/drm/pl111/pl111_drm_gem.c | 287 +++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_pl111.c | 513 +++++++++++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_platform.c | 150 ++++++++ > drivers/gpu/drm/pl111/pl111_drm_suspend.c | 35 ++ > drivers/gpu/drm/pl111/pl111_drm_vma.c | 214 +++++++++++ > 19 files changed, 3268 insertions(+) > create mode 100644 drivers/gpu/drm/pl111/Kbuild > create mode 100644 drivers/gpu/drm/pl111/Kconfig > create mode 100644 drivers/gpu/drm/pl111/pl111_clcd_ext.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_connector.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_crtc.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_cursor.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_device.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_encoder.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_fb.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_funcs.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_gem.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_pl111.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_platform.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_suspend.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_vma.c > > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig > index a7c54c8..4f743f3 100644 > --- a/drivers/gpu/drm/Kconfig > +++ b/drivers/gpu/drm/Kconfig > @@ -223,3 +223,5 @@ source "drivers/gpu/drm/omapdrm/Kconfig" > source "drivers/gpu/drm/tilcdc/Kconfig" > > source "drivers/gpu/drm/qxl/Kconfig" > + > +source "drivers/gpu/drm/pl111/Kconfig" > diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile > index 801bcaf..2ec0181 100644 > --- a/drivers/gpu/drm/Makefile > +++ b/drivers/gpu/drm/Makefile > @@ -54,4 +54,5 @@ obj-$(CONFIG_DRM_SHMOBILE) +=shmobile/ > obj-$(CONFIG_DRM_OMAP) += omapdrm/ > obj-$(CONFIG_DRM_TILCDC) += tilcdc/ > obj-$(CONFIG_DRM_QXL) += qxl/ > +obj-$(CONFIG_DRM_PL111) += pl111/ > obj-y += i2c/ > diff --git a/drivers/gpu/drm/pl111/Kbuild b/drivers/gpu/drm/pl111/Kbuild > new file mode 100644 > index 0000000..5dbd333 > --- /dev/null > +++ b/drivers/gpu/drm/pl111/Kbuild > @@ -0,0 +1,14 @@ > +pl111_drm-y += pl111_drm_device.o \ > + pl111_drm_connector.o \ > + pl111_drm_crtc.o \ > + pl111_drm_cursor.o \ > + pl111_drm_dma_buf.o \ > + pl111_drm_encoder.o \ > + pl111_drm_fb.o \ > + pl111_drm_gem.o \ > + pl111_drm_pl111.o \ > + pl111_drm_platform.o \ > + pl111_drm_suspend.o \ > + pl111_drm_vma.o > + > +obj-$(CONFIG_DRM_PL111) += pl111_drm.o > diff --git a/drivers/gpu/drm/pl111/Kconfig b/drivers/gpu/drm/pl111/Kconfig > new file mode 100644 > index 0000000..6aa4739 > --- /dev/null > +++ b/drivers/gpu/drm/pl111/Kconfig > @@ -0,0 +1,9 @@ > +config DRM_PL111 > + tristate "DRM Support for PL111 CLCD Controller" > + depends on DRM > + select DRM_KMS_HELPER > + select VT_HW_CONSOLE_BINDING if FRAMEBUFFER_CONSOLE > + help > + Choose this option for DRM support for the PL111 CLCD controller. > + If M is selected the module will be called pl111_drm. > + > diff --git a/drivers/gpu/drm/pl111/pl111_clcd_ext.h b/drivers/gpu/drm/pl111/pl111_clcd_ext.h > new file mode 100644 > index 0000000..06e424c > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_clcd_ext.h > @@ -0,0 +1,78 @@ > +/* > + * Support for PL111 > + * > + * Portions (C) COPYRIGHT 2011-2013 ARM Limited. All rights reserved. > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > +/** > + * pl111_clcd_ext.h > + * Extended CLCD register definitions > + */ > + > +#ifndef PL111_CLCD_EXT_H_ > +#define PL111_CLCD_EXT_H_ > + > +/* PL111 cursor register definitions not defined in the kernel's clcd header */ > + > +#define CLCD_CRSR_IMAGE 0x00000800 > + > +#define CLCD_CRSR_IMAGE_MAX_WORDS 256 > + > +#define CLCD_CRSR_CTRL 0x00000c00 > +#define CLCD_CRSR_CONFIG 0x00000c04 > +#define CLCD_CRSR_PALETTE_0 0x00000c08 > +#define CLCD_CRSR_PALETTE_1 0x00000c0c > +#define CLCD_CRSR_XY 0x00000c10 > +#define CLCD_CRSR_CLIP 0x00000c14 > +#define CLCD_CRSR_IMSC 0x00000c20 > +#define CLCD_CRSR_ICR 0x00000c24 > +#define CLCD_CRSR_RIS 0x00000c28 > +#define CLCD_MIS 0x00000c2c > + > +#define CRSR_CTRL_CRSR_ON (1 << 0) > +#define CRSR_CTRL_CRSR_MAX 3 > +#define CRSR_CTRL_CRSR_NUM_SHIFT 4 > +#define CRSR_CTRL_CRSR_NUM_MASK \ > + (CRSR_CTRL_CRSR_MAX << CRSR_CTRL_CRSR_NUM_SHIFT) > +#define CRSR_CTRL_CURSOR_0 0 > +#define CRSR_CTRL_CURSOR_1 1 > +#define CRSR_CTRL_CURSOR_2 2 > +#define CRSR_CTRL_CURSOR_3 3 > + > +#define CRSR_CONFIG_CRSR_SIZE (1 << 0) > +#define CRSR_CONFIG_CRSR_FRAME_SYNC (1 << 1) > + > +#define CRSR_PALETTE_RED_SHIFT 0 > +#define CRSR_PALETTE_GREEN_SHIFT 8 > +#define CRSR_PALETTE_BLUE_SHIFT 16 > + > +#define CRSR_PALETTE_RED_MASK 0x000000ff > +#define CRSR_PALETTE_GREEN_MASK 0x0000ff00 > +#define CRSR_PALETTE_BLUE_MASK 0x00ff0000 > +#define CRSR_PALETTE_MASK (~0xff000000) > + > +#define CRSR_XY_MASK 0x000003ff > +#define CRSR_XY_X_SHIFT 0 > +#define CRSR_XY_Y_SHIFT 16 > + > +#define CRSR_XY_X_MASK CRSR_XY_MASK > +#define CRSR_XY_Y_MASK (CRSR_XY_MASK << CRSR_XY_Y_SHIFT) > + > +#define CRSR_CLIP_MASK 0x3f > +#define CRSR_CLIP_X_SHIFT 0 > +#define CRSR_CLIP_Y_SHIFT 8 > + > +#define CRSR_CLIP_X_MASK CRSR_CLIP_MASK > +#define CRSR_CLIP_Y_MASK (CRSR_CLIP_MASK << CRSR_CLIP_Y_SHIFT) > + > +#define CRSR_IMSC_CRSR_IM (1<<0) > +#define CRSR_ICR_CRSR_IC (1<<0) > +#define CRSR_RIS_CRSR_RIS (1<<0) > +#define CRSR_MIS_CRSR_MIS (1<<0) > + > +#endif /* PL111_CLCD_EXT_H_ */ > diff --git a/drivers/gpu/drm/pl111/pl111_drm.h b/drivers/gpu/drm/pl111/pl111_drm.h > new file mode 100644 > index 0000000..6d39a8b > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm.h > @@ -0,0 +1,227 @@ > +/* > + * > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +#ifndef _PL111_DRM_H_ > +#define _PL111_DRM_H_ > + > +/* Defines for drm_mode_create_dumb flags settings */ > +#define PL111_BO_SCANOUT 0x00000001 /* scanout compatible buffer requested */ > + > +#define DRIVER_AUTHOR "ARM Ltd." > +#define DRIVER_NAME "pl111_drm" > +#define DRIVER_DESC "DRM module for PL111" > +#define DRIVER_LICENCE "GPL" > +#define DRIVER_ALIAS "platform:pl111_drm" > +#define DRIVER_DATE "20101111" > +#define DRIVER_VERSION "0.2" > +#define DRIVER_MAJOR 2 > +#define DRIVER_MINOR 1 > +#define DRIVER_PATCHLEVEL 1 > + > +/* > + * Number of flips allowed in flight at any one time. Any more flips requested > + * beyond this value will cause the caller to block until earlier flips have > + * completed. > + * > + * For performance reasons, this must be greater than the number of buffers > + * used in the rendering pipeline. Note that the rendering pipeline can contain > + * different types of buffer, e.g.: > + * - 2 final framebuffers > + * - >2 geometry buffers for GPU use-cases > + * - >2 vertex buffers for GPU use-cases > + * > + * For example, a system using 5 geometry buffers could have 5 flips in flight, > + * and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be 5 or greater. > + * > + * Whilst there may be more intermediate buffers (such as vertex/geometry) than > + * final framebuffers, KDS is used to ensure that GPU rendering waits for the > + * next off-screen buffer, so it doesn't overwrite an on-screen buffer and > + * produce tearing. > + */ > + fwiw, this is at least different from how other drivers do triple (or > double) buffering. In other drivers (intel, omap, and msm/freedreno, that I know of, maybe others too) the xorg driver dri2 bits implement the double buffering (ie. send flip event back to client immediately and queue up the flip and call page-flip after the pageflip event back from kernel. I'm not saying not to do it this way, I guess I'd like to hear what other folks think. I kinda prefer doing this in userspace as it keeps the kernel bits simpler (plus it would then work properly on exynosdrm or other kms drivers). > +/* > + * Here, we choose a conservative value. A lower value is most likely > + * suitable for GPU use-cases. > + */ > +#define NR_FLIPS_IN_FLIGHT_THRESHOLD 16 > + > +#define CLCD_IRQ_NEXTBASE_UPDATE (1u<<2) > + > +struct pl111_drm_flip_resource; > +struct pl111_drm_cursor_plane; > + > +enum pl111_bo_type { > + PL111_BOT_DMA, > + PL111_BOT_SHM > +}; > + > +struct pl111_gem_bo_dma { > + dma_addr_t fb_dev_addr; > + void *fb_cpu_addr; > +}; > + > +struct pl111_gem_bo_shm { > + struct page **pages; > + dma_addr_t *dma_addrs; > +}; > + > +struct pl111_gem_bo { > + struct drm_gem_object gem_object; > + enum pl111_bo_type type; > + union { > + struct pl111_gem_bo_dma dma; > + struct pl111_gem_bo_shm shm; > + } backing_data; > + struct drm_framebuffer *fb; this is at least a bit odd.. normally the fb has ref to the bo(s) and not the other way around. And the same bo could be referenced by multiple fb's which would kinda fall down with this approach. > +}; > + > +extern struct pl111_drm_dev_private priv; > + > +struct pl111_drm_framebuffer { > + struct drm_framebuffer fb; > + struct pl111_gem_bo *bo; > +}; > + > +struct pl111_drm_flip_resource { > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + struct kds_resource_set *kds_res_set; > + int worker_release_kds; > +#endif > + struct drm_framebuffer *fb; > + struct drm_crtc *crtc; > + struct work_struct vsync_work; > + struct list_head link; > + bool page_flip; > + struct drm_pending_vblank_event *event; > +}; > + > +#define MAX_CURSOR_FORMATS (1) > + > +struct pl111_drm_cursor_plane { > + struct drm_plane base; > + uint32_t *formats; > + uint32_t num_formats_supported; > +}; btw, if you have proper hw cursors, you probably want to expose that via crtc cursor API (and then userspace bits should try cursor first and then fallback to plane.. this is what weston does, fwiw) > + > +struct pl111_drm_crtc { > + struct drm_crtc crtc; > + int crtc_index; > + > + spinlock_t current_displaying_lock; > + spinlock_t base_update_lock; > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + struct kds_resource_set *old_kds_res_set; > +#endif > + struct drm_framebuffer *displaying_fb; > + > + struct drm_display_mode *new_mode; > + struct drm_display_mode *current_mode; > + int last_bpp; > + > + /* > + * The resource that caused a base address update. Only one can be > + * pending, hence it's != NULL if there's a pending update > + */ > + struct pl111_drm_flip_resource *current_update_res; > + /* Queue of things waiting to update the base address */ > + struct list_head update_queue; > + > + struct workqueue_struct *vsync_wq; > + > + struct pl111_drm_cursor_plane cursor; > + > + void (*show_framebuffer_cb)(struct pl111_drm_flip_resource *flip_res, > + struct drm_framebuffer *fb); > +}; > + > +struct pl111_drm_connector { > + struct drm_connector connector; > +}; > + > +struct pl111_drm_encoder { > + struct drm_encoder encoder; > +}; > + > +struct pl111_drm_dev_private { > + struct pl111_drm_crtc *pl111_crtc; > + > + struct amba_device *amba_dev; > + unsigned long mmio_start; > + __u32 mmio_len; > + void *regs; > + struct clk *clk; > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + struct kds_callback kds_cb; > + struct kds_callback kds_obtain_current_cb; > +#endif > + /* > + * Number of flips that were started in show_framebuffer_on_crtc(), > + * but haven't completed yet - because we do deferred flipping > + */ > + atomic_t nr_flips_in_flight; > + wait_queue_head_t wait_for_flips; if the end result is to keep queue'ing in the kernel, this should probably be moved into the crtc. Not sure if you ever have hw that can support more than one crtc, but if you do you probably want this in the crtc. > + > + /* > + * Used to prevent race between pl111_dma_buf_release and > + * drm_gem_prime_handle_to_fd > + */ > + struct mutex export_dma_buf_lock; hmm, seems a bit suspicious.. the handle reference should keep the object live. Ie. either drm_gem_object_lookup() will fail because the object is gone (userspace has closed it's handle ref and dmabuf->release() already dropped it's ref) or it will succeed and you'll have a reference to the bo keeping it from going away if the release() comes after. > + uint32_t number_crtcs; > + > + /* Cache for flip resources used to avoid kmalloc on each page flip */ > + struct kmem_cache *page_flip_slab; > +}; > + > +enum pl111_cursor_size { > + CURSOR_32X32, > + CURSOR_64X64 > +}; > + > +enum pl111_cursor_sync { > + CURSOR_SYNC_NONE, > + CURSOR_SYNC_VSYNC > +}; > + > +#define PL111_FB_FROM_FRAMEBUFFER(drm_fb) \ > + (container_of(drm_fb, struct pl111_drm_framebuffer, fb)) > + > +#define PL111_BO_FROM_FRAMEBUFFER(drm_fb) \ > + (container_of(drm_fb, struct pl111_drm_framebuffer, fb)->bo) > + > +#define PL111_BO_TO_FRAMEBUFFER(drm_fb, bo) \ > + do { \ > + container_of(drm_fb, \ > + struct pl111_drm_framebuffer, fb)->bo = bo; \ > + bo->fb = fb; \ > + } while (0) > + > +#define PL111_BO_FROM_GEM(gem_obj) \ > + container_of(gem_obj, struct pl111_gem_bo, gem_object) > + > +#define to_pl111_crtc(x) container_of(x, struct pl111_drm_crtc, crtc) > + > +#define PL111_ENCODER_FROM_ENCODER(x) \ > + container_of(x, struct pl111_drm_encoder, encoder) > + > +#define PL111_CONNECTOR_FROM_CONNECTOR(x) \ > + container_of(x, struct pl111_drm_connector, connector) > + > +#include "pl111_drm_funcs.h" > + > +#endif /* _PL111_DRM_H_ */ > diff --git a/drivers/gpu/drm/pl111/pl111_drm_connector.c b/drivers/gpu/drm/pl111/pl111_drm_connector.c > new file mode 100644 > index 0000000..304a5be > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_connector.c > @@ -0,0 +1,166 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_connector.c > + * Implementation of the connector functions for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > + > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > + > +#include "pl111_drm.h" > + > + > +static struct { > + int w, h, type; > +} pl111_drm_modes[] = { > + { 640, 480, DRM_MODE_TYPE_PREFERRED}, > + { 800, 600, 0}, > + {1024, 768, 0}, > + { -1, -1, -1} > +}; > + > +void pl111_connector_destroy(struct drm_connector *connector) > +{ > + struct pl111_drm_connector *pl111_connector = > + PL111_CONNECTOR_FROM_CONNECTOR(connector); > + > + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); > + > + drm_sysfs_connector_remove(connector); > + drm_connector_cleanup(connector); > + kfree(pl111_connector); > +} > + > +enum drm_connector_status pl111_connector_detect(struct drm_connector > + *connector, bool force) > +{ > + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); > + return connector_status_connected; > +} > + > +void pl111_connector_dpms(struct drm_connector *connector, int mode) > +{ > + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); > +} > + > +struct drm_encoder * > +pl111_connector_helper_best_encoder(struct drm_connector *connector) > +{ > + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); > + > + if (connector->encoder != NULL) { > + return connector->encoder; /* Return attached encoder */ > + } else { > + /* > + * If there is no attached encoder we choose the best candidate > + * from the list. > + * For PL111 there is only one encoder so we return the first > + * one we find. > + * Other h/w would require a suitable criterion below. > + */ > + struct drm_encoder *encoder = NULL; > + struct drm_device *dev = connector->dev; > + > + list_for_each_entry(encoder, &dev->mode_config.encoder_list, > + head) { > + if (1) { /* criterion ? */ > + break; > + } > + } > + return encoder; /* return best candidate encoder */ > + } > +} > + > +int pl111_connector_helper_get_modes(struct drm_connector *connector) > +{ > + int i = 0; > + int count = 0; > + > + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); > + > + while (pl111_drm_modes[i].w != -1) { > + struct drm_display_mode *mode = > + drm_mode_find_dmt(connector->dev, > + pl111_drm_modes[i].w, > + pl111_drm_modes[i].h, > + 60, > + false); > + > + if (mode != NULL) { > + mode->type |= pl111_drm_modes[i].type; > + drm_mode_probed_add(connector, mode); > + count++; > + } > + > + i++; > + } > + > + DRM_DEBUG_KMS("found %d modes\n", count); > + > + return count; > +} > + > +int pl111_connector_helper_mode_valid(struct drm_connector *connector, > + struct drm_display_mode *mode) > +{ > + DRM_DEBUG_KMS("DRM %s on connector=%p\n", __func__, connector); > + return MODE_OK; > +} > + > +const struct drm_connector_funcs connector_funcs = { > + .fill_modes = drm_helper_probe_single_connector_modes, > + .destroy = pl111_connector_destroy, > + .detect = pl111_connector_detect, > + .dpms = pl111_connector_dpms, > +}; > + > +const struct drm_connector_helper_funcs connector_helper_funcs = { > + .get_modes = pl111_connector_helper_get_modes, > + .mode_valid = pl111_connector_helper_mode_valid, > + .best_encoder = pl111_connector_helper_best_encoder, > +}; > + > +struct pl111_drm_connector *pl111_connector_create(struct drm_device *dev) > +{ > + struct pl111_drm_connector *pl111_connector; > + > + pl111_connector = kzalloc(sizeof(struct pl111_drm_connector), > + GFP_KERNEL); > + > + if (pl111_connector == NULL) { > + pr_err("Failed to allocated pl111_drm_connector\n"); > + return NULL; > + } > + > + drm_connector_init(dev, &pl111_connector->connector, &connector_funcs, > + DRM_MODE_CONNECTOR_DVII); > + > + drm_connector_helper_add(&pl111_connector->connector, > + &connector_helper_funcs); > + > + drm_sysfs_connector_add(&pl111_connector->connector); > + > + return pl111_connector; > +} > + > diff --git a/drivers/gpu/drm/pl111/pl111_drm_crtc.c b/drivers/gpu/drm/pl111/pl111_drm_crtc.c > new file mode 100644 > index 0000000..1f8efbe > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_crtc.c > @@ -0,0 +1,432 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_crtc.c > + * Implementation of the CRTC functions for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > + > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > + > +#include "pl111_drm.h" > + > +static int pl111_crtc_num; > + > +static void vsync_worker(struct work_struct *work) > +{ > + struct pl111_drm_flip_resource *flip_res; > + struct pl111_gem_bo *bo; > + struct pl111_drm_crtc *pl111_crtc; > + struct drm_device *dev; > + int flips_in_flight; > + flip_res = > + container_of(work, struct pl111_drm_flip_resource, vsync_work); > + > + pl111_crtc = to_pl111_crtc(flip_res->crtc); > + dev = pl111_crtc->crtc.dev; > + > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); > + > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + if (flip_res->worker_release_kds == true) { > + spin_lock(&pl111_crtc->current_displaying_lock); > + release_kds_resource_and_display(flip_res); > + spin_unlock(&pl111_crtc->current_displaying_lock); > + } > +#endif > + /* Release DMA buffer on this flip */ > + if (bo->gem_object.export_dma_buf != NULL) > + dma_buf_put(bo->gem_object.export_dma_buf); I think you just want to unref the outgoing bo, and let it drop the dmabuf ref when the file ref of the imported bo goes. Or actually, it would be better to hold/drop ref's to the fb, rather than the bo. At least this will make things simpler if you ever have multi-planar support. Note you probably want to just do all in the irq. The drm-flip-work patches I sent recently give you an easy way to punt off the fb or bo unref to a workqueue. And, well, the whole thing gets simpler if you aren't queuing flips in the kernel. If your hw's scanout register is double buffered, you could probably look at the mdp4_crtc bits in the msm driver patches I recently sent. Otherwise tilcdc is (with the flip-work conversion patch I sent) is a reasonably simple example. > + > + drm_handle_vblank(dev, pl111_crtc->crtc_index); > + > + /* Wake up any processes waiting for page flip event */ > + if (flip_res->event) { > + spin_lock_bh(&dev->event_lock); > + drm_send_vblank_event(dev, pl111_crtc->crtc_index, > + flip_res->event); > + spin_unlock_bh(&dev->event_lock); > + } > + > + drm_vblank_put(dev, pl111_crtc->crtc_index); > + > + /* > + * workqueue.c:process_one_work(): > + * "It is permissible to free the struct work_struct from > + * inside the function that is called from it" > + */ > + kmem_cache_free(priv.page_flip_slab, flip_res); > + > + flips_in_flight = atomic_dec_return(&priv.nr_flips_in_flight); > + if (flips_in_flight == 0 || > + flips_in_flight == (NR_FLIPS_IN_FLIGHT_THRESHOLD - 1)) > + wake_up(&priv.wait_for_flips); > + > + DRM_DEBUG_KMS("DRM release flip_res=%p\n", flip_res); > +} > + > +void pl111_common_irq(struct pl111_drm_crtc *pl111_crtc) > +{ > + unsigned long irq_flags; > + > + spin_lock_irqsave(&pl111_crtc->base_update_lock, irq_flags); > + > + if (pl111_crtc->current_update_res != NULL) { > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + /* > + * If the lock is not acquired defer completion of the > + * resource that caused the buffer update > + */ > + pl111_crtc->current_update_res->worker_release_kds = > + true; > + if (0 != spin_trylock( > + &pl111_crtc->current_displaying_lock)) { > + /* release the resource immediately */ > + release_kds_resource_and_display( > + pl111_crtc->current_update_res); > + /* > + * prevent worker from attempting to release > + * resource again > + */ > + pl111_crtc->current_update_res-> > + worker_release_kds = false; > + spin_unlock(&pl111_crtc-> > + current_displaying_lock); > + } > +#endif > + /* > + * Release dma_buf and resource > + * (if not already released) > + */ > + queue_work(pl111_crtc->vsync_wq, > + &pl111_crtc->current_update_res->vsync_work); > + pl111_crtc->current_update_res = NULL; > + } > + > + if (!list_empty(&pl111_crtc->update_queue)) { > + struct pl111_drm_flip_resource *flip_res; > + /* Remove the head of the list */ > + flip_res = list_first_entry(&pl111_crtc->update_queue, > + struct pl111_drm_flip_resource, link); > + list_del(&flip_res->link); > + do_flip_to_res(flip_res); > + /* > + * current_update_res will be set, so guarentees that > + * another flip_res coming in gets queued instead of > + * handled immediately > + */ > + } > + > + spin_unlock_irqrestore(&pl111_crtc->base_update_lock, irq_flags); > +} > + > +void show_framebuffer_on_crtc_cb(void *cb1, void *cb2) > +{ > + struct pl111_drm_flip_resource *flip_res = cb1; > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(flip_res->crtc); > + > + pl111_crtc->show_framebuffer_cb(cb1, cb2); > +} > + > +int show_framebuffer_on_crtc(struct drm_crtc *crtc, > + struct drm_framebuffer *fb, bool page_flip, > + struct drm_pending_vblank_event *event) > +{ > + struct pl111_gem_bo *bo; > + struct pl111_drm_flip_resource *flip_res; > + int flips_in_flight; > + int old_flips_in_flight; > + > + crtc->fb = fb; > + > + bo = PL111_BO_FROM_FRAMEBUFFER(fb); > + if (bo == NULL) { > + DRM_DEBUG_KMS("Failed to get pl111_gem_bo object\n"); > + return -EINVAL; > + } > + > + /* If this is a full modeset, wait for all outstanding flips to complete > + * before continuing. This avoids unnecessary complication from being > + * able to queue up multiple modesets and queues of mixed modesets and > + * page flips. > + * > + * Modesets should be uncommon and will not be performant anyway, so > + * making them synchronous should have negligible performance impact. > + */ > + if (!page_flip) { > + int ret = wait_event_killable(priv.wait_for_flips, > + atomic_read(&priv.nr_flips_in_flight) == 0); > + if (ret) > + return ret; > + } > + > + /* > + * There can be more 'early display' flips in flight than there are > + * buffers, and there is (currently) no explicit bound on the number of > + * flips. Hence, we need a new allocation for each one. > + * > + * Note: this could be optimized down if we knew a bound on the flips, > + * since an application can only have so many buffers in flight to be > + * useful/not hog all the memory > + */ > + flip_res = kmem_cache_alloc(priv.page_flip_slab, GFP_KERNEL); > + if (flip_res == NULL) { > + pr_err("kmem_cache_alloc failed to alloc - flip ignored\n"); > + return -ENOMEM; > + } > + > + /* > + * increment flips in flight, whilst blocking when we reach > + * NR_FLIPS_IN_FLIGHT_THRESHOLD > + */ > + do { > + /* > + * Note: use of assign-and-then-compare in the condition to set > + * flips_in_flight > + */ > + int ret = wait_event_killable(priv.wait_for_flips, > + (flips_in_flight = > + atomic_read(&priv.nr_flips_in_flight)) > + < NR_FLIPS_IN_FLIGHT_THRESHOLD); > + if (ret != 0) { > + kmem_cache_free(priv.page_flip_slab, flip_res); > + return ret; > + } > + > + old_flips_in_flight = atomic_cmpxchg(&priv.nr_flips_in_flight, > + flips_in_flight, flips_in_flight + 1); > + } while (old_flips_in_flight != flips_in_flight); this isn't really making me a fan of queue'ing the flips in the kernel.. > + flip_res->fb = fb; > + flip_res->crtc = crtc; > + flip_res->page_flip = page_flip; > + flip_res->event = event; > + INIT_WORK(&flip_res->vsync_work, vsync_worker); > + INIT_LIST_HEAD(&flip_res->link); > + DRM_DEBUG_KMS("DRM alloc flip_res=%p\n", flip_res); > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + if (bo->gem_object.export_dma_buf != NULL) { > + struct dma_buf *buf = bo->gem_object.export_dma_buf; > + unsigned long shared[1] = { 0 }; > + struct kds_resource *resource_list[1] = { > + get_dma_buf_kds_resource(buf) }; > + int err; > + > + get_dma_buf(buf); > + DRM_DEBUG_KMS("Got dma_buf %p\n", buf); > + > + /* Wait for the KDS resource associated with this buffer */ > + err = kds_async_waitall(&flip_res->kds_res_set, > + &priv.kds_cb, flip_res, fb, 1, shared, > + resource_list); > + BUG_ON(err); > + } else { > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); > + > + DRM_DEBUG_KMS("No dma_buf for this flip\n"); > + > + /* No dma-buf attached so just call the callback directly */ > + flip_res->kds_res_set = NULL; > + pl111_crtc->show_framebuffer_cb(flip_res, fb); > + } > +#else > + if (bo->gem_object.export_dma_buf != NULL) { > + struct dma_buf *buf = bo->gem_object.export_dma_buf; > + > + get_dma_buf(buf); > + DRM_DEBUG_KMS("Got dma_buf %p\n", buf); see earlier comment about holding ref to fb.. this just seems really wrong > + } else { > + DRM_DEBUG_KMS("No dma_buf for this flip\n"); > + } > + > + /* No dma-buf attached to this so just call the callback directly */ > + { > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); > + pl111_crtc->show_framebuffer_cb(flip_res, fb); > + } > +#endif > + > + /* For the same reasons as the wait at the start of this function, > + * wait for the modeset to complete before continuing. > + */ > + if (!page_flip) { > + int ret = wait_event_killable(priv.wait_for_flips, > + flips_in_flight == 0); > + if (ret) > + return ret; > + } > + > + return 0; > +} > + > +int pl111_crtc_page_flip(struct drm_crtc *crtc, struct drm_framebuffer *fb, > + struct drm_pending_vblank_event *event) > +{ > + DRM_DEBUG_KMS("%s: crtc=%p, fb=%p, event=%p\n", > + __func__, crtc, fb, event); > + return show_framebuffer_on_crtc(crtc, fb, true, event); > +} > + > +int pl111_crtc_helper_mode_set(struct drm_crtc *crtc, > + struct drm_display_mode *mode, > + struct drm_display_mode *adjusted_mode, > + int x, int y, struct drm_framebuffer *old_fb) > +{ > + int ret; > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); > + struct drm_display_mode *duplicated_mode; > + > + DRM_DEBUG_KMS("DRM crtc_helper_mode_set, x=%d y=%d bpp=%d\n", > + adjusted_mode->hdisplay, adjusted_mode->vdisplay, > + crtc->fb->bits_per_pixel); > + > + duplicated_mode = drm_mode_duplicate(crtc->dev, adjusted_mode); > + if (!duplicated_mode) > + return -ENOMEM; > + > + pl111_crtc->new_mode = duplicated_mode; > + ret = show_framebuffer_on_crtc(crtc, crtc->fb, false, NULL); > + if (ret != 0) { > + pl111_crtc->new_mode = pl111_crtc->current_mode; > + drm_mode_destroy(crtc->dev, duplicated_mode); > + } > + > + return ret; > +} > + > +void pl111_crtc_helper_prepare(struct drm_crtc *crtc) > +{ > + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); > +} > + > +void pl111_crtc_helper_commit(struct drm_crtc *crtc) > +{ > + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); > +} > + > +bool pl111_crtc_helper_mode_fixup(struct drm_crtc *crtc, > + struct drm_display_mode *mode, > + struct drm_display_mode *adjusted_mode) > +{ > + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); > + > +#ifdef CONFIG_ARCH_VEXPRESS > + /* > + * 1024x768 with more than 16 bits per pixel does not work correctly > + * on Versatile Express > + */ > + if (mode->hdisplay == 1024 && mode->vdisplay == 768 && > + crtc->fb->bits_per_pixel > 16) { > + return false; > + } > +#endif > + > + return true; > +} > + > +void pl111_crtc_helper_disable(struct drm_crtc *crtc) > +{ > + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); > + clcd_disable(crtc); > +} > + > +void pl111_crtc_destroy(struct drm_crtc *crtc) > +{ > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); > + > + DRM_DEBUG_KMS("DRM %s on crtc=%p\n", __func__, crtc); > + > + destroy_workqueue(pl111_crtc->vsync_wq); > + drm_crtc_cleanup(crtc); > + kfree(pl111_crtc); > +} > + > +const struct drm_crtc_funcs crtc_funcs = { > + .set_config = drm_crtc_helper_set_config, > + .page_flip = pl111_crtc_page_flip, > + .destroy = pl111_crtc_destroy > +}; > + > +const struct drm_crtc_helper_funcs crtc_helper_funcs = { > + .mode_set = pl111_crtc_helper_mode_set, > + .prepare = pl111_crtc_helper_prepare, > + .commit = pl111_crtc_helper_commit, > + .mode_fixup = pl111_crtc_helper_mode_fixup, > + .disable = pl111_crtc_helper_disable, > +}; > + > +bool pl111_crtc_is_fb_currently_displayed(struct drm_device *dev, > + struct drm_framebuffer *fb) > +{ > + struct drm_crtc *crtc; > + > + if (fb == NULL) > + return false; > + > + list_for_each_entry(crtc, &dev->mode_config.crtc_list, head) { > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); > + if (pl111_crtc->displaying_fb == fb) > + return true; > + } > + return false; > +} > + > +struct pl111_drm_crtc *pl111_crtc_create(struct drm_device *dev) > +{ > + struct pl111_drm_crtc *pl111_crtc; > + > + pl111_crtc = kzalloc(sizeof(struct pl111_drm_crtc), GFP_KERNEL); > + if (pl111_crtc == NULL) { > + pr_err("Failed to allocated pl111_drm_crtc\n"); > + return NULL; > + } > + > + drm_crtc_init(dev, &pl111_crtc->crtc, &crtc_funcs); > + drm_crtc_helper_add(&pl111_crtc->crtc, &crtc_helper_funcs); > + > + pl111_crtc->crtc_index = pl111_crtc_num; > + pl111_crtc_num++; > + pl111_crtc->vsync_wq = alloc_ordered_workqueue("pl111_drm_vsync_%d", > + WQ_HIGHPRI, pl111_crtc->crtc_index); > + if (!pl111_crtc->vsync_wq) { > + pr_err("Failed to allocate vsync workqueue\n"); > + drm_crtc_cleanup(&pl111_crtc->crtc); > + return NULL; > + } > + > + pl111_crtc->crtc.enabled = 0; > + pl111_crtc->displaying_fb = NULL; > + pl111_crtc->last_bpp = 0; > + pl111_crtc->current_update_res = NULL; > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + pl111_crtc->old_kds_res_set = NULL; > +#endif > + pl111_crtc->show_framebuffer_cb = show_framebuffer_on_crtc_cb_internal; > + INIT_LIST_HEAD(&pl111_crtc->update_queue); > + spin_lock_init(&pl111_crtc->current_displaying_lock); > + spin_lock_init(&pl111_crtc->base_update_lock); > + > + return pl111_crtc; > +} > + > diff --git a/drivers/gpu/drm/pl111/pl111_drm_cursor.c b/drivers/gpu/drm/pl111/pl111_drm_cursor.c > new file mode 100644 > index 0000000..6be2a55 > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_cursor.c > @@ -0,0 +1,97 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_cursor.c > + * Implementation of cursor functions for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > + > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > +#include "pl111_drm.h" > + > +#define PL111_MAX_CURSOR_WIDTH (64) > +#define PL111_MAX_CURSOR_HEIGHT (64) > + > +static int pl111_drm_cursor_plane_disable(struct drm_plane *plane) > +{ > + pl111_cursor_disable(); > + return 0; > +} > + > +static int pl111_drm_cursor_plane_update(struct drm_plane *plane, > + struct drm_crtc *crtc, struct drm_framebuffer *fb, > + int crtc_x, int crtc_y, > + unsigned int crtc_w, unsigned int crtc_h, > + uint32_t src_x, uint32_t src_y, > + uint32_t src_w, uint32_t src_h) > +{ > + struct pl111_gem_bo *bo = PL111_BO_FROM_FRAMEBUFFER(fb); > + > + /* > + * TODO Find out if there is a way to know if the image needs changing. > + * The cursor API might be better for us than planes as that has > + * distinct set cursor image and set cursor position call backs. > + */ > + > + pl111_set_cursor_image(bo->backing_data.dma.fb_cpu_addr); > + > + pl111_cursor_enable(); > + pl111_set_cursor_position(crtc_x, crtc_y); > + > + return 0; > +} > + > +void pl111_drm_cursor_plane_destroy(struct drm_plane *plane) > +{ > + pl111_drm_cursor_plane_disable(plane); > + drm_plane_cleanup(plane); > +} > + > +static const struct drm_plane_funcs pl111_drm_cursor_plane_funcs = { > + .update_plane = pl111_drm_cursor_plane_update, > + .disable_plane = pl111_drm_cursor_plane_disable, > + .destroy = pl111_drm_cursor_plane_destroy, > +}; > + > +/* > + * We don't actually support ARGB8888 for the cursor only PL111 LBBP, the > + * rasult of setting this is that it creates a buffer larger than we actually > + * need. But there are no compatible formats defined in fourcc.h, so we will > + * only read 256 32 bits words from the buffer to set the cursor image. > + * We expect user space to have formatted the buffer correctly to LBBP. what is LBBP? You probably want to use cursor APIs in crtc and then copy/convert ARGB to whatever format you need into a different bo (ie. treat bo that userspace gives you as a shadow buffer). Weston or any other wayland compositor is going to try to use the cursor APIs and expect ARGB.. > + */ > +static uint32_t pl111_cursor_formats[] = { DRM_FORMAT_ARGB8888 }; > + > +int pl111_cursor_plane_init(struct drm_device *dev, > + struct pl111_drm_cursor_plane *cursor, > + unsigned long possible_crtcs) > +{ > + cursor->formats = pl111_cursor_formats; > + cursor->num_formats_supported = ARRAY_SIZE(pl111_cursor_formats); > + > + return drm_plane_init(dev, &cursor->base, possible_crtcs, > + &pl111_drm_cursor_plane_funcs, cursor->formats, > + cursor->num_formats_supported, false); > +} > + > + > diff --git a/drivers/gpu/drm/pl111/pl111_drm_device.c b/drivers/gpu/drm/pl111/pl111_drm_device.c > new file mode 100644 > index 0000000..4ade09a > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_device.c > @@ -0,0 +1,319 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_device.c > + * Implementation of the Linux device driver entrypoints for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > +#include <linux/slab.h> > + > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > + > +#include "pl111_drm.h" > + > +struct pl111_drm_dev_private priv; > + > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > +static void initial_kds_obtained(void *cb1, void *cb2) > +{ > + wait_queue_head_t *wait = (wait_queue_head_t *) cb1; > + bool *cb_has_called = (bool *) cb2; > + > + *cb_has_called = true; > + wake_up(wait); > +} > + > +/* Must be called from within current_displaying_lock spinlock */ > +void release_kds_resource_and_display(struct pl111_drm_flip_resource *flip_res) > +{ > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(flip_res->crtc); > + pl111_crtc->displaying_fb = flip_res->fb; > + > + /* Release the previous buffer */ > + if (pl111_crtc->old_kds_res_set != NULL) { > + /* > + * Can flip to the same buffer, but must not release the current > + * resource set > + */ > + BUG_ON(pl111_crtc->old_kds_res_set == flip_res->kds_res_set); > + kds_resource_set_release(&pl111_crtc->old_kds_res_set); > + } > + /* Record the current buffer, to release on the next buffer flip */ > + pl111_crtc->old_kds_res_set = flip_res->kds_res_set; > +} > +#endif > + > +void pl111_drm_preclose(struct drm_device *dev, struct drm_file *file_priv) > +{ > + DRM_DEBUG_KMS("DRM %s on dev=%p\n", __func__, dev); > +} > + > +void pl111_drm_lastclose(struct drm_device *dev) > +{ > + DRM_DEBUG_KMS("DRM %s on dev=%p\n", __func__, dev); > +} > + > +/* > + * pl111 does not have a proper HW counter for vblank IRQs so enable_vblank > + * and disable_vblank are just no op callbacks. > + */ > +static int pl111_enable_vblank(struct drm_device *dev, int crtc) > +{ > + DRM_DEBUG_KMS("%s: dev=%p, crtc=%d", __func__, dev, crtc); > + return 0; > +} > + > +static void pl111_disable_vblank(struct drm_device *dev, int crtc) > +{ > + DRM_DEBUG_KMS("%s: dev=%p, crtc=%d", __func__, dev, crtc); > +} > + > +struct drm_mode_config_funcs mode_config_funcs = { > + .fb_create = pl111_fb_create, > +}; > + > +static int pl111_modeset_init(struct drm_device *dev) > +{ > + struct drm_mode_config *mode_config; > + struct pl111_drm_dev_private *priv = dev->dev_private; > + struct pl111_drm_connector *pl111_connector; > + struct pl111_drm_encoder *pl111_encoder; > + int ret = 0; > + > + if (priv == NULL) > + return -EINVAL; > + > + drm_mode_config_init(dev); > + mode_config = &dev->mode_config; > + mode_config->funcs = &mode_config_funcs; > + mode_config->min_width = 1; > + mode_config->max_width = 1024; > + mode_config->min_height = 1; > + mode_config->max_height = 768; > + > + priv->pl111_crtc = pl111_crtc_create(dev); > + if (priv->pl111_crtc == NULL) { > + pr_err("Failed to create pl111_drm_crtc\n"); > + ret = -ENOMEM; > + goto out_config; > + } > + > + priv->number_crtcs = 1; > + > + pl111_connector = pl111_connector_create(dev); > + if (pl111_connector == NULL) { > + pr_err("Failed to create pl111_drm_connector\n"); > + ret = -ENOMEM; > + goto out_config; > + } > + > + pl111_encoder = pl111_encoder_create(dev, 1); > + if (pl111_encoder == NULL) { > + pr_err("Failed to create pl111_drm_encoder\n"); > + ret = -ENOMEM; > + goto out_config; > + } > + > + ret = drm_mode_connector_attach_encoder(&pl111_connector->connector, > + &pl111_encoder->encoder); > + if (ret != 0) { > + DRM_ERROR("Failed to attach encoder\n"); > + goto out_config; > + } > + > + pl111_connector->connector.encoder = &pl111_encoder->encoder; > + > + ret = pl111_cursor_plane_init(dev, &priv->pl111_crtc->cursor, 1); > + if (ret != 0) { > + pr_err("Failed to init cursor plane\n"); > + goto out_config; > + } > + > + goto finish; > + > +out_config: > + drm_mode_config_cleanup(dev); > +finish: > + DRM_DEBUG("%s returned %d\n", __func__, ret); > + return ret; > +} > + > +static void pl111_modeset_fini(struct drm_device *dev) > +{ > + drm_mode_config_cleanup(dev); > +} > + > +static int pl111_drm_load(struct drm_device *dev, unsigned long chipset) > +{ > + int ret = 0; > + > + pr_info("DRM %s\n", __func__); > + > + mutex_init(&priv.export_dma_buf_lock); > + atomic_set(&priv.nr_flips_in_flight, 0); > + init_waitqueue_head(&priv.wait_for_flips); > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + ret = kds_callback_init(&priv.kds_cb, 1, show_framebuffer_on_crtc_cb); > + if (ret != 0) { > + pr_err("Failed to initialise KDS callback\n"); > + goto finish; > + } > + > + ret = kds_callback_init(&priv.kds_obtain_current_cb, 1, > + initial_kds_obtained); > + if (ret != 0) { > + pr_err("Failed to init KDS obtain callback\n"); > + kds_callback_term(&priv.kds_cb); > + goto finish; > + } > +#endif > + > + /* Create a cache for page flips */ > + priv.page_flip_slab = kmem_cache_create("page flip slab", > + sizeof(struct pl111_drm_flip_resource), 0, 0, NULL); > + if (priv.page_flip_slab == NULL) { > + DRM_ERROR("Failed to create slab\n"); > + ret = -ENOMEM; > + goto out_kds_callbacks; > + } > + > + dev->dev_private = &priv; > + > + ret = pl111_modeset_init(dev); > + if (ret != 0) { > + pr_err("Failed to init modeset\n"); > + goto out_slab; > + } > + > + ret = pl111_device_init(dev); > + if (ret != 0) { > + DRM_ERROR("Failed to init MMIO and IRQ\n"); > + goto out_modeset; > + } > + > + ret = drm_vblank_init(dev, 1); > + if (ret != 0) { > + DRM_ERROR("Failed to init vblank\n"); > + goto out_vblank; > + } > + > + goto finish; > + > +out_vblank: > + pl111_device_fini(dev); > +out_modeset: > + pl111_modeset_fini(dev); > +out_slab: > + kmem_cache_destroy(priv.page_flip_slab); > +out_kds_callbacks: > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + kds_callback_term(&priv.kds_obtain_current_cb); > + kds_callback_term(&priv.kds_cb); > +#endif > +finish: > + DRM_DEBUG_KMS("pl111_drm_load returned %d\n", ret); > + return ret; > +} > + > +static int pl111_drm_unload(struct drm_device *dev) > +{ > + pr_info("DRM %s\n", __func__); > + > + kmem_cache_destroy(priv.page_flip_slab); > + > + drm_vblank_cleanup(dev); > + pl111_modeset_fini(dev); > + pl111_device_fini(dev); > + > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + kds_callback_term(&priv.kds_obtain_current_cb); > + kds_callback_term(&priv.kds_cb); > +#endif > + return 0; > +} > + > +static struct vm_operations_struct pl111_gem_vm_ops = { > + .fault = pl111_gem_fault, > + .open = drm_gem_vm_open, > + .close = drm_gem_vm_close, > +}; > + > +static const struct file_operations drm_fops = { > + .owner = THIS_MODULE, > + .open = drm_open, > + .release = drm_release, > + .unlocked_ioctl = drm_ioctl, > + .mmap = pl111_gem_mmap, > + .poll = drm_poll, > + .read = drm_read, > + .fasync = drm_fasync, > +}; > + > +static struct drm_driver driver = { > + .driver_features = > + DRIVER_MODESET | DRIVER_FB_DMA | DRIVER_GEM | DRIVER_PRIME, > + .load = pl111_drm_load, > + .unload = pl111_drm_unload, > + .context_dtor = NULL, > + .preclose = pl111_drm_preclose, > + .lastclose = pl111_drm_lastclose, > + .suspend = pl111_drm_suspend, > + .resume = pl111_drm_resume, > + .get_vblank_counter = drm_vblank_count, > + .enable_vblank = pl111_enable_vblank, > + .disable_vblank = pl111_disable_vblank, > + .ioctls = NULL, > + .fops = &drm_fops, > + .name = DRIVER_NAME, > + .desc = DRIVER_DESC, > + .date = DRIVER_DATE, > + .major = DRIVER_MAJOR, > + .minor = DRIVER_MINOR, > + .patchlevel = DRIVER_PATCHLEVEL, > + .dumb_create = pl111_dumb_create, > + .dumb_destroy = pl111_dumb_destroy, > + .dumb_map_offset = pl111_dumb_map_offset, > + .gem_free_object = pl111_gem_free_object, > + .gem_vm_ops = &pl111_gem_vm_ops, > + .prime_handle_to_fd = &pl111_prime_handle_to_fd, > + .gem_prime_export = &pl111_gem_prime_export, > +}; > + > +int pl111_drm_init(struct platform_device *dev) > +{ > + int ret; > + pr_info("DRM %s\n", __func__); > + pr_info("PL111 DRM initialize, driver name: %s, version %d.%d\n", > + DRIVER_NAME, DRIVER_MAJOR, DRIVER_MINOR); > + driver.num_ioctls = 0; > + ret = 0; > + driver.kdriver.platform_device = dev; > + return drm_platform_init(&driver, dev); > + > +} > + > +void pl111_drm_exit(struct platform_device *dev) > +{ > + pr_info("DRM %s\n", __func__); > + drm_platform_exit(&driver, dev); > +} > diff --git a/drivers/gpu/drm/pl111/pl111_drm_dma_buf.c b/drivers/gpu/drm/pl111/pl111_drm_dma_buf.c > new file mode 100644 > index 0000000..6800100 > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_dma_buf.c > @@ -0,0 +1,339 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_dma_buf.c > + * Implementation of the dma_buf functions for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > + > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > + > +#include "pl111_drm.h" > + > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > +static void obtain_kds_if_currently_displayed(struct drm_device *dev, > + struct drm_framebuffer *fb, > + struct dma_buf *dma_buf) > +{ > + unsigned long shared[1] = { 0 }; > + struct kds_resource *resource_list[1]; > + struct kds_resource_set *kds_res_set; > + struct drm_crtc *crtc; > + bool cb_has_called = false; > + int err; > + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wake); > + > + /* > + * Not all pl111_gem_bo structures have a framebuffer attached - early > + * out in those cases > + */ > + if (fb == NULL) > + return; > + > + DRM_DEBUG_KMS("Obtaining initial KDS res for fb:%p bo:%p dma_buf:%p\n", > + fb, PL111_BO_FROM_FRAMEBUFFER(fb), dma_buf); > + > + resource_list[0] = get_dma_buf_kds_resource(dma_buf); > + get_dma_buf(dma_buf); > + > + /* > + * Can't use kds_waitall(), because kbase will be let through due to > + * locked ignore' > + */ > + err = kds_async_waitall(&kds_res_set, > + &priv.kds_obtain_current_cb, &wake, > + &cb_has_called, 1, shared, resource_list); > + BUG_ON(err); > + wait_event(wake, cb_has_called == true); > + > + list_for_each_entry(crtc, &fb->dev->mode_config.crtc_list, head) { > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); > + spin_lock(&pl111_crtc->current_displaying_lock); > + if (pl111_crtc->displaying_fb == fb) { > + DRM_DEBUG_KMS("Initial KDS resource for fb %p", fb); > + DRM_DEBUG_KMS(" is being displayed, keeping\n"); > + /* There shouldn't be a previous buffer to release */ > + BUG_ON(pl111_crtc->old_kds_res_set); > + > + if (kds_res_set == NULL) { > + err = kds_async_waitall(&kds_res_set, > + &priv.kds_obtain_current_cb, > + &wake, &cb_has_called, > + 1, shared, resource_list); > + BUG_ON(err); > + wait_event(wake, cb_has_called == true); > + } > + > + /* Current buffer will need releasing on next flip */ > + pl111_crtc->old_kds_res_set = kds_res_set; > + > + /* > + * Clear kds_res_set, so a new kds_res_set is allocated > + * for additional CRTCs > + */ > + kds_res_set = NULL; > + } > + spin_unlock(&pl111_crtc->current_displaying_lock); > + } > + > + /* kds_res_set will be NULL here if any CRTCs are displaying fb */ > + if (kds_res_set != NULL) { > + DRM_DEBUG_KMS("Initial KDS resource for fb %p", fb); > + DRM_DEBUG_KMS(" not being displayed, discarding\n"); > + /* They're not being displayed, release them */ > + kds_resource_set_release(&kds_res_set); > + } > + > + dma_buf_put(dma_buf); > +} > +#endif > + > +static int pl111_dma_buf_mmap(struct dma_buf *buffer, > + struct vm_area_struct *vma) > +{ > + struct drm_gem_object *obj = buffer->priv; > + struct pl111_gem_bo *bo = PL111_BO_FROM_GEM(obj); > + DRM_DEBUG_KMS("DRM %s on dma_buf=%p\n", __func__, buffer); > + > + return pl111_bo_mmap(obj, bo, vma, buffer->size); > +} > + > +static void pl111_dma_buf_release(struct dma_buf *buf) > +{ > + /* > + * Need to release the dma_buf's reference on the gem object it was > + * exported from, and also clear the gem object's export_dma_buf > + * pointer to this dma_buf as it no longer exists > + */ > + struct drm_gem_object *obj = (struct drm_gem_object *)buf->priv; > + struct pl111_gem_bo *bo; > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + struct drm_crtc *crtc; > +#endif > + bo = PL111_BO_FROM_GEM(obj); > + > + DRM_DEBUG_KMS("Releasing dma_buf %p, drm_gem_obj=%p\n", buf, obj); > + > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + list_for_each_entry(crtc, &bo->gem_object.dev->mode_config.crtc_list, > + head) { > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); > + spin_lock(&pl111_crtc->current_displaying_lock); > + if (pl111_crtc->displaying_fb == bo->fb) { > + kds_resource_set_release(&pl111_crtc->old_kds_res_set); > + pl111_crtc->old_kds_res_set = NULL; > + } > + spin_unlock(&pl111_crtc->current_displaying_lock); > + } > +#endif > + mutex_lock(&priv.export_dma_buf_lock); > + > + obj->export_dma_buf = NULL; > + drm_gem_object_unreference_unlocked(obj); > + > + mutex_unlock(&priv.export_dma_buf_lock); > +} > + > +static int pl111_dma_buf_attach(struct dma_buf *buf, struct device *dev, > + struct dma_buf_attachment *attach) > +{ > + DRM_DEBUG_KMS("Attaching dma_buf %p to device %p attach=%p\n", buf, > + dev, attach); > + > + attach->priv = dev; > + > + return 0; > +} > + > +static void pl111_dma_buf_detach(struct dma_buf *buf, > + struct dma_buf_attachment *attach) > +{ > + DRM_DEBUG_KMS("Detaching dma_buf %p attach=%p\n", attach->dmabuf, > + attach); > +} > + > +/* Heavily from exynos_drm_dmabuf.c */ > +static struct sg_table *pl111_dma_buf_map_dma_buf(struct dma_buf_attachment > + *attach, > + enum dma_data_direction > + direction) > +{ > + struct drm_gem_object *obj = attach->dmabuf->priv; > + struct pl111_gem_bo *bo = PL111_BO_FROM_GEM(obj); > + struct sg_table *sgt; > + int ret; > + int size, n_pages, nents; > + > + DRM_DEBUG_KMS("Mapping dma_buf %p from attach=%p\n", attach->dmabuf, > + attach); > + > + size = obj->size; > + n_pages = PAGE_ALIGN(size) >> PAGE_SHIFT; > + > + if (bo->type == PL111_BOT_DMA) { > + sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); > + if (!sgt) { > + DRM_ERROR("Failed to allocate sg_table\n"); > + return ERR_PTR(-ENOMEM); > + } > + > + ret = sg_alloc_table(sgt, 1, GFP_KERNEL); > + if (ret < 0) { > + DRM_ERROR("Failed to allocate page table\n"); > + return ERR_PTR(-ENOMEM); > + } > + sg_dma_len(sgt->sgl) = size; > + sg_set_page(sgt->sgl, > + pfn_to_page(PFN_DOWN > + (bo->backing_data.dma.fb_dev_addr)), > + size, 0); > + sg_dma_address(sgt->sgl) = bo->backing_data.dma.fb_dev_addr; > + > + } else { > + struct page **pages; > + > + pages = get_pages(obj); > + if (IS_ERR(pages)) { > + dev_err(obj->dev->dev, "could not get pages: %ld\n", > + PTR_ERR(pages)); > + return ERR_CAST(pages); > + } > + sgt = drm_prime_pages_to_sg(pages, n_pages); > + if (sgt == NULL) > + return ERR_PTR(-ENOMEM); > + nents = dma_map_sg(attach->dev, sgt->sgl, sgt->nents, > + direction); > + if (!nents) { > + DRM_ERROR("Failed to map dma buffer\n"); > + sg_free_table(sgt); > + kfree(sgt); > + return ERR_PTR(-ENOMEM); > + } > + if (nents < sgt->nents) { > + /* dma_map_sg() may merge sglist entries (e.g. if > + * they are contiguous) so nents may be less than > + * sgt->nents. If this happens we need to fix > + * sgt->nents as it is used by the caller */ > + DRM_DEBUG_KMS( > + "sg list entries merged during mapping\n"); > + sgt->nents = nents; > + } > + } > + return sgt; > +} > + > +static void pl111_dma_buf_unmap_dma_buf(struct dma_buf_attachment *attach, > + struct sg_table *sgt, > + enum dma_data_direction direction) > +{ > + struct drm_gem_object *obj = attach->dmabuf->priv; > + struct pl111_gem_bo *bo = PL111_BO_FROM_GEM(obj); > + > + DRM_DEBUG_KMS("Unmapping dma_buf %p from attach=%p\n", attach->dmabuf, > + attach); > + > + if (PL111_BOT_SHM == bo->type) { > + /* use orig_nents here as nents may have been > + * modified in pl111_dma_buf_map_dma_buf() */ > + dma_unmap_sg(attach->dev, sgt->sgl, sgt->orig_nents, > + direction); > + } > + sg_free_table(sgt); > + kfree(sgt); > + sgt = NULL; > +} > + > +static void *pl111_dma_buf_kmap_atomic(struct dma_buf *dma_buf, > + unsigned long what) > +{ > + DRM_ERROR("pl111_dma_buf_kmap_atomic not implemented, dma_buf=%p\n", > + dma_buf); > + return NULL; > +} > + > +static void *pl111_dma_buf_kmap(struct dma_buf *dma_buf, unsigned long what) > +{ > + DRM_ERROR("pl111_dma_buf_kmap not implemented, dma_buf=%p\n", dma_buf); > + return NULL; > +} > + > +static struct dma_buf_ops pl111_dma_buf_ops = { > + .release = &pl111_dma_buf_release, > + .attach = &pl111_dma_buf_attach, > + .detach = &pl111_dma_buf_detach, > + .map_dma_buf = &pl111_dma_buf_map_dma_buf, > + .unmap_dma_buf = &pl111_dma_buf_unmap_dma_buf, > + .kmap_atomic = &pl111_dma_buf_kmap_atomic, > + .kmap = &pl111_dma_buf_kmap, > + .mmap = &pl111_dma_buf_mmap, > +}; > + > +struct dma_buf *pl111_gem_prime_export(struct drm_device *dev, > + struct drm_gem_object *obj, int flags) > +{ > + struct dma_buf *new_buf; > + struct pl111_gem_bo *bo; > + size_t size; > + > + DRM_DEBUG_KMS("DRM %s on dev=%p drm_gem_obj=%p\n", __func__, dev, obj); > + size = obj->size; > + > + new_buf = dma_buf_export(obj /*priv */ , &pl111_dma_buf_ops, size, > + flags | O_RDWR); > + bo = PL111_BO_FROM_GEM(new_buf->priv); > + > + /* > + * bo->gem_object.export_dma_buf not setup until after gem_prime_export > + * finishes > + */ > + > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + /* > + * Ensure that we hold the kds resource if it's the currently > + * displayed buffer. > + */ > + obtain_kds_if_currently_displayed(dev, bo->fb, new_buf); > +#endif > + > + DRM_DEBUG_KMS("Created dma_buf %p\n", new_buf); > + return new_buf; > +} > + > +int pl111_prime_handle_to_fd(struct drm_device *dev, struct drm_file *file_priv, > + uint32_t handle, uint32_t flags, int *prime_fd) > +{ > + int result; > + /* > + * This will re-use any existing exports, and calls > + * driver->gem_prime_export to do the first export when needed > + */ > + DRM_DEBUG_KMS("DRM %s on file_priv=%p, handle=0x%.8x\n", __func__, > + file_priv, handle); > + > + mutex_lock(&priv.export_dma_buf_lock); > + result = drm_gem_prime_handle_to_fd(dev, file_priv, handle, flags, > + prime_fd); > + mutex_unlock(&priv.export_dma_buf_lock); > + > + return result; > +} > diff --git a/drivers/gpu/drm/pl111/pl111_drm_encoder.c b/drivers/gpu/drm/pl111/pl111_drm_encoder.c > new file mode 100644 > index 0000000..028b366 > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_encoder.c > @@ -0,0 +1,106 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_encoder.c > + * Implementation of the encoder functions for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > + > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > + > +#include "pl111_drm.h" > + > +bool pl111_encoder_helper_mode_fixup(struct drm_encoder *encoder, > + struct drm_display_mode *mode, > + struct drm_display_mode *adjusted_mode) > +{ > + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); > + return true; > +} > + > +void pl111_encoder_helper_prepare(struct drm_encoder *encoder) > +{ > + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); > +} > + > +void pl111_encoder_helper_commit(struct drm_encoder *encoder) > +{ > + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); > +} > + > +void pl111_encoder_helper_mode_set(struct drm_encoder *encoder, > + struct drm_display_mode *mode, > + struct drm_display_mode *adjusted_mode) > +{ > + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); > +} > + > +void pl111_encoder_helper_disable(struct drm_encoder *encoder) > +{ > + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); > +} > + > +void pl111_encoder_destroy(struct drm_encoder *encoder) > +{ > + struct pl111_drm_encoder *pl111_encoder = > + PL111_ENCODER_FROM_ENCODER(encoder); > + > + DRM_DEBUG_KMS("DRM %s on encoder=%p\n", __func__, encoder); > + > + drm_encoder_cleanup(encoder); > + kfree(pl111_encoder); > +} > + > +const struct drm_encoder_funcs encoder_funcs = { > + .destroy = pl111_encoder_destroy, > +}; > + > +const struct drm_encoder_helper_funcs encoder_helper_funcs = { > + .mode_fixup = pl111_encoder_helper_mode_fixup, > + .prepare = pl111_encoder_helper_prepare, > + .commit = pl111_encoder_helper_commit, > + .mode_set = pl111_encoder_helper_mode_set, > + .disable = pl111_encoder_helper_disable, > +}; > + > +struct pl111_drm_encoder *pl111_encoder_create(struct drm_device *dev, > + int possible_crtcs) > +{ > + struct pl111_drm_encoder *pl111_encoder; > + > + pl111_encoder = kzalloc(sizeof(struct pl111_drm_encoder), GFP_KERNEL); > + if (pl111_encoder == NULL) { > + pr_err("Failed to allocated pl111_drm_encoder\n"); > + return NULL; > + } > + > + drm_encoder_init(dev, &pl111_encoder->encoder, &encoder_funcs, > + DRM_MODE_ENCODER_DAC); > + > + drm_encoder_helper_add(&pl111_encoder->encoder, &encoder_helper_funcs); > + > + pl111_encoder->encoder.possible_crtcs = possible_crtcs; > + > + return pl111_encoder; > +} > + > diff --git a/drivers/gpu/drm/pl111/pl111_drm_fb.c b/drivers/gpu/drm/pl111/pl111_drm_fb.c > new file mode 100644 > index 0000000..fa37623 > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_fb.c > @@ -0,0 +1,152 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_fb.c > + * Implementation of the framebuffer functions for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > + > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > + > +#include "pl111_drm.h" > + > +static void pl111_fb_destroy(struct drm_framebuffer *framebuffer) > +{ > + struct pl111_drm_framebuffer *pl111_fb; > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + struct drm_crtc *crtc; > +#endif > + DRM_DEBUG_KMS("Destroying framebuffer 0x%p...\n", framebuffer); > + > + pl111_fb = PL111_FB_FROM_FRAMEBUFFER(framebuffer); > + > + /* > + * Because flips are deferred, wait for all previous flips to complete > + */ > + wait_event(priv.wait_for_flips, > + atomic_read(&priv.nr_flips_in_flight) == 0); > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + /* > + * Release KDS resources if it's currently being displayed. Only occurs > + * when the last framebuffer is destroyed. > + */ > + list_for_each_entry(crtc, &framebuffer->dev->mode_config.crtc_list, > + head) { > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); > + spin_lock(&pl111_crtc->current_displaying_lock); > + if (pl111_crtc->displaying_fb == framebuffer) { > + /* Release the current buffers */ > + if (pl111_crtc->old_kds_res_set != NULL) { > + DRM_DEBUG_KMS("Releasing KDS resources for "); > + DRM_DEBUG_KMS("displayed 0x%p\n", framebuffer); > + kds_resource_set_release( > + &pl111_crtc->old_kds_res_set); > + } > + pl111_crtc->old_kds_res_set = NULL; > + } > + spin_unlock(&pl111_crtc->current_displaying_lock); > + } > +#endif > + drm_framebuffer_cleanup(framebuffer); > + > + if ((pl111_fb->bo != NULL) && (&pl111_fb->bo->gem_object != NULL)) > + drm_gem_object_unreference_unlocked(&pl111_fb->bo->gem_object); > + > + kfree(pl111_fb); > + > + DRM_DEBUG_KMS("Destroyed framebuffer 0x%p\n", framebuffer); > +} > + > +static int pl111_fb_create_handle(struct drm_framebuffer *fb, > + struct drm_file *file_priv, > + unsigned int *handle) > +{ > + struct pl111_gem_bo *bo = PL111_BO_FROM_FRAMEBUFFER(fb); > + DRM_DEBUG_KMS("DRM %s on fb=%p\n", __func__, fb); > + > + if (bo == NULL) > + return -EINVAL; > + > + return drm_gem_handle_create(file_priv, &bo->gem_object, handle); > +} > + > +const struct drm_framebuffer_funcs fb_funcs = { > + .destroy = pl111_fb_destroy, > + .create_handle = pl111_fb_create_handle, > +}; > + > +struct drm_framebuffer *pl111_fb_create(struct drm_device *dev, > + struct drm_file *file_priv, > + struct drm_mode_fb_cmd2 *mode_cmd) > +{ > + struct pl111_drm_framebuffer *pl111_fb = NULL; > + struct drm_framebuffer *fb = NULL; > + struct drm_gem_object *gem_obj; > + struct pl111_gem_bo *bo; > + > + pr_info("DRM %s\n", __func__); > + gem_obj = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]); > + if (gem_obj == NULL) { > + DRM_ERROR("Could not get gem obj from handle to create fb\n"); > + goto out; > + } > + > + bo = PL111_BO_FROM_GEM(gem_obj); > + /* Don't even attempt PL111_BOT_SHM, it's not contiguous */ > + BUG_ON(bo->type != PL111_BOT_DMA); umm, no BUG_ON() is not really a good way to validate userspace input.. if (bo->type != ...) return ERR_PTR(-EINVAL); > + > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { > + case 'Y': > + case 'U': > + case 'V': > + case 'N': > + case 'T': perhaps we should instead add a drm_format_is_yuv().. or you could (ab)use drm_fb_get_bpp_depth().. > + DRM_ERROR("YUV formats not supported\n"); > + goto out; > + } > + > + pl111_fb = kzalloc(sizeof(struct pl111_drm_framebuffer), GFP_KERNEL); > + if (pl111_fb == NULL) { > + DRM_ERROR("Could not allocate pl111_drm_framebuffer\n"); > + goto out; > + } > + fb = &pl111_fb->fb; > + > + if (drm_framebuffer_init(dev, fb, &fb_funcs)) { > + DRM_ERROR("drm_framebuffer_init failed\n"); > + kfree(fb); > + fb = NULL; > + goto out; > + } > + > + drm_helper_mode_fill_fb_struct(fb, mode_cmd); > + > + PL111_BO_TO_FRAMEBUFFER(fb, bo); > + > + DRM_DEBUG_KMS("Created fb 0x%p for gem_obj 0x%p physaddr=0x%.8x\n", > + fb, gem_obj, bo->backing_data.dma.fb_dev_addr); > + > +out: > + return fb; > +} > diff --git a/drivers/gpu/drm/pl111/pl111_drm_funcs.h b/drivers/gpu/drm/pl111/pl111_drm_funcs.h > new file mode 100644 > index 0000000..de8a826 > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_funcs.h > @@ -0,0 +1,127 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_funcs.h > + * Function prototypes for PL111 DRM > + */ > + > +#ifndef PL111_DRM_FUNCS_H_ > +#define PL111_DRM_FUNCS_H_ > + > +/* Platform Initialisation */ > +int pl111_drm_init(struct platform_device *dev); > +void pl111_drm_exit(struct platform_device *dev); > + > +/* KDS Callbacks */ > +void show_framebuffer_on_crtc_cb(void *cb1, void *cb2); > +void release_kds_resource_and_display(struct pl111_drm_flip_resource *flip_res); > + > +/* CRTC Functions */ > +struct pl111_drm_crtc *pl111_crtc_create(struct drm_device *dev); > +struct pl111_drm_crtc *pl111_crtc_dummy_create(struct drm_device *dev); > +void pl111_crtc_destroy(struct drm_crtc *crtc); > + > +bool pl111_crtc_is_fb_currently_displayed(struct drm_device *dev, > + struct drm_framebuffer *fb); > + > +int show_framebuffer_on_crtc(struct drm_crtc *crtc, > + struct drm_framebuffer *fb, bool page_flip, > + struct drm_pending_vblank_event *event); > + > +/* Common IRQ handler */ > +void pl111_common_irq(struct pl111_drm_crtc *pl111_crtc); > + > +int pl111_cursor_plane_init(struct drm_device *dev, > + struct pl111_drm_cursor_plane *cursor, > + unsigned long possible_crtcs); > +void pl111_drm_cursor_plane_destroy(struct drm_plane *plane); > + > +/* Connector Functions */ > +struct pl111_drm_connector *pl111_connector_create(struct drm_device *dev); > +void pl111_connector_destroy(struct drm_connector *connector); > +struct pl111_drm_connector *pl111_connector_dummy_create(struct drm_device > + *dev); > + > +/* Encoder Functions */ > +struct pl111_drm_encoder *pl111_encoder_create(struct drm_device *dev, > + int possible_crtcs); > +struct pl111_drm_encoder *pl111_encoder_dummy_create(struct drm_device *dev, > + int possible_crtcs); > +void pl111_encoder_destroy(struct drm_encoder *encoder); > + > +/* Frame Buffer Functions */ > +struct drm_framebuffer *pl111_fb_create(struct drm_device *dev, > + struct drm_file *file_priv, > + struct drm_mode_fb_cmd2 *mode_cmd); > + > +/* VMA Functions */ > +int pl111_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf); > +int pl111_gem_mmap(struct file *file_priv, struct vm_area_struct *vma); > +struct page **get_pages(struct drm_gem_object *obj); > +void put_pages(struct drm_gem_object *obj, struct page **pages); > + > +/* Suspend Functions */ > +int pl111_drm_resume(struct drm_device *dev); > +int pl111_drm_suspend(struct drm_device *dev, pm_message_t state); > + > +/* GEM Functions */ > +int pl111_dumb_create(struct drm_file *file_priv, > + struct drm_device *dev, > + struct drm_mode_create_dumb *args); > +int pl111_dumb_destroy(struct drm_file *file_priv, > + struct drm_device *dev, uint32_t handle); > +int pl111_dumb_map_offset(struct drm_file *file_priv, > + struct drm_device *dev, uint32_t handle, > + uint64_t *offset); > +void pl111_gem_free_object(struct drm_gem_object *obj); > + > +int pl111_bo_mmap(struct drm_gem_object *obj, struct pl111_gem_bo *bo, > + struct vm_area_struct *vma, size_t size); > + > +/* DMA BUF Functions */ > +int pl111_prime_handle_to_fd(struct drm_device *dev, struct drm_file *file_priv, > + uint32_t handle, uint32_t flags, int *prime_fd); > +struct dma_buf *pl111_gem_prime_export(struct drm_device *dev, > + struct drm_gem_object *obj, int flags); > + > +/* Pl111 Functions */ > +void show_framebuffer_on_crtc_cb_internal(struct pl111_drm_flip_resource > + *flip_res, struct drm_framebuffer *fb); > +int clcd_disable(struct drm_crtc *crtc); > +void do_flip_to_res(struct pl111_drm_flip_resource *flip_res); > +int pl111_amba_probe(struct amba_device *dev, const struct amba_id *id); > +int pl111_amba_remove(struct amba_device *dev); > + > +int pl111_device_init(struct drm_device *dev); > +void pl111_device_fini(struct drm_device *dev); > + > +void pl111_set_cursor_size(enum pl111_cursor_size size); > +void pl111_set_cursor_sync(enum pl111_cursor_sync sync); > +void pl111_set_cursor_index(u32 cursor); > +void pl111_set_cursor_enable(bool enable); > +void pl111_set_cursor_position(u32 x, u32 y); > +void pl111_set_cursor_clipping(u32 x, u32 y); > +void pl111_set_cursor_palette(u32 color0, u32 color1); > +void pl111_cursor_enable(void); > +void pl111_cursor_disable(void); > +void pl111_set_cursor_image(u32 *data); > + > +void pl111_convert_drm_mode_to_timing(struct drm_display_mode *mode, > + struct clcd_regs *timing); > +void pl111_convert_timing_to_drm_mode(struct clcd_regs *timing, > + struct drm_display_mode *mode); > +#endif /* PL111_DRM_FUNCS_H_ */ > diff --git a/drivers/gpu/drm/pl111/pl111_drm_gem.c b/drivers/gpu/drm/pl111/pl111_drm_gem.c > new file mode 100644 > index 0000000..01989ec > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_gem.c > @@ -0,0 +1,287 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_gem.c > + * Implementation of the GEM functions for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > +#include "pl111_drm.h" > + > +void pl111_gem_free_object(struct drm_gem_object *obj) > +{ > + struct pl111_gem_bo *bo; > + struct drm_device *dev = obj->dev; > + DRM_DEBUG_KMS("DRM %s on drm_gem_object=%p\n", __func__, obj); > + > + bo = PL111_BO_FROM_GEM(obj); > + > + if (obj->map_list.map != NULL) > + drm_gem_free_mmap_offset(obj); > + > + if (bo->type == PL111_BOT_DMA) { > + dma_free_writecombine(dev->dev, obj->size, > + bo->backing_data.dma.fb_cpu_addr, > + bo->backing_data.dma.fb_dev_addr); > + } else if (bo->backing_data.shm.pages != NULL) { > + put_pages(obj, bo->backing_data.shm.pages); > + } > + drm_gem_object_release(obj); > + > + kfree(bo); > + > + DRM_DEBUG_KMS("Destroyed dumb_bo handle 0x%p\n", bo); > +} > + > +int pl111_dumb_create(struct drm_file *file_priv, > + struct drm_device *dev, struct drm_mode_create_dumb *args) > +{ > + int ret = 0; > + struct pl111_gem_bo *bo = NULL; > + uint32_t bytes_pp; > + bool create_contig_buffer; > + > + bo = kzalloc(sizeof(*bo), GFP_KERNEL); > + if (bo == NULL) { > + ret = -ENOMEM; > + goto finish; > + } > + > + /* Round bpp up, to allow for case where bpp<8 */ > + bytes_pp = args->bpp >> 3; > + if (args->bpp & ((1 << 3) - 1)) > + bytes_pp++; > + > + args->pitch = ALIGN(args->width * bytes_pp, 64); > + args->size = PAGE_ALIGN(args->pitch * args->height); > + > + DRM_DEBUG_KMS("dumb_create w=%d h=%d p=%d bpp=%d b=%d s=%llu f=0x%x\n", > + args->width, args->height, args->pitch, args->bpp, > + bytes_pp, args->size, args->flags); > + > + create_contig_buffer = args->flags & PL111_BO_SCANOUT; > +#ifndef ARCH_HAS_SG_CHAIN > + /* > + * If the ARCH can't chain we can't have non-contiguous allocs larger > + * than a single sg can hold. > + * In this case we fall back to using contiguous memory > + */ > + if (!create_contig_buffer) { > + long unsigned int n_pages = > + PAGE_ALIGN(args->size) >> PAGE_SHIFT; > + if (n_pages > SG_MAX_SINGLE_ALLOC) { > + create_contig_buffer = true; > + /* > + * Non-contiguous allocation request changed to > + * contigous > + */ > + DRM_INFO("non-contig alloc to contig %lu > %lu pages.", > + n_pages, SG_MAX_SINGLE_ALLOC); > + } > + } > +#endif > + if (!create_contig_buffer) { > + /* not scanout compatible - use non-contiguous buffer */ > + bo->type = PL111_BOT_SHM; > + ret = drm_gem_object_init(dev, &bo->gem_object, args->size); > + if (ret != 0) { > + DRM_ERROR("DRM could not init SHM backed GEM obj\n"); > + kfree(bo); > + ret = -ENOMEM; > + goto finish; > + } > + DRM_DEBUG_KMS("Num bytes: %d\n", bo->gem_object.size); > + } else { > + /* scanout compatible - use contiguous buffer */ > + bo->type = PL111_BOT_DMA; > + > + bo->backing_data.dma.fb_cpu_addr = > + dma_alloc_writecombine(dev->dev, args->size, > + &bo->backing_data.dma.fb_dev_addr, > + GFP_KERNEL); > + if (bo->backing_data.dma.fb_cpu_addr == NULL) { > + DRM_ERROR("dma_alloc_writecombine failed\n"); > + kfree(bo); > + ret = -ENOMEM; > + goto finish; > + } > + > + ret = drm_gem_private_object_init(dev, &bo->gem_object, > + args->size); > + if (ret != 0) { > + DRM_ERROR("DRM could not initialise GEM object\n"); > + dma_free_writecombine(dev->dev, args->size, > + bo->backing_data.dma.fb_cpu_addr, > + bo->backing_data.dma.fb_dev_addr); > + kfree(bo); > + ret = -ENOMEM; > + goto finish; > + } > + } > + > + DRM_DEBUG_KMS("dumb_create: 0x%p with w=%d, h=%d, p=%d, bpp=%d,", > + bo, args->width, args->height, args->pitch, args->bpp); > + DRM_DEBUG_KMS("bytes_pp=%d, s=%llu, flags=0x%x, %s 0x%.8lx, type=%d\n", > + bytes_pp, args->size, args->flags, > + (bo->type == PL111_BOT_DMA) ? "physaddr" : "shared page array", > + (bo->type == PL111_BOT_DMA) > + ? (unsigned long)bo->backing_data.dma.fb_dev_addr > + : (unsigned long)bo->backing_data.shm.pages, bo->type); > + > + /* omap_gem_new_handle() */ > + ret = drm_gem_handle_create(file_priv, &bo->gem_object, &args->handle); > + if (ret != 0) { > + DRM_ERROR("DRM failed to create GEM handle\n"); > + drm_gem_object_release(&bo->gem_object); > + if (bo->type == PL111_BOT_DMA) { > + dma_free_writecombine(dev->dev, args->size, > + bo->backing_data.dma.fb_cpu_addr, > + bo->backing_data.dma.fb_dev_addr); > + } > + kfree(bo); > + return ret; > + } > + /* drop reference from allocate - handle holds it now */ > + drm_gem_object_unreference_unlocked(&bo->gem_object); > + DRM_DEBUG_KMS("dumb_create completed: fp=%p h=0x%.8x gem_object=%p", > + file_priv, args->handle, &bo->gem_object); > + > +finish: > + return ret; > +} > + > +int pl111_dumb_destroy(struct drm_file *file_priv, struct drm_device *dev, > + uint32_t handle) > +{ > + DRM_DEBUG_KMS("DRM %s on file_priv=%p handle=0x%.8x\n", __func__, > + file_priv, handle); > + return drm_gem_handle_delete(file_priv, handle); > +} > + > +int pl111_dumb_map_offset(struct drm_file *file_priv, > + struct drm_device *dev, uint32_t handle, > + uint64_t *offset) > +{ > + /* omap_gem_dump_map_offset */ > + struct drm_gem_object *obj; > + int ret = 0; > + DRM_DEBUG_KMS("DRM %s on file_priv=%p handle=0x%.8x\n", __func__, > + file_priv, handle); > + > + /* GEM does all our handle to object mapping */ > + obj = drm_gem_object_lookup(dev, file_priv, handle); > + if (obj == NULL) { > + ret = -ENOENT; > + goto fail; > + } > + > + if (obj->map_list.map == NULL) { > + ret = drm_gem_create_mmap_offset(obj); > + if (ret != 0) > + goto fail; > + } > + > + *offset = (uint64_t) obj->map_list.hash.key << PAGE_SHIFT; > + > + drm_gem_object_unreference_unlocked(obj); > +fail: > + return ret; > +} > + > +/* Based on drm_vm.c and omapdrm driver */ > +int pl111_bo_mmap(struct drm_gem_object *obj, struct pl111_gem_bo *bo, > + struct vm_area_struct *vma, size_t size) > +{ > + int ret = 0; > + > + DRM_DEBUG_KMS("DRM %s on drm_gem_object=%p, pl111_gem_bo=%p\n", > + __func__, obj, bo); > + > + if (obj->size < vma->vm_end - vma->vm_start) { > + ret = -EINVAL; > + goto done; > + } > + > + if (bo->type == PL111_BOT_DMA) { > + vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; > + vma->vm_page_prot = > + pgprot_noncached(vm_get_page_prot(vma->vm_flags)); > + > + if (remap_pfn_range(vma, vma->vm_start, > + (bo->backing_data.dma.fb_dev_addr) >> PAGE_SHIFT, > + vma->vm_end - vma->vm_start, vma->vm_page_prot)) > + ret = -EAGAIN; > + } else { > + unsigned long uaddr = vma->vm_start; > + long usize = obj->size; > + int i = 0; > + struct page **pages; > + vma->vm_flags &= ~VM_PFNMAP; > + vma->vm_flags |= VM_MIXEDMAP; > + vma->vm_page_prot = > + pgprot_noncached(vm_get_page_prot(vma->vm_flags)); > + pages = get_pages(obj); > + if (IS_ERR(pages)) { > + dev_err(obj->dev->dev, "could not get pages: %ld\n", > + PTR_ERR(pages)); > + return PTR_ERR(pages); > + } > + do { > + ret = vm_insert_page(vma, uaddr, pages[i]); > + if (ret != 0) { > + DRM_ERROR("failed to remap user space.\n"); > + return ret; > + } > + uaddr += PAGE_SIZE; > + usize -= PAGE_SIZE; > + i++; > + } while (usize > 0); > + } > + > +done: > + return ret; > +} > + > +int pl111_gem_mmap(struct file *file_priv, struct vm_area_struct *vma) > +{ > + int ret; > + struct drm_file *priv = file_priv->private_data; > + struct drm_device *dev = priv->minor->dev; > + struct drm_gem_mm *mm = dev->mm_private; > + struct drm_local_map *map = NULL; > + struct drm_hash_item *hash; > + struct drm_gem_object *obj; > + struct pl111_gem_bo *bo; > + DRM_DEBUG_KMS("DRM %s\n", __func__); > + > + ret = drm_gem_mmap(file_priv, vma); > + > + drm_ht_find_item(&mm->offset_hash, vma->vm_pgoff, &hash); > + map = drm_hash_entry(hash, struct drm_map_list, hash)->map; > + obj = map->handle; > + bo = PL111_BO_FROM_GEM(obj); > + > + DRM_DEBUG_KMS("DRM %s on pl111_gem_bo %p\n", __func__, bo); > + > + return pl111_bo_mmap(obj, bo, vma, vma->vm_end - vma->vm_start); > +} > diff --git a/drivers/gpu/drm/pl111/pl111_drm_pl111.c b/drivers/gpu/drm/pl111/pl111_drm_pl111.c > new file mode 100644 > index 0000000..daaa5ba > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_pl111.c > @@ -0,0 +1,513 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_pl111.c > + * PL111 specific functions for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > +#include "pl111_clcd_ext.h" > + > +#include "pl111_drm.h" > + > +/* This can't be called from IRQ context, due to clk_get() and board->enable */ > +static int clcd_enable(struct drm_framebuffer *fb) > +{ > + __u32 cntl; > + struct clcd_board *board; > + > + pr_info("DRM %s\n", __func__); > + > + clk_prepare_enable(priv.clk); > + > + /* Enable and Power Up */ > + cntl = CNTL_LCDEN | CNTL_LCDTFT | CNTL_LCDPWR | CNTL_LCDVCOMP(1); > + DRM_DEBUG_KMS("fb->bits_per_pixel = %d\n", fb->bits_per_pixel); > + if (fb->bits_per_pixel == 16) > + cntl |= CNTL_LCDBPP16_565; > + else if (fb->bits_per_pixel == 32 && fb->depth == 24) > + cntl |= CNTL_LCDBPP24; > + else > + BUG_ON(1); > + > + cntl |= CNTL_BGR; > + > + writel(cntl, priv.regs + CLCD_PL111_CNTL); > + > + board = priv.amba_dev->dev.platform_data; > + > + if (board->enable) > + board->enable(NULL); > + > + /* Enable Interrupts */ > + writel(CLCD_IRQ_NEXTBASE_UPDATE, priv.regs + CLCD_PL111_IENB); > + > + return 0; > +} > + > +int clcd_disable(struct drm_crtc *crtc) > +{ > + struct clcd_board *board; > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(crtc); > + > + pr_info("DRM %s\n", __func__); > + > + /* Disable Interrupts */ > + writel(0x00000000, priv.regs + CLCD_PL111_IENB); > + > + board = priv.amba_dev->dev.platform_data; > + > + if (board->disable) > + board->disable(NULL); > + > + /* Disable and Power Down */ > + writel(0, priv.regs + CLCD_PL111_CNTL); > + > + /* Disable clock */ > + clk_disable_unprepare(priv.clk); > + > + pl111_crtc->last_bpp = 0; > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > + /* Release the previous buffers */ > + if (pl111_crtc->old_kds_res_set != NULL) > + kds_resource_set_release(&pl111_crtc->old_kds_res_set); > + > + pl111_crtc->old_kds_res_set = NULL; > +#endif > + return 0; > +} > + > +void do_flip_to_res(struct pl111_drm_flip_resource *flip_res) > +{ > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(flip_res->crtc); > + struct drm_framebuffer *fb; > + struct pl111_gem_bo *bo; > + > + fb = flip_res->fb; > + bo = PL111_BO_FROM_FRAMEBUFFER(fb); > + > + /* Don't even attempt PL111_BOT_SHM, it's not contiguous */ > + BUG_ON(bo->type != PL111_BOT_DMA); > + > + /* > + * Note the buffer for releasing after IRQ, and don't allow any more > + * updates until then. > + * > + * This clcd controller latches the new address on next vsync. Address > + * latching is indicated by CLCD_IRQ_NEXTBASE_UPDATE, and so we must > + * wait for that before releasing the previous buffer's kds > + * resources. Otherwise, we'll allow writers to write to the old buffer > + * whilst it is still being displayed > + */ > + pl111_crtc->current_update_res = flip_res; > + > + DRM_DEBUG_KMS("Displaying fb 0x%p, dumb_bo 0x%p, physaddr %.8x\n", > + fb, bo, bo->backing_data.dma.fb_dev_addr); > + > + if (drm_vblank_get(pl111_crtc->crtc.dev, pl111_crtc->crtc_index) < 0) > + DRM_ERROR("Could not get vblank reference for crtc %d\n", > + pl111_crtc->crtc_index); > + > + /* Set the scanout buffer */ > + writel(bo->backing_data.dma.fb_dev_addr, priv.regs + CLCD_UBAS); > + writel(bo->backing_data.dma.fb_dev_addr + > + ((fb->height - 1) * fb->pitches[0]), priv.regs + CLCD_LBAS); > +} > + > +void > +show_framebuffer_on_crtc_cb_internal(struct pl111_drm_flip_resource *flip_res, > + struct drm_framebuffer *fb) > +{ > + unsigned long irq_flags; > + struct pl111_drm_crtc *pl111_crtc = to_pl111_crtc(flip_res->crtc); > + > + spin_lock_irqsave(&pl111_crtc->base_update_lock, irq_flags); > + if (list_empty(&pl111_crtc->update_queue) && > + !pl111_crtc->current_update_res) { > + do_flip_to_res(flip_res); > + > + /* > + * Catch a potential race with the IRQ handler: > + * - We may've updated the Base Address just after it was > + * latched, but before the OS ran our IRQ handler > + * - Hence, the CLCD controller is now scanning out the > + * previous buffer, not our new buffer. > + * - However, as soon as the IRQ handler runs, it'll inspect > + * pl111_crtc->current_update_res, and use that to cause the > + * previous buffer to be released on the workqueue (even > + * though the CLCD controller is still scanning it out) > + * Instead, we must wait until the *next* IRQ to allow > + * releasing of the previous buffer: > + */ > + if (readl(priv.regs + CLCD_PL111_MIS) & > + CLCD_IRQ_NEXTBASE_UPDATE) { > + DRM_DEBUG_KMS("Redoing flip to fb %p on next IRQ\n", > + fb); > + pl111_crtc->current_update_res = NULL; > + list_add_tail(&flip_res->link, > + &pl111_crtc->update_queue); > + } > + } else { > + /* > + * Enqueue the update to occur on a future IRQ > + * This only happens on triple-or-greater buffering > + */ > + DRM_DEBUG_KMS("Deferring 3+ buffered flip to fb %p to IRQ\n", > + fb); > + list_add_tail(&flip_res->link, &pl111_crtc->update_queue); > + } > + > + spin_unlock_irqrestore(&pl111_crtc->base_update_lock, irq_flags); > + > + if (!flip_res->page_flip && (pl111_crtc->last_bpp == 0 || > + pl111_crtc->last_bpp != fb->bits_per_pixel || > + !drm_mode_equal(pl111_crtc->new_mode, > + pl111_crtc->current_mode))) { > + struct clcd_regs timing; > + > + pl111_convert_drm_mode_to_timing(pl111_crtc->new_mode, &timing); > + > + DRM_DEBUG_KMS("Set timing: %08X:%08X:%08X:%08X clk=%ldHz\n", > + timing.tim0, timing.tim1, timing.tim2, > + timing.tim3, timing.pixclock); > + > + /* This is the actual mode setting part */ > + clk_set_rate(priv.clk, timing.pixclock); > + > + writel(timing.tim0, priv.regs + CLCD_TIM0); > + writel(timing.tim1, priv.regs + CLCD_TIM1); > + writel(timing.tim2, priv.regs + CLCD_TIM2); > + writel(timing.tim3, priv.regs + CLCD_TIM3); > + > + clcd_enable(fb); > + pl111_crtc->last_bpp = fb->bits_per_pixel; > + } > + > + if (!flip_res->page_flip) { > + drm_mode_destroy(flip_res->crtc->dev, pl111_crtc->current_mode); > + pl111_crtc->current_mode = pl111_crtc->new_mode; > + pl111_crtc->new_mode = NULL; > + } > + > + BUG_ON(pl111_crtc->new_mode); > + BUG_ON(!pl111_crtc->current_mode); > + > + /* > + * If IRQs weren't enabled before, they are now. This will eventually > + * cause flip_res to be released via vsync_worker(), which updates > + * every time the Base Address is latched (i.e. every frame, regardless > + * of whether we update the base address or not) > + */ > +} > + > +irqreturn_t pl111_irq(int irq, void *data) > +{ > + u32 irq_stat; > + struct pl111_drm_crtc *pl111_crtc = priv.pl111_crtc; > + > + irq_stat = readl(priv.regs + CLCD_PL111_MIS); > + > + if (!irq_stat) > + return IRQ_NONE; > + > + if (irq_stat & CLCD_IRQ_NEXTBASE_UPDATE) { > + if (pl111_crtc->current_update_res || > + !list_empty(&pl111_crtc->update_queue)) > + DRM_DEBUG_KMS("DRM irq %x after base update\n", > + irq_stat); > + > + /* > + * We don't need to lock here as we don't do any flip-specific > + * processing in this function. All these, including locks, is > + * done in common_irq handler > + */ > + pl111_common_irq(pl111_crtc); > + } > + > + /* Clear the interrupt once done */ > + writel(irq_stat, priv.regs + CLCD_PL111_ICR); > + > + return IRQ_HANDLED; > +} > + > +int pl111_device_init(struct drm_device *dev) > +{ > + struct pl111_drm_dev_private *priv = dev->dev_private; > + int ret; > + > + if (priv == NULL || priv->amba_dev == NULL) > + return -EINVAL; > + > + /* set up MMIO for register access */ > + priv->mmio_start = priv->amba_dev->res.start; > + priv->mmio_len = resource_size(&priv->amba_dev->res); > + > + DRM_DEBUG_KMS("mmio_start=%lu, mmio_len=%u\n", priv->mmio_start, > + priv->mmio_len); > + > + priv->regs = ioremap(priv->mmio_start, priv->mmio_len); > + if (priv->regs == NULL) { > + pr_err("%s failed mmio\n", __func__); > + return -EINVAL; > + } > + > + /* turn off interrupts */ > + writel(0, priv->regs + CLCD_PL111_IENB); > + > + ret = request_irq(priv->amba_dev->irq[0], pl111_irq, 0, > + "pl111_irq_handler", NULL); > + if (ret != 0) { > + pr_err("%s failed %d\n", __func__, ret); > + goto out_mmio; > + } > + > + goto finish; > + > +out_mmio: > + iounmap(priv->regs); > +finish: > + DRM_DEBUG_KMS("pl111_device_init returned %d\n", ret); > + return ret; > +} > + > +void pl111_device_fini(struct drm_device *dev) > +{ > + struct pl111_drm_dev_private *priv = dev->dev_private; > + u32 cntl; > + > + if (priv == NULL || priv->regs == NULL) > + return; > + > + free_irq(priv->amba_dev->irq[0], NULL); > + > + cntl = readl(priv->regs + CLCD_PL111_CNTL); > + > + cntl &= ~CNTL_LCDEN; > + writel(cntl, priv->regs + CLCD_PL111_CNTL); > + > + cntl &= ~CNTL_LCDPWR; > + writel(cntl, priv->regs + CLCD_PL111_CNTL); > + > + iounmap(priv->regs); > +} > + > +int pl111_amba_probe(struct amba_device *dev, const struct amba_id *id) > +{ > + struct clcd_board *board = dev->dev.platform_data; > + int ret; > + pr_info("DRM %s\n", __func__); > + > + if (board == NULL) > + return -EINVAL; > + > + ret = amba_request_regions(dev, NULL); > + if (ret != 0) { > + DRM_ERROR("CLCD: unable to reserve regs region\n"); > + goto out; > + } > + > + priv.amba_dev = dev; > + > + priv.clk = clk_get(&priv.amba_dev->dev, NULL); > + if (IS_ERR(priv.clk)) { > + DRM_ERROR("CLCD: unable to get clk.\n"); > + ret = PTR_ERR(priv.clk); > + goto clk_err; > + } > + > + return 0; > + > +clk_err: > + amba_release_regions(dev); > +out: > + return ret; > +} > + > +int pl111_amba_remove(struct amba_device *dev) > +{ > + DRM_DEBUG_KMS("DRM %s\n", __func__); > + > + clk_put(priv.clk); > + > + amba_release_regions(dev); > + > + priv.amba_dev = NULL; > + > + return 0; > +} > + > +void pl111_set_cursor_size(enum pl111_cursor_size size) > +{ > + u32 reg_data = readl(priv.regs + CLCD_CRSR_CONFIG); > + > + if (size == CURSOR_64X64) > + reg_data |= CRSR_CONFIG_CRSR_SIZE; > + else > + reg_data &= ~CRSR_CONFIG_CRSR_SIZE; > + > + writel(reg_data, priv.regs + CLCD_CRSR_CONFIG); > +} > + > +void pl111_set_cursor_sync(enum pl111_cursor_sync sync) > +{ > + u32 reg_data = readl(priv.regs + CLCD_CRSR_CONFIG); > + > + if (sync == CURSOR_SYNC_VSYNC) > + reg_data |= CRSR_CONFIG_CRSR_FRAME_SYNC; > + else > + reg_data &= ~CRSR_CONFIG_CRSR_FRAME_SYNC; > + > + writel(reg_data, priv.regs + CLCD_CRSR_CONFIG); > +} > + > +void pl111_set_cursor(u32 cursor) > +{ > + u32 reg_data = readl(priv.regs + CLCD_CRSR_CTRL); > + > + reg_data &= ~(CRSR_CTRL_CRSR_MAX << CRSR_CTRL_CRSR_NUM_SHIFT); > + reg_data |= (cursor & CRSR_CTRL_CRSR_MAX) << CRSR_CTRL_CRSR_NUM_SHIFT; > + > + writel(reg_data, priv.regs + CLCD_CRSR_CTRL); > +} > + > +void pl111_set_cursor_enable(bool enable) > +{ > + u32 reg_data = readl(priv.regs + CLCD_CRSR_CTRL); > + > + if (enable) > + reg_data |= CRSR_CTRL_CRSR_ON; > + else > + reg_data &= ~CRSR_CTRL_CRSR_ON; > + > + writel(reg_data, priv.regs + CLCD_CRSR_CTRL); > +} > + > +void pl111_set_cursor_position(u32 x, u32 y) > +{ > + u32 reg_data = (x & CRSR_XY_MASK) | > + ((y & CRSR_XY_MASK) << CRSR_XY_Y_SHIFT); > + /* could optimise out if same values */ > + writel(reg_data, priv.regs + CLCD_CRSR_XY); > +} > + > +void pl111_set_cursor_clipping(u32 x, u32 y) > +{ > + u32 reg_data = (x & CRSR_CLIP_MASK) | > + ((y & CRSR_CLIP_MASK) << CRSR_CLIP_Y_SHIFT); > + /* could optimise out if same values */ > + writel(reg_data, priv.regs + CLCD_CRSR_CLIP); > +} > + > +void pl111_set_cursor_palette(u32 color0, u32 color1) > +{ > + writel(color0 & CRSR_PALETTE_MASK, priv.regs + CLCD_CRSR_PALETTE_0); > + writel(color1 & CRSR_PALETTE_MASK, priv.regs + CLCD_CRSR_PALETTE_1); > +} > + > +void pl111_cursor_enable(void) > +{ > + pl111_set_cursor_sync(CURSOR_SYNC_VSYNC); > + pl111_set_cursor_size(CURSOR_64X64); > + pl111_set_cursor_palette(0x0, 0x00ffffff); > + pl111_set_cursor_enable(true); > +} > + > +void pl111_cursor_disable(void) > +{ > + pl111_set_cursor_enable(false); > +} > + > +void pl111_set_cursor_image(u32 *data) > +{ > + u32 *cursor_ram = priv.regs + CLCD_CRSR_IMAGE; > + int i; > + > + for (i = 0; i < CLCD_CRSR_IMAGE_MAX_WORDS; i++, data++, cursor_ram++) > + writel(*data, cursor_ram); > +} > + > +void pl111_convert_drm_mode_to_timing(struct drm_display_mode *mode, > + struct clcd_regs *timing) > +{ > + unsigned int ppl, hsw, hfp, hbp; > + unsigned int lpp, vsw, vfp, vbp; > + unsigned int cpl; > + > + memset(timing, 0, sizeof(struct clcd_regs)); > + > + ppl = (mode->hdisplay / 16) - 1; > + hsw = mode->hsync_end - mode->hsync_start - 1; > + hfp = mode->hsync_start - mode->hdisplay - 1; > + hbp = mode->htotal - mode->hsync_end - 1; > + > + lpp = mode->vdisplay - 1; > + vsw = mode->vsync_end - mode->vsync_start - 1; > + vfp = mode->vsync_start - mode->vdisplay; > + vbp = mode->vtotal - mode->vsync_end; > + > + cpl = mode->hdisplay - 1; > + > + timing->tim0 = (ppl << 2) | (hsw << 8) | (hfp << 16) | (hbp << 24); > + timing->tim1 = lpp | (vsw << 10) | (vfp << 16) | (vbp << 24); > + timing->tim2 = TIM2_IVS | TIM2_IHS | TIM2_IPC | TIM2_BCD | (cpl << 16); > + timing->tim3 = 0; > + > + timing->pixclock = mode->clock * 1000; > +} > + > +void pl111_convert_timing_to_drm_mode(struct clcd_regs *timing, > + struct drm_display_mode *mode) > +{ > + unsigned int ppl, hsw, hfp, hbp; > + unsigned int lpp, vsw, vfp, vbp; > + > + ppl = (timing->tim0 >> 2) & 0x3f; > + hsw = (timing->tim0 >> 8) & 0xff; > + hfp = (timing->tim0 >> 16) & 0xff; > + hbp = (timing->tim0 >> 24) & 0xff; > + > + lpp = timing->tim1 & 0x3ff; > + vsw = (timing->tim1 >> 10) & 0x3f; > + vfp = (timing->tim1 >> 16) & 0xff; > + vbp = (timing->tim1 >> 24) & 0xff; > + > + mode->hdisplay = (ppl + 1) * 16; > + mode->hsync_start = ((ppl + 1) * 16) + hfp + 1; > + mode->hsync_end = ((ppl + 1) * 16) + hfp + hsw + 2; > + mode->htotal = ((ppl + 1) * 16) + hfp + hsw + hbp + 3; > + mode->hskew = 0; > + > + mode->vdisplay = lpp + 1; > + mode->vsync_start = lpp + vfp + 1; > + mode->vsync_end = lpp + vfp + vsw + 2; > + mode->vtotal = lpp + vfp + vsw + vbp + 2; > + > + mode->flags = 0; > + > + mode->width_mm = 0; > + mode->height_mm = 0; > + > + mode->clock = timing->pixclock / 1000; > + mode->hsync = timing->pixclock / mode->htotal; > + mode->vrefresh = mode->hsync / mode->vtotal; > +} > diff --git a/drivers/gpu/drm/pl111/pl111_drm_platform.c b/drivers/gpu/drm/pl111/pl111_drm_platform.c > new file mode 100644 > index 0000000..a0b9e50 > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_platform.c > @@ -0,0 +1,150 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_platform.c > + * Implementation of the Linux platform device entrypoints for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > +#include <linux/init.h> > +#include <linux/platform_device.h> > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > +#include "pl111_drm.h" > + > +static int pl111_platform_drm_suspend(struct platform_device *dev, > + pm_message_t state) > +{ > + pr_info("DRM %s\n", __func__); > + return 0; > +} > + > +static int pl111_platform_drm_resume(struct platform_device *dev) > +{ > + pr_info("DRM %s\n", __func__); > + return 0; > +} > + > +int pl111_platform_drm_probe(struct platform_device *dev) > +{ > + pr_info("DRM %s\n", __func__); > + return pl111_drm_init(dev); > +} > + > +static int pl111_platform_drm_remove(struct platform_device *dev) > +{ > + pr_info("DRM %s\n", __func__); > + pl111_drm_exit(dev); > + > + return 0; > +} > + > +static struct amba_id pl111_id_table[] = { > + { > + .id = 0x00041110, > + .mask = 0x000ffffe, > + }, > + {0, 0}, > +}; > + > +static struct amba_driver pl111_amba_driver = { > + .drv = { > + .name = "clcd-pl11x", > + }, > + .probe = pl111_amba_probe, > + .remove = pl111_amba_remove, > + .id_table = pl111_id_table, > +}; > + > +static struct platform_driver platform_drm_driver = { > + .probe = pl111_platform_drm_probe, > + .remove = pl111_platform_drm_remove, > + .suspend = pl111_platform_drm_suspend, > + .resume = pl111_platform_drm_resume, > + .driver = { > + .owner = THIS_MODULE, > + .name = DRIVER_NAME, > + }, > +}; > + > +static const struct platform_device_info pl111_drm_pdevinfo = { > + .name = DRIVER_NAME, > + .id = -1, > + .dma_mask = ~0UL > +}; > + > +static struct platform_device *pl111_drm_device; > + > +static int __init pl111_platform_drm_init(void) > +{ > + int ret; > + > + pr_info("DRM %s\n", __func__); > + > + pl111_drm_device = platform_device_register_full(&pl111_drm_pdevinfo); > + if (pl111_drm_device == NULL) { > + pr_err("DRM platform_device_register_full() failed\n"); > + return -ENOMEM; > + } > + > + ret = amba_driver_register(&pl111_amba_driver); > + if (ret != 0) { > + pr_err("DRM amba_driver_register() failed %d\n", ret); > + goto err_amba_reg; > + } > + > + ret = platform_driver_register(&platform_drm_driver); > + if (ret != 0) { > + pr_err("DRM platform_driver_register() failed %d\n", ret); > + goto err_pdrv_reg; > + } > + > + return 0; > + > +err_pdrv_reg: > + amba_driver_unregister(&pl111_amba_driver); > +err_amba_reg: > + platform_device_unregister(pl111_drm_device); > + > + return ret; > +} > + > +static void __exit pl111_platform_drm_exit(void) > +{ > + pr_info("DRM %s\n", __func__); > + > + platform_device_unregister(pl111_drm_device); > + amba_driver_unregister(&pl111_amba_driver); > + platform_driver_unregister(&platform_drm_driver); > +} > + > +#ifdef MODULE > +module_init(pl111_platform_drm_init); > +#else > +late_initcall(pl111_platform_drm_init); > +#endif > +module_exit(pl111_platform_drm_exit); > + > +MODULE_DESCRIPTION(DRIVER_DESC); > +MODULE_VERSION(DRIVER_VERSION); > +MODULE_AUTHOR(DRIVER_AUTHOR); > +MODULE_LICENSE(DRIVER_LICENCE); > +MODULE_ALIAS(DRIVER_ALIAS); > diff --git a/drivers/gpu/drm/pl111/pl111_drm_suspend.c b/drivers/gpu/drm/pl111/pl111_drm_suspend.c > new file mode 100644 > index 0000000..d4da60f > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_suspend.c > @@ -0,0 +1,35 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + */ > + > +/** > + * pl111_drm_suspend.c > + * Implementation of the suspend/resume functions for PL111 DRM > + */ > + > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > +#include "pl111_drm.h" > + > +int pl111_drm_suspend(struct drm_device *dev, pm_message_t state) > +{ > + pr_info("DRM %s\n", __func__); > + return 0; > +} > + > +int pl111_drm_resume(struct drm_device *dev) > +{ > + pr_info("DRM %s\n", __func__); > + return 0; > +} > diff --git a/drivers/gpu/drm/pl111/pl111_drm_vma.c b/drivers/gpu/drm/pl111/pl111_drm_vma.c > new file mode 100644 > index 0000000..a3c78fa > --- /dev/null > +++ b/drivers/gpu/drm/pl111/pl111_drm_vma.c > @@ -0,0 +1,214 @@ > +/* > + * (C) COPYRIGHT 2012-2013 ARM Limited. All rights reserved. > + * > + * Parts of this file were based on sources as follows: > + * > + * Copyright (c) 2006-2008 Intel Corporation > + * Copyright (c) 2007 Dave Airlie <airlied@linux.ie> > + * Copyright (C) 2011 Texas Instruments > + * > + * This program is free software and is provided to you under the terms of the > + * GNU General Public License version 2 as published by the Free Software > + * Foundation, and any use by you of this program is subject to the terms of > + * such GNU licence. > + * > + */ > + > +/** > + * pl111_drm_vma.c > + * Implementation of the VM functions for PL111 DRM > + */ > +#include <linux/amba/bus.h> > +#include <linux/amba/clcd.h> > +#include <linux/version.h> > +#include <linux/shmem_fs.h> > +#include <linux/dma-buf.h> > +#include <linux/module.h> > + > +#include <drm/drmP.h> > +#include <drm/drm_crtc_helper.h> > + > +#include "pl111_drm.h" > + > +/* BEGIN drivers/staging/omapdrm/omap_gem_helpers.c */ note that I (finally) sent patches to move these into drm core (I need them also for msm), so you can drop this bit now BR, -R > +/** > + * drm_gem_put_pages - helper to free backing pages for a GEM object > + * @obj: obj in question > + * @pages: pages to free > + */ > +static void _drm_gem_put_pages(struct drm_gem_object *obj, struct page **pages, > + bool dirty, bool accessed) > +{ > + int i, npages; > + struct pl111_gem_bo *bo; > + npages = obj->size >> PAGE_SHIFT; > + bo = PL111_BO_FROM_GEM(obj); > + for (i = 0; i < npages; i++) { > + if (dirty) > + set_page_dirty(pages[i]); > + if (accessed) > + mark_page_accessed(pages[i]); > + /* Undo the reference we took when populating the table */ > + page_cache_release(pages[i]); > + } > + drm_free_large(pages); > +} > + > +void put_pages(struct drm_gem_object *obj, struct page **pages) > +{ > + int i, npages; > + struct pl111_gem_bo *bo; > + npages = obj->size >> PAGE_SHIFT; > + bo = PL111_BO_FROM_GEM(obj); > + _drm_gem_put_pages(obj, pages, true, true); > + if (bo->backing_data.shm.dma_addrs) { > + for (i = 0; i < npages; i++) { > + if (!dma_mapping_error(obj->dev->dev, > + bo->backing_data.shm.dma_addrs[i])) { > + dma_unmap_page(obj->dev->dev, > + bo->backing_data.shm.dma_addrs[i], > + PAGE_SIZE, > + DMA_BIDIRECTIONAL); > + } > + } > + kfree(bo->backing_data.shm.dma_addrs); > + bo->backing_data.shm.dma_addrs = NULL; > + } > +} > + > +/** > + * drm_gem_get_pages - helper to allocate backing pages for a GEM object > + * @obj: obj in question > + * @gfpmask: gfp mask of requested pages > + */ > +static struct page **_drm_gem_get_pages(struct drm_gem_object *obj, > + gfp_t gfpmask) > +{ > + struct inode *inode; > + struct address_space *mapping; > + struct page *p, **pages; > + int i, npages; > + > + /* This is the shared memory object that backs the GEM resource */ > + inode = obj->filp->f_path.dentry->d_inode; > + mapping = inode->i_mapping; > + > + npages = obj->size >> PAGE_SHIFT; > + > + pages = drm_malloc_ab(npages, sizeof(struct page *)); > + if (pages == NULL) > + return ERR_PTR(-ENOMEM); > + > + gfpmask |= mapping_gfp_mask(mapping); > + > + for (i = 0; i < npages; i++) { > + p = shmem_read_mapping_page_gfp(mapping, i, gfpmask); > + if (IS_ERR(p)) > + goto fail; > + pages[i] = p; > + > + /* > + * There is a hypothetical issue w/ drivers that require > + * buffer memory in the low 4GB.. if the pages are un- > + * pinned, and swapped out, they can end up swapped back > + * in above 4GB. If pages are already in memory, then > + * shmem_read_mapping_page_gfp will ignore the gfpmask, > + * even if the already in-memory page disobeys the mask. > + * > + * It is only a theoretical issue today, because none of > + * the devices with this limitation can be populated with > + * enough memory to trigger the issue. But this BUG_ON() > + * is here as a reminder in case the problem with > + * shmem_read_mapping_page_gfp() isn't solved by the time > + * it does become a real issue. > + * > + * See this thread: http://lkml.org/lkml/2011/7/11/238 > + */ > + BUG_ON((gfpmask & __GFP_DMA32) && > + (page_to_pfn(p) >= 0x00100000UL)); > + } > + > + return pages; > + > +fail: > + while (i--) > + page_cache_release(pages[i]); > + > + drm_free_large(pages); > + return ERR_PTR(PTR_ERR(p)); > +} > + > +struct page **get_pages(struct drm_gem_object *obj) > +{ > + struct pl111_gem_bo *bo; > + bo = PL111_BO_FROM_GEM(obj); > + > + if (bo->backing_data.shm.pages == NULL) { > + struct page **p; > + int npages = obj->size >> PAGE_SHIFT; > + int i; > + > + p = _drm_gem_get_pages(obj, GFP_KERNEL); > + if (IS_ERR(p)) > + return ERR_PTR(-ENOMEM); > + > + bo->backing_data.shm.pages = p; > + > + if (bo->backing_data.shm.dma_addrs == NULL) { > + bo->backing_data.shm.dma_addrs = > + kzalloc(npages * sizeof(dma_addr_t), > + GFP_KERNEL); > + if (bo->backing_data.shm.dma_addrs == NULL) > + goto error_out; > + } > + > + for (i = 0; i < npages; ++i) { > + bo->backing_data.shm.dma_addrs[i] = > + dma_map_page(obj->dev->dev, p[i], 0, PAGE_SIZE, > + DMA_BIDIRECTIONAL); > + if (dma_mapping_error(obj->dev->dev, > + bo->backing_data.shm.dma_addrs[i])) > + goto error_out; > + } > + } > + > + return bo->backing_data.shm.pages; > + > +error_out: > + put_pages(obj, bo->backing_data.shm.pages); > + bo->backing_data.shm.pages = NULL; > + return ERR_PTR(-ENOMEM); > +} > + > +/* END drivers/staging/omapdrm/omap_gem_helpers.c */ > + > +int pl111_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + struct page **pages; > + pgoff_t pgoff; > + unsigned long pfn; > + struct drm_gem_object *obj = vma->vm_private_data; > + struct pl111_gem_bo *bo = PL111_BO_FROM_GEM(obj); > + DRM_DEBUG_KMS("DRM %s on pl111_gem_bo %p\n", __func__, bo); > + > + /* We don't use vmf->pgoff since that has the fake offset: */ > + pgoff = ((unsigned long)vmf->virtual_address - > + vma->vm_start) >> PAGE_SHIFT; > + if (bo->type == PL111_BOT_SHM) { > + pages = get_pages(obj); > + if (IS_ERR(pages)) { > + dev_err(obj->dev->dev, > + "could not get pages: %ld\n", PTR_ERR(pages)); > + return PTR_ERR(pages); > + } > + pfn = page_to_pfn(pages[pgoff]); > + DRM_DEBUG_KMS("physaddr 0x%.8x for offset 0x%x\n", > + PFN_PHYS(pfn), PFN_PHYS(pgoff)); > + vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, pfn); > + return VM_FAULT_NOPAGE; > + } else { > + DRM_DEBUG_KMS("Fault on non-shared memory %p\n", > + vmf->virtual_address); > + return VM_FAULT_SIGBUS; > + } > +} > -- > 1.7.9.5 > > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-07 16:17 ` Rob Clark 2013-08-07 16:46 ` Daniel Vetter @ 2013-08-07 16:46 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-07 16:46 UTC (permalink / raw) To: Rob Clark; +Cc: dri-devel, linux-fbdev, pawel.moll, linux-arm-kernel Just comment a bit on Rob's review with my own opinion. On Wed, Aug 07, 2013 at 12:17:21PM -0400, Rob Clark wrote: > On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > > From: Tom Cooksey <tom.cooksey@arm.com> > > > > This is a mode-setting driver for the pl111 CLCD display controller > > found on various ARM reference platforms such as the Versatile > > Express. The driver supports setting of a single mode (640x480) and > > has only been tested on Versatile Express with a Cortex-A9 core tile. > > > > Known issues: > > * It still includes code to use KDS, which is not going upstream. > > review's on http://lists.freedesktop.org/archives/dri-devel/2013-July/042462.html > can't hurt > > although you might consider submitting a reduced functionality driver > w/ KDS bits removed in the mean time.. then when the fence stuff is > merged it is just an incremental patch rather than a whole driver ;-) Yeah, I think the KDS bits and comments need to go first before merginge. > > +/* > > + * Number of flips allowed in flight at any one time. Any more flips requested > > + * beyond this value will cause the caller to block until earlier flips have > > + * completed. > > + * > > + * For performance reasons, this must be greater than the number of buffers > > + * used in the rendering pipeline. Note that the rendering pipeline can contain > > + * different types of buffer, e.g.: > > + * - 2 final framebuffers > > + * - >2 geometry buffers for GPU use-cases > > + * - >2 vertex buffers for GPU use-cases > > + * > > + * For example, a system using 5 geometry buffers could have 5 flips in flight, > > + * and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be 5 or greater. > > + * > > + * Whilst there may be more intermediate buffers (such as vertex/geometry) than > > + * final framebuffers, KDS is used to ensure that GPU rendering waits for the > > + * next off-screen buffer, so it doesn't overwrite an on-screen buffer and > > + * produce tearing. > > + */ > > + > > fwiw, this is at least different from how other drivers do triple (or > > double) buffering. In other drivers (intel, omap, and > msm/freedreno, that I know of, maybe others too) the xorg driver dri2 > bits implement the double buffering (ie. send flip event back to > client immediately and queue up the flip and call page-flip after the > pageflip event back from kernel. > > I'm not saying not to do it this way, I guess I'd like to hear what > other folks think. I kinda prefer doing this in userspace as it keeps > the kernel bits simpler (plus it would then work properly on exynosdrm > or other kms drivers). Yeah, if this is just a sw queue then I don't think it makes sense to have it in the kernel. Afaik the current pageflip interface drm exposes allows one oustanding flip only, and you _must_ wait for the flip complete event before you can submit the second one. Ofc if your hardware as a hw-based flip queue (maybe even with frame targets) that's a different matter, but currently we don't have a drm interface to expose this. I'd say for merging the basic driver first we should go with the existing simple pageflip semantics. And tbh I don't understand why the amount of buffers you keep in the render pipeline side of things matters here at all. But I also haven't read the details of your driver code. > > > +/* > > + * Here, we choose a conservative value. A lower value is most likely > > + * suitable for GPU use-cases. > > + */ > > +#define NR_FLIPS_IN_FLIGHT_THRESHOLD 16 > > + > > +#define CLCD_IRQ_NEXTBASE_UPDATE (1u<<2) > > + > > +struct pl111_drm_flip_resource; > > +struct pl111_drm_cursor_plane; > > + > > +enum pl111_bo_type { > > + PL111_BOT_DMA, > > + PL111_BOT_SHM > > +}; > > + > > +struct pl111_gem_bo_dma { > > + dma_addr_t fb_dev_addr; > > + void *fb_cpu_addr; > > +}; > > + > > +struct pl111_gem_bo_shm { > > + struct page **pages; > > + dma_addr_t *dma_addrs; > > +}; > > + > > +struct pl111_gem_bo { > > + struct drm_gem_object gem_object; > > + enum pl111_bo_type type; > > + union { > > + struct pl111_gem_bo_dma dma; > > + struct pl111_gem_bo_shm shm; > > + } backing_data; > > + struct drm_framebuffer *fb; > > this is at least a bit odd.. normally the fb has ref to the bo(s) and > not the other way around. And the same bo could be referenced by > multiple fb's which would kinda fall down with this approach. I'd say that's just backwards, framebuffers are created from backing storage objects (which for a gem based driver is a gem object), not the other way round. What's this exactly used for? [snip] > > + > > + /* > > + * Used to prevent race between pl111_dma_buf_release and > > + * drm_gem_prime_handle_to_fd > > + */ > > + struct mutex export_dma_buf_lock; > > hmm, seems a bit suspicious.. the handle reference should keep the > object live. Ie. either drm_gem_object_lookup() will fail because the > object is gone (userspace has closed it's handle ref and > dmabuf->release() already dropped it's ref) or it will succeed and > you'll have a reference to the bo keeping it from going away if the > release() comes after. The race is real, I have an evil testcase here which Oopses my kernel. I'm working on a fix (v1 of my patches is submitted a few weeks back, awaiting review), but I need to rework a few things since now I've also spotted a leak or two ;-) [snip] > > +static void vsync_worker(struct work_struct *work) > > +{ > > + struct pl111_drm_flip_resource *flip_res; > > + struct pl111_gem_bo *bo; > > + struct pl111_drm_crtc *pl111_crtc; > > + struct drm_device *dev; > > + int flips_in_flight; > > + flip_res > > + container_of(work, struct pl111_drm_flip_resource, vsync_work); > > + > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); > > + dev = pl111_crtc->crtc.dev; > > + > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); > > + > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > > + if (flip_res->worker_release_kds = true) { > > + spin_lock(&pl111_crtc->current_displaying_lock); > > + release_kds_resource_and_display(flip_res); > > + spin_unlock(&pl111_crtc->current_displaying_lock); > > + } > > +#endif > > + /* Release DMA buffer on this flip */ > > + if (bo->gem_object.export_dma_buf != NULL) > > + dma_buf_put(bo->gem_object.export_dma_buf); > > I think you just want to unref the outgoing bo, and let it drop the > dmabuf ref when the file ref of the imported bo goes. Or actually, it > would be better to hold/drop ref's to the fb, rather than the bo. At > least this will make things simpler if you ever have multi-planar > support. Drivers have no business frobbing around the dma-buf refcount of imported objects imo, at least if they use all the standard drm prime infrastructure. And if they're bugs they need to be fixed there, not in drivers. [snip] > > +struct drm_framebuffer *pl111_fb_create(struct drm_device *dev, > > + struct drm_file *file_priv, > > + struct drm_mode_fb_cmd2 *mode_cmd) > > +{ > > + struct pl111_drm_framebuffer *pl111_fb = NULL; > > + struct drm_framebuffer *fb = NULL; > > + struct drm_gem_object *gem_obj; > > + struct pl111_gem_bo *bo; > > + > > + pr_info("DRM %s\n", __func__); > > + gem_obj = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]); > > + if (gem_obj = NULL) { > > + DRM_ERROR("Could not get gem obj from handle to create fb\n"); > > + goto out; > > + } > > + > > + bo = PL111_BO_FROM_GEM(gem_obj); > > + /* Don't even attempt PL111_BOT_SHM, it's not contiguous */ > > + BUG_ON(bo->type != PL111_BOT_DMA); > > umm, no BUG_ON() is not really a good way to validate userspace input.. > > if (bo->type != ...) > return ERR_PTR(-EINVAL); Yep. > > + > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { > > + case 'Y': > > + case 'U': > > + case 'V': > > + case 'N': > > + case 'T': > > perhaps we should instead add a drm_format_is_yuv().. or you could > (ab)use drm_fb_get_bpp_depth().. Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger question is why you need this, since the drm core should filter out formats not in your list of supported ones. Or at least it should ... Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 16:46 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-07 16:46 UTC (permalink / raw) To: Rob Clark; +Cc: dri-devel, linux-fbdev, pawel.moll, linux-arm-kernel Just comment a bit on Rob's review with my own opinion. On Wed, Aug 07, 2013 at 12:17:21PM -0400, Rob Clark wrote: > On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > > From: Tom Cooksey <tom.cooksey@arm.com> > > > > This is a mode-setting driver for the pl111 CLCD display controller > > found on various ARM reference platforms such as the Versatile > > Express. The driver supports setting of a single mode (640x480) and > > has only been tested on Versatile Express with a Cortex-A9 core tile. > > > > Known issues: > > * It still includes code to use KDS, which is not going upstream. > > review's on http://lists.freedesktop.org/archives/dri-devel/2013-July/042462.html > can't hurt > > although you might consider submitting a reduced functionality driver > w/ KDS bits removed in the mean time.. then when the fence stuff is > merged it is just an incremental patch rather than a whole driver ;-) Yeah, I think the KDS bits and comments need to go first before merginge. > > +/* > > + * Number of flips allowed in flight at any one time. Any more flips requested > > + * beyond this value will cause the caller to block until earlier flips have > > + * completed. > > + * > > + * For performance reasons, this must be greater than the number of buffers > > + * used in the rendering pipeline. Note that the rendering pipeline can contain > > + * different types of buffer, e.g.: > > + * - 2 final framebuffers > > + * - >2 geometry buffers for GPU use-cases > > + * - >2 vertex buffers for GPU use-cases > > + * > > + * For example, a system using 5 geometry buffers could have 5 flips in flight, > > + * and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be 5 or greater. > > + * > > + * Whilst there may be more intermediate buffers (such as vertex/geometry) than > > + * final framebuffers, KDS is used to ensure that GPU rendering waits for the > > + * next off-screen buffer, so it doesn't overwrite an on-screen buffer and > > + * produce tearing. > > + */ > > + > > fwiw, this is at least different from how other drivers do triple (or > > double) buffering. In other drivers (intel, omap, and > msm/freedreno, that I know of, maybe others too) the xorg driver dri2 > bits implement the double buffering (ie. send flip event back to > client immediately and queue up the flip and call page-flip after the > pageflip event back from kernel. > > I'm not saying not to do it this way, I guess I'd like to hear what > other folks think. I kinda prefer doing this in userspace as it keeps > the kernel bits simpler (plus it would then work properly on exynosdrm > or other kms drivers). Yeah, if this is just a sw queue then I don't think it makes sense to have it in the kernel. Afaik the current pageflip interface drm exposes allows one oustanding flip only, and you _must_ wait for the flip complete event before you can submit the second one. Ofc if your hardware as a hw-based flip queue (maybe even with frame targets) that's a different matter, but currently we don't have a drm interface to expose this. I'd say for merging the basic driver first we should go with the existing simple pageflip semantics. And tbh I don't understand why the amount of buffers you keep in the render pipeline side of things matters here at all. But I also haven't read the details of your driver code. > > > +/* > > + * Here, we choose a conservative value. A lower value is most likely > > + * suitable for GPU use-cases. > > + */ > > +#define NR_FLIPS_IN_FLIGHT_THRESHOLD 16 > > + > > +#define CLCD_IRQ_NEXTBASE_UPDATE (1u<<2) > > + > > +struct pl111_drm_flip_resource; > > +struct pl111_drm_cursor_plane; > > + > > +enum pl111_bo_type { > > + PL111_BOT_DMA, > > + PL111_BOT_SHM > > +}; > > + > > +struct pl111_gem_bo_dma { > > + dma_addr_t fb_dev_addr; > > + void *fb_cpu_addr; > > +}; > > + > > +struct pl111_gem_bo_shm { > > + struct page **pages; > > + dma_addr_t *dma_addrs; > > +}; > > + > > +struct pl111_gem_bo { > > + struct drm_gem_object gem_object; > > + enum pl111_bo_type type; > > + union { > > + struct pl111_gem_bo_dma dma; > > + struct pl111_gem_bo_shm shm; > > + } backing_data; > > + struct drm_framebuffer *fb; > > this is at least a bit odd.. normally the fb has ref to the bo(s) and > not the other way around. And the same bo could be referenced by > multiple fb's which would kinda fall down with this approach. I'd say that's just backwards, framebuffers are created from backing storage objects (which for a gem based driver is a gem object), not the other way round. What's this exactly used for? [snip] > > + > > + /* > > + * Used to prevent race between pl111_dma_buf_release and > > + * drm_gem_prime_handle_to_fd > > + */ > > + struct mutex export_dma_buf_lock; > > hmm, seems a bit suspicious.. the handle reference should keep the > object live. Ie. either drm_gem_object_lookup() will fail because the > object is gone (userspace has closed it's handle ref and > dmabuf->release() already dropped it's ref) or it will succeed and > you'll have a reference to the bo keeping it from going away if the > release() comes after. The race is real, I have an evil testcase here which Oopses my kernel. I'm working on a fix (v1 of my patches is submitted a few weeks back, awaiting review), but I need to rework a few things since now I've also spotted a leak or two ;-) [snip] > > +static void vsync_worker(struct work_struct *work) > > +{ > > + struct pl111_drm_flip_resource *flip_res; > > + struct pl111_gem_bo *bo; > > + struct pl111_drm_crtc *pl111_crtc; > > + struct drm_device *dev; > > + int flips_in_flight; > > + flip_res = > > + container_of(work, struct pl111_drm_flip_resource, vsync_work); > > + > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); > > + dev = pl111_crtc->crtc.dev; > > + > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); > > + > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > > + if (flip_res->worker_release_kds == true) { > > + spin_lock(&pl111_crtc->current_displaying_lock); > > + release_kds_resource_and_display(flip_res); > > + spin_unlock(&pl111_crtc->current_displaying_lock); > > + } > > +#endif > > + /* Release DMA buffer on this flip */ > > + if (bo->gem_object.export_dma_buf != NULL) > > + dma_buf_put(bo->gem_object.export_dma_buf); > > I think you just want to unref the outgoing bo, and let it drop the > dmabuf ref when the file ref of the imported bo goes. Or actually, it > would be better to hold/drop ref's to the fb, rather than the bo. At > least this will make things simpler if you ever have multi-planar > support. Drivers have no business frobbing around the dma-buf refcount of imported objects imo, at least if they use all the standard drm prime infrastructure. And if they're bugs they need to be fixed there, not in drivers. [snip] > > +struct drm_framebuffer *pl111_fb_create(struct drm_device *dev, > > + struct drm_file *file_priv, > > + struct drm_mode_fb_cmd2 *mode_cmd) > > +{ > > + struct pl111_drm_framebuffer *pl111_fb = NULL; > > + struct drm_framebuffer *fb = NULL; > > + struct drm_gem_object *gem_obj; > > + struct pl111_gem_bo *bo; > > + > > + pr_info("DRM %s\n", __func__); > > + gem_obj = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]); > > + if (gem_obj == NULL) { > > + DRM_ERROR("Could not get gem obj from handle to create fb\n"); > > + goto out; > > + } > > + > > + bo = PL111_BO_FROM_GEM(gem_obj); > > + /* Don't even attempt PL111_BOT_SHM, it's not contiguous */ > > + BUG_ON(bo->type != PL111_BOT_DMA); > > umm, no BUG_ON() is not really a good way to validate userspace input.. > > if (bo->type != ...) > return ERR_PTR(-EINVAL); Yep. > > + > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { > > + case 'Y': > > + case 'U': > > + case 'V': > > + case 'N': > > + case 'T': > > perhaps we should instead add a drm_format_is_yuv().. or you could > (ab)use drm_fb_get_bpp_depth().. Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger question is why you need this, since the drm core should filter out formats not in your list of supported ones. Or at least it should ... Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 16:46 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-07 16:46 UTC (permalink / raw) To: linux-arm-kernel Just comment a bit on Rob's review with my own opinion. On Wed, Aug 07, 2013 at 12:17:21PM -0400, Rob Clark wrote: > On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > > From: Tom Cooksey <tom.cooksey@arm.com> > > > > This is a mode-setting driver for the pl111 CLCD display controller > > found on various ARM reference platforms such as the Versatile > > Express. The driver supports setting of a single mode (640x480) and > > has only been tested on Versatile Express with a Cortex-A9 core tile. > > > > Known issues: > > * It still includes code to use KDS, which is not going upstream. > > review's on http://lists.freedesktop.org/archives/dri-devel/2013-July/042462.html > can't hurt > > although you might consider submitting a reduced functionality driver > w/ KDS bits removed in the mean time.. then when the fence stuff is > merged it is just an incremental patch rather than a whole driver ;-) Yeah, I think the KDS bits and comments need to go first before merginge. > > +/* > > + * Number of flips allowed in flight at any one time. Any more flips requested > > + * beyond this value will cause the caller to block until earlier flips have > > + * completed. > > + * > > + * For performance reasons, this must be greater than the number of buffers > > + * used in the rendering pipeline. Note that the rendering pipeline can contain > > + * different types of buffer, e.g.: > > + * - 2 final framebuffers > > + * - >2 geometry buffers for GPU use-cases > > + * - >2 vertex buffers for GPU use-cases > > + * > > + * For example, a system using 5 geometry buffers could have 5 flips in flight, > > + * and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be 5 or greater. > > + * > > + * Whilst there may be more intermediate buffers (such as vertex/geometry) than > > + * final framebuffers, KDS is used to ensure that GPU rendering waits for the > > + * next off-screen buffer, so it doesn't overwrite an on-screen buffer and > > + * produce tearing. > > + */ > > + > > fwiw, this is at least different from how other drivers do triple (or > > double) buffering. In other drivers (intel, omap, and > msm/freedreno, that I know of, maybe others too) the xorg driver dri2 > bits implement the double buffering (ie. send flip event back to > client immediately and queue up the flip and call page-flip after the > pageflip event back from kernel. > > I'm not saying not to do it this way, I guess I'd like to hear what > other folks think. I kinda prefer doing this in userspace as it keeps > the kernel bits simpler (plus it would then work properly on exynosdrm > or other kms drivers). Yeah, if this is just a sw queue then I don't think it makes sense to have it in the kernel. Afaik the current pageflip interface drm exposes allows one oustanding flip only, and you _must_ wait for the flip complete event before you can submit the second one. Ofc if your hardware as a hw-based flip queue (maybe even with frame targets) that's a different matter, but currently we don't have a drm interface to expose this. I'd say for merging the basic driver first we should go with the existing simple pageflip semantics. And tbh I don't understand why the amount of buffers you keep in the render pipeline side of things matters here at all. But I also haven't read the details of your driver code. > > > +/* > > + * Here, we choose a conservative value. A lower value is most likely > > + * suitable for GPU use-cases. > > + */ > > +#define NR_FLIPS_IN_FLIGHT_THRESHOLD 16 > > + > > +#define CLCD_IRQ_NEXTBASE_UPDATE (1u<<2) > > + > > +struct pl111_drm_flip_resource; > > +struct pl111_drm_cursor_plane; > > + > > +enum pl111_bo_type { > > + PL111_BOT_DMA, > > + PL111_BOT_SHM > > +}; > > + > > +struct pl111_gem_bo_dma { > > + dma_addr_t fb_dev_addr; > > + void *fb_cpu_addr; > > +}; > > + > > +struct pl111_gem_bo_shm { > > + struct page **pages; > > + dma_addr_t *dma_addrs; > > +}; > > + > > +struct pl111_gem_bo { > > + struct drm_gem_object gem_object; > > + enum pl111_bo_type type; > > + union { > > + struct pl111_gem_bo_dma dma; > > + struct pl111_gem_bo_shm shm; > > + } backing_data; > > + struct drm_framebuffer *fb; > > this is at least a bit odd.. normally the fb has ref to the bo(s) and > not the other way around. And the same bo could be referenced by > multiple fb's which would kinda fall down with this approach. I'd say that's just backwards, framebuffers are created from backing storage objects (which for a gem based driver is a gem object), not the other way round. What's this exactly used for? [snip] > > + > > + /* > > + * Used to prevent race between pl111_dma_buf_release and > > + * drm_gem_prime_handle_to_fd > > + */ > > + struct mutex export_dma_buf_lock; > > hmm, seems a bit suspicious.. the handle reference should keep the > object live. Ie. either drm_gem_object_lookup() will fail because the > object is gone (userspace has closed it's handle ref and > dmabuf->release() already dropped it's ref) or it will succeed and > you'll have a reference to the bo keeping it from going away if the > release() comes after. The race is real, I have an evil testcase here which Oopses my kernel. I'm working on a fix (v1 of my patches is submitted a few weeks back, awaiting review), but I need to rework a few things since now I've also spotted a leak or two ;-) [snip] > > +static void vsync_worker(struct work_struct *work) > > +{ > > + struct pl111_drm_flip_resource *flip_res; > > + struct pl111_gem_bo *bo; > > + struct pl111_drm_crtc *pl111_crtc; > > + struct drm_device *dev; > > + int flips_in_flight; > > + flip_res = > > + container_of(work, struct pl111_drm_flip_resource, vsync_work); > > + > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); > > + dev = pl111_crtc->crtc.dev; > > + > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); > > + > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > > + if (flip_res->worker_release_kds == true) { > > + spin_lock(&pl111_crtc->current_displaying_lock); > > + release_kds_resource_and_display(flip_res); > > + spin_unlock(&pl111_crtc->current_displaying_lock); > > + } > > +#endif > > + /* Release DMA buffer on this flip */ > > + if (bo->gem_object.export_dma_buf != NULL) > > + dma_buf_put(bo->gem_object.export_dma_buf); > > I think you just want to unref the outgoing bo, and let it drop the > dmabuf ref when the file ref of the imported bo goes. Or actually, it > would be better to hold/drop ref's to the fb, rather than the bo. At > least this will make things simpler if you ever have multi-planar > support. Drivers have no business frobbing around the dma-buf refcount of imported objects imo, at least if they use all the standard drm prime infrastructure. And if they're bugs they need to be fixed there, not in drivers. [snip] > > +struct drm_framebuffer *pl111_fb_create(struct drm_device *dev, > > + struct drm_file *file_priv, > > + struct drm_mode_fb_cmd2 *mode_cmd) > > +{ > > + struct pl111_drm_framebuffer *pl111_fb = NULL; > > + struct drm_framebuffer *fb = NULL; > > + struct drm_gem_object *gem_obj; > > + struct pl111_gem_bo *bo; > > + > > + pr_info("DRM %s\n", __func__); > > + gem_obj = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]); > > + if (gem_obj == NULL) { > > + DRM_ERROR("Could not get gem obj from handle to create fb\n"); > > + goto out; > > + } > > + > > + bo = PL111_BO_FROM_GEM(gem_obj); > > + /* Don't even attempt PL111_BOT_SHM, it's not contiguous */ > > + BUG_ON(bo->type != PL111_BOT_DMA); > > umm, no BUG_ON() is not really a good way to validate userspace input.. > > if (bo->type != ...) > return ERR_PTR(-EINVAL); Yep. > > + > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { > > + case 'Y': > > + case 'U': > > + case 'V': > > + case 'N': > > + case 'T': > > perhaps we should instead add a drm_format_is_yuv().. or you could > (ab)use drm_fb_get_bpp_depth().. Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger question is why you need this, since the drm core should filter out formats not in your list of supported ones. Or at least it should ... Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-07 16:46 ` Daniel Vetter (?) (?) @ 2013-08-09 16:15 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-09 16:15 UTC (permalink / raw) To: 'Daniel Vetter', Rob Clark Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel Hi Daniel, Rob. Thank you both for your reviews - greatly appreciated! > > > Known issues: > > > * It still includes code to use KDS, which is not going upstream. > > > > review's on <http://lists.freedesktop.org/archives/dri-devel/2013- > > July/042462.html> can't hurt > > > > although you might consider submitting a reduced functionality driver > > w/ KDS bits removed in the mean time.. then when the fence stuff is > > merged it is just an incremental patch rather than a whole driver ;-) > > Yeah, I think the KDS bits and comments need to go first before > merginge. Right, as I expected really. Though as I said we'll probably wait for fences to land and switch over to that before asking for it to be merged. A pl111 KMS driver with neither KDS nor implicit fences is useless to us. Having said that, if someone else would have a use for a fence/KDS-less pl111 KMS driver, please let me know! > > > +/* > > > + * Number of flips allowed in flight at any one time. Any more > > > + * flips requested beyond this value will cause the caller to > > > + * block until earlier flips have completed. > > > + * > > > + * For performance reasons, this must be greater than the number > > > + * of buffers used in the rendering pipeline. Note that the > > > + * rendering pipeline can contain different types of buffer, e.g.: > > > + * - 2 final framebuffers > > > + * - >2 geometry buffers for GPU use-cases > > > + * - >2 vertex buffers for GPU use-cases > > > + * > > > + * For example, a system using 5 geometry buffers could have 5 > > > + * flips in flight, and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be > > > + * 5 or greater. > > > + * > > > + * Whilst there may be more intermediate buffers (such as > > > + * vertex/geometry) than final framebuffers, KDS is used to > > > + * ensure that GPU rendering waits for the next off-screen > > > + * buffer, so it doesn't overwrite an on-screen buffer and > > > + * produce tearing. > > > + */ > > > + > > > > fwiw, this is at least different from how other drivers do triple > > (or > double) buffering. In other drivers (intel, omap, and > > msm/freedreno, that I know of, maybe others too) the xorg driver > > dri2 bits implement the double buffering (ie. send flip event back > > to client immediately and queue up the flip and call page-flip > > after the pageflip event back from kernel. > > > > I'm not saying not to do it this way, I guess I'd like to hear > > what other folks think. I kinda prefer doing this in userspace > > as it keeps the kernel bits simpler (plus it would then work > > properly on exynosdrm or other kms drivers). > > Yeah, if this is just a sw queue then I don't think it makes sense > to have it in the kernel. Afaik the current pageflip interface drm > exposes allows one oustanding flip only, and you _must_ wait for > the flip complete event before you can submit the second one. Right, I'll have a think about this. I think our idea was to issue enough page-flips into the kernel to make sure that any process scheduling latencies on a heavily loaded system don't cause us to miss a v_sync deadline. At the moment we issue the page flip from DRI2 schedule_swap. If we were to move that to the page flip event handler of the previous page-flip, we're potentially adding in extra latency. I.e. Currently we have: DRI2SwapBuffers - drm_mode_page_flip to buffer B DRI2SwapBuffers - drm_mode_page_flip to buffer A (gets queued in kernel) ... v_sync! (at this point buffer B is scanned out) - release buffer A's KDS resource/signal buffer A's fence - queued GPU job to render next frame to buffer A scheduled on HW ... GPU interrupt! (at this point buffer A is ready to be scanned out) - release buffer A's KDS resource/signal buffer A's fence - second page flip executed, buffer A's address written to scanout register, takes effect on next v_sync. So in the above, after X receives the second DRI2SwapBuffers, it doesn't need to get scheduled again for the next frame to be both rendered by the GPU and issued to the display for scanout. If we were to move to a user-space queue, I think we have something like this: DRI2SwapBuffers - drm_mode_page_flip to buffer B DRI2SwapBuffers - queue page flip to buffer A in DDX ... v_sync! (at this point buffer B is scanned out) - release buffer A's KDS resource/signal buffer A's fence - queued GPU job to render next frame to buffer A scheduled on HW - Send page flip event to X ... GPU interrupt! (at this point buffer A is ready to be scanned out) - Release buffer A's KDS resource/signal buffer A's fence - but nothing is waiting on it.... ... X gets scheduled, runs page flip handler - drm_mode_page_flip to buffer A - buffer A's address written to scanout register, takes effect on next v_sync. So here, X must get scheduled again after processing the second DRI2SwapBuffers in order to have the next frame displayed. This increases the likely-hood that we're not able to write the address of buffer A to the display HW's scan-out buffer in time to catch the next v_sync, especially on a loaded system. Anyway, I think that's our rational for keeping the queue in kernel space, but I don't see there's much value in queuing more than 2 page flips in kernel space. > Ofc if your hardware as a hw-based flip queue (maybe even with frame > targets) that's a different matter, but currently we don't have a drm > interface to expose this. I'd say for merging the basic driver first we > should go with the existing simple pageflip semantics. Sure - I think it would mean slightly increased jank, but probably something we can address later. > > > +enum pl111_bo_type { > > > + PL111_BOT_DMA, > > > + PL111_BOT_SHM > > > +}; > > > + > > > +struct pl111_gem_bo_dma { > > > + dma_addr_t fb_dev_addr; > > > + void *fb_cpu_addr; > > > +}; > > > + > > > +struct pl111_gem_bo_shm { > > > + struct page **pages; > > > + dma_addr_t *dma_addrs; > > > +}; > > > + > > > +struct pl111_gem_bo { > > > + struct drm_gem_object gem_object; > > > + enum pl111_bo_type type; > > > + union { > > > + struct pl111_gem_bo_dma dma; > > > + struct pl111_gem_bo_shm shm; > > > + } backing_data; > > > + struct drm_framebuffer *fb; > > > > this is at least a bit odd.. normally the fb has ref to the bo(s) and > > not the other way around. And the same bo could be referenced by > > multiple fb's which would kinda fall down with this approach. > > I'd say that's just backwards, framebuffers are created from backing > storage objects (which for a gem based driver is a gem object), not the > other way round. What's this exactly used for? Yup. > > > +static void vsync_worker(struct work_struct *work) > > > +{ > > > + struct pl111_drm_flip_resource *flip_res; > > > + struct pl111_gem_bo *bo; > > > + struct pl111_drm_crtc *pl111_crtc; > > > + struct drm_device *dev; > > > + int flips_in_flight; > > > + flip_res = > > > + container_of(work, struct pl111_drm_flip_resource, > > > + vsync_work); > > > + > > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); > > > + dev = pl111_crtc->crtc.dev; > > > + > > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); > > > + > > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); > > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > > > + if (flip_res->worker_release_kds == true) { > > > + spin_lock(&pl111_crtc->current_displaying_lock); > > > + release_kds_resource_and_display(flip_res); > > > + spin_unlock(&pl111_crtc->current_displaying_lock); > > > + } > > > +#endif > > > + /* Release DMA buffer on this flip */ > > > + if (bo->gem_object.export_dma_buf != NULL) > > > + dma_buf_put(bo->gem_object.export_dma_buf); > > > > I think you just want to unref the outgoing bo, and let it drop the > > dmabuf ref when the file ref of the imported bo goes. Or actually, > > it would be better to hold/drop ref's to the fb, rather than the bo. > > At least this will make things simpler if you ever have multi-planar > > support. > > Drivers have no business frobbing around the dma-buf refcount of > imported objects imo, at least if they use all the standard drm > prime infrastructure. And if they're bugs they need to be fixed > there, not in drivers. Good point. I guess the fb holds a ref on the bo and the bo holds a ref on the imported dma_buf. Don't know what this was for... > > > + BUG_ON(bo->type != PL111_BOT_DMA); > > > > umm, no BUG_ON() is not really a good way to validate userspace > > input.. > > Yep. :-D > > > + > > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { > > > + case 'Y': > > > + case 'U': > > > + case 'V': > > > + case 'N': > > > + case 'T': > > > > perhaps we should instead add a drm_format_is_yuv().. or you could > > (ab)use drm_fb_get_bpp_depth().. > > Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger > question is why you need this, since the drm core should filter out > formats not in your list of supported ones. Or at least it should ... Probably unnecessary belts & braces. I'll see if I can find some DRM test which tries to create an fb using a yuv format and see where, if anywhere, it gets rejected. Thanks again!! Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-07 16:46 ` Daniel Vetter ` (2 preceding siblings ...) (?) @ 2013-08-09 16:15 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-09 16:15 UTC (permalink / raw) To: linux-arm-kernel Hi Daniel, Rob. Thank you both for your reviews - greatly appreciated! > > > Known issues: > > > * It still includes code to use KDS, which is not going upstream. > > > > review's on <http://lists.freedesktop.org/archives/dri-devel/2013- > > July/042462.html> can't hurt > > > > although you might consider submitting a reduced functionality driver > > w/ KDS bits removed in the mean time.. then when the fence stuff is > > merged it is just an incremental patch rather than a whole driver ;-) > > Yeah, I think the KDS bits and comments need to go first before > merginge. Right, as I expected really. Though as I said we'll probably wait for fences to land and switch over to that before asking for it to be merged. A pl111 KMS driver with neither KDS nor implicit fences is useless to us. Having said that, if someone else would have a use for a fence/KDS-less pl111 KMS driver, please let me know! > > > +/* > > > + * Number of flips allowed in flight at any one time. Any more > > > + * flips requested beyond this value will cause the caller to > > > + * block until earlier flips have completed. > > > + * > > > + * For performance reasons, this must be greater than the number > > > + * of buffers used in the rendering pipeline. Note that the > > > + * rendering pipeline can contain different types of buffer, e.g.: > > > + * - 2 final framebuffers > > > + * - >2 geometry buffers for GPU use-cases > > > + * - >2 vertex buffers for GPU use-cases > > > + * > > > + * For example, a system using 5 geometry buffers could have 5 > > > + * flips in flight, and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be > > > + * 5 or greater. > > > + * > > > + * Whilst there may be more intermediate buffers (such as > > > + * vertex/geometry) than final framebuffers, KDS is used to > > > + * ensure that GPU rendering waits for the next off-screen > > > + * buffer, so it doesn't overwrite an on-screen buffer and > > > + * produce tearing. > > > + */ > > > + > > > > fwiw, this is at least different from how other drivers do triple > > (or > double) buffering. In other drivers (intel, omap, and > > msm/freedreno, that I know of, maybe others too) the xorg driver > > dri2 bits implement the double buffering (ie. send flip event back > > to client immediately and queue up the flip and call page-flip > > after the pageflip event back from kernel. > > > > I'm not saying not to do it this way, I guess I'd like to hear > > what other folks think. I kinda prefer doing this in userspace > > as it keeps the kernel bits simpler (plus it would then work > > properly on exynosdrm or other kms drivers). > > Yeah, if this is just a sw queue then I don't think it makes sense > to have it in the kernel. Afaik the current pageflip interface drm > exposes allows one oustanding flip only, and you _must_ wait for > the flip complete event before you can submit the second one. Right, I'll have a think about this. I think our idea was to issue enough page-flips into the kernel to make sure that any process scheduling latencies on a heavily loaded system don't cause us to miss a v_sync deadline. At the moment we issue the page flip from DRI2 schedule_swap. If we were to move that to the page flip event handler of the previous page-flip, we're potentially adding in extra latency. I.e. Currently we have: DRI2SwapBuffers - drm_mode_page_flip to buffer B DRI2SwapBuffers - drm_mode_page_flip to buffer A (gets queued in kernel) ... v_sync! (at this point buffer B is scanned out) - release buffer A's KDS resource/signal buffer A's fence - queued GPU job to render next frame to buffer A scheduled on HW ... GPU interrupt! (at this point buffer A is ready to be scanned out) - release buffer A's KDS resource/signal buffer A's fence - second page flip executed, buffer A's address written to scanout register, takes effect on next v_sync. So in the above, after X receives the second DRI2SwapBuffers, it doesn't need to get scheduled again for the next frame to be both rendered by the GPU and issued to the display for scanout. If we were to move to a user-space queue, I think we have something like this: DRI2SwapBuffers - drm_mode_page_flip to buffer B DRI2SwapBuffers - queue page flip to buffer A in DDX ... v_sync! (at this point buffer B is scanned out) - release buffer A's KDS resource/signal buffer A's fence - queued GPU job to render next frame to buffer A scheduled on HW - Send page flip event to X ... GPU interrupt! (at this point buffer A is ready to be scanned out) - Release buffer A's KDS resource/signal buffer A's fence - but nothing is waiting on it.... ... X gets scheduled, runs page flip handler - drm_mode_page_flip to buffer A - buffer A's address written to scanout register, takes effect on next v_sync. So here, X must get scheduled again after processing the second DRI2SwapBuffers in order to have the next frame displayed. This increases the likely-hood that we're not able to write the address of buffer A to the display HW's scan-out buffer in time to catch the next v_sync, especially on a loaded system. Anyway, I think that's our rational for keeping the queue in kernel space, but I don't see there's much value in queuing more than 2 page flips in kernel space. > Ofc if your hardware as a hw-based flip queue (maybe even with frame > targets) that's a different matter, but currently we don't have a drm > interface to expose this. I'd say for merging the basic driver first we > should go with the existing simple pageflip semantics. Sure - I think it would mean slightly increased jank, but probably something we can address later. > > > +enum pl111_bo_type { > > > + PL111_BOT_DMA, > > > + PL111_BOT_SHM > > > +}; > > > + > > > +struct pl111_gem_bo_dma { > > > + dma_addr_t fb_dev_addr; > > > + void *fb_cpu_addr; > > > +}; > > > + > > > +struct pl111_gem_bo_shm { > > > + struct page **pages; > > > + dma_addr_t *dma_addrs; > > > +}; > > > + > > > +struct pl111_gem_bo { > > > + struct drm_gem_object gem_object; > > > + enum pl111_bo_type type; > > > + union { > > > + struct pl111_gem_bo_dma dma; > > > + struct pl111_gem_bo_shm shm; > > > + } backing_data; > > > + struct drm_framebuffer *fb; > > > > this is at least a bit odd.. normally the fb has ref to the bo(s) and > > not the other way around. And the same bo could be referenced by > > multiple fb's which would kinda fall down with this approach. > > I'd say that's just backwards, framebuffers are created from backing > storage objects (which for a gem based driver is a gem object), not the > other way round. What's this exactly used for? Yup. > > > +static void vsync_worker(struct work_struct *work) > > > +{ > > > + struct pl111_drm_flip_resource *flip_res; > > > + struct pl111_gem_bo *bo; > > > + struct pl111_drm_crtc *pl111_crtc; > > > + struct drm_device *dev; > > > + int flips_in_flight; > > > + flip_res = > > > + container_of(work, struct pl111_drm_flip_resource, > > > + vsync_work); > > > + > > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); > > > + dev = pl111_crtc->crtc.dev; > > > + > > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); > > > + > > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); > > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > > > + if (flip_res->worker_release_kds == true) { > > > + spin_lock(&pl111_crtc->current_displaying_lock); > > > + release_kds_resource_and_display(flip_res); > > > + spin_unlock(&pl111_crtc->current_displaying_lock); > > > + } > > > +#endif > > > + /* Release DMA buffer on this flip */ > > > + if (bo->gem_object.export_dma_buf != NULL) > > > + dma_buf_put(bo->gem_object.export_dma_buf); > > > > I think you just want to unref the outgoing bo, and let it drop the > > dmabuf ref when the file ref of the imported bo goes. Or actually, > > it would be better to hold/drop ref's to the fb, rather than the bo. > > At least this will make things simpler if you ever have multi-planar > > support. > > Drivers have no business frobbing around the dma-buf refcount of > imported objects imo, at least if they use all the standard drm > prime infrastructure. And if they're bugs they need to be fixed > there, not in drivers. Good point. I guess the fb holds a ref on the bo and the bo holds a ref on the imported dma_buf. Don't know what this was for... > > > + BUG_ON(bo->type != PL111_BOT_DMA); > > > > umm, no BUG_ON() is not really a good way to validate userspace > > input.. > > Yep. :-D > > > + > > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { > > > + case 'Y': > > > + case 'U': > > > + case 'V': > > > + case 'N': > > > + case 'T': > > > > perhaps we should instead add a drm_format_is_yuv().. or you could > > (ab)use drm_fb_get_bpp_depth().. > > Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger > question is why you need this, since the drm core should filter out > formats not in your list of supported ones. Or at least it should ... Probably unnecessary belts & braces. I'll see if I can find some DRM test which tries to create an fb using a yuv format and see where, if anywhere, it gets rejected. Thanks again!! Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-07 16:46 ` Daniel Vetter ` (3 preceding siblings ...) (?) @ 2013-08-09 16:15 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-09 16:15 UTC (permalink / raw) To: 'Daniel Vetter', Rob Clark Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel Hi Daniel, Rob. Thank you both for your reviews - greatly appreciated! > > > Known issues: > > > * It still includes code to use KDS, which is not going upstream. > > > > review's on <http://lists.freedesktop.org/archives/dri-devel/2013- > > July/042462.html> can't hurt > > > > although you might consider submitting a reduced functionality driver > > w/ KDS bits removed in the mean time.. then when the fence stuff is > > merged it is just an incremental patch rather than a whole driver ;-) > > Yeah, I think the KDS bits and comments need to go first before > merginge. Right, as I expected really. Though as I said we'll probably wait for fences to land and switch over to that before asking for it to be merged. A pl111 KMS driver with neither KDS nor implicit fences is useless to us. Having said that, if someone else would have a use for a fence/KDS-less pl111 KMS driver, please let me know! > > > +/* > > > + * Number of flips allowed in flight at any one time. Any more > > > + * flips requested beyond this value will cause the caller to > > > + * block until earlier flips have completed. > > > + * > > > + * For performance reasons, this must be greater than the number > > > + * of buffers used in the rendering pipeline. Note that the > > > + * rendering pipeline can contain different types of buffer, e.g.: > > > + * - 2 final framebuffers > > > + * - >2 geometry buffers for GPU use-cases > > > + * - >2 vertex buffers for GPU use-cases > > > + * > > > + * For example, a system using 5 geometry buffers could have 5 > > > + * flips in flight, and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be > > > + * 5 or greater. > > > + * > > > + * Whilst there may be more intermediate buffers (such as > > > + * vertex/geometry) than final framebuffers, KDS is used to > > > + * ensure that GPU rendering waits for the next off-screen > > > + * buffer, so it doesn't overwrite an on-screen buffer and > > > + * produce tearing. > > > + */ > > > + > > > > fwiw, this is at least different from how other drivers do triple > > (or > double) buffering. In other drivers (intel, omap, and > > msm/freedreno, that I know of, maybe others too) the xorg driver > > dri2 bits implement the double buffering (ie. send flip event back > > to client immediately and queue up the flip and call page-flip > > after the pageflip event back from kernel. > > > > I'm not saying not to do it this way, I guess I'd like to hear > > what other folks think. I kinda prefer doing this in userspace > > as it keeps the kernel bits simpler (plus it would then work > > properly on exynosdrm or other kms drivers). > > Yeah, if this is just a sw queue then I don't think it makes sense > to have it in the kernel. Afaik the current pageflip interface drm > exposes allows one oustanding flip only, and you _must_ wait for > the flip complete event before you can submit the second one. Right, I'll have a think about this. I think our idea was to issue enough page-flips into the kernel to make sure that any process scheduling latencies on a heavily loaded system don't cause us to miss a v_sync deadline. At the moment we issue the page flip from DRI2 schedule_swap. If we were to move that to the page flip event handler of the previous page-flip, we're potentially adding in extra latency. I.e. Currently we have: DRI2SwapBuffers - drm_mode_page_flip to buffer B DRI2SwapBuffers - drm_mode_page_flip to buffer A (gets queued in kernel) ... v_sync! (at this point buffer B is scanned out) - release buffer A's KDS resource/signal buffer A's fence - queued GPU job to render next frame to buffer A scheduled on HW ... GPU interrupt! (at this point buffer A is ready to be scanned out) - release buffer A's KDS resource/signal buffer A's fence - second page flip executed, buffer A's address written to scanout register, takes effect on next v_sync. So in the above, after X receives the second DRI2SwapBuffers, it doesn't need to get scheduled again for the next frame to be both rendered by the GPU and issued to the display for scanout. If we were to move to a user-space queue, I think we have something like this: DRI2SwapBuffers - drm_mode_page_flip to buffer B DRI2SwapBuffers - queue page flip to buffer A in DDX ... v_sync! (at this point buffer B is scanned out) - release buffer A's KDS resource/signal buffer A's fence - queued GPU job to render next frame to buffer A scheduled on HW - Send page flip event to X ... GPU interrupt! (at this point buffer A is ready to be scanned out) - Release buffer A's KDS resource/signal buffer A's fence - but nothing is waiting on it.... ... X gets scheduled, runs page flip handler - drm_mode_page_flip to buffer A - buffer A's address written to scanout register, takes effect on next v_sync. So here, X must get scheduled again after processing the second DRI2SwapBuffers in order to have the next frame displayed. This increases the likely-hood that we're not able to write the address of buffer A to the display HW's scan-out buffer in time to catch the next v_sync, especially on a loaded system. Anyway, I think that's our rational for keeping the queue in kernel space, but I don't see there's much value in queuing more than 2 page flips in kernel space. > Ofc if your hardware as a hw-based flip queue (maybe even with frame > targets) that's a different matter, but currently we don't have a drm > interface to expose this. I'd say for merging the basic driver first we > should go with the existing simple pageflip semantics. Sure - I think it would mean slightly increased jank, but probably something we can address later. > > > +enum pl111_bo_type { > > > + PL111_BOT_DMA, > > > + PL111_BOT_SHM > > > +}; > > > + > > > +struct pl111_gem_bo_dma { > > > + dma_addr_t fb_dev_addr; > > > + void *fb_cpu_addr; > > > +}; > > > + > > > +struct pl111_gem_bo_shm { > > > + struct page **pages; > > > + dma_addr_t *dma_addrs; > > > +}; > > > + > > > +struct pl111_gem_bo { > > > + struct drm_gem_object gem_object; > > > + enum pl111_bo_type type; > > > + union { > > > + struct pl111_gem_bo_dma dma; > > > + struct pl111_gem_bo_shm shm; > > > + } backing_data; > > > + struct drm_framebuffer *fb; > > > > this is at least a bit odd.. normally the fb has ref to the bo(s) and > > not the other way around. And the same bo could be referenced by > > multiple fb's which would kinda fall down with this approach. > > I'd say that's just backwards, framebuffers are created from backing > storage objects (which for a gem based driver is a gem object), not the > other way round. What's this exactly used for? Yup. > > > +static void vsync_worker(struct work_struct *work) > > > +{ > > > + struct pl111_drm_flip_resource *flip_res; > > > + struct pl111_gem_bo *bo; > > > + struct pl111_drm_crtc *pl111_crtc; > > > + struct drm_device *dev; > > > + int flips_in_flight; > > > + flip_res = > > > + container_of(work, struct pl111_drm_flip_resource, > > > + vsync_work); > > > + > > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); > > > + dev = pl111_crtc->crtc.dev; > > > + > > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); > > > + > > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); > > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > > > + if (flip_res->worker_release_kds == true) { > > > + spin_lock(&pl111_crtc->current_displaying_lock); > > > + release_kds_resource_and_display(flip_res); > > > + spin_unlock(&pl111_crtc->current_displaying_lock); > > > + } > > > +#endif > > > + /* Release DMA buffer on this flip */ > > > + if (bo->gem_object.export_dma_buf != NULL) > > > + dma_buf_put(bo->gem_object.export_dma_buf); > > > > I think you just want to unref the outgoing bo, and let it drop the > > dmabuf ref when the file ref of the imported bo goes. Or actually, > > it would be better to hold/drop ref's to the fb, rather than the bo. > > At least this will make things simpler if you ever have multi-planar > > support. > > Drivers have no business frobbing around the dma-buf refcount of > imported objects imo, at least if they use all the standard drm > prime infrastructure. And if they're bugs they need to be fixed > there, not in drivers. Good point. I guess the fb holds a ref on the bo and the bo holds a ref on the imported dma_buf. Don't know what this was for... > > > + BUG_ON(bo->type != PL111_BOT_DMA); > > > > umm, no BUG_ON() is not really a good way to validate userspace > > input.. > > Yep. :-D > > > + > > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { > > > + case 'Y': > > > + case 'U': > > > + case 'V': > > > + case 'N': > > > + case 'T': > > > > perhaps we should instead add a drm_format_is_yuv().. or you could > > (ab)use drm_fb_get_bpp_depth().. > > Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger > question is why you need this, since the drm core should filter out > formats not in your list of supported ones. Or at least it should ... Probably unnecessary belts & braces. I'll see if I can find some DRM test which tries to create an fb using a yuv format and see where, if anywhere, it gets rejected. Thanks again!! Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <520515b0.88b70e0a.3ecd.1004SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <520515b0.88b70e0a.3ecd.1004SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-09 16:34 ` Rob Clark @ 2013-08-09 16:34 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-09 16:34 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, Daniel Vetter On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Daniel, Rob. > > Thank you both for your reviews - greatly appreciated! > >> > > Known issues: >> > > * It still includes code to use KDS, which is not going upstream. >> > >> > review's on <http://lists.freedesktop.org/archives/dri-devel/2013- >> > July/042462.html> can't hurt >> > >> > although you might consider submitting a reduced functionality driver >> > w/ KDS bits removed in the mean time.. then when the fence stuff is >> > merged it is just an incremental patch rather than a whole driver ;-) >> >> Yeah, I think the KDS bits and comments need to go first before >> merginge. > > Right, as I expected really. Though as I said we'll probably wait for > fences to land and switch over to that before asking for it to be > merged. A pl111 KMS driver with neither KDS nor implicit fences is > useless to us. Having said that, if someone else would have a use for > a fence/KDS-less pl111 KMS driver, please let me know! > well, it would make it easier to review the patches adding fence support if it was on top of basic KMS support. So there still is some benefit to a fence-less pl111, even if it is just for purposes of git history and patch review ;-) > >> > > +/* >> > > + * Number of flips allowed in flight at any one time. Any more >> > > + * flips requested beyond this value will cause the caller to >> > > + * block until earlier flips have completed. >> > > + * >> > > + * For performance reasons, this must be greater than the number >> > > + * of buffers used in the rendering pipeline. Note that the >> > > + * rendering pipeline can contain different types of buffer, e.g.: >> > > + * - 2 final framebuffers >> > > + * - >2 geometry buffers for GPU use-cases >> > > + * - >2 vertex buffers for GPU use-cases >> > > + * >> > > + * For example, a system using 5 geometry buffers could have 5 >> > > + * flips in flight, and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be >> > > + * 5 or greater. >> > > + * >> > > + * Whilst there may be more intermediate buffers (such as >> > > + * vertex/geometry) than final framebuffers, KDS is used to >> > > + * ensure that GPU rendering waits for the next off-screen >> > > + * buffer, so it doesn't overwrite an on-screen buffer and >> > > + * produce tearing. >> > > + */ >> > > + >> > >> > fwiw, this is at least different from how other drivers do triple >> > (or > double) buffering. In other drivers (intel, omap, and >> > msm/freedreno, that I know of, maybe others too) the xorg driver >> > dri2 bits implement the double buffering (ie. send flip event back >> > to client immediately and queue up the flip and call page-flip >> > after the pageflip event back from kernel. >> > >> > I'm not saying not to do it this way, I guess I'd like to hear >> > what other folks think. I kinda prefer doing this in userspace >> > as it keeps the kernel bits simpler (plus it would then work >> > properly on exynosdrm or other kms drivers). >> >> Yeah, if this is just a sw queue then I don't think it makes sense >> to have it in the kernel. Afaik the current pageflip interface drm >> exposes allows one oustanding flip only, and you _must_ wait for >> the flip complete event before you can submit the second one. > > Right, I'll have a think about this. I think our idea was to issue > enough page-flips into the kernel to make sure that any process > scheduling latencies on a heavily loaded system don't cause us to > miss a v_sync deadline. At the moment we issue the page flip from DRI2 > schedule_swap. If we were to move that to the page flip event handler > of the previous page-flip, we're potentially adding in extra latency. > > I.e. Currently we have: > > DRI2SwapBuffers > - drm_mode_page_flip to buffer B > DRI2SwapBuffers > - drm_mode_page_flip to buffer A (gets queued in kernel) > ... > v_sync! (at this point buffer B is scanned out) > - release buffer A's KDS resource/signal buffer A's fence > - queued GPU job to render next frame to buffer A scheduled on HW > ... > GPU interrupt! (at this point buffer A is ready to be scanned out) > - release buffer A's KDS resource/signal buffer A's fence > - second page flip executed, buffer A's address written to scanout > register, takes effect on next v_sync. > > > So in the above, after X receives the second DRI2SwapBuffers, it > doesn't need to get scheduled again for the next frame to be both > rendered by the GPU and issued to the display for scanout. well, this is really only an issue if you are so loaded that you don't get a chance to schedule for ~16ms.. which is pretty long time. If you are triple buffering, it should not end up in the critical path (since the gpu already has the 3rd buffer to start on the next frame). And, well, if you do it all in the kernel you probably need to toss things over to a workqueue anyways. > > > If we were to move to a user-space queue, I think we have something > like this: > > DRI2SwapBuffers > - drm_mode_page_flip to buffer B > DRI2SwapBuffers > - queue page flip to buffer A in DDX > ... > v_sync! (at this point buffer B is scanned out) > - release buffer A's KDS resource/signal buffer A's fence > - queued GPU job to render next frame to buffer A scheduled on HW > - Send page flip event to X > ... > GPU interrupt! (at this point buffer A is ready to be scanned out) > - Release buffer A's KDS resource/signal buffer A's fence - but nothing > is waiting on it.... > ... > X gets scheduled, runs page flip handler > - drm_mode_page_flip to buffer A > - buffer A's address written to scanout register, takes effect on > next v_sync. > > > So here, X must get scheduled again after processing the second > DRI2SwapBuffers in order to have the next frame displayed. This > increases the likely-hood that we're not able to write the address of > buffer A to the display HW's scan-out buffer in time to catch the next > v_sync, especially on a loaded system. > > Anyway, I think that's our rational for keeping the queue in kernel > space, but I don't see there's much value in queuing more than 2 page > flips in kernel space. > >> Ofc if your hardware as a hw-based flip queue (maybe even with frame >> targets) that's a different matter, but currently we don't have a drm >> interface to expose this. I'd say for merging the basic driver first we >> should go with the existing simple pageflip semantics. > > Sure - I think it would mean slightly increased jank, but probably > something we can address later. > > >> > > +enum pl111_bo_type { >> > > + PL111_BOT_DMA, >> > > + PL111_BOT_SHM >> > > +}; >> > > + >> > > +struct pl111_gem_bo_dma { >> > > + dma_addr_t fb_dev_addr; >> > > + void *fb_cpu_addr; >> > > +}; >> > > + >> > > +struct pl111_gem_bo_shm { >> > > + struct page **pages; >> > > + dma_addr_t *dma_addrs; >> > > +}; >> > > + >> > > +struct pl111_gem_bo { >> > > + struct drm_gem_object gem_object; >> > > + enum pl111_bo_type type; >> > > + union { >> > > + struct pl111_gem_bo_dma dma; >> > > + struct pl111_gem_bo_shm shm; >> > > + } backing_data; >> > > + struct drm_framebuffer *fb; >> > >> > this is at least a bit odd.. normally the fb has ref to the bo(s) and >> > not the other way around. And the same bo could be referenced by >> > multiple fb's which would kinda fall down with this approach. >> >> I'd say that's just backwards, framebuffers are created from backing >> storage objects (which for a gem based driver is a gem object), not the >> other way round. What's this exactly used for? > > Yup. > > >> > > +static void vsync_worker(struct work_struct *work) >> > > +{ >> > > + struct pl111_drm_flip_resource *flip_res; >> > > + struct pl111_gem_bo *bo; >> > > + struct pl111_drm_crtc *pl111_crtc; >> > > + struct drm_device *dev; >> > > + int flips_in_flight; >> > > + flip_res >> > > + container_of(work, struct pl111_drm_flip_resource, >> > > + vsync_work); >> > > + >> > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); >> > > + dev = pl111_crtc->crtc.dev; >> > > + >> > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); >> > > + >> > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); >> > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS >> > > + if (flip_res->worker_release_kds = true) { >> > > + spin_lock(&pl111_crtc->current_displaying_lock); >> > > + release_kds_resource_and_display(flip_res); >> > > + spin_unlock(&pl111_crtc->current_displaying_lock); >> > > + } >> > > +#endif >> > > + /* Release DMA buffer on this flip */ >> > > + if (bo->gem_object.export_dma_buf != NULL) >> > > + dma_buf_put(bo->gem_object.export_dma_buf); >> > >> > I think you just want to unref the outgoing bo, and let it drop the >> > dmabuf ref when the file ref of the imported bo goes. Or actually, >> > it would be better to hold/drop ref's to the fb, rather than the bo. >> > At least this will make things simpler if you ever have multi-planar >> > support. >> >> Drivers have no business frobbing around the dma-buf refcount of >> imported objects imo, at least if they use all the standard drm >> prime infrastructure. And if they're bugs they need to be fixed >> there, not in drivers. > > Good point. I guess the fb holds a ref on the bo and the bo holds a > ref on the imported dma_buf. Don't know what this was for... > > >> > > + BUG_ON(bo->type != PL111_BOT_DMA); >> > >> > umm, no BUG_ON() is not really a good way to validate userspace >> > input.. >> >> Yep. > > :-D > > >> > > + >> > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { >> > > + case 'Y': >> > > + case 'U': >> > > + case 'V': >> > > + case 'N': >> > > + case 'T': >> > >> > perhaps we should instead add a drm_format_is_yuv().. or you could >> > (ab)use drm_fb_get_bpp_depth().. >> >> Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger >> question is why you need this, since the drm core should filter out >> formats not in your list of supported ones. Or at least it should ... > > Probably unnecessary belts & braces. I'll see if I can find some DRM > test which tries to create an fb using a yuv format and see where, if > anywhere, it gets rejected. > fwiw, it should fail, not when you create the (at least for most drivers) but when you try to attach it to a plane or crtc. If we had primary plane's in drm core, we could do a bit better error checking in the core and bail out on fb creation for a format that no crtc/plane could scanout. BR, -R > > Thanks again!! > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-09 16:34 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-09 16:34 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, Daniel Vetter On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Daniel, Rob. > > Thank you both for your reviews - greatly appreciated! > >> > > Known issues: >> > > * It still includes code to use KDS, which is not going upstream. >> > >> > review's on <http://lists.freedesktop.org/archives/dri-devel/2013- >> > July/042462.html> can't hurt >> > >> > although you might consider submitting a reduced functionality driver >> > w/ KDS bits removed in the mean time.. then when the fence stuff is >> > merged it is just an incremental patch rather than a whole driver ;-) >> >> Yeah, I think the KDS bits and comments need to go first before >> merginge. > > Right, as I expected really. Though as I said we'll probably wait for > fences to land and switch over to that before asking for it to be > merged. A pl111 KMS driver with neither KDS nor implicit fences is > useless to us. Having said that, if someone else would have a use for > a fence/KDS-less pl111 KMS driver, please let me know! > well, it would make it easier to review the patches adding fence support if it was on top of basic KMS support. So there still is some benefit to a fence-less pl111, even if it is just for purposes of git history and patch review ;-) > >> > > +/* >> > > + * Number of flips allowed in flight at any one time. Any more >> > > + * flips requested beyond this value will cause the caller to >> > > + * block until earlier flips have completed. >> > > + * >> > > + * For performance reasons, this must be greater than the number >> > > + * of buffers used in the rendering pipeline. Note that the >> > > + * rendering pipeline can contain different types of buffer, e.g.: >> > > + * - 2 final framebuffers >> > > + * - >2 geometry buffers for GPU use-cases >> > > + * - >2 vertex buffers for GPU use-cases >> > > + * >> > > + * For example, a system using 5 geometry buffers could have 5 >> > > + * flips in flight, and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be >> > > + * 5 or greater. >> > > + * >> > > + * Whilst there may be more intermediate buffers (such as >> > > + * vertex/geometry) than final framebuffers, KDS is used to >> > > + * ensure that GPU rendering waits for the next off-screen >> > > + * buffer, so it doesn't overwrite an on-screen buffer and >> > > + * produce tearing. >> > > + */ >> > > + >> > >> > fwiw, this is at least different from how other drivers do triple >> > (or > double) buffering. In other drivers (intel, omap, and >> > msm/freedreno, that I know of, maybe others too) the xorg driver >> > dri2 bits implement the double buffering (ie. send flip event back >> > to client immediately and queue up the flip and call page-flip >> > after the pageflip event back from kernel. >> > >> > I'm not saying not to do it this way, I guess I'd like to hear >> > what other folks think. I kinda prefer doing this in userspace >> > as it keeps the kernel bits simpler (plus it would then work >> > properly on exynosdrm or other kms drivers). >> >> Yeah, if this is just a sw queue then I don't think it makes sense >> to have it in the kernel. Afaik the current pageflip interface drm >> exposes allows one oustanding flip only, and you _must_ wait for >> the flip complete event before you can submit the second one. > > Right, I'll have a think about this. I think our idea was to issue > enough page-flips into the kernel to make sure that any process > scheduling latencies on a heavily loaded system don't cause us to > miss a v_sync deadline. At the moment we issue the page flip from DRI2 > schedule_swap. If we were to move that to the page flip event handler > of the previous page-flip, we're potentially adding in extra latency. > > I.e. Currently we have: > > DRI2SwapBuffers > - drm_mode_page_flip to buffer B > DRI2SwapBuffers > - drm_mode_page_flip to buffer A (gets queued in kernel) > ... > v_sync! (at this point buffer B is scanned out) > - release buffer A's KDS resource/signal buffer A's fence > - queued GPU job to render next frame to buffer A scheduled on HW > ... > GPU interrupt! (at this point buffer A is ready to be scanned out) > - release buffer A's KDS resource/signal buffer A's fence > - second page flip executed, buffer A's address written to scanout > register, takes effect on next v_sync. > > > So in the above, after X receives the second DRI2SwapBuffers, it > doesn't need to get scheduled again for the next frame to be both > rendered by the GPU and issued to the display for scanout. well, this is really only an issue if you are so loaded that you don't get a chance to schedule for ~16ms.. which is pretty long time. If you are triple buffering, it should not end up in the critical path (since the gpu already has the 3rd buffer to start on the next frame). And, well, if you do it all in the kernel you probably need to toss things over to a workqueue anyways. > > > If we were to move to a user-space queue, I think we have something > like this: > > DRI2SwapBuffers > - drm_mode_page_flip to buffer B > DRI2SwapBuffers > - queue page flip to buffer A in DDX > ... > v_sync! (at this point buffer B is scanned out) > - release buffer A's KDS resource/signal buffer A's fence > - queued GPU job to render next frame to buffer A scheduled on HW > - Send page flip event to X > ... > GPU interrupt! (at this point buffer A is ready to be scanned out) > - Release buffer A's KDS resource/signal buffer A's fence - but nothing > is waiting on it.... > ... > X gets scheduled, runs page flip handler > - drm_mode_page_flip to buffer A > - buffer A's address written to scanout register, takes effect on > next v_sync. > > > So here, X must get scheduled again after processing the second > DRI2SwapBuffers in order to have the next frame displayed. This > increases the likely-hood that we're not able to write the address of > buffer A to the display HW's scan-out buffer in time to catch the next > v_sync, especially on a loaded system. > > Anyway, I think that's our rational for keeping the queue in kernel > space, but I don't see there's much value in queuing more than 2 page > flips in kernel space. > >> Ofc if your hardware as a hw-based flip queue (maybe even with frame >> targets) that's a different matter, but currently we don't have a drm >> interface to expose this. I'd say for merging the basic driver first we >> should go with the existing simple pageflip semantics. > > Sure - I think it would mean slightly increased jank, but probably > something we can address later. > > >> > > +enum pl111_bo_type { >> > > + PL111_BOT_DMA, >> > > + PL111_BOT_SHM >> > > +}; >> > > + >> > > +struct pl111_gem_bo_dma { >> > > + dma_addr_t fb_dev_addr; >> > > + void *fb_cpu_addr; >> > > +}; >> > > + >> > > +struct pl111_gem_bo_shm { >> > > + struct page **pages; >> > > + dma_addr_t *dma_addrs; >> > > +}; >> > > + >> > > +struct pl111_gem_bo { >> > > + struct drm_gem_object gem_object; >> > > + enum pl111_bo_type type; >> > > + union { >> > > + struct pl111_gem_bo_dma dma; >> > > + struct pl111_gem_bo_shm shm; >> > > + } backing_data; >> > > + struct drm_framebuffer *fb; >> > >> > this is at least a bit odd.. normally the fb has ref to the bo(s) and >> > not the other way around. And the same bo could be referenced by >> > multiple fb's which would kinda fall down with this approach. >> >> I'd say that's just backwards, framebuffers are created from backing >> storage objects (which for a gem based driver is a gem object), not the >> other way round. What's this exactly used for? > > Yup. > > >> > > +static void vsync_worker(struct work_struct *work) >> > > +{ >> > > + struct pl111_drm_flip_resource *flip_res; >> > > + struct pl111_gem_bo *bo; >> > > + struct pl111_drm_crtc *pl111_crtc; >> > > + struct drm_device *dev; >> > > + int flips_in_flight; >> > > + flip_res = >> > > + container_of(work, struct pl111_drm_flip_resource, >> > > + vsync_work); >> > > + >> > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); >> > > + dev = pl111_crtc->crtc.dev; >> > > + >> > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); >> > > + >> > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); >> > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS >> > > + if (flip_res->worker_release_kds == true) { >> > > + spin_lock(&pl111_crtc->current_displaying_lock); >> > > + release_kds_resource_and_display(flip_res); >> > > + spin_unlock(&pl111_crtc->current_displaying_lock); >> > > + } >> > > +#endif >> > > + /* Release DMA buffer on this flip */ >> > > + if (bo->gem_object.export_dma_buf != NULL) >> > > + dma_buf_put(bo->gem_object.export_dma_buf); >> > >> > I think you just want to unref the outgoing bo, and let it drop the >> > dmabuf ref when the file ref of the imported bo goes. Or actually, >> > it would be better to hold/drop ref's to the fb, rather than the bo. >> > At least this will make things simpler if you ever have multi-planar >> > support. >> >> Drivers have no business frobbing around the dma-buf refcount of >> imported objects imo, at least if they use all the standard drm >> prime infrastructure. And if they're bugs they need to be fixed >> there, not in drivers. > > Good point. I guess the fb holds a ref on the bo and the bo holds a > ref on the imported dma_buf. Don't know what this was for... > > >> > > + BUG_ON(bo->type != PL111_BOT_DMA); >> > >> > umm, no BUG_ON() is not really a good way to validate userspace >> > input.. >> >> Yep. > > :-D > > >> > > + >> > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { >> > > + case 'Y': >> > > + case 'U': >> > > + case 'V': >> > > + case 'N': >> > > + case 'T': >> > >> > perhaps we should instead add a drm_format_is_yuv().. or you could >> > (ab)use drm_fb_get_bpp_depth().. >> >> Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger >> question is why you need this, since the drm core should filter out >> formats not in your list of supported ones. Or at least it should ... > > Probably unnecessary belts & braces. I'll see if I can find some DRM > test which tries to create an fb using a yuv format and see where, if > anywhere, it gets rejected. > fwiw, it should fail, not when you create the (at least for most drivers) but when you try to attach it to a plane or crtc. If we had primary plane's in drm core, we could do a bit better error checking in the core and bail out on fb creation for a format that no crtc/plane could scanout. BR, -R > > Thanks again!! > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-09 16:34 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-09 16:34 UTC (permalink / raw) To: linux-arm-kernel On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Daniel, Rob. > > Thank you both for your reviews - greatly appreciated! > >> > > Known issues: >> > > * It still includes code to use KDS, which is not going upstream. >> > >> > review's on <http://lists.freedesktop.org/archives/dri-devel/2013- >> > July/042462.html> can't hurt >> > >> > although you might consider submitting a reduced functionality driver >> > w/ KDS bits removed in the mean time.. then when the fence stuff is >> > merged it is just an incremental patch rather than a whole driver ;-) >> >> Yeah, I think the KDS bits and comments need to go first before >> merginge. > > Right, as I expected really. Though as I said we'll probably wait for > fences to land and switch over to that before asking for it to be > merged. A pl111 KMS driver with neither KDS nor implicit fences is > useless to us. Having said that, if someone else would have a use for > a fence/KDS-less pl111 KMS driver, please let me know! > well, it would make it easier to review the patches adding fence support if it was on top of basic KMS support. So there still is some benefit to a fence-less pl111, even if it is just for purposes of git history and patch review ;-) > >> > > +/* >> > > + * Number of flips allowed in flight at any one time. Any more >> > > + * flips requested beyond this value will cause the caller to >> > > + * block until earlier flips have completed. >> > > + * >> > > + * For performance reasons, this must be greater than the number >> > > + * of buffers used in the rendering pipeline. Note that the >> > > + * rendering pipeline can contain different types of buffer, e.g.: >> > > + * - 2 final framebuffers >> > > + * - >2 geometry buffers for GPU use-cases >> > > + * - >2 vertex buffers for GPU use-cases >> > > + * >> > > + * For example, a system using 5 geometry buffers could have 5 >> > > + * flips in flight, and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be >> > > + * 5 or greater. >> > > + * >> > > + * Whilst there may be more intermediate buffers (such as >> > > + * vertex/geometry) than final framebuffers, KDS is used to >> > > + * ensure that GPU rendering waits for the next off-screen >> > > + * buffer, so it doesn't overwrite an on-screen buffer and >> > > + * produce tearing. >> > > + */ >> > > + >> > >> > fwiw, this is at least different from how other drivers do triple >> > (or > double) buffering. In other drivers (intel, omap, and >> > msm/freedreno, that I know of, maybe others too) the xorg driver >> > dri2 bits implement the double buffering (ie. send flip event back >> > to client immediately and queue up the flip and call page-flip >> > after the pageflip event back from kernel. >> > >> > I'm not saying not to do it this way, I guess I'd like to hear >> > what other folks think. I kinda prefer doing this in userspace >> > as it keeps the kernel bits simpler (plus it would then work >> > properly on exynosdrm or other kms drivers). >> >> Yeah, if this is just a sw queue then I don't think it makes sense >> to have it in the kernel. Afaik the current pageflip interface drm >> exposes allows one oustanding flip only, and you _must_ wait for >> the flip complete event before you can submit the second one. > > Right, I'll have a think about this. I think our idea was to issue > enough page-flips into the kernel to make sure that any process > scheduling latencies on a heavily loaded system don't cause us to > miss a v_sync deadline. At the moment we issue the page flip from DRI2 > schedule_swap. If we were to move that to the page flip event handler > of the previous page-flip, we're potentially adding in extra latency. > > I.e. Currently we have: > > DRI2SwapBuffers > - drm_mode_page_flip to buffer B > DRI2SwapBuffers > - drm_mode_page_flip to buffer A (gets queued in kernel) > ... > v_sync! (at this point buffer B is scanned out) > - release buffer A's KDS resource/signal buffer A's fence > - queued GPU job to render next frame to buffer A scheduled on HW > ... > GPU interrupt! (at this point buffer A is ready to be scanned out) > - release buffer A's KDS resource/signal buffer A's fence > - second page flip executed, buffer A's address written to scanout > register, takes effect on next v_sync. > > > So in the above, after X receives the second DRI2SwapBuffers, it > doesn't need to get scheduled again for the next frame to be both > rendered by the GPU and issued to the display for scanout. well, this is really only an issue if you are so loaded that you don't get a chance to schedule for ~16ms.. which is pretty long time. If you are triple buffering, it should not end up in the critical path (since the gpu already has the 3rd buffer to start on the next frame). And, well, if you do it all in the kernel you probably need to toss things over to a workqueue anyways. > > > If we were to move to a user-space queue, I think we have something > like this: > > DRI2SwapBuffers > - drm_mode_page_flip to buffer B > DRI2SwapBuffers > - queue page flip to buffer A in DDX > ... > v_sync! (at this point buffer B is scanned out) > - release buffer A's KDS resource/signal buffer A's fence > - queued GPU job to render next frame to buffer A scheduled on HW > - Send page flip event to X > ... > GPU interrupt! (at this point buffer A is ready to be scanned out) > - Release buffer A's KDS resource/signal buffer A's fence - but nothing > is waiting on it.... > ... > X gets scheduled, runs page flip handler > - drm_mode_page_flip to buffer A > - buffer A's address written to scanout register, takes effect on > next v_sync. > > > So here, X must get scheduled again after processing the second > DRI2SwapBuffers in order to have the next frame displayed. This > increases the likely-hood that we're not able to write the address of > buffer A to the display HW's scan-out buffer in time to catch the next > v_sync, especially on a loaded system. > > Anyway, I think that's our rational for keeping the queue in kernel > space, but I don't see there's much value in queuing more than 2 page > flips in kernel space. > >> Ofc if your hardware as a hw-based flip queue (maybe even with frame >> targets) that's a different matter, but currently we don't have a drm >> interface to expose this. I'd say for merging the basic driver first we >> should go with the existing simple pageflip semantics. > > Sure - I think it would mean slightly increased jank, but probably > something we can address later. > > >> > > +enum pl111_bo_type { >> > > + PL111_BOT_DMA, >> > > + PL111_BOT_SHM >> > > +}; >> > > + >> > > +struct pl111_gem_bo_dma { >> > > + dma_addr_t fb_dev_addr; >> > > + void *fb_cpu_addr; >> > > +}; >> > > + >> > > +struct pl111_gem_bo_shm { >> > > + struct page **pages; >> > > + dma_addr_t *dma_addrs; >> > > +}; >> > > + >> > > +struct pl111_gem_bo { >> > > + struct drm_gem_object gem_object; >> > > + enum pl111_bo_type type; >> > > + union { >> > > + struct pl111_gem_bo_dma dma; >> > > + struct pl111_gem_bo_shm shm; >> > > + } backing_data; >> > > + struct drm_framebuffer *fb; >> > >> > this is at least a bit odd.. normally the fb has ref to the bo(s) and >> > not the other way around. And the same bo could be referenced by >> > multiple fb's which would kinda fall down with this approach. >> >> I'd say that's just backwards, framebuffers are created from backing >> storage objects (which for a gem based driver is a gem object), not the >> other way round. What's this exactly used for? > > Yup. > > >> > > +static void vsync_worker(struct work_struct *work) >> > > +{ >> > > + struct pl111_drm_flip_resource *flip_res; >> > > + struct pl111_gem_bo *bo; >> > > + struct pl111_drm_crtc *pl111_crtc; >> > > + struct drm_device *dev; >> > > + int flips_in_flight; >> > > + flip_res = >> > > + container_of(work, struct pl111_drm_flip_resource, >> > > + vsync_work); >> > > + >> > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); >> > > + dev = pl111_crtc->crtc.dev; >> > > + >> > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); >> > > + >> > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); >> > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS >> > > + if (flip_res->worker_release_kds == true) { >> > > + spin_lock(&pl111_crtc->current_displaying_lock); >> > > + release_kds_resource_and_display(flip_res); >> > > + spin_unlock(&pl111_crtc->current_displaying_lock); >> > > + } >> > > +#endif >> > > + /* Release DMA buffer on this flip */ >> > > + if (bo->gem_object.export_dma_buf != NULL) >> > > + dma_buf_put(bo->gem_object.export_dma_buf); >> > >> > I think you just want to unref the outgoing bo, and let it drop the >> > dmabuf ref when the file ref of the imported bo goes. Or actually, >> > it would be better to hold/drop ref's to the fb, rather than the bo. >> > At least this will make things simpler if you ever have multi-planar >> > support. >> >> Drivers have no business frobbing around the dma-buf refcount of >> imported objects imo, at least if they use all the standard drm >> prime infrastructure. And if they're bugs they need to be fixed >> there, not in drivers. > > Good point. I guess the fb holds a ref on the bo and the bo holds a > ref on the imported dma_buf. Don't know what this was for... > > >> > > + BUG_ON(bo->type != PL111_BOT_DMA); >> > >> > umm, no BUG_ON() is not really a good way to validate userspace >> > input.. >> >> Yep. > > :-D > > >> > > + >> > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { >> > > + case 'Y': >> > > + case 'U': >> > > + case 'V': >> > > + case 'N': >> > > + case 'T': >> > >> > perhaps we should instead add a drm_format_is_yuv().. or you could >> > (ab)use drm_fb_get_bpp_depth().. >> >> Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger >> question is why you need this, since the drm core should filter out >> formats not in your list of supported ones. Or at least it should ... > > Probably unnecessary belts & braces. I'll see if I can find some DRM > test which tries to create an fb using a yuv format and see where, if > anywhere, it gets rejected. > fwiw, it should fail, not when you create the (at least for most drivers) but when you try to attach it to a plane or crtc. If we had primary plane's in drm core, we could do a bit better error checking in the core and bail out on fb creation for a format that no crtc/plane could scanout. BR, -R > > Thanks again!! > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 16:34 ` Rob Clark (?) @ 2013-08-09 16:57 ` Daniel Vetter -1 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-09 16:57 UTC (permalink / raw) To: Rob Clark; +Cc: linux-fbdev, Pawel Moll, dri-devel, linux-arm-kernel On Fri, Aug 09, 2013 at 12:34:55PM -0400, Rob Clark wrote: > On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> > fwiw, this is at least different from how other drivers do triple > >> > (or > double) buffering. In other drivers (intel, omap, and > >> > msm/freedreno, that I know of, maybe others too) the xorg driver > >> > dri2 bits implement the double buffering (ie. send flip event back > >> > to client immediately and queue up the flip and call page-flip > >> > after the pageflip event back from kernel. > >> > > >> > I'm not saying not to do it this way, I guess I'd like to hear > >> > what other folks think. I kinda prefer doing this in userspace > >> > as it keeps the kernel bits simpler (plus it would then work > >> > properly on exynosdrm or other kms drivers). > >> > >> Yeah, if this is just a sw queue then I don't think it makes sense > >> to have it in the kernel. Afaik the current pageflip interface drm > >> exposes allows one oustanding flip only, and you _must_ wait for > >> the flip complete event before you can submit the second one. > > > > Right, I'll have a think about this. I think our idea was to issue > > enough page-flips into the kernel to make sure that any process > > scheduling latencies on a heavily loaded system don't cause us to > > miss a v_sync deadline. At the moment we issue the page flip from DRI2 > > schedule_swap. If we were to move that to the page flip event handler > > of the previous page-flip, we're potentially adding in extra latency. > > > > I.e. Currently we have: > > > > DRI2SwapBuffers > > - drm_mode_page_flip to buffer B > > DRI2SwapBuffers > > - drm_mode_page_flip to buffer A (gets queued in kernel) > > ... > > v_sync! (at this point buffer B is scanned out) > > - release buffer A's KDS resource/signal buffer A's fence > > - queued GPU job to render next frame to buffer A scheduled on HW > > ... > > GPU interrupt! (at this point buffer A is ready to be scanned out) > > - release buffer A's KDS resource/signal buffer A's fence > > - second page flip executed, buffer A's address written to scanout > > register, takes effect on next v_sync. > > > > > > So in the above, after X receives the second DRI2SwapBuffers, it > > doesn't need to get scheduled again for the next frame to be both > > rendered by the GPU and issued to the display for scanout. > > well, this is really only an issue if you are so loaded that you don't > get a chance to schedule for ~16ms.. which is pretty long time. If > you are triple buffering, it should not end up in the critical path > (since the gpu already has the 3rd buffer to start on the next frame). > And, well, if you do it all in the kernel you probably need to toss > things over to a workqueue anyways. Just a quick comment on the kernel flip queue issue. 16 ms scheduling latency sounds awful but totally doable with a less than stellar ddx driver going into limbo land and so preventing your single threaded X from doing more useful stuff. Is this really the linux scheduler being stupid? At least my impression was that the hw/kernel flip queue is to save power so that you can queue up a few frames and everything goes to sleep for half a second or so (at 24fps or whatever movie your showing). Needing to schedule 5 frames ahead with pageflips under load is just guaranteed to result in really horrible interactivity and so awful user experience ... -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-09 16:57 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-09 16:57 UTC (permalink / raw) To: Rob Clark; +Cc: linux-fbdev, Pawel Moll, dri-devel, linux-arm-kernel On Fri, Aug 09, 2013 at 12:34:55PM -0400, Rob Clark wrote: > On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> > fwiw, this is at least different from how other drivers do triple > >> > (or > double) buffering. In other drivers (intel, omap, and > >> > msm/freedreno, that I know of, maybe others too) the xorg driver > >> > dri2 bits implement the double buffering (ie. send flip event back > >> > to client immediately and queue up the flip and call page-flip > >> > after the pageflip event back from kernel. > >> > > >> > I'm not saying not to do it this way, I guess I'd like to hear > >> > what other folks think. I kinda prefer doing this in userspace > >> > as it keeps the kernel bits simpler (plus it would then work > >> > properly on exynosdrm or other kms drivers). > >> > >> Yeah, if this is just a sw queue then I don't think it makes sense > >> to have it in the kernel. Afaik the current pageflip interface drm > >> exposes allows one oustanding flip only, and you _must_ wait for > >> the flip complete event before you can submit the second one. > > > > Right, I'll have a think about this. I think our idea was to issue > > enough page-flips into the kernel to make sure that any process > > scheduling latencies on a heavily loaded system don't cause us to > > miss a v_sync deadline. At the moment we issue the page flip from DRI2 > > schedule_swap. If we were to move that to the page flip event handler > > of the previous page-flip, we're potentially adding in extra latency. > > > > I.e. Currently we have: > > > > DRI2SwapBuffers > > - drm_mode_page_flip to buffer B > > DRI2SwapBuffers > > - drm_mode_page_flip to buffer A (gets queued in kernel) > > ... > > v_sync! (at this point buffer B is scanned out) > > - release buffer A's KDS resource/signal buffer A's fence > > - queued GPU job to render next frame to buffer A scheduled on HW > > ... > > GPU interrupt! (at this point buffer A is ready to be scanned out) > > - release buffer A's KDS resource/signal buffer A's fence > > - second page flip executed, buffer A's address written to scanout > > register, takes effect on next v_sync. > > > > > > So in the above, after X receives the second DRI2SwapBuffers, it > > doesn't need to get scheduled again for the next frame to be both > > rendered by the GPU and issued to the display for scanout. > > well, this is really only an issue if you are so loaded that you don't > get a chance to schedule for ~16ms.. which is pretty long time. If > you are triple buffering, it should not end up in the critical path > (since the gpu already has the 3rd buffer to start on the next frame). > And, well, if you do it all in the kernel you probably need to toss > things over to a workqueue anyways. Just a quick comment on the kernel flip queue issue. 16 ms scheduling latency sounds awful but totally doable with a less than stellar ddx driver going into limbo land and so preventing your single threaded X from doing more useful stuff. Is this really the linux scheduler being stupid? At least my impression was that the hw/kernel flip queue is to save power so that you can queue up a few frames and everything goes to sleep for half a second or so (at 24fps or whatever movie your showing). Needing to schedule 5 frames ahead with pageflips under load is just guaranteed to result in really horrible interactivity and so awful user experience ... -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-09 16:57 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-09 16:57 UTC (permalink / raw) To: linux-arm-kernel On Fri, Aug 09, 2013 at 12:34:55PM -0400, Rob Clark wrote: > On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> > fwiw, this is at least different from how other drivers do triple > >> > (or > double) buffering. In other drivers (intel, omap, and > >> > msm/freedreno, that I know of, maybe others too) the xorg driver > >> > dri2 bits implement the double buffering (ie. send flip event back > >> > to client immediately and queue up the flip and call page-flip > >> > after the pageflip event back from kernel. > >> > > >> > I'm not saying not to do it this way, I guess I'd like to hear > >> > what other folks think. I kinda prefer doing this in userspace > >> > as it keeps the kernel bits simpler (plus it would then work > >> > properly on exynosdrm or other kms drivers). > >> > >> Yeah, if this is just a sw queue then I don't think it makes sense > >> to have it in the kernel. Afaik the current pageflip interface drm > >> exposes allows one oustanding flip only, and you _must_ wait for > >> the flip complete event before you can submit the second one. > > > > Right, I'll have a think about this. I think our idea was to issue > > enough page-flips into the kernel to make sure that any process > > scheduling latencies on a heavily loaded system don't cause us to > > miss a v_sync deadline. At the moment we issue the page flip from DRI2 > > schedule_swap. If we were to move that to the page flip event handler > > of the previous page-flip, we're potentially adding in extra latency. > > > > I.e. Currently we have: > > > > DRI2SwapBuffers > > - drm_mode_page_flip to buffer B > > DRI2SwapBuffers > > - drm_mode_page_flip to buffer A (gets queued in kernel) > > ... > > v_sync! (at this point buffer B is scanned out) > > - release buffer A's KDS resource/signal buffer A's fence > > - queued GPU job to render next frame to buffer A scheduled on HW > > ... > > GPU interrupt! (at this point buffer A is ready to be scanned out) > > - release buffer A's KDS resource/signal buffer A's fence > > - second page flip executed, buffer A's address written to scanout > > register, takes effect on next v_sync. > > > > > > So in the above, after X receives the second DRI2SwapBuffers, it > > doesn't need to get scheduled again for the next frame to be both > > rendered by the GPU and issued to the display for scanout. > > well, this is really only an issue if you are so loaded that you don't > get a chance to schedule for ~16ms.. which is pretty long time. If > you are triple buffering, it should not end up in the critical path > (since the gpu already has the 3rd buffer to start on the next frame). > And, well, if you do it all in the kernel you probably need to toss > things over to a workqueue anyways. Just a quick comment on the kernel flip queue issue. 16 ms scheduling latency sounds awful but totally doable with a less than stellar ddx driver going into limbo land and so preventing your single threaded X from doing more useful stuff. Is this really the linux scheduler being stupid? At least my impression was that the hw/kernel flip queue is to save power so that you can queue up a few frames and everything goes to sleep for half a second or so (at 24fps or whatever movie your showing). Needing to schedule 5 frames ahead with pageflips under load is just guaranteed to result in really horrible interactivity and so awful user experience ... -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 16:57 ` Daniel Vetter (?) (?) @ 2013-08-09 17:31 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey ` (2 more replies) -1 siblings, 3 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-09 17:31 UTC (permalink / raw) To: linux-arm-kernel > > > So in the above, after X receives the second DRI2SwapBuffers, it > > > doesn't need to get scheduled again for the next frame to be both > > > rendered by the GPU and issued to the display for scanout. > > > > well, this is really only an issue if you are so loaded that you > > don't get a chance to schedule for ~16ms.. which is pretty long time. Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? Hmmm, that does sound very long. Will try out some experiments and see. > > If you are triple buffering, it should not end up in the critical > > path (since the gpu already has the 3rd buffer to start on the next > > frame). And, well, if you do it all in the kernel you probably need > > to toss things over to a workqueue anyways. > > Just a quick comment on the kernel flip queue issue. > > 16 ms scheduling latency sounds awful but totally doable with a less > than stellar ddx driver going into limbo land and so preventing your > single threaded X from doing more useful stuff. Is this really the > linux scheduler being stupid? Ahahhaaa!! Yes!!! Really good point. We generally don't have 2D HW and so rely on pixman to perform all 2D operations which does indeed tie up that thread for fairly long periods of time. We've had internal discussions about introducing a thread (gulp) in the DDX to off-load drawing operations to. I think we were all a bit scared by that idea though. BTW: I wasn't suggesting it was the linux scheduler being stupid, just that there is sometimes lots of contention over the CPU cores and X is just one thread among many wanting to run. > At least my impression was that the hw/kernel flip queue is to save > power so that you can queue up a few frames and everything goes to > sleep for half a second or so (at 24fps or whatever movie your > showing). Needing to schedule 5 frames ahead with pageflips under > load is just guaranteed to result in really horrible interactivity > and so awful user experience Agreed. There's always a tradeoff between tolerance to variable frame rendering time/system latency (lot of buffers) and UI latency (few buffers). As a side note, video playback is one use-case for explicit sync objects which implicit/buffer-based sync doesn't handle: Queue up lots of video frames for display, but mark those "display buffer" operations as depending on explicit sync objects which get signalled by the audio clock. Not sure Android actually does that yet though. Anyway, off topic. Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 17:31 ` Tom Cooksey @ 2013-08-13 14:35 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey 2 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-13 14:35 UTC (permalink / raw) To: Tom Cooksey, 'Daniel Vetter', Rob Clark Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel > > > > So in the above, after X receives the second DRI2SwapBuffers, it > > > > doesn't need to get scheduled again for the next frame to be both > > > > rendered by the GPU and issued to the display for scanout. > > > > > > well, this is really only an issue if you are so loaded that you > > > don't get a chance to schedule for ~16ms.. which is pretty long > > > time. > > Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? > Hmmm, that does sound very long. Will try out some experiments and see. We're looking at moving the flip queue into the DDX driver, however it's not as straight-forward as I thought. With the current design, all rate-limiting happens on the client side. So even if you only have double buffering, using KDS you can queue up as many asynchronous GPU-render/scan-out pairs as you want. It's up to EGL in the client application to figure out there's a lot of frames in-flight and so should probably block the application's render thread in eglSwapBuffers to let the GPU and/or display catch up a bit. If we only allow a single outstanding page-flip job in DRM, there'd be a race if we returned a buffer to the client which had an outstanding page-flip queued up in the DDX: The client could issue a render job to the buffer just as the DDX processed the page-flip from the queue, making the scan-out block until the GPU rendered the next frame. It would also mean the previous frame would have been lost as it never got scanned out before the GPU rendered the next-next frame to it. So instead, I think we'll have to block (suspend?) a client in ScheduleSwap if the next buffer it would obtain with DRI2GetBuffers has an outstanding page-flip in the user-space queue. We then wake the client up again _after_ we get the page-flip event for the previous page flip and have issued the page-flip to the next buffer to the DRM. That way the DRM display driver has already registered its intention to use the buffer with KDS before the client ever gets hold of it. Note: I say KDS here, but I assume the same issues will apply on any implicit buffer-based synchronization. I.e. dma-fence. It's not really a problem I don't think, but mention it to see if you can see a reason why the above wouldn't work before we go and implement it - it's a fairly big change to the DDX. Can you see any issues with it? PrepareAccess gets interesting... Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 17:31 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey @ 2013-08-13 14:35 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey 2 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-13 14:35 UTC (permalink / raw) To: linux-arm-kernel > > > > So in the above, after X receives the second DRI2SwapBuffers, it > > > > doesn't need to get scheduled again for the next frame to be both > > > > rendered by the GPU and issued to the display for scanout. > > > > > > well, this is really only an issue if you are so loaded that you > > > don't get a chance to schedule for ~16ms.. which is pretty long > > > time. > > Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? > Hmmm, that does sound very long. Will try out some experiments and see. We're looking at moving the flip queue into the DDX driver, however it's not as straight-forward as I thought. With the current design, all rate-limiting happens on the client side. So even if you only have double buffering, using KDS you can queue up as many asynchronous GPU-render/scan-out pairs as you want. It's up to EGL in the client application to figure out there's a lot of frames in-flight and so should probably block the application's render thread in eglSwapBuffers to let the GPU and/or display catch up a bit. If we only allow a single outstanding page-flip job in DRM, there'd be a race if we returned a buffer to the client which had an outstanding page-flip queued up in the DDX: The client could issue a render job to the buffer just as the DDX processed the page-flip from the queue, making the scan-out block until the GPU rendered the next frame. It would also mean the previous frame would have been lost as it never got scanned out before the GPU rendered the next-next frame to it. So instead, I think we'll have to block (suspend?) a client in ScheduleSwap if the next buffer it would obtain with DRI2GetBuffers has an outstanding page-flip in the user-space queue. We then wake the client up again _after_ we get the page-flip event for the previous page flip and have issued the page-flip to the next buffer to the DRM. That way the DRM display driver has already registered its intention to use the buffer with KDS before the client ever gets hold of it. Note: I say KDS here, but I assume the same issues will apply on any implicit buffer-based synchronization. I.e. dma-fence. It's not really a problem I don't think, but mention it to see if you can see a reason why the above wouldn't work before we go and implement it - it's a fairly big change to the DDX. Can you see any issues with it? PrepareAccess gets interesting... Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 17:31 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey @ 2013-08-13 14:35 ` Tom Cooksey 2 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-13 14:35 UTC (permalink / raw) To: Tom Cooksey, 'Daniel Vetter', Rob Clark Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel > > > > So in the above, after X receives the second DRI2SwapBuffers, it > > > > doesn't need to get scheduled again for the next frame to be both > > > > rendered by the GPU and issued to the display for scanout. > > > > > > well, this is really only an issue if you are so loaded that you > > > don't get a chance to schedule for ~16ms.. which is pretty long > > > time. > > Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? > Hmmm, that does sound very long. Will try out some experiments and see. We're looking at moving the flip queue into the DDX driver, however it's not as straight-forward as I thought. With the current design, all rate-limiting happens on the client side. So even if you only have double buffering, using KDS you can queue up as many asynchronous GPU-render/scan-out pairs as you want. It's up to EGL in the client application to figure out there's a lot of frames in-flight and so should probably block the application's render thread in eglSwapBuffers to let the GPU and/or display catch up a bit. If we only allow a single outstanding page-flip job in DRM, there'd be a race if we returned a buffer to the client which had an outstanding page-flip queued up in the DDX: The client could issue a render job to the buffer just as the DDX processed the page-flip from the queue, making the scan-out block until the GPU rendered the next frame. It would also mean the previous frame would have been lost as it never got scanned out before the GPU rendered the next-next frame to it. So instead, I think we'll have to block (suspend?) a client in ScheduleSwap if the next buffer it would obtain with DRI2GetBuffers has an outstanding page-flip in the user-space queue. We then wake the client up again _after_ we get the page-flip event for the previous page flip and have issued the page-flip to the next buffer to the DRM. That way the DRM display driver has already registered its intention to use the buffer with KDS before the client ever gets hold of it. Note: I say KDS here, but I assume the same issues will apply on any implicit buffer-based synchronization. I.e. dma-fence. It's not really a problem I don't think, but mention it to see if you can see a reason why the above wouldn't work before we go and implement it - it's a fairly big change to the DDX. Can you see any issues with it? PrepareAccess gets interesting... Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 16:57 ` Daniel Vetter ` (2 preceding siblings ...) (?) @ 2013-08-09 17:31 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-09 17:31 UTC (permalink / raw) To: 'Daniel Vetter', Rob Clark Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel > > > So in the above, after X receives the second DRI2SwapBuffers, it > > > doesn't need to get scheduled again for the next frame to be both > > > rendered by the GPU and issued to the display for scanout. > > > > well, this is really only an issue if you are so loaded that you > > don't get a chance to schedule for ~16ms.. which is pretty long time. Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? Hmmm, that does sound very long. Will try out some experiments and see. > > If you are triple buffering, it should not end up in the critical > > path (since the gpu already has the 3rd buffer to start on the next > > frame). And, well, if you do it all in the kernel you probably need > > to toss things over to a workqueue anyways. > > Just a quick comment on the kernel flip queue issue. > > 16 ms scheduling latency sounds awful but totally doable with a less > than stellar ddx driver going into limbo land and so preventing your > single threaded X from doing more useful stuff. Is this really the > linux scheduler being stupid? Ahahhaaa!! Yes!!! Really good point. We generally don't have 2D HW and so rely on pixman to perform all 2D operations which does indeed tie up that thread for fairly long periods of time. We've had internal discussions about introducing a thread (gulp) in the DDX to off-load drawing operations to. I think we were all a bit scared by that idea though. BTW: I wasn't suggesting it was the linux scheduler being stupid, just that there is sometimes lots of contention over the CPU cores and X is just one thread among many wanting to run. > At least my impression was that the hw/kernel flip queue is to save > power so that you can queue up a few frames and everything goes to > sleep for half a second or so (at 24fps or whatever movie your > showing). Needing to schedule 5 frames ahead with pageflips under > load is just guaranteed to result in really horrible interactivity > and so awful user experience Agreed. There's always a tradeoff between tolerance to variable frame rendering time/system latency (lot of buffers) and UI latency (few buffers). As a side note, video playback is one use-case for explicit sync objects which implicit/buffer-based sync doesn't handle: Queue up lots of video frames for display, but mark those "display buffer" operations as depending on explicit sync objects which get signalled by the audio clock. Not sure Android actually does that yet though. Anyway, off topic. Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 16:57 ` Daniel Vetter ` (3 preceding siblings ...) (?) @ 2013-08-09 17:31 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-09 17:31 UTC (permalink / raw) To: 'Daniel Vetter', Rob Clark Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel > > > So in the above, after X receives the second DRI2SwapBuffers, it > > > doesn't need to get scheduled again for the next frame to be both > > > rendered by the GPU and issued to the display for scanout. > > > > well, this is really only an issue if you are so loaded that you > > don't get a chance to schedule for ~16ms.. which is pretty long time. Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? Hmmm, that does sound very long. Will try out some experiments and see. > > If you are triple buffering, it should not end up in the critical > > path (since the gpu already has the 3rd buffer to start on the next > > frame). And, well, if you do it all in the kernel you probably need > > to toss things over to a workqueue anyways. > > Just a quick comment on the kernel flip queue issue. > > 16 ms scheduling latency sounds awful but totally doable with a less > than stellar ddx driver going into limbo land and so preventing your > single threaded X from doing more useful stuff. Is this really the > linux scheduler being stupid? Ahahhaaa!! Yes!!! Really good point. We generally don't have 2D HW and so rely on pixman to perform all 2D operations which does indeed tie up that thread for fairly long periods of time. We've had internal discussions about introducing a thread (gulp) in the DDX to off-load drawing operations to. I think we were all a bit scared by that idea though. BTW: I wasn't suggesting it was the linux scheduler being stupid, just that there is sometimes lots of contention over the CPU cores and X is just one thread among many wanting to run. > At least my impression was that the hw/kernel flip queue is to save > power so that you can queue up a few frames and everything goes to > sleep for half a second or so (at 24fps or whatever movie your > showing). Needing to schedule 5 frames ahead with pageflips under > load is just guaranteed to result in really horrible interactivity > and so awful user experience Agreed. There's always a tradeoff between tolerance to variable frame rendering time/system latency (lot of buffers) and UI latency (few buffers). As a side note, video playback is one use-case for explicit sync objects which implicit/buffer-based sync doesn't handle: Queue up lots of video frames for display, but mark those "display buffer" operations as depending on explicit sync objects which get signalled by the audio clock. Not sure Android actually does that yet though. Anyway, off topic. Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <5205277e.84320f0a.1cdf.ffff8816SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <5205277e.84320f0a.1cdf.ffff8816SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-10 12:30 ` Rob Clark @ 2013-08-10 12:30 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-10 12:30 UTC (permalink / raw) To: Tom Cooksey; +Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel On Fri, Aug 9, 2013 at 1:31 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> > > So in the above, after X receives the second DRI2SwapBuffers, it >> > > doesn't need to get scheduled again for the next frame to be both >> > > rendered by the GPU and issued to the display for scanout. >> > >> > well, this is really only an issue if you are so loaded that you >> > don't get a chance to schedule for ~16ms.. which is pretty long time. > > Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? > Hmmm, that does sound very long. Will try out some experiments and see. > yeah > >> > If you are triple buffering, it should not end up in the critical >> > path (since the gpu already has the 3rd buffer to start on the next >> > frame). And, well, if you do it all in the kernel you probably need >> > to toss things over to a workqueue anyways. >> >> Just a quick comment on the kernel flip queue issue. >> >> 16 ms scheduling latency sounds awful but totally doable with a less >> than stellar ddx driver going into limbo land and so preventing your >> single threaded X from doing more useful stuff. Is this really the >> linux scheduler being stupid? > > Ahahhaaa!! Yes!!! Really good point. We generally don't have 2D HW and > so rely on pixman to perform all 2D operations which does indeed tie > up that thread for fairly long periods of time. > > We've had internal discussions about introducing a thread (gulp) in > the DDX to off-load drawing operations to. I think we were all a bit > scared by that idea though. > thread does sound a bit scary.. it probably could be done if you treat it like a virtual cpu and have WaitMarker or PrepareAccess for sw fallbacks synchronize properly.. I bet you'd be much better off just making non-scanout pixmaps cached and doing cache sync ops when needed for dri2 buffers. Sw fallbacks on uncached buffers probably aren't exactly the hot ticket. > > BTW: I wasn't suggesting it was the linux scheduler being stupid, just > that there is sometimes lots of contention over the CPU cores and X > is just one thread among many wanting to run. > > >> At least my impression was that the hw/kernel flip queue is to save >> power so that you can queue up a few frames and everything goes to >> sleep for half a second or so (at 24fps or whatever movie your >> showing). Needing to schedule 5 frames ahead with pageflips under >> load is just guaranteed to result in really horrible interactivity >> and so awful user experience > > Agreed. There's always a tradeoff between tolerance to variable frame > rendering time/system latency (lot of buffers) and UI latency (few > buffers). > > As a side note, video playback is one use-case for explicit sync > objects which implicit/buffer-based sync doesn't handle: Queue up lots > of video frames for display, but mark those "display buffer" > operations as depending on explicit sync objects which get signalled > by the audio clock. Not sure Android actually does that yet though. > Anyway, off topic. > w/ dmafence, rather than explicit fences, I suppose you could add some way to queue the buffer to the audio device and have the audio device signal the fence. I suppose it does sound a bit funny for ALSA to have a DMA_BUF_AV_SYNC ioctl for this sort of case? I don't think there is anything like it in EGL, but there is oml_sync_control extension for more precise control of presentation time. But this is all implemented in userspace and doesn't really work out w/ >double buffering. This is part of the reason for the timing information in vblank events. Of course it doesn't have any tie in to audio subsystem, but in practice this really shouldn't be needed. Audio samples are either rendered at a very predictable rate, or sound like sh** with lots of pops and cut outs. BR, -R > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-10 12:30 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-10 12:30 UTC (permalink / raw) To: Tom Cooksey; +Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel On Fri, Aug 9, 2013 at 1:31 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> > > So in the above, after X receives the second DRI2SwapBuffers, it >> > > doesn't need to get scheduled again for the next frame to be both >> > > rendered by the GPU and issued to the display for scanout. >> > >> > well, this is really only an issue if you are so loaded that you >> > don't get a chance to schedule for ~16ms.. which is pretty long time. > > Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? > Hmmm, that does sound very long. Will try out some experiments and see. > yeah > >> > If you are triple buffering, it should not end up in the critical >> > path (since the gpu already has the 3rd buffer to start on the next >> > frame). And, well, if you do it all in the kernel you probably need >> > to toss things over to a workqueue anyways. >> >> Just a quick comment on the kernel flip queue issue. >> >> 16 ms scheduling latency sounds awful but totally doable with a less >> than stellar ddx driver going into limbo land and so preventing your >> single threaded X from doing more useful stuff. Is this really the >> linux scheduler being stupid? > > Ahahhaaa!! Yes!!! Really good point. We generally don't have 2D HW and > so rely on pixman to perform all 2D operations which does indeed tie > up that thread for fairly long periods of time. > > We've had internal discussions about introducing a thread (gulp) in > the DDX to off-load drawing operations to. I think we were all a bit > scared by that idea though. > thread does sound a bit scary.. it probably could be done if you treat it like a virtual cpu and have WaitMarker or PrepareAccess for sw fallbacks synchronize properly.. I bet you'd be much better off just making non-scanout pixmaps cached and doing cache sync ops when needed for dri2 buffers. Sw fallbacks on uncached buffers probably aren't exactly the hot ticket. > > BTW: I wasn't suggesting it was the linux scheduler being stupid, just > that there is sometimes lots of contention over the CPU cores and X > is just one thread among many wanting to run. > > >> At least my impression was that the hw/kernel flip queue is to save >> power so that you can queue up a few frames and everything goes to >> sleep for half a second or so (at 24fps or whatever movie your >> showing). Needing to schedule 5 frames ahead with pageflips under >> load is just guaranteed to result in really horrible interactivity >> and so awful user experience > > Agreed. There's always a tradeoff between tolerance to variable frame > rendering time/system latency (lot of buffers) and UI latency (few > buffers). > > As a side note, video playback is one use-case for explicit sync > objects which implicit/buffer-based sync doesn't handle: Queue up lots > of video frames for display, but mark those "display buffer" > operations as depending on explicit sync objects which get signalled > by the audio clock. Not sure Android actually does that yet though. > Anyway, off topic. > w/ dmafence, rather than explicit fences, I suppose you could add some way to queue the buffer to the audio device and have the audio device signal the fence. I suppose it does sound a bit funny for ALSA to have a DMA_BUF_AV_SYNC ioctl for this sort of case? I don't think there is anything like it in EGL, but there is oml_sync_control extension for more precise control of presentation time. But this is all implemented in userspace and doesn't really work out w/ >double buffering. This is part of the reason for the timing information in vblank events. Of course it doesn't have any tie in to audio subsystem, but in practice this really shouldn't be needed. Audio samples are either rendered at a very predictable rate, or sound like sh** with lots of pops and cut outs. BR, -R > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-10 12:30 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-10 12:30 UTC (permalink / raw) To: linux-arm-kernel On Fri, Aug 9, 2013 at 1:31 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> > > So in the above, after X receives the second DRI2SwapBuffers, it >> > > doesn't need to get scheduled again for the next frame to be both >> > > rendered by the GPU and issued to the display for scanout. >> > >> > well, this is really only an issue if you are so loaded that you >> > don't get a chance to schedule for ~16ms.. which is pretty long time. > > Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? > Hmmm, that does sound very long. Will try out some experiments and see. > yeah > >> > If you are triple buffering, it should not end up in the critical >> > path (since the gpu already has the 3rd buffer to start on the next >> > frame). And, well, if you do it all in the kernel you probably need >> > to toss things over to a workqueue anyways. >> >> Just a quick comment on the kernel flip queue issue. >> >> 16 ms scheduling latency sounds awful but totally doable with a less >> than stellar ddx driver going into limbo land and so preventing your >> single threaded X from doing more useful stuff. Is this really the >> linux scheduler being stupid? > > Ahahhaaa!! Yes!!! Really good point. We generally don't have 2D HW and > so rely on pixman to perform all 2D operations which does indeed tie > up that thread for fairly long periods of time. > > We've had internal discussions about introducing a thread (gulp) in > the DDX to off-load drawing operations to. I think we were all a bit > scared by that idea though. > thread does sound a bit scary.. it probably could be done if you treat it like a virtual cpu and have WaitMarker or PrepareAccess for sw fallbacks synchronize properly.. I bet you'd be much better off just making non-scanout pixmaps cached and doing cache sync ops when needed for dri2 buffers. Sw fallbacks on uncached buffers probably aren't exactly the hot ticket. > > BTW: I wasn't suggesting it was the linux scheduler being stupid, just > that there is sometimes lots of contention over the CPU cores and X > is just one thread among many wanting to run. > > >> At least my impression was that the hw/kernel flip queue is to save >> power so that you can queue up a few frames and everything goes to >> sleep for half a second or so (at 24fps or whatever movie your >> showing). Needing to schedule 5 frames ahead with pageflips under >> load is just guaranteed to result in really horrible interactivity >> and so awful user experience > > Agreed. There's always a tradeoff between tolerance to variable frame > rendering time/system latency (lot of buffers) and UI latency (few > buffers). > > As a side note, video playback is one use-case for explicit sync > objects which implicit/buffer-based sync doesn't handle: Queue up lots > of video frames for display, but mark those "display buffer" > operations as depending on explicit sync objects which get signalled > by the audio clock. Not sure Android actually does that yet though. > Anyway, off topic. > w/ dmafence, rather than explicit fences, I suppose you could add some way to queue the buffer to the audio device and have the audio device signal the fence. I suppose it does sound a bit funny for ALSA to have a DMA_BUF_AV_SYNC ioctl for this sort of case? I don't think there is anything like it in EGL, but there is oml_sync_control extension for more precise control of presentation time. But this is all implemented in userspace and doesn't really work out w/ >double buffering. This is part of the reason for the timing information in vblank events. Of course it doesn't have any tie in to audio subsystem, but in practice this really shouldn't be needed. Audio samples are either rendered at a very predictable rate, or sound like sh** with lots of pops and cut outs. BR, -R > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <520a4435.070a0e0a.4fce.ffff9731SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <520a4435.070a0e0a.4fce.ffff9731SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-13 14:58 ` Rob Clark @ 2013-08-13 14:58 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-13 14:58 UTC (permalink / raw) To: Tom Cooksey Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel, Daniel Vetter On Tue, Aug 13, 2013 at 10:35 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> > > > So in the above, after X receives the second DRI2SwapBuffers, it >> > > > doesn't need to get scheduled again for the next frame to be both >> > > > rendered by the GPU and issued to the display for scanout. >> > > >> > > well, this is really only an issue if you are so loaded that you >> > > don't get a chance to schedule for ~16ms.. which is pretty long >> > > time. >> >> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? >> Hmmm, that does sound very long. Will try out some experiments and see. > > We're looking at moving the flip queue into the DDX driver, however > it's not as straight-forward as I thought. With the current design, > all rate-limiting happens on the client side. So even if you only have > double buffering, using KDS you can queue up as many asynchronous > GPU-render/scan-out pairs as you want. It's up to EGL in the client > application to figure out there's a lot of frames in-flight and so > should probably block the application's render thread in > eglSwapBuffers to let the GPU and/or display catch up a bit. > > If we only allow a single outstanding page-flip job in DRM, there'd be > a race if we returned a buffer to the client which had an outstanding > page-flip queued up in the DDX: The client could issue a render job to > the buffer just as the DDX processed the page-flip from the queue, > making the scan-out block until the GPU rendered the next frame. It > would also mean the previous frame would have been lost as it never > got scanned out before the GPU rendered the next-next frame to it. You wouldn't unconditionally send the swap-done event to the client when the queue is "full". (Well, for omap and msm, the queue depth is 1, for triple buffer.. I think usually you don't want to do more than triple buffer.) The client would never get a buffer that wasn't already done being scanned out, so there shouldn't be a race. Basically, in DDX, when you get a ScheduleSwap, there are two cases: 1) you are still waiting for previous page-flip event from kernel, in which case you queue the swap and don't immediately send the event back to the client. When the previous page flip completes, you schedule the new one and then send back the event to the client. 2) you are not waiting for a previous page-flip, in which case you schedule the new page-flip and send the event to the client. (I hope that is clear.. I suppose maybe a picture here would help, but sadly I don't have anything handy) The potential drawback is that the client doesn't necessarily have any control over double vs triple buffering. In omap ddx I solved this by adding a special attachment point that the client could request to tell the DDX that it wanted to triple buffer. But the upside is that you never need to worry about a modeset when there is more than one flip pending. BR, -R > So instead, I think we'll have to block (suspend?) a client in > ScheduleSwap if the next buffer it would obtain with DRI2GetBuffers > has an outstanding page-flip in the user-space queue. We then wake > the client up again _after_ we get the page-flip event for the > previous page flip and have issued the page-flip to the next buffer > to the DRM. That way the DRM display driver has already registered its > intention to use the buffer with KDS before the client ever gets hold > of it. > > Note: I say KDS here, but I assume the same issues will apply on any > implicit buffer-based synchronization. I.e. dma-fence. > > It's not really a problem I don't think, but mention it to see if you > can see a reason why the above wouldn't work before we go and > implement it - it's a fairly big change to the DDX. Can you see any > issues with it? PrepareAccess gets interesting... > > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-13 14:58 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-13 14:58 UTC (permalink / raw) To: Tom Cooksey Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel, Daniel Vetter On Tue, Aug 13, 2013 at 10:35 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> > > > So in the above, after X receives the second DRI2SwapBuffers, it >> > > > doesn't need to get scheduled again for the next frame to be both >> > > > rendered by the GPU and issued to the display for scanout. >> > > >> > > well, this is really only an issue if you are so loaded that you >> > > don't get a chance to schedule for ~16ms.. which is pretty long >> > > time. >> >> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? >> Hmmm, that does sound very long. Will try out some experiments and see. > > We're looking at moving the flip queue into the DDX driver, however > it's not as straight-forward as I thought. With the current design, > all rate-limiting happens on the client side. So even if you only have > double buffering, using KDS you can queue up as many asynchronous > GPU-render/scan-out pairs as you want. It's up to EGL in the client > application to figure out there's a lot of frames in-flight and so > should probably block the application's render thread in > eglSwapBuffers to let the GPU and/or display catch up a bit. > > If we only allow a single outstanding page-flip job in DRM, there'd be > a race if we returned a buffer to the client which had an outstanding > page-flip queued up in the DDX: The client could issue a render job to > the buffer just as the DDX processed the page-flip from the queue, > making the scan-out block until the GPU rendered the next frame. It > would also mean the previous frame would have been lost as it never > got scanned out before the GPU rendered the next-next frame to it. You wouldn't unconditionally send the swap-done event to the client when the queue is "full". (Well, for omap and msm, the queue depth is 1, for triple buffer.. I think usually you don't want to do more than triple buffer.) The client would never get a buffer that wasn't already done being scanned out, so there shouldn't be a race. Basically, in DDX, when you get a ScheduleSwap, there are two cases: 1) you are still waiting for previous page-flip event from kernel, in which case you queue the swap and don't immediately send the event back to the client. When the previous page flip completes, you schedule the new one and then send back the event to the client. 2) you are not waiting for a previous page-flip, in which case you schedule the new page-flip and send the event to the client. (I hope that is clear.. I suppose maybe a picture here would help, but sadly I don't have anything handy) The potential drawback is that the client doesn't necessarily have any control over double vs triple buffering. In omap ddx I solved this by adding a special attachment point that the client could request to tell the DDX that it wanted to triple buffer. But the upside is that you never need to worry about a modeset when there is more than one flip pending. BR, -R > So instead, I think we'll have to block (suspend?) a client in > ScheduleSwap if the next buffer it would obtain with DRI2GetBuffers > has an outstanding page-flip in the user-space queue. We then wake > the client up again _after_ we get the page-flip event for the > previous page flip and have issued the page-flip to the next buffer > to the DRM. That way the DRM display driver has already registered its > intention to use the buffer with KDS before the client ever gets hold > of it. > > Note: I say KDS here, but I assume the same issues will apply on any > implicit buffer-based synchronization. I.e. dma-fence. > > It's not really a problem I don't think, but mention it to see if you > can see a reason why the above wouldn't work before we go and > implement it - it's a fairly big change to the DDX. Can you see any > issues with it? PrepareAccess gets interesting... > > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-13 14:58 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-13 14:58 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 13, 2013 at 10:35 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> > > > So in the above, after X receives the second DRI2SwapBuffers, it >> > > > doesn't need to get scheduled again for the next frame to be both >> > > > rendered by the GPU and issued to the display for scanout. >> > > >> > > well, this is really only an issue if you are so loaded that you >> > > don't get a chance to schedule for ~16ms.. which is pretty long >> > > time. >> >> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it? >> Hmmm, that does sound very long. Will try out some experiments and see. > > We're looking at moving the flip queue into the DDX driver, however > it's not as straight-forward as I thought. With the current design, > all rate-limiting happens on the client side. So even if you only have > double buffering, using KDS you can queue up as many asynchronous > GPU-render/scan-out pairs as you want. It's up to EGL in the client > application to figure out there's a lot of frames in-flight and so > should probably block the application's render thread in > eglSwapBuffers to let the GPU and/or display catch up a bit. > > If we only allow a single outstanding page-flip job in DRM, there'd be > a race if we returned a buffer to the client which had an outstanding > page-flip queued up in the DDX: The client could issue a render job to > the buffer just as the DDX processed the page-flip from the queue, > making the scan-out block until the GPU rendered the next frame. It > would also mean the previous frame would have been lost as it never > got scanned out before the GPU rendered the next-next frame to it. You wouldn't unconditionally send the swap-done event to the client when the queue is "full". (Well, for omap and msm, the queue depth is 1, for triple buffer.. I think usually you don't want to do more than triple buffer.) The client would never get a buffer that wasn't already done being scanned out, so there shouldn't be a race. Basically, in DDX, when you get a ScheduleSwap, there are two cases: 1) you are still waiting for previous page-flip event from kernel, in which case you queue the swap and don't immediately send the event back to the client. When the previous page flip completes, you schedule the new one and then send back the event to the client. 2) you are not waiting for a previous page-flip, in which case you schedule the new page-flip and send the event to the client. (I hope that is clear.. I suppose maybe a picture here would help, but sadly I don't have anything handy) The potential drawback is that the client doesn't necessarily have any control over double vs triple buffering. In omap ddx I solved this by adding a special attachment point that the client could request to tell the DDX that it wanted to triple buffer. But the upside is that you never need to worry about a modeset when there is more than one flip pending. BR, -R > So instead, I think we'll have to block (suspend?) a client in > ScheduleSwap if the next buffer it would obtain with DRI2GetBuffers > has an outstanding page-flip in the user-space queue. We then wake > the client up again _after_ we get the page-flip event for the > previous page flip and have issued the page-flip to the next buffer > to the DRM. That way the DRM display driver has already registered its > intention to use the buffer with KDS before the client ever gets hold > of it. > > Note: I say KDS here, but I assume the same issues will apply on any > implicit buffer-based synchronization. I.e. dma-fence. > > It's not really a problem I don't think, but mention it to see if you > can see a reason why the above wouldn't work before we go and > implement it - it's a fairly big change to the DDX. Can you see any > issues with it? PrepareAccess gets interesting... > > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-13 14:58 ` Rob Clark (?) (?) @ 2013-08-14 14:59 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-14 14:59 UTC (permalink / raw) To: linux-arm-kernel > >> > > > So in the above, after X receives the second DRI2SwapBuffers, > >> > > > it doesn't need to get scheduled again for the next frame to > >> > > > be both rendered by the GPU and issued to the display for > >> > > > scanout. > >> > > > >> > > well, this is really only an issue if you are so loaded that you > >> > > don't get a chance to schedule for ~16ms.. which is pretty long > >> > > time. > >> > >> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't > >> it? Hmmm, that does sound very long. Will try out some experiments > >> and see. > > > > We're looking at moving the flip queue into the DDX driver, however > > it's not as straight-forward as I thought. With the current design, > > all rate-limiting happens on the client side. So even if you only > > > have double buffering, using KDS you can queue up as many > > asynchronous GPU-render/scan-out pairs as you want. It's up to EGL > > in the client application to figure out there's a lot of frames in- > > flight and so should probably block the application's render thread > > in eglSwapBuffers to let the GPU and/or display catch up a bit. > > > > If we only allow a single outstanding page-flip job in DRM, there'd > > > be a race if we returned a buffer to the client which had an > > outstanding page-flip queued up in the DDX: The client could issue > > > a render job to the buffer just as the DDX processed the page-flip > > from the queue, making the scan-out block until the GPU rendered > > the next frame. It would also mean the previous frame would have > > been lost as it never got scanned out before the GPU rendered the > > next-next frame to it. > > You wouldn't unconditionally send the swap-done event to the client > when the queue is "full". (Well, for omap and msm, the queue depth is > 1, for triple buffer.. I think usually you don't want to do more than > triple buffer.) The client would never get a buffer that wasn't > already done being scanned out, so there shouldn't be a race. > > Basically, in DDX, when you get a ScheduleSwap, there are two cases: > 1) you are still waiting for previous page-flip event from kernel, in > which case you queue the swap and don't immediately send the event > back to the client. When the previous page flip completes, you > schedule the new one and then send back the event to the client. > 2) you are not waiting for a previous page-flip, in which case you > schedule the new page-flip and send the event to the client. > > (I hope that is clear.. I suppose maybe a picture here would help, > but sadly I don't have anything handy) So your solution depends on the client-side EGL using page flip events to figure out when to block the application thread when CPU is running ahead of the GPU/display. We (currently) use the number of uncompleted frames sent to the GPU to block the application thread. So there is a race if we move the flip queue into the DDX and did nothing else. However, I'm not proposing we do nothing else. :-) Our proposal was to instead use waiting on the reply of the DRI2GetBuffers request to block the application thread when the client is submitting frames faster than the display can display them. I've not really looked into using the DRI2BufferSwapComplete in our EGL implementation - it always felt like we'd be at risk of the application somehow stealing the event and causing us to dead-lock. But - that may well be a completely irrational fear. :-) Anyway, I'll take a look, thanks for the pointer! Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-13 14:58 ` Rob Clark ` (2 preceding siblings ...) (?) @ 2013-08-14 14:59 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-14 14:59 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel, Daniel Vetter > >> > > > So in the above, after X receives the second DRI2SwapBuffers, > >> > > > it doesn't need to get scheduled again for the next frame to > >> > > > be both rendered by the GPU and issued to the display for > >> > > > scanout. > >> > > > >> > > well, this is really only an issue if you are so loaded that you > >> > > don't get a chance to schedule for ~16ms.. which is pretty long > >> > > time. > >> > >> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't > >> it? Hmmm, that does sound very long. Will try out some experiments > >> and see. > > > > We're looking at moving the flip queue into the DDX driver, however > > it's not as straight-forward as I thought. With the current design, > > all rate-limiting happens on the client side. So even if you only > > > have double buffering, using KDS you can queue up as many > > asynchronous GPU-render/scan-out pairs as you want. It's up to EGL > > in the client application to figure out there's a lot of frames in- > > flight and so should probably block the application's render thread > > in eglSwapBuffers to let the GPU and/or display catch up a bit. > > > > If we only allow a single outstanding page-flip job in DRM, there'd > > > be a race if we returned a buffer to the client which had an > > outstanding page-flip queued up in the DDX: The client could issue > > > a render job to the buffer just as the DDX processed the page-flip > > from the queue, making the scan-out block until the GPU rendered > > the next frame. It would also mean the previous frame would have > > been lost as it never got scanned out before the GPU rendered the > > next-next frame to it. > > You wouldn't unconditionally send the swap-done event to the client > when the queue is "full". (Well, for omap and msm, the queue depth is > 1, for triple buffer.. I think usually you don't want to do more than > triple buffer.) The client would never get a buffer that wasn't > already done being scanned out, so there shouldn't be a race. > > Basically, in DDX, when you get a ScheduleSwap, there are two cases: > 1) you are still waiting for previous page-flip event from kernel, in > which case you queue the swap and don't immediately send the event > back to the client. When the previous page flip completes, you > schedule the new one and then send back the event to the client. > 2) you are not waiting for a previous page-flip, in which case you > schedule the new page-flip and send the event to the client. > > (I hope that is clear.. I suppose maybe a picture here would help, > but sadly I don't have anything handy) So your solution depends on the client-side EGL using page flip events to figure out when to block the application thread when CPU is running ahead of the GPU/display. We (currently) use the number of uncompleted frames sent to the GPU to block the application thread. So there is a race if we move the flip queue into the DDX and did nothing else. However, I'm not proposing we do nothing else. :-) Our proposal was to instead use waiting on the reply of the DRI2GetBuffers request to block the application thread when the client is submitting frames faster than the display can display them. I've not really looked into using the DRI2BufferSwapComplete in our EGL implementation - it always felt like we'd be at risk of the application somehow stealing the event and causing us to dead-lock. But - that may well be a completely irrational fear. :-) Anyway, I'll take a look, thanks for the pointer! Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-13 14:58 ` Rob Clark ` (3 preceding siblings ...) (?) @ 2013-08-14 14:59 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-14 14:59 UTC (permalink / raw) To: 'Rob Clark'; +Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel > >> > > > So in the above, after X receives the second DRI2SwapBuffers, > >> > > > it doesn't need to get scheduled again for the next frame to > >> > > > be both rendered by the GPU and issued to the display for > >> > > > scanout. > >> > > > >> > > well, this is really only an issue if you are so loaded that you > >> > > don't get a chance to schedule for ~16ms.. which is pretty long > >> > > time. > >> > >> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't > >> it? Hmmm, that does sound very long. Will try out some experiments > >> and see. > > > > We're looking at moving the flip queue into the DDX driver, however > > it's not as straight-forward as I thought. With the current design, > > all rate-limiting happens on the client side. So even if you only > > > have double buffering, using KDS you can queue up as many > > asynchronous GPU-render/scan-out pairs as you want. It's up to EGL > > in the client application to figure out there's a lot of frames in- > > flight and so should probably block the application's render thread > > in eglSwapBuffers to let the GPU and/or display catch up a bit. > > > > If we only allow a single outstanding page-flip job in DRM, there'd > > > be a race if we returned a buffer to the client which had an > > outstanding page-flip queued up in the DDX: The client could issue > > > a render job to the buffer just as the DDX processed the page-flip > > from the queue, making the scan-out block until the GPU rendered > > the next frame. It would also mean the previous frame would have > > been lost as it never got scanned out before the GPU rendered the > > next-next frame to it. > > You wouldn't unconditionally send the swap-done event to the client > when the queue is "full". (Well, for omap and msm, the queue depth is > 1, for triple buffer.. I think usually you don't want to do more than > triple buffer.) The client would never get a buffer that wasn't > already done being scanned out, so there shouldn't be a race. > > Basically, in DDX, when you get a ScheduleSwap, there are two cases: > 1) you are still waiting for previous page-flip event from kernel, in > which case you queue the swap and don't immediately send the event > back to the client. When the previous page flip completes, you > schedule the new one and then send back the event to the client. > 2) you are not waiting for a previous page-flip, in which case you > schedule the new page-flip and send the event to the client. > > (I hope that is clear.. I suppose maybe a picture here would help, > but sadly I don't have anything handy) So your solution depends on the client-side EGL using page flip events to figure out when to block the application thread when CPU is running ahead of the GPU/display. We (currently) use the number of uncompleted frames sent to the GPU to block the application thread. So there is a race if we move the flip queue into the DDX and did nothing else. However, I'm not proposing we do nothing else. :-) Our proposal was to instead use waiting on the reply of the DRI2GetBuffers request to block the application thread when the client is submitting frames faster than the display can display them. I've not really looked into using the DRI2BufferSwapComplete in our EGL implementation - it always felt like we'd be at risk of the application somehow stealing the event and causing us to dead-lock. But - that may well be a completely irrational fear. :-) Anyway, I'll take a look, thanks for the pointer! Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-07 16:46 ` Daniel Vetter ` (5 preceding siblings ...) (?) @ 2013-08-10 4:56 ` Inki Dae -1 siblings, 0 replies; 131+ messages in thread From: Inki Dae @ 2013-08-10 4:56 UTC (permalink / raw) To: Daniel Vetter; +Cc: linux-fbdev, pawel.moll, dri-devel, linux-arm-kernel [-- Attachment #1.1: Type: text/plain, Size: 10704 bytes --] 2013/8/8 Daniel Vetter <daniel@ffwll.ch> > Just comment a bit on Rob's review with my own opinion. > > On Wed, Aug 07, 2013 at 12:17:21PM -0400, Rob Clark wrote: > > On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > > > From: Tom Cooksey <tom.cooksey@arm.com> > > > > > > This is a mode-setting driver for the pl111 CLCD display controller > > > found on various ARM reference platforms such as the Versatile > > > Express. The driver supports setting of a single mode (640x480) and > > > has only been tested on Versatile Express with a Cortex-A9 core tile. > > > > > > Known issues: > > > * It still includes code to use KDS, which is not going upstream. > > > > review's on > http://lists.freedesktop.org/archives/dri-devel/2013-July/042462.html > > can't hurt > > > > although you might consider submitting a reduced functionality driver > > w/ KDS bits removed in the mean time.. then when the fence stuff is > > merged it is just an incremental patch rather than a whole driver ;-) > > Yeah, I think the KDS bits and comments need to go first before merginge. > > > > > +/* > > > + * Number of flips allowed in flight at any one time. Any more flips > requested > > > + * beyond this value will cause the caller to block until earlier > flips have > > > + * completed. > > > + * > > > + * For performance reasons, this must be greater than the number of > buffers > > > + * used in the rendering pipeline. Note that the rendering pipeline > can contain > > > + * different types of buffer, e.g.: > > > + * - 2 final framebuffers > > > + * - >2 geometry buffers for GPU use-cases > > > + * - >2 vertex buffers for GPU use-cases > > > + * > > > + * For example, a system using 5 geometry buffers could have 5 flips > in flight, > > > + * and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be 5 or greater. > > > + * > > > + * Whilst there may be more intermediate buffers (such as > vertex/geometry) than > > > + * final framebuffers, KDS is used to ensure that GPU rendering waits > for the > > > + * next off-screen buffer, so it doesn't overwrite an on-screen > buffer and > > > + * produce tearing. > > > + */ > > > + > > > > fwiw, this is at least different from how other drivers do triple (or > > > double) buffering. In other drivers (intel, omap, and > > msm/freedreno, that I know of, maybe others too) the xorg driver dri2 > > bits implement the double buffering (ie. send flip event back to > > client immediately and queue up the flip and call page-flip after the > > pageflip event back from kernel. > > > > I'm not saying not to do it this way, I guess I'd like to hear what > > other folks think. I kinda prefer doing this in userspace as it keeps > > the kernel bits simpler (plus it would then work properly on exynosdrm > > or other kms drivers). > > Yeah, if this is just a sw queue then I don't think it makes sense to have > it in the kernel. Afaik the current pageflip interface drm exposes allows > one oustanding flip only, and you _must_ wait for the flip complete event > before you can submit the second one. > Agree. Tizen platform using exynos drm driver also does in same way. And there is another issue we are facing with. Please, assume CPU and GPU are sharing a same buffer. The issue is that in case of using glFlush GL API, 3d app cannot be aware of when GPU access to the buffer is completed: the completion event is sent only to GPU specific API; in our case, MALI specific DDK, so the buffer could be broken if CPU accesses the buffer at once after glFlush. Of course we can use glFinish instead of glFlush but glFinish makes GPU more idle: CPU should wait for the completion of GPU access to the buffer so CPU cannot do anything until that time. So I'd like to know how other folks take care of this issue. In our case, we are using dmabuf sync framework I posted before because I thought we may need buffer access control between CPU and DMA: the user land interfaces are fcntl and select system calls so no having any new additional api. with this feature, 3d app, only using standard GL API, can be aware of the completion event from GPU driver without DRM or other driver API. For this, I will introduce the more stable patch set soon with more features. For this, I'd happy to give me other opinions and advices if there is my missing point. Thanks, Inki Dae > > Ofc if your hardware as a hw-based flip queue (maybe even with frame > targets) that's a different matter, but currently we don't have a drm > interface to expose this. I'd say for merging the basic driver first we > should go with the existing simple pageflip semantics. > > And tbh I don't understand why the amount of buffers you keep in the > render pipeline side of things matters here at all. But I also haven't > read the details of your driver code. > > > > > > +/* > > > + * Here, we choose a conservative value. A lower value is most likely > > > + * suitable for GPU use-cases. > > > + */ > > > +#define NR_FLIPS_IN_FLIGHT_THRESHOLD 16 > > > + > > > +#define CLCD_IRQ_NEXTBASE_UPDATE (1u<<2) > > > + > > > +struct pl111_drm_flip_resource; > > > +struct pl111_drm_cursor_plane; > > > + > > > +enum pl111_bo_type { > > > + PL111_BOT_DMA, > > > + PL111_BOT_SHM > > > +}; > > > + > > > +struct pl111_gem_bo_dma { > > > + dma_addr_t fb_dev_addr; > > > + void *fb_cpu_addr; > > > +}; > > > + > > > +struct pl111_gem_bo_shm { > > > + struct page **pages; > > > + dma_addr_t *dma_addrs; > > > +}; > > > + > > > +struct pl111_gem_bo { > > > + struct drm_gem_object gem_object; > > > + enum pl111_bo_type type; > > > + union { > > > + struct pl111_gem_bo_dma dma; > > > + struct pl111_gem_bo_shm shm; > > > + } backing_data; > > > + struct drm_framebuffer *fb; > > > > this is at least a bit odd.. normally the fb has ref to the bo(s) and > > not the other way around. And the same bo could be referenced by > > multiple fb's which would kinda fall down with this approach. > > I'd say that's just backwards, framebuffers are created from backing > storage objects (which for a gem based driver is a gem object), not the > other way round. What's this exactly used for? > > [snip] > > > > + > > > + /* > > > + * Used to prevent race between pl111_dma_buf_release and > > > + * drm_gem_prime_handle_to_fd > > > + */ > > > + struct mutex export_dma_buf_lock; > > > > hmm, seems a bit suspicious.. the handle reference should keep the > > object live. Ie. either drm_gem_object_lookup() will fail because the > > object is gone (userspace has closed it's handle ref and > > dmabuf->release() already dropped it's ref) or it will succeed and > > you'll have a reference to the bo keeping it from going away if the > > release() comes after. > > The race is real, I have an evil testcase here which Oopses my kernel. I'm > working on a fix (v1 of my patches is submitted a few weeks back, awaiting > review), but I need to rework a few things since now I've also spotted a > leak or two ;-) > > [snip] > > > > +static void vsync_worker(struct work_struct *work) > > > +{ > > > + struct pl111_drm_flip_resource *flip_res; > > > + struct pl111_gem_bo *bo; > > > + struct pl111_drm_crtc *pl111_crtc; > > > + struct drm_device *dev; > > > + int flips_in_flight; > > > + flip_res = > > > + container_of(work, struct pl111_drm_flip_resource, > vsync_work); > > > + > > > + pl111_crtc = to_pl111_crtc(flip_res->crtc); > > > + dev = pl111_crtc->crtc.dev; > > > + > > > + DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res); > > > + > > > + bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb); > > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS > > > + if (flip_res->worker_release_kds == true) { > > > + spin_lock(&pl111_crtc->current_displaying_lock); > > > + release_kds_resource_and_display(flip_res); > > > + spin_unlock(&pl111_crtc->current_displaying_lock); > > > + } > > > +#endif > > > + /* Release DMA buffer on this flip */ > > > + if (bo->gem_object.export_dma_buf != NULL) > > > + dma_buf_put(bo->gem_object.export_dma_buf); > > > > I think you just want to unref the outgoing bo, and let it drop the > > dmabuf ref when the file ref of the imported bo goes. Or actually, it > > would be better to hold/drop ref's to the fb, rather than the bo. At > > least this will make things simpler if you ever have multi-planar > > support. > > Drivers have no business frobbing around the dma-buf refcount of imported > objects imo, at least if they use all the standard drm prime > infrastructure. And if they're bugs they need to be fixed there, not in > drivers. > > [snip] > > > > +struct drm_framebuffer *pl111_fb_create(struct drm_device *dev, > > > + struct drm_file *file_priv, > > > + struct drm_mode_fb_cmd2 > *mode_cmd) > > > +{ > > > + struct pl111_drm_framebuffer *pl111_fb = NULL; > > > + struct drm_framebuffer *fb = NULL; > > > + struct drm_gem_object *gem_obj; > > > + struct pl111_gem_bo *bo; > > > + > > > + pr_info("DRM %s\n", __func__); > > > + gem_obj = drm_gem_object_lookup(dev, file_priv, > mode_cmd->handles[0]); > > > + if (gem_obj == NULL) { > > > + DRM_ERROR("Could not get gem obj from handle to create > fb\n"); > > > + goto out; > > > + } > > > + > > > + bo = PL111_BO_FROM_GEM(gem_obj); > > > + /* Don't even attempt PL111_BOT_SHM, it's not contiguous */ > > > + BUG_ON(bo->type != PL111_BOT_DMA); > > > > umm, no BUG_ON() is not really a good way to validate userspace input.. > > > > if (bo->type != ...) > > return ERR_PTR(-EINVAL); > > Yep. > > > > + > > > + switch ((char)(mode_cmd->pixel_format & 0xFF)) { > > > + case 'Y': > > > + case 'U': > > > + case 'V': > > > + case 'N': > > > + case 'T': > > > > perhaps we should instead add a drm_format_is_yuv().. or you could > > (ab)use drm_fb_get_bpp_depth().. > > Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger > question is why you need this, since the drm core should filter out > formats not in your list of supported ones. Or at least it should ... > > Cheers, Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch > -- > To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > [-- Attachment #1.2: Type: text/html, Size: 13731 bytes --] [-- Attachment #2: Type: text/plain, Size: 159 bytes --] _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-25 17:17 ` tom.cooksey at arm.com (?) @ 2013-07-25 18:21 ` Rob Clark -1 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-25 18:21 UTC (permalink / raw) To: tom.cooksey; +Cc: linux-arm-kernel, linux-fbdev, pawel.moll, dri-devel On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > From: Tom Cooksey <tom.cooksey@arm.com> > > Please find below the current state of our pl111 DRM/KMS driver. This > is lightly tested on a Versatile Express using X running the > xf86-video-armsoc DDX driver[i] with the patches applied to drm-next > as of ~last week. To actually see anything on the DVI output, you > must also apply Pawel Moll's VExpress DVI mux driver[ii] to select > the video signal from the ca9x4 core tile. > > [i] <https://git.linaro.org/gitweb?p=arm/xorg/driver/xf86-video-armsoc.git;a=summary> > [ii] <https://patchwork.kernel.org/patch/1765981/> > > > Known issues: > * It uses KDS. We intend to switch to whatever implicit per-buffer > synchronisation mechanism gets merged, once something is merged. > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > allocate buffers for the GPU. Still not sure how to resolve this > as we don't use DRM for our GPU driver. any thoughts/plans about a DRM GPU driver? Ideally long term (esp. once the dma-fence stuff is in place), we'd have gpu-specific drm (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, using prime/dmabuf to share between the two. > * Doesn't handle page flip event sequence numbers and timestamps > * The v_sync handling needs work in general - a work queue is a > little overkill > * Doesn't support the import half of PRIME properly, only export > * Need to validate src rectangle size in > pl111_drm_cursor_plane_update() > * Only supports 640x480 mode, which is hard-coded. We intend to > rebase on top of CDF once it is merged, which hopefully will > handle a lot of the EDID parsing & mode setting for us (once > Pawel's CDF patches for VExpress also land). note that drm core already handles EDID parsing quite nicely.. I'm not entirely sure why CDF would be needed for this? > > I appreciate that's a fairly hefty list of known issues already! > However, we're waiting for both CDF & dma_buf sync mechanisms to land > before we can address some of those. So in the mean-time, I thought > someone might be interested in taking a look at what we have so far, > which is why I'm posting this now. Needless to say the code will need > to be refactored a fair bit, however I'm keen to get and additional > feedback anyone cares to give. > What is the dependency on CDF? Is there an external encoder/bridge that could be shared between platforms? btw, I think that having some share-able KMS objects is a good idea. I'm not entirely sure that the directions that the current CDF proposals are headed is necessarily the right way forward. I'd prefer to see small/incremental evolution of KMS (ie. add drm_bridge and drm_panel, and refactor the existing encoder-slave). Keeping it inside drm means that we can evolve it more easily, and avoid layers of glue code for no good reason. BR, -R > > Cheers, > > Tom > > Tom Cooksey (1): > drm/pl111: Initial drm/kms driver for pl111 display controller > > drivers/gpu/drm/Kconfig | 2 + > drivers/gpu/drm/Makefile | 1 + > drivers/gpu/drm/pl111/Kbuild | 14 + > drivers/gpu/drm/pl111/Kconfig | 9 + > drivers/gpu/drm/pl111/pl111_clcd_ext.h | 78 ++++ > drivers/gpu/drm/pl111/pl111_drm.h | 227 ++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_connector.c | 166 +++++++++ > drivers/gpu/drm/pl111/pl111_drm_crtc.c | 432 ++++++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_cursor.c | 97 +++++ > drivers/gpu/drm/pl111/pl111_drm_device.c | 319 +++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_dma_buf.c | 339 ++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_encoder.c | 106 ++++++ > drivers/gpu/drm/pl111/pl111_drm_fb.c | 152 ++++++++ > drivers/gpu/drm/pl111/pl111_drm_funcs.h | 127 +++++++ > drivers/gpu/drm/pl111/pl111_drm_gem.c | 287 +++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_pl111.c | 513 +++++++++++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_platform.c | 150 ++++++++ > drivers/gpu/drm/pl111/pl111_drm_suspend.c | 35 ++ > drivers/gpu/drm/pl111/pl111_drm_vma.c | 214 +++++++++++ > 19 files changed, 3268 insertions(+) > create mode 100644 drivers/gpu/drm/pl111/Kbuild > create mode 100644 drivers/gpu/drm/pl111/Kconfig > create mode 100644 drivers/gpu/drm/pl111/pl111_clcd_ext.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_connector.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_crtc.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_cursor.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_device.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_encoder.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_fb.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_funcs.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_gem.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_pl111.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_platform.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_suspend.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_vma.c > > -- > 1.7.9.5 > > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-25 18:21 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-25 18:21 UTC (permalink / raw) To: tom.cooksey; +Cc: linux-arm-kernel, linux-fbdev, pawel.moll, dri-devel On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > From: Tom Cooksey <tom.cooksey@arm.com> > > Please find below the current state of our pl111 DRM/KMS driver. This > is lightly tested on a Versatile Express using X running the > xf86-video-armsoc DDX driver[i] with the patches applied to drm-next > as of ~last week. To actually see anything on the DVI output, you > must also apply Pawel Moll's VExpress DVI mux driver[ii] to select > the video signal from the ca9x4 core tile. > > [i] <https://git.linaro.org/gitweb?p=arm/xorg/driver/xf86-video-armsoc.git;a=summary> > [ii] <https://patchwork.kernel.org/patch/1765981/> > > > Known issues: > * It uses KDS. We intend to switch to whatever implicit per-buffer > synchronisation mechanism gets merged, once something is merged. > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > allocate buffers for the GPU. Still not sure how to resolve this > as we don't use DRM for our GPU driver. any thoughts/plans about a DRM GPU driver? Ideally long term (esp. once the dma-fence stuff is in place), we'd have gpu-specific drm (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, using prime/dmabuf to share between the two. > * Doesn't handle page flip event sequence numbers and timestamps > * The v_sync handling needs work in general - a work queue is a > little overkill > * Doesn't support the import half of PRIME properly, only export > * Need to validate src rectangle size in > pl111_drm_cursor_plane_update() > * Only supports 640x480 mode, which is hard-coded. We intend to > rebase on top of CDF once it is merged, which hopefully will > handle a lot of the EDID parsing & mode setting for us (once > Pawel's CDF patches for VExpress also land). note that drm core already handles EDID parsing quite nicely.. I'm not entirely sure why CDF would be needed for this? > > I appreciate that's a fairly hefty list of known issues already! > However, we're waiting for both CDF & dma_buf sync mechanisms to land > before we can address some of those. So in the mean-time, I thought > someone might be interested in taking a look at what we have so far, > which is why I'm posting this now. Needless to say the code will need > to be refactored a fair bit, however I'm keen to get and additional > feedback anyone cares to give. > What is the dependency on CDF? Is there an external encoder/bridge that could be shared between platforms? btw, I think that having some share-able KMS objects is a good idea. I'm not entirely sure that the directions that the current CDF proposals are headed is necessarily the right way forward. I'd prefer to see small/incremental evolution of KMS (ie. add drm_bridge and drm_panel, and refactor the existing encoder-slave). Keeping it inside drm means that we can evolve it more easily, and avoid layers of glue code for no good reason. BR, -R > > Cheers, > > Tom > > Tom Cooksey (1): > drm/pl111: Initial drm/kms driver for pl111 display controller > > drivers/gpu/drm/Kconfig | 2 + > drivers/gpu/drm/Makefile | 1 + > drivers/gpu/drm/pl111/Kbuild | 14 + > drivers/gpu/drm/pl111/Kconfig | 9 + > drivers/gpu/drm/pl111/pl111_clcd_ext.h | 78 ++++ > drivers/gpu/drm/pl111/pl111_drm.h | 227 ++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_connector.c | 166 +++++++++ > drivers/gpu/drm/pl111/pl111_drm_crtc.c | 432 ++++++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_cursor.c | 97 +++++ > drivers/gpu/drm/pl111/pl111_drm_device.c | 319 +++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_dma_buf.c | 339 ++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_encoder.c | 106 ++++++ > drivers/gpu/drm/pl111/pl111_drm_fb.c | 152 ++++++++ > drivers/gpu/drm/pl111/pl111_drm_funcs.h | 127 +++++++ > drivers/gpu/drm/pl111/pl111_drm_gem.c | 287 +++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_pl111.c | 513 +++++++++++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_platform.c | 150 ++++++++ > drivers/gpu/drm/pl111/pl111_drm_suspend.c | 35 ++ > drivers/gpu/drm/pl111/pl111_drm_vma.c | 214 +++++++++++ > 19 files changed, 3268 insertions(+) > create mode 100644 drivers/gpu/drm/pl111/Kbuild > create mode 100644 drivers/gpu/drm/pl111/Kconfig > create mode 100644 drivers/gpu/drm/pl111/pl111_clcd_ext.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_connector.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_crtc.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_cursor.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_device.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_encoder.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_fb.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_funcs.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_gem.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_pl111.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_platform.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_suspend.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_vma.c > > -- > 1.7.9.5 > > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-25 18:21 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-25 18:21 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > From: Tom Cooksey <tom.cooksey@arm.com> > > Please find below the current state of our pl111 DRM/KMS driver. This > is lightly tested on a Versatile Express using X running the > xf86-video-armsoc DDX driver[i] with the patches applied to drm-next > as of ~last week. To actually see anything on the DVI output, you > must also apply Pawel Moll's VExpress DVI mux driver[ii] to select > the video signal from the ca9x4 core tile. > > [i] <https://git.linaro.org/gitweb?p=arm/xorg/driver/xf86-video-armsoc.git;a=summary> > [ii] <https://patchwork.kernel.org/patch/1765981/> > > > Known issues: > * It uses KDS. We intend to switch to whatever implicit per-buffer > synchronisation mechanism gets merged, once something is merged. > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > allocate buffers for the GPU. Still not sure how to resolve this > as we don't use DRM for our GPU driver. any thoughts/plans about a DRM GPU driver? Ideally long term (esp. once the dma-fence stuff is in place), we'd have gpu-specific drm (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, using prime/dmabuf to share between the two. > * Doesn't handle page flip event sequence numbers and timestamps > * The v_sync handling needs work in general - a work queue is a > little overkill > * Doesn't support the import half of PRIME properly, only export > * Need to validate src rectangle size in > pl111_drm_cursor_plane_update() > * Only supports 640x480 mode, which is hard-coded. We intend to > rebase on top of CDF once it is merged, which hopefully will > handle a lot of the EDID parsing & mode setting for us (once > Pawel's CDF patches for VExpress also land). note that drm core already handles EDID parsing quite nicely.. I'm not entirely sure why CDF would be needed for this? > > I appreciate that's a fairly hefty list of known issues already! > However, we're waiting for both CDF & dma_buf sync mechanisms to land > before we can address some of those. So in the mean-time, I thought > someone might be interested in taking a look at what we have so far, > which is why I'm posting this now. Needless to say the code will need > to be refactored a fair bit, however I'm keen to get and additional > feedback anyone cares to give. > What is the dependency on CDF? Is there an external encoder/bridge that could be shared between platforms? btw, I think that having some share-able KMS objects is a good idea. I'm not entirely sure that the directions that the current CDF proposals are headed is necessarily the right way forward. I'd prefer to see small/incremental evolution of KMS (ie. add drm_bridge and drm_panel, and refactor the existing encoder-slave). Keeping it inside drm means that we can evolve it more easily, and avoid layers of glue code for no good reason. BR, -R > > Cheers, > > Tom > > Tom Cooksey (1): > drm/pl111: Initial drm/kms driver for pl111 display controller > > drivers/gpu/drm/Kconfig | 2 + > drivers/gpu/drm/Makefile | 1 + > drivers/gpu/drm/pl111/Kbuild | 14 + > drivers/gpu/drm/pl111/Kconfig | 9 + > drivers/gpu/drm/pl111/pl111_clcd_ext.h | 78 ++++ > drivers/gpu/drm/pl111/pl111_drm.h | 227 ++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_connector.c | 166 +++++++++ > drivers/gpu/drm/pl111/pl111_drm_crtc.c | 432 ++++++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_cursor.c | 97 +++++ > drivers/gpu/drm/pl111/pl111_drm_device.c | 319 +++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_dma_buf.c | 339 ++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_encoder.c | 106 ++++++ > drivers/gpu/drm/pl111/pl111_drm_fb.c | 152 ++++++++ > drivers/gpu/drm/pl111/pl111_drm_funcs.h | 127 +++++++ > drivers/gpu/drm/pl111/pl111_drm_gem.c | 287 +++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_pl111.c | 513 +++++++++++++++++++++++++++ > drivers/gpu/drm/pl111/pl111_drm_platform.c | 150 ++++++++ > drivers/gpu/drm/pl111/pl111_drm_suspend.c | 35 ++ > drivers/gpu/drm/pl111/pl111_drm_vma.c | 214 +++++++++++ > 19 files changed, 3268 insertions(+) > create mode 100644 drivers/gpu/drm/pl111/Kbuild > create mode 100644 drivers/gpu/drm/pl111/Kconfig > create mode 100644 drivers/gpu/drm/pl111/pl111_clcd_ext.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_connector.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_crtc.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_cursor.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_device.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_encoder.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_fb.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_funcs.h > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_gem.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_pl111.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_platform.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_suspend.c > create mode 100644 drivers/gpu/drm/pl111/pl111_drm_vma.c > > -- > 1.7.9.5 > > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-25 18:21 ` Rob Clark (?) @ 2013-07-26 14:06 ` Pawel Moll -1 siblings, 0 replies; 131+ messages in thread From: Pawel Moll @ 2013-07-26 14:06 UTC (permalink / raw) To: Rob Clark, Laurent Pinchart Cc: linux-fbdev, linux-arm-kernel, dri-devel, Tom Cooksey On Thu, 2013-07-25 at 19:21 +0100, Rob Clark wrote: > > * Only supports 640x480 mode, which is hard-coded. We intend to > > rebase on top of CDF once it is merged, which hopefully will > > handle a lot of the EDID parsing & mode setting for us (once > > Pawel's CDF patches for VExpress also land). > > note that drm core already handles EDID parsing quite nicely.. I'm not > entirely sure why CDF would be needed for this? > > What is the dependency on CDF? Is there an external encoder/bridge > that could be shared between platforms? By all means. The platform we have here at ARM - Versatile Express - has a pretty generic idea of "multimedia bus", carrying a set of RGB pixel data and digital audio. There are three possible source of those: an FPGA on the motherboard and two daughterboards which can take random combination of FPGAs or test chips. In other words: the pixel data can be generated by anything. And some of those things can be driven by existing fbdev drivers. Now Tom is trying to make the most modern driver for the oldest of the things ;-) So that's about sources. All of them are connected to yet another FPGA which acts as a muxer (switch if you wish). And from there things are fed to SiI9022 HDMI/DVI formatter which is pretty common in the "embedded world". In other words: loads of other fbdev-driven display controllers are driving them as well (googling for sii9022 reveals at least 3 different more or less custom drivers for it). And this chip is also a reason Tom mentioned EDID. In order to get (just access, not even parse) it we must jump through hoops - the chip acts as a I2C master itself and must be kindly asked to switch into a sort of I2C bridge mode. So yes, the display pipeline is exactly straight-forward... > btw, I think that having some share-able KMS objects is a good idea. > I'm not entirely sure that the directions that the current CDF > proposals are headed is necessarily the right way forward. I'd prefer > to see small/incremental evolution of KMS (ie. add drm_bridge and > drm_panel, and refactor the existing encoder-slave). Keeping it > inside drm means that we can evolve it more easily, and avoid layers > of glue code for no good reason. To keep the story short - I'm a great enthusiast of CDF because it caters for both DRM *and* fbdev. Today. I'm looking forward to the day when the last fbdev driver is kicked off the kernel, but it doesn't look like happening soon. At the same time I couldn't care less about naming, so if you prefer to call it drm_panel (but keep the API abstract enough so non-DRM drivers can use them as well) - it's fine for me :-) There are already generic mode/timing structures and DT bidnings available (with helpers for the interesting parties), so this doesn't seem like a unreasonable request? Cheers! Pawel ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:06 ` Pawel Moll 0 siblings, 0 replies; 131+ messages in thread From: Pawel Moll @ 2013-07-26 14:06 UTC (permalink / raw) To: Rob Clark, Laurent Pinchart Cc: linux-fbdev, linux-arm-kernel, dri-devel, Tom Cooksey On Thu, 2013-07-25 at 19:21 +0100, Rob Clark wrote: > > * Only supports 640x480 mode, which is hard-coded. We intend to > > rebase on top of CDF once it is merged, which hopefully will > > handle a lot of the EDID parsing & mode setting for us (once > > Pawel's CDF patches for VExpress also land). > > note that drm core already handles EDID parsing quite nicely.. I'm not > entirely sure why CDF would be needed for this? > > What is the dependency on CDF? Is there an external encoder/bridge > that could be shared between platforms? By all means. The platform we have here at ARM - Versatile Express - has a pretty generic idea of "multimedia bus", carrying a set of RGB pixel data and digital audio. There are three possible source of those: an FPGA on the motherboard and two daughterboards which can take random combination of FPGAs or test chips. In other words: the pixel data can be generated by anything. And some of those things can be driven by existing fbdev drivers. Now Tom is trying to make the most modern driver for the oldest of the things ;-) So that's about sources. All of them are connected to yet another FPGA which acts as a muxer (switch if you wish). And from there things are fed to SiI9022 HDMI/DVI formatter which is pretty common in the "embedded world". In other words: loads of other fbdev-driven display controllers are driving them as well (googling for sii9022 reveals at least 3 different more or less custom drivers for it). And this chip is also a reason Tom mentioned EDID. In order to get (just access, not even parse) it we must jump through hoops - the chip acts as a I2C master itself and must be kindly asked to switch into a sort of I2C bridge mode. So yes, the display pipeline is exactly straight-forward... > btw, I think that having some share-able KMS objects is a good idea. > I'm not entirely sure that the directions that the current CDF > proposals are headed is necessarily the right way forward. I'd prefer > to see small/incremental evolution of KMS (ie. add drm_bridge and > drm_panel, and refactor the existing encoder-slave). Keeping it > inside drm means that we can evolve it more easily, and avoid layers > of glue code for no good reason. To keep the story short - I'm a great enthusiast of CDF because it caters for both DRM *and* fbdev. Today. I'm looking forward to the day when the last fbdev driver is kicked off the kernel, but it doesn't look like happening soon. At the same time I couldn't care less about naming, so if you prefer to call it drm_panel (but keep the API abstract enough so non-DRM drivers can use them as well) - it's fine for me :-) There are already generic mode/timing structures and DT bidnings available (with helpers for the interesting parties), so this doesn't seem like a unreasonable request? Cheers! Pawel ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:06 ` Pawel Moll 0 siblings, 0 replies; 131+ messages in thread From: Pawel Moll @ 2013-07-26 14:06 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-07-25 at 19:21 +0100, Rob Clark wrote: > > * Only supports 640x480 mode, which is hard-coded. We intend to > > rebase on top of CDF once it is merged, which hopefully will > > handle a lot of the EDID parsing & mode setting for us (once > > Pawel's CDF patches for VExpress also land). > > note that drm core already handles EDID parsing quite nicely.. I'm not > entirely sure why CDF would be needed for this? > > What is the dependency on CDF? Is there an external encoder/bridge > that could be shared between platforms? By all means. The platform we have here at ARM - Versatile Express - has a pretty generic idea of "multimedia bus", carrying a set of RGB pixel data and digital audio. There are three possible source of those: an FPGA on the motherboard and two daughterboards which can take random combination of FPGAs or test chips. In other words: the pixel data can be generated by anything. And some of those things can be driven by existing fbdev drivers. Now Tom is trying to make the most modern driver for the oldest of the things ;-) So that's about sources. All of them are connected to yet another FPGA which acts as a muxer (switch if you wish). And from there things are fed to SiI9022 HDMI/DVI formatter which is pretty common in the "embedded world". In other words: loads of other fbdev-driven display controllers are driving them as well (googling for sii9022 reveals at least 3 different more or less custom drivers for it). And this chip is also a reason Tom mentioned EDID. In order to get (just access, not even parse) it we must jump through hoops - the chip acts as a I2C master itself and must be kindly asked to switch into a sort of I2C bridge mode. So yes, the display pipeline is exactly straight-forward... > btw, I think that having some share-able KMS objects is a good idea. > I'm not entirely sure that the directions that the current CDF > proposals are headed is necessarily the right way forward. I'd prefer > to see small/incremental evolution of KMS (ie. add drm_bridge and > drm_panel, and refactor the existing encoder-slave). Keeping it > inside drm means that we can evolve it more easily, and avoid layers > of glue code for no good reason. To keep the story short - I'm a great enthusiast of CDF because it caters for both DRM *and* fbdev. Today. I'm looking forward to the day when the last fbdev driver is kicked off the kernel, but it doesn't look like happening soon. At the same time I couldn't care less about naming, so if you prefer to call it drm_panel (but keep the API abstract enough so non-DRM drivers can use them as well) - it's fine for me :-) There are already generic mode/timing structures and DT bidnings available (with helpers for the interesting parties), so this doesn't seem like a unreasonable request? Cheers! Pawel ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-26 14:06 ` Pawel Moll (?) @ 2013-07-26 14:21 ` Rob Clark -1 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-26 14:21 UTC (permalink / raw) To: Pawel Moll; +Cc: linux-arm-kernel, linux-fbdev, Laurent Pinchart, dri-devel On Fri, Jul 26, 2013 at 10:06 AM, Pawel Moll <pawel.moll@arm.com> wrote: > On Thu, 2013-07-25 at 19:21 +0100, Rob Clark wrote: >> > * Only supports 640x480 mode, which is hard-coded. We intend to >> > rebase on top of CDF once it is merged, which hopefully will >> > handle a lot of the EDID parsing & mode setting for us (once >> > Pawel's CDF patches for VExpress also land). >> >> note that drm core already handles EDID parsing quite nicely.. I'm not >> entirely sure why CDF would be needed for this? >> >> What is the dependency on CDF? Is there an external encoder/bridge >> that could be shared between platforms? > > By all means. The platform we have here at ARM - Versatile Express - has > a pretty generic idea of "multimedia bus", carrying a set of RGB pixel > data and digital audio. There are three possible source of those: an > FPGA on the motherboard and two daughterboards which can take random > combination of FPGAs or test chips. In other words: the pixel data can > be generated by anything. And some of those things can be driven by > existing fbdev drivers. Now Tom is trying to make the most modern driver > for the oldest of the things ;-) > > So that's about sources. All of them are connected to yet another FPGA > which acts as a muxer (switch if you wish). And from there things are > fed to SiI9022 HDMI/DVI formatter which is pretty common in the > "embedded world". In other words: loads of other fbdev-driven display > controllers are driving them as well (googling for sii9022 reveals at > least 3 different more or less custom drivers for it). And this chip is > also a reason Tom mentioned EDID. In order to get (just access, not even > parse) it we must jump through hoops - the chip acts as a I2C master > itself and must be kindly asked to switch into a sort of I2C bridge > mode. btw, you might want to have a look at some of the existing i2c slave encoders, since that sounds a lot like what the si9022 is. There is already drivers/gpu/drm/i2c/sil164_drv.c for some other SI part. And tda998x is another more recently added hdmi encoder/formatter (the edid probing code in there might serve as a reasonable source of inspiration) > So yes, the display pipeline is exactly straight-forward... > >> btw, I think that having some share-able KMS objects is a good idea. >> I'm not entirely sure that the directions that the current CDF >> proposals are headed is necessarily the right way forward. I'd prefer >> to see small/incremental evolution of KMS (ie. add drm_bridge and >> drm_panel, and refactor the existing encoder-slave). Keeping it >> inside drm means that we can evolve it more easily, and avoid layers >> of glue code for no good reason. > > To keep the story short - I'm a great enthusiast of CDF because it > caters for both DRM *and* fbdev. Today. I'm looking forward to the day > when the last fbdev driver is kicked off the kernel, but it doesn't look > like happening soon. At the same time I couldn't care less about naming, > so if you prefer to call it drm_panel (but keep the API abstract enough > so non-DRM drivers can use them as well) - it's fine for me :-) There > are already generic mode/timing structures and DT bidnings available > (with helpers for the interesting parties), so this doesn't seem like a > unreasonable request? Well, if you have something complex enough to benefit from CDF, then you probably ought to be looking at drm/kms. If someone wants to somehow come up with some glue to re-use shareable drm/kms components from fbdev, well.. I guess I don't really care about what goes in fbdev. My main concern is that we have more than enough work to do, and have more than enough complexity. If we end up w/ bunch of glue to tie things in to drm properties, how drm helpers sequence things during modeset, locking, etc.. I just don't see that turning out well. And also, just on the logistical side, having something shared across subsystems makes changes more of a pain during the merge window. Not to mention backports to stable kernels, distro kernels, etc. It all just seems like it will be a lot of unnecessary headache. > Cheers! > > Pawel > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:21 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-26 14:21 UTC (permalink / raw) To: Pawel Moll; +Cc: linux-arm-kernel, linux-fbdev, Laurent Pinchart, dri-devel On Fri, Jul 26, 2013 at 10:06 AM, Pawel Moll <pawel.moll@arm.com> wrote: > On Thu, 2013-07-25 at 19:21 +0100, Rob Clark wrote: >> > * Only supports 640x480 mode, which is hard-coded. We intend to >> > rebase on top of CDF once it is merged, which hopefully will >> > handle a lot of the EDID parsing & mode setting for us (once >> > Pawel's CDF patches for VExpress also land). >> >> note that drm core already handles EDID parsing quite nicely.. I'm not >> entirely sure why CDF would be needed for this? >> >> What is the dependency on CDF? Is there an external encoder/bridge >> that could be shared between platforms? > > By all means. The platform we have here at ARM - Versatile Express - has > a pretty generic idea of "multimedia bus", carrying a set of RGB pixel > data and digital audio. There are three possible source of those: an > FPGA on the motherboard and two daughterboards which can take random > combination of FPGAs or test chips. In other words: the pixel data can > be generated by anything. And some of those things can be driven by > existing fbdev drivers. Now Tom is trying to make the most modern driver > for the oldest of the things ;-) > > So that's about sources. All of them are connected to yet another FPGA > which acts as a muxer (switch if you wish). And from there things are > fed to SiI9022 HDMI/DVI formatter which is pretty common in the > "embedded world". In other words: loads of other fbdev-driven display > controllers are driving them as well (googling for sii9022 reveals at > least 3 different more or less custom drivers for it). And this chip is > also a reason Tom mentioned EDID. In order to get (just access, not even > parse) it we must jump through hoops - the chip acts as a I2C master > itself and must be kindly asked to switch into a sort of I2C bridge > mode. btw, you might want to have a look at some of the existing i2c slave encoders, since that sounds a lot like what the si9022 is. There is already drivers/gpu/drm/i2c/sil164_drv.c for some other SI part. And tda998x is another more recently added hdmi encoder/formatter (the edid probing code in there might serve as a reasonable source of inspiration) > So yes, the display pipeline is exactly straight-forward... > >> btw, I think that having some share-able KMS objects is a good idea. >> I'm not entirely sure that the directions that the current CDF >> proposals are headed is necessarily the right way forward. I'd prefer >> to see small/incremental evolution of KMS (ie. add drm_bridge and >> drm_panel, and refactor the existing encoder-slave). Keeping it >> inside drm means that we can evolve it more easily, and avoid layers >> of glue code for no good reason. > > To keep the story short - I'm a great enthusiast of CDF because it > caters for both DRM *and* fbdev. Today. I'm looking forward to the day > when the last fbdev driver is kicked off the kernel, but it doesn't look > like happening soon. At the same time I couldn't care less about naming, > so if you prefer to call it drm_panel (but keep the API abstract enough > so non-DRM drivers can use them as well) - it's fine for me :-) There > are already generic mode/timing structures and DT bidnings available > (with helpers for the interesting parties), so this doesn't seem like a > unreasonable request? Well, if you have something complex enough to benefit from CDF, then you probably ought to be looking at drm/kms. If someone wants to somehow come up with some glue to re-use shareable drm/kms components from fbdev, well.. I guess I don't really care about what goes in fbdev. My main concern is that we have more than enough work to do, and have more than enough complexity. If we end up w/ bunch of glue to tie things in to drm properties, how drm helpers sequence things during modeset, locking, etc.. I just don't see that turning out well. And also, just on the logistical side, having something shared across subsystems makes changes more of a pain during the merge window. Not to mention backports to stable kernels, distro kernels, etc. It all just seems like it will be a lot of unnecessary headache. > Cheers! > > Pawel > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:21 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-26 14:21 UTC (permalink / raw) To: linux-arm-kernel On Fri, Jul 26, 2013 at 10:06 AM, Pawel Moll <pawel.moll@arm.com> wrote: > On Thu, 2013-07-25 at 19:21 +0100, Rob Clark wrote: >> > * Only supports 640x480 mode, which is hard-coded. We intend to >> > rebase on top of CDF once it is merged, which hopefully will >> > handle a lot of the EDID parsing & mode setting for us (once >> > Pawel's CDF patches for VExpress also land). >> >> note that drm core already handles EDID parsing quite nicely.. I'm not >> entirely sure why CDF would be needed for this? >> >> What is the dependency on CDF? Is there an external encoder/bridge >> that could be shared between platforms? > > By all means. The platform we have here at ARM - Versatile Express - has > a pretty generic idea of "multimedia bus", carrying a set of RGB pixel > data and digital audio. There are three possible source of those: an > FPGA on the motherboard and two daughterboards which can take random > combination of FPGAs or test chips. In other words: the pixel data can > be generated by anything. And some of those things can be driven by > existing fbdev drivers. Now Tom is trying to make the most modern driver > for the oldest of the things ;-) > > So that's about sources. All of them are connected to yet another FPGA > which acts as a muxer (switch if you wish). And from there things are > fed to SiI9022 HDMI/DVI formatter which is pretty common in the > "embedded world". In other words: loads of other fbdev-driven display > controllers are driving them as well (googling for sii9022 reveals at > least 3 different more or less custom drivers for it). And this chip is > also a reason Tom mentioned EDID. In order to get (just access, not even > parse) it we must jump through hoops - the chip acts as a I2C master > itself and must be kindly asked to switch into a sort of I2C bridge > mode. btw, you might want to have a look at some of the existing i2c slave encoders, since that sounds a lot like what the si9022 is. There is already drivers/gpu/drm/i2c/sil164_drv.c for some other SI part. And tda998x is another more recently added hdmi encoder/formatter (the edid probing code in there might serve as a reasonable source of inspiration) > So yes, the display pipeline is exactly straight-forward... > >> btw, I think that having some share-able KMS objects is a good idea. >> I'm not entirely sure that the directions that the current CDF >> proposals are headed is necessarily the right way forward. I'd prefer >> to see small/incremental evolution of KMS (ie. add drm_bridge and >> drm_panel, and refactor the existing encoder-slave). Keeping it >> inside drm means that we can evolve it more easily, and avoid layers >> of glue code for no good reason. > > To keep the story short - I'm a great enthusiast of CDF because it > caters for both DRM *and* fbdev. Today. I'm looking forward to the day > when the last fbdev driver is kicked off the kernel, but it doesn't look > like happening soon. At the same time I couldn't care less about naming, > so if you prefer to call it drm_panel (but keep the API abstract enough > so non-DRM drivers can use them as well) - it's fine for me :-) There > are already generic mode/timing structures and DT bidnings available > (with helpers for the interesting parties), so this doesn't seem like a > unreasonable request? Well, if you have something complex enough to benefit from CDF, then you probably ought to be looking at drm/kms. If someone wants to somehow come up with some glue to re-use shareable drm/kms components from fbdev, well.. I guess I don't really care about what goes in fbdev. My main concern is that we have more than enough work to do, and have more than enough complexity. If we end up w/ bunch of glue to tie things in to drm properties, how drm helpers sequence things during modeset, locking, etc.. I just don't see that turning out well. And also, just on the logistical side, having something shared across subsystems makes changes more of a pain during the merge window. Not to mention backports to stable kernels, distro kernels, etc. It all just seems like it will be a lot of unnecessary headache. > Cheers! > > Pawel > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-26 14:21 ` Rob Clark (?) @ 2013-07-26 14:41 ` Pawel Moll -1 siblings, 0 replies; 131+ messages in thread From: Pawel Moll @ 2013-07-26 14:41 UTC (permalink / raw) To: Rob Clark Cc: linux-arm-kernel, linux-fbdev, Laurent Pinchart, dri-devel, Tom Cooksey On Fri, 2013-07-26 at 15:21 +0100, Rob Clark wrote: > Well, if you have something complex enough to benefit from CDF, then > you probably ought to be looking at drm/kms. That's what we're doing - hope you appreciate Tom's effort of re-writing a driver for something that pre-dates DRM/KMS by generations ;-) > I guess I don't really care about what goes in fbdev. Well, I've guessed that ;-) > My main concern is that we have more than enough work to do, > and have more than enough complexity. If we end up w/ bunch of glue > to tie things in to drm properties, how drm helpers sequence things > during modeset, locking, etc.. I just don't see that turning out well. > And also, just on the logistical side, having something shared across > subsystems makes changes more of a pain during the merge window. Not > to mention backports to stable kernels, distro kernels, etc. It all > just seems like it will be a lot of unnecessary headache. I won't argue that, however it seems to me that CDF is such a small variable in the equation that it would disappear in the noise. Surely you have (or will have) standard DRM interface to drive the display pipeline where CDF would be one of many back-ends? In the end you'll have to solve the same problem - how to represent a random pipeline in a generic way so different components can be plugged one into another. Anyway, I'm not planning to stick my nose between DRM and CDF, really. I'm just hoping for a solution that will fit my needs :-) Pawel ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:41 ` Pawel Moll 0 siblings, 0 replies; 131+ messages in thread From: Pawel Moll @ 2013-07-26 14:41 UTC (permalink / raw) To: Rob Clark Cc: linux-arm-kernel, linux-fbdev, Laurent Pinchart, dri-devel, Tom Cooksey On Fri, 2013-07-26 at 15:21 +0100, Rob Clark wrote: > Well, if you have something complex enough to benefit from CDF, then > you probably ought to be looking at drm/kms. That's what we're doing - hope you appreciate Tom's effort of re-writing a driver for something that pre-dates DRM/KMS by generations ;-) > I guess I don't really care about what goes in fbdev. Well, I've guessed that ;-) > My main concern is that we have more than enough work to do, > and have more than enough complexity. If we end up w/ bunch of glue > to tie things in to drm properties, how drm helpers sequence things > during modeset, locking, etc.. I just don't see that turning out well. > And also, just on the logistical side, having something shared across > subsystems makes changes more of a pain during the merge window. Not > to mention backports to stable kernels, distro kernels, etc. It all > just seems like it will be a lot of unnecessary headache. I won't argue that, however it seems to me that CDF is such a small variable in the equation that it would disappear in the noise. Surely you have (or will have) standard DRM interface to drive the display pipeline where CDF would be one of many back-ends? In the end you'll have to solve the same problem - how to represent a random pipeline in a generic way so different components can be plugged one into another. Anyway, I'm not planning to stick my nose between DRM and CDF, really. I'm just hoping for a solution that will fit my needs :-) Pawel ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:41 ` Pawel Moll 0 siblings, 0 replies; 131+ messages in thread From: Pawel Moll @ 2013-07-26 14:41 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2013-07-26 at 15:21 +0100, Rob Clark wrote: > Well, if you have something complex enough to benefit from CDF, then > you probably ought to be looking at drm/kms. That's what we're doing - hope you appreciate Tom's effort of re-writing a driver for something that pre-dates DRM/KMS by generations ;-) > I guess I don't really care about what goes in fbdev. Well, I've guessed that ;-) > My main concern is that we have more than enough work to do, > and have more than enough complexity. If we end up w/ bunch of glue > to tie things in to drm properties, how drm helpers sequence things > during modeset, locking, etc.. I just don't see that turning out well. > And also, just on the logistical side, having something shared across > subsystems makes changes more of a pain during the merge window. Not > to mention backports to stable kernels, distro kernels, etc. It all > just seems like it will be a lot of unnecessary headache. I won't argue that, however it seems to me that CDF is such a small variable in the equation that it would disappear in the noise. Surely you have (or will have) standard DRM interface to drive the display pipeline where CDF would be one of many back-ends? In the end you'll have to solve the same problem - how to represent a random pipeline in a generic way so different components can be plugged one into another. Anyway, I'm not planning to stick my nose between DRM and CDF, really. I'm just hoping for a solution that will fit my needs :-) Pawel ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-25 18:21 ` Rob Clark (?) @ 2013-07-26 14:14 ` Russell King - ARM Linux -1 siblings, 0 replies; 131+ messages in thread From: Russell King - ARM Linux @ 2013-07-26 14:14 UTC (permalink / raw) To: Rob Clark Cc: dri-devel, linux-fbdev, pawel.moll, linux-arm-kernel, tom.cooksey On Thu, Jul 25, 2013 at 02:21:59PM -0400, Rob Clark wrote: > On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > > Known issues: > > * It uses KDS. We intend to switch to whatever implicit per-buffer > > synchronisation mechanism gets merged, once something is merged. > > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > > allocate buffers for the GPU. Still not sure how to resolve this > > as we don't use DRM for our GPU driver. > > any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > once the dma-fence stuff is in place), we'd have gpu-specific drm > (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > using prime/dmabuf to share between the two. The PL111 primecell is just a scanout device. It has no GPU embedded in it. It's just like the Armada/Dove DRM stuff - if a GPU is provided it's an entirely separate piece of IP, and will likely remain a separate non-DRM driver. In the case of PL111, PL111 is not manufacturer specific so the GPU can be any standalone GPU. ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:14 ` Russell King - ARM Linux 0 siblings, 0 replies; 131+ messages in thread From: Russell King - ARM Linux @ 2013-07-26 14:14 UTC (permalink / raw) To: Rob Clark Cc: dri-devel, linux-fbdev, pawel.moll, linux-arm-kernel, tom.cooksey On Thu, Jul 25, 2013 at 02:21:59PM -0400, Rob Clark wrote: > On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > > Known issues: > > * It uses KDS. We intend to switch to whatever implicit per-buffer > > synchronisation mechanism gets merged, once something is merged. > > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > > allocate buffers for the GPU. Still not sure how to resolve this > > as we don't use DRM for our GPU driver. > > any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > once the dma-fence stuff is in place), we'd have gpu-specific drm > (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > using prime/dmabuf to share between the two. The PL111 primecell is just a scanout device. It has no GPU embedded in it. It's just like the Armada/Dove DRM stuff - if a GPU is provided it's an entirely separate piece of IP, and will likely remain a separate non-DRM driver. In the case of PL111, PL111 is not manufacturer specific so the GPU can be any standalone GPU. ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:14 ` Russell King - ARM Linux 0 siblings, 0 replies; 131+ messages in thread From: Russell King - ARM Linux @ 2013-07-26 14:14 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jul 25, 2013 at 02:21:59PM -0400, Rob Clark wrote: > On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: > > Known issues: > > * It uses KDS. We intend to switch to whatever implicit per-buffer > > synchronisation mechanism gets merged, once something is merged. > > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > > allocate buffers for the GPU. Still not sure how to resolve this > > as we don't use DRM for our GPU driver. > > any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > once the dma-fence stuff is in place), we'd have gpu-specific drm > (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > using prime/dmabuf to share between the two. The PL111 primecell is just a scanout device. It has no GPU embedded in it. It's just like the Armada/Dove DRM stuff - if a GPU is provided it's an entirely separate piece of IP, and will likely remain a separate non-DRM driver. In the case of PL111, PL111 is not manufacturer specific so the GPU can be any standalone GPU. ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-26 14:14 ` Russell King - ARM Linux (?) @ 2013-07-26 14:24 ` Rob Clark -1 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-26 14:24 UTC (permalink / raw) To: Russell King - ARM Linux Cc: dri-devel, linux-fbdev, pawel.moll, linux-arm-kernel, tom.cooksey On Fri, Jul 26, 2013 at 10:14 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Thu, Jul 25, 2013 at 02:21:59PM -0400, Rob Clark wrote: >> On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: >> > Known issues: >> > * It uses KDS. We intend to switch to whatever implicit per-buffer >> > synchronisation mechanism gets merged, once something is merged. >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> > allocate buffers for the GPU. Still not sure how to resolve this >> > as we don't use DRM for our GPU driver. >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> using prime/dmabuf to share between the two. > > The PL111 primecell is just a scanout device. It has no GPU embedded > in it. It's just like the Armada/Dove DRM stuff - if a GPU is provided > it's an entirely separate piece of IP, and will likely remain a separate > non-DRM driver. > > In the case of PL111, PL111 is not manufacturer specific so the GPU can > be any standalone GPU. Oh, yeah, I know PL111 is different block independent from GPU. I wasn't advocating adding mali support in this driver. But instead creating a mali DRM driver which had GEM and gpu bits, but not KMS. Then buffer sharing w/ prime/dmabuf just like you would, for example on a laptop w/ both integrated graphics and discrete GPU. BR, -R ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:24 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-26 14:24 UTC (permalink / raw) To: Russell King - ARM Linux Cc: dri-devel, linux-fbdev, pawel.moll, linux-arm-kernel, tom.cooksey On Fri, Jul 26, 2013 at 10:14 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Thu, Jul 25, 2013 at 02:21:59PM -0400, Rob Clark wrote: >> On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: >> > Known issues: >> > * It uses KDS. We intend to switch to whatever implicit per-buffer >> > synchronisation mechanism gets merged, once something is merged. >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> > allocate buffers for the GPU. Still not sure how to resolve this >> > as we don't use DRM for our GPU driver. >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> using prime/dmabuf to share between the two. > > The PL111 primecell is just a scanout device. It has no GPU embedded > in it. It's just like the Armada/Dove DRM stuff - if a GPU is provided > it's an entirely separate piece of IP, and will likely remain a separate > non-DRM driver. > > In the case of PL111, PL111 is not manufacturer specific so the GPU can > be any standalone GPU. Oh, yeah, I know PL111 is different block independent from GPU. I wasn't advocating adding mali support in this driver. But instead creating a mali DRM driver which had GEM and gpu bits, but not KMS. Then buffer sharing w/ prime/dmabuf just like you would, for example on a laptop w/ both integrated graphics and discrete GPU. BR, -R ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 14:24 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-26 14:24 UTC (permalink / raw) To: linux-arm-kernel On Fri, Jul 26, 2013 at 10:14 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Thu, Jul 25, 2013 at 02:21:59PM -0400, Rob Clark wrote: >> On Thu, Jul 25, 2013 at 1:17 PM, <tom.cooksey@arm.com> wrote: >> > Known issues: >> > * It uses KDS. We intend to switch to whatever implicit per-buffer >> > synchronisation mechanism gets merged, once something is merged. >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> > allocate buffers for the GPU. Still not sure how to resolve this >> > as we don't use DRM for our GPU driver. >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> using prime/dmabuf to share between the two. > > The PL111 primecell is just a scanout device. It has no GPU embedded > in it. It's just like the Armada/Dove DRM stuff - if a GPU is provided > it's an entirely separate piece of IP, and will likely remain a separate > non-DRM driver. > > In the case of PL111, PL111 is not manufacturer specific so the GPU can > be any standalone GPU. Oh, yeah, I know PL111 is different block independent from GPU. I wasn't advocating adding mali support in this driver. But instead creating a mali DRM driver which had GEM and gpu bits, but not KMS. Then buffer sharing w/ prime/dmabuf just like you would, for example on a laptop w/ both integrated graphics and discrete GPU. BR, -R ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-25 18:21 ` Rob Clark ` (3 preceding siblings ...) (?) @ 2013-07-26 15:58 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-07-26 15:58 UTC (permalink / raw) To: 'Rob Clark'; +Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel Hi Rob, > > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > > allocate buffers for the GPU. Still not sure how to resolve this > > as we don't use DRM for our GPU driver. > > any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > once the dma-fence stuff is in place), we'd have gpu-specific drm > (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > using prime/dmabuf to share between the two. The "extra" buffers we were allocating from armsoc DDX were really being allocated through DRM/GEM so we could get an flink name for them and pass a reference to them back to our GPU driver on the client side. If it weren't for our need to access those extra off-screen buffers with the GPU we wouldn't need to allocate them with DRM at all. So, given they are really "GPU" buffers, it does absolutely make sense to allocate them in a different driver to the display driver. However, to avoid unnecessary memcpys & related cache maintenance ops, we'd also like the GPU to render into buffers which are scanned out by the display controller. So let's say we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan out buffers with the display's DRM driver but a custom ioctl on the GPU's DRM driver to allocate non scanout, off-screen buffers. Sounds great, but I don't think that really works with DRI2. If we used two drivers to allocate buffers, which of those drivers do we return in DRI2ConnectReply? Even if we solve that somehow, GEM flink names are name-spaced to a single device node (AFAIK). So when we do a DRI2GetBuffers, how does the EGL in the client know which DRM device owns GEM flink name "1234"? We'd need some pretty dirty hacks. So then we looked at allocating _all_ buffers with the GPU's DRM driver. That solves the DRI2 single-device-name and single name-space issue. It also means the GPU would _never_ render into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. One thing I wasn't sure about is if there was an objection to using PRIME to export scanout buffers allocated with DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU driver to be rendered into? Is that a concern? Anyway, that latter case also gets quite difficult. The "GPU" DRM driver would need to know the constraints of the display controller when allocating buffers intended to be scanned out. For example, pl111 typically isn't behind an IOMMU and so requires physically contiguous memory. We'd have to teach the GPU's DRM driver about the constraints of the display HW. Not exactly a clean driver model. :-( I'm still a little stuck on how to proceed, so any ideas would greatly appreciated! My current train of thought is having a kind of SoC-specific DRM driver which allocates buffers for both display and GPU within a single GEM namespace. That SoC-specific DRM driver could then know the constraints of both the GPU and the display HW. We could then use PRIME to export buffers allocated with the SoC DRM driver and import them into the GPU and/or display DRM driver. Note: While it doesn't use the DRM framework, the Mali T6xx kernel driver has supported importing buffers through dma_buf for some time. I've even written an EGL extension :-): <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im port.txt> > I'm not entirely sure that the directions that the current CDF > proposals are headed is necessarily the right way forward. I'd prefer > to see small/incremental evolution of KMS (ie. add drm_bridge and > drm_panel, and refactor the existing encoder-slave). Keeping it > inside drm means that we can evolve it more easily, and avoid layers > of glue code for no good reason. I think CDF could allow vendors to re-use code they've written for their Android driver stack in DRM drivers more easily. Though I guess ideally KMS would evolve to a point where it could be used by an Android driver stack. I.e. Support explicit fences. Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-25 18:21 ` Rob Clark ` (4 preceding siblings ...) (?) @ 2013-07-26 15:58 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-07-26 15:58 UTC (permalink / raw) To: linux-arm-kernel Hi Rob, > > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > > allocate buffers for the GPU. Still not sure how to resolve this > > as we don't use DRM for our GPU driver. > > any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > once the dma-fence stuff is in place), we'd have gpu-specific drm > (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > using prime/dmabuf to share between the two. The "extra" buffers we were allocating from armsoc DDX were really being allocated through DRM/GEM so we could get an flink name for them and pass a reference to them back to our GPU driver on the client side. If it weren't for our need to access those extra off-screen buffers with the GPU we wouldn't need to allocate them with DRM at all. So, given they are really "GPU" buffers, it does absolutely make sense to allocate them in a different driver to the display driver. However, to avoid unnecessary memcpys & related cache maintenance ops, we'd also like the GPU to render into buffers which are scanned out by the display controller. So let's say we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan out buffers with the display's DRM driver but a custom ioctl on the GPU's DRM driver to allocate non scanout, off-screen buffers. Sounds great, but I don't think that really works with DRI2. If we used two drivers to allocate buffers, which of those drivers do we return in DRI2ConnectReply? Even if we solve that somehow, GEM flink names are name-spaced to a single device node (AFAIK). So when we do a DRI2GetBuffers, how does the EGL in the client know which DRM device owns GEM flink name "1234"? We'd need some pretty dirty hacks. So then we looked at allocating _all_ buffers with the GPU's DRM driver. That solves the DRI2 single-device-name and single name-space issue. It also means the GPU would _never_ render into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. One thing I wasn't sure about is if there was an objection to using PRIME to export scanout buffers allocated with DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU driver to be rendered into? Is that a concern? Anyway, that latter case also gets quite difficult. The "GPU" DRM driver would need to know the constraints of the display controller when allocating buffers intended to be scanned out. For example, pl111 typically isn't behind an IOMMU and so requires physically contiguous memory. We'd have to teach the GPU's DRM driver about the constraints of the display HW. Not exactly a clean driver model. :-( I'm still a little stuck on how to proceed, so any ideas would greatly appreciated! My current train of thought is having a kind of SoC-specific DRM driver which allocates buffers for both display and GPU within a single GEM namespace. That SoC-specific DRM driver could then know the constraints of both the GPU and the display HW. We could then use PRIME to export buffers allocated with the SoC DRM driver and import them into the GPU and/or display DRM driver. Note: While it doesn't use the DRM framework, the Mali T6xx kernel driver has supported importing buffers through dma_buf for some time. I've even written an EGL extension :-): <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im port.txt> > I'm not entirely sure that the directions that the current CDF > proposals are headed is necessarily the right way forward. I'd prefer > to see small/incremental evolution of KMS (ie. add drm_bridge and > drm_panel, and refactor the existing encoder-slave). Keeping it > inside drm means that we can evolve it more easily, and avoid layers > of glue code for no good reason. I think CDF could allow vendors to re-use code they've written for their Android driver stack in DRM drivers more easily. Though I guess ideally KMS would evolve to a point where it could be used by an Android driver stack. I.e. Support explicit fences. Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-25 18:21 ` Rob Clark ` (5 preceding siblings ...) (?) @ 2013-07-26 15:58 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-07-26 15:58 UTC (permalink / raw) To: 'Rob Clark'; +Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel Hi Rob, > > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > > allocate buffers for the GPU. Still not sure how to resolve this > > as we don't use DRM for our GPU driver. > > any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > once the dma-fence stuff is in place), we'd have gpu-specific drm > (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > using prime/dmabuf to share between the two. The "extra" buffers we were allocating from armsoc DDX were really being allocated through DRM/GEM so we could get an flink name for them and pass a reference to them back to our GPU driver on the client side. If it weren't for our need to access those extra off-screen buffers with the GPU we wouldn't need to allocate them with DRM at all. So, given they are really "GPU" buffers, it does absolutely make sense to allocate them in a different driver to the display driver. However, to avoid unnecessary memcpys & related cache maintenance ops, we'd also like the GPU to render into buffers which are scanned out by the display controller. So let's say we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan out buffers with the display's DRM driver but a custom ioctl on the GPU's DRM driver to allocate non scanout, off-screen buffers. Sounds great, but I don't think that really works with DRI2. If we used two drivers to allocate buffers, which of those drivers do we return in DRI2ConnectReply? Even if we solve that somehow, GEM flink names are name-spaced to a single device node (AFAIK). So when we do a DRI2GetBuffers, how does the EGL in the client know which DRM device owns GEM flink name "1234"? We'd need some pretty dirty hacks. So then we looked at allocating _all_ buffers with the GPU's DRM driver. That solves the DRI2 single-device-name and single name-space issue. It also means the GPU would _never_ render into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. One thing I wasn't sure about is if there was an objection to using PRIME to export scanout buffers allocated with DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU driver to be rendered into? Is that a concern? Anyway, that latter case also gets quite difficult. The "GPU" DRM driver would need to know the constraints of the display controller when allocating buffers intended to be scanned out. For example, pl111 typically isn't behind an IOMMU and so requires physically contiguous memory. We'd have to teach the GPU's DRM driver about the constraints of the display HW. Not exactly a clean driver model. :-( I'm still a little stuck on how to proceed, so any ideas would greatly appreciated! My current train of thought is having a kind of SoC-specific DRM driver which allocates buffers for both display and GPU within a single GEM namespace. That SoC-specific DRM driver could then know the constraints of both the GPU and the display HW. We could then use PRIME to export buffers allocated with the SoC DRM driver and import them into the GPU and/or display DRM driver. Note: While it doesn't use the DRM framework, the Mali T6xx kernel driver has supported importing buffers through dma_buf for some time. I've even written an EGL extension :-): <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im port.txt> > I'm not entirely sure that the directions that the current CDF > proposals are headed is necessarily the right way forward. I'd prefer > to see small/incremental evolution of KMS (ie. add drm_bridge and > drm_panel, and refactor the existing encoder-slave). Keeping it > inside drm means that we can evolve it more easily, and avoid layers > of glue code for no good reason. I think CDF could allow vendors to re-use code they've written for their Android driver stack in DRM drivers more easily. Though I guess ideally KMS would evolve to a point where it could be used by an Android driver stack. I.e. Support explicit fences. Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <51f29cd1.e686440a.66b2.fffff9d5SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <51f29cd1.e686440a.66b2.fffff9d5SMTPIN_ADDED_BROKEN@mx.google.com> 2013-07-26 18:56 ` Daniel Vetter @ 2013-07-26 18:56 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-07-26 18:56 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, 'Rob Clark', Pawel Moll, linux-fbdev, linux-arm-kernel On Fri, Jul 26, 2013 at 04:58:55PM +0100, Tom Cooksey wrote: > Hi Rob, > > > > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > > > allocate buffers for the GPU. Still not sure how to resolve this > > > as we don't use DRM for our GPU driver. > > > > any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > > once the dma-fence stuff is in place), we'd have gpu-specific drm > > (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > > using prime/dmabuf to share between the two. > > The "extra" buffers we were allocating from armsoc DDX were really > being allocated through DRM/GEM so we could get an flink name > for them and pass a reference to them back to our GPU driver on > the client side. If it weren't for our need to access those > extra off-screen buffers with the GPU we wouldn't need to > allocate them with DRM at all. So, given they are really "GPU" > buffers, it does absolutely make sense to allocate them in a > different driver to the display driver. > > However, to avoid unnecessary memcpys & related cache > maintenance ops, we'd also like the GPU to render into buffers > which are scanned out by the display controller. So let's say > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > out buffers with the display's DRM driver but a custom ioctl > on the GPU's DRM driver to allocate non scanout, off-screen > buffers. Sounds great, but I don't think that really works > with DRI2. If we used two drivers to allocate buffers, which > of those drivers do we return in DRI2ConnectReply? Even if we > solve that somehow, GEM flink names are name-spaced to a > single device node (AFAIK). So when we do a DRI2GetBuffers, > how does the EGL in the client know which DRM device owns GEM > flink name "1234"? We'd need some pretty dirty hacks. I don't know the details, but having different gem driver nodes is exactly what prime support in X allows. The X server then passes the buffer object between the different ddx drivers using dma-buf sharing. So if we have two drm drivers, one a kms-only scanout driver and one a gem-only mali driver, then the userspace gl mali driver would only ever talk to the mali drm node. So flink for DRI2 would use that node exclusively. The scanout drm node would then be treated like e.g. an usb display-link device which is also display-only. > So then we looked at allocating _all_ buffers with the GPU's > DRM driver. That solves the DRI2 single-device-name and single > name-space issue. It also means the GPU would _never_ render > into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. > One thing I wasn't sure about is if there was an objection > to using PRIME to export scanout buffers allocated with > DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU > driver to be rendered into? Is that a concern? Imo sharing dumb buffers should be ok. The "dumb" concept is only really relevant when the display and render part are integrated into one IP block and driver. > Anyway, that latter case also gets quite difficult. The "GPU" > DRM driver would need to know the constraints of the display > controller when allocating buffers intended to be scanned out. > For example, pl111 typically isn't behind an IOMMU and so > requires physically contiguous memory. We'd have to teach the > GPU's DRM driver about the constraints of the display HW. Not > exactly a clean driver model. :-( Well the current dma-buf sharing code essentially only really works on x86 where everyone has a decent graphics tt and no cache flushing is required. For ARM I expect that we need to have a common dma-buf backing storage layer which walks all attached devices on the dma-buf, checks allocation constraints and then picks a suitable pool to allocate the buffer. Since dma-bufs should only be allocated once they're getting used (and not when establishing the sharing) that should work out. But with the current dma apis exposed to drivers that's not possible really. Essentially we need ion but please not expose the heaps explicitly to userspace. The kernel already knows all this, so could take care of this for userspace without breaking the current dma api abstractions we have. Of course if you start to share buffers between different userspace drives they need to talk to each another about what stride/pixel layout/tiling is suitable. Since I'm a kernel guy I'll punt on this problem ;-) > I'm still a little stuck on how to proceed, so any ideas > would greatly appreciated! My current train of thought is > having a kind of SoC-specific DRM driver which allocates > buffers for both display and GPU within a single GEM > namespace. That SoC-specific DRM driver could then know the > constraints of both the GPU and the display HW. We could then > use PRIME to export buffers allocated with the SoC DRM driver > and import them into the GPU and/or display DRM driver. That's pretty much how ION works and I'm not in favour of leaking allocation constraints to userspace like that ... > Note: While it doesn't use the DRM framework, the Mali T6xx > kernel driver has supported importing buffers through dma_buf > for some time. I've even written an EGL extension :-): > > <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im > port.txt> > > > > I'm not entirely sure that the directions that the current CDF > > proposals are headed is necessarily the right way forward. I'd prefer > > to see small/incremental evolution of KMS (ie. add drm_bridge and > > drm_panel, and refactor the existing encoder-slave). Keeping it > > inside drm means that we can evolve it more easily, and avoid layers > > of glue code for no good reason. > > I think CDF could allow vendors to re-use code they've written > for their Android driver stack in DRM drivers more easily. Though > I guess ideally KMS would evolve to a point where it could be used > by an Android driver stack. I.e. Support explicit fences. Just fyi with intel we have some DSI/MIPI support patches floating around. Our plan is to just merge them and then once we have a 2nd drm driver with DSI support grow some common infrastructure out of this. The same approach seems to work neatly for hdmi (infoframes) and dp (although the dp helpers can be seriously extended). Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 18:56 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-07-26 18:56 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, 'Rob Clark', Pawel Moll, linux-fbdev, linux-arm-kernel On Fri, Jul 26, 2013 at 04:58:55PM +0100, Tom Cooksey wrote: > Hi Rob, > > > > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > > > allocate buffers for the GPU. Still not sure how to resolve this > > > as we don't use DRM for our GPU driver. > > > > any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > > once the dma-fence stuff is in place), we'd have gpu-specific drm > > (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > > using prime/dmabuf to share between the two. > > The "extra" buffers we were allocating from armsoc DDX were really > being allocated through DRM/GEM so we could get an flink name > for them and pass a reference to them back to our GPU driver on > the client side. If it weren't for our need to access those > extra off-screen buffers with the GPU we wouldn't need to > allocate them with DRM at all. So, given they are really "GPU" > buffers, it does absolutely make sense to allocate them in a > different driver to the display driver. > > However, to avoid unnecessary memcpys & related cache > maintenance ops, we'd also like the GPU to render into buffers > which are scanned out by the display controller. So let's say > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > out buffers with the display's DRM driver but a custom ioctl > on the GPU's DRM driver to allocate non scanout, off-screen > buffers. Sounds great, but I don't think that really works > with DRI2. If we used two drivers to allocate buffers, which > of those drivers do we return in DRI2ConnectReply? Even if we > solve that somehow, GEM flink names are name-spaced to a > single device node (AFAIK). So when we do a DRI2GetBuffers, > how does the EGL in the client know which DRM device owns GEM > flink name "1234"? We'd need some pretty dirty hacks. I don't know the details, but having different gem driver nodes is exactly what prime support in X allows. The X server then passes the buffer object between the different ddx drivers using dma-buf sharing. So if we have two drm drivers, one a kms-only scanout driver and one a gem-only mali driver, then the userspace gl mali driver would only ever talk to the mali drm node. So flink for DRI2 would use that node exclusively. The scanout drm node would then be treated like e.g. an usb display-link device which is also display-only. > So then we looked at allocating _all_ buffers with the GPU's > DRM driver. That solves the DRI2 single-device-name and single > name-space issue. It also means the GPU would _never_ render > into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. > One thing I wasn't sure about is if there was an objection > to using PRIME to export scanout buffers allocated with > DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU > driver to be rendered into? Is that a concern? Imo sharing dumb buffers should be ok. The "dumb" concept is only really relevant when the display and render part are integrated into one IP block and driver. > Anyway, that latter case also gets quite difficult. The "GPU" > DRM driver would need to know the constraints of the display > controller when allocating buffers intended to be scanned out. > For example, pl111 typically isn't behind an IOMMU and so > requires physically contiguous memory. We'd have to teach the > GPU's DRM driver about the constraints of the display HW. Not > exactly a clean driver model. :-( Well the current dma-buf sharing code essentially only really works on x86 where everyone has a decent graphics tt and no cache flushing is required. For ARM I expect that we need to have a common dma-buf backing storage layer which walks all attached devices on the dma-buf, checks allocation constraints and then picks a suitable pool to allocate the buffer. Since dma-bufs should only be allocated once they're getting used (and not when establishing the sharing) that should work out. But with the current dma apis exposed to drivers that's not possible really. Essentially we need ion but please not expose the heaps explicitly to userspace. The kernel already knows all this, so could take care of this for userspace without breaking the current dma api abstractions we have. Of course if you start to share buffers between different userspace drives they need to talk to each another about what stride/pixel layout/tiling is suitable. Since I'm a kernel guy I'll punt on this problem ;-) > I'm still a little stuck on how to proceed, so any ideas > would greatly appreciated! My current train of thought is > having a kind of SoC-specific DRM driver which allocates > buffers for both display and GPU within a single GEM > namespace. That SoC-specific DRM driver could then know the > constraints of both the GPU and the display HW. We could then > use PRIME to export buffers allocated with the SoC DRM driver > and import them into the GPU and/or display DRM driver. That's pretty much how ION works and I'm not in favour of leaking allocation constraints to userspace like that ... > Note: While it doesn't use the DRM framework, the Mali T6xx > kernel driver has supported importing buffers through dma_buf > for some time. I've even written an EGL extension :-): > > <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im > port.txt> > > > > I'm not entirely sure that the directions that the current CDF > > proposals are headed is necessarily the right way forward. I'd prefer > > to see small/incremental evolution of KMS (ie. add drm_bridge and > > drm_panel, and refactor the existing encoder-slave). Keeping it > > inside drm means that we can evolve it more easily, and avoid layers > > of glue code for no good reason. > > I think CDF could allow vendors to re-use code they've written > for their Android driver stack in DRM drivers more easily. Though > I guess ideally KMS would evolve to a point where it could be used > by an Android driver stack. I.e. Support explicit fences. Just fyi with intel we have some DSI/MIPI support patches floating around. Our plan is to just merge them and then once we have a 2nd drm driver with DSI support grow some common infrastructure out of this. The same approach seems to work neatly for hdmi (infoframes) and dp (although the dp helpers can be seriously extended). Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-26 18:56 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-07-26 18:56 UTC (permalink / raw) To: linux-arm-kernel On Fri, Jul 26, 2013 at 04:58:55PM +0100, Tom Cooksey wrote: > Hi Rob, > > > > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > > > allocate buffers for the GPU. Still not sure how to resolve this > > > as we don't use DRM for our GPU driver. > > > > any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > > once the dma-fence stuff is in place), we'd have gpu-specific drm > > (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > > using prime/dmabuf to share between the two. > > The "extra" buffers we were allocating from armsoc DDX were really > being allocated through DRM/GEM so we could get an flink name > for them and pass a reference to them back to our GPU driver on > the client side. If it weren't for our need to access those > extra off-screen buffers with the GPU we wouldn't need to > allocate them with DRM at all. So, given they are really "GPU" > buffers, it does absolutely make sense to allocate them in a > different driver to the display driver. > > However, to avoid unnecessary memcpys & related cache > maintenance ops, we'd also like the GPU to render into buffers > which are scanned out by the display controller. So let's say > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > out buffers with the display's DRM driver but a custom ioctl > on the GPU's DRM driver to allocate non scanout, off-screen > buffers. Sounds great, but I don't think that really works > with DRI2. If we used two drivers to allocate buffers, which > of those drivers do we return in DRI2ConnectReply? Even if we > solve that somehow, GEM flink names are name-spaced to a > single device node (AFAIK). So when we do a DRI2GetBuffers, > how does the EGL in the client know which DRM device owns GEM > flink name "1234"? We'd need some pretty dirty hacks. I don't know the details, but having different gem driver nodes is exactly what prime support in X allows. The X server then passes the buffer object between the different ddx drivers using dma-buf sharing. So if we have two drm drivers, one a kms-only scanout driver and one a gem-only mali driver, then the userspace gl mali driver would only ever talk to the mali drm node. So flink for DRI2 would use that node exclusively. The scanout drm node would then be treated like e.g. an usb display-link device which is also display-only. > So then we looked at allocating _all_ buffers with the GPU's > DRM driver. That solves the DRI2 single-device-name and single > name-space issue. It also means the GPU would _never_ render > into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. > One thing I wasn't sure about is if there was an objection > to using PRIME to export scanout buffers allocated with > DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU > driver to be rendered into? Is that a concern? Imo sharing dumb buffers should be ok. The "dumb" concept is only really relevant when the display and render part are integrated into one IP block and driver. > Anyway, that latter case also gets quite difficult. The "GPU" > DRM driver would need to know the constraints of the display > controller when allocating buffers intended to be scanned out. > For example, pl111 typically isn't behind an IOMMU and so > requires physically contiguous memory. We'd have to teach the > GPU's DRM driver about the constraints of the display HW. Not > exactly a clean driver model. :-( Well the current dma-buf sharing code essentially only really works on x86 where everyone has a decent graphics tt and no cache flushing is required. For ARM I expect that we need to have a common dma-buf backing storage layer which walks all attached devices on the dma-buf, checks allocation constraints and then picks a suitable pool to allocate the buffer. Since dma-bufs should only be allocated once they're getting used (and not when establishing the sharing) that should work out. But with the current dma apis exposed to drivers that's not possible really. Essentially we need ion but please not expose the heaps explicitly to userspace. The kernel already knows all this, so could take care of this for userspace without breaking the current dma api abstractions we have. Of course if you start to share buffers between different userspace drives they need to talk to each another about what stride/pixel layout/tiling is suitable. Since I'm a kernel guy I'll punt on this problem ;-) > I'm still a little stuck on how to proceed, so any ideas > would greatly appreciated! My current train of thought is > having a kind of SoC-specific DRM driver which allocates > buffers for both display and GPU within a single GEM > namespace. That SoC-specific DRM driver could then know the > constraints of both the GPU and the display HW. We could then > use PRIME to export buffers allocated with the SoC DRM driver > and import them into the GPU and/or display DRM driver. That's pretty much how ION works and I'm not in favour of leaking allocation constraints to userspace like that ... > Note: While it doesn't use the DRM framework, the Mali T6xx > kernel driver has supported importing buffers through dma_buf > for some time. I've even written an EGL extension :-): > > <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im > port.txt> > > > > I'm not entirely sure that the directions that the current CDF > > proposals are headed is necessarily the right way forward. I'd prefer > > to see small/incremental evolution of KMS (ie. add drm_bridge and > > drm_panel, and refactor the existing encoder-slave). Keeping it > > inside drm means that we can evolve it more easily, and avoid layers > > of glue code for no good reason. > > I think CDF could allow vendors to re-use code they've written > for their Android driver stack in DRM drivers more easily. Though > I guess ideally KMS would evolve to a point where it could be used > by an Android driver stack. I.e. Support explicit fences. Just fyi with intel we have some DSI/MIPI support patches floating around. Our plan is to just merge them and then once we have a 2nd drm driver with DSI support grow some common infrastructure out of this. The same approach seems to work neatly for hdmi (infoframes) and dp (although the dp helpers can be seriously extended). Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <51f29ccd.f014b40a.34cc.ffffca2aSMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <51f29ccd.f014b40a.34cc.ffffca2aSMTPIN_ADDED_BROKEN@mx.google.com> 2013-07-29 14:58 ` Rob Clark @ 2013-07-29 14:58 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-29 14:58 UTC (permalink / raw) To: Tom Cooksey; +Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Rob, > >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> > allocate buffers for the GPU. Still not sure how to resolve this >> > as we don't use DRM for our GPU driver. >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> using prime/dmabuf to share between the two. > > The "extra" buffers we were allocating from armsoc DDX were really > being allocated through DRM/GEM so we could get an flink name > for them and pass a reference to them back to our GPU driver on > the client side. If it weren't for our need to access those > extra off-screen buffers with the GPU we wouldn't need to > allocate them with DRM at all. So, given they are really "GPU" > buffers, it does absolutely make sense to allocate them in a > different driver to the display driver. > > However, to avoid unnecessary memcpys & related cache > maintenance ops, we'd also like the GPU to render into buffers > which are scanned out by the display controller. So let's say > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > out buffers with the display's DRM driver but a custom ioctl > on the GPU's DRM driver to allocate non scanout, off-screen > buffers. Sounds great, but I don't think that really works > with DRI2. If we used two drivers to allocate buffers, which > of those drivers do we return in DRI2ConnectReply? Even if we > solve that somehow, GEM flink names are name-spaced to a > single device node (AFAIK). So when we do a DRI2GetBuffers, > how does the EGL in the client know which DRM device owns GEM > flink name "1234"? We'd need some pretty dirty hacks. You would return the name of the display driver allocating the buffers. On the client side you can use generic ioctls to go from flink -> handle -> dmabuf. So the client side would end up opening both the display drm device and the gpu, but without needing to know too much about the display. (Probably in (for example) mesa there needs to be a bit of work to handle this better, but I think that would be needed as well for sharing between, say, nouveau and udl displaylink driver.. which is really the same scenario.) > So then we looked at allocating _all_ buffers with the GPU's > DRM driver. That solves the DRI2 single-device-name and single > name-space issue. It also means the GPU would _never_ render > into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. Well, I think we can differentiate between shared buffers and internal buffers (textures, vertex upload, etc, etc). For example, in mesa/gallium driver there are two paths to getting a buffer... ->resource_create() and ->resource_from_handle(), the latter is the path you go for dri2 buffers shared w/ x11. The former is buffers that are just internal to gpu (if !(bind & PIPE_BIND_SHARED)). I guess you must have something similar in mali driver. > One thing I wasn't sure about is if there was an objection > to using PRIME to export scanout buffers allocated with > DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU > driver to be rendered into? Is that a concern? well.. it wasn't quite the original intention of the 'dumb' ioctls. But I guess in the end if you hand the gpu a buffer with a layout/format that it can't grok, then gpu does a staging texture plus copy. If you had, for example, some setup w/ gpu that could only render to tiled format, plus a display that could scanout that format, then your DDX driver would need to allocate the dri2 buffers with something other than dumb ioctl. (But at that point, you probably need to implement your own EXA so you could hook in your own custom upload-to/download-from screen fxns for sw fallbacks, so you're not talking about using generic DDX anyways.) > Anyway, that latter case also gets quite difficult. The "GPU" > DRM driver would need to know the constraints of the display > controller when allocating buffers intended to be scanned out. > For example, pl111 typically isn't behind an IOMMU and so > requires physically contiguous memory. We'd have to teach the > GPU's DRM driver about the constraints of the display HW. Not > exactly a clean driver model. :-( > > I'm still a little stuck on how to proceed, so any ideas > would greatly appreciated! My current train of thought is > having a kind of SoC-specific DRM driver which allocates > buffers for both display and GPU within a single GEM > namespace. That SoC-specific DRM driver could then know the > constraints of both the GPU and the display HW. We could then > use PRIME to export buffers allocated with the SoC DRM driver > and import them into the GPU and/or display DRM driver. Usually if the display drm driver is allocating the buffers that might be scanned out, it just needs to have minimal knowledge of the GPU (pitch alignment constraints). I don't think we need a 3rd device just to allocate buffers. Really I don't think the separate display drm and gpu drm device is much different from desktop udl/displaylink case. Or desktop integrated-gpu + external gpu. > Note: While it doesn't use the DRM framework, the Mali T6xx > kernel driver has supported importing buffers through dma_buf > for some time. I've even written an EGL extension :-): > > <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im > port.txt> > > >> I'm not entirely sure that the directions that the current CDF >> proposals are headed is necessarily the right way forward. I'd prefer >> to see small/incremental evolution of KMS (ie. add drm_bridge and >> drm_panel, and refactor the existing encoder-slave). Keeping it >> inside drm means that we can evolve it more easily, and avoid layers >> of glue code for no good reason. > > I think CDF could allow vendors to re-use code they've written > for their Android driver stack in DRM drivers more easily. Though > I guess ideally KMS would evolve to a point where it could be used > by an Android driver stack. I.e. Support explicit fences. > yeah, the best would be evolving DRM/KMS to the point that it meets android requirements. Because I really don't think android has any special requirements compared to, say, a wayland compositor. (Other than the way they do synchronization. But that doesn't really *need* to be different, as far as I can tell.) It would certainly be easier if android dev's actually participated in upstream linux graphics, and talked about what they needed, and maybe even contributed a patch or two. But that is a whole different topic. BR, -R > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-29 14:58 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-29 14:58 UTC (permalink / raw) To: Tom Cooksey; +Cc: linux-arm-kernel, linux-fbdev, Pawel Moll, dri-devel On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Rob, > >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> > allocate buffers for the GPU. Still not sure how to resolve this >> > as we don't use DRM for our GPU driver. >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> using prime/dmabuf to share between the two. > > The "extra" buffers we were allocating from armsoc DDX were really > being allocated through DRM/GEM so we could get an flink name > for them and pass a reference to them back to our GPU driver on > the client side. If it weren't for our need to access those > extra off-screen buffers with the GPU we wouldn't need to > allocate them with DRM at all. So, given they are really "GPU" > buffers, it does absolutely make sense to allocate them in a > different driver to the display driver. > > However, to avoid unnecessary memcpys & related cache > maintenance ops, we'd also like the GPU to render into buffers > which are scanned out by the display controller. So let's say > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > out buffers with the display's DRM driver but a custom ioctl > on the GPU's DRM driver to allocate non scanout, off-screen > buffers. Sounds great, but I don't think that really works > with DRI2. If we used two drivers to allocate buffers, which > of those drivers do we return in DRI2ConnectReply? Even if we > solve that somehow, GEM flink names are name-spaced to a > single device node (AFAIK). So when we do a DRI2GetBuffers, > how does the EGL in the client know which DRM device owns GEM > flink name "1234"? We'd need some pretty dirty hacks. You would return the name of the display driver allocating the buffers. On the client side you can use generic ioctls to go from flink -> handle -> dmabuf. So the client side would end up opening both the display drm device and the gpu, but without needing to know too much about the display. (Probably in (for example) mesa there needs to be a bit of work to handle this better, but I think that would be needed as well for sharing between, say, nouveau and udl displaylink driver.. which is really the same scenario.) > So then we looked at allocating _all_ buffers with the GPU's > DRM driver. That solves the DRI2 single-device-name and single > name-space issue. It also means the GPU would _never_ render > into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. Well, I think we can differentiate between shared buffers and internal buffers (textures, vertex upload, etc, etc). For example, in mesa/gallium driver there are two paths to getting a buffer... ->resource_create() and ->resource_from_handle(), the latter is the path you go for dri2 buffers shared w/ x11. The former is buffers that are just internal to gpu (if !(bind & PIPE_BIND_SHARED)). I guess you must have something similar in mali driver. > One thing I wasn't sure about is if there was an objection > to using PRIME to export scanout buffers allocated with > DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU > driver to be rendered into? Is that a concern? well.. it wasn't quite the original intention of the 'dumb' ioctls. But I guess in the end if you hand the gpu a buffer with a layout/format that it can't grok, then gpu does a staging texture plus copy. If you had, for example, some setup w/ gpu that could only render to tiled format, plus a display that could scanout that format, then your DDX driver would need to allocate the dri2 buffers with something other than dumb ioctl. (But at that point, you probably need to implement your own EXA so you could hook in your own custom upload-to/download-from screen fxns for sw fallbacks, so you're not talking about using generic DDX anyways.) > Anyway, that latter case also gets quite difficult. The "GPU" > DRM driver would need to know the constraints of the display > controller when allocating buffers intended to be scanned out. > For example, pl111 typically isn't behind an IOMMU and so > requires physically contiguous memory. We'd have to teach the > GPU's DRM driver about the constraints of the display HW. Not > exactly a clean driver model. :-( > > I'm still a little stuck on how to proceed, so any ideas > would greatly appreciated! My current train of thought is > having a kind of SoC-specific DRM driver which allocates > buffers for both display and GPU within a single GEM > namespace. That SoC-specific DRM driver could then know the > constraints of both the GPU and the display HW. We could then > use PRIME to export buffers allocated with the SoC DRM driver > and import them into the GPU and/or display DRM driver. Usually if the display drm driver is allocating the buffers that might be scanned out, it just needs to have minimal knowledge of the GPU (pitch alignment constraints). I don't think we need a 3rd device just to allocate buffers. Really I don't think the separate display drm and gpu drm device is much different from desktop udl/displaylink case. Or desktop integrated-gpu + external gpu. > Note: While it doesn't use the DRM framework, the Mali T6xx > kernel driver has supported importing buffers through dma_buf > for some time. I've even written an EGL extension :-): > > <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im > port.txt> > > >> I'm not entirely sure that the directions that the current CDF >> proposals are headed is necessarily the right way forward. I'd prefer >> to see small/incremental evolution of KMS (ie. add drm_bridge and >> drm_panel, and refactor the existing encoder-slave). Keeping it >> inside drm means that we can evolve it more easily, and avoid layers >> of glue code for no good reason. > > I think CDF could allow vendors to re-use code they've written > for their Android driver stack in DRM drivers more easily. Though > I guess ideally KMS would evolve to a point where it could be used > by an Android driver stack. I.e. Support explicit fences. > yeah, the best would be evolving DRM/KMS to the point that it meets android requirements. Because I really don't think android has any special requirements compared to, say, a wayland compositor. (Other than the way they do synchronization. But that doesn't really *need* to be different, as far as I can tell.) It would certainly be easier if android dev's actually participated in upstream linux graphics, and talked about what they needed, and maybe even contributed a patch or two. But that is a whole different topic. BR, -R > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-07-29 14:58 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-07-29 14:58 UTC (permalink / raw) To: linux-arm-kernel On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Rob, > >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> > allocate buffers for the GPU. Still not sure how to resolve this >> > as we don't use DRM for our GPU driver. >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> using prime/dmabuf to share between the two. > > The "extra" buffers we were allocating from armsoc DDX were really > being allocated through DRM/GEM so we could get an flink name > for them and pass a reference to them back to our GPU driver on > the client side. If it weren't for our need to access those > extra off-screen buffers with the GPU we wouldn't need to > allocate them with DRM at all. So, given they are really "GPU" > buffers, it does absolutely make sense to allocate them in a > different driver to the display driver. > > However, to avoid unnecessary memcpys & related cache > maintenance ops, we'd also like the GPU to render into buffers > which are scanned out by the display controller. So let's say > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > out buffers with the display's DRM driver but a custom ioctl > on the GPU's DRM driver to allocate non scanout, off-screen > buffers. Sounds great, but I don't think that really works > with DRI2. If we used two drivers to allocate buffers, which > of those drivers do we return in DRI2ConnectReply? Even if we > solve that somehow, GEM flink names are name-spaced to a > single device node (AFAIK). So when we do a DRI2GetBuffers, > how does the EGL in the client know which DRM device owns GEM > flink name "1234"? We'd need some pretty dirty hacks. You would return the name of the display driver allocating the buffers. On the client side you can use generic ioctls to go from flink -> handle -> dmabuf. So the client side would end up opening both the display drm device and the gpu, but without needing to know too much about the display. (Probably in (for example) mesa there needs to be a bit of work to handle this better, but I think that would be needed as well for sharing between, say, nouveau and udl displaylink driver.. which is really the same scenario.) > So then we looked at allocating _all_ buffers with the GPU's > DRM driver. That solves the DRI2 single-device-name and single > name-space issue. It also means the GPU would _never_ render > into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. Well, I think we can differentiate between shared buffers and internal buffers (textures, vertex upload, etc, etc). For example, in mesa/gallium driver there are two paths to getting a buffer... ->resource_create() and ->resource_from_handle(), the latter is the path you go for dri2 buffers shared w/ x11. The former is buffers that are just internal to gpu (if !(bind & PIPE_BIND_SHARED)). I guess you must have something similar in mali driver. > One thing I wasn't sure about is if there was an objection > to using PRIME to export scanout buffers allocated with > DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU > driver to be rendered into? Is that a concern? well.. it wasn't quite the original intention of the 'dumb' ioctls. But I guess in the end if you hand the gpu a buffer with a layout/format that it can't grok, then gpu does a staging texture plus copy. If you had, for example, some setup w/ gpu that could only render to tiled format, plus a display that could scanout that format, then your DDX driver would need to allocate the dri2 buffers with something other than dumb ioctl. (But at that point, you probably need to implement your own EXA so you could hook in your own custom upload-to/download-from screen fxns for sw fallbacks, so you're not talking about using generic DDX anyways.) > Anyway, that latter case also gets quite difficult. The "GPU" > DRM driver would need to know the constraints of the display > controller when allocating buffers intended to be scanned out. > For example, pl111 typically isn't behind an IOMMU and so > requires physically contiguous memory. We'd have to teach the > GPU's DRM driver about the constraints of the display HW. Not > exactly a clean driver model. :-( > > I'm still a little stuck on how to proceed, so any ideas > would greatly appreciated! My current train of thought is > having a kind of SoC-specific DRM driver which allocates > buffers for both display and GPU within a single GEM > namespace. That SoC-specific DRM driver could then know the > constraints of both the GPU and the display HW. We could then > use PRIME to export buffers allocated with the SoC DRM driver > and import them into the GPU and/or display DRM driver. Usually if the display drm driver is allocating the buffers that might be scanned out, it just needs to have minimal knowledge of the GPU (pitch alignment constraints). I don't think we need a 3rd device just to allocate buffers. Really I don't think the separate display drm and gpu drm device is much different from desktop udl/displaylink case. Or desktop integrated-gpu + external gpu. > Note: While it doesn't use the DRM framework, the Mali T6xx > kernel driver has supported importing buffers through dma_buf > for some time. I've even written an EGL extension :-): > > <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im > port.txt> > > >> I'm not entirely sure that the directions that the current CDF >> proposals are headed is necessarily the right way forward. I'd prefer >> to see small/incremental evolution of KMS (ie. add drm_bridge and >> drm_panel, and refactor the existing encoder-slave). Keeping it >> inside drm means that we can evolve it more easily, and avoid layers >> of glue code for no good reason. > > I think CDF could allow vendors to re-use code they've written > for their Android driver stack in DRM drivers more easily. Though > I guess ideally KMS would evolve to a point where it could be used > by an Android driver stack. I.e. Support explicit fences. > yeah, the best would be evolving DRM/KMS to the point that it meets android requirements. Because I really don't think android has any special requirements compared to, say, a wayland compositor. (Other than the way they do synchronization. But that doesn't really *need* to be different, as far as I can tell.) It would certainly be easier if android dev's actually participated in upstream linux graphics, and talked about what they needed, and maybe even contributed a patch or two. But that is a whole different topic. BR, -R > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-29 14:58 ` Rob Clark (?) (?) @ 2013-08-05 17:10 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-05 17:10 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Hi Rob, +linux-media, +linaro-mm-sig for discussion of video/camera buffer constraints... > On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > wrote: > >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > >> > allocate buffers for the GPU. Still not sure how to resolve > >> > this as we don't use DRM for our GPU driver. > >> > >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > >> once the dma-fence stuff is in place), we'd have gpu-specific drm > >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > >> using prime/dmabuf to share between the two. > > > > The "extra" buffers we were allocating from armsoc DDX were really > > being allocated through DRM/GEM so we could get an flink name > > for them and pass a reference to them back to our GPU driver on > > the client side. If it weren't for our need to access those > > extra off-screen buffers with the GPU we wouldn't need to > > allocate them with DRM at all. So, given they are really "GPU" > > buffers, it does absolutely make sense to allocate them in a > > different driver to the display driver. > > > > However, to avoid unnecessary memcpys & related cache > > maintenance ops, we'd also like the GPU to render into buffers > > which are scanned out by the display controller. So let's say > > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > > out buffers with the display's DRM driver but a custom ioctl > > on the GPU's DRM driver to allocate non scanout, off-screen > > buffers. Sounds great, but I don't think that really works > > with DRI2. If we used two drivers to allocate buffers, which > > of those drivers do we return in DRI2ConnectReply? Even if we > > solve that somehow, GEM flink names are name-spaced to a > > single device node (AFAIK). So when we do a DRI2GetBuffers, > > how does the EGL in the client know which DRM device owns GEM > > flink name "1234"? We'd need some pretty dirty hacks. > > You would return the name of the display driver allocating the > buffers. On the client side you can use generic ioctls to go from > flink -> handle -> dmabuf. So the client side would end up opening > both the display drm device and the gpu, but without needing to know > too much about the display. I think the bit I was missing was that a GEM bo for a buffer imported using dma_buf/PRIME can still be flink'd. So the display controller's DRM driver allocates scan-out buffers via the DUMB buffer allocate ioctl. Those scan-out buffers than then be exported from the dispaly's DRM driver and imported into the GPU's DRM driver using PRIME. Once imported into the GPU's driver, we can use flink to get a name for that buffer within the GPU DRM driver's name-space to return to the DRI2 client. That same namespace is also what DRI2 back-buffers are allocated from, so I think that could work... Except... > > Anyway, that latter case also gets quite difficult. The "GPU" > > DRM driver would need to know the constraints of the display > > controller when allocating buffers intended to be scanned out. > > For example, pl111 typically isn't behind an IOMMU and so > > requires physically contiguous memory. We'd have to teach the > > GPU's DRM driver about the constraints of the display HW. Not > > exactly a clean driver model. :-( > > > > I'm still a little stuck on how to proceed, so any ideas > > would greatly appreciated! My current train of thought is > > having a kind of SoC-specific DRM driver which allocates > > buffers for both display and GPU within a single GEM > > namespace. That SoC-specific DRM driver could then know the > > constraints of both the GPU and the display HW. We could then > > use PRIME to export buffers allocated with the SoC DRM driver > > and import them into the GPU and/or display DRM driver. > > Usually if the display drm driver is allocating the buffers that might > be scanned out, it just needs to have minimal knowledge of the GPU > (pitch alignment constraints). I don't think we need a 3rd device > just to allocate buffers. While Mali can render to pretty much any buffer, there is a mild performance improvement to be had if the buffer stride is aligned to the AXI bus's max burst length when drawing to the buffer. So in some respects, there is a constraint on how buffers which will be drawn to using the GPU are allocated. I don't really like the idea of teaching the display controller DRM driver about the GPU buffer constraints, even if they are fairly trivial like this. If the same display HW IP is being used on several SoCs, it seems wrong somehow to enforce those GPU constraints if some of those SoCs don't have a GPU. We may also then have additional constraints when sharing buffers between the display HW and video decode or even camera ISP HW. Programmatically describing buffer allocation constraints is very difficult and I'm not sure you can actually do it - there's some pretty complex constraints out there! E.g. I believe there's a platform where Y and UV planes of the reference frame need to be in separate DRAM banks for real-time 1080p decode, or something like that? Anyway, I guess my point is that even if we solve how to allocate buffers which will be shared between the GPU and display HW such that both sets of constraints are satisfied, that may not be the end of the story. Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-29 14:58 ` Rob Clark ` (2 preceding siblings ...) (?) @ 2013-08-05 17:10 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-05 17:10 UTC (permalink / raw) To: linux-arm-kernel Hi Rob, +linux-media, +linaro-mm-sig for discussion of video/camera buffer constraints... > On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > wrote: > >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > >> > allocate buffers for the GPU. Still not sure how to resolve > >> > this as we don't use DRM for our GPU driver. > >> > >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > >> once the dma-fence stuff is in place), we'd have gpu-specific drm > >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > >> using prime/dmabuf to share between the two. > > > > The "extra" buffers we were allocating from armsoc DDX were really > > being allocated through DRM/GEM so we could get an flink name > > for them and pass a reference to them back to our GPU driver on > > the client side. If it weren't for our need to access those > > extra off-screen buffers with the GPU we wouldn't need to > > allocate them with DRM at all. So, given they are really "GPU" > > buffers, it does absolutely make sense to allocate them in a > > different driver to the display driver. > > > > However, to avoid unnecessary memcpys & related cache > > maintenance ops, we'd also like the GPU to render into buffers > > which are scanned out by the display controller. So let's say > > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > > out buffers with the display's DRM driver but a custom ioctl > > on the GPU's DRM driver to allocate non scanout, off-screen > > buffers. Sounds great, but I don't think that really works > > with DRI2. If we used two drivers to allocate buffers, which > > of those drivers do we return in DRI2ConnectReply? Even if we > > solve that somehow, GEM flink names are name-spaced to a > > single device node (AFAIK). So when we do a DRI2GetBuffers, > > how does the EGL in the client know which DRM device owns GEM > > flink name "1234"? We'd need some pretty dirty hacks. > > You would return the name of the display driver allocating the > buffers. On the client side you can use generic ioctls to go from > flink -> handle -> dmabuf. So the client side would end up opening > both the display drm device and the gpu, but without needing to know > too much about the display. I think the bit I was missing was that a GEM bo for a buffer imported using dma_buf/PRIME can still be flink'd. So the display controller's DRM driver allocates scan-out buffers via the DUMB buffer allocate ioctl. Those scan-out buffers than then be exported from the dispaly's DRM driver and imported into the GPU's DRM driver using PRIME. Once imported into the GPU's driver, we can use flink to get a name for that buffer within the GPU DRM driver's name-space to return to the DRI2 client. That same namespace is also what DRI2 back-buffers are allocated from, so I think that could work... Except... > > Anyway, that latter case also gets quite difficult. The "GPU" > > DRM driver would need to know the constraints of the display > > controller when allocating buffers intended to be scanned out. > > For example, pl111 typically isn't behind an IOMMU and so > > requires physically contiguous memory. We'd have to teach the > > GPU's DRM driver about the constraints of the display HW. Not > > exactly a clean driver model. :-( > > > > I'm still a little stuck on how to proceed, so any ideas > > would greatly appreciated! My current train of thought is > > having a kind of SoC-specific DRM driver which allocates > > buffers for both display and GPU within a single GEM > > namespace. That SoC-specific DRM driver could then know the > > constraints of both the GPU and the display HW. We could then > > use PRIME to export buffers allocated with the SoC DRM driver > > and import them into the GPU and/or display DRM driver. > > Usually if the display drm driver is allocating the buffers that might > be scanned out, it just needs to have minimal knowledge of the GPU > (pitch alignment constraints). I don't think we need a 3rd device > just to allocate buffers. While Mali can render to pretty much any buffer, there is a mild performance improvement to be had if the buffer stride is aligned to the AXI bus's max burst length when drawing to the buffer. So in some respects, there is a constraint on how buffers which will be drawn to using the GPU are allocated. I don't really like the idea of teaching the display controller DRM driver about the GPU buffer constraints, even if they are fairly trivial like this. If the same display HW IP is being used on several SoCs, it seems wrong somehow to enforce those GPU constraints if some of those SoCs don't have a GPU. We may also then have additional constraints when sharing buffers between the display HW and video decode or even camera ISP HW. Programmatically describing buffer allocation constraints is very difficult and I'm not sure you can actually do it - there's some pretty complex constraints out there! E.g. I believe there's a platform where Y and UV planes of the reference frame need to be in separate DRAM banks for real-time 1080p decode, or something like that? Anyway, I guess my point is that even if we solve how to allocate buffers which will be shared between the GPU and display HW such that both sets of constraints are satisfied, that may not be the end of the story. Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-29 14:58 ` Rob Clark ` (3 preceding siblings ...) (?) @ 2013-08-05 17:10 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-05 17:10 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Hi Rob, +linux-media, +linaro-mm-sig for discussion of video/camera buffer constraints... > On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > wrote: > >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also > >> > allocate buffers for the GPU. Still not sure how to resolve > >> > this as we don't use DRM for our GPU driver. > >> > >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. > >> once the dma-fence stuff is in place), we'd have gpu-specific drm > >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, > >> using prime/dmabuf to share between the two. > > > > The "extra" buffers we were allocating from armsoc DDX were really > > being allocated through DRM/GEM so we could get an flink name > > for them and pass a reference to them back to our GPU driver on > > the client side. If it weren't for our need to access those > > extra off-screen buffers with the GPU we wouldn't need to > > allocate them with DRM at all. So, given they are really "GPU" > > buffers, it does absolutely make sense to allocate them in a > > different driver to the display driver. > > > > However, to avoid unnecessary memcpys & related cache > > maintenance ops, we'd also like the GPU to render into buffers > > which are scanned out by the display controller. So let's say > > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > > out buffers with the display's DRM driver but a custom ioctl > > on the GPU's DRM driver to allocate non scanout, off-screen > > buffers. Sounds great, but I don't think that really works > > with DRI2. If we used two drivers to allocate buffers, which > > of those drivers do we return in DRI2ConnectReply? Even if we > > solve that somehow, GEM flink names are name-spaced to a > > single device node (AFAIK). So when we do a DRI2GetBuffers, > > how does the EGL in the client know which DRM device owns GEM > > flink name "1234"? We'd need some pretty dirty hacks. > > You would return the name of the display driver allocating the > buffers. On the client side you can use generic ioctls to go from > flink -> handle -> dmabuf. So the client side would end up opening > both the display drm device and the gpu, but without needing to know > too much about the display. I think the bit I was missing was that a GEM bo for a buffer imported using dma_buf/PRIME can still be flink'd. So the display controller's DRM driver allocates scan-out buffers via the DUMB buffer allocate ioctl. Those scan-out buffers than then be exported from the dispaly's DRM driver and imported into the GPU's DRM driver using PRIME. Once imported into the GPU's driver, we can use flink to get a name for that buffer within the GPU DRM driver's name-space to return to the DRI2 client. That same namespace is also what DRI2 back-buffers are allocated from, so I think that could work... Except... > > Anyway, that latter case also gets quite difficult. The "GPU" > > DRM driver would need to know the constraints of the display > > controller when allocating buffers intended to be scanned out. > > For example, pl111 typically isn't behind an IOMMU and so > > requires physically contiguous memory. We'd have to teach the > > GPU's DRM driver about the constraints of the display HW. Not > > exactly a clean driver model. :-( > > > > I'm still a little stuck on how to proceed, so any ideas > > would greatly appreciated! My current train of thought is > > having a kind of SoC-specific DRM driver which allocates > > buffers for both display and GPU within a single GEM > > namespace. That SoC-specific DRM driver could then know the > > constraints of both the GPU and the display HW. We could then > > use PRIME to export buffers allocated with the SoC DRM driver > > and import them into the GPU and/or display DRM driver. > > Usually if the display drm driver is allocating the buffers that might > be scanned out, it just needs to have minimal knowledge of the GPU > (pitch alignment constraints). I don't think we need a 3rd device > just to allocate buffers. While Mali can render to pretty much any buffer, there is a mild performance improvement to be had if the buffer stride is aligned to the AXI bus's max burst length when drawing to the buffer. So in some respects, there is a constraint on how buffers which will be drawn to using the GPU are allocated. I don't really like the idea of teaching the display controller DRM driver about the GPU buffer constraints, even if they are fairly trivial like this. If the same display HW IP is being used on several SoCs, it seems wrong somehow to enforce those GPU constraints if some of those SoCs don't have a GPU. We may also then have additional constraints when sharing buffers between the display HW and video decode or even camera ISP HW. Programmatically describing buffer allocation constraints is very difficult and I'm not sure you can actually do it - there's some pretty complex constraints out there! E.g. I believe there's a platform where Y and UV planes of the reference frame need to be in separate DRAM banks for real-time 1080p decode, or something like that? Anyway, I guess my point is that even if we solve how to allocate buffers which will be shared between the GPU and display HW such that both sets of constraints are satisfied, that may not be the end of the story. Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <51ffdc7e.06b8b40a.2cc8.0fe0SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <51ffdc7e.06b8b40a.2cc8.0fe0SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-05 18:07 ` Rob Clark @ 2013-08-05 18:07 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-05 18:07 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig On Mon, Aug 5, 2013 at 1:10 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Rob, > > +linux-media, +linaro-mm-sig for discussion of video/camera > buffer constraints... > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> >> wrote: >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> >> > allocate buffers for the GPU. Still not sure how to resolve >> >> > this as we don't use DRM for our GPU driver. >> >> >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> >> using prime/dmabuf to share between the two. >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> > being allocated through DRM/GEM so we could get an flink name >> > for them and pass a reference to them back to our GPU driver on >> > the client side. If it weren't for our need to access those >> > extra off-screen buffers with the GPU we wouldn't need to >> > allocate them with DRM at all. So, given they are really "GPU" >> > buffers, it does absolutely make sense to allocate them in a >> > different driver to the display driver. >> > >> > However, to avoid unnecessary memcpys & related cache >> > maintenance ops, we'd also like the GPU to render into buffers >> > which are scanned out by the display controller. So let's say >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> > out buffers with the display's DRM driver but a custom ioctl >> > on the GPU's DRM driver to allocate non scanout, off-screen >> > buffers. Sounds great, but I don't think that really works >> > with DRI2. If we used two drivers to allocate buffers, which >> > of those drivers do we return in DRI2ConnectReply? Even if we >> > solve that somehow, GEM flink names are name-spaced to a >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> > how does the EGL in the client know which DRM device owns GEM >> > flink name "1234"? We'd need some pretty dirty hacks. >> >> You would return the name of the display driver allocating the >> buffers. On the client side you can use generic ioctls to go from >> flink -> handle -> dmabuf. So the client side would end up opening >> both the display drm device and the gpu, but without needing to know >> too much about the display. > > I think the bit I was missing was that a GEM bo for a buffer imported > using dma_buf/PRIME can still be flink'd. So the display controller's > DRM driver allocates scan-out buffers via the DUMB buffer allocate > ioctl. Those scan-out buffers than then be exported from the > dispaly's DRM driver and imported into the GPU's DRM driver using > PRIME. Once imported into the GPU's driver, we can use flink to get a > name for that buffer within the GPU DRM driver's name-space to return > to the DRI2 client. That same namespace is also what DRI2 back-buffers > are allocated from, so I think that could work... Except... > (and.. the general direction is that things will move more to just use dmabuf directly, ie. wayland or dri3) > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> > DRM driver would need to know the constraints of the display >> > controller when allocating buffers intended to be scanned out. >> > For example, pl111 typically isn't behind an IOMMU and so >> > requires physically contiguous memory. We'd have to teach the >> > GPU's DRM driver about the constraints of the display HW. Not >> > exactly a clean driver model. :-( >> > >> > I'm still a little stuck on how to proceed, so any ideas >> > would greatly appreciated! My current train of thought is >> > having a kind of SoC-specific DRM driver which allocates >> > buffers for both display and GPU within a single GEM >> > namespace. That SoC-specific DRM driver could then know the >> > constraints of both the GPU and the display HW. We could then >> > use PRIME to export buffers allocated with the SoC DRM driver >> > and import them into the GPU and/or display DRM driver. >> >> Usually if the display drm driver is allocating the buffers that might >> be scanned out, it just needs to have minimal knowledge of the GPU >> (pitch alignment constraints). I don't think we need a 3rd device >> just to allocate buffers. > > While Mali can render to pretty much any buffer, there is a mild > performance improvement to be had if the buffer stride is aligned to > the AXI bus's max burst length when drawing to the buffer. I suspect the display controllers might frequently benefit if the pitch is aligned to AXI burst length too.. > So in some respects, there is a constraint on how buffers which will > be drawn to using the GPU are allocated. I don't really like the idea > of teaching the display controller DRM driver about the GPU buffer > constraints, even if they are fairly trivial like this. If the same > display HW IP is being used on several SoCs, it seems wrong somehow > to enforce those GPU constraints if some of those SoCs don't have a > GPU. Well, I suppose you could get min_pitch_alignment from devicetree, or something like this.. In the end, the easy solution is just to make the display allocate to the worst-case pitch alignment. In the early days of dma-buf discussions, we kicked around the idea of negotiating or programatically describing the constraints, but that didn't really seem like a bounded problem. > We may also then have additional constraints when sharing buffers > between the display HW and video decode or even camera ISP HW. > Programmatically describing buffer allocation constraints is very > difficult and I'm not sure you can actually do it - there's some > pretty complex constraints out there! E.g. I believe there's a > platform where Y and UV planes of the reference frame need to be in > separate DRAM banks for real-time 1080p decode, or something like > that? yes, this was discussed. This is different from pitch/format/size constraints.. it is really just a placement constraint (ie. where do the physical pages go). IIRC the conclusion was to use a dummy devices with it's own CMA pool for attaching the Y vs UV buffers. > Anyway, I guess my point is that even if we solve how to allocate > buffers which will be shared between the GPU and display HW such that > both sets of constraints are satisfied, that may not be the end of > the story. > that was part of the reason to punt this problem to userspace ;-) In practice, the kernel drivers doesn't usually know too much about the dimensions/format/etc.. that is really userspace level knowledge. There are a few exceptions when the kernel needs to know how to setup GTT/etc for tiled buffers, but normally this sort of information is up at the next level up (userspace, and drm_framebuffer in case of scanout). Userspace media frameworks like GStreamer already have a concept of format/caps negotiation. For non-display<->gpu sharing, I think this is probably where this sort of constraint negotiation should be handled. BR, -R > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-05 18:07 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-05 18:07 UTC (permalink / raw) To: linux-arm-kernel On Mon, Aug 5, 2013 at 1:10 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Rob, > > +linux-media, +linaro-mm-sig for discussion of video/camera > buffer constraints... > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> >> wrote: >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> >> > allocate buffers for the GPU. Still not sure how to resolve >> >> > this as we don't use DRM for our GPU driver. >> >> >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> >> using prime/dmabuf to share between the two. >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> > being allocated through DRM/GEM so we could get an flink name >> > for them and pass a reference to them back to our GPU driver on >> > the client side. If it weren't for our need to access those >> > extra off-screen buffers with the GPU we wouldn't need to >> > allocate them with DRM at all. So, given they are really "GPU" >> > buffers, it does absolutely make sense to allocate them in a >> > different driver to the display driver. >> > >> > However, to avoid unnecessary memcpys & related cache >> > maintenance ops, we'd also like the GPU to render into buffers >> > which are scanned out by the display controller. So let's say >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> > out buffers with the display's DRM driver but a custom ioctl >> > on the GPU's DRM driver to allocate non scanout, off-screen >> > buffers. Sounds great, but I don't think that really works >> > with DRI2. If we used two drivers to allocate buffers, which >> > of those drivers do we return in DRI2ConnectReply? Even if we >> > solve that somehow, GEM flink names are name-spaced to a >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> > how does the EGL in the client know which DRM device owns GEM >> > flink name "1234"? We'd need some pretty dirty hacks. >> >> You would return the name of the display driver allocating the >> buffers. On the client side you can use generic ioctls to go from >> flink -> handle -> dmabuf. So the client side would end up opening >> both the display drm device and the gpu, but without needing to know >> too much about the display. > > I think the bit I was missing was that a GEM bo for a buffer imported > using dma_buf/PRIME can still be flink'd. So the display controller's > DRM driver allocates scan-out buffers via the DUMB buffer allocate > ioctl. Those scan-out buffers than then be exported from the > dispaly's DRM driver and imported into the GPU's DRM driver using > PRIME. Once imported into the GPU's driver, we can use flink to get a > name for that buffer within the GPU DRM driver's name-space to return > to the DRI2 client. That same namespace is also what DRI2 back-buffers > are allocated from, so I think that could work... Except... > (and.. the general direction is that things will move more to just use dmabuf directly, ie. wayland or dri3) > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> > DRM driver would need to know the constraints of the display >> > controller when allocating buffers intended to be scanned out. >> > For example, pl111 typically isn't behind an IOMMU and so >> > requires physically contiguous memory. We'd have to teach the >> > GPU's DRM driver about the constraints of the display HW. Not >> > exactly a clean driver model. :-( >> > >> > I'm still a little stuck on how to proceed, so any ideas >> > would greatly appreciated! My current train of thought is >> > having a kind of SoC-specific DRM driver which allocates >> > buffers for both display and GPU within a single GEM >> > namespace. That SoC-specific DRM driver could then know the >> > constraints of both the GPU and the display HW. We could then >> > use PRIME to export buffers allocated with the SoC DRM driver >> > and import them into the GPU and/or display DRM driver. >> >> Usually if the display drm driver is allocating the buffers that might >> be scanned out, it just needs to have minimal knowledge of the GPU >> (pitch alignment constraints). I don't think we need a 3rd device >> just to allocate buffers. > > While Mali can render to pretty much any buffer, there is a mild > performance improvement to be had if the buffer stride is aligned to > the AXI bus's max burst length when drawing to the buffer. I suspect the display controllers might frequently benefit if the pitch is aligned to AXI burst length too.. > So in some respects, there is a constraint on how buffers which will > be drawn to using the GPU are allocated. I don't really like the idea > of teaching the display controller DRM driver about the GPU buffer > constraints, even if they are fairly trivial like this. If the same > display HW IP is being used on several SoCs, it seems wrong somehow > to enforce those GPU constraints if some of those SoCs don't have a > GPU. Well, I suppose you could get min_pitch_alignment from devicetree, or something like this.. In the end, the easy solution is just to make the display allocate to the worst-case pitch alignment. In the early days of dma-buf discussions, we kicked around the idea of negotiating or programatically describing the constraints, but that didn't really seem like a bounded problem. > We may also then have additional constraints when sharing buffers > between the display HW and video decode or even camera ISP HW. > Programmatically describing buffer allocation constraints is very > difficult and I'm not sure you can actually do it - there's some > pretty complex constraints out there! E.g. I believe there's a > platform where Y and UV planes of the reference frame need to be in > separate DRAM banks for real-time 1080p decode, or something like > that? yes, this was discussed. This is different from pitch/format/size constraints.. it is really just a placement constraint (ie. where do the physical pages go). IIRC the conclusion was to use a dummy devices with it's own CMA pool for attaching the Y vs UV buffers. > Anyway, I guess my point is that even if we solve how to allocate > buffers which will be shared between the GPU and display HW such that > both sets of constraints are satisfied, that may not be the end of > the story. > that was part of the reason to punt this problem to userspace ;-) In practice, the kernel drivers doesn't usually know too much about the dimensions/format/etc.. that is really userspace level knowledge. There are a few exceptions when the kernel needs to know how to setup GTT/etc for tiled buffers, but normally this sort of information is up at the next level up (userspace, and drm_framebuffer in case of scanout). Userspace media frameworks like GStreamer already have a concept of format/caps negotiation. For non-display<->gpu sharing, I think this is probably where this sort of constraint negotiation should be handled. BR, -R > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-05 18:07 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-05 18:07 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig On Mon, Aug 5, 2013 at 1:10 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Rob, > > +linux-media, +linaro-mm-sig for discussion of video/camera > buffer constraints... > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> >> wrote: >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >> >> > allocate buffers for the GPU. Still not sure how to resolve >> >> > this as we don't use DRM for our GPU driver. >> >> >> >> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >> >> once the dma-fence stuff is in place), we'd have gpu-specific drm >> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >> >> using prime/dmabuf to share between the two. >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> > being allocated through DRM/GEM so we could get an flink name >> > for them and pass a reference to them back to our GPU driver on >> > the client side. If it weren't for our need to access those >> > extra off-screen buffers with the GPU we wouldn't need to >> > allocate them with DRM at all. So, given they are really "GPU" >> > buffers, it does absolutely make sense to allocate them in a >> > different driver to the display driver. >> > >> > However, to avoid unnecessary memcpys & related cache >> > maintenance ops, we'd also like the GPU to render into buffers >> > which are scanned out by the display controller. So let's say >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> > out buffers with the display's DRM driver but a custom ioctl >> > on the GPU's DRM driver to allocate non scanout, off-screen >> > buffers. Sounds great, but I don't think that really works >> > with DRI2. If we used two drivers to allocate buffers, which >> > of those drivers do we return in DRI2ConnectReply? Even if we >> > solve that somehow, GEM flink names are name-spaced to a >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> > how does the EGL in the client know which DRM device owns GEM >> > flink name "1234"? We'd need some pretty dirty hacks. >> >> You would return the name of the display driver allocating the >> buffers. On the client side you can use generic ioctls to go from >> flink -> handle -> dmabuf. So the client side would end up opening >> both the display drm device and the gpu, but without needing to know >> too much about the display. > > I think the bit I was missing was that a GEM bo for a buffer imported > using dma_buf/PRIME can still be flink'd. So the display controller's > DRM driver allocates scan-out buffers via the DUMB buffer allocate > ioctl. Those scan-out buffers than then be exported from the > dispaly's DRM driver and imported into the GPU's DRM driver using > PRIME. Once imported into the GPU's driver, we can use flink to get a > name for that buffer within the GPU DRM driver's name-space to return > to the DRI2 client. That same namespace is also what DRI2 back-buffers > are allocated from, so I think that could work... Except... > (and.. the general direction is that things will move more to just use dmabuf directly, ie. wayland or dri3) > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> > DRM driver would need to know the constraints of the display >> > controller when allocating buffers intended to be scanned out. >> > For example, pl111 typically isn't behind an IOMMU and so >> > requires physically contiguous memory. We'd have to teach the >> > GPU's DRM driver about the constraints of the display HW. Not >> > exactly a clean driver model. :-( >> > >> > I'm still a little stuck on how to proceed, so any ideas >> > would greatly appreciated! My current train of thought is >> > having a kind of SoC-specific DRM driver which allocates >> > buffers for both display and GPU within a single GEM >> > namespace. That SoC-specific DRM driver could then know the >> > constraints of both the GPU and the display HW. We could then >> > use PRIME to export buffers allocated with the SoC DRM driver >> > and import them into the GPU and/or display DRM driver. >> >> Usually if the display drm driver is allocating the buffers that might >> be scanned out, it just needs to have minimal knowledge of the GPU >> (pitch alignment constraints). I don't think we need a 3rd device >> just to allocate buffers. > > While Mali can render to pretty much any buffer, there is a mild > performance improvement to be had if the buffer stride is aligned to > the AXI bus's max burst length when drawing to the buffer. I suspect the display controllers might frequently benefit if the pitch is aligned to AXI burst length too.. > So in some respects, there is a constraint on how buffers which will > be drawn to using the GPU are allocated. I don't really like the idea > of teaching the display controller DRM driver about the GPU buffer > constraints, even if they are fairly trivial like this. If the same > display HW IP is being used on several SoCs, it seems wrong somehow > to enforce those GPU constraints if some of those SoCs don't have a > GPU. Well, I suppose you could get min_pitch_alignment from devicetree, or something like this.. In the end, the easy solution is just to make the display allocate to the worst-case pitch alignment. In the early days of dma-buf discussions, we kicked around the idea of negotiating or programatically describing the constraints, but that didn't really seem like a bounded problem. > We may also then have additional constraints when sharing buffers > between the display HW and video decode or even camera ISP HW. > Programmatically describing buffer allocation constraints is very > difficult and I'm not sure you can actually do it - there's some > pretty complex constraints out there! E.g. I believe there's a > platform where Y and UV planes of the reference frame need to be in > separate DRAM banks for real-time 1080p decode, or something like > that? yes, this was discussed. This is different from pitch/format/size constraints.. it is really just a placement constraint (ie. where do the physical pages go). IIRC the conclusion was to use a dummy devices with it's own CMA pool for attaching the Y vs UV buffers. > Anyway, I guess my point is that even if we solve how to allocate > buffers which will be shared between the GPU and display HW such that > both sets of constraints are satisfied, that may not be the end of > the story. > that was part of the reason to punt this problem to userspace ;-) In practice, the kernel drivers doesn't usually know too much about the dimensions/format/etc.. that is really userspace level knowledge. There are a few exceptions when the kernel needs to know how to setup GTT/etc for tiled buffers, but normally this sort of information is up at the next level up (userspace, and drm_framebuffer in case of scanout). Userspace media frameworks like GStreamer already have a concept of format/caps negotiation. For non-display<->gpu sharing, I think this is probably where this sort of constraint negotiation should be handled. BR, -R > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-05 18:07 ` Rob Clark (?) (?) @ 2013-08-06 11:31 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-06 11:31 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Hi Rob, +lkml > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > >> wrote: > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > >> >> > also allocate buffers for the GPU. Still not sure how to > >> >> > resolve this as we don't use DRM for our GPU driver. > >> >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > >> >> (esp. once the dma-fence stuff is in place), we'd have > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > >> >> specific drm/kms driver, using prime/dmabuf to share between > >> >> the two. > >> > > >> > The "extra" buffers we were allocating from armsoc DDX were really > >> > being allocated through DRM/GEM so we could get an flink name > >> > for them and pass a reference to them back to our GPU driver on > >> > the client side. If it weren't for our need to access those > >> > extra off-screen buffers with the GPU we wouldn't need to > >> > allocate them with DRM at all. So, given they are really "GPU" > >> > buffers, it does absolutely make sense to allocate them in a > >> > different driver to the display driver. > >> > > >> > However, to avoid unnecessary memcpys & related cache > >> > maintenance ops, we'd also like the GPU to render into buffers > >> > which are scanned out by the display controller. So let's say > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > >> > out buffers with the display's DRM driver but a custom ioctl > >> > on the GPU's DRM driver to allocate non scanout, off-screen > >> > buffers. Sounds great, but I don't think that really works > >> > with DRI2. If we used two drivers to allocate buffers, which > >> > of those drivers do we return in DRI2ConnectReply? Even if we > >> > solve that somehow, GEM flink names are name-spaced to a > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > >> > how does the EGL in the client know which DRM device owns GEM > >> > flink name "1234"? We'd need some pretty dirty hacks. > >> > >> You would return the name of the display driver allocating the > >> buffers. On the client side you can use generic ioctls to go from > >> flink -> handle -> dmabuf. So the client side would end up opening > >> both the display drm device and the gpu, but without needing to know > >> too much about the display. > > > > I think the bit I was missing was that a GEM bo for a buffer imported > > using dma_buf/PRIME can still be flink'd. So the display controller's > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > > ioctl. Those scan-out buffers than then be exported from the > > dispaly's DRM driver and imported into the GPU's DRM driver using > > PRIME. Once imported into the GPU's driver, we can use flink to get a > > name for that buffer within the GPU DRM driver's name-space to return > > to the DRI2 client. That same namespace is also what DRI2 back- > > buffers are allocated from, so I think that could work... Except... > > (and.. the general direction is that things will move more to just use > dmabuf directly, ie. wayland or dri3) I agree, DRI2 is the only reason why we need a system-wide ID. I also prefer buffers to be passed around by dma_buf fd, but we still need to support DRI2 and will do for some time I expect. > >> > Anyway, that latter case also gets quite difficult. The "GPU" > >> > DRM driver would need to know the constraints of the display > >> > controller when allocating buffers intended to be scanned out. > >> > For example, pl111 typically isn't behind an IOMMU and so > >> > requires physically contiguous memory. We'd have to teach the > >> > GPU's DRM driver about the constraints of the display HW. Not > >> > exactly a clean driver model. :-( > >> > > >> > I'm still a little stuck on how to proceed, so any ideas > >> > would greatly appreciated! My current train of thought is > >> > having a kind of SoC-specific DRM driver which allocates > >> > buffers for both display and GPU within a single GEM > >> > namespace. That SoC-specific DRM driver could then know the > >> > constraints of both the GPU and the display HW. We could then > >> > use PRIME to export buffers allocated with the SoC DRM driver > >> > and import them into the GPU and/or display DRM driver. > >> > >> Usually if the display drm driver is allocating the buffers that > >> might be scanned out, it just needs to have minimal knowledge of > >> the GPU (pitch alignment constraints). I don't think we need a > >> 3rd device just to allocate buffers. > > > > While Mali can render to pretty much any buffer, there is a mild > > performance improvement to be had if the buffer stride is aligned to > > the AXI bus's max burst length when drawing to the buffer. > > I suspect the display controllers might frequently benefit if the > pitch is aligned to AXI burst length too.. If the display controller is going to be reading from linear memory I don't think it will make much difference - you'll just get an extra 1-2 bus transactions per scanline. With a tile-based GPU like Mali, you get those extra transactions per _tile_ scan-line and as such, the overhead is more pronounced. > > So in some respects, there is a constraint on how buffers which will > > be drawn to using the GPU are allocated. I don't really like the idea > > of teaching the display controller DRM driver about the GPU buffer > > constraints, even if they are fairly trivial like this. If the same > > display HW IP is being used on several SoCs, it seems wrong somehow > > to enforce those GPU constraints if some of those SoCs don't have a > > GPU. > > Well, I suppose you could get min_pitch_alignment from devicetree, or > something like this.. > > In the end, the easy solution is just to make the display allocate to > the worst-case pitch alignment. In the early days of dma-buf > discussions, we kicked around the idea of negotiating or > programatically describing the constraints, but that didn't really > seem like a bounded problem. Yeah - I was around for some of those discussions and agree it's not really an easy problem to solve. > > We may also then have additional constraints when sharing buffers > > between the display HW and video decode or even camera ISP HW. > > Programmatically describing buffer allocation constraints is very > > difficult and I'm not sure you can actually do it - there's some > > pretty complex constraints out there! E.g. I believe there's a > > platform where Y and UV planes of the reference frame need to be in > > separate DRAM banks for real-time 1080p decode, or something like > > that? > > yes, this was discussed. This is different from pitch/format/size > constraints.. it is really just a placement constraint (ie. where do > the physical pages go). IIRC the conclusion was to use a dummy > devices with it's own CMA pool for attaching the Y vs UV buffers. > > > Anyway, I guess my point is that even if we solve how to allocate > > buffers which will be shared between the GPU and display HW such that > > both sets of constraints are satisfied, that may not be the end of > > the story. > > > > that was part of the reason to punt this problem to userspace ;-) > > In practice, the kernel drivers doesn't usually know too much about > the dimensions/format/etc.. that is really userspace level knowledge. > There are a few exceptions when the kernel needs to know how to setup > GTT/etc for tiled buffers, but normally this sort of information is up > at the next level up (userspace, and drm_framebuffer in case of > scanout). Userspace media frameworks like GStreamer already have a > concept of format/caps negotiation. For non-display<->gpu sharing, I > think this is probably where this sort of constraint negotiation > should be handled. I agree that user-space will know which devices will access the buffer and thus can figure out at least a common pixel format. Though I'm not so sure userspace can figure out more low-level details like alignment and placement in physical memory, etc. Anyway, assuming user-space can figure out how a buffer should be stored in memory, how does it indicate this to a kernel driver and actually allocate it? Which ioctl on which device does user-space call, with what parameters? Are you suggesting using something like ION which exposes the low-level details of how buffers are laid out in physical memory to userspace? If not, what? Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-05 18:07 ` Rob Clark ` (2 preceding siblings ...) (?) @ 2013-08-06 11:31 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-06 11:31 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Hi Rob, +lkml > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > >> wrote: > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > >> >> > also allocate buffers for the GPU. Still not sure how to > >> >> > resolve this as we don't use DRM for our GPU driver. > >> >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > >> >> (esp. once the dma-fence stuff is in place), we'd have > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > >> >> specific drm/kms driver, using prime/dmabuf to share between > >> >> the two. > >> > > >> > The "extra" buffers we were allocating from armsoc DDX were really > >> > being allocated through DRM/GEM so we could get an flink name > >> > for them and pass a reference to them back to our GPU driver on > >> > the client side. If it weren't for our need to access those > >> > extra off-screen buffers with the GPU we wouldn't need to > >> > allocate them with DRM at all. So, given they are really "GPU" > >> > buffers, it does absolutely make sense to allocate them in a > >> > different driver to the display driver. > >> > > >> > However, to avoid unnecessary memcpys & related cache > >> > maintenance ops, we'd also like the GPU to render into buffers > >> > which are scanned out by the display controller. So let's say > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > >> > out buffers with the display's DRM driver but a custom ioctl > >> > on the GPU's DRM driver to allocate non scanout, off-screen > >> > buffers. Sounds great, but I don't think that really works > >> > with DRI2. If we used two drivers to allocate buffers, which > >> > of those drivers do we return in DRI2ConnectReply? Even if we > >> > solve that somehow, GEM flink names are name-spaced to a > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > >> > how does the EGL in the client know which DRM device owns GEM > >> > flink name "1234"? We'd need some pretty dirty hacks. > >> > >> You would return the name of the display driver allocating the > >> buffers. On the client side you can use generic ioctls to go from > >> flink -> handle -> dmabuf. So the client side would end up opening > >> both the display drm device and the gpu, but without needing to know > >> too much about the display. > > > > I think the bit I was missing was that a GEM bo for a buffer imported > > using dma_buf/PRIME can still be flink'd. So the display controller's > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > > ioctl. Those scan-out buffers than then be exported from the > > dispaly's DRM driver and imported into the GPU's DRM driver using > > PRIME. Once imported into the GPU's driver, we can use flink to get a > > name for that buffer within the GPU DRM driver's name-space to return > > to the DRI2 client. That same namespace is also what DRI2 back- > > buffers are allocated from, so I think that could work... Except... > > (and.. the general direction is that things will move more to just use > dmabuf directly, ie. wayland or dri3) I agree, DRI2 is the only reason why we need a system-wide ID. I also prefer buffers to be passed around by dma_buf fd, but we still need to support DRI2 and will do for some time I expect. > >> > Anyway, that latter case also gets quite difficult. The "GPU" > >> > DRM driver would need to know the constraints of the display > >> > controller when allocating buffers intended to be scanned out. > >> > For example, pl111 typically isn't behind an IOMMU and so > >> > requires physically contiguous memory. We'd have to teach the > >> > GPU's DRM driver about the constraints of the display HW. Not > >> > exactly a clean driver model. :-( > >> > > >> > I'm still a little stuck on how to proceed, so any ideas > >> > would greatly appreciated! My current train of thought is > >> > having a kind of SoC-specific DRM driver which allocates > >> > buffers for both display and GPU within a single GEM > >> > namespace. That SoC-specific DRM driver could then know the > >> > constraints of both the GPU and the display HW. We could then > >> > use PRIME to export buffers allocated with the SoC DRM driver > >> > and import them into the GPU and/or display DRM driver. > >> > >> Usually if the display drm driver is allocating the buffers that > >> might be scanned out, it just needs to have minimal knowledge of > >> the GPU (pitch alignment constraints). I don't think we need a > >> 3rd device just to allocate buffers. > > > > While Mali can render to pretty much any buffer, there is a mild > > performance improvement to be had if the buffer stride is aligned to > > the AXI bus's max burst length when drawing to the buffer. > > I suspect the display controllers might frequently benefit if the > pitch is aligned to AXI burst length too.. If the display controller is going to be reading from linear memory I don't think it will make much difference - you'll just get an extra 1-2 bus transactions per scanline. With a tile-based GPU like Mali, you get those extra transactions per _tile_ scan-line and as such, the overhead is more pronounced. > > So in some respects, there is a constraint on how buffers which will > > be drawn to using the GPU are allocated. I don't really like the idea > > of teaching the display controller DRM driver about the GPU buffer > > constraints, even if they are fairly trivial like this. If the same > > display HW IP is being used on several SoCs, it seems wrong somehow > > to enforce those GPU constraints if some of those SoCs don't have a > > GPU. > > Well, I suppose you could get min_pitch_alignment from devicetree, or > something like this.. > > In the end, the easy solution is just to make the display allocate to > the worst-case pitch alignment. In the early days of dma-buf > discussions, we kicked around the idea of negotiating or > programatically describing the constraints, but that didn't really > seem like a bounded problem. Yeah - I was around for some of those discussions and agree it's not really an easy problem to solve. > > We may also then have additional constraints when sharing buffers > > between the display HW and video decode or even camera ISP HW. > > Programmatically describing buffer allocation constraints is very > > difficult and I'm not sure you can actually do it - there's some > > pretty complex constraints out there! E.g. I believe there's a > > platform where Y and UV planes of the reference frame need to be in > > separate DRAM banks for real-time 1080p decode, or something like > > that? > > yes, this was discussed. This is different from pitch/format/size > constraints.. it is really just a placement constraint (ie. where do > the physical pages go). IIRC the conclusion was to use a dummy > devices with it's own CMA pool for attaching the Y vs UV buffers. > > > Anyway, I guess my point is that even if we solve how to allocate > > buffers which will be shared between the GPU and display HW such that > > both sets of constraints are satisfied, that may not be the end of > > the story. > > > > that was part of the reason to punt this problem to userspace ;-) > > In practice, the kernel drivers doesn't usually know too much about > the dimensions/format/etc.. that is really userspace level knowledge. > There are a few exceptions when the kernel needs to know how to setup > GTT/etc for tiled buffers, but normally this sort of information is up > at the next level up (userspace, and drm_framebuffer in case of > scanout). Userspace media frameworks like GStreamer already have a > concept of format/caps negotiation. For non-display<->gpu sharing, I > think this is probably where this sort of constraint negotiation > should be handled. I agree that user-space will know which devices will access the buffer and thus can figure out at least a common pixel format. Though I'm not so sure userspace can figure out more low-level details like alignment and placement in physical memory, etc. Anyway, assuming user-space can figure out how a buffer should be stored in memory, how does it indicate this to a kernel driver and actually allocate it? Which ioctl on which device does user-space call, with what parameters? Are you suggesting using something like ION which exposes the low-level details of how buffers are laid out in physical memory to userspace? If not, what? Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-05 18:07 ` Rob Clark ` (3 preceding siblings ...) (?) @ 2013-08-06 11:31 ` Tom Cooksey 2013-08-06 12:18 ` Lucas Stach 2013-09-14 21:33 ` Daniel Stone -1 siblings, 2 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-06 11:31 UTC (permalink / raw) To: linux-arm-kernel Hi Rob, +lkml > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > >> wrote: > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > >> >> > also allocate buffers for the GPU. Still not sure how to > >> >> > resolve this as we don't use DRM for our GPU driver. > >> >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > >> >> (esp. once the dma-fence stuff is in place), we'd have > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > >> >> specific drm/kms driver, using prime/dmabuf to share between > >> >> the two. > >> > > >> > The "extra" buffers we were allocating from armsoc DDX were really > >> > being allocated through DRM/GEM so we could get an flink name > >> > for them and pass a reference to them back to our GPU driver on > >> > the client side. If it weren't for our need to access those > >> > extra off-screen buffers with the GPU we wouldn't need to > >> > allocate them with DRM at all. So, given they are really "GPU" > >> > buffers, it does absolutely make sense to allocate them in a > >> > different driver to the display driver. > >> > > >> > However, to avoid unnecessary memcpys & related cache > >> > maintenance ops, we'd also like the GPU to render into buffers > >> > which are scanned out by the display controller. So let's say > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > >> > out buffers with the display's DRM driver but a custom ioctl > >> > on the GPU's DRM driver to allocate non scanout, off-screen > >> > buffers. Sounds great, but I don't think that really works > >> > with DRI2. If we used two drivers to allocate buffers, which > >> > of those drivers do we return in DRI2ConnectReply? Even if we > >> > solve that somehow, GEM flink names are name-spaced to a > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > >> > how does the EGL in the client know which DRM device owns GEM > >> > flink name "1234"? We'd need some pretty dirty hacks. > >> > >> You would return the name of the display driver allocating the > >> buffers. On the client side you can use generic ioctls to go from > >> flink -> handle -> dmabuf. So the client side would end up opening > >> both the display drm device and the gpu, but without needing to know > >> too much about the display. > > > > I think the bit I was missing was that a GEM bo for a buffer imported > > using dma_buf/PRIME can still be flink'd. So the display controller's > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > > ioctl. Those scan-out buffers than then be exported from the > > dispaly's DRM driver and imported into the GPU's DRM driver using > > PRIME. Once imported into the GPU's driver, we can use flink to get a > > name for that buffer within the GPU DRM driver's name-space to return > > to the DRI2 client. That same namespace is also what DRI2 back- > > buffers are allocated from, so I think that could work... Except... > > (and.. the general direction is that things will move more to just use > dmabuf directly, ie. wayland or dri3) I agree, DRI2 is the only reason why we need a system-wide ID. I also prefer buffers to be passed around by dma_buf fd, but we still need to support DRI2 and will do for some time I expect. > >> > Anyway, that latter case also gets quite difficult. The "GPU" > >> > DRM driver would need to know the constraints of the display > >> > controller when allocating buffers intended to be scanned out. > >> > For example, pl111 typically isn't behind an IOMMU and so > >> > requires physically contiguous memory. We'd have to teach the > >> > GPU's DRM driver about the constraints of the display HW. Not > >> > exactly a clean driver model. :-( > >> > > >> > I'm still a little stuck on how to proceed, so any ideas > >> > would greatly appreciated! My current train of thought is > >> > having a kind of SoC-specific DRM driver which allocates > >> > buffers for both display and GPU within a single GEM > >> > namespace. That SoC-specific DRM driver could then know the > >> > constraints of both the GPU and the display HW. We could then > >> > use PRIME to export buffers allocated with the SoC DRM driver > >> > and import them into the GPU and/or display DRM driver. > >> > >> Usually if the display drm driver is allocating the buffers that > >> might be scanned out, it just needs to have minimal knowledge of > >> the GPU (pitch alignment constraints). I don't think we need a > >> 3rd device just to allocate buffers. > > > > While Mali can render to pretty much any buffer, there is a mild > > performance improvement to be had if the buffer stride is aligned to > > the AXI bus's max burst length when drawing to the buffer. > > I suspect the display controllers might frequently benefit if the > pitch is aligned to AXI burst length too.. If the display controller is going to be reading from linear memory I don't think it will make much difference - you'll just get an extra 1-2 bus transactions per scanline. With a tile-based GPU like Mali, you get those extra transactions per _tile_ scan-line and as such, the overhead is more pronounced. > > So in some respects, there is a constraint on how buffers which will > > be drawn to using the GPU are allocated. I don't really like the idea > > of teaching the display controller DRM driver about the GPU buffer > > constraints, even if they are fairly trivial like this. If the same > > display HW IP is being used on several SoCs, it seems wrong somehow > > to enforce those GPU constraints if some of those SoCs don't have a > > GPU. > > Well, I suppose you could get min_pitch_alignment from devicetree, or > something like this.. > > In the end, the easy solution is just to make the display allocate to > the worst-case pitch alignment. In the early days of dma-buf > discussions, we kicked around the idea of negotiating or > programatically describing the constraints, but that didn't really > seem like a bounded problem. Yeah - I was around for some of those discussions and agree it's not really an easy problem to solve. > > We may also then have additional constraints when sharing buffers > > between the display HW and video decode or even camera ISP HW. > > Programmatically describing buffer allocation constraints is very > > difficult and I'm not sure you can actually do it - there's some > > pretty complex constraints out there! E.g. I believe there's a > > platform where Y and UV planes of the reference frame need to be in > > separate DRAM banks for real-time 1080p decode, or something like > > that? > > yes, this was discussed. This is different from pitch/format/size > constraints.. it is really just a placement constraint (ie. where do > the physical pages go). IIRC the conclusion was to use a dummy > devices with it's own CMA pool for attaching the Y vs UV buffers. > > > Anyway, I guess my point is that even if we solve how to allocate > > buffers which will be shared between the GPU and display HW such that > > both sets of constraints are satisfied, that may not be the end of > > the story. > > > > that was part of the reason to punt this problem to userspace ;-) > > In practice, the kernel drivers doesn't usually know too much about > the dimensions/format/etc.. that is really userspace level knowledge. > There are a few exceptions when the kernel needs to know how to setup > GTT/etc for tiled buffers, but normally this sort of information is up > at the next level up (userspace, and drm_framebuffer in case of > scanout). Userspace media frameworks like GStreamer already have a > concept of format/caps negotiation. For non-display<->gpu sharing, I > think this is probably where this sort of constraint negotiation > should be handled. I agree that user-space will know which devices will access the buffer and thus can figure out at least a common pixel format. Though I'm not so sure userspace can figure out more low-level details like alignment and placement in physical memory, etc. Anyway, assuming user-space can figure out how a buffer should be stored in memory, how does it indicate this to a kernel driver and actually allocate it? Which ioctl on which device does user-space call, with what parameters? Are you suggesting using something like ION which exposes the low-level details of how buffers are laid out in physical memory to userspace? If not, what? Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 11:31 ` Tom Cooksey 2013-08-06 12:18 ` Lucas Stach @ 2013-08-06 12:18 ` Lucas Stach 1 sibling, 0 replies; 131+ messages in thread From: Lucas Stach @ 2013-08-06 12:18 UTC (permalink / raw) To: Tom Cooksey Cc: 'Rob Clark', linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: > Hi Rob, > > +lkml > > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > > >> wrote: > > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > > >> >> > also allocate buffers for the GPU. Still not sure how to > > >> >> > resolve this as we don't use DRM for our GPU driver. > > >> >> > > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > > >> >> (esp. once the dma-fence stuff is in place), we'd have > > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > > >> >> specific drm/kms driver, using prime/dmabuf to share between > > >> >> the two. > > >> > > > >> > The "extra" buffers we were allocating from armsoc DDX were really > > >> > being allocated through DRM/GEM so we could get an flink name > > >> > for them and pass a reference to them back to our GPU driver on > > >> > the client side. If it weren't for our need to access those > > >> > extra off-screen buffers with the GPU we wouldn't need to > > >> > allocate them with DRM at all. So, given they are really "GPU" > > >> > buffers, it does absolutely make sense to allocate them in a > > >> > different driver to the display driver. > > >> > > > >> > However, to avoid unnecessary memcpys & related cache > > >> > maintenance ops, we'd also like the GPU to render into buffers > > >> > which are scanned out by the display controller. So let's say > > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > > >> > out buffers with the display's DRM driver but a custom ioctl > > >> > on the GPU's DRM driver to allocate non scanout, off-screen > > >> > buffers. Sounds great, but I don't think that really works > > >> > with DRI2. If we used two drivers to allocate buffers, which > > >> > of those drivers do we return in DRI2ConnectReply? Even if we > > >> > solve that somehow, GEM flink names are name-spaced to a > > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > > >> > how does the EGL in the client know which DRM device owns GEM > > >> > flink name "1234"? We'd need some pretty dirty hacks. > > >> > > >> You would return the name of the display driver allocating the > > >> buffers. On the client side you can use generic ioctls to go from > > >> flink -> handle -> dmabuf. So the client side would end up opening > > >> both the display drm device and the gpu, but without needing to know > > >> too much about the display. > > > > > > I think the bit I was missing was that a GEM bo for a buffer imported > > > using dma_buf/PRIME can still be flink'd. So the display controller's > > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > > > ioctl. Those scan-out buffers than then be exported from the > > > dispaly's DRM driver and imported into the GPU's DRM driver using > > > PRIME. Once imported into the GPU's driver, we can use flink to get a > > > name for that buffer within the GPU DRM driver's name-space to return > > > to the DRI2 client. That same namespace is also what DRI2 back- > > > buffers are allocated from, so I think that could work... Except... > > > > (and.. the general direction is that things will move more to just use > > dmabuf directly, ie. wayland or dri3) > > I agree, DRI2 is the only reason why we need a system-wide ID. I also > prefer buffers to be passed around by dma_buf fd, but we still need to > support DRI2 and will do for some time I expect. > > > > > >> > Anyway, that latter case also gets quite difficult. The "GPU" > > >> > DRM driver would need to know the constraints of the display > > >> > controller when allocating buffers intended to be scanned out. > > >> > For example, pl111 typically isn't behind an IOMMU and so > > >> > requires physically contiguous memory. We'd have to teach the > > >> > GPU's DRM driver about the constraints of the display HW. Not > > >> > exactly a clean driver model. :-( > > >> > > > >> > I'm still a little stuck on how to proceed, so any ideas > > >> > would greatly appreciated! My current train of thought is > > >> > having a kind of SoC-specific DRM driver which allocates > > >> > buffers for both display and GPU within a single GEM > > >> > namespace. That SoC-specific DRM driver could then know the > > >> > constraints of both the GPU and the display HW. We could then > > >> > use PRIME to export buffers allocated with the SoC DRM driver > > >> > and import them into the GPU and/or display DRM driver. > > >> > > >> Usually if the display drm driver is allocating the buffers that > > >> might be scanned out, it just needs to have minimal knowledge of > > >> the GPU (pitch alignment constraints). I don't think we need a > > >> 3rd device just to allocate buffers. > > > > > > While Mali can render to pretty much any buffer, there is a mild > > > performance improvement to be had if the buffer stride is aligned to > > > the AXI bus's max burst length when drawing to the buffer. > > > > I suspect the display controllers might frequently benefit if the > > pitch is aligned to AXI burst length too.. > > If the display controller is going to be reading from linear memory > I don't think it will make much difference - you'll just get an extra > 1-2 bus transactions per scanline. With a tile-based GPU like Mali, > you get those extra transactions per _tile_ scan-line and as such, > the overhead is more pronounced. > > > > > > So in some respects, there is a constraint on how buffers which will > > > be drawn to using the GPU are allocated. I don't really like the idea > > > of teaching the display controller DRM driver about the GPU buffer > > > constraints, even if they are fairly trivial like this. If the same > > > display HW IP is being used on several SoCs, it seems wrong somehow > > > to enforce those GPU constraints if some of those SoCs don't have a > > > GPU. > > > > Well, I suppose you could get min_pitch_alignment from devicetree, or > > something like this.. > > > > In the end, the easy solution is just to make the display allocate to > > the worst-case pitch alignment. In the early days of dma-buf > > discussions, we kicked around the idea of negotiating or > > programatically describing the constraints, but that didn't really > > seem like a bounded problem. > > Yeah - I was around for some of those discussions and agree it's not > really an easy problem to solve. > > > > > > We may also then have additional constraints when sharing buffers > > > between the display HW and video decode or even camera ISP HW. > > > Programmatically describing buffer allocation constraints is very > > > difficult and I'm not sure you can actually do it - there's some > > > pretty complex constraints out there! E.g. I believe there's a > > > platform where Y and UV planes of the reference frame need to be in > > > separate DRAM banks for real-time 1080p decode, or something like > > > that? > > > > yes, this was discussed. This is different from pitch/format/size > > constraints.. it is really just a placement constraint (ie. where do > > the physical pages go). IIRC the conclusion was to use a dummy > > devices with it's own CMA pool for attaching the Y vs UV buffers. > > > > > Anyway, I guess my point is that even if we solve how to allocate > > > buffers which will be shared between the GPU and display HW such that > > > both sets of constraints are satisfied, that may not be the end of > > > the story. > > > > > > > that was part of the reason to punt this problem to userspace ;-) > > > > In practice, the kernel drivers doesn't usually know too much about > > the dimensions/format/etc.. that is really userspace level knowledge. > > There are a few exceptions when the kernel needs to know how to setup > > GTT/etc for tiled buffers, but normally this sort of information is up > > at the next level up (userspace, and drm_framebuffer in case of > > scanout). Userspace media frameworks like GStreamer already have a > > concept of format/caps negotiation. For non-display<->gpu sharing, I > > think this is probably where this sort of constraint negotiation > > should be handled. > > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. > > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? > I strongly disagree with exposing low-level hardware details like tiling to userspace. If we have to do the negotiation of those things in userspace we will end up with having to pipe those information through things like the wayland protocol. I don't see how this could ever be considered a good idea. I would rather see kernel drivers negotiating those things at dmabuf attach time in way invisible to userspace. I agree that this negotiation thing isn't easy to get right for the plethora of different hardware constraints we see today, but I would rather see this in-kernel, where we have the chance to fix things up if needed, than in a fixed userspace interface. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 131+ messages in thread
* [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 12:18 ` Lucas Stach 0 siblings, 0 replies; 131+ messages in thread From: Lucas Stach @ 2013-08-06 12:18 UTC (permalink / raw) To: linux-arm-kernel Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: > Hi Rob, > > +lkml > > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > > >> wrote: > > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > > >> >> > also allocate buffers for the GPU. Still not sure how to > > >> >> > resolve this as we don't use DRM for our GPU driver. > > >> >> > > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > > >> >> (esp. once the dma-fence stuff is in place), we'd have > > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > > >> >> specific drm/kms driver, using prime/dmabuf to share between > > >> >> the two. > > >> > > > >> > The "extra" buffers we were allocating from armsoc DDX were really > > >> > being allocated through DRM/GEM so we could get an flink name > > >> > for them and pass a reference to them back to our GPU driver on > > >> > the client side. If it weren't for our need to access those > > >> > extra off-screen buffers with the GPU we wouldn't need to > > >> > allocate them with DRM at all. So, given they are really "GPU" > > >> > buffers, it does absolutely make sense to allocate them in a > > >> > different driver to the display driver. > > >> > > > >> > However, to avoid unnecessary memcpys & related cache > > >> > maintenance ops, we'd also like the GPU to render into buffers > > >> > which are scanned out by the display controller. So let's say > > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > > >> > out buffers with the display's DRM driver but a custom ioctl > > >> > on the GPU's DRM driver to allocate non scanout, off-screen > > >> > buffers. Sounds great, but I don't think that really works > > >> > with DRI2. If we used two drivers to allocate buffers, which > > >> > of those drivers do we return in DRI2ConnectReply? Even if we > > >> > solve that somehow, GEM flink names are name-spaced to a > > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > > >> > how does the EGL in the client know which DRM device owns GEM > > >> > flink name "1234"? We'd need some pretty dirty hacks. > > >> > > >> You would return the name of the display driver allocating the > > >> buffers. On the client side you can use generic ioctls to go from > > >> flink -> handle -> dmabuf. So the client side would end up opening > > >> both the display drm device and the gpu, but without needing to know > > >> too much about the display. > > > > > > I think the bit I was missing was that a GEM bo for a buffer imported > > > using dma_buf/PRIME can still be flink'd. So the display controller's > > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > > > ioctl. Those scan-out buffers than then be exported from the > > > dispaly's DRM driver and imported into the GPU's DRM driver using > > > PRIME. Once imported into the GPU's driver, we can use flink to get a > > > name for that buffer within the GPU DRM driver's name-space to return > > > to the DRI2 client. That same namespace is also what DRI2 back- > > > buffers are allocated from, so I think that could work... Except... > > > > (and.. the general direction is that things will move more to just use > > dmabuf directly, ie. wayland or dri3) > > I agree, DRI2 is the only reason why we need a system-wide ID. I also > prefer buffers to be passed around by dma_buf fd, but we still need to > support DRI2 and will do for some time I expect. > > > > > >> > Anyway, that latter case also gets quite difficult. The "GPU" > > >> > DRM driver would need to know the constraints of the display > > >> > controller when allocating buffers intended to be scanned out. > > >> > For example, pl111 typically isn't behind an IOMMU and so > > >> > requires physically contiguous memory. We'd have to teach the > > >> > GPU's DRM driver about the constraints of the display HW. Not > > >> > exactly a clean driver model. :-( > > >> > > > >> > I'm still a little stuck on how to proceed, so any ideas > > >> > would greatly appreciated! My current train of thought is > > >> > having a kind of SoC-specific DRM driver which allocates > > >> > buffers for both display and GPU within a single GEM > > >> > namespace. That SoC-specific DRM driver could then know the > > >> > constraints of both the GPU and the display HW. We could then > > >> > use PRIME to export buffers allocated with the SoC DRM driver > > >> > and import them into the GPU and/or display DRM driver. > > >> > > >> Usually if the display drm driver is allocating the buffers that > > >> might be scanned out, it just needs to have minimal knowledge of > > >> the GPU (pitch alignment constraints). I don't think we need a > > >> 3rd device just to allocate buffers. > > > > > > While Mali can render to pretty much any buffer, there is a mild > > > performance improvement to be had if the buffer stride is aligned to > > > the AXI bus's max burst length when drawing to the buffer. > > > > I suspect the display controllers might frequently benefit if the > > pitch is aligned to AXI burst length too.. > > If the display controller is going to be reading from linear memory > I don't think it will make much difference - you'll just get an extra > 1-2 bus transactions per scanline. With a tile-based GPU like Mali, > you get those extra transactions per _tile_ scan-line and as such, > the overhead is more pronounced. > > > > > > So in some respects, there is a constraint on how buffers which will > > > be drawn to using the GPU are allocated. I don't really like the idea > > > of teaching the display controller DRM driver about the GPU buffer > > > constraints, even if they are fairly trivial like this. If the same > > > display HW IP is being used on several SoCs, it seems wrong somehow > > > to enforce those GPU constraints if some of those SoCs don't have a > > > GPU. > > > > Well, I suppose you could get min_pitch_alignment from devicetree, or > > something like this.. > > > > In the end, the easy solution is just to make the display allocate to > > the worst-case pitch alignment. In the early days of dma-buf > > discussions, we kicked around the idea of negotiating or > > programatically describing the constraints, but that didn't really > > seem like a bounded problem. > > Yeah - I was around for some of those discussions and agree it's not > really an easy problem to solve. > > > > > > We may also then have additional constraints when sharing buffers > > > between the display HW and video decode or even camera ISP HW. > > > Programmatically describing buffer allocation constraints is very > > > difficult and I'm not sure you can actually do it - there's some > > > pretty complex constraints out there! E.g. I believe there's a > > > platform where Y and UV planes of the reference frame need to be in > > > separate DRAM banks for real-time 1080p decode, or something like > > > that? > > > > yes, this was discussed. This is different from pitch/format/size > > constraints.. it is really just a placement constraint (ie. where do > > the physical pages go). IIRC the conclusion was to use a dummy > > devices with it's own CMA pool for attaching the Y vs UV buffers. > > > > > Anyway, I guess my point is that even if we solve how to allocate > > > buffers which will be shared between the GPU and display HW such that > > > both sets of constraints are satisfied, that may not be the end of > > > the story. > > > > > > > that was part of the reason to punt this problem to userspace ;-) > > > > In practice, the kernel drivers doesn't usually know too much about > > the dimensions/format/etc.. that is really userspace level knowledge. > > There are a few exceptions when the kernel needs to know how to setup > > GTT/etc for tiled buffers, but normally this sort of information is up > > at the next level up (userspace, and drm_framebuffer in case of > > scanout). Userspace media frameworks like GStreamer already have a > > concept of format/caps negotiation. For non-display<->gpu sharing, I > > think this is probably where this sort of constraint negotiation > > should be handled. > > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. > > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? > I strongly disagree with exposing low-level hardware details like tiling to userspace. If we have to do the negotiation of those things in userspace we will end up with having to pipe those information through things like the wayland protocol. I don't see how this could ever be considered a good idea. I would rather see kernel drivers negotiating those things at dmabuf attach time in way invisible to userspace. I agree that this negotiation thing isn't easy to get right for the plethora of different hardware constraints we see today, but I would rather see this in-kernel, where we have the chance to fix things up if needed, than in a fixed userspace interface. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 12:18 ` Lucas Stach 0 siblings, 0 replies; 131+ messages in thread From: Lucas Stach @ 2013-08-06 12:18 UTC (permalink / raw) To: Tom Cooksey Cc: 'Rob Clark', linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: > Hi Rob, > > +lkml > > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > > >> wrote: > > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > > >> >> > also allocate buffers for the GPU. Still not sure how to > > >> >> > resolve this as we don't use DRM for our GPU driver. > > >> >> > > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > > >> >> (esp. once the dma-fence stuff is in place), we'd have > > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > > >> >> specific drm/kms driver, using prime/dmabuf to share between > > >> >> the two. > > >> > > > >> > The "extra" buffers we were allocating from armsoc DDX were really > > >> > being allocated through DRM/GEM so we could get an flink name > > >> > for them and pass a reference to them back to our GPU driver on > > >> > the client side. If it weren't for our need to access those > > >> > extra off-screen buffers with the GPU we wouldn't need to > > >> > allocate them with DRM at all. So, given they are really "GPU" > > >> > buffers, it does absolutely make sense to allocate them in a > > >> > different driver to the display driver. > > >> > > > >> > However, to avoid unnecessary memcpys & related cache > > >> > maintenance ops, we'd also like the GPU to render into buffers > > >> > which are scanned out by the display controller. So let's say > > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > > >> > out buffers with the display's DRM driver but a custom ioctl > > >> > on the GPU's DRM driver to allocate non scanout, off-screen > > >> > buffers. Sounds great, but I don't think that really works > > >> > with DRI2. If we used two drivers to allocate buffers, which > > >> > of those drivers do we return in DRI2ConnectReply? Even if we > > >> > solve that somehow, GEM flink names are name-spaced to a > > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > > >> > how does the EGL in the client know which DRM device owns GEM > > >> > flink name "1234"? We'd need some pretty dirty hacks. > > >> > > >> You would return the name of the display driver allocating the > > >> buffers. On the client side you can use generic ioctls to go from > > >> flink -> handle -> dmabuf. So the client side would end up opening > > >> both the display drm device and the gpu, but without needing to know > > >> too much about the display. > > > > > > I think the bit I was missing was that a GEM bo for a buffer imported > > > using dma_buf/PRIME can still be flink'd. So the display controller's > > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > > > ioctl. Those scan-out buffers than then be exported from the > > > dispaly's DRM driver and imported into the GPU's DRM driver using > > > PRIME. Once imported into the GPU's driver, we can use flink to get a > > > name for that buffer within the GPU DRM driver's name-space to return > > > to the DRI2 client. That same namespace is also what DRI2 back- > > > buffers are allocated from, so I think that could work... Except... > > > > (and.. the general direction is that things will move more to just use > > dmabuf directly, ie. wayland or dri3) > > I agree, DRI2 is the only reason why we need a system-wide ID. I also > prefer buffers to be passed around by dma_buf fd, but we still need to > support DRI2 and will do for some time I expect. > > > > > >> > Anyway, that latter case also gets quite difficult. The "GPU" > > >> > DRM driver would need to know the constraints of the display > > >> > controller when allocating buffers intended to be scanned out. > > >> > For example, pl111 typically isn't behind an IOMMU and so > > >> > requires physically contiguous memory. We'd have to teach the > > >> > GPU's DRM driver about the constraints of the display HW. Not > > >> > exactly a clean driver model. :-( > > >> > > > >> > I'm still a little stuck on how to proceed, so any ideas > > >> > would greatly appreciated! My current train of thought is > > >> > having a kind of SoC-specific DRM driver which allocates > > >> > buffers for both display and GPU within a single GEM > > >> > namespace. That SoC-specific DRM driver could then know the > > >> > constraints of both the GPU and the display HW. We could then > > >> > use PRIME to export buffers allocated with the SoC DRM driver > > >> > and import them into the GPU and/or display DRM driver. > > >> > > >> Usually if the display drm driver is allocating the buffers that > > >> might be scanned out, it just needs to have minimal knowledge of > > >> the GPU (pitch alignment constraints). I don't think we need a > > >> 3rd device just to allocate buffers. > > > > > > While Mali can render to pretty much any buffer, there is a mild > > > performance improvement to be had if the buffer stride is aligned to > > > the AXI bus's max burst length when drawing to the buffer. > > > > I suspect the display controllers might frequently benefit if the > > pitch is aligned to AXI burst length too.. > > If the display controller is going to be reading from linear memory > I don't think it will make much difference - you'll just get an extra > 1-2 bus transactions per scanline. With a tile-based GPU like Mali, > you get those extra transactions per _tile_ scan-line and as such, > the overhead is more pronounced. > > > > > > So in some respects, there is a constraint on how buffers which will > > > be drawn to using the GPU are allocated. I don't really like the idea > > > of teaching the display controller DRM driver about the GPU buffer > > > constraints, even if they are fairly trivial like this. If the same > > > display HW IP is being used on several SoCs, it seems wrong somehow > > > to enforce those GPU constraints if some of those SoCs don't have a > > > GPU. > > > > Well, I suppose you could get min_pitch_alignment from devicetree, or > > something like this.. > > > > In the end, the easy solution is just to make the display allocate to > > the worst-case pitch alignment. In the early days of dma-buf > > discussions, we kicked around the idea of negotiating or > > programatically describing the constraints, but that didn't really > > seem like a bounded problem. > > Yeah - I was around for some of those discussions and agree it's not > really an easy problem to solve. > > > > > > We may also then have additional constraints when sharing buffers > > > between the display HW and video decode or even camera ISP HW. > > > Programmatically describing buffer allocation constraints is very > > > difficult and I'm not sure you can actually do it - there's some > > > pretty complex constraints out there! E.g. I believe there's a > > > platform where Y and UV planes of the reference frame need to be in > > > separate DRAM banks for real-time 1080p decode, or something like > > > that? > > > > yes, this was discussed. This is different from pitch/format/size > > constraints.. it is really just a placement constraint (ie. where do > > the physical pages go). IIRC the conclusion was to use a dummy > > devices with it's own CMA pool for attaching the Y vs UV buffers. > > > > > Anyway, I guess my point is that even if we solve how to allocate > > > buffers which will be shared between the GPU and display HW such that > > > both sets of constraints are satisfied, that may not be the end of > > > the story. > > > > > > > that was part of the reason to punt this problem to userspace ;-) > > > > In practice, the kernel drivers doesn't usually know too much about > > the dimensions/format/etc.. that is really userspace level knowledge. > > There are a few exceptions when the kernel needs to know how to setup > > GTT/etc for tiled buffers, but normally this sort of information is up > > at the next level up (userspace, and drm_framebuffer in case of > > scanout). Userspace media frameworks like GStreamer already have a > > concept of format/caps negotiation. For non-display<->gpu sharing, I > > think this is probably where this sort of constraint negotiation > > should be handled. > > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. > > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? > I strongly disagree with exposing low-level hardware details like tiling to userspace. If we have to do the negotiation of those things in userspace we will end up with having to pipe those information through things like the wayland protocol. I don't see how this could ever be considered a good idea. I would rather see kernel drivers negotiating those things at dmabuf attach time in way invisible to userspace. I agree that this negotiation thing isn't easy to get right for the plethora of different hardware constraints we see today, but I would rather see this in-kernel, where we have the chance to fix things up if needed, than in a fixed userspace interface. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 12:18 ` Lucas Stach (?) @ 2013-08-06 14:14 ` Rob Clark -1 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 14:14 UTC (permalink / raw) To: Lucas Stach Cc: Tom Cooksey, linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media On Tue, Aug 6, 2013 at 8:18 AM, Lucas Stach <l.stach@pengutronix.de> wrote: > Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: >> Hi Rob, >> >> +lkml >> >> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> >> > >> wrote: >> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to >> > >> >> > also allocate buffers for the GPU. Still not sure how to >> > >> >> > resolve this as we don't use DRM for our GPU driver. >> > >> >> >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term >> > >> >> (esp. once the dma-fence stuff is in place), we'd have >> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display >> > >> >> specific drm/kms driver, using prime/dmabuf to share between >> > >> >> the two. >> > >> > >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> > >> > being allocated through DRM/GEM so we could get an flink name >> > >> > for them and pass a reference to them back to our GPU driver on >> > >> > the client side. If it weren't for our need to access those >> > >> > extra off-screen buffers with the GPU we wouldn't need to >> > >> > allocate them with DRM at all. So, given they are really "GPU" >> > >> > buffers, it does absolutely make sense to allocate them in a >> > >> > different driver to the display driver. >> > >> > >> > >> > However, to avoid unnecessary memcpys & related cache >> > >> > maintenance ops, we'd also like the GPU to render into buffers >> > >> > which are scanned out by the display controller. So let's say >> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> > >> > out buffers with the display's DRM driver but a custom ioctl >> > >> > on the GPU's DRM driver to allocate non scanout, off-screen >> > >> > buffers. Sounds great, but I don't think that really works >> > >> > with DRI2. If we used two drivers to allocate buffers, which >> > >> > of those drivers do we return in DRI2ConnectReply? Even if we >> > >> > solve that somehow, GEM flink names are name-spaced to a >> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> > >> > how does the EGL in the client know which DRM device owns GEM >> > >> > flink name "1234"? We'd need some pretty dirty hacks. >> > >> >> > >> You would return the name of the display driver allocating the >> > >> buffers. On the client side you can use generic ioctls to go from >> > >> flink -> handle -> dmabuf. So the client side would end up opening >> > >> both the display drm device and the gpu, but without needing to know >> > >> too much about the display. >> > > >> > > I think the bit I was missing was that a GEM bo for a buffer imported >> > > using dma_buf/PRIME can still be flink'd. So the display controller's >> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate >> > > ioctl. Those scan-out buffers than then be exported from the >> > > dispaly's DRM driver and imported into the GPU's DRM driver using >> > > PRIME. Once imported into the GPU's driver, we can use flink to get a >> > > name for that buffer within the GPU DRM driver's name-space to return >> > > to the DRI2 client. That same namespace is also what DRI2 back- >> > > buffers are allocated from, so I think that could work... Except... >> > >> > (and.. the general direction is that things will move more to just use >> > dmabuf directly, ie. wayland or dri3) >> >> I agree, DRI2 is the only reason why we need a system-wide ID. I also >> prefer buffers to be passed around by dma_buf fd, but we still need to >> support DRI2 and will do for some time I expect. >> >> >> >> > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> > >> > DRM driver would need to know the constraints of the display >> > >> > controller when allocating buffers intended to be scanned out. >> > >> > For example, pl111 typically isn't behind an IOMMU and so >> > >> > requires physically contiguous memory. We'd have to teach the >> > >> > GPU's DRM driver about the constraints of the display HW. Not >> > >> > exactly a clean driver model. :-( >> > >> > >> > >> > I'm still a little stuck on how to proceed, so any ideas >> > >> > would greatly appreciated! My current train of thought is >> > >> > having a kind of SoC-specific DRM driver which allocates >> > >> > buffers for both display and GPU within a single GEM >> > >> > namespace. That SoC-specific DRM driver could then know the >> > >> > constraints of both the GPU and the display HW. We could then >> > >> > use PRIME to export buffers allocated with the SoC DRM driver >> > >> > and import them into the GPU and/or display DRM driver. >> > >> >> > >> Usually if the display drm driver is allocating the buffers that >> > >> might be scanned out, it just needs to have minimal knowledge of >> > >> the GPU (pitch alignment constraints). I don't think we need a >> > >> 3rd device just to allocate buffers. >> > > >> > > While Mali can render to pretty much any buffer, there is a mild >> > > performance improvement to be had if the buffer stride is aligned to >> > > the AXI bus's max burst length when drawing to the buffer. >> > >> > I suspect the display controllers might frequently benefit if the >> > pitch is aligned to AXI burst length too.. >> >> If the display controller is going to be reading from linear memory >> I don't think it will make much difference - you'll just get an extra >> 1-2 bus transactions per scanline. With a tile-based GPU like Mali, >> you get those extra transactions per _tile_ scan-line and as such, >> the overhead is more pronounced. >> >> >> >> > > So in some respects, there is a constraint on how buffers which will >> > > be drawn to using the GPU are allocated. I don't really like the idea >> > > of teaching the display controller DRM driver about the GPU buffer >> > > constraints, even if they are fairly trivial like this. If the same >> > > display HW IP is being used on several SoCs, it seems wrong somehow >> > > to enforce those GPU constraints if some of those SoCs don't have a >> > > GPU. >> > >> > Well, I suppose you could get min_pitch_alignment from devicetree, or >> > something like this.. >> > >> > In the end, the easy solution is just to make the display allocate to >> > the worst-case pitch alignment. In the early days of dma-buf >> > discussions, we kicked around the idea of negotiating or >> > programatically describing the constraints, but that didn't really >> > seem like a bounded problem. >> >> Yeah - I was around for some of those discussions and agree it's not >> really an easy problem to solve. >> >> >> >> > > We may also then have additional constraints when sharing buffers >> > > between the display HW and video decode or even camera ISP HW. >> > > Programmatically describing buffer allocation constraints is very >> > > difficult and I'm not sure you can actually do it - there's some >> > > pretty complex constraints out there! E.g. I believe there's a >> > > platform where Y and UV planes of the reference frame need to be in >> > > separate DRAM banks for real-time 1080p decode, or something like >> > > that? >> > >> > yes, this was discussed. This is different from pitch/format/size >> > constraints.. it is really just a placement constraint (ie. where do >> > the physical pages go). IIRC the conclusion was to use a dummy >> > devices with it's own CMA pool for attaching the Y vs UV buffers. >> > >> > > Anyway, I guess my point is that even if we solve how to allocate >> > > buffers which will be shared between the GPU and display HW such that >> > > both sets of constraints are satisfied, that may not be the end of >> > > the story. >> > > >> > >> > that was part of the reason to punt this problem to userspace ;-) >> > >> > In practice, the kernel drivers doesn't usually know too much about >> > the dimensions/format/etc.. that is really userspace level knowledge. >> > There are a few exceptions when the kernel needs to know how to setup >> > GTT/etc for tiled buffers, but normally this sort of information is up >> > at the next level up (userspace, and drm_framebuffer in case of >> > scanout). Userspace media frameworks like GStreamer already have a >> > concept of format/caps negotiation. For non-display<->gpu sharing, I >> > think this is probably where this sort of constraint negotiation >> > should be handled. >> >> I agree that user-space will know which devices will access the buffer >> and thus can figure out at least a common pixel format. Though I'm not >> so sure userspace can figure out more low-level details like alignment >> and placement in physical memory, etc. >> >> Anyway, assuming user-space can figure out how a buffer should be >> stored in memory, how does it indicate this to a kernel driver and >> actually allocate it? Which ioctl on which device does user-space >> call, with what parameters? Are you suggesting using something like >> ION which exposes the low-level details of how buffers are laid out in >> physical memory to userspace? If not, what? >> > > I strongly disagree with exposing low-level hardware details like tiling > to userspace. If we have to do the negotiation of those things in > userspace we will end up with having to pipe those information through > things like the wayland protocol. I don't see how this could ever be > considered a good idea. well, unless userspace mmap's via a de-tiling gart type thing, I don't think tiling can be invisible to userspace. But if two GPU's have some overlap in supportable tiled formats, and you have one gpu doing app and one doing compositor, and you want to use tiling for the shared buffers, you need something in the wayland protocol to figure out what the common supported formats are between the two sides. I suppose it shouldn't be too hard to add a standardized (cross-driver) format-negotiation protocol, and in the absence of that fallback to non tiled format for shared buffers. > I would rather see kernel drivers negotiating those things at dmabuf > attach time in way invisible to userspace. I agree that this negotiation > thing isn't easy to get right for the plethora of different hardware > constraints we see today, but I would rather see this in-kernel, where > we have the chance to fix things up if needed, than in a fixed userspace > interface. Well, if you can think of a sane way to add that to dev->dma_params, and if it isn't visible if userspace mmap's a buffer, then we could handle that in the kernel. But I don't think that will be the case. BR, -R > > Regards, > Lucas > -- > Pengutronix e.K. | Lucas Stach | > Industrial Linux Solutions | http://www.pengutronix.de/ | > Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | > Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 14:14 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 14:14 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 6, 2013 at 8:18 AM, Lucas Stach <l.stach@pengutronix.de> wrote: > Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: >> Hi Rob, >> >> +lkml >> >> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> >> > >> wrote: >> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to >> > >> >> > also allocate buffers for the GPU. Still not sure how to >> > >> >> > resolve this as we don't use DRM for our GPU driver. >> > >> >> >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term >> > >> >> (esp. once the dma-fence stuff is in place), we'd have >> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display >> > >> >> specific drm/kms driver, using prime/dmabuf to share between >> > >> >> the two. >> > >> > >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> > >> > being allocated through DRM/GEM so we could get an flink name >> > >> > for them and pass a reference to them back to our GPU driver on >> > >> > the client side. If it weren't for our need to access those >> > >> > extra off-screen buffers with the GPU we wouldn't need to >> > >> > allocate them with DRM at all. So, given they are really "GPU" >> > >> > buffers, it does absolutely make sense to allocate them in a >> > >> > different driver to the display driver. >> > >> > >> > >> > However, to avoid unnecessary memcpys & related cache >> > >> > maintenance ops, we'd also like the GPU to render into buffers >> > >> > which are scanned out by the display controller. So let's say >> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> > >> > out buffers with the display's DRM driver but a custom ioctl >> > >> > on the GPU's DRM driver to allocate non scanout, off-screen >> > >> > buffers. Sounds great, but I don't think that really works >> > >> > with DRI2. If we used two drivers to allocate buffers, which >> > >> > of those drivers do we return in DRI2ConnectReply? Even if we >> > >> > solve that somehow, GEM flink names are name-spaced to a >> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> > >> > how does the EGL in the client know which DRM device owns GEM >> > >> > flink name "1234"? We'd need some pretty dirty hacks. >> > >> >> > >> You would return the name of the display driver allocating the >> > >> buffers. On the client side you can use generic ioctls to go from >> > >> flink -> handle -> dmabuf. So the client side would end up opening >> > >> both the display drm device and the gpu, but without needing to know >> > >> too much about the display. >> > > >> > > I think the bit I was missing was that a GEM bo for a buffer imported >> > > using dma_buf/PRIME can still be flink'd. So the display controller's >> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate >> > > ioctl. Those scan-out buffers than then be exported from the >> > > dispaly's DRM driver and imported into the GPU's DRM driver using >> > > PRIME. Once imported into the GPU's driver, we can use flink to get a >> > > name for that buffer within the GPU DRM driver's name-space to return >> > > to the DRI2 client. That same namespace is also what DRI2 back- >> > > buffers are allocated from, so I think that could work... Except... >> > >> > (and.. the general direction is that things will move more to just use >> > dmabuf directly, ie. wayland or dri3) >> >> I agree, DRI2 is the only reason why we need a system-wide ID. I also >> prefer buffers to be passed around by dma_buf fd, but we still need to >> support DRI2 and will do for some time I expect. >> >> >> >> > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> > >> > DRM driver would need to know the constraints of the display >> > >> > controller when allocating buffers intended to be scanned out. >> > >> > For example, pl111 typically isn't behind an IOMMU and so >> > >> > requires physically contiguous memory. We'd have to teach the >> > >> > GPU's DRM driver about the constraints of the display HW. Not >> > >> > exactly a clean driver model. :-( >> > >> > >> > >> > I'm still a little stuck on how to proceed, so any ideas >> > >> > would greatly appreciated! My current train of thought is >> > >> > having a kind of SoC-specific DRM driver which allocates >> > >> > buffers for both display and GPU within a single GEM >> > >> > namespace. That SoC-specific DRM driver could then know the >> > >> > constraints of both the GPU and the display HW. We could then >> > >> > use PRIME to export buffers allocated with the SoC DRM driver >> > >> > and import them into the GPU and/or display DRM driver. >> > >> >> > >> Usually if the display drm driver is allocating the buffers that >> > >> might be scanned out, it just needs to have minimal knowledge of >> > >> the GPU (pitch alignment constraints). I don't think we need a >> > >> 3rd device just to allocate buffers. >> > > >> > > While Mali can render to pretty much any buffer, there is a mild >> > > performance improvement to be had if the buffer stride is aligned to >> > > the AXI bus's max burst length when drawing to the buffer. >> > >> > I suspect the display controllers might frequently benefit if the >> > pitch is aligned to AXI burst length too.. >> >> If the display controller is going to be reading from linear memory >> I don't think it will make much difference - you'll just get an extra >> 1-2 bus transactions per scanline. With a tile-based GPU like Mali, >> you get those extra transactions per _tile_ scan-line and as such, >> the overhead is more pronounced. >> >> >> >> > > So in some respects, there is a constraint on how buffers which will >> > > be drawn to using the GPU are allocated. I don't really like the idea >> > > of teaching the display controller DRM driver about the GPU buffer >> > > constraints, even if they are fairly trivial like this. If the same >> > > display HW IP is being used on several SoCs, it seems wrong somehow >> > > to enforce those GPU constraints if some of those SoCs don't have a >> > > GPU. >> > >> > Well, I suppose you could get min_pitch_alignment from devicetree, or >> > something like this.. >> > >> > In the end, the easy solution is just to make the display allocate to >> > the worst-case pitch alignment. In the early days of dma-buf >> > discussions, we kicked around the idea of negotiating or >> > programatically describing the constraints, but that didn't really >> > seem like a bounded problem. >> >> Yeah - I was around for some of those discussions and agree it's not >> really an easy problem to solve. >> >> >> >> > > We may also then have additional constraints when sharing buffers >> > > between the display HW and video decode or even camera ISP HW. >> > > Programmatically describing buffer allocation constraints is very >> > > difficult and I'm not sure you can actually do it - there's some >> > > pretty complex constraints out there! E.g. I believe there's a >> > > platform where Y and UV planes of the reference frame need to be in >> > > separate DRAM banks for real-time 1080p decode, or something like >> > > that? >> > >> > yes, this was discussed. This is different from pitch/format/size >> > constraints.. it is really just a placement constraint (ie. where do >> > the physical pages go). IIRC the conclusion was to use a dummy >> > devices with it's own CMA pool for attaching the Y vs UV buffers. >> > >> > > Anyway, I guess my point is that even if we solve how to allocate >> > > buffers which will be shared between the GPU and display HW such that >> > > both sets of constraints are satisfied, that may not be the end of >> > > the story. >> > > >> > >> > that was part of the reason to punt this problem to userspace ;-) >> > >> > In practice, the kernel drivers doesn't usually know too much about >> > the dimensions/format/etc.. that is really userspace level knowledge. >> > There are a few exceptions when the kernel needs to know how to setup >> > GTT/etc for tiled buffers, but normally this sort of information is up >> > at the next level up (userspace, and drm_framebuffer in case of >> > scanout). Userspace media frameworks like GStreamer already have a >> > concept of format/caps negotiation. For non-display<->gpu sharing, I >> > think this is probably where this sort of constraint negotiation >> > should be handled. >> >> I agree that user-space will know which devices will access the buffer >> and thus can figure out at least a common pixel format. Though I'm not >> so sure userspace can figure out more low-level details like alignment >> and placement in physical memory, etc. >> >> Anyway, assuming user-space can figure out how a buffer should be >> stored in memory, how does it indicate this to a kernel driver and >> actually allocate it? Which ioctl on which device does user-space >> call, with what parameters? Are you suggesting using something like >> ION which exposes the low-level details of how buffers are laid out in >> physical memory to userspace? If not, what? >> > > I strongly disagree with exposing low-level hardware details like tiling > to userspace. If we have to do the negotiation of those things in > userspace we will end up with having to pipe those information through > things like the wayland protocol. I don't see how this could ever be > considered a good idea. well, unless userspace mmap's via a de-tiling gart type thing, I don't think tiling can be invisible to userspace. But if two GPU's have some overlap in supportable tiled formats, and you have one gpu doing app and one doing compositor, and you want to use tiling for the shared buffers, you need something in the wayland protocol to figure out what the common supported formats are between the two sides. I suppose it shouldn't be too hard to add a standardized (cross-driver) format-negotiation protocol, and in the absence of that fallback to non tiled format for shared buffers. > I would rather see kernel drivers negotiating those things at dmabuf > attach time in way invisible to userspace. I agree that this negotiation > thing isn't easy to get right for the plethora of different hardware > constraints we see today, but I would rather see this in-kernel, where > we have the chance to fix things up if needed, than in a fixed userspace > interface. Well, if you can think of a sane way to add that to dev->dma_params, and if it isn't visible if userspace mmap's a buffer, then we could handle that in the kernel. But I don't think that will be the case. BR, -R > > Regards, > Lucas > -- > Pengutronix e.K. | Lucas Stach | > Industrial Linux Solutions | http://www.pengutronix.de/ | > Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | > Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 14:14 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 14:14 UTC (permalink / raw) To: Lucas Stach Cc: Tom Cooksey, linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media On Tue, Aug 6, 2013 at 8:18 AM, Lucas Stach <l.stach@pengutronix.de> wrote: > Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: >> Hi Rob, >> >> +lkml >> >> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> >> > >> wrote: >> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to >> > >> >> > also allocate buffers for the GPU. Still not sure how to >> > >> >> > resolve this as we don't use DRM for our GPU driver. >> > >> >> >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term >> > >> >> (esp. once the dma-fence stuff is in place), we'd have >> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display >> > >> >> specific drm/kms driver, using prime/dmabuf to share between >> > >> >> the two. >> > >> > >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> > >> > being allocated through DRM/GEM so we could get an flink name >> > >> > for them and pass a reference to them back to our GPU driver on >> > >> > the client side. If it weren't for our need to access those >> > >> > extra off-screen buffers with the GPU we wouldn't need to >> > >> > allocate them with DRM at all. So, given they are really "GPU" >> > >> > buffers, it does absolutely make sense to allocate them in a >> > >> > different driver to the display driver. >> > >> > >> > >> > However, to avoid unnecessary memcpys & related cache >> > >> > maintenance ops, we'd also like the GPU to render into buffers >> > >> > which are scanned out by the display controller. So let's say >> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> > >> > out buffers with the display's DRM driver but a custom ioctl >> > >> > on the GPU's DRM driver to allocate non scanout, off-screen >> > >> > buffers. Sounds great, but I don't think that really works >> > >> > with DRI2. If we used two drivers to allocate buffers, which >> > >> > of those drivers do we return in DRI2ConnectReply? Even if we >> > >> > solve that somehow, GEM flink names are name-spaced to a >> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> > >> > how does the EGL in the client know which DRM device owns GEM >> > >> > flink name "1234"? We'd need some pretty dirty hacks. >> > >> >> > >> You would return the name of the display driver allocating the >> > >> buffers. On the client side you can use generic ioctls to go from >> > >> flink -> handle -> dmabuf. So the client side would end up opening >> > >> both the display drm device and the gpu, but without needing to know >> > >> too much about the display. >> > > >> > > I think the bit I was missing was that a GEM bo for a buffer imported >> > > using dma_buf/PRIME can still be flink'd. So the display controller's >> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate >> > > ioctl. Those scan-out buffers than then be exported from the >> > > dispaly's DRM driver and imported into the GPU's DRM driver using >> > > PRIME. Once imported into the GPU's driver, we can use flink to get a >> > > name for that buffer within the GPU DRM driver's name-space to return >> > > to the DRI2 client. That same namespace is also what DRI2 back- >> > > buffers are allocated from, so I think that could work... Except... >> > >> > (and.. the general direction is that things will move more to just use >> > dmabuf directly, ie. wayland or dri3) >> >> I agree, DRI2 is the only reason why we need a system-wide ID. I also >> prefer buffers to be passed around by dma_buf fd, but we still need to >> support DRI2 and will do for some time I expect. >> >> >> >> > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> > >> > DRM driver would need to know the constraints of the display >> > >> > controller when allocating buffers intended to be scanned out. >> > >> > For example, pl111 typically isn't behind an IOMMU and so >> > >> > requires physically contiguous memory. We'd have to teach the >> > >> > GPU's DRM driver about the constraints of the display HW. Not >> > >> > exactly a clean driver model. :-( >> > >> > >> > >> > I'm still a little stuck on how to proceed, so any ideas >> > >> > would greatly appreciated! My current train of thought is >> > >> > having a kind of SoC-specific DRM driver which allocates >> > >> > buffers for both display and GPU within a single GEM >> > >> > namespace. That SoC-specific DRM driver could then know the >> > >> > constraints of both the GPU and the display HW. We could then >> > >> > use PRIME to export buffers allocated with the SoC DRM driver >> > >> > and import them into the GPU and/or display DRM driver. >> > >> >> > >> Usually if the display drm driver is allocating the buffers that >> > >> might be scanned out, it just needs to have minimal knowledge of >> > >> the GPU (pitch alignment constraints). I don't think we need a >> > >> 3rd device just to allocate buffers. >> > > >> > > While Mali can render to pretty much any buffer, there is a mild >> > > performance improvement to be had if the buffer stride is aligned to >> > > the AXI bus's max burst length when drawing to the buffer. >> > >> > I suspect the display controllers might frequently benefit if the >> > pitch is aligned to AXI burst length too.. >> >> If the display controller is going to be reading from linear memory >> I don't think it will make much difference - you'll just get an extra >> 1-2 bus transactions per scanline. With a tile-based GPU like Mali, >> you get those extra transactions per _tile_ scan-line and as such, >> the overhead is more pronounced. >> >> >> >> > > So in some respects, there is a constraint on how buffers which will >> > > be drawn to using the GPU are allocated. I don't really like the idea >> > > of teaching the display controller DRM driver about the GPU buffer >> > > constraints, even if they are fairly trivial like this. If the same >> > > display HW IP is being used on several SoCs, it seems wrong somehow >> > > to enforce those GPU constraints if some of those SoCs don't have a >> > > GPU. >> > >> > Well, I suppose you could get min_pitch_alignment from devicetree, or >> > something like this.. >> > >> > In the end, the easy solution is just to make the display allocate to >> > the worst-case pitch alignment. In the early days of dma-buf >> > discussions, we kicked around the idea of negotiating or >> > programatically describing the constraints, but that didn't really >> > seem like a bounded problem. >> >> Yeah - I was around for some of those discussions and agree it's not >> really an easy problem to solve. >> >> >> >> > > We may also then have additional constraints when sharing buffers >> > > between the display HW and video decode or even camera ISP HW. >> > > Programmatically describing buffer allocation constraints is very >> > > difficult and I'm not sure you can actually do it - there's some >> > > pretty complex constraints out there! E.g. I believe there's a >> > > platform where Y and UV planes of the reference frame need to be in >> > > separate DRAM banks for real-time 1080p decode, or something like >> > > that? >> > >> > yes, this was discussed. This is different from pitch/format/size >> > constraints.. it is really just a placement constraint (ie. where do >> > the physical pages go). IIRC the conclusion was to use a dummy >> > devices with it's own CMA pool for attaching the Y vs UV buffers. >> > >> > > Anyway, I guess my point is that even if we solve how to allocate >> > > buffers which will be shared between the GPU and display HW such that >> > > both sets of constraints are satisfied, that may not be the end of >> > > the story. >> > > >> > >> > that was part of the reason to punt this problem to userspace ;-) >> > >> > In practice, the kernel drivers doesn't usually know too much about >> > the dimensions/format/etc.. that is really userspace level knowledge. >> > There are a few exceptions when the kernel needs to know how to setup >> > GTT/etc for tiled buffers, but normally this sort of information is up >> > at the next level up (userspace, and drm_framebuffer in case of >> > scanout). Userspace media frameworks like GStreamer already have a >> > concept of format/caps negotiation. For non-display<->gpu sharing, I >> > think this is probably where this sort of constraint negotiation >> > should be handled. >> >> I agree that user-space will know which devices will access the buffer >> and thus can figure out at least a common pixel format. Though I'm not >> so sure userspace can figure out more low-level details like alignment >> and placement in physical memory, etc. >> >> Anyway, assuming user-space can figure out how a buffer should be >> stored in memory, how does it indicate this to a kernel driver and >> actually allocate it? Which ioctl on which device does user-space >> call, with what parameters? Are you suggesting using something like >> ION which exposes the low-level details of how buffers are laid out in >> physical memory to userspace? If not, what? >> > > I strongly disagree with exposing low-level hardware details like tiling > to userspace. If we have to do the negotiation of those things in > userspace we will end up with having to pipe those information through > things like the wayland protocol. I don't see how this could ever be > considered a good idea. well, unless userspace mmap's via a de-tiling gart type thing, I don't think tiling can be invisible to userspace. But if two GPU's have some overlap in supportable tiled formats, and you have one gpu doing app and one doing compositor, and you want to use tiling for the shared buffers, you need something in the wayland protocol to figure out what the common supported formats are between the two sides. I suppose it shouldn't be too hard to add a standardized (cross-driver) format-negotiation protocol, and in the absence of that fallback to non tiled format for shared buffers. > I would rather see kernel drivers negotiating those things at dmabuf > attach time in way invisible to userspace. I agree that this negotiation > thing isn't easy to get right for the plethora of different hardware > constraints we see today, but I would rather see this in-kernel, where > we have the chance to fix things up if needed, than in a fixed userspace > interface. Well, if you can think of a sane way to add that to dev->dma_params, and if it isn't visible if userspace mmap's a buffer, then we could handle that in the kernel. But I don't think that will be the case. BR, -R > > Regards, > Lucas > -- > Pengutronix e.K. | Lucas Stach | > Industrial Linux Solutions | http://www.pengutronix.de/ | > Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | > Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 14:14 ` Rob Clark (?) @ 2013-08-06 14:36 ` Lucas Stach -1 siblings, 0 replies; 131+ messages in thread From: Lucas Stach @ 2013-08-06 14:36 UTC (permalink / raw) To: Rob Clark Cc: Tom Cooksey, linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Am Dienstag, den 06.08.2013, 10:14 -0400 schrieb Rob Clark: > On Tue, Aug 6, 2013 at 8:18 AM, Lucas Stach <l.stach@pengutronix.de> wrote: > > Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: > >> Hi Rob, > >> > >> +lkml > >> > >> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > >> > >> wrote: > >> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > >> > >> >> > also allocate buffers for the GPU. Still not sure how to > >> > >> >> > resolve this as we don't use DRM for our GPU driver. > >> > >> >> > >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > >> > >> >> (esp. once the dma-fence stuff is in place), we'd have > >> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > >> > >> >> specific drm/kms driver, using prime/dmabuf to share between > >> > >> >> the two. > >> > >> > > >> > >> > The "extra" buffers we were allocating from armsoc DDX were really > >> > >> > being allocated through DRM/GEM so we could get an flink name > >> > >> > for them and pass a reference to them back to our GPU driver on > >> > >> > the client side. If it weren't for our need to access those > >> > >> > extra off-screen buffers with the GPU we wouldn't need to > >> > >> > allocate them with DRM at all. So, given they are really "GPU" > >> > >> > buffers, it does absolutely make sense to allocate them in a > >> > >> > different driver to the display driver. > >> > >> > > >> > >> > However, to avoid unnecessary memcpys & related cache > >> > >> > maintenance ops, we'd also like the GPU to render into buffers > >> > >> > which are scanned out by the display controller. So let's say > >> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > >> > >> > out buffers with the display's DRM driver but a custom ioctl > >> > >> > on the GPU's DRM driver to allocate non scanout, off-screen > >> > >> > buffers. Sounds great, but I don't think that really works > >> > >> > with DRI2. If we used two drivers to allocate buffers, which > >> > >> > of those drivers do we return in DRI2ConnectReply? Even if we > >> > >> > solve that somehow, GEM flink names are name-spaced to a > >> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > >> > >> > how does the EGL in the client know which DRM device owns GEM > >> > >> > flink name "1234"? We'd need some pretty dirty hacks. > >> > >> > >> > >> You would return the name of the display driver allocating the > >> > >> buffers. On the client side you can use generic ioctls to go from > >> > >> flink -> handle -> dmabuf. So the client side would end up opening > >> > >> both the display drm device and the gpu, but without needing to know > >> > >> too much about the display. > >> > > > >> > > I think the bit I was missing was that a GEM bo for a buffer imported > >> > > using dma_buf/PRIME can still be flink'd. So the display controller's > >> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > >> > > ioctl. Those scan-out buffers than then be exported from the > >> > > dispaly's DRM driver and imported into the GPU's DRM driver using > >> > > PRIME. Once imported into the GPU's driver, we can use flink to get a > >> > > name for that buffer within the GPU DRM driver's name-space to return > >> > > to the DRI2 client. That same namespace is also what DRI2 back- > >> > > buffers are allocated from, so I think that could work... Except... > >> > > >> > (and.. the general direction is that things will move more to just use > >> > dmabuf directly, ie. wayland or dri3) > >> > >> I agree, DRI2 is the only reason why we need a system-wide ID. I also > >> prefer buffers to be passed around by dma_buf fd, but we still need to > >> support DRI2 and will do for some time I expect. > >> > >> > >> > >> > >> > Anyway, that latter case also gets quite difficult. The "GPU" > >> > >> > DRM driver would need to know the constraints of the display > >> > >> > controller when allocating buffers intended to be scanned out. > >> > >> > For example, pl111 typically isn't behind an IOMMU and so > >> > >> > requires physically contiguous memory. We'd have to teach the > >> > >> > GPU's DRM driver about the constraints of the display HW. Not > >> > >> > exactly a clean driver model. :-( > >> > >> > > >> > >> > I'm still a little stuck on how to proceed, so any ideas > >> > >> > would greatly appreciated! My current train of thought is > >> > >> > having a kind of SoC-specific DRM driver which allocates > >> > >> > buffers for both display and GPU within a single GEM > >> > >> > namespace. That SoC-specific DRM driver could then know the > >> > >> > constraints of both the GPU and the display HW. We could then > >> > >> > use PRIME to export buffers allocated with the SoC DRM driver > >> > >> > and import them into the GPU and/or display DRM driver. > >> > >> > >> > >> Usually if the display drm driver is allocating the buffers that > >> > >> might be scanned out, it just needs to have minimal knowledge of > >> > >> the GPU (pitch alignment constraints). I don't think we need a > >> > >> 3rd device just to allocate buffers. > >> > > > >> > > While Mali can render to pretty much any buffer, there is a mild > >> > > performance improvement to be had if the buffer stride is aligned to > >> > > the AXI bus's max burst length when drawing to the buffer. > >> > > >> > I suspect the display controllers might frequently benefit if the > >> > pitch is aligned to AXI burst length too.. > >> > >> If the display controller is going to be reading from linear memory > >> I don't think it will make much difference - you'll just get an extra > >> 1-2 bus transactions per scanline. With a tile-based GPU like Mali, > >> you get those extra transactions per _tile_ scan-line and as such, > >> the overhead is more pronounced. > >> > >> > >> > >> > > So in some respects, there is a constraint on how buffers which will > >> > > be drawn to using the GPU are allocated. I don't really like the idea > >> > > of teaching the display controller DRM driver about the GPU buffer > >> > > constraints, even if they are fairly trivial like this. If the same > >> > > display HW IP is being used on several SoCs, it seems wrong somehow > >> > > to enforce those GPU constraints if some of those SoCs don't have a > >> > > GPU. > >> > > >> > Well, I suppose you could get min_pitch_alignment from devicetree, or > >> > something like this.. > >> > > >> > In the end, the easy solution is just to make the display allocate to > >> > the worst-case pitch alignment. In the early days of dma-buf > >> > discussions, we kicked around the idea of negotiating or > >> > programatically describing the constraints, but that didn't really > >> > seem like a bounded problem. > >> > >> Yeah - I was around for some of those discussions and agree it's not > >> really an easy problem to solve. > >> > >> > >> > >> > > We may also then have additional constraints when sharing buffers > >> > > between the display HW and video decode or even camera ISP HW. > >> > > Programmatically describing buffer allocation constraints is very > >> > > difficult and I'm not sure you can actually do it - there's some > >> > > pretty complex constraints out there! E.g. I believe there's a > >> > > platform where Y and UV planes of the reference frame need to be in > >> > > separate DRAM banks for real-time 1080p decode, or something like > >> > > that? > >> > > >> > yes, this was discussed. This is different from pitch/format/size > >> > constraints.. it is really just a placement constraint (ie. where do > >> > the physical pages go). IIRC the conclusion was to use a dummy > >> > devices with it's own CMA pool for attaching the Y vs UV buffers. > >> > > >> > > Anyway, I guess my point is that even if we solve how to allocate > >> > > buffers which will be shared between the GPU and display HW such that > >> > > both sets of constraints are satisfied, that may not be the end of > >> > > the story. > >> > > > >> > > >> > that was part of the reason to punt this problem to userspace ;-) > >> > > >> > In practice, the kernel drivers doesn't usually know too much about > >> > the dimensions/format/etc.. that is really userspace level knowledge. > >> > There are a few exceptions when the kernel needs to know how to setup > >> > GTT/etc for tiled buffers, but normally this sort of information is up > >> > at the next level up (userspace, and drm_framebuffer in case of > >> > scanout). Userspace media frameworks like GStreamer already have a > >> > concept of format/caps negotiation. For non-display<->gpu sharing, I > >> > think this is probably where this sort of constraint negotiation > >> > should be handled. > >> > >> I agree that user-space will know which devices will access the buffer > >> and thus can figure out at least a common pixel format. Though I'm not > >> so sure userspace can figure out more low-level details like alignment > >> and placement in physical memory, etc. > >> > >> Anyway, assuming user-space can figure out how a buffer should be > >> stored in memory, how does it indicate this to a kernel driver and > >> actually allocate it? Which ioctl on which device does user-space > >> call, with what parameters? Are you suggesting using something like > >> ION which exposes the low-level details of how buffers are laid out in > >> physical memory to userspace? If not, what? > >> > > > > I strongly disagree with exposing low-level hardware details like tiling > > to userspace. If we have to do the negotiation of those things in > > userspace we will end up with having to pipe those information through > > things like the wayland protocol. I don't see how this could ever be > > considered a good idea. > > well, unless userspace mmap's via a de-tiling gart type thing, I don't > think tiling can be invisible to userspace. > Why is mmap considered to be such a strong use-case for DMABUFs? After all we are trying to _avoid_ mmapping shared buffers where ever possible. > But if two GPU's have some overlap in supportable tiled formats, and > you have one gpu doing app and one doing compositor, and you want to > use tiling for the shared buffers, you need something in the wayland > protocol to figure out what the common supported formats are between > the two sides. I suppose it shouldn't be too hard to add a > standardized (cross-driver) format-negotiation protocol, and in the > absence of that fallback to non tiled format for shared buffers. > I don't see how tiling format negotiation would be easier in userspace than in the kernel. If we can come up with a scheme for that, we can as well do it in the kernel. > > I would rather see kernel drivers negotiating those things at dmabuf > > attach time in way invisible to userspace. I agree that this negotiation > > thing isn't easy to get right for the plethora of different hardware > > constraints we see today, but I would rather see this in-kernel, where > > we have the chance to fix things up if needed, than in a fixed userspace > > interface. > > Well, if you can think of a sane way to add that to dev->dma_params, > and if it isn't visible if userspace mmap's a buffer, then we could > handle that in the kernel. But I don't think that will be the case. > I'm sure we can come up with something sane to put it in there. The userspace mmap thing is a bit complicated to deal with, but I have the feeling that going through a slow path if you really need mmap is a reasonable thing to do. For example we could just make mmap some form of attach that forces linear layout if the exporter isn't able to give userspace a linear mapping from a tiled buffer by using GART or VM. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 131+ messages in thread
* [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 14:36 ` Lucas Stach 0 siblings, 0 replies; 131+ messages in thread From: Lucas Stach @ 2013-08-06 14:36 UTC (permalink / raw) To: linux-arm-kernel Am Dienstag, den 06.08.2013, 10:14 -0400 schrieb Rob Clark: > On Tue, Aug 6, 2013 at 8:18 AM, Lucas Stach <l.stach@pengutronix.de> wrote: > > Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: > >> Hi Rob, > >> > >> +lkml > >> > >> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > >> > >> wrote: > >> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > >> > >> >> > also allocate buffers for the GPU. Still not sure how to > >> > >> >> > resolve this as we don't use DRM for our GPU driver. > >> > >> >> > >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > >> > >> >> (esp. once the dma-fence stuff is in place), we'd have > >> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > >> > >> >> specific drm/kms driver, using prime/dmabuf to share between > >> > >> >> the two. > >> > >> > > >> > >> > The "extra" buffers we were allocating from armsoc DDX were really > >> > >> > being allocated through DRM/GEM so we could get an flink name > >> > >> > for them and pass a reference to them back to our GPU driver on > >> > >> > the client side. If it weren't for our need to access those > >> > >> > extra off-screen buffers with the GPU we wouldn't need to > >> > >> > allocate them with DRM at all. So, given they are really "GPU" > >> > >> > buffers, it does absolutely make sense to allocate them in a > >> > >> > different driver to the display driver. > >> > >> > > >> > >> > However, to avoid unnecessary memcpys & related cache > >> > >> > maintenance ops, we'd also like the GPU to render into buffers > >> > >> > which are scanned out by the display controller. So let's say > >> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > >> > >> > out buffers with the display's DRM driver but a custom ioctl > >> > >> > on the GPU's DRM driver to allocate non scanout, off-screen > >> > >> > buffers. Sounds great, but I don't think that really works > >> > >> > with DRI2. If we used two drivers to allocate buffers, which > >> > >> > of those drivers do we return in DRI2ConnectReply? Even if we > >> > >> > solve that somehow, GEM flink names are name-spaced to a > >> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > >> > >> > how does the EGL in the client know which DRM device owns GEM > >> > >> > flink name "1234"? We'd need some pretty dirty hacks. > >> > >> > >> > >> You would return the name of the display driver allocating the > >> > >> buffers. On the client side you can use generic ioctls to go from > >> > >> flink -> handle -> dmabuf. So the client side would end up opening > >> > >> both the display drm device and the gpu, but without needing to know > >> > >> too much about the display. > >> > > > >> > > I think the bit I was missing was that a GEM bo for a buffer imported > >> > > using dma_buf/PRIME can still be flink'd. So the display controller's > >> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > >> > > ioctl. Those scan-out buffers than then be exported from the > >> > > dispaly's DRM driver and imported into the GPU's DRM driver using > >> > > PRIME. Once imported into the GPU's driver, we can use flink to get a > >> > > name for that buffer within the GPU DRM driver's name-space to return > >> > > to the DRI2 client. That same namespace is also what DRI2 back- > >> > > buffers are allocated from, so I think that could work... Except... > >> > > >> > (and.. the general direction is that things will move more to just use > >> > dmabuf directly, ie. wayland or dri3) > >> > >> I agree, DRI2 is the only reason why we need a system-wide ID. I also > >> prefer buffers to be passed around by dma_buf fd, but we still need to > >> support DRI2 and will do for some time I expect. > >> > >> > >> > >> > >> > Anyway, that latter case also gets quite difficult. The "GPU" > >> > >> > DRM driver would need to know the constraints of the display > >> > >> > controller when allocating buffers intended to be scanned out. > >> > >> > For example, pl111 typically isn't behind an IOMMU and so > >> > >> > requires physically contiguous memory. We'd have to teach the > >> > >> > GPU's DRM driver about the constraints of the display HW. Not > >> > >> > exactly a clean driver model. :-( > >> > >> > > >> > >> > I'm still a little stuck on how to proceed, so any ideas > >> > >> > would greatly appreciated! My current train of thought is > >> > >> > having a kind of SoC-specific DRM driver which allocates > >> > >> > buffers for both display and GPU within a single GEM > >> > >> > namespace. That SoC-specific DRM driver could then know the > >> > >> > constraints of both the GPU and the display HW. We could then > >> > >> > use PRIME to export buffers allocated with the SoC DRM driver > >> > >> > and import them into the GPU and/or display DRM driver. > >> > >> > >> > >> Usually if the display drm driver is allocating the buffers that > >> > >> might be scanned out, it just needs to have minimal knowledge of > >> > >> the GPU (pitch alignment constraints). I don't think we need a > >> > >> 3rd device just to allocate buffers. > >> > > > >> > > While Mali can render to pretty much any buffer, there is a mild > >> > > performance improvement to be had if the buffer stride is aligned to > >> > > the AXI bus's max burst length when drawing to the buffer. > >> > > >> > I suspect the display controllers might frequently benefit if the > >> > pitch is aligned to AXI burst length too.. > >> > >> If the display controller is going to be reading from linear memory > >> I don't think it will make much difference - you'll just get an extra > >> 1-2 bus transactions per scanline. With a tile-based GPU like Mali, > >> you get those extra transactions per _tile_ scan-line and as such, > >> the overhead is more pronounced. > >> > >> > >> > >> > > So in some respects, there is a constraint on how buffers which will > >> > > be drawn to using the GPU are allocated. I don't really like the idea > >> > > of teaching the display controller DRM driver about the GPU buffer > >> > > constraints, even if they are fairly trivial like this. If the same > >> > > display HW IP is being used on several SoCs, it seems wrong somehow > >> > > to enforce those GPU constraints if some of those SoCs don't have a > >> > > GPU. > >> > > >> > Well, I suppose you could get min_pitch_alignment from devicetree, or > >> > something like this.. > >> > > >> > In the end, the easy solution is just to make the display allocate to > >> > the worst-case pitch alignment. In the early days of dma-buf > >> > discussions, we kicked around the idea of negotiating or > >> > programatically describing the constraints, but that didn't really > >> > seem like a bounded problem. > >> > >> Yeah - I was around for some of those discussions and agree it's not > >> really an easy problem to solve. > >> > >> > >> > >> > > We may also then have additional constraints when sharing buffers > >> > > between the display HW and video decode or even camera ISP HW. > >> > > Programmatically describing buffer allocation constraints is very > >> > > difficult and I'm not sure you can actually do it - there's some > >> > > pretty complex constraints out there! E.g. I believe there's a > >> > > platform where Y and UV planes of the reference frame need to be in > >> > > separate DRAM banks for real-time 1080p decode, or something like > >> > > that? > >> > > >> > yes, this was discussed. This is different from pitch/format/size > >> > constraints.. it is really just a placement constraint (ie. where do > >> > the physical pages go). IIRC the conclusion was to use a dummy > >> > devices with it's own CMA pool for attaching the Y vs UV buffers. > >> > > >> > > Anyway, I guess my point is that even if we solve how to allocate > >> > > buffers which will be shared between the GPU and display HW such that > >> > > both sets of constraints are satisfied, that may not be the end of > >> > > the story. > >> > > > >> > > >> > that was part of the reason to punt this problem to userspace ;-) > >> > > >> > In practice, the kernel drivers doesn't usually know too much about > >> > the dimensions/format/etc.. that is really userspace level knowledge. > >> > There are a few exceptions when the kernel needs to know how to setup > >> > GTT/etc for tiled buffers, but normally this sort of information is up > >> > at the next level up (userspace, and drm_framebuffer in case of > >> > scanout). Userspace media frameworks like GStreamer already have a > >> > concept of format/caps negotiation. For non-display<->gpu sharing, I > >> > think this is probably where this sort of constraint negotiation > >> > should be handled. > >> > >> I agree that user-space will know which devices will access the buffer > >> and thus can figure out at least a common pixel format. Though I'm not > >> so sure userspace can figure out more low-level details like alignment > >> and placement in physical memory, etc. > >> > >> Anyway, assuming user-space can figure out how a buffer should be > >> stored in memory, how does it indicate this to a kernel driver and > >> actually allocate it? Which ioctl on which device does user-space > >> call, with what parameters? Are you suggesting using something like > >> ION which exposes the low-level details of how buffers are laid out in > >> physical memory to userspace? If not, what? > >> > > > > I strongly disagree with exposing low-level hardware details like tiling > > to userspace. If we have to do the negotiation of those things in > > userspace we will end up with having to pipe those information through > > things like the wayland protocol. I don't see how this could ever be > > considered a good idea. > > well, unless userspace mmap's via a de-tiling gart type thing, I don't > think tiling can be invisible to userspace. > Why is mmap considered to be such a strong use-case for DMABUFs? After all we are trying to _avoid_ mmapping shared buffers where ever possible. > But if two GPU's have some overlap in supportable tiled formats, and > you have one gpu doing app and one doing compositor, and you want to > use tiling for the shared buffers, you need something in the wayland > protocol to figure out what the common supported formats are between > the two sides. I suppose it shouldn't be too hard to add a > standardized (cross-driver) format-negotiation protocol, and in the > absence of that fallback to non tiled format for shared buffers. > I don't see how tiling format negotiation would be easier in userspace than in the kernel. If we can come up with a scheme for that, we can as well do it in the kernel. > > I would rather see kernel drivers negotiating those things at dmabuf > > attach time in way invisible to userspace. I agree that this negotiation > > thing isn't easy to get right for the plethora of different hardware > > constraints we see today, but I would rather see this in-kernel, where > > we have the chance to fix things up if needed, than in a fixed userspace > > interface. > > Well, if you can think of a sane way to add that to dev->dma_params, > and if it isn't visible if userspace mmap's a buffer, then we could > handle that in the kernel. But I don't think that will be the case. > I'm sure we can come up with something sane to put it in there. The userspace mmap thing is a bit complicated to deal with, but I have the feeling that going through a slow path if you really need mmap is a reasonable thing to do. For example we could just make mmap some form of attach that forces linear layout if the exporter isn't able to give userspace a linear mapping from a tiled buffer by using GART or VM. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 14:36 ` Lucas Stach 0 siblings, 0 replies; 131+ messages in thread From: Lucas Stach @ 2013-08-06 14:36 UTC (permalink / raw) To: Rob Clark Cc: Tom Cooksey, linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Am Dienstag, den 06.08.2013, 10:14 -0400 schrieb Rob Clark: > On Tue, Aug 6, 2013 at 8:18 AM, Lucas Stach <l.stach@pengutronix.de> wrote: > > Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: > >> Hi Rob, > >> > >> +lkml > >> > >> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > >> > >> wrote: > >> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to > >> > >> >> > also allocate buffers for the GPU. Still not sure how to > >> > >> >> > resolve this as we don't use DRM for our GPU driver. > >> > >> >> > >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term > >> > >> >> (esp. once the dma-fence stuff is in place), we'd have > >> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display > >> > >> >> specific drm/kms driver, using prime/dmabuf to share between > >> > >> >> the two. > >> > >> > > >> > >> > The "extra" buffers we were allocating from armsoc DDX were really > >> > >> > being allocated through DRM/GEM so we could get an flink name > >> > >> > for them and pass a reference to them back to our GPU driver on > >> > >> > the client side. If it weren't for our need to access those > >> > >> > extra off-screen buffers with the GPU we wouldn't need to > >> > >> > allocate them with DRM at all. So, given they are really "GPU" > >> > >> > buffers, it does absolutely make sense to allocate them in a > >> > >> > different driver to the display driver. > >> > >> > > >> > >> > However, to avoid unnecessary memcpys & related cache > >> > >> > maintenance ops, we'd also like the GPU to render into buffers > >> > >> > which are scanned out by the display controller. So let's say > >> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan > >> > >> > out buffers with the display's DRM driver but a custom ioctl > >> > >> > on the GPU's DRM driver to allocate non scanout, off-screen > >> > >> > buffers. Sounds great, but I don't think that really works > >> > >> > with DRI2. If we used two drivers to allocate buffers, which > >> > >> > of those drivers do we return in DRI2ConnectReply? Even if we > >> > >> > solve that somehow, GEM flink names are name-spaced to a > >> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, > >> > >> > how does the EGL in the client know which DRM device owns GEM > >> > >> > flink name "1234"? We'd need some pretty dirty hacks. > >> > >> > >> > >> You would return the name of the display driver allocating the > >> > >> buffers. On the client side you can use generic ioctls to go from > >> > >> flink -> handle -> dmabuf. So the client side would end up opening > >> > >> both the display drm device and the gpu, but without needing to know > >> > >> too much about the display. > >> > > > >> > > I think the bit I was missing was that a GEM bo for a buffer imported > >> > > using dma_buf/PRIME can still be flink'd. So the display controller's > >> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate > >> > > ioctl. Those scan-out buffers than then be exported from the > >> > > dispaly's DRM driver and imported into the GPU's DRM driver using > >> > > PRIME. Once imported into the GPU's driver, we can use flink to get a > >> > > name for that buffer within the GPU DRM driver's name-space to return > >> > > to the DRI2 client. That same namespace is also what DRI2 back- > >> > > buffers are allocated from, so I think that could work... Except... > >> > > >> > (and.. the general direction is that things will move more to just use > >> > dmabuf directly, ie. wayland or dri3) > >> > >> I agree, DRI2 is the only reason why we need a system-wide ID. I also > >> prefer buffers to be passed around by dma_buf fd, but we still need to > >> support DRI2 and will do for some time I expect. > >> > >> > >> > >> > >> > Anyway, that latter case also gets quite difficult. The "GPU" > >> > >> > DRM driver would need to know the constraints of the display > >> > >> > controller when allocating buffers intended to be scanned out. > >> > >> > For example, pl111 typically isn't behind an IOMMU and so > >> > >> > requires physically contiguous memory. We'd have to teach the > >> > >> > GPU's DRM driver about the constraints of the display HW. Not > >> > >> > exactly a clean driver model. :-( > >> > >> > > >> > >> > I'm still a little stuck on how to proceed, so any ideas > >> > >> > would greatly appreciated! My current train of thought is > >> > >> > having a kind of SoC-specific DRM driver which allocates > >> > >> > buffers for both display and GPU within a single GEM > >> > >> > namespace. That SoC-specific DRM driver could then know the > >> > >> > constraints of both the GPU and the display HW. We could then > >> > >> > use PRIME to export buffers allocated with the SoC DRM driver > >> > >> > and import them into the GPU and/or display DRM driver. > >> > >> > >> > >> Usually if the display drm driver is allocating the buffers that > >> > >> might be scanned out, it just needs to have minimal knowledge of > >> > >> the GPU (pitch alignment constraints). I don't think we need a > >> > >> 3rd device just to allocate buffers. > >> > > > >> > > While Mali can render to pretty much any buffer, there is a mild > >> > > performance improvement to be had if the buffer stride is aligned to > >> > > the AXI bus's max burst length when drawing to the buffer. > >> > > >> > I suspect the display controllers might frequently benefit if the > >> > pitch is aligned to AXI burst length too.. > >> > >> If the display controller is going to be reading from linear memory > >> I don't think it will make much difference - you'll just get an extra > >> 1-2 bus transactions per scanline. With a tile-based GPU like Mali, > >> you get those extra transactions per _tile_ scan-line and as such, > >> the overhead is more pronounced. > >> > >> > >> > >> > > So in some respects, there is a constraint on how buffers which will > >> > > be drawn to using the GPU are allocated. I don't really like the idea > >> > > of teaching the display controller DRM driver about the GPU buffer > >> > > constraints, even if they are fairly trivial like this. If the same > >> > > display HW IP is being used on several SoCs, it seems wrong somehow > >> > > to enforce those GPU constraints if some of those SoCs don't have a > >> > > GPU. > >> > > >> > Well, I suppose you could get min_pitch_alignment from devicetree, or > >> > something like this.. > >> > > >> > In the end, the easy solution is just to make the display allocate to > >> > the worst-case pitch alignment. In the early days of dma-buf > >> > discussions, we kicked around the idea of negotiating or > >> > programatically describing the constraints, but that didn't really > >> > seem like a bounded problem. > >> > >> Yeah - I was around for some of those discussions and agree it's not > >> really an easy problem to solve. > >> > >> > >> > >> > > We may also then have additional constraints when sharing buffers > >> > > between the display HW and video decode or even camera ISP HW. > >> > > Programmatically describing buffer allocation constraints is very > >> > > difficult and I'm not sure you can actually do it - there's some > >> > > pretty complex constraints out there! E.g. I believe there's a > >> > > platform where Y and UV planes of the reference frame need to be in > >> > > separate DRAM banks for real-time 1080p decode, or something like > >> > > that? > >> > > >> > yes, this was discussed. This is different from pitch/format/size > >> > constraints.. it is really just a placement constraint (ie. where do > >> > the physical pages go). IIRC the conclusion was to use a dummy > >> > devices with it's own CMA pool for attaching the Y vs UV buffers. > >> > > >> > > Anyway, I guess my point is that even if we solve how to allocate > >> > > buffers which will be shared between the GPU and display HW such that > >> > > both sets of constraints are satisfied, that may not be the end of > >> > > the story. > >> > > > >> > > >> > that was part of the reason to punt this problem to userspace ;-) > >> > > >> > In practice, the kernel drivers doesn't usually know too much about > >> > the dimensions/format/etc.. that is really userspace level knowledge. > >> > There are a few exceptions when the kernel needs to know how to setup > >> > GTT/etc for tiled buffers, but normally this sort of information is up > >> > at the next level up (userspace, and drm_framebuffer in case of > >> > scanout). Userspace media frameworks like GStreamer already have a > >> > concept of format/caps negotiation. For non-display<->gpu sharing, I > >> > think this is probably where this sort of constraint negotiation > >> > should be handled. > >> > >> I agree that user-space will know which devices will access the buffer > >> and thus can figure out at least a common pixel format. Though I'm not > >> so sure userspace can figure out more low-level details like alignment > >> and placement in physical memory, etc. > >> > >> Anyway, assuming user-space can figure out how a buffer should be > >> stored in memory, how does it indicate this to a kernel driver and > >> actually allocate it? Which ioctl on which device does user-space > >> call, with what parameters? Are you suggesting using something like > >> ION which exposes the low-level details of how buffers are laid out in > >> physical memory to userspace? If not, what? > >> > > > > I strongly disagree with exposing low-level hardware details like tiling > > to userspace. If we have to do the negotiation of those things in > > userspace we will end up with having to pipe those information through > > things like the wayland protocol. I don't see how this could ever be > > considered a good idea. > > well, unless userspace mmap's via a de-tiling gart type thing, I don't > think tiling can be invisible to userspace. > Why is mmap considered to be such a strong use-case for DMABUFs? After all we are trying to _avoid_ mmapping shared buffers where ever possible. > But if two GPU's have some overlap in supportable tiled formats, and > you have one gpu doing app and one doing compositor, and you want to > use tiling for the shared buffers, you need something in the wayland > protocol to figure out what the common supported formats are between > the two sides. I suppose it shouldn't be too hard to add a > standardized (cross-driver) format-negotiation protocol, and in the > absence of that fallback to non tiled format for shared buffers. > I don't see how tiling format negotiation would be easier in userspace than in the kernel. If we can come up with a scheme for that, we can as well do it in the kernel. > > I would rather see kernel drivers negotiating those things at dmabuf > > attach time in way invisible to userspace. I agree that this negotiation > > thing isn't easy to get right for the plethora of different hardware > > constraints we see today, but I would rather see this in-kernel, where > > we have the chance to fix things up if needed, than in a fixed userspace > > interface. > > Well, if you can think of a sane way to add that to dev->dma_params, > and if it isn't visible if userspace mmap's a buffer, then we could > handle that in the kernel. But I don't think that will be the case. > I'm sure we can come up with something sane to put it in there. The userspace mmap thing is a bit complicated to deal with, but I have the feeling that going through a slow path if you really need mmap is a reasonable thing to do. For example we could just make mmap some form of attach that forces linear layout if the exporter isn't able to give userspace a linear mapping from a tiled buffer by using GART or VM. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 14:36 ` Lucas Stach (?) @ 2013-08-06 14:59 ` Rob Clark -1 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 14:59 UTC (permalink / raw) To: Lucas Stach Cc: Tom Cooksey, linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media On Tue, Aug 6, 2013 at 10:36 AM, Lucas Stach <l.stach@pengutronix.de> wrote: > Am Dienstag, den 06.08.2013, 10:14 -0400 schrieb Rob Clark: >> On Tue, Aug 6, 2013 at 8:18 AM, Lucas Stach <l.stach@pengutronix.de> wrote: >> > Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: >> >> Hi Rob, >> >> >> >> +lkml >> >> >> >> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> >> >> > >> wrote: >> >> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to >> >> > >> >> > also allocate buffers for the GPU. Still not sure how to >> >> > >> >> > resolve this as we don't use DRM for our GPU driver. >> >> > >> >> >> >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term >> >> > >> >> (esp. once the dma-fence stuff is in place), we'd have >> >> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display >> >> > >> >> specific drm/kms driver, using prime/dmabuf to share between >> >> > >> >> the two. >> >> > >> > >> >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> >> > >> > being allocated through DRM/GEM so we could get an flink name >> >> > >> > for them and pass a reference to them back to our GPU driver on >> >> > >> > the client side. If it weren't for our need to access those >> >> > >> > extra off-screen buffers with the GPU we wouldn't need to >> >> > >> > allocate them with DRM at all. So, given they are really "GPU" >> >> > >> > buffers, it does absolutely make sense to allocate them in a >> >> > >> > different driver to the display driver. >> >> > >> > >> >> > >> > However, to avoid unnecessary memcpys & related cache >> >> > >> > maintenance ops, we'd also like the GPU to render into buffers >> >> > >> > which are scanned out by the display controller. So let's say >> >> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> >> > >> > out buffers with the display's DRM driver but a custom ioctl >> >> > >> > on the GPU's DRM driver to allocate non scanout, off-screen >> >> > >> > buffers. Sounds great, but I don't think that really works >> >> > >> > with DRI2. If we used two drivers to allocate buffers, which >> >> > >> > of those drivers do we return in DRI2ConnectReply? Even if we >> >> > >> > solve that somehow, GEM flink names are name-spaced to a >> >> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> >> > >> > how does the EGL in the client know which DRM device owns GEM >> >> > >> > flink name "1234"? We'd need some pretty dirty hacks. >> >> > >> >> >> > >> You would return the name of the display driver allocating the >> >> > >> buffers. On the client side you can use generic ioctls to go from >> >> > >> flink -> handle -> dmabuf. So the client side would end up opening >> >> > >> both the display drm device and the gpu, but without needing to know >> >> > >> too much about the display. >> >> > > >> >> > > I think the bit I was missing was that a GEM bo for a buffer imported >> >> > > using dma_buf/PRIME can still be flink'd. So the display controller's >> >> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate >> >> > > ioctl. Those scan-out buffers than then be exported from the >> >> > > dispaly's DRM driver and imported into the GPU's DRM driver using >> >> > > PRIME. Once imported into the GPU's driver, we can use flink to get a >> >> > > name for that buffer within the GPU DRM driver's name-space to return >> >> > > to the DRI2 client. That same namespace is also what DRI2 back- >> >> > > buffers are allocated from, so I think that could work... Except... >> >> > >> >> > (and.. the general direction is that things will move more to just use >> >> > dmabuf directly, ie. wayland or dri3) >> >> >> >> I agree, DRI2 is the only reason why we need a system-wide ID. I also >> >> prefer buffers to be passed around by dma_buf fd, but we still need to >> >> support DRI2 and will do for some time I expect. >> >> >> >> >> >> >> >> > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> >> > >> > DRM driver would need to know the constraints of the display >> >> > >> > controller when allocating buffers intended to be scanned out. >> >> > >> > For example, pl111 typically isn't behind an IOMMU and so >> >> > >> > requires physically contiguous memory. We'd have to teach the >> >> > >> > GPU's DRM driver about the constraints of the display HW. Not >> >> > >> > exactly a clean driver model. :-( >> >> > >> > >> >> > >> > I'm still a little stuck on how to proceed, so any ideas >> >> > >> > would greatly appreciated! My current train of thought is >> >> > >> > having a kind of SoC-specific DRM driver which allocates >> >> > >> > buffers for both display and GPU within a single GEM >> >> > >> > namespace. That SoC-specific DRM driver could then know the >> >> > >> > constraints of both the GPU and the display HW. We could then >> >> > >> > use PRIME to export buffers allocated with the SoC DRM driver >> >> > >> > and import them into the GPU and/or display DRM driver. >> >> > >> >> >> > >> Usually if the display drm driver is allocating the buffers that >> >> > >> might be scanned out, it just needs to have minimal knowledge of >> >> > >> the GPU (pitch alignment constraints). I don't think we need a >> >> > >> 3rd device just to allocate buffers. >> >> > > >> >> > > While Mali can render to pretty much any buffer, there is a mild >> >> > > performance improvement to be had if the buffer stride is aligned to >> >> > > the AXI bus's max burst length when drawing to the buffer. >> >> > >> >> > I suspect the display controllers might frequently benefit if the >> >> > pitch is aligned to AXI burst length too.. >> >> >> >> If the display controller is going to be reading from linear memory >> >> I don't think it will make much difference - you'll just get an extra >> >> 1-2 bus transactions per scanline. With a tile-based GPU like Mali, >> >> you get those extra transactions per _tile_ scan-line and as such, >> >> the overhead is more pronounced. >> >> >> >> >> >> >> >> > > So in some respects, there is a constraint on how buffers which will >> >> > > be drawn to using the GPU are allocated. I don't really like the idea >> >> > > of teaching the display controller DRM driver about the GPU buffer >> >> > > constraints, even if they are fairly trivial like this. If the same >> >> > > display HW IP is being used on several SoCs, it seems wrong somehow >> >> > > to enforce those GPU constraints if some of those SoCs don't have a >> >> > > GPU. >> >> > >> >> > Well, I suppose you could get min_pitch_alignment from devicetree, or >> >> > something like this.. >> >> > >> >> > In the end, the easy solution is just to make the display allocate to >> >> > the worst-case pitch alignment. In the early days of dma-buf >> >> > discussions, we kicked around the idea of negotiating or >> >> > programatically describing the constraints, but that didn't really >> >> > seem like a bounded problem. >> >> >> >> Yeah - I was around for some of those discussions and agree it's not >> >> really an easy problem to solve. >> >> >> >> >> >> >> >> > > We may also then have additional constraints when sharing buffers >> >> > > between the display HW and video decode or even camera ISP HW. >> >> > > Programmatically describing buffer allocation constraints is very >> >> > > difficult and I'm not sure you can actually do it - there's some >> >> > > pretty complex constraints out there! E.g. I believe there's a >> >> > > platform where Y and UV planes of the reference frame need to be in >> >> > > separate DRAM banks for real-time 1080p decode, or something like >> >> > > that? >> >> > >> >> > yes, this was discussed. This is different from pitch/format/size >> >> > constraints.. it is really just a placement constraint (ie. where do >> >> > the physical pages go). IIRC the conclusion was to use a dummy >> >> > devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> > >> >> > > Anyway, I guess my point is that even if we solve how to allocate >> >> > > buffers which will be shared between the GPU and display HW such that >> >> > > both sets of constraints are satisfied, that may not be the end of >> >> > > the story. >> >> > > >> >> > >> >> > that was part of the reason to punt this problem to userspace ;-) >> >> > >> >> > In practice, the kernel drivers doesn't usually know too much about >> >> > the dimensions/format/etc.. that is really userspace level knowledge. >> >> > There are a few exceptions when the kernel needs to know how to setup >> >> > GTT/etc for tiled buffers, but normally this sort of information is up >> >> > at the next level up (userspace, and drm_framebuffer in case of >> >> > scanout). Userspace media frameworks like GStreamer already have a >> >> > concept of format/caps negotiation. For non-display<->gpu sharing, I >> >> > think this is probably where this sort of constraint negotiation >> >> > should be handled. >> >> >> >> I agree that user-space will know which devices will access the buffer >> >> and thus can figure out at least a common pixel format. Though I'm not >> >> so sure userspace can figure out more low-level details like alignment >> >> and placement in physical memory, etc. >> >> >> >> Anyway, assuming user-space can figure out how a buffer should be >> >> stored in memory, how does it indicate this to a kernel driver and >> >> actually allocate it? Which ioctl on which device does user-space >> >> call, with what parameters? Are you suggesting using something like >> >> ION which exposes the low-level details of how buffers are laid out in >> >> physical memory to userspace? If not, what? >> >> >> > >> > I strongly disagree with exposing low-level hardware details like tiling >> > to userspace. If we have to do the negotiation of those things in >> > userspace we will end up with having to pipe those information through >> > things like the wayland protocol. I don't see how this could ever be >> > considered a good idea. >> >> well, unless userspace mmap's via a de-tiling gart type thing, I don't >> think tiling can be invisible to userspace. >> > Why is mmap considered to be such a strong use-case for DMABUFs? After > all we are trying to _avoid_ mmapping shared buffers where ever > possible. I don't know if I'd call it a strong use-case, as much as a good worst-case to consider >> But if two GPU's have some overlap in supportable tiled formats, and >> you have one gpu doing app and one doing compositor, and you want to >> use tiling for the shared buffers, you need something in the wayland >> protocol to figure out what the common supported formats are between >> the two sides. I suppose it shouldn't be too hard to add a >> standardized (cross-driver) format-negotiation protocol, and in the >> absence of that fallback to non tiled format for shared buffers. >> > I don't see how tiling format negotiation would be easier in userspace > than in the kernel. If we can come up with a scheme for that, we can as > well do it in the kernel. well.. I guess I don't see how it would be easier in the kernel than in userspace ;-) But if you can come up with a good way to add it to dev->dma_params that everyone can agree on, I don't mind. I suppose I'm starting with the assumption that it will be difficult/impossible to get everyone to agree on this, but I don't mind being proved wrong. >> > I would rather see kernel drivers negotiating those things at dmabuf >> > attach time in way invisible to userspace. I agree that this negotiation >> > thing isn't easy to get right for the plethora of different hardware >> > constraints we see today, but I would rather see this in-kernel, where >> > we have the chance to fix things up if needed, than in a fixed userspace >> > interface. >> >> Well, if you can think of a sane way to add that to dev->dma_params, >> and if it isn't visible if userspace mmap's a buffer, then we could >> handle that in the kernel. But I don't think that will be the case. >> > I'm sure we can come up with something sane to put it in there. The > userspace mmap thing is a bit complicated to deal with, but I have the > feeling that going through a slow path if you really need mmap is a > reasonable thing to do. > For example we could just make mmap some form of attach that forces > linear layout if the exporter isn't able to give userspace a linear > mapping from a tiled buffer by using GART or VM. I guess I don't object to mmap being a slow path (as long is it doesn't end up requiring a slow path on hw where this otherwise wouldn't be needed). BR, -R > Regards, > Lucas > -- > Pengutronix e.K. | Lucas Stach | > Industrial Linux Solutions | http://www.pengutronix.de/ | > Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | > Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 14:59 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 14:59 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 6, 2013 at 10:36 AM, Lucas Stach <l.stach@pengutronix.de> wrote: > Am Dienstag, den 06.08.2013, 10:14 -0400 schrieb Rob Clark: >> On Tue, Aug 6, 2013 at 8:18 AM, Lucas Stach <l.stach@pengutronix.de> wrote: >> > Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: >> >> Hi Rob, >> >> >> >> +lkml >> >> >> >> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> >> >> > >> wrote: >> >> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to >> >> > >> >> > also allocate buffers for the GPU. Still not sure how to >> >> > >> >> > resolve this as we don't use DRM for our GPU driver. >> >> > >> >> >> >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term >> >> > >> >> (esp. once the dma-fence stuff is in place), we'd have >> >> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display >> >> > >> >> specific drm/kms driver, using prime/dmabuf to share between >> >> > >> >> the two. >> >> > >> > >> >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> >> > >> > being allocated through DRM/GEM so we could get an flink name >> >> > >> > for them and pass a reference to them back to our GPU driver on >> >> > >> > the client side. If it weren't for our need to access those >> >> > >> > extra off-screen buffers with the GPU we wouldn't need to >> >> > >> > allocate them with DRM at all. So, given they are really "GPU" >> >> > >> > buffers, it does absolutely make sense to allocate them in a >> >> > >> > different driver to the display driver. >> >> > >> > >> >> > >> > However, to avoid unnecessary memcpys & related cache >> >> > >> > maintenance ops, we'd also like the GPU to render into buffers >> >> > >> > which are scanned out by the display controller. So let's say >> >> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> >> > >> > out buffers with the display's DRM driver but a custom ioctl >> >> > >> > on the GPU's DRM driver to allocate non scanout, off-screen >> >> > >> > buffers. Sounds great, but I don't think that really works >> >> > >> > with DRI2. If we used two drivers to allocate buffers, which >> >> > >> > of those drivers do we return in DRI2ConnectReply? Even if we >> >> > >> > solve that somehow, GEM flink names are name-spaced to a >> >> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> >> > >> > how does the EGL in the client know which DRM device owns GEM >> >> > >> > flink name "1234"? We'd need some pretty dirty hacks. >> >> > >> >> >> > >> You would return the name of the display driver allocating the >> >> > >> buffers. On the client side you can use generic ioctls to go from >> >> > >> flink -> handle -> dmabuf. So the client side would end up opening >> >> > >> both the display drm device and the gpu, but without needing to know >> >> > >> too much about the display. >> >> > > >> >> > > I think the bit I was missing was that a GEM bo for a buffer imported >> >> > > using dma_buf/PRIME can still be flink'd. So the display controller's >> >> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate >> >> > > ioctl. Those scan-out buffers than then be exported from the >> >> > > dispaly's DRM driver and imported into the GPU's DRM driver using >> >> > > PRIME. Once imported into the GPU's driver, we can use flink to get a >> >> > > name for that buffer within the GPU DRM driver's name-space to return >> >> > > to the DRI2 client. That same namespace is also what DRI2 back- >> >> > > buffers are allocated from, so I think that could work... Except... >> >> > >> >> > (and.. the general direction is that things will move more to just use >> >> > dmabuf directly, ie. wayland or dri3) >> >> >> >> I agree, DRI2 is the only reason why we need a system-wide ID. I also >> >> prefer buffers to be passed around by dma_buf fd, but we still need to >> >> support DRI2 and will do for some time I expect. >> >> >> >> >> >> >> >> > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> >> > >> > DRM driver would need to know the constraints of the display >> >> > >> > controller when allocating buffers intended to be scanned out. >> >> > >> > For example, pl111 typically isn't behind an IOMMU and so >> >> > >> > requires physically contiguous memory. We'd have to teach the >> >> > >> > GPU's DRM driver about the constraints of the display HW. Not >> >> > >> > exactly a clean driver model. :-( >> >> > >> > >> >> > >> > I'm still a little stuck on how to proceed, so any ideas >> >> > >> > would greatly appreciated! My current train of thought is >> >> > >> > having a kind of SoC-specific DRM driver which allocates >> >> > >> > buffers for both display and GPU within a single GEM >> >> > >> > namespace. That SoC-specific DRM driver could then know the >> >> > >> > constraints of both the GPU and the display HW. We could then >> >> > >> > use PRIME to export buffers allocated with the SoC DRM driver >> >> > >> > and import them into the GPU and/or display DRM driver. >> >> > >> >> >> > >> Usually if the display drm driver is allocating the buffers that >> >> > >> might be scanned out, it just needs to have minimal knowledge of >> >> > >> the GPU (pitch alignment constraints). I don't think we need a >> >> > >> 3rd device just to allocate buffers. >> >> > > >> >> > > While Mali can render to pretty much any buffer, there is a mild >> >> > > performance improvement to be had if the buffer stride is aligned to >> >> > > the AXI bus's max burst length when drawing to the buffer. >> >> > >> >> > I suspect the display controllers might frequently benefit if the >> >> > pitch is aligned to AXI burst length too.. >> >> >> >> If the display controller is going to be reading from linear memory >> >> I don't think it will make much difference - you'll just get an extra >> >> 1-2 bus transactions per scanline. With a tile-based GPU like Mali, >> >> you get those extra transactions per _tile_ scan-line and as such, >> >> the overhead is more pronounced. >> >> >> >> >> >> >> >> > > So in some respects, there is a constraint on how buffers which will >> >> > > be drawn to using the GPU are allocated. I don't really like the idea >> >> > > of teaching the display controller DRM driver about the GPU buffer >> >> > > constraints, even if they are fairly trivial like this. If the same >> >> > > display HW IP is being used on several SoCs, it seems wrong somehow >> >> > > to enforce those GPU constraints if some of those SoCs don't have a >> >> > > GPU. >> >> > >> >> > Well, I suppose you could get min_pitch_alignment from devicetree, or >> >> > something like this.. >> >> > >> >> > In the end, the easy solution is just to make the display allocate to >> >> > the worst-case pitch alignment. In the early days of dma-buf >> >> > discussions, we kicked around the idea of negotiating or >> >> > programatically describing the constraints, but that didn't really >> >> > seem like a bounded problem. >> >> >> >> Yeah - I was around for some of those discussions and agree it's not >> >> really an easy problem to solve. >> >> >> >> >> >> >> >> > > We may also then have additional constraints when sharing buffers >> >> > > between the display HW and video decode or even camera ISP HW. >> >> > > Programmatically describing buffer allocation constraints is very >> >> > > difficult and I'm not sure you can actually do it - there's some >> >> > > pretty complex constraints out there! E.g. I believe there's a >> >> > > platform where Y and UV planes of the reference frame need to be in >> >> > > separate DRAM banks for real-time 1080p decode, or something like >> >> > > that? >> >> > >> >> > yes, this was discussed. This is different from pitch/format/size >> >> > constraints.. it is really just a placement constraint (ie. where do >> >> > the physical pages go). IIRC the conclusion was to use a dummy >> >> > devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> > >> >> > > Anyway, I guess my point is that even if we solve how to allocate >> >> > > buffers which will be shared between the GPU and display HW such that >> >> > > both sets of constraints are satisfied, that may not be the end of >> >> > > the story. >> >> > > >> >> > >> >> > that was part of the reason to punt this problem to userspace ;-) >> >> > >> >> > In practice, the kernel drivers doesn't usually know too much about >> >> > the dimensions/format/etc.. that is really userspace level knowledge. >> >> > There are a few exceptions when the kernel needs to know how to setup >> >> > GTT/etc for tiled buffers, but normally this sort of information is up >> >> > at the next level up (userspace, and drm_framebuffer in case of >> >> > scanout). Userspace media frameworks like GStreamer already have a >> >> > concept of format/caps negotiation. For non-display<->gpu sharing, I >> >> > think this is probably where this sort of constraint negotiation >> >> > should be handled. >> >> >> >> I agree that user-space will know which devices will access the buffer >> >> and thus can figure out at least a common pixel format. Though I'm not >> >> so sure userspace can figure out more low-level details like alignment >> >> and placement in physical memory, etc. >> >> >> >> Anyway, assuming user-space can figure out how a buffer should be >> >> stored in memory, how does it indicate this to a kernel driver and >> >> actually allocate it? Which ioctl on which device does user-space >> >> call, with what parameters? Are you suggesting using something like >> >> ION which exposes the low-level details of how buffers are laid out in >> >> physical memory to userspace? If not, what? >> >> >> > >> > I strongly disagree with exposing low-level hardware details like tiling >> > to userspace. If we have to do the negotiation of those things in >> > userspace we will end up with having to pipe those information through >> > things like the wayland protocol. I don't see how this could ever be >> > considered a good idea. >> >> well, unless userspace mmap's via a de-tiling gart type thing, I don't >> think tiling can be invisible to userspace. >> > Why is mmap considered to be such a strong use-case for DMABUFs? After > all we are trying to _avoid_ mmapping shared buffers where ever > possible. I don't know if I'd call it a strong use-case, as much as a good worst-case to consider >> But if two GPU's have some overlap in supportable tiled formats, and >> you have one gpu doing app and one doing compositor, and you want to >> use tiling for the shared buffers, you need something in the wayland >> protocol to figure out what the common supported formats are between >> the two sides. I suppose it shouldn't be too hard to add a >> standardized (cross-driver) format-negotiation protocol, and in the >> absence of that fallback to non tiled format for shared buffers. >> > I don't see how tiling format negotiation would be easier in userspace > than in the kernel. If we can come up with a scheme for that, we can as > well do it in the kernel. well.. I guess I don't see how it would be easier in the kernel than in userspace ;-) But if you can come up with a good way to add it to dev->dma_params that everyone can agree on, I don't mind. I suppose I'm starting with the assumption that it will be difficult/impossible to get everyone to agree on this, but I don't mind being proved wrong. >> > I would rather see kernel drivers negotiating those things at dmabuf >> > attach time in way invisible to userspace. I agree that this negotiation >> > thing isn't easy to get right for the plethora of different hardware >> > constraints we see today, but I would rather see this in-kernel, where >> > we have the chance to fix things up if needed, than in a fixed userspace >> > interface. >> >> Well, if you can think of a sane way to add that to dev->dma_params, >> and if it isn't visible if userspace mmap's a buffer, then we could >> handle that in the kernel. But I don't think that will be the case. >> > I'm sure we can come up with something sane to put it in there. The > userspace mmap thing is a bit complicated to deal with, but I have the > feeling that going through a slow path if you really need mmap is a > reasonable thing to do. > For example we could just make mmap some form of attach that forces > linear layout if the exporter isn't able to give userspace a linear > mapping from a tiled buffer by using GART or VM. I guess I don't object to mmap being a slow path (as long is it doesn't end up requiring a slow path on hw where this otherwise wouldn't be needed). BR, -R > Regards, > Lucas > -- > Pengutronix e.K. | Lucas Stach | > Industrial Linux Solutions | http://www.pengutronix.de/ | > Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | > Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 14:59 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 14:59 UTC (permalink / raw) To: Lucas Stach Cc: Tom Cooksey, linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media On Tue, Aug 6, 2013 at 10:36 AM, Lucas Stach <l.stach@pengutronix.de> wrote: > Am Dienstag, den 06.08.2013, 10:14 -0400 schrieb Rob Clark: >> On Tue, Aug 6, 2013 at 8:18 AM, Lucas Stach <l.stach@pengutronix.de> wrote: >> > Am Dienstag, den 06.08.2013, 12:31 +0100 schrieb Tom Cooksey: >> >> Hi Rob, >> >> >> >> +lkml >> >> >> >> > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> >> >> > >> wrote: >> >> > >> >> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to >> >> > >> >> > also allocate buffers for the GPU. Still not sure how to >> >> > >> >> > resolve this as we don't use DRM for our GPU driver. >> >> > >> >> >> >> > >> >> any thoughts/plans about a DRM GPU driver? Ideally long term >> >> > >> >> (esp. once the dma-fence stuff is in place), we'd have >> >> > >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display >> >> > >> >> specific drm/kms driver, using prime/dmabuf to share between >> >> > >> >> the two. >> >> > >> > >> >> > >> > The "extra" buffers we were allocating from armsoc DDX were really >> >> > >> > being allocated through DRM/GEM so we could get an flink name >> >> > >> > for them and pass a reference to them back to our GPU driver on >> >> > >> > the client side. If it weren't for our need to access those >> >> > >> > extra off-screen buffers with the GPU we wouldn't need to >> >> > >> > allocate them with DRM at all. So, given they are really "GPU" >> >> > >> > buffers, it does absolutely make sense to allocate them in a >> >> > >> > different driver to the display driver. >> >> > >> > >> >> > >> > However, to avoid unnecessary memcpys & related cache >> >> > >> > maintenance ops, we'd also like the GPU to render into buffers >> >> > >> > which are scanned out by the display controller. So let's say >> >> > >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> >> > >> > out buffers with the display's DRM driver but a custom ioctl >> >> > >> > on the GPU's DRM driver to allocate non scanout, off-screen >> >> > >> > buffers. Sounds great, but I don't think that really works >> >> > >> > with DRI2. If we used two drivers to allocate buffers, which >> >> > >> > of those drivers do we return in DRI2ConnectReply? Even if we >> >> > >> > solve that somehow, GEM flink names are name-spaced to a >> >> > >> > single device node (AFAIK). So when we do a DRI2GetBuffers, >> >> > >> > how does the EGL in the client know which DRM device owns GEM >> >> > >> > flink name "1234"? We'd need some pretty dirty hacks. >> >> > >> >> >> > >> You would return the name of the display driver allocating the >> >> > >> buffers. On the client side you can use generic ioctls to go from >> >> > >> flink -> handle -> dmabuf. So the client side would end up opening >> >> > >> both the display drm device and the gpu, but without needing to know >> >> > >> too much about the display. >> >> > > >> >> > > I think the bit I was missing was that a GEM bo for a buffer imported >> >> > > using dma_buf/PRIME can still be flink'd. So the display controller's >> >> > > DRM driver allocates scan-out buffers via the DUMB buffer allocate >> >> > > ioctl. Those scan-out buffers than then be exported from the >> >> > > dispaly's DRM driver and imported into the GPU's DRM driver using >> >> > > PRIME. Once imported into the GPU's driver, we can use flink to get a >> >> > > name for that buffer within the GPU DRM driver's name-space to return >> >> > > to the DRI2 client. That same namespace is also what DRI2 back- >> >> > > buffers are allocated from, so I think that could work... Except... >> >> > >> >> > (and.. the general direction is that things will move more to just use >> >> > dmabuf directly, ie. wayland or dri3) >> >> >> >> I agree, DRI2 is the only reason why we need a system-wide ID. I also >> >> prefer buffers to be passed around by dma_buf fd, but we still need to >> >> support DRI2 and will do for some time I expect. >> >> >> >> >> >> >> >> > >> > Anyway, that latter case also gets quite difficult. The "GPU" >> >> > >> > DRM driver would need to know the constraints of the display >> >> > >> > controller when allocating buffers intended to be scanned out. >> >> > >> > For example, pl111 typically isn't behind an IOMMU and so >> >> > >> > requires physically contiguous memory. We'd have to teach the >> >> > >> > GPU's DRM driver about the constraints of the display HW. Not >> >> > >> > exactly a clean driver model. :-( >> >> > >> > >> >> > >> > I'm still a little stuck on how to proceed, so any ideas >> >> > >> > would greatly appreciated! My current train of thought is >> >> > >> > having a kind of SoC-specific DRM driver which allocates >> >> > >> > buffers for both display and GPU within a single GEM >> >> > >> > namespace. That SoC-specific DRM driver could then know the >> >> > >> > constraints of both the GPU and the display HW. We could then >> >> > >> > use PRIME to export buffers allocated with the SoC DRM driver >> >> > >> > and import them into the GPU and/or display DRM driver. >> >> > >> >> >> > >> Usually if the display drm driver is allocating the buffers that >> >> > >> might be scanned out, it just needs to have minimal knowledge of >> >> > >> the GPU (pitch alignment constraints). I don't think we need a >> >> > >> 3rd device just to allocate buffers. >> >> > > >> >> > > While Mali can render to pretty much any buffer, there is a mild >> >> > > performance improvement to be had if the buffer stride is aligned to >> >> > > the AXI bus's max burst length when drawing to the buffer. >> >> > >> >> > I suspect the display controllers might frequently benefit if the >> >> > pitch is aligned to AXI burst length too.. >> >> >> >> If the display controller is going to be reading from linear memory >> >> I don't think it will make much difference - you'll just get an extra >> >> 1-2 bus transactions per scanline. With a tile-based GPU like Mali, >> >> you get those extra transactions per _tile_ scan-line and as such, >> >> the overhead is more pronounced. >> >> >> >> >> >> >> >> > > So in some respects, there is a constraint on how buffers which will >> >> > > be drawn to using the GPU are allocated. I don't really like the idea >> >> > > of teaching the display controller DRM driver about the GPU buffer >> >> > > constraints, even if they are fairly trivial like this. If the same >> >> > > display HW IP is being used on several SoCs, it seems wrong somehow >> >> > > to enforce those GPU constraints if some of those SoCs don't have a >> >> > > GPU. >> >> > >> >> > Well, I suppose you could get min_pitch_alignment from devicetree, or >> >> > something like this.. >> >> > >> >> > In the end, the easy solution is just to make the display allocate to >> >> > the worst-case pitch alignment. In the early days of dma-buf >> >> > discussions, we kicked around the idea of negotiating or >> >> > programatically describing the constraints, but that didn't really >> >> > seem like a bounded problem. >> >> >> >> Yeah - I was around for some of those discussions and agree it's not >> >> really an easy problem to solve. >> >> >> >> >> >> >> >> > > We may also then have additional constraints when sharing buffers >> >> > > between the display HW and video decode or even camera ISP HW. >> >> > > Programmatically describing buffer allocation constraints is very >> >> > > difficult and I'm not sure you can actually do it - there's some >> >> > > pretty complex constraints out there! E.g. I believe there's a >> >> > > platform where Y and UV planes of the reference frame need to be in >> >> > > separate DRAM banks for real-time 1080p decode, or something like >> >> > > that? >> >> > >> >> > yes, this was discussed. This is different from pitch/format/size >> >> > constraints.. it is really just a placement constraint (ie. where do >> >> > the physical pages go). IIRC the conclusion was to use a dummy >> >> > devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> > >> >> > > Anyway, I guess my point is that even if we solve how to allocate >> >> > > buffers which will be shared between the GPU and display HW such that >> >> > > both sets of constraints are satisfied, that may not be the end of >> >> > > the story. >> >> > > >> >> > >> >> > that was part of the reason to punt this problem to userspace ;-) >> >> > >> >> > In practice, the kernel drivers doesn't usually know too much about >> >> > the dimensions/format/etc.. that is really userspace level knowledge. >> >> > There are a few exceptions when the kernel needs to know how to setup >> >> > GTT/etc for tiled buffers, but normally this sort of information is up >> >> > at the next level up (userspace, and drm_framebuffer in case of >> >> > scanout). Userspace media frameworks like GStreamer already have a >> >> > concept of format/caps negotiation. For non-display<->gpu sharing, I >> >> > think this is probably where this sort of constraint negotiation >> >> > should be handled. >> >> >> >> I agree that user-space will know which devices will access the buffer >> >> and thus can figure out at least a common pixel format. Though I'm not >> >> so sure userspace can figure out more low-level details like alignment >> >> and placement in physical memory, etc. >> >> >> >> Anyway, assuming user-space can figure out how a buffer should be >> >> stored in memory, how does it indicate this to a kernel driver and >> >> actually allocate it? Which ioctl on which device does user-space >> >> call, with what parameters? Are you suggesting using something like >> >> ION which exposes the low-level details of how buffers are laid out in >> >> physical memory to userspace? If not, what? >> >> >> > >> > I strongly disagree with exposing low-level hardware details like tiling >> > to userspace. If we have to do the negotiation of those things in >> > userspace we will end up with having to pipe those information through >> > things like the wayland protocol. I don't see how this could ever be >> > considered a good idea. >> >> well, unless userspace mmap's via a de-tiling gart type thing, I don't >> think tiling can be invisible to userspace. >> > Why is mmap considered to be such a strong use-case for DMABUFs? After > all we are trying to _avoid_ mmapping shared buffers where ever > possible. I don't know if I'd call it a strong use-case, as much as a good worst-case to consider >> But if two GPU's have some overlap in supportable tiled formats, and >> you have one gpu doing app and one doing compositor, and you want to >> use tiling for the shared buffers, you need something in the wayland >> protocol to figure out what the common supported formats are between >> the two sides. I suppose it shouldn't be too hard to add a >> standardized (cross-driver) format-negotiation protocol, and in the >> absence of that fallback to non tiled format for shared buffers. >> > I don't see how tiling format negotiation would be easier in userspace > than in the kernel. If we can come up with a scheme for that, we can as > well do it in the kernel. well.. I guess I don't see how it would be easier in the kernel than in userspace ;-) But if you can come up with a good way to add it to dev->dma_params that everyone can agree on, I don't mind. I suppose I'm starting with the assumption that it will be difficult/impossible to get everyone to agree on this, but I don't mind being proved wrong. >> > I would rather see kernel drivers negotiating those things at dmabuf >> > attach time in way invisible to userspace. I agree that this negotiation >> > thing isn't easy to get right for the plethora of different hardware >> > constraints we see today, but I would rather see this in-kernel, where >> > we have the chance to fix things up if needed, than in a fixed userspace >> > interface. >> >> Well, if you can think of a sane way to add that to dev->dma_params, >> and if it isn't visible if userspace mmap's a buffer, then we could >> handle that in the kernel. But I don't think that will be the case. >> > I'm sure we can come up with something sane to put it in there. The > userspace mmap thing is a bit complicated to deal with, but I have the > feeling that going through a slow path if you really need mmap is a > reasonable thing to do. > For example we could just make mmap some form of attach that forces > linear layout if the exporter isn't able to give userspace a linear > mapping from a tiled buffer by using GART or VM. I guess I don't object to mmap being a slow path (as long is it doesn't end up requiring a slow path on hw where this otherwise wouldn't be needed). BR, -R > Regards, > Lucas > -- > Pengutronix e.K. | Lucas Stach | > Industrial Linux Solutions | http://www.pengutronix.de/ | > Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 | > Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 | > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 12:18 ` Lucas Stach (?) (?) @ 2013-08-06 15:28 ` Daniel Vetter -1 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-06 15:28 UTC (permalink / raw) To: Lucas Stach Cc: Tom Cooksey, Linux Fbdev development list, Pawel Moll, Linux Kernel Mailing List, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media On Tue, Aug 6, 2013 at 2:18 PM, Lucas Stach <l.stach@pengutronix.de> wrote: > I strongly disagree with exposing low-level hardware details like tiling > to userspace. If we have to do the negotiation of those things in > userspace we will end up with having to pipe those information through > things like the wayland protocol. I don't see how this could ever be > considered a good idea. > > I would rather see kernel drivers negotiating those things at dmabuf > attach time in way invisible to userspace. I agree that this negotiation > thing isn't easy to get right for the plethora of different hardware > constraints we see today, but I would rather see this in-kernel, where > we have the chance to fix things up if needed, than in a fixed userspace > interface. The problem with negotiating tiling in the kernel for at least drm/i915 is that userspace needs to know the tiling and for allocating actually be in charge of it. The approach used by prime thus far is to simply use linear buffers between different gpus. If you want something better I guess you need to have a fairly tightly integrate userspace driver, so I don't see how passing around the tiling information should be an issue for those cases. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 15:28 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-06 15:28 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 6, 2013 at 2:18 PM, Lucas Stach <l.stach@pengutronix.de> wrote: > I strongly disagree with exposing low-level hardware details like tiling > to userspace. If we have to do the negotiation of those things in > userspace we will end up with having to pipe those information through > things like the wayland protocol. I don't see how this could ever be > considered a good idea. > > I would rather see kernel drivers negotiating those things at dmabuf > attach time in way invisible to userspace. I agree that this negotiation > thing isn't easy to get right for the plethora of different hardware > constraints we see today, but I would rather see this in-kernel, where > we have the chance to fix things up if needed, than in a fixed userspace > interface. The problem with negotiating tiling in the kernel for at least drm/i915 is that userspace needs to know the tiling and for allocating actually be in charge of it. The approach used by prime thus far is to simply use linear buffers between different gpus. If you want something better I guess you need to have a fairly tightly integrate userspace driver, so I don't see how passing around the tiling information should be an issue for those cases. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 15:28 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-06 15:28 UTC (permalink / raw) To: Lucas Stach Cc: Tom Cooksey, Linux Fbdev development list, Pawel Moll, Linux Kernel Mailing List, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media On Tue, Aug 6, 2013 at 2:18 PM, Lucas Stach <l.stach@pengutronix.de> wrote: > I strongly disagree with exposing low-level hardware details like tiling > to userspace. If we have to do the negotiation of those things in > userspace we will end up with having to pipe those information through > things like the wayland protocol. I don't see how this could ever be > considered a good idea. > > I would rather see kernel drivers negotiating those things at dmabuf > attach time in way invisible to userspace. I agree that this negotiation > thing isn't easy to get right for the plethora of different hardware > constraints we see today, but I would rather see this in-kernel, where > we have the chance to fix things up if needed, than in a fixed userspace > interface. The problem with negotiating tiling in the kernel for at least drm/i915 is that userspace needs to know the tiling and for allocating actually be in charge of it. The approach used by prime thus far is to simply use linear buffers between different gpus. If you want something better I guess you need to have a fairly tightly integrate userspace driver, so I don't see how passing around the tiling information should be an issue for those cases. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 15:28 ` Daniel Vetter 0 siblings, 0 replies; 131+ messages in thread From: Daniel Vetter @ 2013-08-06 15:28 UTC (permalink / raw) To: Lucas Stach Cc: Tom Cooksey, Linux Fbdev development list, Pawel Moll, Linux Kernel Mailing List, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media On Tue, Aug 6, 2013 at 2:18 PM, Lucas Stach <l.stach@pengutronix.de> wrote: > I strongly disagree with exposing low-level hardware details like tiling > to userspace. If we have to do the negotiation of those things in > userspace we will end up with having to pipe those information through > things like the wayland protocol. I don't see how this could ever be > considered a good idea. > > I would rather see kernel drivers negotiating those things at dmabuf > attach time in way invisible to userspace. I agree that this negotiation > thing isn't easy to get right for the plethora of different hardware > constraints we see today, but I would rather see this in-kernel, where > we have the chance to fix things up if needed, than in a fixed userspace > interface. The problem with negotiating tiling in the kernel for at least drm/i915 is that userspace needs to know the tiling and for allocating actually be in charge of it. The approach used by prime thus far is to simply use linear buffers between different gpus. If you want something better I guess you need to have a fairly tightly integrate userspace driver, so I don't see how passing around the tiling information should be an issue for those cases. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 11:31 ` Tom Cooksey 2013-08-06 12:18 ` Lucas Stach @ 2013-09-14 21:33 ` Daniel Stone 1 sibling, 0 replies; 131+ messages in thread From: Daniel Stone @ 2013-09-14 21:33 UTC (permalink / raw) To: Tom Cooksey Cc: 'Rob Clark', linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Hi, On Tue, 2013-08-06 at 12:31 +0100, Tom Cooksey wrote: > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > > >> wrote: > > that was part of the reason to punt this problem to userspace ;-) > > > > In practice, the kernel drivers doesn't usually know too much about > > the dimensions/format/etc.. that is really userspace level knowledge. > > There are a few exceptions when the kernel needs to know how to setup > > GTT/etc for tiled buffers, but normally this sort of information is up > > at the next level up (userspace, and drm_framebuffer in case of > > scanout). Userspace media frameworks like GStreamer already have a > > concept of format/caps negotiation. For non-display<->gpu sharing, I > > think this is probably where this sort of constraint negotiation > > should be handled. Egads. GStreamer's caps negotiation is already close to unbounded time; seems like most of the optimisation work that goes into it these days is all about _reducing_ the complexity of caps negotiation! > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Hm, are you sure about that? The answer is yes for 'userspace' as a broad handwave, but not necessarily for individual processes. Take, for instance, media decode through GStreamer, being displayed by Wayland using a KMS plane/overlay/sprite/cursor/etc. The media player knows that the buffers are coming from the decode engine, and Wayland knows that the buffers are going to a KMS plane, but neither of them knows the full combination of the two. Though this kinda feeds into an idea I've been kicking around for a while, which is an 'optimal hint' mechanism in the Wayland protocol. So for our hypothetical dmabuf-using protocol, we'd start off with buffers which satisfied all the constraints of our media decode engine, but perhaps just the GPU rather than display controller. At this point, we'd note that we could place the video in a plane if only the buffers were better-allocated, and send an event to the client letting it know how to tweak its buffer allocation for more optimal display. But ... > Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. > > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? ... this is still rather unresolved. ;) Cheers, Daniel > > Cheers, > > Tom > > > > > > > _______________________________________________ > Linaro-mm-sig mailing list > Linaro-mm-sig@lists.linaro.org > http://lists.linaro.org/mailman/listinfo/linaro-mm-sig ^ permalink raw reply [flat|nested] 131+ messages in thread
* [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-09-14 21:33 ` Daniel Stone 0 siblings, 0 replies; 131+ messages in thread From: Daniel Stone @ 2013-09-14 21:33 UTC (permalink / raw) To: linux-arm-kernel Hi, On Tue, 2013-08-06 at 12:31 +0100, Tom Cooksey wrote: > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > > >> wrote: > > that was part of the reason to punt this problem to userspace ;-) > > > > In practice, the kernel drivers doesn't usually know too much about > > the dimensions/format/etc.. that is really userspace level knowledge. > > There are a few exceptions when the kernel needs to know how to setup > > GTT/etc for tiled buffers, but normally this sort of information is up > > at the next level up (userspace, and drm_framebuffer in case of > > scanout). Userspace media frameworks like GStreamer already have a > > concept of format/caps negotiation. For non-display<->gpu sharing, I > > think this is probably where this sort of constraint negotiation > > should be handled. Egads. GStreamer's caps negotiation is already close to unbounded time; seems like most of the optimisation work that goes into it these days is all about _reducing_ the complexity of caps negotiation! > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Hm, are you sure about that? The answer is yes for 'userspace' as a broad handwave, but not necessarily for individual processes. Take, for instance, media decode through GStreamer, being displayed by Wayland using a KMS plane/overlay/sprite/cursor/etc. The media player knows that the buffers are coming from the decode engine, and Wayland knows that the buffers are going to a KMS plane, but neither of them knows the full combination of the two. Though this kinda feeds into an idea I've been kicking around for a while, which is an 'optimal hint' mechanism in the Wayland protocol. So for our hypothetical dmabuf-using protocol, we'd start off with buffers which satisfied all the constraints of our media decode engine, but perhaps just the GPU rather than display controller. At this point, we'd note that we could place the video in a plane if only the buffers were better-allocated, and send an event to the client letting it know how to tweak its buffer allocation for more optimal display. But ... > Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. > > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? ... this is still rather unresolved. ;) Cheers, Daniel > > Cheers, > > Tom > > > > > > > _______________________________________________ > Linaro-mm-sig mailing list > Linaro-mm-sig at lists.linaro.org > http://lists.linaro.org/mailman/listinfo/linaro-mm-sig ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-09-14 21:33 ` Daniel Stone 0 siblings, 0 replies; 131+ messages in thread From: Daniel Stone @ 2013-09-14 21:33 UTC (permalink / raw) To: Tom Cooksey Cc: 'Rob Clark', linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Hi, On Tue, 2013-08-06 at 12:31 +0100, Tom Cooksey wrote: > > >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> > > >> wrote: > > that was part of the reason to punt this problem to userspace ;-) > > > > In practice, the kernel drivers doesn't usually know too much about > > the dimensions/format/etc.. that is really userspace level knowledge. > > There are a few exceptions when the kernel needs to know how to setup > > GTT/etc for tiled buffers, but normally this sort of information is up > > at the next level up (userspace, and drm_framebuffer in case of > > scanout). Userspace media frameworks like GStreamer already have a > > concept of format/caps negotiation. For non-display<->gpu sharing, I > > think this is probably where this sort of constraint negotiation > > should be handled. Egads. GStreamer's caps negotiation is already close to unbounded time; seems like most of the optimisation work that goes into it these days is all about _reducing_ the complexity of caps negotiation! > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Hm, are you sure about that? The answer is yes for 'userspace' as a broad handwave, but not necessarily for individual processes. Take, for instance, media decode through GStreamer, being displayed by Wayland using a KMS plane/overlay/sprite/cursor/etc. The media player knows that the buffers are coming from the decode engine, and Wayland knows that the buffers are going to a KMS plane, but neither of them knows the full combination of the two. Though this kinda feeds into an idea I've been kicking around for a while, which is an 'optimal hint' mechanism in the Wayland protocol. So for our hypothetical dmabuf-using protocol, we'd start off with buffers which satisfied all the constraints of our media decode engine, but perhaps just the GPU rather than display controller. At this point, we'd note that we could place the video in a plane if only the buffers were better-allocated, and send an event to the client letting it know how to tweak its buffer allocation for more optimal display. But ... > Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. > > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? ... this is still rather unresolved. ;) Cheers, Daniel > > Cheers, > > Tom > > > > > > > _______________________________________________ > Linaro-mm-sig mailing list > Linaro-mm-sig@lists.linaro.org > http://lists.linaro.org/mailman/listinfo/linaro-mm-sig ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <5200deb3.0b24b40a.3b26.ffffbadeSMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <5200deb3.0b24b40a.3b26.ffffbadeSMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-06 12:15 ` Rob Clark @ 2013-08-06 12:15 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 12:15 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> > So in some respects, there is a constraint on how buffers which will >> > be drawn to using the GPU are allocated. I don't really like the idea >> > of teaching the display controller DRM driver about the GPU buffer >> > constraints, even if they are fairly trivial like this. If the same >> > display HW IP is being used on several SoCs, it seems wrong somehow >> > to enforce those GPU constraints if some of those SoCs don't have a >> > GPU. >> >> Well, I suppose you could get min_pitch_alignment from devicetree, or >> something like this.. >> >> In the end, the easy solution is just to make the display allocate to >> the worst-case pitch alignment. In the early days of dma-buf >> discussions, we kicked around the idea of negotiating or >> programatically describing the constraints, but that didn't really >> seem like a bounded problem. > > Yeah - I was around for some of those discussions and agree it's not > really an easy problem to solve. > > > >> > We may also then have additional constraints when sharing buffers >> > between the display HW and video decode or even camera ISP HW. >> > Programmatically describing buffer allocation constraints is very >> > difficult and I'm not sure you can actually do it - there's some >> > pretty complex constraints out there! E.g. I believe there's a >> > platform where Y and UV planes of the reference frame need to be in >> > separate DRAM banks for real-time 1080p decode, or something like >> > that? >> >> yes, this was discussed. This is different from pitch/format/size >> constraints.. it is really just a placement constraint (ie. where do >> the physical pages go). IIRC the conclusion was to use a dummy >> devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> > Anyway, I guess my point is that even if we solve how to allocate >> > buffers which will be shared between the GPU and display HW such that >> > both sets of constraints are satisfied, that may not be the end of >> > the story. >> > >> >> that was part of the reason to punt this problem to userspace ;-) >> >> In practice, the kernel drivers doesn't usually know too much about >> the dimensions/format/etc.. that is really userspace level knowledge. >> There are a few exceptions when the kernel needs to know how to setup >> GTT/etc for tiled buffers, but normally this sort of information is up >> at the next level up (userspace, and drm_framebuffer in case of >> scanout). Userspace media frameworks like GStreamer already have a >> concept of format/caps negotiation. For non-display<->gpu sharing, I >> think this is probably where this sort of constraint negotiation >> should be handled. > > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. well, let's divide things up into two categories: 1) the arrangement and format of pixels.. ie. what userspace would need to know if it mmap's a buffer. This includes pixel format, stride, etc. This should be negotiated in userspace, it would be crazy to try to do this in the kernel. 2) the physical placement of the pages. Ie. whether it is contiguous or not. Which bank the pages in the buffer are placed in, etc. This is not visible to userspace. This is the purpose of the attach step, so you know all the devices involved in sharing up front before allocating the backing pages. (Or in the worst case, if you have a "late attacher" you at least know when no device is doing dma access to a buffer and can reallocate and move the buffer.) A long time back, I had a patch that added a field or two to 'struct device_dma_parameters' so that it could be known if a device required contiguous buffers.. looks like that never got merged, so I'd need to dig that back up and resend it. But the idea was to have the 'struct device' encapsulate all the information that would be needed to do-the-right-thing when it comes to placement. > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? no, userspace should not need to know this. And having a central driver that knows this for all the other drivers in the system doesn't really solve anything and isn't really scalable. At best you might want, in some cases, a flag you can pass when allocating. For example, some of the drivers have a 'SCANOUT' flag that can be passed when allocating a GEM buffer, as a hint to the kernel that 'if this hw requires contig memory for scanout, allocate this buffer contig'. But really, when it comes to sharing buffers between devices, we want this sort of information in dev->dma_params of the importing device(s). BR, -R ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 12:15 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 12:15 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> > So in some respects, there is a constraint on how buffers which will >> > be drawn to using the GPU are allocated. I don't really like the idea >> > of teaching the display controller DRM driver about the GPU buffer >> > constraints, even if they are fairly trivial like this. If the same >> > display HW IP is being used on several SoCs, it seems wrong somehow >> > to enforce those GPU constraints if some of those SoCs don't have a >> > GPU. >> >> Well, I suppose you could get min_pitch_alignment from devicetree, or >> something like this.. >> >> In the end, the easy solution is just to make the display allocate to >> the worst-case pitch alignment. In the early days of dma-buf >> discussions, we kicked around the idea of negotiating or >> programatically describing the constraints, but that didn't really >> seem like a bounded problem. > > Yeah - I was around for some of those discussions and agree it's not > really an easy problem to solve. > > > >> > We may also then have additional constraints when sharing buffers >> > between the display HW and video decode or even camera ISP HW. >> > Programmatically describing buffer allocation constraints is very >> > difficult and I'm not sure you can actually do it - there's some >> > pretty complex constraints out there! E.g. I believe there's a >> > platform where Y and UV planes of the reference frame need to be in >> > separate DRAM banks for real-time 1080p decode, or something like >> > that? >> >> yes, this was discussed. This is different from pitch/format/size >> constraints.. it is really just a placement constraint (ie. where do >> the physical pages go). IIRC the conclusion was to use a dummy >> devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> > Anyway, I guess my point is that even if we solve how to allocate >> > buffers which will be shared between the GPU and display HW such that >> > both sets of constraints are satisfied, that may not be the end of >> > the story. >> > >> >> that was part of the reason to punt this problem to userspace ;-) >> >> In practice, the kernel drivers doesn't usually know too much about >> the dimensions/format/etc.. that is really userspace level knowledge. >> There are a few exceptions when the kernel needs to know how to setup >> GTT/etc for tiled buffers, but normally this sort of information is up >> at the next level up (userspace, and drm_framebuffer in case of >> scanout). Userspace media frameworks like GStreamer already have a >> concept of format/caps negotiation. For non-display<->gpu sharing, I >> think this is probably where this sort of constraint negotiation >> should be handled. > > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. well, let's divide things up into two categories: 1) the arrangement and format of pixels.. ie. what userspace would need to know if it mmap's a buffer. This includes pixel format, stride, etc. This should be negotiated in userspace, it would be crazy to try to do this in the kernel. 2) the physical placement of the pages. Ie. whether it is contiguous or not. Which bank the pages in the buffer are placed in, etc. This is not visible to userspace. This is the purpose of the attach step, so you know all the devices involved in sharing up front before allocating the backing pages. (Or in the worst case, if you have a "late attacher" you at least know when no device is doing dma access to a buffer and can reallocate and move the buffer.) A long time back, I had a patch that added a field or two to 'struct device_dma_parameters' so that it could be known if a device required contiguous buffers.. looks like that never got merged, so I'd need to dig that back up and resend it. But the idea was to have the 'struct device' encapsulate all the information that would be needed to do-the-right-thing when it comes to placement. > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? no, userspace should not need to know this. And having a central driver that knows this for all the other drivers in the system doesn't really solve anything and isn't really scalable. At best you might want, in some cases, a flag you can pass when allocating. For example, some of the drivers have a 'SCANOUT' flag that can be passed when allocating a GEM buffer, as a hint to the kernel that 'if this hw requires contig memory for scanout, allocate this buffer contig'. But really, when it comes to sharing buffers between devices, we want this sort of information in dev->dma_params of the importing device(s). BR, -R ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 12:15 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 12:15 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> > So in some respects, there is a constraint on how buffers which will >> > be drawn to using the GPU are allocated. I don't really like the idea >> > of teaching the display controller DRM driver about the GPU buffer >> > constraints, even if they are fairly trivial like this. If the same >> > display HW IP is being used on several SoCs, it seems wrong somehow >> > to enforce those GPU constraints if some of those SoCs don't have a >> > GPU. >> >> Well, I suppose you could get min_pitch_alignment from devicetree, or >> something like this.. >> >> In the end, the easy solution is just to make the display allocate to >> the worst-case pitch alignment. In the early days of dma-buf >> discussions, we kicked around the idea of negotiating or >> programatically describing the constraints, but that didn't really >> seem like a bounded problem. > > Yeah - I was around for some of those discussions and agree it's not > really an easy problem to solve. > > > >> > We may also then have additional constraints when sharing buffers >> > between the display HW and video decode or even camera ISP HW. >> > Programmatically describing buffer allocation constraints is very >> > difficult and I'm not sure you can actually do it - there's some >> > pretty complex constraints out there! E.g. I believe there's a >> > platform where Y and UV planes of the reference frame need to be in >> > separate DRAM banks for real-time 1080p decode, or something like >> > that? >> >> yes, this was discussed. This is different from pitch/format/size >> constraints.. it is really just a placement constraint (ie. where do >> the physical pages go). IIRC the conclusion was to use a dummy >> devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> > Anyway, I guess my point is that even if we solve how to allocate >> > buffers which will be shared between the GPU and display HW such that >> > both sets of constraints are satisfied, that may not be the end of >> > the story. >> > >> >> that was part of the reason to punt this problem to userspace ;-) >> >> In practice, the kernel drivers doesn't usually know too much about >> the dimensions/format/etc.. that is really userspace level knowledge. >> There are a few exceptions when the kernel needs to know how to setup >> GTT/etc for tiled buffers, but normally this sort of information is up >> at the next level up (userspace, and drm_framebuffer in case of >> scanout). Userspace media frameworks like GStreamer already have a >> concept of format/caps negotiation. For non-display<->gpu sharing, I >> think this is probably where this sort of constraint negotiation >> should be handled. > > I agree that user-space will know which devices will access the buffer > and thus can figure out at least a common pixel format. Though I'm not > so sure userspace can figure out more low-level details like alignment > and placement in physical memory, etc. well, let's divide things up into two categories: 1) the arrangement and format of pixels.. ie. what userspace would need to know if it mmap's a buffer. This includes pixel format, stride, etc. This should be negotiated in userspace, it would be crazy to try to do this in the kernel. 2) the physical placement of the pages. Ie. whether it is contiguous or not. Which bank the pages in the buffer are placed in, etc. This is not visible to userspace. This is the purpose of the attach step, so you know all the devices involved in sharing up front before allocating the backing pages. (Or in the worst case, if you have a "late attacher" you at least know when no device is doing dma access to a buffer and can reallocate and move the buffer.) A long time back, I had a patch that added a field or two to 'struct device_dma_parameters' so that it could be known if a device required contiguous buffers.. looks like that never got merged, so I'd need to dig that back up and resend it. But the idea was to have the 'struct device' encapsulate all the information that would be needed to do-the-right-thing when it comes to placement. > Anyway, assuming user-space can figure out how a buffer should be > stored in memory, how does it indicate this to a kernel driver and > actually allocate it? Which ioctl on which device does user-space > call, with what parameters? Are you suggesting using something like > ION which exposes the low-level details of how buffers are laid out in > physical memory to userspace? If not, what? no, userspace should not need to know this. And having a central driver that knows this for all the other drivers in the system doesn't really solve anything and isn't really scalable. At best you might want, in some cases, a flag you can pass when allocating. For example, some of the drivers have a 'SCANOUT' flag that can be passed when allocating a GEM buffer, as a hint to the kernel that 'if this hw requires contig memory for scanout, allocate this buffer contig'. But really, when it comes to sharing buffers between devices, we want this sort of information in dev->dma_params of the importing device(s). BR, -R ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 12:15 ` Rob Clark (?) (?) @ 2013-08-06 14:03 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-06 14:03 UTC (permalink / raw) To: linux-arm-kernel Hi Rob, > >> > We may also then have additional constraints when sharing buffers > >> > between the display HW and video decode or even camera ISP HW. > >> > Programmatically describing buffer allocation constraints is very > >> > difficult and I'm not sure you can actually do it - there's some > >> > pretty complex constraints out there! E.g. I believe there's a > >> > platform where Y and UV planes of the reference frame need to be > >> > in separate DRAM banks for real-time 1080p decode, or something > >> > like that? > >> > >> yes, this was discussed. This is different from pitch/format/size > >> constraints.. it is really just a placement constraint (ie. where > >> do the physical pages go). IIRC the conclusion was to use a dummy > >> devices with it's own CMA pool for attaching the Y vs UV buffers. > >> > >> > Anyway, I guess my point is that even if we solve how to allocate > >> > buffers which will be shared between the GPU and display HW such > >> > that both sets of constraints are satisfied, that may not be the > >> > end of the story. > >> > > >> > >> that was part of the reason to punt this problem to userspace ;-) > >> > >> In practice, the kernel drivers doesn't usually know too much about > >> the dimensions/format/etc.. that is really userspace level > >> knowledge. There are a few exceptions when the kernel needs to know > >> how to setup GTT/etc for tiled buffers, but normally this sort of > >> information is up at the next level up (userspace, and > >> drm_framebuffer in case of scanout). Userspace media frameworks > >> like GStreamer already have a concept of format/caps negotiation. > >> For non-display<->gpu sharing, I think this is probably where this > >> sort of constraint negotiation should be handled. > > > > I agree that user-space will know which devices will access the > > buffer and thus can figure out at least a common pixel format. > > Though I'm not so sure userspace can figure out more low-level > > details like alignment and placement in physical memory, etc. > > > > well, let's divide things up into two categories: > > 1) the arrangement and format of pixels.. ie. what userspace would > need to know if it mmap's a buffer. This includes pixel format, > stride, etc. This should be negotiated in userspace, it would be > crazy to try to do this in the kernel. Absolutely. Pixel format has to be negotiated by user-space as in most cases, user-space can map the buffer and thus will need to know how to interpret the data. > 2) the physical placement of the pages. Ie. whether it is contiguous > or not. Which bank the pages in the buffer are placed in, etc. This > is not visible to userspace. Seems sensible to me. > ... This is the purpose of the attach step, > so you know all the devices involved in sharing up front before > allocating the backing pages. (Or in the worst case, if you have a > "late attacher" you at least know when no device is doing dma access > to a buffer and can reallocate and move the buffer.) A long time > back, I had a patch that added a field or two to 'struct > device_dma_parameters' so that it could be known if a device required > contiguous buffers.. looks like that never got merged, so I'd need to > dig that back up and resend it. But the idea was to have the 'struct > device' encapsulate all the information that would be needed to > do-the-right-thing when it comes to placement. As I understand it, it's up to the exporting device to allocate the memory backing the dma_buf buffer. I guess the latest possible point you can allocate the backing pages is when map_dma_buf is first called? At that point the exporter can iterate over the current set of attachments, programmatically determine the all the constraints of all the attached drivers and attempt to allocate the backing pages in such a way as to satisfy all those constraints? Didn't you say that programmatically describing device placement constraints was an unbounded problem? I guess we would have to accept that it's not possible to describe all possible constraints and instead find a way to describe the common ones? One problem with this is it duplicates a lot of logic in each driver which can export a dma_buf buffer. Each exporter will need to do pretty much the same thing: iterate over all the attachments, determine of all the constraints (assuming that can be done) and allocate pages such that the lowest-common-denominator is satisfied. Perhaps rather than duplicating that logic in every driver, we could Instead move allocation of the backing pages into dma_buf itself? > > Anyway, assuming user-space can figure out how a buffer should be > > stored in memory, how does it indicate this to a kernel driver and > > actually allocate it? Which ioctl on which device does user-space > > call, with what parameters? Are you suggesting using something like > > ION which exposes the low-level details of how buffers are laid out > in > > physical memory to userspace? If not, what? > > no, userspace should not need to know this. And having a central > driver that knows this for all the other drivers in the system doesn't > really solve anything and isn't really scalable. At best you might > want, in some cases, a flag you can pass when allocating. For > example, some of the drivers have a 'SCANOUT' flag that can be passed > when allocating a GEM buffer, as a hint to the kernel that 'if this hw > requires contig memory for scanout, allocate this buffer contig'. But > really, when it comes to sharing buffers between devices, we want this > sort of information in dev->dma_params of the importing device(s). If you had a single driver which knew the constraints of all devices on that particular SoC and the interface allowed user-space to specify which devices a buffer is intended to be used with, I guess it could pretty trivially allocate pages which satisfy those constraints? It wouldn't need a way to programmatically describe the constraints either: As you say, if userspace sets the "SCANOUT" flag, it would just "know" that on this SoC, that buffer needs to be physically contiguous for example. Though It would effectively mean you'd need an "allocation" driver per SoC, which as you say may not be scalable? Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 12:15 ` Rob Clark ` (2 preceding siblings ...) (?) @ 2013-08-06 14:03 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-06 14:03 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Hi Rob, > >> > We may also then have additional constraints when sharing buffers > >> > between the display HW and video decode or even camera ISP HW. > >> > Programmatically describing buffer allocation constraints is very > >> > difficult and I'm not sure you can actually do it - there's some > >> > pretty complex constraints out there! E.g. I believe there's a > >> > platform where Y and UV planes of the reference frame need to be > >> > in separate DRAM banks for real-time 1080p decode, or something > >> > like that? > >> > >> yes, this was discussed. This is different from pitch/format/size > >> constraints.. it is really just a placement constraint (ie. where > >> do the physical pages go). IIRC the conclusion was to use a dummy > >> devices with it's own CMA pool for attaching the Y vs UV buffers. > >> > >> > Anyway, I guess my point is that even if we solve how to allocate > >> > buffers which will be shared between the GPU and display HW such > >> > that both sets of constraints are satisfied, that may not be the > >> > end of the story. > >> > > >> > >> that was part of the reason to punt this problem to userspace ;-) > >> > >> In practice, the kernel drivers doesn't usually know too much about > >> the dimensions/format/etc.. that is really userspace level > >> knowledge. There are a few exceptions when the kernel needs to know > >> how to setup GTT/etc for tiled buffers, but normally this sort of > >> information is up at the next level up (userspace, and > >> drm_framebuffer in case of scanout). Userspace media frameworks > >> like GStreamer already have a concept of format/caps negotiation. > >> For non-display<->gpu sharing, I think this is probably where this > >> sort of constraint negotiation should be handled. > > > > I agree that user-space will know which devices will access the > > buffer and thus can figure out at least a common pixel format. > > Though I'm not so sure userspace can figure out more low-level > > details like alignment and placement in physical memory, etc. > > > > well, let's divide things up into two categories: > > 1) the arrangement and format of pixels.. ie. what userspace would > need to know if it mmap's a buffer. This includes pixel format, > stride, etc. This should be negotiated in userspace, it would be > crazy to try to do this in the kernel. Absolutely. Pixel format has to be negotiated by user-space as in most cases, user-space can map the buffer and thus will need to know how to interpret the data. > 2) the physical placement of the pages. Ie. whether it is contiguous > or not. Which bank the pages in the buffer are placed in, etc. This > is not visible to userspace. Seems sensible to me. > ... This is the purpose of the attach step, > so you know all the devices involved in sharing up front before > allocating the backing pages. (Or in the worst case, if you have a > "late attacher" you at least know when no device is doing dma access > to a buffer and can reallocate and move the buffer.) A long time > back, I had a patch that added a field or two to 'struct > device_dma_parameters' so that it could be known if a device required > contiguous buffers.. looks like that never got merged, so I'd need to > dig that back up and resend it. But the idea was to have the 'struct > device' encapsulate all the information that would be needed to > do-the-right-thing when it comes to placement. As I understand it, it's up to the exporting device to allocate the memory backing the dma_buf buffer. I guess the latest possible point you can allocate the backing pages is when map_dma_buf is first called? At that point the exporter can iterate over the current set of attachments, programmatically determine the all the constraints of all the attached drivers and attempt to allocate the backing pages in such a way as to satisfy all those constraints? Didn't you say that programmatically describing device placement constraints was an unbounded problem? I guess we would have to accept that it's not possible to describe all possible constraints and instead find a way to describe the common ones? One problem with this is it duplicates a lot of logic in each driver which can export a dma_buf buffer. Each exporter will need to do pretty much the same thing: iterate over all the attachments, determine of all the constraints (assuming that can be done) and allocate pages such that the lowest-common-denominator is satisfied. Perhaps rather than duplicating that logic in every driver, we could Instead move allocation of the backing pages into dma_buf itself? > > Anyway, assuming user-space can figure out how a buffer should be > > stored in memory, how does it indicate this to a kernel driver and > > actually allocate it? Which ioctl on which device does user-space > > call, with what parameters? Are you suggesting using something like > > ION which exposes the low-level details of how buffers are laid out > in > > physical memory to userspace? If not, what? > > no, userspace should not need to know this. And having a central > driver that knows this for all the other drivers in the system doesn't > really solve anything and isn't really scalable. At best you might > want, in some cases, a flag you can pass when allocating. For > example, some of the drivers have a 'SCANOUT' flag that can be passed > when allocating a GEM buffer, as a hint to the kernel that 'if this hw > requires contig memory for scanout, allocate this buffer contig'. But > really, when it comes to sharing buffers between devices, we want this > sort of information in dev->dma_params of the importing device(s). If you had a single driver which knew the constraints of all devices on that particular SoC and the interface allowed user-space to specify which devices a buffer is intended to be used with, I guess it could pretty trivially allocate pages which satisfy those constraints? It wouldn't need a way to programmatically describe the constraints either: As you say, if userspace sets the "SCANOUT" flag, it would just "know" that on this SoC, that buffer needs to be physically contiguous for example. Though It would effectively mean you'd need an "allocation" driver per SoC, which as you say may not be scalable? Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 12:15 ` Rob Clark ` (3 preceding siblings ...) (?) @ 2013-08-06 14:03 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-06 14:03 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media Hi Rob, > >> > We may also then have additional constraints when sharing buffers > >> > between the display HW and video decode or even camera ISP HW. > >> > Programmatically describing buffer allocation constraints is very > >> > difficult and I'm not sure you can actually do it - there's some > >> > pretty complex constraints out there! E.g. I believe there's a > >> > platform where Y and UV planes of the reference frame need to be > >> > in separate DRAM banks for real-time 1080p decode, or something > >> > like that? > >> > >> yes, this was discussed. This is different from pitch/format/size > >> constraints.. it is really just a placement constraint (ie. where > >> do the physical pages go). IIRC the conclusion was to use a dummy > >> devices with it's own CMA pool for attaching the Y vs UV buffers. > >> > >> > Anyway, I guess my point is that even if we solve how to allocate > >> > buffers which will be shared between the GPU and display HW such > >> > that both sets of constraints are satisfied, that may not be the > >> > end of the story. > >> > > >> > >> that was part of the reason to punt this problem to userspace ;-) > >> > >> In practice, the kernel drivers doesn't usually know too much about > >> the dimensions/format/etc.. that is really userspace level > >> knowledge. There are a few exceptions when the kernel needs to know > >> how to setup GTT/etc for tiled buffers, but normally this sort of > >> information is up at the next level up (userspace, and > >> drm_framebuffer in case of scanout). Userspace media frameworks > >> like GStreamer already have a concept of format/caps negotiation. > >> For non-display<->gpu sharing, I think this is probably where this > >> sort of constraint negotiation should be handled. > > > > I agree that user-space will know which devices will access the > > buffer and thus can figure out at least a common pixel format. > > Though I'm not so sure userspace can figure out more low-level > > details like alignment and placement in physical memory, etc. > > > > well, let's divide things up into two categories: > > 1) the arrangement and format of pixels.. ie. what userspace would > need to know if it mmap's a buffer. This includes pixel format, > stride, etc. This should be negotiated in userspace, it would be > crazy to try to do this in the kernel. Absolutely. Pixel format has to be negotiated by user-space as in most cases, user-space can map the buffer and thus will need to know how to interpret the data. > 2) the physical placement of the pages. Ie. whether it is contiguous > or not. Which bank the pages in the buffer are placed in, etc. This > is not visible to userspace. Seems sensible to me. > ... This is the purpose of the attach step, > so you know all the devices involved in sharing up front before > allocating the backing pages. (Or in the worst case, if you have a > "late attacher" you at least know when no device is doing dma access > to a buffer and can reallocate and move the buffer.) A long time > back, I had a patch that added a field or two to 'struct > device_dma_parameters' so that it could be known if a device required > contiguous buffers.. looks like that never got merged, so I'd need to > dig that back up and resend it. But the idea was to have the 'struct > device' encapsulate all the information that would be needed to > do-the-right-thing when it comes to placement. As I understand it, it's up to the exporting device to allocate the memory backing the dma_buf buffer. I guess the latest possible point you can allocate the backing pages is when map_dma_buf is first called? At that point the exporter can iterate over the current set of attachments, programmatically determine the all the constraints of all the attached drivers and attempt to allocate the backing pages in such a way as to satisfy all those constraints? Didn't you say that programmatically describing device placement constraints was an unbounded problem? I guess we would have to accept that it's not possible to describe all possible constraints and instead find a way to describe the common ones? One problem with this is it duplicates a lot of logic in each driver which can export a dma_buf buffer. Each exporter will need to do pretty much the same thing: iterate over all the attachments, determine of all the constraints (assuming that can be done) and allocate pages such that the lowest-common-denominator is satisfied. Perhaps rather than duplicating that logic in every driver, we could Instead move allocation of the backing pages into dma_buf itself? > > Anyway, assuming user-space can figure out how a buffer should be > > stored in memory, how does it indicate this to a kernel driver and > > actually allocate it? Which ioctl on which device does user-space > > call, with what parameters? Are you suggesting using something like > > ION which exposes the low-level details of how buffers are laid out > in > > physical memory to userspace? If not, what? > > no, userspace should not need to know this. And having a central > driver that knows this for all the other drivers in the system doesn't > really solve anything and isn't really scalable. At best you might > want, in some cases, a flag you can pass when allocating. For > example, some of the drivers have a 'SCANOUT' flag that can be passed > when allocating a GEM buffer, as a hint to the kernel that 'if this hw > requires contig memory for scanout, allocate this buffer contig'. But > really, when it comes to sharing buffers between devices, we want this > sort of information in dev->dma_params of the importing device(s). If you had a single driver which knew the constraints of all devices on that particular SoC and the interface allowed user-space to specify which devices a buffer is intended to be used with, I guess it could pretty trivially allocate pages which satisfy those constraints? It wouldn't need a way to programmatically describe the constraints either: As you say, if userspace sets the "SCANOUT" flag, it would just "know" that on this SoC, that buffer needs to be physically contiguous for example. Though It would effectively mean you'd need an "allocation" driver per SoC, which as you say may not be scalable? Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <52010257.245fc20a.6ff8.1cfdSMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <52010257.245fc20a.6ff8.1cfdSMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-06 14:40 ` Rob Clark @ 2013-08-06 14:40 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 14:40 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Tue, Aug 6, 2013 at 10:03 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Rob, > >> >> > We may also then have additional constraints when sharing buffers >> >> > between the display HW and video decode or even camera ISP HW. >> >> > Programmatically describing buffer allocation constraints is very >> >> > difficult and I'm not sure you can actually do it - there's some >> >> > pretty complex constraints out there! E.g. I believe there's a >> >> > platform where Y and UV planes of the reference frame need to be >> >> > in separate DRAM banks for real-time 1080p decode, or something >> >> > like that? >> >> >> >> yes, this was discussed. This is different from pitch/format/size >> >> constraints.. it is really just a placement constraint (ie. where >> >> do the physical pages go). IIRC the conclusion was to use a dummy >> >> devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> >> >> > Anyway, I guess my point is that even if we solve how to allocate >> >> > buffers which will be shared between the GPU and display HW such >> >> > that both sets of constraints are satisfied, that may not be the >> >> > end of the story. >> >> > >> >> >> >> that was part of the reason to punt this problem to userspace ;-) >> >> >> >> In practice, the kernel drivers doesn't usually know too much about >> >> the dimensions/format/etc.. that is really userspace level >> >> knowledge. There are a few exceptions when the kernel needs to know >> >> how to setup GTT/etc for tiled buffers, but normally this sort of >> >> information is up at the next level up (userspace, and >> >> drm_framebuffer in case of scanout). Userspace media frameworks >> >> like GStreamer already have a concept of format/caps negotiation. >> >> For non-display<->gpu sharing, I think this is probably where this >> >> sort of constraint negotiation should be handled. >> > >> > I agree that user-space will know which devices will access the >> > buffer and thus can figure out at least a common pixel format. >> > Though I'm not so sure userspace can figure out more low-level >> > details like alignment and placement in physical memory, etc. >> > >> >> well, let's divide things up into two categories: >> >> 1) the arrangement and format of pixels.. ie. what userspace would >> need to know if it mmap's a buffer. This includes pixel format, >> stride, etc. This should be negotiated in userspace, it would be >> crazy to try to do this in the kernel. > > Absolutely. Pixel format has to be negotiated by user-space as in > most cases, user-space can map the buffer and thus will need to > know how to interpret the data. > > > >> 2) the physical placement of the pages. Ie. whether it is contiguous >> or not. Which bank the pages in the buffer are placed in, etc. This >> is not visible to userspace. > > Seems sensible to me. > > >> ... This is the purpose of the attach step, >> so you know all the devices involved in sharing up front before >> allocating the backing pages. (Or in the worst case, if you have a >> "late attacher" you at least know when no device is doing dma access >> to a buffer and can reallocate and move the buffer.) A long time >> back, I had a patch that added a field or two to 'struct >> device_dma_parameters' so that it could be known if a device required >> contiguous buffers.. looks like that never got merged, so I'd need to >> dig that back up and resend it. But the idea was to have the 'struct >> device' encapsulate all the information that would be needed to >> do-the-right-thing when it comes to placement. > > As I understand it, it's up to the exporting device to allocate the > memory backing the dma_buf buffer. I guess the latest possible point > you can allocate the backing pages is when map_dma_buf is first > called? At that point the exporter can iterate over the current set > of attachments, programmatically determine the all the constraints of > all the attached drivers and attempt to allocate the backing pages > in such a way as to satisfy all those constraints? yes, this is the idea.. possibly some room for some helpers to help out with this, but that is all under the hood from userspace perspective > Didn't you say that programmatically describing device placement > constraints was an unbounded problem? I guess we would have to > accept that it's not possible to describe all possible constraints > and instead find a way to describe the common ones? well, the point I'm trying to make, is by dividing your constraints into two groups, one that impacts and is handled by userspace, and one that is in the kernel (ie. where the pages go), you cut down the number of permutations that the kernel has to care about considerably. And kernel already cares about, for example, what range of addresses that a device can dma to/from. I think really the only thing missing is the max # of sglist entries (contiguous or not) > One problem with this is it duplicates a lot of logic in each > driver which can export a dma_buf buffer. Each exporter will need to > do pretty much the same thing: iterate over all the attachments, > determine of all the constraints (assuming that can be done) and > allocate pages such that the lowest-common-denominator is satisfied. > > Perhaps rather than duplicating that logic in every driver, we could > Instead move allocation of the backing pages into dma_buf itself? > I tend to think it is better to add helpers as we see common patterns emerge, which drivers can opt-in to using. I don't think that we should move allocation into dma_buf itself, but it would perhaps be useful to have dma_alloc_*() variants that could allocate for multiple devices. That would help for simple stuff, although I'd suspect eventually a GPU driver will move away from that. (Since you probably want to play tricks w/ pools of pages that are pre-zero'd and in the correct cache state, use spare cycles on the gpu or dma engine to pre-zero uncached pages, and games like that.) > >> > Anyway, assuming user-space can figure out how a buffer should be >> > stored in memory, how does it indicate this to a kernel driver and >> > actually allocate it? Which ioctl on which device does user-space >> > call, with what parameters? Are you suggesting using something like >> > ION which exposes the low-level details of how buffers are laid out >> in >> > physical memory to userspace? If not, what? >> >> no, userspace should not need to know this. And having a central >> driver that knows this for all the other drivers in the system doesn't >> really solve anything and isn't really scalable. At best you might >> want, in some cases, a flag you can pass when allocating. For >> example, some of the drivers have a 'SCANOUT' flag that can be passed >> when allocating a GEM buffer, as a hint to the kernel that 'if this hw >> requires contig memory for scanout, allocate this buffer contig'. But >> really, when it comes to sharing buffers between devices, we want this >> sort of information in dev->dma_params of the importing device(s). > > If you had a single driver which knew the constraints of all devices > on that particular SoC and the interface allowed user-space to specify > which devices a buffer is intended to be used with, I guess it could > pretty trivially allocate pages which satisfy those constraints? It keep in mind, even a number of SoC's come with pcie these days. You already have things like https://developer.nvidia.com/content/kayla-platform You probably want to get out of the SoC mindset, otherwise you are going to make bad assumptions that come back to bite you later on. > wouldn't need a way to programmatically describe the constraints > either: As you say, if userspace sets the "SCANOUT" flag, it would > just "know" that on this SoC, that buffer needs to be physically > contiguous for example. not really.. it just knows it wants to scanout the buffer, and tells this as a hint to the kernel. For example, on omapdrm, the SCANOUT flag does nothing on omap4+ (where phys contig is not required for scanout), but causes CMA (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just knows that it wants to be able to scanout that particular buffer. > Though It would effectively mean you'd need an "allocation" driver per > SoC, which as you say may not be scalable? Right.. and not actually even possible in the general sense (see SoC + external pcie gfx card) BR, -R > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 14:40 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 14:40 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 6, 2013 at 10:03 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Rob, > >> >> > We may also then have additional constraints when sharing buffers >> >> > between the display HW and video decode or even camera ISP HW. >> >> > Programmatically describing buffer allocation constraints is very >> >> > difficult and I'm not sure you can actually do it - there's some >> >> > pretty complex constraints out there! E.g. I believe there's a >> >> > platform where Y and UV planes of the reference frame need to be >> >> > in separate DRAM banks for real-time 1080p decode, or something >> >> > like that? >> >> >> >> yes, this was discussed. This is different from pitch/format/size >> >> constraints.. it is really just a placement constraint (ie. where >> >> do the physical pages go). IIRC the conclusion was to use a dummy >> >> devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> >> >> > Anyway, I guess my point is that even if we solve how to allocate >> >> > buffers which will be shared between the GPU and display HW such >> >> > that both sets of constraints are satisfied, that may not be the >> >> > end of the story. >> >> > >> >> >> >> that was part of the reason to punt this problem to userspace ;-) >> >> >> >> In practice, the kernel drivers doesn't usually know too much about >> >> the dimensions/format/etc.. that is really userspace level >> >> knowledge. There are a few exceptions when the kernel needs to know >> >> how to setup GTT/etc for tiled buffers, but normally this sort of >> >> information is up at the next level up (userspace, and >> >> drm_framebuffer in case of scanout). Userspace media frameworks >> >> like GStreamer already have a concept of format/caps negotiation. >> >> For non-display<->gpu sharing, I think this is probably where this >> >> sort of constraint negotiation should be handled. >> > >> > I agree that user-space will know which devices will access the >> > buffer and thus can figure out at least a common pixel format. >> > Though I'm not so sure userspace can figure out more low-level >> > details like alignment and placement in physical memory, etc. >> > >> >> well, let's divide things up into two categories: >> >> 1) the arrangement and format of pixels.. ie. what userspace would >> need to know if it mmap's a buffer. This includes pixel format, >> stride, etc. This should be negotiated in userspace, it would be >> crazy to try to do this in the kernel. > > Absolutely. Pixel format has to be negotiated by user-space as in > most cases, user-space can map the buffer and thus will need to > know how to interpret the data. > > > >> 2) the physical placement of the pages. Ie. whether it is contiguous >> or not. Which bank the pages in the buffer are placed in, etc. This >> is not visible to userspace. > > Seems sensible to me. > > >> ... This is the purpose of the attach step, >> so you know all the devices involved in sharing up front before >> allocating the backing pages. (Or in the worst case, if you have a >> "late attacher" you at least know when no device is doing dma access >> to a buffer and can reallocate and move the buffer.) A long time >> back, I had a patch that added a field or two to 'struct >> device_dma_parameters' so that it could be known if a device required >> contiguous buffers.. looks like that never got merged, so I'd need to >> dig that back up and resend it. But the idea was to have the 'struct >> device' encapsulate all the information that would be needed to >> do-the-right-thing when it comes to placement. > > As I understand it, it's up to the exporting device to allocate the > memory backing the dma_buf buffer. I guess the latest possible point > you can allocate the backing pages is when map_dma_buf is first > called? At that point the exporter can iterate over the current set > of attachments, programmatically determine the all the constraints of > all the attached drivers and attempt to allocate the backing pages > in such a way as to satisfy all those constraints? yes, this is the idea.. possibly some room for some helpers to help out with this, but that is all under the hood from userspace perspective > Didn't you say that programmatically describing device placement > constraints was an unbounded problem? I guess we would have to > accept that it's not possible to describe all possible constraints > and instead find a way to describe the common ones? well, the point I'm trying to make, is by dividing your constraints into two groups, one that impacts and is handled by userspace, and one that is in the kernel (ie. where the pages go), you cut down the number of permutations that the kernel has to care about considerably. And kernel already cares about, for example, what range of addresses that a device can dma to/from. I think really the only thing missing is the max # of sglist entries (contiguous or not) > One problem with this is it duplicates a lot of logic in each > driver which can export a dma_buf buffer. Each exporter will need to > do pretty much the same thing: iterate over all the attachments, > determine of all the constraints (assuming that can be done) and > allocate pages such that the lowest-common-denominator is satisfied. > > Perhaps rather than duplicating that logic in every driver, we could > Instead move allocation of the backing pages into dma_buf itself? > I tend to think it is better to add helpers as we see common patterns emerge, which drivers can opt-in to using. I don't think that we should move allocation into dma_buf itself, but it would perhaps be useful to have dma_alloc_*() variants that could allocate for multiple devices. That would help for simple stuff, although I'd suspect eventually a GPU driver will move away from that. (Since you probably want to play tricks w/ pools of pages that are pre-zero'd and in the correct cache state, use spare cycles on the gpu or dma engine to pre-zero uncached pages, and games like that.) > >> > Anyway, assuming user-space can figure out how a buffer should be >> > stored in memory, how does it indicate this to a kernel driver and >> > actually allocate it? Which ioctl on which device does user-space >> > call, with what parameters? Are you suggesting using something like >> > ION which exposes the low-level details of how buffers are laid out >> in >> > physical memory to userspace? If not, what? >> >> no, userspace should not need to know this. And having a central >> driver that knows this for all the other drivers in the system doesn't >> really solve anything and isn't really scalable. At best you might >> want, in some cases, a flag you can pass when allocating. For >> example, some of the drivers have a 'SCANOUT' flag that can be passed >> when allocating a GEM buffer, as a hint to the kernel that 'if this hw >> requires contig memory for scanout, allocate this buffer contig'. But >> really, when it comes to sharing buffers between devices, we want this >> sort of information in dev->dma_params of the importing device(s). > > If you had a single driver which knew the constraints of all devices > on that particular SoC and the interface allowed user-space to specify > which devices a buffer is intended to be used with, I guess it could > pretty trivially allocate pages which satisfy those constraints? It keep in mind, even a number of SoC's come with pcie these days. You already have things like https://developer.nvidia.com/content/kayla-platform You probably want to get out of the SoC mindset, otherwise you are going to make bad assumptions that come back to bite you later on. > wouldn't need a way to programmatically describe the constraints > either: As you say, if userspace sets the "SCANOUT" flag, it would > just "know" that on this SoC, that buffer needs to be physically > contiguous for example. not really.. it just knows it wants to scanout the buffer, and tells this as a hint to the kernel. For example, on omapdrm, the SCANOUT flag does nothing on omap4+ (where phys contig is not required for scanout), but causes CMA (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just knows that it wants to be able to scanout that particular buffer. > Though It would effectively mean you'd need an "allocation" driver per > SoC, which as you say may not be scalable? Right.. and not actually even possible in the general sense (see SoC + external pcie gfx card) BR, -R > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 14:40 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 14:40 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Tue, Aug 6, 2013 at 10:03 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: > Hi Rob, > >> >> > We may also then have additional constraints when sharing buffers >> >> > between the display HW and video decode or even camera ISP HW. >> >> > Programmatically describing buffer allocation constraints is very >> >> > difficult and I'm not sure you can actually do it - there's some >> >> > pretty complex constraints out there! E.g. I believe there's a >> >> > platform where Y and UV planes of the reference frame need to be >> >> > in separate DRAM banks for real-time 1080p decode, or something >> >> > like that? >> >> >> >> yes, this was discussed. This is different from pitch/format/size >> >> constraints.. it is really just a placement constraint (ie. where >> >> do the physical pages go). IIRC the conclusion was to use a dummy >> >> devices with it's own CMA pool for attaching the Y vs UV buffers. >> >> >> >> > Anyway, I guess my point is that even if we solve how to allocate >> >> > buffers which will be shared between the GPU and display HW such >> >> > that both sets of constraints are satisfied, that may not be the >> >> > end of the story. >> >> > >> >> >> >> that was part of the reason to punt this problem to userspace ;-) >> >> >> >> In practice, the kernel drivers doesn't usually know too much about >> >> the dimensions/format/etc.. that is really userspace level >> >> knowledge. There are a few exceptions when the kernel needs to know >> >> how to setup GTT/etc for tiled buffers, but normally this sort of >> >> information is up at the next level up (userspace, and >> >> drm_framebuffer in case of scanout). Userspace media frameworks >> >> like GStreamer already have a concept of format/caps negotiation. >> >> For non-display<->gpu sharing, I think this is probably where this >> >> sort of constraint negotiation should be handled. >> > >> > I agree that user-space will know which devices will access the >> > buffer and thus can figure out at least a common pixel format. >> > Though I'm not so sure userspace can figure out more low-level >> > details like alignment and placement in physical memory, etc. >> > >> >> well, let's divide things up into two categories: >> >> 1) the arrangement and format of pixels.. ie. what userspace would >> need to know if it mmap's a buffer. This includes pixel format, >> stride, etc. This should be negotiated in userspace, it would be >> crazy to try to do this in the kernel. > > Absolutely. Pixel format has to be negotiated by user-space as in > most cases, user-space can map the buffer and thus will need to > know how to interpret the data. > > > >> 2) the physical placement of the pages. Ie. whether it is contiguous >> or not. Which bank the pages in the buffer are placed in, etc. This >> is not visible to userspace. > > Seems sensible to me. > > >> ... This is the purpose of the attach step, >> so you know all the devices involved in sharing up front before >> allocating the backing pages. (Or in the worst case, if you have a >> "late attacher" you at least know when no device is doing dma access >> to a buffer and can reallocate and move the buffer.) A long time >> back, I had a patch that added a field or two to 'struct >> device_dma_parameters' so that it could be known if a device required >> contiguous buffers.. looks like that never got merged, so I'd need to >> dig that back up and resend it. But the idea was to have the 'struct >> device' encapsulate all the information that would be needed to >> do-the-right-thing when it comes to placement. > > As I understand it, it's up to the exporting device to allocate the > memory backing the dma_buf buffer. I guess the latest possible point > you can allocate the backing pages is when map_dma_buf is first > called? At that point the exporter can iterate over the current set > of attachments, programmatically determine the all the constraints of > all the attached drivers and attempt to allocate the backing pages > in such a way as to satisfy all those constraints? yes, this is the idea.. possibly some room for some helpers to help out with this, but that is all under the hood from userspace perspective > Didn't you say that programmatically describing device placement > constraints was an unbounded problem? I guess we would have to > accept that it's not possible to describe all possible constraints > and instead find a way to describe the common ones? well, the point I'm trying to make, is by dividing your constraints into two groups, one that impacts and is handled by userspace, and one that is in the kernel (ie. where the pages go), you cut down the number of permutations that the kernel has to care about considerably. And kernel already cares about, for example, what range of addresses that a device can dma to/from. I think really the only thing missing is the max # of sglist entries (contiguous or not) > One problem with this is it duplicates a lot of logic in each > driver which can export a dma_buf buffer. Each exporter will need to > do pretty much the same thing: iterate over all the attachments, > determine of all the constraints (assuming that can be done) and > allocate pages such that the lowest-common-denominator is satisfied. > > Perhaps rather than duplicating that logic in every driver, we could > Instead move allocation of the backing pages into dma_buf itself? > I tend to think it is better to add helpers as we see common patterns emerge, which drivers can opt-in to using. I don't think that we should move allocation into dma_buf itself, but it would perhaps be useful to have dma_alloc_*() variants that could allocate for multiple devices. That would help for simple stuff, although I'd suspect eventually a GPU driver will move away from that. (Since you probably want to play tricks w/ pools of pages that are pre-zero'd and in the correct cache state, use spare cycles on the gpu or dma engine to pre-zero uncached pages, and games like that.) > >> > Anyway, assuming user-space can figure out how a buffer should be >> > stored in memory, how does it indicate this to a kernel driver and >> > actually allocate it? Which ioctl on which device does user-space >> > call, with what parameters? Are you suggesting using something like >> > ION which exposes the low-level details of how buffers are laid out >> in >> > physical memory to userspace? If not, what? >> >> no, userspace should not need to know this. And having a central >> driver that knows this for all the other drivers in the system doesn't >> really solve anything and isn't really scalable. At best you might >> want, in some cases, a flag you can pass when allocating. For >> example, some of the drivers have a 'SCANOUT' flag that can be passed >> when allocating a GEM buffer, as a hint to the kernel that 'if this hw >> requires contig memory for scanout, allocate this buffer contig'. But >> really, when it comes to sharing buffers between devices, we want this >> sort of information in dev->dma_params of the importing device(s). > > If you had a single driver which knew the constraints of all devices > on that particular SoC and the interface allowed user-space to specify > which devices a buffer is intended to be used with, I guess it could > pretty trivially allocate pages which satisfy those constraints? It keep in mind, even a number of SoC's come with pcie these days. You already have things like https://developer.nvidia.com/content/kayla-platform You probably want to get out of the SoC mindset, otherwise you are going to make bad assumptions that come back to bite you later on. > wouldn't need a way to programmatically describe the constraints > either: As you say, if userspace sets the "SCANOUT" flag, it would > just "know" that on this SoC, that buffer needs to be physically > contiguous for example. not really.. it just knows it wants to scanout the buffer, and tells this as a hint to the kernel. For example, on omapdrm, the SCANOUT flag does nothing on omap4+ (where phys contig is not required for scanout), but causes CMA (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just knows that it wants to be able to scanout that particular buffer. > Though It would effectively mean you'd need an "allocation" driver per > SoC, which as you say may not be scalable? Right.. and not actually even possible in the general sense (see SoC + external pcie gfx card) BR, -R > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 14:40 ` Rob Clark (?) (?) @ 2013-08-06 17:38 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-06 17:38 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media > >> ... This is the purpose of the attach step, > >> so you know all the devices involved in sharing up front before > >> allocating the backing pages. (Or in the worst case, if you have a > >> "late attacher" you at least know when no device is doing dma access > >> to a buffer and can reallocate and move the buffer.) A long time > >> back, I had a patch that added a field or two to 'struct > >> device_dma_parameters' so that it could be known if a device > >> required contiguous buffers.. looks like that never got merged, so > >> I'd need to dig that back up and resend it. But the idea was to > >> have the 'struct device' encapsulate all the information that would > >> be needed to do-the-right-thing when it comes to placement. > > > > As I understand it, it's up to the exporting device to allocate the > > memory backing the dma_buf buffer. I guess the latest possible point > > you can allocate the backing pages is when map_dma_buf is first > > called? At that point the exporter can iterate over the current set > > of attachments, programmatically determine the all the constraints of > > all the attached drivers and attempt to allocate the backing pages > > in such a way as to satisfy all those constraints? > > yes, this is the idea.. possibly some room for some helpers to help > out with this, but that is all under the hood from userspace > perspective > > > Didn't you say that programmatically describing device placement > > constraints was an unbounded problem? I guess we would have to > > accept that it's not possible to describe all possible constraints > > and instead find a way to describe the common ones? > > well, the point I'm trying to make, is by dividing your constraints > into two groups, one that impacts and is handled by userspace, and one > that is in the kernel (ie. where the pages go), you cut down the > number of permutations that the kernel has to care about considerably. > And kernel already cares about, for example, what range of addresses > that a device can dma to/from. I think really the only thing missing > is the max # of sglist entries (contiguous or not) I think it's more than physically contiguous or not. For example, it can be more efficient to use large page sizes on devices with IOMMUs to reduce TLB traffic. I think the size and even the availability of large pages varies between different IOMMUs. There's also the issue of buffer stride alignment. As I say, if the buffer is to be written by a tile-based GPU like Mali, it's more efficient if the buffer's stride is aligned to the max AXI bus burst length. Though I guess a buffer stride only makes sense as a concept when interpreting the data as a linear-layout 2D image, so perhaps belongs in user-space along with format negotiation? > > One problem with this is it duplicates a lot of logic in each > > driver which can export a dma_buf buffer. Each exporter will need to > > do pretty much the same thing: iterate over all the attachments, > > determine of all the constraints (assuming that can be done) and > > allocate pages such that the lowest-common-denominator is satisfied. > > > > Perhaps rather than duplicating that logic in every driver, we could > > Instead move allocation of the backing pages into dma_buf itself? > > > > I tend to think it is better to add helpers as we see common patterns > emerge, which drivers can opt-in to using. I don't think that we > should move allocation into dma_buf itself, but it would perhaps be > useful to have dma_alloc_*() variants that could allocate for multiple > devices. A helper could work I guess, though I quite like the idea of having dma_alloc_*() variants which take a list of devices to allocate memory for. > That would help for simple stuff, although I'd suspect > eventually a GPU driver will move away from that. (Since you probably > want to play tricks w/ pools of pages that are pre-zero'd and in the > correct cache state, use spare cycles on the gpu or dma engine to > pre-zero uncached pages, and games like that.) So presumably you're talking about a GPU driver being the exporter here? If so, how could the GPU driver do these kind of tricks on memory shared with another device? > >> > Anyway, assuming user-space can figure out how a buffer should be > >> > stored in memory, how does it indicate this to a kernel driver and > >> > actually allocate it? Which ioctl on which device does user-space > >> > call, with what parameters? Are you suggesting using something > >> > like ION which exposes the low-level details of how buffers are > > >> laid out in physical memory to userspace? If not, what? > >> > > >> > >> no, userspace should not need to know this. And having a central > >> driver that knows this for all the other drivers in the system > >> doesn't really solve anything and isn't really scalable. At best > >> you might want, in some cases, a flag you can pass when allocating. > >> For example, some of the drivers have a 'SCANOUT' flag that can be > >> passed when allocating a GEM buffer, as a hint to the kernel that > >> 'if this hw requires contig memory for scanout, allocate this > >> buffer contig'. But really, when it comes to sharing buffers > >> between devices, we want this sort of information in dev->dma_params > >> of the importing device(s). > > > > If you had a single driver which knew the constraints of all devices > > on that particular SoC and the interface allowed user-space to > > specify which devices a buffer is intended to be used with, I guess > > it could pretty trivially allocate pages which satisfy those constraints? > > keep in mind, even a number of SoC's come with pcie these days. You > already have things like > > https://developer.nvidia.com/content/kayla-platform > > You probably want to get out of the SoC mindset, otherwise you are > going to make bad assumptions that come back to bite you later on. Sure - there are always going to be PC-like devices where the hardware configuration isn't fixed like it is on a traditional SoC. But I'd rather have a simple solution which works on traditional SoCs than no solution at all. Today our solution is to over-load the dumb buffer alloc functions of the display's DRM driver - For now I'm just looking for the next step up from that! ;-) > > wouldn't need a way to programmatically describe the constraints > > either: As you say, if userspace sets the "SCANOUT" flag, it would > > just "know" that on this SoC, that buffer needs to be physically > > contiguous for example. > > not really.. it just knows it wants to scanout the buffer, and tells > this as a hint to the kernel. > > For example, on omapdrm, the SCANOUT flag does nothing on omap4+ > (where phys contig is not required for scanout), but causes CMA > (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just > knows that it wants to be able to scanout that particular buffer. I think that's the idea? The omap3's allocator driver would use contiguous memory when it detects the SCANOUT flag whereas the omap4 allocator driver wouldn't have to. No complex negotiation of constraints - it just "knows". Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 14:40 ` Rob Clark ` (2 preceding siblings ...) (?) @ 2013-08-06 17:38 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-06 17:38 UTC (permalink / raw) To: linux-arm-kernel > >> ... This is the purpose of the attach step, > >> so you know all the devices involved in sharing up front before > >> allocating the backing pages. (Or in the worst case, if you have a > >> "late attacher" you at least know when no device is doing dma access > >> to a buffer and can reallocate and move the buffer.) A long time > >> back, I had a patch that added a field or two to 'struct > >> device_dma_parameters' so that it could be known if a device > >> required contiguous buffers.. looks like that never got merged, so > >> I'd need to dig that back up and resend it. But the idea was to > >> have the 'struct device' encapsulate all the information that would > >> be needed to do-the-right-thing when it comes to placement. > > > > As I understand it, it's up to the exporting device to allocate the > > memory backing the dma_buf buffer. I guess the latest possible point > > you can allocate the backing pages is when map_dma_buf is first > > called? At that point the exporter can iterate over the current set > > of attachments, programmatically determine the all the constraints of > > all the attached drivers and attempt to allocate the backing pages > > in such a way as to satisfy all those constraints? > > yes, this is the idea.. possibly some room for some helpers to help > out with this, but that is all under the hood from userspace > perspective > > > Didn't you say that programmatically describing device placement > > constraints was an unbounded problem? I guess we would have to > > accept that it's not possible to describe all possible constraints > > and instead find a way to describe the common ones? > > well, the point I'm trying to make, is by dividing your constraints > into two groups, one that impacts and is handled by userspace, and one > that is in the kernel (ie. where the pages go), you cut down the > number of permutations that the kernel has to care about considerably. > And kernel already cares about, for example, what range of addresses > that a device can dma to/from. I think really the only thing missing > is the max # of sglist entries (contiguous or not) I think it's more than physically contiguous or not. For example, it can be more efficient to use large page sizes on devices with IOMMUs to reduce TLB traffic. I think the size and even the availability of large pages varies between different IOMMUs. There's also the issue of buffer stride alignment. As I say, if the buffer is to be written by a tile-based GPU like Mali, it's more efficient if the buffer's stride is aligned to the max AXI bus burst length. Though I guess a buffer stride only makes sense as a concept when interpreting the data as a linear-layout 2D image, so perhaps belongs in user-space along with format negotiation? > > One problem with this is it duplicates a lot of logic in each > > driver which can export a dma_buf buffer. Each exporter will need to > > do pretty much the same thing: iterate over all the attachments, > > determine of all the constraints (assuming that can be done) and > > allocate pages such that the lowest-common-denominator is satisfied. > > > > Perhaps rather than duplicating that logic in every driver, we could > > Instead move allocation of the backing pages into dma_buf itself? > > > > I tend to think it is better to add helpers as we see common patterns > emerge, which drivers can opt-in to using. I don't think that we > should move allocation into dma_buf itself, but it would perhaps be > useful to have dma_alloc_*() variants that could allocate for multiple > devices. A helper could work I guess, though I quite like the idea of having dma_alloc_*() variants which take a list of devices to allocate memory for. > That would help for simple stuff, although I'd suspect > eventually a GPU driver will move away from that. (Since you probably > want to play tricks w/ pools of pages that are pre-zero'd and in the > correct cache state, use spare cycles on the gpu or dma engine to > pre-zero uncached pages, and games like that.) So presumably you're talking about a GPU driver being the exporter here? If so, how could the GPU driver do these kind of tricks on memory shared with another device? > >> > Anyway, assuming user-space can figure out how a buffer should be > >> > stored in memory, how does it indicate this to a kernel driver and > >> > actually allocate it? Which ioctl on which device does user-space > >> > call, with what parameters? Are you suggesting using something > >> > like ION which exposes the low-level details of how buffers are > > >> laid out in physical memory to userspace? If not, what? > >> > > >> > >> no, userspace should not need to know this. And having a central > >> driver that knows this for all the other drivers in the system > >> doesn't really solve anything and isn't really scalable. At best > >> you might want, in some cases, a flag you can pass when allocating. > >> For example, some of the drivers have a 'SCANOUT' flag that can be > >> passed when allocating a GEM buffer, as a hint to the kernel that > >> 'if this hw requires contig memory for scanout, allocate this > >> buffer contig'. But really, when it comes to sharing buffers > >> between devices, we want this sort of information in dev->dma_params > >> of the importing device(s). > > > > If you had a single driver which knew the constraints of all devices > > on that particular SoC and the interface allowed user-space to > > specify which devices a buffer is intended to be used with, I guess > > it could pretty trivially allocate pages which satisfy those constraints? > > keep in mind, even a number of SoC's come with pcie these days. You > already have things like > > https://developer.nvidia.com/content/kayla-platform > > You probably want to get out of the SoC mindset, otherwise you are > going to make bad assumptions that come back to bite you later on. Sure - there are always going to be PC-like devices where the hardware configuration isn't fixed like it is on a traditional SoC. But I'd rather have a simple solution which works on traditional SoCs than no solution at all. Today our solution is to over-load the dumb buffer alloc functions of the display's DRM driver - For now I'm just looking for the next step up from that! ;-) > > wouldn't need a way to programmatically describe the constraints > > either: As you say, if userspace sets the "SCANOUT" flag, it would > > just "know" that on this SoC, that buffer needs to be physically > > contiguous for example. > > not really.. it just knows it wants to scanout the buffer, and tells > this as a hint to the kernel. > > For example, on omapdrm, the SCANOUT flag does nothing on omap4+ > (where phys contig is not required for scanout), but causes CMA > (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just > knows that it wants to be able to scanout that particular buffer. I think that's the idea? The omap3's allocator driver would use contiguous memory when it detects the SCANOUT flag whereas the omap4 allocator driver wouldn't have to. No complex negotiation of constraints - it just "knows". Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 14:40 ` Rob Clark ` (3 preceding siblings ...) (?) @ 2013-08-06 17:38 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-06 17:38 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media > >> ... This is the purpose of the attach step, > >> so you know all the devices involved in sharing up front before > >> allocating the backing pages. (Or in the worst case, if you have a > >> "late attacher" you at least know when no device is doing dma access > >> to a buffer and can reallocate and move the buffer.) A long time > >> back, I had a patch that added a field or two to 'struct > >> device_dma_parameters' so that it could be known if a device > >> required contiguous buffers.. looks like that never got merged, so > >> I'd need to dig that back up and resend it. But the idea was to > >> have the 'struct device' encapsulate all the information that would > >> be needed to do-the-right-thing when it comes to placement. > > > > As I understand it, it's up to the exporting device to allocate the > > memory backing the dma_buf buffer. I guess the latest possible point > > you can allocate the backing pages is when map_dma_buf is first > > called? At that point the exporter can iterate over the current set > > of attachments, programmatically determine the all the constraints of > > all the attached drivers and attempt to allocate the backing pages > > in such a way as to satisfy all those constraints? > > yes, this is the idea.. possibly some room for some helpers to help > out with this, but that is all under the hood from userspace > perspective > > > Didn't you say that programmatically describing device placement > > constraints was an unbounded problem? I guess we would have to > > accept that it's not possible to describe all possible constraints > > and instead find a way to describe the common ones? > > well, the point I'm trying to make, is by dividing your constraints > into two groups, one that impacts and is handled by userspace, and one > that is in the kernel (ie. where the pages go), you cut down the > number of permutations that the kernel has to care about considerably. > And kernel already cares about, for example, what range of addresses > that a device can dma to/from. I think really the only thing missing > is the max # of sglist entries (contiguous or not) I think it's more than physically contiguous or not. For example, it can be more efficient to use large page sizes on devices with IOMMUs to reduce TLB traffic. I think the size and even the availability of large pages varies between different IOMMUs. There's also the issue of buffer stride alignment. As I say, if the buffer is to be written by a tile-based GPU like Mali, it's more efficient if the buffer's stride is aligned to the max AXI bus burst length. Though I guess a buffer stride only makes sense as a concept when interpreting the data as a linear-layout 2D image, so perhaps belongs in user-space along with format negotiation? > > One problem with this is it duplicates a lot of logic in each > > driver which can export a dma_buf buffer. Each exporter will need to > > do pretty much the same thing: iterate over all the attachments, > > determine of all the constraints (assuming that can be done) and > > allocate pages such that the lowest-common-denominator is satisfied. > > > > Perhaps rather than duplicating that logic in every driver, we could > > Instead move allocation of the backing pages into dma_buf itself? > > > > I tend to think it is better to add helpers as we see common patterns > emerge, which drivers can opt-in to using. I don't think that we > should move allocation into dma_buf itself, but it would perhaps be > useful to have dma_alloc_*() variants that could allocate for multiple > devices. A helper could work I guess, though I quite like the idea of having dma_alloc_*() variants which take a list of devices to allocate memory for. > That would help for simple stuff, although I'd suspect > eventually a GPU driver will move away from that. (Since you probably > want to play tricks w/ pools of pages that are pre-zero'd and in the > correct cache state, use spare cycles on the gpu or dma engine to > pre-zero uncached pages, and games like that.) So presumably you're talking about a GPU driver being the exporter here? If so, how could the GPU driver do these kind of tricks on memory shared with another device? > >> > Anyway, assuming user-space can figure out how a buffer should be > >> > stored in memory, how does it indicate this to a kernel driver and > >> > actually allocate it? Which ioctl on which device does user-space > >> > call, with what parameters? Are you suggesting using something > >> > like ION which exposes the low-level details of how buffers are > > >> laid out in physical memory to userspace? If not, what? > >> > > >> > >> no, userspace should not need to know this. And having a central > >> driver that knows this for all the other drivers in the system > >> doesn't really solve anything and isn't really scalable. At best > >> you might want, in some cases, a flag you can pass when allocating. > >> For example, some of the drivers have a 'SCANOUT' flag that can be > >> passed when allocating a GEM buffer, as a hint to the kernel that > >> 'if this hw requires contig memory for scanout, allocate this > >> buffer contig'. But really, when it comes to sharing buffers > >> between devices, we want this sort of information in dev->dma_params > >> of the importing device(s). > > > > If you had a single driver which knew the constraints of all devices > > on that particular SoC and the interface allowed user-space to > > specify which devices a buffer is intended to be used with, I guess > > it could pretty trivially allocate pages which satisfy those constraints? > > keep in mind, even a number of SoC's come with pcie these days. You > already have things like > > https://developer.nvidia.com/content/kayla-platform > > You probably want to get out of the SoC mindset, otherwise you are > going to make bad assumptions that come back to bite you later on. Sure - there are always going to be PC-like devices where the hardware configuration isn't fixed like it is on a traditional SoC. But I'd rather have a simple solution which works on traditional SoCs than no solution at all. Today our solution is to over-load the dumb buffer alloc functions of the display's DRM driver - For now I'm just looking for the next step up from that! ;-) > > wouldn't need a way to programmatically describe the constraints > > either: As you say, if userspace sets the "SCANOUT" flag, it would > > just "know" that on this SoC, that buffer needs to be physically > > contiguous for example. > > not really.. it just knows it wants to scanout the buffer, and tells > this as a hint to the kernel. > > For example, on omapdrm, the SCANOUT flag does nothing on omap4+ > (where phys contig is not required for scanout), but causes CMA > (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just > knows that it wants to be able to scanout that particular buffer. I think that's the idea? The omap3's allocator driver would use contiguous memory when it detects the SCANOUT flag whereas the omap4 allocator driver wouldn't have to. No complex negotiation of constraints - it just "knows". Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <52013482.e107c20a.27f9.ffffa718SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <52013482.e107c20a.27f9.ffffa718SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-06 19:06 ` Rob Clark @ 2013-08-06 19:06 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 19:06 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Tue, Aug 6, 2013 at 1:38 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> >> ... This is the purpose of the attach step, >> >> so you know all the devices involved in sharing up front before >> >> allocating the backing pages. (Or in the worst case, if you have a >> >> "late attacher" you at least know when no device is doing dma access >> >> to a buffer and can reallocate and move the buffer.) A long time >> >> back, I had a patch that added a field or two to 'struct >> >> device_dma_parameters' so that it could be known if a device >> >> required contiguous buffers.. looks like that never got merged, so >> >> I'd need to dig that back up and resend it. But the idea was to >> >> have the 'struct device' encapsulate all the information that would >> >> be needed to do-the-right-thing when it comes to placement. >> > >> > As I understand it, it's up to the exporting device to allocate the >> > memory backing the dma_buf buffer. I guess the latest possible point >> > you can allocate the backing pages is when map_dma_buf is first >> > called? At that point the exporter can iterate over the current set >> > of attachments, programmatically determine the all the constraints of >> > all the attached drivers and attempt to allocate the backing pages >> > in such a way as to satisfy all those constraints? >> >> yes, this is the idea.. possibly some room for some helpers to help >> out with this, but that is all under the hood from userspace >> perspective >> >> > Didn't you say that programmatically describing device placement >> > constraints was an unbounded problem? I guess we would have to >> > accept that it's not possible to describe all possible constraints >> > and instead find a way to describe the common ones? >> >> well, the point I'm trying to make, is by dividing your constraints >> into two groups, one that impacts and is handled by userspace, and one >> that is in the kernel (ie. where the pages go), you cut down the >> number of permutations that the kernel has to care about considerably. >> And kernel already cares about, for example, what range of addresses >> that a device can dma to/from. I think really the only thing missing >> is the max # of sglist entries (contiguous or not) > > I think it's more than physically contiguous or not. > > For example, it can be more efficient to use large page sizes on > devices with IOMMUs to reduce TLB traffic. I think the size and even > the availability of large pages varies between different IOMMUs. sure.. but I suppose if we can spiff out dma_params to express "I need contiguous", perhaps we can add some way to express "I prefer as-contiguous-as-possible".. either way, this is about where the pages are placed, and not about the layout of pixels within the page, so should be in kernel. It's something that is missing, but I believe that it belongs in dma_params and hidden behind dma_alloc_*() for simple drivers. > There's also the issue of buffer stride alignment. As I say, if the > buffer is to be written by a tile-based GPU like Mali, it's more > efficient if the buffer's stride is aligned to the max AXI bus burst > length. Though I guess a buffer stride only makes sense as a concept > when interpreting the data as a linear-layout 2D image, so perhaps > belongs in user-space along with format negotiation? > Yeah.. this isn't about where the pages go, but about the arrangement within a page. And, well, except for hw that supports the same tiling (or compressed-fb) in display+gpu, you probably aren't sharing tiled buffers. > >> > One problem with this is it duplicates a lot of logic in each >> > driver which can export a dma_buf buffer. Each exporter will need to >> > do pretty much the same thing: iterate over all the attachments, >> > determine of all the constraints (assuming that can be done) and >> > allocate pages such that the lowest-common-denominator is satisfied. >> > >> > Perhaps rather than duplicating that logic in every driver, we could >> > Instead move allocation of the backing pages into dma_buf itself? >> > >> >> I tend to think it is better to add helpers as we see common patterns >> emerge, which drivers can opt-in to using. I don't think that we >> should move allocation into dma_buf itself, but it would perhaps be >> useful to have dma_alloc_*() variants that could allocate for multiple >> devices. > > A helper could work I guess, though I quite like the idea of having > dma_alloc_*() variants which take a list of devices to allocate memory > for. > > >> That would help for simple stuff, although I'd suspect >> eventually a GPU driver will move away from that. (Since you probably >> want to play tricks w/ pools of pages that are pre-zero'd and in the >> correct cache state, use spare cycles on the gpu or dma engine to >> pre-zero uncached pages, and games like that.) > > So presumably you're talking about a GPU driver being the exporter > here? If so, how could the GPU driver do these kind of tricks on > memory shared with another device? > Yes, that is gpu-as-exporter. If someone else is allocating buffers, it is up to them to do these tricks or not. Probably there is a pretty good chance that if you aren't a GPU you don't need those sort of tricks for fast allocation of transient upload buffers, staging textures, temporary pixmaps, etc. Ie. I don't really think a v4l camera or video decoder would benefit from that sort of optimization. > >> >> > Anyway, assuming user-space can figure out how a buffer should be >> >> > stored in memory, how does it indicate this to a kernel driver and >> >> > actually allocate it? Which ioctl on which device does user-space >> >> > call, with what parameters? Are you suggesting using something >> >> > like ION which exposes the low-level details of how buffers are >> > >> laid out in physical memory to userspace? If not, what? >> >> > >> >> >> >> no, userspace should not need to know this. And having a central >> >> driver that knows this for all the other drivers in the system >> >> doesn't really solve anything and isn't really scalable. At best >> >> you might want, in some cases, a flag you can pass when allocating. >> >> For example, some of the drivers have a 'SCANOUT' flag that can be >> >> passed when allocating a GEM buffer, as a hint to the kernel that >> >> 'if this hw requires contig memory for scanout, allocate this >> >> buffer contig'. But really, when it comes to sharing buffers >> >> between devices, we want this sort of information in dev->dma_params >> >> of the importing device(s). >> > >> > If you had a single driver which knew the constraints of all devices >> > on that particular SoC and the interface allowed user-space to >> > specify which devices a buffer is intended to be used with, I guess >> > it could pretty trivially allocate pages which satisfy those > constraints? >> >> keep in mind, even a number of SoC's come with pcie these days. You >> already have things like >> >> https://developer.nvidia.com/content/kayla-platform >> >> You probably want to get out of the SoC mindset, otherwise you are >> going to make bad assumptions that come back to bite you later on. > > Sure - there are always going to be PC-like devices where the > hardware configuration isn't fixed like it is on a traditional SoC. > But I'd rather have a simple solution which works on traditional SoCs > than no solution at all. Today our solution is to over-load the dumb > buffer alloc functions of the display's DRM driver - For now I'm just > looking for the next step up from that! ;-) > True.. the original intention, which is perhaps a bit desktop-centric, really was for there to be a userspace component talking to the drm driver for allocation, ie. xf86-video-foo and/or src/gallium/drivers/foo (for example) ;-) Which means for x11 having a SoC vendor specific xf86-video-foo for x11.. or vendor specific gbm implementation for wayland. (Although at least in the latter case it is a pretty small piece of code.) But that is probably what you are trying to avoid. At any rate, for both xorg and wayland/gbm, you know when a buffer is going to be a scanout buffer. What I'd recommend is define a small userspace API that your customers (the SoC vendors) implement to allocate a scanout buffer and hand you back a dmabuf fd. That could be used both for x11 and for gbm. Inputs should be requested width/height and format. And outputs pitch plus dmabuf fd. (Actually you might even just want to use gbm as your starting point. You could probably just use gbm from xf86-video-armsoc for allocation, to have one thing that works for both wayland and x11. Scanout and cursor buffers should go to vendor/SoC specific fxn, rest can be allocated from mali kernel driver.) > >> > wouldn't need a way to programmatically describe the constraints >> > either: As you say, if userspace sets the "SCANOUT" flag, it would >> > just "know" that on this SoC, that buffer needs to be physically >> > contiguous for example. >> >> not really.. it just knows it wants to scanout the buffer, and tells >> this as a hint to the kernel. >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> (where phys contig is not required for scanout), but causes CMA >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just >> knows that it wants to be able to scanout that particular buffer. > > I think that's the idea? The omap3's allocator driver would use > contiguous memory when it detects the SCANOUT flag whereas the omap4 > allocator driver wouldn't have to. No complex negotiation of > constraints - it just "knows". > well, it is same allocating driver in both cases (although maybe that is unimportant). The "it" that just knows it wants to scanout is userspace. The "it" that just knows that scanout translates to contiguous (or not) is the kernel. Perhaps we are saying the same thing ;-) BR, -R > > Cheers, > > Tom > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 19:06 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 19:06 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 6, 2013 at 1:38 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> >> ... This is the purpose of the attach step, >> >> so you know all the devices involved in sharing up front before >> >> allocating the backing pages. (Or in the worst case, if you have a >> >> "late attacher" you at least know when no device is doing dma access >> >> to a buffer and can reallocate and move the buffer.) A long time >> >> back, I had a patch that added a field or two to 'struct >> >> device_dma_parameters' so that it could be known if a device >> >> required contiguous buffers.. looks like that never got merged, so >> >> I'd need to dig that back up and resend it. But the idea was to >> >> have the 'struct device' encapsulate all the information that would >> >> be needed to do-the-right-thing when it comes to placement. >> > >> > As I understand it, it's up to the exporting device to allocate the >> > memory backing the dma_buf buffer. I guess the latest possible point >> > you can allocate the backing pages is when map_dma_buf is first >> > called? At that point the exporter can iterate over the current set >> > of attachments, programmatically determine the all the constraints of >> > all the attached drivers and attempt to allocate the backing pages >> > in such a way as to satisfy all those constraints? >> >> yes, this is the idea.. possibly some room for some helpers to help >> out with this, but that is all under the hood from userspace >> perspective >> >> > Didn't you say that programmatically describing device placement >> > constraints was an unbounded problem? I guess we would have to >> > accept that it's not possible to describe all possible constraints >> > and instead find a way to describe the common ones? >> >> well, the point I'm trying to make, is by dividing your constraints >> into two groups, one that impacts and is handled by userspace, and one >> that is in the kernel (ie. where the pages go), you cut down the >> number of permutations that the kernel has to care about considerably. >> And kernel already cares about, for example, what range of addresses >> that a device can dma to/from. I think really the only thing missing >> is the max # of sglist entries (contiguous or not) > > I think it's more than physically contiguous or not. > > For example, it can be more efficient to use large page sizes on > devices with IOMMUs to reduce TLB traffic. I think the size and even > the availability of large pages varies between different IOMMUs. sure.. but I suppose if we can spiff out dma_params to express "I need contiguous", perhaps we can add some way to express "I prefer as-contiguous-as-possible".. either way, this is about where the pages are placed, and not about the layout of pixels within the page, so should be in kernel. It's something that is missing, but I believe that it belongs in dma_params and hidden behind dma_alloc_*() for simple drivers. > There's also the issue of buffer stride alignment. As I say, if the > buffer is to be written by a tile-based GPU like Mali, it's more > efficient if the buffer's stride is aligned to the max AXI bus burst > length. Though I guess a buffer stride only makes sense as a concept > when interpreting the data as a linear-layout 2D image, so perhaps > belongs in user-space along with format negotiation? > Yeah.. this isn't about where the pages go, but about the arrangement within a page. And, well, except for hw that supports the same tiling (or compressed-fb) in display+gpu, you probably aren't sharing tiled buffers. > >> > One problem with this is it duplicates a lot of logic in each >> > driver which can export a dma_buf buffer. Each exporter will need to >> > do pretty much the same thing: iterate over all the attachments, >> > determine of all the constraints (assuming that can be done) and >> > allocate pages such that the lowest-common-denominator is satisfied. >> > >> > Perhaps rather than duplicating that logic in every driver, we could >> > Instead move allocation of the backing pages into dma_buf itself? >> > >> >> I tend to think it is better to add helpers as we see common patterns >> emerge, which drivers can opt-in to using. I don't think that we >> should move allocation into dma_buf itself, but it would perhaps be >> useful to have dma_alloc_*() variants that could allocate for multiple >> devices. > > A helper could work I guess, though I quite like the idea of having > dma_alloc_*() variants which take a list of devices to allocate memory > for. > > >> That would help for simple stuff, although I'd suspect >> eventually a GPU driver will move away from that. (Since you probably >> want to play tricks w/ pools of pages that are pre-zero'd and in the >> correct cache state, use spare cycles on the gpu or dma engine to >> pre-zero uncached pages, and games like that.) > > So presumably you're talking about a GPU driver being the exporter > here? If so, how could the GPU driver do these kind of tricks on > memory shared with another device? > Yes, that is gpu-as-exporter. If someone else is allocating buffers, it is up to them to do these tricks or not. Probably there is a pretty good chance that if you aren't a GPU you don't need those sort of tricks for fast allocation of transient upload buffers, staging textures, temporary pixmaps, etc. Ie. I don't really think a v4l camera or video decoder would benefit from that sort of optimization. > >> >> > Anyway, assuming user-space can figure out how a buffer should be >> >> > stored in memory, how does it indicate this to a kernel driver and >> >> > actually allocate it? Which ioctl on which device does user-space >> >> > call, with what parameters? Are you suggesting using something >> >> > like ION which exposes the low-level details of how buffers are >> > >> laid out in physical memory to userspace? If not, what? >> >> > >> >> >> >> no, userspace should not need to know this. And having a central >> >> driver that knows this for all the other drivers in the system >> >> doesn't really solve anything and isn't really scalable. At best >> >> you might want, in some cases, a flag you can pass when allocating. >> >> For example, some of the drivers have a 'SCANOUT' flag that can be >> >> passed when allocating a GEM buffer, as a hint to the kernel that >> >> 'if this hw requires contig memory for scanout, allocate this >> >> buffer contig'. But really, when it comes to sharing buffers >> >> between devices, we want this sort of information in dev->dma_params >> >> of the importing device(s). >> > >> > If you had a single driver which knew the constraints of all devices >> > on that particular SoC and the interface allowed user-space to >> > specify which devices a buffer is intended to be used with, I guess >> > it could pretty trivially allocate pages which satisfy those > constraints? >> >> keep in mind, even a number of SoC's come with pcie these days. You >> already have things like >> >> https://developer.nvidia.com/content/kayla-platform >> >> You probably want to get out of the SoC mindset, otherwise you are >> going to make bad assumptions that come back to bite you later on. > > Sure - there are always going to be PC-like devices where the > hardware configuration isn't fixed like it is on a traditional SoC. > But I'd rather have a simple solution which works on traditional SoCs > than no solution at all. Today our solution is to over-load the dumb > buffer alloc functions of the display's DRM driver - For now I'm just > looking for the next step up from that! ;-) > True.. the original intention, which is perhaps a bit desktop-centric, really was for there to be a userspace component talking to the drm driver for allocation, ie. xf86-video-foo and/or src/gallium/drivers/foo (for example) ;-) Which means for x11 having a SoC vendor specific xf86-video-foo for x11.. or vendor specific gbm implementation for wayland. (Although at least in the latter case it is a pretty small piece of code.) But that is probably what you are trying to avoid. At any rate, for both xorg and wayland/gbm, you know when a buffer is going to be a scanout buffer. What I'd recommend is define a small userspace API that your customers (the SoC vendors) implement to allocate a scanout buffer and hand you back a dmabuf fd. That could be used both for x11 and for gbm. Inputs should be requested width/height and format. And outputs pitch plus dmabuf fd. (Actually you might even just want to use gbm as your starting point. You could probably just use gbm from xf86-video-armsoc for allocation, to have one thing that works for both wayland and x11. Scanout and cursor buffers should go to vendor/SoC specific fxn, rest can be allocated from mali kernel driver.) > >> > wouldn't need a way to programmatically describe the constraints >> > either: As you say, if userspace sets the "SCANOUT" flag, it would >> > just "know" that on this SoC, that buffer needs to be physically >> > contiguous for example. >> >> not really.. it just knows it wants to scanout the buffer, and tells >> this as a hint to the kernel. >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> (where phys contig is not required for scanout), but causes CMA >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just >> knows that it wants to be able to scanout that particular buffer. > > I think that's the idea? The omap3's allocator driver would use > contiguous memory when it detects the SCANOUT flag whereas the omap4 > allocator driver wouldn't have to. No complex negotiation of > constraints - it just "knows". > well, it is same allocating driver in both cases (although maybe that is unimportant). The "it" that just knows it wants to scanout is userspace. The "it" that just knows that scanout translates to contiguous (or not) is the kernel. Perhaps we are saying the same thing ;-) BR, -R > > Cheers, > > Tom > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-06 19:06 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-06 19:06 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Tue, Aug 6, 2013 at 1:38 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> >> ... This is the purpose of the attach step, >> >> so you know all the devices involved in sharing up front before >> >> allocating the backing pages. (Or in the worst case, if you have a >> >> "late attacher" you at least know when no device is doing dma access >> >> to a buffer and can reallocate and move the buffer.) A long time >> >> back, I had a patch that added a field or two to 'struct >> >> device_dma_parameters' so that it could be known if a device >> >> required contiguous buffers.. looks like that never got merged, so >> >> I'd need to dig that back up and resend it. But the idea was to >> >> have the 'struct device' encapsulate all the information that would >> >> be needed to do-the-right-thing when it comes to placement. >> > >> > As I understand it, it's up to the exporting device to allocate the >> > memory backing the dma_buf buffer. I guess the latest possible point >> > you can allocate the backing pages is when map_dma_buf is first >> > called? At that point the exporter can iterate over the current set >> > of attachments, programmatically determine the all the constraints of >> > all the attached drivers and attempt to allocate the backing pages >> > in such a way as to satisfy all those constraints? >> >> yes, this is the idea.. possibly some room for some helpers to help >> out with this, but that is all under the hood from userspace >> perspective >> >> > Didn't you say that programmatically describing device placement >> > constraints was an unbounded problem? I guess we would have to >> > accept that it's not possible to describe all possible constraints >> > and instead find a way to describe the common ones? >> >> well, the point I'm trying to make, is by dividing your constraints >> into two groups, one that impacts and is handled by userspace, and one >> that is in the kernel (ie. where the pages go), you cut down the >> number of permutations that the kernel has to care about considerably. >> And kernel already cares about, for example, what range of addresses >> that a device can dma to/from. I think really the only thing missing >> is the max # of sglist entries (contiguous or not) > > I think it's more than physically contiguous or not. > > For example, it can be more efficient to use large page sizes on > devices with IOMMUs to reduce TLB traffic. I think the size and even > the availability of large pages varies between different IOMMUs. sure.. but I suppose if we can spiff out dma_params to express "I need contiguous", perhaps we can add some way to express "I prefer as-contiguous-as-possible".. either way, this is about where the pages are placed, and not about the layout of pixels within the page, so should be in kernel. It's something that is missing, but I believe that it belongs in dma_params and hidden behind dma_alloc_*() for simple drivers. > There's also the issue of buffer stride alignment. As I say, if the > buffer is to be written by a tile-based GPU like Mali, it's more > efficient if the buffer's stride is aligned to the max AXI bus burst > length. Though I guess a buffer stride only makes sense as a concept > when interpreting the data as a linear-layout 2D image, so perhaps > belongs in user-space along with format negotiation? > Yeah.. this isn't about where the pages go, but about the arrangement within a page. And, well, except for hw that supports the same tiling (or compressed-fb) in display+gpu, you probably aren't sharing tiled buffers. > >> > One problem with this is it duplicates a lot of logic in each >> > driver which can export a dma_buf buffer. Each exporter will need to >> > do pretty much the same thing: iterate over all the attachments, >> > determine of all the constraints (assuming that can be done) and >> > allocate pages such that the lowest-common-denominator is satisfied. >> > >> > Perhaps rather than duplicating that logic in every driver, we could >> > Instead move allocation of the backing pages into dma_buf itself? >> > >> >> I tend to think it is better to add helpers as we see common patterns >> emerge, which drivers can opt-in to using. I don't think that we >> should move allocation into dma_buf itself, but it would perhaps be >> useful to have dma_alloc_*() variants that could allocate for multiple >> devices. > > A helper could work I guess, though I quite like the idea of having > dma_alloc_*() variants which take a list of devices to allocate memory > for. > > >> That would help for simple stuff, although I'd suspect >> eventually a GPU driver will move away from that. (Since you probably >> want to play tricks w/ pools of pages that are pre-zero'd and in the >> correct cache state, use spare cycles on the gpu or dma engine to >> pre-zero uncached pages, and games like that.) > > So presumably you're talking about a GPU driver being the exporter > here? If so, how could the GPU driver do these kind of tricks on > memory shared with another device? > Yes, that is gpu-as-exporter. If someone else is allocating buffers, it is up to them to do these tricks or not. Probably there is a pretty good chance that if you aren't a GPU you don't need those sort of tricks for fast allocation of transient upload buffers, staging textures, temporary pixmaps, etc. Ie. I don't really think a v4l camera or video decoder would benefit from that sort of optimization. > >> >> > Anyway, assuming user-space can figure out how a buffer should be >> >> > stored in memory, how does it indicate this to a kernel driver and >> >> > actually allocate it? Which ioctl on which device does user-space >> >> > call, with what parameters? Are you suggesting using something >> >> > like ION which exposes the low-level details of how buffers are >> > >> laid out in physical memory to userspace? If not, what? >> >> > >> >> >> >> no, userspace should not need to know this. And having a central >> >> driver that knows this for all the other drivers in the system >> >> doesn't really solve anything and isn't really scalable. At best >> >> you might want, in some cases, a flag you can pass when allocating. >> >> For example, some of the drivers have a 'SCANOUT' flag that can be >> >> passed when allocating a GEM buffer, as a hint to the kernel that >> >> 'if this hw requires contig memory for scanout, allocate this >> >> buffer contig'. But really, when it comes to sharing buffers >> >> between devices, we want this sort of information in dev->dma_params >> >> of the importing device(s). >> > >> > If you had a single driver which knew the constraints of all devices >> > on that particular SoC and the interface allowed user-space to >> > specify which devices a buffer is intended to be used with, I guess >> > it could pretty trivially allocate pages which satisfy those > constraints? >> >> keep in mind, even a number of SoC's come with pcie these days. You >> already have things like >> >> https://developer.nvidia.com/content/kayla-platform >> >> You probably want to get out of the SoC mindset, otherwise you are >> going to make bad assumptions that come back to bite you later on. > > Sure - there are always going to be PC-like devices where the > hardware configuration isn't fixed like it is on a traditional SoC. > But I'd rather have a simple solution which works on traditional SoCs > than no solution at all. Today our solution is to over-load the dumb > buffer alloc functions of the display's DRM driver - For now I'm just > looking for the next step up from that! ;-) > True.. the original intention, which is perhaps a bit desktop-centric, really was for there to be a userspace component talking to the drm driver for allocation, ie. xf86-video-foo and/or src/gallium/drivers/foo (for example) ;-) Which means for x11 having a SoC vendor specific xf86-video-foo for x11.. or vendor specific gbm implementation for wayland. (Although at least in the latter case it is a pretty small piece of code.) But that is probably what you are trying to avoid. At any rate, for both xorg and wayland/gbm, you know when a buffer is going to be a scanout buffer. What I'd recommend is define a small userspace API that your customers (the SoC vendors) implement to allocate a scanout buffer and hand you back a dmabuf fd. That could be used both for x11 and for gbm. Inputs should be requested width/height and format. And outputs pitch plus dmabuf fd. (Actually you might even just want to use gbm as your starting point. You could probably just use gbm from xf86-video-armsoc for allocation, to have one thing that works for both wayland and x11. Scanout and cursor buffers should go to vendor/SoC specific fxn, rest can be allocated from mali kernel driver.) > >> > wouldn't need a way to programmatically describe the constraints >> > either: As you say, if userspace sets the "SCANOUT" flag, it would >> > just "know" that on this SoC, that buffer needs to be physically >> > contiguous for example. >> >> not really.. it just knows it wants to scanout the buffer, and tells >> this as a hint to the kernel. >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> (where phys contig is not required for scanout), but causes CMA >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. It just >> knows that it wants to be able to scanout that particular buffer. > > I think that's the idea? The omap3's allocator driver would use > contiguous memory when it detects the SCANOUT flag whereas the omap4 > allocator driver wouldn't have to. No complex negotiation of > constraints - it just "knows". > well, it is same allocating driver in both cases (although maybe that is unimportant). The "it" that just knows it wants to scanout is userspace. The "it" that just knows that scanout translates to contiguous (or not) is the kernel. Perhaps we are saying the same thing ;-) BR, -R > > Cheers, > > Tom > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 19:06 ` Rob Clark (?) (?) @ 2013-08-07 17:33 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-07 17:33 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media > >> > Didn't you say that programmatically describing device placement > >> > constraints was an unbounded problem? I guess we would have to > >> > accept that it's not possible to describe all possible constraints > >> > and instead find a way to describe the common ones? > >> > >> well, the point I'm trying to make, is by dividing your constraints > >> into two groups, one that impacts and is handled by userspace, and > >> one that is in the kernel (ie. where the pages go), you cut down > >> the number of permutations that the kernel has to care about > >> considerably. And kernel already cares about, for example, what > >> range of addresses that a device can dma to/from. I think really > >> the only thing missing is the max # of sglist entries (contiguous > >> or not) > > > > I think it's more than physically contiguous or not. > > > > For example, it can be more efficient to use large page sizes on > > devices with IOMMUs to reduce TLB traffic. I think the size and even > > the availability of large pages varies between different IOMMUs. > > sure.. but I suppose if we can spiff out dma_params to express "I need > contiguous", perhaps we can add some way to express "I prefer > as-contiguous-as-possible".. either way, this is about where the pages > are placed, and not about the layout of pixels within the page, so > should be in kernel. It's something that is missing, but I believe > that it belongs in dma_params and hidden behind dma_alloc_*() for > simple drivers. Thinking about it, isn't this more a property of the IOMMU? I mean, are there any cases where an IOMMU had a large page mode but you wouldn't want to use it? So when allocating the memory, you'd have to take into account not just the constraints of the devices themselves, but also of any IOMMUs any of the device sit behind? > > There's also the issue of buffer stride alignment. As I say, if the > > buffer is to be written by a tile-based GPU like Mali, it's more > > efficient if the buffer's stride is aligned to the max AXI bus burst > > length. Though I guess a buffer stride only makes sense as a concept > > when interpreting the data as a linear-layout 2D image, so perhaps > > belongs in user-space along with format negotiation? > > > > Yeah.. this isn't about where the pages go, but about the arrangement > within a page. > > And, well, except for hw that supports the same tiling (or > compressed-fb) in display+gpu, you probably aren't sharing tiled > buffers. You'd only want to share a buffer between devices if those devices can understand the same pixel format. That pixel format can't be device- specific or opaque, it has to be explicit. I think drm_fourcc.h is what defines all the possible pixel formats. This is the enum I used in EGL_EXT_image_dma_buf_import at least. So if we get to the point where multiple devices can understand a tiled or compressed format, I assume we could just add that format to drm_fourcc.h and possibly v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. For user-space to negotiate a common pixel format and now stride alignment, I guess it will obviously need a way to query what pixel formats a device supports and what its stride alignment requirements are. I don't know v4l2 very well, but it certainly seems the pixel format can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set a particular format. I couldn't however find a way to retrieve a list of supported formats - it seems the mechanism is to try out each format in turn to determine if it is supported. Is that right? There doesn't however seem a way to query what stride constraints a V4l2 device might have. Does HW abstracted by v4l2 typically have such constraints? If so, how can we query them such that a buffer allocated by a DRM driver can be imported into v4l2 and used with that HW? Turning to DRM/KMS, it seems the supported formats of a plane can be queried using drm_mode_get_plane. However, there doesn't seem to be a way to query the supported formats of a crtc? If display HW only supports scanning out from a single buffer (like pl111 does), I think it won't have any planes and a fb can only be set on the crtc. In which case, how should user-space query which pixel formats that crtc supports? Assuming user-space can query the supported formats and find a common one, it will need to allocate a buffer. Looks like drm_mode_create_dumb can do that, but it only takes a bpp parameter, there's no format parameter. I assume then that user-space defines the format and tells the DRM driver which format the buffer is in when creating the fb with drm_mode_fb_cmd2, which does take a format parameter? Is that right? As with v4l2, DRM doesn't appear to have a way to query the stride constraints? Assuming there is a way to query the stride constraints, there also isn't a way to specify them when creating a buffer with DRM, though perhaps the existing pitch parameter of drm_mode_create_dumb could be used to allow user-space to pass in a minimum stride as well as receive the allocated stride? > >> > One problem with this is it duplicates a lot of logic in each > >> > driver which can export a dma_buf buffer. Each exporter will > >> > need to do pretty much the same thing: iterate over all the > >> > attachments, determine of all the constraints (assuming that > >> > can be done) and allocate pages such that the lowest-common- > >> > denominator is satisfied. Perhaps rather than duplicating that > >> > logic in every driver, we could instead move allocation of the > >> > backing pages into dma_buf itself? > >> > >> I tend to think it is better to add helpers as we see common > > >> patterns emerge, which drivers can opt-in to using. I don't > >> think that we should move allocation into dma_buf itself, but > >> it would perhaps be useful to have dma_alloc_*() variants that > >> could allocate for multiple devices. > > > > A helper could work I guess, though I quite like the idea of > > having dma_alloc_*() variants which take a list of devices to > > allocate memory for. > > > > > >> That would help for simple stuff, although I'd suspect > >> eventually a GPU driver will move away from that. (Since you > >> probably want to play tricks w/ pools of pages that are > >> pre-zero'd and in the correct cache state, use spare cycles on > >> the gpu or dma engine to pre-zero uncached pages, and games > >> like that.) > > > > So presumably you're talking about a GPU driver being the exporter > > here? If so, how could the GPU driver do these kind of tricks on > > memory shared with another device? > > Yes, that is gpu-as-exporter. If someone else is allocating buffers, > it is up to them to do these tricks or not. Probably there is a > pretty good chance that if you aren't a GPU you don't need those sort > of tricks for fast allocation of transient upload buffers, staging > textures, temporary pixmaps, etc. Ie. I don't really think a v4l > camera or video decoder would benefit from that sort of optimization. Right - but none of those are really buffers you'd want to export with dma_buf to share with another device are they? In which case, why not just have dma_buf figure out the constraints and allocate the memory? If a driver needs to allocate memory in a special way for a particular device, I can't really imagine how it would be able to share that buffer with another device using dma_buf? I guess a driver is likely to need some magic voodoo to configure access to the buffer for its device, but surely that would be done by the dma_mapping framework when dma_buf_map happens? > >> You probably want to get out of the SoC mindset, otherwise you are > >> going to make bad assumptions that come back to bite you later on. > > > > Sure - there are always going to be PC-like devices where the > > hardware configuration isn't fixed like it is on a traditional SoC. > > But I'd rather have a simple solution which works on traditional SoCs > > than no solution at all. Today our solution is to over-load the dumb > > buffer alloc functions of the display's DRM driver - For now I'm just > > looking for the next step up from that! ;-) > > True.. the original intention, which is perhaps a bit desktop-centric, > really was for there to be a userspace component talking to the drm > driver for allocation, ie. xf86-video-foo and/or > src/gallium/drivers/foo (for example) ;-) > > Which means for x11 having a SoC vendor specific xf86-video-foo for > x11.. or vendor specific gbm implementation for wayland. (Although > at least in the latter case it is a pretty small piece of code.) But > that is probably what you are trying to avoid. I've been trying to get my head around how PRIME relates to DDX drivers. As I understand it (which is likely wrong), you have a laptop with both an Intel & an nVidia GPU. You have both the i915 & nouveau kernel drivers loaded. What I'm not sure about is which GPU's display controller is actually hooked up to the physical connector? Perhaps there is a MUX like there is on Versatile Express? What I also don't understand is what DDX driver is loaded? Is it xf86-video-intel, xf86-video-nouveau or both? I get the impression that there's a "master" DDX which implements 2D operations but can import buffers using PRIME from the other driver and draw to them. Or is it more that it's able to export rendered buffers to the "slave" DRM for scanout? Either way, it's pretty similar to an ARM SoC setup which has the GPU and the display as two totally independent devices. > At any rate, for both xorg and wayland/gbm, you know when a buffer is > going to be a scanout buffer. What I'd recommend is define a small > userspace API that your customers (the SoC vendors) implement to > allocate a scanout buffer and hand you back a dmabuf fd. That could > be used both for x11 and for gbm. Inputs should be requested > width/height and format. And outputs pitch plus dmabuf fd. > > (Actually you might even just want to use gbm as your starting point. > You could probably just use gbm from xf86-video-armsoc for allocation, > to have one thing that works for both wayland and x11. Scanout and > cursor buffers should go to vendor/SoC specific fxn, rest can be > allocated from mali kernel driver.) What does that buy us over just using drm_mode_create_dumb on the display's DRM driver? > >> > wouldn't need a way to programmatically describe the constraints > >> > either: As you say, if userspace sets the "SCANOUT" flag, it would > >> > just "know" that on this SoC, that buffer needs to be physically > >> > contiguous for example. > >> > >> not really.. it just knows it wants to scanout the buffer, and tells > >> this as a hint to the kernel. > >> > >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ > >> (where phys contig is not required for scanout), but causes CMA > >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. > >> It just knows that it wants to be able to scanout that particular > >> buffer. > > > > I think that's the idea? The omap3's allocator driver would use > > contiguous memory when it detects the SCANOUT flag whereas the omap4 > > allocator driver wouldn't have to. No complex negotiation of > > constraints - it just "knows". > > > > well, it is same allocating driver in both cases (although maybe that > is unimportant). The "it" that just knows it wants to scanout is > userspace. The "it" that just knows that scanout translates to > contiguous (or not) is the kernel. Perhaps we are saying the same > thing ;-) Yeah - I think we are... so what's the issue with having a per-SoC allocation driver again? Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 19:06 ` Rob Clark ` (2 preceding siblings ...) (?) @ 2013-08-07 17:33 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-07 17:33 UTC (permalink / raw) To: linux-arm-kernel > >> > Didn't you say that programmatically describing device placement > >> > constraints was an unbounded problem? I guess we would have to > >> > accept that it's not possible to describe all possible constraints > >> > and instead find a way to describe the common ones? > >> > >> well, the point I'm trying to make, is by dividing your constraints > >> into two groups, one that impacts and is handled by userspace, and > >> one that is in the kernel (ie. where the pages go), you cut down > >> the number of permutations that the kernel has to care about > >> considerably. And kernel already cares about, for example, what > >> range of addresses that a device can dma to/from. I think really > >> the only thing missing is the max # of sglist entries (contiguous > >> or not) > > > > I think it's more than physically contiguous or not. > > > > For example, it can be more efficient to use large page sizes on > > devices with IOMMUs to reduce TLB traffic. I think the size and even > > the availability of large pages varies between different IOMMUs. > > sure.. but I suppose if we can spiff out dma_params to express "I need > contiguous", perhaps we can add some way to express "I prefer > as-contiguous-as-possible".. either way, this is about where the pages > are placed, and not about the layout of pixels within the page, so > should be in kernel. It's something that is missing, but I believe > that it belongs in dma_params and hidden behind dma_alloc_*() for > simple drivers. Thinking about it, isn't this more a property of the IOMMU? I mean, are there any cases where an IOMMU had a large page mode but you wouldn't want to use it? So when allocating the memory, you'd have to take into account not just the constraints of the devices themselves, but also of any IOMMUs any of the device sit behind? > > There's also the issue of buffer stride alignment. As I say, if the > > buffer is to be written by a tile-based GPU like Mali, it's more > > efficient if the buffer's stride is aligned to the max AXI bus burst > > length. Though I guess a buffer stride only makes sense as a concept > > when interpreting the data as a linear-layout 2D image, so perhaps > > belongs in user-space along with format negotiation? > > > > Yeah.. this isn't about where the pages go, but about the arrangement > within a page. > > And, well, except for hw that supports the same tiling (or > compressed-fb) in display+gpu, you probably aren't sharing tiled > buffers. You'd only want to share a buffer between devices if those devices can understand the same pixel format. That pixel format can't be device- specific or opaque, it has to be explicit. I think drm_fourcc.h is what defines all the possible pixel formats. This is the enum I used in EGL_EXT_image_dma_buf_import at least. So if we get to the point where multiple devices can understand a tiled or compressed format, I assume we could just add that format to drm_fourcc.h and possibly v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. For user-space to negotiate a common pixel format and now stride alignment, I guess it will obviously need a way to query what pixel formats a device supports and what its stride alignment requirements are. I don't know v4l2 very well, but it certainly seems the pixel format can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set a particular format. I couldn't however find a way to retrieve a list of supported formats - it seems the mechanism is to try out each format in turn to determine if it is supported. Is that right? There doesn't however seem a way to query what stride constraints a V4l2 device might have. Does HW abstracted by v4l2 typically have such constraints? If so, how can we query them such that a buffer allocated by a DRM driver can be imported into v4l2 and used with that HW? Turning to DRM/KMS, it seems the supported formats of a plane can be queried using drm_mode_get_plane. However, there doesn't seem to be a way to query the supported formats of a crtc? If display HW only supports scanning out from a single buffer (like pl111 does), I think it won't have any planes and a fb can only be set on the crtc. In which case, how should user-space query which pixel formats that crtc supports? Assuming user-space can query the supported formats and find a common one, it will need to allocate a buffer. Looks like drm_mode_create_dumb can do that, but it only takes a bpp parameter, there's no format parameter. I assume then that user-space defines the format and tells the DRM driver which format the buffer is in when creating the fb with drm_mode_fb_cmd2, which does take a format parameter? Is that right? As with v4l2, DRM doesn't appear to have a way to query the stride constraints? Assuming there is a way to query the stride constraints, there also isn't a way to specify them when creating a buffer with DRM, though perhaps the existing pitch parameter of drm_mode_create_dumb could be used to allow user-space to pass in a minimum stride as well as receive the allocated stride? > >> > One problem with this is it duplicates a lot of logic in each > >> > driver which can export a dma_buf buffer. Each exporter will > >> > need to do pretty much the same thing: iterate over all the > >> > attachments, determine of all the constraints (assuming that > >> > can be done) and allocate pages such that the lowest-common- > >> > denominator is satisfied. Perhaps rather than duplicating that > >> > logic in every driver, we could instead move allocation of the > >> > backing pages into dma_buf itself? > >> > >> I tend to think it is better to add helpers as we see common > > >> patterns emerge, which drivers can opt-in to using. I don't > >> think that we should move allocation into dma_buf itself, but > >> it would perhaps be useful to have dma_alloc_*() variants that > >> could allocate for multiple devices. > > > > A helper could work I guess, though I quite like the idea of > > having dma_alloc_*() variants which take a list of devices to > > allocate memory for. > > > > > >> That would help for simple stuff, although I'd suspect > >> eventually a GPU driver will move away from that. (Since you > >> probably want to play tricks w/ pools of pages that are > >> pre-zero'd and in the correct cache state, use spare cycles on > >> the gpu or dma engine to pre-zero uncached pages, and games > >> like that.) > > > > So presumably you're talking about a GPU driver being the exporter > > here? If so, how could the GPU driver do these kind of tricks on > > memory shared with another device? > > Yes, that is gpu-as-exporter. If someone else is allocating buffers, > it is up to them to do these tricks or not. Probably there is a > pretty good chance that if you aren't a GPU you don't need those sort > of tricks for fast allocation of transient upload buffers, staging > textures, temporary pixmaps, etc. Ie. I don't really think a v4l > camera or video decoder would benefit from that sort of optimization. Right - but none of those are really buffers you'd want to export with dma_buf to share with another device are they? In which case, why not just have dma_buf figure out the constraints and allocate the memory? If a driver needs to allocate memory in a special way for a particular device, I can't really imagine how it would be able to share that buffer with another device using dma_buf? I guess a driver is likely to need some magic voodoo to configure access to the buffer for its device, but surely that would be done by the dma_mapping framework when dma_buf_map happens? > >> You probably want to get out of the SoC mindset, otherwise you are > >> going to make bad assumptions that come back to bite you later on. > > > > Sure - there are always going to be PC-like devices where the > > hardware configuration isn't fixed like it is on a traditional SoC. > > But I'd rather have a simple solution which works on traditional SoCs > > than no solution at all. Today our solution is to over-load the dumb > > buffer alloc functions of the display's DRM driver - For now I'm just > > looking for the next step up from that! ;-) > > True.. the original intention, which is perhaps a bit desktop-centric, > really was for there to be a userspace component talking to the drm > driver for allocation, ie. xf86-video-foo and/or > src/gallium/drivers/foo (for example) ;-) > > Which means for x11 having a SoC vendor specific xf86-video-foo for > x11.. or vendor specific gbm implementation for wayland. (Although > at least in the latter case it is a pretty small piece of code.) But > that is probably what you are trying to avoid. I've been trying to get my head around how PRIME relates to DDX drivers. As I understand it (which is likely wrong), you have a laptop with both an Intel & an nVidia GPU. You have both the i915 & nouveau kernel drivers loaded. What I'm not sure about is which GPU's display controller is actually hooked up to the physical connector? Perhaps there is a MUX like there is on Versatile Express? What I also don't understand is what DDX driver is loaded? Is it xf86-video-intel, xf86-video-nouveau or both? I get the impression that there's a "master" DDX which implements 2D operations but can import buffers using PRIME from the other driver and draw to them. Or is it more that it's able to export rendered buffers to the "slave" DRM for scanout? Either way, it's pretty similar to an ARM SoC setup which has the GPU and the display as two totally independent devices. > At any rate, for both xorg and wayland/gbm, you know when a buffer is > going to be a scanout buffer. What I'd recommend is define a small > userspace API that your customers (the SoC vendors) implement to > allocate a scanout buffer and hand you back a dmabuf fd. That could > be used both for x11 and for gbm. Inputs should be requested > width/height and format. And outputs pitch plus dmabuf fd. > > (Actually you might even just want to use gbm as your starting point. > You could probably just use gbm from xf86-video-armsoc for allocation, > to have one thing that works for both wayland and x11. Scanout and > cursor buffers should go to vendor/SoC specific fxn, rest can be > allocated from mali kernel driver.) What does that buy us over just using drm_mode_create_dumb on the display's DRM driver? > >> > wouldn't need a way to programmatically describe the constraints > >> > either: As you say, if userspace sets the "SCANOUT" flag, it would > >> > just "know" that on this SoC, that buffer needs to be physically > >> > contiguous for example. > >> > >> not really.. it just knows it wants to scanout the buffer, and tells > >> this as a hint to the kernel. > >> > >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ > >> (where phys contig is not required for scanout), but causes CMA > >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. > >> It just knows that it wants to be able to scanout that particular > >> buffer. > > > > I think that's the idea? The omap3's allocator driver would use > > contiguous memory when it detects the SCANOUT flag whereas the omap4 > > allocator driver wouldn't have to. No complex negotiation of > > constraints - it just "knows". > > > > well, it is same allocating driver in both cases (although maybe that > is unimportant). The "it" that just knows it wants to scanout is > userspace. The "it" that just knows that scanout translates to > contiguous (or not) is the kernel. Perhaps we are saying the same > thing ;-) Yeah - I think we are... so what's the issue with having a per-SoC allocation driver again? Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 19:06 ` Rob Clark ` (3 preceding siblings ...) (?) @ 2013-08-07 17:33 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-07 17:33 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media > >> > Didn't you say that programmatically describing device placement > >> > constraints was an unbounded problem? I guess we would have to > >> > accept that it's not possible to describe all possible constraints > >> > and instead find a way to describe the common ones? > >> > >> well, the point I'm trying to make, is by dividing your constraints > >> into two groups, one that impacts and is handled by userspace, and > >> one that is in the kernel (ie. where the pages go), you cut down > >> the number of permutations that the kernel has to care about > >> considerably. And kernel already cares about, for example, what > >> range of addresses that a device can dma to/from. I think really > >> the only thing missing is the max # of sglist entries (contiguous > >> or not) > > > > I think it's more than physically contiguous or not. > > > > For example, it can be more efficient to use large page sizes on > > devices with IOMMUs to reduce TLB traffic. I think the size and even > > the availability of large pages varies between different IOMMUs. > > sure.. but I suppose if we can spiff out dma_params to express "I need > contiguous", perhaps we can add some way to express "I prefer > as-contiguous-as-possible".. either way, this is about where the pages > are placed, and not about the layout of pixels within the page, so > should be in kernel. It's something that is missing, but I believe > that it belongs in dma_params and hidden behind dma_alloc_*() for > simple drivers. Thinking about it, isn't this more a property of the IOMMU? I mean, are there any cases where an IOMMU had a large page mode but you wouldn't want to use it? So when allocating the memory, you'd have to take into account not just the constraints of the devices themselves, but also of any IOMMUs any of the device sit behind? > > There's also the issue of buffer stride alignment. As I say, if the > > buffer is to be written by a tile-based GPU like Mali, it's more > > efficient if the buffer's stride is aligned to the max AXI bus burst > > length. Though I guess a buffer stride only makes sense as a concept > > when interpreting the data as a linear-layout 2D image, so perhaps > > belongs in user-space along with format negotiation? > > > > Yeah.. this isn't about where the pages go, but about the arrangement > within a page. > > And, well, except for hw that supports the same tiling (or > compressed-fb) in display+gpu, you probably aren't sharing tiled > buffers. You'd only want to share a buffer between devices if those devices can understand the same pixel format. That pixel format can't be device- specific or opaque, it has to be explicit. I think drm_fourcc.h is what defines all the possible pixel formats. This is the enum I used in EGL_EXT_image_dma_buf_import at least. So if we get to the point where multiple devices can understand a tiled or compressed format, I assume we could just add that format to drm_fourcc.h and possibly v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. For user-space to negotiate a common pixel format and now stride alignment, I guess it will obviously need a way to query what pixel formats a device supports and what its stride alignment requirements are. I don't know v4l2 very well, but it certainly seems the pixel format can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set a particular format. I couldn't however find a way to retrieve a list of supported formats - it seems the mechanism is to try out each format in turn to determine if it is supported. Is that right? There doesn't however seem a way to query what stride constraints a V4l2 device might have. Does HW abstracted by v4l2 typically have such constraints? If so, how can we query them such that a buffer allocated by a DRM driver can be imported into v4l2 and used with that HW? Turning to DRM/KMS, it seems the supported formats of a plane can be queried using drm_mode_get_plane. However, there doesn't seem to be a way to query the supported formats of a crtc? If display HW only supports scanning out from a single buffer (like pl111 does), I think it won't have any planes and a fb can only be set on the crtc. In which case, how should user-space query which pixel formats that crtc supports? Assuming user-space can query the supported formats and find a common one, it will need to allocate a buffer. Looks like drm_mode_create_dumb can do that, but it only takes a bpp parameter, there's no format parameter. I assume then that user-space defines the format and tells the DRM driver which format the buffer is in when creating the fb with drm_mode_fb_cmd2, which does take a format parameter? Is that right? As with v4l2, DRM doesn't appear to have a way to query the stride constraints? Assuming there is a way to query the stride constraints, there also isn't a way to specify them when creating a buffer with DRM, though perhaps the existing pitch parameter of drm_mode_create_dumb could be used to allow user-space to pass in a minimum stride as well as receive the allocated stride? > >> > One problem with this is it duplicates a lot of logic in each > >> > driver which can export a dma_buf buffer. Each exporter will > >> > need to do pretty much the same thing: iterate over all the > >> > attachments, determine of all the constraints (assuming that > >> > can be done) and allocate pages such that the lowest-common- > >> > denominator is satisfied. Perhaps rather than duplicating that > >> > logic in every driver, we could instead move allocation of the > >> > backing pages into dma_buf itself? > >> > >> I tend to think it is better to add helpers as we see common > > >> patterns emerge, which drivers can opt-in to using. I don't > >> think that we should move allocation into dma_buf itself, but > >> it would perhaps be useful to have dma_alloc_*() variants that > >> could allocate for multiple devices. > > > > A helper could work I guess, though I quite like the idea of > > having dma_alloc_*() variants which take a list of devices to > > allocate memory for. > > > > > >> That would help for simple stuff, although I'd suspect > >> eventually a GPU driver will move away from that. (Since you > >> probably want to play tricks w/ pools of pages that are > >> pre-zero'd and in the correct cache state, use spare cycles on > >> the gpu or dma engine to pre-zero uncached pages, and games > >> like that.) > > > > So presumably you're talking about a GPU driver being the exporter > > here? If so, how could the GPU driver do these kind of tricks on > > memory shared with another device? > > Yes, that is gpu-as-exporter. If someone else is allocating buffers, > it is up to them to do these tricks or not. Probably there is a > pretty good chance that if you aren't a GPU you don't need those sort > of tricks for fast allocation of transient upload buffers, staging > textures, temporary pixmaps, etc. Ie. I don't really think a v4l > camera or video decoder would benefit from that sort of optimization. Right - but none of those are really buffers you'd want to export with dma_buf to share with another device are they? In which case, why not just have dma_buf figure out the constraints and allocate the memory? If a driver needs to allocate memory in a special way for a particular device, I can't really imagine how it would be able to share that buffer with another device using dma_buf? I guess a driver is likely to need some magic voodoo to configure access to the buffer for its device, but surely that would be done by the dma_mapping framework when dma_buf_map happens? > >> You probably want to get out of the SoC mindset, otherwise you are > >> going to make bad assumptions that come back to bite you later on. > > > > Sure - there are always going to be PC-like devices where the > > hardware configuration isn't fixed like it is on a traditional SoC. > > But I'd rather have a simple solution which works on traditional SoCs > > than no solution at all. Today our solution is to over-load the dumb > > buffer alloc functions of the display's DRM driver - For now I'm just > > looking for the next step up from that! ;-) > > True.. the original intention, which is perhaps a bit desktop-centric, > really was for there to be a userspace component talking to the drm > driver for allocation, ie. xf86-video-foo and/or > src/gallium/drivers/foo (for example) ;-) > > Which means for x11 having a SoC vendor specific xf86-video-foo for > x11.. or vendor specific gbm implementation for wayland. (Although > at least in the latter case it is a pretty small piece of code.) But > that is probably what you are trying to avoid. I've been trying to get my head around how PRIME relates to DDX drivers. As I understand it (which is likely wrong), you have a laptop with both an Intel & an nVidia GPU. You have both the i915 & nouveau kernel drivers loaded. What I'm not sure about is which GPU's display controller is actually hooked up to the physical connector? Perhaps there is a MUX like there is on Versatile Express? What I also don't understand is what DDX driver is loaded? Is it xf86-video-intel, xf86-video-nouveau or both? I get the impression that there's a "master" DDX which implements 2D operations but can import buffers using PRIME from the other driver and draw to them. Or is it more that it's able to export rendered buffers to the "slave" DRM for scanout? Either way, it's pretty similar to an ARM SoC setup which has the GPU and the display as two totally independent devices. > At any rate, for both xorg and wayland/gbm, you know when a buffer is > going to be a scanout buffer. What I'd recommend is define a small > userspace API that your customers (the SoC vendors) implement to > allocate a scanout buffer and hand you back a dmabuf fd. That could > be used both for x11 and for gbm. Inputs should be requested > width/height and format. And outputs pitch plus dmabuf fd. > > (Actually you might even just want to use gbm as your starting point. > You could probably just use gbm from xf86-video-armsoc for allocation, > to have one thing that works for both wayland and x11. Scanout and > cursor buffers should go to vendor/SoC specific fxn, rest can be > allocated from mali kernel driver.) What does that buy us over just using drm_mode_create_dumb on the display's DRM driver? > >> > wouldn't need a way to programmatically describe the constraints > >> > either: As you say, if userspace sets the "SCANOUT" flag, it would > >> > just "know" that on this SoC, that buffer needs to be physically > >> > contiguous for example. > >> > >> not really.. it just knows it wants to scanout the buffer, and tells > >> this as a hint to the kernel. > >> > >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ > >> (where phys contig is not required for scanout), but causes CMA > >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. > >> It just knows that it wants to be able to scanout that particular > >> buffer. > > > > I think that's the idea? The omap3's allocator driver would use > > contiguous memory when it detects the SCANOUT flag whereas the omap4 > > allocator driver wouldn't have to. No complex negotiation of > > constraints - it just "knows". > > > > well, it is same allocating driver in both cases (although maybe that > is unimportant). The "it" that just knows it wants to scanout is > userspace. The "it" that just knows that scanout translates to > contiguous (or not) is the kernel. Perhaps we are saying the same > thing ;-) Yeah - I think we are... so what's the issue with having a per-SoC allocation driver again? Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <520284fe.a16ec20a.2d3c.6e19SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <520284fe.a16ec20a.2d3c.6e19SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-07 17:56 ` Alex Deucher @ 2013-08-07 17:56 ` Alex Deucher 0 siblings, 0 replies; 131+ messages in thread From: Alex Deucher @ 2013-08-07 17:56 UTC (permalink / raw) To: Tom Cooksey Cc: Rob Clark, linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> >> > Didn't you say that programmatically describing device placement >> >> > constraints was an unbounded problem? I guess we would have to >> >> > accept that it's not possible to describe all possible constraints >> >> > and instead find a way to describe the common ones? >> >> >> >> well, the point I'm trying to make, is by dividing your constraints >> >> into two groups, one that impacts and is handled by userspace, and >> >> one that is in the kernel (ie. where the pages go), you cut down >> >> the number of permutations that the kernel has to care about >> >> considerably. And kernel already cares about, for example, what >> >> range of addresses that a device can dma to/from. I think really >> >> the only thing missing is the max # of sglist entries (contiguous >> >> or not) >> > >> > I think it's more than physically contiguous or not. >> > >> > For example, it can be more efficient to use large page sizes on >> > devices with IOMMUs to reduce TLB traffic. I think the size and even >> > the availability of large pages varies between different IOMMUs. >> >> sure.. but I suppose if we can spiff out dma_params to express "I need >> contiguous", perhaps we can add some way to express "I prefer >> as-contiguous-as-possible".. either way, this is about where the pages >> are placed, and not about the layout of pixels within the page, so >> should be in kernel. It's something that is missing, but I believe >> that it belongs in dma_params and hidden behind dma_alloc_*() for >> simple drivers. > > Thinking about it, isn't this more a property of the IOMMU? I mean, > are there any cases where an IOMMU had a large page mode but you > wouldn't want to use it? So when allocating the memory, you'd have to > take into account not just the constraints of the devices themselves, > but also of any IOMMUs any of the device sit behind? > > >> > There's also the issue of buffer stride alignment. As I say, if the >> > buffer is to be written by a tile-based GPU like Mali, it's more >> > efficient if the buffer's stride is aligned to the max AXI bus burst >> > length. Though I guess a buffer stride only makes sense as a concept >> > when interpreting the data as a linear-layout 2D image, so perhaps >> > belongs in user-space along with format negotiation? >> > >> >> Yeah.. this isn't about where the pages go, but about the arrangement >> within a page. >> >> And, well, except for hw that supports the same tiling (or >> compressed-fb) in display+gpu, you probably aren't sharing tiled >> buffers. > > You'd only want to share a buffer between devices if those devices can > understand the same pixel format. That pixel format can't be device- > specific or opaque, it has to be explicit. I think drm_fourcc.h is > what defines all the possible pixel formats. This is the enum I used > in EGL_EXT_image_dma_buf_import at least. So if we get to the point > where multiple devices can understand a tiled or compressed format, I > assume we could just add that format to drm_fourcc.h and possibly > v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. > > For user-space to negotiate a common pixel format and now stride > alignment, I guess it will obviously need a way to query what pixel > formats a device supports and what its stride alignment requirements > are. > > I don't know v4l2 very well, but it certainly seems the pixel format > can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set > a particular format. I couldn't however find a way to retrieve a list > of supported formats - it seems the mechanism is to try out each > format in turn to determine if it is supported. Is that right? > > There doesn't however seem a way to query what stride constraints a > V4l2 device might have. Does HW abstracted by v4l2 typically have > such constraints? If so, how can we query them such that a buffer > allocated by a DRM driver can be imported into v4l2 and used with > that HW? > > Turning to DRM/KMS, it seems the supported formats of a plane can be > queried using drm_mode_get_plane. However, there doesn't seem to be a > way to query the supported formats of a crtc? If display HW only > supports scanning out from a single buffer (like pl111 does), I think > it won't have any planes and a fb can only be set on the crtc. In > which case, how should user-space query which pixel formats that crtc > supports? > > Assuming user-space can query the supported formats and find a common > one, it will need to allocate a buffer. Looks like > drm_mode_create_dumb can do that, but it only takes a bpp parameter, > there's no format parameter. I assume then that user-space defines > the format and tells the DRM driver which format the buffer is in > when creating the fb with drm_mode_fb_cmd2, which does take a format > parameter? Is that right? > > As with v4l2, DRM doesn't appear to have a way to query the stride > constraints? Assuming there is a way to query the stride constraints, > there also isn't a way to specify them when creating a buffer with > DRM, though perhaps the existing pitch parameter of > drm_mode_create_dumb could be used to allow user-space to pass in a > minimum stride as well as receive the allocated stride? > > >> >> > One problem with this is it duplicates a lot of logic in each >> >> > driver which can export a dma_buf buffer. Each exporter will >> >> > need to do pretty much the same thing: iterate over all the >> >> > attachments, determine of all the constraints (assuming that >> >> > can be done) and allocate pages such that the lowest-common- >> >> > denominator is satisfied. Perhaps rather than duplicating that >> >> > logic in every driver, we could instead move allocation of the >> >> > backing pages into dma_buf itself? >> >> >> >> I tend to think it is better to add helpers as we see common >> >> >> patterns emerge, which drivers can opt-in to using. I don't >> >> think that we should move allocation into dma_buf itself, but >> >> it would perhaps be useful to have dma_alloc_*() variants that >> >> could allocate for multiple devices. >> > >> > A helper could work I guess, though I quite like the idea of >> > having dma_alloc_*() variants which take a list of devices to >> > allocate memory for. >> > >> > >> >> That would help for simple stuff, although I'd suspect >> >> eventually a GPU driver will move away from that. (Since you >> >> probably want to play tricks w/ pools of pages that are >> >> pre-zero'd and in the correct cache state, use spare cycles on >> >> the gpu or dma engine to pre-zero uncached pages, and games >> >> like that.) >> > >> > So presumably you're talking about a GPU driver being the exporter >> > here? If so, how could the GPU driver do these kind of tricks on >> > memory shared with another device? >> >> Yes, that is gpu-as-exporter. If someone else is allocating buffers, >> it is up to them to do these tricks or not. Probably there is a >> pretty good chance that if you aren't a GPU you don't need those sort >> of tricks for fast allocation of transient upload buffers, staging >> textures, temporary pixmaps, etc. Ie. I don't really think a v4l >> camera or video decoder would benefit from that sort of optimization. > > Right - but none of those are really buffers you'd want to export with > dma_buf to share with another device are they? In which case, why not > just have dma_buf figure out the constraints and allocate the memory? > > If a driver needs to allocate memory in a special way for a particular > device, I can't really imagine how it would be able to share that > buffer with another device using dma_buf? I guess a driver is likely > to need some magic voodoo to configure access to the buffer for its > device, but surely that would be done by the dma_mapping framework > when dma_buf_map happens? > > > >> >> You probably want to get out of the SoC mindset, otherwise you are >> >> going to make bad assumptions that come back to bite you later on. >> > >> > Sure - there are always going to be PC-like devices where the >> > hardware configuration isn't fixed like it is on a traditional SoC. >> > But I'd rather have a simple solution which works on traditional SoCs >> > than no solution at all. Today our solution is to over-load the dumb >> > buffer alloc functions of the display's DRM driver - For now I'm just >> > looking for the next step up from that! ;-) >> >> True.. the original intention, which is perhaps a bit desktop-centric, >> really was for there to be a userspace component talking to the drm >> driver for allocation, ie. xf86-video-foo and/or >> src/gallium/drivers/foo (for example) ;-) >> >> Which means for x11 having a SoC vendor specific xf86-video-foo for >> x11.. or vendor specific gbm implementation for wayland. (Although >> at least in the latter case it is a pretty small piece of code.) But >> that is probably what you are trying to avoid. > > I've been trying to get my head around how PRIME relates to DDX > drivers. As I understand it (which is likely wrong), you have a laptop > with both an Intel & an nVidia GPU. You have both the i915 & nouveau > kernel drivers loaded. What I'm not sure about is which GPU's display > controller is actually hooked up to the physical connector? Perhaps > there is a MUX like there is on Versatile Express? > > What I also don't understand is what DDX driver is loaded? Is it > xf86-video-intel, xf86-video-nouveau or both? I get the impression > that there's a "master" DDX which implements 2D operations but can > import buffers using PRIME from the other driver and draw to them. > Or is it more that it's able to export rendered buffers to the > "slave" DRM for scanout? Either way, it's pretty similar to an ARM > SoC setup which has the GPU and the display as two totally > independent devices. > In the early days, there was a MUX to switch the displays between the two GPUs, but most modern systems are MUX-less and the dGPU is either connected to no displays or in some cases the local panel is attached to the integrated GPU and the external displays are connected to the dGPU. In the MUX-less case, the dGPU can be used to render, and then the result is copied to the integrated GPU for display. Alex ^ permalink raw reply [flat|nested] 131+ messages in thread
* [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 17:56 ` Alex Deucher 0 siblings, 0 replies; 131+ messages in thread From: Alex Deucher @ 2013-08-07 17:56 UTC (permalink / raw) To: linux-arm-kernel On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> >> > Didn't you say that programmatically describing device placement >> >> > constraints was an unbounded problem? I guess we would have to >> >> > accept that it's not possible to describe all possible constraints >> >> > and instead find a way to describe the common ones? >> >> >> >> well, the point I'm trying to make, is by dividing your constraints >> >> into two groups, one that impacts and is handled by userspace, and >> >> one that is in the kernel (ie. where the pages go), you cut down >> >> the number of permutations that the kernel has to care about >> >> considerably. And kernel already cares about, for example, what >> >> range of addresses that a device can dma to/from. I think really >> >> the only thing missing is the max # of sglist entries (contiguous >> >> or not) >> > >> > I think it's more than physically contiguous or not. >> > >> > For example, it can be more efficient to use large page sizes on >> > devices with IOMMUs to reduce TLB traffic. I think the size and even >> > the availability of large pages varies between different IOMMUs. >> >> sure.. but I suppose if we can spiff out dma_params to express "I need >> contiguous", perhaps we can add some way to express "I prefer >> as-contiguous-as-possible".. either way, this is about where the pages >> are placed, and not about the layout of pixels within the page, so >> should be in kernel. It's something that is missing, but I believe >> that it belongs in dma_params and hidden behind dma_alloc_*() for >> simple drivers. > > Thinking about it, isn't this more a property of the IOMMU? I mean, > are there any cases where an IOMMU had a large page mode but you > wouldn't want to use it? So when allocating the memory, you'd have to > take into account not just the constraints of the devices themselves, > but also of any IOMMUs any of the device sit behind? > > >> > There's also the issue of buffer stride alignment. As I say, if the >> > buffer is to be written by a tile-based GPU like Mali, it's more >> > efficient if the buffer's stride is aligned to the max AXI bus burst >> > length. Though I guess a buffer stride only makes sense as a concept >> > when interpreting the data as a linear-layout 2D image, so perhaps >> > belongs in user-space along with format negotiation? >> > >> >> Yeah.. this isn't about where the pages go, but about the arrangement >> within a page. >> >> And, well, except for hw that supports the same tiling (or >> compressed-fb) in display+gpu, you probably aren't sharing tiled >> buffers. > > You'd only want to share a buffer between devices if those devices can > understand the same pixel format. That pixel format can't be device- > specific or opaque, it has to be explicit. I think drm_fourcc.h is > what defines all the possible pixel formats. This is the enum I used > in EGL_EXT_image_dma_buf_import at least. So if we get to the point > where multiple devices can understand a tiled or compressed format, I > assume we could just add that format to drm_fourcc.h and possibly > v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. > > For user-space to negotiate a common pixel format and now stride > alignment, I guess it will obviously need a way to query what pixel > formats a device supports and what its stride alignment requirements > are. > > I don't know v4l2 very well, but it certainly seems the pixel format > can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set > a particular format. I couldn't however find a way to retrieve a list > of supported formats - it seems the mechanism is to try out each > format in turn to determine if it is supported. Is that right? > > There doesn't however seem a way to query what stride constraints a > V4l2 device might have. Does HW abstracted by v4l2 typically have > such constraints? If so, how can we query them such that a buffer > allocated by a DRM driver can be imported into v4l2 and used with > that HW? > > Turning to DRM/KMS, it seems the supported formats of a plane can be > queried using drm_mode_get_plane. However, there doesn't seem to be a > way to query the supported formats of a crtc? If display HW only > supports scanning out from a single buffer (like pl111 does), I think > it won't have any planes and a fb can only be set on the crtc. In > which case, how should user-space query which pixel formats that crtc > supports? > > Assuming user-space can query the supported formats and find a common > one, it will need to allocate a buffer. Looks like > drm_mode_create_dumb can do that, but it only takes a bpp parameter, > there's no format parameter. I assume then that user-space defines > the format and tells the DRM driver which format the buffer is in > when creating the fb with drm_mode_fb_cmd2, which does take a format > parameter? Is that right? > > As with v4l2, DRM doesn't appear to have a way to query the stride > constraints? Assuming there is a way to query the stride constraints, > there also isn't a way to specify them when creating a buffer with > DRM, though perhaps the existing pitch parameter of > drm_mode_create_dumb could be used to allow user-space to pass in a > minimum stride as well as receive the allocated stride? > > >> >> > One problem with this is it duplicates a lot of logic in each >> >> > driver which can export a dma_buf buffer. Each exporter will >> >> > need to do pretty much the same thing: iterate over all the >> >> > attachments, determine of all the constraints (assuming that >> >> > can be done) and allocate pages such that the lowest-common- >> >> > denominator is satisfied. Perhaps rather than duplicating that >> >> > logic in every driver, we could instead move allocation of the >> >> > backing pages into dma_buf itself? >> >> >> >> I tend to think it is better to add helpers as we see common >> >> >> patterns emerge, which drivers can opt-in to using. I don't >> >> think that we should move allocation into dma_buf itself, but >> >> it would perhaps be useful to have dma_alloc_*() variants that >> >> could allocate for multiple devices. >> > >> > A helper could work I guess, though I quite like the idea of >> > having dma_alloc_*() variants which take a list of devices to >> > allocate memory for. >> > >> > >> >> That would help for simple stuff, although I'd suspect >> >> eventually a GPU driver will move away from that. (Since you >> >> probably want to play tricks w/ pools of pages that are >> >> pre-zero'd and in the correct cache state, use spare cycles on >> >> the gpu or dma engine to pre-zero uncached pages, and games >> >> like that.) >> > >> > So presumably you're talking about a GPU driver being the exporter >> > here? If so, how could the GPU driver do these kind of tricks on >> > memory shared with another device? >> >> Yes, that is gpu-as-exporter. If someone else is allocating buffers, >> it is up to them to do these tricks or not. Probably there is a >> pretty good chance that if you aren't a GPU you don't need those sort >> of tricks for fast allocation of transient upload buffers, staging >> textures, temporary pixmaps, etc. Ie. I don't really think a v4l >> camera or video decoder would benefit from that sort of optimization. > > Right - but none of those are really buffers you'd want to export with > dma_buf to share with another device are they? In which case, why not > just have dma_buf figure out the constraints and allocate the memory? > > If a driver needs to allocate memory in a special way for a particular > device, I can't really imagine how it would be able to share that > buffer with another device using dma_buf? I guess a driver is likely > to need some magic voodoo to configure access to the buffer for its > device, but surely that would be done by the dma_mapping framework > when dma_buf_map happens? > > > >> >> You probably want to get out of the SoC mindset, otherwise you are >> >> going to make bad assumptions that come back to bite you later on. >> > >> > Sure - there are always going to be PC-like devices where the >> > hardware configuration isn't fixed like it is on a traditional SoC. >> > But I'd rather have a simple solution which works on traditional SoCs >> > than no solution at all. Today our solution is to over-load the dumb >> > buffer alloc functions of the display's DRM driver - For now I'm just >> > looking for the next step up from that! ;-) >> >> True.. the original intention, which is perhaps a bit desktop-centric, >> really was for there to be a userspace component talking to the drm >> driver for allocation, ie. xf86-video-foo and/or >> src/gallium/drivers/foo (for example) ;-) >> >> Which means for x11 having a SoC vendor specific xf86-video-foo for >> x11.. or vendor specific gbm implementation for wayland. (Although >> at least in the latter case it is a pretty small piece of code.) But >> that is probably what you are trying to avoid. > > I've been trying to get my head around how PRIME relates to DDX > drivers. As I understand it (which is likely wrong), you have a laptop > with both an Intel & an nVidia GPU. You have both the i915 & nouveau > kernel drivers loaded. What I'm not sure about is which GPU's display > controller is actually hooked up to the physical connector? Perhaps > there is a MUX like there is on Versatile Express? > > What I also don't understand is what DDX driver is loaded? Is it > xf86-video-intel, xf86-video-nouveau or both? I get the impression > that there's a "master" DDX which implements 2D operations but can > import buffers using PRIME from the other driver and draw to them. > Or is it more that it's able to export rendered buffers to the > "slave" DRM for scanout? Either way, it's pretty similar to an ARM > SoC setup which has the GPU and the display as two totally > independent devices. > In the early days, there was a MUX to switch the displays between the two GPUs, but most modern systems are MUX-less and the dGPU is either connected to no displays or in some cases the local panel is attached to the integrated GPU and the external displays are connected to the dGPU. In the MUX-less case, the dGPU can be used to render, and then the result is copied to the integrated GPU for display. Alex ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [Linaro-mm-sig] [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 17:56 ` Alex Deucher 0 siblings, 0 replies; 131+ messages in thread From: Alex Deucher @ 2013-08-07 17:56 UTC (permalink / raw) To: Tom Cooksey Cc: Rob Clark, linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> >> > Didn't you say that programmatically describing device placement >> >> > constraints was an unbounded problem? I guess we would have to >> >> > accept that it's not possible to describe all possible constraints >> >> > and instead find a way to describe the common ones? >> >> >> >> well, the point I'm trying to make, is by dividing your constraints >> >> into two groups, one that impacts and is handled by userspace, and >> >> one that is in the kernel (ie. where the pages go), you cut down >> >> the number of permutations that the kernel has to care about >> >> considerably. And kernel already cares about, for example, what >> >> range of addresses that a device can dma to/from. I think really >> >> the only thing missing is the max # of sglist entries (contiguous >> >> or not) >> > >> > I think it's more than physically contiguous or not. >> > >> > For example, it can be more efficient to use large page sizes on >> > devices with IOMMUs to reduce TLB traffic. I think the size and even >> > the availability of large pages varies between different IOMMUs. >> >> sure.. but I suppose if we can spiff out dma_params to express "I need >> contiguous", perhaps we can add some way to express "I prefer >> as-contiguous-as-possible".. either way, this is about where the pages >> are placed, and not about the layout of pixels within the page, so >> should be in kernel. It's something that is missing, but I believe >> that it belongs in dma_params and hidden behind dma_alloc_*() for >> simple drivers. > > Thinking about it, isn't this more a property of the IOMMU? I mean, > are there any cases where an IOMMU had a large page mode but you > wouldn't want to use it? So when allocating the memory, you'd have to > take into account not just the constraints of the devices themselves, > but also of any IOMMUs any of the device sit behind? > > >> > There's also the issue of buffer stride alignment. As I say, if the >> > buffer is to be written by a tile-based GPU like Mali, it's more >> > efficient if the buffer's stride is aligned to the max AXI bus burst >> > length. Though I guess a buffer stride only makes sense as a concept >> > when interpreting the data as a linear-layout 2D image, so perhaps >> > belongs in user-space along with format negotiation? >> > >> >> Yeah.. this isn't about where the pages go, but about the arrangement >> within a page. >> >> And, well, except for hw that supports the same tiling (or >> compressed-fb) in display+gpu, you probably aren't sharing tiled >> buffers. > > You'd only want to share a buffer between devices if those devices can > understand the same pixel format. That pixel format can't be device- > specific or opaque, it has to be explicit. I think drm_fourcc.h is > what defines all the possible pixel formats. This is the enum I used > in EGL_EXT_image_dma_buf_import at least. So if we get to the point > where multiple devices can understand a tiled or compressed format, I > assume we could just add that format to drm_fourcc.h and possibly > v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. > > For user-space to negotiate a common pixel format and now stride > alignment, I guess it will obviously need a way to query what pixel > formats a device supports and what its stride alignment requirements > are. > > I don't know v4l2 very well, but it certainly seems the pixel format > can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set > a particular format. I couldn't however find a way to retrieve a list > of supported formats - it seems the mechanism is to try out each > format in turn to determine if it is supported. Is that right? > > There doesn't however seem a way to query what stride constraints a > V4l2 device might have. Does HW abstracted by v4l2 typically have > such constraints? If so, how can we query them such that a buffer > allocated by a DRM driver can be imported into v4l2 and used with > that HW? > > Turning to DRM/KMS, it seems the supported formats of a plane can be > queried using drm_mode_get_plane. However, there doesn't seem to be a > way to query the supported formats of a crtc? If display HW only > supports scanning out from a single buffer (like pl111 does), I think > it won't have any planes and a fb can only be set on the crtc. In > which case, how should user-space query which pixel formats that crtc > supports? > > Assuming user-space can query the supported formats and find a common > one, it will need to allocate a buffer. Looks like > drm_mode_create_dumb can do that, but it only takes a bpp parameter, > there's no format parameter. I assume then that user-space defines > the format and tells the DRM driver which format the buffer is in > when creating the fb with drm_mode_fb_cmd2, which does take a format > parameter? Is that right? > > As with v4l2, DRM doesn't appear to have a way to query the stride > constraints? Assuming there is a way to query the stride constraints, > there also isn't a way to specify them when creating a buffer with > DRM, though perhaps the existing pitch parameter of > drm_mode_create_dumb could be used to allow user-space to pass in a > minimum stride as well as receive the allocated stride? > > >> >> > One problem with this is it duplicates a lot of logic in each >> >> > driver which can export a dma_buf buffer. Each exporter will >> >> > need to do pretty much the same thing: iterate over all the >> >> > attachments, determine of all the constraints (assuming that >> >> > can be done) and allocate pages such that the lowest-common- >> >> > denominator is satisfied. Perhaps rather than duplicating that >> >> > logic in every driver, we could instead move allocation of the >> >> > backing pages into dma_buf itself? >> >> >> >> I tend to think it is better to add helpers as we see common >> >> >> patterns emerge, which drivers can opt-in to using. I don't >> >> think that we should move allocation into dma_buf itself, but >> >> it would perhaps be useful to have dma_alloc_*() variants that >> >> could allocate for multiple devices. >> > >> > A helper could work I guess, though I quite like the idea of >> > having dma_alloc_*() variants which take a list of devices to >> > allocate memory for. >> > >> > >> >> That would help for simple stuff, although I'd suspect >> >> eventually a GPU driver will move away from that. (Since you >> >> probably want to play tricks w/ pools of pages that are >> >> pre-zero'd and in the correct cache state, use spare cycles on >> >> the gpu or dma engine to pre-zero uncached pages, and games >> >> like that.) >> > >> > So presumably you're talking about a GPU driver being the exporter >> > here? If so, how could the GPU driver do these kind of tricks on >> > memory shared with another device? >> >> Yes, that is gpu-as-exporter. If someone else is allocating buffers, >> it is up to them to do these tricks or not. Probably there is a >> pretty good chance that if you aren't a GPU you don't need those sort >> of tricks for fast allocation of transient upload buffers, staging >> textures, temporary pixmaps, etc. Ie. I don't really think a v4l >> camera or video decoder would benefit from that sort of optimization. > > Right - but none of those are really buffers you'd want to export with > dma_buf to share with another device are they? In which case, why not > just have dma_buf figure out the constraints and allocate the memory? > > If a driver needs to allocate memory in a special way for a particular > device, I can't really imagine how it would be able to share that > buffer with another device using dma_buf? I guess a driver is likely > to need some magic voodoo to configure access to the buffer for its > device, but surely that would be done by the dma_mapping framework > when dma_buf_map happens? > > > >> >> You probably want to get out of the SoC mindset, otherwise you are >> >> going to make bad assumptions that come back to bite you later on. >> > >> > Sure - there are always going to be PC-like devices where the >> > hardware configuration isn't fixed like it is on a traditional SoC. >> > But I'd rather have a simple solution which works on traditional SoCs >> > than no solution at all. Today our solution is to over-load the dumb >> > buffer alloc functions of the display's DRM driver - For now I'm just >> > looking for the next step up from that! ;-) >> >> True.. the original intention, which is perhaps a bit desktop-centric, >> really was for there to be a userspace component talking to the drm >> driver for allocation, ie. xf86-video-foo and/or >> src/gallium/drivers/foo (for example) ;-) >> >> Which means for x11 having a SoC vendor specific xf86-video-foo for >> x11.. or vendor specific gbm implementation for wayland. (Although >> at least in the latter case it is a pretty small piece of code.) But >> that is probably what you are trying to avoid. > > I've been trying to get my head around how PRIME relates to DDX > drivers. As I understand it (which is likely wrong), you have a laptop > with both an Intel & an nVidia GPU. You have both the i915 & nouveau > kernel drivers loaded. What I'm not sure about is which GPU's display > controller is actually hooked up to the physical connector? Perhaps > there is a MUX like there is on Versatile Express? > > What I also don't understand is what DDX driver is loaded? Is it > xf86-video-intel, xf86-video-nouveau or both? I get the impression > that there's a "master" DDX which implements 2D operations but can > import buffers using PRIME from the other driver and draw to them. > Or is it more that it's able to export rendered buffers to the > "slave" DRM for scanout? Either way, it's pretty similar to an ARM > SoC setup which has the GPU and the display as two totally > independent devices. > In the early days, there was a MUX to switch the displays between the two GPUs, but most modern systems are MUX-less and the dGPU is either connected to no displays or in some cases the local panel is attached to the integrated GPU and the external displays are connected to the dGPU. In the MUX-less case, the dGPU can be used to render, and then the result is copied to the integrated GPU for display. Alex ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <520284d6.07300f0a.72a4.1623SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <520284d6.07300f0a.72a4.1623SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-07 18:12 ` Rob Clark @ 2013-08-07 18:12 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-07 18:12 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> >> > Didn't you say that programmatically describing device placement >> >> > constraints was an unbounded problem? I guess we would have to >> >> > accept that it's not possible to describe all possible constraints >> >> > and instead find a way to describe the common ones? >> >> >> >> well, the point I'm trying to make, is by dividing your constraints >> >> into two groups, one that impacts and is handled by userspace, and >> >> one that is in the kernel (ie. where the pages go), you cut down >> >> the number of permutations that the kernel has to care about >> >> considerably. And kernel already cares about, for example, what >> >> range of addresses that a device can dma to/from. I think really >> >> the only thing missing is the max # of sglist entries (contiguous >> >> or not) >> > >> > I think it's more than physically contiguous or not. >> > >> > For example, it can be more efficient to use large page sizes on >> > devices with IOMMUs to reduce TLB traffic. I think the size and even >> > the availability of large pages varies between different IOMMUs. >> >> sure.. but I suppose if we can spiff out dma_params to express "I need >> contiguous", perhaps we can add some way to express "I prefer >> as-contiguous-as-possible".. either way, this is about where the pages >> are placed, and not about the layout of pixels within the page, so >> should be in kernel. It's something that is missing, but I believe >> that it belongs in dma_params and hidden behind dma_alloc_*() for >> simple drivers. > > Thinking about it, isn't this more a property of the IOMMU? I mean, > are there any cases where an IOMMU had a large page mode but you > wouldn't want to use it? So when allocating the memory, you'd have to > take into account not just the constraints of the devices themselves, > but also of any IOMMUs any of the device sit behind? > perhaps yes. But the device is associated w/ the iommu it is attached to, so this shouldn't be a problem > >> > There's also the issue of buffer stride alignment. As I say, if the >> > buffer is to be written by a tile-based GPU like Mali, it's more >> > efficient if the buffer's stride is aligned to the max AXI bus burst >> > length. Though I guess a buffer stride only makes sense as a concept >> > when interpreting the data as a linear-layout 2D image, so perhaps >> > belongs in user-space along with format negotiation? >> > >> >> Yeah.. this isn't about where the pages go, but about the arrangement >> within a page. >> >> And, well, except for hw that supports the same tiling (or >> compressed-fb) in display+gpu, you probably aren't sharing tiled >> buffers. > > You'd only want to share a buffer between devices if those devices can > understand the same pixel format. That pixel format can't be device- > specific or opaque, it has to be explicit. I think drm_fourcc.h is > what defines all the possible pixel formats. This is the enum I used > in EGL_EXT_image_dma_buf_import at least. So if we get to the point > where multiple devices can understand a tiled or compressed format, I > assume we could just add that format to drm_fourcc.h and possibly > v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. > > For user-space to negotiate a common pixel format and now stride > alignment, I guess it will obviously need a way to query what pixel > formats a device supports and what its stride alignment requirements > are. > > I don't know v4l2 very well, but it certainly seems the pixel format > can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set > a particular format. I couldn't however find a way to retrieve a list > of supported formats - it seems the mechanism is to try out each > format in turn to determine if it is supported. Is that right? it is exposed for drm plane's. What is missing is to expose the primary-plane associated with the crtc. > There doesn't however seem a way to query what stride constraints a > V4l2 device might have. Does HW abstracted by v4l2 typically have > such constraints? If so, how can we query them such that a buffer > allocated by a DRM driver can be imported into v4l2 and used with > that HW? > > Turning to DRM/KMS, it seems the supported formats of a plane can be > queried using drm_mode_get_plane. However, there doesn't seem to be a > way to query the supported formats of a crtc? If display HW only > supports scanning out from a single buffer (like pl111 does), I think > it won't have any planes and a fb can only be set on the crtc. In > which case, how should user-space query which pixel formats that crtc > supports? > > Assuming user-space can query the supported formats and find a common > one, it will need to allocate a buffer. Looks like > drm_mode_create_dumb can do that, but it only takes a bpp parameter, > there's no format parameter. I assume then that user-space defines > the format and tells the DRM driver which format the buffer is in > when creating the fb with drm_mode_fb_cmd2, which does take a format > parameter? Is that right? Right, the gem object has no inherent format, it is just some bytes. The format/width/height/pitch are all attributes of the fb. > As with v4l2, DRM doesn't appear to have a way to query the stride > constraints? Assuming there is a way to query the stride constraints, > there also isn't a way to specify them when creating a buffer with > DRM, though perhaps the existing pitch parameter of > drm_mode_create_dumb could be used to allow user-space to pass in a > minimum stride as well as receive the allocated stride? > well, you really shouldn't be using create_dumb.. you should have a userspace piece that is specific to the drm driver, and knows how to use that driver's gem allocate ioctl. > >> >> > One problem with this is it duplicates a lot of logic in each >> >> > driver which can export a dma_buf buffer. Each exporter will >> >> > need to do pretty much the same thing: iterate over all the >> >> > attachments, determine of all the constraints (assuming that >> >> > can be done) and allocate pages such that the lowest-common- >> >> > denominator is satisfied. Perhaps rather than duplicating that >> >> > logic in every driver, we could instead move allocation of the >> >> > backing pages into dma_buf itself? >> >> >> >> I tend to think it is better to add helpers as we see common >> >> >> patterns emerge, which drivers can opt-in to using. I don't >> >> think that we should move allocation into dma_buf itself, but >> >> it would perhaps be useful to have dma_alloc_*() variants that >> >> could allocate for multiple devices. >> > >> > A helper could work I guess, though I quite like the idea of >> > having dma_alloc_*() variants which take a list of devices to >> > allocate memory for. >> > >> > >> >> That would help for simple stuff, although I'd suspect >> >> eventually a GPU driver will move away from that. (Since you >> >> probably want to play tricks w/ pools of pages that are >> >> pre-zero'd and in the correct cache state, use spare cycles on >> >> the gpu or dma engine to pre-zero uncached pages, and games >> >> like that.) >> > >> > So presumably you're talking about a GPU driver being the exporter >> > here? If so, how could the GPU driver do these kind of tricks on >> > memory shared with another device? >> >> Yes, that is gpu-as-exporter. If someone else is allocating buffers, >> it is up to them to do these tricks or not. Probably there is a >> pretty good chance that if you aren't a GPU you don't need those sort >> of tricks for fast allocation of transient upload buffers, staging >> textures, temporary pixmaps, etc. Ie. I don't really think a v4l >> camera or video decoder would benefit from that sort of optimization. > > Right - but none of those are really buffers you'd want to export with > dma_buf to share with another device are they? In which case, why not > just have dma_buf figure out the constraints and allocate the memory? maybe not.. but (a) you don't necessarily know at creation time if it is going to be exported (maybe you know if it is definitely not going to be exported, but the converse is not true), and (b) there isn't really any reason to special case the allocation in the driver because it is going to be exported. helpers that can be used by simple drivers, yes. Forcing the way the buffer is allocated, for sure not. Currently, for example, there is no issue to export a buffer allocated from stolen-mem. If we put the page allocation in dma-buf, this would not be possible. That is just one quick example off the top of my head, I'm sure there are plenty more. But we definitely do not want the allocate in dma_buf itself. > If a driver needs to allocate memory in a special way for a particular > device, I can't really imagine how it would be able to share that > buffer with another device using dma_buf? I guess a driver is likely > to need some magic voodoo to configure access to the buffer for its > device, but surely that would be done by the dma_mapping framework > when dma_buf_map happens? > if, what it has to configure actually manages to fit in the dma-mapping framework anyways, where the pages come from has nothing to do with whether a buffer can be shared or not > > >> >> You probably want to get out of the SoC mindset, otherwise you are >> >> going to make bad assumptions that come back to bite you later on. >> > >> > Sure - there are always going to be PC-like devices where the >> > hardware configuration isn't fixed like it is on a traditional SoC. >> > But I'd rather have a simple solution which works on traditional SoCs >> > than no solution at all. Today our solution is to over-load the dumb >> > buffer alloc functions of the display's DRM driver - For now I'm just >> > looking for the next step up from that! ;-) >> >> True.. the original intention, which is perhaps a bit desktop-centric, >> really was for there to be a userspace component talking to the drm >> driver for allocation, ie. xf86-video-foo and/or >> src/gallium/drivers/foo (for example) ;-) >> >> Which means for x11 having a SoC vendor specific xf86-video-foo for >> x11.. or vendor specific gbm implementation for wayland. (Although >> at least in the latter case it is a pretty small piece of code.) But >> that is probably what you are trying to avoid. > > I've been trying to get my head around how PRIME relates to DDX > drivers. As I understand it (which is likely wrong), you have a laptop > with both an Intel & an nVidia GPU. You have both the i915 & nouveau > kernel drivers loaded. What I'm not sure about is which GPU's display > controller is actually hooked up to the physical connector? Perhaps > there is a MUX like there is on Versatile Express? afaiu it can be a, b, or c (ie. either gpu can have the display or there can be a mux).. > What I also don't understand is what DDX driver is loaded? Is it > xf86-video-intel, xf86-video-nouveau or both? I get the impression > that there's a "master" DDX which implements 2D operations but can > import buffers using PRIME from the other driver and draw to them. > Or is it more that it's able to export rendered buffers to the > "slave" DRM for scanout? Either way, it's pretty similar to an ARM > SoC setup which has the GPU and the display as two totally > independent devices. > > > >> At any rate, for both xorg and wayland/gbm, you know when a buffer is >> going to be a scanout buffer. What I'd recommend is define a small >> userspace API that your customers (the SoC vendors) implement to >> allocate a scanout buffer and hand you back a dmabuf fd. That could >> be used both for x11 and for gbm. Inputs should be requested >> width/height and format. And outputs pitch plus dmabuf fd. >> >> (Actually you might even just want to use gbm as your starting point. >> You could probably just use gbm from xf86-video-armsoc for allocation, >> to have one thing that works for both wayland and x11. Scanout and >> cursor buffers should go to vendor/SoC specific fxn, rest can be >> allocated from mali kernel driver.) > > What does that buy us over just using drm_mode_create_dumb on the > display's DRM driver? > well, for example, if there was actually some hw w/ omap's dss + mali, you could actually have mali render transparently to tiled buffers which could be scanned out rotated. Which would not be possible w/ dumb buffers. > > >> >> > wouldn't need a way to programmatically describe the constraints >> >> > either: As you say, if userspace sets the "SCANOUT" flag, it would >> >> > just "know" that on this SoC, that buffer needs to be physically >> >> > contiguous for example. >> >> >> >> not really.. it just knows it wants to scanout the buffer, and tells >> >> this as a hint to the kernel. >> >> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> >> (where phys contig is not required for scanout), but causes CMA >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. >> >> It just knows that it wants to be able to scanout that particular >> >> buffer. >> > >> > I think that's the idea? The omap3's allocator driver would use >> > contiguous memory when it detects the SCANOUT flag whereas the omap4 >> > allocator driver wouldn't have to. No complex negotiation of >> > constraints - it just "knows". >> > >> >> well, it is same allocating driver in both cases (although maybe that >> is unimportant). The "it" that just knows it wants to scanout is >> userspace. The "it" that just knows that scanout translates to >> contiguous (or not) is the kernel. Perhaps we are saying the same >> thing ;-) > > Yeah - I think we are... so what's the issue with having a per-SoC > allocation driver again? > In a way the display driver is a per-SoC allocator. But not necessarily the *central* allocator for everything. Ie. no need for display driver to allocate vertex buffers for a separate gpu driver, and that sort of thing. BR, -R > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 18:12 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-07 18:12 UTC (permalink / raw) To: linux-arm-kernel On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> >> > Didn't you say that programmatically describing device placement >> >> > constraints was an unbounded problem? I guess we would have to >> >> > accept that it's not possible to describe all possible constraints >> >> > and instead find a way to describe the common ones? >> >> >> >> well, the point I'm trying to make, is by dividing your constraints >> >> into two groups, one that impacts and is handled by userspace, and >> >> one that is in the kernel (ie. where the pages go), you cut down >> >> the number of permutations that the kernel has to care about >> >> considerably. And kernel already cares about, for example, what >> >> range of addresses that a device can dma to/from. I think really >> >> the only thing missing is the max # of sglist entries (contiguous >> >> or not) >> > >> > I think it's more than physically contiguous or not. >> > >> > For example, it can be more efficient to use large page sizes on >> > devices with IOMMUs to reduce TLB traffic. I think the size and even >> > the availability of large pages varies between different IOMMUs. >> >> sure.. but I suppose if we can spiff out dma_params to express "I need >> contiguous", perhaps we can add some way to express "I prefer >> as-contiguous-as-possible".. either way, this is about where the pages >> are placed, and not about the layout of pixels within the page, so >> should be in kernel. It's something that is missing, but I believe >> that it belongs in dma_params and hidden behind dma_alloc_*() for >> simple drivers. > > Thinking about it, isn't this more a property of the IOMMU? I mean, > are there any cases where an IOMMU had a large page mode but you > wouldn't want to use it? So when allocating the memory, you'd have to > take into account not just the constraints of the devices themselves, > but also of any IOMMUs any of the device sit behind? > perhaps yes. But the device is associated w/ the iommu it is attached to, so this shouldn't be a problem > >> > There's also the issue of buffer stride alignment. As I say, if the >> > buffer is to be written by a tile-based GPU like Mali, it's more >> > efficient if the buffer's stride is aligned to the max AXI bus burst >> > length. Though I guess a buffer stride only makes sense as a concept >> > when interpreting the data as a linear-layout 2D image, so perhaps >> > belongs in user-space along with format negotiation? >> > >> >> Yeah.. this isn't about where the pages go, but about the arrangement >> within a page. >> >> And, well, except for hw that supports the same tiling (or >> compressed-fb) in display+gpu, you probably aren't sharing tiled >> buffers. > > You'd only want to share a buffer between devices if those devices can > understand the same pixel format. That pixel format can't be device- > specific or opaque, it has to be explicit. I think drm_fourcc.h is > what defines all the possible pixel formats. This is the enum I used > in EGL_EXT_image_dma_buf_import at least. So if we get to the point > where multiple devices can understand a tiled or compressed format, I > assume we could just add that format to drm_fourcc.h and possibly > v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. > > For user-space to negotiate a common pixel format and now stride > alignment, I guess it will obviously need a way to query what pixel > formats a device supports and what its stride alignment requirements > are. > > I don't know v4l2 very well, but it certainly seems the pixel format > can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set > a particular format. I couldn't however find a way to retrieve a list > of supported formats - it seems the mechanism is to try out each > format in turn to determine if it is supported. Is that right? it is exposed for drm plane's. What is missing is to expose the primary-plane associated with the crtc. > There doesn't however seem a way to query what stride constraints a > V4l2 device might have. Does HW abstracted by v4l2 typically have > such constraints? If so, how can we query them such that a buffer > allocated by a DRM driver can be imported into v4l2 and used with > that HW? > > Turning to DRM/KMS, it seems the supported formats of a plane can be > queried using drm_mode_get_plane. However, there doesn't seem to be a > way to query the supported formats of a crtc? If display HW only > supports scanning out from a single buffer (like pl111 does), I think > it won't have any planes and a fb can only be set on the crtc. In > which case, how should user-space query which pixel formats that crtc > supports? > > Assuming user-space can query the supported formats and find a common > one, it will need to allocate a buffer. Looks like > drm_mode_create_dumb can do that, but it only takes a bpp parameter, > there's no format parameter. I assume then that user-space defines > the format and tells the DRM driver which format the buffer is in > when creating the fb with drm_mode_fb_cmd2, which does take a format > parameter? Is that right? Right, the gem object has no inherent format, it is just some bytes. The format/width/height/pitch are all attributes of the fb. > As with v4l2, DRM doesn't appear to have a way to query the stride > constraints? Assuming there is a way to query the stride constraints, > there also isn't a way to specify them when creating a buffer with > DRM, though perhaps the existing pitch parameter of > drm_mode_create_dumb could be used to allow user-space to pass in a > minimum stride as well as receive the allocated stride? > well, you really shouldn't be using create_dumb.. you should have a userspace piece that is specific to the drm driver, and knows how to use that driver's gem allocate ioctl. > >> >> > One problem with this is it duplicates a lot of logic in each >> >> > driver which can export a dma_buf buffer. Each exporter will >> >> > need to do pretty much the same thing: iterate over all the >> >> > attachments, determine of all the constraints (assuming that >> >> > can be done) and allocate pages such that the lowest-common- >> >> > denominator is satisfied. Perhaps rather than duplicating that >> >> > logic in every driver, we could instead move allocation of the >> >> > backing pages into dma_buf itself? >> >> >> >> I tend to think it is better to add helpers as we see common >> >> >> patterns emerge, which drivers can opt-in to using. I don't >> >> think that we should move allocation into dma_buf itself, but >> >> it would perhaps be useful to have dma_alloc_*() variants that >> >> could allocate for multiple devices. >> > >> > A helper could work I guess, though I quite like the idea of >> > having dma_alloc_*() variants which take a list of devices to >> > allocate memory for. >> > >> > >> >> That would help for simple stuff, although I'd suspect >> >> eventually a GPU driver will move away from that. (Since you >> >> probably want to play tricks w/ pools of pages that are >> >> pre-zero'd and in the correct cache state, use spare cycles on >> >> the gpu or dma engine to pre-zero uncached pages, and games >> >> like that.) >> > >> > So presumably you're talking about a GPU driver being the exporter >> > here? If so, how could the GPU driver do these kind of tricks on >> > memory shared with another device? >> >> Yes, that is gpu-as-exporter. If someone else is allocating buffers, >> it is up to them to do these tricks or not. Probably there is a >> pretty good chance that if you aren't a GPU you don't need those sort >> of tricks for fast allocation of transient upload buffers, staging >> textures, temporary pixmaps, etc. Ie. I don't really think a v4l >> camera or video decoder would benefit from that sort of optimization. > > Right - but none of those are really buffers you'd want to export with > dma_buf to share with another device are they? In which case, why not > just have dma_buf figure out the constraints and allocate the memory? maybe not.. but (a) you don't necessarily know at creation time if it is going to be exported (maybe you know if it is definitely not going to be exported, but the converse is not true), and (b) there isn't really any reason to special case the allocation in the driver because it is going to be exported. helpers that can be used by simple drivers, yes. Forcing the way the buffer is allocated, for sure not. Currently, for example, there is no issue to export a buffer allocated from stolen-mem. If we put the page allocation in dma-buf, this would not be possible. That is just one quick example off the top of my head, I'm sure there are plenty more. But we definitely do not want the allocate in dma_buf itself. > If a driver needs to allocate memory in a special way for a particular > device, I can't really imagine how it would be able to share that > buffer with another device using dma_buf? I guess a driver is likely > to need some magic voodoo to configure access to the buffer for its > device, but surely that would be done by the dma_mapping framework > when dma_buf_map happens? > if, what it has to configure actually manages to fit in the dma-mapping framework anyways, where the pages come from has nothing to do with whether a buffer can be shared or not > > >> >> You probably want to get out of the SoC mindset, otherwise you are >> >> going to make bad assumptions that come back to bite you later on. >> > >> > Sure - there are always going to be PC-like devices where the >> > hardware configuration isn't fixed like it is on a traditional SoC. >> > But I'd rather have a simple solution which works on traditional SoCs >> > than no solution at all. Today our solution is to over-load the dumb >> > buffer alloc functions of the display's DRM driver - For now I'm just >> > looking for the next step up from that! ;-) >> >> True.. the original intention, which is perhaps a bit desktop-centric, >> really was for there to be a userspace component talking to the drm >> driver for allocation, ie. xf86-video-foo and/or >> src/gallium/drivers/foo (for example) ;-) >> >> Which means for x11 having a SoC vendor specific xf86-video-foo for >> x11.. or vendor specific gbm implementation for wayland. (Although >> at least in the latter case it is a pretty small piece of code.) But >> that is probably what you are trying to avoid. > > I've been trying to get my head around how PRIME relates to DDX > drivers. As I understand it (which is likely wrong), you have a laptop > with both an Intel & an nVidia GPU. You have both the i915 & nouveau > kernel drivers loaded. What I'm not sure about is which GPU's display > controller is actually hooked up to the physical connector? Perhaps > there is a MUX like there is on Versatile Express? afaiu it can be a, b, or c (ie. either gpu can have the display or there can be a mux).. > What I also don't understand is what DDX driver is loaded? Is it > xf86-video-intel, xf86-video-nouveau or both? I get the impression > that there's a "master" DDX which implements 2D operations but can > import buffers using PRIME from the other driver and draw to them. > Or is it more that it's able to export rendered buffers to the > "slave" DRM for scanout? Either way, it's pretty similar to an ARM > SoC setup which has the GPU and the display as two totally > independent devices. > > > >> At any rate, for both xorg and wayland/gbm, you know when a buffer is >> going to be a scanout buffer. What I'd recommend is define a small >> userspace API that your customers (the SoC vendors) implement to >> allocate a scanout buffer and hand you back a dmabuf fd. That could >> be used both for x11 and for gbm. Inputs should be requested >> width/height and format. And outputs pitch plus dmabuf fd. >> >> (Actually you might even just want to use gbm as your starting point. >> You could probably just use gbm from xf86-video-armsoc for allocation, >> to have one thing that works for both wayland and x11. Scanout and >> cursor buffers should go to vendor/SoC specific fxn, rest can be >> allocated from mali kernel driver.) > > What does that buy us over just using drm_mode_create_dumb on the > display's DRM driver? > well, for example, if there was actually some hw w/ omap's dss + mali, you could actually have mali render transparently to tiled buffers which could be scanned out rotated. Which would not be possible w/ dumb buffers. > > >> >> > wouldn't need a way to programmatically describe the constraints >> >> > either: As you say, if userspace sets the "SCANOUT" flag, it would >> >> > just "know" that on this SoC, that buffer needs to be physically >> >> > contiguous for example. >> >> >> >> not really.. it just knows it wants to scanout the buffer, and tells >> >> this as a hint to the kernel. >> >> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> >> (where phys contig is not required for scanout), but causes CMA >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. >> >> It just knows that it wants to be able to scanout that particular >> >> buffer. >> > >> > I think that's the idea? The omap3's allocator driver would use >> > contiguous memory when it detects the SCANOUT flag whereas the omap4 >> > allocator driver wouldn't have to. No complex negotiation of >> > constraints - it just "knows". >> > >> >> well, it is same allocating driver in both cases (although maybe that >> is unimportant). The "it" that just knows it wants to scanout is >> userspace. The "it" that just knows that scanout translates to >> contiguous (or not) is the kernel. Perhaps we are saying the same >> thing ;-) > > Yeah - I think we are... so what's the issue with having a per-SoC > allocation driver again? > In a way the display driver is a per-SoC allocator. But not necessarily the *central* allocator for everything. Ie. no need for display driver to allocate vertex buffers for a separate gpu driver, and that sort of thing. BR, -R > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 18:12 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-07 18:12 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Wed, Aug 7, 2013 at 1:33 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> >> > Didn't you say that programmatically describing device placement >> >> > constraints was an unbounded problem? I guess we would have to >> >> > accept that it's not possible to describe all possible constraints >> >> > and instead find a way to describe the common ones? >> >> >> >> well, the point I'm trying to make, is by dividing your constraints >> >> into two groups, one that impacts and is handled by userspace, and >> >> one that is in the kernel (ie. where the pages go), you cut down >> >> the number of permutations that the kernel has to care about >> >> considerably. And kernel already cares about, for example, what >> >> range of addresses that a device can dma to/from. I think really >> >> the only thing missing is the max # of sglist entries (contiguous >> >> or not) >> > >> > I think it's more than physically contiguous or not. >> > >> > For example, it can be more efficient to use large page sizes on >> > devices with IOMMUs to reduce TLB traffic. I think the size and even >> > the availability of large pages varies between different IOMMUs. >> >> sure.. but I suppose if we can spiff out dma_params to express "I need >> contiguous", perhaps we can add some way to express "I prefer >> as-contiguous-as-possible".. either way, this is about where the pages >> are placed, and not about the layout of pixels within the page, so >> should be in kernel. It's something that is missing, but I believe >> that it belongs in dma_params and hidden behind dma_alloc_*() for >> simple drivers. > > Thinking about it, isn't this more a property of the IOMMU? I mean, > are there any cases where an IOMMU had a large page mode but you > wouldn't want to use it? So when allocating the memory, you'd have to > take into account not just the constraints of the devices themselves, > but also of any IOMMUs any of the device sit behind? > perhaps yes. But the device is associated w/ the iommu it is attached to, so this shouldn't be a problem > >> > There's also the issue of buffer stride alignment. As I say, if the >> > buffer is to be written by a tile-based GPU like Mali, it's more >> > efficient if the buffer's stride is aligned to the max AXI bus burst >> > length. Though I guess a buffer stride only makes sense as a concept >> > when interpreting the data as a linear-layout 2D image, so perhaps >> > belongs in user-space along with format negotiation? >> > >> >> Yeah.. this isn't about where the pages go, but about the arrangement >> within a page. >> >> And, well, except for hw that supports the same tiling (or >> compressed-fb) in display+gpu, you probably aren't sharing tiled >> buffers. > > You'd only want to share a buffer between devices if those devices can > understand the same pixel format. That pixel format can't be device- > specific or opaque, it has to be explicit. I think drm_fourcc.h is > what defines all the possible pixel formats. This is the enum I used > in EGL_EXT_image_dma_buf_import at least. So if we get to the point > where multiple devices can understand a tiled or compressed format, I > assume we could just add that format to drm_fourcc.h and possibly > v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h. > > For user-space to negotiate a common pixel format and now stride > alignment, I guess it will obviously need a way to query what pixel > formats a device supports and what its stride alignment requirements > are. > > I don't know v4l2 very well, but it certainly seems the pixel format > can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set > a particular format. I couldn't however find a way to retrieve a list > of supported formats - it seems the mechanism is to try out each > format in turn to determine if it is supported. Is that right? it is exposed for drm plane's. What is missing is to expose the primary-plane associated with the crtc. > There doesn't however seem a way to query what stride constraints a > V4l2 device might have. Does HW abstracted by v4l2 typically have > such constraints? If so, how can we query them such that a buffer > allocated by a DRM driver can be imported into v4l2 and used with > that HW? > > Turning to DRM/KMS, it seems the supported formats of a plane can be > queried using drm_mode_get_plane. However, there doesn't seem to be a > way to query the supported formats of a crtc? If display HW only > supports scanning out from a single buffer (like pl111 does), I think > it won't have any planes and a fb can only be set on the crtc. In > which case, how should user-space query which pixel formats that crtc > supports? > > Assuming user-space can query the supported formats and find a common > one, it will need to allocate a buffer. Looks like > drm_mode_create_dumb can do that, but it only takes a bpp parameter, > there's no format parameter. I assume then that user-space defines > the format and tells the DRM driver which format the buffer is in > when creating the fb with drm_mode_fb_cmd2, which does take a format > parameter? Is that right? Right, the gem object has no inherent format, it is just some bytes. The format/width/height/pitch are all attributes of the fb. > As with v4l2, DRM doesn't appear to have a way to query the stride > constraints? Assuming there is a way to query the stride constraints, > there also isn't a way to specify them when creating a buffer with > DRM, though perhaps the existing pitch parameter of > drm_mode_create_dumb could be used to allow user-space to pass in a > minimum stride as well as receive the allocated stride? > well, you really shouldn't be using create_dumb.. you should have a userspace piece that is specific to the drm driver, and knows how to use that driver's gem allocate ioctl. > >> >> > One problem with this is it duplicates a lot of logic in each >> >> > driver which can export a dma_buf buffer. Each exporter will >> >> > need to do pretty much the same thing: iterate over all the >> >> > attachments, determine of all the constraints (assuming that >> >> > can be done) and allocate pages such that the lowest-common- >> >> > denominator is satisfied. Perhaps rather than duplicating that >> >> > logic in every driver, we could instead move allocation of the >> >> > backing pages into dma_buf itself? >> >> >> >> I tend to think it is better to add helpers as we see common >> >> >> patterns emerge, which drivers can opt-in to using. I don't >> >> think that we should move allocation into dma_buf itself, but >> >> it would perhaps be useful to have dma_alloc_*() variants that >> >> could allocate for multiple devices. >> > >> > A helper could work I guess, though I quite like the idea of >> > having dma_alloc_*() variants which take a list of devices to >> > allocate memory for. >> > >> > >> >> That would help for simple stuff, although I'd suspect >> >> eventually a GPU driver will move away from that. (Since you >> >> probably want to play tricks w/ pools of pages that are >> >> pre-zero'd and in the correct cache state, use spare cycles on >> >> the gpu or dma engine to pre-zero uncached pages, and games >> >> like that.) >> > >> > So presumably you're talking about a GPU driver being the exporter >> > here? If so, how could the GPU driver do these kind of tricks on >> > memory shared with another device? >> >> Yes, that is gpu-as-exporter. If someone else is allocating buffers, >> it is up to them to do these tricks or not. Probably there is a >> pretty good chance that if you aren't a GPU you don't need those sort >> of tricks for fast allocation of transient upload buffers, staging >> textures, temporary pixmaps, etc. Ie. I don't really think a v4l >> camera or video decoder would benefit from that sort of optimization. > > Right - but none of those are really buffers you'd want to export with > dma_buf to share with another device are they? In which case, why not > just have dma_buf figure out the constraints and allocate the memory? maybe not.. but (a) you don't necessarily know at creation time if it is going to be exported (maybe you know if it is definitely not going to be exported, but the converse is not true), and (b) there isn't really any reason to special case the allocation in the driver because it is going to be exported. helpers that can be used by simple drivers, yes. Forcing the way the buffer is allocated, for sure not. Currently, for example, there is no issue to export a buffer allocated from stolen-mem. If we put the page allocation in dma-buf, this would not be possible. That is just one quick example off the top of my head, I'm sure there are plenty more. But we definitely do not want the allocate in dma_buf itself. > If a driver needs to allocate memory in a special way for a particular > device, I can't really imagine how it would be able to share that > buffer with another device using dma_buf? I guess a driver is likely > to need some magic voodoo to configure access to the buffer for its > device, but surely that would be done by the dma_mapping framework > when dma_buf_map happens? > if, what it has to configure actually manages to fit in the dma-mapping framework anyways, where the pages come from has nothing to do with whether a buffer can be shared or not > > >> >> You probably want to get out of the SoC mindset, otherwise you are >> >> going to make bad assumptions that come back to bite you later on. >> > >> > Sure - there are always going to be PC-like devices where the >> > hardware configuration isn't fixed like it is on a traditional SoC. >> > But I'd rather have a simple solution which works on traditional SoCs >> > than no solution at all. Today our solution is to over-load the dumb >> > buffer alloc functions of the display's DRM driver - For now I'm just >> > looking for the next step up from that! ;-) >> >> True.. the original intention, which is perhaps a bit desktop-centric, >> really was for there to be a userspace component talking to the drm >> driver for allocation, ie. xf86-video-foo and/or >> src/gallium/drivers/foo (for example) ;-) >> >> Which means for x11 having a SoC vendor specific xf86-video-foo for >> x11.. or vendor specific gbm implementation for wayland. (Although >> at least in the latter case it is a pretty small piece of code.) But >> that is probably what you are trying to avoid. > > I've been trying to get my head around how PRIME relates to DDX > drivers. As I understand it (which is likely wrong), you have a laptop > with both an Intel & an nVidia GPU. You have both the i915 & nouveau > kernel drivers loaded. What I'm not sure about is which GPU's display > controller is actually hooked up to the physical connector? Perhaps > there is a MUX like there is on Versatile Express? afaiu it can be a, b, or c (ie. either gpu can have the display or there can be a mux).. > What I also don't understand is what DDX driver is loaded? Is it > xf86-video-intel, xf86-video-nouveau or both? I get the impression > that there's a "master" DDX which implements 2D operations but can > import buffers using PRIME from the other driver and draw to them. > Or is it more that it's able to export rendered buffers to the > "slave" DRM for scanout? Either way, it's pretty similar to an ARM > SoC setup which has the GPU and the display as two totally > independent devices. > > > >> At any rate, for both xorg and wayland/gbm, you know when a buffer is >> going to be a scanout buffer. What I'd recommend is define a small >> userspace API that your customers (the SoC vendors) implement to >> allocate a scanout buffer and hand you back a dmabuf fd. That could >> be used both for x11 and for gbm. Inputs should be requested >> width/height and format. And outputs pitch plus dmabuf fd. >> >> (Actually you might even just want to use gbm as your starting point. >> You could probably just use gbm from xf86-video-armsoc for allocation, >> to have one thing that works for both wayland and x11. Scanout and >> cursor buffers should go to vendor/SoC specific fxn, rest can be >> allocated from mali kernel driver.) > > What does that buy us over just using drm_mode_create_dumb on the > display's DRM driver? > well, for example, if there was actually some hw w/ omap's dss + mali, you could actually have mali render transparently to tiled buffers which could be scanned out rotated. Which would not be possible w/ dumb buffers. > > >> >> > wouldn't need a way to programmatically describe the constraints >> >> > either: As you say, if userspace sets the "SCANOUT" flag, it would >> >> > just "know" that on this SoC, that buffer needs to be physically >> >> > contiguous for example. >> >> >> >> not really.. it just knows it wants to scanout the buffer, and tells >> >> this as a hint to the kernel. >> >> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> >> (where phys contig is not required for scanout), but causes CMA >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. >> >> It just knows that it wants to be able to scanout that particular >> >> buffer. >> > >> > I think that's the idea? The omap3's allocator driver would use >> > contiguous memory when it detects the SCANOUT flag whereas the omap4 >> > allocator driver wouldn't have to. No complex negotiation of >> > constraints - it just "knows". >> > >> >> well, it is same allocating driver in both cases (although maybe that >> is unimportant). The "it" that just knows it wants to scanout is >> userspace. The "it" that just knows that scanout translates to >> contiguous (or not) is the kernel. Perhaps we are saying the same >> thing ;-) > > Yeah - I think we are... so what's the issue with having a per-SoC > allocation driver again? > In a way the display driver is a per-SoC allocator. But not necessarily the *central* allocator for everything. Ie. no need for display driver to allocate vertex buffers for a separate gpu driver, and that sort of thing. BR, -R > > > Cheers, > > Tom > > > > > ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-07 18:12 ` Rob Clark (?) (?) @ 2013-08-09 16:15 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-09 16:15 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media > > Turning to DRM/KMS, it seems the supported formats of a plane can be > > queried using drm_mode_get_plane. However, there doesn't seem to be a > > way to query the supported formats of a crtc? If display HW only > > supports scanning out from a single buffer (like pl111 does), I think > > it won't have any planes and a fb can only be set on the crtc. In > > which case, how should user-space query which pixel formats that crtc > > supports? > > it is exposed for drm plane's. What is missing is to expose the > primary-plane associated with the crtc. Cool - so a patch which adds a way to query the what formats a crtc supports would be welcome? What about a way to query the stride alignment constraints? Presumably using the drm_mode_get_property mechanism would be the right way to implement that? > > As with v4l2, DRM doesn't appear to have a way to query the stride > > constraints? Assuming there is a way to query the stride constraints, > > there also isn't a way to specify them when creating a buffer with > > DRM, though perhaps the existing pitch parameter of > > drm_mode_create_dumb could be used to allow user-space to pass in a > > minimum stride as well as receive the allocated stride? > > > > well, you really shouldn't be using create_dumb.. you should have a > userspace piece that is specific to the drm driver, and knows how to > use that driver's gem allocate ioctl. Sorry, why does this need a driver-specific allocation function? It's just a display controller driver and I just want to allocate a scan- out buffer - all I'm asking is for the display controller driver to use a minimum stride alignment so I can export the buffer and use another device to fill it with data. The whole point is to be able to allocate the buffer in such a way that another device can access it. So the driver _can't_ use a special, device specific format, nor can it allocate it from a private memory pool because doing so would preclude it from being shared with another device. That other device doesn't need to be a GPU wither, it could just as easily be a camera/ISP or video decoder. > >> > So presumably you're talking about a GPU driver being the exporter > >> > here? If so, how could the GPU driver do these kind of tricks on > >> > memory shared with another device? > >> > >> Yes, that is gpu-as-exporter. If someone else is allocating > >> buffers, it is up to them to do these tricks or not. Probably > >> there is a pretty good chance that if you aren't a GPU you don't > >> need those sort of tricks for fast allocation of transient upload > >> buffers, staging textures, temporary pixmaps, etc. Ie. I don't > >> really think a v4l camera or video decoder would benefit from that > >> sort of optimization. > > > > Right - but none of those are really buffers you'd want to export > > > with dma_buf to share with another device are they? In which case, > > why not just have dma_buf figure out the constraints and allocate > > the memory? > > maybe not.. but (a) you don't necessarily know at creation time if it > is going to be exported (maybe you know if it is definitely not going > to be exported, but the converse is not true), I can't actually think of an example where you would not know if a buffer was going to be exported or not at allocation time? Do you have a case in mind? Regardless, you'd certainly have to know if a buffer will be exported pretty quickly, before it's used so that you can import it into whatever devices are going to access it. Otherwise if it gets allocated before you export it, the allocation won't satisfy the constraints of the other devices which will need to access it and importing will fail. Assuming of course deferred allocation of the backing pages as discussed earlier in the thread. > and (b) there isn't > really any reason to special case the allocation in the driver because > it is going to be exported. Not sure I follow you here? Surely you absolutely have to special-case the allocation if the buffer is to be exported because you have to take the other devices' constraints into account when you allocate? Or do you mean you don't need to special-case the GEM buffer object creation, only the allocation of the backing pages? Though I'm not sure how that distinction is useful - at the end of the day, you need to special-case allocation of the backing pages. > helpers that can be used by simple drivers, yes. Forcing the way the > buffer is allocated, for sure not. Currently, for example, there is > no issue to export a buffer allocated from stolen-mem. Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk of memory reserved at boot for the GPU which the OS can't touch? I guess I view such memory as accessible to all media devices on the system and as such, needs to be managed by a central allocator which dma_buf can use to allocate from. I guess if that stolen-mem is managed by a single device then in essence that device becomes the central allocator you have to use to be able to allocate from that stolen mem? > > If a driver needs to allocate memory in a special way for a > > particular device, I can't really imagine how it would be able > > to share that buffer with another device using dma_buf? I guess > > a driver is likely to need some magic voodoo to configure access > > to the buffer for its device, but surely that would be done by > > the dma_mapping framework when dma_buf_map happens? > > > > if, what it has to configure actually manages to fit in the > dma-mapping framework But if it doesn't, surely that's an issue which needs to be addressed in the dma_mapping framework or else you won't be able to import buffers for use by that device anyway? > anyways, where the pages come from has nothing to do with whether a > buffer can be shared or not Sure, but where they are located in physical memory really does matter. > >> At any rate, for both xorg and wayland/gbm, you know when a buffer > >> is going to be a scanout buffer. What I'd recommend is define a > >> small userspace API that your customers (the SoC vendors) implement > >> to allocate a scanout buffer and hand you back a dmabuf fd. That > >> could be used both for x11 and for gbm. Inputs should be requested > >> width/height and format. And outputs pitch plus dmabuf fd. > >> > >> (Actually you might even just want to use gbm as your starting > >> point. You could probably just use gbm from xf86-video-armsoc for > >> allocation, to have one thing that works for both wayland and x11. > >> Scanout and cursor buffers should go to vendor/SoC specific fxn, > >> rest can be allocated from mali kernel driver.) > > > > What does that buy us over just using drm_mode_create_dumb on the > > display's DRM driver? > > well, for example, if there was actually some hw w/ omap's dss + mali, > you could actually have mali render transparently to tiled buffers > which could be scanned out rotated. Which would not be possible w/ > dumb buffers. Why not? As you said earlier, the format is defined when you setup the fb with drm_mode_fb_cmd2. If you wanted to share the buffer between devices, you have to be explicit about what format that buffer is in, so you'd have to add an entry to drm_fourcc.h for the tiled format. So userspace queries what formats the GPU DRM supports and what formats the OMAP DSS DRM supports, selects the tiled format and then uses drm_mode_create_dumb to allocate a buffer of the correct size and sets the appropriate drm_fourcc.h enum value when creating an fb for that buffer. Or have I missed something? > >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ > >> >> (where phys contig is not required for scanout), but causes CMA > >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. > >> >> It just knows that it wants to be able to scanout that particular > >> >> buffer. > >> > > >> > I think that's the idea? The omap3's allocator driver would use > >> > contiguous memory when it detects the SCANOUT flag whereas the > >> > omap4 allocator driver wouldn't have to. No complex negotiation > >> > of constraints - it just "knows". > >> > > >> > >> well, it is same allocating driver in both cases (although maybe > >> that is unimportant). The "it" that just knows it wants to scanout > >> is userspace. The "it" that just knows that scanout translates to > >> contiguous (or not) is the kernel. Perhaps we are saying the same > >> thing ;-) > > > > Yeah - I think we are... so what's the issue with having a per-SoC > > allocation driver again? > > > > In a way the display driver is a per-SoC allocator. But not > necessarily the *central* allocator for everything. Ie. no need for > display driver to allocate vertex buffers for a separate gpu driver, > and that sort of thing. Again, I'm only talking about allocating buffers which will be shared between different devices. At no point have I mentioned the allocation of buffers which aren't to be shared between devices. Sorry if that's not been clear. So for buffers which are to be shared between devices, your suggesting that the display driver is the per-SoC allocator? But as I say, and how this thread got started, the same display driver can be used on different SoCs, so having _it_ be the central allocator isn't ideal. Though this is our current solution and why we're "abusing" the dumb buffer allocation functions. :-) Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-07 18:12 ` Rob Clark ` (2 preceding siblings ...) (?) @ 2013-08-09 16:15 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-09 16:15 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media > > Turning to DRM/KMS, it seems the supported formats of a plane can be > > queried using drm_mode_get_plane. However, there doesn't seem to be a > > way to query the supported formats of a crtc? If display HW only > > supports scanning out from a single buffer (like pl111 does), I think > > it won't have any planes and a fb can only be set on the crtc. In > > which case, how should user-space query which pixel formats that crtc > > supports? > > it is exposed for drm plane's. What is missing is to expose the > primary-plane associated with the crtc. Cool - so a patch which adds a way to query the what formats a crtc supports would be welcome? What about a way to query the stride alignment constraints? Presumably using the drm_mode_get_property mechanism would be the right way to implement that? > > As with v4l2, DRM doesn't appear to have a way to query the stride > > constraints? Assuming there is a way to query the stride constraints, > > there also isn't a way to specify them when creating a buffer with > > DRM, though perhaps the existing pitch parameter of > > drm_mode_create_dumb could be used to allow user-space to pass in a > > minimum stride as well as receive the allocated stride? > > > > well, you really shouldn't be using create_dumb.. you should have a > userspace piece that is specific to the drm driver, and knows how to > use that driver's gem allocate ioctl. Sorry, why does this need a driver-specific allocation function? It's just a display controller driver and I just want to allocate a scan- out buffer - all I'm asking is for the display controller driver to use a minimum stride alignment so I can export the buffer and use another device to fill it with data. The whole point is to be able to allocate the buffer in such a way that another device can access it. So the driver _can't_ use a special, device specific format, nor can it allocate it from a private memory pool because doing so would preclude it from being shared with another device. That other device doesn't need to be a GPU wither, it could just as easily be a camera/ISP or video decoder. > >> > So presumably you're talking about a GPU driver being the exporter > >> > here? If so, how could the GPU driver do these kind of tricks on > >> > memory shared with another device? > >> > >> Yes, that is gpu-as-exporter. If someone else is allocating > >> buffers, it is up to them to do these tricks or not. Probably > >> there is a pretty good chance that if you aren't a GPU you don't > >> need those sort of tricks for fast allocation of transient upload > >> buffers, staging textures, temporary pixmaps, etc. Ie. I don't > >> really think a v4l camera or video decoder would benefit from that > >> sort of optimization. > > > > Right - but none of those are really buffers you'd want to export > > > with dma_buf to share with another device are they? In which case, > > why not just have dma_buf figure out the constraints and allocate > > the memory? > > maybe not.. but (a) you don't necessarily know at creation time if it > is going to be exported (maybe you know if it is definitely not going > to be exported, but the converse is not true), I can't actually think of an example where you would not know if a buffer was going to be exported or not at allocation time? Do you have a case in mind? Regardless, you'd certainly have to know if a buffer will be exported pretty quickly, before it's used so that you can import it into whatever devices are going to access it. Otherwise if it gets allocated before you export it, the allocation won't satisfy the constraints of the other devices which will need to access it and importing will fail. Assuming of course deferred allocation of the backing pages as discussed earlier in the thread. > and (b) there isn't > really any reason to special case the allocation in the driver because > it is going to be exported. Not sure I follow you here? Surely you absolutely have to special-case the allocation if the buffer is to be exported because you have to take the other devices' constraints into account when you allocate? Or do you mean you don't need to special-case the GEM buffer object creation, only the allocation of the backing pages? Though I'm not sure how that distinction is useful - at the end of the day, you need to special-case allocation of the backing pages. > helpers that can be used by simple drivers, yes. Forcing the way the > buffer is allocated, for sure not. Currently, for example, there is > no issue to export a buffer allocated from stolen-mem. Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk of memory reserved at boot for the GPU which the OS can't touch? I guess I view such memory as accessible to all media devices on the system and as such, needs to be managed by a central allocator which dma_buf can use to allocate from. I guess if that stolen-mem is managed by a single device then in essence that device becomes the central allocator you have to use to be able to allocate from that stolen mem? > > If a driver needs to allocate memory in a special way for a > > particular device, I can't really imagine how it would be able > > to share that buffer with another device using dma_buf? I guess > > a driver is likely to need some magic voodoo to configure access > > to the buffer for its device, but surely that would be done by > > the dma_mapping framework when dma_buf_map happens? > > > > if, what it has to configure actually manages to fit in the > dma-mapping framework But if it doesn't, surely that's an issue which needs to be addressed in the dma_mapping framework or else you won't be able to import buffers for use by that device anyway? > anyways, where the pages come from has nothing to do with whether a > buffer can be shared or not Sure, but where they are located in physical memory really does matter. > >> At any rate, for both xorg and wayland/gbm, you know when a buffer > >> is going to be a scanout buffer. What I'd recommend is define a > >> small userspace API that your customers (the SoC vendors) implement > >> to allocate a scanout buffer and hand you back a dmabuf fd. That > >> could be used both for x11 and for gbm. Inputs should be requested > >> width/height and format. And outputs pitch plus dmabuf fd. > >> > >> (Actually you might even just want to use gbm as your starting > >> point. You could probably just use gbm from xf86-video-armsoc for > >> allocation, to have one thing that works for both wayland and x11. > >> Scanout and cursor buffers should go to vendor/SoC specific fxn, > >> rest can be allocated from mali kernel driver.) > > > > What does that buy us over just using drm_mode_create_dumb on the > > display's DRM driver? > > well, for example, if there was actually some hw w/ omap's dss + mali, > you could actually have mali render transparently to tiled buffers > which could be scanned out rotated. Which would not be possible w/ > dumb buffers. Why not? As you said earlier, the format is defined when you setup the fb with drm_mode_fb_cmd2. If you wanted to share the buffer between devices, you have to be explicit about what format that buffer is in, so you'd have to add an entry to drm_fourcc.h for the tiled format. So userspace queries what formats the GPU DRM supports and what formats the OMAP DSS DRM supports, selects the tiled format and then uses drm_mode_create_dumb to allocate a buffer of the correct size and sets the appropriate drm_fourcc.h enum value when creating an fb for that buffer. Or have I missed something? > >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ > >> >> (where phys contig is not required for scanout), but causes CMA > >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. > >> >> It just knows that it wants to be able to scanout that particular > >> >> buffer. > >> > > >> > I think that's the idea? The omap3's allocator driver would use > >> > contiguous memory when it detects the SCANOUT flag whereas the > >> > omap4 allocator driver wouldn't have to. No complex negotiation > >> > of constraints - it just "knows". > >> > > >> > >> well, it is same allocating driver in both cases (although maybe > >> that is unimportant). The "it" that just knows it wants to scanout > >> is userspace. The "it" that just knows that scanout translates to > >> contiguous (or not) is the kernel. Perhaps we are saying the same > >> thing ;-) > > > > Yeah - I think we are... so what's the issue with having a per-SoC > > allocation driver again? > > > > In a way the display driver is a per-SoC allocator. But not > necessarily the *central* allocator for everything. Ie. no need for > display driver to allocate vertex buffers for a separate gpu driver, > and that sort of thing. Again, I'm only talking about allocating buffers which will be shared between different devices. At no point have I mentioned the allocation of buffers which aren't to be shared between devices. Sorry if that's not been clear. So for buffers which are to be shared between devices, your suggesting that the display driver is the per-SoC allocator? But as I say, and how this thread got started, the same display driver can be used on different SoCs, so having _it_ be the central allocator isn't ideal. Though this is our current solution and why we're "abusing" the dumb buffer allocation functions. :-) Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-07 18:12 ` Rob Clark ` (3 preceding siblings ...) (?) @ 2013-08-09 16:15 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-09 16:15 UTC (permalink / raw) To: linux-arm-kernel > > Turning to DRM/KMS, it seems the supported formats of a plane can be > > queried using drm_mode_get_plane. However, there doesn't seem to be a > > way to query the supported formats of a crtc? If display HW only > > supports scanning out from a single buffer (like pl111 does), I think > > it won't have any planes and a fb can only be set on the crtc. In > > which case, how should user-space query which pixel formats that crtc > > supports? > > it is exposed for drm plane's. What is missing is to expose the > primary-plane associated with the crtc. Cool - so a patch which adds a way to query the what formats a crtc supports would be welcome? What about a way to query the stride alignment constraints? Presumably using the drm_mode_get_property mechanism would be the right way to implement that? > > As with v4l2, DRM doesn't appear to have a way to query the stride > > constraints? Assuming there is a way to query the stride constraints, > > there also isn't a way to specify them when creating a buffer with > > DRM, though perhaps the existing pitch parameter of > > drm_mode_create_dumb could be used to allow user-space to pass in a > > minimum stride as well as receive the allocated stride? > > > > well, you really shouldn't be using create_dumb.. you should have a > userspace piece that is specific to the drm driver, and knows how to > use that driver's gem allocate ioctl. Sorry, why does this need a driver-specific allocation function? It's just a display controller driver and I just want to allocate a scan- out buffer - all I'm asking is for the display controller driver to use a minimum stride alignment so I can export the buffer and use another device to fill it with data. The whole point is to be able to allocate the buffer in such a way that another device can access it. So the driver _can't_ use a special, device specific format, nor can it allocate it from a private memory pool because doing so would preclude it from being shared with another device. That other device doesn't need to be a GPU wither, it could just as easily be a camera/ISP or video decoder. > >> > So presumably you're talking about a GPU driver being the exporter > >> > here? If so, how could the GPU driver do these kind of tricks on > >> > memory shared with another device? > >> > >> Yes, that is gpu-as-exporter. If someone else is allocating > >> buffers, it is up to them to do these tricks or not. Probably > >> there is a pretty good chance that if you aren't a GPU you don't > >> need those sort of tricks for fast allocation of transient upload > >> buffers, staging textures, temporary pixmaps, etc. Ie. I don't > >> really think a v4l camera or video decoder would benefit from that > >> sort of optimization. > > > > Right - but none of those are really buffers you'd want to export > > > with dma_buf to share with another device are they? In which case, > > why not just have dma_buf figure out the constraints and allocate > > the memory? > > maybe not.. but (a) you don't necessarily know at creation time if it > is going to be exported (maybe you know if it is definitely not going > to be exported, but the converse is not true), I can't actually think of an example where you would not know if a buffer was going to be exported or not at allocation time? Do you have a case in mind? Regardless, you'd certainly have to know if a buffer will be exported pretty quickly, before it's used so that you can import it into whatever devices are going to access it. Otherwise if it gets allocated before you export it, the allocation won't satisfy the constraints of the other devices which will need to access it and importing will fail. Assuming of course deferred allocation of the backing pages as discussed earlier in the thread. > and (b) there isn't > really any reason to special case the allocation in the driver because > it is going to be exported. Not sure I follow you here? Surely you absolutely have to special-case the allocation if the buffer is to be exported because you have to take the other devices' constraints into account when you allocate? Or do you mean you don't need to special-case the GEM buffer object creation, only the allocation of the backing pages? Though I'm not sure how that distinction is useful - at the end of the day, you need to special-case allocation of the backing pages. > helpers that can be used by simple drivers, yes. Forcing the way the > buffer is allocated, for sure not. Currently, for example, there is > no issue to export a buffer allocated from stolen-mem. Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk of memory reserved at boot for the GPU which the OS can't touch? I guess I view such memory as accessible to all media devices on the system and as such, needs to be managed by a central allocator which dma_buf can use to allocate from. I guess if that stolen-mem is managed by a single device then in essence that device becomes the central allocator you have to use to be able to allocate from that stolen mem? > > If a driver needs to allocate memory in a special way for a > > particular device, I can't really imagine how it would be able > > to share that buffer with another device using dma_buf? I guess > > a driver is likely to need some magic voodoo to configure access > > to the buffer for its device, but surely that would be done by > > the dma_mapping framework when dma_buf_map happens? > > > > if, what it has to configure actually manages to fit in the > dma-mapping framework But if it doesn't, surely that's an issue which needs to be addressed in the dma_mapping framework or else you won't be able to import buffers for use by that device anyway? > anyways, where the pages come from has nothing to do with whether a > buffer can be shared or not Sure, but where they are located in physical memory really does matter. > >> At any rate, for both xorg and wayland/gbm, you know when a buffer > >> is going to be a scanout buffer. What I'd recommend is define a > >> small userspace API that your customers (the SoC vendors) implement > >> to allocate a scanout buffer and hand you back a dmabuf fd. That > >> could be used both for x11 and for gbm. Inputs should be requested > >> width/height and format. And outputs pitch plus dmabuf fd. > >> > >> (Actually you might even just want to use gbm as your starting > >> point. You could probably just use gbm from xf86-video-armsoc for > >> allocation, to have one thing that works for both wayland and x11. > >> Scanout and cursor buffers should go to vendor/SoC specific fxn, > >> rest can be allocated from mali kernel driver.) > > > > What does that buy us over just using drm_mode_create_dumb on the > > display's DRM driver? > > well, for example, if there was actually some hw w/ omap's dss + mali, > you could actually have mali render transparently to tiled buffers > which could be scanned out rotated. Which would not be possible w/ > dumb buffers. Why not? As you said earlier, the format is defined when you setup the fb with drm_mode_fb_cmd2. If you wanted to share the buffer between devices, you have to be explicit about what format that buffer is in, so you'd have to add an entry to drm_fourcc.h for the tiled format. So userspace queries what formats the GPU DRM supports and what formats the OMAP DSS DRM supports, selects the tiled format and then uses drm_mode_create_dumb to allocate a buffer of the correct size and sets the appropriate drm_fourcc.h enum value when creating an fb for that buffer. Or have I missed something? > >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ > >> >> (where phys contig is not required for scanout), but causes CMA > >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. > >> >> It just knows that it wants to be able to scanout that particular > >> >> buffer. > >> > > >> > I think that's the idea? The omap3's allocator driver would use > >> > contiguous memory when it detects the SCANOUT flag whereas the > >> > omap4 allocator driver wouldn't have to. No complex negotiation > >> > of constraints - it just "knows". > >> > > >> > >> well, it is same allocating driver in both cases (although maybe > >> that is unimportant). The "it" that just knows it wants to scanout > >> is userspace. The "it" that just knows that scanout translates to > >> contiguous (or not) is the kernel. Perhaps we are saying the same > >> thing ;-) > > > > Yeah - I think we are... so what's the issue with having a per-SoC > > allocation driver again? > > > > In a way the display driver is a per-SoC allocator. But not > necessarily the *central* allocator for everything. Ie. no need for > display driver to allocate vertex buffers for a separate gpu driver, > and that sort of thing. Again, I'm only talking about allocating buffers which will be shared between different devices. At no point have I mentioned the allocation of buffers which aren't to be shared between devices. Sorry if that's not been clear. So for buffers which are to be shared between devices, your suggesting that the display driver is the per-SoC allocator? But as I say, and how this thread got started, the same display driver can be used on different SoCs, so having _it_ be the central allocator isn't ideal. Though this is our current solution and why we're "abusing" the dumb buffer allocation functions. :-) Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
[parent not found: <520515b9.87370f0a.16e6.2380SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 [not found] ` <520515b9.87370f0a.16e6.2380SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-09 17:12 ` Rob Clark @ 2013-08-09 17:12 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-09 17:12 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> > Turning to DRM/KMS, it seems the supported formats of a plane can be >> > queried using drm_mode_get_plane. However, there doesn't seem to be a >> > way to query the supported formats of a crtc? If display HW only >> > supports scanning out from a single buffer (like pl111 does), I think >> > it won't have any planes and a fb can only be set on the crtc. In >> > which case, how should user-space query which pixel formats that crtc >> > supports? >> >> it is exposed for drm plane's. What is missing is to expose the >> primary-plane associated with the crtc. > > Cool - so a patch which adds a way to query the what formats a crtc > supports would be welcome? well, I kinda think we want something that exposes the "primary plane" of the crtc.. I'm thinking something roughly like: --------- diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h index 53db7ce..c7ffca8 100644 --- a/include/uapi/drm/drm_mode.h +++ b/include/uapi/drm/drm_mode.h @@ -157,6 +157,12 @@ struct drm_mode_get_plane { struct drm_mode_get_plane_res { __u64 plane_id_ptr; __u32 count_planes; + /* The primary planes are in matching order to crtc_id_ptr in + * drm_mode_card_res (and same length). For crtc_id[n], it's + * primary plane is given by primary_plane_id[n]. + */ + __u32 count_primary_planes; + __u64 primary_plane_id_ptr; }; #define DRM_MODE_ENCODER_NONE 0 --------- then use the existing GETPLANE ioctl to query the capabilities > What about a way to query the stride alignment constraints? > > Presumably using the drm_mode_get_property mechanism would be the > right way to implement that? > I suppose you could try that.. typically that is in userspace, however. It seems like get_property would get messy quickly (ie. is it a pitch alignment constraint, or stride alignment? What if this is different for different formats (in particular tiled)? etc) > >> > As with v4l2, DRM doesn't appear to have a way to query the stride >> > constraints? Assuming there is a way to query the stride constraints, >> > there also isn't a way to specify them when creating a buffer with >> > DRM, though perhaps the existing pitch parameter of >> > drm_mode_create_dumb could be used to allow user-space to pass in a >> > minimum stride as well as receive the allocated stride? >> > >> >> well, you really shouldn't be using create_dumb.. you should have a >> userspace piece that is specific to the drm driver, and knows how to >> use that driver's gem allocate ioctl. > > Sorry, why does this need a driver-specific allocation function? It's > just a display controller driver and I just want to allocate a scan- > out buffer - all I'm asking is for the display controller driver to > use a minimum stride alignment so I can export the buffer and use > another device to fill it with data. Sure.. but userspace has more information readily available to make a better choice. For example, for omapdrm I'd do things differently depending on whether I wanted to scan out that buffer (or a portion of it) rotated. This is something I know in the DDX driver, but not in the kernel. And it is quite likely something that is driver specific. Sure, we could add that to a generic "allocate me a buffer" ioctl. But that doesn't really seem better, and it becomes a problem as soon as we come across some hw that needs to know something different. In userspace, you have a lot more flexibility, since you don't really need to commit to an API for life. And to bring back the GStreamer argument (since that seems a fitting example when you start talking about sharing buffers between many devices, for example camera+codec+display), it would already be negotiating format between v4l2src + fooencoder + displaysink.. the pitch/stride is part of that format information. If the display isn't the one with the strictest requirements, we don't want the display driver deciding what pitch to use. > The whole point is to be able to allocate the buffer in such a way > that another device can access it. So the driver _can't_ use a > special, device specific format, nor can it allocate it from a > private memory pool because doing so would preclude it from being > shared with another device. > > That other device doesn't need to be a GPU wither, it could just as > easily be a camera/ISP or video decoder. > > > >> >> > So presumably you're talking about a GPU driver being the exporter >> >> > here? If so, how could the GPU driver do these kind of tricks on >> >> > memory shared with another device? >> >> >> >> Yes, that is gpu-as-exporter. If someone else is allocating >> >> buffers, it is up to them to do these tricks or not. Probably >> >> there is a pretty good chance that if you aren't a GPU you don't >> >> need those sort of tricks for fast allocation of transient upload >> >> buffers, staging textures, temporary pixmaps, etc. Ie. I don't >> >> really think a v4l camera or video decoder would benefit from that >> >> sort of optimization. >> > >> > Right - but none of those are really buffers you'd want to export >> >> > with dma_buf to share with another device are they? In which case, >> > why not just have dma_buf figure out the constraints and allocate >> > the memory? >> >> maybe not.. but (a) you don't necessarily know at creation time if it >> is going to be exported (maybe you know if it is definitely not going >> to be exported, but the converse is not true), > > I can't actually think of an example where you would not know if a > buffer was going to be exported or not at allocation time? Do you have > a case in mind? yeah, dri2.. when the front buffer is allocated it is just a regular pixmap. If you swap/flip it becomes the back buffer and now shared ;-) And pixmaps are allocated w/ enough frequency that it is the sort of thing you might want to optimize. And even when you know it will be shared, you don't know with who. > Regardless, you'd certainly have to know if a buffer will be exported > pretty quickly, before it's used so that you can import it into > whatever devices are going to access it. Otherwise if it gets > allocated before you export it, the allocation won't satisfy the > constraints of the other devices which will need to access it and > importing will fail. Assuming of course deferred allocation of the > backing pages as discussed earlier in the thread. > > > >> and (b) there isn't >> really any reason to special case the allocation in the driver because >> it is going to be exported. > > Not sure I follow you here? Surely you absolutely have to special-case > the allocation if the buffer is to be exported because you have to > take the other devices' constraints into account when you allocate? Or > do you mean you don't need to special-case the GEM buffer object > creation, only the allocation of the backing pages? Though I'm not > sure how that distinction is useful - at the end of the day, you need > to special-case allocation of the backing pages. > well, you need to consider separately what is (a) in the pages, and (b) where the pages come from. By moving the allocation into dmabuf you restrict (b). For sharing buffers, (a) may be restricted, but there is at least some examples of hardware where (b) would not otherwise be restricted by sharing. > >> helpers that can be used by simple drivers, yes. Forcing the way the >> buffer is allocated, for sure not. Currently, for example, there is >> no issue to export a buffer allocated from stolen-mem. > > Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk > of memory reserved at boot for the GPU which the OS can't touch? I > guess I view such memory as accessible to all media devices on the > system and as such, needs to be managed by a central allocator which > dma_buf can use to allocate from. think carve-out created by bios. In all the cases I am aware of, the drm driver handles allocation of buffer(s) from the carveout. > I guess if that stolen-mem is managed by a single device then in > essence that device becomes the central allocator you have to use to > be able to allocate from that stolen mem? > > >> > If a driver needs to allocate memory in a special way for a >> > particular device, I can't really imagine how it would be able >> > to share that buffer with another device using dma_buf? I guess >> > a driver is likely to need some magic voodoo to configure access >> > to the buffer for its device, but surely that would be done by >> > the dma_mapping framework when dma_buf_map happens? >> > >> >> if, what it has to configure actually manages to fit in the >> dma-mapping framework > > But if it doesn't, surely that's an issue which needs to be addressed > in the dma_mapping framework or else you won't be able to import > buffers for use by that device anyway? > I'm not sure if we have to fit everything in dma-mapping framework, at least in cases where you have something that is specific to one platform. Currently dma-buf provides enough flexibility for other drivers to be able to import these buffers. > >> anyways, where the pages come from has nothing to do with whether a >> buffer can be shared or not > > Sure, but where they are located in physical memory really does > matter. > s/does/can/ it doesn't always matter. And in cases where it does matter, as long as we can express the restrictions in dma_parms (which we can already for the case of range-of-memory restrictions) we are covered > >> >> At any rate, for both xorg and wayland/gbm, you know when a buffer >> >> is going to be a scanout buffer. What I'd recommend is define a >> >> small userspace API that your customers (the SoC vendors) implement >> >> to allocate a scanout buffer and hand you back a dmabuf fd. That >> >> could be used both for x11 and for gbm. Inputs should be requested >> >> width/height and format. And outputs pitch plus dmabuf fd. >> >> >> >> (Actually you might even just want to use gbm as your starting >> >> point. You could probably just use gbm from xf86-video-armsoc for >> >> allocation, to have one thing that works for both wayland and x11. >> >> Scanout and cursor buffers should go to vendor/SoC specific fxn, >> >> rest can be allocated from mali kernel driver.) >> > >> > What does that buy us over just using drm_mode_create_dumb on the >> > display's DRM driver? >> >> well, for example, if there was actually some hw w/ omap's dss + mali, >> you could actually have mali render transparently to tiled buffers >> which could be scanned out rotated. Which would not be possible w/ >> dumb buffers. > > Why not? As you said earlier, the format is defined when you setup the > fb with drm_mode_fb_cmd2. If you wanted to share the buffer between > devices, you have to be explicit about what format that buffer is in, > so you'd have to add an entry to drm_fourcc.h for the tiled format. no, that doesn't really work in this case, the format to any device (or userspace) accessing the buffer is not tiled. (Ie. it would look like normal NV12 or whatever). But there are some different requirements on stride. And there are cases where you would prefer not to use tiled buffers, but the kernel doesn't know enough in the dumb-buffer alloc ioctl to make the correct decision. > So userspace queries what formats the GPU DRM supports and what > formats the OMAP DSS DRM supports, selects the tiled format and then > uses drm_mode_create_dumb to allocate a buffer of the correct size and > sets the appropriate drm_fourcc.h enum value when creating an fb for > that buffer. Or have I missed something? > > > >> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> >> >> (where phys contig is not required for scanout), but causes CMA >> >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. >> >> >> It just knows that it wants to be able to scanout that particular >> >> >> buffer. >> >> > >> >> > I think that's the idea? The omap3's allocator driver would use >> >> > contiguous memory when it detects the SCANOUT flag whereas the >> >> > omap4 allocator driver wouldn't have to. No complex negotiation >> >> > of constraints - it just "knows". >> >> > >> >> >> >> well, it is same allocating driver in both cases (although maybe >> >> that is unimportant). The "it" that just knows it wants to scanout >> >> is userspace. The "it" that just knows that scanout translates to >> >> contiguous (or not) is the kernel. Perhaps we are saying the same >> >> thing ;-) >> > >> > Yeah - I think we are... so what's the issue with having a per-SoC >> > allocation driver again? >> > >> >> In a way the display driver is a per-SoC allocator. But not >> necessarily the *central* allocator for everything. Ie. no need for >> display driver to allocate vertex buffers for a separate gpu driver, >> and that sort of thing. > > Again, I'm only talking about allocating buffers which will be shared > between different devices. At no point have I mentioned the allocation > of buffers which aren't to be shared between devices. Sorry if that's > not been clear. ok, I guess we were talking about slightly different things ;-) > So for buffers which are to be shared between devices, your suggesting > that the display driver is the per-SoC allocator? But as I say, and > how this thread got started, the same display driver can be used on > different SoCs, so having _it_ be the central allocator isn't ideal. > Though this is our current solution and why we're "abusing" the dumb > buffer allocation functions. :-) > which is why you want to let userspace figure out the pitch and then tell the display driver what size it wants, rather than using dumb buffer ioctl ;-) Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or property or what have you.. but I think that would be hard to get right for all cases, and most people don't really care about that because they already need a gpu/display specific xorg driver and/or gl/egl talking to their kernel driver. You are in a slightly special case, since you are providing GL driver independently of the display driver. But I think that is easier to handle by just telling your customers "here, fill out this function(s) to allocate buffer for scanout" (and, well, I guess you'd need one to query for pitch/stride), rather than trying to cram everything into the kernel. BR, -R > > > Cheers, > > Tom > > > > > ^ permalink raw reply related [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-09 17:12 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-09 17:12 UTC (permalink / raw) To: linux-arm-kernel On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> > Turning to DRM/KMS, it seems the supported formats of a plane can be >> > queried using drm_mode_get_plane. However, there doesn't seem to be a >> > way to query the supported formats of a crtc? If display HW only >> > supports scanning out from a single buffer (like pl111 does), I think >> > it won't have any planes and a fb can only be set on the crtc. In >> > which case, how should user-space query which pixel formats that crtc >> > supports? >> >> it is exposed for drm plane's. What is missing is to expose the >> primary-plane associated with the crtc. > > Cool - so a patch which adds a way to query the what formats a crtc > supports would be welcome? well, I kinda think we want something that exposes the "primary plane" of the crtc.. I'm thinking something roughly like: --------- diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h index 53db7ce..c7ffca8 100644 --- a/include/uapi/drm/drm_mode.h +++ b/include/uapi/drm/drm_mode.h @@ -157,6 +157,12 @@ struct drm_mode_get_plane { struct drm_mode_get_plane_res { __u64 plane_id_ptr; __u32 count_planes; + /* The primary planes are in matching order to crtc_id_ptr in + * drm_mode_card_res (and same length). For crtc_id[n], it's + * primary plane is given by primary_plane_id[n]. + */ + __u32 count_primary_planes; + __u64 primary_plane_id_ptr; }; #define DRM_MODE_ENCODER_NONE 0 --------- then use the existing GETPLANE ioctl to query the capabilities > What about a way to query the stride alignment constraints? > > Presumably using the drm_mode_get_property mechanism would be the > right way to implement that? > I suppose you could try that.. typically that is in userspace, however. It seems like get_property would get messy quickly (ie. is it a pitch alignment constraint, or stride alignment? What if this is different for different formats (in particular tiled)? etc) > >> > As with v4l2, DRM doesn't appear to have a way to query the stride >> > constraints? Assuming there is a way to query the stride constraints, >> > there also isn't a way to specify them when creating a buffer with >> > DRM, though perhaps the existing pitch parameter of >> > drm_mode_create_dumb could be used to allow user-space to pass in a >> > minimum stride as well as receive the allocated stride? >> > >> >> well, you really shouldn't be using create_dumb.. you should have a >> userspace piece that is specific to the drm driver, and knows how to >> use that driver's gem allocate ioctl. > > Sorry, why does this need a driver-specific allocation function? It's > just a display controller driver and I just want to allocate a scan- > out buffer - all I'm asking is for the display controller driver to > use a minimum stride alignment so I can export the buffer and use > another device to fill it with data. Sure.. but userspace has more information readily available to make a better choice. For example, for omapdrm I'd do things differently depending on whether I wanted to scan out that buffer (or a portion of it) rotated. This is something I know in the DDX driver, but not in the kernel. And it is quite likely something that is driver specific. Sure, we could add that to a generic "allocate me a buffer" ioctl. But that doesn't really seem better, and it becomes a problem as soon as we come across some hw that needs to know something different. In userspace, you have a lot more flexibility, since you don't really need to commit to an API for life. And to bring back the GStreamer argument (since that seems a fitting example when you start talking about sharing buffers between many devices, for example camera+codec+display), it would already be negotiating format between v4l2src + fooencoder + displaysink.. the pitch/stride is part of that format information. If the display isn't the one with the strictest requirements, we don't want the display driver deciding what pitch to use. > The whole point is to be able to allocate the buffer in such a way > that another device can access it. So the driver _can't_ use a > special, device specific format, nor can it allocate it from a > private memory pool because doing so would preclude it from being > shared with another device. > > That other device doesn't need to be a GPU wither, it could just as > easily be a camera/ISP or video decoder. > > > >> >> > So presumably you're talking about a GPU driver being the exporter >> >> > here? If so, how could the GPU driver do these kind of tricks on >> >> > memory shared with another device? >> >> >> >> Yes, that is gpu-as-exporter. If someone else is allocating >> >> buffers, it is up to them to do these tricks or not. Probably >> >> there is a pretty good chance that if you aren't a GPU you don't >> >> need those sort of tricks for fast allocation of transient upload >> >> buffers, staging textures, temporary pixmaps, etc. Ie. I don't >> >> really think a v4l camera or video decoder would benefit from that >> >> sort of optimization. >> > >> > Right - but none of those are really buffers you'd want to export >> >> > with dma_buf to share with another device are they? In which case, >> > why not just have dma_buf figure out the constraints and allocate >> > the memory? >> >> maybe not.. but (a) you don't necessarily know at creation time if it >> is going to be exported (maybe you know if it is definitely not going >> to be exported, but the converse is not true), > > I can't actually think of an example where you would not know if a > buffer was going to be exported or not at allocation time? Do you have > a case in mind? yeah, dri2.. when the front buffer is allocated it is just a regular pixmap. If you swap/flip it becomes the back buffer and now shared ;-) And pixmaps are allocated w/ enough frequency that it is the sort of thing you might want to optimize. And even when you know it will be shared, you don't know with who. > Regardless, you'd certainly have to know if a buffer will be exported > pretty quickly, before it's used so that you can import it into > whatever devices are going to access it. Otherwise if it gets > allocated before you export it, the allocation won't satisfy the > constraints of the other devices which will need to access it and > importing will fail. Assuming of course deferred allocation of the > backing pages as discussed earlier in the thread. > > > >> and (b) there isn't >> really any reason to special case the allocation in the driver because >> it is going to be exported. > > Not sure I follow you here? Surely you absolutely have to special-case > the allocation if the buffer is to be exported because you have to > take the other devices' constraints into account when you allocate? Or > do you mean you don't need to special-case the GEM buffer object > creation, only the allocation of the backing pages? Though I'm not > sure how that distinction is useful - at the end of the day, you need > to special-case allocation of the backing pages. > well, you need to consider separately what is (a) in the pages, and (b) where the pages come from. By moving the allocation into dmabuf you restrict (b). For sharing buffers, (a) may be restricted, but there is at least some examples of hardware where (b) would not otherwise be restricted by sharing. > >> helpers that can be used by simple drivers, yes. Forcing the way the >> buffer is allocated, for sure not. Currently, for example, there is >> no issue to export a buffer allocated from stolen-mem. > > Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk > of memory reserved at boot for the GPU which the OS can't touch? I > guess I view such memory as accessible to all media devices on the > system and as such, needs to be managed by a central allocator which > dma_buf can use to allocate from. think carve-out created by bios. In all the cases I am aware of, the drm driver handles allocation of buffer(s) from the carveout. > I guess if that stolen-mem is managed by a single device then in > essence that device becomes the central allocator you have to use to > be able to allocate from that stolen mem? > > >> > If a driver needs to allocate memory in a special way for a >> > particular device, I can't really imagine how it would be able >> > to share that buffer with another device using dma_buf? I guess >> > a driver is likely to need some magic voodoo to configure access >> > to the buffer for its device, but surely that would be done by >> > the dma_mapping framework when dma_buf_map happens? >> > >> >> if, what it has to configure actually manages to fit in the >> dma-mapping framework > > But if it doesn't, surely that's an issue which needs to be addressed > in the dma_mapping framework or else you won't be able to import > buffers for use by that device anyway? > I'm not sure if we have to fit everything in dma-mapping framework, at least in cases where you have something that is specific to one platform. Currently dma-buf provides enough flexibility for other drivers to be able to import these buffers. > >> anyways, where the pages come from has nothing to do with whether a >> buffer can be shared or not > > Sure, but where they are located in physical memory really does > matter. > s/does/can/ it doesn't always matter. And in cases where it does matter, as long as we can express the restrictions in dma_parms (which we can already for the case of range-of-memory restrictions) we are covered > >> >> At any rate, for both xorg and wayland/gbm, you know when a buffer >> >> is going to be a scanout buffer. What I'd recommend is define a >> >> small userspace API that your customers (the SoC vendors) implement >> >> to allocate a scanout buffer and hand you back a dmabuf fd. That >> >> could be used both for x11 and for gbm. Inputs should be requested >> >> width/height and format. And outputs pitch plus dmabuf fd. >> >> >> >> (Actually you might even just want to use gbm as your starting >> >> point. You could probably just use gbm from xf86-video-armsoc for >> >> allocation, to have one thing that works for both wayland and x11. >> >> Scanout and cursor buffers should go to vendor/SoC specific fxn, >> >> rest can be allocated from mali kernel driver.) >> > >> > What does that buy us over just using drm_mode_create_dumb on the >> > display's DRM driver? >> >> well, for example, if there was actually some hw w/ omap's dss + mali, >> you could actually have mali render transparently to tiled buffers >> which could be scanned out rotated. Which would not be possible w/ >> dumb buffers. > > Why not? As you said earlier, the format is defined when you setup the > fb with drm_mode_fb_cmd2. If you wanted to share the buffer between > devices, you have to be explicit about what format that buffer is in, > so you'd have to add an entry to drm_fourcc.h for the tiled format. no, that doesn't really work in this case, the format to any device (or userspace) accessing the buffer is not tiled. (Ie. it would look like normal NV12 or whatever). But there are some different requirements on stride. And there are cases where you would prefer not to use tiled buffers, but the kernel doesn't know enough in the dumb-buffer alloc ioctl to make the correct decision. > So userspace queries what formats the GPU DRM supports and what > formats the OMAP DSS DRM supports, selects the tiled format and then > uses drm_mode_create_dumb to allocate a buffer of the correct size and > sets the appropriate drm_fourcc.h enum value when creating an fb for > that buffer. Or have I missed something? > > > >> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> >> >> (where phys contig is not required for scanout), but causes CMA >> >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. >> >> >> It just knows that it wants to be able to scanout that particular >> >> >> buffer. >> >> > >> >> > I think that's the idea? The omap3's allocator driver would use >> >> > contiguous memory when it detects the SCANOUT flag whereas the >> >> > omap4 allocator driver wouldn't have to. No complex negotiation >> >> > of constraints - it just "knows". >> >> > >> >> >> >> well, it is same allocating driver in both cases (although maybe >> >> that is unimportant). The "it" that just knows it wants to scanout >> >> is userspace. The "it" that just knows that scanout translates to >> >> contiguous (or not) is the kernel. Perhaps we are saying the same >> >> thing ;-) >> > >> > Yeah - I think we are... so what's the issue with having a per-SoC >> > allocation driver again? >> > >> >> In a way the display driver is a per-SoC allocator. But not >> necessarily the *central* allocator for everything. Ie. no need for >> display driver to allocate vertex buffers for a separate gpu driver, >> and that sort of thing. > > Again, I'm only talking about allocating buffers which will be shared > between different devices. At no point have I mentioned the allocation > of buffers which aren't to be shared between devices. Sorry if that's > not been clear. ok, I guess we were talking about slightly different things ;-) > So for buffers which are to be shared between devices, your suggesting > that the display driver is the per-SoC allocator? But as I say, and > how this thread got started, the same display driver can be used on > different SoCs, so having _it_ be the central allocator isn't ideal. > Though this is our current solution and why we're "abusing" the dumb > buffer allocation functions. :-) > which is why you want to let userspace figure out the pitch and then tell the display driver what size it wants, rather than using dumb buffer ioctl ;-) Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or property or what have you.. but I think that would be hard to get right for all cases, and most people don't really care about that because they already need a gpu/display specific xorg driver and/or gl/egl talking to their kernel driver. You are in a slightly special case, since you are providing GL driver independently of the display driver. But I think that is easier to handle by just telling your customers "here, fill out this function(s) to allocate buffer for scanout" (and, well, I guess you'd need one to query for pitch/stride), rather than trying to cram everything into the kernel. BR, -R > > > Cheers, > > Tom > > > > > ^ permalink raw reply related [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-09 17:12 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-09 17:12 UTC (permalink / raw) To: Tom Cooksey Cc: dri-devel, linux-fbdev, Pawel Moll, linux-arm-kernel, linux-media, linaro-mm-sig, linux-kernel On Fri, Aug 9, 2013 at 12:15 PM, Tom Cooksey <tom.cooksey@arm.com> wrote: > >> > Turning to DRM/KMS, it seems the supported formats of a plane can be >> > queried using drm_mode_get_plane. However, there doesn't seem to be a >> > way to query the supported formats of a crtc? If display HW only >> > supports scanning out from a single buffer (like pl111 does), I think >> > it won't have any planes and a fb can only be set on the crtc. In >> > which case, how should user-space query which pixel formats that crtc >> > supports? >> >> it is exposed for drm plane's. What is missing is to expose the >> primary-plane associated with the crtc. > > Cool - so a patch which adds a way to query the what formats a crtc > supports would be welcome? well, I kinda think we want something that exposes the "primary plane" of the crtc.. I'm thinking something roughly like: --------- diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h index 53db7ce..c7ffca8 100644 --- a/include/uapi/drm/drm_mode.h +++ b/include/uapi/drm/drm_mode.h @@ -157,6 +157,12 @@ struct drm_mode_get_plane { struct drm_mode_get_plane_res { __u64 plane_id_ptr; __u32 count_planes; + /* The primary planes are in matching order to crtc_id_ptr in + * drm_mode_card_res (and same length). For crtc_id[n], it's + * primary plane is given by primary_plane_id[n]. + */ + __u32 count_primary_planes; + __u64 primary_plane_id_ptr; }; #define DRM_MODE_ENCODER_NONE 0 --------- then use the existing GETPLANE ioctl to query the capabilities > What about a way to query the stride alignment constraints? > > Presumably using the drm_mode_get_property mechanism would be the > right way to implement that? > I suppose you could try that.. typically that is in userspace, however. It seems like get_property would get messy quickly (ie. is it a pitch alignment constraint, or stride alignment? What if this is different for different formats (in particular tiled)? etc) > >> > As with v4l2, DRM doesn't appear to have a way to query the stride >> > constraints? Assuming there is a way to query the stride constraints, >> > there also isn't a way to specify them when creating a buffer with >> > DRM, though perhaps the existing pitch parameter of >> > drm_mode_create_dumb could be used to allow user-space to pass in a >> > minimum stride as well as receive the allocated stride? >> > >> >> well, you really shouldn't be using create_dumb.. you should have a >> userspace piece that is specific to the drm driver, and knows how to >> use that driver's gem allocate ioctl. > > Sorry, why does this need a driver-specific allocation function? It's > just a display controller driver and I just want to allocate a scan- > out buffer - all I'm asking is for the display controller driver to > use a minimum stride alignment so I can export the buffer and use > another device to fill it with data. Sure.. but userspace has more information readily available to make a better choice. For example, for omapdrm I'd do things differently depending on whether I wanted to scan out that buffer (or a portion of it) rotated. This is something I know in the DDX driver, but not in the kernel. And it is quite likely something that is driver specific. Sure, we could add that to a generic "allocate me a buffer" ioctl. But that doesn't really seem better, and it becomes a problem as soon as we come across some hw that needs to know something different. In userspace, you have a lot more flexibility, since you don't really need to commit to an API for life. And to bring back the GStreamer argument (since that seems a fitting example when you start talking about sharing buffers between many devices, for example camera+codec+display), it would already be negotiating format between v4l2src + fooencoder + displaysink.. the pitch/stride is part of that format information. If the display isn't the one with the strictest requirements, we don't want the display driver deciding what pitch to use. > The whole point is to be able to allocate the buffer in such a way > that another device can access it. So the driver _can't_ use a > special, device specific format, nor can it allocate it from a > private memory pool because doing so would preclude it from being > shared with another device. > > That other device doesn't need to be a GPU wither, it could just as > easily be a camera/ISP or video decoder. > > > >> >> > So presumably you're talking about a GPU driver being the exporter >> >> > here? If so, how could the GPU driver do these kind of tricks on >> >> > memory shared with another device? >> >> >> >> Yes, that is gpu-as-exporter. If someone else is allocating >> >> buffers, it is up to them to do these tricks or not. Probably >> >> there is a pretty good chance that if you aren't a GPU you don't >> >> need those sort of tricks for fast allocation of transient upload >> >> buffers, staging textures, temporary pixmaps, etc. Ie. I don't >> >> really think a v4l camera or video decoder would benefit from that >> >> sort of optimization. >> > >> > Right - but none of those are really buffers you'd want to export >> >> > with dma_buf to share with another device are they? In which case, >> > why not just have dma_buf figure out the constraints and allocate >> > the memory? >> >> maybe not.. but (a) you don't necessarily know at creation time if it >> is going to be exported (maybe you know if it is definitely not going >> to be exported, but the converse is not true), > > I can't actually think of an example where you would not know if a > buffer was going to be exported or not at allocation time? Do you have > a case in mind? yeah, dri2.. when the front buffer is allocated it is just a regular pixmap. If you swap/flip it becomes the back buffer and now shared ;-) And pixmaps are allocated w/ enough frequency that it is the sort of thing you might want to optimize. And even when you know it will be shared, you don't know with who. > Regardless, you'd certainly have to know if a buffer will be exported > pretty quickly, before it's used so that you can import it into > whatever devices are going to access it. Otherwise if it gets > allocated before you export it, the allocation won't satisfy the > constraints of the other devices which will need to access it and > importing will fail. Assuming of course deferred allocation of the > backing pages as discussed earlier in the thread. > > > >> and (b) there isn't >> really any reason to special case the allocation in the driver because >> it is going to be exported. > > Not sure I follow you here? Surely you absolutely have to special-case > the allocation if the buffer is to be exported because you have to > take the other devices' constraints into account when you allocate? Or > do you mean you don't need to special-case the GEM buffer object > creation, only the allocation of the backing pages? Though I'm not > sure how that distinction is useful - at the end of the day, you need > to special-case allocation of the backing pages. > well, you need to consider separately what is (a) in the pages, and (b) where the pages come from. By moving the allocation into dmabuf you restrict (b). For sharing buffers, (a) may be restricted, but there is at least some examples of hardware where (b) would not otherwise be restricted by sharing. > >> helpers that can be used by simple drivers, yes. Forcing the way the >> buffer is allocated, for sure not. Currently, for example, there is >> no issue to export a buffer allocated from stolen-mem. > > Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk > of memory reserved at boot for the GPU which the OS can't touch? I > guess I view such memory as accessible to all media devices on the > system and as such, needs to be managed by a central allocator which > dma_buf can use to allocate from. think carve-out created by bios. In all the cases I am aware of, the drm driver handles allocation of buffer(s) from the carveout. > I guess if that stolen-mem is managed by a single device then in > essence that device becomes the central allocator you have to use to > be able to allocate from that stolen mem? > > >> > If a driver needs to allocate memory in a special way for a >> > particular device, I can't really imagine how it would be able >> > to share that buffer with another device using dma_buf? I guess >> > a driver is likely to need some magic voodoo to configure access >> > to the buffer for its device, but surely that would be done by >> > the dma_mapping framework when dma_buf_map happens? >> > >> >> if, what it has to configure actually manages to fit in the >> dma-mapping framework > > But if it doesn't, surely that's an issue which needs to be addressed > in the dma_mapping framework or else you won't be able to import > buffers for use by that device anyway? > I'm not sure if we have to fit everything in dma-mapping framework, at least in cases where you have something that is specific to one platform. Currently dma-buf provides enough flexibility for other drivers to be able to import these buffers. > >> anyways, where the pages come from has nothing to do with whether a >> buffer can be shared or not > > Sure, but where they are located in physical memory really does > matter. > s/does/can/ it doesn't always matter. And in cases where it does matter, as long as we can express the restrictions in dma_parms (which we can already for the case of range-of-memory restrictions) we are covered > >> >> At any rate, for both xorg and wayland/gbm, you know when a buffer >> >> is going to be a scanout buffer. What I'd recommend is define a >> >> small userspace API that your customers (the SoC vendors) implement >> >> to allocate a scanout buffer and hand you back a dmabuf fd. That >> >> could be used both for x11 and for gbm. Inputs should be requested >> >> width/height and format. And outputs pitch plus dmabuf fd. >> >> >> >> (Actually you might even just want to use gbm as your starting >> >> point. You could probably just use gbm from xf86-video-armsoc for >> >> allocation, to have one thing that works for both wayland and x11. >> >> Scanout and cursor buffers should go to vendor/SoC specific fxn, >> >> rest can be allocated from mali kernel driver.) >> > >> > What does that buy us over just using drm_mode_create_dumb on the >> > display's DRM driver? >> >> well, for example, if there was actually some hw w/ omap's dss + mali, >> you could actually have mali render transparently to tiled buffers >> which could be scanned out rotated. Which would not be possible w/ >> dumb buffers. > > Why not? As you said earlier, the format is defined when you setup the > fb with drm_mode_fb_cmd2. If you wanted to share the buffer between > devices, you have to be explicit about what format that buffer is in, > so you'd have to add an entry to drm_fourcc.h for the tiled format. no, that doesn't really work in this case, the format to any device (or userspace) accessing the buffer is not tiled. (Ie. it would look like normal NV12 or whatever). But there are some different requirements on stride. And there are cases where you would prefer not to use tiled buffers, but the kernel doesn't know enough in the dumb-buffer alloc ioctl to make the correct decision. > So userspace queries what formats the GPU DRM supports and what > formats the OMAP DSS DRM supports, selects the tiled format and then > uses drm_mode_create_dumb to allocate a buffer of the correct size and > sets the appropriate drm_fourcc.h enum value when creating an fb for > that buffer. Or have I missed something? > > > >> >> >> For example, on omapdrm, the SCANOUT flag does nothing on omap4+ >> >> >> (where phys contig is not required for scanout), but causes CMA >> >> >> (dma_alloc_*()) to be used on omap3. Userspace doesn't care. >> >> >> It just knows that it wants to be able to scanout that particular >> >> >> buffer. >> >> > >> >> > I think that's the idea? The omap3's allocator driver would use >> >> > contiguous memory when it detects the SCANOUT flag whereas the >> >> > omap4 allocator driver wouldn't have to. No complex negotiation >> >> > of constraints - it just "knows". >> >> > >> >> >> >> well, it is same allocating driver in both cases (although maybe >> >> that is unimportant). The "it" that just knows it wants to scanout >> >> is userspace. The "it" that just knows that scanout translates to >> >> contiguous (or not) is the kernel. Perhaps we are saying the same >> >> thing ;-) >> > >> > Yeah - I think we are... so what's the issue with having a per-SoC >> > allocation driver again? >> > >> >> In a way the display driver is a per-SoC allocator. But not >> necessarily the *central* allocator for everything. Ie. no need for >> display driver to allocate vertex buffers for a separate gpu driver, >> and that sort of thing. > > Again, I'm only talking about allocating buffers which will be shared > between different devices. At no point have I mentioned the allocation > of buffers which aren't to be shared between devices. Sorry if that's > not been clear. ok, I guess we were talking about slightly different things ;-) > So for buffers which are to be shared between devices, your suggesting > that the display driver is the per-SoC allocator? But as I say, and > how this thread got started, the same display driver can be used on > different SoCs, so having _it_ be the central allocator isn't ideal. > Though this is our current solution and why we're "abusing" the dumb > buffer allocation functions. :-) > which is why you want to let userspace figure out the pitch and then tell the display driver what size it wants, rather than using dumb buffer ioctl ;-) Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or property or what have you.. but I think that would be hard to get right for all cases, and most people don't really care about that because they already need a gpu/display specific xorg driver and/or gl/egl talking to their kernel driver. You are in a slightly special case, since you are providing GL driver independently of the display driver. But I think that is easier to handle by just telling your customers "here, fill out this function(s) to allocate buffer for scanout" (and, well, I guess you'd need one to query for pitch/stride), rather than trying to cram everything into the kernel. BR, -R > > > Cheers, > > Tom > > > > > ^ permalink raw reply related [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 17:12 ` Rob Clark (?) (?) @ 2013-08-14 15:01 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-14 15:01 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media > >> > Turning to DRM/KMS, it seems the supported formats of a plane > >> > can be queried using drm_mode_get_plane. However, there doesn't > >> > seem to be a way to query the supported formats of a crtc? If > >> > display HW only supports scanning out from a single buffer > >> > (like pl111 does), I think it won't have any planes and a fb can > >> > only be set on the crtc. In which case, how should user-space > > >> query which pixel formats that crtc supports? > >> > >> it is exposed for drm plane's. What is missing is to expose the > >> primary-plane associated with the crtc. > > > > Cool - so a patch which adds a way to query the what formats a crtc > > supports would be welcome? > > well, I kinda think we want something that exposes the "primary plane" > of the crtc.. I'm thinking something roughly like: > > --------- > diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h > index 53db7ce..c7ffca8 100644 > --- a/include/uapi/drm/drm_mode.h > +++ b/include/uapi/drm/drm_mode.h > @@ -157,6 +157,12 @@ struct drm_mode_get_plane { > struct drm_mode_get_plane_res { > __u64 plane_id_ptr; > __u32 count_planes; > + /* The primary planes are in matching order to crtc_id_ptr in > + * drm_mode_card_res (and same length). For crtc_id[n], it's > + * primary plane is given by primary_plane_id[n]. > + */ > + __u32 count_primary_planes; > + __u64 primary_plane_id_ptr; > }; Yup - I think that works and allows userspace to query the supported formats of the crtc. Great! <SNIP> > which is why you want to let userspace figure out the pitch and then > tell the display driver what size it wants, rather than using dumb > buffer ioctl ;-) > > Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or > property or what have you.. but I think that would be hard to get > right for all cases, and most people don't really care about that > because they already need a gpu/display specific xorg driver and/or > gl/egl talking to their kernel driver. You are in a slightly special > case, since you are providing GL driver independently of the display > driver. But I think that is easier to handle by just telling your > customers "here, fill out this function(s) to allocate buffer for > scanout" (and, well, I guess you'd need one to query for > pitch/stride), rather than trying to cram everything into the kernel. I fear we're going round in circles here, so time to step back a sec. My first goal is to figure out how to solve our immediate problem of how to allocate buffers in our DDX which doesn't abuse the dumb buffer interface. As stated at the start of this thread, we need to allocate two types of buffers: 1) Those which will be shared between GPU & Display. These must be allocated in such a way as to satisfy both devices' constraints. 2) Those which will only be used by the GPU (DRI2 buffers, pixmaps which have been imported into a client's EGL, etc.) It must be possible to obtain handles to both those types of buffers in a single DRM/GEM name-space to allow us to implement DRI2. I think we can satisfy the first buffer type by adjusting the display DRM's dumb buffer alloc function to always allocate buffers which also satisfy the GPU's constraints. In pl111_drm, that means all buffers are allocated with a 64-byte stride alignment, even on SoCs without a GPU. Not a big issue really and if it were we could make it a Kconfig option or something. >From what I have now understood, allocating GPU-only buffers should be done with a device-specific ioctl. We then have a choice about which DRM driver we should add that device-specific ioctl to. We could either add it to the display controller's DRM or we could add it to the GPU's DRM. Adding it to the GPU's DRM requires user-space to jump through quite a lot of hoops: In order to get both the scan-out GEM buffers and DRI2 GEM buffers in a single device's name-space, it would have to use PRIME to export, say dumb scan-out buffers from the display's DRM as dma_buf fds, then import those dma_buf fds into the GPU's DRM and then use flink to give those imported scan-out buffers a name in the GPU's DRM's namespace. Yuck. No, I think it is easier to just add allocating GPU DRI2 buffers as a device-specific ioctl on the display controller's DRM. Indeed, this appears to be what OMAP and Exynos DRM drivers (and maybe others) do. One device does all the allocations and thus all buffers are already in the same namespace, no faffing with exporting & importing buffers in the DDX required. We will need to figure out a way in the xf86-video-armsoc DDX to abstract those driver-specific allocation ioctls. Using GBM is an interesting idea - looking at the interface it seems to be very, _very_ similar to Android's gralloc! Though I don't see how to get a system-wide name for a buffer I can pass back to a client via DRI2? I assume gbm_bo_handle is process-local? In the short term, I think we'll just use run-time detection of the underlying DRM and bake knowledge of specific DRMs into the DDX. Anyway, I think we have a conclusion on the "how to allocate buffers for the X.Org/DRI stack" question. However, there are more advanced use-cases like streaming from v4l2 to DRM for which I remain convinced the current allocation model really doesn't work for. Or at a minimum causes significant code duplication in many DRM drivers and forces lots of per-SoC code in userspace which could otherwise be avoided. However, I'm keen to not go round in any more circles on this mail thread and suggest we defer that conversation to Linux Plumbers. :-) Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 17:12 ` Rob Clark ` (2 preceding siblings ...) (?) @ 2013-08-14 15:01 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-14 15:01 UTC (permalink / raw) To: linux-arm-kernel > >> > Turning to DRM/KMS, it seems the supported formats of a plane > >> > can be queried using drm_mode_get_plane. However, there doesn't > >> > seem to be a way to query the supported formats of a crtc? If > >> > display HW only supports scanning out from a single buffer > >> > (like pl111 does), I think it won't have any planes and a fb can > >> > only be set on the crtc. In which case, how should user-space > > >> query which pixel formats that crtc supports? > >> > >> it is exposed for drm plane's. What is missing is to expose the > >> primary-plane associated with the crtc. > > > > Cool - so a patch which adds a way to query the what formats a crtc > > supports would be welcome? > > well, I kinda think we want something that exposes the "primary plane" > of the crtc.. I'm thinking something roughly like: > > --------- > diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h > index 53db7ce..c7ffca8 100644 > --- a/include/uapi/drm/drm_mode.h > +++ b/include/uapi/drm/drm_mode.h > @@ -157,6 +157,12 @@ struct drm_mode_get_plane { > struct drm_mode_get_plane_res { > __u64 plane_id_ptr; > __u32 count_planes; > + /* The primary planes are in matching order to crtc_id_ptr in > + * drm_mode_card_res (and same length). For crtc_id[n], it's > + * primary plane is given by primary_plane_id[n]. > + */ > + __u32 count_primary_planes; > + __u64 primary_plane_id_ptr; > }; Yup - I think that works and allows userspace to query the supported formats of the crtc. Great! <SNIP> > which is why you want to let userspace figure out the pitch and then > tell the display driver what size it wants, rather than using dumb > buffer ioctl ;-) > > Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or > property or what have you.. but I think that would be hard to get > right for all cases, and most people don't really care about that > because they already need a gpu/display specific xorg driver and/or > gl/egl talking to their kernel driver. You are in a slightly special > case, since you are providing GL driver independently of the display > driver. But I think that is easier to handle by just telling your > customers "here, fill out this function(s) to allocate buffer for > scanout" (and, well, I guess you'd need one to query for > pitch/stride), rather than trying to cram everything into the kernel. I fear we're going round in circles here, so time to step back a sec. My first goal is to figure out how to solve our immediate problem of how to allocate buffers in our DDX which doesn't abuse the dumb buffer interface. As stated at the start of this thread, we need to allocate two types of buffers: 1) Those which will be shared between GPU & Display. These must be allocated in such a way as to satisfy both devices' constraints. 2) Those which will only be used by the GPU (DRI2 buffers, pixmaps which have been imported into a client's EGL, etc.) It must be possible to obtain handles to both those types of buffers in a single DRM/GEM name-space to allow us to implement DRI2. I think we can satisfy the first buffer type by adjusting the display DRM's dumb buffer alloc function to always allocate buffers which also satisfy the GPU's constraints. In pl111_drm, that means all buffers are allocated with a 64-byte stride alignment, even on SoCs without a GPU. Not a big issue really and if it were we could make it a Kconfig option or something. >From what I have now understood, allocating GPU-only buffers should be done with a device-specific ioctl. We then have a choice about which DRM driver we should add that device-specific ioctl to. We could either add it to the display controller's DRM or we could add it to the GPU's DRM. Adding it to the GPU's DRM requires user-space to jump through quite a lot of hoops: In order to get both the scan-out GEM buffers and DRI2 GEM buffers in a single device's name-space, it would have to use PRIME to export, say dumb scan-out buffers from the display's DRM as dma_buf fds, then import those dma_buf fds into the GPU's DRM and then use flink to give those imported scan-out buffers a name in the GPU's DRM's namespace. Yuck. No, I think it is easier to just add allocating GPU DRI2 buffers as a device-specific ioctl on the display controller's DRM. Indeed, this appears to be what OMAP and Exynos DRM drivers (and maybe others) do. One device does all the allocations and thus all buffers are already in the same namespace, no faffing with exporting & importing buffers in the DDX required. We will need to figure out a way in the xf86-video-armsoc DDX to abstract those driver-specific allocation ioctls. Using GBM is an interesting idea - looking at the interface it seems to be very, _very_ similar to Android's gralloc! Though I don't see how to get a system-wide name for a buffer I can pass back to a client via DRI2? I assume gbm_bo_handle is process-local? In the short term, I think we'll just use run-time detection of the underlying DRM and bake knowledge of specific DRMs into the DDX. Anyway, I think we have a conclusion on the "how to allocate buffers for the X.Org/DRI stack" question. However, there are more advanced use-cases like streaming from v4l2 to DRM for which I remain convinced the current allocation model really doesn't work for. Or at a minimum causes significant code duplication in many DRM drivers and forces lots of per-SoC code in userspace which could otherwise be avoided. However, I'm keen to not go round in any more circles on this mail thread and suggest we defer that conversation to Linux Plumbers. :-) Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-09 17:12 ` Rob Clark ` (3 preceding siblings ...) (?) @ 2013-08-14 15:01 ` Tom Cooksey -1 siblings, 0 replies; 131+ messages in thread From: Tom Cooksey @ 2013-08-14 15:01 UTC (permalink / raw) To: 'Rob Clark' Cc: linux-fbdev, Pawel Moll, linux-kernel, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media > >> > Turning to DRM/KMS, it seems the supported formats of a plane > >> > can be queried using drm_mode_get_plane. However, there doesn't > >> > seem to be a way to query the supported formats of a crtc? If > >> > display HW only supports scanning out from a single buffer > >> > (like pl111 does), I think it won't have any planes and a fb can > >> > only be set on the crtc. In which case, how should user-space > > >> query which pixel formats that crtc supports? > >> > >> it is exposed for drm plane's. What is missing is to expose the > >> primary-plane associated with the crtc. > > > > Cool - so a patch which adds a way to query the what formats a crtc > > supports would be welcome? > > well, I kinda think we want something that exposes the "primary plane" > of the crtc.. I'm thinking something roughly like: > > --------- > diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h > index 53db7ce..c7ffca8 100644 > --- a/include/uapi/drm/drm_mode.h > +++ b/include/uapi/drm/drm_mode.h > @@ -157,6 +157,12 @@ struct drm_mode_get_plane { > struct drm_mode_get_plane_res { > __u64 plane_id_ptr; > __u32 count_planes; > + /* The primary planes are in matching order to crtc_id_ptr in > + * drm_mode_card_res (and same length). For crtc_id[n], it's > + * primary plane is given by primary_plane_id[n]. > + */ > + __u32 count_primary_planes; > + __u64 primary_plane_id_ptr; > }; Yup - I think that works and allows userspace to query the supported formats of the crtc. Great! <SNIP> > which is why you want to let userspace figure out the pitch and then > tell the display driver what size it wants, rather than using dumb > buffer ioctl ;-) > > Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or > property or what have you.. but I think that would be hard to get > right for all cases, and most people don't really care about that > because they already need a gpu/display specific xorg driver and/or > gl/egl talking to their kernel driver. You are in a slightly special > case, since you are providing GL driver independently of the display > driver. But I think that is easier to handle by just telling your > customers "here, fill out this function(s) to allocate buffer for > scanout" (and, well, I guess you'd need one to query for > pitch/stride), rather than trying to cram everything into the kernel. I fear we're going round in circles here, so time to step back a sec. My first goal is to figure out how to solve our immediate problem of how to allocate buffers in our DDX which doesn't abuse the dumb buffer interface. As stated at the start of this thread, we need to allocate two types of buffers: 1) Those which will be shared between GPU & Display. These must be allocated in such a way as to satisfy both devices' constraints. 2) Those which will only be used by the GPU (DRI2 buffers, pixmaps which have been imported into a client's EGL, etc.) It must be possible to obtain handles to both those types of buffers in a single DRM/GEM name-space to allow us to implement DRI2. I think we can satisfy the first buffer type by adjusting the display DRM's dumb buffer alloc function to always allocate buffers which also satisfy the GPU's constraints. In pl111_drm, that means all buffers are allocated with a 64-byte stride alignment, even on SoCs without a GPU. Not a big issue really and if it were we could make it a Kconfig option or something. >From what I have now understood, allocating GPU-only buffers should be done with a device-specific ioctl. We then have a choice about which DRM driver we should add that device-specific ioctl to. We could either add it to the display controller's DRM or we could add it to the GPU's DRM. Adding it to the GPU's DRM requires user-space to jump through quite a lot of hoops: In order to get both the scan-out GEM buffers and DRI2 GEM buffers in a single device's name-space, it would have to use PRIME to export, say dumb scan-out buffers from the display's DRM as dma_buf fds, then import those dma_buf fds into the GPU's DRM and then use flink to give those imported scan-out buffers a name in the GPU's DRM's namespace. Yuck. No, I think it is easier to just add allocating GPU DRI2 buffers as a device-specific ioctl on the display controller's DRM. Indeed, this appears to be what OMAP and Exynos DRM drivers (and maybe others) do. One device does all the allocations and thus all buffers are already in the same namespace, no faffing with exporting & importing buffers in the DDX required. We will need to figure out a way in the xf86-video-armsoc DDX to abstract those driver-specific allocation ioctls. Using GBM is an interesting idea - looking at the interface it seems to be very, _very_ similar to Android's gralloc! Though I don't see how to get a system-wide name for a buffer I can pass back to a client via DRI2? I assume gbm_bo_handle is process-local? In the short term, I think we'll just use run-time detection of the underlying DRM and bake knowledge of specific DRMs into the DDX. Anyway, I think we have a conclusion on the "how to allocate buffers for the X.Org/DRI stack" question. However, there are more advanced use-cases like streaming from v4l2 to DRM for which I remain convinced the current allocation model really doesn't work for. Or at a minimum causes significant code duplication in many DRM drivers and forces lots of per-SoC code in userspace which could otherwise be avoided. However, I'm keen to not go round in any more circles on this mail thread and suggest we defer that conversation to Linux Plumbers. :-) Cheers, Tom ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-06 12:15 ` Rob Clark (?) (?) @ 2013-08-07 4:23 ` John Stultz -1 siblings, 0 replies; 131+ messages in thread From: John Stultz @ 2013-08-07 4:23 UTC (permalink / raw) To: Rob Clark Cc: Tom Cooksey, linux-fbdev, Pawel Moll, lkml, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media, Rebecca Schultz Zavin, Erik Gilling, Ross Oldfield On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark@gmail.com> wrote: > On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> >>> > So in some respects, there is a constraint on how buffers which will >>> > be drawn to using the GPU are allocated. I don't really like the idea >>> > of teaching the display controller DRM driver about the GPU buffer >>> > constraints, even if they are fairly trivial like this. If the same >>> > display HW IP is being used on several SoCs, it seems wrong somehow >>> > to enforce those GPU constraints if some of those SoCs don't have a >>> > GPU. >>> >>> Well, I suppose you could get min_pitch_alignment from devicetree, or >>> something like this.. >>> >>> In the end, the easy solution is just to make the display allocate to >>> the worst-case pitch alignment. In the early days of dma-buf >>> discussions, we kicked around the idea of negotiating or >>> programatically describing the constraints, but that didn't really >>> seem like a bounded problem. >> >> Yeah - I was around for some of those discussions and agree it's not >> really an easy problem to solve. >> >> >> >>> > We may also then have additional constraints when sharing buffers >>> > between the display HW and video decode or even camera ISP HW. >>> > Programmatically describing buffer allocation constraints is very >>> > difficult and I'm not sure you can actually do it - there's some >>> > pretty complex constraints out there! E.g. I believe there's a >>> > platform where Y and UV planes of the reference frame need to be in >>> > separate DRAM banks for real-time 1080p decode, or something like >>> > that? >>> >>> yes, this was discussed. This is different from pitch/format/size >>> constraints.. it is really just a placement constraint (ie. where do >>> the physical pages go). IIRC the conclusion was to use a dummy >>> devices with it's own CMA pool for attaching the Y vs UV buffers. >>> >>> > Anyway, I guess my point is that even if we solve how to allocate >>> > buffers which will be shared between the GPU and display HW such that >>> > both sets of constraints are satisfied, that may not be the end of >>> > the story. >>> > >>> >>> that was part of the reason to punt this problem to userspace ;-) >>> >>> In practice, the kernel drivers doesn't usually know too much about >>> the dimensions/format/etc.. that is really userspace level knowledge. >>> There are a few exceptions when the kernel needs to know how to setup >>> GTT/etc for tiled buffers, but normally this sort of information is up >>> at the next level up (userspace, and drm_framebuffer in case of >>> scanout). Userspace media frameworks like GStreamer already have a >>> concept of format/caps negotiation. For non-display<->gpu sharing, I >>> think this is probably where this sort of constraint negotiation >>> should be handled. >> >> I agree that user-space will know which devices will access the buffer >> and thus can figure out at least a common pixel format. Though I'm not >> so sure userspace can figure out more low-level details like alignment >> and placement in physical memory, etc. > > well, let's divide things up into two categories: > > 1) the arrangement and format of pixels.. ie. what userspace would > need to know if it mmap's a buffer. This includes pixel format, > stride, etc. This should be negotiated in userspace, it would be > crazy to try to do this in the kernel. > > 2) the physical placement of the pages. Ie. whether it is contiguous > or not. Which bank the pages in the buffer are placed in, etc. This > is not visible to userspace. This is the purpose of the attach step, > so you know all the devices involved in sharing up front before > allocating the backing pages. (Or in the worst case, if you have a > "late attacher" you at least know when no device is doing dma access > to a buffer and can reallocate and move the buffer.) A long time One concern I know the Android folks have expressed previously (and correct me if its no longer an objection), is that this attach time in-kernel constraint solving / moving or reallocating buffers is likely to hurt determinism. If I understood, their perspective was that userland knows the device path the buffers will travel through, so why not leverage that knowledge, rather then having the kernel have to sort it out for itself after the fact. The concern about determinism even makes them hesitant about CMA, over things like carveout, as they don't want to be moving pages around at allocation time (which could hurt reasonable use cases like the time it takes to launch a camera app - which is quite important). Though maybe this concern will lessen as more CMA solutions ship. I worry some of this split between fully general solutions vs hard-coded known constraints is somewhat intractable. But what might make it easier to get android folks interested in approaches like the attach-time allocation / relocating on late-attach you're proposing is if there is maybe some way, as you suggested, to hint the allocation when they do know the device paths, so they can provide that insight and can avoid *all* reallocations. Then of course, there's the question of how to consistently hint things for all the different driver allocation interfaces (and that, maybe naively, leads to the central allocator approach). thanks -john ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 4:23 ` John Stultz 0 siblings, 0 replies; 131+ messages in thread From: John Stultz @ 2013-08-07 4:23 UTC (permalink / raw) To: Rob Clark Cc: linux-fbdev, Erik Gilling, Pawel Moll, lkml, dri-devel, linaro-mm-sig, Ross Oldfield, Tom Cooksey, Rebecca Schultz Zavin, linux-arm-kernel, linux-media On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark@gmail.com> wrote: > On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> >>> > So in some respects, there is a constraint on how buffers which will >>> > be drawn to using the GPU are allocated. I don't really like the idea >>> > of teaching the display controller DRM driver about the GPU buffer >>> > constraints, even if they are fairly trivial like this. If the same >>> > display HW IP is being used on several SoCs, it seems wrong somehow >>> > to enforce those GPU constraints if some of those SoCs don't have a >>> > GPU. >>> >>> Well, I suppose you could get min_pitch_alignment from devicetree, or >>> something like this.. >>> >>> In the end, the easy solution is just to make the display allocate to >>> the worst-case pitch alignment. In the early days of dma-buf >>> discussions, we kicked around the idea of negotiating or >>> programatically describing the constraints, but that didn't really >>> seem like a bounded problem. >> >> Yeah - I was around for some of those discussions and agree it's not >> really an easy problem to solve. >> >> >> >>> > We may also then have additional constraints when sharing buffers >>> > between the display HW and video decode or even camera ISP HW. >>> > Programmatically describing buffer allocation constraints is very >>> > difficult and I'm not sure you can actually do it - there's some >>> > pretty complex constraints out there! E.g. I believe there's a >>> > platform where Y and UV planes of the reference frame need to be in >>> > separate DRAM banks for real-time 1080p decode, or something like >>> > that? >>> >>> yes, this was discussed. This is different from pitch/format/size >>> constraints.. it is really just a placement constraint (ie. where do >>> the physical pages go). IIRC the conclusion was to use a dummy >>> devices with it's own CMA pool for attaching the Y vs UV buffers. >>> >>> > Anyway, I guess my point is that even if we solve how to allocate >>> > buffers which will be shared between the GPU and display HW such that >>> > both sets of constraints are satisfied, that may not be the end of >>> > the story. >>> > >>> >>> that was part of the reason to punt this problem to userspace ;-) >>> >>> In practice, the kernel drivers doesn't usually know too much about >>> the dimensions/format/etc.. that is really userspace level knowledge. >>> There are a few exceptions when the kernel needs to know how to setup >>> GTT/etc for tiled buffers, but normally this sort of information is up >>> at the next level up (userspace, and drm_framebuffer in case of >>> scanout). Userspace media frameworks like GStreamer already have a >>> concept of format/caps negotiation. For non-display<->gpu sharing, I >>> think this is probably where this sort of constraint negotiation >>> should be handled. >> >> I agree that user-space will know which devices will access the buffer >> and thus can figure out at least a common pixel format. Though I'm not >> so sure userspace can figure out more low-level details like alignment >> and placement in physical memory, etc. > > well, let's divide things up into two categories: > > 1) the arrangement and format of pixels.. ie. what userspace would > need to know if it mmap's a buffer. This includes pixel format, > stride, etc. This should be negotiated in userspace, it would be > crazy to try to do this in the kernel. > > 2) the physical placement of the pages. Ie. whether it is contiguous > or not. Which bank the pages in the buffer are placed in, etc. This > is not visible to userspace. This is the purpose of the attach step, > so you know all the devices involved in sharing up front before > allocating the backing pages. (Or in the worst case, if you have a > "late attacher" you at least know when no device is doing dma access > to a buffer and can reallocate and move the buffer.) A long time One concern I know the Android folks have expressed previously (and correct me if its no longer an objection), is that this attach time in-kernel constraint solving / moving or reallocating buffers is likely to hurt determinism. If I understood, their perspective was that userland knows the device path the buffers will travel through, so why not leverage that knowledge, rather then having the kernel have to sort it out for itself after the fact. The concern about determinism even makes them hesitant about CMA, over things like carveout, as they don't want to be moving pages around at allocation time (which could hurt reasonable use cases like the time it takes to launch a camera app - which is quite important). Though maybe this concern will lessen as more CMA solutions ship. I worry some of this split between fully general solutions vs hard-coded known constraints is somewhat intractable. But what might make it easier to get android folks interested in approaches like the attach-time allocation / relocating on late-attach you're proposing is if there is maybe some way, as you suggested, to hint the allocation when they do know the device paths, so they can provide that insight and can avoid *all* reallocations. Then of course, there's the question of how to consistently hint things for all the different driver allocation interfaces (and that, maybe naively, leads to the central allocator approach). thanks -john ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 4:23 ` John Stultz 0 siblings, 0 replies; 131+ messages in thread From: John Stultz @ 2013-08-07 4:23 UTC (permalink / raw) To: linux-arm-kernel On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark@gmail.com> wrote: > On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> >>> > So in some respects, there is a constraint on how buffers which will >>> > be drawn to using the GPU are allocated. I don't really like the idea >>> > of teaching the display controller DRM driver about the GPU buffer >>> > constraints, even if they are fairly trivial like this. If the same >>> > display HW IP is being used on several SoCs, it seems wrong somehow >>> > to enforce those GPU constraints if some of those SoCs don't have a >>> > GPU. >>> >>> Well, I suppose you could get min_pitch_alignment from devicetree, or >>> something like this.. >>> >>> In the end, the easy solution is just to make the display allocate to >>> the worst-case pitch alignment. In the early days of dma-buf >>> discussions, we kicked around the idea of negotiating or >>> programatically describing the constraints, but that didn't really >>> seem like a bounded problem. >> >> Yeah - I was around for some of those discussions and agree it's not >> really an easy problem to solve. >> >> >> >>> > We may also then have additional constraints when sharing buffers >>> > between the display HW and video decode or even camera ISP HW. >>> > Programmatically describing buffer allocation constraints is very >>> > difficult and I'm not sure you can actually do it - there's some >>> > pretty complex constraints out there! E.g. I believe there's a >>> > platform where Y and UV planes of the reference frame need to be in >>> > separate DRAM banks for real-time 1080p decode, or something like >>> > that? >>> >>> yes, this was discussed. This is different from pitch/format/size >>> constraints.. it is really just a placement constraint (ie. where do >>> the physical pages go). IIRC the conclusion was to use a dummy >>> devices with it's own CMA pool for attaching the Y vs UV buffers. >>> >>> > Anyway, I guess my point is that even if we solve how to allocate >>> > buffers which will be shared between the GPU and display HW such that >>> > both sets of constraints are satisfied, that may not be the end of >>> > the story. >>> > >>> >>> that was part of the reason to punt this problem to userspace ;-) >>> >>> In practice, the kernel drivers doesn't usually know too much about >>> the dimensions/format/etc.. that is really userspace level knowledge. >>> There are a few exceptions when the kernel needs to know how to setup >>> GTT/etc for tiled buffers, but normally this sort of information is up >>> at the next level up (userspace, and drm_framebuffer in case of >>> scanout). Userspace media frameworks like GStreamer already have a >>> concept of format/caps negotiation. For non-display<->gpu sharing, I >>> think this is probably where this sort of constraint negotiation >>> should be handled. >> >> I agree that user-space will know which devices will access the buffer >> and thus can figure out at least a common pixel format. Though I'm not >> so sure userspace can figure out more low-level details like alignment >> and placement in physical memory, etc. > > well, let's divide things up into two categories: > > 1) the arrangement and format of pixels.. ie. what userspace would > need to know if it mmap's a buffer. This includes pixel format, > stride, etc. This should be negotiated in userspace, it would be > crazy to try to do this in the kernel. > > 2) the physical placement of the pages. Ie. whether it is contiguous > or not. Which bank the pages in the buffer are placed in, etc. This > is not visible to userspace. This is the purpose of the attach step, > so you know all the devices involved in sharing up front before > allocating the backing pages. (Or in the worst case, if you have a > "late attacher" you at least know when no device is doing dma access > to a buffer and can reallocate and move the buffer.) A long time One concern I know the Android folks have expressed previously (and correct me if its no longer an objection), is that this attach time in-kernel constraint solving / moving or reallocating buffers is likely to hurt determinism. If I understood, their perspective was that userland knows the device path the buffers will travel through, so why not leverage that knowledge, rather then having the kernel have to sort it out for itself after the fact. The concern about determinism even makes them hesitant about CMA, over things like carveout, as they don't want to be moving pages around at allocation time (which could hurt reasonable use cases like the time it takes to launch a camera app - which is quite important). Though maybe this concern will lessen as more CMA solutions ship. I worry some of this split between fully general solutions vs hard-coded known constraints is somewhat intractable. But what might make it easier to get android folks interested in approaches like the attach-time allocation / relocating on late-attach you're proposing is if there is maybe some way, as you suggested, to hint the allocation when they do know the device paths, so they can provide that insight and can avoid *all* reallocations. Then of course, there's the question of how to consistently hint things for all the different driver allocation interfaces (and that, maybe naively, leads to the central allocator approach). thanks -john ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 4:23 ` John Stultz 0 siblings, 0 replies; 131+ messages in thread From: John Stultz @ 2013-08-07 4:23 UTC (permalink / raw) To: Rob Clark Cc: linux-fbdev, Erik Gilling, Pawel Moll, lkml, dri-devel, linaro-mm-sig, Ross Oldfield, Tom Cooksey, Rebecca Schultz Zavin, linux-arm-kernel, linux-media On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark@gmail.com> wrote: > On Tue, Aug 6, 2013 at 7:31 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> >>> > So in some respects, there is a constraint on how buffers which will >>> > be drawn to using the GPU are allocated. I don't really like the idea >>> > of teaching the display controller DRM driver about the GPU buffer >>> > constraints, even if they are fairly trivial like this. If the same >>> > display HW IP is being used on several SoCs, it seems wrong somehow >>> > to enforce those GPU constraints if some of those SoCs don't have a >>> > GPU. >>> >>> Well, I suppose you could get min_pitch_alignment from devicetree, or >>> something like this.. >>> >>> In the end, the easy solution is just to make the display allocate to >>> the worst-case pitch alignment. In the early days of dma-buf >>> discussions, we kicked around the idea of negotiating or >>> programatically describing the constraints, but that didn't really >>> seem like a bounded problem. >> >> Yeah - I was around for some of those discussions and agree it's not >> really an easy problem to solve. >> >> >> >>> > We may also then have additional constraints when sharing buffers >>> > between the display HW and video decode or even camera ISP HW. >>> > Programmatically describing buffer allocation constraints is very >>> > difficult and I'm not sure you can actually do it - there's some >>> > pretty complex constraints out there! E.g. I believe there's a >>> > platform where Y and UV planes of the reference frame need to be in >>> > separate DRAM banks for real-time 1080p decode, or something like >>> > that? >>> >>> yes, this was discussed. This is different from pitch/format/size >>> constraints.. it is really just a placement constraint (ie. where do >>> the physical pages go). IIRC the conclusion was to use a dummy >>> devices with it's own CMA pool for attaching the Y vs UV buffers. >>> >>> > Anyway, I guess my point is that even if we solve how to allocate >>> > buffers which will be shared between the GPU and display HW such that >>> > both sets of constraints are satisfied, that may not be the end of >>> > the story. >>> > >>> >>> that was part of the reason to punt this problem to userspace ;-) >>> >>> In practice, the kernel drivers doesn't usually know too much about >>> the dimensions/format/etc.. that is really userspace level knowledge. >>> There are a few exceptions when the kernel needs to know how to setup >>> GTT/etc for tiled buffers, but normally this sort of information is up >>> at the next level up (userspace, and drm_framebuffer in case of >>> scanout). Userspace media frameworks like GStreamer already have a >>> concept of format/caps negotiation. For non-display<->gpu sharing, I >>> think this is probably where this sort of constraint negotiation >>> should be handled. >> >> I agree that user-space will know which devices will access the buffer >> and thus can figure out at least a common pixel format. Though I'm not >> so sure userspace can figure out more low-level details like alignment >> and placement in physical memory, etc. > > well, let's divide things up into two categories: > > 1) the arrangement and format of pixels.. ie. what userspace would > need to know if it mmap's a buffer. This includes pixel format, > stride, etc. This should be negotiated in userspace, it would be > crazy to try to do this in the kernel. > > 2) the physical placement of the pages. Ie. whether it is contiguous > or not. Which bank the pages in the buffer are placed in, etc. This > is not visible to userspace. This is the purpose of the attach step, > so you know all the devices involved in sharing up front before > allocating the backing pages. (Or in the worst case, if you have a > "late attacher" you at least know when no device is doing dma access > to a buffer and can reallocate and move the buffer.) A long time One concern I know the Android folks have expressed previously (and correct me if its no longer an objection), is that this attach time in-kernel constraint solving / moving or reallocating buffers is likely to hurt determinism. If I understood, their perspective was that userland knows the device path the buffers will travel through, so why not leverage that knowledge, rather then having the kernel have to sort it out for itself after the fact. The concern about determinism even makes them hesitant about CMA, over things like carveout, as they don't want to be moving pages around at allocation time (which could hurt reasonable use cases like the time it takes to launch a camera app - which is quite important). Though maybe this concern will lessen as more CMA solutions ship. I worry some of this split between fully general solutions vs hard-coded known constraints is somewhat intractable. But what might make it easier to get android folks interested in approaches like the attach-time allocation / relocating on late-attach you're proposing is if there is maybe some way, as you suggested, to hint the allocation when they do know the device paths, so they can provide that insight and can avoid *all* reallocations. Then of course, there's the question of how to consistently hint things for all the different driver allocation interfaces (and that, maybe naively, leads to the central allocator approach). thanks -john ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-08-07 4:23 ` John Stultz (?) @ 2013-08-07 13:27 ` Rob Clark -1 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-07 13:27 UTC (permalink / raw) To: John Stultz Cc: Tom Cooksey, linux-fbdev, Pawel Moll, lkml, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media, Rebecca Schultz Zavin, Erik Gilling, Ross Oldfield On Wed, Aug 7, 2013 at 12:23 AM, John Stultz <john.stultz@linaro.org> wrote: > On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark@gmail.com> wrote: >> well, let's divide things up into two categories: >> >> 1) the arrangement and format of pixels.. ie. what userspace would >> need to know if it mmap's a buffer. This includes pixel format, >> stride, etc. This should be negotiated in userspace, it would be >> crazy to try to do this in the kernel. >> >> 2) the physical placement of the pages. Ie. whether it is contiguous >> or not. Which bank the pages in the buffer are placed in, etc. This >> is not visible to userspace. This is the purpose of the attach step, >> so you know all the devices involved in sharing up front before >> allocating the backing pages. (Or in the worst case, if you have a >> "late attacher" you at least know when no device is doing dma access >> to a buffer and can reallocate and move the buffer.) A long time > > One concern I know the Android folks have expressed previously (and > correct me if its no longer an objection), is that this attach time > in-kernel constraint solving / moving or reallocating buffers is > likely to hurt determinism. If I understood, their perspective was > that userland knows the device path the buffers will travel through, > so why not leverage that knowledge, rather then having the kernel have > to sort it out for itself after the fact. If you know the device path, then attach the buffer at all the devices before you start using it. Problem solved.. kernel knows all devices before pages need be allocated ;-) BR, -R ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 13:27 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-07 13:27 UTC (permalink / raw) To: linux-arm-kernel On Wed, Aug 7, 2013 at 12:23 AM, John Stultz <john.stultz@linaro.org> wrote: > On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark@gmail.com> wrote: >> well, let's divide things up into two categories: >> >> 1) the arrangement and format of pixels.. ie. what userspace would >> need to know if it mmap's a buffer. This includes pixel format, >> stride, etc. This should be negotiated in userspace, it would be >> crazy to try to do this in the kernel. >> >> 2) the physical placement of the pages. Ie. whether it is contiguous >> or not. Which bank the pages in the buffer are placed in, etc. This >> is not visible to userspace. This is the purpose of the attach step, >> so you know all the devices involved in sharing up front before >> allocating the backing pages. (Or in the worst case, if you have a >> "late attacher" you at least know when no device is doing dma access >> to a buffer and can reallocate and move the buffer.) A long time > > One concern I know the Android folks have expressed previously (and > correct me if its no longer an objection), is that this attach time > in-kernel constraint solving / moving or reallocating buffers is > likely to hurt determinism. If I understood, their perspective was > that userland knows the device path the buffers will travel through, > so why not leverage that knowledge, rather then having the kernel have > to sort it out for itself after the fact. If you know the device path, then attach the buffer at all the devices before you start using it. Problem solved.. kernel knows all devices before pages need be allocated ;-) BR, -R ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 13:27 ` Rob Clark 0 siblings, 0 replies; 131+ messages in thread From: Rob Clark @ 2013-08-07 13:27 UTC (permalink / raw) To: John Stultz Cc: Tom Cooksey, linux-fbdev, Pawel Moll, lkml, dri-devel, linaro-mm-sig, linux-arm-kernel, linux-media, Rebecca Schultz Zavin, Erik Gilling, Ross Oldfield On Wed, Aug 7, 2013 at 12:23 AM, John Stultz <john.stultz@linaro.org> wrote: > On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark <robdclark@gmail.com> wrote: >> well, let's divide things up into two categories: >> >> 1) the arrangement and format of pixels.. ie. what userspace would >> need to know if it mmap's a buffer. This includes pixel format, >> stride, etc. This should be negotiated in userspace, it would be >> crazy to try to do this in the kernel. >> >> 2) the physical placement of the pages. Ie. whether it is contiguous >> or not. Which bank the pages in the buffer are placed in, etc. This >> is not visible to userspace. This is the purpose of the attach step, >> so you know all the devices involved in sharing up front before >> allocating the backing pages. (Or in the worst case, if you have a >> "late attacher" you at least know when no device is doing dma access >> to a buffer and can reallocate and move the buffer.) A long time > > One concern I know the Android folks have expressed previously (and > correct me if its no longer an objection), is that this attach time > in-kernel constraint solving / moving or reallocating buffers is > likely to hurt determinism. If I understood, their perspective was > that userland knows the device path the buffers will travel through, > so why not leverage that knowledge, rather then having the kernel have > to sort it out for itself after the fact. If you know the device path, then attach the buffer at all the devices before you start using it. Problem solved.. kernel knows all devices before pages need be allocated ;-) BR, -R ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 2013-07-29 14:58 ` Rob Clark (?) @ 2013-08-07 3:57 ` John Stultz -1 siblings, 0 replies; 131+ messages in thread From: John Stultz @ 2013-08-07 3:57 UTC (permalink / raw) To: Rob Clark Cc: linux-fbdev, Erik Gilling, Pawel Moll, dri-devel, Ross Oldfield, Tom Cooksey, Rebecca Schultz Zavin, linux-arm-kernel On Mon, Jul 29, 2013 at 7:58 AM, Rob Clark <robdclark@gmail.com> wrote: > On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> Hi Rob, >> >>> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >>> > allocate buffers for the GPU. Still not sure how to resolve this >>> > as we don't use DRM for our GPU driver. >>> >>> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >>> once the dma-fence stuff is in place), we'd have gpu-specific drm >>> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >>> using prime/dmabuf to share between the two. >> >> The "extra" buffers we were allocating from armsoc DDX were really >> being allocated through DRM/GEM so we could get an flink name >> for them and pass a reference to them back to our GPU driver on >> the client side. If it weren't for our need to access those >> extra off-screen buffers with the GPU we wouldn't need to >> allocate them with DRM at all. So, given they are really "GPU" >> buffers, it does absolutely make sense to allocate them in a >> different driver to the display driver. >> >> However, to avoid unnecessary memcpys & related cache >> maintenance ops, we'd also like the GPU to render into buffers >> which are scanned out by the display controller. So let's say >> we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> out buffers with the display's DRM driver but a custom ioctl >> on the GPU's DRM driver to allocate non scanout, off-screen >> buffers. Sounds great, but I don't think that really works >> with DRI2. If we used two drivers to allocate buffers, which >> of those drivers do we return in DRI2ConnectReply? Even if we >> solve that somehow, GEM flink names are name-spaced to a >> single device node (AFAIK). So when we do a DRI2GetBuffers, >> how does the EGL in the client know which DRM device owns GEM >> flink name "1234"? We'd need some pretty dirty hacks. > > You would return the name of the display driver allocating the > buffers. On the client side you can use generic ioctls to go from > flink -> handle -> dmabuf. So the client side would end up opening > both the display drm device and the gpu, but without needing to know > too much about the display. > > (Probably in (for example) mesa there needs to be a bit of work to > handle this better, but I think that would be needed as well for > sharing between, say, nouveau and udl displaylink driver.. which is > really the same scenario.) > >> So then we looked at allocating _all_ buffers with the GPU's >> DRM driver. That solves the DRI2 single-device-name and single >> name-space issue. It also means the GPU would _never_ render >> into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. > > Well, I think we can differentiate between shared buffers and internal > buffers (textures, vertex upload, etc, etc). > > For example, in mesa/gallium driver there are two paths to getting a > buffer... ->resource_create() and ->resource_from_handle(), the > latter is the path you go for dri2 buffers shared w/ x11. The former > is buffers that are just internal to gpu (if !(bind & > PIPE_BIND_SHARED)). I guess you must have something similar in mali > driver. > >> One thing I wasn't sure about is if there was an objection >> to using PRIME to export scanout buffers allocated with >> DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU >> driver to be rendered into? Is that a concern? > > well.. it wasn't quite the original intention of the 'dumb' ioctls. > But I guess in the end if you hand the gpu a buffer with a > layout/format that it can't grok, then gpu does a staging texture plus > copy. If you had, for example, some setup w/ gpu that could only > render to tiled format, plus a display that could scanout that format, > then your DDX driver would need to allocate the dri2 buffers with > something other than dumb ioctl. (But at that point, you probably > need to implement your own EXA so you could hook in your own custom > upload-to/download-from screen fxns for sw fallbacks, so you're not > talking about using generic DDX anyways.) > >> Anyway, that latter case also gets quite difficult. The "GPU" >> DRM driver would need to know the constraints of the display >> controller when allocating buffers intended to be scanned out. >> For example, pl111 typically isn't behind an IOMMU and so >> requires physically contiguous memory. We'd have to teach the >> GPU's DRM driver about the constraints of the display HW. Not >> exactly a clean driver model. :-( >> >> I'm still a little stuck on how to proceed, so any ideas >> would greatly appreciated! My current train of thought is >> having a kind of SoC-specific DRM driver which allocates >> buffers for both display and GPU within a single GEM >> namespace. That SoC-specific DRM driver could then know the >> constraints of both the GPU and the display HW. We could then >> use PRIME to export buffers allocated with the SoC DRM driver >> and import them into the GPU and/or display DRM driver. > > Usually if the display drm driver is allocating the buffers that might > be scanned out, it just needs to have minimal knowledge of the GPU > (pitch alignment constraints). I don't think we need a 3rd device > just to allocate buffers. > > Really I don't think the separate display drm and gpu drm device is > much different from desktop udl/displaylink case. Or desktop > integrated-gpu + external gpu. Although on many SoCs, aren't the pipelines a bit more complicated? (ie: camera data -> doing face detection, -> applying filters w/ the gpu -> and displaying it). While ION does have interface issues (in that it requires userland to hard-code particular constraints for that specific SoC) just the idea of having the centralized allocator that manages the constraints does seem to make sense for these larger pipelines (and also the fact that there's different allocation APIs for various devices). One idea that was discussed in some of the hallway discussions at connect was if there was some way for the hardware to expose a "constraint cookie" via sysfs (basically a bit field where the meaning of each bit is opaque to userland - so each bit can mean a different constraint on different devices), userland could determine which combination of devices it wanted to pipe a buffer through, then AND the device constraint cookies together, and pass that to a centralized allocator. This allows the constraint solving to be pushed out to userland, but also avoids what I see as the biggest negatives in ION (ie: the heap mask is defined to userland, making it inflexible, as well as the lack of a method for userland to discover device constraints, so they're instead hard-coded in userland for each device). >> Note: While it doesn't use the DRM framework, the Mali T6xx >> kernel driver has supported importing buffers through dma_buf >> for some time. I've even written an EGL extension :-): >> >> <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im >> port.txt> >> >> >>> I'm not entirely sure that the directions that the current CDF >>> proposals are headed is necessarily the right way forward. I'd prefer >>> to see small/incremental evolution of KMS (ie. add drm_bridge and >>> drm_panel, and refactor the existing encoder-slave). Keeping it >>> inside drm means that we can evolve it more easily, and avoid layers >>> of glue code for no good reason. >> >> I think CDF could allow vendors to re-use code they've written >> for their Android driver stack in DRM drivers more easily. Though >> I guess ideally KMS would evolve to a point where it could be used >> by an Android driver stack. I.e. Support explicit fences. >> > > yeah, the best would be evolving DRM/KMS to the point that it meets > android requirements. Because I really don't think android has any > special requirements compared to, say, a wayland compositor. (Other > than the way they do synchronization. But that doesn't really *need* > to be different, as far as I can tell.) It would certainly be easier > if android dev's actually participated in upstream linux graphics, and > talked about what they needed, and maybe even contributed a patch or > two. But that is a whole different topic. It also couldn't hurt to cc them on these discussions (Rebecca and Erik cc'ed :) I think Ross covered much of the KMS issues at Linaro Connect (Video here if you want to spend the time: http://youtu.be/AuoFABKS_Dk), and indeed, the explicit syncing requirement is probably the biggest barrier. My understanding is the additional parallelism/performance that the explicit method provides is the key requirement (though maybe Erik or Ross can chime in with further details? - I've been told Erik has a long list as to why explicit is preferred over implicit and it would be good to get that documented somewhere :) I understand the implicit model is easier to program for, and Daniel mentioned dynamic buffer eviction as having a requirement on the implicit model, so I'm currently in the camp of assuming the synchronization needs are different enough that one method alone (of the two currently presented) likely won't work. My current hope is there's some hybrid approach that can be used (implicitly maybe for the majority of cases, but with the option of doing explicit syncing when necessary). thanks -john (who's unfortunately heading off on vacation soon, so might not be fast to reply) ^ permalink raw reply [flat|nested] 131+ messages in thread
* Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 3:57 ` John Stultz 0 siblings, 0 replies; 131+ messages in thread From: John Stultz @ 2013-08-07 3:57 UTC (permalink / raw) To: Rob Clark Cc: linux-fbdev, Erik Gilling, Pawel Moll, dri-devel, Ross Oldfield, Tom Cooksey, Rebecca Schultz Zavin, linux-arm-kernel On Mon, Jul 29, 2013 at 7:58 AM, Rob Clark <robdclark@gmail.com> wrote: > On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> Hi Rob, >> >>> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >>> > allocate buffers for the GPU. Still not sure how to resolve this >>> > as we don't use DRM for our GPU driver. >>> >>> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >>> once the dma-fence stuff is in place), we'd have gpu-specific drm >>> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >>> using prime/dmabuf to share between the two. >> >> The "extra" buffers we were allocating from armsoc DDX were really >> being allocated through DRM/GEM so we could get an flink name >> for them and pass a reference to them back to our GPU driver on >> the client side. If it weren't for our need to access those >> extra off-screen buffers with the GPU we wouldn't need to >> allocate them with DRM at all. So, given they are really "GPU" >> buffers, it does absolutely make sense to allocate them in a >> different driver to the display driver. >> >> However, to avoid unnecessary memcpys & related cache >> maintenance ops, we'd also like the GPU to render into buffers >> which are scanned out by the display controller. So let's say >> we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> out buffers with the display's DRM driver but a custom ioctl >> on the GPU's DRM driver to allocate non scanout, off-screen >> buffers. Sounds great, but I don't think that really works >> with DRI2. If we used two drivers to allocate buffers, which >> of those drivers do we return in DRI2ConnectReply? Even if we >> solve that somehow, GEM flink names are name-spaced to a >> single device node (AFAIK). So when we do a DRI2GetBuffers, >> how does the EGL in the client know which DRM device owns GEM >> flink name "1234"? We'd need some pretty dirty hacks. > > You would return the name of the display driver allocating the > buffers. On the client side you can use generic ioctls to go from > flink -> handle -> dmabuf. So the client side would end up opening > both the display drm device and the gpu, but without needing to know > too much about the display. > > (Probably in (for example) mesa there needs to be a bit of work to > handle this better, but I think that would be needed as well for > sharing between, say, nouveau and udl displaylink driver.. which is > really the same scenario.) > >> So then we looked at allocating _all_ buffers with the GPU's >> DRM driver. That solves the DRI2 single-device-name and single >> name-space issue. It also means the GPU would _never_ render >> into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. > > Well, I think we can differentiate between shared buffers and internal > buffers (textures, vertex upload, etc, etc). > > For example, in mesa/gallium driver there are two paths to getting a > buffer... ->resource_create() and ->resource_from_handle(), the > latter is the path you go for dri2 buffers shared w/ x11. The former > is buffers that are just internal to gpu (if !(bind & > PIPE_BIND_SHARED)). I guess you must have something similar in mali > driver. > >> One thing I wasn't sure about is if there was an objection >> to using PRIME to export scanout buffers allocated with >> DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU >> driver to be rendered into? Is that a concern? > > well.. it wasn't quite the original intention of the 'dumb' ioctls. > But I guess in the end if you hand the gpu a buffer with a > layout/format that it can't grok, then gpu does a staging texture plus > copy. If you had, for example, some setup w/ gpu that could only > render to tiled format, plus a display that could scanout that format, > then your DDX driver would need to allocate the dri2 buffers with > something other than dumb ioctl. (But at that point, you probably > need to implement your own EXA so you could hook in your own custom > upload-to/download-from screen fxns for sw fallbacks, so you're not > talking about using generic DDX anyways.) > >> Anyway, that latter case also gets quite difficult. The "GPU" >> DRM driver would need to know the constraints of the display >> controller when allocating buffers intended to be scanned out. >> For example, pl111 typically isn't behind an IOMMU and so >> requires physically contiguous memory. We'd have to teach the >> GPU's DRM driver about the constraints of the display HW. Not >> exactly a clean driver model. :-( >> >> I'm still a little stuck on how to proceed, so any ideas >> would greatly appreciated! My current train of thought is >> having a kind of SoC-specific DRM driver which allocates >> buffers for both display and GPU within a single GEM >> namespace. That SoC-specific DRM driver could then know the >> constraints of both the GPU and the display HW. We could then >> use PRIME to export buffers allocated with the SoC DRM driver >> and import them into the GPU and/or display DRM driver. > > Usually if the display drm driver is allocating the buffers that might > be scanned out, it just needs to have minimal knowledge of the GPU > (pitch alignment constraints). I don't think we need a 3rd device > just to allocate buffers. > > Really I don't think the separate display drm and gpu drm device is > much different from desktop udl/displaylink case. Or desktop > integrated-gpu + external gpu. Although on many SoCs, aren't the pipelines a bit more complicated? (ie: camera data -> doing face detection, -> applying filters w/ the gpu -> and displaying it). While ION does have interface issues (in that it requires userland to hard-code particular constraints for that specific SoC) just the idea of having the centralized allocator that manages the constraints does seem to make sense for these larger pipelines (and also the fact that there's different allocation APIs for various devices). One idea that was discussed in some of the hallway discussions at connect was if there was some way for the hardware to expose a "constraint cookie" via sysfs (basically a bit field where the meaning of each bit is opaque to userland - so each bit can mean a different constraint on different devices), userland could determine which combination of devices it wanted to pipe a buffer through, then AND the device constraint cookies together, and pass that to a centralized allocator. This allows the constraint solving to be pushed out to userland, but also avoids what I see as the biggest negatives in ION (ie: the heap mask is defined to userland, making it inflexible, as well as the lack of a method for userland to discover device constraints, so they're instead hard-coded in userland for each device). >> Note: While it doesn't use the DRM framework, the Mali T6xx >> kernel driver has supported importing buffers through dma_buf >> for some time. I've even written an EGL extension :-): >> >> <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im >> port.txt> >> >> >>> I'm not entirely sure that the directions that the current CDF >>> proposals are headed is necessarily the right way forward. I'd prefer >>> to see small/incremental evolution of KMS (ie. add drm_bridge and >>> drm_panel, and refactor the existing encoder-slave). Keeping it >>> inside drm means that we can evolve it more easily, and avoid layers >>> of glue code for no good reason. >> >> I think CDF could allow vendors to re-use code they've written >> for their Android driver stack in DRM drivers more easily. Though >> I guess ideally KMS would evolve to a point where it could be used >> by an Android driver stack. I.e. Support explicit fences. >> > > yeah, the best would be evolving DRM/KMS to the point that it meets > android requirements. Because I really don't think android has any > special requirements compared to, say, a wayland compositor. (Other > than the way they do synchronization. But that doesn't really *need* > to be different, as far as I can tell.) It would certainly be easier > if android dev's actually participated in upstream linux graphics, and > talked about what they needed, and maybe even contributed a patch or > two. But that is a whole different topic. It also couldn't hurt to cc them on these discussions (Rebecca and Erik cc'ed :) I think Ross covered much of the KMS issues at Linaro Connect (Video here if you want to spend the time: http://youtu.be/AuoFABKS_Dk), and indeed, the explicit syncing requirement is probably the biggest barrier. My understanding is the additional parallelism/performance that the explicit method provides is the key requirement (though maybe Erik or Ross can chime in with further details? - I've been told Erik has a long list as to why explicit is preferred over implicit and it would be good to get that documented somewhere :) I understand the implicit model is easier to program for, and Daniel mentioned dynamic buffer eviction as having a requirement on the implicit model, so I'm currently in the camp of assuming the synchronization needs are different enough that one method alone (of the two currently presented) likely won't work. My current hope is there's some hybrid approach that can be used (implicitly maybe for the majority of cases, but with the option of doing explicit syncing when necessary). thanks -john (who's unfortunately heading off on vacation soon, so might not be fast to reply) ^ permalink raw reply [flat|nested] 131+ messages in thread
* [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 @ 2013-08-07 3:57 ` John Stultz 0 siblings, 0 replies; 131+ messages in thread From: John Stultz @ 2013-08-07 3:57 UTC (permalink / raw) To: linux-arm-kernel On Mon, Jul 29, 2013 at 7:58 AM, Rob Clark <robdclark@gmail.com> wrote: > On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey <tom.cooksey@arm.com> wrote: >> Hi Rob, >> >>> > * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also >>> > allocate buffers for the GPU. Still not sure how to resolve this >>> > as we don't use DRM for our GPU driver. >>> >>> any thoughts/plans about a DRM GPU driver? Ideally long term (esp. >>> once the dma-fence stuff is in place), we'd have gpu-specific drm >>> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver, >>> using prime/dmabuf to share between the two. >> >> The "extra" buffers we were allocating from armsoc DDX were really >> being allocated through DRM/GEM so we could get an flink name >> for them and pass a reference to them back to our GPU driver on >> the client side. If it weren't for our need to access those >> extra off-screen buffers with the GPU we wouldn't need to >> allocate them with DRM at all. So, given they are really "GPU" >> buffers, it does absolutely make sense to allocate them in a >> different driver to the display driver. >> >> However, to avoid unnecessary memcpys & related cache >> maintenance ops, we'd also like the GPU to render into buffers >> which are scanned out by the display controller. So let's say >> we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan >> out buffers with the display's DRM driver but a custom ioctl >> on the GPU's DRM driver to allocate non scanout, off-screen >> buffers. Sounds great, but I don't think that really works >> with DRI2. If we used two drivers to allocate buffers, which >> of those drivers do we return in DRI2ConnectReply? Even if we >> solve that somehow, GEM flink names are name-spaced to a >> single device node (AFAIK). So when we do a DRI2GetBuffers, >> how does the EGL in the client know which DRM device owns GEM >> flink name "1234"? We'd need some pretty dirty hacks. > > You would return the name of the display driver allocating the > buffers. On the client side you can use generic ioctls to go from > flink -> handle -> dmabuf. So the client side would end up opening > both the display drm device and the gpu, but without needing to know > too much about the display. > > (Probably in (for example) mesa there needs to be a bit of work to > handle this better, but I think that would be needed as well for > sharing between, say, nouveau and udl displaylink driver.. which is > really the same scenario.) > >> So then we looked at allocating _all_ buffers with the GPU's >> DRM driver. That solves the DRI2 single-device-name and single >> name-space issue. It also means the GPU would _never_ render >> into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB. > > Well, I think we can differentiate between shared buffers and internal > buffers (textures, vertex upload, etc, etc). > > For example, in mesa/gallium driver there are two paths to getting a > buffer... ->resource_create() and ->resource_from_handle(), the > latter is the path you go for dri2 buffers shared w/ x11. The former > is buffers that are just internal to gpu (if !(bind & > PIPE_BIND_SHARED)). I guess you must have something similar in mali > driver. > >> One thing I wasn't sure about is if there was an objection >> to using PRIME to export scanout buffers allocated with >> DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU >> driver to be rendered into? Is that a concern? > > well.. it wasn't quite the original intention of the 'dumb' ioctls. > But I guess in the end if you hand the gpu a buffer with a > layout/format that it can't grok, then gpu does a staging texture plus > copy. If you had, for example, some setup w/ gpu that could only > render to tiled format, plus a display that could scanout that format, > then your DDX driver would need to allocate the dri2 buffers with > something other than dumb ioctl. (But at that point, you probably > need to implement your own EXA so you could hook in your own custom > upload-to/download-from screen fxns for sw fallbacks, so you're not > talking about using generic DDX anyways.) > >> Anyway, that latter case also gets quite difficult. The "GPU" >> DRM driver would need to know the constraints of the display >> controller when allocating buffers intended to be scanned out. >> For example, pl111 typically isn't behind an IOMMU and so >> requires physically contiguous memory. We'd have to teach the >> GPU's DRM driver about the constraints of the display HW. Not >> exactly a clean driver model. :-( >> >> I'm still a little stuck on how to proceed, so any ideas >> would greatly appreciated! My current train of thought is >> having a kind of SoC-specific DRM driver which allocates >> buffers for both display and GPU within a single GEM >> namespace. That SoC-specific DRM driver could then know the >> constraints of both the GPU and the display HW. We could then >> use PRIME to export buffers allocated with the SoC DRM driver >> and import them into the GPU and/or display DRM driver. > > Usually if the display drm driver is allocating the buffers that might > be scanned out, it just needs to have minimal knowledge of the GPU > (pitch alignment constraints). I don't think we need a 3rd device > just to allocate buffers. > > Really I don't think the separate display drm and gpu drm device is > much different from desktop udl/displaylink case. Or desktop > integrated-gpu + external gpu. Although on many SoCs, aren't the pipelines a bit more complicated? (ie: camera data -> doing face detection, -> applying filters w/ the gpu -> and displaying it). While ION does have interface issues (in that it requires userland to hard-code particular constraints for that specific SoC) just the idea of having the centralized allocator that manages the constraints does seem to make sense for these larger pipelines (and also the fact that there's different allocation APIs for various devices). One idea that was discussed in some of the hallway discussions at connect was if there was some way for the hardware to expose a "constraint cookie" via sysfs (basically a bit field where the meaning of each bit is opaque to userland - so each bit can mean a different constraint on different devices), userland could determine which combination of devices it wanted to pipe a buffer through, then AND the device constraint cookies together, and pass that to a centralized allocator. This allows the constraint solving to be pushed out to userland, but also avoids what I see as the biggest negatives in ION (ie: the heap mask is defined to userland, making it inflexible, as well as the lack of a method for userland to discover device constraints, so they're instead hard-coded in userland for each device). >> Note: While it doesn't use the DRM framework, the Mali T6xx >> kernel driver has supported importing buffers through dma_buf >> for some time. I've even written an EGL extension :-): >> >> <http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_im >> port.txt> >> >> >>> I'm not entirely sure that the directions that the current CDF >>> proposals are headed is necessarily the right way forward. I'd prefer >>> to see small/incremental evolution of KMS (ie. add drm_bridge and >>> drm_panel, and refactor the existing encoder-slave). Keeping it >>> inside drm means that we can evolve it more easily, and avoid layers >>> of glue code for no good reason. >> >> I think CDF could allow vendors to re-use code they've written >> for their Android driver stack in DRM drivers more easily. Though >> I guess ideally KMS would evolve to a point where it could be used >> by an Android driver stack. I.e. Support explicit fences. >> > > yeah, the best would be evolving DRM/KMS to the point that it meets > android requirements. Because I really don't think android has any > special requirements compared to, say, a wayland compositor. (Other > than the way they do synchronization. But that doesn't really *need* > to be different, as far as I can tell.) It would certainly be easier > if android dev's actually participated in upstream linux graphics, and > talked about what they needed, and maybe even contributed a patch or > two. But that is a whole different topic. It also couldn't hurt to cc them on these discussions (Rebecca and Erik cc'ed :) I think Ross covered much of the KMS issues at Linaro Connect (Video here if you want to spend the time: http://youtu.be/AuoFABKS_Dk), and indeed, the explicit syncing requirement is probably the biggest barrier. My understanding is the additional parallelism/performance that the explicit method provides is the key requirement (though maybe Erik or Ross can chime in with further details? - I've been told Erik has a long list as to why explicit is preferred over implicit and it would be good to get that documented somewhere :) I understand the implicit model is easier to program for, and Daniel mentioned dynamic buffer eviction as having a requirement on the implicit model, so I'm currently in the camp of assuming the synchronization needs are different enough that one method alone (of the two currently presented) likely won't work. My current hope is there's some hybrid approach that can be used (implicitly maybe for the majority of cases, but with the option of doing explicit syncing when necessary). thanks -john (who's unfortunately heading off on vacation soon, so might not be fast to reply) ^ permalink raw reply [flat|nested] 131+ messages in thread
end of thread, other threads:[~2013-09-14 21:34 UTC | newest] Thread overview: 131+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-07-25 17:17 [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111 tom.cooksey 2013-07-25 17:17 ` tom.cooksey 2013-07-25 17:17 ` tom.cooksey at arm.com 2013-07-25 17:17 ` [RFC 1/1] " tom.cooksey 2013-08-07 16:17 ` Rob Clark 2013-08-07 16:46 ` Daniel Vetter 2013-08-07 16:46 ` Daniel Vetter 2013-08-07 16:46 ` Daniel Vetter 2013-08-09 16:15 ` Tom Cooksey 2013-08-09 16:15 ` Tom Cooksey 2013-08-09 16:15 ` Tom Cooksey [not found] ` <520515b0.88b70e0a.3ecd.1004SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-09 16:34 ` Rob Clark 2013-08-09 16:34 ` Rob Clark 2013-08-09 16:34 ` Rob Clark 2013-08-09 16:57 ` Daniel Vetter 2013-08-09 16:57 ` Daniel Vetter 2013-08-09 16:57 ` Daniel Vetter 2013-08-09 17:31 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey 2013-08-13 14:35 ` Tom Cooksey 2013-08-09 17:31 ` Tom Cooksey 2013-08-09 17:31 ` Tom Cooksey [not found] ` <5205277e.84320f0a.1cdf.ffff8816SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-10 12:30 ` Rob Clark 2013-08-10 12:30 ` Rob Clark 2013-08-10 12:30 ` Rob Clark [not found] ` <520a4435.070a0e0a.4fce.ffff9731SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-13 14:58 ` Rob Clark 2013-08-13 14:58 ` Rob Clark 2013-08-13 14:58 ` Rob Clark 2013-08-14 14:59 ` Tom Cooksey 2013-08-14 14:59 ` Tom Cooksey 2013-08-14 14:59 ` Tom Cooksey 2013-08-10 4:56 ` Inki Dae 2013-07-25 18:21 ` [RFC 0/1] " Rob Clark 2013-07-25 18:21 ` Rob Clark 2013-07-25 18:21 ` Rob Clark 2013-07-26 14:06 ` Pawel Moll 2013-07-26 14:06 ` Pawel Moll 2013-07-26 14:06 ` Pawel Moll 2013-07-26 14:21 ` Rob Clark 2013-07-26 14:21 ` Rob Clark 2013-07-26 14:21 ` Rob Clark 2013-07-26 14:41 ` Pawel Moll 2013-07-26 14:41 ` Pawel Moll 2013-07-26 14:41 ` Pawel Moll 2013-07-26 14:14 ` Russell King - ARM Linux 2013-07-26 14:14 ` Russell King - ARM Linux 2013-07-26 14:14 ` Russell King - ARM Linux 2013-07-26 14:24 ` Rob Clark 2013-07-26 14:24 ` Rob Clark 2013-07-26 14:24 ` Rob Clark 2013-07-26 15:58 ` Tom Cooksey 2013-07-26 15:58 ` Tom Cooksey 2013-07-26 15:58 ` Tom Cooksey [not found] ` <51f29cd1.e686440a.66b2.fffff9d5SMTPIN_ADDED_BROKEN@mx.google.com> 2013-07-26 18:56 ` Daniel Vetter 2013-07-26 18:56 ` Daniel Vetter 2013-07-26 18:56 ` Daniel Vetter [not found] ` <51f29ccd.f014b40a.34cc.ffffca2aSMTPIN_ADDED_BROKEN@mx.google.com> 2013-07-29 14:58 ` Rob Clark 2013-07-29 14:58 ` Rob Clark 2013-07-29 14:58 ` Rob Clark 2013-08-05 17:10 ` Tom Cooksey 2013-08-05 17:10 ` Tom Cooksey 2013-08-05 17:10 ` Tom Cooksey [not found] ` <51ffdc7e.06b8b40a.2cc8.0fe0SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-05 18:07 ` Rob Clark 2013-08-05 18:07 ` Rob Clark 2013-08-05 18:07 ` Rob Clark 2013-08-06 11:31 ` Tom Cooksey 2013-08-06 11:31 ` Tom Cooksey 2013-08-06 11:31 ` Tom Cooksey 2013-08-06 12:18 ` [Linaro-mm-sig] " Lucas Stach 2013-08-06 12:18 ` Lucas Stach 2013-08-06 12:18 ` Lucas Stach 2013-08-06 14:14 ` Rob Clark 2013-08-06 14:14 ` Rob Clark 2013-08-06 14:14 ` Rob Clark 2013-08-06 14:36 ` Lucas Stach 2013-08-06 14:36 ` Lucas Stach 2013-08-06 14:36 ` Lucas Stach 2013-08-06 14:59 ` Rob Clark 2013-08-06 14:59 ` Rob Clark 2013-08-06 14:59 ` Rob Clark 2013-08-06 15:28 ` Daniel Vetter 2013-08-06 15:28 ` Daniel Vetter 2013-08-06 15:28 ` Daniel Vetter 2013-08-06 15:28 ` Daniel Vetter 2013-09-14 21:33 ` Daniel Stone 2013-09-14 21:33 ` Daniel Stone 2013-09-14 21:33 ` Daniel Stone [not found] ` <5200deb3.0b24b40a.3b26.ffffbadeSMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-06 12:15 ` Rob Clark 2013-08-06 12:15 ` Rob Clark 2013-08-06 12:15 ` Rob Clark 2013-08-06 14:03 ` Tom Cooksey 2013-08-06 14:03 ` Tom Cooksey 2013-08-06 14:03 ` Tom Cooksey [not found] ` <52010257.245fc20a.6ff8.1cfdSMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-06 14:40 ` Rob Clark 2013-08-06 14:40 ` Rob Clark 2013-08-06 14:40 ` Rob Clark 2013-08-06 17:38 ` Tom Cooksey 2013-08-06 17:38 ` Tom Cooksey 2013-08-06 17:38 ` Tom Cooksey [not found] ` <52013482.e107c20a.27f9.ffffa718SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-06 19:06 ` Rob Clark 2013-08-06 19:06 ` Rob Clark 2013-08-06 19:06 ` Rob Clark 2013-08-07 17:33 ` Tom Cooksey 2013-08-07 17:33 ` Tom Cooksey 2013-08-07 17:33 ` Tom Cooksey [not found] ` <520284fe.a16ec20a.2d3c.6e19SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-07 17:56 ` [Linaro-mm-sig] " Alex Deucher 2013-08-07 17:56 ` Alex Deucher 2013-08-07 17:56 ` Alex Deucher [not found] ` <520284d6.07300f0a.72a4.1623SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-07 18:12 ` Rob Clark 2013-08-07 18:12 ` Rob Clark 2013-08-07 18:12 ` Rob Clark 2013-08-09 16:15 ` Tom Cooksey 2013-08-09 16:15 ` Tom Cooksey 2013-08-09 16:15 ` Tom Cooksey [not found] ` <520515b9.87370f0a.16e6.2380SMTPIN_ADDED_BROKEN@mx.google.com> 2013-08-09 17:12 ` Rob Clark 2013-08-09 17:12 ` Rob Clark 2013-08-09 17:12 ` Rob Clark 2013-08-14 15:01 ` Tom Cooksey 2013-08-14 15:01 ` Tom Cooksey 2013-08-14 15:01 ` Tom Cooksey 2013-08-07 4:23 ` John Stultz 2013-08-07 4:23 ` John Stultz 2013-08-07 4:23 ` John Stultz 2013-08-07 4:23 ` John Stultz 2013-08-07 13:27 ` Rob Clark 2013-08-07 13:27 ` Rob Clark 2013-08-07 13:27 ` Rob Clark 2013-08-07 3:57 ` John Stultz 2013-08-07 3:57 ` John Stultz 2013-08-07 3:57 ` John Stultz
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.