FS threading brings performance improvements of 0-20% in glmark2.
The validation code checks for thread switch signals and ensures that
the registers of the other thread are not touched, and that our clamps
are not live across thread switches. It also checks that the
threading and branching instructions do not interfere.
(Original patch by Jonas, changes by anholt for style cleanup,
removing validation the kernel doesn't need to do, and adding the flag
for userspace).
v2: Minor style fixes from checkpatch.
Signed-off-by: Jonas Pfeil <pfeiljonas@gmx.de>
Signed-off-by: Eric Anholt <eric@anholt.net>
With the introduction of bin/render pipelining, the previous job may
not be completed when we start binning the next one. If the previous
job wrote our VBO, IB, or CS textures, then the binning stage might
get stale or uninitialized results.
Fixes the major rendering failure in glmark2 -b terrain.
Signed-off-by: Eric Anholt <eric@anholt.net>
Fixes: ca26d28bba ("drm/vc4: improve throughput by pipelining binning and rendering jobs")
Cc: stable@vger.kernel.org
The combo of list_empty() check and return list_first_entry()
can be replaced with list_first_entry_or_null().
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
Overflow memory handling is tricky: While it's still referenced by the
BPO registers, we want to keep it from being freed. When we are
putting a new set of overflow memory in the registers, we need to
assign the old one to the last rendering job using it.
We were looking at "what's currently running in the binner", but since
the bin/render submission split, we may end up with the binner
completing and having no new job while the renderer is still
processing. So, if we don't find a bin job at all, look at the
highest-seqno (last) render job to attach our overflow to.
Signed-off-by: Eric Anholt <eric@anholt.net>
Fixes: ca26d28bba ("drm/vc4: improve throughput by pipelining binning and rendering jobs")
Cc: stable@vger.kernel.org
allowing GLSL shaders with non-unrolled loops.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJXiWHaAAoJELXWKTbR/J7olkkP/3buVvzBdxQmT7yMeBL2Up+F
MZ+W/pr7+gWXT/rvyY8zbEZV1txQtZDnvu8cC3h9E66TVYpTx3dYypECwiRKFt5f
HXhrkiniu65zBPhtYxloswcr686SpFOhr9rwy6cJbCCvuZQZVpQ41Fu7BinaYJc2
uGllnNmRkB7tPG/6BYQ/MAWvR7AjTaaAmcdhNZDyEHqXHiUu8NPprL0BQIwS1GMn
65pQjni/WEek6/WxncxszaDWfsLMRJPKiuPV4kAsoQDluB3ortk3k1VVNJ+n3aXw
GjlMNbU2ganEGWJS16e7HXXz3I19SQHn5mnL74X3GGL8r2x9AIZ64Zk7Ru659NBq
CW/NsSWM/TfU6T4iflbZjSVlmtbvdX5T0rDFb+lMFNXG8EvmRIc3fkbSeh6gF9sF
hc3UDS3ScQBAm8QOpDiVAsXNBGJUTsvxdDYgVUj1mjPtbD1ZDmEWYdb2fzs/n7ja
qU/CW9Ap3tModkTjaD58RoenTkbTXymhxmRd1JHoxVztkAM00fNxtGP6fTxuGLOW
snOoE9Ahdq/IzV/yZrqvWRCLTDenQ/ZxcP+fYleJ6M3qLxfHkgZX/chy/n21Q80G
Qe40uqBWreTCz6SzCEOXeaJ9RDbZfSomVfWuLzm5IGMhOmh1rqkpdgAUKQW7OvxB
e/LnruAmY/K7Yin87gK6
=Qh0U
-----END PGP SIGNATURE-----
Merge tag 'drm-vc4-next-2016-07-15' of https://github.com/anholt/linux into drm-next
This pull request brings in vc4 shader validation for branching,
allowing GLSL shaders with non-unrolled loops.
* tag 'drm-vc4-next-2016-07-15' of https://github.com/anholt/linux:
drm/vc4: Fix a "the the" typo in a comment.
drm/vc4: Fix definition of QPU_R_MS_REV_FLAGS
drm/vc4: Add a getparam to signal support for branches.
drm/vc4: Add support for branching in shader validation.
drm/vc4: Add a bitmap of branch targets during shader validation.
drm/vc4: Move validation's current/max ip into the validation struct.
drm/vc4: Add a getparam ioctl for getting the V3D identity regs.
We're already checking that branch instructions are between the start
of the shader and the proper PROG_END sequence. The other thing we
need to make branching safe is to verify that the shader doesn't read
past the end of the uniforms stream.
To do that, we require that at any basic block reading uniforms have
the following instructions:
load_imm temp, <next offset within uniform stream>
add unif_addr, temp, unif
The instructions are generated by userspace, and the kernel verifies
that the load_imm is of the expected offset, and that the add adds it
to a uniform. We track which uniform in the stream that is, and at
draw call time fix up the uniform stream to have the address of the
start of the shader's uniforms at that location.
Signed-off-by: Eric Anholt <eric@anholt.net>
vblank timestamping, and a couple of small cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJXhUULAAoJELXWKTbR/J7oZ1oQAKnTTJlnCbaSNrWbRBUuaMtO
S2RQKyMI2LOpf4XE13cNHm0IaYNOw6hJIXlnxeuzHbQSGvOrHnabluZuvAfLa3OQ
4bb8K0elWsPRbtmx2L8DjncLCmmmOsbczrTwo3efq70XZs4PyaCp2vHpCU8kI7HJ
xO/fk6E8sYDPFCxZxb82C7LTSjQBlPn58dD+cEFg1kGILOhyR1eZwRj8e4netjKU
gN/059R6VPPor+I8oIMrqoHAxAlZUGnVPsK/1rgR5cfxlC1NPKU71eI7xiBLNclt
86BcvU/LuwAYt81UirnGTQU/m47dCMK4mmpzD/9EgVW/KTUPLPubML+u9fL67+B4
BJ7T7N4t7APblqkO/iI9DcTAkZNmydq4sdsfDdXcVZ3umDTNpmb/PQuq+rFEc1CZ
qsBw13kJ4bHBYUpZvrxuCesSRLpXhUiSOg/uVnRHTdiALZqQQyzQlnw6Q4GCMKiz
R99/k4/7/cAgQkGElwLeHR8Aw0r26m+X4pnLN9lrafjQjlH04CcuvFwlak28jnit
Oi0eZ0VD9RTHwPzUIoe4M32Q/7oAe7y5b7gCNPbM/0s56WQv+cIHi7CdgdIYtUWs
BIyeBTzpLvzkEvrEMYoipgpv3scbudMssTd3bo8gsdOu0tllBzi59Poo5hl6pQbs
lseJ7LavmK5sPDPsIZb3
=aF3C
-----END PGP SIGNATURE-----
Merge tag 'drm-vc4-next-2016-07-12' of https://github.com/anholt/linux into drm-next
This pull request brings in new vc4 plane formats for Android, precise
vblank timestamping, and a couple of small cleanups.
* tag 'drm-vc4-next-2016-07-12' of https://github.com/anholt/linux:
drm/vc4: remove redundant ret status check
drm/vc4: Implement precise vblank timestamping.
drm/vc4: Bind the HVS before we bind the individual CRTCs.
gpu: drm: vc4_hdmi: add missing of_node_put after calling of_parse_phandle
drm: vc4: enable XBGR8888 and ABGR8888 pixel formats
drm/vc4: clean up error exit path on failed dpi_connector allocation
Precise vblank timestamping is implemented via the
usual scanout position based method. On VC4 the
pixelvalves PV do not have a scanout position
register. Only the hardware video scaler HVS has a
similar register which describes which scanline for
the output is currently composited and stored in the
HVS fifo for later consumption by the PV.
This causes a problem in that the HVS runs at a much
faster clock (system clock / audio gate) than the PV
which runs at video mode dot clock, so the unless the
fifo between HVS and PV is full, the HVS will progress
faster in its observable read line position than video
scan rate, so the HVS position reading can't be directly
translated into a scanout position for timestamp correction.
Additionally when the PV is in vblank, it doesn't consume
from the fifo, so the fifo gets full very quickly and then
the HVS stops compositing until the PV enters active scanout
and starts consuming scanlines from the fifo again, making
new space for the HVS to composite.
Therefore a simple translation of HVS read position into
elapsed time since (or to) start of active scanout does
not work, but for the most interesting cases we can still
get useful and sufficiently accurate results:
1. The PV enters active scanout of a new frame with the
fifo of the HVS completely full, and the HVS can refill
any fifo line which gets consumed and thereby freed up by
the PV during active scanout very quickly. Therefore the
PV and HVS work effectively in lock-step during active
scanout with the fifo never having more than 1 scanline
freed up by the PV before it gets refilled. The PV's
real scanout position is therefore trailing the HVS
compositing position as scanoutpos = hvspos - fifosize
and we can get the true scanoutpos as HVS readpos minus
fifo size, so precise timestamping works while in active
scanout, except for the last few scanlines of the frame,
when the HVS reaches end of frame, stops compositing and
the PV catches up and drains the fifo. This special case
would only introduce minor errors though.
2. If we are in vblank, then we can only guess something
reasonable. If called from vblank irq, we assume the irq is
usually dispatched with minimum delay, so we can take a
timestamp taken at entry into the vblank irq handler as a
baseline and then add a full vblank duration until the
guessed start of active scanout. As irq dispatch is usually
pretty low latency this works with relatively low jitter and
good results.
If we aren't called from vblank then we could be anywhere
within the vblank interval, so we return a neutral result,
simply the current system timestamp, and hope for the best.
Measurement shows the generated timestamps to be rather precise,
and at least never off more than 1 vblank duration worst-case.
Limitations: Doesn't work well yet for interlaced video modes,
therefore disabled in interlaced mode for now.
v2: Use the DISPBASE registers to determine the FIFO size (changes
by anholt)
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com> (v2)
... and use it in msm&vc4. Again just want to encapsulate
drm_atomic_state internals a bit.
The const threading is a bit awkward in vc4 since C sucks, but I still
think it's worth to enforce this. Eventually I want to make all the
obj->state pointers const too, but that's a lot more work ...
v2: Provide safe macro to wrap up the unsafe helper better, suggested
by Maarten.
v3: Fixup subject (Maarten) and spelling fixes (Eric Engestrom).
Cc: Eric Anholt <eric@anholt.net>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Eric Engestrom <eric.engestrom@imgtec.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1464877304-4213-1-git-send-email-daniel.vetter@ffwll.ch
The DPI interface involves taking a ton of our GPIOs to be used as
outputs, and routing display signals over them in parallel.
v2: Use display_info.bus_formats[] to replace our custom DT
properties.
v3: Rebase on V3D documentation changes.
v4: Fix rebase detritus from V3D documentation changes.
Signed-off-by: Eric Anholt <eric@anholt.net>
Acked-by: Rob Herring <robh@kernel.org>
The hardware provides us with separate threads for binning and
rendering, and the existing model waits for them both to complete
before submitting the next job.
Splitting the binning and rendering submissions reduces idle time and
gives us approx 20-30% speedup with some x11perf tests such as -line10
and -tilerect1. Improves openarena performance by 1.01897% +/-
0.247857% (n=16).
Thanks to anholt for suggesting this.
v2: Rebase on the spurious resets fix (change by anholt).
Signed-off-by: Varad Gautam <varadgautam@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Eric Anholt <eric@anholt.net>
This gets us functional GPU reset again, like we had until a refactor
at merge time. Tested with a little patch to stuff in a broken binner
job every 100 frames.
Signed-off-by: Eric Anholt <eric@anholt.net>
This may actually get us a feature that the closed driver didn't have:
turning off the GPU in between rendering jobs, while the V3D device is
still opened by the client.
There may be some tuning to be applied here to use autosuspend so that
we don't bounce the device's power so much, but in steady-state
GPU-bound rendering we keep the power on (since we keep multiple jobs
outstanding) and even if we power cycle on every job we can still
manage at least 680 fps.
More importantly, though, runtime PM will allow us to power off the
device to do a GPU reset.
v2: Switch #ifdef to CONFIG_PM not CONFIG_PM_SLEEP (caught by kbuild
test robot)
Signed-off-by: Eric Anholt <eric@anholt.net>
We were tracking the "where are the head pointers pointing" globally,
so if another job reused the same BOs and execution was at the same
point as last time we checked, we'd stop and trigger a reset even
though the GPU had made progress.
Signed-off-by: Eric Anholt <eric@anholt.net>
This implements a simple policy for choosing scaling modes
(trapezoidal for decimation, PPF for magnification), and a single PPF
filter (Mitchell/Netravali's recommendation).
Signed-off-by: Eric Anholt <eric@anholt.net>
So far, we've only ever lit up one CRTC, so this has been fine. To
extend to more displays or more planes, we need to make sure we don't
run our display lists into each other.
Signed-off-by: Eric Anholt <eric@anholt.net>
Again since the drm core takes care of event unlinking/disarming this
is now just needless code.
v2: Fixup misplaced hunk.
Cc: Eric Anholt <eric@anholt.net>
Acked-by: Daniel Stone <daniels@collabora.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com> (v1)
Acked-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1453756616-28942-14-git-send-email-daniel.vetter@ffwll.ch
This can be parsed with vc4-gpu-tools tools for trying to figure out
what was going on.
v2: Use __u32-style types.
Signed-off-by: Eric Anholt <eric@anholt.net>
An async pageflip stores the modeset to be done and executes it once
the BOs are ready to be displayed. This gets us about 3x performance
in full screen rendering with pageflipping.
Signed-off-by: Eric Anholt <eric@anholt.net>
The user submission is basically a pointer to a command list and a
pointer to uniforms. We copy those in to the kernel, validate and
relocate them, and store the result in a GPU BO which we queue for
execution.
v2: Drop support for NV shader recs (not necessary for GL), simplify
vc4_use_bo(), improve bin flush/semaphore checks, use __u32 style
types.
Signed-off-by: Eric Anholt <eric@anholt.net>
Since we have no MMU, the kernel needs to validate that the submitted
shader code won't make any accesses to memory that the user doesn't
control, which involves banning some operations (general purpose DMA
writes), and tracking where we need to write out pointers for other
operations (texture sampling). Once it's validated, we return a GEM
BO containing the shader, which doesn't allow mapping for write or
exporting to other subsystems.
v2: Use __u32-style types.
Signed-off-by: Eric Anholt <eric@anholt.net>
While there exist dumb APIs for creating and mapping BOs, one of the
rules is that drivers doing 3D acceleration have to provide their own
APIs for buffer allocation (besides, the pitch/height parameters of
the dumb alloc don't really make sense for a lot of 3D allocations).
v2: Use __u32-style types, use "drm.h" instead of <drm/drm.h>.
Signed-off-by: Eric Anholt <eric@anholt.net>
We need to allocate new BOs in the kernel as part of each frame, but
the CMA allocator is way too slow for that. As an optimization, keep
track of recently-freed BOs and reuse them, with a 1 second timeout to
fully free them back to the system.
This improves 3D performance by about 15%.
Signed-off-by: Eric Anholt <eric@anholt.net>
This pull request introduces the vc4 driver, for kernel modesetting on
the Raspberry Pi (bcm2835/bcm2836 architectures). It currently
supports a display plane and cursor on the HDMI output. The driver
doesn't do 3D, power management, or overlay planes yet.
[airlied: fixup the enable/disable vblank APIs]
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
* tag 'drm-vc4-next-2015-10-21' of http://github.com/anholt/linux:
drm/vc4: Allow vblank to be disabled
drm/vc4: Use the fbdev_cma helpers
drm/vc4: Add KMS support for Raspberry Pi.
drm/vc4: Add devicetree bindings for VC4.
Keep the fbdev_cma pointer around so we can use it on hotplog and close
to ensure the frame buffer console is in a useful state.
Signed-off-by: Derek Foreman <derekf@osg.samsung.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
This is enough for fbcon and bringing up X using
xf86-video-modesetting. It doesn't support the 3D accelerator or
power management yet.
v2: Drop FB_HELPER select thanks to Archit's patches. Do manual init
ordering instead of using the .load hook. Structure registration
more like tegra's, but still using the typical "component" code.
Drop no-op hooks for atomic_begin and mode_fixup() now that
they're optional. Drop sentinel in Makefile. Fix minor style
nits I noticed on another reread.
v3: Use the new bcm2835 clk driver to manage pixel/HSM clocks instead
of having a fixed video mode. Use exynos-style component driver
matching instead of devicetree nodes to list the component driver
instances. Rename compatibility strings to say bcm2835, and
distinguish pv0/1/2. Clean up some h/vsync code, and add in
interlaced mode setup. Fix up probe/bind error paths. Use
bitops.h macros for vc4_regs.h
v4: Include i2c.h, allow building under COMPILE_TEST, drop msleep now
that other bugs have been fixed, add timeouts to cpu_relax()
loops, rename hpd-gpio to hpd-gpios.
Signed-off-by: Eric Anholt <eric@anholt.net>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>