mesa/.gitlab-ci/tests
Deborah Brouwer 72c182f873 ci/lava: Detect a6xx gpu recovery failures
Sporadically a6xx gpu will fail to recover causing the lava job
a660_vk_full to loop on error messages for three hours before timing
out.

A few sporadic error messages may still be recoverable, but when multiple
errors occur over a short period, successful recovery is unlikely. Parse
the logs to look for repeated error messages within a short time period.
If found, cancel the lava job and rerun it.

Also add unit tests for this behaviour.

cc: mesa-stable

Reported-by: Valentine Burley <valentine.burley@gmail.com>
Acked-by: Daniel Stone <daniel.stone@collabora.com>
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: Deborah Brouwer <deborah.brouwer@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30032>
2024-07-19 23:41:13 +00:00
..
data ci/lava: Add unit tests covering job definition 2023-11-02 03:31:50 +00:00
lava ci/lava: Update lavacli version 2023-01-10 20:10:49 +00:00
utils ci/lava: Detect a6xx gpu recovery failures 2024-07-19 23:41:13 +00:00
__init__.py
conftest.py ci/lava: Add a simple Structural Logger into submitter 2023-04-19 14:36:37 +00:00
test_lava_job_submitter.py ci/lava: Don't run jobs if the remaining execution time is too short 2024-04-22 21:20:07 +00:00