Set exit_code to 1 in case of an exception; otherwise,
the job exits with 0, and GitLab shows the job as successful.
Fixes: b9cee06f9e ("ci/lava: handle non-zero exit codes")
Signed-off-by: Vignesh Raman <vignesh.raman@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31556>
The LAVA job submitter always exits with code 1, regardless
of the HWCI_TEST_SCRIPT's exit code. This commit fixes the
LAVA job submitter to exit with the actual code returned by
the test.
Signed-off-by: Vignesh Raman <vignesh.raman@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31189>
Instead of unpacking the x86_64_build container and its billion build
dependencies every time, switch to using only what's in the minimal
pyutils container, and the Python scripts we get as an artifact from the
python-test job. Pulling the artifacts from S3 rather than using GitLab
is also much more efficient.
This should substantially reduce the runtime required to get to testing.
Signed-off-by: Daniel Stone <daniels@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31151>
Sporadically a6xx gpu will fail to recover causing the lava job
a660_vk_full to loop on error messages for three hours before timing
out.
A few sporadic error messages may still be recoverable, but when multiple
errors occur over a short period, successful recovery is unlikely. Parse
the logs to look for repeated error messages within a short time period.
If found, cancel the lava job and rerun it.
Also add unit tests for this behaviour.
cc: mesa-stable
Reported-by: Valentine Burley <valentine.burley@gmail.com>
Acked-by: Daniel Stone <daniel.stone@collabora.com>
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: Deborah Brouwer <deborah.brouwer@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30032>
Fastboot devices need an indirection for creating a boot image via
`mkbootimg`, so we need to propagate the cmdline from LAVA and our extra
arguments to it properly.
This commit fixes it by retrieving the default cmdline from LAVA and
sending it, together with the `extra_nfsroot_args` to the `mkbootimg`
command.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29611>
Farm variable is added for devices in collabora farm.
So add support in LAVA job submitter to use this in
structured log files.
Signed-off-by: Vignesh Raman <vignesh.raman@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29583>
Due to the expiration time in `mesa-lava` (1m), the kernel used in mesa is now
using `mesa-rootfs` (1y). Due to this change, a fresh kernel image has been
prepared and mesa has also a few changes to adapt to this redirection.
cc: mesa-stable
Signed-off-by: Sergi Blanch Torne <sergi.blanch.torne@collabora.com>
Co-developed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28979>
Improves the error logging in the LAVA job submitter by capturing and
logging the exception message rather than just the exception type when a
job fails to run.
Additionally, introduces a clearer script interruption
message to aid in debugging and immediate understanding of job
submission failures.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28778>
This commit refactors the exception hierarchy to differentiate between
retriable and fatal errors in the CI pipeline, specifically within the
LAVA job submission process.
A new base class, `MesaCIRetriableException`, is introduced for
exceptions that should trigger a retry of the CI job, while
`MesaCIFatalException` is added for non-recoverable errors that halt the
process immediately.
Additionally, the logic for deciding whether a job should be retried or
not is updated to check for instances of `MesaCIRetriableException`,
improving the robustness and reliability of the CI job execution
strategy.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28778>
This workaround is termporary and will be dropped once the
farm is moved to its permament location.
Signed-off-by: Martin Krastev <martin.krastev@broadcom.com>
Reviewed-by: David Heidelberg <david.heidelberg@collabora.com>
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28261>
The r8152 error detection is now considering any order of the known
patterns to detect variations of the r8152 issues during the test phase.
This includes a small refactoring for eventual new issues.
Additionally, adjusted the timing for setting the `start_time` in
`test_lava_job_submitter.py` to ensure consistency and reliability in
test execution, aligning the start time closer to the job submission
process.
With this fix, the bad state shown in the following job will be
detected:
https://gitlab.freedesktop.org/drm/msm/-/jobs/55033953
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27688>
In our process of monitoring LAVA logs, we typically skip numerous
messages to enhance log clarity. We already exclude `feedback` messages
from display. These messages were just used as a heartbeat signal,
indicating that if we are receiving them, the Device Under Test (DUT) is
active.
Practically, if we continuously receive feedback messages without any
other message level (either `debug` or `target`) for several minutes,
this could be a cause for concern, as it likely indicates the device is
in a kind of livelock state.
Therefore, it is more prudent to ignore feedback messages, as they tend
to occur frequently in unstable scenarios. However, it's important to
note that any other message level will still be considered as a
heartbeat signal.
Real case where several minutes of feedback messages indicate a hang:
https://gitlab.freedesktop.org/mesa/mesa/-/jobs/53546660
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26996>
This week we found that the r8152 issue can happen during the boot
phase, make the necessary adjustments to detect it.
https://gitlab.freedesktop.org/vigneshraman/linux/-/jobs/53651940
Notes:
- The kernel messages during the boot phase is being redirected to the
feedback messages due to the namespaces from the SSH job.
- Update the unit tests:
- Add boot phase detection
- Correctly set the boot phase when mocking LogFollower
Reported-by: Vignesh Raman <vignesh.raman@collabora.com>
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27081>
We were just detecting if a log like
[ 143.080663] r8152 2-1.3:1.0 eth0: Tx status -71
happened once before
[ 316.389695] nfs: server 192.168.201.1 not responding, still trying
But we can use a counter to be more assured that the device is
struggling to recover and we can add let this detection happen during
the boot phase.
This mimics how other freedreno devices deal with this problem, see
`cros_servo_run.py:64` for example.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27081>
We need update kernel often. We need test kernel changes often.
Introduced `KERNEL_EXTERNAL_TAG` to differ between `KERNEL_TAG` which is
also used to rebuild the containers. We don't need rebuild containers
for the external kernel, so this way we don't have to.
Updating kernel goes wruuuuuum.
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23563>
Add two unit tests related to the LAVA job definition.
test_generate_lava_job_definition_sanity checks for the most important
fields, deploy actions, namespaces etc.
test_lava_job_definition compares the generated definition with static
skeleton YAML files committed inside tests/data folder.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
Simplify both UART and SSH job definitions module to share common
building blocks themselves.
- generate_lava_yaml_payload is now a LAVAJobDefinition method, so
dropped the Strategy pattern between both modules
- if SSH is supported and UART is not enforced, default to SSH
- when SSH is enabled, wrap the last deploy action to run the SSH server
and rewrite the test actions, which should not change due to the boot
method
- create a constants module to load environment variables
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
To absorb complexity from the building blocks to generate job
definitions for each mode:
- fastboot-uart
- uboot-uart
- uboot-ssh
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
Break it to smaller pieces with variable size (fastboot has 3 deploy
actions and uboot only one) to build the base definition nicely in the
end.
Extract kernel/dtb attachment and init_stage1 extraction into functions
to be later reused by SSH job definition.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
The LAVA job submitter is being used by other fd.o projects, such as
`drm/ci`, so let's make it generate more generic job definitions and
test cases.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
This should help us handling possibly slower downloads of the traces,
which leads into piglit not printing anything on the output.
After Infra will get stabilized again, needs to be reverted.
Acked-by: Helen Koike <helen.koike@collabora.com>
Acked-by: Daniel Stone <daniels@collabora.com>
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25097>
At least these two can be easily used in bare-metal or Labgrid setups.
Currently I already have MR for implementing these for Labgrid.
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24665>
Modify the build process for the images to include the build to have ci-kdl
available in the Mesa jobs. Modify also the init-stage2 to launch in the
background the process that will collect data and store a json file with the
relative changes on the recorded data.
Signed-off-by: Sergi Blanch Torne <sergi.blanch.torne@collabora.com>
Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24177>
This bring visible speedup while preparing the rootfs and containers.
Acked-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Acked-by: Erico Nunes <nunes.erico@gmail.com>
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24079>
Use a lightweight container for ping, ssh, curl and bash support.
Also use an image located at fd.o infrastructure, since we are having
some issues with Collabora's gitlab one lately.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23534>
- To keep things organized, create a base hidden jobs for alpine images,
as we have 2 now
- This image will have an SSH client ran by LAVA dispatchers
(x86_64-only) who will serve as a bypass channel of output and dmesg
and an alternative for UART.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23534>
Our LAVA farm is currently experiencing issues with running and pulling
docker. LAVA has been detecting (with a low rate) timeouts during these
commands, causing some jobs to fail with infrastructure errors.
Increasing the failure_retry will make the job retry run the container
when LAVA detects the failure without losing its place in the job queue.
We are currently investigating why docker times out. But, when LAVA
fails to detect it, we cancel the job on our side and resubmit it to the
job queue. For more information, please refer to following dashboard:
https://ci-stats-grafana.freedesktop.org/goto/VjZvaA_4z?orgId=1
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23534>
Since we use S3 artifacts for LAVA always, keep only one codepath.
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23527>
We don't use MINIO for a long time. Rename variable accordingly.
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23527>