2023-05-21 06:47:05 +08:00
|
|
|
#!/usr/bin/env bash
|
CI: reduce bandwidth for git pull
Over the last 7 days, git pulls represented a total of 1.7 TB.
On those 1.7 TB, we can see:
- ~300 GB for the CI farm on hetzner
- ~730 GB for the CI farm on packet.net
- ~680 GB for the rest of the world
We can not really change the rest of the world*, but we can
certainly reduce the egress costs towards our CI farms.
Right now, the gitlab runners are not doing a good job at
caching the git trees for the various jobs we make, and
we end up with a lot of cache-misses. A typical pipeline
ends up with a good 2.8GB of git pull data. (a compressed
archive of the mesa folder accounts for 280MB)
In this patch, we implemented what was suggested in
https://gitlab.com/gitlab-org/gitlab/-/issues/215591#note_334642576
- we host a brand new MinIO server on packet
- jobs can upload files on 2 locations:
* git-cache/<namespace>/<project>/<branch-name>.tar.gz
* artifacts/<namespace>/<project>/<pipeline-id>/
- the authorization is handled by gitlab with short tokens
valid only for the time of the job is running
- whenever a job runs, the runner are configured to execute
(eval) $CI_PRE_CLONE_SCRIPT
- this variable is set globally to download the current cache
from the MinIO packet server, unpack it and replace the
possibly out of date cache found on the runner
- then git fetch is run by the runner, and only the delta
between the upstream tree and the local tree gets pulled.
We can rebuild the git cache in a schedule job (once a day
seems sufficient), and then we can stop the cache miss
entirely.
First results showed that instead of pulling 280MB of data
in my fork, I got a pull of only 250KB. That should help us.
* arguably, there are other farms in the rest of the world, so
hopefully we can change those too.
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
Reviewed-by: Peter Hutterer <peter.hutterer@who-t.net>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5428>
2020-06-11 23:16:28 +08:00
|
|
|
|
|
|
|
set +e
|
|
|
|
set -o xtrace
|
|
|
|
|
|
|
|
# if we run this script outside of gitlab-ci for testing, ensure
|
|
|
|
# we got meaningful variables
|
2022-03-17 22:09:18 +08:00
|
|
|
CI_PROJECT_DIR=${CI_PROJECT_DIR:-$(mktemp -d)/$CI_PROJECT_NAME}
|
CI: reduce bandwidth for git pull
Over the last 7 days, git pulls represented a total of 1.7 TB.
On those 1.7 TB, we can see:
- ~300 GB for the CI farm on hetzner
- ~730 GB for the CI farm on packet.net
- ~680 GB for the rest of the world
We can not really change the rest of the world*, but we can
certainly reduce the egress costs towards our CI farms.
Right now, the gitlab runners are not doing a good job at
caching the git trees for the various jobs we make, and
we end up with a lot of cache-misses. A typical pipeline
ends up with a good 2.8GB of git pull data. (a compressed
archive of the mesa folder accounts for 280MB)
In this patch, we implemented what was suggested in
https://gitlab.com/gitlab-org/gitlab/-/issues/215591#note_334642576
- we host a brand new MinIO server on packet
- jobs can upload files on 2 locations:
* git-cache/<namespace>/<project>/<branch-name>.tar.gz
* artifacts/<namespace>/<project>/<pipeline-id>/
- the authorization is handled by gitlab with short tokens
valid only for the time of the job is running
- whenever a job runs, the runner are configured to execute
(eval) $CI_PRE_CLONE_SCRIPT
- this variable is set globally to download the current cache
from the MinIO packet server, unpack it and replace the
possibly out of date cache found on the runner
- then git fetch is run by the runner, and only the delta
between the upstream tree and the local tree gets pulled.
We can rebuild the git cache in a schedule job (once a day
seems sufficient), and then we can stop the cache miss
entirely.
First results showed that instead of pulling 280MB of data
in my fork, I got a pull of only 250KB. That should help us.
* arguably, there are other farms in the rest of the world, so
hopefully we can change those too.
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
Reviewed-by: Peter Hutterer <peter.hutterer@who-t.net>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5428>
2020-06-11 23:16:28 +08:00
|
|
|
|
|
|
|
if [[ -e $CI_PROJECT_DIR/.git ]]
|
|
|
|
then
|
|
|
|
echo "Repository already present, skip cache download"
|
|
|
|
exit
|
|
|
|
fi
|
|
|
|
|
|
|
|
TMP_DIR=$(mktemp -d)
|
|
|
|
|
2023-08-02 23:48:53 +08:00
|
|
|
echo "$(date +"%F %T") Downloading archived master..."
|
2023-05-21 06:47:05 +08:00
|
|
|
if ! /usr/bin/wget \
|
2023-02-20 09:32:27 +08:00
|
|
|
-O "$TMP_DIR/$CI_PROJECT_NAME.tar.gz" \
|
2024-04-24 22:01:29 +08:00
|
|
|
"https://${S3_HOST}/${S3_GITCACHE_BUCKET}/${FDO_UPSTREAM_REPO}/$CI_PROJECT_NAME.tar.gz";
|
CI: reduce bandwidth for git pull
Over the last 7 days, git pulls represented a total of 1.7 TB.
On those 1.7 TB, we can see:
- ~300 GB for the CI farm on hetzner
- ~730 GB for the CI farm on packet.net
- ~680 GB for the rest of the world
We can not really change the rest of the world*, but we can
certainly reduce the egress costs towards our CI farms.
Right now, the gitlab runners are not doing a good job at
caching the git trees for the various jobs we make, and
we end up with a lot of cache-misses. A typical pipeline
ends up with a good 2.8GB of git pull data. (a compressed
archive of the mesa folder accounts for 280MB)
In this patch, we implemented what was suggested in
https://gitlab.com/gitlab-org/gitlab/-/issues/215591#note_334642576
- we host a brand new MinIO server on packet
- jobs can upload files on 2 locations:
* git-cache/<namespace>/<project>/<branch-name>.tar.gz
* artifacts/<namespace>/<project>/<pipeline-id>/
- the authorization is handled by gitlab with short tokens
valid only for the time of the job is running
- whenever a job runs, the runner are configured to execute
(eval) $CI_PRE_CLONE_SCRIPT
- this variable is set globally to download the current cache
from the MinIO packet server, unpack it and replace the
possibly out of date cache found on the runner
- then git fetch is run by the runner, and only the delta
between the upstream tree and the local tree gets pulled.
We can rebuild the git cache in a schedule job (once a day
seems sufficient), and then we can stop the cache miss
entirely.
First results showed that instead of pulling 280MB of data
in my fork, I got a pull of only 250KB. That should help us.
* arguably, there are other farms in the rest of the world, so
hopefully we can change those too.
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
Reviewed-by: Peter Hutterer <peter.hutterer@who-t.net>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5428>
2020-06-11 23:16:28 +08:00
|
|
|
then
|
|
|
|
echo "Repository cache not available"
|
|
|
|
exit
|
|
|
|
fi
|
|
|
|
|
|
|
|
set -e
|
|
|
|
|
|
|
|
rm -rf "$CI_PROJECT_DIR"
|
2023-08-02 23:48:53 +08:00
|
|
|
echo "$(date +"%F %T") Extracting tarball into '$CI_PROJECT_DIR'..."
|
CI: reduce bandwidth for git pull
Over the last 7 days, git pulls represented a total of 1.7 TB.
On those 1.7 TB, we can see:
- ~300 GB for the CI farm on hetzner
- ~730 GB for the CI farm on packet.net
- ~680 GB for the rest of the world
We can not really change the rest of the world*, but we can
certainly reduce the egress costs towards our CI farms.
Right now, the gitlab runners are not doing a good job at
caching the git trees for the various jobs we make, and
we end up with a lot of cache-misses. A typical pipeline
ends up with a good 2.8GB of git pull data. (a compressed
archive of the mesa folder accounts for 280MB)
In this patch, we implemented what was suggested in
https://gitlab.com/gitlab-org/gitlab/-/issues/215591#note_334642576
- we host a brand new MinIO server on packet
- jobs can upload files on 2 locations:
* git-cache/<namespace>/<project>/<branch-name>.tar.gz
* artifacts/<namespace>/<project>/<pipeline-id>/
- the authorization is handled by gitlab with short tokens
valid only for the time of the job is running
- whenever a job runs, the runner are configured to execute
(eval) $CI_PRE_CLONE_SCRIPT
- this variable is set globally to download the current cache
from the MinIO packet server, unpack it and replace the
possibly out of date cache found on the runner
- then git fetch is run by the runner, and only the delta
between the upstream tree and the local tree gets pulled.
We can rebuild the git cache in a schedule job (once a day
seems sufficient), and then we can stop the cache miss
entirely.
First results showed that instead of pulling 280MB of data
in my fork, I got a pull of only 250KB. That should help us.
* arguably, there are other farms in the rest of the world, so
hopefully we can change those too.
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
Reviewed-by: Peter Hutterer <peter.hutterer@who-t.net>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5428>
2020-06-11 23:16:28 +08:00
|
|
|
mkdir -p "$CI_PROJECT_DIR"
|
2022-03-17 22:09:18 +08:00
|
|
|
tar xzf "$TMP_DIR/$CI_PROJECT_NAME.tar.gz" -C "$CI_PROJECT_DIR"
|
CI: reduce bandwidth for git pull
Over the last 7 days, git pulls represented a total of 1.7 TB.
On those 1.7 TB, we can see:
- ~300 GB for the CI farm on hetzner
- ~730 GB for the CI farm on packet.net
- ~680 GB for the rest of the world
We can not really change the rest of the world*, but we can
certainly reduce the egress costs towards our CI farms.
Right now, the gitlab runners are not doing a good job at
caching the git trees for the various jobs we make, and
we end up with a lot of cache-misses. A typical pipeline
ends up with a good 2.8GB of git pull data. (a compressed
archive of the mesa folder accounts for 280MB)
In this patch, we implemented what was suggested in
https://gitlab.com/gitlab-org/gitlab/-/issues/215591#note_334642576
- we host a brand new MinIO server on packet
- jobs can upload files on 2 locations:
* git-cache/<namespace>/<project>/<branch-name>.tar.gz
* artifacts/<namespace>/<project>/<pipeline-id>/
- the authorization is handled by gitlab with short tokens
valid only for the time of the job is running
- whenever a job runs, the runner are configured to execute
(eval) $CI_PRE_CLONE_SCRIPT
- this variable is set globally to download the current cache
from the MinIO packet server, unpack it and replace the
possibly out of date cache found on the runner
- then git fetch is run by the runner, and only the delta
between the upstream tree and the local tree gets pulled.
We can rebuild the git cache in a schedule job (once a day
seems sufficient), and then we can stop the cache miss
entirely.
First results showed that instead of pulling 280MB of data
in my fork, I got a pull of only 250KB. That should help us.
* arguably, there are other farms in the rest of the world, so
hopefully we can change those too.
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
Reviewed-by: Peter Hutterer <peter.hutterer@who-t.net>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5428>
2020-06-11 23:16:28 +08:00
|
|
|
rm -rf "$TMP_DIR"
|
|
|
|
chmod a+w "$CI_PROJECT_DIR"
|
2023-08-02 23:48:53 +08:00
|
|
|
|
|
|
|
echo "$(date +"%F %T") Git cache download done"
|