Hi,
Quoting Johannes Schauer Marin Rodrigues (2024-09-11 15:38:13)
Post by Johannes Schauer Marin RodriguesPost by Stefano RiveraDuring a test rebuild, bmap-tools failed to rebuild.
I see it has hung in the same place in the past, on a buildd.
Filing at the important level as it apparently built after being
retried, on the buildd.
-------------------------------------------------------------------------------
[...]
debian/rules override_dh_auto_test
make[1]: Entering directory '/<<PKGBUILDDIR>>'
faketime '2022-06-26 17:10:50' dh_auto_test
I: pybuild base:311: cd /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12/build; python3.12 -m pytest tests
============================= test session starts ==============================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0
rootdir: /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12/build
configfile: pyproject.toml
plugins: typeguard-4.3.0
collected 19 items
tests/test_CLI.py ..... [ 26%]
tests/test_api_base.py
E: Build killed with signal TERM after 150 minutes of inactivity
my guess is, that this is due to the system starved of random numbers.
I'd like to change my guess. I think it's actually a deadlock created through
the use of the subprocess module in different Python threads.
I have had a look at the problem again in the context of
https://salsa.debian.org/debian/bmap-tools/-/merge_requests/3 as I got very
annoyed by the randomly happening build timeouts. I am still unable to
reproduce the issue locally but it persists on the buildds, on salsaci as well
as on tests.reproducible-builds.org. I now have a workaround patch but not a
full solution yet. I'd like to use this mail to write-up my findings before I
squash all my WIP commits into one.
Since I'm unable to reproduce this on my own machine, I patched debian/rules
like this:
override_dh_auto_test:
sh -c 'while sleep 10; do pstree -l -a $(shell echo $$PPID) >&2; done' & pid=$$!; \
ret=0; \
timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$$?; \
kill $$pid; \
wait $$pid || : ; \
exit $$ret
This is will run "pstree -l -a ${pid of make}" every 10 seconds and limits the
time that the test is allowed to run to 15 minutes to not hog salsaci too much.
In case that we run into a stall, messages like these are printed repeatedly to
the log:
bzip2 stalls:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
| `-pstree -l -a 1641
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-sh -c bzip2 -c -k /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_vq64huef.img
xz stalls:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 13086 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 13086 >&2; done
| `-pstree -l -a 13086
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-sh -c xz -c -k /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_kqm3eugf.img
tar stalls:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 13126 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 13126 >&2; done
| `-pstree -l -a 13126
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-sh -c tar -c -j -O -P -C / /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_gn6_77ax.img
gzip stalls:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 1640 >&2; done' & pid=$!; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- -vvv; \\\012kill $pid; \\\012wait $pid || :
|-sh -c while sleep 10; do pstree -l -a 1640 >&2; done
| `-pstree -l -a 1640
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
`-dh_auto_test /usr/bin/dh_auto_test -- -vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 -vvv
`-sh -c cd /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest tests
`-python3.13 -m pytest tests
`-sh -c gzip -c /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_4up2jzoj.img
rules -f debian/rules override_dh_auto_test
All of these tools are invoked by _generate_compressed_files() in
tests/test_api_base.py so my guess was that the way that the tools are called
using subprocess.Popen(), followed by Popen.wait() somehow causes the kind of
deadlock that is described in the documentation of Popen.wait(). So I replaced
those subprocess calls with calls to subprocess.check_output() and slurped
the whole standard output into a Python string (it's only ~2 MB of data). But
this did not change the behaviour at all.
Searching online for deadlocks involving the subprocess module revealed that in
Python 2.7 there was a problem of deadlocks when the subprocess module was
called from Python threads. The fix was supposedly to add close_fds=True to
the Popen call. I tried that too but it resulted no change in behaviour.
Then I stumbled across a slightly different deadlock:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
| `-pstree -l -a 1641
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-df -PT -- /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_3wg1y132.img
The invocation of "df -PT" is from a completely different part of the code.
It comes from get_file_system_type() in bmaptool/BmapHelpers.py which just runs:
proc = subprocess.Popen(["df", "-PT", "--", abspath], stdout=PIPE, stderr=PIPE)
stdout, stderr = proc.communicate()
But this should be completely harmless, no???
So maybe the problem is not with tests/test_api_base.py but the problem is a
more fundamental one. I looked into other parts of the codebase and
_open_compressed_file() in bmaptool/TransRead.py does an aweful lot of
threading in combination with opening decompressors via the subprocess module.
I have no solution to the problem yet but I do have a workaround:
--- a/tests/helpers.py
+++ b/tests/helpers.py
@@ -284,29 +284,19 @@ def calculate_chksum(file_path):
return hash_obj.hexdigest()
+import subprocess
+
+
def copy_and_verify_image(image, dest, bmap, image_chksum, image_size):
"""
Copy image 'image' using bmap file 'bmap' to the destination file 'dest'
and verify the resulting image checksum.
"""
- f_image = TransRead.TransRead(image)
- f_dest = open(dest, "w+b")
- if bmap:
- f_bmap = open(bmap, "r")
- else:
- f_bmap = None
-
- writer = BmapCopy.BmapCopy(f_image, f_dest, f_bmap, image_size)
- # Randomly decide whether we want the progress bar or not
- if bool(random.getrandbits(1)) and sys.stdout.isatty():
- writer.set_progress_indicator(sys.stdout, None)
- writer.copy(bool(random.getrandbits(1)), bool(random.getrandbits(1)))
+ subprocess.check_call(
+ ["python3", "-m", "bmaptool", "copy"]
+ + (["--bmap", bmap] if bmap else ["--nobmap"])
+ + [image, dest]
+ )
- # Compare the original file and the copy are identical
assert calculate_chksum(dest) == image_chksum
-
- if f_bmap:
- f_bmap.close()
- f_dest.close()
- f_image.close()
Instead of calling BmapCopy.BmapCopy().copy(), spawn a new process running the
full bmaptool utility with the given options. This is slower than the existing
code because bmaptool gets spawned a lot but it will effectively test the same
thing. Except that with this change we do not anymore test using BmapCopy as a
library. But I do not believe that this test is useful for Debian because we do
not ship this as a library but we ship it as a program that is to be called
from the terminal. And when run like that, the deadlock problem never occurs.
The above workaround should not close this bug. I'd leave it open. Maybe at
some point somebody will figure out what is going on. But at least we have
something now which allows us to run the test suite without deadlocking. It
would be a bummer having to disable these tests.
Thanks!
cheers, josch