Discussion:
Bug#1081336: Intermittent FTBFS due to hang in tests/test_api_base.py
Add Reply
Stefano Rivera
2024-09-10 19:40:02 UTC
Reply
Permalink
Source: bmap-tools
Version: 3.8.0-1
Severity: important
Tags: ftbfs
Justification: FTBFS

Dear maintainer,

During a test rebuild, bmap-tools failed to rebuild.

I see it has hung in the same place in the past, on a buildd.
Filing at the important level as it apparently built after being
retried, on the buildd.

-------------------------------------------------------------------------------
[...]
debian/rules override_dh_auto_test
make[1]: Entering directory '/<<PKGBUILDDIR>>'
faketime '2022-06-26 17:10:50' dh_auto_test
I: pybuild base:311: cd /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12/build; python3.12 -m pytest tests
============================= test session starts ==============================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0
rootdir: /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12/build
configfile: pyproject.toml
plugins: typeguard-4.3.0
collected 19 items

tests/test_CLI.py ..... [ 26%]
tests/test_api_base.py
E: Build killed with signal TERM after 150 minutes of inactivity
--------------------------------------------------------------------------------
Build finished at 2024-09-07T17:02:38Z

-------------------------------------------------------------------------------

The above is just how the build ends and not necessarily the most relevant part.
If required, the full build log is available here (for the next 30 days):
https://debusine.debian.net/artifact/612253/

About the archive rebuild: The build was made on debusine.debian.net,
on arm64 using sbuild.

You can find the build task here:
https://debusine.debian.net/work-request/20398/

If this is really a bug in one of the build-depends, please use
reassign and affects, so that this is still visible in the BTS web
page for this package.

Thanks,

Stefano
Johannes Schauer Marin Rodrigues
2024-09-11 13:40:01 UTC
Reply
Permalink
Hi,
Post by Stefano Rivera
During a test rebuild, bmap-tools failed to rebuild.
I see it has hung in the same place in the past, on a buildd.
Filing at the important level as it apparently built after being
retried, on the buildd.
-------------------------------------------------------------------------------
[...]
debian/rules override_dh_auto_test
make[1]: Entering directory '/<<PKGBUILDDIR>>'
faketime '2022-06-26 17:10:50' dh_auto_test
I: pybuild base:311: cd /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12/build; python3.12 -m pytest tests
============================= test session starts ==============================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0
rootdir: /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12/build
configfile: pyproject.toml
plugins: typeguard-4.3.0
collected 19 items
tests/test_CLI.py ..... [ 26%]
tests/test_api_base.py
E: Build killed with signal TERM after 150 minutes of inactivity
my guess is, that this is due to the system starved of random numbers. The
function generate_test_files() makes heavy use of the python random module to
generate lots of data. Maybe the build system is just starved of entropy and
maybe this is also why it's not failing reliably?

I wasn't able to reproduce this issue locally with sbuild but saw in htop that
the test in question was not using much CPU but was generating a lot of I/O by
writing out all the temporary test files.

The effect also happened on the buildds where the build ran into a timeout on
x86-csail-02 but then built successfully within 2 minutes on x86-grnet-02 a
week later: https://buildd.debian.org/status/logs.php?pkg=bmap-tools&ver=3.8.0-1&arch=all

Thanks!

cheers, josch
Johannes Schauer Marin Rodrigues
2025-01-21 10:10:01 UTC
Reply
Permalink
Hi,

Quoting Johannes Schauer Marin Rodrigues (2024-09-11 15:38:13)
Post by Johannes Schauer Marin Rodrigues
Post by Stefano Rivera
During a test rebuild, bmap-tools failed to rebuild.
I see it has hung in the same place in the past, on a buildd.
Filing at the important level as it apparently built after being
retried, on the buildd.
-------------------------------------------------------------------------------
[...]
debian/rules override_dh_auto_test
make[1]: Entering directory '/<<PKGBUILDDIR>>'
faketime '2022-06-26 17:10:50' dh_auto_test
I: pybuild base:311: cd /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12/build; python3.12 -m pytest tests
============================= test session starts ==============================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0
rootdir: /<<PKGBUILDDIR>>/.pybuild/cpython3_3.12/build
configfile: pyproject.toml
plugins: typeguard-4.3.0
collected 19 items
tests/test_CLI.py ..... [ 26%]
tests/test_api_base.py
E: Build killed with signal TERM after 150 minutes of inactivity
my guess is, that this is due to the system starved of random numbers.
I'd like to change my guess. I think it's actually a deadlock created through
the use of the subprocess module in different Python threads.

I have had a look at the problem again in the context of
https://salsa.debian.org/debian/bmap-tools/-/merge_requests/3 as I got very
annoyed by the randomly happening build timeouts. I am still unable to
reproduce the issue locally but it persists on the buildds, on salsaci as well
as on tests.reproducible-builds.org. I now have a workaround patch but not a
full solution yet. I'd like to use this mail to write-up my findings before I
squash all my WIP commits into one.

Since I'm unable to reproduce this on my own machine, I patched debian/rules
like this:

override_dh_auto_test:
sh -c 'while sleep 10; do pstree -l -a $(shell echo $$PPID) >&2; done' & pid=$$!; \
ret=0; \
timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$$?; \
kill $$pid; \
wait $$pid || : ; \
exit $$ret

This is will run "pstree -l -a ${pid of make}" every 10 seconds and limits the
time that the test is allowed to run to 15 minutes to not hog salsaci too much.
In case that we run into a stall, messages like these are printed repeatedly to
the log:

bzip2 stalls:

rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
| `-pstree -l -a 1641
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-sh -c bzip2 -c -k /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_vq64huef.img

xz stalls:

rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 13086 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 13086 >&2; done
| `-pstree -l -a 13086
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-sh -c xz -c -k /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_kqm3eugf.img

tar stalls:

rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 13126 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 13126 >&2; done
| `-pstree -l -a 13126
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-sh -c tar -c -j -O -P -C / /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_gn6_77ax.img

gzip stalls:

rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 1640 >&2; done' & pid=$!; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- -vvv; \\\012kill $pid; \\\012wait $pid || :
|-sh -c while sleep 10; do pstree -l -a 1640 >&2; done
| `-pstree -l -a 1640
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
`-dh_auto_test /usr/bin/dh_auto_test -- -vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 -vvv
`-sh -c cd /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest tests
`-python3.13 -m pytest tests
`-sh -c gzip -c /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_4up2jzoj.img
rules -f debian/rules override_dh_auto_test

All of these tools are invoked by _generate_compressed_files() in
tests/test_api_base.py so my guess was that the way that the tools are called
using subprocess.Popen(), followed by Popen.wait() somehow causes the kind of
deadlock that is described in the documentation of Popen.wait(). So I replaced
those subprocess calls with calls to subprocess.check_output() and slurped
the whole standard output into a Python string (it's only ~2 MB of data). But
this did not change the behaviour at all.

Searching online for deadlocks involving the subprocess module revealed that in
Python 2.7 there was a problem of deadlocks when the subprocess module was
called from Python threads. The fix was supposedly to add close_fds=True to
the Popen call. I tried that too but it resulted no change in behaviour.

Then I stumbled across a slightly different deadlock:

rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
| `-pstree -l -a 1641
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-df -PT -- /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_3wg1y132.img

The invocation of "df -PT" is from a completely different part of the code.
It comes from get_file_system_type() in bmaptool/BmapHelpers.py which just runs:

proc = subprocess.Popen(["df", "-PT", "--", abspath], stdout=PIPE, stderr=PIPE)
stdout, stderr = proc.communicate()

But this should be completely harmless, no???

So maybe the problem is not with tests/test_api_base.py but the problem is a
more fundamental one. I looked into other parts of the codebase and
_open_compressed_file() in bmaptool/TransRead.py does an aweful lot of
threading in combination with opening decompressors via the subprocess module.

I have no solution to the problem yet but I do have a workaround:

--- a/tests/helpers.py
+++ b/tests/helpers.py
@@ -284,29 +284,19 @@ def calculate_chksum(file_path):
return hash_obj.hexdigest()


+import subprocess
+
+
def copy_and_verify_image(image, dest, bmap, image_chksum, image_size):
"""
Copy image 'image' using bmap file 'bmap' to the destination file 'dest'
and verify the resulting image checksum.
"""

- f_image = TransRead.TransRead(image)
- f_dest = open(dest, "w+b")
- if bmap:
- f_bmap = open(bmap, "r")
- else:
- f_bmap = None
-
- writer = BmapCopy.BmapCopy(f_image, f_dest, f_bmap, image_size)
- # Randomly decide whether we want the progress bar or not
- if bool(random.getrandbits(1)) and sys.stdout.isatty():
- writer.set_progress_indicator(sys.stdout, None)
- writer.copy(bool(random.getrandbits(1)), bool(random.getrandbits(1)))
+ subprocess.check_call(
+ ["python3", "-m", "bmaptool", "copy"]
+ + (["--bmap", bmap] if bmap else ["--nobmap"])
+ + [image, dest]
+ )

- # Compare the original file and the copy are identical
assert calculate_chksum(dest) == image_chksum
-
- if f_bmap:
- f_bmap.close()
- f_dest.close()
- f_image.close()

Instead of calling BmapCopy.BmapCopy().copy(), spawn a new process running the
full bmaptool utility with the given options. This is slower than the existing
code because bmaptool gets spawned a lot but it will effectively test the same
thing. Except that with this change we do not anymore test using BmapCopy as a
library. But I do not believe that this test is useful for Debian because we do
not ship this as a library but we ship it as a program that is to be called
from the terminal. And when run like that, the deadlock problem never occurs.

The above workaround should not close this bug. I'd leave it open. Maybe at
some point somebody will figure out what is going on. But at least we have
something now which allows us to run the test suite without deadlocking. It
would be a bummer having to disable these tests.

Thanks!

cheers, josch
James Addison
2025-01-29 01:00:01 UTC
Reply
Permalink
Source: bmap-tools
Followup-For: Bug #1081336
X-Debbugs-Cc: ***@debian.org
Control: tags -1 ftbfs

Dear Maintainer (and including Josch as requested),

I'm able to replicate this using a podman-based Debian trixie container
(currently, using podman 5.3.1+ds1-7 on amd64, and running an amd64 container
image of debian:trixie with ID a1b129fd683d) when building repeatedly with
dpkg-buildpackage.

So far I don't have much to add in terms of findings; all I have achieved is
to seem to rule-out the possibility that random selection of the three
variables for progress-bar, sync, and verify prepared[1] for each test copy
operation are a factor -- because I've encountered stalled test_api_base
behaviour regardless of the runtime true/false value for each of those
variables, assuming they're independent.

(_strictly_ speaking there could remain complex boolean interactions between
them that are necessary the stall preconditions -- I can't disprove that
because I haven't encountered all eight permutations possible in the truth
table; but it seems likely to me that they are independent variables)


PSV (P = progressbar, S = sync, V = verify) stall counts
---
FFF 1
FTT 1
TFF 1
TFT 2
TTF 2


The stall encounter rate, across the total number of testsuite evaluations
attempted, seemed to be around 15-20% on my local machine.

Not much in the way of clues/indicators yet, but it leads me to believe that
focusing elsewhere than the random variables (e.g. on the file and pipe
operations) may be a good next step.

Regards,
James

[1] - https://sources.debian.org/src/bmap-tools/3.8.0-1/tests/helpers.py/#L301-L304
Loading...