Discussion:
Bug#1042029: graph-tool: FTBFS: collect2: fatal error: ld terminated with signal 9 [Killed]
Add Reply
Lucas Nussbaum
2023-07-25 21:20:13 UTC
Reply
Permalink
Source: graph-tool
Version: 2.45+ds-10
Severity: serious
Justification: FTBFS
Tags: trixie sid ftbfs
User: ***@debian.org
Usertags: ftbfs-20230724 ftbfs-trixie

Hi,

During a rebuild of all packages in sid, your package failed to build
on amd64.
/bin/bash ./libtool --tag=CXX --mode=link g++ -std=gnu++17 -fopenmp -O3 -fvisibility=default -fvisibility-inlines-hidden -Wno-deprecated -Wall -Wextra -ftemplate-backtrace-limit=0 -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -module -avoid-version -export-dynamic -no-undefined -Wl,-E -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -o libgraph_tool_util.la -rpath /usr/lib/python3.11/dist-packages/graph_tool/util src/graph/util/graph_search.lo src/graph/util/graph_util_bind.lo -L/usr/lib/x86_64-linux-gnu -lpython3.11 -L/usr/lib/x86_64-linux-gnu -lboost_iostreams -lboost_python311 -lboost_regex -lboost_context -lboost_coroutine -lexpat -lgmp -lgmp
libtool: link: g++ -fPIC -DPIC -shared -nostdlib /usr/lib/gcc/x86_64-linux-gnu/13/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/13/crtbeginS.o src/graph/util/.libs/graph_search.o src/graph/util/.libs/graph_util_bind.o -L/usr/lib/x86_64-linux-gnu -lpython3.11 -lboost_iostreams -lboost_python311 -lboost_regex -lboost_context -lboost_coroutine -lexpat -lgmp -L/usr/lib/gcc/x86_64-linux-gnu/13 -L/usr/lib/gcc/x86_64-linux-gnu/13/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/13/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/13/../../.. -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/x86_64-linux-gnu/13/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/13/../../../x86_64-linux-gnu/crtn.o -fopenmp -O3 -g -O2 -fstack-protector-strong -Wl,-E -Wl,--as-needed -Wl,-z -Wl,relro -Wl,-z -Wl,now -fopenmp -Wl,-soname -Wl,libgraph_tool_util.so -o .libs/libgraph_tool_util.so
libtool: link: ( cd ".libs" && rm -f "libgraph_tool_util.la" && ln -s "../libgraph_tool_util.la" "libgraph_tool_util.la" )
collect2: fatal error: ld terminated with signal 9 [Killed]
compilation terminated.
make[2]: *** [Makefile:3339: libgraph_tool_inference.la] Error 1
The full build log is available from:
http://qa-logs.debian.net/2023/07/24/graph-tool_2.45+ds-10_unstable.log

All bugs filed during this archive rebuild are listed at:
https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=ftbfs-20230724;users=***@debian.org
or:
https://udd.debian.org/bugs/?release=na&merged=ign&fnewerval=7&flastmodval=7&fusertag=only&fusertagtag=ftbfs-20230724&fusertaguser=***@debian.org&allbugs=1&cseverity=1&ctags=1&caffected=1#results

A list of current common problems and possible solutions is available at
http://wiki.debian.org/qa.debian.org/FTBFS . You're welcome to contribute!

If you reassign this bug to another package, please mark it as 'affects'-ing
this package. See https://www.debian.org/Bugs/server-control#affects

If you fail to reproduce this, please provide a build log and diff it with mine
so that we can identify if something relevant changed in the meantime.
Santiago Vila
2023-10-08 20:10:01 UTC
Reply
Permalink
found 1042029 2.45+ds-10
tags 1042029 + patch
thanks
Hi Lucas, I could not reproduce the issue in the current Sid environment
(I tried twice) so I am closing the bug. It is highly probable that
your build encountered a resource issue as this software is really heavy to build.
Hello.

This is certainly heavy to build.

However, it is so needlessly.

For example, configure.ac unconditionally adds -O3 to CXXFLAGS.
This is already a bug, because packages should honor whatever comes
from dpkg-buildflags (usually -O2).

Trivial patch attached. While I have not tested specifically if it solves
the issue for me, I think it is a patch which should be applied in either case.

[ For the record, I also tried to build this package and it also failed for me.
Last time I tried it was using a machine with 16 GB of RAM and 4 GB of swap.
To put things in perspective, I can build *all* the 35208 source packages currently
in trixie with 16 GB of RAM. But not this one ].

Thanks.
Adrian Bunk
2024-12-19 04:20:01 UTC
Reply
Permalink
Post by Santiago Vila
...
For example, configure.ac unconditionally adds -O3 to CXXFLAGS.
This is already a bug, because packages should honor whatever comes
from dpkg-buildflags (usually -O2).
...
For the record, this is not considered to be be a bug - using
non-standard flags is fine as long as they aren't violating the
architecture baseline.

-O3 is regarded as best choice for some applications,
there is nothing wrong with that.
Post by Santiago Vila
...
A) The simple and effective way: Disable parallel building,
i.e. add --no-parallel to dh call. The package would still
build the same, and it would not sacrifice things like -O3,
which would be preserved.
...
This would simply be horrible.

The successful build of the previous version took a week on the
riscv64 buildd.

If the build would still succeed, your change would have caused a build
time of 3 weeks or a month on a release architecture.

This is a HUGE problem.

The packages requiring lots of RAM during the build are usually also the
ones with the largest build times, you cannot disable parallel building
on these (as your 10 GB change also effectively did).
Post by Santiago Vila
...
I've used 10000 MB / CPU as a threshold, based on my own measurements,
but of course we could lower that later if it proves not to be enough.
...
What exactly have you measured?

Looking at the riscv64 FTBFS your change caused, I am puzzled how you
ended up with such an insanely high number.

Please fix this regression you introduced.

Based on build times on other architectures, with 8 GB RAM and 2 CPUs on
s390x the build of graph-tool is much faster when using both CPUs, which
indicates that the limit must be below 4 GB for this package.
Post by Santiago Vila
...
+MEM_PER_CPU = 10000
+NCPUS := $(or $(subst parallel=,,$(filter parallel=%,$(DEB_BUILD_OPTIONS))),$(shell nproc))
+NJOBS := $(shell awk -vn=$(NCPUS) -vm=$(MEM_PER_CPU) '/^MemAvailable:/ { mt = $$2 } \
+ END { n2 = int(mt/1024/m); print (n==1 || n2<=1) ? 1 : (n2<=n ? n2 : n) }' /proc/meminfo)
...
+ dh_auto_build -- -j $(NJOBS)
...
--max-parallel would be a simpler approach.


cu
Adrian
Santiago Vila
2024-12-19 12:00:02 UTC
Reply
Permalink
reopen 1042029
thanks
Post by Adrian Bunk
Please fix this regression you introduced.
We'll try, but given the long build times, we should think
carefully before the next move.
Post by Adrian Bunk
What exactly have you measured?
I've measured that -j2 is already too much on a system with 16GB of RAM
and 4 GB of swap, as I was getting build failures due to processes being
killed in such scenario.

Your suggestion to use -j2 on a machine with only 8GB of RAM does not seem
to be compatible with that (I'm assuming here a similar RAM usage pattern
across different architectures, but that may or may not be accurate).

So let's see what Jerome has to say about this.

Thanks.
Adrian Bunk
2024-12-19 14:00:02 UTC
Reply
Permalink
Post by Santiago Vila
reopen 1042029
thanks
Post by Adrian Bunk
Please fix this regression you introduced.
We'll try, but given the long build times, we should think
carefully before the next move.
Post by Adrian Bunk
What exactly have you measured?
I've measured that -j2 is already too much on a system with 16GB of RAM
and 4 GB of swap, as I was getting build failures due to processes being
killed in such scenario.
Most buildds on release architectures have 4 GB RAM per core,
but they are not swap-starved.
Post by Santiago Vila
Your suggestion to use -j2 on a machine with only 8GB of RAM does not seem
to be compatible with that (I'm assuming here a similar RAM usage pattern
across different architectures, but that may or may not be accurate).
...
https://buildd.debian.org/status/logs.php?pkg=graph-tool&arch=s390x
2 cores
8 GB RAM
94 GB swap

Your change increased build time by ~ 40%.


https://buildd.debian.org/status/logs.php?pkg=graph-tool&arch=amd64
6 cores
12 GB RAM
125 GB swap

Building with 6 cores was successful, but 2 GB RAM per core apparently
has a negative effect on build time.


My impression is that on a typical buildd with 4 cores and 16 GB RAM
"-j 4" is desirable.
Post by Santiago Vila
Thanks.
cu
Adrian
Jerome BENOIT
2023-10-29 22:10:01 UTC
Reply
Permalink
Hi all,
I downgraded the severity to normal as this issue is before all
a resource issue. I have no problem to build it on a 24 CPUs amd64 box
with 64 GB of RAM (and with ~ 4/3 of 64 GB of SWAP).
Please keep in mind that graph-tool is scientific tool for big-data,
so the choice of the optimization option is relevant.
Best wishes, Jerome
--
Jerome BENOIT | calculus+at-rezozer^dot*net
https://qa.debian.org/developer.php?login=***@rezozer.net
AE28 AE15 710D FF1D 87E5 A762 3F92 19A6 7F36 C68B
Santiago Vila
2023-12-14 17:20:01 UTC
Reply
Permalink
tags 1042029 - patch
thanks
Post by Jerome BENOIT
Please keep in mind that graph-tool is scientific tool for big-data,
so the choice of the optimization option is relevant.
Fair enough. I'm removing the patch tag accordingly.
Post by Jerome BENOIT
I downgraded the severity to normal as this issue is before all
a resource issue. I have no problem to build it on a 24 CPUs amd64 box
with 64 GB of RAM (and with ~ 4/3 of 64 GB of SWAP).
Well, let me describe our build environments:

Lucas Nussbaum usually starts a bunch of machines of type
t3.2xlarge from AWS, which have 8 CPUs and 32 GB of RAM,
then have three "workers" in each of them building packages
in parallel (so each package has 10.66 GB of RAM to build
on average, but since lots of them build with a lot less
than that, it is very rare that a package may not be built
because of lack of memory).

I do it differently: I start machines with 4GB, 8GB or 16GB
and build a single package at a time. Because I monitor
/proc/meminfo during build and collect the data, I know
which packages build with only 4GB and which packages
need 8 GB or 16 GB. For all the 35000 packages in trixie,
I have never seen any of them requiring 32 GB so far.

So, if it were the case that this package needs 64 GB of RAM
to build (which honestly I don't think it's really the case),
that would be a problem for us, because that would force
to do our QA work differently, just for a single package.

In other words, if it were the case that this package needs
so much memory, requesting some change (not necessarily
removing the -O3 thing) would be legitimate, in my opinion.

In this case, I have successful build logs which were made
after I said that it failed for me, so it is still possible
that there was a compiler bug or something suboptimal
in the build chain which has been fixed since then.

So, I am adding to my todo list to check this package again
in both the environment by Lucas and my own one.

(Please Cc me and Lucas on replies, messages which are just
sent to the bug address are never forwarded to the bug
submitters or casual participants).

Thanks.
Adrian Bunk
2024-12-19 16:10:01 UTC
Reply
Permalink
Post by Santiago Vila
...
that would be a problem for us, because that would force
to do our QA work differently, just for a single package.
...
If a QA environment differs significantly from the buildds where the
package is usually built, then this might be considered a bug in the
QA environment - you are not testing whether the package would still
build on the buildds.

If the problem is just a single package in your setup, the easiest
solution is to blacklist that package in your setup.

The reproducible CI that continuously rebuilds the whole archive also
has some packages blacklisted that are problematic in their environment,
that's better than having a negative impact on the build time on the
buildds or even the contents of the package.
Post by Santiago Vila
Thanks.
cu
Adrian
Santiago Vila
2024-12-19 16:30:01 UTC
Reply
Permalink
Post by Adrian Bunk
If a QA environment differs significantly from the buildds where the
package is usually built, then this might be considered a bug in the
QA environment - you are not testing whether the package would still
build on the buildds.
We are a free software distribution. If the end user can't rebuild
a given package without a lot of trouble, that's also a problem.
Post by Adrian Bunk
If the problem is just a single package in your setup, the easiest
solution is to blacklist that package in your setup.
No way.

I want to ensure that the end user will be able to build all packages.
This includes Lucas Nussbaum in his setup, and also mine.

Skipping packages will not help to achieve that goal.

Please stop posting to this bug, you are not helping anymore
at this point with those kind of comments.

I'll try to find a solution which works for everybody.

Thanks.
Adrian Bunk
2024-12-19 17:40:01 UTC
Reply
Permalink
Post by Santiago Vila
Post by Adrian Bunk
If a QA environment differs significantly from the buildds where the
package is usually built, then this might be considered a bug in the
QA environment - you are not testing whether the package would still
build on the buildds.
We are a free software distribution. If the end user can't rebuild
a given package without a lot of trouble, that's also a problem.
We had this discussion many times before, and I don't want to go to the
Debian Technical Committee for the same again.

Some packages will not build when a build machine differs from the
typical buildd setup.
Post by Santiago Vila
Post by Adrian Bunk
If the problem is just a single package in your setup, the easiest
solution is to blacklist that package in your setup.
No way.
I want to ensure that the end user will be able to build all packages.
This includes Lucas Nussbaum in his setup, and also mine.
We are a binary distribution.
End users are not supposed to rebuild our packages.

You were annoying everyone for years with trying to rebuild all packages
on a machine with only one core claiming this would be so important for
QA work that your bugs have to be release critical, and now we have the
same again.
Post by Santiago Vila
Skipping packages will not help to achieve that goal.
Please stop posting to this bug, you are not helping anymore
at this point with those kind of comments.
I'll try to find a solution which works for everybody.
Add more swap to your setup.

The root cause of this bug (where you have caused an RC issue) is that
you are trying to build all packages on a build machine that significantly
differs from the buildd setup.

Why are you doing your rebuilds actually?

Most useful for Debian are rebuilds that are best at finding FTBFS on
the buildds. This is what actually matters, not trying to get the whole
archive build on whatever different setup you have at some point in time.

Keeping the archive building on all buildds is hard enough, it is not
helpful creating work to get everything building in random different
setups.

And it is really horrible when you are making statements like
The simple and effective way: Disable parallel building

This is super-painful for exactly the packages with long build times
that also use a lot of memory during the build.

In some cases people have spent a considerable amount of time for making
parallel building working.

You might not care if non-parallel building is not a problem in your
setup, but disabling parallel building for large packages is really
harmful on the Debian buildd infrastructure.
Post by Santiago Vila
Thanks.
cu
Adrian
Santiago Vila
2024-10-13 23:40:01 UTC
Reply
Permalink
Hi.

My summary of the problem is that this package uses too much RAM
per CPU, i.e. the ratio UsedRAM / AvailableCPUs is a lot
higher than in most packages.

I can think of two ways to fix this:

A) The simple and effective way: Disable parallel building,
i.e. add --no-parallel to dh call. The package would still
build the same, and it would not sacrifice things like -O3,
which would be preserved.

If this package had a build time much lower than it has,
I believe that would be the best solution.

However, this package takes its time to build, in such a way
that parallel building makes a difference, and I understand
if you don't like the simple and effective solution.

B) The not so simple but also effective way. This package
is not the first one having a problem with RAM/CPU ratio.
Other packages have implemented solutions in debian/rules
to reduce the number of CPUs used according to available RAM.
For example, the llvm-toolchain-NN packages do this:

https://salsa.debian.org/pkg-llvm-team/llvm-toolchain/-/blob/17/debian/rules?ref_type=heads#L60-67

I would be willing to propose a patch like that. Would you
consider it? (In case you don't like solution A, that is).

Thanks.
Santiago Vila
2024-10-15 01:10:01 UTC
Reply
Permalink
tags 1042029 + patch
thanks

Hi. This is the patch I had in mind, inspired
by the code in the llvm-toolchain-nn packages,
but simplified a little bit.

The code takes available memory and divides it by 10000 MB,
so that at most 10000 MB of RAM are used per CPU.

For example, on a machine of type r7a.large from AWS,
which has 16 GB of RAM and 2 CPUs, the code will make it
to use 1 CPU, because there is not enough RAM to use 2 CPUs.

Thanks.
Santiago Vila
2024-10-16 15:00:01 UTC
Reply
Permalink
Hi.

Please consider this revised patch instead. It's a little bit more readable
than the previous one.

Thanks.
Santiago Vila
2024-12-15 11:50:02 UTC
Reply
Permalink
Hi.

After this problem has bitten me again today, I've just made a team upload to fix it,
using the last patch I posted (the one which carefully calculates number of CPUs to
be used from available RAM).

I've used 10000 MB / CPU as a threshold, based on my own measurements,
but of course we could lower that later if it proves not to be enough.

Thanks.
Jerome BENOIT
2024-12-19 14:00:02 UTC
Reply
Permalink
Hello, thanks for your contribution.
I will try to have a look this week-end.
Best, Jerome
--
Jerome BENOIT | calculus+at-rezozer^dot*net
https://qa.debian.org/developer.php?login=***@rezozer.net
AE28 AE15 710D FF1D 87E5 A762 3F92 19A6 7F36 C68B
Jerome BENOIT
2024-12-20 19:00:01 UTC
Reply
Permalink
Hi, thanks for your interest in the package.

Let give a try to your patch.

I will have a closer look this week end.

We need to keep in mind that graph-tool and its author
are oriented toward intensive big data computations
on machines common in labs and operated by know-how people.

Otherwise, I am in favor to keep the debian/rules neutral as concerns resources,
and let to the superusers of the building machines to tune them.
However, we can advertise that the package needs an unusual amount of resources
to be built. This might be caught ultimately by the debhelper machinery
and/or the building environments.

Cheers,
Jerome

Jerome
Hi.
I'd like to upload the attached diff as an intermediate release
so that version 2.77 can propagate to testing (even if the build
takes 7 days on the slower autobuilders).
This would buy some time that we can use to talk with the author
and maybe prepare the next release (2.80).
Please take a look and tell me if it's ok for you.
Thanks.
--
Jerome BENOIT | calculus+at-rezozer^dot*net
https://qa.debian.org/developer.php?login=***@rezozer.net
AE28 AE15 710D FF1D 87E5 A762 3F92 19A6 7F36 C68B
Loading...