Discussion:
Bug#1072594: goobox: FTBFS: UTF-8 "xFC" does not map to Unicode at /usr/share/perl5/Locale/Po4a/Po.pm line 353, <$fh> line 114.
Add Reply
Helge Kreutzmann
2024-06-05 15:40:01 UTC
Reply
Permalink
Hello Santiago,
Thanks for notifying me.



po4a::sgml: Warning: onsgmls produced some errors. This is usually caused by po4a, which modifies the input and restores it afterwards, causing the input of onsgmls to be invalid. This is usually safe, but you may wish to verify the generated document with onsgmls -wno-valid.
-o debug=onsgmls
UTF-8 "\xFC" does not map to Unicode at /usr/share/perl5/Locale/Po4a/Po.pm line 353, <$fh> line 114.
(108 entries)
Looks like the updated po4a causes this. I'll investigate
in the next days.
If this is really a bug in one of the build-depends, please use
reassign and affects, so that this is still visible in the BTS web
page for this package.
At first sight I'll assume this, after investigation I'll take care of
this.

Greetings

Helge
--
Dr. Helge Kreutzmann ***@helgefjell.de
Dipl.-Phys. http://www.helgefjell.de/debian.php
64bit GNU powered gpg signed mail preferred
Help keep free software "libre": http://www.ffii.de/
Helge Kreutzmann
2024-06-05 16:50:02 UTC
Reply
Permalink
clone 1072594 -1
reassign -1 po4a
retitle -1 Regression: po4a fails on valid non-utf8 file
tags 1072594 + pending
thanks

Sorry, forgot to actually use the real bug number 

--
Dr. Helge Kreutzmann ***@helgefjell.de
Dipl.-Phys. http://www.helgefjell.de/debian.php
64bit GNU powered gpg signed mail preferred
Help keep free software "libre": http://www.ffii.de/
Santiago Vila
2024-06-05 17:10:01 UTC
Reply
Permalink
(Adding this note to the cloned bug)

Note: If you take a look at the FTBFS bugs I reported yesterday:

https://people.debian.org/~sanvila/build-logs/202406/?C=M;O=A

you can see that several of them are also a consequence of this change in po4a.

So, I fully support that this kind of behaviour change deserves
at least an entry in NEWS.Debian.

Thanks.
Helge Kreutzmann
2024-06-13 09:10:01 UTC
Reply
Permalink
reopen 1072643
severity 1072643 important
found 1072643 0.72
thanks

Hello Martin,
I think that the fix applied to #1072594 (recoding the input file from latin-1
to UTF-8) was not necessary. Changing the config of po4a to correctly specify
the used encoding would have worked.
I tried to improve the error messages upstream to help future users to debug
such issues, but in any case, this does not justify a RC bug against po4a, thus
closing.
I'm not arguing the severity (I left it intentionally to you after
closing), but there still is a bug. I leave this to you and Santiago,
but making several pages suddenly FTBFS is IMHO at least serious.

For several years (probably something like 10 years) this worked
without problem, now it fails (and with a very strange message). If
the previous po4a was buggy, i.e. allowed broken config files, then a
warning or NOTE during updates would be mandated, but switching this
(inadverently, probably) to a strange or even fatal error message is
not sufficient.

Here is the statement from Santiago:
From: Santiago Vila <***@debian.org>
To: ***@bugs.debian.org, Helge Kreutzmann <***@helgefjell.de>
Subject: Regression: po4a fails on valid non-utf8 file
Date: Wed, 5 Jun 2024 19:03:48 +0200

(Adding this note to the cloned bug)

Note: If you take a look at the FTBFS bugs I reported yesterday:

https://people.debian.org/~sanvila/build-logs/202406/?C=M;O=A

you can see that several of them are also a consequence of this change in po4a.

So, I fully support that this kind of behaviour change deserves
at least an entry in NEWS.Debian.

Thanks.

So no, this bug is not closed.

Greetings

Helge
--
Dr. Helge Kreutzmann ***@helgefjell.de
Dipl.-Phys. http://www.helgefjell.de/debian.php
64bit GNU powered gpg signed mail preferred
Help keep free software "libre": http://www.ffii.de/
Santiago Vila
2024-06-13 10:10:01 UTC
Reply
Permalink
Hi. I'm reading the upstream NEWS file and found this:

* Greatly simplify the code by using PerlIO instead of messing up
with encodings manually.
* This is a very intrusive change, and even if all tests of our
comprehensive suite pass, I still expect issues with this on some
corner cases, such as projects not using UTF-8 but a mixture of
encodings. Please report any issue, and accept my apologies.

Simple question: Am I right to think that the problem happens
when a .po file is encoded using a given charset but declares
(in the charset= line) another different charset?

Thanks.
Martin Quinson
2024-06-13 13:00:01 UTC
Reply
Permalink
  * Greatly simplify the code by using PerlIO instead of messing up
    with encodings manually.
  * This is a very intrusive change, and even if all tests of our
    comprehensive suite pass, I still expect issues with this on some
    corner cases, such as projects not using UTF-8 but a mixture of
    encodings. Please report any issue, and accept my apologies.
Simple question: Am I right to think that the problem happens
when a .po file is encoded using a given charset but declares
(in the charset= line) another different charset?
By definition, I fixed all the known issues, but I'm not confident with corner
cases that are less tested. I mean that I don't know where the remaining issue
will come from.

Checking on the code about your question, if the charset=??? line does not
match the provided charset, the file is re-read with the charset from the file
(with a warning). That should be OK (and I never heard of issues here).

We have some issues when the msgid contain UTF chars, but they are related to
gettext, not po4a itself. https://savannah.gnu.org/bugs/?65104

In addition, I tried to draft the information I'll add to the README.Debian
here: https://github.com/mquinson/po4a/blob/master/NEWS#L14

Your advice about it is very welcome, please.
Mt
Helge Kreutzmann
2024-06-23 09:30:01 UTC
Reply
Permalink
Hello Martin,
Post by Martin Quinson
  * Greatly simplify the code by using PerlIO instead of messing up
    with encodings manually.
  * This is a very intrusive change, and even if all tests of our
    comprehensive suite pass, I still expect issues with this on some
    corner cases, such as projects not using UTF-8 but a mixture of
    encodings. Please report any issue, and accept my apologies.
Simple question: Am I right to think that the problem happens
when a .po file is encoded using a given charset but declares
(in the charset= line) another different charset?
By definition, I fixed all the known issues, but I'm not confident with corner
cases that are less tested. I mean that I don't know where the remaining issue
will come from.
Checking on the code about your question, if the charset=??? line does not
match the provided charset, the file is re-read with the charset from the file
(with a warning). That should be OK (and I never heard of issues here).
We have some issues when the msgid contain UTF chars, but they are related to
gettext, not po4a itself. https://savannah.gnu.org/bugs/?65104
In addition, I tried to draft the information I'll add to the README.Debian
here: https://github.com/mquinson/po4a/blob/master/NEWS#L14
Your advice about it is very welcome, please.
I think for README.Debian this is fine, but maybe you can add a NEWS
file as well, something like:

--- snip --- snip -- snip ---

Po4a (since version 0.70) is more strict about encoding issues. If you
experience trouble with translations, read README.Debian for more
information and double check all encondings used in PO files.

You also might want to consider converting legacy encodings to UTF-8.

--- snip --- snip -- snip ---

You might of course drop the last sentence, but I think legacy
encoding should be phased out in 2024 whereever possible.

Maybe you could upload 0.73 to Debian with such a NEWS (and
README.Debian) in the next days?

Greetings

Helge
--
Dr. Helge Kreutzmann ***@helgefjell.de
Dipl.-Phys. http://www.helgefjell.de/debian.php
64bit GNU powered gpg signed mail preferred
Help keep free software "libre": http://www.ffii.de/
Loading...