Discussion:
Bug#796170: lintian: [new check] warn on non-UTF8 text files
Add Reply
Adam Borowski
2015-08-19 21:50:02 UTC
Reply
Permalink
Package: lintian
Version: 2.5.36.1
Severity: wishlist
Tags: patch


Here's an experimental tag, a step towards elimination of mojibake
system-wide. It checks all text files in *bin/, /usr/share/doc/ and those
that look like a script file. "Text" is defined as not having any bytes in
the 0..31 range other than tabs, newlines (incl. Windows ones) or form
feeds. In practice, this definition appears to work pretty well, although
the list of files that should be skipped despite being text needs work.

It's a part of the "UTF-8 everywhere" release goal that I intend to
re-propose for Stretch.

This is only a preliminary version, let's discuss what you think. If you're
on DebConf, you can contact me in person.
Niels Thykier
2015-08-20 08:10:02 UTC
Reply
Permalink
Post by Adam Borowski
Package: lintian
Version: 2.5.36.1
Severity: wishlist
Tags: patch
Hi,

Thanks for the suggestion and the prototype patch.

I think it is an interesting proposal and I think we could try it as an
experimental check to see if it is a feasible check.

I have some minor technical remarks to the patch interleaved to below.
It will also need a test case in the t/tests suite.


Axel (XTaran) and I are at DebConf if you need help with anything, etc. :)
Post by Adam Borowski
[...]
0001-New-experimental-tag-text-file-uses-obsolete-encodin.patch
From 902283f122c71c88b968abfc3c778686200c9361 Mon Sep 17 00:00:00 2001
Date: Wed, 19 Aug 2015 23:32:39 +0200
Subject: [PATCH] New experimental tag: text-file-uses-obsolete-encoding
---
checks/files.desc | 11 +++++++++++
checks/files.pm | 10 +++++++++-
lib/Lintian/Util.pm | 22 ++++++++++++++++++++++
3 files changed, 42 insertions(+), 1 deletion(-)
[...]
diff --git a/checks/files.pm b/checks/files.pm
index b816ed8..0940bbf 100644
--- a/checks/files.pm
+++ b/checks/files.pm
@@ -27,7 +27,7 @@ use Lintian::Data;
use Lintian::Output qw(warning);
use Lintian::Tags qw(tag);
use Lintian::Util qw(drain_pipe fail is_string_utf8_encoded open_gz
- signal_number2name strip normalize_pkg_path);
+ signal_number2name strip normalize_pkg_path file_is_non_utf8_text);
use Lintian::SlidingWindow;
use constant BLOCKSIZE => 16_384;
@@ -1514,6 +1514,14 @@ sub run {
if $info->index($1);
}
+ # ---------------- encoding
+ if ( $fname =~ m{^(?:usr/)?s?bin/}
Is this already guarded by a "$file->is_regular_file" (or
$file->is_file)? Otherwise, this will break horribly with lintian
trying to open directories, named pipes, etc.
Post by Adam Borowski
+ or $fname =~ m{\.(?:pm|py|pl|txt)$}
Minor nit: Perhaps \.(?:p[myl]|txt(?:\.gz)?)$

* Would also test gzip-compressed text files
* p[myl] is slightly more optimial than pm|py|pl
Post by Adam Borowski
[...]
index 0b8fa5a..f68ea71 100644
--- a/lib/Lintian/Util.pm
+++ b/lib/Lintian/Util.pm
@@ -62,6 +62,7 @@ BEGIN {
slurp_entire_file
file_is_encoded_in_non_utf8
is_string_utf8_encoded
+ file_is_non_utf8_text
fail
strip
lstrip
@@ -859,6 +860,27 @@ sub file_is_encoded_in_non_utf8 {
return $line;
}
+=item file_is_non_utf8_text (...)
+
+Both binary files and text files encoded in proper UTF8 give a negative
+answer.
+
+=cut
+
+sub file_is_non_utf8_text {
+
+ my $fd = ($file->file_info =~ m/gzip compressed/) ? $file->open_gz : $file->open;
+ my $bad=0;
+ while (<$fd>) {
I would like to see this use a named variable here (to keep the uses of
"$_" down).
Post by Adam Borowski
+ return close($fd), 0 if (!m/^[\t\n\f\r -\x{ff}]+$/);
The $fd might be a pipe to a subprocess (in the open_gz case). It would
need a "drain_pipe" call to avoid spurious errors caused by gzip dying
by "broken pipe".
Post by Adam Borowski
[...]
Adam Borowski
2015-08-22 16:10:02 UTC
Reply
Permalink
Post by Niels Thykier
Thanks for the suggestion and the prototype patch.
I think it is an interesting proposal and I think we could try it as an
experimental check to see if it is a feasible check.
I did some runs on the whole Debian archive, but having it in Lintian as an
experimental tag means it's available to other people in an usable/linkable
way.
Post by Niels Thykier
I have some minor technical remarks to the patch interleaved to below.
It will also need a test case in the t/tests suite.
I'll try to learn how to write lintian test cases.
Post by Niels Thykier
Post by Adam Borowski
0001-New-experimental-tag-text-file-uses-obsolete-encodin.patch
+ # ---------------- encoding
+ if ( $fname =~ m{^(?:usr/)?s?bin/}
Is this already guarded by a "$file->is_regular_file" (or
$file->is_file)? Otherwise, this will break horribly with lintian
trying to open directories, named pipes, etc.
Yes, it's inside a $file->is_file block.
Post by Niels Thykier
Post by Adam Borowski
+ or $fname =~ m{\.(?:pm|py|pl|txt)$}
Minor nit: Perhaps \.(?:p[myl]|txt(?:\.gz)?)$
* Would also test gzip-compressed text files
* p[myl] is slightly more optimial than pm|py|pl
Done. I also added php, rb, tcl and sh to the list of extensions, probably
more should be added too.
Post by Niels Thykier
Post by Adam Borowski
+ while (<$fd>) {
I would like to see this use a named variable here (to keep the uses of
"$_" down).
Done.
Post by Niels Thykier
Post by Adam Borowski
+ return close($fd), 0 if (!m/^[\t\n\f\r -\x{ff}]+$/);
The $fd might be a pipe to a subprocess (in the open_gz case). It would
need a "drain_pipe" call to avoid spurious errors caused by gzip dying
by "broken pipe".
Ok, I added drain_pipe($fd) if /...gzip/.


On Thu, Aug 20, 2015 at 12:09:26PM +0200, Jakub Wilk wrote:
] >+Tag: text-file-uses-obsolete-encoding
]
] We have already a few tags named *-uses-obsolete-national-encoding, so it
] would be nice if this tag used the same scheme.

I wanted to keep the tag short, but you're probably right. Done.

] Also, please make sure that we don't complain twice about the same file.

This one is harder. There's several other checks on specific files (like
debian/changelog.Debian.gz) that differ by severity. Should I hunt down all
such checks and add exclusions?

] >+Severity: normal
]
] I'd say wishlist here.

Ok, it can always be jacked back up once I achieve world domination^W^W^Wget
it included as a part of a release goal.

] >+ characters (often called "mojibake"). You should convert it to UTF8 using
] >+ iconv or a similar tool.
]
] iconv should be good for plain text files in /usr/share/doc; but if applied
] blindly to code, or HTML documents or similar, it can cause more harm than
] good.

"Blindly" can break when the code in question deals with obsolete encodings,
but otherwise, I disagree. Using non-UTF-8 in HTML is always damage, which
leads to data loss once someone gives a form/etc input that's
unrepresentable in the encoding in question. I've got enough of ń in my
town name mangled when ordering stuff, etc. Unicode is the only charset
that's universal.

] It would be good to emphasise that the conversion should be done upstream,
] not by "you", the Debian maintainer.

I intend to submit a patch to debhelper that makes the whole process a
matter of writing a single value to a file under debian/ which would
massively reduce the amount of work in ~3000 affected packages. This should
make it easy enough. You're right that fixing encodings is best done by the
upstream, but if converting during package build is easy enough, there's no
reason to discourage that.

] >+Info: The given file is text but uses non-UTF8 encoding.
]
] s/UTF8/UTF-8/, here and elsewhere.

Done.

] >+ # ---------------- encoding
] >+ if ( $fname =~ m{^(?:usr/)?s?bin/}
]
] Also /usr/games?

/usr/games added. I initially skipped it for simplicity (there's exactly
1 package failing), but you're right, it's better to be complete.

] Though I'm not sure what's the rationale for choosing these directories...

Text files in /*bin are all scripts rather than random data files, and thus
comments and messages included need all to be in UTF-8.


On Thu, Aug 20, 2015 at 10:05:06AM -0700, Russ Allbery wrote:
} The last time I looked at this in a policy context, the distribution
} included a few documentation files that were intentionally provided
} upstream in multiple different encodings. In other words, there would be
} a README.sjis and a README.utf8, etc., side-by-side. In those cases, it
} feels bad to have Lintian tag the README.sjis file and have maintainers
} possibly just not install it, when it might still be a convenience to some
} users.

We default to UTF-8 for a decade, there are already plans to drop support
for running with an obsolete locale. Shipping these files is a waste of
space.

In any case, in the whole archive we have 3 packages with such files with
sjis inside the file name, 16 with koi8.

I wonder how to gauge if there are actually any users who run obsolete
encodings left. I'll try something like downloading X thousands of recent
bug reports and gathering Locale: lines reportbug includes.
--
⢎⣉⠂⠠⠀⡀⣄⠀⡀⠠⡅⠀⠀⡧⠄⡄⠀⡄⠀⠀⠀⠠⡅⠀⡠⠀⠄⠀⠀⠀⢎⠍⠀⡠⠀⡀⣄⠀⡀⠀⠀⠀⠀⡧⠄⣇⠀⡀⡠⠀⡀⠀⠀⠀⡄⠀⡄⡠⠀⡀⠠⠀⡀⡇⡠⠄⠀⠀⠀
⠢⠀⠃⠪⠭⠇⠇⠀⠇⠀⠣⠀⠀⠣⠄⠚⠭⠃⠀⠀⠀⠀⠣⠀⠬⠭⠂⠀⠀⠀⠞⠀⠀⠣⠀⠃⠇⠀⠀⠀⠀⠀⠀⠣⠄⠇⠀⠇⠫⠭⠁⠀⠀⠀⠣⠣⠃⠫⠭⠁⠪⠭⠇⠏⠢⠄⠀⠄⠀
(https://github.com/kilobyte/braillefont for this hack)
Jakub Wilk
2015-08-22 17:20:01 UTC
Reply
Permalink
Post by Adam Borowski
Post by Jakub Wilk
iconv should be good for plain text files in /usr/share/doc; but if
applied blindly to code, or HTML documents or similar, it can cause
more harm than good.
"Blindly" can break when the code in question deals with obsolete
encodings, but otherwise, I disagree.
Python, Perl's POD, HTML and XML all have a way to declare encoding. If
you apply iconv to a correctly-encoded file, but "forget" to update the
encoding, you get mojibake, or in the Python's case, syntax error.
Post by Adam Borowski
I intend to submit a patch to debhelper that makes the whole process a
matter of writing a single value to a file under debian/ which would
massively reduce the amount of work in ~3000 affected packages.
Heaven forbid...
--
Jakub Wilk
Jakub Wilk
2015-08-20 10:20:02 UTC
Reply
Permalink
Hi Adam!
+Tag: text-file-uses-obsolete-encoding
We have already a few tags named *-uses-obsolete-national-encoding, so
it would be nice if this tag used the same scheme.

Also, please make sure that we don't complain twice about the same file.
+Severity: normal
I'd say wishlist here.
+ characters (often called "mojibake"). You should convert it to UTF8 using
+ iconv or a similar tool.
iconv should be good for plain text files in /usr/share/doc; but if
applied blindly to code, or HTML documents or similar, it can cause more
harm than good.

It would be good to emphasise that the conversion should be done
upstream, not by "you", the Debian maintainer.
+Info: The given file is text but uses non-UTF8 encoding.
s/UTF8/UTF-8/, here and elsewhere.
+ # ---------------- encoding
+ if ( $fname =~ m{^(?:usr/)?s?bin/}
Also /usr/games? Though I'm not sure what's the rationale for choosing
these directories...
+ or $fname =~ m{\.(?:pm|py|pl|txt)$}
... or these extensions.
--
Jakub Wilk
Russ Allbery
2015-08-20 17:10:02 UTC
Reply
Permalink
Post by Adam Borowski
Here's an experimental tag, a step towards elimination of mojibake
system-wide. It checks all text files in *bin/, /usr/share/doc/ and
those that look like a script file. "Text" is defined as not having any
bytes in the 0..31 range other than tabs, newlines (incl. Windows ones)
or form feeds. In practice, this definition appears to work pretty
well, although the list of files that should be skipped despite being
text needs work.
It's a part of the "UTF-8 everywhere" release goal that I intend to
re-propose for Stretch.
This is only a preliminary version, let's discuss what you think. If
you're on DebConf, you can contact me in person.
The last time I looked at this in a policy context, the distribution
included a few documentation files that were intentionally provided
upstream in multiple different encodings. In other words, there would be
a README.sjis and a README.utf8, etc., side-by-side. In those cases, it
feels bad to have Lintian tag the README.sjis file and have maintainers
possibly just not install it, when it might still be a convenience to some
users.

Maybe this check should exclude files that have an extension that
indicates they were intentionally encoded in some other encoding?
--
Russ Allbery (***@debian.org) <http://www.eyrie.org/~eagle/>
Felix Lechner
2019-08-11 15:50:01 UTC
Reply
Permalink
Hi Adam,
Post by Adam Borowski
Here's an experimental tag, a step towards elimination of mojibake
system-wide. It checks all text files in *bin/, /usr/share/doc/ and
those that look like a script file. "Text" is defined as not having any
bytes in the 0..31 range other than tabs, newlines (incl. Windows ones)
or form feeds. In practice, this definition appears to work pretty
well, although the list of files that should be skipped despite being
text needs work.
The bug has attachments with proposed patches. Is this tag still in
the works? If so, would you please turn it into a merge request on
Salsa?
Post by Adam Borowski
I'll try to learn how to write lintian test cases.
I would be happy to assist with writing tests.

Kind regards
Felix Lechner

Loading...