Linux: my latest impediment to getting work done

At work I have to use Linux routinely. (They pay me enough, or at least they historically have.)

Today, I was working with a program that generated C code and wanted to find all places that generated comments. As a first cut, I tried (`backends' is a directory containing the code in question)

find backends -type f -print0 | mxa -0 mcgrep -H /\* | egrep -a '^([^"]*"[^"]*")*[^"]*"[^"]*/[*]' | less

This worked, but included one very long line because there was a .png image there. So I tried to chop off long lines:

find backends -type f -print0 | mxa -0 mcgrep -H /\* | egrep -a '^([^"]*"[^"]*")*[^"]*"[^"]*/[*]' | colrm 80 | less

This printed the same thing, except it cut off six bytes into the line printed for the .png — and didn't pick up again. The first dropped octet was the first non-ASCII octet (it was 0x96) and, I conjecture, colrm is getting upset because it insists on thinking that (a) I want it to treat its input as UTF-8 (which is incorrect) and (b) I want to get no output at all after the first non-UTF-8 sequence found (which is, if anything, even incorrecter).

$LANG is C in my environment (because, on a previous version of Linux, that shut off exactly this sort of idiocy). locale(1) prints

LANG=C
LANGUAGE=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

so I have no idea why it thinks either of those.

But, with this, Linux has crossed from being just aesthetically offensive[1] to being a major practical obstacle to getting real work done. It has finally improved to the point where it can't do what I routinely did thirty years ago. It's so broken I am going to have to copy the data off Linux in order to do rudimentary processing on it.

Perhaps there actually is a way to fix this. But if so, (a) it is WAY too well hidden and (b) it is defaulting the wrong way, especially in view of my environment. It needs to be mentioned prominently, either in the error message from colrm when it fails (that it failed silently is ANOTHER major issue) or in its manpage.

And I am, strictly, guessing when I am blaming UTF-8 for this. All I know is that I've seen lots of text tools break on Linux as soon as I go outside ASCII. I suspect UTF-8 mostly because I see so much UTF-8 bigotry all over the Linux world.

I am forced to wonder how people can seriously advocate Linux as a development platform in the face of braindamage this crippling.

 

Actually, UTF-8 has been infesting the world beyond Linux, even, to some extent; I asked the SQLite people about using non-UTF-8 filenames for databases (the doc says that using non-UTF-8 filenames provokes undefined behaviour) and was asked — and this is a direct quote — "What's the problem? Are you wanting to name a database with some octet sequence that is not valid UTF-8? Why would you want to do that?". Absolutely jaw-dropping.

And I've mentioned elsewhere that SSH is, strictly, unimplementable on many Unix variants, because the spec has drunk the UTF-8 koolaid.

But those are only tangentially related to today's rant.

 

[1]^ Some of the major ones, in rough chronological order of my encountering them:

Main