Battery Powered Project

I think you are missing the point. If I pipe 4095 characters into mawk, nothing happens, if a pipe an extra char to make 4096, it prints out.

Reply to
Pancho
Loading thread data ...

Agreed. It is easy to reproduce.

$ (seq 9999 | head -c 4095; sleep 2; echo) | mawk '{print}'

This pauses before printing anything whatsoever.

$ (seq 9999 | head -c 4096; sleep 2; echo) | mawk '{print}'

This immediately prints whole lines up to 1040, pauses, then prints 104 (i.e. 1041, truncated).

$ (seq 9999 | head -c 4095; sleep 2; echo) | mawk -Wi '{print}'

This immediately prints whole lines up to 1040, pauses, then prints (i.e. 1041, truncated).

This is entirely down to mawk and is nothing to do with the kernel. The effect of -Wi is twofold.

First it disables output buffering. But this is not really relevant here.

Second it causes mawk to read from stdin rather than file descriptor 0. This is the key difference. With -Wi, it runs fgets on stdin, and gets stdio?s buffering policy: read as much as possible, but don?t block unless progress is impossible. Without -Wi, it uses its own internal buffering policy: always read a whole block even if this means blocking unnecessarily.

I have no idea what the benefit of the latter policy is, it seems to make the code a lot more complicated for no clear gain (and it breaks your use case). It?s plainly deliberate, so in that sense not a bug, although it seems like a bizarre design decision to me.

[...]

There is no such thing as mawk 5.0.1, Fedora is presumably using GNU Awk (which also available in Debian and its derivatives). These are totally different programs and it does not make any sense to compare their version numbers.

--
https://www.greenend.org.uk/rjk/
Reply to
Richard Kettlewell

Thats definitely faulty behaviour: pipe operation should not depend on how full the pipe is.

I've just run similar tests on my systems:

- Fedora 32 on an *GB Lenovo T420

- Raspbins Buster in a 512MB RPI 2B

Both systems had full updates 4 days ago.

Fedora 32 (awk 5.0.1) with 65Kb pipe buffer size

File size Time to transfer:

65535 7-10 Ms 65536 7-11 Ms 65537 10-31 Ms

Raspbian (awk 1.3.3) with 40

65535 67 Ms 65536 67 Ms 65537 68 Ms 4095 63 Ms 4096 64 Ms 4097 64 Ms If your system can't match that then you have a problem that doesn't show up here. How recently have you updated your system?

If the pipe hangs persist after a software update then you should raise a bug on it, which, if the Raspbian bug reporter is any good, also gives you the opportunity to see if anybody else has the same problem.

The Raspbian awk is also quite old: Version 1.3.3 is dated November 1996. By contrast, the Fedora 32 awk is version 5.0.1 and dated May 2019

I think that's old enough to be worth asking about a refresh, too.

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

Richard explained it better than me. It's mawk waiting until it has a block of 4096 bytes (or EOF). Clearly designed behaviour.

With 20.04, Ubuntu seems to have switched from mawk to gawk as default awk. I don't know if this is Ubuntu specific or Debian. So it's quite possible this will be reflected in the next version of Raspbian (Rasberry Pi OS)

This is exactly what I mean by fragile. Someone writes a script to do something using awk, an OS update comes along and the app completely changes.

Reply to
Pancho

Thanks for the seq example, simple, informative. I had seen that (or similar) done before, but couldn't quite remember/figure it out myself, I actually tried yesterday.

[snip]

Naively, I would guess prioritising performance for big files, as the default.

To be fair, until this thread there was just awk, as far as I was concerned, which is why I see different implementations as fragile.

It looks as if Ubuntu recently swapped default awk from mawk to gawk.

Reply to
Pancho

In Fedora systems the binary is called gawk with awk as an alias In Raspbian Buster the binary is /usr/bin/mawk with awk and nawk as aliases

So, the same shell script should run in both places: my test scripts do exactly that *and* do not show the long delay you're seeing.

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

In mine, Linux raspi-3plus 5.4.79-v7+ #1373 SMP Mon Nov 23 13:22:33 GMT

2020 armv7l GNU/Linux,

# ls -l /usr/bin/?awk

-rwxr-xr-x 1 root root 537K Sep 14 2018 /usr/bin/gawk*

-rwxr-xr-x 1 root root 93K Apr 8 2012 /usr/bin/mawk* lrwxrwxrwx 1 root root 22 May 27 2020 /usr/bin/nawk -> /etc/alternatives/nawk*

so mawk is well out of date!

/usr/bin/awk -> /etc/alternatives/awk -> /usr/bin/gawk

/usr/bin/mawk -W version says: mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

compiled limits: max NF 32767 sprintf buffer 1020

So really, well worth linking awk to gawk if you can.

--
Chris Elvidge 
England
Reply to
Chris Elvidge

Ah, you already said the same, sorry.

Reply to
A. Dumas

Not here, or did you not mean aliases to mawk? It's /usr/bin/awk -> /etc/alternatives/awk -> /usr/bin/gawk. /usr/bin/mawk is a separate binary with these properties:

$ /usr/bin/mawk -W version mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

compiled limits: max NF 32767 sprintf buffer 1020

Reply to
A. Dumas

I think that requires the Debian maintainers to switch from mawk to gawk and then everybody else to wait while the new codes percolates up through the distro dependencies.

However, I have realised that the *content* of the input file may be affecting awk, since its designed to work with lines of text.

I started out by grabbing a suitably large JPG image and adjusting its length using the 'truncate' utility, which makes files of exactly the specified size. Then I read its manpage a little more carefully and discovered that, if it is used to lengthen a file, the new space is filled with binary zeros: something awk may not handle well since its expecting textual input.

As a result, I wrote another awk script for creating files of pseudo-text of exactly the specified length. Each line in a generated file is up to

37 characters:

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\n

and can also generate very short final line if the required amount of data isn't an exact multiple of 37:

empty '' (empty)

1 char '\n' (1 char 2 chars '0\n' 3 chars '01\n' ...

The test results I posted were all the result of running a command like this:

"cat filename.txt | awk -- 'script'"

that uses 'cat' to fill a pipeline with one of the files created with my awk file filling program and 'script' is an awk program that totals the lengths of the lines read and displays the total.

So, questions for Pancho:

- how did you create your input files?

- Is there any chance that they don't contain any 'newline' characters?

I'm asking because its quite possible that awk would stall, waiting for a the next character, if it had read several KB of data without finding a newline character or EOF.

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

Well, that is effectively aliases, just selective ones and is what I have here. I've newer switched that alternative on my RPi, so I've been running mawk as awk.

Fedora does not use alternatives - just /usr/bin/gawk with /usr/bin/awk as an alias.

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

/usr/bin/awk -> /etc/alternatives/awk -> /usr/bin/gawk Not aliases, soft links, AIUI

alias awk='gawk' would not work if user not logged in (e.g in cron)

--
Chris Elvidge 
England
Reply to
Chris Elvidge

But I also haven't changed those aliases, or at least don't remember. I would find it very strange if mawk was the default awk anywhere, but especially on Debian/Raspbian which I have used for quite a few years and never noticed that awk was linked to mawk. Pretty sure it isn't, in a standard installation.

Reply to
A. Dumas

Oh right, what Chris said (again...): not aliases but symbolic links.

Reply to
A. Dumas

Indeed: that's what I meant.

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

If both are installed, gawk wins (priority 10 vs priority 5). However mawk is Priority: required, while gawk is Priority: optional.

So mawk is the default in the sense that it?s certain to be installed while gawk is not, but gawk is the default in the alternative sense that if you have both, awk->gawk.

--
https://www.greenend.org.uk/rjk/
Reply to
Richard Kettlewell

Thanks for the explanation!

Reply to
A. Dumas

So, all in all, a TITSUP. (Total Inability To Support Users, Properly).

--
Chris Elvidge 
England
Reply to
Chris Elvidge

I used:

tail -f testfile | mawk...

I used vi to generate files, with newlines, which I appended into the testfile, or just echo "stuff..." appended into the test file.

But Richard's example is perfect to demonstrate the problem:

Not only does seq quickly generate a byte stream, but it allows you to see exactly what byte you are on. I'm not sure why he used the brackets () but I left them in as they don't hurt.

I guess he used the echo because he is habitually tidy :-).

Maybe the answer is that for reliability I should have used perl instead of awk? Maybe perl is more standard?

Reply to
Pancho

If you take them out it means something different.

I am l-) in this case I wanted one more byte to make up the 4096-byte input block.

Perl only has one implementation (albeit many versions of it) so it?s more predictable in that sense. I doubt it has mawk?s weird approach to IO, too.

--
https://www.greenend.org.uk/rjk/
Reply to
Richard Kettlewell

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.