high accuracy math library

- H
- Hul Tytus
  
  Contact options for registered users
posted
2 years ago

Fri, Jan 14, 2022 4:50 PM

There was once a math library in c, if memory serves, with the basic functions, ie +, -, * and / and some others also. The resolution was adjustable so changing a reference variable (or was that a #define?) from 32 to 256 would change the size of the variables to 256 bits. Anyone rember the name or location of that library?

Hul

- J
- jlarkin
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Fri, Jan 14, 2022 4:54 PM

There seem to be a bunch:

formatting link

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Fri, Jan 14, 2022 6:12 PM

I don't recall that particular one but GCC can be fairly easily persuaded to go up to 128 bit reals which are usually good enough for all but the most insane of floating point calculations.

I think your choices there are limited to 32, 64, 80, 128

formatting link

It includes the most common transcendental functions as well.

Quad floating precision runs slowly so do as much as you can at a lower precision and then refine the answer using that as a seed value.

I used to like having 80 bit reals available in the good old prehistoric days of MSC v6. Today it requires some effort to use them with MSC :(

- J
- Joe Gwinn
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Fri, Jan 14, 2022 9:01 PM

In the old days, only VAX/VMS had hardware support for 128-bit floats (not IEEE format though). In the cited GCC list, which of these are directly supported in hardware, versus software emulation?

Most current machines directly support multi precision integer arithmetic for power-of-2 lengths, but it is done in multiple coordinated machine-code operations, so it's partly in software.

Of course, when the word size goes up, the various approximations polynomials must improve, which generally means to use higher-order polynomials, so the slowdown isn't all due to slower computational hardware.

The only real application of 128-bit floats that I am aware of was the design of interferometers such as LIGO, where one is tracking very small fractions of an optical wavelength over path lengths in the kilometers, with at least two spare decimal digits to absorb numerical noise from the ray-trace computations.

Joe Gwinn

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Fri, Jan 14, 2022 11:09 PM

You might be remembering the GNU Multiple Precision Library:

formatting link

CH

- H
- Hul Tytus
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sat, Jan 15, 2022 4:33 PM

Thanks for the references everyone.

Hul

snipped-for-privacy@highlandsniptechnology.com wrote:

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sat, Jan 15, 2022 5:50 PM

32, 64 are native x87 and full SSE floating point support 80 x87 only but GCC does it fairly well 128 emulated and slower

Always work in the hardware supported ones to obtain an approximate answer unless and until you need that extra precision.

Preferably frame it so you refine an approximate starting guess.

32, 64 and 128 integer support are sometimes native at least for some platforms. +, - and * all execute in one nominal CPU cycle* too! (at least for 32, 64 bit - I have never bothered with 128 bit int)

sometimes they can appear to take less than one cycle due to out of order execution and the opportunities to do work whilst divides are in progress. Divides are always best avoided or if that is impossible their number minimised. Divide is between 10-20x slower than all the other primitive operations and two divides close together can be *much* slower. Pipeline stalls typically cost around 90 cycles per hit.

divide remains a PITA and worth eliminating where possible.

I have an assembler implementation for a special case division that can be faster than the hardware divide for the situation it aims to solve.

Basically 1/(1-x) = 1 + x + x^2 + x^3 + x^4 + ... (1 + x)*(1 + x^2)*(1 + x^4)*(1 + x^8)

And for smallish x it converges faster than hardware FP divide.

There aren't all that many that need it.

Most planetary dynamics can be done with 80 bit reals with a bit to spare.

That might be a genuine application.

The only times I have played with them have been to investigate the weird constants that play a part in some chaotic equations. I was curious to see how much of the behaviour was due to finite mantissa length and how much was inherent in the mathematics. Doubling the length of the mantissa goes a long way to solving that particular problem. (but it is rather slow)

- J
- jlarkin
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sat, Jan 15, 2022 5:58 PM

Power Basic has a native 80-bit float type and a 64-bit integer.

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sat, Jan 15, 2022 11:29 PM

Which corresponds to floating point types supported in the early Intel chips.

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sun, Jan 16, 2022 9:26 AM

Probably a very wasteful decision that cost them dear. The requirement for anything above a 64 bit FP word length is very esoteric.

The most popular back in that era for high speed floating point was the Cyber 7600 (60 bit word) which powered Manchester universities Jodrell Bank processing and BMEWS early warning system amongst other things.

I'm so impressed. NOT

MS C used to have it back in the old v6 days but they rationalised things to only have 64 bit FP support in C/C++ a very long time ago.

Most decent compilers *do* offer 80 bit reals. It is a pity that Mickeysoft don't because their code optimiser is streets ahead of both Intel and GCC's at handling out of order execution parallelism.

Intel C and GCC compilers still support 80 bit floating point.

On the code I have been testing recently Intel generates code that effectively *forces* a pipeline stall more often than not. MSC somehow manages the opposite. Pipeline stalls cost around 90 cycles which is not insignificant in a routine that should take 300 cycles.

Putting two divides close together with the second one dependent on the result of the other is one way to do it. MSC tries much harder to utilise the cycles where the divide hardware isn't ready to answer. (at least it does when you enable every possible speed optimisation)

Sometimes it generates loop unrolled code that is completely wrong too :(

- J
- jlarkin
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sun, Jan 16, 2022 3:50 PM

Of course not. The word BASIC triggers too much emotion, facts not required.

We had a couple of cases where we wanted to do a signal processing routine that processed an array of adc samples, on x86. My official programmer guys did it in gcc and I did it in Power Basic. Mine used subscripts in the most obvious loop and they used pointers. Mine ran

4x as fast. After a day of mucking with code and compiler switches, many combinations, they got within about 40%.

Python looks a lot like Basic to me. Some of the goofier features were added so that it couldn't be directly accused of being Basic syntax, which would have been toxic.

PB has wonderful string functions. It has TCP OPEN and such, and can send/receive emails if you really want to. The cool stuff is native, not libraries; make an EXE file in half a second and you're done.

We wrote MAX, our material control/BOM program, in PowerBasic. It's great. We couldn't find any commercial packages that actually understand electronics manufacturing.

- J
- jlarkin
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sun, Jan 16, 2022 3:52 PM

IBMs decision to go Microsoft+Intel was tragic.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sun, Jan 16, 2022 4:15 PM

Agreed.

The current crop of x86 processors are fantastic engineering to get high throughputs despite the terrible ISA and often massively inefficient code written for them. They are fine demonstrations that you really /can/ polish a turd.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sun, Jan 16, 2022 4:23 PM

I suspect he simply means it is not a hard or exciting feature if you are making a language designed purely to run on a single target processor family and OS. It is not impressive that Power BASIC has support for 80-bit floats. It /would/ be impressive if it supported

128-bit floats, because that would require a lot of development effort.

BASIC is okay for small and simple programs. It is not uncommon to need a something quick and easy - you want a language and tool that has minimum developer time overhead, is interpreted (to minimise the edit/run cycle time), has string handling, automatic memory management, garbage collection (or even just keep all memory until the program ends), and is easy to understand and write even for people who don't do much coding. I personally don't see BASIC as a bad choice for that - though I do think Python is usually a better choice these days.

On the other hand, trying to write /big/ systems in BASIC is an exercise in madness. Pick the right tool for the job.

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sun, Jan 16, 2022 5:16 PM

isn't Power BASIC compiled?

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sun, Jan 16, 2022 5:21 PM

that leaves division which was and still is "slow" compared to +/-/*

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sun, Jan 16, 2022 8:51 PM

Last time I compared them directly was 2006ish, using almost all single-precision C++ code. Back then, Intel was streets ahead of Microsoft for vectorization and loop unrolling, and gcc was a distant distant third.

What sorts of code are you comparing?

Yikes.

Cheers

Phil Hobbs

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Sun, Jan 16, 2022 10:28 PM

Only if you have the kind of funding that relies on a brutal monopoly of a major industry that has almost every other industry by the short-and-curly's.

Think how much amazing technology could have been produced by the same level of investment in an open market. The waste of talent is nothing short of tragic, on a global scale.

CH

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Mon, Jan 17, 2022 7:50 AM

Yes, exactly.

(Somewhere in the development of the m68k processor family - I forget exactly where, but I /think/ it was the 68030 - the cpu designers realised that they could do a division in software faster than using the hardware division block they had. Removing the hardware division instruction saved significant die space.)

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
2 years ago

Mon, Jan 17, 2022 9:40 AM

I think it will depend a lot on the application. Vectorising real*4 I expect the Intel compiler might well still have the edge. I'm mostly interested in real*8 and real*10 function computations.

The elderly compiler that impressed me the most was Apple's Clang for the M1 - that CPU really motors and at very low power cf Intel. PITA the differences from classic standard C but once over that it was worth it. It was already fast enough on its safer default settings that I didn't notice /Ofast wasn't set!

My various Pade approximations didn't max out its ability to hide operations inside the latency time of the divide. They all took the same time irrespective of the polynomials up to 7,6 (as high as I go). On Intel CPUs they get slower once the polynomial order goes above 4.

Solving cubics, polynomials, a few transcendental functions and higher order correctors in the series that begins Newton-Raphson, Halley, D4 ...

Mostly they are snippets that seldom exceed 20 lines. They form a set of functional Lego bricks that solve a particular problem.

On modern optimising compilers NR and Halley take essentially the same elapsed time for quadratic or better cubic convergence and D4 ~15% slower for quartic convergence and D5 ~25% slower. After that they slow down. In one special case Halley is *faster* than NR!

Traditional way of doing it D5 would be 2x slower so it is quite a game changer in terms of which corrector you can use (or conversely how crude an initial guess you need to get full machine precision).

Indeed. It was to be fair quite a pathological piece of code (but it still shouldn't happen).

The other thing I have had problems with is modern global optimisers spotting benchmark loops of a simple form and coding the algebraic answer! (their strength reduction tricks have become quite cunning)

x = 0; dx = 1e-6; for (i = 0; i<100000; i++) x += dx;

They also have to return a result that might be printed out later so that there is a potential side effect.