algorithm that can distinguish between a sequence of PI digits and a sequence of pseudorandom digits of equal length

- J
- Jamie M
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Oct 30, 2015 12:55 AM

Hi,

I did more analysis, and added other constants besides pi.

There is still some unexplained divergence in the graphs for the constants compressions compared to each other between compressed sequence lengths 2000 to 300,000, so I will try to do more averaging to get rid of it, and compress longer sequences to see if the pattern of convergence of all the constants shown in the below graphs continues to converge at compressed sequence lengths greater than 1million.

Also the left side of the graphs are wildly divergent because there are only 20 samples averaged per constant's given compressed sequence length, ie pi compressed sequence length of 80, averaged 20 times.

However the main divergence in the graph occurs in the middle of the graph between compressed sequence lengths 20,000 to 80,000. These are also averaged 20 times, so for 20,000 that is 20,000 * 20 digits = 400,000 digits per constant, and there is still a wide divergence in that part of the graph that I can't explain.

Here is a graph for compressed E sequences compared to the other constants compressed sequences:

formatting link

Here is a graph for compressed pi sequences compared to the other constants compressed sequences:

formatting link

Here is a spreadsheet of all the averaged constants compression data:

formatting link

No idea how to explain the divergence, maybe it is something in the compression algorithm that causes it but I don't see how, I think more averaged data is needed still.

cheers, Jamie

- D
- Dave Platt
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Oct 30, 2015 1:35 AM

I kinda like starting with Euler's identity and solving for pi.

- C
- Chris
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Oct 30, 2015 6:00 AM

Try entering "convert pi to base e" in

formatting link

Or "convert pi to base pi/e" and even "convert pi to base pi/sqrt(2)"

Some interesting results ;-)

--
Cheers, 
Chris.

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Oct 30, 2015 9:19 AM

From the position of the feature I would hazard a guess that there is a

16kbyte to 64k search buffer in the compression algorithm being used.

--
Regards, 
Martin Brown

- J
- Jamie M
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Oct 30, 2015 11:10 AM

Hi,

I did a bigger test, and plotted the results, there still seems to be a bias in regards to the pseudorandom number sequence used:

formatting link

For this test I wanted to eliminate some possible errors that could potentially be causing the unexpected bias, I can't really think of anything left that could be causing the bias, except either as you say something with the compression algorithm or an actual bias in the sequences. For the compression algorithm I can't see how it would be causing a bias more for the pseudorandom sequence than the other ones I tested though.

This test was done with sequences:

Pseudorandom Pi Sqrt2 E

All four sequences are stored in text files and have minimum 50million base 10 digits.

The script does this for each of the sequences:

ie for Pi:

compress 15 sequence lengths of this number of input digits:

10 100 1000 10000 11000 12000 13000 14000 15000 16000 17000 18000 20000 30000 40000

Grab the digits from these locations in the sequence, 0, 50000, 100000,

150000, 200000, .... , 4990000, 49950000.

So every 50000 digits in the sequence, up to just under 50 million digits, there are 15 different length sequences compressed for Pi.

That works out to 1000 samples of compressed data for each of the 15 sequence lengths. So for Pi I create 15,000 compressed sequences total, and then average them to end up with 15 average compressions for each of the 15 sequence lengths.

So for Pi, I end up with this:

(sequence length) (Pi compression average)

10 1.292 100 24.858 1000 91.066 10000 102.124 11000 104.462 12000 107.126 13000 109.938 14000 112.987 15000 116.292 16000 119.763 17000 123.41 18000 127.347 20000 135.91 30000 192.225 40000 271.081

And for the pseudorandom sequence I end up with this:

(sequence length) (pseudorandom compression average)

10 1.326 100 24.693 1000 91.051 10000 102.328 11000 104.706 12000 107.337 13000 110.172 14000 113.221 15000 116.635 16000 120.181 17000 123.944 18000 127.942 20000 136.687 30000 193.787 40000 273.021

The percent difference between Pi and the pseudorandom sequence for each averaged compressed sequence length (ie pseudorandom/pi)

(sequence length) (pseudorandom compression average/pi compression average)

10 1.026315789 100 0.993362298 1000 0.999835284 10000 1.001997572 11000 1.002335778 12000 1.001969643 13000 1.002128472 14000 1.002071035 15000 1.002949472 16000 1.003490227 17000 1.00432704 18000 1.004672273 20000 1.005717019 30000 1.008125894 40000 1.007156533

So there is a 0.716% bias at 40000 sequence length between the compressed Pi and pseudorandom sequence, and also the 40000 sequence is actually an average of one thousand 40,000 digit sequences equally distributed in the first 50million digits of the sequences. So a

0.716% bias from a total of 40million (40000*1000) digits sampled seems large, not to mention that the bias is larger than some of the earlier shorter averaged sequences as shown in the graph.

I'd like to test with larger than 40000 sequence lengths with this new high average sampling to see what happens but I think this is an interesting result for now too!

cheers, Jamie

- J
- Jamie M
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Oct 30, 2015 11:33 AM

Here is the link to the spreadsheet showing the percent difference graph:

formatting link

It is interesting to compare the percent differences in compression for e to pi, and pi to pseudorandom. The percent difference in compression is a lot lower for e to pi, than e to pseudorandom or pi to pseudorandom.

ie from the spreadsheet:

(compressed sequence length) (e/pi)

10 1.006965944 100 1.001810282 1000 1.000065886 10000 1.000332929 11000 1.00026804 12000 0.99929989 13000 0.999818079 14000 0.999159195 15000 0.997884635 16000 0.998547131 17000 0.998995219 18000 0.99935609 20000 1.001155176 30000 1.001695929 40000 1.00065663

(compressed sequence length) (e/pseudorandom)

10 0.981146305 100 1.008504434 1000 1.00023064 10000 0.998338676 11000 0.997937081 12000 0.997335495 13000 0.997694514 14000 0.997094179 15000 0.994950058 16000 0.995074097 17000 0.994691151 18000 0.99470854 20000 0.995464089 30000 0.993621863 40000 0.993546284

The difference is compression is pretty evident, the pseudorandom sequence is compressing slightly more than e or pi do, which means that there is a slightly larger amount of periodicity in the e and pi sequence compared to the pseudorandom sequence I think, since my compression algorithm works only by finding periodicity, first order, second order, up to n-th order periodicity.

cheers, Jamie

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Oct 30, 2015 11:53 AM

OK. You made me curious enough to find allegedly 100k digits of pi and feed them to my byte wise entropy calculator. It turns out that the web page titled 100k digits of Pi actually has 200k digits on it!

formatting link

Anyway assuming that they really are the digits of pi there is an obvious curiosity in that there is a 1% bias in favour of "1" and a half % bias against 2 & 9 over this relatively short range whilst

5,6,3,7,0,8 & 4 are all suspiciously close to their ideal value!

Filename : pi100k.txt File size = 200002 Entropy = 2.303 ( max. 5.545 ) States used = 10.00 ( max. 256 )

Zero frequency : 0-47 58-255

Most frequent bytes: 49 31 "1" 20278 53 35 "5" 20054 54 36 "6" 20054 51 33 "3" 20052 55 37 "7" 20050 48 30 "0" 19998 56 38 "8" 19956 52 34 "4" 19940 50 32 "2" 19816 57 39 "9" 19804

I doubt if it is significant in number theory but it would be enough to skew compression in favour of higher compressibility. Someone else has already done this and the bias vanishes for longer sequences:

formatting link

Here 1 is least frequent in the first 10M digits.

--
Regards, 
Martin Brown

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Oct 30, 2015 12:13 PM

On a sunny day (Fri, 30 Oct 2015 11:53:02 +0000) it happened Martin Brown wrote in :

Pi is usually compressed to the symbol pi, so 2 x 7 bits ASCII, or 2 x 8 bits if ye must.

- J
- Jamie M
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, Oct 30, 2015 8:16 PM

Hi,

I used a total of 40million digits of pi (and e, and a pseudorandom sequence) and still notice the bias or apparently an increasing bias in compression.

Using 1000 unique averaged compressed sequences each of 40,000 length, distributed within the first 50million digits of pi, e and pseudorandom sequences, the compression ratio differences are on an increasing trend from shorter averaged sequences ie:

e/pi compression ratio difference is 1.0006566 (0.06%) showing that e and pi have an approximate than 0.06% difference in their compression, while pseudorandom/e compression ratio difference is

1.0064956 (0.6%) showing that e and pseudorandom have an approximate 0.6% difference in their compression.

Both pi and e compress about 0.6% more than the pseudorandom sequence, even when compressing 40million digits! 40,000 digits compressed 1000 times in this case.

I should just compress 100million digits of pi, e and the pseudorandom sequence to see what the compression ratio differences are, however the compression algorithm REALLY slows down with increased sequence size :D The biggest single sequence I compressed so far is 10million digits for pi, I did it twice and here are the results:

compression1 pi sequence start digit 700000: 1866935 compressed lines compression2 pi sequence start digit 1000000: 1866597 compressed lines

Since it is a 10million digit compressed sequence there is a significant overlap since the start digits are only 300000 digits apart in the pi sequence, so I will compress another starting at pi sequence digit 20million to get two totally separate samples to average. I removed all input compressed sequence overlap in other recent tests as well.

I will compress e and pseudorandom at 10million digits and see if the pseudorandom sequence maintains a distinctively lower compression than pi and e. It took 10hours to do a single 10million digit compression before, but I think it should be slightly faster now that I switched to the 64bit compressor, was previously accidentally using the BigInteger arbitrary precision one I made for compressing very large numbers, but the algorithm itself scales very poorly with increasing sequence length.

Also you were mentioning it could be a sample buffer in the algorithm related to the compression divergence, but the algorithm has no buffer, it takes into account the whole sequence length for compressing each digit, which is also why it is very slow with longer sequences.

cheers, Jamie

- J
- Jamie M
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 1, 2015 8:34 PM

Hi,

The test above on digits misses out on local sequence differences over the 10M digit range, as it only considers the digits from the start of the sequence at shorter sequence lengths.

There is a detectable local bias in the digits of pi, at least for the first 60million digits, ie for 80,000 digit sequential chunks of digits taken from the 60million sequence (750 sub-sequences), these sub sequences all have slightly higher periodicity than the corresponding digits from my 60million pseudorandom sequence generated in C#.

I also did a test to shift the initial compression offset in 10,000 digit increments for these 80,000 digit blocks, and it looks pretty clear than ANY block of 80,000 digits within the first 60million digits of pi will compress slightly more than any 80,000 digit block within the first 60million digits of my pseudorandom sequence.

I had to do A LOT of sequence compressions to get this result to be able to be confident it is in fact true. I compressed 80,000 digit sequences of pi 6000 different ways (8 offsets * 750 averages per

80,000 digits, within the first 60million digits), and same for the pseudorandom sequence.

The percent difference between pi and pseudorandom compression does converge at 1,000,000 compressed sequence lengths, but still there is the local order I found (ie within 80,000 digit blocks) above at any point in the first 60million digits of pi.

Here are some of the results:

formatting link

I want to figure out now how to utilize the local excess periodicity found in pi at 80,000 digit scales to improve my algorithm on longer ie 1million or longer sequences, to distinguish between pi and pseudorandom sequences for those and longer sequences. A simple starting point is just to consider the 80,000 digit blocks within the longer sequences I think.

cheers, Jamie

- J
- Jamie M
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Nov 1, 2015 8:38 PM

Here is the link to the digits I am using by the way:

formatting link

(pi to 160,000,000 decimal digits)

cheers, Jamie

- J
- Jamie M
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 2, 2015 2:22 AM

Hi,

I make this guarantee for what I found for these base 10 sequences:

Catalan E Euler Golden Ratio Lemniscate Pi Pseudorandom Sqrt2

given two sequences at once of the above constants of 80,000 consecutive digits taken from the same sequence locations in both constant, ie pi digits 0 to 80,000 and pseudorandom from 0 to 80,000, or pi digits 240,000 to 320,000 and pseudorandom digits from 240,000 to 320,000 (80,000 digits of each constant) my compression algorithm can distinguish these patterns for these given sequences:

pi compression is always more than:

Catalan
E
Euler
Pseudorandom

lemniscate compression is always more than:

pseudorandom

sqrt2 compression is always more than:

pseudorandom

golden ratio compression is always more than:

pseudorandom

E (natural log) compression is always more than:

pseudorandom

So there is something apparently "special" about the pi sequence compared to the other sequences :D

Also if anyone wants to send me matching 80,000 digits from ANY location within the first 60million digits of any of the above sequences to test if my prediction is true, I will compress them and send back which sequence compressed more and which compressed less.

cheers, Jamie

- J
- Jamie M
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 2, 2015 3:30 AM

The above is actually incorrect, :/ there is only about a 53% chance that the prediction that I stated is true at least for pi vs pseudorandom sequences, ie there is an approximate ~53% chance that for any matched 80,000 digit sequences of pi and pseudorandom, that pi will be more compressed.

ie in the test I just did, with 750 pi and pseudorandom compressed sequences of 80,000 digits each, pi was compressed more than the pseudorandom sequence 399 times. This is what gives the overall average extra compression seen in the pi sequence compared to the pseudorandom sequence.

total average compressed output lines comparison for 750 sequences of 80,000 digit lengths:

pi 804.0066 pseudorandom 809.1466

cheers, Jamie

- D
- David Eather
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 2, 2015 8:10 AM

at 1% difference you would have to test more than 10,000 sequences to have any validity - have you tested that many?

- J
- Jamie M
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Nov 2, 2015 8:52 PM

Hi,

I've tested 100,000's of sequences already :D

formatting link

There are my most recent results, there is defintely something about pi that causes it to compress a little more with my algorithm than the other sequences tested.

All compressed sequence lengths 20,000 to 2,000,000 have the maximum amount possible averaged from the first 60million digits of pi, ie for compressed sequence lengths of 20,000 I use 3000 individual compressed sequences of 20,000 length in the first

60,000,000 digits of pi and averaged them for each of the constants checked, and still there is a visible pattern showing that pi is a bit more compressed than other sequences.

As I stated in another thread the difference is most likely something to do with this (my first guesses):

the digits of pi have fewer "local" (ie within 1million consecutive digits occuring anywhere up to 60million digits) matched repeating pairs of digits ie 3,4 or 5,6 repeated at least twice.

the digits of pi have fewer "local" (ie within 1million consecutive digits occuring anywhere up to 60million digits) matched repeating pairs of digits with equal spacing occuring at least three times ie.

3,4 spaced in the sequence:

3,4,x,y,z,3,4,a,b,c,3,4 (matched spacing gap of 3 digits between 3,4)

Or could be something else or related to the above at least for the difference I think I found in pi.

cheers, Jamie