I did more analysis, and added other constants besides pi.
There is still some unexplained divergence in the graphs for the constants compressions compared to each other between compressed sequence lengths 2000 to 300,000, so I will try to do more averaging to get rid of it, and compress longer sequences to see if the pattern of convergence of all the constants shown in the below graphs continues to converge at compressed sequence lengths greater than 1million.
Also the left side of the graphs are wildly divergent because there are only 20 samples averaged per constant's given compressed sequence length, ie pi compressed sequence length of 80, averaged 20 times.
However the main divergence in the graph occurs in the middle of the graph between compressed sequence lengths 20,000 to 80,000. These are also averaged 20 times, so for 20,000 that is 20,000 * 20 digits = 400,000 digits per constant, and there is still a wide divergence in that part of the graph that I can't explain.
Here is a graph for compressed E sequences compared to the other constants compressed sequences:
formatting link
Here is a graph for compressed pi sequences compared to the other constants compressed sequences:
formatting link
Here is a spreadsheet of all the averaged constants compression data:
formatting link
No idea how to explain the divergence, maybe it is something in the compression algorithm that causes it but I don't see how, I think more averaged data is needed still.
I did a bigger test, and plotted the results, there still seems to be a bias in regards to the pseudorandom number sequence used:
formatting link
For this test I wanted to eliminate some possible errors that could potentially be causing the unexpected bias, I can't really think of anything left that could be causing the bias, except either as you say something with the compression algorithm or an actual bias in the sequences. For the compression algorithm I can't see how it would be causing a bias more for the pseudorandom sequence than the other ones I tested though.
This test was done with sequences:
Pseudorandom Pi Sqrt2 E
All four sequences are stored in text files and have minimum 50million base 10 digits.
The script does this for each of the sequences:
ie for Pi:
compress 15 sequence lengths of this number of input digits:
Grab the digits from these locations in the sequence, 0, 50000, 100000,
150000, 200000, .... , 4990000, 49950000.
So every 50000 digits in the sequence, up to just under 50 million digits, there are 15 different length sequences compressed for Pi.
That works out to 1000 samples of compressed data for each of the 15 sequence lengths. So for Pi I create 15,000 compressed sequences total, and then average them to end up with 15 average compressions for each of the 15 sequence lengths.
So there is a 0.716% bias at 40000 sequence length between the compressed Pi and pseudorandom sequence, and also the 40000 sequence is actually an average of one thousand 40,000 digit sequences equally distributed in the first 50million digits of the sequences. So a
0.716% bias from a total of 40million (40000*1000) digits sampled seems large, not to mention that the bias is larger than some of the earlier shorter averaged sequences as shown in the graph.
I'd like to test with larger than 40000 sequence lengths with this new high average sampling to see what happens but I think this is an interesting result for now too!
Here is the link to the spreadsheet showing the percent difference graph:
formatting link
It is interesting to compare the percent differences in compression for e to pi, and pi to pseudorandom. The percent difference in compression is a lot lower for e to pi, than e to pseudorandom or pi to pseudorandom.
The difference is compression is pretty evident, the pseudorandom sequence is compressing slightly more than e or pi do, which means that there is a slightly larger amount of periodicity in the e and pi sequence compared to the pseudorandom sequence I think, since my compression algorithm works only by finding periodicity, first order, second order, up to n-th order periodicity.
OK. You made me curious enough to find allegedly 100k digits of pi and feed them to my byte wise entropy calculator. It turns out that the web page titled 100k digits of Pi actually has 200k digits on it!
formatting link
Anyway assuming that they really are the digits of pi there is an obvious curiosity in that there is a 1% bias in favour of "1" and a half % bias against 2 & 9 over this relatively short range whilst
5,6,3,7,0,8 & 4 are all suspiciously close to their ideal value!
I doubt if it is significant in number theory but it would be enough to skew compression in favour of higher compressibility. Someone else has already done this and the bias vanishes for longer sequences:
I used a total of 40million digits of pi (and e, and a pseudorandom sequence) and still notice the bias or apparently an increasing bias in compression.
Using 1000 unique averaged compressed sequences each of 40,000 length, distributed within the first 50million digits of pi, e and pseudorandom sequences, the compression ratio differences are on an increasing trend from shorter averaged sequences ie:
e/pi compression ratio difference is 1.0006566 (0.06%) showing that e and pi have an approximate than 0.06% difference in their compression, while pseudorandom/e compression ratio difference is
1.0064956 (0.6%) showing that e and pseudorandom have an approximate
0.6% difference in their compression.
Both pi and e compress about 0.6% more than the pseudorandom sequence, even when compressing 40million digits! 40,000 digits compressed 1000 times in this case.
I should just compress 100million digits of pi, e and the pseudorandom sequence to see what the compression ratio differences are, however the compression algorithm REALLY slows down with increased sequence size :D The biggest single sequence I compressed so far is 10million digits for pi, I did it twice and here are the results:
compression1 pi sequence start digit 700000: 1866935 compressed lines compression2 pi sequence start digit 1000000: 1866597 compressed lines
Since it is a 10million digit compressed sequence there is a significant overlap since the start digits are only 300000 digits apart in the pi sequence, so I will compress another starting at pi sequence digit 20million to get two totally separate samples to average. I removed all input compressed sequence overlap in other recent tests as well.
I will compress e and pseudorandom at 10million digits and see if the pseudorandom sequence maintains a distinctively lower compression than pi and e. It took 10hours to do a single 10million digit compression before, but I think it should be slightly faster now that I switched to the 64bit compressor, was previously accidentally using the BigInteger arbitrary precision one I made for compressing very large numbers, but the algorithm itself scales very poorly with increasing sequence length.
Also you were mentioning it could be a sample buffer in the algorithm related to the compression divergence, but the algorithm has no buffer, it takes into account the whole sequence length for compressing each digit, which is also why it is very slow with longer sequences.
The test above on digits misses out on local sequence differences over the 10M digit range, as it only considers the digits from the start of the sequence at shorter sequence lengths.
There is a detectable local bias in the digits of pi, at least for the first 60million digits, ie for 80,000 digit sequential chunks of digits taken from the 60million sequence (750 sub-sequences), these sub sequences all have slightly higher periodicity than the corresponding digits from my 60million pseudorandom sequence generated in C#.
I also did a test to shift the initial compression offset in 10,000 digit increments for these 80,000 digit blocks, and it looks pretty clear than ANY block of 80,000 digits within the first 60million digits of pi will compress slightly more than any 80,000 digit block within the first 60million digits of my pseudorandom sequence.
I had to do A LOT of sequence compressions to get this result to be able to be confident it is in fact true. I compressed 80,000 digit sequences of pi 6000 different ways (8 offsets * 750 averages per
80,000 digits, within the first 60million digits), and same for the pseudorandom sequence.
The percent difference between pi and pseudorandom compression does converge at 1,000,000 compressed sequence lengths, but still there is the local order I found (ie within 80,000 digit blocks) above at any point in the first 60million digits of pi.
Here are some of the results:
formatting link
I want to figure out now how to utilize the local excess periodicity found in pi at 80,000 digit scales to improve my algorithm on longer ie 1million or longer sequences, to distinguish between pi and pseudorandom sequences for those and longer sequences. A simple starting point is just to consider the 80,000 digit blocks within the longer sequences I think.
I make this guarantee for what I found for these base 10 sequences:
Catalan E Euler Golden Ratio Lemniscate Pi Pseudorandom Sqrt2
given two sequences at once of the above constants of 80,000 consecutive digits taken from the same sequence locations in both constant, ie pi digits 0 to 80,000 and pseudorandom from 0 to 80,000, or pi digits 240,000 to 320,000 and pseudorandom digits from 240,000 to 320,000 (80,000 digits of each constant) my compression algorithm can distinguish these patterns for these given sequences:
pi compression is always more than:
Catalan
E
Euler
Pseudorandom
lemniscate compression is always more than:
pseudorandom
sqrt2 compression is always more than:
pseudorandom
golden ratio compression is always more than:
pseudorandom
E (natural log) compression is always more than:
pseudorandom
So there is something apparently "special" about the pi sequence compared to the other sequences :D
Also if anyone wants to send me matching 80,000 digits from ANY location within the first 60million digits of any of the above sequences to test if my prediction is true, I will compress them and send back which sequence compressed more and which compressed less.
The above is actually incorrect, :/ there is only about a 53% chance that the prediction that I stated is true at least for pi vs pseudorandom sequences, ie there is an approximate ~53% chance that for any matched 80,000 digit sequences of pi and pseudorandom, that pi will be more compressed.
ie in the test I just did, with 750 pi and pseudorandom compressed sequences of 80,000 digits each, pi was compressed more than the pseudorandom sequence 399 times. This is what gives the overall average extra compression seen in the pi sequence compared to the pseudorandom sequence.
total average compressed output lines comparison for 750 sequences of 80,000 digit lengths:
There are my most recent results, there is defintely something about pi that causes it to compress a little more with my algorithm than the other sequences tested.
All compressed sequence lengths 20,000 to 2,000,000 have the maximum amount possible averaged from the first 60million digits of pi, ie for compressed sequence lengths of 20,000 I use 3000 individual compressed sequences of 20,000 length in the first
60,000,000 digits of pi and averaged them for each of the constants checked, and still there is a visible pattern showing that pi is a bit more compressed than other sequences.
As I stated in another thread the difference is most likely something to do with this (my first guesses):
the digits of pi have fewer "local" (ie within 1million consecutive digits occuring anywhere up to 60million digits) matched repeating pairs of digits ie 3,4 or 5,6 repeated at least twice.
the digits of pi have fewer "local" (ie within 1million consecutive digits occuring anywhere up to 60million digits) matched repeating pairs of digits with equal spacing occuring at least three times ie.
3,4 spaced in the sequence:
3,4,x,y,z,3,4,a,b,c,3,4 (matched spacing gap of 3 digits between 3,4)
Or could be something else or related to the above at least for the difference I think I found in pi.
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.