Now, back to the original topic at hand, before Stuart and Ales so rudely created a gschem and gEDA turf war over schematics not being part of PCB's charter -- I was unaware that gEDA had the right to dictate design to the PCB project which existed long before gEDA. I'll let DJ, Harry, Stuart and Ales work their turf war out ... leave me out of it please.
I considered strongly adding schematics into PCB some four years back, while Harry was still maintaining it himself out of JHU. When Harry dropped off the face of the earth for a while, I even considered starting a PCB project at sf.net, till I saw one day Harry had created one. Harry had believed strongly that Schematics do not belong in PCB, and as chief maintainer of the sf.net project, that was his choice. Before starting FpgaC on sf.net, I was strongly tempted to pickup and continue to support an Xaw version, and add Schematics in, as an sf.net project called PCB2 ... fork the project since Harry and I have very different goals and expectations about UI's and the types of designs PCB should support/produce. I asked very clearly on the user forum for PCB if Xaw was dead, trying to get a clear idea if it would remain a crippled Gtk version ... and no answer. I was actually supprised to find DJ had done a Lesstif variation, and had deviated strongly (forked) from the old/Gtk UI.
In the end I decided I would be more useful digging TMCC out of the grave, and bringing it forward a decade to be useful with todays FPGA products.
TMCC/FpgaC suffers badly from the same working set problems I posed for PCB. Very small changes in a project code transition FpgaC compile times from a few minutes, to hours ... and in one case from 45 minutes to over a day and a half, simply by exceeding working set sizes for L2 cache. Interestingly enough, the same C code does the same thing to GCC at a slightly different boundry point.
Student, and other toy, projects frequently contain simple algorithms that are ok inside a typical processors L2 cache size these day ... that when the data set grows just slightly, fail horribly performance wise. In this case, linearly searching a linked list works fine up to about 90-95% of L2 cache size. When you exceed that threshold, performance drops and run times increase roughly 10X or better because of the nature of LRU or pseudo-LRU cache replacement policy.
Consider for example, a small cache of 4 "bins" of marbles taken from a bowl of 300 marbles. If we first reference a certain red marble, it's taken from the bowl and placed in a cache bin after searching the 300 marbles for it. We keep using, and replacing the red marble avoiding the search in the bowl. Later we also use a green, blue, and yellow marble, which take the three remaining bins in the cache. Because of the nature of the task, we always use red, green, blue, and yellow in that order, always taking from the cache, and replacing in the cache.
When our working set expands to five mables, we have a cache failure, which goes like this. We access the red, green, blue, yellow marbles in order from the cache, then we need a white marble. The red mable is least reciently used so it's removed from the cache, and replaced with the white marble. We then repeat our cycle, next needing the red mable which is no longer in the cache, so we must fetch it from the bowl, and due to the LRU algorithm, replace the green marble with the red marble. However next we need the green mable, which forces the blue out of the cache. Next we need the blue marble, forcing the yellow out of the cache. Next we need the yellow, forcing the white out of the cache ... and so on, with every cache hit faulting, requiring a lengthy access and search of the bowl.
LRU algorithms fail horribly with sequential searches of the cached working set, resulting in a very sharp reduction in performance as the working set is exceeded. In FpgaC's case, the primary data structures are linked lists which are frequently searched completely to verify the lack of duplicate enteries when a new item is created. When the working set under these linked lists exceeds the processors L2 cache size, run times jump by more than a factor of 10 for many machines these days ... the ratio of L2 cach performance compared to memory performance. Thus, depending on the host processors L2/L3 cache size, there are critical points for FpgaC where the run times to compile incrementally small increases in program size, jumps dramatically. The fix for this is relatively simple, and will occur soon, which is to replace the linear searches with tree or hash searches to avoid referencing the entire working set to invoke the LRU replacement failure mode problem.
Similar problems exist at several levels in the responsiveness of PCB. Any event which forces a search of the design space, will require the working set to hold all the objects that are required to be searched. When that working set increases past various cache sizes, noticable increases in latency will result, to the point that they are visible in the UI ... that point will vary depeding on the particular machine being used (L1/L2/L3 cache sizes, and the relative latency of reference for faults). Developers who only use and test on a fast processors, with large caches, and fast native memory, will not notice extremely jerky performance that someone using a P2 333 celerion (128K cache) with a 66mhz processor bus and fast page mode memory will encounter. Slightly larger designs running on 4Ghz processors with 512K caches will fail equally noticably with a design some 4-10 times larger.
Certain operations will fail harder, those which invoke a series of X output, as they will also incure the memory overhead of part of the kernel, some shared libraries, the X server, and display driver in the "working set" for those operations. While a 512K cache is four times larger, the available working set is Cache size minus X working set, meaning that for small cache sizes there might not be much working set left at all, while doubling the cache size may actually increase the usable working set by 10X or more.
Just taking a guess, PCB + kernel + Xaw + Xserver, probably has a working set something around or slightly larger than 128K for very basic PCB tasks. Thus, we will see cache flushing and reloading between X calls, and locally ok cache performance at both ends. As the L2/L3 cache grows to 512K this is probably less of a problem.
What does become a problem is when the PCBX working set get continually flused by every call, such as making a long series of little calls to the Xserver, faulting all the way to the X server, and faulting all the way back ... calling performance drops like a rock ... factor of 10 or more. This happens when the task at either, or both, of the PCB or X server end requires a slightly larger working set, making the total working set for LRU into worst case failure mode.
I suspect that the Gtk failure modes do this, by including Gtk overhead into the working set, such that every PCB to Gtk to X server call faults round trip, and runs at native memory performance. The reason I believe this is that in my testing a 550Mhz PIII machine with SDRAM is only about twice as slow, as a 2GHz P4 machine with DDR SDRAM, in this failure mode ... rather than the 4-6X normal compuation difference when running at CPU speeds from L1 cache, or even L2 cache.
With synchronous calls to Gtk and the X server, it's difficult for PCB to keep it's event processing in real time.
I have a several day class I used to regularly teach that discusses in detail, designing for hysteresis problems that occure with step discontinuties of the processor load vs thruput function that is quite useful for recognizing and designing architectural solutions to problems of this class.
So ... applicati> DJ Delorie wrote: