I had a chance to think about this further and I think the localized variables in path delays actually hurt the async device more than it does the sync device.
The idea behind the sync clocked design is to deal with all the issues that make the logic delay time so variable. Instead of trying to match delays with the clocking, the entire issue is lumped into the clock domain. The clock period has to be larger than the worst case delay through the logic plus an additional margin for the skewing of the clock. Minimizing clock skew is the purpose of the clock tree. So this is typically very small and only needs to be added to the logic delay to get the minimum clock period.
The async processor must match the clock delay with the logic delay and always keep the clock delay slightly larger. There are always variations in timing of similar components due to statistical factors. Even if it is out at the 3 sigma point, by having a million transitors on a die, you have to account for the few that are either fast or slow. The worst case would be a fast clock path and a slow logic path. This skewing must be considered at the logic and clock level. In the end you end up having to allow for the deviation in both directions which means it is doubled.
So the async design likely must have larger margins added to the design of the handshake path and the result is it will have a slower maximum speed compared to a sync design.