We have a requirement to develop an audit procedure that ensures the integrity of our data. We have a hierarchical, embedded, real-time system, and our data is distributed over multiple cards. Each higher level card stores the data and executable image of its lower cards. All requests to update the data emanate from external management systems, and they are incident on the highest level card, where some basic validation is performed.
Management update requests are processed in a trickle-down manner. An update request is routed to the target card, where it is processed. The target card is given an opportunity to validate the request based on the run-time conditions that prevail on that card (which only the target card knows about, since run-time data is not maintained at higher levels). If the target card explicitly rejects the request, and the reject response is not lost before it reaches the top level, the update request is rejected, and the rejection response is sent to the external management entity that issued the request. If the request times out (due to link failure or card failure), the higher level card goes ahead and updates its version of the data.
Our requirement is to audit the management data, since that is the only data that survives a process restart or a card reset. We have looked at two approaches to handle this:
(1) Periodically obtain the checksum of the files at all the levels, and compare them. In case of a discrepancy, we always defer to the higher level card. While this seems reasonable, the cards themselves have different processors, and they may not yield checksums in a consistent manner for the same data file
(2) When the highest level card successfully updates the data pertaining to the target card, it logs the management request. As mentioned above, the highest level card will always update its version of the data, as long as the basic validation is successful. The target card will also log the update management requests that caused it to update some data. Periodically the top level card will send a message that indicates to the target card what all updates it (the top level card) has made for the target card. The target card will compare this information with the log of update commands that it is maintaining. This comparison will be made based on a correlation tag that is generated by the top level card.
If there are more entries in the top-level card's version of the successful updates that have been made since the last audit cycle, it means that the top-level card processed more update commands than the target card. Moreover, the target card will know exactly which commands it missed, and it executes those commands on itself (albeit in a time delayed manner). Upon success, both the top-level card and the target card will delete these entries in the log files that they are maintaining. In case of a failure, these entries will not be deleted, and hopefully the reconciliation will take place in the next audit cycle.
After reconciliation, if there are any intermediate cards, between the top-level card and the target card, the data to those cards will be blindly over-written by the top-level card. This will minimize the risk of the top-level card (which is really the data master) and the intermediate cards getting out of sync.
If anyone can think of other approaches that we can consider, kindly post them here.
Thanks, Zahid