To: fyodor_ who wrote (66145 ) 12/24/2001 10:03:56 AM From: pgerassi Respond to of 275872 Dear Fyo: It is simple for some of us that used to check on hardware to work. You simply set a watchdog like process that asks for the resource every few seconds or so. The cost is small. When a CPU dies, one of two possibilities happens. The first is that it stops working and the watchdog process easily determines it is dead. All unresponded tasks are given to other CPUs to be processed anew (many of these were given to two or more different CPUs in the names of redundency and cross checking). If the second thing happens and the CPU merely is processing some things incorrectly, the comparison fails and all the CPUs are rechecked with higher standard diagnostics and/or the task is run by other CPUs. Those that do not match the "correct" result are taken out of service and are marked for replacement or repair (the former much more likely than the later as time goes on). Ditto for storage (disk) or memory. The process was used by IBM in the Shuttle Orbiter but, the development cost is probably not covered in its entirety. Besides, the cost of such software may allow these companies to make additional dollars of profit. After all MS doesn't give you the software for the cost of making it (if it did, Windows XP would cost in the single dollar or less range). I have programmed Hot Swap for I/O boards. They had internal CPUs, memory, I/O and glue logic. They are just a node in a (C)PCI bus tree network. So are the CPU card, the disk controller, etc. The memory was in the CPU module and was replaced as a package. In the IBM mainframe clusters, the CPU nodes have enough L1, L2 and L3 to add to the EAROM to make a true working cluster node. These modules also contain the network logic as well. In the one Power design I looked at, the CPU module also contained the local SDRAM memory as well (you could tell that memory was added in chunks per CPU module during ordering). Thus, even IBM doesn't hot swap memory without swapping the CPUs linked with it. Besides, I did not state what the underlying topology is used in the network tying the nodes together in the clusters. It varies among each of the CPU families (some families used more than one topology). The ccNUMA topology appears to be the current favorite for the processing side though. The current reason for developing these capabilities is not that these replace clusters but, that they make field repair easier by untrained personnel (It costs far less for the customer to use an already paid for $10 an hour computer operator to do it than to pay $200 an hour for a vendor technician plus expenses and travel time (or the equivalent in service contract costs)). That feature is very valuable to both the vendor and the customer and is cheap to duplicate once done (read high margin product add-ons). Pete