Failures

July 22, 2017 | Autor: Wendy Lile | Categoria: Computer Science

Descrição do Produto

Failures
Wendy Lile
POS/355
September 2, 2013
Alicia Pearlman

Failures
Failures can be classified as either hardware or software failures. In a distributed system, there are four main types of general failures; these are crash, omission, timing, and response failures (Boszormenyi, n.d.). The most common kinds of hardware failure are link and site failures, as well as, message loss. Stored media decay, crashing of storage devices, and actual machine faults are also common at specific points in the sites lifeline (Silberschatz, Galvin, & Gagne, 2009). These types of hardware failures usually occur either directly after manufacturing or after many years of continuous use because, just like everything else, computers wear out over time. There are even more kinds of software failures; just a few are, requirement faults, coding faults, and data problems. According to Hamill and Goseva, a significant percentage of failures occur in the later part of a machines life cycle as well (2009). The three most common failures in a distributed system are: network link failure, host failure, and storage medium failure. Both host and storage medium failure could also occur in a centralized computing system, but a network link failure can only happen in a distributed system that is networked (Silberschatz, Galvin, & Gagne, 2009). Aside from having a disaster recovery plan that would be able to automatically isolate and fix most failures that could occur, implementing fault tolerance takes a lot of planning of different scenarios that could happen.
Network Link Failure
To be able to detect a failure of a network link, a procedure named handshaking would be used. Handshaking is two sites sending each other messages by direct route to see if either can be received. If either site did not receive a message, it can send a message asking if the other site is working. Then a message is sent by an alternate route to be able to determine if the failure is in the link itself or the actual site. For all of these messages being sent, there is a time-out scheme that is used. This helps the site sending the message, if still not received, tell whether the other site is down, the direct or alternative link is down, or if the message is simply being lost. At this point, the failure has been isolated and the system will be reconfigured so that it can move on with normal operations. Broadcasting the site, direct link, or alternative link has failed to the other sites in the system will avoid additional complications and failures because communications can be re-routed. After repair and repeating the handshaking procedure numerous times to test the failure with the link or site, it can now be re-implemented back into the system (Silberschatz, Galvin, & Gagne, 2009).
Storage Media Failure
Fortunately, physical storage device failures can be repaired with greater ease with an efficient fault tolerance system. Redundant components are able to automatically take over in the event of failed hardware, for example, hard disks. RAID systems are helpful in this situation when it comes to being able to access the data on the failed components. There are RAID levels 0 through 5, with 5 being the most secure and the most expensive as well. RAID is the storing and protection of media and sufficient operating system information on a group of hard disks so that no information is lost. RAID level 1 is simply the duplication of the information by a procedure called mirroring and RAID level 5 involves multiple copies of arrays inside of arrays on multiple hard disks. The latter not only assures your information is secure but also employs many redundant hardware components for a purely fault tolerant system (Russel & Crawford, 2013).

References
Boszormenyi, Laszlo. (n.d.) Distributed Systems: Fault Tolerant Systems. Retrieved from
http://www-itec.uni-klu.ac.at/~laszlo/courses/DistSys_BP/FaultTolerance.pdf
Hamill, Maggie and Goseva-Popstojanova, Katerina. (2009, July/August). Common Trends in
Software Fault and Failure Data. CS Digital Library, 35(4), 484-496. doi:http://doi.ieeecomputersociety.org/10.1109/TSE.2009.3
Russel, Charlie and Crawford, Sharon. (2013). Planning Fault Tolerance and Avoidance.
Microsoft TechNet. Retrieved from http://technet.microsoft.com/en-us/library/bb742464.aspx
Silberschatz, A., Galvin, P. B., & Gagne, G. (2009). Operating system concepts: Update (8th
ed.). Hoboken, NJ: Wiley & Sons.

FAILURES
3

Running head: FAILURES
1

Lihat lebih banyak...

Failures

Descrição do Produto

Comentários