Created By: Abdul at Dec. 11, 2020
So, we will talk about the clusters and Job queue in a few upcoming videos. Let’s start with the clusters, what is a cluster? Why clusters are useful? What should we consider before choosing clusters? And how we can convert the multiprocessing solutions into a clustered solution? Let’s try to answer these questions!
So, A cluster is commonly recognized to be a collection of computers working together to solve a common task. It could be viewed from the outside as a larger single system. GCP has a bunch of great examples of how they have implemented clusters of hardware infrastructure to ensure the reliability and availability of their services. Amazon Web Services (AWS) is commonly used both for engineering production clusters in the cloud and for building on-demand clusters for short-lived tasks like machine learning.
So, what are the things we should consider before moving to a clustered solution?
Before you move to a clustered solution, do make sure that you have:
But, why clusters are useful, what benefits we will get?
let’s discuss the critical considerations you must encounter:
Requires a change in thinking: Moving to a clustered solution requires a change in thinking. This is an evolution of the change in thinking required when you move from serial to parallel code. Suddenly you have to consider what happens when you have more than one machine—you have latency between machines, you need to know if your other machines are working, and you need to keep all the machines running the same version of your software. System administration is probably your biggest challenge.
System Implementation requirements will change: Also, you normally have to think hard about the algorithms you are implementing and what happens once you have all these additional moving parts that may need to stay in sync. This additional planning can impose a heavy mental tax; it is likely to distract you from your core task, and once a system grows large enough you’ll probably require a dedicated engineer to join your team.
Consider important questions: When designing a clustered solution, you’ll need to remember that each machine’s configuration might be different (each machine will have a different load and different local data). How will you get all the right data onto the machine that’s processing your job? Do the latency involved in moving the job and the data amount to a problem? Do your jobs need to communicate partial results to each other? What happens if a process fails or a machine dies or some hardware wipes itself when several jobs are running? Failures can be introduced if you don’t consider these questions.
Accept the failure: You should also consider that failures can be acceptable. For example, you probably don’t need 99.999% reliability when you’re running a simple web service to serve your content—if on occasion a job fails (e.g., a picture doesn’t get resized quickly enough) and the user is required to reload a page, that’s something that everyone is already used to. It might not be the solution you want to give to the user, but accepting a little bit of failure typically reduces your engineering and management costs by a worthwhile margin. On the flip side, if a high-frequency trading system experiences failures, the cost of bad stock market trades could be considerable!
It can be expensive: Maintaining a fixed infrastructure can become expensive. Machines are relatively cheap to purchase, but they have an awful habit of going wrong—automatic software upgrades can glitch, network cards fail, disks have to write errors, power supplies can give spikey power that disrupts data. The more computers you have, the more time will be lost to dealing with these issues. Sooner or later you’ll want to add a system engineer who can deal with these problems, so extend your budget too. Using a cloud-based cluster can mitigate a lot of these problems (it costs more, but you don’t have to deal with the hardware maintenance), and some cloud providers also offer a spot-priced market for cheap but temporary computing resources.
We have too many examples to quote here, Skype suffered a 24-hour planet-wide failure in 2010. Behind the scenes, Skype is supported by a peer-to-peer network. An overload in one part of the system caused delayed responses from Windows clients; approximately 40% of the live clients crashed, including 25% of the public supernodes.
With 25% of the routing offline (it came back on, but slowly), the network overall was under great strain. Skype became largely unavailable for 24 hours. The recovery process involved first setting up hundreds of new mega-supernodes configured to deal with the increased traffic, and then following up with thousands more. Over the coming days, the network recovered.
This incident caused a lot of embarrassment for Skype; clearly, it also changed their focus to damage limitation for several days. Customers were forced to look for alternative solutions for voice calls, which was likely a marketing boon for competitors.
I believe that, now you have a better idea of what a cluster is, why it can be useful and what are the things critical to consider in this regard. Feel free to post your questions and thoughts in the comments below. I think that's enough for this video, in the next video, we will talk about how to design a cluster? and we will explore some clustering solutions specific to Python, so stay tuned, and if you liked the content of this video, gives a thumbs up and be sure to subscribe to my channel and hit the bell icon you will never miss any fantastic video in the future.
© 2021 Pythonist.org | All rights reserved | Design by W3layouts.
Post a comment