Hi Everyone, this is Abdul
. We have learned a lot throughout this journey to make our Python code much more efficient. We have talked about the importance of profiling to pinpoint the performance-related issues in our code, we understand the mechanism behind lists and tuples, how dictionaries and sets can help us in this regard, the discussion about vectors and matrix computation was also very interesting, and we also talked about to bring the performance of C inside the Python and even we get our hands dirty on concurrency and multiprocessing. What a great journey, full of great discussions, if you haven’t watched that playlist
yet, believe me, you are missing something great if you are python lover. Go ahead and watch this playlist
Now form this video, we will continue our journey to achieve even more to provide a much faster experience to our customers and users.
So, we will talk about the clusters and Job queue in a few upcoming videos. Let’s start with the clusters, what is a cluster? Why clusters are useful? What should we consider before choosing clusters? And how we can convert the multiprocessing solutions into a clustered solution? Let’s try to answer these questions!
So, A cluster is commonly recognized to be a collection of computers working together to solve a common task. It could be viewed from the outside as a larger single system. GCP has a bunch of great examples of how they have implemented clusters of hardware infrastructure to ensure the reliability and availability of their services. Amazon Web Services (AWS) is commonly used both for engineering production clusters in the cloud and for building on-demand clusters for short-lived tasks like machine learning.
So, what are the things we should consider before moving to a clustered solution?
Before you move to a clustered solution, do make sure that you have:
- Profiled your system, so you understand the bottlenecks you have to overcome
- Utilized compiler solutions like Cython or Numba
- Exploited multiple cores on a single machine using the multiprocessing techniques we discussed earlier in this series.
- Exploited techniques for using less RAM. Keeping your system to one machine will make your life easier. Move to a cluster if you really need a lot of CPUs or the ability to process data from disks in parallel, or you have production needs like high resiliency and rapid speed of response.
But, why clusters are useful, what benefits we will get?
- The most obvious benefit of a cluster is that you can easily scale computing requirements —for example, you can add more machines (or “nodes”) if you need to process more data or to get an answer faster,
- By adding machines, you can also improve reliability. Each machine’s components have a certain likelihood of failing, and with a good design, the failure of some components will not stop the operation of the cluster.
- Clusters are also used to create systems that scale dynamically.
- Dynamic scaling is a very cost-effective way of dealing with nonuniform usage patterns, as long as the machine activation time is fast enough to deal with the speed of changing demand.
let’s discuss the critical considerations you must encounter:
Requires a change in thinking: Moving to a clustered solution requires a change in thinking. This is an evolution of the change in thinking required when you move from serial to parallel code. Suddenly you have to consider what happens when you have more than one machine—you have latency between machines, you need to know if your other machines are working, and you need to keep all the machines running the same version of your software. System administration is probably your biggest challenge.
System Implementation requirements will change: Also, you normally have to think hard about the algorithms you are implementing and what happens once you have all these additional moving parts that may need to stay in sync. This additional planning can impose a heavy mental tax; it is likely to distract you from your core task, and once a system grows large enough you’ll probably require a dedicated engineer to join your team.
Consider important questions: When designing a clustered solution, you’ll need to remember that each machine’s configuration might be different (each machine will have a different load and different local data). How will you get all the right data onto the machine that’s processing your job? Do the latency involved in moving the job and the data amount to a problem? Do your jobs need to communicate partial results to each other? What happens if a process fails or a machine dies or some hardware wipes itself when several jobs are running? Failures can be introduced if you don’t consider these questions.
Accept the failure: You should also consider that failures can be acceptable. For example, you probably don’t need 99.999% reliability when you’re running a simple web service to serve your content—if on occasion a job fails (e.g., a picture doesn’t get resized quickly enough) and the user is required to reload a page, that’s something that everyone is already used to. It might not be the solution you want to give to the user, but accepting a little bit of failure typically reduces your engineering and management costs by a worthwhile margin. On the flip side, if a high-frequency trading system experiences failures, the cost of bad stock market trades could be considerable!
It can be expensive: Maintaining a fixed infrastructure can become expensive. Machines are relatively cheap to purchase, but they have an awful habit of going wrong—automatic software upgrades can glitch, network cards fail, disks have to write errors, power supplies can give spikey power that disrupts data. The more computers you have, the more time will be lost to dealing with these issues. Sooner or later you’ll want to add a system engineer who can deal with these problems, so extend your budget too. Using a cloud-based cluster can mitigate a lot of these problems (it costs more, but you don’t have to deal with the hardware maintenance), and some cloud providers also offer a spot-priced market for cheap but temporary computing resources.
We have too many examples to quote here, Skype suffered a 24-hour planet-wide failure in 2010. Behind the scenes, Skype is supported by a peer-to-peer network. An overload in one part of the system caused delayed responses from Windows clients; approximately 40% of the live clients crashed, including 25% of the public supernodes.
With 25% of the routing offline (it came back on, but slowly), the network overall was under great strain. Skype became largely unavailable for 24 hours. The recovery process involved first setting up hundreds of new mega-supernodes configured to deal with the increased traffic, and then following up with thousands more. Over the coming days, the network recovered.
This incident caused a lot of embarrassment for Skype; clearly, it also changed their focus to damage limitation for several days. Customers were forced to look for alternative solutions for voice calls, which was likely a marketing boon for competitors.
I believe that, now you have a better idea of what a cluster is, why it can be useful and what are the things critical to consider in this regard. Feel free to post your questions and thoughts in the comments below. I think that's enough for this video, in the next video, we will talk about how to design a cluster? and we will explore some clustering solutions specific to Python, so stay tuned, and if you liked the content of this video, gives a thumbs up and be sure to subscribe to my channel and hit the bell icon you will never miss any fantastic video in the future.