Friday, November 16, 2007

Scaling your system - What I learnt from Dan Pritchett's (eBay) talk

I was lucky enough to attend both talks given by Dan Pritchett during the Colorado Software Summit this year. The first one was, "You Scaled your what". If you want to know what scalability is, here is an excellent introduction by Werner Vogels. After listening to what Dan had to say, the following points stood out in my mind
  • You need to understand and take complete control of your architecture. Read my post on Architecture is your responsibility for more thoughts on this.

  • You need to have scalability in mind right from the beginning. Trying to achieve scalability later can be time consuming and very costly. Quoting from Werner Vogels post
  • Why is scalability so hard? Because scalability cannot be an after-thought. It requires applications and platforms to be designed with scaling in mind, such that adding resources actually results in improving the performance or that if redundancy is introduced the system performance is not adversely affected. Many algorithms that perform reasonably well under low load and small datasets can explode in cost if either requests rates increase, the dataset grows or the number of nodes in the distributed system increases.
  • Transactional scaling is just one dimension. You need to think about Scalability of data, operational,deployment, power ..etc. This is a minimalistic set. Try to figure out what dimensions are important to your organization.

  • All scalability dimensions are related and impacts each other. Any dimension ignored can could evolve into a problem for your application

  • Prefer vertical over horizontal scaling. Vertical scaling is better for your vendors and is not a viable long term strategy. There is so much you can get by increasing memory, CPU etc..

Transactional Scaling
Usually measured in TPS and is a traditional indicator for application performance.

  • Keep asking the question "How long can the business survive?" based on,

    • Time-to-live on current resources.

    • Time-to-live on maximum plausible configuration.

  • These metrics should be taken regularly to anticipate possible production bottle necks and identify issues before they become a crisis.

Data Scaling
How well does your data scale? Think about,

  • Functional Decomposition, group data by logical relationships, business importance,transactional volumes etc.

  • Think about partitioning data (sharding).

  • Is all data equally important? prioritizing your data and allocating resources accordingly will help you scale better.

Operational Scalability
How hard is it to run your software? Operational scalability is a software problem and you need to think about operational concerns right from the beginning. Pay attention to,

  • Logging metrics, Monitoring.

  • Controlling/updating/tuning live apps without disrupting traffic.

Deployment Scalability
You need to design/architect your systems while keeping the following in mind,

  • Ability to do incremental roll outs (and rollback if there are problems) without disrupting live traffic.

  • Managing component dependencies during deployment without disrupting live traffic.

  • Your architecture shouldn't assume or decouple itself to any hardware,network topology or data center topology. This allows you to take advantage of new hardware, network topologies ..etc without significant changes.

Power Scalability
Power can be a limiting factor in a data center and may put bounds on transactional scaling.

  • How efficient is your software?, wasted clock cycles == wasted watts.

  • Consider vitalization for best utilization of your hardware resources.

Some good tips I managed to note down

  • Run old and new schema parallelly and then take out the old schema after a while when you gain enough confidence.

  • Prioritize services, willing to take a hit on certain services over others.

  • Incremental rollouts is a very good way to roll out new features while managing risk and also prevents taking the system offline.

  • Schedule deployment during working hours instead of weekends/nights as this enables your developers, support staff to attend to problems while they are alert and without being distracted by non work issues.If you do incremental rollouts this is possible as you are not disrupting traffic.

No comments: