Knowledge Your Way: Sharding Concept in My SQL Database

Sharding is a database scalability technique that has proven itself in some of the world’s most popular MySQL web sites, applications and games. Simply stated, with a sharded database, instead of storing data in only one database instance, you distribute the data across multiple database instances, and potentially multiple servers. It’s a scale-out technique used by many SaaS companies, many of the most popular web sites social and mobile apps, and by many big multi-player games and enterprises for their customer-facing applications.

ScaleBase enables applications to cost-effectively leverage multiple MySQL instances working together

Historically, sharding a database required manually coding data distribution policies directly into your applications. Application developers would write code that stipulates directly where specific data should be placed and found. In essence developers were creating work-around code to solve a database scalability problem so their applications could handle more users, more transactions and more data.

Instead, what if you could leverage the expertise of MySQL sharding engineers who draw experience from hundreds of cases? Now you can, with ScaleBase.

Wikipedia defines sharding as:

Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.

Using an example can help explain MySQL sharding more clearly, so let’s take the following table:
What is Sharding?

This is a small table containing a list of customers. Any database can handle such a table. But happens if instead the table has to store seven million rows instead of just seven rows?

Theoretically, this should not be a problem. But usually there are lots of operations on such a large table – for example we may have many read and write operations on this table every second.

In practice, a very large customer table can become a database bottleneck. Why? Because it doesn’t al fit in the database server cache anymore, because of database isolation management, and for other reasons that cause the database to crawl under load.

How does sharding solve MySQL Scalability?

If we take the customers table, and split it into four different databases, each database will contain 1.75 million rows. That’s still a lot, but less than 7 million rows. This will result in improved database performance. In fact, ScaleBase tests have shown about 75% improvements in response time in some standard performance tests. You can see the results here. The following diagram shows how such a table can be split:

Every database will get some of the rows. In old-fashioned do-it-yourself sharding, it was the developer’s responsibility to create an efficient, application-specific data distribution policy that efficiently stipulated exactly where each row should be stored and found for each table. Nowadays, that work is simplified and automated using ScaleBase software.

Unlimited Scale – Sharding allows MySQL to cost-effectively scale out to an unlimited number of users, while increasing performance and data throughput. If you have hundreds of thousands (or hundreds of millions) of users, you may wonder when your MySQL database will failure.

“At scale, everything breaks” (Google VP Engineering, Urs Hölzle, says here).

Using ScaleBase software, a standard open-source MySQL database can scale out indefinitely.

Response time improvement – If your database is big (over 50GB) or has many writes or reads per second (anything above a few hundred per second), sharding will boost your database performance.

Writing do-it-yourself sharding code is difficult. It requires you to rewrite most of your Data Access Layer from scratch and your home-grown sharding code needs to address a range of concerns typically handled by a database. In the above example we’ve seen one table split across four databases, which seems simple. But with a real application, a real schema and real data, it can be more complicated.

For example, how will you maintain relations between the tables? Some tables are shard tables, some are global tables. Also, your sharding code needs to maintain global (broadcast, duplicate) tables across all DBs. For every INSERT the sharding code must know which database to update, and a mistake means bad data placement which translates into lost data. Additionally, for every SELECT the sharding code must know where to find the correct data. and now what about JOINs, GROUP BY, ORDER BY, UNIONs?…

And while it’s difficult to do when you write your own SQL code, it’s even more complex when using O/R mapping tools, as most are not “sharding oriented”.

And after writing the initial do-it-yourself sharding code, you can run into issues. For instance, a common problem occurs when scaling requires adding more shards. Usually, internally written sharding code supports a fixed number of shards, and adding shards requires massive code rewrites – as well as the major downtime required when moving data chunks from one shard to another.

Other parts of the infrastructure also change when using a manually sharded database. For example, the reporting application must now be aware of the sharding logic, since you want to collect data from multiple databases rather than just one. But reporting applications expect to interact with one IP address and one database. With a home-grown sharding deployment, you’ll have to write the reporting application from scratch that understands cross-shard joins and data aggregation to compile results from multiple databases back into a single result. ScaleBase software does this for you and allows you to continue using off-the-shelf third party reporting tools.

With home-grown sharding, database backups are also an issue. And the complexities continue to pile on. So, while, MySQL sharding is a great solution for database scaling, it is complex and costly if done manually, as many of the costs are hidden and only become realized after the initial sharding is performed.

ScaleBase gives you all the benefits of MySQL sharding, with none of the headaches and without changing a single line of your code. And since ScaleBase’s software not embedded inside your application, your BI, DBA team and backup tools can use it, too which means no ecosystem changes.

ScaleBase also provides the ability to scale MySQL with read/write splitting. This means copies of the database can now serve for read operations, while only one database is used for the writes. Database replicas are no longer useless machines only used for high availability – with ScaleBase they can be used for scaling as well, and your architecture will look something like this:

In this architecture you get database high availability and scale out via data distributionor read/write splitting – all in a fully redundant environment that also boosts application performance and database throughput. Several deployment options exist, so check out ourTechnical Whitepaper here to see which would be best for you.

What's the difference between sharding and partition?
Partitioning is a general term used to describe the act of breaking up your logical data elements into multiple entities for the purpose of performance, availability, or maintainability.

Sharding is the equivalent of "horizontal partitioning". When you shard a database, you create replica's of the schema, and then divide what data is stored in each shard based on a shard key. For example, I might shard my customer database using CustomerId as a shard key - I'd store ranges 0-10000 in one shard and 10001-20000 in a different shard. When choosing a shard key, the DBA will typically look at data-access patterns and space issues to ensure that they are distributing load and space across shards evenly.

"Vertical partitioning" is the act of splitting up the data stored in one entity into multiple entities - again for space and performance reasons. For example, a customer might only have one billing address, yet I might choose to put the billing address information into a separate table with a CustomerId reference so that I have the flexibility to move that information into a separate database, or different security context, etc.

To summarize - partitioning is a generic term that just means dividing your logical entities into different physical entities for performance, availability, or some other purpose. "Horizontal partitioning", or sharding, is replicating the schema, and then dividing the data based on a shard key. "Vertical partitioning" involves dividing up the schema (and the data goes along for the ride).

Final note:  you can combine both horizontal and vertical partitioning techniques - sometimes required in big data, high traffic environments.

4 comments:

sriSeptember 30, 2020 at 1:03 AM
Thanks for sharing this informative content , Great work
Leanpitch provides online training in Agile coach during this lockdown period everyone can use it wisely.
Certified agile coaching Bangalore
sriOctober 28, 2020 at 4:31 AM
https://tcstechlounge.blogspot.com/2018/06/tcs-tech-lounge-java-final-13.html?showComment=1603884681018#c2987871865721574390
sriOctober 28, 2020 at 4:31 AM
Thanks for sharing this informative content , Great work
Leanpitch provides online training in Product prototyping during this lockdown period everyone can use it wisely.
icp-cat training
sriNovember 9, 2020 at 10:50 PM
Thanks for sharing this informative content , Great work
To crack scrum master interview : Scrum Master Interview Questions

Knowledge Your Way

Thursday, May 14, 2015

Sharding Concept in My SQL Database

4 comments:

Java 9 and Java11 and Java17, Java 21 Features

About Me

Labels