Table of Content
As data growth affects the CPU capacity of the machine, sharding plays a key role in the place where a single server cannot handle a very large data set exceeding the storage capacity.
Sharding – An Overview
Sharding is a method for distributing data across multiple machines. It supports deployments with very large data sets, high throughput operations, and horizontal scaling. It shards data at the collection level and distributes the collection data across the shards in the cluster. For aggregation operations running on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard so as to merge the results. Thus, avoiding overloading of the primary shard for that database. It divides the data across multiple servers and reduces the amount of data stored by each server. A shard cluster can have shard/non-shard collections without causes.
- Provides an interface between client applications and the sharded cluster.
- Do not maintain persistent state.
- Restarts, retrieves a copy of the config database, and begins routing queries.
- Consumes minimal system resources.
- Reasonably lightweight, in terms of storage and CPU/memory usage, as the metadata for a sharded cluster is significantly smaller than the actual cluster data.
- Stores metadata and configuration settings for the cluster.
- Deployed as a replica set (config servers as replica sets) in MongoDB 3.4.
- Two types of config servers are:
- Sync Cluster Connection Configuration (SCCC) – Contains maximum 3 servers per cluster (older version is not supported since v3.4).
- Config Servers as Replica Sets (CSRS) – Provides advantage of the standard replica set read and write protocols for sharding config data.
- Contains a subset of the sharded data.
- Deployed as a replica set.
- Distributes the documents in a collection.
- Exists in every document in the target collection.
- Contains an immutable field and one shard key per-collection.
- Cannot be changed after creating collections.
For non-empty collection, an index that starts with the shard key is needed. For empty collections, MongoDB creates the index if the collection does not already have an appropriate index for the specified shard key. The choice of the shard key affects the performance, efficiency, and scalability of the sharded cluster and affects the sharding strategy used in the cluster.
A shard cluster can have both shard & non-shard collections without causes.
Connecting with Shard Cluster
- Should connect to a Mongos router (Query Router) to interact with any collection (sharded and unsharded collection) in the sharded cluster.
- Should never connect to a single shard to perform read or write operations.
Operational Restrictions on Shard Cluster
Few restrictions on shard cluster are:
- $group – does not work with sharding. Use mapReduce or aggregate instead.
- $where – does not permit references to the DB object from the $where function. This is uncommon in un-sharded collections.
- $isolated – update modifier does not work in sharded environments.
- $snapshot – queries do not work in sharded environments.
- $geoSearch – command is not supported in sharded environments.
- Single Document Modification Operations – updateOne(), removeOne(), and deleteOne() operations without the shard key or the _id field returning an error.
- Does not support unique indexes across shards, except when the unique index contains the full shard key as a prefix of the index.
Launching Shard/replica-set Cluster with Even Number of replica-set (Arbiter Server)
If there are equal number of servers on either side of the partition, the database cannot maintain CAP. An Arbiter is specifically designed to create an “imbalance” or “majority” on one side so that a primary can be elected in this case. An arbiter does not have a copy of data set and cannot become a primary. Replica sets may have arbiters to add a vote in elections of for primary. Arbiters allow replica sets to have uneven number of members without the overhead of a member that replicates data.
Advantages of Sharding
- Reads/writes scaled and distributed horizontally.
- High storage capacity and high availability.
- Can perform partial read / write operations even if one or more shards are unavailable.
- Can perform reads or writes on other shard servers even if a shard server is unavailable during the downtime.
- Can deploy config servers as replica sets from v3.2. A sharded cluster with a CSRS can continue to process reads and writes as long as majority of the replica set is available. SCCC config servers support is removed from v3.4.
- Should deploy individual shards as replica sets in production environments for providing increased redundancy and availability.
- MongoDB shard is very handful when larger data cannot be handled by single machine.
- With replica-set, larger or multiple queries/read operations can be performed on MongoDB databases.
- With replica-set integration in Shard, high availability is enabled.
- For aggregation operations running on multiple shards and not running on the database’s primary shard, the results can be routed to any shard to merge the results and to avoid overloading of the primary shard for that database.
- MongoDB Shard use case: please refer our next blog post – MongoDB Shard Setup – Part II
- Building an Inexpensive Petabyte Database with MongoDB and Amazon Web Services: