Designing Partitions

Partitioning works well if the tables in the database are naturally divisible into similar partitions where most of the rows accessed by any SQL statement can be placed on the same member server. Tables are clustered in related units. For example, suppose the entry of an order references the Orders, Customers, and Parts tables, along with all tables that record the relationships between customers, orders, and parts. Partitions work best if all the rows in a logical cluster can be placed on the same member server.

Symmetric Partitions

Partitioning is most effective if the tables in a database can be partitioned symmetrically:

Related data is placed on the same member server, so that most SQL statements routed to the correct member server will have minimal, if any, requirements for data on other member servers. A distributed partitioned view design goal can be stated as an 80/20 rule: Design partitions so that most SQL statements can be routed to a member server, where at least 80 percent of the data is on that server, and distributed queries are needed for 20 percent or less of the data. A good test of whether this can be achieved is to see whether the partition allows all rows to be placed on the same member server as all of their referencing foreign key rows. Database designs that support this goal are good candidates for partitioning.
The data is partitioned uniformly across the member servers.
For example, suppose a company has divided North America into regions. Each employee works in one region, and customers make most of their purchases in the state or province where they live. The region and employee tables are partitioned along regions. Customers are partitioned between regions by their state or province. While some queries require data from multiple regions, the data needed for most queries is on the server for one region. Applications route SQL statements to the member server containing the region inferred from the context of the user input.

Asymmetric Partitions

Although symmetric partitions are the ideal goal, most applications have complex data access patterns that prevent symmetrical partitioning. Asymmetric partitions result in some member servers assuming larger roles than others. For example, only some of the tables in a database may be partitioned, with the tables that have not been partitioned remaining on the original server. Asymmetric partitions can provide much of the performance of a symmetric partition, with these important benefits:

Dramatically improving the performance of a database that cannot be symmetrically partitioned by asymmetrically partitioning some of its tables.
Successfully partitioning a large existing system by making a series of iterative, asymmetric improvements. The tables chosen for partitioning in each step are usually those that will give the highest performance gain at that time.

In an asymmetric approach, the original server usually retains some tables that did not fit the partitioning scheme. The performance of these remaining tables is usually faster than in the original system because the member tables move to member servers, reducing the load on the original server.

Many databases can be partitioned in more than one way. The specific partitions chosen for implementation must be those that best meet the requirements of the typical range of SQL statements executed by the business services tier.

Distributed Partitioned Views

You should also design the partitioning in a way that produces routing rules that applications can use to determine which member server can most effectively process each SQL statement. The business services tier must be able to match a piece of user data against the routing rules to find which member server processes the SQL statement.

There are four areas to consider when designing a set of distributed partitioned views to implement a federation of database servers:

Determine the pattern of SQL statements executed by the application.
Develop a list of the SQL statements executed by the application during typical processing periods. Divide the list into SELECT, UPDATE, INSERT, and DELETE categories, and order the list in each category by frequency of execution. If the SQL statements reference stored procedures, use the base SELECT, INSERT, UPDATE, and DELETE statements from the stored procedure. If you are partitioning an existing Microsoft® SQL Server™ 2000 database, you can use SQL Profiler to get such a list.

The recommendation for using the frequency of SQL statements is a reasonable approximation in the typical online transaction processing (OLTP) or Web site database in which distributed partitioned views work best. These systems are characterized by having individual SQL statements that retrieve relatively small amounts of data when compared to the types of queries in a decision support, or OLAP, system. When each SQL statement references a small amount of data, simply studying the frequency of each statement yields a reasonable approximation of the data traffic in the system. Many systems, however, have some group of SQL statements that reference large amounts of data. You may want to take the extra step of weighting these queries to reflect their larger data requirements.
Determine how the tables are related to each other.
The intent is to find clusters of tables that can be partitioned along the same dimension (for example, part number or department number) so that all the rows related to individual occurrences of that dimension will end up on the same member server. For example, you may determine that one way to partition your database is by region. To support this, even tables that do not have a region number in their key must be capable of being partitioned in some manner related to a region. In such a database, even when the Customer table does not have a region number column, if regions are defined as collections of whole states or provinces, then the Customer.StateProvince column can be used to partition the customers in a manner related to region.

Because they define the relationships between tables, explicit and implicit foreign keys are the prime elements to review in looking for ways to partition data. Study the explicit foreign key definitions to determine how queries would usually use rows in one table to find rows in another table. Also study implicit foreign keys, or ways that SQL statements use values in the rows of one table to reference rows from another table in join operations, even when there is no specific foreign key definition. Because implicit foreign keys are not explicitly defined as part of the database schema, you must review the SQL statements generated by the application to understand whether there are statements that join tables using nonkey columns. These implicit foreign keys are typically indexed to improve join performance, so you should also review the indexes defined in the database.
Match the frequency of SQL statements against the partitions defined from analyzing the foreign keys.
Select the partitioning that will best support the mix of SQL statements in your application. If some sets of tables can be partitioned in more than one way, use the frequency of SQL statements to determine which of the partitions satisfies the largest number of SQL statements. The tables most frequently referenced by SQL statements are the ones you want to partition first. Prioritize the sequence in which you partition the tables based on the frequency in which the tables are referenced.

The pattern of SQL statements also influences the decision on whether a table should be partitioned:
- Partition a table if more than 5 percent of the statements referencing a table are INSERT, UDATE, or DELETE statements, and the table can be partitioned along the dimension you have chosen.
- Maintain complete copies of tables on each member server if less than 5 percent of the statements referencing the table are INSERT, UPDATE, or DELETE statements. You will also need to define how updates will be made so that all the copies of the table are updated. If high transactional integrity is required, you can code triggers that perform distributed updates of all the copies within the context of a distributed transaction. If you do not need high transactional integrity, you can use one of the SQL Server replication mechanisms to propagate updates from one copy of the table to all other copies.
- Do not partition or copy a table if more than 5 percent of the statements referencing a table are INSERT, UDATE, or DELETE statements, and the table cannot be partitioned along the dimension you have chosen.
Define the SQL statement routing rules. The routing rules must be able to define which member server can most effectively process each SQL statement. They must establish a relationship between the context of the input of the user and the member server that contains the bulk of the data required to complete the statement. The applications must be able to take a piece of data entered by the user, and match it against the routing rules to determine which member server should process the SQL statement.

Designing Partitions

Symmetric Partitions

Asymmetric Partitions

Distributed Partitioned Views

See Also