116 CHAPTER 6 Scalable Data Warehousing
You design the data layout on the appliance to avoid or minimize data movement for par-
allel queries by using either a replicated or a distributed strategy for storage. When planning
which strategy to implement, you consider the types of joins that the parallel queries require.
Some tables require a replicated strategy, whereas others require a distributed strategy.
Replicated Strategy
For best performance, you can add small tables—such as dimension tables in a star schema—
to Parallel Data Warehouse by using a replicated strategy. Parallel Data Warehouse makes
a copy of the table on each compute node, as shown in Figure 6-3. You then perform the
initial load of the table, followed by any subsequent inserts, updates, or deletes, as if you were
working with a single table, without the need to manage each copy of the table. Parallel Data
Warehouse handles all changes to the table for you. When a query performs a join on a repli-
cated dimension, Parallel Data Warehouse joins the dimension to the portion of the fact table
that exists on the same compute node. All compute nodes run the query in parallel and can
nd data very quickly because the complete dimension table is on each compute node.
Table
Compute nodes
All table rows are copied
to each compute node
Replicated table
FIGURE 6-3 Replicated strategy
Distributed Strategy
One of the keys to performance in an MPP architecture is the distribution of large tables
across multiple nodes, as shown in Figure 6-4. To distribute a fact table, you simply select a
column from the table to use as the distribution column, and when data is loaded into the
table, Parallel Data Warehouse automatically spreads the rows across all of the compute
Comments to this Manuals