This article is for the DBA's who are not familiar with the basics of join techniques like HASH Join, Nested Loops and SORT MERGE. Because any DBA who doesnt know how a join is happening in the background cannot work with the join order so understanding the concept is more important first. As it is an article on to explain the concepts so we will not go deep into it as of now. On this cold night here in New Delhi with some hot sweet corn soup on the table, oh im sorry let me come to the point now...
HASH JOINS
Lets start with hash join. As we know that the optimizer looks into the access path , join method and then the join order respectively, so yes hash join is a join method. The database uses hash joins to join large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing the hash table to find the joined rows. This method is best available when the hash table is able to fit in memory. This method is best when the smaller table fits in available memory. The cost is then limited to a single read pass over the data for the two tables.
The optimizer shall use hash join when you are joining large data sets or large fraction of a small table must be joined. Use the USE_HASH hint to instruct the optimizer to use a hash join when joining two tables together.
However, in a production database with very large tables, it is not always easy to get your database to invoke hash joins without increasing the RAM regions that control hash joins. For large tables, hash joins requires lots of RAM. Make sure that the smaller table is the driving table (the first table in the from clause when using the ordered hint). This is because a hash join builds the memory array using the driving table.
Make sure that you have the proper settings for optimizer_index_cost_adj and optimizer_max_permutations (Oracle9i and earlier) to limit the number of table join evaluations. Also check your values for hash_area_size and hash_multiblock_io_count. If using pga_aggregate_target, remember that it will not use more than 5% of the space for any hash join and you may need to override it.
SORT MERGE JOINS
Sort merge joins can join rows from two independent sources. Hash joins generally perform better than sort merge joins. However, sort merge joins can perform better than hash joins if the row sources are already sorted and a sort operation will not at all be performed. Do keep in mind that there is no concept of driving table in sort merge join. As this post is only for basic understanding, the final thing i will say it is that this technique has two steps := SORT is done for both the inputs and then both the sorted lists are MERGED together. If the input is sorted by the join column, then a sort join operation is not performed for that row source. However, a sort merge join always creates a positionable sort buffer for the right side of the join so that it can seek back to the last match in the case where duplicate join key values come out of the left side of the join. Use USE_MERGE hint to instruct the optimizer to go for sort merge join.
Sort merge joins are useful when the join condition between two tables is an inequality condition (but not a nonequality) like <, <=, >, or >=. Sort merge joins perform better than nested loop joins for large data sets. You cannot use hash joins unless there is an
equality condition.
NESTED LOOP JOINS
This type of join should always be used when small subsets of data is being joined by the database and also if the join condition is an efficient way of accessing the second table. Hold on here... By second table here i mean is the driven table. So there is a driving table which drives the driven table in the nested loops offcourse. It is very important that inner table which is driven should have dependent data on the driving table otherwise if the access path of driving and driven table is independent then for every iteration for the outer loop the same rows will be retrived degrading performance considerably. It works something like this, well not exactly as a cartesion product, but yes if access path is independent then for every row of outer loop an inner loop row is evaluated which hits the performance badly if joining large data sets.
Oracle Database 11g introduces a new implementation for nested loop joins to reduce overall latency for physical I/O. When an index or a table block is not in the buffer cache and is needed to process the join, a physical I/O is required. Oracle Database 11g can batch multiple physical I/O requests and process them using a vector I/O instead of processing them one at a time.
Well the optimizer will always work on the stats which it has as it is a mathematical function and it works on numbers. What im trying to say is that lets say for example A larger multiblock read count is likely to decrease the cost for a sort merge join in relation to a nested loop join. If the database can read a large number of sequential blocks from disk in a single I/O, then an index on the inner table for the nested loop join is less likely to improve performance over a full table scan. The multiblock read count is specified by the initialization parameter DB_FILE_MULTIBLOCK_READ_COUNT.
Have a nice day ahead DBA's ..... My next post shall be more on how to improve the cost of joins as it is very important task for a DBA specially when it comes to determine the join order.
Related links:
HOME
No comments:
Post a Comment