Teradata SQL Reference: Collect Statistics

COLLECT STATISTICS is a Teradata sql command that scans columns and indexes of a table and records demographics of the data.

COLLECT STATISITICS is used to provide the Teradata Optimizer with as much information on data as possible. The Optimizer uses this information to determine how many rows exist and which rows qualify for given values.

Collecting statistics can improve the execution of a sql. The optimizer can have more details about each column or index, and therefore determine a better join plan for resolving the query.

COLLECT STATISTICS COMMANDS SHOULD ONLY BE USED WHEN A SPECIFIC SQL (OR GROUP OF SQLS) IS BENEFITTED.

Because, Collecting statistics may not improve a sql’s efficiency. Therefore, the processing time used to collect the statistics was wasted, and this could be a large amount of processing time.

If a sql’s performance is shown to improve from a specific collected statistic, then the statistic should be initially collected and then refreshed on a regular basis.

To possibly avoid spoolspace errors in the application design phase, some common problem columns are listed below in "Collect Statistics - Problem Columns".

With COLLECT STATISTICS:

- COLLECT STATISTICS can make the usage of these more efficient:

Primary indexes

Secondary indexes

Join Columns or base tables

Join Indexes

Qualifying columns

-Statistics are especially informative if index values are distributed unevenly.

- When a query uses conditionals based on non-unique index values, then Teradata uses statistics to determine whether indexing or a full search of all table rows is more efficient.

- If Teradata determines that indexing is the best method, then it uses the statistics to determine whether spooling or building a bitmap would be the most efficient method of qualifying the data rows.

Without COLLECT STATISTICS:

- The Optimizer assumes:

Non-unique indexes are highly non-unique. (Lots of rows per value).

Non-Index columns are even more non-unique than non-unique indexes. (Lots of rows per value)

- Teradata derives row counts from a random AMP sample for:

Small tables (less than 1000 rows per amp),Unevenly distributed tables (skewed row distribution due to PI).

Random amp sample:

Look at data from 1 amp of table, and from this, estimate the total rows in the table.

Random amp samples may not represent the true total number of rows in the table because the rows in the table may not be distributed evenly. This occurs often with small tables. Asof 9/2000, per table, the random amp sample uses the same amp for each sql or query.

6 comments:

AnonymousMay 24, 2013 at 2:58 PM
IS THERE A DIFFERENCE BETWEEN
COLLECT STATISTICS VS COLLECT STATS?
UnknownJune 11, 2013 at 4:25 AM
No. There is no difference b/w COLLECT STATISTICS VS COLLECT STATS.
COLLECT STATS is short form of COLLECT STATISTICS.
UnknownNovember 18, 2013 at 6:39 PM
when will go for collect stats in real time environment ,and in case of loading fact and dimention tables what tables we should load first and how to do loading of about this fact and dimention tables and how to get these fact and dimention tables can u justify my doubts please....
UnknownNovember 18, 2013 at 6:47 PM
hi this is ramesh can u tell me briefly about data stage and teradata project flow,I think can u explain about where will we extract the data, work data base and etc ETL phases of both data stage and teradata please explain I had confused about that
AnonymousNovember 27, 2013 at 10:30 PM
we collected stats on one table. if we add one more column to the same table, previous collect stats will work for other columns or do we need to collect stats again?
AnonymousFebruary 26, 2014 at 9:13 PM
What is difference between the below :
collect stats on table column(col1)
vs
collect stats column(col1) on table