Collect Statistics

COLLECT STATISTICS is a Teradata sql command that scans columns and indexes of a table and records demographics of the data.

COLLECT STATISITICS is used to provide the Teradata Optimizer with as much information on data as possible. The Optimizer uses this information to determine how many rows exist and which rows qualify for given values.

Collecting statistics can improve the execution of a sql. The optimizer can have more details about each column or index, and therefore determine a better join plan for resolving the query.

COLLECT STATISTICS COMMANDS SHOULD ONLY BE USED WHEN A SPECIFIC SQL (OR GROUP OF SQLS) IS BENEFITTED.

Because, Collecting statistics may not improve a sql’s efficiency. Therefore, the processing time used to collect the statistics was wasted, and this could be a large amount of processing time.
           
If a sql’s performance is shown to improve from a specific collected statistic, then the statistic should be initially collected and then refreshed on a regular basis.

To possibly avoid spoolspace errors in the application design phase, some common problem columns are listed below in "Collect Statistics - Problem Columns".

With COLLECT STATISTICS:

 - COLLECT STATISTICS can make the usage of these more efficient:
                        Primary indexes
                        Secondary indexes
                        Join Columns or base tables
                        Join Indexes
                        Qualifying columns

 -Statistics are especially informative if index values are distributed unevenly.

 - When a query uses conditionals based on non-unique index values, then Teradata uses statistics to determine whether indexing or a full search of all table rows is more efficient.

              
 - If Teradata determines that indexing is the best method, then it uses the statistics to determine whether spooling or building a bitmap would be the most efficient method of qualifying the data rows.
          
   
Without COLLECT STATISTICS:

- The Optimizer assumes:
  Non-unique indexes are highly non-unique. (Lots of rows per value).
 Non-Index columns are even more non-unique than non-unique indexes.     (Lots of rows per value)

- Teradata derives row counts from a random AMP sample for:
  Small tables (less than 1000 rows per amp),Unevenly distributed tables (skewed row distribution due to PI).

  Random amp sample:
  Look at data from 1 amp of table, and from this, estimate the total rows in the table.

  Random amp samples may not represent the true total number of rows in the table because the rows in the table may not be distributed evenly. This occurs often with small tables. Asof 9/2000, per table, the random amp sample uses the same amp for each sql or query.


6 comments:

  1. IS THERE A DIFFERENCE BETWEEN
    COLLECT STATISTICS VS COLLECT STATS?

    ReplyDelete
  2. No. There is no difference b/w COLLECT STATISTICS VS COLLECT STATS.
    COLLECT STATS is short form of COLLECT STATISTICS.

    ReplyDelete
  3. when will go for collect stats in real time environment ,and in case of loading fact and dimention tables what tables we should load first and how to do loading of about this fact and dimention tables and how to get these fact and dimention tables can u justify my doubts please....

    ReplyDelete
  4. hi this is ramesh can u tell me briefly about data stage and teradata project flow,I think can u explain about where will we extract the data, work data base and etc ETL phases of both data stage and teradata please explain I had confused about that

    ReplyDelete
  5. we collected stats on one table. if we add one more column to the same table, previous collect stats will work for other columns or do we need to collect stats again?

    ReplyDelete
  6. What is difference between the below :
    collect stats on table column(col1)
    vs
    collect stats column(col1) on table

    ReplyDelete