Collect Statistics

COLLECT STATISTICS is a Teradata sql command that scans columns and indexes of a table and records demographics of the data.

COLLECT STATISITICS is used to provide the Teradata Optimizer with as much information on data as possible. The Optimizer uses this information to determine how many rows exist and which rows qualify for given values.

Collecting statistics can improve the execution of a sql. The optimizer can have more details about each column or index, and therefore determine a better join plan for resolving the query.

COLLECT STATISTICS COMMANDS SHOULD ONLY BE USED WHEN A SPECIFIC SQL (OR GROUP OF SQLS) IS BENEFITTED.

Because, Collecting statistics may not improve a sql’s efficiency. Therefore, the processing time used to collect the statistics was wasted, and this could be a large amount of processing time.
           
If a sql’s performance is shown to improve from a specific collected statistic, then the statistic should be initially collected and then refreshed on a regular basis.

To possibly avoid spoolspace errors in the application design phase, some common problem columns are listed below in "Collect Statistics - Problem Columns".

With COLLECT STATISTICS:

 - COLLECT STATISTICS can make the usage of these more efficient:
                        Primary indexes
                        Secondary indexes
                        Join Columns or base tables
                        Join Indexes
                        Qualifying columns

 -Statistics are especially informative if index values are distributed unevenly.

 - When a query uses conditionals based on non-unique index values, then Teradata uses statistics to determine whether indexing or a full search of all table rows is more efficient.

              
 - If Teradata determines that indexing is the best method, then it uses the statistics to determine whether spooling or building a bitmap would be the most efficient method of qualifying the data rows.
          
   
Without COLLECT STATISTICS:

- The Optimizer assumes:
  Non-unique indexes are highly non-unique. (Lots of rows per value).
 Non-Index columns are even more non-unique than non-unique indexes.     (Lots of rows per value)

- Teradata derives row counts from a random AMP sample for:
  Small tables (less than 1000 rows per amp),Unevenly distributed tables (skewed row distribution due to PI).

  Random amp sample:
  Look at data from 1 amp of table, and from this, estimate the total rows in the table.

  Random amp samples may not represent the true total number of rows in the table because the rows in the table may not be distributed evenly. This occurs often with small tables. Asof 9/2000, per table, the random amp sample uses the same amp for each sql or query.


Amp Offline in Teradata Database

The Teradata database system is impacted in the following ways when an amp is offline:

            1) System Performance                                                
            2) Table Updates/Offline Recovery
            3) Table Rebuild
            4) Spoolspace
            5) ASF Archives
            6) ASF Restore/Copy
            7) Amp Usage Statistics
            8) 2nd Amp Risk
            9) Fastload Configuration Change

1) System Performance

The system performance is impacted because the processing power of the down amp is missing. The application processing normally done on the down amp must be absorbed by the other amps in the system. This is accomplished with clustering.

When an amp is down, a table's primary copy of the data on that amp is not available, but the fallback copy is available on the other amps within the cluster. In our environment, the cluster size is 4 amps, so the table data from the down amp is spread across the other 3 amps. This means that when an amp is offline, the access to the table data on the other 3 online amps in the cluster is increased.

The amount of processing increase in the other 3 amps can be determined as follows: Assume that the system processing is spread evenly across all 4 original amps. Then if 1 amp is removed, then to do the same processing on the other 3 amps, the processing would have to increase by 33% on each of the other 3 amps.

Therefore, when an amp is offline, the system performance of the cluster decreases by 33%. And, if this particular cluster has amps that are very highly utilized, then the overall system performance degradation would be noticable.

2) Table Updates/Offline Recovery

Table data can be updated while an amp is down. The data slated for the down amp is directed to the other 3 amps instead. A recovery journal ("dbc.changedrowjournal") is used on the other 3 amps. But, when the amp is brought back online, then the table updates stored on the other 3 amps must be updated on the new amp when it is brought back online. This procedure is "offline recovery" or "catch-up".

TOS single threads each logon's "offline recovery".

3) Table Rebuild

If the down amp's portion of the table data is not available when the amp is brought back online, then it would be rebuilt with a procedure called "table rebuild". This procedure reads the fallback copy of the table data from the 3 online amps in the cluster, and creates both the primary and fallback copy of the table data on the new amp.

4) Spoolspace

The DBA sets the spoolspace limit per logon, but TOS sets the spoolspace limit per amp (See "Spoolspace" Tip). When an amp is down, the spoolspace limit per amp remains the same, but the spoolspace limit per logon is decreased by 1 amp's worth. Therefore, on the 3 online amps, the processing will increase, and the spoolspace usage will increase, but the spoolspace limit per amp will remain the same.

5) ASF Archives

When an amp is down, the ASF archive will use the fallback copy. If indexes are also archived, then NUSI's (non-unique secondary indexes) will be skipped, and a warning message displayed. The message also includes the amp number that is down.

6) ASF Restore/Copy

ASF Restore does not work when an amp is down. ASF Copy will work and use the online amps in a cluster.

7) Amp Usage Statistics

When the "table rebuild" is executed for the amp, the dbc.acctg table (dbc.ampusage view) is also rebuilt. But this dictionary table does not have fallback, therefore the cpu and diskio statistics for each logon on that amp are lost. This is seen in the Opermenu utility as an unusually high number of logons accessing the system during the hour.

8) 2nd Amp Risk

During the time that 1 amp is down in a cluster, there is a very small risk that a 2nd amp could also go down in the same cluster.

Within the same cluster as the down amp: If a 2nd amp goes down, but the table data is recoverable, then the system processing would be halted, and after the amp is fixed, the system would be restarted. But if the table data is not recoverable, then the table data of the entire system would be restored, and then the system would be restarted.

9) Fastload Configuration Change

Fastload cannot continue when an amp goes offline while the fastload is executing.One of the recovery options is to wait until the original configuration is attained, but this will probably never be used.


Secondary Index in Teradata

A minimum set of secondary indexes should be determined for each table. These would include those columns described above, which are subdivisions of primary indexes used in table joins. Also, an additional set of secondary indexes could include those columns used for set selection in the where clause, and those non-index columns used for table joins.

TOS(Teradata Operating System) will not subdivide a primary index into its components and do a partial index search or retrieval.

If a unique primary index is composed of several columns, then subsets of those columns are likely candidates for non-unique secondary indexes. When a query uses a join to a subset of the columns of the primary index, then the Teradata Operating System will not subdivide the primary index into a subdivision of columns, and use this subdivision as an index. Instead, a secondary index must be specifically created with the columns contained in the subdivision. Then the optimizer may select the secondary index for processing the query.

The selection of the additional set of indexes depends on the column usage within a query, and whether the optimizer will select the index, even if it is available. the explain command provides an insight to this selection.


Spool space in Teradata

Spool space is allocated to a logon with a spool space limit parameter. This spool space parameter assigns a spool space limit to each amp. Each amp gets an equal share of the spool space limit. Therefore,
                                                            Logon spool space limit
            Amp spoolspace limit =   ---------------------------------------------------
                                                            Number of amps

The message “Failure 2646 No more spool space in username” occurs whenever any amp exceeds its individual spoolspace limit.

For example, if the logon is assigned 150MB of spool space, and there are 50 amps, then each amp will get a 3MB spool space limit. And, the 2646 message will occur if any amp exceeds 3MB of spool space.



Teradata Multiload

Multiload is a Teradata utility that will read a unix file and update (insert, delete, update) a populated target table.

Multiload is faster than Bteq for updating a populated table. Bteq updates 1 row at a time, where Multiload updates blocks of rows at a time.

When Multiload is compared to the Fastload/delete/insert method, then Multiload is faster for volumes above 10,000 records. For volumes less than 10,000 records, the difference is seconds, and is negligible. Multiload is faster whether the target table has a unique primary index, or a non-unique primary index. 

The Fastload/delete/insert method of updating populated target tables consists of 3 steps:
 1) Fastload into a temporary table
 2) Bteq delete matching rows from the target table
 3) Bteq insert into the target table. 

Multiload skips the temporary table and directly updates the target table from the unix file.
When loading into an empty target table, both Multiload and Fastload are about the same speed.

Multiload’s speed is not affected by the number of rows already in the target table. The speed is affected by the number of  update records, and can be affected by the number of error records written to the error journals. 

Upsert performs the same functionality as the delete/insert.