Collect Statistics

COLLECT STATISTICS is a Teradata sql command that scans columns and indexes of a table and records demographics of the data.

COLLECT STATISITICS is used to provide the Teradata Optimizer with as much information on data as possible. The Optimizer uses this information to determine how many rows exist and which rows qualify for given values.

Collecting statistics can improve the execution of a sql. The optimizer can have more details about each column or index, and therefore determine a better join plan for resolving the query.

COLLECT STATISTICS COMMANDS SHOULD ONLY BE USED WHEN A SPECIFIC SQL (OR GROUP OF SQLS) IS BENEFITTED.

Because, Collecting statistics may not improve a sql’s efficiency. Therefore, the processing time used to collect the statistics was wasted, and this could be a large amount of processing time.
           
If a sql’s performance is shown to improve from a specific collected statistic, then the statistic should be initially collected and then refreshed on a regular basis.

To possibly avoid spoolspace errors in the application design phase, some common problem columns are listed below in "Collect Statistics - Problem Columns".

With COLLECT STATISTICS:

 - COLLECT STATISTICS can make the usage of these more efficient:
                        Primary indexes
                        Secondary indexes
                        Join Columns or base tables
                        Join Indexes
                        Qualifying columns

 -Statistics are especially informative if index values are distributed unevenly.

 - When a query uses conditionals based on non-unique index values, then Teradata uses statistics to determine whether indexing or a full search of all table rows is more efficient.

              
 - If Teradata determines that indexing is the best method, then it uses the statistics to determine whether spooling or building a bitmap would be the most efficient method of qualifying the data rows.
          
   
Without COLLECT STATISTICS:

- The Optimizer assumes:
  Non-unique indexes are highly non-unique. (Lots of rows per value).
 Non-Index columns are even more non-unique than non-unique indexes.     (Lots of rows per value)

- Teradata derives row counts from a random AMP sample for:
  Small tables (less than 1000 rows per amp),Unevenly distributed tables (skewed row distribution due to PI).

  Random amp sample:
  Look at data from 1 amp of table, and from this, estimate the total rows in the table.

  Random amp samples may not represent the true total number of rows in the table because the rows in the table may not be distributed evenly. This occurs often with small tables. Asof 9/2000, per table, the random amp sample uses the same amp for each sql or query.


Amp Offline in Teradata Database

The Teradata database system is impacted in the following ways when an amp is offline:

            1) System Performance                                                
            2) Table Updates/Offline Recovery
            3) Table Rebuild
            4) Spoolspace
            5) ASF Archives
            6) ASF Restore/Copy
            7) Amp Usage Statistics
            8) 2nd Amp Risk
            9) Fastload Configuration Change

1) System Performance

The system performance is impacted because the processing power of the down amp is missing. The application processing normally done on the down amp must be absorbed by the other amps in the system. This is accomplished with clustering.

When an amp is down, a table's primary copy of the data on that amp is not available, but the fallback copy is available on the other amps within the cluster. In our environment, the cluster size is 4 amps, so the table data from the down amp is spread across the other 3 amps. This means that when an amp is offline, the access to the table data on the other 3 online amps in the cluster is increased.

The amount of processing increase in the other 3 amps can be determined as follows: Assume that the system processing is spread evenly across all 4 original amps. Then if 1 amp is removed, then to do the same processing on the other 3 amps, the processing would have to increase by 33% on each of the other 3 amps.

Therefore, when an amp is offline, the system performance of the cluster decreases by 33%. And, if this particular cluster has amps that are very highly utilized, then the overall system performance degradation would be noticable.

2) Table Updates/Offline Recovery

Table data can be updated while an amp is down. The data slated for the down amp is directed to the other 3 amps instead. A recovery journal ("dbc.changedrowjournal") is used on the other 3 amps. But, when the amp is brought back online, then the table updates stored on the other 3 amps must be updated on the new amp when it is brought back online. This procedure is "offline recovery" or "catch-up".

TOS single threads each logon's "offline recovery".

3) Table Rebuild

If the down amp's portion of the table data is not available when the amp is brought back online, then it would be rebuilt with a procedure called "table rebuild". This procedure reads the fallback copy of the table data from the 3 online amps in the cluster, and creates both the primary and fallback copy of the table data on the new amp.

4) Spoolspace

The DBA sets the spoolspace limit per logon, but TOS sets the spoolspace limit per amp (See "Spoolspace" Tip). When an amp is down, the spoolspace limit per amp remains the same, but the spoolspace limit per logon is decreased by 1 amp's worth. Therefore, on the 3 online amps, the processing will increase, and the spoolspace usage will increase, but the spoolspace limit per amp will remain the same.

5) ASF Archives

When an amp is down, the ASF archive will use the fallback copy. If indexes are also archived, then NUSI's (non-unique secondary indexes) will be skipped, and a warning message displayed. The message also includes the amp number that is down.

6) ASF Restore/Copy

ASF Restore does not work when an amp is down. ASF Copy will work and use the online amps in a cluster.

7) Amp Usage Statistics

When the "table rebuild" is executed for the amp, the dbc.acctg table (dbc.ampusage view) is also rebuilt. But this dictionary table does not have fallback, therefore the cpu and diskio statistics for each logon on that amp are lost. This is seen in the Opermenu utility as an unusually high number of logons accessing the system during the hour.

8) 2nd Amp Risk

During the time that 1 amp is down in a cluster, there is a very small risk that a 2nd amp could also go down in the same cluster.

Within the same cluster as the down amp: If a 2nd amp goes down, but the table data is recoverable, then the system processing would be halted, and after the amp is fixed, the system would be restarted. But if the table data is not recoverable, then the table data of the entire system would be restored, and then the system would be restarted.

9) Fastload Configuration Change

Fastload cannot continue when an amp goes offline while the fastload is executing.One of the recovery options is to wait until the original configuration is attained, but this will probably never be used.


Secondary Index in Teradata

A minimum set of secondary indexes should be determined for each table. These would include those columns described above, which are subdivisions of primary indexes used in table joins. Also, an additional set of secondary indexes could include those columns used for set selection in the where clause, and those non-index columns used for table joins.

TOS(Teradata Operating System) will not subdivide a primary index into its components and do a partial index search or retrieval.

If a unique primary index is composed of several columns, then subsets of those columns are likely candidates for non-unique secondary indexes. When a query uses a join to a subset of the columns of the primary index, then the Teradata Operating System will not subdivide the primary index into a subdivision of columns, and use this subdivision as an index. Instead, a secondary index must be specifically created with the columns contained in the subdivision. Then the optimizer may select the secondary index for processing the query.

The selection of the additional set of indexes depends on the column usage within a query, and whether the optimizer will select the index, even if it is available. the explain command provides an insight to this selection.


Spool space in Teradata

Spool space is allocated to a logon with a spool space limit parameter. This spool space parameter assigns a spool space limit to each amp. Each amp gets an equal share of the spool space limit. Therefore,
                                                            Logon spool space limit
            Amp spoolspace limit =   ---------------------------------------------------
                                                            Number of amps

The message “Failure 2646 No more spool space in username” occurs whenever any amp exceeds its individual spoolspace limit.

For example, if the logon is assigned 150MB of spool space, and there are 50 amps, then each amp will get a 3MB spool space limit. And, the 2646 message will occur if any amp exceeds 3MB of spool space.



Teradata Multiload

Multiload is a Teradata utility that will read a unix file and update (insert, delete, update) a populated target table.

Multiload is faster than Bteq for updating a populated table. Bteq updates 1 row at a time, where Multiload updates blocks of rows at a time.

When Multiload is compared to the Fastload/delete/insert method, then Multiload is faster for volumes above 10,000 records. For volumes less than 10,000 records, the difference is seconds, and is negligible. Multiload is faster whether the target table has a unique primary index, or a non-unique primary index. 

The Fastload/delete/insert method of updating populated target tables consists of 3 steps:
 1) Fastload into a temporary table
 2) Bteq delete matching rows from the target table
 3) Bteq insert into the target table. 

Multiload skips the temporary table and directly updates the target table from the unix file.
When loading into an empty target table, both Multiload and Fastload are about the same speed.

Multiload’s speed is not affected by the number of rows already in the target table. The speed is affected by the number of  update records, and can be affected by the number of error records written to the error journals. 

Upsert performs the same functionality as the delete/insert.


Teradata Fastload


Teradata Fastload is a utility which loads records into an empty table using blocks of records and multiple amps. Fastload is composed of 2 phases, the first phase reads the unix records, and writes them to TOS buffers on the amps. The second phase reads the TOS buffers and writes them to a TOS database table.

Fastload Performance:

CHECKPOINTS:

Fastload provides the capability to issue a checkpoint in the 2nd phase of the fastload. This checkpoint is an increment of rows being loaded into the table. A checkpoint is issued after each increment of rows. If a fastload process gets aborted in the 2nd phase, then the fastload can be rescued from the last checkpoint completed.

TABLE UPDATE PROCESS:

The fastload table update process is typically composed of 3 steps:
            1) Fastload:
                        Reads the unix records and loads them into temporary database table.
            2) Delete:
                        Deletes the rows from the permanent table, where the primary indexes of the
                        temporary and permanent tables match.
            3) Insert
                        Inserts the rows from the temporary table into the permanent table. 




Teradata Database Download

The following are the Teradata Database download links.


What is Teradata?


Teradata is a Relational Database Management System (RDBMS) for the world’s largest commercial databases.  Teradata can store data upto Teradata bytes in size. This makes the Teradata as a market leader in data warehousing applications.

Through the concept of parallelism, Teradata obtains the ability to manage terabytes of data.


Difference between TOP and SAMPLE in Teradata SQL?

We can find the difference clearly while using ORDER by clause.

Eg:- Select TOP 10 cust_name from customer order by dept_id;

TOP displays the first 10 rows in Sorted order from the table.
SAMPLE displays any 10 rows randomly from the table.


Difference between subquery and correlated subquery in Teradata SQL?

Subquery :- The inner query is executed only once The inner query will get executed first and the output of the inner query used by the outer query.The inner query is not dependent on outer query.

Eg:- SELECT cust_name, dept_no FROM Customer WHERE cust_name IN (SELECT cust_name FROM Customer);

Correlated subquery:-The outer query will get executed first and for every row of outer query, inner query will get executed. So the inner query will get executed as many times as no.of rows in result of the outer query.The outer query output can use the inner query output for comparison. This means inner query and outer query dependent on each other

Eg:- SELECT cust_name,dept_id FROM Cust
WHERE cust_name in (SELECT cust_name FROM dept WHERE cust.dept_id=dept.dept_id);


Difference between in IN and EXISTS in Teradata SQL?

Performance wise both should be same with less no of records.If no of records will be more, EXISTS is faster than IN.


Mostly IN is used in case of subqueries and EXISTS is used in case of correlated subqueries.


Teradata SQL Query Optimization Or Performance Tuning

SQL and Indexes :                                      

1) Primary indexes: Use primary indexes for joins whenever possible, and specify in the where clause all the columns for the primary indexes.

2) Secondary indexes (10% rule rumor): The optimizer does not actually use a 10% rule to determine if a secondary index will be used. But, this is a good estimation: If less than 10% of a table will be accessed if the secondary index is used, then assume the sql will use the secondary index. Otherwise, the sql execution will do a full table scan.


The optimizer actually uses a “least cost” method: The optimizer determines if the cost of using a secondary index is cheaper than the cost of doing a full table scan. The cost involves the cpu usage, and diskio counts.

3) Constants: Use constants to specify index column contents whenever possible, instead of specifying the constant once, and joining the tables. This may provide a small savings on performance.

4) Mathematical operations: Mathematical operations are faster than string operations (i.e. concatenation), if both can achieve the same result.

5) Variable length columns: The use of variable length columns should be minimized, and should be by exception. Fixed length columns should always be used to define tables.

6) Union: The “union” command can be used to break up a large sql process or statement into several smaller sql processes or statements, which would run in parallel. But these could then cause spoolspace limit problems. “Union all” executes the sql’s single threaded.

7) Where in/where not in (subquery): The sql “where in” is more efficient than the sql “where not in”. It is more efficient to specify constants in these, but if a subquery is specified, then the subquery has a direct impact on the sql time.


If there is a sql time problem with the subquery, then the sql subquery could be separated from the original query. This would require 2 sql statements, and an intermediate table. The 2 sql statements would be: 1) New sql statement, which does the previous subquery function, and inserts into the temporary table, and 2) Modified original sql statement, which doesn’t have the subquery, and reads the temporary table.

8) Strategic Semicolon: At the end of every sql statement, there is a semicolon. In some cases, the strategic placement of this semicolon can improve the sql time of a group of sql statements. But this will not improve an individual sql statement’s time. These are a couple cases: 1) The group’s sql time could be improved if a group of sql statements share the same tables (or spool files), 2) The group’s sql time could be improved if several sql statements use the same unix input file.

Reducing Large SQL’s :                                         

The following methods can be used to scope down the size of sql’s.

1) Table denormalization: Duplicating data in another table. This provides faster access to the duplicated data, but requires more update time.

2) Table summarization: The data from one/many table(s) is summarized into commonly used summary tables. This provides faster access to the summarized data, but requires more update time.

3) SQL union: The DBC/SQL Union can be used to break up a large SQL process or statement into several smaller SQL processes or statements, which would run in parallel.

4) Unix split: A large input unix files could be split into several smaller unix files, which could then be input in series, or in parallel, to create smaller SQL processing steps.

5) Unix concatenation: A large query could be broken up into smaller independent queries, whose output is written to several smaller unix files. Then these smaller files are unix concatenated together to provide a single unix file.

6) Trigger tables: A group of tables, each contains a subset of the keys of the index of an original table. the tables could be created based on some value in the index of the original table. This provides an ability to break up a large SQL statement into multiple smaller SQL statements, but creating the trigger tables requires more update time.

7) Sorts (order by): Although sorts take time, these are always done at the end of the query, and the sort time is directly dependent on the size of the solution. Unnecessary sorts could be eliminated.

8) Export/Load: Table data could be exported (Bteq, Fastexport) to a unix file, and updated, and then reloaded into the table (Bteq, fastload, Multiload).

9) C PROGRAM/UNIX SCRIPTS: Some data manipulation is very difficult and time consuming in sql. These could be replaced with c programs/unix scripts. See the “C/Embedded sql” tip.

 Reducing Table Update Time :                            

1) Table update time can be improved by dropping the table’s indexes first, and then doing the updates. After the completion of the updates, then rebuild the indexes, and recollect the table’s statistics on the indexes. The best improvement is obtained when the volume of table updates is large in relation to the size of the table. If more then 5% of  a large table is changed.


2) Try to avoid dropping a table, instead, delete the table. Table related statements (i.e. create table, drop table) are single threaded thru a system permissions table and become a bottleneck. They can also cause deadlocks on the dictionary tables. Also, any user permissions specific to the table are dropped when the table is dropped, and these permissions must be recreated.


Fallback

Whenever we use fallback on a table, there are two copies of the table:
one is Primary copy and other one is Fallback copy. 

If an amp failure occurs when using fallback, then access will be switched immediately to the mirror copy and if any operation has done on the primary tables those modifications are saved in a journal.

Then after some time whenever the failure is fixed,the amp is brought back online and the primary tables are updated from the journal .

Suppose if we are not using fallback, an amp failure occurs  then access must be stopped to those non-fallback tables. Then we need to call field engineer for failure to be fixed or replaced and some intervention would be needed to recreate  non-fallback tables.


To update fallback table if non-fallback table was being read and failure occurs then the restoration also requires some intervention to restart the update process.

The usage of disk space is increased (doubled) if a table is fallback turned on. This is  due to every row in the primary table will be duplicated into the fallback table.

If we are not using fallback then the performance of sql updates (while all amps are online) , Fastloads and  Multiloads can be improved.

Non-fallback can provide less disk space usage and performance improvements. But in summary fallback provides instant recovery and non-fallback need intervention for recovery. 

The permspace for each table is distributed differently. Each row of the Primary copy is distributed by the primary index. Each row of the Fallback copy is distributed by the primary index and to a different amp within the same cluster. This means that the permspace used by each copy is distributed differently.

The distribution efficiency of a table is measured by the following:

            Average Permspace across all amps
            -------------------------------------------------
            Maximum Permspace on 1 amp

The distribution efficiency of the total of the Primary and Fallback permspaces is not a true indicator of the distribution efficiency of the Primary copy. In fact, the distribution efficiency of the Primary copy could be perfect (100%), but the distribution efficiency of the Fallback copy could look bad (70%).

Therefore, the only way to truly measure the distribution efficiency of a table is to temporarily drop the Fallback copy, and measure the Primary permspace distribution.


Teradata Interview Questions

What is Teradata BTEQ?
What are the teradata Objects?
What are the Teradata tools and utilities?
What is the difference between v2r5 and v2r6?
How to find average sal without using avg function?
Why multiload supports only NUSI?
Any three differences between teradata 12 and teradata 13?
What is join index?what are the types of join index?
What is Teradata fastload,multiload and Tpump ? 
What is error handling in Bteq?
What is skewness in Teradata?
Difference between fastload and multiload in Teradata?
Difference between General DELETE and Multiload DELETE in Teradata?
Difference between BTEQ export and fast export?
When we will use nullif function?
What is the use of coalesce function ?
What is the role of Secondary Index(SI) and explain its types?
How many primary keys we will apply on table?
Difference between SMP and MPP?
Explain Shared Nothing Architecture?
Difference between subquery and correlated subquery ?
What are the join Strategies?
What are the limitations of fastload and multiload?
Explain different phases of fastload and multiload?


Teradata Corp Office Locations

The following are the Teradata Corporation Office Adresses for your reference.



Teradata Argentina Office:
Av. Corrientes 746 – Piso 14
C1043AAU, Buenos Aires, Argentina

Teradata Canada Office:
6303 Airport Road, Suite 300
Mississauga, On.
L4N 1R8

Teradata San Diego Office :
17095 Via del Campo
San Diego, CA 92127

Teradata Colombia Office:
Transversal 23 No. 97-73 Piso 2o.
Bogota, D.C.

Teradata Government Systems:
430 National Business Parkway
Suite 450
Annapolis Junction, MD 20701

Teradata World Headquarters Office:
10000 Innovation Drive
Dayton, OH 45342

Teradata Atlanta Office :
11695 Johns Creek Parkway
4th Floor
Johns Creek, GA 30097

Teradata Mexico Office:
Av. Paseo de las Palmas 239, Piso 5
Col. Lomas de Chapultepec
11000 Mexico, D.F.

Teradata Chile Office:
Cerro Colorado 5240 
Torre del Parque II, Piso 16
Las Condes, 7560995 
Santiago, Chile

Teradata Brazil Office :
Rua Olimpíadas, 205, 1o. Andar, Cj. 13
04551-000 São Paulo, Brasil

Teradata Singapore Office : 
300 Beach Road
#18-02 The Concourse
Singapore 199555

Teradata Pune Office:
Tower 12 Cyber City, 
Magarpatta City,
Hadapsar, Pune - 411028

Teradata Hyderabad Office : 
Auditorium, 3rd Floor,
Ascendes, "The V Park"
Plot #17, Software Units Layout,
Madhapur, Hyderabad 81

Teradata Bangalore Office :
No.9, Eterna Level - 2
Above HDFC Building -- Near Raheja Arcade
Koramangala, Bangalore-560095. India


Teradata Database Versions

The following are Teradata Database Versions .

1. Teradata v2r5
2. Teradata v2r6
3. Teradata 12
4. Teradata 13
5. Teradata 13.10
6. Teradata 14

Teradata 14 is the latest Version now , Teradata 14.10 is upcoming Version.


Teradata University Network

Teradata University Network (TUN) is a resource and free web based portal for teaching and learning business intelligence and data warehousing courses. You may only access this network if you are authorised by teradata. If you are a teradata authorised person, click here to register in this network as a faculty or student.


Teradata Certification

The following are the list of Teradata Certification exams.

1. Teradata Basics
2. Teradata Physical Implementation
3. Teradata SQL
4. Teradata Database Administration(Teradata DBA)
5. Teradata Design Architecture
6. Teradata Application development

If you complete all these Teradata Certification exams you would become Teradata Master. If you decided to write these exams online, login to Prometric.com select "Schedule a test"  then select Area of study as Information Technology then select Testing program as NCR Teradata. It will gives you the list of prometric centers, so select your near by prometric center and take an appointment to write the exam.

At present Teradata 12 certifications are the latest ones in prometric centers.


How to find average sal without using avg function?

We can find the average salary by using  "sum (salary)/count(salary)" 


How many columns can be there in a table?

Maximum 256 columns per table


Teradata Benchmark Query Driver

Teradata Benchmark Query Driver (TdBench) provides set of tools these tools provide framework for executing benchmarks driven by PC or windows server and reporting the results using Database Query Language(DBQL). Also helps you to compare the performance within a data warehouse:

-before/after a new release of the database.
-before/after changes to the indexes.
-relative performance of new database platform.

There are also tools for extracting a united set of tables and queries from DBQL to define the benchmark.


About Teradata Database

Teradata provides Data warehousing, Business Intelligence and CRM solutions.