TPUMP in Teradata


TPUMP is a flexible multi-session load utility.  The only characteristic of TPUMP that limits its use is its transaction oriented nature.  If the situation allows execution of Fastload or Multiload, these utilities will always outperform TPUMP in a large batch operation when measuring Teradata resource consumption and response time.  Consideration of TPUMP in a load scenario should occur before that of Fastload or Multiload.  Basically TPUMP has none of the disadvantages of Multiload (table locking, load slot usage, single amp operations), Fastload (empty table only, load slot usage), or BTEQ (no checkpointing or restartability). 

Best practices for the development of BTEQ Applications:

  • Logtable and error tables are job specific.  
  • The BTEQ clean up step drops the logtable and prepares the target table.
  • The file layout section should completely define the input file.  .FILLER should be replaced with .FIELD and any null if logic should be completed in the DML section of the TPUMP. 
  • If the script is doing complex updates, ROBUST ON is in the "BEGIN LOAD" statement.  
  • If the script is doing a high volume (due to high pack and/or sessions) of simple inserts and the checkpoint is not 1, ROBUST ON is in the "BEGIN LOAD" statement.   
  • The PACK factor is lower than the maximum (32K block). You will receive a warning if it is too high.
  • SERIALIZE is on for UPSERT scripts.
  • If maintaining a table without a unique primary index (a NUPI), you probably need to use the SERIALIZE ON feature to prevent blocking.
  • When using SERIALIZE ON, be sure to type ‘KEY’ next to each of the primary index fields in the INFILE_LAYOUT section.
  • All dates are formatted.
  • All 6 digit dates in non characters fields are converted to char(06) so sliding date rule can be applied properly.  
  • The # of sessions are appropriate.


No Primary Index Tables


The syntax for the CREATE TABLE statement has been changed to permit user data tables to be created without a primary index. Such tables are referred to as NoPI (No Primary Index) tables.

This feature provides a performance advantage when using FastLoad or TPump Array INSERT to load data into staging tables. Because NoPI staging tables have no row-ordering constraints, the system can always append rows to the end of a NoPI table. Rows in a NoPI table can also be stored on any AMP, which is advantageous for TPump Array INSERT operations because many rows can then be packed into a single AMP step, thus dramatically reducing the performance burden on both CPU and I/O. After a NoPI staging table has been populated, the table can be processed further using SQL DML statements such as DELETE, INSERT, and SELECT.

While using a NoPI table, you can:
• Manipulate rows directly using most SQL DML statements or you can move its rows into a primary-indexed target table using INSERT… SELECT, MERGE, or UPDATE…FROM SQLs.
• Create unique secondary indexes to avoid full-table scans during row access. For example, while single-AMP retrieval of NoPI rows by means of their primary index is not possible, you can work around this through the appropriate assignment of unique secondary indexes (USIs) to NoPI tables, and by careful construction of request conditions to specify those USIs to access individual NoPI table rows.
• Create nonunique secondary indexes (NUSIs) to facilitate set processing retrieval of rows from NoPI tables.
• Avoid full-table scans when deleting a set of rows from a NoPI table by assigning secondary indexes and specifying them in your request conditions.
Benefits.
• Enhanced performance for FastLoad bulk data loads into staging tables.
• Enhanced performance for TPump Array INSERT minibatch loads into staging tables.

Points to note:
• The absence of a primary index or secondary index in NoPI tables means that all row access is done using full-table scans.
• The drawback to using secondary indexes for NoPI tables is that while they can enhance query processing significantly, they can also reduce load performance.
• You cannot modify a NoPI table using SQL UPDATE or UPSERT.
• This feature introduces a new DBS Control flag, PrimaryIndexDefault, that determines the behavior of a CREATE TABLE statement that does not explicitly specify any of the following:
• PRIMARY INDEX clause
• NO PRIMARY INDEX clause
• PRIMARY KEY or UNIQUE constraints



Derived Tables

Derived tables can be a powerful technique to produce efficient queries, but they can also cause major performance problems when used in inappropriate situations.


Here are some guidelines for the use of derived tables:

Never use a derived table to simply restrict the records of a large table prior to joining it to some other table. Doing this prevents the optimizer from using statistics on the table when it is subsequently joined to another table, since the derived table is pulled into a spool file, and this spool file will not have statistics available to the optimizer to prepare downstream joins.



Explain Plan in Teradata

When developing queries, always perform “An explain” on the query before you run it.  Explains can give a lot of information the way optimizer will execute a query. 

To perform an “Explain”, simply add the explain keyword prior to your select/ insert/ update/ delete statement and execute it.
The Explain statement is used to aid in identifying potential performance issues, it analyses the SQL and breaks it down into its low level process. Unfortunately the output can be very difficult to understand for an untrained person, but there are some points to recognize: Confidence Level and Product Joins.



Advantages of OLAP Functions over Subqueries in Teradata


Significance of OLAP Features:
  1. Execution Time of the Query is minimized to a significant level after using OLAP features.
  2. Less complex query can be formulated by OLAP functions.

Description of the table [PRDT_RECVD_DT_TBL] used:

Table contains the 5 columns namely PRODUCT, BRAND, CATEGORY, SALES_CENTER (i.e Store), PRDT_RCVD_DATE. Consider 'Product, Brand, Category' as a MERCHANDISE.

High level Requirement Specification:

Ultimate Goal is to compute the Earliest Product Received Date. Earliest Product Received Date is the most recent date when majority of the sales center received the goods at the Merchandise level.



Teradata Date and Time functions

The following are the date and time functions in Teradata.

DATE or CURRENT_DATE -  Both displays current date.
ADD_MONTHS(date,3) - Add 3 months to the current date.
ADD_MONTHS(date, -2)- Subtract 2 months from the current date.
TIME or CURRENT_TIME - Both displays current time.
EXTRACT( Day FROM Date)- Extracts and displays the day.
EXTRACT(Month FROM Date) - Extracts and display month.
EXTRACT(Year FROM Date) - Extracts and displays year.
CURRENT_TIMESTAMP - Displays combination of both date and time.
EXTRACT( Hour FROM Time) - Extracts and displays hour.
EXTRACT( Minute FROM Time) - Extracts and displays Minute.
EXTRACT( Second FROM Time) - Extracts and displays Second.


What is the Criteria to choose best Primary Index ?

Be careful while choosing the primary index because it affects the data storage and performance.

The following are the important tips while choosing the primary index.

1. Data Distribution.
You need to analyze the number of distinct values in the table . If the primary index of the table contains less number of null values and more distinct values,it will give better the performance.

2. Access frequency. 
The column has to be frequently used in the where clause during the row selection.
The column should be that which is frequently used in join process.

3. Volatility
The column should not be frequently changed.


Decimal datatype issue with BTEQ

Suppose if you have a table called employee having column salary declared as decimal(35,0).

Query: Select salary from employee;


The above query works fine in SQL Assistant. But if you try to run using BTEQ, the output contains decimal point(dot)  at the end. See below.

salary
******
98765432109876543210987654321098765432.

If you can try FORMAT '99999......999999' , you will get error because format allows only 30 digits. Alternative solution for this problem is use FORMAT 'Z(35)' or FORMAT '9(35)' instead of Decimal(35,0) to avoid dot at the end of the output if your query is running through BTEQ.


NULLIF function

If its arguments are equal,NULLIF returns NULL , else it returns its first argument, scalar_expression_1.

For the following full CASE expression NULLIF is a shorthand expression :
CASE  WHEN scalar_expression_1=scalar_expression_2  THEN NULL  ELSE scalar_expression_1  END

The following examples show queries on the following table:
CREATE TABLE Membership
(FullName CHARACTER(39)
,Age SMALLINT
,Code CHARACTER(4) );

Example 1:
The following is the Teradata SQL NULLIFZERO(Age) function ANSI- compliant form of theand is more versatile.
SELECT FullName, NULLIF (Age,0) FROM Membership;

Example 2:
In the following query, blanks indicate no value.
SELECT FullName, NULLIF (Code, ' ') FROM Membership;

Example 3:
The following example uses NULLIF in an expression with an arithmetic operator.
SELECT NULLIF(Age,0) * 100;


List of Interview Questions On Collect Statistics

1.Suppose if  collect stats is created on table , we can delete the data of the table by using delete statement those collect stats also be deleted or not?
Ans: No collect stats will not be deleted on table. We need to drop it manually by using DROP COLLECT STATISTICS statement.

2.Where these Collect statistics are actually stored?
Ans:  Collect Statistics are stored in DBC views like IndexStats,ColumnStats and MultiColumnStats.

3.What is the purpose of Collect Statistics Command?
Ans:  The purpose  is to gather and store demographic data for indices or join index or one or more columns of a table. If you want optimizer to use high confidence, You should create COLLECT STATISTICS on columns or indices


Advantages of Partition Primary Index

· Range queries eliminates the use of Full Table Scan.
· PPI is an alternate to secondary index, it provides an excellent solution compared to secondary indexes
. PPI can make modeling and querying easier.
· We can't use fastload and multiload with all secondary indexes, But we an  use Fastload and Multiload work with PPI tables.


The POSITION Function

The POSITION function Returns the actual position of the character which occurs first. POSITION function is ANSI standard.

Teradata has an equivalent function called INDEX.

Both the POSITION and INDEX functions returns position of character's first occurrence in a string.

Examples for the POSITION function

SELECT POSITION( 'u' IN 'formula');  Displays Result as '5' 
SELECT POSITION( 'fo' IN 'formula'); Displays Result as '1' 
SELECT POSITION( 'e' IN 'formula');  Displays Result as '0'  

Examples for the INDEX function.

SELECT INDEX('formula', 'u');    Displays Result as '5'   
SELECT INDEX('formula', 'mu');   Displays Result as '4'   
SELECT INDEX('formula', 'e');    Displays Result as '0'   


MERGE INTO Statement

The MERGE INTO statement 'merges' a source row into a target table based on whether any rows in the target table match a specified condition with the source row.

MERGE INTO replicates the functionality of the Upsert capability provided by the UPDATE with INSERT option. This option updates a row if it exists and inserts the row if it does not exists.

Example 1:

Add a new row for Department 700 using the MERGE INTO form of Upsert. 

MERGE INTO department
USING VALUES (700,'Shipping',800000.00)
                      AS Dept (deptnum, dept_name, budgamt)
ON Dept.deptnum = dept_num
WHEN MATCHED THEN UPDATE
SET budget_amount = Dept.budgamt
WHEN NOT MATCHED THEN INSERT
VALUES (Dept.deptnum, Dept.dept_name,Dept.budgamt,NULL);

SELECT  dept_num AS Dept
, department_name (CHAR(10))AS Dept_Name
, budget_amount
, manager_employee_number
FROM department
WHERE  dept_num  = 700;

  Dept  Dept_Name    budget_amount  manager_employee_number
  ------  ----------      ------------------   -----------------------
   700  Shipping        800000.00                        ?

Things to notice about this example:
This example does an insert of a 'Shipping' department row. 
The USING clause defines the source row to be merged. 
The AS clause defines the name of the temporary source table and columns which contain the source row. This is also referred to as the 'correlation table'. 
The ON clause followed by condition returns the matching rows in both the source and target tables.
It must use the primary index of the target table. 
If the Primary Index is non-unique, additional qualifying conditions are required to reduce the test to a single target row.

Example 2:

Update the budget for Department 700 to be $9,900,000 using the MERGE INTO form of Upsert. 

MERGE INTO department
USING VALUES (700,'Shipping',9900000.00) AS Dept (deptnum, dept_name, budgamt)
ON Dept.deptnum = dept_num
WHEN MATCHED THEN UPDATE
SET budget_amount = Dept.budgamt
WHEN NOT MATCHED THEN INSERT
VALUES (Dept.deptnum, Dept.dept_name,Dept.budgamt,NULL);

SELECT  dept_num  AS Dept
, department_name (CHAR(10))AS Dept_Name
, budget_amount
, manager_employee_number
FROM department
WHERE  dept_num  = 700;

  Dept  Dept_Name    budget_amount  manager_employee_number
  ------  ------------    ------------------   ---------------------------------
  700    Shipping       9900000.00                        ?

Things to notice about this example:
This example does an update of the previously inserted 'Shipping' department row. 



Hashing Algorithm

When the primary index value of a row is input to the hashing algorithm, then the output is called the “row hash”. Row hash is the logical storage address of the row, and identifies the amp of the row. Also, the “table id” plus the row hash identifies the cylinder and data block, and is used for row distribution, placement and retrieval of the row. Based on the row hash uniqueness, data distribution happens.

The “table id” is a sequential number assigned whenever a table is created. This number changes whenever a table is re-created.

“Hash code redistribution” is used in join operation. This is used when the foreign key (join column) of a table (i.e. table A) is joined to a primary index of another table (i.e. table B). For each table A row, the row hash of the foreign key is calculated. Then, the table A row is sent to the amp dictated by the row hash, which is the same amp that contains table B’s row for that row hash.

“Join column hash code sequence” is the result of a sorting. The row hash of the foreign key (join column) of a table (i.e. table A) is sorted into this sequence. These are matched in sequence to the other table (i.e. table B) on the same amp.


User Defined Functions

Create Function: compile, link and add a new function to a database a function.Create Function privilege- enables user to Create a function.This privilege is not granted automatically and must be granted explicitly.


Syntax: GRANT CREATE FUNCTION ON Developer1 TO Developer1; Allows ‘Developer1’ to create functions in their own database.

Drop Function: removes the function by not linking from its shared library.
Drop Function privilege - enables user to DROP or REPLACE a previously created function.Granted automatically WITH GRANT OPTION to the creator of a database or function.Granted automatically to a database or user when it is created.


Syntax: GRANT DROP ON SPECIFIC FUNCTION Parse_Text TO develop1.
Allows ‘develop1’ to drop or replace the Parse_Text function..


Alter Function - performs either or both of the following:
 - Controls whether an existing function can run in protected mode as a separate process or in non-protected mode as part of the database.
  -  Re-compiles or relinks the function and redistributes it.
ALTER FUNCTION privilege - enables user to ALTER a function.This privilege is not granted automatically and must be explicitly granted.Alter Function privilege should be held only by the DBA.


Syntax: ALTER FUNCTION Parse_Text EXECUTE NOT PROTECTED;
Change the function from protected to non-protected mode.
ALTER FUNCTION Parse_Text COMPILE; Recompile the function without changing the protection mode.


Execute Function privilege: Enables user to evoke a function in an SQL statement. It enables a user to execute either specific functions, or all functions in a specified database. There is no EXECUTE FUNCTION command.
This privilege is not granted automatically to the creator of a database or user and must be explicitly granted. A user can execute any function they create, in protected mode.
GRANT EXECUTE FUNCTION On SYSLIB TO Usera;
Allows “Usera” to execute all function in database SYSLIB.
GRANT EXECUTE FUNCTION ON SPECIFIC FUNCTION Develop.Parse_Text TO Userc;
Allows “Userc” to execute the one function called ‘Parse_Text’ located in database “Develop”


Rename Function: This function permits the renaming either of a function name or a specific function name.
Syntax: RENAME SPECIFIC FUNCTION Match_Text TO Scan_Text;
RENAME FUNCTION Imagine_numbers(FLOAT, FLOAT) TO Imaginary_numbers;
COMMENT ON
COMMENT [ON] [ FUNCTION ] <object_name> [AS] ‘<comment>’
To create comments on a function use FUNCTION keyword and specify function name.
COMMENT ON FUNCTION Parse_Text ‘Scans for text using a pattern’;
To comment on function parameter then use the COLUMN keyword option.
COMMENT ON COLUMN Parse_Text.source_text AS ‘The text to be scanned’;
Limitations of User Defined Functions:
 - UDFs cannot execute any SQL.
 - UDFs cannot execute any Standard I/O functions.
 - UDFs cannot be used to calculate the value of a Primary Index column for an INSERT statement.
 - UDFs cannot be used to calculate the value of a Primary Index column in a Where clause to Select a Row.



LOB in Teradata

It is possible to create functions which operate on large object data types. The following library functions exists to provide the ability use LOB's as function input or output:

FNC_GetLobLength - acquire Lob length
FNC_LobAppend - add to Lob in memory
FNC_LobClose - close a context to a Lob
FNC_LobOpen - open a context to a Lob
FNC_LobRead - read some or all
FNC_LobRef2Loc - convert a Lob Reference to a Locator 
FNC_LobLoc2Ref - convert a Locator to a Lob Reference

Example :
The CREATE FUNCTION syntax permits BLOB's or CLOB's to be the data type of an argument or a return value

CREATE FUNCTION JPEG_CROP
(
Image BLOB AS LOCATOR 
Width INTEGER, 
Height INTEGER, 
X_Offset INTEGER, 
Y_Offset INTEGER
RETURNS BLOB AS LOCATOR
LANGUAGE C
NO SQL
EXTERNAL;


Recursive Triggers

A recursive trigger contains a triggered action which causes the triggering action to reoccur. For example, assume a row trigger is defined with a triggering action that inserts a row into a table. If the triggered action also inserts into the same table, it will cause the trigger to fire again. This will continue until one of two events occur:

The WHEN clause logic in the trigger definition causes the cycle to end. 
Sixteen levels of recursion occur. 

The system automatically stops the recursion after sixteen levels to prevent an infinite loop.
Example 1:
CREATE TABLE table1_recur
(col_a INT);
CREATE TRIGGER recur_1
AFTER INSERT ON table1_recur
REFERENCING NEW AS t1
FOR EACH ROW
WHEN (t1.col_a < 10)
INSERT INTO table1_recur (t1.col_a + 1);
INSERT INTO table1_recur( 7 );

SELECT * FROM table1_recur;
      col_a
-----------
          7
          6
          8
         10

DELETE FROM table1_recur;

Now, we will remove the WHEN clause which controls the stopping of the recursion. In theory, this could produce an infinite loop, however the Teradata trigger implementation forces the recursion to rollback with a failure after 16 iterations. 

Example 2:
REPLACE TRIGGER recur_1
AFTER INSERT ON table1_recur
REFERENCING NEW AS t1
FOR EACH ROW
INSERT INTO table1_recur (t1.col_a + 1);
INSERT INTO table1_recur( 7 );
*** Failure 5848 Statement# 1, Info =0 

SELECT * FROM table1_recur;
*** Query completed. No rows found. 

Note that no rows are found in the table because the failure of the transaction has caused all activity to be rolled back. 


Limitations of Partition Primary Index

1. Partition Primary Index(PPI) occupies extra two bytes for storing partition number.
2. Partition Primary index cannot be created on volatile and Global Temporary tables and compressed join index.
3. PPI does not allow NUPI, allows only UPI if PPI is part of primary index. Because in case of NUPI, it would require checking for a duplicate key value in every  partition, which would be very expensive. But in case of UPI,This will result in checking for unique constraint in the same partition.


Teradata NULL functions


The following are the NULL functions in Teradata sql:

1) NULLIFZERO
This will change a zero value to a NULL value.

2) NULLIFSPACES
This will change a space value to a NULL value.

3) ZEROIFNULL
This will change a NULL value to a zero.

4) NULLIF = 0
This will change a zero value to a NULL.
ex: define c1(decimal (5,2), NULLIF = 0);

5) NULLIF = ' '
This will change a space value to a NULL.
ex: define c1(char (2), NULLIF = ' ');

6) NULLIF = '?'
This will change a '?' value to a NULL.
ex: define c1(char (2), NULLIF = '?');

7) NULLIF = '40616263'XC
The '40616263'XC is a hexadecimal representation of '@abc'. Therefore, this will change a '@abc' value to a NULL.
ex: define c1(char (4), NULLIF = '40616263'XC );

8) CASE WHEN :c1 = '?' THEN NULL ELSE :c1 END
This will change a '?' to a NULL.

9) NULLIF(:c1,'?')  
This will change a '?' to a NULL. This is an abbreviation of the case command.

10) CASE WHEN c1 IS NOT NULL THEN c1 ELSE 0 END
This changes a null value to a zero.


Teradata data types

Teradata data types can be categorized as follows.

1) Character data:
This is character data that can be fixed or variable length.
The data types can be CHAR, VARCHAR, CHAR VARYING, LONG VARCHAR.

2) Arithmetic Numeric data:
This is numeric data that can have arithmetic operations performed on it.
This numeric data is usually money or quantity values.
The data types can be DATE, DECIMAL, NUMERIC, BYTEINT, SMALLINT, INTEGER, REAL, DOUBLE PRECISION, or FLOAT.
The arithmetic operations can be:
SUM(t1.c1), MAX(t1.c1), t1.c1 + t2.c1, etc.

3) Non-Arithmetic Numeric data:
This is numeric data that will not have arithmetic operations performed on it.
This numeric data is usually code or id values, and usually, is not money or quantity values.
The data types can be DATE, DECIMAL, NUMERIC, BYTEINT, SMALLINT, INTEGER, REAL, DOUBLE PRECISION, or FLOAT.

4) Byte  data:
This is bit oriented data, with hexadecimal orientation of values.
This data type is not currently used in the EDW application tables.
The data types can be BYTE, VARBYTE.
This data will not be covered as part of this NULL topic.

5) Graphic data:
This is Japanese Kanji oriented data, with a mixture of single and multiple bytes representing the Japanese language.
This data type is not currently used in the EDW application tables.
The data types can be GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC.
This data will not be covered as part of this NULL topic.


Teradata Architecture

Teradata Architecture diagram



Here the basic and most important component is PE (Parsing Engine) and PE can support up to 120 sessions.

1) Session Control : It checks authorization of users before processing any SQL query.

2) Parser : It checks syntax of the SQL query and access rights to the user to access various database objects referred in the SQL query.

3) Optimizer : Determines the most effective way by creating a plan or execution steps to follow  in order to perform actions on database objects as per SQL query submitted by user and scans the SQL query to determine where to place locks.

4) Dispatcher : It controls the sequence in which steps are executed and  It passes the execution steps to AMP by using BYNET. Dispatcher also plays an important role by combining all the responses received and send it to user.

The Next component is BYNET. It is used for communication between PE's and AMP's (Access Module Processor). There are two BYNET's available in any TERADATA environment: BYNET-0 & BYNET-1. Two BYNET's allows for continuous sending and receiving messages between PE's and AMP's.

AMP's can be considered as the worker in TERADATA Architecture. All the calculations and actual execution takes place in AMP only.Each AMP has it's own dedicated VDISK (Virtual DISK) to which it queries and process the steps planned by Optimizer. AMP's work only on their own VDISK and do not have access to other AMP's VDISK. Once the AMP perform the steps, it send back response to PE via BYNET where all the responses from various AMP's is collected and sent back to user.


Drop Table and Create Table Commands in Teradata SQL

Drop Table and Create Table commands  will address the following problems:

1. Reduce update activity on the DD/D (Data Dictionary/Directory), and improve overall system throughput.
2. Resolve the occasional deadlock problems which occur from simultaneous DD/D updates. (This could cause your application to be idle for multiple hours.)

The drop table command should be replaced with the “DELETE FROM table ALL” command. The delete command will perform better than the drop command, and without the DD/D updates.

The create command should only be executed when needed. The create table command should only be executed for table alterations or recovery situations.

There are exceptions to this requirement:
1. The fastload error tables must always be dropped, so that fastload can re-create them.
2. The DSS(Decision support system) Query tool creates tables for intermediate data storage.


Disadvantages of Compression

·          Does not compress varchar columns.
·   There should not be any spoolspace usage differences, because where compressed values are converted back into spoolspace, there should have been a spool file anyway. The spoolspace usage differences may not be fully understood and tested yet.


Advantages of Compression

The below are the advantages of data compression in Teradata.

·      Compression results in shorter rows which are generally more performant.
·      Column compression provides the following capacity and performance benefits:
              Reduced capacity
            Reduced I/O traffic
            Moderate CPU savings
·    The I/O savings correlates to the percentage of data compressed out of a row.
·    Table backups use the compressed format, so backups will also be faster.


Compression in teradata

Here are some rules and guidelines of Teradata's data compression via the COMPRESS phrase:

1) COMPRESS:
COMPRESS is not a complex data reducing algorithm. It does not reduce repeating characters (or bits), or character (or bit) patterns within rows, columns, blocks or cylinders.

COMPRESS practically eliminates the data storage of nulls (or 256 constant values) for fixed-length, non-primary index columns.


2) Rules for compression:
·         Column must be fixed-length, 255 characters or less, and not part of the primary index.
·          The following types of data can be compressed:
          Nulls, Zeros, Blanks, Any numeric data type,
        DATE (expressed as COMPRESS (DATE 'yyyy-mm-dd')),
        CHARACTER(up to 255 characters).
·          The default setting is no compression.
·          Additional rules have been specified in the documentation.
·        Only FIXED WIDTH data types can be compressed. VARIABLE WIDTH (VARCHAR) data types cannot be compressed.
·   The compress column values are case specific. The value "PRODUCT" does not get compressed, if "product" is the specified compression value.
·     Compression values are stored in the table header, and in V2R5, there is a limit of 255 values, and a header size limit of 64KB.
·      There are 2 forms of compression:
    - Teradata RDBMS V2R4.1 (current version): The compress attribute  
      compresses the nulls, and 1 column value.
      - Teradata RDBMS V2R5 (next version): The compress attribute compresses 
      the nulls, plus up to 256 column values.
·         The syntax for the compression attribute is as follows:
            CREATE SET TABLE tbl1,
            ( col1    CHAR(1)  COMPRESS,    /* Compresses nulls */
              col2    CHAR(1)  COMPRESS 'product' )  /* Compresses nulls plus the   
              value 'product' */
            PRIMARY INDEX ...;

3) Guidelines for Compression:
·         Only compress columns when compression affects enough of the rows in the table to make compression worthwhile.
·         Measure the potential savings with the following calculation:
       CC * CW , where CC is the count of potential compressed nulls (or values) 
     for the column, and CW is the column width.
        Sqls for calculating CC:
                  - Get actual null count:
                              select count(*) from tbl1 where col1 IS NULL;
               - Get actual value list by value count:
                        select col1, count(*) from tbl1 group by 1 order by 1 desc;
·          Measure the actual savings with the following calculation:
       (TB - TA) / TA, where TB is the tablesize before the compression, and TA 
     is the tablesize after the compression.
·          Presence bits:                                                                                                                                 For every data column, there will be either 0, 1, or 2 Presence Bits:                
      1) For each 8 columns compressable, add 1 byte of compression presence bits per row.
     2) And, for each 8 columns nullable, add 1 byte of nullable presence bits per row.
·         Convert VARCHAR to Fixed:      
     Since a VARCHAR cannot be compressed, then it could be converted to a fixed length CHAR, and then compressed. Weigh these differences: A VARCHAR column contains a 2 byte VLI (variable length indicator) for each value, and a CHAR column contains the same number of characters per value.
·          Implement the compressions on the test system first (DW TEST (t62tst)), and then migrate the changes to the production system (DW EDW (t61edw)).


Collect Statistics Syntax in Teradata

The following are the Collect Statistics Syntaxes in Teradata.

“COLLECT STATISTICS ON tablename COLUMN columnname;” will collect statistics  on a column.

                .
“COLLECT STATISTICS ON tablename INDEX (columnname)” will collect statistics  on an index.
                    
“COLLECT STATISTICS ON tablename INDEX (col1, col2, ...)” will collect statistics  on multiple columns of an index.                      

“HELP STATISTICS tablename;” will display the number of distinct values of the columns. 
                        
“COLLECT STATISTICS tablename;” refreshes (recollects) the table statistics.

 “DROP STATISTICS ON tablename ... ;" will drop the statistics.


Collect Statistics

COLLECT STATISTICS is a Teradata sql command that scans columns and indexes of a table and records demographics of the data.

COLLECT STATISITICS is used to provide the Teradata Optimizer with as much information on data as possible. The Optimizer uses this information to determine how many rows exist and which rows qualify for given values.

Collecting statistics can improve the execution of a sql. The optimizer can have more details about each column or index, and therefore determine a better join plan for resolving the query.

COLLECT STATISTICS COMMANDS SHOULD ONLY BE USED WHEN A SPECIFIC SQL (OR GROUP OF SQLS) IS BENEFITTED.

Because, Collecting statistics may not improve a sql’s efficiency. Therefore, the processing time used to collect the statistics was wasted, and this could be a large amount of processing time.
           
If a sql’s performance is shown to improve from a specific collected statistic, then the statistic should be initially collected and then refreshed on a regular basis.

To possibly avoid spoolspace errors in the application design phase, some common problem columns are listed below in "Collect Statistics - Problem Columns".

With COLLECT STATISTICS:

 - COLLECT STATISTICS can make the usage of these more efficient:
                        Primary indexes
                        Secondary indexes
                        Join Columns or base tables
                        Join Indexes
                        Qualifying columns

 -Statistics are especially informative if index values are distributed unevenly.

 - When a query uses conditionals based on non-unique index values, then Teradata uses statistics to determine whether indexing or a full search of all table rows is more efficient.

              
 - If Teradata determines that indexing is the best method, then it uses the statistics to determine whether spooling or building a bitmap would be the most efficient method of qualifying the data rows.
          
   
Without COLLECT STATISTICS:

- The Optimizer assumes:
  Non-unique indexes are highly non-unique. (Lots of rows per value).
 Non-Index columns are even more non-unique than non-unique indexes.     (Lots of rows per value)

- Teradata derives row counts from a random AMP sample for:
  Small tables (less than 1000 rows per amp),Unevenly distributed tables (skewed row distribution due to PI).

  Random amp sample:
  Look at data from 1 amp of table, and from this, estimate the total rows in the table.

  Random amp samples may not represent the true total number of rows in the table because the rows in the table may not be distributed evenly. This occurs often with small tables. Asof 9/2000, per table, the random amp sample uses the same amp for each sql or query.


Amp Offline in Teradata Database

The Teradata database system is impacted in the following ways when an amp is offline:

            1) System Performance                                                
            2) Table Updates/Offline Recovery
            3) Table Rebuild
            4) Spoolspace
            5) ASF Archives
            6) ASF Restore/Copy
            7) Amp Usage Statistics
            8) 2nd Amp Risk
            9) Fastload Configuration Change

1) System Performance

The system performance is impacted because the processing power of the down amp is missing. The application processing normally done on the down amp must be absorbed by the other amps in the system. This is accomplished with clustering.

When an amp is down, a table's primary copy of the data on that amp is not available, but the fallback copy is available on the other amps within the cluster. In our environment, the cluster size is 4 amps, so the table data from the down amp is spread across the other 3 amps. This means that when an amp is offline, the access to the table data on the other 3 online amps in the cluster is increased.

The amount of processing increase in the other 3 amps can be determined as follows: Assume that the system processing is spread evenly across all 4 original amps. Then if 1 amp is removed, then to do the same processing on the other 3 amps, the processing would have to increase by 33% on each of the other 3 amps.

Therefore, when an amp is offline, the system performance of the cluster decreases by 33%. And, if this particular cluster has amps that are very highly utilized, then the overall system performance degradation would be noticable.

2) Table Updates/Offline Recovery

Table data can be updated while an amp is down. The data slated for the down amp is directed to the other 3 amps instead. A recovery journal ("dbc.changedrowjournal") is used on the other 3 amps. But, when the amp is brought back online, then the table updates stored on the other 3 amps must be updated on the new amp when it is brought back online. This procedure is "offline recovery" or "catch-up".

TOS single threads each logon's "offline recovery".

3) Table Rebuild

If the down amp's portion of the table data is not available when the amp is brought back online, then it would be rebuilt with a procedure called "table rebuild". This procedure reads the fallback copy of the table data from the 3 online amps in the cluster, and creates both the primary and fallback copy of the table data on the new amp.

4) Spoolspace

The DBA sets the spoolspace limit per logon, but TOS sets the spoolspace limit per amp (See "Spoolspace" Tip). When an amp is down, the spoolspace limit per amp remains the same, but the spoolspace limit per logon is decreased by 1 amp's worth. Therefore, on the 3 online amps, the processing will increase, and the spoolspace usage will increase, but the spoolspace limit per amp will remain the same.

5) ASF Archives

When an amp is down, the ASF archive will use the fallback copy. If indexes are also archived, then NUSI's (non-unique secondary indexes) will be skipped, and a warning message displayed. The message also includes the amp number that is down.

6) ASF Restore/Copy

ASF Restore does not work when an amp is down. ASF Copy will work and use the online amps in a cluster.

7) Amp Usage Statistics

When the "table rebuild" is executed for the amp, the dbc.acctg table (dbc.ampusage view) is also rebuilt. But this dictionary table does not have fallback, therefore the cpu and diskio statistics for each logon on that amp are lost. This is seen in the Opermenu utility as an unusually high number of logons accessing the system during the hour.

8) 2nd Amp Risk

During the time that 1 amp is down in a cluster, there is a very small risk that a 2nd amp could also go down in the same cluster.

Within the same cluster as the down amp: If a 2nd amp goes down, but the table data is recoverable, then the system processing would be halted, and after the amp is fixed, the system would be restarted. But if the table data is not recoverable, then the table data of the entire system would be restored, and then the system would be restarted.

9) Fastload Configuration Change

Fastload cannot continue when an amp goes offline while the fastload is executing.One of the recovery options is to wait until the original configuration is attained, but this will probably never be used.