Compression in teradata

Here are some rules and guidelines of Teradata's data compression via the COMPRESS phrase:

1) COMPRESS:
COMPRESS is not a complex data reducing algorithm. It does not reduce repeating characters (or bits), or character (or bit) patterns within rows, columns, blocks or cylinders.

COMPRESS practically eliminates the data storage of nulls (or 256 constant values) for fixed-length, non-primary index columns.


2) Rules for compression:
·         Column must be fixed-length, 255 characters or less, and not part of the primary index.
·          The following types of data can be compressed:
          Nulls, Zeros, Blanks, Any numeric data type,
        DATE (expressed as COMPRESS (DATE 'yyyy-mm-dd')),
        CHARACTER(up to 255 characters).
·          The default setting is no compression.
·          Additional rules have been specified in the documentation.
·        Only FIXED WIDTH data types can be compressed. VARIABLE WIDTH (VARCHAR) data types cannot be compressed.
·   The compress column values are case specific. The value "PRODUCT" does not get compressed, if "product" is the specified compression value.
·     Compression values are stored in the table header, and in V2R5, there is a limit of 255 values, and a header size limit of 64KB.
·      There are 2 forms of compression:
    - Teradata RDBMS V2R4.1 (current version): The compress attribute  
      compresses the nulls, and 1 column value.
      - Teradata RDBMS V2R5 (next version): The compress attribute compresses 
      the nulls, plus up to 256 column values.
·         The syntax for the compression attribute is as follows:
            CREATE SET TABLE tbl1,
            ( col1    CHAR(1)  COMPRESS,    /* Compresses nulls */
              col2    CHAR(1)  COMPRESS 'product' )  /* Compresses nulls plus the   
              value 'product' */
            PRIMARY INDEX ...;

3) Guidelines for Compression:
·         Only compress columns when compression affects enough of the rows in the table to make compression worthwhile.
·         Measure the potential savings with the following calculation:
       CC * CW , where CC is the count of potential compressed nulls (or values) 
     for the column, and CW is the column width.
        Sqls for calculating CC:
                  - Get actual null count:
                              select count(*) from tbl1 where col1 IS NULL;
               - Get actual value list by value count:
                        select col1, count(*) from tbl1 group by 1 order by 1 desc;
·          Measure the actual savings with the following calculation:
       (TB - TA) / TA, where TB is the tablesize before the compression, and TA 
     is the tablesize after the compression.
·          Presence bits:                                                                                                                                 For every data column, there will be either 0, 1, or 2 Presence Bits:                
      1) For each 8 columns compressable, add 1 byte of compression presence bits per row.
     2) And, for each 8 columns nullable, add 1 byte of nullable presence bits per row.
·         Convert VARCHAR to Fixed:      
     Since a VARCHAR cannot be compressed, then it could be converted to a fixed length CHAR, and then compressed. Weigh these differences: A VARCHAR column contains a 2 byte VLI (variable length indicator) for each value, and a CHAR column contains the same number of characters per value.
·          Implement the compressions on the test system first (DW TEST (t62tst)), and then migrate the changes to the production system (DW EDW (t61edw)).


5 comments:

  1. nice post ... very well explained and covered all the imp points of compression

    ReplyDelete
  2. How many values can be compressed for a given column? Say 1,,3,7,15 ??? (('A1','A2','A3') so on??

    ReplyDelete
  3. VARCHAR can be compressed in 13 and 14

    ReplyDelete
  4. varchar and varchar2 are included in the list of data types that could be compressed. Refer to TD documentation!!

    ReplyDelete