Here are some rules and
guidelines of Teradata's data compression via the COMPRESS phrase:
1) COMPRESS:
COMPRESS is not a complex
data reducing algorithm. It does not reduce repeating characters (or bits), or
character (or bit) patterns within rows, columns, blocks or cylinders.
COMPRESS practically
eliminates the data storage of nulls (or 256 constant values) for fixed-length,
non-primary index columns.
2) Rules for compression:
· Column must be fixed-length, 255 characters or less, and not part
of the primary index.
·
The following types of data can be compressed:
Nulls, Zeros, Blanks, Any numeric data type,
DATE (expressed as COMPRESS (DATE 'yyyy-mm-dd')),
CHARACTER(up to 255 characters).
·
The default setting is no compression.
·
Additional rules have been specified in the documentation.
· Only FIXED WIDTH data types can be compressed. VARIABLE WIDTH
(VARCHAR) data types cannot be compressed.
· The compress column values are case specific. The value
"PRODUCT" does not get compressed, if "product" is the
specified compression value.
· Compression values are stored in the table header, and in V2R5,
there is a limit of 255 values, and a header size limit of 64KB.
· There are 2 forms of compression:
- Teradata RDBMS V2R4.1 (current version): The compress
attribute
compresses the nulls, and 1 column value.
- Teradata RDBMS V2R5 (next version): The compress
attribute compresses
the nulls, plus up to 256 column values.
· The syntax for the compression attribute is as follows:
CREATE SET TABLE tbl1,
( col1 CHAR(1) COMPRESS, /*
Compresses nulls */
col2 CHAR(1)
COMPRESS 'product' ) /* Compresses
nulls plus the
value 'product' */
PRIMARY INDEX ...;
3) Guidelines for Compression:
· Only compress columns when compression affects enough of the rows
in the table to make compression worthwhile.
· Measure the potential savings with the following calculation:
CC * CW , where CC is the count of potential compressed
nulls (or values)
for the column, and CW is the column width.
Sqls for calculating CC:
- Get actual null count:
select count(*) from tbl1 where
col1 IS NULL;
- Get actual value list by value count:
select col1, count(*) from tbl1
group by 1 order by 1 desc;
·
Measure the actual savings with the following calculation:
(TB - TA) / TA, where TB is the tablesize before the
compression, and TA
is the tablesize after the compression.
·
Presence bits: For
every data column, there will be either 0, 1, or 2 Presence Bits:
1) For each 8 columns
compressable, add 1 byte of compression presence bits per row.
2) And, for each 8 columns nullable, add 1 byte
of nullable presence bits per row.
· Convert VARCHAR to Fixed:
Since
a VARCHAR cannot be compressed, then it could be converted to a fixed length CHAR, and then compressed. Weigh
these differences: A VARCHAR column contains a 2
byte VLI (variable length indicator) for each value, and a CHAR column contains
the same number of characters
per value.
·
Implement the compressions on the test system first (DW TEST
(t62tst)), and then migrate the changes to the production system (DW EDW
(t61edw)).
nice post ... very well explained and covered all the imp points of compression
ReplyDeleteThank you nitin
ReplyDeleteHow many values can be compressed for a given column? Say 1,,3,7,15 ??? (('A1','A2','A3') so on??
ReplyDeleteVARCHAR can be compressed in 13 and 14
ReplyDeletevarchar and varchar2 are included in the list of data types that could be compressed. Refer to TD documentation!!
ReplyDelete