Skew Factor

The data distribution of table among AMPs is called Skew Factor.

Generally for NUPI, we get duplicate values, so the more dulicate values you get the more the data will have the same row hash so all the same data will be loaded into same AMP. It makes data distribution inequality

One AMP will store more data and another AMP will store less data, when we access the full table, the AMP with more data will take longer time to retrive data and make other AMPs wait until it fetches data which leads to processing wastage.

In this situation we should avoid full table scans.


TPUMP How It Works

TPump is similar to MLoad, TPump edits Teradata tables by processing insert, updates, and deletes.

TPump performs updates on the Teradata Database in a synchronous manner. Changes are sent in conventional CLIv2 parcels and applied immediately to the target table(s). To improve its efficiency, TPump builds multiple statement requests and provides the serialize option to help reduce locking overhead.
TPump, on the other hand, does better on relatively low volumes of changes because there is no temporary table overhead. TPump becomes expensive for large volumes of data because multiple updates to a physical data block will most likely result in multiple reads and writes of    the block.

The most important technique used by TPump to improve performance over MultiLoad is the multiple statement request. Placing more statements in a single request is beneficial for two reasons. First, it reduces network overhead because large messages are more efficient than small ones. Secondly, (in ROBUST mode) it reduces TPump recovery overhead, which amounts to one extra database row written for each request. TPump automatically packs multiple statements into a request based upon the PACK specification in the BEGIN LOAD command.

Macros: TPump uses macros to efficiently modify tables, rather than using the actual DML commands. The technique of changing statements into equivalent macros before beginning the job greatly improves performance. Specifically, the benefits of using macros are:
1. the size of network (and channel) messages sent to the RDBMS by TPump are reduced.
2. RDBMS parsing engine overhead is reduced because the execution plans (or “steps”) for macros are cached and re-used. This eliminates “normal” parser handling, where each request sent by TPump is planned and optimized. Because the space required by macros is negligible, the only issue regarding the macros is where the macros are placed in the RDBMS. The macros are put into the database that contains the restart log table or the database specified using the MACRODB keyword in the  BEGIN LOAD command.

Locking and Transactional Logic
   In contrast to MultiLoad, TPump uses conventional row hash locking which allows for some amount of concurrent read and write access to its target tables. At any point  TPump can be stopped and the target tables are fully accessible. Note however, that if TPump is stopped, depending on the nature of the update process, it may mean that the “relational” integrity of the data is impaired.
   This differs from MultiLoad, which operates as a single logical update to one or more target tables. Once MultiLoad goes into phase two of its logic, the job is
   “essentially” irreversible and the (entire set of) table(s) is locked for write access until it completes.

Recovery Logic and Overhead
   TPump, in “ROBUST mode”, writes one database row in the log restart table for every request that it issues. This collection of rows in the restart log table can be referred to as the request log. Because a request is guaranteed by the RDBMS to either completely finish or completely rollback, the request log will always accurately reflect the completion status of a TPump import. Thus, the request log overhead for restart logic decreases as the number of statements packed per request increases.    TPump also allows you to specify a checkpoint interval. During the checkpoint process TPump flushes all pending changes from the import file to the database and also cleans out the request log. The larger the checkpoint interval, the larger the request log (and its table) is  going to grow. Upon an unexpected restart, TPump scans the import data source along with the request log in order to re-execute the statements not found in the request log.
   TPump in “SIMPLE (non-ROBUST) mode”, provides basic checkpoints. If a restart occurs between checkpoints, then some requests will likely be reprocessed. This is adequate protection under some circumstances.

Serialization of Changes
   In certain uses of TPump or MultiLoad it is possible to have multiple changes to one row in the same job. For instance, the row may be inserted and then updated during the batch job or it may be updated and then deleted. In any case, the correct ordering of these operations is obviously very important. MultiLoad automatically guarantees that this ordering of  operations is maintained correctly. By using the serialization feature, TPump can also guarantee that this ordering of operations is maintained correctly, but it requires some small amount of scripting work and a small amount of utility overhead.

Resource Usage and Limitations
   A feature unique to TPump is the ability to constrain run-time resource usage through the statement rate feature. TPump gives you control over the rate per minute at which statements are sent to the RDBMS and the statement rate correlates directly to resource usage on both the client and in the RDBMS. The statement rate can be controlled in two ways, either dynamically while the job is running, or it can be scripted into the job with the RATE keyword on the BEGIN LOAD command. Dynamic control over the statement rate is provided by updates to a table on the RDBMS.

Operating Modes :
  • Interactive – Interactive processing involves the more or less continuous participation of the user.
   • Batch – Batch programs process data in groups as per scheduled operations, typically in a separate operation,      rather than interactively or in real time.



TPUMP Basics

TPump is a data loading utility that helps you maintain (update, delete, insert, and atomic upsert) the data in your Teradata Database. Used to keep the target table updated continuously .. helps you achive near real time data in the data warehouse.  
Concurrency: MultiLoad is limited to a maximum of 15 instances running concurrently. TPump does not impose this limit.

TPump uses row hash locks rather than table level locks. This allows the users to run queries while TPump is running. This also means that TPump can be stopped instantaneously.
Instead of updating Teradata Databases overnight, or in batches throughout the day, TPump updates information in real time, acquiring data from the client system with low processor utilization. It does this through a continuous feed of data into the data warehouse, rather than through traditional batch updates. Continuous updates result in more accurate, timely data.
TPump provides a dynamic throttling feature, that we can specify the number of statements run per minute, or may alter throttling minute-by-minute.

TPump’s main attributes are:
  • Simple, hassle-free setup – does not require staging of data, intermediary files, or special hardware.
  • Efficient, time-saving operation – jobs can continue running in spite of database restarts, dirty data, and network slowdowns. Jobs restart without intervention.
  • Flexible data management – accepts an infinite variety of data forms from an infinite number of data sources, including direct feeds from other databases.

Resource Consumption: TPump has a built-in resource governing facility. This allows the operator to specify how many updates occur (the statement rate) minute by minute, and then change the statement rate, while the job continues to run. Thus, this facility can be used to increase the statement rate during windows when TPump is running by itself, but then decrease the statement rate later on, if users log on for ad hoc query access.

The TPump task provides the acquisition of data from client files for application to target tables through INSERT, UPDATE, or DELETE statements that specify the full primary index.
  • TPump examines all commands and statements for a task, from the BEGIN LOAD command through the END LOAD command, before actually executing the task.
  • After all commands and statements involved in a given task have been processed and validated by TPump
  • Optionally, TPump supports data serialization for a given row, which guarantees that if a row insert is immediately followed by a row update, the insert is processed first. This is done by hashing records to a given session.
  • TPump supports bulletproof restartability using time-based checkpoints. Using frequent checkpoints provides a greater ease in restarting, but at the expense of the checkpointing overhead.
  • TPump supports upsert logic similar to MultiLoad.
  • TPump uses macros to minimize network overhead. Before TPump begins a load, it sends the statements to the Teradata Database to create equivalent macros for every insert/update/delete statement used in the job script.     The execute macro requests, rather than lengthy text requests, are then executed iteratively during a job run.
  • TPump supports error treatment options, similar to MultiLoad.
  • TPump runs as a single process


Data Modeling and Normal forms

Data modeling is a method used to define and analyze data requirements needed to support the business processes of an organization. The data requirements are recorded as a conceptual data model with associated data definitions. Actual implementation of the conceptual model is called a logical data model. To implement one conceptual data model may require multiple logical data models. Data modeling defines the relationships between data elements and structures. Data modeling is also a technique for defining business requirements for a database. It is sometimes called database modeling because a data model is eventually implemented in a database.
Conceptual schema
Logical schema
Physical schema

Modeling methodologies:
Bottom-up models are often the result of a reengineering effort. They usually start with existing data structures forms, fields on application screens, or reports. These models are usually physical, application-specific, and incomplete from an enterprise perspective.
Top-down logical data models are created in an abstract way by getting information from people who know the subject area. A system may not implement all the entities in a logical model, but the model serves as a reference point or template.

Evaluation in organization use of data warehouses:
Off line Operational Database 
Data warehouses in this initial stage are developed by simply copying the data off an operational system to another server where the processing load of reporting against the copied data does not impact the operational system's performance.
Off line Data Warehouse 
Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data is stored in a data structure designed to facilitate reporting.
Real Time Data Warehouse 
Data warehouses at this stage are updated every time an operational system performs a transaction (e.g. an order or a delivery or a booking.)
Integrated Data Warehouse 
Data warehouses at this stage are updated every time an operational system performs a transaction and then generate transactions that are passed back into the operational systems.

Dimension: A dimension is a data element that categorizes each item in a data set into non-overlapping regions.
The primary function of dimensions is threefold: to provide filtering, grouping and labeling. For example, in a data warehouse where each person is categorized as having a gender of male, female or unknown, a user of the data warehouse would then be able to filter or categorize each presentation or report by either filtering based on the gender dimension or displaying results broken out by the gender.

Types of dimension:  
Confirmed Dimension: Dimension connecting to multiple facts is called confirmed dimension. This dimension does not change with time. Some examples are time dimension, customer dimension and product dimension
Junk Dimension: Consolidated dimension from several (two or more) smaller dimension is called junk dimension.
Degenerated Dimension: Fact containing attribute from dimension is called degenerated dimension. Fact generally contains measure, but since one or more attribute of (to be) dimension is present inside this fact, it is treated as dimension.
Role playing dimension:  Dimensions are often recycled for multiple applications within the same database. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a "role-playing dimension".

Codd’s Rules for normalization:
1NF:  Eliminate Repeating Groups - Make a separate table for each set of related attributes, and give each table a primary key.
2NF:  Eliminate Redundant Data - If an attribute depends on only part of a multi-valued key, remove it to a separate table. To avoid update and delete anomalies!
3NF:  Eliminate Columns Not Dependent On Key - If attributes do not contribute to a description of the key, remove them to a separate table.


Bitmap Indexes

Oracle's two major index types are Bitmap indexes and B-Tree indexes. B-Tree indexes are the regular type that OLTP systems make much use of, and bitmap indexes are a highly compressed index type that tends to be used primarily for data warehouses.

Characteristic of Bitmap Indexes
@@For columns with very few unique values (low cardinality)
Columns that have low cardinality are good candidates (if the cardinality of a column is <= 0.1 %  that the column is ideal candidate, consider also 0.2% – 1%)

@@Tables that have no or little insert/update are good candidates (static data in warehouse)
 
@@Stream of bits: each bit relates to a column value in a single row of table
create bitmap index person_region on person (region);

        Row     Region   North   East   West   South
        1       North        1      0      0       0
        2       East         0      1      0       0
        3       West         0      0      1       0
        4       West         0      0      1       0
        5       South        0      0      0       1
        6       North        1      0      0       0

Advantage of Bitmap Indexes
The advantages of them are that they have a highly compressed structure, making them fast to read and their structure makes it possible for the system to combine multiple indexes together for fast access to the underlying table.
Compressed indexes, like bitmap indexes, represent a trade-off between CPU usage and disk space usage. A compressed structure is faster to read from disk but takes additional CPU cycles to decompress for access - an uncompressed structure imposes a lower CPU load but requires more bandwidth to read in a short time.
One belief concerning bitmap indexes is that they are only suitable for indexing low-cardinality data. This is not necessarily true, and bitmap indexes can be used very successfully for indexing columns with many thousands of different values.

Disadvantage of Bitmap Indexes
The reason for confining bitmap indexes to data warehouses is that the overhead on maintaining them is enormous. A modification to a bitmap index requires a great deal more work on behalf of the system than a modification to a b-tree index. In addition, the concurrency for modifications on bitmap indexes is dreadful.