What is DataStage?
- An ETL tool to extract, transform and load the data in data mart or data warehouse
- It is used for data integration projects like data warehouse, ODS (Operational Data Store) and can connect to major databases like Teradata, Oracle, DB2, SQL Server etc.
- Designed ETL jobs can migrate across different environments, such as Dev, UAT, and Prod, by importing and exporting DataStage components.
- You can manage metadata on jobs.
- You can schedule, run, and monitor jobs in DataStage
Data stage architecture:
DataStage allows us to develop jobs in Server or Parallel editions. Parallel editing uses parallel processing capabilities to process the data and is ideal for large volumes of data.
Components:
- Designated
- Director
- Administrator
Administrator:
The following tasks performed with the administrator.
- Add, delete and move projects
- Set user permissions for projects
- Purge job log files
- Set the timeout interval in the engine
- Engine Activity Tracking
- Set Job Parameter Defaults
- Issue WebSphere DataStage Engine commands from the administration client
- Configure the parallel processing job settings.
- Create/set environmental variables.
Enabling job management on the Director client:
These functions allow WebSphere DataStage operators to release the resources of a job that has been canceled or hung, and therefore return the job to a state in which it can be executed.
This procedure enables two commands on the Director menu.
- CleaningResources
- Clear state file
Appointed:
- Design and develop using the graphic design tool.
- Various stages like General, Database, File, Processing stages used when developing jobs
- Table definitions can be imported directly from data source or data warehouse tables
- Jobs are compiled with the designer, and the designer checks main inputs, reference outputs, key expressions, transformations, and so on for compile errors.
- Import and/or export projects from different environments
- Server, mainframe and parallel jobs can be created using the designer
- Define the parameters in the parameters page under the properties and they will be used accordingly in the development phase
- You can create custom routines
- Multiple jobs can be selected for the build and provide the report after the build is complete
Director:
- Validate, schedule, run, and monitor jobs run by the DataStage server
- The job status displays the current status as running, compiled, finished, aborted, and not compiled
- Job Log displays the log file for the selected job
- Reset the job if the state is canceled or stopped before running it again.
- Provides the execution times of the jobs.
- Ability to clean up resources (if administrator has enabled this option)
Along with these jobs, DataStage provides containers (local containers and shared containers) and stream jobs allow you to specify a stream of servers or parallel jobs to run.