In terms of computing, ETL(i.e., extract, transform, load) is the general technique of copying the data from one or multiple sources into a destination system that will represent the data differently from the source.
Why is ETL important?
The ETL tool helps us break down data silos and make it easier for the data scientists to analyze and access data and turn them into BI(Business Intelligence). In simple terms, the ETL tool is the essential step in any data warehousing process that eventually lets us make more informed decisions in a short period of time.
If you are planning to attend an ETL Interview, then you have to prepare yourself prior to the interview so that you crack it easily. We have listed the most frequently asked ETL Interview Questions and Answers. Make sure to go through the entire article so that you will not miss any of the questions.
Top ETL Interview Questions and Answers
1. Define ETL?
In data warehousing, ETL(ETL stands for Extract, Transact, Load) plays an important role; it is a component that manages data for any business process. The extract will read data from a database. Transform converts data into a format that will be appropriate for reporting and analysis, and load will write the data into the target database.
2. Explain the three-layer architecture of an ETL cycle?
In the ETL architecture, the data will originate from source files and databases, and then it is entered into the ETL transformation engine. From there, the data will be loaded into the following locations:
Landing area: The landing area is the area where the data first arrives after extraction from the source location. If any, transformations are applied to the data in the landing area. If ETL batch processing is processed, the landing area will store various batches of data before it is moved through the ETL pipeline.
Staging area: It is a temporary, intermediate location where ETL transformations are performed. This area will take the form of a relational database, or text files, or binary.
Data warehouse area: It is the final destination for data in an ETL pipeline. From here, the data can be easily queried and analyzed to obtain valuable insights and make good business decisions.
3. Differentiate between ETL and ELT?
|ETL means Extract, Transform, and Load process for data.||ELT means Extract, Load, and Transform process for data.|
|Here, data moves from data source to staging.||Here, there is no staging.|
|It is suited to deal with smaller data sets that require more complex transformations.||ELT is best when we are dealing with huge amounts of structured and unstructured data.|
4. Define ETL testing?
ETL( Extract/Transform/Load) is a technique that will extract the data from source systems, then transforms the data into a consistent data type, and then loads the data into the single depository. ETL testing mainly refers to the technique of validating, verifying, and qualifying the data, as well as it prevents duplicate records and data loss.
Interview Questions ETL
5. Explain ETL testing operations?
The ETL testing operations are listed below:
- It validates the data movement from source to the target system.
- There will be a data count verification in the source and target system.
- ETL testing will verify the transformation, extraction as per the needs and expectations.
- ETL testing will verify if the table relations, joins, and keys are preserved during the transformation.
6. Define BI?
Business Intelligence, i.e., BI, consists of strategies and technologies that are used by enterprises for data analysis of business information. BI technologies usually provide historical, predictive, and current views of business operations.
7. Define ETL process?
ETL is defined as the process of Extraction, Transformation, and Loading.
8. Name the smoke test that is performed on ETL testing?
- Duplicate key check
- Primary key
- Cdc check
- Referential integrity
- Table completeness
9. List the types of data warehouse applications?
The types of data warehouse applications are:
- Info Processing
- Analytical Processing
- Data Mining
ETL Interview Questions and Answers
10. Differentiate between Data mining and Data warehousing?
|Data Mining||Data Warehousing|
|It is the process of determining data patterns.||A data warehouse can be defined as a database system designed for analytics.|
|Usually, Business entrepreneurs carry out data mining with the help of engineers.||It is entirely carried out by the engineers.|
|It uses pattern recognition techniques to identify patterns.||It is a technique of extracting and storing data that allow easy reporting.|
|These techniques are cost-efficient.||The responsibility of a data warehouse is to simplify every type of business data.|
11. Can you tell the differences between ETL and BI tools?
|ETL Tools||BI Tools|
|The ETL tools extract the data from multiple data sources, transform them, and loads it into a data warehouse system.||BI tools generate interactive and ad-hoc reports for the end-users, data visualization for monthly, and quarterly, and annual board meetings.|
|Some ETL tools are Informatica, SAP BO data service, Oracle Data Integrator (ODI), Clover ETL Open Source, Microsoft SSIS, etc.||Some BI tools are SAP Lumira, IBM Cognos, Tableau, Oracle Business Intelligence Enterprise Edition, Microsoft BI platform, etc.|
12. Are there any sub-steps for each of the ETL steps?
Each of the steps that are involved in the ETL has several sub-steps. The transform step has more number of sub-steps.
13. Can you list the responsibilities of an ETL tester?
- They have to design and develop UNIX commands as part of ETL processes and automate the process of pulling the data and loading.
- They have to work with SQL, ANSI SQL, and PL/SQL queries to test the database functionality.
- They have to validate mainframe PMS files migrating to DB2.
- They should have experience working with DB2, Terada.
- They have to generate reports from the COGNOS report tool and compare them with the EDW database.
14. List the different tools used in ETL?
- Oracle Warehouse Builder
- Business Objects XI
- Cognos Decision Stream
- SAS business warehouse
- SAS Enterprise ETL serve
15. What is the purpose of the staging area in the ETL process?
The staging area is the central area that is available between the data sources and data warehouse or data marts systems. It is an area where we store the data temporarily in the process of data integration. In the staging area, the data is always cleansed and checked for duplication. The staging area is designed in such a way as to provide many benefits, but the main goal is to use the staging area. It increases efficiency, ensures data integrity, and supports data quality operations.
ETL Interview Questions
16. Define Initial load and Full load in ETL?
Initial Load: It is the first run where we process the historical load to the target, and after that, we have to increment load (i.e., bringing only modified and new records).
Full Load: The whole data dump takes place the first time when a data source is loaded into the warehouse.
17. Define fact and list the types of facts in ETL?
A fact in ETL is defined as a quantitative piece of information like a sale or a download.
Facts are usually stored in the fact tables, and they have a foreign key relationship with the number of dimension tables.
The types of facts in ETL are:
- Additive Facts
- Semi-additive Facts
- Non-additive Facts
18. Define the dimension table and how it is different from the fact table?
A Dimension table in ETL is defined as a table in a star schema of a data warehouse. Dimension tables mainly describe dimensions; they consist of dimension keys, values, and attributes.
The fact table mainly consists of measurements, metrics, and facts about a business process, whereas the Dimension table is a counterpart to the fact table that consists of descriptive attributes that are used as query constraining.
19. Define Incremental Load?
Incremental load is referred to applying dynamic changes whenever required in a given period and predefined schedules.
20. Define Cubes and OLAP Cubes?
Cubes are defined as data processing units that are composed of fact tables and dimensions from the data warehouse. They aim to provide multidimensional views of data, analytical capabilities, and querying to clients.
An OLAP Cube can be defined as a data structure that will allow a fast analysis of the data according to multiple dimensions that will define a business problem.
Interview Questions ETL
21. Define Datamart?
Datamarts are defined as a subset of the information content of a data warehouse that will support the needs of a particular department or business function. Datamart is built and controlled by a single department within the enterprise. The data here may or may not be sourced from the enterprise data warehouse.
22. How many layers are there in ETL, and what are they?
We have three layers:
- The first layer in ETL is called the source layer, and it is the layer where the data lands.
- The second layer is called the integration layer, where we store the data after transformation.
- The third layer is called the dimension layer, where the actual presentation layer stands.
23. Define tracing level, and what are the types?
The tracing level in ETL is the amount of data that is placed inside the log files.
The tracing levels are mainly classified into two types:
- Normal level: It describes tracing level in a comprehensive manner.
- Verbose: It explains the tracing levels at each and every row.
24. Can you tell the differences between Manual Testing and ETL Testing?
|Manual Testing||ETL Testing|
|It focuses on the functionality of the program.||It is related to the database and its count.|
|It is a time-consuming process.||It is an automated testing process.|
|It requires technical knowledge.||It doesn’t require technical knowledge since it is automated.|
Interview Questions ETL
25. Define snapshots and their characteristics?
Snapshot is referred to as a complete visualization of the data at the time of extraction. It usually occupies less space and is used to back up and restore data quickly.
Snapshots are usually located on remote nodes and are refreshed periodically so that all the changes in the master table will be recorded. They are also a replica of tables.
ETL Interview Questions
26. Define Grain fact?
In terms of data warehousing, a fact table usually contains measurements, metrics, or facts of a given business process.
The grain of a fact table usually defines the most atomic level where the facts are defined. Example: The grain of a sales fact table can be stated as “sales volume by day by product by store.”
27. Why do we need ETL Testing?
- ETL testing is used to keep an eye on the data that is transferred from one system to another.
- We need ETL testing to keep track of the efficiency and speed of the process.
- The need for ETL testing is essential, and we have to be familiar with the ETL process before we implement it into our production and business.
28. What are Views in ETL?
A view usually takes the output of the query and treats it as a table. Therefore a view is thought of as a stored query or virtual table. We can create a simple view from one table where a complex view can be created from different tables.
29. Define Transformation?
Transformation in ETL refers to cleansing and aggregation that needs to happen to the data to prepare it for analysis. The extracted data will be moved to a staging area where the transformations take place prior to loading data into the warehouse.
30. Where can the users use ETL concepts?
- Mergers and Acquisitions
- Data warehousing
- Data Migration
ETL Interview questions
31. Define a Materialized view and Materialized view log?
Materialized views in ETL can be defined as physical structures that can improve data access time by pre-computing the intermediary results.
The Materialized View Log ETL will delete the Materialized View Log data once the job is completed. This requires a user to have DELETE permissions, which may jeopardize the source data.
32. List the use of Lookup Transformation?
- To update the changes to the dimension table.
- To get a related value from the table by using a column value.
- To verify whether the records already exist in the table.
33. List the characteristics of Data Warehouse?
- Some of the data is denormalized for simplification and to improve performances.
- Huge amounts of historical data are used.
- Queries often retrieve huge amounts of data.
- The data load is controlled.
- Both the planned and ad hoc queries are common.
34. Define Partitioning and the types?
In order to improve the performance, we subdivide the transactions; this process is known as Partitioning. The Partitioning allows the Informatica Server to create multiple connections to various sources.
Types of Partitioning:
Round-Robin Partitioning: Here, the data is distributed equally among all partitions. In each partition, the number of rows to process is approximately the same.
Hash Partitioning: It is a type of partitioning technique where a hash key is used to distribute the rows evenly across the various partitions. Hash partitioning is used where ranges are not appropriate.
ETL Interview Questions and Answers
35. List the types of Data Warehouse systems?
- Predictive Analysis
- Online Analytical Processing (OLAP)
- Online Transactional Processing
- Data Mart
36. Differentiate between PowerCenter and PowerMart?
|With PowerCenter, you can receive all functionalities, including the ability to share metadata, register multiple servers across repositories, and partition data.||The PowerMart includes all the features except for distributed metadata, data partitioning, and multiple registered servers. Also, the different options that are available with PowerCenter are not available with PowerMart like Powerconnect for Siebel, PeopleSoft, etc.|
37. By using SSIS ( SQL Server Integration Service), list the possible ways to update the table?
- Use a staging table
- Use a SQL command
- Use Cache
- Use the Script Task
- Use full database name for updating if the MSSQL is used.
38. List the steps followed in the ETL testing process?
The steps followed are:
- Requirement Analysis
- Validation and Test Estimation
- Test Planning and designing the testing environment
- Test Data Preparation and Execution
- Summary Report
39. List the apps to which the PowerCenter can be connected?
PowerCenter can be connected with ERP sources like:
- Oracle Apps
- PeopleSoft, etc.
40. Where do you use dynamic cache and static cache in connected and unconnected transformations?
- For flat files, Static cache is used.
- The dynamic cache is usually used when you have to update the master table and slowly change the dimensions (SCD) type 1
ETL Interview Questions and Answers
41. List the steps to choose the ETL process?
- Data Connectivity
- Transformation Flexibility
- Data Quality
- Flexible data action option
- Committed ETL vendor
42. Name the partition that is used to improve the performances of ETL transactions?
The session partition is used to improve the performance of ETL transactions.
43. Define the Data source view in ETL?
(DSVs) Data Source Views will allow you to create a logical view of only the tables that are involved in your data warehouse design.
44. List the types of ETL bugs?
- Source Bugs
- Load Condition Bugs
- Calculation Bugs
- ECP related Bugs
- User-Interface Bugs
- Equivalence Class Partitioning bugs
- Boundary value analysis bug
Interview questions ETL
45. List the types of ETL testing?
- Production Validation Testing
- Source to Target Testing (Validation)
- Application Upgrade
- Metadata Testing
- Data Accuracy Testing
- Data Transformation Testing
- Migration Testing
- New Data Warehouse Testing
ETL Interview Questions and Answers
46. Differentiate between Unconnected and Connected lookup?
|Unconnected Lookup||Connected Lookup|
|It receives values from the lookup expression.||It receives input values directly from the mapping pipeline.|
|It has only one return port, and hence it returns one column from each row.||It returns multiple columns from the same row because they have multiple return ports.|
|It does not support user-defined values.||It supports user-defined default values.|
47. Define ODS in ETL?
ODS means Operational Data Source. It works between staging areas and Data Warehouse. The data is ODS will be at the granularity level. When we insert the data in ODS, all the data will be loaded into the EDW through the ODS.
48. Define Data Extraction and list the phases in ETL?
Data Extraction can be defined as extracting the data from various different sources using ETL tools.
Here are two types of data extraction:
- Partial Extraction: We get the notification from source systems to update the specified data. It is known as Delta Load.
- Full Extraction: All the extracted data from an operational system or source system loads into the staging area.
49. Explain the steps to extract SAP data using Informatica?
- By using the power connect option, we can extract the SAP data using Informatica.
- For that, you need to install and configure the PowerConnect tool.
- Then, Import the source into the Source Analyzer. Between the Informatica and SAP, the Powerconnect act as a gateway.
- In the next step, we generate the ABAP code for mapping; then, only Informatica can pull data from SAP.
- Next, to connect and import sources from the external systems, we use Power Connect.
50. Define the terms Worklet, Session, Mapplet, and Workflow?
Worklet: It defines a specific set of tasks given.
Mapplet: It arranges or creates sets of transformation.
Workflow: It can be defined as a set of instructions that will tell the server how to execute tasks.
Session: It is defined as a set of parameters that tells the server how to move data from a source to the target.
Good luck with your ETL Interview, and we hope our ETL Interview Questions and Answers were of some help to you. You can also check our Informatica Interview Questions which might be of some help to you.