Data-warehouse Concepts:

1. Data model:

Data model is an abstract model that organizes elements of data and standardizes how they relate to one another.

Data models are made up of entities, which are the objects or concepts we want to track data about, and they become the tables in a database.

2. Data Modelling:

Relational data modelling:

3NF:

--Faster writes.

--Highly normalised data.

--Every table has Primary key, Date timestamp

--cascading effect on child tables.

Dimensional data modelling:

Dimensional data modelling (Ralph Kimball) can result in a design called a star schema, which has denormalized tables, and it is used for building reporting and analytical systems.

--Fast reads.

--Good for reporting.

--Granularity is lowest level up to which data will be stored in a Fact Table in context with specific Dimension table.

Data Vault:

Hub: (emp_id)

Satellite: (emp_name, sal, DOB)

Link: link to hubs

--agile Datawarehouse.

3. Datawarehouse:

Data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW).

retail Datawarehouse

ELT, ETL based Datawarehouse.

4. Datamart:

data mart is a subset of the data warehouse (like sales data mart in

retail Datawarehouse)

5. Star schema:

consists of one or more fact tables referencing any number of dimension tables.

fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized.

3NF is used to reduce the data duplication. It is also used to achieve the data integrity.

6. Snowflake schema:

centralized fact tables which are connected to multiple dimensions.

"Snowflaking" is a method of normalizing the dimension tables in a star schema.

7. Fact Table:

Keys columns, Measures columns.

8. Metadata:

Metadata in data warehouse defines the warehouse Entities/objects.

9. Delete duplicate records from table?

Method 1:

begin;

create table table_new as select distinct * from table;

alter table table rename to table_old;

alter table table_new rename to table;

drop table table_old;

commit;

Method2

begin;

-- First identify all the rows that are duplicate

CREATE TEMP TABLE duplicate_saleids AS

SELECT saleid

FROM sales

WHERE saledateid BETWEEN 2224 AND 2231

GROUP BY saleid

HAVING COUNT(*) > 1;

-- Extract one copy of all the duplicate rows

CREATE TEMP TABLE new_sales(LIKE sales);

INSERT INTO new_sales

SELECT DISTINCT *

FROM sales

WHERE saledateid BETWEEN 2224 AND 2231

AND saleid IN(

SELECT saleid

FROM duplicate_saleids

);

-- Remove all rows that were duplicated (all copies).

DELETE FROM sales

WHERE saledateid BETWEEN 2224 AND 2231

AND saleid IN(

SELECT saleid

FROM duplicate_saleids

);

-- Insert back in the single copies

INSERT INTO sales

SELECT *

FROM new_sales;

-- Cleanup

DROP TABLE duplicate_saleids;

DROP TABLE new_sales;

COMMIT;

10. Find duplicate records from a table.

SELECT id, COUNT(*)

FROM users

GROUP BY id

HAVING COUNT(*) > 1;

HAVING is important here because unlike WHERE, HAVING filters on aggregate functions.

8. What is SCD in data warehouse?

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

--current table

CREATE TABLE basket(id int PRIMARY KEY,

fruit VARCHAR(50) NOT NULL);

INSERT INTO basket values(1,'apple');

INSERT INTO basket values(2,'orange');

INSERT INTO basket values(3,'banana');

--create stage table

create table stage_basket(like basket);

--load data to stage table

INSERT INTO stage_basket values(4,'pear');

INSERT INTO stage_basket values(3,'banana1');

--Start the transaction before deleting / inserting data

begin transaction;

delete from basket

using stage_basket

where basket.id = stage_basket.id;

insert into basket select * from stage_basket;

end transaction;

Slowly Changing Dimension (SCD):

SCD type1 - overwrite data in dimensions.

SCD type2 –

SCD type3 –

SCD type4 –

SCD type5 –

SCD type6 –

Delete Query:

DELETE FROM employees WHERE employee_ID = 3;

Alias in where, Having clause?

We can not use alias in Where, Having Clause.

What is View in a Database?

Views are a logical virtual table created by “select query” but the result is not stored

anywhere in the disk and every time we need to fire the query when we need data, so always we get updated or the latest data from original tables.

What is Materialized View in a Database?

Materialized views are also the logical view of our data-driven by the select query

but the result of the query will get stored in the table or disk, also the definition

of the query will also store in the database.

List duplicates in a fact table?

Select col1, col2, col3, count(*) as duplicate_col from fact_table

GROUP BY col1, col2, col3 HAVING count(*) > 1;

Queries:

Create Tables and insert data?

create TABLE test1 (id integer, Fname varchar(15), LName varchar(20));

insert INTO test1 values (1, 'subash', 'Bose');

insert INTO test1 values (2, 'vir', 'savarkar');

insert INTO test1 values (3, 'Bhagat', 'Singh');

insert INTO test1 values (4, 'Baji', 'Bhosale');

insert INTO test1 values (5, 'hk', 'Bose');

create TABLE test2 (id integer, dept varchar(15), Sal integer);

insert INTO test2 values (1, 'HR', 10000);

insert INTO test2 values (2, 'HR', 10000);

insert INTO test2 values (3, 'IT', 20000);

insert INTO test2 values (4, 'IT', 20000);

insert INTO test2 values (5, 'IT', 20000);

select all rows from both tables?

select * from test1 left join test2 on test1.id = test2.id;

select all rows from test1 table where dept=IT?

select t.id, t.Fname, t.LName from test1 as t left join test2 on t.id = test2.id where test2.dept = 'IT';

Select emp all details if salary greater than avg salary of all employees?

select * from test1 as t left join test2 as t2 on t.id = t2.id

where t2.sal > (select avg(sal) from test2);

create TABLE emp1 (emp_id INTEGER, emp_name varchar(15), job_name varchar(15),

manager_id integer, hire_date date, salary integer, dep_id integer)

insert into emp1 values(68319, 'emp1', 'PRESIDENT' ,NULL , 1991-11-18, 6000.00, 1001);

insert into emp1 values(66928, 'emp2', 'MANAGER' , 68319, 1991-05-01, 2750.00, 3001);

insert into emp1 values(67832, 'emp3', 'MANAGER' , 68319, 1991-06-09, 2550.00, 1001);

insert into emp1 values(65646, 'emp4', 'MANAGER' , 68319, 1991-04-02, 2957.00, 2001);

insert into emp1 values(67858, 'emp5', 'ANALYST_ ' , 65646, 1997-04-19, 3100.00, 2001);

insert into emp1 values(69062, 'emp6', 'ANALYST_ ' , 65646, 1991-12-03, 3100.00, 2001);

insert into emp1 values(63679, 'emp7', 'CLERK___ ' , 69062, 1990-12-18, 900.00 , 2001);

select employee details if he is aa manger

SELECT * FROM emp1 WHERE emp_id IN (SELECT manager_id FROM emp1);

Query to get counts for joins?

create TABLE t1 (emp_id integer, dept_id);

insert into t1 values(1, 1);

insert into t1 values(2, 2);

insert into t1 values(3, 3);

insert into t1 values(4, 1);

create TABLE t2 (emp_id integer, dept_id);

insert into t2 values(1, 1);

insert into t2 values(2, 2);

insert into t2 values(3, 3);

insert into t2 values(4, 1);

insert into t2 values(4, 4);

Python, PySpark, Cloud, Unix, SQL

Search This Blog

SQL & Data-warehouse

Data-warehouse Concepts:

No comments:

Post a Comment

Total Pageviews

Followers