Varenyk Shop is a fictional e-commerce store which sells varenyky - a traditional Ukrainian dish. In this tutorial we will be using CoginitiScript to build a transformation pipeline and transform raw sales data into a model which is ready for analytics.
Our raw (source) data looks like following:
We would like to load it and prepare customers
and orders
datasets that we want to use for reporting and analysis purposes:
Please note, that proposed solution in this tutorial is just a recommended way of how given task could be solved. CoginitiScript doesn’t force user to follow certain ways to solve problems. Users are free to come up with their own solutions.
Instructions
This tutorial is based on the Postgres database. To get up and running do following:
- Download Coginiti catalog package from here: varenyk_shop.coginitipkg
- Import this package into the root folder
- Open a new query tab, select Postgres database where you want demo data to be created and run following script:
#+import "tutorials/varenyk_shop/seed" {{ seed.Run() }};
- Open
tutorials/varenyk_shop/schedule_nightly
and execute it on the same Postgres database.
Once you done with this steps, customers
and orders
tables should be created in your Postgres database. You can query those tables to see final result.
Reviewing project structure
Let’s review imported tutorial project and see how it is implemented.
The overall structure is following:
+-- tutorials/ +-- varenyk_shop/ +-- reporting/ +-- customers +-- orders +-- seed/
+-- gen/
+-- person
+-- clean +-- source_data +-- sources/ +-- sources +-- schedule_nightly
We break down the project into several packages, which represent layers. sources
package contains definition of the source data and reporting
package contains definition of the datasets we produce that will be used for further analysis. seed
package contains scripts to prepopulate database with a demo data. You’ve already used it in the instructions above.
Sources
When we work on a projects like that we always create a dedicated layer for source data. All downstream transformations then references blocks from this layer rather then using tables directly. This gives us a way to abstract our transformation logic from the physical sources which have following benefits:
- if your sources changes, you have to change only a source layer and do not touch transformation layer at all (which is usually much more bigger).
- your transformation logic becomes less polluted with cleansing logic, dealing with inconsistencies in source data, etc, because these operations are done in the source layer.
For this tutorial we have following source blocks defined in the tutorials/varenyk_shop/sources/sources
asset:
#+src sql Customers() #+begin SELECT id AS customer_id, first_name, last_name, email FROM src_customers ; #+end #+src sql Products() #+begin SELECT id AS product_id, name FROM src_products ; #+end #+src sql Orders() #+begin SELECT id AS order_id, order_date, customer_id, payment_type_id, order_status FROM src_orders ; #+end #+src sql OrderLines() #+begin SELECT id AS order_line_id, order_id, product_id, quantity, price FROM src_order_lines ; #+end #+src sql OrderStatuses() #+begin SELECT id AS order_status_id, status_name FROM src_order_statuses ; #+end #+src sql PaymentTypes() #+begin SELECT id AS payment_type_id, payment_type_name FROM src_payment_types ; #+end
As you can see, those blocks has a simple definition with SELECT statements where we extract only the columns we need and rename some columns.
Reporting Models
Customers
For analyzing customers data we want to produce a dataset which for each customer have following metrics calculated:
- date of the first order
- date of the most recent order
- total sales amount
- total number of orders
The first 3 metrics represent a so called RFM values:
Recency - How recently a customer has made a purchase.
Frequency - How often a customer makes a purchase.
Monetary Value - How much money a customer spends on purchases.
We decided to split calculation logic into 2 blocks. rfmValues
block is used to calculate required metrics and Customers
block is used to produce final result. Customers
blocks has publication strategy defined on it since we want to materialize result of this block as a table in the database.
Please pay attention how we use sources here by importing sources
package first and then referencing blocks from this package in our queries.
tutorials/varenyk_shop/reporting/customers
#+import "tutorials/varenyk_shop/sources" #+src sql rfmValues() #+meta { :doc "Calculates following metrics for each customer: - first order : when they first purchase was - recency : when their last purchase was - frequence : how often they purachase - monetary value : how much they spent " } #+begin SELECT customer_id, MIN(o.order_date) AS first_order, MAX(o.order_date) AS recent_order, COUNT(o.order_id) AS total_orders, SUM(ol.price) AS total_sales FROM {{ sources.Orders() }} AS o INNER JOIN {{ sources.OrderLines() }} AS ol ON ol.order_id = o.order_id GROUP BY customer_id #+end #+src sql Customers() #+meta { :publication { :type "table", :name "customers" } } #+begin SELECT c.customer_id, c.first_name, c.last_name, r.first_order, r.recent_order, r.total_orders, r.total_sales AS ltv FROM {{ sources.Customers() }} AS c LEFT JOIN {{ rfmValues() }} AS r ON r.customer_id = c.customer_id ; #+end
Orders
In the same manner we define a block to calculate orders dataset. The interesting detail here is that we dynamically generate a SQL query to build list of columns for each payment type our system support. To do this, we first define a list of supported payment types. This is done by defining a constant with paymentTypes
name. Then we use CoginitiScript loop statement to iterate over the values in the paymentTypes
constant and produce a correspondent column expression.
tutorials/varenyk_shop/reporting/orders
#+import "tutorials/varenyk_shop/sources" #+const paymentTypes = ["credit_card", "paypal", "wire_transfer"]; #+end #+src sql Orders() #+meta { :publication { :type "table", :name "orders" } } #+begin SELECT o.order_id, o.customer_id, o.order_date, os.status_name AS order_status, #+for t : paymentTypes do SUM(CASE WHEN pt.payment_type_name = '{{ t }}' THEN ol.price ELSE 0 END) AS {{ t }}_amount, #+end SUM(ol.price) AS amount FROM {{ sources.Orders() }} AS o INNER JOIN {{ sources.OrderLines() }} AS ol ON ol.order_id = o.order_id INNER JOIN {{ sources.PaymentTypes() }} AS pt ON pt.payment_type_id = o.payment_type_id INNER JOIN {{ sources.OrderStatuses() }} AS os ON os.order_status_id = o.order_status GROUP BY o.order_id, o.customer_id, o.order_date, os.status_name ; #+end
Executing publication
To execute publication we created an asset tutorials/varenyk_shop/schedule_nightly
where we run publication from the reporting
package.
#+import "std/publication" #+import "tutorials/varenyk_shop/reporting" {{ publication.Run(packages = [reporting]) }};
We could schedule this asset to be executed nightly using built-in scheduler. This functionality is only available in the Coginiti Team / Enterprise.