Olist Data Analysis

Building a structured environment to extract business performance insights from 100k+ records.

Project Overview

This project explores the Olist dataset (the largest department store in Brazilian marketplaces) to uncover hidden patterns in sales performance, customer geography, and product profitability.
Facing a dataset of almost 100,000 orders spanning multiple tables and relationships, the goal was to build a scalable relational structure and extract high-level KPIs to drive executive decision-making.

Executive Overview

This panel provides a granular day-to-day view of the marketplace operation, highlighting daily volatility in sales and logistics.
Beyond temporal trends, it integrates geospatial distribution to identify regional clusters and a categorical breakdown to pinpoint the top-performing product niches. All metrics summarize 100k+ records through optimized SQL (detailed here).

Overview_Dashboard

Core Technologies

  • PostgreSQL: Advanced query optimization, CTEs, and Window Functions.
  • Relational Database Design: Data Polishing and Normalization with Entity-Relationship modeling.
  • Power BI & CSV Analytics: Data visualization with Business Intelligence & Storytelling.

Strategic Architecture

Everything should start with a solid foundation, as such, the first step was mapping the entity relationships to ensure that every join across the 100k+ orders remained accurate and efficient.
Below is the normalized database schema designed to ensure data integrity and optimized retrieval.

Entity Relationship Diagram

ER Diagram

Queries

  • Query 1 - How is the business performing over time?

    Description:

    Calculates daily revenue and computes a weekly average.

    SQL Code:
    With daily_revenue as (
        SELECT SUM(oi.price) AS revenue,
            o.order_approved_at::date AS p_date
        FROM orders o   
            JOIN order_items oi ON o.order_id = oi.order_id
        GROUP BY p_date
        )
    
    SELECT p_date, revenue,
        ROUND(AVG(revenue) OVER 
        (ORDER BY p_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)::numeric, 2) AS weekly_avg
    FROM daily_revenue;
    Results:
    Result_6

    Note: Profit calculations are based on available dataset attributes and do not include operational costs beyond freight.

    Insight:

    Analyzing the weekly moving average allows for smoothing out daily fluctuations and identifying the true growth or decline trends in sales by eliminating noise from isolated peaks.

    Business Value:

    It assists the planning team in forecasting inventory and staffing needs based on medium-term trends rather than just the previous day's performance.

    Visualization:

    The following graph illustrates the monthly revenue, helping to identify historical patterns and business evolution.

    Graph_1

    Graph made with CSV created from the query results.

  • Query 2 - Which categories have the highest product prices?

    Description:

    Ranks products within each category based on their prices.

    SQL Code:
    SELECT DISTINCT t.product_category_name_english AS Category,
        oi.price AS Price,
        DENSE_RANK() OVER (PARTITION BY p.product_category_name ORDER BY oi.price DESC) AS Ranking
    FROM products p 
        JOIN order_items oi ON p.product_id = oi.product_id 
        JOIN "translation" t ON p.product_category_name = t.product_category_name
    ORDER BY oi.price DESC, ranking ASC, t.product_category_name_english;
    Results:
    Result_5
    Insight:

    By using DENSE_RANK, it is possible to identify "premium" (high-end) products within each segment without skipping positions in the classification.

    Business Value:

    Enables the marketing team to create targeted campaigns for "Luxury/Premium" audiences within specific categories.

    Visualization:

    The following graph illustrates categories that have pricier products.

    Graph_2

    Graph made with CSV created from the query results.

  • Query 3 - Which product categories contribute most to total profit?

    Description:

    Retrieves the top 5 product categories that have generated the highest total profit.

    SQL Code:
    SELECT t.product_category_name_english AS category,
        ROUND(SUM(oi.price * oi.order_item_id)) AS income
    FROM products p 
        JOIN order_items oi ON p.product_id = oi.product_id 
        JOIN translation t ON p.product_category_name = t.product_category_name 
    GROUP BY t.product_category_name_english 
    ORDER BY income DESC
    LIMIT 5;
    Results:
    Result_1
    Insight:

    Identifies the primary revenue drivers for the company. Categories such as Health & Beauty and Watches & Gifts emerge as leaders, accounting for a significant portion of total profit.

    Business Value:

    Directs investment in advertising and partnerships toward the categories that yield the highest direct return on investment (ROI).

    Visualization:

    The following graph illustrates the top 5 most profitable categories.

    Graph_3

    Graph made with CSV created from the query results.

  • Query 4 - How are the categories distributed in percentage?

    Description:

    Calculates the percentage contribution of each product category to the total income.

    SQL Code:
    WITH income_pct AS (
    SELECT t.product_category_name_english AS category,
        ROUND(((SUM(oi.price) / SUM(SUM(oi.price)) OVER())::NUMERIC * 100),2) AS pct
    FROM order_items oi 
        JOIN products p ON oi.product_id = p.product_id 
        JOIN "translation" t ON p.product_category_name = t.product_category_name 
    GROUP BY t.product_category_name_english)
    
    SELECT * FROM income_pct
    WHERE pct > 0
    ORDER BY pct DESC;
    Results:
    Result_4
    Insight:

    This Pareto-style analysis reveals that the top categories represent a disproportionate percentage of total income, often following the 80/20 rule where a small group of products generates most of the revenue.

    Business Value:

    Critical for merchandising decisions and inventory management, as it highlights which categories to prioritize, it is also helpful for understanding the company's dependency on specific market niches.

    Visualization:

    The following graph illustrates the percentage of total income contributed by each product category.

    Graph_4

    Graph made with CSV created from the query results.

  • Query 5 - Where are the best places for distribution centers?

    Description:

    Ranks cities based on the number of customers in each city.

    SQL Code:
    SELECT g.geolocation_city || ',' || g.geolocation_state || ',' || 'Brasil' AS geolocation,
        COUNT(c.customer_city)
    FROM customers c 
        JOIN geolocation g ON c.customer_zip_code_prefix = g.geolocation_zip_code_prefix 
    GROUP BY g.geolocation_city ,g.geolocation_state
    ORDER BY COUNT(c.customer_city) DESC;
    Results:
    Result_5
    Insight:

    Customer volume is heavily concentrated in major hubs like Rio de Janeiro, São Paulo, and Belo Horizonte. As such, these cities are ideal locations for distribution centers.

    Business Value:

    Guides logistics strategies, such as establishing distribution centers (hubs) near these clusters to reduce delivery times and shipping costs (freight).

    Visualization:

    The following map shows the 15 cities with the highest customer counts.

    Graph_5

    Graph made with CSV created from the query results.

  • Query 6 - How does customer distribution vary across regions?

    Description:

    Ranks Brazilian states based on the number of customers in each state.

    SQL Code:
    SELECT g.geolocation_state, 
        COUNT(c.customer_state) 
    FROM customers c 
        JOIN geolocation g on c.customer_zip_code_prefix = g.geolocation_zip_code_prefix 
    GROUP BY g.geolocation_state
    ORDER BY COUNT(c.customer_state) DESC;
    Results:
    Result_6
    Insight:

    The state of São Paulo (SP) leads by a significant margin, followed by RJ and MG, solidifying the Southeast region as the primary consumer market.

    Business Value:

    Allows for personalized regional promotions and provides insights into brand penetration in states with lower customer representation, such as MT and GO, where there is growth potential.

    Visualization:

    The following map illustrates the customer distribution across Brazilian states.

    Graph_6

    Graph made with CSV created from the query results.

Key Strategic Takeaways

  • Revenue growth shows sustained trend despite daily volatility
  • Health & Beauty and Watches & Gifts concentrate a disproportionate share of profit
  • Customer density strongly favors Southeast region, reinforcing logistics hub prioritization
  • Pareto distribution confirms dependency on top-performing categories