The Evolution of Data Warehousing: What Has Changed and What Remains the Same

When I first started my career in data analytics, I went to a senior engineer on my team and asked him for resources I could use to learn to do what he does. He recommended this book to me. Of course, I bought it immediately and it still sits on my shelf.

I recently stumbled upon the book and started rereading it. I am struck by how much has changed in the past 15 years, especially with the technology available to us. At the same time, many of the challenges we face today are the same ones we've always encountered. Issues like data quality, resistance to adopting new tools, and gaps in team knowledge and skills were challenging then and remain so today. In other words, many of our "issues" stem from people rather than technology 😄.

What Has Changed

  1. Cloud Adoption:
    • Then: Data warehouses were primarily on-premises, requiring substantial hardware investments and maintenance. At my first job, there was a physical room that had a server that hosted all of our data. It was always locked and only once did I get to peak in there to see a mess of wires and servers.
    • Now: Cloud-based data warehousing solutions like AWS Redshift, Google BigQuery, and Snowflake have become mainstream, offering scalability, flexibility, and cost-efficiency. Organizations can now scale their storage and compute resources on-demand, reducing the need for large upfront investments. This has been particularly revolutionary for small, midsize companies that typically don't have much budget. It is also great for bigger companies as it enables teams to start small with one division or team and scale out, with more resources needed only after the use case has been proven and stakeholders bought in.
  2. Automated Data Management:
      • Then: Data management tasks, such as ETL (Extract, Transform, Load), were manually intensive and time-consuming.
      • Now: Automation tools and platforms streamline ETL processes, with ETL/ELT solutions like Talend, Apache NiFi, and cloud-native services (e.g., AWS Glue) reducing manual intervention and increasing efficiency. Additionally, data integration tools like Fivetran, Stitch and Matillion make it significantly easier to pull in data from many popular SaaS tools your organization might be using.
  3. Self-Service BI:
    • Then: Business Intelligence (BI) was the domain of IT and specialized analysts who created reports for business users. Often this process was time consuming and turn around time for reports was too long to be useful for real time decision making.
    • Now: Self-service BI tools like Tableau, Power BI, and Looker empower business users to create their own reports and dashboards, democratizing data access and enabling faster insights without relying heavily on IT.
  4. Data Governance and Security:
    • Then: Data governance was often an afterthought, and security measures were more basic, focusing primarily on perimeter defenses.
    • Now: There is a stronger emphasis on data governance, with frameworks to ensure data quality, compliance, and privacy. Advanced security measures, including data encryption, access controls, and auditing, are now standard to protect sensitive information.
  5. Advanced Analytics and AI:
      • Then: Data warehouses primarily supported historical reporting and descriptive analytics.
      • Now: They support advanced analytics, including predictive and prescriptive analytics powered by machine learning and AI. Integration with tools like Databricks, TensorFlow, and Azure Machine Learning allows for more sophisticated data analysis and model building directly within the data warehouse environment.

What Hasn't Changed

  1. The Importance of Data Quality:
    • Then and Now: Ensuring high data quality remains critical. Accurate, consistent, and reliable data is essential for making informed business decisions. Often the data is put into the source tools by humans and if it is put in inconsistently or is missing at the source, the dirty data will flow down to the data warehouse and will create very little actionable insight for the organization. Having strict processing and data validation protocols in place, so that the data comes in as clean as possible are essential. Getting this step right often requires foresight on the part of the data analytics team (i.e. "we'll likely want to do xyz analysis in the future, so we should start collecting that data now"). This type of foresight can come only when there is a very strong partnership between the analytics team and the rest of the organization. Building this type of partnership is a "soft skill" that is often lacking in many data analytics professionals, meanwhile, understanding data schemas and ETL processes is a "hard skill" that is lacking in business professionals. It is important to find people that can be 'translators' between the two, and these types of people remain rare and expensive.
  2. Difficulty in Hiring Skilled Professionals:
    • Then and Now: Skilled data professionals, including data engineers, data architects, and data analysts, are still in high demand. It is important for organizations to understand that you will likely have to deploy at least 2-3 people with complimentary skillsets to have a successful data warehouse and data driven structure. I continue to see organizations fumble here because they don't understand the depth of expertise needed to do each part of the process well and try to find a unicorn that can do "all the data things". These unicorns don't exist. You'll have to find data engineers that can handle the back-end, data analysts that can handle the reporting and stakeholder management. These are typically not the same person.
  3. Resistance to Change:
    • Then and Now: In every organization I have supported, there have always been tales of reports that exist that no one looks at or uses. There are often also reports that people are supposed to look at but are too complex for the stakeholder, don't have the right amount of granularity to be useful etc etc. This type of resistance to adopting new tools and technology is human nature and persists no matter how much we improve the technology and tools.
  4. Strategic Importance:
    • Then and Now: Data warehousing continues to be a strategic asset for organizations. It provides the foundation for business intelligence, analytics, and data-driven decision-making, supporting organizational goals and competitive advantage.