Evolution’s in Data

I am inherently curious, always asking, “What’s the next big thing?” Sometimes, the answer turns out to be “more of the same.”

This thought crossed my mind recently when a friend mentioned the fractal nature of emerging technologies. Within a specific technology trend, there often exist several smaller-scale evolutions, mirroring the broader phenomenon.

two person walking towards mountain covered with snow

For example, consider the progression of cloud computing. It started with “raw computing and storage,” then evolved into “re-imagining key services with a single click,” and eventually became the backbone of AI work. All of this falls under the umbrella of “renting computing power and storage from others.” Similarly, the journey of Web3 has taken us from “basic blockchain and cryptocurrency tokens” to “decentralized finance” and now “NFTs as loyalty cards.” Each step represents an innovative twist on the idea of using code to interact with tamper-resistant ledgers in real-time.

Recently, I’ve been contemplating this phenomenon in the context of what we currently label as “AI.” I’ve previously discussed the rebranding efforts in the data field, acknowledging that these are more than just cosmetic changes. With each iteration, the underlying implementation evolves while still adhering to the overarching goal of “Analyzing Data for Fun and Profit.”

Let’s explore the structural evolutions of this theme:

Stage 1: Hadoop and Big Data™

Around 2008, companies found themselves at the intersection of a surge in online activity and a drastic decrease in storage and computing costs. Although they weren’t entirely sure about the nature of this “data” substance, they believed they possessed vast amounts that could be monetized. They needed a tool to handle this massive workload, and Hadoop entered the scene.

Hadoop became a must-have for data jobs and a prerequisite for data-related products. However, Hadoop’s value in processing large datasets often didn’t justify its costs, which included a substantial initial investment and the need to train teams to manage clusters and work with MapReduce.

Moreover, Hadoop essentially repackaged existing large-scale business intelligence (BI) practices. Despite its usefulness, it didn’t satisfy the data enthusiasts who sought to explore beyond the realms of the known.

Stage 2: Machine Learning Models

While Hadoop could handle large-scale workloads, it struggled with machine learning (ML). Early ML libraries, like Mahout, required data scientists to code in Java and offered limited algorithms. This led to frustration and, in many cases, a return to traditional tools.

Goodbye, Hadoop; hello, R and scikit-learn. Data job interviews shifted from MapReduce questions to discussions of k-means clustering or random forests on whiteboards. This phase was beneficial for some time but encountered its own hurdles.

As data scientists dealt with “unstructured data” (often referred to as “soft numbers”), a new challenge arose. Documents and images represented thousands to millions of features, surpassing the capabilities of existing tools.

This challenge paved the way for the next structural evolution, bringing us to the present day:

Stage 3: Neural Networks

High-end video games required powerful graphics cards, and this coincided with the rise of neural networks in machine learning. Suddenly, neural networks became feasible and commercially viable for various ML tasks, leading to the dominance of frameworks like Keras, TensorFlow, and Torch.

These frameworks are as prevalent now as Hadoop was in 2010-2012. Job interviews for machine learning engineers now involve questions about these toolkits or higher-level abstractions like HuggingFace Transformers.

Cloud providers, such as Google and Amazon, offer GPU resources, making them accessible to anyone. Google even provides specialized TPU hardware for advanced computing tasks. With abundant computing power and prebuilt models, the field of generative AI has gained significant momentum.

Large language models (LLMs), like those behind Midjourney or ChatGPT, exemplify generative AI’s capabilities. They create content that seamlessly fits into their training data, provided the dataset is sufficiently vast.

Considering the abundant compute power, tools, and prebuilt models available today, we must ask: What challenges remain in GPU-enabled ML? What will drive the next structural evolution in “Analyzing Data for Fun and Profit”?

Stage 4? Simulation

Given the pattern so far, I believe the next structural evolution will center on a newfound appreciation for randomness, specifically through simulation.

Simulations act as synthetic environments to test ideas, exploring “what if” scenarios at a massive scale. They can run millions of tests, varying parameters extensively and summarizing results. Simulation opens up several possibilities:

Moving Beyond Point Estimates: ML models often provide single-point predictions. However, we need more than just a single value; we need a range of likely outcomes. Techniques like Bayesian data analysis and Monte Carlo simulations provide valuable insights by varying parameters and generating curves showing the likelihood of different outcomes.

New Ways of Exploring the Solution Space: Evolutionary algorithms, inspired by Monte Carlo methods, optimize parameters through iterative processes. These algorithms are valuable for solving complex problems with numerous variables.

Taming Complexity: Complex adaptive systems, such as financial markets and economic networks, require understanding hidden connections and interactions. Agent-based modeling (ABM) simulates these systems, revealing unexpected interactions and aiding in risk mitigation.

Smoothing the On-Ramp: To make this structural evolution a reality, it needs a distinguishing name. I propose “synthetics” as an umbrella term that encompasses generative AI and simulation. Furthermore, improving simulation-specific frameworks and tools will be crucial for adoption.

In a nutshell, the future of “Analyzing Data for Fun and Profit” holds exciting possibilities, marked by the next structural evolution. While challenges exist, a new wave of opportunities is on the horizon. Just as the data field has evolved over the years, so too should practitioners and job-seekers remain adaptable and open to the next big transformation.

Louis Jones

Louis Jones

Leave a Reply

Your email address will not be published. Required fields are marked *