0 likes16 views

Data Ex Machina: Synthetic Data and the Future of AI

April 3, 2024

After maintaining impressive form last year, tech giant Adobe is now seeing some of its lustre rubbing off.

Following lacklustre forecasts and concerns over whether the firm’s incorporation and monetisation of artificial intelligence (AI) is occurring at the right pace, Adobe’s stock took a hit in mid-March. Many of the issues that Adobe is facing are relevant to the entire AI tech sector, where despite extreme hype and genuinely revolutionary potential alike, generative AI is experiencing growing pains.

This is in part stoked by the fact that the fresh data required to continue improving the revolutionary technology’s capabilities is in increasingly short supply. To maintain the brisk pace of advancement that AI technology has seen over recent years, new sources of data for training algorithms must become available, which has led to a focus on synthetic data – computer-generated sets of information created to train algorithms. Synthetic data is now among the most sought-after resources in the tech industry.

Corporate headaches

Problems are mounting at Adobe. The tech giant’s shares plummeted as much as 11% in March following a soft sales forecast. Despite registering year on year revenue growth in the double figures, Adobe also announced a decrease in net income, from $2.71 per share last year to $1.36 in 2024. After nearly reaching a historic maximum at the beginning of the year, shares in the company are down 20% since the beginning of February. This has given rise to concerns that Adobe’s recent troubles are a red flag for the wider AI sector.

Part of the issue, naturally, is fallout from Adobe’s failed takeover of Figma. A $20 billion acquisition of the cloud-based design tool, shepherded by Adobe’s Chief Strategy Officer Scott Belsky, looked almost a done deal before Adobe pulled out of the transaction late last year due to EU and UK regulatory hurdles. Adobe also had to pay Figma $1 billion in breakup fees after the failed merger.

Following the failed Figma deal, Adobe—and its investors—doubled down on Adobe’s generative AI potential. “The company is innovating at a pace we’ve never seen,” Adobe CFO Dan Durn underlined last fall, before the Figma deal had officially fallen apart but while it was already facing intense regulatory scrutiny. “We’re natively, deeply integrating these technologies into those workflows and products that define how they operate. This is a seminal moment in Adobe’s history,” Durn explained. “There’s an opportunity in front of us.”

Yet the fact that Adobe Firefly is making similar mistakes to the Google Gemini mishaps which have made ample headlines recently shows that the problem runs deeper than the collapsed Figma takeover. Attempts to train AI models to avoid racial stereotypes created a deluge of ahistorical representations that quickly went viral, providing the latest fodder for AI sceptics about the tech’s current limitations.

Data’s event horizon

Firefly is proving to be a formidable tool, but its public setbacks risk causing lasting damage to the app’s reputation. In some ways, Adobe is being punished for playing by the rules. The company trained its algorithm on stock images and openly licensed content to allay critics’ worries about the intellectual property rights implications of generative AI. This is in opposition to other tech competitors who often play fast-and-loose with copyright when training their algorithms.

Industry leader OpenAI, for example–facing multiple lawsuits over its use of copyrighted content– argues that it is “impossible” to train its AI tools without them. OpenAI maintains that the limitations come from the fact that “copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents”.

Even including copyrighted data, however, companies are fast approaching a wall in terms of new training data available for AI training purposes. It’s not merely a licensing issue– even available copyrighted data is becoming too scarce to feed the hunger of Large Language Models (LLMs) to train themselves. Rather, this is a wave that could crash across the entire industry, but is hitting Adobe earlier than most due to their reliance on fair-use data.

This is particularly true given the importance of relying on high-quality data for training algorithms. User-generated content like social media posts or low-quality photos are easily sourced, but are bringing no meaningful contribution to an AI model’s output. Worse, low quality data may actively harm an algorithm’s output, just as burning bad fuel can ruin an engine. Alarms are already ringing across the industry about the looming lack of high-quality data, a data event horizon could force AI tech to stagnate.

A synthetic future

To fully harness the emerging power of AI and to ensure exponential learning growth continues, there’s only one real solution: synthetic data, computer-generated sets of information which are explicitly created to train algorithms. This solution is particularly appealing because it not only offers the scalability needed for AI models to continue their exponential growth, but because it also solves inherent copyright and privacy issues.

In some industries, synthetic data already proves to be extremely effective. Companies developing self-driving car technologies, for instance, supplement real-world data with generated data. This approach allows them to simulate every conceivable scenario, including rare occurrences and extensive variations of each specific situation.

Using AI to identify fraud in banking transactions has so far proved challenging, as fraudulent transactions typically represent less than 100th of a percent of all dealings. But by using synthetic data sets which generate thousands of such edge-cases, algorithms are fed enough information to make recognizing similar patterns possible. Further applications can be found in healthcare, where AI training has so far been difficult due to strong privacy restrictions protecting medical data.

Synthetic data naturally poses risks, including “inbreeding” whereby algorithms might replicate each other’s errors, a problem already present in AI training via web-scraping. As AI generates more online content, algorithms risk training on this AI-created content, often unbeknownst to developers. However, using custom synthetic datasets allows developers to better address errors and inconsistencies compared to using gathered data.

While the road ahead is still long and winding, synthetic data will without a doubt be a massive piece of the generative AI puzzle. From start-ups like Scale AI and Gretel.ai to established giants like OpenAI or Microsoft, the industry is catching up to this fact and an arms race for synthetic data is already on its way. With the end of natural data already in sight, it might well be the race that saves artificial intelligence.