Suchir Balaji’s Death And What You Need to Know About OpenAI’s All-Consuming Hunger For Data

Balaji's arguments against OpenAI are part of the on-going debacle regarding the ethicality and legality of AI generated content and the data that these generative models are trained on.

By Saptaparna Samajdar Dec 20, 2024 9 min read

» Featured Image Source: Canva

Suchir Balaji was found dead in his apartment in San Francisco on November 26, 2024. Authorities have branded the death as a suicide, and no evidence of foul play has been detected yet. A former researcher at OpenAI, he has been named a whistleblower for the same. Disputing the “plausible defence” of “fair use” when it comes to generative AI products, he wrote a blog post, asking Machine Learning researchers to engage with copyright laws in this context, and more broadly.

Disputing the “plausible defence” of “fair use” when it comes to generative AI products, he wrote a blogpost, asking Machine Learning researchers to engage with copyright laws in this context, and more broadly.

In his last tweet, he wrote, ‘To give some context: I was at OpenAI for nearly 4 years and worked on ChatGPT for the last 1.5 of them. I initially didn’t know much about copyright, fair use, etc. but became curious after seeing all the lawsuits filed against GenAI companies. When I tried to understand the issue better, I eventually came to the conclusion that fair use seems like a pretty implausible defense for a lot of generative AI products, for the basic reason that they can create substitutes that compete with the data they’re trained on. I’ve written up the more detailed reasons for why I believe this in my post. Obviously, I’m not a lawyer, but I still feel like it’s important for even non-lawyers to understand the law — both the letter of it, and also why it’s actually there in the first place.‘

Balaji’s arguments are part of the on-going debacle regarding the ethicality and legality of AI generated content and the data that these generative models are trained on. Not only does it breach the “fair use doctrine”, it also makes a complete copy of that data. From there, a company like OpenAI can then teach the system to generate an exact copy of the data. Or it can teach the system to generate text that is in no way a copy. The reality, he said, is that companies teach the systems to do something in between.

Eight days before Balaji’s death, The Times’ attorneys had proposed naming him as a “custodian” in the landmark lawsuit that it has filed against Open AI. Business Insider stated, upon viewing the court documents, ‘The attorneys’ letter described Balaji as someone with “unique and relevant documents” that could support their copyright infringement case against OpenAI and Microsoft.‘ This is the same lawsuit, in which OpenAI accidentally erased crucial evidence that the newspapers’ legal team had spent 150 hours looking through

All fhat we see is ours: OpenAI’s consistent jnfringement of copyrights [and bypassing laws]

One must remember, OpenAI’s hunger for data is huge. The learning principle, at its core, is simple. The more data it has its disposal, the more it learns and the more it generates. Very little of this data is ethically sourced – and almost every website, every image, every document that is public [and private, but that is a completely different pandora’s box] is privy to it. Here is where the YouTube debacle comes in. In late 2021, when the data supply ran short, OpenAI decided to transcribe hours worth of YouTube videos.

» Also read: Artificial Intelligence And Human Reproduction: Redefining Gender Roles And Parenting Stereotypes

Led by Mr. Brockman, the President of Open AI, a team within the company developed Whisper, a speech recognition tool and transcribed a million hours worth of YouTube videos and podcasts in order to gain the data that they fed into the technology, apparently. Here comes a question – there is supposed to be an impasse here. Google does not allow the usage of videos on YouTube for independent, third-party applications, even when done by bots, data scrapers etc. Why did they, then, not oppose this process of transcription?

According to The New York Times, Google itself had used its own YouTube videos’ transcriptions as datasets to train their own AI models, and therefore, did not want to be mired into further controversy by walking into this discourse, especially when it was also guilty of the self-same crime: a possible violation of the copyrights held by the YouTube creators. While Sundar Pichai told CNBC’s Deirdre Bosa that if Google found that their terms of service had been breached, they would “sort it out” – we could not find any reportage that indicated any form of public “sorting out”.

While Sundar Pichai told CNBC’sDeirdreBosa that if Google found that their terms of service had been breached, they would “sort it out” – we could not find any reportage that indicated any form of public “sorting out”.

This is not limited to YouTube, and this did not end in 2022 – when ChatGPT was unleashed into society. This is an ongoing process, and Open AI (and other firms’) data sourcing practices have not slowed down much.

Earlier this year, it was revealed that Taylor and Francis had granted Microsoft non-exclusive access to content it has the rights to. As one might suspect, this move was made without prior consultation of the academics and authors whose works were being offered to Microsoft. Informa, the group that owns Taylor and Francis, published an agreement from which is the following excerpt [emphasis ours]:

The partnership will focus on four core areas:

Improved Productivity: Explore how AI can enable more effective ways of working at Informa, streamlining operations, utilising Copilot for Microsoft 365 to enable Colleagues to work more efficiently, and enhancing the capabilities of Informa’s existing AI and data platforms (IIRIS);
Citation Engine: Collaborate to further develop automated citation referencing, using the latest technology to improve speed and accuracy;
Specialist Expert Agent: Explore the development of specialised expert agents for customers such as authors and librarians to assist with research, understanding and new knowledge creation/sharing;
Data Access: Provide non-exclusive access to Advanced Learning content and data to help improve relevance and performance of AI systems.

The agreement includes payment to Informa of both an initial data access fee ($10m+) and a recurring payment across three years (2025, 2026, 2027).”

This sets a precedent. Most dissent against the copyright infringement of AI platforms has been from academics, and while most are aware of the predatory nature of the academic publishing industry, this is one more blow. One must also remember, Microsoft has invested several billions in OpenAI.

It does not end there, in fact, it is only the tip of the iceberg. The problem extends farther – with Sora, OpenAI’s text-to-video generator plausibly being trained by YouTube videos and gaming content. There are currently 25+ copyright lawsuits against AI firms, many of which are class action lawsuits. These are filed by writers (the likes of George RR Martin, David Baldacci etc. in Alter vs Open AI), creatives (Sarah Silverman In re OpenAI ChatGPT Litigation), journalists and news publications.

One of the other sources that OpenAI and other similar companies have been relying on to train their models on, is synthetic data. Instead of training all the AI models on human generated information on the Internet, another alternative is to rely on the data generated by AI and then compound that knowledge. This often leads to the phenomenon of “hallucination”: when your AI model dreams up a source that does not exist, because it is built on essentialisation or dissembling of other data by itself. It also leads to the strengthening of biases and continuation of errors because of companies cutting corners.

The political economy of digital labour: a brief look

With the rise of AI, there is also a massive expansion in the economy of digital labor. The problematics are a two way street. One can gauge the fact that the intent is not just increased efficiency, it is also to ensure that the most amount of labor can be extracted with the least expense incurred. Here is the first issue : OpenAI and other firms are in the process of developing an “agent technology”. Now, moving past the answering of queries and generating information, it will also be able to perform the daily repetitive tasks that are otherwise currently carried out by the human workforce. Altman said in a podcast interview about his vision for the technology in question: ‘a really smart senior coworker who you can collaborate on a project with … The agent can go do a two-day task — or two-week task — really well, and ping you when it has questions, but come back to you with a great work product.‘

» Also read: Gender Bias In Futuristic Technologies: A Probe Into AI & Inclusive Solutions

Until the agent technology comes into successful fruition, cost cutting labor methodologies are in full implementation. This is entrenched right from the structuring of the company as a whole, going right back to data training as a system. In a Time exclusive, it was revealed that OpenAI used a San Francisco based ethical AI company, Sama, formerly Samasource. In order to train the AI to recognise content that is biased and toxic. OpenAI sent tens and thousands of snippets of text to the Kenyan firm, which often consisted in graphic detail, descriptions of sexual violence, bestiality, incest etc.

OpenAI sent tens and thousands of snippets of text to the Kenyan firm, which often consisted in graphic detail, descriptions of sexual violence, bestiality, incest etc.

All of this was annotated by Kenyan workers, who were paid from around $1.32 to $2 per hour. Sama has marketed itself as an “ethical AI” company that says that it has lifted thousands of Kenyan workers out of poverty.

Ethical AI companies like this exist all over the global South. The AI industry is heavily dependent on gig workers, and predictably, when an Oxford survey was done based on the working conditions across 15 platforms that facilitate the provisions of these services. All these companies scored an appalling 5 out of 10 – the scoring was done on the basis of 5 principles : fair pay, fair conditions, fair contracts, fair management, and fair representation; with a maximum of two points for each principle. Some companies scored a 0. A 10 would only mean the companies meeting the most basic requirements of a workplace.

Neo-colonialism and the (literal) weaponisation tactics of OpenAI

What emerges, therefore, is an insidious form of neo-colonialism where pre-existing power dynamics between the Global North and the Global South get repeatedly reinforced. Assimilating a cheap labor force, introducing surveillance technologies in South Africa, and aggravating historical inequalities all sediment into repeating what has been, all over again. OpenAI also will be partnering with defense tech company Anduril and data analytics firm Palantir – with access to defense data.

» Also read: Infographic: Artificial Intelligence — From World Domination To Inclusive Education

In a thread explaining their collaboration, Anduril Industries stated, ‘Most defense data collected at the tactical edge is never retained. Exabytes of valuable information are lost—data that could train world-class AI models and deliver the U.S. an advantage over adversaries. Anduril’s Lattice and Menace systems solve this problem by capturing, securing, and backhauling that data to enable AI. Even with the data retained, there’s no secure pipeline for turning it into actionable AI. Palantir’s AI Platform (AIP) changes that. It allows developers to structure, train, and deploy models at scale—handling everything from unclassified to SCI-level data… Together, these systems unlock the full potential of defense data—turning it into actionable intelligence and next-gen capabilities. This partnership also enables collaboration with leading AI developers, including @OpenAI. By connecting the tactical edge, secure cloud infrastructure, and cutting-edge AI models, we’re building the complete solution for operationalizing AI‘.

What we seek to highlight is the fact that this collaboration will focus on the enhancement of the CUAS – that is the US’ counter-drone security system.

What we seek to highlight is the fact that this collaboration will focus on the enhancement of the CUAS – that is the US’ counter-drone security system. Anduril, with the help of a large language model, is developing a cluster of aircrafts, that will help translate natural language commands into instructions both understandable for human pilots and the drones. Heidy Khlaaf, a chief AI scientist at the AI Now Institute and a safety researcher told the MIT Technology Review, ‘Defensive weapons are still indeed weapons … can often be positioned offensively subject to the locale and aim of a mission.‘ When all of this is contextualised with the US government’s long and continuous tradition of maintaining military presence around the world and enacting violence in the name of counter-terrorism, it assumes a huge, hideous form.

The problem with OpenAI and other big AI firms is not their failure to meet the expectations of ethicality. One rarely ever expects the maintenance of ethicality when it comes to corporate firms in the trenches of late stage capitalism. The entire ethos of the AI machine is that it is built on the art of substitution: substituting the very working class, poor people on whose shoulders it is built on, substituting the work that we have created, and substituting the existing frameworks of violence with more mechanised, “mind”less forms of violence.

» Also read: How Unbiased Is Artificial Intelligence?

The idea, as apparent, is to make more money, and gain more power – and the cost is borne by the same people trying to run past decades of colonialism, oppression, and other forms of violence brought upon them. The idea at its core, has been termed as anti-anthropos. Let us use the real, simpler wording here: it is anti-human. All of this is, and will continue to magnifically be anti-human, and anti-people.

SOURCES: