IntersectionalityCaste The Digital Caste System: How LLMs Perpetuate Ancient Hierarchies In Modern India

The Digital Caste System: How LLMs Perpetuate Ancient Hierarchies In Modern India

Artificial intelligence has long-since promised to democratise knowledge and opportunity, yet in India, it has become the most efficient enforcer of caste discrimination the subcontinent has ever seen.

Artificial intelligence has long-since promised to democratise knowledge and opportunity, yet in India, it has become the most efficient enforcer of caste discrimination the subcontinent has ever seen. Large Language Models (LLMs) have revolutionised the ways in which we interact with and harness technology, from delegating one’s daily quota of corporate tasks to seeking therapeutic advice from perpetually sycophantic correspondents to consciously outsourcing one’s capacity for independent critical thinking and analytical reasoning.

These systems of knowledge engineering, while offering us a wide array of Faustian trade-offs between ameliorating the efficiency of professional tasks and heralding the ultimate loss of objective reason, have largely ensconced themselves within our intellectual routines and mental habits. Artificial intelligence may be a vaunted tool of automation, but as time inevitably progresses, the catalog of machine-learning-model-fueled repercussions grows steadily.

A comprehensive study on the reinforcement of systematic caste-based discrimination within widely used LLMs, DeCaste: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis,” has recently surfaced. This research has revealed troubling insights into how well-known AI tools propagate dangerous stereotypes of citizens consigned to the lowest tiers of the caste structure. Research models typically perpetuate existing power structures and systematically discriminate against lower castes, such as the Dalits and the Shudras, in favour of the higher castes—the chroniclers of their own eminence and the entrenchers of their own hierarchical advantages.

The algorithm’s caste system: digital hierarchies in code

Machine learning models exhibit certain factors that make them vulnerable to further exacerbating the caste divide. Several data and model distortions exist. As with the scant research on societal oppression and social justice, models are over-fitted for digitally-rich profiles—typically middle-class men—further excluding the 50% without internet access. Additionally, Indian users are perceived as ‘bottom billion’ data subjects, being subjected to intrusive models, non-consensual automation, poor tech policies, inadequate user research, low-cost or free products of substandard quality, and are considered ‘unsaturated markets,’ lending severity to the issue of double standards maintained by ML makers.

Source: Canva

Moreover, the promise of an AI-influenced breakthrough of tremendous scale usually means the adoption of AI into high-stakes domains. With no ecosystem of tools, policies, and stakeholders like journalists, researchers, and activists, AI remains an impetus to exclusion, marginalisation, and stratification.

Invisible datasets, visible discrimination

Data considerations must be taken into account. A study highlighted that 13% of hate posts on Facebook in India were related to caste-based hate speech, including derogatory references to caste-based occupations and anti-Ambedkar content, according to a report from the National Campaign on Dalit Human Rights (NCDHR). Additionally, while 92% of dominant caste urban households reported having access to the internet, only 71% of Scheduled Tribe households did so. The prominence of upper caste presence within the internet skews the dataset patterns that algorithms regurgitate, singing devout paeans to the existing hierarchical dynamics.

Furthermore, an analysis by Google Research in 2021 showed that approximately half of the nation’s population—mostly consisting of women, rural communities, and Adivasis—lacks access to the internet. As a result, ‘entire communities may be missing or misrepresented in datasets… leading to wrong conclusions and residual unfairness.’ Issues with internet connectivity trace back to pandemic times, when it was argued that India’s mandatory COVID-19 contact tracing app excluded hundreds of millions due to access constraints, pointing to the futility of digital nation-wide tracing.

Issues with internet connectivity trace back to pandemic times, when it was argued that India’s mandatory COVID-19 contact tracing app excluded hundreds of millions due to access constraints, pointing to the futility of digital nation-wide tracing.

Additionally, data mapping safety apps that were promoted after the 2012 Nirbhaya case to enhance women’s safety display issues of unequal case visibility, marking out areas of slums and those populated by Dalits and Muslims as unsafe, which could potentially lead to hyper-patrolling and mass surveillance in those spaces.

The irony is that people who are not counted in these datasets are still subject to these data-driven systems which reproduce bias and discrimination,‘ said Urvashi Aneja, founding director of Digital Futures Lab, a research collective.

Source: Canva

The severe inadequacy plaguing socio-economic and demographic datasets at national, state, and municipal levels also proves challenging for ensuring fairness in LLMs. Even if such data were collected, it may be withheld by the government, aligning with an aggrandising trend of blatant infrastructure and transparency issues. As one public policy researcher described, ‘The country has the ability to collect large amounts of data, but there is no access to it, and not in a machine-readable format.’ In particular, respondents shared how datasets featuring migration, incarceration, employment, or education, disaggregated by sub-groups, were unavailable to the public. There is also scarce political will to release data segregated by caste, class, and religion—such statistics would only serve as evidence against the current regime.

Moreover, mis-recorded identities may be a prevalent syndrome of Indian datasets. Ground truth on full names, location, contact details, biometrics, and usage patterns can be inconsistent, especially for marginalised groups.

More specifically, ML systems in India persist in their under-analysis of bias, replicating existing public discourse on casteism. This occurs because the technology does not account for the cultural implementation and logistics of a nation as populous and diverse as India, manifesting in numerous ways. Names may function as revelatory proxies for caste, religion, gender, and ethnicity, which have contributed to discrimination in India. The same may be said for employment, with occupations such as manual scavenging and butchery often being undertaken by the lower castes.

Skin tone also plays a role in this caste divide, with darker skin tones facing discrimination.

Skin tone also plays a role in this caste divide, with darker skin tones facing discrimination. Since only approximately 10% of the Indian populace understands English, which is the primary language AI systems offer, there is further marginalisation of Indian society through failure to account for our nation’s thirty languages with over a million speakers. Numerous AI systems, particularly in finance, also require state-issued documentation, such as Aadhaar cards, which may be a substantial deterrent since, in India, the economically poor may also be document-poor.

When code becomes caste

We live in a nearly technocratic space, in which every corporation, regardless of whether they are stolid testaments of brick-and-mortar establishments or conjured safely in the murky depths of the Cloud, is figuratively rushing to worship the newest deployment of AI technology on the market to integrate into their recruitment processes, educational platforms, and even loan approval systems. These embedded biases risk amplifying real-world discrimination by adding another layer of stratification to an already historically complex web of social divisions.

Source: Canva

The insidious nature of algorithmic discrimination lies in its façade of objectivity. Unlike human prejudice, which may be challenged and confronted, AI bias hides behind mathematical formulas and claims of data-driven neutrality. When a Dalit applicant is rejected for a job or loan, there’s no smoking gun, no explicit slur—just a algorithmic decision that appears scientifically sound. This technological laundering of bias makes discrimination both more pervasive and more difficult to combat, as victims struggle to prove what cannot be easily seen.

Healthcare as an industry is not excluded from this unsettling phenomenon. Google Analysis researchers noted that ‘rich people’s problems like cardiac disease and cancer, not poor people’s tuberculosis, are prioritised, exacerbating inequities among those who benefit from AI and those who do not.’ AI diagnostic tools trained primarily on data from upper-caste, urban populations may be rendered unable to recognise symptoms or diseases as they primarily manifest in marginalised communities, leading to misdiagnoses and delayed treatment.

AI diagnostic tools trained primarily on data from upper-caste, urban populations may be rendered unable to recognise symptoms or diseases as they primarily manifest in marginalised communities, leading to misdiagnoses and delayed treatment.

Telemedicine platforms, increasingly vital in rural areas, may struggle with accents and dialects associated with lower castes, creating data-driven barriers to basic healthcare access. A study by The Lancet showed that people aged 50 years and older in scheduled tribes and castes reported poorer self-rated health and generally higher levels of disability than those in less impoverished groups, suggesting that the longer the exposure to poverty, the greater the effect on the aging process. The composition of Indian criminal databases tells a damning story, with Dalits, Muslims, and Adivasis vastly overrepresented in arrest, prosecution, and incarceration statistics.

Source: Canva

In a nation where 800 million people depend on government welfare schemes, biased AI systems could systematically exclude the most vulnerable from social safety nets, deepening inequality and social unrest. These embedded prejudices can systematically exclude entire communities from socioeconomic advancement. This technological reinforcement of historical oppression threatens to crystallise caste-based discrimination in digital amber, making it harder to challenge and dismantle. The systemic biases that prevail in the nexus of socio-political relations cannot be easily eradicated by a technical fix—more research into persistent biases and their societal consequences must be conducted to ensure a more equitable future and prevent the entrenchment of historical inequalities.


References:

https://arxiv.org/pdf/2101.09995

https://www.context.news/digital-rights/racist-sexist-casteist-is-ai-bad-news-for-india

http://www.ncdhr.org.in/wp-content/uploads/2024/04/Caste-Based-Abuse-Report.pdfhttps://idronline.org/article/inequality/the-digital-wall-how-caste-shapes-access-to-technology-in-india

Leave a Reply

Related Posts

Skip to content