Who Judges the Turing Test Better

AI implementation, R&D

As we continue to explore the possibilities of AI, particularly in its role as both a participant and an evaluator, we turn our focus to the deeper implications of AI in judgment-based tasks. Following our previous discussion on AI’s evolving role in software development, this blog post dives into a thought-provoking experiment: Can machines, like ChatGPT, evaluate human authenticity as effectively as human experts?

By revisiting Alan Turing’s groundbreaking test and examining its modern adaptations, we explore the capability of AI to not only simulate human responses but also judge the authenticity of conversations. This exploration is key to understanding how far AI has come and where it stands in tasks that rely heavily on nuanced decision-making.

Machines as Judges in the Turing Test

In 1950, Alan Turing introduced the Turing Test, where a human interrogator tries to distinguish between a human and a machine through conversation. If the machine can convincingly mimic a human, Turing argued it should be considered capable of thought.

Since then, variations of the Turing Test have explored the idea of machines as judges. Watt’s Inverse Turing Test (1996) introduced a scenario where a machine evaluates both human and machine responses, suggesting that if a machine can judge as accurately as a human, it demonstrates intelligence. Similarly, the Reverse Turing Test, exemplified by CAPTCHA (2003), focuses on whether machines can determine if a user is human based on tasks like recognizing distorted images.

As advanced AI models like GPT-3.5 and GPT-4 evolve, distinguishing between human and machine-generated content has become increasingly difficult. Since AI is now surpassing human abilities in areas like data analysis and pattern recognition, it raises the question: can machines outperform humans in judging the Turing Test? With their ability to detect subtle patterns that humans may overlook, AI could potentially become better at identifying machine-generated material. As these models become more refined, they offer the potential for more accurate and scalable solutions, pointing toward a future where AI surpasses human judgment in recognizing machine-generated content.

Designing The Experiment

To explore whether AI can match or outperform human judgment in evaluating the authenticity of content, an experiment was crafted focusing on the specific case of HR interviews within the software engineering field. This experiment aims to compare the ability of human experts—professionals working in software engineering—to distinguish between human and machine-generated responses to typical HR interview questions with the performance of ChatGPT in the same task.

A mix of responses is created, with some written by real candidates and others generated by ChatGPT. Judges rate their confidence in identifying each response as human-made or AI-generated on a scale of 1-4 and provide a reason and explanation for their decision. These justifications are analyzed to reveal the patterns and signs judges use when making their decisions, such as response structure, depth of reasoning, or subtle language nuances.

Conducting the Experiment

During the data collection phase, human responses were collected through online interviews on the Teams platform. Twelve participants were interviewed, comprising eight men and four women aged 24 to 45. Responses from the ChatGPT-3.5 model were obtained using the OpenAI ChatCompletion API, guided by a Python script that followed best practices for optimal prompting. ChatGPT generated 12 interviews, matching the number of human conversations collected, resulting in a total of 24 interviews that were evaluated in the subsequent Turing experiment phase.

The Turing experiment involved 24 participants, comprised of 18 men and 6 women aged 22 to 46. In this experiment, both human judges and ChatGPT were tasked with evaluating the authenticity of collected conversations.

Human Evaluation: Each human judge was presented with three randomly selected conversations and given five minutes to rate each one on a scale of 1 to 4. This rating indicated their confidence in determining whether the response was human or machine-generated. After providing their ratings, the judges were required to explain their reasoning, which was later analyzed for patterns and insights.

ChatGPT Evaluation: Concurrently, ChatGPT assessed the same conversations using a zero-shot chain-of-thought prompting technique. This method encourages the model to articulate its reasoning behind the ratings without needing prior examples, allowing it to clearly explain the thought process behind its evaluations.

Results

How good are the experts, and how good is the ChatGPT model in determining the authenticity of answers in HR interviews?

Experts outperformed ChatGPT-4 in the Turing test but exhibited less confidence in their decisions, particularly regarding human-generated conversations. The most significant difference between human judges and ChatGPT-4 occurred in the evaluation of machine-generated content.

While experts demonstrated strong performance in assessing both human- and machine-generated dialogues, ChatGPT-4 accurately rated most human-generated conversations but misjudged the majority of machine-generated ones. Specifically, ChatGPT-4 correctly identified only 2 out of 10 machine-generated conversations, achieving an accuracy of 59.1%, which is only slightly above random chance. In contrast, human judges achieved an accuracy rate of 90.91%.

Regarding confidence levels, ChatGPT-4 rated human-generated conversations correctly with high confidence (90.91%), whereas human judges were less confident, with only 54.54% of correct ratings delivered with high confidence. This trend persisted for machine-generated conversations as well, where human judges showed only 33.33% accuracy and tended to give ratings with lower confidence. Although ChatGPT-4 rated only two machine-generated conversations correctly, it displayed low confidence (rated a 2) in seven out of eight instances.

What signs do judges use when making decisions?

The analysis revealed an overlap in the signs used by judges to accurately identify text origins, particularly in recognizing human-like content through categories such as “Language style” and “Anecdotal evidence.” Participants drew on previous interactions with the model and evaluated content against familiar patterns. Notably, there are no universal or consistent indicators for determining artificially generated content, emphasizing the importance of considering the broader conversation context.

Human judges often relied on a diverse range of indicators, with their intricate interactions influencing nuanced decisions. In contrast, the ChatGPT model frequently focused on specific signs that biased its evaluations toward human-generated content, resulting in errors.

In evaluating human-generated conversations:

Judges exhibited overlap in signs for accurate identification (Language style and Anecdotal evidence).

ChatGPT identified “Consistent expression” as a sign for accurately recognizing human-generated content, a pattern not observed by the experts

Signs leading to correct identifications by ChatGPT also contributed to incorrect judgments.

In evaluating machine-generated conversations:

Except for the Superficiality category, no universal or consistent reasons led to correct identifications.

Experts utilized a broader range of indicators, with their intricate interactions influencing nuanced decisions.

For ChatGPT, reliance on categories like Anecdotal evidence, Emotional depth, and Cognitive depth often resulted in incorrect identifications.

Conclusion

The exploration of machines as judges in the Turing Test, as initiated by Alan Turing in 1950, has evolved significantly with the advent of advanced AI models like GPT-3.5 and GPT-4. The original concept of evaluating human-like thinking has expanded into the realm of assessing machine capabilities in discerning between human and machine-generated content.

Results from the experiment aimed at determining whether AI can match or surpass human judgment in evaluating response authenticity indicate that human experts outperformed ChatGPT-4 in accurately identifying responses. Human experts outperformed ChatGPT-4 in accurately identifying responses, with the stark contrast mostly evident in the evaluation of machine-generated responses, where ChatGPT-4 struggled significantly. The analysis of the signs judges used to evaluate content highlights the necessity of contextual and nuanced understanding in assessing authenticity. It also reveals AI-specific categories, such as “consistent expression,” which suggests that there are patterns recognized by AI that humans may overlook. In conclusion, while advanced AI models demonstrate potential in evaluating conversational content, the intricate, multifaceted nature of human judgment remains superior.

TIAC’s Next Steps

As we look ahead, the role of AI as a judge in decision-making processes will continue to grow, but our findings underscore the irreplaceable value of human insight. At TIAC, we will keep challenging AI to push its boundaries, but with a clear understanding that human expertise will always have a fundamental role in guiding and shaping AI’s future applications.

For those interested in delving deeper into AI’s fascinating journey, don’t miss the next presentation by our colleague Danijel Popović, “A journey through the world of AI”, where he will dive into the fascinating beginnings of AI, how it “intelligently” thinks, and how it makes decisions.

Blog

The First ESG Demo Presented in Split

This year’s Automation Summit took place in Split, gathering professionals from the worlds of technology and automation, with a strong focus on sustainability, regulation, and digital transformation. TIAC proudly participated as a Platinum sponsor, highlighting our ESG expertise and digital solutions developed in partnership with Climcycle. On the first day of the summit, we delivered […]

ESG, Fintech

TECHNICAL DEBT: Hidden costs of software development and how to manage them

by Vladimir Mandić The software industry lives with a phenomenon that is conveniently named technical debt (TD) [1]. According to some estimates 25% of all effort spent in the industry is related to this phenomenon [2], eighter by having difficulties while implementing software changes or new features that turn to be too expensive for the […]

Best practices, Development technologies, R&D, Technology

TIAC Talk – Exploring the World of AI and Large Language Models (LLM)

A journey through the world of AI Following our exploration of AI’s capabilities in evaluation and decision-making, we now take a fundamental journey into the core concepts that make these advanced applications possible. This presentation delves into the beginnings of artificial intelligence, examining how computers “intelligently” think and make decisions. This deep dive into neural […]

AI implementation, R&D

Who Judges the Turing Test Better

AI implementation, R&D