There’s no arguing that artificial intelligence still has a lot of unreliable aspects, but one would hope that at least its estimates would be accurate. However, last week, Google allegedly instructed contract workers who evaluated Gemini not to miss any clues, regardless of their experience, TechCrunch reported based on internal instructions it reviewed. Earlier this month, Google shared a preview of Gemini 2.0.
Google has reportedly instructed GlobalLogic, an outsourcing firm whose contractors evaluate artificial intelligence performance, to prevent reviewers from skipping over clues that are outside their expertise. Previously, contractors could skip any clue that went far beyond their competence, such as a doctor’s question about the law. The guidelines stated: “If you do not have the critical knowledge (e.g., coding, math) to evaluate this hint, please skip this task.”
Now, contractors have allegedly been instructed: “You should not skip prompts that require specialized knowledge in a particular area” and that they should “evaluate the parts of the prompt that you understand”, while adding a note that this is not an area in which they have expertise. Apparently, the only cases where contracts can be skipped is if a large chunk of information is missing or if it has harmful content that requires special consent forms to evaluate.
One contractor aptly reacted to the changes by saying, “I thought the point of skipping was to improve accuracy by passing it on to someone better?”
Shortly after this article was first published, Google provided the following statement: “Reviewers perform a wide range of tasks across many different Google products and platforms. They provide valuable insights not only into the content of answers, but also into style, format, and other factors. The scores they provide don’t directly affect our algorithms, but collectively they are useful data that helps us assess how well our systems are performing.”
The Google representative also noted that the new language will not necessarily lead to changes in Gemini’s accuracy, as they ask users to rate only the parts of the prompts that they understand. This can include providing feedback on things like formatting issues, even if the evaluator does not have specific expertise in this area. The company also noted that this week it released the FACTS Grounding test, which can be used to check LLM responses to ensure that “they are not only factually accurate with respect to the data provided, but also detailed enough to provide satisfactory answers to user queries.”