Large language models (LLMs) are increasingly used to predict human behavior from plain-text descriptions of experimental tasks that range from judging disease severity to consequential medical decisions. While these methods promise quick insights without complex psychological theories, we reveal a critical flaw: they often latch onto accidental patterns in the data that seem predictive but collapse when faced with novel experimental conditions. Testing across multiple behavioral studies, we show these models can generate wildly inaccurate predictions, sometimes even reversing true relationships, when applied beyond their training context. Standard validation techniques miss this flaw, creating false confidence in their reliability. We introduce a simple diagnostic tool to spot these failures and urge researchers to prioritize theoretical grounding over statistical convenience. Without this, LLM-driven behavioral predictions risk being scientifically meaningless, despite impressive initial results.
Keywords: embedding-based regression; extrapolation failure; large language models; model adequacy; out-of-distribution generalization.