Always, Sometimes & LLMs: Evaluating Temporal Logic in Large Language Models

Author: DAVOR PAVITS
Email: stud-pavits@ruc.dk

Supervisor: Torben Braüner
Email: torben@ruc.dk

Abstract

This study evaluates the ability of smaller scale Language Models to reason about temporal inferences under Priorean tense logic. A dataset of temporal inference tasks was constructed, including valid and invalid inference patterns, with sensical and nonsensical predicates to isolate structure from world knowledge. Four models (Llama 3.1 70B Instruct, Llama 3.1 8B Instruct, Qwen 3 30B-A3B, Qwen 3 8B) were evaluated under zero-shot and few-shot prompting strategies and deterministic and heuristic decoding (temperature 0 and 1). The results show that few-shot prompting is the strongest factor influencing performance, substantially improving accuracy across all models. Performance was similar on sensical and nonsensical predicates, suggesting that model responses are driven more by structure than semantic content. No significant differences were observed between past and future directed operators. However, models partially struggled with nested temporal constructions, revealing limitations in handling recursive temporal reasoning. Increased decoding temperature generally reduced performance. Beyond classification accuracy, the study applies statistical and psychometric analyses to assess whether model behavior reflects genuine reasoning or heuristic approximation. The findings suggest that temporal reasoning abilities in Large Language Models (LLMs) are conditional, emerging through few-shot prompting and scale. While larger models exhibit evidence of structural temporal competence, the results do not provide conclusive support for robust human-like temporal reasoning, highlighting the need for cognitively informed evaluation frameworks.

Keywords: Large Language Models, Temporal Reasoning, Priorean Temporal Logic, Natural Language Inference, Prompting, temperature, Cognitive Biases