MINED

Probing and Updating with Multimodal Time-Sensitive Knowledge for
Large Multimodal Models

Teaser

"Temporal Awareness Evaluation, Comprehensive Benchmarking, and Multi-Dimensional Analysis!"

– Multimodal Time-Sensitive Knowledge

Introduction

Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs' ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.

Multimodal tIme-seNsitive knowlEDge

[Left] Overall comparison with existing temporal knowledge benchmarks. [Right] We evaluate temporal awareness of time-sensitive knowledge of SOTA LMMs across six capability dimensions.

In Table 2 and Figure 3, MINED comprises 4,208 questions, spanning 6 dimensions and 6 types of fine-grained knowledge, demonstrating substantial diversity.

Probing Multimodal tIme-seNsitive knowlEDge

When evaluating the cognitive capacities of LMMs, we present queries conveying identical knowledge in three distinct temporal formats: Time-Agnostic, Temporal Interval-Aware, and Timestamp-Aware. For the knowledge ``Lionel Messi played for Inter Miami CF", Time-Agnostic, Temporal Interval-Aware, and Timestamp-Aware queries are formulated as follows: ``Which club does the person in the image currently play for?", ``Which club did the footballer play for between 2023 and 2024?", and ``Which club did the footballer play for on 1 January 2024?", respectively. In Table 3, all LMMs perform better on Timestamp-Aware tasks. This phenomenon may stem from the narrower temporal context required: Timestamp-Aware queries only necessitate knowledge retrieval for a specific point in time, whereas Time-Agnostic and Temporal Interval-Aware tasks demand recalling broader or time period-based information, which is more challenging. Despite this, the top-performing model, Gemini-2.5-Pro, still fails to recall approximately 25% of the knowledge, underscoring the importance of temporal sensitivity in model reasoning.

Compared to T.S.A. results in Table 3, LMMs' performance degrades when queries are accompanied by temporal misaligned context, which impedes correct knowledge recall. For the experiment in Figure 7, we use the same timestamp in the queries, with the only difference being whether the input query included the relevant but temporal misaligned text. We Observationerve that more capable closed-source models and larger open-source models exhibit greater robustness to temporally misaligned context, whereas smaller open-source models suffer significant performance degradation. For instance, Qwen2-VL-I-7B shows declines of 43.84% on F.M.C and 56.43% on P.M.C. These results indicate that smaller models are more susceptible to misleading temporal context, with past misaligned information having a particularly strong negative impact.

As indicated by P.U.D and F.U.D results in Table 3, most LMMs (except for mPLUG-Owl2-7B) are capable of effectively rejecting questions that contain unanswerable dates from either the past or the future. This is likely because such dates are absent from the training data, allowing the models to reject them with greater confidence. Furthermore, LMMs show a slightly stronger propensity to reject questions with unanswerable future dates, likely because these represent entirely unseen temporal concepts, resulting in even greater refusal certainty. Surprisingly, both Qwen2-VL-I-7B (average CEM score of 99.64) and Qwen2.5-VL-I-7B (average CEM score of 99.70) demonstrate exceptional performance in question refusal, a capability potentially attributable to enhanced defensive mechanisms from their instruction tuning process.

In the I.T.C column of Table 3, all LLMs perform terribly, with even the top-performing model, Gemini-2.5-Pro, recalling less than 20% of relevant knowledge. This indicates a fundamental deficiency in understanding and utilizing implicit temporal concepts.

Unexpectedly, MiniCPM-V2.6-8B and InternVL2.5-8B achieved the highest performance on ranking task, while models such as GPT-4.1 and Doubao-1.5-Vision-Pro scored below 20% in CEM. Figure 5 further illustrates this phenomenon, showing a decline in ranking performance within the Qwen2.5-VL-I series as model size increases 50.3_(3B) → 38.9_(7B) → 11.4_(72B), potentially due to overthinking. Larger models, despite their enhanced reasoning capabilities, may overcomplicate simple tasks like ranking, leading to reduced effectiveness. In contrast, on more challenging calculation task, closed-source LMMs including Gemini-2.5-Pro and GPT-4.1 demonstrated superior performance.

MINED

Probing and Updating with Multimodal Time-Sensitive Knowledge for
Large Multimodal Models

Teaser

Introduction

Multimodal tIme-seNsitive knowlEDge

Probing Multimodal tIme-seNsitive knowlEDge

Analysis of Exploratory Results

Updating Multimodal tIme-seNsitive knowlEDge

Qualitative Examples

Chat Templates

Our Team

BibTeX

MINED

Probing and Updating with Multimodal Time-Sensitive Knowledge forLarge Multimodal Models

Teaser

Introduction

Multimodal tIme-seNsitive knowlEDge

Probing Multimodal tIme-seNsitive knowlEDge

Observation 1: LMMs exhibit improved cognitive performance when queries are framed as timestamp-aware task.

Observation 2: LMMs are vulnerable to temporal misaligned context, especially from past temporal misaligned contexts.

Observation 3: LMMs are better at rejecting questions with unanswerable future dates than those with past dates.

Observation 4: All LLMs perform terribly on tasks involving implicit temporal concepts.

Observation 5: Open-source LMMs demonstrate stronger performance on simpler ranking task, whereas closed-source LMMs excel in more complex calculation task.

Observation 6: Current LMMs demonstrate limited adversarial robustness against temporal errors.

Observation 7: More recent LMMs exhibit better temporal awareness performance.

Analysis of Exploratory Results

Exploration 1: Fine-grained Knowledge Types.

Exploration 2: Model Size and Foundation LLM.

Exploration 3: Fine-grained Analysis of Time-Agnostic and Temporal Distribution.

Exploration 4: Error analysis of Awareness of Temporal Misalignment.

Updating Multimodal tIme-seNsitive knowlEDge

Single Editing Shows Strong Effectiveness.

Lifelong Editing Still Needs Improvement.

Qualitative Examples

Chat Templates

Our Team

BibTeX

Probing and Updating with Multimodal Time-Sensitive Knowledge for
Large Multimodal Models