MINED

Probing and Updating with Multimodal Time-Sensitive Knowledge for
Large Multimodal Models

Teaser

"Temporal Awareness Evaluation, Comprehensive Benchmarking, and Multi-Dimensional Analysis!"

– Multimodal Time-Sensitive Knowledge




Introduction

Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs' ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.

Multimodal tIme-seNsitive knowlEDge

[Left] Overall comparison with existing temporal knowledge benchmarks. [Right] We evaluate temporal awareness of time-sensitive knowledge of SOTA LMMs across six capability dimensions.

In Table 2 and Figure 3, MINED comprises 4,208 questions, spanning 6 dimensions and 6 types of fine-grained knowledge, demonstrating substantial diversity.

Probing Multimodal tIme-seNsitive knowlEDge

When evaluating the cognitive capacities of LMMs, we present queries conveying identical knowledge in three distinct temporal formats: Time-Agnostic, Temporal Interval-Aware, and Timestamp-Aware. For the knowledge ``Lionel Messi played for Inter Miami CF", Time-Agnostic, Temporal Interval-Aware, and Timestamp-Aware queries are formulated as follows: ``Which club does the person in the image currently play for?", ``Which club did the footballer play for between 2023 and 2024?", and ``Which club did the footballer play for on 1 January 2024?", respectively. In Table 3, all LMMs perform better on Timestamp-Aware tasks. This phenomenon may stem from the narrower temporal context required: Timestamp-Aware queries only necessitate knowledge retrieval for a specific point in time, whereas Time-Agnostic and Temporal Interval-Aware tasks demand recalling broader or time period-based information, which is more challenging. Despite this, the top-performing model, Gemini-2.5-Pro, still fails to recall approximately 25% of the knowledge, underscoring the importance of temporal sensitivity in model reasoning.

Compared to T.S.A. results in Table 3, LMMs' performance degrades when queries are accompanied by temporal misaligned context, which impedes correct knowledge recall. For the experiment in Figure 7, we use the same timestamp in the queries, with the only difference being whether the input query included the relevant but temporal misaligned text. We Observationerve that more capable closed-source models and larger open-source models exhibit greater robustness to temporally misaligned context, whereas smaller open-source models suffer significant performance degradation. For instance, Qwen2-VL-I-7B shows declines of 43.84% on F.M.C and 56.43% on P.M.C. These results indicate that smaller models are more susceptible to misleading temporal context, with past misaligned information having a particularly strong negative impact.

As indicated by P.U.D and F.U.D results in Table 3, most LMMs (except for mPLUG-Owl2-7B) are capable of effectively rejecting questions that contain unanswerable dates from either the past or the future. This is likely because such dates are absent from the training data, allowing the models to reject them with greater confidence. Furthermore, LMMs show a slightly stronger propensity to reject questions with unanswerable future dates, likely because these represent entirely unseen temporal concepts, resulting in even greater refusal certainty. Surprisingly, both Qwen2-VL-I-7B (average CEM score of 99.64) and Qwen2.5-VL-I-7B (average CEM score of 99.70) demonstrate exceptional performance in question refusal, a capability potentially attributable to enhanced defensive mechanisms from their instruction tuning process.

In the I.T.C column of Table 3, all LLMs perform terribly, with even the top-performing model, Gemini-2.5-Pro, recalling less than 20% of relevant knowledge. This indicates a fundamental deficiency in understanding and utilizing implicit temporal concepts.

Unexpectedly, MiniCPM-V2.6-8B and InternVL2.5-8B achieved the highest performance on ranking task, while models such as GPT-4.1 and Doubao-1.5-Vision-Pro scored below 20% in CEM. Figure 5 further illustrates this phenomenon, showing a decline in ranking performance within the Qwen2.5-VL-I series as model size increases 50.3(3B) → 38.9(7B) → 11.4(72B), potentially due to overthinking. Larger models, despite their enhanced reasoning capabilities, may overcomplicate simple tasks like ranking, leading to reduced effectiveness. In contrast, on more challenging calculation task, closed-source LMMs including Gemini-2.5-Pro and GPT-4.1 demonstrated superior performance.

According to the A.T.E results in Table 3, models such as Qwen-VL-7B, LLaVA-Next-M-7B , and InternVL2.5-8B fail to correct any prior errors, demonstrating severely limited robustness. Even the top-performing model, Gemini-2.5-Pro, corrects fewer than 40% of errors. These results indicate a significant need for improvement in temporal reasoning robustness across current models.

Avg. results in Table 3 reveal an approximate trend: more recent LMMs generally achieve superior overall performance, indicating a link between temporal awareness and recency of development.

Analysis of Exploratory Results

All LMMs show consistent trends in recalling time-sensitive knowledge across domains. As shown in Figure 4, LMMs perform better on queries related to organization, company, and country leaders, but worse on athletes and competition champions,likely due to the broader coverage of the former in public knowledge sources. Furthermore, closed-source models outperform open-source variants on university president queries, indicating potential discrepancies in their pretraining corpora.

Observing Figure 5, we have the following findings: (1) Larger model sizes generally lead to improved performance on most tasks, except for R.K, P.U.D, F.U.D, and A.T.E. (2) Even with an identical architecture, LMMs exhibit divergent performance when using different foundation LLMs. For instance, while LLaVA-Next-L-8B and LLaVA-Next-M-7B perform poorly on A.T.E task, LLaVA-Next-V-7B achieves a CEM score of 31.2.

In the Time-Agnostic task, we further categorize the model's outputs into fine-grained labels. Since Prompt Agreement is adopted, each knowledge yields four outputs. If any output contains the most up-to-date value from the attribute list A, it is labeled as Latest. If none includes the latest value but at least one contains an outdated answer, it is marked as Outdated. All other cases are categorized as Irrelevant. In Table 4, open-source models not only produce a limited number of latest responses but also generate a substantial portion of irrelevant responses. In contrast, closed-source models reduce the frequency of irrelevant responses but still exhibit a high proportion of outdated responses. These statistical results indicate that a significant portion of model-generated responses are either outdated or irrelevant, highlighting a pronounced issue of inaccurate time-sensitive knowledge. Figure 6 provides an approximate visualization of the temporal distribution of knowledge within LMMs. Closed-source models demonstrate a broader temporal coverage. In contrast, the internal knowledge of open-source models is concentrated in more recent time periods, indicating a comparative difficulty in recalling information from distant historical contexts.

Table 5 provides a detailed error analysis of awareness experiment. The red values in the bracket mean a negative effect, while green means a positive. Con. to context-based answers, Oth. to other answers, and Irr. to irrelevant ones. Surprisingly, even when provided with relevant context, models still generate responses that are irrelevant to the query or contain incorrect values from attribute list A, rather than leveraging the given context. This finding underscores the need to further investigate how models integrate external information with their internal knowledge.

Updating Multimodal tIme-seNsitive knowlEDge

Image 1

By observing Table 6, we make the following observations: (1) FT-LLM demonstrates strong performance as a knowledge updating method, achieving superior results across all evaluated tasks. (2) In contrast, both the SERAC and MINED exhibit comparatively weaker performance, demonstrating limited effectiveness in knowledge updating tasks. (3) Exception of SERAC, all methods achieve excellent performance on A.T.E task, demonstrating the strong robustness of current knowledge editing approaches. (4) Knowledge updating significantly enhances the model's performance on complex I.T.C and C.A tasks.
Image 3

By observing Table 7, we make the following observations: (1) Except for P.U.D, F.U.D and A.T.E tasks, knowledge updating performance of FT-LLM, FT-VIS and SERAC has experienced varying degrees of loss. (2) SERAC maintains excellent performance in lifelong editing scenario, with only 10.35% loss. Its memory-based architecture mitigates catastrophic forgetting through explicit caching, maintaining robust performance in lifelong editing. (3) Performance of SERAC in A.T.E has been improved by 12.55%, which may be due to lifelong editing making SERAC better suited for robustness tasks.

Qualitative Examples

Chat Templates

Our Team

BibTeX

@article{jiang2025mined,
  title = {MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models},
  author={Jiang, Kailin and Jiang, Ning and Ren, Yuchen and Li, Yuchen and Gao, Yifan and Bi, Jinhe and Ma, Yunpu and Liu, Qingqing and Wang, Xianhao and Jia, Yifan and Jiang, Hongbo and Hu, Yaocong and Li, Bin and Liu, Lei and Du, Yuntao},
  year = {2025}
}