A Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Shaden Alshammari1*Kevin Wen1*Abrar Zainal3*Mark Hamilton1
Navid Safaei4Sultan Albarakati2William T. Freeman1†Antonio Torralba1†

1MIT
2KAUST
3HUMAIN
4Bulgarian Academy of Sciences
*† Equal contribution
MathNet overview: large-scale multilingual data, high-quality solutions, diverse topics, and three evaluation tasks

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.

MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts.

MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that RAG performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark.

The examples section makes the benchmark concrete. MathNet contains standard text-only problems, diagram-heavy geometry, and retrieval pairs where the challenge is not exact lexical matching but mathematical equivalence or structural resonance.

First Round 2020 · Netherlands Geometry
Square and rectangle diagram
Francisca has a square piece of paper whose sides have length $10$ cm. She also has a rectangular piece of paper having the exact same area. She puts the rectangle on top of the square, aligning the bottom-left corners. The bottom side of the rectangle lies on the bottom of the square; the top-right corner of the rectangle lies on the right side of the square. What fraction of the square is not covered by the rectangle?
E) $13\tfrac{1}{4}$
Romanian Mathematical Olympiad 2020 Geometry
Two congruent squares diagram
Two congruent squares $ABCD$ and $EFGH$ have disjoint interiors with $C$ the midpoint of $EF$ and $B,F,G$ collinear. Line $BC$ meets $EH$ at $K$ and line $AC$ meets $GH$ at $M$. Let $L$ be the midpoint of $GH$. Prove that $K, L, M$ are collinear and $CL \perp KM$.
Triangle $CFB$ is right-angled at $F$, hence similar to $CEK$. Since $CF=CE$ the triangles are congruent, so $CK=CB=CL$. This circle argument establishes collinearity and perpendicularity.
Romanian Mathematical Olympiad 2025 3D Geometry
3D square pyramid projection diagram
From a point $O$ inside square $ABCD$, a perpendicular $OS$ is raised to the plane. Let $M,N,P,Q$ be the projections of $O$ onto planes $(SAB),(SBC),(SCD),(SDA)$. Prove that $M,N,P,Q$ are coplanar if and only if $O$ lies on a diagonal of the square.
If $O$ lies on diagonal $AC$ then $OE=OF$ where $OE\perp AB$, $OF\perp AD$. The symmetry forces $M,N,P,Q$ into a common plane. Conversely, coplanarity implies $OE=OF$, which holds only on a diagonal.
Example Conceptual Problem Pair from MathNet-RAG
Problem A
Chinese TST 2014
Show that there are no $2$-tuples $(x,y)$ of positive integers satisfying $$(x+1)(x+2)\cdots(x+2014) = (y+1)(y+2)\cdots(y+4028).$$
Problem B
Russia 2009
Alireza multiplied a billion consecutive natural numbers, and Matin multiplied two million consecutive natural numbers. Prove that they got different results, or one of them made a mistake.

MathNet-RAG pairs problems by three levels of mathematical similarity — from strict equivalence to loose thematic clustering.

Mode Problem A Problem B
Invariance
Syntactic Equivalence Find $f:\mathbb{R}\to\mathbb{R}$ such that $f(x^2-y^2)=(x-y)(f(x)+f(y))$. Find $g:\mathbb{R}\to\mathbb{R}$ such that $(g(a)+g(b))(a-b)=g(a^2-b^2)$.
Reformulation Let $a_i>0$. Prove $\sum_{i=1}^n \frac{a_i}{a_i^2+a_{i+1}a_{i+2}} \le \sum_{i=1}^n \frac{1}{a_i+a_{i+1}}$. Let $a_i>0$. Prove $\sum_{i=1}^n \frac{a_i^2}{a_i^2+a_{i+1}a_{i+2}} \ge \tfrac{1}{2}$.
Transformational Find all $x\in\mathbb{R}$ such that $4^x+6^x=9^x$. Find all $x\in\mathbb{R}$ such that $(2/3)^x+(3/2)^x=5/2$.
Structural Resonance
Generalization For $k\ge1$, prove that $k$ divides $\binom{n}{k}$ for all $n\ge k$. Show $\binom{n}{m}\equiv\prod\binom{n_i}{m_i}\pmod{p}$, where $n=\sum n_i p^i$, $m=\sum m_i p^i$.
Common Lemma If $ab+1\mid a^2+b^2$, show $\frac{a^2+b^2}{ab+1}$ is a perfect square. If $a^2+b^2+c^2=k(ab+bc+ca)$, show $k\in\{1,2,3\}$.
Structural Reduction Prove that $4^n+2^n+1$ is never prime. Prove $2^{2n}+2^n+1$ is divisible by $3$ for all $n$.
Affinity (Thematic)
Affinity Show the largest prime factor of $\binom{2n}{n}$ is greater than $n^{2/3}$. For every $n>1$, there exists a prime $p$ with $n<p<2n$.

Table: Taxonomy of mathematical similarity with Olympiad examples. Invariance captures strict equivalence under reformulation; Structural Resonance reflects shared lemmas or reductions; Affinity denotes looser thematic clustering.

Rather than treating evaluation as a single benchmark score, MathNet separates the problem into three linked tasks. This helps distinguish failures of reasoning from failures of retrieval, and makes it easier to study when external mathematical context is genuinely useful.

Task I
Problem Solving
Evaluate generative models on Olympiad-level problems spanning algebra, combinatorics, geometry, and number theory, with expert-authored solutions as ground truth.
Task II
Math-Aware Retrieval
Benchmark embedding models on their ability to retrieve mathematically equivalent and structurally similar problems from a large corpus.
Task III
Retrieval-Augmented Problem Solving
Assess how retrieval quality affects reasoning: retrieved similar problems are provided as context to generative models before solving.

The statistical view is useful for understanding why the benchmark is hard. MathNet is not only large; it is heterogeneous in language, topic, contest type, and solution length, which makes naive transfer from smaller math benchmarks unreliable.

MathNet dataset statistics

MathNet dataset statistics. (a) Contest type distribution. (b) Solution length compared to existing benchmarks — MathNet solutions are substantially longer. (c) Problems per year. (d) Topic and sub-topic distribution. (e) Language distribution: 74% English, 26% non-English across 17 languages.

The dataset is not just scraped from PDFs. The pipeline combines OCR, segmentation, normalization, and human verification so that the final benchmark preserves both the original problem statements and the long-form solutions needed for evaluation.

MathNet data extraction and curation pipeline

Data extraction and curation pipeline. PDF booklets are processed via OCR to extract markdown text, segmented into problem–solution blocks, normalized with GPT-4.1, and verified by human experts to produce the curated dataset.

We close with the main quantitative results. These charts summarize the benchmark at a glance: reasoning models still have meaningful headroom on Olympiad solving, retrieval remains difficult at low recall depths, and expert retrieval gives the most reliable gains in RAG.

Problem Solving on MathNet-Solve-Test
Using the paper's micro-average accuracy across all 6,400 test problems.
gemini-3.1-pro-preview
78.4%
gemini-3-flash-preview
70.4%
gpt-5
69.3%
gpt-5-mini
57.0%
claude-opus-4.6
45.7%
gpt-5-nano
42.2%
gemini-2.5-flash
41.1%
DeepSeek-V3.2
40.1%
DeepSeek-R1
36.3%
grok-3
28.5%
gpt-4.1
21.4%
Llama-4-Maverick-17B*
14.7%

Problem Solving on MathNet-Solve-Test (6,400 problems). The chart uses the paper's overall micro-average accuracy values to make the ranking legible on the page. Takeaway: LMMs with reasoning are clearly strongest overall, but even the top model remains well below perfect performance.

Math-Aware Retrieval on MathNet-Retrieve
Recall@1 Recall@5
gemini-embedding-001
R@1
4.83
R@5
68.88
qwen3-embedding-4B
R@1
4.96
R@5
64.95
all-mpnet-base-v2
R@1
3.78
R@5
57.70
multi-qa-mpnet-base-dot-v1
R@1
3.27
R@5
55.08
text-embedding-3-large
R@1
2.74
R@5
54.23
cohere-embed-v4.0
R@1
2.24
R@5
44.81

Math-Aware Retrieval on MathNet-Retrieve (10,000 anchor problems). This chart uses the paper's aggregate “All” Recall@1 and Recall@5 values. Takeaway: Recall@1 stays very low even for the best models, while Recall@5 is much stronger, showing that mathematically equivalent retrieval is still unreliable at shallow depths.

Retrieval-Augmented Problem Solving on MathNet-RAG
Zero-shot Embed-RAG Expert-RAG
DeepSeek-V3.2-Speciale
Human grading
Zero
84.8%
Embed
89.5%
Expert
97.3%
Gemini-3-Pro
Human grading
Zero
89.1%
Embed
92.9%
Expert
87.5%
GPT-5
LLM grading
Zero
87.1%
Embed
81.8%
Expert
85.8%
Claude-4.5-Opus
Human grading
Zero
46.8%
Embed
55.5%
Expert
52.4%

Retrieval-Augmented Problem Solving on MathNet-RAG (35 problems). These bars highlight the most legible comparisons from the paper across zero-shot, Embed-RAG, and Expert-RAG settings. Takeaway: expert retrieval most often gives the strongest gains, but improvements remain model-dependent and grading-dependent.

For implementation details, benchmark construction choices, and the full experimental setup, the paper remains the primary reference. The citation is included here at the end for convenience.

@inproceedings{alshammari2026mathnet,
  title     = {{MathNet}: A Global Multimodal Benchmark for Mathematical
               Reasoning and Retrieval},
  author    = {Alshammari, Shaden and Wen, Kevin and Zainal, Abrar and
               Hamilton, Mark and Safaei, Navid and Albarakati, Sultan and
               Freeman, William T. and Torralba, Antonio},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mathnet.csail.mit.edu}
}

For questions about the dataset, benchmark, or paper, reach out to shaden@mit.edu.