MathNet: A Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

MathNet overview: large-scale multilingual data, high-quality solutions, diverse topics, and three evaluation tasks

Why MathNet

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.

MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts.

MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that RAG performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark.

Dataset Examples

Here are some examples from MathNet. The dataset contains standard text-only problems, diagram-heavy geometry, and retrieval pairs where the challenge is not exact lexical matching but mathematical equivalence or structural resonance.

First Round 2020 · Netherlands Geometry

Francisca has a square piece of paper whose sides have length $10$ cm. She also has a rectangular piece of paper having the exact same area as the square piece of paper. She puts the rectangle right on top of the square, putting the left bottom corner of both pieces of paper in the same spot. Exactly one quarter of the square remains uncovered by the rectangle. What is the length in centimetres of the long side of the rectangle?

Romanian Mathematical Olympiad 2020 Geometry

Two congruent squares $ABCD$ and $EFGH$ have disjoint interiors with $C$ the midpoint of $EF$ and $B,F,G$ collinear. Line $BC$ meets $EH$ at $K$ and line $AC$ meets $GH$ at $M$. Let $L$ be the midpoint of $GH$. Prove that $K, L, M$ are collinear and $CL \perp KM$.

Romanian Mathematical Olympiad 2025 3D Geometry

From a point $O$ inside square $ABCD$, a perpendicular $OS$ is raised to the plane. Let $M,N,P,Q$ be the projections of $O$ onto planes $(SAB),(SBC),(SCD),(SDA)$. Prove that $M,N,P,Q$ are coplanar if and only if $O$ lies on a diagonal of the square.

Explore All 30,676 Problems

Example Conceptual Problem Pair from MathNet-RAG

Problem A

Chinese TST 2014

Show that there are no $2$-tuples $(x,y)$ of positive integers satisfying $$(x+1)(x+2)\cdots(x+2014) = (y+1)(y+2)\cdots(y+4028).$$

Problem B

Russia 2009

Alireza multiplied a billion consecutive natural numbers, and Matin multiplied two million consecutive natural numbers. Prove that they got different results, or one of them made a mistake.

MathNet-RAG pairs problems by three levels of mathematical similarity — from strict equivalence to loose thematic clustering.

Mode	Problem A	Problem B
Invariance
Syntactic Equivalence	Find $f:\mathbb{R}\to\mathbb{R}$ such that $f(x^2-y^2)=(x-y)(f(x)+f(y))$.	Find $g:\mathbb{R}\to\mathbb{R}$ such that $(g(a)+g(b))(a-b)=g(a^2-b^2)$.
Reformulation	Let $a_i>0$. Prove $\sum_{i=1}^n \frac{a_i}{a_i^2+a_{i+1}a_{i+2}} \le \sum_{i=1}^n \frac{1}{a_i+a_{i+1}}$.	Let $a_i>0$. Prove $\sum_{i=1}^n \frac{a_i^2}{a_i^2+a_{i+1}a_{i+2}} \ge \tfrac{1}{2}$.
Transformational	Find all $x\in\mathbb{R}$ such that $4^x+6^x=9^x$.	Find all $x\in\mathbb{R}$ such that $(2/3)^x+(3/2)^x=5/2$.
Structural Resonance
Generalization	For $k\ge1$, prove that $k$ divides $\binom{n}{k}$ for all $n\ge k$.	Show $\binom{n}{m}\equiv\prod\binom{n_i}{m_i}\pmod{p}$, where $n=\sum n_i p^i$, $m=\sum m_i p^i$.
Common Lemma	If $ab+1\mid a^2+b^2$, show $\frac{a^2+b^2}{ab+1}$ is a perfect square.	If $a^2+b^2+c^2=k(ab+bc+ca)$, show $k\in\{1,2,3\}$.
Structural Reduction	Prove that $4^n+2^n+1$ is never prime.	Prove $2^{2n}+2^n+1$ is divisible by $3$ for all $n$.
Affinity (Thematic)
Affinity	Show the largest prime factor of $\binom{2n}{n}$ is greater than $n^{2/3}$.	For every $n>1$, there exists a prime $p$ with $n<p<2n$.

Table: Three levels of mathematical similarity, illustrated with Olympiad examples. Invariance means the problems are equivalent under reformulation. Structural Resonance means they share key lemmas or reductions. Affinity is a looser connection — same topic area, different approach.

Tasks

MathNet supports three tasks. The first tests whether models can solve Olympiad problems outright. The second tests whether embedding models can retrieve mathematically equivalent problems from a large pool. The third combines both: does giving a model a similar problem as context actually help?

Task I

Problem Solving

Given a problem, can the model produce a correct solution? We test across algebra, combinatorics, geometry, and number theory, graded against expert-written solutions.

Task II

Math-Aware Retrieval

Given a query problem, can an embedding model find the mathematically equivalent or structurally similar problems in a pool of 30K?

Task III

Retrieval-Augmented Problem Solving

A retrieved problem is given to the model as context before it solves the query. This measures how much retrieval quality actually matters for final accuracy.

Dataset Statistics

MathNet covers 47 countries and 17 languages, with problems spanning two decades of competition math. The solutions are long (considerably longer than those in existing benchmarks), which is part of what makes evaluation harder.

MathNet dataset statistics. (a) Contest type distribution. (b) Solution length vs. existing benchmarks — MathNet solutions are much longer. (c) Problems per year. (d) Topic and sub-topic distribution. (e) Language distribution: 74% English, 26% non-English across 17 languages.

Data Pipeline

Each problem starts as a scanned competition booklet. We run OCR, split the text into problem–solution pairs, normalize the formatting, and have human experts verify the output before anything enters the dataset.

MathNet data extraction and curation pipeline

Data extraction and curation pipeline. Competition PDFs are converted to markdown via OCR, split into problem–solution blocks, normalized with GPT-4.1, and verified by human experts.

Results

The top solving model hits 78.4%, which's very strong; howerver, retrieval is the bigger gap: Recall@1 stays below 5% for every model we tested. Expert-retrieved context helps solving accuracy, but only when the retrieval is actually good.

Problem Solving on MathNet-Solve-Test

Using the paper's micro-average accuracy across all 6,400 test problems.

gemini-3.1-pro-preview

78.4%

gemini-3-flash-preview

70.4%

gpt-5

69.3%

gpt-5-mini

57.0%

claude-opus-4.6

45.7%

gpt-5-nano

42.2%

gemini-2.5-flash

41.1%

DeepSeek-V3.2

40.1%

DeepSeek-R1

36.3%

grok-3

28.5%

gpt-4.1

21.4%

Llama-4-Maverick-17B*

14.7%

Problem Solving on MathNet-Solve-Test (6,400 problems). The chart uses the paper's overall micro-average accuracy values to make the ranking legible on the page. Takeaway: LMMs with reasoning are clearly strongest overall, but even the top model remains well below perfect performance.

Math-Aware Retrieval on MathNet-Retrieve

Recall@1 Recall@5

gemini-embedding-001

R@1

4.83

R@5

68.88

qwen3-embedding-4B

R@1

4.96

R@5

64.95

all-mpnet-base-v2

R@1

3.78

R@5

57.70

multi-qa-mpnet-base-dot-v1

R@1

3.27

R@5

55.08

text-embedding-3-large

R@1

2.74

R@5

54.23

cohere-embed-v4.0

R@1

2.24

R@5

44.81

Math-Aware Retrieval on MathNet-Retrieve (10,000 anchor problems). This chart uses the paper's aggregate “All” Recall@1 and Recall@5 values. Takeaway: Recall@1 stays very low even for the best models, while Recall@5 is much stronger, showing that mathematically equivalent retrieval is still unreliable at shallow depths.

Retrieval-Augmented Problem Solving on MathNet-RAG

Zero-shot Embed-RAG Expert-RAG

DeepSeek-V3.2-Speciale
Human grading

Zero

84.8%

Embed

89.5%

Expert

97.3%

Gemini-3-Pro
Human grading

Zero

89.1%

Embed

92.9%

Expert

87.5%

GPT-5
LLM grading

Zero

87.1%

Embed

81.8%

Expert

85.8%

Claude-4.5-Opus
Human grading

Zero

46.8%

Embed

55.5%

Expert

52.4%

Retrieval-Augmented Problem Solving on MathNet-RAG (35 problems). These bars highlight the most legible comparisons from the paper across zero-shot, Embed-RAG, and Expert-RAG settings. Takeaway: expert retrieval most often gives the strongest gains, but improvements remain model-dependent and grading-dependent.

BibTeX

If you use MathNet in your work, please cite the paper.

@inproceedings{alshammari2026mathnet,
  title     = {MathNet: A Global Multimodal Benchmark for Mathematical
               Reasoning and Retrieval},
  author    = {Alshammari, Shaden and Wen, Kevin and Zainal, Abrar and
               Hamilton, Mark and Safaei, Navid and Albarakati, Sultan and
               Freeman, William T. and Torralba, Antonio},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mathnet.mit.edu}
}

Contact

For questions about the dataset, benchmark, or paper, reach out to shaden@mit.edu.