Logo IPBench

Benchmarking the Knowledge of Large Language Models in Intellectual Property

Qiyao Wang1,2, Guhong Chen1, Hongbo Wang2, Huaren Liu2,
Minghui Zhu2, Zhifei Qin2, Linwei Li2, Yilin Yue2, Shiqiang Wang2, Jiayan Li2, Yihang Wu2,
Ziqiang Liu1, Longze Chen1, Run Luo1, Liyang Fan1, Jiaming Li1,
Lei Zhang1, Kan Xu2, Hongfei Lin2, Hamid Alinejad-Rokny4,
Shiwen Ni1,†, Yuan Lin2,†, Min Yang1,3,†

1SIAT-NLP, 2DUT-IR, 3SUAT and 4UNSW

† Corresponding Authors

Overview of the IPBench.

πŸ””News

πŸŽ‰ [2025-04-23]: Our IPBench paper (IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property) can be accessed in arXiv!

πŸ”₯ [2025-04-17]: We will release our paper and benchmark soon.

Introduction

Intellectual property, especially patents, shares similarities with academic papers in that it encapsulates the essence of domain knowledge across various technical fields. However, it is also governed by the intellectual property legal frameworks of different countries and regions. As such, it carries technical, legal, and economic significance, and is closely connected to real-world intellectual property services. In particular, intellectual property data is a rich, multi-modal data type with immense potential for content mining and analysis. Focusing on the field of intellectual property, we propose a comprehensive four-level IP task taxonomy based on the DOK model. Building on this taxonomy, we developed IPBench, a large language model benchmark consisting of 10,374 data instances across 20 tasks and covering 8 types of IP mechanisms. Compared to existing related benchmarks, IPBench features the largest data scale and the most comprehensive task coverage, spanning technical and legal tasks as well as understanding, reasoning, classification, and generation tasks.

Intellectual Property Benchmark

Overview

To bridge the gap between real-world demands and the application of LLMs in the IP field, we introduce the first comprehensive IP task taxonomy. Our taxonomy is based on Webb's Depth of Knowledge (DOK) Theory and is extended to include four hierarchical levels: Information Processing, Logical Reasoning, Discriminant Evaluation, and Creative Generation. It includes an evaluation of models' intrinsic knowledge of IP, along with a detailed analysis of IP text from both point-wise and pairwise perspectives, covering technical and legal aspects.

Building on this taxonomy, we develop IPBench, the first comprehensive Intellectual Property Benchmark for LLMs, consisting of 10,374 data points across 20 tasks aimed at evaluating the knowledge and capabilities of LLMs in real-world IP applications.

This holistic evaluation enables us to gain a hierarchical deep insight into LLMs, assessing their capabilities in in-domain memory, understanding, reasoning, discrimination, and creation across different IP mechanisms. Due to the legal nature of the IP field, there are regional differences between countries. Our IPBench is constrained within the legal frameworks of the United States and mainland China, making it a bilingual benchmark.

Comparisons with Existing Benchmarks

We provide a detailed comparison between IPBench and existing benchmarks. These features make our IPBench a most large-scale and task comprehensive benchmark for Large Language Models.

algebraic reasoning

Statistics

More statics please see our paper.

Task Examples

Experiment Results

Leaderboard

We evaluated the capabilities of 16 advanced large language models, including 13 general-purpose models, 2 legal-domain models, and 1 intellectual property (IP)-domain model. The general-purpose models include chat models such as GPT-4o, DeepSeek-V3, Qwen, Llama, Gemma, and Mistral, as well as reasoning-oriented models represented by DeepSeek-R1 and QwQ. Each model was tested under five different settings: zero-shot, one-shot, two-shot, three-shot, and Chain-of-Thought. Detailed experimental parameters can be found in the paper.


General-purpose Model Law-oriented Model IP-oriented Model

Click on different buttons to view the results under various settings.

Error Analysis

We randomly selected 300 incorrect responses across all tasks from GPT-4o-mini under the CoT setting for error analysis. We classify the error into 7 types: Consistency error, Hallucination error, Reasoning error, Refusing error, Priority error, Mathematical error and Obsolescence error, where Reasoning Error is the most common error type, accounting for 33%.

error distribution

Error distribution over 300 annotated GPT-4o-mini errors.

Error Examples

BibTeX


          comming...