A Information to 400+ Categorized Massive Language Mannequin Datasets -

You could find helpful datasets on numerous platforms—Kaggle, Paperwithcode, GitHub, and extra. However what if I inform you there’s a goldmine: a repository full of over 400+ datasets, meticulously categorised throughout 5 important dimensions—Pre-training Corpora, Effective-tuning Instruction Datasets, Choice Datasets, Analysis Datasets, and Conventional NLP Datasets and extra? And to high it off, this assortment receives common updates. Sounds spectacular, proper?

These datasets had been compiled by Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin of their survey on the paper “Datasets for Massive Language Fashions: A Complete Survey,” which has simply been launched (February 2024). It provides a groundbreaking take a look at the spine of enormous language mannequin (LLM) growth: datasets.

Notice: I’m offering you with a short description of the datasets talked about within the analysis paper; you could find all of the datasets within the repo.

A Information to 400+ Categorized Massive Language Mannequin Datasets

Datasets for Your GenAI/LLMs Venture: Summary Overview of the Paper

Supply: Datasets for Massive Language Fashions: A Complete Survey

This paper units out to navigate the intricate panorama of LLM datasets, that are the cornerstone behind the stellar evolution of those fashions. Simply because the roots of a tree present the mandatory help and vitamins for development, datasets are basic to LLMs. Thus, finding out these datasets isn’t simply related; it’s important.

Given the present gaps in complete evaluation and overview, this survey organises and categorises the important kinds of LLM datasets into 5 main views:

Pre-training Corpora
Instruction Effective-tuning Datasets
Choice Datasets
Analysis Datasets
Conventional Pure Language Processing (NLP) Datasets
Multi-modal Massive Language Fashions (MLLMs) Datasets
Retrieval Augmented Era (RAG) Datasets.

The analysis outlines the important thing challenges that exist as we speak and suggests potential instructions for additional exploration. It goes a step past mere dialogue by compiling a radical assessment of obtainable dataset sources: statistics from 444 datasets spanning 32 domains and 8 language classes. This consists of in depth information dimension metrics—greater than 774.5 TB for pre-training corpora alone and 700 million cases throughout different dataset varieties.

This survey acts as a full roadmap to information researchers, function a useful useful resource, and encourage future research within the LLM discipline.

Right here’s the general structure of the survey

Additionally learn: 10 Datasets by INDIAai in your Subsequent Knowledge Science Venture

LLM Textual content Datasets Throughout Seven Dimensions

Listed below are the important thing kinds of LLM textual content datasets, categorized into seven primary dimensions: Pre-training Corpora, Instruction Effective-tuning Datasets, Choice Datasets, Analysis Datasets, Conventional NLP Datasets, Multi-modal Massive Language Fashions (MLLMs) Datasets, and Retrieval Augmented Era (RAG) Datasets. These classes are recurrently up to date for complete protection.

Notice: I’m utilizing the identical construction talked about within the repo, and you may seek advice from the repo for the dataset data format.

It’s like this –

- Dataset identify  Launch Time | Public or Not | Language | Development Technique
| Paper | Github | Dataset | Web site
- Writer:
- Measurement:
- License:
- Supply:

Repo Hyperlink: Superior-LLMs-Datasets

1. Pre-training Corpora

These are in depth collections of textual content used in the course of the preliminary coaching section of LLMs.

A. Basic Pre-training Corpora: Massive-scale datasets that embrace various textual content sources from varied domains. They’re designed to coach foundational fashions that may carry out varied duties as a consequence of their broad information protection.

Webpages

MADLAD-400 2023-9 | All | Multi (419) | HG |
Paper | Github | Dataset
- Writer: Google DeepMind et al.
- Measurement: 2.8 T Tokens
- License: ODL-BY
- Supply: Frequent Crawl
FineWeb 2024-4 | All | EN | CI |
Dataset
- Writer: HuggingFaceFW
- Measurement: 15 TB Tokens
- License: ODC-BY-1.0
- Supply: Frequent Crawl
CCI 2.0 2024-4 | All | ZH | HG |
Dataset1 | Dataset2
- Writer: BAAI
- Measurement: 501 GB
- License: CCI Utilization Aggrement
- Supply: Chinese language webpages
DCLM 2024-6 | All | EN | CI |
Paper | Github | Dataset | Web site
- Writer: College of Washington et al.
- Measurement: 279.6 TB
- License: Frequent Crawl Phrases of Use
- Supply: Frequent Crawl

Language Texts

ANC 2003-X | All | EN | HG |
Web site
- Writer: The US Nationwide Science Basis et al.
- Measurement: –
- License: –
- Supply: American English texts
BNC 1994-X | All | EN | HG |
Web site
- Writer: Oxford College Press et al.
- Measurement: 4124 Texts
- License: –
- Supply: British English texts
Information-crawl 2019-1 | All | Multi (59) | HG |
Dataset
- Writer: UKRI et al.
- Measurement: 110 GB
- License: CC0
- Supply: Newspapers

Books

Anna’s Archive 2023-X | All | Multi | HG |
Web site
- Writer: Anna
- Measurement: 586.3 TB
- License: –
- Supply: Sci-Hub, Library Genesis, Z-Library, and many others.
BookCorpusOpen 2021-5 | All | EN | CI |
Paper | Github | Dataset
- Writer: Jack Bandy et al.
- Measurement: 17,868 Books
- License: Smashwords Phrases of Service
- Supply: Toronto E book Corpus
PG-19 2019-11 | All | EN | HG |
Paper | Github | Dataset
- Writer: DeepMind
- Measurement: 11.74 GB
- License: Apache-2.0
- Supply: Venture Gutenberg
Venture Gutenberg 1971-X | All | Multi | HG |
Web site
- Writer: Ibiblio et al.
- Measurement: –
- License: The Venture Gutenberg
- Supply: E book information

You could find extra classes on this dimension right here: Basic Pre-training Corpora

B. Area-specific Pre-training Corpora: Custom-made datasets centered on particular fields or matters, used for focused, incremental pre-training to reinforce efficiency in specialised domains.

Monetary

BBT-FinCorpus 2023-2 | Partial | ZH | HG |
Paper | Github | Web site
- Writer: Fudan College et al.
- Measurement: 256 GB
- License: –
- Supply: Firm bulletins, analysis experiences, monetary
- Class: Multi
- Area: Finance
FinCorpus 2023-9 | All | ZH | HG |
Paper | Github | Dataset
- Writer: Du Xiaoman
- Measurement: 60.36 GB
- License: Apache-2.0
- Supply: Firm bulletins, monetary information, monetary examination questions
- Class: Multi
- Area: Finance
FinGLM 2023-7 | All | ZH | HG |
Github
- Writer: Information Atlas et al.
- Measurement: 69 GB
- License: Apache-2.0
- Supply: Annual Stories of Listed Firms
- Class: Language Texts
- Area: Finance

Medical

Medical-pt 2023-5 | All | ZH | CI |
Github | Dataset
- Writer: Ming Xu
- Measurement: 632.78 MB
- License: Apache-2.0
- Supply: Medical encyclopedia information, medical textbooks
- Class: Multi
- Area: Medical
PubMed Central 2000-2 | All | EN | HG |
Web site
- Writer: NCBI
- Measurement: –
- License: PMC Copyright Discover
- Supply: Biomedical scientific literature
- Class: Educational Supplies
- Area: Medical

Math

Proof-Pile-2 2023-10 | All | EN | HG & CI |
Paper | Github | Dataset | Web site
- Writer: Princeton College et al.
- Measurement: 55 B Tokens
- License: –
- Supply: ArXiv, OpenWebMath, AlgebraicStack
- Class: Multi
- Area: Arithmetic
MathPile 2023-12 | All | EN | HG |
Paper | Github | Dataset
- Writer: Shanghai Jiao Tong College et al.
- Measurement: 9.5 B Tokens
- License: CC-BY-NC-SA-4.0
- Supply: Textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, arXiv
- Class: Multi
- Area: Arithmetic
OpenWebMath 2023-10 | All | EN | HG |
Paper | Github | Dataset
- Writer: College of Toronto et al.
- Measurement: 14.7 B Tokens
- License: ODC-BY-1.0
- Supply: Frequent Crawl
- Class: Webpages
- Area: Arithmetic

You could find extra classes on this dimension right here: Area-specific Pre-training Corpora

2. Instruction Effective-tuning Datasets

These datasets encompass pairs of “instruction inputs” (requests made to the mannequin) and corresponding “reply outputs” (model-generated responses).

A. Basic Instruction Effective-tuning Datasets: Embody a wide range of instruction varieties with out area limitations. They intention to enhance the mannequin’s skill to comply with directions throughout basic duties.

Human Generated Datasets (HG)

databricks-dolly-15K 2023-4 | All | EN | HG |
Dataset | Web site
- Writer: Databricks
- Measurement: 15011 cases
- License: CC-BY-SA-3.0
- Supply: Manually generated primarily based on completely different instruction classes
- Instruction Class: Multi
InstructionWild_v2 2023-6 | All | EN & ZH | HG |
Github
- Writer: Nationwide College of Singapore
- Measurement: 110K cases
- License: –
- Supply: Collected on the internet
- Instruction Class: Multi
LCCC 2020-8 | All | ZH | HG |
Paper | Github
- Writer: Tsinghua College et al.
- Measurement: 12M cases
- License: MIT
- Supply: Crawl person interactions on social media
- Instruction Class: Multi

Mannequin Constructed Datasets (MC)

Alpaca_data 2023-3 | All | EN | MC |
Github
- Writer: Stanford Alpaca
- Measurement: 52K cases
- License: Apache-2.0
- Supply: Generated by Textual content-Davinci-003 with Aplaca_data prompts
- Instruction Class: Multi
BELLE_Generated_Chat 2023-5 | All | ZH | MC |
Github | Dataset
- Writer: BELLE
- Measurement: 396004 cases
- License: GPL-3.0
- Supply: Generated by ChatGPT
- Instruction Class: Era
BELLE_Multiturn_Chat 2023-5 | All | ZH | MC |
Github | Dataset
- Writer: BELLE
- Measurement: 831036 cases
- License: GPL-3.0
- Supply: Generated by ChatGPT
- Instruction Class: Multi

You could find extra classes on this dimension right here: Basic Instruction Effective-tuning Datasets

B. Area-specific Instruction Effective-tuning Datasets: Tailor-made for particular domains, containing directions related to explicit data areas or activity varieties.

Medical

ChatDoctor 2023-3 | All | EN | HG & MC |
Paper | Github | Dataset
- Writer: College of Texas Southwestern Medical Heart et al.
- Measurement: 115K cases
- License: Apache-2.0
- Supply: Actual conversations between docs and sufferers & Generated by ChatGPT
- Instruction Class: Multi
- Area: Medical
ChatMed_Consult_Dataset 2023-5 | All | ZH | MC |
Github | Dataset
- Writer: michael-wzhu
- Measurement: 549326 cases
- License: CC-BY-NC-4.0
- Supply: Generated by GPT-3.5-Turbo
- Instruction Class: Multi
- Area: Medical
CMtMedQA 2023-8 | All | ZH | HG |
Paper | Github | Dataset
- Writer: Zhengzhou College
- Measurement: 68023 cases
- License: MIT
- Supply: Actual conversations between docs and sufferers
- Instruction Class: Multi
- Area: Medical

Code

Code_Alpaca_20K 2023-3 | All | EN & PL | MC |
Github | Dataset
- Writer: Sahil Chaudhary
- Measurement: 20K cases
- License: Apache-2.0
- Supply: Generated by Textual content-Davinci-003
- Instruction Class: Code
- Area: Code
CodeContest 2022-3 | All | EN & PL | CI |
Paper | Github
- Writer: DeepMind
- Measurement: 13610 cases
- License: Apache-2.0
- Supply: Assortment and enchancment of varied datasets
- Instruction Class: Code
- Area: Code
CommitPackFT 2023-8 | All | EN & PL (277) | HG |
Paper | Github | Dataset
- Writer: Bigcode
- Measurement: 702062 cases
- License: MIT
- Supply: GitHub Motion dump
- Instruction Class: Code
- Area: Code

You could find extra classes on this dimension right here: Area-specific Instruction Effective-tuning Datasets

3. Choice Datasets

Choice datasets consider and refine mannequin responses by offering comparative suggestions on a number of outputs for a similar enter.

A. Choice Analysis Strategies: These can embrace strategies resembling voting, sorting, and scoring to ascertain how mannequin responses align with human preferences.

Vote

Chatbot_arena_conversations 2023-6 | All | Multi | HG & MC |
Paper | Dataset
- Writer: UC Berkeley et al.
- Measurement: 33000 cases
- License: CC-BY-4.0 & CC-BY-NC-4.0
- Area: Basic
- Instruction Class: Multi
- Choice Analysis Technique: VO-H
- Supply: Generated by twenty LLMs & Handbook judgment
hh-rlhf 2022-4 | All | EN | HG & MC |
Paper1 | Paper2 | Github | Dataset
- Writer: Anthropic
- Measurement: 169352 cases
- License: MIT
- Area: Basic
- Instruction Class: Multi
- Choice Analysis Technique: VO-H
- Supply: Generated by LLMs & Handbook judgment
MT-Bench_human_judgments 2023-6 | All | EN | HG & MC |
Paper | Github | Dataset | Web site
- Writer: UC Berkeley et al.
- Measurement: 3.3K cases
- License: CC-BY-4.0
- Area: Basic
- Instruction Class: Multi
- Choice Analysis Technique: VO-H
- Supply: Generated by LLMs & Handbook judgment

You could find extra classes on this dimension right here: Choice Analysis Strategies

4. Analysis Datasets

These datasets are meticulously curated and annotated to measure the efficiency of LLMs on varied duties. They’re categorized primarily based on the domains they’re used to guage.

Basic

AlpacaEval 2023-5 | All | EN | CI & MC |
Paper | Github | Dataset | Web site
- Writer: Stanford et al.
- Measurement: 805 cases
- License: Apache-2.0
- Query Sort: SQ
- Analysis Technique: ME
- Focus: The efficiency on open-ended query answering
- Numbers of Analysis Classes/Subcategories: 1/-
- Analysis Class: Open-ended query answering
BayLing-80 2023-6 | All | EN & ZH | HG & CI |
Paper | Github | Dataset
- Writer: Chinese language Academy of Sciences
- Measurement: 320 cases
- License: GPL-3.0
- Query Sort: SQ
- Analysis Technique: ME
- Focus: Chinese language-English language proficiency and multimodal interplay abilities
- Numbers of Analysis Classes/Subcategories: 9/-
- Analysis Class: Writing, Roleplay, Common sense, Fermi, Counterfactual, Coding, Math, Generic, Information
BELLE_eval 2023-4 | All | ZH | HG & MC |
Paper | Github
- Writer: BELLE
- Measurement: 1000 cases
- License: Apache-2.0
- Query Sort: SQ
- Analysis Technique: ME
- Focus: The efficiency of Chinese language language fashions in following directions
- Numbers of Analysis Classes/Subcategories: 9/-
- Analysis Class: Extract, Closed qa, Rewrite, Summarization, Era, Classification, Brainstorming, Open qa, Others

You could find extra classes on this dimension right here: Analysis Dataset

5. Conventional NLP Datasets

These datasets cowl textual content used for pure language processing duties previous to the period of LLMs. They’re important for duties like language modelling, translation, and sentiment evaluation in conventional NLP workflows.

Choice & Judgment

BoolQ 2019-5 | EN |
Paper | Github
- Writer: College of Washington et al.
- Practice/Dev/Check/All Measurement: 9427/3270/3245/15942
- License: CC-SA-3.0
CosmosQA 2019-9 | EN | Paper | Github | Dataset | Web site
- Writer: College of Illinois Urbana-Champaign et al.
- Practice/Dev/Check/All Measurement: 25588/3000/7000/35588
- License: CC-BY-4.0
CondaQA 2022-11 | EN |
Paper | Github | Dataset
- Writer: Carnegie Mellon College et al.
- Practice/Dev/Check/All Measurement: 5832/1110/7240/14182
- License: Apache-2.0
PubMedQA 2019-9 | EN |
Paper | Github | Dataset | Web site
- Writer: College of Pittsburgh et al.
- Practice/Dev/Check/All Measurement: -/-/-/273.5K
- License: MIT
MultiRC 2018-6 | EN |
Paper | Github | Dataset
- Writer: College of Pennsylvania et al.
- Practice/Dev/Check/All Measurement: -/-/-/9872
- License: MultiRC License

You could find extra classes on this dimension right here: Conventional NLP Datasets

Datasets on this class combine a number of information varieties, resembling textual content and pictures, to coach fashions able to processing and producing responses throughout completely different modalities.

Paperwork

mOSCAR: A big-scale multilingual and multimodal document-level corpus
OBELISC: An open web-scale filtered dataset of interleaved image-text paperwork

Instruction Effective-tuning Datasets:

Distant Sensing

MMRS-1M: Multi-sensor distant sensing instruction dataset

Photographs + Movies

VideoChat2-IT: Instruction fine-tuning dataset for pictures/movies

You could find extra classes on this dimension right here: Multi-modal Massive Language Fashions (MLLMs) Datasets

7. Retrieval Augmented Era (RAG) Datasets

These datasets improve LLMs with retrieval capabilities, enabling fashions to entry and combine exterior information sources for extra knowledgeable and contextually related responses.

CRUD-RAG: A complete Chinese language benchmark for RAG
WikiEval: To do correlation evaluation of distinction metrics proposed in RAGAS
RGB: A benchmark for RAG
RAG-Instruct-Benchmark-Tester: An up to date benchmarking check dataset for RAG use circumstances within the enterprise

You could find extra classes on this dimension right here: Retrieval Augmented Era (RAG) Datasets

Conclusion

In conclusion, the great survey “Datasets for Massive Language Fashions: A Complete Survey” gives a useful roadmap for navigating the various and complicated world of LLM datasets. This in depth assessment by Liu, Cao, Liu, Ding, and Jin showcases over 400 datasets, meticulously categorized into crucial dimensions resembling Pre-training Corpora, Instruction Effective-tuning Datasets, Choice Datasets, Analysis Datasets, and others, protecting over 774.5 TB of knowledge and 700 million cases. By breaking down these datasets and their makes use of—from broad foundational pre-training units to extremely specialised, domain-specific collections—this survey highlights current sources and maps out present challenges and future analysis instructions in creating and optimising LLMs. This useful resource serves as each a information for researchers getting into the sphere and a reference for these aiming to reinforce generative AI’s capabilities and software scopes.

Additionally, if you’re searching for a Generative AI course on-line, then discover: GenAI Pinnacle Program

Often Requested Questions

Q1. What are the principle kinds of datasets used for coaching LLMs?

Ans. Datasets for LLMs might be broadly categorized into structured information (e.g., tables, databases), unstructured information (e.g., textual content paperwork, books, articles), and semi-structured information (e.g., HTML, JSON). The commonest are large-scale, various textual content datasets compiled from sources like web sites, encyclopedias, and tutorial papers.

Q2. How do datasets impression the standard of an LLM?

Ans. The coaching dataset’s high quality, range, and dimension closely impression an LLM’s efficiency. A well-curated dataset improves the mannequin’s generalizability, comprehension, and bias discount, whereas a poorly curated one can result in inaccuracies and biased outputs.

Q3. What are widespread sources for LLM datasets?

Ans. Frequent sources embrace net scrapes from platforms like Wikipedia, information websites, books, analysis journals, and large-scale repositories like Frequent Crawl. Publicly out there datasets resembling The Pile or OpenWebText are additionally incessantly used.

This fall. How do you deal with information bias in LLM datasets?

Ans. Mitigating information bias entails diversifying information sources, implementing fairness-aware information assortment methods, filtering content material to scale back bias, and post-training fine-tuning. Common audits and moral evaluations assist determine and reduce biases throughout dataset creation.

Hello, I’m Pankaj Singh Negi – Senior Content material Editor | Enthusiastic about storytelling and crafting compelling narratives that rework concepts into impactful content material. I like studying about expertise revolutionizing our life-style.

A Information to 400+ Categorized Massive Language Mannequin Datasets

Datasets for Your GenAI/LLMs Venture: Summary Overview of the Paper

LLM Textual content Datasets Throughout Seven Dimensions

1. Pre-training Corpora

Webpages

Language Texts

Books

Monetary

Medical

Math

2. Instruction Effective-tuning Datasets

Human Generated Datasets (HG)

Mannequin Constructed Datasets (MC)

Medical

Code

3. Choice Datasets

Vote

4. Analysis Datasets

Basic

5. Conventional NLP Datasets

Choice & Judgment

Paperwork

Distant Sensing

Photographs + Movies

7. Retrieval Augmented Era (RAG) Datasets

Conclusion

Often Requested Questions

The right way to Carry out Knowledge Preprocessing Utilizing Cleanlab?

Defined: How Does L1 Regularization Carry out Function Choice?

Microsoft Flight Simulator Releases Metropolis Replace 10: United States I

“Periodic desk of machine studying” might gasoline AI discovery | MIT Information

Robotic see, robotic do: System learns after watching how-to movies

The right way to Carry out Knowledge Preprocessing Utilizing Cleanlab?

Defined: How Does L1 Regularization Carry out Function Choice?

Microsoft Flight Simulator Releases Metropolis Replace 10: United States I

“Periodic desk of machine studying” might gasoline AI discovery | MIT Information

Datasets for Your GenAI/LLMs Venture: Summary Overview of the Paper

LLM Textual content Datasets Throughout Seven Dimensions

1. Pre-training Corpora

Webpages

Language Texts

Books

Monetary

Medical

Math

2. Instruction Effective-tuning Datasets

Human Generated Datasets (HG)

Mannequin Constructed Datasets (MC)

Medical

Code

3. Choice Datasets

Vote

4. Analysis Datasets

Basic

5. Conventional NLP Datasets

Choice & Judgment

6. Multi-modal Massive Language Fashions (MLLMs) Datasets

Paperwork

Distant Sensing

Photographs + Movies

7. Retrieval Augmented Era (RAG) Datasets

Conclusion

Often Requested Questions