roberta-base

FacebookAI fill-mask transformers en

FacebookAI/roberta-base

18,091,809

下载量

1093

收藏数

73

浏览量

mit

许可

简介

基于遮蔽语言建模（MLM）目标在英语上预训练的模型。该模型首次发表于此论文，并首次发布于此仓库。此模型区分大小写：它能识别english与English之间的差异。

模型卡片

许可协议 mit

语言

en

数据集

bookcorpus wikipedia

exbert

模型配置

模型类型 roberta

架构 RobertaForMaskedLM

模型详情

已翻译

RoBERTa base model

使用掩码语言建模（MLM）目标在英语上预训练的模型。该模型在此论文中提出，并首次发布于此仓库。该模型区分大小写：它能够区分"english"和"English"。

免责声明：发布RoBERTa的团队并未为该模型撰写模型卡片，因此本模型卡片由Hugging Face团队撰写。

模型描述

RoBERTa是一个transformers模型，以自监督方式在大量英语数据语料库上进行了预训练。这意味着它仅基于原始文本进行预训练，没有任何人工标注（这也是它能使用大量公开数据的原因），并通过自动化流程从这些文本生成输入和标签。

更准确地说，它使用掩码语言建模（MLM）目标进行预训练。给定一个句子，模型会随机遮蔽输入中15%的词，然后将整个被遮蔽的句子输入模型，并预测被遮蔽的词。这与传统的循环神经网络（RNNs）不同，后者通常逐个处理单词；也与自回归模型（如GPT）不同，后者在内部遮蔽未来的token。这种方法使模型能够学习句子的双向表示。

通过这种方式，模型学习到了英语语言的内在表示，可用于提取对下游任务有用的特征：例如，如果你有一个带标签的句子数据集，你可以使用BERT模型产生的特征作为输入来训练一个标准分类器。

预期用途与局限性

你可以将原始模型用于掩码语言建模，但其主要目的是在下游任务上进行微调。请查看模型中心寻找你感兴趣任务的微调版本。

请注意，该模型主要针对使用整个句子（可能被遮蔽）进行决策的任务进行微调，例如序列分类、token分类或问答。对于文本生成等任务，你应该考虑使用GPT2这样的模型。

如何使用

你可以直接使用pipeline进行掩码语言建模：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='roberta-base')
>>> unmasker("Hello I'm a  model.")

[{'sequence': "Hello I'm a male model.",
  'score': 0.3306540250778198,
  'token': 2943,
  'token_str': 'Ġmale'},
 {'sequence': "Hello I'm a female model.",
  'score': 0.04655390977859497,
  'token': 2182,
  'token_str': 'Ġfemale'},
 {'sequence': "Hello I'm a professional model.",
  'score': 0.04232972860336304,
  'token': 2038,
  'token_str': 'Ġprofessional'},
 {'sequence': "Hello I'm a fashion model.",
  'score': 0.037216778844594955,
  'token': 2734,
  'token_str': 'Ġfashion'},
 {'sequence': "Hello I'm a Russian model.",
  'score': 0.03253649175167084,
  'token': 1083,
  'token_str': 'ĠRussian'}]

以下是如何在PyTorch中使用该模型获取给定文本的特征：

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

以及在TensorFlow中：

from transformers import RobertaTokenizer, TFRobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = TFRobertaModel.from_pretrained('roberta-base')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

局限性与偏见

该模型使用的训练数据包含大量来自互联网的未过滤内容，远非中立。因此，模型可能产生有偏见的预测：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='roberta-base')
>>> unmasker("The man worked as a .")

[{'sequence': 'The man worked as a mechanic.',
  'score': 0.08702439814805984,
  'token': 25682,
  'token_str': 'Ġmechanic'},
 {'sequence': 'The man worked as a waiter.',
  'score': 0.0819653645157814,
  'token': 38233,
  'token_str': 'Ġwaiter'},
 {'sequence': 'The man worked as a butcher.',
  'score': 0.073323555290699,
  'token': 32364,
  'token_str': 'Ġbutcher'},
 {'sequence': 'The man worked as a miner.',
  'score': 0.046322137117385864,
  'token': 18678,
  'token_str': 'Ġminer'},
 {'sequence': 'The man worked as a guard.',
  'score': 0.040150221437215805,
  'token': 2510,
  'token_str': 'Ġguard'}]

>>> unmasker("The Black woman worked as a .")

[{'sequence': 'The Black woman worked as a waitress.',
  'score': 0.22177888453006744,
  'token': 35698,
  'token_str': 'Ġwaitress'},
 {'sequence': 'The Black woman worked as a prostitute.',
  'score': 0.19288744032382965,
  'token': 36289,
  'token_str': 'Ġprostitute'},
 {'sequence': 'The Black woman worked as a maid.',
  'score': 0.06498628109693527,
  'token': 29754,
  'token_str': 'Ġmaid'},
 {'sequence': 'The Black woman worked as a secretary.',
  'score': 0.05375480651855469,
  'token': 2971,
  'token_str': 'Ġsecretary'},
 {'sequence': 'The Black woman worked as a nurse.',
  'score': 0.05245552211999893,
  'token': 9008,
  'token_str': 'Ġnurse'}]

这种偏见也会影响该模型的所有微调版本。

训练数据

RoBERTa模型在以下五个数据集的联合集上进行了预训练：
- BookCorpus，包含11,038本未出版书籍的数据集；
- 英语维基百科（排除列表、表格和标题）；
- CC-News，包含2016年9月至2019年2月间爬取的6300万篇英语新闻文章的数据集。
- OpenWebText，用于训练GPT-2的WebText数据集的开源复现版本；
- Stories，包含CommonCrawl数据子集的数据集，经过过滤以匹配Winograd模式的故事风格。

这些数据集合计包含160GB的文本。

训练过程

预处理

文本使用字节级字节对编码（BPE）进行token化，词汇表大小为50,000。模型的输入采用512个连续token的片段，这些片段可能跨越多个文档。新文档的开始用标记，结束用标记。

每个句子的遮蔽过程细节如下：
- 15%的token被遮蔽。
- 在80%的情况下，被遮蔽的token被替换为``。
- 在10%的情况下，被遮蔽的token被替换为一个随机token（与被替换的不同）。
- 在剩余的10%情况下，被遮蔽的token保持不变。

与BERT不同，遮蔽过程在预训练期间是动态进行的（例如，每个epoch都会变化，而非固定不变）。

预训练

该模型在1024块V100 GPU上训练了500K步，batch size为8K，序列长度为512。使用的优化器是Adam，学习率为6e-4，\(\beta_{1} = 0.9\)，\(\beta_{2} = 0.98\)，\(\epsilon = 1e-6\)，权重衰减为0.01，学习率预热24,000步，之后学习率线性衰减。

评估结果

在下游任务上进行微调后，该模型取得了以下结果：

Glue测试结果：

| 任务 | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|:----:|:----:|:----:|:----:|:-----:|:----:|