# ProtoRE **Repository Path**: bobtuan/ProtoRE ## Basic Information - **Project Name**: ProtoRE - **Description**: Code for 'Prototypical Representation Learning for Relation Extraction'. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-11-24 - **Last Updated**: 2021-11-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ProtoRE ## Introduction This repo contains the code of the pretraining method proposed in **Prototypical Representation Learning for Relation Extraction** *Ning Ding\*, Xiaobin Wang\*, Yao Fu, Guangwei Xu, Rui Wang, Penguin Xie, Ying Shen, Fei Huang, Hai-Tao Zheng, Rui Zhang. ICLR 2021*. ## Prerequisites ### Environment - python : 3.7.5 - pytorch : 1.3.1 - tranformers : 2.1.1 ### Key Implementation - The relation representation produced by BERT is implemented in `models/bert_em.py`; - Equation (3) in the paper is implemented in `control/train.py` line 103-108; - Equation (4) (5) (6) in the paper is implemented in `control/train.py` line 78-96; - Prototypes are implemented in `models/proto_sim.py`. ### Data Distantly labeled training data is required for runing. Training data should have **five columns**, i.e. instance id, relation id, start position of the first entity, start position of the second entity and a sentence. The sentence is converted to a sequence of word ids by BertTokenizer. Special tokens like [CLS] and [SEP] are added (So a sequence starts with 101 and ends with 102). Reserved tokens are used as entity markers, [unused1] -> \, [unused2] -> \, [unused2] -> \, [unused4] -> \. There is a sample.txt in directory data for demo. ## Run First, modify the train_config.json according your data and setting. - train_data : path of train data - vocab : path of the vocabulary file for BertTokenizer - base_model : path of basic pretrained model, e.g., bert-base-cased - relations : number of relations in training data, - embedding_size : size of relation representation, - iterations : training iterations, - save_interval : the interval of saving checkpoints Second, create a directory for saving models. ``` mkdir ckpts ``` Then, run the code: ``` python train.py train_config.json ``` It will take a long time (serval days in one tesla P100) for training. ## How to use the pretrained encoder First, load the pretrained model ``` from models.bert_em import BERT_EM bert_encoder = BERT_EM.from_pretrained("path/to/saved_model_dir") ``` Second, feed inputs and get relation embeddings. ``` """ p_input_ids: Indices of input sequence tokens in the vocabulary. attention_mask: Mask to avoid performing attention on padding token indices. e1_pos: positiion of entity start marker of the first entity (zero based, CLS included). e2_pos: positiion of entity start marker of the second entity. p_input_ids, attention_mask is same as the input of BertModel in transformers. """ _, rel_emb = bert_encoder(p_input_ids, e1_pos, e2_pos, attention_mask = mask) ``` ### Citation Please cite our paper is you use prototypical method in your work: ``` @article{ding2021prototypical, title={Prototypical Representation Learning for Relation Extraction}, author={Ding, Ning and Wang, Xiaobin and Fu, Yao and Xu, Guangwei and Wang, Rui and Xie, Pengjun and Shen, Ying and Huang, Fei and Zheng, Hai-Tao and Zhang, Rui}, journal={arXiv preprint arXiv:2103.11647}, year={2021} } ``` ## Contact If you have any questions about this code, please write an e-mail to : xuanjie.wxb@alibaba-inc.com or [dingn18@mails.tsinghua.edu.cn](dingn18@mails.tsinghua.edu.cn)