This is one of the implementation of the following paper:
@inproceedings{omelianchuk-etal-2020-gector,
title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
author = "Omelianchuk, Kostiantyn and
Atrasevych, Vitaliy and
Chernodub, Artem and
Skurzhanskyi, Oleksandr",
booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
month = jul,
year = "2020",
address = "Seattle, WA, USA → Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.bea-1.16",
doi = "10.18653/v1/2020.bea-1.16",
pages = "163--170"
}
Confirmed that it works on python3.11.0.
pip install git+https://github.com/gotutiyan/gector
# Donwload the verb dictionary in advance
mkdir data
cd data
wget https://github.com/grammarly/gector/raw/master/data/verb-form-vocab.txt
gector-predict \
--input <raw text file> \
--restore_dir gotutiyan/gector-roberta-base-5k \
--out <path to output file>
from transformers import AutoTokenizer
from gector import GECToR, predict, load_verb_dict
model_id = 'gotutiyan/gector-roberta-base-5k'
model = GECToR.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
encode, decode = load_verb_dict('data/verb-form-vocab.txt')
srcs = [
'This is a correct sentence.',
'This are a wrong sentences'
]
corrected = predict(
model, tokenizer, srcs,
encode, decode,
keep_confidence=0.0,
min_error_prob=0.0,
n_iteration=5,
batch_size=2,
)
print(corrected)
--from_official
and related options starting with --official.
.data/output_vocabulary
is in here
# An example to use official BERT model.
# Download the official model.
wget https://grammarly-nlp-data-public.s3.amazonaws.com/gector/bert_0_gectorv2.th
# Predict with the official model.
python predict.py \
--input <raw text file> \
--restore bert_0_gectorv2.th \
--out out.txt \
--from_official \
--official.vocab_path data/output_vocabulary \
--official.transformer_model bert-base-cased \
--official.special_tokens_fix 0 \
--official.max_length 80