[SOLVED] Is there a way to optimize SpaCy training?

Issue

I’m currently training a SpaCy model for multi-label text classification. There are 6 labels: anger, anticipation, disgust, fear, joy, sadness, surprise and trust. The dataset is over 200k. However, per epoch is taking 4 hours. I was wondering if there’s a way to optimize the training and do it faster, maybe I’m skipping something here that can improve the model.


TRAINING_DATA

TRAIN_DATA = list(zip(train_texts, [{"cats": cats} for cats in final_train_cats]))

[...
  {'cats': {'anger': 1,
    'anticipation': 0,
    'disgust': 0,
    'fear': 0,
    'joy': 0,
    'sadness': 0,
    'surprise': 0,
    'trust': 0}}),
 ('mausoleum',
  {'cats': {'anger': 1,
    'anticipation': 0,
    'disgust': 0,
    'fear': 0,
    'joy': 0,
    'sadness': 0,
    'surprise': 0,
    'trust': 0}}),
 ...]

TRAINING

nlp = spacy.load("en_core_web_sm")
category = nlp.create_pipe("textcat", config={"exclusive_classes": True})
nlp.add_pipe(category)

# add label to text classifier
category.add_label("trust")
category.add_label("fear")
category.add_label("disgust")
category.add_label("surprise")
category.add_label("anticipation")
category.add_label("anger")
category.add_label("joy")

optimizer = nlp.begin_training()
losses = {}

for i in range(100):
    random.shuffle(TRAIN_DATA)

    print('...')
    for batch in minibatch(TRAIN_DATA, size=8):
        texts = [nlp(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]
        nlp.update(texts, annotations, sgd=optimizer, losses=losses)
    print(i, losses)

...
0 {'parser': 0.0, 'tagger': 27.018985521040854, 'textcat': 0.0, 'ner': 0.0}
...
1 {'parser': 0.0, 'tagger': 27.01898552104131, 'textcat': 0.0, 'ner': 0.0}
...

Solution

"200k-record dataset is taking 4 hours per-epoch" doesn’t tell us much:

  1. Make sure you’re not blowing out memory (are you?) How much RAM is it taking?
  2. You are presumably running single-thread, due to the GIL. See e.g. this on how to turn off the GIL to run training multicore. How many cores do you have?
  • putting texts = [nlp(text) ...] inside your inner-loop for batch in minibatch(TRAIN_DATA, size=8): looks like trouble, because your code will always hold the GIL, even though you only need it for C-library string calls on processing the input text i.e. the parser stage, not for training.
  • Refactor your code so you first run nlp() pipeline on all your input, then save some intermediate representation (array or whatever). Keep that code separate to your training loop, so training can be multithreaded.
  1. I can’t comment on your choice of minibatch() parameters, but 8 seems very small, and those parameters seem to matter for performance, so try tweaking them (/grid-search a few values).
  2. Finally, once you first check all of the above, find the fastest unicore/multicore box you can, and with enough RAM.

Answered By – smci

Answer Checked By – Robin (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *