[SOLVED] Group numpy array elements without for-loop

Issue

After doing some text processing, I’ve got a list of tokens and a list of sentence indices, one for each token. Now I’d like to reassemble the tokens into sentences. I’ve used Numpy, but I feel like there’s a better/faster/more-numpy-ish way to do this…without a for loop. There could be a lot more than two sentences in the future.

import numpy as np

all_tokens = np.array(['I', 'spent', 'a', 'lot', 'of', 'time', ',', 'money', ',', 'and', 'effort', 'childproofing', 'my', 'house', '.', 'However', ',', 'the', 'kids', 'still', 'get', 'in', '.'])
sent_ids = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])

new_sents = []
for unique_sent_id in np.unique(sent_ids):
    sent_tokens = all_tokens[sent_ids == unique_sent_id].tolist()
    new_sents.append(' '.join(sent_tokens))

Result: ["I spent a lot of time , money , and effort childproofing my house .", "However , the kids still get in ."]

Solution

Assuming sent_ids is ordered, you can find out the position where sent_id has changed and then split tokens based on that:

list(map(" ".join, np.split(all_tokens, np.flatnonzero(np.diff(sent_ids) != 0)+1)))
# ['I spent a lot of time , money , and effort childproofing my house .', 'However , the kids still get in .']

Answered By – Psidom

Answer Checked By – Candace Johnson (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *