Issue
After doing some text processing, I’ve got a list of tokens and a list of sentence indices, one for each token. Now I’d like to reassemble the tokens into sentences. I’ve used Numpy, but I feel like there’s a better/faster/more-numpy-ish way to do this…without a for loop. There could be a lot more than two sentences in the future.
import numpy as np
all_tokens = np.array(['I', 'spent', 'a', 'lot', 'of', 'time', ',', 'money', ',', 'and', 'effort', 'childproofing', 'my', 'house', '.', 'However', ',', 'the', 'kids', 'still', 'get', 'in', '.'])
sent_ids = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])
new_sents = []
for unique_sent_id in np.unique(sent_ids):
sent_tokens = all_tokens[sent_ids == unique_sent_id].tolist()
new_sents.append(' '.join(sent_tokens))
Result: ["I spent a lot of time , money , and effort childproofing my house .", "However , the kids still get in ."]
Solution
Assuming sent_ids
is ordered, you can find out the position where sent_id
has changed and then split tokens based on that:
list(map(" ".join, np.split(all_tokens, np.flatnonzero(np.diff(sent_ids) != 0)+1)))
# ['I spent a lot of time , money , and effort childproofing my house .', 'However , the kids still get in .']
Answered By – Psidom
Answer Checked By – Candace Johnson (BugsFixing Volunteer)