NLP-Spell Correction

本文最后更新于:2024年10月26日 上午

Spell Correction

拼写检查任务,实现语言模型和通道模型,并利用tkinter实现GUI

I.Environment

Python 3.10.9

Package: nltk 3.7 and numpy 1.26.4

II.Theory

Spell-checking involves distinguishing between non-word errors and real-word errors. Non-word errors occur when a word is spelled incorrectly to the extent that it does not exist, such as spelling “giraffe” as “graffe.” Real-word errors, on the other hand, involve correctly spelled words that are used incorrectly within a given context.

Detection of non-word spelling errors primarily relies on dictionary lookup to determine whether a word exists. Correcting non-word errors typically involves two main steps: first, generating candidate words that are similar to the misspelled word and correct; second, selecting the best correction from these candidates. The selection of the best correction can be achieved through methods such as calculating the weighted edit distance between the misspelled word and each candidate word, choosing the candidate with the shortest distance as the correction, or calculating noisy channel probabilities and selecting the highest probability candidate.

Handling real-word spelling errors involves generating a set of candidate corrections for each word w. Candidates can be generated based on similar pronunciation or spelling to w, ensuring that w itself is included in the candidate set. The best correction is then chosen from these candidates, often using methods like edit distance to select the candidate with the smallest distance from w, or applying a noisy channel model to choose the candidate with the highest posterior probability. These approaches are combined to select the candidate with the highest probability among candidates with equal edit distances.

This framework ensures effective handling of spelling errors in natural language processing applications.

III.Data

Training corpus

Reuters Corpus

Test dataset

testdata.txt

IV.Model Development

Language model

Language models evaluate the naturalness and grammatical correctness of sentences by assessing the probability of word occurrences in text. The sentence probability calculation formula is derived from the chain rule.

img

Introducing the Markov assumption, given the previous k words, each word’s generation depends solely on these k preceding words. The formula then transforms into

img

In natural language processing, the method where each word’s generation depends on the preceding N words is called N-gram. The 0th order is Unigram, 1st order is Bigram, and 2nd order is Trigram.

In language modeling, the issue of encountering probabilities of zero often arises, necessitating the use of smoothing methods. Here, I combine Add-k smoothing and interpolation smoothing for handling this issue.

interpolation smoothing:

img

Among them : img.

For calculating unigram, bigram, and trigram probabilities, we use Add-k smoothing. img

Edit Distance

Based on edit distance, the system can generate a list of candidate words similar to the original word. By constraining the edit distance within a reasonable range, we ensure that the generated candidate words include possible correct spelling forms. We filter out candidate words where the edit distance is less than or equal to 2 to obtain the list of candidates.

Noisy channel model

Noisy channel model

The system combines the candidate word list generated using edit distance with a noisy channel model to weight and evaluate the probability of each candidate word. This approach helps to exclude spelling suggestions that are unreasonable in context and improves the accuracy of selecting the correct correction by the system.

Assume word x has a spelling error. Now, to find the best candidate from the set V, the following calculation method can be used.

Where P(x|w) is derived from the noisy channel model, and P(w) is derived from the language model.

The channel model also employs Add-k smoothing to smooth the probabilities.

V.Code Implementation

Data preprocessing

preprocessing(ngram, cate)

Read the vocabulary from vocab.txt, read and preprocess the test data from testdata.txt, and process the corpus to form n-gram information.

Load confusion matrix

Here, we use the data from the paper ‘A Spelling Correction Program Based on a Noisy Channel Model’

loadConfusionMatrix():

Obtain addmatrix, submatrix, revmatrix, and delmatrix respectively.

Language model

interpolated_language_model(gram_count, V, data, lambdas, K=0.0001)

Language model combining Add-k smoothing and interpolated smoothing

Edit distance

editType(candidate, word)

Identify the type of edit relative to the original word for candidate words.

img

Retrieve candidate words

get_candidate(trie, word, edit_distance=2)

Retrieve words from the dictionary with an edit distance of less than or equal to 2 as candidate words.

img

Noisy channel model

channelModel(x, y, edit, corpus, k=0.1)

Calculate channel model error probability

img

Perform spell correction

spell_correct(vocab, testdata, gram_count, corpus, V, trie, lambdas)

Perform spell correction on the test data, first addressing non-word errors, then handling real-word errors.

VI.Result

Obtained by multiple adjustments of various parameter values:

Interpolated combination weightsAdd-1Add-k
0.90 0.05 0.0582.9%88.8%
0.05 0.90 0.0585.5%89.2%
0.05 0.05 0.9085.4%89.2%
0.01 0.80 0.1985.6%89.7%
0.20 0.30 0.5084.8%89.3%
0.10 0.30 0.6085.3%89.5%
0.10 0.60 0.3085.2%89.4%
0.40 0.40 0.2084.3%89.2%

The highest accuracy is 89.7%, with interpolated combination weights of 0.01, 0.80, and 0.19, using Add-k smoothing.

VII.GUI

The packaged executable file “main_script.exe” is located in “/GUI/dist/main_script”. Double-click to run it.

img

VIII.Reference

[1] Kernighan, M. D., Church, K., & Gale, W. A. (1990). A spelling correction program based on a noisy channel model. In COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics.

[2] Jayanthi, S. M., Pruthi, D., & Neubig, G. (2020). Neuspell: A neural spelling correction toolkit. arXiv preprint arXiv:2010.11085.

[3] 如何写一个拼写纠错器 – how to write a spelling corrector-CSDN博客

[4] https://blog.csdn.net/qq_36134437/article/details/103146390

[5] NLP-拼写纠错(spell correction)实战_nlp correction插件-CSDN博客

IX. Code

main.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import reuters
from collections import Counter
from collections import defaultdict, deque
import numpy as np
import time
import ast


#预处理函数,测试数据,语料库和词典
def preprocessing(ngram, cate):
# 读取词汇表
vocabpath = './vocab.txt'
with open(vocabpath, 'r') as vocabfile:
# 从词汇表文件中读取每一行,去掉首尾空格后存储到 vocab_list 列表中
vocab_list = [line.strip() for line in vocabfile]

# 读取测试数据
testpath = './testdata.txt'
testdata = []
with open(testpath, 'r') as testfile:
for line in testfile:
# 将每行数据用制表符拆分成三个部分:句子标识符、错误计数和实际句子
item = line.split('\t')
# 对句子进行分词
item[2] = word_tokenize(item[2])
# 在句子的开头和结尾添加特殊标记 <s> 和 </s>
item[2] = ['<s>'] + item[2] + ['</s>']
# 将处理后的句子添加到 testdata 列表中
testdata.append(item)

# 预处理语料库
corpus_raw_text = reuters.sents(categories=cate)
corpus_text = []
gram_count = defaultdict(int)
vocab_corpus = set()

for sents in corpus_raw_text:
# 在每个句子的开头和结尾添加特殊标记 <s> 和 </s>
sents = ['<s>'] + sents + ['</s>']
# 将处理后的句子添加到 corpus_text 中
corpus_text.extend(sents)
# 更新词汇集合 vocab_corpus
vocab_corpus.update(sents)

# 统计不同长度的 n-grams
for n in range(1, ngram + 2):
for i in range(n, len(sents) + 1):
# 提取 n-gram
gram = ' '.join(sents[i - n: i])
# 统计 n-gram 的频率
gram_count[gram] += 1

# 计算语料库中的独特词汇数量
V = len(vocab_corpus)
# 返回词汇表、测试数据、n-gram 统计信息、词汇集合、处理后的语料库和词汇数量
return vocab_list, testdata, gram_count, list(vocab_corpus), corpus_text, V



#路透社语料库
cate = reuters.categories()


#结合语料库进行预处理
vocab, testdata, gram_count, vocab_corpus, corpus_text, V = preprocessing(2, cate) # 使用trigram


# 从外部数据文件加载混淆矩阵
def loadConfusionMatrix():
# 加载添加操作的混淆矩阵
with open('addconfusion.data', 'r') as f:
addmatrix = ast.literal_eval(f.read())

# 加载替换操作的混淆矩阵
with open('subconfusion.data', 'r') as f:
submatrix = ast.literal_eval(f.read())

# 加载颠倒操作的混淆矩阵
with open('revconfusion.data', 'r') as f:
revmatrix = ast.literal_eval(f.read())

# 加载删除操作的混淆矩阵
with open('delconfusion.data', 'r') as f:
delmatrix = ast.literal_eval(f.read())

# 返回所有混淆矩阵
return addmatrix, submatrix, revmatrix, delmatrix



#获取混淆矩阵
addmatrix, submatrix, revmatrix, delmatrix = loadConfusionMatrix()

# %%
END = '$' # 用于表示单词结束的特殊标记

# 创建字典树(trie)
def make_trie(vocab):
trie = {} # 初始化空字典树
for word in vocab:
t = trie # 从根节点开始插入单词
for c in word:
if c not in t:
t[c] = {} # 如果当前字符不在当前节点中,创建一个新的子节点
t = t[c] # 移动到当前字符的子节点
t[END] = {} # 在单词的末尾添加结束标记
return trie # 返回构建好的字典树



#将词典创建字典树,加快查找效率
trie = make_trie(vocab)


#候选词和原词的编辑类型
def editType(candidate, word):
# 如果候选词和原词长度相等
if len(candidate) == len(word):
# 检查字符替换错误
for i in range(len(candidate)):
if candidate[i] != word[i]:
return 'sub', candidate[i], word[i], (candidate[i], word[i])
# 如果候选词长度比原词短1
elif len(candidate) + 1 == len(word):
# 检查字符删除错误
for i in range(len(candidate)):
if candidate[i] != word[i]:
return 'del', candidate[i], word[i], (candidate[i], word[i + 1])
return 'del', candidate[-1], '', (candidate[-1], '')
# 如果候选词长度比原词长1
elif len(candidate) == len(word) + 1:
# 检查字符添加错误
for i in range(len(word)):
if candidate[i] != word[i]:
return 'add', candidate[i], word[i], (candidate[i], word[i])
return 'add', candidate[-1], '', (candidate[-1], '')
# 如果候选词和原词长度相等(再次检查字符调换错误)
elif len(candidate) == len(word):
# 检查字符调换错误
for i in range(len(candidate) - 1):
if candidate[i] != word[i] and candidate[i + 1] != word[i + 1]:
if candidate[i] == word[i + 1] and candidate[i + 1] == word[i]:
return 'rev', candidate[i], word[i], (candidate[i], candidate[i + 1])
# 如果没有匹配的编辑类型,返回 None
return None


# 结合加K和插值平滑的语言模型
def interpolated_language_model(gram_count, V, data, lambdas, K=0.0001):
# 提取插值平滑系数
unigram_lambda, bigram_lambda, trigram_lambda = lambdas
total_log_prob = 0 # 初始化总对数概率
total_count = sum(gram_count.values()) # 总的 n-gram 计数

for i in range(len(data)):
unigram = data[i] # 当前词(Unigram)
bigram = ' '.join(data[i-1:i+1]) if i > 0 else None # 当前词和前一个词(Bigram)
trigram = ' '.join(data[i-2:i+1]) if i > 1 else None # 当前词和前两个词(Trigram)

# Unigram 概率计算,使用 K-平滑
unigram_prob = (gram_count[unigram] + K) / (total_count + K * V)

if bigram:
# Bigram 概率计算,使用 K-平滑
bigram_prob = (gram_count[bigram] + K) / (gram_count[data[i-1]] + K * V) if data[i-1] in gram_count else K / V
else:
bigram_prob = 0 # 如果没有 Bigram,则概率为 0

if trigram:
# Trigram 概率计算,使用 K-平滑
trigram_prob = (gram_count[trigram] + K) / (gram_count[' '.join(data[i-2:i])] + K * V) if ' '.join(data[i-2:i]) in gram_count else K / V
else:
trigram_prob = 0 # 如果没有 Trigram,则概率为 0

# 计算插值平滑后的概率
interpolated_prob = (unigram_lambda * unigram_prob +
bigram_lambda * bigram_prob +
trigram_lambda * trigram_prob)

# 将插值平滑后的概率取对数并加到总对数概率中
total_log_prob += np.log(interpolated_prob)

return total_log_prob # 返回总对数概率


# 获取候选词
def get_candidate(trie, word, edit_distance=2):
# 初始化队列,包含(字典树,词,路径,编辑距离)的元组
que = deque([(trie, word, '', edit_distance)])

# 使用广度优先搜索算法遍历队列
while que:
# 从队列中取出一个元素
trie, word, path, edit_distance = que.popleft()

# 如果词已经处理完
if word == '':
# 如果路径在字典树中标记为结束(即一个完整单词)
if END in trie:
yield path # 生成候选词

# 如果编辑距离还未耗尽,继续生成候选词
if edit_distance > 0:
for k in trie:
if k != END: # 不处理结束标记
# 将新生成的路径和剩余编辑距离放入队列
que.appendleft((trie[k], '', path + k, edit_distance - 1))
else:
# 如果当前字符在字典树中
if word[0] in trie:
# 将当前字符匹配的路径放入队列,继续处理剩余词
que.appendleft((trie[word[0]], word[1:], path + word[0], edit_distance))

# 如果编辑距离还未耗尽,尝试其他编辑操作生成候选词
if edit_distance > 0:
edit_distance -= 1
for k in trie.keys() - {word[0], END}:
# 替换字符操作
que.append((trie[k], word[1:], path + k, edit_distance))
# 添加字符操作
que.append((trie[k], word, path + k, edit_distance))

# 删除字符操作
que.append((trie, word[1:], path, edit_distance))

# 调换字符操作
if len(word) > 1:
que.append((trie, word[1] + word[0] + word[2:], path, edit_distance))


# 计算信道模型错误概率
def channelModel(x, y, edit, corpus, k=0.01):
# 将语料库转换为一个字符串,以便进行计数
corpus_str = ' '.join(corpus)
# 语料库的长度
corpus_len = len(corpus)

# 处理添加字符的情况
if edit == 'add':
# 获取在添加混淆矩阵中的频次
count_xy = addmatrix.get(x + y, 0)
# 计算x或y的频次,根据x是否为起始标记来决定
count_y = corpus_str.count(' ' + y) if x == '#' else corpus_str.count(x)
# 使用加K平滑计算添加错误的概率
return (count_xy + k) / (count_y + k * corpus_len) if count_y else (count_xy + k) / corpus_len

# 处理替换字符的情况
if edit == 'sub':
# 获取在替换混淆矩阵中的频次
count_xy = submatrix.get((x + y)[:2], 0)
# 计算y的频次
count_y = corpus_str.count(y)
# 使用加K平滑计算替换错误的概率
return (count_xy + k) / (count_y + k * corpus_len) if count_y else (count_xy + k) / corpus_len

# 处理调换字符的情况
if edit == 'rev':
# 获取在调换混淆矩阵中的频次
count_xy = revmatrix.get(x + y, 0)
# 计算x和y调换后在语料库中的频次
count_xy_in_corpus = corpus_str.count(x + y)
# 使用加K平滑计算调换错误的概率
return (count_xy + k) / (count_xy_in_corpus + k * corpus_len) if count_xy_in_corpus else (count_xy + k) / corpus_len

# 处理删除字符的情况
if edit == 'del':
# 获取在删除混淆矩阵中的频次
count_xy = delmatrix.get(x + y, 0)
# 计算x和y在语料库中的频次
count_xy_in_corpus = corpus_str.count(x + y)
# 使用加K平滑计算删除错误的概率
return (count_xy + k) / (count_xy_in_corpus + k * corpus_len) if count_xy_in_corpus else (count_xy + k) / corpus_len

# 如果没有匹配的编辑操作,返回一个默认的概率
return 1 / corpus_len


# 拼写纠正函数
def spell_correct(vocab, testdata, gram_count, corpus, V, trie, lambdas):
resultpath = './result.txt'
with open(resultpath, 'w') as resultfile:
for item in testdata:
corrected_sentence = item[2][1:-1] # 去掉<s> 和 </s>,只保留句子部分
error_count = int(item[1]) # 错误数量
corrected_words = 0 # 纠正的单词数量
non_word_errors = 0 # 处理的非词错误数量
modified_indices = set() # 记录已经修改的索引

# 处理非词错误
for i, word in enumerate(corrected_sentence):
if non_word_errors >= error_count:
break

if word not in vocab: # 如果词不在词汇表中,可能是非词错误
# 获取编辑距离为1或2的候选词
candidates = list(get_candidate(trie, word, edit_distance=1)) or list(get_candidate(trie, word, edit_distance=2))
candi_proba = []

# 对每个候选词计算其概率
for candidate in candidates:
edit = editType(candidate, word) # 获取编辑类型
if edit is None:
candi_proba.append(interpolated_language_model(gram_count, V, [candidate], lambdas))
continue

x, y = '', ''
if len(edit) == 4:
x, y = edit[3][0], edit[3][1]
channel_p = np.log(channelModel(x, y, edit[0].lower(), corpus)) if edit else 0

# 构建前后文环境
word_index = i + 1
pre_phase = item[2][word_index - 2: word_index] + [candidate]
post_phase = [candidate] + item[2][word_index + 1: word_index + 3]

# 使用插值平滑计算概率
p = (0.4 * interpolated_language_model(gram_count, V, pre_phase, lambdas) +
0.4 * interpolated_language_model(gram_count, V, post_phase, lambdas) +
0.2 * channel_p)
candi_proba.append(p)

# 选择概率最高的候选词进行纠正
if candi_proba:
index = candi_proba.index(max(candi_proba))
corrected_sentence[i] = candidates[index]
corrected_words += 1
modified_indices.add(i)
non_word_errors += 1

# 处理真词错误
if corrected_words < error_count:
for i, word in enumerate(corrected_sentence):
if corrected_words >= error_count:
break

if i in modified_indices or word not in vocab: # 已经处理过非词错误或者是真词错误
continue

# 获取编辑距离为1或2的候选词
candidates = list(get_candidate(trie, word, edit_distance=1)) or list(get_candidate(trie, word, edit_distance=2))
candi_proba = []

# 对每个候选词计算其概率
for candidate in candidates:
if candidate == word:
continue

edit = editType(candidate, word)
if edit is None:
candi_proba.append(interpolated_language_model(gram_count, V, [candidate], lambdas))
continue

x, y = '', ''
if len(edit) == 4:
x, y = edit[3][0], edit[3][1]
channel_p = np.log(channelModel(x, y, edit[0].lower(), corpus)) if edit else 0

# 构建前后文环境
word_index = i + 1
pre_phase = item[2][word_index - 2: word_index] + [candidate]
post_phase = [candidate] + item[2][word_index + 1: word_index + 3]

# 使用插值平滑计算概率
p = (0.4 * interpolated_language_model(gram_count, V, pre_phase, lambdas) +
0.4 * interpolated_language_model(gram_count, V, post_phase, lambdas) +
0.2 * channel_p)
candi_proba.append(p)

# 选择概率最高的候选词进行纠正
if candi_proba:
index = candi_proba.index(max(candi_proba))
corrected_sentence[i] = candidates[index]
corrected_words += 1

# 将纠正后的句子写入结果文件
corrected_sentence_str = ' '.join(corrected_sentence)
# 修正常见的标点间距问题
corrected_sentence_str = corrected_sentence_str.replace(" 's", "'s").replace(" ,", ",").replace(" .", ".").replace(" ?", "?").replace(" !", "!").replace(" ;", ";").replace(" :", ":").replace(" (", "(").replace(" )", ")")
resultfile.write(f"{item[0]}\t{corrected_sentence_str}\n")


# 运行拼写纠正
lambdas = (0.01, 0.8, 0.19) # 设定插值平滑参数
spell_correct(vocab, testdata, gram_count, corpus_text, V, trie, lambdas)

eval.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import nltk
anspath='./ans.txt'
resultpath='./result.txt'
ansfile=open(anspath,'r')
resultfile=open(resultpath,'r')
count=0
for i in range(1000):
ansline=ansfile.readline().split('\t')[1]
ansset=set(nltk.word_tokenize(ansline))
resultline=resultfile.readline().split('\t')[1]
resultset=set(nltk.word_tokenize(resultline))
if ansset==resultset:
count+=1
print("Accuracy is : %.2f%%" % (count*1.00/10))

GUI.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import reuters
from collections import Counter, defaultdict, deque
import numpy as np
import ast
import tkinter as tk
from tkinter import scrolledtext

# 预处理函数
def preprocessing(ngram, cate):
# 读取词汇表
vocabpath = './vocab.txt'
with open(vocabpath, 'r') as vocabfile:
vocab_list = [line.strip() for line in vocabfile]

corpus_raw_text = reuters.sents(categories=cate)
corpus_text = []
gram_count = defaultdict(int)
vocab_corpus = set()

for sents in corpus_raw_text:
sents = ['<s>'] + sents + ['</s>']
corpus_text.extend(sents)
vocab_corpus.update(sents)
for n in range(1, ngram + 2):
for i in range(n, len(sents) + 1):
gram = ' '.join(sents[i - n: i])
gram_count[gram] += 1

V = len(vocab_corpus)
return vocab_list, gram_count, list(vocab_corpus), corpus_text, V

# 路透社语料库
cate = reuters.categories()

# 结合语料库进行预处理
vocab, gram_count, vocab_corpus, corpus_text, V = preprocessing(2, cate)

# 从外部数据文件加载混淆矩阵
def loadConfusionMatrix():
with open('addconfusion.data', 'r') as f:
addmatrix = ast.literal_eval(f.read())
with open('subconfusion.data', 'r') as f:
submatrix = ast.literal_eval(f.read())
with open('revconfusion.data', 'r') as f:
revmatrix = ast.literal_eval(f.read())
with open('delconfusion.data', 'r') as f:
delmatrix = ast.literal_eval(f.read())
return addmatrix, submatrix, revmatrix, delmatrix

addmatrix, submatrix, revmatrix, delmatrix = loadConfusionMatrix()

END = '$' # 用于表示单词结束的特殊标记

def make_trie(vocab):
trie = {}
for word in vocab:
t = trie
for c in word:
if c not in t:
t[c] = {}
t = t[c]
t[END] = {}
return trie

trie = make_trie(vocab)

def editType(candidate, word):
if len(candidate) == len(word):
for i in range(len(candidate)):
if candidate[i] != word[i]:
return 'sub', candidate[i], word[i], (candidate[i], word[i])
elif len(candidate) + 1 == len(word):
for i in range(len(candidate)):
if candidate[i] != word[i]:
return 'del', candidate[i], word[i], (candidate[i], word[i + 1])
return 'del', candidate[-1], '', (candidate[-1], '')
elif len(candidate) == len(word) + 1:
for i in range(len(word)):
if candidate[i] != word[i]:
return 'add', candidate[i], word[i], (candidate[i], word[i])
return 'add', candidate[-1], '', (candidate[-1], '')
elif len(candidate) == len(word):
for i in range(len(candidate) - 1):
if candidate[i] != word[i] and candidate[i + 1] != word[i + 1]:
if candidate[i] == word[i + 1] and candidate[i + 1] == word[i]:
return 'rev', candidate[i], word[i], (candidate[i], candidate[i + 1])
return None

def interpolated_language_model(gram_count, V, data, lambdas, K=0.001):
unigram_lambda, bigram_lambda, trigram_lambda = lambdas
total_log_prob = 0
total_count = sum(gram_count.values())

for i in range(len(data)):
unigram = data[i]
bigram = ' '.join(data[i-1:i+1]) if i > 0 else None
trigram = ' '.join(data[i-2:i+1]) if i > 1 else None

unigram_prob = (gram_count[unigram] + K) / (total_count + K * V)

if bigram:
bigram_prob = (gram_count[bigram] + K) / (gram_count[data[i-1]] + K * V) if data[i-1] in gram_count else K / V
else:
bigram_prob = 0

if trigram:
trigram_prob = (gram_count[trigram] + K) / (gram_count[' '.join(data[i-2:i])] + K * V) if ' '.join(data[i-2:i]) in gram_count else K / V
else:
trigram_prob = 0

interpolated_prob = (unigram_lambda * unigram_prob +
bigram_lambda * bigram_prob +
trigram_lambda * trigram_prob)

total_log_prob += np.log(interpolated_prob)

return total_log_prob

def get_candidate(trie, word, edit_distance=1):
que = deque([(trie, word, '', edit_distance)])

while que:
trie, word, path, edit_distance = que.popleft()

if word == '':
if END in trie:
yield path

if edit_distance > 0:
for k in trie:
if k != END:
que.appendleft((trie[k], '', path + k, edit_distance - 1))
else:
if word[0] in trie:
que.appendleft((trie[word[0]], word[1:], path + word[0], edit_distance))

if edit_distance > 0:
edit_distance -= 1
for k in trie.keys() - {word[0], END}:
que.append((trie[k], word[1:], path + k, edit_distance))
que.append((trie[k], word, path + k, edit_distance))

que.append((trie, word[1:], path, edit_distance))

if len(word) > 1:
que.append((trie, word[1] + word[0] + word[2:], path, edit_distance))

def channelModel(x, y, edit, corpus, k=0.01):
corpus_str = ' '.join(corpus)
corpus_len = len(corpus)

if edit == 'add':
count_xy = addmatrix.get(x + y, 0)
count_y = corpus_str.count(' ' + y) if x == '#' else corpus_str.count(x)
return (count_xy + k) / (count_y + k * corpus_len) if count_y else (count_xy + k) / corpus_len

if edit == 'sub':
count_xy = submatrix.get((x + y)[:2], 0)
count_y = corpus_str.count(y)
return (count_xy + k) / (count_y + k * corpus_len) if count_y else (count_xy + k) / corpus_len

if edit == 'rev':
count_xy = revmatrix.get(x + y, 0)
count_xy_in_corpus = corpus_str.count(x + y)
return (count_xy + k) / (count_xy_in_corpus + k * corpus_len) if count_xy_in_corpus else (count_xy + k) / corpus_len

if edit == 'del':
count_xy = delmatrix.get(x + y, 0)
count_xy_in_corpus = corpus_str.count(x + y)
return (count_xy + k) / (count_xy_in_corpus + k * corpus_len) if count_xy_in_corpus else (count_xy + k) / corpus_len

return 1 / corpus_len

def spell_correct_sentence(sentence, vocab, gram_count, corpus, V, lambdas):
words = word_tokenize(sentence)
corrected_sentence = []

for word in words:
if word in vocab:
corrected_sentence.append(word)
else:
candidates = list(get_candidate(trie, word, edit_distance=2))
best_candidate = word
best_prob = float('-inf')

for candidate in candidates:
edit = editType(candidate, word)
if edit:
edit_prob = channelModel(edit[1], edit[2], edit[0], corpus)
else:
edit_prob = 1

candidate_sentence = corrected_sentence + [candidate] + words[len(corrected_sentence) + 1:]
lm_prob = interpolated_language_model(gram_count, V, candidate_sentence, lambdas)
total_prob = np.log(edit_prob) + lm_prob

if total_prob > best_prob:
best_prob = total_prob
best_candidate = candidate

corrected_sentence.append(best_candidate)

return ' '.join(corrected_sentence)

def correct_spelling():
input_sentence = input_text.get("1.0", tk.END).strip()
corrected_sentence = spell_correct_sentence(input_sentence, vocab, gram_count, corpus_text, V, (0.01, 0.8, 0.19))
output_text.delete("1.0", tk.END)
output_text.insert(tk.END, corrected_sentence)

# 创建主窗口
root = tk.Tk()
root.title("拼写纠正")

# 创建文本输入框
input_text = tk.Text(root, height=10, width=50)
input_text.pack()

# 创建纠正按钮
correct_button = tk.Button(root, text="纠正", command=correct_spelling)
correct_button.pack()

# 创建文本输出框
output_text = scrolledtext.ScrolledText(root, height=10, width=50)
output_text.pack()

# 运行主循环
root.mainloop()


NLP-Spell Correction
https://furthur509.github.io/2024/10/25/NLP-Spell Correction/
作者
Yang Mingxin
发布于
2024年10月25日
许可协议