python - Using re to sanitize a word file, allowing letters with hyphens and apostrophes -
here's have far:
import re def read_file(file): words = [] line in file: word in line.split(): words.append(re.sub("[^a-z]", "", word.lower()))
as stands, read in "can't" "cant" , "co-ordinate" "coordinate". want read in words these 2 punctuation marks allowed. how modify code this?
there can 2 approaches: 1 suggested ritesht93 in comment question, though i'd use
words.append(re.sub("[^-'a-z]+", "", word.lower())) ^^ ^ - 1 or more occurrences remove in 1 go | - apostrophe , hyphen added
the +
quantifier remove unwanted characters matching pattern in 1 go.
note hyphen added @ beginning of negated character class , not have escaped. note: still recommended escape if other, less regex-savvy developers going maintain later.
the second approach helpful if have unicode letters.
ur'((?![-'])[\w\d_])+'
see regex demo (to compiled re.unicode
flag)
the pattern matches non-letter (except hyphen or apostrophe due negative lookahead (?![-'])
), digit or underscore ([\w\d_]
)
Comments
Post a Comment