python - Using re to sanitize a word file, allowing letters with hyphens and apostrophes -


here's have far:

import re  def read_file(file):     words = []     line in file:         word in line.split():             words.append(re.sub("[^a-z]", "", word.lower())) 

as stands, read in "can't" "cant" , "co-ordinate" "coordinate". want read in words these 2 punctuation marks allowed. how modify code this?

there can 2 approaches: 1 suggested ritesht93 in comment question, though i'd use

words.append(re.sub("[^-'a-z]+", "", word.lower()))                        ^^    ^ - 1 or more occurrences remove in 1 go                         | - apostrophe , hyphen added 

the + quantifier remove unwanted characters matching pattern in 1 go.

note hyphen added @ beginning of negated character class , not have escaped. note: still recommended escape if other, less regex-savvy developers going maintain later.

the second approach helpful if have unicode letters.

ur'((?![-'])[\w\d_])+' 

see regex demo (to compiled re.unicode flag)

the pattern matches non-letter (except hyphen or apostrophe due negative lookahead (?![-'])), digit or underscore ([\w\d_])


Comments

Popular posts from this blog

PySide and Qt Properties: Connecting signals from Python to QML -

c# - DevExpress.Wpf.Grid.InfiniteGridSizeException was unhandled -

scala - 'wrong top statement declaration' when using slick in IntelliJ -