Python Pandas - How to format and split a text in column ? -


i have set of strings in dataframe below

id textcolumn 1 line number 1 2 love pandas, puffy 3 [this $tring specia| characters, yes is!] 

a. want format string eliminate special characters b. once formatted, i'd list of unique words (space being split)

here code have written:

get_df_by_id dataframe has 1 selected frame, id 3.

#replace special characters formatted_title = get_df_by_id['title'].str.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '') # split words results = set() get_df_by_id['title'].str.lower().str.split().apply(results.update) print results 

but when check output, see special characters still in list.

output  set([u'[this', u'is', u'it', u'specia|', u'$tring', u'is!]', u'characters,', u'yes', u'with']) 

intended output should below:

set([u'this', u'is', u'it', u'specia', u'tring', u'is', u'characters,', u'yes', u'with']) 

why formatted dataframe still retain special characters?

i think can first replace special characters (i add \| end), lower text, split \s+ (arbitrary wtitespaces). output dataframe. can stack series, drop_duplicates , last tolist:

print (df['title'].str                   .replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')                   .str                   .lower()                   .str                   .split('\s+', expand=true)                   .stack()                   .drop_duplicates()                   .tolist())  ['this', 'is', 'line', 'number', 'one', 'i', 'love', 'pandas', 'they', 'are',  'so', 'puffy', 'tring', 'with', 'specia', 'characters', 'yes', 'it'] 

Comments

Popular posts from this blog

PySide and Qt Properties: Connecting signals from Python to QML -

c# - DevExpress.Wpf.Grid.InfiniteGridSizeException was unhandled -

scala - 'wrong top statement declaration' when using slick in IntelliJ -