Python Pandas - How to format and split a text in column ? -
i have set of strings in dataframe below
id textcolumn 1 line number 1 2 love pandas, puffy 3 [this $tring specia| characters, yes is!]
a. want format string eliminate special characters b. once formatted, i'd list of unique words (space being split)
here code have written:
get_df_by_id dataframe has 1 selected frame, id 3.
#replace special characters formatted_title = get_df_by_id['title'].str.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '') # split words results = set() get_df_by_id['title'].str.lower().str.split().apply(results.update) print results
but when check output, see special characters still in list.
output set([u'[this', u'is', u'it', u'specia|', u'$tring', u'is!]', u'characters,', u'yes', u'with'])
intended output should below:
set([u'this', u'is', u'it', u'specia', u'tring', u'is', u'characters,', u'yes', u'with'])
why formatted dataframe still retain special characters?
i think can first replace
special characters (i add \|
end), lower
text, split
\s+
(arbitrary wtitespaces). output dataframe. can stack
series
, drop_duplicates
, last tolist
:
print (df['title'].str .replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','') .str .lower() .str .split('\s+', expand=true) .stack() .drop_duplicates() .tolist()) ['this', 'is', 'line', 'number', 'one', 'i', 'love', 'pandas', 'they', 'are', 'so', 'puffy', 'tring', 'with', 'specia', 'characters', 'yes', 'it']
Comments
Post a Comment