DevHeads.net

Alphabets Benchmarks - How many ways to unaccent a text string? Turn AÄÁaäá into AAAaaa. And the winner is...

Hello,

let's try out half a dozen ways to unaccent a text string? [1]

The challenge - What's the fastest way to turn `AÄÁaäá EÉeé IÍiíï
NÑnñ OÖÓoöó Ssß UÜÚuüú`
into `AAAaaa EEee IIiii NNnn OOOooo Ssss UUUuuu`?

Let's benchmark and the winner (so far) is... Spoiler: `gsub` .

NON_ALPHA_CHAR_REGEX = /[^A-Za-z0-9 ]/ # use/try regex constant
for speed-up
def unaccent_gsub( text, mapping )
text.gsub( NON_ALPHA_CHAR_REGEX ) do |ch|
mapping[ch] || ch
end
end

Can you find a faster way? Show us.

Happy data (and text) wrangling with ruby. Cheers. Prost.

[1]: <a href="https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark" title="https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark">https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark</a>

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Comments

Re: Alphabets Benchmarks - How many ways to unaccent a text stri

By Rob Biedenharn at 08/13/2019 - 17:03

You missed that String#gsub can take a Hash as its second argument

def unaccent_gsub_v3( text, mapping )
text.gsub( NON_ALPHA_CHAR_REGEX, mapping )
end

-Rob

P.S. Pull request in your repo.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>