DevHeads.net

Alphabets Benchmarks - How many ways to unaccent a text string? Turn AÄÁaäá into AAAaaa. And the winner is...

Hello,

let's try out half a dozen ways to unaccent a text string? [1]

The challenge - What's the fastest way to turn `AÄÁaäá EÉeé IÍiíï
NÑnñ OÖÓoöó Ssß UÜÚuüú`
into `AAAaaa EEee IIiii NNnn OOOooo Ssss UUUuuu`?

Let's benchmark and the winner (so far) is... Spoiler: `gsub` .

NON_ALPHA_CHAR_REGEX = /[^A-Za-z0-9 ]/ # use/try regex constant
for speed-up
def unaccent_gsub( text, mapping )
text.gsub( NON_ALPHA_CHAR_REGEX ) do |ch|
mapping[ch] || ch
end
end

Can you find a faster way? Show us.

Happy data (and text) wrangling with ruby. Cheers. Prost.

[1]: <a href="https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark" title="https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark">https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark</a>

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Comments

Re: Alphabets Benchmarks - How many ways to unaccent a text stri

By Frank J. Cameron at 08/13/2019 - 19:17

Gerald Bauer wrote:
For the single-character mapping there's String#tr

text=>AÄÁaäá ...:
user system total real
each_char 1.219647 0.021196 1.240843 ( 1.587247)
each_char_v2 1.054844 0.011123 1.065967 ( 1.312583)
each_char_reduce 1.372809 0.010580 1.383389 ( 1.839789)
each_char_reduce_v2 1.226152 0.003887 1.230039 ( 1.644493)
gsub 1.124067 0.005926 1.129993 ( 1.399212)
gsub_v2 0.949538 0.003917 0.953455 ( 1.158131)
gsub_v3 0.804060 0.009833 0.813893 ( 1.009054)
scan 1.879271 0.006998 1.886269 ( 2.305612)
iconv 0.192035 0.001944 0.193979 ( 0.224324)
tr 0.154944 0.000978 0.155922 ( 0.224245)
tr_v2 0.095632 0.002961 0.098593 ( 0.120770)

text=>Aa...:
user system total real
each_char 0.332079 0.002956 0.335035 ( 0.430095)
each_char_v2 0.336198 0.002921 0.339119 ( 0.411377)
each_char_reduce 0.379635 0.003936 0.383571 ( 0.474561)
each_char_reduce_v2 0.386494 0.003990 0.390484 ( 0.488094)
gsub 0.034031 0.000004 0.034035 ( 0.039017)
gsub_v2 0.033728 0.000000 0.033728 ( 0.037283)
gsub_v3 0.035162 0.000000 0.035162 ( 0.058595)
scan 0.566857 0.000904 0.567761 ( 0.679017)
iconv 0.032989 0.000842 0.033831 ( 0.036642)
tr 0.079383 0.000004 0.079387 ( 0.135033)
tr_v2 0.020683 0.000986 0.021669 ( 0.023377)

require 'iconv'
def unaccent_iconv( text, mapping )
Iconv.iconv('ascii//translit//ignore', 'utf-8', text)
end
#=> ["AAAaaa ... UUUuuu"]

def unaccent_tr( text, mapping )
text.tr( UNACCENT.keys.join, UNACCENT.values.join )
end
#=> "AAAaaa ... UsuuUU"

TR_KEYS = UNACCENT.keys .join
TR_VALS = UNACCENT.values.join
def unaccent_tr_v2( text, mapping )
text.tr( TR_KEYS, TR_VALS )
end
#=> "AAAaaa ... UsuuUU"

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Re: Alphabets Benchmarks - How many ways to unaccent a text stri

By Rob Biedenharn at 08/13/2019 - 16:03

You missed that String#gsub can take a Hash as its second argument

def unaccent_gsub_v3( text, mapping )
text.gsub( NON_ALPHA_CHAR_REGEX, mapping )
end

-Rob

P.S. Pull request in your repo.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Re: Alphabets Benchmarks - How many ways to unaccent a text stri

By Gerald Bauer at 08/13/2019 - 19:41

Hello,

Great thanks. Today I learned that String#gsub can take a Hash as
its second argument. I added your unaccent function.

About tr - that's great too and I guess that's as fast as you can
get - but unaccent will not work with ligatures e.g. 'æ'=>'ae', 'ß' =>
'ss' or german umlaut transliteration 'ä' => 'ae', 'ö' => 'oe' etc.

Some more new examples include - to quote from the updated readme [1]:

Samuel Williams writes in with one more optimization.
Why not replace the `NON_ALPHA_CHAR_REGEX`, that is, `/[^A-Za-z0-9 ]/`
with a regex matching only known accented chars?

``` ruby
UNACCENT_REGEX = Regexp.union( UNACCENT.keys )
def unaccent_gsub_v3b( text, mapping=UNACCENT, regex=UNACCENT_REGEX )
text.gsub( regex, mapping)
end
```

Hold on. Let's add some more optimizations to the humble `each_char`
version too.
For all 7-bit (less than 0x7F) unicode latin basic (also known as ascii)
char(acter)s no mapping (ever) needed. Let's try:

``` ruby
def unaccent_each_char_v2_7bit( text, mapping )
buf = String.new
text.each_char do |ch|
buf << if ch.ord < 0x7F
ch
else
mapping[ch] || ch
end
end
buf
end
```

Maybe the mapping lookup using an array index by an integer number
is faster than hash mapping lookup by single-character string?
Let's try:

``` ruby
UNACCENT_FASTER = UNACCENT.reduce( [] ) do |ary,(ch,value)|
ary[ ch.ord ] = value
ary
end

def unaccent_each_char_v2_7bit_faster( text, mapping_faster=UNACCENT_FASTER )
buf = String.new
text.each_char do |ch|
buf << if ch.ord < 0x7F
ch
else
mapping_faster[ ch.ord ] || ch
end
end
buf
end
```

Voila. And the winner is... Can you find a faster way? Show us.

Happy data (and text) wrangling with ruby. Cheers. Prost.

[1] <a href="https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark" title="https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark">https://github.com/sportdb/sport.db/tree/master/alphabets/benchmark</a>

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>