Debian, ICU, charlock_holmes – encoding detection for files in ruby

Recently I had a problem with uploading csv files and invalid byte sequences. My original csv files was encoded in UTF-16LE (wtf right? :)), and I needed to use UTF-8. So I’ve searched for some easy solution, maybe this could work:

system("file #{uploaded_file.path}");
# README.rdoc: ASCII text
#=> true #(system returns true, when command ends with success)

This wasn’t good idea, so I keep digging and I found some better solution:

Charlock_Holmes gem (https://github.com/brianmario/charlock_holmes)

It uses ICU for detecting file encoding, so firstly I refused to use this, because of third part soft. I rather use ruby pure solutions without any third part dependencies, if you move your site to another server you have to remember about other dependencies.. of course you have test covered for everything right? But on production mode you don’t run your test…

Anyway I’ve decided to use this solution. On Debian you can install

sudo apt-get install libicu-dev

firstly I couldn’t find installed files, so I used own precompilation
(http://www.linuxfromscratch.org/blfs/view/svn/general/icu.html)

sudo su -
wget http://download.icu-project.org/files/icu4c/56.1/icu4c-56_1-src.tgz
tar -xf icu4c-56_1-src.tgz
cd icu/source && ./configure --prefix=/usr && make
make install

Next part is to install charlock_holmes gem – it should be super easy… but it wasn’t I struggled with finding icu4c directory (on debian is… /usr/lib/)

So I installed it

gem install charlock_holmes -- --with-icu-dir=/usr/lib/

with ruby 2.0.0. (with rvm) everything was smooth, of course I struggled with whole thing few hours until I figured it out how to do this right ;).

When I tested everything:

detector = CharlockHolmes::EncodingDetector.new
uploaded_file = detector.detect(File.read(file.tempfile.path))
#=> {:type=>:text, :encoding=>"UTF-16", :ruby_encoding=>"UTF-16", :confidence=>100}

it worked like a charm. But since ruby 1.9.3 – iconv was deprecated and encoding should be used instead – to use it with new ruby, I’ve changed it to ruby 2.2.2, I bundled all gems and…I forgot to use –with-icu-dir – so after testing Charlock I’ve got:

~ irb
2.2.2 :001 > require 'charlock_holmes'
LoadError: /home/vagrant/.rvm/gems/ruby-2.2.2/gems/charlock_holmes-0.7.3/lib/charlock_holmes/charlock_holmes.so: undefined symbol: _ZN6icu_568ByteSink15GetAppendBufferEiiPciPi - /home/vagrant/.rvm/gems/ruby-2.2.2/gems/charlock_holmes-0.7.3/lib/charlock_holmes/charlock_holmes.so
        from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from /home/vagrant/.rvm/gems/ruby-2.2.2/gems/charlock_holmes-0.7.3/lib/charlock_holmes.rb:1:in `<top (required)>'
        from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:128:in `require'
        from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:128:in `rescue in require'
        from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:39:in `require'
        from (irb):1
        from /home/vagrant/.rvm/rubies/ruby-2.2.2/bin/irb:11:in `<main>'

OK, this should be super ease just uninstall and install and with ruby 2.2.2

gem uninstall charlock_holmes
# uninstalled successfully 1 gem
gem install charlock_holmes -- --with-icu-dir=/usr/lib

I’ve got problem with LoadError… wtf? setting up directory didn’t help, I dug into gem directory and compare ruby 2.0.0 and ruby 2.2.2 charlock_holmes.so had different sizes, I copied this file from ruby-2.0.0 and it worked, so problem was with ruby-2.2.2 installation I looked into

charlock_holmes-0.7.3/ext/charlock_holmes/Makefile

There was

libpath = /usr/lib/lib . $(libdir)
LIBPATH =  -L/usr/lib/lib -Wl,-R/usr/lib/lib -L. -L$(libdir) -Wl,-R$(libdir)

Something went wrong during the installation, so when I’ve changed directory and installed like

gem install charlock_holmes -- --with-icu-dir=/usr/

It worked for me… and

uploaded_file = File.read(file.tempfile.path)
detector = CharlockHolmes::EncodingDetector.new
uploaded_file = detector.detect(uploaded_file)

csv = File.open(new_csv, 'w+')
csv.puts(uploaded_file.encode('UTF-8', uploaded_file[:encoding]))
csv.close

working like it should be work 馃槈

Thanks for reading and I hope you found what you were looking for.

Sorry of my English – my first public post in this language ;). tl;dr 馃榾

 

Rafath Khan

Tu powinien by膰 pean na moj膮 cze艣膰, jaki to wspania艂y jestem i jakimi niezwyk艂ymi problemami si臋 zajmuj臋, ale prawda jest taka, 偶e jak ka偶dy cz艂owiek - mam swoje wady i mo偶e jakie艣 zalety. S膮 momenty, kiedy mam odpowiedni膮 ilo艣膰 zasob贸w psychoenergetycznych i mog臋 przenosi膰 g贸ry, a s膮 niestety i takie momenty, kiedy mi si臋 nawet z 艂贸偶ka wsta膰 nie chce... nie mo偶e tak 藕le nie jest, ale ch臋tnie bym sobie pospa艂 d艂u偶ej... Niemniej jednak, gdy uda si臋 pokona膰 siebie - satysfakcja jest, ale potem przychodz膮 kolejne rzeczy, z kt贸rymi trzeba si臋聽zmierzy膰... a na nie, niestety, energii mo偶e nie starczy膰 i tu w艂a艣nie wkracza tzw samodyscyplina - powiniene艣 usi膮艣膰 i zrobi膰 to co艣, a nie siedzie膰 na kanapie i zajada膰 si臋 s艂odyczami i ogl膮da膰 jaki艣 nieciekawy serial czy film dla spalenia swojej najwa偶niejszej waluty 艣wiata... czasu, kt贸rego nie da si臋 odzyska膰. I w艂a艣nie o tej samodyscyplinie traktowa膰 b臋d臋 na tym blogu + kilka innych temat贸w, kt贸re s膮 mi potrzebne do pracy

Mo偶e Ci si臋 r贸wnie偶 spodoba