Debian, ICU, charlock_holmes – encoding detection for files in ruby
Recently I had a problem with uploading csv files and invalid byte sequences. My original csv files was encoded in UTF-16LE (wtf right? :)), and I needed to use UTF-8. So I’ve searched for some easy solution, maybe this could work:
system("file #{uploaded_file.path}"); # README.rdoc: ASCII text #=> true #(system returns true, when command ends with success)
This wasn’t good idea, so I keep digging and I found some better solution:
Charlock_Holmes gem (https://github.com/brianmario/charlock_holmes)
It uses ICU for detecting file encoding, so firstly I refused to use this, because of third part soft. I rather use ruby pure solutions without any third part dependencies, if you move your site to another server you have to remember about other dependencies.. of course you have test covered for everything right? But on production mode you don’t run your test…
Anyway I’ve decided to use this solution. On Debian you can install
sudo apt-get install libicu-dev
firstly I couldn’t find installed files, so I used own precompilation
(http://www.linuxfromscratch.org/blfs/view/svn/general/icu.html)
sudo su - wget http://download.icu-project.org/files/icu4c/56.1/icu4c-56_1-src.tgz tar -xf icu4c-56_1-src.tgz cd icu/source && ./configure --prefix=/usr && make make install
Next part is to install charlock_holmes gem – it should be super easy… but it wasn’t I struggled with finding icu4c directory (on debian is… /usr/lib/)
So I installed it
gem install charlock_holmes -- --with-icu-dir=/usr/lib/
with ruby 2.0.0. (with rvm) everything was smooth, of course I struggled with whole thing few hours until I figured it out how to do this right ;).
When I tested everything:
detector = CharlockHolmes::EncodingDetector.new uploaded_file = detector.detect(File.read(file.tempfile.path)) #=> {:type=>:text, :encoding=>"UTF-16", :ruby_encoding=>"UTF-16", :confidence=>100}
it worked like a charm. But since ruby 1.9.3 – iconv was deprecated and encoding should be used instead – to use it with new ruby, I’ve changed it to ruby 2.2.2, I bundled all gems and…I forgot to use –with-icu-dir – so after testing Charlock I’ve got:
~ irb 2.2.2 :001 > require 'charlock_holmes' LoadError: /home/vagrant/.rvm/gems/ruby-2.2.2/gems/charlock_holmes-0.7.3/lib/charlock_holmes/charlock_holmes.so: undefined symbol: _ZN6icu_568ByteSink15GetAppendBufferEiiPciPi - /home/vagrant/.rvm/gems/ruby-2.2.2/gems/charlock_holmes-0.7.3/lib/charlock_holmes/charlock_holmes.so from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require' from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require' from /home/vagrant/.rvm/gems/ruby-2.2.2/gems/charlock_holmes-0.7.3/lib/charlock_holmes.rb:1:in `<top (required)>' from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:128:in `require' from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:128:in `rescue in require' from /home/vagrant/.rvm/rubies/ruby-2.2.2/lib/ruby/site_ruby/2.2.0/rubygems/core_ext/kernel_require.rb:39:in `require' from (irb):1 from /home/vagrant/.rvm/rubies/ruby-2.2.2/bin/irb:11:in `<main>'
OK, this should be super ease just uninstall and install and with ruby 2.2.2
gem uninstall charlock_holmes # uninstalled successfully 1 gem gem install charlock_holmes -- --with-icu-dir=/usr/lib
I’ve got problem with LoadError… wtf? setting up directory didn’t help, I dug into gem directory and compare ruby 2.0.0 and ruby 2.2.2 charlock_holmes.so had different sizes, I copied this file from ruby-2.0.0 and it worked, so problem was with ruby-2.2.2 installation I looked into
charlock_holmes-0.7.3/ext/charlock_holmes/Makefile
There was
libpath = /usr/lib/lib . $(libdir) LIBPATH = -L/usr/lib/lib -Wl,-R/usr/lib/lib -L. -L$(libdir) -Wl,-R$(libdir)
Something went wrong during the installation, so when I’ve changed directory and installed like
gem install charlock_holmes -- --with-icu-dir=/usr/
It worked for me… and
uploaded_file = File.read(file.tempfile.path) detector = CharlockHolmes::EncodingDetector.new uploaded_file = detector.detect(uploaded_file) csv = File.open(new_csv, 'w+') csv.puts(uploaded_file.encode('UTF-8', uploaded_file[:encoding])) csv.close
working like it should be work 馃槈
Thanks for reading and I hope you found what you were looking for.
Sorry of my English – my first public post in this language ;). tl;dr 馃榾
Najnowsze komentarze