FastText is an
open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.
It's written in C++ and
builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support.Python bindings are included.
Using this library, you can train models which represent words as multi-dimensional vectors. The models can be queried to find correlations between these vectors in the multi-dimensional space. Depending on the volume and quality of your input data, as well as the training parameters you specify, you can obtain better or worse results to your queries. An introductory explanation of how it works: king - man + woman is queen; but why?
An unofficial FastText fork for Windows is available on GitHub. I'm using this as a starting point to create a DLL in Visual Studio, exposing the C++ classes in a "flattened" C-style API as described in the article Using C++ objects in Delphi by Rudy Velthuis. Such a DLL can also be used from .NET via platform invoke.
As a result, when it's done I'll be able to use this library from Python, .NET, Delphi/Free Pascal and JavaScript (by embedding ChakraCore).
Some examples of console output from my experiments:
King - Man + Woman = ? (JavaScript)
C:\Code\fasttextConsole\chakra\Win64\Debug>ftcc ft.js
Loading file "C:\Data\fasttext\wiki\enwik9.bin"...
done.
Computing vectors...
done.
positive words:
king woman
negative words:
man
"queen": 0.7796817421913147
"regnant": 0.7554017305374145
"consort": 0.7433754205703735
"daughter": 0.7231032848358154
"throne": 0.721994161605835
Berlin - Germany + Argentina = ? (C#)
C:\Code\fasttextConsole\cs\ftcs\bin\x64\Debug>ftcs
Loading file "C:\Data\fasttext\wiki\enwik9.bin"...done.
Computing vectors...done.
positive words:
berlin argentina
negative words:
germany
"aires": 0.8183396
"buenos": 0.8142648
"argentinan": 0.7616609
"argentinas": 0.7580159
"caracas": 0.740073
Berlin - Germany + Slovakia = ? (Free Pascal)
C:\Code\fasttextConsole\fpc\bin\x86_64-win64\Debug>ftc
Loading file 'C:\Data\fasttext\wiki\enwik9.bin'...done.
Computing vectors...done.
positive words:
berlin slovakia
negative words:
germany
'zagreb': 0.81 (Oops! ;-))
'bratislava': 0.79 (Yeah!)
'budapesti': 0.79
'slavonski': 0.79
'podgorica': 0.78
Playstation - Sony + Nintendo = ? (JavaScript)
C:\Code\fasttextConsole\chakra\Win64\Debug>ftcc ft.js
Loading file "C:\Data\fasttext\wiki\enwik9.bin"...
done.
Computing vectors...
done.
positive words:
playstation nintendo
negative words:
sony
"gamecube": 0.8729094862937927
"nintendogs": 0.8490696549415588
"playstationjapan": 0.840140163898468
"snes": 0.8312469720840454
"sega": 0.822517454624176
Here are some code examples:
1. C-style API (DLL exported functions)
2. Imports for .NET
3. Wrapper class for .NET
4. C# usage
5. Imports for Pascal
6. Pascal usage
7. Pascal class for ChakraCore host
8. JavaScript usage
No comments:
Post a Comment