English / po polsku
About Kolokacje
Kolokacje is a combined web crawler and collocation finder, written by Aleksander Buczynski. The program has been corrected and modified by Tomasz Okninski and Piotr Milkowski.
The program has been developed in WU (Warsaw University) Institute of Informatics for the M.Sc. thesis seminar entitled "Tools and methods of text processing", directed by Janusz S. Bien, Dr.Sc (Chair of Formal Linguistics, WU) and Dr. Krzysztof Szafran (Institute of Informatics, WU).
Kolokacje is distributed for free under the GNU General Public License: www.gnu.org/copyleft/gpl.html
The program can be used to:
- build a corpora of texts from selected websites, with an option to filter out most of the HTML "noise" (duplicate pages, menus etc.);
- monitor changes on selected websites;
- find strong and/or frequent collocations;
- find keywords for a collection of documents;
- get sample contexts (concordances) for given words or collocations;
- compare 14 different statistical tests used for collocation detection.
You can access the program functionality in a number of ways:
- through a simple graphical interface, provided by kolokacje.standaloneNew.SAMain - this is the easiest way to get familiar with the basic functions;
- calling selected modules from the shell command line (see modules synopsis);
- calling selected methods from your own Java program (see API specification).
- using kolokacje.server.PrettyPrinter and kolokacje.server.QueryServer to build a web based interface (see example built for emacs documentation);
- using kolokacje.server.PrettyPrinter to ask queries from a console and then viewing the results in a HTML browser.
System requirements
| Functionality | Requirements |
|---|---|
| Browsing files generated by PrettyPrinter, accessing WWW interface | Any HTML browser (support for CSS and UTF-8 recommended) |
| Crawler, IndexBuilder, PrettyPrinter | JRE |
| QueryServer | JRE + Internet connection |
| Creating your own WWW interface for your archives | JRE + Internet connection + HTTP server + PHP |
| SAMain, SAManager | JRE + graphical environment (X Window System, MS Windows) |
| Modyfying / creating new collocation tests | JDK |
JRE - Java Virtual Machine, JDK - Java compiler, consistent with Sun J2SE 1.4.2 specification. You can download both from java.sun.com.
Kolokacje has been tested under Linux (PLD, Debian and Knoppix distributions; KDE, FVWM and IceWM window managers), Windows 98 and Windows XP. A minor glitch has been detected under FVWM, see documentation for a workaround.
Download
Kolokacje 1.21, source + binaries (ZIP, 400 kB)
Kolokacje 1.21, source + binaries + documentation (ZIP, 2 MB)
Kolokacje 1.0df, source + binaria (1.0 fork, calculating DF i RIDF also for single words - ZIP, 218 KB)
Collocatrix - bootable CD image (LiveCD), Knoppix with Kolokacje 1.10b, OmegaT (translation memory) etc. (ISO, 700 MB)
Installation and usage
1. Download the binaries;
2. Unzip the downloaded file;
3. Check if you have the directory with java executables included in your PATH variable - it makes life much easier.
Provided batch files:
start.bat - runs kolokacje.standaloneNew.SAMain, new graphical program interface.
Program modules - synopsis:
java kolokacje.standalone.SAMain [dir] java kolokacje.crawler.Crawler dir [var1=value1 var2=value2...] java kolokacje.index.IndexBuilder dir [var1=value1 var2=value2...] java kolokacje.server.PrettyPrinter dir [var1=value1 var2=value2...] java kolokacje.server.QueryServer dir [var1=value1 var2=value2...] java kolokacje.standalone.SAManager dir [var1=value1 var2=value2...]
Documentation
Kolokacje 1.2 API (HTML)
User manual for Kolokacje 1.0 (PDF, 191 kB)
Manual addendum for Kolokacje 1.1 (PDF, 106 kB)
Pozyskiwanie z Internetu tekstów do badań lingwistycznych (Polish, PDF, 316 kB)
Narzędzia przetwarzania tekstów w języku Java (Polish, PDF, 385 kB)
Plik changes.txt (changes between version 1.0 and 1.21)
Plik changes-1.txt (older changes)
Contact
Bug reports, questions and comments can be sent to nmpt-l(na)mimuw.edu.pl.