Tools
Tools in MicroData
OverviewQuestions
What is a programing tool and why we use them?
How to read a tool manual and understand the outputs?
Objectives
Join a unique ID to a raw .csv with different input types
Choose the right cut offs and specify the good matches
Get to know how to run a tool on the server with good options
The problem
There are lots of occasions when we would like to automate some tasks by the help of programming tools.
https://en.wikipedia.org/wiki/Programming_tool
One task is when we want to add a unique ID to a raw input variable. An input data could be firm name, person name, governmental institution, foundation, association .etc. Firm and person names also could be foreign ones.
We are continuously developing tools to identify these inputs and give a unique ID to them.
Firm name search tool
Firm name search tool merges the eight-digit registration number (first 8 digits of tax number) to a Hungarian firm name. You can find, among others, the building and installing methods on github:
https://github.com/ceumicrodata/firm_name_search
Using the tool
Requirements:
Python2 and the tool must be available on the PATH
a proprietary database index file (available only to members of CEU MicroData)
where "firm name" is the field name for firm name in input.csv
and there is an index file in the current directory with name complex_firms.sqlite
.
The tool provides command line help, so for further details run
FirmFinder.find_complex()
expects a single unicode firm name parameter which it will resolve to a match object with at least these attributes:
org_score
text_score
found_name
tax_id
An example how to run firm name search tool in python2
The tool searches for tax numbers for items in the name
field. You must have to add the whitelist path manually: 'input/firm-name-index/complex_firms.sqlite'
The meaning of the outcome and how to choose the proper cut off scores
The tool outcomes and scoring system are based on Hungarian legal forms. If an input data has a valid legal form then more likely a Hungarian company that not.
https://www.ksh.hu/gfo_eng_menu
A good match is dependend on how well prepared the input data is. If the data pre-filtering is good, before we run the tool, a text_score with lower value could become a good hit. Good cleaning opportunity to drop person names and foreign firm names from the input data.
Can be said generally that most of the possible matches will be in org_score==2 and text_score 0 <= x < 1
category. If a data is well prepared 0.8
org_score could be a suggested cut-off score. Results above this are expected to be good. You must have to adjust the good match cut offs in every time in every category when you run the tool on a new input.
PIR name search tool
Pir name search tool is developed for identify Hungarian state organizations by name. Pir number is the registration number of budgetary institutions in the Financial Information System at Hungarian State Treasury.
TIP:
There is an online platform to find PIR numbers one by one.
http://www.allamkincstar.gov.hu/hu/ext/torzskonyv
The PIR search command line tool requires Python 3.6+.
Input: utf-8 encoded CSV file Output: utf-8 encoded CSV file, same fields as in input with additional fields for "official data"
https://github.com/ceumicrodata/pir_search/releases/tag/v0.8.0
This release is the first, that requires an external index file to work with. You can find this index.json file in the pir-index beads. The index file was separated, because it enables match quality to improve without new releases.
The match precision can be greatly improved by providing optional extra information besides organization name:
settlement
date
An extra tuning parameter is introduced with --idf-shift which tweaks the matcher's sensitivity to rare trigrams. Its default value might not be optimal, it changes match quality. Attached files are binary releases for all 3 major platforms: pir_search.cmd is for Windows, pir_search (without extension) is for unix-like systems (e.g. Linux and Mac)
An example how to run pir name search tool in python3
The pir_score output could be between 0 <= x < 1. Pir_score==1 AND pir_err==0 is the perfect match.
The bigger the pir_err score the match is more likely wrong.
Pir score bigger than 0.8 and pir_err<0.8 are potentially good matches
.
You must have to adjust the good match cut offs in every category, every time when you run the tool on a new input.
Complexweb
Complexweb is and internal searchable version of the raw Complex Registry Court database. VPN and password is required to log in.
TIP:
You can find downloadable official balance and income statements from e-beszámolo.hu:
https://e-beszamolo.im.gov.hu/oldal/kezdolap
You can easily find the firm you are searching for if you change the tax_id or the ceg_id in the html:
You can write Postgre SQL queries to request more complex searches:
https://www.postgresql.org/docs/13/index.html
Time machine tool
A tool for collapsing start and end dates and imputing missing dates.
https://github.com/ceumicrodata/time-machine
Make a new environment to run the tool
Required files:
You need these .py files to your code folder to run the tool:
timemachine.py
timemachine_mp.py
timemachine_tools.py
Required inputs:
An entity resolved csv file. Example: complex rovat csv files with person IDs.
Rovat 8 csv file, which contains the birth dates of firms.
Frame, which contains the death date of firms in the death_date column.
Usage: timemachine.py [-h] [-s START] [-e END] [-u] entity_resolved rovat_8 deaths order unique_id is_sorted fp out_path
Optional arguments:
-h: Shows a help message and exits.
-s START: Comma separated field preference list for start dates. e.g. hattol,valtk,bkelt. DEFAULT: hattol,valtk,bkelt,jogvk
-e END: Comma separated field preference list for end dates. e.g. hatig,valtv,tkelt. DEFAULT: hatig,valtv,tkelt,jogvv
-u: Unique flag. Should be used if only a single entry is valid at any given time.
Positional arguments:
entity_resolved: The path to the entity resolved input csv file.
rovat_8: The path to the rovat 8 csv file.
deaths: The path to the frame.
order: A column of the entity resolved csv file describing the order of records within a firm. It is usually the alrovat_id.
unique_id: A column of the entity resolved csv file which contains unique entity IDs. E.g.: person ID
fp: A comma separated list of column labels in the entity resolved csv file describing the path to a single firm. It is usually ceg_id.
out_path: Path where the output should be written.
An example how to run time machine tool on the server
In this example we would like to clean the raw NACE input dates by ceg_id:
You can see the -u
unique option means that we have one NACE main activity code at the same time. The unique column is the teaor
and the code is using the frame_with_dates.csv
which identify one frame_id-tax_id pair for each firm. The 25
means that we choose multiprocessing with maximum 25 cores.
Key Points
Tools are helping you to automate tasks like joining unique ID-s for an input variable.
Firm name tool good for firm matching and PIR tool is good for state organizations matching
Last updated