Tools

Tools in MicroData

Overview

Questions

  • What is a programing tool and why we use them?

  • How to read a tool manual and understand the outputs?

Objectives

  • Join a unique ID to a raw .csv with different input types

  • Choose the right cut offs and specify the good matches

  • Get to know how to run a tool on the server with good options

The problem

There are lots of occasions when we would like to automate some tasks by the help of programming tools.

https://en.wikipedia.org/wiki/Programming_toolarrow-up-right

One task is when we want to add a unique ID to a raw input variable. An input data could be firm name, person name, governmental institution, foundation, association .etc. Firm and person names also could be foreign ones.

We are continuously developing tools to identify these inputs and give a unique ID to them.

Firm name search tool

Firm name search tool merges the eight-digit registration number (first 8 digits of tax number) to a Hungarian firm name. You can find, among others, the building and installing methods on github:

https://github.com/ceumicrodata/firm_name_searcharrow-up-right

Using the tool

Requirements:

  • Python2 and the tool must be available on the PATH

  • a proprietary database index file (available only to members of CEU MicroData)

where "firm name" is the field name for firm name in input.csv and there is an index file in the current directory with name complex_firms.sqlite.

The tool provides command line help, so for further details run

FirmFinder.find_complex() expects a single unicode firm name parameter which it will resolve to a match object with at least these attributes:

  • org_score

  • text_score

  • found_name

  • tax_id

An example how to run firm name search tool in python2

The tool searches for tax numbers for items in the name field. You must have to add the whitelist path manually: 'input/firm-name-index/complex_firms.sqlite'

The meaning of the outcome and how to choose the proper cut off scores

The tool outcomes and scoring system are based on Hungarian legal forms. If an input data has a valid legal form then more likely a Hungarian company that not.

https://www.ksh.hu/gfo_eng_menuarrow-up-right

A good match is dependend on how well prepared the input data is. If the data pre-filtering is good, before we run the tool, a text_score with lower value could become a good hit. Good cleaning opportunity to drop person names and foreign firm names from the input data.

Can be said generally that most of the possible matches will be in org_score==2 and text_score 0 <= x < 1 category. If a data is well prepared 0.8 org_score could be a suggested cut-off score. Results above this are expected to be good. You must have to adjust the good match cut offs in every time in every category when you run the tool on a new input.

PIR name search tool

Pir name search tool is developed for identify Hungarian state organizations by name. Pir number is the registration number of budgetary institutions in the Financial Information System at Hungarian State Treasury.

TIP:

There is an online platform to find PIR numbers one by one.

http://www.allamkincstar.gov.hu/hu/ext/torzskonyvarrow-up-right

The PIR search command line tool requires Python 3.6+.

Input: utf-8 encoded CSV file Output: utf-8 encoded CSV file, same fields as in input with additional fields for "official data"

https://github.com/ceumicrodata/pir_search/releases/tag/v0.8.0arrow-up-right

This release is the first, that requires an external index file to work with. You can find this index.json file in the pir-index beads. The index file was separated, because it enables match quality to improve without new releases.

The match precision can be greatly improved by providing optional extra information besides organization name:

  • settlement

  • date

An extra tuning parameter is introduced with --idf-shift which tweaks the matcher's sensitivity to rare trigrams. Its default value might not be optimal, it changes match quality. Attached files are binary releases for all 3 major platforms: pir_search.cmd is for Windows, pir_search (without extension) is for unix-like systems (e.g. Linux and Mac)

An example how to run pir name search tool in python3

The pir_score output could be between 0 <= x < 1. Pir_score==1 AND pir_err==0 is the perfect match.

The bigger the pir_err score the match is more likely wrong.

Pir score bigger than 0.8 and pir_err<0.8 are potentially good matches.

You must have to adjust the good match cut offs in every category, every time when you run the tool on a new input.

Complexweb

Complexweb is and internal searchable version of the raw Complex Registry Court database. VPN and password is required to log in.

TIP:

You can find downloadable official balance and income statements from e-beszámolo.hu:

https://e-beszamolo.im.gov.hu/oldal/kezdolaparrow-up-right

You can easily find the firm you are searching for if you change the tax_id or the ceg_id in the html:

You can write Postgre SQL queries to request more complex searches:

https://www.postgresql.org/docs/13/index.htmlarrow-up-right

Time machine tool

A tool for collapsing start and end dates and imputing missing dates.

https://github.com/ceumicrodata/time-machinearrow-up-right

Make a new environment to run the tool

Required files:

You need these .py files to your code folder to run the tool:

  • timemachine.py

  • timemachine_mp.py

  • timemachine_tools.py

Required inputs:

  • An entity resolved csv file. Example: complex rovat csv files with person IDs.

  • Rovat 8 csv file, which contains the birth dates of firms.

  • Frame, which contains the death date of firms in the death_date column.

Usage: timemachine.py [-h] [-s START] [-e END] [-u] entity_resolved rovat_8 deaths order unique_id is_sorted fp out_path

Optional arguments:

  • -h: Shows a help message and exits.

  • -s START: Comma separated field preference list for start dates. e.g. hattol,valtk,bkelt. DEFAULT: hattol,valtk,bkelt,jogvk

  • -e END: Comma separated field preference list for end dates. e.g. hatig,valtv,tkelt. DEFAULT: hatig,valtv,tkelt,jogvv

  • -u: Unique flag. Should be used if only a single entry is valid at any given time.

Positional arguments:

  • entity_resolved: The path to the entity resolved input csv file.

  • rovat_8: The path to the rovat 8 csv file.

  • deaths: The path to the frame.

  • order: A column of the entity resolved csv file describing the order of records within a firm. It is usually the alrovat_id.

  • unique_id: A column of the entity resolved csv file which contains unique entity IDs. E.g.: person ID

  • fp: A comma separated list of column labels in the entity resolved csv file describing the path to a single firm. It is usually ceg_id.

  • out_path: Path where the output should be written.

An example how to run time machine tool on the server

In this example we would like to clean the raw NACE input dates by ceg_id:

You can see the -u unique option means that we have one NACE main activity code at the same time. The unique column is the teaor and the code is using the frame_with_dates.csv which identify one frame_id-tax_id pair for each firm. The 25 means that we choose multiprocessing with maximum 25 cores.

Key Points

  • Tools are helping you to automate tasks like joining unique ID-s for an input variable.

  • Firm name tool good for firm matching and PIR tool is good for state organizations matching

Last updated