Best practices

Overview

Questions

  • How to name files and variables?

  • What code style do we use?

  • How to ensure reproducible research?

Objectives

  • Use verbose, helpful file and variable names.

  • Read and write data in a way facilitating reproducibility.

These guidelines help us share data among themselves. Always make the extra effort to make your work understandable to others: you are in the best position to explain what you did. If you follow these guidelines, you only have to do this once for every data product you create.

Naming files and folders

  1. /data/source-of-data/name-of-dataset/stage. For example, /data/mak/pir_entities/raw

  2. Use long, expressive filenames, e.g. balance-sheet-with-frame.dta rather than bs-merge.dta

  3. Use-dash-for-word-separation, not CamelCase, space or underscore.

  4. Do not indicate version in filename. File versions should be managed by version control tool.

  5. Do not indicate author in filename. File versions should be managed by version control tool.

  6. For folders and files containing multiples of the same entity, use plain English singular filenames, e.g. city.csv, county.csv, mayor.csv, exhibit/figure, output/table

Data format

  1. All text data files use UTF-8 encoding from the “consistent” stage upward. If the raw file is in any other encoding, your first task is to convert it to UTF-8.

  2. CSV files are generally the preferred format, with , as separator and as quote character. Do NOT use ; or tab as separator. CSV format is easily processed by all software (Python, STATA, Excel, ...), the only problem being with separators and quote characters defined according to individual preferences.

  3. Missing values are missing. Do not include any special values (N/A, -99, 0) for missing values in fields, leave them blank. The user of the data should build in any logic to deal with missing values.

  4. Only share one format of a dataset. Do not include both .csv and .dta. In some cases, .dta might be preferred (mostly numerical variables, labeling). Then we do not need .csv with the same content.

Input and output

  1. Every output you are sharing is a file saved by a script. Do not copy paste figures and tables. If you want to share it, create a script that saves it in a file (.dta, .csv, .gph, .pdf, .png etc)

  2. Every script produces exactly one output, not more. If your code is producing more pieces of output, break it up into smaller pieces.

  3. Use relative paths exclusively for accessing data. ../data/election/mayor.csv, not /home/koren/projects/political_connections/data/election/mayor.csv

  4. The output of a script should be named differently than any existing input or output file generated by other scripts. This rule ensures that the question “what script created this file?” can be answered.

  5. As a special case: never use the same file for input and output within a single script. If the script fails, nobody will know what the file actually contains.

  6. Leave no trace: use temporary files and variables. For Stata variables and files storing intermediate calculations, that are not inputs to other scripts, use “tempvar” and “tempfile.” These are automatically deleted once your script finished.

  7. Only put those data files in output folder of bead that will be actually used. Keep temporary files in the temp folder.

Code style guide

  1. Python code must follow PEP-8.

  2. Stata code must follow this guide.

Key Points

  • Name variables and datasets with nouns of what they contain.

  • Name scripts with verbs of what they do.

  • Every piece of data is written by a script.

  • Every piece of data is written by one script.

Last updated