Bead
Changing your data and code together
OverviewQuestions
How do you ensure that your data products are reproducible?
Objectives
'Use
bead
, a command-line tool to create, update and share data products.'
The problem
Researchers and analysts need to know the provenance of their data to trust its integrity and to quickly redo the analysis when requirements change. However, in a diverse and agile team, knowledge about data is often tacit, and is destroyed when a team member leaves. This leads to a duplication of effort. Additionally, the multitude of software tools and work methods create frictions in the analytics process. Existing solutions are either too rigid to be widely adopted or involve too much face-to-face communication, reducing researcher and analyst productivity.
bead
is a lightweight software tool with which a researcher can explicitly declare dependency on other data products, referring to them with a universally unique identifier. She can see how her work fits in the bigger picture and who to work with when redoing the analysis. Bead encapsulates data, the software script that created it and references to its dependencies in a single package, making knowledge about the data explicit. Bead is platform independent and agnostic to the specific software tools and workflows used to create data. It can be adapted to any workflow that uses the file system for both scripts and data. Sample use cases include social and business datasets of a few GB each, processed and analyzed with perl, Python, R, Stata, Matlab, julia or other scripts.
Basic logic of bead
bead
Given a discrete computation of the form
a BEAD captures all three named parts:
output
- data files (results of the computation)function
- source code files, that when run hopefully computeoutput
frominputs
inputs
- are other bead'output
and thus stored as references to those beads
As a special case pure data can be thought of as constant computation having only output but neither inputs nor source code.
Bead concepts
Data packages can be in one of two states. Under active development in a workspace, or packaged and stored as a bead. Beads are stored in a box, which is just a collection of completed beads.
Workspace
A workspace is a directory, where the user works. It contains a prototype of a bead - it is a computation under active development. At some time however it is supposed to have all the inputs, code and output of a computation.
The directory has special structure and content, so it must be created via the bead
tool:
there are metadata in a hidden .bead-meta directory which .txt file enumerate all the inputs and the bead unique hash id.
there are 3 standard directories with special meanings:
input
: where input data is read from. It is read only, managed by thebead
tool.temp
: temporary outputs, this is an area that is ignored when saving the bead.output
: This is where results of the computation are stored.
Bead
A bead is a frozen, discrete computation, created from a workspace. It is currently a zip file.
A bead is intended to capture data with the code that produced it. The goal is transparency and semi-automatic reproducability through extra tooling. Full automatic reproducability is assumed to be inpractical/expensive, however it can be achieved by gradual process/quality improvements (learning through feedback).
The bead format is designed to be
resilient to change
decentralized
keep enough information to be able to get both the details and the big picture (if all relevant beads are available)
The main technology involved is a combination of different probabilistic identifiers (UUID, secure hash, Merkle-tree).
Main properties of a bead:
kind
that is shared with other versions of a bead (book analogy: ISSN)it is a technical name, whose existence allows the human name to
change/evolve/diverge over time, while still referring to conceptually
the same computation
content_id
, that is unique for every bead (~version, book analogy: ISBN)it is calculated, so changes in a bead makes it either invalid or a new version
freeze time (for ordering versions, this is fragile in theory as depends on correctly set clocks, but in practice it is expected to cause few problems)
freeze name
references to its inputs (
kind
,content_id
)
The main changes from v. 0.0.2. to 0.8.1 that beads are referenced by names from here on.
It is important to mention that we should not create a new bead with a name already in use.
Box
A box is where beads are saved to and loaded from. It also gives names to beads and provide minimal search functionality. Currently, boxes are implemented a flat directories on the file system.
Basic workflow
Bead help
The bead help guide you through the usage of the bead.
$ bead -h
All the positional arguments have own subcommands with complete help.
$ bead new -h
Create a new bead
Initial setup. The latest
bead-box already made on the haflinger.
This is where completed beads will be stored. Create an empty bead with name name
:
Add some data to the output of this new bead which we can use later. This bead has no computation, only data.
Working with inputs in a new bead
Create a new data package:
Add data from an existing bead at input/<input-name>/
:
Create a program greet
that produces a greeting, using input/name
as an input:
Run the program:
This script has create a text file in output/greeting
. Let us verify its content:
Load specific version of an input in bead
I want to roll back analysis. I have cherry picked by bead meta, but how do I load the exact bead versions into the input folder?
You need to bead input delete
it first, and then bead input add
the version you would like to use. bead input load
and bead input unload
do not touch the bead meta, they only load the data specified in the meta to the input folder or delete them from the input folder (to free up disk).
Bead input have a time option. For example: you have two versions with different time stamps. address_balance_20180823T120149294589+0200.zip address_balance_20200317T111341998516+0100.zip
If you want to use version 20180823 you can type:
bead input add address_balance -t 20180823
You have to use the bead name and the yyyymmdd part of the timestamp with the -t option.
(source: https://stackoverflow.com/c/ceu-microdata/questions/40)
Visually display the bead chain
Bead web is a new feature of version 0.8.1. You can check the details with bead web -h
Auto-rewire is required for the new bead.
Color is optional.
Heads are optional: if loaded they will only plot the latest version of each bead plus what is referenced by another bead.
If you change the source bead to ..
it plots the entire bead structure leading to the target bead. If you change the target bead to ..
it plots the entire structure starting from the source bead. It is very important that before and after /
you need a space character.
Instead of png it can be svg filename.svg
if you prefer that format.
Package the data and send it to an outside collaborator
Save our new bead:
This stores output, computation and references to inputs. Now the content of /somepath/BeadBox
is
These are regular (and, in this case, small) zip files, which can be transferred by usual means (e.g. emailed) to collaborators. The recipient can process them via the bead
tool, keep the integrity of provenance information, and adding further dependencies as needed. Even withouth the tool, she can access the data by directly unzipping the file and inspecting its content.
The output of the computation is stored under data/*
. An outide collaborator without access to bead
can just ignore the computation and all other metadata.
The following graph summarizes the internal structure of a workspace and the logical links to other beads.
Bead boxes
Key Points
Keep and share data together with the code that produced it.
When sharing your data, always do it in a bead.
Never refer to external data from a bead, use bead inputs.
Last updated