Glossary
Accessibility (FAIR principles)
The ability of a data user to acquire data once they have found it. Does the data require login credentials? Is there a proprietary software required to access the data?
API
Application Programming Interface. For data, this is usually a way provided by the data publisher for programs or apps to read data directly over the web. The app sends the API a query asking for the specific data it needs, e.g. the time of the next bus leaving a particular stop. This allows the app to use the data without downloading the whole dataset, saving bandwidth and ensuring that the data used is the most up-to-date available.
Anonymization
Processing data that includes personal information so that individuals can no longer be identified in the resulting data. Anonymization enables data to be published without breaching data protection principles. The principal techniques are aggregation and de-identification. Care must be taken to avoid data leakage that would result in individuals’ privacy being compromised.
App / Application
A piece of software (short for ‘application’), especially one designed to run on the web or on mobile phones and similar platforms. Apps can make network connections to large databases and thus be a powerful way of consuming open data, which may be real-time, personalized, and (using a mobile phone’s GPS) location-specific information. Crowdsourcing apps can also be used to build or improve datasets. Application Programming Interface A way computer programs talk to one another. Can be understood in terms of how a programmer sends instructions between programs.
Bulk
Data is available in bulk if the entire dataset can be downloaded easily and efficiently to a user’s own system. Conversely it is non-bulk if one is limited to getting small parts of the dataset, for example, are you restricted to a few elements of the data at a time and therefore require thousands or millions of requests to get the entire dataset. The provision of bulk access is a requirement of open data.
CKAN
An open-source software platform for creating data portals, built and maintained by Open Knowledge. CKAN is used as the official data-publishing platform of around 20 national governments and powers many more local, community, scientific and other data portals. Notable features are configurable metadata, user-friendly web interface for publishers and data users, data preview, organization-based authorization levels, and APIs giving access to all features as well as data access.
CSV
‘Comma-separated values’, a standard format for spreadsheet data. Data is represented in a plain text file, with each data row on a new line and commas separating the values on each row. As a very simple open format it is easy to consume and is widely used for publishing open data.
Conversion
The process of automatically reading data in one file format and emitting the same data in a different format, thus making the data accessible to a wider range of applications.
Creative Commons
A non-profit organization founded in 2001 that promotes re-usable content by publishing a number of standard licenses, some of them open (though others include a non-commercial clause), that can be used to release content for re-use, together with clear explanations of their meaning.
DOI
Digital Object Identifier, an identifier for a digital object (such as a document or dataset) that is assigned by a central registry and is therefore guaranteed to be a globally unique identifierno two digital objects in the world will have the same DOI.
Data
Data may be thought of as unprocessed atomic statements of fact. It very often refers to systematic collections of numerical information in tables of numbers such as spreadsheets or databases. When data is structured and presented so as to be useful and relevant for a particular purpose, it becomes information available for human apprehension. See also knowledge.
Data cleaning
Processing a dataset to make it easier to consume. This may involve fixing inconsistencies and errors, removing non-machine-readable elements such as formatting, using standard labels for row and column headings, ensuring that numbers, dates, and other quantities are represented appropriately, conversion to a suitable file format, reconciliation of labels with another dataset being used (see data integration), etc.
Data integration
Almost any interesting use of data will combine data from different sources. To do this it is necessary to ensure that the different datasets are compatiblethey must use the same names for the same objects, the same units or coordinates, etc. If the data quality is good this process of data integration may be straightforward but if not it is likely to be arduous. A key aim of linked data is to make data integration fully or nearly fully automatic. Non-open data is a barrier to data integration, as obtaining the data and establishing the necessary permission to use it is time-consuming and must be done afresh for each dataset.
Data management
The policies, procedures, and technical choices used to handle data through its entire lifecycle from data collection to storage, preservation and use. A data management policy should take account of the needs of data quality, availability, data protection, data preservation, etc.
Data portal
A web platform for publishing data. The aim of a data portal is to provide a data catalogue, making data not only available but discoverable for data users, while offering a convenient publishing workflow for publishing organizations. Typical features are web interfaces for publishing and for searching and browsing the catalogue, machine interfaces (APIs) to enable automatic publishing from other systems, and data preview and visualization.
Data quality
A measure of the usefulness of data. An ideal dataset is accurate, complete, timely in publication, consistent in its naming of items and its handling of e.g. missing data, and directly machine-readable (see data cleaning), conforms to standards of nomenclature in the field, and is published with sufficient metadata that users can easily understand, for example, who it is published by and the meaning of the variables in the dataset.
Database
(i) Any organized collection of data may be considered a database. In this sense the word is synonymous with dataset. (ii) A software system for processing and managing data, including features to extend or update, transform and query the data. Examples are the open source PostgreSQL, and the proprietary Microsoft Access.
Dataset
Any organized collection of data. ‘Dataset’ is a flexible term and may refer to an entire database, a spreadsheet or other data file, or a related collection of data resources.
DCAT
A Resource Description Framework (RDF) designed to increase the interoperability of data catalogs using standardized models and vocabularies, allowing metadata to be ingested from multiple sources. Visit for more informationhttps://www.w3.org/TR/vocab-dcat-3/
Discoverable
It is not enough for open data to be published if potential users cannot find it, or even do not know that it exists. Rather than simply publishing data haphazardly on websites, governments and other large data publishers can help make their datasets discoverable by indexing them in catalogues or data portals.
FAIR Principles
Guidelines for scientific data management and stewardship that are meant to make data more Findable, Accessible, Interoperable, and Reusable, with an emphasis on the data being easily ingested by machines/programs. Visit for more informationhttps://www.go-fair.org/fair-principles/
File format
The description of how a file is represented on a computer disk. The format usually corresponds to the last part of the file name (‘extension’), e.g. a file in CSV format might be called schools-list.csv. The file format refers to the internal format of the file, not how it is displayed to users. E.g. CSV and XLS files are structured very differently on disk, but may look similar or identical when opened in a spreadsheet program such as Excel.
Findability (FAIR principles)
The ability of a data user to find the data they are interested in.
GeoJSON
A dialect of JSON with specialized features for describing geodata, and hence a popular interchange format for geodata.
Geospatial data
Data associated with features and phenomena at one or more locations or areas on or near the Earth's surface. E.g. data with x,y,z coordinates.
GIS
Geographical Information System, any computer system designed to read, display, analyze and manipulate geodata.
GPS
The Global Positioning System, a satellite-based system which provides exact location information to any equipment with a suitable receiver (including modern smartphones). GPS is invaluable to many location-based apps, providing users with e.g. route-finding information or weather forecasts based on their current location. GPS is also a striking example of successful open data, as it is maintained by the US government and provided free of charge to anyone with a GPS receiver.
Groups (CKAN)
Categories describing CKAN datasets that are meant to help refine the results of one’s search. Adding datasets to one or more groups can help make the data more findable.
HTML
Hypertext Markup Language. The primary markup language for web pages, defining the structure of their contents.
Human Readable
Data in a format that can be conveniently read by a human. Some human-readable formats, such as PDF, are not machine-readable as they are not structured data, i.e. the representation of the data on disk does not represent the actual relationships present in the data.
Identifier
The name of an object or concept in a database. An identifier may be the object’s actual name (e.g. ‘Albuquerque’ or ‘87102’), or a word describing the concept (‘population’), or an arbitrary identifier such as ‘XY123’ that makes sense only in the context of the particular dataset. Careful choice of identifiers using relevant standards can facilitate data integration.
Information
A structured collection of data presented in a form that people can understand and process. Information is converted into knowledge when it is contextualized with the rest of a person’s knowledge and world model.
Interoperability (FAIR principles)
The ability of data to be merged or combined with other data, or be incorporated into the same work flows as other data.
JSON
JavaScript Object Notation, a simple but powerful format for data. It can describe complex data structures, is highly machine-readable as well as reasonably human-readable, and is independent of platform and programming language, and is therefore a popular format for data interchange between programs and systems.
KML
Keyhole Markup Language, an XML-based open format for geodata. KML was devised for Keyhole Earth Viewer, later acquired by Google and renamed Google Earth, but has been an international standard of the Open Geospatial Consortium since 2008.
License
A legal instrument by which a copyright holder may grant rights over the protected work. Data and content is open if it is subject to an explicitly-applied license that conforms to the Open Definition. A range of standard open licenses are available, such as the Creative Commons CC-BY license, which requires only attribution.
Machine readable
Data in a data format that can be automatically read and processed by a computer, such as CSV, JSON, XML, etc. Machine-readable data must be structured data. Compare human-readable. Non-digital material (for example printed or hand-written documents) is by its non-digital nature not machine-readable. But even digital material need not be machine-readable. For example, consider a PDF document containing tables of data. These are definitely digital but are not machine-readable because a computer would struggle to access the tabular information - even though they are very human readable. The equivalent tables in a format such as a spreadsheet would be machine readable. As another example, scans (photographs) of text are not machine-readable (but are human readable!), but the equivalent text in a format such as a simple ASCII text file can be machine-readable and processable. NoteThe appropriate machine readable format may vary by type of data - so, for example, machine readable formats for geographic data may differ from those for tabular data.
Metadata
Information about a dataset such as its title and description, method of collection, author or publisher, area and time period covered, license, date and frequency of release, etc. It is essential to publish data with adequate metadata to aid both discoverability and usability of the data.
Open Data
Data is open if it can be freely accessed, used, modified and shared by anyone for any purpose - subject only, at most, to requirements to provide attribution and/or share-alike. Specifically, open data is defined by the Open Definition and requires that the data be A. Legally openthat is, available under an open (data) license that permits anyone freely to access, reuse and redistribute B. Technically openthat is, that the data be available for no more than the cost of reproduction and in machine-readable and bulk form.
Open Source
Software for which the source code is available under an open license. Not only can the software be used for free, but users with the necessary technical skills can inspect the source code, modify it and run their own versions of the code, helping to fix bugs, develop new features, etc. Some large open source software projects have thousands of volunteer contributors. The Open Definition was heavily based on the earlier Open Source Definition, which sets out the conditions under which software can be considered open source.
Open format
File format with no restrictions, monetary or otherwise, placed upon its use and can be fully processed with at least one free/open-source software tool. Patents are a common source of restrictions that make a format proprietary. Often, but not necessarily, the structure of an open format is set out in agreed standards, overseen and published by a non-commercial expert body. A file in an open format enjoys the guarantee that it can be correctly read by a range of different software programs or used to pass information between them.
Package (CKAN)
A “package” in CKAN is another name for a CKAN “dataset;” it is a remnant of earlier versions of CKAN, and is often encountered in CKAN documentation.
PDF
Portable Document Format, a file format for representing the layout and appearance of documents on a page independent of the layout software, computer operating system, etc. Originally a proprietary format of Adobe Systems, PDF has been an open format since 2008. Data in PDF files is not machine-readable; see structured data.
Privacy
The right of individuals to a private life includes a right not to have personal information about themselves made public. A right to privacy is recognized by the Universal Declaration of Human Rights and the European Convention on Human Rights. See data protection legislation.
Proprietary
(i) Proprietary software is owned by a company which restricts the ways in which it can be used. Users normally need to pay to use the software, cannot read or modify the source code, and cannot copy the software or re-sell it as part of their own product. Common examples include Microsoft Excel and Adobe Acrobat. Non-proprietary software is usually open source. (ii) A proprietary file format is one that a company owns and controls. Data in this format may need proprietary software to be read reliably. Unlike an open format, the description of the format may be confidential or unpublished, and can be changed by the company at any time. Proprietary software usually reads and saves data in its own proprietary format. For example, different versions of Microsoft Excel use the proprietary XLS and XLSX formats.
Publisher
Anyone who distributes and makes available data or other content. Data publishers include government departments and agencies, research establishments, NGOs, media organizations, commercial companies, individuals, etc.
Query
A type of question accepted by a database about the data it holds. A complex query may ask the database to select records according to some criteria, aggregate certain quantities across those records, etc. Many databases accept queries in the specialized language SQL or dialects of it. A web API allows an app to send queries to a database over the web. Compared with downloading and processing the data, this reduces both the computation load on the app and the bandwidth needed.
Re-use
It is rare that data gathered for a particular purpose does not have other possible uses. Happily, data is an infinite resource (see tragedy of the anti-commons); once gathered, for whatever reason, it can be re-used again and again, in ways that were never envisaged when it was collected, provided only that the data-holder makes it available under an open license to enable such re-use.
Real time
Data (such as the current location of trains on a network) which is being constantly updated, where a query needs to be against the latest version of the data.
Resource
CKAN uses this term to denote one of the individual data objects (a file such as a spreadsheet, or an API) in a dataset.
Resource Description Framework (RDF)
A framework for showing information on the internet. Visit for more informationhttps://www.w3.org/TR/rdf11-concepts/
Reusability (FAIR principles)
The ability of data to be reused, especially by users who have no prior experience with the data. The better described the data is, the more easily it can be reused by others.
SQL
Structured Query Language, a standard language used for interrogating many types of database. See query.
Shapefile
A popular file format for geodata, maintained and published by Esri, a manufacturer of GIS software. A Shapefile actually consists of several related files. Though the format is technically proprietary, Esri publishes a full specification standard and Shapefiles can be read by a wide range of software, so function somewhat like an open standard in practice.
Source code
The files of computer code written by programmers that are used to produce a piece of software. The source code is usually converted or ‘compiled’ into a form that the user’s computer can execute. The user therefore never sees the original source code, unless it is published as open source.
Spreadsheet
A table of data and calculations that can be processed interactively with a specialized spreadsheet program such as Microsoft Excel or OpenOffice Calc.
Standard
A published specification for, e.g., the structure of a particular file format, recommended nomenclature to use in a particular domain, a common set of metadata fields, etc. Conforming to relevant standards greatly increases the value of published data by improving machine readability and easing data integration.
Structured data
All data has some structure, but ‘structured data’ refers to data where the structural relation between elements is explicit in the way the data is stored on a computer disk. XML and JSON are common formats that allow many types of structure to be represented. The internal representation of, for example, word-processing documents or PDF documents reflects the positioning of entities on the page, not their logical structure, which is correspondingly difficult or impossible to extract automatically.
Tab-separated values
Tab-separated values (TSV) are a very common form of text file format for sharing tabular data. The format is extremely simple and highly machine-readable.
Tag (CKAN)
Tags are short descriptions of the contents of CKAN datasets and resources. They are meant to help users find data by providing terms that one might use to describe the information in a dataset or its resources. Tags are important because CKAN’s search function only looks at the titles, descriptions, and other metadata about the dataset, but not the contents of the dataset, such as the column headers of spreadsheets.
URI / URL
Uniform Resource Identifier / Uniform Resource Locator. A URL is the http://… web address of some page or resource. When a URL is used in linked data as the identifier for some object, it is not strictly a locator for the object (e.g. http://dbpedia.org/page/Paris is the location of a document about Paris, but not of Paris itself), so in this context it is referred to as a URI
Unique identifier (or UID)
An identifier for an object which is guaranteed to be different from identifiers of all other objects in a collection. Within a database, every object will have a UID that is unique within the database. A UID assigned by a central registry (such as an ISBN for books, or a DOI for data) will be unique for all objects for which it is assigned. The http://… identifiers of linked data provide a technique for guaranteeing UIDs without a central authority.
XLS(X)
A proprietary spreadsheet format, the native format of the popular Microsoft Excel spreadsheet package. Older versions use .xls files, while more recent ones use the XML-based .xlsx variant.
XML
Extensible Markup Language, a simple and powerful standard for representing structured data.