21 septembre 2023
Table of contents
1 - Introduction
Data Fair and its ecosystem make it possible to implement a platform for sharing data (internal or opendata) and visualizations. This platform can be intended for the general public, who can access data through adapted interactive visualizations, as well as for a more expert public who can access data through APIs.
The word FAIR refers to data which meet principles of “Findability, Accessibility, Interoperability and Reusability”. This is made possible thanks to the indexation of the data on the platform. It makes it possible to carry out complex searches in volumes of several million records and to access more easily and quickly what interests us. Access to data through standardized and documented APIs makes it possible to interface the platform with other information systems and facilitates data reusability.
Data users access the platform through one or more data portals. They allow you to access the dataset catalog and explore it in different ways. It is possible to consult the datasets directly, whether with generic views (tables, simple maps, etc.) or more specific preconfigured visualizations. Data is disseminated through pages that present it in the form of data storytelling, making it easier for anyone to understand. Users can subscribe to notifications about updates and developers can access interactive documentation for various platform APIs. The portals can be embellished with content pages presenting the approaches, contributors or reuses put forward, for example.
Administrators and data contributors have access to a back-office which allows them to manage the various elements of the platform: user accounts, datasets and visualizations. Administrators can set up the environment and manage access permissions to data and visualizations. According to their profile, back-office users will be able to create, edit, enrich, delete datasets, maps and graphs. The back-office makes it possible to create data portals (internal or open data) and also to access various portal usage metrics.
Datasets are generally created by users by loading tabular or geographic files: the service stores the file, analyzes it and derives a data schema. The data is then indexed according to this scheme and can be queried through its own Web API.
In addition to file-based datasets, Data Fair also allows the creation of editable by form datasets and virtual datasets which are configurable views of one or more datasets.
The user can semanticize the fields of the datasets by attributing concepts to them, for example by determining that a column containing data on 5 digits is a field of the Postal Code type. This semantization allows 2 things: the data can be enriched from reference data or itself become reference data to enrich others, and the data can be used in visualizations adapted to their concepts.
Visualizations help unlock the full potential of user data. A few examples: a dataset containing commune codes can be projected onto a map of the French administrative division, a dataset containing parcel codes can be projected onto the cadastre, etc.
Main advantages of the platform
Data Fair makes it possible to set up an organization centered around data:
- Ability to load data in different file formats or by entering via form, even allowing crowd sourcing
- Consultation of data through a wide choice of interactive visualizations (graphs, maps, search engines, ...)
- Possibility to create several portals according to the use cases (open data, internal exchanges, ...)
- Easy creation of data APIs and enrichment of data to give it even more value
- Implementation of periodic processing to automatically feed the platform with data
- Secure framework, open source code and use of standards
2 - Data portal
Data users access the platform through one or more data portals dedicated to different use cases. There may be, for example, a portal for public data (opendata), a portal for internal data and a pre-production portal showing data that is being consolidated or not yet ready for publication.
A data portal is made up of different pages (home, contact, etc.), multi-criteria search engines, data sets, visualizations and content pages (news, themes, contributors, approach, ...). It is possible to present key figures and editorialize content highlighted.
The data portal has a responsive design and its display is suitable for a wide variety of terminals, allowing use on fixed or mobile workstations. It is possible to configure different elements to personalize the portal as much as possible (title and description, choice of main and secondary colors, logo, favicone and welcome image, links in the footer, etc.).
An authentication screen is accessible on all the pages of the portal to be able to connect or create an account. The account allows access to private data according to its rights in the case of internal portals, and to subscribe to notifications on data sets.
2.1 - Home page
The home page is the main entry point to the data portal. It allows quick access to data based on themes and a text search field. Key figures make it possible to quantify the data present on the portal
It is possible to highlight editorial content (text explaining the approach or presenting the portal) as well as a visualization of data which can be used directly on this page. We can, for example, highlight a map showing the location of vaccination centers, or canteen menus for the week. As part of a communication process, it is also possible to display a Twitter feed on this page.
For regular visitors, sections present the latest datasets added as well as the latest visualizations made, with a quick navigation to see this new content in more detail.
It is possible to customize which elements are displayed on this page, to display a banner instead of editorialized text or even to use a visualization instead of the banner! This last option allows you to present a carousel with navigation links.
2.2 - Data catalog
The data catalog is a search engine allowing quick access to datasets likely to interest the user. In addition to the textual search field, it is possible to access the datasets by theme or by concept present in the data. For example, it is possible to list all geographic datasets by filtering by Latitude concept, or all data related to companies by filtering by SIREN.
The list of datasets is browsed using an infinite scroll mechanism, equally well suited for desktop or mobile use. It is also possible to sort the results according to different criteria (Alphabetical, creation date, etc.). The list of results obtained can be exported in CSV format with one click.
The results on this page are presented in the form of thumbnails that display information such as the title of the dataset, its date of update or the themes associated with it. A bit of the description is also displayed, but it can be replaced by an image to make a more "visual" catalog.
In addition to navigating to a dataset's details page, tiles provide action buttons for:
- Visualize the data with a tabular view in which one can sort the columns, paginate, carry out fulltext searches and download the filtered data
- Possibly visualize the data with a map view when the data allows it.
- Access the interactive API documentation
- Consult the data diagram
The catalog presents the datasets that the user has the right to see. If he is not logged in, he will only see opendata games, if he is logged in and a member of the organization that owns the portal, he will also be able to see private data games.
2.3 - Dataset details
To be accessible to as many people as possible, the datasets are presented with visualizations adapted to the data. This helps tell a story and highlight different aspects of the dataset or possible use cases. Visitors to the portal can quickly absorb the data and interact with it.
Different metadata (producer, license, date of last update, ...) are presented to the user, and we find the action buttons of the catalog thumbnails (table view, access to the diagram, documentation of APIs). Other additional actions are offered to the user. It can download source data that has been uploaded by the producer. If the user is authenticated, he can subscribe to notifications related to this dataset, which allows him to be informed when the data is updated.
He has access to a full-screen table view, which allows him to see more rows and columns, and most importantly to get a URL that matches the filters he has applied and can share with others. other users. He can thus share data which would be filtered on a particular municipality, for example. It is possible to display only certain columns and to download the filtered dataset in several formats (CSV, ODS, XLSX, GeoJson, ...).
It is also possible to access the attachments of the dataset. These are generally files loaded in addition to the data files, such as a data user manual in pdf format.
In addition to portal visualizations that are "internal" data reuses, other data reuses may be mentioned. These reuses can be presented through a clickable thumbnail, or directly embedded in the iframe page.
Different icônes are used to share the dataset page on social networks (Facebook, Twitter, LinkedIn, ...). The data page has enriched metadata which allows better indexing by search engines and display of a thumbnail on social networks. This thumbnail presents the first visualization associated with the dataset, and this graphical side helps to increase the engagement of social network users.
2.4 - Visualizations catalog
Just like for datasets, a catalog of visualizations is available with a search engine and tools to filter or sort the visualizations. You can thus quickly access the visualizations corresponding to certain themes or linked to data with specific concepts.
Visualizations are presented through image thumbnails. As with datasets, these tiles have different action buttons so you can quickly interact with the visualization or navigate to a full-screen view of it. The list of results is browsed using an infinite scroll mechanism, equally well suited for desktop or mobile use.
The platform offers a wide variety of visualizations, and new ones are added regularly. There are interactive maps to present geographical data, graphs, animated visualizations, word clouds, ... It is even possible to create search engines, more suitable than the tabular view for presenting data with long texts , or mini-games that allow you to discover the data by playing with it! The set of visualizations available is presented in more detail in section 3.2.
2.5 - Data Visualization
Each visualization is presented through a dedicated page which allows it to be associated with context, title and description, as well as to offer different actions to the user. It is possible to access the visualization in full screen (the visualizations can be consulted both on desktop and smartphone) and to recover an HTML code which allows the visualization to be embedded in another website.
Just like for the datasets, there are different buttons at the bottom of the page to share the visualization page on different social networks (Facebook, Twitter, Linkedin, ...).
The visualization page has enriched metadata which allows better indexing by search engines and the display of a thumbnail on social networks.
2.6 - Content Pages
The content pages can be of different types: articles, thematic pages around several datasets, news pages, data storytelling, licenses, conditions of use, ... It is thus possible to highlight data and give them even more context, or create dashboards integrating different data.
In addition to entering free text, it is possible to integrate different types of elements: table of a dataset, visualization, list of datasets, integration of external content, ...
To access the content pages created, it is possible to enter links in the navigation bar. Links can appear directly in the bar, or in a menu added to it. It is possible to create public pages or private pages.
The news pages allow you to communicate information about events concerning your portal or open data. They allow to establish a first phase of communication with the visitors of your portal.
The news list is available on the news page in the form of a card containing an image and the beginning of the news (or the summary). By clicking on a news card, visitors can access the information about the news.
As for the content pages, different elements can be integrated such as images, videos, forms, tables of data sets, pdf, links, etc ...
To facilitate access to information, a news feed is also available on the home page.
2.7 - User Account
If the data portal is public, there is no obligation to create an account to use it. Users can, if they wish, create an account to subscribe to notifications and create API keys to use the APIs with fewer restrictions. In the case where the portal is private, users will need an account, but they will also need authorizations given by an administrator of the organization owning the portal.
To limit GDPR-related issues, a minimum amount of data is collected and the only data required is the user's email. He can enter a first and last name if he wishes, or put a pseudonym instead. If the user does not log in to their account for 3 years, it is automatically deleted. Users can also delete their account using a button, without having to make a request by email or otherwise.
Account creation is done by entering an email and a password, and there is also the possibility of going through a Gmail, Facebook, LinkedIn or Github account via the oAuth2 protocol. A password renewal mechanism is available for users who have lost it or wish to change it.
Users who create accounts by themselves have their data stored in databases. Their password is encrypted with salts and multiple hashes to guarantee maximum security. There are input rules to prevent the creation of weak passwords.
It is also possible to configure a connection to an external user directory through the LDAP protocol.
Accounts created on the portal can be used to create partnerships.
A portal administrator can give contribution rights on one or various datasets to a user account (or an organization account) that has been created on the portal.
Partners will be able to modify the datasets by replacing the entire file or by editing the lines of the dataset from their personal space.
Portals can be configured to accept reuse submissions.
Reuse submissions are made from the user's personal account and are subject to moderation, with the administrator choosing whether to publish the reuse or not.
2.8 - Portal notification
Notifications allow registered users to receive alerts in the portal or by email when certain events occur. Alerts are visible on the bell in the navigation bar and if the user has chosen it, he can receive an email alert.
Users can subscribe to the addition of a dataset on the portal or in a particular theme. It is also possible to subscribe to the update of a particular dataset.
2.9 - Reuses
One of the main interests of a data portal is to allow different data to circulate better and therefore to be reused as much as possible. To highlight this approach, it is possible to publish pages listing the different uses of the published data.
The interest of these pages is twofold. For the data producer, this enhances his action of making his data available and helps to reinforce him in this sharing process. For the user of the data, this offers him a way to be referenced and to highlight his project which uses the data of the portal.
Data reuse pages can be created by portal administrators, but it is also possible to allow registered users to submit new pages. These submissions are then subject to moderation and it is the administrator who chooses to publish the reuse or not.
2.10 - API Acces
All of the platform's features are available through documented Rest APIs. These APIs can be called outside the portal, but for restricted access it is necessary to use an API key. When adding an API key, it is possible to restrict access to a single function. It is then possible to restrict access to a specific IP or domain name.
API documentation is done following the OpenAPI 3.0 specification. This allows clear and understandable documentation through interactive documentation. The handling of APIs by developers is thus faster
Another benefit of using this specification is increased interoperability, with some IT systems (e.g. API gateways) being able to understand this specification. APIs made with Data Fair can, for example, be directly integrated by sites such as https://api.gouv.fr.
3 - Back Office
The back-office allows you to manage different elements on the platform: data, visualizations, portals, members of organizations, permissions, connectors to other data catalogs (input or output) and periodic treatments. It also provides access to operating metrics
Access to the back-office is of course restricted to authenticated users, in particular to users with administrator or contributor roles.
3.3 - Configure a data portal
Data Fair allows you to configure several data portals which are places for publishing datasets, visualizations and content pages that users will be able to consult. There may be portals for different use cases: open data, internal data sharing, sharing with partners, data being consolidated (pre-production portal), ...
Just like for data visualizations, the configuration of a data portal is done graphically in two stages: we work on a draft which we then publish in current version. The current version is the version that is presented to the various visitors to your portal, which allows you to update the portal without impacting users until the draft has been validated. No knowledge of HTML or CSS is required and a portal is administered like a CMS like Wordpress.
Many elements are configurable: the logo, the home image, the favicon, the color of the navigation bar, the color of the footer, content elements (title, description, visualization to put forward, public or private visibility) and various communication elements (website, contact email, and accounts on social networks).
It is possible to enter a Google Analytics or Matomo (ex Piwik) account for activity monitoring and thus have statistics on the most visited pages and the most downloaded data.
Editing content pages
The creation of a page is done in 3 steps: We first choose the page template. Then we fill in the different elements using a form adapted to the chosen page model with a preview of the result. We can finally publish the page, which allows you to prepare pages in advance and publish them later. In addition to entering free text, it is possible to integrate different types of elements: table of a dataset, visualization, list of datasets, integration of external content, ...
To access the content pages created, it is possible to enter links in the navigation bar. Links can appear directly in the bar, or in a menu added to it. It is possible to create public pages or private pages.
3.4 - User Management
By default, there are four different roles: User, Contributor, Admin, and Super Admin.
Super administrators of the platform can manage all the organizations, members and all the content of the platform. They have the possibility to configure the visualizations available, publish the portals on particular domain names, configure the periodic processing, or define which are the master data sets. It is planned to transfer the management of the last two elements to the administrators of the organizations in the near future.
The other 3 roles are defined by organization: it is for example possible to be administrator in one organization and simple user in another.
Roles and associated permissions
Organization admins can manage members:
- Invite new members by email
- Change member roles
- Exclude a member
|Add a dataset||x||x|
|Read a dataset||x||x||x|
|Edit a datase||x||x|
|Administration of a dataset||x|
|Add a visualization||x||x|
|Read a visualization||x||x||x|
|Edit a visualization||x||x|
|Administration of a visualization||x|
|Acces and Change Settings||x|
|Create and modify the portal||x|
In addition to their role, users can be assigned to a department of the organization. This allows a form of partitioning and to have groups of users who each manage their data on their side. Users who are not restricted to a department can see (or edit if they have a contributor or administrator role) all resources in the organization.
A contributor of a department can only update the datasets of this department, and when he creates a dataset, it is attached to his department. Similarly, an administrator attached to a department can only publish datasets on a portal attached to his department. On the other hand, a global administrator of the organization can publish this same dataset on a portal more global to the organization.
3.5 - Periodic processings
Periodic processing will find data in certain places, transform it and import it on the platform. They are made in the form of open source plugins, and the list of available treatments is constantly evolving.
For example, the download-file plugin, which is quite generic, covers several use cases: this processing will look for files in one place to publish them on the platform. It is able to access files via ftp, sftp or http(s) protocols. It generally works following data processing carried out by ETLs who submit their results in the form of files that can be accessed remotely.
Each plugin has its own settings (access code, data set to update, ...) but all processing has the same scheduling options. Processes can be triggered every hour, day, week, month or be set to be triggered manually.
It is possible to generate an API key specific to the process to create a webhook allowing it to be triggered: a process in an ETL can for example create a file on a shared space then call the url of the webhook to that the import processing is triggered.
A contributor can access the success status of the different executions of a job, as well as the detailed logs of its executions. He can subscribe to processing notifications to be informed when a processing fails, or when alerts are present in the logs.
3.6 - Catalog connectors
Connectors allow you to interact with other platforms or data services, both in reading and in writing.
In writing, the idea is to be able to push metadata into other catalogs. An example of a catalog is the national open data catalog data.gouv.fr: datasets published using Data Fair can be synchronized automatically and any change in the metadata is propagated to the remote catalog.
Pushing metadata to a catalog rather than being harvested by it offers several advantages including propagating changes immediately. Also, if there are changes to the Data Fair API, the connector will continue to work while a harvester might become inoperative.
Connectors can eventually push data to these catalogs but it is best to avoid this because of data duplication and synchronization issues. As mentioned before, data is indexed in a very efficient way with Data Fair and it is better to query the data directly from the APIs it offers.
As far as reading is concerned, the approach is however different and the connectors behave more like metadata and data harvesters. It is thus possible for each connector to configure the collection frequencies and the types of sources that one wishes to harvest.
Integrating the data into the platform makes it possible to index it and to be able to centralize access controls, an essential prerequisite if you want to be able to merge the data or consult different sources on a visualization.
It is possible to add new catalog connectors by following the instructions in this section.
Compared to periodic processing, the main difference between catalog connectors is that they can process several data sources that are referenced via an API that lists them. The synchronization frequency is generally lower than for periodic processing.
3.7 - Usage metrics
There are two modules to track the use of the platform. The first is analytics and corresponds to the monitoring of user journeys on the data portal. This allows you to see which pages are consulted, where the users come from, the time they spend on the pages, ... The second corresponds to measurements of API consumption and allows you to see how the platform is used by other information systems or external sites.
It is possible to use Matomo Analytics (formerly Piwik) or Google Analytics as a tracking system. This is done simply by configuring the data portal by filling in a few fields in a form.
The configuration is done with the url of the tracker and the id of your site. The statistics under Matomo Analytics are available in different forms: tables, graphs and maps. By selecting the different representations of statistics, it is possible to customize its dashboards. It is also possible to anonymise data and record user paths while complying with the recommendations of the CNIL.
The configuration is done using the ID number. The statistics under Google Analytics are also available in different forms: tables, graphs and maps. It is also possible to customize its dashboards.
Data Fair and the various associated services make extensive use of cache mechanisms to improve access times to resources, the precise statistics of use of the various access points of the platform can only be collected by a service associated with the platform's reverse-proxy.
Regarding the compliance with the GDPR, the data collected is anonymized and aggregated on a daily basis. You can access statistics for each dataset: number of API calls and number of downloads. The metrics are aggregated by user groups (owner organization, external authenticated users, anonymous, ...) or by call domain. Key figures are presented for the period requested, with a comparison to the previous period, which makes it possible to see whether the use of certain data is increasing or decreasing.