Adopting a data-driven culture allows organisations to increase customer satisfaction, streamline operations and enhanced decision making. As such, data is increasingly being regarded as the new gold. But as with gold, data becomes valuable only after it has been processed and refined. To make that crucial station of the journey of your precious commodity happen in the most efficient and reliable way, a proper infrastructure is required to bring your data where your engineers need it and enable them to collaborate on it confidently. As such, this infrastructure should satisfy three important requirements:
- Frictionless accessibility: your engineers should be able to easily and unambiguously access the latest version of the data in their workflow
- Reliable dynamism: you should be able to reliably update your data while keeping track of the changes and synchronizing them with interested parties
- Fine-grained confidentiality: your data is not only valuable but it could also contain sensitive information, it should stay under your control and only accessible to those who would have the right to do so
In some sense, the infrastructure we are looking for is an engineering-oriented data versioning, synchronisation and collaboration service. What are organisations usually using to fulfil that need? Popular options include:
- Direct transfer options such as e-mail, file sharing services or even a flash drive are probably the easiest to setup and they provide fine-grained access control to the data but they are suboptimal when it comes to accessibility and reliable dynamism.
- Version control systems such as Git, which are the standard when it comes to collaborating on codebases, are designed to handle modestly sized textual files. On the other hand, data projects will often involve huge (gigabytes and more) binary-encoded files such as images or even videos. Also, fine-grained access control is not part of the offering of such tools. Everyone with access to the code repository will have full access to the data.
- Several DataOPS platforms were designed to solve some of those modern data challenges. However, they often introduce additional complexity and workflows, sometimes requiring a significant initial investment to work with them reliably. Moreover, adopting such platforms often means trusting the provider with your precious data. Depending on the requirements of the project, this could not be an option.
|Code Version Control Systems
For big projects involving complex requirements, it can be worthwhile (or even necessary) to invest in adopting a DataOPS platform. But what about small to mid-sized projects where you mostly care about the requirements defined above? In that case, a simple tool with a low overhead is more to the point. This is where the open source tool Data Version Control (DVC) can step in.
DVC can be thought as a Version Control System for Data Science. As such, it would be to Data Science what Git is to software engineering. In the words of its DVC inventors, DVC brings agility, reproducibility, and collaboration into your existing data science workflow. DVC has the following features:
- It is an easy to setup and to use tool for synchronizing (potentially big) data that complements and runs on top of the ubiquitous version control system Git. Thus, it provides frictionless accessibility
- It reliably tracks data, pipelines and Machine Learning experiments, thus providing reliable dynamism
- It is storage agnostic, which means you can deploy it on Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, or your own infrastructure and having full ownership and control over your data, thus providing fine-grained confidentiality
As such, DVC provides an interesting answer to the challenges posed by modern data intensive workflows. Instead of replacing existing tools and workflows, it builds upon them to enhance the processes and workflows around data, thus enabling frictionless accessibility, reliable dynamism and fine-grained confidentiality. Will this be enough to make it the de facto tool for tracking and collaborating on Data Science projects?
If you need help to improve your DataOPS don’t hesitate to contact us!
Data Scientist – Uncovering gems with AI
Ali Hosseiny is a Data Scientist at Artifact. Passionate about Artificial Intelligence and Data Science, he’s keen on understanding their ins and outs to wisely leverage their potential for accelerating value realization. He has a broad expertise in AI and Data Science tools and technologies.
Accelerating Impact with AI & Data Science
Spearheading in AI & Data Science to accelerate impact for your business in Switzerland. Pragmatic analytics services leader for consulting & implementation.