Dremio Authors: Insights and Perspectives

Guides

What Is a Data Lakehouse?

As the name suggests, a data lakehouse architecture combines a data lake and a data warehouse. Although it is not just a mere integration between the two, the idea is to bring the best out of the two architectures: the reliable transactions of a data warehouse and the scalability and low cost of a data […]

Read more ->

Guides

What Is Data Lineage?

Data Lineage Definition: Data lineage refers to the data’s “line of descent.” In other words, it’s a record of how data got to a specific location and the intermediate steps and transformations that took place as it traveled through business systems. For organizations that depend on data, understanding where data comes from, evaluating its quality, […]

Read more ->

Guides

Data Virtualization vs. Data Lakes

To make data available to data consumers like analysts for analytics and reporting, businesses need to aggregate data sources. Data virtualization and data lakes are popular approaches to breaking down data silos and providing centralized data access. Your approach can significantly impact scalability, cost, and performance, so it’s important to understand the differences. What Is […]

Read more ->

Guides

What Is a Data Pipeline?

A data pipeline moves data between systems. Data pipelines involve a series of data processing steps to move data from source to target. These steps may involve copying data, moving it from an on-premises system to the cloud, standardizing it, joining it with other data sources, and more. Why Is a Data Pipeline Important? Businesses […]

Read more ->

Guides

Introduction to Data Engineering

Businesses produce a lot of data. Everything from customer feedback to sales performance and stock price influences how a company operates. But understanding what stories the data tells isn’t always easy or intuitive, which is why many businesses rely on data engineering. What Is Data Engineering? Data engineering is the process of designing and building […]

Read more ->

Guides

Intro: Data Lake vs Warehouse by Dremio

If your organization depends on data, you need a place to store it. Not only that — you need the right kind of data storage and management solution for the data you use and produce. Most organizations find that a data warehouse or data lake meets their needs. Many even use both. What Is a […]

Read more ->

Guides

What Is Apache Parquet?

Apache Parquet is an open source file format that stores data in columnar format (as opposed to row format). As a columnar data storage format, it offers several advantages over row-based formats for analytical workloads. Your choice of data format can have significant implications for query performance and cost, so it’s important to understand the […]

Read more ->

Guides

Introduction to Data Lakes

What Is a Data Lake? A data lake is a centralized repository that allows you to store all of your data, whether a little or a lot, in one place. Like a real lake, data lakes store large amounts of unrefined data coming from various streams and tributaries in its natural state. Also, like a […]

Read more ->

Guides

Introduction to Data Warehouses

What Is a Data Warehouse? A data warehouse is a system used for storing and reporting on data. The data typically originates in multiple systems, then it is moved into the data warehouse for long-term storage and analysis. Data warehouses are on-premises or in the cloud. This storage is structured so users from many divisions […]

Read more ->

Guides

What Is ETL & Types of ETL Tools

If you’ve ever discussed data warehousing, you’ve probably heard the term “ETL.” It refers to processes that allow businesses to access data, modify it, and store it. Organizations use ETL for a variety of reasons, including the efficient management of data and the ability to run business intelligence (BI) against their data. There are several […]

Read more ->

Blog Post

Dremio’s $135M Series D

This week we announced a $135M series D at a billion-dollar valuation making Dremio one of the top funded companies in our space. Chief Product Officer, Tomer Shiran highlights our vision in this blog.

Read more ->

Blog Post

Collecting App Metrics in your cloud data lake with Kafka

In this article, we will demonstrate how Kafka can be used to collect metrics on data lake storage like Amazon S3 from a web application.

Read more ->

AWS

Introducing Parallel Projects

What are Parallel Projects? Parallel projects are multi-tenant instances of Dremio where you get a service-like cluster experience with end-to-end lifecycle automation across deployment, configuration with best practices, and upgrades, all running in your own AWS account. Every time that you launch a new project, it comes with all the best practices already set up […]

Read more ->

AWS

Introducing Elastic Engines

Introducing Elastic Engines – Dremio Introduction Dremio AWS Edition supports the ability to provision multiple separate execution engines from a single Dremio coordinator node, start and stop based on predefined workload requirements at runtime. This provides several benefits, including: In this article we walk you through the steps to provision and manage Elastic Engines, we […]

Read more ->

Python

Data Science on the Data Lake using Dremio, NLTK and Spacy

Introduction Enterprises often have a need to work with data stored in different places; because of the variety of data being produced and stored, it is almost impossible to use SQL to query all these data sources. These two things represent a great challenge for the data science and BI community. Prior to working on […]

Read more ->

AWS

Using R to perform data science operations on AWS

Intro Amazon Web Services (AWS) is a cloud services platform with extensive functionality. AWS provides different opportunities and solutions for databases, storage, data management and analytics, computing, security, AI, etc. Among the offered databases and storages are Amazon Redshift and Amazon S3. Amazon Redshift belongs to the group of the leading data warehouses. It is […]

Read more ->

Adls

Multi-Source Time Series Data Prediction with Python

Introduction Modern businesses generate, store, and use huge amounts of data. Often, the data is stored in different data sources. Moreover, many data users are comfortable to interact with data using SQL while many data sources don’t support SQL. For example, you may have data inside a data lake or NoSQL database like MongoDB, or […]

Read more ->

Adls

Forecasting air quality with Dremio, Python and Kafka

Intro Forecasting air quality is a worthwhile investment on many different levels, not only to individuals but also communities in general, having an idea of what the quality of air will be at a certain point in time allows people to plan ahead, and as a result decreases the effects on health and costs associated […]

Read more ->

Tableau

Lightning Fast Analytics with Tableau Online and Dremio

Guide to setting up Tableau Online Bridge with Dremio Overview Tableau Bridge is a way to connect your Tableau Instance to your data. Connecting to online data sources using Tableau Online is easy, you can connect to both live and extracted data depending on your environment, but what if your data sources are constantly changing? […]

Read more ->

Kubernetes

Easily Deploy Dremio on MicroK8s

Overview One of the many advantages of Dremio, is its deployment flexibility. You can deploy Dremio on any of your favorite cloud flavors, and also on Prem using different methods such as Yarn, Docker and Kubernetes. In this article I will walk through the steps of evaluating Dremio by deploying it through Kubernetes using MicroK8s […]

Read more ->

AWS

Analyzing Multiple Stream Data Sources using Dremio and Python

Introduction New technologies, communication systems, and information processing algorithms demand data rates, availability, and performance targets. Accordingly, the data processing procedures implemented with data (messages) calls for technologies capable of handling this high demand. One of these technologies is RabbitMQ – which is used to develop service-oriented architecture services (SOA) and distributed resource-intensive operations. However, […]

Read more ->

Amazon

Cluster Analysis The Cloud Data Lake with Dremio and Python

Introduction Today’s modern world is filled with a myriad of different devices, gadgets, and systems equipped with GPS modules. The main function of these modules is to locate the positions of the moving objects and record them to a file called a GPS track. The services for accounting and processing such files, which are generally […]

Read more ->

Amazon

Machine Learning Models on S3 and Redshift with Python

Introduction An important requirement for large and small business is the proper resource management. Classical solutions for such tasks can be presented as different optimization and control methods. But for the last few years, there appeared some approaches that use mathematical tools, statistics, and probability theory. They allow solving the optimization problems by detecting dependencies […]

Read more ->

Python

How to Analyze Student Performance with Dremio and Python

Intro Data analysis and data visualization are essential components of data science. Actually, before the machine learning era, all data science was about the interpretation and visualization of data with different tools and making conclusions about the nature of data. Nowadays, these tasks are still present. They just became one of many miscellaneous data science […]

Read more ->

AWS

Anomaly detection on cloud data with Dremio and Python

Introduction In datasets, very often some records do not match with the rest of the data by error or by nature. These kinds of records are useless and even harmful to ML models. In other problems, the sole purpose is to detect anomalies. For example, in health-monitoring systems in hospitals or credit fraud detection. Either […]

Read more ->

AWS

Querying Cloud Data Lakes Using Dremio and Python Seaborn

Introduction In the last few years, more and more companies have realized the value of data. Therefore, the popularity of data analytics has been growing rapidly. In general, data analysis can be performed in several ways, which are classified into subtypes depending on the analysis task: descriptive, exploratory, inferential, predictive, causal, and mechanistic. Each of […]

Read more ->

Python

Data Lake Machine Learning Models with Python and Dremio

Introduction Amazon Simple Storage Service (S3) is an object storage service that offers high availability and reliability, easy scaling, security, and performance. Many companies all around the world use Amazon S3 to store and protect their data. PostgreSQL is an open-source object-relational database system. In addition to many useful features, PostgreSQL is highly extensible, and […]

Read more ->

Python

Gensim Topic Modeling with Python, Dremio and S3

Intro Topic modeling is one of the most widespread tasks in natural language processing (NLP). This is one of the vivid examples of unsupervised learning. The main goal of this task is the following: a machine learning model should be trained on the corpus of texts with no predefined labels. In other words, we don’t […]

Read more ->

ARP

How to Create an ARP Connector

How to Create an ARP Connector The storage plugin configuration file tells Dremio what the name of the plugin should be, what connection options should be displayed in the source UI such as host address, user credentials, etc., what the name of the ARP file is, which JDBC driver to use and how to make […]

Read more ->

Python

Visualizing Amazon SQS and S3 using Python and Dremio

Introduction Nowadays, relevant analysis of different data is an important stage of business and technical research and development. Often the data is received in the form of serial info messages (queues). This is typical for data loggers and recorders, IoT developments, live-tracking systems, communication and navigation systems, etc. After that, the following information is sent […]

Read more ->

Dremio Authors: Insights and Perspectives