#

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 8,304 public repositories matching this topic...

apache / spark

Apache Spark - A unified analytics engine for large-scale data processing

python java r scala sql big-data spark jdbc

Updated Jun 2, 2024
Scala

ytsaurus / ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

sql big-data spark clickhouse distributed-database lakehouse olap-database ytsaurus

Updated Jun 2, 2024
C++

xuwenyihust / DataPulse

DataPulse is a platform for developers to build, schedule and monitor data pipelines.

kubernetes workflow spark jupyter-notebook gcp orchestration data-engineering data-platform mlflow delta-lake

Updated Jun 2, 2024
JavaScript

tobymao / sqlglot

Python SQL Parser and Transpiler

Updated Jun 2, 2024
Python

apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery real-time sql database spark hive hadoop etl snowflake olap query-engine redshift dbt elt iceberg hudi delta-lake lakehouse

Updated Jun 2, 2024
Java

kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

kubernetes spark apache-spark kubernetes-operator kubernetes-controller kubernetes-crd google-cloud-dataproc

Updated Jun 2, 2024
Go

mage-ai / mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

python data-science data machine-learning sql spark pipeline etl pipelines orchestration artificial-intelligence data-engineering data-integration dbt elt transformation data-pipelines reverse-etl

Updated Jun 2, 2024
Python

kentarokamiyajp / crypto-prediction-devops

Devops for DWH which is for Crypto data analysis (hadoop, hive, spark, kafka, cassandra, trino, etc.)

docker kafka spark cassandra hive hadoop docker-compose dbt trino

Updated Jun 2, 2024
Dockerfile

nessie

projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

git java data spark aws-lambda iceberg

Updated Jun 2, 2024
Java

awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS

kubernetes spark terraform ml jupyterhub ray kubeflow aws-eks eks mlflow

Updated Jun 2, 2024
HCL

adenletchworth / Developers-Data-Analytics

Developer data streamed through data pipeline for analytics

python docker natural-language-processing airflow kafka spark

Updated Jun 2, 2024
Python

kentarokamiyajp / crypto-prediction-dwh

DataWareHouse for Crypto Data Analysis using hive, cassandra, trino, kafka, spark, etc.

python airflow sql crypto kafka spark cassandra hive realtime dwh dbt trino

Updated Jun 2, 2024
Python

huwngnosleep / complete_lakehouse_techstack

This project implements an end-to-end techstack for a data platform, can be used on production.

kafka spark hadoop etl bigdata data-warehouse data-platform lambda-architecture data-lakehouse

Updated Jun 2, 2024
Python

uni-openai / uniai-maas

An opensource AI & model as a service platform.

ai spark gpt moonshot midjourney chatgpt stability-ai chatglm uniai kimichat

Updated Jun 2, 2024
TypeScript

HsiehShuJeng / cdk-emrserverless-with-delta-lake

This construct builds some elements for you to quickly launch an EMR Serverless application. After submitting the Emr Serverless job, you could also launch an EMR notebook via cluster template to check the outcome from the EMR Serverless application.

python java golang aws spark serverless dotnet javacript aws-cloudformation emr-notebooks delta-lake aws-service-catalog cdk-constructs projen emr-studio emr-serverless

Updated Jun 2, 2024
TypeScript

apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator

rust spark arrow datafusion

Updated Jun 1, 2024
Rust

getredash / redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

visualization javascript mysql python bigquery bi spark dashboard athena analytics postgresql business-intelligence redash redshift databricks hacktoberfest spark-sql

Updated Jun 1, 2024
Python

nielsbasjes / splittablegzip

Splittable Gzip codec for Hadoop

spark hadoop gzip pig codec splittable gzipped-files mapreduce-java gzip-codec

Updated Jun 1, 2024
Java

exacaster / lighter

REST API for Apache Spark on K8S or YARN

spark apache-spark yarn jupyter k8s livy sparkmagic

Updated Jun 1, 2024
Java

iimeta / fastapi-admin

智元 Fast API 是一站式API管理系统，将各类大模型API进行统一格式、统一规范、统一管理，使其在功能、性能和用户体验上达到极致。

api fast spark openai glm gpt fastapi gpt-4 chatgpt ernie-bot qwen

Updated Jun 1, 2024
Go

Created by Matei Zaharia

Released May 26, 2014

Followers: 417 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics