GCP Data Life Cycle
Data lifecycle
Mainly data life cycle has 4 steps:
1. Ingest - ( to pull in the raw data )
2. Store - ( to store in a format that is durable and can be easily accessed)
3. Process and analyze - (data is transformed from raw form into actionable information)
4. Explore and Visualize - (to convert the results of the analysis into a format that is easy to draw insights from)
Ingest
Store
Cloud Storage:
* backing up and archiving* storage and delivery of content
* cloud storage can be accessed by dataflow for transformation and loading into other systems such as Bigtable or BigQuery.
* For Hadoop and Spark jobs, data from Cloud Storage can be natively accessed by using Dataproc.
* BigQuery natively supports importing CSV, JSON, and Avro files from a specified Cloud Storage bucket.
* For Hadoop and Spark jobs, data from Cloud Storage can be natively accessed by using Dataproc.
* BigQuery natively supports importing CSV, JSON, and Avro files from a specified Cloud Storage bucket.
Cloud Storage for Firebase:
* good fit for storing and retrieving assets such as images, audio, video, and other user-generated content in mobile and web apps.
Cloud SQL:
* fully managed, cloud-native RDBMS that offers both MySQL and PostgreSQL engines with built-in support for replication.
* offers built-in backup and restoration, high availability, and read replicas.
* Cloud SQL supports RDBMS workloads up to 30 TB for both MySQL and PostgreSQL* Data stored in Cloud SQL is encrypted both in transit and at rest
* For OLTP Cloud SQL is appropriate
* For OLAP workloads, consider BigQuery
* If your workload requires dynamic schemas, consider Datastore.
* You can use Dataflow or Dataproc to create ETL jobs that pull data from Cloud SQL and insert it into other storage systems.
* offers built-in backup and restoration, high availability, and read replicas.
* Cloud SQL supports RDBMS workloads up to 30 TB for both MySQL and PostgreSQL
* For OLTP Cloud SQL is appropriate
* For OLAP workloads, consider BigQuery
* If your workload requires dynamic schemas, consider Datastore.
* You can use Dataflow or Dataproc to create ETL jobs that pull data from Cloud SQL and insert it into other storage systems.
Bigtable: Managed wide-column NoSQL
* managed, high-performance NoSQL database service designed for terabyte- to petabyte-scale workloads
* provides consistent, low-latency, and high-throughput storage for large-scale NoSQL data
* Bigtable is built for real-time app serving workloads, as well as large-scale analytical workloads.
* Use case:
1) Real-time app data
2) Stream processing (pub/sub => dataflow(transform) => BigTable)
3) IoT time series data (sensor/streamed data => Bigtable (time series schema))
4) AdTech workloads (can be used to store and track ad impressions which can be used by dataproc and dataflow for processing and analysing)
5) data ingestion (cloud storage => dataflow/dataproc => Bigtable)
6) Analytical Workloads (Bigtable=> dataflow (complex aggrregation) => dataproc((Dataproc can be used to execute Hadoop or Spark processing and machine-learning tasks.))
7) Apache HBase replacement
note: While Bigtable is considered an OLTP system, it doesn't support multi-row transactions, SQL queries or joins. For those use cases, consider either Cloud SQL or Datastore.
* Use case:
1) Real-time app data
2) Stream processing (pub/sub => dataflow(transform) => BigTable)
3) IoT time series data (sensor/streamed data => Bigtable (time series schema))
4) AdTech workloads (can be used to store and track ad impressions which can be used by dataproc and dataflow for processing and analysing)
5) data ingestion (cloud storage => dataflow/dataproc => Bigtable)
6) Analytical Workloads (Bigtable=> dataflow (complex aggrregation) => dataproc((Dataproc can be used to execute Hadoop or Spark processing and machine-learning tasks.))
7) Apache HBase replacement
note: While Bigtable is considered an OLTP system, it doesn't support multi-row transactions, SQL queries or joins. For those use cases, consider either Cloud SQL or Datastore.
Spanner: Horizontally scalable relational database
* fully managed relational database service for mission-critical OLTP apps
* Like relational databases, Spanner supports schemas, ACID transactions, and SQL queries
* Spanner also performs automatic sharding while serving data with single-digit millisecond latencies
* uses cases for Spanner:
1) financial services (strong consistency across read/write operations without scarificing HA)
1) financial services (strong consistency across read/write operations without scarificing HA)
2) Ad tech (low-latency querying without compromising scale or availability.)
3) Retail and Global Supply Chain(Spanner offers automatic, global, synchronous replication with low latency, which means that data is always consistent and highly available.)
Firestore: Flexible, scalable NoSQL database
* NoSQL database that stores JSON data
* JSON data can be synchronized in real time to connected clients across different platforms
* to build a real-time experience serving millions of users without compromising responsiveness.
* Use Cases:
1) chat ans social (Store and retrieve images, audio, video, and other user-generated content.)
* to build a real-time experience serving millions of users without compromising responsiveness.
* Use Cases:
1) chat ans social (Store and retrieve images, audio, video, and other user-generated content.)
2) Mobile Games (Keep track of game progress and statistics across devices and device platforms)
(Storing data warehouse data)
BigQuery: Managed data warehouse
* to store data directly in BigQuery for analysis
* to supports loading data through the web interface, command line tools, and REST API calls.
* When loading data in bulk, the data should be in the form of CSV, JSON, or Avro files
* For streaming data, you can use Pub/Sub and Dataflow in combination to process incoming streams and store the resulting data in BigQuery.
* In some workloads, however, it might be appropriate to stream data directly into BigQuery without additional processing.
=============================================================
(Process and analyze)* to derive business value and insights from data, you must transform and analyze it.
Processing large-scale data
* Large-scale data processing typically involves reading data from source systems such as Cloud Storage, Bigtable, or Cloud SQL, and then conducting complex normalizations or aggregations of that data.
* In many cases, the data is too large to fit on a single machine so frameworks are used to manage distributed compute clusters and to provide software tools that aid processing.
Dataproc: Managed Apache Hadoop and Apache Spark
* Spark has gained popularity over the past few years as an alternative to Hadoop MapReduce
* With Dataproc, you can migrate your existing Hadoop or Spark deployments to a fully-managed service that automates cluster creation, simplifies configuration and management of your cluster, has built-in monitoring and utilization reports, and can be shut down when not in use.
* reduces the operational and cost overhead of managing a Spark or Hadoop deployment
* . Dataproc provides the ease and flexibility to spin up Spark or Hadoop clusters on demand when they are needed, and to terminate clusters when they are no longer needed.
* simplifies operational activities such as installing software or resizing a cluster
* With Dataproc, you can natively read data and write results in Cloud Storage, Bigtable, or BigQuery, or the accompanying HDFS storage provided by the cluster.
*
usecases:
* Log processing
* Reporting (Aggregate data into reports and store the data in BigQuery)
* On-demand Spark clusters
* Machine learning
Dataflow: Serverless, fully managed batch and stream processing
* able to analyze streaming data to respond in real-time
* to deal with batch and streaming analytics
* increases complexity by necessitating two different pipelines
* to simplify big data for both streaming and batch workloads by unifying the programming model and the execution model
* Instead of having to specify a cluster size and manage capacity, Dataflow is a managed service where on-demand resources are created, autoscaled, and parallelized.
*As a true zero-ops service, workers are added or removed based on the demands of the job
*usecases:
1) MapReduce replacement
2) User analytics (Analyze high-volume user-behavior data, such as in-game events, click stream data, and retail sales data.)
3) Data Science (Process large amounts of data to make scientific discoveries and predictions, such as genomics, weather, and financial data.)
4) ETL: Ingest, transform, and load data into a data warehouse, such as BigQuery.
5) Log processing: Process continuous event-log data processing to build real-time dashboards, app metrics, and alerts.
Subscribe to:
Post Comments
(
Atom
)
No comments :
Post a Comment