GCP - Big Data
Big Data
Google Cloud Big Data Platform
- help transform the business and user experiences with meaningful data insights.
 - an Integrated Serverless Platform.
- Serverless, no worry about provisioning Compute Instances to run the jobs.
- The services are fully managed
 
 - pay only for the resources you consume.
 - The platform is integrated
- so GCP data services work together to help create custom solutions.
 
 
 - Serverless, no worry about provisioning Compute Instances to run the jobs.
 - Apache Hadoop
- an open source framework for big data.
 - It is based on the  MapReduce programming model  which Google invented and published.
-   "Map function" 
- runs in parallel with a massive dataset to produce intermediate results.
 
 -   "Reduce function" 
- builds a final result set based on all those intermediate results.
 
 
 -   "Map function" 
 - The term “Hadoop” is often used informally to encompass Apache Hadoop itself, and related projects such as Apache Spark, Apache Pig, and Apache Hive.
 
 
Data
Ingest
Cloud Pub/Sub
Cloud publishers/subscribers
- simple, reliable, scalable foundation for stream analytics.
- foundation for Dataflow streaming
 
 - Analyzing streaming data
 - use for IoT applications
 -   decoupled systems 
, and scale independently.
- offers on-demand scalability to one million messages per second and beyond.
 
 - support many-to-many asynchronous messaging service.
- Push notifications for cloud-based applications
 - let independent applications send and receive messages.
 - Applications can publish messages in Pub/Sub
 - and one or more subscribers receive them.
 
 - builds on the same technology Google uses internally.
- connect applications across Google cloud platform
 - push/pull between Compute Engine and App
 - works well with applications built on GCP’s Compute Platforms.
 - when analyzing streaming data, Cloud Dataflow is a natural pairing with Pub/Sub.
 
 - Receiving messages doesn’t have to be synchronous.
- That’s what makes Pub/Sub great for decoupling systems.
 - It’s designed to provide “at least once” delivery at low latency.
- a small chance some messages might be delivered more than once.
 
 - keep this in mind when you write your application.
 
 You just choose the quota you want.
- an important building block for data ingestion in Dataflow
- for applications where data arrives at high and unpredictable rates,
 - like Internet of Things systems, marketing analytics
 
 - application components make push/pull subsciptions to topics
- configure subscribers to receive messages on a push or pull basis.
 - get notified when new messages arrive for them
 - or check for new messages at intervals.
 
 - includes supports for offline consumers
 
Store
Cloud BigQuery
- if data needs to run more in the way of exploring a vast sea of data.
- instead of a dynamic pipeline
 
 - fully-managed, petabyte-scale, low-cost  data analytics warehouse 
- no infrastructure to manage
 - no cluster maintencance is required
 - focus on analyze data to find meaningful insights by familiar SQL
 
 - do  ad-hoc SQL queries on massive data set 
- provide near real-time interactive analysis of massive datasets (hundreds of TBs) using SQL syntax (SQL 2011)
 
 - used by all types of organizations
- smaller organizations, Big Query’s free monthly quotas,
 - bigger organizations like its seamless scale,
- it’s available 99.9 percent service level agreement.
 
 
 - get data into BigQuery.
- load it from cloud storage or cloud data store,
 - or stream it into BigQuery at up to 100,000 rows per second.
 
 - process data
- SQL queries
- run super-fast SQL queries against multiple terabytes of data in seconds
 - using the processing power of Google’s infrastructure.
 
 - or easily read and write data in BigQuery via Cloud Dataflow, Hadoop, and Spark.
 
 - SQL queries
 - Google’s infrastructure is global and so is BigQuery.
- can specify the region where the data will be kept.
 - example
 - to keep data in Europe
- don’t have to set up a cluster in Europe.
 - Just specify the EU location where you create your data set.
 
 - US and Asia locations are also available.
 
 -   pay-as-you-go model 
- separates storage and computation with a terabit network in between
 - pay for your data storage separately from queries.
 - pay for queries only when they are actually running.
 
 - have full control over who has access to the data stored in BigQuery,
- including sharing data sets with people in different projects.
 - If you share data sets that won’t impact your cost or performance.
- People you share with pay for their own queries, not you.
 
 
 - Long-term storage pricing is an automatic discount for data residing in BigQuery for extended periods of time.
- data reaches 90 days in BigQuery, auto drop the price of storage.
 
 
Process
Cloud Dataproc
Hadoop jobs Running on-premises
- requires a capital hardware investment.
 
Running Hadoop jobs in Cloud Dataproc
-   migrate on=permises Hadoop jobs to cloud 
- a fast, easy, managed way to run and manage 
Hadoop, MapReduce, Spark, Hive service, and Pigon Google Cloud Platform. 
 - a fast, easy, managed way to run and manage 
 Data mining and analysis in datasets of known size
-   create clusters in 90 sec or less 
- just need to request a Hadoop cluster.
 - It will be built in 90 seconds or less
- on top of Compute Engine virtual machines whose number and type you control.
 
 
 -   Scale clusters even when jobs are running 
- need more or less processing power while the cluster is running, scale it up or down.
 - use the default configuration for the Hadoop software in the cluster or customize it.
 - monitor the cluster using Stackdriver.
 
 -   save money with preemptible Compute Engine instances 
-   only pay for hardware resources used during the life of the cluster 
- the costs of the Compute Engine instances isn’t the only component of the cost of a Dataproc cluster, but it’s a significant one.
 - Although the rate for pricing is based on the hour,
- Cloud Dataproc is billed by the second.
 - billed in one-second clock-time increments, subject to a one minute minimum billing.
 
 - when done with the cluster, delete it, and billing stops.
 
 -   more agile use of resources 
than on-premise hardware assets.
 - let Cloud Dataproc use  preemptible Compute Engine instances  for the batch processing.
- make sure that the jobs can be restarted cleanly, if they’re terminated, and you get a significant break in the cost of the instances.
 - preemptible instances were around 80 percent cheaper.
 
 
 -   only pay for hardware resources used during the life of the cluster 
 
Once the data is in a cluster,
use Spark and Spark SQL to do data mining
use MLib, Apache Spark’s machine learning libraries to discover patterns through machine learning
cloud Dataflow
| term | cloud Dataproc | Cloud Dataflow | 
|---|---|---|
| data size | for known size data set | unpredictable size or rate | 
| manage or not | manage your cluster size yourself | a unified programming model and a managed service | 
| dataflow | \ | if data shows up in real time | 
Dataflow
both a unified programming model and a managed service
- develop and execute a big range of data processing patterns
- extract, transform, and load batch computation and continuous computation.
 
 - write code once and get batch an streaming
- Transform-based programming model
 - use Dataflow to build data pipelines.
 - the same pipelines work for both batch and streaming data.
 
 no need to spin up a cluster or to size instances.
- fully automates the management of whatever processing resources are required.
- frees you from operational tasks
- like resource management and performance optimization.
 
 
 - frees you from operational tasks
 
- example,
- Dataflow pipeline reads data from a big query table, the Source,
 - processes it in a variety of ways, the Transforms,
 - and writes its output to a cloud storage, the Sink.
 - Some of those transforms you see here are map operations and some are reduce operations.
 
 
pipelines
can build really expressive pipelines.
- Each step in the pipeline is elastically scaled.
- no need to launch and manage a cluster.
 - the service provides all resources on demand.
 
 - It has automated and optimized worked partitioning built in
- can dynamically rebalance lagging work.
 - reduces the need to worry about hotkeys.
 - situations where disproportionately large chunks of your input get mapped to the same cluster.
 
 - use cases.
- a general purpose ETL (extract/transform/load) tool
 - a data analysis engine
- batch computation or continuous computation using streaming.
 - handy in things like
 - fraud detection and financial services,
 - IoT analytics and manufacturing,
 - healthcare and logistics and click stream,
 - point of sale and segmentation analysis in retail.
 
 -   orchestration 
- create pipeline that coordinates multiple services even external services.
 - can be used in real time applications such as personalizing gaming user experiences.
 
 
 - integrates with GCP services like CLoud storage, cloud Pub/Sub, BigQuery, and Bigtable
- Open source Java and Python SDKs
 
 
Visualize
Pipieline
cloud composer
data fustion
Cloud Datalab
Scientists have long used lab notebooks to organize their thoughts and explore their data.
- For data science, the lab notebook metaphor works really well
- because it feels natural to intersperse data analysis with comments about their results.
 
 - A popular environment for hosting those is Project Jupyter.
- create and maintain web-based notebooks containing Python code
 - and run that code interactively and view the results.
 
 
Cloud Datalab
- offers interactive data exploration
- interactive tool for large-scale data exploration, transformation, analysis, and visulization
 
 - integrated, open sourse
- build on Jupyter (formerly IPython)
 
 - It’s integrated with BigQuery, Compute Engine, and Cloud Storage
- so access data doesn’t run into authentication hassles.
 - analyze data in BigQuery, Compute Engine, and Cloud Storage using python, SQL, and Javascript
 - easily deploy models to BigQuery
 
 - Cloud Datalab takes the management work out of this natural technique.
- It runs in a Compute Engine virtual machine.
 
 - To get started
- specify the virtual machine type
 - what GCP region it should run in.
 - When it launches
- it presents an interactive Python environment
 - it orchestrates multiple GCP services automatically, so can focus on exploring the data.
 
 
 - only pay for the resources you use.
- no additional charge for Datalab itself.
 
 When you’re up and running, visualize your data with Google Charts or map plot line and because there’s a vibrant interactive Python community, you can learn from published notebooks.
- existing packages for statistics, machine learning, and so on.
 
.












Comments powered by Disqus.