Monday, 16 November 2020

WordPress Learning

WordPress is an open source Content Management System (CMS), which allows the users to build dynamic websites and blog. WordPress is the most popular blogging system on the web and allows updating, customizing and managing the website from its back-end CMS and components.

Tuesday, 10 November 2020

Explain how the buffer is used in Amazon web services?

The buffer is used to make the system more robust to manage traffic or load by synchronizing different component. Usually, components receive and process the requests in an unbalanced way. With the help of buffer, the components will be balanced and will work at the same speed to provide faster services.

Mention what the difference between Amazon S3 and EC2 is?

The difference between EC2 and Amazon S3 is that
EC2 S3
It is a cloud web service used for hosting your application
It is a data storage system where any amount of data can be stored
It is like a huge computer machine which can run either Linux or Windows and can handle application like PHP, Python, Apache or any databases
It has a REST interface and uses secure HMAC-SHA1 authentication keys

AWS interview preperation:

what is aws?

Amazon Web Services (AWS) is a cloud service from Amazon, which provides services in the form of building blocks, these building blocks can be used to create and deploy any type of application in the cloud.

what is security in aws ?

group in aws?

security group in aws?

A security group is an AWS firewall solution that performs one primary function: to filter incoming and outgoing traffic from an EC2 instance. It accomplishes this filtering function at the TCP and IP layers, via their respective ports, and source/destination IP addresses.

policies in aws?

You manage access in AWS by creating policies and attaching them to IAM identities (users, groups of users, or roles) or AWS resources. A policy is an object in AWS that, when associated with an identity or resource, defines their permissions. AWS evaluates these policies when an IAM principal (user or role) makes a request.

s3 bucket in aws?

Amazon S3 is listed top in the AWS services list - because, storing and retrieving the data plays a prominent role in cloud computing. So, AWS offers a wonderful service called Amazon Simple Storage Service or Amazon S3 to store and retrieve data from the cloud. S3 allows the user to store, upload, retrieve large files up to 5 TB from the cloud. It is a scalable, low-cost and high-speed web-based service designed for archival and online backup of application programs and data. Using S3, the user can access the same system that Amazon uses to run its website. Users have control over the public or private accessibility of the data.

which services available?

few services we can say like s3,ec2,dyname Db,lambda,cloudfront etc

use of terraform?

It is a IAC tool which is used to automate cloud without human intervention in the environment

what is group in aws?

Groups let you specify permissions for multiple users,

which can make it easier to manage the permissions for those users.

For example, you could have a group called Admins and give that group

the types of permissions that administrators typically need.

what is role and policy in aws?

A policy is an object in AWS that, when associated with an identity or resource,

defines their permissions. AWS evaluates these policies when an IAM principal (user or role)

makes a request. Permissions in the policies determine whether the request is allowed or denied

What is the difference between groups and roles in AWS?

An IAM group is primarily a management convenience to manage the same set of permissions

for a set of IAM users. An IAM role is an AWS Identity and Access Management (IAM) entity

with permissions to make AWS service requests. ... Use IAM roles to delegate access within or between AWS accounts.

ec2 types?

micro

medium

large

how many IAM users can I create?

The default maximum limit is 5000 users per AWS account.

What is the difference between roles and groups?

A group is a collection of users with a given set of permissions assigned

to the group (and transitively, to the users). A role is a collection of permissions, and a user

effectively inherits those permissions when he acts under that role. ...

A role, on the other hand, can be activated according to specific conditions.

Who uses AWS?

According to Intricately, the top ten AWS users based on EC2

monthly spend are: Netflix: $19 million. Twitch: $15 million. LinkedIn: $13 million.

Amazon EC2 Instance Types

Amazon EC2 provides a wide selection of instance types optimized to fit different use cases.

Instance types comprise varying combinations of CPU, memory, storage, and networking capacity

and give you the flexibility to choose the appropriate mix of resources for your applications.

Each instance type

includes one or more instance sizes, allowing you to scale your resources to the requirements of your target workload.

How can you send a request to Amazon S3?

Amazon S3 is a REST service, and you can send a request by using the REST API or the AWS SDK wrapper libraries that wrap the underlying Amazon S3 REST API.

What is Lambda?

Lambda is used to encapsulate Data centres, Hardware, Assembly code/Protocols, high-level languages, operating systems, AWS APIs.
Lambda is a compute service where you can upload your code and create the Lambda function.
Lambda takes care of provisioning and managing the servers used to run the code.
While using Lambda, you don't have to worry about scaling, patching, operating systems, etc.

What is CloudWatch?

CloudWatch is a service used to monitor your AWS resources and applications that you run on AWS in real time. CloudWatch is used to collect and track metrics that measure your resources and applications.
It displays the metrics automatically about every AWS service that you choose.
You can create the dashboard to display the metrics about your custom application and also display the metrics of custom collections that you choose.
You can also create an alarm to watch metrics. For example, you can monitor CPU usage, disk read and disk writes of Amazon EC2 instance to determine whether the additional EC2 instances are required to handle the load or not. It can also be used to stop the instance to save money.

Following are the terms associated with CloudWatch:

Dashboards: CloudWatch is used to create dashboards to show what is happening with your AWS environment.
Alarms: It allows you to set alarms to notify you whenever a particular threshold is hit.
Logs: CloudWatch logs help you to aggregate, monitor, and store logs.
Events: CloudWatch help you to respond to state changes to your AWS resources.

What is Redshift?

Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud.

OLAP

OLAP is an Online Analytics Processing System used by the Redshift.

OLAP transaction Example:

Suppose we want to calculate the Net profit for EMEA and Pacific for the Digital Radio Product. This requires to pull a large number of records. Following are the records required to calculate a Net Profit:

Sum of Radios sold in EMEA.
Sum of Radios sold in Pacific.
Unit cost of radio in each region.
Sales price of each radio
Sales price - unit cost

The complex queries are required to fetch the records given above. Data Warehousing databases use different type architecture both from a database perspective and infrastructure layer.

Redshift Configuration

Redshift

Redshift consists of two types of nodes:

Single node
Multi-node

Single node: A single node stores up to 160 GB.

Multi-node: Multi-node is a node that consists of more than one node. It is of two types:

Leader Node
It manages the client connections and receives queries. A leader node receives the queries from the client applications, parses the queries, and develops the execution plans. It coordinates with the parallel execution of these plans with the compute node and combines the intermediate results of all the nodes, and then return the final result to the client application.
Compute Node
A compute node executes the execution plans, and then intermediate results are sent to the leader node for aggregation before sending back to the client application. It can have up to 128 compute nodes.

What is SQS?

SQS stands for Simple Queue Service.
SQS was the first service available in AWS.
Amazon SQS is a web service that gives you access to a message queue that can be used to store messages while waiting for a computer to process them.
Amazon SQS is a distributed queue system that enables web service applications to quickly and reliably queue messages that one component in the application generates to be consumed by another component where a queue is a temporary repository for messages that are awaiting processing.
With the help of SQS, you can send, store and receive messages between software components at any volume without losing messages.

Let's look at another example of SQS, i.e., Travel Website.

Suppose the user wants to look for a package holiday and wants to look at the best possible flight. AUser types a query in a browser, it then hits the EC2 instance. An EC2 instance looks "What the user is looking for?", it then puts the message in a queue to the SQS. An EC2 instance pulls queue. An EC2 instance continuously pulling the queue and looking for the jobs to do. Once it gets the job, it then processes it. It interrogates the Airline service to get all the best possible flights. It sends the result to the web server, and the web server sends back the result to the user. A User then selects the best flight according to his or her budget.

If we didn't have SQS, then what happened?

A web server passes the information to an application server and then application server queried an Airline service. If an Application server crashes, then a user loses its query. One of the great thing about SQS is that data is queued in the SQS even if the application server crashes, the message in the queue is marked as an invisible in a timeout interval window. When the timeout runs out, message reappears in the queue; then a new EC2 instance can use this message to perform its job. Therefore, we can say that SQS removes the application server dependency.

What are the most used of Top 10 AWS Services?

Service #1 - Amazon S3

Service #2 - Amazon EC2 [Elastic Compute Cloud]

Service #3 - AWS Lambda

Service #4 - Amazon Glacier

Service #5 - Amazon SNS

Service #6 - Amazon CloudFront

Service #7 - Amazon EBS [Elastic Block Store]

Service #8 - Amazon Kinesis

Service #9 - Amazon VPC

Service #10 - Amazon SQS

Sunday, 26 April 2020

What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

How is Spark different than Hadoop?

Spark stores data in memory, thus running MapReduce operations much faster than Hadoop, which stores that on disk. Also it has command line interfaces in Scala, Python, and R. And it includes a machine learning library, Spark ML, that is developed by the Spark project and not separately, like Mahout.

Basic understanding about Spark

1.What is Apache Spark?

Apache Spark is a cluster computing framework which runs on a cluster of commodity hardware and performs data unification i.e., reading and writing of wide variety of data from multiple sources. In Spark, a task is an operation that can be a map task or a reduce task. Spark Context handles the execution of the job and also provides API’s in different languages i.e., Scala, Java and Python to develop applications and faster execution as compared to MapReduce.

2. How is Spark different from MapReduce? Is Spark faster than MapReduce?

Yes, Spark is faster than MapReduce. There are few important reasons why Spark is faster than MapReduce and some of them are below:

There is no tight coupling in Spark i.e., there is no mandatory rule that reduce must come after map.
Spark tries to keep the data “in-memory” as much as possible.

In MapReduce, the intermediate data will be stored in HDFS and hence takes longer time to get the data from a source but this is not the case with Spark.

3. Explain the Apache Spark Architecture. How to Run Spark applications?

Apache Spark application contains two programs namely a Driver program and Workers program.
A cluster manager will be there in-between to interact with these two cluster nodes. Spark Context will keep in touch with the worker nodes with the help of Cluster Manager.
Spark Context is like a master and Spark workers are like slaves.
Workers contain the executors to run the job. If any dependencies or arguments have to be passed then Spark Context will take care of that. RDD’s will reside on the Spark Executors.
You can also run Spark applications locally using a thread, and if you want to take advantage of distributed environments you can take the help of S3, HDFS or any other storage system.

4. What is RDD?

A. RDD stands for Resilient Distributed Datasets (RDDs). If you have large amount of data, and is not necessarily stored in a single system, all the data can be distributed across all the nodes and one subset of data is called as a partition which will be processed by a particular task. RDD’s are very close to input splits in MapReduce.

5. What is the role of coalesce () and repartition () in Map Reduce?

A. Both coalesce and repartition are used to modify the number of partitions in an RDD but Coalesce avoids full shuffle.

If you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions and this does not require a shuffle.

Repartition performs a coalesce with shuffle. Repartition will result in the specified number of partitions with the data distributed using a hash practitioner.

6. How do you specify the number of partitions while creating an RDD? What are the functions?

A. You can specify the number of partitions while creating a RDD either by using the sc.textFile or by using parallelize functions as follows:

Val rdd = sc.parallelize(data,4)

val data = sc.textFile(“path”,4)

7. What are actions and transformations?

A. Transformations create new RDD’s from existing RDD and these transformations are lazy and will not be executed until you call any action.

Eg: map(), filter(), flatMap(), etc.,

Actions will return results of an RDD.

Eg: reduce(), count(), collect(), etc.,

8. What is Lazy Evaluation?

A. If you create any RDD from an existing RDD that is called as transformation and unless you call an action your RDD will not be materialized the reason is Spark will delay the result until you really want the result because there could be some situations you have typed something and it went wrong and again you have to correct it in an interactive way it will increase the time and it will create un-necessary delays. Also, Spark optimizes the required calculations and takes intelligent decisions which is not possible with line by line code execution. Spark recovers from failures and slow workers.

9. Mention some Transformations and Actions

A. Transformations map (), filter(), flatMap()

Actions

reduce(), count(), collect()

10. What is the role of cache() and persist()?

A. Whenever you want to store a RDD into memory such that the RDD will be used multiple times or that RDD might have created after lots of complex processing in those situations, you can take the advantage of Cache or Persist.

You can make an RDD to be persisted using the persist() or cache() functions on it. The first time it is computed in an action, it will be kept in memory on the nodes.

When you call persist(), you can specify that you want to store the RDD on the disk or in the memory or both. If it is in-memory, whether it should be stored in serialized format or de-serialized format, you can define all those things.

cache() is like persist() function only, where the storage level is set to memory only.

11. What are Accumulators?

A. Accumulators are the write only variables which are initialized once and sent to the workers. These workers will update based on the logic written and sent back to the driver which will aggregate or process based on the logic.

Only driver can access the accumulator’s value. For tasks, Accumulators are write-only. For example, it is used to count the number errors seen in RDD across workers.

12. What are Broadcast Variables?

A. Broadcast Variables are the read-only shared variables. Suppose, there is a set of data which may have to be used multiple times in the workers at different phases, we can share all those variables to the workers from the driver and every machine can read them.

13. What are the optimizations that developer can make while working with spark?

A. Spark is memory intensive, whatever you do it does in memory.

Firstly, you can adjust how long spark will wait before it times out on each of the phases of data locality (data local –> process local –> node local –> rack local –> Any).

Filter out data as early as possible. For caching, choose wisely from various storage levels.

Tune the number of partitions in spark.

14. What is Spark SQL?

A. Spark SQL is a module for structured data processing where we take advantage of SQL queries running on the datasets.

15. What is a Data Frame?

A. A data frame is like a table, it got some named columns which organized into columns. You can create a data frame from a file or from tables in hive, external databases SQL or NoSQL or existing RDD’s. It is analogous to a table.

16. How can you connect Hive to Spark SQL?

A. The first important thing is that you have to place hive-site.xml file in conf directory of Spark.

Then with the help of Spark session object we can construct a data frame as,

result = spark.sql(“select * from <hive_table>”)

17. What is GraphX?

A. Many times you have to process the data in the form of graphs, because you have to do some analysis on it. It tries to perform Graph computation in Spark in which data is present in files or in RDD’s.

GraphX is built on the top of Spark core, so it has got all the capabilities of Apache Spark like fault tolerance, scaling and there are many inbuilt graph algorithms also. GraphX unifies ETL, exploratory analysis and iterative graph computation within a single system.

You can view the same data as both graphs and collections, transform and join graphs with RDD efficiently and write custom iterative algorithms using the pregel API.

GraphX competes on performance with the fastest graph systems while retaining Spark’s flexibility, fault tolerance and ease of use.

18. What is PageRank Algorithm?

A. One of the algorithm in GraphX is PageRank algorithm. Pagerank measures the importance of each vertex in a graph assuming an edge from u to v represents an endorsements of v’s importance by u.

For exmaple, in Twitter if a twitter user is followed by many other users, that particular will be ranked highly. GraphX comes with static and dynamic implementations of pageRank as methods on the pageRank object.

19. What is Spark Streaming?

A. Whenever there is data flowing continuously and you want to process the data as early as possible, in that case you can take the advantage of Spark Streaming. It is the API for stream processing of live data.

Data can flow for Kafka, Flume or from TCP sockets, Kenisis etc., and you can do complex processing on the data before you pushing them into their destinations. Destinations can be file systems or databases or any other dashboards.

20. What is Sliding Window?

A. In Spark Streaming, you have to specify the batch interval. For example, let’s take your batch interval is 10 seconds, Now Spark will process the data whatever it gets in the last 10 seconds i.e., last batch interval time.

But with Sliding Window, you can specify how many last batches has to be processed. In the below screen shot, you can see that you can specify the batch interval and how many batches you want to process.

Apart from this, you can also specify when you want to process your last sliding window. For example you want to process the last 3 batches when there are 2 new batches. That is like when you want to slide and how many batches has to be processed in that window.

Introduction to MESOS

Introduction to MESOS

Apache Mesos is an open source cluster management project designed to set up and optimize distributed systems. Mesos allows the management and sharing of resources in a fine and dynamic way between different nodes and for various applications. This article covers the architecture of Mesos, its fundamentals and its support for NVIDIA GPUs.

Architecture of Mesos

Mesos consists of several elements:

Master daemon: runs on master nodes and controls “slave daemons”.

Slave daemon: runs on slave nodes and allows tasks to be launched.

Framework: better known as “Mesos”, it is made up of:

a scheduler which asks the master for available resources

one or more executors that launch applications on the workstations.

Offer: lists the available resources “CPU and memory”.

Task: run on slave nodes, it can be any type of application (bash, Query SQL, Hadoop job ...).

Zookeeper: allows coordinating masters nodes

High availability

In order to avoid a SPOF (Single Point of Failure), several masters, a master master (leader) and backup masters must be used. Zookeeper replicates the master at N node to form a Zookeeper quorum. It is he who coordinates the election of the master master. At least 3 masters are required for high availability.

Marathon

Marathon is a container orchestrator for Mesos that allows you to launch applications. It is equipped with a REST API to start and stop applications.

Chronos

Chronos is a framework for Mesos developed by Airbnb to replace standard crontab. It is a complete, distributed, fault tolerant scheduler that facilitates the orchestration of tasks. Chronos has a REST API for creating planning tasks from a web interface.

Principle of operation

This diagram explains to us how a task is launched and orchestrated:

Agent 1 informs the master master of the resources available on the slave node with which it is associated. The master can then edit an investment strategy, it offers all the resources available to framework 1.

The master informs framework 1 of the resources available for agent 1.

The orchestrator responds to the master "I will perform two tasks on agent 1" depending on the resources available.

The master sends the two tasks to the agent who will allocate the resources to the two new tasks.

Containerizer

Containerizer is a Mesos component that launches containers, it is responsible for isolating and managing container resources.

Creation and launch of a containerizer:

The agent creates a containerizer with the --containerizer option

To run a containerizer, you must specify the type of executor (mesos, docker, composing) otherwise it will use the default. You can find out the default executor using the TaskInfo command

mesos-executor -> default executor

mesos-docker-executor -> Docker executor

Types of containers:

Mesos supports different types of containers:

Composing: implementation of docker-compose

Docker containerizer: manages containers using the Docker-engine.

Mesos containerizer are the native containers of Mesos

NVIDIA and Mesos GPUs

Using GPU with Mesos is not a big problem. The agents must first be configured so that they take GPUs into account when they inform the master of the resources available. It is obviously necessary to configure the masters so that they too can inform frameworks of the available resources offered.

Launching tasks is performed in the same way by adding a GPU resource type. However, unlike processors, memory and disks, only whole numbers of GPUs can be selected. If a fractional quantity is chosen, launching the task will cause a TASK_ERROR type error.

For the moment, only Mesos containerizers are capable of launching tasks with Nvidia GPUs. Normally this does not bring any limitations because Mesos containerizer natively supports Docker images.

In addition, Mesos incorporates the operating principle of the "nvidia-docker" image exposing the CUDA Toolkit to developers and Data Scientists. This allows to directly mount the drivers and tools necessary for GPUs in the container. We can therefore locally build our container and deploy it easily with Mesos.

Conclusion

Mesos is a solution that allows companies to deploy and manage Docker containers, while sharing the available resources of their infrastructures. In addition, thanks to the Mesos containerizer, we can perform deep learning in a distributed way or share GPU resources between several users.

EC2	S3
It is a cloud web service used for hosting your application	It is a data storage system where any amount of data can be stored
It is like a huge computer machine which can run either Linux or Windows and can handle application like PHP, Python, Apache or any databases	It has a REST interface and uses secure HMAC-SHA1 authentication keys