Ride Analytics on new York City taxi data

Ride Analytics on new York City taxi data

RIDE ANALYTICS ON NEW YORK CITY TAXI DATA Sai Duth Deekshit G, Rohit Reddy G, Rohith Varma Jampana, Sumanth D. CONTENTS

Introduction Project Description Back Ground Problem definition and Solutions Techniques used Future Work References INTRODUCTION Transportation plays a vital role in large cities

Taxi mode of transportation has become a key player in large cities of united states and other countries. In NYC approximately 50,000 vehicles and 1,00,000 drivers exist. Different variety of service providers are Uber, Yellow Taxi, Green Taxi etc. The data that contain ride details was made available by NYC taxi and Limousine commission. We use these details to perform analytics on ride data that would benefit businesses of various types and government. PROJECT DESCRIPTION

In this project we perform analytics on NYC taxi data and find solutions to queries like : Most common pick up and drop-off locations Busiest routes for taxis Most revenue generated areas for cabs Popularity of the places Know whether driver took the correct route or not Find popular places between pick up and drop off locations BACKGROUND WORK

Yufei Tao, Dimitris Papadias, Qiongmao Shen. Continuous Nearest Neighbor Search. Proceedings of the 28th VLDB conference, Hong Kong, China, 2002. Apache Spark: An engine for processing big data in fast and efficient manner Contain several built in modules for streaming, SQL, machine learning and graph processing Provide an API known as Resilient Distributed Dataset (RDD) RDD allows to develop both iterative algorithms which require dataset to visit several times in a loop and exploratory data analysis(repeated database style querying of data)

Process and execute batch jobs much better and faster than MapReduce. It run on Hadoop along with other tools like Hive Pig which come under Hadoop CONT Yelp API which provide us the business name and address around the given location(latitude and longitude) Google Matrix Distance API provide us road network distance between two locations by taking coordinated of two locations as input SOFTWARE & HARDWARE REQUIREMENTS

Query Languages: DBMS, SQL Programming Languages: Scala, Python, Java Online tools: www.databricks.com (For running Scala or Python cells and storing the data) APIs : YELP API, Google Distance Matrix API Windows or MAC OS, RAM:4GB or more, HDD: Minimum 50GB, Internet connection PROBLEM DEFINITION The problem is defined in three stages:

1st Stage: Data cleaning and Analysis on Ride data 2nd Stage: Finding popular places between pick up and drop off location 3rd Stage: Visualization of results CLEANING AND ANALYSIS ON RIDE DATA Data Set Link: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml The initial data set contain unnecessary fields which are of no use in the analysis like VendorId, RateCodeID, Store_and_fwd_flag, Tolls_amount, Improvement Also, remove invalid data (check for blank entries and delete them)

The final data set that was cleaned contains the following fields which will be used for our analysis: Pickup_datatime, dropoff_datetime, passenger_count, trip_distance, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, payment_type, fare_amount, trip_amount CONTN Now we perform Analysis on ride data to find: Most common pick up and drop-off locations Busiest routes for taxis

Most revenue generated areas for cabs Popularity of the places Know whether driver took the correct route or not Find popular places between pick up and drop off locations TECHNIQUES USED For finding the famous places between two points we use: Continuous Nearest Neighborhood integrated with Google Distance Matrix API Google Distance Matrix API :

https://maps.googleapis.com/maps/api/distancematrix/outputFormat? parameters OutputFormat can be either JSON or XML format Parameters can be origins = latitude, longitude | latitude, longitude For querying multiple points we can use Polyline Algorithm format FINDING CONTINUOUS NEAREST NEIGHBOR (CNN) It retrieves the nearest neighbor of every point on a line segment Splint point is the point on the line segment where there is a change of neighborhood

Use R-tree as datastructure, take MBR of intermediate node Given E and q (line segment), subtree of E contains qualifying points only If mindist (E, q) < SLmaxd else it is not qualified If dist(Si, Si NN) > mindist (Si, E) It is clear that for entries that are closer to line segment there is high possibility to qualify Entries that satisfy the above condition are accessed in increasing order of their minimum distances (distance is found using Google Distance Matrix API) we get the set of split nodes Scover = {split points} and their nearest neighbors Finally, as a result we get a set of Ex:

ALGORITHM FOR FINDING POPULAR PLACES BETWEEN TWO POINTS: Select source and destination location coordinates which will be the pick-up and drop-off coordinates Call findNeighbors ( ) method which will return a set of . Where, Point is the nearest neighbor and interval is the interval for which Point is nearest neighbor Store the result obtained above and use findPopularity ( ) method to find the popularity of the above obtained result

Display top 5 results based on the popularity FUTURE WORK The result of analysis can be used to help taxi drivers to decide in which area they need to go so they get maximum customers and boost their business New taxi business can also gain from the analysis Traffic analysis: Finds which routes and times of the day are heavy on traffic Provide visualization like heated maps and route visualization Decrease traffic congestion and reduce CO2 emissions by start using public transport instead of individual transport

Find potential for pool/sharing taxi business REFERENCES NYC Taxi & Limousine Commission. http://www.nyc.gov/html/tlc/html/about/about.shtml Yufei Tao, Dimitris Papadias, Qiongmao Shen. Continuous Nearest Neighbor Search. Proceedings of the 28th VLDB conference, Hong Kong, China, 2002. THANK YOU

QUERIES????

Recently Viewed Presentations

  • Arial bold 24pt: Option B cover - delete A or B Do not change ...

    Arial bold 24pt: Option B cover - delete A or B Do not change ...

    SQA Seminar, Glasgow. 28 February 2013. Good morning. Please to see so many people here today and I'm grateful to SQA for hosting the event. In a moment of weakness I agreed to lead a session on inter-subject comparability. As...
  • Elements of Literature: Character

    Elements of Literature: Character

    (After his twenty-year nap) The appearance of Rip, with his long grizzled beard, his rusty fowling piece, his uncouth dress, . . . soon attracted the attention of the tavern politicians. from "Rip Van Winkle" by Washington Irving Private Thoughts...
  • ELECTRONICS PRIMER II Operational Amplfier Operational Amplifiers take

    ELECTRONICS PRIMER II Operational Amplfier Operational Amplifiers take

    Golden Rules (Op amp with negative feedback): No-current flows into either (+) or (-) inputs. The (+) and (-) inputs are at the same voltage. Electrical engineers use operational amplifiers (Op Amps), resistors, capacitors, diodes, transistors, etc. to perform mathematical...
  • Chapter 3

    Chapter 3

    JOINTS H. Biology II Adapted 2014-2015
  • David Olere - Weebly

    David Olere - Weebly

    David Olere. David Olere is a well-known artist whose work testifies to the atrocities of the Holocaust. He is the only professional artist among the known survivors of Jewish Sonderkommando squads who worked in the gas chambers and crematoria of...
  • EXTERNALISING DISORDERS Chapter D.2 OPPOSITIONAL DEFIANT DISORDER Katie

    EXTERNALISING DISORDERS Chapter D.2 OPPOSITIONAL DEFIANT DISORDER Katie

    The result is that the parent may in the end give in, reinforcing the child's negative behaviours. This paradoxical "reward" of a child's negative behaviour may both increase and maintain oppositional behaviours and is the specific target of therapeutic interventions...
  • CPSC Computational Linguistics - UBC Department of Computer ...

    CPSC Computational Linguistics - UBC Department of Computer ...

    CPSC 422, Lecture 25. My Conceptual map - This is the master plan. Markov Models used for part-of-speech and dialog. Syntax is the study of formal relationship between words. How words are clustered into classes (that determine how they group...
  • City of Pittsburgh, PA Classification and Compensation Study

    City of Pittsburgh, PA Classification and Compensation Study

    A market assessment that measures the City of Pittsburgh's market position compared to other peer employers ... PLI. Liz O'Neill. Personnel & Finance Analyst 1. 412-255-8905. [email protected] Public Safety Administration. Claire Mastroberardino.