Ride Analytics on new York City taxi data

Ride Analytics on new York City taxi data

RIDE ANALYTICS ON NEW YORK CITY TAXI DATA Sai Duth Deekshit G, Rohit Reddy G, Rohith Varma Jampana, Sumanth D. CONTENTS

Introduction Project Description Back Ground Problem definition and Solutions Techniques used Future Work References INTRODUCTION Transportation plays a vital role in large cities

Taxi mode of transportation has become a key player in large cities of united states and other countries. In NYC approximately 50,000 vehicles and 1,00,000 drivers exist. Different variety of service providers are Uber, Yellow Taxi, Green Taxi etc. The data that contain ride details was made available by NYC taxi and Limousine commission. We use these details to perform analytics on ride data that would benefit businesses of various types and government. PROJECT DESCRIPTION

In this project we perform analytics on NYC taxi data and find solutions to queries like : Most common pick up and drop-off locations Busiest routes for taxis Most revenue generated areas for cabs Popularity of the places Know whether driver took the correct route or not Find popular places between pick up and drop off locations BACKGROUND WORK

Yufei Tao, Dimitris Papadias, Qiongmao Shen. Continuous Nearest Neighbor Search. Proceedings of the 28th VLDB conference, Hong Kong, China, 2002. Apache Spark: An engine for processing big data in fast and efficient manner Contain several built in modules for streaming, SQL, machine learning and graph processing Provide an API known as Resilient Distributed Dataset (RDD) RDD allows to develop both iterative algorithms which require dataset to visit several times in a loop and exploratory data analysis(repeated database style querying of data)

Process and execute batch jobs much better and faster than MapReduce. It run on Hadoop along with other tools like Hive Pig which come under Hadoop CONT Yelp API which provide us the business name and address around the given location(latitude and longitude) Google Matrix Distance API provide us road network distance between two locations by taking coordinated of two locations as input SOFTWARE & HARDWARE REQUIREMENTS

Query Languages: DBMS, SQL Programming Languages: Scala, Python, Java Online tools: www.databricks.com (For running Scala or Python cells and storing the data) APIs : YELP API, Google Distance Matrix API Windows or MAC OS, RAM:4GB or more, HDD: Minimum 50GB, Internet connection PROBLEM DEFINITION The problem is defined in three stages:

1st Stage: Data cleaning and Analysis on Ride data 2nd Stage: Finding popular places between pick up and drop off location 3rd Stage: Visualization of results CLEANING AND ANALYSIS ON RIDE DATA Data Set Link: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml The initial data set contain unnecessary fields which are of no use in the analysis like VendorId, RateCodeID, Store_and_fwd_flag, Tolls_amount, Improvement Also, remove invalid data (check for blank entries and delete them)

The final data set that was cleaned contains the following fields which will be used for our analysis: Pickup_datatime, dropoff_datetime, passenger_count, trip_distance, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, payment_type, fare_amount, trip_amount CONTN Now we perform Analysis on ride data to find: Most common pick up and drop-off locations Busiest routes for taxis

Most revenue generated areas for cabs Popularity of the places Know whether driver took the correct route or not Find popular places between pick up and drop off locations TECHNIQUES USED For finding the famous places between two points we use: Continuous Nearest Neighborhood integrated with Google Distance Matrix API Google Distance Matrix API :

https://maps.googleapis.com/maps/api/distancematrix/outputFormat? parameters OutputFormat can be either JSON or XML format Parameters can be origins = latitude, longitude | latitude, longitude For querying multiple points we can use Polyline Algorithm format FINDING CONTINUOUS NEAREST NEIGHBOR (CNN) It retrieves the nearest neighbor of every point on a line segment Splint point is the point on the line segment where there is a change of neighborhood

Use R-tree as datastructure, take MBR of intermediate node Given E and q (line segment), subtree of E contains qualifying points only If mindist (E, q) < SLmaxd else it is not qualified If dist(Si, Si NN) > mindist (Si, E) It is clear that for entries that are closer to line segment there is high possibility to qualify Entries that satisfy the above condition are accessed in increasing order of their minimum distances (distance is found using Google Distance Matrix API) we get the set of split nodes Scover = {split points} and their nearest neighbors Finally, as a result we get a set of Ex:

ALGORITHM FOR FINDING POPULAR PLACES BETWEEN TWO POINTS: Select source and destination location coordinates which will be the pick-up and drop-off coordinates Call findNeighbors ( ) method which will return a set of . Where, Point is the nearest neighbor and interval is the interval for which Point is nearest neighbor Store the result obtained above and use findPopularity ( ) method to find the popularity of the above obtained result

Display top 5 results based on the popularity FUTURE WORK The result of analysis can be used to help taxi drivers to decide in which area they need to go so they get maximum customers and boost their business New taxi business can also gain from the analysis Traffic analysis: Finds which routes and times of the day are heavy on traffic Provide visualization like heated maps and route visualization Decrease traffic congestion and reduce CO2 emissions by start using public transport instead of individual transport

Find potential for pool/sharing taxi business REFERENCES NYC Taxi & Limousine Commission. http://www.nyc.gov/html/tlc/html/about/about.shtml Yufei Tao, Dimitris Papadias, Qiongmao Shen. Continuous Nearest Neighbor Search. Proceedings of the 28th VLDB conference, Hong Kong, China, 2002. THANK YOU

QUERIES????

Recently Viewed Presentations

  • Unit One - Ms. McClure

    Unit One - Ms. McClure

    Same SongPat Mora. While my sixteen-year-old son sleeps, my twelve-year-old daughterstumbles into the bathroom at six a.m. plugs in the curling ironsqueezes into faded jeanscurls her hair carefullystrokes Aztec Blue shadow on her eyelidssmooth's Frosted Mauve blusher on her cheeksoutlines...
  • Corporate Sukuk Market Day Three  World Bank, Arab

    Corporate Sukuk Market Day Three World Bank, Arab

    1. Global . Sukuk. Issuances - All Currencies (Jan 2001 - March 2015,USD . Millions) World Bank, Arab Monetary Fund Seminar on Development of Sukuk Markets, Abu Dhabi, April 19-23, 2015
  • Ng, S. F., &amp; Lee, K. (2009). The model method: Singapore ...

    Ng, S. F., & Lee, K. (2009). The model method: Singapore ...

    Step-by-step model drawing (Forsten, 2010) 1. Read the entire problem. 2. Rewrite the question in sentence form, leaving a space for the answer.
  • Determining Forward Prices and Futures Prices

    Determining Forward Prices and Futures Prices

    Determining Forward and Futures Prices In a well functioning market, the forward price of carry-type commodities (stocks & stock indexes, debt securities, currencies, & gold) must preclude the possibility of arbitrage.
  • Unit 3 - Training for Personal Fitness Learning

    Unit 3 - Training for Personal Fitness Learning

    These should be agreed with your coach and should follow the SMARTER principle: Setting personal goals When setting goals athletes should apply the SMARTER principle. Setting personal goals Bobby is a 100m sprinter and has set the following targets below...
  • SUPERFUND JOB TRAINING INITIATIVE Melissa Friedland, SuperJTI Program

    SUPERFUND JOB TRAINING INITIATIVE Melissa Friedland, SuperJTI Program

    40-hr HAZWOPER. OSHA-10. CPR/First Aid. We had a great group of recruits and they worked really hard throughout the training. Trainees completed courses in environmental justice, interpersonal communication, cultural competence and effective work habits. EPA contractor Skeo Solutions provided this...
  • Chapter 2: MANAGERIAL ETHICS - Pearson Education

    Chapter 2: MANAGERIAL ETHICS - Pearson Education

    Chapter 2 Managerial Ethics ... ethically Managerial Ethics Ethics Rules and principles that define right and wrong conduct Three Views of Ethics Utilitarian view - ethical decisions are made on the basis of their outcomes or consequences (continued) Managerial Ethics...
  • Pocahontas and Joseph Campbell - Buckeye Valley

    Pocahontas and Joseph Campbell - Buckeye Valley

    George Lucas had already written two drafts of Star Wars when he rediscovered Joseph Campbell's The Hero With a Thousand Faces in 1975 (having read it years before in college).. This blueprint for "The Hero's Journey" gave Lucas the focus...