Big Ambient Data - VLDB

Big Ambient Data - VLDB

Is it Still Big Data if it Fits in my Pocket? Dave Campbell Microsoft The Journey: Objective Try to separate hype from reality Identify unique new value Is map-reduce a giant steps backwards? What are the dominant dimensions of Big Data 8/31/2011 VLDB 2011 2 The Journey: Process Engaged and connected with many people Many interesting debates

Created, tested, and refined a frame to explain Big Data phenomenon Major driving forces Encountered independently evolved common patterns Wrote code & prototyped 8/31/2011 VLDB 2011 3 The Journey: Results An (one) explanation for the phenomenon A set of design & architecture patterns Material to inform an R&D agenda 8/31/2011 VLDB 2011 4 The Story 8/31/2011

VLDB 2011 5 The Knowledge Hierarchy Structure / Value Knowledge Application Knowledge Information Data Signal Effort / Latency 8/31/2011 VLDB 2011 6 The Current Paradigm l e l s

d e l n r o d e o e o d m ti w l s o m s a e l n a m u

u t a ta a l t c a q i a p o d s da c t e e i y c e h g s h

t n h o n e p l t r o o h ) t c e a a t ti c e s w d d

d d un ns l le uil e il i l a u u u A (T B Lo Q B B Co t Time to Insight: Weeks to Months 8/31/2011 VLDB 2011

7 Lifecycle of a Question Validation Question o sti e Qu t n re e ff Di n Not interesting Worth asking again? Make it repeatable Bring it to production 8/31/2011

VLDB 2011 8 A Tightly Coupled System Available data prepared on basis of scope of analysis Available Data Conceptual 8/31/2011 Model Logical VLDB 2011 Scope of Analysis Physical

9 Models have traditionally been coupled Conceptual Logical Physical Logical model has been scaffolding for physical: Relational: Indexes MOLAP: Aggregations In-memory technologies breaking logical/physical knot! Knowledge domain coupled to conceptual model 8/31/2011 VLDB 2011 10 Todays Challenge Available Data Model

Scope of Analysis Data are freely available Ability to model it is much more of a gating factor than raw size Particularly when considering new forms of data 8/31/2011 VLDB 2011 11 Sensemaking: Intelligence Analysis Reference: The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive 8/31/2011 VLDB 2011 12 Task Analysis, Peter Pirolli and Stuart Card, 2005 Sensemaking Explanation Explanatory

Frame Data Support Interdependent relationship Supports abductive logic Current systems support sensemaking within a modeled domain. Big Data expands this. Reference: A Data-Frame Theory of Sensemaking, G.A. Klein, et. al. 8/31/2011 VLDB 2011 13 Key Value Proposition Model Model Available Data Model Model Model Traditional

System Traditional System New System Key elements of Big Data value: Reducing friction to produce valuable information Enabling sensemaking over a broader space Enabling model / algorithm generation 8/31/2011 VLDB 2011 14 Reworking The Knowledge Hierarchy Structure / Value Knowledge Application Knowledge Information Data Knowledge

Application Knowledge Information Data Signal t Time to insight improvement 8/31/2011 EffortVLDB / Latency 2011 15 Objective: Change Shape of Two Curves 8/31/2011 VLDB 2011 16 Emergent Architectural Patterns

8/31/2011 VLDB 2011 17 Big Data Patterns Have observed some common patterns Many appear to occur via independent evolution Prototyped over personal sources Patterns: Digital Shoebox Information Production Transform & Load Model Development Monitor, Mine, Manage 8/31/2011 VLDB 2011

18 Pattern: Digital Shoebox Intent: Retain all ambient data to enable sensemaking over all available signals Applicability: Use to create a source data pool to bootstrap subsequent information generation Description: Enabling Enabling Trends: Trends: Cost Cost of of data data acquisition acquisition $0 $0 Cost Cost of of data data storage storage $0 $0 Tipping Tipping point point occurs

occurs if: if: ( ( ) + ( ) > )( ) Must Must keep keep modeling modeling and and storage storage costs costs low low to to achieve achieve this this

Implementation: Augment Augment raw raw data data with with sourceID, sourceID, and and instanceID instanceID and and retain retain on on inexpensive inexpensive but but reliable reliable storage storage 8/31/2011 VLDB 2011 19 Pattern: Digital Shoebox Source Model: The natural model in which the data are produced Acquisition Model: An augmented source model which contains source identifier and instance (typically timestamp)

AcquisitionModel = {sourceID, instanceID, sourceData} SourceID InstanceID Source Source Source A A A B B B C C C Source Source Source Source Source Source 8/31/2011 VLDB 2011

1 2 3 1 2 3 1 2 3 Source Source Source Source Source Source Source Source Source 20 Personal Example GPS GPS GPS A A

A B B B C C C Outlook Outlook Outlook HA HA HA 1 2 3 1 2 3 1 2 3 Source

Source Source Source Source Source Source Source Source GPS Have been carrying a GPS data logger for 5 months HA Log file from home automation system Outlook Have script that produces when I send mail, to whom, and, if a reply, my response latency 8/31/2011 VLDB 2011 21 Pattern: Information Production Intent: Turn acquired data from digital shoebox into other events and states Applicability: Used to transform raw data into information for subsequent processing Description: Often requires temporal processing & correlation of acquired data

Key point: Cleansing often much easier in transformed domain Implementation: Requires environment for parsing, grouping, aggregation, and often joining of acquired data 8/31/2011 VLDB 2011 22 Information Production Transform Transform Transforms source data into events & states Data cleanup, cleansing & imputation Quite often cleansing happens in transformed domain E.g. Nights on the road vs. @ Home Wind up with a set of composable transforms Produced information stored in

Digital Shoebox or downstream system 8/31/2011 VLDB 2011 23 Personal Example - GPS T3 T2 Source T1 T4 T5 8/31/2011 Tree of transforms and filters Cleansing often happens in transformed domain E.g. Where I slept each night Can produce higher level information [DwellAtHome],[RouteToWork], [DwellAtWork] = Commute to work

Using higher level information: Commute duration f(leavingTime) VLDB 2011 24 Commute Time as f(leaveTime) 8/31/2011 VLDB 2011 25 Event & State Correlation Dwell geolocation + 2011-06-10 2011-06-10 06:18:26, 06:18:26, 2011-06-10 2011-06-10 06:16:18, 06:16:18, 0.04 0.04 2011-06-10 2011-06-10 06:21:18,

06:21:18, 2011-06-09 2011-06-09 08:27:50, 08:27:50, 21.89 21.89 2011-06-10 06:24:37, 2011-06-09 07:43:58, 22.68 2011-06-10 06:24:37, 2011-06-09 07:43:58, 22.68 2011-06-10 2011-06-10 06:26:48, 06:26:48, None, None, 0.00 0.00 2011-06-10 2011-06-10 06:29:37, 06:29:37, 2011-06-09 2011-06-09 06:53:34, 06:53:34, 23.60 23.60 2011-06-10 2011-06-10 06:34:41, 06:34:41, 2011-06-09 2011-06-09 12:00:25, 12:00:25, 18.57 18.57

2011-06-10 2011-06-10 06:39:52, 06:39:52, 2011-06-09 2011-06-09 17:44:54, 17:44:54, 12.92 12.92 2011-06-10 06:43:18, 2011-06-09 14:28:49, 16.24 2011-06-10 06:43:18, 2011-06-09 14:28:49, 16.24 Outlook statistics = How much email do I send from home vs. at work? 8/31/2011 VLDB 2011 26 Pattern: Transform & Load Intent: Transform acquired data and produced information to load into traditional systems e.g. Data Warehouse, OLAP cube, etc.

Applicability: Used to load other systems for production use or other analysis Description: Transformations and queries over the Digital Shoebox are used to load downstream systems Jobs can be scheduled or invoked by other systems Implementation: Requires repeatable transform mechanism Adapters to downstream systems Scheduling mechanism 8/31/2011 VLDB 2011 27 Transform & Load Acquisition Model Information Information Model Information Model Information

Model Information Model Model Data Mart 8/31/2011 Data Warehouse CEP System VLDB 2011 28 Pattern: Model Development Intent: Enable sensemaking directly over the Digital Shoebox without extensive up front modeling Applicability: Used to create knowledge from Digital Shoebox contents Description:

Provide a suite of tools which operate efficiently to enable model discovery, refinement and validation Implementation: Requires exploration, visualization, and statistical tools 8/31/2011 VLDB 2011 29 Model Development Example Its clear that Im an early to be, early to rise, guy When not home, only activity is from the pet-sitter & cleaners Marcia gets up after me and likes to read in bed 8/31/2011 VLDB 2011 30 Pattern: Monitor, Mine, Manage Intent: Develop and use generated models to perform active management or intervention

Applicability: Use for fraud detection, system alerting, intrusion detection, user classification, Description: Historical data is used to develop a model (algorythm) which is installed in active system Implementation: Requires model generation pattern, active monitoring system [e.g. Complex Event Processing (CEP)] 8/31/2011 VLDB 2011 31 Pattern: Monitor, Mine, Manage 2 1. Monitor & collect data 2. Mine and create online model 3. Deploy online model to actively manage 1 This is about reducing Time to Action! 3

Examples: Financial fraud detection and prevention Audience intelligence Personal: Home & Away settings for home automation 8/31/2011 VLDB 2011 32 Pattern Map Digital Shoebox Model Development Information Production Monitor, Mine & Manage Transform & Load 8/31/2011 VLDB 2011

33 Tying it Together Monitor, Monitor, Mine, Mine, Manage Manage Structure / Value Knowledge Application Knowledge Knowledge Application Knowledge Model Model Generation Generation Information Information

Data Transform Transform & & Load Load Data Information Information Production Production Signal Digital Digital Shoebox Shoebox t Time to Insight 8/31/2011 EffortVLDB / Latency 2011

34 R&D Agenda Improved sensemaking tools: Visualization Temporal and spatial correlation Machine learning Large Ambient Data can eclipse existing methods E.g. language translation Robust big-data query processing Leverage various degrees of structure and modeling General locality awareness Checkpoint vs. restart tradeoff Emergent intermediate structure infer and reify dimensions Re-stating history Re-feed downstream systems sourcing from big-data environment Re-think slowly changing dimensions 8/31/2011 VLDB 2011 35 Wrap up

Big Data is multi-faceted Interesting architecture/design patterns emerging Realizing new value requires re-thinking existing system assumptions Time to insight/action should be a driving metric Complements existing data platform Intersection with HPC/TC world This is reshaping information management 8/31/2011 VLDB 2011 36

Recently Viewed Presentations

  • Merging I&R Databases Using the AIRS XSD

    Merging I&R Databases Using the AIRS XSD

    Today Experiences in data sharing Both I&R-only And with other related fields Each panelist 5-10 minutes Phil Donahue lives! Merging I&R Databases Using the AIRS XSD David Canavan, Canavan Associates Eric Jahn, Alexandria Consulting Jeff Sumner, 211 Texas Nancy Shank,...
  • Careers In Veterinary Pharmacy -

    Careers In Veterinary Pharmacy -

    Careers In Veterinary Pharmacy. Building your skill set/finding your job. Gigi Davidson, BSPh, DICVP. Director of Clinical Pharmacy Services. NC State College of Veterinary Medicine
  • Welcome To Oasis Youth Support Network Community Partnerships

    Welcome To Oasis Youth Support Network Community Partnerships

    Builds confidence over time many Young people too nervous at first but over time Confidence is built and they overcome initial nerves. Identity to be known as Jim the DJ or presenter, rather than Jim the street kid. Our work...
  • Genera l Physic s Time Length Ruler V

    Genera l Physic s Time Length Ruler V

    Forces . Hooke's Law. Shape, size and motion. Fᾳ x. F=kx. F = ma. Moment in clock.=moment in anti-clock. No resultant force and turning effect. Moment = fxd
  • Climates of the earth

    Climates of the earth

    Earth's tilt and rotation. Earth's axis = an imaginary line running from the north pole to the south pole through the planet's center.. Earth's axis is currently tilted at an angle of about 23.5 degrees. This affects . temperature, the...
  • ATLAS and the Grid ACAT02 Moscow June 2002

    ATLAS and the Grid ACAT02 Moscow June 2002

    Now defining an ATLAS/LHCb joint user interface, GANGA Co-evolution with Grappa Knowledge of experiment OO architecture needed (Athena/Gaudi) Interfacing Athena/Gaudi to the GRID EDG GUI for Job Submission GRAPPA Based on XCAT Science Portal, framework for building personal science portals...
  • Office of Community and Rural Affairs Economic Vitality:

    Office of Community and Rural Affairs Economic Vitality:

    Commit and follow through on selected transformational strategies for 18-24 months. ... Fiscal agent must fill out a Partner Form and Management Review Form to be eligible. Grant requests between $5,000 and $10,000 will be accepted. MATT. Program Guidelines.

    Clocks. Have several clocks in your classroom, one showing local time and the others either showing places that you are teaching about, or capital cities etc. Themed names eg places associated with Christmas are fun too.