Quantcast
Channel: elasticsearch – Neo4j Graph Database Platform
Viewing all 32 articles
Browse latest View live

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

$
0
0

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911 Irish censuses, I hoped I would be able to find a way to reliably link resident records from the two together to identify the same residents.

Since then I’ve learned a bit about master data management and record linkage and so I thought I would give it another stab.

Here I’d like to talk about how I’ve been matching records based on the local data space around objects to improve my record linkage scoring.

The data model of the imported data is very linear:

See How We Used Graph Structure Record Linkage to Extract Insights on Irish Census Data with Neo4j


In this post, however, I’m going to be focusing on Houses and Residents and creating relationships between them based on their properties.

Relations to the Head


To view an example of what a census record from 1911 Ireland looks like you can have a look at the McCarthys of 1901 and 1911. Charles is the head of the family with his wife Hannah, mother Ellen, children (two in 1901 and seven in 1911), and a servant (Timothy Walsh in 1901 and William Regan in 1911).

McCarthy census data from 1901 The McCarthy family of Barnagowlane, Cloghdowell, Cork, 1901 McCarthy census data from 1911 The McCarthy family of Barnagowlane, Cloghdonnell, Cork, 1911
Surname Forename Age Sex Relation to Head Religion   Surname Forename Age Sex Relation to Head Religion
McCarthy Charles 37 Male Head of Family Roman Catholic   McCarthy Charles 47 Male Head of Family Roman Catholic
McCarthy Hannah 25 Female Wife Roman Catholic   McCarthy Hannah 35 Female Wife Roman Catholic
McCarthy William 1 Male Son Roman Catholic   McCarthy William 11 Male Son Roman Catholic
McCarthy Bridget   Female Daughter Roman Catholic   McCarthy Bridget 10 Female Daughter Roman Catholic
              McCarthy Ellen 8 Female Daughter Roman Catholic
              McCarthy Kate 6 Female Daughter Roman Catholic
              McCarthy Florence 4 Male Son Roman Catholic
              McCarthy Charles Peter 2 Male Son Roman Catholic
              McCarthy Annie   Female Daughter Roman Catholic
McCarthy Ellen 65 Female Mother Roman Catholic   McCarthy ? Ellen 75 Female Mother Roman Catholic
Walsh Timothy 25 Male Servant Roman Catholic              
              Regan William 24 Male Servant Roman Catholic


The McCarthys are an almost exact match between two census records between 1901 and 1911. The names, ages, occupations, and relationships all match perfectly.

Unfortunately the story for other records is not so simple. Many times, houses – which to the human eye seem to be the same house – can have wildly varying details. For example, Hannah might go be listed as Hana or Anne in a different census.

Likewise, ages vary a lot more than you might think. In examining the records I regularly found ages varying by a year or two and have even found a few houses with ages off by as much as 10-15 years.

In both censuses, there is a field for residents to fill out called “Relation to Head.” This gives us information about how each resident is related to the head of the house. In the case of the McCarthys, Charles is listed as “Head of Family” in both years. The rest of the family has a nice representation of things that we often see in the data: “Wife,” “Son,” “Daughter,” and “Servant.”

We might be tempted to say, “This person was the head in 1901, so they must be the same person who was the head in 1911.” Often, however, the head of the family can die or retire leaving the role of head of the family to their wife or child.

So, can the “Relation to Head” values still be useful to us to match any given resident from 1901 to another resident in 1911?

First, let’s cover the general process of record linkage I have been using. To find a match for a resident, I start by using an Elasticsearch server (which contains a duplicate of my Neo4j census data) to quickly find a list of other residents with a match on very rough criteria:

    • Is the resident in the other census?
    • Does the sex match (or is it NULL)?
    • Is the resident’s age within 15 years of what it would be expected to be in the other census?
    • Does the name match, roughly (within an edit distance of 4)?

This comes back with anywhere from zero to hundreds of results. I call these “similarity candidates” and for each one, I create a relationship between the original record and the candidate.

With this list, I can compare the attributes of the two records (using the record_linkage gem I created) to see how closely they match.

The closer their name, sex, age, etc. matches, the higher score they get. Ideally, the real match should have the highest score, but that isn’t always true and can take some tuning.

In addition to this simple comparison of attributes, I have now added a process to take advantage of the similarity candidate relationships to compare family relationships.

Let’s start with this example of a sub-graph pattern:

mccarthy_charles_comparison


The relationship CHILD_OF is created whenever there is a “Son” or “Daughter” in the “Relation to Head” field. Likewise, we can create other gender-neutral relationships like MARRIED_TO, SIBLING_OF, NIECE_NEPHEW_OF, etc….

In this case, the resident in question is the 1901 record for William. When we are evaluating the 1911 record of William as a potential match we can explore other residents in the same house as evidence of similarity.

The diagram above shows that both records have a CHILD_OF relationship to the two “Charles” records which furthermore are linked via a SIMILARITY_CANDIDATE relationship. Because of this we can say that there is a greater chance that the two “William” records represent the same person.

This only gives us the ability to find these relationships between the head of the family and other residents. What about generically matching based on the relationship of any two residents of a house?

Let’s say that Charles died sometime between 1901 and 1911. If his wife Hannah takes over as the head of the family, we would have a sub-graph which looks like this:

mccarthy_hannah_comparison


We could say that when we have the paths -CHILD_OF-><-MARIED_TO- and -CHILD_OF-> on either side that we can build our case for a match a bit more. This kind of matching can be used on all of the other residents of the house with SIMILARITY_CANDIDATE relationships.

For example, -CHILD_OF-><-CHILD_OF- could be matched to -CHILD_OF-><-CHILD_OF- even in this case, where the wife becomes the head of the house. Or if a child becomes the head then it could be compared to a -SIBLING_OF- relationship.

The Code


So how do we actually do this with Neo4j? First let’s take our sub-graph and turn our nodes into variables:

Record Linkage in Irish Census Data


In this example let’s take resident h1 r1 (house 1, resident 1) as the resident in question and h2 r1 as the candidate that we want to compare it to. This is the sort of query that Neo4j is wonderful at both performing quickly and making easy to formulate.

Let’s look at part of the Ruby code:

def get_similarity_candidate_relationship_paths
  self.query_as(:h1_r1)
    .match('(h1:House), (h2:House)')
    .match('h1<-[:LIVES_IN]-h1_r1-[sc_1:similarity_candidate]-(h2_r1)-[:LIVES_IN]->h2')
    .match('h1<-[:LIVES_IN]-h1_r2-[sc_2:similarity_candidate]-(h2_r2)-[:LIVES_IN]->h2')
    .match('path1=h1_r1-[:born_to|married_to|grandchild_of|niece_nephew_of|sibling_of
     |cousin_of|child_in_law_of|step_child_of*1..2]-h1_r2')
    .match('path2=h2_r1-[:born_to|married_to|grandchild_of|niece_nephew_of|sibling_of
     |cousin_of|child_in_law_of|step_child_of*1..2]-h2_r2')
    .pluck(
      :h2_r1,
      'collect([path1, rels(path1), path2, rels(path2)])'
      ).each_with_object({}) do |(r2, data), result|

    result[r2] = data.inject(0) do |total, (path1, rels1, path2, rels2)|
      relations1 = relation_string_from_path_and_rels(path1, rels1)
      relations2 = relation_string_from_path_and_rels(path2, rels2)

      if relations1 == relations2
        1.0
      elsif score = (RELATION_EQUIVILENCE_SCORES[relations1] || {})[relations2]
        score
      else
        -2.0
      end + total
    end
  end
end


Here we start with a Cypher query using the Query API from neo4j.rb. The object upon which we’ve called get_similarity_candidate_relationship_paths is our h1_r1 anchor.

Note here that we match paths with a length of either one or two relationships long from between two residents of the same house. Then we return all residents found via the SIMILARITY_CANDIDATE relationship from our anchor and the family relationship paths aggregated into an Array.

Once the Cypher query returns data we call relation_string_from_path_and_rels which is a way of transforming the path into a string like -BORN_TO-><-BORN_TO. This string gives us a simple way to express the path between the two residents as a string.

We then can give a score based on the two paths. If the paths are the same then we say that the score is 1.0. If the pair of paths is something like -BORN_TO-><-BORN_TO and -SIBLING_OF-> then we can give a score based on a lookup.

We add these scores up to give us a total score comparing our anchor resident and each of it’s similarity candidates. All with just one query to the database.

Challenges


There are a couple of things that I needed to do to make this work:

Previously, I was simply grabbing one resident at a time, finding all of the similarity candidates, and then creating a set of relationships to link the resident with the candidates and to store the record linkage scores (both the individual scores for fields and the total score).

However, this approach requires all of the candidates in the house to have SIMILARITY_CANDIDATE relationships in order to compare family relationships. So now I first process all residents for a house to create the similarity candidate relationships and store the record linkage scores and then go through them again with the graph-based comparisons and store that score and update the total.

Beyond that, there is the conceptual problem of determining the scoring when comparing paths. For example, if somebody was BORN_TO the head one year but their spouse takes over as the head, could we say that they’re BORN_TO the spouse if they are are a step-child? Family relationships are complicated and don’t always fit neatly into our properties and algorithms.

Conclusion


Most record linkage focuses on the properties of an object, but we need to remember that relationships are data about our entities too. With Neo4j, we have a powerful tool for analyzing those data relationships naturally and quickly.

Additionally I have found that the ability to create relationships on the fly to aggregate calculations like the ones discussed above is a wonderful way to find the best solution quickly.



Want to learn more about how to use Neo4j for your project? Click below to get your free copy of the Learning Neo4j ebook and catch up to speed with the world’s leading graph database.

The post Using Graph Structure Record Linkage on Irish Census Data with Neo4j appeared first on Neo4j Graph Database.


The Future of Recommendation Engines: Graph-Aided Search

$
0
0
Learn How the Future of Real-Time Recommendation Engines Merges with Graph-Aided SearchEditor’s Note: GraphAware is a Silver sponsor of GraphConnect San Francisco. Register for GraphConnect to meet Michal and other sponsors in person.

For the last couple of years, Neo4j has been increasingly popular as the technology of choice for people building real-time recommendation engines.

Having been at the forefront of the graph movement through client engagements and open source software development, we have identified the next step in the natural evolution of graph-based recommendation engines. We call it Graph-Aided Search.

Recommendation Engines Everywhere


At first glance, it may seem that graph databases are only good for social networks, but it has been proven over and over again that the variety of domains and industries that need a graph database to store, analyse and query connected data could not be any wider.

Similarly, recommendation engines go far beyond retail – the most obvious industry. We’ve seen real-time recommendations with Neo4j applied to finding:

    • Matches on dating sites (Dating, Social)
    • People one may know in professional networks (Social)
    • Ideal candidates for clinical trials (Pharma)
    • Fraudsters (Banking, Insurance, Retail)
    • Criminals (Law Enforcement)
    • Events of interest (Event Planning)
    • And many more

Real-Time Recommendations


The reasons for wanting to implement a system that serves recommendations in real-time and for choosing a native graph database to do that have been well understood and written about.

Once the technology choice has been made, there are three main challenges to building such a recommendation engine. The first one is to discover the items to recommend. The second is to choose the most relevant ones to present to the user. Finally, the third challenge is to find relevant recommendations as quickly as possible.

Typically, the input to the recommendation engine is an object (e.g., a user) for which we would like to determine the recommendations. Such an object is represented in the graph as a node, so the whole process is effectively a traversal through the network, finding paths from the input node to other nodes, some of which will be deemed as the most relevant ones and served as recommendations.

Last year, GraphAware built an open source recommendation engine skeleton that runs as a Neo4j extension and provides a foundation to address the three challenges outlined above.

It does so by allowing developers to plug in their (path-finding) business logic into a best-practice architecture, resulting in a fast, flexible, yet simple and maintainable piece of software. The architecture imposes the separation of concerns between the plug-in components that:

    • Discover all possible recommendations
    • Apply a score to the identified recommendations
    • Filter out irrelevant or blacklisted recommendations
    • Optionally record why and how fast the recommendations were served
The skeleton is responsible for sorting by relevance, performance optimisations, thread-safety and other “frameworky” features.

Since its first release, the GraphAware Recommendation Engine has been used by teams all around the world to build production-ready recommendation functionality into their applications.

Search Engines


The vast majority of websites and other systems today provide some sort of search capability, allowing users to find what they are looking for very quickly. Lucene-based search engines, such as Elasticsearch and Apache Solr are the leading technologies in this space.

Like recommendation engines, search engines also serve results in real-time, sorted by decreasing relevance. However, the input to these systems is typically a string of characters and the results are matching documents (items).

Without adding extra complexity, the user performing the search is not taken into account. Hence, two users searching for the same thing will get the same results.

Graph-Aided Search


For the same reasons people are interested in personalising recommendations, they also want to personalise search results.

To see an example of such personalisation in practice, just head to LinkedIn and type the first name of one of your connections into the search box. Your connections will appear on top of the results. Not because they are the most important person with that first name on LinkedIn, but because they are most likely the person you are looking for.

One can treat such functionality as a recommendation engine with all candidate recommendations provided by an external system (search engine in this case), as opposed to discovered by the recommendation engine itself.

Applying the “right tool for the job” philosophy, we can use the search (S) and recommendation (R) engines together to achieve what we call Graph-Aided Search:

    • Discover all matching recommendations (S)
    • Apply a score to the recommendations based on textual match (S)
    • Apply a score to the recommendations based on the user’s graph (R)
    • Filter out irrelevant or blacklisted recommendations (R)
This way, the power of both systems can be used to build personalised search functionality.

Learn More at GraphConnect


At GraphAware, we are currently finalising the development of enterprise-ready extensions to Neo4j and Elasticsearch for bi-directional integration of the two systems, so that they can be easily combined to provide Graph-Aided Search.

We will launch and open source both extensions at this year’s GraphConnect San Francisco.

If you are interested in real-time recommendations, personalising search results or integrating Neo4j with a search engine such as Elasticsearch, come see my presentation at GraphConnect, starting at 2:20 p.m.


Register to hear Michal Bachman’s presentation on real-time recommendation engines and the future of search – along with many other industry-leading presentations – at GraphConnect San Francisco on October 21st.

The post The Future of Recommendation Engines: Graph-Aided Search appeared first on Neo4j Graph Database.

GraphGrid Interview: Neo4j for Your Modern Graph Data Architecture

$
0
0
Read This Interview with the Co-Creators of GraphGrid about Neo4j for Your Graph Data ArchitectureEditor’s Note: GraphGrid is a Bronze sponsor of GraphConnect San Francisco. Register for GraphConnect to meet Ben and Brad and other sponsors in person.

I recently sat down with Ben and Brad Nussbaum, the co-creators of GraphGrid to talk more about their role as one of our Neo4j solution partners and to dive deeper into their Neo4j Enterprise platform-as-a-service offering.

Here’s what we covered:

Talk to me about GraphGrid. What’s your story?


So to understand GraphGrid, let’s dive into a little back story: We co-founded AtomRain nearly seven years ago with the vision to create an elite engineering team capable of providing enterprises worldwide with real business value by solving their most complex business challenges. As we figured out what that looked like practically, we found ourselves moving deeper down the technology stack into the services and data layer where we handled all the heavy lifting necessary to integrate data sources and provide the functionality, performance and scale needed to deliver powerful enterprise service APIs.

In early 2012, we had our first exposure to Neo4j and experienced first hand the potential of graph databases and over the next couple of years refined the integration of Neo4j into our enterprise technology stack.

After delivering multiple enterprise solutions built around Neo4j that required the same foundation, we experienced the same pain of our customers, who often spent months laying a solid foundation of integration and operations around Neo4j before they could focus on the core services functionality their business needed.

This sparked many conversations over the next several months. We just needed to figure out the best way to meet the need so it would be accessible to all enterprises interested in using Neo4j. One morning, in February 2015 we called the folks at Neo Technology, pitched the idea and worked out the details for such an offering. We wanted to make sure we were in lock step with them on this to make sure it aligned with their objectives as well.

From there, we set out to define the initial requirements for a data integration and service platform that would help enterprises succeed on their Neo4j journey, picked a name, assembled the team of engineers that would be building it and began our big investment in what we see is an incredible future for the graph.

GraphGrid is the full suite of essential data import, export and routing capabilities for utilizing Neo4j within your modern data architecture. At its core, GraphGrid, enables seamless multi-region global Neo4j Enterprise cluster management with automatic failover for disaster recovery.

A powerful job framework enables our graph analytics and job processing, which removes the need to move data out of Neo4j to do analytics and batch processing of data because you just deploy your algorithms as extensions and write the results back to the graph. ElasticSearch auto-indexing keeps your search cluster updated with the latest data from your graph with, at most, a few seconds latency.

How does GraphGrid address global scalability and failover?


The GraphGrid platform is designed to solve high availability challenges. We support clusters that span multiple geo regions and multiple data centers within a region. The time to failover is usually a few seconds if an entire region goes offline and geo load balancing is implemented to take advantage of the second region capacity.

What is GraphGrid’s approach to security, especially in light of so many cloud providers recently being hacked?


Good security starts with your employees. Our team approaches security with respect and discipline. We design and develop with security as a first step.

One of the most important features of the GraphGrid platform is the ability to deploy clusters into segmented networks unreachable by other customer instances. In this way, customers gain a higher degree of security on an instance level.

What experience do you and your team bring to this endeavor?


We really can’t say enough about how much I appreciate our team.

We’ve been fortunate to work with engineers with outstanding character, great work ethic and a desire for continued learning and personal improvement. We’ve been working together for nearly four years solving complex engineering challenges in mission critical environments for global enterprises and every time we deliver a solution, I see tremendous growth taking place across the organization because getting across the finish line in a demanding enterprise environment is trial by fire.

It’s that very refinement in engineering rigor and discipline coming from detailed architecture defenses of the software systems we’ve delivered that has prepared us to build GraphGrid in a manner that provides real business value to enterprises worldwide by standing up under their work load day in and day out.

From a purely quantifiable perspective, we’ve been working with Neo4j since version 1.6 (early 2012) and eight of our engineers are certified Neo4j professionals.

Did timing have anything to do with your creation of GraphGrid?


We’ve been in the trenches building and managing enterprise Neo4j clusters for nearly four years as part of our custom enterprise software solutions and through this we’ve seen a recurring trend in needs across enterprises utilizing Neo4j to the point where it made sense to create a platform capable of meeting these needs for general enterprises without needing to rebuild the foundation every time.

Additionally, Forrester validated what we were seeing in the market with their projections. According to Forrester research, by 2017, 25% of the top 2000 enterprises will be utilizing a graph database. So for us it made sense to rally our team to bring this to market so enterprises would have a trusted foundation with proven patterns for taking advantage of graph in their architecture.

How does GraphGrid work together with Neo Technology?


We’ve worked closely with Neo Technology as a trusted solution partner consulting with many of their enterprise customers on their implementations from embedded on bare metal on-premise to stand alone on virtual instances on cloud infrastructure.

From the inception of GraphGrid, we’ve been engaged with Neo about the exact platform capabilities and service offerings to make sure it all aligns closely with their enterprise strategies and objectives. We’ve received great feedback from the team at Neo on the platform and continue to work closely with them going forward.

Why does the enterprise market need GraphGrid?


The corporate world is full of “safe” technology choices with over twenty years of successful production usage by global enterprises to justify the selection. This immediately puts Neo4j behind the eight ball with usage dependent on a value proposition worth the risk of choosing a less established technology.

It’s that very value proposition that we first experienced in 1.6 when we used Neo4j for a complex media workflow solution. This is the enterprise landscape and GraphGrid exists to make Neo4j a proven, safe, reliable and trusted technology choice by software and solution architects worldwide.

What about startups? Is GraphGrid appropriate for them too?


Definitely. In some cases, we’ve actually seen it provide an even greater boost to startups than established enterprises. Here’s why: A startup, whether well-funded or bootstrapping, is generally trying to run lean and allocate budget for personnel that will be building critical functionality that gets their product or service to market and maximizes the value of the company.

By offloading their DevOps requirements to the GraphGrid Data Platform and utilizing our Development Quick Start package with proven graph templates, enterprise patterns and direct access to our certified Neo4j professional engineers, they go from zero to 60 overnight instead of spending the first 9-12 months laying a solid foundation that will propel their company forward.

So as a startup when you get to start building on an already proven foundation and focus on building functionality that maximizes your value your potential for success and ROI increases exponentially.

What is the biggest benefit of using GraphGrid?


The biggest benefit of using GraphGrid is the wealth of enterprise software and enterprise Neo4j development and DevOps experience you have at your fingertips to guide you in your graph journey. Our team of certified Neo4j professional engineers has been delivering enterprise Neo4j solutions together for the last three years.

The biggest benefit of the platform itself is the data integration and scalable graph job processing frameworks that have been put in place around Neo4j. This gives enterprises tremendous flexibility to smoothly flow data into and retrieve data from Neo4j while seamlessly scaling up additional resources as needed to meet peak demand and perform graph analytics and job processing.

How does GraphGrid differ from other Neo4j cloud hosting companies?


We’ve poured our combined decades of experience delivering mission-critical enterprise software into every aspect of the architecture, design and development of the GraphGrid Data Platform to ensure it is able to withstand the rigor of an enterprise workload.

The two big practical differences are:
    1. We only deploy and manage Neo4j Enterprise clusters. It is not an option to deploy Neo4j Community on the platform because it’s never acceptable to go to production without high availability (HA).
    2. We deploy across 9 regions and 27 availability zones around the world so anyone using GraphGrid instantly has a global reach.
    3. A bonus one is our enterprise security architecture that is part of the foundation of the infrastructure: All instances by default are deployed into a VPC with dedicated subnets utilizing access control lists to manage infrastructure access within an organization.
(Ben chuckles) I guess that’s already more than two and I could keep going. We’re just very excited about the differential benefits that we’ve experienced with our customers’ solutions using our platform compared to before when we were delivering solutions without GraphGrid.

How can other solution partners or developers benefit from GraphGrid?


We have a Consulting Partner program and one of our goals in this that we see is being able to unify the other partners’ product offerings at the application framework and visualization layers by providing a platform that handles the heavy data lifting and operational concerns for them so they can focus on the application features being built for their specific use case.

Great. Thanks so much for taking the time to interview and we look forward to seeing more of you guys at GraphConnect.


Our pleasure. Stop by the GraphGrid booth at GraphConnect San Francisco and say “Hi.”


Register below to meet and network with Ben and Brad Nussbaum of GraphGrid – and many other graph database leaders – at GraphConnect San Francisco on October 21st.

The post GraphGrid Interview: Neo4j for Your Modern Graph Data Architecture appeared first on Neo4j Graph Database.

From the Neo4j Community: February 2016

$
0
0
Explore All of the Great Articles & Blog Posts Created by the Neo4j Community in March 2016In the Neo4j community last month, love was in the air.

That love expressed itself as more nodes than ever in our community content. From articles and podcasts to GraphGists and other projects, our global graph of community members keeps growing strong!

Below we’ve rolled out the red carpet for a few of our favorite pieces from the Neo4j community in February. Enjoy!

If you would like to see your post featured in April’s “From the Community” blog post, follow us on Twitter and use the #Neo4j hashtag for your chance to get picked.

Articles and Blog Posts


Podcasts and Audio


Slides and Presentations


Libraries, GraphGists and Code Repos


Other Projects




What’s better than the online Neo4j community? The Neo4j community in person! Click below to register for GraphConnect Europe to mix and mingle with world-changing graphistas from organizations across the globe!

The post From the Neo4j Community: February 2016 appeared first on Neo4j Graph Database.

APOC 1.1.0 Release: Awesome Procedures on Cypher

$
0
0
Learn what's new in the 1.1.0 release of in the Awesome Procedures on Cypher (a.k.a. "APOC") library

I’m super thrilled to announce last weeks 1.1.0 release of the Awesome Procedures on Cypher (APOC). A lot of new and cool stuff has been added and some issues have been fixed.

Thanks to everyone who contributed to the procedure collection, especially Stefan Armbruster, Kees Vegter, Florent Biville, Sascha Peukert, Craig Taverner, Chris Willemsen and many more.

And of course my thanks go to everyone who tried APOC and gave feedback, so that we could improve the library.

If you are new to Neo4j’s procedures and APOC, please start by reading the first article of my introductory blog series.

The APOC library was first released as version 1.0 in conjunction with the Neo4j 3.0 release at the end of April with around 90 procedures and was mentioned in Emil’s Neo4j 3.0 release keynote.

In early May we had a 1.0.1 release with a number of new procedures especially around free text search, graph algorithms and geocoding, which was also used by the journalists of the ICIJ for their downloadable Neo4j database of the Panama Papers.

And now, two months later, we’ve reached 200 procedures that are provided by APOC. These cover a wide range of capabilities, some of which I want to discuss today. In each section of this post I’ll only list a small subset of the new procedures that were added.

If you want get more detailed information, please check out the documentation with examples.

Notable Changes


As the 100 new procedures represent quite a change, I want to highlight the aspects of APOC that got extended or documented with more practical examples.

Metadata


Besides the apoc.meta.graph functionality that was there from the start, additional procedures to return and sample graph metadata have been added. Some, like apoc.meta.stats, access the transactional database statistics to quickly return information about label and relationship-type counts.

There are now also procedures to return and check of types of values and properties.

CALL apoc.meta.subGraph({config})

examines a sample sub graph to create the meta-graph, default sampleSize is 100
config is: {labels:[labels],rels:[rel-types],sample:sample}

CALL apoc.meta.stats YIELD labelCount, relTypeCount, propertyKeyCount, nodeCount, relCount, labels, relTypes, stats

returns the information stored in the transactional database statistics

CALL apoc.meta.type(value)

type name of a value (INTEGER,FLOAT,STRING,BOOLEAN,RELATIONSHIP,NODE,PATH,NULL,UNKNOWN,MAP,LIST)

CALL apoc.meta.isType(value, type)

returns a row if type name matches none if not

Data Import / Export


The first export procedures output the provided graph data as Cypher statements in the format that neo4j-shell understands and that can also be read with apoc.cypher.runFile.

Indexes and constraints as well as batched sets of CREATE statements for nodes and relationships will be written to the provided file-path.

apoc.export.cypherAll(file, config)

exports whole database incl. indexes as cypher statements to the provided file

apoc.export.cypherData(nodes, rels, file, config)

exports given nodes and relationships incl. indexes as cypher statements to the provided file

apoc.export.cypherQuery(query, file, config)

exports nodes and relationships from the cypher statement incl. indexes as cypher statements to the provided file

Data Integration with Cassandra, MongoDB and RDBMS


Making integration with other databases easier is a big aspiration of APOC.

Being able to directly read and write data from these sources using Cypher statements is very powerful. As Cypher is an expressive data processing language that allows a variety of data filtering, cleansing and conversions and preparing of the original data.

APOC integrates with relational (RDBMS) and other tabular databases like Cassandra using JDBC. Each row returned from a table or statement is provided as a map value to Cypher to be processed.

And for ElasticSearch the same is achieved by using the underlying JSON-HTTP functionality. For MongoDB, we support connecting via their official Java driver.

To avoid listing full database connection strings with usernames and passwords in your procedures, you can configure those in $NEO4J_HOME/conf/neo4j.conf using the apoc.{jdbc,mongodb,es}.<name>.url config parameters, and just pass name as the first parameter in the procedure call.

Here is a part of the Cassandra example from the data integration section of the docs using the Cassandra JDBC Wrapper.

Entry in neo4j.conf
apoc.jdbc.cassandra_songs.url=jdbc:cassandra://localhost:9042/playlist

CALL apoc.load.jdbc('cassandra_songs', 'track_by_artist') YIELD row
MERGE (a:Artist {name: row.artist})
MERGE (g:Genre {name: row.genre})
CREATE (t:Track {id: toString(row.track_id), title: row.track, length: row.track_length_in_seconds})
CREATE (a)-[:PERFORMED]->(t)
CREATE (t)-[:GENRE]->(g);

// Added 63213 labels, created 63213 nodes, set 182413 properties, created 119200 relationships.

For each data source that you want to connect to, just provide the relevant driver in the $NEO4J_HOME/plugins directory as well. It will then automatically picked up by APOC.

Even if you just visualize which kind of graphs are hidden in that data, there is already a big benefit of being able to do that without leaving the comfort of Cypher and the Neo4j Browser.

To render virtual nodes, relationships and graphs, you can use the appropriate procedures from the apoc.create.* package.

Controlled Cypher Execution


While individual Cypher statements can be run easily, more complex executions – like large data updates, background executions or parallel executions – are not yet possible out of the box.

These kind of abilities are added by the apoc.periodic. and the apoc.cypher. packages. Especially apoc.peridoc.iterate and apoc.periodic.commit are useful for batched updates.

Procedures like apoc.cypher.runMany allow execution of semicolon-separated statements and apoc.cypher.mapParallel allows parallel execution of partial or whole Cypher statements driven by a collection of values.

CALL apoc.cypher.runFile(file or url) yield row, result

runs each statement in the file, all semicolon separated – currently no schema operations

CALL apoc.cypher.runMany('cypher;\nstatements;',{params})

runs each semicolon separated statement and returns summary – currently no schema operations

CALL apoc.cypher.mapParallel(fragment, params, list-to-parallelize) yield value

executes fragment in parallel batches with the list segments being assigned to _

CALL apoc.periodic.commit(statement, params)

repeats an batch update statement until it returns 0, this procedure is blocking

CALL apoc.periodic.countdown('name',statement,delay-in-seconds)

submit a repeatedly-called background statement until it returns 0

CALL apoc.periodic.iterate('statement returning items', 'statement per item', {batchSize:1000,parallel:true}) YIELD batches, total

run the second statement for each item returned by the first statement. Returns number of batches and total processed rows

Schema / Indexing


Besides the manual index update and query support that was already there in the APOC release 1.0, more manual index management operations have been added.

CALL apoc.index.list() - YIELD type,name,config

lists all manual indexes

CALL apoc.index.remove('name') YIELD type,name,config

removes manual indexes

CALL apoc.index.forNodes('name',{config}) YIELD type,name,config

gets or creates manual node index

CALL apoc.index.forRelationships('name',{config}) YIELD type,name,config

gets or creates manual relationship index

There is pretty neat support for free text search that is also detailed with examples in the documentation. It allows you, with apoc.index.addAllNodes, to add a number of properties of nodes with certain labels to a free text search index which is then easily searchable with apoc.index.search.

apoc.index.addAllNodes('index-name',{label1:['prop1',…​],…​})

add all nodes to this full text index with the given proeprties, additionally populates a ‘search’ index

apoc.index.search('index-name', 'query') YIELD node, weight

search for the first 100 nodes in the given full text index matching the given lucene query returned by relevance

Collection & Map Functions


While Cypher has already great support for handling maps and collections, there are always some capabilities that are not possible yet. That’s where APOC’s map and collection functions come in. You can now dynamically create, clean and update maps.

apoc.map.fromPairs([[key,value],[key2,value2],…​])

creates map from list with key-value pairs

apoc.map.fromLists([keys],[values])

creates map from a keys and a values list

apoc.map.fromValues([key,value,key1,value1])

creates map from alternating keys and values in a list

apoc.map.setKey(map,key,value)

returns the map with the value for this key added or replaced

apoc.map.clean(map,[keys],[values]) yield value

removes the keys and values (e.g. null-placeholders) contained in those lists, good for data cleaning from CSV/JSON

There are means to convert and split collections to other shapes and much more.

apoc.coll.partition(list,batchSize)

partitions a list into sublists of batchSize

apoc.coll.zip([list1],[list2])

all values in a list

apoc.coll.pairs([list])

returns `[first,second],[second,third], …​

apoc.coll.toSet([list])

returns a unique list backed by a set

apoc.coll.split(list,value)

splits collection on given values rows of lists, value itself will not be part of resulting lists

apoc.coll.indexOf(coll, value)

position of value in the list

You can UNION, SUBTRACT and INTERSECTION collections and much more.

apoc.coll.union(first, second)

creates the distinct union of the 2 lists

apoc.coll.intersection(first, second)

returns the unique intersection of the two lists

apoc.coll.disjunction(first, second)

returns the disjunct set of the two lists

Graph Representation


There are a number of operations on a graph that return a subgraph of nodes and relationships. With the apoc.graph.* operations you can create such a named graph representation from a number of sources.

apoc.graph.from(data,'name',{properties}) yield graph

creates a virtual graph object for later processing it tries its best to extract the graph information from the data you pass in

apoc.graph.fromPaths([paths],'name',{properties})

creates a virtual graph object for later processing

apoc.graph.fromDB('name',{properties})

creates a virtual graph object for later processing

apoc.graph.fromCypher('statement',{params},'name',{properties})

creates a virtual graph object for later processing

The idea is that on top of this graph representation other operations (like export or updates), but also graph algorithms, can be executed. The general structure of this representation is:

{
 name:"Graph name",
 nodes:[node1,node2],
 relationships: [rel1,rel2],
 properties:{key:"value1",key2:42}
}

Plans for the Future


Of course, it doesn’t stop here. As outlined in the readme, there are many ideas for future development of APOC.

One area to be expanded are graph algorithms and the quality and performance of their implementation. We also want to support import and export capabilities, for instance for graphml and binary formats.

Something that in the future should be more widely supported by APOC procedures is to work with a subgraph representation of a named set of nodes, relationships and properties.

Conclusion


There is a lot more to explore, just take a moment and have a look at the wide variety of procedures listed in the readme.

Going forward I want to achieve a more regular release cycle of APOC. Every two weeks there should be a new release so that everyone benefits from bug fixes and new features.

Now, please: Cheers,

Michael


Want to take your Neo4j skills up a notch? Take our (newly revamped!) online training class, Neo4j in Production, and learn how scale the world’s leading graph database to unprecedented levels.



Catch up with the rest of the Introduction to APOC blog series:

The post APOC 1.1.0 Release: Awesome Procedures on Cypher appeared first on Neo4j Graph Database.

From the Neo4j Community: July 2016

$
0
0
Explore all of the great articles created by the Neo4j community in July 2016The Neo4j community has been very active this summer – that much is obvious. What you might not have noticed is how many new integrations and drivers have been published for new programming languages in the last month. Highlights include: Golang, Grails, PHP, Elixir and Elasticsearch.

Of course, there’s even more great content to explore this month around the Tour de France and even one on a Pokémon graph. Happy reading!

If you would like to see your post featured in August 2016’s “From the Community” blog post, follow us on Twitter and use the #Neo4j hashtag for your chance to get picked.

Articles and Blog Posts


Podcasts and Audio


Videos


Slides and Presentations


Libraries, GraphGists and Code Repos



Love the Neo4j community? Now’s your chance to meet them all in person! Click below to register for GraphConnect San Francisco for this year’s biggest – and best – graph technology event.

The post From the Neo4j Community: July 2016 appeared first on Neo4j Graph Database.

APOC: Database Integration, Import and Export with Awesome Procedures On Cypher

$
0
0

If you haven’t seen the first part of this series, make sure to check out the first article to get an introduction to Neo4j’s user defined procedures and check out our APOC procedure library.

New APOC Release


First of all I want to announce that we just released APOC version 3.0.4.1. You might notice the new versioning scheme which became necessary with SPI changes in Neo4j 3.0.4 which caused earlier versions of APOC to break.

That’s why we decided to release APOC versions that are tied to the Neo4j version from which they are meant to work. The last number is an ever increasing APOC build number, starting with 1.

So if you are using Neo4j 3.0.4 please upgrade to the new version, which is available as usual from http://github.com/neo4j-contrib/neo4j-apoc-procedures/releases.

Notable changes since the last release (find more details in the docs):

    • Random graph generators (by Michal Bachman from GraphAware)
    • Added export (and import) for GraphML apoc.export.graphml.*
    • PageRank implementation that supports pulling the subgraph to run on WITH Cypher statements apoc.algo.pageRankCypher (by Atul Jangra from RightRelevance)
    • Basic weakly connected components implementation (by Tom Michiels and Sascha Peukert)
    • Better error messages for load.json and periodic.iterate
    • Support for leading wildcards “*foo” in apoc.index.search (by Stefan Armbruster)
    • apoc.schema.properties.distinct provides distinct values of indexed properties using the index (by Max de Marzi)
    • Timeboxed execution of Cypher statements (by Stefan Armbruster)
    • Linking of a collection of nodes with apoc.nodes.link in a chain
    • apoc.util.sleep e.g., for testing (by Stefan Armbruster)
    • Build switched to gradle, including release (by Stefan Armbruster)

We got also a number of documentation updates by active contributors like Dana, Chris, Kevin and Viksit.

Thanks so much to everyone for contributing to APOC. We’re now at 227 procedures and counting! 🙂

If you missed it, you can also see what was included in the previous release: APOC 1.1.0.

But now back to demonstrating the main topics for this blog post:

Database Integration & Data Import


Besides the flexibility of the graph data model, for me personally the ability to enrich your existing graph by relating data from other data sources is a key advantage of using a graph database.

And Neo4j data import has been a very enjoyable past time of mine, which you know if you followed my activities in the last six years.

With APOC, I got the ability to pull data import capabilities directly into Cypher so that a procedure can act as a data source providing a stream of values (e.g., rows). Those are then consumed by your regular Cypher statement to create, update and connect nodes and relationships in whichever way you want.

apoc.load.json


Because it is so close to my heart, I first started with apoc.load.json.Then I couldn’t stop anymore and added support for XML, CSV, GraphML and a lot of databases (including relational & Cassandra via JDBC, Elasticsearch, MongoDB and CouchBase (upcoming)).

All of these procedures are used in a similar manner. You provide some kind of URL or connection information and then optionally queries / statements to retrieve data in rows. Those rows are usually maps that map columns or fields to values, depending on the data source these maps can also be deeply nested documents.

Those can be processed easily with Cypher. The map and collection lookups, functions, expressions and predicates help a lot with handling nested structures.

Let’s look at apoc.load.json. It takes a URL and optionally some configuration and returns the resulting JSON as one single map value, or if the source is an array of objects, then as a stream of maps.

The mentioned docs and previous blog posts show how to use it for loading data from Stack Overflow or Twitter search. (You have to pass in your Twitter bearer token or credentials).

Here I want to demonstrate how you could use it to load a graph from http://onodo.org, a graph visualization platform for journalists and other researchers that want to use the power of the graph to draw insights from the connections in their data.

I came across that tweet this week, and while checking out their really neat graph editing and visualization UI, I saw that both nodes and relationships for each publicly shared visualization are available as JSON.

To load the mentioned Game of Thrones graph, I just had to grab the URLs for nodes and relationships, have a quick look at the JSON structures and re-create the graph in Neo4j. Note that for creating dynamic relationship-types from the input data I use apoc.create.relationship.

call apoc.load.json("https://onodo.org/api/visualizations/21/nodes/") yield value
create (n:Person) set n+=value
with count(*) as nodes
call apoc.load.json("https://onodo.org/api/visualizations/21/relations/") yield value
match (a:Person {id:value.source_id})
match (b:Person {id:value.target_id})
call apoc.create.relationship(a,value.relation_type,{},b) yield rel
return nodes, count(*) as relationships

Learn all about how to use APOC for database integration as well as data import and export


apoc.load.xml


The procedure for loading XML works similarly, only that I had to convert the XML into a nested map structure to be returned.

While apoc.load.xml maintains the order of the original XML, apoc.load.xmlSimple aggregates child elements into entries with the element name as a key and all the children as a value or collection value.

book.xml from Microsoft:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <author>Arciniegas, Fabio</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
…

WITH "https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/master/src/test/resources/books.xml" as url
call apoc.load.xmlSimple(url)

{_type: "catalog", _book: [
  {_type: "book", id: "bk101",
    _author: [{_type: "author", _text: "Gambardella, Matthew"},{_type: author, _text: "Arciniegas, Fabio"}],
    _title: {_type: "title", _text: "XML Developer's Guide"},
    _genre: {_type: "genre", _text: "Computer"},
    _price: {_type: "price", _text: "44.95"},
    _publish_date: {_type: "publish_date", _text: "2000-10-01"},
    _description: {_type: description, _text: An in-depth look at creating applications ....

You will find more examples in the documentation.

Relational Databases and Cassandra via JDBC


In past articles and documentation, we demonstrated how to use apoc.load.jdbc with JDBC drivers, the powerhorse of Java Database Connectivity to connect and retrieve data from relational databases.

The usage of apoc.load.jdbc mostly reduces to dropping the database vendor’s jdbc-jar file into the $NEO4J_HOME/plugins directory and providing a jdbc-url to the procedure. Then you can declare either a table name or full statement that determines which and how much data is pulled from the source.

To protect the auth information it is also possible to configure the jdbc-url in $NEO4J_HOME/conf/neo4j.conf under the apoc.jdbc.<alias>.url. Then instead of the full jdbc-url, you only provide the alias from the config.

As JDBC in its core is mostly about sending parametrized query strings to a server and returning tabular results, many non-relational databases also provide JDBC drivers. For example, Cassandra.

You can even use the Neo4j JDBC driver to connect to another Neo4j instance and retrieve data from there.

It is always nice if the APIs you build have the right abstraction so that you can compose them to achieve something better.

Here is an example on how we can use apoc.load.jdbc with apoc.periodic.iterate to parallelize import from a JDBC data source:

CALL apoc.periodic.iterate('
call apoc.load.jdbc("jdbc:mysql://localhost:3306/northwind?user=northwind","company")',
'CREATE (c:Company) SET c += value', {batchSize:10000, parallel:true})
RETURN batches, total

As we already covered loading from relational databases before, I won’t bore you with it again (unless you ask me to). Instead, I’ll introduce two other database integrations that we added.

MongoDB


As many projects use MongoDB but have a hard time managing complex relationships between documents in an efficient manner, we thought it would be nice to support it out of the box.

The only thing you have to provide separately is the MongoDB Java driver jar in $NEO4J_HOME/plugins. APOC will pick it up and you’ll be able to use the MongoDB procedures:

CALL apoc.mongodb.get(host-or-port,db-or-null,collection-or-null,query-or-null) yield value

Perform a find operation on MongoDB collection

CALL apoc.mongodb.count(host-or-port,db-or-null,collection-or-null,query-or-null) yield value

Perform a find operation on MongoDB collection

CALL apoc.mongodb.first(host-or-port,db-or-null,collection-or-null,query-or-null) yield value

Perform a first operation on MongoDB collection

CALL apoc.mongodb.find(host-or-port,db-or-null,collection-or-null,query-or-null,projection-or-null,sort-or-null) yield value

Perform a find,project,sort operation on MongoDB collection

CALL apoc.mongodb.insert(host-or-port,db-or-null,collection-or-null,list-of-maps)

Inserts the given documents into the MongoDB collection

CALL apoc.mongodb.delete(host-or-port,db-or-null,collection-or-null,list-of-maps)

Inserts the given documents into the MongoDB collection

CALL apoc.mongodb.update(host-or-port,db-or-null,collection-or-null,list-of-maps)

Inserts the given documents into the MongoDB collection

Copy these jars into the plugins directory:

mvn dependency:copy-dependencies
cp target/dependency/mongodb*.jar target/dependency/bson*.jar $NEO4J_HOME/plugins/

CALL apoc.mongodb.first('mongodb://localhost:27017','test','test',{name:'testDocument'})

If we import the example restaurants dataset into MongoDB, we can then access the documents from Neo4j using Cypher.

Retrieving one restaurant
CALL apoc.mongodb.get("localhost","test","restaurants",null) YIELD value
RETURN value LIMIT 1

{ name: Riviera Caterer,
 cuisine: American ,
 grades: [{date: 1402358400000, grade: A, score: 5}, {date: 1370390400000, grade: A, score: 7}, .... ],
 address: {building: 2780, coord: [-73.98241999999999, 40.579505], street: Stillwell Avenue, zipcode: 11224},
 restaurant_id: 40356018, borough: Brooklyn,
 _id: {timestamp: 1472211033, machineIdentifier: 16699148, processIdentifier: -10497, counter: 8897244, ....}
}

Retrieving 25359 restaurants and counting them
CALL apoc.mongodb.get("localhost","test","restaurants",null) YIELD value
RETURN count(*)

CALL apoc.mongodb.get("localhost","test","restaurants",{borough:"Brooklyn"}) YIELD value AS restaurant
RETURN restaurant.name, restaurant.cuisine LIMIT 3

╒══════════════════╤══════════════════╕
│restaurant.name   │restaurant.cuisine│
╞══════════════════╪══════════════════╡
│Riviera Caterer   │American          │
├──────────────────┼──────────────────┤
│Wendy'S           │Hamburgers        │
├──────────────────┼──────────────────┤
│Wilken'S Fine Food│Delicatessen      │
└──────────────────┴──────────────────┘

And then we can, for instance, extract addresses, cuisines and boroughs as separate nodes and connect them to the restaurants:

CALL apoc.mongodb.get("localhost","test","restaurants",{`$where`:"$avg(grades.score) > 5"}) YIELD value as doc
CREATE (r:Restaurant {name:doc.name, id:doc.restaurant_id})
CREATE (r)-[:LOCATED_AT]->(a:Address) SET a = doc.address
MERGE (b:Borough {name:doc.borough})
CREATE (a)-[:IN_BOROUGH]->(b)
MERGE (c:Cuisine {name: doc.cuisine})
CREATE (r)-[:CUISINE]->(c);

Added 50809 labels, created 50809 nodes, set 152245 properties, created 76077 relationships, statement executed in 14785 ms.

Here is a small part of the data showing a bunch of restaurants in NYC:

An example of an APOC database integration with MongoDB and Neo4j


Elasticsearch


Elasticsearch support is provided by calling their REST API. The general operations are similar to MongoDB.

apoc.es.stats(host-url-Key)

Elasticsearch statistics

apoc.es.get(host-or-port,index-or-null,type-or-null,id-or-null,query-or-null,payload-or-null) yield value

Perform a GET operation

apoc.es.query(host-or-port,index-or-null,type-or-null,query-or-null,payload-or-null) yield value

Perform a SEARCH operation

apoc.es.getRaw(host-or-port,path,payload-or-null) yield value

Perform a raw GET operation

apoc.es.postRaw(host-or-port,path,payload-or-null) yield value

Perform a raw POST operation

apoc.es.post(host-or-port,index-or-null,type-or-null,query-or-null,payload-or-null) yield value

Perform a POST operation

apoc.es.put(host-or-port,index-or-null,type-or-null,query-or-null,payload-or-null) yield value

Perform a PUT operation

After importing the example Shakespeare dataset, we can have a look at the Elasticsearch statistics.

call apoc.es.stats("localhost")

{ _shards:{
  total:10, successful:5, failed:0},
 _all:{
  primaries:{
   docs:{
    count:111396, deleted:13193
   },
   store:{
    size_in_bytes:42076701, throttle_time_in_millis:0
   },
   indexing:{
    index_total:111396, index_time_in_millis:54485, …

Couchbase support is upcoming with a contribution by Lorenzo Speranzoni from Larus IT, one of our Italian partners.

Data Export


Exporting your Neo4j database to a shareable format has always been a bit of a challenge, which is why I created the neo4j-import-tools for neo4j-shell a few years ago. Those support exporting your whole database or the results of a Cypher statement to:

    • Cypher scripts
    • CSV
    • GraphML
    • Binary (Kryo)
    • Geoff

I’m now moving that functionality to APOC one format at a time.

Cypher Script


Starting with export to Cypher, the apoc.export.cypher.* procedures export:

    • The whole database
    • The results of a Cypher query
    • A set of paths
    • A subgraph

The procedures also create a Cypher script file containing the statements to recreate your graph structure.

apoc.export.cypher.all(file,config)

Exports whole database including indexes as Cypher statements to the provided file

apoc.export.cypher.data(nodes,rels,file,config)

Exports given nodes and relationships including indexes as Cypher statements to the provided file

apoc.export.cypher.graph(graph,file,config)

Exports given graph object including indexes as Cypher statements to the provided file

apoc.export.cypher.query(query,file,config)

Exports nodes and relationships from the Cypher statement including indexes as Cypher statements to the provided file

It also creates indexes and constraints; currently only MERGE is used for nodes and relationships. It also makes sure that nodes which do not have a uniquely constrained property get an additional artificial label and property (containing their node-id) for that purpose. Both are pruned at the end of the import.

Relationships are created by matching the two nodes and creating the relationship between them, optionally setting parameters.

The node and relationship creation happens in batches wrapped with BEGIN and COMMIT commands. Currently, the generated code doesn’t use parameters, but that would be a future optimization. The current syntax only works for neo4j-shell and Cycli; support for cypher-shell will be added as well.

Here is a simple example from our movies graph:

:play movies
create index on :Movie(title);
create constraint on (p:Person) assert p.name is unique;

call apoc.export.cypher.query("MATCH (m:Movie)<-[r:DIRECTED]-(p:Person) RETURN m,r,p", "/tmp/directors.cypher", {batchSize:10});

╒═════════════════════╤══════════════════════════════╤══════╤═════╤═════════════╤══════════╤════╕
│file                 │source                        │format│nodes│relationships│properties│time│
╞═════════════════════╪══════════════════════════════╪══════╪═════╪═════════════╪══════════╪════╡
│/tmp/directors.cypher│statement: nodes(66), rels(44)│cypher│66   │44           │169       │104 │
└─────────────────────┴──────────────────────────────┴──────┴─────┴─────────────┴──────────┴────┘

Contents of exported file
begin
CREATE (:`Movie`:`UNIQUE IMPORT LABEL` {`title`:"The Matrix", `released`:1999, `tagline`:"Welcome to the Real World", `UNIQUE IMPORT ID`:1106});
CREATE (:`Person` {`name`:"Andy Wachowski", `born`:1967});
CREATE (:`Person` {`name`:"Lana Wachowski", `born`:1965});
....
CREATE (:`Person` {`name`:"Rob Reiner", `born`:1947});
commit
....
begin
CREATE INDEX ON :`Movie`(`title`);
CREATE CONSTRAINT ON (node:`Person`) ASSERT node.`name` IS UNIQUE;
CREATE CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`) ASSERT node.`UNIQUE IMPORT ID` IS UNIQUE;
commit
schema await
begin
MATCH (n1:`Person`{`name`:"Andy Wachowski"}), (n2:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`:1106}) CREATE (n1)-[:`DIRECTED`]->(n2);
....
MATCH (n1:`Person`{`name`:"Tony Scott"}), (n2:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`:1135}) CREATE (n1)-[:`DIRECTED`]->(n2);
MATCH (n1:`Person`{`name`:"Cameron Crowe"}), (n2:`UNIQUE IMPORT LABEL`{`UNIQUE IMPORT ID`:1143}) CREATE (n1)-[:`DIRECTED`]->(n2);
commit
...
begin
MATCH (n:`UNIQUE IMPORT LABEL`)  WITH n LIMIT 10 REMOVE n:`UNIQUE IMPORT LABEL` REMOVE n.`UNIQUE IMPORT ID`;
commit
...
begin
DROP CONSTRAINT ON (node:`UNIQUE IMPORT LABEL`) ASSERT node.`UNIQUE IMPORT ID` IS UNIQUE;
commit

load again with neo4j-shell
./bin/neo4j-shell -file /tmp/directors.cypher

GraphML


The second export format I migrated is GraphML, which can then be used by other tools like yEd, Gephi, Cytoscape etc. as an import format.

The procedures API is similar to the Cypher script ones:

apoc.import.graphml(file-or-url,{batchSize: 10000, readLabels: true, storeNodeIds: false, defaultRelationshipType:"RELATED"})

Imports GraphML into the graph

apoc.export.graphml.all(file,config)

Exports whole database as GraphML to the provided file

apoc.export.graphml.data(nodes,rels,file,config)

Exports given nodes and relationships as GraphML to the provided file

apoc.export.graphml.graph(graph,file,config)

Exports given graph object as GraphML to the provided file

apoc.export.graphml.query(query,file,config)

Exports nodes and relationships from the Cypher statement as GraphML to the provided file

Here is an example of exporting the Panama Papers data to GraphML (after replacing the bundled with the latest version of APOC) and loading it into Gephi.

The export of the full database results in a 612MB-large GraphML file. Unfortunately, Gephi struggles with rendering the full file. That’s why I’ll try again with the neighborhood of officers with a country code of “ESP” for Spain, which is much less data.

call apoc.export.graphml.query("match p=(n:Officer)-->()<--() where n.country_codes = 'ESP' return p","/tmp/es.graphml",{})

╒═══════════════╤══════════════════════════════════╤═══════╤═════╤═════════════╤══════════╤════╕
│file           │source                            │format │nodes│relationships│properties│time│
╞═══════════════╪══════════════════════════════════╪═══════╪═════╪═════════════╪══════════╪════╡
│/tmp/es.graphml│statement: nodes(2876), rels(3194)│graphml│2876 │3194         │24534     │2284│
└───────────────┴──────────────────────────────────┴───────┴─────┴─────────────┴──────────┴────┘

Gephi graph data visualization using the Panama Papers data from Spain


Conclusion


I hope this article and series helps you to see how awesome user-defined procedures and APOC are.

If you have any comments, feedback, bugs or ideas to report, don’t hesitate to tell us. Please either raise GitHub issues or ask in the #apoc channel on our neo4j-users Slack. Of course you can join the growing list of contributors and submit a pull request with your suggested changes.

Looking ahead to the next articles which I hope to provide all before GraphConnect on October 13th and 14th in San Francisco. If you join me there, we can chat about procedures in person. We’ll try to set up a Neo4j Developer Relations booth with Q&A sessions, live demos and more.

In the next article, I’ll demonstrate the date- and number-formatting capabilities, utility functions and means to run Cypher statements in a more controlled fashion. Following will be the metadata procedures and the wide area of (manual and schema) index operations. After that, I’ll cover graph algorithms as well as custom expand and search functions.

Oh, and if you like the project please make sure to star it on GitHub and tell your friends, family and grandma to do the same. 🙂

Cheers,
Michael



Already a Neo4j expert?
Show off your graph database skills with an official Neo4j Certification. Take the exam and you’ll be Neo4j Certified in less than an hour.


Start My Certification


Catch up with the rest of the Introduction to APOC blog series:

The post APOC: Database Integration, Import and Export with Awesome Procedures On Cypher appeared first on Neo4j Graph Database.

“Google” Your Own Brain: Create a CMS with Neo4j & Elasticsearch [Community Post]

$
0
0

[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

Learn how to create a content management system (CMS) using Neo4j, Mazerunner and Elasticsearch

Grasp Theory is a project that is exploring a new way to catalogue and recall documents that are personally relevant. This article describes some high-level concepts being used that leverage Neo4j.

The Power of the Graph


Having a graph to represent connections between content is really powerful. It’s a really hot topic and Google solidified a dynasty off their PageRank algorithm that leverages the links between pages on the web. This helps Google provide us with more relevant content very quickly.

Wait a minute though, “relevant” in the case of using Google doesn’t necessarily mean “personally relevant.” In fact, most times we are searching to find out what everyone else knows and is generally agreed upon as relevant. We are essentially peering into the connections of a brain which is the cumulative average of the world and Google allows us to do this through the colored lenses provided by their proprietary ranking algorithm.

Search engines are a valuable tool for a variety of use cases, however, there could be use cases where we may not want to search everyone else’s brains and instead search our own.

The Start of Our New Brain


Human memory is short and terribly fickle.
–Janine Di Giovanni
Suppose throughout our schooling days we developed a content management system (CMS) and indexed all the information we were exposed to into our CMS. We could then rely on this system to help us recall information that is personally relevant to us without requiring our actual brains to keep sharp representations of all the information we were ever exposed to handy.

If we indexed all that information into something like Elasticsearch we could certainly search for relevant documents via basic text searches. Done!

Isn’t simple text search enough to search our own brains? Simple text-based search wasn’t enough for Google, so let’s explore a few things we can do to improve upon a text-based search in our new brain-based CMS.

It seems that there are a few quick wins we could implement:

  1. Provide related content to items we find
  2. Increase the relevancy of the search results
If we tracked the content and all the relationships we made between content in our CMS using Neo4j then #1 is already done. Nice, thanks Neo4j!
How do we address #2, the relevancy problem? Let’s take a page from Google’s playbook.

Enhancing Relevancy via Mazerunner


A good memory is one trained to forget the trivial.
–Clifton Fadiman
Our brain does an amazing job letting some things fade away but provides hooks into memories that we have deemed important. We can then jump to related memories via associations we have created through our experiences. It would be annoying, inefficient, and probably dangerous to be able to recall every associated memory that isn’t relevant.

Let’s use this idea as a model and work to enhance the relevancy of searches in our Neo4j CMS.

Fortunately, the Neo4j team took over a project created by Kenny Bastani called Mazerunner. Mazerunner is exactly the tool we need to enhance the relevancy of our search. As described here, Mazerunner integrates an existing Neo4j database with Apache Spark and GraphX to generate graph analytics like PageRank and then puts those values back into your Neo4j database.

NOTE: There is a new Apache Spark connector that is now the preferred way to use Spark with Neo4j. Check it out here.

To generate a PageRank value, you must tell Mazerunner which relationships to use and this will depend on your relationship structure.

See Mazerunner documentation for implementation details here.

Once we have a PageRank value for each piece of content, we could use that to tweak our search results.

Below is the result of running Mazerunner on a simple tree structure utilizing RELATED relationships between our nodes. Since PageRank essentially gives a weight proportional to the probability of reaching a node from a randomly selected node in the graph, it makes sense that our root node of the tree has a low PageRank, while our values trend upwards as we traverse down the branches.

An example of Mazerunner in a graph data structure


Simple example of PageRank values added to nodes using Mazerunner

So how do we utilize our new PageRank values? Once Mazerunner finishes adding PageRank values back into the nodes within our Neo4j graph database, it’s time to re-index each node with this new value into Elasticsearch. (See the latest APOC Elasticsearch integration details here as one possible implementation).

We can then tweak our Elasticsearch query to include the PageRank value in our score calculation of matched documents.

query: {
 function_score: {
   query: {
     filtered: {
       // omitted for brevity
     }
   },
   boost_mode: "replace",
   script_score: {
     params: {
       prPercent: pageRankPercent
     },
     // *** MAGIC: Incorporate Neo4j graph pagerank score ***
     script: "_score * (1 + _source.pageRank.value * prPercent)"
   }
 }
}

(See the Elasticsearch documentation for script_score.)

The tricky part here is that Elasticsearch provides an arbitrary score value for your results (description of problem here, documentation here). This score could be greater than one and varies based on each query, so since your weighting is fixed per query, you will need to tweak the weighting to get a “good enough” effect on your query results. This is accomplished above by adding a prPercent weight to the PageRank value.

The Results


The impact of integrating PageRank values from Mazerunner and Neo4j into your search results will vary based upon your scoring algorithms/weighting and the underlying graph structure used to calculate the PageRank values.

Check out the screenshot and video below of toggling PageRank weighted searches using the same tree-based data described above. Even though the example below only uses a single tree structure for data and a limited number of relationships, PageRank already provides some small enhancements to search results. Naturally, the more data and more relationships, the more relevancy our PageRank values will give us.

A Neo4j-powered document search engine CMS


With PageRank values indexed into Elasticsearch, toggling their use during a search is as simple as clicking a button.



What’s Next?


The “Grasp Theory” project is working to import more data to fine-tune the generation and use of PageRank values. More data around that is coming soon, but our personal brain search engine is certainly showing some promise and it is exciting to see what else might pop out while leveraging Neo4j as things progress.

Here are some other areas that might be worth exploring to take things a step further:
  1. With a new release of APOC for Neo4j, streamlining integrations with Elasticsearch will be even better!
  2. Enhancing our graph with mappings from content nodes to semantics nodes could really help. This could help drive recommendations to relevant content that does not have relationships between them. This is like integrating a research paper from another project into your current project because they have semantic similarities.
  3. We could also look at the other graph analytics algorithms that Mazerunner provides. Perhaps calculating measures of betweenness centrality could provide some interesting insights?
  4. Add configurable relevancy via “priming”. Right now we have added PageRank to provide the global importance of content. We are essentially asking a question to our brains while our brains is in the same exact state during every query. What would happen if we “primed” our brains the way our brains are primed right now by reading this article?

    If we searched right now for “graphs,” we would likely get graph theory or Neo4j-related results. If we searched “graphs” just before reading this article, it could be likely we get different results like those related to Euclidean coordinates. “Priming” is indeed very possible with our setup, so perhaps we’ll explore this in a future post.


Ready to use Neo4j in your own project?
Click below to get your free copy of the Learning Neo4j ebook and catch up to speed with the world’s leading graph database.


Download My Copy

The post “Google” Your Own Brain: Create a CMS with Neo4j & Elasticsearch [Community Post] appeared first on Neo4j Graph Database.


Using NLP + Neo4j for a Social Media Recommendation Engine

$
0
0

Introduction


In recent years, the rapid growth of social media communities has created a vast amount of digital documents on the web. Recommending relevant documents to users is a strategic goal for the effectiveness of customer engagement but at the same time is not a trivial problem.

In a previous blog post, we introduced the GraphAware Natural Language Processing (NLP) plugin. It provides the basis to realize more complex applications that leverage text analysis and to offer enhanced functionalities to end users.

An interesting use case is combining content-based recommendations with a collaborative filtering approach to deliver high quality “suggestions”. This scenario fits well in all applications that combine user-generated content such as social media, with any sort of reaction, like tagging, likes, and so on.

In this direction, starting from the ideas exposed in the paper Social-aware Document Similarity Computation for Recommender Systems [1], we developed as part of the GraphAware Enterprise Reco plugin for Neo4j, a recommendation engine that uses a combination of similarities as a model to provide high quality recommendations.

Document Modelling


In a social community, a document (which could be a post, tweet, blog, etc.) could be characterized by three elements:
    • The document internal content and extracted tags
    • Tags that users associate with it
    • The readers’ interactions (i.e., view, comment, tag, like) with the document
The internal content of the document is static over time. However, tags and users associated with the document are community-driven. They reflect the attitude of the community towards the document and can be changed over time.

With traditional information retrieval techniques, the internal contents of the document are indexed. The index is then used to help users search for documents of their interest.

These techniques are still popular in many information retrieval systems. However, using only the document may miss out certain meaning carried by tags and users. Recognizing the importance of tags as a supplement to internal content indexing, some systems use tags as document external metadata. This type of metadata is used to assist users with browsing or navigating in document databases.

GraphAware Enterprise Reco uses the combined approach of computing document similarity for building recommender systems. The idea is that the meaning of a document is derived not only from its content, but also from its associated tags and user interactions.

“These three factors are viewed as three dimensions of a document in social space, named as Content, Tag, and User. Each dimension provides a different view of the document. In Content dimension, the meaning of the document is given by its author(s). However, in the Tag dimension, the meaning of the document is what it is perceived by the community. Each user may provide a different view of the document by tagging it. This view can be far different from the initial intention of the document’s author(s). In User dimension, the meaning of the document is exposed via its readers’ activities in the community.” [1]
Moreover, while analyzing “static” content and social tags, ontology and semantics can be used to extract hierarchies in concepts. This extension allows the finding of relationships between tags and in this way, discovers the hidden relationship between apparently unrelated documents.

So, for instance, if a document is tagged (automatically from content or by a user) with the tag violence while another is tagged with the tag war, at first analysis, they could appear unrelated, but after analyzing the semantic hierarchy of word violence (with ConceptNet 5 for instance) the system can reveal a relation between them.

The designed schema for the database will appear as follows:

Use natural language processing (NLP) and Neo4j to build a social media recommendation engine


This schema shows also how this complex model can be easily stored, and further extended, using graphs and Neo4j.

Similarity Computation


Using all the information stored, three different vectors will be created for each document:

Content- and ontology-based vector:

      Ci = {wc(i,1), wc(i,2), …, wc(i,n)} where n is the total number of tags in the database, wc(i,k) is the weight of the kth tag in the document or in the hierarchy of the tag. wc(i,k) is computed using the following formula: α*tf-idf(i,k), where α is a weight associated with the hierarchy in the ontology; it is equal to 1 if the tag is in the document or if it is a synonym of a tag in the document; less than 1 in other cases.

Social Tag-based vector:

      Ti = {wt(i,1), wt(i,2), …, wt(i,p)} where p is the total number of tags in the database, wt(i,k) is the weight of the kth tag for the document. wt(i,k) is the association frequency of the tag k to document i.

User vectors:

      Ui = {wu(i,1), wu(i,2), …, wu(i,q)} where q is the total number of users in the database, wu(i,k) is the weight of the kth user for the document. This weight can be computed in a different way, considering the different levels of interest expressed by a user for the document. Moreover, more than one user vector can be used if it is necessary to use different weights for each of the components (for instance, one vector for likes, one for rates, and so on).

Using these three (or more) vectors, three (or more) different cosine similarities are computed and then the value for the combined similarity is calculated in the following way:

CombinedSimilarity(i, j) = αCosineSim(Ci, Cj)+βCosineSim(Ti, Tj)+γ*CosineSim(Ui, Uj)

Where:

α + β + γ = 1

It is worth noting that the similarity computed represents new knowledge extracted from the data available in the graph database. It is stored as model for the recommendation engine and it can be used in several ways to provide suggestions to users.

Conclusion


In this use case, the GraphAware NLP Plugin is used to deliver high-quality recommendations to end users. The plugin provides content-based and ontology-based cosine similarities, which, together with the more classical “collaborative filtering” approach, produces completely new and more advanced functionalities in a straightforward way.

The GraphAware NLP Plugin can be used with other plugins available on the GraphAware products page. In particular, using the Neo4j2Elastic plugin for Neo4j and Graph-Aided Search plugin for Elasticsearch, it is possible to provide a complete end-to-end customized search framework.

The NLP plugin is going to be open-sourced under GPL in the future, and we would like to make sure it is production ready with private beta testers. If you’re interested to know more or see its usage in action, please get in touch.

If you’re attending GraphConnect in San Francisco in October this year, or in London next May, be sure to stop by our booth!

Reference


[1] Tran Vu Pham, Le Nguyen Thach, “Social-Aware Document Similarity Computation for Recommender Systems”, vol. 00, no., pp. 872-878, 2011, doi:10.1109/DASC.2011.147


Learn more about the GraphAware NLP plugin and meet the GraphAware team at GraphConnect San Francisco on October 13th, 2016. Click below to register – and we’ll see you in San Francisco soon!

Get My Ticket

The post Using NLP + Neo4j for a Social Media Recommendation Engine appeared first on Neo4j Graph Database.

The 5-Minute Interview: Christophe Willemsen, Senior Consultant at GraphAware

$
0
0
For this week’s 5-Minute Interview, I chatted with Christophe Willemsen, Senior Neo4j Consultant at GraphAware in London. Christophe and I spoke at GraphConnect San Francisco last October.

Here’s what we discussed:



Talk to me about how you use Neo4j.


I’m a Neo4j consultant at GraphAware, a Neo4j consulting company based in London. I’ve been working with the graph database for the last four years, and it has become more of a passion than a job.

One of our last projects involved providing recommendations for holiday house accommodations. In a period of only two weeks, we went from zero to production with a recommendation engine, and then went back to make some fine tuning. Our recommendations continually change based on updated information that is gathered.

Right now, with Neo4j we use natural language processing and Elasticsearch integration to provide graph-aided searches. Through these queries we can provide elastic boosts and relevencies based on the user’s graph. Some big clients are using it, such as Airbnb.

What made Neo4j stand out?


We were able to go into production much, much faster than with a traditional relational database. It also provides us with the flexibility to continually make changes over time — often our clients aren’t sure what they really want, so we have to make adjustments. We don’t really do any work without Neo4j and have increased our customer base, which is great.

Catch this week’s 5-Minute Interview with Christophe Willemsen, Senior Neo4j Consultant at GraphAware

If you could start over with Neo4j, taking everything you know now, what would you do differently?


I immediately fell in love with Neo4j. At the time I discovered graph databases, I was working on a DNA project for dog genealogy with traditional databases and was really struggling. From the very beginning I wasn’t sure how to move forward with my data. There isn’t really anything would do differently — experience is really important.

Can you talk to me about some of your most interesting or surprising results you’ve had while using Neo4j?


We did an interesting project this last week, which we presented at a hackathon. We do a lot of computations with Apache Spark which you can now run within Cypher.

We created a procedure in which we triggered the Apache Sark process over a Spark cluster. This allows your Neo4j JVM to remain quiet, perform all computation outside of your main database, and automatically write the results back to Neo4j. It is going to be an important aspect of our future work.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com.


Want to learn more about graph databases and Neo4j? Click below to register for our online training class, Introduction to Graph Databases and master the world of graph technology in no time.

Sign Me Up

The post The 5-Minute Interview: Christophe Willemsen, Senior Consultant at GraphAware appeared first on Neo4j Graph Database.

The 5-Minute Interview: Galit Gontar, Software Engineer at Glidewell Laboratories

$
0
0
“We chose a graph database because it’s so much more expressive and a natural representation of the type of data we work with,” says Galit Gontar, Software Engineer at Glidewell Laboratories in Irvine, California.

In fact, Gontar and the Glidewell team have found the graph data model to be useful beyond just one or two niche use cases. The flexibility, she says, allows the Neo4j graph database to be used generically across the company’s entire range of applications.

In this week’s 5-Minute Interview (conducted at GraphConnect San Francisco), we discuss how Neo4j is used in Glidewell’s enterprise and manufacturing processes. Gontar also talks about the other technologies that the Glidewell team uses alongside Neo4j.



Tell us how your team uses Neo4j at Glidewell Laboratories.


Galit Gontar: We are a dental lab that manufactures prosthetic teeth, and we use Neo4j in a number of different ways. Our largest project is implementing a Neo4j-based workflow engine into our manufacturing processes, and then eventually into all of our enterprise processes.

What made Neo4j stand out?


Galit: It’s definitely the best graph database out there right now, and we chose a graph database for this type of implementation because it’s so much more expressive and a natural representation of the type of data we work with. And we really like working with it; it has a number of performance advantages and we really like using Cypher. Neo4j is very easy to use and easy to incorporate into our other systems.

Catch this week’s 5-Minute Interview with Galit Gontar, Software Engineer at Glidewell Laboratories

Can you talk to me about how you use Neo4j with other technologies?


Galit: We use MongoDB for a lot of our catalogs and Elasticsearch for our search systems. Most of our infrastructure is written in C#-based microservices that run in Docker containers.

Anything else you’d like to add or share?


Galit: I think Neo4j works very well as a generic system. I don’t think I’ve seen a lot of other examples of it used very, very generically. Generally speaking, the relationships are expressive – the actual way that an item relates to another item. But we’ve had a lot of success using Neo4j in fundamental, generic applications, and then reusing the same kind of schema throughout our company.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com


Want to learn more about graph databases and Neo4j? Click below to register for our online training class, Introduction to Graph Databases and master the world of graph technology in no time.

Sign Me Up

The post The 5-Minute Interview: Galit Gontar, Software Engineer at Glidewell Laboratories appeared first on Neo4j Graph Database.

Relevant Search Leveraging Knowledge Graphs with Neo4j

$
0
0
“Relevance is the practice of improving search results for users by satisfying their information needs in the context of a particular user experience, while balancing how ranking impacts business’s needs.”[1]

Providing relevant information to the user performing search queries or navigating a site is always a complex task. It requires a huge set of data, a process of progressive improvements, and self-tuning parameters together with infrastructure that can support them.

Such search infrastructure must be introduced seamlessly and smoothly into the existing platform, with access to all relevant data flows to provide always up-to-date data. Moreover, it should allow for easy addition of new data sources, to cater to new requirements, without affecting the entire system or the current relevance.

Information must be stored and managed correctly as well as take into account the relationships between individual items, providing a model and access patterns that can be also processed automatically by artificial minds (machines). These models are generally referred to as Knowledge Graphs. They have become a crucial resource for many tasks in machine learning, data mining and artificial intelligence applications.

Knowledge Graphs: An Introduction


A knowledge graph is a multi-relational graph composed of entities as nodes and relationships as edges with different types that describe facts in the world.

Out of the many features involved in the processing of data sources to create a knowledge graph, Natural Language Processing (NLP) plays an important role. It assists in reading and understanding text to automatically extract “knowledge” from a large number of data sources.[3]

The search goal varies based on the domain in which it is used, so it would differ substantially amongst web search and product catalog navigation in an ecommerce site, scientific literature discovery, and expert search prominent in medicine, law and research. All these domains differ in terms of business goals, the definition of relevance, synonyms, ontologies and so on.

In this blog post, we introduce knowledge graphs as the core data source on top of which a relevant search application has been built. We describe in detail the data model, the feeding and updating processes, and the entire infrastructure applied to a concrete application. We will consider a product catalog for a generic ecommerce site as the use case for this article; however, the concepts and ideas could be applied easily to other scenarios.

The Use Case: Ecommerce


In all ecommerce sites, text search and catalog navigation are not only the entry points for users but they are also the main “salespeople.” Compared with web search engines, this use case has the advantage that the set of “items” to be searched is more controlled and regulated.

However, there are a lot of critical aspects and peculiarities that need to be taken into account while designing the search infrastructure for this specific application:

    • Multiple data sources: Products and related information come from various heterogeneous sources like product providers, information providers, and sellers.
    • Multiple category hierarchies: An ecommerce platform should provide multiple navigation paths to simplify access from several perspectives and shorten the time from desire to purchase. This requires storing and traversing multiple category hierarchies that are subject to change over time based on new business requirements.
    • Marketing strategy: New promotions, offers, and marketing campaigns are created to promote the site or specific products. All of them affect, or should affect, results boosting.
    • User signals and interactions: In order to provide a better and more customized user experience, clicks, purchases, search queries, and other user signals must be captured, processed and used to drive search results.
    • Supplier information: Product suppliers are the most important. They provide information like quantity, availability, delivery options, timing and changes in the product’s details.
    • Business constraints: Ecommerce sites have their own business interests, so they must also return search results that generate profit, clear expiring inventory, and satisfy supplier relationships.
All these requirements and data could affect “relevance” in several ways during search, as well as how the product catalog should be navigated. Keeping these constraints in mind, designing a relevant search infrastructure for ecommerce vendors requires an entire ecosystem of data and related data flows together with platforms to manage them.

Relevant Search


Relevant search revolves around four elements: text, user, context, and business goal:
    • Information extraction and NLP are key to providing search results that mostly satisfy the user’s text query in terms of content.
    • User modelling and recommendation engines allow for customizing results according to user preferences and profiles.
    • Context information like location, time, and so on, further refine results based on the needs of the user while performing the query.
    • Business goals drive the entire implementation, as search exists to contribute to the success and profitability of the organization.
In our previous blog posts on text search and NLP for social media recommendations, we described how to use advanced NLP features and graphs to combine text, user profiles and behaviour to customize the search experience using recommendation engines.

Relevant searches also require context information, previous searches, current business goals, and feedback loops to further customize user experience and increase revenues. These must be stored and processed in ways that can be easily accessed and navigated during searches without affecting the user experience in terms of response time and quality of the search.

Knowledge Graphs: The Model


In order to provide relevant search, the search architecture must be able to handle highly heterogenous data in terms of sources, schema, volume and speed of generation. This data includes aspects such as textual descriptions and product features, marketing campaigns and business goals. Moreover, these have to be accessed as a single data source, so they must be normalized and stored using a unified schema structure that satisfies all the informational and navigational requirements of a relevant search.

The graph data model, considered as both the storage and the query model, provides the right support for all the components of a relevant search. Graphs are the right representational option for the following reasons:

  1. Information Extraction attempts to make the text’s semantic structure explicit by analysing its contents and identifying mentions of semantically defined entities and relationships within the text. These relationships can then be recorded in a database to search for a particular relationship or to infer additional information from the explicitly stated facts.

    Once “basic” data structures like tokens, events, relationships, and references are extracted from the text provided, related information can be extended by introducing new sources of knowledge like ontologies (ConceptNet 5, WordNet, DBpedia, domain specific ontology) or further processed/extended using services like AlchemyAPI.

  2. Recommendation Engines build models for users and items/products based on dynamic (such as user previous sessions) or static (such as description) data, which represent relationships of interests. Hence a graph is a very effective structure to store and query these relationships, even allowing them to be merged with other sources of knowledge such as user and item profiles.

  3. Context information is a multi-dimensional representation of a status or an event. The types and number of dimensions can change greatly and a graph allows for the required high degree of flexibility.

  4. A graph can be used to define a rule engine that could enforce whichever business goal is defined for the search.

Lately, the use of graphs for representing complex knowledge and storing them in a easy-to-query model has become prominent for information management, and the term “knowledge graph” is becoming increasingly popular. Sometimes defined as “encyclopaedias for machines,” knowledge graphs have become a crucial resource for advanced search, machine learning and data mining applications [4]. Nowadays, graph construction is one of the hottest topics in artificial intelligence (AI).

A knowledge graph, from a data model perspective, is a multi-relational graph composed of entities as nodes and relationships as edges with different types. An instance of an edge is a triple (e1, r, e2) which describes the directed relationship r between the two entities e1 and e2.

According to this definition, we designed the following logical schema for this specific use case:

A logical graph database schema


This schema merges multiple knowledge graphs into one big knowledge graph that can be easily navigated. Many of the relationships in the schema above are explicitly loaded using data sources like prices, product descriptions, and some relationships between products (like IS_USEFUL_FOR), while others are inferred automatically by machine learning tools.

This is the logical model which can be extended to a more generic and versatile design with various types of relationships. Consider this sample as a representation of product attributes:

Data model of ecommerce product attributes


The specific relationships that describe a product feature, for instance HAS_SIZE, HAS_COLOUR, are replaced with more general and dynamic schema:

(p:ProductData)-[:HAS_ATTRIBUTE]->(a:Attribute),
(a)-[:HAS_KEY]->(k:Key {ref: “Size”}),
(a)-[:HAS_VALUE]->(v:Value {data: “128 GB”})

Part of the model is built using NLP by processing the information available in product details. In this case, GraphAware NLP framework as described in a previous blog post, is used to extract knowledge from text.

After the first round in which text is processed and organized into tags as described in the schema, information is extended using ConceptNet 5 to add new knowledge like synonyms, specification, generalization, localization, and other interesting relationships. Further processing allows computing of similarities between products, clustering them, and automatically assigning multiple “keywords” to describe the cluster.

The knowledge graph is the heart of the infrastructure not only because it is central to aiding the search but also because it is a living system growing and learning day by day, following the user needs and the evolving business requirements.

Infrastructure: A 10k-Foot View


Considering the relevant search goals described earlier, the model of the knowledge graph designed for the specific use case, and the type and amount of data to be processed together with the related data flows and requirements in terms of textual search capabilities, we defined the following architecture described in high-level terms:

Learn how to use Neo4j knowledge graphs to make more relevant and powerful search engines


The data flow is composed of several data sources like product information sources, product offers, sellers, click streams, feedbacks and more. All of these data items flow from outside to inside the data architecture using Apache Kafka as the queue and streaming platform for building real-time data pipelines. Information goes through a multi-step process where it is transformed before being stored in Neo4j, the main database of the infrastructure.

In order to avoid concurrency issues, only one microservice reads from the queue and writes to Neo4j. The raw sources are then enriched and processed and new relationships between objects are created. In this way, the knowledge graph is created and maintained.

Many machine learning tools and data mining algorithms as well as Natural Language Processing operations are applied to the graph and new relationships are inferred and stored. In order to process this huge amount of data, an Apache Spark cluster is seamlessly integrated into the architecture through the Neo4j-Spark connector.

At this point, data is transformed to several document types and sent to an Elasticsearch cluster where it is stored as documents. In Elasticsearch, these documents are analyzed and indexed for providing text search. The front-end interacts with Neo4j for providing advanced features that require graph queries that cannot be expressed using documents or simple text searches.

We will now describe in more detail the two core elements of the infrastructure.

The Neo4j Roles


Neo4j is the core of the architecture – it is the main database, the “single source of truth” of the product catalog since it stores the entire knowledge graph on which all the searches and navigations are performed. It is a viable tool in a relevant search ecosystem offering not only a suitable model for representing complex data (text, user models, business goals, and context information), but also for providing efficient ways of navigating this data in real time.

Moreover, at an early stage of the “search improvement process,” Neo4j can help relevance engineers to identify salient features describing the content, the user, or the search query. Later on it will help to find a way to instruct the search engine about those features through extraction and enrichment.

Once the data is stored in Neo4j, it goes through a process of enrichment that comprises three main categories: cleansing, existing data augmentation, and data merging.

First, cleansing. It’s usually well worth the time to parse through documents, look for mistakes such as misspellings and document duplications, and correct them. Otherwise, users might not find a document because it contains a misspelling of the query term. Or they may find 20 duplicates of the same document, which would have the effect of pushing other relevant documents off the end of the search results page. Neo4j and the GraphAware NLP framework provide features that find duplicates or synonyms and relate them, as well as search for misspellings of words.

Second, the existing data is post-processed to augment the features already there. For instance, machine learning techniques can be used to classify or cluster documents. The possibilities are endless. After this new metadata is attached to the documents, it can serve as a valuable feature for users to search through.

Finally, new information is merged into the documents from external sources. In our ecommerce use case, the products being sold often come from external vendors. Product data provided by the vendors can be sparse: for instance, missing important fields such as the product title. In this case, additional information can be appended to documents. The existing product codes can be used to look up product titles, or missing descriptions can be written in by hand. The goal is to provide users with every possible opportunity to find the document they’re looking for, and that means more and richer search features.

Elasticsearch is for SEARCH


It is worth noting here that Elasticsearch is not a database. It is built for providing text searches at super-high speed over a cluster of nodes. Also, it can’t provide full ACID support.

“Lucene, which Elasticsearch is built on, has a notion of transactions. Elasticsearch, on the other hand, does not have transactions in the typical sense. There is no way to rollback a submitted document, and you cannot submit a group of documents and have either all or none of them indexed. What it does have, however, is a write-ahead-log to ensure the durability of operations without having to do an expensive Lucene-commit. You can also specify the consistency level of index-operations, in terms of how many replicas must acknowledge the operation before returning. […] Visibility of changes is controlled when an index is refreshed, which by default is once per second, and happens on a shard-by-shard-basis.” [Source: Elasticsearch as a NoSQL Database]
Nonetheless, in our opinion, it is used as the main data store too often. This approach could lead to a lot of issues in terms of data consistency and ease of managing the data stored in the documents.

Elasticsearch can provide an efficient interface for accessing catalog information, providing not only advanced text search capability but also any sort of aggregation and faceting. Faceting is the capability of grouping search results for providing predefined filters and allowing users to simply refine the search. This is an example from Amazon:

Example of Amazon faceting ecommerce search results


It can also be useful for providing analytics as well as other capabilities like collapsing results. The latter, added in the latest version (5.3.x at the time of writing), allows you to group results and avoid having to show the same results when they only contain minor differences.

Those among many others are the reasons why in the proposed architecture Elasticsearch is used as the front-end search interface to provide product details as well as autocomplete and suggestion functionalities.

Conclusion


This post continues our series on advanced applications and features built on top of a graph database using Neo4j. The knowledge graph plays a fundamental role since it gathers and represents all the data needed in an organic and homogeneous form and allows access, navigation and extensibility in an easy and performant way.

It is the core of a complex data flow with multiple sources and feeds the Elasticsearch front-end. In this blog post, our presentation of a complete end-to-end architecture illustrates an advanced search infrastructure that powers highly relevant searches.

If you believe GraphAware NLP framework or our expertise with knowledge graphs would be useful for your project or organisation, please drop an email to nlp@graphaware.com specifying “Knowledge Graph” in the subject and one of our GraphAware team members will get in touch.

GraphAware is a Gold sponsor of GraphConnect Europe. Use discount code GRAPHAWARE30 to get 30% off your tickets and trainings.


Join us at the Europe’s premier graph technology event: Get your ticket to GraphConnect Europe and we’ll see you on 11th May 2017 at the QEII Centre in downtown London!

Get My Ticket



References


[1]
D. Turnbull, J. Berryman – Relevant Search, Manning
[2]
A. L. Farris, G. S. Ingersoll, and T. S. Morton – Taming Text, Manning
[3]
Google Knowledge Graph – https://www.google.com/intl/es419/insidesearch/features/search/knowledge.html
[4]
L. Del Corro – Knowledge graphs: Encyclopaedias for machines – https://www.ambiverse.com/knowledge-graphs-encyclopaedias-for-machines/
[5]
E. Gabrilovich, N. Usunier – Constructing and Mining Web-Scale Knowledge Graphs, SIGIR ’16 Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

The post Relevant Search Leveraging Knowledge Graphs with Neo4j appeared first on Neo4j Graph Database.

This Week in Neo4j – 20 May 2017

$
0
0

Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days.


This week’s featured community member is Ben Nussbaum, CTO of Neo4j Solution Partner AtomRain.

Ben Nussbaum – This week’s featured community member

Ben has been an active member of the Neo4j community for the last five years, while building out the GraphGrid data platform, which provides hosted Neo4j, graph algorithms, and advanced analytics.

Ben appeared twice on Rik van Bruggen‘s Graphistania podcast in the first half of 2016 and frequently runs Neo4j training sessions in Los Angeles.

The Neo4j GraphQL Community Graph Hackathon


My colleague Will Lyon announced the start of the Neo4j GraphQL Community Graph hackathon which will finish on Monday after the GraphQL-Europe conference in Berlin.



If you’d like to take part, Sashko Stubailo has created the community graph starter kit which provides the skeleton of an application that uses the Apollo GraphQL client to connect to and query the GraphQL community graph.

The Salesforce graph, Software analytics, Automated menu planning


State of graph databases survey


IBM released the results of their State of graph databases survey where people were asked why they’re using graph databases and what they’re planning to use them for in future.



Democratising data at Airbnb


Following on from their talk at GraphConnect Europe last week, Chris Williams and John Bodley explain how Airbnb have developed Dataportal, a novel data resource search and discovery tool to make sense of their internal data.

Dataportal combines Flask, ElasticSearch, and Neo4j to help employees discover and then search data that would usually only be available in team specific silos.

ComputerWorld UK also have a detailed write up of the talk.

Neo4j available on AWS & Azure Marketplace


As of this week Neo4j is now available in the Azure Marketplace as well as the AWS Marketplace.

Neo4j is now available in the Azure Marketplace

If you use either of those cloud providers be sure to give it a try and let us know how you get on.

From The Knowledge Base


This week from the Neo4j Knowledge Base we have an article showing how to compare two graphs for equality using Cypher and APOC’s md5 function.

On the Podcast: Darko Križić


On the Graphistania podcast this week we have an interview with Darko Križić, the CTO of Neo4j partner PRODYNA.

Darko has been working on Neo4j projects for the last couple of years and chats with Rik about how they got into graph databases at PRODYNA, why graphs work well for modeling complex domain models, and the Cypher query language.

Tweet of the Week


My favourite tweet this week was by Tim Williams:

That’s all for this week. Have a great weekend!

Cheers, Mark

The post This Week in Neo4j – 20 May 2017 appeared first on Neo4j Graph Database.

The Top 10 Reasons You Should Attend GraphConnect New York

$
0
0
This autumn, GraphConnect will again be in the town that never sleeps – New York City! Whether you’re new to the world of graph database or you’ve been a part of the movement for some time, there’s something new for everybody at this year’s event.

Here are just a few of the best reasons why you should attend GraphConnect NYC on October 23 and 24:

10. The Location


Learn the top 10 reasons to attend the GraphConnect New York conference in 2017This year, GraphConnect New York takes place in a variety of locations:
    • Training and Workshops will take place at the Convene Center, located right off of Times Square, which offers panoramic views of the neon lights, Central Park and the Hudson River.
    • GraphConnect sessions will be hosted at Pier 36, a state-of-the-art events venue with views of downtown Manhattan, the Manhattan and Brooklyn Bridges and the Statue of Liberty.
    • The GraphHack hackathon will be held at eBay’s NYC offices, a brand new space located in the heart of Chelsea within walking distance of a variety of incredible restaurants.
Come early or stay late to enjoy all the amazing activities NYC has to offer!

9. The disConnect Party


The disConnect party at GraphConnectOn Tuesday, you’ll spend the day watching both CEO Emil Eifrem and Chief Scientist Jim Webber talk about the future of Neo4j and graphs, attending a dozen sessions and lightning talks, getting help with your data modeling in the GraphClinic, and perhaps doing some Graph Karaoke. Afterwards, you’ll probably be ready to disconnect your brain, and start building relationships with your fellow engineers and business execs.

The post-conference disConnect party lets you mingle with your new connections over a few drinks — all while making plans on how you’ll change the world with the power of graph database technology.

Plus, there will be free drinks. Need I say more?

8. The 5 Customized Learning Tracks


Learning tracks at GraphConnect New YorkWith so many amazing submissions for this year’s Call for Papers, we had a really difficult time choosing the sessions. For you, that means amazing, high-quality presentations all around!

The Case Studies track covers how graph technology have helped businesses from a variety of industries achieve success, including an exciting use case from eBay, which uses Neo4j for artificial intelligence (AI) technology.

We also have not one, but two (!) How To tracks that teach developers, architects and DBAs how to use Neo4j for knowledge graphs, NLP, infosec, DevOps, machine learning, manufacturing, fraud detection, and more. Customers will also talk about moving applications from legacy databases to graphs.

The Neo4j Deep Dive track will teach how to use the latest features and products announced at GraphConnect, the advances in openCypher, and even how to use ElasticSearch with Neo4j.

The last track? Seventeen (17!!!) Lightning Talks, each being just 15 minutes in length. These sessions will cover topics ranging from reducing infectious diseases to analyzing Salesforce data — and even using your voice to power knowledge graphs.

7. DevZone


The Developer Zone at GraphConnect New YorkNew this year is our DevZone, a place for developers to chat with the speakers, lounge on couches, play Graph Karaoke, experience virtual reality, get Neo4j Certified and grab a snack.

Stop by, connect with other developers, learn something new, and have fun! (oh, and read #4!)

6. The GraphHack


The GraphHack hackathon at GraphConnect New YorkThe first day of the conference is jam-packed with training classes and workshops, and is brought to a close by the GraphHack – a graph database hackathon organized around the theme of “Graphs for Good.”

Developers will divide into teams to see who can build the best app for Neo4j that benefits society, with a chance to win some great prizes. Don’t miss it!

5. The Amazing Speaker Lineup


The speaker lineup at GraphConnect New YorkGraphConnect always attracts the best speakers in the graph technology ecosystem and this year is no exception. This year’s highlights include:
    • Ashley Sun, LendingClub
    • Kenny Bastani, Pivotal
    • Julien Pierre, Microsoft
    • Andy Robbins, Bloodhound
    • Michael Zelenetz, NewYork-Presbyterian Hospital
    • Ajinkya Kale and Anuj Vatsa, eBay, Inc
    • Mark Hashimoto, Hai Thai, Jessica Lowing and Mark Ture, Comcast
Of course, there are still tons more amazing speakers to fill the day, all of which can be found in the Agenda section on GraphConnect.com.

4. In-Person Access to Neo4j Engineering


Neo4j engineers at GraphConnect New YorkGet the full graph database experience and meet the makers of the world’s #1 platform for connected data!

At GraphConnect New York, you’ll get to rub elbows with the very engineers who built Neo4j. That means networking, idea-swapping and technical questions. And what other database technology can offer such personal access?

Whether you have a quick question or you’re tackling an advanced graph challenge, plenty of Neo4j experts will also be in the DevZone GraphClinic to help get your graph database feeling well soon.

3. Expanded Options for Neo4j Training


Neo4j training at GraphConnect New YorkThe first day of GraphConnect – October 23rd – is devoted to instructor-led classroom training. In addition to the standard 8-hour training classes like Neo4j Fundamentals and Graph Modeling, we’ve added a great set of 4-hour workshops!

The new workshops will still be hands-on, but will be faster-paced and include sessions on building apps with Neo4j, using Spring Cloud and Spring Boot for microservices development, data science, GraphQL and more! Expect to fill your brain — some of our top engineers, trainers and partners will be leading these sessions.

2. The Big Announcements


Neo4j announcements at GraphConnect New YorkIn 2016, GraphConnect San Francisco saw the release of Neo4j 3.1, which introduced Causal Clustering for massively scalable graph data.

And the release of Neo4j 3.2 at this year’s GraphConnect Europe added multi-data center and cloud zone support. Plus – for the fourth release in a row – write performance increased significantly with the most recent boosts making Neo4j 3.2 360% faster than 2015’s version 2.3!

While we can’t say what exactly might be announced on the keynote stage this year, but rest assured that it’ll be big!

1. The Relationships


Relationships and networking at GraphConnect New YorkGraph technology is powerful because it leverages data relationships. GraphConnect is powerful because it builds your person-to-person relationships.

After all, trainings, presentations and announcements can all be experienced elsewhere or even remotely, but in-person networking and relationship building with fellow graphistas can’t happen anywhere else but GraphConnect.

You’ll find it difficult to be an orphan node at GraphConnect New York. That’s because everyone in attendance already understands the inherent value of relationships over individual data points.

Do you really need any other reasons to attend GraphConnect this year? I didn’t think so. We’ll see you in NYC!


What are you waiting for?
Click below to register for GraphConnect New York on October 23-24, 2017 at Pier 36 in New York City – and connect with leading graph experts from around the globe.


Get My Ticket

The post The Top 10 Reasons You Should Attend GraphConnect New York appeared first on Neo4j Graph Database.

GraphConnect New York Agenda: Everything You Need to Know [2017 Edition]

$
0
0
You already have all the reasons you should attend GraphConnect New York, and now you know who the must-see presenters will be, but now it’s time for the best reason to attend: The bottom-to-top agenda of all the keynotes, presentations and lightning talks you can attend at the United States graph technology event of the year.

See the full agenda right here on GraphConnect.com or check out the highlights below:

Keynotes & Big Announcements


Starting off the day will be Neo4j’s own CEO, Emil Eifrem, who will be taking us through the world of graphs past, present and – most importantly – future. As always, Emil reserves the biggest company and product announcements of the year for his GraphConnect keynote, so you won’t want to miss this big kickoff to an action-packed day.

Of this year’s announcement, Emil said, “The stuff we’re announcing now is absolutely mind-blowing. It’s 10 times more exciting to me than anything we’ve announced before — and I can’t wait to be there and talk about it.”



Closing us out for the day will be Jim Webber, Chief Scientist at Neo4j. His closing keynote address will dive into more technical and scientific facets of graph technology, including what’s currently powering the Neo4j product – and what’s in the pipeline for making Neo4j even more powerful. If you’re curious about the future of the graph sector, this will be a talk you don’t want to miss!

Five Main Learning Tracks


GraphConnect audience member taking notes at a presentation Fika coffee break at GraphConnect

Throughout the Day: GraphClinics and Fikas


All day, whenever you’re between presentations or lightning talks, the GraphClinics will be open for free consulting and troubleshooting of your Neo4j deployment. The GraphClinics are staffed by Neo4j engineers and consultants, so you’ll be receiving absolutely top-notch advice and tips – all free of charge.

And don’t forget that during the day’s many fikas, you’ll have an opportunity to rub elbows with Neo4j executives and the entire Neo4j engineering team. Nothing beats a face-to-face meeting over coffee! (except perhaps tea…)

Wrap Up the Day with the disConnect Party


After a full day of graph talks, you’ll probably be ready to disconnect your brain and start building relationships with your fellow engineers and business execs.

The post-conference disConnect party lets you mingle with your new connections over a few drinks — all while making plans on how you’ll change the world with the power of graph database technology.

With such a jam-packed agenda and star-studded speaker lineup, you simply can’t afford to miss GraphConnect New York. Click below to get your ticket, and we’ll see you on October 24th!

Register for GraphConnect

The post GraphConnect New York Agenda: Everything You Need to Know [2017 Edition] appeared first on Neo4j Graph Database.


The ROI on Connected Data: The Overlooked Value of Context for Business Insights [+ Airbnb Case Study]

$
0
0
Your data is inherently valuable, but until you connect it, that value is largely hidden.

Those data relationships give your applications an integrated view that powers real-time, higher-order insights traditional technology cannot deliver.

Learn why you need data context for business insights in this series on the ROI of connected data


In this series, we’ll examine how investments in connected data return dividends for your bottom line – and beyond. Last week, we explored how increasing data’s connectedness increases its business value.

This week, we’ll take a closer look at how connected data gives you contextual insights for essential business use cases.

Connected Data Offers Business Context


The biggest benefit of connected data is the ability to provide an integrated view of the data to your analytic and operational applications, thereby gaining and growing intelligence downstream.

The connections can be made available to applications or business users to make operational decisions. You also obtain context that allows you to more deeply or better refine the pieces of information you’re collecting or the recommendations you’re producing.

Marketing may determine the best time to send an email to customers who previously purchased winter coats and dynamically display photos in their preferred colors. The more understanding you have of the relationships between data, the better and more refined your system is downstream.

Business Use Cases of Connected Data


Connected data applies to a variety of contexts.

In addition to refining the output of your recommendation engines, you can better understand the flow of money to detect fraud and money laundering (see below), and assess the risk of a network outage across computer networks.


A connected dataset for a fraud detection use case


Connected data also helps you see when and how relationships change over time. For example, you can determine when a customer moves and change the applicable data (such as mailing address) so that customer data doesn’t become obsolete.

Connected data is most powerful when it provides operational, real-time insights and not just after-the-fact analytics. Real-time insights allow business users and applications to make business decisions and act in real time. Thus, recommendation engines leverage data from the current user session – and from historical data – to deliver highly relevant suggestions.

IT organizations proactively mitigate network issues that would otherwise cause an outage using a connected-data view, and anti-fraud teams put an end to potentially malicious activity before it results in a substantial loss.

Case Study: Airbnb


With over 3500 employees located across 20 global offices, Airbnb is growing exponentially.

As a result of employee growth, they have experienced an explosion in both the volume and variety of internal data resources, such as tables, dashboards, reports, superset charts, Tableau workbooks, knowledge posts and more.

As Airbnb grows, so do the problems around the volume, complexity and obscurity of data. Information and people become siloed, which creates inefficiencies around navigating personalized tribal knowledge instead of clear and easy access to relevant data.

In order for this ocean of data resources to be of any use at all, the Airbnb team would need to help employees navigate the varying quality, complexity, relevance and trustworthiness of the data. In fact, lack of trust in data was a constant occurrence because employees were afraid of accidentally using outdated or incorrect information. Rather, employees would create their own additional data resources, further adding to the problem of myopic, isolated datasets.

To address these challenges, the Airbnb team created the Dataportal, a self-service system providing transparency to their complex and often-obscure data landscape. This search-and-discovery tool democratizes data and empowers Airbnb employees to easily find or discover data and feel confident about its trustworthiness and relevance.

When creating the Dataportal, the Airbnb team realized their ecosystem was best represented as a graph of connected data. Nodes were the various data resources: tables, dashboards, reports, users, teams, etc. Relationships were the already-present connections in how people used the data: consumption, production, association, team affinity, etc.

Using a graph data model, the relationships became just as pertinent as the nodes. Knowing who produced or consumed a data resource can be just as valuable as the resource itself. Connected data thus provides the necessary linkages between silos of data components and provides an understanding of the overall data landscape.

Given their connected data model, it was both logical and performant to use a graph database to store the data. Using Apache Hive as their master data store, Airbnb exports the data using Python to create a weighted PageRank of the graph data before pushing it into Neo4j where it’s synced with Elasticsearch for simple search and data discovery.

Conclusion


As you can see, once you surface the connections in your data, the use cases are endless.

The insights that these connections enable allow your organization to remain nimble in a changing business world and overcome the challenges of digital transformation. In the end, having a connected-data view of your enterprise is a future-proof solution to unknown future business requirements.

Next week, we’ll explore how to harness connected data using graph database technology in conjunction with your existing data platforms and analytics tools.


Get more value from the connections in your data:
Click below to get your copy of The Return on Connected Data and learn how to create a sustainable competitive advantage with graph technology.


Read the White Paper


Catch up with the rest of the ROI on connected data blog series:

Democratizing Data Discovery at Airbnb

$
0
0
Learn how Airbnb democratized their data discovery with a graph database.
Editor’s Note: This presentation was given by John Bodley and Chris Williams at GraphConnect Europe in May 2017.

Presentation Summary


Airbnb, the online marketplace and hospitality service for people to lease or rent short-term lodging, generates many data points, which leads to logjams when users attempt to find the right data. Challenges managing all the data points have led the data team to search for solutions to “democratize the data,” helping employees with data exploration and discovery.

To address this challenge, Airbnb has developed the Dataportal, an internal data tool that helps with data discovery and decision-making and that runs on Neo4j. It’s designed to capture the company’s collective tribal knowledge.

As data accumulates, so do the challenges around the volume and complexity of the data. One example of where this data accumulates is in Airbnb’s Hive data warehouse. Airbnb has more than 200,000 tables in Hive spread across multiple clusters.

Each day the data starts off in Hive. Airbnb’s data engineers use Airflow to push it to Python. The data is eventually pushed to Neo4j by the Neo4j driver. The graph database is live, and every day they push updates from Hive into the graph database.

Why did Airbnb choose Neo4j? There are multiple reasons. Neo4j captures the relevancy of relationships between people and data resources, helping guide people to the data they need and want. On a technical level, it integrates well with Python and Elasticsearch.

Airbnb’s Dataportal UI is designed to help users, the ultimate holders of tribal knowledge, find the resources they need quickly.

Full Presentation: Democratizing Data at Airbnb


What we will be talking about today is how Airbnb uses Neo4j’s graph database to manage the many data points that accumulate in our Hive data warehouse.



What Is the Dataportal?


John Bodley: Airbnb is an online marketplace that connects people to unique travel experiences. We both work in an internal data tools team where our job is to help ensure that Airbnb makes data-informed business decisions.

The Dataportal is an internal data tool that we’re developing to help with data discovery and decision-making at Airbnb. We are going to describe how we modelled and engineered this solution, centered around Neo4j.

Addressing the Problem of Tribal Knowledge


The problem that the Dataportal project attempts to address is the proliferation of tribal knowledge. Relying on tribal knowledge often stifles productivity. As Airbnb grows, so do the challenges around the volume, the complexity and the obscurity of data. In a large and complex organization with a sea of data resources, users often struggle to find the right data.

We run an employee survey and consistently score really poorly on the question, “The information I need to do my job is easy to find.”

Data is often siloed, inaccessible and lacks context. I’m a recovering data scientist who wants to democratize data and provide context wherever possible.

Taming the Firehose of Hive


We have over 200,000 tables in our Hive data warehouse. It is spread across multiple clusters. When I joined Airbnb last year, it wasn’t evident how you could find the right table. We built a prototype, leveraging previous insights, giving users the ability to search for metadata. We quickly realized that we were somewhat myopic in our thinking and decided to include resources beyond just data tables.

Data Resources Beyond the Data Warehouse


We have over 10,000 Superset charts and dashboards. Superset is an open source, data analytics platform. We have in excess of 6,000 experiments in metrics. We have over 6,000 Tableau workbooks and charts, and over 1,500 knowledge posts, from Knowledge Repo, our open source, code knowledge-sharing platform that data scientists use to share their results, as well as a litany of other data types.

But most importantly, there’s over 3,500 employees at Airbnb. I can’t stress enough how valuable people are as a data resource. Surfacing who may be the point of contact for a resource is just as pertinent as the resource itself. To further complicate matters, we’re dispersed geographically, with over 20 offices worldwide.

The mandate of the Dataportal is quite simply to democratize data and to empower Airbnb employees to be data informed by aiding with data exploration, discovery and trust.

At a very high level, we want everyone to be able to search for data. The question is, how to frame our data in a meaningful way for searching. We have to be cognizant of ranking relevance as well. It should be fairly evident what we actually feed into our search indices, which is all these data resources and their associated metatypes.

The Relevancy of Relationships: Bringing People and Data Together


Thinking about our data in this way, we were missing something extremely important: relationships.

Our ecosystem is a graph, the data resources are nodes and the connectivity is all relationships. The relationships provide the necessary linkages between our siloed data components and the ability to understand the entire data ecosystem, all the way from logging to consumption.

Relationships are extremely pertinent for us. Knowing who created or consumed a resource (as shown below) is just as valuable as the resource itself. Where should we gather information from a plethora of disjointed tools? It would be really great if we could provide additional context.

Check out this graphic of how Airbnb defines their relevancy of data relationships with their employees.

Let’s walk through a high-level example, shown below. Using event logs, we discover a user consumes a Tableau chart, which lacks context. Piecing things together, we discover that the chart is from a Tableau workbook. The directionless edge is somewhat ambiguous, but we prefer the many-to-one direction from both a flow and a relevancy perspective. Digging a little further, both these resources were created by another user. Now we find an indirect relationship between these users.

We then discover that the workbook was derived from some aggregated table that wasn’t in Hive, thus exposing the underlying data to the user. Then we pass out the Hive order logs and determine that this table is actually derived from another table, which provides us with the underlying data. And finally, both these tables are associated with the same Hive schema, which may provide additional context with regards to the nature of the data.

How Airbnb's Dataportal graph search platform first took shape.

We leverage all these data sources, and we build a graph comprising of the nodes and relationships, and this resides in Hive. We pull from a number of different sources. Actually, Hive is our persistent data store, where the table schema mimics Neo4j. We have a notion of labels, and properties, and maybe an ID.

We pull from over six databases that come through scrapes that land in Hive. We create a number of APIs, be that Google, Slack and also some logging frameworks. That all goes into an Airflow Directed Acrylic Graph (DAG). (Airflow is an open source workflow tool that was also developed at Airbnb.) And then this workflow is run every day, and the graph is left to soak to prevent what we call “graph flickering.”

See the data resources Airbnb leverages to build a graph in Hive.

Dealing with “Graph Flickering”


Let me explain what I mean by graph flickering. Our graph is somewhat time-agnostic. It represents the most recent snapshot of the ecosystem. The issue is certain types of relationships are sporadic in nature, and that’s causing the graph to flicker. We resolve this by introducing the notion of relational state.

We have two sorts of relationships: persistent and transient.

Persistent relationships (see below) represent a snapshot in time of the system; they are the result of a DB scrape. In this example, the creator relationship will persist forever.

Check out how persistent relationships represent a snapshot in time.

Transient relationships, on the other hand, represent events that are somewhat sporadic in nature. In this example, the consumed relationship would only exist on certain days, which would cause the graph to flicker.

To solve this, we simply expand the time period from one to a trailing 28-day window, which acts as a smoothing function. This ensures the graph doesn’t flicker, but also enables us to capture only recent, and thus relevant, consumption information into our graph.

See how transient relationships are sporadic in nature.

How Airbnb Uses Neo4j with Python and Elasticsearch


Let’s touch upon how our data ends up in Neo4j and downstream resources.

Shown below is a very simplified view of our data path which, in itself, is a graph. Given that relationships have parity with nodes, it’s pertinent that we also discuss the conduit that connects these systems.

Every day, the data starts off in Hive. We use AirFlow to push it to Python. In Python, the graph is represented in NetworkX as an object and from this, we compute a weighted page rank on the graph and that helps improve search ranking. The data is then pushed to Neo4j by the Neo4j driver.

We have to be cognizant of how we do a merge here. The graph database is live, and every day we push updates from Hive into the graph database. That’s a merge, and it is something we have to be quite cautious of.

From here, the flow forks into two directions. The nodes get pushed into Elasticsearch via a GraphAware plugin, which is based on transaction hooks. From there, Elasticsearch will serve as our search engine. Finally, we use Flask as a lightweight Python web app, which is used with other data tools. Results from Elasticsearch queries are fetched by the web server.

Additionally, results from Neo4j queries pertaining to connectivity are fetched by the web server via Neo4j, using that same driver.

Learn how Airbnb democratized their data discovery with a graph database.

Why did we choose Neo4j as our graph database?

There are four main reasons. First, our data represents a graph, so it felt logical to use a graph database to store the data. Second, it’s nimble. We wanted a really fast, performant system. Third, it’s popular; it’s the world’s number one graph database. The community edition is free, which is really super helpful for exploring and prototyping. And finally, it integrates well with Python and Elasticsearch, existing technologies we wanted to leverage.

Learny why Airbnb choose Neo4j's graph database.

There’s a lovely symbiotic relationship between Elasticsearch and Neo4j, courtesy of some GraphAware plugins. The Neo4j plugin, which asynchronously replicates data from Neo4j to Elasticsearch. That means we actually don’t need to actively manage our Elasticsearch cluster. All our data persists. We use Neo4j as the source of truth.

The second plugin actually lives in Elasticsearch and allows Elasticsearch to consult with the Neo4j database during a search. And this allows us to enrich search rankings by leveraging the graph topology. For example, we could sort by recently created, which is a property on the relationship, or most consumed, where we have to explore topology of the graph.

This is how we represent our data model. We defined a node label hierarchy as follows.

Check out Airbnb's node label hierarchy.

This hierarchy enables us to organize data in both Neo4J and Hive. The top-level :Entity label really represents some base abstract node type, which I’ll explain later.

Let’s walk through a few examples here. Our schema was created in such a way that the nodes are globally unique in our database, by combining the set of labels and the locally scoped ID property.

First, we have a user who’s keyed by their LDAP username, then a table that’s keyed by the table name and finally a Tableau chart that’s keyed by the corresponding DB instance inside the Tableau database.

User name nodes examples from Airbnb.

The graph cores are heavily leveraged in the user interface (UI), and they need to be incredibly fast. We can efficiently match queries by defining per label indices on the ID property and we leverage them for fast access. Here, we’re just explicitly forcing the use of the index because we’re using multiple labels.

Match queries using multiple labels.

Ideally, we’d love to have a more abstract representation of the graph, moving from local to global uniqueness. To achieve that, we leverage another GraphAware plugin, UUID. This plugin assigns a global UUID on newly created entities that cannot be mutated in any way. This gives us global uniqueness. We can talk about entities in the graph by using just this one unique UUID property in addition to the entity label.

This helps us use PrimeBase queries, which leads to faster query and execution times. This is especially relevant when we do bulk loads. Every day we do a bulk load of data and we need that to be really performant.

Here’s this same sort of example as before. Now we’ve simplified this, so we can just purely match any entity using this UUID property, and it’s global.

See match queries.

We have a RESTful API. In the first example, you can match a node based on its labels and IDs. And this is useful if you have like a slug type of URL. The second one, you can match a node based purely on the UUID. The third one is how we’d get a created relationship, based on leveraging these two UUIDs. The front-end uses these APIs, as covered in the next section.

Check out match node labels and IDs.

Designing the Front-end of the Dataportal


Chris Williams: I’m going to describe how we enable Airbnb employees to harness the power of our data resource graph through the web application.

The backends of data tools are often so complex that the design of the front-end is an afterthought. This should never be the case, and in fact, the complexity and data density of these tools makes intentional design even more critical.

One of our project goals is to help build trust in data. As users encounter painful or buggy interactions, these can chip away at their trust in your tool. On the other hand, a delightful data product can build trust and confidence. Therefore, with the Dataportal, we decided to embrace a product mindset from the start and ensure a thoughtful user interface and experience.

As a first step, we interviewed users across the company to assess needs and pain points around data resources and tribal knowledge. From these interviews, three overall user personas emerged. I want to point out that they span data literacy levels and many different use cases.

The first of these personas is Daphne Data. She is a technical data power user, the epitome of a tribal knowledge holder. She’s in the trenches tracing data lineage, but she also spends a lot of time explaining and pointing others to these resources.

Second, we have Manager Mel. Perhaps she’s less data literate, but she still needs to keep tabs on her team’s resources, share them with others, and stay up to date with other teams that she interacts with. Finally, we have Nathan New. He may be new to Airbnb, working with a new team, or new to data. In any case, he has no clue what’s going on and quickly needs to get ramped up.

Airbnb's Dataportal user personalities.

With these personas in mind, we built up the front end of the Dataportal to support data exploration, discovery and trust through a variety of product features. At a high level, these broadly include search, more in-depth resource detail and metadata exploration, and user-centric, team-centric and company-centric data.

We do not really allow free-form exploration of our graph as the Neo4j UI does. The Dataportal offers a highly curated view of the graph, which attempts to provide utility while maintaining guardrails, where necessary, for less data-literate employees.

Designing the Dataportal for exploration, discovery and trust.

The Dataportal is primarily a data resource search engine. Clearly, it has to have killer search functionality. We tried to embrace a clean and minimalistic design. This aesthetic allows us to maintain clarity despite all the data content, which adds a lot of complexity on its own.

We also tried to make the app feel really fast and snappy. Slow interactions generally disincentivize exploration.

At the top of the screen (see below) are search filters that are somewhat analogous to Google. Rather than images, news and videos, we have things like data resources, charts, groups, teams and people.

Data discovery and contextual search.

The search cards have a hierarchy of information. The overall goal is to help provide a lot of context to allow users to quickly gauge the relevancy of results. We have things like the name, the type. We highlight search terms, the owner of the resource, when it was last updated, the number of views, and so on. And we also try to show the top consumers of any given result set. This is just another way to surface relationships and provide a lot more context.

Continuing with this flow, from a search result, users typically want to explore a resource in greater detail. For this, we have content pages. Here is an example of a Hive table content page.

A hive table content page.

At the top of the page, we have a description linked to the external resource and social features, such as favoriting and pinning, so users can pin a resource to their team page. Below that, we have metadata about the data resource, including who created it, when it was last updated, who consumes it, and so on.

The relationships between nodes provide context. This context isn’t available in any of our other siloed data tools. It’s something that makes the Dataportal unique, tying the entire ecosystem together.

Another way to surface graph relationships is through related content, so we show direct connections to this resource. For a data table, this could be something like the charts or dashboards that directly pull from the data table.

We also have a lot of links to promote exploration. You can see who created this resource and find out what other resources that they work on.

The screen below highlights some of the features we built out specifically for exploring data tables. You can explore column details and value distributions for any table. Additionally, tracing data lineage is important, so we allow users to explore both the parent tables and the child tables of any given table.

We’re also really excited about being able to enrich and edit metadata on the fly, we add table descriptions and column contents. And these are pushed directly to our Hive metastore.

Hive metastore for Airbnb's Dataportal.

The screen below highlights our Knowledge Repo, which is where data scientists can share analyses. You have dashboards and visualizations. We are typically iframing these data tools. That generates a log, which then our graph picks up, and it will trickle back into our graph, affect PageRank, and affect the number of views.

Airbnb Knowledge Repo analyses.

Helping Users, the Ultimate Holders of Tribal Knowledge


Users are the ultimate holders of tribal knowledge, so we created a dedicated user page, shown below, to reflects that.

On the left is basic contact information. On the right are resources that the user accesses frequently, created, or favorited in groups to which they belong. To help build trust in data, we wanted to be transparent about data. You can look at any resources a person views, including what your manager views and so on.

Along the lines of data transparency, we also made a conscious choice to keep former employees in the graph.

If we take George, the handsome intern that all the ladies talk about, he created a lot of data resources and he favorited things. If I wanted to find a cool dashboard that he made last summer, that I forgot the name of, this can be really relevant.

An example of data transparency with former employees tribal knowledge.

Another source of tribal knowledge is found within an organization’s teams. Teams have tables they query regularly, dashboards they look at and go-to metric definitions. We found that team members spend a lot of time telling people about the same resources, and they wanted a way to quickly point people to these items.

For that, we created group pages. The group overview below shows who’s in a particular team.

Group pages of tribal knowledge in Airbnb's Dataportal.

To enable curating content, we decided to borrow some ideas from Pinterest, so you can pin any content to a page. If a team doesn’t have any content that’s been curated, there’s a Popular tab. Rather than displaying an empty page, we can leverage our graph to inspect what resources the people on a given team use on a regular basis and provide context that way.

We leverage thumbnails for maximum context. We gathered about 15,000 thumbnails from Tableau or Knowledge Repo in our Superset internal data tool. They’re a combination of APIs and headless browser screenshots.

The screen below highlights the pinning and editing flows. On the left, similar to Pinterest, you can pin an item to a team page. On the right, you can customize and rearrange the resources on the team page.

Team page pinning of editing flows.

Finally, we have company metric data.

We found that people on a team typically keep a tight pulse on relevant information for their team. A lot of times, as the company grows larger, they’ll feel more and more disconnected from company-level, high-level metrics. For that, we created a high-level Airbnb dashboard where they can explore up-to-date company-level data.

Airbnb dashboard company level data.

Front-End Technology Stack


Our front-end technology stack is similar to what many teams use at Airbnb.

We leverage modern JavaScript, ES6. We use node package manager (NPM) to manage package dependencies and build the application. We use an open source package called React from Facebook for generating the Document Object Model (DOM) and the UI. We use Redux, which is an application state tool. We use a cool open source package from Khan Academy called Aphrodite, which essentially allows you to write Cascading Style Sheets (CSS) in JavaScript. We use ESLint to enforce JavaScript’s style guide, which is also open source from Airbnb, and Enzyme, Mocha and Chai for testing.

Airbnb's Dataportal technology stack.

Challenges in Building the Dataportal


We faced a number of challenges in building the Dataportal.

It is an umbrella data tool that brings together all of our siloed data tools and generates a picture of the overall ecosystem. The problem with this is that any umbrella data tool is vulnerable to changes in the upstream dependencies. This can include things on the backend like schema changes, which could break our graph generation, or URL changes, which would break the front-end.

Additionally, data-dense design, creating a UI that’s simple and still functional for people across a large number of data literacy levels, is challenging. To complicate this, most internal design patterns aren’t built for data-rich applications. We had to do a lot of improvising and creation of our own components.

We have a non-trivial Git-like merging of the graph that happens when we scrape everything from Hive and then push that to production in Neo4j.

The data ecosystem is quite complex, and for less data literate people, this can be confusing. We’ve used the idea of proxy nodes, in some cases, to abstract some of those complexities. For example, we have lots of data tables, which are often replicated across different clusters. Non-technical users could be confused by this, so we actually accurately model it on the backend, and then expose a simplified proxy node on the front end.

Airbnb's Dataportal challenges.

Future Directions for Airbnb and the Graph Database


We’re considering a number of future directions.

The first is a network analysis that finds obsolete nodes. In our case, this could be things like data tables that haven’t been queried for a long time and are costing us thousands of dollars each month. It could also be critical paths between resources.

One idea that we’re exploring is a more active curation of data resources. If you search for something and you get five dashboards with the same name, it’s often hard, if you lack context, to tell which one is relevant to you. We have passive mechanisms like PageRank and surfacing metadata that would, hopefully, surface more relevant results. We are thinking about more active forms of certification that we could use to boost results in search ranking.

We’re also excited about moving from active exploration to delivering more relevant updates and content suggestions through alerts and recommendations. For example, “Your dashboard is broken,” “This table you created hasn’t been queried for several months and is costing us X amount,” or “This group that you follow just added a lot of new content.”

And then, finally, what feature set would be complete without gamification?

We’re thinking about providing fun ways to give content producers a sense of value by telling them, for example, “You have the most viewed dashboard this month.”

The future of Airbnb's Dataportal.


Inspired by Dave’s talk? Click below to register for GraphConnect 2018 on September 20-21 in Times Square, New York City – and connect with leading graph experts from around the globe.

Get My Ticket

The 5-Minute Interview: Galit Gontar, Software Engineer at Glidewell Laboratories

$
0
0
Catch this week’s 5-Minute Interview with Galit Gontar, Software Engineer at Glidewell Laboratories
“We chose a graph database because it’s so much more expressive and a natural representation of the type of data we work with,” says Galit Gontar, Software Engineer at Glidewell Laboratories in Irvine, California.

In fact, Gontar and the Glidewell team have found the graph data model to be useful beyond just one or two niche use cases. The flexibility, she says, allows the Neo4j graph database to be used generically across the company’s entire range of applications.

In this week’s 5-Minute Interview (conducted at GraphConnect San Francisco), we discuss how Neo4j is used in Glidewell’s enterprise and manufacturing processes. Gontar also talks about the other technologies that the Glidewell team uses alongside Neo4j.



Tell us how your team uses Neo4j at Glidewell Laboratories.


Galit Gontar: We are a dental lab that manufactures prosthetic teeth, and we use Neo4j in a number of different ways. Our largest project is implementing a Neo4j-based workflow engine into our manufacturing processes, and then eventually into all of our enterprise processes.

What made Neo4j stand out?


Galit: It’s definitely the best graph database out there right now, and we chose a graph database for this type of implementation because it’s so much more expressive and a natural representation of the type of data we work with. And we really like working with it; it has a number of performance advantages and we really like using Cypher. Neo4j is very easy to use and easy to incorporate into our other systems.

Catch this week’s 5-Minute Interview with Galit Gontar, Software Engineer at Glidewell Laboratories

Can you talk to me about how you use Neo4j with other technologies?


Galit: We use MongoDB for a lot of our catalogs and Elasticsearch for our search systems. Most of our infrastructure is written in C#-based microservices that run in Docker containers.

Anything else you’d like to add or share?


Galit: I think Neo4j works very well as a generic system. I don’t think I’ve seen a lot of other examples of it used very, very generically. Generally speaking, the relationships are expressive – the actual way that an item relates to another item. But we’ve had a lot of success using Neo4j in fundamental, generic applications, and then reusing the same kind of schema throughout our company.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com


Want to learn more about graph databases and Neo4j? Click below to register for our online training class, Introduction to Graph Databases and master the world of graph technology in no time.

Sign Me Up

The 5-Minute Interview: Christophe Willemsen, Senior Consultant at GraphAware

$
0
0
Catch this week’s 5-Minute Interview with Christophe Willemsen, Senior Neo4j Consultant at GraphAware
For this week’s 5-Minute Interview, I chatted with Christophe Willemsen, Senior Neo4j Consultant at GraphAware in London. Christophe and I spoke at GraphConnect San Francisco last October.

Here’s what we discussed:



Talk to me about how you use Neo4j.


I’m a Neo4j consultant at GraphAware, a Neo4j consulting company based in London. I’ve been working with the graph database for the last four years, and it has become more of a passion than a job.

One of our last projects involved providing recommendations for holiday house accommodations. In a period of only two weeks, we went from zero to production with a recommendation engine, and then went back to make some fine tuning. Our recommendations continually change based on updated information that is gathered.

Right now, with Neo4j we use natural language processing and Elasticsearch integration to provide graph-aided searches. Through these queries we can provide elastic boosts and relevencies based on the user’s graph. Some big clients are using it, such as Airbnb.

What made Neo4j stand out?


We were able to go into production much, much faster than with a traditional relational database. It also provides us with the flexibility to continually make changes over time — often our clients aren’t sure what they really want, so we have to make adjustments. We don’t really do any work without Neo4j and have increased our customer base, which is great.

Catch this week’s 5-Minute Interview with Christophe Willemsen, Senior Neo4j Consultant at GraphAware

If you could start over with Neo4j, taking everything you know now, what would you do differently?


I immediately fell in love with Neo4j. At the time I discovered graph databases, I was working on a DNA project for dog genealogy with traditional databases and was really struggling. From the very beginning I wasn’t sure how to move forward with my data. There isn’t really anything would do differently — experience is really important.

Can you talk to me about some of your most interesting or surprising results you’ve had while using Neo4j?


We did an interesting project this last week, which we presented at a hackathon. We do a lot of computations with Apache Spark which you can now run within Cypher.

We created a procedure in which we triggered the Apache Sark process over a Spark cluster. This allows your Neo4j JVM to remain quiet, perform all computation outside of your main database, and automatically write the results back to Neo4j. It is going to be an important aspect of our future work.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com.


Want to learn more about graph databases and Neo4j? Click below to register for our online training class, Introduction to Graph Databases and master the world of graph technology in no time.

Sign Me Up

Using Graph Structure Record Linkage on Irish Census Data with Neo4j

$
0
0
See How We Used Graph Structure Record Linkage to Extract Insights on Irish Census Data with Neo4j

For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911 Irish censuses, I hoped I would be able to find a way to reliably link resident records from the two together to identify the same residents.

Since then I’ve learned a bit about master data management and record linkage and so I thought I would give it another stab.

Here I’d like to talk about how I’ve been matching records based on the local data space around objects to improve my record linkage scoring.

The data model of the imported data is very linear:

See How We Used Graph Structure Record Linkage to Extract Insights on Irish Census Data with Neo4j


In this post, however, I’m going to be focusing on Houses and Residents and creating relationships between them based on their properties.

Relations to the Head


To view an example of what a census record from 1911 Ireland looks like you can have a look at the McCarthys of 1901 and 1911. Charles is the head of the family with his wife Hannah, mother Ellen, children (two in 1901 and seven in 1911), and a servant (Timothy Walsh in 1901 and William Regan in 1911).

McCarthy census data from 1901 The McCarthy family of Barnagowlane, Cloghdowell, Cork, 1901 McCarthy census data from 1911 The McCarthy family of Barnagowlane, Cloghdonnell, Cork, 1911
Surname Forename Age Sex Relation to Head Religion   Surname Forename Age Sex Relation to Head Religion
McCarthy Charles 37 Male Head of Family Roman Catholic   McCarthy Charles 47 Male Head of Family Roman Catholic
McCarthy Hannah 25 Female Wife Roman Catholic   McCarthy Hannah 35 Female Wife Roman Catholic
McCarthy William 1 Male Son Roman Catholic   McCarthy William 11 Male Son Roman Catholic
McCarthy Bridget   Female Daughter Roman Catholic   McCarthy Bridget 10 Female Daughter Roman Catholic
              McCarthy Ellen 8 Female Daughter Roman Catholic
              McCarthy Kate 6 Female Daughter Roman Catholic
              McCarthy Florence 4 Male Son Roman Catholic
              McCarthy Charles Peter 2 Male Son Roman Catholic
              McCarthy Annie   Female Daughter Roman Catholic
McCarthy Ellen 65 Female Mother Roman Catholic   McCarthy ? Ellen 75 Female Mother Roman Catholic
Walsh Timothy 25 Male Servant Roman Catholic              
              Regan William 24 Male Servant Roman Catholic


The McCarthys are an almost exact match between two census records between 1901 and 1911. The names, ages, occupations, and relationships all match perfectly.

Unfortunately the story for other records is not so simple. Many times, houses – which to the human eye seem to be the same house – can have wildly varying details. For example, Hannah might go be listed as Hana or Anne in a different census.

Likewise, ages vary a lot more than you might think. In examining the records I regularly found ages varying by a year or two and have even found a few houses with ages off by as much as 10-15 years.

In both censuses, there is a field for residents to fill out called “Relation to Head.” This gives us information about how each resident is related to the head of the house. In the case of the McCarthys, Charles is listed as “Head of Family” in both years. The rest of the family has a nice representation of things that we often see in the data: “Wife,” “Son,” “Daughter,” and “Servant.”

We might be tempted to say, “This person was the head in 1901, so they must be the same person who was the head in 1911.” Often, however, the head of the family can die or retire leaving the role of head of the family to their wife or child.

So, can the “Relation to Head” values still be useful to us to match any given resident from 1901 to another resident in 1911?

First, let’s cover the general process of record linkage I have been using. To find a match for a resident, I start by using an Elasticsearch server (which contains a duplicate of my Neo4j census data) to quickly find a list of other residents with a match on very rough criteria:

    • Is the resident in the other census?
    • Does the sex match (or is it NULL)?
    • Is the resident’s age within 15 years of what it would be expected to be in the other census?
    • Does the name match, roughly (within an edit distance of 4)?

This comes back with anywhere from zero to hundreds of results. I call these “similarity candidates” and for each one, I create a relationship between the original record and the candidate.

With this list, I can compare the attributes of the two records (using the record_linkage gem I created) to see how closely they match.

The closer their name, sex, age, etc. matches, the higher score they get. Ideally, the real match should have the highest score, but that isn’t always true and can take some tuning.

In addition to this simple comparison of attributes, I have now added a process to take advantage of the similarity candidate relationships to compare family relationships.

Let’s start with this example of a sub-graph pattern:

mccarthy_charles_comparison


The relationship CHILD_OF is created whenever there is a “Son” or “Daughter” in the “Relation to Head” field. Likewise, we can create other gender-neutral relationships like MARRIED_TO, SIBLING_OF, NIECE_NEPHEW_OF, etc….

In this case, the resident in question is the 1901 record for William. When we are evaluating the 1911 record of William as a potential match we can explore other residents in the same house as evidence of similarity.

The diagram above shows that both records have a CHILD_OF relationship to the two “Charles” records which furthermore are linked via a SIMILARITY_CANDIDATE relationship. Because of this we can say that there is a greater chance that the two “William” records represent the same person.

This only gives us the ability to find these relationships between the head of the family and other residents. What about generically matching based on the relationship of any two residents of a house?

Let’s say that Charles died sometime between 1901 and 1911. If his wife Hannah takes over as the head of the family, we would have a sub-graph which looks like this:

mccarthy_hannah_comparison


We could say that when we have the paths -CHILD_OF-><-MARIED_TO- and -CHILD_OF-> on either side that we can build our case for a match a bit more. This kind of matching can be used on all of the other residents of the house with SIMILARITY_CANDIDATE relationships.

For example, -CHILD_OF-><-CHILD_OF- could be matched to -CHILD_OF-><-CHILD_OF- even in this case, where the wife becomes the head of the house. Or if a child becomes the head then it could be compared to a -SIBLING_OF- relationship.

The Code


So how do we actually do this with Neo4j? First let’s take our sub-graph and turn our nodes into variables:

Record Linkage in Irish Census Data


In this example let’s take resident h1 r1 (house 1, resident 1) as the resident in question and h2 r1 as the candidate that we want to compare it to. This is the sort of query that Neo4j is wonderful at both performing quickly and making easy to formulate.

Let’s look at part of the Ruby code:

def get_similarity_candidate_relationship_paths
  self.query_as(:h1_r1)
    .match('(h1:House), (h2:House)')
    .match('h1<-[:LIVES_IN]-h1_r1-[sc_1:similarity_candidate]-(h2_r1)-[:LIVES_IN]->h2')
    .match('h1<-[:LIVES_IN]-h1_r2-[sc_2:similarity_candidate]-(h2_r2)-[:LIVES_IN]->h2')
    .match('path1=h1_r1-[:born_to|married_to|grandchild_of|niece_nephew_of|sibling_of
     |cousin_of|child_in_law_of|step_child_of*1..2]-h1_r2')
    .match('path2=h2_r1-[:born_to|married_to|grandchild_of|niece_nephew_of|sibling_of
     |cousin_of|child_in_law_of|step_child_of*1..2]-h2_r2')
    .pluck(
      :h2_r1,
      'collect([path1, rels(path1), path2, rels(path2)])'
      ).each_with_object({}) do |(r2, data), result|

    result[r2] = data.inject(0) do |total, (path1, rels1, path2, rels2)|
      relations1 = relation_string_from_path_and_rels(path1, rels1)
      relations2 = relation_string_from_path_and_rels(path2, rels2)

      if relations1 == relations2
        1.0
      elsif score = (RELATION_EQUIVILENCE_SCORES[relations1] || {})[relations2]
        score
      else
        -2.0
      end + total
    end
  end
end


Here we start with a Cypher query using the Query API from neo4j.rb. The object upon which we’ve called get_similarity_candidate_relationship_paths is our h1_r1 anchor.

Note here that we match paths with a length of either one or two relationships long from between two residents of the same house. Then we return all residents found via the SIMILARITY_CANDIDATE relationship from our anchor and the family relationship paths aggregated into an Array.

Once the Cypher query returns data we call relation_string_from_path_and_rels which is a way of transforming the path into a string like -BORN_TO-><-BORN_TO. This string gives us a simple way to express the path between the two residents as a string.

We then can give a score based on the two paths. If the paths are the same then we say that the score is 1.0. If the pair of paths is something like -BORN_TO-><-BORN_TO and -SIBLING_OF-> then we can give a score based on a lookup.

We add these scores up to give us a total score comparing our anchor resident and each of it’s similarity candidates. All with just one query to the database.

Challenges


There are a couple of things that I needed to do to make this work:

Previously, I was simply grabbing one resident at a time, finding all of the similarity candidates, and then creating a set of relationships to link the resident with the candidates and to store the record linkage scores (both the individual scores for fields and the total score).

However, this approach requires all of the candidates in the house to have SIMILARITY_CANDIDATE relationships in order to compare family relationships. So now I first process all residents for a house to create the similarity candidate relationships and store the record linkage scores and then go through them again with the graph-based comparisons and store that score and update the total.

Beyond that, there is the conceptual problem of determining the scoring when comparing paths. For example, if somebody was BORN_TO the head one year but their spouse takes over as the head, could we say that they’re BORN_TO the spouse if they are are a step-child? Family relationships are complicated and don’t always fit neatly into our properties and algorithms.

Conclusion


Most record linkage focuses on the properties of an object, but we need to remember that relationships are data about our entities too. With Neo4j, we have a powerful tool for analyzing those data relationships naturally and quickly.

Additionally I have found that the ability to create relationships on the fly to aggregate calculations like the ones discussed above is a wonderful way to find the best solution quickly.



Want to learn more about how to use Neo4j for your project? Click below to get your free copy of the Learning Neo4j ebook and catch up to speed with the world’s leading graph database.

Viewing all 32 articles
Browse latest View live