Welcome to Soft-Boiled’s documentation!

The usage examples below assume that you have created a zip file containing the top level directory of the repo called soft-boiled.zip. To create zip file run “zip -r soft-boiled.zip __init__.py src” from inside the repository directory.

Example IPython notebooks demonstrating the functionality of these algorithms can also be found in the notebooks directory of this repository.

Spatial Label Propagation [slp.py]:

Usage:

sc.addPyFile ('/path/to/zip/soft-boiled.zip') # Can be an hdfs path
from src.algorithms import slp

# Create dataframe from parquet data
tweets = sqlCtx.read.parquet('hdfs:///post_etl_datasets/twitter')
tweets.registerTempTable('my_tweets')

# Get Known Locations
locs_known = slp.get_known_locs(sqlCtx, 'my_tweets', min_locs=3,         dispersion_threshold=50, num_partitions=30)

# Get at mention network, bi-directional at mentions
edge_list = slp.get_edge_list(sqlCtx, 'my_tweets')

# Run Spatial label propagation with 5 iterations
estimated_locs = slp.train_slp(filtered_locs_known, edge_list, 5,     dispersion_threshold=100)


# prepare the input functions to the evaluate function. In this case, we create a holdout function
# that filter approximately 10% of the data for testing, and we also have a closure that prepopulates
# some of the parameters to the train_slp function
holdout_10pct = lambda (src_id): src_id[-1] != '9'
train_f = lambda locs, edges : slp.train_slp(locs, edges, 4, neighbor_threshold=4, dispersion_threshold=150)

# Test results
test_results = slp.evaluate(locs_known, edge_list, holdout_10pct, train_f)```

Options:

Related to calculating the median point amongst a collection of points:

dispersion_threshold: This is the maximum median distance in km a point can be from the remaining points and still estimate a location

min_locs: Number of geotagged posts that a user must have to be included in ground truth.

Related to the actual label propagation: num_iters: This controls the number of iterations of label propagation performed

Gaussian Mixture Model [gmm.py]

Usage:

sc.addPyFile ('/path/to/zip/soft-boiled.zip') # Can be an hdfs path
from src.algorithms import gmm

# Create dataframe from parquet data
tweets = sqlCtx.read.parquet('hdfs:///post_etl_datasets/twitter')
tweets.registerTempTable('my_tweets')

# Train GMM model
gmm_model = gmm.train_gmm(sqlCtx, 'my_tweets', ['user.location', 'text'], min_occurrences=10, max_num_components=12)

# Test GMM model
test_results = gmm.run_gmm_test(sc, sqlCtx, 'my_tweets', ['user.location', 'text'], gmm_model)
print test_results

# Use GMM model to predict tweets
other_tweets = sqlCtx.read.parquet('hdfs:///post_etl_datasets/twitter')
estimated_locs = gmm.predict_user_gmm(sc, other_tweets, ['user.location'], gmm_model, radius=100, predict_lower_bound=0.2)

# Save model for future prediction use
gmm.save_model(gmm_model, '/local/path/to/model_file.csv.gz')

# Load a model, produces the same output as train
gmm_model = gmm.load_model('/local/path/to/model_file.csv.gz')

Options:

Related to GMM:

fields: A set of fields to use to train/test the GMM model. Currently only user.location and text are supported

min_occurrences: Number of times that a token must appear with a known location in the text to be estimated

max_num_components: Limit on the number of GMM components that can be used

Predict User Options:

radius: Predict the probability that the user is within this distance of most likely point, used with predict_lower_bound

predict_lower_bound: Used with radius to filter user location estimates with probability lower than threshold

Contents:

class src.algorithms.slp.LocEstimate(geo_coord, dispersion, dispersion_std_dev)
dispersion

Alias for field number 1

dispersion_std_dev

Alias for field number 2

geo_coord

Alias for field number 0

src.algorithms.slp.evaluate(locs_known, edges, holdout_func, slp_closure)

This function is used to assess various stats regarding how well SLP is running. Given all locs that are known and all edges that are known, this funciton will first apply the holdout to the locs_known, allowing for a ground truth comparison to be used. Then, it applies the non-holdout set to the training function, which should yield the locations of the holdout for comparison.

For example:

holdout = lambda (src_id) : src_id[-1] == '6'
trainer = lambda l, e : slp.train_slp(l, e, 3)
results = evaluate(locs_known, edges, holdout, trainer)
Parameters:
  • locs_known (rdd of LocEstimate objects) – The complete list of locations
  • edges (rdd of (src_id, (dest_id, weight)) – all available edge information
  • holdout_func (function) –

    function responsible for filtering a holdout data set. For example:

    lambda (src_id) : src_id[-1] == '6'
    

    can be used to get approximately 10% of the data since the src_id’s are evenly distributed numeric values

  • slp_closure (function closure) –

    a closure over the slp train function. For example:

    lambda locs, edges :
    
            slp.train_slp(locs, edges, 4, neighbor_threshold=4, dispersion_threshold=150)
    

    can be used for training with specific threshold parameters

Returns:

stats of the results from the SLP algorithm

median: median difference of predicted versus actual

mean: mean difference of predicted versus actual

coverage: ratio of number of predicted locations to number of original unknown locations

reserved_locs: number of known locations used to train

total_locs: number of known locations input into this function

found_locs: number of predicted locations

holdout_ratio: ratio of the holdout set to the entire set

Return type:

results (dict)

src.algorithms.slp.get_edge_list(sqlCtx, table_name, num_partitions=300)

Given a loaded twitter table, this will return the @mention network in the form (src_id, (dest_id, num_@mentions))

Parameters:
  • sqlCtx (Spark SQL Context) – A Spark SQL context
  • table_name (string) – Table name that was registered when loading the data
  • num_paritions (int) – Optimizer for specifying the number of paritions for the resulting RDD to use
Returns:

edges loaded from the table

Return type:

edges (rdd (src_id, (dest_id, weight)))

src.algorithms.slp.get_known_locs(sqlCtx, table_name, include_places=True, min_locs=3, num_partitions=30, dispersion_threshold=50)

Given a loaded twitter table, this will return all the twitter users with locations. A user’s location is determined by the median location of all known tweets. A user must have at least min_locs locations in order for a location to be estimated

Parameters:
  • sqlCtx (Spark SQL Context) – A Spark SQL context
  • table_name (string) – Table name that was registered when loading the data
  • min_locs (int) – Minimum number tweets that have a location in order to infer a location for the user
  • num_partitions (int) – Optimizer for specifying the number of partitions for the resulting RDD to use.
  • dispersion_threhold (int) – A distance threhold on the dispersion of the estimated location for a user. We consider those estimated points with dispersion greater than the treshold unable to be predicted given how dispersed the tweet distances are from one another.
Returns:

Found locations of users. This rdd is often used as the ground truth of locations

Return type:

locations (rdd of LocEstimate)

src.algorithms.slp.median(distance_func, vertices, weights=None)

given a python list of vertices, and a distance function, this will find the vertex that is most central relative to all other vertices. All of the vertices must have geocoords

Parameters:
  • distance_func (function) – A function to calculate the distance between two GeoCoord objects
  • vertices (list) – List of GeoCoord objects
Returns:

The median point

Return type:

LocEstimate

src.algorithms.slp.train_slp(locs_known, edge_list, num_iters, neighbor_threshold=3, dispersion_threshold=100)

Core SLP algorithm

Parameters:
  • locs_known (rdd of LocEstimate objects) – Locations that are known for the SLP network
  • edge_list (rdd of edges (src_id, (dest_id, weight))) – edges representing the at mention network
  • num_iters (int) – number of iterations to run the algorithm
  • neighbor_threshold (int) – The minimum number of neighbors required in order for SLP to try and predict a location of a node in the network
  • dispersion_theshold (int) – The maximum median distance amoung a local at mention network in order to predict a node’s location.
Returns:

The locations found and known

Return type:

locations (rdd of LocEstimate objects)

src.algorithms.gmm.GMMLocEstimate

alias of LocEstimate

src.algorithms.gmm.combine_gmms(gmms)

Takes an array of gaussian mixture models and produces a GMM that is the weighted sum of the models

Parameters:gmms (list) – A list of (mixture.GMM, median_error_on_training) models
Returns:A single GMM model that is the weighted sum of the input gmm models
Return type:new_gmm (mixture.GMM)
src.algorithms.gmm.fit_gmm_to_locations(geo_coords, max_num_components)

Searches within bounts to fit a GMM with the optimal number of components

Parameters:
  • geo_coords (list) – A list of GeoCoord points to fit a GMM distribution to
  • max_num_components (int) – The maximum number of components that the GMM model can have
Returns:

Tuple containing the best mixture.GMM and the error of that model on the training data

Return type:

gmm_estimate (tuple)

src.algorithms.gmm.get_errors(model, points)

Computes the median error for a GMM model and a set of training points

Parameters:
  • model (mixture.GMM) – A GMM model for a word
  • points (list) – A list of (lat, lon) tuples
Returns:

The median distance to the training points from the most likely point

Return type:

median (float)

src.algorithms.gmm.get_location_from_tweet(row)

Extract location from a tweet object. If geo.coordinates not present use center of place.bounding_box.

Parameters:row (Row) – A spark sql row containing a tweet
Retruns:
GeoCoord: The location in the tweet
src.algorithms.gmm.get_most_likely_point(tokens, model_bcast, radius=None)

Create the combined GMM and find the most likely point. This function is called in a flatMap so return a list with 0 or 1 item

Parameters:
  • tokens (list) – list of words in tweet
  • model_bcast (pyspark.Broadcast) – A broadcast version of a dictionary of GMM model for the entire vocabulary
  • radius (float) – Distance from most likely point at which we should estimate containment probability (if not None)
Returns:

A list with 0 or 1 GMMLocEstimates

Return type:

loc_estimate (list)

src.algorithms.gmm.load_model(input_fname)

Load a pre-trained model

Parameters:input_fname (str) – Local file path to read GMM model from
Returns:A dictionary of the form {word: (mixture.GMM, error)}
Return type:model (dict)
src.algorithms.gmm.predict_probability_area(model, upper_bound, lower_bound)

Predict the probability that the true location is within a specified bounding box given a GMM model

Parameters:
  • model (mixture.GMM) – GMM model to use
  • upper_bound (list) – [upper lat, right lon] of bounding box
  • lower_bound (list) – [lower_lat, left_lon] of bounding box
Returns:

Probability from 0 to 1 of true location being in bounding box

Return type:

total_prob (float)

src.algorithms.gmm.predict_probability_radius(gmm_model, radius, center_point)

Attempt to estimate the probability that the true location is within some radius of a given center point. Estimate is based on estimating probability in corners of bounding box and subtracting from total probability mass

Parameters:
  • gmm_model (mixture.GMM) – GMM model to use
  • radius (float) – Radius from center point to include in estimate
  • center_point (tuple) – (lat, lon) center point
Returns:

Probability from 0 to 1 of true location being in the specified radius

Return type:

total_prob (float)

src.algorithms.gmm.predict_user_gmm(sc, tweets_to_predict, fields, model, radius=None, predict_lower_bound=0, num_partitions=5000)

Takes a set of tweets and for each user in those tweets it predicts a location Also returned are the probability of that prediction location being w/n 100 km of the true point

Parameters:
  • sc (pyspark.SparkContext) – Spark Context to use for execution
  • tweets_to_predict (RDD) – RDD of twitter Row objects
  • fields (list) – List of field names to extract and then use for GMM prediction
  • model (dict) – Dictionary of {word:(mixture.GMM, error)}
  • radius (float) – Distance from most likely point at which we should estimate containment probability (if not None)
  • predict_lower_bound (float) – Probability hreshold below which we should filter tweets
  • num_partitions (int) – Number of partitions. Should be based on size of the data
Returns:

An RDD of (id_str, GMMLocEstimate)

Return type:

loc_est_by_user (RDD)

src.algorithms.gmm.run_gmm_test(sc, sqlCtx, table_name, fields, model, where_clause='')

Test a pretrained model on a table of test data

Parameters:
  • sc (pyspark.SparkContext) – Spark Context to use for execution
  • sqlCtx (pyspark.sql.SQLContext) – Spark SQL Context to use for sql queries
  • table_name (str) – Table name to query for test data
  • fields (list) – List of field names to extract and then use for GMM prediction
  • model (dict) – Dictionary of {word:(mixture.GMM, error)}
  • where_clause (str) – A where clause that can be applied to the query
Returns:

A description of the performance of the GMM Algorithm

Return type:

final_result (dict)

src.algorithms.gmm.save_model(model, output_fname)

Save the current model for future use

Parameters:
  • model (dict) – A dictionary of the form {word: (mixture.GMM, error)}
  • output_fname (str) – Local file path to store GMM model
src.algorithms.gmm.tokenize_tweet(inputRow, fields)

A simple tokenizer that takes a tweet as input and, splitting on whitespace, and returns words in the tweet

Parameters:
  • inputRow (Row) – A spark sql row containing a tweet
  • fields (list) – A list of field names which directs tokenize on which fields to use as source data
Returns:

A list of words appearing in the tweet

Return type:

tokens (list)

src.algorithms.gmm.train_gmm(sqlCtx, table_name, fields, min_occurrences=10, max_num_components=12, where_clause='')

Train a set of GMMs for a given set of training data

Parameters:
  • sqlCtx (pyspark.sql.SQLContext) – Spark SQL Context to use for sql queries
  • table_name (str) – Table name to query for test data
  • fields (list) – List of field names to extract and then use for GMM prediction
  • min_occurrences (int) – Number of times a word must appear to be incldued in the model
  • max_num_components (int) – The maximum number of components that the GMM model can have
  • where_clause (str) – A where clause that can be applied to the query
Returns:

Dictionary of {word:(mixture.GMM, error)}

Return type:

model (dict)

class src.algorithms.estimator.EstimatorCurve(w_stdev, wo_stdev)

The EstimatorCurve class is used to assess the confidence of a predicted location for SLP.

w_stdev

numpy arr – A two dimensional numpy array representing the estimator curve. The x axis is the standard deviations and y axis is the probability. The curve is a CDF. This curve is generated from known locations where at least two neighbors are at different locations.

wo_stdev

numpy_arr – A two dimensional numpy array representing the estimator curve

static build_curve(vals, desired_samples)

Static helper method for building the curve from a set of stdev stample

Parameters:
  • vals (rdd of floats) – The rdd containing the standard deviation from the distance between the estimated location and the actual locationn
  • desired_samples (int) – For larger RDDs it is more efficient to take a sample for the collect
Returns:

two dimensional array representing the curve.

Column 0 is the sorted stdevs and column 1 is the percentage for the CDF.

Return type:

curve (numpy.ndarray)

confidence_estimation_viewer(sc, eval_rdd)

Displays a plot of the estimated and actual probability that the true point is within an array of radius values

Parameters:
  • curve (numpy.darray) – x axis is stdev, y axis is percent
  • eval_rdd (rdd (src_id, (dist, loc_estimate))) – this is the result of the evaluator function
static load_from_file(name='estimator')

Loads an Estimator curve from csv files

Parameters:name (string) – prefix name for the two CSV files
static load_from_rdds(locs_known, edges, desired_samples=1000, dispersion_threshold=150, neighbor_threshold=3)

Creates an EstimatorCurve

Parameters:
  • locs_known (rdd of LocEstimate) – RDD of locations that are known
  • edges (rdd of (src_id (dest_id, weight)) – RDD of edges in the network
  • desired_samples (int) – Limit the curve to just a sample of data
Returns:

A new EstimatorCurve representing the known input data

Return type:

EstimatorCurve

lookup(val, axis=0)
static lookup_static(table, val, axis=0)

lookups up closes stdev by subtracting from lookup table, taking absolute value and finding which is closest to zero by sorting and taking the first element

Args: num_std_devs (float): the stdev to lookup

Returns: CDF (float) : Percentage of actual locations found to be within the input stdev

plot(w_stdev_lim=10, wo_stdev_lim=1000)

Plots both the stdev curve and the distance curve for when the stdev is 0

Parameters:
  • w_stdev_lim (int) – x axis limit for the plot
  • wo_stdev_lim (int) – x axis limit for the plot
predict_probability_area(upper_bound, lower_bound, estimated_loc)

Given a prediction and a bounding box this will return a confidence range for that prediction

Parameters:
  • upper_bound (geoCoord) – bounding box top right geoCoord
  • lower_bound (geoCoord) – bounding box bottom left geoCoord
  • estimated_loc (LocEstimate) – geoCoord of the estimated location
Returns:

A probability range tuple (min probability, max probability)

Return type:

Probability Tuple(Tuple(float,float))

save(name='estimator')

Saves the EstimatorCurve as a csv

Parameters:name (string) – A prefix name for the filename. Two CSVs will be created– one for when the stdev is 0, and one for when it is greater than 0
validator(sc, eval_rdd)

Validates a curve

Parameters:
  • curve (numpy.darray) – x axis is stdev, y axis is percent
  • eval_rdd (rdd (src_id, (dist, loc_estimate))) – this is the result of the evaluator function
class src.algorithms.estimator.EstimatorCurve(w_stdev, wo_stdev)

The EstimatorCurve class is used to assess the confidence of a predicted location for SLP.

w_stdev

numpy arr – A two dimensional numpy array representing the estimator curve. The x axis is the standard deviations and y axis is the probability. The curve is a CDF. This curve is generated from known locations where at least two neighbors are at different locations.

wo_stdev

numpy_arr – A two dimensional numpy array representing the estimator curve

static build_curve(vals, desired_samples)

Static helper method for building the curve from a set of stdev stample

Parameters:
  • vals (rdd of floats) – The rdd containing the standard deviation from the distance between the estimated location and the actual locationn
  • desired_samples (int) – For larger RDDs it is more efficient to take a sample for the collect
Returns:

two dimensional array representing the curve.

Column 0 is the sorted stdevs and column 1 is the percentage for the CDF.

Return type:

curve (numpy.ndarray)

confidence_estimation_viewer(sc, eval_rdd)

Displays a plot of the estimated and actual probability that the true point is within an array of radius values

Parameters:
  • curve (numpy.darray) – x axis is stdev, y axis is percent
  • eval_rdd (rdd (src_id, (dist, loc_estimate))) – this is the result of the evaluator function
static load_from_file(name='estimator')

Loads an Estimator curve from csv files

Parameters:name (string) – prefix name for the two CSV files
static load_from_rdds(locs_known, edges, desired_samples=1000, dispersion_threshold=150, neighbor_threshold=3)

Creates an EstimatorCurve

Parameters:
  • locs_known (rdd of LocEstimate) – RDD of locations that are known
  • edges (rdd of (src_id (dest_id, weight)) – RDD of edges in the network
  • desired_samples (int) – Limit the curve to just a sample of data
Returns:

A new EstimatorCurve representing the known input data

Return type:

EstimatorCurve

lookup(val, axis=0)
static lookup_static(table, val, axis=0)

lookups up closes stdev by subtracting from lookup table, taking absolute value and finding which is closest to zero by sorting and taking the first element

Args: num_std_devs (float): the stdev to lookup

Returns: CDF (float) : Percentage of actual locations found to be within the input stdev

plot(w_stdev_lim=10, wo_stdev_lim=1000)

Plots both the stdev curve and the distance curve for when the stdev is 0

Parameters:
  • w_stdev_lim (int) – x axis limit for the plot
  • wo_stdev_lim (int) – x axis limit for the plot
predict_probability_area(upper_bound, lower_bound, estimated_loc)

Given a prediction and a bounding box this will return a confidence range for that prediction

Parameters:
  • upper_bound (geoCoord) – bounding box top right geoCoord
  • lower_bound (geoCoord) – bounding box bottom left geoCoord
  • estimated_loc (LocEstimate) – geoCoord of the estimated location
Returns:

A probability range tuple (min probability, max probability)

Return type:

Probability Tuple(Tuple(float,float))

save(name='estimator')

Saves the EstimatorCurve as a csv

Parameters:name (string) – A prefix name for the filename. Two CSVs will be created– one for when the stdev is 0, and one for when it is greater than 0
validator(sc, eval_rdd)

Validates a curve

Parameters:
  • curve (numpy.darray) – x axis is stdev, y axis is percent
  • eval_rdd (rdd (src_id, (dist, loc_estimate))) – this is the result of the evaluator function

Indices and tables