Welcome to Soft-Boiled’s documentation!¶

The usage examples below assume that you have created a zip file containing the top level directory of the repo called soft-boiled.zip. To create zip file run “zip -r soft-boiled.zip __init__.py src” from inside the repository directory.

Example IPython notebooks demonstrating the functionality of these algorithms can also be found in the notebooks directory of this repository.

Spatial Label Propagation [slp.py]:¶

Usage:¶

sc.addPyFile ('/path/to/zip/soft-boiled.zip') # Can be an hdfs path
from src.algorithms import slp

# Create dataframe from parquet data
tweets = sqlCtx.read.parquet('hdfs:///post_etl_datasets/twitter')
tweets.registerTempTable('my_tweets')

# Get Known Locations
locs_known = slp.get_known_locs(sqlCtx, 'my_tweets', min_locs=3,         dispersion_threshold=50, num_partitions=30)

# Get at mention network, bi-directional at mentions
edge_list = slp.get_edge_list(sqlCtx, 'my_tweets')

# Run Spatial label propagation with 5 iterations
estimated_locs = slp.train_slp(filtered_locs_known, edge_list, 5,     dispersion_threshold=100)


# prepare the input functions to the evaluate function. In this case, we create a holdout function
# that filter approximately 10% of the data for testing, and we also have a closure that prepopulates
# some of the parameters to the train_slp function
holdout_10pct = lambda (src_id): src_id[-1] != '9'
train_f = lambda locs, edges : slp.train_slp(locs, edges, 4, neighbor_threshold=4, dispersion_threshold=150)

# Test results
test_results = slp.evaluate(locs_known, edge_list, holdout_10pct, train_f)```

Options:¶

Related to calculating the median point amongst a collection of points:

dispersion_threshold: This is the maximum median distance in km a point can be from the remaining points and still estimate a location

min_locs: Number of geotagged posts that a user must have to be included in ground truth.

Related to the actual label propagation: num_iters: This controls the number of iterations of label propagation performed

Gaussian Mixture Model [gmm.py]¶

Usage:¶

sc.addPyFile ('/path/to/zip/soft-boiled.zip') # Can be an hdfs path
from src.algorithms import gmm

# Create dataframe from parquet data
tweets = sqlCtx.read.parquet('hdfs:///post_etl_datasets/twitter')
tweets.registerTempTable('my_tweets')

# Train GMM model
gmm_model = gmm.train_gmm(sqlCtx, 'my_tweets', ['user.location', 'text'], min_occurrences=10, max_num_components=12)

# Test GMM model
test_results = gmm.run_gmm_test(sc, sqlCtx, 'my_tweets', ['user.location', 'text'], gmm_model)
print test_results

# Use GMM model to predict tweets
other_tweets = sqlCtx.read.parquet('hdfs:///post_etl_datasets/twitter')
estimated_locs = gmm.predict_user_gmm(sc, other_tweets, ['user.location'], gmm_model, radius=100, predict_lower_bound=0.2)

# Save model for future prediction use
gmm.save_model(gmm_model, '/local/path/to/model_file.csv.gz')

# Load a model, produces the same output as train
gmm_model = gmm.load_model('/local/path/to/model_file.csv.gz')

Options:¶

Related to GMM:

fields: A set of fields to use to train/test the GMM model. Currently only user.location and text are supported

min_occurrences: Number of times that a token must appear with a known location in the text to be estimated

max_num_components: Limit on the number of GMM components that can be used

Predict User Options:

radius: Predict the probability that the user is within this distance of most likely point, used with predict_lower_bound

predict_lower_bound: Used with radius to filter user location estimates with probability lower than threshold

Contents:¶

class src.algorithms.slp.LocEstimate(geo_coord, dispersion, dispersion_std_dev)¶

dispersion¶: Alias for field number 1

dispersion_std_dev¶: Alias for field number 2

geo_coord¶: Alias for field number 0

src.algorithms.slp.evaluate(locs_known, edges, holdout_func, slp_closure)¶

This function is used to assess various stats regarding how well SLP is running. Given all locs that are known and all edges that are known, this funciton will first apply the holdout to the locs_known, allowing for a ground truth comparison to be used. Then, it applies the non-holdout set to the training function, which should yield the locations of the holdout for comparison.

For example:

holdout = lambda (src_id) : src_id[-1] == '6'
trainer = lambda l, e : slp.train_slp(l, e, 3)
results = evaluate(locs_known, edges, holdout, trainer)

Parameters:

locs_known (rdd of LocEstimate objects) – The complete list of locations
edges (rdd of (src_id, (dest_id, weight)) – all available edge information
holdout_func (function) –
function responsible for filtering a holdout data set. For example:
```
lambda (src_id) : src_id[-1] == '6'
```
can be used to get approximately 10% of the data since the src_id’s are evenly distributed numeric values
slp_closure (function closure) –
a closure over the slp train function. For example:
```
lambda locs, edges :

        slp.train_slp(locs, edges, 4, neighbor_threshold=4, dispersion_threshold=150)
```
can be used for training with specific threshold parameters

Returns:

stats of the results from the SLP algorithm

median: median difference of predicted versus actual

mean: mean difference of predicted versus actual

coverage: ratio of number of predicted locations to number of original unknown locations

reserved_locs: number of known locations used to train

total_locs: number of known locations input into this function

found_locs: number of predicted locations

holdout_ratio: ratio of the holdout set to the entire set

Return type:

results (dict)

src.algorithms.slp.get_edge_list(sqlCtx, table_name, num_partitions=300)¶

Given a loaded twitter table, this will return the @mention network in the form (src_id, (dest_id, num_@mentions))

Parameters:	sqlCtx (Spark SQL Context) – A Spark SQL context table_name (string) – Table name that was registered when loading the data num_paritions (int) – Optimizer for specifying the number of paritions for the resulting RDD to use
Returns:	edges loaded from the table
Return type:	edges (rdd (src_id, (dest_id, weight)))

src.algorithms.slp.get_known_locs(sqlCtx, table_name, include_places=True, min_locs=3, num_partitions=30, dispersion_threshold=50)¶

Given a loaded twitter table, this will return all the twitter users with locations. A user’s location is determined by the median location of all known tweets. A user must have at least min_locs locations in order for a location to be estimated

Parameters:	sqlCtx (Spark SQL Context) – A Spark SQL context table_name (string) – Table name that was registered when loading the data min_locs (int) – Minimum number tweets that have a location in order to infer a location for the user num_partitions (int) – Optimizer for specifying the number of partitions for the resulting RDD to use. dispersion_threhold (int) – A distance threhold on the dispersion of the estimated location for a user. We consider those estimated points with dispersion greater than the treshold unable to be predicted given how dispersed the tweet distances are from one another.
Returns:	Found locations of users. This rdd is often used as the ground truth of locations
Return type:	locations (rdd of LocEstimate)

src.algorithms.slp.median(distance_func, vertices, weights=None)¶

given a python list of vertices, and a distance function, this will find the vertex that is most central relative to all other vertices. All of the vertices must have geocoords

Parameters:	distance_func (function) – A function to calculate the distance between two GeoCoord objects vertices (list) – List of GeoCoord objects
Returns:	The median point
Return type:	LocEstimate

src.algorithms.slp.train_slp(locs_known, edge_list, num_iters, neighbor_threshold=3, dispersion_threshold=100)¶

Core SLP algorithm

Parameters:	locs_known (rdd of LocEstimate objects) – Locations that are known for the SLP network edge_list (rdd of edges (src_id, (dest_id, weight))) – edges representing the at mention network num_iters (int) – number of iterations to run the algorithm neighbor_threshold (int) – The minimum number of neighbors required in order for SLP to try and predict a location of a node in the network dispersion_theshold (int) – The maximum median distance amoung a local at mention network in order to predict a node’s location.
Returns:	The locations found and known
Return type:	locations (rdd of LocEstimate objects)

src.algorithms.gmm.GMMLocEstimate¶: alias of LocEstimate

src.algorithms.gmm.combine_gmms(gmms)¶

Takes an array of gaussian mixture models and produces a GMM that is the weighted sum of the models

Parameters:	gmms (list) – A list of (mixture.GMM, median_error_on_training) models
Returns:	A single GMM model that is the weighted sum of the input gmm models
Return type:	new_gmm (mixture.GMM)

src.algorithms.gmm.fit_gmm_to_locations(geo_coords, max_num_components)¶

Searches within bounts to fit a GMM with the optimal number of components

Parameters:	geo_coords (list) – A list of GeoCoord points to fit a GMM distribution to max_num_components (int) – The maximum number of components that the GMM model can have
Returns:	Tuple containing the best mixture.GMM and the error of that model on the training data
Return type:	gmm_estimate (tuple)

src.algorithms.gmm.get_errors(model, points)¶

Computes the median error for a GMM model and a set of training points

Parameters:	model (mixture.GMM) – A GMM model for a word points (list) – A list of (lat, lon) tuples
Returns:	The median distance to the training points from the most likely point
Return type:	median (float)

src.algorithms.gmm.get_location_from_tweet(row)¶

Extract location from a tweet object. If geo.coordinates not present use center of place.bounding_box.

Parameters:	row (Row) – A spark sql row containing a tweet

Retruns:: GeoCoord: The location in the tweet

src.algorithms.gmm.get_most_likely_point(tokens, model_bcast, radius=None)¶

Create the combined GMM and find the most likely point. This function is called in a flatMap so return a list with 0 or 1 item

Parameters:	tokens (list) – list of words in tweet model_bcast (pyspark.Broadcast) – A broadcast version of a dictionary of GMM model for the entire vocabulary radius (float) – Distance from most likely point at which we should estimate containment probability (if not None)
Returns:	A list with 0 or 1 GMMLocEstimates
Return type:	loc_estimate (list)

src.algorithms.gmm.load_model(input_fname)¶

Load a pre-trained model

Parameters:	input_fname (str) – Local file path to read GMM model from
Returns:	A dictionary of the form {word: (mixture.GMM, error)}
Return type:	model (dict)

src.algorithms.gmm.predict_probability_area(model, upper_bound, lower_bound)¶

Predict the probability that the true location is within a specified bounding box given a GMM model

Parameters:	model (mixture.GMM) – GMM model to use upper_bound (list) – [upper lat, right lon] of bounding box lower_bound (list) – [lower_lat, left_lon] of bounding box
Returns:	Probability from 0 to 1 of true location being in bounding box
Return type:	total_prob (float)

src.algorithms.gmm.predict_probability_radius(gmm_model, radius, center_point)¶

Attempt to estimate the probability that the true location is within some radius of a given center point. Estimate is based on estimating probability in corners of bounding box and subtracting from total probability mass

Parameters:	gmm_model (mixture.GMM) – GMM model to use radius (float) – Radius from center point to include in estimate center_point (tuple) – (lat, lon) center point
Returns:	Probability from 0 to 1 of true location being in the specified radius
Return type:	total_prob (float)

src.algorithms.gmm.predict_user_gmm(sc, tweets_to_predict, fields, model, radius=None, predict_lower_bound=0, num_partitions=5000)¶

Takes a set of tweets and for each user in those tweets it predicts a location Also returned are the probability of that prediction location being w/n 100 km of the true point

Parameters:	sc (pyspark.SparkContext) – Spark Context to use for execution tweets_to_predict (RDD) – RDD of twitter Row objects fields (list) – List of field names to extract and then use for GMM prediction model (dict) – Dictionary of {word:(mixture.GMM, error)} radius (float) – Distance from most likely point at which we should estimate containment probability (if not None) predict_lower_bound (float) – Probability hreshold below which we should filter tweets num_partitions (int) – Number of partitions. Should be based on size of the data
Returns:	An RDD of (id_str, GMMLocEstimate)
Return type:	loc_est_by_user (RDD)

src.algorithms.gmm.run_gmm_test(sc, sqlCtx, table_name, fields, model, where_clause='')¶

Test a pretrained model on a table of test data

Parameters:	sc (pyspark.SparkContext) – Spark Context to use for execution sqlCtx (pyspark.sql.SQLContext) – Spark SQL Context to use for sql queries table_name (str) – Table name to query for test data fields (list) – List of field names to extract and then use for GMM prediction model (dict) – Dictionary of {word:(mixture.GMM, error)} where_clause (str) – A where clause that can be applied to the query
Returns:	A description of the performance of the GMM Algorithm
Return type:	final_result (dict)

src.algorithms.gmm.save_model(model, output_fname)¶

Save the current model for future use

Parameters:	model (dict) – A dictionary of the form {word: (mixture.GMM, error)} output_fname (str) – Local file path to store GMM model

src.algorithms.gmm.tokenize_tweet(inputRow, fields)¶

A simple tokenizer that takes a tweet as input and, splitting on whitespace, and returns words in the tweet

Parameters:	inputRow (Row) – A spark sql row containing a tweet fields (list) – A list of field names which directs tokenize on which fields to use as source data
Returns:	A list of words appearing in the tweet
Return type:	tokens (list)

src.algorithms.gmm.train_gmm(sqlCtx, table_name, fields, min_occurrences=10, max_num_components=12, where_clause='')¶

Train a set of GMMs for a given set of training data

Parameters:	sqlCtx (pyspark.sql.SQLContext) – Spark SQL Context to use for sql queries table_name (str) – Table name to query for test data fields (list) – List of field names to extract and then use for GMM prediction min_occurrences (int) – Number of times a word must appear to be incldued in the model max_num_components (int) – The maximum number of components that the GMM model can have where_clause (str) – A where clause that can be applied to the query
Returns:	Dictionary of {word:(mixture.GMM, error)}
Return type:	model (dict)

class src.algorithms.estimator.EstimatorCurve(w_stdev, wo_stdev)¶

The EstimatorCurve class is used to assess the confidence of a predicted location for SLP.

w_stdev¶: numpy arr – A two dimensional numpy array representing the estimator curve. The x axis is the standard deviations and y axis is the probability. The curve is a CDF. This curve is generated from known locations where at least two neighbors are at different locations.

wo_stdev¶: numpy_arr – A two dimensional numpy array representing the estimator curve

static build_curve(vals, desired_samples)¶

Static helper method for building the curve from a set of stdev stample

Parameters:

vals (rdd of floats) – The rdd containing the standard deviation from the distance between the estimated location and the actual locationn
desired_samples (int) – For larger RDDs it is more efficient to take a sample for the collect

Returns:

two dimensional array representing the curve.

Column 0 is the sorted stdevs and column 1 is the percentage for the CDF.

Return type:

curve (numpy.ndarray)

confidence_estimation_viewer(sc, eval_rdd)¶

Displays a plot of the estimated and actual probability that the true point is within an array of radius values

Parameters:	curve (numpy.darray) – x axis is stdev, y axis is percent eval_rdd (rdd (src_id, (dist, loc_estimate))) – this is the result of the evaluator function

static load_from_file(name='estimator')¶

Loads an Estimator curve from csv files

Parameters:	name (string) – prefix name for the two CSV files

static load_from_rdds(locs_known, edges, desired_samples=1000, dispersion_threshold=150, neighbor_threshold=3)¶

Creates an EstimatorCurve

Parameters:	locs_known (rdd of LocEstimate) – RDD of locations that are known edges (rdd of (src_id (dest_id, weight)) – RDD of edges in the network desired_samples (int) – Limit the curve to just a sample of data
Returns:	A new EstimatorCurve representing the known input data
Return type:	EstimatorCurve

lookup(val, axis=0)¶

static lookup_static(table, val, axis=0)¶

lookups up closes stdev by subtracting from lookup table, taking absolute value and finding which is closest to zero by sorting and taking the first element

Args: num_std_devs (float): the stdev to lookup

Returns: CDF (float) : Percentage of actual locations found to be within the input stdev

plot(w_stdev_lim=10, wo_stdev_lim=1000)¶

Plots both the stdev curve and the distance curve for when the stdev is 0

Parameters:	w_stdev_lim (int) – x axis limit for the plot wo_stdev_lim (int) – x axis limit for the plot

predict_probability_area(upper_bound, lower_bound, estimated_loc)¶

Given a prediction and a bounding box this will return a confidence range for that prediction

Parameters:	upper_bound (geoCoord) – bounding box top right geoCoord lower_bound (geoCoord) – bounding box bottom left geoCoord estimated_loc (LocEstimate) – geoCoord of the estimated location
Returns:	A probability range tuple (min probability, max probability)
Return type:	Probability Tuple(Tuple(float,float))

save(name='estimator')¶

Saves the EstimatorCurve as a csv

Parameters:	name (string) – A prefix name for the filename. Two CSVs will be created– one for when the stdev is 0, and one for when it is greater than 0

validator(sc, eval_rdd)¶

Validates a curve

Parameters:	curve (numpy.darray) – x axis is stdev, y axis is percent eval_rdd (rdd (src_id, (dist, loc_estimate))) – this is the result of the evaluator function

class src.algorithms.estimator.EstimatorCurve(w_stdev, wo_stdev)

The EstimatorCurve class is used to assess the confidence of a predicted location for SLP.

w_stdev: numpy arr – A two dimensional numpy array representing the estimator curve. The x axis is the standard deviations and y axis is the probability. The curve is a CDF. This curve is generated from known locations where at least two neighbors are at different locations.

wo_stdev: numpy_arr – A two dimensional numpy array representing the estimator curve

static build_curve(vals, desired_samples)

Static helper method for building the curve from a set of stdev stample

Parameters:

vals (rdd of floats) – The rdd containing the standard deviation from the distance between the estimated location and the actual locationn
desired_samples (int) – For larger RDDs it is more efficient to take a sample for the collect

Returns:

two dimensional array representing the curve.

Column 0 is the sorted stdevs and column 1 is the percentage for the CDF.

Return type:

curve (numpy.ndarray)

confidence_estimation_viewer(sc, eval_rdd)

Displays a plot of the estimated and actual probability that the true point is within an array of radius values

Parameters:	curve (numpy.darray) – x axis is stdev, y axis is percent eval_rdd (rdd (src_id, (dist, loc_estimate))) – this is the result of the evaluator function

static load_from_file(name='estimator')

Loads an Estimator curve from csv files

Parameters:	name (string) – prefix name for the two CSV files

static load_from_rdds(locs_known, edges, desired_samples=1000, dispersion_threshold=150, neighbor_threshold=3)

Creates an EstimatorCurve

Parameters:	locs_known (rdd of LocEstimate) – RDD of locations that are known edges (rdd of (src_id (dest_id, weight)) – RDD of edges in the network desired_samples (int) – Limit the curve to just a sample of data
Returns:	A new EstimatorCurve representing the known input data
Return type:	EstimatorCurve

lookup(val, axis=0)

static lookup_static(table, val, axis=0)

lookups up closes stdev by subtracting from lookup table, taking absolute value and finding which is closest to zero by sorting and taking the first element

Args: num_std_devs (float): the stdev to lookup

Returns: CDF (float) : Percentage of actual locations found to be within the input stdev

plot(w_stdev_lim=10, wo_stdev_lim=1000)

Plots both the stdev curve and the distance curve for when the stdev is 0

Parameters:	w_stdev_lim (int) – x axis limit for the plot wo_stdev_lim (int) – x axis limit for the plot

predict_probability_area(upper_bound, lower_bound, estimated_loc)

Given a prediction and a bounding box this will return a confidence range for that prediction

Parameters:	upper_bound (geoCoord) – bounding box top right geoCoord lower_bound (geoCoord) – bounding box bottom left geoCoord estimated_loc (LocEstimate) – geoCoord of the estimated location
Returns:	A probability range tuple (min probability, max probability)
Return type:	Probability Tuple(Tuple(float,float))

save(name='estimator')

Saves the EstimatorCurve as a csv

Parameters:	name (string) – A prefix name for the filename. Two CSVs will be created– one for when the stdev is 0, and one for when it is greater than 0

validator(sc, eval_rdd)

Validates a curve

Parameters:	curve (numpy.darray) – x axis is stdev, y axis is percent eval_rdd (rdd (src_id, (dist, loc_estimate))) – this is the result of the evaluator function

Welcome to Soft-Boiled’s documentation!¶

Spatial Label Propagation [slp.py]:¶

Usage:¶

Options:¶

Gaussian Mixture Model [gmm.py]¶

Usage:¶

Options:¶

Contents:¶

Indices and tables¶