Symptoms of an Uncommon Code

Learn from HPC How to Model ML Performance Beyond FLOPS, Params, MACS, or TOPS

Tue, 26 Oct 2021 00:00:00 -0500

Learn from HPC How to Model ML Performance Beyond FLOPS, Params, MACS, or TOPS

You’ve invented a brand new model architecture and you want to show it’s the fastest architecture on the block.

Do you evaluate performance using FLOPS? MACS? Param counts? TOPS?

It turns out none of these by themselves sufficiently model performance, but fortunately the HPC community has solved this problem with a robust performance model for parallel processing!

This basic equation models total time as the sum of communication time (from network, disk, memory, cache, and register movement) and computation time (arithmetic operations):

\[total\_time = communication\_time + computation\_time\]

In the case of a neural network we can model as:

\[communication\_overhead = n\_params * param\_size + data\_between\_layers\] \[communication\_time \approx \dfrac{communication\_overhead}{memory\_bandwidth}\] \[computation\_overhead = n\_ops\] \[computation\_time \approx \dfrac{computation\_overhead}{ops\_throughput}\]

Data sent between layers is the sum of the tensor volume for each layer output in a network multiplied by the element size. In some cases the volume is the number of activations, but not all data transferred is due to an activation (e.g. a residual connection, concat operation, unfused bias add, projection operations, etc).

What this means is that the total time for a network is a function of memory bandwidth and arithmetic throughput. Estimating total time (inference latency) depends on a weighting between communication time and computation time. The exact weighting varies based on memory_bandwidth and ops_throughput of the accelerator architecture and can be derived either from first principles or estimated based on observations.

All major neural network accelerators have seen huge leaps in ops_throughput in recent years, making computation time relatively small. Focusing on reducing communication time with networks is more important now than ever. In neural networks this means reducing network weights and the volume of tensor data sent between layers. This can occur with mechanisms like layer fusion and batch norm folding, but also applies to low arithmetic intensity layers in general.

For tiny microprocessors and CPUs compute time can still be a bottleneck. We were faced with compute bottlenecks with the type of small DSPs we were using at Whisper.ai. This HPC inspired performance model combining communication time and computation time is flexible enough to describe accelerator bottlenecks both for big GPUs and TPUs or tiny CPUs and DSPs.

Limitations

Like any model, this performance model can be wrong in some cases. For example in TensorFlow Lite if an op isn’t implemented by an accelerator, you could in some cases actually have that layer run on a completely different processor and incur poor ops_throughput and even more communication time.

I started my career working in a computational neuroscience lab with spiking neural networks that effectively have 1-bit activations. 1-bit activations minimize compute and communication. Unfortunately the reality of communication within a chip means sending 1-bit of information is fairly inefficient. We may need a new performance model one day 🤓

Accelerator-Oblivious Network Architectures

Exploration of performance models like these could lead us to model architectures that are both thrifty with compute and communication time.

Today the trend seems to be kicking off a NAS for each accelerator architecture you want to use - how brutish!

Perhaps instead one day we will see network architectures with high arithmetic intensity good for many accelerators, in the limit becoming “accelerator-oblivious” (in reference to “Cache-oblivious” programming).

In the meantime, start reporting communication and computation overheads:

\[communication\_overhead = n\_params * param\_size + data\_between\_layers\] \[computation\_overhead = n\_ops\]

From these numbers, anyone can plug in their own constants for ops_throughput and memory_throughput to estimate inference latency and pick what architecture makes sense for a given accelerator.

Solving the XKCD NP-Complete Restaurant Order Problem with Python and Optlib

Mon, 05 Jul 2021 00:00:00 -0500

Introduction to Optimization Problems

If you ever find yourself looking for the minimum (or equivalently the maximum) of a function value, you have an optimization problem.

For continuous functions we can use calculus to provide analytical guidance for minima. As an example, for a single variable function, the point of inflection of a double derivative will highlight a local minima. You may remember solving these problems in calculus classes finding the minima or maxima of parabolas.

With discrete values, we lose the analytical guidance provided over continuous functions. Our functions are no longer differentiable as there are discontinuities between points.

Discrete optimization problems exist in everyday life and vary in complexity to solve. They could be something as simple as given a list of gas stations pick the one that is the cheapest or as complex as pick an investment portfolio that will minimize risk.

Some optimization problems can be easy to solve with a few assumptions. For example if we have a list of gas stations and want to minimize price, we could add the assumption that driving distance doesn’t matter when picking a gas station. In this case the optimal gas station is simply the one with the minimum price. If the list of gas stations was filtered to be ones that were acceptable to drive to, this would provide a reasonable solution. However if you are solving this problem on massive scale, you may want to model driving time and distance costs in the optimization. Pinching pennies at scale easily pays the salaries of many employees :)

There is much art to choosing simplifying assumptions that allow for problems to be tractable. The hardest place for optimization problems to solve properly is when the system is discrete with many combinations of state. One particularly famous family of functions with discrete integer values can be optimized with a technique called “Integer Linear Programming”.

Integer Linear Programming

Any function that can be written as a combination of a weight multiplied by an integer value can be minimized with Integer Linear Programming (ILP). While in general integer programming problems are NP-complete, smart people have created optimization packages that can find good and sometimes even optimal solutions.

For these sorts of optimization problems, there will be constraints placed on a solution that bound what is considered an acceptable solution. Constraints can come from many places such as physical constraints of realizable solutions or desired properties such as profit is greater than cost.

Solving the XKCD Restaurant Order Problem - Deliciously

XKCD has a comic funny for us nerds, painful for everyone else, about embedding optimization problems in restaurant orders:

The XKCD optimization problem is this: given the prices of several foods, find the sum of elements that equals 15.05:

Food	Cost
Mixed Fruit	2.15
French Fries	2.75
Side Salad	3.35
Hot Wings	3.55
Mozzarella Sticks	4.20
Sampler Plate	5.80

We can use an integer programming solver to help our poor server find an optimal solution.

For this instance there are actually multiple solutions that optimally satisfy the constraints. One solution is simply to have 7 orders of mixed fruit. However that order is not very diverse and way too healthy for the typical person. Let’s make it more delicious!

We can add some fun while still staying within the realm of integer linear programming. Let’s say that we want to find the combination of foods that at the same time minimizes healthiness. Wait minimize healthiness? That’s right - for the most delicious meal, we want to minimize healthiness. This is simply because unhealthy food tastes better - visit New Orleans if you don’t believe me.

To do this, let’s score each food with a healthy index, where 0 is deathly healthy and 10 is necessary for a healthy life:

Food	Healthy Index
Mixed Fruit	8
French Fries	3
Side Salad	7
Hot Wings	4
Mozzarella Sticks	2
Sampler Plate	3

We can then compute the total health-food score by summing the product of the number of fruit items and the cost:

\[\sum_{i \in food} count_i * health(i)\]

Additionally we will ask our solver to add the constraint that everything adds up to $15.05:

\[\$15.05 = \sum_{i \in food} count_i * cost(i)\]

Notice that the health, like cost, is simply a constant because we have a numerical value that is unique for each food defined in the tables above.

Solving Integer Programming Problems with Optlang in Python

Before diving into the code there are a several programming primitives important to understand with optlang:

Variables: objects that represent symbolic variables. Under the scenes, optlang uses sympy to provide these symbolic objects. They can have the domain specified, such as integer or real, or bounds specified, like if a variable must be positive then specify the lower bound is 0.
Objectives: the function to minimize.
Constraints: the boundaries a solution must exist within, defined as a combination of variables and equality operators.
Models: containers that compose objectives and constraints.

Now let’s pick the foods that provides the most delicious meal that satisfies our constraint to be equal to $15.05:

from optlang import Variable, Objective, Model, Constraint

costs = {
    "mixed_fruit": 2.15,
    "french_fries": 2.75,
    "side_salad": 3.35,
    "hot_wings": 3.55,
    "mozzarella_sticks": 4.20,
    "sampler_plate": 5.80
}

healthy_index = {
    "mixed_fruit": 8,
    "french_fries": 3,
    "side_salad": 7,
    "hot_wings": 4,
    "mozzarella_sticks": 2,
    "sampler_plate": 3
}

v_is = []
for item, cost in costs.items():
    v_i = Variable("%s" % item, lb=0, type="integer")
    v_is.append(v_i)

total_cost = sum([v_i * costs[v_i.name] for v_i in v_is])
total_healthiness = sum([v_i * healthy_index[v_i.name] for v_i in v_is])

objective = Objective(total_healthiness, direction="min")
constraint = Constraint(total_cost, lb=15.05, ub=15.05)

model = Model(name="xkcd server model")
model.objective = objective
model.add([constraint])
status = model.optimize()
                    
print(f"optimization status: {model.status}")
print(f"objective value: {model.objective.value}")

cost = 0.0                      
for name, variable in model.variables.items():
    value = int(variable.primal)
    print(f"{name}={value}")
    cost += costs[name] * value

print(f"total cost: {cost}")

The solution for this is to pick only 1 mixed fruit, 1 sampler plate, and 2 orders of hot wings. That sounds very delicious to me.

This post was originally drafted January 2015, but updated for posting in 2021.

Enabling SSH with Intel's Euclid Dev Kit

Sat, 10 Jun 2017 00:00:00 -0500

After I ran the demo apps, I immediately wanted to hack on some code.

Unfortunately, the current documentation is incomplete how to discover the Euclid over wifi and SSH into the device!

It turns out that VNC is present, but not ssh. To fix this, we need to connect the Euclid to the internet, connect to the Euclid via VNC, and install ssh.

Connect the Euclid to the Internet

By default, the Euclid will self-host after every restart. It will appear in WIFI something like EUCLID_XXXX. When you find the device, connect via wifi.

At this point it will self host under the 10.42.0.1 address. Put this in your browser and then click to the wifi settings page:

At this point, click the scan button. The Euclid will disconnect to discover WIFI access points. After about 60 seconds, you should reconnect to the EUCLID_XXXX access point. Navigate to your desired host and enter the appropriate wifi password. It will now connect to your WIFI access point. Now we can install the utilities we need for SSH.

Connect to the Euclid via VNC

The first step here is to figure out what the hostname is. The hostname is the same as the WIFI access name in broadcast mode. This should be something like EUCLID_4285.

After discovering the hostname, we can use it to connect with a convenient .local address. The host broadcasts via Avahi with the underscore in the hostname removed. For example, if your hostname was EUCLID_4285 it will be broadcast as EUCLID4285.local. For most network utilities (including your browser, though milage varies for mobile devices) you can connect via this address instead of an IP.

On your workstation, VNC into the Euclid device:

gvncviewer EUCLID4285.local

The default password is euclid. I recommend changing this :)

Install SSH

Within VNC, log in and open a terminal on the Euclid and run:

sudo apt-get install openssh-server
sudo service ssh start

You can then ssh into your Euclid:

ssh euclid@EUCLID4285.local

The password will be euclid, or whatever you changed it to.

Happy hacking!

Interactive Exploration of College Football Team Schedules

Wed, 11 Nov 2015 00:00:00 -0600

This is an exciting year for college football. This is the first year since 1981 that Clemson is ranked #1. But what has the schedule looked like to get there, and how does it compare to other elite teams?

Below is an interactive visualization of the network of FBS teams so far. Each circle represents a team, where colors represent the conference, and the large circles are the top 25 ranked teams. Independent teams are colored with gray.

Teams that are drawn near each other tend to have similar schedules, whereas teams far apart have much different schedules.

The lines represent a game played between two teams. The shorter the line the more normal the game fits within a schedule; most games within the same conference will have short lines, such as Stanford playing UCLA, whereas a very long line is when Stanford played UCF. As you move your mouse over each team, the teams played will emerge. Try it out!

Graph Facts

Some top 25 teams have not yet played any ranked opponents. Ohio State seems to often face this critique and this year is no exception.
LSU is the only ranked team in the country to have played every ranked team within their conference so far.
Few teams compete against ranked teams outside their own conference. However, the independent Notre Dame and BYU both scheduled top opponents across different conferences.
Wisconsin’s schedule so far tells us the most about relative team ranks by looking at the opponents played and how those opponents fared within their own games. This comes from a diverse schedule cutting across central teams in the Big Ten, Mountain West, Big 12, SEC, Mid-American, and Sun Belt.
Baylor has one of the least informative schedules so far. While they have played one game in Conference USA (Rice) and one game in the American Athletic (SMU) neither team they played is a central role in their respective conferences. In fact, there have been no games played between the Big 12 and ACC so far.

Privacy Preserving Queries: Surveying Populations Without Revealing Individuals' Identities

Sun, 09 Aug 2015 00:00:00 -0500

Imagine that you were asked to discover whether the workplace is generally happy. Unfortunately, your organization happens to be run by a semi-evil dictator who would fire anyone he knows for sure is unhappy:

Normally you might use a standard survey, but you cannot in good conscience ask individuals to fill out a survey that may get them fired. What can you do?

Enter Differential Privacy. The key is to give an individual plausible deniability in their specific response, but collectively provide a relevant statistic. With randomness at our side, we can have a surprisingly simple survey approach that will let us know approximately if most people are happy or unhappy.

Randomized Response Function

Our survey where we whether an employee is happy or unhappy can be generalized to a binary function that asks whether an employee is happy, returning True when happy or False when unhappy.

With this generalization in mind we can write an algorithm that can guarantee privacy at an individual level with any binary response query. This same approach will work regardless if you are asking an individual a survey question, sending anonymized user event statistics over the internet, or peeking into a test set without overfitting (to be covered in more detail later).

The privacy preserving function that we will use is called the “randomized response function”, where individually we can return either True or False regardless of what our true response value is. The algorithm is quite simple:

Randomly pick True or False
If you picked True, return the true response
Else return a new random True or False

Super simple! If we break down every possible combination of values, we can see that while there is uncertainty about an individual’s response, when many individuals are combined together we begin to see a sketch of the true number of True and False values:

True Response	First Random Value	Second Random Value	Output
True	True	True	True
True	True	False	True
True	False	True	True
True	False	False	False
False	True	True	False
False	True	False	False
False	False	True	True
False	False	False	False

From this table we see when True Response = True, $p(True \vert True) = 3/4$. When True Response = False, $p(True \vert False) = 1/4$.

If we ask an individual what their true response is and they say True, we expect $3/4$ of the time that their true response is True. So if you start firing people who say False, you will incorrectly fire $1/4$ of your workplace - enough that even our semi-evil dictator takes pause.

In our example, our dictator is only semi-evil. A fully evil dictator simply does not care about a false positive rate of $1/4$ and will still fire anyone who tells him they are unhappy, knowing some happy employees will be fired as well. What this approach guarantees is that even if most unhappy employees will be fired ($3/4$ on average) there will still be some unhappy employees left around. Our evil dictator will never be completely successful picking out individuals, and even if they fired $1/4$ of the workplace we will still be able to estimate workplace happiness. That’s pretty cool!

Population Happiness

The next step is to use our individually randomized survey and determine if these individuals are happy overall. The expected number of happy responses is:

\[E[\#True] = \#True * p(True \vert True) + \#False * p(True \vert False)\] \[E[\#True] = \#True * 3/4 + \#False * 1/4\]

Individuals are happy overall when the number of happy responses is greater than the number of unhappy responses. Put symbolically this is true if $\#True > \#False$. This will begin to happen when $\#True \approx \#False$, and continue until $\#Surveys = \#True$. When $\#True = \#False$ we expect $E[\#True] = \#Surveys / 2$. Therefore a simple approximate test to see if employees are generally happy is when $\#True > \#Surveys/2$.

Putting it All Together

For this algorithm, the number of times our happiness threshold $\#True > \#Surveys/2$ fails to indicate the correct response is captured in the CDF of a Binomial Distribution. If you want to model how randomized response will perform with your population, plot the CDF of the Binomial distribution for $n=\#employees$ and $p=0.5$.

In practice, it’s always good to test randomized methods with simulation. This algorithm works great as the number of employees grows. If we only have 10 employees, expect a lot of noise:

However if you have 100, we have a much better algorithm:

While analysis with this approach are not exact, one should be able to feel comfortable saying that employees are fairly happy when the threshold is passed. One can make the threshold more strict by adding an offset, such as one standard deviation $\#True > \#Surveys/2 + \sqrt{\#Surveys}/2$:

This significantly reduces the false positive rate but at the cost of false negatives.

I’ve put the implementation of this up on github in an ipython notebook, hopefully others find it helpful!

Cautionary Notes

This approach in itself doesn’t fully prevent information leakage when there are multiple queries asked to the same employee, either by asking the same query multiple times or asking multiple queries that are related.

If a spy attempts to ask an employee the same question twice or more, the employee will inadvertently leak more information about their true response value. A simple solution to this is to only ever answer a question once as an employee, or alternatively return random values after the first response.

The spy realizes they cant ask the same question twice, so instead they ask multiple related questions much like the 20-questions game. For example, if you wanted to determine a user’s age, you could ask multiple questions like $age < 40$, $age < 30$, $age < 20$. Although these are all binary questions, they are not independent and will probabilistically leak some information about the age of a user (ex: if $age < 20$ then $age < 30$ and $age < 40$). With dependent queries the privacy guarantees will fail without modifications to incorporate additional noise. However, privacy is still preserved if the questions are independent.

Less realistically, there are also timing attacks that one could make that reveal when a user is returning their true response. With the simple implementation of the randomized response function there will be different cycle counts and memory accesses arising from the naive implementation using a branch and drawing the random numbers lazily; it takes longer to draw two random numbers than one. It’s very unlikely this difference will be measurable with a fast random number generator; however, one could easily make a branchless version not susceptible to timing attacks.

Learn More

If you want to learn more about differential privacy, I highly recommend reading “The Algorithmic Foundations of Differential Privacy” by Cynthia Dwork and Aaron Roth. It covers both practical examples and theory behind common techniques.

I will review some of the recent results using differential privacy within data analysis and machine learning in an upcoming post to be linked here in the future.

Machine Learning Articles of the Week: Occulus Rift Occluded Face Reconstruction, Low-Precision Deep Neural Networks, Numerically Precise Floating Point Code Synthesis, and Learned Terrain Traversal for CGI

Wed, 27 May 2015 00:00:00 -0500

I’ve been catching up with some of the SIGGRAPH entries this year and there are quite a few that are simple but effective applications of machine learning to graphics problems. I suspect this is the new trend in graphics papers but it’s refreshing to see interesting applications of machine learning that aren’t using a multilayer deep learning architectures with bayesian hyperparameter optimization and new custom gradient descent algorithms.

Why are Eight Bits Enough for Deep Neural Networks?

When neural networks are often implemented in double precision (or more) due to concerns about numerical stability, performance, and sometimes correctness, is left on the table. Consider an architecture where you have 8 bit weights instead of 64. While on a standard x86 you may not have different computational performance, you can have big savings from faster memory management and increased cache locality. Pete Warden explores this and more in the context of deep learning architectures.

Facial Performance Sensing

Virtual reality environments would be much more intimate if you could experience other people with their facial expressions in realtime. While a 3d map of the face is not hard to capture with proper rigging, wearing an Occulus Rift or other head mounted display occludes sensors from capturing facial expressions. The authors merge strain gauge and depth data with linear regression to estimate a user’s facial map occluded by an Occulus Rift.

Synthesis for Floating-Point Expressions

Compile efficient floating point code from real numbered math written in a lisp-like language. This is pretty exciting research direction where numerical solvers may be optimized for both performance and numerical stability. I’m hoping this turns into the “Stochastic Superoptimization” of floating point math.

Dynamic Terrain Traversal Skills Using Reinforcement Learning

Animating characters is hard. What if you could train a model to learn how to animate? This paper looks into this by using reinforcement learning to train both a dog and a piped to navigate terrain by alternating between running and jumping.

Machine Learning Articles of the Week: Troll Detection, Strategies for Live Interviews, Silicon Machine Learning, and more

Tue, 19 May 2015 00:00:00 -0500

Towards Healthier Online Communities

Should be subtitled: Detecting Trolls Before They Get Banned.

Jure Leskovec & company look into several online communities to find features that contribute to positive and negative engagement within comments. The authors performed a user study that indicated that users care less about total upvote magnitude, but rather the proportion of positive and negative upvotes.

This led to further research in differences between downvoting behavior for negative content and personal reasons, and found that the proportion of downvoting is increasing over time - perhaps indicating increasingly unhealthy communities.

A predictive model for trolls was created using bag of words (text quality), post deletion rate, the number of words in post, post frequency (user activity), upvotes (community), and moderator signals that achieved 0.7+ ROC for various sites with a general model, and closer to 0.8 ROC when trained on each site.

Why Live Interviews are a Particular Challenge for Statisticians

If getting asked by your friends and family what you do for work is hard to answer, imagine doing it interviewed by the press on live performances. Before you live the horror yourself, it’s good to understand what the interviewer is interested in, connecting your numbers or algorithms to people; what effect do you have on people and why people affect your work.

Silicon Chips That See Are Going to Make Your Smartphone Brilliant

Mobile advances in computing, perhaps in the form of custom machine learning or neuromorphic chips, will bring algorithms usually run on some server “in the cloud” closer to the sensors. It’s hard to predict or understand if there are killer use cases for training on the sensor, but this space is getting exciting with industry teams from IBM, Qualcomm, Intel, and now Synopsys (who has its roots in neuromorphic computing surprisingly).

Psychology journal bans P values

I’ve only semi-recently found out that null hypothesis significance testing is no longer cool. Good riddance! I think most people were forced into the religion in high school science classes. No result was significant without p-values, and with a high p-value then correlation implies causation. Wikipedia has a great summary of p-value critiques, and the alternative section is also telling - there is no free lunch when trying to scientifically demonstrate causal behavior. You simply need more than just a p-value.

Machine Learning Articles of the Week: Learning Object Detectors from Scenes, Exploring Emojis on Instagram, A Tutorial on Dynamical Systems on Networks, and more

Tue, 12 May 2015 00:00:00 -0500

Object Detectors Emerge In Deep Scene CNNs

ImageNet is the rage with CNNs. These authors show that a byproduct of training to recognize scenes is object detection which is more reliable than other work performed with ImageNet without any supervision for learning object detectors. The author suggests the network hierarchy follows edges, texture, objects, scenes which is kind of neat because it reflects early work in CV, such as David Marr’s “primal sketch” hoping to create similar levels of hierarchy.

The authors have a couple interesting approaches discussed, like using a type of selective whitening they call scene simplification to improve training data by iteratively removing components that do not affect score. Several of these are pictured below:

Dynamical Systems on Networks: A Tutorial

I haven’t finished reading this yet but it is a great tutorial on analytically tractable dynamical systems applied to networks, from social contagions and voter models to coupled oscillators.

Profiling Top Kagglers: KazAnova Currently #2 in the World

This is a quick interview with Marios Michailidis, one of the highest performing Kaggler in the world on some of his methods to success. The most important features to when seeing a new dataset is to understand the problem, create a metric, set up reliable cross validation consistent with the leaderboard, understand the family of algorithmic approaches and where they could be useful, and iterate and test frequently.

Emojineering Part 1: Machine Learning for Emoji Trends

Emoji usage has increased over time. This is a light read analyzing Instagram’s users for the increasing use of Emoji. They then found what hashtags are associated commonly with each emoji, learning a word2vec vector space, which was then used for visualization and analysis purposes:

Becoming a Bayesian

Curious about Bayesian Machine Learning? I really liked this read because it is one of the more honest reads on Bayesian Machine Learning; many people gloss over the difficulties that arise in practice.

The author discusses his journey from traditional approaches to thinking “about a probabilistic model that relates observables to quantities of interest and of suitable prior distributions for any unknowns that are present in this model.” A second part of the post is available where he delves into the delicate dance between approximate models to provide computationally tractable approaches and rigorously pure Bayesian models where modeling is performed independently of the inference procedure.

Machine Learning Articles of the Week: Network Dynamics with BuzzFeed and Quora, Unnecessary Distributed Machine Learning, and the Utility of Small Data

Mon, 04 May 2015 00:00:00 -0500

This week is a double whammy of large consumer focused organizations looking at network dynamics! Very exciting times to have so many interesting approaches being published each week.

The Emperor’s New Clothes: Distributed Machine Learning

When you can get workstations with 1TB of RAM, dozens of cores, and multiple GPUs for most datasets do you really need a distributed machine learning solution?

From my personal experiences most people write low-throughput code and use parallelization as the wrong tool to speed things up. However once there is a decoupling between what is creating the data and what is consuming it, like a host that saves events to S3 and another that reads S3 and creates features, it does make sense to go parallel to improve networking throughput between the two nodes or cluster. The reasons for having the decoupling typically come from building high availability services which somewhat is in conflict with single node workstations for machine learning.

Introducing Pound: Process for Optimizing and Understanding Network Diffusion

Twins Andrew and Adam Kelleher created a graph construction system called Pound which takes event data and creates a graph of nodes and edges which can be explored with network analysis tools. Of particular interest is understanding network diffusion of information cascades, visualized below by Adam Kelleher:

I believe this is at least loosely based on work by Jure Leskovec, and it’s very exciting to see where this work may continue.

Upvote Dynamics on the Quora Network

Quora is able to connect questions to experts who can answer them effectively. This explores a graph theoretic look at the Quora network dynamics by constructing a graph of users with an incremental algorithm to add new users and follow relationships. A sparsifying step is done to remove paths that take longer than some time threshold, and then longest path is computed. The author explored answer propagation dynamics through this metric finding some interesting results like that follower count for good answers significantly helps in the beginning of a post, as time goes on the low follow count users tend to converge.

How Not to Drown in Numbers

An interesting discussion about how useful small data is; things like qualitative studies or human insight are extremely valuable when a computer does not easily have access to the data. The author presents several real world cases where small data is useful even with “big data” approaches.

Machine Learning Articles of the Week: Big Data Neuroscience Pipeline, State of NLP, Compressed NNs, Faster NNs, and Improving Police Sketches with Genetic Algorithms

Mon, 27 Apr 2015 00:00:00 -0500

5 Takeaways on the State of Natural Language Processing

This is a collection of takeaways from a recent NLP get together: Word2Vec is popular and doing more than analogy parlor tricks, production grade NLP is popular, open source tools are not being sponsored, RNNs are popular, and there is a massive gender imbalance.

One of the interesting presentations being summarized here was how StitchFix is using word2vec on user’s comments to improve recommendations.

Compressing Neural Networks with the Hashing Trick

Motivated by computational and memory pressures with mobile architectures, the authors created a trainable hashed neural network that randomly groups weights into hashed buckets shared throughout the network. Their method can achive a compression factor of 1/64 with 2% error in MNIST.

Analyzing + Visualizing Neuroscience data by Jeremy Freeman

Jeremy Freeman of HHMI’s Janelia Farm Research Campus gave a very cool presentation of his state of the art realtime and batch big data pipeline using python, Spark, d3, and a lot of custom bits to provide a clean and interactive analysis and visualization system called Thunder and Lightning respectively. The code is completly open source and available on github.

Caffe con Troll: Shallow Ideas to Speed Up Deep Learning

Thinking about computer architecture and using smart blocked matrix operations provide big speedups (4.5x end to end) on CPU and GPU for Caffe, a deep learning framework. The authors suggest current approaches with CPU vs GPU measurements have a poor implementation of Deep Learning on the CPU, causing CPU performance numbers to underperform.

The Psychology of Police Sketches - And Why They’re Usually Wrong

Randomized feature selection algorithmicly generate police sketches with genetic algorithms improves conviction rate and hopefully reduces bias.