[ICML 2019] Day 2 - U.S. Census, Time Series, Hawkes Processes, Shapley values, Topological Data Analysis, Deep Learning & Logic, Random Matrices, Optimal Transport for Graphs
[ICML 2019] Day 2 - U.S. Census, Time Series, Hawkes Processes, Shapley values, Topological Data Analysis, Deep Learning & Logic, Random Matrices, Optimal Transport for Graphs
The main conference began today (yesterday was the Tutorials). It started by an invited talk from Prof. John M. Abowd, Chief Scientist and Associate Director for Research and Methodology U.S. Census Bureau on ``The U.S. Census Bureau Tries to Be a Good Data Sterward in the 21st Century’’, followed by the best paper award presentation on Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations.
Takeaways from the keynote ``The U.S. Census Bureau Tries to Be a Good Data Sterward in the 21st Century’’:
- By releasing census data (summary statistics), the U.S. Census Bureau actually exposed private information of the U.S. citizens. Commercial databases could have been enriched with race and ethnicity information by linking them to the Census statistics databases which contain this information. In general, models (like recommender systems) spit out statistics from their underlying data, and thus are vulnerable to database reconstruction attacks. Formalism around this problematic is known as differential privacy, and originated from cryptographers. Note that the topic of differential privacy was also involved in last year first keynote ``AI and Security: Lessons, Challenges and Future Directions’’ by Prof. Dawn Song at ICML 2018 in Stockholm (cf. this post).
Takeaways from the best paper Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations:
Two main contributions from the paper:
- Unsupervised learning of disentangled representations (few explanatory factors of variation) is fundamentally impossible without inductive biases on both the models and the data
- Large-scale experimental study of training 12,000 sota models shows that increased disentanglement does not lead to an improvement in downstream tasks
After this plenary session, the day was subdivided into three sessions (+ the evening poster session) with highly parallelized thematic tracks.
I attended the following sessions:
Supervised Learning
I read two of the session papers before the conference as they sounded of interest to me:
- Data Shapley: Equitable Valuation of Data for Machine Learning
- This paper proposes to use Shapley values to quantify how valuable a data point is to the machine learning model. Recently, Shapley values have been used to quantify how important a feature is for interpreting black-box machine learning models. The aim here is different. Motivations are of an economic nature: Shapley values could be a way to remunerate ‘fairly’ people (or organizations) for contributing their data; The more useful (according to data Shapley) the data point for the problem and model at hand (given all the other data points already collected), the more money it is worth.
- GitHub
- Topological Data Analysis of Decision Boundaries with Application to Model Selection
- slides
- Another paper motivated by economic applications: matching vendor pre-trained models to customer data. From a technical perspective, to do so, authors extend the standard Topological Data Analysis (TDA) toolkit that works on point clouds of unlabeled data to labeled point clouds. Thus, they can study the complexity of supervised machine learning models decision boundaries. They find that when choosing a pre-trained network, one whose topological complexity matches that of the dataset yields good generalization. Therefore, on a model marketplace, vendors should report the topological complexity measures of their models, and customers should estimate these numbers on their data. Customers should choose the model whose topological complexity measures match the closest.
Time Series
The Time Series session contained a bunch of interesting presentations. I was particularly sensible to:
- Learning Hawkes Processes Under Synchronization Noise
- slides
- Multivariate Hawkes processes are used to model the occurrence of discrete events in continuous time. They are especially relevant when an arrival in one dimension can affect future arrivals in other dimensions (they are self-exciting and mutually exciting). Before this paper, the usual approach considers that observations are noiseless, that is the arrival times of the events are recorded accurately without any delay. Authors introduce a new approach for learning the causal structure of multivariate Hawkes processes when events are subject to random and unknown time shift. Each dimension can have a different but constant time shift of its observations. The idea of the paper is to define a new process, the desynchronized multivariate Hawkes process, which is parametrized by (z, theta), where z is the time shift noise (considered as parameters) and theta the standard parameters of the multivariate Hawkes process. Estimating these parameters using maximum likelihood has its challenges since the objective function is neither smooth nor continuous. To overcome this difficulty, authors propose to smooth the objective function by approximating the kernels (which create the discontinuities) by functions differentiable everywhere. Stochastic gradient descent is then applied to maximize the log likelihood.
- Deep Factors for Forecasting
- I will have to dig deeper into the paper and play with the code, but basically Amazon’s researchers use a decomposition theorem to motivate their approach: Some N-variate time series can be decomposed into a global time series and N local time series. It reminds me the Sklar theorem that decomposes a N-variate distribution into a joint uniform distribution over
[[0, 1]]
^N et N marginal distributions, which motivated my first paper (strongly influenced by P. Very and P. Donnat) on the hierarchical clustering of credit default swaps time series. Back to the decomposition (global, N local), a deep net can be used for modelling the global time series and Gaussian Processes for modelling the local time series. In the case where there are hierarchical clusters of time series, results can be extended: the “exchangeabiblity” (invariance of the distribution under permulation of the variables) condition needs only to be verified for each hierarchical cluster. In that case, there is a set of “global” time series. Doing so, they can have a probabilistic time series model that scales. They take care to precise that they consider time series (sub)sampled at discrete time steps, rather than a marked point process. - GitHub - GluonTS - Probabilistic Time Series Modeling in Python
- I will have to dig deeper into the paper and play with the code, but basically Amazon’s researchers use a decomposition theorem to motivate their approach: Some N-variate time series can be decomposed into a global time series and N local time series. It reminds me the Sklar theorem that decomposes a N-variate distribution into a joint uniform distribution over
- Imputing Missing Events in Continuous-Time Event Streams
- GitHub and related paper (Neural Hawkes Process), GitHub
Large Scale Learning and Systems
I didn’t get much from this session (very hardware oriented), but maybe the following paper:
- DL2: Training and Querying Neural Networks with Logic
- This paper proposes the “SQL” for querying (and training with logical constraints) neural networks. I like this idea of combining logic and neural networks, it was quite new for me.
- GitHub
Poster session
- Optimal Transport for structured data with application on graphs
- Now, I can do Frechet mean of graphs!
- GitHub
- Random Matrix Improved Covariance Estimation for a Large Class of Metrics
- Information Geometry and Random Matrices for covariance estimation… I’m somewhat familiar with both topics. Surprised that the research stream from Joel Bun, Jean-Philippe Bouchaud and the CFM was not cited, for example this extensive review.