ICML 2023 - 40 Years

Selected ML Papers from ICML 2023

This blog post serves as a summary and exploration of ~100 papers, providing insights into the key trends presented at ICML 2023. The papers can be categorized into several sub-fields, including Graph Neural Networks and Transformers, Large Language Models, Optimal Transport, Time Series Analysis, Causality, Clustering, PCA and Autoencoders, as well as a few miscellaneous topics.

Graph Neural Networks and Transformers

The first sub-field, Graph Neural Networks and Transformers, encompasses papers that delve into the fusion of graph theory and deep learning architectures. These papers explore novel methods to enhance the representation and understanding of complex graph-structured data. They aim to improve graph reasoning, graph generation, and graph embedding techniques, unlocking the potential for more accurate predictions and insights.

Transformers Meet Directed Graphs

paper

Transformers as Algorithms: Generalization and Stability in In-context Learning

paper

Fast Inference from Transformers via Speculative Decoding

paper

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

paper

Graph Inductive Biases in Transformers without Message Passing

paper

On the Connection Between MPNN and Graph Transformer

paper

Towards Understanding the Generalization of Graph Neural Networks

paper

Graph Generative Model for Benchmarking Graph Neural Networks

paper

XTab: Cross-table Pretraining for Tabular Transformers

paper

Feature Expansion for Graph Neural Networks

paper

Fisher Information Embedding for Node and Graph Learning

paper

GOAT: A Global Transformer on Large-scale Graphs

paper

Coder Reviewer Reranking for Code Generation

paper

Exphormer: Sparse Transformers for Graphs

paper

Distribution Free Prediction Sets for Node Classification

paper

Node Embedding from Neural Hamiltonian Orbits in Graph Neural Networks

paper

Relevant Walk Search for Explaining Graph Neural Networks

paper

Quantifying the Knowledge in GNNs for Reliable Distillation into MLPs

paper

Conformal Prediction Sets for Graph Neural Networks

paper

Leveraging Label Non-Uniformity for Node Classification in Graph Neural Networks

paper

Large Language Models

Large Language Models have gained significant attention in recent years for their ability to generate coherent and contextually relevant text. The papers in this sub-field delve into the advancements and challenges related to these models. They address topics such as fine-tuning strategies, model interpretability, long-tail knowledge learning, efficiency at inference time, and exploring the limits and biases of language models.

POUF: Prompt-Oriented Unsupervised Fine-tuning for Large Pre-trained Models

paper

Very interesting paper for practical applications… Simple but neat idea of aligning the distributions between the unlabeled target data (potentially very different than the data on which the model was pre-trained) and textual prototypes (prompts) using a fine-tuning loss based on optimal transport and mutual information.

Why do Nearest Neighbor Language Models Work?

paper

Prompting Large Language Model for Machine Translation: A Case Study

paper

Large Language Models Struggle to Learn Long-Tail Knowledge

Large Language Models Struggle to Learn Long-Tail Knowledge

Interesting empirical study showing the correlation (and causality) between the number of relevant pre-training documents and the Question & Answer accuracy of Large Language Models.

In short, the more documents covering the topic (question, answer), the better.

  • Having larger LMs helps (R^2 98%) but log-linear scaling make it unrealistic for now as a serious improvement direction;

  • Adding a retrieval module (prompt enriched by relevant context) boosts answers’ accuracy on low resource questions, and seems a more promising research direction.

  • Authors focus on absolute counts of relevant documents, but what about ratios (wrt other topics in the corpus)?

  • How is the Q&A accuracy impacted by contradicting documents on facts as a function of their ratios?

  • Is the majority view always winning? Does it depend on the ratio? Or more contextual?

paper

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

paper

Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning

paper

Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning

paper

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot

paper

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

paper

Can Large Language Models Reason about Program Behavior?

paper

Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation

paper

Large Language Models Can Be Easily Distracted by Irrelevant Context

paper

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature

paper

Simple but interesting idea!

This paper poses a simple hypothesis: minor rewrites of model-generated text tend to have lower log probability under the model than the original sample, while minor rewrites of human-written text may have higher or lower log probability than the original sample.

We empirically verify this hypothesis, and find that it holds true across a diverse body of LLMs, even when the minor rewrites, or perturbations, come from alternative language models.

FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU

paper

CodeIPPrompt: Intellectual Property Infringement Assessment of Code Language Models

paper

The Unreasonable Effectiveness of Few-shot Learning for Machine Translation

paper

Repository-Level Prompt Generation for Large Language Models of Code

paper

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

paper

“Models as Data” contribution to the literature and future research effort.

The Pythia suite is the only publicly released suite of LLMs that satisfies three key properties:

  1. Models span several orders of magnitude of model scale.
  2. All models were trained on the same data in the same order.
  3. The data and intermediate checkpoints are publicly available for study.

Specializing Smaller Language Models towards Multi-Step Reasoning

paper

Knowledge distillation of code-davinci-002 to tune smaller FlanT5: Specializing T5 and FlanT5 from a code-davinci-002 to have a chain-of-thought (CoT) ability on math problems.

CoT typically appears for large language models (> 100B) but cannot be found in small models. This paper shows a way to obtain specialized CoT in small models, but at the expense of losing generic abilities.

In this paper, authors have to face the misalignment between the GPT tokenizer and the T5 tokenizer. They solve it by using dynamic programming.

Authors use the following datasets:

Besides this math CoT, what are others specializations one might want to try?

We show the importance of using the instruction-tuned checkpoints as the base model because their generalization performance is better than the raw pretrained checkpoints.

Automatically Auditing Large Language Models via Discrete Optimization

paper

Pretraining Language Models with Human Preferences

paper

LEVER: Learning to Verify Language-to-Code Generation with Execution

LEVER: Learning to Verify Language-to-Code Generation with Execution

Well motivated approach: (Code)Language Models (CodeLMs) are costly to finetune; Authors propose an approach to improve them (e.g. OpenAI Codex) without changing their parameters.

Train a Verifier (much smaller LM, 0.5% of the CodeLM original size) to classify whether a triplet: (code in natural language, corresponding code generated by the CodeLM, output obtained by executing the LM-generated code) is correct or not.

Then, take the argmax of the combined CodeLM proba x verification probability from small LM, voila!

Through various empirical studies, authors show it is better to combine both probabilities than doing pruning (thresholding, binary decisions), and that both probabilities are calibrated very differently: The Verifier (small classification LM) being better at detecting obvious mistakes leading to faulty executions, where the OG CodeLM being better at distinguishing amongst the top-ranked programs.

Not too dissimilar to diversification and alpha combination in quant.

paper

GitHub

Demo

Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models

paper

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

paper

Effective Structured Prompting by Meta-Learning and Representative Verbalizer

paper

Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise

paper

Optimal Transport

The field of Optimal Transport deals with the study of transportation and mapping between probability distributions. The papers in this sub-field propose new methods and insights into utilizing Optimal Transport for various ML tasks, including generative modeling, information maximization, and embedding high-dimensional features.

Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps

paper

InfoOT: Information Maximizing Optimal Transport

paper

Meta Optimal Transport

paper

Linear Optimal Partial Transport Embedding

paper

Time Series

Time Series Analysis is crucial in understanding and predicting temporal data patterns. The papers in this sub-field introduce innovative techniques for time series forecasting, explainability of predictions, and handling feature and label shifts in domain adaptation scenarios.

Learning Deep Time-index Models for Time Series Forecasting

paper

Learning Perturbations to Explain Time Series Predictions

paper

Domain Adaptation for Time Series Under Feature and Label Shifts

paper

Causality

Causal relationships play a fundamental role in understanding cause and effect in ML models. The papers in this sub-field explore metrics, algorithms, and frameworks for inferring and utilizing causal knowledge. They aim to enhance regression models with causal insights, generate counterfactual explanations, and uncover data manifolds entailed by structural causal models.

New metrics and search algorithms for weighted causal DAGs

paper

Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge

paper

High Fidelity Image Counterfactuals with Probabilistic Causal Models

paper

On Data Manifolds Entailed by Structural Causal Models

paper

Clustering

Clustering algorithms, Principal Component Analysis (PCA), and Autoencoders are essential tools for unsupervised learning and dimensionality reduction. The papers in this sub-field propose novel approaches for interpretable neural clustering, orthogonal-enforced latent spaces, structured variational autoencoders, and fundamental limits of two-layer autoencoders.

XAI Beyond Classification: Interpretable Neural Clustering

paper

Multi-class Graph Clustering via Approximated Effective p-Resistance

paper

End-to-end Differentiable Clustering with Associative Memories

paper

PCA and Autoencoders

Extending Kernel PCA through Dualization: Sparsity, Robustness and Fast Algorithms

paper

Orthogonality-Enforced Latent Space in Autoencoders: An Approach to Learning Disentangled Representations

paper

Revisiting Structured Variational Autoencoders

paper

Fundamental Limits of Two-layer Autoencoders, and Achieving Them with Gradient Methods

paper

Misc.

Additionally, there are papers that cover a diverse range of topics. These include advancements in quantile regression, probabilistic attention models for event sequences, robust consensus ranking, synthetic data generation, model calibration, intellectual property infringement assessment, and more.

Faith-Shap: The Faithful Shapley Interaction Index

paper

Flexible Model Aggregation for Quantile Regression

paper

Probabilistic Attention-to-Influence Neural Models for Event Sequences

paper

When does Privileged information Explain Away Label Noise?

paper

Temporal Label Smoothing for Early Event Prediction

paper

Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning

paper

Never mind the metrics—what about the uncertainty? Visualising confusion matrix metric distributions

paper

A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models

paper

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

paper

A Large-Scale Study of Probabilistic Calibration in Neural Network Regression

paper

Trompt: Towards a Better Deep Neural Network for Tabular Data

paper

Synthetic Data, Real Errors: How (Not) to Publish and Use Synthetic Data

paper

Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think

paper

On the Power of Foundation Models

paper

Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models

paper

Improving Expert Predictions with Conformal Prediction

paper

Robust Consensus in Ranking Data Analysis: Definitions, Properties and Computational Issues

paper

BEATs: Audio Pre-Training with Acoustic Tokenizers

paper

Great Models Think Alike: Improving Model Reliability via Inter-Model Latent Agreement

paper

A New PHO-rmula for Improved Performance of Semi-Structured Networks

paper

Taxonomy-Structured Domain Adaptation

paper

Explainability as statistical inference

paper

Discrete Key-Value Bottleneck

paper

End-to-End Multi-Object Detection with a Regularized Mixture Model

paper

Which Features are Learnt by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression

paper

Generative Graph Dictionary Learning

paper

Conformal Inference is (almost) Free for Neural Networks Trained with Early Stopping

paper

Random Teachers are Good Teachers

paper

Answering Complex Logical Queries on Knowledge Graphs via Query Computation Tree Optimization

paper

RankMe: Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank

paper

Shapley Based Residual Decomposition for Instance Analysis

paper

Building Neural Networks on Matrix Manifolds: A Gyrovector Space Approach

paper