Klaus 3319
266 Ferst Dr NW
Atlanta, GA 30332
My name is Renzhi Wu (吴仁智). I’m a third year Ph.D. student in Computer Science at Georgia Tech, advised by Prof. Xu Chu and Prof. Kexin Rong. I was fortunate to intern at Google, Celonis, Adobe, and Alibaba. I am on the job market!
Research Interests:
I am generally interested in machine learning and data management. I work on applying ML to data management problems (e.g. entity resolution and cardinality estimation) and improving ML models from a data-centric perspective (e.g. via programmatic data labeling and data cleaning). My research is partly supported by a Meta (Facebook) Fellowship.
Previously:
I hold M.Sc. in Thermophysics from Tsinghua University where I worked on numerical simulation algorithms for the formation of water dews. I also hold M.Eng. in Production System Engineering from RWTH Aachen University.
Before that, I obtained my bachelor’s degrees in Energy/Power Engineering and in Economics from Tsinghua University.
Publications
* denotes equal contribution.
2023
Learning Hyper Label Model for Programmatic Weak Supervision
Renzhi Wu,
Shen-En Chen,
Jieyu Zhang,
and Xu Chu
To appear in ICLR
2023
To reduce the human annotation efforts, the programmatic weak supervision (PWS) paradigm abstracts weak supervision sources as labeling functions (LFs) and involves a label model to aggregate the output of multiple LFs to produce training labels. Most existing label models require a parameter learning step for each dataset. In this work, we present a hyper label model that (once learned) infers the ground-truth labels for each dataset in a single forward pass without dataset-specific parameter learning. The hyper label model approximates an optimal analytical (yet computationally intractable) solution of the ground-truth labels. We train the model on synthetic data generated in the way that ensures the model approximates the analytical optimal solution, and build the model upon Graph Neural Network (GNN) to ensure the model prediction being invariant (or equivariant) to the permutation of LFs (or data points). On 14 real-world datasets, our hyper label model outperforms the best existing methods in both accuracy (by 1.4 points on average) and efficiency (by six times on average).
Ground Truth Inference for Weakly Supervised Entity Matching
Renzhi Wu,
Alexander Bendeck,
Xu Chu,
and Yeye He
To appear in SIGMOD
2023
Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching performance; however, they require many labeled examples, which are often expensive or infeasible to obtain. This has inspired us to approach data labeling for EM using weak supervision. In particular, we use the labeling function abstraction popularized by Snorkel, where each labeling function (LF) is a user-provided program that can generate many noisy match/non-match labels quickly and cheaply. Given a set of user-written LFs, the quality of data labeling depends on a labeling model to accurately infer the ground-truth labels. In this work, we first propose a simple but powerful labeling model for general weak supervision tasks. Then, we tailor the labeling model specifically to the task of entity matching by considering the EM-specific transitivity property.
The general form of our labeling model is simple while substantially outperforming the best existing method across ten general weak supervision datasets. To tailor the labeling model for EM, we formulate an approach to ensure that the final predictions of the labeling model satisfy the transitivity property required in EM, utilizing an exact solution where possible and an ML-based approximation in remaining cases. On two single-table and nine two-table real-world EM datasets, we show that our labeling model results in a 9% higher F1 score on average than the best existing method. We also show that a deep learning EM end model (DeepMatcher) trained on labels generated from our weak supervision approach is comparable to an end model trained using tens of thousands of ground-truth labels, demonstrating that our approach can significantly reduce the labeling efforts required in EM.
2022
Learning to be a Statistician: Learned Estimator for Number of Distinct Values
Renzhi Wu,
Bolin Ding,
Xu Chu,
Zhewei Wei,
Xiening Dai,
Tao Guan,
and Jingren Zhou
Proc. VLDB Endow.
2022
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline)samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our learned estimator online for reproducibility.
2021
Demonstration of Panda: A Weakly Supervised Entity Matching System
Renzhi Wu,
Prem Sakala,
Peng Li,
Xu Chu,
and Yeye He
Proc. VLDB Endow.
2021
Entity matching (EM) refers to the problem of identifying tuple pairs in one or more relations that refer to the same real world entities. Supervised machine learning (ML) approaches, and deep learning based approaches in particular, typically achieve state-of-the-art matching results. However, these approaches require many labeled examples, in the form of matching and non-matching pairs, which are expensive and time-consuming to label. In this paper, we introduce Panda, a weakly supervised system specifically designed for EM. Panda uses the same labeling function abstraction as Snorkel, where labeling functions (LF) are user-provided programs that can generate large amounts of (somewhat noisy) labels quickly and cheaply, which can then be combined via a labeling model to generate accurate final predictions. To support users developing LFs for EM, Panda provides an integrated development environment (IDE) that lives in a modern browser architecture. Panda’s IDE facilitates the development, debugging, and life-cycle management of LFs in the context of EM tasks, similar to how IDEs such as Visual Studio or Eclipse excel in general-purpose programming. Panda’s IDE includes many novel features purpose-built for EM, such as smart data sampling, a builtin library of EM utility functions, automatically generated LFs, visual debugging of LFs, and finally, an EM-specific labeling model. We show in this demo that Panda IDE can greatly accelerate the development of high-quality EM solutions using weak supervision.
Nearest Neighbor Classifiers over Incomplete Information: From Certain
Answers to Certain Predictions
Bojan Karlas*,
Peng Li*,
Renzhi Wu,
Nezihe Merve Gürel,
Xu Chu,
Wentao Wu,
and Ce Zhang
Proc. VLDB Endow.
2021
Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) – a test data example can be certainly predicted (CP’ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction.
We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP’ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed – we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds.
We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques in terms of classification accuracy with mild manual cleaning effort.
2020
ZeroER: Entity Resolution using Zero Labeled Examples
Renzhi Wu,
Sanya Chaba,
Saurabh Sawlani,
Xu Chu,
and Saravanan Thirumuruganathan
In Proceedings of the 2020 International Conference on Management of
Data, SIGMOD Conference 2020, online conference [Portland, OR, USA],
June 14-19, 2020
2020
Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of labeled examples that are expensive to obtain and often times infeasible. We investigate an important problem that vexes practitioners: is it possible to design an effective algorithm for ER that requires Zero labeled examples, yet can achieve performance comparable to supervised approaches? In this paper, we answer in the affirmative through our proposed approach dubbed ZeroER. Our approach is based on a simple observation – the similarity vectors for matches should look different from that of unmatches. Operationalizing this insight requires a number of technical innovations. First, we propose a simple yet powerful generative model based on Gaussian Mixture Models for learning the match and unmatch distributions. Second, we propose an adaptive regularization technique customized for ER that ameliorates the issue of feature overfitting. Finally, we incorporate the transitivity property into the generative model in a novel way resulting in improved accuracy. On five benchmark ER datasets, we show that ZeroER greatly outperforms existing unsupervised approaches and achieves comparable performance to supervised approaches.
Dynamic pattern matching with multiple queries on large scale data
streams
Sergey Sukhanov*,
Renzhi Wu*,
Christian Debes,
and Abdelhak M. Zoubir
Signal Processing
2020
Similarity search in data streams is an important but challenging task in many practical areas where real-time pattern retrieval is required. Dynamic and fast updating data streams are often subject to outliers, noise and potential distortions in amplitude and time dimensions. Such conditions typically lead to a failure of existing pattern matching algorithms and to inability to retrieve required patterns from the stream. The main reason for such failures is the limitation of data normalization utilized in the majority of methods. Another reason is the lack of means to consider multiple examples of the same template to account for possible variations of the query signal. In this paper, we propose a dynamic normalization approach that allows bringing streaming signal subsequences to the scale of the query template. This significantly improves pattern retrieval capabilities, especially when sampling variance or time distortions are present. We further develop a pattern matching approach utilizing the proposed normalization mechanism and extend it for the case when multiple examples of a query template are available. Multiple synthetic and real data experiments demonstrate that this allows to considerably improve the pattern matching rate for distorted data streams, providing real time performance.
GOGGLES: Automatic Image Labeling with Affinity Coding
Nilaksh Das,
Sanya Chaba,
Renzhi Wu,
Sakshi Gandhi,
Duen Horng Chau,
and Xu Chu
In Proceedings of the 2020 International Conference on Management of
Data, SIGMOD Conference 2020, online conference [Portland, OR, USA],
June 14-19, 2020
2020
Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, the data programming paradigm has been proposed to reduce the human cost in labeling training data. However, data programming relies on designing labeling functions which still requires significant domain expertise. Also, it is prohibitively difficult to write labeling functions for image datasets as it is hard to express domain knowledge using raw features for images (pixels).
We propose affinity coding, a new domain-agnostic paradigm for automated training data labeling. The core premise of affinity coding is that the affinity scores of instance pairs belonging to the same class on average should be higher than those of pairs belonging to different classes, according to some affinity functions. We build the GOGGLES system that implements affinity coding for labeling image datasets by designing a novel set of reusable affinity functions for images, and propose a novel hierarchical generative model for class inference using a small development set.
We compare GOGGLES with existing data programming systems on 5 image labeling tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a minimum of 71% to a maximum of 98% without requiring any extensive human annotation. In terms of end-to-end performance, GOGGLES outperforms the state-of-the-art data programming system Snuba by 21% and a state-of-the-art few-shot learning technique by 5%, and is only 7% away from the fully supervised upper bound.
Reviewer/PC Services: ICLR, Neurips, KDD, SDM, Scientific Reports