Bo Zhao

The Aalto Data-Intensive System group (ADIS) seminars provide a forum to discuss the state of the art in systems research with experts from leading academic institutes and industry labs. All events are free to join, please reach out to our group members for accesses.

SQL2Circuits: Estimating Metrics for SQL Queries with A Quantum Natural Language Processing Method

Valter Uotila, University of Helsinki, 26.Mar.2024, 14:00 EET, CS Building B322

[Abstract] Quantum computing has developed significantly in recent years. Developing algorithms to estimate various metrics for SQL queries has been an important research question in database research since the estimations affect query optimization and database performance. This work represents a quantum natural language processing (QNLP) -inspired approach for constructing a quantum machine learning model that can classify SQL queries with respect to their execution times, costs and cardinalities. From the quantum machine learning perspective, we compare our model and results to the previous research in QNLP and conclude that our model reaches similar accuracy as the QNLP model in the classification tasks. This indicates that the QNLP model is a promising method even when applied to problems that are not in QNLP. We study the developed quantum machine learning model by calculating its expressibility and entangling capability histograms. The results show that the model has favorable properties to be expressible but also not too complex to be executed on quantum hardware, for example, on the current 20-qubit quantum computer in Finland.

[About the speaker] Valter Uotila is a second-year doctoral researcher at the University of Helsinki researching quantum computing applications for databases and data management. His research interests are in the intersection of quantum computing, databases and category theory.

Advances, challenges, and opportunities in Table Representation Learning

Madelon Hulsebos, UC Berkeley, 22.Mar.2024, 17:00 EET, Online

[Abstract] The impressive capabilities of transformers have been explored for applications over language, code, images, but the millions of tables have long been overlooked while tables dominate the organizational data landscape and give rise to peculiar challenges. Unlike natural language, tables come with structure, heterogeneous and messy data, relations across tables, contextual interactions, and metadata. Accurately and robustly embedding tables is, however, key to many real-world applications from data exploration and preparation to question answering and tabular ML. In this talk, I will discuss the general approaches taken towards adapting the transformer architecture towards tables and give an overview of the tasks already explored in this space. I will zoom in on some of the shortcomings of these approaches and close with the open challenges and opportunities, and some ongoing work.

[About the speaker] Madelon Hulsebos is a postdoctoral fellow at UC Berkeley. She obtained her PhD from the Informatics Institute at the University of Amsterdam, for which she did research at the MIT Media Lab and Sigma Computing. She was awarded the BIDS-Accenture fellowship for her postdoctoral research on retrieval systems for structured data. Madelon her general research interest is on the intersection of data management and machine learning, with recent contributions in methods, tools and resources for Table Representation Learning.

Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

Wenqi Jiang, ETH Zürich, 7.Mar.2024, 14:00 EET, Online

[Abstract] The recent advances in generative large language models (LLMs) are attributable to the surging number of model parameters trained on massive datasets. However, improving LLM quality by scaling up models leads to several major problems including high computational costs. Instead of scaling up the models, a promising direction, which OpenAI has recently adopted, is known as Retrieval-Augmented Language Model (RALM), which augments a large language model (LLM) by retrieving context-specific knowledge from an external database via vector search. This strategy facilitates impressive text generation quality even with smaller models, thus reducing computational demands by orders of magnitude.

However, RALMs introduce unique system design challenges due to (a) the diverse workload characteristics between LLM inference and retrieval and (b) the various system requirements and bottlenecks for different RALM configurations including model sizes, database sizes, and retrieval frequencies. In this talk, I will present Chameleon, a heterogeneous accelerator system integrating both LLM and retrieval accelerators in a disaggregated architecture. The heterogeneity ensures efficient serving for both LLM inference and retrieval, while the disaggregation allows independent scaling of LLM and retrieval of accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements retrieval accelerators on FPGAs and assigns LLM inference to GPUs, with a CPU server orchestrating these accelerators over the network. Evaluated on various RALMs, Chameleon exhibits up to 2.16× reduction in latency and 3.18× speedup in throughput compared to the hybrid CPU-GPU architecture.

[About the speaker] Wenqi Jiang is a fourth-year Ph.D. student at ETH Zurich, where he is affiliated with the systems group advised by Gustavo Alonso and Torsten Hoefler. Wenqi's research interests span data management, computer architecture, and computer systems. His work primarily focuses on designing post-Moore data systems, which involve cross-stack solutions including algorithm, system, and architecture innovations. Some examples of his work include large language models, vector search, recommender systems, and spatial data processing.

DeltaZip: Multi-tenant Language Models Serving via Delta Compression

Xiaozhe Yao, ETH Zürich, 26.Feb.2024, 14:00 EET, Online

[Abstract] Fine-tuning large language models (LLMs) for downstream tasks can greatly improve model quality, however serving many different fine-tuned LLMs concurrently for users in multi-tenant environments is challenging. Dedicating GPU memory for each model is prohibitively expensive and naively swapping large model weights in and out of GPU memory is slow. Our key insight is that fine-tuned models can be quickly swapped in and out of GPU memory by extracting and compressing the delta between each model and its pre-trained base model. We propose DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by a factor of 6× to 8× while maintaining high model quality. DeltaZip increases serving throughput by 1.5× to 3× and improves SLO attainment compared to a vanilla HuggingFace serving system.

[About the speaker] Xiaozhe Yao is a second-year doctoral student at Systems Group, Department of Computer Science, ETH Zürich advised by Prof. Dr. Ana Klimović. Working on a wide spectrum of machine learning and systems, his research direction is to build systems that support large-scale machine learning and democratize machine learning. Prior to ETH, Xiaozhe Yao gained his Master’s degree at the University of Zurich in Data Science, advised by Prof. Dr. Michael Böhlen and Qing Chen. Before that, he completed his Bachelor’s study at Shenzhen University in Computer Science, advised by Prof. Dr. Shiqi Yu. He interned at Shenzhen Institute of Advanced Technology in 2016 as a data scientist.

TempoRL: Efficient Deep Reinforcement Learning with Recurrent Tensors

Pedro Silvestre, Imperial College London, 19.Feb.2024, 13:00 EET, Online

[Abstract] Reinforcement Learning (RL) is an increasingly relevant area of algorithmic research. Though RL differs substantially from Supervised Learning (SL), today's RL frameworks are often simple wrappers over SL systems. In this talk, we first analyse the differences between SL and RL from the system designer's point-of-view, then discuss the issues and inefficiencies of RL frameworks arising from those differences. In particular, we discuss how the existence of cyclic and dynamic data dependencies in RL forces the decomposition of algorithms into disjoint dataflow graphs, preventing holistic analysis and optimisation.

We then propose TempoRL, a system designed to efficiently capture these cyclic and dynamic data dependencies in a single graph by instead viewing RL algorithms as Systems of Recurrence Equations (SREs). TempoRL is then able to holistically analyse and optimise this graph, applying both classic and novel transformations like automatic vectorisation (when memory allows) or incrementalisation (when memory is scarce). Because SREs impose no control-flow, TempoRL is free to choose any execution schedule that respects the data dependencies. Luckily, by designing around SREs, we are able to leverage the powerful polyhedral analysis framework to find efficient and parallel execution schedules, as well as, compute a memory management plan through dataflow analysis. The remainder of the talk discusses the surprising advantages that this novel computational model brings, and the applications it may have outside of RL.

[About the speaker] Pedro Silvestre is a PhD student in the Large-Scale Data & Systems Group at Imperial College London, under the supervision of Prof. Peter Pietzuch, working on Dataflow Systems for Deep Reinforcement Learning. Before Imperial, Pedro was a Research Engineer at the TU Delft’s Web Information Systems Group working on Consistent Fault-tolerance for Distributed Stream Processing. Pedro completed both his MSc and BSc from the NOVA School of Science and Technology.