Publications
2024
- RefQSR: Reference-based Quantization for Image Super-Resolution NetworksLee, Hongjae, Yoo, Jun-Sang, and Jung, Seung-WonIEEE TIP 2024
Single image super-resolution (SISR) aims to reconstruct a high-resolution image from its low-resolution observation. Recent deep learning-based SISR models show high performance at the expense of increased computational costs, limiting their use in resource-constrained environments. As a promising solution for computationally efficient network design, network quantization has been extensively studied. However, existing quantization methods developed for SISR have yet to effectively exploit image self-similarity, which is a new direction for exploration in this study. We introduce a novel method called reference-based quantization for image super-resolution (RefQSR) that applies high-bit quantization to several representative patches and uses them as references for low-bit quantization of the rest of the patches in an image. To this end, we design dedicated patch clustering and reference-based quantization modules and integrate them into existing SISR network quantization methods. The experimental results demonstrate the effectiveness of RefQSR on various SISR networks and quantization methods.
2023
- Video Object Segmentation-aware Video Frame InterpolationYoo, Jun-Sang, Lee, Hongjae, and Jung, Seung-WonICCV 2023
Video frame interpolation (VFI) is a very active research topic due to its broad applicability to many applications, including video enhancement, video encoding, and slow-motion effects. VFI methods have been advanced by improving the overall image quality for challenging sequences containing occlusions, large motion, and dynamic texture. This mainstream research direction neglects that foreground and background regions have different importance in perceptual image quality. Moreover, accurate synthesis of moving objects can be of utmost importance in computer vision applications. In this paper, we propose a video object segmentation (VOS)-aware training framework called VOS-VFI that allows VFI models to interpolate frames with more precise object boundaries. Specifically, we exploit VOS as an auxiliary task to help train VFI models by providing additional loss functions, including segmentation loss and bi-directional consistency loss. From extensive experiments, we demonstrate that VOS-VFI can boost the performance of existing VFI models by rendering clear object boundaries. Moreover, VOS-VFI displays its effectiveness on multiple benchmarks for different applications, including video object segmentation, object pose estimation, and visual tracking.
- Pose and Shape Estimation of Humans in VehiclesKo, Kwang-Lim, Yoo, Jun-Sang, Han, Chang-Woo, Kim, Jungyeop, and Jung, Seung-WonIEEE Transactions on Intelligent Transportation Systems 2023
As autonomous driving technologies advance, occupants are expected to be free from driving, diversifying interaction scenarios with vehicles. Despite the growing importance of in-vehicle occupant monitoring systems, most existing systems focus on the face or head tracking of occupants, and only a few studies have attempted to detect their poses. In this paper, we present the first in-vehicle environment-specialized framework for the joint estimation of 3D human pose and shape from a single image. To this end, we introduce a new dataset called Human In VEhicles (HIVE), which contains a large collection of synthesized humans with different shapes and poses in vehicle images. HIVE provides RGB and NIR in-vehicle image pairs with ground-truth 2D and 3D pose and shape annotations, respectively. In addition, to exploit the different characteristics of humans in vehicles and unconstrained environments, we present a new pose prior penalizing poses that deviate from in-vehicle poses. The pose prior is derived using a variational autoencoder trained with in-vehicle human pose data. By using the proposed HIVE dataset and pose prior along with an elaborately designed two-stage training procedure, our method exhibits significantly improved pose and shape estimation performance compared with state-of-the-art methods for real-world test images captured in vehicles under different conditions.
- Hierarchical Spatiotemporal Transformers for Video Object SegmentationYoo, Jun-Sang, Lee, Hongjae, and Jung, Seung-WonICCVW 2023
This paper presents a novel framework called HST for semi-supervised video object segmentation (VOS). HST extracts image and video features using the latest Swin Transformer and Video Swin Transformer to inherit their inductive bias for the spatiotemporal locality, which is essential for temporally coherent VOS. To take full advantage of the image and video features, HST casts image and video features as a query and memory, respectively. By applying efficient memory read operations at multiple scales, HST produces hierarchical features for the precise reconstruction of object masks. HST shows effectiveness and robustness in handling challenging scenarios with occluded and fast-moving objects under cluttered backgrounds. In particular, HST-B outperforms the state-of-the-art competitors on multiple popular benchmarks.
- GPS-GLASS: Learning Nighttime Semantic Segmentation Using Daytime Video and GPS dataLee, Hongjae, Han, Changwoo, Yoo, Jun-Sang, and Jung, Seung-WonICCVW 2023
Semantic segmentation for autonomous driving should be robust against various in-the-wild environments. Nighttime semantic segmentation is especially challenging due to a lack of annotated nighttime images and a large domain gap from daytime images with sufficient annotation. In this paper, we propose a novel GPS-based training framework for nighttime semantic segmentation. Given GPS-aligned pairs of daytime and nighttime images, we perform cross-domain correspondence matching to obtain pixel-level pseudo supervision. Moreover, we conduct flow estimation between daytime video frames and apply GPS-based scaling to acquire another pixel-level pseudo supervision. Using these pseudo supervisions with a confidence map, we train a nighttime semantic segmentation network without any annotation from nighttime images. Experimental results demonstrate the effectiveness of the proposed method on several nighttime semantic segmentation datasets.
2022
- RZSR: Reference-based Zero-Shot Super-Resolution with Depth Guided Self-ExemplarsYoo, Jun-Sang, Kim, Dong-Wook, Lu, Yucheng, and Jung, Seung-WonIEEE Transaction on Multimedia (TMM) 2022
Recent methods for single image super-resolution (SISR) have demonstrated outstanding performance in generating high-resolution (HR) images from low-resolution (LR) images. However, most of these methods show their superiority using synthetically generated LR images, and their generalizability to real-world images is often not satisfactory. In this paper, we pay attention to two well-known strategies developed for robust super-resolution (SR), i.e., reference-based SR (RefSR) and zero-shot SR (ZSSR), and propose an integrated solution, called reference-based zero-shot SR (RZSR). Following the principle of ZSSR, we train an image-specific SR network at test time using training samples extracted only from the input image itself. To advance ZSSR, we obtain reference image patches with rich textures and high-frequency details which are also extracted only from the input image using cross-scale matching. To this end, we construct an internal reference dataset and retrieve reference image patches from the dataset using depth information. Using LR patches and their corresponding HR reference patches, we train a RefSR network that is embodied with a non-local attention module. Experimental results demonstrate the superiority of the proposed RZSR compared to the previous ZSSR methods and robustness to unseen images compared to other fully supervised SISR methods.