Thursday, June 13, 2024

Artificial Intelligence news

Apple is promising personalized...

At its Worldwide Developer Conference on Monday, Apple for the first time...

What using artificial intelligence...

This story originally appeared in The Algorithm, our weekly newsletter on AI....

The data practitioner for...

The rise of generative AI, coupled with the rapid adoption and democratization...

Five ways criminals are...

Artificial intelligence has brought a big boost in productivity—to the criminal underworld.  Generative...
HomeMachine LearningImproving the Quality...

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling



Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…



Article Source link and Credit

Continue reading

Swallowing the Bitter Pill: Simplified Scalable Conformer Generation

We present a novel way to predict molecular conformers through a simple formulation that sidesteps many of the heuristics of prior works and achieves state of the art results by using the advantages of scale. By training a...

KPConvX: Modernizing Kernel Point Convolution with Kernel Attention

In the field of deep point cloud understanding, KPConv is a unique architecture that uses kernel points to locate convolutional weights in space, instead of relying on Multi-Layer Perceptron (MLP) encodings. While it initially achieved success, it has...

Efficient Diffusion Models without Attention

Transformers have demonstrated impressive performance on class-conditional ImageNet benchmarks, achieving state-of-the-art FID scores. However, their computational complexity increases with transformer depth/width or the number of input tokens and requires patchy approximation to operate on even latent input sequences....