preprocesing package

Submodules

preprocesing.config_file module

preprocesing.convert_images module

preprocesing.convert_images.convert_images_to_bw()

Concurrently and in parallel convert the anime illustration images to black and white

preprocesing.convert_images.convert_single_image(image_path)

Opens a anime illustration image and turns it black and white

preprocesing.extract_and_verify_fonts module

preprocesing.extract_and_verify_fonts.create_character_test_string(dataframe_file, render_text_test_file)

Create a string of the unique characters in the japanese text corpus to test whether the fonts being used can render enough of the text

preprocesing.extract_and_verify_fonts.extract_fonts()

A function to get the font files which are in zip format and extract them

preprocesing.extract_and_verify_fonts.get_font_files(fonts_zip_output, fonts_raw_dir, font_file_dir)

A function to find the .otf and .ttf font files from the scraped font files

Parameters

fonts_zip_output – Path for zip files

of font files

Parameters

fonts_raw_dir – Place where all the

raw font files exist whether zipped or not

Parameters

font_file_dir (str) – Out directory for font files

preprocesing.extract_and_verify_fonts.has_glyph(font, glyph)

Check if a font file has the character glyph specified

Parameters
  • font (TTFont) – A TTFont object from fontTools

  • glyph (str) – A character glyph

Returns

0 or 1 as a yes or no

Return type

int

preprocesing.extract_and_verify_fonts.make_char_list(row)

Helper functions to make a set of characters from a row in the dataframe of the text corpus

Parameters

row – A row in the dataframe

Returns

A set of characters

Return type

list

preprocesing.extract_and_verify_fonts.move_files(paths)

Wrapper to move files used for parallel execution

Parameters

paths (list) – A set of paths 0 is from 1 is to

preprocesing.extract_and_verify_fonts.unzip_file(paths)

Unzip a file :param paths: Path to unzip file from and to

preprocesing.extract_and_verify_fonts.verify_font_files(dataframe_file, render_text_test_file, font_file_dir, font_dataset_path)

A function that tests whether the font files that have been scraped meet the benchmark of rendering at least x% (as specififed in the config) of the unique characters in the text corpus

preprocesing.text_dataset_format_changer module

preprocesing.text_dataset_format_changer.convert_jesc_to_dataframe()

Convert the CSV file of the text to a Dask Dataframe

Module contents