Hi there 👋 Today, let's dive into 7 ML repos that (and those you have likely never heard of)! the top 1% of developers use What defines the top 1%? Ranking developers is a problem, and every methodology has its issues. difficult For example, if you rank developers by the number of lines of code they have written in Python you’ll probably get some pretty good Python developers at the top. However, you may get people who have just copy-pasted lots of Python code to their repos and they aren’t that good. 🙁 At Quine, we , but again not 100% perfect! have developed a methodology that we think is robust in most cases It’s called DevRank (you can read more about how we calculate this ). here The notion of the Top 1% that I use in this article is based on DevRank. And yes, we continue working on this to make it better every day! How do we know which repos the top 1% use? We look at the repos that the 99th percentile has starred. We then compare the propensity of the top 1% of devs vs the bottom 50% of devs to star a repo, and automatically generate the list. In other words, these . repositories are the hidden gems used by the top 1% of developers and are yet to be discovered by the wider developer community CleverCSV I handle your messy CSVs A package developed by some friends of ours . A small but common problem at the start of many ML pipelines, solved well. 🔮 to handle common pain points of loading CSV files CleverCSV is able to detect and load various different CSV dialects, without needing to be told anything in its arguments. CSV files do not provide the necessary information to perform this natively, so some clever inference is required by the library. CleverCSV can even handle messy CSV files, which have mistakes in their formatting. In addition to the Python library, CleverCSV also includes a command line interface for code generation, exploration and standardization. https://github.com/alan-turing-institute/CleverCSV skll Streamline ML workflows with scikit-learn through CLI to obtain cross-validated results with multiple algorithms? Try ’s interface instead for a much cleaner coding experience. ⚡️ Are you writing endless boilerplate in sklearn skll Skll is designed to enable running machine learning experiments with scikit-learn more efficiently, reducing the need for extensive coding. The leading utility provided is called , and it runs a series of learners on datasets specified in a configuration file. run_experiment It also offers a Python API for straightforward integration with existing code, including tools for format conversion and feature file operations. https://github.com/EducationalTestingService/skll BanditPAM k-Medoids Clustering in Almost Linear-Time Back to fundamental algos here — that can run in almost linear time. 🎉 BanditPAM is a new k-medoids (think a robust “k-means”) algorithm Runs in O(nlogn) time rather than O(n^2) time, as per previous algorithms. Cluster centers are data points, and hence correspond to meaningful observations. The center of a k-means cluster may correspond to invalid data; this is not possible with k-medoids. Arbitrary distance metrics can be used (think L1, or Hamming distance for example), efficient k-means algos are typically limited to L2 distance. Implemented from this , BanditPAM is ideal for data scientists looking for a powerful, scalable solution for group work, especially those dealing with large or complex data. paper https://github.com/motiwari/BanditPAM recordlinkage The record matcher and duplicate detector everyone needs Have you ever , or who have slightly different attributes? Use this great library inspired by the , rebuilt for modern Python tooling. 🛠️ struggled to match users within different datasets who have spelt their name wrong Freely Extensible Biomedical Record Linkage (FEBRL) Provides a Python native implementation of the powerful FEBRL library, making use of numpy and pandas. Includes both supervised and unsupervised approaches. Includes tools for generating matching pairs to enable supervised ML approaches. RecordLinkage is ideal for data scientists looking for a flexible, Python-based solution to perform record linkage and data deduplication tasks. https://github.com/J535D165/recordlinkage dragnet A sole focus on web page content extraction . Dragnet focuses on the content and user comments on a page, and ignores the rest. It's handy for our scraper-friends out there. 🕷️ Content extraction from webpages Dragnet aims to extract keywords and phrases from web pages by removing unwanted content such as advertising or navigation equipment. Provides simple Python functions ( and ) with the option to include or exclude comments for extracting content from HTML strings. extract_content extract_content_and_comments A extractor class is there for more advanced use, allowing customisation and training of extractors. sklearn-style https://github.com/dragnet-org/dragnet spacy-stanza The latest StanfordNLP research models directly in spaCy Interested in standard NLP tasks such as part-of-speech tagging, dependency parsing and named entity recognition? 🤔 SpaCy-Stanza wraps the Stanza (formerly StanfordNLP) library to be used in spaCy pipelines. The package includes named entity recognition capabilities for selected languages, extending its utility in natural language processing tasks. It supports 68 languages, making it versatile for various linguistic applications. The package allows your pipeline to be customised with additional spaCy components. https://github.com/explosion/spacy-stanza Littleballoffur "Swiss Army knife for graph sampling tasks" Have you ever worked with a dataset so large that you need to take a sample of it? For simple data, random sampling maintains distribution in a smaller sample. However, in complex networks, snowball sampling - - better captures network structure. where you select initial users and include their connections This helps avoid bias in analysis. 🔦 Now, do you have of it (either for algorithmic or computational reasons)? 👩💻 graph-structured data and need to work on samples Littleballoffur offers a range of methods for sampling from graphs and networks, including node-, edge-, and exploration-sampling. Designed with a unified application public interface, making it easy for users to apply complex sampling algorithms without deep technical know-how. https://github.com/benedekrozemberczki/littleballoffur I hope these discoveries are valuable to you and will help build a more robust ML toolkit! ⚒️ If you are interested in leveraging these tools to create impactful projects in open source, you should first find out what your current DevRank is on and see how it evolves in the coming months! Quine Lastly, please consider supporting these projects by starring them. ⭐️ PS: We are not affiliated with them. We just think that great projects deserve great recognition. See you next week, Your Hackernoon buddy 💚 Bap If you want to join the self-proclaimed "coolest" server in open source 😝, you should join our . We are here to help you on your journey in open source. 🫶 discord server