Hyperreal

Timeline
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Release Notes

0.6.1

This is bugfix release only: the templates for the web interface were not included in the release artifacts.

0.6.0

This release improves the API significantly by allowing for a clearer API between the corpus and the index, and also with respect to customised rendering of values for fields in different contexts. This will be further exploited in the future to allow more customised rendering of values in the web UI, and enable enhanced visualisations such as trendlines in the future.

  • Improves UI by allowing rendering of matching snippets (concordance lines or keyword in context) for a query, along with the matching features and the document itself.
  • Improves clustering algorithm by making acceptance of moves depend on the ratio of cluster sizes: this prevents clusters from being emptied out too quickly and limits the accumulation of features into a single large cluster.
  • Significant improvements in the handling of stackexchange data: you can now work with the .7z archive files directly on the command line. You will need to install extra dependencies first though:

    # Install optional dependencies along with the upgraded library
    
    $ pip install --upgrade hyperreal[stackexchange]
    
    # assumes a stackexchange archive file has been downloaded to the current directory.
    # Will replace the contents of expatriates stackexchange into the expat.db
    
    $ hyperreal stackexchange-corpus replace-sites expatriates.stackexchange.com.7z expat.db
    

Note: because of the change in the corpus API and the migration API, it will likely be simpler to recreate your indexes from scratch against the new API rather than try to migrate in place. If you do want to migrate your data in place you will need to make sure your index database has been migrated first in version 0.5.0.

0.5.0

Further refinements and simplifications of the clustering process - it is now possible to control the grouping size in the hierarchical clustering improvement.

This release also migrates the packaging infrastructure to fully use pyproject.toml.

0.4.1

  • Bugfix for an edge case in cluster refinement. #64
  • Very rough initial implementation to allow searching for a single value in a single field - this currently only supports exact matches and has no error handling.

0.4.0

Lots of new features in this release:

  • support for network export of cluster-cluster cooccurrence via networkx (see the new export subcommand on the CLI, or the Index.create_cluster_cooccurrence_graph method.
  • support for structured sampling using the model as a guide for sampling documents
  • interface for corpus implementations to support exporting tabular renderings of documents
  • refactoring of the clustering method so more of the pieces can be used on their own - this will be used for further functionality in the future
  • reworking of the web UI render pathway, so the top of the page is calculated first and streamed as soon as it's ready - this enables better on interactivity on very large datasets
  • more sophisticated tokenisation of stackexchange data: codeblocks are now pulled out and indexed separately.
  • initial examples for two large scale document collections: stackoverflow and Australian Federal Hansard

Experimental initial features:

  • initial experimental indexing of word positions to enable phrase and proximity queries - this is likely to change a lot in future releases
  • initial support for rendering of concordances (keywords in context)
  • initial support for intersecting a set of queries with a given field - this will be used in the future to enable first class support for temporal queries

0.3.1

This is a small utility release:

  • make available a bugfix that prevents zombie processes from automatically created process pools
  • factors out some functionality from the feature clustering algorithm into their own methods to allow more flexibility in future.

0.3.0

This release includes a number of tweaks and improvements to support some useful workflows for editing and manipulating models after the initial computational feature clustering. It also updates the clustering algorithm slightly so it should converge faster and make more interesting clusterings when expanding the number of clusters.

The main component of this release is a new and improved approach to the clustering algorithm. This results in improvements in the objective score, and enables dynamically reducing the number of clusters in the model in addition to increasing. This might be used in the future to allow new complications and changes for the clustering algorithm too.

  • Make PlainTextSqliteCorpus take a custom tokeniser function by @SamHames in #53
  • Skip Twitter related tests in CI due to Twitter API changes by @SamHames in #54
  • Refine the clustering algorithm by @SamHames in #55

0.1.1 - Fix packaging related issues

This release is a bugfix to the distributed package, the wheel pushed to pypi wasn't usable because it was missing key things.

0.1.0

This is the initial release of the Hyperreal package for interpretive topic modelling.

10 check-ins tagged with "release"
2024-07-09
05:06
Bugfix: make sure templates are properly included in the distributed package check-in: 4929916c99 user: sam_hames tags: trunk, release, 0.6.1
04:32
Bump version for release check-in: bdef0c24c5 user: sam_hames tags: trunk, release, 0.6.0
2023-10-02
06:10
Merge pull request #68 from SamHames/declarative_package_config Initial test of pyproject.toml only packaging check-in: 3f803d1c22 user: sam_hames tags: trunk, 0.5.0, release
2023-08-10
03:20
Handle no-op edgecase for clustering that cannot change A clustering with a single cluster selected and target_clusters not set, or set to 1 cannot change. This is still valid though, so now this is a no-op and the initial clustering is immediately returned. This closes #64. check-in: 5099f9b240 user: sam_hames tags: trunk, 0.4.1, release
2023-08-05
12:51
Add an example script for all of stackoverflow check-in: c0afbc5b55 user: sam_hames tags: trunk, 0.4.0, release
2023-05-29
04:40
Merge pull request #60 from SamHames/spinout_utility_methods Factor out useful subroutines into their own methods check-in: e11b2d8bc4 user: sam_hames tags: trunk, 0.3.1, release
2023-03-30
04:36
Workflow refinements (#56) * Expose pin/unpin of clusters in the front end * Workflow refinements and layout improvements - Make the default display denser, allowing more symbols present on screen at the same time - Don't use the simplistic BSTM ranking - just show random samples of matching documents in every case to ensure face validity of documents and cluster representations at each point - Improve? the default styling a little bit. It is still ugly :) * Update styling to show more content across the screen * Soften the colour contrast for the background * Process clusters with the most features first when pivoting by a query * Layout the cluster of features with flexbox * Make the twitter test case more complicated for the new feature subdivision * Improve styling for the navigation bar and add a bit more space * Sample from features to create new clusters Change from splitting the cluster with the most features to create a new cluster, to sampling from all movable features. This means that new clusters can draw from many different features, and hopefully more likely to create a new useful cluster. Also, the termination criterion now focuses on the number of feature moves only, as that is core to the human assessment of the model, not the numerical objective. * Highlight the value box in features properly, not the containing div * Add a button to drag all visible features into their own cluster * Confirm before deleting an existing model through the CLI * Add back parameter that shouldn't have been removed in the cluster/create endpoint * Partial bugfix for incorrect number of clusters being dissolved when there is a pinned cluster * Don't allow any edits on pinned clusters * Two algorithmic tweaks 1. Add a small amount of randomness to the greedy selection choice This accepts improvements with a 99% probability. This is important for avoiding oscillation between the same two states that can happen without randomness. 2. Prevent moving features to a cluster that it doesn't intersect with at all - also mark features that don't intersect with the rest of their current cluster with bad scores to encourage movement. * Expose the tolerance paramter through refine_clusters and the CLI * Change default feature move tolerance, enforce a range of values check-in: e6a3a49efd user: sam_hames tags: trunk, 0.3.0, release
2023-02-24
04:45
Make sure process pools are cleaned up properly I noticed I had accumulated a lot of zombie processes, I think because interrupting test runs might not cleanup the background process pools. This change just makes sure the pool is a fixture which ensures that it is cleaned up. check-in: d4435888ae user: sam_hames tags: trunk, 0.2.0, release
2022-10-27
07:30
Formatting fixes check-in: ab43621fa3 user: sam_hames tags: trunk, 0.1.1, release
2022-10-26
06:16
Add software-wide DOI to the citation file [ci skip] check-in: 0dd2d6cb8d user: sam_hames tags: trunk, 0.1.0, release